Tagged: text

Read Aladore as a Word Tree

Why not?

Check out an Aladore word tree visualization

created using Word Tree, http://www.jasondavies.com/wordtree

Basically you choose a word, and the visualization shows you all the things the follow that word anywhere in the text.  If you choose a word that is only used once–there will be just one line of text.  If you choose a commonly used word, you will get an amazing cross section of the entire text arranged on the tree.  It is a quick and interesting way to browse the text or get a sense of what it says about specific topics.



Still Working on Aladore!

Okay, so despite the lack of Aladore related posts for awhile, Digital Aladore is still working on the final EPUB!

I just decided to take some more time to think about the final formatting and polish.  However, lets say you are really, really dieing to read my new text, you could go read it where I posted it on Juxta… just kidding, here is a downloadable plain text version, saved in ODT (WordPress doesn’t let me post txt, sorry for the silly work around):


Or if you prefer, here is a very basic HTML version, again saved in ODT:


I offer these unformatted versions in case you want to do something with the plain text, like use it for textual analysis or something else fun!  For example you could:

Do some text analysis with TAPoR ware, http://taporware.ualberta.ca/~taporware/textTools

Explore the text with visualizations at Voyant, http://voyant-tools.org

Or try out some serious visual analytics using Jigsaw, http://www.cc.gatech.edu/gvu/ii/jigsaw

(They are all free!)

Let me know if you come up with something interesting!


1914 versus 1915!

Earlier in the project I thought it would be interesting to compare the raw texts created by different OCR engines.  This was not simply to benchmark accuracy–the idea was that each engine might make slightly different types of mistakes, thus combining the outputs could result in a better text overall.  However, since my testing made it clear that Tesseract was far superior to other Free OCR engines and that ABBYY FineReader did not seem to be any more accurate for the purposes of Digital Aladore, I decided to abandon this comparison.  Since Tesseract performed so well with the digitized Aladore witnesses, creating separate texts would not contribute to my main goal of creating a Good reading edition.

However, I want to compare the Tesseract OCR text of the two main print editions of Aladore, 1914 (Edinburgh) and 1915 (New York).  The qualifier “OCR text of” is key here–I am not directly comparing the print editions or even the digitized images of them!  I am comparing the HTML text created by my OCR workflow using digitized JPEGs of the print pages… Thus any textual differences are a bit complicated to unravel, potentially originating anywhere along the transmission process:

First, there could be actual differences in the editions and simple errors introduced/fixed by the reprint.  Secondly, each actual copy of the book is unique–the two that were digitized may have different blemishes or flaws.  Thirdly, since the two books were digitized by different people and machines, different errors may have been introduced by the process (I discovered one already).  Finally, my OCR and HTML editing workflow may introduce different errors in each edition because of slight differences in pagination and the qualities of the images.

So this comparison is NOT about directly comparing the two printed texts–although it might highlight some differences (or not).  It is actually a sort of distorted version of textual criticism (which I talked a lot about earlier in the project). We have two new witnesses with a complex transmission back story.  I want to collate these newly created witnesses, to better understand how they are related and how they can be combined to create a single Best text. 

There are a number of software tools that can carry out this comparison, helping us surface both technical and textual issues with the HTML files.  I started from the HTML edited using BlueFish, with each edition consisting of six separate files.  First, I wanted to ensure the two witnesses had a uniform format so that the comparison would reveal textual differences rather than a bunch of inconsistencies in HTML tags.

Sigil is a surprisingly efficient tool for this job, following steps similar to EPUB creation I described in an earlier post.  Open Sigil, click File > Add > Existing files, and select the six HTML files for an edition.  In the “Book Browser” pane (left side by default) highlight the six files you just added, then right click and select Merge.  This quickly and cleanly combines the files into a single HTML.  Next, I did a quick Find & Replace sweep for known issues, such as “So” being recognized as “80” and the exotic characters “fi” (which is not “f i”) and “fl” (which is not “f l”).  I also did a quick check of the Report (Tools > Reports > Characters in HTML) and Spellcheck for any strange OCR artifacts.  These are obvious transmission errors that just need to be fixed!  Then I right click on the HTML file in the “Book Browser” and choose Save As.  This allows you to “export” a HTML file from the EPUB.

After completing these steps for the two editions, we have two HTML files ready for comparison.  I will describe the next step in the next post!

Aladore 1915 text

Sorry no new posts for awhile–but Digital Aladore has not been idle!

I have been processing the second digital witness, Aladore 1915 (New York: E.P. Dutton & Company, digitized by Internet Archive in 2006).  This time around things went quite quick and efficiently, since I wasn’t testing various options and I am now familiar with all the software.  The page background was cleaner in this digitized edition, which I think made the OCR a tiny bit more accurate.  However, the actual book seemed to have a few more print errors–for example a single letter or punctuation mark missing or distorted.  I think this sometimes happened with later printings of a book, since they were often reproduced from plates used in the first printing.  Wear and tear on the older plates can introduce errors into the new text.

Like the first time around, I divided the page images into six batches (i.e. directories) to simplify processing.  I preprocessed the pages with ScanTailor, ran OCR with YAGF, and batch edited the HTML with BlueFish.  Those three steps, including the computer’s processing time, took about four to five hours to complete in total.  You could rush through the process faster, but I think this time estimate is a fairly careful and non-stressful pace.

I am curious to compare this new text with the first one I created, so I will be setting that up next!  Stay tuned…

Edit HTML with Bluefish

We enter the next stage of Digital Aladore with our ugly raw HTML in hand… and we want to make it into nice reflowable text to convert into an ebook.  Welcome to 5. Editing Text!

What is the most efficient way to rip through those ugly tags and fix up our text?

I am pretty sure its Bluefish Editor, http://bluefish.openoffice.nl

Bluefish is a powerful Free software (GNU GPLv3) text editor that supports web development and programming.  It is important to note that it is not a WYSIWYG editor, there is no graphical preview or editing.  For Digital Aladore, I am interested in Bluefish’s advanced find & replace features which allow you to use regular expressions and carry out operations across any size batch of files.  Exciting!

Opening Aladore 1914 with Bluefish.

Opening Aladore 1914 with Bluefish.

First, I open the batch of six HTML files output by YAGF OCR of Aladore 1914.  I inspect the HTML tags to make sure I understand what is going on.  Since I will add formatting later during the epub stage, I want to strip away almost everything, but in a controlled manner!

First to go is the style tag in the headers:

<style type=”text/css”>
p, li { white-space: pre-wrap; }

Advanced Find & Replace.

Advanced Find & Replace.

We don’t need it, so out it goes with Advanced Find & Replace!  In the screen shot above you can see that Bluefish lets us select an entire directory (and even set the number of recursive levels below it) to Find & Replace on at once.  I love this!  But, this first operation is pretty tame… only six items gone, one for each HTML file header.

Next, I get rid of all the style tags since they are not really necessary or meaningful.  Simply Advanced Find & Replace for  style=".*?"  and we find 9,202 items to replace with nothing!  Give it a second and they are all gone.

Now we need to do some more strategic thinking to sort out the patterns left by YAGF OCR.  Basically, each line of the text is contained in its own paragraph tag <p>…</p>.  Between paragraphs is an empty line consistently tagged <p ><br /></p>.  And the end of a page is always:

<p ><br /></p>
<p ><br /></p>
<p > </p>

Ultimately, I want to get rid of those tags, replacing them so that: 1) the line breaks are removed, 2) the page breaks are removed, 3) complete paragraphs are are contained in paragraph tags.

#3 is the tricky part, so to do this, we have to go about things in just the right order.  First, I remove the page breaks by replacing the string shown above with nothing.  This resulted in 364 replacements, exactly the number of page images we ran OCR on–Good!

This leaves an empty line in the HTML text files for each page break which I now remove using: Tools > Filters > Strip empty lines.

Next, I remove all the paragraph breaks by replacing <p ><br /></p> with nothing (501 results). Do NOT strip the lines this time, because the resulting empty lines will create the proper paragraphs in the next step.

Now, replace:

<p >

with a space (7096 results). This pattern represents a line break which should be removed to create our wrapping paragraphs.  Since the previous step left an empty line between actual paragraph breaks, the first <p> and last </p> of each does not match the replace string–resulting in correctly tagged paragraphs!

Finally, I replace “- ” (hyphen space) with nothing (259 results).  This catches all the words that were hyphenated at a line break, while leaving all the actually hyphenated phrases alone.

After that the batch looks good!  There are a few more issues that need to be fixed and errors to be searched for, but they are better off done with other tools, i.e. Sigil the ebook editor.  More soon!

Raw HTML text

After YAGF (Tesseract) OCR we have a batch of large HTML files.  For Aladore 1914 it was six HTML files around 1600 lines each.  The first page looks like this:





SIR YWAIN sat in the Hall of Sulney and

did justice upon wrong-doers. And one man

had gathered sticks where he ought not, and

this was for the twentieth time; and another

had snared a rabbit of his lord’s, and this was

for the fortieth time; and another had beaten

his wife, and she him, and this was for the

hundredth time: so that Sir Ywain was weary

of the sight of them. Moreover, his steward

stood beside him, and put him in remem-

Each line of text from the page image has been tagged as one HTML paragraph, so the breaks follow the original printed page.   Take a look at the mark up:

<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>ALADORE.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>CHAPTER I.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>OF THE HALL OF SULNEY AND HOW</p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>SIR YWAIN LEFT IT.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>SIR YWAIN sat in the Hall of Sulney and</p>

Each paragraph has a big “style=” tag that is sort of meaningless since they are all the same.  For my purposes, its just ugly and unnecessary.  However, the important thing is that everything is consistent.  I will not use the format described by the HTML tags, but because they are consistent in how they are used, I can easily transform the text.  With a little thought and find & replace, we can easily create a reflowable text–making the HTML above, into something like:




SIR YWAIN sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remembrance of all the misery that had else been forgotten.

Which leads us to the next section of the project!

Digital Text

But now we (and our texts) live in the Digital world.  What does that mean for transmission?

I think texts are getting crazier… Texts on paper, even hand copied, are fairly static and stable over centuries.  How we interpret them changes a lot, but the witnesses are pretty solid.

Digital texts on the other hand are so easy to change, copy, distribute, re-distribute, etc., etc.  The internet is a textual Wild West where everyone copies back and forth from each other, text is constantly transforming.  It is almost impossible to trace its life as it tumbles through the digital world.

The last few decades have seen a chaotic proliferation of free online books, often of highly suspect quality and consistency, almost always with poor metadata.  Projects such as the Universal Library/Million Book Project rushed to create scanned copies of, well, millions of books–most of which are terrible quality.  Pages are missing, images are out of focus or capture a hand, files got jumbled (you can some of the wreckage on Internet Archive: https://archive.org/details/millionbooks)… These digitized images of books are supposed to be a type of facsimile edition–attempting to exactly represent the qualities of the original book.  However, simple images of pages do not take advantage of the digital medium.

To enable search functions, OCR is used to create transcriptions of the images–essentially creating a new version of the text, a low accuracy machine generated/interpreted version… Since the OCR is machine readable, it is used to generate all sorts of other formats such as epubs or txt.  Few of these receive any human editing.

However, as early as the 1970’s, Project Gutenberg was creating ebook editions of public domain works specifically focused on general Reading.  They are not attempting to create authoritative or critical editions.  Nor are they trying to exactly reproduce a specific print copy of the work.  Instead, the texts are proof read and edited by real people, to generate good Human readable ebooks.  Anyone (i.e. YOU) can volunteer to become a proofreader, check out their Distributed Proofreaders: http://www.pgdp.net

This project follows on that type of tradition.

However, there is a lot of other directions you can go with digital texts.  There is growing demand for academic quality texts to use as raw material for digital analytic techniques.  Standards such as Text Encoding Initiative (TEI) offer possibilities of enriching text with semantic markup.

You can also do a lot of neat stuff with online readers to create digital editions that collate and display several variant texts together, or a critical text with the variant readings as hyperlinks.  For example, check out the “Online Critical Pseudepigrapha”: http://ocp.tyndale.ca

A digital text can represent the extremely complicated nature transmission, supporting multiple readings of the text (set in a historical context) rather than a single critical edition.  For example, check out the “Homer Multitext Project”: http://www.homermultitext.org/about.html

Anyway, forget about all that–I will get back to Aladore soon…

Textual Criticism

When I started into this section of the project, I briefly mentioned some concepts related to textual criticism: Transmission, Witnesses, and Collation.  In a casual fashion, we have been exploring the transmission of Aladore, how the text is embodied in specific witnesses.  Soon we will more carefully collate our digital witnesses in pursuit of a good Reading text.  But, now is a good time to step back and reflect about transmission.

“Close-up of male hand with manuscript in Bridwell Library," Southern Methodist University, Central University Libraries, DeGolyer Library. http://www.flickr.com/photos/smu_cul_digitalcollections/8679147466

“Close-up of male hand with manuscript in Bridwell Library,”
Southern Methodist University [Flickr Commons],

Using textual criticism to create new critical editions is an ancient and important form of scholarship and communication–I am not going to outline 1000’s of years of rich history here, but…  It is important to think about Transmission.  Textual criticism is sort of like trying to sort out a game of “Telephone.”  The author had some intention for the text, then captured a version of the text on paper, which got published, which got published again, etc.  Due to the limitations of physical and intellectual reality, none of these text is the same.  When you see bunches of hand written copies of a work (such as ancient scrolls and manuscripts), the physical process of transmitting the text from author to reader is obvious.  You can see that a bunch of different scribes spent hours and hours copying from one witness to create a new one.  Its obvious that mistakes will happen–but also that different interpretations will happen.  Even though this human element is less visible today, the life of a text is still incredibly complex, passing through the hands and minds of countless people (and machines).

Does every letter and every word match in our different versions of Aladore?  Editors and typesetters might fix or add errors.  A printing press may introduce some anomaly.  Our copy might have a stain across a page.  Our digital version may be missing something…

Texts are always alive and are never static.