Category: 5.5 Comparing Texts

Aladore 1915 text

Sorry no new posts for awhile–but Digital Aladore has not been idle!

I have been processing the second digital witness, Aladore 1915 (New York: E.P. Dutton & Company, digitized by Internet Archive in 2006).  This time around things went quite quick and efficiently, since I wasn’t testing various options and I am now familiar with all the software.  The page background was cleaner in this digitized edition, which I think made the OCR a tiny bit more accurate.  However, the actual book seemed to have a few more print errors–for example a single letter or punctuation mark missing or distorted.  I think this sometimes happened with later printings of a book, since they were often reproduced from plates used in the first printing.  Wear and tear on the older plates can introduce errors into the new text.

Like the first time around, I divided the page images into six batches (i.e. directories) to simplify processing.  I preprocessed the pages with ScanTailor, ran OCR with YAGF, and batch edited the HTML with BlueFish.  Those three steps, including the computer’s processing time, took about four to five hours to complete in total.  You could rush through the process faster, but I think this time estimate is a fairly careful and non-stressful pace.

I am curious to compare this new text with the first one I created, so I will be setting that up next!  Stay tuned…

1914 versus 1915!

Earlier in the project I thought it would be interesting to compare the raw texts created by different OCR engines.  This was not simply to benchmark accuracy–the idea was that each engine might make slightly different types of mistakes, thus combining the outputs could result in a better text overall.  However, since my testing made it clear that Tesseract was far superior to other Free OCR engines and that ABBYY FineReader did not seem to be any more accurate for the purposes of Digital Aladore, I decided to abandon this comparison.  Since Tesseract performed so well with the digitized Aladore witnesses, creating separate texts would not contribute to my main goal of creating a Good reading edition.

However, I want to compare the Tesseract OCR text of the two main print editions of Aladore, 1914 (Edinburgh) and 1915 (New York).  The qualifier “OCR text of” is key here–I am not directly comparing the print editions or even the digitized images of them!  I am comparing the HTML text created by my OCR workflow using digitized JPEGs of the print pages… Thus any textual differences are a bit complicated to unravel, potentially originating anywhere along the transmission process:

First, there could be actual differences in the editions and simple errors introduced/fixed by the reprint.  Secondly, each actual copy of the book is unique–the two that were digitized may have different blemishes or flaws.  Thirdly, since the two books were digitized by different people and machines, different errors may have been introduced by the process (I discovered one already).  Finally, my OCR and HTML editing workflow may introduce different errors in each edition because of slight differences in pagination and the qualities of the images.

So this comparison is NOT about directly comparing the two printed texts–although it might highlight some differences (or not).  It is actually a sort of distorted version of textual criticism (which I talked a lot about earlier in the project). We have two new witnesses with a complex transmission back story.  I want to collate these newly created witnesses, to better understand how they are related and how they can be combined to create a single Best text. 

There are a number of software tools that can carry out this comparison, helping us surface both technical and textual issues with the HTML files.  I started from the HTML edited using BlueFish, with each edition consisting of six separate files.  First, I wanted to ensure the two witnesses had a uniform format so that the comparison would reveal textual differences rather than a bunch of inconsistencies in HTML tags.

Sigil is a surprisingly efficient tool for this job, following steps similar to EPUB creation I described in an earlier post.  Open Sigil, click File > Add > Existing files, and select the six HTML files for an edition.  In the “Book Browser” pane (left side by default) highlight the six files you just added, then right click and select Merge.  This quickly and cleanly combines the files into a single HTML.  Next, I did a quick Find & Replace sweep for known issues, such as “So” being recognized as “80” and the exotic characters “fi” (which is not “f i”) and “fl” (which is not “f l”).  I also did a quick check of the Report (Tools > Reports > Characters in HTML) and Spellcheck for any strange OCR artifacts.  These are obvious transmission errors that just need to be fixed!  Then I right click on the HTML file in the “Book Browser” and choose Save As.  This allows you to “export” a HTML file from the EPUB.

After completing these steps for the two editions, we have two HTML files ready for comparison.  I will describe the next step in the next post!

Collation 1

To do a quick and easy comparison of the 1914 and 1915 HTML texts, the best tool is Notepad++, http://notepad-plus-plus.org.  This is a weird case where Windows has an awesome Free software tool that Linux doesn’t!  Notepad++ is a powerful and flexible text editor with extensive features to make coding easier.

The Notepad++ community also creates bunches of plugins to extend its functionality.  For this task we need the Compare plugin.  You may need to add it: on the menu click Plugins > Show Plugin Manager, then find Compare, check the box, and click install.  Now, you are ready to compare any type of text based file–easy!

Simply open the files you want to compare (Notepad++ uses tabs), then click Plugins > Compare > Compare.  With our two Aladore HTML files, it will look something like this:

Compare on Notepad++

Compare on Notepad++

The texts are aligned and scroll in sync, with a representation the differences displayed on the right side.  Each line with a discrepancy is highlighted (but not the actual different characters or words).  The type of change is indicated by colors and icons (for example, line added, line deleted, or line moved).  This quickly reveals simple formatting issues.

In the example pictured above, the red highlights reveal a paragraph that was broken incorrectly in the 1915 text.  Both files can be edited, so this is easily fixed by deleting the extra <p> tags and empty lines.  The text was the same, only the HTML tagging was incorrect.  I quickly worked through the red (line deleted) and green (line added) highlights, which were all similar formatting issues.  This resulted in two HTML files with 1105 lines each.  This confirms that the editions are nearly identical!

However, this still leaves hundreds of yellow highlighted lines, which simply indicate some change somewhere in the line (i.e. with in a single paragraph <p> to </p>).  The exact difference is NOT highlighted.  The majority of these differences are a single character, such as “S” versus “s”.  It would be painstaking to find them all using Compare.

Honestly, it isn’t really necessary to go any further for this project, but to explore a few more tools, we will look more into these differences in the next post…

 

 

Juxta Collation!

Let’s look at a more nuanced tool that was specifically created for collating witnesses: Juxta, http://www.juxtasoftware.org

Juxta is an Free and open source project (APLv2) designed to support textual scholarship that was developed by NINES (i.e. Networked Infrastructure for Nineteenth-Century Electronic Scholarship).  People have found a wide variety of uses for the software.  For example, here is a post by Stephanie Kingsley at the Juxta blog about using it for editing OCR transcripts, much like I am doing at Digital Aladore.  Juxta was first available as a java-based desktop application, but the most recent version is online only, called Juxta Web Service.  The code can be found at github if you want to start your own instance.  However, a complete pipeline is available to use for free at the Juxta Commons, http://juxtacommons.org

Go register for an account and start Collating your own texts online!  Oh, what fun!

Seriously, Juxta is a simple and powerful tool that quickly reveals the exact differences between the witnesses.  For the purposes of Digital Aladore, where the source texts are the simple HTML files I prepared in the last post, the older desktop version and the newer web version are almost the identical.  The web version just looks a little slicker and enables easy sharing of your work.

The desktop version is very simple.  Just click the Plus icon to add each witness and click the Refresh icon to collate the selected texts.  This will generate a view like this:

Aladore on Juxta desktop.

Aladore on Juxta desktop.

There is two main ways to visualize the collation of the witnesses: Heatmap or Side-by-side view.  The Heatmap is the default view, the upper right pane in the screenshot above.  The text displayed is the “base text”, i.e. one of the witnesses, in this screenshot the 1915 edition.  The base can be switched by clicking on a different witness in the left pane.  Areas where the other witnesses differ from the base text are highlighted in blue.  If you have many witnesses, the color will be lighter or darker depending on how much variance is present.  For example, if all the witnesses have a different word in one location, it would be dark blue.  If only one of five witnesses has a different word, it would be highlighted in light blue.  Clicking on the highlighted area brings up a window showing the alternative reading (i.e. what the other witness says that differs from the base text).

The Side-by-side view displays two witnesses aligned next to each other with highlights on the differences.  Lines visually connect the differences so that you can easily see how they relate.  A histogram showing the areas of variance can be opened to easily navigate through the text.

Aladore on Juxta desktop.

Side-by-side and histogram on Juxta desktop.

Using Juxta Commons is basically the same, although the workflow is a little more complicated using the browser based controls.  This enables more input types and advanced processing of text sources, which we don’t need for Digital Aladore.  After logging in, you need to add sources, i.e. upload your files or connect to a URL.  Once the source is uploaded, click the little arrow next to the file name to “Prepare Witness.”  A processed version will now show up in the Witnesses window.  Once you have all the witnesses ready, check off the ones you want to compare, click on Witnesses at the top of the window, and click “Create Set with selected.”  The screen will look something like this:

Aladore on Juxta Commons.

Aladore on Juxta Commons.

Collation of the witnesses may take a few minutes, a green circle will appear to the left of the Set name when processing is complete.  Click on the Set’s name to open the visualizations.  The view options are the same as on the desktop version.  Side-by-side of Digital Aladore 1914 and 1915 looks like this:

Juxta Commons Side-by-side view.

Juxta Commons Side-by-side view.

So the collation is all set up, we will start USING it in the next post…

 

Using Juxta

So the collation of the 1914 and 1915 Aladore texts is pretty neat–but it is also useful.  I don’t think it is revealing any interesting differences between the two printed editions, but it is surfacing many simple errors in the OCR that I haven’t spotted by other means.  The 1914 and 1915 print editions seem to be identical except for a few page breaks.  I do not need to unravel issues with the transmission or make any complex editorial decisions based on textual scholarship.  Instead, the comparison is part of a process to get the new OCR witnesses to better match the digital image witnesses.  The OCR of each edition has slightly different errors which are highlighted by the collation.  Thus, by combining the information of two imperfect witnesses (all witnesses are imperfect!), we can create a new best text that is more accurate than its parents.

Here is an example:

OCR errors in Juxta.

OCR errors in Juxta.

The 1915 has “world’s. four roads” and the 1914 “world’s {our roads”, so both have errors!  Looking at the page images “world’s four roads” is obviously the correct reading, but the 1915 edition has a tiny blemish at the end of a line which OCRed as a period, and the “f” of four in 1914 edition is slightly faint contributing to its missed identification.  These are exactly the sort of OCR errors that are hard to detect in any other way since they do not show up when spell checking and are not visually obvious.

In case you want to play along, I shared the comparison on Juxta Commons: http://juxtacommons.org/shares/qWLf2p

However, Juxta Commons is not ideal for actually fixing the errors, since it does not allow you to edit the source texts.  You will have to open the base text in an external text editor to fix issues as you look through the collation on Juxta.  Instead, I prefer to use the desktop version which does allow you to edit on the fly–handy!  One other difference is that the desktop application displays the raw HTML rather than the rendered text shown on the Commons.  This can be distracting if your tags are too ugly but I like seeing them since I want to ensure both the text content and tags are correct.

Here is my workflow for the process:

1) Download and install the desktop version of Juxta from http://www.juxtasoftware.org/download

2) Open Juxta, and click File > New Comparison Set.

Juxta menu bar and icons.

Juxta menu bar and icons.

3) Click the Plus icon to add your witnesses (i.e. Digital Aladore 1914 and 1915).

4) Click the Refresh icon to collate the witnesses.  This brings up a dialog box with options for the collation.  Since I want to detect ALL differences, I uncheck all the boxes.  Processing might take a few seconds.

collate options.

collate options.

You will now have a workspace something like this:

Aladore collated on Juxta.

Aladore collated on Juxta.

Clicking on a witness listed on the “Comparison Set” pane (left) changes the base text.  The selected base text will appear in the lower right “Source” pane.  The upper right pane can be switched between “Collation view” or “Comparison view” (i.e. Heatmap or Side-by-side in Juxta Commons terminology, as described in the previous post) using the tabs on the bottom of the pane.  Clicking on a word in Collation pane will move the source text and highlight the selection in the Source pane.

5) Choose one of the source texts to edit (I used Aladore 1914).  Then, click “Edit” in the lower right corner of the Source pane.  This allows you to edit the text, but nothing is saved until you click “Update.”

6) Now, I scroll through the text on the Collation pane (I prefer to use the “Comparison view”) and decide which errors need to be edited in the base text.  I click on the highlighted errors in the Collation pane (if using Comparison view, make sure you click on the side representing your base text or it will switch the Source pane).  This brings up the corresponding spot in the Source pane to edit.  If the correct reading is not obvious, I quickly check the original page image (I have the directory open with thumbnails and a preview window so they are easy to reference).

7) When finished working through the entire collation, click “Update” on the lower right corner of the Source pane and save the edited text with a new name.

8) Click on the new text in the Comparison Set pane.  Then, click File > Export Source Document.  This saves a text file of the new witness.

Done!

This little activity caught a lot more errors than I expected!  In particular, the 1914 text had “b” instead of “h” in many places.  There was also many misplaced periods in random locations.  The interface is a little cumbersome for editing in this way, but I definitely think it is a useful tool.

Another Juxta use!

After creating my best text combining my OCR transcripts of Aladore 1914 and 1915 editions, I realized I had a “newer” version of the 1914 text that I didn’t use! When creating the draft Aladore EPUB using the 1914 text, I spent some time correcting the paragraph breaks since they are not very accurate in the OCR text.  The version I used for the Juxta collations did not include these corrections…

No big deal–I just put the text I created with Juxta in the last post back into Juxta as a witness and collated it with the forgotten 1914 version.  Then I could scroll through to quickly add edits to the best text.  Done.

Basically I was combining the edits of the two texts, kind of amazing idea when you think about it, since they happened asynchronously with different source texts.  Juxta is useful as a versioning tool–I am sure there are some others available aimed at coders.  But Juxta is free and simple to use.  I suggest the desktop version unless you want to share online.  Also, if you have a wordpress site, there is a plugin to embed Juxta collations.

Digital Aladore versus Internet Archive!

No, this isn’t about some kind of battle–I love Internet Archive!

But, it goes back to the beginning of this project:  I was trying to read the Aladore EPUB available from Internet Archive.  It was terrible!  These files are automatically generated by OCR and not human edited (a tag in the HTML headers says they were generated by “abbyy to epub tool, v0.2.”), so maybe I should be impressed at how good they are… but the thousands of tiny errors make for a frustrating and ugly reading experience.

When I first read the Internet Archive Aladore EPUB back in July 2014, I did some quick find & replace to clean up the EPUB a bit using Calibre.  However, some of the errors are very difficult to eliminate without painstaking editing of the entire text.  Particularly annoying are the headers and page numbers.  They both cause a lot of OCR errors, so are not predictable enough to remove with find & replace.

At that point, I just read it, and dealt with the crummy text…  If you want something better, there were two options: 1) edit the OCR text available on Internet Archive, 2) start from scratch and do the whole process yourself.

Although option one would probably be easier in the short term, the Digital Aladore project followed option two–because it seemed more interesting!  Trust me, I know the breadth of this project wasn’t entirely necessary, but I think it demonstrates how a single person using only Free software can create a quality digitized text.  To finish polishing and editing the text, larger projects that can utilize the power of crowd source, like Project Gutenberg’s Distributed Proofreaders (or in Canada), are great.  But what if its such a obscure or personal text that you can’t generate that kind of participation and interest?  I think Digital Aladore shows its not impossible to just go it alone.  Be empowered over your EPUBs!  Create, Edit, Read!

Anyway,

So to see how far (or not very far) the Digital Aladore text has come by starting over from scratch, I thought it would be interesting to do a Juxta comparison with the Internet Archive text.  To set up the comparison I had to smooth out some technical content in the HTML since I really just want to compare the written text, not the tagging.  Here is a quick outline of what I did, since the concepts may be helpful if you want to start from an existing EPUB and polish it up, rather than go through the entire OCR process:

Open the EPUB with Sigil (downloaded from https://archive.org/details/aladorehen00newbrich )

Explore the contents to understand how the files are divided.  In this case the EPUB has a bunch of files named “leaf” which are the covers and random pages from the front matter.  The actual text is contained in three arbitrarily divided HTML files named “part.”

Merge the HTML containing the text (In the “Book Browser” pane highlight the files, then right click and select Merge).

I noticed that the text contains a bunch of page divisions represented by <div> tags.  They are not very accurate, and won’t relate to the Digital Aladore text, so I wanted to remove them all.  Advanced Find & Replace, using Regex, <div class=".*?" id=".*?"/>

Then, search for div, since there are a few more scattered around to remove.

Next, I need to remove the illustrations since I just want to compare the text.  I looked at how they were tagged in the files, and used this regex Find & Replace string to eliminate them: <p class="illus"><img alt=".*?" src=".*?"/></p>

Finally, right click on the HTML file and Save As to export it from the EPUB.

Ready to compare!

Here it is at Juxta Commons: http://juxtacommons.org/shares/BUCgJl