Tagged: juxta

Digital Aladore versus Internet Archive!

No, this isn’t about some kind of battle–I love Internet Archive!

But, it goes back to the beginning of this project:  I was trying to read the Aladore EPUB available from Internet Archive.  It was terrible!  These files are automatically generated by OCR and not human edited (a tag in the HTML headers says they were generated by “abbyy to epub tool, v0.2.”), so maybe I should be impressed at how good they are… but the thousands of tiny errors make for a frustrating and ugly reading experience.

When I first read the Internet Archive Aladore EPUB back in July 2014, I did some quick find & replace to clean up the EPUB a bit using Calibre.  However, some of the errors are very difficult to eliminate without painstaking editing of the entire text.  Particularly annoying are the headers and page numbers.  They both cause a lot of OCR errors, so are not predictable enough to remove with find & replace.

At that point, I just read it, and dealt with the crummy text…  If you want something better, there were two options: 1) edit the OCR text available on Internet Archive, 2) start from scratch and do the whole process yourself.

Although option one would probably be easier in the short term, the Digital Aladore project followed option two–because it seemed more interesting!  Trust me, I know the breadth of this project wasn’t entirely necessary, but I think it demonstrates how a single person using only Free software can create a quality digitized text.  To finish polishing and editing the text, larger projects that can utilize the power of crowd source, like Project Gutenberg’s Distributed Proofreaders (or in Canada), are great.  But what if its such a obscure or personal text that you can’t generate that kind of participation and interest?  I think Digital Aladore shows its not impossible to just go it alone.  Be empowered over your EPUBs!  Create, Edit, Read!

Anyway,

So to see how far (or not very far) the Digital Aladore text has come by starting over from scratch, I thought it would be interesting to do a Juxta comparison with the Internet Archive text.  To set up the comparison I had to smooth out some technical content in the HTML since I really just want to compare the written text, not the tagging.  Here is a quick outline of what I did, since the concepts may be helpful if you want to start from an existing EPUB and polish it up, rather than go through the entire OCR process:

Open the EPUB with Sigil (downloaded from https://archive.org/details/aladorehen00newbrich )

Explore the contents to understand how the files are divided.  In this case the EPUB has a bunch of files named “leaf” which are the covers and random pages from the front matter.  The actual text is contained in three arbitrarily divided HTML files named “part.”

Merge the HTML containing the text (In the “Book Browser” pane highlight the files, then right click and select Merge).

I noticed that the text contains a bunch of page divisions represented by <div> tags.  They are not very accurate, and won’t relate to the Digital Aladore text, so I wanted to remove them all.  Advanced Find & Replace, using Regex, <div class=".*?" id=".*?"/>

Then, search for div, since there are a few more scattered around to remove.

Next, I need to remove the illustrations since I just want to compare the text.  I looked at how they were tagged in the files, and used this regex Find & Replace string to eliminate them: <p class="illus"><img alt=".*?" src=".*?"/></p>

Finally, right click on the HTML file and Save As to export it from the EPUB.

Ready to compare!

Here it is at Juxta Commons: http://juxtacommons.org/shares/BUCgJl

Advertisements

Another Juxta use!

After creating my best text combining my OCR transcripts of Aladore 1914 and 1915 editions, I realized I had a “newer” version of the 1914 text that I didn’t use! When creating the draft Aladore EPUB using the 1914 text, I spent some time correcting the paragraph breaks since they are not very accurate in the OCR text.  The version I used for the Juxta collations did not include these corrections…

No big deal–I just put the text I created with Juxta in the last post back into Juxta as a witness and collated it with the forgotten 1914 version.  Then I could scroll through to quickly add edits to the best text.  Done.

Basically I was combining the edits of the two texts, kind of amazing idea when you think about it, since they happened asynchronously with different source texts.  Juxta is useful as a versioning tool–I am sure there are some others available aimed at coders.  But Juxta is free and simple to use.  I suggest the desktop version unless you want to share online.  Also, if you have a wordpress site, there is a plugin to embed Juxta collations.

Using Juxta

So the collation of the 1914 and 1915 Aladore texts is pretty neat–but it is also useful.  I don’t think it is revealing any interesting differences between the two printed editions, but it is surfacing many simple errors in the OCR that I haven’t spotted by other means.  The 1914 and 1915 print editions seem to be identical except for a few page breaks.  I do not need to unravel issues with the transmission or make any complex editorial decisions based on textual scholarship.  Instead, the comparison is part of a process to get the new OCR witnesses to better match the digital image witnesses.  The OCR of each edition has slightly different errors which are highlighted by the collation.  Thus, by combining the information of two imperfect witnesses (all witnesses are imperfect!), we can create a new best text that is more accurate than its parents.

Here is an example:

OCR errors in Juxta.

OCR errors in Juxta.

The 1915 has “world’s. four roads” and the 1914 “world’s {our roads”, so both have errors!  Looking at the page images “world’s four roads” is obviously the correct reading, but the 1915 edition has a tiny blemish at the end of a line which OCRed as a period, and the “f” of four in 1914 edition is slightly faint contributing to its missed identification.  These are exactly the sort of OCR errors that are hard to detect in any other way since they do not show up when spell checking and are not visually obvious.

In case you want to play along, I shared the comparison on Juxta Commons: http://juxtacommons.org/shares/qWLf2p

However, Juxta Commons is not ideal for actually fixing the errors, since it does not allow you to edit the source texts.  You will have to open the base text in an external text editor to fix issues as you look through the collation on Juxta.  Instead, I prefer to use the desktop version which does allow you to edit on the fly–handy!  One other difference is that the desktop application displays the raw HTML rather than the rendered text shown on the Commons.  This can be distracting if your tags are too ugly but I like seeing them since I want to ensure both the text content and tags are correct.

Here is my workflow for the process:

1) Download and install the desktop version of Juxta from http://www.juxtasoftware.org/download

2) Open Juxta, and click File > New Comparison Set.

Juxta menu bar and icons.

Juxta menu bar and icons.

3) Click the Plus icon to add your witnesses (i.e. Digital Aladore 1914 and 1915).

4) Click the Refresh icon to collate the witnesses.  This brings up a dialog box with options for the collation.  Since I want to detect ALL differences, I uncheck all the boxes.  Processing might take a few seconds.

collate options.

collate options.

You will now have a workspace something like this:

Aladore collated on Juxta.

Aladore collated on Juxta.

Clicking on a witness listed on the “Comparison Set” pane (left) changes the base text.  The selected base text will appear in the lower right “Source” pane.  The upper right pane can be switched between “Collation view” or “Comparison view” (i.e. Heatmap or Side-by-side in Juxta Commons terminology, as described in the previous post) using the tabs on the bottom of the pane.  Clicking on a word in Collation pane will move the source text and highlight the selection in the Source pane.

5) Choose one of the source texts to edit (I used Aladore 1914).  Then, click “Edit” in the lower right corner of the Source pane.  This allows you to edit the text, but nothing is saved until you click “Update.”

6) Now, I scroll through the text on the Collation pane (I prefer to use the “Comparison view”) and decide which errors need to be edited in the base text.  I click on the highlighted errors in the Collation pane (if using Comparison view, make sure you click on the side representing your base text or it will switch the Source pane).  This brings up the corresponding spot in the Source pane to edit.  If the correct reading is not obvious, I quickly check the original page image (I have the directory open with thumbnails and a preview window so they are easy to reference).

7) When finished working through the entire collation, click “Update” on the lower right corner of the Source pane and save the edited text with a new name.

8) Click on the new text in the Comparison Set pane.  Then, click File > Export Source Document.  This saves a text file of the new witness.

Done!

This little activity caught a lot more errors than I expected!  In particular, the 1914 text had “b” instead of “h” in many places.  There was also many misplaced periods in random locations.  The interface is a little cumbersome for editing in this way, but I definitely think it is a useful tool.

Juxta Collation!

Let’s look at a more nuanced tool that was specifically created for collating witnesses: Juxta, http://www.juxtasoftware.org

Juxta is an Free and open source project (APLv2) designed to support textual scholarship that was developed by NINES (i.e. Networked Infrastructure for Nineteenth-Century Electronic Scholarship).  People have found a wide variety of uses for the software.  For example, here is a post by Stephanie Kingsley at the Juxta blog about using it for editing OCR transcripts, much like I am doing at Digital Aladore.  Juxta was first available as a java-based desktop application, but the most recent version is online only, called Juxta Web Service.  The code can be found at github if you want to start your own instance.  However, a complete pipeline is available to use for free at the Juxta Commons, http://juxtacommons.org

Go register for an account and start Collating your own texts online!  Oh, what fun!

Seriously, Juxta is a simple and powerful tool that quickly reveals the exact differences between the witnesses.  For the purposes of Digital Aladore, where the source texts are the simple HTML files I prepared in the last post, the older desktop version and the newer web version are almost the identical.  The web version just looks a little slicker and enables easy sharing of your work.

The desktop version is very simple.  Just click the Plus icon to add each witness and click the Refresh icon to collate the selected texts.  This will generate a view like this:

Aladore on Juxta desktop.

Aladore on Juxta desktop.

There is two main ways to visualize the collation of the witnesses: Heatmap or Side-by-side view.  The Heatmap is the default view, the upper right pane in the screenshot above.  The text displayed is the “base text”, i.e. one of the witnesses, in this screenshot the 1915 edition.  The base can be switched by clicking on a different witness in the left pane.  Areas where the other witnesses differ from the base text are highlighted in blue.  If you have many witnesses, the color will be lighter or darker depending on how much variance is present.  For example, if all the witnesses have a different word in one location, it would be dark blue.  If only one of five witnesses has a different word, it would be highlighted in light blue.  Clicking on the highlighted area brings up a window showing the alternative reading (i.e. what the other witness says that differs from the base text).

The Side-by-side view displays two witnesses aligned next to each other with highlights on the differences.  Lines visually connect the differences so that you can easily see how they relate.  A histogram showing the areas of variance can be opened to easily navigate through the text.

Aladore on Juxta desktop.

Side-by-side and histogram on Juxta desktop.

Using Juxta Commons is basically the same, although the workflow is a little more complicated using the browser based controls.  This enables more input types and advanced processing of text sources, which we don’t need for Digital Aladore.  After logging in, you need to add sources, i.e. upload your files or connect to a URL.  Once the source is uploaded, click the little arrow next to the file name to “Prepare Witness.”  A processed version will now show up in the Witnesses window.  Once you have all the witnesses ready, check off the ones you want to compare, click on Witnesses at the top of the window, and click “Create Set with selected.”  The screen will look something like this:

Aladore on Juxta Commons.

Aladore on Juxta Commons.

Collation of the witnesses may take a few minutes, a green circle will appear to the left of the Set name when processing is complete.  Click on the Set’s name to open the visualizations.  The view options are the same as on the desktop version.  Side-by-side of Digital Aladore 1914 and 1915 looks like this:

Juxta Commons Side-by-side view.

Juxta Commons Side-by-side view.

So the collation is all set up, we will start USING it in the next post…