Earlier in the project I thought it would be interesting to compare the raw texts created by different OCR engines. This was not simply to benchmark accuracy–the idea was that each engine might make slightly different types of mistakes, thus combining the outputs could result in a better text overall. However, since my testing made it clear that Tesseract was far superior to other Free OCR engines and that ABBYY FineReader did not seem to be any more accurate for the purposes of Digital Aladore, I decided to abandon this comparison. Since Tesseract performed so well with the digitized Aladore witnesses, creating separate texts would not contribute to my main goal of creating a Good reading edition.
However, I want to compare the Tesseract OCR text of the two main print editions of Aladore, 1914 (Edinburgh) and 1915 (New York). The qualifier “OCR text of” is key here–I am not directly comparing the print editions or even the digitized images of them! I am comparing the HTML text created by my OCR workflow using digitized JPEGs of the print pages… Thus any textual differences are a bit complicated to unravel, potentially originating anywhere along the transmission process:
First, there could be actual differences in the editions and simple errors introduced/fixed by the reprint. Secondly, each actual copy of the book is unique–the two that were digitized may have different blemishes or flaws. Thirdly, since the two books were digitized by different people and machines, different errors may have been introduced by the process (I discovered one already). Finally, my OCR and HTML editing workflow may introduce different errors in each edition because of slight differences in pagination and the qualities of the images.
So this comparison is NOT about directly comparing the two printed texts–although it might highlight some differences (or not). It is actually a sort of distorted version of textual criticism (which I talked a lot about earlier in the project). We have two new witnesses with a complex transmission back story. I want to collate these newly created witnesses, to better understand how they are related and how they can be combined to create a single Best text.
There are a number of software tools that can carry out this comparison, helping us surface both technical and textual issues with the HTML files. I started from the HTML edited using BlueFish, with each edition consisting of six separate files. First, I wanted to ensure the two witnesses had a uniform format so that the comparison would reveal textual differences rather than a bunch of inconsistencies in HTML tags.
Sigil is a surprisingly efficient tool for this job, following steps similar to EPUB creation I described in an earlier post. Open Sigil, click File > Add > Existing files, and select the six HTML files for an edition. In the “Book Browser” pane (left side by default) highlight the six files you just added, then right click and select Merge. This quickly and cleanly combines the files into a single HTML. Next, I did a quick Find & Replace sweep for known issues, such as “So” being recognized as “80” and the exotic characters “ﬁ” (which is not “f i”) and “ﬂ” (which is not “f l”). I also did a quick check of the Report (Tools > Reports > Characters in HTML) and Spellcheck for any strange OCR artifacts. These are obvious transmission errors that just need to be fixed! Then I right click on the HTML file in the “Book Browser” and choose Save As. This allows you to “export” a HTML file from the EPUB.
After completing these steps for the two editions, we have two HTML files ready for comparison. I will describe the next step in the next post!