No, this isn’t about some kind of battle–I love Internet Archive!
But, it goes back to the beginning of this project: I was trying to read the Aladore EPUB available from Internet Archive. It was terrible! These files are automatically generated by OCR and not human edited (a tag in the HTML headers says they were generated by “abbyy to epub tool, v0.2.”), so maybe I should be impressed at how good they are… but the thousands of tiny errors make for a frustrating and ugly reading experience.
When I first read the Internet Archive Aladore EPUB back in July 2014, I did some quick find & replace to clean up the EPUB a bit using Calibre. However, some of the errors are very difficult to eliminate without painstaking editing of the entire text. Particularly annoying are the headers and page numbers. They both cause a lot of OCR errors, so are not predictable enough to remove with find & replace.
At that point, I just read it, and dealt with the crummy text… If you want something better, there were two options: 1) edit the OCR text available on Internet Archive, 2) start from scratch and do the whole process yourself.
Although option one would probably be easier in the short term, the Digital Aladore project followed option two–because it seemed more interesting! Trust me, I know the breadth of this project wasn’t entirely necessary, but I think it demonstrates how a single person using only Free software can create a quality digitized text. To finish polishing and editing the text, larger projects that can utilize the power of crowd source, like Project Gutenberg’s Distributed Proofreaders (or in Canada), are great. But what if its such a obscure or personal text that you can’t generate that kind of participation and interest? I think Digital Aladore shows its not impossible to just go it alone. Be empowered over your EPUBs! Create, Edit, Read!
So to see how far (or not very far) the Digital Aladore text has come by starting over from scratch, I thought it would be interesting to do a Juxta comparison with the Internet Archive text. To set up the comparison I had to smooth out some technical content in the HTML since I really just want to compare the written text, not the tagging. Here is a quick outline of what I did, since the concepts may be helpful if you want to start from an existing EPUB and polish it up, rather than go through the entire OCR process:
Open the EPUB with Sigil (downloaded from https://archive.org/details/aladorehen00newbrich )
Explore the contents to understand how the files are divided. In this case the EPUB has a bunch of files named “leaf” which are the covers and random pages from the front matter. The actual text is contained in three arbitrarily divided HTML files named “part.”
Merge the HTML containing the text (In the “Book Browser” pane highlight the files, then right click and select Merge).
I noticed that the text contains a bunch of page divisions represented by <div> tags. They are not very accurate, and won’t relate to the Digital Aladore text, so I wanted to remove them all. Advanced Find & Replace, using Regex,
<div class=".*?" id=".*?"/>
Then, search for div, since there are a few more scattered around to remove.
Next, I need to remove the illustrations since I just want to compare the text. I looked at how they were tagged in the files, and used this regex Find & Replace string to eliminate them:
<p class="illus"><img alt=".*?" src=".*?"/></p>
Finally, right click on the HTML file and Save As to export it from the EPUB.
Ready to compare!
Here it is at Juxta Commons: http://juxtacommons.org/shares/BUCgJl