Since we have decent individual page images for Aladore, our OCR workflow going forward is similar to what DYI Bookscanners are doing. Book scanners will shoot photographs of each page and aim to have the results at around 300 PPI. Our Aladore images are in that ballpark, so should produce reasonable OCR results. Basically, I want an OCR solution with these features:
- batch process hundreds of individual page images
- good accuracy
- OCR area can be selected (avoids further processing later on)
- corrects line breaks (avoids further processing)
I have a Windows 7 and a Ubuntu 14 machine at home, so I could use anything that runs on these platforms. The most basic option is to simply install Tesseract and run it from the command line/terminal. First, download and install the most recent version from https://code.google.com/p/tesseract-ocr or from a Linux repository. Then open a terminal and follow the recipe to test some OCR action!
The basic command is “tesseract imagename.extension outputfilename”. For example, this command runs OCR on the first page of Aladore and out puts the text as a file called “test1914tess.txt”:
tesseract aladoren00newbuoft_0021.jpg test1914tess
Here is what the page image looks like:
And here is the Tesseract output:
OF THE HALL OF SULNEY AND HOW
SIR YWAIN LEFT IT.
SIR YWAINl sat in the Hall of Sulney and
did justice upon wrong-doers. And one man
had gathered sticks where he ought not, and
this was for the twentieth time; and another
had snared a rabbit of his lord’s, and this was
for the fortieth time; and another had beaten
his wife, and she him, and this was for the
hundredth time: so that Sir Ywain was weary
of the sight of them. Moreover, his steward
stood beside him, and put him in remem-
1 Ywain=Ewain or Ewan.
So the results are very accurate! The only problem it has is with the subscript 1 for the footnote about Ywain. Luckily, this is the last footnote of the book (I think). However, in terms of generating our good reading text there is some limitations which point out the challenges going forward (some of which can be solved by using different software).
First, every page in the book has the header “Aladore” and a page number in the footer. I don’t want those in the reading text. The Aladore header is fairly hard to remove in a batch process–you can not just find and replace because the word appears with in the text as well. The page numbers will also just cause us editing headaches going forward. There is basically two options to get rid of them: 1) crop them out of the page images before OCR or 2) select an area for OCR rather than recognizing the whole page. Either option would be easy if every page was EXACTLY the same, but they are not. The text we want wanders around quite a bit.
Second, Tesseract outputs the recognized text with the existing line breaks. We need to get rid of line breaks and page breaks to make a nicely flowing epub text.
Over-all I was impressed with the recognition and ease of use.