Now that I have big batches of nicely pre-processed images from ScanTailor, its finally time to Capture some text!
To get started, I click on the Open Image icon, navigate to the ScanTailor output directory for a batch, and select all the images (about 60). Next, click on Settings > OCR Settings to make sure it is set to Tesseract-OCR. Then, make sure the HTML output icon is selected.
Now, I click Recognize all pages and let it go to work!
After a few minutes, the recognized text shows up in the right panel.
I quickly scan through it looking for any issues. At this point I am not going to fix any formatting, just checking for any thing that looks majorly scrambled. I was actually amazed by the accuracy. There was very few errors that I could detect. I fixed only a handful per batch. Spelling errors are highlighted with red underline so I check those as I go along. There are only a couple actual spelling errors in the batch, usually two words stuck together. In Aladore 99% of the detected issues are just names or archaic words, since it is filled with things like “assotted”, “aforetime”, and “thitherward”. You start to see a few pattern errors emerge: capitalized “SO” at the beginning of a chapter often becomes “80”; “ff” often becomes “fI” or “f1” or “f'”; or “O” becomes zero. It is easy to reference the original page to check the text by choosing it from the thumbnails on the left side.
Its important to note at this point, we are entering the realm of editorship. We are making decisions about the transmission of the text. I am trying to just replicate what is in the printed text, but what if the page has a blemish obscuring a good reading or if the printing made a mistake? Even replication requires interpretation. The errors I miss at this step will be perpetuated down stream…
Back to practical matters, while working on these batches, I noticed YAGF has two annoying interface issues: First, when clicking a image thumbnail on the left pane, the image appears in the main window at like %500 so you can’t see anything–you need to zoom out several times to actually view the page. Secondly, there is no way to remove all the items from the current project or start a new project. If you don’t want to spend your time dragging each image to the trash icon to remove them individually, you simply have to close the program and reopen.
Although the OCR recognition was surprisingly accurate, there is one area where Tesseract has major trouble: italics. If you notice any italics, be sure to go fix it manually. It will be obvious in the text, because it will be a jumble of nonsense!
After checking the text over, I click the Save icon and save as HTML. That’s it! Done.
Assuming you don’t have any italics, this step is amazingly simple and goes quickly!
I will show you what the resulting text looks like in the next post…