Okay, I have been messing around with dozens of workflow options, and I have finally settled on one version. It is not necessarily the most efficient, but I found it has the best balance between accuracy and automation. The basic steps are:
- preprocess image files with ScanTailor,
- OCR with YAGF and export as HTML,
- edit HTML with BlueFish,
- create ebook from HTML with Sigil.
You might say, “gosh that’s unnecessarily ugly!”
And you are partially right. Each of these applications has overlapping features that can do more than what I am using them for, but I found each has specific limitations and strengths. I am only using the features that the application is good at!
For example, the way I am using YAGF is very simple and automated. I could in fact use YAGF for preprocessing the images and selection of the text areas for recognition, thus avoiding ScanTailor. However, YAGF’s image processing is not as good and the interface is too cumbersome for these tasks. It is much more efficient to use ScanTailor and feed YAGF preprocessed images that require no further user input. But you say, then why not replace YAGF with command line use of Tesseract, automated with a Windows batch file or Linux shell script? Simply because YAGF automatically combines multiple pages into a single HTML file, rather than one file for each page generated by the command line. Thus, it saves me one transformation step (combining the HTML files and removing the extra headers), plus gives me some simple visual feedback to catch any major recognition errors.
Further more, you may ask why export the OCR as HTML when all you want is the text? Ah, well that is the clever bit, sort of… The hocr output is intended for creating searchable PDFs using utilities that combine the text tagged with layout position and the page images (for example, check out hocr2pdf available with the Exactimage utilities on Linux). I don’t in fact need any of the tags. However, they make it much easier to reformat the text into the form I want because I can easily find the line, paragraph, and page breaks. The plain text doesn’t have enough information to consistently transform, its hard to tell the difference between a page break, paragraph break, or random spacing error. Thus, the HTML gives me more than I need, but it is easy to strip it down to what I want. The plain text doesn’t give me enough, and its impossible to built it up except by tedious manual labor.
So there you go. Four more steps, four tools. I will explain it all soon!