Okay, here we go! Are you excited?
But wait, “post-processing”? Yeah, I guess its all in your point of view: I am calling it pre- because its processing before I feed it to Tesseract-OCR; they call it post- because you do it after scanning to make the images more usable. In any case, ScanTailor is basically a batch image editing platform optimized to enable efficient processing of unedited page scans into lovely readable pages for PDFs/DJVU, etc. You input any type of image file, and you output black and white TIFF page images. Many of the GUI OCR applications have similar processing built in–even Tesseract via command line does some image pre-processing. ABBYY goes a step further and does the final PDF creation as well.
It is limited (niche?), but what ScanTailor does, it does well and efficiently. And its Free. It runs on Windows and Linux, I use it with Ubuntu 14.
A full user guide can be found here: https://github.com/scantailor/scantailor/wiki/User-Guide
And here’s a nice video tutorial to get you started: http://vimeo.com/12524529
So what am I trying to do with ScanTailor?
As I described in an earlier post, every Aladore page image has a header and footer that will be annoying, ugly, and generally undesirable in our final EPUB edition. Furthermore, the header and footer tend to cause OCR errors, so they are hard to eliminate from the text output. None of the GUI OCR options managed to consistently select only the text block (ABBYY can identify headers, but it wasn’t 100% and didn’t separate the page numbers). The exact location of the text block wanders around from page to page, so we can not pre-set a selection area for groups of images. We could manually select the text block of each page via the interface of YAGF or OCRFeeder, but I found both too cumbersome to efficiently work through hundreds of pages. So as an alternative, I decided to crop each page down to the main text block before carrying out OCR.
ScanTailor can do this cropping efficiently and has the added bonus of excellent image processing to prepare the pages for OCR.
Here is my workflow:
1) Set up batches of images. I decided to work on about 60 pages at a time to make things more manageable: a large enough batch to be efficient, but not too large that it took a huge amount of time to work on. I divided Aladore into six sections (each in a directory), being careful to not break up any chapters.
2) Start a new ScanTailor project. i.e. Start the application and load one of the Aladore directories.
3) Fix DPI. Basically, you could estimate the DPI based on the pixel dimensions of the digital image divided by the physical dimensions of the area it was imaging (i.e. ~original printed page size). In his video tutorial, Joseph Artsimovich also gives a method to estimate based on measuring the pixel height of six lines of text in the image. I estimated my Aladore images to be between 300-400 DPI, but its not essential to get it exactly right. I decided to just use 300×300.
4) Select Content. Since the Aladore images are already processed, we can skip the Fix Orientation, Split Pages, and Deskew steps. For good results, the selection can not be fully automated. First, I do two manual selections to get the batch started: one for all the left and one for all right hand pages of the book, since the location of the text block tends to be closer. I go to an average page and resize the selection box to be below the header, above the page number, and almost the full page width horizontally. Then I click “Apply to” and choose “every other page.” Then, repeat for the next page to set the selection for the other half of the book.
This will give us a pretty good selection–but now I go through the tedious step of checking every single page. Since blotches and marks on the paper often cause random OCR errors, it is best to get the selection box close to the text. ScanTailor will white out everything outside the box, thus eliminating the many flaws. Basically, I tweak the first and last page of each chapter, since the text block is smaller. About 1 in 6 pages in Aladore needed just a little tweak up or down as I check the pages. This doesn’t take as long as it sounds, less time than it took to write this post…
5) Margins. This step is mainly cosmetic to ensure consistent output. The area added by the Margins will be blank white. First, we can add “hard margins” around each selection, like you would with a Word document, to ensure that the text is not all the way at the edge of the page. Second, using “soft margins” we can decide if we want all the pages to be the same size and where the text selection is positioned in the resulting page. By default ScanTailor will add white space to make all pages the size of the largest in the batch. Having a consistent page size is actually helpful for our purposes, because the OCR will detect the blank space at the beginning and end of chapters. The defaults are fine for my purposes!
6) Output!! This stage allows you to set the parameters for image processing. You can set the DPI, Mode (B&W, color), adjust the filtering, add Dewarping, and adjust the noise reduction (Despeckling). The defaults worked well for Aladore, at 600 DPI (2x the input DPI).
Click the play button and ScanTailor starts the actual image processing, saving the output to a new directory. The computer works for a minute, and then we have a batch of black and white TIFFs ready for OCR!