One frustration is that although most of the OCR platforms can automatically detect page layout, they are not very good at it. Titles are often split into multiple sections instead of a single block, imperfections on the side of the page are detected as images, or lines with a bit of extra space are detected as columns. Further more, they do not allow you to choose a single page element to recognize without adjusting every page manually. In the case of Aladore, each page has a header (saying “Aladore”), a text block (the actual text), and a footer (page number)–I would like to simply choose the text block and ignore everything else. This is only possible by manually selecting the text on each page. Not a very viable option unless you have PLENTY of time.
We could just OCR the whole page and try to deal with the header and footer in post-processing the text output. However, as I have mentioned in other posts, getting rid of the header and footer is complicated. You can not just find and replace “Aladore” and all the numbers, since they may also appear in the text body. Furthermore, because the header and footer are in different sized fonts and character spacing, they generate more OCR errors than the main text. This makes it even harder to isolate them to delete from the text.
I decided it might be simplest to crop away all the extra page information before doing OCR. The idea is to go from a full page image, down to just the text block I want, eliminating the header and page number.
This sounds like it should be an easy batch process, but reality… is different, because the text block is not in the same place on every page. There is considerable variation in the printed book which is further exaggerated by variation in position during scanning.
To over come these issues, I evaluated a number of options. I will list a few possibilities here for informational purposes, although in the end I decided to use ScanTailor. ScanTailor will be covered in a separate post, but I think these rejected options may be useful for other applications.
The simplest way to crop images is (of course) to use some photo management software. There are a number of options that enable efficient cropping. For example, on Windows I use FastStone Image Viewer (Freeware). When you open a crop window you can set a size and move quickly through an entire directory. This way you can move the crop box around for each page as needed, but keep the general size without needing to reselect every time. It is not ideal, because you have to look at every page individually, but a 400 page book won’t take too long…
For PDFs, I use a helpful cropping utility called Briss (GNU GPLv3). Imagine you have a PDF article you want to read on an ereader. It has huge margins with a tiny text body in the middle of the page or text split in columns. To make it easier to read, Briss lets you crop away all empty space or turn the columns into separate pages. The trick is that Briss actually clusters the pages based on their layout first, enabling you to apply different crops for different clusters of pages:
This helps overcome the issue of the text body’s variable position on the page. However, it is important to note that Briss is not actually changing the image embedded in the PDF. It is only changing how it is marked in the PDF metadata, so that it will be displayed in the way you selected. This means using Briss alone won’t improve our OCR, because Tesseract actually extracts the full image files from the PDF before doing OCR.
Theoretically, we could still use it, via this ugly work-around: 1) put the page images into a PDF, 2) crop with Briss, 3) “print” to a new PDF thus discarding the unnecessary image data, 4) OCR. The main problem with this solution is that our images are JPG, a lossy format. Each time they go through reprocessing the quality will go down. A JPG put into a PDF and then extracted is noticeably lower quality, which will impact the quality of our OCR. No good.
This post is getting long, so I will move the final tool idea to the next post…