Digital Aladore has been idle for a few weeks while I was ill and dealing with other deadlines, but hopefully we can get the steam rolling again…
Its time to start sorting out options for Optical Character Recognition to capture the text of Aladore from our digital witnesses (i.e. page images). OCR has a surprisingly long history, in use since at least the 1960’s for enterprise data entry applications and for improving accessibility for visually impaired. Today, one of the most common ways to encounter OCR is as a feature built into PDF applications, such as Adobe Acrobat or PDF-XChange Viewer (I do not recommend either of these programs, Acrobat is too expensive and PDF-XChange seems to be bloated and nagware). This use of OCR is focused on creating a simple transcript underlying the page images to enable search. This is not ideal for Digital Aladore, since we want to create a non-PDF text. The other option is a dedicated desktop OCR platform. Probably the most used and most accurate OCR platform is the commercial application ABBYY Finereader. While I have used ABBYY at my work, it is an expensive and proprietary software. I want to steer this project towards open source and Libre solutions.
There is basically four parts to most OCR platforms:
- image pre-processing (options that optimize the input for the chosen OCR engine, such as scaling, de-skew, and sharpening)
- layout analysis (recognizes the structure of a document to select text areas for OCR. This can provide position data for the resulting recognized text to reproduce the layout or allow it to be embedded below the PDF image)
- OCR engine (algorithms that can recognize characters in an image and output a text character. These are often optimized by “training” or lexicons that add context to the possible interpretations to improve accuracy.)
- GUI (helps the user efficiently process documents, combining access to the other elements, settings, and output options)
Tesseract is the most commonly used open source OCR engine (licensed Apache 2.0). It was first developed by Hewlett Packard in 1985, but eventually released as open source in 2005. Google has been sponsoring further development since 2006. The engine can now be run as a command line application, but it is also used by a wide range of GUI applications. It is generally considered the best open source option. It is usually not as accurate as ABBYY, but it is faster.
There are a few other open source OCR engines. CuneiForm was a commercial competitor to ABBYY for many years, but it was released under a free BSD license in 2008. It is usually not as accurate or fast as Tesseract. Ocrad is a unique engine used by some applications that is part of the GNU project. It is very fast, but not very accurate.
We will be testing a couple different GUIs and the main engines. To get a overview of the options, check out a comparison at Wikipedia: http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software