Linux has a few good Free GUI OCR options that are still actively developed. The two most popular applications are YAGF and OCRFeeder, both easily installed via repositories or software center, both licensed GNU GPLv3. You will need to separately install the OCR engines, so make sure you have tesseract-ocr packages as well. YAGF easily integrates only Tesseract and Cuneiform, while OCRFeeder can detect any installed engine.
To test the ease of use, I loaded the first three pages of Aladore unedited, recognized the text, and saved the output.
- YAGF’s very simple interface and workflow makes jumping right into OCR easy. Just select some images and click recognize all pages. The text will show up as a single document on the right hand panel. If you change any options and click recognize again, the text is added to the end–this is convenient for quickly comparing output. To save the text, cut and paste, or click save to create a plain text or HTML file. Switching between Tesseract and Cunieform is a simple menu selection. A number of other features, such as auto preprocessing, correcting page skew, auto region detection, and spell check can be accessed from the other icons on the interface. It will take a bit of playing around before you figure out how to efficiently use them. The interface is not completely polished and the options are fairly basic, but over-all YAGF is a complete and easy to use OCR platform. It was last updated in August 2014, however the most up-to-date version, 0.9.4.2, is not available in package repositories (Ubuntu Software Center offers 0.9.2.1).
- OCRFeeder has a more polished interface, however the workflow is not as straight forward as YAGF. It offers more fine grained control of each text/image region of the document. Unfortunately, unless you use the full automatic detection this means the process takes much longer. OCRFeeder supports any OCR engine you install on your system, and you can actually use different engines on each text region. For preprocessing, OCRFeeder intergates Unpaper, a tool created to process scanned book page images. Again in contrast to YAGF, it offers more control over the options. One nice feature is output to ODT format (OpenDocument Text, the standard format for Open/LibreOffice Writer). Each of the detected regions will be translated to a text/image box on the document. This means you can translate the page layout to an editable format, rather than just a searchable PDF or plain text. However, I don’t need any of this functionality for Digital Aladore! The overly complicated workflow is not necessary since I just want the simple, unformatted text. However, if you have a document with lots of images that you wanted to properly capture and embedded in your new text, OCRFeeder’s workflow would be better than YAGF. Version 0.8 was released in August 2014, although most repositories still offer only 0.7.11 (February 2013).
There are a few other options which I quickly tested, but found more work to install and/or use than necessary while offering no further functionality. These include tesseract-gui, gImageReader, lios, and ocrivist
For my purposes, YAGF seems like the best OCR GUI, but its a toss up for now.
The main use of the GUIs is to help with layout detection. I want to detect the text body only (leaving off the header and page numbers), but these applications can not do this automatically. It would be too much work to go through 400 pages and manually adjust the text box, so I need to look for a preprocessing method of eliminating the header and footers. Although the GUI makes OCR convenient, I am not sure it is necessary to this project, since it may be just as easy to create a batch process to run the processing directly through Tesseract.