At the beginning of the project, I was planning to use a commercial OCR software to compare with the open options. However, at this point I have decided NOT to create a full text with a proprietary platform, and just focus on using Tesseract–I just don’t have enough time to do both well! To give you a sense of what it would be like to use a nice proprietary system, here is an extensive post about ABBYY FineReader:
Founded back in 1989, ABBYY is based in Russia and is probably the most common commercial OCR engine. Their products are often integrated into document processing platforms and content management systems. For example, many museums and libraries use CONTENTdm digital collection management to provide access to digitized materials. A FineReader plugin can OCR images as they are uploaded making the content searchable. Of course, the plugin costs thousands of dollars on top of your tens of thousands of dollars in subscription fees!
ABBYY also develops a desktop GUI OCR platform featuring powerful PDF creation and editing capabilities that rival Adobe Acrobat. It will cost you about $169, although you can get a very limited free trial (30-days, 100 pages, 3 pages at a time). This price tag gets you more out-of-the-box functionality built in than any open software. For example, Tesseract can recognize around 40 languages, but requires downloading and installing training packages (it comes only with English). FineReader comes with 190 languages and 48 language dictionaries to refine its recognition. Tesseract theoretically allows users to create custom word lists, but no one can seem to get the feature to work. FineReader integrates with Microsoft Word’s dictionaries so you can easily set up a custom word list. Finally, many specialized office/library tasks are optimized, for example, creating searchable PDF/A documents designed for long-term digital archival storage.
I have used a few older versions of FineReader on computers at work. Here is a screenshot of FineReader 11, which looks about the same as every other platform reviewed at Digital Aladore so far:
I decided to download the free trial FineReader 12 and test a few pages of Aladore. FineReader 12 has been updated to match Windows 8 style, but doesn’t make great use of screen space, unless you have a giant monitor:
The interface and workflow is basically the same as the open platforms: open some images, recognize the text, edit it in the window, and output in a new format. The display offers a few more advanced features, such as highlighting low-confidence recognitions and non-dictionary words so that you can quickly check them. Basically, FineReader just offers MANY more options and settings, allowing more advanced control.
Here is how it compares to Tesseract:
- processing is a bit slower
- layout detection is better and more nuanced (can detect headers/footers, tables, etc.)
- accuracy is the same (with a simple English text like Aladore)
However, output is where ABBYY is considerably different from other options. First, its excellent layout detection allows it to reproduce complex formatting such as tables, side bars, or bulleted lists. Second, it can create a wide variety of output files, including word documents, full featured PDFs, spreadsheets, and ebooks–yes, it can directly create an EPUB!
The EPUB formatting is not great, so it would still require a lot of editing and tweaking…
Yeah, FineReader is pretty slick–but its not that great! It is still a lot of work to create a good text. Most people use FineReader to efficiently create searchable PDFs with out editing the OCR output. Since they see the image layer as the main means of actually reading the text, the underlying transcript doesn’t have to be very accurate.
I will stick with Free software, and end up with the same results…