I have been playing around with the many no cost GUI OCR platforms out there, and I want to report some findings. This post looks at some options available on Windows, none of which I found to be very viable for the purposes of Digital Aladore. Another post will relate information about options on Linux.
If you want to test out some super quick OCR without installing anything, the easiest is to use Google Drive.
Go into your Drive, click on the setting icon (gear thing), navigate down to Upload settings, and click on “Convert text from uploaded PDF and image files.” Now upload an image and it will appear in a new Google Doc with the OCR text below. This feature is also available on the Drive app for Android so you can directly OCR your photo “scans” of documents. Obviously this is not a solution for Aladore, but thought I would pass it on in case you had some immediate OCR needs!
As for desktop GUI OCR platforms on Windows, here are some notes just to get the information out there, even if its a negative result for this project:
- This is an odd site that provides lots of information about the main commercial OCR platforms, plus download of their own freeware. It is actually part of a niche document imaging retailer, the ScanStore. SimpleOCR is free for non-commercial use, but is not open source. Word recognition is good. However, it very often (every page) seems to interpret large spaces in the middle of a paragraph as columns, thus jumbling the order of the text. This kind of mistake is impossible to recover from without painstaking cut and paste. There are no settings to change this behavior. This platform would be unusable with Aladore.
- This is a commonly cited freeware (not open) option that uses the Tesseract engine. However, it was consistently blocked by my security programs as containing a virus. Numerous reviews mention that the installer includes PUPs (potentially unwanted programs) and malware. I am not sure what to think, but I decided it wasn’t worth perusing since there is nothing special about it.
- Basically a windows version of the Linux application Tesseract-GUI. It is open source (GNU GPL v3), but hasn’t been updated since 2010. The project never really got off the ground, so it is too buggy to use.
- The interface looks pretty good (more like the ABBYY set up), but I could not get it to actually recognize any text. Python based, open source (GNU GPL v2), but it hasn’t been updated since 2011.
- gImageReader is the most usable on this list and has a Free open source license (GNU GPLv3). It is easy to install on Windows and works with out tweaking. However, it is slow at loading page images and layout detection. Also, the OCR is some how less accurate than Tesseract via command line or via Linux GUI options–I don’t know how this is possible. It does have a number of nice features for editing/polishing the output text. It was last updated in April 2014.
But don’t worry, I found some other options on Linux that run much better!