2016-01-10

Update: Tesseract OCR in 2016

Using Tesseract via Command Line has consistently been the most wildly popular post on Digital Aladore. However, due to some changes, I thought I should update the information.

Tesseract used to be hosted at Google Code, which closed up shop in August 2015. The project has transitioned to Github, with the main page at https://github.com/tesseract-ocr/tesseract

and the Wiki page at: https://github.com/tesseract-ocr/tesseract/wiki

There is no longer an official Windows installer for the current release. If you want to use an installer, the old version on Google Code from 2012 still works, featuring Tesseract version 3.02.02 (I am not sure if this link will go away when Google Code moves to archive status in January 2016). This package is handy for English users since it will install everything you need in one go. (Update 2016-01-30: the link did go away in January, so there is no longer access to the old Google Code pages. It is easiest to use one of the installers listed below. If you want to compile it yourself, check out the directions from bantilan.)

If you want a more up-to-date version on Windows there are some third party installers put together for you, otherwise you have to compile it yourself (unless you use Cygwin or MSYS2):

A nice portable install of both engine and language data from a guy named Simon
A bleeding edge installer from Mannheim University Library

It is important to note that the OCR engine and language data are now completely separate. Install the Tesseract engine first, then unzip the language data into the “tessdata” directory.

On Linux installation is easier. Use your distro’s software repository (the package is usually called ‘tesseract-ocr’), or download the latest release and use make.

Now that you have it installed, the commands on the old post should work fine!

Share this:

Leave a comment Cancel reply