Using Tesseract via Command Line has consistently been the most wildly popular post on Digital Aladore. However, due to some changes, I thought I should update the information.
and the Wiki page at: https://github.com/tesseract-ocr/tesseract/wiki
There is no longer an official Windows installer for the current release.
If you want to use an installer, the old version on Google Code from 2012 still works, featuring Tesseract version 3.02.02 (I am not sure if this link will go away when Google Code moves to archive status in January 2016). This package is handy for English users since it will install everything you need in one go. (Update 2016-01-30: the link did go away in January, so there is no longer access to the old Google Code pages. It is easiest to use one of the installers listed below. If you want to compile it yourself, check out the directions from bantilan.)
- A nice portable install of both engine and language data from a guy named Simon
- A bleeding edge installer from Mannheim University Library
It is important to note that the OCR engine and language data are now completely separate. Install the Tesseract engine first, then unzip the language data into the “tessdata” directory.
On Linux installation is easier. Use your distro’s software repository (the package is usually called ‘tesseract-ocr’), or download the latest release and use make.
Now that you have it installed, the commands on the old post should work fine!
Okay, just one last tool background post before we hit the “real” workflow I settled on. As I touched on in an earlier post, Tesseract is surprisingly easy to use from the command line. Figuring out how to use it is a good chance to practice your old school computing skills. Here are some more extensive notes, in case you are interested in pursuing this option.
Tesseract has a basic manual that is barely human readable. So if you are trying to figure anything out, it might be easier to search the user group on Google, https://groups.google.com/forum/#!forum/tesseract-ocr.
There is a few basic things to know about input to maximize the quality of the output. The most likely image flaws to cause recognition errors are skew, uneven lighting, and black edges on the scans. These should be avoided during scanning or fixed before input (using ScanTailor, for example). The newer releases of Tesseract can handle most image formats and include automatic pre-processing (i.e. binarization, basically converting to black and white only, plus noise reduction). Some applications give you the ability to tweak or retrain the OCR engine–however, except with new languages or extremely odd fonts, this will not improve Tesseract very much. So don’t bother!
The standard output is a UTF-8 TXT file. If you add the option ‘hocr’, it will output a special HTML format that includes layout position for the recognized text. This file can be used to create searchable PDFs.
Command line use is pretty simple. It is easiest on a Linux system, but I thought I would describe the Windows workflow since many users don’t even realize command line is an option. The best way to use Tesseract directly on Windows is to look in the start menu folder “Tesseract-OCR”, right click the icon for “Console”, and choose “Run as Administrator” (if you don’t run as admin, tesseract will likely not have the correct permissions to actually create files).
Now just follow the basic recipe given by the manual, although you will have to supply file paths for the input and output files:
Tesseract [file path]\inputfile.name [filepath]\outputfile hocr
Don’t put an extension on the output file name because Tesseract will do it automatically. The hocr option is added if you want HTML output with layout information or is left off for plain text.
It would be insanely tedious to do more than one file this way, so luckily its very easy to create a Windows batch file to automate the process (or even easier via Linux shell script. If you need to brush up on Linux shell, check out the great resource LinuxCommand.org).
To create a batch file, open notepad, and follow this recipe replacing “[filepath]” with the actual location of the input and output directories (e.g. “C:\temp\testscans\*.tif”, using the correct extension for your files):
Set _Tesseract="C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
For %%A in (%_SourcePath%) Do Echo Converting %%A...&%_Tesseract% %%A %_OutputPath%%%~nA
Then click save as, type in a file name plus the extension “.bat”. This process will run Tesseract on each file with the given extension in the source directory, outputting a text file for each in the output directory. To run the batch process, simply double click the .bat file!
Okay, now you have a huge batch of .txt files that you want to combine. Time for some more command line! Open the console and use the command
cd [filepath] to navigate to the directory where all the text files are located. Then enter the command
copy *.txt combinedfile.txt which copies the content of every txt file in the directory into the new file “combinedfile.txt”.
Feeling like a Windows power user yet?
For example, recognizing the first page of Aladore 1914:
Tesseract generates this very accurate text without any preprocessing:
ALADORE. CHAPTER I. OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT. SIR YWAINl sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-
While the Cunieform engine generates this text:
ALA DORK, CHAPTER I. SIR Y%’AIM sat, in the Ha11 of Sulney and dKI ]Ustice upon %’fong-doers» And one &an had gathered sticks where he ought not, and this vedas for the twentieth time; and another had snared a rabbit of his lord,’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth tiine: so that Sir Vwain was weary of the sIght of theln» Moreover» his steward stood besN1e 4Imq and pQt him In reIACAI”
I kind of like “ALA DORK”… However, using YAGF‘s automatic preprocessing improves the result greatly. After processing, Cunieform generates this text (still not as accurate as Tesseract):
A L A D 0 R E. CHAPTER I. OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT. SIR YwAIN sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’ s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredt.h time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-
Ocrad is noticeably faster (all three can be tested on OCRFeeder), and generates this text:
OF THE HALL OF SULNEY AND HOW SIR Y\_AIN LEFT IT.
SIR YWAIN’ sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; aod another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-
After testing, I see no reason not to use the most popular option, Tesseract. It is extremely accurate out-of-the-box for a text such as Aladore, with no need for training, tweaking, or preprocessing. It can be used efficiently via command line or GUI applications. And finally, the license is Free.
Since we have decent individual page images for Aladore, our OCR workflow going forward is similar to what DYI Bookscanners are doing. Book scanners will shoot photographs of each page and aim to have the results at around 300 PPI. Our Aladore images are in that ballpark, so should produce reasonable OCR results. Basically, I want an OCR solution with these features:
- batch process hundreds of individual page images
- good accuracy
- OCR area can be selected (avoids further processing later on)
- corrects line breaks (avoids further processing)
I have a Windows 7 and a Ubuntu 14 machine at home, so I could use anything that runs on these platforms. The most basic option is to simply install Tesseract and run it from the command line/terminal. First, download and install the most recent version from https://code.google.com/p/tesseract-ocr or from a Linux repository. Then open a terminal and follow the recipe to test some OCR action!
The basic command is “tesseract imagename.extension outputfilename”. For example, this command runs OCR on the first page of Aladore and out puts the text as a file called “test1914tess.txt”:
tesseract aladoren00newbuoft_0021.jpg test1914tess
Here is what the page image looks like:
And here is the Tesseract output:
OF THE HALL OF SULNEY AND HOW
SIR YWAIN LEFT IT.
SIR YWAINl sat in the Hall of Sulney and
did justice upon wrong-doers. And one man
had gathered sticks where he ought not, and
this was for the twentieth time; and another
had snared a rabbit of his lord’s, and this was
for the fortieth time; and another had beaten
his wife, and she him, and this was for the
hundredth time: so that Sir Ywain was weary
of the sight of them. Moreover, his steward
stood beside him, and put him in remem-
1 Ywain=Ewain or Ewan.
So the results are very accurate! The only problem it has is with the subscript 1 for the footnote about Ywain. Luckily, this is the last footnote of the book (I think). However, in terms of generating our good reading text there is some limitations which point out the challenges going forward (some of which can be solved by using different software).
First, every page in the book has the header “Aladore” and a page number in the footer. I don’t want those in the reading text. The Aladore header is fairly hard to remove in a batch process–you can not just find and replace because the word appears with in the text as well. The page numbers will also just cause us editing headaches going forward. There is basically two options to get rid of them: 1) crop them out of the page images before OCR or 2) select an area for OCR rather than recognizing the whole page. Either option would be easy if every page was EXACTLY the same, but they are not. The text we want wanders around quite a bit.
Second, Tesseract outputs the recognized text with the existing line breaks. We need to get rid of line breaks and page breaks to make a nicely flowing epub text.
Over-all I was impressed with the recognition and ease of use.
Digital Aladore has been idle for a few weeks while I was ill and dealing with other deadlines, but hopefully we can get the steam rolling again…
Its time to start sorting out options for Optical Character Recognition to capture the text of Aladore from our digital witnesses (i.e. page images). OCR has a surprisingly long history, in use since at least the 1960’s for enterprise data entry applications and for improving accessibility for visually impaired. Today, one of the most common ways to encounter OCR is as a feature built into PDF applications, such as Adobe Acrobat or PDF-XChange Viewer (I do not recommend either of these programs, Acrobat is too expensive and PDF-XChange seems to be bloated and nagware). This use of OCR is focused on creating a simple transcript underlying the page images to enable search. This is not ideal for Digital Aladore, since we want to create a non-PDF text. The other option is a dedicated desktop OCR platform. Probably the most used and most accurate OCR platform is the commercial application ABBYY Finereader. While I have used ABBYY at my work, it is an expensive and proprietary software. I want to steer this project towards open source and Libre solutions.
There is basically four parts to most OCR platforms:
- image pre-processing (options that optimize the input for the chosen OCR engine, such as scaling, de-skew, and sharpening)
- layout analysis (recognizes the structure of a document to select text areas for OCR. This can provide position data for the resulting recognized text to reproduce the layout or allow it to be embedded below the PDF image)
- OCR engine (algorithms that can recognize characters in an image and output a text character. These are often optimized by “training” or lexicons that add context to the possible interpretations to improve accuracy.)
- GUI (helps the user efficiently process documents, combining access to the other elements, settings, and output options)
Tesseract is the most commonly used open source OCR engine (licensed Apache 2.0). It was first developed by Hewlett Packard in 1985, but eventually released as open source in 2005. Google has been sponsoring further development since 2006. The engine can now be run as a command line application, but it is also used by a wide range of GUI applications. It is generally considered the best open source option. It is usually not as accurate as ABBYY, but it is faster.
There are a few other open source OCR engines. CuneiForm was a commercial competitor to ABBYY for many years, but it was released under a free BSD license in 2008. It is usually not as accurate or fast as Tesseract. Ocrad is a unique engine used by some applications that is part of the GNU project. It is very fast, but not very accurate.
We will be testing a couple different GUIs and the main engines. To get a overview of the options, check out a comparison at Wikipedia: http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software