Using Tesseract via command line

Okay, just one last tool background post before we hit the “real” workflow I settled on.  As I touched on in an earlier post, Tesseract is surprisingly easy to use from the command line.  Figuring out how to use it is a good chance to practice your old school computing skills.  Here are some more extensive notes, in case you are interested in pursuing this option.

Tesseract has a basic manual that is barely human readable. So if you are trying to figure anything out, it might be easier to search the user group on Google, https://groups.google.com/forum/#!forum/tesseract-ocr.

There is a few basic things to know about input to maximize the quality of the output.  The most likely image flaws to cause recognition errors are skew, uneven lighting, and black edges on the scans.  These should be avoided during scanning or fixed before input (using ScanTailor, for example).  The newer releases of Tesseract can handle most image formats and include automatic pre-processing (i.e. binarization, basically converting to black and white only, plus noise reduction).  Some applications give you the ability to tweak or retrain the OCR engine–however, except with new languages or extremely odd fonts, this will not improve Tesseract very much.  So don’t bother!

The standard output is a UTF-8 TXT file.  If you add the option ‘hocr’, it will output a special HTML format that includes layout position for the recognized text.  This file can be used to create searchable PDFs.

Command line use is pretty simple.  It is easiest on a Linux system, but I thought I would describe the Windows workflow since many users don’t even realize command line is an option.  The best way to use Tesseract directly on Windows is to look in the start menu folder “Tesseract-OCR”, right click the icon for “Console”, and choose “Run as Administrator” (if you don’t run as admin, tesseract will likely not have the correct permissions to actually create files).

windows consule

Using Tesseract on Windows Consule

Now just follow the basic recipe given by the manual, although you will have to supply file paths for the input and output files:

Tesseract [file path]\inputfile.name [filepath]\outputfile hocr

Don’t put an extension on the output file name because Tesseract will do it automatically. The hocr option is added if you want HTML output with layout information or is left off for plain text.

It would be insanely tedious to do more than one file this way, so luckily its very easy to create a Windows batch file to automate the process (or even easier via Linux shell script.  If you need to brush up on Linux shell, check out the great resource LinuxCommand.org).

To create a batch file, open notepad, and follow this recipe replacing “[filepath]” with the actual location of the input and output directories (e.g. “C:\temp\testscans\*.tif”, using the correct extension for your files):

@Echo off
Set _SourcePath=[filepath]\*.tif
Set _OutputPath=[filepath]\
Set _Tesseract="C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
For %%A in (%_SourcePath%) Do Echo Converting %%A...&%_Tesseract% %%A %_OutputPath%%%~nA
Set "_SourcePath="
Set "_OutputPath="
Set "_Tesseract="

Then click save as, type in a file name plus the extension “.bat”.  This process will run Tesseract on each file with the given extension in the source directory, outputting a text file for each in the output directory.  To run the batch process, simply double click the .bat file!

Okay, now you have a huge batch of .txt files that you want to combine.  Time for some more command line!  Open the console and use the command cd [filepath] to navigate to the directory where all the text files are located. Then enter the command copy *.txt combinedfile.txt which copies the content of every txt file in the directory into the new file “combinedfile.txt”.

Feeling like a Windows power user yet?

Advertisements

5 comments

  1. Pingback: Actual OCR Workflow!! | Digital Aladore
  2. Pingback: Update: Tesseract OCR in 2016 | Digital Aladore
  3. James Arnold

    Many thanks for this extremely clearly-written post: such a relief for a novice user after all the gobbledegook put out by the experts. I have one question, relating to an academic project of my own. The texts I’m working with are in French, and this is in theory no great problem for Tesseract: to convert one page of .tiff or .png into a French .txt I would insert the command “-l fra”. Can this be incorporated into your batch file to work across multiple image documents? Your batch file works very well indeed, but it’s not coping with the French, as indeed it is not designed to. Is there an easy solution to this?

    Like

  4. dangojangodango

    Thanks for your comment! I haven’t been working on Digital Aladore for a bit, so sorry for not responding. I hope you have already figured out a solution! But here is a response for posterity…
    In the batch file, the line:

    For %%A in (%_SourcePath%) Do Echo Converting %%A…&%_Tesseract% %%A %_OutputPath%%%~nA

    creates a loop that repeats the command using the variables you set. So “%_Tesseract%” is your Tesseract .exe that you want to call, “%%A” is an input file name found in the “%_SourcePath%” folder, and “%_OutputPath%%%~nA” is a new file name in the output folder. When put together by the loop, each iteration ends up being a standard Tesseract command just as you would type it in the terminal. For example, the batch file above would essentially type this:

    “C:\Program Files (x86)\Tesseract-OCR\tesseract.exe” C:\temp\testscans\example.tif C:\temp\output\example

    into the CMD window for you. Notice that I made the file paths absolute in the batch file to ensure that it could be run from anywhere (i.e. if you were using the commandline in the directory of input files with Tesseract properly installed, the command would look much simpler, such as “tesseract input.tif output”). But really all its doing is creating that basic Tesseract command over and over in a loop for all the .tif files in the given input directory.

    Thus, to change the language or other Tesseract parameters, you can simply add them into the line above. For example, to create French output, this should work:

    For %%A in (%_SourcePath%) Do Echo Converting %%A…&%_Tesseract% %%A %_OutputPath%%%~nA -l fra

    However, Digital Aladore is a bit out of date at this point, so if you are using the current Tesseract releases where language data is fully separated from the OCR engine, you may also need to tell Tesseract where to find the language data directory (see https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage ). In this case, the batch file command would like:

    For %%A in (%_SourcePath%) Do Echo Converting %%A…&%_Tesseract% –tessdata-dir /usr/share %%A %_OutputPath%%%~nA -l fra

    However, I don’t have time to test this right now!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s