Okay, just one last tool background post before we hit the “real” workflow I settled on. As I touched on in an earlier post, Tesseract is surprisingly easy to use from the command line. Figuring out how to use it is a good chance to practice your old school computing skills. Here are some more extensive notes, in case you are interested in pursuing this option.
Tesseract has a basic manual that is barely human readable. So if you are trying to figure anything out, it might be easier to search the user group on Google, https://groups.google.com/forum/#!forum/tesseract-ocr.
There is a few basic things to know about input to maximize the quality of the output. The most likely image flaws to cause recognition errors are skew, uneven lighting, and black edges on the scans. These should be avoided during scanning or fixed before input (using ScanTailor, for example). The newer releases of Tesseract can handle most image formats and include automatic pre-processing (i.e. binarization, basically converting to black and white only, plus noise reduction). Some applications give you the ability to tweak or retrain the OCR engine–however, except with new languages or extremely odd fonts, this will not improve Tesseract very much. So don’t bother!
The standard output is a UTF-8 TXT file. If you add the option ‘hocr’, it will output a special HTML format that includes layout position for the recognized text. This file can be used to create searchable PDFs.
Command line use is pretty simple. It is easiest on a Linux system, but I thought I would describe the Windows workflow since many users don’t even realize command line is an option. The best way to use Tesseract directly on Windows is to look in the start menu folder “Tesseract-OCR”, right click the icon for “Console”, and choose “Run as Administrator” (if you don’t run as admin, tesseract will likely not have the correct permissions to actually create files).
Now just follow the basic recipe given by the manual, although you will have to supply file paths for the input and output files:
Tesseract [file path]\inputfile.name [filepath]\outputfile hocr
Don’t put an extension on the output file name because Tesseract will do it automatically. The hocr option is added if you want HTML output with layout information or is left off for plain text.
It would be insanely tedious to do more than one file this way, so luckily its very easy to create a Windows batch file to automate the process (or even easier via Linux shell script. If you need to brush up on Linux shell, check out the great resource LinuxCommand.org).
To create a batch file, open notepad, and follow this recipe replacing “[filepath]” with the actual location of the input and output directories (e.g. “C:\temp\testscans\*.tif”, using the correct extension for your files):
Set _Tesseract="C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
For %%A in (%_SourcePath%) Do Echo Converting %%A...&%_Tesseract% %%A %_OutputPath%%%~nA
Then click save as, type in a file name plus the extension “.bat”. This process will run Tesseract on each file with the given extension in the source directory, outputting a text file for each in the output directory. To run the batch process, simply double click the .bat file!
Okay, now you have a huge batch of .txt files that you want to combine. Time for some more command line! Open the console and use the command
cd [filepath] to navigate to the directory where all the text files are located. Then enter the command
copy *.txt combinedfile.txt which copies the content of every txt file in the directory into the new file “combinedfile.txt”.
Feeling like a Windows power user yet?