Using Tesseract via Command Line has consistently been the most wildly popular post on Digital Aladore. However, due to some changes, I thought I should update the information.
and the Wiki page at: https://github.com/tesseract-ocr/tesseract/wiki
There is no longer an official Windows installer for the current release.
If you want to use an installer, the old version on Google Code from 2012 still works, featuring Tesseract version 3.02.02 (I am not sure if this link will go away when Google Code moves to archive status in January 2016). This package is handy for English users since it will install everything you need in one go. (Update 2016-01-30: the link did go away in January, so there is no longer access to the old Google Code pages. It is easiest to use one of the installers listed below. If you want to compile it yourself, check out the directions from bantilan.)
- A nice portable install of both engine and language data from a guy named Simon
- A bleeding edge installer from Mannheim University Library
It is important to note that the OCR engine and language data are now completely separate. Install the Tesseract engine first, then unzip the language data into the “tessdata” directory.
On Linux installation is easier. Use your distro’s software repository (the package is usually called ‘tesseract-ocr’), or download the latest release and use make.
Now that you have it installed, the commands on the old post should work fine!
Its been a long and busy summer, but nothing much has happened at Digital Aladore. Far too long since the last posts! There are really only a few more to go before the project can wrap up and release the final ebook to the world. If only I can find the time…
If you need some heavy reading in traditional typography, check out Edmund G Gress, The Art & Practice of Typography, digitized by the Smithsonian:
It is amusing to download the EPUB–why, oh why was this created? Not only is the OCR appalling on all the strange fonts and columns, but it misses the point of showing off the art of typography!
First, the OCR could be vastly improved with just a few tiny edits, for example the first paragraph of the preface reads:
IN the preface to the first edition of “The Art and Practice of Typograpliy,” the author stated that he did not “anticipate again having tlie pleasure of producing a book as elaborate as tliis one,” but the favor witli wliich tlie volume was received made anotlier edition advisable
It takes one human glance to realize that “h” is not being recognized (by ABBYY), which a computer should realize as well with a simple spelling dictionary. (Readers of Digital Aladore, of course, could fix this file up in no time!)
Meanwhile, the varied examples of type are reduced to this single CSS:
font-family: "Palatino Linotype", "Book Antiqua", Palatino, Georgia, "Times New Roman", serif;
font-family: "Palatino Linotype", "Book Antiqua", Palatino, Georgia, "Times New Roman", serif;
font-family: Georgia, "Palatino Linotype", "Book Antiqua", Palatino, "Times New Roman", serif;
display: block; text-align: center; margin: 1em auto;
Here is an amusing example: page 170,
is reduced by “ABBYY to EPUB” to this:
<div class="newpage" id="page-170"/>
<p> THE ART AND PRACTICE OF TYPOGRAPHY</p>
<p> EXAMPLE 465</p>
<p> Evolution of Roman lower-case type-faces. (A) Pen-made Roman capitals. (B) Development into Minuscules or lower-case thru rapid lettering. (C) Black Letter or German Text developed from Roman Uncials. (D) White Letter, the open, legible Caroline Minuscules, on which Jenson based his Roman type-face of 1470. (E) A recent typeface closely modeled on Jenson s Roman types. (F) Joseph Moxon's letters of 1676. (G) Caslon s type-face of 1722</p>
<p> The face first selected—and witlioiit Iicsitatioii—was foundries and tliat are available for niacliine composition.</p>
<p> Caslon Oldstyle as originally designed. Scotcli Roman was It may be well to inject liere a warning that most so-called</p>
<p> the second selection, Cheltenham Oldstyle the tliird, Clois- Caslon Oldstylcs are not as good as the one selected (Ex-</p>
<p> ter Oldstyle tlie fourth, Bodoni Book the fifth, and French amjile KiT-B) ; that Jenson Oldstyle is inferior to Clois-</p>
<p> Oldstyle the sixtli. (All shown in Example 4(57-) ter Oldstyle (Example I(i7-A) as a re])resentative of the</p>
<p> Type-faces designed and cut for j^rivate use were not original Jenson type. However, good representatives of</p>
<p> considered in making these selections, as it was believed Scotch Roman (Example 'KiT-D) are obtainable under the</p>
<p> best to adhere to type-faces that are procurable from most name of Wayside, of National Roman, etc.</p>
<p> ABCDEFGH IJ K L M N O P O R S T U V WX YZ</p>
<p> a h c cl e f g h i j k I m n o ]:> q r s t u v w x y z How it appears assembieel</p>
<p> (A) Modernized Oldstyle, the Miller & Richard type-face of about 1852</p>
<p> ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdef ghi jklmnopqrstuvwxyz How it appears assembled</p>
<p> (B) Century Expanded, the Benton "modern" type-face of 1901</p>
<p> EXAMPLE 466</p>
<p> Two standard type-faces that rate high in legibility, but that are colorless in the mass and lacking in the pleasing irregularities of form that characterized Roman type-faces before the nineteenth century. The various qualities of legibility found in Modernized Oldstyle have been converted to narrower letter shapes and more "modern " form in Century Expanded</p>
Which means it looks something like this on your ereader:
The example illustrates the issues we have been dealing with at Digital Aladore– ebooks are awesome, but how can be bring the craft back into publishing?
Sorry no new posts for awhile–but Digital Aladore has not been idle!
I have been processing the second digital witness, Aladore 1915 (New York: E.P. Dutton & Company, digitized by Internet Archive in 2006). This time around things went quite quick and efficiently, since I wasn’t testing various options and I am now familiar with all the software. The page background was cleaner in this digitized edition, which I think made the OCR a tiny bit more accurate. However, the actual book seemed to have a few more print errors–for example a single letter or punctuation mark missing or distorted. I think this sometimes happened with later printings of a book, since they were often reproduced from plates used in the first printing. Wear and tear on the older plates can introduce errors into the new text.
Like the first time around, I divided the page images into six batches (i.e. directories) to simplify processing. I preprocessed the pages with ScanTailor, ran OCR with YAGF, and batch edited the HTML with BlueFish. Those three steps, including the computer’s processing time, took about four to five hours to complete in total. You could rush through the process faster, but I think this time estimate is a fairly careful and non-stressful pace.
I am curious to compare this new text with the first one I created, so I will be setting that up next! Stay tuned…
Now that I have big batches of nicely pre-processed images from ScanTailor, its finally time to Capture some text!
To get started, I click on the Open Image icon, navigate to the ScanTailor output directory for a batch, and select all the images (about 60). Next, click on Settings > OCR Settings to make sure it is set to Tesseract-OCR. Then, make sure the HTML output icon is selected.
Now, I click Recognize all pages and let it go to work!
After a few minutes, the recognized text shows up in the right panel.
I quickly scan through it looking for any issues. At this point I am not going to fix any formatting, just checking for any thing that looks majorly scrambled. I was actually amazed by the accuracy. There was very few errors that I could detect. I fixed only a handful per batch. Spelling errors are highlighted with red underline so I check those as I go along. There are only a couple actual spelling errors in the batch, usually two words stuck together. In Aladore 99% of the detected issues are just names or archaic words, since it is filled with things like “assotted”, “aforetime”, and “thitherward”. You start to see a few pattern errors emerge: capitalized “SO” at the beginning of a chapter often becomes “80”; “ff” often becomes “fI” or “f1” or “f'”; or “O” becomes zero. It is easy to reference the original page to check the text by choosing it from the thumbnails on the left side.
Its important to note at this point, we are entering the realm of editorship. We are making decisions about the transmission of the text. I am trying to just replicate what is in the printed text, but what if the page has a blemish obscuring a good reading or if the printing made a mistake? Even replication requires interpretation. The errors I miss at this step will be perpetuated down stream…
Back to practical matters, while working on these batches, I noticed YAGF has two annoying interface issues: First, when clicking a image thumbnail on the left pane, the image appears in the main window at like %500 so you can’t see anything–you need to zoom out several times to actually view the page. Secondly, there is no way to remove all the items from the current project or start a new project. If you don’t want to spend your time dragging each image to the trash icon to remove them individually, you simply have to close the program and reopen.
Although the OCR recognition was surprisingly accurate, there is one area where Tesseract has major trouble: italics. If you notice any italics, be sure to go fix it manually. It will be obvious in the text, because it will be a jumble of nonsense!
After checking the text over, I click the Save icon and save as HTML. That’s it! Done.
Assuming you don’t have any italics, this step is amazingly simple and goes quickly!
I will show you what the resulting text looks like in the next post…
Okay, I have been messing around with dozens of workflow options, and I have finally settled on one version. It is not necessarily the most efficient, but I found it has the best balance between accuracy and automation. The basic steps are:
- preprocess image files with ScanTailor,
- OCR with YAGF and export as HTML,
- edit HTML with BlueFish,
- create ebook from HTML with Sigil.
You might say, “gosh that’s unnecessarily ugly!”
And you are partially right. Each of these applications has overlapping features that can do more than what I am using them for, but I found each has specific limitations and strengths. I am only using the features that the application is good at!
For example, the way I am using YAGF is very simple and automated. I could in fact use YAGF for preprocessing the images and selection of the text areas for recognition, thus avoiding ScanTailor. However, YAGF’s image processing is not as good and the interface is too cumbersome for these tasks. It is much more efficient to use ScanTailor and feed YAGF preprocessed images that require no further user input. But you say, then why not replace YAGF with command line use of Tesseract, automated with a Windows batch file or Linux shell script? Simply because YAGF automatically combines multiple pages into a single HTML file, rather than one file for each page generated by the command line. Thus, it saves me one transformation step (combining the HTML files and removing the extra headers), plus gives me some simple visual feedback to catch any major recognition errors.
Further more, you may ask why export the OCR as HTML when all you want is the text? Ah, well that is the clever bit, sort of… The hocr output is intended for creating searchable PDFs using utilities that combine the text tagged with layout position and the page images (for example, check out hocr2pdf available with the Exactimage utilities on Linux). I don’t in fact need any of the tags. However, they make it much easier to reformat the text into the form I want because I can easily find the line, paragraph, and page breaks. The plain text doesn’t have enough information to consistently transform, its hard to tell the difference between a page break, paragraph break, or random spacing error. Thus, the HTML gives me more than I need, but it is easy to strip it down to what I want. The plain text doesn’t give me enough, and its impossible to built it up except by tedious manual labor.
So there you go. Four more steps, four tools. I will explain it all soon!
Okay, just one last tool background post before we hit the “real” workflow I settled on. As I touched on in an earlier post, Tesseract is surprisingly easy to use from the command line. Figuring out how to use it is a good chance to practice your old school computing skills. Here are some more extensive notes, in case you are interested in pursuing this option.
Tesseract has a basic manual that is barely human readable. So if you are trying to figure anything out, it might be easier to search the user group on Google, https://groups.google.com/forum/#!forum/tesseract-ocr.
There is a few basic things to know about input to maximize the quality of the output. The most likely image flaws to cause recognition errors are skew, uneven lighting, and black edges on the scans. These should be avoided during scanning or fixed before input (using ScanTailor, for example). The newer releases of Tesseract can handle most image formats and include automatic pre-processing (i.e. binarization, basically converting to black and white only, plus noise reduction). Some applications give you the ability to tweak or retrain the OCR engine–however, except with new languages or extremely odd fonts, this will not improve Tesseract very much. So don’t bother!
The standard output is a UTF-8 TXT file. If you add the option ‘hocr’, it will output a special HTML format that includes layout position for the recognized text. This file can be used to create searchable PDFs.
Command line use is pretty simple. It is easiest on a Linux system, but I thought I would describe the Windows workflow since many users don’t even realize command line is an option. The best way to use Tesseract directly on Windows is to look in the start menu folder “Tesseract-OCR”, right click the icon for “Console”, and choose “Run as Administrator” (if you don’t run as admin, tesseract will likely not have the correct permissions to actually create files).
Now just follow the basic recipe given by the manual, although you will have to supply file paths for the input and output files:
Tesseract [file path]\inputfile.name [filepath]\outputfile hocr
Don’t put an extension on the output file name because Tesseract will do it automatically. The hocr option is added if you want HTML output with layout information or is left off for plain text.
It would be insanely tedious to do more than one file this way, so luckily its very easy to create a Windows batch file to automate the process (or even easier via Linux shell script. If you need to brush up on Linux shell, check out the great resource LinuxCommand.org).
To create a batch file, open notepad, and follow this recipe replacing “[filepath]” with the actual location of the input and output directories (e.g. “C:\temp\testscans\*.tif”, using the correct extension for your files):
Set _Tesseract="C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
For %%A in (%_SourcePath%) Do Echo Converting %%A...&%_Tesseract% %%A %_OutputPath%%%~nA
Then click save as, type in a file name plus the extension “.bat”. This process will run Tesseract on each file with the given extension in the source directory, outputting a text file for each in the output directory. To run the batch process, simply double click the .bat file!
Okay, now you have a huge batch of .txt files that you want to combine. Time for some more command line! Open the console and use the command
cd [filepath] to navigate to the directory where all the text files are located. Then enter the command
copy *.txt combinedfile.txt which copies the content of every txt file in the directory into the new file “combinedfile.txt”.
Feeling like a Windows power user yet?
At the beginning of the project, I was planning to use a commercial OCR software to compare with the open options. However, at this point I have decided NOT to create a full text with a proprietary platform, and just focus on using Tesseract–I just don’t have enough time to do both well! To give you a sense of what it would be like to use a nice proprietary system, here is an extensive post about ABBYY FineReader:
Founded back in 1989, ABBYY is based in Russia and is probably the most common commercial OCR engine. Their products are often integrated into document processing platforms and content management systems. For example, many museums and libraries use CONTENTdm digital collection management to provide access to digitized materials. A FineReader plugin can OCR images as they are uploaded making the content searchable. Of course, the plugin costs thousands of dollars on top of your tens of thousands of dollars in subscription fees!
ABBYY also develops a desktop GUI OCR platform featuring powerful PDF creation and editing capabilities that rival Adobe Acrobat. It will cost you about $169, although you can get a very limited free trial (30-days, 100 pages, 3 pages at a time). This price tag gets you more out-of-the-box functionality built in than any open software. For example, Tesseract can recognize around 40 languages, but requires downloading and installing training packages (it comes only with English). FineReader comes with 190 languages and 48 language dictionaries to refine its recognition. Tesseract theoretically allows users to create custom word lists, but no one can seem to get the feature to work. FineReader integrates with Microsoft Word’s dictionaries so you can easily set up a custom word list. Finally, many specialized office/library tasks are optimized, for example, creating searchable PDF/A documents designed for long-term digital archival storage.
I have used a few older versions of FineReader on computers at work. Here is a screenshot of FineReader 11, which looks about the same as every other platform reviewed at Digital Aladore so far:
I decided to download the free trial FineReader 12 and test a few pages of Aladore. FineReader 12 has been updated to match Windows 8 style, but doesn’t make great use of screen space, unless you have a giant monitor:
The interface and workflow is basically the same as the open platforms: open some images, recognize the text, edit it in the window, and output in a new format. The display offers a few more advanced features, such as highlighting low-confidence recognitions and non-dictionary words so that you can quickly check them. Basically, FineReader just offers MANY more options and settings, allowing more advanced control.
Here is how it compares to Tesseract:
- processing is a bit slower
- layout detection is better and more nuanced (can detect headers/footers, tables, etc.)
- accuracy is the same (with a simple English text like Aladore)
However, output is where ABBYY is considerably different from other options. First, its excellent layout detection allows it to reproduce complex formatting such as tables, side bars, or bulleted lists. Second, it can create a wide variety of output files, including word documents, full featured PDFs, spreadsheets, and ebooks–yes, it can directly create an EPUB!
The EPUB formatting is not great, so it would still require a lot of editing and tweaking…
Yeah, FineReader is pretty slick–but its not that great! It is still a lot of work to create a good text. Most people use FineReader to efficiently create searchable PDFs with out editing the OCR output. Since they see the image layer as the main means of actually reading the text, the underlying transcript doesn’t have to be very accurate.
I will stick with Free software, and end up with the same results…
For example, recognizing the first page of Aladore 1914:
Tesseract generates this very accurate text without any preprocessing:
ALADORE. CHAPTER I. OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT. SIR YWAINl sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-
While the Cunieform engine generates this text:
ALA DORK, CHAPTER I. SIR Y%’AIM sat, in the Ha11 of Sulney and dKI ]Ustice upon %’fong-doers» And one &an had gathered sticks where he ought not, and this vedas for the twentieth time; and another had snared a rabbit of his lord,’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth tiine: so that Sir Vwain was weary of the sIght of theln» Moreover» his steward stood besN1e 4Imq and pQt him In reIACAI”
I kind of like “ALA DORK”… However, using YAGF‘s automatic preprocessing improves the result greatly. After processing, Cunieform generates this text (still not as accurate as Tesseract):
A L A D 0 R E. CHAPTER I. OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT. SIR YwAIN sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’ s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredt.h time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-
Ocrad is noticeably faster (all three can be tested on OCRFeeder), and generates this text:
OF THE HALL OF SULNEY AND HOW SIR Y\_AIN LEFT IT.
SIR YWAIN’ sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; aod another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-
After testing, I see no reason not to use the most popular option, Tesseract. It is extremely accurate out-of-the-box for a text such as Aladore, with no need for training, tweaking, or preprocessing. It can be used efficiently via command line or GUI applications. And finally, the license is Free.
Linux has a few good Free GUI OCR options that are still actively developed. The two most popular applications are YAGF and OCRFeeder, both easily installed via repositories or software center, both licensed GNU GPLv3. You will need to separately install the OCR engines, so make sure you have tesseract-ocr packages as well. YAGF easily integrates only Tesseract and Cuneiform, while OCRFeeder can detect any installed engine.
To test the ease of use, I loaded the first three pages of Aladore unedited, recognized the text, and saved the output.
- YAGF’s very simple interface and workflow makes jumping right into OCR easy. Just select some images and click recognize all pages. The text will show up as a single document on the right hand panel. If you change any options and click recognize again, the text is added to the end–this is convenient for quickly comparing output. To save the text, cut and paste, or click save to create a plain text or HTML file. Switching between Tesseract and Cunieform is a simple menu selection. A number of other features, such as auto preprocessing, correcting page skew, auto region detection, and spell check can be accessed from the other icons on the interface. It will take a bit of playing around before you figure out how to efficiently use them. The interface is not completely polished and the options are fairly basic, but over-all YAGF is a complete and easy to use OCR platform. It was last updated in August 2014, however the most up-to-date version, 0.9.4.2, is not available in package repositories (Ubuntu Software Center offers 0.9.2.1).
- OCRFeeder has a more polished interface, however the workflow is not as straight forward as YAGF. It offers more fine grained control of each text/image region of the document. Unfortunately, unless you use the full automatic detection this means the process takes much longer. OCRFeeder supports any OCR engine you install on your system, and you can actually use different engines on each text region. For preprocessing, OCRFeeder intergates Unpaper, a tool created to process scanned book page images. Again in contrast to YAGF, it offers more control over the options. One nice feature is output to ODT format (OpenDocument Text, the standard format for Open/LibreOffice Writer). Each of the detected regions will be translated to a text/image box on the document. This means you can translate the page layout to an editable format, rather than just a searchable PDF or plain text. However, I don’t need any of this functionality for Digital Aladore! The overly complicated workflow is not necessary since I just want the simple, unformatted text. However, if you have a document with lots of images that you wanted to properly capture and embedded in your new text, OCRFeeder’s workflow would be better than YAGF. Version 0.8 was released in August 2014, although most repositories still offer only 0.7.11 (February 2013).
There are a few other options which I quickly tested, but found more work to install and/or use than necessary while offering no further functionality. These include tesseract-gui, gImageReader, lios, and ocrivist
For my purposes, YAGF seems like the best OCR GUI, but its a toss up for now.
The main use of the GUIs is to help with layout detection. I want to detect the text body only (leaving off the header and page numbers), but these applications can not do this automatically. It would be too much work to go through 400 pages and manually adjust the text box, so I need to look for a preprocessing method of eliminating the header and footers. Although the GUI makes OCR convenient, I am not sure it is necessary to this project, since it may be just as easy to create a batch process to run the processing directly through Tesseract.
I have been playing around with the many no cost GUI OCR platforms out there, and I want to report some findings. This post looks at some options available on Windows, none of which I found to be very viable for the purposes of Digital Aladore. Another post will relate information about options on Linux.
If you want to test out some super quick OCR without installing anything, the easiest is to use Google Drive.
Go into your Drive, click on the setting icon (gear thing), navigate down to Upload settings, and click on “Convert text from uploaded PDF and image files.” Now upload an image and it will appear in a new Google Doc with the OCR text below. This feature is also available on the Drive app for Android so you can directly OCR your photo “scans” of documents. Obviously this is not a solution for Aladore, but thought I would pass it on in case you had some immediate OCR needs!
As for desktop GUI OCR platforms on Windows, here are some notes just to get the information out there, even if its a negative result for this project:
- This is an odd site that provides lots of information about the main commercial OCR platforms, plus download of their own freeware. It is actually part of a niche document imaging retailer, the ScanStore. SimpleOCR is free for non-commercial use, but is not open source. Word recognition is good. However, it very often (every page) seems to interpret large spaces in the middle of a paragraph as columns, thus jumbling the order of the text. This kind of mistake is impossible to recover from without painstaking cut and paste. There are no settings to change this behavior. This platform would be unusable with Aladore.
- This is a commonly cited freeware (not open) option that uses the Tesseract engine. However, it was consistently blocked by my security programs as containing a virus. Numerous reviews mention that the installer includes PUPs (potentially unwanted programs) and malware. I am not sure what to think, but I decided it wasn’t worth perusing since there is nothing special about it.
- Basically a windows version of the Linux application Tesseract-GUI. It is open source (GNU GPL v3), but hasn’t been updated since 2010. The project never really got off the ground, so it is too buggy to use.
- The interface looks pretty good (more like the ABBYY set up), but I could not get it to actually recognize any text. Python based, open source (GNU GPL v2), but it hasn’t been updated since 2011.
- gImageReader is the most usable on this list and has a Free open source license (GNU GPLv3). It is easy to install on Windows and works with out tweaking. However, it is slow at loading page images and layout detection. Also, the OCR is some how less accurate than Tesseract via command line or via Linux GUI options–I don’t know how this is possible. It does have a number of nice features for editing/polishing the output text. It was last updated in April 2014.
But don’t worry, I found some other options on Linux that run much better!