Category: 4.5 OCR

OCR now!

"How Ywain looked into the Water of a Well", p.36.

“How Ywain looked into the Water of a Well”, Aladore p.36.

Section 4. Capturing Text has gotten too long, rambling through the tool options, so we are moving on to 4.5 OCR.  Time to get to business!  Its been a slog, but we are almost there–well, we are almost to having some raw text anyway…

For those keeping track, I corrected the digital skew on the image above, trying to make it look more square.

Using Tesseract via command line

Okay, just one last tool background post before we hit the “real” workflow I settled on.  As I touched on in an earlier post, Tesseract is surprisingly easy to use from the command line.  Figuring out how to use it is a good chance to practice your old school computing skills.  Here are some more extensive notes, in case you are interested in pursuing this option.

Tesseract has a basic manual that is barely human readable. So if you are trying to figure anything out, it might be easier to search the user group on Google, https://groups.google.com/forum/#!forum/tesseract-ocr.

There is a few basic things to know about input to maximize the quality of the output.  The most likely image flaws to cause recognition errors are skew, uneven lighting, and black edges on the scans.  These should be avoided during scanning or fixed before input (using ScanTailor, for example).  The newer releases of Tesseract can handle most image formats and include automatic pre-processing (i.e. binarization, basically converting to black and white only, plus noise reduction).  Some applications give you the ability to tweak or retrain the OCR engine–however, except with new languages or extremely odd fonts, this will not improve Tesseract very much.  So don’t bother!

The standard output is a UTF-8 TXT file.  If you add the option ‘hocr’, it will output a special HTML format that includes layout position for the recognized text.  This file can be used to create searchable PDFs.

Command line use is pretty simple.  It is easiest on a Linux system, but I thought I would describe the Windows workflow since many users don’t even realize command line is an option.  The best way to use Tesseract directly on Windows is to look in the start menu folder “Tesseract-OCR”, right click the icon for “Console”, and choose “Run as Administrator” (if you don’t run as admin, tesseract will likely not have the correct permissions to actually create files).

windows consule

Using Tesseract on Windows Consule

Now just follow the basic recipe given by the manual, although you will have to supply file paths for the input and output files:

Tesseract [file path]\inputfile.name [filepath]\outputfile hocr

Don’t put an extension on the output file name because Tesseract will do it automatically. The hocr option is added if you want HTML output with layout information or is left off for plain text.

It would be insanely tedious to do more than one file this way, so luckily its very easy to create a Windows batch file to automate the process (or even easier via Linux shell script.  If you need to brush up on Linux shell, check out the great resource LinuxCommand.org).

To create a batch file, open notepad, and follow this recipe replacing “[filepath]” with the actual location of the input and output directories (e.g. “C:\temp\testscans\*.tif”, using the correct extension for your files):

@Echo off
Set _SourcePath=[filepath]\*.tif
Set _OutputPath=[filepath]\
Set _Tesseract="C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
For %%A in (%_SourcePath%) Do Echo Converting %%A...&%_Tesseract% %%A %_OutputPath%%%~nA
Set "_SourcePath="
Set "_OutputPath="
Set "_Tesseract="

Then click save as, type in a file name plus the extension “.bat”.  This process will run Tesseract on each file with the given extension in the source directory, outputting a text file for each in the output directory.  To run the batch process, simply double click the .bat file!

Okay, now you have a huge batch of .txt files that you want to combine.  Time for some more command line!  Open the console and use the command cd [filepath] to navigate to the directory where all the text files are located. Then enter the command copy *.txt combinedfile.txt which copies the content of every txt file in the directory into the new file “combinedfile.txt”.

Feeling like a Windows power user yet?

Actual OCR Workflow!!

Okay, I have been messing around with dozens of workflow options, and I have finally settled on one version.  It is not necessarily the most efficient, but I found it has the best balance between accuracy and automation.  The basic steps are:

  • preprocess image files with ScanTailor,
  • OCR with YAGF and export as HTML,
  • edit HTML with BlueFish,
  • create ebook from HTML with Sigil.

You might say, “gosh that’s unnecessarily ugly!”

And you are partially right.  Each of these applications has overlapping features that can do more than what I am using them for, but I found each has specific limitations and strengths.  I am only using the features that the application is good at!

For example, the way I am using YAGF is very simple and automated.  I could in fact use YAGF for preprocessing the images and selection of the text areas for recognition, thus avoiding ScanTailor.  However, YAGF’s image processing is not as good and the interface is too cumbersome for these tasks.  It is much more efficient to use ScanTailor and feed YAGF preprocessed images that require no further user input.  But you say, then why not replace YAGF with command line use of Tesseract, automated with a Windows batch file or Linux shell script?  Simply because YAGF automatically combines multiple pages into a single HTML file, rather than one file for each page generated by the command line.  Thus, it saves me one transformation step (combining the HTML files and removing the extra headers), plus gives me some simple visual feedback to catch any major recognition errors.

Further more, you may ask why export the OCR as HTML when all you want is the text?  Ah, well that is the clever bit, sort of… The hocr output is intended for creating searchable PDFs using utilities that combine the text tagged with layout position and the page images (for example, check out hocr2pdf available with the Exactimage utilities on Linux).  I don’t in fact need any of the tags.  However, they make it much easier to reformat the text into the form I want because I can easily find the line, paragraph, and page breaks.  The plain text doesn’t have enough information to consistently transform, its hard to tell the difference between a page break, paragraph break, or random spacing error.  Thus, the HTML gives me more than I need, but it is easy to strip it down to what I want.  The plain text doesn’t give me enough, and its impossible to built it up except by tedious manual labor.

So there you go.  Four more steps, four tools.  I will explain it all soon!

Pre-process with ScanTailor

Okay, here we go!  Are you excited?

scantailor start

ScanTailor: Aladore loaded into a new project.

ScanTailorhttp://scantailor.org, is a free software tool for post-processing scanned images (if I haven’t emphasised this enough yet, check out “What is free software” from FSF).

But wait, “post-processing”?  Yeah, I guess its all in your point of view: I am calling it pre- because its processing before I feed it to Tesseract-OCR; they call it post- because you do it after scanning to make the images more usable.  In any case, ScanTailor is basically a batch image editing platform optimized to enable efficient processing of unedited page scans into lovely readable pages for PDFs/DJVU, etc.  You input any type of image file, and you output black and white TIFF page images.  Many of the GUI OCR applications have similar processing built in–even Tesseract via command line does some image pre-processing.  ABBYY goes a step further and does the final PDF creation as well.

It is limited (niche?), but what ScanTailor does, it does well and efficiently.  And its Free.  It runs on Windows and Linux, I use it with Ubuntu 14.

A full user guide can be found here: https://github.com/scantailor/scantailor/wiki/User-Guide

And here’s a nice video tutorial to get you started: http://vimeo.com/12524529

So what am I trying to do with ScanTailor?

As I described in an earlier post, every Aladore page image has a header and footer that will be annoying, ugly, and generally undesirable in our final EPUB edition.  Furthermore, the header and footer tend to cause OCR errors, so they are hard to eliminate from the text output.  None of the GUI OCR options managed to consistently select only the text block (ABBYY can identify headers, but it wasn’t 100% and didn’t separate the page numbers).  The exact location of the text block wanders around from page to page, so we can not pre-set a selection area for groups of images.  We could manually select the text block of each page via the interface of YAGF or OCRFeeder, but I found both too cumbersome to efficiently work through hundreds of pages.  So as an alternative, I decided to crop each page down to the main text block before carrying out OCR.

ScanTailor can do this cropping efficiently and has the added bonus of excellent image processing to prepare the pages for OCR.

Here is my workflow:

1) Set up batches of images.  I decided to work on about 60 pages at a time to make things more manageable: a large enough batch to be efficient, but not too large that it took a huge amount of time to work on.  I divided Aladore into six sections (each in a directory), being careful to not break up any chapters.

2) Start a new ScanTailor project. i.e. Start the application and load one of the Aladore directories.

3) Fix DPI.  Basically, you could estimate the DPI based on the pixel dimensions of the digital image divided by the physical dimensions of the area it was imaging (i.e. ~original printed page size).  In his video tutorial, Joseph Artsimovich also gives a method to estimate based on measuring the pixel height of six lines of text in the image.  I estimated my Aladore images to be between 300-400 DPI, but its not essential to get it exactly right.  I decided to just use 300×300.

4) Select Content.  Since the Aladore images are already processed, we can skip the Fix Orientation, Split Pages, and Deskew steps.  For good results, the selection can not be fully automated.  First, I do two manual selections to get the batch started: one for all the left and one for all right hand pages of the book, since the location of the text block tends to be closer.  I go to an average page and resize the selection box to be below the header, above the page number, and almost the full page width horizontally.  Then I click “Apply to” and choose “every other page.”  Then, repeat for the next page to set the selection for the other half of the book.

scantailor selecting the text block.

Manually selecting the text block.

This will give us a pretty good selection–but now I go through the tedious step of checking every single page.  Since blotches and marks on the paper often cause random OCR errors, it is best to get the selection box close to the text.  ScanTailor will white out everything outside the box, thus eliminating the many flaws.  Basically, I tweak the first and last page of each chapter, since the text block is smaller.  About 1 in 6 pages in Aladore needed just a little tweak up or down as I check the pages.  This doesn’t take as long as it sounds, less time than it took to write this post…

tweaking selection

Tweaking selection on last page of a chapter.

5) Margins.  This step is mainly cosmetic to ensure consistent output.  The area added by the Margins will be blank white.  First, we can add “hard margins” around each selection, like you would with a Word document, to ensure that the text is not all the way at the edge of the page.  Second, using “soft margins” we can decide if we want all the pages to be the same size and where the text selection is positioned in the resulting page.  By default ScanTailor will add white space to make all pages the size of the largest in the batch.  Having a consistent page size is actually helpful for our purposes, because the OCR will detect the blank space at the beginning and end of chapters.  The defaults are fine for my purposes!

6) Output!! This stage allows you to set the parameters for image processing.  You can set the DPI, Mode (B&W, color), adjust the filtering, add Dewarping, and adjust the noise reduction (Despeckling).  The defaults worked well for Aladore, at 600 DPI (2x the input DPI).

output

Aladore ready for output!

Click the play button and ScanTailor starts the actual image processing, saving the output to a new directory.  The computer works for a minute, and then we have a batch of black and white TIFFs ready for OCR!

ScanTailor output image

ScanTailor output image.

 

 

OCR with YAGF

Now that I have big batches of nicely pre-processed images from ScanTailor, its finally time to Capture some text!

YAGF is a free software OCR GUI for Linux, http://sourceforge.net/projects/yagf-ocr.  I reviewed it in an earlier post Linux OCR.

To get started, I click on the Open Image icon, navigate to the ScanTailor output directory for a batch, and select all the images (about 60).  Next, click on Settings > OCR Settings to make sure it is set to Tesseract-OCR.  Then, make sure the HTML output icon is selected.

Now, I click Recognize all pages and let it go to work!

YAGF working.

YAGF working.

After a few minutes, the recognized text shows up in the right panel.

I quickly scan through it looking for any issues.  At this point I am not going to fix any formatting, just checking for any thing that looks majorly scrambled.  I was actually amazed by the accuracy.  There was very few errors that I could detect.  I fixed only a handful per batch.  Spelling errors are highlighted with red underline so I check those as I go along.  There are only a couple actual spelling errors in the batch, usually two words stuck together.  In Aladore 99% of the detected issues are just names or archaic words, since it is filled with things like “assotted”, “aforetime”, and “thitherward”.  You start to see a few pattern errors emerge: capitalized “SO” at the beginning of a chapter often becomes “80”; “ff” often becomes “fI” or “f1” or “f'”; or “O” becomes zero.  It is easy to reference the original page to check the text by choosing it from the thumbnails on the left side.

Its important to note at this point, we are entering the realm of editorship.  We are making decisions about the transmission of the text.  I am trying to just replicate what is in the printed text, but what if the page has a blemish obscuring a good reading or if the printing made a mistake?  Even replication requires interpretation.  The errors I miss at this step will be perpetuated down stream…

Back to practical matters, while working on these batches, I noticed YAGF has two annoying interface issues:  First, when clicking a image thumbnail on the left pane, the image appears in the main window at like %500 so you can’t see anything–you need to zoom out several times to actually view the page.  Secondly, there is no way to remove all the items from the current project or start a new project.  If you don’t want to spend your time dragging each image to the trash icon to remove them individually, you simply have to close the program and reopen.

Although the OCR recognition was surprisingly accurate, there is one area where Tesseract has major trouble: italics.  If you notice any italics, be sure to go fix it manually.  It will be obvious in the text, because it will be a jumble of nonsense!

Anyway,

After checking the text over, I click the Save icon and save as HTML.  That’s it!  Done.

Assuming you don’t have any italics, this step is amazingly simple and goes quickly!

I will show you what the resulting text looks like in the next post…

Raw HTML text

After YAGF (Tesseract) OCR we have a batch of large HTML files.  For Aladore 1914 it was six HTML files around 1600 lines each.  The first page looks like this:

ALADORE.

CHAPTER I.

OF THE HALL OF SULNEY AND HOW

SIR YWAIN LEFT IT.

SIR YWAIN sat in the Hall of Sulney and

did justice upon wrong-doers. And one man

had gathered sticks where he ought not, and

this was for the twentieth time; and another

had snared a rabbit of his lord’s, and this was

for the fortieth time; and another had beaten

his wife, and she him, and this was for the

hundredth time: so that Sir Ywain was weary

of the sight of them. Moreover, his steward

stood beside him, and put him in remem-

Each line of text from the page image has been tagged as one HTML paragraph, so the breaks follow the original printed page.   Take a look at the mark up:

<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>ALADORE.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>CHAPTER I.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>OF THE HALL OF SULNEY AND HOW</p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>SIR YWAIN LEFT IT.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>SIR YWAIN sat in the Hall of Sulney and</p>

Each paragraph has a big “style=” tag that is sort of meaningless since they are all the same.  For my purposes, its just ugly and unnecessary.  However, the important thing is that everything is consistent.  I will not use the format described by the HTML tags, but because they are consistent in how they are used, I can easily transform the text.  With a little thought and find & replace, we can easily create a reflowable text–making the HTML above, into something like:

ALADORE.

CHAPTER I.

OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT.

SIR YWAIN sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remembrance of all the misery that had else been forgotten.

Which leads us to the next section of the project!

Update: Tesseract OCR in 2016

Using Tesseract via Command Line has consistently been the most wildly popular post on Digital Aladore. However, due to some changes, I thought I should update the information.

Tesseract used to be hosted at Google Code, which closed up shop in August 2015. The project has transitioned to Github, with the main page at https://github.com/tesseract-ocr/tesseract

and the Wiki page at: https://github.com/tesseract-ocr/tesseract/wiki

There is no longer an official Windows installer for the current release. If you want to use an installer, the old version on Google Code from 2012 still works, featuring Tesseract version 3.02.02 (I am not sure if this link will go away when Google Code moves to archive status in January 2016). This package is handy for English users since it will install everything you need in one go. (Update 2016-01-30: the link did go away in January, so there is no longer access to the old Google Code pages. It is easiest to use one of the installers listed below. If you want to compile it yourself, check out the directions from bantilan.)

If you want a more up-to-date version on Windows there are some third party installers put together for you, otherwise you have to compile it yourself (unless you use Cygwin or MSYS2):

It is important to note that the OCR engine and language data are now completely separate. Install the Tesseract engine first, then unzip the language data into the “tessdata” directory.

On Linux installation is easier. Use your distro’s software repository (the package is usually called ‘tesseract-ocr’), or download the latest release and use make.

Now that you have it installed, the commands on the old post should work fine!