OCR engines

Of the Free OCR engines, the biggies are Tesseract (Apache v2), Cunieform (BSD), and Ocrad (GPLv2).  In all my tests Tesseract was the most accurate and Ocrad was the fastest.

For example, recognizing the first page of Aladore 1914:

aladoren00newbuoft_0021.jpg

aladoren00newbuoft_0021.jpg

Tesseract generates this very accurate text without any preprocessing:

ALADORE. CHAPTER I. OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT. SIR YWAINl sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-

While the Cunieform engine generates this text:

ALA DORK, CHAPTER I. SIR Y%’AIM sat, in the Ha11 of Sulney and dKI ]Ustice upon %’fong-doers» And one &an had gathered sticks where he ought not, and this vedas for the twentieth time; and another had snared a rabbit of his lord,’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth tiine: so that Sir Vwain was weary of the sIght of theln» Moreover» his steward stood besN1e 4Imq and pQt him In reIACAI”

I kind of like “ALA DORK”… However, using YAGF‘s automatic preprocessing improves the result greatly.  After processing, Cunieform generates this text (still not as accurate as Tesseract):

A L A D 0 R E. CHAPTER I. OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT. SIR YwAIN sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’ s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredt.h time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-

Ocrad is noticeably faster (all three can be tested on OCRFeeder), and generates this text:

ALADORE.

CHAPTER I.
OF THE HALL OF SULNEY AND HOW SIR Y\_AIN LEFT IT.
SIR YWAIN’ sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; aod another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-

After testing, I see no reason not to use the most popular option, Tesseract.  It is extremely accurate out-of-the-box for a text such as Aladore, with no need for training, tweaking, or preprocessing.  It can be used efficiently via command line or GUI applications.  And finally, the license is Free.

 

 

Advertisements

One comment

  1. Pingback: 1914 versus 1915! | Digital Aladore

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s