The Digital Aladore project has more than 25 posts, so as we enter stage “4. Capturing Text” I want to quickly review where we are.
The Background stage looked at the author of Aladore, Henry Newbolt. He was a Victorian celebrity, but also had a complex love life that impacted his writing. In fact Aladore seems to be a product of his passionate affair with Alice Hylton who illustrated the novel.
The Witnesses stage looked at the print and digital publication history of Aladore. There was a short diversion into two stories by Edgar Allen Poe to practice creating and distributing EPUBs.
Digital Witnesses (3.5) reflected on how traditional concepts of textual criticism relate to digitization and unraveling digital texts.
Now we are at the Capturing Text stage. We officially have our JPG page images of the two digital witnesses of Aladore. We will now be testing a few different OCR platforms to “capture” our digital text.
Looking ahead, here is the plan:
5. Compare Editions: After we have a couple full transcripts of the witnesses we will use some tools to compare the texts. Curious? Okay check out http://www.juxtasoftware.org for a preview of a great tool! We will do this with the different OCR platforms to evaluate them, but also with the two corrected texts to see if there is any differences between the editions.
6. Edit into a Best Text: we will fix up the raw OCR transcript into a more correct text–following the evidence of the page images if the text appears incorrect. Will we find everything that is inaccurate? No… but we just want to a good reading text, not a completely authoritative edition.
7. Massage epub: At this point hopefully we have a lovely text in some format–but we need to make it into a beautiful EPUB! I haven’t thought too much about the intricacies of EPUB formatting yet… but here is a good lead of someone who IS thinking about the presentation of text in ebook formats: Yellow Buick Review, http://yellowbuickreview.wordpress.com. I can slap together a working EPUB using Sigil, but hope to learn more about the finer points, such as CSS to provide better formatting.
8. Release: Once everything looks great, I will put the epub out into the world (i.e. www). It will most likely be at Archive.org.
Sound good? Sound possible to finish some day?
Digital Aladore has been idle for a few weeks while I was ill and dealing with other deadlines, but hopefully we can get the steam rolling again…
Its time to start sorting out options for Optical Character Recognition to capture the text of Aladore from our digital witnesses (i.e. page images). OCR has a surprisingly long history, in use since at least the 1960’s for enterprise data entry applications and for improving accessibility for visually impaired. Today, one of the most common ways to encounter OCR is as a feature built into PDF applications, such as Adobe Acrobat or PDF-XChange Viewer (I do not recommend either of these programs, Acrobat is too expensive and PDF-XChange seems to be bloated and nagware). This use of OCR is focused on creating a simple transcript underlying the page images to enable search. This is not ideal for Digital Aladore, since we want to create a non-PDF text. The other option is a dedicated desktop OCR platform. Probably the most used and most accurate OCR platform is the commercial application ABBYY Finereader. While I have used ABBYY at my work, it is an expensive and proprietary software. I want to steer this project towards open source and Libre solutions.
There is basically four parts to most OCR platforms:
- image pre-processing (options that optimize the input for the chosen OCR engine, such as scaling, de-skew, and sharpening)
- layout analysis (recognizes the structure of a document to select text areas for OCR. This can provide position data for the resulting recognized text to reproduce the layout or allow it to be embedded below the PDF image)
- OCR engine (algorithms that can recognize characters in an image and output a text character. These are often optimized by “training” or lexicons that add context to the possible interpretations to improve accuracy.)
- GUI (helps the user efficiently process documents, combining access to the other elements, settings, and output options)
Tesseract is the most commonly used open source OCR engine (licensed Apache 2.0). It was first developed by Hewlett Packard in 1985, but eventually released as open source in 2005. Google has been sponsoring further development since 2006. The engine can now be run as a command line application, but it is also used by a wide range of GUI applications. It is generally considered the best open source option. It is usually not as accurate as ABBYY, but it is faster.
There are a few other open source OCR engines. CuneiForm was a commercial competitor to ABBYY for many years, but it was released under a free BSD license in 2008. It is usually not as accurate or fast as Tesseract. Ocrad is a unique engine used by some applications that is part of the GNU project. It is very fast, but not very accurate.
We will be testing a couple different GUIs and the main engines. To get a overview of the options, check out a comparison at Wikipedia: http://en.wikipedia.org/wiki/Comparison_of_optical_character_recognition_software
Since we have decent individual page images for Aladore, our OCR workflow going forward is similar to what DYI Bookscanners are doing. Book scanners will shoot photographs of each page and aim to have the results at around 300 PPI. Our Aladore images are in that ballpark, so should produce reasonable OCR results. Basically, I want an OCR solution with these features:
- batch process hundreds of individual page images
- good accuracy
- OCR area can be selected (avoids further processing later on)
- corrects line breaks (avoids further processing)
I have a Windows 7 and a Ubuntu 14 machine at home, so I could use anything that runs on these platforms. The most basic option is to simply install Tesseract and run it from the command line/terminal. First, download and install the most recent version from https://code.google.com/p/tesseract-ocr or from a Linux repository. Then open a terminal and follow the recipe to test some OCR action!
The basic command is “tesseract imagename.extension outputfilename”. For example, this command runs OCR on the first page of Aladore and out puts the text as a file called “test1914tess.txt”:
tesseract aladoren00newbuoft_0021.jpg test1914tess
Here is what the page image looks like:
And here is the Tesseract output:
OF THE HALL OF SULNEY AND HOW
SIR YWAIN LEFT IT.
SIR YWAINl sat in the Hall of Sulney and
did justice upon wrong-doers. And one man
had gathered sticks where he ought not, and
this was for the twentieth time; and another
had snared a rabbit of his lord’s, and this was
for the fortieth time; and another had beaten
his wife, and she him, and this was for the
hundredth time: so that Sir Ywain was weary
of the sight of them. Moreover, his steward
stood beside him, and put him in remem-
1 Ywain=Ewain or Ewan.
So the results are very accurate! The only problem it has is with the subscript 1 for the footnote about Ywain. Luckily, this is the last footnote of the book (I think). However, in terms of generating our good reading text there is some limitations which point out the challenges going forward (some of which can be solved by using different software).
First, every page in the book has the header “Aladore” and a page number in the footer. I don’t want those in the reading text. The Aladore header is fairly hard to remove in a batch process–you can not just find and replace because the word appears with in the text as well. The page numbers will also just cause us editing headaches going forward. There is basically two options to get rid of them: 1) crop them out of the page images before OCR or 2) select an area for OCR rather than recognizing the whole page. Either option would be easy if every page was EXACTLY the same, but they are not. The text we want wanders around quite a bit.
Second, Tesseract outputs the recognized text with the existing line breaks. We need to get rid of line breaks and page breaks to make a nicely flowing epub text.
Over-all I was impressed with the recognition and ease of use.
I have been playing around with the many no cost GUI OCR platforms out there, and I want to report some findings. This post looks at some options available on Windows, none of which I found to be very viable for the purposes of Digital Aladore. Another post will relate information about options on Linux.
If you want to test out some super quick OCR without installing anything, the easiest is to use Google Drive.
Go into your Drive, click on the setting icon (gear thing), navigate down to Upload settings, and click on “Convert text from uploaded PDF and image files.” Now upload an image and it will appear in a new Google Doc with the OCR text below. This feature is also available on the Drive app for Android so you can directly OCR your photo “scans” of documents. Obviously this is not a solution for Aladore, but thought I would pass it on in case you had some immediate OCR needs!
As for desktop GUI OCR platforms on Windows, here are some notes just to get the information out there, even if its a negative result for this project:
- This is an odd site that provides lots of information about the main commercial OCR platforms, plus download of their own freeware. It is actually part of a niche document imaging retailer, the ScanStore. SimpleOCR is free for non-commercial use, but is not open source. Word recognition is good. However, it very often (every page) seems to interpret large spaces in the middle of a paragraph as columns, thus jumbling the order of the text. This kind of mistake is impossible to recover from without painstaking cut and paste. There are no settings to change this behavior. This platform would be unusable with Aladore.
- This is a commonly cited freeware (not open) option that uses the Tesseract engine. However, it was consistently blocked by my security programs as containing a virus. Numerous reviews mention that the installer includes PUPs (potentially unwanted programs) and malware. I am not sure what to think, but I decided it wasn’t worth perusing since there is nothing special about it.
- Basically a windows version of the Linux application Tesseract-GUI. It is open source (GNU GPL v3), but hasn’t been updated since 2010. The project never really got off the ground, so it is too buggy to use.
- The interface looks pretty good (more like the ABBYY set up), but I could not get it to actually recognize any text. Python based, open source (GNU GPL v2), but it hasn’t been updated since 2011.
- gImageReader is the most usable on this list and has a Free open source license (GNU GPLv3). It is easy to install on Windows and works with out tweaking. However, it is slow at loading page images and layout detection. Also, the OCR is some how less accurate than Tesseract via command line or via Linux GUI options–I don’t know how this is possible. It does have a number of nice features for editing/polishing the output text. It was last updated in April 2014.
But don’t worry, I found some other options on Linux that run much better!
Just a quick thought to pass on some tools I have been using:
Since I have a big batch of image files for the two editions of Aladore, I wanted to normalize the file names just to make things easy for myself and keep everything straight. For example, the page images from Internet Archive Aladore 1914 were named like “aladoren00newbuoft_0001.jpg”. I think the name comes from “aladore” (title) + “n00” (?) + “newb” (author newbolt) + “uoft” (scanning center, University of Toronto). I wanted to switch them to a more memorable and sensible name, like “aladore_1914_ia_0001.jpg”. It makes things a lot easier when you are working on the command line (with Tesseract) if you don’t have to remember something weird.
There are a few options out there to do this. Personally I use AdvancedRenamer on Windows and GPRename on linux.
- Great utility for batch renaming files. You can change the file name plus a lot of the embedded metadata or tags such as timestamps or geolocation. Freeware. Very handy and powerful.
- Simple Perl based utility. A lot of nice automatic options built in, like changing the case or trimming spaces.
P.s. If you use xfce, there is Bulk-renamer as a plugin for Thunar, http://docs.xfce.org/xfce/thunar/bulk-renamer/start.
Linux has a few good Free GUI OCR options that are still actively developed. The two most popular applications are YAGF and OCRFeeder, both easily installed via repositories or software center, both licensed GNU GPLv3. You will need to separately install the OCR engines, so make sure you have tesseract-ocr packages as well. YAGF easily integrates only Tesseract and Cuneiform, while OCRFeeder can detect any installed engine.
To test the ease of use, I loaded the first three pages of Aladore unedited, recognized the text, and saved the output.
- YAGF’s very simple interface and workflow makes jumping right into OCR easy. Just select some images and click recognize all pages. The text will show up as a single document on the right hand panel. If you change any options and click recognize again, the text is added to the end–this is convenient for quickly comparing output. To save the text, cut and paste, or click save to create a plain text or HTML file. Switching between Tesseract and Cunieform is a simple menu selection. A number of other features, such as auto preprocessing, correcting page skew, auto region detection, and spell check can be accessed from the other icons on the interface. It will take a bit of playing around before you figure out how to efficiently use them. The interface is not completely polished and the options are fairly basic, but over-all YAGF is a complete and easy to use OCR platform. It was last updated in August 2014, however the most up-to-date version, 0.9.4.2, is not available in package repositories (Ubuntu Software Center offers 0.9.2.1).
- OCRFeeder has a more polished interface, however the workflow is not as straight forward as YAGF. It offers more fine grained control of each text/image region of the document. Unfortunately, unless you use the full automatic detection this means the process takes much longer. OCRFeeder supports any OCR engine you install on your system, and you can actually use different engines on each text region. For preprocessing, OCRFeeder intergates Unpaper, a tool created to process scanned book page images. Again in contrast to YAGF, it offers more control over the options. One nice feature is output to ODT format (OpenDocument Text, the standard format for Open/LibreOffice Writer). Each of the detected regions will be translated to a text/image box on the document. This means you can translate the page layout to an editable format, rather than just a searchable PDF or plain text. However, I don’t need any of this functionality for Digital Aladore! The overly complicated workflow is not necessary since I just want the simple, unformatted text. However, if you have a document with lots of images that you wanted to properly capture and embedded in your new text, OCRFeeder’s workflow would be better than YAGF. Version 0.8 was released in August 2014, although most repositories still offer only 0.7.11 (February 2013).
There are a few other options which I quickly tested, but found more work to install and/or use than necessary while offering no further functionality. These include tesseract-gui, gImageReader, lios, and ocrivist
For my purposes, YAGF seems like the best OCR GUI, but its a toss up for now.
The main use of the GUIs is to help with layout detection. I want to detect the text body only (leaving off the header and page numbers), but these applications can not do this automatically. It would be too much work to go through 400 pages and manually adjust the text box, so I need to look for a preprocessing method of eliminating the header and footers. Although the GUI makes OCR convenient, I am not sure it is necessary to this project, since it may be just as easy to create a batch process to run the processing directly through Tesseract.
For example, recognizing the first page of Aladore 1914:
Tesseract generates this very accurate text without any preprocessing:
ALADORE. CHAPTER I. OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT. SIR YWAINl sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-
While the Cunieform engine generates this text:
ALA DORK, CHAPTER I. SIR Y%’AIM sat, in the Ha11 of Sulney and dKI ]Ustice upon %’fong-doers» And one &an had gathered sticks where he ought not, and this vedas for the twentieth time; and another had snared a rabbit of his lord,’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth tiine: so that Sir Vwain was weary of the sIght of theln» Moreover» his steward stood besN1e 4Imq and pQt him In reIACAI”
I kind of like “ALA DORK”… However, using YAGF‘s automatic preprocessing improves the result greatly. After processing, Cunieform generates this text (still not as accurate as Tesseract):
A L A D 0 R E. CHAPTER I. OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT. SIR YwAIN sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’ s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredt.h time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-
Ocrad is noticeably faster (all three can be tested on OCRFeeder), and generates this text:
OF THE HALL OF SULNEY AND HOW SIR Y\_AIN LEFT IT.
SIR YWAIN’ sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; aod another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remem-
After testing, I see no reason not to use the most popular option, Tesseract. It is extremely accurate out-of-the-box for a text such as Aladore, with no need for training, tweaking, or preprocessing. It can be used efficiently via command line or GUI applications. And finally, the license is Free.
One frustration is that although most of the OCR platforms can automatically detect page layout, they are not very good at it. Titles are often split into multiple sections instead of a single block, imperfections on the side of the page are detected as images, or lines with a bit of extra space are detected as columns. Further more, they do not allow you to choose a single page element to recognize without adjusting every page manually. In the case of Aladore, each page has a header (saying “Aladore”), a text block (the actual text), and a footer (page number)–I would like to simply choose the text block and ignore everything else. This is only possible by manually selecting the text on each page. Not a very viable option unless you have PLENTY of time.
We could just OCR the whole page and try to deal with the header and footer in post-processing the text output. However, as I have mentioned in other posts, getting rid of the header and footer is complicated. You can not just find and replace “Aladore” and all the numbers, since they may also appear in the text body. Furthermore, because the header and footer are in different sized fonts and character spacing, they generate more OCR errors than the main text. This makes it even harder to isolate them to delete from the text.
I decided it might be simplest to crop away all the extra page information before doing OCR. The idea is to go from a full page image, down to just the text block I want, eliminating the header and page number.
This sounds like it should be an easy batch process, but reality… is different, because the text block is not in the same place on every page. There is considerable variation in the printed book which is further exaggerated by variation in position during scanning.
To over come these issues, I evaluated a number of options. I will list a few possibilities here for informational purposes, although in the end I decided to use ScanTailor. ScanTailor will be covered in a separate post, but I think these rejected options may be useful for other applications.
The simplest way to crop images is (of course) to use some photo management software. There are a number of options that enable efficient cropping. For example, on Windows I use FastStone Image Viewer (Freeware). When you open a crop window you can set a size and move quickly through an entire directory. This way you can move the crop box around for each page as needed, but keep the general size without needing to reselect every time. It is not ideal, because you have to look at every page individually, but a 400 page book won’t take too long…
For PDFs, I use a helpful cropping utility called Briss (GNU GPLv3). Imagine you have a PDF article you want to read on an ereader. It has huge margins with a tiny text body in the middle of the page or text split in columns. To make it easier to read, Briss lets you crop away all empty space or turn the columns into separate pages. The trick is that Briss actually clusters the pages based on their layout first, enabling you to apply different crops for different clusters of pages:
This helps overcome the issue of the text body’s variable position on the page. However, it is important to note that Briss is not actually changing the image embedded in the PDF. It is only changing how it is marked in the PDF metadata, so that it will be displayed in the way you selected. This means using Briss alone won’t improve our OCR, because Tesseract actually extracts the full image files from the PDF before doing OCR.
Theoretically, we could still use it, via this ugly work-around: 1) put the page images into a PDF, 2) crop with Briss, 3) “print” to a new PDF thus discarding the unnecessary image data, 4) OCR. The main problem with this solution is that our images are JPG, a lossy format. Each time they go through reprocessing the quality will go down. A JPG put into a PDF and then extracted is noticeably lower quality, which will impact the quality of our OCR. No good.
This post is getting long, so I will move the final tool idea to the next post…
Continuing from the last post, basically I went looking for a Briss-like application for individual image files. What I discovered is the unique java application ImageJ. Although it is a bit of a tangent, I thought I would pass along my notes since it might be useful for some other projects–and anyway, its a really fascinating program!
Since ImageJ was developed by the governmental organization National Institutes of Health, it is licensed public domain. ImageJ can also be used via Fiji, an optimized package of ImageJ core, dependencies, and plugins focused on helping process scientific research images. Fiji development is active and provides continuous updates. ImageJ has a very steep learning curve, even if you are familiar with other image editing software since it uses unconventional terminology and workflows. However, it offers many unique and powerful features–I have barely scratched the surface. If you want to figure out how to do something, surf around on forums, or read the giant User Guide. I could not figure out how to automatically cluster the page images like Briss does, but I think it might be possible (?). However, based on a variety of forum posts I found all over the place and some experimentation, I pieced together this simplistic workflow to crop batches of page images:
- Open a batch of images in ImageJ (multiple items can be selected from the file browser). I used one chapter of pages.
- Create a “stack”, i.e. combine the opened images into a single window that can be worked on as a batch: On the menu, select Image > Stacks > Images to Stack. Then click OK to create the stack.
- Create a “Z Projection” to make the features of all pages visible at once, so we can decide where to crop: Select Image > Stacks > Z Project. The default projection should be “Average Intensity”, but for a better representation for my purposes, use the drop down menu to switch it to “Sum Slices” and click OK.
- Select the text area on the Z Projection: After processing we have a new window (the “Z projection”) displaying the “Sum” of the page images. Basically we can see exactly where all the text is for every page at once. Use the rectangle selection tool (on the toolbar) to click and drag a rectangle over the part of the page we want to keep. ImageJ calls this the “region of interest”!
- Transfer this selection to the “Stack”: click on the stack you first created. On the menu, go to Edit > Selection > Restore Selection. The selection box you drew on the Z Projection will be transferred to the Stack.
- Crop: on the menu, go to Image > Crop. The stack will be cropped.
- Export the cropped pages from the stack: on the menu, go to File > Save As > Image Sequence. The dialog box will let you set the parameters for the export. Select the option “Use slice labels as file name” to use the original file names.
After a bit of processing, you will have a batch of nicely cropped page images! This option is pretty neat and efficient, but since we are not adjusting for the variation of each page individually, results in some pages having bits of the header or footer remaining. It would be better if we could get a program to automatically detect the text body instead.
More on that in the next post!
At the beginning of the project, I was planning to use a commercial OCR software to compare with the open options. However, at this point I have decided NOT to create a full text with a proprietary platform, and just focus on using Tesseract–I just don’t have enough time to do both well! To give you a sense of what it would be like to use a nice proprietary system, here is an extensive post about ABBYY FineReader:
Founded back in 1989, ABBYY is based in Russia and is probably the most common commercial OCR engine. Their products are often integrated into document processing platforms and content management systems. For example, many museums and libraries use CONTENTdm digital collection management to provide access to digitized materials. A FineReader plugin can OCR images as they are uploaded making the content searchable. Of course, the plugin costs thousands of dollars on top of your tens of thousands of dollars in subscription fees!
ABBYY also develops a desktop GUI OCR platform featuring powerful PDF creation and editing capabilities that rival Adobe Acrobat. It will cost you about $169, although you can get a very limited free trial (30-days, 100 pages, 3 pages at a time). This price tag gets you more out-of-the-box functionality built in than any open software. For example, Tesseract can recognize around 40 languages, but requires downloading and installing training packages (it comes only with English). FineReader comes with 190 languages and 48 language dictionaries to refine its recognition. Tesseract theoretically allows users to create custom word lists, but no one can seem to get the feature to work. FineReader integrates with Microsoft Word’s dictionaries so you can easily set up a custom word list. Finally, many specialized office/library tasks are optimized, for example, creating searchable PDF/A documents designed for long-term digital archival storage.
I have used a few older versions of FineReader on computers at work. Here is a screenshot of FineReader 11, which looks about the same as every other platform reviewed at Digital Aladore so far:
I decided to download the free trial FineReader 12 and test a few pages of Aladore. FineReader 12 has been updated to match Windows 8 style, but doesn’t make great use of screen space, unless you have a giant monitor:
The interface and workflow is basically the same as the open platforms: open some images, recognize the text, edit it in the window, and output in a new format. The display offers a few more advanced features, such as highlighting low-confidence recognitions and non-dictionary words so that you can quickly check them. Basically, FineReader just offers MANY more options and settings, allowing more advanced control.
Here is how it compares to Tesseract:
- processing is a bit slower
- layout detection is better and more nuanced (can detect headers/footers, tables, etc.)
- accuracy is the same (with a simple English text like Aladore)
However, output is where ABBYY is considerably different from other options. First, its excellent layout detection allows it to reproduce complex formatting such as tables, side bars, or bulleted lists. Second, it can create a wide variety of output files, including word documents, full featured PDFs, spreadsheets, and ebooks–yes, it can directly create an EPUB!
The EPUB formatting is not great, so it would still require a lot of editing and tweaking…
Yeah, FineReader is pretty slick–but its not that great! It is still a lot of work to create a good text. Most people use FineReader to efficiently create searchable PDFs with out editing the OCR output. Since they see the image layer as the main means of actually reading the text, the underlying transcript doesn’t have to be very accurate.
I will stick with Free software, and end up with the same results…