Tagged: HTML

Onward to EPUB3?

DigitalAladore 1.0 is a valid EPUB2. To recap: EPUB was chosen for the ebook because it is a free and open format built on open web standards (in contrast to proprietary formats such as Kindle AZW). And we love Free because of the many practical benefits of open source development plus the moral ideals of respecting the user’s freedom.

The EPUB2 standard was first released in 2007, but has since been superseded by EPUB3 released in October 2011. EPUB3 was designed to take advantage of new elements introduced in HTML5 and allow more interactive functionality (script). However, support of the full specification continues to be very poor. The only readers with full support seem to be commercial apps that deliver interactive books in a closed ecosystem. For example, AZARDI offers a cost-free reading app that has good support of advanced features of EPUB3, but it is focused on secure “content fulfillment” of interactive textbook subscriptions. To publish to the platform, authors must use their proprietary ebook creation application. Kobo and Apple have developed tweaked versions of EPUB3 that do not fully comply with the standard and focus on the possibilities for improved DRM, rather than functionality not found in EPUB2.

However, for simple functionality (i.e. a linear novel) EPUB3 is supported by most reading devices. I decided to update the Aladore EPUB2 to an EPUB3 version for future compatibility, higher specs, and improved semantic inflection. Guidelines now suggest adding larger images and cover images than I used in the EPUB2 to ensure they don’t look terrible on HD tablets. So while DigitalAladore 1.0 was optimized for older e-ink ereaders, the EPUB3 version will be optimized for larger, more powerful devices.

However, Sigil does not currently support the creation of ebooks following the EPUB3 spec. If you make changes to the markup following EPUB3, Sigil will actually correct them back to EPUB2 when saving the file.  So, to create the Aladore EPUB3 we have to do a few extra steps:

  • Replace all the image files with larger versions using Sigil.
  • Use the Sigil plugin ePub3-itizer to export a pseudo EPUB3. Sigil developers intend to implement full EPUB3 creation and editing support soon, so this plugin is considered a “stop-gap measure.” It changes the HTML headers, restructures a few files, and adds the nav.xhtml.
  • Unzip the ePub3-itizer output to edit the contents. Because Sigil limits the markup to XHTML valid to the EPUB2 spec, it is not possible to add HTML5 tags such as section or EPUB3 attributes such as epub:type (thus, it is what I call a pseudo EPUB3). I used the IDPF Accessibility Guidelines (The epub:type attribute) plus the attribute vocab EPUB 3 Structural Semantics Vocabulary to add some semantic structure to the text. This markup can be used for styling the document with CSS, but is also useful for machine processing and accessibility options. You can mark up sections of the ebook (frontmatter, body, backmatter), divisions within (abstract, chapters), types of content (footnote), or individual elements (title). I added div tags with attributes in the EPUB2 which I converted to section tags, for example, each <div class=”chapter”> became <section epub:type=”chapter”>. I used these epub:type values: cover, titlepage, chapter, epigraph, toc, and loi. Since I made each chapter a single XHMTL file, another option would be to add the epub:type attribute to the body element. However, those attributes would be lost if merging the HTML, so I prefer the section tags.
  • Delete the toc.ncx file.  This file was used by older reading devices to provide navigation functionality, but it is not part of the EPUB3 spec as it is replaced by nav.xhtml. However, many people seem to be leaving this file in the EPUB for legacy support. If you leave it, everything should work fine, but the file will NOT fully validate.
  • Re-Zip the new EPUB3. EPUBs need to be zipped in the correct order or they will not function. This means you must create the zip archive first (in Windows right click somewhere and choose New > Compressed (zip) Folder), then add the mimetype file (drag it into the new zip folder). Then all the rest of the content can be added. Finally, change the extension from .zip to .epub.
  • Validate with the IDPF EPUB Validator.

The sketchyTech blog talks about the differences created by this process in more detail if you want to hear from some one else…

But basically, that’s it!  Not too complicated, although it requires some thought about 1) the quality of images to include, 2) changes to styling with larger screens in mind, and 3) consideration of semantic inflection to provide better accessibility and machine readability. I will post the new Aladore EPUB3 soon!

Advertisements

Poetry Markup

If you remember, I talked a bit about the difficulties of formatting the poetry in an ebook (Aladore didn’t turn out tooooo complicated, since the poems and structures were pretty simple). And if you remember way back, we started on the journey To Aladore with a bit of poetry, the Song of the Children in Paladore.

So, to preview the new poetry markup and styling, I put the new CSS inline so we can revisit the song:

To Aladore, to Aladore,

Who goes the pilgrim way?

Who goes with us to Aladore

Before the dawn of day?

O if we go the pilgrim way,

Tell us, tell us true,

How do they make their pilgrimage

That walk the way with you?

O you must make your pilgrimage

By noonday and by night,

By seven years of the hard hard road

And an hour of starry light.

O if we go by the hard hard road,

Tell us, tell us true,

What shall they find in Aladore

That walk the way with you?

You shall find dreams in Aladore,

All that ever were known:

And you shall dream in Aladore

The dreams that were your own.

O then, O then to Aladore,

We’ll go the pilgrim way,

To Aladore, to Aladore,

Before the dawn of day.

Do you like it better than the Blockquote on the old To Aladore post?

Here is what it looked like in print:

To Aladore, p.250, 1914 edition

p.250, 1914 edition

You can see that the print verse is actually in a slightly smaller font. To reproduce a bit of these typographical techniques used to set off the poetry from body text, I added a few more CSS tweaks: font-size: 0.95em; word-spacing: 0.2em.

Aladore on WordPress post?

Want to see a preview? I put all the new Aladore CSS inline for the first page of chapter III, it doesn’t necessarily come out all correct, since this is WordPress–the posts don’t serve up the exact html I type in. But, if you put all that new markup together, you should get something like this:

CHAPTER III.
HOW IT FORTUNED TO YWAIN TO FIND A STAFF IN THE PLACE OF HIS SWORD.

THEN Ywain turned his face towards the village, and went down the hill: and he went with a good heart, for though the boy had left him, yet he hoped not to be long without him, and even now when he looked straight forward it seemed that he had the joy of his company and his laughter. But when he turned and looked beside him, there was but his own shadow; black it lay and long, and about the edges of it a brightness was shining. Then he remembered that the sun was low and night rising among the hollows, and he bethought him of his supper and sleep.

So he went quickly to the village, and passed through it and came to the farmer’s house that lay beside the great wood: and the farmwife gave him welcome, as one that knew not who he was, but could well pitch her guess within a mile or so. And she whispered to her husband, but he was hard of hearing and full of slumber from the fields. So when Ywain had supped, they showed him where he should lie. And when he was come there he laid him down, and the day went from him in a moment and he knew no more whether he were alive or dead.

There are a few more tweaks to do on this text, which we will talk about in this section!

Take a Peek at the Markup!

A few posts ago I outlined the improvements to the underlying markup of the text, moving us beyond the draft ebook.  Lets take a concrete look at what this means.

Here is what Chapter 3 of the draft Aladore EPUB looked like:

<h2 id="sigil_toc_id_3" style="text-align: center;">CHAPTER III.<br />
HOW IT FORTUNED TO YWAIN TO FIND A STAFF IN THE PLACE OF HIS SWORD.</h2>

<p>THEN Ywain turned his face towards the village, and went down the hill: and he went with a good heart, for though the boy had left him, yet he hoped not to be long without him, and even now when he looked straight forward it seemed that he had the joy of his company and his laughter. But when he turned and looked beside him, there was but his own shadow; black it lay and long, and about the edges of it a brightness was shining. Then he remembered that the sun was low and night rising among the hollows, and he bethought him of his supper and sleep.</p>

<p>So he went quickly to the village, and passed through it and came to the farmer's house that lay beside the great wood: and the farmwife gave him welcome, as one that knew not who he was, but could well pitch her guess within a mile or so. And she whispered to her husband, but he was hard of hearing and full of slumber from the fields. So when Ywain had supped, they showed him where he should lie. And when he was come there he laid him down, and the day went from him in a moment and he knew no more whether he were alive or dead.</p>

The new version of the markup looks like this:

<div class="chapter">
<h2 class="chapterHeading"><span class="chapterNumber">CHAPTER III.</span><br />HOW IT FORTUNED TO YWAIN TO FIND A STAFF IN THE PLACE OF HIS SWORD.</h2>

<p class="firstParagraph">THEN Ywain turned his face towards the village, and went down the hill: and he went with a good heart, for though the boy had left him, yet he hoped not to be long without him, and even now when he looked straight forward it seemed that he had the joy of his company and his laughter. But when he turned and looked beside him, there was but his own shadow; black it lay and long, and about the edges of it a brightness was shining. Then he remembered that the sun was low and night rising among the hollows, and he bethought him of his supper and sleep.</p>

<p>So he went quickly to the village, and passed through it and came to the farmer's house that lay beside the great wood: and the farmwife gave him welcome, as one that knew not who he was, but could well pitch her guess within a mile or so. And she whispered to her husband, but he was hard of hearing and full of slumber from the fields. So when Ywain had supped, they showed him where he should lie. And when he was come there he laid him down, and the day went from him in a moment and he knew no more whether he were alive or dead.</p>

Note the div class chapter, chapter heading class, chapter number span, and first paragraph class.  Here is the new CSS relevant to this selection:

body {
font-family: Georgia, serif;
margin-left: 1.1em;
margin-right: 1.1em;
}

h2.chapterHeading {
font-family: Georgia, serif;
font-size: 1.15em;
line-height: 1.6;
text-align: center;
margin-top: 5em;
margin-left: 3em;
margin-right: 3em;
}

span.chapterNumber {
font-size: 1.35em;
letter-spacing: 0.1em;
line-height: 2.5;
}

p {
text-indent: 1.2em;
line-height: 1.5;
margin-top: 0em;
margin-bottom: 0em;
}

p.firstParagraph {
text-indent: 0;
margin-top: 0.5em;
}

Which should get us to something that looks like this:

Chapter 3 rendered by the Readium app.

Chapter 3 rendered by the Readium app.

Aladore Regex

In earlier posts, I mentioned using advanced features of find & replace to do some automated editing of the HTML tags in the text.  Let me give you another example to show how useful some clever find & replace can be!

Before uploading the draft Aladore EPUB, I wanted to quickly edit the format of the chapter titles.  At this point the chapter headings looked like this:

<p>CHAPTER IV.</p>

<p>HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</p>

They look exactly the same as normal body paragraphs.  Of course, in the print version this is not the case.

Aladore 1914, page 20.

Aladore 1914, page 20.

To better match the look of the print book, I decided the headings should be centered and tagged h2.  This will more clearly set them off from the body text, something like this:

<h2 style="text-align: center;">CHAPTER IV.</h2>

<h2 style="text-align: center;">HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</h2>

However, I also wanted to quickly generate a table of contents.  If I tagged the heading as shown above, the automatically generated TOC would have two separate entries for each chapter.  So a quick and dirty alternative is to put a <br /> between the two parts of the chapter title, making them a single h2 unit that will appear correctly in the TOC.  This solution looks like:

<h2 style="text-align: center;">CHAPTER IV.<br />HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</h2>

So, how can we automatically find the 58 chapter headings and get them tagged correctly without doing it by hand?  Luckily, Sigil supports Find & Replace with Regular Expressions (i.e. Regex, make sure you have regex chosen for the Mode in the find window). With a few handy expressions and some logical thinking, we can sort this out in no time.  If you want to learn about and practice Regex, check out RegExr.

The main things we need to work with for this application are the lookahead and lookbehind expressions.  In this case it is very easy to test the accuracy of the expression–click Count All and it should be exactly 58 items, otherwise you are not catching only/all the chapter headings.

Sigil find & replace.

Sigil find & replace.

First, I need to replace the <p> in front of the title “CHAPTER…” with <h2 style=”text-align: center;”>.  To do this, use a lookahead, (?=ABC).  This means we are using “CHAPTER” in our search, but will not select it for the Replace function.  The Find looks like this:

<p>(?=CHAPTER)

It will find and select only the <p> that appear before “CHAPTER”, but it does not select CHAPTER.  Awesome!

Next, I want to find the p tags between the two sections of the chapter heading and replace them with <br />.  Since the roman numerals that follow CHAPTER make it a variable string length, for technical reasons we can not use a lookbehind.  Instead we need to figure out a regex that will find only the chapter subtitle (for example, HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.) to use in another lookahead.  Every chapter subtitle is in all uppercase and at least 20 characters–AND there is no other string in the book that has those qualities.  We can use these facts to create our find string: [A-Z,\s,’]{20,}.  This expression means find any string that includes ONLY the uppercase letters A through Z, \s white spaces, or ‘ (needs to be included since some titles have possessives) AND that is 20 characters or more in length.  I didn’t do any serious analysis to decide on the number 20–I just looked a few of the subtitles, counted the number of characters, and tested a few numbers by entering the expression and clicking Count All.  If I got 58 results, I knew I was on the right track.  The expression will find only our chapter subtitles, so we can use it in a lookahead to select the two p tags between the chapter number and the subtitle.  The find looks like this:

</p>

<p>(?=[A-Z,\s,']{20,})

It will select only the two p tags, which I replace with a break.

Finally, we need to replace the </p> after the chapter subtitle with a <h2>.  This time we need to use a lookbehind (?<=ABC), because there is no consistent string after the subtitle that we can use in searching.  In a lookbehind, we can not use a string of open length as I did above.  However, the same qualities of the subtitle will give us enough information to create a distinctive search that will exclude everything NOT a subtitle.  In this case, a string of three characters, including only uppercase letters and a period [A-Z,.]{3} in front of a </p> will give us the 58 results we want.  This is because no other paragraph in the book ends with a word in uppercase letters.  The lookbehind Find expression looks like this:

(?<=[A-Z,.]{3})</p>

It selects only the </p> tag, which I replace with </h2> to close off the heading.  Now, we have all the chapter titles tagged correctly and looking… well, sort of beautiful.  When we get into more polishing, we will use similar expressions to add the style tags needed for CSS.

Isn’t it amazing that we have gone from textual transmission to Regex?  Ha, ha, I love Digital Aladore!

Collation 1

To do a quick and easy comparison of the 1914 and 1915 HTML texts, the best tool is Notepad++, http://notepad-plus-plus.org.  This is a weird case where Windows has an awesome Free software tool that Linux doesn’t!  Notepad++ is a powerful and flexible text editor with extensive features to make coding easier.

The Notepad++ community also creates bunches of plugins to extend its functionality.  For this task we need the Compare plugin.  You may need to add it: on the menu click Plugins > Show Plugin Manager, then find Compare, check the box, and click install.  Now, you are ready to compare any type of text based file–easy!

Simply open the files you want to compare (Notepad++ uses tabs), then click Plugins > Compare > Compare.  With our two Aladore HTML files, it will look something like this:

Compare on Notepad++

Compare on Notepad++

The texts are aligned and scroll in sync, with a representation the differences displayed on the right side.  Each line with a discrepancy is highlighted (but not the actual different characters or words).  The type of change is indicated by colors and icons (for example, line added, line deleted, or line moved).  This quickly reveals simple formatting issues.

In the example pictured above, the red highlights reveal a paragraph that was broken incorrectly in the 1915 text.  Both files can be edited, so this is easily fixed by deleting the extra <p> tags and empty lines.  The text was the same, only the HTML tagging was incorrect.  I quickly worked through the red (line deleted) and green (line added) highlights, which were all similar formatting issues.  This resulted in two HTML files with 1105 lines each.  This confirms that the editions are nearly identical!

However, this still leaves hundreds of yellow highlighted lines, which simply indicate some change somewhere in the line (i.e. with in a single paragraph <p> to </p>).  The exact difference is NOT highlighted.  The majority of these differences are a single character, such as “S” versus “s”.  It would be painstaking to find them all using Compare.

Honestly, it isn’t really necessary to go any further for this project, but to explore a few more tools, we will look more into these differences in the next post…

 

 

Edit HTML with Bluefish

We enter the next stage of Digital Aladore with our ugly raw HTML in hand… and we want to make it into nice reflowable text to convert into an ebook.  Welcome to 5. Editing Text!

What is the most efficient way to rip through those ugly tags and fix up our text?

I am pretty sure its Bluefish Editor, http://bluefish.openoffice.nl

Bluefish is a powerful Free software (GNU GPLv3) text editor that supports web development and programming.  It is important to note that it is not a WYSIWYG editor, there is no graphical preview or editing.  For Digital Aladore, I am interested in Bluefish’s advanced find & replace features which allow you to use regular expressions and carry out operations across any size batch of files.  Exciting!

Opening Aladore 1914 with Bluefish.

Opening Aladore 1914 with Bluefish.

First, I open the batch of six HTML files output by YAGF OCR of Aladore 1914.  I inspect the HTML tags to make sure I understand what is going on.  Since I will add formatting later during the epub stage, I want to strip away almost everything, but in a controlled manner!

First to go is the style tag in the headers:

<style type=”text/css”>
p, li { white-space: pre-wrap; }
</style>

Advanced Find & Replace.

Advanced Find & Replace.

We don’t need it, so out it goes with Advanced Find & Replace!  In the screen shot above you can see that Bluefish lets us select an entire directory (and even set the number of recursive levels below it) to Find & Replace on at once.  I love this!  But, this first operation is pretty tame… only six items gone, one for each HTML file header.

Next, I get rid of all the style tags since they are not really necessary or meaningful.  Simply Advanced Find & Replace for  style=".*?"  and we find 9,202 items to replace with nothing!  Give it a second and they are all gone.

Now we need to do some more strategic thinking to sort out the patterns left by YAGF OCR.  Basically, each line of the text is contained in its own paragraph tag <p>…</p>.  Between paragraphs is an empty line consistently tagged <p ><br /></p>.  And the end of a page is always:

<p ><br /></p>
<p ><br /></p>
<p > </p>

Ultimately, I want to get rid of those tags, replacing them so that: 1) the line breaks are removed, 2) the page breaks are removed, 3) complete paragraphs are are contained in paragraph tags.

#3 is the tricky part, so to do this, we have to go about things in just the right order.  First, I remove the page breaks by replacing the string shown above with nothing.  This resulted in 364 replacements, exactly the number of page images we ran OCR on–Good!

This leaves an empty line in the HTML text files for each page break which I now remove using: Tools > Filters > Strip empty lines.

Next, I remove all the paragraph breaks by replacing <p ><br /></p> with nothing (501 results). Do NOT strip the lines this time, because the resulting empty lines will create the proper paragraphs in the next step.

Now, replace:

</p>
<p >

with a space (7096 results). This pattern represents a line break which should be removed to create our wrapping paragraphs.  Since the previous step left an empty line between actual paragraph breaks, the first <p> and last </p> of each does not match the replace string–resulting in correctly tagged paragraphs!

Finally, I replace “- ” (hyphen space) with nothing (259 results).  This catches all the words that were hyphenated at a line break, while leaving all the actually hyphenated phrases alone.

After that the batch looks good!  There are a few more issues that need to be fixed and errors to be searched for, but they are better off done with other tools, i.e. Sigil the ebook editor.  More soon!

Raw HTML text

After YAGF (Tesseract) OCR we have a batch of large HTML files.  For Aladore 1914 it was six HTML files around 1600 lines each.  The first page looks like this:

ALADORE.

CHAPTER I.

OF THE HALL OF SULNEY AND HOW

SIR YWAIN LEFT IT.

SIR YWAIN sat in the Hall of Sulney and

did justice upon wrong-doers. And one man

had gathered sticks where he ought not, and

this was for the twentieth time; and another

had snared a rabbit of his lord’s, and this was

for the fortieth time; and another had beaten

his wife, and she him, and this was for the

hundredth time: so that Sir Ywain was weary

of the sight of them. Moreover, his steward

stood beside him, and put him in remem-

Each line of text from the page image has been tagged as one HTML paragraph, so the breaks follow the original printed page.   Take a look at the mark up:

<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>ALADORE.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>CHAPTER I.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>OF THE HALL OF SULNEY AND HOW</p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>SIR YWAIN LEFT IT.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>SIR YWAIN sat in the Hall of Sulney and</p>

Each paragraph has a big “style=” tag that is sort of meaningless since they are all the same.  For my purposes, its just ugly and unnecessary.  However, the important thing is that everything is consistent.  I will not use the format described by the HTML tags, but because they are consistent in how they are used, I can easily transform the text.  With a little thought and find & replace, we can easily create a reflowable text–making the HTML above, into something like:

ALADORE.

CHAPTER I.

OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT.

SIR YWAIN sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remembrance of all the misery that had else been forgotten.

Which leads us to the next section of the project!

OCR with YAGF

Now that I have big batches of nicely pre-processed images from ScanTailor, its finally time to Capture some text!

YAGF is a free software OCR GUI for Linux, http://sourceforge.net/projects/yagf-ocr.  I reviewed it in an earlier post Linux OCR.

To get started, I click on the Open Image icon, navigate to the ScanTailor output directory for a batch, and select all the images (about 60).  Next, click on Settings > OCR Settings to make sure it is set to Tesseract-OCR.  Then, make sure the HTML output icon is selected.

Now, I click Recognize all pages and let it go to work!

YAGF working.

YAGF working.

After a few minutes, the recognized text shows up in the right panel.

I quickly scan through it looking for any issues.  At this point I am not going to fix any formatting, just checking for any thing that looks majorly scrambled.  I was actually amazed by the accuracy.  There was very few errors that I could detect.  I fixed only a handful per batch.  Spelling errors are highlighted with red underline so I check those as I go along.  There are only a couple actual spelling errors in the batch, usually two words stuck together.  In Aladore 99% of the detected issues are just names or archaic words, since it is filled with things like “assotted”, “aforetime”, and “thitherward”.  You start to see a few pattern errors emerge: capitalized “SO” at the beginning of a chapter often becomes “80”; “ff” often becomes “fI” or “f1” or “f'”; or “O” becomes zero.  It is easy to reference the original page to check the text by choosing it from the thumbnails on the left side.

Its important to note at this point, we are entering the realm of editorship.  We are making decisions about the transmission of the text.  I am trying to just replicate what is in the printed text, but what if the page has a blemish obscuring a good reading or if the printing made a mistake?  Even replication requires interpretation.  The errors I miss at this step will be perpetuated down stream…

Back to practical matters, while working on these batches, I noticed YAGF has two annoying interface issues:  First, when clicking a image thumbnail on the left pane, the image appears in the main window at like %500 so you can’t see anything–you need to zoom out several times to actually view the page.  Secondly, there is no way to remove all the items from the current project or start a new project.  If you don’t want to spend your time dragging each image to the trash icon to remove them individually, you simply have to close the program and reopen.

Although the OCR recognition was surprisingly accurate, there is one area where Tesseract has major trouble: italics.  If you notice any italics, be sure to go fix it manually.  It will be obvious in the text, because it will be a jumble of nonsense!

Anyway,

After checking the text over, I click the Save icon and save as HTML.  That’s it!  Done.

Assuming you don’t have any italics, this step is amazingly simple and goes quickly!

I will show you what the resulting text looks like in the next post…

Actual OCR Workflow!!

Okay, I have been messing around with dozens of workflow options, and I have finally settled on one version.  It is not necessarily the most efficient, but I found it has the best balance between accuracy and automation.  The basic steps are:

  • preprocess image files with ScanTailor,
  • OCR with YAGF and export as HTML,
  • edit HTML with BlueFish,
  • create ebook from HTML with Sigil.

You might say, “gosh that’s unnecessarily ugly!”

And you are partially right.  Each of these applications has overlapping features that can do more than what I am using them for, but I found each has specific limitations and strengths.  I am only using the features that the application is good at!

For example, the way I am using YAGF is very simple and automated.  I could in fact use YAGF for preprocessing the images and selection of the text areas for recognition, thus avoiding ScanTailor.  However, YAGF’s image processing is not as good and the interface is too cumbersome for these tasks.  It is much more efficient to use ScanTailor and feed YAGF preprocessed images that require no further user input.  But you say, then why not replace YAGF with command line use of Tesseract, automated with a Windows batch file or Linux shell script?  Simply because YAGF automatically combines multiple pages into a single HTML file, rather than one file for each page generated by the command line.  Thus, it saves me one transformation step (combining the HTML files and removing the extra headers), plus gives me some simple visual feedback to catch any major recognition errors.

Further more, you may ask why export the OCR as HTML when all you want is the text?  Ah, well that is the clever bit, sort of… The hocr output is intended for creating searchable PDFs using utilities that combine the text tagged with layout position and the page images (for example, check out hocr2pdf available with the Exactimage utilities on Linux).  I don’t in fact need any of the tags.  However, they make it much easier to reformat the text into the form I want because I can easily find the line, paragraph, and page breaks.  The plain text doesn’t have enough information to consistently transform, its hard to tell the difference between a page break, paragraph break, or random spacing error.  Thus, the HTML gives me more than I need, but it is easy to strip it down to what I want.  The plain text doesn’t give me enough, and its impossible to built it up except by tedious manual labor.

So there you go.  Four more steps, four tools.  I will explain it all soon!