Category: 5. Editing Text

Edit HTML with Bluefish

We enter the next stage of Digital Aladore with our ugly raw HTML in hand… and we want to make it into nice reflowable text to convert into an ebook.  Welcome to 5. Editing Text!

What is the most efficient way to rip through those ugly tags and fix up our text?

I am pretty sure its Bluefish Editor, http://bluefish.openoffice.nl

Bluefish is a powerful Free software (GNU GPLv3) text editor that supports web development and programming.  It is important to note that it is not a WYSIWYG editor, there is no graphical preview or editing.  For Digital Aladore, I am interested in Bluefish’s advanced find & replace features which allow you to use regular expressions and carry out operations across any size batch of files.  Exciting!

Opening Aladore 1914 with Bluefish.

Opening Aladore 1914 with Bluefish.

First, I open the batch of six HTML files output by YAGF OCR of Aladore 1914.  I inspect the HTML tags to make sure I understand what is going on.  Since I will add formatting later during the epub stage, I want to strip away almost everything, but in a controlled manner!

First to go is the style tag in the headers:

<style type=”text/css”>
p, li { white-space: pre-wrap; }
</style>

Advanced Find & Replace.

Advanced Find & Replace.

We don’t need it, so out it goes with Advanced Find & Replace!  In the screen shot above you can see that Bluefish lets us select an entire directory (and even set the number of recursive levels below it) to Find & Replace on at once.  I love this!  But, this first operation is pretty tame… only six items gone, one for each HTML file header.

Next, I get rid of all the style tags since they are not really necessary or meaningful.  Simply Advanced Find & Replace for  style=".*?"  and we find 9,202 items to replace with nothing!  Give it a second and they are all gone.

Now we need to do some more strategic thinking to sort out the patterns left by YAGF OCR.  Basically, each line of the text is contained in its own paragraph tag <p>…</p>.  Between paragraphs is an empty line consistently tagged <p ><br /></p>.  And the end of a page is always:

<p ><br /></p>
<p ><br /></p>
<p > </p>

Ultimately, I want to get rid of those tags, replacing them so that: 1) the line breaks are removed, 2) the page breaks are removed, 3) complete paragraphs are are contained in paragraph tags.

#3 is the tricky part, so to do this, we have to go about things in just the right order.  First, I remove the page breaks by replacing the string shown above with nothing.  This resulted in 364 replacements, exactly the number of page images we ran OCR on–Good!

This leaves an empty line in the HTML text files for each page break which I now remove using: Tools > Filters > Strip empty lines.

Next, I remove all the paragraph breaks by replacing <p ><br /></p> with nothing (501 results). Do NOT strip the lines this time, because the resulting empty lines will create the proper paragraphs in the next step.

Now, replace:

</p>
<p >

with a space (7096 results). This pattern represents a line break which should be removed to create our wrapping paragraphs.  Since the previous step left an empty line between actual paragraph breaks, the first <p> and last </p> of each does not match the replace string–resulting in correctly tagged paragraphs!

Finally, I replace “- ” (hyphen space) with nothing (259 results).  This catches all the words that were hyphenated at a line break, while leaving all the actually hyphenated phrases alone.

After that the batch looks good!  There are a few more issues that need to be fixed and errors to be searched for, but they are better off done with other tools, i.e. Sigil the ebook editor.  More soon!

Advertisements

New action plan

I have some time deadlines this week… so to produce some demo products from this project, I decided to temporarily short cut the original project outline.

Aladore, p.407

Aladore, p.407

I have all the text ready for the 1914 edition of Aladore, and rather than spending time preparing the 1915 text as well, I will push through with the processing of 1914 to a draft EPUB edition.  Then, I will go back and create the 1915 text and carry out the originally planned comparisons sometime in the next few weeks.

In the meantime, we have a few more tools to visit on route to the first draft!

Creating an EPUB 1

After Bluefish tidied up our HTML text, I originally planned to do some comparisons between editions, but since only the 1914 text is ready, I decided to jump ahead to creating a draft ebook.

First, lets talk a bit about EPUB.

EPUB (electronic publication) is an open and Free standard maintained by International Digital Publishing Forum designed for reflowable ebooks that can be properly displayed on any screen size.  Full documentation about the standard, as well as a lot of helpful resources, can be found at the IDPF website, http://idpf.org.  Each EPUB file is basically a ZIP package containing XHTML files for the sections of the book, image files, CSS style sheets, and metadata describing the contents.  A good way of looking at it is that an EPUB contains the elements of a complete website of the book.  If you want to take a look inside, make a copy of an EPUB on your computer and unzip it (use 7-zip, a great Free archive manager, http://www.7-zip.org).  You will find a couple directories, including one labeled text containing the main xhtml files.

Because it is an open and relatively simple standard, it is easy to create and/or hand tweak your own EPUBs.  If you were really bored, you could hand write all the necessary files, put them in the correct directories, and zip them together.  Luckily there are a number of good tools to help create EPUB in a more automated fashion!

As usual, rather than just telling you what method I actually used, I will outline another method first–just to give you some options!

One of the quickest ways to create a usable EPUB is with the Writer2epub extension.  First, you need OpenOfficehttps://www.openoffice.org,  or LibreOfficehttp://www.libreoffice.org.  I won’t get into the debate about which one you should choose, but basically they are identical, Free software office suites.  They can easily replace the popular microsoft equivalent that doesn’t even need to be named… I have always used OpenOffice on my Windows computers and LibreOffice on my Linux ones–no good reason why!  If you have one installed, download the latest Writer2epub extension from http://writer2epub.it/en (older versions of the extension no longer work, so if you installed it awhile ago, update!) and open it to install.  This should add a little EPUB tool bar.

Writer2epub tool bar.

Writer2epub tool bar.

Now open the HTML text with LibreOffice, and save it as an ODT.  If we were trying to create the entire book using this method, you would need to combine all the other HTML files as well (copy & paste works fine!).  The whole document should currently be set to style “Text body.” Now, I go through the text and change the style of each chapter title to “Heading 3.”  At the same time I scan the text for any errors.

LibreOffice

Editing Aladore with LibreOffice.

The main thing to remember if trying to create an EPUB by this method is don’t get too fancy.  Use only styles to add formatting (and only the basic styles).  If you use the tab key, any tables, headers/footers, or weird fonts you are more like to get strange results.  Also, note that many ereaders are a bit dated in hardware and software.  They often only support EPUB2.  The only image types they can render are JPEG, GIF, PNG, and SVG+XML.  Its safest to stick to GIF for simple graphics (like logos) and JPEGs for general images.

When everything looks good, click on the EPUB E in the tool bar.  This brings up the Writer2epub window.  First, we need to enter metadata for the ebook:

Writer2epub

Add metadata.

This information will be embedded with the files to ensure the ebook can be identified.  You could also choose a cover at this time.  Next, we need to adjust the Document Preferences:

writer2epub

Adjust the document preferences.

I check off the options to split the files before Headings 1, 2, and 3.  This creates a separate XHTML file for each chapter, ensuring that we have a page break in the text and making smaller files for the ereader to load.  Then click okay.  LibreOffice will think for a second, and a dialog box showing a log of the epub creation macro will appear to confirm it was successful.  The EPUB file will be saved in the same place as the ODT with the same name.

calibre

New epub viewed with Calibre.

Easy! Fast!  Thank you Luca “Luke” Calcinai (creator of w2e)!

However, for the purposes of Digital Aladore, I wanted a bit more control and options, so I used Sigil instead.  See the next post!

 

EPUB Creation 2

To create my Aladore edition, I decided to use the wonderful Free software (GPLv3) EPUB editor, Sigil, https://github.com/user-none/Sigil

I have used Sigil to tweak and create ebooks for many years.  At the beginning of this project several months ago, I was sad to read that active development was going to stop.  In February 2014, Sigil’s main developer John Schember posted that development was stalled and suggested using Calibre’s newly expanded epub editor (we will visit Calibre in another post soon!).  However, at the end of September 2014, Sigil surged back to life with a major new release!  Hooray!  Thank you John Schember and Kevin Hendricks!

A good user guide and tutorials can be found hosted on the old Google code page, http://web.sigil.googlecode.com/git/files/OEBPS/Text/introduction.html, or downloaded as an epub, https://github.com/user-none/Sigil/blob/master/docs/Sigil_User_Guide_0_7_2.epub.

Sigil is a true, full featured EPUB editor.  It can easily import text or HTML files, add images and style sheets, generate table of contents and index, and create a cover.  It allows you to toggle between WYSIWYG and code editing views.  It has a number of utilities built in to tidy and validate your EPUB.  Its only disadvantage at this point is that it is based on EPUB2 and does not yet support the full features of the newest standard, EPUB3.

So lets open Sigil and get going!

sigil

Starting work on Aladore!

First, I go to File > Add > Existing Files and select all the six HTML files containing the text of Aladore (generated by YAGF, edited by Bluefish).  The directory structure of the EPUB is represented in the “Book Browser” in the left hand pane.  I can see the HTML files I added.  Double click on the file in the Book Browser to open it in the main window where all the active files are tabbed.

Now, I work through each file to break it into to chapters.  Normally, each chapter is contained in its own HTML file.  This ensures that there is a page break and that the file sizes remain small enough for good performance on ereaders (which may have very low specs).  I put the cursor in front of the chapter title and press Ctrl+Enter (or select Edit > Split at Cursor from the menu, or click the Split at Cursor icon in the tool bar).  The chapter title will now be at the head of a new HTML file.  Select the chapter title and change the style to H1.  You may want to use H2 or 3 to allow some diversity in your heading formatting, but the disadvantage is that Sigil interprets this as a hierarchy.  The table of contents will have all the H2+ nested underneath the earlier H1.  In practical terms, this means when you access the TOC on your ereader you have to click through a hierarchy rather than straight to to a chapter.  With Sigil it is just important to get the headings marked–it is very easy to change the level as a batch later on if you want to tweak the formatting.

sigil

Making chapter breaks.

I keep the directory of page images handy in case I need to refer to the original to make any corrections as I scan through the text.  Spelling errors are only marked on the code view:

sigil

Code view with spelling errors underlined.

If your text has a lot of strange words, like Aladore does, it is a good idea to set up a custom dictionary to work with.  Go to Edit > Preferences > Spellcheck Dictionaries and add a new dictionary:

sigil

Add new dictionary.

Now, as I work through the text I add the common weird words (for example, the name of the hero Ywain) to the new Aladore dictionary by right clicking in Code view.  This makes it is easier to identify actual errors.

Now its just a long slog through the chapters, carefully tagging the headings and touching up the text.  Mainly, I am just fixing the paragraph breaks as needed and looking for anything that seems weird.

At this point, I also begin to notice patterns of errors.  For example, “80” often appears in place of “So,” and zero in place of “O”.  Time for some more Find & Replace!

sigil

Sigil Find & Replace.

Luckily, Sigil offers advanced find & replace features much like Bluefish.  It supports regular expressions and can work on selections or the whole book at once.

Another feature helpful for isolating errors is generating a Report via Tools > Reports. The report gives summary information about everything in the EPUB.  For example, the HTML section lists all the files with their size, word count, spelling errors count, and other statistics.  Looking at the Characters section often highlights errors which would be impossible to otherwise identify:

sigil

Character Report.

The character report for Aladore listed a bunch of instances of the exotic characters “fi” (which is not “f i”) and “fl” (which is not “f l”).  These are strange OCR artifacts and need to be replaced with normal characters.  Clicking on the report places the cursor on an example so you can jump right into Find & Replace.  Finally, I click Tools > Spellcheck > Spellcheck which generates a spelling error report similar to the character report.  You can quickly scan through this list of errors to locate issues–even where there are many strange words as with Aladore.  Clicking on a word in the list highlights in the HTML so you can fix it.

Alright, keep going, we almost have an EPUB!  Next post…

What about the Illustrations!

But Wait! you say… what about that whole thing with Henry’s passionate affair with aristocrat/illustrator Alice??  Ah yes, we need to deal with the illustrations!

Don’t worry, I didn’t forget.

Before the OCR processing routine, I moved all the illustration pages to their own directory.  The illustrations are identical in the 1914 and 1915 editions.  However, the 1914 pages are heavily yellowed, making for a dimmer, less crisp scan image.  I decided to use the 1915 images from the Internet Archive which are brighter with a whiter background. Generally when editing raw scans I adjust the levels and run unsharp mask to improve the image quality.  However, these images have been heavily processed already and are in the lossy JPG format, so further tinkering only seemed to make things worse.

Since these are page images, it is simplest to start editing with our old friend ScanTailor.  This time however, I used more of the advanced features to fix up the illustrations.  First, I used Deskew to generally straighten the images.  I ran the automatic Select Content which found all the illustrations perfectly.  On Margins, I deselected “Match size with other pages” and set a 1 mm margin on each side, since I want only the illustration with no extra page area.  On Output, I lowered the Resolution to 300 DPI, since any higher is over kill.  I changed the Mode to Color/Grayscale, since ScanTailor’s black and white processing is too extreme for images.

ScanTailor b&w processing

ScanTailor b&w processing, kind of neat, but…

Finally, I used ScanTailor’s Dewarping feature.  As I have mentioned a few times, none of the illustrations appear square.  This is partially because physical books are not flat and photographic lens introduce some distortion.  Straight lines on the resulting page image will not appear parallel/square.  This can effect the cosmetic look of your page images, but also cause OCR errors.  Thus, ScanTailor offers the Dewarping operation to quickly minimize the distortion.

Dewarping in scantailor

Dewarping Ywain.

On the central pane, click on the Dewarping tab.  A grid appears over the page image.  Drag the edges of the grid to match the lines that should be square in your image.  ScanTailor takes your distorted grid and reprojects the image as if it was square, fixing the apparent perspective issues.

Run Output and ScanTailor gives us a set of 15 nicely processed TIF images ranging in size from 4 to 7 MB.

We are making important editorial decisions through out this process.  For example, by throwing out blank pages and close cropping the illustrations, I have discarded some of the information and format of the original book.  By deciding to dewarp, I have introduced my subjective interpretation of “straightness”.  These decisions will shape how the illustrations look and relate to the text in the new edition.

Finally, since Digital Aladore is focused on a reading text (not maximum image fidelity) we need to think about the EPUB.  I do not want to unnecessarily bloat the ebook size.  The ScanTailor output comes in at 80 MB, way too big for an ebook.  Instead, the quality and resolution should be appropriate to a small black and white screen.  EPUB2 and most ereaders do not support TIF images, so we will need to convert the images to a more acceptable format and size.

 

Editing Illustrations

To finish processing the illustrations I use my go-to image editor: GIMP, http://www.gimp.org

GIMP is Free, open, and runs on any OS.  If you are familiar with how Photoshop functions, its fairly easy to understand.  Batch processing is achieved through scripting, which has a steeper learning curve than Photoshop action sets, but can be an extremely powerful tool.

gimp

Ywain in GIMP.

First, I opened the 15 illustrations and simply exported them as JPG (the most supported format for EPUB images).  This makes images around 1500 x 1900 pixels, for a total of about 8 MB–and the images still look pretty good…

However, the average ereader has a screen size only about 600 x 800 pixels (although newer models are pushing 758 x 1024).  Leaving the larger images enables tablets and computers to view the full quality images, but it also makes the EPUB unnecessarily large for ereaders that can’t display it.

I decided to make a second set of scaled JPGs to test in the ebook to see how they preformed on an actual reader.  I scaled each image down based on a the approximate size of 600 x 800.  This resulted in a batch just over 2 MB.

P.S. I uploaded the full set of JPGs to a new illustration gallery!

Finishing off the EPUB

Now that we have the scaled JPG illustrations, its easy to insert them into the ebook with Sigil.

Simply find the correct location (I followed the layout used in the print book) and use Insert > File.  Sigil will copy the image into the Images directory of the EPUB and insert the basic img tag into the HTML.  If we want the illustration to display properly on a ereader screen we will need to tweak the tags, but that will be another day…

sigil

Inserting images.

Now we have to finish off a few more loose ends.

First, our ebook needs some metadata.  Click the Metadata icon or Tools > Metadata Editor.

sigil

Add metadata.

At a minimum an EPUB requires a Title, but an Author and Date are usually included as well.  The Metadata Editor allows you to embed more complex metadata if desired.  For this draft EPUB I didn’t get fancy, but I added an entry for Alice as illustrator.

Next we need to create the Table of Contents via Tools > Table of Contents > Generate Table of Contents.  Sigil allows you decide which headings to include in the TOC.  It is also easy to change the level of any heading in the book individually or as a batch during this operation–very convenient!

I also went into the first HTML section and reproduced the title page from the front matter of the original book.

Finally, I created a quick cover image and inserted it using Tools > Add Cover.  I am not exactly proud of this cover, but it will serve for now:

Draft cover image.

Draft cover image.

The draft EPUB needs a lot more polish, but it is functional.  So, that’s it for now!