Tagged: Sigil

Google Docs EPUB

Some news on the EPUB creation front:

Google Docs just enabled a feature to export as EPUB! To use it, simply open the doc and look under File > Download as > EPUB Publication.

This is a handy and very easy method to create an ebook. However, the consistency and quality isn’t good. The markup it creates is down right bizarre with tons of unnecessary <span> tags and strange CSS. It also does not create a cover. In theory you could open this Google Doc EPUB with Sigil and do some polishing up, but given how unnecessarily complex the markup is, it would be more work than starting fresh.

So if you need a super quick EPUB for some reason, just click the “Download as” option. Otherwise, stick with the tools that provide better markup results, such as Writer2ePub and Sigil.

News at Sigil Ebook

Since Digital Aladore is more than a year old (see The Idea), I thought I should check in with a few of the key tools for any news. First up is Sigil Ebook editor, used for creating the various EPUB versions. As I have said many times, it is a great tool! There are a few features like the character report and auto merging html that I wish were in my everyday text editor.

After a scary period where it looked like development on Sigil might stall, I am happy to see it surge back to an active project full of interesting changes. This week version 0.9.1 was released stabilizing a host of new features moving the application towards full EPUB3 support. Also be sure to check through the Plugin Index to find many useful extensions for the editor.

Creating an editor that supports both EPUB2 and 3 is a bit complicated. As I mentioned in an earlier post, older versions of Sigil automatically correct markup and packaging to match the EPUB2 standard. To fix this issue, version 0.9.1 replaces Xerces (xml parser) and Tidy (html parser) with Python lxml and Google Gumbo, and makes the FlightCrew EPUB2 validator a plugin rather than built in tool.

Despite the major overhaul under the hood, using Sigil remains almost unchanged, which is great. So thank you to current maintainers Kevin Hendricks and Doug Massay and everyone else who makes this Free and open tool available!

Check out the code or get the latest version at Github.

 

Onward to EPUB3?

DigitalAladore 1.0 is a valid EPUB2. To recap: EPUB was chosen for the ebook because it is a free and open format built on open web standards (in contrast to proprietary formats such as Kindle AZW). And we love Free because of the many practical benefits of open source development plus the moral ideals of respecting the user’s freedom.

The EPUB2 standard was first released in 2007, but has since been superseded by EPUB3 released in October 2011. EPUB3 was designed to take advantage of new elements introduced in HTML5 and allow more interactive functionality (script). However, support of the full specification continues to be very poor. The only readers with full support seem to be commercial apps that deliver interactive books in a closed ecosystem. For example, AZARDI offers a cost-free reading app that has good support of advanced features of EPUB3, but it is focused on secure “content fulfillment” of interactive textbook subscriptions. To publish to the platform, authors must use their proprietary ebook creation application. Kobo and Apple have developed tweaked versions of EPUB3 that do not fully comply with the standard and focus on the possibilities for improved DRM, rather than functionality not found in EPUB2.

However, for simple functionality (i.e. a linear novel) EPUB3 is supported by most reading devices. I decided to update the Aladore EPUB2 to an EPUB3 version for future compatibility, higher specs, and improved semantic inflection. Guidelines now suggest adding larger images and cover images than I used in the EPUB2 to ensure they don’t look terrible on HD tablets. So while DigitalAladore 1.0 was optimized for older e-ink ereaders, the EPUB3 version will be optimized for larger, more powerful devices.

However, Sigil does not currently support the creation of ebooks following the EPUB3 spec. If you make changes to the markup following EPUB3, Sigil will actually correct them back to EPUB2 when saving the file.  So, to create the Aladore EPUB3 we have to do a few extra steps:

  • Replace all the image files with larger versions using Sigil.
  • Use the Sigil plugin ePub3-itizer to export a pseudo EPUB3. Sigil developers intend to implement full EPUB3 creation and editing support soon, so this plugin is considered a “stop-gap measure.” It changes the HTML headers, restructures a few files, and adds the nav.xhtml.
  • Unzip the ePub3-itizer output to edit the contents. Because Sigil limits the markup to XHTML valid to the EPUB2 spec, it is not possible to add HTML5 tags such as section or EPUB3 attributes such as epub:type (thus, it is what I call a pseudo EPUB3). I used the IDPF Accessibility Guidelines (The epub:type attribute) plus the attribute vocab EPUB 3 Structural Semantics Vocabulary to add some semantic structure to the text. This markup can be used for styling the document with CSS, but is also useful for machine processing and accessibility options. You can mark up sections of the ebook (frontmatter, body, backmatter), divisions within (abstract, chapters), types of content (footnote), or individual elements (title). I added div tags with attributes in the EPUB2 which I converted to section tags, for example, each <div class=”chapter”> became <section epub:type=”chapter”>. I used these epub:type values: cover, titlepage, chapter, epigraph, toc, and loi. Since I made each chapter a single XHMTL file, another option would be to add the epub:type attribute to the body element. However, those attributes would be lost if merging the HTML, so I prefer the section tags.
  • Delete the toc.ncx file.  This file was used by older reading devices to provide navigation functionality, but it is not part of the EPUB3 spec as it is replaced by nav.xhtml. However, many people seem to be leaving this file in the EPUB for legacy support. If you leave it, everything should work fine, but the file will NOT fully validate.
  • Re-Zip the new EPUB3. EPUBs need to be zipped in the correct order or they will not function. This means you must create the zip archive first (in Windows right click somewhere and choose New > Compressed (zip) Folder), then add the mimetype file (drag it into the new zip folder). Then all the rest of the content can be added. Finally, change the extension from .zip to .epub.
  • Validate with the IDPF EPUB Validator.

The sketchyTech blog talks about the differences created by this process in more detail if you want to hear from some one else…

But basically, that’s it!  Not too complicated, although it requires some thought about 1) the quality of images to include, 2) changes to styling with larger screens in mind, and 3) consideration of semantic inflection to provide better accessibility and machine readability. I will post the new Aladore EPUB3 soon!

Aladore Regex

In earlier posts, I mentioned using advanced features of find & replace to do some automated editing of the HTML tags in the text.  Let me give you another example to show how useful some clever find & replace can be!

Before uploading the draft Aladore EPUB, I wanted to quickly edit the format of the chapter titles.  At this point the chapter headings looked like this:

<p>CHAPTER IV.</p>

<p>HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</p>

They look exactly the same as normal body paragraphs.  Of course, in the print version this is not the case.

Aladore 1914, page 20.

Aladore 1914, page 20.

To better match the look of the print book, I decided the headings should be centered and tagged h2.  This will more clearly set them off from the body text, something like this:

<h2 style="text-align: center;">CHAPTER IV.</h2>

<h2 style="text-align: center;">HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</h2>

However, I also wanted to quickly generate a table of contents.  If I tagged the heading as shown above, the automatically generated TOC would have two separate entries for each chapter.  So a quick and dirty alternative is to put a <br /> between the two parts of the chapter title, making them a single h2 unit that will appear correctly in the TOC.  This solution looks like:

<h2 style="text-align: center;">CHAPTER IV.<br />HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</h2>

So, how can we automatically find the 58 chapter headings and get them tagged correctly without doing it by hand?  Luckily, Sigil supports Find & Replace with Regular Expressions (i.e. Regex, make sure you have regex chosen for the Mode in the find window). With a few handy expressions and some logical thinking, we can sort this out in no time.  If you want to learn about and practice Regex, check out RegExr.

The main things we need to work with for this application are the lookahead and lookbehind expressions.  In this case it is very easy to test the accuracy of the expression–click Count All and it should be exactly 58 items, otherwise you are not catching only/all the chapter headings.

Sigil find & replace.

Sigil find & replace.

First, I need to replace the <p> in front of the title “CHAPTER…” with <h2 style=”text-align: center;”>.  To do this, use a lookahead, (?=ABC).  This means we are using “CHAPTER” in our search, but will not select it for the Replace function.  The Find looks like this:

<p>(?=CHAPTER)

It will find and select only the <p> that appear before “CHAPTER”, but it does not select CHAPTER.  Awesome!

Next, I want to find the p tags between the two sections of the chapter heading and replace them with <br />.  Since the roman numerals that follow CHAPTER make it a variable string length, for technical reasons we can not use a lookbehind.  Instead we need to figure out a regex that will find only the chapter subtitle (for example, HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.) to use in another lookahead.  Every chapter subtitle is in all uppercase and at least 20 characters–AND there is no other string in the book that has those qualities.  We can use these facts to create our find string: [A-Z,\s,’]{20,}.  This expression means find any string that includes ONLY the uppercase letters A through Z, \s white spaces, or ‘ (needs to be included since some titles have possessives) AND that is 20 characters or more in length.  I didn’t do any serious analysis to decide on the number 20–I just looked a few of the subtitles, counted the number of characters, and tested a few numbers by entering the expression and clicking Count All.  If I got 58 results, I knew I was on the right track.  The expression will find only our chapter subtitles, so we can use it in a lookahead to select the two p tags between the chapter number and the subtitle.  The find looks like this:

</p>

<p>(?=[A-Z,\s,']{20,})

It will select only the two p tags, which I replace with a break.

Finally, we need to replace the </p> after the chapter subtitle with a <h2>.  This time we need to use a lookbehind (?<=ABC), because there is no consistent string after the subtitle that we can use in searching.  In a lookbehind, we can not use a string of open length as I did above.  However, the same qualities of the subtitle will give us enough information to create a distinctive search that will exclude everything NOT a subtitle.  In this case, a string of three characters, including only uppercase letters and a period [A-Z,.]{3} in front of a </p> will give us the 58 results we want.  This is because no other paragraph in the book ends with a word in uppercase letters.  The lookbehind Find expression looks like this:

(?<=[A-Z,.]{3})</p>

It selects only the </p> tag, which I replace with </h2> to close off the heading.  Now, we have all the chapter titles tagged correctly and looking… well, sort of beautiful.  When we get into more polishing, we will use similar expressions to add the style tags needed for CSS.

Isn’t it amazing that we have gone from textual transmission to Regex?  Ha, ha, I love Digital Aladore!

1914 versus 1915!

Earlier in the project I thought it would be interesting to compare the raw texts created by different OCR engines.  This was not simply to benchmark accuracy–the idea was that each engine might make slightly different types of mistakes, thus combining the outputs could result in a better text overall.  However, since my testing made it clear that Tesseract was far superior to other Free OCR engines and that ABBYY FineReader did not seem to be any more accurate for the purposes of Digital Aladore, I decided to abandon this comparison.  Since Tesseract performed so well with the digitized Aladore witnesses, creating separate texts would not contribute to my main goal of creating a Good reading edition.

However, I want to compare the Tesseract OCR text of the two main print editions of Aladore, 1914 (Edinburgh) and 1915 (New York).  The qualifier “OCR text of” is key here–I am not directly comparing the print editions or even the digitized images of them!  I am comparing the HTML text created by my OCR workflow using digitized JPEGs of the print pages… Thus any textual differences are a bit complicated to unravel, potentially originating anywhere along the transmission process:

First, there could be actual differences in the editions and simple errors introduced/fixed by the reprint.  Secondly, each actual copy of the book is unique–the two that were digitized may have different blemishes or flaws.  Thirdly, since the two books were digitized by different people and machines, different errors may have been introduced by the process (I discovered one already).  Finally, my OCR and HTML editing workflow may introduce different errors in each edition because of slight differences in pagination and the qualities of the images.

So this comparison is NOT about directly comparing the two printed texts–although it might highlight some differences (or not).  It is actually a sort of distorted version of textual criticism (which I talked a lot about earlier in the project). We have two new witnesses with a complex transmission back story.  I want to collate these newly created witnesses, to better understand how they are related and how they can be combined to create a single Best text. 

There are a number of software tools that can carry out this comparison, helping us surface both technical and textual issues with the HTML files.  I started from the HTML edited using BlueFish, with each edition consisting of six separate files.  First, I wanted to ensure the two witnesses had a uniform format so that the comparison would reveal textual differences rather than a bunch of inconsistencies in HTML tags.

Sigil is a surprisingly efficient tool for this job, following steps similar to EPUB creation I described in an earlier post.  Open Sigil, click File > Add > Existing files, and select the six HTML files for an edition.  In the “Book Browser” pane (left side by default) highlight the six files you just added, then right click and select Merge.  This quickly and cleanly combines the files into a single HTML.  Next, I did a quick Find & Replace sweep for known issues, such as “So” being recognized as “80” and the exotic characters “fi” (which is not “f i”) and “fl” (which is not “f l”).  I also did a quick check of the Report (Tools > Reports > Characters in HTML) and Spellcheck for any strange OCR artifacts.  These are obvious transmission errors that just need to be fixed!  Then I right click on the HTML file in the “Book Browser” and choose Save As.  This allows you to “export” a HTML file from the EPUB.

After completing these steps for the two editions, we have two HTML files ready for comparison.  I will describe the next step in the next post!

Finishing off the EPUB

Now that we have the scaled JPG illustrations, its easy to insert them into the ebook with Sigil.

Simply find the correct location (I followed the layout used in the print book) and use Insert > File.  Sigil will copy the image into the Images directory of the EPUB and insert the basic img tag into the HTML.  If we want the illustration to display properly on a ereader screen we will need to tweak the tags, but that will be another day…

sigil

Inserting images.

Now we have to finish off a few more loose ends.

First, our ebook needs some metadata.  Click the Metadata icon or Tools > Metadata Editor.

sigil

Add metadata.

At a minimum an EPUB requires a Title, but an Author and Date are usually included as well.  The Metadata Editor allows you to embed more complex metadata if desired.  For this draft EPUB I didn’t get fancy, but I added an entry for Alice as illustrator.

Next we need to create the Table of Contents via Tools > Table of Contents > Generate Table of Contents.  Sigil allows you decide which headings to include in the TOC.  It is also easy to change the level of any heading in the book individually or as a batch during this operation–very convenient!

I also went into the first HTML section and reproduced the title page from the front matter of the original book.

Finally, I created a quick cover image and inserted it using Tools > Add Cover.  I am not exactly proud of this cover, but it will serve for now:

Draft cover image.

Draft cover image.

The draft EPUB needs a lot more polish, but it is functional.  So, that’s it for now!

EPUB Creation 2

To create my Aladore edition, I decided to use the wonderful Free software (GPLv3) EPUB editor, Sigil, https://github.com/user-none/Sigil

I have used Sigil to tweak and create ebooks for many years.  At the beginning of this project several months ago, I was sad to read that active development was going to stop.  In February 2014, Sigil’s main developer John Schember posted that development was stalled and suggested using Calibre’s newly expanded epub editor (we will visit Calibre in another post soon!).  However, at the end of September 2014, Sigil surged back to life with a major new release!  Hooray!  Thank you John Schember and Kevin Hendricks!

A good user guide and tutorials can be found hosted on the old Google code page, http://web.sigil.googlecode.com/git/files/OEBPS/Text/introduction.html, or downloaded as an epub, https://github.com/user-none/Sigil/blob/master/docs/Sigil_User_Guide_0_7_2.epub.

Sigil is a true, full featured EPUB editor.  It can easily import text or HTML files, add images and style sheets, generate table of contents and index, and create a cover.  It allows you to toggle between WYSIWYG and code editing views.  It has a number of utilities built in to tidy and validate your EPUB.  Its only disadvantage at this point is that it is based on EPUB2 and does not yet support the full features of the newest standard, EPUB3.

So lets open Sigil and get going!

sigil

Starting work on Aladore!

First, I go to File > Add > Existing Files and select all the six HTML files containing the text of Aladore (generated by YAGF, edited by Bluefish).  The directory structure of the EPUB is represented in the “Book Browser” in the left hand pane.  I can see the HTML files I added.  Double click on the file in the Book Browser to open it in the main window where all the active files are tabbed.

Now, I work through each file to break it into to chapters.  Normally, each chapter is contained in its own HTML file.  This ensures that there is a page break and that the file sizes remain small enough for good performance on ereaders (which may have very low specs).  I put the cursor in front of the chapter title and press Ctrl+Enter (or select Edit > Split at Cursor from the menu, or click the Split at Cursor icon in the tool bar).  The chapter title will now be at the head of a new HTML file.  Select the chapter title and change the style to H1.  You may want to use H2 or 3 to allow some diversity in your heading formatting, but the disadvantage is that Sigil interprets this as a hierarchy.  The table of contents will have all the H2+ nested underneath the earlier H1.  In practical terms, this means when you access the TOC on your ereader you have to click through a hierarchy rather than straight to to a chapter.  With Sigil it is just important to get the headings marked–it is very easy to change the level as a batch later on if you want to tweak the formatting.

sigil

Making chapter breaks.

I keep the directory of page images handy in case I need to refer to the original to make any corrections as I scan through the text.  Spelling errors are only marked on the code view:

sigil

Code view with spelling errors underlined.

If your text has a lot of strange words, like Aladore does, it is a good idea to set up a custom dictionary to work with.  Go to Edit > Preferences > Spellcheck Dictionaries and add a new dictionary:

sigil

Add new dictionary.

Now, as I work through the text I add the common weird words (for example, the name of the hero Ywain) to the new Aladore dictionary by right clicking in Code view.  This makes it is easier to identify actual errors.

Now its just a long slog through the chapters, carefully tagging the headings and touching up the text.  Mainly, I am just fixing the paragraph breaks as needed and looking for anything that seems weird.

At this point, I also begin to notice patterns of errors.  For example, “80” often appears in place of “So,” and zero in place of “O”.  Time for some more Find & Replace!

sigil

Sigil Find & Replace.

Luckily, Sigil offers advanced find & replace features much like Bluefish.  It supports regular expressions and can work on selections or the whole book at once.

Another feature helpful for isolating errors is generating a Report via Tools > Reports. The report gives summary information about everything in the EPUB.  For example, the HTML section lists all the files with their size, word count, spelling errors count, and other statistics.  Looking at the Characters section often highlights errors which would be impossible to otherwise identify:

sigil

Character Report.

The character report for Aladore listed a bunch of instances of the exotic characters “fi” (which is not “f i”) and “fl” (which is not “f l”).  These are strange OCR artifacts and need to be replaced with normal characters.  Clicking on the report places the cursor on an example so you can jump right into Find & Replace.  Finally, I click Tools > Spellcheck > Spellcheck which generates a spelling error report similar to the character report.  You can quickly scan through this list of errors to locate issues–even where there are many strange words as with Aladore.  Clicking on a word in the list highlights in the HTML so you can fix it.

Alright, keep going, we almost have an EPUB!  Next post…

Poe’s Blackwood articles

In the last post, I mentioned that Aladore was first published in Blackwood’s Magazine.

While researching the publication, I noticed that Edgar Allan Poe wrote two short humorous pieces ridiculing the overly sensation stories typical in these magazines.  They were originally published as “The Psyche Zenobia” and “The Scythe of Time” in the The American Museum of Science, Literature, and the Arts (Baltimore: Brooks & Snodgrass) November 1838.   The stories were retitled “How to write a Blackwood article” and “A Predicament” in later collections and can be found in a number of free online editions of Poe’s writing.

As a quick test project, I decided to make a mini ebook of these two short stories.

I used two source files (witnesses) that are in the Public Domain and easy to access:

1) An edited HTML version from Project Gutenberg in The Works of Edgar Allan Poe, Volume 4 (of 5) of the Raven Edition, eBook #2150, released 2008.  http://www.gutenberg.org/ebooks/2150

2) A scanned copy of The Works of Edgar Allan Poe, Volume IV (New York: A.C. Armstrong & Son, 1884) available at Internet Archive. https://archive.org/details/worksofedgaralla04poeeuoft

To create the ebook, I used the great open-source editing software Sigil, which I will discuss in more detail in a future post.  Unfortunately, the project is no longer actively developed, but for more info you can check out:

http://en.wikipedia.org/wiki/Sigil_%28application%29

https://github.com/user-none/Sigil

For this test project, I started with the HTML text from Project Gutenberg and checked it against the scanned original making edits and upgrades as needed.  Sigil helps package everything together to create a proper EPUB file (more on epub another day).  I created a quick cover from a screen shot of a scanned book page, and an automatic table of contents. Then I edited the embedded metadata file.  Sigil automatically validates the EPUB, and it is good to go!

Unfortunately WordPress doesn’t allow upload of epubs, so I can’t just immediately share it here…

I need to find a good home for distributing the file, but for now I put it up on ge.tt.

Two stories by Edgar Allan Poe relating to that “justly celebrated publication” Blackwood’s Magazine: http://ge.tt/1FcuEHz1/v/0

Updated file location: Internet Archive, Community Text Collection, https://archive.org/details/PoeBlackwoodArticle

P.S. If you are sick of reading, some nice person also created a Public Domain audio book version of these two stories available at LibriVox:

https://librivox.org/two-poe-tales

P.S. I also forgot to mention that the stories are full of some stereotypes and language that is offensive to modern ears, as you might expect from nineteenth century satire…