Reflecting on ColorOurCollections

My new favorite national holiday is #ColorOurCollections week, Feb 1-5, 2016.

The idea originated at the New York Academy of Medicine (NYAM) to merge the adult coloring book craze with digital collections. Amazing libraries around the world joined in, sharing some fun and beautiful coloring books. Open Culture listed some of the big names, such as Bodleian, Smithsonian, DPLA, and Europeana. However, it was great to see so many Less famous libraries creating awesome coloring books highlighting their fascinating collections– and having some fun!

None-the-less, #ColorOurCollections also highlighted some less-than-best practices in the creation and distribution of digital files. Many of these libraries are on the forefront of digital preservation and/or user experience, but put out PDFs that violated all the rules… Its a bit disappointing since its not that hard to get a few little details right, and libraries should be leading by example.

A lot of libraries did a great job, but here are some little things that bugged me about many coloring book offerings:

  • Random file names. If you are creating a PDF for public distribution don’t name it “coloringbook1.pdf” or  “color-our-collections.pdf” or “jim_file3.pdf”. The file name is one of the few bits of metadata that can be easily understood without even opening the file. Make your file name descriptive and meaningful, providing the basic metadata (creator, title, and date) in a place where Everyone can see it. Example: “ThisLibrary_CoolColoringBook_2016.pdf”
  • Huge file sizes. A coloring book is a mostly black and white document designed to be printed at approximately letter size paper. A reasonable file size for something around 15 pages is under 2 MB, and could be smaller. I saw many PDFs in the range of 25MB, and some over 65MB! Larger file sizes will not give better quality–this is for public web distribution not a professional press run printing glossy coffee table books. Think about your users and your web servers. People have to DOWNLOAD the PDF! Please make it a reasonable size.
  • No embedded metadata. When creating a PDF you should always check the embedded metadata. If you are exporting a PDF from LibreOffice Writer or MS Word, it will have embedded metadata automatically created based on your profile. Unfortunately, many people don’t realize that and have never checked their profile. Many of the coloring books thus have metadata like “Title: Microsoft Word – coloringbook_draft3_fromJim.docx” and an Author that is the profile name of who ever first created the file. This metadata will be displayed when users import the PDF into a ebook management tool, such as Calibre. Furthermore, this information is not helpful for future users trying to understand where the file came from and what it is–and could be a bit embarrassing depending on what the automatically generated information contains. I suggest you carefully edit the metadata before exporting the final version of your PDF. It should contain a meaningful title, a subject such as “#ColorOurCollections 2016”, an author/creator that relates to the institution, and a URL to find more information.
  • Lack of image metadata. If you send out a document highlighting some fascinating treasures of your collections–there had better be a clear means for users to find out more information! Every image used in the coloring book needs metadata directly on the page where it appears. Each page does not need the full archival description, but please give enough information for the users to find the item in your online collections. A title, identifier, and URL is nice. I think these references need to be given on each coloring page, not in a separate reference and index page. Online resources and printed coloring book pages are quickly disassociated from their original context–don’t expect that the information given on an introduction or TOC page will be available to users.
  • Lack of overall context. Many of the coloring books were just pages of images. That is great for many users, but I would like to see a short introduction page that explains the context. Where did these images come from? Why are they interesting in the scope of your collections? Where can I learn more? This is an easy chance to communicate with patrons and invite them into our collections–which is the point of #ColorOurCollections.
  • Links to paid databases. A few coloring books had reference links to paid databases. I found this a bit insulting and against the spirit of #ColorOurCollections. One of the most amazing aspects of digital collections is the ability to democratically open up the public domain to the PUBLIC. We are able to take fragile materials traditionally hidden away in a locked basement, and give them out freely to the world! It is disappointing to see objects in the public domain digitized and then LOCKED back up in a proprietary, paid database. Its even more disappointing to see those over priced rip offs promoted in a library coloring book.
  • Grey backgrounds. Sorry, but this is a coloring book! Who wants to color on grey paper? Who wants to waste printer ink printing a grey page background? Some images are just more appropriate for a coloring book than others. You can not just desaturate a digitized image and call it a coloring book. Digitized pages have a color, and that page background needs to be removed to make a quality coloring book page. Generally, most coloring book images should be fully binarized, i.e. only pure black and white. Using GIMP you could desaturate (Colors > Desaturate) or greyscale (Image > Mode > Greyscale) the image, then use a Threshold (Colors > Threshold) to eliminate the “color” of the page background. The coloring book image should be reduced to clean black lines and white background. ScanTailor is a great tool that can do this pre-processing for you for many coloring book appropriate images. Play around with the output options in Black & White mode, tweaking “Thickness” and “Despeckling” until you get a good result.

Why no Digital Aladore coloring book?

ScanTailor b&w processing

ScanTailor b&w processing

I was thinking about putting together an Aladore coloring book, but I found the images had too many shades of grey scale hatching to reduce nicely to clean lines. Processing ends up with too many black blobs, with too little detail. The images just don’t work as a coloring page! Here is a one page PDF attempt just to show you what I mean:


Anyway, #ColorOurCollections was good fun, and I am looking forward to it next year!


Google Docs EPUB

Some news on the EPUB creation front:

Google Docs just enabled a feature to export as EPUB! To use it, simply open the doc and look under File > Download as > EPUB Publication.

This is a handy and very easy method to create an ebook. However, the consistency and quality isn’t good. The markup it creates is down right bizarre with tons of unnecessary <span> tags and strange CSS. It also does not create a cover. In theory you could open this Google Doc EPUB with Sigil and do some polishing up, but given how unnecessarily complex the markup is, it would be more work than starting fresh.

So if you need a super quick EPUB for some reason, just click the “Download as” option. Otherwise, stick with the tools that provide better markup results, such as Writer2ePub and Sigil.

Update: Tesseract OCR in 2016

Using Tesseract via Command Line has consistently been the most wildly popular post on Digital Aladore. However, due to some changes, I thought I should update the information.

Tesseract used to be hosted at Google Code, which closed up shop in August 2015. The project has transitioned to Github, with the main page at

and the Wiki page at:

There is no longer an official Windows installer for the current release. If you want to use an installer, the old version on Google Code from 2012 still works, featuring Tesseract version 3.02.02 (I am not sure if this link will go away when Google Code moves to archive status in January 2016). This package is handy for English users since it will install everything you need in one go. (Update 2016-01-30: the link did go away in January, so there is no longer access to the old Google Code pages. It is easiest to use one of the installers listed below. If you want to compile it yourself, check out the directions from bantilan.)

If you want a more up-to-date version on Windows there are some third party installers put together for you, otherwise you have to compile it yourself (unless you use Cygwin or MSYS2):

It is important to note that the OCR engine and language data are now completely separate. Install the Tesseract engine first, then unzip the language data into the “tessdata” directory.

On Linux installation is easier. Use your distro’s software repository (the package is usually called ‘tesseract-ocr’), or download the latest release and use make.

Now that you have it installed, the commands on the old post should work fine!

Public Domain Day 2016!

Happy 2016 from Digital Aladore!

January 1st brings us to another joyous Public Domain Day, the holiday where lots of people celebrate the New Year AND a new crop of works entering the public domain.

Sadly in America we have NOTHING to celebrate. Because of bizarre copyright extensions, we will not have any works entering the public domain until 2019. It is stunning to think that while the rest of the world is celebrating, the USA has not had a happy Public Domain Day since 1978… When copyright was first introduced in America the term was 14 years; current works now enjoy life of the author + 70 years, or if the work of multiple authors (corporate authorship) 95 years from publication. The extensions in 1978 and 1998 applied to retrospectively to old works, creating a crazy tangle of rules (check out a summary from Peter B. Hirtle), which highlights the nonsense of the move: the rationale for extended terms was incentivizing creation, but it seems hard to fathom it motivating a bunch of dead people! Meanwhile the preservation of our cultural resources has become illegal, with fragile artifacts such as our film heritage literally disappearing.

Here is what I said last year, and things have only gotten worse:

Recent research and economic modeling suggest that current copyright terms are too long and do NOT provide incentive for creation.  Instead our shared culture is being locked away by corporate profiteers.  In fact, the majority of works still protected by copyright are orphans–out of print with no likely hood of ever being used again commercially.  Projects like Digital Aladore, Free software, and honestly the majority of the internet point out that creators aren’t purely profit driven.  Its time to reform copyright to benefit the creators rather than hoarders of capitol (who already have plenty of power and wealth!).

North of the border, in Canada things are more cheerful this year. The works of lot of great authors and thinkers will become freely available resources to drive current learning, thought, and creativity. Libraries and Archives will be able to legally preserve, digitize, and provide access to valuable cultural creations. Check out the Public Domain Review’s Class of 2016 for some highlights. However, there is a pall on the celebrations. The Trans-Pacific Partnership trade deal threatens to force countries to have a minimum of life+70 years copyright term.

A sad holiday indeed, learn more at the Center for the Study of the Public Domain:

“What do these laws mean to you? As you can read in our analysis here, they impose great (and in many cases unnecessary) costs on creativity, on libraries and archives, on education and on scholarship. More broadly, they impose costs on our collective culture. We have little reason to celebrate on Public Domain Day because our public domain has been shrinking, not growing.”

None-the-less, here at Digital Aladore we wish you all the Best for the New Year! 



DigitalAladore 1.5, EPUB3 Edition!

Is DigitalAladore 1.0 looking crummy on your ultra high def 10 inch tablet screen?

Well, give DigitalAladore 1.5 a try! Following the workflow outlined in previous posts, I generated an Aladore EPUB3 edition. The images are much bigger and the CSS is slightly tweaked with larger screens in mind. Personally, I still find reading ebooks on tablets a bit unsatisfying, slightly too big and bright. But I think this version will look pretty good! However, at over 9MB it might be slow to load on your e-ink reader.

So with out further ado, you can find the new EPUB3 at Internet Archive,

DigitalAladore 1.5: Aladore, by Henry Newbolt (1914, epub3),

News at Sigil Ebook

Since Digital Aladore is more than a year old (see The Idea), I thought I should check in with a few of the key tools for any news. First up is Sigil Ebook editor, used for creating the various EPUB versions. As I have said many times, it is a great tool! There are a few features like the character report and auto merging html that I wish were in my everyday text editor.

After a scary period where it looked like development on Sigil might stall, I am happy to see it surge back to an active project full of interesting changes. This week version 0.9.1 was released stabilizing a host of new features moving the application towards full EPUB3 support. Also be sure to check through the Plugin Index to find many useful extensions for the editor.

Creating an editor that supports both EPUB2 and 3 is a bit complicated. As I mentioned in an earlier post, older versions of Sigil automatically correct markup and packaging to match the EPUB2 standard. To fix this issue, version 0.9.1 replaces Xerces (xml parser) and Tidy (html parser) with Python lxml and Google Gumbo, and makes the FlightCrew EPUB2 validator a plugin rather than built in tool.

Despite the major overhaul under the hood, using Sigil remains almost unchanged, which is great. So thank you to current maintainers Kevin Hendricks and Doug Massay and everyone else who makes this Free and open tool available!

Check out the code or get the latest version at Github.


Thoughts About EPUB3

When I first started looking into the EPUB3 specs, I was excited by the possibilities of a more powerful ebook format. Just think of all the neat things you can create with simple CSS and JS! I imagined creating little “epub apps” like a calculator or timer. It would be a neat way to add functionality to very simple devices such as the Sony Reader. I created a few test versions, however these demos often worked in Calibre’s built in reader, but were not functional with any actual ereaders.

Of course, the point would be to go beyond silly little apps and add some interesting and valuable extensions to the ebook, such as text collation or visualizations. Simple adjustable collation tools could be embedded so that the reader could query the text while reading. Some of this functionality has been built into the reading apps on some devices, such as Kindle X-Ray. Simple interactive elements would be useful for textbooks and manuals to make information delivery more interesting. Imagine something like Jupyter Notebook, which can run embedded Python code.

Unfortunately, there just isn’t good support for the advanced features of EPUB3 in an open and flexible way.  As I mentioned in a previous post, device makers only seem interested in the possibilities of further limiting users with tougher DRM, rather than enabling new possibilities. In the ideal world we could combine the open format with open hardware and software!

Remove the NCX

Interested in minuet EPUB intricacies? Good, you are in for a treat!

One of the steps I mentioned for going from EPUB2 to EPUB3 is removing the toc.ncx file. This is actually a some what involved step (that is probably unnecessary) so I thought I would expand on it a bit.  It also gives you a chance to poke around EPUB innards…

The NCX, i.e. “Navigation Center eXtended” was a feature to enhance navigation and accessibility based on the DAISY/NISO Standard. It was a required Spine element in EPUB2.  However, the EPUB 3.0.1 spec that tells you the NCX is Superseded. Instead you are required to include an EPUB Navigation Document (i.e. nav.xhtml) that makes use of the HTML5 nav element. Basically you need to set up a <nav>, with <ol> inside, with <li> that have <a> relative links to parts of the ebook.

Since this file is a valid HTML document, it can be easily rendered by the reading device. Thus the new navigation file can serve both as a human and machine readable TOC.  You can write the table of contents once in <nav>, use it at the beginning of the book (for people) and for the device to understand the reading order of the digital files to provide extended navigation.

So in EPUB3 you need a Nav doc, but do you need to get rid of NCX? No, not really… NCX Superseded says that we “MAY” include the NCX since it will not interfere with anything, “but EPUB 3 Reading Systems must ignore the NCX.” I.e. older devices will keep looking for NCX, but newer ones will definitely not.

So we have an EPUB2, we create a new Nav based TOC, the question is NCX to keep or not to keep… I really don’t have a good answer.  It seems there is no reason to not keep it?

But if you want to get rid of it, building a Pure EPUB3, its more complicated than just deleting one file. Here’s what you need to do:

  • Unzip your EPUB (it is probably already unzipped if you are monkeying around with the EPUB2 to 3 transition), navigate to the OEBPS directory.
  • Delete the toc.ncx.
  • Open the content.opf file in a text editor. This is an XML that defines the ebook.
  • Look for the <manifest> element and find an <item> listing for the NCX (easiest just to search for “.ncx”).  It should look something like this: <item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>. Delete it!
  • Find the <spine> element. It should have a TOC attribute that looks like this: <spine toc="ncx">. Delete the whole attribute as it is optional in EPUB3, leaving <spine>.
  • Save your cleansed content.opf!

ncxLet me know what you decide!

Onward to EPUB3?

DigitalAladore 1.0 is a valid EPUB2. To recap: EPUB was chosen for the ebook because it is a free and open format built on open web standards (in contrast to proprietary formats such as Kindle AZW). And we love Free because of the many practical benefits of open source development plus the moral ideals of respecting the user’s freedom.

The EPUB2 standard was first released in 2007, but has since been superseded by EPUB3 released in October 2011. EPUB3 was designed to take advantage of new elements introduced in HTML5 and allow more interactive functionality (script). However, support of the full specification continues to be very poor. The only readers with full support seem to be commercial apps that deliver interactive books in a closed ecosystem. For example, AZARDI offers a cost-free reading app that has good support of advanced features of EPUB3, but it is focused on secure “content fulfillment” of interactive textbook subscriptions. To publish to the platform, authors must use their proprietary ebook creation application. Kobo and Apple have developed tweaked versions of EPUB3 that do not fully comply with the standard and focus on the possibilities for improved DRM, rather than functionality not found in EPUB2.

However, for simple functionality (i.e. a linear novel) EPUB3 is supported by most reading devices. I decided to update the Aladore EPUB2 to an EPUB3 version for future compatibility, higher specs, and improved semantic inflection. Guidelines now suggest adding larger images and cover images than I used in the EPUB2 to ensure they don’t look terrible on HD tablets. So while DigitalAladore 1.0 was optimized for older e-ink ereaders, the EPUB3 version will be optimized for larger, more powerful devices.

However, Sigil does not currently support the creation of ebooks following the EPUB3 spec. If you make changes to the markup following EPUB3, Sigil will actually correct them back to EPUB2 when saving the file.  So, to create the Aladore EPUB3 we have to do a few extra steps:

  • Replace all the image files with larger versions using Sigil.
  • Use the Sigil plugin ePub3-itizer to export a pseudo EPUB3. Sigil developers intend to implement full EPUB3 creation and editing support soon, so this plugin is considered a “stop-gap measure.” It changes the HTML headers, restructures a few files, and adds the nav.xhtml.
  • Unzip the ePub3-itizer output to edit the contents. Because Sigil limits the markup to XHTML valid to the EPUB2 spec, it is not possible to add HTML5 tags such as section or EPUB3 attributes such as epub:type (thus, it is what I call a pseudo EPUB3). I used the IDPF Accessibility Guidelines (The epub:type attribute) plus the attribute vocab EPUB 3 Structural Semantics Vocabulary to add some semantic structure to the text. This markup can be used for styling the document with CSS, but is also useful for machine processing and accessibility options. You can mark up sections of the ebook (frontmatter, body, backmatter), divisions within (abstract, chapters), types of content (footnote), or individual elements (title). I added div tags with attributes in the EPUB2 which I converted to section tags, for example, each <div class=”chapter”> became <section epub:type=”chapter”>. I used these epub:type values: cover, titlepage, chapter, epigraph, toc, and loi. Since I made each chapter a single XHMTL file, another option would be to add the epub:type attribute to the body element. However, those attributes would be lost if merging the HTML, so I prefer the section tags.
  • Delete the toc.ncx file.  This file was used by older reading devices to provide navigation functionality, but it is not part of the EPUB3 spec as it is replaced by nav.xhtml. However, many people seem to be leaving this file in the EPUB for legacy support. If you leave it, everything should work fine, but the file will NOT fully validate.
  • Re-Zip the new EPUB3. EPUBs need to be zipped in the correct order or they will not function. This means you must create the zip archive first (in Windows right click somewhere and choose New > Compressed (zip) Folder), then add the mimetype file (drag it into the new zip folder). Then all the rest of the content can be added. Finally, change the extension from .zip to .epub.
  • Validate with the IDPF EPUB Validator.

The sketchyTech blog talks about the differences created by this process in more detail if you want to hear from some one else…

But basically, that’s it!  Not too complicated, although it requires some thought about 1) the quality of images to include, 2) changes to styling with larger screens in mind, and 3) consideration of semantic inflection to provide better accessibility and machine readability. I will post the new Aladore EPUB3 soon!

Lets Read Together: Chapter One



SIR YWAIN sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remembrance of all the misery that had else been forgotten.

And in the midst of his judging there was brought into the hall a child that had been found in the road, a boy of seven years as it seemed: and he was dressed in fine hunting green, but not after the fashion of that day or country. Also when they spoke to him he answered becomingly, but in a speech that no one could understand.

So Sir Ywain had him set by the table at his own side, and now and again as he judged those wrong-doers, he cast a look upon the child. And always the child looked back at him with bright eyes, and even when there was no looking between them, he listened to what was being said, and smiled as though that which was weariness to others was to him something new and joyful. But as the hour passed, Sir Ywain felt his mind slacken more and more, and whenever he saw the boy smiling, his own heart became heavier and heavier between his shoulders, and his life and the life of his people seemed like a high-road, dusty and endless, that might never be left without trespassing. And though he would not break off from his judging, yet he groaned over the offenders instead of rebuking them; and when he should have punished, he dismissed them upon their promise, so that his steward was mortified, and the guilty could not believe their ears.

Then when all was said and done the hall was cleared, and Sir Ywain was left alone with the boy.

But the steward, looking slyly back through the hinges of the door, saw that his lord and the child were speaking together; and he perceived that they understood one another well enough, though how this should have come about he was not able to guess, having himself heard the boy answering to all questions in none but an outlandish tongue.

Then he saw Sir Ywain rise up, and suddenly he was aware that his lord was calling for him loudly and with a hearty voice, as he would call for him long since, when they were at the wars together. And when he went in, Sir Ywain bade him summon all the household.

Now when the household were come into the hall they stood at a little distance from the dais, in the order of their service, and Sir Ywain stood above them in front of the high table. And beside him was the boy, and before him was his own brother, who was now an esquire grown, with hawk on wrist.

Then Sir Ywain bade his brother kneel down, and there he made him knight, taking his sword from him and laying it on his shoulder, and afterwards belting it again round his body. And he took the keys from his own girdle and the gold spurs from his own feet, and said aloud: I call you all to witness that as I have done off my knighthood and the Honour of Sulney, and given them to this my brother Sir Turquin, so also by these tokens do I deliver unto him the quiet possession of my house and goods and the seisin of all my lands, to hold unto him and his heirs for ever, by the service due and accustomed for the same. And henceforth I go free.

How Sir Ywain was led away of a Child

Then his brother, who was both glad and sorry, and moreover was still in doubt how this might end, stood holding the keys and the spurs, and looking at him without a word. And he looked also at the child, and he saw that for all the difference in their years, the eyes of Sir Ywain had become like the boy’s eyes: and as he looked his heart became heavy, and for a moment he envied his brother and feared for himself. But in his fear he moved his hands, and the keys clanked and the spurs clinked together, and his heart leaped up again for joy of his possessions.

And all this Ywain saw as it were a great way off, and he smiled, and forgot it again instantly. And the boy took his hand, and they went down the hall together. And when they came to the door to pass out, the steward got before them and bowed as he was used to do, and he spoke very gravely to Sir Ywain, reminding him that this same afternoon had been appointed among the lords, his neighbours, for the witnessing of certain charters.

But Ywain and the boy looked at one another and laughed, and the steward saw that they laughed at the lords and at him and at the very greatness of the business: and he was enraged, and turned away and went to his new master.

Then Sir Turquin came hastily after them, and he laid his hand upon his brother’s arm and bent his head a little, and spoke to him so that none else should hear, and he said: What is this that you are doing; for no man leaves all that he has, and departs suddenly, taking nothing with him. But those two went from him without answering, and they passed, as it seemed, very swiftly along the road under the woodside, and were hidden from him. And again, as he stood still watching, he saw them going swiftly above the wood where there was no path, but only the bare wold before them.

Keep reading: get the Digital Aladore ebook at Internet Archive!