Visualizing Aladore the Book

Recently, images of Jason Shulman’s “Photographs of Film” project came across my Twitter feed. I love long exposure photography, so I was really enjoying the idea, but it also brought up the controversy surrounding his exhibit about a year ago. The issue is explained by researcher Kevin L. Ferguson in “To Cite or to Steal? When a Scholarly Project Turns Up in a Gallery”.

What is most interesting to me is Ferguson’s insistence on openness about his methods and techniques, versus Shulman’s deliberate secrecy. Readers of Digital Aladore will not be surprised that despite many years working as a studio artist, I sympathize with the openness!

Ferguson’s article included a handy mini-tutorial about ImageJ–which set me off to find some old images that I never shared on this blog. Early in the project, I covered using ImageJ to preprocess Aladore’s page images (Preprocessing 2). However, you can do many more interesting and beautiful things with ImageJ…

Like visualize the entire text block at once:

I find this image oddly compiling. It had a practical purpose of helping find the text block to crop out of the page images, thus simplifying OCR. But is also a unique visualization of the physical book, allowing us to read in a new way. We can see the pages summarized visually, perhaps revealing new insights into this artifact, hinting at the inky physicality missing in the digital images.

Anyway, they are really interesting to look at. So let’s do some more!

Get a copy of ImageJ2 (this is a newer version than used in my old posts, I suggest using the Fiji version). Make sure you have an up-to-date 64-bit version of Java installed (if you use 32-bit Java your memory will be limited).

Get a big stack of images, for example a full copy of Aladore from Internet Archive. Visit the Aladore 1914 book page and download the “SINGLE PAGE PROCESSED JPEG ZIP”. Unzip the package.

Open ImageJ / Fiji, go to “Edit” > “Options” > “Memory & threads”. Give it as much memory as seems reasonable for your system.

Now open your images: “File” > “Import” > “Image sequence”, navigate into the image folder and click “Open”. On the “Sequence Options”, check “Use virtual stack” to save memory. This will create an image stack. Browse through the stack by using the slider bar at the bottom.

If you want to work with these full sized images, you will need lots of RAM, otherwise resize the stack with “Image” > “Scale”. For example, Aladore is just over 400 page images, at around 2800 x 1800 pixels each. Running Zprojection Sum Slices required around 15Gb of memory. Scaled at “0.5”, it was able to run on 4Gb.

If you would like to crop the stack to clean up the edges, drag a rectangle over the image and “Image” > “Crop”.

The image above was created using “Image” > “Stacks” > “Z Project”, then selecting “Projection type” > “Min Intensity”. The processing will take a bit depending on your machine.

Here’s some other interesting visualizations to try:

Image > Stacks > Make montage, reveals the entire book in one image. Are there patterns and rhythms in the layout?

Image > Stacks > Z Project > Projection type > Sum Slices, creates this ghostly image:

Image > Stacks > Z Project > Projection type > Standard Deviation, a more sinister feeling visualization of the text block:

Image > Stacks > Plot Z-axis profile, charts the mean intensity, so the downward spikes are the blank pages after each illustration:

Image > Stacks > Orthogonal views, creates an interesting way to navigate the stack. The images on the right and bottom reveal cross sections of the book (image stack), use the cross-hairs to cut through the stack:

ImageJ is powerful and capable of creating fascinating visualizations of batches of images. It is developed for scientists working with imagery data, often biomedical microscopy tracking the numbers of cells or counting biomarkers. Digital humanities researchers such as Ferguson have expanded it’s use to media studies and visual culture. However, don’t let that stop you from “misusing” it to create something beautiful.

On that note, check out Ferguson’s recent use of ImageJ to manipulate video, “Edge”.

Reflecting on ColorOurCollections

My new favorite national holiday is #ColorOurCollections week, Feb 1-5, 2016.

The idea originated at the New York Academy of Medicine (NYAM) to merge the adult coloring book craze with digital collections. Amazing libraries around the world joined in, sharing some fun and beautiful coloring books. Open Culture listed some of the big names, such as Bodleian, Smithsonian, DPLA, and Europeana. However, it was great to see so many Less famous libraries creating awesome coloring books highlighting their fascinating collections– and having some fun!

None-the-less, #ColorOurCollections also highlighted some less-than-best practices in the creation and distribution of digital files. Many of these libraries are on the forefront of digital preservation and/or user experience, but put out PDFs that violated all the rules… Its a bit disappointing since its not that hard to get a few little details right, and libraries should be leading by example.

A lot of libraries did a great job, but here are some little things that bugged me about many coloring book offerings:

  • Random file names. If you are creating a PDF for public distribution don’t name it “coloringbook1.pdf” or  “color-our-collections.pdf” or “jim_file3.pdf”. The file name is one of the few bits of metadata that can be easily understood without even opening the file. Make your file name descriptive and meaningful, providing the basic metadata (creator, title, and date) in a place where Everyone can see it. Example: “ThisLibrary_CoolColoringBook_2016.pdf”
  • Huge file sizes. A coloring book is a mostly black and white document designed to be printed at approximately letter size paper. A reasonable file size for something around 15 pages is under 2 MB, and could be smaller. I saw many PDFs in the range of 25MB, and some over 65MB! Larger file sizes will not give better quality–this is for public web distribution not a professional press run printing glossy coffee table books. Think about your users and your web servers. People have to DOWNLOAD the PDF! Please make it a reasonable size.
  • No embedded metadata. When creating a PDF you should always check the embedded metadata. If you are exporting a PDF from LibreOffice Writer or MS Word, it will have embedded metadata automatically created based on your profile. Unfortunately, many people don’t realize that and have never checked their profile. Many of the coloring books thus have metadata like “Title: Microsoft Word – coloringbook_draft3_fromJim.docx” and an Author that is the profile name of who ever first created the file. This metadata will be displayed when users import the PDF into a ebook management tool, such as Calibre. Furthermore, this information is not helpful for future users trying to understand where the file came from and what it is–and could be a bit embarrassing depending on what the automatically generated information contains. I suggest you carefully edit the metadata before exporting the final version of your PDF. It should contain a meaningful title, a subject such as “#ColorOurCollections 2016”, an author/creator that relates to the institution, and a URL to find more information.
  • Lack of image metadata. If you send out a document highlighting some fascinating treasures of your collections–there had better be a clear means for users to find out more information! Every image used in the coloring book needs metadata directly on the page where it appears. Each page does not need the full archival description, but please give enough information for the users to find the item in your online collections. A title, identifier, and URL is nice. I think these references need to be given on each coloring page, not in a separate reference and index page. Online resources and printed coloring book pages are quickly disassociated from their original context–don’t expect that the information given on an introduction or TOC page will be available to users.
  • Lack of overall context. Many of the coloring books were just pages of images. That is great for many users, but I would like to see a short introduction page that explains the context. Where did these images come from? Why are they interesting in the scope of your collections? Where can I learn more? This is an easy chance to communicate with patrons and invite them into our collections–which is the point of #ColorOurCollections.
  • Links to paid databases. A few coloring books had reference links to paid databases. I found this a bit insulting and against the spirit of #ColorOurCollections. One of the most amazing aspects of digital collections is the ability to democratically open up the public domain to the PUBLIC. We are able to take fragile materials traditionally hidden away in a locked basement, and give them out freely to the world! It is disappointing to see objects in the public domain digitized and then LOCKED back up in a proprietary, paid database. Its even more disappointing to see those over priced rip offs promoted in a library coloring book.
  • Grey backgrounds. Sorry, but this is a coloring book! Who wants to color on grey paper? Who wants to waste printer ink printing a grey page background? Some images are just more appropriate for a coloring book than others. You can not just desaturate a digitized image and call it a coloring book. Digitized pages have a color, and that page background needs to be removed to make a quality coloring book page. Generally, most coloring book images should be fully binarized, i.e. only pure black and white. Using GIMP you could desaturate (Colors > Desaturate) or greyscale (Image > Mode > Greyscale) the image, then use a Threshold (Colors > Threshold) to eliminate the “color” of the page background. The coloring book image should be reduced to clean black lines and white background. ScanTailor is a great tool that can do this pre-processing for you for many coloring book appropriate images. Play around with the output options in Black & White mode, tweaking “Thickness” and “Despeckling” until you get a good result.

Why no Digital Aladore coloring book?

ScanTailor b&w processing

ScanTailor b&w processing

I was thinking about putting together an Aladore coloring book, but I found the images had too many shades of grey scale hatching to reduce nicely to clean lines. Processing ends up with too many black blobs, with too little detail. The images just don’t work as a coloring page! Here is a one page PDF attempt just to show you what I mean:


Anyway, #ColorOurCollections was good fun, and I am looking forward to it next year!


Google Docs EPUB

Some news on the EPUB creation front:

Google Docs just enabled a feature to export as EPUB! To use it, simply open the doc and look under File > Download as > EPUB Publication.

This is a handy and very easy method to create an ebook. However, the consistency and quality isn’t good. The markup it creates is down right bizarre with tons of unnecessary <span> tags and strange CSS. It also does not create a cover. In theory you could open this Google Doc EPUB with Sigil and do some polishing up, but given how unnecessarily complex the markup is, it would be more work than starting fresh.

So if you need a super quick EPUB for some reason, just click the “Download as” option. Otherwise, stick with the tools that provide better markup results, such as Writer2ePub and Sigil.

Update: Tesseract OCR in 2016

Using Tesseract via Command Line has consistently been the most wildly popular post on Digital Aladore. However, due to some changes, I thought I should update the information.

Tesseract used to be hosted at Google Code, which closed up shop in August 2015. The project has transitioned to Github, with the main page at

and the Wiki page at:

There is no longer an official Windows installer for the current release. If you want to use an installer, the old version on Google Code from 2012 still works, featuring Tesseract version 3.02.02 (I am not sure if this link will go away when Google Code moves to archive status in January 2016). This package is handy for English users since it will install everything you need in one go. (Update 2016-01-30: the link did go away in January, so there is no longer access to the old Google Code pages. It is easiest to use one of the installers listed below. If you want to compile it yourself, check out the directions from bantilan.)

If you want a more up-to-date version on Windows there are some third party installers put together for you, otherwise you have to compile it yourself (unless you use Cygwin or MSYS2):

It is important to note that the OCR engine and language data are now completely separate. Install the Tesseract engine first, then unzip the language data into the “tessdata” directory.

On Linux installation is easier. Use your distro’s software repository (the package is usually called ‘tesseract-ocr’), or download the latest release and use make.

Now that you have it installed, the commands on the old post should work fine!

Public Domain Day 2016!

Happy 2016 from Digital Aladore!

January 1st brings us to another joyous Public Domain Day, the holiday where lots of people celebrate the New Year AND a new crop of works entering the public domain.

Sadly in America we have NOTHING to celebrate. Because of bizarre copyright extensions, we will not have any works entering the public domain until 2019. It is stunning to think that while the rest of the world is celebrating, the USA has not had a happy Public Domain Day since 1978… When copyright was first introduced in America the term was 14 years; current works now enjoy life of the author + 70 years, or if the work of multiple authors (corporate authorship) 95 years from publication. The extensions in 1978 and 1998 applied to retrospectively to old works, creating a crazy tangle of rules (check out a summary from Peter B. Hirtle), which highlights the nonsense of the move: the rationale for extended terms was incentivizing creation, but it seems hard to fathom it motivating a bunch of dead people! Meanwhile the preservation of our cultural resources has become illegal, with fragile artifacts such as our film heritage literally disappearing.

Here is what I said last year, and things have only gotten worse:

Recent research and economic modeling suggest that current copyright terms are too long and do NOT provide incentive for creation.  Instead our shared culture is being locked away by corporate profiteers.  In fact, the majority of works still protected by copyright are orphans–out of print with no likely hood of ever being used again commercially.  Projects like Digital Aladore, Free software, and honestly the majority of the internet point out that creators aren’t purely profit driven.  Its time to reform copyright to benefit the creators rather than hoarders of capitol (who already have plenty of power and wealth!).

North of the border, in Canada things are more cheerful this year. The works of lot of great authors and thinkers will become freely available resources to drive current learning, thought, and creativity. Libraries and Archives will be able to legally preserve, digitize, and provide access to valuable cultural creations. Check out the Public Domain Review’s Class of 2016 for some highlights. However, there is a pall on the celebrations. The Trans-Pacific Partnership trade deal threatens to force countries to have a minimum of life+70 years copyright term.

A sad holiday indeed, learn more at the Center for the Study of the Public Domain:

“What do these laws mean to you? As you can read in our analysis here, they impose great (and in many cases unnecessary) costs on creativity, on libraries and archives, on education and on scholarship. More broadly, they impose costs on our collective culture. We have little reason to celebrate on Public Domain Day because our public domain has been shrinking, not growing.”

None-the-less, here at Digital Aladore we wish you all the Best for the New Year! 



DigitalAladore 1.5, EPUB3 Edition!

Is DigitalAladore 1.0 looking crummy on your ultra high def 10 inch tablet screen?

Well, give DigitalAladore 1.5 a try! Following the workflow outlined in previous posts, I generated an Aladore EPUB3 edition. The images are much bigger and the CSS is slightly tweaked with larger screens in mind. Personally, I still find reading ebooks on tablets a bit unsatisfying, slightly too big and bright. But I think this version will look pretty good! However, at over 9MB it might be slow to load on your e-ink reader.

So with out further ado, you can find the new EPUB3 at Internet Archive,

DigitalAladore 1.5: Aladore, by Henry Newbolt (1914, epub3),

News at Sigil Ebook

Since Digital Aladore is more than a year old (see The Idea), I thought I should check in with a few of the key tools for any news. First up is Sigil Ebook editor, used for creating the various EPUB versions. As I have said many times, it is a great tool! There are a few features like the character report and auto merging html that I wish were in my everyday text editor.

After a scary period where it looked like development on Sigil might stall, I am happy to see it surge back to an active project full of interesting changes. This week version 0.9.1 was released stabilizing a host of new features moving the application towards full EPUB3 support. Also be sure to check through the Plugin Index to find many useful extensions for the editor.

Creating an editor that supports both EPUB2 and 3 is a bit complicated. As I mentioned in an earlier post, older versions of Sigil automatically correct markup and packaging to match the EPUB2 standard. To fix this issue, version 0.9.1 replaces Xerces (xml parser) and Tidy (html parser) with Python lxml and Google Gumbo, and makes the FlightCrew EPUB2 validator a plugin rather than built in tool.

Despite the major overhaul under the hood, using Sigil remains almost unchanged, which is great. So thank you to current maintainers Kevin Hendricks and Doug Massay and everyone else who makes this Free and open tool available!

Check out the code or get the latest version at Github.


Thoughts About EPUB3

When I first started looking into the EPUB3 specs, I was excited by the possibilities of a more powerful ebook format. Just think of all the neat things you can create with simple CSS and JS! I imagined creating little “epub apps” like a calculator or timer. It would be a neat way to add functionality to very simple devices such as the Sony Reader. I created a few test versions, however these demos often worked in Calibre’s built in reader, but were not functional with any actual ereaders.

Of course, the point would be to go beyond silly little apps and add some interesting and valuable extensions to the ebook, such as text collation or visualizations. Simple adjustable collation tools could be embedded so that the reader could query the text while reading. Some of this functionality has been built into the reading apps on some devices, such as Kindle X-Ray. Simple interactive elements would be useful for textbooks and manuals to make information delivery more interesting. Imagine something like Jupyter Notebook, which can run embedded Python code.

Unfortunately, there just isn’t good support for the advanced features of EPUB3 in an open and flexible way.  As I mentioned in a previous post, device makers only seem interested in the possibilities of further limiting users with tougher DRM, rather than enabling new possibilities. In the ideal world we could combine the open format with open hardware and software!

Remove the NCX

Interested in minuet EPUB intricacies? Good, you are in for a treat!

One of the steps I mentioned for going from EPUB2 to EPUB3 is removing the toc.ncx file. This is actually a some what involved step (that is probably unnecessary) so I thought I would expand on it a bit.  It also gives you a chance to poke around EPUB innards…

The NCX, i.e. “Navigation Center eXtended” was a feature to enhance navigation and accessibility based on the DAISY/NISO Standard. It was a required Spine element in EPUB2.  However, the EPUB 3.0.1 spec that tells you the NCX is Superseded. Instead you are required to include an EPUB Navigation Document (i.e. nav.xhtml) that makes use of the HTML5 nav element. Basically you need to set up a <nav>, with <ol> inside, with <li> that have <a> relative links to parts of the ebook.

Since this file is a valid HTML document, it can be easily rendered by the reading device. Thus the new navigation file can serve both as a human and machine readable TOC.  You can write the table of contents once in <nav>, use it at the beginning of the book (for people) and for the device to understand the reading order of the digital files to provide extended navigation.

So in EPUB3 you need a Nav doc, but do you need to get rid of NCX? No, not really… NCX Superseded says that we “MAY” include the NCX since it will not interfere with anything, “but EPUB 3 Reading Systems must ignore the NCX.” I.e. older devices will keep looking for NCX, but newer ones will definitely not.

So we have an EPUB2, we create a new Nav based TOC, the question is NCX to keep or not to keep… I really don’t have a good answer.  It seems there is no reason to not keep it?

But if you want to get rid of it, building a Pure EPUB3, its more complicated than just deleting one file. Here’s what you need to do:

  • Unzip your EPUB (it is probably already unzipped if you are monkeying around with the EPUB2 to 3 transition), navigate to the OEBPS directory.
  • Delete the toc.ncx.
  • Open the content.opf file in a text editor. This is an XML that defines the ebook.
  • Look for the <manifest> element and find an <item> listing for the NCX (easiest just to search for “.ncx”).  It should look something like this: <item href="toc.ncx" id="ncx" media-type="application/x-dtbncx+xml"/>. Delete it!
  • Find the <spine> element. It should have a TOC attribute that looks like this: <spine toc="ncx">. Delete the whole attribute as it is optional in EPUB3, leaving <spine>.
  • Save your cleansed content.opf!

ncxLet me know what you decide!

Onward to EPUB3?

DigitalAladore 1.0 is a valid EPUB2. To recap: EPUB was chosen for the ebook because it is a free and open format built on open web standards (in contrast to proprietary formats such as Kindle AZW). And we love Free because of the many practical benefits of open source development plus the moral ideals of respecting the user’s freedom.

The EPUB2 standard was first released in 2007, but has since been superseded by EPUB3 released in October 2011. EPUB3 was designed to take advantage of new elements introduced in HTML5 and allow more interactive functionality (script). However, support of the full specification continues to be very poor. The only readers with full support seem to be commercial apps that deliver interactive books in a closed ecosystem. For example, AZARDI offers a cost-free reading app that has good support of advanced features of EPUB3, but it is focused on secure “content fulfillment” of interactive textbook subscriptions. To publish to the platform, authors must use their proprietary ebook creation application. Kobo and Apple have developed tweaked versions of EPUB3 that do not fully comply with the standard and focus on the possibilities for improved DRM, rather than functionality not found in EPUB2.

However, for simple functionality (i.e. a linear novel) EPUB3 is supported by most reading devices. I decided to update the Aladore EPUB2 to an EPUB3 version for future compatibility, higher specs, and improved semantic inflection. Guidelines now suggest adding larger images and cover images than I used in the EPUB2 to ensure they don’t look terrible on HD tablets. So while DigitalAladore 1.0 was optimized for older e-ink ereaders, the EPUB3 version will be optimized for larger, more powerful devices.

However, Sigil does not currently support the creation of ebooks following the EPUB3 spec. If you make changes to the markup following EPUB3, Sigil will actually correct them back to EPUB2 when saving the file.  So, to create the Aladore EPUB3 we have to do a few extra steps:

  • Replace all the image files with larger versions using Sigil.
  • Use the Sigil plugin ePub3-itizer to export a pseudo EPUB3. Sigil developers intend to implement full EPUB3 creation and editing support soon, so this plugin is considered a “stop-gap measure.” It changes the HTML headers, restructures a few files, and adds the nav.xhtml.
  • Unzip the ePub3-itizer output to edit the contents. Because Sigil limits the markup to XHTML valid to the EPUB2 spec, it is not possible to add HTML5 tags such as section or EPUB3 attributes such as epub:type (thus, it is what I call a pseudo EPUB3). I used the IDPF Accessibility Guidelines (The epub:type attribute) plus the attribute vocab EPUB 3 Structural Semantics Vocabulary to add some semantic structure to the text. This markup can be used for styling the document with CSS, but is also useful for machine processing and accessibility options. You can mark up sections of the ebook (frontmatter, body, backmatter), divisions within (abstract, chapters), types of content (footnote), or individual elements (title). I added div tags with attributes in the EPUB2 which I converted to section tags, for example, each <div class=”chapter”> became <section epub:type=”chapter”>. I used these epub:type values: cover, titlepage, chapter, epigraph, toc, and loi. Since I made each chapter a single XHMTL file, another option would be to add the epub:type attribute to the body element. However, those attributes would be lost if merging the HTML, so I prefer the section tags.
  • Delete the toc.ncx file.  This file was used by older reading devices to provide navigation functionality, but it is not part of the EPUB3 spec as it is replaced by nav.xhtml. However, many people seem to be leaving this file in the EPUB for legacy support. If you leave it, everything should work fine, but the file will NOT fully validate.
  • Re-Zip the new EPUB3. EPUBs need to be zipped in the correct order or they will not function. This means you must create the zip archive first (in Windows right click somewhere and choose New > Compressed (zip) Folder), then add the mimetype file (drag it into the new zip folder). Then all the rest of the content can be added. Finally, change the extension from .zip to .epub.
  • Validate with the IDPF EPUB Validator.

The sketchyTech blog talks about the differences created by this process in more detail if you want to hear from some one else…

But basically, that’s it!  Not too complicated, although it requires some thought about 1) the quality of images to include, 2) changes to styling with larger screens in mind, and 3) consideration of semantic inflection to provide better accessibility and machine readability. I will post the new Aladore EPUB3 soon!