The edited text and HTML created by the Digital Aladore project has been incorporated into the “aladore-book” project to create a web-based edition with data features–visit the site and give it a read or check out the text data it creates!
aladore-book uses the static site generator Jekyll to transform the text into a complete website hosted on GitHub Pages. The text is stored as a series of plain text Markdown files, one for each chapter. This text is used to generate both the web pages of the book (for reading online) and the data derivatives (designed for importing into other tools for analysis).
The source code of the project is available in a GitHub repository. This allows the book and text to be easily shared, adapted, and modified by anyone–or used as a template for creating other books.
Recently, images of Jason Shulman’s “Photographs of Film” project came across my Twitter feed. I love long exposure photography, so I was really enjoying the idea, but it also brought up the controversy surrounding his exhibit about a year ago. The issue is explained by researcher Kevin L. Ferguson in “To Cite or to Steal? When a Scholarly Project Turns Up in a Gallery”.
What is most interesting to me is Ferguson’s insistence on openness about his methods and techniques, versus Shulman’s deliberate secrecy. Readers of Digital Aladore will not be surprised that despite many years working as a studio artist, I sympathize with the openness!
Ferguson’s article included a handy mini-tutorial about ImageJ–which set me off to find some old images that I never shared on this blog. Early in the project, I covered using ImageJ to preprocess Aladore’s page images (Preprocessing 2). However, you can do many more interesting and beautiful things with ImageJ…
Like visualize the entire text block at once:
I find this image oddly compiling. It had a practical purpose of helping find the text block to crop out of the page images, thus simplifying OCR. But is also a unique visualization of the physical book, allowing us to read in a new way. We can see the pages summarized visually, perhaps revealing new insights into this artifact, hinting at the inky physicality missing in the digital images.
Anyway, they are really interesting to look at. So let’s do some more!
Get a copy of ImageJ2 (this is a newer version than used in my old posts, I suggest using the Fiji version). Make sure you have an up-to-date 64-bit version of Java installed (if you use 32-bit Java your memory will be limited).
Get a big stack of images, for example a full copy of Aladore from Internet Archive. Visit the Aladore 1914 book page and download the “SINGLE PAGE PROCESSED JPEG ZIP”. Unzip the package.
Open ImageJ / Fiji, go to “Edit” > “Options” > “Memory & threads”. Give it as much memory as seems reasonable for your system.
Now open your images: “File” > “Import” > “Image sequence”, navigate into the image folder and click “Open”. On the “Sequence Options”, check “Use virtual stack” to save memory. This will create an image stack. Browse through the stack by using the slider bar at the bottom.
If you want to work with these full sized images, you will need lots of RAM, otherwise resize the stack with “Image” > “Scale”. For example, Aladore is just over 400 page images, at around 2800 x 1800 pixels each. Running Zprojection Sum Slices required around 15Gb of memory. Scaled at “0.5”, it was able to run on 4Gb.
If you would like to crop the stack to clean up the edges, drag a rectangle over the image and “Image” > “Crop”.
The image above was created using “Image” > “Stacks” > “Z Project”, then selecting “Projection type” > “Min Intensity”. The processing will take a bit depending on your machine.
Here’s some other interesting visualizations to try:
Image > Stacks > Make montage, reveals the entire book in one image. Are there patterns and rhythms in the layout?
Image > Stacks > Z Project > Projection type > Sum Slices, creates this ghostly image:
Image > Stacks > Z Project > Projection type > Standard Deviation, a more sinister feeling visualization of the text block:
Image > Stacks > Plot Z-axis profile, charts the mean intensity, so the downward spikes are the blank pages after each illustration:
Image > Stacks > Orthogonal views, creates an interesting way to navigate the stack. The images on the right and bottom reveal cross sections of the book (image stack), use the cross-hairs to cut through the stack:
ImageJ is powerful and capable of creating fascinating visualizations of batches of images. It is developed for scientists working with imagery data, often biomedical microscopy tracking the numbers of cells or counting biomarkers. Digital humanities researchers such as Ferguson have expanded it’s use to media studies and visual culture. However, don’t let that stop you from “misusing” it to create something beautiful.
On that note, check out Ferguson’s recent use of ImageJ to manipulate video, “Edge”.
My new favorite national holiday is #ColorOurCollections week, Feb 1-5, 2016.
The idea originated at the New York Academy of Medicine (NYAM) to merge the adult coloring book craze with digital collections. Amazing libraries around the world joined in, sharing some fun and beautiful coloring books. Open Culture listed some of the big names, such as Bodleian, Smithsonian, DPLA, and Europeana. However, it was great to see so many Less famous libraries creating awesome coloring books highlighting their fascinating collections– and having some fun!
None-the-less, #ColorOurCollections also highlighted some less-than-best practices in the creation and distribution of digital files. Many of these libraries are on the forefront of digital preservation and/or user experience, but put out PDFs that violated all the rules… Its a bit disappointing since its not that hard to get a few little details right, and libraries should be leading by example.
A lot of libraries did a great job, but here are some little things that bugged me about many coloring book offerings:
- Random file names. If you are creating a PDF for public distribution don’t name it “coloringbook1.pdf” or “color-our-collections.pdf” or “jim_file3.pdf”. The file name is one of the few bits of metadata that can be easily understood without even opening the file. Make your file name descriptive and meaningful, providing the basic metadata (creator, title, and date) in a place where Everyone can see it. Example: “ThisLibrary_CoolColoringBook_2016.pdf”
- Huge file sizes. A coloring book is a mostly black and white document designed to be printed at approximately letter size paper. A reasonable file size for something around 15 pages is under 2 MB, and could be smaller. I saw many PDFs in the range of 25MB, and some over 65MB! Larger file sizes will not give better quality–this is for public web distribution not a professional press run printing glossy coffee table books. Think about your users and your web servers. People have to DOWNLOAD the PDF! Please make it a reasonable size.
- No embedded metadata. When creating a PDF you should always check the embedded metadata. If you are exporting a PDF from LibreOffice Writer or MS Word, it will have embedded metadata automatically created based on your profile. Unfortunately, many people don’t realize that and have never checked their profile. Many of the coloring books thus have metadata like “Title: Microsoft Word – coloringbook_draft3_fromJim.docx” and an Author that is the profile name of who ever first created the file. This metadata will be displayed when users import the PDF into a ebook management tool, such as Calibre. Furthermore, this information is not helpful for future users trying to understand where the file came from and what it is–and could be a bit embarrassing depending on what the automatically generated information contains. I suggest you carefully edit the metadata before exporting the final version of your PDF. It should contain a meaningful title, a subject such as “#ColorOurCollections 2016”, an author/creator that relates to the institution, and a URL to find more information.
- Lack of image metadata. If you send out a document highlighting some fascinating treasures of your collections–there had better be a clear means for users to find out more information! Every image used in the coloring book needs metadata directly on the page where it appears. Each page does not need the full archival description, but please give enough information for the users to find the item in your online collections. A title, identifier, and URL is nice. I think these references need to be given on each coloring page, not in a separate reference and index page. Online resources and printed coloring book pages are quickly disassociated from their original context–don’t expect that the information given on an introduction or TOC page will be available to users.
- Lack of overall context. Many of the coloring books were just pages of images. That is great for many users, but I would like to see a short introduction page that explains the context. Where did these images come from? Why are they interesting in the scope of your collections? Where can I learn more? This is an easy chance to communicate with patrons and invite them into our collections–which is the point of #ColorOurCollections.
- Links to paid databases. A few coloring books had reference links to paid databases. I found this a bit insulting and against the spirit of #ColorOurCollections. One of the most amazing aspects of digital collections is the ability to democratically open up the public domain to the PUBLIC. We are able to take fragile materials traditionally hidden away in a locked basement, and give them out freely to the world! It is disappointing to see objects in the public domain digitized and then LOCKED back up in a proprietary, paid database. Its even more disappointing to see those over priced rip offs promoted in a library coloring book.
- Grey backgrounds. Sorry, but this is a coloring book! Who wants to color on grey paper? Who wants to waste printer ink printing a grey page background? Some images are just more appropriate for a coloring book than others. You can not just desaturate a digitized image and call it a coloring book. Digitized pages have a color, and that page background needs to be removed to make a quality coloring book page. Generally, most coloring book images should be fully binarized, i.e. only pure black and white. Using GIMP you could desaturate (Colors > Desaturate) or greyscale (Image > Mode > Greyscale) the image, then use a Threshold (Colors > Threshold) to eliminate the “color” of the page background. The coloring book image should be reduced to clean black lines and white background. ScanTailor is a great tool that can do this pre-processing for you for many coloring book appropriate images. Play around with the output options in Black & White mode, tweaking “Thickness” and “Despeckling” until you get a good result.
Why no Digital Aladore coloring book?
I was thinking about putting together an Aladore coloring book, but I found the images had too many shades of grey scale hatching to reduce nicely to clean lines. Processing ends up with too many black blobs, with too little detail. The images just don’t work as a coloring page! Here is a one page PDF attempt just to show you what I mean:
Anyway, #ColorOurCollections was good fun, and I am looking forward to it next year!
Some news on the EPUB creation front:
Google Docs just enabled a feature to export as EPUB! To use it, simply open the doc and look under File > Download as > EPUB Publication.
This is a handy and very easy method to create an ebook. However, the consistency and quality isn’t good. The markup it creates is down right bizarre with tons of unnecessary <span> tags and strange CSS. It also does not create a cover. In theory you could open this Google Doc EPUB with Sigil and do some polishing up, but given how unnecessarily complex the markup is, it would be more work than starting fresh.
Using Tesseract via Command Line has consistently been the most wildly popular post on Digital Aladore. However, due to some changes, I thought I should update the information.
and the Wiki page at: https://github.com/tesseract-ocr/tesseract/wiki
There is no longer an official Windows installer for the current release.
If you want to use an installer, the old version on Google Code from 2012 still works, featuring Tesseract version 3.02.02 (I am not sure if this link will go away when Google Code moves to archive status in January 2016). This package is handy for English users since it will install everything you need in one go. (Update 2016-01-30: the link did go away in January, so there is no longer access to the old Google Code pages. It is easiest to use one of the installers listed below. If you want to compile it yourself, check out the directions from bantilan.)
- A nice portable install of both engine and language data from a guy named Simon
- A bleeding edge installer from Mannheim University Library
It is important to note that the OCR engine and language data are now completely separate. Install the Tesseract engine first, then unzip the language data into the “tessdata” directory.
On Linux installation is easier. Use your distro’s software repository (the package is usually called ‘tesseract-ocr’), or download the latest release and use make.
Now that you have it installed, the commands on the old post should work fine!
Happy 2016 from Digital Aladore!
January 1st brings us to another joyous Public Domain Day, the holiday where lots of people celebrate the New Year AND a new crop of works entering the public domain.
Sadly in America we have NOTHING to celebrate. Because of bizarre copyright extensions, we will not have any works entering the public domain until 2019. It is stunning to think that while the rest of the world is celebrating, the USA has not had a happy Public Domain Day since 1978… When copyright was first introduced in America the term was 14 years; current works now enjoy life of the author + 70 years, or if the work of multiple authors (corporate authorship) 95 years from publication. The extensions in 1978 and 1998 applied to retrospectively to old works, creating a crazy tangle of rules (check out a summary from Peter B. Hirtle), which highlights the nonsense of the move: the rationale for extended terms was incentivizing creation, but it seems hard to fathom it motivating a bunch of dead people! Meanwhile the preservation of our cultural resources has become illegal, with fragile artifacts such as our film heritage literally disappearing.
Here is what I said last year, and things have only gotten worse:
Recent research and economic modeling suggest that current copyright terms are too long and do NOT provide incentive for creation. Instead our shared culture is being locked away by corporate profiteers. In fact, the majority of works still protected by copyright are orphans–out of print with no likely hood of ever being used again commercially. Projects like Digital Aladore, Free software, and honestly the majority of the internet point out that creators aren’t purely profit driven. Its time to reform copyright to benefit the creators rather than hoarders of capitol (who already have plenty of power and wealth!).
North of the border, in Canada things are more cheerful this year. The works of lot of great authors and thinkers will become freely available resources to drive current learning, thought, and creativity. Libraries and Archives will be able to legally preserve, digitize, and provide access to valuable cultural creations. Check out the Public Domain Review’s Class of 2016 for some highlights. However, there is a pall on the celebrations. The Trans-Pacific Partnership trade deal threatens to force countries to have a minimum of life+70 years copyright term.
A sad holiday indeed, learn more at the Center for the Study of the Public Domain:
“What do these laws mean to you? As you can read in our analysis here, they impose great (and in many cases unnecessary) costs on creativity, on libraries and archives, on education and on scholarship. More broadly, they impose costs on our collective culture. We have little reason to celebrate on Public Domain Day because our public domain has been shrinking, not growing.”
None-the-less, here at Digital Aladore we wish you all the Best for the New Year!
Is DigitalAladore 1.0 looking crummy on your ultra high def 10 inch tablet screen?
Well, give DigitalAladore 1.5 a try! Following the workflow outlined in previous posts, I generated an Aladore EPUB3 edition. The images are much bigger and the CSS is slightly tweaked with larger screens in mind. Personally, I still find reading ebooks on tablets a bit unsatisfying, slightly too big and bright. But I think this version will look pretty good! However, at over 9MB it might be slow to load on your e-ink reader.
So with out further ado, you can find the new EPUB3 at Internet Archive,
DigitalAladore 1.5: Aladore, by Henry Newbolt (1914, epub3), https://archive.org/details/AladoreHenryNewbolt3
Since Digital Aladore is more than a year old (see The Idea), I thought I should check in with a few of the key tools for any news. First up is Sigil Ebook editor, used for creating the various EPUB versions. As I have said many times, it is a great tool! There are a few features like the character report and auto merging html that I wish were in my everyday text editor.
After a scary period where it looked like development on Sigil might stall, I am happy to see it surge back to an active project full of interesting changes. This week version 0.9.1 was released stabilizing a host of new features moving the application towards full EPUB3 support. Also be sure to check through the Plugin Index to find many useful extensions for the editor.
Creating an editor that supports both EPUB2 and 3 is a bit complicated. As I mentioned in an earlier post, older versions of Sigil automatically correct markup and packaging to match the EPUB2 standard. To fix this issue, version 0.9.1 replaces Xerces (xml parser) and Tidy (html parser) with Python lxml and Google Gumbo, and makes the FlightCrew EPUB2 validator a plugin rather than built in tool.
Despite the major overhaul under the hood, using Sigil remains almost unchanged, which is great. So thank you to current maintainers Kevin Hendricks and Doug Massay and everyone else who makes this Free and open tool available!
Check out the code or get the latest version at Github.
When I first started looking into the EPUB3 specs, I was excited by the possibilities of a more powerful ebook format. Just think of all the neat things you can create with simple CSS and JS! I imagined creating little “epub apps” like a calculator or timer. It would be a neat way to add functionality to very simple devices such as the Sony Reader. I created a few test versions, however these demos often worked in Calibre’s built in reader, but were not functional with any actual ereaders.
Of course, the point would be to go beyond silly little apps and add some interesting and valuable extensions to the ebook, such as text collation or visualizations. Simple adjustable collation tools could be embedded so that the reader could query the text while reading. Some of this functionality has been built into the reading apps on some devices, such as Kindle X-Ray. Simple interactive elements would be useful for textbooks and manuals to make information delivery more interesting. Imagine something like Jupyter Notebook, which can run embedded Python code.
Unfortunately, there just isn’t good support for the advanced features of EPUB3 in an open and flexible way. As I mentioned in a previous post, device makers only seem interested in the possibilities of further limiting users with tougher DRM, rather than enabling new possibilities. In the ideal world we could combine the open format with open hardware and software!