My new favorite national holiday is #ColorOurCollections week, Feb 1-5, 2016.
The idea originated at the New York Academy of Medicine (NYAM) to merge the adult coloring book craze with digital collections. Amazing libraries around the world joined in, sharing some fun and beautiful coloring books. Open Culture listed some of the big names, such as Bodleian, Smithsonian, DPLA, and Europeana. However, it was great to see so many Less famous libraries creating awesome coloring books highlighting their fascinating collections– and having some fun!
None-the-less, #ColorOurCollections also highlighted some less-than-best practices in the creation and distribution of digital files. Many of these libraries are on the forefront of digital preservation and/or user experience, but put out PDFs that violated all the rules… Its a bit disappointing since its not that hard to get a few little details right, and libraries should be leading by example.
A lot of libraries did a great job, but here are some little things that bugged me about many coloring book offerings:
- Random file names. If you are creating a PDF for public distribution don’t name it “coloringbook1.pdf” or “color-our-collections.pdf” or “jim_file3.pdf”. The file name is one of the few bits of metadata that can be easily understood without even opening the file. Make your file name descriptive and meaningful, providing the basic metadata (creator, title, and date) in a place where Everyone can see it. Example: “ThisLibrary_CoolColoringBook_2016.pdf”
- Huge file sizes. A coloring book is a mostly black and white document designed to be printed at approximately letter size paper. A reasonable file size for something around 15 pages is under 2 MB, and could be smaller. I saw many PDFs in the range of 25MB, and some over 65MB! Larger file sizes will not give better quality–this is for public web distribution not a professional press run printing glossy coffee table books. Think about your users and your web servers. People have to DOWNLOAD the PDF! Please make it a reasonable size.
- No embedded metadata. When creating a PDF you should always check the embedded metadata. If you are exporting a PDF from LibreOffice Writer or MS Word, it will have embedded metadata automatically created based on your profile. Unfortunately, many people don’t realize that and have never checked their profile. Many of the coloring books thus have metadata like “Title: Microsoft Word – coloringbook_draft3_fromJim.docx” and an Author that is the profile name of who ever first created the file. This metadata will be displayed when users import the PDF into a ebook management tool, such as Calibre. Furthermore, this information is not helpful for future users trying to understand where the file came from and what it is–and could be a bit embarrassing depending on what the automatically generated information contains. I suggest you carefully edit the metadata before exporting the final version of your PDF. It should contain a meaningful title, a subject such as “#ColorOurCollections 2016”, an author/creator that relates to the institution, and a URL to find more information.
- Lack of image metadata. If you send out a document highlighting some fascinating treasures of your collections–there had better be a clear means for users to find out more information! Every image used in the coloring book needs metadata directly on the page where it appears. Each page does not need the full archival description, but please give enough information for the users to find the item in your online collections. A title, identifier, and URL is nice. I think these references need to be given on each coloring page, not in a separate reference and index page. Online resources and printed coloring book pages are quickly disassociated from their original context–don’t expect that the information given on an introduction or TOC page will be available to users.
- Lack of overall context. Many of the coloring books were just pages of images. That is great for many users, but I would like to see a short introduction page that explains the context. Where did these images come from? Why are they interesting in the scope of your collections? Where can I learn more? This is an easy chance to communicate with patrons and invite them into our collections–which is the point of #ColorOurCollections.
- Links to paid databases. A few coloring books had reference links to paid databases. I found this a bit insulting and against the spirit of #ColorOurCollections. One of the most amazing aspects of digital collections is the ability to democratically open up the public domain to the PUBLIC. We are able to take fragile materials traditionally hidden away in a locked basement, and give them out freely to the world! It is disappointing to see objects in the public domain digitized and then LOCKED back up in a proprietary, paid database. Its even more disappointing to see those over priced rip offs promoted in a library coloring book.
- Grey backgrounds. Sorry, but this is a coloring book! Who wants to color on grey paper? Who wants to waste printer ink printing a grey page background? Some images are just more appropriate for a coloring book than others. You can not just desaturate a digitized image and call it a coloring book. Digitized pages have a color, and that page background needs to be removed to make a quality coloring book page. Generally, most coloring book images should be fully binarized, i.e. only pure black and white. Using GIMP you could desaturate (Colors > Desaturate) or greyscale (Image > Mode > Greyscale) the image, then use a Threshold (Colors > Threshold) to eliminate the “color” of the page background. The coloring book image should be reduced to clean black lines and white background. ScanTailor is a great tool that can do this pre-processing for you for many coloring book appropriate images. Play around with the output options in Black & White mode, tweaking “Thickness” and “Despeckling” until you get a good result.
Why no Digital Aladore coloring book?
I was thinking about putting together an Aladore coloring book, but I found the images had too many shades of grey scale hatching to reduce nicely to clean lines. Processing ends up with too many black blobs, with too little detail. The images just don’t work as a coloring page! Here is a one page PDF attempt just to show you what I mean:
Anyway, #ColorOurCollections was good fun, and I am looking forward to it next year!
But Wait! you say… what about that whole thing with Henry’s passionate affair with aristocrat/illustrator Alice?? Ah yes, we need to deal with the illustrations!
Don’t worry, I didn’t forget.
Before the OCR processing routine, I moved all the illustration pages to their own directory. The illustrations are identical in the 1914 and 1915 editions. However, the 1914 pages are heavily yellowed, making for a dimmer, less crisp scan image. I decided to use the 1915 images from the Internet Archive which are brighter with a whiter background. Generally when editing raw scans I adjust the levels and run unsharp mask to improve the image quality. However, these images have been heavily processed already and are in the lossy JPG format, so further tinkering only seemed to make things worse.
Since these are page images, it is simplest to start editing with our old friend ScanTailor. This time however, I used more of the advanced features to fix up the illustrations. First, I used Deskew to generally straighten the images. I ran the automatic Select Content which found all the illustrations perfectly. On Margins, I deselected “Match size with other pages” and set a 1 mm margin on each side, since I want only the illustration with no extra page area. On Output, I lowered the Resolution to 300 DPI, since any higher is over kill. I changed the Mode to Color/Grayscale, since ScanTailor’s black and white processing is too extreme for images.
Finally, I used ScanTailor’s Dewarping feature. As I have mentioned a few times, none of the illustrations appear square. This is partially because physical books are not flat and photographic lens introduce some distortion. Straight lines on the resulting page image will not appear parallel/square. This can effect the cosmetic look of your page images, but also cause OCR errors. Thus, ScanTailor offers the Dewarping operation to quickly minimize the distortion.
On the central pane, click on the Dewarping tab. A grid appears over the page image. Drag the edges of the grid to match the lines that should be square in your image. ScanTailor takes your distorted grid and reprojects the image as if it was square, fixing the apparent perspective issues.
Run Output and ScanTailor gives us a set of 15 nicely processed TIF images ranging in size from 4 to 7 MB.
We are making important editorial decisions through out this process. For example, by throwing out blank pages and close cropping the illustrations, I have discarded some of the information and format of the original book. By deciding to dewarp, I have introduced my subjective interpretation of “straightness”. These decisions will shape how the illustrations look and relate to the text in the new edition.
Finally, since Digital Aladore is focused on a reading text (not maximum image fidelity) we need to think about the EPUB. I do not want to unnecessarily bloat the ebook size. The ScanTailor output comes in at 80 MB, way too big for an ebook. Instead, the quality and resolution should be appropriate to a small black and white screen. EPUB2 and most ereaders do not support TIF images, so we will need to convert the images to a more acceptable format and size.
Okay, here we go! Are you excited?
But wait, “post-processing”? Yeah, I guess its all in your point of view: I am calling it pre- because its processing before I feed it to Tesseract-OCR; they call it post- because you do it after scanning to make the images more usable. In any case, ScanTailor is basically a batch image editing platform optimized to enable efficient processing of unedited page scans into lovely readable pages for PDFs/DJVU, etc. You input any type of image file, and you output black and white TIFF page images. Many of the GUI OCR applications have similar processing built in–even Tesseract via command line does some image pre-processing. ABBYY goes a step further and does the final PDF creation as well.
It is limited (niche?), but what ScanTailor does, it does well and efficiently. And its Free. It runs on Windows and Linux, I use it with Ubuntu 14.
A full user guide can be found here: https://github.com/scantailor/scantailor/wiki/User-Guide
And here’s a nice video tutorial to get you started: http://vimeo.com/12524529
So what am I trying to do with ScanTailor?
As I described in an earlier post, every Aladore page image has a header and footer that will be annoying, ugly, and generally undesirable in our final EPUB edition. Furthermore, the header and footer tend to cause OCR errors, so they are hard to eliminate from the text output. None of the GUI OCR options managed to consistently select only the text block (ABBYY can identify headers, but it wasn’t 100% and didn’t separate the page numbers). The exact location of the text block wanders around from page to page, so we can not pre-set a selection area for groups of images. We could manually select the text block of each page via the interface of YAGF or OCRFeeder, but I found both too cumbersome to efficiently work through hundreds of pages. So as an alternative, I decided to crop each page down to the main text block before carrying out OCR.
ScanTailor can do this cropping efficiently and has the added bonus of excellent image processing to prepare the pages for OCR.
Here is my workflow:
1) Set up batches of images. I decided to work on about 60 pages at a time to make things more manageable: a large enough batch to be efficient, but not too large that it took a huge amount of time to work on. I divided Aladore into six sections (each in a directory), being careful to not break up any chapters.
2) Start a new ScanTailor project. i.e. Start the application and load one of the Aladore directories.
3) Fix DPI. Basically, you could estimate the DPI based on the pixel dimensions of the digital image divided by the physical dimensions of the area it was imaging (i.e. ~original printed page size). In his video tutorial, Joseph Artsimovich also gives a method to estimate based on measuring the pixel height of six lines of text in the image. I estimated my Aladore images to be between 300-400 DPI, but its not essential to get it exactly right. I decided to just use 300×300.
4) Select Content. Since the Aladore images are already processed, we can skip the Fix Orientation, Split Pages, and Deskew steps. For good results, the selection can not be fully automated. First, I do two manual selections to get the batch started: one for all the left and one for all right hand pages of the book, since the location of the text block tends to be closer. I go to an average page and resize the selection box to be below the header, above the page number, and almost the full page width horizontally. Then I click “Apply to” and choose “every other page.” Then, repeat for the next page to set the selection for the other half of the book.
This will give us a pretty good selection–but now I go through the tedious step of checking every single page. Since blotches and marks on the paper often cause random OCR errors, it is best to get the selection box close to the text. ScanTailor will white out everything outside the box, thus eliminating the many flaws. Basically, I tweak the first and last page of each chapter, since the text block is smaller. About 1 in 6 pages in Aladore needed just a little tweak up or down as I check the pages. This doesn’t take as long as it sounds, less time than it took to write this post…
5) Margins. This step is mainly cosmetic to ensure consistent output. The area added by the Margins will be blank white. First, we can add “hard margins” around each selection, like you would with a Word document, to ensure that the text is not all the way at the edge of the page. Second, using “soft margins” we can decide if we want all the pages to be the same size and where the text selection is positioned in the resulting page. By default ScanTailor will add white space to make all pages the size of the largest in the batch. Having a consistent page size is actually helpful for our purposes, because the OCR will detect the blank space at the beginning and end of chapters. The defaults are fine for my purposes!
6) Output!! This stage allows you to set the parameters for image processing. You can set the DPI, Mode (B&W, color), adjust the filtering, add Dewarping, and adjust the noise reduction (Despeckling). The defaults worked well for Aladore, at 600 DPI (2x the input DPI).
Click the play button and ScanTailor starts the actual image processing, saving the output to a new directory. The computer works for a minute, and then we have a batch of black and white TIFFs ready for OCR!