Tagged: pdf

Reflecting on ColorOurCollections

My new favorite national holiday is #ColorOurCollections week, Feb 1-5, 2016.

The idea originated at the New York Academy of Medicine (NYAM) to merge the adult coloring book craze with digital collections. Amazing libraries around the world joined in, sharing some fun and beautiful coloring books. Open Culture listed some of the big names, such as Bodleian, Smithsonian, DPLA, and Europeana. However, it was great to see so many Less famous libraries creating awesome coloring books highlighting their fascinating collections– and having some fun!

None-the-less, #ColorOurCollections also highlighted some less-than-best practices in the creation and distribution of digital files. Many of these libraries are on the forefront of digital preservation and/or user experience, but put out PDFs that violated all the rules… Its a bit disappointing since its not that hard to get a few little details right, and libraries should be leading by example.

A lot of libraries did a great job, but here are some little things that bugged me about many coloring book offerings:

  • Random file names. If you are creating a PDF for public distribution don’t name it “coloringbook1.pdf” or  “color-our-collections.pdf” or “jim_file3.pdf”. The file name is one of the few bits of metadata that can be easily understood without even opening the file. Make your file name descriptive and meaningful, providing the basic metadata (creator, title, and date) in a place where Everyone can see it. Example: “ThisLibrary_CoolColoringBook_2016.pdf”
  • Huge file sizes. A coloring book is a mostly black and white document designed to be printed at approximately letter size paper. A reasonable file size for something around 15 pages is under 2 MB, and could be smaller. I saw many PDFs in the range of 25MB, and some over 65MB! Larger file sizes will not give better quality–this is for public web distribution not a professional press run printing glossy coffee table books. Think about your users and your web servers. People have to DOWNLOAD the PDF! Please make it a reasonable size.
  • No embedded metadata. When creating a PDF you should always check the embedded metadata. If you are exporting a PDF from LibreOffice Writer or MS Word, it will have embedded metadata automatically created based on your profile. Unfortunately, many people don’t realize that and have never checked their profile. Many of the coloring books thus have metadata like “Title: Microsoft Word – coloringbook_draft3_fromJim.docx” and an Author that is the profile name of who ever first created the file. This metadata will be displayed when users import the PDF into a ebook management tool, such as Calibre. Furthermore, this information is not helpful for future users trying to understand where the file came from and what it is–and could be a bit embarrassing depending on what the automatically generated information contains. I suggest you carefully edit the metadata before exporting the final version of your PDF. It should contain a meaningful title, a subject such as “#ColorOurCollections 2016”, an author/creator that relates to the institution, and a URL to find more information.
  • Lack of image metadata. If you send out a document highlighting some fascinating treasures of your collections–there had better be a clear means for users to find out more information! Every image used in the coloring book needs metadata directly on the page where it appears. Each page does not need the full archival description, but please give enough information for the users to find the item in your online collections. A title, identifier, and URL is nice. I think these references need to be given on each coloring page, not in a separate reference and index page. Online resources and printed coloring book pages are quickly disassociated from their original context–don’t expect that the information given on an introduction or TOC page will be available to users.
  • Lack of overall context. Many of the coloring books were just pages of images. That is great for many users, but I would like to see a short introduction page that explains the context. Where did these images come from? Why are they interesting in the scope of your collections? Where can I learn more? This is an easy chance to communicate with patrons and invite them into our collections–which is the point of #ColorOurCollections.
  • Links to paid databases. A few coloring books had reference links to paid databases. I found this a bit insulting and against the spirit of #ColorOurCollections. One of the most amazing aspects of digital collections is the ability to democratically open up the public domain to the PUBLIC. We are able to take fragile materials traditionally hidden away in a locked basement, and give them out freely to the world! It is disappointing to see objects in the public domain digitized and then LOCKED back up in a proprietary, paid database. Its even more disappointing to see those over priced rip offs promoted in a library coloring book.
  • Grey backgrounds. Sorry, but this is a coloring book! Who wants to color on grey paper? Who wants to waste printer ink printing a grey page background? Some images are just more appropriate for a coloring book than others. You can not just desaturate a digitized image and call it a coloring book. Digitized pages have a color, and that page background needs to be removed to make a quality coloring book page. Generally, most coloring book images should be fully binarized, i.e. only pure black and white. Using GIMP you could desaturate (Colors > Desaturate) or greyscale (Image > Mode > Greyscale) the image, then use a Threshold (Colors > Threshold) to eliminate the “color” of the page background. The coloring book image should be reduced to clean black lines and white background. ScanTailor is a great tool that can do this pre-processing for you for many coloring book appropriate images. Play around with the output options in Black & White mode, tweaking “Thickness” and “Despeckling” until you get a good result.

Why no Digital Aladore coloring book?

ScanTailor b&w processing

ScanTailor b&w processing

I was thinking about putting together an Aladore coloring book, but I found the images had too many shades of grey scale hatching to reduce nicely to clean lines. Processing ends up with too many black blobs, with too little detail. The images just don’t work as a coloring page! Here is a one page PDF attempt just to show you what I mean:

DigitalAladore_YwainColoringPage_2016

Anyway, #ColorOurCollections was good fun, and I am looking forward to it next year!

 

Advertisements

PDF Reading

I messed around a lot with PDFs during this project, but then decided to avoid using them during processing.  However, I thought I should write a post about PDFs since we all use them everyday.

Many people don’t realize there are alternatives to Adobe Reader.  Issues with Adobe data breaches or privacy invasions of the new Adobe Digital Editions 4.0 might make you consider switching to something else… PDFreaders.org provides a list of Free Software PDF readers so you can replace the slightly creepy, proprietary Adobe with something that respects your freedom.

In the past I used Foxit Reader (freeware).  I really liked the interface and features, but Foxit is also proprietary and has some issues with privacy and unwanted software bundled in installs and updates.  I was bothered by an update that suddenly opted me into their cloud service with no explanation.  They have recently contributed to the open source community, as code for the Foxit rendering engine is the basis for Google’s PDFium project. None-the-less, I quit Foxit…

On Windows, I now use Sumatra PDF as my everyday reader (which is not listed at PDFreaders.org, but is open source).  You can download the application from the main site: http://blog.kowalczyk.info/software/sumatrapdf/free-pdf-reader.html, or look at the code (GPLv3) here: https://code.google.com/p/sumatrapdf.

Reading Aladore on Sumatra.

Reading Aladore on Sumatra.

Sumatra is very simple and striped down.  Unlike Adobe Reader, it starts up very quickly and doesn’t run any background helpers, definitely no bloat.  It is also more flexible: it can read PDF, ePub, Mobi, XPS, DjVu, CHM, CBZ, and CBR files, so it is a great all around reader.  On rare occasions the PDF rendering is slower than Adobe or Foxit, usually when the page images are highly compressed in the PDF container.  Sumatra provides a message saying “page is rendering”, which is an improvement over Adobe where you can scroll through a complex document and find mysteriously empty pages that may later appear.  Similarly, Sumatra never mysteriously freezes while sorting complex rendering out–it just honestly tells you it is working on it!  I appreciate full user feedback!

On Linux, almost everyone already has some open source reader.  I use Evince as a basic all-around reader, https://wiki.gnome.org/Apps/Evince.  If you like it, Evince is easy to install on Windows as well, but it honestly runs better on Linux.

Here are some other handy utilities I found useful for manipulating PDFs while working on Digital Aladore:

  • Did you know you can open any PDF with LibreOffice Draw?  This enables you to annotate and edit the PDF and save it in different formats.  VERY HANDY!
  • PDF Shaper (freeware from GloryLogic) is an easy to use set of PDF manipulating tools.  It offers features such as splitting, combining, extracting images, and conversions.  Slick and smooth interface.
  • K2PDF is an open source tool focused on reformatting PDFs for ereaders or mobile devices.  Very useful for converting multi-column texts into simpler PDFs.
  • jPDF Tweak is a GNU open source project that is the self proclaimed “Swiss Army Knife for PDF files.”  The learning curve is steep, but it can do an amazing amount of Tweaks!

What applications do you use for PDF Reading, editing, and creation?

Scanning Error Discovery!

I mentioned earlier that the scanned version of Aladore 1914 had 418 pages while the 1915 edition had 416 pages.  There wasn’t any obvious differences between the front and back matter, so it meant somewhere in there the 1915 scan was missing two pages… scary!

To try to discover the difference, it would be easiest to have the two PDF’s side-by-side or overlaid to visually compare.  A side-by-side view is a feature of many paid full version PDF tools, such as Adobe Acrobat or Nitro Pro.  This still requires turning pages on each PDF separately which can get annoying if the problem is a few hundred pages in!  I am not aware of any free PDF reader that has this view.  A paid option (with a free trial) that specializes in this task is DiffPDF.  However, as usual, I want to steer towards as open as possible instead.

An open source alternative is (the quite similarly named) diff-pdf, http://vslavik.github.io/diff-pdf.  It is not very polished, but is pretty neat once you figure it out.  Basically, download the package and unzip.  The program is run from the command line.  To make things simple, I put the two Aladore PDFs I wanted to compare into the diff-pdf directory.  I then ran the command:

[file path]\diff-pdf.exe --view aladore_1914_IA.pdf aladore_1915_IA.pdf

This analyses the listed files and starts up a GUI visualization of the two PDFs overlaid.  It looks like this:

Aladore title page in diff-pdf

Aladore title page in diff-pdf

Luckily, the discrepancy between the two scanned editions was easy to find–it was immediately obvious in diff-pdf when the editions got out of sync.  It turns out the 1915 scan is missing pages viii and ix, in the table of contents.  This is a typical scanning error: basically the pages probably stuck together, and the operator didn’t notice when turning the page.  Luckily it was in the table of contents and not the text itself.

Nice to have that mystery solved!

Digitization

Before we leave the “Witnesses” section of the project, I think we should take a look at digitization.  We are dealing with digital images of old print books, so the process of creating them has a major impact on the transmission of our witnesses.

A typical book digitization workflow goes something like this:

  • Get a book (some intelligent administrative system for deciding what to digitize, hopefully?  If you just scan what exactly is digitized in the world, you will realize it is unfortunately not often guided by a very careful plan.  Much digitization is still purely ad hoc off the side of the desk…)
  • Scan the book to create digital images
  • Process/edit the digital images
  • Prepare access/display images (and hopefully archival master images for storage)

Here is a bit more detail on the process.  For scanning books there is basically two options:

  • Non-destructive: use a book scanner (such as ATIZ BookDrive). These specialized scanners have a cradle that holds the book open in a natural angled position (rather than flat).  They usually have two cameras on a frame so that one is pointed at each page of the opened book.  To scan, you turn the page and pull down a glass platen to hold the book in place.  The cameras quickly take an shot of each page.  Lift the platen, turn the page… etc. etc. etc.  It takes a lot of tedious human work, but the results are good (assuming the operator keeps their fingers out of the image, check out the Artful Accidents of Google Books for some problem examples).  Usually this is done by digitization labs at big universities or organizations such as Internet Archive.  However, there is also a vibrant DIY community.  Check out DIY Book Scanner to get started, http://www.diybookscanner.org!
  • Destructive: If the particular copy of the book you have is no longer valuable, it can be disbound to make scanning faster and easier.  A specialized device literally cuts the binding off the book.  With the pages loose, whole stacks can be automatically feed into a document scanner.  Feed it into something like these Fujitsu Production Scanners and the whole book will be scanned in a couple minutes.  Here is a blog post from University of British Columbia about destructive scanning: http://digitize.library.ubc.ca/digitizers-blog/the-discoverer-and-other-book-destruction-techniques

Scanning a book results in a huge group of image files.  If you are using a camera based book scanner these are usually in a RAW format from the camera’s manufacturer.  Other types of scanners will usually save the page images as TIFFs.  These unedited scan images usually need to be cropped to size and enhanced.  The readability of the image can usually be improved by adjusting the levels and running unsharp mask.  The edited files will usually be saved as TIFFs, since it is the most accepted archival format.

Now that you have all these beautiful, readable TIFFs, you need to make them available to users, that is create access copies and provide metadata.  The edited TIFFs are usually converted into a smaller sized display version for serving to the public online.  For example, the book viewers on Hathi and Internet Archive use JPEGs.  You can check this by right clicking on an individual page and viewing or saving the image (the viewer on Scholars Portal Books also uses JPEGs, but has embedded them in a more complex page structure, so you can’t just right click on the image). This step is often automated by a digital asset management system.  Other sites only provide a download of as a PDF.  PDF versions are usually constructed using Adobe Acrobat or ABBY FineReader (many scaning software suites also have this functionality), combining the individual page images into a single file.  PDF creation usually compresses the images to create a much smaller file size.  OCR is completed during this processing as well.

The Aladore PDFs from Internet Archive and Scholars Portal Books (U of Toronto) were created using LuraDocument.  This is a system similar to ABBY FineReader, which takes the group of individual page images, runs OCR, compresses the images, and creates a PDF.  OCR used this way allows the text to be associated with its actual location on the page image, thus making the PDF searchable (For my project, I don’t want to create a PDF and I don’t care about the location of the text on the page–I just want the text, so my workflow will be a little different).

If you compare the two PDFs, as I have ranted a bunch of times now, you can see that IA’s version is horrible!  The pages load very slowly and images look bad.  None-the-less they started with exactly the same images as the U of T version.  They just used much higher compression.  This creates a smaller file size (IA 11 MB vs. UoT 57 MB), but the image quality is much worse and rendering time is increased.

So, I guess the point is: like the ancient world of scribes hand copying scrolls, textual transmission is still an active human process.  It is full of human decisions and human mistakes.  The text is not an abstract object, but something that passes through our hands, through the hands of a bunch of scanner operators anyway…

Working with PDF?

As I outlined earlier, Aladore was only scanned twice (a 1914 and a 1915 edition), but  there are many different access derivatives available online with mysteriously different qualities.

Internet Archive, Hathi, and Scholars Portal Books each provide a “read online” feature, but if you want to download your own copy, it will be a PDF of course.

PDFs are kind of strange files.  Standardized and portable, yet so much variation in how they behave depending on the elements hidden inside.

Here is a comparison of the available PDFs to get a sense of the range,

Aladore 1914:

  • Internet Archive, 418 pages, 10.92 MB, produced by LuraDocument, OCR
  • Scholars Portal Books, 418 pages, 57.29 MB, produced by LuraDocument, OCR

LuraDocument is a compression and OCR software used for processing individual page images into a PDF.  These two files are identical, except that IA has done a ton more compression on their PDF.  Because of it, the IA PDF is very slow to turn pages (render) and is very ugly.  It is terrible to read even on a laptop, much less a ereader.  Making smaller file size was a bigger concern in the early 2000’s–now days, IA should be putting up better quality.

Aladore 1915:

  • Internet Archive, 416 pages, 13.57 MB, produced by LuraDocument, OCR
  • Hathi Trust, 417 pages, 122 MB, produced by image server / pdf::api3, no OCR

Again, the IA version is mangled by compression.  They don’t always do this!  For example, check out the Blackwood’s Magazine, Vol. 195 (January to June 1914) where Aladore was first published.  I already mentioned how much I dislike the Hathi version, due to the way they embedded the page image in a standard 8×10 background and add a watermark.  The “image server / pdf::api3” is a Perl module for creating PDFs.  It is doing the equivalent of “print to PDF” on the fly from the content server’s jpeg images.

You might notice that the page numbers don’t quite line up.  The Hathi version has one extra page because they add a digital object title page to the front of the book.  However, that still means that the 1914 scan has 418 pages, and the 1915 scan has 416.  When scanning books, skipping a pair of pages is the most likely mistake–so our 1915 scan may have skipped two pages… or the printing omitted two pages somewhere?  or they were torn out by some angry reader in 1923?? The front and back matter look identical, so I will have to take a closer look.

If you really want to learn about the PDF format, here is three resources:

IDR Solutions blog has a large series of article about the technical and practical aspects of PDFs, “Understanding the PDF File Format: Overview,” https://blog.idrsolutions.com/2013/01/understanding-the-pdf-file-format-overview/#pdf-images

iText RUPS (Reading and Updating PDF Syntax) is a java program that allows you to browse the internal file structure of a PDF–its a maze, but very interesting.  It can be helpful for “de-bugging” PDF files, but the editing features are not functional yet.  http://sourceforge.net/projects/itextrups

Once you peek inside the secretive contents of the PDF, just flip through your own copy of the handy Adobe PDF Reference to understand what you are seeing, its only 756 pages long!  http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf

The PDF standard was first developed by Adobe, but is now maintained by ISO, so you can buy the official copy of ISO 32000-1:2008 for a mere 198 Swiss francs.  Or get the complete documenation for free from Adobe, http://www.adobe.com/devnet/pdf/pdf_reference.html

Not that I would ever check out this junk!

Anyway, after looking at these PDFs I realized it would be much better to NOT work with them… and I figured out a workaround!  More on that soon.

Digitized Aladore

Finally, lets get some digital witnesses to work with!

There are basically two scanned books available online in many different versions with different processing.

Aladore, 1914 Title page

Aladore, 1914 Title page

First, is a copy of the 1914 Blackwood standard edition scanned at University of Toronto in 2006.  Several different versions of the scanned book are available.  U of T previously hosted PDFs on their own library website.  You can still get their two versions from the legacy links, although the catalog no longer points to them:

U of T 1914 edition: http://scans.library.utoronto.ca/pdf/1/5/aladoren00newbuoft/aladoren00newbuoft.pdf

U of T 1914 edition, processed to black and white: http://scans.library.utoronto.ca/pdf/1/5/aladoren00newbuoft/aladoren00newbuoft_bw.pdf

The U of T library now points to Scholars Portal Books, hosted by the Ontario Council of University Libraries.  The listing is here: http://books1.scholarsportal.info/viewdoc.html?id=75462

These PDFs are identical.  They are large (57.2 MB) and well made.

Internet Archive also hosts a derivative of this scan.  However, the PDF is much lower quality (10.9 MB) and has slow performance due to the odd layered post-processing.  IA also provides automatically generated alternative formats, such as EPUB, but the accuracy of the transcription is horrible. On the upside, IA provides much more metadata than any of the other sites.  https://archive.org/details/aladoren00newbuoft

 

Aladore, 1915 Title page

Aladore, 1915 Title page

Second, is a copy of the 1915 Dutton edition scanned at University of California Libraries in 2006.  This copy exists in two versions online.

University of California Libraries point to the record on Hathi Trust Digital Library.  Using Hathi can be annoying because they limit downloading many of their items, despite the fact that they are in the public domain.  They have a sort of pay wall that requires logging in from a partner institution to access the full site.  The second annoyance is that Hathi adds a huge border around the PDF pages that has a reference to the source file, plus a watermark over the bottom of the page.  In this case the watermark says “Digitized by Internet Archive, Original from University of California.”  Internet Archive does NOT include this watermark on their copy!  Unfortunately, in providing this format, Hathi does not seem to consider READING and readers.  It also seems a pathetic possessiveness over public domain materials.  Color and black+white PDFs are available from the Hathi catalog listing, and are of high quality (119 MB): http://catalog.hathitrust.org/Record/006155073

Internet Archive also hosts a version of this scan, but again, their post-processing creates a lower quality (13.5 MB) and less useable PDF: https://archive.org/details/aladorehen00newbrich

Looking at the metadata provided by Internet Archive is really fascinating.  The digitization of both editions were sponsored by Microsoft.  One was shot using a Canon EOS 5D (at 400 ppi), the other a 1Ds (at 500 ppi)–almost certainly using an ATIZ BookDrive.  The 1915 was shot November 7th 2006, and the 1914 was shot ten days later.  The operators were “scanner-melissa-cunningham” and “scanner-katie-lawson.”

Even with all this random metadata, we do not know much about the digitization project or the post-processing.  To me it is strange that we do not better document the process, to understand the intentions behind how these objects were created.  It is interesting that the library catalogs do not represent ANY of the digitization metadata.  The catalogs only refer to the original object and seem completely uninterested in the digital one or how it came into existence.

New location for files

Okay, I decided to find a good home for the files I am creating… The best I can come up with is the Internet Archive (archive.org) Community Text Collection.  If you are not familiar with the collections at the Internet Archive, it is definitely worth exploring.  They are vast and fascinating, including everything from scans of historical books to live music bootlegs uploaded by fans (with permission from the bands).  There is even a huge collection of broadcast TV and movies.  The Wayback Machine provides access to their collection of archived websites–amazing and useful.

So here is the new page to download versions of the “Two Stories by Edgar Allan Poe relating to Blackwood’s Magazine” that I was using to test out some ideas:

https://archive.org/details/PoeBlackwoodArticle

Update:  I noticed one odd/funny thing that Archive.org has done to my files.  I first uploaded my hand made EPUB.  Later I uploaded the PDF I created based on the EPUB.  Archive.org assumes that the PDF is the original document, so it then automatically generated alternative formats based on the PDF–thus there is now a second EPUB, one that was generated by OCRing from my PDF!  Very strange.  I recommend downloading only my original EPUB, which has a file name “Poe_Blackwood_Article.epub” and is 118.5 K.

PDF Poe

I just suddenly thought, even though this blog is all about creating epubs for ereaders, its actually a bit annoying to have an epub if you don’t have an ereader.  So, I quickly converted the test Poe Two Blackwoods Stories epub into a PDF.

Converting EPUB into PDF is actually a bit harder than it should be.  EPUBs are basically a zip file with the text contained in  XHTML files.  Like any web page there is no exact page layout encoded.  So to create a PDF, the text must be laid out on a standard page format.  One method is to create a CSS file that will add proper layout and formatting to your html, but it takes some tweaking to get it working correctly, and you still need a method to create a PDF from this layout.  Ideally, as a book editor you would simply copy the text into a word processor/publishing program and do the layout by hand to ensure all the page breaks and formatting make sense.

However, I don’t want to spend a bunch of time re-editing this file!

The quickest automated conversion method is to use Calibre’s (free, open-source ebook manager, http://calibre-ebook.com) built-in conversion tool.  However, you have to fiddle around with the settings a lot to get good results, because the converter is aimed at making a custom PDF designed for a specific screen size (i.e. your particular ereader or tablet), not a standard print PDF.

I used a different slightly silly work around:

Using 7-zip (free, open-source archive manager, http://www.7-zip.org) I extracted all the files from the epub (7-zip automatically recognizes epubs as zips).  The actual text will appear inside a directory as XHTML files–in this case, there is one for each “chapter” of the book.  I opened each with Firefox, and printed it as a PDF.  However, I changed the print settings to make the font bigger and add some margins to the page.  This crudely converts the free flowing epub text into a page layout.  Then I took the PDF printout of each chapter and merged them into a single PDF.  Finally, I edited the PDF metadata, just to try to be thorough about everything.

This creates a very basic PDF, basically recreating the look you would have on an ereader.

Since PDF is an accepted format at wordpress.com, I can share it right here:

Poe_two_blackwoods_stories