Working with PDF?

As I outlined earlier, Aladore was only scanned twice (a 1914 and a 1915 edition), but  there are many different access derivatives available online with mysteriously different qualities.

Internet Archive, Hathi, and Scholars Portal Books each provide a “read online” feature, but if you want to download your own copy, it will be a PDF of course.

PDFs are kind of strange files.  Standardized and portable, yet so much variation in how they behave depending on the elements hidden inside.

Here is a comparison of the available PDFs to get a sense of the range,

Aladore 1914:

  • Internet Archive, 418 pages, 10.92 MB, produced by LuraDocument, OCR
  • Scholars Portal Books, 418 pages, 57.29 MB, produced by LuraDocument, OCR

LuraDocument is a compression and OCR software used for processing individual page images into a PDF.  These two files are identical, except that IA has done a ton more compression on their PDF.  Because of it, the IA PDF is very slow to turn pages (render) and is very ugly.  It is terrible to read even on a laptop, much less a ereader.  Making smaller file size was a bigger concern in the early 2000’s–now days, IA should be putting up better quality.

Aladore 1915:

  • Internet Archive, 416 pages, 13.57 MB, produced by LuraDocument, OCR
  • Hathi Trust, 417 pages, 122 MB, produced by image server / pdf::api3, no OCR

Again, the IA version is mangled by compression.  They don’t always do this!  For example, check out the Blackwood’s Magazine, Vol. 195 (January to June 1914) where Aladore was first published.  I already mentioned how much I dislike the Hathi version, due to the way they embedded the page image in a standard 8×10 background and add a watermark.  The “image server / pdf::api3” is a Perl module for creating PDFs.  It is doing the equivalent of “print to PDF” on the fly from the content server’s jpeg images.

You might notice that the page numbers don’t quite line up.  The Hathi version has one extra page because they add a digital object title page to the front of the book.  However, that still means that the 1914 scan has 418 pages, and the 1915 scan has 416.  When scanning books, skipping a pair of pages is the most likely mistake–so our 1915 scan may have skipped two pages… or the printing omitted two pages somewhere?  or they were torn out by some angry reader in 1923?? The front and back matter look identical, so I will have to take a closer look.

If you really want to learn about the PDF format, here is three resources:

IDR Solutions blog has a large series of article about the technical and practical aspects of PDFs, “Understanding the PDF File Format: Overview,”

iText RUPS (Reading and Updating PDF Syntax) is a java program that allows you to browse the internal file structure of a PDF–its a maze, but very interesting.  It can be helpful for “de-bugging” PDF files, but the editing features are not functional yet.

Once you peek inside the secretive contents of the PDF, just flip through your own copy of the handy Adobe PDF Reference to understand what you are seeing, its only 756 pages long!

The PDF standard was first developed by Adobe, but is now maintained by ISO, so you can buy the official copy of ISO 32000-1:2008 for a mere 198 Swiss francs.  Or get the complete documenation for free from Adobe,

Not that I would ever check out this junk!

Anyway, after looking at these PDFs I realized it would be much better to NOT work with them… and I figured out a workaround!  More on that soon.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s