Before we leave the “Witnesses” section of the project, I think we should take a look at digitization. We are dealing with digital images of old print books, so the process of creating them has a major impact on the transmission of our witnesses.
A typical book digitization workflow goes something like this:
- Get a book (some intelligent administrative system for deciding what to digitize, hopefully? If you just scan what exactly is digitized in the world, you will realize it is unfortunately not often guided by a very careful plan. Much digitization is still purely ad hoc off the side of the desk…)
- Scan the book to create digital images
- Process/edit the digital images
- Prepare access/display images (and hopefully archival master images for storage)
Here is a bit more detail on the process. For scanning books there is basically two options:
- Non-destructive: use a book scanner (such as ATIZ BookDrive). These specialized scanners have a cradle that holds the book open in a natural angled position (rather than flat). They usually have two cameras on a frame so that one is pointed at each page of the opened book. To scan, you turn the page and pull down a glass platen to hold the book in place. The cameras quickly take an shot of each page. Lift the platen, turn the page… etc. etc. etc. It takes a lot of tedious human work, but the results are good (assuming the operator keeps their fingers out of the image, check out the Artful Accidents of Google Books for some problem examples). Usually this is done by digitization labs at big universities or organizations such as Internet Archive. However, there is also a vibrant DIY community. Check out DIY Book Scanner to get started, http://www.diybookscanner.org!
- Destructive: If the particular copy of the book you have is no longer valuable, it can be disbound to make scanning faster and easier. A specialized device literally cuts the binding off the book. With the pages loose, whole stacks can be automatically feed into a document scanner. Feed it into something like these Fujitsu Production Scanners and the whole book will be scanned in a couple minutes. Here is a blog post from University of British Columbia about destructive scanning: http://digitize.library.ubc.ca/digitizers-blog/the-discoverer-and-other-book-destruction-techniques
Scanning a book results in a huge group of image files. If you are using a camera based book scanner these are usually in a RAW format from the camera’s manufacturer. Other types of scanners will usually save the page images as TIFFs. These unedited scan images usually need to be cropped to size and enhanced. The readability of the image can usually be improved by adjusting the levels and running unsharp mask. The edited files will usually be saved as TIFFs, since it is the most accepted archival format.
Now that you have all these beautiful, readable TIFFs, you need to make them available to users, that is create access copies and provide metadata. The edited TIFFs are usually converted into a smaller sized display version for serving to the public online. For example, the book viewers on Hathi and Internet Archive use JPEGs. You can check this by right clicking on an individual page and viewing or saving the image (the viewer on Scholars Portal Books also uses JPEGs, but has embedded them in a more complex page structure, so you can’t just right click on the image). This step is often automated by a digital asset management system. Other sites only provide a download of as a PDF. PDF versions are usually constructed using Adobe Acrobat or ABBY FineReader (many scaning software suites also have this functionality), combining the individual page images into a single file. PDF creation usually compresses the images to create a much smaller file size. OCR is completed during this processing as well.
The Aladore PDFs from Internet Archive and Scholars Portal Books (U of Toronto) were created using LuraDocument. This is a system similar to ABBY FineReader, which takes the group of individual page images, runs OCR, compresses the images, and creates a PDF. OCR used this way allows the text to be associated with its actual location on the page image, thus making the PDF searchable (For my project, I don’t want to create a PDF and I don’t care about the location of the text on the page–I just want the text, so my workflow will be a little different).
If you compare the two PDFs, as I have ranted a bunch of times now, you can see that IA’s version is horrible! The pages load very slowly and images look bad. None-the-less they started with exactly the same images as the U of T version. They just used much higher compression. This creates a smaller file size (IA 11 MB vs. UoT 57 MB), but the image quality is much worse and rendering time is increased.
So, I guess the point is: like the ancient world of scribes hand copying scrolls, textual transmission is still an active human process. It is full of human decisions and human mistakes. The text is not an abstract object, but something that passes through our hands, through the hands of a bunch of scanner operators anyway…