But Wait! you say… what about that whole thing with Henry’s passionate affair with aristocrat/illustrator Alice?? Ah yes, we need to deal with the illustrations!
Don’t worry, I didn’t forget.
Before the OCR processing routine, I moved all the illustration pages to their own directory. The illustrations are identical in the 1914 and 1915 editions. However, the 1914 pages are heavily yellowed, making for a dimmer, less crisp scan image. I decided to use the 1915 images from the Internet Archive which are brighter with a whiter background. Generally when editing raw scans I adjust the levels and run unsharp mask to improve the image quality. However, these images have been heavily processed already and are in the lossy JPG format, so further tinkering only seemed to make things worse.
Since these are page images, it is simplest to start editing with our old friend ScanTailor. This time however, I used more of the advanced features to fix up the illustrations. First, I used Deskew to generally straighten the images. I ran the automatic Select Content which found all the illustrations perfectly. On Margins, I deselected “Match size with other pages” and set a 1 mm margin on each side, since I want only the illustration with no extra page area. On Output, I lowered the Resolution to 300 DPI, since any higher is over kill. I changed the Mode to Color/Grayscale, since ScanTailor’s black and white processing is too extreme for images.
Finally, I used ScanTailor’s Dewarping feature. As I have mentioned a few times, none of the illustrations appear square. This is partially because physical books are not flat and photographic lens introduce some distortion. Straight lines on the resulting page image will not appear parallel/square. This can effect the cosmetic look of your page images, but also cause OCR errors. Thus, ScanTailor offers the Dewarping operation to quickly minimize the distortion.
On the central pane, click on the Dewarping tab. A grid appears over the page image. Drag the edges of the grid to match the lines that should be square in your image. ScanTailor takes your distorted grid and reprojects the image as if it was square, fixing the apparent perspective issues.
Run Output and ScanTailor gives us a set of 15 nicely processed TIF images ranging in size from 4 to 7 MB.
We are making important editorial decisions through out this process. For example, by throwing out blank pages and close cropping the illustrations, I have discarded some of the information and format of the original book. By deciding to dewarp, I have introduced my subjective interpretation of “straightness”. These decisions will shape how the illustrations look and relate to the text in the new edition.
Finally, since Digital Aladore is focused on a reading text (not maximum image fidelity) we need to think about the EPUB. I do not want to unnecessarily bloat the ebook size. The ScanTailor output comes in at 80 MB, way too big for an ebook. Instead, the quality and resolution should be appropriate to a small black and white screen. EPUB2 and most ereaders do not support TIF images, so we will need to convert the images to a more acceptable format and size.