The aim of the next stage of the project, Capturing Text as I decided to call it, is to test some OCR platforms to create a usable digital text from the digitized page images of Aladore.
I have been talking a lot about the witnesses available online, and I was frustrated with the PDF versions. PDF is just not a very good format to work with and the quality was too variable. Although OCR is possible on PDF files, most of the programs are really set up to use individual page images. To get the best possible digital text, I need the best quality images I can get.
No one provides individual page images par se… but actually they do! Most online reader platforms are actually a container for serving up JPG pages. For example, check out the Internet Archive reader display of Aladore 1915, https://archive.org/stream/aladorehen00newbrich#page/n7/mode/2up
Right click on one of the pages and choose “view image”. You find a JPG! You can get better quality files by zooming in first then viewing the image. If you study the resulting URL you be able to figure out the pattern for naming the individual page images and qualities.
I use the Free and open download manager DownThemAll! which is an extension for Firefox: http://www.downthemall.net
This allowed me to efficiently download a full set of images for the 1914 (from Internet Archive) and 1915 (from Hathi Trust) editions. While this is not the way IA and Hathi intended us to use the images stuck in the online readers, if we flipped through every page of the book it would result in exactly the same file use.
So, Awesome! Digital witnesses ready to go!