Tagged: editing

Aladore Regex

In earlier posts, I mentioned using advanced features of find & replace to do some automated editing of the HTML tags in the text.  Let me give you another example to show how useful some clever find & replace can be!

Before uploading the draft Aladore EPUB, I wanted to quickly edit the format of the chapter titles.  At this point the chapter headings looked like this:

<p>CHAPTER IV.</p>

<p>HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</p>

They look exactly the same as normal body paragraphs.  Of course, in the print version this is not the case.

Aladore 1914, page 20.

Aladore 1914, page 20.

To better match the look of the print book, I decided the headings should be centered and tagged h2.  This will more clearly set them off from the body text, something like this:

<h2 style="text-align: center;">CHAPTER IV.</h2>

<h2 style="text-align: center;">HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</h2>

However, I also wanted to quickly generate a table of contents.  If I tagged the heading as shown above, the automatically generated TOC would have two separate entries for each chapter.  So a quick and dirty alternative is to put a <br /> between the two parts of the chapter title, making them a single h2 unit that will appear correctly in the TOC.  This solution looks like:

<h2 style="text-align: center;">CHAPTER IV.<br />HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</h2>

So, how can we automatically find the 58 chapter headings and get them tagged correctly without doing it by hand?  Luckily, Sigil supports Find & Replace with Regular Expressions (i.e. Regex, make sure you have regex chosen for the Mode in the find window). With a few handy expressions and some logical thinking, we can sort this out in no time.  If you want to learn about and practice Regex, check out RegExr.

The main things we need to work with for this application are the lookahead and lookbehind expressions.  In this case it is very easy to test the accuracy of the expression–click Count All and it should be exactly 58 items, otherwise you are not catching only/all the chapter headings.

Sigil find & replace.

Sigil find & replace.

First, I need to replace the <p> in front of the title “CHAPTER…” with <h2 style=”text-align: center;”>.  To do this, use a lookahead, (?=ABC).  This means we are using “CHAPTER” in our search, but will not select it for the Replace function.  The Find looks like this:

<p>(?=CHAPTER)

It will find and select only the <p> that appear before “CHAPTER”, but it does not select CHAPTER.  Awesome!

Next, I want to find the p tags between the two sections of the chapter heading and replace them with <br />.  Since the roman numerals that follow CHAPTER make it a variable string length, for technical reasons we can not use a lookbehind.  Instead we need to figure out a regex that will find only the chapter subtitle (for example, HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.) to use in another lookahead.  Every chapter subtitle is in all uppercase and at least 20 characters–AND there is no other string in the book that has those qualities.  We can use these facts to create our find string: [A-Z,\s,’]{20,}.  This expression means find any string that includes ONLY the uppercase letters A through Z, \s white spaces, or ‘ (needs to be included since some titles have possessives) AND that is 20 characters or more in length.  I didn’t do any serious analysis to decide on the number 20–I just looked a few of the subtitles, counted the number of characters, and tested a few numbers by entering the expression and clicking Count All.  If I got 58 results, I knew I was on the right track.  The expression will find only our chapter subtitles, so we can use it in a lookahead to select the two p tags between the chapter number and the subtitle.  The find looks like this:

</p>

<p>(?=[A-Z,\s,']{20,})

It will select only the two p tags, which I replace with a break.

Finally, we need to replace the </p> after the chapter subtitle with a <h2>.  This time we need to use a lookbehind (?<=ABC), because there is no consistent string after the subtitle that we can use in searching.  In a lookbehind, we can not use a string of open length as I did above.  However, the same qualities of the subtitle will give us enough information to create a distinctive search that will exclude everything NOT a subtitle.  In this case, a string of three characters, including only uppercase letters and a period [A-Z,.]{3} in front of a </p> will give us the 58 results we want.  This is because no other paragraph in the book ends with a word in uppercase letters.  The lookbehind Find expression looks like this:

(?<=[A-Z,.]{3})</p>

It selects only the </p> tag, which I replace with </h2> to close off the heading.  Now, we have all the chapter titles tagged correctly and looking… well, sort of beautiful.  When we get into more polishing, we will use similar expressions to add the style tags needed for CSS.

Isn’t it amazing that we have gone from textual transmission to Regex?  Ha, ha, I love Digital Aladore!

Advertisements

Using Juxta

So the collation of the 1914 and 1915 Aladore texts is pretty neat–but it is also useful.  I don’t think it is revealing any interesting differences between the two printed editions, but it is surfacing many simple errors in the OCR that I haven’t spotted by other means.  The 1914 and 1915 print editions seem to be identical except for a few page breaks.  I do not need to unravel issues with the transmission or make any complex editorial decisions based on textual scholarship.  Instead, the comparison is part of a process to get the new OCR witnesses to better match the digital image witnesses.  The OCR of each edition has slightly different errors which are highlighted by the collation.  Thus, by combining the information of two imperfect witnesses (all witnesses are imperfect!), we can create a new best text that is more accurate than its parents.

Here is an example:

OCR errors in Juxta.

OCR errors in Juxta.

The 1915 has “world’s. four roads” and the 1914 “world’s {our roads”, so both have errors!  Looking at the page images “world’s four roads” is obviously the correct reading, but the 1915 edition has a tiny blemish at the end of a line which OCRed as a period, and the “f” of four in 1914 edition is slightly faint contributing to its missed identification.  These are exactly the sort of OCR errors that are hard to detect in any other way since they do not show up when spell checking and are not visually obvious.

In case you want to play along, I shared the comparison on Juxta Commons: http://juxtacommons.org/shares/qWLf2p

However, Juxta Commons is not ideal for actually fixing the errors, since it does not allow you to edit the source texts.  You will have to open the base text in an external text editor to fix issues as you look through the collation on Juxta.  Instead, I prefer to use the desktop version which does allow you to edit on the fly–handy!  One other difference is that the desktop application displays the raw HTML rather than the rendered text shown on the Commons.  This can be distracting if your tags are too ugly but I like seeing them since I want to ensure both the text content and tags are correct.

Here is my workflow for the process:

1) Download and install the desktop version of Juxta from http://www.juxtasoftware.org/download

2) Open Juxta, and click File > New Comparison Set.

Juxta menu bar and icons.

Juxta menu bar and icons.

3) Click the Plus icon to add your witnesses (i.e. Digital Aladore 1914 and 1915).

4) Click the Refresh icon to collate the witnesses.  This brings up a dialog box with options for the collation.  Since I want to detect ALL differences, I uncheck all the boxes.  Processing might take a few seconds.

collate options.

collate options.

You will now have a workspace something like this:

Aladore collated on Juxta.

Aladore collated on Juxta.

Clicking on a witness listed on the “Comparison Set” pane (left) changes the base text.  The selected base text will appear in the lower right “Source” pane.  The upper right pane can be switched between “Collation view” or “Comparison view” (i.e. Heatmap or Side-by-side in Juxta Commons terminology, as described in the previous post) using the tabs on the bottom of the pane.  Clicking on a word in Collation pane will move the source text and highlight the selection in the Source pane.

5) Choose one of the source texts to edit (I used Aladore 1914).  Then, click “Edit” in the lower right corner of the Source pane.  This allows you to edit the text, but nothing is saved until you click “Update.”

6) Now, I scroll through the text on the Collation pane (I prefer to use the “Comparison view”) and decide which errors need to be edited in the base text.  I click on the highlighted errors in the Collation pane (if using Comparison view, make sure you click on the side representing your base text or it will switch the Source pane).  This brings up the corresponding spot in the Source pane to edit.  If the correct reading is not obvious, I quickly check the original page image (I have the directory open with thumbnails and a preview window so they are easy to reference).

7) When finished working through the entire collation, click “Update” on the lower right corner of the Source pane and save the edited text with a new name.

8) Click on the new text in the Comparison Set pane.  Then, click File > Export Source Document.  This saves a text file of the new witness.

Done!

This little activity caught a lot more errors than I expected!  In particular, the 1914 text had “b” instead of “h” in many places.  There was also many misplaced periods in random locations.  The interface is a little cumbersome for editing in this way, but I definitely think it is a useful tool.