Aladore Regex

In earlier posts, I mentioned using advanced features of find & replace to do some automated editing of the HTML tags in the text.  Let me give you another example to show how useful some clever find & replace can be!

Before uploading the draft Aladore EPUB, I wanted to quickly edit the format of the chapter titles.  At this point the chapter headings looked like this:

<p>CHAPTER IV.</p>

<p>HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</p>

They look exactly the same as normal body paragraphs.  Of course, in the print version this is not the case.

Aladore 1914, page 20.

Aladore 1914, page 20.

To better match the look of the print book, I decided the headings should be centered and tagged h2.  This will more clearly set them off from the body text, something like this:

<h2 style="text-align: center;">CHAPTER IV.</h2>

<h2 style="text-align: center;">HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</h2>

However, I also wanted to quickly generate a table of contents.  If I tagged the heading as shown above, the automatically generated TOC would have two separate entries for each chapter.  So a quick and dirty alternative is to put a <br /> between the two parts of the chapter title, making them a single h2 unit that will appear correctly in the TOC.  This solution looks like:

<h2 style="text-align: center;">CHAPTER IV.<br />HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.</h2>

So, how can we automatically find the 58 chapter headings and get them tagged correctly without doing it by hand?  Luckily, Sigil supports Find & Replace with Regular Expressions (i.e. Regex, make sure you have regex chosen for the Mode in the find window). With a few handy expressions and some logical thinking, we can sort this out in no time.  If you want to learn about and practice Regex, check out RegExr.

The main things we need to work with for this application are the lookahead and lookbehind expressions.  In this case it is very easy to test the accuracy of the expression–click Count All and it should be exactly 58 items, otherwise you are not catching only/all the chapter headings.

Sigil find & replace.

Sigil find & replace.

First, I need to replace the <p> in front of the title “CHAPTER…” with <h2 style=”text-align: center;”>.  To do this, use a lookahead, (?=ABC).  This means we are using “CHAPTER” in our search, but will not select it for the Replace function.  The Find looks like this:

<p>(?=CHAPTER)

It will find and select only the <p> that appear before “CHAPTER”, but it does not select CHAPTER.  Awesome!

Next, I want to find the p tags between the two sections of the chapter heading and replace them with <br />.  Since the roman numerals that follow CHAPTER make it a variable string length, for technical reasons we can not use a lookbehind.  Instead we need to figure out a regex that will find only the chapter subtitle (for example, HOW YWAIN CAME TO AN HERMITAGE IN A WOOD.) to use in another lookahead.  Every chapter subtitle is in all uppercase and at least 20 characters–AND there is no other string in the book that has those qualities.  We can use these facts to create our find string: [A-Z,\s,’]{20,}.  This expression means find any string that includes ONLY the uppercase letters A through Z, \s white spaces, or ‘ (needs to be included since some titles have possessives) AND that is 20 characters or more in length.  I didn’t do any serious analysis to decide on the number 20–I just looked a few of the subtitles, counted the number of characters, and tested a few numbers by entering the expression and clicking Count All.  If I got 58 results, I knew I was on the right track.  The expression will find only our chapter subtitles, so we can use it in a lookahead to select the two p tags between the chapter number and the subtitle.  The find looks like this:

</p>

<p>(?=[A-Z,\s,']{20,})

It will select only the two p tags, which I replace with a break.

Finally, we need to replace the </p> after the chapter subtitle with a <h2>.  This time we need to use a lookbehind (?<=ABC), because there is no consistent string after the subtitle that we can use in searching.  In a lookbehind, we can not use a string of open length as I did above.  However, the same qualities of the subtitle will give us enough information to create a distinctive search that will exclude everything NOT a subtitle.  In this case, a string of three characters, including only uppercase letters and a period [A-Z,.]{3} in front of a </p> will give us the 58 results we want.  This is because no other paragraph in the book ends with a word in uppercase letters.  The lookbehind Find expression looks like this:

(?<=[A-Z,.]{3})</p>

It selects only the </p> tag, which I replace with </h2> to close off the heading.  Now, we have all the chapter titles tagged correctly and looking… well, sort of beautiful.  When we get into more polishing, we will use similar expressions to add the style tags needed for CSS.

Isn’t it amazing that we have gone from textual transmission to Regex?  Ha, ha, I love Digital Aladore!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s