Edit HTML with Bluefish

We enter the next stage of Digital Aladore with our ugly raw HTML in hand… and we want to make it into nice reflowable text to convert into an ebook.  Welcome to 5. Editing Text!

What is the most efficient way to rip through those ugly tags and fix up our text?

I am pretty sure its Bluefish Editor, http://bluefish.openoffice.nl

Bluefish is a powerful Free software (GNU GPLv3) text editor that supports web development and programming.  It is important to note that it is not a WYSIWYG editor, there is no graphical preview or editing.  For Digital Aladore, I am interested in Bluefish’s advanced find & replace features which allow you to use regular expressions and carry out operations across any size batch of files.  Exciting!

Opening Aladore 1914 with Bluefish.

Opening Aladore 1914 with Bluefish.

First, I open the batch of six HTML files output by YAGF OCR of Aladore 1914.  I inspect the HTML tags to make sure I understand what is going on.  Since I will add formatting later during the epub stage, I want to strip away almost everything, but in a controlled manner!

First to go is the style tag in the headers:

<style type=”text/css”>
p, li { white-space: pre-wrap; }

Advanced Find & Replace.

Advanced Find & Replace.

We don’t need it, so out it goes with Advanced Find & Replace!  In the screen shot above you can see that Bluefish lets us select an entire directory (and even set the number of recursive levels below it) to Find & Replace on at once.  I love this!  But, this first operation is pretty tame… only six items gone, one for each HTML file header.

Next, I get rid of all the style tags since they are not really necessary or meaningful.  Simply Advanced Find & Replace for  style=".*?"  and we find 9,202 items to replace with nothing!  Give it a second and they are all gone.

Now we need to do some more strategic thinking to sort out the patterns left by YAGF OCR.  Basically, each line of the text is contained in its own paragraph tag <p>…</p>.  Between paragraphs is an empty line consistently tagged <p ><br /></p>.  And the end of a page is always:

<p ><br /></p>
<p ><br /></p>
<p > </p>

Ultimately, I want to get rid of those tags, replacing them so that: 1) the line breaks are removed, 2) the page breaks are removed, 3) complete paragraphs are are contained in paragraph tags.

#3 is the tricky part, so to do this, we have to go about things in just the right order.  First, I remove the page breaks by replacing the string shown above with nothing.  This resulted in 364 replacements, exactly the number of page images we ran OCR on–Good!

This leaves an empty line in the HTML text files for each page break which I now remove using: Tools > Filters > Strip empty lines.

Next, I remove all the paragraph breaks by replacing <p ><br /></p> with nothing (501 results). Do NOT strip the lines this time, because the resulting empty lines will create the proper paragraphs in the next step.

Now, replace:

<p >

with a space (7096 results). This pattern represents a line break which should be removed to create our wrapping paragraphs.  Since the previous step left an empty line between actual paragraph breaks, the first <p> and last </p> of each does not match the replace string–resulting in correctly tagged paragraphs!

Finally, I replace “- ” (hyphen space) with nothing (259 results).  This catches all the words that were hyphenated at a line break, while leaving all the actually hyphenated phrases alone.

After that the batch looks good!  There are a few more issues that need to be fixed and errors to be searched for, but they are better off done with other tools, i.e. Sigil the ebook editor.  More soon!


One comment

  1. Pingback: Aladore 1915 text | Digital Aladore

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s