Raw HTML text

After YAGF (Tesseract) OCR we have a batch of large HTML files.  For Aladore 1914 it was six HTML files around 1600 lines each.  The first page looks like this:

ALADORE.

CHAPTER I.

OF THE HALL OF SULNEY AND HOW

SIR YWAIN LEFT IT.

SIR YWAIN sat in the Hall of Sulney and

did justice upon wrong-doers. And one man

had gathered sticks where he ought not, and

this was for the twentieth time; and another

had snared a rabbit of his lord’s, and this was

for the fortieth time; and another had beaten

his wife, and she him, and this was for the

hundredth time: so that Sir Ywain was weary

of the sight of them. Moreover, his steward

stood beside him, and put him in remem-

Each line of text from the page image has been tagged as one HTML paragraph, so the breaks follow the original printed page.   Take a look at the mark up:

<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>ALADORE.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>CHAPTER I.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>OF THE HALL OF SULNEY AND HOW</p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>SIR YWAIN LEFT IT.</p>
<p style=”-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”><br /></p>
<p style=” margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;”>SIR YWAIN sat in the Hall of Sulney and</p>

Each paragraph has a big “style=” tag that is sort of meaningless since they are all the same.  For my purposes, its just ugly and unnecessary.  However, the important thing is that everything is consistent.  I will not use the format described by the HTML tags, but because they are consistent in how they are used, I can easily transform the text.  With a little thought and find & replace, we can easily create a reflowable text–making the HTML above, into something like:

ALADORE.

CHAPTER I.

OF THE HALL OF SULNEY AND HOW SIR YWAIN LEFT IT.

SIR YWAIN sat in the Hall of Sulney and did justice upon wrong-doers. And one man had gathered sticks where he ought not, and this was for the twentieth time; and another had snared a rabbit of his lord’s, and this was for the fortieth time; and another had beaten his wife, and she him, and this was for the hundredth time: so that Sir Ywain was weary of the sight of them. Moreover, his steward stood beside him, and put him in remembrance of all the misery that had else been forgotten.

Which leads us to the next section of the project!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s