Wednesday, September 30, 2015

Hoyle Bibliography: technology update (part 3)

For background to this short post, please see part 1 and part 2 of the technology update. I've pretty much finished the work of producing MS Word from my XML files, but my approach is quite different from the one I expected.

I thought the model would be XML->HTML for the web and XML->MS Word for the print version. It turns out to be easier, much easier, to go XML->HTML->MS Word!

To look at the sample file, the HTML version of Whist.3 is here. Below is the translation to MS Word:

(click to enlarge)
Now you'll notice there isn't much in the way of formatting: no borders on the table, no nice margins or spacing, no bold table headers, etc. That's deliberate. One can always add styling later and it can be quite hard to remove if there's too much. What I have done is get all the text rendered correctly: smallcaps, italics, superscripts, etc. And the crazy table with the rows and columns that span cells. [Aside: As you can learn here, spanning columns is simple; spanning rows is much more difficult.]

Other than spanning rows, the hardest thing was managing whitespace. There is a whole section in my XSLT/XPath book on whitespace including a subsection "Solving Whitespace Problems" with subsections "Too Much Whitespace" and "Too Little Whitespace". I had problems with both. It was necessary:
  • to have the XML->HTML transformation use stricter <xsl:output method ="xml"> rather than "=html"
  • to have the HTML->MS Word transformation use <xsl:strip-space elements="*"/>
  • to write a function to "normalize" all text data--that is, collapse consecutive white space into a single space, but allow an initial leading and trailing space.
 Okay, TMI, I know. But I wanted to write it all down so I wouldn't lose it.

There may well be better ways to do this. I found myself frequently at the boundaries of my knowledge. But with a lot of Googling and reading, I've found that many others have been down this path and come up with similar solutions.

OK, enough technology. Back to bibliography!

Sunday, September 27, 2015

Hoyle Bibliography: technology update (part 2)

Another techie update...

In my last essay, I gave an overview of the technology I am using for the Hoyle bibliography. One of the claims I made is that storing the descriptions in a highly-structured format would allow me to render them both on the web and in a word processing document. If truth be told, until quite recently, I had never tested that claim, except on the most trivial data. But now I'm ready to declare success!

To review the acronyms briefly, I am storing each bibliographical description in an XML file. I use another language, XSLT (Extensible Stylesheet Language Transformations) to translate the data into HTML for display on the web. I've always assumed that I could modify the XSLT to translate the data into a MS Word file, but had tested that only for unformatted text. It remained to deal with the annoyances of superscripts, subscripts, italics, tables, etc.

Well, I'm quite relieved to be able to report that everything works! In the last essay, I showed the XML for the collation formula for Whist.3, which is displayed as:

12o: A–D12 E4 [$½ (-A2,B2) signed; missigning B4 as B5]; 52 leaves, pp. [8] [1] 2–96

You can see the full bibliographical description on my website here, rendered as HTML. I wrote a new XSLT program reads the same XML and plops the collation formula into a file that MS Word can read. More on that program in a moment. Here is the output, readable by MS Word:

<?xml version="1.0" encoding="utf-8"?><?mso-application progid="Word.Document"?>
<w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">
   <w:body>
      <w:p>
         <w:r>
            <w:t>
               <w:rPr>
                  <w:i w:val="on"/>
               </w:rPr>A Short Treatise on the Game of Whist<w:rPr>
                  <w:i w:val="off"/>
               </w:rPr>, printed for F. Cogan, third London edition, 1743.<w:p/>
               <w:p/>
               <w:t>Collation: 12<w:rPr>
                     <w:vertAlign w:val="superscript"/>
                  </w:rPr>o<w:rPr>
                     <w:vertAlign w:val="baseline"/>
                  </w:rPr>: A–D<w:rPr>
                     <w:vertAlign w:val="superscript"/>
                  </w:rPr>12<w:rPr>
                     <w:vertAlign w:val="baseline"/>
                  </w:rPr> E<w:rPr>
                     <w:vertAlign w:val="superscript"/>
                  </w:rPr>4<w:rPr>
                     <w:vertAlign w:val="baseline"/>
                  </w:rPr> [$½ (-A2,B2) signed; missigning B4 as B5]; 52 leaves, pp. [<w:rPr>
                     <w:i w:val="on"/>
                  </w:rPr>8<w:rPr>
                     <w:i w:val="off"/>
                  </w:rPr>] [1] 2–96 </w:t>
            </w:t>
         </w:r>
      </w:p>
   </w:body>
</w:wordDocument>


All those impenetrable tags beginning <w:....> are the incantations that MS Word needs for formatting.

For the ambitious, you can copy that text into a file and save it as Whist3.xml or some such. Note that the file extension must be .xml. Then launch MS Word and open the file. You should get something that looks like this (click to enlarge):


Notice that I've dealt with paragraph breaks, superscripts, italics, and more. Success!

Not shown in this example are other things I'll need to do: tables, headers, etc. Fortunately, I've solved those items as well. 

Back to the program. The really good news is that there is about an 80% overlap between the XSLT used to translate to HTML and to MS Word. Now that I am learning which parts of the XSLT are the same and which must be customized, I can recode the XSLT a bit more intelligently so that the common 80% is in one file, and the two 20% specializations are in other files.

I can't say I was ever worried about getting my descriptions into MS Word, but it's awfully nice to know it works!

Thursday, September 10, 2015

Hoyle Bibliography: technology update

While it has been ages since I last posted on this blog, I have been monumentally busy with Hoyle. Last November, I announced that I would be starting an online descriptive bibliography of Hoyle. This post highlights the progress I have made, both with content, and with the supporting technology.

Underlying Approach

The bibliography nears a major milestone. I have completed descriptions of all but a handful of the 18th century editions of Hoyle, the task I had originally contemplated. Inevitably, my scope has expanded, and I’m well into the 19th century. It is difficult to find a graceful stopping point. In addition to the content, the bibliography is a significant and apparently unique effort in the digital humanities. So far I have created 170 bibliographical descriptions, storing each in a file for validation, processing, and display. The programming effort has been substantial and is ongoing, but continues to pay for itself many times over. This blog essay discusses the technology I have developed in the course of compiling the Hoyle bibliography.

My primary goal was to create bibliographical descriptions of the books that could be presented in multiple ways—initially as a web site and then in a word processing document leading to print publication. I expect that others will be able to extract data programmatically from my descriptions if  desired—perhaps a library wishes to update its catalogue or a collector wishes to build a checklist. This goal, one data source with multiple presentations, dictated storing the descriptions in a highly-structured way.

A second goal was to avoid errors and inconsistencies in bibliographical descriptions and their presentation. As to the descriptions, collation formulas and pagination statements should total to the same number of leaves. Deletions and signing errors should refer to leaves actually in the collation formula. Signature references and page references should point to the same page. I have seen each of these errors in printed bibliographies—mistakes are inevitable. Formatting is equally error prone. Fredson Bowers’ Principles of Bibliographical Description is the standard for descriptive bibliography, including the collation formula and pagination statement. Bowers requires dexterous use of brackets, italics, commas, semicolons, superscripts and subscripts. Proofreading is hardly...foolproof. It seemed as though there should be better solutions.

The desire to avoid errors led to the same design decision suggested earlier, highly structured data. Following other digital humanities projects, particularly TEI (about which more below), I chose XML as an underlying technology. A brief excerpt from one of my book descriptions will show how structured XML data can reduce error. Consider Whist.3 (my description is online here), which has one of the simpler collation formulas:

12o: A–D12 E4 [$½ (-A2,B2) signed; missigning B4 as B5]; 52 leaves, pp. [8] [1] 2–96

The data used to produce the collation formula is:

        <collation>
            <format>12</format>
            <collationFormula>
                <gatherings>
                    <gatheringRange signed="true">
                        <sigStart>A</sigStart>
                        <sigEnd>D</sigEnd>
                        <leaves>12</leaves>
                    </gatheringRange>
                    <gatheringRange signed="true">
                        <sigStart>E</sigStart>
                        <leaves>4</leaves>
                    </gatheringRange>
                </gatherings>
                <signatureLeaves>$½</signatureLeaves>
                <anomSignatures>
                    <anomSignature>
                        <anomType>-</anomType>
                        <sigRef>A2</sigRef>
                    </anomSignature>
                    <anomSignature>
                        <anomType>-</anomType>
                        <sigRef>B2</sigRef>
                    </anomSignature>
                </anomSignatures>
                <signingErrors>
                    <signingError>
                        <sigRef>B4</sigRef>
                        <badSig>B5</badSig>
                    </signingError>
                </signingErrors>
            </collationFormula>
            <totalLeaves>52</totalLeaves>
            <pagination>
                <pageRanges>
                    <pageRange numbered="false" range="true">
                        <start>8</start>
                    </pageRange>
                    <pageRange numbered="false">
                        <start>1</start>
                    </pageRange>
                    <pageRange numbered="true">
                        <start>2</start>
                        <end>96</end>
                    </pageRange>
                </pageRanges>
            </pagination>
        </collation>

XML is a hierarchical structure: elements have values (the book's format is 12, a duodecimo) and attributes have values (page 1 is unnumbered, pages 2-96 are). Everything is text and therefore readable by humans, particularly when indented in an outline form that reveals the structure. In the example above, the collation consists of format, collation formula, total leaves, and pagination. The collation formula consists of gatherings, signature leaves (indicating normal signing), and anomalous signatures. Each gathering range within the gatherings has a starting signature (sigStart), an optional ending signature, and a number of leaves. A gathering range may be signed or unsigned. The pagination section is similar. More complicated books will use other optional elements.

How does this encoding help avoid error? First, the data it is validated against an XML schema I created. The schema is formal description of the rules for describing a book. The schema requires elements such as collation, collation formula, signature leaves, etc. The element anomalous signatures is optional, as are elements for signing errors, duplicated signatures, doubled alphabets, insertions, deletions, and free form notes. Failure to include a required element or inclusion of an unexpected element will generate an error.

Moreover, the XML schema restricts each element as to allowed types values. For example format is limited to a small set of values such as 8 for octavo, 12 for duodecimo, etc. Entering 13 into the format field will generate an error. The schema is rather complex, but does an admirable job of preventing errors.

One might expect that all of the tags, required structure, and rules for allowed values would add substantial effort when inputting data. Indeed the above snippet of XML for Whist.3 is much more verbose than the collation formula. Surprisingly, there is much less data entry. Much less. Modern tools will read the required structure contained in the XML schema, insert most of the tags, and suggest allowed values for the data. Most of the typing is done for you. And as we shall see below, you don’t have to worry about brackets, italics, superscripts and the like—that is handled elsewhere.

Once the data is structured and individual elements are known to have valid values, it is possible to check them for internal consistency. For example, I have written a program to read the collation formula statement, count the number of leaves it implies, and compare it with the element signature leaves, flagging any discrepancy as an error. Similarly, the pagination statement implies a total number of pages that is expected to be twice the number of leaves. In the example above, there are four gatherings of 12 leaves and one of 4, totaling 52 leaves and 104 pages. Check.

Much more validation is possible. For example, I give references in terms of both signature and page, such as A5v–E4r (2–95) for a range or E4v (96) for a page. Once we are certain that the collation formula and pagination statements are consistent, we know the page number for each leaf. I was able to write a program that verifies that leaf A5v is page 2, E4r is page 95, and E4v is page 96. It is no exaggeration to say that the program has detected hundreds of errors. Perhaps thousands. By entering both the signature and page reference, I have to make two errors that are consistent with one another before mistakes of reference appear in the bibliography.

I only wish there were a similar way to validate quasi-facsimile transcription!

XML works with another language called XSLT (Extensible Stylesheet Language Transformations) to render XML in other formats such as text or HTML. It is an XSLT stylesheet that transforms the collation as expressed in XML into Bowers format. All the “knowledge” of Bowers' rules is in one program. As a result, when entering the collation for a book, I do not have to type brackets, italics, or superscripts—a major time saving for data entry.

An amusing example demonstrates the strength of the approach. At Rare Book School, I learned to describe signing errors by saying “missigning B4 as B5”. Bowers prefers “misprinting B4 as B5” and has no objection to quoting the erroneous signature, writing “misprinting B4 as ‘B5’”. (See Bowers p270). Regardless of which is preferred, I can change the output for all 170 book descriptions by making a minor change to one XSLT stylesheet and not all 170 descriptions. Neat!

A third goal was to automate the production of indices for the bibliography. The top-level index classifies works as (a) separate works; (b) publishers’ collections of works published separately; and (c) collected editions. It is produced programmatically. Other programs produce other indices:
  • An index of short titles and short imprints
  • A chronological list of all editions and issues
  • A list of games and subjects treated in Hoyle with a chronological list of books for each game or subject
  • An index by publisher or printer
  • A list of institutions holding copies of Hoyle (see here, for example, for libraries in the British Isles) and the books held at a given library (for example, the Bodleian, which has the largest collection of Hoyles in the world)
  • Lists of Hoyles in each of the standard gaming bibliographies, such as Horr, Jessel, and Rather and Goldwater.
Each time I add a new bibliographical description, I can regenerate all of the indices and indeed the entire website by running one program.

The final goal is perhaps the most ambitious—to develop a platform that other bibliographers can use. I have no intention of turning the technology into a commercial product, and have built it with laser-like focus on my needs rather than as a general solution to bibliographical description.  I would expect, however, that hobbyist programmers familiar with the technologies I used should have little difficulty extending it to their needs. I would be eager to hear from anyone who is interested.

Afterword: A Note on Technology

I initially explored the Text Encoding Initiative (TEI) as a way to encode book descriptions. I found that, as the name suggests, the standards were focused on encoding text and other contents, rather than encoding characteristics of the physical book. A TEI Work Group on physical bibliography made a good start at encoding a collation statement, but their work did not proceed to completion and did not become part of a TEI release.  I used theirs as a starting point for my own work. 

I am using early and well-supported versions of products in the XML suite: XML 1.0, XML Schema 1.0, and XSLT 1.0 (including XPath 1.0). While there are some attractions to using later versions, they are not always supported by browsers, and I wanted the web version to work with Firefox, Chrome, Internet Explorer, and Safari, not all of which support more recent versions of XSLT and XPath.

I use oXygen XML Editor 17 as an XML development environment. It is an awesome tool and I fear that I am only using a fraction of its capabilities.

Where I need to insure consistency of the book descriptions beyond what XML Schema provides, I write programs in Python 3.4. Python also creates the various indices I described earlier. Python is a general purpose programming language that excels in handling text and has excellent libraries for reading and writing XML files.