Encoding Guide for Early Printed Books

Automated encoding

Automated encoding processes insert markup into the text with little or no direct human intervention. They typically work by detecting textual features of interest (such as names, dates, quotations) through their formatting or their textual behavior, and their effectiveness depends directly on how consistently and distinctively these features are represented in the text. If all dates consist of four numeric digits, and no other four-digit numbers appear in the text, marking all dates automatically is very easy to do. Similarly, if all personal names begin with a capital letter, it may be possible to develop a system that identifies names by searching for capitalized words, determining whether they are also at the beginning of a sentence, and testing to see whether they appear in a dictionary of proper nouns. Such systems have a harder time with features that vary widely in their presentation (for instance, place names that are inconsistently capitalized and may include common nouns or personal names: e.g. King of Prussia). In very early texts, where spelling, capitalization, and punctuation may be highly variable, automated encoding may be nearly impossible. For later texts, though, and for large collections of long documents, the tradeoff may be considerable. It is beyond our scope here to describe automated encoding methods in detail, but there are a number of projects that are exploring such approaches, including the Perseus Project at Tufts University. Automated markup is an important research topic from a computational standpoint, as well as being commercially significant, so although at the moment there are not many tools specifically aimed at humanities projects, the field is developing steadily.

There remains the question of whether automated markup processes are useful. In theory, if dates (for example) can be detected easily through automated processes, they can also be detected easily by search processes, so that the advantage of having the markup actually present in the text might not be very great. In the early days of markup, computer speeds were slow enough that identifying textual features in advance through markup could represent a very significant savings of time. A program that had to identify proper nouns by testing for capitalization, sentence position, and absence from a dictionary could not possibly respond to searches in real time, whereas it could search the text for persName reasonably quickly. Modern computer speeds are now approaching rates which shift this balance: it is possible that before long, heuristics for identifying many textual features will be accurate, ingenious, and fast enough that certain kinds of basic markup are not necessary. However, any markup which depends on human recognition or interpretation will still need to be entered by hand.

In addition to fully automated markup, there also exists the possibility of semi-automated processes that help a human being discover possible features quickly, and mark them easily if they are indeed the feature desired. For instance, an automated process might search through the text, find words and phrases that might be names, and allow the encoder to choose whether to mark them as personal, place, or organizational names with a single keystroke—a much quicker process than having the encoder go manually through the text typing in elements by hand. Semi-automated processes also allow the detection of much more variable and uncertain features, since they permit an element of human judgment and choice. The development of more complex heuristics for identifying textual features may also enable the development of very powerful semi-automated encoding tools.

Error discovery and correction

Processes for error discovery are clearly crucial for any text encoding project, and although human proofreading will always have an important place, there are also a number of tools one can use to find errors that human beings may find it difficult or burdensome to catch. XML validation is of course one of the most important of these, and it will act as the first line of defense by catching gross encoding errors: places where the encoding does not match the rules set out by the schema at all. But validity is only the first step: a valid file may still be full of errors, inconsistencies, and omissions. Additional tools are very helpful in identifying and possibly fixing these problems. We discuss this topic in more detail in Error Checking.

Transformation into other formats

One of the most common kinds of post-processing is to transform your XML data into some other format, usually as part of your publication process. These transformations may be performed just once, as the final step in production, or they may be set up as repeatable processes that allow for the source XML to be updated and the output process run again to produce fresh output. If you expect to be making changes to your encoded files that would require updating the output, having repeatable processes is very useful, and in this case they should require as little hand work as possible: ideally, they should be completely automated.

Another principle to bear in mind is that your source XML should match the broadest and most far-reaching goals of your project, rather than your specific (and possibly temporary) publication needs. If you treat your transcribed TEI files as archival sources, from which you can derive specific kinds of output as needed, you gain flexibility and longevity for your data. The transformation process described here thus becomes a way to convert your archival-quality data into whatever particular format you need for some specific purpose.

Some of the most common output functions and formats are:

HTML: Because HTML is still the predominant language used in web publishing, it is also the most common destination format. TEI encoding projects very typically transcribe materials in TEI/XML, and then once the text is completely encoded and corrected, they transform it into HTML for publication on the web. XSLT makes this process very easy, and it can be used either as part of a one-time transformation or as part of a dynamic system that converts the data on the fly in response to a user request. There are also numerous other ways to convert TEI to HTML, including various kinds of scripts. Such tools and systems are common and comparatively easy to write.
Non-archival XML: For many purposes, the archival-quality TEI you create through your encoding process may not be ideal for specific publication purposes. For instance, if you have encoded line breaks in prose using the lb element, some search engines may not be able to search for words that are broken across the line break. It would be short-sighted to neglect the line breaks simply because some software misbehaves; instead, you could generate a derived file for use in your current publication system, which eliminates the lb encoding. Similarly, you might wish to experiment with a publication tool that only works on files encoded with TEI Lite. Again, it would be short-sighted to base your encoding decisions on the behavior of a single tool, but you could generate TEI Lite from your archival TEI files for use in that system.
Interchange formats for sharing with other projects: If you have developed significant TEI customizations, but also wish to contribute data to a common repository or a joint encoding project with different encoding practices, it may be very useful to be able to convert your files to a more generic TEI encoding: for instance, converting locally defined elements to a standard TEI equivalent. If your customizations support a more detailed or fine-grained encoding than the TEI allows, this conversion process may end up reducing the descriptive nuance of your encoding. (Conversely, if you find that you lose nothing by the conversion, you may realize that your customizations were unnecessary.)
Non-XML output formats: For some publication purposes it may be useful to produce outputs in non-XML formats, such as PDF for producing printed output, or a tab-delimited data set listing poems and the languages they are written in, for use in a spreadsheet program.

Encoding Guide for Early Printed Books

Post-processing

Automated encoding

Error discovery and correction

Transformation into other formats