Post-processing is, in general terms, any kind of processing performed upon the encoded text after its initial encoding is complete. In practice (and in the context we are discussing) post-processing falls into several categories.
Automated encoding processes insert markup into the text with little or no direct human intervention. They typically work by detecting textual features of interest (such as names, dates, quotations) through their formatting or their textual behavior, and their effectiveness depends directly on how consistently and distinctively these features are represented in the text. If all dates consist of four numeric digits, and no other four-digit numbers appear in the text, marking all dates automatically is very easy to do. Similarly, if all personal names begin with a capital letter, it may be possible to develop a system that identifies names by searching for capitalized words, determining whether they are also at the beginning of a sentence, and testing to see whether they appear in a dictionary of proper nouns. Such systems have a harder time with features that vary widely in their presentation (for instance, place names that are inconsistently capitalized and may include common nouns or personal names: e.g. King of Prussia). In very early texts, where spelling, capitalization, and punctuation may be highly variable, automated encoding may be nearly impossible. For later texts, though, and for large collections of long documents, the tradeoff may be considerable. It is beyond our scope here to describe automated encoding methods in detail, but there are a number of projects that are exploring such approaches, including the Perseus Project at Tufts University. Automated markup is an important research topic from a computational standpoint, as well as being commercially significant, so although at the moment there are not many tools specifically aimed at humanities projects, the field is developing steadily.
There remains the question of whether automated markup processes are useful. In theory, if dates (for example) can be detected easily through automated processes, they can also be detected easily by search processes, so that the advantage of having the markup actually present in the text might not be very great. In the early days of markup, computer speeds were slow enough that identifying textual features in advance through markup could represent a very significant savings of time. A program that had to identify proper nouns by testing for capitalization, sentence position, and absence from a dictionary could not possibly respond to searches in real time, whereas it could search the text for persName reasonably quickly. Modern computer speeds are now approaching rates which shift this balance: it is possible that before long, heuristics for identifying many textual features will be accurate, ingenious, and fast enough that certain kinds of basic markup are not necessary. However, any markup which depends on human recognition or interpretation will still need to be entered by hand.
In addition to fully automated markup, there also exists the possibility of semi-automated processes that help a human being discover possible features quickly, and mark them easily if they are indeed the feature desired. For instance, an automated process might search through the text, find words and phrases that might be names, and allow the encoder to choose whether to mark them as personal, place, or organizational names with a single keystroke—a much quicker process than having the encoder go manually through the text typing in elements by hand. Semi-automated processes also allow the detection of much more variable and uncertain features, since they permit an element of human judgment and choice. The development of more complex heuristics for identifying textual features may also enable the development of very powerful semi-automated encoding tools.
Processes for error discovery are clearly crucial for any text encoding project, and although human proofreading will always have an important place, there are also a number of tools one can use to find errors that human beings may find it difficult or burdensome to catch. XML validation is of course one of the most important of these, and it will act as the first line of defense by catching gross encoding errors: places where the encoding does not match the rules set out by the schema at all. But validity is only the first step: a valid file may still be full of errors, inconsistencies, and omissions. Additional tools are very helpful in identifying and possibly fixing these problems. We discuss this topic in more detail in Error Checking.
One of the most common kinds of post-processing is to transform your XML data into some other format, usually as part of your publication process. These transformations may be performed just once, as the final step in production, or they may be set up as repeatable processes that allow for the source XML to be updated and the output process run again to produce fresh output. If you expect to be making changes to your encoded files that would require updating the output, having repeatable processes is very useful, and in this case they should require as little hand work as possible: ideally, they should be completely automated.
Another principle to bear in mind is that your source XML should match the broadest and most far-reaching goals of your project, rather than your specific (and possibly temporary) publication needs. If you treat your transcribed TEI files as archival sources, from which you can derive specific kinds of output as needed, you gain flexibility and longevity for your data. The transformation process described here thus becomes a way to convert your archival-quality data into whatever particular format you need for some specific purpose.
Some of the most common output functions and formats are: