Encoding Guide for Early Printed Books

validation transcription

Error checking

In creating TEI-encoded texts, there exist all of the same opportunities for error that plague any kind of publication. There will inevitably be challenges in maintaining the consistency and accuracy of both the transcription and the encoding process, and these challenges increase with the number of people involved and also with the complexity and detail of the encoding. However, because the encoding can be read and examined with XML-aware software tools, errors and inconsistencies in the encoding can be discovered and sometimes even fixed automatically. There are also various techniques (some obvious and familiar, others less so) for catching errors of transcription. We discuss below some methods that have been used by the WWP and other text encoding projects, which may prove useful.

Typographical errors and proofreading

Both hand transcription and OCR produce texts which are likely to contain typographical errors, no matter how much care is taken to avoid them. Probably the most accurate method of text capture is double keying by a vendor, in which the text is typed in twice and the results are compared, revealing any typographical errors; the results from this process can be as low as one character error in 20 pages of text (99.995%). However, even with this method you will need to be prepared to check the output you receive to make sure it meets the specified levels of accuracy. And with any other transcription method, proofreading will be an essential part of your work flow.

Proofreading an XML-encoded document is slightly more complex than proofreading an ordinary transcription, because some of the information you are checking may be captured in the encoding itself rather than in the content. For instance, if you are representing typographical errors in the original by encoding both the error and a corrected reading, you need to ascertain that both are correct. This means that whatever means you use to display or print the text for proofreading will need to provide access to both readings in some manner. Similarly, any information about presentation (such as font shifts, indentation, and so forth) will need to be made available for proofreading. It is not difficult to design a proofreading output (either printable or viewable online) that will show all the information you need in a meaningful way, but it may take some thought and also some training of your proofreaders to make sure that they understand what they are supposed to be looking for. You can also proofread the XML-encoded file itself (again, by printing it out or viewing it online), and this can be very helpful as a way of identifying errors in the encoding itself, but it may be harder to catch typographical errors in the transcription using such a view, because the presence of the markup may pose a distraction. We recommend two proofreading passes, one in which the XML itself is proofread, and then a second pass using some form of formatted output to allow the proofreader to catch any errors that slipped through the first pass. In both cases, the proofreader would be comparing the text against the source copy, line by line. This is particularly important if your transcription captures old-style spellings or errors in the original, which cannot be checked without reference to the source.

Encoding errors

The simplest kinds of encoding errors are errors of invalidity and ill-formedness: cases where the XML markup is simply broken. The easiest way to deal with such errors is to prevent them before they occur. A good XML editor (some are described in the section on )transcription and markup will ensure that your XML is well-formed, and will read your schema (the rules that define the structure of your documents) and know what elements are valid. Software of this sort will help constrain your encoding so that invalid markup is identified immediately (or, better still, is prevented before it happens) and can be fixed promptly. It can also help encoders identify the correct element to use, and can thus prevent common mistakes. Creating many kinds of XML errors under these circumstances will actually take some ingenuity, or else gross inattention and negligence. As a final safeguard, it is essential to check the validity of every file as part of your regular error-checking process.

If you are using good XML editing software to support your transcription, the simple kinds of encoding errors described above will be blessedly rare. However, XML software can only prevent errors involving violations of the rules established in your schema. Encoding errors that involve using the wrong element—as long as that element does not violate those rules—will not be caught. Because humanities texts are so complex, the TEI schema is necessarily complex as well, and there are typically a fairly large number of elements that are valid in any given context. If an encoder encodes a passage of text as verse rather than prose, or encodes a word as a place name rather than a personal name, the XML software has no way of noticing the error. For this reason, it’s important to have additional mechanisms in place for checking the encoding. These mechanisms can take several forms.

Additional XML tools: In addition to ordinary XML validation which is available through your XML editor, there are XML tools which allow you to check for very specific kinds of errors and inconsistencies which are not constrained by your XML schema. An example is Schematron, which is a system for writing specific tests that can be applied to your XML document. For instance, your schema might require every date to be encoded with a regularized value so that it can be searched, but cannot constrain the precision with which that date is expressed, or the actual content of the date. Using Schematron, you could write an additional test that requires every date to be precise to the day, and also requires dates in the body of your document to be before 1850. Similarly, your schema might allow the persName element to contain either letters and spaces, or other name elements such as surname and forename. Using Schematron, you could add a constraint so that when persName is used within a bibliographic citation, the forename and surname elements are required. Tests like this can be a powerful way of checking for particular kinds of errors, and also for making sure that specific types of textual features are encoded correctly, by enforcing greater constraint than is possible using your schema.
Display tools: Sometimes the best way to spot errors and inconsistencies is to create some sort of output that uses formatting to reflect the markup in your text. Output of this kind can make even tiny errors highly visible: if the output displays prose paragraphs in green and verse lines in red, the difference will jump out even at a quick glance. Similarly, if you create a display that shows all of the essential fields from your TEI header (the author’s name, the title, the language, the date of publication, etc.), any missing or inconsistent information will be easy to see. Stylesheets can be written that highlight certain kinds of errors on purpose: for instance, if your encoders often mistakenly put headings in the wrong place, a stylesheet can color all of the correctly placed headings in black and all those that are incorrect in red, making them easy to see at a glance. (Tightening the constraints in your schema could also help with some of these kinds of errors, by making headings invalid anywhere except the correct location, but sometimes there are reasons to leave the schema more permissive.)
Other error-checking tools: Some kinds of errors do not involve markup, but instead simply involve patterns of characters. XML software does not help address these errors, but it is possible to write other kinds of error-checking tools that will help catch them. Examples might include checking the format of collation formulae, making sure that the values of your page numbers are sequential, or checking the capitalization of titles. Writing such tools requires the help of a programmer, so if they don’t save you measurable work in proofreading, they may not be worth the effort of creating them, but they are worth considering if you work with large quantities of text.
Hand checking: the final resort, and one for which there is ultimately no substitute, is having a human being check the encoding by reading it. This is the only way to ascertain that the elements used are the correct ones, that the text is being accurately and adequately represented by the markup, and that the encoding approach is consistent from text to text. This kind of checking is more important the more complex and detailed the encoding is; if your schema allows only a few elements, the opportunities for error are much fewer than if you are dealing with hundreds of options. Similarly, if your encoding requires interpretation or judgment on the part of your encoders, there will be a greater need to make sure they all reach consistent conclusions. (However, it is worth pointing out that simpler encoding does not necessarily lead to greater accuracy: a bored encoder is often less apt to pay attention or to care about the outcome, whereas someone who is being challenged may well do a better job overall.) Such checking most often takes the form of a final review stage once the encoding is complete, and the resulting feedback can be a valuable part of encoder training. However, it is an expensive process and hence may be a luxury for many projects. It also requires that the project have someone on the staff with the requisite expertise and available time to carry out the review, and this is undoubtedly a rarity.