Documentation
Documentation is arguably the most important part of any text encoding project, and the most neglected. There is a natural human tendency to make decisions and take action first, and write down the results later, which is reinforced in the world of digital projects by the need to show progress to funders and readers. In a world of scarce resources, it is arguably more important to digitize additional texts than to spend the funds on writing about how one did it. However, for text encoding projects of the kind that we are considering in this guide, there are several crucial differences that shift the balance. First, most scholarly text encoding projects are producing materials for which the methodology is arguably as important as the content. In order to be useful and credible, scholarly research tools must declare the probity of their methods and must be able to prove that they have carried them out systematically. Without information on how a text was transcribed and edited, a researcher may be unable to determine its value or interpret its contents. The significance of a search may be profoundly altered by whether personal names are being distinguished from place names automatically or by a human encoder.
Secondly, the value of scholarly text collections depends on their consistency of approach. However, most scholarly text encoding projects are complex, long-term undertakings involving more than one person. If they endure for more than one year, they are very likely to employ a sequence of people, many of them students working for a year or a few months. Without documentation, it is nearly impossible to ensure that the transcription and encoding will be done consistently over time. An investment of effort in documenting your practices will save multiples of that effort spent redoing misdirected work. The effect on morale is immense; work involved in discovering error is burdensome and difficult, and the knowledge that problems could have been avoided is really miserable.
Finally, scholarly projects (particularly at this early stage of the digital domain) are inevitably setting precedents for each other, and learning things that may advance our understanding of how digital work is done. By documenting your ideas and practices, you transform the value of your work into something more public, and you also enable others to provide you with unexpected advice.
We recommend documenting the following areas of your work, in as much detail as you can afford.
- Your schema and customizations: Even if you are using a fairly standard TEI schema without any locally created elements, you should document the schema you are using. This means indicating which TEI modules or tagsets you are using, the elements (if any) you have eliminated, and the attribute values for which you have created controlled vocabularies. If you are using a customized schema, you should keep a copy of the TEI customization file, and you should document the specific customizations you have made, including the reason you made them. (This is as much for yourself as for others: you will inevitably forget why you needed that particular element, or why it had that particular structure.) For more information on customization, see TEI Customization.
- Your encoding practices: In addition to documenting the schema itself, you should also document how you use it. For any TEI schema, there are innumerable ways of actually doing the encoding, and to ensure consistency in your encoded files you should document which elements you use for what purpose, how your recognize specific textual features (how do you identify place names? what if they are not capitalized? is the grassy knoll a place name?), what features should be encoded and which should be ignored. This documentation may also serve as the basis for your encoder training, and as a reference tool so that encoders can look up specific tags or features as they work.
- Your editorial practices: As with any editing project, documentation of editorial practices is essential both for readers and to help you work consistently. This documentation should include your transcription practices (how you handle illegibility, how you regularize spacing, how you treat line-end hyphenation, whether you preserve original capitalization and typeface, and so forth) and also your editorial practices such as handling of variant readings, choice of source, whether you silently emend typographical errors, and the like. The TEI header provides ways of expressing this kind of documentation as part of the encoded file, but it should also exist as an accessible part of your project’s publication environment.
- Your tools and procedures: Any tools that you develop for your own use should be documented, partly so that you can easily train people to use them (or enable your next programmer to support them). The code should be well commented and there should also be a file describing what the tool is for and how it is used. If you have several tools you use (for instance, error checking tools, transformation scripts, etc.), it’s a good idea to maintain a list somewhere that indicates where the source code for each one is located, who wrote it, where the documentation is. With staff turnover, it is remarkable how quickly information like this is lost, and it can often be essential in trouble-shooting or updating a tool. Procedures should also be documented—again, to help with training and ensure consistency, but also to help you remember how you have decided to handle your texts and why.