Encoding Guide for Early Printed Books

transcription OCR optical character recognition scanning template

Transcription and Markup

Transcription of the text may take place in a number of ways, and it is worth observing from the start that in many cases transcription and markup will take place as part of a single process, rather than being done separately. We treat them here together as closely related activities.

Transcription processes and methods will be familiar to anyone who has edited or even studied a primary source; guidance is provided in standard sources such as D. C. Greetham’s Textual Scholarship: An Introduction or Scholarly Editing: A Guide. The challenge is to determine how much of the source information to capture (given the impossibility—not just practical, but theoretical—of capturing it all), and how to represent the various complexities of the text: illegibility, insertions, font shifts, and so forth. In a digital transcription as in a hand- or typewritten transcription, one is moving between media with very different expressive capacities: the transcription is a model, not a replica, of its original.

This guide, of course, is considering transcription in a very specific context: the creation of a TEI-encoded document. Unlike a typical diplomatic transcription, or a transcription using a word-processor, TEI encoding allows (and requires) you to express all of the textual features that are not simple characters and spaces using TEI markup. For instance, if a span of text is in italics, rather than underlining it or using some other convention, you mark the text with the appropriate TEI element. (Which element to use is covered in the encoding portion of this guide.) Similarly, if a passage of text is deleted, instead of using a transcriptional convention such as a strikethrough, you mark the text with a TEI element that says, in effect, this text was deleted in the source. This appears like a departure from traditional transcriptional methods, in which the presentational features of the source were represented using conventions such as underlining, brackets, and other commonly available marks. However, inasmuch as those conventions are themselves a form of markup, in fact using TEI markup is not fundamentally different: it uses a different set of codes, and it provides ways of marking additional features, but the essential goal of capturing the details of the source text is the same. The transcription and markup process—including methods of capturing the text itself, basic issues of transcription, and approaches for adding markup—are described in more detail below.

Methods of capture

The first step (logically if not always temporally) in transcribing and tagging the text is actually getting the words of the text—the stream of letters and spaces—into digital form. There are a number of ways of accomplishing this, each with advantages, disadvantages, and costs:

  • Hand transcription by project participants
  • Hand transcription by a vendor
  • Scanning and OCR

In hand transcription, the text is transcribed by typing it directly into the computer. It makes sense to insert some markup during this process as well, since there are usually features of the text (such as font shifts, formatting, and so forth) that will need to be represented in some manner; doing so with TEI markup is more efficient than using some other set of conventions and then converting the results later. Hand transcription may of course be used on any text, but it is particularly applicable in cases where the text cannot be digitized by the automatic means described below: for instance, if the text is in manuscript, or is in very small (old, variable) type, or if it has suffered damage that affects legibility, or if the document is fragile or rare and cannot be disbound for scanning. The transcription may be performed by project participants (such as students, library staff, faculty, or volunteers) or by a vendor. For high volumes of comparatively simple text, such as large collections of novels, using a vendor may be the fastest and most cost-effective way to capture the text, and vendors are increasingly able to provide basic TEI markup as part of the transcription process. This approach may also be the most accurate; vendors often use a double-keying process in which the text is typed in twice and then compared for differences. You can specify a guaranteed error rate which could be as low as one character in 20 pages (or roughly 99.995%). The cost will increase as the error rate goes down. For small quantities of text, or for cases where the text is esoteric or difficult to read, using a vendor may be more trouble than it’s worth.

Scanning and OCR is an automated process by which the text can be captured in digital format. The text is first scanned as a digital image, and then the image is processed by software that recognizes and interprets the letter shapes, and captures them as text. This process is steadily improving but is still not perfect, and it works best on documents which are clearly printed. Older texts, in which there may be foxing or other damage, and in which the typesetting may be irregular, may need to be hand-corrected to catch errors. Manuscript documents still cannot be digitized in this way at all. Scanning and OCR can be done by hand, but they are typically performed either by vendors or by larger digitization units within a library, where large quantities of documents can be processed quickly; this often involves disbinding the book so that individual sheets can be fed through the scanner automatically. Robotic systems are now emerging which include automated page turning, for books that cannot be disbound, but they are still rare and expensive. Some OCR systems will enter basic XML encoding to represent the document structures that can be determined from the formatting (such as paragraphs, headings, line and page breaks, and the like) but these should be hand-checked for accuracy.

More detailed information about these processes can be found in the following sources:

Methods of markup

As part of the initial process of text capture, or as its immediate sequel, markup is added to the text to represent its internal structures and important features.

If you are transcribing documents by hand, the likelihood is that you will be marking them up by hand as well. Because XML markup consists of ordinary text characters, it can simply be typed into the stream of text using any ordinary text editor, or even a word processor, as long as the results are saved as plain text (no formatting). However, there is now a genre of software designed specifically to assist in creating and editing XML documents, and to handle the kinds of useful processes that people engaged in these activities find useful. These XML editors typically are capable of reading and understanding the rules (the schema or DTD files) that govern the structure of your documents, and they can therefore detect errors in your encoding, and even prevent them in advance by allowing you to enter only elements that match those rules. In addition, they may offer features to help speed the encoding process by reducing the amount of typing you need to do to insert the markup. Examples of XML editors currently in widespread use include <oXygen/>, emacs, jEdit, XMLSpy, and there are many others as well. They vary in their power, their features, and in how easy they are to learn and use. To make a good choice, it’s best to talk with other text encoding projects and see what people say.

Markup can also be inserted automatically, using information that is already present in the transcribed file to infer the presence of structural and content features that need to be encoded. In a word-processing file or in the output of an OCR process, there may be sufficient formatting information (such as paragraph indentation or inter-paragraph spacing, font shifts, marginal white space, and so forth) to allow for automated encoding of pages, columns, line breaks, paragraphs, sections, and other simple structural features. It may also be possible to automatically encode names (using capitalization and comparison with a dictionary), sentences (using punctuation and spacing), dates (using number patterns), and quotations (from the presence of quotation marks). Clearly the success of the encoding will depend on the consistency of the textual symptoms you use as a basis; one missing quotation mark could produce wildly incorrect results. In some documents, automated routines of this sort may be very productive, and in others they may be more trouble than they are worth. In your document analysis, it is worth taking stock of the various features that might lend themselves to such automation. The tools for automated encoding of this sort typically do not exist in standard, public form, partly because the heuristics for successfully identifying these features vary so much from one set of documents to another; most projects that use them either build them themselves, or modify tools built by another project.

If you are tagging your texts by hand, and particularly if your project involves multiple encoders, there are also techniques you can use which will help ensure consistency.

  • Develop and maintain a template for new documents. It should include, at a minimum, a skeletal TEI header with all the elements required for your project, plus a simple document outline showing the basic elements that will appear in all of your documents. For instance, if you are encoding a collection of letters, your template might include elements for the opener of each letter (the dateline, salutation, and so forth). We provide some sample templates to illustrate how this might work, and these may be taken and modified as needed to suit local practice.
  • Develop stylesheets to display and check essential information. Stylesheets describe how documents are presented on a screen and in print. There is a rule of thumb saying that any time you display your data, in any format at all, you will instantly spot inconsistencies and errors. For this reason it is a good idea to develop a variety of displays for internal use, so that you’re the first to see these errors. Create a simple stylesheet using XSLT or CSS (we provide some samples) that will make it easy to see when essential information is missing or inconsistent. Your encoders can use the stylesheet as a quick check to see whether they are doing things right, and it can also be used by project managers to look for errors and inconsistencies across multiple encoded files.