Encoding document appearance: general points


General notes on what kinds of renditional information we do and do not capture

Although the TEI markup system places the greatest informational weight on the informational structures of the text, it also allows for the capture of information about textual appearance, and most digital projects are interested, to some degree, in capturing this information. The level of detail and and the range of renditional features represented will vary greatly from project to project, depending on the nature of the texts being encoded, and the kinds of research purposes to which they will be put. Now that so many projects include digital images of the texts as well as full-text transcriptions, the evidentiary burden on the transcription is lessened: users who are interested in the text’s appearance can get a much better sense from an image than from even the most detailed renditional encoding. However, in cases where the renditional information may directly affect how the text is read (for instance, italics used for emphasis, or a set of specialized terms set in blackletter, or a poem with complex indentation that models its rhyme scheme), it should be represented in the encoding.

For projects that are going to serve a scholarly audience over the long term, we recommend capturing rendition in sufficient detail that the general appearance of the source text can be reproduced. Features like indentation, alignment, font shifts, case, super- and subscription, line and column breaks, and delimiters such as quotation marks are constitutive of the reader’s impression of the text and may convey important historical detail to the reader. In the design of your output you may use this information or may choose to generalize or ignore it for specific purposes, but having the information available strengthens your long-term options. If your funding is limited, or if you cannot foresee using specific pieces of information, then any of these facets can of course be omitted. We do not recommend capturing any information that cannot be represented systematically using a controlled vocabulary or measurement, or any information that cannot be captured with certainty based on the materials you have to work with. If you are transcribing from microfilm, it will be difficult to describe the weight of the paper or the exact size of the type.

There are essentially two ways of representing renditional information in your encoding. The first is to treat the renditional information as evidence of a structural phenomenon, and to encode the structural information, documenting the typical rendition for each feature so that it can be recovered, and also so that it can be represented to the reader through the use of stylesheets that format the final output. For example, if all emphatic words in the text are set in italics, and you encode all of them as emph, as long as you document the connection between italics in the source and emph in your transcription, you can display the text with all instances of emph set in italics as in the original. The italicization is not captured explicitly in the encoding, but it is being recorded indirectly.

This approach works well if two conditions are met:

If a single emphasized word is treated differently (set in all capitals, for example), this system will fail, and it has the additional risk that it provides the temptation to ignore small variations: either by encoding them as emph and misrepresenting their appearance, or (possibly worse) failing to encode them as emph because their appearance is different.

The second approach is to represent the renditional information explicitly in the encoding. The TEI provides a rend attribute that is available on all appropriate elements, and this attribute can be used in simple or complex ways to describe the appearance of the source text. The TEI does not specify how the values for the rend attribute should be defined; we have found that in order to capture complex renditional information it’s necessary to impose some internal structure on this attribute, using a system we call rendition ladders. (For more information see the entry on rendition ladders.) This approach allows many different aspects of rendition (such as case, font, and indentation) to be captured in the single rend attribute without confusion. We have designed a fairly robust descriptive system that is documented in this guide and may be extended as necessary by other projects.

In practice, combining these two approaches is the most effective. Since rendition is often fairly consistent throughout a text, for many elements it is possible to identify a default rendition that will cover nearly all cases. Specific variations, or elements whose rendition is highly variable (such as the indentation of verse lines) can be captured using the rend attribute. For information on encoding default rendition, see renditional defaults.

This guide provides detailed entries on a number of topics in the encoding of rendition: