Basic Manuscript and Physical Document Encoding

Some philosophical issues

Note that these are really two separate, though closely related issues:

All documents are physical; manuscripts are just a specific category whose physicality is particularly inescapable
Special problems of manuscripts include issues having to do with handwriting and with the fact that the format of the page is essentially unregulated
Special problems of print documents include chiefly issues having to do with collation and pagination

There are some aspects of the encoding of physical document structures which are common to print and MS documents, so we treat them together

Similarly, there are some issues having to do with our perception of the physical exemplar (esp. having to do with legibility and conjecture) that are common to both.

By and large, the TEI is focused, methodologically, on the text as linguistic rather than material information: its encoding provisions for genre, language, content are rich and detailed, while its provisions for material information are fairly minimal. Textual materiality poses some interesting conceptual problems for markup systems:

it is not so readily representable as a logical structure
its descriptors are less well codified
it has typically been theorized as secondary to the text’s logical structure
representing materiality in formal terms is challenging

We raise these here mostly as signposts to issues that may be interesting, rather than trying to give an adequate treatment here; there’s a lot of interesting debate on this topic and if you’re interested we can provide some pointers

For now, we’re just going to cover some practical encoding points.

A sample manuscript page

A sample manuscript page illustrating various MS features.

Basic prose tagging

These are simple, intuitive elements, many of which have direct corollaries in HTML

div: basic structural chunking, can nest recursively inside itself
head
p
list, item
name with type
date with when
emph and hi
add and del

Transcriptional complexities: choices

In these examples, we’re still looking at parallelism, but instead of managing it through a linking mechanism, we’re managing it in a different way: through an enclosing element.

These examples don’t actually violate the ideal document tree view, but they make it slightly more complex: almost as if a twig has split nd then rejoined

This approach is useful for smaller and more local examples of parallel text. There are a number of kinds of local editorial changes that are often made in the process of transcription and editing: processes of regularization and correction that are often done silently and noted in an introduction:

correction of typographical errors in the source
regularization or modernization of spelling and typography
expansion of abbreviations

In print-based editing, these choices are exclusionary: whichever kind of reading you decide to show the reader, its complementary version has to be suppressed (it could be indicated in a note or an appendix but it can’t typically be displayed as part of the regular reading surface)

In an XML transcription, however, it’s possible to represent both (or in principle multiple) readings in a data structure that shows their parallelism and treats them as alternatives, which can then be chosen (displayed, searched, etc.) when desired.

In TEI, this mechanism is the choice element, which represents a moment of textual forking, where instead of a single reading the text offers a choice of readings

Transcriptional complexities: revision

What’s at stake here: because the transcription of manuscript materials (and often printed texts as well) involves significant efforts of decipherment and in many cases conjecture or interpretation, and also because primary sources are informationally complex (authorial revision, erasures, missing letters, illegible passages, etc.), a responsible transcription needs to capture not just the end product but also information about the process and the editorial decision-making: not just produce a clean-looking innocent butter-wouldn’t-melt-in-its-mouth transcription but preserve information about what was difficult or unclear

conventions for accomplishing this are familiar from print: carets and brackets for marking insertions and deletions, italics to indicate unclear text, footnotes to indicate hypothetical readings or to describe damaged sections

In text markup, the goal is to formalize as much of this information as possible and represent it systematically

to classify the reasons for illegibility (where possible), to formalize the rationales for determining whether a given letter is illegible or simply unclear
with the goal of making it possible to control the display of the reading surface of the text: to show or hide the deleted words and hypothetical readings, perhaps even to let the reader control the threshold of conjecture at which readings are displayed or hidden ("only show me things you’re really certain about")

Next: show basic encoding features: unclear, supplied, gap, add, del

Encoding the physical document

In TEI, the primary emphasis of the encoding is on the text stream (paragraphs, divisions, and so forth)

the physical document can also be represented, but since physical document structures overlap with logical document structures, the physical structures are typically represented by milestone elements of various sorts, as we saw a moment ago:
pb,cb, lb
also milestone for other kinds of boundaries: for instance, signatures
note where they go in relation to structural elements: where they belong