Basic Tagging

A Sample Text Fragment

This is a very primitive, simple example that we’ve cooked up to demonstrate some basic markup; it has some basic features that are common to prose documents:

a heading
a few paragraphs
a list
some embedded features like names, quotations, foreign words

It’s worth noting before we even look at the markup that in fact there’s already structure being expressed here, using typography and layout, but the structure is intended to be legible by humans. Part of what we’re doing in our XML markup is to translate this typographical markup into a form that the computer can understand.

Basic Prose Tagging

These are simple, intuitive elements, many of which have direct corollaries in HTML

There are several levels of structure being represented here:

Large-scale elements that represent major document divisions:

<div>: basic structural chunking, can nest recursively inside itself

Mid-level structural elements that can go inside a major division, and that contain words and phrase-level elements

<head>
<p>
<list>, <item>

Small elements that represent individual words and phrases:

<name>
<date>
<emph>

Transcriptional Complexities: Choices

In these examples, we’re still looking at parallelism, but instead of managing it through a linking mechanism, we’re managing it in a different way: through an enclosing element.

These examples don’t actually violate the ideal document tree view, but they make it slightly more complex: almost as if a twig has split nd then rejoined

This approach is useful for smaller and more local examples of parallel text. There are a number of kinds of local editorial changes that are often made in the process of transcription and editing: processes of regularization and correction that are often done silently and noted in an introduction:

correction of typographical errors in the source
regularization or modernization of spelling and typography
expansion of abbreviations

In print-based editing, these choices are exclusionary: whichever kind of reading you decide to show the reader, its complementary version has to be suppressed (it could be indicated in a note or an appendix but it can’t typically be displayed as part of the regular reading surface)

In an XML transcription, however, it’s possible to represent both (or in principle multiple) readings in a data structure that shows their parallelism and treats them as alternatives, which can then be chosen (displayed, searched, etc.) when desired.

In TEI, this mechanism is the <choice> element, which represents a moment of textual forking, where instead of a single reading the text offers a choice of readings

Transcriptional complexities: revision

What’s at stake here: because the transcription of manuscript materials (and often printed texts as well) involves significant efforts of decipherment and in many cases conjecture or interpretation, and also because primary sources are informationally complex (authorial revision, erasures, missing letters, illegible passages, etc.), a responsible transcription needs to capture not just the end product but also information about the process and the editorial decision-making: not just produce a clean-looking innocent butter-wouldn’t-melt-in-its-mouth transcription but preserve information about what was difficult or unclear

conventions for accomplishing this are familiar from print: carets and brackets for marking insertions and deletions, italics to indicate unclear text, footnotes to indicate hypothetical readings or to describe damaged sections

In text markup, the goal is to formalize as much of this information as possible and represent it systematically

to classify the reasons for illegibility (where possible), to formalize the rationales for determining whether a given letter is illegible or simply unclear
with the goal of making it possible to control the display of the reading surface of the text: to show or hide the deleted words and hypothetical readings, perhaps even to let the reader control the threshold of conjecture at which readings are displayed or hidden ("only show me things you’re really certain about")

Next: show basic encoding features: unclear, supplied, gap, add, del