Descriptive Markup

Julia Flanders

2007-05-13/14

What is text encoding?

I'd like to start by situating the activity of text encoding in intellectual space:

Text markup sounds like a technical concept, but like so many things it is a more basic idea that has come to our attention because technology and resulting media shifts make us aware of it

In fact it is an expression of motives and practices that have been around for a long time.

Markup is several things:

Note that markup and text encoding are essentially the same thing, for our purposes.

Text encoding in the ancient world

This example of scriptio continua lacks word breaks, which are a very basic form of text markup, but it does have other kinds of markup: the line positioning, and the differently colored inks

The markup serves to granularize the flow of language and allow the reader to parse its significance more easily

If we think of written language as a secondary derivation from oral language, this kind of markup is the written equivalent of the pauses and inflections that make spoken words comprehensible (both by separating them from one another and by giving them additional emphasis or coloration): in other words, punctuational markup represents an authentically linguistic level of meaning

If we think of written language as a separate sign system, then the markup would seem more like an indigenous part of the formalism of written language: part of its distinctive expressiveness

Either way, this kind of markup becomes an important part of our apprehension of the document: as manuscript and printing practices become more nuanced, the structured visual presentation of documents becomes a way of conveying much more nuanced information about the structure and semantics of the documents

Text encoding in the early modern world

By the time printing comes to the ascendant, the visual markup of the page has become highly formalized, and in addition the structures of printed texts have become significantly codified, so that more complex reading practices are possible:

Text encoding in the digital world

In the digital world of text markup, we are dealing with an entirely different representational system: one that is not rooted in the phenomenology of print, but rather allows us (if we choose) to represent things from a very different perspective

A representational system whose basic components are not information about formatting, but rather information about structure and function.

This approach is often called descriptive markup

Essentially, descriptive markup is based on the idea that the best way to represent a document digitally is by describing it; not by giving instructions to a particular system on what to do with it but by saying, in general terms, what each of its parts is.

Underlying this philosophy is the idea that presentation derives from the nature and function of documentary parts:

Three foundational assumptions:

Note that this is a significant departure from earlier kinds of document markup, which served to give instructions to specific processing systems (e.g. typesetting engines) on how to format or process the text.

Descriptive versus procedural markup

Note the difference between these two systems:

The Rhetoric of Descriptive Markup

In the digital world, our relationship to textuality becomes somewhat more complex, because our use of the digital medium is so often (though not exclusively) to represent materials from other media.

Hence digital text encoding has two different modes:

1. A transcriptional mode: in which you're creating a second-order representation of a textual artifact (using or reproducing the visual markup of the original artifact); the original markup (spacing, punctuation, other formatting) either gets subsumed into the new markup system and is expressed using the vocabulary of that system, or gets described as part of the encoding (as a secondary fact about the document's structure). By analogy with the print world, we might liken this to a scholarly edition of a text, in which the original text is being represented through a different set of formatting conventions that aim to convey the same meaning as those of the original.

2. An authorial mode: in which you're creating a new textual artifact with its own original markup systems, with no backwards look towards print at all: perhaps a look forward to future processing and perhaps remaining agnostic about how it will be used or presented.

In our discussion of the TEI, we will be focusing primarily on the former, because it is the current emphasis of the TEI and similar encoding systems, but the latter is also of increasing interest to scholars.

Advantages of descriptive markup

There are a number of practical advantages to descriptive markup and the kinds of digital objects it produces:

There are also conceptual benefits, once you move beyond these kinds of prosaic organizational information and start to consider humanities texts

Additional assumptions

There are some additional assumptions that go along with the idea of descriptive markup.

Some complications

These assumptions are pretty much true for the kinds of information which were first motivating the development of SGML: for instance, technical documentation, legal forms, documents generated and used by the military and the IRS, all of which needed to be encoded not for immediate output, but for long-term storage, maintenance, and output in multiple formats (including formats that couldn't be foreseen).

And encoding systems that emerged out of this same tradition, like the TEI (and EAD, DocBook, EpiDoc, etc.), all emphasize structural markup that identifies the parts of the document by their structure rather than their appearance, and even a brain-dead renegade like HTML has been steadily moving from its initial emphasis on presentation (the hi and font element etc) to greater structural expressiveness, precisely because it turns out this is a more sustainable, cost-effective way of doing things. QED.

However, the use of markup to describe humanities texts (particularly those from the early era of print and before) has revealed complexities that need to be taken into account and make this kind of markup more of a challenge to apply:

And not to mention the fact that even if one regards presentation as secondary, for humanities scholars it turns out to be a very important secondary indeed: they still want to know about how the document looked.

Motives for Text Encoding

We've been talking so far about text encoding as a theoretical pursuit, but of course it's also an intensely practical activity and takes work to actually perform, so it's fair enough to ask why people do it, and why they use systems like the TEI Guidelines.

The practical motives for text encoding are situated within a fairly complex set of social and technological constraints and goals; there are three very significant goals [don't elaborate!]:

To fulfill the first goal, all you need is a format that is non-proprietary and human-readable: XML for instance (and we'll say more in a few minutes about what this means)

To fulfill the second goal, you need more than this: you need an adequately detailed markup system: a system that can capture the kind of information you are interested in, and enable the kinds of things you plan to do with your data in the future: in other words, make it worth your while to store information in the long term

To fulfill the third goal, you need more than this: you need a markup system that is shared by other people, who agree to use it in the same way you do

This is why the TEI exists: to provide a long-term, detailed, analytically rich markup system that is understood by an entire community and can be used to create sharable, durable representations of the textual objects that community cares about.

Text Encoding is Never Simple

Text encoding sits right at the intersection between technology and humanistic/cultural research—at the moment it is the central representational technology for the digital humanities

Important to present this not as a simple act of copying, making a digital facsimile:

Text encoding fits into this as the chief means of creating textual representations: reseach objects which are of interest because of their textual information

As a result, text encoding:

These considerations make text encoding more difficult, but also more interesting, both to learn and to perform.

More difficult:

More interesting: