Overview of Descriptive Markup and the TEI

Julia Flanders


Thought Experiment: What is this text?

Let’s do a thought experiment...

What kind of text is this? If you recognize it because you’re familiar with it, hold off for a moment...

If we’re unfamiliar with this text, can we tell what genre it’s in?

Can you quickly identify the place where the second stanza begins?

How about this one?

What’s different about this version?

Motives for text encoding

To fulfill the first goal, all you need is something that’s platform independent, human-readable: XML for instance

To fulfill the second goal, you need more than this: you need an adequately detailed markup system: a system that can capture the kind of information you are interested in, and enable the kinds of things you plan to do with your data in the future: in other words, make it worth your while to store information in the long term

To fulfill the third goal, you need more than this: you need a markup system that is shared by other people, who agree to use it in the same way you do

This is why the TEI exists: because in order to share information usefully, you need something that functions like a standard.

What is Descriptive Markup?
Why is it Important?

In this class we’re going to be focusing on a particular kind of markup, often called descriptive markup

Essentially, descriptive markup is based on the idea that the best way to represent a document is by describing it; not by giving instructions to a particular system on what to do with it but by saying, in general terms, what each of its parts is.

Underlying this philosophy is the idea that presentation derives from the nature and function of documentary parts:

Note that this is a significant departure from earlier kinds of document markup, which served to give instructions to specific processing systems (e.g. typesetting engines) on how to format or process the text.

Descriptive versus procedural markup

Descriptive markup, broadly speaking, is about representing a source:

Whereas procedural markup is about giving orders:

There’s clearly a continuum here: from purely procedural approaches on the one hand (in which the only thing we care about is giving instructions concerning output) to purely descriptive approaches on the other (in which the only thing we care about is the representation of the source

In accomplishing that representation we may or may not be interested in how the source looked: descriptive approaches may focus on structure or on presentation or both

Additional assumptions

This began as one of the primary tenets of SGML encoding: that instead of trying to describe what documents look like, it’s more powerful and efficient to describe their structure, and then control appearance afterwards.

Several underlying assumptions here:

Advantages of descriptive markup

These assumptions are pretty much true for the kinds of information which were first motivating the development of SGML: for instance, technical documentation, legal forms, documents generated and used by the military and the IRS, all of which needed to be encoded not for immediate output, but for long-term storage, maintenance, and output in multiple formats (including formats that couldn’t be foreseen).

And in cases where they are true, there are obvious practical benefits to separating presentation and structure, which are probably either familiar or self-evident or both

There are also conceptual benefits, once you move beyond these kinds of prosaic organizational information and start to consider humanities texts

Some complications

So encoding systems like the TEI (and EAD, DocBook, EpiDoc, etc.) all emphasize structural markup that identifies the parts of the document by their structure rather than their appearance, and even a brain-dead renegade like HTML has been steadily moving from its initial emphasis on presentation (the i and font elements, etc.) to greater structural expressiveness, precisely because it turns out this is a more sustainable, cost-effective way of doing things. QED.

However, if we reexamine those earlier assumptions in light of this new humanities emphasis, they appear much more problematic:

And not to mention the fact that even if one regards presentation as secondary, for humanities scholars it turns out to be a very important secondary indeed: they still want to know about how the document looked.

We’re going to talk about renditional markup a bit later on; for the moment, we want to sketch out the issue so that you can be aware of it as we proceed.

Text Encoding is Never Simple

Central issues of humanities computing: understanding the intersection between technology and humanistic/cultural research

Important to present this not as a simple act of copying, making a digital facsimile:

Text encoding fits into this as the chief means of creating textual representations: reseach objects which are of interest because of their textual information

As a result, text encoding:

These considerations make text encoding more difficult, but also more interesting, both to learn and to perform.

More difficult:

More interesting:

What is the TEI?

Technically: The TEI is a standards organization that exists to create, maintain, and disseminate a standard for humanities text encoding

Organizationally: The TEI is an international consortium whose members are institutions that want the TEI to continue to exist

Socially: The TEI is a community of people and projects who use text encoding in a wide variety of ways, and who communicate with one another about their research and the practical problems associated with it.

The TEI is also, importantly, the set of guidelines and XML specifications that make up the TEI Guidelines.

It’s important to note that the TEI is not a fixed tag set that is written in stone

The TEI Guidelines

The TEI Guidelines are a flexible specification:

Areas of Usage

Note as well that the TEI’s domain is strongly international, both in the kinds of materials it is used for (Tibetan manuscripts, graphical narratives from pre-Columbian Mexico, Near Eastern stone inscriptions) and for the international membership community it intends to serve

TEI documentation is being translated into multiple languages: