Descriptive Markup

What is text encoding?

I'd like to start by situating the activity of text encoding in intellectual space:

From the viewpoint of the humanities scholar, text encoding looks as if it's coming over from computer science: as an activity that takes place on computers and requires some technical knowledge (of software, of data standards, of encoding languages)
in fact, rather than likening a text encoder to a computer programmer, I want to draw some different lines of connection
first, with an anthropologist or ethnographer somewhat in the tradition of Clifford Geertz: the text encoder is an observer and documenter of the textual world, and the encoding he/she produces has (at least potentially) something of the quality of a thick description: a contextualized, interpretative account of the details of the textual landscape.
Another affiliation: the text encoder is also very much like a critical editor, creating an analytical representation of the text which provides systematic, expert knowledge about it

Text markup sounds like a technical concept, but like so many things it is a more basic idea that has come to our attention because technology and resulting media shifts make us aware of it

In fact it is an expression of motives and practices that have been around for a long time.

Markup is several things:

a way of formalizing and externalizing the structures in a text
a way of adding further information to the text that interests us
a meta-text that comments on, interprets, or extends the meaning a text

Note that markup and text encoding are essentially the same thing, for our purposes.

Text encoding in the ancient world

This example of scriptio continua lacks word breaks, which are a very basic form of text markup, but it does have other kinds of markup: the line positioning, and the differently colored inks

The markup serves to granularize the flow of language and allow the reader to parse its significance more easily

If we think of written language as a secondary derivation from oral language, this kind of markup is the written equivalent of the pauses and inflections that make spoken words comprehensible (both by separating them from one another and by giving them additional emphasis or coloration): in other words, punctuational markup represents an authentically linguistic level of meaning

If we think of written language as a separate sign system, then the markup would seem more like an indigenous part of the formalism of written language: part of its distinctive expressiveness

Either way, this kind of markup becomes an important part of our apprehension of the document: as manuscript and printing practices become more nuanced, the structured visual presentation of documents becomes a way of conveying much more nuanced information about the structure and semantics of the documents

Text encoding in the early modern world

By the time printing comes to the ascendant, the visual markup of the page has become highly formalized, and in addition the structures of printed texts have become significantly codified, so that more complex reading practices are possible:

reading practices that take advantage of this codification to gather more meaning, more efficiently and transparently, from the page
reading in which the semantics of specific formal components of the page actually determines the significance of words
For instance, in this example of an early dictionary, the formatting tells us that particular components of the page have a specific function: the heading that locates us within the alphabet, the headword, the definition
In other words, the markup represents and makes visible the intellectual structures of the text—or, putting it the other way round, the text is granularized as a set of formal structures that are made apparent to the reader through the visual markup
Obviously not all of the formatting works towards this end: some can be thought of almost as a visual surplus: decorative features, choices that don't affect our apprehension of textual meaning directly but provide context (such as font size or typeface, margins, interlinear spacing)

Text encoding in the digital world

In the digital world of text markup, we are dealing with an entirely different representational system: one that is not rooted in the phenomenology of print, but rather allows us (if we choose) to represent things from a very different perspective

A representational system whose basic components are not information about formatting, but rather information about structure and function.

This approach is often called descriptive markup

extremely important development in the history of electronic document management
a long and interesting history of debates about its real nature, what to call it
but at a simple level, there are a few important premises

Essentially, descriptive markup is based on the idea that the best way to represent a document digitally is by describing it; not by giving instructions to a particular system on what to do with it but by saying, in general terms, what each of its parts is.

Underlying this philosophy is the idea that presentation derives from the nature and function of documentary parts:

a heading is bold because it's a heading; a lemma has parentheses around it to mark it as a lemma in this text
and therefore in encoding, we should identify the parts of the document's structure first of all, and then base any additional information (such as details of presentation) on that structure

Three foundational assumptions:

Presentation expresses structure and function
Markup should identify structure (primary)
Stylesheets produce presentation (secondary)

Note that this is a significant departure from earlier kinds of document markup, which served to give instructions to specific processing systems (e.g. typesetting engines) on how to format or process the text.

Descriptive versus procedural markup

Note the difference between these two systems:

troff says: "center this text"
XML says "just FYI, this text is a (centered) heading"

The Rhetoric of Descriptive Markup

In the digital world, our relationship to textuality becomes somewhat more complex, because our use of the digital medium is so often (though not exclusively) to represent materials from other media.

Hence digital text encoding has two different modes:

1. A transcriptional mode: in which you're creating a second-order representation of a textual artifact (using or reproducing the visual markup of the original artifact); the original markup (spacing, punctuation, other formatting) either gets subsumed into the new markup system and is expressed using the vocabulary of that system, or gets described as part of the encoding (as a secondary fact about the document's structure). By analogy with the print world, we might liken this to a scholarly edition of a text, in which the original text is being represented through a different set of formatting conventions that aim to convey the same meaning as those of the original.

2. An authorial mode: in which you're creating a new textual artifact with its own original markup systems, with no backwards look towards print at all: perhaps a look forward to future processing and perhaps remaining agnostic about how it will be used or presented.

In our discussion of the TEI, we will be focusing primarily on the former, because it is the current emphasis of the TEI and similar encoding systems, but the latter is also of increasing interest to scholars.

Advantages of descriptive markup

There are a number of practical advantages to descriptive markup and the kinds of digital objects it produces:

lets you use the same underlying data with multiple presentations
allows you to change presentation easily through stylesheets, etc.
in other words, it gives you a more natural way of interacting with the document

There are also conceptual benefits, once you move beyond these kinds of prosaic organizational information and start to consider humanities texts

if you mark up the structure, you can treat it as an object of analysis: literary analysis, historical analysis, rhetorical analysis, linguistic analysis, etc.

Additional assumptions

There are some additional assumptions that go along with the idea of descriptive markup.

that the relationship between structure and presentation is consistent (even if perhaps complex)
that presentation is not decorative but functional: that is, that it exists to express function, not for any other purpose
that presentation is variable while structure is constant (in other words, the structure expresses something fundamental about the document while presentation expresses something secondary)

Some complications

These assumptions are pretty much true for the kinds of information which were first motivating the development of SGML: for instance, technical documentation, legal forms, documents generated and used by the military and the IRS, all of which needed to be encoded not for immediate output, but for long-term storage, maintenance, and output in multiple formats (including formats that couldn't be foreseen).

And encoding systems that emerged out of this same tradition, like the TEI (and EAD, DocBook, EpiDoc, etc.), all emphasize structural markup that identifies the parts of the document by their structure rather than their appearance, and even a brain-dead renegade like HTML has been steadily moving from its initial emphasis on presentation (the hi and font element etc) to greater structural expressiveness, precisely because it turns out this is a more sustainable, cost-effective way of doing things. QED.

However, the use of markup to describe humanities texts (particularly those from the early era of print and before) has revealed complexities that need to be taken into account and make this kind of markup more of a challenge to apply:

the relationship between structure and presentation may not be consistent at all, particularly when dealing with older texts
either by accident/sloppiness/practical constraints (such as the need to fit more or less onto a given page: think of the setting of Shakespearean plays as prose or verse depending on available space)
or because in fact presentation in humanities texts may well be decorative rather than (or in addition to being) functional: it may exist to comment on, complexify, ironize, adorn, or distract from the content
and further, there has been an important line of commentary within editorial theory and text encoding theory both, arguing that the distinction between an "essential/fundamental" content and a variable/inessential presentation is false: that in fact the presentation and the physical substance of the document are constitutive of meaning and inseparable from it.

And not to mention the fact that even if one regards presentation as secondary, for humanities scholars it turns out to be a very important secondary indeed: they still want to know about how the document looked.

Motives for Text Encoding

We've been talking so far about text encoding as a theoretical pursuit, but of course it's also an intensely practical activity and takes work to actually perform, so it's fair enough to ask why people do it, and why they use systems like the TEI Guidelines.

The practical motives for text encoding are situated within a fairly complex set of social and technological constraints and goals; there are three very significant goals [don't elaborate!]:

To store information for the long term, in a format that is not vulnerable to changes in hardware and software
To analyse information and represent the results of the analysis in some way
To share information with colleagues and other projects, and to publish it for future use.

To fulfill the first goal, all you need is a format that is non-proprietary and human-readable: XML for instance (and we'll say more in a few minutes about what this means)

doesn't matter how detailed, what kind of markup

To fulfill the second goal, you need more than this: you need an adequately detailed markup system: a system that can capture the kind of information you are interested in, and enable the kinds of things you plan to do with your data in the future: in other words, make it worth your while to store information in the long term

To fulfill the third goal, you need more than this: you need a markup system that is shared by other people, who agree to use it in the same way you do

for this, you need some sort of infrastructure for developing and maintaining the markup system and even more importantly its documentation, so that people who want to use it have a place to go find it, learn about it.
you might be able to come up with a perfectly good encoding system all by yourself; if you lived on a desert island, you wouldn't have any motive to do otherwise
but insofar as text encoding is a community-oriented activity, inventing your own system from scratch can be a very solipsistic activity

This is why the TEI exists: to provide a long-term, detailed, analytically rich markup system that is understood by an entire community and can be used to create sharable, durable representations of the textual objects that community cares about.

Text Encoding is Never Simple

Text encoding sits right at the intersection between technology and humanistic/cultural research—at the moment it is the central representational technology for the digital humanities

Important to present this not as a simple act of copying, making a digital facsimile:

instead, think of it as part of the intellectual strategy of research
creating research objects that are of value: whether broad or specific, advanced or basic

Text encoding fits into this as the chief means of creating textual representations: reseach objects which are of interest because of their textual information

not simply the letters and words themselves, but also the text’s structure and its contents
text encoding allows the researcher to represent the text in complex ways
and allows the addition of specialized research knowledge as well as basic information necessary to elucidate arcane texts.

As a result, text encoding:

creates a model of the text: a representation that will be used for research purposes
is a strategic act: it exists to serve the specific purposes of its user. It is not a neutral or objective process
is thus discipline-specific: it adds certain kinds of information and focuses attention on certain kinds of information, and it ignores and eliminates other kinds of information.

These considerations make text encoding more difficult, but also more interesting, both to learn and to perform.

More difficult:

because it involves complex analysis and decision-making
because it involves specialized knowledge of the research objects and the audience

More interesting:

because its work is directly implicated in the scholarly research that will be performed on the text
and in fact is in some ways inseparable from it.