Advanced Markup Concepts

Basic encoding

So far we have been looking mostly at two very basic functions that markup can perform:

identifying the boundaries of textual features
naming textual features

These are both really important functions—fundamental to the ways markup can be useful to us. Knowing the boundaries of things, and distinguishing between them, are an essential foundation for everything else we do.

We might think of this kind of markup as being almost like clothing: it closely follows the form of the feature it’s surrounding, it draws our attention to it, it marks out its boundaries (not always in a "natural" way...).

“Advanced” encoding

In this next session, we are going to move beyond this basic concept of markup to consider some more complex things that markup can help us do:

create information-bearing links between things
establish connections between parallel structures
allow us to create information structures with the markup that don’t necessarily follow the shape of the text, are not limited by its boundaries.

The analogy here would not be clothing but maybe architecture.

Notes and annotations

One very important form of structural complexity: a hypertextual sprout or fork or jump in the textual stream

For example, a footnote: which sprouts off from the text at a certain point
Or an endnote, where there’s in effect a cross-reference from a place in the text to a subsequent explanation

In TEI, all types of annotations are encoded using the <note> element (except in certain special cases for which <annotation> or <annotationBlock> may be used). These can be classified to indicate responsibility or to indicate the kind of note (using any classification system that seems useful, e.g. annotation, correction, hypothesis, context, gloss, etc.)

We’re illustrating here several different levels of annotation:

at a specific point (but note that in this case we don’t know anything very specific about what the annotation is really annotating)
a specific name
a specific quotation
an arbitrary word not already marked up
a location in the text

The notes themselves can go anywhere; what we are illustrating here has you putting the notes into a special division in the back matter.

In addition, there may be some kinds of annotation that can be handled other ways, not with <note> but with something more flexible; in this example we show place names being linked to a gazetteer or placeography, which records additional information about the place (a regularized version of the name, a brief note about it, a location with lat-long). This same approach could be used for the names of people as well. We will talk more about these later in the workshop

Figures and Images

Figures:

Heading (transcribed from the source): <head>
A link to the image file: <graphic> with @url
A description of the image, for accessibility: <figDesc>
A caption (transcribed from the source): <ab>
A transcription of the writing that is found within the graphic: <floatingText>

Facsimiles and Page Images

Facsimiles of pages and parts of pages:

The @facs attribute is available on any element; use as appropriate
It points to an image file (could be a whole page, or just a detail)

Representing Rendition

We’ve been talking about showing page images and facsimiles, which is a great way to give a very accurate representation of the source, but doesn’t give us access to data about how the source document looked. When we want to include that data in our transcription, we can do so using the @rend attribute, which is available on all TEI elements.

If you just want to say one simple thing about the appearance of an element (eg. italics, centered, bold, whatever) you can use the simple keyword approach at the top here.

If you need to say more than one thing about the rendition of a given element, then you need to provide some internal structure inside the @rend attribute. One way to do this is with something known as rendition ladders (which are not in wide use, but are fairly elegant). Another approach would be to use CSS style descriptors.

If you’re not using the CSS method, you make up the values yourself: the TEI does not provide any suggested values.

Critical apparatus

We can also represent a plurality of editorial opinions or textual witnesses as a piece of critical apparatus, using the <app> element. The optional <lem> gives the reading of the "base text", and the two <rdg> elements each represent a different editorial view of what the text really means.

What if we have a plurality of readings because we have multiple witnesses? Here’s an example of a hypothetical text that is a critical amalgam of two separate witnesses with slight local differences. The witnesses themselves are documented in a <listWit> element, and the individual readings are associated with the appropriate witness using the @wit attribute. Note that you can associate a given reading with more than one witness if that’s more economical.

Generic markup structures

We’ve gone over a fair number of TEI elements with quite specific purposes so far, and there are hundreds more out there--the TEI has anticipated a large number of textual features that we’re going to want to encode, and created elements for them. However, the universe of texts is much larger than the universe of the TEI, and the TEI knows this:

it can’t possibly anticipate all the things people are going to want to encode
even if it could, that’s not a good way to design an encoding system

Instead, the TEI provides a fall-back mechanism, a set of generic elements that encoders can use to encode the unforeseen. In these generic elements, instead of giving the element itself a very specific meaning (for instance, personal name, stage direction), the element itself carries almost no meaning at all: it just says "thing!" The semantics, the meaning of the element, is carried in an attribute value, which can be made up by the encoder.

As we show here, there are three main generic elements in TEI, one for each structural level:

At the division level, we’ve already encountered the <div> element, from which we can fabricate chapters and sections and things of that scale
At the chunk level (things like paragraphs and lists), we have the <ab> element (stands for "anonymous block")
At the word or phrase level, we have the <seg> element (short for "segment").

Empty elements used as milestones

Simplest option: instead of encoding the feature by enclosing it in an element, instead just mark its boundaries with empty elements

The most common case of this is with milestone elements:

Elements that divide the text into segments according to some system: pages, columns, lines
works perfectly for an information structure which is completely flat and divides up the whole text into parts: page breaks, signatures, reels of a movie
i.e. there’s nothing in the text that isn’t on some page; there’s nothing in a paragraph that’s not on some line
in these cases, you mark the boundaries between segments, so each boundary element marks the end of one segment and the start of the next.

Empty elements used as endpoints

But in addition there are other cases where it’s handy to be able to mark the ends of an element at arbitrary places, rather than having to fit the element neatly into the document hierarchy

classic example is additions and deletions: authors often add large chunks of stuff, or delete parts of things that don’t match the textual structure

For these, as we saw briefly yesterday, we can mark them much more effectively by putting an empty element at each end, sort of like marking the boundaries of an impromptu soccer field by putting your shoes at each end

Then create a link between the two, using the pointing system we talked about yesterday...

Fragmentation

Take what is logically a single content object

encode it as multiple separate XML elements
indicate that each XML element is only a partial element
optionally have each partial element indicate which is the next piece of the whole content object

TEI provides 2 methods for doing this; the first is the @part attribute...

The @part attribute can be used for serial cases:

all fragments are in sequential order
no intervening occurrence of same element type that is not part of the aggregate element
e.g., good for <l> but sometimes not <q>
available on <l>, <lg>, <div>, <seg>, <ab>, <s>, <cl>, <phr>, <w>, <m>, <c>

Another approach to fragmentation

The @next and @prev attributes can be used for any cases:

available on every element when additional tagset for segmentation & alignment is used
each fragment must bear either @next or @prev
probably better if each fragment bears both