A Minimalist Introduction to XML

Julia Flanders and Syd Bauman



A good place to start in understanding XML—what it is and why it’s important—is to think about how it as a way of structuring or organizing information.

Structure is what makes digital information useful to us: it helps us find things, it helps us identify them and understand what they are, and it helps us communicate about them to other people. We can think of the difference between structured and unstructured information as being similar to the difference between different approaches to organizing your tools (or anything else you organize)

In the first image, the only structure is the outer container; essentially, that toolbox only lets us say "it’s in there somewhere..." but doesn’t give us a way of predicting exactly where something will be, or grouping similar or related things together.

In that middle image, there is a bit more structure: the toolbox has compartments to group things together. But the groupings here are not very rigorous (at least to my eye) and might not be very precise as a way of finding what you need.

On the right we have the logical extreme: every tool has a very specific location; similar tools are grouped together. You can instantly see if something is missing. If you had a team of workers who all used this same toolbox, they could trade boxes and still be able to find things with their eyes closed.

It’s important to note that we’re not claiming here that it’s always useful to have as much structure as possible. Sometimes the first or second toolbox is exactly what you want: just a big container to hold things together. The question we’re going to explore in this class is, how much structure is useful? and how do we represent it?

XML Structures the World

XML provides a very specific way of structuring information and of looking at the world. It asks us to identify individual pieces of information that we are interested in (by naming them), and it asks us to group them together. We can represent these groupings in various different visual ways:


So what’s the relationship between XML and TEI? This is where things get slightly tricky...

Let’s start with TEI: TEI is an encoding language: a language for describing the structure of documents. We can think of the TEI as being like the language we use to label the slots in our toolbox: TEI provides us with terms like "paragraph" and "heading" and defines what they mean. We can think of TEI as being analogous to a specific toolbox: one that groups together all of the different chisels and names them in a specific way, and separates the chisels from the hammers.

There are lots of encoding languages: another one is HTML. HTML also gives us terms for labeling things, although it has a much smaller vocabulary than the TEI. We could think of HTML as being analogous to a simpler toolbox, one with fewer, more generic slots; instead of having different size slots for each different kind of chisel, it just has a bunch of more generic bins.

So: TEI and HTML are encoding languages. What is XML? XML is the mechanism for defining encoding languages. In the context of our toolbox metaphor, XML is the stuff you make toolboxes and labels out of. It doesn’t determine how many little slots your toolbox has, or what you call them, but it says "you have to have a box and any little slots inside it have to have labels". It provides a general mechanism for representing data structures.

Languages and meta-languages

Another (more accurate) way of putting this is to say that XML is a metalanguage, while TEI and HTML are languages.

A language has a vocabulary and a grammar: the vocabulary tells you what words exist in the language, and the grammar tells you how they may be used. In learning the TEI, we’re going to learn a specific encoding language: we’re going to learn a lot of its vocabulary and grammar rules: what things you can "say" in TEI, what information structures you can describe, what kinds of slots there are in the toolbox.

A meta-language provides the mechanism for defining words and grammatical structures. We aren’t going to talk about this mechanism directly, but if you’re curious I can tell you more about it over lunch. At a simpler level, the XML meta-language also gives us the notation for writing in text encoding languages, and that is what we are going to look at next.

XML Elements

An XML element is like a slot in a toolbox, or a box in your attic, or field in your database: it’s a thing that identifies and labels a piece of information that you want to know about.

In the toolbox, we mark the boundaries with pieces of wood (or plastic or whatever) so that things stay put

In XML, we mark the boundaries with things called tags, and these also serve as a label, a way of giving the name for the element.

Since we’ve said that what we’re looking at here is a language, we can also think of elements as being like the nouns in our language: the things we want to name and talk about.

XML Attributes

An XML attribute is like an adjective: it modifies an element, just as an adjective modifies a noun, provides more information about it. Attributes are a bit more detailed than adjectives: they tell you what aspect of the element they’re describing, like its type, or its size, or its color (the name of the attribute) along with the specific descriptor within that category of information (the attribute value).

So: just to get the hang of this: give me a noun... then give me an adjective... now translate this into an element with an attribute

XML in Detail

So let’s look at an XML document in more detail. What elements do you see? What attributes? What do they mean?

A few more things to note as we look at a sample XML document up close:

Rules for all XML documents

The final things you need to know about XML documents: the rules they have to follow!

These rules come from the W3C as part of the XML standard: all XML languages have to obey these rules. So when you write XHTML, for instance, you also need to obey the rules of well-formedness.

First, they must observe the basic rules of XML, which we’ve seen in practice in our examples but need to spell out explicitly here:

If they obey these rules, they are said to be well-formed

Rules for specific encoding languages

In addition, XML documents may conform to the rules established by a specific XML language (such as TEI or HTML)

These rules are very similar to the rules for speaking a human language: are you using the right words? Are you using them in the right way?

When we test an XML document for validity, we are checking the document to see whether it matches the rules for a specific language.