Publishing and Transforming TEI Documents

This tutorial provides an overview of XML publication platforms and outlines the basic framework for the rest of the tutorials in Transformation and Publication. Covered here is why one might want to publish TEI data, and how one might go about publishing it.

XML workflow

No one needs to convince us of the importance of the overall topic here: "transforming and publishing TEI". It's why we create TEI data. However, we may need to do some preliminary clarification and scoping to get a full sense of what we mean, of what the possibilities are and what kinds of "publishing and transforming" they entail.

One way to orient ourselves in the landscape of "publishing and transforming" is to think about how we use our own data. If we think of the life cycle of a TEI project, there are numerous places along the timeline where we want to express different views of our data, for internal or external viewing: proofreading and error catching formats that extract specific structures to let us catch inconsistencies in the encoding web publication print publication formats for contribution to collaborative projects (where we might want to simplify or alter our markup to match the target encoding of those projects) metadata formats (e.g. to expose to metadata harvesters) archival formats for committing to a repository

Usually when we think about transforming and publishing TEI data, we are talking specifically about transforming for print or web publications. However, as this chart shows this that we may need to transform our TEI for other reasons, throughout our workflow. For example, we may want to transform into formats that facilitate proofreading and error catching, formats for contribution to collaborative projects, metadata formats, or archival formats.

Single-source publishing and XSLT

Another way to approach the topic of this workshop is to think about transformation and publishing as a variety of informational avenues that radiate out from our TEI data. Even though it's probably a familiar concept to many of you, it's worth noting an important assumption that underlies much of our work with XML: we're creating a single XML source from which we are going to generate many different kinds of output.

This is important because the XML source is an expensive and valuable information object: it represents a careful modeling of our research materials, we've put a lot of work into it (transcription, encoding, proofreading, correction, annotation, other kinds of enhancement) and we want to exploit it in many different ways, automatically, not by hand.

When we generate these different varieties of output, we are often losing information: erasing distinctions that are present in the source (but unnecessary in the output), or moving from a representationally rich language (like TEI) to a representationally impoverished language (like HTML)

But since these output formats are generated automatically, rather than by hand, this information loss doesn't matter: the source retains its informational richness: it represents the full set of possibilities from which any specific option can be generated.

Another way to approach the topic of this tutorial is to think about transformation and publication as a variety of informational avenues that radiate out of our TEI data. When we create XML, we're creating a single source from which we are going to generate many different kinds of output.

Our XML data is quite rich, in that it says a lot and can do a lot. However, creating rich data is expensive and often involves a lot of labor. Since our XML data is both very valuable and expensive (and time-consuming) to create, we want to maintain it in some form. We want to create less rich output formats without messing with the original XML. What we want to do is find a way to automatically transform our XML into other formats without overwriting or impoverishing our rich data.

Some examples

Charles Brockden Brown Archive

Mark Twain Project

Paul the Simple, XML source

Petrus Plaoul

A few examples: Charles Brockden Brown Mark Twain Project Paul the Simple

For a few different examples of TEI projects that rely on this process of TEI transformation, please see these examples.

The Charles Brockden Brown Archive allows you to download the TEI files, read the transcriptions, and look at the page images. As you can see, there is fairly minimal markup (only div, p and lb). The transformation too is relatively simple. It shows the page images (recorded with the facs attribute) and the line breaks that the encoder has marked. The XSLT on this particular project is relatively straightforward. However, the site and reading interface are a bit more complicated. A project like this would probably require that you hire a web developer. Also, if you go to the search button, you will see that the search uses XTF, which we will discuss later.

The Mark Twain Project also has a site that would require a lot time and money invested on the project's part. If you click through to one of the letters, you can see that there are sidebars that contain notes that are highlighted when clicked. The search function on this one also requires someone with extensive technical experience to set it up.

Transformation as a power tool

A third important aspect of our topic is the idea of data as a mutable, protean substance: as a kind of plastic informational model that we can reshape and manipulate as needed.

In the example here, all four of these examples represent pretty much the same pieces of data—any one of them could be generated from any of the others. And yet these differences might matter in the context of some particular tool or standard way of doing things.

The point is that our data is almost never trapped in its current format: when we understand it as transformable, we gain power over it and we can use it more flexibly. If a collaborator needs some information extracted from our data, or if they put their fields in a slightly different order, or whatever, it's not a problem.

For our purposes here, it is important to think of data as a mutable, protean substance, like a kind of plastic informational model that we can reshape and manipulate as needed. In this slide, all four of these examples represent pretty much the same pieces of data—any one of them could be generated from any of the others. And yet these differences might matter in the context of some particular tool or standard way of doing things. The point is that our data is almost never trapped in its current format: when we understand it as transformable, we gain power over it and we can use it more flexibly. If a collaborator needs some information extracted from our data, or if they put their fields in a slightly different order, we can easily transform our data to match theirs.

Scope and ambition

The chief tool for doing all of these kinds of work is a programming language called XSLT, the Extensible Stylesheet Language for Transformations: it can be used on its own to generate different kinds of transformed and manipulated data (such as HTML, KML, JSON, other XML formats) and it also is built into many (most? all?) of the XML publication systems that we use, such as XTF, as the way that they take XML data and manipulate it as part of their publication activities Either way, what it does is give us a way of manipulating our XML data: to extract pieces of it, reshape them, change their format, generally do whatever we want to do with them.

Let's talk for a moment about what we're going to cover in this seminar (and what we're not going to cover).

This seminar is aimed at people who have TEI data and not much else: we aren't assuming familiarity with programming, or with XML publishing tools

Our goal is to help you learn about what's involved in using your TEI data: in publishing it, in manipulating and transforming it into other formats, exploiting its informational potential; we'd like you to come away, first of all, with a sense of what is possible.

How about in concrete, practical terms? What are we actually going to cover? XSLT is hugely powerful--it is a full-fledged programming language--but as a result it's a big topic: hence we are not aiming here to teach you XSLT in any kind of comprehensive way what we are aiming to do is give you an understanding of how XSLT works, both on its own and in the context of XML publishing systems on the first two days, we are going to look at a lot of examples, and we're going to experiment with a lot of things that XSLT can do, including generating HTML and KML on the third day, we are going to install and set up two fairly simple XML publishing tools: XTF and TEI Boilerplate so at a minimum, by the end of the workshop you will be able to take your TEI data and publish it on the web in some basic ways.

By the end of the workshop, you should also have a good sense of whether XSLT is something you want to know more about and learn in a more systematic way, and if it is, we encourage you to take a more intensive XSLT workshop: Syd teaches one at DHSI, and Syd and David teach one at Brown every so often. This workshop is a good starting point for either of those workshops.

It is important to note what this primer will cover—and what it will not. For our purposes, we will be focusing on a language called XSLT (the Extensible Stylesheet Language for Transformations). We will be focusing on transforming specifically TEI data, although the language is capable of transforming into and out of any XML language, as well as other types of data altogether. We will mostly be focusing on how XSLT works in the context of publishing systems, using tools like XTF and TEI Boilerplate. However, there is much more that you can do with XSLT!

Simple Publication with XSLT

Extensible Stylesheet Language transformations allow you to transform XML documents into other formats

The Extensible Stylesheet Language allows you to transform XML documents into other XML formats

Essentially XSLT allows you to map a given XML element onto another XML element: saying "take in the following construct, and put out this other construct"

It could be a construct in the same language, or in a different language such as XHTML, as in the example here

XSLT allows you to transform from one XML format into another. When you write an XSLT stylesheet, you are essentially saying take this construct and turn it into another construct. It is important to note that the transformations can occur within the same language (turning one TEI element into another) or from one language to another (take this TEI element, and transform it to HTML). In the example listed, we can see that the TEI element text is transformed into the HTML element body. This is important for creating XML that can interact with stylesheets like CSS.

XML Databases and Publication Frameworks

Tools designed to manage large groups of XML files, with more advanced functionality: fast, efficient searching transformations involving groups of files eXist, DBXML, Xindice, XTF, MarkLogic

The XML database and publication framework universe

These kinds of tools are designed to manage large groups of XML files, and to provide certain kinds of advanced functionality: fast, efficient searching transformations involving groups of files: not just transforming each file separately, but doing transformations that involve taking parts of different files and creating new results files: for instance, a sorted list of the first lines from all the poems in a collection.

How do databases fit into a larger XML publication framework? What do they do? they create and store indexed information: that is, information from the source XML files that has been preprocessed to make it more accessible and easier to manipulate. For instance, they might store tables of all the document metadata (author, title, genre, date, etc.) so that it can be searched and sorted more quickly they contain a representation of the document's structure in a format that makes it easier to process, so that certain kinds of navigation are easier Within the XML publication framework, the database sits and waits for queries to come in. when it receives a query, it performs the necessary searching and returns a result (in the form of an XML fragment, or a node set, or some proprietary structure) the result can then be transformed (e.g. into HTML for delivery to a browser, or into some other XML format for other processing) using XSLT

XML databases exist as separate modules that can be used as the basis for XML publishing systems, for instance: eXist DBXML Xindice (Apache)

XML Databases and Publication Frameworks are designed to manage large groups of XML files. They provide certain kinds of advanced functionality, such as fast, efficient searching and transformations of large groups of files. Rather than treating each file separately, XML databases do transformations that involve taking parts of different files and creating new resulting files. For instance, this type of tool could create a sorted list of the first lines of all the poems in a collection.

So, how do databases fit into a larger XML publication framework? and what do they do?

They can create and store indexed information from source XML files that has been preprocessed to make it more accessible and easier to manipulate. For instance, it might store tables of all the document metadata so that it can be searched and sorted more quickly.

XML databases contain a representation of the document's structure that makes it easier to process, so that certain kinds of navigation are easier. So, as you can see in the example. the words from the opening of Charles Dickens' A Tale of Two Cities are indexed, so that a query for the word times is returned more quickly than if the search engine had to run through the entire document character by character.

Within the XML publication framework, the database sits and waits for queries to come in. When it receives a query, it performs the necessary searching and returns a result in the form of an XML fragment, node set, or some proprietary structure. The result is then transformed using XSLT. So, for example, the result could be transformed into HTML for delivery to a browser, as in the example on the slide (which renders the search hits red).

The Bigger Picture

The tools you need, and the people you need, can be imagined as a rough continuum of increasing scale, complexity, difficulty, and cost: at the simplest level, there are things you can do (or learn to do) by yourself, with very little in the way of equipment or software: tools like XSLT and CSS will go a long way towards producing simple, effective interfaces for browsing and reading small sets of documents at a slightly more complex level, as the number of documents increases and as you want to do more ambitious things with them (such as visualizations, complex searching), you need software tools that are a little more challenging to manage: perfectly within the capabilities of a humanist, but requiring more time: not something you can do on the side of another job; this becomes someone's major job responsibility Going a bit further, we get to things that require XML publication frameworks that require a professional systems administrator, someone who really understands the installation and configuration of things like web servers, XML databases, etc. These are the kinds of tools we need to build things like data mining or text/topic analysis into our publications, and also if we want to publish larger collections of documents that require more server power/speed For production-level publication, where you may be actually charging money for access (and hence need to do things like authentication) and hence may have higher standards of performance and reliability, you need to start engaging with your institutional IT organization to make sure that things like backups, server maintenance, etc. are being handled at the appropriate level of professionalism; this is also the level of scale at which we start to be able to really work effectively with multiple large data sets: for instance, multiple projects of substantial size Finally, if we want to be able to ensure the long-term sustainability of projects, we need to engage with systems like institutional repositories and the data curators who can help us ensure that data will be maintained, migrated, etc. after the project itself is no longer funded.

So considering where the three examples we looked at earlier might fit in: Paul the Simple: a single scholar, acting alone Charles Brockden Brown: a small amount of professional systems adminstration Mark Twain Project: a much larger staff, embedded in the CDL (which is where XTF was built)

The types of systems you will need in place to transform and publish TEI data exist on a continuum of complexity, difficulty and cost. There's a lot you can learn to do by yourself! However, functionality is limited if you want to create and maintain something on your own. More expensive and complicated frameworks need institutional support in order to survive, however not all of us have access to this kind of support. Moving forward (in this primer and for your own data) it is important to think about where your project fits in along this continuum.

This tutorial is complete, please see links below to continue: Proceed to next tutorial in Transformation and Publication Primer Return to Transformation and Publication Primer Return to main tutorial page