Encoding Guide for Early Printed Books


Customization of your TEI schema

The TEI Guidelines are a very comprehensive treatment of a very large subject: the encoding and representation of humanities texts in digital form. The Guidelines thus cover far more than any individual user or project is likely to need, and for this reason they are explicitly designed so that they can be customized to match the specific needs of any given project or individual. This customization process permits you, for example, to do the following:

There are a number of reasons why customization is both useful and important. First, as a matter of practical convenience, it is much easier to work with an encoding language that only contains the elements and attributes you are actually going to use. It prevents confusion and minimizes the likelihood of using the wrong element; it also reduces the amount of documentation you and your encoders need to read and maintain. Second, customization permits you to describe your data in a way that more precisely matches the goals and interests of your project, by using project-specific descriptive terms rather than the generic ones provided by the unmodified TEI Guidelines. A customized schema can also allow you to ensure a higher degree of consistency in your encoding, by making certain features required or by constraining their content more tightly. For example, if you are encoding a set of plays, and you want to ensure that the speaker of each speech is always identified, you can make that identification required instead of optional. Finally, customization provides a way of expressing your project’s encoding methods and choices in an explicit, formal way, which can be very helpful when sharing data with other projects or when working with a programmer or interface designer. The customized TEI schema will tell your collaborators exactly what elements you use, and what their permitted structure and attributes are; this information can often make the task of developing a useful interface much easier.

Because customization is such an important aspect of TEI use, there are TEI customizations available in all sorts of places: from other projects, from digital library organizations, from colleagues, from the TEI itself. When should you use an existing customization and when should you put the effort into creating your own? The first thing to consider is your goal in creating a TEI transcription: are you trying to create data that will work seamlessly with some existing collection, or are you trying to express a new or personal view of the text? If you want your data to work well with material that has already been produced for a specific project, or if you think an existing customization does a good job of expressing the encoding approach you want to use, then it makes sense to use such a customization for your own data, rather than creating your own from scratch. Similarly, if you think an existing customization might be a close fit, it may make sense and save work to use it as a starting point, making additional small modifications of your own. However, if you are starting a new project whose primary goal is to represent a new set of texts in a distinctive way, then you should probably invest the time in developing a customization specifically for that project. If you are applying for funding, the costs of this work can be written into the grant proposal as an important first step in your work plan.

For almost any project of any size (30 documents or more), we recommend doing an initial analysis of your data and a review of similar projects to see what TEI customizations exist in your field. If you find any candidates, try encoding a small set of sample documents using these customizations, to test them out and see where they fit and where they may chafe. At this stage it is also worth doing an initial simple customization of your own and encoding some test documents that way as well, in the same spirit of inquiry. What this exploration will reveal is both the distinctive features of your documents, and your own emerging ideas about what is worth marking. With this information, you will then be in a much stronger position to create a full customization that really meets your needs—or to determine that some existing schema will do perfectly well. Above all, don’t be tempted to use someone else’s customization just because it’s handy, if it doesn’t really fit your data. It is much better to get someone to help you create a version that will suit your purposes well over the long term, and it will save you time and frustration in the end.

A special note is needed on the most common TEI customization, TEI Lite. This schema has entered into wide use because it is comparatively small and simple to use, and in particular it has been heavily adopted by large-scale digital library projects. However, TEI Lite was not intended for use in actual encoding projects; it was initially developed as a training schema, to demonstrate the customization process. Ironically, this effort to show the ease of customization has instead discouraged it by providing an easy alternative. TEI Lite is not well suited to the purposes of most scholarly encoding projects, for two important reasons. First, it lacks elements that may be needed for specific text types, thus forcing the encoder to seek some inappropriate substitute. And second, it cannot represent the specific constraints (such as controlled vocabularies) of any particular project, so it cannot help you provide a precise, illuminating representation of your texts. TEI Lite can offer an easy way to experiment with TEI, but it should not be used for any actual project encoding. It should also never be modified; if you want to create a customization that is based on TEI Lite, you should take the customization file for the TEI Lite schema and make your further modifications to that. (See below for more information on how to do this.)

What exactly is a customization?

In order to understand what a TEI customization is, it’s useful first to understand how the TEI customization process works. The process we describe here applies to the current version of the TEI Guidelines (P5); details on how the process worked in the previous version of the TEI (P4) are given below.

To understand the TEI customization process, let’s first consider how the TEI Guidelines themeselves are constructed. The TEI Guidelines are themselves encoded in TEI, in a file which is referred to as the ODD (for One Document Does it all, and which contains three kinds of information:

  • the words of the prose TEI Guidelines
  • individual pieces of documentation for individual TEI elements and attributes
  • small pieces of schema code which, taken together, constitute the TEI schema.

To generate a readable version of the TEI Guidelines, or a usable version of the TEI schema, this source file is processed by a tool called “roma” which assembles all the prose material into a single sequential document, collects all of the bits of documentation together into a single ordered reference, and assembles all of the schema fragments into a single, usable schema.

When this process is performed on the entire TEI, the result is the complete TEI schema and Guidelines (as published in print, or on the TEI web site). TEI customization intervenes in this process by specifying more precisely which parts of the TEI should be included, and by describing modifications, so that what is generated is not the complete TEI Guidelines but a version that reflects the choices and modifications that have been made in the customization process. The customization process produces both a customized schema and a customized set of documentation containing references only to the elements and attributes which actually appear in that schema.

The customization itself is expressed as a second ODD file, whose contents are in effect a description of what is being include, omitted, or altered. To generate a customized schema and documentation, this customization ODD is fed to the roma processor together with the main TEI ODD.

How do I create a customization?

In some sense, creating a TEI customization file (i.e. an ODD that describes how your schema will differ from the TEI default) is very much like encoding any other TEI document. You can use any XML editor to author it; all you need to know is what elements are used to describe the kinds of features (inclusion and omission of elements, changes to attribute values, etc.) that such a document contains. However, these features are unique to the ODD and may be challenging to people who are not familiar with how schemas are written. For this reason, the TEI has provided a simpler approach, via a web tool called “Roma” (which is related to, but not the same as, the “roma” tool mentioned above). Roma the web tool (http://www.tei-c.org/Roma/) provides a fairly simple web interface through which you can select which modules, elements, and attributes will be included in the custom schema, create controlled value lists for attributes, and make other more advanced changes. Once you have created your customization file, you can then also use Roma to generate the custom schema and reference documentation. You can also use Roma to make further customizations to an existing customization file. This is useful if you want to base your customization on an existing ODD from another project. (See below for more detail.) Documentation for getting started with Roma is also available at the TEI web site (http://www.tei-c.org/Guidelines/Customization/use_roma.xml).

The WWP Guide customization

The WWP has developed a TEI customization for our use in encoding early printed books, called EMPB (or Early Modern Printed Books). A version of this customization is published with this Guide. It includes the TEI modules that are required for the kind of detailed encoding that we describe here, and it also provides useful values for common attributes. For some projects, it may provide exactly what is needed. It could also serve as a good head start in designing your own customization.

The most significant features of the EMPB customization are:

  • Selection of modules: EMPB includes the TEI modules for encoding names and dates, transcription of primary sources, figures, linking, drama, and verse.
  • Added elements: EMPB includes some additional elements not covered in the TEI.

Customization and TEI conformance

One of the important reasons to use standards like the TEI Guidelines is that they can help make humanities data more widely useful, by eliminating unnecessary variations and making the data more consistent from project to project. While customization is a very important part of TEI use, it can also introduce local variations that pose challenges if you want to exchange data with other projects, or use common tools on your encoded texts. As a general rule, it’s a good idea to follow the TEI standard unless the change you are making is demonstrably better, or essential for your project.

The TEI itself uses the concept of conformance to describe where a customization, or a TEI-encoded document, falls within the spectrum of variation. In general terms, a document is TEI-conformant if it can be validated against a schema that is directly derived from the TEI Guidelines. Directly derived here means that the schema in question contains only TEI elements, and that these elements may appear only where specified in the TEI Guidelines. A conformant schema might be more tightly constrained than the default TEI schema: it may have fewer elements, and it may impose tighter restrictions on where they may appear. But it may not add new elements and it may not loosen the constraints on document structure.

The TEI’s goal in having a concept of conformance is to assist projects and software developers in improving the possibility of data interchange and interoperability. Since data exchange and tool development are still nascent areas for the TEI, it remains to be seen how much of a difference conformance makes, and whether in fact it will prove easier to interchange conformant documents than non-conformant documents. Whatever the outcome, however, it is important to note that data interchange is only useful if the data itself is of high quality. If a project has data that cannot be encoded accurately or usefully with the TEI Guidelines as they stand, then customization (even a non-conformant customization) is far better than inaccurate encoding which misrepresents the nature of the data. Many projects may find that they need to add new elements, or alter the way the document’s structure is represented, in ways that cannot be accommodated within a conformant TEI schema.

TEI conformance thus functions not as a standard of quality, but as a statement about commonality of approach. Many excellent and useful TEI projects have of necessity developed encoding approaches that are not TEI conformant. Their data can typically be converted into TEI-conformant output for exchange purposes. In your own project design, you should consider conformance as one way among many of assessing the social and practical value of your encoding.

For a more detailed account of conformance (and the related concepts of conformabile documents and TEI extensions) please see the TEI Guidelines, chapter 23.3.

How to modify an existing customization

Although digital projects often create their own TEI schemas from scratch, there may be cases where another project in your field has already done useful work in creating a discipline-specific schema (for instance, to encode Renaissance Drama or oral histories) and you want to build your own project’s work on that foundation. In these cases, you may find that their schema does exactly what you want, but there may also be small changes that you want to make (for instance, to add a few more values to a controlled list of terms). How should you approach this problem?

First, never modify a TEI schema directly: always modify the ODD file and generate a new schema, using Roma. Similarly, when asking for another project’s TEI customization (or when sharing one of your own), always exchange the ODD file, not a generated schema. There are several good reasons for this:

  • Ease of maintenance The ODD file expresses just your local changes, whereas the generated schema includes information drawn from the main TEI ODD. When the TEI ODD is updated, if all of your changes are expressed in your local ODD file, you may be able to easily generate a new schema that applies your changes to the updated version of the TEI. However, if you have made changes directly to the schema itself, you will need to make those changes to the schema over again. If the changes are well documented, this may be merely painstaking; if they are not, the process may be extremely irksome.
  • Clarity of documentation The ODD file contains information about the changes that have been made to the schema, which will permit subsequent readers or users to understand how your data is encoded. If you modify the schema directly, the ODD file will no longer reflect these changes accurately.

Once you have the ODD file for the schema you are borrowing, you can do several things:

  • You can generate a schema (using Roma) and use it to encode some test documents of your own. This may help you decide whether you can use the schema as is, without making further customizations.
  • You can upload the ODD into Roma and make further customizations to it, to suit your own project’s needs. The resulting ODD will serve as your own project’s sharable documentation of how your TEI schema is designed. If other projects request your schema, this is what you should give them.
  • From this ODD, you can then generate your own custom schema and documentation to use in your own project.

Customizations in P4

The customization process in P4 is similar, but not identical, to the process described above. As with P5, in P4 the customization process involves creating a customization file that describes the specific differences between your own version and the unmodified TEI. To generate a usable schema, this file is sent through a process that reads both the main TEI files and your customization file, and creates a flattened schema that can be used for encoding. This flattened schema should be considered strictly an output file, and should never be modified; if you want to make further customizations, you should change the customization file and regenerate the schema.

In P4, there is also a web tool available to assist with the customization process; it is called the TEI Pizza Chef and can be found at the TEI web site: http://www.tei-c.org/pizza.html.

P4 is no longer the current version of the TEI Guidelines, and the Pizza Chef is also a somewhat outdated tool; if you are starting a new project, you should use P5.