Encoding Guide for Early Printed Books
Document analysis
Like project analysis, document analysis is an attempt to make explicit the constraints and assumptions that will affect your encoding practice. Project analysis looks outwards to contextual factors such as funding, audience, and duration; document analysis looks in towards your texts and examines their details. You will need to examine a representative set of sample documents and (ideally) do a preliminary encoding or marginal annotation indicating what significant document features are present and how they would be represented in TEI. The more samples you consider, the more likely you are to anticipate all of the challenges your collection entails, but clearly there are problems of scale: you want to be able to learn about your collection and plan your work accordingly before you actually encode the entire thing.
You should consider issues like the following:
- Homogeneity
Are your documents all very similar, either structurally or generically? do they all contain the same basic set of textual features, or do they differ greatly from one another? A collection of very similar documents can be simpler to encode, partly because (if the documents are structurally very similar) you may be able to create basic templates that anticipate the documents’ essential features, and partly because it will be easier to maintain a consistent approach. In a collection that covers a broad variety of texts, there will necessarily be a wider range of textual features, but in addition there may be variations on these features that need to be taken into account: bibliographic references in many different styles, names with a range of different presentational conventions, tables of contents with varying structures, different treatment of quotations or dialogue, and the like. These variations may affect the complexity of your encoding schema (the set of rules that govern your text encoding) or may complicate the searching and presentation of your collection. This is not a reason to avoid complex collections, but a reason to plan ahead if you are working with one.
- Genre
What genres are represented in your documents? are they focused in one particular area, or do they represent a wide generic range? Some genres (such as prose fiction) are comparatively simple to encode, at least at the structural level, though they may require extra work if you plan to include detailed word-level encoding (for instance, glossaries of hard words, spelling modernization, identification of people and places, etc.). Others, like drama or scientific writing, involve more encoding per unit of text and are much more labor-intensive to encode unless some of the work can be automated.
A consideration of genre is also important as you choose what parts of the TEI Guidelines you will be using. The basic elements that are available as part of the TEI core allow for a simple encoding of the major genres (prose, verse, drama) but for more detailed encoding of verse or drama and for specialized genres such as dictionaries, manuscript descriptions, linguistic corpora, or scholarly editions you will need to use some of the additional TEI modules. Taking an inventory of the genres present in your collection at the start will help you create an adequate schema to use from the start and reduce the need for later changes.
- Chronology
Do your documents represent a broad chronological range, or are they all from the same time period? Historical changes in printing practices may have a significant effect on how certain textual features are presented, which may in turn affect how they are encoded, particularly if you are interested in capturing the material details of the document. In particular, conventions for representing the structural divisions of a document (systems of headings and clarity of nesting relationships) and for distinguishing between verse and prose, or between direct speech and third-person narration, may be very different from document to document. These differences can make it more challenging to articulate consistent ways of identifying and encoding these features. This may not affect a reader’s experience of each document individually, but to provide coherent and meaningful ways of searching and analysing the collection as a whole, the encoding system needs to provide some degree of uniformity.
Similarly, the time span of your collection may affect the uniformity of more fine-grained features like spelling, the presentation of names, and even which words the text presents as being in a foreign language. These are smaller issues, but if you are planning to do a very detailed encoding in which rendition plays a significant role, issues like these will make certain encoding decisions more complex, and knowing about them in advance will help you make more durable decisions. For instance, you might be encoding names and need to make a consistent decision about what to include in that category (honorifics? titles of nobility? suffixes like Esq.?). You will need to decide what to do if at one point in your collection names are italicized (but not titles of nobility or suffixes), at another point names are italicized (including all of these features), and at still a third point names are not italicized at all. Should rendition be your guide to what constitutes a name in each set of texts? or can you make a consistent decision about name structures and account for rendition by some other means?
- Languages
Do your documents contain more than one language? more than one writing system? In XML, characters from non-Roman alphabets can be represented using Unicode. TEI P5 also provides a mechanism for representing characters that are not (yet) availalbe in Unicode. However, there may be other issues you need to address that go beyond transcription. For instance, if your documents are in multiple languages, you should probably identify the languages used and indicate the language of each text or text segment, to allow users to limit their searches usefully. This is particularly useful if you are using a search engine that allows for stop-words (frequently used words that are often excluded from searches to improve efficiency), since these words will of course differ from language to language.
- Print or manuscript
Are your texts printed or manuscript, or a mix of the two? if manuscript, are they fair copies intended for circulation, or are they drafts or more informal materials with less systematic self-presentation? Manuscript materials in themselves pose a range of special problems, since handwriting is intrinsically a less highly formalized representational system than print, and since position on the page is largely unconstrained. While it is possible to create useful XML representations of manuscript materials, approaches to doing so require careful thought about exactly what information is being represented and by what means. The encoding of manuscript materials lies outside the scope of this guide, although we discuss the encoding of manuscript revisions and annotations to print documents.
- Legibility
Do your documents pose any problems of legibility, either because of damage to the originals, tight binding, poor reproduction, heavy revision, difficult handwriting, or systematic deletion? Legibility issues pose both logistical challenges and special encoding challenges. On the logistical side there is the need for encoder training, to help encoders decipher difficult handwriting or infer missing or hard-to-read characters, and to help ensure consistency across a group of encoders (for instance to make sure that they agree on a threshold below which something is officially unclear). Legibility problems also need to be represented systematically through the encoding so that the reader can understand the status of the transcribed characters being presented, and whether they are reliable or questionable.