Document Analysis

What this is for

In order to be sure that we encode our texts consistently and address their complexities in a well-thought-out way, we do a basic preliminary analysis of each text before we start encoding it. This process helps identify difficult textual issues early and allows us to discuss and research them. It also helps the encoder conceptualize the structure of the text and the relationships between its parts, so as to tag it more accurately and consistently.

This form is a guide to help you think through the preliminary document analysis. As you begin it is expected that you may want to solve many of these questions in consultation with others; as you become more familiar with encoding issues you will be able to take more individual responsibility for developing solutions. At any point, if you come across an interesting encoding problem or issue, or something you'd like to get feedback or clarification on before your presentation, please post to DSGTAG-L.

Basic Structure

Sketch the basic structure of the document (as a tree or a chart or in whatever way makes sense). Include at least the first three levels of division inside <body>, or more if the structure is complex. This is intended to flush out difficult hierarchical issues and help you develop an encoding strategy before you start encoding.

Issues to think about: What structural components form the basic divisions of the text? What are first-level <div>s, second-level <div>s, and so on? Do we already have a named element, or a type attribute for <div>, that describes them or are they completely anomalous? Do similar <div>s always appear at the same level of the hierarchy, or are they sometimes nested at a different level?

In the case of poetry, analyse the line groupings carefully and decide which attributes of <lg> will be appropriate. You may need to refer to the line group tutorlal for a full list of possible line groups and how to encode them.

Local Features

Examine several representative sections of the text, paying particular attention to the start and end of the section: how are the boundaries between sections handled? Are there features like epigraphs, signatures, multiple headings, ruled lines or other delimiters marking section boundaries? Whatever you find, identify the appropriate encoding.

Skim through the text, looking out for features like block quotations, embedded letters, figures, and other features that may require special handling. For each one, identify the element or elements you'll need to encode it.

Look through several sample pages taken from different parts of the document, and notice any words or phrases that are highlighted (in italics, all caps, or other distinctive treatment). For each one, identify the reason for the highlighting and determine what TEI element is appropriate (for instance <persName> or some other naming element, <emph>, <foreign>, <term>, etc.).

Physical bibliographic issues

Does the source document (i.e. the original book, not our photocopy) have any anomalies such as pages missing or out of order? Are there any problems with illegibility (either in the photocopy or in the source text), or damage? Is it a complete book, xeroxed from start to finish, or is it part of some larger work whose structure may need to be taken into account? Assess the extent and cause of any damage or illegibility and the appropriate treatment.

Check the pagination and collation (the sequence of signatures, as indicated by the signatures at the bottom of the pages) for accuracy or missing sections. It may help to sketch the collation on paper before you start encoding. If page numbers are missing or out of order, check to see whether the flow of the text is continuous (indicating error in the page numbering) or discontinuous (indicating an error in the printing, binding, or xeroxing).

Title Page

Think about how to encode the various parts of the title page, particularly for Renaissance texts. Issues to consider: how is the title itself divided? What other information is there, and how should it be encoded? Consider the content even more than the typography as an indication of how to encode it.

Textual Features

Linking and cross-referencing: does the document contain footnotes, endnotes, side notes (i.e. marginal notes), errata lists, subscriber lists, table of contents, index, internal cross-references, or other referencing mechanisms? Can these be accommodated using ordinary WWP methods, or do they pose any special challenges?

Special Characters

Does the document contain unfamiliar characters? Any characters or abbreviations which will require expansion (e.g. macrons, etc.)? Check the list of entity references to see how special characters should be encoded, and note any for which we don't have an entity reference listed for further research.

Handwritten additions

If the document contains any handwriting, you will need to assess whose handwriting it is, if possible, and whether it needs to be encoded. If it does, then you will also need to assess whether it poses any structural problems; does it span across several elements? How will its position be indicated? Is it legible? Does it contain cross-outs or other additional complexities? Does it use any characters which will need special treatment or scholarly interpretation (for instance, letters which might be either capitals or lower case; contractions; marks which are not letters)?

Other Idiosyncrasies

Note any other features of the text that may need special treatment or further research.