Methodology for Transcription and Editing

General Principles

The WWP encodes works in English, or in English translation, by women before 1850.

In general, we encode entire texts rather than excerpts, and our transcription includes all front and back matter, including paratextual material which may not necessarily be by the author. We excerpt only in cases where the desired text is a very small part of the total published document, and where transcribing the entire document is currently impractical.

Our transcriptions are encoded in XML, following the spirit of the Text Encoding Initiative (TEI) Guidelines, with documented TEI extensions to accommodate the needs of our particular corpus and approach. We produce a transcription of each text that preserves the original spellings, typographical errors, lineation, hyphenation, and other details of the text. The transcription also records corrected readings for typographical errors, expansions for uncommon abbreviations, and regularized versions of old-style typography (such as the use of “i” for “j” and “u” for “v”). These alternative readings are represented through the TEI markup so as to support flexibility in display: using an appropriate stylesheet, we can display the text with or without original lineation, typography, errors, and so forth. These practices are described in more detail below.

Theory of the text

We treat the text as a document more than as a work of literature: hence our approach emphasizes transcription of the full document rather than only the “work,” and preservation of renditional details, original spellings, and errors, rather than their effacement. In addition, each document is treated as a circulating cultural artifact, whose historical specificity is part of its value. As a result, we do not emend the text or create critical or synthetic editions; each encoded text is a transcription of a particular physical object.

XML and the TEI tend to imply a theory of the text which emphasizes its structure as an important ontological fact about the text’s existence. The WWP believes that this kind of encoding provides an intuitive and productive way for scholarly users to read and navigate the text; however, we do not insist on it as the only possible theory of the text.

Inclusion and exclusion criteria

The WWP’s inclusion criteria in principle are very broad: over the (very) long term, we seek to capture a substantial subset of printed texts by women in English dating from 1850 or before. Within those broad constraints, the question of which texts to focus on, and which to transcribe first, has been challenging and our approach has changed somewhat over time. In the early phases of the project, we gave priority to texts based on a number of factors including unavailability in print, scholarly and pedagogical demand, and our ability to handle the text within our operational constraints. With the advent of large-scale digitization efforts such as the Internet Archive, EEBO and ECCO, and Google Books, simple scholarly access is of diminished concern, and we have also increased our attention to questions of cultural and geographic diversity. We assess current scholarly interest in periods, genres, and issues, and seek to develop the collection in broadly representative directions, while giving priority to texts which respond to the current and developing needs of the scholarly community.

Our steering committee meets regularly during the course of the year to discuss priorities for transcription, which often take the form of a thematic cluster of texts (to be accompanied by one or more related exhibits in Women Writers in Context). These choices are driven by the research interests of our community of users (including the WWP itself). We actively seek nominations of texts and authors from the public. Our choices and prioritizations necessarily represent a balance between aspiration and practicality; given our limited resources for transcription, the value of very long and very complex texts needs to be balanced against their cost in time and effort. When possible, we partner on transcription with other projects and with external collaborators, but our encoding system is complex enough that it requires fairly substantial training and unfortunately does not lend itself to crowd-sourcing or short-term volunteer effort.

The collection has thus far focused almost entirely on printed works; the early phases of the project deliberately postponed the encoding of manuscript materials until we could devote attention to the methodological issues involved. However, we do plan to include manuscript texts by women during the same period. We are currently engaged in an extended pilot to explore these issues. At the conclusion of this pilot, we will probably seek external funding for a broader initiative involving a team of scholarly experts and a wider range of manuscript texts.

In addition to English texts written by women, the WWO textbase includes texts co-authored by men; texts of doubtful authorship, where the WWP feels there is good reason to believe the author was female; texts translated into English by women (the original author may be male, although these texts would have a somewhat lower priority); texts written by women in other languages and translated into English by men (again, with a slightly lower priority); historical accounts of trials or other events which claim to report women’s words more or less directly; narratives dictated by women to male transcribers (even where it seems likely that the transcription is not verbatim); texts written under a female pseudonym which have circulated as women’s writing (whether or not the author is actually female; again, with a somewhat lower priority).

The aim behind these choices is to give an inclusive cross-section of the written culture as it circulated at the time, by authors who represented themselves (or were represented) as women, with significant diversity of theme, geography, topicality, genre, format, and audience. For example, allowing the inclusion of dictated texts such as some slave narratives makes it possible to include texts by illiterate women which would be excluded if we insisted upon a strict construction of authorship. Similarly, historical reports of women’s words (for instance, in the context of a witchcraft trial) give a view of women’s discourse which would otherwise be inaccessible. Texts in translation have circulated within the culture of English women’s writing and represent an important component of that culture. Categories like these need to be distinguished from writing which is straightforwardly “by women” (for instance, for purposes of linguistic comparison), but this can be accomplished by appropriate identification.

Choice of Edition

The WWP always transcribes from a specific copy of an early edition, contemporaneous with the author unless particular circumstances dictate otherwise (e.g. posthumous publication). Where possible and appropriate, we use the first edition. In cases where a later edition is of equal or greater scholarly importance (because of authorial revision, censorship, etc.), we also aim to encode the later edition, although we may not be able to do so immediately.

As a rule, the copy chosen for transcription is generally the only source of information for that transcription; the WWP does not provide a record of variants, emendations, etc. from other copies or editions. However, in cases of illegibility, the transcription may be supplemented with readings from other copies of the same edition. In very rare cases we may consult other editions where there is only a single flawed copy of the chosen edition available. The source of such readings is always explicitly documented. A possible future project (perhaps developed as a specialized exhibit) might be to publish a cluster of texts that survive in significantly varying versions as linked transcriptions, allowing for easy comparison between editions. See Principles of Transcription, below.

Texts which were encoded at the project’s inception were occasionally chosen on other grounds, since the project’s editorial goals have changed somewhat over the years. Transcriptions which do not currently follow the principles outlined above will be updated over time to conform to our current editorial practice.

Principles of Transcription

Treatment of textual variants

Textual variants from other editions than the one being transcribed are not included except supplementally in cases of illegibility.

Hyphenation

Line-end hyphens are preserved. Soft hyphens are distinguished from hard hyphens, and are recorded using a SOFT HYPHEN character (‘’, U+00AD). They may be displayed or suppressed depending on whether original lineation is expressed or not. In cases where it is unclear whether a line-end hyphen is hard or soft, we follow the hyphenation for that word used elsewhere in the same text; if the word does not appear elsewhere, we record a hard hyphen.

Typographical errors

Typographical errors in the original document are recorded, together with a corrected reading, using TEI’s mechanism for recording error.

Regularization

The WWP regularizes intraword spacing to a single space. We regularize space between words and any following punctuation to zero spaces.

Original typography and spelling

Old-style typography (including the early modern use of i and j, u and v, and vv and w) is preserved, together with a normalized reading, using TEI’s mechanism for handling original readings.

The WWP does not currently record modernized spellings. Doing so poses a number of challenges, not least of which is financial; it would be a very large undertaking and one which would require special funding. In addition, however, there are conceptual challenges such as the frequent difficulty (especially in our oldest texts) of determining the correct modern equivalent. Modernization for many texts is closer to translation than to spelling correction; a word which appears to be a direct modern equivalent may in fact have a rather different meaning, and to substitute it may create a misleading impression of the text — particularly for readers less familiar with early texts, whom modernization is intended to help. Although offering a modernized reading may in some cases make a text more accessible, our experience shows that novices are usually able to adjust, and may even learn more from contact with a less mediated version of the text.

Special characters and Unicode

The advent of Unicode has made it possible to represent nearly all of the printed characters in early modern printed books without difficulty. However, we occasionally encounter characters (such as alchemical symbols or inverted characters) that are not included in the Unicode standard. For these we use the TEI’s mechanism for representing characters not in Unicode (see the TEI Guidelines for more detail).

Handwritten additions and deletions

Handwritten additions which are roughly contemporary with the text are transcribed in full. Deletions are encoded, with the original printed text being transcribed as content. If the deleted text is illegible, that fact is also encoded. The presence of modern handwriting (such as shelf marks or other inscriptions) is represented as a gap using the TEI’s mechanism for capturing omissions.

Illegibility

Illegible passages are captured using the TEI’s mechanisms for representing illegible text; characters or words which are unclear but can be tentatively deciphered are captured using the TEI’s <unclear> element; if multiple readings seem possible, all are captured as alternatives. If the text is completely illegible in the copy being transcribed, but legible in another copy that meets the standards for inclusion described above, we may include the legible reading using the TEI’s <supplied> element to document the source. Text not legible in any copy is recorded as a <gap>.

Features omitted from transcription

The WWP’s approach to transcription focuses on the linguistic text, and while we also provide some basic information about non-linguistic features of the text, we do so in a simplified way. There are also a number of features which we do not transcribe. These are omitted largely to enable us to encode more efficiently, and to focus on making more texts available rather than on giving exhaustive visual detail about correspondingly fewer texts. Finally, we understand that no transcription can ever capture strictly visual or graphical detail with sufficient accuracy to replace the original for certain kinds of study. Scholars who need information of this sort will need to consult the original in any case, and for us to attempt to duplicate that information here would be wasteful.

The WWP records the presence of illustrations, together with a brief description of the illustration and a transcription of any text which appears within the illustration. We also encode the presence of ornaments and ruled lines. We do not distinguish between different kinds of ornaments or rules. For our purposes, an illustration is any graphical feature which contains representational content; an ornament is any purely formal or abstract graphic (e.g. a border of acanthus leaves).

The WWP does not transcribe running headers and footers, with the exception of page numbers, signatures, catchwords, and press figures.

The WWP does not transcribe bookplates, modern handwriting, or modern library stamps. The omission of these features is indicated in the transcription using the TEI’s <gap> element.

The WWP does not transcribe smudges, foxing, dead insects, or other non-textual, non-graphical marks. No indication at all is made of their presence, unless they render text illegible or unclear.

Treatment of Document Rendition

With electronic texts, there is a large difference between the information that is recorded and the way the text is displayed (on the screen or in print output). The WWP records a great deal of renditional information, both directly (in a renditional attribute which records many details of the document’s original presentation) and indirectly (in the use we make of renditional information in deciding what a given textual element is). However, in displaying texts electronically, or in creating printed output, we are guided not only by the document’s original appearance but also by considerations of readability in the new format. Our aim in displaying the document is to present the same information that the original document conveyed — for instance, the presence of paragraphs and stanzas — but without necessarily using the same means of conveying it. Thus while different documents may use indentation or line spacing to show a paragraph break, our standard display regularizes the display of paragraphs to a single style. Similarly, different documents may use varying amounts of space to separate poetic stanzas; our standard display makes it clear where the stanza breaks are, but does not seek to reproduce the exact amount of original spacing. Over time, we are developing additional viewing options that will give the reader some control over these features (for instance, to choose whether to display page and line breaks, or to choose between original and regularized spelling). However, we do not plan to offer anything approaching a facsimile view of the text; the increasing availability of page images makes this less useful, and the level of detail at which our encoding represents formatting information does not support a truly realistic display.

That said, the WWP seeks to capture as many as possible of the meaningful renditional features of the text, for their informational value. By “meaningful” we mean features which affect the reading of the text, where “reading” is understood to mean all aspects of reception, not simply the absorption of strictly denotational meaning. Meaningful renditional features are those which affect the way the reader knows what kind of textual feature she or he is looking at and understands its relationship to other textual features.

Our transcription records most (though not all) significant details of the appearance of the source, including:

font shifts (roman, italic, blackletter)
capitalization and use of small capitals
text alignment with respect to page margins: left, right, center
relative indentation
line, column, and page breaks
rough positioning on the page (for marginal notes, annotations, and the like)
end-of-line hyphenation
wrong-font letters
turn-unders and turn-overs in verse
significant use of relative white space to delineate textual structure (stanzas, paragraphing, etc.)
inverted letters
rotated text
underlining
superscripts
the presence of dropped, raised, or decorated initial capitals

We do not record:

absolute or relative type size
font of punctuation
specific type face information
absolute line spacing or vertical white space
baseline irregularities
broken type
running heads (except page numbers)
kerning and word spacing irregularities (except where these may be significant to the determination of word boundaries)
swash characters
ligatures (except digraphs such as æ)

Documentation and Metadata

In addition to transcribing the full text of each document, the WWP also records certain kinds of metadata, or information describing the document and its transcription. This information is recorded in the TEI header. For more detail on TEI headers, see the TEI Guidelines.

Information about the source copy

The WWP records the author’s name (if known), the facts of publication about the original text, the location of the source copy used in our transcription, including library catalogue number where possible, and the Wing or STC number where applicable.

Language

The WWP records the main language of the document and any other languages used in the document.

Genre

The WWP assigns a rough genre classification to each text.

Keywords

The WWP will over a period of time record topical keywords for each document or (in the case of multiple works published together) for each textual unit.

Details of encoding and editing

The WWP records information about the general editorial practices used in preparing the textbase, and also information about the specific practices used for the individual document, if it requires special treatment.