The Word Vector Interface
Code for the Word Vector Interface can be found on GitHub at https://github.com/NEU-DSG/wwp-w2vonline. The Interface itself is an RStudio Shiny application developed by Jonathan D. Fitzgerald and Parth Tandel. It uses Benjamin Schmidt’s R package wordVectors (which itself includes a version of Tomas Mikolov’s original code for word2vec) to train and query word embedding models.
The GitHub repository includes the models used in the Word Vector Interface, as well as a JSON catalog containing descriptions of those models.
The Word Vector Interface is powered by three collections of documents from three different projects. Each corpus was transcribed and encoded in XML, following the guidelines of the Text Encoding Initiative (TEI). By using texts marked up in TEI, we were able to make nuanced choices about the textual features used to generate the files on which we trained the word embedding models.
The Women Writers Project’s Women Writers Online (WWO) corpus collects works which were authored, translated, or compiled by women, published between 1526 and 1850. The corpus is available online to subscribing institutions or individuals, and information on one-month free trials is available here. The XML files are freely available on request. For consideration, send an email with a brief description of your research to firstname.lastname@example.org.
The Interface also includes models trained on texts from the Victorian Women Writers Project (VWWP), as well as the Early English Books Online (EEBO), Eighteenth-Century Collections Online (ECCO), and Evans Early American Imprint TEI texts published by the Text Creation Partnership (TCP). From the vast TCP corpus, we selected sub-corpora which roughly mirror the WWO corpus in number of words per century. The VWWP began at Indiana University in 1995 and is concerned with the exposure of lesser-known British women writers of the 19th century. The collection contains a range of genres, including: poetry, novels, children’s books, political pamphlets, religious tracts, and histories.
In order to get plain text out of the WWO XML, we transformed the corpus using a suite of tools written in
XSLT and XQuery, native programming languages for XML documents. The XSLT stylesheet uses TEI encoding to make nuanced choices about
significant textual content. For example, abbreviations are expanded and errors are corrected
within our text outputs, based on the encoding with
<expan>, and with
In addition to transforming the output based on particular elements, we use XQuery scripts to remove elements that skew results with word embedding models, such as speaker labels in drama. Modern writing, such as metadata and text written by WWP encoders and staff, are also removed. The XQuery scripts also allow us to construct sub-corpora by extracting specified sections of XML documents using XPath.
These tools can be found on GitHub at https://github.com/NEU-DSG/wwp-public-code-share/tree/master/fulltext.
We have also included two models with some additional regularization, using the routines developed by the Visualizing English Print project: a version of the full WWO corpus and one including only the language attributable to WWO authors. Full details on the VEP project’s regularization routines are available in their GitHub repository and documentation—we borrow their code with thanks and appreciation.
The Women Writers Vector Toolkit (WWVT) provides various models for exploration and comparison. In addition to two models that include every file in the WWO (one including front and back matter and one with exclusively the main body of each text), the interface also offers the option to compare models by publication date and by genre.
The WWO corpus spans more than three centuries and includes information on the date of each text’s first edition, making the corpus ideal for exploring changes in word usage over time. Publication dates are used to sort and combine documents by their century of publication, with each sub-corpus containing all the documents published in a single century, except for the 16th and 17th centuries, which are combined. The 16th and 17th centuries together provide enough words for a more accurate model and are approximately the same size as the 18th- and 19th-century models. These models can be used for diachronic studies of literary and cultural change.
To create proxies for genre, we were able to leverage TEI encoding of document structures.
For example, the interface contains two models, one trained on a corpus with exclusively the contents of paragraphs (the TEI
<p> element) and the other with the contents of line groups (the TEI
<lg> element), enabling a
basic comparison between prose and verse.
Another corpus drawn from WWO’s TEI markup is the “authorial” corpus, which includes only those textual contents written by WWO authors, and excludes tables of contents, indices, editorial prefaces and dedications, subscriber lists, and so on. This corpus was prepared using an XSLT stylesheet developed by Ash Clark and Sarah Connell that is available in the WWP’s GitHub repository.
We also used TEI markup to create several corpora in which names of places and persons were tokenized by inserting underscores between words inside of
<placeName> elements and removing extraneous spaces and punctuation. These corpora were prepared using an XSLT stylesheet developed by Laura Johnson and Ash Clark that is available in the WWP’s GitHub repository.
TCP Corpora Preparation and Parsing
The TCP corpora are designed as parallels to the WWO corpus, approximating the word-counts per century in Women
Writers Online. To collect texts from the TCP files, we used a Python script which counts the words in the plain-text
versions of each text and then uses publication date metadata to select a subset of texts from the
TCP collection with approximately the same number of words per century as in the WWO collection. In the future, we plan
to use more precise methods for creating parallel corpora, drawing on the markup to select
texts with similar numbers of
elements, for example, to ensure a rough balance in genres between the texts in the two corpora.
Once the plain text of each file has been cleaned, normalized, sorted, and combined, the resulting text files can used to train models with word2vec. To choose the best-fitting parameters for our data, we compared models’ cosine similarities of word pairs that, based on our knowledge of these corpora, we expect to have high degrees of similarity (such as: “before” and “after”; “father” and “mother”; “king” and “prince”; and “holy” and “sacred”). We found that we obtained the best results from a model created with a window size of 6 words, 100 vectors, 10 iterations, and negative sampling set at 15. Models produced from the smallest datasets varied slightly with an optimal negative sampling at 5.
With the model parameters in place, we then tested our three regularization processes: XSLT and XQuery alone; Northwestern University’s MorphAdorner; and MorphAdorner tuned to be sensitive to Early Modern English vocabularies and spelling practices. We again tested models with word pairings and found that the XSLT and XQuery alone produced the best results. In the future, we plan to test other methods for text regularization, and add models that have been trained on the same corpora, but with different regularization processes, to allow for comparison. We will also add more robust methods for validation, updating our models based on the results of that validation, and publishing our validation routines, along with the code used to train the models, in this space.
For links to view and download the routines and code for this project, see here.