Methods

The Word Vector Interface

Code for the Word Vector Interface can be found on GitHub. The Interface itself is an RStudio Shiny application developed by Jonathan D. Fitzgerald and Parth Tandel. It uses Benjamin Schmidt’s R package wordVectors (which itself includes a version of Tomas Mikolov’s original code for word2vec) to train and query word embedding models.

The GitHub repository includes the models used in the Word Vector Interface, as well as a JSON catalog containing descriptions of those models.

Corpora

The Word Vector Interface is powered by three collections of documents from three different projects. Each corpus was transcribed and encoded in XML, following the guidelines of the Text Encoding Initiative (TEI). By using texts marked up in TEI, we were able to make nuanced choices about the textual features used to generate the files on which we trained the word embedding models.

The Women Writers Project’s Women Writers Online (WWO) corpus collects works which were authored, translated, or compiled by women, published between 1526 and 1850. The corpus is available online to subscribing institutions or individuals (find information on one-month free trials on the WWP website). The XML files are freely available on request. Send an email with a brief description of your research to wwp@northeastern.edu.

The Interface also includes models trained on texts from the Victorian Women Writers Project (VWWP). The VWWP began at Indiana University in 1995 and is concerned with the exposure of lesser-known British women writers of the 19th century. The collection contains a range of genres, including poetry, novels, children’s books, political pamphlets, religious tracts, and histories.

Finally, the Interface contains several models trained on texts from the Early English Books Online (EEBO), Eighteenth-Century Collections Online (ECCO), and Evans Early American Imprint TEI texts created by the Text Creation Partnership (TCP) and developed into corpora by the Visualizing English Print project. These include the VEP Early Modern 1080 Collection, the Evans and EEBO corpora from the VEP TCP Collection, and the VEP Super Science Collection.

Corpus Preparation

In order to get plain text out of the WWO XML, we transformed the corpus using a suite of tools written in XSLT and XQuery, native programming languages for XML documents. The XSLT stylesheet uses TEI encoding to make nuanced choices about significant textual content. For example, abbreviations are expanded and errors are corrected within our text outputs, based on the encoding with <abbr> and <expan>, and with <sic> and <corr> elements.

In addition to transforming the output based on particular elements, we use XQuery scripts to remove elements that skew results with word embedding models, such as speaker labels in drama. Modern writing, such as metadata and text written by WWP encoders and staff, is also removed. The XQuery scripts also allow us to construct sub-corpora by extracting specified sections of XML documents using XPath.

These tools can be found in the WWP Public Code Share on GitHub.

We have also included a model trained on the WWO corpus with some additional regularization, using the routines developed by the Visualizing English Print project. Full details on the VEP project’s regularization routines are available in the VEP-Pipeline GitHub repository and documentation—we borrow their code with thanks and appreciation.

Corpus Parsing

The Women Writers Vector Toolkit (WWVT) provides various models for exploration and comparison. In addition to models trained on the full WWO collection, the interface also offers the option to compare WWO models by publication date. The WWO collection spans more than three centuries and includes information on the date of each text’s first edition, making the corpus ideal for exploring changes in word usage over time. Publication dates are used to sort and combine documents by their century of publication, with each sub-corpus containing all the documents published in a single century, except for the 16th and 17th centuries, which are combined. These models can be used for diachronic studies of literary and cultural change.

Another corpus drawn from WWO’s TEI markup is the “authorial” corpus, which includes only those textual contents written by WWO authors, and excludes tables of contents, indices, editorial prefaces and dedications, subscriber lists, and so on. This corpus was prepared using the “FulltextBot,” an XSLT stylesheet developed by Ash Clark and Sarah Connell that is available in the WWP’s GitHub repository. A similar corpus includes only the contents of the <body> element, excluding front and back matter regardless of authorship.

We also used TEI markup to create several corpora in which names of places and persons were tokenized by inserting underscores between words inside of <persName> and <placeName> elements and removing extraneous spaces and punctuation. These corpora were prepared using the “Element Tokenizer,” an XSLT stylesheet developed by Juniper Johnson and Ash Clark that is available in the WWP’s GitHub repository.

Model Testing

The WWP has held two phases of model training and testing since the launch of the Toolkit. Models trained prior to 2025 were tested against a basic list of word pairs that, based on our knowledge of these corpora, we expect to have high degrees of similarity (such as: “before” and “after”; “father” and “mother”; “king” and “prince”; and “holy” and “sacred”). With these testing routines, we obtained the best results from models created with a window size of 6 words, 100 vectors, 10 iterations, and negative sampling set at 15.

In 2025, the WWP released an Evaluation Guide for Word Embedding Models; we also published an essay on Word Vector Modeling Evaluation, both authored by Avery Blankenship. These guides were released with the WWP’s resources for using Python to train word2vec models, but we adapted the code to work with the R-trained models in the WWVT. Drawing from the keyword lists generated for the Python testing routines, we calculated cosine similarity scores for 3,373 word pairs that we expect to have high degrees of similarity. These improved testing routines led us to slightly modify our parameters; models trained after 2025 have a window size of 5 words, 100 vectors, 10 iterations, and negative sampling set at 20. As of winter 2026, we have begun updating the models in the WWVT with these new parameters and including the texts that have been recently added to WWO; we expect to complete this process by summer 2026.

Model Versioning

As the WWP team continues adding new texts to Women Writers Online, we will continue to update the models in the Toolkit. Current and previous versions of our models are available on our GitHub repository.

Downloads

For links to view and download the routines and code for this project, see our Downloads page.