WWP Advanced Institutes: Word Vectors for the Thoughtful Humanist


To apply for participation in one of the WWP’s institutes on word embedding models, please send the following information to wwp@neu.edu:

  • Your name and institutional affiliation.
  • A contact email address where we can reach you.
  • Which seminar are you applying to attend?
  • A brief (two or three paragraphs) statement of interest describing your work with digital methods (teaching and/or research) and how this institute would support it. If you have a specific course or research project in mind, please describe it as well.
  • This institute includes the option to bring a data corpus of your own (see below for details). If you have a data corpus you’d like to work on, please describe it briefly: how many and what kind of documents? How many words total? What format is the data currently in? If the corpus uses TEI or other XML markup, please briefly describe the markup and its purpose.
  • Please note that the May 2022 seminar will be conducted virtually, so no travel arrangements will be necessary.

Applicants will be notified by the dates listed on the seminar schedule.

Information on the Data Corpus

All four institutes will start by focusing on several established data collections including a large collection of women’s writing based on Women Writers Online, and several complementary corpora of male-authored texts drawn from EEBO/ECCO and other digital collections. Participants are also encouraged but not required to bring a corpus of their own to experiment with. Here are a few things to take into account when planning your corpus:

  • Size: You will need a corpus of at least 500,000 words, but to get statistically meaningful results, a minimum corpus size of 2 million words is better. For logistical reasons, your corpus shouldn’t be larger than 50 million words.
  • Format: Your corpus needs to be in plain text, and we can provide guidance on conversion from word-processing formats (e.g. Word, RTF). Although TEI data is not required for these institutes, if your data is already in TEI there may be some additional forms of analysis to explore; we will work with you to take advantage of whatever markup you have.
  • Scope and content: Word embedding models offer ways of analyzing associations between words, discovering vocabularies and vocabulary shifts that are characteristic of a document collection, and making comparisons between corpora. Although you can learn something interesting from any corpus, your results may be most revealing if your corpus coheres around a period, a genre, or some other property. The texts in your corpus should primarily be in the same language. Individual texts can be of any length or genre (as long as the total corpus word count is sufficient). Since we will be providing a corpus of women’s writing from 1550–1900, and several parallel corpora of male-authored texts covering the same period (both of which will offer options for genre-specific subsets), you might think about what kind of document collection could produce interesting comparisons.

For all of the institutes, we will pre-train models which will be hosted in the WWP Lab for use during the institute and for a period of three months following the event; in special cases we may be able to extend the hosting period (for instance, to support a course).

For the intensive institutes, we will walk participants through the process of training models and (if appropriate) using XQuery to select specific TEI/XML features for analysis. Prior to the event, we will work with participants to prepare their data to ensure a smooth process during the institute itself.

We’ll send more detailed instructions to participants as part of the event preparation.

Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.