Downloads

Word Vector Interface and Models

Code for the Word Vector Interface can be found on GitHub. The Interface itself is an RStudio Shiny application developed by Jonathan D. Fitzgerald and Parth Tandel. It uses Benjamin Schmidt’s R package wordVectors (which itself includes a version of Tomas Mikolov’s original code for word2vec) to train and query word embedding models.

This GitHub repository includes the models used in the Word Vector Interface, as well as a JSON catalog containing descriptions of those models.

Plain-Text Generation Routines

The scripts and stylesheets which generate plaintext from XML can be found in the WWP Public Code Share on Github. This repository contains an XSLT stylesheet and several XQueries developed by Women Writers Project staff.

The XSLT stylesheet takes a single document encoded according to the conventions of Text Encoding Initiative (TEI). The XML is modified such that, within the main content, most serializers could extract plaintext from the document’s text nodes. The XQuery scripts are used to extract plaintext from XML text nodes in the simplest possible manner.

Walkthroughs

Walkthroughs for training and querying a model using the wordVectors package in R can be found in the WWP Public Code Share on Github or downloaded as a release. This repository contains six RMarkdown files that provide introductions to R and RStudio, model training and querying, visualization, evaluation, and other frameworks for interacting with word embedding models. The Toolkit hosts static, web-friendly versions of the R walkthroughs.

In addition to the R walkthroughs, we also offer a set of Python notebooks that provide substantially the same content, in this repository or available for download as a release.

Additional Resources

Several checklists and guides are available on the main WWP site. These worksheets are short documents intended to help with:

preparing a corpus,
validating models,
experimenting with settings for training models, and
using findings from word vector models to jump-start a corpus-wide search.