Word Vector Interface and Models
Code for the Word Vector Interface can be found on GitHub at https://github.com/NEU-DSG/wwp-w2vonline. The Interface itself is an RStudio Shiny application developed by Jonathan D. Fitzgerald and Parth Tandel. It uses Benjamin Schmidt’s R package wordVectors (which itself includes a version of Tomas Mikolov’s original code for word2vec) to train and query word embedding models.
This GitHub repository includes the models used in the Word Vector Interface, as well as a JSON catalog containing descriptions of those models.
Plain-Text Generation Routines
The scripts and stylesheets which generate plaintext from XML can be found at https://github.com/NEU-DSG/wwp-public-code-share/tree/master/fulltext. This repository contains an XSLT stylesheet and several XQueries developed by Women Writers Project staff.
The XSLT stylesheet takes a single document encoded according to the conventions of Text Encoding Initiative (TEI). The XML is modified such that, within the main content, most serializers could extract plaintext from the document’s text nodes. The XQuery scripts are used to extract plaintext from XML text nodes in the simplest possible manner.
Walkthroughs for training and querying a model using the wordVectors package can be found at https://github.com/NEU-DSG/wwp-public-code-share/tree/master/WordVectors. This repository contains two RMarkdown files: the introduction includes detailed instructions and comments; it is designed to cover the full process from installing packages to testing models for those who have a basic familiarity with R but no previous experience with word2vec. The template file includes code for training and querying models with more minimal instruction; it is designed to make these processes convenient for those who are familiar with the basics.
The repository also contains a "data" folder with a small set of test texts, an "output" folder where any exports can be saved, and a project file to help organize work in this directory.