Glossary

This glossary is a list of terms selected and defined with the intention of making word embedding models accessible to individuals with multiple levels of expertise.

Algorithm: A process or set of rules used to solve a problem or make a calculation. An algorithm may be expressed in structured language such as a computer program, or described in a visualization such as a flowchart. Algorithms are particularly useful for repeatable tasks with multiple steps.
Clustering: In machine learning, clustering is a process of algorithmically discovering natural groupings in data.
Code: An expression of one or more algorithms. Code is generally written to the standards of a particular programming language. When written in this way, code can be read and its instructions acted upon by computers.
Context word: During the model training process, the words that appear within the window around the target word (i.e. the words that are assessed as “related to” the target word).
Continuous bag-of-words model: An approach to training a word embedding model in which the model tries to predict the target word from a set of context words that are present in a window surrounding the target word. The term “bag-of-words” refers to the fact that the order of words within the window is not significant: each window (X words on either side of the target word) is treated as a bag of words rather than a series. The order of words within the corpus as a whole (which affects which words are grouped within the window) is still significant.
Corpus (Plural: Corpora): A collection of textual data. A corpus is usually analyzed or manipulated in some fashion.
Cosine similarity: In the context of word embedding models, the relationship between two words, expressed as the geometrical relationship between their respective vectors in the corpus vector space. If the cosine of the two word vectors is 1, the two words are identically positioned within the vector space. If the cosine is 0, the two words are as unrelated as possible within the vector space (negative cosine similarities do not occur in this analysis since the numbers that contribute to the computation are all positive).
Data (Singular: Datum): Facts, statistics, or information collected for analysis. Data can come in all shapes and sizes and may look different across disciplines as well as among groups within disciplines.
Documentation: A way of providing information about the choices made in a project; documentation is useful for accessibility, transparency, and archiving. On this site, we have documentation about how the code was implemented, how corpora were selected and prepared, and more.
Embedding: Broadly speaking, an embedding is a compressed vector representation of a corpus which renders the data more dense, resulting in both faster processing and also in more meaningful analytical outcomes. When representing a corpus with (for instance) a 1000-word vocabulary as a vector space, each word can be considered by itself as a vector with 1000 dimensions, 999 of which are zero. However, this kind of “sparse” vector space gives no information about connections between words, and it is also computationally unwieldy because of the large number of dimensions. (Think of a 1000-dimensional cube!) When we train an embedding model, we use the corpus itself (and the groupings of words that are observed via the window) to discover relationships between words: from this training process we get a model in which each word is a vector with a smaller number of dimensions representing the actual words in whose company it is most commonly found. (The term “embedding” describes the way in which the training process “flattens” the high-dimensional space; think about what would happen if a heavy weight landed on top of a paper cube and embedded its top layer in its bottom layer.) This model has fewer dimensions and is also much “denser” (less “sparse”) because none of the dimensions is zero (all represent actual word relationships). As a result, it is both easier to handle computationally, and also more information-rich, since it captures the results of all of the observations of the actual corpus.
Iteration: Within the model training process, a single cycle through the process of examining each word and its context (see window) and adjusting its positioning in vector space. The model training process takes multiple iterations, and the more iterations, the more refined the model (but the more time required to produce it). The degree of that refinement depends on the size of the input corpus, with larger corpora requiring fewer iterations. The number of iterations is specified as a parameter during the model training process, and a reasonable starting point (as a sort of generic default) is 10.
k-means clustering: Generally speaking, “clustering” is a process of analysing a data set to discover clusters of related data points. Starting with a decision about how many clusters there are (this is a parameter that can be set by the researcher), the clustering algorithm tries to find the best fit within the data set: that is, the set of cluster boundaries that result in the minimum distance between data points within each cluster. (Intuitively, this is a bit like a human being looking at a map with dots representing human population, and identifying the likely locations of town centers.) k-means clustering uses an algorithm which seeks to identify a specified number (k) of “centroids”: that is, points that are the mathematical centers of clusters. It starts by randomly selecting a set of k points (think of these as being like “candidate centroids”), and then associates every data point within the set to the closest candidate centroid. Next, it calculates the actual centroid of each resulting cluster (resulting in the centroid moving a bit). Next, it reassigns each data point to the centroid that is now closest. These two steps are repeated, with the centroids’ positions and the data points that form their “clusters” being incrementally adjusted, until the process reaches equilibrium or until a specified maximum number of iterations is completed. In the context of word embedding models, k-means clustering can be used to identify clusters of related words within the model.
Model: A representation of a concept or thing. When created for analysis, models are distillations of available data. Models always contain less information when compared to the original versions—they reduce complexity in order to aid understanding. Because of this, models must be constructed carefully, so that loss is minimized in areas where the model needs to be representative. For this project, we used textual models (XML representations of actual books) to create word embedding models.
Natural language processing (NLP): A domain of computer science and related fields focused on processing and analyzing human (“natural” rather than machine) language. NLP includes things like machine translation of text from one language to another, parsing of language (to identify parts of speech and syntactic structures), identification of meaning and affect (as in sentiment analysis), and speech recognition.
Negative sampling: Within the process of training a word embedding model, negative sampling is a technique for reducing the computational burden of the training process. During the training process, in principle each iteration could update the information about the position of each word in the model (making it progressively more and more accurate). However, because there are many thousands of words in the model, doing that update with every iteration is time-consuming and computationally costly. With negative sampling, instead of updating every word at each iteration, the training process updates only the words directly observed within the window, plus a random sampling of a specified number of the other words in the model. Recommended values (for how many other words to update) are 5 to 20 for smaller corpora and 2 to 5 for larger ones.
Programming language: A strictly-defined set of syntactical rules, used when writing or executing code. The term “programming language” may also be used more broadly to describe a language’s “implementations,” or, programs which read, validate, compile, or execute code.
R: A programming language and environment used for statistical computing and graphics.
Skip-gram model (or continuous skip-gram model): An approach to training a word embedding model in which the training process tries to predict the context words that will appear within a window around a given target word. The term “skip-gram” (analogous to “n-gram”) refers to the fact that the training process considers the target word together with each of the context words in turn; each pair of words is a “skip-gram” in the sense that it skips over the intervening words within the window. (So for instance in the sentence “the quick brown fox jumps over the lazy dog”, the word pair “quick dog” is a skip-gram since it skips over “brown fox jumps over the lazy”.) The skip-gram approach gives different weights to words based on their distance from the target word, so unlike the continuous bag-of-words model, word order within the window is to some extent treated as significant.
Target word: During the model training process, the word around which the window is centered (and whose context words are assessed as part of the calculation of what words are related to one another).
TEI: Stands for “Text Encoding Initiative.” TEI is a collaborative XML schema that is used to standardize the markup of textual documents. The Women Writers Project uses TEI to encode the texts in WWO and this project uses models trained on TEI-encoded files that have been selectively transformed into plain text.
Training: In the context of word embedding models, training is the process by which we create a model of a text corpus: in other words, the process of analyzing the corpus to observe what words appear near each other, and turning those observations computationally into a mathematical model. The training process and the resulting model are controlled by several parameters that affect how those observations and computations are done: for instance, the level of word proximity that is taken into consideration (the window) and the number of observations (iterations) made during the training process.
Topic Modeling: Topic modeling is part of a class of text analysis methods that analyze “bags” or groups of words together—instead of counting them individually–in order to capture how the meaning of words is dependent upon the broader context in which they are used in natural language. Topic modeling is not the only method that does this– cluster analysis, latent semantic analysis, and other techniques have also been used to identify clustering within texts.
Vector: In mathematics, a vector is a line connecting two points in space that has both directionality (i.e. it goes from point A to point B rather than the other way around) and extent (i.e. the two points are a specific distance apart). If you picture a dandelion head with the little silver seeds sticking out of it, or a person with their hair standing on end, each seed or hair could be represented by a distinct vector. (And their angular relationships could be represented by cosine similarity; see above.) In word embedding models, each word in the corpus could be imagined as the endpoint of a vector (a single dandelion seed at the end of its tiny stalk), with the other endpoint being the geometrical origin, the point where all dimensions are zero (akin to the center of the dandelion). While the dandelion exists in three-dimensional space, a model's vector space has hundreds of dimensions. The specific vector for a given word is calculated based on the words that tend to appear near it (see negative sampling, continuous bag-of-words, and skip-gram above). (What counts as “nearness” is controlled by the size of the window we set when we train the model; see window below.) Each iteration of the training process adds information to the calculation of each word’s location in vector space, by re-examining the relative position of words in the corpus.
Window: A span of text on either side of a given target word that determines how many context words are taken into consideration when identifying words as being “near to one another” or “related” or “used together.” The window’s length is specified as a parameter during the model training process. The size of the window affects the kinds of similarities between words that are brought to visibility: a larger window will tend to emphasize topical similarities, whereas a smaller window will tend to emphasize functional and syntactic similarities. The appropriate size of the window will vary depending on the form and length of the texts in the corpus, as well as the questions the researcher is trying to answer. Window sizes between 5 and 10 will work as a starting point for many projects.
Word embedding models: Word embedding models are a technique from the fields of machine-learning and natural language processing in which the words in large textual datasets are represented numerically and mapped to vectors. Essentially, word embedding models allow for a spatial understanding of the words in a corpus, such that one can calculate the proximity of particular words (closer words are those used in more similar contexts) as well as their positioning along paths, or vectors (words on the same vector are those with similar relationships between each other).
XML: Stands for “eXtensible Markup Language.” Like HTML, XML uses tags to layer meaning and hierarchical relationships onto text. Unlike HTML, XML has no defined tag names, and is in fact customizable. Anyone can define their own flavor of XML; the only constraint is that XML documents must be well-formed (that is, a document has only one outermost tag, and all tags are closed). TEI is one such flavor of XML, as is XSLT.
XQuery: A programming language used to obtain and manipulate data from XML. XQuery shares many of the features of XSLT, but XQuery code is not written in XML.
XSLT: Stands for “eXtensible Stylesheet Language Transformation.” XSLT is a programming language used to transform XML documents into other kinds of documents, such as HTML, plain text, or even another XML file. XSLT code is expressed through XML.