Word Vectors for the Thoughtful Humanist Institute Data Preparation Guide and Checklist

By Juniper Johnson, English PhD Student at Northeastern University

This guide is designed to guide you through the process of preparing your data for a word embedding model. Given that every corpus is different, this should not be treated as a “one size fits all” guide but should instead be considered a set of suggestions for getting your data ready for training a model. Since data preparation can involve quite a few different tasks, we have broken this guide into five stages:

Initial data exploration
Understanding your data’s “noise”
Developing a preparation plan
Cleaning your data
Reflect and repeat

By organizing this process into these five stages, our hope is that data preparation can not only become less intimidating, but that you might better understand how even the most complicated data-driven tasks can be broken down into more easily digestible parts. We suggest that you also keep detailed notes as you progress through this guide and begin to modify your data. Keeping notes is an important habit in working with data because it ensures that your work will be reproducible by either yourself or others, and it will also help you retrace your steps if something goes wrong. We also recommend that you make changes to your data incrementally. It may be tempting to load a bunch of code, hit “run,” and just see what happens. However, in doing so, you run the risk of making substantial changes to your data without really knowing how those changes were made or if the code is even working properly. Finally, like most aspects of working with word embedding models, the process of preparing your data is likely to be an iterative process. You may make some decisions up front about what your data should look like and then, after training the model, discover that your data should actually be processed differently. Not only is this iterative process a great way to learn about the impacts of your different processing choices, but it can help you develop models that are more efficient and better suited to the types of questions you want to investigate.

Stage 1: Data Exploration

The word “data” can be a somewhat loaded term. When we think of “data” it may be tempting to conjure up images of complex charts with numbers that are only interpretable by those with a penchant for math. However, when we say “data” in this guide, what we really mean is a set of texts that are linked together by some heuristic. We call this set of texts a “corpus” (plural “corpora”).Your corpus may be one that you have downloaded pre-assembled, or it may comprise a set of texts that you have decided to bring together. In either case, an important first step in preparing your data is exploration. Take some time to look at some of the texts in your corpus and ask yourself the following questions:

There are a number of ways to collect and compile information for your corpus profile. In addition to making some basic observations about your data on your own, there are a number of user-friendly (and free) tools out there that can provide useful information about your texts. For example Voyant Tools is a web-based tool and does not require any installation. This tutorial and documentation is helpful for understanding what kinds of textual analysis Voyant can perform. AntConc, unlike Voyant, needs to be downloaded and installed but also has a lot of useful documentation.

With either of these tools, it is easy to survey a corpus for important structural and thematic elements using different functions: word frequency, concordance, collocations, word clusters, n-grams, sentence length, and vocabulary density. To gain an “aerial” view of your data, you can also use topic modeling. One easy-to-use program is DARIAH. While there are a lot of different perspectives on topic modeling as a form of textual analysis, it can be very useful for reading across a larger corpus for key themes, and thus can help you to better understand what is in your corpus as you plan for data preparation. A word of warning, however, is that you should never make assumptions about your data based on an analytical method you don’t understand well. For instance, while some of the initial patterns identified through topic modeling may give you new ideas for experimenting with your corpus, you can only draw conclusions about your corpus based on this method if you understand what topic modeling is. Your data only works if you work for it!

On the topic of reading, another useful way to understand a corpus—especially if it was not hand curated or if you are unfamiliar with the file format, structure, and content—is simply to choose random files or sections and begin reading. Reading with data preparation in mind brings up different features than reading strictly for content. Take a sampling of the corpus by choosing random pages or files, and reading those with the above questions in mind. If you are using a textual editor like Oxygen, BBEdit, or Atom, you can additionally read across a corpus by using simple features like “find all” and, for XML documents, XPath. Using tools to aid in exploring a corpus has two positive effects: increased knowledge and improved navigatibility. The more time you spend navigating the corpus without an initial agenda, the easier it is to understand what it contains. Once you have all of this information, next comes analysis and creating an action plan.

Stage 2: Data Analysis and Identifying “Noise”

After your initial data exploration, the next step is to analyze your data to identify what features are “noise” and may impact your word embedding model. Here, “noise” refers to anything in your data that may get in the way of your actual analysis. You can think of computers as having complete and utter trust in you: they will analyze your corpus in the exact format that you provide them with. So, if your corpus is a set of Victorian novels and one of the text files contains a thousand words of footnotes, your computer is going to trust that you know what you’re doing and not ask any questions about what those footnotes are doing there. For this reason, you should make sure that you have properly accounted for the potential that some of the noisiness of your data may make your results less legible. Some of the different categories that noise typically falls into are:

Metadata: Metadata can be understood as data that describes other data. Metadata about your texts may or may not be useful to your analysis. However, remember that when the model is trained, any metadata in the corpus will be treated the same as the rest of your textual data and so, if you don’t want to study the language of your metadata, you should remove it.

Examples of metadata include:

publication statements
source descriptions
data licenses or agreement statements
citation information
encoding descriptions
revision statements
copyright information

Transcription Information: If your data was transcribed, there may be transcription notes or other artifacts of the transcription process. Much as with metadata, it is likely you will want to remove these, as they are not part of the original documents. This information might include:

transcribers’ notes
figure descriptions
transcribers’ or editors’ annotations
markers for uncertainty, additions, or deletions

Structuring Features: The textual features that mark document structures can vary widely, but they are generally noisy, with repeated headers, numbers, or other words that will all be treated the same in the word embedding model as the contents of the documents, if they are not removed. Features that we recommend removing include:

page numbers
chapter titles or headers
section, book, act, or volume headers
illustrations
data tables or graphs
speaker labels (in drama; you may also want to consider removing stage directions)
usernames or timestamps
URLs
running header and footers

Paratexts: Paratext are all the elements of a text that are not the text itself. Understanding whether or not paratexts may be of use to you is particularly important for the study of literature. Depending on the structure of your textual data, there may be paratexts that do not make sense to keep in your corpus. For example, tables of contents describe the contents of a text, but they are not the text itself. Before removing any paratexts, determine how frequent these features are, identify how you would remove them, and decide if you are interested in them. Most often, it is more effective to remove paratexts, including the following:

tables of contents
pronunciation guides
prefaces
appendices
frontispieces
title pages
acknowledgements
abstracts
letters from the editor (letters from the author may also be worth removing, depending on your project)
advertisements
indices

Transformations and Regularizations

Ideally, your corpus should contain plain text files. For corpora that contain data that is not already in plain text, an important consideration is how you will transform the text. If your data is in TEI or any other form of XML, it is fairly straightforward to use XQuery to transform XML to plain text. Additionally, due to how features are tagged in XML documents, you can combine the cleaning and transformation into the same process, removing certain elements or setting the parameters to transform certain portions of the text.

The WWP has several XQueries for this purpose on GitHub. These include an XQuery that can be used to transform non-TEI XML data to plain text.

Another important aspect of data preparation is regularization. Popular forms of regularization in data are modernizing archaic spellings, expanding abbreviations, fixing OCR errors, and correcting misspelled words. You want to consider regularization because your model will process each spelling variation or OCR error as separate from the regularized version of the same word. By regularizing your data, you guarantee that your model won’t think that, for example, “New York City” and “NYC” are two different entities. If you are using a corpus that was already prepared, it is important to consider what steps have been taken (if any) to regularize the text. See if you can find other versions of your texts that are more or less regularized to use in comparison. Regularization can be quite time-intensive but if you choose not to do it, you may find inconsistencies in your final word embedding model (for example, if the same word might be spelled several different ways).

Stage 3: Data Preparation Plan

After completing data analysis for your corpus, the next stage is to create a plan for how you want to modify your original corpus before model training. This plan can take many forms, but the information that it should contain is: a) what textual features you will be removing or changing, b) what documents from within your corpus will be modified for each change, and c) how you will make these changes. This last piece of information, in particular, is important to document and, as you start implementing these changes, is likely to evolve.

As with the earlier recommendation for taking notes about your corpus, taking notes about the changes you make throughout data preparation is essential, especially if you make a change that you may want to revert. Whether or not you are using a form of version control for your corpus—either with tools like GitHub or by saving different copies on your own computer—documenting changes between corpora is essential. You should also make sure to preserve a completely unmodified version of your corpus, before you begin to make any changes. In the event that you want to go back to an earlier form of your corpus, having this documentation (both as a plan and in tracking any changes) is very helpful. For example, here is a brief outline of a generic data preparation plan:

Corpus Level

Data transformation routines (as needed)

XQuery (date, XQuery file used, parameters set)

Restructuring (create a new version with original saved as-is)

Organization statement (by genre, by chronology, by theme, etc.)
New file structure/directory
Folder or file naming conventions

File/Document Level

Metadata and paratexts

Date removed
Files impacted (if not entire corpus)
Content removed (titles, data licenses, copyright information, etc)
Method of removal/cleaning
Structural changes
Headings (chapters, books, sections, etc.)
Paratexts (table of contents, frontispieces, etc).

Regularizations

Date changed
Files impacted
Content changed (words, spellings, organization structure, etc)
Method of regularization (Open Refine, regular expressions, by hand, etc)

Stage 4: Clean and Modify Corpus

After creating your data preparation plan, the next step is to put this plan into action and modify your corpus. This process can take shape in a variety of ways, depending on many factors: your experience with different tools, your research questions, and the composition of your corpus (content, format, and quantity). Indeed, there are many different tools and tutorials for data manipulation that are great for all different skill levels. Here are some useful resources for data cleaning and preparation:

OpenRefine

OpenRefine is free tool for data manipulation supported by Google that allows for exploration, transformation, data matching, and manipulation. Here are tutorials and documentation on how to use it:

“Getting Started with OpenRefine” by Miriam Posner (for an undergraduate class)
“Getting Started with OpenRefine” by Thomas Padilla (includes use cases)
OpenRefine official documentation for users and developers

Regular Expressions

Regular expressions are helpful for finding, modifying, or removing repetition in data by describing a sequence of characters in a text or dataset. Many text editors will have an option to use regular expressions along with the “find all” or “find and replace” features. While there are a few different notations for regular expressions (it is helpful to check which is used in your preferred text editor), here are some generally useful introductory and intermediate sources:

“Understanding Regular Expressions” from the Programming Historian by Doug Knox
Regexone has many different interactive tutorials
Regular Expressions Cookbook by Jan Goyvaerts and Steven Levithan

The following are some regular expressions that we at the WWP use in data preparation (pay attention to case sensitivity and spacing):

Chapter labels: for different cases of capital letters or headings spelled out, see the following variations of the simple regex → Chapter \w+
- Different titles → (chapter|book) \w+ (separate all choices with a | )
- Numbers spelled out → Chapter (the )?\w+(\.\w+|-\w+)?( (ONE|TWO|THREE|FOUR|FIVE|SIX|SEVEN|EIGHT|NINE)\w*)?
- Headings on newline → \nChapter \w+ or \n(chapter|book) \w+
Illustrations: for illustration captions using square brackets and prefaced with the string “Illustration” → \[Illustration([^\]]+)?\]
Timestamps: following the pattern 00:00 am or pm→ \d\d:\d\d (\d\d )?
Dates: for month followed by year → (January|February|March|April|May|June|July|August|September|October|Novemb er|December) \d{4}
Speaker labels: for speaker names which appear alone on a single line → \n[A-Za-z\s-]+\n
Roman numerals: in chapter titles → ([MDCLXVI]+$\W)

A Note Regarding Underscores

The word2vec R package treats a corpus as a “bag of words.” During the model training process, punctuation is largely removed, but _underscores_ are an exception. For any corpus with texts from Project Gutenberg or several other text transcription projects, underscores are frequently used to mark italics. We advise researchers to remove underscores because the model treats these words as distinct from the non-underscored version of the word, even if they are the same from a human reading perspective. However, when used deliberately, this behavior can also be quite useful. If there are words or phrases that you would like to be treated as a single token in the model, using an underscore to join them (i.e. free_trade or queer_liberation) will let you explore phrases in a trained model.

Using this feature in your data manipulation stage is fairly easy. Using a “find and replace” feature, search for the words or phrases of interest and replace with the desired phrase with underscores in place of spaces. For more information, here is a helpful exploration by Kavita Ganesan: “How to incorporate phrases into Word2Vec-a text mining approach.”

If you are working with TEI data, you can use the markup to have more systematic tokenization of named entities and other features. The WWP has published an “Element Tokenizer” XSLT transformation that tokenizes the content of specified elements in XML documents. You can read more about the development of this transformation in “Experiments in Tokenization for Word Embedding Models,” by Juniper Johnson.

Stage 5: Reflect and Repeat

The final stage of data preparation before training a word embedding model is simply to reflect on the process so far. Regardless of how much you document and plan, new issues with data preparation will arise as you are doing the work. Before you train your first (or twentieth) model, it is useful to think about how you have prepared the data for this step. Are there issues that you could not or did not address in this round of preparation? If so, what are these and why did you choose to leave them as is? Might you need to change them in the future?

Data preparation is iterative; it is tempting to try and make all the changes at once, but slowing the process down to observe, reflect, and explore the data and resulting model is an important step, especially at the beginning of a project. After training your first model, there will likely be new changes or ways to organize your data that you will be interested in exploring. Documenting your preparation process and reflecting on it will help as you move forward to testing your model and exploring the effect of training parameters.