Project Showcase and Discussion

Julia Flanders and Sarah Connell

2019-04-01

Textual Corpora and Computational Text Analysis Project

This assignment is designed for an undergraduate class focused on the ways that digital tools and methods can be used to support diversity, equity, and inclusion. The class is typically a mixture of English and Computer Science majors; this assignment is part of the second major unit in the course, following a text encoding project. The assignment follows an introductory activity in the text analysis unit, in which students experiment with web-based analysis tools to compare two related documents and then write blog posts about them on the class WordPress site. In the word2vec assignment, students are asked to develop research questions, which can be on any topic, and then assemble corpora related to their questions; they learn how to train and query word embedding models using R and RStudio Server, then write up their results in blog posts for the class site. This is a complex assignment and one that requires substantial in-class workshopping, not just on technical skills but also on developing research questions, building orpora, and identifying queries that can help answer the research questions.

Student Project: Gender Representation in Popular Culture

This first example student project examines the ways that gender is represented in popular periodicals. The student constructed two corpora—one from Vogue and one from GQ—to study how periodicals aimed at women and men differ in their language usage and representations of gender. The comparative approach can be particularly productive; it takes more work, because the students need to build two corpora, but it also provides an easier way to see what is distinctive or notable about each corpus.

Lowercasing and removing punctuation is standard practice, and most projects will also remove metadata as this project did. Other data preparation decisions can require more thought—for example, this student determined that advertisements, while not part of the primary contents of each magazine, also contribute to the social norms that periodicals produce and so elected to keep them.

Queries and findings

The student discusses several queries in her blog post; this example highlights some particularly stark differences between the corpora. From the closest terms for woman, it is clear that Vogue is using more terms focusing on women’s individual identities, while GQ is using terms that connect with the domestic and familial roles that women play. As the student observes, this is not particularly surprising given the audiences of these two periodicals, but it does show how powerful comparative analyses can be with word embedding models.

Student Project: Public Perception of COVID-19 Vaccines

In this second project, the student used a single corpus of Tweets to investigate public perceptions of COVID-19 vaccines. Twitter data requires some additional preprocessing to reduce noise, and so the student removed Twitter artifacts such as mentions and links. In this project, we also see some complicated decisions about which language from the corpus is related to the semantics that the project is studying; in this case, the student decided that hashtags are central enough to Twitter discourse that they merited inclusion. She also did some additional preprocessing work by combining key phrases so that they could be treated as single tokens—this step is necessary because it is not possible to query exact multi-word strings with word2vec.

Queries and findings

The student used vector math to get at the three vaccines’ distinctive associations with side effects. Essentially, this query takes each vaccine and searches for its associations both as distinct from the other two vaccines, and in addition to the associations for side-effects. For instance, the student looked at the terms that are particular to Moderna and not the other two vaccines when combined with the contexts for side effects. She was able to validate her conclusion that there were more specific physical side effects associated with Moderna by looking at some recent medical studies. This example shows an application of vector math, and also demonstrates the importance of asking students to connect their projects with existing scholarship. Reflecting this need, the assignment asks that each project reference at least one scholarly source.

Ideological differences and terminology

In addition to looking at differences among the three vaccines, this project also examines political impacts on attitudes towards vaccination. One interesting result is in the strong associations between vax and conspiracy theories about the vaccine. Importantly, the student calls attention to the need to examine source materials, since it is impossible to tell from the results whether these associations are coming from those who oppose vaccines, or from others describing what they perceive to be anti-vaccination attitudes. Looking directly at the corpus shows varied results, with some neutral uses of vax to simply describe the experience of being vaccinated, and some Tweets describing the vaccine as an irreversible genetic modification. This is a key strategy for teaching and working with word embedding models: it’s essential to return to the input corpora in order to understand results from the models.

Showcase: Accrediting early modern histories

This project uses a model trained on a corpus of seventeenth-century histories to examine the ways that language around credit was used in discussions about what should constitute history and what should be instead considered fiction or fable. The central questions are: what are the affordances of credit for historians describing their work? And, what can the discourse around credit show about how early modern historians defined their discipline during a period when the pressures of empire and the disruptions of the Civil Wars made national origin stories particularly important, while at the same time the historicity of Britain’s early traditions was being vigorously questioned?

Methods and preliminary results

The corpus at stake comprises 52 texts and 10.9 million words, collected from the EEBO-TCP files.

The words closest to credit in vector space show a very close association with historiographic evaluation in this corpus. Words that tend to be used with credit include ones that connect with discussions of history as a discipline and historical research as a process (historian, authors, authour, historians, pains, antiquity); describe different kinds of historical sources (testimony, relation, account, story, romance, fictions); and evaluate validity or make arguments about the past (reputation, opinion, certainty, truth, shew, disparage, deserves, impartiality, critiques).

Corpus analysis

Despite the strong associations between credit and historiographic validity, it is possible to find terms related to the commercial connotations of credit using vector math. In fact, the more financial connotations of credit are very much present in this collection of seventeenth-century histories, and even seem to be part of the appeal of credit as an evaluative framework.

Further exploration

These methods can also be used to investigate particular historians and historical figures. For example, the twelfth-century historian Geoffrey of Monmouth—who was widely recognized as the source for most of Britain’s Arthurian traditions and who was equally widely criticized for his inclusion of fictional materials—was particularly troubling for some British historians because his credit was so thoroughly bound up with King Arthur’s, whom most felt deserved a more reliable chronicler. This connection is quite evident in the model, which shows numerous variations of Geoffrey’s name when Arthur and credit are queried together.

Scoping analysis

In fact, the suspect credit available to Geoffrey of Monmouth is such a pervading concern for the authors in this corpus that a simple query for one common spelling of his first name shows that the closest terms, in addition to variant spellings of Geoffrey are those specific to this historian and to debates about his historicity, including two other historians—Giraldus Cambrensis and John Westhampstead—who had written skeptically about Geoffrey’s work. Arthur is closely related to Geoffrey as well, again indicating the very close connection between historian and historical figures.

Showcase: Archival descriptions of LGBT Collections

This project explores finding aids of LGBTQ archive collections, focusing on the relationship between archival description and identity. Using digital humanities methods (including word vector models), it explores semantics of description and structured data as it relates to use, access, discoverability, and representation. Central questions for this project include: what is the semantic relationship between formal data structure and archival description? How do controlled vocabularies of structured data appear in computational analysis? How is identity described in structured data and how do we access this information?

Corpus building

This corpus contains 304 finding aids (1.6 million words) collected from ArchiveGrid (a digital repository of over five million archival descriptions) using the keyword search lgbt OR queer. The finding aids are from four archives with the most records from this search. As documents, finding aids vary greatly in form and description level. Across these differences, finding aids contain important metadata about archival collections: titles, creators, physical descriptions, abstracts, biographical or historical notes, scope and contents, and subject headings/indexing terms.

Corpus analysis

All the files in this corpus are plain text, but their structures are an integral part of the data itself. Generating a word cloud with Voyant Tools, it is easy to see that the most frequent words in this corpus are box and folder. Box appears over 79,000 times and folder almost 65,000 times. Combined, these words make up 8.6% of the entire corpus word count (144,000 words). Given that structural or formal semantics are such a large part of this corpus, where do they manifest in computational analyses like word vector models?

Results: Clusters

Clustering is an interesting way to read a word vector model. These three clusters show words closely associated with each other in vector space and each is a distinct part of this corpus. The first cluster is about material descriptions, including the types of documents, items, and artifacts that are in these collections. The second cluster is about use and access, including words that describe or define the processes of archiving, using, and accessing collections. The third cluster deals with subjects in the collections, including much of the related terminology for subject headings or keywords that are used to define or describe these collections. Each cluster, to some extent, is indicative of the different structures and contents of this type of data.

Results: Queries

This corpus was originally generated with the keywords lgbt and queer on ArchivesGrid and their corresponding vectors are an interesting way to parse the relationship of these words throughout the finding aids. Being the broader descriptive term, lgbt is related to a larger variety of words that includes not just identity labels like transgender and intersex, but words relating to use and access of archival materials. For instance, this vector includes educational, promotes, and highlights—all which are verbs and adverbs relating to use and not just people. However, queer is a term with a much more varied semantic past and present, having been (and continuing for some) a derogatory word and an identity label. Some of this history and context is reflected in the list of related words, especially with asexual, intersex, questioning, and assimilation—the distinctive contexts for queer are also evident in the closer connection with lgbtqia than is the case for lgbt.

Future questions and exploration

As this project progresses, the next step is to create a larger dataset of LGBT finding aids from across the country, paying attention to institutional differences. In the end, by using a variety of digital humanities methods (including other forms of textual and computational analysis), this project seeks to study how these tools may help scholars better understand and respond to the social implications of data structure and classification within archival materials.

Analogies in vector space

Now let’s walk through a slightly more complex use of word embedding models for research; how do analogies work? Can we outline how Heuser applies word2vec to his research questions, and work through the logic of the explanatory analogy? As we’ve seen already, projects like this bring together a corpus and a research question, with a method and some particular queries; can we identify all of these for Heuser’s work?

Analyzing results and testing queries

Can we characterize the kind of research project that Heuser is pursuing here (thinking about the conclusions he actually draws)? Do you find these conclusions persuasive, or are there any other explanations that you can think of for the presence of genius in this list? What other evidence would you like to see to support this conclusion? What other queries would you want to try? Since we have a model trained on ECCO, using the same parameters that Heuser did, we can investigate these queries directly—what do we find? What do we see with other models?

Try some queries of your own

When we try the same query in a model trained on the full corpus of WWO, we also see genius fairly high on the list, although the overall cosine similarities are lower—note that there are 84 million words in the ECCO corpus that Heuser used, and about 12 million words in Women Writers Online.

What do we make of this? What additional information would we need to have before we could draw any conclusions? What other queries could we try?

Discussion

What kinds of arguments can we make?

What kinds of arguments can’t we make?

What kinds of data and supporting evidence are necessary?