One of the interesting intersections of digital humanities scholarship and library science is the use and description of archival records. With the increase of digital materials and information comes a proliferation of digital records and corresponding data. As scholars continue to understand the ways text as data can be conceptualized, defined, quantified, and visualized in the humanities, textual records and metadata are similar topics of recent scholarship. In “The Reconfiguration of the Archive as Data to Be Mined,” Michael Moss, David Thomas, and Tim Gollins argue that “the enormous scale of digital records has changed the way scholarly resources are read,” so that the increasing focus on volume means “that, in [the] future, archives will no longer be conceived of as collections of texts, but as data to be made sense of” (120). Scholars are making sense of archives and collections as data in a number of promising ways, including usage studies, data archaeology and preservation, and inclusive cataloging practices.
Within many of these threads of critical inquiry, scholars are rethinking and reconceptualizing library and archival description as not just a form of metadata, but human-made data where historic, racial, cultural, and social biases and attitudes are marked and encoded. As part of my ongoing project for Northeastern University’s Digital Humanities graduate certificate, I am exploring the relationship between archival description and identity in LGBT collections through digital humanities methods, focusing on use, access, discoverability, and representation. In this preliminary stage, I am using word vector modeling with my corpus of EAD finding aids of LGBT collections to ask: what is the semantic relationship between formal data structure and archival description? How do controlled vocabularies of structured data show up in computational analysis? How is identity described in structured data and how do we access this information?
Before beginning any form of analysis—computational, textual, or otherwise—one of the largest, but most important challenges is understanding the structure and content of your data. While the question of defining and understanding data in digital humanities is still a site of on-going and important scholarship, having a basic understanding of the scope, shape, and depth of your data is helpful regardless of your analytical methodology.1 To begin thinking critically and theoretically about description for LGBT collections, I am using a corpus built with ArchivesGrid, a digital collection of over five million archival material description supported by OCLC Research. Created to improve the discovery of archival materials, ArchivesGrid has an easy-to-use interface, organizing keyword results by categories: people, groups, places, archives, archive locations, and topics. My primary corpus was created by identifying finding aids with the keyword search “lgbt OR queer.”2 I then used the summary view to organize the records by archive and gathered finding aids from the top four institutions with the largest number of results for this search.
My data is organized this way because of important critical consideration. LGBTQ collections exist primarily due to the tireless work of researchers, archivist, activists, and community members in the 1970s, 80s, and 90s, who helped to document the LGBTQ community across the United States (Chenier 172). Thus, LGBTQ archival material was largely a community effort centered in local nonprofits and organizations. As interest in LGBTQ history grows, these collections and materials have become increasingly institutionalized, moving from community and local organizations to large universities and centralized institutes like major libraries and museums. This shift raises important theoretical and practical concerns: what does this transition mean for the use and discoverability of these materials? This is another consideration that I hope to explore in my project and, in this preliminary stage, one that I have operationalized by gathering records from four archives and institutions, two of which are dedicated to collecting and preserving LGBT history and two of which are not. These include: ONE National Gay and Lesbian Archive; the Gay, Lesbian, Bisexual, Transgender Historical Society; the New York Public Library; and the Cornell University Division of Rare and Manuscript Collections.
In total, this corpus contains over 1.6 million words in 304 finding aids. As documents, finding aids vary greatly in form and description level: some are encoded in XML using the Encoded Archival Description (EAD) standard, some are simply PDFs, and some contain description down to the individual folder or container. Just as collections differ in size, the length and depth of description in finding aids also widely differs. For example, the longest finding aid in this corpus is approximately 225,000 words3 and the shortest is has 137 words.4 Across these differences, finding aids contain important metadata about archival collections: titles, creators, physical descriptions, abstracts, biographical or historical notes, scope and contents, and subject headings/indexing terms. While other datasets might organize information by categories and in spreadsheets, this metadata is the internal data structure for finding aids in my corpus (see Figure 1). To maintain some degree of homogeneity in this corpus across institutional records, I collected the material that was the most reproduced throughout all of the collections when there was any discrepancy.5
However, even with these parameters for data collection, the formal structure of data is easy to observe throughout this corpus. For example, some of the most common words in finding aids exist at the item and container level of collection description, meaning the physical container descriptions that materials are housed in with words like “box” and “folder.” For the finding aids that contained this information, it is typically arranged with a container number followed by a brief description of the materials inside (see Figure 2). These words have the highest frequency across the corpus with “box” appearing a total of 79,344 times and “folder” 64,984 times. This accounts for a total of 144,328 words or 8.6% of the total word count of this corpus (for a visualization of this, see Figure 3). Given that structural or formal semantics is such a large part of this corpus, where does this show up in computational analysis like word vector models? And when it does, what does it mean and how can we best try to understand it?
Word embedding and vector space models are forms of computational analysis that allow for using a spatial framework to study the relationships between words in an entire corpus of texts. That is, they calculate the spatial proximity and relationality of words across an entire corpus that can be explored through different access points and configurations like clusters and single word queries. Following the data preparation process for the first WWP Institute, I used the R package “wordVectors” by Ben Schmidt and Jian Li to train a model on my combined corpus of finding aids. For this model, I used the settings the WWP found worked best in creating the word2vec models and interface with 100 dimensions, a window size of 6, 10 iterations, and 15 negative samples.6 When training a model, word2vec operates on the level of the corpus as a whole; if there is any internal structure in your data (like my examples of “box” and “folder”), these features can greatly influence the model. On the other hand, word vector models can be a good way to look at the internal structures in data. Depending on what your research interests are and how you plan to use word vector models, internal data structures may be something to remove or to keep.
To illustrate this dichotomy and the effects of data preparation on word vector models, I used regular expressions to remove significant instances of “box” and “folder” in my corpus. Here I define significant instance as a container-level description where these words are labels for physical containers of archival materials. While I was unable to remove all instances, I focused on looking for instances of “box” and “folder” with numbers, not words. I wanted to keep the textual description but remove the container labels.7 For all of the other results that I am sharing, I used my original corpus without this form of data cleaning. While it is usually preferable to remove metadata, I chose to begin by exploring my full dataset because the internal metadata structures are so inherent and important to how these documents function. However, the results of the top-ten related words for the vectors for “box” and “folder” from before and after cleaning my corpus are interesting.
Before cleaning my corpus:
V(box): box, folder, 29.b, 43, 49, 48, 29a, 44, 46, 38
V(folder): folder, box, 29b, 29a, 38, 49, 43, 48, 2, 1
After cleaning my corpus:
Box: box, folder, oversize, clamshell, preferred, artifact, citation, flat, cartons, boxes
Folder: folder, citation, preferred, box name, oversize, artifact, identification, kanemoto, or
Before removing these two words, query results for both terms showed folder and box numbers and almost nothing else. Querying these words once the corpus had been cleaned shows much more specific results, with descriptive words about the contents and items in these containers. As this example demonstrates, understanding your research question when using computational analysis is just as important as understanding your methods of data preparation: both can have a significant effect on how you can create or interpret your data. Now that I have seen promising results from removing data headings, I am interested to see if this applies to words other than “box” and “folder.”
While querying cosine similarities for individual terms is an interesting way to look at the data structure, one of the ways you can further explore your corpus is by looking at clusters. K-means clusters are a way to “read” word vector models by showing words that are closely associated with each other in vector space.
Cluster : includes, materials, publications, other, clippings, notes, material, articles, letters, miscellaneous, various, journal, manuscripts, reviews, newspaper
Cluster : access, permission, must, copyright, conditions, use, governing, researchers, publish, given, requests, restrictions, owner, submitted, behalf
Cluster : gay, lesbian, subject, women, issues, men, lgbt, bisexual, political, publisher, queer, liberation, transgender, politics, movement
Returning again to my full dataset, each of these clusters I have selected can show three distinct “facets” of this corpus. Thomas Padilla and Devins Higgins use “facets” to highlight how data “serves purposes aside from reading and viewing,” including substance, use, access, and affordances (332). The first cluster is a facet about material type in descriptions, including words describing or identifying documents, items, and artifacts that are in these collections. The second cluster is a facet about use and access, including words that describe or define the process of archiving, using, and accessing collections. The third cluster is a facet dealing with subjects in the collections, including much of the related terminology for subject headings or keywords that are used to define or describe these collections. Each cluster is indicative of different “facets” and concerns of this type of data, both in content and in structure.
In this preliminary stage of my research, I am interested in using word vector models to generate new questions and possibilities for developing both my corpus and research methods. When I first generated this primary corpus with ArchiveGrid, I used only one query: “lgbt OR queer.” This interface organizes information about finding aids by keywords, so I wanted to investigate how searching for words closest to these keywords might generate new keywords or uncover semantic relationships.
V(lgbt): lgbt (1.0), broader (0.64), groups (0.62), topics (0.61), highlighting (0.61), individuals (0.60), transgender (0.60), highlight (0.59), queer (0.59), jewish (0.58), intersexuality (0.57), related (0.57), lgbtq (0.57), intersex (0.57), media’s (0.56), promotes (0.56), educational (0.56), recreational (0.56), audiences (0.56), primarily (0.56)
V(queer): queer (1.0), nation (0.69), asexual (0.65), intersex (0.62), questioning (0.59), ally (0.59), assimilation (0.59), lgbt (0.59), multicultural (0.57), ubiquitous (0.57), cybercenter (0.57), emerged (0.57), cultural (0.57), stages (0.56), cultures (0.56), jqcc (0.55), separatists (0.54), polyamory (0.54), lgbtqia (0.54), geek (0.54)
As a broader descriptive term, “lgbt” is more closely related to a wider variety of words, including not just identity labels like “transgender” and “intersex” but also words that mark usage like “highlight, “promotes,” and “educational.” However, “queer” is a term that has a much different and varied semantic past and present, have been (and continue to be for some) a derogatory word as well as an accepted and widely-used identity label. Some of this history and the question of terminology are potentially reflected in this list of related words, especially with “asexual,” “intersex,” “questioning,” and “assimilation.” Many of these results matched my expectations based on how I created this corpus, but there were a few surprising results that lead to even further questions: Why is “ubiquitous” related to “queer”? What should I make of the relationship between “queer” and “assimilation” or “queer” and “ally”? What information is left out and what does this say about my larger corpus?
As I move forward with my project, I will continue to add to my corpus and pay attention to the semantics of description, data structure, and access. One of the significant threads of discourse surrounding archival description and library cataloging standards relates to particular concerns within queer theory about gender, sexuality, and identity. As scholars think about ways to “queer” these spaces and practices to allow for more inclusive ways to record and describe materials, the challenge is both to identify instances of out-dated and harmful language and create solutions within these existing systems of encoding data.8 Beyond whether or not the language of description is inclusive or exclusive, standards create fixed, stable classifications that—while useful in organizing a large amount of data—can create problems for points of entry, access, and representation.
In their essay, “What’s Gender Got to Do with It? A Critique of RDA 9.7.”, Amber Bailey, Emily Drabinski, and K.R. Roberto explore the implications of new cataloging standards. They write: “the adoption of an identity category often says more about the category and its context than it does about the people who take on that identity category. For queer theorists, gender and sex are always negotiated and socially constituted; fixing them, as RDA asks catalogers to do, denies the shifting and contextual nature of gender identities” (414). While data structure and replicability are important, how do you balance the need for homogeneity with the potential for inclusive, even recuperative, description?
This issue is not just about social justice, empathy, and access, but also is about usage. If the goal is to increase use, engagement, and discoverability of these important resources, cataloging and describing materials with current, community-approved (or even community-generated) taxonomies is essential. As scholars continue to think of archives and collections as data, we must consider the implications of structure for both machines and humans. How can word vector models (and other forms of computational analysis) be used to “discover” and locate problematic archival description? What are the identities that are being overlooked? What are the complexities of identities that language is erasing or replacing in archival collections? Is there a difference in description semantics based on who is doing the describing (community- or institution-based cataloguers)? How can such an analysis help create recuperative and empathy-based description practices and standards? In the end, I am hoping to continue using computational analysis methods like word vector models to move beyond the “box” (that is, the structural semantics of finding aids descriptions) and better understand how these words relate to the material and topics within it.
Bailey, Amber, Emily Drabinski, and K.R. Roberto. “What’s Gender Got to Do with It? A Critique of RDA 9.7.” Cataloging & Classification Quarterly, vol. 52, no. 4, 2014, pp. 412-421. https://doi.org/10.1080/01639374.2014.882465
Chenier, Elise. “Reclaiming the Lesbian Archives.” The Oral History Review, vol. 43, no. 1, 2016, pp. 170-182. https://doi.org/10.1093/ohr/ohw025
Moss, Michael, David Thomas, and Tim Gollins. “The Reconfiguration of the Archive as Data to Be Mined.” Archivaria, no. 86, 2019, pp. 118-151. https://muse.jhu.edu/article/711160
Padilla, Thomas and Devin Higgins. “Library Collections as Humanities Data: The Facet Effect.” Public Services Quarterly, vol. 10, no. 4, 324-335. https://doi.org/10.1080/15228959.2014.963780
- For more information regarding the shape of data in digital humanities, I recommend Christof Schöch’s “Big? Smart? Clean? Mess? Data in the Humanities.”
- I chose “lgbt OR queer” for my search in hopes of broadening my results. I chose this terminology based on my experience of working with and searching for LGBT collections. However, after doing more exploration on ArchiveGrid, I quickly realized that the keyword “gay” generated more results, stemming from “gay” being a far more common umbrella term than “lgbt” or “queer.” As I add to this corpus, I will likely change my search terms to get a larger sample size, then analyze the specific language in description and classification, or use terms that will give me less granularity. Additionally, the word “queer” has a wide range of historic denotations and connotations, some relating to LGBTQ individuals and topics and others not. Using “queer” as a search term led to the inclusion of texts that had nothing to do with LGBTQ issues or cataloging.
- This finding aid is a guide to the Ford Madox Ford Collection, housed at Cornell University Division of Rare and Manuscript Collections. This collection is not LGBT-related but was included as a result because of the use of “queer” at item-level description for letters where the first line of the text was the folder title. Refining or broadening search terms is important because it brings attention to the ways that language changes and how it affects both discoverability and classification with historic materials
- This finding aid is a guide to the International Gay Information Center Collection (publication and typescripts) at the New York Public Library.
- Maintaining similar data structures for finding aids across an institution is time-and-resource-intensive, so it is common to see finding aids with different levels of information, including container lists or not, depending on when it was encoded. The finding aids from the New York Public Library differed greatly in this matter, so for control purposes, I copied only the collection-level description, leaving out the container lists.
- These specific dimensions were tested by the Women Writers Project for the WWO corpus. For explanation: 100 dimensions means that the model will reduce the much higher potential dimensionality of all the relationships between words in the corpus to 100 dimensions. I set the model to 10 iterations, or 10 passes through the corpus. The window size indicates how many words on either side of the target word are considered context in training the model, or how closely (or loosely) related they are. The negative sample indicates a value of non-context words that, when the model is training, are updated in each iteration—this is used to reduce the computational burden of the model training process. For more information about these specific parameters and terms, I recommend the WWP Word Vector Toolkit glossary and methodology pages.
- One of the challenges of removing structural data from plain text documents is the variety of ways these structures are marked. For instance, while “box” is a common label, there are many configurations of it: Box 1, Box: 1, Box 1/20, Box: 1a, Box::2a. Using a few different regular expressions, I was able to remove 78,514 instances of “box” with only 1,317 instances remaining. For “folder,” I was able to remove 64,701 instances and only 2,029 remain.
- For more information about this topic, I suggest Emily Drabinski’s “Queering the Catalog: Queer Theory and the Politics of Correction” (Library Studies Quarterly: Information, Community, Policy, vol. 83, no. 2, 2013, pp. 94-111), Kara Keeling’s “Queer OS” (Cinema Journal, vol. 52, no. 2, Winter 2014, pp. 152-157), and Out of the Closet, Into the Archives: Researching Sexual Histories (edited by Amy L. Stone, and Jaime Cantrell, State University of New York Press, 2015).