By Becky Standard
In July 2019 I was selected to attend the workshop “Word Vectors for the Thoughtful Humanist” expertly hosted by the Women Writers Project (WWP) at Northeastern University. Funding from the National Endowment for the Humanities made it possible for me to travel to Boston for this three-day event to explore how word-embedding models might reveal characteristics of a hand-picked selection of texts. Although I had a smattering of experience with other digital humanities techniques, word vectors and word-embedding models were especially intriguing because they promised analysis at scale on massive corpora without overwhelming prep work or computational power beyond a desktop machine.
Building the Corpus
The motivating resource for my project was the William B. Cairns Collection of American Women Writers, 1650–1940. Held in Special Collections at the University of Wisconsin–Madison Libraries, the archive includes letters, diaries, travel journals, and ephemera but primarily novels by more than 2000 women. Works by Emily Dickinson, Louisa May Alcott, Harriet Beecher Stowe, Kate Chopin, Mary E. Wilkins Freeman, and authors of similar renown were gathered comprehensively, but its lesser-known novelists from the late nineteenth and early twentieth centuries interested me most. The clearinghouse for locating women’s digital works is Mary Mark Ockerbloom’s massive A Celebration of Women Writers at the University of Pennsylvania, and it led me to three troves of proofread material.
The Library of Southern Literature collection in Documenting the American South (better known as DocSouth) held at the University of North Carolina at Chapel Hill has a concentration of famous authors, autobiographies, narratives by enslaved people, and works for children, so I gathered just eight novels. More expansive for my purposes was the Wright American Fiction project at Indiana University: 2887 titles covering 1851 to 1875, which provided 149. Project Gutenberg was the source for 89 texts. At submission to the workshop, my corpus totaled 246 novels, short novels, novellas, and novelettes by 107 different writers. By matching electronic versions to their printed counterparts I found that the shortest text was The Voice published by Margaret Deland in 1912, which clocked in at 12,494 words (approximately 84 printed pages) and by far the longest was the hefty 708-page Ancestors by Gertrude Atherton from 1907. At 301,020 words it beat its nearest competitor by 21,369 words — more than a single Voice. The oldest novel was 1839’s Autumn Hours, and Fireside Reading by Caroline M. Kirkland (writing as Mrs. C. M. Kirkland), although it was not published until 1854. The most recent was Grace Atherton’s Black Oxen from 1923.
Naturally, the three digital sources and the Cairns Collection have differing selection criteria. The Cairns Collection focuses on “the literary record of women writers” because “the voices of women in American literary history reflect visions and styles as diverse as their experiences” (Special Collections). Its authors are based in the United States during a set timeframe, but it has no restrictions on genre, length, or impact. Project Gutenberg is a generalist resource that exists simply to “encourage the creation and distribution of eBooks” (“About Project Gutenberg”), and the selection is decentralized since volunteers submit their personal choices. The Library of Southern Literature has its selection bias in its name, and it further limits its contents to “Dr. Robert Bain’s bibliography of the hundred most important southern literary works” with expansions suggested by Joseph M. Flora and William L. Andrews for coverage stretching from the seventeenth century to 1924 (“Library of Southern Literature”). “Most important” would exclude nearly all writers targeted for my corpus. The Wright American Fiction project uses a three-volume compilation by Lyle H. Wright at the Huntington Library thought to be “the most comprehensive bibliography of American adult fiction of the 18th and 19th centuries” (“Wright American Fiction”). How comprehensive? No Sui Sin Far/Edith Maude Eaton, but would she have been considered English or Canadian, even though she wrote the bulk of her fiction in the United States, or a twentieth-century figure? No Harriet Ann Jacobs, but would her fictionalized autobiography have counted? No Frances Watkins Harper, although Iola Leroy was successful and Harriet Wilson’s Our Nig is included. And “comprehensive” can mean a starting point of all works that were widely distributed, mass produced, advertised broadly, published and circulated in major metropolitan areas, written in English, brought to Wright’s attention, or predominantly written by the white and well educated. I have not yet checked to see how many of the authors in the Cairns Collection were included by Wright, but that examination could be instructive. The Cairns Collection too is hampered by funding levels and limited opportunities for new acquisitions.
My corpus appears to be reasonably representative of digital texts available from Project Gutenberg, DocSouth, and Wright American Fiction by lesser-known novelists in the Cairns Collection. Many writers have a single work, and I tried not to include more than one novel from a series. The prolific Ann S. Stephens is represented by twenty — that’s entirely too many, but I heavily downloaded her work toward the end of compiling the texts just in case I wasn’t going to hit the word target. My cleaned corpus has 23,279,229 words (over the twenty million suggested limit), so I could have skipped a few of her offerings after all.
Genre novels greatly affect the content of this corpus. Romance tales and religious fiction are common. Many carry their prime character’s name — Hope Mills, Kate Danton, Miss McDonald, Dora Darling — as their title and focus on a marriage plot, a daughter’s role, or a secret preventing said marriage or conflicting with said daughter’s role. There are detective stories, works about the Civil War, historical novels, temperance tracts, pulp fiction, and murder mysteries. Later works turn toward independent women: Gertrude Van Deusen in Helen M. Winslow’s A Woman for a Mayor from 1909; Shirley Hollister, a stenographer who rehabs a barn, installs her ailing mother and brother there, and then transcribes a conversation that thwarts a business deal in 1917’s The Enchanted Barn by Grace Livingston Hill; and Priscilla Glenn, who escapes her village to become a nurse in The Place Beyond the Winds by Harriet T. Comstock from 1914. This focus on occupations outside the domestic sphere is what grabbed my attention.
My plan was to compare work roles and descriptions about workers across time, corpora, and gender. This was intended to be pure discovery: no sophisticated research questions, nuanced analysis, or narrow scope. What did the texts in this large corpus have to say about work and gender? From A Dictionary of Employments Open to Women from 1898 (British but applicable) I had relevant job terms to pursue (maybe not insect setter or inspector of baby farms), and I added a few from Harvard Library’s Women Working, 1800–1930 that would not have been among my first thoughts but could be in my corpus: musician, book agent, sales rep.
After arriving at the workshop and getting an overview of word vectors, cosine similarity, clustering, the word2vec process, and other word-embedding concepts, we were let loose with the Women Writers Vector Toolkit to experiment with queries. Not knowing much about my corpus, I dove into analogies to isolate woman-associated terms from man-associated terms. Some simple queries using man – woman and woman – man located gendered word clusters for work, labor, advancement, and career. For women, work, according to cosine similarities, was closest to domestic tasks through words including sewing, crochet, washing, housekeeping, and spinning. For men, the results were broader and perhaps closer to some concepts we might think of today: business, tools, labor, and brains.
Labor generated toil, drudgery, and daily but also potentially value-charged terms such as industry, discipline, and usefulness for women. The cluster for men also had toil and usefulness but included profession, promotion, and advancement. Advancement had a tumble of results (subversive for men was interesting), and career for women was associated with destiny, scheme, and imaginings (having a career only in their dreams or by scheming?) whereas enterprise, collegiate, success, reputation, and manhood popped up for men. These positive terms seem to tie a career to the concept of “manhood.”
Next I tried some words related to specific types of work: factory, cook, and cotton. The gendered divide led to loom, spinning, cottage, and seamstress for the woman-associated side of factory. But results for men were more general: workmen, carpenter, city, and office. Gender also affected the types and locations of work. Although cook brought the expected maid, housekeeper, housemaid, and nurse for woman, man yielded a surprising group: scrimp, guzzling, toddy, rum, and whiskey. The temperance novels are clearly influencing some of these results (variants of guzzling include “guzzling the clear liquor as if it were so much water” from Mary Schuyler Crowninshield’s Latitude 19° [publishing as Mrs. Schuyler Crowninshield]). For cotton the woman-associated terms were calico, linen, striped, and silk. Instead of focusing on garments, fashion, or the domestic realm, man offered words from the business and manufacturing side: merchants, stock, and goods.
Through a workshop sandbox we had access to Women Writers Online (the full WWO corpus covers 1526–1850), Eighteenth Century Collections Online, the Victorian Women Writers Project (VWWP), and texts that individual participants had gathered. The range spread across time and genre: narratives of enslaved people from North America, early Italian scientific manuscripts, Native American literature, Wikipedia comments, the Folger edition of Shakespeare’s works, women’s biographies, congressional policy debates, Jane Austen’s writings, and finding aids for LGBTQ collections at Cornell University. Most applicable to compare with my corpus were the fiction-heavy collections.
For a cross-corpora study, servant seemed like an apt job term to start with. Maid would stand in for woman too often. Other job-related words might not fit into specific centuries as technologies and cultures changed. The word clusters for servant were a surprise. No matter the corpus or time period, the closest terms to servant were usually descriptions of a person’s character. In the WWO full corpus, a combined VWWP and WWO corpus, Native American literature from the Internet Archive, Eighteenth Century Collections Online, and the Early Modern 1080 corpus published by the Visualizing Early Print project, these were some of the closest words: faithful, humble, obedient, and dutiful.1 Narratives by African Americans in DocSouth and a separate collection of novels from the Wright American Fiction project had a few notable differences. These had the first occurrences of other job terms: coachman, waiter, footman, lackey, and valet. Not until result eleven in the Wright corpus did any typically woman-associated role appear: maid.
In stark contrast, my corpus featured women’s jobs in the cluster closest to servant: maid, chambermaid, and nurse. The list included footman and messenger too, so labor was more dominant. There were no descriptive terms for character. Instead, there were words associated with race: negro at number eight and negress at sixteen. No other collection — not even the African American narratives or the works from the South, where they might be expected — had any race terms clustered with servant. Race was not a topic I had considered examining, and I’m curious about what distinctions could be teased out from the corpus. Are novels about the Civil War skewing the outcome? Does this difference mean that more fully realized characters were being portrayed instead of the dutiful tropes of earlier periods? How would the terms change if the corpus were split at the century mark or if works explicitly about the Civil War era were excluded? The top twenty results for labor in the WWO full corpus include slave, negroes, maroons, and manumission, and I wonder if servants are intentionally demarcated from slaves even though they often performed the same tasks and an individual might shift between roles depending on time and location. Is this difference found for servant about race or gender? This could be a meaningful path to follow. Or it could be a dead end. One caution about this result is that the WWO corpus contains fiction, nonfiction, drama, and verse with only roughly a third of it made up of fiction. Seven texts by Lydia Maria Child refer to manumission, other abolitionist tracts describe enslaved people, and Aphra Behn’s poetry is overrepresented with its mentions of slaves. Only 10 works of fiction contain labor (57 instances), 32 use slave (133 occurrences), and 48 have servant (including 172 appearances in Margaret Cavendish’s Philosophical Letters mainly when her fictional correspondents conclude their missives), so at these quantities and ratios I may be reading way too much into what’s merely a straightforward genre-driven variation.2
Our workshop group included a few people skilled in R programming and the WWP team was game to troubleshoot our installations of R Studio, so we had an opportunity to pick up a few R commands to operate directly on our texts and craft more advanced queries. We’d also been reading about other R-based approaches that could be adapted. Ben Schmidt at New York University has developed extensive tools in R (including the Hathidy package for the HathiTrust), and one of his word-embedding models analyzed the six million newspaper pages from the Chronicling America project at the Library of Congress. He generated vectors to locate food words and then created cluster dendrograms of their cosine similarities to separate them into a discursive “meatspace” and “veggiespace.” Modeling the terms in this way allowed him to demonstrate what he calls “useful distinctions about food preparation practices” (Schmidt, “Word Embeddings”). Foods like lamb and poultry are more associated with sweetness (think fruit sauces or honey glazes), and celery and tomatoes are closer to salt. Could a similar distinction be made with work words in my corpus? Is there a women’s workspace and a men’s workspace? Domestic/public or inside/outside workspaces?
A huge caveat — aside from the obvious one that my knowledge of R programming is still novice level at best — is that Schmidt’s newspaper collection contains numerous recipes and therefore lots of food words. My novels have varied topics, and particularly during the nineteenth century and in some genres, work is not vital to the narratives. With my corpus this approach would perhaps be more like looking in Schmidt’s newspaper collection for work words rather than food words: they’re in there and sometimes will be key, but likely they aren’t a focus. But this sounded like fun to research after the workshop. And it was.
Although my primary focus was mapping workspace, perhaps Ben Schmidt’s meatspace vs. veggiespace dichotomy was pushing me toward food, and I took a spaghetti approach: I tried everything to see what might be fruitful (to add a term from another foodspace and mangle the metaphor). James Clawson, one of the experienced programmers in our workshop, developed word vector utilities (updated in August 2020) to streamline tasks and readily access analysis tools. Mainly I used his original code to throw terms into a heatmap but couldn’t produce striking grids like his. In my corpus, comparing talent with career and character with career reveals that talent is related to refinement and education but has little connection to career. Do these results identify a negative value judgment associated with a career? Career occurs 701 times in my texts. Electra Grey, an aspiring artist in Augusta J. Evans’s Macaria; or, Altars of Sacrifice (1864) refers to her potential career as “persevering, patient toil.” In Evans’s St. Elmo (1866) Edna Earl, characterized as a “literary female,” is cautioned against a career “fraught with trials” by a priest (of course he also tells her she’s selfish and being tempted by Satan). Anne Nunn ponders a life of “active accomplishment” in Gertrude Atherton’s The Gorgeous Isle (1908) but knows of none open to women aside from “those dedicated to fashion, which did not interest her.” Isabel Otis is assured in that same novel that she’ll have “a most illustrious and distinctive career,” but she’ll “be terribly lonely” (she retorts that she’s too independent to “yoke comfortably with any man”). Many comments pit love against work, so that relationships “shatter,” “endanger,” “ruin,” or “destroy” careers. Perhaps hunting for violent verbs (and using James’s updated code) would be a next step here.
In the heatmap for beauty and purity, brunettes bubble up, and talent is here too (all of these comparisons make me think about pageants). Connections between work and gender may lurk in some combination of pairs, and trying amplified heatmaps would be the next segment of this exploration.
In a different blog post, Ben Schmidt developed a process for identifying gender binaries and incorporating vector rejection to locate gendered synonym pairs. I started with the obvious woman/man binary and got obvious results. Woman is associated with fashion, appearance, domestic tasks, and possibly sexuality: crocheting, sewing, negligee, and nightdress are on the list, but disrobed is also toward the top. Man is primarily associated with defense, offense, the military, or violence: rifle, pistol, sword, sabre, and spurs fit with the many Civil War–era narratives. Using the combined corpus containing works from the VWWP and only the nineteenth-century texts from WWO for comparison offered more colorful and judgmental language. Women are described as artless, coquettish, virginal, and seductive. Men, however, are again mostly identified with their arsenals.
Then I tried creating a goodness vector to get at negative and positive terms for women and men.
For women, negative terms tend to be emotion-based — heartbroken, dreads, inconsolable — or sinister/supernatural — deathly, omened, phantasy. For positive terms, disrobe appears again, but so do chaste and divinely. Positive terms for men are often possibly about work or at least involve action outside the home: ferryman, carpenter, farmer, sleds, sleigh, horses, saddle. Negative words focus on character: deserter, drunkard, scoundrel, bloodthirsty. Of note is that bride and groom both fall in the positive zone, and unwomanly is negative. For comparison, I again tried the combined VWWP and nineteenth-century WWO corpus and found a greater emphasis on emotion and stronger character labels. Women get negative terms like neuralgia, fickleness, and hysteria (also revealing the nineteenth-century fixation on health), and men get reeking, dastardly, and skulking plus poisoner and murderer. This could be a rich line of inquiry for how work, character traits, and gender intersect. Here I got stuck trying to make a slopegraph, which is the next step in the analysis, so I will need to learn more R before compiling other synonym pairs.
Using Schmidt’s queries as a guide, I then stepped back into workspace to iteratively generate job names from my corpus until I had collected 150 that include the common maid, nurse, governess, magistrate, farmer, and clergyman and others like attaché, maummy, calker, and gandy. The resulting dendrogram has some clusters of similar types of work, but other related terms are far apart. Lawyer and clerk are near, but attorney is distant and paired with magistrate. However they all are part of a large group of professional terms: editor, reporter, officer, financier, agent. But valet and porter are members of that group too, and they’re separate from concierge and almost opposite servant, nurse, and housekeeper. Perhaps some interior/exterior framing is influencing the categorization. Some religious jobs — chaplain, clergyman, priest, minister, bishop — are next to servant, nurse, and housekeeper. Is this connection about caregiving? Visits to homes? Are these male jobs associated with traditional women’s work? Maybe social class is driving connections: concierges work at upscale hotels and for guests who likely are in a professional or upper class.
Plotting these job terms in two-dimensional vector space with woman, lady, man, and gentleman shows some gendered clustering. A domestic constellation has woman/lady at its center (in red), and that’s distinct from the grouping with man/gentleman (in pink) with carpenter, clergyman, broker, and councillor. Doctor floats closer to groom and waiter than lawyer (obscured at the center right near carpenter). Gardener, functioning in a domestic sphere, is with washer and washerwoman near lady but distant from forester (more than .5 below).
More illuminating is comparing that plot with one generated from the same process on the combo VWWP-WWO corpus. Here lady/gentleman may have a stronger bond than lady/woman, and there’s a high concentration of domestic workers near that pair. Woman is situated with more independent professions: seamstress, milliner, nun, governess. Maybe social class has a stronger effect than gender here? Or perhaps women who work outside the home have specific jobs, but otherwise women are not typically associated with work?
Dividing workspace by gender proved troublesome. The dendrograms didn’t suggest obvious delineations, and these full plots featured small clusters of gendered terms but otherwise mostly had male-dominated professions. Maybe that was the key: work was associated with men but not women. I tried to suss out other divisions by plotting dueling terms: medical jobs vs. domestic jobs (nurse/doctor vs. servant/maid), religion vs. profession (some religious vocations were on the professional side of the line with doctor and lawyer), school vs. home (this one made sense), farmer vs. housekeeper (both terms were isolated in vast white space), race vs. maleness (no obvious connections or disjunctures), violence vs. femaleness or maleness (same), and on and on.
Plotting domestic vs. professional as the defining points of workspace was somewhat revealing (despite the infiltration of nonwork words). Although the graph is heavily skewed, some professional positions are clearly distant from home-based jobs.
With this experimentation behind me, the next task is to learn more R. Extrapolating from others’ work got me only this far. Although my current corpus is large (and contains multitudes), splitting it could provide a fresh round of inquiries: nineteenth-century vs. twentieth-century publication dates, narratives taking place before/after the Civil War, genre novels separate from general fiction. Discarding texts so that each author is represented just once could help with balance (or would simply shift it), and then I could add works by other Cairns Collection authors. Looking at a similar corpus with male authors could help evaluate how gender is associated with work in this time frame. Given the high occurrence of race-related language, distinct corpora of white and black authors could unearth other connections between gender, race, and labor.
My alternate topic for the workshop, if this one had not panned out, was on animal imagery and gender. Quick searches of my corpus turned up 200 instances of rabbit, 284 occurrences of cat, 89 references to turtles (some snapping, some green), plus whales, dogs, birds, squirrels, elephants, spiders, bears, frogs, toads, and other vertebrates and invertebrates. Many of these creatures will have no relation to gender, but descriptions like that of Miss Muggridge, who “drew several deep breaths, which sounded very like the sighs of a whale” (in Eliza A. Dupuy’s The Adventures of a Gentleman in Search of Miss Smith from 1852), at least invite further exploration. And the corpus contains an abundance of food words too.
“About Project Gutenberg.” Project Gutenberg. www.gutenberg.org/about/.
Clawson, James M. “Updates to Word Vector Utilities.” James M. Clawson blog, August 27, 2020. jmclawson.net/blog/posts/updates-to-word-vector-utilities/.
———. “Word Vector Utilities.” James M. Clawson blog, August 12, 2019. jmclawson.net/blog/posts/word-vector-utilities/.
“Library of Southern Literature.” Documenting the American South. docsouth.unc.edu/southlit/.
Ockerbloom, Mary Mark. A Celebration of Women Writers. digital.library.upenn.edu/women/.
Phillips, Mrs. (Leonora). A Dictionary of Employments Open to Women: With Details of Wages, Hours of Work, and Other Information. Women’s Institute, 1898.
Schmidt, Ben. “Rejecting the Gender Binary: A Vector-Space Operation.” Ben’s Bookworm Blog, October 30, 2015. bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html.
———. “Word Embeddings for Digital Humanists.” Ben’s Bookworm Blog, October 25, 2015. bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html.
Special Collections. “The William B. Cairns Collection of American Women Writers 1650–1940.” University of Wisconsin–Madison Libraries. www.library.wisc.edu/specialcollections/collections/the-william-b-cairns-collection-of-american-women-writers-1650-1940/.
“Women Working, 1800–1930.” Harvard Library. curiosity.lib.harvard.edu/women-working-1800-1930.
Women Writers Project. Women Writers Online. wwp.northeastern.edu/wwo/.
———. Women Writers Vector Toolkit. lab.wwp.northeastern.edu/wwvt/.
“Wright American Fiction.” Indiana University. webapp1.dlib.indiana.edu/TEIgeneral/welcome.do?brand=wright.
About the author
Becky Standard works for the University of Chicago, the University of Illinois at Urbana-Champaign, and the University of Wisconsin–Madison.
Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.
- In the WWO full corpus a number of these options close letters. Margaret Cavendish in particular often signs off as a “Humble Servant” and indeed calls herself the “most Obedient, Loyal, humble Subject and Servant” of Charles II in her dedication to The Life of the Thrice Noble, High and Puissant Prince William Cavendish from 1667. These terms also show up as descriptors for religious adherents in a variety of texts. Excluding such usages might eliminate or at least weaken this distinction between the corpora. A quick look at some of the 4029 instances of servant in my novels reveals that the majority identify people performing this role, some dutifully and some not. Deeper examination of all these occurrences and their context is warranted. Thanks to Sarah Connell, assistant director of the WWP, for pointing to letters as the likely culprits behind these servant clusters.
- Thanks to Sarah Connell for observing that the WWO’s abolitionist tracts were probably heavily influencing the labor results.