A Word Embedding Model of One’s Own: Modern Fiction from Materialism to Spiritualism

A Word Embedding Model of One’s Own: Modern Fiction from Materialism to Spiritualism

By James Clawson

This post is part of a series we will be publishing with projects from the July 2019 Institutes Series: Word Vectors for the Thoughtful Humanist. For more information on the seminar, see here

Woolf’s essays heading into the third decade of the 20th century — especially “Modern Fiction,” “Mr. Bennet and Mrs. Brown,” and “The Russian Point of View” — show her interest in the changing shape of literature of the time, especially as it related to character development. Contrasting earlier writers from the Victorian and Edwardian eras to her contemporary “Georgians,” Woolf pays close attention to the ways characters were revealed, from a dedication to describing the physical world around characters to an interest in their inner thoughts.

Woolf knows what she’s talking about, and her writing shows the deep consideration given by someone who is more than just a common reader. Nevertheless, it is tempting to see how her understanding of these trends relates to a broader corpus of texts — to the “great unread” that has been increasingly considered by thoughtful humanists (Moretti). When, for instance, Woolf distinguishes the cook in Victorian fiction — “a leviathan in the lower depths” — from a contemporary “creature of sunshine and fresh air” (“Mr. Bennett and Mrs. Brown”), can we sense that distinction? When she cheekily writes of December 1910 as the time when human character changed (“Mr. Bennett and Mrs. Brown”), is her proffered calendrical break well placed to describe this shift? And more generally, does a broad corpus of literature agree with Woolf’s telling of subtler shifts in the fictional characterization of certain classes of characters?

But, you may say, the title of this blog post mentions word embedding models — what has that got to do with Woolf’s writing on character in fiction? I will try to explain.

As an enthusiastic neophyte to and, thanks to last summer’s workshop, maybe-could-be wielder of word embedding models, I can’t ignore the allure of searching for measurable vectors that relate to Woolf’s distinctions of the inner and outer worlds: a waning physicality of realism inherited from the previous century, and a growing consideration for the unseen world of the mind inspired by Freud, James, and others. Ultimately, I’d like to try to find if her modes of materialism and spiritualism can be measured; if so, word embedding models seem a good way to test Woolf’s argument against a larger set of texts. But to start directly with these matters would mean starting in water over my head.

Although Woolf describes the changes as applying universally across “[a]ll human relations,” including “masters and servants, husbands and wives, parents and children” (“Mr. Bennett and Mrs. Brown”), this broad scope is best studied from a smaller scale. Her distinctions between Victorian and Georgian cooks, along with the general tendency for fiction from this time to position itself for an upwardly middle-class readership, make the first relationship of service an ideal place to begin.

I. First Test: Embedded Servants

Many good explanations of word embedding models already exist, so space won’t be dedicated here toward explaining how or why they work.1 But what a word embedding model shows is the likelihood of words to be found in similar contexts, considering a particular collection of texts. If a corpus or a reading list is made up of books from one period of time and by authors of a given national identity writing in the same genre, it will yield a model that is helpful to understand the words that are likeliest to be used in that era, place, and literature. If two such models are compared against each other, the comparison allows us to understand differences over time or among locations or across genres.

To measure Woolf’s words against ground truth of the fiction in the periods she’s writing about, it’s necessary to compare a few different corpora, standardizing for genre and for authors’ broad national identities in order to compare over time.

A. Texts and Models

I drew the texts used here from a list of titles linked on Wikipedia’s category pages of British novels for each year, between the years of 1800 and 1922. For instance, the page for 1901 British novels links to the entries for nine different works: Anna Lombard, Erewhon Revisited, The First Men in the Moon, The House with the Green Shutters, The Inheritors, Kim, Lysbeth, The Purple Cloud, and The Tangled Skein. Using an R script to crawl and scrape these category pages, I collected a list of 4,719 titles between these years, then whittled it down using David Robinson’s gutenbergr package in R to a smaller 289 works available from Project Gutenberg; of the 1901 titles listed above, only The House with the Green Shutters, The First Men in the Moon, and The Tangled Skein remain.2 Still using R, I downloaded and stored these texts locally, and from this collection trained word embedding models using two sets of novels divided by year of publication.3

model years novels words
Victorian 1836–1901 166 22,765,428
Georgian 1910–1922 91 8,558,048

B. Vectors of Distant Reading

Once the models are made, they can begin to reveal the structure of the language on which they’re trained. Woolf’s comparison of the cook of a Victorian household to one of the “Georgian” period offers a good test for first consideration: 

The Victorian cook lived like a leviathan in the lower depths, formidable, silent, obscure, inscrutable; the Georgian cook is a creature of sunshine and fresh air; in and out of the drawing-room, now to borrow the Daily Herald, now to ask advice about a hat. (“Mr. Bennett and Mrs. Brown”)

Woolf here suggests that cooks — and, she later clarifies, other servants — were relegated to a certain sphere in fiction before 1901, one that differed markedly from that in fiction after the coronation of George V in 1910. Her descriptions — “formidable, silent, obscure” — suggest a kind of characterization centered on Victorian unknowability, contrasting markedly with the friendly clarity of Georgian “sunshine and fresh air,” at least when it comes to the cook and other servants of a house.

To test Woolf’s assertion, it’s possible to query the models to look for words along the vector defined by her descriptions for each period. Unfortunately, Woolf’s style is too singular to take without modulation. Looking for the words closest to her descriptions of “formidable” and “inscrutable” spoils the search, as the models tend to return words matching her register and theme: “gloomy,” “remote,” “ungainly,” and others.4 Rather than looking for the words Woolf herself uses, it is better to start from a general vector of words one might use to describe a person; we can then sort this list by median cosine similarity to another vector of words like “cook” and “butler” and “maid” to see which descriptive words are used most often in relation to these characters. The results are telling:

These two heat maps — showing results from the Victorian model on the left and the Georgian model on the right — are constructed along two different axes.5 At the top of the charts, column headers show the top ten words that are closest to the vector “cook” + “servant” + “maid” + “butler” in each of the models. The column headers are generally pretty stable; the real value is in the differences among the words at the start of the rows. These descriptive words are taken from longer vectors of descriptions,6 which are sorted by median nearness to the column headers and then limited to the top 20. This vector changes from one model to the next: the two 100-word lists overlap by 51 words, so each model contributes 49 unique words. Limiting the lists to the top 20 of each shows an overlap of only 7 words, with some changes in ranking.

These lists don’t necessarily show words that are used to describe servants; rather, for each model, higher cosine similarity scores and higher rankings show words that are likelier to be used in the same context as the words for servants.7 Sometimes these words or others like them describe the characters who are servants or maids, cooks or butlers; other times they describe something or someone else entirely. Either way, they characterize the contexts in which these groups of characters appear.

The model of Victorian fiction shows higher scores for words like “faithful,” “patient,” “humble,” and “meek” — suggesting that servants are associated with qualities of being servile. Even outside these top four, adjectives typically suggest favorable qualities that an employer might notice in a servant: “staid,” “discreet,” “quiet,” “modest.” Meanwhile, the model of “Georgian” fiction shows a wider selection of adjectives. While “faithful” and “patient” drop in value, novels contemporaneous to Woolf’s writing elevate the words “homely” and “garrulous.” Consideration for unique words is also revealing. Later novels were likelier to describe characters in proximity to “disciplinarian” and “vivacious” qualities. And while Victorian British fiction might associate servants with being “talkative” or “quiet,” “genteel” or “respectable,” Georgian fiction seems to make space for those who were “chatty” or “cheerful,” or who even might be considered “dignified” or “nice.”

C. Texts and Contexts

Word vector models don’t offer direct ways to understand their results in the context of the novels, but it’s possible to verify the likelihood of these results by returning to the full-text works themselves.8 It is unsurprisingly easy to find Victorian novels discussing servants in the context of humbleness, respectability, and faithfulness. H. G. Wells’s 1895 The Wonderful Visit offers a telling scene in which “respectable” is used to describe writers rather than a servant, but in a way that foregrounds many of these relevant considerations of the depiction of character in fiction:

Delia must remain what she really was — a servant girl. I know that to give a mere servant girl, or at least an English servant girl, the refined feelings of a human being, to present her as speaking with anything but an intolerable confusion of aspirates, places me outside the pale of respectable writers. Association with servants, even in thought, is dangerous in these days.

Victorian class anxiety, Wells suggests, drives contemporary Victorian concern for a certain kind of character among servants. Fiction from this period works to differentiate the serving and the served, and the word embedding model reflects evidence of this effort. Returning to the other texts in this group, we see these descriptions continually adopted to show the distinction of the two groups. Servants and serving-class tradespeople are humble or faithful in these novels most often at the close of a letter to those for whom they work, as Anthony Trollope’s 1865 novel Miss MacKenzie and countless other works show. They act kindly when they respect class boundaries, as illustrated by the butler in R. D. Blackmore’s 1872 The Maid of Sker, who “kindly invited me in […] because he did not consider it proper that an officer of the Royal Navy should dine with the maids.” And they are discreet when they keep the secrets of the families for whom they work, as seen in George MacDonald’s 1871 children’s book At the Back of the North Wind, which features “the most discreet of servants […who] did not talk about family affairs […] concerning master and his family.” Even in instances whereby the distinctions may seem challenged — as, for instance, in Charles Dickens’s 1850 David Copperfield, in which Agnes Wakefield “looked as staid and as discreet a housekeeper as the old house could have” — the action already assumes a common language of difference between the family of a house and the humans who serve that family; Agnes is, after all, Mr. Wakefield’s “little housekeeper, his daughter.”

These patterns change subtly in fiction after 1910. As the second heat map shows, the highest values of cosine similarity to any singular noun register with adjectives like “humble” or “faithful,” a pattern that matches well with that of the Victorian corpus. But the order of rows shifts to indicate the typical ranking of each adjective. Georgian texts continue to recall servants by their “sort,” their “kind,” or their capacity to serve, as in John Buchan’s 1915 The Thirty-Nine Steps, where “they were just the sort that you would find in a respectable middle-class household […] not the gossiping kind.” And Wells, again, writing in his 1916 novel Mr. Britling Sees It Through, allows one character to be “a little disappointed to find that the servants didn’t wear livery,” describing them nevertheless as “slim, discreet and attentive […] with a gentle piety in their manner instead of pride.” These examples seem to show novels after 1910 continuing the Victorian trend of limiting the fictional depiction of servants to their ability to serve, but they also show a calcified and static understanding of “servant” as a type, revealing an image that seems to have become a kind of proxy or a simulacrum that could overshadow the real. Jeffery Farnol’s The Amateur Gentleman from 1913 makes this substitution of the image explicit as a lower-class character expresses over-the-top and unwelcome declarations of servility in order to affect an air he hopes others will see as genteel: 

“Sir,” said he, hat gracefully flourished in the air, “your very humble obedient servant to command.”

“A humble obedient fiddlestick, sir!” retorted the new comer.

Although later novels often retain rigid expectations to categorize characters of the servant class in particular ways, works like The Amateur Gentleman capitalize on this rigidity to experiment with class mobility, and works like Mr. Britling Sees It Through point out where class expectations exist to fall short. Still other novels show maids, cooks, and other servants in contexts less commonly seen in the Victorian era: witness the “happy-go-lucky and informal” scene in D. H. Lawrence’s 1922 Aaron’s Rod, with “the waiter and the maid-servant cheerful and bustling”; the “garrulous butler” who enjoys delivering bad news in P. G. Wodehouse’s 1920 work The Coming of Bill; and the full cast of servants in Edgar Jepson’s 1920 The Loudwater Mystery who are so unencumbered by traditional bonds of service that they like and respect the man who killed their hated master.

These distinctions do not fall squarely along the lines Woolf resolves as belonging to the Victorians or the Georgians. Words that belong to the Victorian context do not suggest an air of mystery or inscrutability, nor do these words seem to be considered in any significant way along other lines Woolf describes, while Georgian descriptions notably do not all resolve toward sunny clarity. But there is a change. If we trust these results, Woolf was correct in sensing a shift, but it seems to be different in kind. In the decade between 1901 and 1910, however incompletely, fiction made room for more human depictions of those who were often previously overlooked.

II. Second Test: The Fabric of Things

Woolf’s consideration for the differences among older and newer literature does not stop at cooks and servants in the Victorian and Georgian eras. Her essay “Modern Fiction” is chiefly concerned with contrasting Edwardian materialists, so-called “because they are concerned not with the spirit but with the body,” against the Georgian spiritualists, who “reveal the flickerings of that innermost flame which flashes its message through the brain.” In “Mr. Bennett and Mrs. Brown,” she widens her attack against the earlier writers to say

the Edwardian tools are the wrong ones for us to use. They have laid an enormous stress on the fabric of things. They have given us a house in the hope that we may be able to deduce the human beings who live there.

These ideas resonate throughout Woolf’s writing, which repeatedly disparages an inclination to focus on the details of material items when describing character.9

The contrast of Edwardian materialism and Georgian spiritualism suggests a broader set of questions that are worth exploring: first, whether a change of this sort can be measured in the literature; second, whether it changes the way Woolf says it did; third, whether the change corresponds to the date of 1910. This work is ongoing and these questions are big, so I’ll only just touch on them briefly here.

A. Methods: Corpora, Models, Vectors

The distinction Woolf draws between materialism and spiritualism is a complicated one, warranting more careful consideration than a simple before-and-after test of adjectives.10 A series of overlapping corpora, each containing something like a decade’s worth of novels is necessary to track any potential curve over time — so, at the suggestion of Julia Flanders and Sarah Connell, I prepared a number of them.11 Spanning from 1888 to 1922, these 13 corpora of British novels were then used to train 13 word embedding models with the same settings used before. Finally, for each model, vectors of materialism and spiritualism were measured for their median values of cosine similarity when judged against vectors for servants in tests that were similar to those heatmap tests in the previous section.

B. Results

In the charts below, the lines connect the cosine similarity values of each model’s two groups of the top 20 words (selected from longer lists of 100 words) along vectors loosely representing “materialism” and “spiritualism.”

As shown in the legend of this chart, vectors were chosen by searching for those words that were nearest to ideal algorithms: materialism was defined as “material” + “body” + “house” – “spirit” – “soul”; and spiritualism was defined as “spirit” + “soul” + “human” – “material” – “body”. For the sake of simplicity, these formulas show almost literal expressions of the words as Woolf uses them, and the resulting words for each grouping — again, selecting the top 20 out of a list of 100 — seem to make sense. Words in gold like “house”, “bedrooms,” and “fixtures” tend to show consideration for what Woolf calls “the fabric of things” (“Mr. Bennett and Mrs. Brown”), the “unimportant things” that are “trivial and transitory” (“Modern Fiction”). Meanwhile, words in navy like “sinner,” “child,” and “slave” consider those themes that “would sound on and on in our ears after the book was finished” (“Mr. Bennet and Mrs. Brown”). 

The placement of these words on the chart, determined by each word’s similarity to the words for servants, suggests something different from what Woolf describes. Perhaps surprisingly, the median values for each group do show some kind of inflection around 1910, with a convergence of the modes of materialism and spiritualism as they relate to servants, but the trends shift after that point to diverge throughout the teens and early twenties. It is possible that the set of corpora, which is incomplete and stops after 1922, is inadequately selected, neglecting many of the works Woolf was thinking about when she wrote “Modern Fiction” in 1919 let alone those additional novels she may have had in mind when writing “Mr. Bennett and Mrs. Brown” in 1924 and “The Russian Point of View” in 1925. It is certainly true that better vectors could be imagined to approximate materialism and spiritualism as she meant them.12 But it is also possible that Woolf’s inclination doesn’t bear out on a broader scale or that it cannot be seen by this kind of study. Comparing novelists’ treatment of servants to their treatment of other kinds of characters offers context to these concerns.

Despite the reservations I might have about the ability of word embedding models to test for shifts in these modes, two more charts suggest that there may be validity in this approach.

The first of these moves to consider materialism and spiritualism in the relationship of husband and wife. Notwithstanding the inversion of modes from what is shown in the previous figure, with the blue curve above the yellow, this version also shows a rise in the spiritual mode which peaks around 1910 before declining. 

The second shows the relationship of parents and children. It, too, shows spiritualism growing to a crest in 1910 before slowly falling once more through the next decade.

More work certainly needs to be done to tune these vectors of materialism and spiritualism and to understand their measures’ change over time. The small evidence shown here seems to suggest that Woolf was right in reading some kind of a shift toward the spiritualist mode and right in sensing the era around 1910 to herald some kind of change in directions, but the models seem to disagree with her assessment of the timing and enduring nature of these changes.

III. Conclusions

Woolf is certainly a citizen of a different time. Her consideration for the master-servant “relationship” rings differently than it does today, and the later charts showing the inversion of modes underscores how truly unequal were the depictions of the relationships she considered. 

However falteringly or incompletely, Woolf saw herself as working toward her own understanding of empathy, and her scope of sympathy often seemed willing to transgress class boundaries. In spite of her privileged position, she saw shared human experience as overcoming many distinctions of status. Her writing shows interest, unusual in her time, for the lives and histories of those who worked by cooking, serving, and tending to the comforts of the already comfortable. We see this interest, for instance, in the biographical sketch for Lily Wilson featuring in an astonishingly long footnote in Flush (109–113); as the maid of Elizabeth Barrett Browning, Wilson is “extremely obscure,” but her life nevertheless “cries aloud for the services of a biographer” (109) — and Woolf supplies this service. We might see this interest, too, in Woolf’s unpublished sketches of the lives of servants, one of which Susan Dick described as showing Woolf’s advocacy for “the importance of the unrecorded lives of servants” (qtd. in Jones 1). 

But these readings may be generous. In the case of Flush, we have to remember that Lucy’s story is relegated to a footnote in a biography about a dog, and the “class ventriloquism” that Clara Jones identifies in another of Woolf’s unpublished sketches of a servant shows the limits of attempting to breach a class gap (2). In other novels, too, Monica Miller sees that “Woolf refracts servant characters’ emotional states through collections of objects” (113) — a technique that certainly evokes the materialist mode over the spiritualist.

Regardless of Woolf’s own fiction, she has a lot to say about the writing of others. Considering her essays about fiction against the very works she discusses can be revealing.  

In making any survey, even the freest and loosest, of modern fiction, it is difficult not to take it for granted that modern techniques for studying literature seem somehow an improvement upon the old. In the end, one cannot avoid the actual reading of books — nor does one wish to avoid it! — but distant reading with the aid of word embedding models makes it possible to consider an impossibly large collection (or collections) of works.

Sources

Anthony, Laurence. AntConc. Version 3.5.8. laurenceanthony.net/software/antconc.

Blackmore, R. D. The Maid of Sker. 1872. 

Buchan, John. The Thirty-Nine Steps. 1915.

“Category: British Novels by Year.” Wikipedia. en.wikipedia.org/wiki/Category:British_novels_by_year.

Clawson, James M. “w2v_utilities.R” GitHub Gist, 13 Aug. 2019. gist.github.com/jmclawson/21c6a40c78fd66d708bec45d5c0b52e2.

Dickens, Charles. David Copperfield. 1850.

Farnol, Jeffery. The Amateur Gentleman. 1913.

Fitzgerald, Jonathan and Sarah Connell. “Word Vectors Template.” GitHub, 3 June 2019, github.com/NEU-DSG/wwp-public-code-share/blob/master/WordVectors/template_word2vec.Rmd

Fitzgerald, Jonathan. “Word Embedding Models Are the New Topic Models.” NULab for Texts, Maps, and Networks, 25 July 2017. web.northeastern.edu/nulab/word-embedding-model.

Heuser, Ryan. “Word Vectors in the Eighteenth Century, Episode 1: Concepts.” 14 Apr. 2016. ryanheuser.org/word-vectors-1.

Jepson, Edgar. The Loudwater Mystery. 1920.

Johnson, Laura. “ Word Vectors for the Thoughtful Humanist Institute: Data Preparation Guide and Checklist.” docs.google.com/document/u/0/d/1Qj4mrEsPQEX6hagqEyEFPq–ufHvhxpavc0w2_h8j_E

Jones, Clara. “Virginia Woolf’s 1931 ‘Cook Sketch’.” Woolf Studies Annual, vol. 20, 2014, pages 1–23.

Lawrence, D. H. Aaron’s Rod. 1922.

MacDonald, George. At the Back of the North Wind. 1871.

Miller, Monica J. “Odds, Ends, and Others: Objects and the Narration of Woolf’s Servant Characters.” Woolf Studies Annual, vol. 16, 2016, pages 111–131.

Moretti, Franco. “Conjectures on World Literature.” New Left Review vol. 1, Jan.–Feb. 2000. newleftreview.org/issues/II1/articles/franco-moretti-conjectures-on-world-literature.

R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. r-project.org

Recchia, Gabriel. “‘Numberless Degrees of Similitude’: A Response to Ryan Heuser’s ‘Word Vectors in the Eighteenth Century, Part 1.’” Two New Things, 11 June 2016. www.twonewthings.com/gabrielrecchia/2016/06/11/numberless-degrees-of-similitude-word-vectors.

Robinson, David. “gutenbergr.” CRAN, R package, version 0.1.5, 10 Sept. 2019. cran.r-project.org/web/packages/gutenbergr

Schmidt, Ben. “Vector Space Models for the Digital Humanities.” Ben’s Bookworm Blog, 25 Oct. 2015. bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html.

—. “wordVectors.” GitHub, R package, Version 2.0, 21 Feb. 2017. github.com/bmschmidt/wordVectors.

Trollope, Anthony. Miss MacKenzie. 1865.

Wells, H. G. Mr. Britling Sees It Through. 1916. 

—. The Wonderful Visit. 1895.

Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag, 2016. ggplot2.tidyverse.org. CRAN, R package, Version 3.3.2, 19 June 2020. cran.r-project.org/web/packages/ggplot2.

Wodehouse, P. G. The Coming of Bill. 1920. 

Woolf, Virginia. Flush. 1933. Oxford World Classics, 2009.

—. “Modern Fiction.” 1919–1921. The Common Reader, 1925. http://gutenberg.net.au/ebooks03/0300031h.html#C12

—. “Mr. Bennet and Mrs. Brown.” 1924. http://www.columbia.edu/~em36/MrBennettAndMrsBrown.pdf

—. “The Russian Point of View.” The Common Reader 1925. http://gutenberg.net.au/ebooks03/0300031h.html#C15

About the author

I learned about word vector embeddings by attending the Women Writers Project institute “Word Vectors for the Thoughtful Humanist” at Northeastern University in July 2019. Our institute was full of generous and enthusiastic participants, lucky to be led by WWP’s Sarah Connell and Julia Flanders, and further fortunate to have at hand the insights of Syd Bauman, Laura Johnson, Laura Nelson, Anjelica Oswald, Ashley Clark, and others.

As Associate Professor and Ann Petry Endowed Professor of English at Grambling State University in Louisiana, I teach courses in literature, writing, and data analytics. My areas of focus include twentieth-century British fiction, poetry, and travel writing, and my book Durrell Re-Read: Crossing the Liminal in Lawrence Durrell’s Major Novels was published by Fairleigh Dickinson University Press in 2016. More recently, my research applies digital humanities methods to study literature, and my current work considers authorship in Renaissance Latin and English texts and across genre by twentieth-century British writers. You can reach me by Twitter (@jmclawson) or via email (clawsonj@gram.edu).

Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.

  1. Having established that I won’t waste space describing what word embedding models can do, I’ll nevertheless add something down in a footnote. A model in this sense is nothing more than a simplified representation of some original thing, optimized for one purpose: a road atlas is a model of geophysical space optimized for finding one’s way while driving; a family tree is a model of history optimized for heteronormative understandings of filiation and marriage. As each model is created for its purpose, this purpose will also dictate the level of detail to be found in the model: a road atlas will show major streets and intersections, but it may omit unpaved roadways to fit details on the page; a family tree might omit the detail of certain family units that don’t result in offspring.

    In this sense, a word embedding model is a simplified representation of all the words in a collection of texts, optimized to show the commonest or likeliest neighbors of each word. Retaining one instance of every word in a set of texts, a word embedding model imagines a way of organizing all of these words so that each is near many different cohorts of similar words. Like every model, it has limitations. We can’t read a word embedding model like a book, but we must instead query it for distances between given words to try to understand the characteristics of the various neighborhoods to which a word belongs. And we can’t go much further than that within a model. We can’t, for instance, unravel it to try to understand why “porridge” somehow lives near “bear.” To understand the model in context, we have to go back to the full texts themselves.

    Nevertheless, a word embedding model is an ideal way to study literature for characteristics other than plot or character, to study the characteristics of language caught within pages as one might study the physiology of a mosquito trapped within amber.

  2. Since all nine of these novels are listed on Project Gutenberg’s website, it is clear that something is lacking in the methodology of corpus collection using gutenbergr; nevertheless, this is the method with which the corpus was collected.
  3. I should add that, while gutenbergr attempts to strip headers and footers from these texts, because they were shared with the Women’s Writers Project for use during last summer’s workshop, they also went through an additional process of expert scrubbing by Laura Johnson.

    Although not sized equally, these corpora are each sufficiently large to build a word embedding model. I then used Ben Schmidt’s “wordVectors” package for R to build models with the settings suggested by the Women Writers Project’s template: 100 vectors, trained on a window size of 6 words, with 10 iterations and a negative sampling value of 15.

  4. The 20 words closest to “formidable,” “obscure,” and “silent” are these: “formidable,” “obscure,” “silent,” “gloomy,” “remote,” “ungainly,” “animated,” “awkward,” “indistinct,” “isolated,” “insignificant,” “lounger,” “solitary,” “taciturn,” “manageable,” “motionless,” “approachable,” “unconcerned,” “uncomfortable,” “moody.” All of these words had greater than 0.53 cosine similarity to the three words queried.
  5. These heatmaps were made in R using ggplot2. Code for simplifying the creation of these kinds of visualizations is available on GitHub.
  6. These vectors of personal descriptions were taken by finding 100 words nearest to “mysterious” + “quiet” + “talkative” + “kind” + “faithful”. There might be a more parsimonious selection of adjectives one might use to find this vector, but the resulting lists of words look reasonable. Moreover, by first finding a longer list of 100 words and then paying attention only to the 20 that show the greatest cosine similarity to the column headers, the comparison allows each model to filter out unimportant descriptors. Median values were chosen to avoid skewing by extreme values in individual columns.
  7. It might be helpful at this point to see Gabriel Recchia’s discussion relating word vectors to “semantic differentials.” If the word “polite” can fall somewhere on a spectrum of “stale”-ness, then the word “cook” might fall somewhere on spectra of “faithful”-ness and “chatty”-ness. The lists shown here just surface those scales on which servants score highly in Victorian fiction and Georgian fiction.
  8. Tools like AntConc make it possible to search for lists of particular words (like “faithful,” “patient,” “humble,” and so on) in the context of other words (like “servant,” “maid,” “cook”) within a stipulated window or context horizon. This makes it useful for trying to unravel some of those passages that have informed the training of a model, but it won’t be able to reveal every strand in the weave. AntConc will only show collocations in a text, while a word embedding model can, via the same transitive property upon which it’s built, extrapolate correlations in a set of texts.
  9. Woolf is especially clear in drawing distinctions on this point between the fiction of Britain and Russia. The “dark places of psychology” are “difficult for us to grasp, incomprehensible to our [English] predecessors” and understood by “No one [in England] but a modern, no one perhaps but a Russian”; moreover, “remarks upon modern English fiction can hardly avoid some mention of the Russian influence […]. If we are sick of our own materialism the least considerable of their novelists has by right of birth a natural reverence for the human spirit” (“Modern Fiction”). English novelists will describe a character’s “buttons and wrinkles; her ribbons and warts,” while Russian novelists “would pierce through the flesh; would reveal the soul — the soul alone” (“Mr. Bennett and Mrs. Brown”). And English novelists tell a story in a particular way — “we should begin with his house; we should solidify his surroundings” — while Russian novels “are seething whirlpools, gyrating sandstorms, waterspouts which hiss and boil and suck us in […] composed purely and wholly of the stuff of the soul” (“The Russian Point of View”). In light of these descriptions, it would also be interesting to compare models of British and Russian fiction from this period.
  10. “I am not saying that one went out, as one might into a garden, and there saw that a rose had flowered, or that a hen had laid an egg. The change was not sudden and definite like that.” (“Mr. Bennett and Mrs. Brown”)
  11. The corpora that went into the models prepared for this second set of tests have the following details:

    – c1898 (1888–1898): 67 novels; 5,839,250 words
    – c1900 (1890–1900): 61 novels; 5,295,466 words
    – c1902 (1892–1902): 64 novels; 5,204,355 words
    – c1904 (1894–1904): 68 novels; 5,463,370 words
    – c1906 (1896–1906): 69 novels; 5,796,085 words
    – c1908 (1898–1908): 71 novels; 6,013,836 words
    – c1910 (1900–1910): 80 novels; 6,844,079 words
    – c1912 (1902–1911): 83 novels; 7,481,804 words
    – c1914 (1904–1914): 80 novels; 7,160,934 words
    – c1916 (1906–1916): 80 novels; 7,685,435 words
    – c1918 (1908–1918): 75 novels; 7,304,288 words
    – c1920 (1910–1920): 74 novels; 7,170,370 words
    – c1922 (1912–1922): 77 novels; 7,066,495 words

  12. To these lists of abstract concepts, one might add contrasting pairs of words like “expression” / “impression.” Instead of abstract concepts, one might consider lists of concrete things like the “innumerable” bells and buttons Woolf sees in materialists’ works. But it will always be difficult to define two such groups when one of them is defined in part by its vagueness: “Whether we call it life or spirit, truth or reality, this, the essential thing, has moved off, or on […] life is a luminous halo, a semi-transparent envelope” (“Modern Fiction”).

Leave a Reply

Your email address will not be published. Required fields are marked *