By Caterina Agostini
Corpus, Methods, and Guidelines
The objects of my research are scientific thinking, language, and modes of communication from 1543 to the 1630s, when astronomer Copernicus, physician Vesalius, scientist Galileo, and numerous others published texts which, from a modern perspective, mark the starting point of the Scientific Revolution and its early developments. Historian of science Thomas Kuhn concluded that scientific discoveries and thinking paradigms are correlated, for the causes leading to the development of the modern scientific method. The result of such analysis is what he called “the structure of Scientific Revolution(s).”
My research advances understanding of scientific and technical literature by analyzing a variety of authors through the lens of genre. Texts in my corpus are texts on science, mostly in Italian (1580s-1630s); however, let’s keep in mind that, at the time, it was not unusual to insert quotations from scientific texts in Latin. For texts in Latin, there were a variety of forms: some texts were originally written in Latin by classical Latin authors, others had been translated into Latin from Greek, while others still were in Latin because their authors intended to reach an international readership at the university, or in academic contents. In my corpus, I included a selection of Galileo’s main works and Marino’s Baroque poem, Adone, as a significant sample of technical and non-technical writing on science in the Italian vernacular in the early modern period. This area of my work examines also the influence of Galileo on Marino’s Baroque poem, Adone, particularly the tenth canto in which Marino discusses science and the importance of Galileo’s work. While prose prevailed in scientific discourse, poetical renderings existed alongside prose, both in the form of celebratory writing and in a didactic form (Galilei, Segre).
My research corpus contains 1,039,821 words in twelve different documents. I assembled my corpus by copying and pasting texts from the National Edition of Galileo’s works, edited by Antonio Favaro, and from the Hathi Trust Digital Library. Cleaning and regularizations were necessary digital transformations from the assembled plain-text version, so that I could have a set of documents in some machine-readable form. Juniper Johnson, a Graduate Research Assistant (Northeastern University), helped me to refine my corpus for use with the WWP’s Vector Toolkit tool. Since the majority of my texts are in Italian, defining stopwords was a trial-and-error procedure to see what would work best. First, I filled in the parallel stopword list in English, provided by the WWP team, and I gave a translation of those terms in Italian. Next, I found modern-day Italian stopword lists on GitHub and integrated those with other, equivalent terms, and also alternative spellings that would be acceptable in pre-standardized, early modern Italian writing. For this blog post, I found it helpful to consult examples of word vectors in non-contemporary literature, in seventeenth-century values of credit and authority from Sarah Connell’s research and in eighteenth-century literature in Ryan Heuser’s studies.
Note that, when starting a project on text analysis, it sometimes helps to begin with a brief exploration of the corpus using web-based platforms like Voyant Tools — for example, by generating a word cloud of the top forty-five terms. I tried this approach for a sample of my corpus, Galileo’s Dialogo sopra i due massimi sistemi del mondo (“Discourse Concerning the Two Chief World Systems,” 1632), the book that caused conflicts with the Church, eventually leading to the Inquisition trial against Galileo in 1633. The book has 177,492 words; the forty-five most frequent words in this word cloud visualization are: “terra” (Earth), “moto” (motion), “esser” (“being”), “sole” (“Sun”) and “luna” (“Moon”).
Some of my preliminary questions have these perspectives: When did scientific language become standardized in Italian? How does humanistic and scientific communication relate to authority and logical arguments? What associations are present between subcategories for science disciplines? For instance, what language justified the emergence of astrology as a less legitimate branch for astronomy?
To answer these questions, I researched the National Edition of Galileo’s works curated by Antonio Favaro (1890–1922), and got access to the collected, curated texts that have become the standard for works by Galileo and texts related to his works. Performing text analysis with word vectors has helped me to retrace and pinpoint scientific communication in the early modern period.
After preparing my corpus and training my model, I could examine a representation of my texts, in the form of a processed representation of the textual data contained in those documents. Vectors and vector models are useful in understanding different facets of a collection: not only the presence of certain words, but also absences can be significant. As a matter of fact, proximity (semantically) translates to proximity (spatially). A vector is a line that has both a specific length and a specific direction or orientation in space. In a word-embedding model, the model represents a text corpus so that, in a certain sense, each word projects some meaning, based on its position and proximity in vector space. I have been trying window settings as important parameters to test my assumptions about the relevance of relative proximities for individual words. In word vectors analysis, a window is a span of text of a specified length: we can control the size of the window by setting parameters.
Window Settings for Italian Baroque Prose
In my preliminary research, I have found that window settings alter the relevance of results significantly. The syntax in early modern Italian is possibly more grandiose than in a language like English. Because of the consistent use of articles, but also the flourishing combination of adjectives and adverbs to accompany nouns in early modern Italian, it was worth trying both wider and smaller windows to get results at scale for specific queries on words related to science, astronomy, and astrology. I adjusted the window sizes gradually from ten to one hundred for words that have a standardized, agreed meaning in scientific practice, at Galileo’s time, in opposition to words where there is still some degree of ambiguity, thus showing ambivalent uses among authors, if not within one author’s texts. My results lead me to these conclusions: words that are specific to the discipline, such as scienza (“science”) and astronomia (“astronomy”) generate productive results, regardless of window sizes. Words that, otherwise, belong to more semantic territories make one investigate obliquely; for example, “astronomy” would embrace both astronomy and astrology, at that time. Thus, I searched for one thematic word, “influssi” (“influences”) that I derived from the Inquisition wording (“La Denuncia Contro il Galilei”, in Poppi, Antonino. Cremonini e Galilei inquisiti a Padova nel 1604: Nuovi Documenti d’Archivio. Padova: Editrice Antenore, 1992, pp. 51-54). I also narrowed my word searches based on speech elements; after considerations on the nature of the Italian vernacular, I decided to privilege nouns over more variable forms, for consistency in my word vector investigations, given the complexity of agreement and declension in Italian; one notable exception was the verb, ragionare (“to reason”) that is applicable to mathematics as well as experiments in physics and observations in astronomy.
In broader terms, iterations in queries have contributed to my digital analysis for this project on science at Galileo’s time. Since we are working with textual sources, there is a constant re-reading of the plain text materials — as we also usually read one source more than once, in order to understand and study it fully. Every time we repeat the machine reading, small adjustments through word queries, window sizes, and the extension of the corpus help us to achieve a better picture of the model.
Word vectors offer an opportunity to explore the semantic spaces and relationships within a large corpus, discover analogies between words, and study details of register and genre. In fact, literary strategies used in scientific writing reflect not only a paradigm shift in the writer’s mindset, but also a shifting appreciation of literature and the scientific contents both on the author’s and the reader’s side. What makes one’s writing authoritative, consequently rendering the author an expert? I intend to examine how writers chose to express new ideas through a humanistic-sounding discourse, thus encouraging humanistic ways of thinking about nature and representations of scientific observation while fostering scientific communication on a larger scale. On the other hand, for parts in Latin, I have also explored how the Italian scientific language derived from Latin and Italian applied to technical fields. Did Latin affect the development of scientific language and the role of translations in the shaping of scientific language? As an insight that I have derived from my research, the appreciation of literary texts shifts. Assessing scientific vocabulary in books on science, or in poems, would particularly benefit from dedicated corpora in which texts are grouped by decades. This area of investigation has not been fully covered in the current project, and I believe there is potential to evaluate how the model reads, as opposed to the perceptions of those texts at the time of their original publication.
I find this type of text analysis to be effective in the analysis of Galileo’s works, considering the impactful innovation in the Italian language through his works, ranging from Latin (Sidereus Nuncius, “The Starry Messenger”), Italian (Saggiatore, “The Assayer” and Dialogo sopra i due massimi sistemi del mondo, “Discourse Concerning the Two Chief World Systems”), and even the Paduan dialect (Dialogo de Cecco di Ronchitti da Bruzene in perpuosito de la stella nuova, “Dialogue of Cecco of Ronchitti from Brugine, on the New Star”). Crystal Hall has argued that reading and writing represent two facets of an author, given that Galileo’s persuasion techniques sometimes draw from literary texts, such as chivalric epic poems; in Hall’s words, “As Galileo begins to write and criticize, he also begins to experiment with the literary, not factual potential of the epic poets who might help him to make his case” (43). Furthermore, I would add, Galileo had authored some entries for the academic standard of the Italian language, the Dizionario della Crusca. His interest in words was not only in using them for his books, but even in defining their meaning and scopes.
Querying the Model
For my corpus, validity tests have included the name of cities, names of scientific disciplines, and scientific instruments. Queries have centered around words for “knowledge,” “culture,” and “ignorance.” For my model and, consequently, corpus validation, I need to test it to see whether that is a useful representation. To validate a model, I can verify how working with vector math affects my results, for addition, subtraction, and analogies respectively. Negative sampling is another way to reduce distraction noise in the digital process of text analysis. For these reasons, keeping a digital lab notebook is an important practice to see and retrace actions.
Using the Word Vector Interface, some searches have shown the following results. Querying “scientia” and “scienza,” that is, the word for “knowledge” in Latin and Italian, pulls up results in Latin and Italian respectively. I was looking forward to checking whether any distinction between languages could be maintained. That seems to be the case, after double-checking a number of concept-words and connectives in the two languages, Latin and Italian. Clustering terms with varying windows’ dimensions also preserves linguistic autonomy, in the instances I have encountered through the Word Vector Interface.
Therefore, I have used the value of cosine similarity as a measurement for the nearness among words, through a geometrical measure of the angle between two vectors, with values ranging between zero and one. Individual queries for “scienza” (“science”) showed associations to astronomy, mathematics, probability, and wisdom, to mention just a few of the top-ranking results; running a query for “conoscenza” (“knowledge”) revealed words related to emotions and power dynamics, instead. Using the “Operations” function, I could search for “scienza” (“science”) + “conoscenza” (“knowledge”) and compare that output to the previous results in which I searched those two main terms for “knowledge” in Italian: the result of the combined lexical unit leaned towards vigilance, attempts, and the impact on the community at large. Moving further so that I could try one analogy query expanding on this lexical theme, I was able to search for “scienza” + “conoscenza” and subtract “bugia” (“falsehood”) from that meaning unit: the results included words like “science,” “examples,” “revealing,” “understanding,” and “learning.” While individual queries have proved useful in framing my questions and finding results, I have found clustering lists to be more contentious to examine, given the randomization that is part of the process.
From Querying with Word Vectors, to Theoretical Questions in Digital Methods
One recent project, the Republic of Letters run by Stanford University, has examined “the surviving correspondence of the mathematician and astronomer Galileo Galilei…to map his social and intellectual networks.” It is hard to pass by the fact that some letters ended up being used to wrap deli meat in Florence in the eighteenth century, when a Galileo scholar found and saved that paper as an important primary source. Therefore, the transitory and casual nature of letters’ findings (and any early modern source) is a factor to keep in mind. More than ever, the fact that we have sources, no matter how extensive, does not imply that we have all eligible materials in scientific writing that were truly in print in Galileo’s lifetime. However, a frame of reference comes from scientific networks that Galileo addressed, such as the Jesuit scholars in Rome and in Europe have also been part of the conversation on science since the sixteenth century. Professor William R. Shea has debated the role of the Jesuits as mathematicians and theologians in the early modern period, thus explaining Galileo’s intense communication with Jesuit fathers to test, verify, and validate his own theories and discoveries (William R. Shea and Mariano Artigas. 2005. Galileo in Rome: The Rise and Fall of a Troublesome Genius. New York: Oxford University Press).
The history of science, medical humanities, and illustrations in scientific books have been among my main interests in pursuing a Ph.D. degree at Rutgers University. The topics in my doctoral research have inspired me since childhood. Galileo, to me, is a brilliant mind, a great scientist, and an author whose works I would look for, in my hometown libraries. Born in Padua, I became fascinated with the locales and culture where Galileo spent “the eighteen happiest years” of his life, as he wrote from his house arrests in Arcetri, near Florence, in a letter addressed to Fortunio Liceti living in Padua (June 3, 1640 Opere di Galileo Galilei, ed. Favaro, XVIII, 207-09).
The goal is to find ways, both long-established and new, digital methods, so that we can work with incomplete data in ways that are not inaccurate or deceptive, or overly simplified. So far, one way to work around similar issues is a methodological shift, thus working with relative magnitude and relative timespans, as I suggested earlier with the idea of dividing corpora by genre and decade to test those options as well. In combination with philology, Italian studies, and the history of science and medicine, word vectors as a methodology help me in assessing word relevance, proximity, and concurrence, all the while expanding my horizons in research questions. The very process of entering queries, as a matter of fact, makes me more flexible to itemize the core values of my research, and to build word-related networks that I can search, validate, and update as the model reveals.
Bellini, Eraldo. Umanisti e lincei. Letteratura e scienza a Roma nell’età di Galileo. Editrice Antenore, 1997.
Byers, William. The Blind Spot: Science and the Crisis of Uncertainty. Princeton University Press, 2011.
Cohn, Samuel K. Cultures of Plague: Medical Thinking at the End of the Renaissance. Oxford University Press, 2012.
Eamon, William. Science and the Secrets of Nature. Princeton University Press, 1994.
Favaro, Antonio. Opere di Galileo Galilei. Barbera: 1890-1922. 20 voll.
Galilei, Galileo. Sidereus Nuncius. Tommaso Baglioni, 1610.
—. Il Saggiatore. Giacomo Mascardi, 1623.
Galilei, Galileo, Ottavio Besomi, and Mario Helbing, eds. Dialogo sopra i due massimi sistemi del mondo, tolemaico e copernicano. Antenore, 1998.
Getto, Giovanni. Barocco in prosa e poesia. Rizzoli, 1969. “La prosa scientifica,” pp. 403-76.
Hall, Krystal. Galileo Reading. Cambridge University Press: 2013.
Kuhn, Thomas. The Structure of Scientific Revolutions. University of Chicago Press, 1962.
Marino, Giambattista. Adone. Giacomo Sarzina, 1623.
Pitt, Joseph C. Galileo, Human Knowledge, and the Book of Nature. Kluwer Academic Publishers, 1992.
Poppi, Antonino. Cremonini e Galilei inquisiti a Padova nel 1604: Nuovi Documenti d’Archivio. Padova: Editrice Antenore, 1992.
Ronchitti, Cecco, Girolamo Spinelli, and Galileo Galilei. Dialogo de Cecco di Ronchitti . . . in perpuosito de la stella. Pietro Paulo Tozzi, 1605.
Segre, Michael. In the Wake of Galileo. Rutgers University Press, 1991.
Shea, William R., and Mariano Artigas. Galileo in Rome: The Rise and Fall of a Troublesome Genius. Oxford University Press, 2005.
Tesauro, Emanuele. Il cannocchiale aristotelico. Paolo Baglioni, 1663.
Wear, Andrew, R. K. French, and Iain Ionie, eds. The Medical Renaissance of the Sixteenth Century. Cambridge University Press, 2009.
About the author
Caterina Agostini, Ph.D. candidate at Rutgers University Department of Italian, attended the WWP institute “Word Vectors for the Thoughtful Humanist” at Northeastern University Snell Library in July 2019. The WWP has received a grant from the National Endowment for the Humanities Institutes for Advanced Topics in Digital Humanities for a series of advanced seminars on word embedding models and their applications to teaching and research in the humanities.
The WWP committee encouraged scholars to build a corpus of texts and experiment with word embedding models via WWP Sandbox folders. While learning about the Women Writers Vector Toolkit, Caterina found it helpful to experiment with the interface that allows researchers to conduct word vector analysis on texts already available in the Women Writers Project. The workshop, guidance, and support of Julia Flanders, Sarah Connell, Syd Bauman, Juniper Johnson, and Anjelica Oswald have been extraordinary: many thanks, with gratitude, to all WWP scholars at Northeastern, as well as to the Rutgers Digital Humanities Initiative and Lab and the Ms. Elena Petronio Scholarship at Rutgers.
Caterina presented “Before the Eclipse: Corresponding with Galileo Galilei on the Topic of Astronomy” online at the 2020 Renaissance Society of America (RSA) annual meeting. Since presentation and discussion were online, after the lockdown, you can find the session “Correspondence and Disciplinary Shifts Across Time” for “New Technologies and Renaissance Studies 3: Meaning Making in Text, Space, and Time,” part of a series sponsored by Iter: Gateway to the Middle Ages & Renaissance. More discussions are available via the Twitter conference hashtag for Digital Humanities and the Twitter conversations around the Renaissance Society of America (April 2, 2020).
As a Digital Humanities Graduate Research Specialist, Caterina consults and runs workshops on scholarly digital publishing, mapping, and word vectors and text analysis at Rutgers Digital Humanities Initiative and Lab.
Are you interested in Italian texts, science history, or digital humanities?
Send an e-mail message to Caterina: email@example.com
Word Vectors for the Thoughtful Humanist has been made possible in part by a major grant from the National Endowment for the Humanities: Exploring the human endeavor. Any views, findings, conclusions, or recommendations expressed in this project, do not necessarily represent those of the National Endowment for the Humanities.Tweet