Corpus Search Guide

Like any tool, word embedding models are often most revealing when put into dialogue with other tools and methods so that you can get a sense of how the tool itself is inflecting what you see. It’s also useful to be able to go back to your corpus to confirm interesting results and make sure you understand how they were derived. Returning to your corpus can also be useful for identifying any problems with your model which might impact the results. This handout provides some ideas about strategies for examining your corpus and some suggested tools to make this easier.

Strategies for searching

If your corpus is in one big document (but not too big), you can just open and search it in a plain-text editor; some good options include:

  • BBEdit (free, Mac only)
  • Oxygen XML Editor(not free, but free trial is available)

Wildcards and regular expressions are helpful in cases where spelling or part of speech may vary

If your corpus consists of multiple documents, you can do a search across multiple files in your text editor. You can also load the corpus into an analysis tool like AntConc or Lexos—these also offer more powerful options for analyzing large amounts of text. Some features that AntConc and Lexos offer that may be useful are searching for words in context and counting word frequencies.

If your corpus documents contain markup, use an XML-aware editor like Oxygen. “XML-aware” just means a text editor that knows what XML is and how to read it:

  • In an XML-aware editor, you have the option of searching only in the text content, excluding the tags. You also have the option of isolating certain tags which may be useful for more specific searching.
  • You can also use XPath if you want to limit your search to specific contexts: for instance, to discover how a word is used in poetry, or in drama.

In addition to simple word searching, there are other kinds of text analysis that can provide useful comparative perspectives on the language in your corpus:

  • It can be helpful to verify the overall frequency of specific words, as a reality check on cosine similarities. For instance, if your model seems to reveal an improbable word pairing, it may be because one or both words are quite rare. This may prompt you to retrain your model with a higher word frequency requirement.
  • It can also be very helpful to look quickly at the spaces where a word is generally used: what genres, what parts of the text (for example, you might observe if words are distributed evenly across texts).
  • To get a sense of how specific words are used in proximity to others:
    • You can use regular expressions in a tool like BBEdit or Oxygen
    • You can do a proximity search in a tool like AntConc or Voyant
    • You can use the keyword-in-context (KWIC) view in Voyant or AntConc
  • It may also be interesting to experiment with other tools/methods (e.g., topic modeling) to see what they reveal about the vocabulary of your corpus. Understanding the vocabulary of your corpus can provide necessary context for your results and overall make your findings more precise.

Regular expressions

There are many useful resources for learning regular expressions, including this tutorial by the Programming Historian and the Regex10 site for testing your regular expressions. The Regular Expressions Info site has many resources, including instructions for using regexes to search for words in proximity to each other.

Tools

BBEdit: A free “Bare Bones” text editor for MacOS. It supports search across multiple files using regular expressions.

Oxygen XML Editor: An XML editor that can be used with any kind of text file. It supports search across multiple files using regular expressions. Oxygen is not free, but can be accessed using a 30-day trial.

Voyant: A web-based reading and analysis environment for digital texts. This suite of tools gives you counts of words and lets you compare patterns in word locations and frequencies, or examine keywords in context, along with a few other options. Voyant can handle up to approximately 4 million words and can be useful for exploring general patterns in word usage in your corpora.

Lexos: A tool for preparing and analyzing digital texts; it offers several fairly sophisticated options for text preparation, and a range of different analytical possibilities, including some that operate on individual texts and others that operate on a corpus. Importantly, it also preserves all the changes that are made to your texts, so that any results can be reproduced. You can use the web version to get a sense of how Lexos operates, but you will likely want to download a local copy for your analysis, since that will perform better for large files.

AntConc: A corpus analysis toolkit for concordancing and text analysis. AntConc is free and available to download for MacOS and Windows. AntConc performs better with many small files, rather than one or two large ones—there is no limit on how many words you can analyze, but larger corpora will take longer to work with. The Programming Historian has a tutorial for working with AntConc (as well as many other tutorials for analyzing texts).