Sample Model and Query Log

This document is a sample of a note-taking space for recording settings, queries, and results when experimenting with your word embedding models. You will likely end up working with many different models in your experimentation with word embeddings, and it’s generally a good idea to keep a log with some basic information about your models and corpus in addition to recording specific results. This save you from misremembering crucial information about your models and prepare you to discuss your results with others.

Some of the information you might record includes the specifics of what model you were using, the parameters that were used to train the model, and any other details that will affect the results but might be hard to reconstruct after the fact. You may find it helpful to use this space as a log of the things that you’ve tried so that you can return to them later and start developing a sense of where relevant patterns or areas for further exploration are emerging. You should also make sure to log failures as well. For instance, if different model parameters negatively impact the model’s accuracy, you should make note of this. By logging both success and failure, you help your future self make smarter decisions about the directions that your project should go in. You can never predict when something you tried the day before might become unexpectedly relevant!

Model Log

This log is a crucial way to record the details of the corpus, training process, and parameters for the models you use. It’s tempting to use filenameing (e.g. “wwo_2021-05-24_100iterations_5window_300_dimensions.01”) but this approach has obvious limitations! Whenever you train a model, get in the habit of writing down the details and associate a filename or ID with the model, and then you can easily refer to it in your query log (see below). This example is just to give an idea, so feel free to experiment with the best way of recording this information that works for you. The key pieces of information to record are:

Data preparation choices, for example:

  • What state was your corpus in before you loaded it into R or Python? Multiple text files? A CSV?
  • What preprocessing did you do before loading your corpus into R or Python? (combining any key terms with underscores, removing metadata or other textual contents, and so on). Essentially, in what ways was the text of the corpus edited before training your model on it? It may also be helpful to record why you made certain preprocessing changes.
  • If you are working with the R WordVectors package, did you use the prep_word2vec function? If so, did you lowercase your corpus? Bundle ngrams? What other libraries or packages did you use to preprocess your corpus?

Parameters used in model training, for example:

  • Iterations
  • Window size
  • Vectors
  • Negative samples

Keeping track of the settings is important because word vector models are probabilistic, meaning that every time you train a model, you’re going to get a slightly different version of the model than the previous version–even if you use all of the same settings! It can be helpful to think of your model training as an experimental process where you change one setting at a time in order to see how changes to a single setting impacts your results. Changing multiple settings at once can make it difficult to distinguish what is actually producing your results. Of course, even with minor changes between models, you’ll still want to keep a detailed log in order to both keep track of what changes were made when as well as to make reproducing your results easier.

Date and model name/ID Corpus and data preparation details Iterations Window size Vectors Negative samples
           
           
           

Query Log

There are plenty of different ways to organize this kind of information, so feel free to experiment and find an approach that works for you. The tabular format might make the notes easier to scan through, but if you’re taking a lot of notes maybe a regular document would work better. The key things to record (somehow!) are:

  • When was this?
  • What model(s) were you using?
  • What query did you try? (E.g. the specific query term(s), the operators, etc.)
  • What results did you get? For instance, what cosine similarities for specific words?
Date and Model Settings Query Results