Keying NAMEs: the WWP Approach

This article discusses how the Women Writers Project has implemented usage of the TEI key= attribute, including our reasoning for the mechanism chosen, and the checksum algorithm used. A basic understanding of at least SGML, if not TEI, is presumed.

1.2 Acknowledgement

2 Introduction

At the WWP, names of people are encoded with <persName>; names of places are encoded with <placeName>, and names of other things are encoded with <name> (exception: in a <respstmt> of the <teiHeader> a <persName> is not allowed, so <name> is used). For example

We encode a person's name with a particular element in order to facilitate searching for occurrences of the name. There are other possible advantages, however; e.g., allowing a brief biography to be linked to the name as it is displayed on your screen.

It is easy to search for a consistently spelled name. Even if there are only a few clearly defined possibilities (e.g., Jon or Jonathan), searching is not so difficult. However, during the period covered by the WWP textbase, names (and many other words) were inconsistently spelled (e.g., Davies vs. Davyes or Davis). Furthermore, many individuals have more than one name (e.g., Margaret Cavendish is also referred to as the Duchess of Newcastle). Identification of names is a particularly pertinent issue when dealing with women writers, who may have written under both married and maiden names, or under pseudonyms. People searching the WWP textbase may be looking for references to a given person, regardless of which name was used and how it was spelled.

Luckily, the TEI provides a mechanism that simultaneously facilitates both searching by person and linking a brief biography to a person's name. Many TEI elements have two special attributes declared for this very purpose: reg= and key=. Of the elements that bear these attributes[1], the WWP uses only <name>, <rs>, <pubPlace>, <persName>, and <placeName>.

The reg= attribute can be used to hold a normalized or regularized form of the name used (P3, p. 156). Thus, you might see

While this can be quite useful in facilitating searches, there are quite a few drawbacks.

The key= attribute can be used to provide an alternative identifier for the object being named, such as a database record key (P3, p. 156). The idea is very similar to that of reg=, but because the format of the value is not predetermined by convention, we can design a system to avoid the disadvantages listed above.

Here, all occurrences of a specific person's name (in this case, Lady Eleanor Davies), even if an alias or misspelled, are brought together by associating a particular string with each. In order to ascertain who is really being referred to, a user (or her software) would need to look up the record keyed "EDV01" in a database provided along with the encoded text. Conversely, and probably more commonly, if a user wants to ask her software to find all references to Davies, whether she was called "Davies", "Touchet", "Audeley", "Lady Eleanor", or even rendered in Latin as "Elleanorum Tichet", she (or her software) would need to look up the correct key in the database and search for it.

Thus, there is one added disadvantage inherent in using a key as opposed to a regularization: it is always necessary to look up the key in a database to find out who the person, or what the object, is. That is, there is one level of indirection built into the system. Nonetheless, the ability to overcome the difficulties with reg= far outweigh this minor disadvantage.

3 What Gets a Key?

As mentioned above, quite a few TEI elements have a key= attribute upon which we could place a key. But for which elements (i.e., text objects) should we go to the effort? For which kinds of real world objects that are discussed in our texts would our users find a keyed database useful?

It does not make sense to key something which is not consistently encoded. For example, at the WWP we encode a word or phrase other than a proper noun that refers to a person as an <rs> if and only if it is renditionally distinct (i.e., has at least one typographic feature like italicization, typeface, typesize, small capitals, or all capitals that is different than surrounding text; e.g., is in italics when surrounding text is roman). Therefore it would probably not be particularly helpful to key those <rs>s, because a search based on the key would only turn up those occurrences which were renditionally distinct in the source.

However, there is such scholarly utility to keying all occurrneces of a particular type of text object consistently, it may make sense to encode them all just so that you can attach keys to them. It is for this very reason that the WWP encodes all personal names of real people, whether renditionally distinct or not.

After some discussion, we divided the list of elements that perhaps should be keyed into three groups:

Those elements which we have decided should be keyed now are names of people who exist outside the text. These names are all encoded with either <persName> (inside <text>) or <name> (inside <teiHeader>). We decided not to encode non-name references to people (e.g., the Murcian Lord, our Saviour), because it is often difficult to tell exactly what is and is not a reference, let alone to whom it refers. Also, we would have to start encoding all of them, not just those that are renditionally distinct. We decided not to encoded references to fictional people mostly because we do not think our users will find it as useful, but also because it could be very hard to determine whether or not the same name, appearing in two different works, refers to the same fictional character in both instances -- which is hard enough to do for names of real people.

Of course, we are only talking about names of humans here. For example, Socks, the first cat, would not be encoded with a <persName>, but rather would be encoded as a <name> if and only if it were renditionally distinct, and would not be keyed.

Besides key= on <persName> (or, in the <teiHeader>, <name>), we are using the same keys to indicate individual people on the specifications of the TEI attributes hand= and resp=, where they are used.

Someday we would like to key <title> elements (This requires an extension to the TEI DTDs, which we have already made.), which we use to encode titles of works mentioned in the textbase. However, since our present resources are already significantly stretched encoding a large quantity of texts to very high standards, we have decided to postpone keying <title>s.

4 What's in a Key?

What are the desirable attributes of a key? The answer, of course, depends on the use you plan to put it to. In our case, we are keying a significant, but not extremely large number of names; humans will have to type the keys in as they enter text, although the cut-and-paste features of a text editor could be used to assist; and the keys will be used in a database of our own design, which will be distributed with the textbase.

4.1 Non-error-prone

Keys should resist error rather than encouraging it. Mistyping a character, or transposing two characters, or other common typos should not produce another key that exists in our database. Although we can easily write a program that checks to see that any given key is actually in the database, it takes a person to catch the case where a key that is in the database is applied to the wrong person. So you would not want CS1 and CS2 to be the keys for two people whose initials are CS; the two keys need to be more different than that.

The situation in which two keys are differentiated only by the case of one or more characters is error-prone and to be avoided. The easiest way to avoid this is to have all keys be case insensitive.

4.2 Similarity to Source

Keys should have some kind of similarity to the principal name of the person being keyed: that is, Charlotte Smith's key should be something more like Smith_C than like 13245. This makes any study of the text at least have some potential of flushing out key problems: if you see a key that does not look like it fits, it could be worth checking.

4.3 Multi-character Disambiguation

The keys for two people with the same (or even very similar) names should be different enough to make confusion in encoding less likely.

4.4 Ease of Entry

Keys should not be an agony to type. This mainly means they should not be horrendously long, but also that long strings of arbitrary characters should be avoided, and that there should be no more than a few `hard-to-type' characters (like } and #).

4.5 Character Selection

Difficult characters should be avoided. As mentioned above, this includes characters that are hard to type. More importantly, though, unusual characters that may not survive translation from one character set to another (e.g. "$") should be avoided. Last, there may be a mild advantage to sticking to SGML NAME characters (alphanumeric, dot, and minus sign). If only those characters are used, the declared value of the key= attribute can be changed to NAME (or NMTOKEN or NUTOKEN, depending on the scheme used), and SGML can catch some data entry errors. However, since this requires a modification to the TEI DTDs, and the vast majority of errors (including the more important ones of a valid key being applied to the wrong person) will slip right past SGML, this is not a major concern.

Furthermore, at the WWP we often use generic tools (rather than specialized SGML-aware tools) to parse our texts. In order to allow those tools to continue to work properly, the characters "<", ">", "=", '"', "'" and "&" should be avoided.

5 What's in a WWP Key?

Our keys are case-insensitive, and all punctuation and accentuation are removed from the name prior to generating a key. E.g., our Director's key is "CDeBoerLa.gyp". Note the absence of the hyphen; also note that the case shifts are solely for human readers -- the key "cdeboerla.gyp" is equivalent.

When a key is generated, the computer picks two letters at random to serve as disambiguation characters, and calculates the check character.

5.1 Advantages and Disadvantages

These keys resist typographical error very well--the check character allows a computer program to rapidly find most typos. They are also easy for a human to proofread, as they bear some similarity to the source. They do not involve unusual or difficult to type characters (not even digits), are not terribly long, and have only a few arbitrary characters.

Because most of the first name is not used, and only the first eight characters of the surname are used, it is likely that two people with similar names will end up with identical initial portions of keys. For example, Agnes Campanelle and Alfred Campanelli (whose names I just found in the Providence phone book) both have identical initial portions of their keys: ACampanel. However, because two disambiguation letters are appended, the system can differentiate up to 676 different people with names that boil down to the same first part of a key.

6 What is a Check?

Check digits and check letters are characters appended to data, such as credit card numbers, names, or files, so that errors in data entry or transmission can be discovered. Given a name, a prearranged computational scheme (algorithm) is used to generate an extra check character, resulting in an expanded name. On subsequent re-entry or transmission of the expanded name, the generation of the check character is repeated and compared with the entered or transmitted check character. Disagreement is a sure sign of error.

For example, the United States Postal Service uses a very simple check digit system for the barcode representation of a ZIP+4 code. The code itself consists of 9 digits. To generate a check digit, the 9 real digits are added together, and only the digit in the ones place of the sum is considered (which is the same as saying the sum of the digits taken modulo 10); this new digit is subtracted from 10, and the ones digit of the result is the check digit.

Take the ZIP+4 code here at the WWP: 02912-1841. The digits are added up, resulting in 28. The ones digit (8) is subtracted from 10, and the result (2) is the check digit.

This particular algorithm, like many checksum algorithms, is clever in that it is set up so that if you perform the first part of the process used to generate the check digit (add up all the digits and look at only the ones place) on the expanded number (nine digits plus the check digit), the result is always zero. In the WWP case, if you take the sum of the digits 0, 2, 9, 1, 2, 1, 8, 4, 1, and 2 you get 30; the ones digit of 30 is zero.

If a smudge on the envolope causes the barcode reader to read a "2" instead of the "9" in our ZIP+4 code, the result would not be zero; thus the machine would instantly know that there was an error. (If you take the sum of the digits 02212-1841 and 2 you get 23; the ones digit of 23 is three, not the expected zero.)

This is a reasonable algorithm in that it is easy to generate and even easier to check. However, it is not very robust. Although it will catch any single digit error, 9.1% of all multi-digit errors will slip by unnoticed, and it will not catch any errors created by the transposition of digits.

7 Our Check Algorithm

7.1 Requirements

When we developed an algorithm for a check character we had several desired features and an important constraint. The constraint was one that many projects bear: time. We wanted to create, code, test, and document our procedures in under two weeks.

Despite the time pressure, we wanted an algorithm that would use only a letter of the alphabet as the check character. Although digits or some non-alphanumeric characters could be used if needed, we felt that for aesthetic reasons we would prefer to stick with letters.

We also wanted a relatively rigorous check that would catch the majority of typographical errors, including at least any single character error (e.g., typing "KNurphy.cfr" for "KMurphy.cfr") and any adjacent character transposition (e.g., typing "CDeBeorLa.gyp" for "CDeBoerLa.gyp").

And of course, we wanted an algorithm that was easy to code and debug in any common computer language.

7.2 Algorithm

Our algorithm only works on the letters. The period used as a separator is stripped out. The basic idea is to treat the sequence of letters as a number in base 26 (with a=0, b=1, ..., z=25), and use the remainder after the resulting number is divided by 29 (i.e., take the number modulo 29) as the check character. We chose 29 becase it is the smallest prime number larger than the number of letters in our alphabet.

There are two confounding factors. First, in order to improve the set of errors caught by the check character, the initial portion of the key is padded to a length of 9 characters. A pad character that is not in the alphabet is used (in particular, an asterisk, but any non-letter would do), and the arithmetic is actually done base 27. Second, for various reasons, the end result is adjusted by a constant (in particular, 56).

So, for example, to re-calculate the check letter for the key "PCaton.xzc", we would take the base 27 `number' "pcaton***xz" modulo 29, then subtract the result from 56, and take that modulo 29. Thus what we are doing is checking to ensure that

c₂₇ = (56 - (pcaton***xz₂₇ mod 29) )
	mod 29

and we find that it does.

7.3 Problems

There is one obvious problem. The number that results from taking a huge number modulo 29 will have 29 possible values (range: 0-28); but the alphabet only has 26 letters (range: a-z or 0-25). There are two solutions that came to mind immediately. The first is to use three non-alphabetic characters for the three "on beyond zebra" possible values. The second is to simply consider as invalid any key which results in a check value "on beyond zebra". Since there are many possible disambiguation character combinations for each initial portion of a key, we could simply not use those combinations which result in a check character beyond Z. We would be giving up approximately 3 of every 29 values, resulting in only 606 disambiguations left per initial portion of a key.

For example, if we came across the name Mary Robinson, we would ask the computer to generate an appendix portion for the initial key portion "MRobinson". The program would randomly generate two disambiguation characters; let's say "ac". Then the program calculates the check value. In this case it would come up with the value 27. Since there is no 28th letter of the alphabet, it would consider this an invalid combination, and generate two more random disambiguation characters: let's say "ca". Then the program would calculate the check value again. This time the result would be 15, or the 16th letter of the alphabet, "p". The resulting key would be "MRobinson.cap", leaving approximately 605 other keys available for her mother Mary and whomever else might have the first initial M and the surname Robinson.

We felt that the likelihood of running into over 600 different people who had the same first initial and first eight letters of their last names in a textbase of pre-victorian women's English writings is so small, that we chose the second solution. If we actually did run into that many keys that needed to be disambiguated, adding three more characters to our `alphabet' would allow up to 676 disambiguations with very little effort, and no need to change any existing keys.

Another problem that crops up is that many computer languages have trouble handling a 9-digit base 26 number (maximum value zzzzzzzzz base 26 is greater than 141 trillion base 10, and takes 33 bits to represent base 2). This is easily worked around by taking advantage of the fact that the modulus operation is distributive over multiplication. This allows us to determine the value of our very large base 26 number modulo 29 one digit at a time. The easiest way to explain this is with a base 10 example. Suppose we wanted to determine 4570 modulo 6 (that is, the remainder when 4570 is divided by 6). The easiest way is to program our computer to do the desired operation in one quick and easy step.

answer = 4570
modulo 6

often written as 4570 % 6. But remember that it is possible that 4570 would be a number too large to hold at one time. So, we could also "build" the number 4570 from its constituent digits, and take the resulting number modulo six to find our answer. That is

answer = ( ( ( 4*10 + 5 )*10 + 7 )*10 +
0 ) % 6

.
Of course, we haven't gained anything here. We still end up calculating the number 4570. However, because the modulus operation is distributive over multiplication, we can also say that

answer = ( ( 4*10%6 + 5 )*10%6 + 7 )*10%6 +
0

.
The advantage of thinking of it this way is that every time we perform a multiplication (thus increasing the number), we also perform a modulus operation, thus `knocking down' the number to somewhere in the range of 0-5. Of course, in our case the number we start with is much larger than 4570 (maybe as large as 141 trillion), but we can `knock it down' to somewhere in the range of 0-28 every time we multiply another digit by 27. In this way, we will never have to deal with a huge number directly.

7.4 Advantages

As mentioned earlier, this check character system creates keys that are quite resistant to typical typographical error. All single character errors and all transposed characters (whether adjacent or not) are caught (proofs in an appendix). Because the check is only one letter, it is not difficult to type, does not make the key too long, and is certain to survive transmission from one computer to another.

Furthermore, because the algorithm relies on a letter's position in the alphabet, rather than on whatever number a given computer uses internally to represent that letter, the algorithm can be applied regardless of what character set is being used. (Although parties must agree on the alphabet.) That is, a key generated using ASCII on a Macintosh can be validated by a program using EBCDIC on a mainframe or by a program using a flavor of Unicode on a Be box.

The algorithm is easy to code in almost any modern computer language. We used Perl for our initial implementation because it is easily ported between unix and Macintosh platforms; we also have implimentations in Rexx and Microsoft Excel. It is reasonably likely that we will have a C implementation next semester. Copies of the programs we use should be available via our web site sometime next semester.

The following is high-level pseudo-code for a routine that does the actual check character calculation. This routine expects to be handed an `expanded' key without a check character. E.g., if the key JRowley.bri were to be checked for validity (or if you needed a check character to add to the key parts "JRowley" and "br"), this routine would expect to be handed "jrowley**br". It returns an integer between zero and 28 (inclusive). If the returned value is 25 or less, it should be interpreted as a letter of the alphabet (a=0, b=1, ..., z=25). If the returned value is 26 or greater, either the key is invalid (because the returned value will never match the check letter), or, in the case of generating a new key, different disambiguation characters are needed.

The mythical routine base27to_int() used on line 4 simply returns the integer that corresponds to the value of a letter or asterisk if it is considered a base 27 number (with a=0, b=1, ... z=25, *=26). In many computer languages this can be handled very easily by a built-in index() function.

8 What's in it for Me?

The system described herein has a serious limitation described above: it can disambiguate only 606 (or with added characters 676) names with the same first initial and first eight letters of the surname. For the WWP this is probably not a problem. But other projects might find that far too restrictive.

The principles described here can be applied to slightly different schemes for generating keys, however. For example, rather than using the first letter of the first name and the first eight of the surname, one could use the first two letters of the first name and the first seven letters of the surname. The two randomly generated letters could still only disambiguate approximately 606 different initial portions of keys, but for most sets of names of European origin, a lot less disambiguation would be needed, as there is more variation in the second letter of first names than in the eighth letter of surnames. For example Agnes Campanelle and Alfred Campanelli would not need to be disambiguated. Neither would Mary and Mildred Robinson. The initial portions of their keys would be "MaRobinso" and "MiRobinso", so it would be OK if their disambiguation characters were exactly the same (say "dc") -- the resulting keys would differ by two characters (MaRobinso.dcu and MiRobinso.dcg).

Obviously the amount of disambiguation needed can be altered by the first name / last name source for letters split, or by allowing more than nine letters from the name. The amount of disambiguation available can be altered by changing the number of disambiguation letters. Using three would disambiguate approximately 15,757 initial portions of keys; using six would disambiguate over 276 million[3]. Of course, since you will be adding a check letter, even when you have three disambiguation letters you may want to start worrying about combinations that are four letter words.

Appendix: Proofs

[Editor's note -- current browsers do not do a good job rendering these proofs. For a much easier to read version, download a page image: DVI, PostScript, and PDF versions are currently available.]

A 12-character expanded name is coded into a number base 27, call it N. N is of the form N = a₁₁ * 27¹¹+…+a_i * 27ⁱ+…+a₁ * 27+a₀ where the a_i coefficients are all in the range [0,1,...,26]. We must show that common errors always change the remainder when N is divided by 29. We consider single-character errors and transpositions.

single character error

Suppose a single character of N is in error, i.e., say a_i has been mistakenly entered as b_i, so that

N' =
a₁₁ * 27¹¹+…+ b_i *
27ⁱ +…+ a₁ * 27 + a₀

. Then N-N' = (a_i-b_i) * 27ⁱ. We note that 29 cannot divide a_i-b_i, which is at most 26 (range -26 to +26); nor, since it is a prime, can 29 divide 27ⁱ, which has 3 as its only prime factor. Therefore it does not divide N-N', and so N' ≠ N mod 29.

transpositions

Suppose there is a transposition of two letters, corresponding to a transposition of two coefficients, a_i and a_j, where i > j. Now N-N' will be of the form

N-N' = (a_i - a_j) * 27ⁱ + (a_j - a_i) * 27^j
= (a_i - a_j) * 27ⁱ - (a_i - a_j) * 27^j
= (a_i - a_j) * 27^j * (27^(i-j) - 1)

As for single character error, above, 29 cannot divide either of the first two factors. We must examine whether 29 can divide the third, i.e., a number of the form 27^k - 1, where k is an integer in the range 1-11. This is a matter of simply testing all 11 values of k, one at a time, and none of these is divisible by 29. (Actually this is the case for any k in the range 1-27.) So once again, the error results in an N' that has a different remainder from N, mod 29. (And, indeed, the method would work for strings up to 28 characters.)

other random errors

It is possible that multiple errors would compensate for each other, giving a false validity check. The frequency of false checks, in cases of completely random garble, would be about 1/29 = 3.4%. Thus this method detects over 96% of errors, including all of the most common ones.

Notes

[1] The complete list is: <name>, <rs>, <measure>, <pubPlace>, <persName>, <surname>, <forename>, <genName>, <nameLink>, <addName>, <roleName>, <placeName>, <settlement>, <region>, <country>, <bloc>, <offset>, <geogName>, and <geog>. But, interestingly enough, not <author>, <docAuthor>, <editor>, <publisher>, or <title>. [return]

Keying `<name>`s: the WWP Approach

Table of Contents

1.1 Overview