" -- "≠" -- -- not equal to, U+2260 ISOtech --> ]>
<name>s: the WWP
ApproachThis article discusses how the Women Writers Project
has implemented usage of
the TEI key= attribute, including our reasoning
for the mechanism chosen, and the checksum
algorithm used. A basic understanding of at least SGML, if
not TEI, is presumed.
At the WWP, names of people are encoded with
<persName>; names of places are encoded with
<placeName>, and names of other things are encoded
with <name> (exception: in a <respstmt> of the
<teiHeader> a <persName> is not allowed, so
<name> is used). For example
<q>Therefore,</q>said<persName>King Arthur</persName>unto<persName>Sir Bedivere</persName>,<q>take thou here<name>Excalibur</name>my good sword and go with it to yonder water's side …
We encode a person's name with a particular element in order to facilitate searching for occurrences of the name. There are other possible advantages, however; e.g., allowing a brief biography to be linked to the name as it is displayed on your screen.
It is easy to search for a consistently spelled name. Even if there are only a few clearly defined possibilities (e.g., Jon or Jonathan), searching is not so difficult. However, during the period covered by the WWP textbase, names (and many other words) were inconsistently spelled (e.g., Davies vs. Davyes or Davis). Furthermore, many individuals have more than one name (e.g., Margaret Cavendish is also referred to as the Duchess of Newcastle). Identification of names is a particularly pertinent issue when dealing with women writers, who may have written under both married and maiden names, or under pseudonyms. People searching the WWP textbase may be looking for references to a given person, regardless of which name was used and how it was spelled.
Luckily, the TEI provides a mechanism that simultaneously
facilitates both searching by person and linking a brief
biography to a person's name. Many TEI elements have two
special attributes declared for this very purpose:
reg= and key=. Of the elements that bear
these attributes[1],
the WWP uses only <name>,
<rs>, <pubPlace>, <persName>, and
<placeName>.
The reg= attribute can be used to hold a
normalized or regularized form of the name used
(P3, p. 156). Thus, you might see
<cit><quote lang="la">Officium Dominorum con<persName reg="Davies, Lady Eleanor">Elleanorum Tichet</persName>, alias<persName reg="Davies, Lady Eleanor">Davyes</persName>, alias<persName reg="Davies, Lady Eleanor">Douglas</persName>.</quote><bibl>Davies, The Blasphemous Charge Against Her, p. 7.</bibl></cit>
While this can be quite useful in facilitating searches, there are quite a few drawbacks.
The key= attribute can be used to provide an alternative identifier for the object being named,
such as a database record key
(P3, p.
156). The idea is very similar to
that of reg=, but because the format of the value
is not predetermined by convention, we can design a system
to avoid the disadvantages listed above.
You might see, for example
<cit><quote lang="la">Officium Dominorum con<persName key="EDV01">Elleanorum Tichet</persName>, alias<persName key="EDV01">Davyes</persName>, alias<persName key="EDV01">Douglas</persName>.</quote><bibl>Davies, The Blasphemous Charge Against Her, p. 7.</bibl></cit>
Here, all occurrences of a specific person's name (in this case, Lady Eleanor Davies), even if an alias or misspelled, are brought together by associating a particular string with each. In order to ascertain who is really being referred to, a user (or her software) would need to look up the record keyed "EDV01" in a database provided along with the encoded text. Conversely, and probably more commonly, if a user wants to ask her software to find all references to Davies, whether she was called "Davies", "Touchet", "Audeley", "Lady Eleanor", or even rendered in Latin as "Elleanorum Tichet", she (or her software) would need to look up the correct key in the database and search for it.
Thus, there is one
added disadvantage inherent in using a key as opposed to a
regularization: it is always necessary to look up the key in
a database to find out who the person, or what the object,
is. That is, there is one level of indirection built into
the system. Nonetheless, the ability to overcome the
difficulties with reg= far outweigh this minor
disadvantage.
As mentioned above, quite a few TEI elements have a
key= attribute upon which we could place a key.
But for which elements (i.e., text objects) should we go to
the effort? For which kinds of real world objects that are
discussed in our texts would our users find a keyed database
useful?
It does not make sense to key something which is not
consistently encoded. For example, at the WWP we encode a
word or phrase other than a proper noun that refers to a
person as an <rs> if and only if it is renditionally
distinct (i.e., has at least one typographic feature like
italicization, typeface, typesize, small capitals, or all
capitals that is different than surrounding text; e.g., is
in italics when surrounding text is roman). Therefore it
would probably not be particularly helpful to key those
<rs>s, because a search based on the key would only
turn up those occurrences which were renditionally distinct
in the source.
However, there is such scholarly utility to keying all occurrneces of a particular type of text object consistently, it may make sense to encode them all just so that you can attach keys to them. It is for this very reason that the WWP encodes all personal names of real people, whether renditionally distinct or not.
After some discussion, we divided the list of elements that perhaps should be keyed into three groups:
Those elements which we have decided should be keyed now
are names of people who exist outside the text. These names
are all encoded with either <persName> (inside
<text>) or <name> (inside <teiHeader>).
We decided not to encode non-name references to people
(e.g., the Murcian Lord,
our Saviour), because it is often difficult to tell
exactly what is and is not a reference, let alone to whom it
refers. Also, we would have to start encoding all of
them, not just those that are renditionally distinct. We
decided not to encoded references to fictional people mostly
because we do not think our users will find it as useful,
but also because it could be very hard to determine whether
or not the same name, appearing
in two different works, refers to the same fictional
character in both instances -- which is hard enough
to do for names of real people.
Of course, we are only talking about names of humans here.
For example, Socks, the first cat,
would not be encoded with a <persName>, but rather
would be encoded as a <name> if and only if it were
renditionally distinct, and would not be keyed.
Besides key= on <persName> (or, in the
<teiHeader>, <name>), we are using the same
keys to indicate individual people on the specifications of
the TEI attributes hand= and resp=,
where they are used.
Someday we would like to key <title> elements
(This
requires an extension to the TEI DTDs, which we have
already made.), which we use to encode titles of
works mentioned in the textbase. However, since our present
resources are already significantly stretched encoding a
large quantity of texts to very high standards, we have
decided to postpone keying <title>s.
For now we are not planning to key any other elements.
What are the desirable attributes of a key? The answer, of course, depends on the use you plan to put it to. In our case, we are keying a significant, but not extremely large number of names; humans will have to type the keys in as they enter text, although the cut-and-paste features of a text editor could be used to assist; and the keys will be used in a database of our own design, which will be distributed with the textbase.
Keys should resist error rather than encouraging it. Mistyping a character, or transposing two characters, or other common typos should not produce another key that exists in our database. Although we can easily write a program that checks to see that any given key is actually in the database, it takes a person to catch the case where a key that is in the database is applied to the wrong person. So you would not want CS1 and CS2 to be the keys for two people whose initials are CS; the two keys need to be more different than that.
The situation in which two keys are differentiated only by the case of one or more characters is error-prone and to be avoided. The easiest way to avoid this is to have all keys be case insensitive.
Keys should have some kind of similarity to the principal name of the person being keyed: that is, Charlotte Smith's key should be something more like Smith_C than like 13245. This makes any study of the text at least have some potential of flushing out key problems: if you see a key that does not look like it fits, it could be worth checking.
The keys for two people with the same (or even very similar) names should be different enough to make confusion in encoding less likely.
Keys should not be an agony to type. This mainly means they should not be horrendously long, but also that long strings of arbitrary characters should be avoided, and that there should be no more than a few `hard-to-type' characters (like } and #).
Difficult characters should be avoided. As mentioned
above, this includes characters that are hard to type.
More importantly, though, unusual characters that may not
survive translation from one character set to another
(e.g. "$") should be avoided. Last, there may be a
mild advantage to sticking to SGML NAME characters
(alphanumeric, dot, and minus sign). If only those
characters are used, the declared value of the
key= attribute can be changed to NAME (or
NMTOKEN or NUTOKEN, depending on the scheme used), and
SGML can catch some data entry errors. However, since this
requires a modification to the TEI DTDs, and the vast
majority of errors (including the more important ones of a
valid key being applied to the wrong person) will slip
right past SGML, this is not a major concern.
Furthermore, at the WWP we often use generic tools (rather than specialized SGML-aware tools) to parse our texts. In order to allow those tools to continue to work properly, the characters "<", ">", "=", '"', "'" and "&" should be avoided.
After some brief discussion we came up with the following format.
FSsssssss.ddc
Where:
Our keys are case-insensitive, and all punctuation and accentuation are removed from the name prior to generating a key. E.g., our Director's key is "CDeBoerLa.gyp". Note the absence of the hyphen; also note that the case shifts are solely for human readers -- the key "cdeboerla.gyp" is equivalent.
When a key is generated, the computer picks two letters at random to serve as disambiguation characters, and calculates the check character.
These keys resist typographical error very well--the check character allows a computer program to rapidly find most typos. They are also easy for a human to proofread, as they bear some similarity to the source. They do not involve unusual or difficult to type characters (not even digits), are not terribly long, and have only a few arbitrary characters.
Because most of the first name is not used, and only the first eight characters of the surname are used, it is likely that two people with similar names will end up with identical initial portions of keys. For example, Agnes Campanelle and Alfred Campanelli (whose names I just found in the Providence phone book) both have identical initial portions of their keys: ACampanel. However, because two disambiguation letters are appended, the system can differentiate up to 676 different people with names that boil down to the same first part of a key.
Check digits and check letters are characters appended to data, such as credit card numbers, names, or files, so that errors in data entry or transmission can be discovered. Given a name, a prearranged computational scheme (algorithm) is used to generate an extra check character, resulting in an expanded name. On subsequent re-entry or transmission of the expanded name, the generation of the check character is repeated and compared with the entered or transmitted check character. Disagreement is a sure sign of error.
For example, the United States Postal Service uses a very simple check digit system for the barcode representation of a ZIP+4 code. The code itself consists of 9 digits. To generate a check digit, the 9 real digits are added together, and only the digit in the ones place of the sum is considered (which is the same as saying the sum of the digits taken modulo 10); this new digit is subtracted from 10, and the ones digit of the result is the check digit.
Take the ZIP+4 code here at the WWP: 02912-1841. The digits are added up, resulting in 28. The ones digit (8) is subtracted from 10, and the result (2) is the check digit.
This particular algorithm, like many checksum algorithms, is clever in that it is set up so that if you perform the first part of the process used to generate the check digit (add up all the digits and look at only the ones place) on the expanded number (nine digits plus the check digit), the result is always zero. In the WWP case, if you take the sum of the digits 0, 2, 9, 1, 2, 1, 8, 4, 1, and 2 you get 30; the ones digit of 30 is zero.
If a smudge on the envolope causes the barcode reader to read a "2" instead of the "9" in our ZIP+4 code, the result would not be zero; thus the machine would instantly know that there was an error. (If you take the sum of the digits 02212-1841 and 2 you get 23; the ones digit of 23 is three, not the expected zero.)
This is a reasonable algorithm in that it is easy to generate and even easier to check. However, it is not very robust. Although it will catch any single digit error, 9.1% of all multi-digit errors will slip by unnoticed, and it will not catch any errors created by the transposition of digits.
When we developed an algorithm for a check character we had several desired features and an important constraint. The constraint was one that many projects bear: time. We wanted to create, code, test, and document our procedures in under two weeks.
Despite the time pressure, we wanted an algorithm that would use only a letter of the alphabet as the check character. Although digits or some non-alphanumeric characters could be used if needed, we felt that for aesthetic reasons we would prefer to stick with letters.
We also wanted a relatively rigorous check that would catch the majority of typographical errors, including at least any single character error (e.g., typing "KNurphy.cfr" for "KMurphy.cfr") and any adjacent character transposition (e.g., typing "CDeBeorLa.gyp" for "CDeBoerLa.gyp").
And of course, we wanted an algorithm that was easy to code and debug in any common computer language.
Our algorithm only works on the letters. The period used as a separator is stripped out. The basic idea is to treat the sequence of letters as a number in base 26 (with a=0, b=1, ..., z=25), and use the remainder after the resulting number is divided by 29 (i.e., take the number modulo 29) as the check character. We chose 29 becase it is the smallest prime number larger than the number of letters in our alphabet.
There are two confounding factors. First, in order to improve the set of errors caught by the check character, the initial portion of the key is padded to a length of 9 characters. A pad character that is not in the alphabet is used (in particular, an asterisk, but any non-letter would do), and the arithmetic is actually done base 27. Second, for various reasons, the end result is adjusted by a constant (in particular, 56).
So, for example, to re-calculate the check letter for the
key "PCaton.xzc", we would take the base 27
`number' "pcaton***xz" modulo 29,
then subtract the result from 56, and take that modulo 29.
Thus what we are doing is checking to ensure that
c27 = (56 - (pcaton***xz27 mod 29) )
mod 29
and we find that it does.
There is one obvious problem. The number that results from taking a huge number modulo 29 will have 29 possible values (range: 0-28); but the alphabet only has 26 letters (range: a-z or 0-25). There are two solutions that came to mind immediately. The first is to use three non-alphabetic characters for the three "on beyond zebra" possible values. The second is to simply consider as invalid any key which results in a check value "on beyond zebra". Since there are many possible disambiguation character combinations for each initial portion of a key, we could simply not use those combinations which result in a check character beyond Z. We would be giving up approximately 3 of every 29 values, resulting in only 606 disambiguations left per initial portion of a key.
For example, if we came across the name Mary Robinson, we would ask the computer to generate an appendix portion for the initial key portion "MRobinson". The program would randomly generate two disambiguation characters; let's say "ac". Then the program calculates the check value. In this case it would come up with the value 27. Since there is no 28th letter of the alphabet, it would consider this an invalid combination, and generate two more random disambiguation characters: let's say "ca". Then the program would calculate the check value again. This time the result would be 15, or the 16th letter of the alphabet, "p". The resulting key would be "MRobinson.cap", leaving approximately 605 other keys available for her mother Mary and whomever else might have the first initial M and the surname Robinson.
We felt that the likelihood of running into over 600 different people who had the same first initial and first eight letters of their last names in a textbase of pre-victorian women's English writings is so small, that we chose the second solution. If we actually did run into that many keys that needed to be disambiguated, adding three more characters to our `alphabet' would allow up to 676 disambiguations with very little effort, and no need to change any existing keys.
Another problem that crops up is that many computer
languages have trouble handling a 9-digit base 26 number
(maximum value zzzzzzzzz base 26 is greater than 141
trillion base 10, and takes 33 bits to represent base 2).
This is easily worked around by taking advantage of the
fact that the modulus operation is distributive over
multiplication. This allows us to determine the value of
our very large base 26 number modulo 29 one digit at a
time. The easiest way to explain this is with a base 10
example. Suppose we wanted to determine 4570 modulo 6
(that is, the remainder when 4570 is divided by 6). The
easiest way is to program our computer to do the desired
operation in one quick and easy step. answer = 4570
modulo 6 often written as 4570 % 6.
But remember that it is possible that 4570 would be a
number too large to hold at one time. So, we could also
"build" the number 4570 from its constituent digits,
and take the resulting number modulo six to find our
answer. That is
answer = ( ( ( 4*10 + 5 )*10 + 7 )*10 +
0 ) % 6.
Of course, we haven't gained anything
here. We still end up calculating the number 4570.
However, because the modulus operation is distributive
over multiplication, we can also say that
answer = ( ( 4*10%6 + 5 )*10%6 + 7 )*10%6 +
0.
The advantage of thinking of it this way is
that every time we perform a multiplication (thus
increasing the number), we also perform a modulus
operation, thus `knocking down' the
number to somewhere in the range of 0-5. Of course,
in our case the number we start with is much larger than
4570 (maybe as large as 141 trillion), but we can
`knock it down' to somewhere in the
range of 0-28 every time we multiply another digit
by 27. In this way, we will never have to deal with a huge
number directly.
As mentioned earlier, this check character system creates keys that are quite resistant to typical typographical error. All single character errors and all transposed characters (whether adjacent or not) are caught (proofs in an appendix). Because the check is only one letter, it is not difficult to type, does not make the key too long, and is certain to survive transmission from one computer to another.
Furthermore, because the algorithm relies on a letter's position in the alphabet, rather than on whatever number a given computer uses internally to represent that letter, the algorithm can be applied regardless of what character set is being used. (Although parties must agree on the alphabet.) That is, a key generated using ASCII on a Macintosh can be validated by a program using EBCDIC on a mainframe or by a program using a flavor of Unicode on a Be box.
The algorithm is easy to code in almost any modern computer language. We used Perl for our initial implementation because it is easily ported between unix and Macintosh platforms; we also have implimentations in Rexx and Microsoft Excel. It is reasonably likely that we will have a C implementation next semester. Copies of the programs we use should be available via our web site sometime next semester.
The following is high-level pseudo-code for a routine that does the actual check character calculation. This routine expects to be handed an `expanded' key without a check character. E.g., if the key JRowley.bri were to be checked for validity (or if you needed a check character to add to the key parts "JRowley" and "br"), this routine would expect to be handed "jrowley**br". It returns an integer between zero and 28 (inclusive). If the returned value is 25 or less, it should be interpreted as a letter of the alphabet (a=0, b=1, ..., z=25). If the returned value is 26 or greater, either the key is invalid (because the returned value will never match the check letter), or, in the case of generating a new key, different disambiguation characters are needed.
1. flastdd = [INPUT of11 lowercase letters (or "*")] 2. i = 0 3. for each of the 11 characters 4. i = ( i + base27to_int( this-char ) ) * 27 5. i = i mod 29 6. end loop 8. return i
The mythical routine base27to_int() used on
line 4 simply returns the integer that corresponds to the
value of a letter or asterisk if it is considered a base
27 number (with a=0, b=1, ... z=25, *=26). In many
computer languages this can be handled very easily by a
built-in index() function.
The system described herein has a serious limitation described above: it can disambiguate only 606 (or with added characters 676) names with the same first initial and first eight letters of the surname. For the WWP this is probably not a problem. But other projects might find that far too restrictive.
The principles described here can be applied to slightly different schemes for generating keys, however. For example, rather than using the first letter of the first name and the first eight of the surname, one could use the first two letters of the first name and the first seven letters of the surname. The two randomly generated letters could still only disambiguate approximately 606 different initial portions of keys, but for most sets of names of European origin, a lot less disambiguation would be needed, as there is more variation in the second letter of first names than in the eighth letter of surnames. For example Agnes Campanelle and Alfred Campanelli would not need to be disambiguated. Neither would Mary and Mildred Robinson. The initial portions of their keys would be "MaRobinso" and "MiRobinso", so it would be OK if their disambiguation characters were exactly the same (say "dc") -- the resulting keys would differ by two characters (MaRobinso.dcu and MiRobinso.dcg).
Obviously the amount of disambiguation needed can be altered by the first name / last name source for letters split, or by allowing more than nine letters from the name. The amount of disambiguation available can be altered by changing the number of disambiguation letters. Using three would disambiguate approximately 15,757 initial portions of keys; using six would disambiguate over 276 million[3]. Of course, since you will be adding a check letter, even when you have three disambiguation letters you may want to start worrying about combinations that are four letter words.
[Editor's note -- current browsers do not do a good job rendering these proofs. For a much easier to read version, download a page image: DVI, PostScript, and PDF versions are currently available.]
A 12-character expanded name is coded into a number base 27, call
it N. N is of the form N = a11 * 2711+…+ai * 27i+…+a1 * 27+a0
where the ai coefficients are all in the range [0,1,...,26].
We must show that common errors always change the remainder when N
is divided by 29. We consider single-character errors and
transpositions.
Suppose a single character of N is in error, i.e., say ai has
been mistakenly entered as bi, so that N' =
a11 * 2711+…+ bi *
27i +…+ a1 * 27 + a0.
Then N-N' = (ai-bi) * 27i.
We note that 29 cannot divide ai-bi, which is at most 26 (range
-26 to +26); nor, since it is a prime, can 29 divide 27i, which
has 3 as its only prime factor. Therefore it does not divide N-N',
and so N' ≠ N mod 29.
Suppose there is a transposition of two letters, corresponding to
a transposition of two coefficients, ai and aj, where i > j. Now
N-N' will be of the form
N-N' = (ai - aj) * 27i + (aj - ai) * 27j
= (ai - aj) * 27i - (ai - aj) * 27j
= (ai - aj) * 27j * (27(i-j) - 1)
As for single character error, above, 29
cannot divide either of the first two factors. We must examine
whether 29 can divide the third, i.e., a number of the form 27k -
1, where k is an integer in the range 1-11. This is a matter
of simply testing all 11 values of k, one at a time, and none of
these is divisible by 29. (Actually this is the case for any k in
the range 1-27.) So once again, the error results in an N'
that has a different remainder from N, mod 29. (And, indeed, the
method would work for strings up to 28 characters.)
It is possible that multiple errors would compensate for each other, giving a false validity check. The frequency of false checks, in cases of completely random garble, would be about 1/29 = 3.4%. Thus this method detects over 96% of errors, including all of the most common ones.
[1] The
complete list is: <name>, <rs>,
<measure>, <pubPlace>, <persName>,
<surname>, <forename>, <genName>,
<nameLink>, <addName>, <roleName>,
<placeName>, <settlement>, <region>,
<country>, <bloc>, <offset>,
<geogName>, and <geog>. But, interestingly
enough, not <author>, <docAuthor>,
<editor>, <publisher>, or
<title>. [return]
[2] It should be noted that the only person in the room who voted against truncating the surname was the one person whose surname is greater than eight letters long. Another example of the tyranny of the majority. [return]
[3] Scary, isn't it? Every man, woman, and child in the United States could be uniquely identified by 6 letters. But it gets worse. Everybody on the planet could be uniquely identified by one of the over 8 billion combinations of 7 letters. [return]