What is distinctive about digital research materials?
What I want to do in this first session is situate text encoding within
a larger frame of reference: as a specific way of representing scholarly
information
- partly as a way of understanding text encoding (and other forms
of digital research tools)
- and partly as a way of understanding how we represent and use
scholarly information, research materials, more generally
- because one crucial question being asked in the digital
humanities domain is essentially "what changes?"; "does anything
change?"
- i.e. are we radically altering how humanities research is done?
or the kinds of arguments we make?
- put another way, is it all worth the effort?
In order to do this, I think it will be helpful if we can lay out in
front of ourselves the kinds of research materials we’re familiar with,
and talk about how we’re used to thinking with them: what do they do for
us, informationally? how do they present the source materials to us?
So: what kinds of research sources have you used in the past year?
- primary sources? in what media? originals, facsimiles, reprints,
editions, microfilm reproductions, digital transcriptions (what kind?)
- secondary sources?
- derived data? in what form?
- metadata (library catalogues, finding aids, etc.)
How would we characterize the types of information found in these
sources?
- sources that show us visual evidence
- sources that give us derived analysis (quantitative, qualitative)
- sources that give us a description, in language
- sources that give us a description, in formal terms
- sources that give us an argument, in language
- sources that give us an argument, in other forms?
How do we evaluate these various types of sources? how are they
successful or unsuccessful?
- Visual evidence: level of granularity, fineness of detail,
accuracy (of color, etc.)
- Derived analysis: the intellectual basis of the analysis: the
accuracy and relevance of its disciplinary assumptions, the usefulness
of the insight it gives us
- Prose description: the richness of detail, the persuasiveness
(i.e. how it convinces us of the author’s trustworthiness and usefulness
as a witness), also its comparability to other descriptions (i.e. using
descriptive terms consistently)
- Formal description: consistency, appropriate granularization of
the data
Is any of this inflected by discipline?
- what kinds of sources do historians use?
- what kinds of sources do literary scholars use?
- linguists?
- ontologists?
- --other groups?
What is text encoding? Where does it fit in?
So we can try to situate the activity of text encoding in this
intellectual space:
- From the viewpoint of the humanities scholar, text encoding looks
as if it’s coming over from computer science: as an activity that takes
place on computers and requires some technical knowledge (of software,
of data standards, of encoding languages)
- in fact, there are some other lines of connection that make it
clearer why it should be of interest to us
- anthropological: the text encoder is an observer and
documenter of the textual world, and the encoding he/she produces has
(at least potentially) something of the quality of a thick
description: a contextualized, interpretative account of the
details of the textual landscape.
- editorial: the text encoder is also very much like a critical editor, creating
an analytical representation of the text which provides systematic,
expert knowledge about it
- interpretive, critical: the encoder can also act as an interpretative commentator,
using
markup to add context, layers of interpretive information
Perhaps most importantly, text encoding is a modelling activity: a
process of creating an analytical representation of an object (e.g. a
document) or an information system
Sampling and modelling
It may be useful to talk more about the concept of data modelling at this point...
- increasingly common concept in digital humanities, lots of discussion about what it
means
I’m using the term modelling here as distinct from a concept
like sampling:
- Sampling takes slices or samples of the world:
visually (like a digital camera), sonically (like a digital sound
recorder), or in some other way
- the classic example is the bitmap image: a matrix of colored dots
that represent an image at some resolution: high or low
- Modelling creates an analytic representation of the
world: as a function, a formalization, a mathematical representation, a
conceptual model, some kind of surrogate
Sampling produces what I would tentatively call a
depiction: a version that aspires to be the
source:
- Measured in terms of fidelity
- Example: a high-resolution photograph (higher resolution = better
depiction)
Modelling produces a version that aspires to yield
information about the source for a specific purpose:
- Measured in terms of functionality against the purpose in question
- A topographical map: functional for understanding geographical
features
- A road map: functional for navigating in a car
- A satellite map: functional for viewing weather systems
Text encoding vs. information
modelling
Note that text encoding may not even be the best or
most evocative term for all of this:
- it describes the markup of text streams
- but for things like modelling contextual information (e.g.
personography example) the emphasis is on the structures created, not on
the text itself
- somewhere in between is the terrain of text which is itself
regularly structured
- when we think of what all of these kinds of markup are really
doing, we might better describe it as information
modelling or data modelling
And note as well that text markup is not the only way to model data, or
text data:
- databases have been used for a long time for this purpose
- with a greater emphasis on structure and less on the nuances of
text
- not fundamentally different, just a different emphasis: on what is
consistent vs. on what is variable
- databases tend to cease modelling at the point where the text
becomes highly variable: within paragraphs, within lines of poetry: the
markup of individual words
- though now with XMl databases, we’re starting to see approaches
that are really hybrids