Recognizing subdivisions
For projects dealing with printed books, recognizing subdivisions is usually fairly straightforward. Any subdivisions which are important to the comprehension of the text have typically been clearly signalled by the printer, using headings, ornamental dividers, or other typographical signals. A crucial part of the document analysis is determining what signals are used.
However, there are at least three kinds of cases where recognizing subdivisions may be more complex:
- In very cheaply printed books, subdivisions may have been marked haphazardly, or idiosyncrasies of printing may obscure subdivision markers.
- In very early printed books, the conventions of sectioning are not as firmly evident as in later books, and in particular the rules we expect (e.g. of a clear hierarchy of subdivisions, sub-subdivisions, etc. in an outlinable form) are not in evidence.
- Certain narrative genres contain cases where textual subdivisions are signalled only by the narrative (for instance, nested stories, signalled perhaps by a phrasing like So she began the tale of the beggar’s wife…) rather than by any clear printed mark. In cases like these, not only may the boundary between subdivisions be unclear, but even the manner in which they nest within one another; it may be unclear, for instance, whether a given story is being told by a speaker or by a character in the speaker’s story.
In these situations, one might imagine wanting to indicate alternative boundary points (for instance, to show that a given subdivision might begin in one of two possible places), or indicate uncertainty about a boundary point. This kind of encoding is certainly possible, although using the information might prove more difficult; you would need to think carefully about what sort of behavior you want from your interface. Before undertaking encoding of this sort, it would be worth thinking about what it would gain you, and how you plan to use it if at all.