Creating and Documenting Electronic Texts


Chapter 2: Document Analysis

2.1: What is document analysis?

Deciding to create an electronic text is just like deciding to begin any other type of construction project. While the desire to dive right in and begin building is tempting, any worthwhile endeavour will begin with a thorough planning stage. In the case of digitized text creation, this stage is called document analysis. Document analysis is literally the task of examining the physical object in order to acquire an understanding about the work being digitized and to decide what the purpose and future of the project entails. The digitization of texts is not simply making groups of words available to an online community; it involves the creation of an entirely new object. This is why achieving a sense of what it is that you are creating is critical. The blueprint for construction will allow you to define the foundation of the project. It will also allow you to recognise any problems or issues that have the potential to derail the project at a later point.

Document analysis is all about definition — defining the document context, defining the document type and defining the different document features and relationships. At no other point in the project will you have the opportunity to spend as much quality time with your document. This is when you need to become intimately acquainted with the format, structure, and content of the texts. Document analysis is not limited to physical texts, but as the goal of this guide is to advise on the creation of digital texts from the physical object this will be the focus of the chapter. For discussions of document analysis on objects other than text, please refer to such studies as Yale University Library Project Open Book (, the Library of Congress American Memory Project and National Digital Library Program ( , and Scoping the Future of Oxford's Digital Collections (

2.2: How should I start?

2.2.1: Project objectives

One of the first tasks to perform in document analysis is to define the goals of the project and the context under which they are being developed. This could be seen as one of the more difficult tasks in the document analysis procedure, as it relies less upon the physical analysis of the document and more upon the theoretical positions taken with the project. This is the stage where you need to ask yourself why the document is being encoded. Are you looking simply to preserve a digitized copy of the document in a format that will allow an almost exact future replication? Is your goal to encode the document in a way that will assist in a linguistic analysis of the work? Or perhaps there will be a combination of structural and thematic encoding, so that users will be able to perform full-text searches of the document? Regardless of the choice made, the project objectives must be carefully defined, as all subsequent decisions hinge upon them.

It is also important to take into consideration the external influences on the project. Often the bodies that oversee digitization projects, either in a funding or advisory capacity, have specific conditions that must be fulfilled. They might for example have markup requirements or standards (linguistic, TEI/SGML, or EAD perhaps) that must be taken into account when establishing an encoding methodology. Also, if you are creating the electronic text for scholarly purposes, then it is very likely that the standards of this community will need to be adhered to. Again, it must be remembered that the electronic version of a text is a distinct object and must be treated as such. Just as you would adhere to a publishing standard of practice with a printed text, so must you follow the standard for electronic texts. The most stringent scholarly community, the textual critics and bibliographers, will have specific, established guidelines that must be considered in order to gain the requisite scholarly authority. Therefore, if you were creating a text to be used or approved by this community their criteria would have to be integrated into the project standards, with the subsequent influence on both the objectives and the creative process taken into account. If the digitization project includes image formats, then there are specific archiving standards held by the electronic community that might have to be met — this will not only influence the purchase of hardware and software, but will have an impact on the way in which the electronic object will finally be structured. External conditions are easily overlooked during the detailed analysis of the physical object, so be sure that the standards and policies that influence the outcome of the project are given serious thought, as having to modify the documents retrospectively can prove both detrimental and expensive.

This is also a good time to evaluate who the users of your project are likely to be. While you might have personal goals to achieve with the project — perhaps a level of encoding that relates to your own area of expertise — many of the objectives will relate to your user base. Do you see the work being read by secondary school pupils? Undergraduates? Academics? The general public? Be prepared for the fact that every user will want something different from your text. While you cannot satisfy each desire, trying to evaluate what information might be the most important to your audience will allow you to address the needs and concerns you deem most appropriate and necessary. Also, if there are specific objectives that you wish users to derive from the project then this too needs to be established at the outset. If the primary purpose for the texts is as a teaching mechanism, then this will have a significant influence on how you choose to encode the document. Conversely, if your texts are being digitized so that users will be able to perform complex thematic searches, then both the markup of content and the content of the markup will differ somewhat. Regardless of the decision, be sure that the outcome of this evaluation becomes integrated with the previously determined project objectives.

You must also attempt to assess what tools users will have at their disposal to retrieve your document. The hardware and software capabilities of your users will differ, sometimes dramatically, and will most likely present some sort of restriction or limitation upon their ability to access your project. SGML encoded text requires the use of specialised software, such as Panorama, to read the work. Even HTML has tagsets that early browsers may not be able to read. It is essential that you take these variants into consideration during the planning stage. There might be priorities in the project that require accessibility for all users, which would affect the methodology of the project. However, don't let the user limitations stunt the encoding goals for the document. Hardware and software are constantly being upgraded so that although some of the encoding objectives might not be fully functional during the initial stages of the project, they stand a good chance of becoming accessible in the near future.

2.2.2: Document context

The first stage of document analysis is not only necessary for detailing the goals and objectives of the project, but also serves as an opportunity to examine the context of the document. This is a time to gather as much information as possible about the documents being digitized. The amount gathered varies from project to project, but in an ideal situation you will have a complete transmission and publication history for the document. There are a few key reasons for this. Firstly, knowing how the object being encoded was created will allow you to understand any textual variations or anomalies. This, in turn, will assist in making informed encoding decisions at later points in the project. The difference between a printer error and an authorial variation not only affects the content of the document, but also the way in which it is marked up. Secondly, the depth of information gathered will give the document the authority desired by the scholarly community. A text about which little is known can only be used with much hesitation. While some users might find it more than acceptable for simply printing out or reading, there can be no authoritative scholarly analysis performed on a text with no background history. Thirdly, a quality electronic text will have a TEI header attached (see Chapter 6). The TEI header records all the information about the electronic text's print source. The more information you know about the source, the more full and conclusive your header will be — which will again provide scholarly authority. Lastly, understanding the history of the document will allow you to understand its physicality.

The physicality of the text is an interesting issue — and one on which very few scholars fully agree. Clearly, an understanding of the physical object provides a sense of the format, necessary for a proper structural encoding of the text, but it also augments a contextual understanding. Peter Shillingsburg theorises that the 'electronic medium has extended the textual world; it has not overthrown books nor the discipline of concentrated "lines" of thought; it has added dimensions and ease of mobility to our concepts of textuality' (Shillingsburg 1996, 164). How is this so? Simply put, the electronic medium will allow you to explore the relationships in and amongst your texts. While the physical object has trained readers to follow a more linear narrative, the electronic document will provide you with an opportunity to develop the variant branches found within the text. Depending upon the decided project objectives, you are free to highlight, augment or furnish your users with as many different associations as you find significant in the text. Yet to do this, you must fully understand the ontology of the texts and then be able to delineate this textuality through the encoding of the computerised object.

It is important to remember that the transmission history does not end with the publication of the printed document. Tracking the creation of the electronic text, including the revision history, is a necessary element of the encoding process. The fluidity of electronic texts precludes the guarantee that every version of the document will remain in existence, so the responsibility lies with the project creator to ensure that all revisions and developments are noted. While some of the documentation might seem tedious, an electronic transmission history will serve two primary purposes. One, it will help keep the project creator(s) aware of what has developed in the creation of the electronic text. If there are quite a few staff members working on the documents, you will be able to keep track of what has been accomplished with the texts and to check that the project methodology is being followed. Two, users of the documents will be able to see what emendations or regularisations have been made and to track what the various stages of the electronic object were. Again, this will prove useful to a scholarly community, like the textual critics, whose research is grounded in the idea of textual transmission and history.

2.3: Visual and structural analysis

Once the project objectives and document context have been established, you can move on to an analysis of the physical object. The first step is to provide the source texts with a classification. Defining the document type is a critical part of the digitization process as it establishes the foundation for the initial understanding of the text's structure. At this point you should have an idea of what documents are going to be digitized for the project. Even if you not sure precisely how many texts will be in the final project, it is important to have a representative sample of the types of documents being digitized. Examine the sample documents and decide what categories they fall under. The structure and content of a letter will differ greatly from that of a novel or poem, so it is critical to make these naming classifications early in the process. Not only are there structural differences between varying document types but also within the same type. One novel might consist solely of prose, while another might be comprised of prose and images, while yet another might have letters and poetry scattered throughout the prose narrative. Having an honest representative sample will provide you with the structural information needed to make fundamental encoding decisions.

Deciding upon document type will give you an initial sense of the shape of the text. There are basic structural assumptions that come with classification: looking for the stanzas in poetry or the paragraphs in prose for example. Having established the document type, you can begin to assign the texts a more detailed structure. Without worrying about the actual tag names, as this comes later in the process, label all of the features you wish to encode. For example, if you are digitizing a novel, you might initially break it into large structural units: title page, table of contents, preface, body, back matter, etc. Once this is done you might move on to smaller features: titles, heads, paragraphs, catchwords, pagination, plates, annotations and so forth. One way to keep the naming in perspective is to create a structure outline. This will allow you to see how the structure of your document is developing, whether you have omitted any necessary features, or if you have labelled too much.

Once the features to be encoded have been decided upon, the relationships between them can then be examined. Establishing the hierarchical sequence of the document should not be too arduous a task — especially if you have already developed a structural outline. It should at this point be apparent, if we stick with the example of a novel, that the work is contained within front matter, body matter, and back matter. Within front matter we find such things as epigraphs, prologues, and title pages. The body matter is comprised of chapters, which are constructed with paragraphs. Within the paragraphs can be found quotations, figures, and notes. This is an established and understandable hierarchy. There is also a sequential relationship where one element logically follows another. Using the above representation, if every body has chapters, paragraphs, and notes, then you would expect to find a sequence of <chapter> then <paragraph> then <note>, not <chapter>, <note>, then <paragraph>. Again, the more you understand about the type of text you are encoding, the easier this process will be. While the level of structural encoding will ultimately depend upon the project objectives, this is an opportune time to explore the form of the text in as much detail as possible. Having these data will influence later encoding decisions, and being able to refer to these results will be much easier than having to sift through the physical object at a later point to resolve a structural dilemma.

The analysis also brings to light any issues or problems with the physical document. Are parts of the source missing? Perhaps the text has been water damaged and certain lines are unreadable? If the document is a manuscript or letter perhaps the writing is illegible? These are all instances that can be explored at an early stage of the project. While these problems will add a level of complexity to the encoding project, they must be dealt with in an honest fashion. If the words of a letter are illegible and you insert text that represents your best guess at the actual wording then this needs to be encoded. The beauty of document analysis is that by examining the documents prior to digitization you stand a good chance of recognising these issues and establishing an encoding methodology. The benefit of this is threefold: firstly, having identified and dealt with this problem at the start you will have fewer issues arise during the digitization process; secondly, there will be an added level of consistency during the encoding stage and retrospective revision won't be necessary; thirdly, the project will benefit from the thorough level of accuracy desired and expected by the scholarly community.

This is also a good time to examine the physical document and attempt to anticipate problems with the digitization process. Fragile spines, flaking or foxed paper, badly inked text, all will create difficulties during the scanning process and increase the likelihood of project delays if not anticipated at an early stage. This is another situation that requires examining representative samples of texts. It could be that one text was cared for in the immaculate conditions of a Special Collections facility while another was stored in a damp corner of a bookshelf. You need to be prepared for as many document contingencies as possible. Problems not only arise out of the condition of the physical object, but also out of such things as typography. OCR digitization is heavily reliant upon the quality and type of fonts used in the text. As will be discussed in greater detail in Chapter 3, OCR software is optimised for laser quality printed text. This means that the older the printed text, the more degradation in the scanning results. These types of problems are critical to identify, as decisions will have to be made about how to deal with them — decisions that will become a significant part of the project methodology.

2.4: Typical textual features

The final stage of document analysis is deciding which features of the text to encode. Once again, knowing the goals and objectives of the project will be of great use as you try to establish the breadth of your element definition. You have the control over how much of the document you want to encode, taking into account how much time and manpower are dedicated to the project. Once you've made a decision about the level of encoding that will go into the project, you need to make the practical decision of what to tag. There are three basic categories to consider: structure, format and content.

In terms of structure there are quite a few typical elements that are encoded. This is a good time to examine the structural outline to determine what skeletal features need to be marked up. In most cases, the primary divisions of text — chapters, sections, stanzas, etc. — and the supplementary parts — paragraphs, lines, pages — are all assigned tag names. With structural markup, it is helpful to know how detailed an encoding methodology is being followed. As you will discover, you can encode almost anything in a document, so it will be important to have established what level of markup is necessary and to then adhere to those boundaries.

The second step is to analyse the format of the document. What appearance-based features need to translate between the print and electronic objects? Some of the common elements relate to attributes such as bold, italic and typeface. Then there are other aspects that take a bit more thought, such as special characters. These require special tags, for example &Aelig; for Æ. However, cases do exist of characters which cannot be encoded and alternate provisions must be made. Format issues also include notes and annotations (items that figure heavily in scholarly texts), marginal glosses, and indentations. Elements of format are easily forgotten, so be sure to go through the representative documents and choose the visual aspects of the text that must be carried through to the electronic object.

The third encoding feature concerns document content. This is where you will go through the document looking for features that are neither structural nor format based. This is the point where you can highlight the content information necessary to the text and the user. Refer back to the decisions made about textual relationships and what themes and ideas should be highlighted. If, for example, you are creating a database of author biographies you might want to encode such features as author's name, place of birth, written works, spouse, etc. Having a clear sense of the likely users of the project will make these decisions easier — and perhaps more straightforward. This is also a good time to evaluate what the methodology will be for dealing with textual revisions, deletions, and additions — either authorial or editorial. Again, it is not so critical here to define what element tags you are using but rather to arrive at a listing of features that need to be encoded. Once these steps have been taken you are ready to move on to the digitization process.

© The right of Alan Morrison, Michael Popham and Karen Wikander to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. 

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or part of any of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service Electronic or print copies may not be offered, whether for sale or otherwise, to any third party. 
Arts and Humanities Data Service 
A red line
Bibliography Next Back Glossary Contents