Once the project objectives and document context have been established, you can move on to an analysis of the physical object. The first step is to provide the source texts with a classification. Defining the document type is a critical part of the digitization process as it establishes the foundation for the initial understanding of the text's structure. At this point you should have an idea of what documents are going to be digitized for the project. Even if you not sure precisely how many texts will be in the final project, it is important to have a representative sample of the types of documents being digitized. Examine the sample documents and decide what categories they fall under. The structure and content of a letter will differ greatly from that of a novel or poem, so it is critical to make these naming classifications early in the process. Not only are there structural differences between varying document types but also within the same type. One novel might consist solely of prose, while another might be comprised of prose and images, while yet another might have letters and poetry scattered throughout the prose narrative. Having an honest representative sample will provide you with the structural information needed to make fundamental encoding decisions.
Deciding upon document type will give you an initial sense of the shape of the text. There are basic structural assumptions that come with classification: looking for the stanzas in poetry or the paragraphs in prose for example. Having established the document type, you can begin to assign the texts a more detailed structure. Without worrying about the actual tag names, as this comes later in the process, label all of the features you wish to encode. For example, if you are digitizing a novel, you might initially break it into large structural units: title page, table of contents, preface, body, back matter, etc. Once this is done you might move on to smaller features: titles, heads, paragraphs, catchwords, pagination, plates, annotations and so forth. One way to keep the naming in perspective is to create a structure outline. This will allow you to see how the structure of your document is developing, whether you have omitted any necessary features, or if you have labelled too much.
Once the features to be encoded have been decided upon, the relationships between them can then be examined. Establishing the hierarchical sequence of the document should not be too arduous a task — especially if you have already developed a structural outline. It should at this point be apparent, if we stick with the example of a novel, that the work is contained within front matter, body matter, and back matter. Within front matter we find such things as epigraphs, prologues, and title pages. The body matter is comprised of chapters, which are constructed with paragraphs. Within the paragraphs can be found quotations, figures, and notes. This is an established and understandable hierarchy. There is also a sequential relationship where one element logically follows another. Using the above representation, if every body has chapters, paragraphs, and notes, then you would expect to find a sequence of <chapter> then <paragraph> then <note>, not <chapter>, <note>, then <paragraph>. Again, the more you understand about the type of text you are encoding, the easier this process will be. While the level of structural encoding will ultimately depend upon the project objectives, this is an opportune time to explore the form of the text in as much detail as possible. Having these data will influence later encoding decisions, and being able to refer to these results will be much easier than having to sift through the physical object at a later point to resolve a structural dilemma.
The analysis also brings to light any issues or problems with the physical document. Are parts of the source missing? Perhaps the text has been water damaged and certain lines are unreadable? If the document is a manuscript or letter perhaps the writing is illegible? These are all instances that can be explored at an early stage of the project. While these problems will add a level of complexity to the encoding project, they must be dealt with in an honest fashion. If the words of a letter are illegible and you insert text that represents your best guess at the actual wording then this needs to be encoded. The beauty of document analysis is that by examining the documents prior to digitization you stand a good chance of recognising these issues and establishing an encoding methodology. The benefit of this is threefold: firstly, having identified and dealt with this problem at the start you will have fewer issues arise during the digitization process; secondly, there will be an added level of consistency during the encoding stage and retrospective revision won't be necessary; thirdly, the project will benefit from the thorough level of accuracy desired and expected by the scholarly community.