Creating & Documenting Electronic Texts


Chapter 6 : Documentation and Metadata

6.1 What is Metadata and why is it important?

Simply put, metadata is one piece of data which describes another piece of data. In the context of digital resources the kind of information you would expect to find in a typical metadata record would be data on the nature of a resource, who created the resource, what format it is held in, where it is held, and so on. In recent years the issue of metadata has become a serious topic for those concerned with the creation and management of digital resources. When digital resources first started to emerge much of the focus of activity was centred on the creation process, without much thought of how these resources would be documented and found by others. In the academic arena announcements of the availability of resources tended to be within an interested community, usually though subject based discussion lists. However as use of the web has steadily increased, many institutions have come to depend on it as a crucial means of storing and distributing information. The means by which this information is organized has now become a central issue if the web is to continue to be an effective tool for the digital information age.

While there is an overwhelming consensus that a practical metadata model is required, a single one has yet to be emerge which will satisfy the needs of thenet community as a whole. This section of the Guide will look at two metadata models currently in use, the Dublin Core Element Set, and the TEI Header, but we begin with an overview of the problem as it stands at the moment.

The concept of metadata has been around much longer then the web, and while there exist a great number of metadata formats, it is most often associated with the work of the library community. The web is commonly likened to an enormous library for the digital age, and while this analogy may not stand up to any serious scrutiny, it is a useful one to make as it highlights the perceived problems associated with metadata and digital resources and points towards possible solutions. At its inception the web was not designed nor intended as a forum for the organised publication and retrieval of information andtherefore no system for effectively cataloguing information held on the web was devised. Due to this lack of formal cataloguing procedures the web has evolved into a "chaotic repository for the collective output of the world's digital 'printing presses'" (Lynch 1997). Locating an item of a library shelf is a relatively simple task due to our familiarity with a long established procedure for doing so. Library metadata systems, such as MARC, follow a strictly defined set of rules which are applied by a trained body of professionals. The web has few such parallels.

One of the most common ways of locating items on the web is via a search engine, and it is to these that the proper application of metadata would be most beneficial. While search engines are undeniably powerful they do not operate in a effective and precise enough way to make them trustworthy tools of retrieval. It is estimated that there in the region of three and a half million web sites containing five hundred million unique files (OCLC Web characterisation Project, June 99, but only one third of these is indexed by search engines. The web contains much that is difficult to catalogue in a straightforward manner, multimedia packages, audio and visual material not to mention pages which are automatically generated, all demand consideration in any system which attempts to catalogue them. The method by which search engines index a web site is based on the frequency of occurrences of words which appear in the document rather than identifying any real notion of its content. The indiscriminate nature of the searches not only make it difficult to find what you are looking for but often bury any potentially useful information in a flurry of unwanted unrelated “hits”. The growing commercialisation of the web has influenced the nature of search engines and made them even more unreliable and of dwindling practical use to the academic community.

While search engine are now able to make better use of the HTML tag (although the tag can be abused by index spamming), it is perhaps a case of too little too late. Initiatives such as the Dublin Core go some way in trying to redress the balance, but these are still being refined and have numerous shortcomings. The Dublin Core, in an attempt to maintain its simplicity fails to achieve its hoped for functionality, trades off much of its potential precision in a quest for general acceptance. The Dublin Core element set is, in places, too general to coherently describe the complex relationships which exist within many digital resources, and lacks the required rigidity, in areas such as the use of controlled vocabulary, to make it easily interoperable. This applies particularly in regard to the original unqualified 15 elements, but the work of bodies such as the Dublin Core Data Model working group, implementing Dublin Core in RDF/XML, are providing potential solutions to these problems ( While a single metadata scheme, adopted and implemented whole-scale would be the ideal, it is likely that a proliferation of metadata scheme are likely to emerge and be used by different communities. This makes the current work centred on integrated services and interoperability all the more important.

Conclusion and current developments

The need for a solution to the problem of how to document data on the web so that it can be located and retrieved with the minimum of effort is now essential if the web is to continue to thrive as a major provider of our daily resources. It is generally recognized that what is required is a metadata scheme which contains “the librarian’s classification and selection skills…complemented by the computer scientist’s ability to automate the task of indexing and storing information” (Lynch 1997). Existing models do not go far enough in providing a framework that satisfies the precise requirements of different communities and discipline groups, and until clear guidelines become available on how metadata records should be created in a standardized way, little progress will be made. In the foreseeable future it is unlikely that some outside agent will prepare your metadata for you, and proper investment in web cataloguing methods is essential if its implementation is to be conducted successfully.

New developments and proposals are being investigated in an attempt to find solutions in the face of these seemingly insurmountable problems. The Warwick Framework ( for example suggests the concept of a container architecture, which can support the coexistence of several independently developed and maintained metadata packages which may serve other functions (rights management, administrative metadata, etc.). Rather than attempt to provide a metadata scheme for all web resources, the Wawrwick Framework uses the Dublin Core as a starting point, but allows individual communities to extend this to fit their own subject specific requirements. This movement towards a more decentralized, modular and community based solution, where the "communities of expertise" themselves create the metadata they need has much to offer. In the UK, various funded organisations such as the AHDS (, and projects like ROADS ( and DESIRE ( are all involved in assisting the development of subject-based information gateways that provide metadata-based services tailored to the needs of particular user communities.

It is clear that there is still some way to go before the problems of metadata for describing digital resources have been adequately resolved. Initiatives created to investigate the issues are still in their infancy, but hopefully solutions will be found, either globally or within distinct communities, which will provide a framework simple enough to be used by the maximum number of people with the minimum degree of inconvenience.

6.2 The TEI Header

The work and objectives of the Text Encoding Initiative (TEI) and the guidelines it produced for text encoding and interchange have already been discussed in the previous chapter. In this section dealing with metadata, we will focus on how the TEI has approached the problems particular to the effective documentation of electronic texts. This section will look at the TEI Header, andSpecifically, the version of the header as provided by the TEI Lite DTD (

Unlike the Dublin Core element set the TEI Header is not designed specifically for describing and locating objects on the web although it can be used for this purpose. The TEI Header provides a mechanism for fully documenting all aspects of an electronic text. The TEI Header does not only limit itself to documenting the text but also provides a system for documenting its source, its encoding practices, and the process of its creation. The TEI Header is therefore an essential resource of information for users of the text, for software that has to process the metadata information, and for cataloguers in libraries, museums, and archives. In contrast with the Dublin Core, whose inclusion in any document is voluntary, the presence of the TEI Header is mandatory if the document is to be considered TEI conformant.

As with the full TEI Lite tag-set, a number of optional elements are offered by the TEI Header (of which only one, the <filedesc>, is mandatory) for use in a structured way. These elements are capable of being extended by the addition of attributes on the elements. Therefore the TEI Header can range from a very large and complex document to a simple, concise piece of metadata. The most basic valid TEI Lite header would look something like:

A guide to good practice
Published by the AHDS, 1999
A dual web and print publication

At its simplest a TEI Lite Header requires no more than a description of the electronic file itself, a description which includes some kind of statement on what the text is called, how it is published, and if it has been derived or transcribed from another source.

A typical TEI Header would hopefully contain more detailed information relating to a document. In general the header should be regarded as providing the same kind of information analogous to that provided by the title page of a printed book, combined with the information usually found in an electronic readme file. As with the Dublin Core <META> tag, the TEI Header tag appears at the beginning of a text (although it can be held separately from the document) between the SGML prolog (i.e. the SGML declaration and the DTD) and the front matter of the text itself:

<!DOCTYPE tei.2 PUBLIC "-//TEI//DTD TEI Lite 1.6//EN"><tei.2>

[header details go here]


The metadata information contained within the TEI Header can also be utilized as an effective resource for the information management of texts. In the same way that an on-line library catalogue allows different search options and views of a collection, the metadata information in the TEI Header can also be manipulated to present different access points into a collection of electronic texts. For example, rather than maintain a separate, static catalogue or database the holdings of the OTA as recorded in the metadata information stored in the TEI Headers is used to assist in the identification and retrieval of resources. In addition to being able to perform simple searches for the author or title or a work, users of the OTA catalogue can submit complex queries on a number of available options, such as searching for resources by language, genre, time period, and even by file format.

Additional to its ability to dynamically construct indexes and catalogues, the metadata contained within the TEI Header can also be used to create other metadata and catalogue records. TEI Header metadata can be extracted and mapped onto other well-established resource cataloguing standards, such as library MARC records, or to emerging standards such as the Dublin Core element set and the Resource Description Framework (RDF). This is a relatively simple task since the TEI Header was closely modelled on existing standards in library cataloguing.For example the TEI Lite <author> tag within the <titleStmt> is analogous to the 100 MARC AUTHOR record field and also with the Dublin Core CREATOR element. There is no need, therefore, to maintain several different metadata formats when they can simply be filtered from one central information source.

For more details see ( and (

The TEI Lite Header Tag Set

Although the TEI Lite Header has only one required element (the <fileDesc>) it is recommended that all four of the principal elements which comprise the Header be used. The TEI Header provides scope to describe practically all of the textual and non-textual aspects of an electronic text, so it is always recommended that when creating a Header to include as much information as is possible.

The following overview of the four main elements which go to make up the Header is by no means exhaustive, a more comprehensive account with examples can be found in the Gentle Introduction to SGML(see:

The four recommended elements which go to make a <teiHeader> are:

<fileDesc>: the file description. This element contains a full bibliographic description of an electronic file.<encodingDesc>: the encoding description. This element documents the relationship between an electronic text and the source(s) from which it was derived.<profileDesc>: the profile description. This element provides a detailed description of the non-bibliographic aspects of a text, specifically the languages and sub-languages used, the situation in which it was produced, the participants and their setting. <revisionDesc>: the revision description. This element summarizes the revision history of a file.

The elements within the TEI Header fall into three broad categories of content:

- Descriptions (containing the suffix Desc) can contain simple prose descriptions of the content of the element. These can also contain specific sub-elements.- Statements (containing the suffix Stmt) indicate that the element groups together a number of specialized elements recording some structured information. - Declarations (containing the suffix Decl) enclose information about specific encoding practices applied to the electronic text.

The File Description: <fileDesc>

The file description contains a full bibliographic description of the computer file itself. It should provide enough useful information in itself to construct a meaningful bibliographic citation or library catalogue entry. The <fileDesc> contains three mandatory, and four optional elements:

<titleStmt>: groups information relating to the title of the work and those responsible for its intellectual content. Details of any major funding or sponsoring bodies can also recorded here. This element is mandatory.

<editionStmt>: groups together information relating to one edition of a text. This element may contain information on the edition or version of the electronic work being documented.

<extent>: simply records the size of the electronic text in a recognizable format, e.g. bytes, Mb, words, etc.

<publicationStmt>: records details of the publication or distribution details of the electronic text including a statement on its availability status (e.g freely available, restricted, forbidden, etc.). This element is mandatory.

An <idno> is also included to provide a useful mechanism for identifying a bibliographic item by assigning it noe or more unique identifiers. <seriesStmt>: groups together information about a series, if any, to which a publication belongs. Again an <idno> element is supplied to help with identifying the unique individual work.

<noteStmt>: groups together any notes providing information about a texts additional to that recorded in other parts of the bibliographic description. This general element can be made use of in a variety of ways to record potentially significant details about the text and its features which have not already been accommodated elsewhere in the header.

<sourceDesc>: groups together details of the source or sources from which the electronic edition was derived. This element may contain a simple prose description of the text or more complex bibliographic elements may be employed to provide a structured bibliographic reference for the work. This element is mandatory.

The Encoding Description: <encodingDesc>

<encodingDesc>: documents the relationship between an electronic text and the source or sources from which it was derived. The <encodingDesc> can contain a simple prose description detailing such features as the purpose(s) for which the work was encoded, as well as any other relevant information concerning the process by which it was assembled or collected. While there are no mandatory elements within the <encodingDesc>, those available are useful for documenting the rationale behind how and why certain elements have been implemented.

<projectDesc>: used to describe, in prose, the purpose for which the electronic text was encoded (for example if a text forms a part of a larger collection, or was created with a particular audience in mind)

<samplingDecl>: useful in identifying the rationale behind the sampling procedure for a corpus.

<editorialDecl>: provides details of the editorial principles applied during the encoding of a text, for example it can record whether the text has been normalized or how quotations in a text have been handled.

<tagsDecl>: groups information on how the SGML tags have been used, and how often, within a text.

<refsDecl>: commonly used to identify which SGML elements contain identifying information, and whether this information is represented as attribute values or as content.

<classDecl>: defines which descriptive classification schemes (if any) have been used by other parts of the header.

The Profile Description: <profileDesc>

<ProfileDesc> : The profile description details the non-bibliographic aspects of a text, specifically the languages used in the text, the situation in which the text was produced, and the participants involved in the creation.

<creation>: groups information detailing the time and place of creation of a text.

<langUsage>: records the languages (including dialects , sub-languages, etc) used in the text.

<textClass>: describes the nature or topic of the text in terms of a standard classification scheme. Included in this element is a useful <keyword> tag which can be used to identify a particular classification scheme used, and which keywords from this scheme were used.

The Revision Description: <revisionDesc>

<revisionDesc>: provides a detailed system for recording changes made to the text. This element is of particular use in the administration of files, recording when changes were made to text and by whom. The <revisionDesc> should be updated every-time a significant alteration has been made to a text.

The TEI Header: Conclusion

The above overview hopefully demonstrates the comprehensive nature of the TEI Header as a mechanism for documenting electronic texts. The emergence of the electronic text over the past decade has presented librarians and cataloguers with many new challenges. Existing library cataloguing procedures, while inadequate to properly document all the features of electronic texts, were used as a secure foundation onto which additional features directly relevant to the electronic text could be grafted. The TEI Header has proved to be an invaluable tool for those concerned with documenting electronic resources; its supremacy in this field can be measured by the increasing number of electronic text centres, libraries, and archives which have adopted its framework. The Oxford Text Archive has found it indispensable as a means of managing its large collection of disparate electronic texts, not only as a mechanism for creating its searchable catalogue, but as a means of creating other forms of metadata which can communicate with other information systems.

Ironically it is the same generality and flexibility offered by the TEI Guidelines (P3) on creating a header which have hindered the progress of one of the main goals of the TEI and the hopes of the electronic text community as a whole, namely the interoperability and interchangeability of metadata. Unlike the Dublin Core element set, which has a strictly defined set of rules governing its content, the TEI Header has a set of guidelines, which allow for widely divergent approaches to header creation. While this is not a major problem for individual texts, or texts within a single collection, the variant way in which the guidelines are interpreted and put into practice make easy interoperability with other systems using TEI Headers more difficult than first imagined. As with the Dublin Core element set, what is required is the whole-scale adoption of a mutually acceptable code of practice which header creators could implement. One final aspect of the TEI Header which is a cause of irritation to those creating and managing TEI Headers and texts; the apparent dearth of affordable and user-friendly software aimed specifically at header production. While this has long been a general criticism of SGML applications as a whole, the TEI can in no way be held to blame for this absence, as it was not part of the TEI remit to create software. However it has contributed to the relative slow uptake and implementation of the TEI Header as the predominant method of providing well structured metadata to the electronic text community as a whole. Until this situation is adequately resolved the tools on offer tend to be freeware products designed by people within the SGML community itself, or large and very expensive purpose built SGML aware products aimed at the commercial market.

Further reading:

The SGML/XML Web Page (

Ebenezer's software suite for TEI (

TEI home page (

6.3 The Dublin Core Element Set and the Arts and Humanities Data Service

"The Dublin Core is a 15-element metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of web resources, it has also attracted the attention of formal resource description communities such as museums and libraries"

Dublin Core Metadata home page (

By the mid-1990s large-scale web users, document creators and information providers had recognized the pressing need to introduce some kind of workable cataloguing scheme for documenting resources on the web. The scheme needed to be accessible enough to be adopted and implemented by typical web content creators who had little or no formal cataloguing training. The set of metadata elements also needed to be simpler than those used in traditional library cataloguing systems but also needed to offer greater retrieval precision than the relatively crude indexing methods employed by existing search engines and web crawlers.

The Dublin Core Metadata Element Set grew out of a series of meetings and workshops consisting of experts from the library world, the networking and digital library research community, and other content specialists

The basic objectives of the Dublin Core initiative included:

- to produce a core set of descriptive elements which would be capable of describing or identifying the majority of resources available on the internet. Unlike a traditional library where the main focus is on cataloguing published textual materials, the Internet contains a vast range or material in a variety of formats, including non-textual material such as images, video, most of which have not been 'published' in any formal way.

- to make this scheme intelligible enough that it could be easily utilized by trained cataloguers but still retain enough content that it functioned effectively as a catalogue record.

- to encourage the adoption of the scheme on an international level by ensuring that it provided the best format for documenting digital objects on the web

The Dublin Core element set provides a straightforward framework for documenting features of a work such as who created the work, what its content is and what languages it contains, where and from whom it is available from and in what formats, and whether it derived from a printed source. At a basic level the element set uses commonly understood terms and semantics which are intelligible to most disciplines and information systems communities. The descriptive terms were chosen to be generic enough to be understood by a document author, but could also be extended to provide full and precise cataloguing information. For example textual authors, painters, photographers, writers of software programs can all be considered 'creators' in a broad sense.

In any implementation of the Dublin Core element set, all elements are optional and repeatable. Therefore if a work is the result of a collaboration between a number of contributors it is relatively easy to record the details of each one (name, contact details etc) as well as their specific contribution or role (author, editor, photographer, etc.) by simply repeating the appropriate element.

These basic details can be extended by the use of Dublin Core qualifiers. The Dublin Core initiative originally defined three different kinds of qualifier:type (or sub-element) to broadly refine the semantics of an element name, language to specify the language of an element value, and,scheme to note the existence of an element value taken from an externally defined scheme or standard. Guidelines for implementing these qualifiers in HTML are also available. Work on integrating Dublin Core and the Resource Description Framework (RDF), however, revealed that these terms could be the source of confusion. Dublin Core qualifiers are now identified as either element qualifiers that refine the semantics of a particular element or value qualifiers that provide contextual information about an element value. Take the Dublin Core date element, for example. Element qualifiers, for example, would allow the broad concept of date to be subdivided into things like 'date of creation' or 'date of last modification', etc. Value qualifiers might explain how a particular element value should be parsed. For example, a date element with a value qualifier of 'ISO 8601' indicates that the string '1999-1-1' should be parsed as the 1st of January 1999. Other value qualifiers might indicate that an element value is taken from a particular controlled vocabulary or scheme, for example to indicate the use of a subject term from an established scheme like the Library of Congress Subject Headings.

Implementing the Dublin Core

The Dublin Core element set was designed for documenting web resources and it is easily integrated into web pages using the HTML <META> tag, inserted between the <HEAD>...</HEAD> tags and before the <BODY> of the work. An Internet-Draft has been published that explains how this should be done ( specialist tools more sophisticated than an average word processor are required to produce the content of a Dublin Core record, however a number of labour saving devices are available, notably the DC-dot generator available from the UKOLN web site ( DC-dot can automatically generate Dublin Core metadata for a web site and encode this in HTML <META> tags and other formats. The metadata produced can also be easily edited and extended further. The Nordic Metadata Project Template is an alternative way of creating simple Dublin Core metadata that can be embedded in HTML <META> tags (

Conclusions and further reading

The Dublin Core element scheme offers enormous potential as a useable standard cataloguing procedure for digital resources on the web. The core set of elements are broad and encompassing enough to be of use to novice web authors and skilled cataloguers alike. However its success will ultimately be dependent on its wide-scale adoption by the Internet community as a whole. It is also crucial that the rules of the scheme be implemented in an intelligent and systematic way. To fulfil this objective more has to be done to refine and stabalize the element set. The provision and use of simple Dublin Core generating tools, which demonstrate the benefits of including metadata, needs to become more prevalent.

The Arts and Humanities Data Service (AHDS), in association with the UK office for Library and Information Networking (UKOLN), has produced a publication which outlines in more detail the best practices involved in using Dublin Core, as well as giving many practical examples. "Discovering Online Resources across the Humanities: A practical implementation of the Dublin Core" (ISBN 0-9516856-4-3). This is available also freely available from the AHDS web site, (

A practical illustration of how the Dublin Core element set can be implemented in order to perform searches for individual items across disparate collections is the AHDS Gateway []. The AHDS Gateway is, in reality, an integrated catalogue of the holdings of the five individual Service Providers, which make up the AHDS. Although the Service Providers are separated geographically, by providing Dublin Core records describing each of their holdings, users can very simply search across the complete holdings of the AHDS from one single access point.

The Dublin Core Elements

This set of official definitions of the Dublin Core metadata element set is based on:

Element Descriptions


Label: TITLE

The name given to the resource by the CREATOR or PUBLISHER. Where possible standard authority files should be consulted when entering the content of this element. For example the Library of Congress or British Library title lists can be used, but always remember to indicate the source using the 'scheme' qualifier. If authorities are to be used, these would need to be indicated as a value qualifier

2.Author or Creator


The person or organization primarily responsible for creating the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources. Note that this element does not refer to the person who is responsible for digitizing a work, this belongs in the CONTRIBUTOR element. So in the case of a machine- readable version of King Lear held by the OTA, the CREATOR remains William Shakespeare, and not the person who transcribed it into digital form. Again, standard authority files should be consulted for the content of this element.

3.Subject and Keywords


The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemas is encouraged.


Label: DESCRIPTIONA textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources.



The entity responsible for making the resource available in its present form, such as a publishing house, a university department, or a corporate entity.

6.Other Contributor


A person or organization not specified in a CREATOR element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a CREATOR element (for example, editor, transcriber, and illustrator).


Label: DATE

The date the resource was made available in its present form. Recommended best practice is an 8 digit number in the form YYYY-MM-DD as defined in, a profile of ISO 8601. In this scheme, the date element 1994-11-05 corresponds to November 5, 1994. Many other schema are possible, but if used, they should be identified in an unambiguous manner.

8.Resource Type

Label: TYPE

The category of the resource, such as home page, novel, poem, working paper, technical report, essay, dictionary. For the sake of interoperability, TYPE should be selected from an enumerated list that is under development in the workshop series at the time of publication of this document. See for current thinking on the application of this element



The data format of the resource, used to identify the software and possibly hardware that might be needed to display or operate the resource. For the sake of interoperability, FORMAT should be selected from an enumerated list that is under development in the workshop series at the time of publication of this document.

10.Resource Identifier


String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers, such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element in the case of off-line resources.



A string or number used to uniquely identify the work from which this resource was derived, if applicable. For example, a PDF version of a novel might have a SOURCE element containing an ISBN number for the physical book from which the PDF version was derived.



Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with RFC 1766. See:



The relationship of this resource to other resources. The intent of this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items ina collection. Formal specification of RELATION is currently under development. Users and developers should understand that use of this element is currently considered to be experimental.



The spatial and/or temporal characteristics of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element is currently considered to be experimental.

15.Rights Management


A link to a copyright notice, to a rights-management statement, or to a service that would provide information about terms of access to the resource. Formal specification of RIGHTS is currently under development. Users and developers should understand that use of this element is currently consideredto be experimental.

© The right of Alan Morrison, Michael Popham and Karen Wikander to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. 

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or part of any of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service Electronic or print copies may not be offered, whether for sale or otherwise, to any third party. 
Arts and Humanities Data Service 
A red line
Back Next Bibliography Glossary Contents