Creating & Documenting Electronic Texts


Chapter 6 : The TEI Header

6.2 : TEI Header

The work and objectives of the Text Encoding Initiative (TEI) and the guidelines it produced for text encoding and interchange have already been discussed in the previous chapter. In this section dealing with metadata, we will to focus on how the TEI has approached the problems particular to the effective documentation of electronic texts. This section will look at the TEI header, specifically the version of the header as provided by the TEILite DTD

Unlike the Dublin Core element set the TEI header is not designed specifically for describing and locating objects on the web although it can be used for this purpose. The TEI Header provides a mechanism for fully documenting an electronic text. The TEI Header does not only limit itself to documenting the text but also provides a system for documenting its source, its encoding practices, and the process of its creation. The TEI Header is therefore an essential resource of information for users of the text, for software that has to process the metadata information, and for cataloguers in libraries, museums, and archives. In contrast with the DC, whose inclusion in any document is voluntary, the presence of the TEI header is mandatory if the document is to be considered TEI conformant (i.e. it conforms to a proper SGML DTD).

As with the full TEILite tag-set, a number of optional elements are offered by the TEI Header (of which only one, the <filedesc>, is mandatory) for use in a structured way. These elements are capable of being extended by the addition of attributes on the elements. Therefore the TEI header can range from a very large and complex document to a simple, concise piece of metadata. The most basic valid TEI Lite header would look something like:



At its simplest a TEI Lite header requires no more than a description of the electronic file itself, a description which includes some kind of statement on what the text is called, what its publishing environment is, and if it has been derived or transcribed from another source.

A typical TEI Header would hopefully contain more detailed information relating to a document. In general the header should be regarded as providing the same kind of information analogous to that provided by the title page of a printed book, combined with the information usually found in an electronic readme file. As with the Dublin Core <META> tag, the TEI header tag appears at the beginning of a text (although it can be held separately from the document) between the SGML prolog (ie the SGML declaration and the DTD) and the front matter of the text itself:

<!DOCTYPE tei.2 PUBLIC "-//TEI//DTD TEI Lite 1.6//EN">

[header details go here]


The metadata information contained within the TEI Header can also be utilized as an effective resource for the information management of texts. In the same way that an on-line library catalogue allows different search options and views of a collection, the metadata information in the TEI header can also be manipulated to present different access points into a collection of electronic texts. For example, rather than maintain a separate, static catalogue or database the of holdings of the OTA as recorded in the metadata information stored in the TEI headers is used to assist in the identification and retrieval of resources. In addition to being able to perform simple searches for the author or title or a work, users of the OTA catalogue can submit complex queries on a number of available options, such as searching for resources by language, genre, time period, and even by file format.

Additional to its ability to dynamically construct indexes and catalogues, the metadata contained within the TEI Header can also be used to create other metadata and catalogue records. TEI Header metadata can be extracted and mapped onto other well-established resource cataloguing standard, such as library MARC records, or to emerging standards such as the Dublin Core element set and the Resource Description Framework (RDF). This is a relatively simple task since the TEI header was closely modelled on existing standards in library cataloguing.For example the TEI Lite tag within the <titleStmt> is analogous to the 100 MARC AUTHOR record field and also with the Dublin Core CREATOR element. There is no need, therefore, to maintain several different metadata formats when they can simply be filtered from one central information source.

[see: ]

The TEI Lite Header Tag Set

Although the TEI Lite Header has only one required element (the <fileDesc>) it is recommended that all four of the principal elements which comprise the header be used. The TEI Header provides scope to describe practically of the textual and non-textual aspects of an electronic text, so it is always recommended that when creating a header to include as much information as is possible. The following overview of the four main elements which go to make up the header is by no means exhaustive, a more comprehensive account with examples can be found in the Gentle Introduction to SGML [see: ]

The four recommended elements which go to make a <teiHeader> are:
<fileDesc>: the file description. This element contains a full bibliographic description of an electronic file.
<encodingDesc>: the encoding description. This element documents the relationship between an electronic text and the source(s) from which it was derived.
: the profile description. This element provides a detailed description of the non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting.
<revisionDesc>: the revision description. This element summarizes the revision history of a file.

The elements within the TEI header fall into three broad categories of content:

- Descriptions (containing the suffix <i>Desc) can contain simple prose descriptions of the content of the element. These can also contain specific sub-elements.

- Statements (containing the suffix <i>Stmt) indicate that the element groups together a number of specialized elements recording some structured information.

- Declarations (containing the suffix <i>Decl) enclose information about specific encoding practices applied to the electronic text.

The File Description: <fileDesc>

The file description contains a full bibliographic description of the computer file itself. It should provide enough useful information in itself to construct a meaningful bibliographic citation or library catalogue entry. The <fileDesc> contains three mandatory, and four optional elements:

<titleStmt>: groups information relating to the title of the work and those responsible for its intellectual content. Details of any major funding or sponsoring bodies can also recorded here. This element is mandatory.

<editionStmt>: groups together information relating to one edition of a text. This element may contain information on the edition or version of the electronic work being documented.

<extent>: simply records the size of the electronic text in a recognizable format, e.g. bytes, Mb, words, etc.

: records details of the publication or distribution details of the electronic text including a statement on its availability status (e.g freely available, restricted, forbidden, etc.). This element is mandatory.

An <idno> is also included to provide a useful mechanism for identifying a bibliographic item by assigning it a unique identifier.

<seriesStmt>: groups together information about a series, if any, to which a publication belongs. Again an <idno> element is supplied to help with identifying the individual work.

<noteStmt>: groups together any notes providing information about a texts additional to that recorded in other parts of the bibliographic description. This general element can be made use of in a variety of ways to record potentially significant details about the text and its features.

<SourceDesc>: groups together details of the source or sources from which the electronic edition was derived. This element may contain a simple prose description of the text or more complex bibliographic elements may be employed to provide a structured bibliographic reference for the work. This element is madatory.

The Encoding Description: <encodingDesc>

<encodingDesc>: documents the relationship between an electronic text and the source or sources from which it was derived. The <encodingDesc> can contain a simple prose description detailing such features as the purpose(s) for which the work was encoded, as well as any other relevant information concerning the process by which it was assembled or collected. While there are no mandatory elements within the <encodingDesc>, those available are useful for documenting the rationale behind how and why certain elements have been implement.

<projectDesc>: used to describe, in prose, the purpose for which the electronic text was encoded (for example if a text forms a part of a larger collection, or was created with a particular audience in mind)

<samplingDecl>: particularly useful in identifying the rationale behind the sampling procedure for a corpus.

<editorialDecl>: provides details of the editorial principles applied during the encoding of a text, for example it can record whether the text has been normalized or how quotations in a text have been handled.

<tagsDecl>: groups information on how the SGML tags have been used, and how often, within a text.

<refsDecl>: commonly used to identify which SGML elements contain identifying information, and whether this information is represented as attribute values or as content.

<classDecl>: defines which descriptive classification schemes (if any) have been used by other parts of the header.

The Profile Description:

: The profile description details the non-bibliographic aspects of a text, specifically the languages used in the text, the situation in which the text was produced, and the participants involved in the creation.

<creation>: groups information detailing the time and place of creation of a text.

<langUsage>: records the languages (including dialects , sub-languages, etc) used in the text.

<textClass>: describes the nature or topic of the text in terms of a standard classification scheme. Included in this element is a useful <keyword> tag which can be used to identify a particular classification scheme used, and which keywords from this scheme were used.

The Revision Description: <revisionDesc>

<revisionDesc>: provides a detailed system for recording changes made to the text. This element is of particular use in the administration of files, recording when changes were made to text and by whom. The <revisionDesc> should be updated every-time a significant alteration has been made to a text.

The TEI Header: Conclusions

The above overview hopefully demonstrates the comprehensive nature of the TEI header as a mechanism for documenting electronic texts. The emergence of the electronic text over the past decade has presented librarians and cataloguers with many new challenges. Existing library cataloguing procedures, while inadequate to properly document all the features of electronic texts, were used as a secure foundation onto which additional features directly relevant to the electronic text could be grafted. The TEI header has proved to be an invaluable tool for those concerned with documenting electronic resources; its supremacy in this field can be measured by the increasing number of electronic text centres, libraries, and archives who have adopted its framework. The Oxford Text Archive has found it indispensable as a means of managing its large collection of disparate electronic texts, not only as a mechanism for creating its searchable catalogue, but as a means of creating other forms of metadata which can communicate with other information systems.

Ironically it is the same generality and flexibility offered by the TEI Guidelines (P3) on creating a header which have hindered the progress of one of the main goals of the TEI and the hopes of the electronic text community as a whole, namely the interoperability and interchangeability of metadata. Unlike the Dublin Core element set, which has a strictly defined set of rules governing its content, the TEI header has a set of guidelines, which allow for widely divergent approaches to header creation. While this is not a major problem for individual texts, or texts within a single collection, the variant way in which the guidelines are interpreted and put into practice make easy interoperability with other systems using TEI headers more difficult than first imagined. As with the Dublin Core element set, what is required is the whole-scale adoption of a mutually acceptable code of practice which header creators could implement.

One final aspect of the TEI header which is a cause of irritation to those creating and managing TEI headers and texts; the apparent dearth of affordable and user-friendly software aimed specifically at header production. While this has long been a general criticism of SGML applications as a whole, the TEI can in no way be held to blame for this absence, as it was not part of the TEI remit to create software. However it has contributed to the relative slow uptake and implementation of the TEI header as the predominant method of providing well structured metadata to the electronic text community as a whole. Until this situation is adequately resolved the tools on offer tend to be freeware products designed by people within the SGML community itself, or large and very expensive purpose built SGML aware products aimed at the commercial market.

[see : ]
[see : ]

[mention what the OTA uses? PAT and Perl scripts?]
[do we want to give Author Editor and Softquad a plug here?]

web sites and further reading:
TEI home page:

6.3 : Documentation and Metadata

6.3.1 The Dublin Core Element Set and the Arts and Humanities Data Service

"The Dublin Core is a 15-element metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of web resources, it has also attracted the attention of formal resource description communities such as museums and libraries" [Dublin Core Metadata home page -]

By the mid-1990's large scale web users, document creators and information providers had recognized the pressing need to introduce some kind of basic cataloguing scheme for documenting resources on the web. The scheme needed to be accessible enough to be adopted and implemented by typical web content creators who had little or no formal cataloguing training. The set of metadata elements needed to be simpler than those used in traditional library cataloguing but which also offered information systems greater precision than the crude indexing methods already employed by unreliable search engines and web crawlers.

The Dublin Core Metadata Element Set grew out of a series of meetings and workshops comprising of experts from the library world, the networking and digital library research community, and content specialists. The basic objectives of the Dublin Core initiative included:

- to produce a core set of descriptive elements which would be capable of describing or identifying the majority of resources available on the internet. Unlike a traditional library where the main focus is on cataloguing published textual materials, the Internet contains a vast range or material in a variety of formats, including non-textual material such as images, video, most of which do have not been 'published' in any formal way.

- to make this scheme intelligible enough that it could be easily utilized by cataloguers but still retain enough content that it functioned effectively as a catalogue record.

- to encourage the adoption of the scheme on an international level by ensuring that it provided the best format for documenting digital objects on the web

The Dublin Core element set provides a straightforward framework for documenting features of a work such as who created the work, what its content is and what languages it contains, where and from whom it is available from and in what formats, and whether it derived from a printed source. At a basic level the element set uses commonly understood terms and semantics which are intelligible to most disciplines and information systems communities. The descriptive terms were chosen to be generic enough to be understood by a document author, but could also be extended to provide full and precise cataloguing information. For example textual authors, painters, photographers, writers of software programs can all be considered 'creators' in a broad sense.

Two main principles apply when creating a Dublin Core record, which are that all elements are optional and all elements are repeatable. Therefore if a work is the result of numerous contributors it is simple to record the details of each member (name, contact details etc) as well as their specific contribution (author, editor, photographer, etc) by simply repeating the appropriate element. These basic details can be extended by the use of qualifiers such a scheme, type, and language on the elements. The <i>scheme qualifier identifies a recognized coding or cataloguing scheme used in a Dublin Core element, for example if a document employs an established cataloguing scheme such as the Library of Congress subject headings. The use of the <I>scheme qualifier provides a mechanism to introduce a degree of standardization and consistency to the format. The <i>type qualifier refines more precisely he content of a single element, for example the <i>author element is often used several times for the same individual. The <I>type qualifier can be employed to differentiate details such as the authors postal address, email address, telephone number, etc. The <i>language element simply identifies the language of the element value.

Implementing the Dublin Core

The Dublin Core element set was designed for documenting web resources and it is easily integrated into web pages using the HTML <META> tag, inserted between the

... tags and before the of the work. No specialist tools more sophisticated than an average word processor are required to produce the content of a Dublin Core record, however a number of labour saving devices are available, notably the DC-dot generator available from the UKOLN web site []. The DC-dot will automatically generate the <META> tags for any web site, which can be easily edited and extended further.

Conclusions and further reading

The Dublin Core element scheme offers enormous potential as a useable standard cataloguing procedure for digital resources on the web. The core set of elements are broad and encompassing enough to be of use to novice web authors and skilled cataloguers alike. However its success will ultimately be dependent on its wide-scale adoption by the internet community as a whole. It is also crucial that the rules of the scheme be implemented in an intelligent and systematic way. To fulfil this objective more has to be done to refine and stabalize the element set. The provision of simple Dublin Core generating tools, which demonstrate the benefits of including metadata, must become more prevalent.

The Arts and Humanities Data Service (AHDS), in association with the UK office for Library and Information Networking (UKOLN), has produced a publication which outlines in more detail the best practices involved in using Dublin Core, as well as giving many practical examples. "Discovering Online Resources across the Humanities: A practical implementation of the Dublin Core" (ISBN 0-9516856-4-3). This is available also freely available from the AHDS web site, []

As a practical illustration of how the Dublin Core element set can be implemented in order to perform searches for individual items across disparate collections is the AHDS Gateway []. The AHDS Gateway is, in reality, an integrated catalogue of the holdings of the five individual Service Providers, which make up the AHDS. Although the Service Providers are separated geographically, by providing Dublin Core records describing each of their holdings, users can very simply search across the complete holdings of the AHDS from one single access point.

The Dublin Core Elements

This set of official definitions of the Dublin Core metadata element set can be found at:

Element Descriptions


Label: TITLE

The name given to the resource by the CREATOR or PUBLISHER. Where possible standard authority files should be consulted when entering the content of this element. For example the Library of Congress or British Library title lists can be used, but always remember to indicate the source using the 'scheme' qualifier.

2.Author or Creator


The person or organization primarily responsible for creating the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources. Note that this element does not refer to the person who is responsible for digitizing a work, this belongs in the CONTRIBUTOR element. So in the case of a machine- readable version of King Lear held by the OTA, the CREATOR remains William Shakespeare, and not the person who transcribed it into digital form. Again, standard authority files should be consulted for the content of this element.

3.Subject and Keywords


The topic of the resource. Typically, subject will be expressed as keywords or phrases that describe the subject or content of the resource. The use of controlled vocabularies and formal classification schemas is encouraged.



A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources.



The entity responsible for making the resource available in its present form, such as a publishing house, a university department, or a corporate entity.

6.Other Contributor


A person or organization not specified in a CREATOR element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a CREATOR element (for example, editor, transcriber, and illustrator).


Label: DATE

The date the resource was made available in its present form. Recommended best practice is an 8 digit number in the form YYYY-MM-DD as defined in, a profile of ISO 8601. In this scheme, the date element 1994-11-05 corresponds to November 5, 1994. Many other schema are possible, but if used, they should be identified in an unambiguous manner.

8.Resource Type

Label: TYPE

The category of the resource, such as home page, novel, poem, working paper, technical report, essay, dictionary. For the sake of interoperability, TYPE should be selected from an enumerated list that is under development in the workshop series at the time of publication of this document. See for current thinking on the application of this element



The data format of the resource, used to identify the software and possibly hardware that might be needed to display or operate the resource. For the sake of interoperability, FORMAT should be selected from an enumerated list that is under development in the workshop series at the time of publication of this document.

10.Resource Identifier


String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element in the case of off-line resources.



A string or number used to uniquely identify the work from which this resource was derived, if applicable. For example, a PDF version of a novel might have a SOURCE element containing an ISBN number for the physical book from which the PDF version was derived.



Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with RFC 1766. See:



The relationship of this resource to other resources. The intent of this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. Formal specification of RELATION is currently under development. Users and developers should understand that use of this element is currently considered to be experimental.



The spatial and/or temporal characteristics of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element is currently considered to be experimental.

15.Rights Management


A link to a copyright notice, to a rights-management statement, or to a service that would provide information about terms of access to the resource. Formal specification of RIGHTS is currently under development. Users and developers should understand that use of this element is currently considered to be experimental.

© The right of Alan Morrison, Michael Popham and Karen Wikander to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. 

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or part of any of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service Electronic or print copies may not be offered, whether for sale or otherwise, to any third party. 
Arts and Humanities Data Service 
A red line
Back Next Bibliography Glossary Contents