Creating and Documenting Electronic Texts



Anglo-American Cataloguing Rules (2nd ed., 1988 Revision). Rules used in the USA and UK which define the procedure for creating MARC records.
The Arts and Humanities Data Service. Online:
American Standard Code for Information Interchange, sometimes also referred to as 'plain text'. Essentially the basic character set, with minimal formatting (i.e. without changes in font, font size, use of italics etc.)
Corpus (pl. Corpora)
Informally, an collection of data (e.g. whole texts or extracts, transcribed conversations etc.) selected and organised according to certain principles. For example, a literary corpus might consist of all the prose works of a particular author, while a linguistic corpus might consist of all the forms of Russian verbs or examples of conversations amongst British English dialect speakers.
Development of a European Service for Information on Research and Education. Online:
The process by which a non-digital (i.e. analogue) source is rendered in machine-readable form. Most often used to describe the process of scanning a text or image using specialist hardware, to create machine-readable data which can be manipulated by another application (e.g. OCR or image processing software).
Document Analysis
The task of examining the source object (usually a non-electronic text), in order to acquire an understanding of the work being digitized and what the purpose and future of the project entails. Document analysis is all about definition — defining the document context, defining the document type, and defining the different document features and relationships. Usually, document analysis should comprise the first step in any electronic text creation project, and requires users to become intimately acquainted with the format, structure, and content of their source material.
Document Type Definition. Rules, determined by an application, that apply SGML or XML to the markup of documents of a particular type.
Dublin Core
A metadata element set intended to facilitate discovery of electronic resources.
Encoded Archival Description Document Type Definition (EAD DTD). A non-proprietary encoding standard for machine-readable finding aids such as inventories, registers, indexes, and other documents created by archives, libraries, museums, and manuscript repositories to support the use of their holdings. Online:
Graphic Interchange Format. GIF files use an older format that is limited to 256 colours. Like TIFFs, GIFs use a lossless compression format but without requiring as much storage space. While they do not have the compression capabilities of JPEG, they are strong candidates for graphic art and line drawings. They also have the capability to be made into transparent GIFs — meaning that the background of the image can be rendered invisible, thereby allowing it to blend in with the background of the web page.
HyperText Markup Language is a non-proprietary format (based upon SGML) for publishing hypertext on the World Wide Web. It has appeared in four main versions (1.0, 2.0, 3.2, and 4.0) although the World Wide Web Consortium (W3C) recommends using HTML 4.0. Online:
Joint Photographic Experts Group. JPEG files are the strongest format for web viewing, and for transfer through systems with space restrictions. JPEGs are popular with image creators not only for their compression capabilities but also for their quality. While a TIFF is a lossless compression, JPEGs are a lossy compression format. This means that as a filesize condenses the image loses bits of information — the information least likely to be noticed by the eye. The disadvantage to this format is precisely what makes it so attractive: the lossy compression. Once an image is saved, the discarded information is lost. The implication of this is that the entire image, or certain parts of it, cannot be enlarged. And the more work done to the image, requiring it to be re-saved, the more information is lost. As there is no way to retain all of the information scanned from the source, JPEGs are not recommended for archival storage. Nevertheless, in terms of viewing capabilities and storage size, JPEGs are the best image file format for online viewing.
MAchine Readable Cataloguing record. Bibliographic record used by libraries which can be processed by computers.
Markup (n.)
Text that is added to the data of a document in order to convey information about it. There are several kinds of markup, but the two most important are descriptive markup (often represented using markup tags such as <TITLE>, </H1> etc.), and processing instructions (i.e. the internal instructions required to change the appearance of a piece of data displayed on screen, start a new page when printing, indicate a change in font etc.)
Mark up (vb.)
To add markup.
Data about data. The additional information used to describe something for a particular purpose (although that may not preclude its use for multiple purposes). For example, the 'Dublin Core' describes a set of metadata intended to facilitate the discovery of electronic resources (see
Optical Character Recognition. OCR software attempts to recognise the characters on an image of a page of text, and output a version of that text in machine-readable form. Modern OCR software can be trained to recognise different fonts, and may use a dictionary to facilitate recognition of certain characters and words. OCR works best with clean, modern, well-printed text.
The Oxford Text Archive. Online:
Portable Document Format. The native proprietary file format of the Adobe® Acrobat® family of products, intended to enable users to exchange and view electronic documents easily and reliably, independent of the environment in which they were created. Online:
'Plain Text'
Adobe® PostScript® is a computer language that describes the appearance of a page, including elements such as text, graphics, and scanned images, to a printer or other output device. Online:
The Resource Description Framework. A foundation for processing metadata; it provides interoperability between applications that exchange machine-understandable information on the web.
A set of software tools to enable the set up and maintenance of web based subject gateways. Online
Rich Text Format. A proprietary file format developed by Microsoft that describes the format and style of a document (primarily for the purposes of interchange between different applications, most often common word-processors). Online:
The Standard Generalized Markup Language. An International Standard (ISO8879) defining a language for document representation that formalises markup and frees it of system and processing dependencies. SGML is the language used to create DTDs. Online:
The Text Encoding Initiative is an international project which in May 1994 issued its Guidelines for the Encoding and Interchange of Machine-Readable Texts. These Guidelines provide SGML encoding conventions for describing the physical and logical structure of a large range of text types and features relevant for research in language technology, the humanities, and computational linguistics. A revised version of the Guidelines was released in 1999. Online:
TEI Lite
An SGML DTD which represents a simplified subset of the recommendations set out in the TEI's Guidelines for the Encoding and Interchange of Machine-Readable Texts. Online:
A popular typesetting language (TeX) and a set of macro extensions (LaTeX) the latter being designed to facilitate descriptive markup. Online:
Tagged Image File Format. TIFF files are the most widely accepted format for archival image and master copy creation. TIFFs retain all of the scanned image data, allowing you to gather as much information as possible from the original. This is reflected in the one disadvantage of the TIFF image — the file size — but any type of compression is strongly advised against. Any project that plans to archive images or call them up for future modification should scan using this format.
UK Office for Library and Information Networking. A national focus of expertise in network information management, based at the University of Bath. Online:
An industry profile of ISO 10646, the Unicode Worldwide Character Standard is a character coding system designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world. In addition, it supports classical and historical texts of many written languages. Online:
The Extensible Markup Language is a data format for structured document interchange on the Web. The current World Wide Web Consortium (W3C) Recommendation is XML 1.0, February 1998. Online:
© The right of Alan Morrison, Michael Popham and Karen Wikander to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. 

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or part of any of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service Electronic or print copies may not be offered, whether for sale or otherwise, to any third party. 
Arts and Humanities Data Service 
A red line
Bibliography Next Back Glossary Contents