Creating & Documenting Electronic Texts

 

Chapter 4: Markup: The key to reusability

Subsection 4.1: What is markup?

Markup is most commonly defined as a form of text added to a document to transmit information about both the physical and electronic source. Don't be surprised if the term sounds familiar, it's been in use for centuries. It was first used in conjunction with the printing trade as a reference to the instructions inscribed onto copy so that the compositor would know how to prepare the typographical design of the document. As Philip Gaskell points out, "Many examples of printers' copy have survived from the hand-press period, some of them annotated with instructions concerning layout, italicization, capitalization, etc." (Gaskell, 41). This concept evolved slightly through the years but has remained entwined with the printing industry. As G.T. Tanselle writes in a 1981 article on scholarly editing, "one might. . .choose a particular text to mark up to reflect these editorial decisions, but that text would only be serving as a convenient basis for producing printer's copy. . ." (Tanselle, 64). There still seems to be some demarcation between the usage of the term for bibliography and for computing, but the boundary is really quite blurry. The leap from markup as a method of labelling instructions on printer's copy to markup as a language used to describe information in an electronic document being read by a computer is not so vast.

Therefore when we think of markup there are really three differing types (two of which will be discussed below). The first is the markup that relates strictly to formatting instructions found on the physical text, as mentioned above. It is used for the creation of an emended version of the work and, with the exception of textual scholars, is rarely referred to again. Then there is the proprietary markup found in electronic document encoding and is tied to a specific software or developer. This markup is concerned primarily with document formatting, describing what words should be in italics or centred, where the margins should be set, or where to place a bulleted list. There are a few things to note about this type of markup. The first is that being proprietary means that it is intimately tied to the software that created it. This does not pose a problem as long as the document will only remain with that software program. And that the creator recognizes that in the future there is no guarantee that the software will exist. This is important, as proprietary software formats allow the user to say where and how they want the document formatted, but then inserts it's own markup language to accomplish this. When a user creates a document in Word or PDF they are unconsciously adding encoding with every keystroke. As anyone who has created a document in one software format and attempted to transfer it to another is aware, the encoding does not transfer -- and if for some reason a bit of it does, it rarely means the same thing. The third type of markup is non-proprietary, a generalized markup language. This language allows for a level of control not found in proprietary markup. More importantly, it offers cross-platform capabilities, ensuring that documents with this style of encoding will be readable many years down the line.

Subsection 4.2: Visual/presentational markup vs. Structural/descriptive markup

The discussion of visual/presentational markup vs. structural/descriptive markup picks right up from the concepts of proprietary and non-proprietary markup. As the name implies, presentational markup is concerned with the visual structure of a text. Depending upon what processing software is being used, the markup explains to the computer how the document should appear. So if the work should be seen in 12 point, Tahoma font, the software dictates a markup so that this happens. Presentational markup is concerned with structure only insofar as it relates to the visual aspect of the document. It does not care whether a heading is for a book, a chapter or a paragraph -- the only consideration is how that heading should look on the page. Most proprietary language formats tend to focus solely on presentational issues. To move into descriptive markup would require that the software provide the document creator with the ability to formulate their own tags with which to encode the structure and presentation of the work.In other words, descriptive markup relates less to the visual strategy of the work and more to the reasons behind the structure. It allows the creator to encode the document with a markup that more clearly elucidates how the presentation, configuration, and content relate to the document as a whole. Once again, the beneficial effects of thorough document analysis can be seen. Having a holistic sense of the document, having the detailed listing of critical elements in the document, will exemplify how descriptive markup advances a project. In this case, a non-proprietary language will be the most beneficial, as it will allow the document creator to arrive at their own tagsets, providing a much needed level of control over the encoding development. Subsection 4.2.1: PostScript and Portable Document Format (PDF)

In 1985, Adobe Systems created a programming language for printers called PostScript. In so doing, they produced a system that allowed computers to "talk" to their printers. This language describes for the printer the appearance of the page, incorporating elements like text, graphics, colour, and images, so that documents maintain their integrity through the transmission from computer to printer. PostScript printers have become industry standard with corporations, marketers, publishing companies, graphic designers, and more. Printers, slide recorders, imagesetters -- all these output devices utilise PostScript technology. Combine this with PostScript's multiple operating system capability and it becomes clear why Adobe calls PostScript "the world's standard printing and imaging technology" (http://www.abobe.co.uk/products/postscript/pscriptov.html). PostScript language can be found in most printers, Epson, IBM and Hewlett-Packard just to name a few, almost guaranteeing that the standard of printing can be found in both the home and office. Adobe provides a list of compatible products at http://www.adobe.com/proindex/postscript/oemlist.html.

Portable Document Format (PDF) was created by Adobe in 1993 to complement their PostScript language. PDF allows the user to view a document with a presentational integrity that almost resembles a scanned image of the source. This delivery of visually rich content is likely the most attractive use of PDF. The format is entirely concerned with keeping the document intact, and, to ensure this, allows any combination of text, graphics and images. It also has full, rich colour presentation and is therefore often used with corporate and marketing graphic arts materials. Another enticing feature is that when the user prints out a PDF file the hard copy output is an exact replication of the screen image -- this is also dependent upon the quality of the printer of course. PDF is also desirable for its delivery strengths. Not only does the document maintain its visual integrity, but it also can be compressed. This compression eases on-line and CD-ROM transmission and assists its archiving opportunities.

PDF files can be read through an Acrobat Reader application that is freely available for download via the web. This application is also capable of serving as a browser plug-in for online document viewing. Creating PDF files is a bit more complicated than the viewing procedure. To write a PDF document it is necessary to purchase Adobe software. PDFWriter allows the user to create the PDF document, and the more expensive Adobe Capture program will convert TIFF files into PDF formatted text versions. If the user would like the document to become more interactive, being able to annotate the document for example, then this functionality can be added with the additional purchase of Acrobat Exchange that serves an editorial function. Exchange allows the user to annotate and edit the document, search across documents and also has plug-ins that provides highlighting ability.

Taking into consideration the earlier discussion of visual vs. structural markup, it is clear how programs like PostScript and PDF fall into the category of a proprietary processing language that is concerned with presentational rather than descriptive markup. This does not imply that these languages should be avoided. On the contrary, if the only document concern is how it appears both on the screen and through the printer, then software of this nature is appropriate. However, if the document needs to cross-platforms or the project objectives require control over the encoding or document preservation, then these proprietary programs are not dependable.

Subsection 4.2.2: HTML 4.0

HyperText Markup Language (or HTML as it is commonly known) is a non-proprietary format markup system used for publishing hypertext on the World Wide Web. To date, it has appeared in four main versions (1.0, 2.0, 3.2, 4.0), with the World Wide Web Consortium (W3C) recommending 4.0 as the markup language of choice. As a result of its accessibility to most browsers and platforms, along with it being a relatively simple markup language to learn, HTML is by far the most popular web-publishing language. It allows users to create online text documents that include various forms of multimedia (such as images, sounds, and video clips), and then put these documents in an environment that allows for instant publication and retrieval.

There are many advantages to a markup language like HTML. As mentioned above, the primary benefit is that a document encoded with HTML can be viewed in almost any browser -- an extremely attractive option for a creator who wants their document viewed by an audience with varied systems. However, it is important to note that while the encoding can cross platforms, there are consistently differences in page appearance between browsers. While W3C recommends the usage of HTML 4.0, many of its features simply are not available to users with early versions of browsers. Unlike PDF which is hyper-concerned with keeping the document and its format intact, HTML has no true sense of page structure and can neither save nor print files with any sense of precision.

Besides the benefit of a markup language that crosses platforms with ease, HTML attracts its many users for the simple manner with which it can be mastered. For users that don't want to take the time to learn the tagset the good news is that conversion-to-HTML tools are becoming more accessible and easier to use. And for those who don't want to take the time to learn how to use HTML-creation software, of which there are a limited quantity, they can sit down with any text creation program (Notepad for example) and author an HTML document. Then by using the "Open File" tool in a browser, the document can immediately be viewed. What this means for novice HTML authors is that they can sit down with a text creator and a browser and teach themselves a markup language in one session. And as David Seaman, Director of the Electronic Text Center at the University of Virginia, points out, this

"has a real pedagogical value as a form of SGML that makes clear to newcomers the concept of standardized markup. To the novice, the mass of information that constitutes the Text Encoding Initiative Guidelines -- the premier tagging scheme for most humanities documents -- is not easily grasped. In contrast, the concise guidelines to HTML that are available on-line (and usually as a "help" option from the menu of a Web client) are a good introduction to some of the basic SGML concepts." (Seaman, David. Campus Publishing in Standardized Electronic Formats -- HTML and TEI. http://etext.lib.virginia.edu/articles/arl/dms-arl94.html).
This is a real value for the user. The notion of marking up a text is quite often an overwhelming concept. Most people don't realize that markup enters into their life every time they make a keystroke in a word processing program. So for the uninitiated, HTML provides a manageable stepping-stone into the world of more complex encoding. Once this limited tagset is mastered, many users find the jump into an extended markup language less intimidating -- and more liberating.

Yet the advantage of easy authoring brings up one of the largest drawbacks of HTML, which is the fact that browsers are simply not that picky about the validity of the HTML. Unless something serious is missing from the encoded document, it will be successfully viewed through a Web client. The impact of this is that while HTML provides a convenient and universal markup language for a user, many of the documents floating out in cyberspace are permeated with invalid code. The focus then moves away from authoring documents that conform to a set of encoding guidelines and towards the creation of works that can be viewed in a browser. (Seaman, David. Campus Publishing in Standardized Electronic Formats -- HTML and TEI. http://etext.lib.virginia.edu/articles/arl/dms-arl94.html). This problem will obtain more gravity with the increased usage of Extensible Markup Language, or XML as it is more commonly known. This markup language, which is being lauded as the [wave of the future] combines the visual benefits of HTML with the contextual benefits of SGML/TEI. However, while XML will have the universality of HTML, the web clients will require a more stringent adherence to markup rules. The problem will arrive with the conversion process of HTML to XML. While documents that comply with the HTML rules for valid encoding will find the transition relatively simple, the documents that were constructed strictly with viewing in mind will require much clean up prior to conversion.

This is not to say that HTML is not a useful tool for creating online documents. Similar to PostScript and PDF, the choice to use HTML should be document dependent. It is the perfect choice for static documents that will have a short shelf-life. If you're creating course pages or supplementary materials regarding specific readings that will not be necessary or available after the end of term, then HTML is an appropriate choice. Subsection 4.2.3: User-definable descriptive markup

Subsection 4.3: Implications for long-term preservation and reuse

© 
The right of xxxx to be identified as the Authorsof this Work has been asserted by them in accordance with the Copyright,Designs and Patents Act 1988. 
All material supplied via the Arts and HumanitiesData Service is protected by copyright, and duplication or sale of allor part of any of it is not permitted, except that material may be duplicatedby you for your personal research use or educational purposes in electronicor print form. Permission for any other use must be obtained from the
Arts and HumanitiesData Service
Electronic or print copies may not be offered, whetherfor sale or otherwise, 
to any third party. 
Arts and Humanities Data Service 
 
A red line
Back Next Bibliography Glossary Contents