*** THE HONG KONG SOUTH CHINA MORNING POST CORPUS *** Compiled by Phil Benson (Hong Kong University) with the assistance of Joseph Leung (South China Morning Post) The Hong Kong South China Morning Post corpus consists of 2874 Hong Kong and China news reports originally published in the South China Morning Post, Hong Kong's leading circulation daily English-language newspaper. The reports were published between February 1992 and March 1992. In total, the corpus contains 1 million+ running words. The reports in the corpus are not a complete set of items for this period, and they are not listed in any special order in the files. The corpus has been produced solely as a large sample of text for linguistic analysis. The text in the corpus has been prepared from original typesetting tapes, and has been modified only in order to make explicit certain textual features. The modifications are in the form of additional codes in angle, curly and double square brackets: Filename Identification number for each report Date of publication

Page number {headline} Start of a headline {byline} Start of a byline {article} Start of text of report {/article} End of text of report {para} Start of a paragraph [[break]] Unreadable text In South China Morning Post reporting style one sentence is usually equivalent to one paragraph. The {para} code can also be taken, therefore, as denoting the start of a sentence. Double quotation marks are represented by two single quotation marks (``........''). The dash is represented by a single hyphen with a space on either side (xxx - xxx) and the hyphen which forms compound words by a hyphen (xxx-xxx). In some cases, there is a space following the hyphen in compound words (xxx- xxx). This has not been corrected. Original graphic features of the text such as line-breaks, page breaks, inter- textual headings, emboldening, large print sizes are not represented in the corpus. The Hong Kong South China Morning Post Corpus is made available solely for purposes of research and teaching, and on condition that the user signs and returns the `User Declaration' attached. Copyright of the texts contained in the corpus remains with the South China Morning Post. 1 December 1993 *** THE HONG KONG SOUTH CHINA MORNING POST CORPUS *** *** INSTRUCTIONS FOR USE *** The corpus is supplied on three diskettes. Diskette 1 contains a text file containing this documentation (SCMP.DOC), the archive software (ARC.EXE) and one archived file (SCMP1.ARC). Diskette 2 contains the archived file SCMP2.ARC. Diskette 3 contains the archived file SCMP3.ARC. The three archived files will decompress to 40 text files (SCMP01.TXT - SCMP40.TXT) of approximately 200K each. These text files are in plain ASCII format. Files SCMP01.TXT - SCMP35.TXT contain Hong Kong news reports. Files SCMP36.TXT - SCMP40.TXT contain reports dealing with internal affairs in the People's Republic of China. To decompress the files, copy the three files SCMP01.ARC, SCMP02.ARC and SCMP03.ARC together with the file ARC.EXE into a new directory on your hard disk. From the new directory enter the commands: arc e scmp01.arc arc e scmp02.arc arc e scmp03.arc You can get more information on the syntax of `arc' commands by simple entering the command: arc *** THE HONG KONG SOUTH CHINA MORNING POST CORPUS *** *** USER DECLARATION *** This declaration should be signed by any user of the corpus and returned in hard copy to: Phil Benson, English Centre, 7/F K.K.Leung Building, University of Hong Kong, Pokfulam Road, Hong Kong. Any requests for use of the materials in this corpus which goes beyond the conditions stated in this declaration should be addressed to: The Editor, South China Morning Post, Morning Post Building, Tong Chong Street, Quarry Bay, Hong Kong. ______________________________________________________________ *** DECLARATION *** Please sign the following, in the place indicated: I undertake:- 1. To use the Hong Kong South China Morning Post Corpus for purposes of scholarly research and teaching only. 2. To acknowledge in any published or unpublished work based in whole or in part on the corpus, the name of the corpus and the names of its compilers. 3. To inform the compiler of the existence of any such work. 4. To observe normal copyright regulations in citation of texts from the corpus. In the event of publication of extended extracts from the corpus for teaching or dictionary publication, to request permission directly from the South China Morning Post. 5. Not to redistribute the corpus to third parties without the consent of the compilers. 6. To reimburse any costs incurred in distributing the corpus. Signature ............................................ Date .............................