YU-CORPUS (Serbo-Croatian text corpus) This is a text corpus consisting of approximately 700 000 words of Serbo- Croatian. The texts are taken from modern (i.e. primarily post-World War II) Yugoslav fiction and all Serbo-Croatian-speaking areas are represented: Serbia, Croatia, Montenegro, and Bosnia-Hercegovina. The corpus was compiled by scanning books of fairly high printing quality (one of the parameters of text selection, I must admit). My equipment was a Macintosh computer with 4 Mb of RAM and a (French) OCR program called AutoREAD. Each file consists of prose work(s) by an author who can be identified by the file name. All text files are zipped and must thus be transferred in binary mode and unzipped before use. The files are of approximately equal size, namely about 300 000 bytes/50 000 words. The texts are (when unzipped) pure ASCII (8 bits) texts. They are all in the Latin alphabet - even when the book was printed in Cyrillic. I use the texts with Nota Bene's word processor and text base facilities, so the ASCII values of the special Serbo-Croatian characters follow the Nota Bene standard which is: C w/hachek = ASCII 220 c w/hachek = ASCII 221 C w/acute = ASCII 223 c w/acute = ASCII 222 D w/stroke = ASCII 127 d w/stroke = ASCII 235 S w/hachek = ASCII 156 s w/hachek = ASCII 157 Z w/hachek = ASCII 241 z w/hachek = ASCII 242 The only manipulation I'm guilty of is the splitting of long paragraphs: This is because I use paragraph markers as separators between entries, so when paragraphs become longer than one half of a screen page, I divide them into smaller parts by means of the combination ASCII-179 CR ASCII-179. In Nota Bene's text base system ASCII-179 is ignored (is considered as a separation marker), but displayed on-screen. Those who want to use these texts on different systems must be aware of this fact and replace the sequence ASCII179-CR-ASCII179 by CR (i.e. carriage return) or SPACE (which gives the original text). The file yu-index.txt describes the contents of the whole set of files which are: bozovic.zip isakov.zip kapor.zip krleza.zip lalic.zip marinkov.zip mihailov.zip nazor.zip pavlicic.zip savic.zip selimov.zip tisma.zip novele1.zip novele2.zip antolog1.zip I am still working on adding new files to the corpus, and I hope to be able to accomplish my first goal, 1 million words, within a year or so. All texts have been proofread only once, so I cannot guarantee that there are no misspellings left. In my opinion, however, the texts are fully usable as they are. If you find any misprints, please let me know! Have a good time! Henning Moerk Slavisk Institut Aarhus Universitet Ny Munkegade 116 8000 Aarhus C tel: +45 86 13 65 55 fax: +45 86 19 21 55 e-mail: slavhenn@aau.dk