Digital language resources in Oxford

Digital resources for University of Oxford Users

Resources available online in Oxford
Resources available via CLARIN
Other corpus resources at the University of Oxford
Links to other resources available online

LITERARY resources available online in Oxford

Below is a list of some of the literary resources for which groups within Oxford have licences, and to which students and staff have access.

Text Creation Partnership: the EEBO, ECCO and Evans collections contain 60000+ texts (not all literary) produced by the Text Creation Partnership, from the OTA Text Archive. All of these resources are now available without authentication.

LINGUISTIC resources available online in Oxford

Below is a list of some of the linguistic resources for which groups within Oxford have licences, and to which students and staff have access.

British National Corpus:
- BNC website, with information about the corpus
- Download the corpus from the OTA
IVIE Corpus of English dialects
Oxford English Corpus, access available on demand to Oxford researchers - please apply via the website
Sketch Engine - institutional access funded by ELEXIS starts 1 April 2018. Log in here with your University of Oxford single sign-on credentials.
Various Corpora from the Oxford Text Archive
Literary and linguistic electronic resources on SOLO. For further information contact Johanneke Sytsema

The University of Oxford has licences for 1997, 2008, 2009, 2010, 2013 and 2015 for the Linguistic Data Consortium. Take a look at their catalogue, and if there is something there that you are interested in, and you don't see it in the list below, please get in touch with Martin Wynne. Thanks to OUP who paid for the 2009 licence in full for the University, Department of Computer Science who paid for the 2010 and 2015 licences, and the Phonetics Laboratory for 1997 and 2013. The following resources have been downloaded from the LDC and are now available online from IT Services for Oxford users. Consult the LDC catalogue for the full list of what is available, and get in touch via ota at bodleian.ox.ac.uk. Please note that you are bound by the terms and conditions of the user agreements associated with each of these resources, which can be found on the LDC website.

LDC93S1 TIMIT Acoustic-Phonetic Continuous Speech Corpus (436 Mb)
LDC94S14A Air Traffic Control Complete (436 Mb)
LDC96L14 CELEX2 (71 Mb)
LDC96S37 CALLHOME Japanese Speech (1.5 Gb)
LDC96S53 CALLFRIEND Japanese (975 Mb)
LDC97S62 Switchboard-1 Release 2 (10.3 Gb)
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0 (1.1 Mb)
LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0 (1.1 Mb)
LDC2008S09 CHAracterizing INdividual Speakers (CHAINS) (2.8 Gb)
LDC2008T18 New York Times Annotated Corpus (3.3 Gb)
LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 (8.5 Mb)
LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 (2.4 Mb)
LDC2009T10 Language Understanding Annotation Corpus (1.6 Mb)
LDC2009T12 2008 CoNLL Shared Task Data (16.1 Mb)
LDC2009T13 English Gigaword Fourth Edition disk 1 (4 Gb) and LDC2009T13 English Gigaword Fourth Edition disk 2 (4.5 Gb)
LDC2009T22 Arabic Newswire English Translation Collection (4 Mb)
LDC2009T23 FactBank 1.0 (4 Mb)
LDC2009T24 OntoNotes Release 3.0 (448 Mb)
LDC2009T30 Arabic Gigaword Fourth Edition (2.7 Gb)
LDC2010L01 LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 (1.2 Mb)
LDC2013S06 LDC Spoken Language Sampler - Second Release (97 Mb)
LDC2013T18 Semantic Textual Similarity (STS) 2013 Machine Translation (77 kb)
LDC2013T19 OntoNotes Release 5.0 (932 Mb)
LDC2015T13 English News Text Treebank: Penn Treebank Revised (9.7 Mb)

Please visit the LDC website for more information about these resources, and to consult the relevant licence agreements. Note that these resources are for use by members of the University of Oxford, and you are not permitted to redistribute them.

If the files are too big for you to download over the web, get in touch via ota at bodleian.ox.ac.uk.

The following have also been downloaded by the Phonetics Laboratory and might be available by arrangement.

LDC94S13B CSR-II (WSJ1) Sennheiser
LDC96L17 CALLHOME Japanese Lexicon
LDC96T18 CALLHOME Japanese Transcripts
LDC94S13A CSR-II (WSJ1) Complete
LDC93S2 NTIMIT

Resources available via CLARIN

The UK is a member of the CLARIN European Research Infrastructure Consortium, which offers easy access to language data and tools for research in the humanities and social sciences. The latest up to date information on activities and resources can be found at CLARIN website. The University of Oxford is home to the co-ordination of the CLARIN-UK Consortium.

Certain resources have restricted access but are now accessible to authenticated users from Oxford - see the following page:

CLARIN protected resources including resources in Czech, Danish, Dutch, English, German, Norwegian, and online interfaces to a number of other languages via the Corpuscle archive at the University of Bergen, including Abkhazian, Bulgarian, Older Scots, Persian, Slovenian, among others. In most cases, you need to log in to these sites by following the link to 'Log in via your institution' or 'EduGAIN', or simply 'Log in', and you will be redirected to WebAuth.
Virtual Language Observatory is the gateway to a larger number of resources. The VLO is a resource discovery service aggregating records for resources held in most of the major archives world-wide.
CLARIN Resource Showcases is an on-line collection of training materials, case studies and expert contacts from the entire CLARIN network, aimed at researchers and students at all stages who are working in the fields of Digital Humanities and Social Sciences and are interested in analyzing language data and using text processing tools that are available in the CLARIN infrastructure
CLARIN-UK also makes available to users in Oxford a number of important resources, provided by the members of the CLARIN-UK Consortium.
Oral History & Technology: a new website which will feature tools for processing audio data, including speech synthesis and alignment.

Other corpus resources at the University of Oxford

There are further corpora, copies of which may be available in Oxford, but under a variety of different licensing and access arrangements (often on optical disk). Please get in touch to add to the list. For these resources, contact Martin Wynne unless otherwise stated.

BNC XML version, BNC Baby (sampler on one CD)
Corpus of Spoken Dutch
Corpus of Spoken Japanese
IPI-PAN corpus of Polish
COLT Corpus of London Teenagers' Speech
Gesprochenes Jiddisch Textzeugen einer Europäisch-jüdischen Kultur
ICAME corpus collection
East meets West: a compendium of multilingual resources (the TELRI CD, parallel aligned corpora in many European languages)

Further sources for help and advice:

Links to other resources available online

SketchEngine, available through your Oxford single sign-on, thanks to the eLexis European lexicographic infrastructure, with support from the Bodleian Libraries
CQPweb at Lancaster University, with many corpora, mostly, but not all, English language (register with your Oxford email address to get maximum access rights)
Mark Davies' online corpora at Brigham Young University; also accessible via Databases A-Z thanks to a subscription from the Bodleian Libraries Electronic Resources Team on behalf of the University: BYU Corpora via Databases A-Z (link last checked 17th December 2019)
Antconc: free desktop software applications for text and corpus analysis (Mac, Linux and Windows versions)
Wordsmith Tools: desktop software application for text and corpus analysis (version 4 now free, Windows only)

Digital resources for University of Oxford Users

LITERARY resources available online in Oxford

LINGUISTIC resources available online in Oxford

Resources available via CLARIN

Other corpus resources at the University of Oxford

Links to other resources available online

Local Connections

Repository

CLARIN Community Connections