Subject Guides: Linguistics: Linguistic Corpora

Linguistic Data Consortium (LDC)

Linguistic Data Consortium (LDC) is a key resource for discovering text corpora, and includes:

Hundreds of corpora of language data
Indexed collections of Arabic, Chinese and English newswire text
Millions of words of English telephone speech from the Switchboard and Fisher collection

To find and access a dataset, follow the instructions on the LDC page linked above. If you require additional assistance, please email: libldc@ualberta.ca.

BYU English Corpora

English Corpora

Select corpora available from BYU Corpora (English Corpora) include:

Corpus of Canadian English (Strathy)
Corpus of Contemporary American English (COCA)
The Corpus of Historical American English (COHA)
Coronavirus Corpus
British National Corpus (BNC)

Other ways to access the BNC:
Full BNC (in XML) can be downloaded from the Oxford Text Archive
Explore the corpus online via BNCWeb (hosted by Lancaster University), registration required
Available on disc from the University of Alberta Library

Additional English language corpora

Dictionary of Old English Corpus

Multilingual Corpora

Archive of the Indigenous Languages of Latin America (AILLA)
The Language Archive
more... less...

The archive contains various types of materials, including: audio and video language corpus data from languages around the world; photographs, notes, experimental data, and other relevant information required to document and describe languages and how people use them; records of speech in everyday interactions in families and communities; naturalistic data from adult conversations from endangered and under-studied languages, and linguistic phenomena; experimental stimuli and data.
Rutgers Optimality Archive
Natural language software registry
Open Language Archives Community (OLAC)

Linguistic Corpora

Linguistic Corpora are a collection of linguistic data, comprising of speech and text databases, lexicons, text corpora and other metadata-added textual resources used for language and linguistic research.

Publishing: Dictionaries, grammar books, teaching materials, thesauri. Publishers refer to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed.
Linguistic Research: Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics, etc.
Artificial Intelligence: Data test bed for program development.
Natural language: Processing Taggers, parsers, natural language understanding programs, spell checking word lists, etc.
Language Teaching: Syllabus and materials design, classroom reference, independent learner research.