Skip to Main Content


Linguistic Data Consortium (LDC)

Linguistic Data Consortium (LDC) is a key resource for discovering text corpora. To find and access a dataset, follow the instructions on the LDC page linked above. If you require additional assistance, please email:

BYU English Corpora

Select corpora available from BYU Corpora (English Corpora) include:

  • Corpus of Canadian English (Strathy) 
  • Corpus of Contemporary American English (COCA) 
  • The Corpus of Historical American English (COHA) 
  • Coronavirus Corpus 
  • British National Corpus (BNC) 

    Other ways to access the BNC:

  • Full BNC (in XML) can be downloaded from the Oxford Text Archive
  • Explore the corpus online via BNCWeb (hosted by Lancaster University), registration required
  • Available on disc from the University of Alberta Library

Additional English language corpora

Multilingual Corpora

About the LDC

The Linguistic Data Consortium (LDC) is an open consortium of various research organizations, and is hosted by the University of Pennsylvania, which creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes.
The LDC Catalog includes:
  • Hundreds of corpora of language data
  • Indexed collections of Arabic, Chinese and English newswire text
  • Millions of words of English telephone speech from the Switchboard and Fisher collection
The University of Alberta is an institutional member of the Linguistic Data Consortium (LDC). All language corpora listed in the LDC Catalog are available on request.

Linguistic Corpora

Linguistic Corpora are a collection of linguistic data, comprising of speech and text databases, lexicons, text corpora and other metadata-added textual resources used for language and linguistic research.
Some text corpora uses:
  • Publishing: Dictionaries, grammar books, teaching materials, thesauri. Publishers refer to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed.
  • Linguistic Research: Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics, etc.
  • Artificial Intelligence: Data test bed for program development.
  • Natural language: Processing Taggers, parsers, natural language understanding programs, spell checking word lists, etc.
  • Language Teaching: Syllabus and materials design, classroom reference, independent learner research.