Skip to Main Content


Linguistic Data Consortium (LDC)

Linguistic Data Consortium (LDC) is a key resource for discovering text corpora, and includes:
  • Hundreds of corpora of language data
  • Indexed collections of Arabic, Chinese and English newswire text
  • Millions of words of English telephone speech from the Switchboard and Fisher collection

To find and access a dataset, follow the instructions on the LDC page linked above. If you require additional assistance, please email:

BYU English Corpora

Select corpora available from BYU Corpora (English Corpora) include:

  • Corpus of Canadian English (Strathy) 
  • Corpus of Contemporary American English (COCA) 
  • The Corpus of Historical American English (COHA) 
  • Coronavirus Corpus 
  • British National Corpus (BNC) 

    Other ways to access the BNC:

  • Full BNC (in XML) can be downloaded from the Oxford Text Archive
  • Explore the corpus online via BNCWeb (hosted by Lancaster University), registration required
  • Available on disc from the University of Alberta Library

Additional English language corpora

Multilingual Corpora

Linguistic Corpora

Linguistic Corpora are a collection of linguistic data, comprising of speech and text databases, lexicons, text corpora and other metadata-added textual resources used for language and linguistic research.

  • Publishing: Dictionaries, grammar books, teaching materials, thesauri. Publishers refer to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed.
  • Linguistic Research: Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics, etc.
  • Artificial Intelligence: Data test bed for program development.
  • Natural language: Processing Taggers, parsers, natural language understanding programs, spell checking word lists, etc.
  • Language Teaching: Syllabus and materials design, classroom reference, independent learner research.