Skip to Main Content

Linguistics

Resources for using Corpora

BYU English Corpora

Select corpora available from BYU Corpora (English Corpora) include:

  • Corpus of Canadian English (Strathy) 
  • Corpus of Contemporary American English (COCA) 
  • The Corpus of Historical American English (COHA) 
  • Coronavirus Corpus 
  • British National Corpus (BNC) 

Other ways to access the BNC:

Additional English language corpora

Multilingual Corpora

Linguistic Data Consortium (LDC)

The Linguistic Data Consortium (LDC) is a key resource for discovering text corpora. To find and access a dataset, follow the instructions on the LDC page linked above. If you require additional assistance, please email: libldc@ualberta.ca.

About the LDC

The Linguistic Data Consortium (LDC) is an open consortium of various research organizations, and is hosted by the University of Pennsylvania, which creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes.
The LDC Catalog includes:
  • Hundreds of corpora of language data
  • Indexed collections of Arabic, Chinese and English newswire text
  • Millions of words of English telephone speech from the Switchboard and Fisher collection
The University of Alberta is an institutional member of the Linguistic Data Consortium (LDC). All language corpora listed in the LDC Catalog are available on request.