Skip to main content


Other Text Corpora


Textual data comprise of speech and text databases, lexicons, text corpora, and other metadata-added textual resources used for language and linguistic research. Some text corpora uses are:

Publishing Dictionaries, grammar books, teaching materials, usage guides, thesauri. Increasingly, publishers are referring to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed.

Linguistic Research Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics, etc.

Artificial Intelligence Data test bed for program development.

Natural language processing Taggers, parsers, natural language understanding programs, spell checking word lists, etc.

Language Teaching Syllabus and materials design, classroom reference, independent learner research.

The University of Alberta Library subscribes to two main sources of textual data:

The LDC is an open consortium of universities, companies and government research laboratories hosted by the University of Pennsylvania, which creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The LDC's catalogue contains hundreds of corpora of language data. In addition, the LDC Online contains an indexed collection of Arabic, Chinese and English newswire text, millions of words of English telephone speech from the Switchboard and Fisher collections and the American English Spoken Lexicon, as well as the full text of the Brown corpus. Using LDC Online, users can search textual data and play audio extracts for transcribed utterances on standard web browsers. 

To access LDC corpora, search the LDC Catalog, and follow these instructions (NB: the University of Alberta is an institutional LDC member). Alternatively, contact Data Help, and we will provide you with a copy of your desired item. 

The BNC is is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the 2007 BNC XML Edition. The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual, and bibliographic information is also included with each text in the form of a TEI-conformant header.



Data Help's picture
Data Help
2-10 Cameron Library
Website / Blog Page