Subject Guides: Data: Textual Data

Other Text Corpora

English Corpora This link opens in a new window
- This link opens in a new window
- This link opens in a new window

Select corpora available from BYU Corpora (English Corpora) include:

Corpus of Canadian English (Strathy)
Corpus of Contemporary American English (COCA)
The Corpus of Historical American English (COHA)
Coronavirus Corpus
British National Corpus (BNC)

Other ways to access the BNC:
Full BNC (in XML) can be downloaded from the Oxford Text Archive
Explore the corpus online via BNCWeb (hosted by Lancaster University), registration required
Available on disc from the University of Alberta Library

Additional English language corpora

Voices of the International Corpus of English (VOICE) - Canada
American National Corpus (ANC)
- This link opens in a new window
Manually Annotated Sub-Corpus (MASC)
- This link opens in a new window

Dictionary of Old English Corpus This link opens in a new window
- This link opens in a new window
- This link opens in a new window

TEXTUAL DATA

Textual data comprise of speech and text databases, lexicons, text corpora, and other metadata-added textual resources used for language and linguistic research. Some text corpora uses are:

Publishing Dictionaries, grammar books, teaching materials, usage guides, thesauri. Increasingly, publishers are referring to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed.

Linguistic Research Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics, etc.

Artificial Intelligence Data test bed for program development.

Natural language processing Taggers, parsers, natural language understanding programs, spell checking word lists, etc.

Language Teaching Syllabus and materials design, classroom reference, independent learner research.

The University of Alberta Library subscribes to two main sources of textual data:

LINGUISTIC DATA CONSORTIUM (LDC)

The University of Alberta Library subscribes to a key source of textual data, named the LINGUISTIC DATA CONSORTIUM (LDC) . The LDC is an open consortium of universities, companies and government research laboratories hosted by the University of Pennsylvania, which creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes.

The LDC Catalog includes:

* Hundreds of corpora of language data

* Indexed collections of Arabic, Chinese and English newswire text

* Millions of words of English telephone speech from the Switchboard and Fisher collection
The University of Alberta is an institutional member of the Linguistic Data Consortium (LDC). All language corpora listed in the LDC Catalog are available on request.

To access to corpora in LDC:

1. Look up the individual language corpora by typing ‘LDC’ and your language of interest in the Search the Library box on the UofA Libraries homepage.

2. When you find the result you desire click on the Request Form and fill it out with as many details as possible.

3. Once your request has been processed you will be provided with a link to download the data if it is available online. If available only in a physical format (CD, hard drive etc.) you will be notified when it is ready for pickup.

4. IF YOU CANNOT FIND what you are looking for in the Library Catalogue the LDC Catalog is still available to provide advanced searching options.

* Find the title you are looking for in LDC Catalog

* Go to the Request Form page

* Fill out a Request Form with as much detail as possible

The LDC Request Form is monitored Monday through Friday during regular hours.

If you require additional assistance, please email: libldc@ualberta.ca

BRITISH NATIONAL CORPUS
The BNC is is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the 2007 BNC XML Edition. The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual, and bibliographic information is also included with each text in the form of a TEI-conformant header.

DATA HELP

Data Help

Contact:

data@ualberta.ca
780-492-5212
2-10 Cameron Library
CLICK TAG BELOW FOR DATA HELP