Collins COBUILD

Home     WordbanksOnline     Catalogue     Word Features

Corpus Concordance Sampler

The Collins WordbanksOnline English corpus is composed of 56 million words of contemporary written and spoken text. To get a flavour of the type of linguistic data that a corpus like this can provide, you can type in some simple queries here and get a display of concordance lines from the corpus. The query syntax allows you to specify word combinations, wildcards, part-of-speech tags, and so on.


Type in your query:

Which sub-corpora should be searched?

British books, ephemera, radio, newspapers, magazines (26m words)
American books, ephemera and radio (9m words)
British transcribed speech (10m words)

To get sample concordances, press this button:

Note that output from this demo facility will be restricted to 40 lines of concordance. The lines to be displayed will be selected on an every-Nth basis.


Collocation Sampler

Type in your word:

Select a significance score to be calculated:

Mutual Information
T-score

To get collocations, press this button:

Note that output from this demo facility will be restricted to 100 collocates. These will be the statistically most significant ones according to the score you have selected.


Query Syntax

Overview

A query is made up of one or more terms concatenated with a + symbol. E.g.hell+hole would search for the word "hell" immediately followed by the word "hole".

Terms may be made up of simple alphabetic strings, optionally modified with a trailing asterisk or 'at'-symbol, concatenated and separated by vertical bars, or followed by an oblique stroke and a part-of-speech tag.

Word combinations

The plus may be modified with a preceding number to indicate the maximum number of intervening words. E.g. dog+4bark will search for "dog" followed by "bark" with up to 4 words intervening.

Inflected Forms

An at-sign (@) appended to a string of letters causes the software to expand the wordform preceding the @ symbol into a set of inflected forms. For example, the query blew@+away will search for the set of words blow blows blowing blew followed by the word away.

Trailing wildcard

An asterisk appended to a string of letters indicates a wildcard match for all characters at the end of a word. Be careful with this feature: in a large corpus there are a surprising number of matching words for any given prefix string. Using cut* to get instances of "cut", "cuts" and "cutting" is probably a bad idea.

Word sets

Words (or wildcard words) can be strung together with vertical bars to match an explicit set of words. E.g. cut|cuts|cutting

Part-of-speech tags

The corpus has been tagged automatically with a statistical tagger. You can specify a search on word/TAG combinations by appending an oblique stroke and a part-of-speech tag. POS tags must be in uppercase. Here are some major POS tags:

NOUN    a macro tag: stands for any noun tag
VERB    a macro tag: stands for any verb tag
NN      common noun
NNS     noun plural
JJ      adjective
AT      definite and indefinite article
RB      adverb
VB      base-form verb
VBN     past participle verb
VBG     -ing form verb
VBD     past tense verb

Putting it all together

Word sets, wildcards and part-of-speech tags can be combined within a term. The vertical bar binds more tightly than the oblique stroke, so that fool|fools|fooling|fooled/VERB matches these four words when any of them occurs as a verb.

Home     Top of this Page