Corpora in lexical studies Corpus Linguistics Richard Xiao [email protected].

44
Corpora in lexical studies Corpus Linguistics Richard Xiao [email protected]

Transcript of Corpora in lexical studies Corpus Linguistics Richard Xiao [email protected].

Page 1: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Corpora in lexical studies

Corpus LinguisticsRichard Xiao

[email protected]

Page 2: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Aims of this session• Lecture

– Corpus-based lexicography– Collocation and colligation

• Lab session– Collocation using WST– Collocation using AntConc– Collocation and colligation in Xaira– Using the BNCweb to study collocation

Page 3: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Corpus revolution in lexicographic and lexical studies

• Lexicographic and lexical studies are the greatest beneficiaries of corpora

• Corpora have “revolutionised” dictionary making and reference publishing– It is now nearly unheard of for new dictionaries

and new editions of old dictionaries published from the 1990s onwards not to claim to be based on corpus data

Page 4: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Why use corpora in dictionary making?• Machine-readable corpora allow dictionary makers

to extract all authentic, typical examples of the usage of a lexical item from a large body of text in a few seconds

• Corpora allow dictionary makers to select entries based on frequency information

• Corpora can readily provide frequency information and collocation information for readers

• Textual (e.g. register, genre and domain) and sociolinguistic (e.g. user gender and age) information encoded in corpora allows lexicographers to give a more accurate description of the usage of a lexical item

Page 5: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Why use corpora in dictionary making?

• Corpus annotations such as part-of-speech tagging and word sense disambiguation also enable a more sensible grouping of words which are polysemous and homographs

• A “monitor corpus” allows lexicographers to track subtle change in the meaning and usage of a lexical item so as to keep their dictionaries up-to-date

• Corpus evidence can complement or refute the intuitions of individual lexicographers, which are not always reliable because of potential biases in intuitions

Page 6: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Five emphases

• Changes brought about by corpora to dictionaries and other reference books - five “emphases” (Hunston 2002)– an emphasis on frequency– an emphasis on collocation and phraseology– an emphasis on variation– an emphasis on lexis in grammar– an emphasis on authenticity

Page 7: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Top 1000 written / spoken words

Authentic examples

Page 8: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Corpus-based learner dictionaries

• First ‘fully corpus-based’ dictionary – Collins Cobuild English Dictionary (1987)

• Some corpus-based learner dictionaries– Longman Dictionary of Contemporary English (3rd

edition)– Oxford Advanced Learner’s Dictionary (OALD, 5th

edition)– Cambridge International Dictionary of English (1st

edition)

Page 10: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Collocation• Collocation is among the linguistic concepts which have

benefited most from advances in corpus linguistics• What is collocation?

– strong tea, powerful car (Halliday 1976)– “collocations of a given word are statements of the habitual or

customary places of that word…the company that words keep” (Firth 1968:181-2)

• “One of the meanings of night is its collocability with dark” (Firth 1957:196)

– “a frequent co-occurrence of two lexical items in the language” (Greenbaum 1974:82)

• expel a school child vs. cashier an army officer

• “I propose to bring forward as a technical term, meaning by collocation, and apply the test of collocability” (Firth 1957: 194)

Page 11: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Meaning by collocation• “There is frequently so high a degree of

interdependence between lexemes which tend to occur in texts in collocation with one another that their potentiality for collocation is reasonably described as being part of their meaning” (Lyons 1977: 613)

• Complete description of the meaning of a word would have to include the other word or words that collocate with it

• “You shall know a word by the company it keeps!” (Firth 1968:179)

• Collocation is part of the word meaning

Page 12: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Two types of collocation

• Coherence collocation vs. neighbourhood (horizontal) collocation (Scott 1998)– Coherence collocation

• Collocates associated with a word (e.g. letter – stamp, post office)

– Neighbourhood collocation• Words which do actually co-occur with the word (letter

- my, this, a, etc)

Page 13: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Coherence collocation• “A cover term for the cohesion that results from the

co-occurrence of lexical items that are in some way or other typically associated with one another, because they tend to occur in similar environments.” (Halliday & Hasan 1976:287)– candle – flame – flicker– hair – comb – curl – wave– sky – sunshine – cloud – rain

• Difficult to measure using a statistical formula

Page 14: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Neighbourhood collocation• Collocation in corpus linguistics• Structure of collocation – collocation window

– “We may use the term node to refer to an item whose collocations we are studying, and we may then define a span as the number of lexical items on each side of a node that we consider relevant to that node. Items in the environment set by the span we will call collocates.” (Sinclair 1966:415)

• Casual vs. significant collocation– Significant collocation: collocation that occurs more

frequently than would be expected (in a statistical sense) on the basis of the individual items

• n.b. Neighbourhood (horizontal) collocations can include some coherence collocations

Page 15: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Intuition vs. collocation• Greenbaum (1974): “people disagree on collocations” in

introspection-based elicitation experiments• Although “collocation can be observed informally” on the

basis of intuitions, “it is more reliable to measure it statistically, and for this a corpus is essential” (Hunston 2002: 68)

• Intuition is often a poor guide to collocation– “because each of us has only a partial knowledge of the language, we

have prejudices and preferences, our memory is weak, our imagination is powerful (so we can conceive of possible contexts for the most implausible utterances), and we tend to notice unusual words or structures but often overlook ordinary ones” (Krishnamurthy 2000: 32-33)

• Collocation can be measured on the basis of co-occurrence statistics (MI, z, t, LL etc) – more discussion to follow

Page 16: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Collocation is syntagmatic

famous boots. On the stroke of full time the Stoke the lead on the stroke of half-time with a goal Smith sin-binned on the stroke of half-time, added a clinched their win on the stroke of lunch after resuming chase by declaring on the stroke of lunch. <p> With a lead expectant crowd, on the stroke of midday. The bird hour began not upon the stroke of midnight but upon the of midnight but upon the stroke of noon. There was, booked in advance. On the stroke of seven, a gong summons Promptly on the stroke of six 'clock, the chooks from Edinburgh on the stroke of the Millennium.

Parole (Utterance)

syntagmatic

Langue (Language system)paradigmatic

Page 17: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Collocation vs. colligation• Collocation

– Relationship between a lexical item and other lexical items

• Relationship between words at the lexical level• E.g. very collocates with good

• Colligation– Relationship between a lexical item and a

grammatical category• Relationship between words at the grammatical level• E.g. very colligates with ADJ

Page 18: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

WST Collocate settings

Concord tab

Page 19: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

WST collocates

Strength of relationship is displayed as 0.000 if it hasn't yet been computed

Page 20: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Strength of collocation relationship

A wordlist is required

Page 21: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Highlight and double click…

Page 22: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

…to see the selected collocate

Page 23: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Collocates in AntConc

Page 24: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Collocation in Xaira

Page 25: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Colligation in Xaira

Page 26: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Exploring collocation with BNCwebhttp://bncweb.lancs.ac.uk/bncwebSignup/user/login.php

Page 27: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Search for “sweet”

Page 28: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Concordances of “sweet”

KWIC view

Page 29: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

KWIC view

Page 30: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Dropdown menu: collocations

Page 31: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Collocation setting

Page 32: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Collocation database (default settings)

Page 33: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Adjusting settings

Page 34: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Noun collocates of “sweet”

Click on a word to see its collocation info

Page 35: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Collocation info of “sweet” + “smell”

Click on a number to see concordances of collocates at that position

Page 36: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Concordances of “smell” at R2

Page 37: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Collocation statistics

Page 38: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Rank by frequency

Frequent words crowd into the top of the collocate list:Are they genuine collocates?

Page 39: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Rank by the t test

• Also focusing on frequent words?

Page 40: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Rank by MI

Infrequent words at the top of the listHow useful are they (especially to English learners)?

Page 41: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Rank by the z score

Like MI, the z score also over-estimates infrequent items (e.g. nothings, afton, marjoram)

Page 42: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Log-likelihood test

Page 43: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Rank by MI3

Page 44: Corpora in lexical studies Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com.

Rank by dice coefficient