Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2,...
-
Upload
elmer-collins -
Category
Documents
-
view
217 -
download
0
Transcript of Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova 1, 2,...
Word Association Thesaurus as a Resource for Extending
Semantic Networks
Anna Sinopalnikova1, 2, Pavel Smrz1
1Faculty of Informatics, Masaryk UniversityBotanicka 68a, 602 00 Brno, Czech Republic
2Saint-Petersburg State UniversityUniversitetskaya 11, Saint-Petersburg, Russia
{anna, smrz}@fi.muni.cz
Overview Motivation Word Association and other notions of
psycholinguistics WAT vs. Corpus Semantic Information from WAT
core concepts, semantic primitives, syntagmatic and paradigmatic relations, domain information
Types of Semantic Resources used in NLP
Corpora Dictionaries, thesauri,
ontologies, taxonomies
1. These are primary resources, presenting (more or less) ‘raw’ data on the language in use.
2. Information is given implicitly.
3. Need special extraction procedures and tools.
1. These are ‘derived’ resources, presenting explications of some internal knowledge. They are based on primary resources + researcher’s intuition.
2. Information is given explicitly.
Motivation There is still a need for empirical basis
of semantic network construction. Semantic Web initiatives. WAT are available for many languages.
Nobody knows what are they good for and how to use them.
Word Association and other notions of psycholinguistics
Word Association Word Association Test Word Association Norms Word Association Thesaurus
Example
Needle stimulates:-> thread: 41, pin: 13, sharp: 6, sew: 5, cotton: 2, dressmaker: 1, fix: 1, prick: 1, sewing: 1, sow: 1, spring; 1, stitch: 1, etc.
WATs explored RAT - Russian WAT by Karaulov et al (1994-1998):
8000 stimuli - 23000 words covered – 1000 subjects, EAT - Edinburgh WAT by Kiss et al (1972): 8400
stimuli – 54000 words covered - 1000 subjects, Czech WAN (Novak et al, 1996): 150 stimuli - 4000
words covered – 250 subjects.
Experience gained in projects: RussNet (a wordnet-like database for Russian linking lexical
semantics with derivational morphology Czech part of the BalkaNet project (multilingual wordnet-like
network for 5 Balkan languages and Czech).
WAT vs. Corpus
History: Church & Hanks, 1990; Wettler & Rapp, 1993; Willners, 2001
Bokrjonok 3.0. - balanced corpus for Russian (16 mln words), BNC - British National Corpus (112 mln), CNC - Czech National Corpus (160 mln) and its unbalanced version
(630 mln words)
Research procedure:5000 pairs e.g. cheese – mouse, dark - alley have been extracted from
each WAN in random order, and then searched in the corpora. The window span was fixed to -10; +10 words.
Corpus WAN
WAN vs. Corpus: Russian
Quantitative analysis: (Sinopalnikova, 2004) - 64% word associations do not occur in the corpus,- 49% while excluding unique associations (that with absolute frequency = 1)
Qualitative analysis:- high ratio of syntagmatic associations to be absent,- for verbs this number was up to 84%.
WAN vs. Corpus: Russian (2)Relation % Relation % Relation % Relation
PARADI GMATI C: 21,4 SYNTAGMATI C: 48,7 DOMAI N 13,4 OTHER
antonymy 1,5 Adj+N 7,9
cause 1,6 N+Adj 4,8
co-hyponymy 4,9 V+Adv 9,1
has_ subevent 0,8 V+N (agent) 3,5
hyponymy 2,5 V+N (instrument) 1,4
is_ subevent 2,9 V+N (location) 1,5
meronymy 0,5 V+N (object) 8,3
synonymy 2,9 V+N (patient) 9,8
xpos_ near_ synonymy 3,6 V+V 1,1
others 0,2 others 1,3
%
16,5
WAN vs. Corpus: English
Quantitative analysis:- 31% word associations do not occur in the BNC
Qualitative analysis:PARADIGMATIC 57,1
SYNTAGMATIC 8,4DOMAIN 21,7OTHER 12,8
WAN vs. Corpus: English (2) acquiring synonymy and hyponymy
e.g. sex – fornicate (archaic or humorous), ire (poetic) – anger, cowardly – yellow (slang)
acquiring information about low frequent wordse.g. perambulate (NBNC = 3), fornicate (NBNC = 6)
cf. EAT: perambulate - walk: 30, pram: 17, baby: 9, push: 8, about: 1, dawdle: 1,move: 1, promenade: 1, slowly: 1, stroll:1, through:1, wander:1, etc.
acquiring domain relations; absent portion of them was surprisingly large for such corpus as BNC e.g. ink-pot – pen: 24, non-violence – peace 29, offside – soccer 2
WAN vs. Corpus: Czech
Quantitative analysis: - 514 associations missing (10,28%)
Qualitative analysis:- proportion of the syntagmatic and paradigmatic ones among them was similar to that for English
Extracting semantic information from WAT
Associations:by form – 10% (e.g. know – no, yellow - mellow)by meaning – 90% (e.g. needle – sew, yellow -
sun) core concepts, semantic primitives, syntagmatic and paradigmatic relations, domain information
Core conceptsIn WAT there could be observed words that have an above-
average number of direct links to other words. Russian человек, мир, дом, жизнь, есть, думать, жить, идти, большой, хорошо, плохо, нет (не), новый, дерево etc. (295 words with more then 100 relations); English man, sex, no (not), love, house; work, eat, think, go, live; good, old, small etc. (586 words with more then 100 relations); Czech člověk, dům, strom; jíst, jít, myslet; moc, starý, velký, bílý, hezký etc.
These words determine the fundamental concepts of a particular language system, and thus should be incorporated into ontology as its core components (e.g., SUMO upper concepts or EWN Base Concepts.
Semantic primitives WAT could also provide a list of basic concepts
associated with each separate word. Thus revealing semantics of a word (situation) as a
list of semantic constituents - separate pieces of information.
Abstract words (verbs, adjectives or nouns denoting complex situation or emotional states) are difficult to decompose by means of logic and intuition.
E.g. Depression could be reduced to its constituents sad 7, low 5, black 4, manic 4, sadness 3, bored 3, misery 2, tiredness 2, despair 1, gloom 1, grey 1, hopelessness 1, monotony 1, sick 1, mood 1, nerves 1, etc., its probable causes: rain 3, guilt 1, pain 1, unemployment 1, its probable effects: suicide 1, its antipodes elation 3, fun 1, happiness 1 etc.
Syntagmatic and paradigmatic relations
“Linguistic substitutes for reality” WA reflect the order of events in reality, the way objects are
organized in the space, and the way human beings experience them.
Associations by contiguity e.g. cry – baby may be treated as a manifestation of syntagmatic relation between verb and its subject, while take – hand as a ROLE_INSTRUMENT relation.
Generalization! e.g. drink – water, beer, milk, ale, Coca-cola, coffee, juice, etc. found in WAT should be generalized as drink ROLE_OBJECT beverage relation and in such a form incorporated in the semantic network
Syntagmatic and paradigmatic relations (2)
The law of contiguity could not explain all associations.
Law of similarity, e.g. inanimate – dead: 39 (SYNONYMY), seek – find: 56 (CAUSE relation), buy – sell: 56 (CONVERSIVE relation).
One of the main benefits of WAT : paradigmatic relations are given explicitly as opposed to other sources of empirical data (e.g. text corpora).
Domain information WAT explicitly present the way common words are grouped
together according to the fragments of reality they describe. E.g., hospital –> nurse, doctor, pain, ill, injury, load… Types of domain relations:
name of domain (situation) – domain member e.g. hospital – nurse:8, finance – money: 61, football – player:4; marriage – husband 2;
participant – participant e.g. pepper – salt: 58, tamer – lion: 69, needle – thread: 41 mouse – cat: 22;
participant – circumstance e.g. umbrella – rain: 58; actor – stage:23; participant – pointer to its function/role in the situation e.g. larder –
food: 58, envelope – letter: 60, actor – play: 15 etc. To differentiate types of domain relations within semantic
network, vs. to include them as uniform IS_ASSOCIATED_TO relation?
ConclusionsAdvantages of using WAT in constructing semantic
network: Simplicity of data acquisition. Broad variety of semantic information to acquire. Empirical nature of data extracted (as opposed to
theoretical one, cf. conventional ontologies, taxonomies or classification schemes, that supposes the researcher’s introspection and intuition to be involved, and hence, leads to over- and under-estimation of the phenomena under consideration).
Probabilistic nature of data presented (data reflects the relative rather then absolute relevance of semantic relations in each particular case).
Thank you...