VACNET: Extracting and analyzing non-trivial linguistic structures at scale
description
Transcript of VACNET: Extracting and analyzing non-trivial linguistic structures at scale
VACNET: Extracting and analyzing non-trivial linguistic structures at
scale
Matthew Brook O’Donnell,Nick C. Ellis, Ute Römer & Gin Corden
English Language Institute
The 2nd University of Michigan Workshop on Data, Text, Web, and Social Network Mining
April 22, 2011
Challenge of natural language for data mining•Much work in NLP, IR and text classification
relies upon frequency analysis of• single words• n-grams (contiguous word sequences of various lengths)
• Units are computationally trivial to retrieve• Map-Reduce ‘Hello World’!
• Techniques tend to use a ‘bag of words’ approach, disregarding structure• Frequency and statistical measures highlight
distinctive items and document ‘aboutness’• But this is a weak proxy for meaning, which
remains somewhat elusive!
Sentence splitting
Word Tokenization
POS tagging
Chunking/Parsing
Named-entity recognition
meaning???
text text text text text text
Typical NLP Pipeline
Can linguistic theory help?... NLP tools:
Challenge of natural language for data mining
Analyzing natural language data is, in my opinion, the problem of the next 2-3 decades. It's an incredibly difficult issue […] It's imperative to have a sufficiently sophisticated and rigorous enough approach that relevant context can be taken into account.
Matthew Russell, Author
Can linguistic theory help?... What is relevant context?
Learning meaning in language
How are we able to learn what novel words mean?
① She moogels about her book
each word contributes individual meaning
verb meaning central; yet verbs are highly polysemous
larger configuration of words carries meaning
these we call CONSTRUCTIONS
V about n
moogle inherits its interpretation from the echoes of the verbs that occupy the V about n Verb Argument Construction (VAC), words like: talk, think, know, write, hear,
speak, worry … fuss, shout, mutter, gossip ‘recurrent patterns of linguistic elements
that serve some well-defined linguistic function’ (Ellis 2003)
Collaborative project to build an inventory of a large number of English verb argument constructions (VACs) using:
• the COBUILD Verb Grammar Patterns descriptions • tools from computational and corpus linguistics• techniques from data mining, machine learning and network
analysis
The project has two components:
(1) a computational corpus analysis of corpora to retrieve instances and verb distributions for the full range of VACs
(2) psycholinguistic experiments to measure speaker knowledge of these VACs through the verbs selected.
VACNET
V about n – some examples• He grumbled incessantly about the ‘disgusting’ provincial life we had to
lead on the island• You should try to think ahead about your financial situation• He worried persistently about the poverty of his social life• She would keep banging on about her son• He wondered briefly about the effects of prolonged exposure to solar
radiation• The housekeeper left the room, muttering about ingratitude• I do not want to carp about the work of the Committee• ‘Any views expressed about Master Matthew?’• There are several other valid justifications for teaching explicitly about
language• Those who gossip about him tend to meet with nasty accidents.
• TASK– retrieval of 700+ verb argument constructions from
a 100 million corpus with minimal intervention but requirement for high precision and high recall
• Multidisciplinary TEAM– linguists, psychologists, information scientists– undergraduate/graduate student RAs, faculty
• TOOLS– dependency parsed corpus in GraphML format– web-based precision analysis tool– processing pipeline
VACNET: Language engineering challenge
Architecture: Large scale extraction of constructions
8
POS tagging &
Dependency Parsing
CouchDB document database
COBUILD Verb Patterns
Construction Descriptions
CORPUS
BNC 100 mill. words
Word Sense Disambiguation
Statistical analysis of
distributions
Web application
WordNet
Network Analysis &
Visualization
DISCO
Method: Collaborative semi-automatic extraction
1.DEFINE search graph2.ENCODE in XML3.CONVERT to Python code4.SEARCH corpus and RECORD matches5.ERROR CODE
Method: Collaborative semi-automatic extraction
Precision analysis interface
Recall analysis
VAC freqtalk 2232think 1810know 879hear 349worry 347forget 322write 299ask 298say 281care 250go 203complain 192speak 181find 148learn 143be 124feel 118look 115wonder 102read 101
Results: V about n
Types (list of different verbs occurring in VAC)
Frequency (Zipfian?) Contingency
(attraction of verb construction)
Semantics prototypicality of meaning & radial structure (Zipfian?)
VAC freqtalk 2232think 1810know 879hear 349worry 347forget 322write 299ask 298say 281care 250go 203complain 192speak 181find 148learn 143be 124feel 118look 115wonder 102read 101
Results: V about nVAC freq Corpus freq Faithfulness
reminisce 12 98 0.1224moon 5 51 0.098talk 2232 24566 0.0909brag 5 69 0.0725carp 5 72 0.0694worry 347 5027 0.069generalize 15 244 0.0615generalise 10 176 0.0568enthuse 13 236 0.0551complain 192 3947 0.0486grumble 18 407 0.0442rave 9 205 0.0439fret 10 265 0.0377fuss 9 246 0.0366care 250 7064 0.0354speculate 26 771 0.0337gossip 9 270 0.0333forget 322 10240 0.0314enquire 38 1341 0.0283prowl 5 179 0.0279
15
VAC Types Tokens TTR Lead verb Token*Faith MIcw
V about n 365 3519 10.37 talk talk bragV across n 799 4889 16.34 come spread scudV after n 1168 7528 15.52 look look lustV among pl-n 417 1228 33.96 find divide nestleV around n 761 3801 20.02 look revolve traipseV as adj 235 1012 23.22 know regard classV as n 1702 34383 4.95 know act masqueradeV at n 1302 9700 13.42 look look officiateV between pl-n 669 3572 18.73 distinguish distinguish sandwichV for n 2779 79894 3.48 look wait vieV in n 2671 37766 7.07 find result couchV into n 1873 46488 4.03 go divide delveV like n 548 1972 27.79 look look glitterV n n 663 9183 7.22 give give renameV of n 1222 25155 4.86 think consist partakeV over n 1312 9269 14.15 go preside poreV through n 842 4936 17.06 go riffle riffleV to n 707 7823 9.04 go listen randomizeV towards n 190 732 25.96 move bias gravitateV under n 1243 8514 14.6 come come wiltV way prep 365 2896 12.6 make wend wendV with n 1942 24932 7.79 deal deal pepper
The frequency distributions for the types occupying each VAC are Zipfian
The most frequent verb for each VAC is much more frequent than the other members, taking the lion’s share of the distribution
The most frequent verb in each VAC is prototypical of that construction’s functional interpretation
generic in its action semantics
VACs are selective in their verb form family occupancy:
Individual verbs select particular constructions Particular constructions select particular verbs There is greater contingency between verb types and
constructions
VACs are coherent in their semantics.
Initial Findings
What do speakers know about verbs in VACS?
s/he/it _____ about the …
Two Experiments276 Native & 276 L1 German speakers of English
Asked to fill the gap with the first word that comes to mind given the prompt
But what about meaning?...• We want to quantify the semantic coherence or
‘clumpiness’ of the verbs extracted in the previous steps– {think, know, hear, worry, care,…} ABOUT
• Construction patterns are productive units in language and subject to polysemy just like words. Can we separate meaning groups within verb distributions?– COMMUNICATION: {talk, write, ask, say, argue,…} ABOUT– COGNITION: {think, know, hear, worry, care,…} ABOUT– MOTION: {move, walk, run, fall, wander,…} ABOUT
• The semantic sources must not be based on localized distributional language analysis– Use WordNet and Roget’s
• Pedersen et al. (2004) WordNet similarity measures• Kennedy, A. (2009). The Open Roget's Project: Electronic lexical knowledge base
Building a semantic network• Use semantic similarity scores for pairs of verbs (from
WordNet, Roget, DISCO, etc.) to create network• nodes = lemma forms from VAC/CEC distribution• edges = link between nodes for top n similarity scores for a pair of verbs
COGNITION
COMMUNICATION
Community detection
top 100 verbs in VAC V about n
Semantic Networks• Exploring community detection algorithms
• Edge Betweenness (Girvan and Newman, 2002)• Fast Greedy (Clauset, Newman and Moore, 2004)• Label Propagation (Raghavan, Albert and Kumara, 2007)• Leading Eigenvector (Newman 2006)• Spinglass (Reichardt and Bornholdt, 2006)• Walktrap (Pons and Latapy, 2005)• Louvain (Blondel, Guillaume, Lambiotte and Lefebvre, 2008)
VACNET Summary Challenge of natural language for data mining Project investigates usage of VACs at scale
constructions = meaning through patterns IR challenge: retrieving non-trivial structures at scale
Corpus analysis examines the distributions of verbs in VACs frequency distribution contingency semantics
Psycholinguistic experiments explore the psychological reality of VACs
VACNET structured inventory verb to construction and construction to verb valuable for NLP and DM tasks
Future explorations: Train classifiers on our datasets Tackle ‘big data’ sets
Thank you!