IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography...
-
Upload
irea-duffy -
Category
Documents
-
view
214 -
download
0
Transcript of IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography...
IMS Universität Stuttgart
Off-line (and On-line) Text Analysis for Computational
LexicographyHannah Kermes
Algorithmische Syntax21.12.2004
IMS Universität Stuttgart
2
Motivation
• maintainance of consistency and completeness within lexica computer assisted methods
• lexical engineering scalable lexicographic work process processes reproducible on large amounts of text
• statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research
• full parsers are not robust enoughneed for analyzing tools that meet the specific needs of
corpus linguistic studies
IMS Universität Stuttgart
3
Information needed
• syntactic information• subcategorization patterns
• semantic information• selectional preferences, collocations
• synonyms
• multi-word units
• lexical classes
• morphological information• case, number, gender
• compounding and derivation
IMS Universität Stuttgart
4
Requirements for the tool
• it has to work on unrestricted text• shortcomings in the grammar should not lead
to a complete failure to parse• no manual checking should be required• should provide a clearly defined interface• annotation should follow linguistic standards
IMS Universität Stuttgart
5
Requirements for the annotation• head lemma• morpho-syntactic information• lexical-semantic information• structural and textual information• hierarchical representation
IMS Universität Stuttgart
6
A corpus linguistic approach
IMS Universität Stuttgart
7
Hypothesis
The better and more detailed the off-line annotation, the better and faster the on-line extraction.However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.
IMS Universität Stuttgart
8
Three different dimensions
• type of grammar• symbolic grammar
• probabilistic grammar
• type of grammar development• hand-written grammar
• learning methods
• depth of analysis• analysis on token level only
• full parsing
• partial parsing
IMS Universität Stuttgart
9
Classical chunk definition
• Abney 1991:The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template
• Abney 1996:a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head
IMS Universität Stuttgart
10
Problems for extraction
• Kübler and Hinrichs (2001)focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.
IMS Universität Stuttgart
11
An example
1. [PC mit kleinen ], [PC über die Köpfe ]
with small above the heads[NC der Apostel ] [NC gesetzten Flammen ]
the apostles set flames2. [PP mit [NP [AP kleinen ], [AP über [NP die Köpfe
with small above the heads[NP der Apostel ] ] gesetzten ] Flammen ] ]
the apostles set flames`with small flames set above the heads of the
apostles´
IMS Universität Stuttgart
12
Problems for extraction
• four NCs instead of only one NP• AN-pair:
+gesetzten + Flammen
- kleine + Flammen
• NN-pair Köpfe + Apostel needs agreement information
• VN-pair setzen + Flammen needs information about the deverbal character of gesetzten
a more complex analysis is needed PCs and NCs need to be combined
IMS Universität Stuttgart
13
Simple solution
PP PC (PC|NC)*• theoretical motivation?• rule covers this particular example, other
examples might need additional rules• rule is vague and largely underspecified
not very reliable
• internal structure is mainly left opague
IMS Universität Stuttgart
14
Complex solution
1. NP NC NCgen
2. PP preposition NP3. AP PP adjective4. NP AP* noun
IMS Universität Stuttgart
15
Complex solution
• solution for this particular example only• large number of rules needed• rules have to be repeated for every instance
of a complex phrase in order to support extractions, the classic
chunk concept has to be extended
IMS Universität Stuttgart
16
Conclusion
ChunkingFull
Parsing
• flat non-recursive structures
• simple grammar
• robust and efficient
• non-ambiguous output
• full hierarchical representation
• complex grammar
• not very robust
• ambiguous output
YAC
IMS Universität Stuttgart
17
A recursive chunker for unrestricted German text• recursive chunker for unrestricted German text• fully automatic analysis• main goal:
provide a useful basis for extraction of linguistic as well as lexicographic information from corpora
IMS Universität Stuttgart
18
• based on a symbolic regular expression grammar
• grammar rules written in CQP• basis:
• tokenization
• PoS-tagging
• lemmatization
• agreement information
General aspects
Tree Tagger
IMSLex
IMS Universität Stuttgart
19
A typical chunker
• robust – works on unrestricted text• works fully automatically• does not provide full but partial analysis of text• no highly ambiguous attachment decisions are
made
IMS Universität Stuttgart
20
YAC goes beyond
• extends the chunk definition of Abney1. recursive embedding
2. post-head embedding
• provides additional information about annotated chunks
1. head lemma
2. agreement information
3. lexical-semantic and structural properties
IMS Universität Stuttgart
21
Extended chunk definition
A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head
as well as post-head modifiers but no PP-attachment, or sentential elements.
IMS Universität Stuttgart
22
Technical Framework
corpusPerl-Scripts
grammarrules
lexicon
ruleapplication
annotationof results
post-processing
IMS Universität Stuttgart
23
Output formats
• CQP format, used for:• interactive grammar development
• parsing
• extraction
• an XML format, used for:• hierarchy building
• extraction
• data exchange
IMS Universität Stuttgart
24
Advantages of the system
• efficient work even with large corpora• modular query language• interactive grammar development• powerful post-processing of rules
IMS Universität Stuttgart
25
Linguistic coverage
• Adverbial phrases (AdvP)a) schön stark (beautifully strong)
b) daher (from there); irgendwoher (from anywhere)
c) heim (home); querfeldein (cross-country)
d) innen (inside); überall (everywhere)
e) "sehr bald" (very soon)
f) jetzt (now); damals (at that time)
IMS Universität Stuttgart
26
Linguistic coverage
• Adjectival phrases (AP)a) möglich (possible)
b) schreiend lila (screamingly purple)
c) rund zwei Meter hohearound two meter high
d) über die Köpfe der Apostel gesetzten
above the heads of the apostles set
'set above the heads of the apostles'
IMS Universität Stuttgart
27
Linguistic coverage
• Noun phrases (NP)a) Oktober (October); er (he)
b) 4,9 Milliarden Euro
4.9 billion Euros
c) "Frankensteins Fluch"
"Frankenstein's curse"
d) kleine, über die Köpfe der Apostel gesetzten
small, above the heads of the apostles set
Flammen
flames
'small flames set above the heads of the apostles'
IMS Universität Stuttgart
28
Linguistic coverage
• Prepositional phrases (PP)a) davon (thereof)
b) zwischen Basel und St. Moritz
between Basel and St. Moritz
c) mit kleinen, über die Köpfe der Apostel gesetzten
with small, above the heads of the apostles set
Flammen
flames
'with small flames set above the heads of the apostles
IMS Universität Stuttgart
29
Linguistic coverage
• Verbal complexes (VC)a) gemunkelt (rumored)
b) muß gerechnet werden
has counted to be
'has to be counted
c) zu bekommen
to get
d) bekommen zu haben
gotten to have
'to have gotten'
IMS Universität Stuttgart
30
Linguistic coverage
• Clauses (CL)a) … , daß selbst Ravel sich amüsiert hätte.
… , that even Ravel himself enjoyed had.
'… , that even Ravel would have enjoyed.'
b) … , die man in der griechischen Tragödie findet.
… , which one in the Greek tragedy finds.
'… , which one finds in the Greek tragedy.'
IMS Universität Stuttgart
31
Linguistic coverage
• Clauses (CL)a) … , Instrumente selbst zu bauen.
… , instruments oneself to build.
' … , to build instruments oneself.'
b) … , um einen Kaffee zu trinken.
… , in order a coffee to drink.
'… , in order to drink a coffee.'
IMS Universität Stuttgart
32
Feature annotation
• head lemma• morpho-syntactic information• lexical-semantic properties
IMS Universität Stuttgart
33
Feature annotation
feature value
AdvP
AP NP PP VC CL
lexical-semantic
X X X X X X
head lemma X X X X X X
agreement info
X X X
verbal head lemma
X
IMS Universität Stuttgart
34
Head lemma
• lemma attribute at the head position• normally a single token• multi-word proper nouns have a multi-token
head lemma• a separated verbal prefix is included in the
head lemma of the VCkommt … an ankommen (arrive)
• head lemma of PP:preposition:noun
IMS Universität Stuttgart
35
Morpho-syntactic information
• intersection of the morpho-syntactic information of relevant elements
• invariant elements are not considered• no guessing involved to solve ambiguities
IMS Universität Stuttgart
36
Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|
Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Gen:F:Pl:Def|Gen:F:Pl:Ind|Gen:F:Sg:Def|Gen:F:Sg:Ind|Gen:M:Pl:Def|Gen:M:Pl:Ind|Gen:M:Sg:Def|Gen:M:Sg:Ind|Gen:M:Sg:Nil|Gen:N:Pl:Def|Gen:N:Pl:Ind|Gen:N:Sg:Def|Gen:N:Sg:Ind|Gen:N:Sg:Nil|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr>
<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>
IMS Universität Stuttgart
37
Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|
Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr>
<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>
IMS Universität Stuttgart
38
Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|
Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|
Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr>
<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>
IMS Universität Stuttgart
39
Agreement Information<np_agr |Akk:M:Sg:Def|>den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|
Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr>
<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>
</np_agr>
<np_agr |Akk:M:Sg:Def|>
IMS Universität Stuttgart
40
Lexical-semantic properties
• important for parsing as well as for extraction• properties can be triggers for specific internal
structures, functions, and usages• properties inherent in the corpus
• PoS-tags
Johann Sebastian Bach
NE NE NE
• text markers
"Wilhelm Meisters Lehrjahre"
NE NN NN
IMS Universität Stuttgart
41
Lexical-semantic properties
• properties determined by external knowledge sources (lexica, ontologies, word lists)• locality:
hier (here); dort (there); Stuttgart
• temporality:
Jahr (year); damals (at that time)
• derivation:
gesetzten (set) deverbal adjective
IMS Universität Stuttgart
42
Lexical-semantic properties
• structural information• complex embeddings
[AP [PP über die Köpfe der Apostel ] gesetzten ]
above the heads of the apostles set
' set above the heads of the apostles'
[AP [NP der "Inkatha"-Partei ] angehörenden ]
to the Inkatha-party belonging
'belonging to the Inkatha-party'
IMS Universität Stuttgart
43
Some properties of NPs
card cardinal noun
meas measure noun
ne named entity
quot NP in quotation marks
street street address
temp temporal noun
date date
pron pronominal NP
IMS Universität Stuttgart
44
Other lexical-semantic properties• VC with separated prefix: pref
Er kommt an (he arrives)• PP with contracted preposition and article: fus
am Bahnhof (at the station)• complex APs embedding PPs: pp
über die Köpfe der Apostel gesetztenabove the heads of the apostles set'set above the heads of the apostles'
• AP with deverbal adjectives: vder
IMS Universität Stuttgart
45
Chunking process
Corpus CorpusThirdLevel
FirstLevel
Corpus
SecondLevel
Lexicon
IMS Universität Stuttgart
46
First level
• basic (non-recursive) chunks• chunks with specific internal structure
a) Ende September (end of Semptember)b) Jahre später (years later)c) 21. Juli 2003d) Johann Sebastian Bach
• lexical information is introduced• within the rules itself• within the Perl-scripts
IMS Universität Stuttgart
47
Advantages
• specific rules do not interact with main parsing rules
• additional (e.g. domain specific) rules can be included easily
• main parsing rules can be kept simple• number of main parsing rules can be kept
small
IMS Universität Stuttgart
48
Second level
• main parsing level• relatively simple and general rules
a) AP AdvP? (PP|NP)* ACb) NP Determiner? Cardinal? AP* NCc) PP Preposition (NP|AdvP)
• complex (recursive) structures are built in several iterations
IMS Universität Stuttgart
49
Rule blocks
IMS Universität Stuttgart
50
Complexity of phrases
- complexity of phrases is achieved by the embedding of complex structures rather than by complex rules
a) [NP eine [AP verständliche ] Sprache ] an understandable language
b) [NP eine [AP für den Anwender verständliche ] Sprache ] a for the user understandable language'a language understandable for the user'
IMS Universität Stuttgart
51
Complexity of phrases
a) [PP auf [NP dem Giebel ] ] on top of the gable
b) [PP auf [NP dem westwärts gerichteten Giebel on top of the westwards pointed gable des heute im barocken Gewande erscheinenden of the today in baroque garment appearingGotteshauses ] ]Lord's house
IMS Universität Stuttgart
52
Third level
• chunks of related but different categories can be subsumed under one category
• NPs with determiner (NP)• NPs without determiner (NCC) NP• base noun chunks (NC)
• coordination of maximal chunks• decisions are made which need full recursive
chunks• adverbially and predicatively used Adjectives can
only be differentiated by the actual usageadverbially used AP AdvP
IMS Universität Stuttgart
53
Hierarchy building
• resulting structures of all parsing stages are collected and stored in XML-files
• after the parsing process collected structures are combined into a hierarchical structure
• only the largest instance of a structure (sharing the same head) is taken into account
IMS Universität Stuttgart
54
Hierarchy building
a) [NP Faszination ]
fascinationb) [NP gewisse Faszination des Schattens ]
certain fascination of the shadowc) [NP eine gewisse Faszination des Schattens ]
a certain fascination of the shadowd) [NP des Schattens ]
of the shadowe) [NP eine gewisse Faszination [NP des Schattens ] ]
a certain fascination of the shadow
IMS Universität Stuttgart
55
Evaluation on automatic PoS-tags
all chunks maximal chunks
precision recall precision recall
NP 89.93 91.67 89.43 91.68
PP 94.05 89.67 94.04 89.65
AP 84.24 89.25 83.67 89.59
VC - - 97.72 96.62
IMS Universität Stuttgart
56
Evaluation on ideal PoS-tags
all chunks maximal chunks
precision recall precision recall
NP 96.36 96.51 95.55 96.47
PP 98.08 96.51 98.07 96.50
AP 96.39 97.50 96.12 97.45
VC - - 99.01 98.59
IMS Universität Stuttgart
57
Extraction
• Advantage of the system• Goal• Sample Extraction
IMS Universität Stuttgart
58
Advantages of the system
• efficient work even with large corpora• modular query language• interactive grammar development• powerful post-processing of rules
IMS Universität Stuttgart
59
Goal
• provide a fine-grained syntactic classification
of the extracted data at the level of • subcategorization• scrambling
• adjectives subcategorizing clauses• combinatory preferences with verbs• syntactic behavior
IMS Universität Stuttgart
60
Target data
• predicative(-like) constructions
Es war klar, daß ...
It was clear, that ...• ... with adverbial pronoun
Er ist davon überzeugt, daß ...
He is of it convinced, that ...• ... with reflexive pronoun
Es zeigt sich deutlich, daß ...
It shows itself clear, that ...
IMS Universität Stuttgart
61
Target data
• ... with infinite clauses
Es ist möglich, ihn zu besuchen.
It is possible, him to visit.• ... with clause in topicalized position
Daß ..., ist klar.
That ..., is clear.
Ihn zu besuchen, ist möglich.
Him to visit, is possible.
IMS Universität Stuttgart
62
Sample query
adjective + verb + finite clause
VC
APCL
IMS Universität Stuttgart
63
Sample query
adjective + verb + finite clause
VC
APpred
CLfin
IMS Universität Stuttgart
64
Sample query
adjective + verb + finite clause
VC Adjuncts*APpred
CLfin
IMS Universität Stuttgart
65
Sample query
adjective + verb + finite clause
VC (AdvP|PP|NPtemp|CLrel)*APpred
CLfin
IMS Universität Stuttgart
66
adjective + verb + finite clause
sein bleiben machen werden
fraglich 326 34 3
unklar 320 103
klar 225 41 30
offen 228 40
möglich 160 30 2
wichtig 180 2
deutlich 5 97 34
total 1500 177 168 75
IMS Universität Stuttgart
67
adjective + verb + finite clause
sein bleiben machen werden
fraglich 326 34 3
unklar 320 103
klar 225 41 30
offen 228 40
möglich 160 30 2
wichtig 180 2
deutlich 5 97 34
total 1500 177 168 75
IMS Universität Stuttgart
68
Topicalized finite clause
adjective + verb + finite clause CLfin
VC (AdvP|PP|NPtemp|CLrel)*APpred
IMS Universität Stuttgart
69
adjective + verb + finite clause
fincl_ex fincl_top total
fraglich 91 335 426
unklar 13 413 426
klar 221 159 380
offen 19 266 285
möglich 207 4 211
wichtig 192 9 201
deutlich 139 22 161
IMS Universität Stuttgart
70
adjective + verb + finite clause
fincl_ex fincl_top total
fraglich 91 335 426
unklar 13 413 426
klar 221 159 380
offen 19 266 285
möglich 207 4 211
wichtig 192 9 201
deutlich 139 22 161
IMS Universität Stuttgart
71
adjective + verb + infinite clause
sein fallen haben werden machen
bereit 431 4 6
schwer 162 221 108 33 26
möglich 532 40 35
schwierig 245 93 12
leicht 120 59 31 8 16
nötig 112 48 2 7
erforderlich 102 1 15
total 1708 280 195 183 111
IMS Universität Stuttgart
72
adjective + verb + infinite clause
sein fallen haben werden machen
bereit 431 4 6
schwer 162 221 108 33 26
möglich 532 40 35
schwierig 245 93 12
leicht 120 59 31 8 16
nötig 112 48 2 7
erforderlich 102 1 15
total 1708 280 195 183 111
IMS Universität Stuttgart
73
low freq adj + verb + infin clause
stehen bringen haben sein
frei 35 4
satt 19 10
fertig 24 1
IMS Universität Stuttgart
74
low freq adj + verb + clause
stehen bringen haben sein
frei 37 6
satt 27 11
fertig 26 1
IMS Universität Stuttgart
75
adjective subcategorization
• APs with PP complements embedded in NPsDie [AP dafür erforderlichen] 300 000 MarkThe for this needed 300 000 Marks„The 300 000 Marks needed for this“
Der [AP auf Sport spezialisierte] JournalistThe on sports specialised journalist„The journalist specialising in sports“
IMS Universität Stuttgart
76
multiword units and abbreviations• chunks/phrases in brackets or quotes
• multiword units„Teenage Mutant Hero Turtle“(FC Italia Frankfurt)
• abbreviationsDeutscher Aktienindex (Dax)Stickstoffdioxyd (NO2)
IMS Universität Stuttgart
77
Conclusion
• recursive chunking workable compromise between depth of analysis and robustness
• extracted data show correlation between• collocational preference
• subcategorization frames
• semantic classes of adjectives
• to a certain extent distributional preferences