IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography...

77
IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004

Transcript of IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography...

Page 1: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

Off-line (and On-line) Text Analysis for Computational

LexicographyHannah Kermes

Algorithmische Syntax21.12.2004

Page 2: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

2

Motivation

• maintainance of consistency and completeness within lexica computer assisted methods

• lexical engineering scalable lexicographic work process processes reproducible on large amounts of text

• statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research

• full parsers are not robust enoughneed for analyzing tools that meet the specific needs of

corpus linguistic studies

Page 3: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

3

Information needed

• syntactic information• subcategorization patterns

• semantic information• selectional preferences, collocations

• synonyms

• multi-word units

• lexical classes

• morphological information• case, number, gender

• compounding and derivation

Page 4: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

4

Requirements for the tool

• it has to work on unrestricted text• shortcomings in the grammar should not lead

to a complete failure to parse• no manual checking should be required• should provide a clearly defined interface• annotation should follow linguistic standards

Page 5: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

5

Requirements for the annotation• head lemma• morpho-syntactic information• lexical-semantic information• structural and textual information• hierarchical representation

Page 6: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

6

A corpus linguistic approach

Page 7: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

7

Hypothesis

The better and more detailed the off-line annotation, the better and faster the on-line extraction.However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.

Page 8: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

8

Three different dimensions

• type of grammar• symbolic grammar

• probabilistic grammar

• type of grammar development• hand-written grammar

• learning methods

• depth of analysis• analysis on token level only

• full parsing

• partial parsing

Page 9: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

9

Classical chunk definition

• Abney 1991:The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template

• Abney 1996:a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head

Page 10: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

10

Problems for extraction

• Kübler and Hinrichs (2001)focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.

Page 11: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

11

An example

1. [PC mit kleinen ], [PC über die Köpfe ]

with small above the heads[NC der Apostel ] [NC gesetzten Flammen ]

the apostles set flames2. [PP mit [NP [AP kleinen ], [AP über [NP die Köpfe

with small above the heads[NP der Apostel ] ] gesetzten ] Flammen ] ]

the apostles set flames`with small flames set above the heads of the

apostles´

Page 12: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

12

Problems for extraction

• four NCs instead of only one NP• AN-pair:

+gesetzten + Flammen

- kleine + Flammen

• NN-pair Köpfe + Apostel needs agreement information

• VN-pair setzen + Flammen needs information about the deverbal character of gesetzten

a more complex analysis is needed PCs and NCs need to be combined

Page 13: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

13

Simple solution

PP PC (PC|NC)*• theoretical motivation?• rule covers this particular example, other

examples might need additional rules• rule is vague and largely underspecified

not very reliable

• internal structure is mainly left opague

Page 14: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

14

Complex solution

1. NP NC NCgen

2. PP preposition NP3. AP PP adjective4. NP AP* noun

Page 15: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

15

Complex solution

• solution for this particular example only• large number of rules needed• rules have to be repeated for every instance

of a complex phrase in order to support extractions, the classic

chunk concept has to be extended

Page 16: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

16

Conclusion

ChunkingFull

Parsing

• flat non-recursive structures

• simple grammar

• robust and efficient

• non-ambiguous output

• full hierarchical representation

• complex grammar

• not very robust

• ambiguous output

YAC

Page 17: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

17

A recursive chunker for unrestricted German text• recursive chunker for unrestricted German text• fully automatic analysis• main goal:

provide a useful basis for extraction of linguistic as well as lexicographic information from corpora

Page 18: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

18

• based on a symbolic regular expression grammar

• grammar rules written in CQP• basis:

• tokenization

• PoS-tagging

• lemmatization

• agreement information

General aspects

Tree Tagger

IMSLex

Page 19: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

19

A typical chunker

• robust – works on unrestricted text• works fully automatically• does not provide full but partial analysis of text• no highly ambiguous attachment decisions are

made

Page 20: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

20

YAC goes beyond

• extends the chunk definition of Abney1. recursive embedding

2. post-head embedding

• provides additional information about annotated chunks

1. head lemma

2. agreement information

3. lexical-semantic and structural properties

Page 21: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

21

Extended chunk definition

A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head

as well as post-head modifiers but no PP-attachment, or sentential elements.

Page 22: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

22

Technical Framework

corpusPerl-Scripts

grammarrules

lexicon

ruleapplication

annotationof results

post-processing

Page 23: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

23

Output formats

• CQP format, used for:• interactive grammar development

• parsing

• extraction

• an XML format, used for:• hierarchy building

• extraction

• data exchange

Page 24: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

24

Advantages of the system

• efficient work even with large corpora• modular query language• interactive grammar development• powerful post-processing of rules

Page 25: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

25

Linguistic coverage

• Adverbial phrases (AdvP)a) schön stark (beautifully strong)

b) daher (from there); irgendwoher (from anywhere)

c) heim (home); querfeldein (cross-country)

d) innen (inside); überall (everywhere)

e) "sehr bald" (very soon)

f) jetzt (now); damals (at that time)

Page 26: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

26

Linguistic coverage

• Adjectival phrases (AP)a) möglich (possible)

b) schreiend lila (screamingly purple)

c) rund zwei Meter hohearound two meter high

d) über die Köpfe der Apostel gesetzten

above the heads of the apostles set

'set above the heads of the apostles'

Page 27: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

27

Linguistic coverage

• Noun phrases (NP)a) Oktober (October); er (he)

b) 4,9 Milliarden Euro

4.9 billion Euros

c) "Frankensteins Fluch"

"Frankenstein's curse"

d) kleine, über die Köpfe der Apostel gesetzten

small, above the heads of the apostles set

Flammen

flames

'small flames set above the heads of the apostles'

Page 28: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

28

Linguistic coverage

• Prepositional phrases (PP)a) davon (thereof)

b) zwischen Basel und St. Moritz

between Basel and St. Moritz

c) mit kleinen, über die Köpfe der Apostel gesetzten

with small, above the heads of the apostles set

Flammen

flames

'with small flames set above the heads of the apostles

Page 29: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

29

Linguistic coverage

• Verbal complexes (VC)a) gemunkelt (rumored)

b) muß gerechnet werden

has counted to be

'has to be counted

c) zu bekommen

to get

d) bekommen zu haben

gotten to have

'to have gotten'

Page 30: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

30

Linguistic coverage

• Clauses (CL)a) … , daß selbst Ravel sich amüsiert hätte.

… , that even Ravel himself enjoyed had.

'… , that even Ravel would have enjoyed.'

b) … , die man in der griechischen Tragödie findet.

… , which one in the Greek tragedy finds.

'… , which one finds in the Greek tragedy.'

Page 31: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

31

Linguistic coverage

• Clauses (CL)a) … , Instrumente selbst zu bauen.

… , instruments oneself to build.

' … , to build instruments oneself.'

b) … , um einen Kaffee zu trinken.

… , in order a coffee to drink.

'… , in order to drink a coffee.'

Page 32: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

32

Feature annotation

• head lemma• morpho-syntactic information• lexical-semantic properties

Page 33: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

33

Feature annotation

feature value

AdvP

AP NP PP VC CL

lexical-semantic

X X X X X X

head lemma X X X X X X

agreement info

X X X

verbal head lemma

X

Page 34: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

34

Head lemma

• lemma attribute at the head position• normally a single token• multi-word proper nouns have a multi-token

head lemma• a separated verbal prefix is included in the

head lemma of the VCkommt … an ankommen (arrive)

• head lemma of PP:preposition:noun

Page 35: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

35

Morpho-syntactic information

• intersection of the morpho-syntactic information of relevant elements

• invariant elements are not considered• no guessing involved to solve ambiguities

Page 36: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

36

Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|

Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Gen:F:Pl:Def|Gen:F:Pl:Ind|Gen:F:Sg:Def|Gen:F:Sg:Ind|Gen:M:Pl:Def|Gen:M:Pl:Ind|Gen:M:Sg:Def|Gen:M:Sg:Ind|Gen:M:Sg:Nil|Gen:N:Pl:Def|Gen:N:Pl:Ind|Gen:N:Sg:Def|Gen:N:Sg:Ind|Gen:N:Sg:Nil|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr>

<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>

Page 37: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

37

Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|

Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr>

<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>

Page 38: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

38

Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|

Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|

Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr>

<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>

Page 39: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

39

Agreement Information<np_agr |Akk:M:Sg:Def|>den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|

Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr>

<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>

</np_agr>

<np_agr |Akk:M:Sg:Def|>

Page 40: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

40

Lexical-semantic properties

• important for parsing as well as for extraction• properties can be triggers for specific internal

structures, functions, and usages• properties inherent in the corpus

• PoS-tags

Johann Sebastian Bach

NE NE NE

• text markers

"Wilhelm Meisters Lehrjahre"

NE NN NN

Page 41: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

41

Lexical-semantic properties

• properties determined by external knowledge sources (lexica, ontologies, word lists)• locality:

hier (here); dort (there); Stuttgart

• temporality:

Jahr (year); damals (at that time)

• derivation:

gesetzten (set) deverbal adjective

Page 42: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

42

Lexical-semantic properties

• structural information• complex embeddings

[AP [PP über die Köpfe der Apostel ] gesetzten ]

above the heads of the apostles set

' set above the heads of the apostles'

[AP [NP der "Inkatha"-Partei ] angehörenden ]

to the Inkatha-party belonging

'belonging to the Inkatha-party'

Page 43: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

43

Some properties of NPs

card cardinal noun

meas measure noun

ne named entity

quot NP in quotation marks

street street address

temp temporal noun

date date

pron pronominal NP

Page 44: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

44

Other lexical-semantic properties• VC with separated prefix: pref

Er kommt an (he arrives)• PP with contracted preposition and article: fus

am Bahnhof (at the station)• complex APs embedding PPs: pp

über die Köpfe der Apostel gesetztenabove the heads of the apostles set'set above the heads of the apostles'

• AP with deverbal adjectives: vder

Page 45: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

45

Chunking process

Corpus CorpusThirdLevel

FirstLevel

Corpus

SecondLevel

Lexicon

Page 46: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

46

First level

• basic (non-recursive) chunks• chunks with specific internal structure

a) Ende September (end of Semptember)b) Jahre später (years later)c) 21. Juli 2003d) Johann Sebastian Bach

• lexical information is introduced• within the rules itself• within the Perl-scripts

Page 47: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

47

Advantages

• specific rules do not interact with main parsing rules

• additional (e.g. domain specific) rules can be included easily

• main parsing rules can be kept simple• number of main parsing rules can be kept

small

Page 48: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

48

Second level

• main parsing level• relatively simple and general rules

a) AP AdvP? (PP|NP)* ACb) NP Determiner? Cardinal? AP* NCc) PP Preposition (NP|AdvP)

• complex (recursive) structures are built in several iterations

Page 49: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

49

Rule blocks

Page 50: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

50

Complexity of phrases

- complexity of phrases is achieved by the embedding of complex structures rather than by complex rules

a) [NP eine [AP verständliche ] Sprache ] an understandable language

b) [NP eine [AP für den Anwender verständliche ] Sprache ] a for the user understandable language'a language understandable for the user'

Page 51: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

51

Complexity of phrases

a) [PP auf [NP dem Giebel ] ] on top of the gable

b) [PP auf [NP dem westwärts gerichteten Giebel on top of the westwards pointed gable des heute im barocken Gewande erscheinenden of the today in baroque garment appearingGotteshauses ] ]Lord's house

Page 52: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

52

Third level

• chunks of related but different categories can be subsumed under one category

• NPs with determiner (NP)• NPs without determiner (NCC) NP• base noun chunks (NC)

• coordination of maximal chunks• decisions are made which need full recursive

chunks• adverbially and predicatively used Adjectives can

only be differentiated by the actual usageadverbially used AP AdvP

Page 53: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

53

Hierarchy building

• resulting structures of all parsing stages are collected and stored in XML-files

• after the parsing process collected structures are combined into a hierarchical structure

• only the largest instance of a structure (sharing the same head) is taken into account

Page 54: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

54

Hierarchy building

a) [NP Faszination ]

fascinationb) [NP gewisse Faszination des Schattens ]

certain fascination of the shadowc) [NP eine gewisse Faszination des Schattens ]

a certain fascination of the shadowd) [NP des Schattens ]

of the shadowe) [NP eine gewisse Faszination [NP des Schattens ] ]

a certain fascination of the shadow

Page 55: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

55

Evaluation on automatic PoS-tags

all chunks maximal chunks

precision recall precision recall

NP 89.93 91.67 89.43 91.68

PP 94.05 89.67 94.04 89.65

AP 84.24 89.25 83.67 89.59

VC - - 97.72 96.62

Page 56: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

56

Evaluation on ideal PoS-tags

all chunks maximal chunks

precision recall precision recall

NP 96.36 96.51 95.55 96.47

PP 98.08 96.51 98.07 96.50

AP 96.39 97.50 96.12 97.45

VC - - 99.01 98.59

Page 57: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

57

Extraction

• Advantage of the system• Goal• Sample Extraction

Page 58: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

58

Advantages of the system

• efficient work even with large corpora• modular query language• interactive grammar development• powerful post-processing of rules

Page 59: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

59

Goal

• provide a fine-grained syntactic classification

of the extracted data at the level of • subcategorization• scrambling

• adjectives subcategorizing clauses• combinatory preferences with verbs• syntactic behavior

Page 60: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

60

Target data

• predicative(-like) constructions

Es war klar, daß ...

It was clear, that ...• ... with adverbial pronoun

Er ist davon überzeugt, daß ...

He is of it convinced, that ...• ... with reflexive pronoun

Es zeigt sich deutlich, daß ...

It shows itself clear, that ...

Page 61: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

61

Target data

• ... with infinite clauses

Es ist möglich, ihn zu besuchen.

It is possible, him to visit.• ... with clause in topicalized position

Daß ..., ist klar.

That ..., is clear.

Ihn zu besuchen, ist möglich.

Him to visit, is possible.

Page 62: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

62

Sample query

adjective + verb + finite clause

VC

APCL

Page 63: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

63

Sample query

adjective + verb + finite clause

VC

APpred

CLfin

Page 64: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

64

Sample query

adjective + verb + finite clause

VC Adjuncts*APpred

CLfin

Page 65: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

65

Sample query

adjective + verb + finite clause

VC (AdvP|PP|NPtemp|CLrel)*APpred

CLfin

Page 66: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

66

adjective + verb + finite clause

sein bleiben machen werden

fraglich 326 34 3

unklar 320 103

klar 225 41 30

offen 228 40

möglich 160 30 2

wichtig 180 2

deutlich 5 97 34

total 1500 177 168 75

Page 67: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

67

adjective + verb + finite clause

sein bleiben machen werden

fraglich 326 34 3

unklar 320 103

klar 225 41 30

offen 228 40

möglich 160 30 2

wichtig 180 2

deutlich 5 97 34

total 1500 177 168 75

Page 68: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

68

Topicalized finite clause

adjective + verb + finite clause CLfin

VC (AdvP|PP|NPtemp|CLrel)*APpred

Page 69: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

69

adjective + verb + finite clause

fincl_ex fincl_top total

fraglich 91 335 426

unklar 13 413 426

klar 221 159 380

offen 19 266 285

möglich 207 4 211

wichtig 192 9 201

deutlich 139 22 161

Page 70: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

70

adjective + verb + finite clause

fincl_ex fincl_top total

fraglich 91 335 426

unklar 13 413 426

klar 221 159 380

offen 19 266 285

möglich 207 4 211

wichtig 192 9 201

deutlich 139 22 161

Page 71: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

71

adjective + verb + infinite clause

sein fallen haben werden machen

bereit 431 4 6

schwer 162 221 108 33 26

möglich 532 40 35

schwierig 245 93 12

leicht 120 59 31 8 16

nötig 112 48 2 7

erforderlich 102 1 15

total 1708 280 195 183 111

Page 72: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

72

adjective + verb + infinite clause

sein fallen haben werden machen

bereit 431 4 6

schwer 162 221 108 33 26

möglich 532 40 35

schwierig 245 93 12

leicht 120 59 31 8 16

nötig 112 48 2 7

erforderlich 102 1 15

total 1708 280 195 183 111

Page 73: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

73

low freq adj + verb + infin clause

stehen bringen haben sein

frei 35 4

satt 19 10

fertig 24 1

Page 74: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

74

low freq adj + verb + clause

stehen bringen haben sein

frei 37 6

satt 27 11

fertig 26 1

Page 75: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

75

adjective subcategorization

• APs with PP complements embedded in NPsDie [AP dafür erforderlichen] 300 000 MarkThe for this needed 300 000 Marks„The 300 000 Marks needed for this“

Der [AP auf Sport spezialisierte] JournalistThe on sports specialised journalist„The journalist specialising in sports“

Page 76: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

76

multiword units and abbreviations• chunks/phrases in brackets or quotes

• multiword units„Teenage Mutant Hero Turtle“(FC Italia Frankfurt)

• abbreviationsDeutscher Aktienindex (Dax)Stickstoffdioxyd (NO2)

Page 77: IMS Universität Stuttgart Off-line (and On-line) Text Analysis for Computational Lexicography Hannah Kermes Algorithmische Syntax 21.12.2004.

IMS Universität Stuttgart

77

Conclusion

• recursive chunking workable compromise between depth of analysis and robustness

• extracted data show correlation between• collocational preference

• subcategorization frames

• semantic classes of adjectives

• to a certain extent distributional preferences