Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for...

Post on 31-Mar-2015

217 views 0 download

Tags:

Transcript of Syntactic annotation in CGN: lessons learned and to be learned Ineke Schuurman Centre for...

Syntactic annotation in CGN: Syntactic annotation in CGN: lessons learned lessons learned and to be learnedand to be learned

Ineke Schuurman

Centre for Computational Linguistics

Katholieke Universiteit Leuven

15-11-2011 Paris 2

This talk ...

• Why CGN: Spoken Dutch Corpus?• At that time …• Other layers

– Orthographic transcription– PoS tagging

• Syntactic annotation– Dependencies and categories

• Spoken language– “standard” language, disfluencies

• LASSY/SoNaR: Written Dutch Corpus • What to take into account when planning a ‘spoken

treebank’

15-11-2011 Paris 3

Why CGN?

Dutch Language Union• Dutch/Flemish organization taking care of common

language• 1997-8: report state of the art wrt Language & Speech

Technology

• 1998: Spoken Dutch Corpus, 5 years, 2/3 Netherlands - 1/3 Flanders, balanced

1000 hours, +/- 10M words1 M Syntactic Annotation

• Both research purposes and services (EU) / industry

15-11-2011 Paris 4

At that time

This talk: focus on textual aspects!

--------------------------------------------------------

• No taggers, parsers that could be reused• Existing grammars cover(ed) the northern variant of

Dutch• No ‘formal’ grammar

►start from scratch

15-11-2011 Paris 5

Other layers

• Relevant for syntax:– Orthographic transcription– PoS tagging

• All layers in parallel, butper fragment: layer A finished before start layer B(except for errors)

• Reason: time• But: gave us opportunity to express wishes/needs wrt

other layers• Example: handling of specific types of words.

15-11-2011 Paris 6

Transcription and PoS

An example:

15-11-2011 Paris 7

Specific types of words

*v words in another language (not 'adopted' in Dutch)*a not fully realized words (gaan probe instead of gaan

proberen)*x words that could not be (fully) understood (also xxx,

ggg)

*u mispronounced words (ploberen instead of proberen, om-uh-dat*u instead of omdat)

*d dialectal words

One or more words?zo’n vs zo ‘n (such a): one token!But hebde*d (litt. have you) realized as hebt*d de*d :

two tokens

15-11-2011 Paris 8

Syntactic analysis: goal CGN

• Annotation in theory-neutral format in order to be useful for as many people as possible

• Categories: NP, PP, …• Functions/dependencies: subject, object1, …

• As automatic as possible:– Tool from NEGRA-corpus: Annotate

– for German– same desiderata as CGN (contrary to Dutch AMAZON-parser)

.

15-11-2011 Paris

Annotate

• Developed for NEGRA-project (Saarbrücken)– Oliver Plaehn, Thorsten Brants

• Semi-automatic annotation– Works with tagger and parser – Suggests structures

• Combined with Cascaded Markov Models (Brants)– Bootstrapping approach possible

15-11-2011 Paris

Annotate screen

.

15-11-2011 Paris 11

Annotate ‘correction’ format

15-11-2011 Paris 12

Annotate export format

.

15-11-2011 Paris 13

Principles of syntactic annotation

• Structures as flat as possible• Only new level when there is a new head• No branching when just one node is involved• No duplication of functions (1 SU, 1 OBJ1, …)• In principle just non-branching heads• Allowed:

– multiple branching– crossing dependencies

• Input: simplified PoS.

15-11-2011 Paris 14

Less PoS-tags

Simplified PoS

• PoS: over 300 tags– Over 100 for pronouns

– Not problematic at all, often unique token/tag combinations

• Not all details necessary for SA

• Example full tagset– T501a VNW(pers,pron,nomin,vol,1,ev) ik (I)

– T501o VNW(pers,pron,nomin,vol,3,ev,masc) hij (he)

• Example simplified tagset– VNW1 VNW(pers,pron) personal pronoun

– In graph: both T501a and VNW1

.

15-11-2011 Paris 15

Syntactic simplifications

Other simplifications

• Obj2 – indirect object (dative)meewerkend voorwerp

• Ik geef hem een boek / een boek aan hem(I give him a book)

belanghebbend voorwerp• Ik koop hem een boek / een boek voor hem

(I buy him a book)

• Bepaling van gesteldheid (~predicative complement)• hij verft de deur blauw (he paints the door blue)• Hij vindt het boek leuk (he does like the book)• Hij nam het boek lachend aan (laughing he accepted the book)

.

15-11-2011 Paris 16

Results

Even then:

• Annotate did most NPs and PPs very well, but often failed for the more complex parts

• In some sense surprising as the results for German were much better.

However:• In that case written language was involved.

Training for spoken language is much harder!.

15-11-2011 Paris 17

Details CGN corpus

Balanced corpus: • types of documents (next slide)• Speaker characteristics

• Sex• Age• Geographic region• Socio-economic class• Level of education

• 2/3 Netherlands, 1/3 Belgium (Flanders)• Participants were asked to speak standard language (in

case they agreed beforehand to participate in CGN) .

15-11-2011 Paris 18

Details CGN corpus

►many types of documents• Read-aloud written: Literature read aloud (library for the

blind)• Written to be spoken:

• News broadcasts• Lectures

• Spoken (spontaneous)• Interviews• Phone calls• Debates• Spontaneous conversations with x people (over lunch etc).

15-11-2011 Paris 19

Variation

To some extent differences in written language, much more in spoken variants, esp. in spontaneous speech

• Separable verbs• NL dat ze hem op wilde bellen (that she wanted to call him)• VL dat ze hem wilde opbellen

• Other choice of auxiliaries• NL Ze is het komen brengen (she came and brought it)• VL Ze heeft het komen brengen

• Other words for same concept, same words for different concepts

• Pompbak-gootsteen (sink), namiddag (afternoon-late afternoon)

Gramm/dictionaries: mostly northern written variant

.

15-11-2011 Paris

Disfluencies

Partially realized words

hilari*a instead of hilarisch (EN hilarious)

Analyzed as if realized

***

Ik doe West- en Oost-Vlaanderen

I’ll take care of West- and Oost-Vlaanderen

Short for: West-Vlaanderen en Oost-Vlaanderen

Completely regularly analyzed as conjunction (CONJ)

.

15-11-2011 Paris

Disfluencies

When too little of a token is realized, such a token is ignored

awel genen TV meer en genen boe*a gene voetbal meer .

EN: So no more tv and no more football

.

15-11-2011 Paris

Ex of disfluency (repetition)

15-11-2011 Paris

Disfluencies

Mixed repetition/correction

Ze was bijna hileri*a hilari*a

She was almost hilarious

hileri*a is corrected as hilari*a, only the corrected form is included in the analysis

Die verd*a die vervl*a die krankzinnige hond

That damn*, that cursed*, that crazy dog

Only last 3 words (that crazy dog) included in graph

.

15-11-2011 Paris 24

Disfluencies

Wrong pronunciation

Dat is een serieus plobleem*u

Dat is een serieus probleem

That’s a serious problem

Analysed as if the ‘correct’ word was involved

***

15-11-2011 Paris 25

Words in foreign language

In spoken and written language:

Words in another language, and not found in a Dutch dictionary:

umbrella*v, plus*v de*v temps*v, à la carte not: rendez-vous, cinema, cognac (in Dutch dictionaries)

• Single words: just like their Dutch counterpart• Strings: only ‘top’ label presented• Sentences: not analyzed.

15-11-2011 Paris 26

Pro and con markings

Markings (*a, etc) have proven to be useful for PoS and SA.

But:

should have been removed afterwards, i.e. all information should have been contained in tags, orthographic level should contain only orthography

Problem: other groups wanted them at orthographic level for speech recognition purposes

Solution: add a field without markings

.

15-11-2011 Paris 27

Syntactic annotation

Lacking and superfluous words

There are no ‘ungrammatical’ sentences, all sentences are to be analyzed!

• Lacking elements: just accept it• Superfluous elements: just accept it

BUT there are some exceptions:

repetition

‘accidental’ sentences

.

15-11-2011 Paris 28

Not analyzed parts

Sometimes parts of a ‘sentence’ are ‘ignored’:

• ReparationsIk zie hem morg*a overmorgenI’ll see him the day after tomorrow

• RepetitionsHij is in in vergaderingHe has a meeting

Or not connected:

• ‘accidental’ sentences/unitsIk heb nooit ik ben leraresI have never I am a teacher

• Uh-insertion (hesitation marker)Ze heeft uh zeven dochtersShe has seven daughters.

15-11-2011 Paris 29

Examples

More of the same

15-11-2011 Paris 30

Asyndetic conjunction

15-11-2011 Paris 31

Discourse phenomena

Some examples of ‘discourse’ within a sentence

15-11-2011 Paris 32

Accidental unit

‘Accidental’ unit, discourse

parts not connected

15-11-2011 Paris 33

Syntactic annotation

sentence

vs

discourse

15-11-2011 Paris 34

Atypical ‘sentences’

Often: discourse

15-11-2011 Paris 35

Complicating factors

No punctuation apart from full stop, question mark, elipsis• ‘wrong order’ of sentences when more people are talking at

the same time!

►Tricky wrt coreference, temporal reasoning etc

Spelling: incorrect (but correct with other meaning)• U zij de glorie (Thine be the glory) • U zei de glorie (‘zei’ meaning ‘said’)• Ik zal haar eraan houden (houden aan: to keep a promise)• Ik zal haar er aanhouden (aanhouden: to arrest)

►context, recordings

.

15-11-2011 Paris 36

Written corpus: Lassy/SoNaR

STEVIN programme (Flemish/Dutch - 2004-2011)

D-Coi / LASSY / (SoNaR)

1M SA written text, manually corrected, plus

1.500M SA automatically

ALPINO parser (Groningen)

Largely inspired by CGN, based on HPSG

Some differences• Mentioning of ‘hidden’ subjects, objects

– Hij heeft een boek gekocht

.

15-11-2011 Paris 37

Alpino

• Alpino grammar: HPSG-based• ‘Constructional’ approach:

– rich lexical representations– many detailed, construction specific lexical rules (+/- 600)

• Grammar based parsing very efficient, esp when combined with specific rules

• Large lexicon (100.000+ entries, 200.000+ NEs)– Stored as perfect hash finite automaton (Daciuk)

• Crucial: Integrated tagger (=/= CGN tagger!)• Left corner parser

15-11-2011 Paris

Alpino (as is) and CGN

Parsing the CGN-corpus with Alpino• very bad results• reason might be: it uses a ‘wrong’ grammar, inadequate

lexicon etc etc

As we wanted both CGN and Lassy to be searchable using the same tools, CGN was ‘translated’ into the Lassy-format. There are, however, still differences in the way a few phenomena are handled.

.

15-11-2011 Paris

Lassy vs CGN

• Subject/direct objects wrt infinitives and participle• Partitives (one of them said …): in CGN separate label

PART, in Lassy combination of HD and MOD• LASSY: head always lexically anchored• In LASSY SBAR-complement always VC-label, in CGN

either OBJ1 or VC• …

Analyses not fully identical, but 99% is!

15-11-2011 Paris 40

Syntactic annotation: Lassy

.

15-11-2011 Paris 41

Syntactic annotation: CGN

.

15-11-2011 Paris 42

To be taken into account

In general:

• Take care of IPR• Be prepared to consult other layers• Use a flexible bug reporting system• “Spoken language”: grammar/system should be very flexible• Alignment may be very time consuming

Be aware that, as far as consistency is concerned, not the really hard cases are the most important, but rather those the correctors don’t realize to be problematic (because in those cases they don’t consult others)

GOOD LUCK !.