Hindi and Urdu Treebank Project
• 400,000 Hindi words* and 200,000 Urdu words• Multi-layered and multi-representational– Syntactic + Semantic annotation (‘layers’)– Dependency + Phrase structure (‘representation’)
• Hindi corpus consists of newswire text from ‘Amar Ujala’
• Create a linguistic resource for Hindi in the tradition of other Treebanks (Penn Treebank, Prague Dependency Treebank)
*~21,000 sentences
Outline
• 3 Representations– DS: Dependency Structure– PB: PropBank (lexical predicate-argument
structure)– PS: Phrase Structure
• Mapping between DS and PB• Linguistic phenomena: Causative verbs
Hindi Dependency Treebank
• The Hindi/Urdu Treebank is annotated using a dependency grammar framework
• Framework used is CPG: (Computational Paninian Grammar) – Panini’s ‘karaka’ theory adapted for annotation
scheme
Dependency labelsकि� या�
do
रा�म नेRaam erg
��मwork
�लyesterday
k1 k7t k2
Example of a dependency tree. Labels denote relations between a modifier and a modified
• Karakas are the relations between head and child nodes in the treebank
• Relations are depicted between word chunks and not individual tokens– E.g. a verb chunk can consist of a finite verb along with its
auxiliaries
An Example
Example:
meraa badZaa bhaaii bahuta phala khaataa hai
‘My elder brother eats lots of fruits.’
COLING`12
An Example (Contd...)
Morph Analysis:
meraa <fs af= root=meraa, cat=pron, gend=any, num=sg, pers=1, case=o>
badZaa <fs af= root=badZaa, cat=adj, gend=m, , , >
bhaaii <fs af= root=bhaaii, cat=n, gend=m, num=sg, pers=3, case=d>
bahuta <fs af= root=bahuta, cat=adj, gend=any, , , >
phala <fs af= root=phala, cat=n, gend=m, num=any, pers=3, case=d>
khaataa <fs af= root=khaa, cat=v, gend=m, num=sg, pers=3, TAM=taa> hai <fs af= root=hai, cat=v, gend=any, num=any, pers=3, >
COLING`12
An Example (Contd ..)
POS Tagging:
meraa_PRP baDzaa_JJ bhaaii_NN bahuta_QF
phala_NN khaataa_VM hai_VAUX
Chunking:
((meraa_PRP))_NP
((baDzaa_JJ bhaaii_NN))_NP
((bahuta_QF phala_NN))_NP
((khaataa_VM hai_VAUX))_VG
COLING`12
Selected dependency labels [TOTAL = 43 ]
k1 karta (similar to agent/doer)k2 karma (similar to patient/theme)k3 instrumentk4 beneficiaryk5 sourcek7t temporal locationk7p spatial locationk1s noun complement k2p destinationpk1, mk1, jk1 Causer, mediator-causer, causee rh causert purposersp durationadv adverb (manner)pof Part-of (complex predicates)ccof conjunctionfragof fragment-of
Classification of labels in tagset
• Based on their syntactic and semantic behaviour, we find 6 categories of labels (Vaidya and Husain, 2010)
k1 k2 k3 k4 k5 k2p k1s k7p k7t rh rt nmod pof fragof ccof
Invariant Syntactic labels
Local semantic labels Global semantic labels
Mod labels
‘pof’ type labels
‘ccof’ labels
Invariant syntactic labels
• k1 ‘karta’ and k2 ‘karma’ • Invariant across syntactic alternations like voice• E.g. aaj khuub mithai khaai gai
Today many sweets eaten go.pst‘Today many sweets were eaten’
• The label for ‘sweets’ is k2, although ‘sweets’ is now the passivized subject
• This property allows for mapping with PropBank roles Arg0, Arg1
Local semantic labels
• Relation between verb and dependent is ‘local’• These labels are “relevant to the verb meaning in
question”
E.g. Ram ne Mohan ko kahaani sunaai Ram erg Mohan acc story told.perf
‘Ram told Mohan a story’• Mohan is a ‘k4’, a beneficiary is a local semantic label• However, it is a label specific to certain verbs only, e.g.
denaa ‘to give’, kahnaa ‘to say’
Local semantic labels
• Other local semantic labels include– k4a ‘anubhava karta’, experiencer, with verbs like
mila ‘find’, dikha ‘see’, laga ‘feel’– k2p ‘goal’ with verbs like pahuMca ‘reach’, jaa ‘go’
• The interpretation of these labels is closely bound with the meaning of the verb
Global semantic labels
• These labels are relevant “across different verbs and verb meanings”
E.g maine aaj pustak khariidi I-erg today book bought‘I bought a book today’
• Here ‘aaj (today)’ has the label k7t or ‘time’ which does not change across different verb meanings
Global semantic labels
• More examples of global semantic labels are:– k7p ‘place’ – rh ‘reason’– rsp ‘duration’
• Not tied to the meaning of a verb
Invariant syntactic relations
k1 karta (similar to agent/doer)k2 karma (similar to patient/theme)k3 instrumentk4 beneficiaryk5 sourcek2p destinationk1s noun complementk7p spatial locationk7t temporal locationrh causert purposersp durationnmod noun modificationpof Part-of (complex predicates)fragof Fragment-ofccof conjunction
Local semantic relations
Global semantic relations
Modifier relations
pof-type relations
ccof-type relations
Dependency Structure
• Local vs. global distinctions would identify the core participants in the verb’s event
• Mapping to other frameworks that make distinctions between the core and non-core participants will be easier
• We examine such a mapping with the PropBank labels in the following section
Outline
– DS: Dependency Structure ✓– PB: PropBank (lexical predicate-argument
structure)• Mapping between DS and PB• Linguistic phenomena: Causative verbs
Proposition Bank
• A PropBank is a large annotated corpus of predicate-argument information
• A set of semantic roles is defined for each verb• A syntactically parsed corpus is then tagged
with verb-specific semantic role information
English PropBank
• English PropBank envisioned as the next level of Penn Treebank (Kingsbury & Palmer, 2003)
• Added a layer of predicate-argument information to the Penn Treebank
• Broad in its coverage- covering every instance of a verb and its semantic arguments in the corpus
English PropBank Annotation
• Two steps are involved in annotation– Choose a sense ID for the predicate – Annotate the arguments of that predicate with
semantic roles• This requires two components: frame files and
PropBank tagset
PropBank Frame files
• PropBank defines semantic roles on a verb-by-verb basis
• This is defined in a verb lexicon consisting of frame files
• Each predicate will have a set of roles associated with a distinct usage
• A polysemous predicate can have several rolesets within its frame file
An example
• John rings the bell
ring.01 Make sound of bellArg0 Causer of ringing
Arg1 Thing rung
Arg2 Ring for
An example
• John rings the bell• Tall aspen trees ring the lake
ring.01 Make sound of bellArg0 Causer of ringing
Arg1 Thing rung
Arg2 Ring for
ring.02 To surroundArg1 Surrounding entity
Arg2 Surrounded entity
An example
• [John] rings [the bell] • [Tall aspen trees] ring [the lake]
ring.01 Make sound of bellArg0 Causer of ringing
Arg1 Thing rung
Arg2 Ring forring.02 To surroundArg1 Surrounding entity
Arg2 Surrounded entity
Ring.01
Ring.02
An example
• [JohnARG0] rings [the bellARG1]
• [Tall aspen treesARG1] ring [the lakeARG2]
ring.01 Make sound of bellArg0 Causer of ringing
Arg1 Thing rung
Arg2 Ring forring.02 To surroundArg1 Surrounding entity
Arg2 Surrounded entity
Ring.01
Ring.02
English PropBank Tagset
• Numbered arguments Arg0, Arg1, and so on until Arg4
• Modifiers with function tags e.g. ArgM-LOC (location) , ArgM-TMP (time), ArgM-PRP (purpose)
• Modifiers give additional information about when, where or how the event occurred
Using PropBank
• As a computational resource– Train semantic role labellers (Pradhan et al, 2005)– Question answering systems (with FrameNet)– Project semantic roles onto a parallel corpus in another
language (Pado & Lapata, 2005)• For linguists, to study various phenomena related
to predicate-argument structure
Developing Hindi PropBank
• Making a PropBank resource for a new language– Linguistic differences• Capturing relevant language-specific phenomena
– Annotation practices• Maintain similar annotation practices
– Consistency across PropBanks
32
Hindi PropBank
• PropBank annotation on dependency trees has some advantages
• Hindi Treebank uses a large set of dependency labels that have rich semantic information
दि� या gave
रा�म नेRaam erg
औरात ��woman dat
पै�सेmoney
Arg0k1
Arg2k4
Arg1k2
33
Annotating Hindi PropBank
– HPB consists of 26 labels including arguments and modifiers
• Numbered arguments: An individual verb’s semantic arguments E.g. Arg0-3• Modifiers: not specific to the verb are labeled ArgM
E.g. ArgM-LOC, ArgM-PRP
– 44546 individual verb tokens in Hindi PropBank, where 40% are complex predicates
34
Annotating Hindi PropBank
Label DescriptionARG0 Agent
ARG1 Patient, theme, undergoer
ARG2 Beneficiary
ARG3 Instrument
35
Annotating Hindi PropBank
Label DescriptionARG0 Agent
ARG1 Patient, theme, undergoer
ARG2 Beneficiary
ARG3 Instrument
ARG2-ATRARG2-LOC
attribute ARG2-GOLARG2-SOU
goal
location source
Numbered arguments
36
Annotating Hindi PropBank
Label DescriptionARG0 Agent
ARG1 Patient, theme, undergoer
ARG2 Beneficiary
ARG3 Instrument
ARG2-ATRARG2-LOC
attribute ARG2-GOLARG2-SOU
goal
location source
ARGC causer
ARGA secondary causer Causative
Numbered arguments
37
Annotating Hindi PropBank
Label DescriptionARG0 Agent
ARG1 Patient, theme, undergoer
ARG2 Beneficiary
ARG3 Instrument
ARG2-ATRARG2-LOC
attribute ARG2-GOLARG2-SOU
goal
location source
ARGA causer
ARGA-MNS Intermediate causer
ARG0-GOL, ARG0-MNS causees
ARGM-VLV Verb-verb construction
ARGM-PRX Noun-verb construction
Causative
Complex predicate
Numbered arguments
38
Annotating Hindi PropBank
Label Description Label Description
ARGM-ADV adverb ARGM-CAU cause
ARGM-DIR direction ARGM-DIS discourse
ARGM-EXT extent ARGM-LOC location
ARGM-MNR manner ARGM-MNS means
ARGM-MOD modal ARGM-NEG negation
ARGM-PRP purpose ARGM-TMP time
•Other modifier labels
Outline
– DS: Dependency Structure ✓– PB: PropBank (lexical predicate-argument
structure) ✓• Mapping between DS and PB• Linguistic phenomena: Causative verbs
40
Hindi Dependency Treebank
• The DS tagset has labels that are in some ways fairly similar to PB– Verb specific labels: k1- 5– Verb modifier labels: k7p, k7t, rh etc.– Non dependencies like pof, ccof for complex
predicates and co-ordination
41
Dependency structure and PropBank
• In ‘Ram gave the woman money’, the dependents of give are k1 (primary doer), k2 (patient) and k4 (recipient)
• These correspond fairly neatly to Arg0, Arg1 and Arg2
• Dependency labels and to some extent, the tree structure are helpful for deriving PropBank annotations
42
Dependency structure and PropBank
• A mapping between the dependency karaka labels (HDT) and Hindi PropBank labels (HPB) is feasible
• Such a mapping would increase speed of annotation, improve inter-annotator agreement and help in a full fledged semantic role labeling task
43
Label comparison
•Using linguistic intuition, we can compare HDT labels with the numbered arguments in HPB
44
Label comparison
• Similarly, linguistic intuition gives us the mapping from HDT for HPB modifiers
45
Label comparison
• These mappings are included in the HPB frame files, for example, the verb ‘A: to come’
• Only for numbered arguments
Roleset Usage Rule
A.01 To come (path) k1 Arg1k2p Arg2-GOL
A.03 to arrive k1 Arg0k2p Arg2-GOLk5 Arg2-SOU
46
Automatic mapping of HDT to HPB
• A rule based, probabilistic system for automatic mapping
• We use two kinds of resources:– Annotated corpus [Treebank+ PropBank]• 32,300 tokens, 2005 predicates
– Frame files with mapping rules
47
Argument classification
• We use two kinds of rules to carry out automatic mapping– Empirically derived rules• Using corpus statistics associated with dependency &
PropBank labels
– Linguistically motivated rules• Derived from linguistic intuition & captured in frame
files
48
Linguistically motivated rules
• Helpful for predicates not seen in training data• We use the mapping captured in the frame
files• Applied after empirically derived rules• Limitation: available for numbered arguments
only
49
Empirically derived rules
• These rules estimate the maximum likelihood of each PropBank label (pbrel) being associated with a feature tuple
Rule (id, v, drel) = argmaxi P(pbreli)
50
Empirically derived rules
• The feature tuple consists of– Id: Predicate lemma OR Predicate ID– V : Voice (passive or active, given in HDT)– Drel: Dependency label
• For example, the tuple (xe ‘to give’, active, k1)
Example of the rulesFeatures Count PropBank labels
xe.01_active_k1(give)
32 Arg0: 0.93
Arg1: 0.03
Arg2:0.03
xe.01_active_k2 65 Arg1:0.95
Arg2:0.01
Arg0:0.01
xe.01_active_k4 34 Arg2:0.94
Arg0:0.02
•Associate the probability of each label in combination with a particular feature tuple •We use only 3: roleset ID, voice, dependency label•For the verb give, we get the correct mapping to the Hindi labels
53
Evaluation
• Empirically derived rulesDist. Precision Recall F1 score
ALL 100.00 90.59 47.92 62.69
54
Evaluation
• Empirically derived rulesDist. Precision Recall F1 score
ALL 100.00 90.59 47.92 62.69
ARG0 17.50 95.83 67.27 79.05
ARG1 27.28 94.47 61.62 74.59
ARG2 3.42 81.48 37.93 51.76
55
Evaluation
• Empirically derived rulesDist. Precision Recall F1 score
ALL 100.00 90.59 47.92 62.69
ARG0 17.50 95.83 67.27 79.05
ARG1 27.28 94.47 61.62 74.59
ARG2 3.42 81.48 37.93 51.76
ARG2-ATR 2.54 94.55 40.31 56.52
ARG2-GOL 1.61 64.29 21.95 32.73
ARG2-SOU 0.83 78.26 42.86 55.38
56
Evaluation
• Empirically derived rulesDist. Precision Recall F1 score
ALL 100.00 90.59 47.92 62.69
ARG0 17.50 95.83 67.27 79.05
ARG1 27.28 94.47 61.62 74.59
ARG2 3.42 81.48 37.93 51.76
ARG2-ATR 2.54 94.55 40.31 56.52
ARG2-GOL 1.61 64.29 21.95 32.73
ARG2-SOU 0.83 78.26 42.86 55.38
ARGM-ADV 3.50 31.82 3.93 7.00
ARGM-CAU 1.44 50.00 5.48 9.88
ARGM-LOC 10.77 83.80 27.42 41.32
ARGM-TMP 7.01 74.63 14.04 23.64
57
EvaluationPRECISION RECALL F1 SCORE
Empirically derived rules
90.59 47.92 62.69
Linguistically motivated rules
89.80 55.28 68.44
58
EvaluationPRECISION RECALL F1 SCORE
Empirically derived rules
90.59 47.92 62.69
Linguistically motivated rules
89.80 55.28 68.44
Numbered Argument AccuracyPRECISION RECALL F-SCORE
Empirically derived rules
93.63 58.76 72.21
Linguistically motivated
rules
91.87 72.36 80.96
59
Evaluation
• Linguistically motivated rules improve the recall with a slight drop in the precision
• With the most frequent PropBank labels, empirically derived rules perform well
• More data should improve the performance for modifier arguments
60
Summary
• We demonstrate the similarities between Hindi dependency labels and PropBank semantic arguments
• Using three kinds of rules, we achieve a mapping with 93% confidence for almost half the data
• Initial experiments show that mapping reduces annotation time by 33%
Outline
• 3 Representations– DS: Dependency Structure ✔– PB: PropBank (lexical predicate-argument
structure) ✔• Mapping between DS and PB ✔• Linguistic phenomena: Causative verbs
Linguistic issues: Causatives
• Syntactic/ semantic pheomena (“what”)– Relative clause– Causatives– Complex predicates …etc
• Representational issues (“how”)– Dependency/phrase structure
• Lexical semantics
Causatives
• Direct causative: khaa + -aa khilaa• Indirect causative: khilaa + -vaa khilvaa • Adding the causative morpheme –aa to
cause someone to do X.• It is also possible to add another causative
morpheme –vaa to cause A to cause B to do X.
Causatives
• Problem: Should the relation between causative verbs and their underlying forms be represented?
• Base form (khaa)• Causativized forms (khilaa, khilvaa)
Causatives
• In PropBank frame files, we decided to represent this relation– The same frame file with separate rolesets used for base
and causatives– Frame labels also represent this relation
Causatives
raam neArg0 khaanaArg1 khaayaaRam erg food eat-perf
‘Ram ate the food’
Roleset id: KA.01 to eat
Arg0 eater
Arg1 the thing being eaten
Causatives
raam neArg0 khaanaArg1 khaayaaRam erg food eat-perf
‘Ram ate the food’
mohan neArgA raam koArg0-GOL khaanaArg1 khilaayaaMohan erg Ram dat food eat-caus-perf
‘Mohan made Ram eat the food’
Roleset id: KA.01 to eat
Arg0 eater
Arg1 the thing being eaten
Roleset id: KilA.01 to feed
ArgA feeder
Arg0-GOL eater
Arg1 the thing being eaten
Causatives
raam neArg0 khaanaArg1 khaayaaRam erg food eat-perf
‘Ram ate the food’
mohan neArgA raam koArg0-GOL khaanaArg1 khilaayaaMohan erg Ram dat food eat-caus-perf
‘Mohan made Ram eat the food’
sita neArgA mohan seArgA-MNS raam koArg0-GOL khaanaArg1 khilvaayaaSita erg Mohan instr Ram acc food eat-ind.caus-caus-perf
‘Sita, through Mohan made Ram eat the food ’
Roleset id: KA.01 to eat
Arg0 eater
Arg1 the thing being eaten
Roleset id: KilA.01 to feed
ArgA feeder
Arg0-GOL eater
Arg1 the thing being eaten
Roleset id: KilvA.01 to cause to be fedArgA Causer of
feedingArgA-MNS feeder
Arg0-GOL Eater
Arg1 the thing eaten
Causatives
raam neArg0 ticketArg1 kharidaaRam erg food bought-perf
‘Ram bought the ticket’
Roleset id: khariid.01 to buy
Arg0 buyer
Arg1 the thing being bought
Causatives
raam neArg0 ticketArg1 kharidaaRam erg food bought-perf
‘Ram bought the ticket’
mohan neArgA raam seArg0-MNS ticketArg1 khariidvaayaMohan erg Ram dat food eat-caus-perf
‘Mohan made Ram eat the food’
Roleset id: khariid.01 to buy
Arg0 buyer
Arg1 the thing being bought
Roleset id: kharidvaa.01 to feedArgA Cause to buy
Arg0-MNS causee
Arg1 the thing being bought
Causatives
raam neArg0 ticketArg1 kharidaaRam erg food bought-perf
‘Ram bought the ticket’
mohan neArgA raam seArg0-MNS ticketArg1 khariidvaayaMohan erg Ram inst ticjet buy-caus-perf
‘Mohan made Ram buy the ticket’
sita neArgA mohan dwaraArgA-MNSraam seArg0-MNSticketArg1 kharidvaaya
Sita erg Mohan by Ram inst ticket buy-ind.caus-perf`Sita through Mohan made Ram buy the ticket’
Roleset id: khariid.01 to buy
Arg0 buyer
Arg1 the thing being bought
Roleset id: kharidvaa.01 to
ArgA Cause to buy
Arg0-MNS causee
Arg1 the thing being bought
Representing causees
• The ARG0 label represents agents, but for certain causativized forms, this is further split into:– ARG0-GOL: affected agent– ARG0-MNS: non-affected agent– ARGA : causer
• For any other intermediate causers: ARGA-MNS
Other linguistic issues
• Empty categories• Representation of complex predicates• Intransitive verbs: Unaccusatives and
Unergatives• Syntactic phenomena: small clauses, relative
clauses, co-ordination
Top Related