1 OntoNotes: A Unified Relational Semantic Representation Sameer Pradhan, Eduard Hovy, Mitchell...

Post on 19-Dec-2015

216 views 0 download

Transcript of 1 OntoNotes: A Unified Relational Semantic Representation Sameer Pradhan, Eduard Hovy, Mitchell...

1

OntoNotes: A Unified Relational Semantic Representation

Sameer Pradhan, Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel

http://www.bbn.com/ontonotes

2

Outline

Multiple layers of annotation and modeling capture useful elements of text meaning at 90% ITA– Syntax– Proposition– Word sense

– Ontology– Coreference– Names

An integrated relational database representation– Enforces consistency across the different annotations– Supports integrated models that can combine evidence from

different layers

Some practical issues Sensitivity to changes in layers

Adding a new layer to the data

Few lessons learned

3

Problems with Multiple Layers of Annotation

Not previously available – A number of these layers have not been available in significant

quantity before:• Word Sense • Coreference

Not previously integrated – Each layer encoded separately as individual files, requiring

supporting documentation for interpretation

Not previously completely consistent– Mismatches between Treebank and PropBank

Not previously user friendly– Raw text format

4

Unified Representation

Provide a bare-bones representation independent of the individual layer’s semantics that can– Efficiently capture intra- and inter- layer semantics– Maintain component independence (facilitate collaboration)– Provide mechanism for flexible integration (for an application)– Integrate information at the required level of granularity– Data storage as close as possible to an application backend– Adaptable in face of incremental representational changes– API extremely accessible (don’t need to be a hacker to use it)– Ability to easily perform cross-layer queries– Easily extensible– Capable of maintaining version information – Ideally at

different possible levels– …– …

Relational Database

+

Object Oriented API

5

Relational Representation

Corpus

Trees

Coreference Names

Propositions

Senses

6

Example: Database Representation of Syntax

• Treebank tokens (stored in the Token table) provide the common base• The Tree table stores the recursive tree nodes, each with its span• Subsidiary tables define the sets of function tags, phrase types, etc.

7

Object Oriented API

8

Using the API: Importing the modules

9

Using the API: Creating Skeleton Objects

10

Using the API: Creating Full-fledged Objects (I)

11

Using the API: Creating Full-fledged Objects (II)

12

Using the API: Writing to the database

13

Using the API: Reading form the Database

14

Data Loading Life-cycle

Database

15

OntoNotes Data: Current and Future

NW BN BC

Eng 300

Chi 250

Ara

OntoNotes 1.0

100Ara

300250Chi

200300Eng

BCBNNW

OntoNotes 2.0

200Ara

150300250Chi

200200500Eng

BCBNNW

OntoNotes 3.0

16

Advantages of an Integrated Representation

Clean, consistent layers– Resolve the inconsistencies and problems that this reveals

Well defined relationships– Database schema defines the merged structure efficiently

Extract individual views – Treebank, PropBank, etc.

SQL queries can extract examples based on multiple layers or define new views

Python Object-oriented API allows for programmatic access to tables and queries

17

Example of Database Query Function

for a_proposition in a_proposition_bank: if(a_proposition.lemma != "say"): arg_in_p_q = "select * from argument where proposition_id = '%s';" % (a_proposition.id) a_cursor.execute(arg_in_p_query) argument_rows = a_cursor.fetchall()

for a_argument_row in argument_rows: a_argument_id = a_argument_row["id"] a_argument_type = a_argument_row["type"]

if(a_argument_type != "ARG0"): n_in_arg_q = "select * from argument_node where argument_id = '%s';" % (a_argument_id) a_cursor.execute(n_in_arg_q) argument_node_rows = a_cursor.fetchall() for a_argument_node_row in argument_node_rows: a_node_id = a_argument_node_row["node_id"]

a_ne_node_query = "select * from name_entity where subtree_id = '%s';" % (a_node_id) a_cursor.execute(a_ne_node_query) ne_rows = a_cursor.fetchall()

for a_ne_row in ne_rows: a_ne_type = a_ne_row["type"] ne_hash[a_ne_type] = ne_hash[a_ne_type] + 1

a_tree = a_tree_document.get_tree(a_tree_id) a_node = a_tree.get_subtree(a_node_id)

for a_child in a_node.subtrees(): a_ne_subtree_query = "select * from name_entity where subtree_id = '%s';" % (a_child.id) subtree_ne_rows = a_cursor.execute(a_ne_subtree_query)

ne_subtree_rows = a_cursor.fetchall()

for a_ne_subtree_row in ne_subtree_rows: a_subtree_ne_type = a_ne_subtree_row["type"] ne_hash[a_subtree_ne_type] = ne_hash[a_subtree_ne_type] + 1

if (proposition.lemma == “say”):

query = “select * from argument where proposition_id = '%s';” ..

What is the distribution of named entities that are ARG0s of the predicate “say”?

if (argument_type == "ARG0"):

for child in node.subtrees():

......

15NORP

29Organization

34GPE

84Person

FrequencyName Entity

18

Reconciling Treebank and PropBank

We found several mis-matches between syntax and propositions– Sometimes PropBank was right– Sometimes Treebank was right

Guidelines modified to bring the two in line

Now each argument points to a single node in the tree– Secondary connections are made using Treebank trace chains– Almost no discontinuous arguments– Non-trace connections are explicitly identified

This greater consistency will make it easier to train models that predict argument structure

19

Sensitivity to Changes – PropBank changes

ARG2

ARG1ARGM-LOC

... major reductions and realignments of troops in central Europe – ...

NP

NP

JJ NNS CC NNS IN NP

NNS

PP

IN NP

JJ NNP

PP

S

20

Sensitivity to Changes – Treebank changes

... major reductions and realignments of troops in central Europe – ...

NP

NP

JJ NNS CC NNS IN NP

NNS

PP

IN NP

JJ NNP

PP

S

• If the node got deleted, remove associated annotation• if any node has a change in children or parent node, then update associated annotation. Print new propbank

21

Adding a new layer

1. What information do you want to capture?

2. Define relationship with the required layer

3. Design tables

4. Superimpose on existing machinery with respect to the anchor

5. Create a class in the corpora packagea. Define a few specific functions

• Create object from original annotation (Text Reader)• Write object to database (DB Writer)• Create object from database (DB Reader)• Write database to original format (Text Writer)• Pretty print function (Pretty Printer)

b. Write at least one alignment function at the level where the enrichment is required, or even multiple levels• Enrich Treebank/Document/…

22

Few Errors Found

Missing co-indices in Trees (found during loading) Invalid sense numbers (while checking against repository) Multiple sense definitions (in the repository) Validation errors in schemas Dead pointers in ontology Multiple coreference chain memberships Missing/Invalid predicate/argument pointers Invalid PB/TB merges Filename/Content mismatches Pinyin/Unicode inconsistencies Varying sentence breaks SLINK Errors Inconsistent TB Empty specifications in the merge process Typos (found through Type Tables) .. And, a few annotation Errors

23

Some Interesting Problems Addressed

Word sense annotation transferred from old Treebank to new Treebank

Coreference annotation transferred to new Treebank

Treebank/PropBank with or without NMLs reside in harmony

Various levels of data quality identified in the database

Varying styles of marking traces normalized

Language specific idiosyncrasies in inventories and frames normalized

Data generated for annotation– Eventive nouns– Coreference

24

Few Lessons Learned

Each layer should – abide by a minimum dependency principle– adhere to a well defined schema

Try to maintain consistency across representation of similar components

Use a centralized, version controlled repository

Need for single-point, push-button loading philosophy

25

Conclusion

Lot of annotation layers available, integrated using a relational schema

A extensible, relational/object oriented architecture available to the community

Easily Accessible– Through Python API– SQL queries

OntoNotes Release 2.0 available from LDC

unencumbered, open source!!

26

Backup

27

Syntax Layer

Identifies meaningful phrases in the text

Lays out the structure of how they are related

Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .

S

major reductions and realignments of troopsin central Europe

... major reductions and realignments of troops in central Europe – ...

NP

NP

JJ NNS CC NNS IN NP

NNS

PP

IN NP

JJ NNP

PP

SYNTAX

28

ARG2

ARG1

ARGM-LOC

Propositional Structure

Tells who did what to whom

For both verbs and nouns

Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .

... major reductions and realignments of troops in central Europe – ...

NP

NP

JJ NNS CC NNS IN NP

NNS

PP

IN NP

JJ NNP

PP

S

Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .

29

reduce.01 – Make less

Predicate Frames

Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .

reductionreduce.01 – Make less

ARG0 – Agent ARG1 – Thing fallingARG2 – Amount fallenARG3 – Starting pointARG4 – Ending point

Predicate frames define the meanings of the numbered arguments

- the troopsmajor--

30

Word Sense and Ontology

Meaning of nouns and verbs are specified using a catalog of possible senses

All the senses are annotatable at 90% ITA

Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .

Word Sense

aim

1. Point or direct object, weapon, at something ...

2. Wish, purpose or intend to achieve something

Word Sense

register

1. Enter into an official record2. Be aware of, enter into someone’s

consciousness3. Indicate a measurement4. Show in one’s face

2. Wish, purpose or intend to achieve something

1. Enter into an official record

Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .

Ontology links (currently being added) capture similarities between related senses of different words

31

Coreference

Identifies different mentions of the same entity within a document – especially links definite, referring noun phrases, and pronouns to their antecedents

Two types tagged – Identity and Attributive

Concerns about the pace of the Vienna talks -- which are aimed at the destruction of some 100,000 weapons , as well as major reductions and realignments of troops in central Europe – also are being registered at the Pentagon .

President Bushconventional arms talk

Pentagon He

e0 e1 e2

of some 100,000 weapons , as well as major reductions and realignments of troopsin central Europe

Vienna talks – which are aimed at the destruction

the Pentagon