Project overview

1
Project overview Framework Example: Chemical–Disease Interactions Example: Protein-Protein Interactions Key Contributions Supporting Annotation Layers for Natural Language Processing Preslav Nakov Ariel Schwartz Brian Wolf Marti Hearst CS & SIMS UC Berkeley We demonstrate a system for flexible querying of text that has been annotated with the results of NLP processing. The system supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. We present the Layered Query Language (LQL) and its use on examples taken from the NLP literature. Sample Output Summary Related Work NSF-DBI-0317510 & Genentech Project support: Project url: http:// biotext.berkeley.edu/lql SELECT p1.content, verb.content, p2.content, COUNT(*) AS cnt ( BEGIN_LQL [layer=‘sentence’ { ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’gene’] $ ] AS p1 [layer=‘pos’ && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=‘gene’] $ ] AS p1 ] SELECT p1.content, verb.content, p2.content END_LQL ) GROUP BY p1.content, verb.content, p2.content ORDER BY cnt DESC Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene. PROTEIN 1 INTERACTION VERB PROTEIN 2 FREQUENCY Ca2 activates protein kinase 312 Cln3 activate protein kinase 234 TAP binds transcription factor 192 TNF activates protein tyrosine kinase 133 serine/ threonine kinase binding RhoA GTPase 132 Phospholamban inhibits ATPase 114 PRL activated transcription factor 108 Interleukin 2 activates transcription factor 84 Prolactin activates transcription factor 84 AMPA activated protein kinase 78 Nerve growth factor activates protein kinase 78 LPS inhibited MHC class II 75 Heat shock The LQL Query Annotations are stored independently of text in an RDBMS Declarative query language for annotation retrieval Indexing structure designed for efficient query processing Layered Query Language for easy retrieval Object Oriented API for annotations: insertion, deletion and modification Multiple overlapping layers (cannot be expressed in a single XML file) Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet) Specialized query language Flexible results format Focused on scaling annotation- based queries to very large corpora (millions of documents) with many layers of annotations Each annotation represents an interval spanning a sequence of characters absolute start and end positions Each layer corresponds to a conceptually different kind of annotation Layers can be • Sequential • Overlapping (e.g., two multiple- word concepts sharing a word) • Hierarchical spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a hierarchical ontology Layers of Annotations Annotation Layers Example Indexing Architectures NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN NP PP NP VP PP NP NP PP NP D019254 D044465 D001769 D002477 D003643 D001773 D016923 D007962 24224 596 28102 0 12043 POS Shallow parse Ontology Gene/ protein 185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523 Word Ontology Gene/protein Word Part of Speech Shallow Parse Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53. D016158 39727 6 42722 Full parse, sentence and section layers are not shown. 2 31(NP) 39 34 3(s.parse) b 3345 89985 2 27 54 50 1 b 3345 55608 2 53 (VB) 48 41 1 b 3345 59571 2 27 (NN) 39 34 1 (POS) b 3345 89985 2 89985 54 50 0 b 3345 55608 2 55608 48 41 0 b 3345 59571 2 59571 39 34 b (body) 3345 WORD ID SENTE NCE SEQUE NCE POS TAG TYPE END CHAR POS START CHAR POS LAYER ID SECTION PMID 1 31(NP) 39 34 3(s.parse) b 3345 89985 3 27 54 50 1 b 3345 55608 2 53 (VB) 48 41 1 b 3345 59571 1 27 (NN) 39 34 1 (POS) b 3345 89985 3 89985 54 50 0 b 3345 55608 2 55608 48 41 0 b 3345 59571 1 59571 39 34 0 (word) b (body) 3345 WORD ID SENTE NCE SEQUE NCE POS TAG TYPE START CHAR POS LAYER ID SECTION PMID Basic architecture Added, architecture 3 Added, architecture 2 Added, architecture 4 3 2 1 3 2 1 FIRST WORD POS 1 3 2 1 3 2 1 LAST WORD POS 1 Added, architecture 5 FROM [layer=‘sentence’ { NO ORDER, ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’chemicals’] AS chemical $ ] [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=‘MeSH’ && tree_number BELOW “C”] AS disease $ ] ] AS sent SELECT chemical.content, disease.content, sent.content Goal: extract the relation that statin (potentially) prevents coronary heart disease. MeSH C subtree contains diseases MeSH supplementary concepts represent chemicals. LQL query to find potentially useful sentences : This query extracts sentences containing two NPs in any order without overlaps (NO ORDER) and separated by any number of intervening elements (ALLOW GAPS). Requires one of the NPs to end with a chemical ($), and the other to end with a MeSH term from the C subtree (BELOW). A mechanism to effectively store and query layers of textual annotations. Evaluated various structures for data storage and have arrived at an efficient and simple one. Implemented a concise and powerful annotation query language (LQL). Built a web interface Planning to release the software to the research community. Tree systems Overview: see (Bird et al.,2005); Examples:TGrep2, TIGERSearch, LPath, CorpusSearch, GSearch, Linguist’s Search Engine, Netgraph, TIQL, VIQTORIA, etc. Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined. (Cassidy&Harrington,2001) NiteQL (the query language of MATE): highly expressive, allows quering of intersecting hierarchies; stored in XML (McKelvie&al., 2001); TIQL: queries manipulate intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002) Annotation graphs: directed acyclic graph; nodes can have time stamps, constrained via paths to labeled parents and children. (Bird and Liberman, 2001) Based on benchmarking, we use Archictecture 5 Adherence to statin prevents one coronary heart disease event for every 429 patients. 1.4 million MEDLINE abstracts 10 million sentences annotated 320 million multi-layered annotations 70 GB database size.

description

Supporting Annotation Layers for Natural Language Processing. Preslav Nakov Ariel Schwartz Brian Wolf Marti Hearst. CS & SIMS UC Berkeley. Word Part of Speech Shallow Parse. Ontology Gene/protein. Gene/protein. 596. 12043. 24224. 281020. 42722. 397276. D007962. D016923. - PowerPoint PPT Presentation

Transcript of Project overview

Page 1: Project overview

Project overview

Framework

Example:

Chemical–Disease Interactions

Example: Protein-Protein Interactions

Key Contributions

Supporting Annotation Layers for Natural Language

Processing

Preslav Nakov

Ariel Schwartz

Brian Wolf

Marti Hearst

CS & SIMS

UC Berkeley

We demonstrate a system for flexible querying of text that has been annotated with the results of NLP processing. The system supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. We present the Layered Query Language (LQL) and its use on examples taken from the NLP literature.

Sample Output

Summary

Related Work

NSF-DBI-0317510 &

Genentech

Project support:

Project url:http://biotext.berkeley.edu/

lql

SELECT p1.content, verb.content, p2.content, COUNT(*) AS cnt (BEGIN_LQL [layer=‘sentence’ { ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’gene’] $ ] AS p1 [layer=‘pos’ && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=‘gene’] $ ] AS p1 ] SELECT p1.content, verb.content, p2.contentEND_LQL) GROUP BY p1.content, verb.content, p2.contentORDER BY cnt DESC

Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene.

PROTEIN 1 INTERACTION VERB PROTEIN 2 FREQUENCY

Ca2 activates protein kinase 312

Cln3 activate protein kinase 234

TAP binds transcription factor 192

TNF activates protein tyrosine kinase 133

serine/threonine kinase binding RhoA GTPase 132

Phospholamban inhibits ATPase 114

PRL activated transcription factor 108

Interleukin 2 activates transcription factor 84

Prolactin activates transcription factor 84

AMPA activated protein kinase 78

Nerve growth factor activates protein kinase 78

LPS inhibited MHC class II 75

Heat shock protein Binding p59 72

EPO activated STAT5 63

EGF activated PP2A 60

cis binds Sp1 50

The LQL Query

Annotations are stored independently of text in an RDBMS

Declarative query language for annotation retrieval

Indexing structure designed for efficient query processing

Layered Query Language for easy retrieval

Object Oriented API for annotations: insertion, deletion and modification

Multiple overlapping layers (cannot be expressed in a single XML file)

Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text

Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet)

Specialized query language

Flexible results format

Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations

Each annotation represents an interval spanning a sequence of characters

• absolute start and end positions

Each layer corresponds to a conceptually different kind of annotation

•Layers can be• Sequential • Overlapping (e.g., two multiple-word concepts sharing a word)• Hierarchical

•spanning, when the intervals are nested as in a parse tree, or •ontologically, when the token itself is derived from a hierarchical ontology

Layers of Annotations

Annotation Layers Example

Indexing Architectures

NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP VP PP NP NP PP NP

D019254 D044465 D001769 D002477 D003643

D001773

D016923

D007962

24224596 28102012043

POS

Shallow

parse

Ontology

Gene/protein

185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word

Ontology

Gene/protein

Word

Part of Speech

Shallow Parse

Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.

D016158

39727642722

Full parse, sentence and section layers are not shown.

231(NP)39343(s.parse)b3345

8998522754501b3345

55608253 (VB)48411 b3345

59571227 (NN)39341 (POS)b3345

8998528998554500b3345

5560825560848410b3345

595712595713934b (body)3345

WORD

ID

SENTENCE

SEQUENCE

POS

TAG

TYPE

END

CHAR

POS

STARTCHAR

POS

LAYER

ID

SECTIONPMID

131(NP)39343(s.parse)b3345

8998532754501b3345

55608253 (VB)48411 b3345

59571127 (NN)39341 (POS)b3345

8998538998554500b3345

5560825560848410b3345

5957115957139340 (word)b (body)3345

WORD

ID

SENTENCE

SEQUENCE

POS

TAG

TYPE

STARTCHAR

POS

LAYER

ID

SECTIONPMID

Basic architecture Added, architecture 3

Added, architecture 2 Added, architecture 4

3

2

1

3

2

1

FIRSTWORDPOS

1

3

2

1

3

2

1

LASTWORDPOS

1

Added, architecture 5

FROM [layer=‘sentence’ { NO ORDER, ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’chemicals’] AS chemical $ ] [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=‘MeSH’ && tree_number BELOW “C”] AS disease $ ] ] AS sent SELECT chemical.content, disease.content, sent.content

Goal: extract the relation that statin (potentially) prevents coronary heart disease.

MeSH C subtree contains diseases

MeSH supplementary concepts represent chemicals.

LQL query to find potentially useful sentences :

This query extracts sentences containing two NPs in any order without overlaps (NO ORDER) and separated by any number of intervening elements (ALLOW GAPS). Requires one of the NPs to end with a chemical ($), and the other to end with a MeSH term from the C subtree (BELOW).

A mechanism to effectively store and query layers of textual annotations.

Evaluated various structures for data storage and have arrived at an efficient and simple one.

Implemented a concise and powerful annotation query language (LQL).

Built a web interface

Planning to release the software to the research community.

Tree systems Overview: see (Bird et al.,2005); Examples:TGrep2, TIGERSearch, LPath, CorpusSearch, GSearch, Linguist’s Search Engine, Netgraph, TIQL, VIQTORIA, etc.

Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined. (Cassidy&Harrington,2001)

NiteQL (the query language of MATE): highly expressive, allows quering of intersecting hierarchies; stored in XML (McKelvie&al., 2001);

TIQL: queries manipulate intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002)

Annotation graphs: directed acyclic graph; nodes can have time stamps, constrained via paths to labeled parents and children. (Bird and Liberman, 2001)

Based on benchmarking, we use Archictecture 5

“Adherence to statin prevents one coronary heart disease event for every 429 patients.”

1.4 million MEDLINE abstracts

10 million sentences annotated

320 million multi-layered annotations

70 GB database size.