Project overview
description
Transcript of Project overview
Project overview
Framework
Example:
Chemical–Disease Interactions
Example: Protein-Protein Interactions
Key Contributions
Supporting Annotation Layers for Natural Language
Processing
Preslav Nakov
Ariel Schwartz
Brian Wolf
Marti Hearst
CS & SIMS
UC Berkeley
We demonstrate a system for flexible querying of text that has been annotated with the results of NLP processing. The system supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. We present the Layered Query Language (LQL) and its use on examples taken from the NLP literature.
Sample Output
Summary
Related Work
NSF-DBI-0317510 &
Genentech
Project support:
Project url:http://biotext.berkeley.edu/
lql
SELECT p1.content, verb.content, p2.content, COUNT(*) AS cnt (BEGIN_LQL [layer=‘sentence’ { ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’gene’] $ ] AS p1 [layer=‘pos’ && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=‘gene’] $ ] AS p1 ] SELECT p1.content, verb.content, p2.contentEND_LQL) GROUP BY p1.content, verb.content, p2.contentORDER BY cnt DESC
Goal: Find all sentences that consist of a noun phrase containing a gene followed by a morphological variant of the verb “activate”, “inhibit”, or “bind”, followed by another NP containing a gene.
PROTEIN 1 INTERACTION VERB PROTEIN 2 FREQUENCY
Ca2 activates protein kinase 312
Cln3 activate protein kinase 234
TAP binds transcription factor 192
TNF activates protein tyrosine kinase 133
serine/threonine kinase binding RhoA GTPase 132
Phospholamban inhibits ATPase 114
PRL activated transcription factor 108
Interleukin 2 activates transcription factor 84
Prolactin activates transcription factor 84
AMPA activated protein kinase 78
Nerve growth factor activates protein kinase 78
LPS inhibited MHC class II 75
Heat shock protein Binding p59 72
EPO activated STAT5 63
EGF activated PP2A 60
cis binds Sp1 50
The LQL Query
Annotations are stored independently of text in an RDBMS
Declarative query language for annotation retrieval
Indexing structure designed for efficient query processing
Layered Query Language for easy retrieval
Object Oriented API for annotations: insertion, deletion and modification
Multiple overlapping layers (cannot be expressed in a single XML file)
Self-overlapping, parallel layers, allowing multiple syntactic parses of the same text
Integration of multiple intersecting hierarchies (e.g. MeSH, UMLS, Wordnet)
Specialized query language
Flexible results format
Focused on scaling annotation-based queries to very large corpora (millions of documents) with many layers of annotations
Each annotation represents an interval spanning a sequence of characters
• absolute start and end positions
Each layer corresponds to a conceptually different kind of annotation
•Layers can be• Sequential • Overlapping (e.g., two multiple-word concepts sharing a word)• Hierarchical
•spanning, when the intervals are nested as in a parse tree, or •ontologically, when the token itself is derived from a hierarchical ontology
Layers of Annotations
Annotation Layers Example
Indexing Architectures
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
POS
Shallow
parse
Ontology
Gene/protein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Word
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
Full parse, sentence and section layers are not shown.
231(NP)39343(s.parse)b3345
8998522754501b3345
55608253 (VB)48411 b3345
59571227 (NN)39341 (POS)b3345
8998528998554500b3345
5560825560848410b3345
595712595713934b (body)3345
WORD
ID
SENTENCE
SEQUENCE
POS
TAG
TYPE
END
CHAR
POS
STARTCHAR
POS
LAYER
ID
SECTIONPMID
131(NP)39343(s.parse)b3345
8998532754501b3345
55608253 (VB)48411 b3345
59571127 (NN)39341 (POS)b3345
8998538998554500b3345
5560825560848410b3345
5957115957139340 (word)b (body)3345
WORD
ID
SENTENCE
SEQUENCE
POS
TAG
TYPE
STARTCHAR
POS
LAYER
ID
SECTIONPMID
Basic architecture Added, architecture 3
Added, architecture 2 Added, architecture 4
3
2
1
3
2
1
FIRSTWORDPOS
1
3
2
1
3
2
1
LASTWORDPOS
1
Added, architecture 5
FROM [layer=‘sentence’ { NO ORDER, ALLOW GAPS } [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=’chemicals’] AS chemical $ ] [layer=‘shallow_parse’ && tag_name=‘NP’ [layer=‘MeSH’ && tree_number BELOW “C”] AS disease $ ] ] AS sent SELECT chemical.content, disease.content, sent.content
Goal: extract the relation that statin (potentially) prevents coronary heart disease.
MeSH C subtree contains diseases
MeSH supplementary concepts represent chemicals.
LQL query to find potentially useful sentences :
This query extracts sentences containing two NPs in any order without overlaps (NO ORDER) and separated by any number of intervening elements (ALLOW GAPS). Requires one of the NPs to end with a chemical ($), and the other to end with a MeSH term from the C subtree (BELOW).
A mechanism to effectively store and query layers of textual annotations.
Evaluated various structures for data storage and have arrived at an efficient and simple one.
Implemented a concise and powerful annotation query language (LQL).
Built a web interface
Planning to release the software to the research community.
Tree systems Overview: see (Bird et al.,2005); Examples:TGrep2, TIGERSearch, LPath, CorpusSearch, GSearch, Linguist’s Search Engine, Netgraph, TIQL, VIQTORIA, etc.
Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined. (Cassidy&Harrington,2001)
NiteQL (the query language of MATE): highly expressive, allows quering of intersecting hierarchies; stored in XML (McKelvie&al., 2001);
TIQL: queries manipulate intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002)
Annotation graphs: directed acyclic graph; nodes can have time stamps, constrained via paths to labeled parents and children. (Bird and Liberman, 2001)
Based on benchmarking, we use Archictecture 5
“Adherence to statin prevents one coronary heart disease event for every 429 patients.”
1.4 million MEDLINE abstracts
10 million sentences annotated
320 million multi-layered annotations
70 GB database size.