Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007.
-
Upload
malcolm-higgins -
Category
Documents
-
view
216 -
download
2
Transcript of Information Extraction from Literature Yue Lu BeeSpace Seminar Oct 24, 2007.
Overview
BeeSpace V4 deeper semantic base than the current v3 system entities and relations VS mutual information
Four levels Level1: Entity Recognition Level2: Entity Association Mining Level3: Relation Extraction Level4: Inference and Hypothesis Generation
Overview
Level1: Entity Recognition (detailed later) Level2 Entity Association Mining
Suppose entities are properly taggedUtilize the co-occurrence patterns of entities
to extract semanticse.g. a bee biologist may want to know which
genes are important for foraging behavior. Similar to TREC Genomics 2007 task
TREC Genomics 2007
e.g. “Which [PATHWAYS] are possibly involved in the disease ADPKD?”
currently only retrieval techniquesGene synonym expansionConjunctive query interpretationUser relevance feedback
tagged Entities definitely would help
Overview
Level3: Relation Extraction Goal is to extract the relations between entities Generally requires entities to be properly tagged first Detailed later
Level4: Inference and Hypothesis Generation Inference on knowledge base Graph mining
Entity Recognition
Gene Example:
Although <GENE>mxp</GENE> and <GENE>Pb</GENE> display very similar expression patterns, <GENE>pb</GENE> null embryos develop normally
Entity Recognition
Anatomy Example:
In normal embryos, mxp is expressed in the <ANATOMY>maxillary</ANATOMY> and <ANATOMY>labial</ANATOMY> segments, whereas ectopic expression is observed in some GOF variants.
Entity Recognition
Biological process Example:
Amongst these are the Bicoid, the Nanos, and the terminal class gene products, some of which are oncoproteins involved in signal transduction for <BIOLOGICAL PROCESS>the formation of terminal structures in the embryo<BIOLOGICAL PROCESS>.
Entity Recognition
Pathways Example:
Several signal transduction pathways have been described in Drosophila, and this review explores the potential of oncogene studies using one of those pathways - <PATHWAY>the terminal class signal transduction pathway</PATHWAY> - to better understand the cellular mechanisms of proto-oncogenes that mediate cellular responses in vertebrates including humans
Entity Recognition
Protein family Example:
While non-arthropod orthologs have been found for many Drosophila eye developmental genes, this has not been the case for the glass (gl) gene, which encodes a <PROTEIN FAMILY>zinc finger transcription factor</PROTEIN FAMILY> required for photoreceptor cell specification, differentiation, and survival.
Entity Recognition
CRE (cis-regulatory elements) Example:
A synthetic, 23-bp <CRE>ecdysterone regulatory element (EcRE) </CRE>, derived from the upstream region of the Drosophila melanogaster hsp27 gene, was inserted adjacent to the herpes simplex virus thymidine kinase promoter fused to a bacterial gene for chloramphenicol acetyltransferase (CAT).
Entity Recognition
Phenotype Definition:
a set of observable physical characteristics of an individual organism
Example: Fog, dumpy
Entity Recognition
Class1: Small Variation (Dictionary/Ontology)Organism, Anatomy , Biological Process,
Pathway, Protein Family Class2: Medium Variation
Gene, cis Regulatory Element Class3: Large Variation
Phenotype, Behavior
Entity Recognition
Generally can be defined as a classification problem
Boils down to feature definitionClass1: matching a word in the
Dictionary/Ontology Class2: prefix/suffix of the word, POS tags, …Class3:?
Entity Recognition
Firstly focus on Class1Relatively simple
Class2 and Class3 need training examples
Useful in entity association miningUseful in facilitating extraction of many
interesting relations Related work: Textpresso
Textpresso
Input: full text C. elegans literature Output: tagged XML format Defined a Textpresso ontology
First category is biological entities
manually curated a lexicon of names Implemented by PERL regular expressions We could reuse some of the regular expressions
Entity Recognition
Organism Entrez gene table, Textpresso, BeeSpace DB
Anatomy FlyBase
Biological Process,
Cellular Component, Molecular Function
Textpresso
Pathway KEGG
Protein Family PDB, NCBI
Resources:
Relation Extraction
Expression Location the expression of a gene in some location
(tissues, body parts) Homology/Orthology
one gene is homologous to another gene
Relation Extraction
Biological process one gene has some role in a biological
process Genetic/Physical/Regulatory Interaction
one gene interacts with another gene in a certain fashion (3 types of relations)
a simple case: Protein-Protein Interaction (PPI)
Relation Extraction
Generally can be defined as a classification problem, which requires training data
Domain adaptation?an example of PPI
PPI
Problem Definition:Gene/protein names are already taggedA known list of interaction words
133 words
classify each tuple (p1, p2, interWord) in one single sentence
PPI
MethodsLearning algorithm: Maximum EntropyContext features
“Extracting protein-protein interactions using simple contextual features training data” BioNLP Workshop on HLT-NAACL 06
e.g. lexical forms, POS tags … Less dependent on domain
PPI
Training/Testing data:BioCreative1000 hand labeled sentences, 3964 tuples5-fold cross validation
Performanceavgpr = 47.14624avgre = 43.97337avgf1 = 45.35523
PPI
Training data: BioCreative 1000 hand labeled sentences, 3964 tuples
Testing Data (different domain) Bee collection
Performance (Judged by Moushumi) Total number of tuples extracted as PPI instances: 92 Precision: 63%
PPI Misclassification examples
Type1: No interaction Sentence: Pretreatment of platelet suspension
with phospholipase A2 from N. naja atra or A. mellifera venom (50 .mu.g/ml) inhibited platelet aggregation induced by sodium arachidonate or collagen, but not induced by thrombin or ionophore A-23187.
False: (collagen, thrombin, induced) True: relation between protein and platelet
aggregation; no PPI
PPI Misclassification examples
Type2: Incorrect interaction word Sentence: IgG antibody was able to inhibit
binding of IgE antibody in the PLA radioallergsorbent test (RAST) from 10-40% at a molar excess of 10- to 1000-fold.
False: (IgG antibody, IgE antibody, binding) True: (IgG antibody, IgE antibody, inhibit)
PPI Misclassification examples
Type3: Incorrect protein involved Sentence: AChE exhibits a
butyrylcholinesterase (BuChE) activity that represents about 14% of AChE activity.
False: (AChE, AChE, exhibits) True: (AChE, BuChE, exhibits )
PPI
Possible Improvementsyntactic patterns: “Optimizing syntax-
patterns for discovering protein-protein interactions” In Proc ACM Symposium on Applied Computing, SAC, Bioinformatics Track,
parse treedependency parsing…