University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE...

26
University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline Razvan C. Bunescu Raymond J. Mooney Machine Learning Group Department of Computer Sciences University of Texas at Austin {razvan, mooney}@cs.utexas.edu Arun K. Ramani Edward M. Marcotte Institute for Cellular and Molecular Biology Center for Computational Biology and Bioinformatics University of Texas at Austin {arun, marcotte}@icmb.utexas.edu

Transcript of University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE...

Page 1: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

University of Texas at Austin

Machine Learning Group

Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from

Medline

Razvan C. Bunescu

Raymond J. Mooney

Machine Learning GroupDepartment of Computer Sciences

University of Texas at Austin

{razvan, mooney}@cs.utexas.edu

Arun K. Ramani

Edward M. Marcotte

Institute for Cellular and Molecular Biology Center for Computational Biology and

BioinformaticsUniversity of Texas at Austin

{arun, marcotte}@icmb.utexas.edu

Page 2: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

2University of Texas at Austin

Machine Learning Group

Introduction

• Two orthogonal approaches to mining binary relations from a collection of documents:– Information Extraction:

• Relation Extraction from individual sentences;

• Aggregation of the results over the entire collection.

– Co-occurrence Statistics:• Compute (co-)occurrence counts over the entire corpus;

• Use statistical tests to detect whether co-occurrence is due to chance.

• Aim: Combine the two approaches into an integrated extraction model.

Page 3: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

3University of Texas at Austin

Machine Learning Group

Outline

Introduction.

• Two approaches to relation extraction:

– Information Extraction.

– Co-occurrence Statistics.

• Integrated Model.

• Evaluation Corpus.

• Experimental Results.

• Future Work & Conclusion.

Page 4: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

4University of Texas at Austin

Machine Learning Group

Information Extraction

• Most IE systems detect relations only between entities mentioned in the same sentence.

• The existence & type of the relationship is based on lexico-semantic cues inferred from the sentence context.

• Given a pair of entities, corpus-level results are assembled by combining the confidence scores that the IE system associates with each occurrence.

In synchronized human osteosarcoma cells, cyclin D1 is induced in early G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.

Page 5: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

5University of Texas at Austin

Machine Learning Group

Relation Extraction using a Subsequence Kernel

• Subsequences of words and POS tags are used as Implicit Features.

• Assumes the entities have already been annotated.

• Exponential penalty factor is used to downweigh longer word gaps.

• Generalization of the extraction system from [Blaschke et al., 2001].

• The system is trained to ouput a normalized confidence value for each

extraction.

interaction of (3) PROT (3) with PROT

[Bunescu et al., 2005].

Page 6: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

6University of Texas at Austin

Machine Learning Group

Aggregating Corpus-Level Results

S1

S2

Sn

.

.

.

InformationExtraction

SentencesSentences

P(R( p1, p2) | S1)

P(R( p1, p2) | S2)

P(R( p1, p2) | Sn )

.

.

.

ConfidencesConfidences

Γ

P(R( p1, p2) | C)

AggregationAggregation

Page 7: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

7University of Texas at Austin

Machine Learning Group

Aggregation Operators

• Maximum

• Noisy-OR

• Average

• AND

Γmax = maxi

P(R( p1, p2) | Si)

∏ −−=Γi

inor SppRP ))|),((1(1 21

∑=Γi

iavg SppRPn

)|),((1

21

∏=Γi

niand SppRP /1

21 )|),((

Page 8: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

8University of Texas at Austin

Machine Learning Group

Outline

Introduction.

Two approaches to relation extraction: Information Extraction.

– Co-occurrence Statistics.

• Integrated Model.

• Evaluation Corpus.

• Experimental Results.

• Future Work & Conclusion.

Page 9: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

9University of Texas at Austin

Machine Learning Group

Co-occurrence Statistics

• Compute (co-)occurrence counts for the two entities in the entire

corpus.

• Based on these counts, detect if the co-occurrence of the two entities is

due to chance, or to an underlying relationship.

• Can use various statistical measures:

– Pointwise Mutual Information (PMI)

– Chi-square Test (2)

– Log-Likelihood Ratio (LLR)

Page 10: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

10University of Texas at Austin

Machine Learning Group

Pointwise Mutual Information

• N : the total number of protein pairs co-occurring in the same sentence in the entire corpus.

• P(p1,p2) n12/N : the probability that p1 and p2 co-occur in the same sentence.

• P(p1, p) n1/N : the probability that p1 co-occurs with any protein in the same sentence.

• P(p2, p) n2/N : the probability that p2 co-occurs with any protein in the same sentence.

• The higher the sPMI(p1, p2) value, the less likely it is that p1 and p2 co-occurred by chance => they may be interacting.

),(),(

),(log),(

21

2121 ppPppP

ppPppPMI

⋅=

21

12lognn

nN

⋅≈

21

1221 ),(

nn

nppsPMI

⋅=

Page 11: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

11University of Texas at Austin

Machine Learning Group

Outline

Introduction.

Two approaches to relation extraction: Information Extraction.

Co-occurrence Statistics.

• Integrated Model.

• Evaluation Corpus.

• Experimental Results.

• Future Work & Conclusion.

Page 12: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

12University of Texas at Austin

Machine Learning Group

Integrated Model

• [Local] The sentence-level Relation Extraction (SSK) uses information that is local to one occurrence of a pair of entities (p1,p2).

• [Global] The corpus-level Co-occurrence Statistics (PMI) are based on counting all occurrences of a pair of entities (p1,p2).

• [Local & Global] Achieve a more reliable extraction performance by combining the two orthogonal approaches into an integrated model.

Page 13: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

13University of Texas at Austin

Machine Learning Group

Integrated Model• Rewrite sPMI as:

• Instead of counting 1 for each co-occurrence, use the confidence ouput by the IE system => a weighted PMI:

• Can use any aggregation operator:

∑=⋅

=⋅

=12

12121

1221 1

1),(

n

innnn

nppsPMI

∑=⋅

=12

121

2121 )|),((

1),(

n

iiSppRP

nnppwPMI

=n12

n1 ⋅n2

Γavg ({P(R(p1, p2) | Si)}

wPMI( p1, p2) =n12

n1 ⋅n2

Γmax ({P(R( p1, p2) | Si)}

Page 14: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

14University of Texas at Austin

Machine Learning Group

Outline

Introduction.

Two approaches to relation extraction: Information Extraction.

Co-occurrence Statistics.

Integrated Model.

• Evaluation Corpus.

• Experimental Results.

• Future Work & Conclusion.

Page 15: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

15University of Texas at Austin

Machine Learning Group

Evaluation Corpus

• An evaluation corpus needs to provide two types of information:

– The complete list of interactions mentioned in the corpus.

– Annotations of protein mentions, together with their gene identifiers.

• The corpus was compiled based on the HPRD [www.hprd.org] and NCBI [www.ncbi.nih.gov] databases:

– Every interaction is linked to a set of Medline articles that report the corresponding experiment.

– An interaction is specified as a tuple containing:

• The LocusLink (EntrezGene) identifiers of the proteins involved.

• The PubMed identifiers of the corresponding Medline articles.

Page 16: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

16University of Texas at Austin

Machine Learning Group

Evaluation Corpus (cont’ed)

<gene id=“2318”>

<name>FNLC</name>

<description>filamin C, gamma</description>

<synonyms>

<synonym>ABPA</synonym>

<synonym>ABPL</synonym>

<synonym>FNL2</synonym>

<synonyms>

<proteins>

<protein> gamma filamin </protein>

<protein>filamin 2</protein>

<protein>filamin C, gamma</protein>

</proteins>

</gene>

<gene id=“58529”>

<name>MYOZ1</name>

<description>myozenin 1</description>

<synonyms> ... </synonyms>

<proteins> FATZ … </proteins>

</gene>

Part

icip

ant G

enes

(X

ML

) (N

CB

I)

<interaction>

<gene>2318</gene>

<gene>58529</gene>

<pubmed>10984498 11171996</pubmed>

</interaction>

Interactions (XML) (HPRD)

<PMID>10984498</PMID>

<AbstractText>

We found that this protein binds to three other Z-dics proteins; therefore we have named it FATZ, gamma-filamin, alpha-actinin and telethonin binding protein of the Z-disc.

</AbstractText>

Medline Abstracts (XML) (NCBI)

Page 17: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

17University of Texas at Austin

Machine Learning Group

Gene Name Annotation and Normalization

• NCBI provides a comprehensive dictionary of human genes, where each gene specifies is specified by its unique identifier, and qualified with:

– an official name,

– a description,

– a list of synonyms,

– a list of protein names.

• All these names (including the description) are considered as referring to the same entity.

• Use a dictionary-based annotation, similar to [Cohen, 2005].

Page 18: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

18University of Texas at Austin

Machine Learning Group

Gene Name Annotation and Normalization

• Each name is reduced to a normal form, by:1) Replacing dashes with spaces

2) Introducing spaces between letters and digits

3) Replacing Greek letters with their Latin counterparts

4) Substituting Roman numerals with Arabic numerals

5) Decapitalizing the first word (if capitalized).

• The names are further tokenized, and checked against a dictionary of 100K English nouns.

• Names associated with more than one gene identifier (i.e. ambiguous names) are ignored.

• The final gene name dictionary is implemented as a trie-like structure.

Page 19: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

19University of Texas at Austin

Machine Learning Group

Outline

Introduction.

Two approaches to relation extraction: Information Extraction.

Co-occurrence Statistics.

Integrated Model.

Evaluation Corpus.

• Experimental Results.

• Future Work & Conclusion.

Page 20: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

20University of Texas at Austin

Machine Learning Group

Experimental Results

• Compare four methods on the task of interaction extraction:– Information Extraction:

• [SSK.Max] Relation extraction with the subsequence kernel (SSK), followed by an aggregation of corpus-level results using Max.

– Co-occurrence Statistics:• [PMI] Pointwise Mutual Information.

• [HG] The HyperGeometric distribution method from [Ramani et al., 2005].

– Integrated Model:• [PMI.SSK.Max]The combined model of PMI & SSK.

• Draw Precision vs. Recall graphs, by ranking the extractions and choosing only the top N interactions, while N is varying.

Page 21: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

21University of Texas at Austin

Machine Learning Group

Experimental Results

Page 22: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

22University of Texas at Austin

Machine Learning Group

Future Work

• Derive an evaluation corpus from a potentially more accurate database (Reactome).

• Investigate combining IE with other statistical tests (LLR, 2).

• Design an IE method that is trained to do corpus-level extraction (as opposed to sentence-level extraction).

Page 23: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

23University of Texas at Austin

Machine Learning Group

Conclusion

• Introduced an integrated model that combines two orthogonal

approaches to corpus-level relation extraction:

– Information Extraction (SSK).

– Co-occurrence Statistics (PMI).

• Derived an evaluation corpus from the HPRD and NCBI databases.

• Experimental results show a more consistent performance across the

PR curve.

Page 24: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

24University of Texas at Austin

Machine Learning Group

Aggregating Corpus-Level Results

• Two entities p1 and p2 are mentioned in a corpus C of n sentences, C {S1, …, Sn}.

• The IE system outputs a confidence value for each of the n occurrences:

• The corpus-level confidence value is computed using an aggregation operator Γ:

]1,0[)|),(( 21 ∈iSppRP

})..1|)|),((({)|),(( 2121 niSppRPCppRP i =Γ=

Page 25: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

25University of Texas at Austin

Machine Learning Group

Experimental Results

• Compare [PMI] and [HG] on the task of extracting interactions from the entire Medline.

• Use the shared protein function benchmark from [Ramani et al. 2005].

– Calculate the extent to which interactions partners share functional annotations, as specified in the KEGG and GO databases.

– Use a Log-Likelihood Ratio (LLR) scoring scheme to rank the interactions:

• Also plot the scores associated with the HPRD, BIND and Reactome databases.

)~

|(

)|(ln

IDP

IDPLLR =

Page 26: University of Texas at Austin Machine Learning Group Integrating Co-occurrence Statistics with IE for Robust Retrieval of Protein Interactions from Medline.

26University of Texas at Austin

Machine Learning Group

Experimental Results