Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint...

97
Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    1

Transcript of Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint...

Page 1: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

Natural Language Processing in Bioinformatics:

Uncovering Semantic Relations

Barbara RosarioJoint work with Marti Hearst

SIMS, UC Berkeley

Page 2: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

2

Outline of Talk

Goal: Extract semantics from text

Information and relation extraction

Protein-protein interactions Noun compounds

Page 3: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

3

Text Mining

Text Mining is the discovery by computers of new, previously unknown information, via automatic extraction of information from text

Page 4: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

4

Text Mining Text:

Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker

1: Extract semantic entities from text

Page 5: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

5

Text Mining Text:

Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker

Stress Migraine

Magnesium Calcium channel blockers

1: Extract semantic entities from text

Page 6: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

6

Text Mining (cont.) Text:

Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker

Stress Migraine

Magnesium Calcium channel blockers

2: Classify relations between entities

Associated withLead to loss

Prevent

Subtype-of (is a)

Page 7: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

7

Text Mining (cont.) Text:

Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker

Stress Migraine

Magnesium Calcium channel blockers

3: Do reasoning: find new correlations

Associated withLead to loss

Prevent

Subtype-of (is a)

Page 8: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

8

Text Mining (cont.) Text:

Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker

Stress Migraine

Magnesium Calcium channel blockers

4: Do reasoning: infer causality

Associated withLead to loss

Prevent

Subtype-of (is a)

No prevention

Deficiency of magnesium migraine

Page 9: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

9

My research

Stress Migraine

Magnesium Calcium channel blockers

Information Extraction

Stress is associated with migraines Stress can lead to loss of magnesium Calcium channel blockers prevent some migraines Magnesium is a natural calcium channel blocker

Page 10: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

10

My research

Relation extraction

Stress Migraine

Magnesium Calcium channel blockers

Associated withLead to loss

Prevent

Subtype-of (is a)

Page 11: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

11

Information and relation extraction

Problems: Given biomedical text: Find all the treatments and all the

diseases Find the relations that hold between them

Treatment Disease

Cure?

Prevent?

Side Effect?

Page 12: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

12

Hepatitis Examples

Cure These results suggest that con A-induced

hepatitis was ameliorated by pretreatment with TJ-135.

Prevent A two-dose combined hepatitis A and B

vaccine would facilitate immunization programs

Vague Effect of interferon on hepatitis B

Page 13: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

13

Two tasks

Relationship extraction: Identify the several semantic relations

that can occur between the entities disease and treatment in bioscience text

Information extraction (IE): Related problem: identify such entities

Page 14: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

14

Outline of IE

Data and semantic relations Quick intro to graphical models Models and results Features Conclusions

Page 15: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

15

Data and Relations

MEDLINE, abstracts and titles 3662 sentences labeled

Relevant: 1724 Irrelevant: 1771

e.g., “Patients were followed up for 6 months”

2 types of Entities treatment and disease

7 Relationships between these entities

The labeled data are available at http://biotext.berkeley.edu

Page 16: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

16

Semantic Relationships 810: Cure

Intravenous immune globulin for recurrent spontaneous abortion

616: Only Disease Social ties and susceptibility to the common

cold 166: Only Treatment

Flucticasone propionate is safe in recommended doses

63: Prevent Statins for prevention of stroke

Page 17: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

17

Semantic Relationships 36: Vague

Phenylbutazone and leukemia 29: Side Effect

Malignant mesodermal mixed tumor of the uterus following irradiation

4: Does NOT cure Evidence for double resistance to

permethrin and malathion in head lice

Page 18: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

18

Outline of IE

Data and semantic relations Quick intro to graphical models Models and results Features Conclusions

Page 19: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

19

Graphical Models

Unifying framework for developing Machine Learning algorithms

Graph theory plus probability theory Widely used

Error correcting codes Systems diagnosis Computer vision Filtering (Kalman filters) Bioinformatics

Page 20: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

20

(Quick intro to) Graphical Models

Nodes are random variables

Edges are annotated with conditional probabilities

Absence of an edge between nodes implies conditional independence

“Probabilistic database”

B C

DA

Page 21: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

21

Graphical Models

A

B C

D

Define a joint probability distribution:

P(X1, ..XN) = i P(Xi | Par(Xi) )

P(A,B,C,D) = P(A)P(D)P(B|A)P(C|A,D) Learning

Given data, estimate P(A), P(B|A), P(D), P(C | A, D)

Page 22: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

22

Graphical Models

A

B C

D

Define a joint probability distribution:

P(X1, ..XN) = i P(Xi | Par(Xi) )

P(A,B,C,D) = P(A)P(D)P(B|A)P(C,A,D) Learning

Given data, estimate P(A), P(B|A), P(D), P(C | A, D) Inference: compute conditional probabilities,

e.g., P(A|B, D) Inference = Probabilistic queries. General

inference algorithms (Junction Tree)

Page 23: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

23

Naïve Bayes models

Simple graphical model

Xi depend on Y Naïve Bayes assumption: all Xi are

independent given Y Currently used for text classification and

spam detection

x1 x2 x3

Y

Page 24: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

24

Dynamic Graphical Models

Graphical model composed of repeated segments

HMMs (Hidden Markov Models) POS tagging, speech recognition, IE

2t 1-Nt

1w 2w 1-NwWords

Tags POS1t tN

wN

Page 25: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

25

HMMs Joint probability distribution

P(t1,.., tN, w1,.., wN) = P(t1) P(ti|ti-1)P(wi|ti)

Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data

2t 1-Nt

1w 2w 1-Nw

1t tN

wN

Page 26: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

26

HMMs Joint probability distribution

P(t1,.., tN, w1,.., wN) = P(t1) P(ti|ti-1)P(wi|ti)

Estimate P(t1), P(ti|ti-1), P(wi|ti) from labeled data

Inference: P(ti | w1 , w2 ,… wN) 2t 1-Nt

1w 2w 1-Nw

1t tN

wN

Page 27: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

27

Graphical Models for IE Different dependencies between

the features and the relation nodes

D3

D1 S1

D2 S2

Dynamic

Static

Page 28: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

28

Graphical Model Relation node:

Semantic relation (cure, prevent, none..) expressed in the sentence

Relation generate the state sequence and the observations

Relation

Page 29: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

29

Graphical Model

Markov sequence of states (roles) Role nodes:

Rolet {treatment, disease, none}

Rolet-1 Rolet Rolet+1

Page 30: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

30

Graphical Model Roles generate multiple

observations Feature nodes (observed):

word, POS, MeSH…

Features

Page 31: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

31

Graphical Model

Inference: Find Relation and Roles given the features observed

???

?

Page 32: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

32

Features

Word Part of speech Phrase constituent Orthographic features

‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ …

Semantic features (MeSH)

Page 33: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

33

MeSH MeSH Tree Structures 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]

Page 34: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

34

MeSH (cont.)

1. Anatomy [A] Body Regions [A01] + Musculoskeletal System [A02]

Digestive System [A03] + Respiratory System [A04] + Urogenital System [A05] + Endocrine System [A06] + Cardiovascular System [A07]

+ Nervous System [A08] + Sense Organs [A09] + Tissues [A10] + Cells [A11] + Fluids and Secretions [A12] + Animal Structures [A13] + Stomatognathic System [A14] (…..)

Body Regions [A01] Abdomen [A01.047]

Groin [A01.047.365] Inguinal Canal [A01.047.412] Peritoneum [A01.047.596] + Umbilicus [A01.047.849]

Axilla [A01.133] Back [A01.176] + Breast [A01.236] + Buttocks [A01.258] Extremities [A01.378] + Head [A01.456] + Neck [A01.598] (….)

Page 35: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

35

Use of lexical Hierarchies in NLP

Big problem in NLP: few words occur a lot, most of them occur very rarely (Zipf’s law)

Difficult to do statistics One solution: use lexical hierarchies Another example: WordNet Statistics on classes of words instead of

words

Page 36: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

36

Mapping Words to MeSH Concepts

headache pain C23.888.592.612.441 G11.561.796.444

C23.888 G11.561 [Neurologic Manifestations][Nervous System Physiology ]

C23 G11 [Pathological Conditions, Signs and Symptoms]

[Musculoskeletal, Neural, and Ocular Physiology]

headache recurrence C23.888.592.612.441 C23.550.291.937

breast cancer cells A01.236 C04 A11

Page 37: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

37

Graphical Model

Joint probability distribution over relation, roles and features nodes

Parameters estimated with maximum likelihood and absolute discounting smoothing

Rela) Role | P(f, Rela) | RoleP(Role

Rela)|oleP(Rela)P(R)f,..f,RoleleP(Rela, Ro

t

T

1t

n

j

jtt-1t

0nTT0

,

1

10 , ,..,

Page 38: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

38

Graphical Model

Inference: Find Relation and Roles given the features observed

???

?

Page 39: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

39

Relation extraction

Results in terms of classification accuracy (with and without irrelevant sentences)

2 cases: Roles given Roles hidden (only features)

)f,..,f,,...,RoleRole,P(RelaRela nTTkRela

^

k

argmax 100

Page 40: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

40

Relation classification: Results

Good results for a difficult task One of the few systems to tackle several DIFFERENT

relations between the same types of entities; thus differs from the problem statement of other work on relations

Accuracy

Sentences Input Base. GM D2

Only rel. only feat. 46.7 72.6

roles given

76.6

Rel. + irrel.

only feat. 50.6 74.9

roles given

82.0

Page 41: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

41

Role Extraction: ResultsJunction tree algorithm

F-measure = (2*Prec*Recall)/(Prec + Recall)

(Related work extracting “diseases” and “genes” reports F-measure of 0.50)

Sentences F-measure

Only rel. 0.73

Rel. + irrel.

0.71

Page 42: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

42

Features impact: Role extraction

Most important features: 1)Word 2)MeSH

Rel. + irrel. Only rel. All features 0.71 0.73

No word 0.61 0.66 -14.1% -9.6%

No MeSH 0.65 0.69 -8.4% -5.5%

Page 43: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

43

Most important features: Roles

Accuracy All feat. + roles 82.0

Features impact: Relation classification

(rel. + irrel.)

All feat. – roles 74.9 -8.7%

All feat. + roles – Word 79.8 -2.8%

All feat. + roles – MeSH 84.6 3.1%

Page 44: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

44

Features impact: Relation classification

Most realistic case: Roles not known Most important features: 1) Word 2) Mesh

Accuracy All feat. – roles 74.9

(rel. + irrel.)

All feat. - roles – Word 66.1 -11.8%

All feat. - roles – MeSH 72.5 -3.2%

Page 45: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

45

Conclusions Classification of subtle semantic

relations in bioscience text Graphical models for the

simultaneous extraction of entities and relationships

Importance of MeSH, lexical hierarchy

Page 46: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

46

Outline of Talk

Goal: Extract semantics from text Information and relation extraction Protein-protein interactions; using

an existing database to gather labeled data

Page 47: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

47

Protein-Protein interactions

One of the most important challenges in modern genomics, with many applications throughout biology

There are several protein-protein interaction databases (BIND, MINT,..), all manually curated

Page 48: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

48

Protein-Protein interactions

Supervised systems require manually labeled data, while purely unsupervised are still to be proven effective for these tasks.

Some other approaches: semi-supervised, active learning, co-training.

We propose the use of resources developed in the biomedical domain to address the problem of gathering labeled data for the task of classifying interactions between proteins

Page 49: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

49

HIV-1, Protein Interaction Database

Documents interactions between HIV-1 proteins and host cell proteins other HIV-1 proteins disease associated with HIV/AIDS

2224 pairs of interacting proteins, 65 types

http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions

Page 50: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

50

HIV-1, Protein Interaction Database

Protein 1

Protein 2 Paper ID Interaction Type

Tat, p14 AKT3 11156964, 11994280..

activates

AIP1 Gag, Pr55

14519844,…

binds

Tat, p14 CDK2 9223324 induces

Tat, p14 CDK2 7716549 enhances

Tat, p14 CDK2 9525916 downregulates

….

Page 51: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

51

Most common interactions

Page 52: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

52

Protein-Protein interactions Idea: use this to “label data”

Protein 1

Protein 2 Interaction Paper ID

Tat, p14 AKT3 activates 11156964Extract from the paper all the sentences with Protein 1 and Protein 2

Page 53: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

53

Protein-Protein interactions Idea: use this to “label data”

Protein 1

Protein 2 Interaction Paper ID

Tat, p14 AKT3 activates 11156964Extract from the paper all the sentences with Protein 1 and Protein 2

Label them with the interaction given in the database

activates

activates

Page 54: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

54

Protein-Protein interactions

Use citations

Find all the papers that cite the papers in the database

Protein 1

Protein 2 Interaction Paper ID

Tat, p14

AKT3 activates 11156964

ID 9918876 ID 9971769

Page 55: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

55

Protein-Protein interactions

From the papers, extract the citation sentences; from these extract the sentences with Protein 1 and Protein 2 Label them

Protein 1

Protein 2 Interaction Paper ID

Tat, p14

AKT3 activates 11156964

ID 9918876 ID 9971769

activates

Page 56: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

56

Examples of sentences

Papers: The interpretation of these results was slightly

complicated by the fact that AIP-1/ALIX depletion by using siRNA likely had deleterious effects on cell viability , because a Western blot analysis showed slightly reduced Gag expression at later time points (fig. 5C ).

Citations: They also demonstrate that the GAG protein from

membrane - containing viruses , such as HIV , binds to Alix / AIP1 , thereby recruiting the ESCRT machinery to allow budding of the virus from the cell surface (TARGET_CITATION; CITATION ) .

Page 57: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

57

10 Interaction types

Page 58: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

58

Protein-Protein interactions

Tasks: Given sentences from Paper ID,

and/or citation sentences to ID Predict the interaction type given in

the HIV database for Paper ID Extract the proteins involved

10-way classification problem

Page 59: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

59

Protein-Protein interactions

Models Dynamic graphical model Naïve Bayes

Page 60: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

60

Graphical Models

Page 61: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

61

Evaluation

Evaluation at document level All (sentences from papers + citations) Papers (only sentences from papers) Citations (only citation sentences) “Trigger word” approach

List of keywords (ex: for inhibits: “inhibitor”, “inhibition”, “inhibit”…etc.

If keyword presents: assign corresponding interaction

Page 62: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

62

Results Accuracies on interaction classification

Model All Papers Citations

Markov Model 60.5 57.8 53.4

Naïve Bayes 58.1 57.8 55.7

Baselines

Most freq. inter.

21.8 11.1 26.1

TriggerW 20.1 24.4 20.4

TriggerW + BO 25.8 40.0 26.1

(Roles hidden)

Page 63: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

63

Results: confusion matrix

For All. Overall accuracy: 60.5%

Page 64: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

64

Hiding the protein names

Replaced protein names with tokens PROT_NAME Selective CXCR4 antagonism by Tat Selective PROT_NAME antagonism by

PROT_NAME

Page 65: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

65

Results with no protein names

Model Papers Citations

Markov Model 44.4(-

23.1%)

52.3 (-2.0%)

Naïve Bayes 46.7 (-

19.2%)

53.4 (-4.1 %)

Page 66: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

66

Protein extraction

(Protein name tagging, role extraction) The identification of all the proteins

present in the sentence that are involved in the interaction These results suggest that Tat - induced

phosphorylation of serine 5 by CDK9 might be important after transcription has reached the +36 position, at which time CDK7 has been released from the complex.

Tat might regulate the phosphorylation of the RNA polymerase II carboxyl - terminal domain in pre - initiation complexes by activating CDK7

Page 67: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

67

Protein extraction: results

Recall Precision F-measure

All 0.74 0.85 0.79

Papers 0.56 0.83 0.67

Citations 0.75 0.84 0.79

No dictionary used

Page 68: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

68

Conclusions of protein-protein interaction project

Encouraging results for the automatic classification of protein-protein interactions

Use of an existing database for gathering labeled data

Use of citations

Page 69: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

69

Noun compounds (NCs)

Any sequence of nouns that itself functions as a noun

asthma hospitalizations asthma hospitalization rates health care personnel hand wash

Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.

Page 70: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

70

NCs: 3 computational tasks

Identification Syntactic analysis (attachments)

[Baseline [headache frequency]] [[Tension headache] patient]

Semantic analysis Headache treatment treatment for headache Corticosteroid treatment treatment that uses

corticosteroid

Page 71: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

71

Two approaches

Treat it as a classification problem (and use a machine learning algorithm)

Linguistically motivated: consider the “semantics” of the nouns which will determine the relations between them

Page 72: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

72

Second approach

Linguistic Motivation Head noun has argument structure

Meaning of the head noun determines what kinds of things can be done to it, what it is made of, what it is a part of…

Page 73: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

73

Linguistic Motivation Material + Cutlery Made of

steel knife, plastic fork, wooden spoon   Food + Cutlery Used on

meat knife, dessert spoon, salad fork  Profession + Cutlery Used by

chef's knife, butcher's knife

Page 74: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

74

Linguistic Motivation

Hypothesis: A particular semantic relation holds

between all 2-word NCs that can be categorized by a MeSH pair.

Use the classes of MeSH to identify semantic relations

Page 75: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

75

Grouping the NCs A02 C04 (Musculoskeletal System, Neoplasms)

skull tumors, bone cysts, bone metastases, skull osteosarcoma… B06 B06 (Plants, Plants)

eucalyptus trees, apple fruits, rice grains, potato plants A01 M01 (Body region, Person)

shoulder patient, eye physician, eye donor Too different: need to be more specific: go down

the hierarchy A01 M01.643 (Body Regions, Patients)

shoulder patient C04 M01.526 (Body Regions, Occupational Groups)

eye physician, chest physicians

Page 76: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

76

Classification Decisions + Relations

A02 C04 Location of Disease B06 B06 Kind of Plants C04 M01

C04 M01.643 Person afflicted by Disease C04 M01.526 Person who treats Disease

A01 H01 A01 H01.770 A01 H01.671

A01 H01.671.538 A01 H01.671.868

A01 M01 A01 M01.643 Person afflicted by Disease A01 M01.526 Specialist of A01 M01.898 Donor of

Page 77: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

77

Evaluation

Accuracy: Anatomy: 91% accurate Natural Science: 79% Neoplasm: 100%

Total Accuracy : 90.8%

Page 78: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

78

Conclusion of NCs

Problem of assigning semantic relations to two-word technical NCs

Important problem: many NCs in technical text

Especially difficult for the lack of syntactic clues

State-of-the-art results One of very few working systems to

tackle this task for NCs

Page 79: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

79

Conclusion

Machine Learning methods for NLP tasks

Three lines of research in this area, state-of-the art results Information and relation extraction for

“treatments” and “diseases” Protein-protein interactions (Noun compounds)

Page 80: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

Thank you!

Barbara Rosario

SIMS, UC Berkeley

[email protected]

Page 81: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

81

Future work

Unsupervised, semi-supervised methods Reasoning (knowledge representation

and inference procedures) Huge amount of textual data (Web) Connection between several databases

and/or text collections for linking different pieces of information

System architecture to support multiple layers of annotation on text

Development of effective interface

Page 82: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

Additional slides on IE

Page 83: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

83

Related work Several DIFFERENT Relations

between the Same Types of Entities Thus differs from the problem

statement of other work on relations Many find one relation which holds

between two entities (many based on ACE)

Page 84: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

84

Related work (cont.)

Agichtein and Gravano (2000), lexical patterns for location of

Zelenko et al. (2002) SVM for person affiliation and organization-location

Hasegawa et al. (ACL 2004) Person-Organization -> President “relation”

Craven (1999, 2001) HMM for subcellular-location and disorder-association Doesn’t identify the actual relation

Page 85: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

85

Related work: Bioscience

Many hand-built rules Feldman et al. (2002), Friedman et al. (2001) Pustejovsky et al. (2002) Saric et al.; (2004)

Page 86: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

86

Our D1

Thompson et al. 2003Frame classification and role

labeling for FrameNet sentencesTarget word must be observed

More relations and roles

Page 87: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

87

Smoothing: absolute discounting

Lower the probability of seen events by subtracting a constant from their count (ML estimate: )

The remaining probability is evenly divided by the unseen events

e

MLec

eceP

)(

)()(

0)( if

0)( if )()(

eP

ePePeP

ML

MLMLad

events)seen (

events)seen (

UNc

c

Page 88: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

88

F-measures for role extraction in function of smoothing factors

Page 89: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

89

Relation accuracies in function of smoothing factors

Page 90: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

90

Relation classification: Confusion Matrix

Computed for “rel + irrel.”, “only features”

Page 91: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

91

Proteins: sentence-level evaluation

Total accuracy: 38.9% (49.4% without interact with)

Page 92: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

92

Learning with the hand-labeled sentences

Model All Papers Citations

Markov Model 48.9 28.9 47.9

Naïve Bayes 47.1 33.3 53.4

Baselines

Most freq. inter.

36.3 34.4 37.6

TriggerW 30.5 18.9 38.3

TRiggerW + BO 46.2 36.6 52.6

Page 93: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

93

Learning with the hand-labeled sentences

Page 94: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

94

Q & A system

Q: What are the treatments of cervical carcinoma

A: Stage Ib and IIa cervical carcinoma can be cured by radical surgery or radiotherapy

Page 95: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

95

Q & A system

Q: What are the methods of administration of headache treatment

A: intranasal migraine treatment

Page 96: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

96

Evaluation

Mj: For each triple, for each sentence of the triple, find the interaction that maximizes the posterior probability of the interaction given the features; then assign to all sentences of this triple the most frequent interaction between those predicted for the individual sentences.

Mj*: Same as Mj, except that if the interaction predicted is the generic interacts with, choose instead the next most frequent interaction (retain interacts with only if it is the only interaction predicted.

Page 97: Natural Language Processing in Bioinformatics: Uncovering Semantic Relations Barbara Rosario Joint work with Marti Hearst SIMS, UC Berkeley.

97

Evaluation

Cf: Retain all the conditional probabilities (i.e., don't first choose an interaction per sentence), then for each triple choose the interaction that maximizes the sum over all the sentences of the triple.

Cf*: Same as Cf, substituting interacts with with the next most confident interaction.