Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley ...

48
Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley http://biotext.berkeley.edu Supported by NSF DBI-0317510 and a gift from Genentech
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    2

Transcript of Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley ...

Page 1: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Classifying Semantic Relations in Bioscience

Texts

Barbara RosarioMarti Hearst

SIMS, UC Berkeleyhttp://biotext.berkeley.edu

Supported by NSF DBI-0317510 and a gift from Genentech

Page 2: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Problem: Which relations hold between 2 entities?

Treatment Disease

Cure?

Prevent?

Side Effect?

Page 3: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Hepatitis Examples

Cure These results suggest that con A-induced

hepatitis was ameliorated by pretreatment with TJ-135.

Prevent A two-dose combined hepatitis A and B

vaccine would facilitate immunization programs

Vague Effect of interferon on hepatitis B

Page 4: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Two tasks

Relationship Extraction: Identify the several semantic relations

that can occur between the entities disease and treatment in bioscience text

Entity extraction: Related problem: identify such entities

Page 5: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

The Approach

Data: MEDLINE abstracts and titles Graphical models

Combine in one framework both relation and entity extraction

Both static and dynamic models Simple discriminative approach:

Neural network Lexical, syntactic and semantic features

Page 6: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Outline

Related work Data and semantic relations Features Models and results Conclusions

Page 7: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Several DIFFERENT Relations between the Same Types of Entities

Thus differs from the problem statement of other work on relations

Many find one relation which holds between two entities (many based on ACE) Agichtein and Gravano (2000), lexical patterns for

location of Zelenko et al. (2002) SVM for person affiliation and

organization-location Hasegawa et al. (ACL 2004) Person-Organization ->

President “relation” Craven (1999, 2001) HMM for subcellular-location

and disorder-association Doesn’t identify the actual relation

Page 8: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Related work: Bioscience

Many hand-built rules Feldman et al. (2002), Friedman et al. (2001) Pustejovsky et al. (2002) Saric et al.; this conference

Page 9: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Data and Relations

MEDLINE, abstracts and titles 3662 sentences labeled

Relevant: 1724 Irrelevant: 1771

e.g., “Patients were followed up for 6 months”

2 types of Entities, many instances treatment and disease

7 Relationships between these entities

The labeled data is available at http://biotext.berkeley.edu

Page 10: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Semantic Relationships 810: Cure

Intravenous immune globulin for recurrent spontaneous abortion

616: Only Disease Social ties and susceptibility to the common

cold 166: Only Treatment

Flucticasone propionate is safe in recommended doses

63: Prevent Statins for prevention of stroke

Page 11: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Semantic Relationships 36: Vague

Phenylbutazone and leukemia 29: Side Effect

Malignant mesodermal mixed tumor of the uterus following irradiation

4: Does NOT cure Evidence for double resistance to

permethrin and malathion in head lice

Page 12: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Features

Word Part of speech Phrase constituent Orthographic features

‘is number’, ‘all letters are capitalized’, ‘first letter is capitalized’ …

MeSH (semantic features) Replace words, or sequences of words, with

generalizations via MeSH categories Peritoneum -> Abdomen

Page 13: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Features (cont.): MeSH MeSH Tree Structures 1. Anatomy [A] 2. Organisms [B] 3. Diseases [C] 4. Chemicals and Drugs [D] 5. Analytical, Diagnostic and Therapeutic Techniques and Equipment [E] 6. Psychiatry and Psychology [F] 7. Biological Sciences [G] 8. Physical Sciences [H] 9. Anthropology, Education, Sociology and Social Phenomena [I] 10. Technology and Food and Beverages [J] 11. Humanities [K] 12. Information Science [L] 13. Persons [M] 14. Health Care [N] 15. Geographic Locations [Z]

Page 14: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Features (cont.): MeSH

1. Anatomy [A] Body Regions [A01] + Musculoskeletal System [A02]

Digestive System [A03] + Respiratory System [A04] + Urogenital System [A05] + Endocrine System [A06] + Cardiovascular System [A07]

+ Nervous System [A08] + Sense Organs [A09] + Tissues [A10] + Cells [A11] + Fluids and Secretions [A12] + Animal Structures [A13] + Stomatognathic System [A14] (…..)

Body Regions [A01] Abdomen [A01.047]

Groin [A01.047.365] Inguinal Canal [A01.047.412] Peritoneum [A01.047.596] + Umbilicus [A01.047.849]

Axilla [A01.133] Back [A01.176] + Breast [A01.236] + Buttocks [A01.258] Extremities [A01.378] + Head [A01.456] + Neck [A01.598] (….)

Page 15: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Models

2 static generative models 3 dynamic generative models 1 discriminative model (neural

networks)

Page 16: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Static Graphical Models S1: observations dependent on Role

but independent from Relation given roles

S2: observations dependent on both Relation and Role

S1 S2

Page 17: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Dynamic Graphical Models

D1, D2 as in S1, S2

D3: only one observation per state isdependent on both the relation and the role

D1

D2

D3

Page 18: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Graphical Models Relation node:

Semantic relation (cure, prevent, none..) expressed in the sentence

Page 19: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Graphical Models

Role nodes: 3 choices: treatment, disease, or

none

Page 20: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Graphical Models

Feature nodes (observed): word, POS, MeSH…

Page 21: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Graphical Models

For Dynamic Model D1: Joint probability distribution over relation,

roles and features nodes

Parameters estimated with maximum likelihood and absolute discounting smoothing

) Role | P(f, Rela) | RoleP(Role

Rela)|oleP(Rela)P(R)f,..f,RoleleP(Rela, Ro

t

T

1t

n

j

jtt-1t

0nTT0

1

10 , ,..,

Page 22: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Our D1

Thompson et al. 2003Frame classification and role

labeling for FrameNet sentencesTarget word must be observed

More relations and roles

Page 23: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Neural Networks

Feed-forward network (MATLAB) Training with conjugate gradient

descent One hidden layer (hyperbolic tangent

function) Logistic sigmoid function for the output

layer representing the relationships Same features Discriminative approach

Page 24: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Relation extraction

Results in terms of classification accuracy (with and without irrelevant sentences)

2 cases: Roles hidden Roles given

Graphical models

NN: simple classification problem

)f,..,f,,...,RoleRole,P(RelaRela nTTkRela

^

k

argmax 100

Page 25: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Relation classification: Results

Neural Net always best

Page 26: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Relation classification: Results

With no smoothing, D1 best Graphical Model

Page 27: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Relation classification: Results

With Smoothing and No Roles, D2 best GM

Page 28: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Relation classification: Results

With Smoothing and Roles, D1 best GM

Page 29: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Relation classification: Results

Dynamic models always outperform Static

Page 30: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Relation classification: Confusion Matrix

Computed for the model D2, “rel + irrel.”, “only features”

Page 31: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Role extraction

Results in terms of F-measure Graphical models

Junction tree algorithm (BNT) Relation hidden and marginalized over

NN Couldn’t run it (features vectors too large)

(Graphical models can do role extraction and relationship classification simultaneously)

Page 32: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Role Extraction: Results

F-measuresD1 best when no smoothing

Page 33: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Role Extraction: ResultsF-measuresD2 best with smoothing, but doesn’t boost

scores as much as in relation classification

Page 34: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Features impact: Role Extraction

Most important features: 1)Word, 2)MeSH

Models D1 D2 All features 0.67 0.71 No word 0.58 0.61

-13.4% -14.1% No MeSH 0.63 0.65

-5.9% -8.4%

(rel. + irrel.)

Page 35: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Most important features: Roles

Accuracy: D1 D2 NN All feat. + roles 91.6 82.0 96.9 All feat. – roles 68.9 74.9 79.6

-24.7% -8.7% -17.8% All feat. + roles – Word 91.6 79.8 96.4

0% -2.8% -0.5% All feat. + roles – MeSH 91.6 84.6 97.3

0% 3.1% 0.4%

Features impact: Relation classification

(rel. + irrel.)

Page 36: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Features impact: Relation classification

Most realistic case: Roles not known Most important features: 1) Mesh 2) Word for D1

and NN (but vice versa for D2)

Accuracy: D1 D2 NN All feat. – roles 68.9 74.9 79.6 All feat. - roles – Word 66.7 66.1 76.2

-3.3% -11.8% -4.3% All feat. - roles – MeSH 62.7 72.5 74.1

-9.1% -3.2% -6.9% (rel. + irrel.)

Page 37: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Conclusions Classification of subtle semantic relations in

bioscience text Discriminative model (neural network) achieves

high classification accuracy Graphical models for the simultaneous extraction

of entities and relationships Importance of lexical hierarchy

Future work: A new collection of disease/treatment data Different entities/relations Unsupervised learning to discover relation types

Page 38: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Thank you!

Barbara RosarioMarti Hearst

SIMS, UC Berkeleyhttp://biotext.berkeley.edu

Page 39: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Additional slides

Page 40: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Smoothing: absolute discounting

Lower the probability of seen events by subtracting a constant from their count (ML estimate: )

The remaining probability is evenly divided by the unseen events

e

MLec

eceP

)(

)()(

0)( if

0)( if )()(

eP

ePePeP

ML

MLMLad

events)seen (

events)seen (

UNc

c

Page 41: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

F-measures for role extraction in function of smoothing factors

Page 42: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Relation accuracies in function of smoothing factors

Page 43: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Role Extraction: ResultsStatic models better than Dynamic for

Note: No Neural Networks

Page 44: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Features impact: Role Extraction

Most important features: 1)Word, 2)MeSH

Models D1 D2 Average All features 0.67 0.71 No word 0.58 0.61

-13.4% -14.1% -13.7% No MeSH 0.63 0.65

-5.9% -8.4% -7.2%

(rel. + irrel.)

Page 45: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Features impact: Role extraction

Most important features: 1) Word, 2) MeSH

F-measures: D1 D2 Average All features 0.72 0.73 No word 0.65 0.66

-9.7% -9.6% -9.6%

No MeSH 0.69 0.69 -4.2% -5.5% -4.8%

(only rel.)

Page 46: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Features impact: Role extraction

Most important features: 1) Word, 2) MeSH

F-measures: D1 D2 All features 0.72 0.73 No word 0.65 0.66

-9.7% -9.6% No MeSH 0.69 0.69

-4.2% -5.5%

(only rel.)

Page 47: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Most important features: Roles

Accuracy: D1 D2 NN Avg. All feat. + roles 91.6 82.0 96.9 All feat. – roles 68.9 74.9 79.6

-24.7% -8.7% -17.8% -17.1% All feat. + roles – Word 91.6 79.8 96.4

0% -2.8% -0.5% -1.1% All feat. + roles – MeSH 91.6 84.6 97.3

0% 3.1% 0.4% 1.1%

Features impact: Relation classification

(rel. + irrel.)

Page 48: Classifying Semantic Relations in Bioscience Texts Barbara Rosario Marti Hearst SIMS, UC Berkeley  Supported by NSF DBI-0317510.

Features impact: Relation classification

Most realistic case: Roles not known Most important features: 1) Mesh 2) Word for D1

and NN (but vice versa for D2)

Accuracy: D1 D2 NN Avg. All feat. – roles 68.9 74.9 79.6 All feat. - roles – Word 66.7 66.1 76.2

-3.3% -11.8% -4.3% -6.4% All feat. - roles – MeSH 62.7 72.5 74.1

-9.1% -3.2% -6.9% -6.4%

(rel. + irrel.)