Download - Applications of machine learning in Computational Biologypeople.csail.mit.edu/dsontag/courses/ml13/slides/lecture... · 2013-12-14 · Applications of Machine Learning in Computational

Applications of Machine Learning in Computational Biology

Narges Razavian

New York University

Slides thanks to James Galagan@Board Institute Su-In Lee@Univ of Washington Rainer Breitling@ Univ of Glasgow Christopher M. Bishop@ ECCV 2004

Central Dogma of Biology

Examples of Challenges involved

Slide Credit: Manolis Kellis

Application : Decoding Sequences and Motif Discovery

Motif Discovery GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT

TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT

GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA

AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC

GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC

CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC

TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC

GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC

GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG

ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG

TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG

TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG

GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC

ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT

GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG

GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC

GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT

ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG

GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC

GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT

CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT





















Sequence Annotation

Gene

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT





















Sequence Annotation

Gene

Promoter Motif

A Generative Model

Background Island

0.15

0.25

0.75 0.85

A: 0.25

T: 0.25

G: 0.25

C: 0.25

TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC

A: 0.15

T: 0.13

G: 0.30

C: 0.42

P

B B

P P

B

P P

B

P

B

P

B

P

B

P

B

A Generative Model(cont.)

P P

B B B

P P

C A A A T G C G S:

B B B

P P P

B B

A: 0.42

T: 0.30

G: 0.13

C: 0.15

A: 0.25

T: 0.25

G: 0.25

C: 0.25

P(S|P) P(S|B) P(Li+1|Li)

Bi+1 Pi+1

Bi 0.85 0.15

Pi 0.25 0.75

Fundamental HMM Operations

Decoding • Given an HMM and sequence S • Find a corresponding sequence of

labels, L

Evaluation • Given an HMM and sequence S • Find P(S|HMM)

Training • Given an HMM w/o parameters

and set of sequences S • Find transition and emission

probabilities the maximize P(S | params, HMM)

Computation Biology

Annotate pathogenicity islands on a new sequence

Score a particular sequence (not as useful for this model – will come back to this later)

Learn a model for sequence composed of background DNA and pathogenicity islands

Application: Modeling Protein Families

Modeling Protein Families

• Given amino acid sequences from a protein family, how can we find other members? – Can search databases with each known member – not sensitive

– More information is contained in full set

• The HMM Profile Approach – Learn the statistical features of protein family

– Model these features with an HMM

– Search for new members by scoring with HMM

UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK

UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK

BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK

UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK

UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK

UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK

UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRT

AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIAT

UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRT

CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT

BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT

UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVET

UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSS

UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQ

E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRH

UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQ

UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQ

UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTN

UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS

Human Ubiquitin Conjugating Enzymes

Profile HMM

Ij

Start M1 Mj MN End

Dj D1 DN

I I1 IN

A

C

D

E

F

G

H

I

K

L

M

N

O

P

Q

R

S

T

V

W

Y

A

C

D

E

F

G

H

I

K

L

M

N

O

P

Q

R

S

T

V

W

Y

A------------ D S A G -

E2EPF5 LG K D F PA S PP K G YF L T K I F H P N VGA N UBE2L1 F PA E Y P F K PP K I T F K T K I Y H P N I DE K UBE2L6 F PP E Y P F K PPMI K F TT K I Y H P N V DE N UBE2H LP D K Y P F K S P S IG F M N K I F H P N I DE A

- G E ICV N VL KR W T A E LGI RH Q VCLPVI A----------- E N W K PA T K T D Q

- G Q ICLPII SS A----------- E N W K PC T K T C Q S G T VCL D VI N -P----------- QT W T AL Y D L TN

Using Profile HMMs

Decoding Find sequence of labels, L,

that maximizes P(L|S, HMM)

Evaluation • Find P(S|HMM)

Training • Find transition and emission

probabilities the maximize P(S | params, HMM)

Computation Biology

Align a new sequence to a protein family

Score a sequence for membership in family

Discover and model family structure

Application: Modeling Protein Dynamics

Background • Proteins: Molecular machines, composed of a

sequences of Amino Acid sub-units

Background:

• Protein functional analysis pipeline

20

Crystallize to Get X-Ray Snapshot

Molecular Dynamics

Simulations

Learn Probabilistic

Model

Analyze and Predict

Image: H khanlou, et.al. “Durable Efficacy and Continued Safety of Ibalizumab in Treatment-Experienced Patients”, Infectious Diseases Society of America (IDSA) October 2011

Modeling Protein Tertiary Structure

10 second Reminder! Probability Theory

• Sum rule

• Product rule

• From these we have Bayes’ theorem – with normalization

10 second Reminder(cont.)! Decomposition

• Consider an arbitrary joint distribution

• By successive application of the product rule

Directed Acyclic Graphs • Joint distribution

where denotes the parents of i

No directed cycles

Undirected Graphs

• Provided then joint distribution is product of non-negative functions over the cliques of the graph where are the clique potentials, and Z is a normalization constant

Undirected Graphical Models

• Pairwise Undirected graphical models (single and bivariate potentials only)

n

n

i jieij jiijii

n

i jieij jiijii

dXdXXXfXf

XXfXfXP

GraphFactorAasFieldRandomMarkov

..),()(

),()(

)(

111

11

X2

X4

Xn-1

X5

X1

X3

Xn

f12

f12 f13

f34

f4n-1 f5n-1 f5n

f1 f2

f5

f3

fn-1 fn

f4

26

Question:

• Each potential has some parameters. How to estimate them from training data?

– Could do gradient descent on the likelihood of the data, (if we knew z)

– Often iterative process

• How to compute z?

– Belief propagation (next slides)

Message Passing

• Example

• Find marginal for a particular node – for M-state nodes, cost is – exponential in length of chain – but, we can exploit the graphical structure

(conditional independences)

Message Passing • Joint distribution

• Exchange sums and products

Message Passing

• Express as product of messages

• Recursive evaluation of messages

• Find Z by normalizing

Belief Propagation • Extension to general tree-structured graphs

• At each node: – form product of incoming messages and local evidence

– marginalize to give outgoing message

– one message in each direction across every link

• No convergence guaranteed if there are loops!

Inference and Learning • Data set

• Likelihood function (independent observations)

• Maximize (log) likelihood

Modeling Protein Tertiary Structure

• Optimize Pseudo-likelihood of training data, to estimate parameters

Application: Microarray Gene Expression Analysis

35

The dramatic consequences of gene regulation in biology

Same genome Different tissues

•Different physiology •Different proteome

•Different expression pattern

Anise swallowtail, Papilio zelicaon

36

cDNA microarray schema

From Duggan et al. Nature Genetics 21, 10 – 14 (1999)

color code for relative expression

37

Hierarchical clustering

• Combine most similar genes into agglomerative clusters, build tree of genes

• Do the same procedure along the second dimension to cluster samples

• Display as a heatmap

Hierarchical clustering results

Chi et al., PNAS | September 16, 2003 | vol. 100 | no. 19 | 10623-10628 “Endothelial cell diversity revealed by global expression profiling”

Personalized cancer treatment

160 d

rugs

Drug sensitivity test

~100 patients at UWMC

… g1

g2

g4

g5 g6

g3

e8

g11

g14 g15

g9

g16

g

g g30,000

g3

g7

g12 g13

g

g

g

g

g

g

g10 30,0

00 g

enes

RNA levels of genes in

cancer cells

Drug 3 Drug 2

Drug i

Drug 6

Drug 4

Drug 5

Drug 160

30,000 features! (feature selection)

Prior knowledge on drugs’ targets

Publicly available RNA level data

>3000 patients Transfer learning, Feature reconstruction

Other applications

• Predicting phenotype (symptoms) given:

– Predictive Models Can be:

• Generative (i.e. Bayesian Network)

• Discriminative (i.e. Regression, SVM, KNN)

RNA levels

of genes

Protein levels

of genes

Epigenetics

(Methylation)

A few histologic

features

…ACGTAGCTAGCTAGCTAGCTGATGCTAGCTACGTGCT…

DNA sequence

Many more exciting research to come!