Applications of Machine Learning in Computational Biology
Narges Razavian
New York University
Slides thanks to James Galagan@Board Institute Su-In Lee@Univ of Washington Rainer Breitling@ Univ of Glasgow Christopher M. Bishop@ ECCV 2004
Central Dogma of Biology
Examples of Challenges involved
Slide Credit: Manolis Kellis
Application : Decoding Sequences and Motif Discovery
Motif Discovery GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT
TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT
GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA
AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC
GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC
CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC
TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC
GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC
GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG
ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG
TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG
TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG
GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC
ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT
GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG
GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC
GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT
ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG
GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC
GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT
CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT
TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT
GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA
AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC
GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC
CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC
TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC
GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC
GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG
ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG
TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG
TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG
GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC
ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT
GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG
GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC
GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT
ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG
GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC
GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT
CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
Sequence Annotation
Gene
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT
TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT
GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA
AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC
GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC
CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC
TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC
GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC
GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG
ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG
TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG
TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG
GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC
ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT
GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG
GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC
GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT
ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG
GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC
GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT
CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
Sequence Annotation
Gene
Promoter Motif
A Generative Model
Background Island
0.15
0.25
0.75 0.85
A: 0.25
T: 0.25
G: 0.25
C: 0.25
TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC
A: 0.15
T: 0.13
G: 0.30
C: 0.42
P
B B
P P
B
P P
B
P
B
P
B
P
B
P
B
A Generative Model(cont.)
P P
B B B
P P
C A A A T G C G S:
B B B
P P P
B B
A: 0.42
T: 0.30
G: 0.13
C: 0.15
A: 0.25
T: 0.25
G: 0.25
C: 0.25
P(S|P) P(S|B) P(Li+1|Li)
Bi+1 Pi+1
Bi 0.85 0.15
Pi 0.25 0.75
Fundamental HMM Operations
Decoding • Given an HMM and sequence S • Find a corresponding sequence of
labels, L
Evaluation • Given an HMM and sequence S • Find P(S|HMM)
Training • Given an HMM w/o parameters
and set of sequences S • Find transition and emission
probabilities the maximize P(S | params, HMM)
Computation Biology
Annotate pathogenicity islands on a new sequence
Score a particular sequence (not as useful for this model – will come back to this later)
Learn a model for sequence composed of background DNA and pathogenicity islands
Application: Modeling Protein Families
Modeling Protein Families
• Given amino acid sequences from a protein family, how can we find other members? – Can search databases with each known member – not sensitive
– More information is contained in full set
• The HMM Profile Approach – Learn the statistical features of protein family
– Model these features with an HMM
– Search for new members by scoring with HMM
UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK
UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK
BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK
UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK
UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK
UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK
UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRT
AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIAT
UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRT
CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT
BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT
UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVET
UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSS
UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQ
E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRH
UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQ
UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQ
UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTN
UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS
Human Ubiquitin Conjugating Enzymes
Profile HMM
Ij
Start M1 Mj MN End
Dj D1 DN
I I1 IN
A
C
D
E
F
G
H
I
K
L
M
N
O
P
Q
R
S
T
V
W
Y
A
C
D
E
F
G
H
I
K
L
M
N
O
P
Q
R
S
T
V
W
Y
A------------ D S A G -
E2EPF5 LG K D F PA S PP K G YF L T K I F H P N VGA N UBE2L1 F PA E Y P F K PP K I T F K T K I Y H P N I DE K UBE2L6 F PP E Y P F K PPMI K F TT K I Y H P N V DE N UBE2H LP D K Y P F K S P S IG F M N K I F H P N I DE A
- G E ICV N VL KR W T A E LGI RH Q VCLPVI A----------- E N W K PA T K T D Q
- G Q ICLPII SS A----------- E N W K PC T K T C Q S G T VCL D VI N -P----------- QT W T AL Y D L TN
Using Profile HMMs
Decoding Find sequence of labels, L,
that maximizes P(L|S, HMM)
Evaluation • Find P(S|HMM)
Training • Find transition and emission
probabilities the maximize P(S | params, HMM)
Computation Biology
Align a new sequence to a protein family
Score a sequence for membership in family
Discover and model family structure
Application: Modeling Protein Dynamics
Background • Proteins: Molecular machines, composed of a
sequences of Amino Acid sub-units
Background:
• Protein functional analysis pipeline
20
Crystallize to Get X-Ray Snapshot
Molecular Dynamics
Simulations
Learn Probabilistic
Model
Analyze and Predict
Image: H khanlou, et.al. “Durable Efficacy and Continued Safety of Ibalizumab in Treatment-Experienced Patients”, Infectious Diseases Society of America (IDSA) October 2011
Modeling Protein Tertiary Structure
10 second Reminder! Probability Theory
• Sum rule
• Product rule
• From these we have Bayes’ theorem – with normalization
10 second Reminder(cont.)! Decomposition
• Consider an arbitrary joint distribution
• By successive application of the product rule
Directed Acyclic Graphs • Joint distribution
where denotes the parents of i
No directed cycles
Undirected Graphs
• Provided then joint distribution is product of non-negative functions over the cliques of the graph where are the clique potentials, and Z is a normalization constant
Undirected Graphical Models
• Pairwise Undirected graphical models (single and bivariate potentials only)
n
n
i jieij jiijii
n
i jieij jiijii
dXdXXXfXf
XXfXfXP
GraphFactorAasFieldRandomMarkov
..),()(
),()(
)(
111
11
X2
X4
Xn-1
X5
X1
X3
Xn
f12
f12 f13
f34
f4n-1 f5n-1 f5n
f1 f2
f5
f3
fn-1 fn
f4
26
Question:
• Each potential has some parameters. How to estimate them from training data?
– Could do gradient descent on the likelihood of the data, (if we knew z)
– Often iterative process
• How to compute z?
– Belief propagation (next slides)
Message Passing
• Example
• Find marginal for a particular node – for M-state nodes, cost is – exponential in length of chain – but, we can exploit the graphical structure
(conditional independences)
Message Passing • Joint distribution
• Exchange sums and products
Message Passing
• Express as product of messages
• Recursive evaluation of messages
• Find Z by normalizing
Belief Propagation • Extension to general tree-structured graphs
• At each node: – form product of incoming messages and local evidence
– marginalize to give outgoing message
– one message in each direction across every link
• No convergence guaranteed if there are loops!
Inference and Learning • Data set
• Likelihood function (independent observations)
• Maximize (log) likelihood
Modeling Protein Tertiary Structure
• Optimize Pseudo-likelihood of training data, to estimate parameters
Application: Microarray Gene Expression Analysis
35
The dramatic consequences of gene regulation in biology
Same genome Different tissues
•Different physiology •Different proteome
•Different expression pattern
Anise swallowtail, Papilio zelicaon
36
cDNA microarray schema
From Duggan et al. Nature Genetics 21, 10 – 14 (1999)
color code for relative expression
37
Hierarchical clustering
• Combine most similar genes into agglomerative clusters, build tree of genes
• Do the same procedure along the second dimension to cluster samples
• Display as a heatmap
Hierarchical clustering results
Chi et al., PNAS | September 16, 2003 | vol. 100 | no. 19 | 10623-10628 “Endothelial cell diversity revealed by global expression profiling”
Personalized cancer treatment
160 d
rugs
Drug sensitivity test
~100 patients at UWMC
… g1
g2
g4
g5 g6
g3
e8
g11
g14 g15
g9
g16
g
g g30,000
g3
g7
g12 g13
g
g
g
g
g
g
g10 30,0
00 g
enes
RNA levels of genes in
cancer cells
Drug 3 Drug 2
Drug i
Drug 6
Drug 4
Drug 5
Drug 160
30,000 features! (feature selection)
Prior knowledge on drugs’ targets
Publicly available RNA level data
>3000 patients Transfer learning, Feature reconstruction
Other applications
• Predicting phenotype (symptoms) given:
– Predictive Models Can be:
• Generative (i.e. Bayesian Network)
• Discriminative (i.e. Regression, SVM, KNN)
RNA levels
of genes
Protein levels
of genes
Epigenetics
(Methylation)
A few histologic
features
…ACGTAGCTAGCTAGCTAGCTGATGCTAGCTACGTGCT…
DNA sequence
Many more exciting research to come!
Top Related