Post on 21-Jan-2016
description
1
Gene family classification using a semi-supervised learning method
Nan SongAdvisors: John Lafferty, Dannie Durand
2
Outline
• Introduction – A motivating application: genome annotation
• A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion
The complete genetic material
of an organism or species
The Genome
Key genomic component: genes
ACCCTTAGCTAGACCTTTAGGAGG...
A gene is a DNA subsequence
Key genomic component: genes
Genes encode proteins, the building blocks of the cell
ACCCTTAGCTAGACCTTTAGGAGG...
A gene is a DNA subsequence
A protein is an amino acid sequence
V H L T P E...
Genes encode proteins, the building blocks of the cell
ACCCTTAGCTAGACCTTTAGGAGG...
A gene is a DNA subsequence
A protein is an amino acid sequence
V H L T P E...
6
413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria
In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes
www.genomesonline.org
Whole Genome Sequencing
• atgcaccttg
8
Gene prediction and annotationInternational Human Genome Consortium, Nature 2001
Predicted genes16,896
Total31,778
Known genes14,882
Gene annotation• We are given a new genome sequence with
predicted genes.
• A few genes are well studied.
• Identify other genes in the same family to predict function.
• Verify predictions experimentally
Two contexts: – Individual scientist
– High throughput
10
Outline
• Introduction – Molecular biology– A motivating application: genome
annotation• A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion
11
Evolutionarily related genes have related functions
Ancestral geneatgccaggactcccagtga…
atgcgccgtctggcatgt…
β-globin
atgcaaggagtcccagagc…γ-globin
atgcgaggtctcccatgt…
ε-globin
Adult Fetal Embryonic
Duplication
Duplication
Evolutionarily related genes have related functions
Gene family classification is a powerful source of information for inferring evolutionary,
functional and structural properties of genes
atgcgaggtctcccatgt…
Ancestral geneatgccaggactcccagtga…
Duplication
Duplication
atgcgccgtctggcatgt… atgcaaggagtcccagagc…
β-globin γ-globin ε-globin
13
Outline
• Introduction • A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion
14
…atgcaaggagtcccagagcc……atgcgaggtctcccagtgtc…xi
xj
A graphical model of sequence relatedness
E: weight of the edge is proportional to the similarity between sequences.
G = (V,E)
V: represent sequences
15
xi
xj
A graphical model of sequence relatedness
E: weight of the edge is proportional to the similarity between sequences.
G = (V,E)
V: represent sequences
16
xi
xj
Gene family classification
Goal: Given known genes, identify genes in the same family.
Biological scenario: • small number of known genes• large number of unknown genes
17
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning
• Empirical evaluation
• Conclusion
18
Framework: binary classification
Determine which unlabeled genes belong to the family.
Machine learning scenario: • small number of labeled data
• genes known to be in family• genes clearly not in family
• large number of unlabeled data
19
Several challenging problems of gene family classification
Traditionally, similarity is represented by sequence comparison
atgcgccgtctggcatgt…atgcaaggagtcccagagc…
atgcgaggtctcccatgt…
Ancestral gene
Duplication
Duplication
Mutations
atgcgccccccggcatgt… DNA shuffling
atgcgccgtctggcatgt…ggctcgta
20
Several challenging problems of gene family classification
Traditionally, similarity is represented by sequence comparison
atgcgccgtctggcatgt…atgcaaggagtcccagagc…
atgcgaggtctcccatgt…
Ancestral gene
Duplication
Duplication
Mutations
atgcgccccccggcatgt… DNA shuffling
atgcgccgtctggcatgt…ggctcgta
21
Several challenging problems of gene family classification
Families
– do not form a clique– do not form a connected
component– have edges to sequences outside
the family.
22
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning– Semi-supervised learning algorithm– Supervised learning algorithm
• Empirical evaluation
• Conclusion
23
Gene family classification
Goal: Binary classification
Machine learning scenario: • large number of unlabeled data• small number of labeled data
Semi supervised learning:• Exploit information from both labeled and unlabeled data
• Performed well in many applications
24
Graphical semi-supervised learning (Binary classification)
Notation:
• V: The whole data set
• L: Labeled data set
• U: unlabeled data set
• Each vertex: (xi,yi) or (xk, f(k))
Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)
(xi,yi = 1)
(xj,yj = 0)(xk,f(k))
25
Graphical semi-supervised learning (Binary classification)
(xi,yi = 1)
(xj,yj = 0)(xk,f(k))
• Output: – Assign a real value to every
vertex in the graph– Find a cutoff to separate the
two classes
Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)
• Input: – family members (xi, yi = 1) – nonfamily members: (xj, yj = 0)
26
Graphical semi-supervised learning (Binary classification)
(xi,yi = 0)
G = (V,E)L: Labeled data setU: unlabeled data set
(xn,yp = 1)(xk,f(k))
Assign real values to all vertices in the graph, to minimize E(f):
)( exp where
)( ..
)))()((()( 2
ij
ij
ii
ijVjVi
SW
Lxyifts
jfifWfE
Sij
27
Graph-based semi-supervised learning
f(xk)
http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html
Works well
28
Graph-based semi-supervised learning
f(xk)
http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html
Works well Works well ?
29
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning– Semi-supervised learning– Supervised learning
• Empirical evaluation
• Conclusion
Semi-supervised vs kernel-based supervised learning
• Semi-supervised learning:
• Supervised learning:
Lxyifts
jfifWfE
ii
ijVjVi
)( ..
)))()((()( 2
)))()((()( 2jfifWfE ijUjLi
where L is the labeled data set and U is the unlabeled data set
31
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning
• Empirical evaluation– Methodology– Results
• Conclusion
32
Graph construction
G = (V,E)
V: All mouse sequences from SwissProt (n = 7439)
E: based on newly designed sequence similarity measurement.
0 < S(i, j) < 1
33
Methodology• Graph construction
• Test set construction
• Experiments performed
• Basis for evaluation
Test set construction
18 well studied protein families– Receptors, enzymes, transcription factors,
motor proteins, structural proteins, and extracellular matrix proteins.
ACSL FOX Laminin
PDE TRAF
ADAM GATA
SEMA
T-box
DVL Kinase
Myosin
USP
FGF Kinesin
Notch TNFR
WNT
35
Test set construction
• Retrieved all complete mouse sequences from SwissProt database (7,439)
• Identified sequences for each test family based on
– Nomenclature committee reports
– Structural properties
– Literature surveys
Family size ACSL 5 ADAM 26 DVL 3 FGF 20 FOX 30 GATA 6 Kinase 293 Kinesin 18 Laminin 11 Myosin 12 Notch 4 PDE 15 SEMA 16 TNFR 24 TRAF 9 Tbox 6 USP 18 WNT 19
36
Methodology• Graph construction
• Test set construction
• Experiments performed
• Basis for evaluation
Experiments performed
• Compare semi-supervised with supervised learning algorithm
• Tested parameters:– Scaling parameter,σ, in the kernel function
– Number of Labeled Family members (LF)
– Number of Labeled Nonfamily members(LN)
Tested parameters
number of Labeled Family members
number of Non-labeled Family members
σ
For each set of parameters, 20 tests were performed
Tested parameters (1)
Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100
S
W
0.02
σ=100
σ=10
σ=1
σ=0.5
σ=0.2σ=0.1
0.080.05
10.80.60.40.20
1
0.8
0.6
0.4
0.2
0
)(exp where),))()((()( 2
ij
ijijVjVi
SWjfifWfE
Tested parameters (2)
• Labeled Family members (LF):
10-70% of family size • Labeled Nonfamily members (LN) :
100, 500, 1000
about 1 - 10% of nonfamily size
Family size
LF
ACSL 5 1, 3 ADAM 26 3, 5, 7, 9,15 DVL 3 1 FGF 20 3, 5, 7, 11, 15 FOX 30 3, 9, 15 GATA 6 1,3 Kinase 293 3, 7, 11, 15,
20, 50, 150 Kinesin 18 2, 6, 9 Laminin 11 1, 3, 5, 7 Myosin 12 2, 4, 6, 9 Notch 4 1, 2 PDE 15 2, 5, 7, 10 SEMA 16 2, 5, 8 TNFR 24 2, 4, 8, 12, 18 TRAF 9 1, 3 Tbox 6 2, 5 USP 18 2, 4, 6, 9,13 WNT 19 2, 9
Database size: 7439
41
Methodology• Graph construction
• Test set construction
• Experiments performed
• Basis for evaluation
42
Semi-supervised learning
Goal:
f(i) > f(j) when xi is a family member and xj is not.
Evaluation criteria:
• Visualization
• AUC score
• False negatives
VisualizationSort all unlabeled data by f(x)
f(x)
Rank
Family members
Nonfamily members
1 - specificity
sens
itivi
ty
11 1
ba
n
i
n
jba
nnAUC
a b
ji
f(x)
Rank
Family members
Nonfamily members
AUC (Area Under ROC Curve)
Rank plot
Advantages of rank plot
AUC = 0.9382
AUC scores do not reflect all information we need
• False negatives after the first false positive
• The number of missed data after the first false positive
47
Outline
• Introduction
• A graphical model of sequence relatedness
• Gene classification using machine learning
• Empirical evaluation– Methodology– Results
• Conclusion
48
Several challenging problems of gene family classification
Families
– do not form a clique– do not form a connected
component– have edges to sequences outside
the family.
Edges to sequences outside the family are mainly a problem if they have strong edge weights
49
Test families have different graph properties Family
size Clique Connected NOT
Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
Results
• Compare semi-supervised with supervised learning algorithm
• Tested parameters:– Scaling parameter,σ, in the kernel function
– Number of Labeled Family members (LF)
– Number of Labeled Nonfamily members(LN)
Tested parameters
number of Labeled Family members
number of Non-labeled Family members
σ
0.9996
0.9998
1
Notch, Lf = 1, Ln =1000
10.1 0.5 100.2
AU
C (
ave)
The effect of σ
Raw similarity score (s)
W
0.02
σ=100
σ=10
σ=1
σ=0.5
σ=0.2σ=0.1
0.080.05
10.80.60.40.20
1
0.8
0.6
0.4
0.2
0
)(exp where),))()((()( 2
ij
ijijVjVi
SWjfifWfE
53
Test families have different graph properties Family
size Clique Connected NOT
Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
Edges to sequences outside the family are mainly a problem if they have strong edge weights
Edges to sequences outside the family are mainly a problem if they have strong edge weights
FOX Notch
Nu
mb
er o
f ed
ges
Raw edge weight Raw edge weight
Case study: Rank plots for semi-supervised learning in FOX
σ = 0.1σ =1
σ = 10σ=100
LF = 3, LN = 100, family size: 30
Case study: rank plots for semi-supervised learning in Notch
labeled family seqs: 1 (out of 4) labeled nonfamily seqs: 100(out of 7435)
σ = 0.1
σ = 1
σ = 10
σ= 10
0.9996
0.9998
1
Notch, Lf = 1, Ln =1000
10.1 0.5 100.2
AU
C (
ave)
0.9996
0.9998
1
10.1 0.5 100.2
AU
C (
ave)
FOX, Lf = 3, Ln =1000
σ
Summary on σ
• For most families, the performance is not very sensitive to σ
• For almost all families that form a clique, there is at least one value of sigma (usually many)– such that both semi-supervised and supervised
learning algorithms have perfect classfication performance
Results
• Compare semi-supervised with supervised learning algorithm
• Tested parameters:– Scaling parameter,σ, in the kernel function
– Number of Labeled Family members (LF)
– Number of Labeled Nonfamily members(LN)
61
Test families have different graph properties Family
size Clique Connected NOT
Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
The connection among sequences in ADAM family
0
2
4
6
8
10
12
14
16
# of connected ADAM sequences269 24 25
The connection among sequences in ADAM family
Tested parameters
number of Labeled Family members
number of Non-labeled Family members
σ
By taking the maximum
number of Labeled Family members
number of Non-labeled Family members
achieve the best average AUC score
The impact of number of labeled family and nonfamily members on the performance
0.992
0.996
1
73 5 159
AU
C
Supervised, LN =100
# labeled family seqs, LF
ADAM
The impact of number of labeled family and nonfamily members on the performance
0.992
0.996
1
73 5 159
AU
C
Supervised, LN =100
# labeled family seqs, LF
Semi-supervised, LN = 100
ADAM
Performed paired t-test to detect the difference between semi-supervised and supervised method for a set of parameters
The impact of number of labeled family and nonfamily members on the performance
0.992
0.996
1
73 5 159
AU
C
Supervised, ln =100
# labeled family seqs
Supervised, ln =1000
Semi-supervised, ln = 100
ADAM
The impact of number of labeled family and nonfamily members on the performance
0.992
0.996
1
73 5 159
AU
C
# labeled family seqs
Semi-supervised, ln = 1000
Supervised, ln =100
Supervised, ln =1000
Semi-supervised, ln = 100
ADAM
Graph structure of ADAM
• Troublemaker: ADAMTS10 matches with only 8
out of 26 sequences in ADAM family.
• ADAMTS10 is often misclassified
• ADAMTS10 is implicated in a genetic disease
that causes impaired vision and heat defects.
70
Semi-supervised method
Supervised method
71
Several challenging problems of gene family classification
• Sequences in the same family – do not form a clique– do not exist in the same connected component
• Sequences in different families – have significant matches
72
Test families have different graph properties Family
size Clique Connected NOT
Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X
W: Edges to sequences outside the family have weak edge weights
S: Edges to sequences outside the family have strong edge weights
The connection among sequences in TNFR family
The connection among sequences in TNFR family
10 11 12 13 14 15 16 17 18 19 20
6
4
2
# of connected TNFR sequences
20 TNFR in this connected component
0.92
0.96
1
TNFR (family size 24)
82 4 1812
Semi, ln = 1000
Supervised, ln =100
Supervised, ln=1000Semi, , ln = 100
AU
CThe impact of number of labeled family and
nonfamily members on the performance
Summary for Number of labeled family members
• The performance of both semi-supervised
and supervised learning improves as LF
increases for all families.
• In non-clique families, semi-supervised
learning performs better than supervised
when LF is small.
Rank plots for semi-supervised learning in TNFR
σ= 0.1 Lf = 2, ln = 100
AUC values do not reflect all information that we need
TNFR (family size 24)
82 4 1812
Semi, ln = 1000
Supervised, ln =100
Supervised, ln=1000
Semi, , ln = 100
Nu
mb
er o
f m
isse
d T
NF
RThe impact of number of labeled family and
nonfamily members on the performance
Summary for Number of labeled family members
• The performance of both semi-supervised
and supervised learning improves as LF
increases for all families.
• In non-clique families, semi-supervised
learning performs better than supervised
when LF is small.
Summary for Number of labeled non-family members (LN)
• The performance supervised learning
improves as LN increases for all families.
• For semi-supervised learning, sometimes
LN is sometimes helpful and sometimes
not.
81
Summary of results
Clique
Connected
Small LF Large LF
100 1000 100 1000 100 1000 100 1000
Super Semi Super semi Super Semi Super semi ACSL DVL FOX GATA Kinesin Myosin Laminin 0.9999 0.9999 0.9999 0.9999 1 1 1 1 Notch SEMA TRAF Tbox WNT ADAM 0.9951 1 0.9989 1 1 1 1 1 FGF Kinase 0.9549 0.9644 0.9745 0.9738 0.9656 0.9666 0.9804 0.9771 PDE 0.9181 0.9364 0.9644 0.9589 0.9612 0.9603 0.9850 0.9769 TNFR 0.9297 0.9420 0.9537 0.9526 0.9628 0.9671 0.9845 0.9866 USP 0.9792 0.9798 0.9907 0.9895 0.9827 0.9875 0.9900 0.9895
Insights - 1
• SSL is most effective for families that are not cliques but are connected.
• In test set, 12/18 cliques, 3/18 not connected.
• What fraction of protein families are cliques? Is the large number of cliques in the test set due to sample bias?
Insights - 2
• Performance evaluation measures should match the needs of the user.
– AUC scores penalize all FNs and FPs.
– For experimental biologists, top ranked predictions are of interest
– The number of FNs after the first false positive can reveal some information
Insights - 3
• Semi-supervised learning algorithm provides an appealing visualization tool for identifying family members especially when the number of known family members are small
Acknowledgements
• John Lafferty
• Dannie Durand
• Jerry Zhu
Durand Lab• Robbie Sedgewick
• Rose Hoberman
• Ben Vernot
• Narayanan Raghupathy
• Aiton Goldman
• Jacob Joseph
• Annette McLeod
• Maureen Stolzer
Thank You !