Gene family classification using a semi-supervised learning method

Nan SongAdvisors: John Lafferty, Dannie Durand

Outline

• Introduction – A motivating application: genome annotation

• A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion

The complete genetic material

of an organism or species

The Genome

Key genomic component: genes

ACCCTTAGCTAGACCTTTAGGAGG...

A gene is a DNA subsequence

Key genomic component: genes

Genes encode proteins, the building blocks of the cell

A protein is an amino acid sequence

V H L T P E...

Genes encode proteins, the building blocks of the cell

A protein is an amino acid sequence

V H L T P E...

413 whole genome sequences: 41 eukarya, 28 archaea, 344 bacteria

In progress: 1034 prokaryotic genomes, 629 eukaryotic genomes

www.genomesonline.org

Whole Genome Sequencing

• atgcaccttg

Gene prediction and annotationInternational Human Genome Consortium, Nature 2001

Predicted genes16,896

Total31,778

Known genes14,882

Gene annotation• We are given a new genome sequence with

predicted genes.

• A few genes are well studied.

• Identify other genes in the same family to predict function.

• Verify predictions experimentally

Two contexts: – Individual scientist

– High throughput

Outline

• Introduction – Molecular biology– A motivating application: genome

annotation• A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion

Evolutionarily related genes have related functions

Ancestral geneatgccaggactcccagtga…

atgcgccgtctggcatgt…

β-globin

atgcaaggagtcccagagc…γ-globin

atgcgaggtctcccatgt…

ε-globin

Adult Fetal Embryonic

Duplication

Evolutionarily related genes have related functions

Gene family classification is a powerful source of information for inferring evolutionary,

functional and structural properties of genes

Ancestral geneatgccaggactcccagtga…

Duplication

atgcgccgtctggcatgt… atgcaaggagtcccagagc…

β-globin γ-globin ε-globin

Outline

• Introduction • A graphical model of sequence relatedness• Gene classification using machine learning• Empirical evaluation• Conclusion

…atgcaaggagtcccagagcc……atgcgaggtctcccagtgtc…xi

A graphical model of sequence relatedness

E: weight of the edge is proportional to the similarity between sequences.

G = (V,E)

V: represent sequences

A graphical model of sequence relatedness

E: weight of the edge is proportional to the similarity between sequences.

G = (V,E)

V: represent sequences

Gene family classification

Goal: Given known genes, identify genes in the same family.

Biological scenario: • small number of known genes• large number of unknown genes

Outline

• Introduction

• A graphical model of sequence relatedness

• Gene classification using machine learning

• Empirical evaluation

• Conclusion

Framework: binary classification

Determine which unlabeled genes belong to the family.

Machine learning scenario: • small number of labeled data

• genes known to be in family• genes clearly not in family

• large number of unlabeled data

Several challenging problems of gene family classification

Traditionally, similarity is represented by sequence comparison

atgcgccgtctggcatgt…atgcaaggagtcccagagc…

Ancestral gene

Duplication

Mutations

atgcgccccccggcatgt… DNA shuffling

atgcgccgtctggcatgt…ggctcgta

Traditionally, similarity is represented by sequence comparison

atgcgccgtctggcatgt…atgcaaggagtcccagagc…

Ancestral gene

Duplication

Mutations

atgcgccccccggcatgt… DNA shuffling

atgcgccgtctggcatgt…ggctcgta

Families

– do not form a clique– do not form a connected

component– have edges to sequences outside

the family.

Outline

• Introduction

• Gene classification using machine learning– Semi-supervised learning algorithm– Supervised learning algorithm

• Conclusion

Gene family classification

Goal: Binary classification

Machine learning scenario: • large number of unlabeled data• small number of labeled data

Semi supervised learning:• Exploit information from both labeled and unlabeled data

• Performed well in many applications

Graphical semi-supervised learning (Binary classification)

Notation:

• V: The whole data set

• L: Labeled data set

• U: unlabeled data set

• Each vertex: (xi,yi) or (xk, f(k))

Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)

(xi,yi = 1)

(xj,yj = 0)(xk,f(k))

(xi,yi = 1)

(xj,yj = 0)(xk,f(k))

• Output: – Assign a real value to every

vertex in the graph– Find a cutoff to separate the

two classes

Xiaojin Zhu, Zoubin Ghahramani, John Lafferty The Twentieth International Conference on Machine Learning (ICML-2003)

• Input: – family members (xi, yi = 1) – nonfamily members: (xj, yj = 0)

(xi,yi = 0)

G = (V,E)L: Labeled data setU: unlabeled data set

(xn,yp = 1)(xk,f(k))

Assign real values to all vertices in the graph, to minimize E(f):

)( exp where

)))()((()( 2

ijVjVi

Lxyifts

jfifWfE

Graph-based semi-supervised learning

http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html

Works well

Graph-based semi-supervised learning

http://www.cs.wisc.edu/~jerryzhu/research/ssl/animation.html

Works well Works well ?

Outline

• Introduction

• Gene classification using machine learning– Semi-supervised learning– Supervised learning

• Conclusion

Semi-supervised vs kernel-based supervised learning

• Semi-supervised learning:

• Supervised learning:

Lxyifts

jfifWfE

ijVjVi

)))()((()( 2

)))()((()( 2jfifWfE ijUjLi

where L is the labeled data set and U is the unlabeled data set

Outline

• Introduction

• Empirical evaluation– Methodology– Results

• Conclusion

Graph construction

G = (V,E)

V: All mouse sequences from SwissProt (n = 7439)

E: based on newly designed sequence similarity measurement.

0 < S(i, j) < 1

Methodology• Graph construction

• Test set construction

• Experiments performed

• Basis for evaluation

Test set construction

18 well studied protein families– Receptors, enzymes, transcription factors,

motor proteins, structural proteins, and extracellular matrix proteins.

ACSL FOX Laminin

PDE TRAF

ADAM GATA

DVL Kinase

Myosin

FGF Kinesin

Notch TNFR

Test set construction

• Retrieved all complete mouse sequences from SwissProt database (7,439)

• Identified sequences for each test family based on

– Nomenclature committee reports

– Structural properties

– Literature surveys

Family size ACSL 5 ADAM 26 DVL 3 FGF 20 FOX 30 GATA 6 Kinase 293 Kinesin 18 Laminin 11 Myosin 12 Notch 4 PDE 15 SEMA 16 TNFR 24 TRAF 9 Tbox 6 USP 18 WNT 19

Experiments performed

• Compare semi-supervised with supervised learning algorithm

• Tested parameters:– Scaling parameter,σ, in the kernel function

– Number of Labeled Family members (LF)

– Number of Labeled Nonfamily members(LN)

Tested parameters

number of Labeled Family members

number of Non-labeled Family members

For each set of parameters, 20 tests were performed

Tested parameters (1)

Tested σ values: 0.05, 0.1, 0.5, 1, 2, 10, 100

σ=100

σ=0.5

σ=0.2σ=0.1

0.080.05

10.80.60.40.20

)(exp where),))()((()( 2

ijijVjVi

SWjfifWfE

Tested parameters (2)

• Labeled Family members (LF):

10-70% of family size • Labeled Nonfamily members (LN) :

100, 500, 1000

about 1 - 10% of nonfamily size

Family size

ACSL 5 1, 3 ADAM 26 3, 5, 7, 9,15 DVL 3 1 FGF 20 3, 5, 7, 11, 15 FOX 30 3, 9, 15 GATA 6 1,3 Kinase 293 3, 7, 11, 15,

20, 50, 150 Kinesin 18 2, 6, 9 Laminin 11 1, 3, 5, 7 Myosin 12 2, 4, 6, 9 Notch 4 1, 2 PDE 15 2, 5, 7, 10 SEMA 16 2, 5, 8 TNFR 24 2, 4, 8, 12, 18 TRAF 9 1, 3 Tbox 6 2, 5 USP 18 2, 4, 6, 9,13 WNT 19 2, 9

Database size: 7439

Semi-supervised learning

f(i) > f(j) when xi is a family member and xj is not.

Evaluation criteria:

• Visualization

• AUC score

• False negatives

VisualizationSort all unlabeled data by f(x)

Family members

Nonfamily members

1 - specificity

Family members

Nonfamily members

AUC (Area Under ROC Curve)

Rank plot

Advantages of rank plot

AUC = 0.9382

AUC scores do not reflect all information we need

• False negatives after the first false positive

• The number of missed data after the first false positive

Outline

• Introduction

• Empirical evaluation– Methodology– Results

• Conclusion

Families

– do not form a clique– do not form a connected

component– have edges to sequences outside

the family.

Edges to sequences outside the family are mainly a problem if they have strong edge weights

Test families have different graph properties Family

size Clique Connected NOT

Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X

W: Edges to sequences outside the family have weak edge weights

S: Edges to sequences outside the family have strong edge weights

Results

Tested parameters

0.9996

0.9998

Notch, Lf = 1, Ln =1000

10.1 0.5 100.2

The effect of σ

Raw similarity score (s)

σ=100

σ=0.5

σ=0.2σ=0.1

0.080.05

10.80.60.40.20

)(exp where),))()((()( 2

ijijVjVi

SWjfifWfE

Connected ACSL 5 W DVL 3 W FOX 30 W GATA 6 W SEMA 16 W TRAF 9 W Tbox 6 W WNT 19 W Kinesin 18 S Laminin 12 S Myosin 12 S Notch 4 S ADAM 26 X FGF 20 X Kinase 293 X PDE 15 X X TNFR 24 X X USP 18 X X W: Edges to sequences outside the family have weak edge weights

Edges to sequences outside the family are mainly a problem if they have strong edge weights

FOX Notch

Raw edge weight Raw edge weight

Case study: Rank plots for semi-supervised learning in FOX

σ = 0.1σ =1

σ = 10σ=100

LF = 3, LN = 100, family size: 30

Case study: rank plots for semi-supervised learning in Notch

labeled family seqs: 1 (out of 4) labeled nonfamily seqs: 100(out of 7435)

σ = 0.1

σ = 1

σ = 10

σ= 10

0.9996

0.9998

Notch, Lf = 1, Ln =1000

10.1 0.5 100.2

0.9996

0.9998

10.1 0.5 100.2

FOX, Lf = 3, Ln =1000

Summary on σ

• For most families, the performance is not very sensitive to σ

• For almost all families that form a clique, there is at least one value of sigma (usually many)– such that both semi-supervised and supervised

learning algorithms have perfect classfication performance

Results

The connection among sequences in ADAM family

# of connected ADAM sequences269 24 25

The connection among sequences in ADAM family

Tested parameters

By taking the maximum

achieve the best average AUC score

The impact of number of labeled family and nonfamily members on the performance

73 5 159

Supervised, LN =100

# labeled family seqs, LF

73 5 159

Supervised, LN =100

# labeled family seqs, LF

Semi-supervised, LN = 100

Performed paired t-test to detect the difference between semi-supervised and supervised method for a set of parameters

73 5 159

Supervised, ln =100

# labeled family seqs

Supervised, ln =1000

Semi-supervised, ln = 100

73 5 159

# labeled family seqs

Supervised, ln =100

Supervised, ln =1000

Graph structure of ADAM

• Troublemaker: ADAMTS10 matches with only 8

out of 26 sequences in ADAM family.

• ADAMTS10 is often misclassified

• ADAMTS10 is implicated in a genetic disease

that causes impaired vision and heat defects.

Semi-supervised method

Supervised method

• Sequences in the same family – do not form a clique– do not exist in the same connected component

• Sequences in different families – have significant matches

The connection among sequences in TNFR family

10 11 12 13 14 15 16 17 18 19 20

# of connected TNFR sequences

20 TNFR in this connected component

TNFR (family size 24)

82 4 1812

Semi, ln = 1000

Supervised, ln =100

Supervised, ln=1000Semi, , ln = 100

CThe impact of number of labeled family and

nonfamily members on the performance

Summary for Number of labeled family members

• The performance of both semi-supervised

and supervised learning improves as LF

increases for all families.

• In non-clique families, semi-supervised

learning performs better than supervised

when LF is small.

Rank plots for semi-supervised learning in TNFR

σ= 0.1 Lf = 2, ln = 100

AUC values do not reflect all information that we need

TNFR (family size 24)

82 4 1812

Semi, ln = 1000

Supervised, ln =100

Supervised, ln=1000

Semi, , ln = 100

RThe impact of number of labeled family and

nonfamily members on the performance

Summary for Number of labeled family members

• The performance of both semi-supervised

and supervised learning improves as LF

increases for all families.

• In non-clique families, semi-supervised

learning performs better than supervised

when LF is small.

Summary for Number of labeled non-family members (LN)

• The performance supervised learning

improves as LN increases for all families.

• For semi-supervised learning, sometimes

LN is sometimes helpful and sometimes

Summary of results

Clique

Connected

Small LF Large LF

100 1000 100 1000 100 1000 100 1000

Super Semi Super semi Super Semi Super semi ACSL DVL FOX GATA Kinesin Myosin Laminin 0.9999 0.9999 0.9999 0.9999 1 1 1 1 Notch SEMA TRAF Tbox WNT ADAM 0.9951 1 0.9989 1 1 1 1 1 FGF Kinase 0.9549 0.9644 0.9745 0.9738 0.9656 0.9666 0.9804 0.9771 PDE 0.9181 0.9364 0.9644 0.9589 0.9612 0.9603 0.9850 0.9769 TNFR 0.9297 0.9420 0.9537 0.9526 0.9628 0.9671 0.9845 0.9866 USP 0.9792 0.9798 0.9907 0.9895 0.9827 0.9875 0.9900 0.9895

Insights - 1

• SSL is most effective for families that are not cliques but are connected.

• In test set, 12/18 cliques, 3/18 not connected.

• What fraction of protein families are cliques? Is the large number of cliques in the test set due to sample bias?

Insights - 2

• Performance evaluation measures should match the needs of the user.

– AUC scores penalize all FNs and FPs.

– For experimental biologists, top ranked predictions are of interest

– The number of FNs after the first false positive can reveal some information

Insights - 3

• Semi-supervised learning algorithm provides an appealing visualization tool for identifying family members especially when the number of known family members are small

Acknowledgements

• John Lafferty

• Dannie Durand

• Jerry Zhu

Durand Lab• Robbie Sedgewick

• Rose Hoberman

• Ben Vernot

• Narayanan Raghupathy

• Aiton Goldman

• Jacob Joseph

• Annette McLeod

• Maureen Stolzer

Thank You !

Gene family classification using a semi-supervised learning method

Documents

Transcript of Gene family classification using a semi-supervised learning method

Semi-Supervised Clustering

Augmenting Feature-driven fMRI Analyses: Semi-supervised ... · performance of feature-driven regression of stimulus-driven data. 3 Semi-supervised Learning Semi-supervised learning

Incremental Semi-supervised Clustering

Semi-supervised Information Extraction

Semi-Supervised Support Vector Machinespapers.nips.cc/paper/1582-semi-supervised-support-vector-machines.pdf · In this work we propose a method for semi-supervised support vector

Inductive Semi-supervised Learning

A semi-supervised learning algorithm via adaptive Laplacian …crabwq.github.io/pdf/2020 A Semi-supervised Learning... · 2021. 7. 31. · A semi-supervised learning algorithm via

Semi-supervised Mesh Segmentation and Labeling · Semi-supervised Mesh Segmentation and Labeling Jiajun Lv, ... and potentially more topologi- ... Hujun Bao / Semi-supervised Mesh

Semi-Supervised Learning Tutorial

Supervised and Semi-Supervised Multi-View Canonical Correlation … · remote sensing Article Supervised and Semi-Supervised Multi-View Canonical Correlation Analysis Ensemble for

Semi-supervised Clustering of Yeast Gene Expression Datahomepages.cwi.nl/~as/documents/SemiSupervised.pdfSemi-supervised Clustering of Yeast Gene Expression Data 155 two genes iand

PLoS Semi-Supervised Methods to Predict Patient Survival ...statweb.stanford.edu/~tibs/ftp/bair.pdf · Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data

Semi-Supervised Factored Logistic Regression for …papers.nips.cc/paper/5646-semi-supervised-factored...Semi-Supervised Factored Logistic Regression for High-Dimensional Neuroimaging

Semi-Supervised Multitask Learning

Learning Semi-Supervised Representation … Semi-Supervised Representation Towards a Uniﬁed Optimization Framework for Semi-Supervised Learning Chun-Guang Li1, Zhouchen Lin2,3, Honggang

Semi-Supervised Network Embedding

Semi-supervised Discriminant Analysis

Semi-Supervised Learningzhuxj/tmp/book.pdf1.1.2 Semi-Supervised Learning Semi-supervised learning (SSL) is half way between supervised and unsupervised learning. In addition to unlabeled

Semi Supervised Learning

Semi-supervised text categorization.pdf