LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function...

Post on 12-Jan-2016

213 views 0 download

Tags:

Transcript of LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function...

LSM3241: Bioinformatics and BiocomputingLSM3241: Bioinformatics and Biocomputing

Lecture 3: Machine learning method for Lecture 3: Machine learning method for protein function predictionprotein function prediction

Prof. Chen Yu ZongProf. Chen Yu Zong

Tel: 6516-6877Tel: 6516-6877Email: Email: csccyz@nus.edu.sgcsccyz@nus.edu.sg

http://http://bidd.nus.edu.sgbidd.nus.edu.sgRoom 07-24, level 7, SOC1, Room 07-24, level 7, SOC1,

National University of Singapore National University of Singapore

22

Protein Function and Functional FamilyProtein Function and Functional FamilyProteins of similar functional characteristics can be grouped into a family

33

Protein Function and Functional FamilyProtein Function and Functional FamilyProteins of similar functional characteristics can be grouped into a family

44

Protein Function and Functional FamilyProtein Function and Functional FamilyProteins of similar functional characteristics can be grouped into a family

55

Functional Classification of Proteins by SVMFunctional Classification of Proteins by SVM

• A protein is classified as either belong (+) or not belong (-) to a functional family

• By screening against all families, the function of this protein can be identified (example: SVMProt)

Protein

Family-1 SVM

Family-2 SVM

Family-3 SVM

Protein belongs toFamily-3

-

-

+

--

66

Functional Classification of Proteins by SVMFunctional Classification of Proteins by SVM

What is SVM? • Support vector machines, a machine learning method,

learning by examples, statistical learning, classify objects into one of the two classes.

Advantages of SVM: • Diversity of class members (no racial discrimination). • Use of sequence-derived physico-chemical features as

basis for classification. • Suitable for functional classification of novel proteins

(distantly-related proteins, homologous proteins of different functions).

77

Machine Learning MethodMachine Learning Method Inductive learning:

Example-based learning

Descriptor

Positive examples

Negative examples

88

Machine Learning MethodMachine Learning Method

A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)

Feature vectors: Descriptor

Feature vector

Positive examples

Negative examples

99

SVM MethodSVM Method Feature vectors in input space:

A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)

Z

Input space

X

Y

BAE

F

Feature vector

1010

SVM MethodSVM Method

BorderNew border

Project to a higher dimensional space

Protein familymembers

Nonmembers

Protein familymembers

Nonmembers

1111

SVM methodSVM method

Support vector

Support vector

New border

Protein familymembers

Nonmembers

1212

SVM MethodSVM Method

Protein familymembers

Nonmembers

New border

Support vector

Support vector

1313

SVM MethodSVM Method

Border line is nonlinear

1414

SVM methodSVM method

Non-linear transformation: use of kernel function

1515

SVM methodSVM method

Non-linear transformation

1616

SVM MethodSVM Method

1717

SVM MethodSVM Method

1818

SVM MethodSVM Method

1919

SVM MethodSVM Method

2020

SVM for Classification of ProteinsSVM for Classification of ProteinsHow to represent a protein?

• Each sequence represented by specific feature vector assembled from encoded representations of tabulated residue properties:– amino acid composition– Hydrophobicity– normalized Van der Waals volume– polarity,– Polarizability– Charge– surface tension– secondary structure– solvent accessibility

• Three descriptors, composition (C), transition (T), and distribution (D), are used to describe global composition of each of these properties.

Nucleic Acids Res., 31: 3692-3697

2121

SVM for Classification of ProteinsSVM for Classification of ProteinsHow to represent a protein?

2222

SVM for Classification of ProteinsSVM for Classification of ProteinsHow to represent a protein?

From protein sequence:

To Feature vector :

(C_amino acid composition, T_ amino acid composition, D_ amino acid composition, C_hydrophobicity, T_hydrophobicity, D_hydrophobicity, … )

Nucleic Acids Res., 31: 3692-3697

Protein function prediction software SVMProtProtein function prediction software SVMProtUseful for functional prediction of novel proteins, distantly-related proteins, homologous proteins of different functions

Your protein sequence

Computer loaded Computer loaded with SVMProtwith SVMProt

Support vector machinesSupport vector machinesclassifier for every classifier for every

protein functional familyprotein functional family

Identified Identified Functional familiesFunctional families

Protein functionalProtein functionalindicationsindications

Send sequence to classifierSend sequence to classifier

Nucl. Acids Res. 31, 3692-3697 (2003)

Input sequencethrough internet

Option 2Option 1

Input sequenceon local machine

http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi

Your protein sequence

Which functional Which functional families your protein families your protein

belong to?belong to?

Protein function prediction software SVMProtProtein function prediction software SVMProt

Useful for functional prediction of novel proteins, distantly-related proteins, homologous proteins of different functions.

Protein families covered:

46 enzyme families, 3 receptor families, 4 transporter and channel families, 6 DNA- and RNA-binding families, 8 structural families, 2 regulator/factor families.

SVMProt web-version at:http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi

Nucl. Acids Res. 31, 3692-3697 (2003)

Protein function prediction software SVMProtProtein function prediction software SVMProt

Nucl. Acids Res. 31, 3692-3697 (2003)

Check covered protein families here

Input sequence here

Check format here

Protein function prediction software SVMProtProtein function prediction software SVMProt

Nucl. Acids Res. 31, 3692-3697 (2003)

Probability of correct prediction

Prediction score

2727

Summary of Today’s lectureSummary of Today’s lecture

• Machine learning method for protein function prediction.

• Use of SVMProt for probing protein function