Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks...

Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks

Yetian Chen

2008-12-12

2008 Nobel Prize in Chemistry

Roger Tsien Osamu Shimomura Martin Chalfie

Green Fluorescent Protein (GFP)

Use GFP to track a protein in living cells

The cellular Localization information of a protein is embedded in protein sequence

PKKKRKV: Nuclear Localization Signal

VALLAL: transmembrane segment

Cellular Localization SitesAmino Acid sequence of a protein

Challenge: predict

Extracting cellular localization information from protein sequence mcg: McGeoch's method for signal sequence recognition.

gvh: von Heijne's method for signal sequence recognition.

alm: Score of the ALOM membrane spanning region prediction program.

mit: Score of discriminant analysis of the amino acid content of the N- terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins.

erl: Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute.

pox: Peroxisomal targeting signal in the C-terminus.

vac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins.

nuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.

Problem Statement & Datasets

Protein Name mcg gvh lip chg aac alm1 alm2 LocationEMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cpATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imSNFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im

Dataset 1: 336 proteins from E.coli (Prokaryote Kingdom)http://archive.ics.uci.edu/ml/datasets/Ecoli

Dataset 2: 1484 proteins from yeast (Eukaryote Kingdom)http://archive.ics.uci.edu/ml/datasets/Yeast

Implementation of AI algorithms

Decision Tree

Neural Network

> Single layer feed-forwad NN: Perceptrons

> Multilayer feed-forward NN: one hidden layer

Implementation of Decision Tree: C5

Preprocessing of Dataset

> If the data point is linear and continous, divide the data range to 5 equal-width bins: tiny, small, medium, large, huge. Then discretize the data points to these bins.

> if the feature value is missing (?), replace ? with tiny.

Generating training set and test set

> Randomly split the data set to training set and test set such that 70% will be in the training set and 30% for test set.

Learning the Decision Tree

> using the decision tree learning algorithm in chapter 18.3 of text book

Testing

Implementation of Neural Networks Structure of Perceptrons and two-layer NN

Protein Name mcg gvh lip chg aac alm1 alm2 LocationEMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cpATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imSNFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im

output

Desired output

output

Desired output

Perceptrons Two-layer NN

Implementation of Perceptrons & Two-layer NN: Algorithms

max ( )r j jO O 0

[ ][ ] [ ]n

j iiO g w i j x e

0 0[ ][ ] [ ][ ] [ ] [ ][ ] [ ] 1 [ ][ ] [ ]

j i i ii iw i j w i j Err x e g w i j x e g w i j x e

Function PERCEPTRONS-LEARNING (examples, network)

initially set correct=0

initialize the weight matrix w[i][j] with randomized number within[-0.5,0.5]

While(correct < threshold) //threshold =0.0, 0.1, 0.2…, 1.0

for each e in the example do

calculate output for each output node //g() is sigmoid function

prediction = r such that

if r != y(e)

for each output node j

for i=1,…,m

endfor

endwhile

Return w[i][j]

0( ) [ ][ ] [ ]

j j iiErr y e g w i j x e

2-layer NN(example,network)

Using the Back-Prop-Learning in Chap 20.5 of textbook

Results

Accuracy comparison

Dataset Decision Tree Perceptrons Two-layer NN

(hidden nodes:5)

Majority

E.coli 68.04±5.03% 66.76±6.34% (Threshold=0.7)

65.68±6.09% (Threshold=0.7)

45.05%

Yeast 46.63±2.55% 50.41±2.74%

(Threshold=0.5)

50.28±2.23%

(Threshold=0.55)

28.82%

•The statistics for Decision Tree are average over 100 runs

•The statistics for Perceptrons and Two-layer NN are average over 50 runs

•Threshold is the termination condition for training the neural networks

Conclusions

The two datasets are linearly inseparable.

For the E.coli dataset, DT, Perceptrons, Two-layer NN achieve similar accuracy

For the yeast dataset, Perceptrons, Two-layer NN achieve slightly better accuracy than DT

All the three AI algorithms have much better accuracy than the simple majority algorithm

Future work

Probabilistic modelBayesian networkK-Nearest Neighbor……

A protein localization sites prediction scheme

mcg gvh alm mit erl pox vac nuc

Classifiers

prediction

Guide the experimental design and biological research, save much labor and time!

Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks...

Documents

Transcript of Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks...

Sound Localization Superior Olivary Complex. Localization: Limits of Performance Absolute localization: localization of sound without a reference. Humans:

Bangla Localization of - PAN Localization

Crowdsourced Indoor Localization for Diverse Devices through …dzeina/talks/ipin13-crowdsourcing... · 2013-12-30 · Fault tolerant detection, localization and tracking of multiple

Differential Utilization and Localization of ErbB Receptor Tyrosine … · 2016. 12. 19. · Differential Utilization and Localization of ErbB Receptor Tyrosine Kinases in Skin Compared

Ch 121 Sensation & Perception Ch. 12: Sound Localization © Takashi Yamauchi (Dept. of Psychology, Texas A&M University) Main topics Auditory localization.

Improving Space Localization Properties of the Discrete Wavelet Transform · · 2013-12-30Improving Space Localization Properties of the Discrete Wavelet Transform 659 ... In the

Extracting Viewer Interests for Automated Bookmarking in ...staff.ustc.edu.cn/~yetian/pub/FCS_Bookmark_15.pdfFront.Comput.Sci. DOI RESEARCH ARTICLE Extracting Viewer Interests for

Image-based localization for mobile robots in dynamic …nick/doc/Bellotto2004.pdf · 2008-12-12 · To my family ABSTRACT This thesis is concerned with the implementation of a localization

On the Equivalence of Certain Fault Localization Techniquesvdebroy/cse3353/Lectures/Lecture-12/ACM-SAC-March... · On the Equivalence of Certain Fault Localization Techniques –ACM

A Survey on Underwater Localization,Localization ...

Anderson Localization: Localization by Disordertfp1.physik.uni-freiburg.de/teaching/Seminar2010/talks/... · Anderson Localization: Localization by Disorder obiasT Binninger Seminar:

Chapter 12: Sound Localization and the Auditory Scenecourses.washington.edu/psy333/lecture_pdfs/chapter12... · 2008-03-05 · Chapter 12: Sound Localization and the Auditory Scene

2016 06 12 - Localization of NAP 1325

Localization Installation Guide Release 2.3.1.0 Banking... · 2017. 12. 20. · Oracle® Banking Platform Localization Installation Guide Release 2.3.1.0.0 E92632-01 December 2017

1 Wireless Sensor Networks Akyildiz/Vuran Chapter 12: Localization.

Localization and Secure Localization

Content to Cash: Understanding and Improving Crowdsourced Live …staff.ustc.edu.cn/~yetian/pub/CN_Livecast_20.pdf · 2020-06-18 · Content to Cash: Understanding and Improving Crowdsourced

Localization and Circumnavigation of a Slowly Moving Target Using Bearing Measurementsusers.cecs.anu.edu.au/~deghat/publications/2013TAC_DSAY.pdf · 2013-12-02 · Localization and

A Probabilistic Framework for Distributed Localization of …wpage.unina.it/alessandra.debenedictis/index_file/doc... · 2013-12-11 · A Probabilistic Framework for Distributed Localization

Air Ultrasonic Signal Localization with a Beamforming Microphone …downloads.hindawi.com/archive/2019/7691645.pdf · 2019-12-12 · ResearchArticle Air Ultrasonic Signal Localization