Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks...
-
Upload
theresa-barton -
Category
Documents
-
view
218 -
download
2
Transcript of Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks...
![Page 1: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/1.jpg)
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks
Yetian Chen
2008-12-12
![Page 2: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/2.jpg)
2008 Nobel Prize in Chemistry
Roger Tsien Osamu Shimomura Martin Chalfie
Green Fluorescent Protein (GFP)
Use GFP to track a protein in living cells
![Page 3: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/3.jpg)
The cellular Localization information of a protein is embedded in protein sequence
PKKKRKV: Nuclear Localization Signal
VALLAL: transmembrane segment
Cellular Localization SitesAmino Acid sequence of a protein
Challenge: predict
![Page 4: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/4.jpg)
Extracting cellular localization information from protein sequence mcg: McGeoch's method for signal sequence recognition.
gvh: von Heijne's method for signal sequence recognition.
alm: Score of the ALOM membrane spanning region prediction program.
mit: Score of discriminant analysis of the amino acid content of the N- terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins.
erl: Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute.
pox: Peroxisomal targeting signal in the C-terminus.
vac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins.
nuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.
![Page 5: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/5.jpg)
Problem Statement & Datasets
Protein Name mcg gvh lip chg aac alm1 alm2 LocationEMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cpATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imSNFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im
Dataset 1: 336 proteins from E.coli (Prokaryote Kingdom)http://archive.ics.uci.edu/ml/datasets/Ecoli
Dataset 2: 1484 proteins from yeast (Eukaryote Kingdom)http://archive.ics.uci.edu/ml/datasets/Yeast
![Page 6: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/6.jpg)
Implementation of AI algorithms
Decision Tree
> C5
Neural Network
> Single layer feed-forwad NN: Perceptrons
> Multilayer feed-forward NN: one hidden layer
![Page 7: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/7.jpg)
Implementation of Decision Tree: C5
Preprocessing of Dataset
> If the data point is linear and continous, divide the data range to 5 equal-width bins: tiny, small, medium, large, huge. Then discretize the data points to these bins.
> if the feature value is missing (?), replace ? with tiny.
Generating training set and test set
> Randomly split the data set to training set and test set such that 70% will be in the training set and 30% for test set.
Learning the Decision Tree
> using the decision tree learning algorithm in chapter 18.3 of text book
Testing
![Page 8: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/8.jpg)
Implementation of Neural Networks Structure of Perceptrons and two-layer NN
Protein Name mcg gvh lip chg aac alm1 alm2 LocationEMRB_ECOLI 0.71 0.52 0.48 0.50 0.64 1.00 0.99 cpATKC_ECOLI 0.85 0.53 0.48 0.50 0.53 0.52 0.35 imSNFRB_ECOLI 0.63 0.49 0.48 0.50 0.54 0.76 0.79 im
input
Att 1
Att 2
Att 3
Att 4
output
cp
imS
im
1
0
0
Desired output
input
Att 1
Att 2
Att 3
Att 4
output
cp
imS
im
1
0
0
Desired output
Perceptrons Two-layer NN
![Page 9: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/9.jpg)
Implementation of Perceptrons & Two-layer NN: Algorithms
max ( )r j jO O 0
[ ][ ] [ ]n
j iiO g w i j x e
0 0[ ][ ] [ ][ ] [ ] [ ][ ] [ ] 1 [ ][ ] [ ]
n n
j i i ii iw i j w i j Err x e g w i j x e g w i j x e
Function PERCEPTRONS-LEARNING (examples, network)
initially set correct=0
initialize the weight matrix w[i][j] with randomized number within[-0.5,0.5]
While(correct < threshold) //threshold =0.0, 0.1, 0.2…, 1.0
for each e in the example do
calculate output for each output node //g() is sigmoid function
prediction = r such that
if r != y(e)
for each output node j
for i=1,…,m
endfor
endfor
endif
endfor
endwhile
Return w[i][j]
0( ) [ ][ ] [ ]
n
j j iiErr y e g w i j x e
2-layer NN(example,network)
Using the Back-Prop-Learning in Chap 20.5 of textbook
![Page 10: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/10.jpg)
Results
Accuracy comparison
Dataset Decision Tree Perceptrons Two-layer NN
(hidden nodes:5)
Majority
E.coli 68.04±5.03% 66.76±6.34% (Threshold=0.7)
65.68±6.09% (Threshold=0.7)
45.05%
Yeast 46.63±2.55% 50.41±2.74%
(Threshold=0.5)
50.28±2.23%
(Threshold=0.55)
28.82%
•The statistics for Decision Tree are average over 100 runs
•The statistics for Perceptrons and Two-layer NN are average over 50 runs
•Threshold is the termination condition for training the neural networks
![Page 11: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/11.jpg)
Conclusions
The two datasets are linearly inseparable.
For the E.coli dataset, DT, Perceptrons, Two-layer NN achieve similar accuracy
For the yeast dataset, Perceptrons, Two-layer NN achieve slightly better accuracy than DT
All the three AI algorithms have much better accuracy than the simple majority algorithm
![Page 12: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/12.jpg)
Future work
Probabilistic modelBayesian networkK-Nearest Neighbor……
![Page 13: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/13.jpg)
A protein localization sites prediction scheme
mcg gvh alm mit erl pox vac nuc
Classifiers
prediction
Guide the experimental design and biological research, save much labor and time!
![Page 14: Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen 2008-12-12.](https://reader036.fdocuments.us/reader036/viewer/2022062422/56649e8e5503460f94b91f6c/html5/thumbnails/14.jpg)