2. Motivation Label the unlabeled DNA sequences by the model,
built by examining the labeled DNA sequences and be able to
perceive some real world Machine Learning problems. 2
3. Approaches K-mer based Fixed length K-mer K-mer with
Mismatches Using Regular Expression PWM based MEME and MAST
Combined Model Unite both model 3
4. K-mer Approach Based on Regular ExpressionMotivation 2-mer
appears mostly in the sequences. So, emphasize mostly on
2-mer.Strategy - For any two 2-mers X & Y, generate regular
expression X(.*)Y and Y(.*)X. - Use these Regular expression as
candidate attribute.
5. Classifier Selection Fig : Around 9 classifiers applied on
TF data setAlgorithms are numbered as follows - (1)Logistic (2)SMO
(3)NaiveBayes (4)BayesianLogisticRegression (5)Kstar (6)Bagging
7)LogitBoost (8)RandomForest (9)J48Summary - * 9 classifiers are
applied on 10 data set. 3 are shown among them * choosing an
absolute classifier is not a trivial task * same classifier behaves
differently on different data sets 5
6. Change in Accuracy due to Different Classifiers Logistic J48
RandomForest NaiveBayes Logistic J48 RandomForest NaiveBayes Fig :
The performance of different types of Classifiers on TF_3 data set
Fig : The performance of different types of Classifiers on TF_5
data setSummary - * classifiers have great consequences on accuracy
* one has to be prudent when choosing classifiers 6
7. Change in Accuracy due to Different K-mer Length 4-mer 5-mer
6-mer Fig : The performance of different length K-mer on TF_3 data
setSummary - * K-mer length also has consequences on accuracy * not
trivial, difficult to find the absolute one 7
8. Attribute Space Selection Fig : The performance of different
selecting k-mer on TF_4 data setSummary - * considering number of
attributes also has consequences on accuracy * accuracy increases
if we consider greater number of attributes, but from such
saturation point it decreases. 8
9. PWM based Analysis on Accuracy (TF_1 data set)Fig : J48,
minW 6 - maxW 15, no. of sites 10 Fig : J48, minW 6 maxW 15, no. of
motifs 5Summary - * accuracy increases when we have more motifs but
fixed no. of sites * accuracy increases when we have more sites but
fixed no. of motifs * what happened when we increases both ?????
9
10. PWM based Analysis Fig : Accuracy vary on no. of motifs and
no. of sites* 1st bar concern with no. of sites* 2nd bar concern
with no. of motifs* 3rd bar concern with accuracy* the point is
that accuracy decreases when we increases no. of motifs and no. of
sites.
11. Extra Work for TF_20 Sequences identified by both
modelK-mer The New Model + for TF-20Pwm Sequences Biased 2- Newly
identified mer Model Labeled differently Sequences Fig : Flow
diagram of Building New Model for TF-20Summary - * we have done
some extra work for TF_20
12. AUC based on the Feedback (bonus model) Fig : AUC of 10
data sets based on last submission* accuracy improved than first
submission* PWM does not have pleasant result 12
13. Participation Background Working Working Paramete
Automation Study with Tools with r Tuning Models Badri DNA,RNA,
AlignAce, PWM K-mer Arff Writer, Sampath protein, MEME, Mast output
motif MAST writer Iffat Protein, Weka, K-mer PWM Script for Sharmin
Motif, AlignAce, FASTA,Chowdhury Transcriptio ScanAce Weka
nProsunjit DNA, MEME, K-mer PWM Script for Biswas Transcriptio MAST
RE, for new nK-mer model Tahmina MEME, MEME, PWM K-mer Script for
Ahmed MAST, MAST, MEME, PWM Weka MAST 13