An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University,...
-
Upload
alison-arnold -
Category
Documents
-
view
213 -
download
1
Transcript of An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University,...
![Page 1: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/1.jpg)
An MLE-based clustering method on DNA barcode
Ching-Ray Yu
Statistics Department
Rutgers University, USA
07/07/2006
![Page 2: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/2.jpg)
07/07/2006 Rutgers University
Outline
Problem & Motivation Statistical modeling Preliminary result Future work
![Page 3: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/3.jpg)
07/07/2006 Rutgers University
Problem & Motivation Species clustering problem is a very
important issue on DNA barcode project. MEGA3 by S. Kumar, K. Tamura, and M. Nei
(2004) has limited success when applying the true data set.
A good method to cluster new biological specimens to the proper species is very important. This is also one of the goals of this DNA initiative.
![Page 4: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/4.jpg)
07/07/2006 Rutgers University
Statistical modeling Statistical model.
Assumption:
Nucleotides are iid. random variables which are multinomial distributed with parameter , where i stands for the DNA (COI) sequence and j indicates the position.
{ , , , }ijX A T C G
( , , , )A T C Gj j j j j
thithj
![Page 5: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/5.jpg)
07/07/2006 Rutgers University
MLE-based clustering method Maximum likelihood estimator (MLE)
Suppose are the MLE of in species k.
The clustering rule : A new unidentified specimen , where , is assigned to species k if
ˆ ˆ ˆ ˆˆ ( , , , )A T C Gjk jk jk jk jk jk
1( ,..., )nY Y Y { , , , }iY A T C G
1
ˆargmax i
nyir
r i
k
![Page 6: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/6.jpg)
07/07/2006 Rutgers University
Preliminary result There are 150 species with 1623 COI
sequences in the training datasets provide by DIMACS.
In order to performance the misclassification rate of the MLE-based clustering method, I randomly select 200 COI sequences as testing set. The MLE of parameters is estimated from the rest 1423 COI sequences.
The misclassification rate is about 16%.
![Page 7: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/7.jpg)
07/07/2006 Rutgers University
Future work The misclassification rate could be
improved if more complicated models are considered.
For example: 1. gaps of the COI sequences.2. Kimura-two-parameter (K2P) model (1980)
considers the biological diversity with different evolution rate.
3. synonymous and non-synonymous substitutions.Remark: Synonymous substitutions will change the encoded amino acid.
![Page 8: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/8.jpg)
07/07/2006 Rutgers University
Proposed deliverables If the MLE-based clustering method
works well on barcode sequences, it can be written as a paper with R-code
for public users of barcode data.
![Page 9: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/9.jpg)
07/07/2006 Rutgers University
Acknowledgement NSF grant Data Analysis Working Group DIMACS at Rutgers My advisor: Dr. Javier Cabrera
![Page 10: An MLE-based clustering method on DNA barcode Ching-Ray Yu Statistics Department Rutgers University, USA chingray@stat.rutgers.edu 07/07/2006.](https://reader036.fdocuments.us/reader036/viewer/2022083006/56649f355503460f94c52f3e/html5/thumbnails/10.jpg)
07/07/2006 Rutgers University
References M. Kimura (1980) A simple method for estimating
evolutionary rate of base substitution through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120.
Sudhir Kumar, Koichior Tamura and Masatoshi Nei
(2004) MEGA 3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment. Briefings in Bioinformatics. Vol 5. No 2. 150-163.