20131001 lab meeting
-
Upload
gigi-wu -
Category
Technology
-
view
121 -
download
0
description
Transcript of 20131001 lab meeting
![Page 1: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/1.jpg)
Error correction for next generation sequencing
Wu Chihua (Gigi)Matsuyama Lab M2
Bioinformatics GroupOctober 1st, 2013
13年11月5⽇日星期⼆二
![Page 2: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/2.jpg)
Agenda
BackgroundExisting researchToy ExperimentFuture workReferences
2
13年11月5⽇日星期⼆二
![Page 3: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/3.jpg)
Background
3
why & what
13年11月5⽇日星期⼆二
![Page 4: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/4.jpg)
DNA Sequencing
4
Angelina Jolie tested for one gene, what about the other 20,000?
13年11月5⽇日星期⼆二
![Page 5: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/5.jpg)
20,000
5
1
full genome sequence
13年11月5⽇日星期⼆二
![Page 6: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/6.jpg)
Genome
6
An organism's complete set of DNA
13年11月5⽇日星期⼆二
![Page 7: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/7.jpg)
7
Chromosome
����������� ������������������ a����������� ������������������ region����������� ������������������ of����������� ������������������ chromosome����������� ������������������ that����������� ������������������ controls����������� ������������������ a����������� ������������������ hereditary����������� ������������������ characteristic
DNA����������� ������������������ +����������� ������������������ protein
=13年11月5⽇日星期⼆二
![Page 8: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/8.jpg)
8
Chromosome
����������� ������������������ a����������� ������������������ region����������� ������������������ of����������� ������������������ chromosome����������� ������������������ that����������� ������������������ controls����������� ������������������ a����������� ������������������ hereditary����������� ������������������ characteristic
DNA����������� ������������������ +����������� ������������������ protein
=
ATCG
base pair(bp)
13年11月5⽇日星期⼆二
![Page 9: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/9.jpg)
Chromosome Gene
����������� ������������������ a����������� ������������������ region����������� ������������������ of����������� ������������������ chromosome����������� ������������������ that����������� ������������������ controls����������� ������������������ a����������� ������������������ hereditary����������� ������������������ characteristic
20,000+
13年11月5⽇日星期⼆二
![Page 10: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/10.jpg)
10
average : 3,000 bpslargest : 2,400,000 bps
Human gene
Human genome3 billion bps
Human DNA50 ~ 250 Mbps
13年11月5⽇日星期⼆二
![Page 11: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/11.jpg)
Next Generation
11
Sequencing
high����������� ������������������ throughput����������� ������������������ &����������� ������������������ chea
per
output����������� ������������������ short����������� ������������������ reads
13年11月5⽇日星期⼆二
![Page 12: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/12.jpg)
12
Elaine R. Mardis. A decade’s perspective on DNA sequencing technology. Figure 1.
13年11月5⽇日星期⼆二
![Page 13: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/13.jpg)
13
wikipedia. http://en.wikipedia.org/wiki/DNA_sequencing#cite_note-quail2012-37
13年11月5⽇日星期⼆二
![Page 14: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/14.jpg)
14
13年11月5⽇日星期⼆二
![Page 15: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/15.jpg)
Error Correction
15
highly accurate sequenced reads will likely lead to higher quality results.
13年11月5⽇日星期⼆二
![Page 16: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/16.jpg)
Existing Research
16
13年11月5⽇日星期⼆二
![Page 17: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/17.jpg)
17
13年11月5⽇日星期⼆二
![Page 18: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/18.jpg)
Possible direction
To handle large genomes and larger datasets.
To handle insertion and deletion errors.
To correct hybrid datasets from multiple next generation platforms.
To develop error correction methods for datasets in population studies.
18
13年11月5⽇日星期⼆二
![Page 19: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/19.jpg)
Toy experiment
19
13年11月5⽇日星期⼆二
![Page 20: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/20.jpg)
short read
find similar pairs of reads by SlideSort
vote each position by paired read
decide the new base
correct the erroneous bases
13年11月5⽇日星期⼆二
![Page 21: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/21.jpg)
• All pairs similarity search (APSS) for sequence dataset.
• APSS: find all similar pairs in a dataset.
• Performance of SlideSort• 10 minutes for 10 million reads.• 2~3G byte for 10 million reads.
• Complexity of SlideSort• Time: O(N+α)• Equivalence classes are found in O(N).• α is a number of neighbor pairs.
Slidesort
21
13年11月5⽇日星期⼆二
![Page 22: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/22.jpg)
ATGCATAATGCTCAAAGTCGGAAGGTCG
ATTCATTATGCCCAATGTATTATGCTTA
Input Output
ATGCATAATGCTTA
AAG-TCGGAAGGTCG-
• A set of short reads• Distance threshold d
Alignments and distancesof all similar pairs.
ed= 1
ed= 2
ATGCATAATGCTCA
ed= 2SlideSort
Slidesort
22
13年11月5⽇日星期⼆二
![Page 23: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/23.jpg)
ACGC.….
ATGC…….
AAGT…….
Naive approach:O(N2)
How to reduce computational
cost?*Animation by Prof. Shimizu
13年11月5⽇日星期⼆二
![Page 24: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/24.jpg)
ACGC.….
ATGC…….
AAGT…….
Naive approach:O(N2)
How to reduce computational
cost?*Animation by Prof. Shimizu
13年11月5⽇日星期⼆二
![Page 25: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/25.jpg)
ATGC…….
AAGT…….
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
*Animation by Prof. Shimizu
13年11月5⽇日星期⼆二
![Page 26: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/26.jpg)
ATGC…….
AAGT…….
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
*Animation by Prof. Shimizu
13年11月5⽇日星期⼆二
![Page 27: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/27.jpg)
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 28: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/28.jpg)
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 29: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/29.jpg)
ACGC.….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 30: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/30.jpg)
ACGC.….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 31: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/31.jpg)
ACGC.….
AAGT…….
ATGC…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 32: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/32.jpg)
ACGC.….
AAGT…….
ATGC…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 33: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/33.jpg)
ATGC…….ACGC.….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 34: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/34.jpg)
ATGC…….ACGC.….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 35: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/35.jpg)
AAGT…….
ACGC.….
ATGC…….
ATGC…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 36: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/36.jpg)
AAGT…….
ACGC.….
ATGC…….
ATGC…….
*Animation by Prof. Shimizu
Basic strategy:1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stageCompares all pairs for each subset.
13年11月5⽇日星期⼆二
![Page 37: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/37.jpg)
S1 & S2 are decomposed into m blocks.
If edit distance of S1 & S2 is at most d, there exist at least (m-d) common blocks between S1&S2, at similar position.
Slidesort
13年11月5⽇日星期⼆二
![Page 38: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/38.jpg)
• First step:• Quickly finds a subset of short
reads which shares (m-d) common blocks. (k-mers)
• Second step:• Calculates edit-dist between all
pairs included in the subset (equivalence class).
• Outputs pairs whose edit-dist are more than d, as well as alignments and scores.
ATGC…….
S1
S2
S3S4
S5S6
S1S2S5
S1S2S5
Equivalence class
Slidesort
13年11月5⽇日星期⼆二
![Page 39: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/39.jpg)
Toy ExperimentData: test.fasta
Simulator: Stampy. (An open source that can simulate short read error.)
Num of sequence : 5
Max_seq_length: 51
Min_seq_length: 51
32
13年11月5⽇日星期⼆二
![Page 40: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/40.jpg)
Toy Experiment
33
seq 0 1 2 3 4
◉ 1 1
△ 1 1
✖ 1
13年11月5⽇日星期⼆二
![Page 41: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/41.jpg)
Discussion
• Not sure if test data generated by Stampy is good or not.
• Data set is way too small.
34
13年11月5⽇日星期⼆二
![Page 42: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/42.jpg)
Future work
• Proper, bigger dataset.
• Select data sets from real experiments from online database instead of simulations.
• Try Bayesian model
35
13年11月5⽇日星期⼆二
![Page 43: 20131001 lab meeting](https://reader036.fdocuments.us/reader036/viewer/2022081400/554f3e48b4c905471e8b4bd8/html5/thumbnails/43.jpg)
References
• Elaine R. Mardis. A decade’s perspective on DNA sequencing technology.
• Michael L. Metzker. Sequencing technologies — the next generation.
• Xiao Yang, Sriram P. Chockalingam, Srinivas Aluru. A survey of error-correction methods for next-generation sequencing. Briefing in Bioinformatics (2013) 14 (1): 56-66.
• Kana Shimizu1, Koji Tsuda. SlideSort: all pairs similarity search for short reads. Bioinformatics (2011) 27 (4): 464-470.
• Next Generation Sequencing (NGS) Market [Platforms (Illumina HiSeq, MiSeq, Life Technologies Ion Proton/PGM, 454 Roche), Bioinformatics (RNA-Seq, ChIP-Seq), (Pyrosequencing, SBS, SMRT), (Diagnostics, Personalized Medicine)] - Global Forecast to 2017.
13年11月5⽇日星期⼆二