Dynamic mappers of NGS reads - Institut Gaspard...
Transcript of Dynamic mappers of NGS reads - Institut Gaspard...
Dynamic mappers of NGS readsKarel Bř inda (L IGM Univers ité Par is -Est )
Valent ina Boeva ( Inst i tut Cur ie)
Gregory Kucherov (L IGM Univers ité Par is -Est )
IntroductionRead mapping is a bottleneck in NGS data processing (e.g., for variant calling)
A lot of effort constantly invested into the development of new mappers
None of them supports dynamic updates of the reference during the mapping
Idea: update reference during the mappingOnly few papers on this topic exist
◦ J. Pritt. Efficiently Improving the Reference Genome for DNA Read Alignment. Seminar work, Harvard University, 2013.
◦ A. Ghanayim and D. Geiger. Iterative referencing for improving the interpretation of DNA sequence data. Technical report, Technion, Israel, 2013.
◦ C. S. Iliopoulos et al. An algorithm for mapping short reads to a dynamically changing genomic sequence. Journal of Discrete Algorithms 10, 2012.
Mapping – from static to dynamic
1. Static mapping◦ Classical mappers, no updates
2. Iterative referencing◦ Usage of a standard mappers, mapping is followed by calling variants in many iterations
3. Dynamic mapping◦ Mapper is dynamically updating its index accordingly to already mapped reads
1) Static mapping (standard mappers)
Static mapper
Reference (index)
MAPPER OUTPUT
1 2 n1 iter.Read mapping
SAM/BAM file
READS
2) Iterative referencing (Ghanayim&Geiger, 2013)
Static mapper
Reference (index)
Statistics
1 2 n
1 2 n
1 2 n
MAPPER
1 iter.
1 iter.
1 iter.
.
.
.
Read mapping
Pileup, consensus
Update of the reference
OUTPUT
SAM/BAM file
READS
3) Dynamic mapping (no existing mapper until now)
Dynamic mapperSAM/BAM
file
Reference (index)
Statistics1
2
n
READS MAPPER
1 iter.
1 iter.
1 iter.
.
.
.
Read mapping
Update of the reference
OUTPUT
Estimating the usefulness
Memory requirements Speed Quality of alignment
Iterative referencing + -- ++
Dynamic mapping -- + +
Static mapping + ++ -
Difficulties – dynamic data structuresTwo basic types of mappers:
◦ FM-index based (e.g., BWA-ALN, BWA-SW, BWA-MEM, GEM, etc.)
◦ Hash-table based (e.g., SHRiMP 2, SToRM, etc.)
Data structures must be dynamic◦ Difficult to make dynamic versions
◦ More memory needed
◦ Worse cache-optimization (=> significant decrease of speed)
Dynamic FM-index – already studied:◦ M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. A four-stage algorithm for updating a Burrows–
Wheeler transform. Theoretical Computer Science 410(43), 2009.
◦ M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. Dynamic extended suffix arrays. Journal of Discrete Algorithms 8(2), 2010.
◦ Implementation: http://dfmi.sourceforge.net/
Difficulties – statistics and referenceTo make updates, it is necessary to keep simplified pileups (nucleotide counts in an alignment column).
It is difficult to deal with insertions.
The coordinates of already mapped reads can change during the mapping.◦ Possible solution: padded reference, many
initial place holders (‘*’ character), final small post-processing corrections of the SAM file.
‘A’counter
‘C’counter
‘G’counter
‘T’counter
DEL counter
Sum
3 bits 3 bits 3 bits 3 bits 3 bits 15 bits
Example (memory needed for statistics for a single nucleotide)
1 3 5 7 9 11 13 15 17 19
C * * A * * G * * C * * G C * C * * A * …
Example (padded reference, an insertion at pos. 14)
Difficulties – remapping, unmappingWhen reference sequence changes too much, some of the already mapped reads should be remapped or unmapped
Possible solution:◦ Ignore it
◦ Iterate over the set of reads more times and take only the last reported alignments for each read
...AAAAATATATATATCGATCTGC...CC _
1: ATCTATATATCG2: CCGATCTGC3: CCCGATCTG4: ATCCCGATC
Reference:
Reads:
Dynamic mapping
Dynamic mapper
Reference (index)
Statistics1
2
n
READS MAPPER
1 iter.
1 iter.
1 iter.
.
.
.
Read mapping
Update of the reference
OUTPUT
SAM/BAM file
Simulation (ideal approach)
Static mapper
Reference (index)
READS
1
1 2
1 2 n
MAPPER
Statistics
1 iter.
1 iter.
1 iter.
.
.
.
Read mapping
Pileup, consensus
Update of the reference
OUTPUT
SAM/BAM file
Simulation (feasible approach: 1
𝑑iterations)
Static mapper
Reference (index)
READS
d reads
d reads d reads
d reads d reads d reads
MAPPER
Statistics
1 iter.
1 iter.
1 iter.
.
.
.
Read mapping
Pileup, consensus
Update of the reference
OUTPUT
SAM/BAM file
Our pipelineGoals:
◦ Simulating dynamic mapper using existing static mappers
◦ Estimating usefulness of dynamic mapping
◦ Making general statements about its benefit
Implementation:◦ Set of several scripts (BASH, Python) and programs (C++)
◦ It uses standard bioinformatics software (SAMtools suit, etc.) and mappers (any mapper can be incorporated)
◦ Updates are made by own simple variant caller (simulating real capabilities of mapper)
◦ Currently only SNP updates (no indels) and single-end reads supported
Comparison of mappersTypical approach:
1. Taking several mappers as black-boxes.
2. Simulating reads.
3. Mapping by the selected mappers.
4. Applying the same threshold on mapping qualities for all reads.
5. Comparing.
…it is not very useful.
Comparison of mappers/alignmentsTypical approach:
1. Taking several mappers as black-boxes.
2. Simulating reads.
3. Mapping by the selected mappers.
4. Applying the same threshold on mapping qualities for all reads.
5. Comparing.
…it is not very useful.
Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997
Threshold 20(on mapping qualities)
Comparison of mappers/alignmentsTypical approach:
1. Taking several mappers as black-boxes.
2. Simulating reads.
3. Mapping by the selected mappers.
4. Applying the same threshold on mapping qualities for all reads.
5. Comparing.
…it is not very useful.
It is important to considerall thresholds on mapping qualities!
Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997
LAVEnderA new evaluation software for comparing alignments (C++, Python)
It creates interactive HTML reports for a set of BAM files
Support of:◦ DWGsim read simulator (will be extended)
◦ Single-end reads
Availability◦ Currently a private repository on GitHub
◦ In case of interest, don’t hesitate to contact me at [email protected]
Example of a comparison• Human chromosome 21
• Sequencing error rate: 0.04
• Mutation rate: 0.10
• Single-end reads
• Simulated by DWGsim
• Aligned by BWA-MEM
Fraction of wrongly mapped reads in mapped reads
Part of all reads in %
SetupMappers: BWA-ALN, BWA-MEM
Reference genomes: a bacteria (Borrelia crocidurae), human chromosome 21
Mutation rates: 0.01 – 0.05 for BWA-ALN, 0.15 for BWA-MEM
Sequencing error rate: 0.01
Read length: 100
Read simulator: DWGSim
Evaluator: LAVEnder
BWA-ALNBorrelia crociduraeRate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
MAPPING OF ALL READSWITHOUT ANY UPDATES
BorreliaBWA-ALN
0.01 mut. rate
BWA-ALNBorrelia crociduraeRate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALNHuman chromosome 21Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
MAPPING OF ALL READSWITHOUT ANY UPDATES
Human Chr. 21BWA-ALN
0.01 mut. rate
BWA-ALNHuman chromosome 21Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALNBorrelia crociduraeRate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
MAPPING OF ALL READSWITHOUT ANY UPDATES
BorreliaBWA-ALN
0.03 mut. rate
BWA-ALNBorrelia crociduraeRate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALNHuman chromosome 21Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
MAPPING OF ALL READSWITHOUT ANY UPDATES
Human Chr. 21BWA-ALN
0.03 mut. rate
BWA-ALNHuman chromosome 21Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALNBorrelia crociduraeRate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
MAPPING OF ALL READSWITHOUT ANY UPDATES
BorreliaBWA-ALN
0.05 mut. rate
BWA-ALNBorrelia crociduraeRate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-ALNHuman chromosome 21Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
MAPPING OF ALL READSWITHOUT ANY UPDATES
Human Chr. 21BWA-ALN
0.05 mut. rate
BWA-ALNHuman chromosome 21Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-MEMBorrelia crociduraeRate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
MAPPING OF ALL READSWITHOUT ANY UPDATES
BorreliaBWA-MEM
0.15 mut. rate
BWA-MEMBorrelia crociduraeRate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
DYNAMIC MAPPING ITERATIVE REFERENCING
BWA-MEMHuman chromosome 21Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
MAPPING OF ALL READSWITHOUT ANY UPDATES
Human Chr. 21BWA-MEM
0.15 mut. rate
BWA-MEMHuman chromosome 21Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10
DYNAMIC MAPPING ITERATIVE REFERENCING
ConclusionWe have shown: For cases with small number of mutations between genomes, static mapping suffices (e.g., 1%+1%,
BWA-ALN)
For cases with high amount of mutations, mapping is much improved when dynamic mapping is employed (e.g., 15%+1%, BWA-MEM)
Real situations: regions with low rates of mutations as well as highly mutated regions (e.g., hot spot regions) If we are interested also in these regions, dynamic mapping/iterative referencing would provide great
improvement (especially for, e.g., variant calling)
Side products of our work: LAVEnder – a new evaluator of alignments