Dynamic mappers of NGS reads - Institut Gaspard...

43
Dynamic mappers of NGS reads Karel Břinda (LIGM Université Paris-Est) Valentina Boeva (Institut Curie) Gregory Kucherov (LIGM Université Paris-Est)

Transcript of Dynamic mappers of NGS reads - Institut Gaspard...

Dynamic mappers of NGS readsKarel Bř inda (L IGM Univers ité Par is -Est )

Valent ina Boeva ( Inst i tut Cur ie)

Gregory Kucherov (L IGM Univers ité Par is -Est )

IntroductionRead mapping is a bottleneck in NGS data processing (e.g., for variant calling)

A lot of effort constantly invested into the development of new mappers

None of them supports dynamic updates of the reference during the mapping

Idea: update reference during the mappingOnly few papers on this topic exist

◦ J. Pritt. Efficiently Improving the Reference Genome for DNA Read Alignment. Seminar work, Harvard University, 2013.

◦ A. Ghanayim and D. Geiger. Iterative referencing for improving the interpretation of DNA sequence data. Technical report, Technion, Israel, 2013.

◦ C. S. Iliopoulos et al. An algorithm for mapping short reads to a dynamically changing genomic sequence. Journal of Discrete Algorithms 10, 2012.

Mapping – from static to dynamic

1. Static mapping◦ Classical mappers, no updates

2. Iterative referencing◦ Usage of a standard mappers, mapping is followed by calling variants in many iterations

3. Dynamic mapping◦ Mapper is dynamically updating its index accordingly to already mapped reads

1) Static mapping (standard mappers)

Static mapper

Reference (index)

MAPPER OUTPUT

1 2 n1 iter.Read mapping

SAM/BAM file

READS

2) Iterative referencing (Ghanayim&Geiger, 2013)

Static mapper

Reference (index)

Statistics

1 2 n

1 2 n

1 2 n

MAPPER

1 iter.

1 iter.

1 iter.

.

.

.

Read mapping

Pileup, consensus

Update of the reference

OUTPUT

SAM/BAM file

READS

3) Dynamic mapping (no existing mapper until now)

Dynamic mapperSAM/BAM

file

Reference (index)

Statistics1

2

n

READS MAPPER

1 iter.

1 iter.

1 iter.

.

.

.

Read mapping

Update of the reference

OUTPUT

Estimating the usefulness

Memory requirements Speed Quality of alignment

Iterative referencing + -- ++

Dynamic mapping -- + +

Static mapping + ++ -

Dynamic mappers

Difficulties – dynamic data structuresTwo basic types of mappers:

◦ FM-index based (e.g., BWA-ALN, BWA-SW, BWA-MEM, GEM, etc.)

◦ Hash-table based (e.g., SHRiMP 2, SToRM, etc.)

Data structures must be dynamic◦ Difficult to make dynamic versions

◦ More memory needed

◦ Worse cache-optimization (=> significant decrease of speed)

Dynamic FM-index – already studied:◦ M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. A four-stage algorithm for updating a Burrows–

Wheeler transform. Theoretical Computer Science 410(43), 2009.

◦ M. Salson, T. Lecroq, M. Léonard, and L. Mouchard. Dynamic extended suffix arrays. Journal of Discrete Algorithms 8(2), 2010.

◦ Implementation: http://dfmi.sourceforge.net/

Difficulties – statistics and referenceTo make updates, it is necessary to keep simplified pileups (nucleotide counts in an alignment column).

It is difficult to deal with insertions.

The coordinates of already mapped reads can change during the mapping.◦ Possible solution: padded reference, many

initial place holders (‘*’ character), final small post-processing corrections of the SAM file.

‘A’counter

‘C’counter

‘G’counter

‘T’counter

DEL counter

Sum

3 bits 3 bits 3 bits 3 bits 3 bits 15 bits

Example (memory needed for statistics for a single nucleotide)

1 3 5 7 9 11 13 15 17 19

C * * A * * G * * C * * G C * C * * A * …

Example (padded reference, an insertion at pos. 14)

Difficulties – remapping, unmappingWhen reference sequence changes too much, some of the already mapped reads should be remapped or unmapped

Possible solution:◦ Ignore it

◦ Iterate over the set of reads more times and take only the last reported alignments for each read

...AAAAATATATATATCGATCTGC...CC _

1: ATCTATATATCG2: CCGATCTGC3: CCCGATCTG4: ATCCCGATC

Reference:

Reads:

Simulating dynamic mapping

Dynamic mapping

Dynamic mapper

Reference (index)

Statistics1

2

n

READS MAPPER

1 iter.

1 iter.

1 iter.

.

.

.

Read mapping

Update of the reference

OUTPUT

SAM/BAM file

Simulation (ideal approach)

Static mapper

Reference (index)

READS

1

1 2

1 2 n

MAPPER

Statistics

1 iter.

1 iter.

1 iter.

.

.

.

Read mapping

Pileup, consensus

Update of the reference

OUTPUT

SAM/BAM file

Simulation (feasible approach: 1

𝑑iterations)

Static mapper

Reference (index)

READS

d reads

d reads d reads

d reads d reads d reads

MAPPER

Statistics

1 iter.

1 iter.

1 iter.

.

.

.

Read mapping

Pileup, consensus

Update of the reference

OUTPUT

SAM/BAM file

Our pipelineGoals:

◦ Simulating dynamic mapper using existing static mappers

◦ Estimating usefulness of dynamic mapping

◦ Making general statements about its benefit

Implementation:◦ Set of several scripts (BASH, Python) and programs (C++)

◦ It uses standard bioinformatics software (SAMtools suit, etc.) and mappers (any mapper can be incorporated)

◦ Updates are made by own simple variant caller (simulating real capabilities of mapper)

◦ Currently only SNP updates (no indels) and single-end reads supported

Comparing mappers and alignments

Comparison of mappersTypical approach:

1. Taking several mappers as black-boxes.

2. Simulating reads.

3. Mapping by the selected mappers.

4. Applying the same threshold on mapping qualities for all reads.

5. Comparing.

…it is not very useful.

Comparison of mappers/alignmentsTypical approach:

1. Taking several mappers as black-boxes.

2. Simulating reads.

3. Mapping by the selected mappers.

4. Applying the same threshold on mapping qualities for all reads.

5. Comparing.

…it is not very useful.

Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997

Threshold 20(on mapping qualities)

Comparison of mappers/alignmentsTypical approach:

1. Taking several mappers as black-boxes.

2. Simulating reads.

3. Mapping by the selected mappers.

4. Applying the same threshold on mapping qualities for all reads.

5. Comparing.

…it is not very useful.

It is important to considerall thresholds on mapping qualities!

Source: Heng Li: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM, arXiv:1303.3997

LAVEnderA new evaluation software for comparing alignments (C++, Python)

It creates interactive HTML reports for a set of BAM files

Support of:◦ DWGsim read simulator (will be extended)

◦ Single-end reads

Availability◦ Currently a private repository on GitHub

◦ In case of interest, don’t hesitate to contact me at [email protected]

Example of a comparison• Human chromosome 21

• Sequencing error rate: 0.04

• Mutation rate: 0.10

• Single-end reads

• Simulated by DWGsim

• Aligned by BWA-MEM

Fraction of wrongly mapped reads in mapped reads

Part of all reads in %

EXPERIMENTS

SetupMappers: BWA-ALN, BWA-MEM

Reference genomes: a bacteria (Borrelia crocidurae), human chromosome 21

Mutation rates: 0.01 – 0.05 for BWA-ALN, 0.15 for BWA-MEM

Sequencing error rate: 0.01

Read length: 100

Read simulator: DWGSim

Evaluator: LAVEnder

BWA-ALNBorrelia crociduraeRate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

MAPPING OF ALL READSWITHOUT ANY UPDATES

BorreliaBWA-ALN

0.01 mut. rate

BWA-ALNBorrelia crociduraeRate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

DYNAMIC MAPPING ITERATIVE REFERENCING

BWA-ALNHuman chromosome 21Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

MAPPING OF ALL READSWITHOUT ANY UPDATES

Human Chr. 21BWA-ALN

0.01 mut. rate

BWA-ALNHuman chromosome 21Rate of mutations: 0.01, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

DYNAMIC MAPPING ITERATIVE REFERENCING

BWA-ALNBorrelia crociduraeRate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

MAPPING OF ALL READSWITHOUT ANY UPDATES

BorreliaBWA-ALN

0.03 mut. rate

BWA-ALNBorrelia crociduraeRate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

DYNAMIC MAPPING ITERATIVE REFERENCING

BWA-ALNHuman chromosome 21Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

MAPPING OF ALL READSWITHOUT ANY UPDATES

Human Chr. 21BWA-ALN

0.03 mut. rate

BWA-ALNHuman chromosome 21Rate of mutations: 0.03, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

DYNAMIC MAPPING ITERATIVE REFERENCING

BWA-ALNBorrelia crociduraeRate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

MAPPING OF ALL READSWITHOUT ANY UPDATES

BorreliaBWA-ALN

0.05 mut. rate

BWA-ALNBorrelia crociduraeRate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

DYNAMIC MAPPING ITERATIVE REFERENCING

BWA-ALNHuman chromosome 21Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

MAPPING OF ALL READSWITHOUT ANY UPDATES

Human Chr. 21BWA-ALN

0.05 mut. rate

BWA-ALNHuman chromosome 21Rate of mutations: 0.05, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

DYNAMIC MAPPING ITERATIVE REFERENCING

BWA-MEMBorrelia crociduraeRate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

MAPPING OF ALL READSWITHOUT ANY UPDATES

BorreliaBWA-MEM

0.15 mut. rate

BWA-MEMBorrelia crociduraeRate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

DYNAMIC MAPPING ITERATIVE REFERENCING

BWA-MEMHuman chromosome 21Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

MAPPING OF ALL READSWITHOUT ANY UPDATES

Human Chr. 21BWA-MEM

0.15 mut. rate

BWA-MEMHuman chromosome 21Rate of mutations: 0.15, Rate of seq. errors: 0.01, Read length: 100Average coverage: 10

DYNAMIC MAPPING ITERATIVE REFERENCING

ConclusionWe have shown: For cases with small number of mutations between genomes, static mapping suffices (e.g., 1%+1%,

BWA-ALN)

For cases with high amount of mutations, mapping is much improved when dynamic mapping is employed (e.g., 15%+1%, BWA-MEM)

Real situations: regions with low rates of mutations as well as highly mutated regions (e.g., hot spot regions) If we are interested also in these regions, dynamic mapping/iterative referencing would provide great

improvement (especially for, e.g., variant calling)

Side products of our work: LAVEnder – a new evaluator of alignments

Thank you for your attention!

Gregory KucherovValentina Boeva