Karel Břinda, Valentina Boeva, Gregory Kucherov Introduction Rnf ...

1
Rnf: a method and tools to evaluate Ngs read mappers Karel Břinda, Valentina Boeva, Gregory Kucherov [email protected], [email protected], [email protected] Introduction Aligning reads to a reference sequence is a fundamental step in numerous bioinformatics pipelines. The sensitivity and precision of the mapping tool can critically affect the accu- racy of produced results. Read simulators combined with alignment evaluation tools provide the most straightfor- ward way to evaluate and compare map- pers. In default of standards for encoding read origins, every evaluation tool had to be made explicitly compatible with the simulator used to generate reads. To solve this obstacle, we have created a format Rnf (Read Naming Format) and an associated software package RnfTools. Rnf Description: Read Naming Format, a generic format for as- signing read names with encoded information about original posi- tions. Specification: http://karel-brinda.github.io/rnf-spec/ RnfTools Description: An associated software package of Rnf- compatible programs, based on Snakemake [2]. All employed external programs are installed automatically when they are needed. Components: i) MIShmash Pipeline applying one of popular read sim- ulating tools (among DwgSim, Art, Ma- son, CuReSim etc.) and transforming the generated reads into Rnf format. ii) LAVEnder Tool for read mappers evaluation using Rnf reads. Source codes and documentation: http://github.com/karel-brinda/rnftools http://rnftools.rtfd.org Prerequisites: – Unix-like system (Linux, OSX, etc.) – Python 3.2+ Installation using Pip: > pip install rnftools Installation using Easy Install: > easy_install rnftools References [1] K. Břinda, V. Boeva, G. Kucherov. RNF: a gen- eral framework to evaluate NGS read mappers. arXiv:1504.00556 [q-bio.GN], 2015. [2] J. Köster and S. Rahmann. Snakemake – a scal- able bioinformatics workflow engine. Bioinfor- matics 28(19): 2520–2522, 2012. Read Naming Format sim__0043fd1__(3,13,F,027871,027970),(3,13,R,029171,029270)__[paired_end],C:[100=,42=1X47=] Segments of reads Suffix (with comments and extensions) Read tuple ID Prefix Leftmost coordinate Genome ID Direction Chromosome ID Rightmost coordinate Example of simulated read tuples Coor 12345678901234-5678901234567890123456789 Source 1 - reference genome chr 1 ATGTTAGATAA-GATAGCTGTGCTAGTAGGCAGTCAGCCC chr 2 ttcttctggaa-gaccttctcctcctgcaaataaa Source 2 - generator of random sequences READS: r001 ATG-TAGATA -> r002/1 TTAGATAACGA -> r002/2 <- TCAG-CGGG r003/1 tgcaaataa -> r003/2 gaa-gacc-t -> r004 ATAGCT............TCAG -> r005 GTAGG -> <- agacctt <- TCGACACG r006 ATATCACATCATTAGACACTA Their corresponding Rnf names read tuple LRN SRN r001 sim__1__(1,1,F,01,10)__[single_end] #1 r002 sim__2__(1,1,F,04,14),(1,1,R,31,39) __[paired_end] #2 r003 sim__3__(1,2,F,09,17),(1,2,F,25,33) __[mate_pair] #3 r004 sim__4__(1,1,F,15,36)__[spliced], C:[6=12N4=] #4 r005 sim__5__(1,1,R,15,22),(1,1,F,25,29), (1,2,R,05,11)__[chimeric] #5 r006 rnd__6__(2,0,N,00,00)__[random] #6 LRN Long read name. SRN Short read name. They are used only if an LRN ex- ceeds 255 characters (maximum allowed read length in Sam). Then a SRN-LRN correspondence file must be created. Evaluation of read mappers using Rnf-compatible programs Genome 1 Genome 2 Genome n Read simulator Reads Alignment Mapper evaluation tool Report FASTA FASTQ BAM TXT/HTML RNF decoding RNF encoding Mapper Read simulation Mapper evaluation RnfTools – example of usage Steps: 1. Simulation of reads. 200.000 reads were simulated by DwgSim using MIShmash: 100.000 reads from a human genome (HG38), 100.000 reads from a mouse genome (MM10). 2. Mapping All reads were mapped to HG38 by i) Yara, ii) Bwa-Mem, iii) Bwa-Sw, and iv) Bowtie2. 3. Evaluation. The obtained Bam files were evaluated using LAVEnder. Figure Comparison of the mappers with respect to correctly mapped reads. Figure & Detailed graph for Yara. Figure Detailed graph for Bwa-Mem. #correctly mapped reads / #reads which should be mapped Correctly mapped reads in all reads which should be mapped FDR in mapping (#wrongly mapped reads / #mapped reads) BWA-MEM BWA-SW Bowtie2 YARA 50 % 60 % 70 % 80 % 90 % 100 % 10 -4 10 -3 10 -2 10 -1 10 0 Part of all reads (%) BWA-MEM FDR in mapping (#wrongly mapped reads / #mapped reads) Unmapped correctly Unmapped incorrectly Thresholded correctly Thresholded incorrectly Multimapped Mapped, should be unmapped Mapped to wrong position Mapped correctly 0 % 20 % 40 % 60 % 80 % 100 % 10 -2 10 -1 Part of all reads (%) YARA FDR in mapping (#wrongly mapped reads / #mapped reads) Unmapped correctly Unmapped incorrectly Thresholded correctly Thresholded incorrectly Multimapped Mapped, should be unmapped Mapped to wrong position Mapped correctly 0 % 20 % 40 % 60 % 80 % 100 % 10 -2 10 -1

Transcript of Karel Břinda, Valentina Boeva, Gregory Kucherov Introduction Rnf ...

Page 1: Karel Břinda, Valentina Boeva, Gregory Kucherov Introduction Rnf ...

Rnf: amethodandtools toevaluateNgs readmappers

Karel Břinda, Valentina Boeva, Gregory [email protected], [email protected], [email protected]

IntroductionAligning reads to a reference sequence is afundamental step in numerous bioinformaticspipelines. The sensitivity and precision of themapping tool can critically affect the accu-racy of produced results.

Read simulators combined with alignmentevaluation tools provide the most straightfor-ward way to evaluate and compare map-pers.

In default of standards for encoding read origins,every evaluation tool had to be made explicitlycompatible with the simulator used to generatereads.

To solve this obstacle, we have created a formatRnf (Read Naming Format) and an associatedsoftware package RnfTools.

RnfDescription: Read NamingFormat, a generic format for as-signing read names with encodedinformation about original posi-tions.Specification:http://karel-brinda.github.io/rnf-spec/

RnfToolsDescription: An associatedsoftware package of Rnf-compatible programs, based onSnakemake [2]. All employedexternal programs are installedautomatically when they areneeded.

Components:

i) MIShmashPipeline applying one of popular read sim-ulating tools (among DwgSim, Art, Ma-son, CuReSim etc.) and transforming thegenerated reads into Rnf format.

ii) LAVEnderTool for read mappers evaluation usingRnf reads.

Source codes and documentation:http://github.com/karel-brinda/rnftoolshttp://rnftools.rtfd.org

Prerequisites:– Unix-like system (Linux, OSX, etc.)– Python 3.2+

Installation using Pip:> pip install rnftools

Installation using Easy Install:> easy_install rnftools

References[1] K. Břinda, V. Boeva, G. Kucherov. RNF: a gen-eral framework to evaluate NGS read mappers.arXiv:1504.00556 [q-bio.GN], 2015.

[2] J. Köster and S. Rahmann. Snakemake – a scal-able bioinformatics workflow engine. Bioinfor-matics 28(19): 2520–2522, 2012.

Read Naming Format

sim__0043fd1__(3,13,F,027871,027970),(3,13,R,029171,029270)__[paired_end],C:[100=,42=1X47=]

Segments of reads Suffix(with comments and extensions)

Read tuple IDPrefix

Leftmost coordinate

Genome ID

DirectionChromosome ID

Rightmost coordinate

Example of simulated read tuplesCoor 12345678901234-5678901234567890123456789

Source 1 - reference genome

chr 1 ATGTTAGATAA-GATAGCTGTGCTAGTAGGCAGTCAGCCC

chr 2 ttcttctggaa-gaccttctcctcctgcaaataaa

Source 2 - generator of random sequences

READS:

r001 ATG-TAGATA ->

r002/1 TTAGATAACGA ->

r002/2 <- TCAG-CGGG

r003/1 tgcaaataa ->

r003/2 gaa-gacc-t ->

r004 ATAGCT............TCAG ->

r005 GTAGG ->

<- agacctt

<- TCGACACG

r006 ATATCACATCATTAGACACTA

Their corresponding Rnf namesreadtuple

LRN SRN

r001 sim__1__(1,1,F,01,10)__[single_end] #1r002 sim__2__(1,1,F,04,14),(1,1,R,31,39)

__[paired_end]#2

r003 sim__3__(1,2,F,09,17),(1,2,F,25,33)__[mate_pair]

#3

r004 sim__4__(1,1,F,15,36)__[spliced],C:[6=12N4=]

#4

r005 sim__5__(1,1,R,15,22),(1,1,F,25,29),(1,2,R,05,11)__[chimeric]

#5

r006 rnd__6__(2,0,N,00,00)__[random] #6

LRN Long read name.

SRN Short read name. They are used only if an LRN ex-ceeds 255 characters (maximum allowed read lengthin Sam). Then a SRN-LRN correspondence file mustbe created.

Evaluation of read mappers using Rnf-compatible programs

Genome 1

Genome 2

Genome n

Read simulator Reads AlignmentMapper

evaluationtool

Report

FASTA

FASTQ BAM TXT/HTMLRNF decoding

RNF encoding

Mapper

Read simulation Mapper evaluation

RnfTools – example of usage

Steps:

1. Simulation of reads. 200.000 reads weresimulated by DwgSim using MIShmash:– 100.000 reads from a human genome (HG38),– 100.000 reads from a mouse genome (MM10).

2. Mapping All reads were mapped to HG38 byi) Yara, ii) Bwa-Mem, iii) Bwa-Sw, andiv) Bowtie2.

3. Evaluation. The obtained Bam files wereevaluated using LAVEnder.

Figure → Comparison of the mappers withrespect to correctly mapped reads.Figure ↘ Detailed graph for Yara.Figure ↓ Detailed graph for Bwa-Mem.

#cor

rect

ly m

appe

d re

ads

/ #re

ads

whi

ch s

houl

d be

map

ped

Correctly mapped reads in all reads which should be mapped

FDR in mapping (#wrongly mapped reads / #mapped reads)

BWA-MEM BWA-SW

Bowtie2 YARA

50 %

60 %

70 %

80 %

90 %

100 %10-4 10-3 10-2 10-1 100

Par

t of a

ll re

ads

(%)

BWA-MEM

FDR in mapping (#wrongly mapped reads / #mapped reads)

Unmapped correctlyUnmapped incorrectlyThresholded correctly

Thresholded incorrectlyMultimapped

Mapped, should be unmappedMapped to wrong position

Mapped correctly

0 %

20 %

40 %

60 %

80 %

100 %10-2 10-1

Par

t of a

ll re

ads

(%)

YARA

FDR in mapping (#wrongly mapped reads / #mapped reads)

Unmapped correctlyUnmapped incorrectlyThresholded correctly

Thresholded incorrectlyMultimapped

Mapped, should be unmappedMapped to wrong position

Mapped correctly

0 %

20 %

40 %

60 %

80 %

100 %10-2 10-1