Download - Karel Břinda, Valentina Boeva, Gregory Kucherov Introduction Rnf ...

Transcript
Page 1: Karel Břinda, Valentina Boeva, Gregory Kucherov Introduction Rnf ...

Rnf: amethodandtools toevaluateNgs readmappers

Karel Břinda, Valentina Boeva, Gregory [email protected], [email protected], [email protected]

IntroductionAligning reads to a reference sequence is afundamental step in numerous bioinformaticspipelines. The sensitivity and precision of themapping tool can critically affect the accu-racy of produced results.

Read simulators combined with alignmentevaluation tools provide the most straightfor-ward way to evaluate and compare map-pers.

In default of standards for encoding read origins,every evaluation tool had to be made explicitlycompatible with the simulator used to generatereads.

To solve this obstacle, we have created a formatRnf (Read Naming Format) and an associatedsoftware package RnfTools.

RnfDescription: Read NamingFormat, a generic format for as-signing read names with encodedinformation about original posi-tions.Specification:http://karel-brinda.github.io/rnf-spec/

RnfToolsDescription: An associatedsoftware package of Rnf-compatible programs, based onSnakemake [2]. All employedexternal programs are installedautomatically when they areneeded.

Components:

i) MIShmashPipeline applying one of popular read sim-ulating tools (among DwgSim, Art, Ma-son, CuReSim etc.) and transforming thegenerated reads into Rnf format.

ii) LAVEnderTool for read mappers evaluation usingRnf reads.

Source codes and documentation:http://github.com/karel-brinda/rnftoolshttp://rnftools.rtfd.org

Prerequisites:– Unix-like system (Linux, OSX, etc.)– Python 3.2+

Installation using Pip:> pip install rnftools

Installation using Easy Install:> easy_install rnftools

References[1] K. Břinda, V. Boeva, G. Kucherov. RNF: a gen-eral framework to evaluate NGS read mappers.arXiv:1504.00556 [q-bio.GN], 2015.

[2] J. Köster and S. Rahmann. Snakemake – a scal-able bioinformatics workflow engine. Bioinfor-matics 28(19): 2520–2522, 2012.

Read Naming Format

sim__0043fd1__(3,13,F,027871,027970),(3,13,R,029171,029270)__[paired_end],C:[100=,42=1X47=]

Segments of reads Suffix(with comments and extensions)

Read tuple IDPrefix

Leftmost coordinate

Genome ID

DirectionChromosome ID

Rightmost coordinate

Example of simulated read tuplesCoor 12345678901234-5678901234567890123456789

Source 1 - reference genome

chr 1 ATGTTAGATAA-GATAGCTGTGCTAGTAGGCAGTCAGCCC

chr 2 ttcttctggaa-gaccttctcctcctgcaaataaa

Source 2 - generator of random sequences

READS:

r001 ATG-TAGATA ->

r002/1 TTAGATAACGA ->

r002/2 <- TCAG-CGGG

r003/1 tgcaaataa ->

r003/2 gaa-gacc-t ->

r004 ATAGCT............TCAG ->

r005 GTAGG ->

<- agacctt

<- TCGACACG

r006 ATATCACATCATTAGACACTA

Their corresponding Rnf namesreadtuple

LRN SRN

r001 sim__1__(1,1,F,01,10)__[single_end] #1r002 sim__2__(1,1,F,04,14),(1,1,R,31,39)

__[paired_end]#2

r003 sim__3__(1,2,F,09,17),(1,2,F,25,33)__[mate_pair]

#3

r004 sim__4__(1,1,F,15,36)__[spliced],C:[6=12N4=]

#4

r005 sim__5__(1,1,R,15,22),(1,1,F,25,29),(1,2,R,05,11)__[chimeric]

#5

r006 rnd__6__(2,0,N,00,00)__[random] #6

LRN Long read name.

SRN Short read name. They are used only if an LRN ex-ceeds 255 characters (maximum allowed read lengthin Sam). Then a SRN-LRN correspondence file mustbe created.

Evaluation of read mappers using Rnf-compatible programs

Genome 1

Genome 2

Genome n

Read simulator Reads AlignmentMapper

evaluationtool

Report

FASTA

FASTQ BAM TXT/HTMLRNF decoding

RNF encoding

Mapper

Read simulation Mapper evaluation

RnfTools – example of usage

Steps:

1. Simulation of reads. 200.000 reads weresimulated by DwgSim using MIShmash:– 100.000 reads from a human genome (HG38),– 100.000 reads from a mouse genome (MM10).

2. Mapping All reads were mapped to HG38 byi) Yara, ii) Bwa-Mem, iii) Bwa-Sw, andiv) Bowtie2.

3. Evaluation. The obtained Bam files wereevaluated using LAVEnder.

Figure → Comparison of the mappers withrespect to correctly mapped reads.Figure ↘ Detailed graph for Yara.Figure ↓ Detailed graph for Bwa-Mem.

#cor

rect

ly m

appe

d re

ads

/ #re

ads

whi

ch s

houl

d be

map

ped

Correctly mapped reads in all reads which should be mapped

FDR in mapping (#wrongly mapped reads / #mapped reads)

BWA-MEM BWA-SW

Bowtie2 YARA

50 %

60 %

70 %

80 %

90 %

100 %10-4 10-3 10-2 10-1 100

Par

t of a

ll re

ads

(%)

BWA-MEM

FDR in mapping (#wrongly mapped reads / #mapped reads)

Unmapped correctlyUnmapped incorrectlyThresholded correctly

Thresholded incorrectlyMultimapped

Mapped, should be unmappedMapped to wrong position

Mapped correctly

0 %

20 %

40 %

60 %

80 %

100 %10-2 10-1

Par

t of a

ll re

ads

(%)

YARA

FDR in mapping (#wrongly mapped reads / #mapped reads)

Unmapped correctlyUnmapped incorrectlyThresholded correctly

Thresholded incorrectlyMultimapped

Mapped, should be unmappedMapped to wrong position

Mapped correctly

0 %

20 %

40 %

60 %

80 %

100 %10-2 10-1