Data analysis methods for next-generation sequencing technologies

Gabor T. MarthBoston College Biology Department

Epigenomics & Sequencing MeetingJuly 14-15, 2008, Boston, MA

T1. Roche / 454 FLX system

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads• tested in many published applications• supports paired-end read protocols with up to 10kb separation size

T2. Illumina / Solexa Genome Analyzer

• fixed-length short-read sequencer• read properties are very close traditional capillary sequences • very low INDEL error rate• tested in many published applications• paired-end read protocols support short (<600bp) separation

T3. AB / SOLiD system

A C G T

2nd Base

• fixed-length short-read sequencer• employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy• requires color-space informatics• published applications underway / in review• paired-end read protocols support up to 10kb separation size

T4. Helicos / Heliscope system

• experimental short-read sequencer system• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing

A1. Variation discovery: SNPs and short-INDELs

1. sequence alignment

2. dealing with non-unique mapping

3. looking for allelic differences

A2. Structural variation detection

• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

• copy number (for amplifications, deletions) from depth of read coverage

A3. Identification of protein-bound DNA

genome sequence

aligned reads

Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)

Transcription binding sites. Robertson et al. Nature Methods, 2007

A4. Novel transcript discovery (genes)

Mortazavi et al. Nature Methods

A5. Novel transcript discovery (miRNAs)

Ruby et al. Cell, 2006

A6. Expression profiling by tag counting

aligned reads

Jones-Rhoads et al. PLoS Genetics, 2007

gene gene

A7. De novo organismal genome sequencing

assembled sequence contigs

short reads

longer reads

read pairs

Lander et al. Nature 2001

C1. Read length

read length [bp]0 100 200 300

~200-450 (var)

25-40 (fixed)

25-35 (fixed)

20-35 (var)

When does read length matter?

• short reads often sufficient where the entire read length can be used for mapping:

SNPs, short-INDELs, SVsCHIP-SEQshort RNA discoverycounting (mRNA miRNA)

• longer reads are needed where one must use parts of reads for mapping:

de novo sequencing

novel transcript discovery

aacttagacttacagacttacatacgta

Known exon 1 Known exon 2

accgattactatacta

C2. Read error rate

• error rate dictates the stringency of the read mapper

• error rate typically 0.4 - 1%

• the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned

0 1 20.00

Number of mismatches allowed

0 5 10 15 20 25 30 35 40

Position on Read

10.00%

Error rate grows with each cycle

• this phenomenon limits useful read length

Substitutions vs. INDEL errors

C3. Representational biases / library complexity

fragmentation biases

amplification biases

sequencing biases

sequencing

low/no representati

on high

representation

Dispersal of read coverage

• this affects variation discovery (deeper starting read coverage is needed)• it should have major impact is on counting applications

Amplification errors

many reads from clonal copies of a single fragment

• early PCR errors in “clonal” read copies lead to false positive allele calls

early amplification error gets propagated onto every clonal copy

C4. Paired-end reads

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

Korbel et al. Science 2007

• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)

Technologies / properties / applications

Technology

Roche/454 Illumina/Solexa AB/SOLiD

Read properties

Read length 200-450bp 20-50bp 25-50bp

Error rate <0.5% <1.0% <0.5%

Dominant error type INDEL SUB SUB

Quality values available yes yes not really

Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal)

Applications

SNP discovery ● ● ○

short-INDEL discovery ● ○

SV discovery ○ ○ ●

CHIP-SEQ ○ ● ●

small RNA/gene discovery ○ ● ●

mRNA Xcript discovery ● ○ ○

Expression profiling ○ ● ●

De novo sequencing ● ? ?

Resequencing-based SNP discovery

(iv) read assembly

(iii) read mapping (pair-wise alignment to genome reference)

(i) base calling

(v) SNP calling

(vi) SNP validation

(ii) micro-repeat analysis

(vii) data viewing, hypothesis generation

The “toolbox”

• base callers

• microrepeat finders

• read mappers

• SNP callers

• structural variation callers

• assembly viewers

…AND they give you the cover on the box

Reference guided read mapping

Reference-sequence guided mapping:

…you get the pieces…

Some pieces are more unique than others

MOSAIK: an anchored aligner / assembler

Step 1. initial short-hash scan for possible read locations

Step 2. evaluation of candidate locations with SW method

Michael Stromberg

Non-unique mapping, gapped alignments

1. Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented)

2. Gapped alignments: allow for mapping reads with insertion or deletion sequencing errors, and reads with bona fide INDEL alleles

Read types aligned, paired-end read strategy

3. Aligns and co-assembles customary read types:ABI/capillaryIllumina/SolexaAB/SOLiDRoche/454Helicos/Heliscope

ABI/capillary

454 FLX

454 GS20

Illumina4. Paired-end read alignments

Data analysis methods for next-generation sequencing technologies

Documents

Transcript of Data analysis methods for next-generation sequencing technologies

Next Generation Sequencing Technologies in Microbial Ecology

Next-Generation Sequencing Next-Generation Sequencing Technologies

Next-Generation Sequencing: an overview of technologies ...bioinformatics.org.au/ws13/wp-content/uploads/ws13/... · Next-Generation Sequencing: an overview of technologies and applications

Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.

Next-Generation Sequencing Technologies Geno Institute

Next Generation Sequencing - Lab Genetix · Next Generation Sequencing Services Next generation sequencing (NGS) technologies have dramatically reduced the costs of sequencing and

Fundamentals of Next-Generation Sequencing: Technologies ...technologies that: – Perform numerous sequencing reactions simultaneously through nano-scale engineering – Use of a

Sequencing technologies — the next generationstaff.vbi.vt.edu/mlawre04/NextGen Genomics Class/Technology... · development of next-generation sequencing (NGS) technologies. The

Landscape of Next-Generation Sequencing Technologieshydrodictyon.eeb.uconn.edu/people/les/NGS_Seminar_2012/Papers... · Landscape of Next-Generation Sequencing Technologies ... 30

Assessing the potential of RAD-sequencing to resolve ...lab.dessimoz.org/papers/17_ChiastochetaphylogenyMPE.pdf · 35 generation-sequencing technologies and associated techniques

Introduction to Next-Generation Sequencing · Next generation sequencing technologies and limitations 5 Next generation sequencing Short-read NGS Long-read NGS “Second-generation

The Use of Next Generation Sequencing Technologies to Dissect … · 2015. 7. 21. · 1 March 2014 The Use of Next Generation Sequencing Technologies to Dissect the Aetiologies of

Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

NEXT-GENERATION SEQUENCING AND …mbg.unipv.it/attach/1_next_gen_bioinformatics_1718.pdf · NEXT-GEN SEQUENCING TECHNOLOGIES • Roche/454 FLX • Applied Biosystems SOLiD System

Next-generation sequencing course, part 1: technologies

Landscape of Next-Generation Sequencing Technologiesweb.stanford.edu/group/barronlab/PubPdfs/2011/Landscape of Next... · Landscape of Next-Generation Sequencing Technologies ...

Overview of Next Generation Sequencing (NGS) Technologies · Metzker (2010) Sequencing technologies – The next generation. Nature Reviews Genetics 11:31 Harismendy et al. (2009)

Comparison of Next-Generation Sequencing Technologies for ... · Comparison of Next-Generation Sequencing Technologies for Comprehensive Assessment of Full-Length Hepatitis C Viral

Landscape of Next-Generation Sequencing Technologies

Validation of Next Generation Sequencing Technologies in ... · Validation of Next Generation Sequencing Technologies in Comparison to Current Diagnostic Gold Standards for BRAF,