NGS techniques and data

NGS techniques and data relevant for metagenomics analyses

Lex NederbragtNorwegian Sequencing Center &

Centre for Ecological and Evolutionary SynthesisUniversity of Oslo

The sequence revolution

Stratton et al Nature 458, 719-724

The sequence revolution

Stratton et al Nature 458, 719-724

Norwegian Sequencing Center

www.sequencing.uio.no

This talk

• Technologies– 454– Illumina

• Topics– How does it work–What do you get– Quality check– Filtering

How does it work: 454

Library preparation

Shotgun library Amplicon library

Starting from DNA sample Starting from PCR product

Library preparation

Shotgun library

Fragmentation

Addition of adaptors

Amplicon library

Multiplexing

Amplicon libraryA

Shotgun: tag in the adaptors

Amplification

Plate loading

Multiplexing

Flickr.com

2 lanes

4 lanes

8 lanes

16 lanes

Sequencing

PPi: pyrophosphate

Basecalling

Read length

500 bases

Coming soon

Single end

• Default single end sequencing• Special protocols for mate-pairs

How does it work: Illumina

Library preparation

Multiplexing: same as for 454

Bridge amplification

Metzker 2010 Nat Rev Genet.11(1):31-46

Bridge amplification

Multiplexing

Flowcell: 8 lanes

Sequencing

Reversible terminators

Basecalling

Read length

454 GS FLX Titanium Illumina HiSeq

500 bases

Paired-end

• Default paired-end sequencing– single end also possible

150– 600 bases

What do you get?

454 Throughput

• GS FLX Titanium per-run output:– Up to 1.5 million single-end reads– Up to 600 megabases (Mb, million bases)– Less for amplicons

Illumina throughput (HiSeq 2000)

• Variable length– 50,100, (soon 150)– single or paired-end

• per-run output:– Up to 1 billion (109) single-end– Up to 2 billion paired-end reads – Up to 200 gigabases (Gb, billion bases) – Soon: 3 times more reads and bases

What do you get? Errors!

http://www.it.bton.ac.uk/staff/je/java/jewl/tutorial/tutorial.html

Error profiles

454 GS FLX Titanium Illumina Genome Analyzer II

454 specific

3 G's? 4 G's?

Illumina specific

• Substitutions– e.g. AG

• Underrepresentation of AT and GC rich regions

Solving errors

• Oversampling

Oversampling: 454

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATT-GGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATT-GGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGG

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGG

Undercall in two reads

Overcall in three reads

Consensus

Solving errors

• Oversampling• 454 amplicons: AmpliconNoise– this course

• Illumina GC-bias: PCR conditions– Aird et al. Genome Biology 2011, 12:R18

Duplicate reads

• Illumina: PCR step in library prep• 454: two beads in one microreactor– emulsion PCR

Chimeras

Haas B J et al. Genome Res. 2011;21:494-504

Chimeras

• 454 FLX Titanium– chimera rate of up to 20%

• >70% of sequences representing particular genera

Haas B J et al. Genome Res. 2011;21:494-504

Chimeras: solutions

• ChimeraSlayer– AmpliconNoise

• ChimeraCheck–Mothur

• See Haas et al. 2011 Genome Res. 21:494-504

What do you get? Bytes!

Filesizes

• 454– Up to 2 Gbytes per lane (sff)– two lanes

• HiSeq– up to 20 Gb per lane (fastq)– eight lanes

Datafiles 454

• sff file (standard flowgram format)– binary

• fasta & qual– text

454: sff file (text format)

>F7K88GK01BMPI0Run Prefix: R_2009_12_18_15_27_42_Region #: 1XY Location: 0551_2346

Run Name: R_2009_12_18_15_27_42_FLX########_Administrator_yourrunnameAnalysis Name: D_2009_12_19_01_11_43_XX_fullProcessingFull Path: /data/R_2009_12_18_15_27_42_FLX########_Administrator_yourrunname/D_2009_12_19_01_11_43_XX_fullProcessing/

Read Header Len: 32Name Length: 14# of Bases: 500Clip Qual Left: 15Clip Qual Right: 490Clip Adap Left: 0Clip Adap Right: 0

Flowgram: 1.03 0.00 1.01 0.02 0.00 0.96 0.00 1.00 0.00 1.04 0.00 0.00 0.97 0.00 0.96 0.02 0.00 1.04 0.01 1.04 0.00 0.97 0.96 0.02 0.00 1.00 0.95 1.04 0.00 0.00 2.04 0.02 0.03 1.05 Flow Indexes: 1 3 6 8 10 13 15 18 20 22 23 26 27 28 31 31 34 35 37 37 37 40 43 45 47 47 47 50 53 53 53 55 58 60 63 66 67 67 67 67 70 71 71 74 74 76 79 82 83 86 86 88 88 91 93 96 97...Bases: tcagatcagacacgCCACTTTGCTCCCATTTCAGCACCCCACCAAGCACAAGGCTGTCATCCCAATTGGACGGACAGATATGAGGTTAGCATTGGAAACCAATTCAGTCCCTAATTATTCACGACTGAACCCAGCGACAATTGGACATGGATTCATTTTTCAACTTGATTTGTTGTTGTAAAAGCA...Quality Scores: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 38 38 38 40 40 40 39 39 39 40 34 34 34 40 40 40 40 39 26 26 26 26 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 ...

454: fasta and qual files

Fasta:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAATTGTCCCTTTGACATAACGACTAAAGGAGTCAACAGATTTTCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACGCTATT...

Qual:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

40 40 39 39 39 40 40 40 40 40 40 40 40 38 31 26 26 16 16 16 20 20 14 14 14 14 27 33 32 35 36 33 36 35 36 38 35 20 20 21 24 24 22 36 39 40 38 38 38 40 40 40 40 40 40 37 37 37 33 3329 36 38 38 38 38 38 38 38 35 20 21 21 21 31 36 37 40 40 35 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40...

Sanger-style Phred scores

454: fasta and qual files

Fasta:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAATTGTCCCTTTGACATAACGACTAAAGGAGTCAACAGATTTTCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACGCTATT...

Qual:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

40 40 39 39 39 40 40 40 40 40 40 40 40 38 31 26 26 16 16 16 20 20 14 14 14 14 27 33 32 35 36 33 36 35 36 38 35 20 20 21 24 24 22 36 39 40 38 38 38 40 40 40 40 40 40 37 37 37 33 3329 36 38 38 38 38 38 38 38 35 20 21 21 21 31 36 37 40 40 35 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40...

Sanger-style Phred scores

chance of being wrong: 1:104.0 = 1:10000

chance of being wrong: 1:103.5 = 1:3162

Illumina: fastq file

@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1CCAACATAGCTGGATGCCAACATAGCTGGATTGTTATAGCTGGTTTGCTTTTCTAACTCGCTGGAAGTTTATAAGCATTCCTACTATTTCATAGTATTAC+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1BBbfYcbV^BV`cQffaBZfB_fdfUYaa]`adcbfef\acfd^cad^fOabRceb`beSbdfaad_e^^dbeedTbd`V\ecdfffYBddb^fa\d\de

Quality score as characters: Phred score = ASCII value -33'B' is ASCII 66 Phred 33

Illumina: fastq file

@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1CCAACATAGCTGGATGCCAACATAGCTGGATTGTTATAGCTGGTTTGCTTTTCTAACTCGCTGGAAGTTTATAAGCATTCCTACTATTTCATAGTATTAC+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1BBbfYcbV^BV`cQffaBZfB_fdfUYaa]`adcbfef\acfd^cad^fOabRceb`beSbdfaad_e^^dbeedTbd`V\ecdfffYBddb^fa\d\de

Matching pair in the other file:+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/2

FastQ formats

Cock PJ et al 2009

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Nucleic Acids Res. 2010 Apr;38(6):1767-71.

http://en.wikipedia.org/wiki/Fastq

Quality control

Quality Control

• 454 (and others): Prinseq• Illumina (and others): fastQC, fastQA, etc

Prinseq

• http://edwards.sdsu.edu/prinseq_beta• Web-based and stand-alone• Upload – fasta file– qual file (optional)

Prinseq: read length

Prinseq: quality per position

Prinseq: quality values

Prinseq: duplicate reads

Prinseq: adaptors

No tag

Barcode (Roche 'MID')

Transcriptome library adaptor

Prinseq: contamination

The dinucleotideodds ratios*

Principal component

analysis (PCA)

*dinucleotide frequencies normalized for the base composition

FastQC

• http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

• Stand-alone• GUI (Java based)• Upload – fasta file– qual file (optional)

FastQC: quality per position

FastQC: quality values

FastQC: nucleotide composition

FastQC: GC distribution

FastQC: duplicated reads

Filtering/trimming

• Adaptor removal – especially Illumina

• Duplicate removal• Filtering for low quality bases– or stretches of them– reads with 'N's

• E.g. – fastX toolkit– prinseq

Other technologies

• Life Technologies– SOLiD– ionTorrent– not much used for metagenomics

• Pacific Biosciences– PacBio RS– large potential

Pacific Biosciences

Zero Mode Waveguides

Pacific Biosciences

Videos

http://www.qiagen.com/media/player.aspx?movie=Pyrosequencing

http://www.youtube.com/watch?v=HtuUFUnYB9Y

NGS techniques and data

Documents

Transcript of NGS techniques and data

10/22/12 NGS Sequence data NGS Sequence datagorgonzola.cshl.edu/pfb/2012/lecture_notes/Stajich NGS...hyphaltip.github.com/CSHL_2012_NGS/lecture/NGS_DNA.slides.html#slide1 2/58 NGS

NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.

Genotyping HLA and KIR from NGS data

Workshop NGS data analysis - 2

Data Presentation Techniques. Data Presentation Techniques Data Presentation Techniques.

Workshop NGS data analysis - 1

Computational infrastructure for NGS data analysis

Algorithm of NGS Data

NGS Data Analysis 101 - Agilent Technologies

Quality Control of NGS Data

Considerations for Analyzing Targeted NGS Data Introduction

Introduction to NGS data - GitHub Pagesbioinformatics-core-shared-training.github.io/cruk... · 7/22/2015 Introduction to NGS data file:///home/dunnin01/work/git/Talks/ngs-intro/ngs-intro.html

Finding the Lost Treasure of NGS Data

NGS data analysis and visualization with Chipster - Prace … · · 2012-07-19NGS data analysis and visualization with Chipster ... Chipster NGS functionality ... Does not support

Considerations for Analyzing Targeted NGS Data HLA

High Throughput NGS Data Analysis · 2018-07-10 · High Throughput NGS Data Analysis. Bioinformatics Lab Wen-Lian Hsu Kart -- An Ultra-fast NGS read mapping Algorithm. Background

Visualization of NGS Data: ngsplotbarc.wi.mit.edu/education/hot_topics/ngsplot/ngsplot_Apr...Typical NGS Work Flow 2 Raw Data Analyze QC Reports High Quality Data •Short reads (eg.

Visualizing NGS data with GenomeViewJune 2, 2010 NGS workshop, Espoo, Finland, Thomas Abeel 17. VISUALIZING NGS DATA June 2, 2010 NGS workshop, Espoo, Finland, Thomas Abeel 18. GenomeView

Corporate Presentation January 2019 - Integragen · NGS Market 2020 $20B $ NGS Data Analysis Market 2020 $1.15B NGS Market 2020 $20B $ NGS Data Analysis Market 2020 $1.15B CAGR :

NGS data analysis with Galaxy