NGS techniques and data

Post on 10-May-2015

1.301 views 2 download

Tags:

description

A talk for I gave for the 2011 metagenomics course at the Biological Dept. Univ. of Oslo April 2011

Transcript of NGS techniques and data

 NGS techniques and data relevant for metagenomics analyses

Lex NederbragtNorwegian Sequencing Center &

Centre for Ecological and Evolutionary SynthesisUniversity of Oslo

The sequence revolution

Stratton et al Nature 458, 719-724

The sequence revolution

Stratton et al Nature 458, 719-724

Norwegian Sequencing Center

www.sequencing.uio.no

This talk

• Technologies– 454– Illumina

• Topics– How does it work–What do you get– Quality check– Filtering

How does it work: 454

Library preparation

Shotgun library Amplicon library

Starting from DNA sample Starting from PCR product

Library preparation

Shotgun library

Fragmentation

Addition of adaptors

Fw

AFw

Rv B

A

Rv B

Amplicon library

Multiplexing

Fw

AFw

RvB

A

Rv B

Amplicon libraryA

Fw

Tag

Shotgun: tag in the adaptors

Amplification

Plate loading

Multiplexing

Flickr.com

2 lanes

4 lanes

8 lanes

16 lanes

Sequencing

PPi: pyrophosphate

Basecalling

Read length

500 bases

Coming soon

Single end

• Default single end sequencing• Special protocols for mate-pairs

How does it work: Illumina

Library preparation

Multiplexing: same as for 454

Bridge amplification

Metzker 2010 Nat Rev Genet.11(1):31-46

Bridge amplification

Metzker 2010 Nat Rev Genet.11(1):31-46

Multiplexing

Flowcell: 8 lanes

Sequencing

Metzker 2010 Nat Rev Genet.11(1):31-46

Reversible terminators

Basecalling

Metzker 2010 Nat Rev Genet.11(1):31-46

Read length

454 GS FLX Titanium Illumina HiSeq

500 bases

Paired-end

• Default paired-end sequencing– single end also possible

150– 600 bases

What do you get?

454 Throughput

• GS FLX Titanium per-run output:– Up to 1.5 million single-end reads– Up to 600 megabases (Mb, million bases)– Less for amplicons

Illumina throughput (HiSeq 2000)

• Variable length– 50,100, (soon 150)– single or paired-end

• per-run output:– Up to 1 billion (109) single-end– Up to 2 billion paired-end reads – Up to 200 gigabases (Gb, billion bases) – Soon: 3 times more reads and bases

What do you get? Errors!

http://www.it.bton.ac.uk/staff/je/java/jewl/tutorial/tutorial.html

Error profiles

454 GS FLX Titanium Illumina Genome Analyzer II

454 specific

3 G's? 4 G's?

Illumina specific

• Substitutions– e.g. AG

• Underrepresentation of AT and GC rich regions

Solving errors

• Oversampling

Oversampling: 454

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATT-GGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATT-GGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAAATTGTCCCTTTGACATAACGACTAAAGGAGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGG

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAA-TTGTCCCTTTGACATAACGACTAAAGG

Undercall in two reads

Overcall in three reads

Consensus

Solving errors

• Oversampling• 454 amplicons: AmpliconNoise– this course

• Illumina GC-bias: PCR conditions– Aird et al. Genome Biology 2011, 12:R18

Duplicate reads

• Illumina: PCR step in library prep• 454: two beads in one microreactor– emulsion PCR

Chimeras

Haas B J et al. Genome Res. 2011;21:494-504

Chimeras

• 454 FLX Titanium– chimera rate of up to 20% 

• >70% of sequences representing particular genera 

Haas B J et al. Genome Res. 2011;21:494-504

Chimeras: solutions

• ChimeraSlayer– AmpliconNoise

• ChimeraCheck–Mothur

• See Haas et al. 2011 Genome Res. 21:494-504

What do you get? Bytes!

Filesizes

• 454– Up to 2 Gbytes per lane (sff)– two lanes

• HiSeq– up to 20 Gb per lane (fastq)– eight lanes

Datafiles 454

• sff file (standard flowgram format)– binary

• fasta & qual– text

454: sff file (text format)

>F7K88GK01BMPI0Run Prefix: R_2009_12_18_15_27_42_Region #: 1XY Location: 0551_2346

Run Name: R_2009_12_18_15_27_42_FLX########_Administrator_yourrunnameAnalysis Name: D_2009_12_19_01_11_43_XX_fullProcessingFull Path: /data/R_2009_12_18_15_27_42_FLX########_Administrator_yourrunname/D_2009_12_19_01_11_43_XX_fullProcessing/

Read Header Len: 32Name Length: 14# of Bases: 500Clip Qual Left: 15Clip Qual Right: 490Clip Adap Left: 0Clip Adap Right: 0

Flowgram: 1.03 0.00 1.01 0.02 0.00 0.96 0.00 1.00 0.00 1.04 0.00 0.00 0.97 0.00 0.96 0.02 0.00 1.04 0.01 1.04 0.00 0.97 0.96 0.02 0.00 1.00 0.95 1.04 0.00 0.00 2.04 0.02 0.03 1.05 Flow Indexes: 1 3 6 8 10 13 15 18 20 22 23 26 27 28 31 31 34 35 37 37 37 40 43 45 47 47 47 50 53 53 53 55 58 60 63 66 67 67 67 67 70 71 71 74 74 76 79 82 83 86 86 88 88 91 93 96 97...Bases: tcagatcagacacgCCACTTTGCTCCCATTTCAGCACCCCACCAAGCACAAGGCTGTCATCCCAATTGGACGGACAGATATGAGGTTAGCATTGGAAACCAATTCAGTCCCTAATTATTCACGACTGAACCCAGCGACAATTGGACATGGATTCATTTTTCAACTTGATTTGTTGTTGTAAAAGCA...Quality Scores: 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 38 38 38 40 40 40 39 39 39 40 34 34 34 40 40 40 40 39 26 26 26 26 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 ...

454: fasta and qual files

Fasta:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAATTGTCCCTTTGACATAACGACTAAAGGAGTCAACAGATTTTCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACGCTATT...

Qual:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

40 40 39 39 39 40 40 40 40 40 40 40 40 38 31 26 26 16 16 16 20 20 14 14 14 14 27 33 32 35 36 33 36 35 36 38 35 20 20 21 24 24 22 36 39 40 38 38 38 40 40 40 40 40 40 37 37 37 33 3329 36 38 38 38 38 38 38 38 35 20 21 21 21 31 36 37 40 40 35 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40...

Sanger-style Phred scores

454: fasta and qual files

Fasta:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

AGAAAGTCAGCGGCAAATTTGGTTTTAGACGAATTGTCCCTTTGACATAACGACTAAAGGAGTCAACAGATTTTCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACGCTATT...

Qual:>FTJD6BE02HHD3W length=409 xy=2951_1562 region=2 run=R_2009_04_01_11_28_49_

40 40 39 39 39 40 40 40 40 40 40 40 40 38 31 26 26 16 16 16 20 20 14 14 14 14 27 33 32 35 36 33 36 35 36 38 35 20 20 21 24 24 22 36 39 40 38 38 38 40 40 40 40 40 40 37 37 37 33 3329 36 38 38 38 38 38 38 38 35 20 21 21 21 31 36 37 40 40 35 37 37 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40...

Sanger-style Phred scores

chance of being wrong: 1:104.0 = 1:10000

chance of being wrong: 1:103.5 = 1:3162

Illumina: fastq file

@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1CCAACATAGCTGGATGCCAACATAGCTGGATTGTTATAGCTGGTTTGCTTTTCTAACTCGCTGGAAGTTTATAAGCATTCCTACTATTTCATAGTATTAC+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1BBbfYcbV^BV`cQffaBZfB_fdfUYaa]`adcbfef\acfd^cad^fOabRceb`beSbdfaad_e^^dbeedTbd`V\ecdfffYBddb^fa\d\de

Quality score as characters: Phred score = ASCII value -33'B' is ASCII 66  Phred 33

Illumina: fastq file

@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1CCAACATAGCTGGATGCCAACATAGCTGGATTGTTATAGCTGGTTTGCTTTTCTAACTCGCTGGAAGTTTATAAGCATTCCTACTATTTCATAGTATTAC+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/1BBbfYcbV^BV`cQffaBZfB_fdfUYaa]`adcbfef\acfd^cad^fOabRceb`beSbdfaad_e^^dbeedTbd`V\ecdfffYBddb^fa\d\de

Matching pair in the other file:+@PCUS-319-EAS487_0004_FC:6:1:1351:952#0/2

FastQ formats

Cock PJ et al 2009

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Nucleic Acids Res. 2010 Apr;38(6):1767-71. 

and

http://en.wikipedia.org/wiki/Fastq

Quality control

Quality Control

• 454 (and others): Prinseq• Illumina (and others): fastQC, fastQA, etc

Prinseq

• http://edwards.sdsu.edu/prinseq_beta• Web-based and stand-alone• Upload – fasta file– qual file (optional)

Prinseq: read length

Prinseq: quality per position

Prinseq: quality values

Prinseq: duplicate reads

Prinseq: adaptors

No tag

Barcode (Roche 'MID')

Transcriptome library adaptor

Prinseq: contamination

The dinucleotideodds ratios*

 Principal component 

analysis (PCA)

*dinucleotide frequencies normalized for the base composition

FastQC

• http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

• Stand-alone• GUI (Java based)• Upload – fasta file– qual file (optional)

FastQC: quality per position

FastQC: quality per position

FastQC: quality values 

FastQC: nucleotide composition 

FastQC: GC distribution 

FastQC: duplicated reads 

Filtering/trimming

• Adaptor removal – especially Illumina

• Duplicate removal• Filtering for low quality bases– or stretches of them– reads with 'N's

• E.g. – fastX toolkit– prinseq

Other technologies

• Life Technologies– SOLiD– ionTorrent– not much used for metagenomics

• Pacific Biosciences– PacBio RS– large potential

Pacific Biosciences

Metzker 2010 Nat Rev Genet.11(1):31-46

Zero Mode Waveguides

Pacific Biosciences

Metzker 2010 Nat Rev Genet.11(1):31-46