2013 pag-equine-workshop

47
C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University [email protected] Next-Gen Sequencing: 4 years in the trenches

Transcript of 2013 pag-equine-workshop

Page 1: 2013 pag-equine-workshop

C. Titus BrownAsst Prof, CSE and

Microbiology;BEACON NSF STC

Michigan State [email protected]

Next-Gen Sequencing:4 years in the trenches

Page 2: 2013 pag-equine-workshop

These slides are available online.

“titus brown slideshare”

You can also e-mail me: [email protected]

Also note that these are my opinions and observations, culled from personal experience,

online material, and reading. I’m happy to cite/explain further upon request, but:

Your Mileage May Vary

Page 3: 2013 pag-equine-workshop

Things I won’t talk aboutDon’t work on/with/have anything useful to

say about:Exome sequencingAncient DNAChIP-seq (protein-DNA interactions)

Work on but you’re probably not interested in:Metagenomics (sequencing uncultured

microbial communities)Bioinformatics data structures and algorithms

Page 4: 2013 pag-equine-workshop

OverviewShotgun sequencing basics

Things everyone wants to know: how much $$...

Various current problems & challenges

Technology, now and future

Some papers and projects worth looking at; & our own experiences

Page 5: 2013 pag-equine-workshop
Page 6: 2013 pag-equine-workshop
Page 7: 2013 pag-equine-workshop

Two specific concepts:First, sequencing everything at random is very

much easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.)

Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences.

These two concepts underlie the recent stunning increases

in sequencing capacity.

Page 8: 2013 pag-equine-workshop
Page 9: 2013 pag-equine-workshop
Page 10: 2013 pag-equine-workshop
Page 11: 2013 pag-equine-workshop
Page 12: 2013 pag-equine-workshop

What are current costs for Illumina?Approximate costs from MSU sequencing

center, a few months ago, including labor:

RNAseq:$200 prep / sampleSingle-ended 1x50 -- $1100/lane – 100-150 mn

readsPaired-end 2x100 -- $2500/lane – 200-300 mn

reads (/ 2)

Barcoding samples, etc, gets complicated.Discuss biology, etc with a sequencing geek

before going forward!

Page 13: 2013 pag-equine-workshop

What does this data really give you??With RNAseq, you can do de novo (genome- and gene-

annotation-independent) gene & isoform discovery and quantification; 50-100m reads/sample is probably “enough”(see: http://blog.fejes.ca/?p=607 for a good discussion)

With genome resequencing, you can do variant analysis/discovery; I recommend 20x depth.

De novo assembly of complex vertebrate genomes is not casual:Cheap short-read sequencing does not yet deliver good long-range contiguity; repeats, heterozygosity get in the way.Assembly & scaffolding process itself is still evolving.

Page 14: 2013 pag-equine-workshop

Why so much data?Why do we need 10-20x coverage

(resequencing) or 50-100m reads (mRNAseq) with Illumina?

Two (linked) reasons:Shotgun sequencing is randomCounting/sampling variation

Page 15: 2013 pag-equine-workshop

1. Useful minimum coverage depends on high average coverage

Page 16: 2013 pag-equine-workshop

2. mRNAseq quantitation – must overcome sampling variation

Page 17: 2013 pag-equine-workshop

Coverage conclusionsMore coverage rarely hurts (you can always

discard data, but it is harder/more $$ to get more data from an old sample)

Your desired coverage numbers should be driven by sensitivity considerations.

Page 18: 2013 pag-equine-workshop

Problems and challengesSystematic bias in sequencing and

software.

Genome assembly: scaffolding and sensitivity

Gene references

mRNAseq isoform construction

Page 19: 2013 pag-equine-workshop

Resequencing: bias and errorCalling SNPs by mapping --

U. Coloradohttp://genomics-course.jasondk.org/?p=395

Page 20: 2013 pag-equine-workshop

Both sequencing and bioinformatics yield many low-frequency artifacts!“Obvious” things like misalignments to

paralogous/repeat sequences.Indels are handled badly by current tools (up to

60% false positive rate?!)Oxidation of DNA during library prep step

(acoustic shearing) generated 8-oxoguanine “lesions” responsible for artifacts involving C>A/G>T triplets.

=> With any data set, especially big ones, there will both random and systematic error and bias.

http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-truth-you-cant-handle-the-truth/

Page 21: 2013 pag-equine-workshop

Suggestion: Cortex variant caller

Iqbal et al., Nat Genet. 2012, pmid 22231483

Page 22: 2013 pag-equine-workshop

Genome assembly: scaffolding & sensitivityEveryone wants two things from a genome assembly --

Long/correct scaffolds

See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmon

Complete genome content

Page 23: 2013 pag-equine-workshop

Sequence dataReads

http://www.cbcb.umd.edu/research/assembly_primer.shtmlslides from http://slideshare.net/flxlex/ ; Lex Nederbragt

original DNA

fragments

original DNA

fragments

Sequenced ends

Page 24: 2013 pag-equine-workshop

ContigsBuilding contigs

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 25: 2013 pag-equine-workshop

ScaffoldsOrdered, oriented contigs

contigs

mate pairs

gap size estimate

http://dx.doi.org/10.6084/m9.figshare.100940

Scaffoldcontig

gap

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 26: 2013 pag-equine-workshop

Longer reads!

Repeat copy 1

Repeat copy 2

Long reads can span repeats

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

and heterozygous regions

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 27: 2013 pag-equine-workshop

Cod: PacBio resultsMapping to the published genome

11.4 kbp subread

10.6 kbp subread

10.9 kbp subread

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 28: 2013 pag-equine-workshop

Sensitivity – does your genome include everything?Generally not!

For example, the chick genome is missing a substantial number of genes from microchromosomes:723 genes from HSA19q missing from

chicken galGal4.ESTs and RNAseq transcripts for many or

most.

Page 29: 2013 pag-equine-workshop

Approach - Digital normalization(a computational version of library normalization)

Digital normalization “smooths out” coverage from

different loci, and can “recover” low

coverage regions for assembly.

Page 30: 2013 pag-equine-workshop

Applying diginorm to increase sensitivityReassembled chick genome from 70x Illumina -

> normalized reads in ~24 hours.Contig assembly contained partial or complete

matches to 70% of previously unmappable transcripts assembled from chick mRNAseq

Together with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio and normalization to improve chick genome; should be generalizable approach.

Page 31: 2013 pag-equine-workshop

Mapping => mRNAseq quantitation

Reference transcriptome required.

Page 32: 2013 pag-equine-workshop

Existing chick gene models lack exons, isoforms

*This gene contains at least 4 isoforms.

Our data

Models

Likit Preeyanon

Page 33: 2013 pag-equine-workshop

(Exon detection is pretty good.)

Likit Preeyanon

Page 34: 2013 pag-equine-workshop

Gene Modeler Pipeline (“gimme”?)Merge transcripts together based on

transcript mapping to genome; can include existing gene predictions, iterate.

Construct gene modelsRemove redundant sequencesPredict strands and ORFs

Likit Preeyanon

Page 35: 2013 pag-equine-workshop

Some thoughts on bioinfoSoftware is evolving very fast. Don’t worry

about using the latest, but keep an eye on possible artifacts/problems with what you do use.

In NGS, online information (seqanswers, biostar, Twitter) is generally far less behind than publications.

Page 36: 2013 pag-equine-workshop

Technology – where next?Most slides taken from Lex Nederbragt:

http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-technologies-at-the-norwegian-sequencing-centre-and-beyond

Page 37: 2013 pag-equine-workshop

High-throughput sequencingPhase 1: more is better2005 GS20 200 000 reads100 bp

0.02 Gb/run

2011 GS FLX+1.2 million reads750 bp0.7 Gb/run

2006 GA 28 million reads25 bp0.7 Gb/run

2011 HiSeq 2000 3 billion reads 2x100 bp600 Gb/run

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 38: 2013 pag-equine-workshop

High-throughput sequencingPhase 2: smaller is better

GS Junior from Roche/454

MiSeq from Illumina

PGM from Ion Torrent/Life Technologies

0.04 GB/run400 bp reads

0.7 GB/run700 bp reads

4.5 GB/run2x150 bp

reads600 GB/run2x100 bp reads

0.01, 0.1 or 1 GB/run

100 or 200 bp reads

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 39: 2013 pag-equine-workshop

High-throughput sequencingWhy benchtop sequencing instruments?

Affordable price per

instrumentSmall projects

Diagnostics

Fast turn around time

http://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/http://www.vetlearn.com/ http://vanillajava.blogspot.com

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 40: 2013 pag-equine-workshop

Which instrument to choose?

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 41: 2013 pag-equine-workshop

High-throughput sequencingPhase 3: single-molecule

C2 (current) chemistry:Average read length 2500 bp36 000 reads90 MB per ‘run’

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 42: 2013 pag-equine-workshop

High-throughput sequencingTechnology

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 43: 2013 pag-equine-workshop

Need to combine Illumina + PacBio still.

+

+

2.7x

23x

24 cpus4.5 days 100 Gb RAM

Alignments of at least 1kb to cod published assembly

Raw reads

Err

or-

corr

ect

ed r

eads

P_errorCorrection pipeline from

93% of reads recovered

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Page 44: 2013 pag-equine-workshop

My perspective on tech:Illumina HiSeq + benchtop sequencers

(MiSeq) currently most reliable for data generation: data in hand, decent quality.

PacBio data is an excellent add-on for situations where long reads are needed (to bridge repeats or het regions).

Page 45: 2013 pag-equine-workshop

Two final pieces of adviceShould you work with genome centers? Maybe.

Genome centers are good at large, well funded projects.

Their default pipelines are reliable but not always cutting edge.

“Weird” problems (high heterozygosity, or complex repeats) may require more attention than they can give.

They also have their own schedules and incentives.

Where should you go for contract sequencing?I get asked this a lot!My best recommendation is UC Davis.“Cheaper” is not always “better”; data quality can

vary immensely.

Page 46: 2013 pag-equine-workshop

June 10-June 20, Kellogg Biological Station; < $500

Hands on exposure to data, analysis tools.

Advertisement: next-gen sequence course

http://bioinformatics.msu.edu/ngs-summer-course-2013

Page 47: 2013 pag-equine-workshop

AcknowledgementsI showed work from Likit Preeyanon and

Alexis Black Pyrkosz, in my labHans Cheng is primary collaborator on

chick work

USDA funded our technology development.

Lex Nederbragt for his slides :)