2013 pag-equine-workshop

C. Titus BrownAsst Prof, CSE and

Microbiology;BEACON NSF STC

Michigan State Universityctb@msu.edu

Next-Gen Sequencing:4 years in the trenches

These slides are available online.

“titus brown slideshare”

You can also e-mail me: ctb@msu.edu

Also note that these are my opinions and observations, culled from personal experience,

online material, and reading. I’m happy to cite/explain further upon request, but:

Your Mileage May Vary

Things I won’t talk aboutDon’t work on/with/have anything useful to

say about:Exome sequencingAncient DNAChIP-seq (protein-DNA interactions)

Work on but you’re probably not interested in:Metagenomics (sequencing uncultured

microbial communities)Bioinformatics data structures and algorithms

OverviewShotgun sequencing basics

Things everyone wants to know: how much $$...

Various current problems & challenges

Technology, now and future

Some papers and projects worth looking at; & our own experiences

Two specific concepts:First, sequencing everything at random is very

much easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.)

Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences.

These two concepts underlie the recent stunning increases

in sequencing capacity.

What are current costs for Illumina?Approximate costs from MSU sequencing

center, a few months ago, including labor:

RNAseq:$200 prep / sampleSingle-ended 1x50 -- $1100/lane – 100-150 mn

readsPaired-end 2x100 -- $2500/lane – 200-300 mn

reads (/ 2)

Barcoding samples, etc, gets complicated.Discuss biology, etc with a sequencing geek

before going forward!

What does this data really give you??With RNAseq, you can do de novo (genome- and gene-

annotation-independent) gene & isoform discovery and quantification; 50-100m reads/sample is probably “enough”(see: http://blog.fejes.ca/?p=607 for a good discussion)

With genome resequencing, you can do variant analysis/discovery; I recommend 20x depth.

De novo assembly of complex vertebrate genomes is not casual:Cheap short-read sequencing does not yet deliver good long-range contiguity; repeats, heterozygosity get in the way.Assembly & scaffolding process itself is still evolving.

Why so much data?Why do we need 10-20x coverage

(resequencing) or 50-100m reads (mRNAseq) with Illumina?

Two (linked) reasons:Shotgun sequencing is randomCounting/sampling variation

1. Useful minimum coverage depends on high average coverage

2. mRNAseq quantitation – must overcome sampling variation

Coverage conclusionsMore coverage rarely hurts (you can always

discard data, but it is harder/more $$ to get more data from an old sample)

Your desired coverage numbers should be driven by sensitivity considerations.

Problems and challengesSystematic bias in sequencing and

software.

Genome assembly: scaffolding and sensitivity

Gene references

mRNAseq isoform construction

Resequencing: bias and errorCalling SNPs by mapping --

U. Coloradohttp://genomics-course.jasondk.org/?p=395

Both sequencing and bioinformatics yield many low-frequency artifacts!“Obvious” things like misalignments to

paralogous/repeat sequences.Indels are handled badly by current tools (up to

60% false positive rate?!)Oxidation of DNA during library prep step

(acoustic shearing) generated 8-oxoguanine “lesions” responsible for artifacts involving C>A/G>T triplets.

=> With any data set, especially big ones, there will both random and systematic error and bias.

http://pathogenomics.bham.ac.uk/blog/2013/01/sequencing-data-i-want-the-truth-you-cant-handle-the-truth/

Suggestion: Cortex variant caller

Iqbal et al., Nat Genet. 2012, pmid 22231483

Genome assembly: scaffolding & sensitivityEveryone wants two things from a genome assembly --

Long/correct scaffolds

See http://www.slideshare.net/flxlex/a-different-kettle-of-fish-entirely-bioinformatic-challenges-and-solutions-for-whole-de-novo-genome-assembly-of-atlantic-cod-and-atlantic-salmon

Complete genome content

Sequence dataReads

http://www.cbcb.umd.edu/research/assembly_primer.shtmlslides from http://slideshare.net/flxlex/ ; Lex Nederbragt

original DNA

fragments

original DNA

fragments

Sequenced ends

ContigsBuilding contigs

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

ScaffoldsOrdered, oriented contigs

contigs

mate pairs

gap size estimate

http://dx.doi.org/10.6084/m9.figshare.100940

Scaffoldcontig

Longer reads!

Repeat copy 1

Repeat copy 2

Long reads can span repeats

Polymorphic contig 2Polymorphic contig 2

Polymorphic contig 3Polymorphic contig 3

Contig 4Contig 1

and heterozygous regions

Cod: PacBio resultsMapping to the published genome

11.4 kbp subread

10.6 kbp subread

10.9 kbp subread

Sensitivity – does your genome include everything?Generally not!

For example, the chick genome is missing a substantial number of genes from microchromosomes:723 genes from HSA19q missing from

chicken galGal4.ESTs and RNAseq transcripts for many or

Approach - Digital normalization(a computational version of library normalization)

Digital normalization “smooths out” coverage from

different loci, and can “recover” low

coverage regions for assembly.

Applying diginorm to increase sensitivityReassembled chick genome from 70x Illumina -

> normalized reads in ~24 hours.Contig assembly contained partial or complete

matches to 70% of previously unmappable transcripts assembled from chick mRNAseq

Together with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio and normalization to improve chick genome; should be generalizable approach.

Mapping => mRNAseq quantitation

Reference transcriptome required.

Existing chick gene models lack exons, isoforms

*This gene contains at least 4 isoforms.

Our data

Models

Likit Preeyanon

(Exon detection is pretty good.)

Likit Preeyanon

Gene Modeler Pipeline (“gimme”?)Merge transcripts together based on

transcript mapping to genome; can include existing gene predictions, iterate.

Construct gene modelsRemove redundant sequencesPredict strands and ORFs

Likit Preeyanon

Some thoughts on bioinfoSoftware is evolving very fast. Don’t worry

about using the latest, but keep an eye on possible artifacts/problems with what you do use.

In NGS, online information (seqanswers, biostar, Twitter) is generally far less behind than publications.

Technology – where next?Most slides taken from Lex Nederbragt:

http://www.slideshare.net/flxlex/updated-new-high-throughput-sequencing-technologies-at-the-norwegian-sequencing-centre-and-beyond

High-throughput sequencingPhase 1: more is better2005 GS20 200 000 reads100 bp

0.02 Gb/run

2011 GS FLX+1.2 million reads750 bp0.7 Gb/run

2006 GA 28 million reads25 bp0.7 Gb/run

2011 HiSeq 2000 3 billion reads 2x100 bp600 Gb/run

High-throughput sequencingPhase 2: smaller is better

GS Junior from Roche/454

MiSeq from Illumina

PGM from Ion Torrent/Life Technologies

0.04 GB/run400 bp reads

0.7 GB/run700 bp reads

4.5 GB/run2x150 bp

reads600 GB/run2x100 bp reads

0.01, 0.1 or 1 GB/run

100 or 200 bp reads

High-throughput sequencingWhy benchtop sequencing instruments?

Affordable price per

instrumentSmall projects

Diagnostics

Fast turn around time

http://pennystockalerts.com/ http://www.highqualitylinkbuildingservice.com/http://www.vetlearn.com/ http://vanillajava.blogspot.com

Which instrument to choose?

High-throughput sequencingPhase 3: single-molecule

C2 (current) chemistry:Average read length 2500 bp36 000 reads90 MB per ‘run’

High-throughput sequencingTechnology

Need to combine Illumina + PacBio still.

24 cpus4.5 days 100 Gb RAM

Alignments of at least 1kb to cod published assembly

Raw reads

P_errorCorrection pipeline from

93% of reads recovered

My perspective on tech:Illumina HiSeq + benchtop sequencers

(MiSeq) currently most reliable for data generation: data in hand, decent quality.

PacBio data is an excellent add-on for situations where long reads are needed (to bridge repeats or het regions).

Two final pieces of adviceShould you work with genome centers? Maybe.

Genome centers are good at large, well funded projects.

Their default pipelines are reliable but not always cutting edge.

“Weird” problems (high heterozygosity, or complex repeats) may require more attention than they can give.

They also have their own schedules and incentives.

Where should you go for contract sequencing?I get asked this a lot!My best recommendation is UC Davis.“Cheaper” is not always “better”; data quality can

vary immensely.

June 10-June 20, Kellogg Biological Station; < $500

Hands on exposure to data, analysis tools.

Advertisement: next-gen sequence course

http://bioinformatics.msu.edu/ngs-summer-course-2013

AcknowledgementsI showed work from Likit Preeyanon and

Alexis Black Pyrkosz, in my labHans Cheng is primary collaborator on

chick work

USDA funded our technology development.

Lex Nederbragt for his slides :)

2013 pag-equine-workshop

Documents

Transcript of 2013 pag-equine-workshop

Proceedings of a Workshop on - Havemeyer Foundationhavemeyerfoundation.org/PDFfiles/Stratford Monograph.pdf · Proceedings of a Workshop on EQUINE RECURRENT LARYNGEAL ... Spirometric

PRODUCTS CATALOGUE · Pag. 1 Pag. 2 Pag. 3 Pag. 4 Pag. 5 Pag. 6 Pag. 7 Pag. 8 Pag. 9 Pag. 10 Pag. 11 Pag. 12 Pag. 13 Pag. 14 INOX A2 EN 10088-2 INOX A2 EN 10088-2 DIN 6797 J STEEL

Microsoft Outlook - Stile · PDF filePag. 29 Pag. 30 Pag. 31 Pag. 32 Pag. 33 Pag. 34 Pag. 35 Pag. 36 Pag. 37 Pag. 38 Pag. 39 Pag. 40 Pag. 41 DIN 6904 Pag. 42 STEEL C 60 EN 10132 DIN

BSc(Hons) Equine Science - Sparsholt College · Equine Biomechanics and Sports Science Equine Exercise Physiology Equine Rehabilitation and Therapy Equine Veterinary Science Reproductive

Movex product catalog 2018 - Polyketting B.V. Nederlands · 423 Leveling pads Pag 424 Pag 426 Pag 428 Pag 430 Pag 432 Pag 434 Pag 436 Pag 439 Pag 442 Pag 444 Pag 446 992-993 Easy

Serie Ghilux - aignep.com · 6460 Pag. 12.12 6470 Pag. 12.12 66300 Pag. 12.17 6500 Pag. 12.13 6540 Pag. 12.13 6680 Pag. 12.22 10740 Pag. 12.22 10741 Pag. 12.22 10760 Pag. 12.23 6750L

Totalna proteza koljena - Hossam · Totalna proteza koljena PRODUCT DESCRIPTION OPIS PROIZVODA 0123 pag. 3 pag. 4 pag. 5 pag. 6 pag. 7 pag. 8 pag. 9 pag. 3 pag. 4 Index Sadržaj OPIS

Proceedings of a Workshop on Equine Recurrent Laryngeal ...

SCHEMARIO INSTALLATION DIAGRAM - came.com · Indice 031 pag. 6 ZC4 pag.5 8 032 pag. 7 ZC5 pag.5 9 033 pag. 8 ZD2 pag.6 0 034 pag. 9 ZE1 pag.6 1 038 pag. 10 ZE2 pag.6 2 041B pag. 11

la nostra forza è il tempo - Italian Tannery Suppliers · Sedi nel mondo pag. 5 pag. 7 pag. 13 pag. 14 pag. 17 pag. 27 pag. 33 pag. 35 pag. 37 pag. 39 ... Silvateam Perù per la

SPACES 03 - Rifra · pag 04 pag 48 pag 65 pag 93 pag 100 pag 116 pag 124 pag 134 pag 138 pag 142 pag 146 pag 160 pag 168 pag 176 pag 182 pag 190 pag 198 5.

COP24 SPECIAL REPORT · CONTENTS Acknowledgements Executive summary Introduction Pag. 10 - 11 Pag. 42 - 44 Pag. 45 - 46 Pag. 47 - 48 Pag. 48 - 49 Pag. 50 Pag. 51 - 52 Pag. 52 Pag.

11 - CRG Kart · pag. 115 - 116 pag. 83 pag. 20 - 23 pag. 103 pag. 114 pag. 24 pag. 104 - 107 pag. 6 - 18 pag. 84 pag. 110 pag. 98 m mini okj v08 kz okj dd v08 v08 v09 dd kz dd v09

catalogue 2017 · adone 2.0 pag 12 adone plus pag 18 minosse pag 22 elio pag 26 efesto pag 30 diomede pag 34 olimpo pag 38 eos pag 66 crono pag 70 zefiro pag 72 narciso pag 76 pegaso

Equine Equine Inhalation therapy for equine lower ...

PAG 2-3 SECUR UP PAG 4-5 LOCK2011 PAG 6-7 LOCK2012 PAG 8 …€¦ · pag 2-3 secur up pag 10-11 lock2000 pag 6-7 lock2012 pag 14-15 ® pag 4-5 lock2011 pag 12-13 clip tag memory pag

GS/3 - La Marzocco USAlamarzoccousa.com/docs/gs3/GS3-Manual.pdf · GS/3 pag 26 pag 27 pag 28 pag 29 pag 30 pag 31 pag 32 pag 33 pag 34 pag 35 pag 36 ... 3 4 • This espresso is equipped

brochureKy Tile Co.,Ltd. La Fabbrica Spa La Platera Sa Laminam Spa Lanka Tiles Plc.. Pag Pag Pag Pag Pag Pag Pag Pag Pag Pag 418 420 422 424 426 428 430 432 434 436 Ita Spa Italgraniti

Calendario eventi - inarzignano.it · pag. 2 pag. 3 pag. 4 pag. 5 pag. 6 pag. 7 pag. 8 pag. 9 pag. 10 pag. 17 pag. 30 pag. 46 pag. 48 ... Marco Sartori Chitarra classica Martedì

Horse SA Climate Change Workshop 2012 Climate change and Equine Infectious Disease Dr Gary Muscatello