BEACON 101: Sequencing tech

54
S BEACON 101: Making use of new sequencing technologies C. Titus Brown [email protected] Comp Sci& Micro Michigan State University

Transcript of BEACON 101: Sequencing tech

Page 1: BEACON 101: Sequencing tech

S

BEACON 101:

Making use of new

sequencing technologiesC. Titus Brown

[email protected]

Comp Sci& Micro

Michigan State University

Page 2: BEACON 101: Sequencing tech

Outline

1. “Next”-generation sequencing.

2. Dealing with the data – our research.

3. What are they teaching kids these days, anyway?

Page 3: BEACON 101: Sequencing tech

But first, some background…

S The kinds of technology I’ll be talking about are being used by many BEACON groups, and will probably be used by many more within the next few years.

S Sequencing advances are (IMO) one of the most stunning technological breakthroughs in biology in the last 20 years.

S As a mid-level BEACON bureaucrat (TG leader! Course instructor!) I’m interested in:

S Enabling interesting science.

S Finding fun new problems to tackle.

S Developing a training & education plan so that we produce tech-savvy students and junior faculty.

Page 4: BEACON 101: Sequencing tech

In particular…

S At the last BEACON Congress, we had a “bioinformatics

sandbox” session.

S Only MSU folk could attend (short notice!)

S About 8 labs, all using next-gen sequencing…

S …and 2 labs, working on methods for analyzing data.

(Hi!)

S I know there are more people out there, on both sides of

the equation. Who are you??

Page 5: BEACON 101: Sequencing tech

OK, Back to…

I. Sequencing!

S Sequencing of DNA and RNA.

S Single genomes

S Transcriptomes

S Natural populations (tags)

S Environmental samples/microbial populations

(metagenomics)

S Cheap and massively scalable sequencing of DNA and

RNA.

Page 6: BEACON 101: Sequencing tech

Sequencing technology

S Major, dramatic changes in our ability to sequence DNA

and RNA quickly and cheaply.

S Majority of deployed techniques depend on (variations of)

a single trick: “polony” sequencing. No cloning.

S Single-molecule sequencing coming along fast, but not

yet ready for prime time.

Page 7: BEACON 101: Sequencing tech
Page 8: BEACON 101: Sequencing tech
Page 9: BEACON 101: Sequencing tech

Two specific concepts:

S First, sequencing everything at random is very much

easier than sequencing a specific gene region. (For

example, it will soon be easier and cheaper to shotgun-

sequence all of E. coli then it is to get a single good

plasmid sequence.)

S Second, if you are sequencing on a 2-D substrate (wells,

or surfaces, or whatnot) then any increase in density

(smaller wells, or better imaging) leads to a squared

increase in the number of sequences.

Page 10: BEACON 101: Sequencing tech

Novel genome sequencing

Page 11: BEACON 101: Sequencing tech
Page 12: BEACON 101: Sequencing tech
Page 13: BEACON 101: Sequencing tech
Page 14: BEACON 101: Sequencing tech
Page 15: BEACON 101: Sequencing tech

Some numbers

S For under $1,000 per sample, the Illumina HiSeq

machine will generate:

S 100,000,000 reads

S Each of length ~100

S In under a week.

S x 16 samples/run.

S That’s 160 Gb of sequence, or just over 50x human

genome…

Page 16: BEACON 101: Sequencing tech

How do you choose a

sequencing approach?

S Choose one:

S Long reads (low sampling, but easier to work with)

S Deep random sampling (quantitative sequencing, quite sensitive)

S The answer will depend on what exactly you want to do. Generally I prefer the shorter reads.

S Find someone who pays obsessive attention to this stuff. (Hi!)

Page 17: BEACON 101: Sequencing tech

Data analysis!

S In general, it now takes longer to analyze the data than it does to generate the data.

S That is, suppose you already know exactly what to do and simply want to run your analysis.

S By and large, you can generate a large enough amount of data in one week that you cannot run the analysis of it in the following week.

S …this is steadily shifting towards the “more data” side, too.

S (This is really a paradigm shift for many areas of biology.)

Page 18: BEACON 101: Sequencing tech

Your basic data file.

>895:5:1:1276:16683/1

GTCGCTTTGCGATGTTTGTCGGGTGCATCTTTTGGGAACAGCAAGTTTTGGAATGATCCCTGCACTTTCAT

CGGAACACC

>895:5:1:1558:16140/2

CCGTTCCAGAGATATGACCCGTTTTAATGAACGCTGCCAGTTGACAAATTATTTTCCAAAATTAGCAATTGCG

TGGGTTCTTTTCCATCTAAACAGCTTCTGGGCTTTATGCTG

>895:5:1:1581:10052/1

TTACAGACGTCGTTCTAACTAATTTGTGACGAAAATTGCCCACAATTATGACTATATGTGGAATTTTG

>895:5:1:1824:4518/2

CCAAATTAGTTAGAATGACGTTTGTAACCGTATTCCGGTGCAACTTTGTGAATAATTTCTAACTGTAAAAATTT

TTGGCAAAACCAAGTTTGCCGGCCGCAACCGCAAC

>895:5:1:1945:14960/1

CTGATTTTGCAATGTTACTGACATGGGTATGCCAGTTGTGATTATTGGCGACTGCAACTCCCAACAATGATA

CTGTTTACTTTTGTGTGAATGAACATTTATTCATCCTTGGGT

Page 19: BEACON 101: Sequencing tech

What now?

Page 20: BEACON 101: Sequencing tech

Mapping

S Many fast & efficient computational solutions exist.

S You have to figure out how to choose parameters to

maximize sensitivity/specificity, and when to validate.

U. Colorado

http://genomics-course.jasondk.org/?p=395

Page 21: BEACON 101: Sequencing tech

Whole genome shotgun

sequencing & assembly

Randomly fragment & sequence from DNA;

reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

Page 22: BEACON 101: Sequencing tech

Data analysis challenges

S Choosing a software suite/pipeline/analysis approach.

S Scaling chosen approach to volume of data (2-200x what they designed it for)

S Efficiently running software.

S Integrating analysis results and extracting desired information.

S Understanding what you’ve done in sufficient detail to design & perform requisite computational controls.

Page 23: BEACON 101: Sequencing tech

Data analysis challenges,

cont’d

S The rate of change is itself accelerating:

S New tools, approaches every month.

S More data, data types, chemistries every month.

S Increasing commercialization (so getting an honest answer

from the companies is basically impossible)

S But… opportunities are great! Jump on in!

Page 24: BEACON 101: Sequencing tech

What does the future hold?

“Prediction is very difficult, especially about the future.”

-- Niels Bohr

S More, cheaper sequencing: plan for a world where you

can sequence anything you sample, to any depth you

want, for arbitrarily small amounts of money. Seriously.

S Solutions to the majority of the scaling issues in data

analysis (but not the scientific issues…)

Page 25: BEACON 101: Sequencing tech

Questions?

Page 26: BEACON 101: Sequencing tech

II. Our research

“Making sense of sequence”

“Surfing the data tsunami”

There are a number of fascinating challenges at the intersection of

genomics and the rest of biology; they require appropriate

(ab)use of computational techniques, applied to data sets from

interesting critters and/or experimental setups.

(Evolution turns out to be especially interesting in this regard.)

Page 27: BEACON 101: Sequencing tech

Frontiers in sequencing new stuff…

S There are many, many

interesting critters for which we

have essentially no genomic or

transcriptomic information.

S Next-gen sequencing has now

made these organisms

accessible to investigation.

S But dealing with organisms for

which there is no reference

genome is … challenging.

Page 28: BEACON 101: Sequencing tech

Whole genome shotgun

sequencing & assembly

Randomly fragment & sequence from DNA;

reassemble computationally.

UMD assembly primer (cbcb.umd.edu)

Page 29: BEACON 101: Sequencing tech

A brief intro to shotgun assembly

It was the best of times, it was the wor

, it was the worst of times, it was the

isdom, it was the age of foolishness

mes, it was the age of wisdom, it was th

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

…but for 2 bn+ fragments.

Not subdivisible; not easy to distribute; memory intensive.

Page 30: BEACON 101: Sequencing tech

Repeats do cause problems:

Assemble based on word overlaps:

Page 31: BEACON 101: Sequencing tech

Project I: metagenomics

Wild microbes!

S Millions or billions of microbial species inhabit soil:

“1m species in a single gram” (Gans et al., 2005)

S These microbes mediate important geobiological

processes (e.g. nitrogen reduction)

S Ecology & evolution of these habitats??

Page 32: BEACON 101: Sequencing tech

The Great Plate Count

Anomaly

S Fewer than 1% of microbes can be cultivated in the lab (vsdirect observation).

S Difficult or impossible to cultivate in the lab

S Unknown physiological requirements

S Commensalism and microbial consortia

S How can we study them?

Page 33: BEACON 101: Sequencing tech
Page 34: BEACON 101: Sequencing tech

SAMPLING LOCATIONS

Page 35: BEACON 101: Sequencing tech

Sampling strategy per site

Reference soil

Soil cores: 1 inch diameter, 4 inches

deep

Total:

8 Reference metagenomes +

64 spatially separated cores

(pyrotag sequencing)

10 M

10 M

1 M

1 M

1 cM

1 cM

Page 36: BEACON 101: Sequencing tech

0

50

100

150

200

250

300

350

Ba

se

pa

irs

of

Se

qu

en

cin

g (

Gb

p)

GAII HiSeq

200x human genome…!

> 10x more challenging (total diversity)

Great Prairie sequencing

summary

Page 37: BEACON 101: Sequencing tech

Subdividing reads by

connection

“Partitioning” => assembly on multiple computers

Page 38: BEACON 101: Sequencing tech

Project II: transcriptomicsDevelopmental change in non-model ascidians, the

Molgula

Page 39: BEACON 101: Sequencing tech

Molgula questions

S What happened to the downstream tail gene network in the tailless ascidian?

S What are the genomic adaptations that made the Molgulidaeparticularly susceptible to tail loss? (e.g. Manx/bobcat)

S How does tail loss actually work, functionally?

S Heterochrony of metamorphosis?

Page 40: BEACON 101: Sequencing tech

Preliminary round of sequencing(Illumina 76 bp x 2, ~250 bp insert size)

Sample NameTotal reads after

trim+filter

Loci (total genes,

> 500)

Total incl splice

variants (> 500)

M. oculata

(gastrula) 35,252,607 13,172 16,269

Hybrid (gastrula) 38,690,601 14,148 24,209

M. occulta

(gastrula) 22,548,831 8,046 10,802

M. oculata

(neurula) 38,030,938 10,365 11,043

Hybrid (neurula) 38,699,913 14,400 29,189

M. oculata

(tailbud) 38,073,640 14,204 17,835

Hybrid (tailbud) 34,307,907 15,399 26,594

Page 41: BEACON 101: Sequencing tech

We can count by allele!

Sample Name

M.

oculata

manx

M.

occulta

manx

M.

oculata

bobcat

M.

occulta

bobcat

M. oculata (gastrula) 9 0 11 0

Hybrid (gastrula) 0 28 0 255

M. occulta (gastrula) 0 41 0 76

M. oculata (neurula) 5 0 4 1

Hybrid (neurula) 0 38 0 223

M. oculata (tailbud) 6 0 6 0

Hybrid (tailbud) 0 8 0 121

Page 42: BEACON 101: Sequencing tech

Molgula/emerging story

S Looks like notochord/tail cells are being specified, but cell

movement isn’t happening.

S May be failure in convergence/extension?

S Computational leads => experimental validation.

Page 43: BEACON 101: Sequencing tech

Research goals

“Better science through superior computation”

Enable interesting biology downstream of sequence

analysis.

Also, provide tools to others.

Page 44: BEACON 101: Sequencing tech

Questions?

Page 45: BEACON 101: Sequencing tech

III. (Graduate) education!

S Biology is fast becoming data-intensive.

S This requires expertise that is not traditionally part of many biologists’ training.

S More generally, “computational science” in biology is really at least three different things:

S Data analysis (data => hypothesis discovery/validation)

S Modeling/simulation (ecology models, protein structure, etc.)

S Instantiation of biological system (e.g. evolution).

S I’m avoiding theory and (non-digital) experiment, which are yet separate skills…

Page 46: BEACON 101: Sequencing tech

…and worse…

S Increasingly, biological

understanding relies on

computational analysis and

inference.

S Computational intuition and

informed skepticism (a.k.a.

“scientific method”…) isn’t

taught to biologists.

Page 47: BEACON 101: Sequencing tech

…and worst.

S All of this rests on a “bedrock” foundation of

S Badly written or inflexible software that’s difficult to run or install.

S Scripts written quickly and without reflection or testing.

S Ineffective computer use.

S …and a general lack of regard for reproducibility and replication.

Page 48: BEACON 101: Sequencing tech

Cultural problems?

S Physics, in particular, has a history of computation, and a robust computational culture… but not bio so much.

“Many undergrads got into biology because they were interested in science, but didn’t like the math required for physics and chemistry. I have bad

news for them…”

-- me

S Bad news? Computation is increasingly important in bio.

S Good news? Computation != math.

S Better news: BEACON is enriched for grads, postdocs, and faculty that live at this interface. It’s a good crowd.

Page 49: BEACON 101: Sequencing tech

So what do we do?

BEACON course: “Computational Science for (Evolutionary)

Biologists”, v2.0 (alpha)

1. Teach programming for computational scientists.

2. Teach computational science strategies/thinking.

3. Touch on reproducibility, RCR, and data management.

4. Keep it interesting enough that people don’t “check out”

5. …try to figure out remote interaction: currently teaching

across

MSU (15), UT Austin (3), UW Seattle (2), and U Idaho (3).

Page 50: BEACON 101: Sequencing tech

What is class like?

Tuesdays: programming HW due; discussion of computational

stuff.

Thursdays: reading HW due; group presentation; discussion.

Groups split between MSU & (other); in-group teleconf (iPads and

FaceTime), whiteboard (Jot!)

(Yes, we bought 16 iPads for the course. BEACON now owns 16 iPads.)

Page 51: BEACON 101: Sequencing tech

(Jot! Demo)

Page 52: BEACON 101: Sequencing tech

The course is still a work in

progress

S You can ask your local students, too, but –

S In-class interaction is possible, but still hard.

S Group dynamics! Now across 1000s of miles!

S Not everyone is great at technology multitasking (although kids these days…)

S That whole “mixed background” thing is extra challenging. BEACON students are so diverse that you can’t rely on all of them really knowing anything specific. But they all know so much individually that you risk boring them.

Sigh.

Page 53: BEACON 101: Sequencing tech

Educational future

S Increasingly, BEACON graduate students cannot be placed into easy categories (Michelle Vogel, Tasneem Pierce).

S Can we really split these people into “bio” and “compu” folk? No… nor should we want to, necessarily; whole point of BEACON!

S Can we make the courses more distributed to take advantage of remote faculty expertise?

S Last year: no way in heck.

S This year, the tech is working better. Plus, iPads!

S Your opinions welcome, especially if it involves less work for me.

S Note: options for faculty &postdocs, too: summer course w/Dworkin.

Page 54: BEACON 101: Sequencing tech

Concluding thoughts

S Sequencing is awesome, and presents fantastic opportunities. Very, very exciting world!!!

S Taking advantage of it currently requires expertise that’s hard to teach, learn. That shouldn’t stop us!

S BEACON could be argued into helping:

S Workshops, courses, training, etc.

S I will be visiting UT Austin & U Idaho (Oct 17-21), and UW Seattle (Nov 14-18) and would love to chat about this stuff.