02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010...

56
DNA Sequencing Ben Langmead Department of Computer Science Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briey how you are using the slides. For original Keynote les, email me ([email protected]).

Transcript of 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010...

Page 1: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

DNA SequencingBen Langmead

Department of Computer Science

Please sign guestbook (www.langmead-lab.org/teaching-materials) to tell me briefly how you are using the slides. For original Keynote files, email me ([email protected]).

Page 2: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

DNA

Hunter, Lawrence. "Life and its molecules: A brief introduction." AI Magazine 25.1 (2004): 9.

Page 3: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Genomics technology

DNA Microarrays 2nd-generation DNA sequencing

Sanger DNA sequencing

1977-1990s Since mid-1990s Since ~2007

3rd-generation & single-molecule DNA sequencing

Since ~2010

Fred Sanger 1918-2013

“Chain termination” sequencing

Page 4: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Sanger sequencing

Sanger sequencing1977-1990s

First practical method invented by Fred Sanger in 1977. Initially used to sequence shorter genomes, e.g. viral genomes 10,000s of bases long.

Not-so-high-throughput Sanger sequencingFred Sanger in episode 3 of PBS documentary “DNA”

Page 5: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Sanger sequencing

From "DNA" documentary, episode 3

Page 6: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Genomics technology

DNA Microarrays 2nd-generation DNA sequencing

Sanger DNA sequencing

1977-1990s Since mid-1990s Since ~2007

3rd-generation & single-molecule DNA sequencing

Since ~2010

Page 7: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Sequencing

No sequencing technology yet invented can read much more than 10,000 nucleotides at a time with reasonable cost, throughput, accuracy

Instead, there’s a vigorous race to see whose sequencer can read “short” fragments of DNA (around 100s of nucleotides) with best cost, throughput, accuracy

Decoding DNA With SemiconductorsBy NICHOLAS WADE Published: July 20, 2011

Source: nytimes.com

Cost of Gene Sequencing Falls, Raising Hopes for Medical AdvancesBy JOHN MARKOFF Published: March 7, 2012

Company Unveils DNA Sequencing Device Meant to Be Portable, Disposable and CheapBy ANDREW POLLACK Published: February 17, 2012

Page 8: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Sequencing

Since 2005, many DNA sequencing instruments have been described and released. They are based on a few different principles

SMRT cell NanoporeSynthesis / ligation

Pictures: http://www.illumina.com/systems/miseq/technology.ilmn, http://www.genengnews.com/gen-articles/third-generation-sequencing-debuts/3257/

Sequencing by synthesis (“massively parallel sequencing”) provides greatest throughput, and is the most prevalent today

Page 9: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

DNA: double helix

A T

G C

TCACACTGAGCGTGCTG

http://ghr.nlm.nih.gov/handbook/basics/dna

Page 10: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG

Your genome

GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCG

Reads

Page 11: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG

GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCGTAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGC

Reads

Your genome

Page 12: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG

GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCGTAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGCTGTCTTTGATTCCTG CGCGATAGCATTGCG GCATTGCGAGACGCT CCTATGTCGCAGTAT

Reads

Your genome

Page 13: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

CGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTG

GTATGCACGCGATAG TATGTCGCAGTATCT CACCCTATGTCGCAG GAGACGCTGGAGCCGTAGCATTGCGAGACG GGTATGCACGCGATA TGGAGCCGGAGCACC CGCTGGAGCCGGAGCTGTCTTTGATTCCTG CGCGATAGCATTGCG GCATTGCGAGACGCT CCTATGTCGCAGTATGACGCTGGAGCCGGA GCACCCTATGTCGCA GTATCTGTCTTTGAT CCTCATCCTATTATTTATCGCACCTACGTT CAATATTCGATCATG GATCACAGGTCTATC ACCCTATTAACCACT

TGCATTTGGTATTTT CGTCTGGGGGGTATG CACGCGATAGCATTGGTATGCACGCGATAG ACCTACGTTCAATAT TATTTATCGCACCTA CCACTCACGGGAGCTGCGAGACGCTGGAGC CTATCACCCTATTAA CTGTCTTTGATTCCT ACTCACGGGAGCTCTCCTACGTTCAATATT GCACCTACGTTCAAT GTCTGGGGGGTATGC AGCCGGAGCACCCTAGACGCTGGAGCCGGA GCACCCTATGTCGCA GTATCTGTCTTTGAT CCTCATCCTATTATTTATCGCACCTACGTT CAATATTCGATCATG GATCACAGGTCTATC ACCCTATTAACCACTCACGGGAGCTCTCCA TGCATTTGGTATTTT CGTCTGGGGGGTATG CACGCGATAGCATTG

CACGGGAGCTCTCCA

Reads

Your genome

Page 14: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Reads

100 nt

100,000,000 nt

Your genome

Page 15: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Reads

100 nt

Your genome100,000,000 nt?

Page 16: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

A T

G C

Double stranded DNA (lego version)

Double stranded DNA (double helix)

Page 17: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

CCATAG

GGTATC

A T

G C

Page 18: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

C

C

A

TAG

G

G

T

A

T

C

Single stranded templates

Page 19: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

C

C

A

TAG

G

G

T

A

T

CC

Page 20: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

C

C

A

TAG

G

G

T

A

T

CC

DNA polymerase

Page 21: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

C

C

A

TAG

G

G

T

A

T

CCT

Page 22: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

C

C

A

TAG

G

G

T

A

T

CC

A

T

Page 23: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

C

C

A

TAG

G

G

T

A

T

CC

A

T

T

Page 24: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

C

C

A

TAG

G

G

T

A

T

CC

A

T

T

G

Page 25: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

C

C

A

TAG

G

G

T

A

T

CC

A

T

T

G

G

Page 26: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

C

C

A

TAG

G

G

T

A

T

CC

A

T

T

G

GC

C

A

T

A

G

Page 27: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT CCATAGTATATCTCGGCTCTAGGCCCTCATTTTTT

CCATAGTA TATCTCGG CTCTAGGCCCTC ATTTTTT CCA TAGTATAT CTCGGCTCTAGGCCCTCA TTTTTT CCATAGTAT ATCTCGGCTCTAG GCCCTCA TTTTTT CCATAG TATATCT CGGCTCTAGGCCCT CATTTTTT

Input DNA

Cut into snippets

Deposit on slideC C A T A G

More details: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008 Nov 6;456(7218):53-9

Page 28: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Template (billions of them!)

Slide

Page 29: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

DNA polymerase

C

A T

G

“Terminator”

Page 30: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger
Page 31: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

(snap)

~ ~ ~ ~ ~ ~

~ ~ ~

Page 32: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Remove terminators

Page 33: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

DNA polymerase

C

A T

G

Repeat!

Page 34: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger
Page 35: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

(snap)

Page 36: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

(snap)

Page 37: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

(snap)

Page 38: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

(snap)

Page 39: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

(snap)

Page 40: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Sequencing by synthesis

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

Page 41: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Sequencing by synthesis

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6

complement complement complement complement complement complement

G A T A C C

C C A T A G

Page 42: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Sequencing by synthesis

Actual Illumina HiSeq 3000 image

http://dnatech.genomecenter.ucdavis.edu/2015/05/07/first-hiseq-3000-data-download/

Page 43: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Sequencing by synthesis

Billions of templates on a slide

Massively parallel: photograph captures all templates simultaneously

Terminators are “speed bumps,” keeping reactions in sync

Page 44: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger
Page 45: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Cluster of clones

Page 46: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

UnterminatedAhead of schedule

Page 47: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger
Page 48: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger
Page 49: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Q = -10 ∙ log10 p

Base quality Probability that base call is incorrect

Q = 10 → 1 in 10 chance call is incorrect Q = 20 → 1 in 100 Q = 30 → 1 in 1,000

Page 50: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Call: orange (C)

Estimate p, probability incorrect:

non-orange light / total light

p = 3 green / 9 total = 1/3

Q = -10 log10 1/3 = 4.77

Page 51: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

A read in FASTQ format

@ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1 ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAAT + ?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:G

NameSequence

(ignore)Base qualities

Page 52: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Read 1

Read 2

Read 3

Read 4

Read 5

FASTQ

NameSequence(placeholder)Base qualitiesNameSequence(placeholder)Base qualitiesNameSequence(placeholder)Base qualitiesNameSequence(placeholder)Base qualitiesNameSequence(placeholder)Base qualities

Page 53: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Base qualities

AGCTCTGGTGACCCATGGGCAGCTGCTAGGGA |||||||||||||||||||||||||||||||| HHHHHHHHHHHHHHHGCGC5FEFFFGHHHHHH

Bases and qualities line up:

Base quality is ASCII-encoded version of Q = -10 log10 p

Page 54: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

ASCII

Page 55: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger
Page 56: 02 dna sequencing v2 - Department of Computer Science · 7/20/2011  · DNA sequencing Since ~2010 Fred Sanger 1918-2013 “Chain termination” sequencing. Sanger sequencing Sanger

Usual ASCII encoding is “Phred+33”:

take Q, rounded to integer, add 33, convert to character

def QtoPhred33(Q): """ Turn Q into Phred+33 ASCII-encoded quality """ return chr(int(round(Q)) + 33)

def phred33ToQ(qual): """ Turn Phred+33 ASCII-encoded quality into Q """ return ord(qual)-33

(converts integer to character according to ASCII table)

(converts character to integer according to ASCII table)

Base qualities