DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São...

89
DNA Assembly DNA Assembly Sanger Reads Sanger Reads Arthur Gruber Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP AG-ICB-USP

Transcript of DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São...

Page 1: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

DNA Assembly DNA Assembly Sanger ReadsSanger Reads

Arthur GruberArthur Gruber

Instituto de Ciências Biomédicas Universidade de

São Paulo

AG-ICB-USPAG-ICB-USP

Page 2: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Why assemble a genome? Why assemble a genome?

• Current DNA sequencing methods Current DNA sequencing methods generate reads of 500-700 bp generate reads of 500-700 bp – – resolution limit of electrophoresisresolution limit of electrophoresis

• Whole genomes or large clones need Whole genomes or large clones need to be fragmented to be fragmented - clone library- clone library

• Short fragments are randomly Short fragments are randomly sequenced (shotgun approach)sequenced (shotgun approach) – – reads are assembled to form final reads are assembled to form final consensus sequenceconsensus sequence

AG-ICB-USPAG-ICB-USP

Page 3: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Shotgun Sequencing I – Shotgun Sequencing I – random phaserandom phase

BAC clone: BAC clone: 100-200 kb100-200 kb

Sheared DNA: Sheared DNA: 1.0-2.0 kb1.0-2.0 kb

SequencingSequencingTemplates Templates

RandomRandomReadsReads

Modified from BCM-Modified from BCM-HGSCHGSC AG-ICB-USPAG-ICB-USP

Page 4: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

ConsensusConsensusSequenceSequenceGapGap

Low Base Low Base QualityQuality

SingleSingleStrandedStrandedRegionRegion

Mis-AssemblyMis-Assembly((InvertedInverted))

Shotgun Sequencing II - Shotgun Sequencing II - assemblyassembly

Modified from BCM-Modified from BCM-HGSCHGSC AG-ICB-USPAG-ICB-USP

Page 5: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

ConsensusConsensusSequenceSequenceGapGap

Low Base Low Base QualityQuality

SingleSingleStrandedStrandedRegionRegion

Shotgun Sequencing III - Shotgun Sequencing III - finishingfinishing

Mis-AssemblyMis-Assembly((InvertedInverted))

Modified from BCM-Modified from BCM-HGSCHGSC AG-ICB-USPAG-ICB-USP

Page 6: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

ConsensusConsensusSequenceSequenceGapGap

SingleSingleStrandedStrandedRegionRegion

Shotgun Sequencing III - Shotgun Sequencing III - finishingfinishing

Mis-AssemblyMis-Assembly((InvertedInverted))

Modified from BCM-Modified from BCM-HGSCHGSC AG-ICB-USPAG-ICB-USP

Page 7: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

ConsensusConsensusSequenceSequenceGapGap

Shotgun Sequencing III - Shotgun Sequencing III - finishingfinishing

Mis-AssemblyMis-Assembly((InvertedInverted))

Modified from BCM-Modified from BCM-HGSCHGSC AG-ICB-USPAG-ICB-USP

Page 8: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

ConsensusConsensusSequenceSequenceGapGap

Shotgun Sequencing III - Shotgun Sequencing III - finishingfinishing

Modified from BCM-Modified from BCM-HGSCHGSC AG-ICB-USPAG-ICB-USP

Page 9: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

ConsensusConsensus

Shotgun Sequencing III - Shotgun Sequencing III - finishingfinishing

High Accuracy Sequence:High Accuracy Sequence:< 1 error/ 10,000 bases< 1 error/ 10,000 bases

Modified from BCM-Modified from BCM-HGSCHGSC AG-ICB-USPAG-ICB-USP

Page 10: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

How to deal with the enormous How to deal with the enormous amount of reads generated by the amount of reads generated by the high throughput DNA sequencers?high throughput DNA sequencers?

Sanger InstituteAG-ICB-USPAG-ICB-USP

Page 11: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Sanger Institute - Hinxton - UKSanger Institute - Hinxton - UK

ABI3700 DNA sequencers

Page 12: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Sanger Institute - Hinxton - UKSanger Institute - Hinxton - UK

ABI3730 DNA sequencers

Page 13: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Sanger Institute - Hinxton - UKSanger Institute - Hinxton - UK

ABI3700 DNA sequencers

Page 14: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Sanger Institute - Hinxton - UKSanger Institute - Hinxton - UK

AG-ICB-USPAG-ICB-USP

Colony-picking robots

Page 15: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Sanger Institute - Hinxton - UKSanger Institute - Hinxton - UK

AG-ICB-USPAG-ICB-USP

Colony-picking robot

Page 16: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Sanger Institute - Hinxton - UKSanger Institute - Hinxton - UK

AG-ICB-USPAG-ICB-USP

Plasmid miniprep robots

Page 17: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Sanger Institute - Hinxton - UKSanger Institute - Hinxton - UK

AG-ICB-USPAG-ICB-USP

Plasmid miniprep rooom

Page 18: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Sanger Institute - Hinxton - UKSanger Institute - Hinxton - UK

AG-ICB-USPAG-ICB-USP

Thermocycler room

Page 19: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Exponential Exponential growth of growth of sequence sequence generationgeneration

AG-ICB-USPAG-ICB-USP

Page 20: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Exponential Exponential growth of growth of sequence sequence generationgeneration

AG-ICB-USPAG-ICB-USP

Page 21: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Exponential Exponential growth of growth of sequence sequence generationgeneration

Page 22: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Exponential growth of sequence Exponential growth of sequence generationgeneration

AG-ICB-USPAG-ICB-USP

Page 23: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Exponential growth of sequence Exponential growth of sequence generationgeneration

AG-ICB-USPAG-ICB-USP

• Genetic Sequence Data Bank - October 15 Genetic Sequence Data Bank - October 15

20122012• NCBI-GenBank Flat File Release 192.0NCBI-GenBank Flat File Release 192.0• Distribution Release Notes:Distribution Release Notes:

• 157.889.737 loci, 157.889.737 loci, • 145.430.961.262 bases145.430.961.262 bases• ……from 157.889.737 reported sequencesfrom 157.889.737 reported sequences

Page 24: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred/Phrap/Consed PackagePhred/Phrap/Consed Package

Phred/Phrap/Consed is a worldwide distributed Phred/Phrap/Consed is a worldwide distributed package for:package for:

a. Trace file (chromatograms) reading;a. Trace file (chromatograms) reading;

b. Quality (confidence) assignment to each b. Quality (confidence) assignment to each individual base;individual base;

c. Vector and repeat sequences identification and c. Vector and repeat sequences identification and masking;masking;

d. Sequence assembly and error probability d. Sequence assembly and error probability assignment to the consensus sequence;assignment to the consensus sequence;

e. Assembly viewing and editing;e. Assembly viewing and editing;

f. Automatic finishing. f. Automatic finishing.

AG-ICB-USPAG-ICB-USP

Page 25: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred/Phrap/Consed PipelinePhred/Phrap/Consed Pipeline

Directories:Directories:

FinishingAutofinish + m anual finishing

Assembly view ing/editingConsed

Assem blyPhrapassem bled contigs - seq .fa sta .sc re en .con tigsassem bly file - seq .fas ta .sc ree n .a ce#

Vector screening and m askingCross_M atch (local a lignment program) x vec to r.seqscreened/masked file - seq .fa sta .scre enquality values - seq .fas ta .sc ree n .q u a l

Conversion - phd to fastaphd2fasta.plnucleotide sequences - seq .fa s taquality values - seq .fas ta .q u a l

Quality (confidence) values assignm entPhredphd files - * .p hd

Inputchromatogram files

chromat_dirchromat_dir

phd_dirphd_dir

edit_diredit_dir

AG-ICB-USPAG-ICB-USP

Page 26: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred Phred Genome ResearchGenome Research 88: 175-185, 1998: 175-185, 1998

AG-ICB-USPAG-ICB-USP

Page 27: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred Phred Genome ResearchGenome Research 88: 186-194, 1998: 186-194, 1998

AG-ICB-USPAG-ICB-USP

Page 28: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

PhredPhred

Phred is a program that performs several Phred is a program that performs several tasks:tasks:

a. Reads trace filesa. Reads trace files – compatible with most – compatible with most file formats: SCF (standard chromatogram file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR.(MegaBACE) and LI-COR.

b. Calls basesb. Calls bases – attributes a base for each – attributes a base for each identified peak with a lower error rate than identified peak with a lower error rate than the standard base calling programs.the standard base calling programs.

AG-ICB-USPAG-ICB-USP

Page 29: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred Phred

c. Assigns quality values to the basesc. Assigns quality values to the bases – a – a ““Phred valuePhred value”” based on an error rate based on an error rate estimation calculated for each individual estimation calculated for each individual base.base.

d. Creates output filesd. Creates output files – base calls and – base calls and quality values are written to output files.quality values are written to output files.

AG-ICB-USPAG-ICB-USP

Page 30: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Trace File

High quality read:High quality read:- no ambiguities (Ns)- no ambiguities (Ns)- no noise - no noise - peaks very well spaced- peaks very well spaced

AG-ICB-USPAG-ICB-USP

Page 31: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Good quality read:Good quality read:- no ambiguities (Ns)- no ambiguities (Ns)- some noise (notice baseline) - some noise (notice baseline) - peaks very well spaced- peaks very well spaced

Trace File

AG-ICB-USPAG-ICB-USP

Page 32: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Poor quality read:Poor quality read:- some ambiguities (Ns)- some ambiguities (Ns)- bad noise (notice baseline)- bad noise (notice baseline)- overlapping peaks - overlapping peaks - can be caused by bad quality template, bad matrix, low signal to - can be caused by bad quality template, bad matrix, low signal to

noise rate noise rate

Trace File

AG-ICB-USPAG-ICB-USP

Page 33: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Poor quality read:Poor quality read:- many ambiguities (Ns)- many ambiguities (Ns)- noise - noise - caused by homopolymeric region/polymerase - caused by homopolymeric region/polymerase

slippageslippage

Trace File

AG-ICB-USPAG-ICB-USP

Page 34: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Sudden drop artifact:Sudden drop artifact:- good quality region is followed by a sudden drop of signal- good quality region is followed by a sudden drop of signal- caused by secondary structure - caused by secondary structure

Trace File

AG-ICB-USPAG-ICB-USP

Page 35: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

High quality region:High quality region:- no ambiguities (Ns)- no ambiguities (Ns)- no noise - no noise - peaks very well spaced- peaks very well spaced

Trace File

AG-ICB-USPAG-ICB-USP

Page 36: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Medium quality region:Medium quality region:- some ambiguities (Ns)- some ambiguities (Ns)- no noise - no noise - peaks very well spaced- peaks very well spaced- some homopolymeric strectches are not well - some homopolymeric strectches are not well

resolvedresolved

Trace File

AG-ICB-USPAG-ICB-USP

Page 37: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Poor quality region - Poor quality region - diffusion effects and decrease in the relative diffusion effects and decrease in the relative mass difference between the sequence products:mass difference between the sequence products:

- overlapping peaks, peaks not evenly spaced - overlapping peaks, peaks not evenly spaced - low resolution- low resolution- low confidence to base assignment - low confidence to base assignment

Trace File

AG-ICB-USPAG-ICB-USP

Page 38: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred Phred Analysis stepsAnalysis steps

a) a) Predicts idealized (expected) peaks (amplitudes) Predicts idealized (expected) peaks (amplitudes) based effectively on the best region of the tracebased effectively on the best region of the trace

b) b) Identifies observed peaksIdentifies observed peaks

c) c) Compares observed and expected peaks (divides Compares observed and expected peaks (divides the peaks into matched and unmatched)the peaks into matched and unmatched)

d) d) Unmatched peaks are analyzed for any peak that Unmatched peaks are analyzed for any peak that could be called, but was not called in step ccould be called, but was not called in step c

Modified from Evan Eichler, Ph.DModified from Evan Eichler, Ph.DAG-ICB-USPAG-ICB-USP

Page 39: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred value formula Phred value formula

qq = - 10 x log = - 10 x log10 10 ((pp))

wherewhereqq - - q quality valueuality valuepp - - estimated probability error for a base call estimated probability error for a base call

Examples:Examples:

qq = 20 means = 20 means pp = 10 = 10-2-2 (1 error in 100 bases) (1 error in 100 bases)qq = 40 means = 40 means pp = 10 = 10-4-4 (1 error in 10,000 bases) (1 error in 10,000 bases)

AG-ICB-USPAG-ICB-USP

Page 40: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

The structure of a phd file The structure of a phd file BEGIN_SEQUENCE 01EBV10201A02.gBEGIN_SEQUENCE 01EBV10201A02.g

BEGIN_COMMENTBEGIN_COMMENT

CHROMAT_FILE: EBV10201A02.gCHROMAT_FILE: EBV10201A02.gABI_THUMBPRINT: ABI_THUMBPRINT: PHRED_VERSION: 0.990722.gPHRED_VERSION: 0.990722.gCALL_METHOD: phredCALL_METHOD: phredQUALITY_LEVELS:99QUALITY_LEVELS:99TIME: Thu May 24 00:18:58 2001TIME: Thu May 24 00:18:58 2001TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MIN_INDEX: 0TRACE_ARRAY_MAX_INDEX: 12153TRACE_ARRAY_MAX_INDEX: 12153TRIM: TRIM: CHEM: termCHEM: termDYE: bigDYE: big

END_COMMENTEND_COMMENT  

BEGIN_DNABEGIN_DNAt 8 5t 8 5c 13 17c 13 17a 19 26a 19 26c 19 32c 19 32

t 6 11908t 6 11908a 6 11921a 6 11921g 6 11927g 6 11927t 6 11947t 6 11947c 6 11953c 6 11953a 6 11964a 6 11964g 6 11981g 6 11981c 4 11994c 4 11994n 4 12015n 4 12015c 4 12037c 4 12037n 4 12044n 4 12044n 4 12058n 4 12058n 4 12071n 4 12071n 4 12085n 4 12085n 4 12098n 4 12098n 4 12111n 4 12111n 4 12124n 4 12124c 4 12144c 4 12144n 4 12151n 4 12151END_DNAEND_DNA  END_SEQUENCEEND_SEQUENCE

t 24 2221t 24 2221a 24 2232a 24 2232a 22 2245a 22 2245a 27 2261a 27 2261g 25 2272g 25 2272c 19 2286c 19 2286c 12 2302c 12 2302t 19 2314t 19 2314g 12 2324g 12 2324g 15 2331g 15 2331g 19 2346g 19 2346g 23 2363g 23 2363t 33 2378t 33 2378g 36 2390g 36 2390c 44 2404c 44 2404c 44 2419c 44 2419t 39 2433t 39 2433a 39 2446a 39 2446a 34 2460a 34 2460t 35 2470t 35 2470g 34 2482g 34 2482

t 16 8191t 16 8191g 19 8200g 19 8200t 13 8211t 13 8211c 13 8229c 13 8229g 4 8241g 4 8241n 4 8253n 4 8253c 4 8263c 4 8263t 10 8276t 10 8276t 9 8286t 9 8286c 12 8301c 12 8301t 16 8313t 16 8313c 12 8329c 12 8329c 12 8336c 12 8336c 15 8343c 15 8343t 19 8356t 19 8356c 9 8371c 9 8371g 13 8386g 13 8386g 14 8397g 14 8397a 7 8417a 7 8417g 9 8427g 9 8427g 4 8445g 4 8445

AG-ICB-USPAG-ICB-USP

Page 41: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 42: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

c 57 1778c 57 1778

t 57 1792t 57 1792

g 57 1805g 57 1805

a 57 1820a 57 1820

t 57 1828t 57 1828

g 57 1841g 57 1841

t 57 1853t 57 1853

g 57 1867g 57 1867

a 68 1880a 68 1880

c 68 1889c 68 1889

a 68 1902a 68 1902

g 68 1915g 68 1915

c 68 1927c 68 1927

t 68 1941t 68 1941

c 68 1954c 68 1954

t 68 1967t 68 1967

c 68 1979c 68 1979

a 68 1991a 68 1991

c 68 2000c 68 2000

t 68 2014t 68 2014

c 57 2028c 57 2028

t 57 2040t 57 2040

a 57 2053a 57 2053

g 57 2063g 57 2063

a 41 2079a 41 2079

g 57 2087g 57 2087

g 57 2100g 57 2100

c 57 2112c 57 2112

t 59 2125t 59 2125

g 54 2138g 54 2138

t 57 2149t 57 2149

t 57 2162t 57 2162

g 57 2176g 57 2176

c 57 2186c 57 2186

a 57 2199a 57 2199

g 57 2212g 57 2212

a 57 2228a 57 2228

g 57 2237g 57 2237

g 57 2250g 57 2250

t 57 2263t 57 2263

c 57 2274c 57 2274

c 57 2287c 57 2287

g 57 2302g 57 2302

c 57 2311c 57 2311

g 57 2326g 57 2326

a 57 2341a 57 2341

t 57 2350t 57 2350

t 57 2364t 57 2364

c 68 2375c 68 2375

c 68 2388c 68 2388

t 68 2400t 68 2400

t 68 2414t 68 2414

g 68 2427g 68 2427

c 68 2439c 68 2439

a 68 2451a 68 2451

g 68 2462g 68 2462

c 68 2474c 68 2474

t 68 2488t 68 2488

g 68 2501g 68 2501

c 68 2511c 68 2511

a 68 2523a 68 2523

t 68 2535t 68 2535

a 68 2548a 68 2548

c 68 2559c 68 2559

t 68 2572t 68 2572

a 68 2584a 68 2584

c 68 2596c 68 2596

a 68 2609a 68 2609

AG-ICB-USPAG-ICB-USP

Page 43: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 44: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

t 28 6526t 28 6526

c 31 6539c 31 6539

g 32 6552g 32 6552

t 35 6562t 35 6562

a 35 6574a 35 6574

t 39 6585t 39 6585

g 47 6597g 47 6597

c 43 6608c 43 6608

c 41 6621c 41 6621

c 32 6632c 32 6632

c 31 6645c 31 6645

a 37 6655a 37 6655

c 21 6664c 21 6664

c 18 6678c 18 6678

a 9 6688a 9 6688

g 9 6708g 9 6708

g 9 6712g 9 6712

g 9 6721g 9 6721

a 18 6734a 18 6734

g 37 6745g 37 6745

a 36 6758a 36 6758

t 37 6767t 37 6767

t 37 6779t 37 6779

c 32 6792c 32 6792

g 22 6804g 22 6804

g 20 6816g 20 6816

a 23 6829a 23 6829

c 23 6837c 23 6837

c 24 6852c 24 6852

g 22 6863g 22 6863

g 22 6875g 22 6875

a 25 6889a 25 6889

c 25 6897c 25 6897

a 24 6908a 24 6908

g 31 6919g 31 6919

t 34 6932t 34 6932

a 37 6941a 37 6941

a 37 6952a 37 6952

t 41 6964t 41 6964

c 39 6976c 39 6976

g 39 6988g 39 6988

a 28 6997a 28 6997

a 21 7008a 21 7008

t 15 7017t 15 7017

t 15 7027t 15 7027

c 12 7034c 12 7034

c 13 7049c 13 7049

c 14 7062c 14 7062

g 32 7078g 32 7078

c 20 7090c 20 7090

g 18 7101g 18 7101

g 10 7112g 10 7112

c 9 7121c 9 7121

c 9 7137c 9 7137

g 9 7149g 9 7149

c 9 7156c 9 7156

c 9 7171c 9 7171

a 18 7182a 18 7182

t 25 7192t 25 7192

g 37 7204g 37 7204

g 39 7214g 39 7214

c 36 7228c 36 7228

g 36 7238g 36 7238

g 31 7249g 31 7249

c 22 7262c 22 7262

c 22 7276c 22 7276

g 22 7288g 22 7288

g 20 7296g 20 7296

g 20 7311g 20 7311

a 19 7324a 19 7324

g 21 7333g 21 7333

c 15 7344c 15 7344

a 16 7353a 16 7353

t 15 7366t 15 7366

AG-ICB-USPAG-ICB-USP

Page 45: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 46: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

g 25 7377g 25 7377

c 22 7389c 22 7389

g 26 7402g 26 7402

a 16 7414a 16 7414

c 24 7423c 24 7423

g 15 7437g 15 7437

t 28 7450t 28 7450

c 19 7459c 19 7459

g 19 7475g 19 7475

g 19 7484g 19 7484

g 16 7491g 16 7491

c 19 7506c 19 7506

c 19 7520c 19 7520

c 32 7530c 32 7530

a 34 7540a 34 7540

a 37 7552a 37 7552

t 31 7562t 31 7562

t 26 7575t 26 7575

c 27 7586c 27 7586

g 27 7599g 27 7599

c 23 7607c 23 7607

c 26 7620c 26 7620

c 26 7631c 26 7631

t 30 7642t 30 7642

a 30 7653a 30 7653

t 15 7663t 15 7663

a 12 7674a 12 7674

g 11 7687g 11 7687

t 12 7698t 12 7698

g 12 7708g 12 7708

a 26 7720a 26 7720

g 21 7730g 21 7730

t 34 7743t 34 7743

c 34 7755c 34 7755

g 37 7766g 37 7766

t 37 7777t 37 7777

a 32 7787a 32 7787

t 16 7797t 16 7797

t 10 7809t 10 7809

a 8 7817a 8 7817

c 8 7828c 8 7828

a 8 7847a 8 7847

t 22 7860t 22 7860

t 19 7872t 19 7872

c 30 7881c 30 7881

a 37 7889a 37 7889

c 37 7900c 37 7900

t 25 7912t 25 7912

g 24 7923g 24 7923

g 22 7935g 22 7935

c 13 7942c 13 7942

c 13 7953c 13 7953

g 10 7963g 10 7963

c 12 7979c 12 7979

g 8 7988g 8 7988

t 8 8002t 8 8002

t 8 8019t 8 8019

t 12 8023t 12 8023

t 8 8034t 8 8034

t 6 8050t 6 8050

t 6 8061t 6 8061

a 6 8066a 6 8066

c 8 8086c 8 8086

a 6 8092a 6 8092

t 6 8107t 6 8107

a 7 8117a 7 8117

a 8 8126a 8 8126

a 8 8131a 8 8131

g 8 8145g 8 8145

g 8 8153g 8 8153

AG-ICB-USPAG-ICB-USP

Page 47: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred/Phrap/Consed PipelinePhred/Phrap/Consed Pipeline

chromat_dirchromat_dir

phd_dirphd_dir

edit_diredit_dir

Directories:Directories:

FinishingAutofinish and manual finishing

Assembly view ing/editingConsed

Assem blyPhrapassem bled contigs - s eq .fa sta .sc re en .c on tigsassem bly file - s eq .fas ta .sc ree n .a ce#

Vector screening and m askingCross_M atch (local a lignment program) x vec to r.seqscreened/masked file - s eq .fa sta .s cre enquality values - s eq .fas ta .sc ree n .q u a l

Conversion - phd to fastaphd2fasta.plnucleotide sequences - s eq .fa s taquality values - s eq .fas ta .q u a l

Q uality (confidence) values assignm entPhredphd files - * .p hd

Inputchromatogram files

AG-ICB-USPAG-ICB-USP

Page 48: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Conversion of phd files into FASTA Conversion of phd files into FASTA files files phd2fasta scriptphd2fasta script

Features:Features:

- Phred creates single-sequences files containing the - Phred creates single-sequences files containing the sequence itself plus the quality assignments sequence itself plus the quality assignments (phd files)(phd files)

- The input file for cross_match and phrap programs is a - The input file for cross_match and phrap programs is a multiple sequence file in FASTA formatmultiple sequence file in FASTA format

- A Perl script named - A Perl script named phd2fastaphd2fasta converts the phd files converts the phd files into two multiple sequence FASTA format files, into two multiple sequence FASTA format files, containing the sequence information and the basecall containing the sequence information and the basecall quality information respectivelyquality information respectively

- - phredPhrap script automatically executes phd2fasta phredPhrap script automatically executes phd2fasta before running cross_match and phrap! before running cross_match and phrap!

AG-ICB-USPAG-ICB-USP

Page 49: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred/Phrap/Consed PipelinePhred/Phrap/Consed Pipeline

Directories:Directories:

FinishingAutofinish and manual finishing

Assembly view ing/editingConsed

Assem blyPhrapassem bled contigs - s eq .fa sta .sc re en .c on tigsassem bly file - s eq .fas ta .sc ree n .a ce#

Vector screening and m askingCross_M atch (local a lignment program) x vec to r.seqscreened/masked file - s eq .fa sta .s cre enquality values - s eq .fas ta .sc ree n .q u a l

Conversion - phd to fastaphd2fasta.plnucleotide sequences - s eq .fa s taquality values - s eq .fas ta .q u a l

Q uality (confidence) values assignm entPhredphd files - * .p hd

Inputchromatogram files

chromat_dirchromat_dir

phd_dirphd_dir

edit_diredit_dir

AG-ICB-USPAG-ICB-USP

Page 50: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Vector screeningVector screening

Features:Features:

This step removes or screen out vector sequence before running This step removes or screen out vector sequence before running phrapphrap

Program:Program:

Cross_matchCross_match – a program for rapid sequence comparison and – a program for rapid sequence comparison and database search based on na efficient implementation of the database search based on na efficient implementation of the Smith-Waterman-Gotoh algorithm. Smith-Waterman-Gotoh algorithm.

Command:Command:

cross_match cross_match seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] - seq_file is a file containing sequences in a FASTA format - seq_file is a file containing sequences in a FASTA format - all sequences in seq_file1 - all sequences in seq_file1 (query)(query) are compared to sequences in seq_file2 are compared to sequences in seq_file2 (subject)(subject)- matches meeting relevant criteria are written to the standard output- matches meeting relevant criteria are written to the standard output

AG-ICB-USPAG-ICB-USP

Page 51: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Vector screeningVector screeningExample:Example:

cross_match cross_match seqfile.fasta vector.seq –minmatch 10 –minscore 20 –seqfile.fasta vector.seq –minmatch 10 –minscore 20 –screen >screen.outscreen >screen.out

where:where: - - ‘‘seqfile.fastaseqfile.fasta’’ is a file containing multiple reads in FASTA format is a file containing multiple reads in FASTA format

- - ‘‘vector.seqvector.seq’’ is a file containing the vector sequences is a file containing the vector sequences - - ‘‘-minmatch-minmatch’’ and and ‘‘-minscore-minscore’’ are parameters for pairwise alignment are parameters for pairwise alignment- - ‘‘-screen-screen’’ creates a file named seqfile.fasta.screen containing creates a file named seqfile.fasta.screen containing vector-masked versions of the original sequences. vector-masked versions of the original sequences. Any region Any region matching any part of a vector sequence is replaced by Xs.matching any part of a vector sequence is replaced by Xs.- - ‘‘screen.outscreen.out’’ contains a list of the matches found contains a list of the matches found- the .- the .‘‘screenscreen’’ file is the input for phrap file is the input for phrap- if a - if a ‘‘.qual.qual’’ file was created file was created (i.e. seqfile.fasta.qual) (i.e. seqfile.fasta.qual) , it has to be , it has to be renamed to renamed to (seqfile.fasta.screen.qual) – (seqfile.fasta.screen.qual) – phredPhrap script phredPhrap script automatically performs this step!automatically performs this step!

AG-ICB-USPAG-ICB-USP

Page 52: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred/Phrap/Consed PipelinePhred/Phrap/Consed Pipeline

Directories:Directories:

FinishingAutofinish and manual finishing

Assembly view ing/editingConsed

Assem blyPhrapassem bled contigs - s eq .fa sta .sc re en .c on tigsassem bly file - s eq .fas ta .sc ree n .a ce#

Vector screening and m askingCross_M atch (local a lignment program) x vec to r.seqscreened/masked file - s eq .fa sta .s cre enquality values - s eq .fas ta .sc ree n .q u a l

Conversion - phd to fastaphd2fasta.plnucleotide sequences - s eq .fa s taquality values - s eq .fas ta .q u a l

Q uality (confidence) values assignm entPhredphd files - * .p hd

Inputchromatogram files

chromat_dirchromat_dir

phd_dirphd_dir

edit_diredit_dir

AG-ICB-USPAG-ICB-USP

Page 53: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phrap Phrap - - Phragment Assembly Program or… Phragment Assembly Program or… PhilPhil’’s Revised Assembly Programs Revised Assembly Program

Phrap is a program for assembling shotgun Phrap is a program for assembling shotgun DNA sequence dataDNA sequence data

Command:Command:

phrap phrap –seq_file1 [seq_file2...] [-optionvalue] – –seq_file1 [seq_file2...] [-optionvalue] – [optionvalue] [optionvalue] - seq_file is a file containing multiple sequences in a FASTA format - seq_file is a file containing multiple sequences in a FASTA format - the current version only handles a single sequence file- the current version only handles a single sequence file- all the sequences in the seq_file are compared to each other- all the sequences in the seq_file are compared to each other

AG-ICB-USPAG-ICB-USP

Page 54: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phrap Phrap

Key Features:Key Features:

a. Uses the entire read contenta. Uses the entire read content – no need for trimming.– no need for trimming.

b. User supplied (i.e. Repbase) + internally computed b. User supplied (i.e. Repbase) + internally computed datadata – better accuracy of assembly in the presence of – better accuracy of assembly in the presence of repeats.repeats.

c. Contig sequence is constituted by a mosaic of the c. Contig sequence is constituted by a mosaic of the highest quality parts of the readshighest quality parts of the reads – it – it’’s not a consensus! s not a consensus!

AG-ICB-USPAG-ICB-USP

Page 55: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phrap Phrap

Key Features:Key Features:

e. Handles very large datasetse. Handles very large datasets – hundreds of – hundreds of thousands of reads are easily manipulated.thousands of reads are easily manipulated.

f. Generate output filesf. Generate output files – contain some important – contain some important data and enable visualization by other programsdata and enable visualization by other programs

AG-ICB-USPAG-ICB-USP

Page 56: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phrap output filesPhrap output files

• *.contigs *.contigs – fasta file containing the contigs– fasta file containing the contigs- Contigs with more than one readContigs with more than one read

- Singletons (single reads with a match to some other contig but that Singletons (single reads with a match to some other contig but that couldncouldn’’t be merged consistently to it)t be merged consistently to it)

• *.singlets *.singlets – fasta file of the singlet reads– fasta file of the singlet reads- Reads with no match to other readReads with no match to other read

• *.ace*.ace – allows for viewing the assembly using – allows for viewing the assembly using ConsedConsed

AG-ICB-USPAG-ICB-USP

Page 57: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred/Phrap/Consed PipelinePhred/Phrap/Consed Pipeline

Directories:Directories:

FinishingAutofinish and manual finishing

Assembly view ing/editingConsed

Assem blyPhrapassem bled contigs - s eq .fa sta .sc re en .c on tigsassem bly file - s eq .fas ta .sc ree n .a ce#

Vector screening and m askingCross_M atch (local a lignment program) x vec to r.seqscreened/masked file - s eq .fa sta .s cre enquality values - s eq .fas ta .sc ree n .q u a l

Conversion - phd to fastaphd2fasta.plnucleotide sequences - s eq .fa s taquality values - s eq .fas ta .q u a l

Q uality (confidence) values assignm entPhredphd files - * .p hd

Inputchromatogram files

chromat_dirchromat_dir

phd_dirphd_dir

edit_diredit_dir

AG-ICB-USPAG-ICB-USP

Page 58: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Consed Genome Research 8: 195-202, 1998

AG-ICB-USPAG-ICB-USP

Page 59: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Consed Consed

Consed is a program for viewing and editing Consed is a program for viewing and editing assemblies produced by Phrapassemblies produced by Phrap

Key Features:Key Features:

a. Assembly viewer a. Assembly viewer - allows for visualization of contigs, - allows for visualization of contigs, assembly (aligned reads), quality values of reads and assembly (aligned reads), quality values of reads and final sequence. final sequence.

b. Trace file viewer b. Trace file viewer – single and multiple trace files can be – single and multiple trace files can be visualized allowing for comparison of a given sequence visualized allowing for comparison of a given sequence in several reads.in several reads.

AG-ICB-USPAG-ICB-USP

Page 60: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Consed Consed

Consed is a program for viewing and editing Consed is a program for viewing and editing assemblies produced by Phrapassemblies produced by Phrap

Key Features:Key Features:

c. Navigation c. Navigation – identify and list regions which are below a – identify and list regions which are below a given quality threshold, contain high quality given quality threshold, contain high quality discrepancies, single-strand coverage, etc.discrepancies, single-strand coverage, etc.

d. Autofinish d. Autofinish – automatic set of functions for: gap closure, – automatic set of functions for: gap closure, improvement of sequence quality, determination of improvement of sequence quality, determination of relative orientation of contigs, identification of regions relative orientation of contigs, identification of regions covered by a single read or by reads of a single strand. covered by a single read or by reads of a single strand. The program automatically performs primer picking and The program automatically performs primer picking and chooses the templates.chooses the templates.

AG-ICB-USPAG-ICB-USP

Page 61: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 62: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 63: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 64: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 65: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 66: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 67: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 68: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 69: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 70: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Phred/Phrap/Consed PipelinePhred/Phrap/Consed Pipeline

Directories:Directories:

FinishingAutofinish and manual finishing

Assembly view ing/editingConsed

Assem blyPhrapassem bled contigs - s eq .fa sta .sc re en .c on tigsassem bly file - s eq .fas ta .sc ree n .a ce#

Vector screening and m askingCross_M atch (local a lignment program) x vec to r.seqscreened/masked file - s eq .fa sta .s cre enquality values - s eq .fas ta .sc ree n .q u a l

Conversion - phd to fastaphd2fasta.plnucleotide sequences - s eq .fa s taquality values - s eq .fas ta .q u a l

Q uality (confidence) values assignm entPhredphd files - * .p hd

Inputchromatogram files

chromat_dirchromat_dir

phd_dirphd_dir

edit_diredit_dir

AG-ICB-USPAG-ICB-USP

Page 71: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AutofinishAutofinishGenome ResearchGenome Research 1111: 614-625, 2001: 614-625, 2001

AG-ICB-USPAG-ICB-USP

Page 72: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AutofinishAutofinish

Features:Features:

- Autofinish is part of the Consed package. - Autofinish is part of the Consed package.

- It automatically chooses finishing reads in - It automatically chooses finishing reads in order to finish a project. order to finish a project.

- The - The ““finishedfinished”” status is defined by the user status is defined by the user according to pre-defined parametersaccording to pre-defined parameters

AG-ICB-USPAG-ICB-USP

Page 73: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AutofinishAutofinish

Autofinish allows the user to:Autofinish allows the user to:

- Figure out how contigs are ordered and - Figure out how contigs are ordered and oriented oriented

- Close gaps - Close gaps

- Improve the error rate - Improve the error rate

- Cover every base by reads from at least 2 - Cover every base by reads from at least 2 different subclones different subclones

AG-ICB-USPAG-ICB-USP

Page 74: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AutofinishAutofinish

Autofinish will suggest any of the following Autofinish will suggest any of the following types of reads:types of reads: - Forward universal primer terminator reads Forward universal primer terminator reads

- Reverse universal primer terminator reads Reverse universal primer terminator reads

- Custom primer reads with subclone template Custom primer reads with subclone template

- Custom primer reads with whole clone template Custom primer reads with whole clone template

- Minilibraries Minilibraries

- PCR PCR

AG-ICB-USPAG-ICB-USP

Page 75: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AutofinishAutofinish

Finishing Finishing procedure:procedure:

AutofinisAutofinish h

suggests suggests readsreads

Shotgun Shotgun readsreads

Assemble Assemble new reads new reads

with existing with existing readsreads

Make Make reads in reads in

lablab

AG-ICB-USPAG-ICB-USP

Page 76: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

How to get the programsHow to get the programs

Supported platforms:Supported platforms:

- Solaris- Solaris- Linux computers (- Linux computers (i686, i386, EM64T, AMD64 ) i686, i386, EM64T, AMD64 ) - Mac (OS X) - Mac (OS X) Note: there are commercial versions of Phred/Phrap for DOS/Windows Note: there are commercial versions of Phred/Phrap for DOS/Windows platform (no Consed version so far)platform (no Consed version so far)

Internet site:Internet site:

http://http://www.phrap.org/phredphrapconsed.htmlhttp://http://www.phrap.org/phredphrapconsed.html - academic - academic versionversion

AG-ICB-USPAG-ICB-USP

Page 77: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

ContactsContacts

To obtain the programs, questions, bug To obtain the programs, questions, bug reports, suggestions:reports, suggestions:

- Phrap/Cross_match/Swat- Phrap/Cross_match/Swat – Phil Green – – Phil Green – [email protected]@u.washington.edu

- Phred- Phred – Brent Ewing – – Brent Ewing – [email protected]@u.washington.edu

- Consed- Consed – David Gordon – – David Gordon – [email protected]@genome.washington.edu

AG-ICB-USPAG-ICB-USP

Page 78: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Preparing sequence trace data for analysis for Preparing sequence trace data for analysis for assemblyassembly– pregap4

• Graphical user interfaceGraphical user interface

• Prepare trace dataPrepare trace data

• AutomationAutomation

• Trace format conversionTrace format conversion

• Quality analysisQuality analysis

• Vector clippingVector clipping

• Contaminant screeningContaminant screening

• Repeat searching.Repeat searching.

The Staden PackageMedical Research Council – Laboratory of Molecular Biology (MRC-LMB) – UK(no more supported by the original team) – now open source

AG-ICB-USPAG-ICB-USP

Page 79: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Assembly programAssembly program– Gap4 and Gap5Gap4 and Gap5

• AssemblyAssembly• Contig joiningContig joining• Assembly checkingAssembly checking• Repeat searchingRepeat searching• Experiment suggestionExperiment suggestion• Read pair analysisRead pair analysis• Contig editingContig editing• Graphical views of contigsGraphical views of contigs• DatabaseDatabase

Note:Note: ace files produced by a special version of Phrap can be viewed ace files produced by a special version of Phrap can be viewed by Gap4by Gap4

The Staden PackageThe Staden Package

AG-ICB-USPAG-ICB-USP

Page 80: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 81: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 82: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 83: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 84: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

AG-ICB-USPAG-ICB-USP

Page 85: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Supported platforms:Supported platforms:- Sun Solaris - Sun Solaris - Compaq Tru64 UNIX (Alpha) - Compaq Tru64 UNIX (Alpha) - SGI Irix - SGI Irix - Linux - Linux - MS Windows (Win9x, NT, 2000)- MS Windows (Win9x, NT, 2000)

The Staden PackageThe Staden Package

AG-ICB-USPAG-ICB-USP

Availability:Availability:- - http://staden.sourceforge.net/

Page 86: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

CAP3 - Sequence Assembly CAP3 - Sequence Assembly ProgramProgram Genome ResearchGenome Research 99: 868-: 868-877, 1999877, 1999

AG-ICB-USPAG-ICB-USP

Page 87: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Characteristics:Characteristics:

- Makes use of quality values- Makes use of quality values – qual files – qual files produced by Phred can be used by CAP3produced by Phred can be used by CAP3

- Produces an ace file compatible with Consed- Produces an ace file compatible with Consed

- Can also be used in Gap4 (Staden Package)- Can also be used in Gap4 (Staden Package)

- - Program available at Program available at http://seq.cs.iastate.edu/http://seq.cs.iastate.edu/

CAP3 - Sequence Assembly CAP3 - Sequence Assembly ProgramProgram

AG-ICB-USPAG-ICB-USP

Page 88: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Finishing ProblemsFinishing Problems

Finishing can be a boring and difficult task due:Finishing can be a boring and difficult task due:

DNA sequencing problemsDNA sequencing problems

a. High GC content a. High GC content – genomes presenting a high GC – genomes presenting a high GC content are more prone to generate artifacts as content are more prone to generate artifacts as compressions, sudden drops, bad quality regions. compressions, sudden drops, bad quality regions. Try to use Try to use Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase Dye Primer instead of Dye Terminator, change chemistry, add DMSO, increase annealing temperature, use deaza-dGTP instead of dGTP, etc.annealing temperature, use deaza-dGTP instead of dGTP, etc.

b. Palindromic regionsb. Palindromic regions – lead to strong secondary – lead to strong secondary structures causing sudden drops. structures causing sudden drops. Try to use deaza-dGTP instead of Try to use deaza-dGTP instead of dGTP, amplify the problematic region by PCR and sequence the product.dGTP, amplify the problematic region by PCR and sequence the product.

c. Homopolymeric regionsc. Homopolymeric regions – can reduce DNA synthesis – can reduce DNA synthesis efficiency for some chemistries. efficiency for some chemistries. Try to use Dye Primer instead of Dye Try to use Dye Primer instead of Dye Terminator, change chemistry (dRhodamine instead of BigDye).Terminator, change chemistry (dRhodamine instead of BigDye).

AG-ICB-USPAG-ICB-USP

Page 89: DNA Assembly Sanger Reads Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.

Finishing ProblemsFinishing Problems

Finishing can be a boring and difficult task due:Finishing can be a boring and difficult task due:

DNA assembly problemsDNA assembly problems

a. High repeat content a. High repeat content – highly repeated elements – highly repeated elements reduce accuracy of DNA assembly. reduce accuracy of DNA assembly. Identify the repeat unit, screen Identify the repeat unit, screen it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add it with Cross_Match or Repeat_Masker and mask it. Try to assemble again and add the repetitive region only at the end. Map the repetitive region using restriction the repetitive region only at the end. Map the repetitive region using restriction enzymes to estimate its size and number of repeat units.enzymes to estimate its size and number of repeat units.

b. High AT contentb. High AT content – some highly biased genomes (i.e. – some highly biased genomes (i.e. Plasmodium falciparum; Plasmodium falciparum; plastid genomes) can pose a plastid genomes) can pose a problem for assembly programs. problem for assembly programs. Very difficult to solve. Try to Very difficult to solve. Try to determine a restriction map and associate mapping with DNA sequencing data. determine a restriction map and associate mapping with DNA sequencing data.

AG-ICB-USPAG-ICB-USP