Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/[email protected]...

68
Genome Sequencing: Genome Sequencing: Technology and Technology and Strategies Strategies Chuong Huynh Chuong Huynh NIH/NLM/NCBI NIH/NLM/NCBI [email protected] [email protected] knowledgement: Daniel Lawson (Sanger Institute) and Jane Carlton (TI

Transcript of Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/[email protected]...

Page 1: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Genome Genome Sequencing: Sequencing:

Technology and Technology and StrategiesStrategies

Chuong HuynhChuong Huynh

NIH/NLM/NCBINIH/NLM/NCBI

[email protected]@ncbi.nlm.nih.gov

Acknowledgement: Daniel Lawson (Sanger Institute) and Jane Carlton (TIGR)

Page 2: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Bioinformatics Flow ChartBioinformatics Flow Chart

6. Gene & Protein expression data

7. Drug screening

Ab initio drug design ORDrug compound screening in database of molecules

8. Genetic variability

1a. Sequencing

1b. Analysis of nucleic acid seq.

2. Analysis of protein seq.

3. Molecular structure prediction

4. molecular interaction

5. Metabolic and regulatory networks

Page 3: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

How to sequence a genome

• development of sequencing strategy and source of funding• procurement of DNA and initial library construction• test sequencing• large-scale random sequencing of small (2-3 kb), medium (10 kb) and large (>50 kb) libraries• analysis of raw sequence data by: BLAST, RepeatFinder etc• release of genome data onto sequencing center website• at 8-10 X coverage, random stops• closure of sequence gaps and physical gaps• comparison to physical map• gene model prediction• final gene model annotation• release of data to GenBank and publication

Page 4: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

large insert library (20 - 500 kb)

Minimal tiling path

Genomic DNA Marker1 Marker2

shotgun library: small (2-3 kb) and medium (10 kb)

Sequencing (8-10 X)

Assembly

Gap closure

gene prediction, annotation and analysis

scaffold contig

Full shotgun sequencing

Page 5: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Partial shotgun sequencingPartial shotgun sequencing

Sequencing (5X)

Assembly

Analysis

Genomic DNA

contig scaffold

shotgun library: small (2-3 kb) and medium (10 kb)

Page 6: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Raw sequence: unassembled sequence reads produced from sequencing of inserts from individual recombinant clones of a genomic DNA library.Finished sequence: complete sequence of a genome with no gaps and an accuracy of > 99.9%.Genome coverage: average number of times a nucleotide is represented by a high-quality base in random raw sequence.Full shotgun coverage: genome coverage in random raw sequence required to produce finished sequence, usually 8-10 fold (‘8-10X’).Partial shotgun coverage: typically 3-6X random coverage of a genome which produces sequence data of sufficient quality to enable gene identification but which is not sufficient to produce a finished genome sequencePaired reads: sequence reads determined from both ends of a cloned insert in a recombinant clone.Contig: contiguous DNA sequence produced from joining overlapping raw sequence reads.Singleton: single sequence read that cannot be joined (‘assembled’) into a contig.Scaffold: a group of ordered and orientated contigs known to be physically linked to each other by paired read information. EST: expressed sequence tag generated by sequencing one end of a recombinant clone from a cDNA library. ESTs are single-pass reads and therefore prone to contain sequence errors.GSS: genome survey sequence generated by sequencing one end of a recombinant clone from a genomic DNA library. The genomic DNA library can in some instances be enriched for the presence of coding regions, for example through use of mung bean nuclease digestion of genomic DNA prior to cloning.SNP: single nucleotide polymorphismORF: open reading frame, stretches of codons in the same reading frame uninterrupted by STOP codons and calculated from a six-frame translation of DNA sequence.

Genome sequencing terms

Page 7: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Jan 2003

Page 8: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

NCBI Trace Archive Sep NCBI Trace Archive Sep 23, 200323, 2003

Page 9: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Large-scale genome Large-scale genome projectsprojects

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Sequencing DNA molecules in the Mb size range

• All strategies employ the same underlying principles:

Random Shotgun sequencing

Page 10: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Complete sequence

Shotgun reads

Contigs

Genomic DNA

Shearing/Sonication

Subclone and Sequence

Assembly

Finishing

Finishing read

Page 11: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Strategies for Strategies for sequencingsequencing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• How big can you go??

• Large-insert clones

• cosmids 30-40 kb

• BACs/PACs 50 - 100 kb

• Whole chromosomes

• Whole genomes

Page 12: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Genome size and sequencing Genome size and sequencing strategiesstrategies

Genome size (log Mb)

D.melanogaster (170 Mb)C.elegans (100Mb)

H.sapiens (3000 Mb)

S.cerevisiae (14 Mb)E.coli (4 Mb)

P.falciparum (30 Mb)

0 1 2 3 4

Whole genome shotgun (WGS)

Whole Chromosome Shotgun (WCS)

Clone-by-clone

Whole Genome Shotgun (WGS)with Clone ‘skims’

Page 13: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Complete sequence

Shotgun reads

Contigs

Genomic DNA

Shearing/Sonication

Subclone and Sequence

Assembly

Finishing

Finishing read

Page 14: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Strategies for Strategies for sequencingsequencing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Size and GC composition of genome

• Volume of data

• Ease of cloning

• Ease of sequencing

• Genome complexity

• dispersed repetitive sequence

• telomeres & centromeres

• Politics/Funding

Page 15: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Strategies: Clone by Strategies: Clone by CloneClone

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Simple (0.5 - 2 K reads)

• Few problems with repeats

• Relatively simple informatics

• Scalability

• Quality of physical map

• Fingerprint / STS maps

• End sequencing

Page 16: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Strategies: Whole Strategies: Whole Chromosome shotgun Chromosome shotgun

(WCS)(WCS)

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Requires chromosome isolation

• Moderate complexity (10’s K reads)

• Problems with repeats

• Complex informatics

• Inefficient in isolation

• Quality of physical map (want good physical map)

• Skims of mapped clones

Page 17: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Strategies: Whole Strategies: Whole Genome shotgun (WGS)Genome shotgun (WGS)

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Moderate to High complexity (10-100’s K reads)

• Massive Problems with repeats

• Complex informatics

• Quality of physical map

• Fingerprint map

• STS markers

• End-sequences

• Skims of mapped clones

Page 18: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Sequencing my Sequencing my genomegenome

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

Annotation

Finishing

Production

Politics

TIME MONEY

Page 19: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

What do you get?What do you get?

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Sequence

• incomplete complete

• First-pass annotation

• Gene discovery

• Full annotation

• A starting point for research

DATA!!, DATA !!, and more DATA!!

Page 20: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Genome annotation is central to Genome annotation is central to functional genomicsfunctional genomics

Gene Knockout

Expression Microarray

RNAi phenotypes

ORFeome based functional genomics

Page 21: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Where is the problem?Where is the problem?

Most genome will be sequenced and Most genome will be sequenced and can be sequenced; few problem are can be sequenced; few problem are unsolvable. unsolvable.

Problems lies in understanding what Problems lies in understanding what you have:you have: gene predictiongene prediction annotationannotation

Page 22: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

SequencingSequencing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Library construction

• Colony picking (random)

• DNA preparation (isolate DNA)

• Sequencing reactions

• Electrophoresis

• Tracking/Base calling

Page 23: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

LibrariesLibraries

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Essentially Sub-cloning

• Generation of small insert libraries in a well characterised vector.

• Ease of propagation

• Ease of DNA purification

• e.g. puc18, M13

Page 24: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Libraries - Libraries - testingtesting

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Simple concepts

• Insert/Vector ratio (Blue/White ratio)

• Real data

• Insert size

• Sequence ….

• Simple analysis

Page 25: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Sequence Sequence generationgeneration

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Pick colonies growth medium

•Template preparation (DNA isolation)

• Sequence reactions

• Standard terminator chemistry

• pUC libraries sequenced with forward and reverse primers

•Tracking and noise

Page 26: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Sequence Sequence generationgeneration

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Electrophoresis of products

• Old style - slab gels, 32 > 64 > 96 lanes

• New style - capillary gels, 96 lanes

• Transfer of gel image to UNIX

• Sequencing machines use a slave Mac/PC

• Move data to centralised storage area for processing

Page 27: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Gel image Gel image processingprocessing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Light-to-Dye estimation

• Lane tracking

• Lane editing

• Trace extraction

• Trace standardisation

• Mobility correction

• Background substitution

Page 28: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Pre-processingPre-processing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Base calling using Phred

• modifies SCF file format

• Quality clipping from Phred

• Vector clipping

• Sequencing vector

• Cloning vector

• Screen for contaminants

• Feature mark up (repeats/transposons)

Page 29: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

FinishingFinishing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap)

• Closure: Process of ordering and merging consensus sequences into a single contiguous sequence

• Finished is defined as sequenced on both strands using multiple clones. In the absence of multiple clones the clone must be sequenced with multiple chemistries. The overall error rate is estimated at less than 1 error per 10 kb

Page 30: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Genome AssemblyGenome Assembly

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Pre-assembly (assembly algorithm)

• Assembly

• Automated appraisal

• Manual review

Page 31: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Pre-AssemblyPre-Assembly

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Convert to CAF format

• flatfile text format

• choice of assembler

• choice of post-assembly modules

• choice of assembly editor

www.sanger.ac.uk/Software/CAF

Page 32: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

AssemblyAssembly

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Assemble using Phrap

• Read fasta & quality scores from CAF file

• Merge existing Phrap .ace file (previous assembly) as necessary

• Adjust clipping (where vector, quality start)

Page 33: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Assembly appraisalAssembly appraisal

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• auto-edit

• removes 70% of read discrepancies of seq. assembly (highlight misassembly); manually

• Remove cloning vector

• Mark up sequence features (for finisher)

• “Finish” Program (or Program “AutoFinish”)

• Identify low-quality regions

• Cover using ‘re-runs’ and ‘long-runs’

• Compare with current databases

• plate contamination

Page 34: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Manual Assembly Manual Assembly appraisalappraisal

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Use a sequence editor (GAP/consed)

• Tools to identify Internal joins

• Tools to identify and import data from an overlapping projects

• Tools to check failed or mis-assembled reads for inclusion in project

Page 35: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Manual editingManual editing

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Sanger uses 100% edit strategy

• Where additional data is required:

• Check clipping

• Additional sequencing

• Template / Primer / Chemistry

• Assemble new data into project

• GAP4 Auto-assemble

• Repeat whole process

Page 36: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Manual Quality ChecksManual Quality Checks

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Force annotation tag consistency

• All unedited data is re-assembled using Phrap

• All high-quality discrepancies are reviewed

• Confirm restriction digest (clones)

• Check for inverted repeats

• Manually check:

• Areas of high-density edits

• Areas with no supporting unedited data

• Areas of low read coverage (need to confirm)

Page 37: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Gap closureGap closure

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Read pairs

• PCR reactions (long-range / combinatorial)

• Small-insert libraries

• Transposon-insertion libraries

Page 38: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Gap closure - contig Gap closure - contig orderingordering

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Read pair consistency

• STS mapping

• Physical mapping

• Genetic mapping

• Optical mapping

• Large-insert clone

• skims

• end-sequencing

Page 39: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

AnnotationAnnotation

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• DNA features (repeats/similarities)

• Gene finding

• Peptide features

• Initial role assignment

• Others- regulatory regions

Page 40: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Annotation of eukaryotic genomesAnnotation of eukaryotic genomes

transcription

RNA processing

translation

AAAAAAA

Genomic DNA

Unprocessed RNA

Mature mRNA

Nascent polypeptide

folding

Reactant A Product BFunction

Active enzyme

ab initio gene prediction

Comparative gene prediction

Functional identification

Gm3

Page 41: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Genome analysis overview: Genome analysis overview: C.elegansC.elegans

Page 42: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

DNA featuresDNA features

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Similarity features

• mapping repeats

• simple tandem and inverted

• repeat families

• mapping DNA similarities

• EST/mRNAs in eukaryotes

• Duplications,

• RNAs

• mapping peptide similarities

• protein similarities

Page 43: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Gene findingGene finding

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• ORF finding (simple but messy)

• ab initio prediction

• Measures of codon bias

• Simple statistical frequencies

• Comparative prediction

• Using similarity data

• Using cross-species similarities

Page 44: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Peptide featuresPeptide features

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Peptide features

• low-complexity regions

• trans-membrane regions

• structural information (coiled-coil)

• Similarities and alignments

• Protein families (InterPro/COGS)

Page 45: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Initial role assignmentInitial role assignment

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• Simple attempt to describe the functional identity of a peptide

• Uses data from:

• peptide similarities

• protein families

• Vital for data mining

• Large number of predicted genes remain hypothetical or unknown

Page 46: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Other regulatory Other regulatory featuresfeatures

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy• Ribosomal binding sites

• Promoter regions

Page 47: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Data ReleaseData Release

Libraries

Sequencing

Release

Assembly

Annotation

Closure

Strategy

• DNA release

• Unfinished

• Finished

• Nucleotide databases

• GENBANK/EMBL/DDBJ

• Peptide databases

• SWISSPROT/TREMBL/GENPEPT

• Others

Page 48: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Real World Example: Real World Example: Malaria Genome ProjectMalaria Genome Project

If time permits.

Page 49: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Four species of malaria infect man:Plasmodium falciparumP. vivaxP. malariaeP. ovale

Four species of malaria infect rodents:P. yoeliiP. bergheiP. chabaudiP. vinckei

Sequencing the Plasmodium genomes

Page 50: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Plasmodium falciparum ~30 million base pairs

(Mb) 80% (A+T) 14 chromosomes DNA “unstable” in E.

coli No large insert DNA

clones suitable for sequencing

Too large for whole genome shotgun (‘96)

Whole chromosome shotgun strategy was selected

101112

5-9

3

2

1

13-14

4

3.2, 3.4 Mb

1.0 Mb

1.2 Mb

2.4 Mb

1.6 - 1.8 Mb

0.8 Mb

2.1 Mb2.3 Mb

1.4 Mb

Page 51: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.
Page 52: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Feature P.y.yoeliiP.falciparum

Size (Mb) 23.1 22.9No. chroms 14 14Coverage (fold) 5 14.5No. gaps 5,812

93(G+C) content (%) 22.6 19.4No. genes 5,878 5,268Mean gene length (bp) 1,298 2,283Gene density (bp/gene) 2,566 4,338Genes with introns (%) 54.2 53.9Genes with ESTs (%) 48.9 49.1Genes with proteomic data (%) 18.2 51.8Exons: Mean no./gene 2.0 2.4

(G+C) content (%) 24.8 23.7Introns: (G+C) content 21.1 13.5

Intergenic sequences: (G+C) content 20.7 13.6

RNAs: no. tRNAs 39 43no. 5s rRNAs 3 3no. rRNA units 4 7

Comparison of genome

features

Page 53: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

P. falciparum genome status

Chr Size (bp) No. gaps Fold coverage1 643,293 0 13.3

2 (TIGR) 947,102 0 11.1

3 1,060,087 0 10.9

4 1,204,112 0 16.8

5 1,343,552 0 15.1

6 1,377,956 8 16.8

7 1,350,452 14 15.8

8 1,323,195 24 16.2

9 1,541,723 0 17.9

10 (TIGR) 1,694,445 4 15.6

11 (TIGR) 2,035,250 3 11.3

12 (Stanford) 2,271,477 0 16.3

13 2,747,327 37 17.2

14 (TIGR) 3,291,006 3 9.2

0 22,788 0 ND

22,853,764 93 14.5

Page 54: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Eukaryotic annotation - TIGR

EGC

Annotation Station/Manatee

DDS/DPS

Annotation DB

Project DB

Functional assignments

BLASTPFAM/TIGRFAMSignalP/TMHMM

Gene models

Gene finders

Alignments of genomic toproteins and ESTs

Page 55: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

PFB0680w

Page 56: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.
Page 57: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

P. falciparum S. pombe S. cerevisiae D. discoideum A. thaliana

Size (bp) 22,853,764 12,462,637 12,495,682 8,100,000 115,409,949

(G+C) content (%) 19.4 36.0 38.3 22.2 34.9

No. of genes 5,268 4,929 5,770 2,799 25,498

Mean gene length* (bp) 2,283 1,426 1,424 1,626 1,310

Gene density† 4,338 2,528 2,088 2,600 4,526

Percent coding 52.6 57.5 70.5 56.3 28.8

Genes with introns (%) 53.9 43 5.0 68 79

No. tRNA genes 43 174 ND 73 ND

No. 5S rRNA genes 3 30 ND NA ND

No. rRNAs units 7 200-400 ND NA 700-800

*excluding introns; †bp per gene

The The P. falciparumP. falciparum genome genome

Page 58: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Distribution of gene lengthsDistribution of gene lengths

0

500

1000

1500

2000

2500

3000

< 300 300-999 1000-1999 2000-2999 3000-3999 >4000

Gene length (bp, excluding introns)

Nu

mb

er

of

ge

ne

s

P. falciparum

S. pombe

S. cerevisiea

15.5%3.0-3.6%

Page 59: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

The The P. falciparumP. falciparum proteome proteome

Feature Number Per cent

Total predicted proteins 5,268

Hypothetical proteins 3,208 60.9

InterPro matches 2,650 52.8

PFAM matches 1,746 33.1

Gene Ontology™

Process 1,301 24.7

Function 1,244 23.6

Component 2,412 45.8

Targeted to apicoplast 551 10.4

Targeted to mitochondrion 246 4.7

Structural features

Transmembrane domain(s) 1,631 31.0

Signal peptide 544 10.3

Signal anchor 367 7.0

Non-secretory proteins 4,357 82.7

Page 60: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

52% of predicted gene products detected by proteomics

Florens et al. Nature 419:520-526

Page 61: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Metabolism and transport Analysis based on similarity searches with

sequences of known enzymes

14% (733) of genes encoded enzymes Lower than in bacterial genomes (25-33%)

Enzymes more difficult to identify due to AT-rich genome and evolutionary distance between P.f. and other sequenced organisms

Or

P.f. has smaller proportion of genome devoted to enzymes, reduced metabolic potential

Page 62: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

H+SO2-4

di/tri-carboxylates ? ?

(6)

ADPATP

(2)

PiPE

P

Pisu

ga

rp

ho

sp

ha

tes

Amino Compounds

methionine salvage pathwayNOVEL INHIBITORS

aspartate prolineasparagine ornithine

glutamate N-acetyl-glutamateglutamine

Purine salvage,Pyrimidine synthesis

PEP

pyruvate

GLYCEROL

GLUCOSE

fructose-6P

fructose-1,6-bisP

dihydroxyacetone-P+

glyceraldehyde-3P

GlycolysisGlycolysis

myo-inositol-1P

L-LACTATE

6-phospho-gluconate

ribulose-5P ribose-5P

Pentose PhosphatePentose PhosphatePathwayPathway

PRPP

glucosamine-6P

glucosamine

glucosamine-1P

dihydroorotate

orotate

dUMP Methylene THF

DHF THF

xylulose-5P+

erythrose-4P

CDP

dCDP

DNARNA

chorismate

Shikimic AcidShikimic AcidPathwayPathway

CO2

pABA

Folatebiosynthesis

glycosylphophatidylinositol

(GPI anchors)

NOVEL INHIBITORS

Glycerolipid MetabolismGlycerolipid Metabolism

3-deoxyarabino -heptulosanate -7-phosphate

dTMP

PyrimethamineCycloguanyl

oxaloacetate

glycerol triacylglycerol

phosphatidylcholinecholine

ethanolamine phosphatidylethanolamine

Protohaem(FPIX2+)

porpho -bilinogen

acetoacetyl-ACP + malonyl-ACP

APICOPLAST

isopentenyl -PP

3-oxoacyl-ACP

3-hydroxy-acyl-ACPenoyl -ACP

acyl-ACP

acetyl-CoA

HaemHaemBiosynthesisBiosynthesis

Fatty acidFatty acidelongationelongation

Glycerolipids

Triclosan

Thiolactomycin

ALA

Haem AHaem C

UQ

acetate

glycine

NAD+

NADHmodified tRNAsmalonyl-CoA

pyruvate

glucose-6P

malate

NOVEL INHIBITORS

2C-methylerythrose -4P

deoxyxylulose -5P

NOVEL INHIBITORS

O2

FPIX2+

Largepeptides

Smallpeptides

Amino acidsAmino acidsHaemoglobin

FPIX3+O2-

Haemozoin

FOOD VACUOLE

Chloroquine Artemesinin

Quinine

PROTEASE INHIBITORS

PROTEASE INHIBITORS

F, V, & P-type ATPases

ADPATP

H+

V

ADPATP

P-lipids, Cu2+,other cations?

(16)

P

Na+H+ Ca2+H+ Zn2+H+ Mn2+H+water/

glycerol

(3)

ca

rbo

xy

late

s?

H+ glu

co

se

H+ su

ga

r

H+nucleotide

or nt-sugar?

(2)

nucleo-side/base H+

(4)

me

tab

olit

es

dru

gs

?(2)

H+

ADPATP

H+

F?

?H+

NOVEL INHIBITORS

Sulfonamides

IMP

AMP ATP

XMP hypoxanthine

xanthineGMP

GTP

guanineGDP

Folate BiosynthesisFolate Biosynthesis

7,8- dihydropteroate

DHF

THF

pABA

PyrimethamineCycloguanyl

Purines andPurines andPyrimidinesPyrimidines

H2OCytc Fe3+

Cytc Fe2+

UQ

UQH2

O2

or

Atovaquone

MITOCHONDRIONacetyl-CoA

glucose-1P

DOXP PathwayDOXP Pathway

Fosmidomycin

ALA

oxoglutarate

citrate

Tricarboxylic acidTricarboxylic acidcyclecycle

oxaloacetate

malate

fumarate

succinate

succinyl-CoA

isocitrate

cis-aconitate

Fatty AcidFatty AcidBiosynthesisBiosynthesis

NOVELINHIBITORS

PP i

H+

(2)PiADPATP

ABCtransporters

(13)

drugs?

NOVELINHIBITORS

NOVELINHIBITORS

malate

oxaloacetate

or

ubiquinone pool

mitochondrial/plastid carriers

H+PiH+Pi

oxaloacetateaspartate

glycine serine cysteine alanine

ornithine spermidineputrescine

riboflavin

FMN

FADCoA

dephosphoCoA

CO2

amino acidoxo acid

Page 63: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Analysis oftransporters inP. falciparum

Page 64: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Organization of multi-gene families in P. falciparum

Page 65: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.
Page 66: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

P. falciparum Genome Summary

Feature Value Comments

Genome size 24 million base pairs

1% of the human genome

Number of chromosomes 14 23 pairs

Number of gaps 93 (0-37 per chr)

Genome >98% complete

(A+T) content ~ 80.6%Most (A+T) rich genome sequenced to date

Number of genes ~5,300Yeast: 5,770Human: ~35,000

Proteins of unknown function 60% More than other

genomesPossible surface proteins ~900 Test for use in

vaccines

Gene products detected by proteomics

52%See Florens et al.See Lasonder et al.

Genes conserved in rodent malaria P. yoelii yoelii

60% See Carlton et al.

Page 67: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Extra SlidesExtra Slides

Page 68: Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.