Introduction to Genomics and the Tree of Life Chapter 13.

77
Introduction to Genomics and the Tree of Life Chapter 13

Transcript of Introduction to Genomics and the Tree of Life Chapter 13.

Page 1: Introduction to Genomics and the Tree of Life Chapter 13.

Introduction to Genomics and the Tree of Life

Chapter 13

Page 2: Introduction to Genomics and the Tree of Life Chapter 13.

Extra-Reading

• Next generation sequencer– What next generation sequencer can do for

genetics/genomics research?

• Compar_genomics– What can we learn from comparative

genomics?

Page 3: Introduction to Genomics and the Tree of Life Chapter 13.

Outline of today’s lecture

Introduction: 5 perspectives, history of life

Genome-sequencing projects: chronology

Genome analysis: criteria, resequencing, metagenomics

DNA sequencing technologies: Sanger, 454, Solexa

Process of genome sequencing: centers, repositories

Genome annotation: features, prokaryotes, eukaryotes

Page 4: Introduction to Genomics and the Tree of Life Chapter 13.

Five approaches to genomics

As we survey the tree of life, consider these perspectives:

Approach I: cataloguing genomic informationGenome size; number of chromosomes; GC

content; isochores; number of genes; repetitive DNA; unique features of each genome

Approach V: Bioinformatics aspectsAlgorithms, databases, websites

Approach IV: Human disease relevance

Approach III: function; biological principles; evolutionHow genome size is regulated; polyploidization; birth and death of genes; neutral theory of

evolution; positive and negative selection; speciation

Approach II: cataloguing comparative genomic informationOrthologs and paralogs; COGs; lateral gene transfer

Page 519

Page 5: Introduction to Genomics and the Tree of Life Chapter 13.

IntroductionLessons learned form comparative genomics What have we learned about genes by comparing genomic

sequences? What have we learned about regulation? About 5% of the human genome is under purifying selection Positively regulated regions Mechanisms and history of mammalian evolution Nonuniformity of neutral evolutionary rates within species Nonuniformity of evolution along the branches of phylogenyLearning more form existing data Choice of species Choice of toolsFuture of comparative genomics

Page 6: Introduction to Genomics and the Tree of Life Chapter 13.

Levels of analysis in genomics

level topics databasesDNA genes, chromosomes GenBankRNA ESTs, ncRNA UniGene, GEOprotein ORFs, composition UniProtcomplexes binary, multimeric BINDpathways COGs, KEGGorganellesorgansindividuals variation and disease HapMapspecies speciation TaxBrowser; SGDgenus JAX mouse phylum FishBasekingdom TOL

Page 7: Introduction to Genomics and the Tree of Life Chapter 13.

Definitions of terms

Genomics is the study of genomes (the DNA comprising an organism) using the tools of bioinformatics.

Bioinformatics is the study protein, genes, and genomes using computer algorithms and databases.

Systematics is the scientific study of the kinds and diversity of organisms and of any and all relationships among them.

Classification is the ordering of organisms into groups on the basis of their relationships. The relationships may be evolutionary (phylogenetic) or may refer to similarities of phenotype (phenetic).

Taxonomy is the theory and practice of classifying organisms.

Page 8: Introduction to Genomics and the Tree of Life Chapter 13.

Fig. 13.1Page 521

Pace (2001) described a tree of life based on small subunit rRNA sequences.

This tree shows the mainthree branches describedby Woese and colleagues.

Page 9: Introduction to Genomics and the Tree of Life Chapter 13.

Historically, trees were generated primarily usingcharacters provided by morphological data. Molecularsequence data are now commonly used, includingsequences (such as small-subunit RNAs) that arehighly conserved.

Visit the European Small Subunit Ribosomal RNAdatabase for 20,000 SSU rRNA sequences.

Molecular sequences as basis of trees

Page 523

Page 10: Introduction to Genomics and the Tree of Life Chapter 13.

http://www.zo.utexas.edu/faculty/antisense/Download.html

Tree of life from David Hillis’ lab (based on ~3000 rRNAs)

animalsplants

fungi

protists

bacteriaarchaea

you are here

Page 11: Introduction to Genomics and the Tree of Life Chapter 13.

http://www.zo.utexas.edu/faculty/antisense/Download.html

you are here

Tree of life from David Hillis’ lab (based on ~3000 rRNAs)

Page 12: Introduction to Genomics and the Tree of Life Chapter 13.

Ribosomal RNA Database

Ribosomal Database Projecthttp://rdp.cme.msu.edu/index.jsp

Santos, S. R. and Ochman H. Identification and phylogenetic sorting of bacterial lineages with universally conserved genes and proteins. Environmental Microbiology. 2004. Jul(6)7:754-9.

►Download fusA (translation elongation factor 2 [EF-2])►Obtain DNA in the fasta format►Align by ClustalW in MEGA►Create a neighbor-joining tree

Page 524

Page 13: Introduction to Genomics and the Tree of Life Chapter 13.
Page 14: Introduction to Genomics and the Tree of Life Chapter 13.

European Small Subunit Ribosomal RNA database(http://www.psb.ugent.be/rRNA/ssu/)

Page 15: Introduction to Genomics and the Tree of Life Chapter 13.

Bac

ant

hrac

is S

tern

e fu

sA

Bac

thur

ing

9727

fusA

Bac

ant

hrac

is A

mes

fusA

Bac

ant

hrac

is 0

581

fusA

Bac

cer

eus

1098

7 fu

sA

Bac

cer

eus

1457

9 fu

sA

Bac

sub

tilis

fusA

Bac

hal

odur

ans

fusA

List

inno

cua

Clip

1126

2 fu

sA

List

mon

ocyt

o 4b

F23

65 fu

sA

List

mon

ocyt

o EG

De

fusA

Oce

anob

ac ih

eyen

sis H

TE83

1 fu

sA

Staph

yl ep

ider

mi 1

2228

fusA

Staph

y aur

eus M

W2

fusA

Staphy

aure

us M

u50 f

usA

Staphy aureus N

315 fusA

Lactobac j

ohnsonii N

CC533 fusA

Lactobac p

lantarum WCFS1 fu

sA

Entero faeca

lis V583 fu

sA

Strep m

utans UA159 fusA

Lactococ lactis Il1403 fusA

Strep agalactiae NEM316 fusA

Strep agalactiae 2603VR fusA

Strep pneumoniae R6 fusA

Strep pneumoniae TIGR4 fusA

Strep pyogenes M1 GAS fusA

Strep pyogenes MGAS8232 fusA

Strep pyogenes MGAS315 fusAStrep pyogenes SSI1 fusAOnion yel phytoplasm OYM fusAMycoplas mobile 163K fusAMycoplas pulmonis UAB CTIP fusAMycoplas mycoides PG1 fusA

Mycoplas penetrans HF2 fusA

Ureaplasma parvum 700970 fusA

Mycoplas galli R fusA

Mycoplas genita G37 fusA

Mycoplas pneumon M129 fusA

Thermoanaero tengcongensis fusA

Fuso nucleatum ATCC25586 fusA

Clost perfringens 13 fusA

Clost acetobutylicum 824 fusA

Clost tetani E88 fusA

Parachlamydia UWE25 fusA

Chlamy muridarum fusA

Chlamy tracho DUW3CX fusA

Chlamydo caviae GPIC fusA

Chlamydo pneumon J138 fusA

Chlamydo pneumon CWL029 fusA

Chlamydo pneumon AR39 fusA

Chlamydo pneum

on TW183 fusA

Prochloro marinus CCM

P1375 fusA

Prochloro marinus CCM

P1986 fusA

Nostoc PCC7120 fusA

Synechocystis PCC6803 fusA

Gloeo violaceus PC

C7421 fusA

Thermosynecho elongatus BP1 fusA

Prochloro m

arinus MIT 9313 fusA

Synechococcus sp W

H8102 fusA

Hel

ico

pylo

ri 26

695

fusA

Hel

ico

pylo

ri J9

9 fu

sA

Hel

ico

hepa

ticus

514

49 fu

sA

Wol

inel

la s

ucci

noge

n D

SM

1740

fusA

Cam

pylo

jeju

ni N

CT

C11

168

fusA

Buc

h ap

hidi

AP

S fu

sA

Buc

h ap

hidi

Sg

fusA

Buc

h ap

hidi

Bp

fusA

Can

di B

loch

man

flor

i fus

A

Wig

gles

wor

thia

fusA

Nitr

o eu

ropa

ea 1

9718

fusA

Cox

iella

bur

netii

RS

A49

3 fu

sAX

ylel

la fa

stid

iosa

9a5

c fu

sAX

ylel

la fa

stid

iosa

Tem

ecu1

fusA

Vib

rio v

ulni

ficus

CM

CP

6 fu

sA

Vib

rio v

ulni

ficus

YJ0

16 fu

sA

Vib

rio p

arah

aem

olyt

RIM

D22

1063

3 fu

sA

Vib

rio c

hole

rae

N16

961

fusA

She

wan

ella

one

iden

sis

MR

1 fu

sA

Aci

neto

bact

er A

DP

1 fu

sA

Nei

s m

enin

git M

C58

fusA

Nei

s m

enin

git Z

2491

fusA

Hae

mo

ducr

eyi 3

5000

HP

fusA

Pas

teu

mul

toci

da P

m70

fusA

Hae

mo

influ

RdK

W20

fusA

Phot

o lu

min

es T

TO1

fusA

Yers

inia

pes

tis C

O92

fusA

Yersin

ia p

estis

KIM

fusA

Yersin

ia pe

stis 9

1001

fusA

Erwini

a ca

roto

vora

SCRI1

043

fusA

Salmon

enter

Typ

hi CT18

fusA

Salmon enter T

yphi T

y2 fu

sA

Salmon ty

phimuriu

m LT2 fusA

E coli O

157 H7 fusA

E coli O157 H7 EDL933 fusA

E coli CFT073 fusA

E coli K12 fusA

Shigella flexneri 2457T fusA

Shigella flexneri 301 fusA

Lepto inter lai 56601 fusA

Lepto inter Copen Fio L1130 fusA

Pirellula 1 fusA

Aquifex aeolicus fusA

Thermotoga maritima MSB8 fusA

Bacteroides thetaio VPI5482 fusA

Porphyro gingiv W83 fusA

Geo sulfur PCA fusAChloro tepidum TLS fusA

Bordet bronchi RB50 fusABordet pertussis TohamaI fusABordet parapert 12822 fusARalstonia solan GMI1000 fusA

Chromo violaceum 12472 fusA

Xanthomonas axonopodis 306 fusA

Xanthomonas campestris 33913 fusA

Pseudo aeruginosa PA01 fusA

Pseudo putida KT2440 fusA

Pseudo syringae DC3000 fusA

Desulfo vulgaris Hilden fusA

Agro tumefaciens C58 fusA

Sinorhiz meliloti 1021 fusA

Mesorhiz loti MAFF303099 fusA

Bruc suis 1330 fusA

Caulo crescentus CB15 fusA

Bradyrhiz japonicum USDA110 fusA

Rhodopseudo palustris CGA009 fusA

Deino radiodurans R1 fusA

Thermus therm

ophilus HB27 fusA

Coryne efficiens YS314 fusA

Coryne gluta 13032 fusA

Coryne diphtheriae N

CTC

13129 fusA

Bifido longum

fusA

Streptom

y avermitilis M

A4680 fusA

Streptom

y coelicol A3 2 fusA

Mycobac leprae T

N fusA

Mycobac avium

k10 fusA

Mycobac bovis A

F212297 fusA

Mycobac tubercu C

DC

1551 fusA

Mycobac tubercu H

37Rv fusA

Treponem

a denticola 35405 fusA

Treponem

a pallidum N

ichols fusA

Borrelia burgdorferi B

31 fusA

Bdello bacter H

D100 fusA

Tropherym

a whipplei T

W08 27 fusA

Tropherym

a whipplei T

wist fusA

Bart henselae H

oust1 fusAB

art quintana fusAW

olbachia fusAR

icket conorii Malish 7 fusA

Ricket prow

azekii MadridE

fusA

0.05Rickettsia Treponema

Mycobacterium

Aquifex aeolicus

Yersinia pestis

Clostridium

Mycoplasma

Bac. antracis

Neighbor-joining tree of ~150 fusA (GTPase) DNA sequences

Page 16: Introduction to Genomics and the Tree of Life Chapter 13.

History of life on earth

4.55 BYA formation of earth (violent 100 MY period)4.4-3.8 BYA last ocean-evaporating impacts3.9 BYA oldest dated rocks3.8 BYA sun brightened to 70% of today’s luminosity

Ammonia, methane, or carbon dioxide atmosphere.Earliest life: RNA, protein

Source: Schopf J.W. (ed.), Life’s Origins (U. Calif. Press, 2002)

Page 521

Page 17: Introduction to Genomics and the Tree of Life Chapter 13.

1000 100 0500

InsectsCambrianexplosion

Age of Reptiles ends

Land plants

Proterozoic eon Phanerozoic eon

deuterostome/protostome

echinoderm/chordate

Millions of years ago (MYA)

Page 522

Page 18: Introduction to Genomics and the Tree of Life Chapter 13.

Millions of years ago (MYA)

Dinosaurs extinct;Mammalian radiation

Human/chimpdivergence

100 10 050

Mass extinction

Page 522

Page 19: Introduction to Genomics and the Tree of Life Chapter 13.

Millions of years ago (MYA)

Homo sapiens/Chimp divergence

Emergence ofHomo erectus

Earlieststone tools

10 1 05

AustralepithecusLucy

Page 522

Page 20: Introduction to Genomics and the Tree of Life Chapter 13.

Homo erectusemerges in Africa

MitochondrialEve

1,000,000 100,000 0500,000

Years ago

Page 523

Page 21: Introduction to Genomics and the Tree of Life Chapter 13.

Years ago

Neanderthal and Homo erectus disappear

Emergence ofanatomically

modern H. sapiens

100,000 10,000 050,000

Page 523

Page 22: Introduction to Genomics and the Tree of Life Chapter 13.

Years ago

“Ice Man”from Alps Aristotle

10,000 1,000 05,000

Earliestpyramids

Page 523

Page 23: Introduction to Genomics and the Tree of Life Chapter 13.

Years ago

algebra calculusDarwin,MendelGutenberg

1,000 100 0500

Page 523

Page 24: Introduction to Genomics and the Tree of Life Chapter 13.

We will next summarize the major achievements ingenome sequencing projects from a chronologicalperspective.

Chronology of genome sequencing projects

Page 525

Page 25: Introduction to Genomics and the Tree of Life Chapter 13.

1976: first viral genomeFiers et al. sequence bacteriophage MS2 (3,569 base pairs,Accession NC_001417).

1977:Sanger et al. sequence bacteriophage X174.This virus is 5,386 base pairs (encoding 11 genes).See accession J02482; NC_001422.

Chronology of genome sequencing projects

Page 527

Page 26: Introduction to Genomics and the Tree of Life Chapter 13.

1981Human mitochondrial genome16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)Today (10/09), over 1800 mitochondrial genomes sequenced

1986Chloroplast genome 156,000 base pairs (most are 120 kb to 200 kb)

Chronology of genome sequencing projects

Page 527

Page 27: Introduction to Genomics and the Tree of Life Chapter 13.

mitochondrion

chloroplast

Lackmitochondria (?)

Page 28: Introduction to Genomics and the Tree of Life Chapter 13.

http://www.ncbi.nlm.nih.gov/genomes/ORGANELLES/organelles.html

Entrez Genomes organelle resource at NCBI

Page 29: Introduction to Genomics and the Tree of Life Chapter 13.

There are >2100 eukaryotic organelles (10/09)

Page 30: Introduction to Genomics and the Tree of Life Chapter 13.

http://megasun.bch.umontreal.ca/gobase/

GOBASE: resource for organelle genomes

Page 31: Introduction to Genomics and the Tree of Life Chapter 13.

http://www-lecb.ncifcrf.gov/mitoDat/

MitoDat: resource for organelle genomes

“This database is dedicated to the nuclear genes specifying the enzymes, structural proteins, and other proteins, many still not identified, involved in mitochondrial biogenesis and function. MitoDat highlights predominantly human nuclear-encoded mitochondrial proteins.”

Not updated recently.

Page 32: Introduction to Genomics and the Tree of Life Chapter 13.

http://www.mitomap.org/

MitoMap: resource for organelle genomes

Page 33: Introduction to Genomics and the Tree of Life Chapter 13.

It is possible to map mutations in human mitochondrial DNA that are responsible for disease

Page 34: Introduction to Genomics and the Tree of Life Chapter 13.

1995: first genome of a free-living organism, the bacterium Haemophilus influenzae

Chronology of genome sequencing projects

Page 530

Page 35: Introduction to Genomics and the Tree of Life Chapter 13.

1996: first eukaryotic genome

The complete genome sequence of the budding yeastSaccharomyces cerevisiae was reported. We willdescribe this genome soon.

Also in 1996, TIGR reported the sequence of the firstarchaeal genome, Methanococcus jannaschii.

Chronology of genome sequencing projects

Page 532

Page 36: Introduction to Genomics and the Tree of Life Chapter 13.

1997:More bacteria and archaeaEscherichia coli4.6 megabases, 4200 proteins (38% of unknown function)

1998: first multicellular organismNematode Caenorhabditis elegans 97 Mb; 19,000 genes.

1999: first human chromosomeChromosome 22 (49 Mb, 673 genes)

Chronology of genome sequencing projects

Page 532

Page 37: Introduction to Genomics and the Tree of Life Chapter 13.

1999: Human chromosome 22 sequenced

Page 38: Introduction to Genomics and the Tree of Life Chapter 13.

2000:Fruitfly Drosophila melanogaster (13,000 genes)

Plant Arabidopsis thaliana

Human chromosome 21

2001: draft sequence of the human genome(public consortium and Celera Genomics)

Chronology of genome sequencing projects

Page 534

Page 39: Introduction to Genomics and the Tree of Life Chapter 13.
Page 40: Introduction to Genomics and the Tree of Life Chapter 13.

2000

Page 41: Introduction to Genomics and the Tree of Life Chapter 13.
Page 42: Introduction to Genomics and the Tree of Life Chapter 13.
Page 43: Introduction to Genomics and the Tree of Life Chapter 13.
Page 44: Introduction to Genomics and the Tree of Life Chapter 13.

• Selection of genomes for sequencing

• Sequence one individual genome, or several?

• How big are genomes?

• Genome sequencing centers

• Sequencing genomes: strategies

• When has a genome been fully sequenced?

• Repository for genome sequence data

• Genome annotation

Overview of genome analysis

Page 537

Page 45: Introduction to Genomics and the Tree of Life Chapter 13.

Applications of Genome Sequencing

Purpose Template Example

De novo sequencing

Genome sequencing Sequencing >1000 influenza genomes

Ancient DNA Extinct Neanderthal genome

Metagenomics Human gut

Resequencing Whole genomes Individual humans

Genomic regions Assessment of genomic rearrangements or disease-associated regions

Somatic mutations Sequencing mutations in cancer

Transcriptome Full-length transcripts Defining regulated messenger RNA transcriptsSerial Analysis of

Gene Expression (SAGE)

Noncoding RNAs Identifying and quantifying microRNAs in samples

Epigenetics Methylation changes Measuring methylation changes in cancer

Table 13.15 p.538

Page 46: Introduction to Genomics and the Tree of Life Chapter 13.

Fig. 13.8p.539

Overview of genome analysis

Page 47: Introduction to Genomics and the Tree of Life Chapter 13.

Criteria include:

• genome size (some plants are >>>human genome)• cost• relevance to human disease (or other disease)• relevance to basic biological questions• relevance to agriculture

Criteria for selecting genomes for sequencing

Page 538

Page 48: Introduction to Genomics and the Tree of Life Chapter 13.

Criteria include:

• genome size (some plants are >>>human genome)• cost• relevance to human disease (or other disease)• relevance to basic biological questions• relevance to agriculture

Recent projects:Chicken Fungi (many)Chimpanzee Honey beeCow Sea urchinDog Rhesus macaque

Page 540

Criteria for selecting genomes for sequencing

Page 49: Introduction to Genomics and the Tree of Life Chapter 13.

Selection of genomes for sequencing is basedon specific criteria.

For an overview, see a series of white papers posted on the National Human Genome Research Institute (NHGRI) website: http://www.genome.gov/10002154

For a description of NHGRI selection criteria, visit:http://www.genome.gov/10001495

Selection criteria

Page 540

Page 50: Introduction to Genomics and the Tree of Life Chapter 13.

Sequence one individual genome, or several?

Try one…

--Each genome center may study one

chromosome from an organism

--It is necessary to measure polymorphisms

(e.g. SNPs) in large populations

For viruses, thousands of isolates may be sequenced.

For the human genome, cost is the impediment.

Page 540

Criteria for selecting genomes for sequencing

Page 51: Introduction to Genomics and the Tree of Life Chapter 13.

How big are genomes?

Viral genomes: 1 kb to 350 kb (Mimivirus: 1181 kb)

Bacterial genomes: 0.5 Mb to 13 Mb

Eukaryotic genomes: 8 Mb to 686 Gb (human: ~3 Gb)

Diversity of genome sizes

Page 540

Page 52: Introduction to Genomics and the Tree of Life Chapter 13.

viruses

plasmids

bacteria

fungi

plants

algae

insects

mollusks

reptiles

birds

mammals

Genome sizes in nucleotide base pairs

104 108105 106 107 10111010109

The size of the humangenome is ~ 3 X 109 bp;almost all of its complexityis in single-copy DNA.

The human genome is thoughtto contain ~30,000-40,000 genes.

bony fish

amphibians

http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt

Page 53: Introduction to Genomics and the Tree of Life Chapter 13.

Genus, species Subgroup Size (Mb) #chr common name

Macropus eugenii Mammals 3800 8 tammar wallaby

Oryctolagus cuniculus Mammals 3500 22 rabbit

Cavia porcellus Mammals 3400 31 guinea pig

Pan troglodytes Mammals 3100 24 chimpanzee

Homo sapiens Mammals 3038 23 human

Bos taurus Mammals 3000 30 cow

Dasypus novemcinctus Mammals 3000 32 nine-banded armadillo

Loxodonta africana Mammals 3000 28 African savanna elephant

Sorex araneus Mammals 3000 European shrew

Rattus norvegicus Mammals 2750 21 rat

Canis familiaris Mammals 2400 39 dog

Zea mays Land Plants 2365 10 corn

Aplysia californicaOther Animals 1800 17 California sea hare

Danio rerio Fishes 1700 25 zebrafish

Gallus gallus Birds 1200 40 chicken

Triphysaria versicolor Land Plants 1200 plant parasite

16 eukaryotic genome projects > 1000 megabases

Page 54: Introduction to Genomics and the Tree of Life Chapter 13.

Ancient DNA projects

Special challenges:

• Ancient DNA is degraded by nucleases• The majority of DNA in samples derives from unrelated organisms such as bacteria that invaded after death• The majority of DNA in samples is contaminated by human DNA• Determination of authenticity requires special controls, and analysis of multiple independent extracts

Page 542

Page 55: Introduction to Genomics and the Tree of Life Chapter 13.

Metagenomics projects

Two broad areas:

• Environmental (ecological) e.g. hot spring, ocean, sludge, soil

• Organismal e.g. human gut, feces, lung

Page 543

Page 56: Introduction to Genomics and the Tree of Life Chapter 13.

Outline of today’s lecture

Introduction: 5 perspectives, history of life: time lines

Genome-sequencing projects: chronology

Genome analysis: criteria, resequencing, metagenomics

DNA sequencing technologies: Sanger, 454, Solexa

Process of genome sequencing: centers, repositories

Genome annotation: features, prokaryotes, eukaryotes

Page 57: Introduction to Genomics and the Tree of Life Chapter 13.

Outline of today’s lecture

Introduction: 5 perspectives, history of life: time lines

Genome-sequencing projects: chronology

Genome analysis: criteria, resequencing, metagenomics

DNA sequencing technologies: Sanger, 454, Solexa

Process of genome sequencing: centers, repositories

Genome annotation: features, prokaryotes, eukaryotes

Page 58: Introduction to Genomics and the Tree of Life Chapter 13.

20 Genome sequencing centers contributedto the public sequencing of the human genome.

Many of these are listed at the Entrez genomes site.(Or see Table 19.3, page 803.)

Overview of genome analysis

Page 548

Page 59: Introduction to Genomics and the Tree of Life Chapter 13.

Whole genome shotgun sequencing (Celera)

Hierarchical shotgun sequencing (public consortium)

Two approaches to genome sequencing

Page 60: Introduction to Genomics and the Tree of Life Chapter 13.

Whole Genome Shotgun (from the NCBI website)

An approach used to decode an organism's genome by shredding it into smaller fragments of DNA which can be sequenced individually. The sequences of thesefragments are then ordered, based on overlaps in the genetic code, and finally reassembled into the complete sequence. The 'whole genome shotgun' (WGS) method isapplied to the entire genome all at once, while the 'hierarchical shotgun' method is applied to large, overlapping DNA fragments of known location in the genome.

Page 548

Two approaches to genome sequencing

Page 61: Introduction to Genomics and the Tree of Life Chapter 13.

Human genome project: strategies

Whole genome shotgun sequencing (Celera)

-- given the computational capacity, this approach is far faster than hierarchical shotgun sequencing-- the approach was validated using Drosophila

Page 62: Introduction to Genomics and the Tree of Life Chapter 13.

Hierarchical shotgun methodAssemble contigs from various chromosomes, then sequence and assemble them. A contig is a set of overlapping clones or sequences from which a sequence can be obtained. The sequence may be draft or finished.

A contig is thus a chromosome map showing the locations of those regions of a chromosome where contiguous DNA segments overlap. Contig maps are important because they provide the ability to study a complete, and often large segment of the genome by examining a series of overlapping clones which then provide an unbroken succession of information about that region.

Two approaches to genome sequencing

Page 548

Page 63: Introduction to Genomics and the Tree of Life Chapter 13.

Hierarchical shotgun sequencing (public consortium)

-- 29,000 BAC clones-- 4.3 billion base pairs-- it is helpful to assign chromosomal loci to sequenced fragments, especially in light of the large amount of repetitive DNA in the genome-- individual chromosomes assigned to centers

Two approaches to genome sequencing

Page 64: Introduction to Genomics and the Tree of Life Chapter 13.

Source: IHGSC (2001)

Page 65: Introduction to Genomics and the Tree of Life Chapter 13.

Fig. 19.8Page 804Source: IHGSC (2001)

Sequenced-clone contigs are merged to form scaffolds of known order and orientation

Page 66: Introduction to Genomics and the Tree of Life Chapter 13.

A typical goal is to obtain five to ten-fold coverage.

Finished sequence: a clone insert is contiguouslysequenced with high quality standard of error rate0.01%. There are usually no gaps in the sequence.

Draft sequence: clone sequences may contain severalregions separated by gaps. The true order andorientation of the pieces may not be known.

When has a genome been fully sequenced?

Page 549

Page 67: Introduction to Genomics and the Tree of Life Chapter 13.

When has a genome been fully sequenced?

Fold coverage % sequenced0.25 220.5 390.75 531 632 87.53 954 98.25 99.46 99.757 99.918 99.979 99.9910 99.995

When has a genome been fully sequenced?

Page 551

Page 68: Introduction to Genomics and the Tree of Life Chapter 13.

Raw data from many genome sequencing projectsare stored at the trace archive at NCBI or EBI

(main NCBI page, bottom right).

Also visit: http://trace.ensembl.org/

As of October 2008, the Trace Archive had ~2b traces.

As of October 2009 it has ~2,108,000,000 traces.

Trace repository for genome sequence data

Page 552

Page 69: Introduction to Genomics and the Tree of Life Chapter 13.

Fig. 13.12Page 553

Page 70: Introduction to Genomics and the Tree of Life Chapter 13.

http://www.jgi.doe.gov/education/

http://www.youtube.com/watch?v=RLsb0pMx_oU&feature=channel_page

A Howard Hughes Medical Institute (HHMI) video production describing the Whole Genome Shotgun Sequencing process at the JGI. This video is viewable on YouTube in three parts: Part1(chapters 1-5), Part 2 (chapters 6-8), Part 3 (chapters 9-14).

Page 71: Introduction to Genomics and the Tree of Life Chapter 13.

Role of comparative genomics

Phylogenetic footprinting

Phylogenetic shadowing

Population shadowing

Page 552

Page 72: Introduction to Genomics and the Tree of Life Chapter 13.

Fig. 13.13Page 554

Page 73: Introduction to Genomics and the Tree of Life Chapter 13.

Outline of today’s lecture

Introduction: 5 perspectives, history of life: time lines

Genome-sequencing projects: chronology

Genome analysis: criteria, resequencing, metagenomics

DNA sequencing technologies: Sanger, 454, Solexa

Process of genome sequencing: centers, repositories

Genome annotation: features, prokaryotes, eukaryotes

Page 74: Introduction to Genomics and the Tree of Life Chapter 13.

Fig. 13.14Page 555

Page 75: Introduction to Genomics and the Tree of Life Chapter 13.

Information content in genomic DNA includes:

-- nucleotide composition (GC content)

-- repetitive DNA elements

-- protein-coding genes, other genes

Genome annotation

Page 555

Page 76: Introduction to Genomics and the Tree of Life Chapter 13.

20 30 40 50 60 70 80

GC content (%)

Vertebrates

Invertebrates

Plants

Bacteria

3

5

10

Nu

mb

er o

f sp

ecie

sin

eac

h G

C c

lass

5

10

5

GC content varies across genomes

Fig. 13.15Page 556

Page 77: Introduction to Genomics and the Tree of Life Chapter 13.

Gene prediction tools• http://bioinformatics.ca/links_directory/?subcategory_i

d=39• http://www.geneprediction.org/

Common tools

GenScan: http://genes.mit.edu/GENSCAN.html

HMMgene: http://www.cbs.dtu.dk/services/HMMgene/

Microbial: http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi

Fungal:

http://www.cbcb.umd.edu/software/GlimmerHMM/