Goals of the Human Genome Project (1990 ~) Map and sequence the 3,000 Mb human genome

11

Genomics Databases and Bioinformatics Applications

Wailap Victor NgInstitute of Biotechnology in Medicine

Institute of BioinformaticsDept Biotechnology and Lab Science in Medicine

National Yang Ming University

[email protected] 22, 2005

22

Goals of the Human Genome Project (1990 ~)• Map and sequence the 3,000 Mb human genome

• Map and sequence the genomes of model organism

- The bacterium E. coli (4.6 Mb) - The yeast S. cerevisiae (12 Mb) - The roundworm C. elegans (100 Mb) - The fruit fly D. melanogaster (180 Mb) - The mouse M. musculus (3,000 Mb)

• Collect and distribute data

• Study the ethical, legal, and social implications of genetic research

• Train researchers

• Develop technologies

• Transfer technology to the private sector

http://www.genome.gov/Pages/EducationKit/online.htm

33

What is the total number of human genes? (Science, 2000)

44

Milestones of Genome Projects

1995 Haemophilus influenzae (1.83 Mb; 1,742 genes) 1996 Saccharomyces cerevisae (12 Mb; 6,000 genes) 1998 Caenorhabditis elegans (97 Mb; 19,000 genes) 2000 Arabidopsis thaliana (115/125 Mb; 25,000 genes) 2000 Drosophila melanogaster (~120 Mb; 13,600 genes) 2001 Homo sapiens (90%; 2,900 Mb; ~30k genes) 2002 Mus musculus (96%; 2,500 Mb; ~30K genes) 2002 Oryza sativa L. ssp. indica (92%; 466 Mb; 46-56k genes) 2002 Fugu rubripes (95%; 365 Mb; 33,609 genes) 2004 H. sapiens (99% euchromatin; 2,850 Mb; 20,000-25,000

genes)

55

Homo sapiens

Number of cells: ~1x1014

Number of genes encoded by the genome: 20,000 – 25,000

Number of Expressed genes per cell type: 10,000-15,000

66

Genome Transcriptome(mRNAs)

Proteome(Proteins)

Alternative splicing

Post-translational modifications

25-30K genes

(Human)

Complexity

77

Genome Sequencing Strategies

• Top-down approach - Clone large genomic DNA fragments into special vector, e.g. BAC (bacterial artificial chromosome)

- Create an ordered array of BAC clones

- Carry out full-length BAC clone sequencing

- Assemble the BAC insert sequences

- Identify the next BAC for full length sequencing (Hybridization method or searching BAC end sequence library)

• Bottom-up approach

- Whole genome shotgun sequencing

88

Top-down genome sequencing method

Method I. Systematic sequencing of ordered clones- Construct shotgun genomic library in YAC (yeast artificial chromosome) or BAC vector

- Use the YAC or BAC clone DNAs to construct smaller insert shotgun cosmid DNA library (~45 kb inserts)

- Multiple Complete Digest (MCD) mapping of cosmid DNAs ordered cosmid clone library

- Choose the minimal overlap set of cosmid DNA to construct shotgun libraries in M13 or plasmid vector DNA sequencing Assembly

Genomic DNA Large insert library in BAC/YAC

Medium insert library in cosmid

ordered cosmid library

Small insert library

in plasmid

99

Flow chart of wet bench procedures for YAC → cosmid and BAC → cosmid MCD mapping. The main difference is that, while BAC DNA can readily be purified from bacterial chromosomal DNA, there is no good preparative method to separate YAC DNA from yeast chromosomal DNA. In the YAC case, the few percent of the cosmids that are derived from the YAC are identified by a hybridization-based colony-screening protocol. With BAC-derived cosmids, this step is unnecessary because the mapping software can readily eliminate the small number of cosmids that do not originate from the BAC.

Proc Natl Acad Sci U S A. 94: 5225 (1997)

(YAC DNA)

Multiple Complete Digest Mapping

1010

Schematic representation of MCD mapping process.

(a) Gel image.

(b) List of fragment sizes for each enzyme domain in each clone. Lanes labeled with a number identify the clone as c01 or c02. Lanes labeled with the letter M identify size markers.

(c) Three single-enzyme maps are independently constructed (Right). Synchronization across enzyme domains results in a composite map (Left). Long tick marks indicate boundaries between ordered groups of fragments; short tick marks demarcate unordered fragments within a group, arbitrarily drawn in order of decreasing size.


1111

Gray scale image of a typical mapping gel poststained with SYBR–green I. There are five marker lanes, at positions 1, 8, 15, 22, and 29. Two clones, each independently digested with EcoRI, HindIII, and NsiI (and loaded in that order) are placed between every pair of marker lanes.


1212

Representative MCD map from chromosome 7


1414

Method II. BAC by BAC sequencing• Choose BAC clone seeds • Construct BAC shotgun library in plasmid vector• Sequence the shotgun plasmid DNAs• Assemble the shotgun reads• Look for adjacent BAC clones for sequencing –

- By colony array hybridization or- BAC end sequence library

Genomic DNA Large insert library in BAC

Small insert shotgun library

in plasmid

DNS sequencin

g and assembly

Identify neighborin

g BAC clones for sequencin

g

Top-down genome sequencing method

1515

Genomic DNA Large insert library in BAC

Small insert shotgun library

in plasmid vector

E. coli transformants

Array E. coli on nylon

membrane and grow cells agar

plate

25x25 cm2

BAC colony array hybridization assay

Lyze E. coli colonies on nylon mem

brane

Fix the DNA onto

nylon membrane

Hybridize with PCR amplified

BAC end probes

Autoradiogram

1616

BAC clone genomic DNA insert (sequenced)PCR-1 PCR-2

BAC colony array hybridization

Restriction fingerprinting

1717

http://www2.carthage.edu/~pfaffle/hgp/Ventor1.gif

1818

How many reads is needed to How many reads is needed to determine a genome determine a genome

sequence?sequence? Usually ~8X coverage of each base pair

# reads = ( 8 x genome_size ) / (av._read_length)

e.g. Haloarcula marismortui (4,274,315 bp)

# reads = (8 x 4,274,315 bp) / (550 bp) = 62,172 sequencing reactions

1919

Principle of Sanger Dideoxy DNA sequencingPrinciple of Sanger Dideoxy DNA sequencing

http://genetics.nbii.gov/basic2.htmlUSB

2020

Simple one step fluorescent dye-Simple one step fluorescent dye-terminator DNA cycle sequencingterminator DNA cycle sequencing

-ddCTP-ddATP

-ddGTP-ddTTP

DNAPrimerTaq DNA Pol

Reaction buffer

Thermocycling

2-propanol precipitation

DNA analyzer

2121

Applied Biosystems Capillary DNA SequencerApplied Biosystems Capillary DNA Sequencer

ABI 3730 xlGATCAGGGTTACATGCTACGGCTTCACACGTCGACCCATATTAC...................

Electropherogram (chromatogram)

2222

> vtrace HM023_0188.y1_096.ab1

2323

phred

Function – base calling and quality assignment

chromat files (input) phd files (output)

2424

Example of phd fileExample of phd file

q value: numbers in middle column

q = -10 log (P)

q, quality value

P, estimated error rates

q20 1 error in 100 bases (p=0.01)

q40 1 error in 10,000 bases

(p=0.0001)

2525

Sequence Assembly SoftwareSequence Assembly Software

phredPhrap (Phil Green) cap3 (Xiaoqiu Huang) TIGR Assembler (TIGR)

ATLAS (BCM) SPS phrap (Geospiza) Genome Assembler (Paracel) Celera Assembler (Celera) BGI Assembler (BGI)

2626

Basic Functional Genomic Analysis

• Gene Prediction (P: Prokaryotes; E: Eukaryotes)- Glimmer (P)

- GenMark (PE)

- Genscan (E)

- X-grail (E)

- Fgenes (E)

- est2genome (E; EST driven prediction)

* others (http://www.cs.jhu.edu/~salzberg/appendixa.html#Gene_finders)

• Gene Functional Analysis- Blast searches

- Motif analysis

- Structure prediction and homology searches

2727

Sources of genomics databases and bioinformatics applications

• Public Data Banks - NCBI, EMBL-EBI, and DDBJ

• Genome Centers- DOE Joint Genome Institute

- Baylor College of Med. Human Genome Sequencing Center

- The Welcome Trust Sanger Institute

- Washington Univ. School of Med. Genome Sequencing Center

- Whitehead Institute/MIT Center for Genome Research

- Others (www.ornl.gov/sci/techresources/Human_Genome/research/centers.shtml)

2828

NCBI Genome Resources

3030

Hu

man

Gen

om

e

Resou

rces

3131

NCBI Map ViewerH

um

an

Gen

om

e

Resou

rces

3232

Hu

man

Gen

om

e

Resou

rces

3333

Hu

man

Gen

om

e

Resou

rces

3434

Hu

man

Gen

om

e

Resou

rces

3535

Hu

man

Gen

om

e

Resou

rces

3636

Hu

man

Gen

om

e

Resou

rces

3737

Hu

man

Gen

om

e

Resou

rces

3838

Hu

man

Gen

om

e

Resou

rces

3939

Hu

man

Gen

om

e

Resou

rces

4141

Hu

man

Gen

om

e

Resou

rces

4242

Hu

man

Gen

om

e

Resou

rces

4343

Hu

man

Gen

om

e

Resou

rces

4444

Hu

man

Gen

om

e

Resou

rces

4545

Hu

man

Gen

om

e

Resou

rces

4646

Hu

man

Gen

om

e

Resou

rces

4747

Hu

man

Gen

om

e

Resou

rces

4848

Hu

man

Gen

om

e

Resou

rces

4949

Hu

man

Gen

om

e

Resou

rces

5050

Full-

leng

th c

DNA

5151

Full-

leng

th c

DNA

5252

Full-

leng

th c

DNA

5353

Hu

man

Gen

om

e

Resou

rces

5555http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj

http://www.ncbi.nlm.nih.gov/

5656

Animal genome projects

Completed (3)Homo sapiensCaenorhabditis elegansDrosophila melanogaster Draft assembly (17)Canis familiaris (dog)Bos taurus (cattle)Gallus gallus (chicken)Monodelphis domestica Mus musculus (mouse)Rattus norvegicus (rat)Aedes aegypti (mosquito)Anopheles gambiae (mosquito)Apis mellifera (honey bee)Bombyx mori (silk worm)Caenorhabditis briggsae Caenorhabditis remaneiCiona intestinalis Ciona savignyiDrosophila pseudoobscura Drosophila yakubaTakifugu rubripes Tetraodon nigroviridis

In progress (139)


6161

Plant genome projects

Completed (2)

Arabidopsis thaliana Oryza sativa (rice)

Draft assembly (0)

None listed

In progress (52)


6262

Fungus genome projectsCompleted (9)

Candida glabrataFilobasidiella neoformansDebaryomyces hanseniiEncephalitozoon cuniculi@Eremothecium gossypii #Kluyveromyces lactis Saccharomyces cerevisiaeSchizosaccharomyces pombe Yarrowia lipolytica Draft assembly (18)Aspergillus terreus ( lovastatin)Candida albicans Coccidioides immitisCoprinopsis cinereaEmericella nidulans Gibberella zeae Kluyveromyces waltiiMagnaporthe griseaNaumovia castellii Neurospora crassa Phanerochaete chrysosporiumPichia angustaSaccharomyces bayanusSaccharomyces kluyveriSaccharomyces kudriavzevii Saccharomyces mikatae Saccharomyces paradoxus Ustilago maydis In progress (27)


6464

Completed (3)Cyanidioschyzon merolae str

ain 10D (unicellular red alga)

Entamoeba histolytica (amoeba)

Leishmania major

Draft assembly (8)Cryptosporidium hominis Cryptosporidium parvum Dictyostelium discoideumGiardia intestinalis Plasmodium bergheiPlasmodium chabaudiPlasmodium yoelii Thalassiosira pseudonana

In progress - 38 organisms

Protist genome projects


6565

Microbial genome projects

　 November 2003

November 2004

March 2005

Archaea (completed)

16 20 21

Bacteria (completed)

127 177 204

Total completed 143 197 225

Total number of listed projects

- - 562


6666

Viral genome projects


6767

Genome Browsers Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway)

Ensembl Genome Browser (http://www.ensembl.org/)

[Vega Genome Browser (http://vega.sanger.ac.uk/Homo_sapiens/)]

Generic Genome Browser (http://www.gmod.org/index.php)

VISTA Genome Browser VISTA Genome Browser (http://pipeline.lbl.gov/cgi-bin/gateway2)(http://pipeline.lbl.gov/cgi-bin/gateway2)

6969

Genom

e B

row

ser

7070

Genom

e B

row

ser

7171

Genom

e B

row

ser

7272

Genom

e B

row

ser

7373

Genom

e B

row

ser

7474

Genom

e B

row

ser

Psa gene

7676

Gene S

ort

er

Gene Sorter - This program displays a sorted table of genes that are related to one another. The relationship can be one of several types, including protein-level homology, similarity of gene expression profiles, or genomic proximity.

psa

7777

Gene S

ort

er

7878

Gene S

ort

er

8080

Pro

teom

e B

row

ser

PSA

8181

Pro

teom

e B

row

ser

PSA continued

8282

Pro

teom

e B

row

ser

Brca1

8383

Pro

teom

e B

row

ser

Brca1 continued

8484

Pro

teom

e B

row

ser

Brca1 continued

8585

Pro

teom

e B

row

ser

Brca1 continued

8686

Ensembl Genome Browser

8787

Ens

embl

Gen

ome

Bro

wse

r

8888

Ens

embl

Gen

ome

Bro

wse

r

8989

Ens

embl

Gen

ome

Bro

wse

r

9090

A workbench for analysis of large-scale genomic sequence data, with strong emphasis on the production of enriched graphical representation of the analysed data. The GESTALT Workbench (GEnomic Sequence Total Analysis and Lookup Tool) can execute a variety of external analysis programmes (e.g. for gene recognition) as well as internal analyses (e.g. compositional complexity analysis); the resulting analysis output files are stored in an internal database. Integrating the analysis results, a Gestalt* is created for each sequence.

http://bioinformatics.weizmann.ac.il/GESTALT

9191http://bioinformatics.weizmann.ac.il/GESTALT/

Example of a large-scale GESTALT map of the Familial Mediterranean Fever region on human chromosome 16 (GenBank locus HSAJ03147).

Goals of the Human Genome Project (1990 ~) Map and sequence the 3,000 Mb human genome

Documents

Transcript of Goals of the Human Genome Project (1990 ~) Map and sequence the 3,000 Mb human genome