Goals of the Human Genome Project (1990 ~) Map and sequence the 3,000 Mb human genome
description
Transcript of Goals of the Human Genome Project (1990 ~) Map and sequence the 3,000 Mb human genome
11
Genomics Databases and Bioinformatics Applications
Wailap Victor NgInstitute of Biotechnology in Medicine
Institute of BioinformaticsDept Biotechnology and Lab Science in Medicine
National Yang Ming University
[email protected] 22, 2005
22
Goals of the Human Genome Project (1990 ~)• Map and sequence the 3,000 Mb human genome
• Map and sequence the genomes of model organism
- The bacterium E. coli (4.6 Mb) - The yeast S. cerevisiae (12 Mb) - The roundworm C. elegans (100 Mb) - The fruit fly D. melanogaster (180 Mb) - The mouse M. musculus (3,000 Mb)
• Collect and distribute data
• Study the ethical, legal, and social implications of genetic research
• Train researchers
• Develop technologies
• Transfer technology to the private sector
http://www.genome.gov/Pages/EducationKit/online.htm
33
What is the total number of human genes? (Science, 2000)
44
Milestones of Genome Projects
1995 Haemophilus influenzae (1.83 Mb; 1,742 genes) 1996 Saccharomyces cerevisae (12 Mb; 6,000 genes) 1998 Caenorhabditis elegans (97 Mb; 19,000 genes) 2000 Arabidopsis thaliana (115/125 Mb; 25,000 genes) 2000 Drosophila melanogaster (~120 Mb; 13,600 genes) 2001 Homo sapiens (90%; 2,900 Mb; ~30k genes) 2002 Mus musculus (96%; 2,500 Mb; ~30K genes) 2002 Oryza sativa L. ssp. indica (92%; 466 Mb; 46-56k genes) 2002 Fugu rubripes (95%; 365 Mb; 33,609 genes) 2004 H. sapiens (99% euchromatin; 2,850 Mb; 20,000-25,000
genes)
55
Homo sapiens
Number of cells: ~1x1014
Number of genes encoded by the genome: 20,000 – 25,000
Number of Expressed genes per cell type: 10,000-15,000
66
Genome Transcriptome(mRNAs)
Proteome(Proteins)
Alternative splicing
Post-translational modifications
25-30K genes
(Human)
Complexity
77
Genome Sequencing Strategies
• Top-down approach - Clone large genomic DNA fragments into special vector, e.g. BAC (bacterial artificial chromosome)
- Create an ordered array of BAC clones
- Carry out full-length BAC clone sequencing
- Assemble the BAC insert sequences
- Identify the next BAC for full length sequencing (Hybridization method or searching BAC end sequence library)
• Bottom-up approach
- Whole genome shotgun sequencing
88
Top-down genome sequencing method
Method I. Systematic sequencing of ordered clones- Construct shotgun genomic library in YAC (yeast artificial chromosome) or BAC vector
- Use the YAC or BAC clone DNAs to construct smaller insert shotgun cosmid DNA library (~45 kb inserts)
- Multiple Complete Digest (MCD) mapping of cosmid DNAs ordered cosmid clone library
- Choose the minimal overlap set of cosmid DNA to construct shotgun libraries in M13 or plasmid vector DNA sequencing Assembly
Genomic DNA Large insert library in BAC/YAC
Medium insert library in cosmid
ordered cosmid library
Small insert library
in plasmid
99
Flow chart of wet bench procedures for YAC → cosmid and BAC → cosmid MCD mapping. The main difference is that, while BAC DNA can readily be purified from bacterial chromosomal DNA, there is no good preparative method to separate YAC DNA from yeast chromosomal DNA. In the YAC case, the few percent of the cosmids that are derived from the YAC are identified by a hybridization-based colony-screening protocol. With BAC-derived cosmids, this step is unnecessary because the mapping software can readily eliminate the small number of cosmids that do not originate from the BAC.
Proc Natl Acad Sci U S A. 94: 5225 (1997)
(YAC DNA)
Multiple Complete Digest Mapping
1010
Schematic representation of MCD mapping process.
(a) Gel image.
(b) List of fragment sizes for each enzyme domain in each clone. Lanes labeled with a number identify the clone as c01 or c02. Lanes labeled with the letter M identify size markers.
(c) Three single-enzyme maps are independently constructed (Right). Synchronization across enzyme domains results in a composite map (Left). Long tick marks indicate boundaries between ordered groups of fragments; short tick marks demarcate unordered fragments within a group, arbitrarily drawn in order of decreasing size.
Proc Natl Acad Sci U S A. 94: 5225 (1997)
1111
Gray scale image of a typical mapping gel poststained with SYBR–green I. There are five marker lanes, at positions 1, 8, 15, 22, and 29. Two clones, each independently digested with EcoRI, HindIII, and NsiI (and loaded in that order) are placed between every pair of marker lanes.
Proc Natl Acad Sci U S A. 94: 5225 (1997)
1212
Representative MCD map from chromosome 7
Proc Natl Acad Sci U S A. 94: 5225 (1997)
1313
1414
Method II. BAC by BAC sequencing• Choose BAC clone seeds • Construct BAC shotgun library in plasmid vector• Sequence the shotgun plasmid DNAs• Assemble the shotgun reads• Look for adjacent BAC clones for sequencing –
- By colony array hybridization or- BAC end sequence library
Genomic DNA Large insert library in BAC
Small insert shotgun library
in plasmid
DNS sequencin
g and assembly
Identify neighborin
g BAC clones for sequencin
g
Top-down genome sequencing method
1515
Genomic DNA Large insert library in BAC
Small insert shotgun library
in plasmid vector
E. coli transformants
Array E. coli on nylon
membrane and grow cells agar
plate
25x25 cm2
BAC colony array hybridization assay
Lyze E. coli colonies on nylon mem
brane
Fix the DNA onto
nylon membrane
Hybridize with PCR amplified
BAC end probes
Autoradiogram
1616
BAC clone genomic DNA insert (sequenced)PCR-1 PCR-2
BAC colony array hybridization
Restriction fingerprinting
1818
How many reads is needed to How many reads is needed to determine a genome determine a genome
sequence?sequence? Usually ~8X coverage of each base pair
# reads = ( 8 x genome_size ) / (av._read_length)
e.g. Haloarcula marismortui (4,274,315 bp)
# reads = (8 x 4,274,315 bp) / (550 bp) = 62,172 sequencing reactions
1919
Principle of Sanger Dideoxy DNA sequencingPrinciple of Sanger Dideoxy DNA sequencing
http://genetics.nbii.gov/basic2.htmlUSB
2020
Simple one step fluorescent dye-Simple one step fluorescent dye-terminator DNA cycle sequencingterminator DNA cycle sequencing
-ddCTP-ddATP
-ddGTP-ddTTP
DNAPrimerTaq DNA Pol
Reaction buffer
Thermocycling
2-propanol precipitation
DNA analyzer
2121
Applied Biosystems Capillary DNA SequencerApplied Biosystems Capillary DNA Sequencer
ABI 3730 xlGATCAGGGTTACATGCTACGGCTTCACACGTCGACCCATATTAC...................
Electropherogram (chromatogram)
2222
> vtrace HM023_0188.y1_096.ab1
2323
phred
Function – base calling and quality assignment
chromat files (input) phd files (output)
2424
Example of phd fileExample of phd file
q value: numbers in middle column
q = -10 log (P)
q, quality value
P, estimated error rates
q20 1 error in 100 bases (p=0.01)
q40 1 error in 10,000 bases
(p=0.0001)
2525
Sequence Assembly SoftwareSequence Assembly Software
phredPhrap (Phil Green) cap3 (Xiaoqiu Huang) TIGR Assembler (TIGR)
ATLAS (BCM) SPS phrap (Geospiza) Genome Assembler (Paracel) Celera Assembler (Celera) BGI Assembler (BGI)
2626
Basic Functional Genomic Analysis
• Gene Prediction (P: Prokaryotes; E: Eukaryotes)- Glimmer (P)
- GenMark (PE)
- Genscan (E)
- X-grail (E)
- Fgenes (E)
- est2genome (E; EST driven prediction)
* others (http://www.cs.jhu.edu/~salzberg/appendixa.html#Gene_finders)
• Gene Functional Analysis- Blast searches
- Motif analysis
- Structure prediction and homology searches
2727
Sources of genomics databases and bioinformatics applications
• Public Data Banks - NCBI, EMBL-EBI, and DDBJ
• Genome Centers- DOE Joint Genome Institute
- Baylor College of Med. Human Genome Sequencing Center
- The Welcome Trust Sanger Institute
- Washington Univ. School of Med. Genome Sequencing Center
- Whitehead Institute/MIT Center for Genome Research
- Others (www.ornl.gov/sci/techresources/Human_Genome/research/centers.shtml)
2828
NCBI Genome Resources
2929
3030
Hu
man
Gen
om
e
Resou
rces
3131
NCBI Map ViewerH
um
an
Gen
om
e
Resou
rces
3232
Hu
man
Gen
om
e
Resou
rces
3333
Hu
man
Gen
om
e
Resou
rces
3434
Hu
man
Gen
om
e
Resou
rces
3535
Hu
man
Gen
om
e
Resou
rces
3636
Hu
man
Gen
om
e
Resou
rces
3737
Hu
man
Gen
om
e
Resou
rces
3838
Hu
man
Gen
om
e
Resou
rces
3939
Hu
man
Gen
om
e
Resou
rces
4040
4141
Hu
man
Gen
om
e
Resou
rces
4242
Hu
man
Gen
om
e
Resou
rces
4343
Hu
man
Gen
om
e
Resou
rces
4444
Hu
man
Gen
om
e
Resou
rces
4545
Hu
man
Gen
om
e
Resou
rces
4646
Hu
man
Gen
om
e
Resou
rces
4747
Hu
man
Gen
om
e
Resou
rces
4848
Hu
man
Gen
om
e
Resou
rces
4949
Hu
man
Gen
om
e
Resou
rces
5050
Full-
leng
th c
DNA
5151
Full-
leng
th c
DNA
5252
Full-
leng
th c
DNA
5353
Hu
man
Gen
om
e
Resou
rces
5454
5555http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=genomeprj
5656
Animal genome projects
Completed (3)Homo sapiensCaenorhabditis elegansDrosophila melanogaster Draft assembly (17)Canis familiaris (dog)Bos taurus (cattle)Gallus gallus (chicken)Monodelphis domestica Mus musculus (mouse)Rattus norvegicus (rat)Aedes aegypti (mosquito)Anopheles gambiae (mosquito)Apis mellifera (honey bee)Bombyx mori (silk worm)Caenorhabditis briggsae Caenorhabditis remaneiCiona intestinalis Ciona savignyiDrosophila pseudoobscura Drosophila yakubaTakifugu rubripes Tetraodon nigroviridis
In progress (139)
5757
5858
5959
6060
6161
Plant genome projects
Completed (2)
Arabidopsis thaliana Oryza sativa (rice)
Draft assembly (0)
None listed
In progress (52)
6262
Fungus genome projectsCompleted (9)
Candida glabrataFilobasidiella neoformansDebaryomyces hanseniiEncephalitozoon cuniculi@Eremothecium gossypii #Kluyveromyces lactis Saccharomyces cerevisiaeSchizosaccharomyces pombe Yarrowia lipolytica Draft assembly (18)Aspergillus terreus ( lovastatin)Candida albicans Coccidioides immitisCoprinopsis cinereaEmericella nidulans Gibberella zeae Kluyveromyces waltiiMagnaporthe griseaNaumovia castellii Neurospora crassa Phanerochaete chrysosporiumPichia angustaSaccharomyces bayanusSaccharomyces kluyveriSaccharomyces kudriavzevii Saccharomyces mikatae Saccharomyces paradoxus Ustilago maydis In progress (27)
6363
6464
Completed (3)Cyanidioschyzon merolae str
ain 10D (unicellular red alga)
Entamoeba histolytica (amoeba)
Leishmania major
Draft assembly (8)Cryptosporidium hominis Cryptosporidium parvum Dictyostelium discoideumGiardia intestinalis Plasmodium bergheiPlasmodium chabaudiPlasmodium yoelii Thalassiosira pseudonana
In progress - 38 organisms
Protist genome projects
6565
Microbial genome projects
November 2003
November 2004
March 2005
Archaea (completed)
16 20 21
Bacteria (completed)
127 177 204
Total completed 143 197 225
Total number of listed projects
- - 562
6767
Genome Browsers Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway)
Ensembl Genome Browser (http://www.ensembl.org/)
[Vega Genome Browser (http://vega.sanger.ac.uk/Homo_sapiens/)]
Generic Genome Browser (http://www.gmod.org/index.php)
VISTA Genome Browser VISTA Genome Browser (http://pipeline.lbl.gov/cgi-bin/gateway2)(http://pipeline.lbl.gov/cgi-bin/gateway2)
6868
6969
Genom
e B
row
ser
7070
Genom
e B
row
ser
7171
Genom
e B
row
ser
7272
Genom
e B
row
ser
7373
Genom
e B
row
ser
7474
Genom
e B
row
ser
Psa gene
7575
7676
Gene S
ort
er
Gene Sorter - This program displays a sorted table of genes that are related to one another. The relationship can be one of several types, including protein-level homology, similarity of gene expression profiles, or genomic proximity.
psa
7777
Gene S
ort
er
7878
Gene S
ort
er
7979
8080
Pro
teom
e B
row
ser
PSA
8181
Pro
teom
e B
row
ser
PSA continued
8282
Pro
teom
e B
row
ser
Brca1
8383
Pro
teom
e B
row
ser
Brca1 continued
8484
Pro
teom
e B
row
ser
Brca1 continued
8585
Pro
teom
e B
row
ser
Brca1 continued
8686
Ensembl Genome Browser
8787
Ens
embl
Gen
ome
Bro
wse
r
8888
Ens
embl
Gen
ome
Bro
wse
r
8989
Ens
embl
Gen
ome
Bro
wse
r
9090
A workbench for analysis of large-scale genomic sequence data, with strong emphasis on the production of enriched graphical representation of the analysed data. The GESTALT Workbench (GEnomic Sequence Total Analysis and Lookup Tool) can execute a variety of external analysis programmes (e.g. for gene recognition) as well as internal analyses (e.g. compositional complexity analysis); the resulting analysis output files are stored in an internal database. Integrating the analysis results, a Gestalt* is created for each sequence.
http://bioinformatics.weizmann.ac.il/GESTALT
9191http://bioinformatics.weizmann.ac.il/GESTALT/
Example of a large-scale GESTALT map of the Familial Mediterranean Fever region on human chromosome 16 (GenBank locus HSAJ03147).
9292
9393
9494