The Human Genome, impact in the biomedical domain Sonia ABDELHAK, PhD Molecular Investigation of...
-
Upload
trinity-evans -
Category
Documents
-
view
217 -
download
3
Transcript of The Human Genome, impact in the biomedical domain Sonia ABDELHAK, PhD Molecular Investigation of...
The Human Genome, impact in the biomedical domain
Sonia ABDELHAK, PhDMolecular Investigation of Genetic Orphan Disorders
Institut Pasteur de Tunis
Human Genome Project
• Historical context.
• Goals of the HGP.
• Strategy.
• Results.
• Impact on Biomedical domain.
• Discussion.
« Finished » sequence April 1953-April 2003
February 2001
Brief history of HGP1984 to 1986 – first proposed at US DOE meetings 1988 – endorsed by US National Research Council(Funded by NIH and US DOE $3 billion set aside)1990 – Human Genome Project started (NHGRI)Later – UK, France, Japan, Germany, China1998. Celera announces a 3-year plan to complete
the project years earlyFirst draft published in Science and Nature in
February, 2001Finished Human Genome sequence published in
Nature 2003.
Challenges• Genome Attributes
– Size– Polymorphism– Repeats (Smaller repeats are technically difficult to sequence,
some sequences are repeated all over the genome: How can these be placed?).
• Available Technology– 600 bp per “read”(Sequencing works by extension from a primer/
gel electrophoresis. Limited by resolution of gel).– Error (~1 error per 600. Sequencing multiple times decreases
error; same error unlikely in multiple reads. 10x Coverage = error rate ~1/10,000).
– Relies on cloning (Some regions are difficult to clone Heterochromatin; some sequences rearrange or are deleted when cloned)
Goals of HGP
• Create a genetic and physical map of the 24 human chromosomes (22 autosomes, X & Y)
• Identify the entire set of genes & map them all to their chromosomes
• Determine the nucleotide sequence of the estimated 3 billion base pairs
• Analyze genetic variation among humans• Map and sequence the genomes of model
organisms
Model organisms
• Bacteria (E. coli, influenza, several others)
• Yeast (Saccharomyces cerevisiae)
• Plant (Arabidopsis thaliana)
• Roundworm (Caenorhabditis elegans)
• Fruit fly (Drosophila melanogaster)
• Mouse (Mus musculus)
Goals of HGP (II)
• Develop new laboratory and computing technologies to make all this possible
• Disseminate genome information
• Consider ethical, legal, and social issues associated with this research
Time-line large scale genomic analysis
Identification de Polymorphismes de type microsatellites par analyse de séquence:
tggtggcagaaatcattgtctgaaaagtaattgttttacttttattcttttcgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgcatgtgccagatttcttgtttgaaaggcaatgagcttcatccaagtatcaa
IL-12p35AC F
IL-12p35AC R
atttcaggtgtgagccactgtgcctggccagaactttttcaatgaatattcaagataattgtatacacattttatatatatatatatatatacacacacacacacacacacatatgtatacacacattatatatataatccatgttatatacatctctacattatatatatccactatatatattttacttatacatatagattttatttttatgaactaggatcaaattgta
IL-12p40AC F
IL-12p40AC R
78.57%
69.23%
174170166
1 2 3 4 5
EST Division: Expressed Sequence Tags
80-100,000 RNA gene products
nucleus80-100,000
genes
80-100,000 uniquecDNA clones in library
- isolate unique clones - sequence once from each end
TAGTCA
CGTACT
sequence1
sequence2
clone xyz
make cDNA library
ESTsdbEST http://www.ncbi.nlm.nih.gov/dbEST/
>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC
>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG
A
A G C T A T
A G C T A
A G C T
A G C
A G
ElectrophorèseGel plat / capillaire
A G C T A T
Analyse automatique
dépot détection
Chimie de séquençageDye Terminator (6)
amorce
T C G A T AADN
TaqA G C T A T ...
réaction deséquence
Two Competing Strategies for Human Genome
• (Hierarchical shotgun) [Public human genome project]
• Whole-genome Shotgun [Celera project]
Sequencing
BAC: Bacterial Artificial Chromosome clone
Contig: joined overlapping collection of sequences or clones.
Whole-genome shotgun sequencing
Private company Celera used to sequence whole human genome
• Whole genome randomly sheared three times– Plasmid library constructed
with ~ 2kb inserts– Plasmid library with ~10 kb
inserts– BAC library with ~ 200 kb
inserts• Computer program assembles
sequences into chromosomes• No physical map construction• Only one BAC library• Reduces problems of repeat
sequences
Vérification de la qualité de séquence
Elimination des séquences contaminantesBlastn contre des banques de vecteurs, de bactéries, levures,…
Assemblage, Phred, Phrap, Consed
Identification des séquences potentiellement codantesComparaison avec les banques de données,
Logiciels de prédictions d’exons.
Différentes étapes d’analyse de séquence
A G C T A T
GenBankGenBank
DDBJDDBJ
EMBLEMBL
EMBLEMBL
Entrez
SRS
getentry
NIGNIGCIB EBI
NCBI
NIHNIH
•Submissions•Updates •Submissions
•Updates
•Submissions•Updates
HTG Division: High Throughput Genome RecordsHTG Division: High Throughput Genome Records
40,000 to > 350,000 bp
phase 1
phase 2
phase 3
HTG
HTG
PRI
Acc = AC008701 gi = 6601005
Acc = AC008701 gi = 6671909
Acc = AC008701 gi = 7328720
2.88 Gbp
2,851,330,913
Gene prediction
• Easy for procaryotes (single cell) – one gene, one protein
• More difficult for eukaryotes (multicell) – one gene, many proteins
• Very difficult for Human – short exons separated by non-coding long introns
Gene recognition
• Coding region and non-coding region have different sequence profiles – coding region is “protected” from mutation and
is less random
• Gene recognition by sequence alignment• Gene prediction by Hidden Markov Model
trained by set of known genes• Many genes are homologs – similar in
vastly different organisms
Two predictions disagree
John B. Hogenesch, et alCell, Vol. 106, 413–415August 24, 2001
“…predicted transcripts collectively contain partial matches to nearly all knowngenes, but the novel genes predicted by both groups are largely non-overlapping.”
Human genome content The Human Genome
Total length 3000 Mb~ 40,000 genes (coding seq)
Gene sequences < 5% Exons ~ 1.5% (coding) Introns ~ 3.5% (noncoding)
Intergenic regions (junk) > 95%
Repeats > 50%
Global properties
• Pericentromeric and subtelomeric regions of chromosomes filled with large recent transposable elements
• Marked decline in the overall activity of transposable elements or transposons
• Male mutation rate about twice female – most mutation occurs in males
• Recombination rates much higher in distal regions of chromosomes and on shorter chromosome arms– > one crossover per chromosome arm in each
meiosis
Fig 17 transposables
Classes of transposable elements. LINE, long interspersed element. SINE short interspersed element.
Total 45%
Interspersed repeats: fixed transposable elements copied to non-homologous regions.
Fig 21
Two regions of about 1 Mb on chromosomes 2 and 22. Red bars, interspersed repeats; blue bars, exons of known genes. Note the deficit of repeats in the HoxD cluster, which contains a collection of genes with complex, interrelated regulation.
Genes are sometimes protected from repeats
Important features of Human proteome
• 30,000–40,000 protein-coding genes• Proteome (full set of proteins) more complex than
those of invertebrates.– pre-existing components arranged into a richer
architectures.
• Hundreds of genes seem to come from horizontal transfer from bacteria questionable
• Dozens of genes seem to come from transposable elements.
Noncoding RNA genes
• Transfer RNAs (tRNAs) – adaptors that translate triplet code of RNA into amino acid sequence of proteins
• Ribosomal RNAs (rRNAs) – components of ribosome
• Small nucleolar RNAs (snoRNAs) – RNA processing and base modification in nucleolus
• Small nuclear RNAs (sncRNAs) - spliceosomes
Human races have similar genes
• Genome sequence centers have sequenced significant portions of at least three races
• Range of polymorphisms within a race can be much greater than the range of differences between any two individuals of different race
• Very few genes are race specific
Genome Sizes (MegaBases)
0
100000
200000
300000
400000
500000
600000
Fly Fugu Human Wheat Amoeba
Size
0
500
1000
1500
2000
2500
3000
3500
E.coli Yeast Worm Fly Fugu Human
Size
Fig 35a
Size distributions of exons in Human, Worm and Fly. Human have shorter exons.
Fig 35cSize distributions of intons in Human, Worm and Fly. Human have longer introns.
• Complexity of proteome increase from yeast to humans– More genes– Shuffling, increase, or decrease of functional
modules– Alternative RNA splicing – humans exhibit
significantly more– Chemical modification of proteins is higher in
humans
Combinatorial strategies
• At DNA level – T-cell receptor genes are encoded by a multiplicity of gene segments
• At RNA level – splicing of exons in different orders
Fig. 10.21
Yeast
• 70 human genes are known to repair mutations in yeast
•Nearly all we know about cell cycle and cancer comes from studies of yeast
•Advantages:
•fewer genes (6000)
•few introns
• 31% of yeast genes give same products as human homologues
Drosophila
• nearly all we know of how mutations affect gene function come from Drosophila studies
•We share 50% of their genes
•61% of genes mutated in 289 human diseases are found in fruit flies
•68% of genes associated with cancers are found in fruit flies
•Knockout mutants
•Homeobox genes
C. elegans
• 959 cells in the nervous system
• 131 of those programmed for apoptosis
• apoptosis involved in several human genetic neurological disorders
•Alzheimers
•Huntingtons
•Parkinsons
Mouse
• known as “mini” humans
•Very similar physiological systems
•Share 90% of their genes
Questions Remain about the Human Genome
– Difficult to precisely estimate number of genes at this time
• Small genes are hard to identify
• Some genes are rarely expressed and do not have normal codon usage patterns – thus hard to detect
Impact of HG on Biomedical domain
Applications to medicine and biology
• Disease genes– human genomic sequence in public databases
allows rapid identification of disease genes in silico
• Drug targets– pharmaceutical industry has depended upon a
limited set of drug targets to develop new therapies
– now can find new target in silico
• Basic biology– basic physiology, cell biology…
Hérédité liée au chromosome X
Hérédité autosomique dominante
Hérédité autosomique récessive
A1A1A1A2
A1A1
A1A1A2A2
A1A2 A1A2Mm Mm
MmMM mm mm
mm
Les mutations ponctuelles
Création de codon stopCAG GlnTAG
Disease
Function/Protein
Gene
Chromosomal localisation
Disease
Function/Protein
Gene
Chromosomal localisation
Positional cloning of genes
... CCT GAG GAG ... ... CCT GTG GAG ...
... Pro Glu Glu ... ... Pro Val Glu ...
normal muté
anomalie cytogénétique
Cartographie génétique-localisation chromosomique-localisation fine
Cartographie physiqueet
Isolement de clones spécifiques
Isolement de gène (s)
Recherche de mutations
Etude fonctionnelle
Recherche de familles-détermination du phénotype-collecte d'ADN
1 to 10 years!
1 2 3 4 5 6 7 8 9-1 1' 10 1112
1314
15 16
II III IV V VI VII VIII IX X XI XIVXIII
XVXII
a)
b)
11083 9480 4405 10910
c)
-I I I'
EYA1 gene structure
Bronchio-Oto-Renal Syndrome
... CCT GAG GAG ... ... CCT GTG GAG ...
... Pro Glu Glu ... ... Pro Val Glu ...
normal muté
anomalie cytogénétique
Cartographie génétique-localisation chromosomique-localisation fine
Cartographie physiqueet
Isolement de clones spécifiques
Isolement de gène (s)
Recherche de mutations
Etude fonctionnelle
Recherche de familles-détermination du phénotype-collecte d'ADN
.... From in vivo to in vitro to in silico
Problème de pénétrance
Famille EBDD-I
IV
V
III
I
II
2
74 4
3
33m7
33M103
3m7
33M10
33m6
33M10
33m6
33M8
33m7
33M8
Sous le mode dominant
33M7
33M8
33M8
33M7
22M11
33M8
33M10
33M8
33M7
33M10
22M11
44M5
52M9
33M
33m7
Maladie à pénétrance incomplète et expressivité variable
Individu 1
G1 Malade
Individu 2
G1 Sain??
Environnement?
G1/1 G1/2
Epissage alternatifNon Sens mRNA decayMécanisme de régulation post-transcriptionnelle
G2 G3
Gènes modificateurs
Environemental factors Genetic factors
Complex /common disorders: multifactoriel
Hem
ophi
liaFam
ilial
Col
on o
r
Breas
t Can
cer
Alz
heim
er’s
Ast
hma
Skin
Can
cer
Mot
or V
ehic
le
Acc
iden
t
Car
diov
ascu
lar
Dise
ase
Complex Diseases : Genes & Environment
Environmental Effect
Genetic Component
Schi
zoph
reni
a
Cys
tic F
ibro
sis
Stro
ke
Type 2
Dia
bete
s
Lung
Can
cer
Bipol
ar D
isord
er
Improve the understanding of disease etiology and mechanism
Early disease risk assessment
Discover new drug targets
Disease prevention
population or ethnic group variability
The potential benefits of identifying genes/variations involved in disease
Predisposition
Targeted screening
Prevention
Diagnosis
Therapy
Predictive medicine
Pharmacogenomics:The Promise of Personalized Medicine
CR
ED
IT:
JOE S
UTLI
FF.
SC
IEN
CE,
20
01
O GOD!
Acknowledgement: the following presentation has been prepared on the basis of
• Internet resources.
• International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
• Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
• International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome., Nature 431: 931-945 (2004).
Thank you