The Human Genome, impact in the biomedical domain Sonia ABDELHAK, PhD Molecular Investigation of...

The Human Genome, impact in the biomedical domain

Sonia ABDELHAK, PhDMolecular Investigation of Genetic Orphan Disorders

Institut Pasteur de Tunis

Human Genome Project

• Historical context.

• Goals of the HGP.

• Strategy.

• Results.

• Impact on Biomedical domain.

• Discussion.

« Finished » sequence April 1953-April 2003

February 2001

Brief history of HGP1984 to 1986 – first proposed at US DOE meetings 1988 – endorsed by US National Research Council(Funded by NIH and US DOE $3 billion set aside)1990 – Human Genome Project started (NHGRI)Later – UK, France, Japan, Germany, China1998. Celera announces a 3-year plan to complete

the project years earlyFirst draft published in Science and Nature in

February, 2001Finished Human Genome sequence published in

Nature 2003.

Challenges• Genome Attributes

– Size– Polymorphism– Repeats (Smaller repeats are technically difficult to sequence,

some sequences are repeated all over the genome: How can these be placed?).

• Available Technology– 600 bp per “read”(Sequencing works by extension from a primer/

gel electrophoresis. Limited by resolution of gel).– Error (~1 error per 600. Sequencing multiple times decreases

error; same error unlikely in multiple reads. 10x Coverage = error rate ~1/10,000).

– Relies on cloning (Some regions are difficult to clone Heterochromatin; some sequences rearrange or are deleted when cloned)

Goals of HGP

• Create a genetic and physical map of the 24 human chromosomes (22 autosomes, X & Y)

• Identify the entire set of genes & map them all to their chromosomes

• Determine the nucleotide sequence of the estimated 3 billion base pairs

• Analyze genetic variation among humans• Map and sequence the genomes of model

organisms

Model organisms

• Bacteria (E. coli, influenza, several others)

• Yeast (Saccharomyces cerevisiae)

• Plant (Arabidopsis thaliana)

• Roundworm (Caenorhabditis elegans)

• Fruit fly (Drosophila melanogaster)

• Mouse (Mus musculus)

Goals of HGP (II)

• Develop new laboratory and computing technologies to make all this possible

• Disseminate genome information

• Consider ethical, legal, and social issues associated with this research

Time-line large scale genomic analysis

Identification de Polymorphismes de type microsatellites par analyse de séquence:

tggtggcagaaatcattgtctgaaaagtaattgttttacttttattcttttcgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgtgcatgtgccagatttcttgtttgaaaggcaatgagcttcatccaagtatcaa

IL-12p35AC F

IL-12p35AC R

atttcaggtgtgagccactgtgcctggccagaactttttcaatgaatattcaagataattgtatacacattttatatatatatatatatatacacacacacacacacacacatatgtatacacacattatatatataatccatgttatatacatctctacattatatatatccactatatatattttacttatacatatagattttatttttatgaactaggatcaaattgta

IL-12p40AC F

IL-12p40AC R

78.57%

69.23%

174170166

1 2 3 4 5

EST Division: Expressed Sequence Tags

80-100,000 RNA gene products

nucleus80-100,000

genes

80-100,000 uniquecDNA clones in library

- isolate unique clones - sequence once from each end

TAGTCA

CGTACT

sequence1

sequence2

clone xyz

make cDNA library

ESTsdbEST http://www.ncbi.nlm.nih.gov/dbEST/

>IMAGE:275615 3', mRNA sequenceNNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC

>IMAGE:275615 5' mRNA sequenceGACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

A

A G C T A T

A G C T A

A G C T

A G C

A G

ElectrophorèseGel plat / capillaire

A G C T A T

Analyse automatique

dépot détection

Chimie de séquençageDye Terminator (6)

amorce

T C G A T AADN

TaqA G C T A T ...

réaction deséquence

Two Competing Strategies for Human Genome

• (Hierarchical shotgun) [Public human genome project]

• Whole-genome Shotgun [Celera project]

Sequencing

BAC: Bacterial Artificial Chromosome clone

Contig: joined overlapping collection of sequences or clones.

Whole-genome shotgun sequencing

Private company Celera used to sequence whole human genome

• Whole genome randomly sheared three times– Plasmid library constructed

with ~ 2kb inserts– Plasmid library with ~10 kb

inserts– BAC library with ~ 200 kb

inserts• Computer program assembles

sequences into chromosomes• No physical map construction• Only one BAC library• Reduces problems of repeat

sequences

Vérification de la qualité de séquence

Elimination des séquences contaminantesBlastn contre des banques de vecteurs, de bactéries, levures,…

Assemblage, Phred, Phrap, Consed

Identification des séquences potentiellement codantesComparaison avec les banques de données,

Logiciels de prédictions d’exons.

Différentes étapes d’analyse de séquence

A G C T A T

GenBankGenBank

DDBJDDBJ

EMBLEMBL

EMBLEMBL

Entrez

SRS

getentry

NIGNIGCIB EBI

NCBI

NIHNIH

•Submissions•Updates •Submissions

•Updates

•Submissions•Updates

HTG Division: High Throughput Genome RecordsHTG Division: High Throughput Genome Records

40,000 to > 350,000 bp

phase 1

phase 2

phase 3

HTG

HTG

PRI

Acc = AC008701 gi = 6601005

Acc = AC008701 gi = 6671909

Acc = AC008701 gi = 7328720

2.88 Gbp

2,851,330,913

Gene prediction

• Easy for procaryotes (single cell) – one gene, one protein

• More difficult for eukaryotes (multicell) – one gene, many proteins

• Very difficult for Human – short exons separated by non-coding long introns

Gene recognition

• Coding region and non-coding region have different sequence profiles – coding region is “protected” from mutation and

is less random

• Gene recognition by sequence alignment• Gene prediction by Hidden Markov Model

trained by set of known genes• Many genes are homologs – similar in

vastly different organisms

Two predictions disagree

John B. Hogenesch, et alCell, Vol. 106, 413–415August 24, 2001

“…predicted transcripts collectively contain partial matches to nearly all knowngenes, but the novel genes predicted by both groups are largely non-overlapping.”

Human genome content The Human Genome

Total length 3000 Mb~ 40,000 genes (coding seq)

Gene sequences < 5% Exons ~ 1.5% (coding) Introns ~ 3.5% (noncoding)

Intergenic regions (junk) > 95%

Repeats > 50%

Global properties

• Pericentromeric and subtelomeric regions of chromosomes filled with large recent transposable elements

• Marked decline in the overall activity of transposable elements or transposons

• Male mutation rate about twice female – most mutation occurs in males

• Recombination rates much higher in distal regions of chromosomes and on shorter chromosome arms– > one crossover per chromosome arm in each

meiosis

Fig 17 transposables

Classes of transposable elements. LINE, long interspersed element. SINE short interspersed element.

Total 45%

Interspersed repeats: fixed transposable elements copied to non-homologous regions.

Fig 21

Two regions of about 1 Mb on chromosomes 2 and 22. Red bars, interspersed repeats; blue bars, exons of known genes. Note the deficit of repeats in the HoxD cluster, which contains a collection of genes with complex, interrelated regulation.

Genes are sometimes protected from repeats

Important features of Human proteome

• 30,000–40,000 protein-coding genes• Proteome (full set of proteins) more complex than

those of invertebrates.– pre-existing components arranged into a richer

architectures.

• Hundreds of genes seem to come from horizontal transfer from bacteria questionable

• Dozens of genes seem to come from transposable elements.

Noncoding RNA genes

• Transfer RNAs (tRNAs) – adaptors that translate triplet code of RNA into amino acid sequence of proteins

• Ribosomal RNAs (rRNAs) – components of ribosome

• Small nucleolar RNAs (snoRNAs) – RNA processing and base modification in nucleolus

• Small nuclear RNAs (sncRNAs) - spliceosomes

Human races have similar genes

• Genome sequence centers have sequenced significant portions of at least three races

• Range of polymorphisms within a race can be much greater than the range of differences between any two individuals of different race

• Very few genes are race specific

Genome Sizes (MegaBases)

0

100000

200000

300000

400000

500000

600000

Fly Fugu Human Wheat Amoeba

Size

0

500

1000

1500

2000

2500

3000

3500

E.coli Yeast Worm Fly Fugu Human

Size

Fig 35a

Size distributions of exons in Human, Worm and Fly. Human have shorter exons.

Fig 35cSize distributions of intons in Human, Worm and Fly. Human have longer introns.

• Complexity of proteome increase from yeast to humans– More genes– Shuffling, increase, or decrease of functional

modules– Alternative RNA splicing – humans exhibit

significantly more– Chemical modification of proteins is higher in

humans

Combinatorial strategies

• At DNA level – T-cell receptor genes are encoded by a multiplicity of gene segments

• At RNA level – splicing of exons in different orders

Fig. 10.21

Yeast

• 70 human genes are known to repair mutations in yeast

•Nearly all we know about cell cycle and cancer comes from studies of yeast

•Advantages:

•fewer genes (6000)

•few introns

• 31% of yeast genes give same products as human homologues

Drosophila

• nearly all we know of how mutations affect gene function come from Drosophila studies

•We share 50% of their genes

•61% of genes mutated in 289 human diseases are found in fruit flies

•68% of genes associated with cancers are found in fruit flies

•Knockout mutants

•Homeobox genes

C. elegans

• 959 cells in the nervous system

• 131 of those programmed for apoptosis

• apoptosis involved in several human genetic neurological disorders

•Alzheimers

•Huntingtons

•Parkinsons

Mouse

• known as “mini” humans

•Very similar physiological systems

•Share 90% of their genes

Questions Remain about the Human Genome

– Difficult to precisely estimate number of genes at this time

• Small genes are hard to identify

• Some genes are rarely expressed and do not have normal codon usage patterns – thus hard to detect

Impact of HG on Biomedical domain

Applications to medicine and biology

• Disease genes– human genomic sequence in public databases

allows rapid identification of disease genes in silico

• Drug targets– pharmaceutical industry has depended upon a

limited set of drug targets to develop new therapies

– now can find new target in silico

• Basic biology– basic physiology, cell biology…

Hérédité liée au chromosome X

Hérédité autosomique dominante

Hérédité autosomique récessive

A1A1A1A2

A1A1

A1A1A2A2

A1A2 A1A2Mm Mm

MmMM mm mm

mm

Les mutations ponctuelles

Création de codon stopCAG GlnTAG

Disease

Function/Protein

Gene

Chromosomal localisation

Disease

Function/Protein

Gene

Chromosomal localisation

Positional cloning of genes

... CCT GAG GAG ... ... CCT GTG GAG ...

... Pro Glu Glu ... ... Pro Val Glu ...

normal muté

anomalie cytogénétique

Cartographie génétique-localisation chromosomique-localisation fine

Cartographie physiqueet

Isolement de clones spécifiques

Isolement de gène (s)

Recherche de mutations

Etude fonctionnelle

Recherche de familles-détermination du phénotype-collecte d'ADN

1 to 10 years!

1 2 3 4 5 6 7 8 9-1 1' 10 1112

1314

15 16

II III IV V VI VII VIII IX X XI XIVXIII

XVXII

a)

b)

11083 9480 4405 10910

c)

-I I I'

EYA1 gene structure

Bronchio-Oto-Renal Syndrome

... CCT GAG GAG ... ... CCT GTG GAG ...

... Pro Glu Glu ... ... Pro Val Glu ...

normal muté

anomalie cytogénétique

Cartographie génétique-localisation chromosomique-localisation fine

Cartographie physiqueet

Isolement de clones spécifiques

Isolement de gène (s)

Recherche de mutations

Etude fonctionnelle

Recherche de familles-détermination du phénotype-collecte d'ADN

.... From in vivo to in vitro to in silico

Problème de pénétrance

Famille EBDD-I

IV

V

III

I

II

2

74 4

3

33m7

33M103

3m7

33M10

33m6

33M10

33m6

33M8

33m7

33M8

Sous le mode dominant

33M7

33M8

33M8

33M7

22M11

33M8

33M10

33M8

33M7

33M10

22M11

44M5

52M9

33M

33m7

Maladie à pénétrance incomplète et expressivité variable

Individu 1

G1 Malade

Individu 2

G1 Sain??

Environnement?

G1/1 G1/2

Epissage alternatifNon Sens mRNA decayMécanisme de régulation post-transcriptionnelle

G2 G3

Gènes modificateurs

Environemental factors Genetic factors

Complex /common disorders: multifactoriel

Hem

ophi

liaFam

ilial

Col

on o

r

Breas

t Can

cer

Alz

heim

er’s

Ast

hma

Skin

Can

cer

Mot

or V

ehic

le

Acc

iden

t

Car

diov

ascu

lar

Dise

ase

Complex Diseases : Genes & Environment

Environmental Effect

Genetic Component

Schi

zoph

reni

a

Cys

tic F

ibro

sis

Stro

ke

Type 2

Dia

bete

s

Lung

Can

cer

Bipol

ar D

isord

er

Improve the understanding of disease etiology and mechanism

Early disease risk assessment

Discover new drug targets

Disease prevention

population or ethnic group variability

The potential benefits of identifying genes/variations involved in disease

Predisposition

Targeted screening

Prevention

Diagnosis

Therapy

Predictive medicine

Pharmacogenomics:The Promise of Personalized Medicine

CR

ED

IT:

JOE S

UTLI

FF.

SC

IEN

CE,

20

01

O GOD!

Acknowledgement: the following presentation has been prepared on the basis of

• Internet resources.

• International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

• Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

• International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome., Nature 431: 931-945 (2004).

Thank you

The Human Genome, impact in the biomedical domain Sonia ABDELHAK, PhD Molecular Investigation of...

Documents

Transcript of The Human Genome, impact in the biomedical domain Sonia ABDELHAK, PhD Molecular Investigation of...