Goals of the International Human Genome Sequencing ...2 843 433 602 99 281 Is the human genome...

43
Goals of the International Human Genome Sequencing Consortium • Completeness: – no mapping gaps – no sequencing gaps • Accuracy: – error rate < 10 -4 – based on a minimum of 3 reads (1 on each strand at least)

Transcript of Goals of the International Human Genome Sequencing ...2 843 433 602 99 281 Is the human genome...

  • Goals of theInternational Human Genome

    Sequencing Consortium

    • Completeness:– no mapping gaps– no sequencing gaps

    • Accuracy: – error rate < 10-4

    – based on a minimum of 3 reads (1 on eachstrand at least)

  • Is the human genome sequence

    complete ?

  • 0

    5000

    10000

    15000

    20000

    25000

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

    Sequencing gaps : June 00/Apr 01

  • Sequencing gaps : June 00/Dec 01

    0

    5000

    10000

    15000

    20000

    25000

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

  • Sequencing gaps : June 00/Nov 02

    0

    5000

    10000

    15000

    20000

    25000

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

  • Sequencing gaps : June 00/Apr 03

    0

    5000

    10000

    15000

    20000

    25000

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

  • Assembling the entire sequence:The Golden Path(NCBI, Build 34) (Jul 2003)

    • Total estimated size (bp)

    • Total euchromatin estimated size (bp)

    • Size of non-overlapping assembly (bp)

    • Euchromatic fraction sequenced (%)

    • Number of cloning gaps

    3 070 000 000

    2 864 000 000

    2 843 433 602

    99

    281

  • Is the human genome sequence

    accurate ?

  • Sequence Accuracy

    Base miscalls and small indels were determinedby resequencing a sample of BACs from each sequencing center(J. Schmutz, Stanford Human Genome Center)

    Error level between 10-4 and 10-5 basepairs

  • Sequence Accuracy

    Large indels were identified by mapping size-known end-sequenced fragments on thehuman genome assembly

  • Is the Human Genome Sequence

    well annotated ?

  • Genes

    coding parts

    other transcribed segments (5’ and 3’ UTRs)

    regulatory regions

    Other features

    structural (centromeres, telomeres...)

    unknown functions

    Sequence variants

    DNA sequence features of biological relevance

  • Biological features are identified using

    biological data (cDNAs, RNAs, proteins)

    ab initio analyses

    comparative analyses

    combination of these methods

  • Genes

  • Ensembl Current Release (Jan 2004)(based on NCBI, Build 34)

    • Ensembl gene predictions

    • Genscan gene predictions

    • Ensembl gene exons

    • Ensembl gene transcripts

    23 531

    65 010

    225 897

    31 609

  • NCBI statistics (July 2003)(based on Build 33)

    RNA genes 185Protein coding genes

    Function known or inferred 14858Function unknown 6383

    Models predicted ab initio 4288Models predicted ab initio with EST support 9848Other models with EST support 3431Other models with mRNA support 3289Known and predicted Pseudogenes 6241

    TOTAL Features 48523TOTAL protein or possibly protein coding 42282

  • RefSeq Statistics (14 Jan 04)

    Review status

    Validated

    Inferred

    Provisional

    Predicted

    Reviewed

    Models

    RNAs

    1214

    5

    23885

    8720

    9393

    31935

    Proteins

    1210

    5

    23767

    8720

    10393

    31882

  • Despite the availability of

    a nearly complete sequence (99%)

    of the human genome,

    the gene inventory is not yet complete

  • Numerous annotated gene models remain

    fragmentary

  • 5 ’ end 3 ’ end

    CpG island

  • Annotated gene models frequently lack:

    - the 5’ end

    - alternative exons

    - alternative splicing sites

    - alternative starts or polyadenylation sites

  • Some neglected genome features

  • Small open reading frames smORFs

    • encoding less than 100 amino acids

    • phylogenetically conserved

    • single or multiple exon genes

    • other standard gene features

  • MicroRNA genes

    • ~22-nucleotide non-coding RNAs

    • control expression of other genes at the post-

    transcriptional level

    • derive from a phylogenetically conserved stem-

    loop precursor with characteristic features

    • 200-250 miRNA genes in the human genome

  • Other non-coding transcripts

    • spliced, polyadenylated and cytoplasmic

    • expressed at low level

    • poorly conserved between human and mouse

    • display some tissue specificity

    • may represent 20 to 30% of the genome

    • can they be considered as products of genes?

  • How can we identify more genomic features ?

  • • More ‘full length’ cDNAs

    • Combining ab initio predictions with

    biological data (RT PCR, SAGE)

    • Global genome comparisons

  • Power of sequence comparison is known since a long a time

    It has been observed that sequences that have a biological function

    will show a higher degree of sequence similarity than on average

    However a higher degree of sequence similarity is not a proof of

    biological function

    Comparative Genomics

  • To obtain a better idea about the respective role of mutation and

    selection which are the main forces acting on genome evolution, one cannot restrict analyses to coding sequences

    Hence, the use of a conservation score which can be applied to any type of genomic DNA sequence

    Comparative Genomics

    (ρ−µ)

    µ(1-µ)/n

    S=S(R)=

    n number of sites within the window that are alignedρ fraction of aligned sites that are identicalµ average fraction of sites that are identical in aligned ancestral repeats

    in the surrounding region

    Non-coding sequences

  • coding exons

    5’ UTR

    200 bp upstream transcript start

    known regulatory regions

    introns

    3’ UTR

    200 bp downstream transcript end

    CpG islands

  • Comparative Genomics

    Identification of regions of biological relevance cannot only rely on

    comparisons of the human/mouse pair

    An important fraction of sequences under selection show

    conservation scores in the range of sequences evolving neutrally

    Use of genome sequences from additional species can overcome

    this limitation

  • Vertebrate Genome Sequencing Projects

    Human (>9X coverage, 99%complete)

    Mouse (7X coverage)

    Rat (6X coverage)

    Zebrafish (5X coverage)

    Pufferfish Tetraodon (7X coverage)

    Pufferfish Fugu (6X coverage)

  • Vertebrate Genome Sequencing ProjectsData in Trace Archive

    Species

    MouseRatChimpanzeeLemurDogCatBovinePigChickenXenopusZebrafishPufferfish Tetraodon

    Pufferfish Fugu

    Sequence reads (million reads)

    78.939.627.50.5350.60.70.9

    11.75.9

    13.43.0

    2.0

  • T h e T e t r a o d o n g e n o m e p r o j e c t

    A tool for vertebrate comparative analysis

  • 89.3258312.4312.41649,609All contigs

    56.5258197.7197.72616,083Mapped contigs

    97.87,612312.4342.473125,773All scaffolds

    62.47,612197.7218.21,3821,338Mapped scaffolds

    78.211,977247.0274.01,382128All ultracontigs

    62.411,977197.7218.37,60139Mapped ultracontigs

    Percentageof the

    genome,gaps included

    Longest(Kb)

    Size, gaps

    excluded (Mb)

    Size, gaps

    included (Mb)

    N50 length(Kb)

    Number

    Global assembly statistics

  • (Tetraodon/human) ecores in annotated gene modelsnot overlapping exons

    6180 ecores in Ensembl gene models but not matching an exon

    5 ’ end 3 ’ end

    5789 ecores in RefSeq-based gene models but not matching an exon

  • 5 ’ end 3 ’ end

    (Tetraodon/human) ecores outside annotated gene models

    19300 (Tetraodon/human) ecores outside Ensembl gene models

  • Vega gene annotations are generated by manual curation of computer based models:

    HAVANA group, Wellcome Trust Sanger Institute

    Hillier et al., Univerity of Washington Genome Centre

    HAVANA group, Wellcome Trust Sanger Institute

    Genoscope, CNRS

    HAVANA group, Wellcome Trust Sanger Institute

    Collins et al., Wellcome Trust Sanger Institute

    Chromosome

    6

    7

    13

    14

    20

    22

  • Association of CpG islands and gene models(CpG island < 2kb from 5’ end)

    Model

    Vega

    Refseq

    Ensembl

    CpG island(%)

    65

    60

    50

  • Table S10. Exofish analysis of five finished human chromosomes.

    4.9 %95.1%19024277Total

    6.0 %94.0 %227587 Chr. 22

    4.1 %95.9 %175650 Chr. 20

    4.3 %95.7 %587860 Chr. 14

    5.9 %94.1 %279622 Chr. 13

    4.7 %95.3 %6341558 Chr. 6

    Ecores out of annotations

    Ecores in annotations

    PseudogenesGenes(known +putative)

    Chromosome

  • ExtendingEnsembl models

    in new Vegamodels

    in pseudogenes

    elsewhere

    32%

    24%

    36%

    8%

    (Tetraodon/human) ecores outside annotated gene models

  • Comparative Genomics applied to whole genomes

    (1)a way to monitor the degree of completion ofgenome annotation

    (2) a method to refine existing annotated genemodels (extensions, additional internal exons)

    (3) a resource for novel candidate gene models

    (4) a method to identify non-transcribed and non-coding features

  • Contributors

    Tetraodon genomicsH. Roest CrolliusA. BernotL. BouneauC. DasilvaC. FischerS. NicaudJL PetitZ. Skalli

    SequencingP. Wincker

    FISHC. Ozouf-Costaz(Museum Nationald’Histoire Naturelle)

    InformaticsO. JaillonJ.M. AuryV. CastelliC. DossatM. LevyE. PelletierC. ScarpelliW. SaurinV. Schächter

    WIBR/MITN. Stange ThomannS. DodgeM. ZodyR. SantosC. NusbaumB. BirrenE. Lander

    cDNAB. SegurensM. SalanoubatM. Katinka