EnsEMBL and the process of genebuild - …bio.lundberg.gu.se/courses/ht06/bio2/julio.pdf · EnsEMBL...

27
Göteborg 01/12/2006 EnsEMBL and the process of genebuild Julio Fernández Banet ([email protected]) Wellcome Trust Sanger Institute EnsEMBL Group (Genebuild Team) 01 - Dec- 2006 Overview What is Ensembl? Ensembl project Open source The genome browser Genebuild Automatic vs. Manual annotation Traditional Genebuild Special cases

Transcript of EnsEMBL and the process of genebuild - …bio.lundberg.gu.se/courses/ht06/bio2/julio.pdf · EnsEMBL...

Göteborg 01/12/2006

EnsEMBL and the process of genebuild

Julio Fernández Banet ([email protected])Wellcome Trust Sanger Institute

EnsEMBL Group (Genebuild Team)01 - Dec- 2006

Overview• What is Ensembl?

– Ensembl project– Open source– The genome browser

• Genebuild– Automatic vs. Manual annotation– Traditional Genebuild– Special cases

Göteborg 01/12/2006

• Joint project– EMBL – European Bioinformatics Institute (EBI) – Wellcome Trust Sanger Institute

• Produce accurate, automatic genome annotation• Integrate external (distributed) biological data• Focused on selected eukaryotic genomes • Presentation of the analysis to all via the Web at

http://www.ensembl.org• Open distribution of the analysis to the community• Development of open, collaborative software (databases

and APIs)

Ensembl - Project

Species in Ensembl v41Species in Ensembl v41

Göteborg 01/12/2006

• Used to retrieve data from and to store data in Ensembl databases.

• Ensembl Perl API;– Written in Object-Oriented Perl,– Foundation for the Ensembl Pipeline and Ensembl

Web interface.• Ensembl Java API;

– Written in Java, but similar in layout to the Perl API,– Foundation for Apollo,– Non supported, Stop development .

APIsAPIs

• Object model– standard interface makes it easy for others to build

custom applications on top of Ensembl data• Open discussion of design ([email protected])• Most major pharma and many academics represented on

mailing list and code is being actively developed externally

• Ensembl locally (Free for all)– Both industry & academia

Open source open standardsOpen source open standards

Göteborg 01/12/2006

Ensembl Ensembl –– Open sourceOpen source

Genome browserGenome browser

Göteborg 01/12/2006

• Browse genes in genomic context• Display features in and around a particular gene• Explore larger chromosome regions• Search and retrieve information on a gene- and

genome-scale• Investigate genome organisation• Compare genomes

Exploring genomesExploring genomes

EnsemblEnsemblContigView

Göteborg 01/12/2006

Manualannotationvia Vega

Ensembl predictions

Ensembl EST-based predictions

ContigViewContigView - Close-up

Transcriptsred & black(Ensembl predictions)Blue (Vega) & gold (HAVANA)

Pop-up menu

ContigViewContigViewclose-up

Göteborg 01/12/2006

Gene Level ViewGene Level View• GeneView

– Orhologs– Paralogs– Transcript– Gene structure

• TransView– Transcript structure– Transcript sequence– Similarity Matches

• ProtView– Protein sequence– Peptide stats– Domains

Göteborg 01/12/2006

GeneSNPViewGeneSNPView

MultiContigViewMultiContigView

Göteborg 01/12/2006

Much MoreMuch More

• Review documentation:– About Ensembl (Publications, Software licence)– Helpdesk (Training courses and online workshops)– Ensembl Data– Software (API documentation)

• Archive (old release information)• Data Mining (BioMART)• Export and download data• Display your own data (DAS)

Automatic Genome Analysis Automatic Genome Analysis and Annotation (and Annotation (GeneBuildGeneBuild))

Göteborg 01/12/2006

AutomaticAutomatic AnnotationAnnotation• Quick whole genome

analysis ~ weeks• Consistent annotation• Use unfinished

sequence or shotgun assembly

• No polyA sites or signals, pseudogene

• Predicts ~ 70% loci

ManualManual AnnotationAnnotation• Extremely slow

~ 3 months for Chr 6• Most rules have exception• Flexible, can deal with

inconsistencies• Need finished sequences• Consult publications as well

as databases

Automatic Automatic vs vs ManualManual

HumanProteins

Genewise

Genewisegenes

Other Proteins

AlignedcDNAs

Exonerate

Human cDNAs

Genewise genes with UTRs

GenebuilderSupportedgenscans

(optional)

Preliminarygene set

cDNA genes

ClusterMerge

GeneCombiner

Core Ensemblgenes

PseudogenesFinal set

+ pseudogenes

Aligned ESTs

Human ESTs

EnsemblEST genes

Traditional GenebuildTraditional Genebuild

Göteborg 01/12/2006

GenewiseGenewise

•Predicts gene structure from protein-to-genome alignments.

•High specificity - Low sensitivity•Targetted (Species specific proteins).•Similarity (Related species proteins).

•High computational cost

Using Using miniseqsminiseqs

“MiniSeq”

Genewise

Genome

BLAST Exon “seeds”

RemappedGenewise

Göteborg 01/12/2006

ExonerateExonerate

Alignment of species specific cDNAs and ESTs with genome.

Good gene models

Faster than genewise for DNA alignments.

UTR addictionUTR addiction

Human

cDNAs

Gene(genewise model)

Use cDNA alginments to extend genewise models and add UTRs

Göteborg 01/12/2006

GenebuilderGenebuilder

Remove redundancy, collapse similar things together

GenecombinerGenecombiner

Fill gaps with cDNA genes

Preliminary geneset

Göteborg 01/12/2006

PseudogenesPseudogenes

• Finds frameshift introns, removes repeats, retro-transposed genes.

• Gene tests:– If each transcript is pseudo, throw away all but

the longest transcript, set it to pseudogene– If transcripts are mixed throw away all pseudo

transcripts and keep the rest, call it real– BLASTs single exon genes against a database

of the multiexon genes. Span of retro gene < 3x of real gene, label as pseudo

PseudogenesPseudogenes

• Eliminate retro-transposed (processed) pseudogenes

Query: 3 SRLLLNNGAKMPILGLGTWKSPPGQVTEAVKVAIDVGYRHIDCAHVYQNENEVGVAIQEK 62S ++LNNG K +LGLGTWKSPPGQV EAVKVAI+ YRHIDC+HV+QN++ QE+

Sbjct: 2 SHIMLNNGTKTDMLGLGTWKSPPGQVAEAVKVAINTVYRHIDCSHVHQNKD------QEQ 55

Query: 63 LREQVVKREELFIVSKLWCTYHEKGLVKGACQKTLSDLKLDYLDLYLIHWPTGFKPGKEF 122L+EQVV+RE LFI+SK W H K LV+G+C+K LS L+LDYLDL+LIHWPTG PGKEF

Sbjct: 56 LKEQVVRREWLFIISKPWGICHRKCLVRGSCRKVLSGLELDYLDLHLIHWPTGCHPGKEF 115

Query: 123 FPLDESGNVVPSDTNILDTWAAMEELVDEGLVKAIGISNFNHLQVEMILNKPGLKYKPAV 182LDESG + +GLVKA GISNF HLQ E LNK GLK

Sbjct: 116 SFLDESGLI-------------------QGLVKAAGISNF-HLQAERTLNKSGLKLSATG 155

Query: 183 NQIECHPYLTQEKLIQYCQSKGIVVTAYSPLGSPDRPWAKPEDPSLLEDPRIKAIAAKHN 242LTQE LIQY QSK VTAYSPLGSPDRP AKPEDPSLLEDPRIK IAAKHN

Sbjct: 156 RS------LTQENLIQYYQSKA-AVTAYSPLGSPDRPRAKPEDPSLLEDPRIKVIAAKHN 208

Query: 243 KTTAQVLIRFPMQRNLVVIPKSVTPERIAENFKVFDFELSSQDMTTLLSYNRN 295+ T+QVL+ QRNLVV P SVT +RIAENFKVFDFELSSQDMT+LLS NRN

Sbjct: 209 E-TSQVLMWLLTQRNLVVTPTSVTLDRIAENFKVFDFELSSQDMTSLLSCNRN 260

Chromosome 18:8,535,635-8,536,757

Chromosome 7:133,584,367-133,601,097

Göteborg 01/12/2006

• Low sequence identity• Families share conserved secondary structure• Pipeline Rfam Scan• Run Blast and Infernal at the genomic level

ncRNA ncRNA ((tRNAstRNAs))

• Highly conserved across species• Identified using BLAST genomic vs miRBase

precursors

ncRNAs ncRNAs ((miRNAmiRNA))

Göteborg 01/12/2006

Traditional Genebuild Traditional Genebuild (Conclusions)(Conclusions)

• Developed originally for human

• Exploits rich human specific resources

• Protein, cDNA based

• Compute was a Really Big Issue in the past• As we moved beyond “just” building human and mouse,

scaling became a big issue.• Increased build automation crucial - genebuild was

pipelined after being a set of dodgy scripts for far too long.

Genebuild IssuesGenebuild IssuesDataData availabilityavailabilityTargetted build most useful in mouse, rat, human Similarity build more important other species;

StructuralStructural IssuesIssuesZebrafish Many similar genes near each other

Genome from different haplotypes

C. Briggsae Very dense genomeShort introns

Mosquito Many single-exon genesGenes within genes

Solution: Solution: Configuration Files provide flexibilityConfiguration Files provide flexibility

Göteborg 01/12/2006

Genebuild IssuesGenebuild Issues

GeneGene level.level.• Proteins from very distant organisms can skew

similarity build• Spindly exons• Non consensus splice sites• Targetted protein fragments masking similarity

build

Problem: Spindly genesProblem: Spindly genes

No Miniseqs -> Use Fullseqs:

• Compute expensive

• Reduces gene merging

Göteborg 01/12/2006

Problem: Non consensus Problem: Non consensus splice sitessplice sites

Non consensus splice sites common

Excess of alternative transcripts

Problem: Problem: Targetted Targetted protein protein fragmentsfragments

Göteborg 01/12/2006

More improvementsMore improvements

Homology Build:

Used to rescue fragmented genes Compara homology pipeline:

human, mouse, rat, dog and chicken.Exonerate used to align orthologs

Incremental Build:

1. Targeted genes, Similarity genes, Homology genes2. EST genes3. Genes from the previous Build

Low Coverage GenebuildLow Coverage Genebuild

Göteborg 01/12/2006

Low Coverage GenebuildLow Coverage Genebuild

• Genomes come in lots of scaffolds (cow 3x had 800K contigs in 450K scaffolds).

• None of our traditional approaches are much help. Apply normal gene-build with low coverage-cutoffs offers at best a set of gene fragments.

• Approach: Use WGA to infer gene scaffold assemblies• New method reduces fragmentation by piecing together

scaffolds into “gene-scaffolds” that contain complete gene(s)

• Projection of genes from reference species.

Low Coverage GenebuildLow Coverage GenebuildFrom start to finish

(Starting point: a core database with repeat features)

(1) Raw alignment generation - BLASTZ(2) Alignment chaining - Jim Kent’s axtChain(3) Best-in-genome alignment filtering (“net”)(4) Gene-scaffold inference and annotation projection(5) Merging and extension of gene-scaffold(6) Loading gene-scaffold assembly and annotation(7) Run rest of analysis on revised top-level sequence regions

Göteborg 01/12/2006

Method overviewMethod overview

NNNN NNNN

Human Chr.

Low coverageScaffold

Superscaffold

Projection

Human gene

Human

Cow

Göteborg 01/12/2006

Projection (Build over gaps)

NNNNNNNNNNNNNNNNNNN

Human

Cow

Human

NNNNNNNNNNNNNNNNNNN Cow

Projection (Filtering)• Filter out transcript with less than 50% percent identity or

50% coverageHuman

Cow

Human

Cow

Human

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Cow

Göteborg 01/12/2006

• Low number of Chimp protein, cDNA, EST information.• 6X high coverage genome • Based on low coverage projection code.• Take advantage of human chimp genome similarity.• Human has the best annotated and studied genome.

Other cases Other cases (Chimp genebuild)(Chimp genebuild)

Alternative GenebuildAlternative Genebuild

Exonerate

Pseudogenes

HumanTranscripts

Projection

Projectedtranscripts

HumanTranscripts

AlignedcDNAs

Human + ChimpcDNAs

transcripts with UTRs

Genebuilder

Core Ensemblgene set

Final set + pseudogenes

Exonerate

Exoneratedtranscripts

Transcriptmerge

Göteborg 01/12/2006

Transcript MergeTranscript Merge

• Preference for projected transcripts over exonerate.

NNNNNNNNNNNNNNNNNNN

Projection

Exonerate

Human

Transcript mergeTranscript merge

• Exonerate gene models selected where no projection was obtained.

• No human - chimp alignment for the region where the gene resides in human.

Göteborg 01/12/2006

• Ciona intestinalis and savignyi - few protein sequences but lots of ESTs and cDNAs

• Genebuilder not making best use of this resource

• TranscriptCoalescer developed

Other cases Other cases (EST Genebuild)(EST Genebuild)

EST GenebuildEST Genebuild

TranscriptCoalescer tests conserved intron boundaries to join clusters together conservatively.Use ab initio data to confirm gene structuresGenebuild found no genes in this area.

TranscriptCoalescer gene

Included in Feb release

EST alignments

Göteborg 01/12/2006

SummarySummary

• Ensembl: Open source genome browser and annotation tool.

• Automatic genome annotation– Traditional genebuild– Low Coverage genebuild– Alternative Genebuild– Est based Genebuild

Guy Coates, Tim Cutts, Shelley GoddardSystems & Support

Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel RiosFunctional Genomics

Ewan Birney (EBI), Tim Hubbard (Sanger Institute)Leaders

Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel ZerbinoResearch

Martin Hammond, Dan Lawson, Karyn MegyVectorBase Annotation

Kerstin Howe, Mario Caccamo, Ian SealyZebrafish Annotation

Val Curwen, Steve Searle, Browen Aken, Julio Fernández Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel, Kevin Howe, Felix Kokocinski, Simon White

Analysis and Annotation Pipeline

Abel Ureta-Vidal, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Javier Herrero Sánchez, Albert VilellaComparative Genomics

James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion(VEGA), Matt WoodWeb Team

Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael SchusterOutreach

Andreas Kähäri, Eugene KuleshaDistributed Annotation System (DAS)

Arek Kasprzyk, Damian Smedley, Richard Holland, Syed HaldarBioMart

Glenn Proctor, Ian Longden, Patrick MeidlDatabase Schema and Core API

Ensembl TeamsEnsembl Teams

Göteborg 01/12/2006

Thank You!Thank You!