EnsEMBL and the process of genebuild - …bio.lundberg.gu.se/courses/ht06/bio2/julio.pdf · EnsEMBL...
Transcript of EnsEMBL and the process of genebuild - …bio.lundberg.gu.se/courses/ht06/bio2/julio.pdf · EnsEMBL...
Göteborg 01/12/2006
EnsEMBL and the process of genebuild
Julio Fernández Banet ([email protected])Wellcome Trust Sanger Institute
EnsEMBL Group (Genebuild Team)01 - Dec- 2006
Overview• What is Ensembl?
– Ensembl project– Open source– The genome browser
• Genebuild– Automatic vs. Manual annotation– Traditional Genebuild– Special cases
Göteborg 01/12/2006
• Joint project– EMBL – European Bioinformatics Institute (EBI) – Wellcome Trust Sanger Institute
• Produce accurate, automatic genome annotation• Integrate external (distributed) biological data• Focused on selected eukaryotic genomes • Presentation of the analysis to all via the Web at
http://www.ensembl.org• Open distribution of the analysis to the community• Development of open, collaborative software (databases
and APIs)
Ensembl - Project
Species in Ensembl v41Species in Ensembl v41
Göteborg 01/12/2006
• Used to retrieve data from and to store data in Ensembl databases.
• Ensembl Perl API;– Written in Object-Oriented Perl,– Foundation for the Ensembl Pipeline and Ensembl
Web interface.• Ensembl Java API;
– Written in Java, but similar in layout to the Perl API,– Foundation for Apollo,– Non supported, Stop development .
APIsAPIs
• Object model– standard interface makes it easy for others to build
custom applications on top of Ensembl data• Open discussion of design ([email protected])• Most major pharma and many academics represented on
mailing list and code is being actively developed externally
• Ensembl locally (Free for all)– Both industry & academia
Open source open standardsOpen source open standards
Göteborg 01/12/2006
• Browse genes in genomic context• Display features in and around a particular gene• Explore larger chromosome regions• Search and retrieve information on a gene- and
genome-scale• Investigate genome organisation• Compare genomes
Exploring genomesExploring genomes
EnsemblEnsemblContigView
Göteborg 01/12/2006
Manualannotationvia Vega
Ensembl predictions
Ensembl EST-based predictions
ContigViewContigView - Close-up
Transcriptsred & black(Ensembl predictions)Blue (Vega) & gold (HAVANA)
Pop-up menu
ContigViewContigViewclose-up
Göteborg 01/12/2006
Gene Level ViewGene Level View• GeneView
– Orhologs– Paralogs– Transcript– Gene structure
• TransView– Transcript structure– Transcript sequence– Similarity Matches
• ProtView– Protein sequence– Peptide stats– Domains
Göteborg 01/12/2006
Much MoreMuch More
• Review documentation:– About Ensembl (Publications, Software licence)– Helpdesk (Training courses and online workshops)– Ensembl Data– Software (API documentation)
• Archive (old release information)• Data Mining (BioMART)• Export and download data• Display your own data (DAS)
Automatic Genome Analysis Automatic Genome Analysis and Annotation (and Annotation (GeneBuildGeneBuild))
Göteborg 01/12/2006
AutomaticAutomatic AnnotationAnnotation• Quick whole genome
analysis ~ weeks• Consistent annotation• Use unfinished
sequence or shotgun assembly
• No polyA sites or signals, pseudogene
• Predicts ~ 70% loci
ManualManual AnnotationAnnotation• Extremely slow
~ 3 months for Chr 6• Most rules have exception• Flexible, can deal with
inconsistencies• Need finished sequences• Consult publications as well
as databases
Automatic Automatic vs vs ManualManual
HumanProteins
Genewise
Genewisegenes
Other Proteins
AlignedcDNAs
Exonerate
Human cDNAs
Genewise genes with UTRs
GenebuilderSupportedgenscans
(optional)
Preliminarygene set
cDNA genes
ClusterMerge
GeneCombiner
Core Ensemblgenes
PseudogenesFinal set
+ pseudogenes
Aligned ESTs
Human ESTs
EnsemblEST genes
Traditional GenebuildTraditional Genebuild
Göteborg 01/12/2006
GenewiseGenewise
•Predicts gene structure from protein-to-genome alignments.
•High specificity - Low sensitivity•Targetted (Species specific proteins).•Similarity (Related species proteins).
•High computational cost
Using Using miniseqsminiseqs
“MiniSeq”
Genewise
Genome
BLAST Exon “seeds”
RemappedGenewise
Göteborg 01/12/2006
ExonerateExonerate
Alignment of species specific cDNAs and ESTs with genome.
Good gene models
Faster than genewise for DNA alignments.
UTR addictionUTR addiction
Human
cDNAs
Gene(genewise model)
Use cDNA alginments to extend genewise models and add UTRs
Göteborg 01/12/2006
GenebuilderGenebuilder
Remove redundancy, collapse similar things together
GenecombinerGenecombiner
Fill gaps with cDNA genes
Preliminary geneset
Göteborg 01/12/2006
PseudogenesPseudogenes
• Finds frameshift introns, removes repeats, retro-transposed genes.
• Gene tests:– If each transcript is pseudo, throw away all but
the longest transcript, set it to pseudogene– If transcripts are mixed throw away all pseudo
transcripts and keep the rest, call it real– BLASTs single exon genes against a database
of the multiexon genes. Span of retro gene < 3x of real gene, label as pseudo
PseudogenesPseudogenes
• Eliminate retro-transposed (processed) pseudogenes
Query: 3 SRLLLNNGAKMPILGLGTWKSPPGQVTEAVKVAIDVGYRHIDCAHVYQNENEVGVAIQEK 62S ++LNNG K +LGLGTWKSPPGQV EAVKVAI+ YRHIDC+HV+QN++ QE+
Sbjct: 2 SHIMLNNGTKTDMLGLGTWKSPPGQVAEAVKVAINTVYRHIDCSHVHQNKD------QEQ 55
Query: 63 LREQVVKREELFIVSKLWCTYHEKGLVKGACQKTLSDLKLDYLDLYLIHWPTGFKPGKEF 122L+EQVV+RE LFI+SK W H K LV+G+C+K LS L+LDYLDL+LIHWPTG PGKEF
Sbjct: 56 LKEQVVRREWLFIISKPWGICHRKCLVRGSCRKVLSGLELDYLDLHLIHWPTGCHPGKEF 115
Query: 123 FPLDESGNVVPSDTNILDTWAAMEELVDEGLVKAIGISNFNHLQVEMILNKPGLKYKPAV 182LDESG + +GLVKA GISNF HLQ E LNK GLK
Sbjct: 116 SFLDESGLI-------------------QGLVKAAGISNF-HLQAERTLNKSGLKLSATG 155
Query: 183 NQIECHPYLTQEKLIQYCQSKGIVVTAYSPLGSPDRPWAKPEDPSLLEDPRIKAIAAKHN 242LTQE LIQY QSK VTAYSPLGSPDRP AKPEDPSLLEDPRIK IAAKHN
Sbjct: 156 RS------LTQENLIQYYQSKA-AVTAYSPLGSPDRPRAKPEDPSLLEDPRIKVIAAKHN 208
Query: 243 KTTAQVLIRFPMQRNLVVIPKSVTPERIAENFKVFDFELSSQDMTTLLSYNRN 295+ T+QVL+ QRNLVV P SVT +RIAENFKVFDFELSSQDMT+LLS NRN
Sbjct: 209 E-TSQVLMWLLTQRNLVVTPTSVTLDRIAENFKVFDFELSSQDMTSLLSCNRN 260
Chromosome 18:8,535,635-8,536,757
Chromosome 7:133,584,367-133,601,097
Göteborg 01/12/2006
• Low sequence identity• Families share conserved secondary structure• Pipeline Rfam Scan• Run Blast and Infernal at the genomic level
ncRNA ncRNA ((tRNAstRNAs))
• Highly conserved across species• Identified using BLAST genomic vs miRBase
precursors
ncRNAs ncRNAs ((miRNAmiRNA))
Göteborg 01/12/2006
Traditional Genebuild Traditional Genebuild (Conclusions)(Conclusions)
• Developed originally for human
• Exploits rich human specific resources
• Protein, cDNA based
• Compute was a Really Big Issue in the past• As we moved beyond “just” building human and mouse,
scaling became a big issue.• Increased build automation crucial - genebuild was
pipelined after being a set of dodgy scripts for far too long.
Genebuild IssuesGenebuild IssuesDataData availabilityavailabilityTargetted build most useful in mouse, rat, human Similarity build more important other species;
StructuralStructural IssuesIssuesZebrafish Many similar genes near each other
Genome from different haplotypes
C. Briggsae Very dense genomeShort introns
Mosquito Many single-exon genesGenes within genes
Solution: Solution: Configuration Files provide flexibilityConfiguration Files provide flexibility
Göteborg 01/12/2006
Genebuild IssuesGenebuild Issues
GeneGene level.level.• Proteins from very distant organisms can skew
similarity build• Spindly exons• Non consensus splice sites• Targetted protein fragments masking similarity
build
Problem: Spindly genesProblem: Spindly genes
No Miniseqs -> Use Fullseqs:
• Compute expensive
• Reduces gene merging
Göteborg 01/12/2006
Problem: Non consensus Problem: Non consensus splice sitessplice sites
Non consensus splice sites common
Excess of alternative transcripts
Problem: Problem: Targetted Targetted protein protein fragmentsfragments
Göteborg 01/12/2006
More improvementsMore improvements
Homology Build:
Used to rescue fragmented genes Compara homology pipeline:
human, mouse, rat, dog and chicken.Exonerate used to align orthologs
Incremental Build:
1. Targeted genes, Similarity genes, Homology genes2. EST genes3. Genes from the previous Build
Low Coverage GenebuildLow Coverage Genebuild
Göteborg 01/12/2006
Low Coverage GenebuildLow Coverage Genebuild
• Genomes come in lots of scaffolds (cow 3x had 800K contigs in 450K scaffolds).
• None of our traditional approaches are much help. Apply normal gene-build with low coverage-cutoffs offers at best a set of gene fragments.
• Approach: Use WGA to infer gene scaffold assemblies• New method reduces fragmentation by piecing together
scaffolds into “gene-scaffolds” that contain complete gene(s)
• Projection of genes from reference species.
Low Coverage GenebuildLow Coverage GenebuildFrom start to finish
(Starting point: a core database with repeat features)
(1) Raw alignment generation - BLASTZ(2) Alignment chaining - Jim Kent’s axtChain(3) Best-in-genome alignment filtering (“net”)(4) Gene-scaffold inference and annotation projection(5) Merging and extension of gene-scaffold(6) Loading gene-scaffold assembly and annotation(7) Run rest of analysis on revised top-level sequence regions
Göteborg 01/12/2006
Method overviewMethod overview
NNNN NNNN
Human Chr.
Low coverageScaffold
Superscaffold
Projection
Human gene
Human
Cow
Göteborg 01/12/2006
Projection (Build over gaps)
NNNNNNNNNNNNNNNNNNN
Human
Cow
Human
NNNNNNNNNNNNNNNNNNN Cow
Projection (Filtering)• Filter out transcript with less than 50% percent identity or
50% coverageHuman
Cow
Human
Cow
Human
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Cow
Göteborg 01/12/2006
• Low number of Chimp protein, cDNA, EST information.• 6X high coverage genome • Based on low coverage projection code.• Take advantage of human chimp genome similarity.• Human has the best annotated and studied genome.
Other cases Other cases (Chimp genebuild)(Chimp genebuild)
Alternative GenebuildAlternative Genebuild
Exonerate
Pseudogenes
HumanTranscripts
Projection
Projectedtranscripts
HumanTranscripts
AlignedcDNAs
Human + ChimpcDNAs
transcripts with UTRs
Genebuilder
Core Ensemblgene set
Final set + pseudogenes
Exonerate
Exoneratedtranscripts
Transcriptmerge
Göteborg 01/12/2006
Transcript MergeTranscript Merge
• Preference for projected transcripts over exonerate.
NNNNNNNNNNNNNNNNNNN
Projection
Exonerate
Human
Transcript mergeTranscript merge
• Exonerate gene models selected where no projection was obtained.
• No human - chimp alignment for the region where the gene resides in human.
Göteborg 01/12/2006
• Ciona intestinalis and savignyi - few protein sequences but lots of ESTs and cDNAs
• Genebuilder not making best use of this resource
• TranscriptCoalescer developed
Other cases Other cases (EST Genebuild)(EST Genebuild)
EST GenebuildEST Genebuild
TranscriptCoalescer tests conserved intron boundaries to join clusters together conservatively.Use ab initio data to confirm gene structuresGenebuild found no genes in this area.
TranscriptCoalescer gene
Included in Feb release
EST alignments
Göteborg 01/12/2006
SummarySummary
• Ensembl: Open source genome browser and annotation tool.
• Automatic genome annotation– Traditional genebuild– Low Coverage genebuild– Alternative Genebuild– Est based Genebuild
Guy Coates, Tim Cutts, Shelley GoddardSystems & Support
Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel RiosFunctional Genomics
Ewan Birney (EBI), Tim Hubbard (Sanger Institute)Leaders
Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel ZerbinoResearch
Martin Hammond, Dan Lawson, Karyn MegyVectorBase Annotation
Kerstin Howe, Mario Caccamo, Ian SealyZebrafish Annotation
Val Curwen, Steve Searle, Browen Aken, Julio Fernández Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel, Kevin Howe, Felix Kokocinski, Simon White
Analysis and Annotation Pipeline
Abel Ureta-Vidal, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Javier Herrero Sánchez, Albert VilellaComparative Genomics
James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion(VEGA), Matt WoodWeb Team
Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael SchusterOutreach
Andreas Kähäri, Eugene KuleshaDistributed Annotation System (DAS)
Arek Kasprzyk, Damian Smedley, Richard Holland, Syed HaldarBioMart
Glenn Proctor, Ian Longden, Patrick MeidlDatabase Schema and Core API
Ensembl TeamsEnsembl Teams