Post on 18-Jan-2016
EBI is an Outstation of the European Molecular Biology Laboratory.
Gautier KoscielnyVectorBase Meeting
08 Feburary 2012, EBI
VectorBase Text Search Engine
2Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
History of text search
• Up to 2009:• Notre Dame University maintained the main site text search• At the time, there was no text search module available in the
version of Ensembl installed.
• In 2010:• The Ensembl installation was updated to reflect the latest
Ensembl Genomes installation.• Text search technology available • At the time, Ensembl search was based on the EB-EYE indices
2
3Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Challenges in 2010
• How to integrate the new Lucene EB-EYE indices in the main site?
• Multiple sources of indexing VectorBase (expression, community annotations, etc.)
• Relied on good will from external services to update the EB-EYE indices from VectorBase core databases
• Relied on a XML dump of the core database• Time-consuming task• Difficult to index new datatypes or resources
3
4Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Requirements
• Framework to generate indices at any time• Can reflect new community annotations (CAP)• Ontology information• New datasources: literature
• Search to serve Lucene indices from different providers:• Gene annotation, x-refs, comparative genomics data (EBI)• Microarray and gene expression data (Imperial)• CAP (Notre Dame)
• Indexing must be fast, easy to use and maintain• Search can be plugged to different tools:
• Main VectorBase website• Ensembl genome browser
4
5Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Architecture
5
Ensembl FuncGen CAP
Lucene indices
Data sources
Indexfile
VectorBase Search Service Layer
Clients
EBI Imperial Notre Dame
Indexfile
Indexfile
SOAP
6Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
What is being searched?
• Genomic information (Ensembl databases)• Gene models• Variation• Probes• Orthologs
• Expression data (Imperial)• CAP • Ontologies (idomal, miro, anatomy)• Population genomics (Imperial)
6
7Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Generating Ensembl indices at the EBI
• Based on a direct connection to the database(s)• Use a configuration file containing the description of
objects and their types• Database connection (staging-1, …)• Database type (core, funcgen, variation)• Genome (aedes_aegypti)• Homologies
• Each object in the configuration file is represented by a java class
• The configuration loader will automatically create an instance of each type using the class loader.
7
8Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Example of configuration file
8
9Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Procedure (for Ensembl indices)
9
core funcgen variation compara
1. If compara is defined, get all homologies2. For each genome in turn:3. Get all gene, transcript, exons, proteins, xrefs information from core4. Get all reporters from funcgen and their mapping to gene models5. Get all variations and relation to gene models6. Associate all existing homologies to the genes7. Create a Lucene Document for all genes8. The indices are copied to Notre Dame University9. Tomcat instance is restarted
10
Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Ensembl object mapping in Java
• Ensembl concepts are mapped to equivalent Java data access objects (DAO)
• All Ensembl concepts are stored in memory and removed when a Lucene Document is created
10
EnsemblFeature
Gene
extendscontains
Transcripts, translations, exons
Homologyextends
Xrefcontains
11
Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Creating a Lucene document
• A document is a container for the index• Each document define one or several fields• The framework creates a document per gene• Each field can store its value (or not)• Each field can be indexed (or not)• The text stored can be compressed.
11
12
Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Gene Document
• Fields:• Gene id, name, description• Coordinates: seq region name, start, end• Species, feature type (gene), source (biotype), genomic unit• Transcript count, transcript stable ids• Exon count, exon stable ids• Peptide count, peptide stable ids, domains• Core xrefs• Variation xrefs (if available)• Funcgen xrefs (if available)• Compara homologs (If available)
12
13
Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
CAP indices
• GFF parser extract gene and transcript models.• Name, description, submitter, chromosome location are
indexed.• Very fast• Could be updated overnight if required.
13
14
Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Expression data/Population genomics
• Constructed by Bob McCallum (Imperial)
14
15
Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
Ontologies
• Ontology term are indexed.• An OBO parser extract each term in turn.• Accession, name, description are parsed by default• Extra fields are parsed depending on the completeness of
each term.
15
16
Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
SOAP interface
• 2 procedures: getNbOfResults, getResults (see wiki)
16
17
Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 2012
To do list
• Front-end: • All domain should be queried to produce an ‘Entrez’ like page.• So, search all by default and display count per domain • Could be very simple result page (see next slide for mock-up)
• Updates:• We could update some of the domain more frequently• CAP is a good candidate.
• Other technologies:• Other technologies can be used • Auto-completion • SOLR
17
18
Gautier Koscielny - VectorBase Search EngineWednesday, 8 February 201218
Result page
Genome (1693)
Expression (3693)
Ontology (70)
Population (30)