EB-eye Back End
-
Upload
franck-valentin -
Category
Technology
-
view
190 -
download
2
description
Transcript of EB-eye Back End
- 1. The new EBI search engine: EB-eye Backend : An overview of what is under the hood Franck Valentin External Services group Bioinformatics masters' students Open Day
2. Summary
- What is available
- Parsing
- Indexing challenge
- Software behind EB-eye
- Acknowledgments
3. What is the data available ? ArrayExpress Ligand Interpro > 20 domains >130M entries > 550 Gb of data 4. What is the data available formats ArrayExpress Ligand Interpro . . . . . . . . . . . . . . . . . . . . . ID: ..PARENT ID : .. RANK: .. ... ID ... AC ... DT ... ID ... AC ... DT ... ID ... AC ... DT ... 5. What is the data available sizes 43M 4.2G 1G 8.4G Interpro 81M 57Gb, >500 files 374Gb, >600 files 6. Points to take into consideration
- Our World
-
- A variety of file formats
-
- A large amount of data
-
- A variety of file sizes
-
- Data formats are changing
- Our Quest
-
- Index the data as fast as possible
-
- Add and configure a new domain easily
-
- Detect errors in the data
7. Parsing and indexing different formats Indexer Lucene API Db EMBL grammar Taxonomy grammar Uniprot grammar . . . Parser (ANTXR) Medline grammar Interpro grammar Dump file grammar . . . Parser (ANTLR) Uniprot Index Embl Index Taxonomy Index IDAF030562; SV 1; linear; genomic DNA; STS; FUN; 852 BP. ACAF030562; DT04-DEC-1997 (Rel. 53, Created) DT03-MAR-2000 (Rel. 62, Last updated, Version 2) XX DEFusarium venenatum clone VEN-A RAPD band generated using Operon primer DEOPW-03, sequence tagged site. . . . Flat files 1099793520001004 XML files 14216186 1965 02 01 1996 12 01 20070301 0009-8981 10 1964Jul Clinica chimica acta; international journal of clinical chemistry Clin. Chim. Acta . . . . . . ID Creation Date Modification Date issn volume name IDAF030562 ; SV 1; linear; genomic DNA; STS; FUN; 852 BP. XX ACAF030562 ; XX DT04-DEC-1997(Rel. 53, Created) DT03-MAR-2000(Rel. 62, Last updated, Version 2) XX DEFusarium venenatum clone VEN-A RAPD band generated using Operon primer DEOPW-03, sequence tagged site . XX KWSTS. XX OSFusarium venenatum OCEukaryota; Fungi; Ascomycota; Pezizomycotina; Sordariomycetes ; OCHypocreomycetidae; Hypocreales; mitosporic Hypocreales; Fusarium . XX RN[1] RP1-852 RAYoder W.T., Christianson L.M .; RT"Species-specific primers resolve members of the section Fusarium . RTTaxonomic status of the edible 'Quorn' fungus re-evaluated "; RLFungal Genet. Biol. 0:0-0(1997). XX RN[2] RP1-852 RAYoder W.T., Christianson L.M.; RT; RLSubmitted (21-OCT-1997) to the EMBL/GenBank/DDBJ databases . RLMicrobiology, Novo Nordisk Biotech, Inc., 1445 Drew Ave., Davis, CA 95616 , RLUSA XX FHKeyLocation/Qualifiers FH FTsource1..852 FT/organism="Fusarium venenatum" FT/strain="ATCC20334" . . .ID AC Creation date / Modification date Description Organism species Organism classes References References IntAct.ExperimentExperimental procedures that allowed to1.02007-Feb-165697 Dump file (XML) 8. Divide and Conquer the Indexing Uniprot (>4M entries) Embl (>83M entries) 2 files,~ 9.4G >600 files ~ 375G Medline (>16M entries) >500 files ~ 57G Taxonomy (>0.37M entries) 1 file,~ 81M GO (>0.23M entries) 1 file ~ 27M Others (ArrayExpress Ensembl, Intact, ) XML XML Db XML dump XML dump XML dump Uniprot Index Embl Index Taxonomy Index Medline Index GO Index ArrayExpress Index Ensembl Index Intact Index 8 cpu 8 cpu 8 cpu 8 cpu XML XML XML dump XML dump XML dump 9. Libraries
- Indexing
-
- Lucene ( http://lucene.apache.org )
-
- ANTLR ( http://www.antlr.org/ )
-
- ANTXR ( http://javadude.com/tools/antxr/index.html )
-
- JGroups( http://www.jgroups.org )
- Web
-
- Tomcat ( http://tomcat.apache.org/ )
-
- Spring Framework ( http://www.springframework.org )
10. Acknowledgements
- Rodrigo Lopez
- Janet Thorntonand Graham Cameron
- EMBL, EBI Industry Support Programme, European Patent Office and the EU.
- All the data providers
- Our colleagues in the External Services group
- The System Group