Exploration of multidimensional biomedical data in PubChem
Lianyi Han
National Center for Biotechnology InformationAdvances science and health by providing access to biomedical and genomic information.
Literatures• PubMed• PMC• PubMed Health• …
Sequences• Proteins• Genes &
Expression• Genome & Maps• …
Chemicals & Bioassays• PubChem
Databases• BioSystems• …
Software & tools• Blast• Structure Search• Entrez/Eutils
Structure & Domains• Structure• CDD• …
Provides information on the biological activities of small molecules and beyond
PubChemSubstance
Compound
Bioactivities
Literatures(link)
Target
Patent
Pathways
23 million citations
The Challenge
•Varity heterogeneous documents with many-to-many relationships
•Volume200M+ bioactivity data
40M+ compounds600K+ bioassays20K+ pathways
9k targets
•Velocityquery wide quickly, query deep quickly, facet search quickly
Answers
The Direction
Velocity
Variety
Volu
me
Existing Search Systems
• ASN.1, XML schema• RDMS(SQL)• In-house NoSQL Search Engine• Specialized Search Engine• Homebrewed messaging system• Queue systems
A new search system• Features? • Scalability?• Accessibility?• Maintenance?• Reusability?• Extensibility?• Cost effective?
Archive Analysis
The feature requirements for the new search system
• Full text search• Highlighting• Faceting• Molecule formula search • 2D similarity search• Molecule superstructure/substructure search• Joins, cascading joins to search wide and deep• Transfer search result effectively across services
We can make the feature complete in SOLR!
• Full text search(SOLR)• Highlighting(SOLR)• Faceting(SOLR)• Molecule formula search (implement MF search in SOLR)• 2D similarity search (implement 2D fingerprint search in SOLR)• Molecule superstructure/substructure search (SOLR-5244)• Joins, cascading joins to search wide and deep (SOLR-4787)• Transfer search result effectively across services(SOLR-4787, SOLR-5244)
Architecture
UI/UX
Web API
RDMS NoSQL(SOLR) Specialized Search Backend
Caching/List handling
The Backend• Backend Components (SOLR+SQL+ Specialized search engine)
– Configuration– Importing pipeline
• Dumping & Importing (SGE Farm)• DIH (jdbc)
– Replication– Warm up
• Web API– Encapsulate the backend implementation– Load balancing and throttling– Generic data model for heterogeneous document– Query language
The Frontend
• Easier to develop or expand based on modern web technologies. – One backend, multiple frontends– One data model, multiple presentations
• UI/UX design– MVC– Reusability– Mobile browser friendly– Interactivity & Accessibility
The Frontends• PubChem widgets (beta)
– A reusable UI components
• PubChem new search (beta)– A new search system that delivers
multiple search features
Briefly on UI architecture• PubChem widgets as an example
PubChem widgets
ExtJS components
Data model/store
Web API
backend
Controller
Demo : PubChem widget• http://jsfiddle.net/Gtbg7/
PubChem.widget.CreateGridTable({ gridtabletype: 'pcassay', cid: 2244, renderTo: ‘table’, width: "90%", height: 400});
More PubChem widgets
Demo : PubChem Search• https://pubchem.ncbi.nlm.nih.gov/search/
Desktop Mobile
Faceting
Molecular Formula SearchSuper/sub Structure Search
Full-text Search
Brief Summary on PubChem Search Demo
Thanks
• Yu Bo• Renata Geer• Asta Gindulyte• Siqian He• Paul Thiessen• Jiyao Wang• Jeff Zhang
• Steve Bryant• Lewis Geer• Evan Bolton• Yanli Wang• NCBI IEB and IRB
This research was supported [in part] by the Intramural Research Program of the NIH, National Library of Medicine.
Questions
About this talk: [email protected]: https://www.facebook.com/pubchemNCBI: https://www.facebook.com/ncbi.nlm
Top Related