Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr...

19

Transcript of Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr...

Page 1: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC
Page 2: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

Exploration of multidimensional biomedical data in PubChem

Lianyi Han

Page 3: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

National Center for Biotechnology InformationAdvances science and health by providing access to biomedical and genomic information.

Literatures• PubMed• PMC• PubMed Health• …

Sequences• Proteins• Genes &

Expression• Genome & Maps• …

Chemicals & Bioassays• PubChem

Databases• BioSystems• …

Software & tools• Blast• Structure Search• Entrez/Eutils

Structure & Domains• Structure• CDD• …

Page 4: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

Provides information on the biological activities of small molecules and beyond

PubChemSubstance

Compound

Bioactivities

Literatures(link)

Target

Patent

Pathways

23 million citations

Page 5: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

The Challenge

•Varity heterogeneous documents with many-to-many relationships

•Volume200M+ bioactivity data

40M+ compounds600K+ bioassays20K+ pathways

9k targets

•Velocityquery wide quickly, query deep quickly, facet search quickly

Answers

Page 6: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

The Direction

Velocity

Variety

Volu

me

Existing Search Systems

• ASN.1, XML schema• RDMS(SQL)• In-house NoSQL Search Engine• Specialized Search Engine• Homebrewed messaging system• Queue systems

A new search system• Features? • Scalability?• Accessibility?• Maintenance?• Reusability?• Extensibility?• Cost effective?

Archive Analysis

Page 7: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

The feature requirements for the new search system

• Full text search• Highlighting• Faceting• Molecule formula search • 2D similarity search• Molecule superstructure/substructure search• Joins, cascading joins to search wide and deep• Transfer search result effectively across services

Page 8: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

We can make the feature complete in SOLR!

• Full text search(SOLR)• Highlighting(SOLR)• Faceting(SOLR)• Molecule formula search (implement MF search in SOLR)• 2D similarity search (implement 2D fingerprint search in SOLR)• Molecule superstructure/substructure search (SOLR-5244)• Joins, cascading joins to search wide and deep (SOLR-4787)• Transfer search result effectively across services(SOLR-4787, SOLR-5244)

Page 9: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

Architecture

UI/UX

Web API

RDMS NoSQL(SOLR) Specialized Search Backend

Caching/List handling

Page 10: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

The Backend• Backend Components (SOLR+SQL+ Specialized search engine)

– Configuration– Importing pipeline

• Dumping & Importing (SGE Farm)• DIH (jdbc)

– Replication– Warm up

• Web API– Encapsulate the backend implementation– Load balancing and throttling– Generic data model for heterogeneous document– Query language

Page 11: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

The Frontend

• Easier to develop or expand based on modern web technologies. – One backend, multiple frontends– One data model, multiple presentations

• UI/UX design– MVC– Reusability– Mobile browser friendly– Interactivity & Accessibility

Page 12: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

The Frontends• PubChem widgets (beta)

– A reusable UI components

• PubChem new search (beta)– A new search system that delivers

multiple search features

Page 13: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

Briefly on UI architecture• PubChem widgets as an example

PubChem widgets

ExtJS components

Data model/store

Web API

backend

Controller

Page 14: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

Demo : PubChem widget• http://jsfiddle.net/Gtbg7/

PubChem.widget.CreateGridTable({ gridtabletype: 'pcassay', cid: 2244, renderTo: ‘table’, width: "90%", height: 400});

Page 15: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

More PubChem widgets

Page 16: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

Demo : PubChem Search• https://pubchem.ncbi.nlm.nih.gov/search/

Desktop Mobile

Page 17: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

Faceting

Molecular Formula SearchSuper/sub Structure Search

Full-text Search

Brief Summary on PubChem Search Demo

Page 18: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

Thanks

• Yu Bo• Renata Geer• Asta Gindulyte• Siqian He• Paul Thiessen• Jiyao Wang• Jeff Zhang

• Steve Bryant• Lewis Geer• Evan Bolton• Yanli Wang• NCBI IEB and IRB

This research was supported [in part] by the Intramural Research Program of the NIH, National Library of Medicine.

Page 19: Exploration of multidimensional biomedical data in pub chem, Presented by Lianyi Han at Solr Exchange DC

Questions

About this talk: [email protected]: https://www.facebook.com/pubchemNCBI: https://www.facebook.com/ncbi.nlm