Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT
-
Upload
rguha -
Category
Technology
-
view
1.186 -
download
0
Transcript of Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT
![Page 1: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/1.jpg)
Enabling Discoveries at High Throughput Small molecule and RNAi HTS at the NCTT
Rajarshi Guha NIH Center for Transla6on Therapeu6cs
May 3, 2011
![Page 2: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/2.jpg)
Outline
• Informa6cs for small molecule & RNAi screening • HCA & automated decision making
– Pre7y pictures can lead to more efficient screens
• Large scale cheminforma6cs – We can do it, but do we need to?
![Page 3: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/3.jpg)
• Founded 2004 as part of NIH Roadmap Molecular Libraries Ini6a6ve – NCGC staffed with 90+ scien6sts – biologists, chemists, informa6cians, engineers
– Post-‐doc program
• Mission – MLPCN (screening & chemical synthesis; compound repository; PubChem database;
funding for assay, library and technology development ) – Develop new chemical probes for basic research and leads for therapeu6c development,
par6cularly for rare/neglected diseases – New paradigms & applica6ons of HTS for chemical biology / chemical genomics
• All NCGC projects are collabora6ons with a target or disease expert; currently >200 collabora6ons with inves6gators worldwide
NIH Chemical Genomics Center
![Page 4: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/4.jpg)
(C) Detection methods
(B) Target types (A) Disease areas
Project Diversity Project Diversity
![Page 5: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/5.jpg)
Assay formats & detec?on methods in HTS
• ligand binding – compe66on binding
• enzyma6c ac6vity – biochemical – cellular
• ion or ligand transport – Ion-‐sensi6ve dyes – membrane poten6al dyes
• protein-‐protein interac6ons – biochemical – cellular
• luminescence – chemiluminescence – bioluminescence – BRET – ALPHA
• fluorescence – FI – FRET – TRF – TR-‐FRET – FP – FCS – FLT
• cellular signal transduction – reporter gene – second messenger
• phenotypic – protein redistribution – cell viability – etc.
• absorbance • radioactivity
– SPA
Assay formats
Detection modes
![Page 6: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/6.jpg)
Detector Systems: “Reading the assay”
• ViewLux – Mul6modal CCD-‐based imager
• Abs., Luminescence, Fluorescence
• Envision – PMT-‐based reader
• ALPHA
• Acumen Explorer – Laser Scanning Imager
• “sta6c” cell cytometry
• Hamamatsu FDS 7000 Series – rapid kine6cs
• INCell1000 – Subcellular imaging
![Page 7: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/7.jpg)
1536-well plates, inter-plate dilution series Assay volumes 2 – 5 μL
Assay concentration ranges over 4 logs (high:~ 100 μM)
Informatics pipeline. Automated curve fitting and classification. 300K samples
Automated concentration-response data collection ~1 CRC/sec
A
B
C
qHTS: High Throughput Dose Response
![Page 8: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/8.jpg)
Informa?cs Ac?vi?es
• High throughput curve fieng • Data integra6on, automated cherry picking • SAR algorithms
– QSAR modeling – Fragment based analysis – Ac6vity cliffs
• Tools – standardizer, tautomers, fragment acDvity browser, kinome browser and more
• RNAi hit selec6on, OTE analysis • High content analysis
![Page 9: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/9.jpg)
Kinome Navigator
• Browse kinase panel data
• Currently focused on the Abbot dataset
• View • Fragments
• Target pairs • Kinome overlay
hip://tripod.nih.gov
![Page 10: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/10.jpg)
Fragment Browser
• View ac6vi6es on a fragment wise basis • Compare ac6vity distribu6ons by fragment • Currently based around ChEMBL assays but users can browse their own compounds & ac6vi6es
hip://tripod.nih.gov
![Page 11: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/11.jpg)
Structure Ac?vity Landscapes
• Rugged gorges or rolling hills? – Small structural changes associated with large ac6vity changes represent steep slopes in the landscape
– But tradi6onally, QSAR assumes gentle slopes – We can characterize the landscape using SALI
Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
![Page 12: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/12.jpg)
What Can We Do With SALI’s?
• SALI characterizes cliffs & non-‐cliffs • For a given molecular representa6on, SALI’s gives us an idea of the smoothness of the SAR landscape
• Models try and encode this landscape
• Use the landscape to guide descriptor or model selec6on
Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658
![Page 13: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/13.jpg)
Predic?ng the Landscape
• Rather than predic6ng ac6vity directly, we can try to predict the SAR landscape
• Implies that we aiempt to directly predict cliffs – Observa6ons are now pairs of molecules
Scheiber et al, StaDsDcal Analysis and Data Mining, 2009, 2, 115-‐122
Original pIC50 RMSE = 0.97
SALI, AbsDiff RMSE = 1.10
SALI, GeoMean RMSE = 1.04
![Page 14: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/14.jpg)
Data Integra?on
• It’s nice to simplify data, but we can s6ll be faced with a mul6tude of data types
• We want to explore these data in a linked fashion
• How we explore and what we explore is generally influenced by the task at hand
• At one point, make inferences over all the data
![Page 15: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/15.jpg)
Data Integra?on
User’s Network
Network of Public Data
Content: -‐ Drugs -‐ Compounds -‐ Scaffolds -‐ Assays -‐ Genes -‐ Targets -‐ Pathways -‐ Diseases -‐ Clinical Trials -‐ Documents
Links: -‐Manually curated -‐Derived from algorithms
![Page 16: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/16.jpg)
Record View of an Assay
![Page 17: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/17.jpg)
Access Disease Hierarchy & Network
![Page 18: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/18.jpg)
Ar?cles, Patents, Drug Labels, …
![Page 19: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/19.jpg)
NPC Browser
hip://tripod.nih.gov/npc/
![Page 20: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/20.jpg)
Going Beyond Explora?on?
• Simply being able to explore data in an integrated manner is useful as an idea generator
• Can we integrate heterogenous data types & sources to get a systems level view? – Current research problem in genomics and systems biology
– Some aiempts have been made to merge chemical data with other data types
Young, D.W. et al, Nat. Chem. Biol., 2008, 4, 59-‐68
![Page 21: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/21.jpg)
• Perform collabora6ve genome-‐wide RNAi screening-‐based projects with intramural inves6gators
• Advance the science of RNAi and miRNA screening and informa6cs via technology development to improve efficiency, reliability, and costs.
RNAi Facility Mission
Range of Assays!
Pathway (Reporter assays, e.g. luciferase,
β-lactamase)!
Complex Phenotypes (High-content imaging, cell
cycle, translocation, etc)!
Simple Phenotypes (Viability, cytotoxicity, oxidative stress, etc)!
![Page 22: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/22.jpg)
RNAi Effectors
RNAi effectors provide an excellent way to conduct gene-specific loss of function studies."
![Page 23: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/23.jpg)
• RNAi effectors give a knockdown not a knockout (70% - 80% is considered good). Therefore, they may not silence enough to give a phenotype even if the target is involved in what you are assaying for."
• RNAi effectors induce off-target effects!!!!! "
Issues Using RNAi Effectors
![Page 24: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/24.jpg)
• Protein Quality Control
• DNA Re-‐replica6on
• Base Excision Repair
• DNA Damage – ELG1 stabiliza6on
• An6oxidant Response
• Hypoxia
• TNFa Response
• Interferon Response
• iPS to RPE
• Poxvirus
• Respiratory Viruses
• Lysosomal Storage Disorders
• Parkinsons – Mitochondrial Quality Control
• Ewings Sarcoma
• Drug Modifiers, Pancrea6c Cancer
• Drug Modifiers, TOP1 Clinical Agents
• Immunotoxin-‐Mediated Cell Death
Examples of Current Projects Examples of Current Projects
![Page 25: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/25.jpg)
User Accessible Tools
![Page 26: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/26.jpg)
RNAi Libraries
Qiagen Human Druggable Genome Library, > 7,000
genes, 4 unique siRNAs per gene."
Kinome Libraries"Purchased from a number of
vendors."
• Smaller libraries (e.g. kinome and miRNA mimics) will enable high-impact screens in systems less amenable to high throughput applications."
• Considerations are being made for additional species and shRNA resources."
Human and Mouse miRNA Mimic Libraries &
Human miRNA Inhibitor Library"
Ambion Human Genome-Wide Library, 21,585 genes, 3
unique siRNAs per gene. "
Dharmacon Human Duet Genome-Wide siRNA
Libraries, 18,236 genes, siRNA pools."
Ambion Mouse Genome-Wide Library, 17,582 genes, 3 unique siRNAs per gene."
![Page 27: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/27.jpg)
Druggable Genome Screening Campaign
• Over 7,000 genes, 4 unique siRNAs per gene (≈36,000 wells).
Pseudo-colored Blue/Green Ratio (Normalized to plate Median)
• 85 genes were selected for follow-up through a variety of threshold-based selection schemes.
• 27 genes were validated as confident hits using siRNAs from multiple vendors.
0
20
40
60
80
100
TNFα Receptor IKKα RELA NEMO
Percent Reduction in NF-kB Signal Av
erag
e In
hibi
tion
(%)
Qiagen siRNAs Ambion siRNAs
Significant enrichment for core NF-kB components
![Page 28: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/28.jpg)
Qiagen Ambion
Murata et al Nature Reviews Mol. Cell Biol.
ß1-7 α1-7
α1-7 20S Proteasome
RPT
RPT
RPN
RPN 19S Regulator particle
19S Regulator particle 0
20
40
60
80
100
A1
A2
A3 A4
A5
A6
A7 B2
B3
B4
C4
C5
D2
D7
D14
Percent Reduction in NF-kB Signal
Aver
age
Inhi
bitio
n (%
)
α core 20S β core 20S RPT 19S RPN 19S
PSM Gene
PSM Protein
Significant enrichment for proteins that form the 28S proteasome
An additional 34 genes remain inconclusive, but noteworthy hits that require further study. Some of these tie into the core NF-kB pathway
Druggable Genome Screening Campaign
![Page 29: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/29.jpg)
Other instances of the seeds incorporated within siRNAs targeting PSMA3 do not exhibit significant activity, adding to the likelihood of this being an on-target effect."
Seed Sequence Analysis
![Page 30: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/30.jpg)
Other instances of the seeds within the active siRNAs targeting SLC24A1 tend to downregulate NF-kB reporter, adding to the likelihood of this being an off-target effect."
Seed Sequence Analysis
![Page 31: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/31.jpg)
RNAi & Small Molecule Screens
Goal: Develop systems level view of small molecule acUvity
• Reuse pre-‐exis6ng MLI data • Develop new annotated libraries
TACGGGAACTACCATAATTTA
CAGCATGAGTACTACAGGCCA
• Run parallel RNAi screen
What targets mediate ac6vity of siRNA and compound
Pathway elucida6on, iden6fica6on of interac6ons
Target ID and valida6on
Link RNAi generated pathway peturba6ons to small molecule ac6vi6es. Could provide insight into polypharmacology
![Page 32: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/32.jpg)
Matching Phenotypes RNAI
Small Molecule
![Page 33: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/33.jpg)
Merging Screening Technologies
• Lead iden6fica6on • Single (few) read outs • High-‐throughput • Moderate data volumes
• Phenotypic profiling • Mul6ple parameters • Moderate throughput • Very large data volumes
High throughput screening High content screening
• We’d like to combine the technologies, to obtain rich high-‐resolu6on data at high speed
• Is this feasible? What are the trade-‐offs?
![Page 34: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/34.jpg)
Merging Screening Technologies
• A simple solu6on is to run a HTS & HCS as separate, primary & secondary screens
• Alterna6vely – Wells to Cells – Integrate HTS & HCS in a single screen using a combined plavorm for robo6cs & real 6me automated HTS analy6cs
– Selec6ve imaging of interes6ng wells
![Page 35: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/35.jpg)
Wells to Cells Workflow
• Sequen6al qHTS using laser scanning cytometry followed by high-‐res microscopy
• Unit of work is a plate series • The same aliquot is analyzed by both techniques
• A message based system
• The key is deciding which wells go through the workflow
![Page 36: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/36.jpg)
Well to Cells Assays
• Cell cycle, cell transloca6on, DNA repreplica6on • All assays run against LOPAC1280 • Consistency between cytometry & microscopy is measured by the R2 between log AC50’s – Cell cycle, 0.94 – 0.96 – Cell transloca6on, 0.66 – 0.94 – DNA rereplica6on, s6ll in progress
![Page 37: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/37.jpg)
Cell Transloca?on Example Hits
![Page 38: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/38.jpg)
Informa?cs Pla[orm
• Advanced correc6on and normaliza6on methods
• Sophis6cated curve fieng algorithm
• Good performance, allows paralleliza6on of the en6re workflow
InCell Layout File
![Page 39: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/39.jpg)
Why Messaging?
• A messaging architecture allows for significant flexibility – Persistent, can be kept for process tracking, repor6ng
– Asynchronous, allows individual components of the workflow to proceed at their own pace
– Modular, new components can be introduced at any 6me without redesigning the whole workflow
• We employ Oracle AQ, but any message queue can be employed
![Page 40: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/40.jpg)
Handling Mul?ple Pla[orms
• Current examples employ InCell hardware • We also use Molecular Devices hardware
• As a result we have two orthogonal image stores / databases
• Need to integrate them – Support seamless data browsing across mul6ple screens irrespec6ve of imaging plavorm used
– Support analy6cs external to vendor code
![Page 41: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/41.jpg)
A Unified Interface
• A client sees a single, simple interface to screening image data
• Transparently extract image data via the MetaXpress database or via custom code
• Currently the interface address image serving
• Unified metadata interface in the works
hXp://host/rest/protocol/plate/well/image
![Page 42: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/42.jpg)
Trade-‐offs & Opportuni?es
• Automa6on reduces the ability to handle unforeseen errors – Dispense errors and other plate problems – Well selec6on based on curve classes may need to be modified on the fly
• Well selec6on does not consider SAR – Wells are selected independently of each other – If we could model SAR on the fly (or from valida6on screens), we’d select mul6ple wells, to obtain posi6ve and nega?ve results
![Page 43: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/43.jpg)
Cloud Compu?ng & Cheminforma?cs
• Cloud compu6ng is a hot topic • A number of examples of computa6onal chemistry / cheminforma6cs on the cloud – MolPlex, hBar, Numerate, Wingu, Sciligence, Pfizer
• Many examples use the cloud for remote storage remote (hosted) computa6ons
• But providers such as Amazon allow us to run distributed compuDng applica6ons on the cloud
![Page 44: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/44.jpg)
Map/Reduce
• Map/Reduce is a programming model for efficient distributed compu6ng
• M/R made “famous” by Google, but the idea has been around for a long 6me
• It works like a Unix pipeline: – cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output
• Efficiency from
– Streaming through data, reducing seeks
– Pipelining Owen O’Malley, hip://bit.ly/ecHPvB
![Page 45: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/45.jpg)
Map/Reduce
Owen O’Malley, hip://bit.ly/ecHPvB
![Page 46: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/46.jpg)
Hadoop & Cheminforma?cs
• Hadoop is an Open Source implementa6on of the map/reduce paradigm
• Hadoop is a framework for scalable, distributed compu6ng – Hadoop, HDFS, Hive, PIG
• Importantly, you can play with all this on your laptop and just copy files to the big cluster when you’re ready for produc6on
![Page 47: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/47.jpg)
Why Hadoop?
• Simple way to make use of large clusters without MPI etc
• AWS supports Hadoop, so easy to scale up to 100’s or 1000’s of cores
• Great for Java code, but non-‐Java code can also make use of Hadoop
• M/R can be applied to a lot of problems, but one of the simplest is to use it as a “chunker”
![Page 48: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/48.jpg)
Cheminforma?cs in Parallel
• Many cheminforma6cs problems are data parallel – Chunk the data and apply the same technique over each chunk
• This makes many problems amenable for M/R – Substructure / pharmacophore search
– Descriptor calcula6ons, virtual screening – Model development (?)
• In general, each chunk is processed on a dis6nct node – so code itself can be non-‐parallel
![Page 49: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/49.jpg)
Cheminforma?cs in Parallel
See h_p://blog.rguha.net/?tag=hadoop for examples & code
![Page 50: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/50.jpg)
Substructure Searching
• Substructure searching is a trivial extension of atom coun6ng
• If a structure matches, emit (name,1)!
• Otherwise (name,0)
• Reducer simply outputs tuples of the form (name,1)
public class SubSearch {!
…! public static class MoleculeMapper extends ! Mapper<Object, Text, Text, IntWritable> {!
private Text matches = new Text();! private String pattern;!
public void setup(Context context) {! pattern = context.getConfiguration().get("net.rguha.dc.data.pattern");! }!
public void map(Object key, Text value, Context context) throws! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString()); !
sqt.setSmarts(pattern);! boolean matched = sqt.matches(molecule);! matches.set((String) molecule.getProperty(CDKConstants.TITLE));! if (matched) context.write(matches, one);! else context.write(matches, zero);! } catch (CDKException e) {! e.printStackTrace();! }! }! }!
public static class SMARTSMatchReducer extends ! Reducer<Text, IntWritable, Text, IntWritable> {! private IntWritable result = new IntWritable();!
public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! for (IntWritable val : values) {! if (val.compareTo(one) == 0) {! result.set(1);! context.write(key, result);! }! }! }!
![Page 51: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/51.jpg)
Running on AWS
• All the code was debugged on my laptop with rela6vely small files
• To test the scalability, I shi{ed everything to AWS – Pharmacophore search – 136K structures, single conformer, 560MB
– Created a single JAR file with CDK & applica6on code
– Uploaded data files to S3 • Total cost of experiments was ~ $10
![Page 52: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/52.jpg)
But I Don’t Want to Write Programs
• All these examples require us to write full fledged Java classes
• An easier way to use Pig & Pig La6n – a plavorm and query language built on top of Hadoop
• Lets us write SQL-‐like queries that make use of Hadoop underneath
• Flexible due to user defined func6ons (UDF’s) – UDF’s encapsulate the cheminforma6cs
![Page 53: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/53.jpg)
Cheminforma?cs & Pig
• Iden6fy molecules in medium.smi that match the SMARTS paiern and dump to output.txt
• The complexity is now hidden in the UDF
• Many toolkit func6ons could be wrapped as UDF’s, allowing flexible queries with much simpler code
• See hip://blog.rguha.net/?p=748 for the code
A = load 'medium.smi' as (smiles:chararray);!B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N');!store B into 'output.txt';!
![Page 54: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/54.jpg)
Latency
• Hadoop is suited for batch processing • Significant network I/O involved in distribu6ng data to compute nodes
• Not good for – Random ad hoc processing of small subsets – Small volume data
– Real 6me (low latency) work
• But latency issues can be addressed somewhat by Hbase, Hive and other technologies
![Page 55: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/55.jpg)
More than Chunking?
• But all the examples so far could have been done via PBS/Condor or any other job scheduler – (With Hadoop we don’t have to worry about explicit chunking of the input data)
• But are there cheminforma6cs algorithms that can be reworked in to the M/R paradigm? – Predic6ve modeling?
– Graph algorithms?
![Page 56: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/56.jpg)
More than Chunking?
• Both predic6ve & graph algorithms are increasingly supported in Hadoop – Mahout for M/L algorithms on massive datasets – Cloud9 for graph algorithms
• A number of bioinforma6cs applica6ons make use of M/R at the algorithmic level
• They are all big applica6ons – Crossbow aligns 3 billion paired/unpaired reads
• Cheminforma?cs datasets are not very big
![Page 57: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/57.jpg)
Summary
• HTS data is an ample playground for interes6ng analy6cs, mul6ple data types makes it more fun
• A major challenge in our informa6cs infrastructure is dealing with proprietary vendor interfaces
• Hadoop and M/R provide great opportuni6es for handling large data in a flexible manner
• But can cheminforma6cs really make use of it?
![Page 58: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/58.jpg)
InformaUcs
• Ajit Jadhav • Trung Nguyen • Noel Southall • Ruili Huang • Min Shen
• Hongmao Sun
• Xin Hu • Tongan Zhao
RNAi & Small Molecule
• Scoi Mar6n
• Pinar Tuzmen • Yu-‐Chi Chen • Carleen Klump • Craig Thomas
• Jim Inglese
• Ron Johnson • Sam Michael
• Jennifer Wichterman
Acknowledgments
![Page 59: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/59.jpg)
![Page 60: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/60.jpg)
Coun?ng Atoms
• The canonical Hadoop program is to count the frequency of words in a text file – Mapper reads a line, outputs a tuple – (word, 1) – Reducer will receive tuples, keyed on word!
• Summing up the 1’s gives us the frequency of word
• By default, Hadoop works on a line-‐by-‐line basis • For cheminforma6cs problems, SMILES files sa6sfy this requirement – one line, one molecule
![Page 61: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/61.jpg)
Coun?ng Atoms public class HeavyAtomCount {! static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());!
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> ! {!
private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!
public void map(Object key, Text value, Context context) throws ! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString());! for (IAtom atom : molecule.atoms()) {! word.set(atom.getSymbol());! context.write(word, one);! }! } catch (InvalidSmilesException e) {! // do nothing for now! }! }! }!
public static class IntSumReducer extends Reducer<Text, IntWritable, ! Text, IntWritable> {! private IntWritable result = new IntWritable();!
public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! int sum = 0;! for (IntWritable val : values) {! sum += val.get();! }! result.set(sum);! context.write(key, result);! }! }!….!}!
• Uses the CDK to parse SMILES
• For each molecule loop over atoms – Emit (symbol,1)!
• Reducer simply sums the 1’s for each symbol
![Page 62: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/62.jpg)
Mul?line Records
• Lots of cheminforma6cs applica6ons require 3D – SMILES won’t do. Need to support SDF
• We implement a custom RecordReader to process SD files!
• We’re now ready to tackle preiy much most cheminforma6cs tasks
![Page 63: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT](https://reader033.fdocuments.us/reader033/viewer/2022052822/554e9a46b4c90526358b5354/html5/thumbnails/63.jpg)
Why Hadoop?
• Java and C++ APIs – In Java use Objects, while in C++ bytes
• Each task can process data sets larger than RAM
• Automa6c re-‐execu6on on failure – In a large cluster, some nodes are always slow or flaky – Framework re-‐executes failed tasks
• Locality op6miza6ons – M/R queries HDFS for loca6ons of input data – Map tasks are scheduled close to the inputs when possible
Owen O’Malley, hip://bit.ly/ecHPvB