Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data...
Transcript of Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data...
![Page 1: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/1.jpg)
Algorithms for In Situ Data Analytics of Next Generation
Molecular Dynamics WorkflowsMichela Taufer
Department of Electrical Engineering and Computer ScienceThe University of Tennessee Knoxville
![Page 2: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/2.jpg)
Acknowledgements
T. Johnston B. Zhang T. Estrada A Liwo
S. Crivelli
M. CuendetE. Deelman
R. da Silva
Sponsors:
H. Weinstein
A. RazaviT. Do
S. Thomas A. PlanteB. Mulligan
![Page 3: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/3.jpg)
Trends in Next-Generation Systems
Source: Lucy Nowell (DOE)
5x
12x
64x
13x
Rising Importance of Ensembles
Source: https://wci.llnl.gov/simulation/computer-codes/uncertainty-quantification
Exas
cale
Peta
scal
eTe
rasc
ale
Sierra (2018)
Sequoia (2013)
Purple (2005)
Number of Jobs
Syst
em S
cale
(FLO
PS)
Few Many
Widening I/O Gap
3
![Page 4: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/4.jpg)
Classical Molecular Dynamics Simulations
By Vincent Voelz Gorup - CC BY-SA 3.0, http://www.voelzlab.org/
• A MD simulation comprises of hundreds of thousands of MD job
• Each job preforms hundreds of thousands of MD steps
4
![Page 5: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/5.jpg)
Classical Molecular Dynamics Simulations
Forces on single atoms à Acceleration
à Velocity à Position
• A MD step computes forces on single atoms (e.g., bond, angle, dihedrals, nonbond)• Forces are added to
compute acceleration• Acceleration is used to
update velocities • Velocities are used to
update the atom positions• Every n steps, all atom
positions are stored à 3D snapshot or frame
5
![Page 6: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/6.jpg)
Analyzing MD Frames: Present and Future
6
ANALYSIS
Local Storage
MEMORY
CN
MD code
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
Parallel File System
MEMORY
CN
MD code
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
CN
ANALYSIS
Parallel File System
Present Future
![Page 7: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/7.jpg)
In Situ and In Transit Analysis
Core 1
Core 4
Core 7
Core 8
Core 1
Core 1
Core 5
Core 6
Simulation
Nod
e
AnalysisSimulation
Node 1 Node 2 Node 3
Network InterconnectAnalysis
Shar
ed M
emor
y
Example of tools:• DataSpaces (Rutgers U.)• DataStager (GeorgiaTech)
In situ and in transit analysis requires rethinking data algorithms 7
![Page 8: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/8.jpg)
Building a Closed-loop Workflow
Plumed
MD code (e.g., GROMACS)
In-memory Staging AreaDataSpaces
A4MDanalytics
Retriever
Dataflow
IngestorControlflow
Run n-stride simulation steps Collective
variables
Data Generation Data AnalyticsData Storage
Dataflow
Burst Buffer
Parallel File System
(e.g., Lustre)
A4MDanalytics
A4MDanalytics
algorithms
ML-inferred algorithms
Controlflow ControlflowControlflow
Data Feedback
8
![Page 9: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/9.jpg)
Building a Closed-loop Workflow
Plumed
MD code (e.g., GROMACS)
In-memory Staging AreaDataSpaces
In-memory Staging AreaDataSpaces RetrieverRetriever
DataflowDataflow
IngestorIngestorControlflowControlflow
Run n-Stride simulation stepsRun n-Stride simulation steps Collective
variablesCollective variables
Data GenerationData Generation Data AnalyticsData AnalyticsData StorageData Storage
DataflowDataflow
Burst BufferBurst Buffer
Parallel File System
(e.g., Lustre)
Parallel File System
(e.g., Lustre)
ML-inferred algorithmsML-inferred algorithms
ControlflowControlflow ControlflowControlflowControlflowControlflow
Data FeedbackData Feedback
Plumed
MD code (e.g., GROMACS)
In-memory Staging AreaDataSpaces Retriever
Dataflow
IngestorControlflow
Run n-Stride simulation steps Collective
variables
Data Generation Data AnalyticsData Storage
Dataflow
Burst Buffer
Parallel File System
(e.g., Lustre)
ML-inferred algorithms
Controlflow ControlflowControlflow
Data Feedback
A4MDanalytics
A4MDanalytics
A4MDanalytics
algorithms
9
![Page 10: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/10.jpg)
Analytics for Molecular Dynamics
• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and
protein engineering
![Page 11: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/11.jpg)
Analytics for Molecular Dynamics
• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and
protein engineering
![Page 12: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/12.jpg)
A4MD: Protein-Ligand Docking
12
T. Estrada, B. Zhang, P. Cicotti, R. S. Armen, M. Taufer: A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42(7): 758-771 (2012)
![Page 13: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/13.jpg)
From 3D Atomic Structures to 3D Points
13
x-y axis
x-z axis
y-z axis
x-y axis
x-z axis
y-z axis
Protein pocket
Docked ligand
Docked ligand
Metadata: each 3D pointrepresents the position of one docked ligand in the protein pocket
![Page 14: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/14.jpg)
Search for Dense Spaces: Octree Clustering
14
![Page 15: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/15.jpg)
Search for Dense Spaces: Octree Clustering
Octree nodes
Metadata: ligand conformations
15
![Page 16: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/16.jpg)
Search for Dense Spaces: Octree Clustering
Near-native ligand structures
(RMSD <= 2A)
Deepest, more dense octant
found by octree clustering
16
Search: Linear in complexity using Mimir- a MapReduce over MPI framework
![Page 17: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/17.jpg)
Case Study: Sampled Conformations - Ligand 1k1l
−1−0.5
00.5
1 −1−0.5
00.5
1
−1
−0.5
0
0.5
1
17
T. Estrada, B. Zhang, P. Cicotti, R. S. Armen, M. Taufer: A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42(7): 758-771 (2012)
![Page 18: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/18.jpg)
Case Study: Sampled Conformations - Ligand 1k1l
Near-native ligand structures
(RMSD <= 2A)
Deepest, more dense octant
found by octree clustering
18
![Page 19: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/19.jpg)
Analytics for Molecular Dynamics
• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and
protein engineering
![Page 20: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/20.jpg)
A4MD: Rare Events in MD Simulations
Transformations:
Movements:
20
![Page 21: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/21.jpg)
A4MD: Rare Events in MD Simulations
• We want to capture what is going on in each frame without:• Disrupting the simulation (e.g., stealing CPU and memory on
the node)• Moving all the frames to a central file system and analyzing
them once the simulation is over• Comparing each frame with past frames of the same job• Comparing each frame with frames of other jobs
Frames (or snapshots) of an MD trajectory:
Frame 55 Frame 60 Frame 65 Frame 70 Frame 75 Frame 80
21
![Page 22: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/22.jpg)
From 3D Atomic Structure to a Single Eigenvalue
Drop all but not the backbone atoms (Cα atoms)
CβiCα
j
T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.
![Page 23: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/23.jpg)
From 3D Atomic Structure to a Single Eigenvalue
Measure the distance between Cα
j and Cβi
Build a bipartite distance matrix by comparing two substructures
i
j
λmaxCompute largest eigenvalue
CβiCα
j d
T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.
![Page 24: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/24.jpg)
Case Study: Capturing Movement of α-helices
24
Capture movement of structures (α-helices) with respect to each other
1330 1360 1390
T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.
24
![Page 25: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/25.jpg)
Case Study: Capturing Movement of α-helices
Monitor largest eigenvalue of entire protein
25
![Page 26: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/26.jpg)
Case Study: Capturing Movement of α-helices
Something is changing
Monitor largest eigenvalue of entire protein
26
![Page 27: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/27.jpg)
Case Study: Capturing Movement of α-helices
Individual α-helices (Helix 1, Helix 2, and Helix 3) appear stable
Monitor largest eigenvalue of single α-helices
27
![Page 28: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/28.jpg)
Case Study: Capturing Movement of α-helices
Monitor largest eigenvalue of bipartite distance matrix
First and second α-helices appear stable; third helix moves
28
![Page 29: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/29.jpg)
1330
1360
1390
Case Study: Capturing Movement of α-helices
Analysis: Linear in complexity using local metadata (eigenvalues) with DataSpaces
![Page 30: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/30.jpg)
Analytics for Molecular Dynamics
• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria
and protein engineering
![Page 31: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/31.jpg)
A4MD: Proteins with Similar Functions Key principle: proteins with similar sequences have similar functions• Measure millions of protein variants expressed from yeast or bacteria• Structure proteins to produce desired properties (protein engineering)
31
![Page 32: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/32.jpg)
Protein Representations
3D Cartesian representation Surface representationMulti-fold representation
32
![Page 33: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/33.jpg)
From Multi-fold Representation to Image Encoding
33
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
![Page 34: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/34.jpg)
From Multi-fold Representation to Image Encoding
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
34
![Page 35: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/35.jpg)
From Multi-fold Representation to Image Encoding
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
![Page 36: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/36.jpg)
Case Study: High-Throughput Protein Analysis
convolutional neural network
• 62,991 proteins from the Protein Data Bank• Eight biological processes from biological process taxonomy in RCSB-PDB
Proteins as 3D tens
T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.
36
Google’s Inception-v3, Gem-Net
![Page 37: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/37.jpg)
Challenges and Opportunity
A workflow that integrates both simulations and analytics must have these key properties: • Efficiency: Optimize workflows’ performance and power usage
associated to data movement and analytics• Generality: Build workflows that support different types of
analytics across different MD applications• Non-invasive: Capture data from MD simulations without
rewriting legacy codes or simulation scripts • Portability: Execute combined simulations and analytics across
different platforms and with heterogenous resources • Scalability: (Re)design ML algorithms for knowledge discovery
at scale
37
![Page 38: Algorithms for In Situ Data Analytics of Next Generation ... · Algorithms for In Situ Data Analytics of Next Generation Molecular Dynamics Workflows Michela Taufer Department of](https://reader034.fdocuments.us/reader034/viewer/2022052015/602df9f2f3c75a64f1210301/html5/thumbnails/38.jpg)