A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data...

12
A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klasky *†‡ , Matthew Wolf * , Mark Ainsworth , Chuck Atkins ** , Jong Choi * , Greg Eisenhauer , Berk Geveci ** , William Godoy * , Mark Kim * , James Kress * , Tahsin Kurc *k , Qing Liu , Jeremy Logan , Arthur B. Maccabe * , Kshitij Mehta * , George Ostrouchov *† , Manish Parashar †† , Norbert Podhorszki * , David Pugmire *† , Eric Suchyta * , Lipeng Wan * , Ruonan Wang * * Oak Ridge National Laboratory, Oak Ridge, TN. 37831, USA The University of Tennessee, Knoxville TN, USA School of Computer Science, Georgia Institute of Technology, Atlanta, GA 30332, USA § Division of Applied Mathematics, Brown University, Providence, RI 02912, USA Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA k Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794,USA ** Kitware Inc., 28 Corporate Drive, Clifton Park, New York 12065 USA †† Computer Science Department, Rutgers University, New Brunswick, NJ, USA Abstract—One of the core issues across computer and compu- tational science today is adapting to, managing, and learning from the influx of “Big Data”. In the commercial space, this problem has led to a huge investment in new technologies and capabilities that are well adapted to dealing with the sorts of human- generated logs, videos, texts, and other large-data artifacts that are processed and resulted in an explosion of useful platforms and languages (Hadoop, Spark, Pandas, etc.). However, translating this work from the enterprise space to the computational science and HPC community has proven somewhat difficult, in part because of some of the fundamental differences in type and scale of data and timescales surrounding its generation and use. We describe a twelve year research and development plan which centers around the concept of making Input/Output (I/O) intelligent for users in the scientific community, whether they are accessing scalable storage or performing in situ workflow tasks. Much of our work is based on our experience with the Adaptable I/O System (ADIOS 1.X), and our next generation version of the software ADIOS 2.X [1]. I. I NTRODUCTION As the HPC community moves broadly towards exascale capabilities, and as new science facilities like the Square Kilometer Array (SKA) move towards larger data capture rates (and subsequently larger data transport, processing and storage demands), there is a renewed focus on what is needed to enable intelligent and scalable I/O – not merely as the transfer of large number of bytes, but as a vehicle for managing and expressing complex queries and data requirements in order to extract the most possible from large science data sets. When even a current run of a science code can generate 100’s of Petabytes of data, the classic HPC approach of writing it all to disk and then inundating a postdoc to analyze it using processing scripts after the fact simply does not work. The basic nature of the problem for HPC applications will surely be familiar to anyone who has paid attention to the commodity Big Data space. The key observation for scientific datasets, however, goes back to one of the original discussions of what constitutes Big Data [2] – the “3 V’s” of Volume, Velocity, and Variety. Where most systems in the enterprise/internet space focus on large Volume, scientific data is complex (high Variety) and high rate (Velocity) in addition to being merely large. The design choices for how to manage such I/O therefore need to be able to respond to that difference. Conversely, traditional HPC I/O solutions tend to fall short in their expectations that I/O is strictly a matter of moving bytes from one level to another. The enormous complexity of the current and next generation HPC hardware, where data must be retrieved and stored from layers of non-volatile memory, burst buffers, campaign storage, parallel file systems, object stores, and/or cold storage like tape, means that user code must either take explicit control of all of those placement and transfer options, or must blindly trust a 3rd party tool to do the optimizations for them. Adding to this is the software complexity generated by the rise of new simulation and analysis models that do not depend on single monolithic implementations. In situ processing and analysis of data, multi- physics code coupling where each piece is written by a different team, and ensemble-based execution models all add complexity to a traditional notion of I/O, causing a blurring between messaging, storage, and database lookup techniques. In our work over the last decade, based on research and development going back even further than that, we have de- veloped an approach for Intelligent I/O that we believe serves as a better vehicle for moving the community forward into the next generation of data-rich environments. At the core of this is a reorganization of I/O to recognize that our traditional usage mirrors very closely a set of abstractions that have been developed in the streaming computing community for many years, namely the Publish/Subscribe (Pub/Sub) programming model. In a blind Pub/Sub system, publishers do not know who the subscribers are and vice versa; they merely coordinate through a common name space to share what it is they want

Transcript of A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data...

Page 1: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

A View from ORNL: Scientific Data ResearchOpportunities in the Big Data Age

Scott Klasky∗†‡, Matthew Wolf∗, Mark Ainsworth∗§, Chuck Atkins∗∗, Jong Choi∗, Greg Eisenhauer‡,Berk Geveci∗∗, William Godoy∗, Mark Kim∗, James Kress∗, Tahsin Kurc∗‖, Qing Liu∗¶, Jeremy Logan†,

Arthur B. Maccabe∗, Kshitij Mehta∗, George Ostrouchov∗†, Manish Parashar††, Norbert Podhorszki∗,David Pugmire∗†, Eric Suchyta∗, Lipeng Wan∗, Ruonan Wang∗∗Oak Ridge National Laboratory, Oak Ridge, TN. 37831, USA

†The University of Tennessee, Knoxville TN, USA‡School of Computer Science, Georgia Institute of Technology, Atlanta, GA 30332, USA§Division of Applied Mathematics, Brown University, Providence, RI 02912, USA

¶Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA‖Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794,USA

∗∗Kitware Inc., 28 Corporate Drive, Clifton Park, New York 12065 USA††Computer Science Department, Rutgers University, New Brunswick, NJ, USA

Abstract—One of the core issues across computer and compu-tational science today is adapting to, managing, and learning fromthe influx of “Big Data”. In the commercial space, this problemhas led to a huge investment in new technologies and capabilitiesthat are well adapted to dealing with the sorts of human-generated logs, videos, texts, and other large-data artifacts thatare processed and resulted in an explosion of useful platforms andlanguages (Hadoop, Spark, Pandas, etc.). However, translatingthis work from the enterprise space to the computational scienceand HPC community has proven somewhat difficult, in partbecause of some of the fundamental differences in type andscale of data and timescales surrounding its generation anduse. We describe a twelve year research and development planwhich centers around the concept of making Input/Output (I/O)intelligent for users in the scientific community, whether they areaccessing scalable storage or performing in situ workflow tasks.Much of our work is based on our experience with the AdaptableI/O System (ADIOS 1.X), and our next generation version of thesoftware ADIOS 2.X [1].

I. INTRODUCTION

As the HPC community moves broadly towards exascalecapabilities, and as new science facilities like the SquareKilometer Array (SKA) move towards larger data capture rates(and subsequently larger data transport, processing and storagedemands), there is a renewed focus on what is needed to enableintelligent and scalable I/O – not merely as the transfer of largenumber of bytes, but as a vehicle for managing and expressingcomplex queries and data requirements in order to extractthe most possible from large science data sets. When even acurrent run of a science code can generate 100’s of Petabytesof data, the classic HPC approach of writing it all to diskand then inundating a postdoc to analyze it using processingscripts after the fact simply does not work.

The basic nature of the problem for HPC applicationswill surely be familiar to anyone who has paid attentionto the commodity Big Data space. The key observation forscientific datasets, however, goes back to one of the original

discussions of what constitutes Big Data [2] – the “3 V’s”of Volume, Velocity, and Variety. Where most systems in theenterprise/internet space focus on large Volume, scientific datais complex (high Variety) and high rate (Velocity) in additionto being merely large. The design choices for how to managesuch I/O therefore need to be able to respond to that difference.

Conversely, traditional HPC I/O solutions tend to fall shortin their expectations that I/O is strictly a matter of movingbytes from one level to another. The enormous complexityof the current and next generation HPC hardware, wheredata must be retrieved and stored from layers of non-volatilememory, burst buffers, campaign storage, parallel file systems,object stores, and/or cold storage like tape, means that usercode must either take explicit control of all of those placementand transfer options, or must blindly trust a 3rd party toolto do the optimizations for them. Adding to this is thesoftware complexity generated by the rise of new simulationand analysis models that do not depend on single monolithicimplementations. In situ processing and analysis of data, multi-physics code coupling where each piece is written by adifferent team, and ensemble-based execution models all addcomplexity to a traditional notion of I/O, causing a blurringbetween messaging, storage, and database lookup techniques.

In our work over the last decade, based on research anddevelopment going back even further than that, we have de-veloped an approach for Intelligent I/O that we believe servesas a better vehicle for moving the community forward intothe next generation of data-rich environments. At the core ofthis is a reorganization of I/O to recognize that our traditionalusage mirrors very closely a set of abstractions that have beendeveloped in the streaming computing community for manyyears, namely the Publish/Subscribe (Pub/Sub) programmingmodel. In a blind Pub/Sub system, publishers do not knowwho the subscribers are and vice versa; they merely coordinatethrough a common name space to share what it is they want

Page 2: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

to share.However, in order to support the emerging analytics, pro-

cessing, and storage use cases, we believe that this by itself isa too restricted model. In this era, data cannot be consideredto be passive, directly falling through a chute connectingpublishers and subscribers. There must be a service-orientedarchitecture that connects them; actors must be involved totouch, manage, maintain, and abstract the data and to sup-port in situ analysis, visualization, etc. These sets of actionsmust be managed and orchestrated across the wide array ofresources in a way that enables not just imperative connections(“Output A must go to Input B”) but also enable new modelsof learning and intelligence in the system (“Make this datapersistent, but watch what I’ve been doing to other data setsand pre-process this data based on that”).

Concretely, we have been developing an expanded definitionof the Publish/Subscribe paradigm that we believe enables bothhigh performance on current hardware, as well as a muchricher future environment that can leverage developments incloud data analytics, deep learning, Internet of Things, as wellas high performance and computational science research. Tothe initial two roles of publisher and subscriber, we add newprogrammatic roles – manager, clerk, consultant, and resourcemanager. All of these are defined in ways that enable themto both scale up and scale out; for instance, the manager rolecould be implemented as a single global master process, or itcould be implemented as a distributed, peer-to-peer controlsystem. As we will detail further in the below sections, acritical distinction is a rigid separation between roles that takeplace in the data plane versus the control plane, and the natureof the communications that occur between them.

As we look towards future systems with extreme het-erogeneity [3], we describe here our model for intelligentI/O based on developments with the Adaptable I/O System(ADIOS) [1] and other related frameworks such as DataS-paces [4] and EVPath [5]. This work has been carried out inclose collaboration with science teams across many areas andwith data sources ranging from experimental and observationalto simulations both large and small.

Figure 1: Intelligent I/O Middleware Requirements

1) The I/O System must take into account that a simula-tion/workflow may consist of multiple coupled applicationsas well as monolithic applications.

2) The system must take into account heterogeneous systems, acomplex memory hierarchy, and increasing data volumes.

3) It must provide an infrastructure for dynamic management ofdata and workflow in an intelligent way supported by learning.

4) It must provide an intelligible way to express user intent,along with an innovative way to represent constraints andpolicies.

As we will describe in the sections below, we believe that asuitable extension on the existing Publish/Subscribe messagingabstraction allows for a unification of high performance I/Owith the sorts of advanced I/O services that are required tosupport intelligence across the HPC, cloud, and IoT scenarios

we described above. In Figure I, we summarize our keyrequirements for this future system; in the following sectionwe describe the motivation behind these requirements. In thesubsequent sections, we describe the architecture design andits component elements. Through this extended abstraction forI/O, we believe that a wide array of implementations, frompurely library-based to complex, distributed service frame-works, can be constructed to address the I/O needs of thedata-intensive applications of the future.

II. MOTIVATION

In this section, we further explain our motivations towardreorienting the dominant I/O paradigm, exemplifying someof the types of issues that will need to be addressed inyears to come. There are many piecemeal approaches to someof the challenges that we will describe here and in othersections to follow, but it would be preferable to establish amore comprehensive solution that offers a richer environmentfor application programmers, computer scientists, and datascientists alike. Technology changes, analysis and visualizationsupport, as well as the science applications themselves all playroles in shaping the motivation.

A. Technology Motivations

Future supercomputers will have a highly heterogeneousarchitecture. One machine may have nodes with a combi-nation of regular CPUs, GPUs, FPGAs, along with a morecomplex memory hierarchy, while the next machine will havea completely different mixture. Furthermore, systems willconsist of on-node and/or off-node SSDs or non-volatile RAM,in addition to campaign storage as well as an underlyingparallel file system. Data will move across a vast spectrumof components from the point of being produced to the pointof arriving at persistent storage. Because the ratio of FLOPSto ultimate output bandwidth, it is also true that during thisprocess data will have to undergo transformations and variousanalyses while it is in motion. The rise (or return) of in situprocessing as a paradigm has been driven by this recognition.

Complex workflows running on complex systems gives riseto some fundamental challenges with regards to data manage-ment and user productivity. Modern day science applicationsare not only monolithic, single-author MPI applications, butthey also may be workflows consisting of multiple applicationcodes coupled with each other. Each of these codes producesdata that is compressed and analyzed at runtime to deriveinformation that drives the workflow. This data is visualizedlive with an emphasis on avoiding expensive post-processing.While it is possible to build systems that can be tuneddynamically by a human-in-the-loop, intelligent systems withthe capability to automatically tune workflows and drive themaccording to data events observed at runtime will lead the wayin the design of modern computing infrastructure.

B. Visualization and Analysis

In almost any scientific scenario, data products need tobe visualized to check for problems and to gather insight.

Page 3: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

Analysis is also critical, as the raw data are never the finalanswer. In HPC environments, these analysis and visualizationtasks tend to be data driven, and are subject to load imbalance,both on-node as well as off-node. In order to be effective, thetasks must be flexible enough to respond to these differingadaptations, and behave as good citizens in this complexlandscape. These constraints of good citizenship can be cat-egorized into three major categories: resource utilization,the timeliness of results, and the accuracy of results. (1)Visualization tasks must have the flexibility to operate underimposed resource limitations. Efforts like VTK-m [6] are pro-viding abstractions of heterogeneous architectures that enableportable implementations of algorithms. (2) Operating understrict time constraints requires a fundamental performanceof algorithms. In order to support this, good performancemodels, (e.g., [7], [8]) are needed that allow for schedulingof visualization tasks within the context of an entire scientificcampaign. (3) An orthogonal axis to resource and time arethe acceptable error bounds for analysis and visualizationresults[9]. The accuracy of results is related to the quality ofthe data (reduced data vs. non-reduced data), as well as thealgorithms employed in the analysis and visualization tasks.Data accuracy is highest at the top of the storage hierarchy, andgenerally, will become lower (either temporally, spatially, orboth) as it moves down the storage hierarchy. Acceptable errorbounds will dictate when and where analysis and visualizationtasks should be performed.

Because it is very unlikely that scientists will know exactlywhat they need before running their experiments, analysis oflarge-scale scientific datasets always requires post-processingof the data. Diagnostic quantities are saved at a given fre-quency, for use in subsequent (offline) analysis and/or visual-ization. Calculations to compute these diagnostics are actuallyin situ analysis, and ideally as much of the additional offlineanalysis/visualization would be pulled in online too. Overheadto do so consists of two factors: the I/O time and the timeto perform the visualization/analysis processing. The lattershould be sufficiently faster than the rest of the applicationduring that interval, so it does not lag the application. Ideally,a scientist could declare the maximum tolerable overheadand the minimum required output frequency, then the I/Oframework would choose the actual output rate based on thetotal overhead relative to the application. Memory require-ments are dependent on the analysis context [10] or on theavailability of streaming versions of the analysis algorithms.Further optimizations can be made if the framework can adjustresource balancing, e.g. dedicate more processors for a parallelanalysis code. This requires a system that is capable of makingreallocation decisions autonomously, and that is empowered tolaunch and reallocate applications based on those decisions.

C. Science Application Examples

As concrete illustrations of the types of jobs that scientistsrun, we consider two specific HPC application examplesthat illustrate forthcoming challenges in several of the areasthat we believe can be better addressed. They are composed

Request data

N selected earthquakes

Pre-processingProcess data, Select windows

Make measurementsCompute adjoint sources

Run N adjoint simulations

Post-processingPre-condition & smooth the gradient

Determine step lengthSum kernels, Update model

Convergence?

Run N forward simulations

NoFinish

Fig. 2. Workflow for adjoint tomography. The pre- and post-processing stepsinclude human interactions.

of fairly sophisticated workflows, with components that arenot currently as well-supported as would be ideal on futuresystems.

1) Computational Simulation – Global Earth TomographyModel: Using seismic data generated by earthquakes as rays,one can create a detailed 3-D picture of Earth’s interior.Currently, the team in [11] is working on imaging the entireglobe from the surface to the core-mantle boundary – a depthof 1,800 miles. They use one thousand earthquake events, eachrecorded at thousands of seismic stations all over the globe,in an iterative process that takes years to get to a satisfactorilyfine tomographic model.

An iteration in the adjoint tomography workflow (see Fig-ure 2) consist mainly of two large computation steps (forwardand adjoint simulations) and two tedious processing steps (pre-processing and post-processing) that involve many manualtasks, small jobs and ad-hoc operations. In the pre-processingstep, the scientists are using scripting tools to clean up andprepare the data between the computational steps, while inpost-processing, they smooth the data before creating anupdated Earth model and then evaluate it to decide if a newiteration should be executed. Today, the computational stepstake up less than two days, however, an iteration takes abouta month because of the heavy human involvement.

If the workflow system could learn about manual steps thatare regularly taken by the scientists and then execute thosesteps automatically, it could speed up the processing steps andfuture iterations. In general, the same approach would help anyscientist when looking for clues for data of interest in theirsimulations’ output. Everyone tends to settle on some practicethat becomes used regularly when looking at new data atfirst. By producing results automatically, which are identified

Page 4: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

WDM coupling workflow

GENEinterpolator

XGCPETSc

TAU TAU

Lustre

ADIOSMGARDZchecker VTK-M physics

plots

VTK-M image plots

SOS FlowPerformanceMonitoring

VTK-M performance plots

VTK-M reduction

plots

ADIOS-DataSpaces

ADIOS-DataSpaces

ADIOS-DataSpaces

ADIOS-DataSpaces

ADIOS-DataSpaces

EVPath

EVPath

ADIOS-DataSpaces

VTK-M feature plots

VTK-M reduction

plots

ADIOSSZ

Zchecker

ADIOS-DataSpaces

Fig. 3. Workflow for coupling XGC and GENE in the WDM project, alongwith in situ data reduction and particle data analytics to track particles affectedor responsible for numerical instabilities.

as results from steps regularly taken by the scientist, andwhich are not too costly, we could accelerate the knowledgediscovery process.

This ability of learning user habits requires the ability tocollect provenance, not just at workflow level, but at thecomplete activity level of the user on a system. We need tocollect provenance all the time when something happens toany data item produced in the workflow. Provenance can becollected as part of the data in an I/O framework for self-describing data. At each reading and writing of data, therelationship between input and output can be recorded by theI/O framework automatically in any tool or script.

Given such a rich set of metadata that reflects the waythat the user interacts with data over time, there becomesa need for an entity in the intelligent I/O system capableof extracting the user habits in a meaningful way, as wellas an entity that can act on this understanding to improvethe user’s experience by, for instance, suggesting a workflowcomposition for a new simulation run, or performing additionalvisualization steps that have been useful in the past and canbe done for negligible cost. Such autonomous features, thoughpotentially quite useful, would have to be introduced graduallyand carefully to avoid stigmatization.

2) Near-real time decision making of fusion experimentaldata: Another class of applications that impacts the designof future data management systems deals with the need toprocess high volumes of data at near-real-time (NRT). Thewhole device modeling project in the Exascale ComputingProgram of DOE targets the first ever high-fidelity full-tokamak simulation framework by self-consistently couplingtwo individually developed applications: Gyrokinetic PlasmaTurbulence Code (GENE [12], [13], [14]), a continuum codethat has been designed for the core of the tokamak, and X-point Gyrokinetic Code (XGC1 [15], [16]), a particle-in-cellcode that has been targeted to study the outer edge.

Figure 3 shows the coupled XGC1/GENE workflow. GENEand XGC1 are run simultaneously, sharing multiple quan-

tities (e.g. the plasma distribution function) back and forthin an overlap region between the core and the edge. Bothapplications generate data that must be stored, in addition todiagnostic outputs, grid quantities, and reduced representationsthat are used for scientific analysis, run quality assessment, andvisualization. This data is written out with a higher frequencythan checkpoint-restart files and will often need to persist formonths after the simulation. The exchange between XGC andGENE is frequent, possibly needing to occur as often as everytime step (approx. one second), and it must be fast enoughnot to significantly impact performance. When one code iswaiting for data from the other before it can continue, NRTfeedback would inform the application to pursue auxiliarycomputation, reduction, analysis, or visualization, instead ofidling. Quality/performance monitoring and associated prove-nance information will form the crux of the NRT feedbacklayer. Additionally, NRT decisions become necessary whenmore codes are being coupled together to understand newphysical phenomena which could never be done before.

In the future, fusion scientists will include many more codeswhich will be coupled together, adding boundary physics,magneto hydrodynamics, radio frequency heating, energeticparticles, etc. As the understanding of the coupled physicsprogresses, a mechanism to steer and automatically managethe internal functioning of the simulations themselves willbecome important. A control system with a service orientedarchitecture that provides such NRT functionality using vari-ous aspects of data as the steering mechanism will form thebasis for driving such research.

III. ARCHITECTURE FOR INTELLIGENT I/O MIDDLEWARE

As we noted previously, there is great synergy betweenour designs for high performance I/O systems and the pub-lish/subscribe abstraction. Publish/subscribe has many imple-mentations across a range of technical spaces, as the abstrac-tion is useful for managing telecom updates, real-time businessor government intelligence operations, and even somethingas ubiquitous as daily news updates. Implementations likeSystem S from IBM [17], Tibco’s FTL [18], or Amazon’sSimple Notification Service [19] all offer key features for thecommercial space, and open source tools like ZeroMQ [20]and Apache’s Kafka [21] offer messaging services that canoperate in pub/sub as well as queue-based modes. EVPathfrom Georgia Tech [22], [5] and Meteor from Rutgers [23]are examples stemming from academic research.

Building from our experience with these tools (amongothers), we have developed an I/O abstraction for the ADIOSframework that appears very much like POSIX I/O for theuser, but with enough of a tweak that it allows equally wellfor fully online or mixed at rest/in motion data retrieval. Here,we are building upon the model in a way which allows forexpression of user intent with high performance I/O streamsthat opens opportunities for a variety of active managementand data processing tools. These active management anddata analysis/transformation requirements stretch the existingchannelized and brokered models for publish/subscribe.

Page 5: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

Publisher

Consultant Manager

Clerk Subscriber

ResourceManager

Data Policies

Fig. 4. A system architecture showcasing the proposed design pattern forhigh performance I/O middleware.

As depicted in Figure 4, we propose extending the pub-lish/subscribe metaphor even further to include several newactors or roles capable of providing functionality that coversboth current and future needs for I/O systems. This futureIntelligent I/O platform will help to address challenges posedby increases in scale and heterogeneity of future systems,while meeting requirements of key applications. Beyond thefamiliar Publisher and Subscriber actors, the pattern includesfour others: Manager, Clerk, Consultant, and Resource Man-ager. Importantly, to preserve high performance there needs tobe a separation between actions that occur in the control planeand those that occur in the data plane. However, this separationmust also include the ability to delegate control decisions intothe lowest level when you need short, high throughput controldecisions. Thus data moves through the system abstractionfrom Publisher to Subscriber as before, but now a Clerkis able to act on the data to apply compression, reduction,data structure transformation, and/or to control the placementof that data in the storage hierarchy. In the control plane,a Manager directs the data management activities, decidingamong different options for processing and directing data. Themanager is guided by one or more Consultants that providepredictive information about the costs of various options,driven by direct observation and modeling of system and userbehaviors. A Resource Manager is available to help bringnew tasks online as needed to address dynamic requirements.For all of these it is important to remember that, because ofour focus on high performance, we also consider all of theseelements to be parallel themselves.

The decisions of the Manager are therefore tied both tospecific subscription requests and the input of a complexset of costed decision trees provided by various types ofConsultants. Interchanges between Consultants and Manager,Manager and Resource Manager, Manager and Clerks all mustuse their own protocol for specifying policy requests. Policiesmay be declared by users, set by system administrators, ordetermined by the Manager at runtime. These policies willexpress different requirements with different priorities, andit will be up to the Manager to weigh conflicting policiesand send final marching orders to the Clerk to be carriedout. For example, a Clerk may receive a specific threshold

on data size above which the data must be compressed. TheClerk would then assume the delegated control to enforce thisparticular policy in a tight control loop. This threshold wouldbe determined by the Manager by consulting cost modelsprovided by one or more Consultants. Although many runtimesystems have aspects of these types of functionality (e.g. framemanagement on streaming video), one to support intelligentI/O requires that the policy infrastructure be both flexible andextensible to allow for customization of new user- and facility-facing policies as the system evolves.

To make this more concrete, we want to enable a user’ssubscription request to be able to provide a policy that includescost function terms like the following, borrowing from theterminology in [24]:

δ(x) =

0 resolution > 9.1 resolution = 9.2 6 ≤ resolution < 91 resolution < 6

This function would correspond to saying that a data resolutionof 10 or greater is always good, but a resolution of value9 (in whatever localized scale) would carry a 10% penalty.Resolutions between 6 and 9 would carry a 20% penalty, andanything below that would be a 100% penalty. This policychoice, along with the request to minimize delivery time, givesthe manager the information needed to calculate simplifieddecision trees that can be deployed into a Clerk for imple-mentation. Note that there are explicit metadata references(resolution value) as well as implicit performance metadata(throughput and interconnect or storage retrieval times) ineach of these policy components. Thus each of the systemcomponents is responsible for writing appropriate metadatawhen data is introduced, altered, or accessed. Metadata mustbe kept alongside related data in the data plane as it must beleveraged by the system to make appropriate decisions in thedata plane. As we have seen, some portions of the metadataare critical in the control plane as well so that the system canmake intelligent decisions and also can learn and adapt whenencountering situations similar to those seen before.

Since this more explicit component of policy exchange andmetadata management are key to this broader abstraction, wewill explore each of those components of our architecture inmore depth in the remainder of this section. Section IV thendescribes each of the roles in turn, providing details of theirspecifications through examples of past and current work inthe space as well as a view of how the abstraction enables amore intelligent high performance I/O capability in the future,as motivated in Section II.

A. Policies

Intelligent systems exhibit two main properties: 1) theability to learn from available information, and 2) making deci-sions according to constraints set forth by users and by systemrequirements. A primary challenge that software makers facewith both these aspects is to develop an effective means ofcommunication between scientists and the underlying systems

Page 6: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

as well as between components of a system. Enabling self-tuning of science workflows is a challenge as different scienceteams have different workflows and constraints that a systemhas to learn and incorporate. For example, decisions thata human might take by observing artifacts at runtime (i.e.whether a crack has occurred in the simulated material) needto be transformed into programmatic decision points by 1)describing the artifact, and 2) describing the decision. Asprogramming languages and libraries natively offer limitedsupport for such expressibility, novel methods need to beinvented that can represent policy statements.

In this broad view, policies describe any control-planeworkflows and events that leads to actions. They can berelatively simple, such as a halting the data workflow whena certain data type is encountered; or more complex suchas using lossy compression methods to reduce output sizewhile also dynamically spawning online analysis and testingto ensure that there is continuity of important data qualityfeatures (like streamers in fusion plasma simulations) in thereduced form.

A policy must be able to describe and include the followingaspects of the workflow and data.

1) It must efficiently describe detailed cause-and-effectrelationships between data and workflows. “Take thisaction when this event in the data is observed”.

2) It must incorporate quality of service requirements fordata and workflows. “Compress the data with lossytransformation techniques at level 7 if it takes less than2ms, else set it to level 4”.

3) It must build cost functions for different actions, that area combination of accuracy, performance stats, and pastworkflow states.

In a general form, for this abstraction the policy languageused represents a way of making concrete the decision tree thatmust be followed in order to actively manage the I/O streams.The nature of those decision trees vary – the communicationsbetween consultants and a manager are more open (A, B,and C are all possible, but here are the costs), while thosebetween manager and resource manager or manager and clerkare more constrained (use this threshold to decide betweenA and B – we’ll never use C). This is key to our vision ofthe control plane, as it means that the focus is on controllingthe distributed system by sending functions that allow fordelegated, localized control decisions, rather than requiringconstant feedback from a centralized service.

B. Metadata Management

Even in the absence of constant centralized feedback, itis clear that some control decisions cannot be made purelylocal. There must be a flow of metadata across the systemthat enables timely and correct processing of data. In addition,rich metadata should be stored and indexed for later useby (1) the I/O sub-system to learn user intentions and dataprocessing patterns to improve its prediction and decisionmaking accuracy during execution of workflows, (2) the sys-tem developers and administrators to study and understand

performance problems and tune the system, and (3) scientiststo inspect, interpret, and, if necessary, debug data results andorganize datasets for future scientific use.

We will need to extend today’s I/O subsystems in orderto capture, organize and provide access to large volumes ofcomplex metadata. This metadata not only describes a scien-tific analysis campaign and its data transformations but alsoincludes performance metadata (e.g., how long it took to moveand transform a data subset). We build upon concepts from thedatabase and web communities such as property graph modelsand semantic graphs [25] and graph databases [26] to providesupport for such linked metadata.

Efficient mechanisms of integrating such concepts in I/Osub-systems and structured, self-describing file formats areneeded. As the complexity of data analysis workflows andthe sizes of datasets continue to increase, support for linkedmetadata will need to deal with very high rates of metadatainsertions and updates and very large volumes of metadataentries. NoSQL database technologies have been developed incommercial environments [27], [28] to address big data man-agement challenges, but these systems are generally designedto scale horizontally across relatively homogeneous sets ofcompute and storage nodes. Metadata management solutionson next-generation supercomputers will need to deal withextreme heterogeneity in order to scale vertically as well ashorizontally. In addition, metadata will have to be managed atmultiple scales and resolutions throughout the system to enableboth queries for near-real-time decision making by the I/O sub-system during workflow execution and queries by scientiststhat explore and compare large subsets of data for inspectionand debugging of analysis results.

IV. SYSTEM ACTORS

This section provides more detailed descriptions of each ofthe actors in the proposed system: the Publishers/Subscribers,Clerk, Consultant, Resource Manager, and Manager. In brief,we envision an extension to the publish/subscribe metaphorto include a Clerk that will sit between the Publisher andSubscriber and mediate or orchestrate data streams in adynamic fashion. Data can be adaptively changed over timedepending on objectives defined by the users or Manager. TheConsultant adds system and performance feedback to helpinform the Manager’s decisions, and the Resource Managerenables complex workflow requirements to best utilize theplatform’s resources, given the many requirements.

A. Publisher/Subscriber

The reader is likely familiar with the roles of Publisher(Data Producer) and Subscriber (Data Consumer). In today’ssystems, these interactions are typically active, with nameslike read and write, or put and get. In our extended pub-lish/subscribe pattern, these roles perform interactions thatare passive. A Publisher would not write data directly, butwould advertise availability of data, leaving the decision ofwhether and how to act on that data up to the Manager. And asubscriber would register to receive data, but might need to be

Page 7: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

flexible about the exact precision and resolution of that data, asthe system could send reduced data in some situations. A moredramatic change is that a subscriber may not be executed untilresources become available, so analysis codes would have tobe provided in some standard way that would allow the systemto control them, such as through containerization.

A concrete example beckoning passive publish/subscribeflexibility is simulation checkpoint/restart. Large scale ap-plications must create checkpoints regularly, because systemfailures over time are expected. The failure rate depends on thegiven system, its current stability, and the scale of the appli-cation, with week- or even month-scale periods that are moreerror-prone. Nevertheless, the frequency of checkpointing isusually a manually-edited input parameter to the applicationcode that the user is expected to fix. Accounting for worstcase scenarios is a typical strategy, but targeting the worst caseall the time can incur significant overhead; checkpoint outputsare usually large and expensive to write frequently. With someguidance provided by the I/O framework, applications couldinstead prolong creating new checkpoints as late as possiblein an automated way. Perhaps the simplest approach would beto report the availability of new data at every iteration, as if acheckpoint was going to be written, but allowing the system toonly save data as infrequently as possible given the expectedfailure rate for the current environment.

1) Current Research: ADIOS, based on the pub-lish/subscribe model, has demonstrated excellent performancefor traditional parallel I/O [1]. To further support pub-lisher/subscriber enhancements, ADIOS has been moving to-ward a more passive approach. The ADIOS read API allowsthe user to register (or “schedule”) read operations to beapplied to an incoming data step, and then block until the databecomes available. On the write side, ADIOS allows individualwrite operations to be buffered, and the data may subsequentlybe written to file in a number of different ways using dif-ferent write methods. For instance, the MPI_AGGREGATEmethod enhances performance for highly parallel writes byassigning each writing process to a group, and assigning oneprocess from each group (the aggregator) to perform all of thefilesystem writes for the group. In contrast, use of the POSIXmethod allows the same application to forego this aggregation,resulting in every process writing its data independently. Theselection of a particular write method can have a significanteffect on write performance, though currently the user mustmake these kinds of choices based on experience or trial-and-error or guesswork. Control over this and other similarfunctionality will need to be extended to allow external controlby the Clerk based on the Manager’s decisions.

2) Future Vision: We envision the view of data for pub-lishers and subscribers will be changed dramatically. The datawill be no longer simple streams of bytes to be exchangedin a static manner. Instead, it will be viewed as dynamicstreams managed by a set of complex policies at runtime.For example, precisions of data can be adaptively changedand refactored over time, in ways similar to [29]. The data’slocality can then be determined at runtime, as we have

begun studying in the Sirius project [30]. Correspondingly,publishers and subscribers should be able to express suchdynamism and need to understand and process data at runtime.In that regard, developing generic algorithms or methods tofind multi-level/multi-resolution data representation will bevaluable. Advanced subsetting and aggregation of data will beneeded to support more iterative and stream-oriented analysismethods.

B. Clerk

At increasing scale and complexity, HPC applications willneed to become more flexible and autonomous in producingand consuming data. To help keep the focus of Publishers andSubscribers on their own computational goals, the Clerk willbe responsible for any and all in situ data services requiredbetween the Publisher and Subscriber. This extends beyondpredefined data conversion or static intermediate data storage,to include actively performing services such as indexing andquerying, format conversion and translation, or even changingdata or its associated precision level. Managing such in situservices at run time will necessitate dynamic routing of datastreams. In all of this, the clerk must be transparent andtrustworthy. This means that its impact on the veracity ofthe data delivered from the publisher to the subscriber mustbe disclosed (via metadata) to the subscriber. Preferably,the nature of any change to the data should have intuitiveinterpretation to the scientist or else the system will strugglewith acceptance in the scientific community.

The Clerk will be autonomous within the constraints givento it by the Manager in order to provide services in het-erogeneous computing environments with dynamic factors.For example, the Manager (with input from the Consultantand Resource Manager) may alter compression ratio decisionsdepending on data quality at runtime, system usage, or diskavailability, then dispatch the Clerk to execute the reductionaccordingly. A similar applicable example is the one from Sec-tion II-B, where we highlighted how analysis and visualizationservices need to be deployed in a manner that respects timeand accuracy requirements given the system constraints.

1) Current Research: We have been incorporating someaspects of the Clerk into ADIOS. For instance, ADIOS hasa transformation layer that allows compression services tobe applied to data being written to disk. We have used thisto compress data in multi-scale physics coupling experimentswhere two concurrent applications exchange data while run-ning, as we demonstrated at SC 2017. Another example isthe Sirius project [30]. Based on ADIOS, the project aimsat developing a transparent layer to adaptively decomposedata on users’ behalf and store them in different levels ofdeep storage hierarchy to achieve optimal data placement forreading and writing [29]. This transformation capability notonly allows data to be modified by the I/O system, but alsoadjusts the accompanying self-descriptive metadata to reflecthow the transformed data relates to the original data. Thisis necessary because these data transformations can result indata with byte organization and dimensionality that is different

Page 8: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

from the data written by the application, and the metadata isused by the Subscriber to make sense of the reduced data.

SENSEI [31], a project to develop infrastructure for in situanalysis, is noteworthy. SENSEI enables users to write an insitu method and deploy any number of in situ infrastructuresto perform on-line and ad hoc functions to provide interoper-ability between concurrent processes or applications.

2) Future Vision: The Clerk makes decisions that affect thespeed of data delivery as well as the veracity of the delivereddata within parameters given to it by the Manager. The speedis largely independent of the science being done with manyopportunities to research alternatives and extract optimizedsolutions to drive an autonomous Clerk. The veracity ismore dependent on the science being done and requiringtransparency and intuitive interpretation.

Optimizing the agility of the Clerk is an intriguing researchdirection. Plugins are responsible for communicating their rel-ative internal performance data thorough the shared metadatainfrastructure. When sufficient data or understanding of a givenparameter space is available, it becomes possible to specifyand solve a local optimization problem, leading to performanceimprovement. This means that a given Clerk service needs tobe exercised in a wide variety of possible states at runtimeto collect sufficient data. This variety may mean differenthardware configurations, different compression and transportmechanisms, and different levels of veracity, all guided by anexperimental design to get necessary performance informationwith respect to the parameters and their interactions. A care-fully designed plug-in mechanism will allow self-tuning ofClerk operations.

Generic, information-theoretic methods for generatingmulti-resolution forms of data to enable later time vs qualitytrade-offs, should be further explored for future data refactor-ing; some initial work on mathematically robust approachesis already on-going [32]. With well-understood mathematicalproperties, they are well suited for understanding the impactsof error-levels in different resolutions. Schemas can be ex-tended to not only include multiple resolutions but also toenable seamless access through popular analysis packagessuch as Pandas.

C. Consultant

Efficient and intelligent scientific data management at ex-treme scale requires a deep understanding of not only comput-ing and storage systems’ performance characteristics, but alsoscientific applications’ usage patterns [33]. Thus, function-alities such as performance predicting, provenance learning,etc., are critical for scientific data management systems at-tempting to make advanced performance- or usage-influencedoptimizations. In current systems, these functionalities areeither missing or implemented in some ad hoc manner. Ourvision includes a novel component, called the Consultant,that assembles these functionalities and provides consultationservices regarding data refactoring, placement, and movement.Namely, the Consultant’s major responsibilities include:

1) modeling and predicting the performance of computingand storage systems;

2) analyzing and understanding the provenance of scientificworkflows;

3) providing guidance which can potentially make scientificdata management more efficient by leveraging the resultsof performance prediction and provenance learning.

These responsibilities make the Consultant a unique compo-nent in scientific data management systems, which can alsolead to several open research challenges and opportunities.

1) Current Research: A variety of methodologies and mod-els have been proposed to study and understand performancestatistics collected from HPC systems [34], [35], [36], [37],[38]. Based on the observed properties of I/O traces collectedon Titan, [39] built a hidden Markov model to characterize andpredict the I/O performance of the Lustre file system. Machinelearning techniques were leveraged in [40] to build a decisiontree based I/O prediction model using long-term I/O tracescollected at LLNL. Performance of other components in thedata management system, such as main memory [41], non-volatile memory [42], [43], MPI communication layers [44],etc., have also been widely studied.

The concept of leveraging provenance data has also beenstudied. For example, the Kepler scientific workflow system[45] allows users to record provenance information at runtime,which can be queried, analyzed, and visualized to gain adeeper understanding of how certain results were obtained asthe workflow was executed. In [46], Deelman, et al. proposean approach to remove redundant workflow activities basedon the availability of intermediate provenance data producedby previous execution. Heuristics for detecting task executionfailure and re-executing failed tasks based on real-time prove-nance data is introduced in [47]. The potential of applyingdata mining techniques to provenance data to predict futureworkflow execution performance and optimize the computingresource allocation is discussed in [48].

2) Future Vision: Though the references cited in the pre-vious section have made significant progress, we see severalopportunities for innovations related to the Consultant. Dueto limited computing resources, the system might not beable to collect as much performance and provenance dataas needed when the overhead of collection is too high. TheConsultant should still be able to provide reasonable-qualityservices even in these scenarios. It is possible to leverage apriori understandings about systems and applications to reducethe amount of performance and provenance data required bythe Consultant. An ideal Consultant needs to be fast enoughto satisfy the timing requirements of scientific applications,which can be quite stringent. For example, near real-timedecisions need to be made when processing and analyzingexperimental data collected by some scientific instruments.This means the Consultant must be able to provide guidancewithin a very short amount of time. Moreover, since theperformance and provenance data could change rapidly duringruntime, the Consultant also needs to quickly adapt to thesechanges. Furthermore, in order to provide guidance that can

Page 9: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

potentially make scientific data management more efficient,the Consultant must have the capabilities to automatically findpatterns and learn features from the performance and prove-nance data that has been collected. There are opportunitiesto leverage state-of-the-art analysis, modeling, and learningalgorithm techniques to enhance the Consultant’s predictingcapabilities.

D. Resource Manager

A core requirement of our proposed framework will bethe ability to dynamically tune workflows through intelligentresource management. Although this is closely related to themanager role, we distinguish the two because the resourcemanager is responsible for the actuation of control-planedecisions as opposed to the manager that determines thechoices to be made. The resource manager is also mosttightly coupled to the many existing technologies for resourcemanagement, including compute resources (CPUs and GPUsavailability through batch schedulers and libraries like CUDA),memory and I/O pipeline components (system memory, High-Bandwidth Memory, Burst Buffers), and information aboutthe parallel file system and system status (through RASand performance measurement tools). Depending on SLAsand runtime decisions, the Resource Manager allocates theappropriate resources to each part of the specified workflow,both for data-plane and control-plane components.

Intelligent resource management is especially critical inscenarios subject to load imbalance. Section II-B highlightshow HPC analysis and visualization often fall under thiscategory, whether it is on-node or off-node. The resourcesmade available to perform these tasks can have a dramaticeffect on the time required to complete.

Based on runtime feedback through its interface with theManager, the Resource Manager may opt to utilize GPUs torun an application, or it may co-locate simulation and analysisprocesses on compute nodes to reduce communication or I/Ooverhead on a congested system. The Consultant maintainsinformation about different optimization techniques pertinentto a system and a workflow’s characteristics. The ResourceManager utilizes this information to tune a workflow dynam-ically.

1) Current Research: Resource management when multiplecomponents share a node, such as when computationallyexpensive Clerks must share space with a simulation, becomesvery complex. In work such as Goldrush [49], we investigateways to exploit slack cycles even in highly scalable HPCsystems to complete complex in situ workflows. Landrush [50]similarly looks at how to time share slack cycles from GPUs,while GPU Share [51] looks at how to partition the streamingprocessors in the GPU for concurrent execution of compo-nents.

Network-focused task mapping has been an active researcharea in parallel and distributed computing [52], [53], [54].The goal is to find an optimal layout of the processes of anapplication onto a given network topology. As part of ourexploration of this resource manager space, we have tested

using a graph theoretic task mapping approach called TaskGraph Embedding (TGE) [55] for large-scale mapping ofsimulation and in situ workflow components.

2) Future Vision: The Resource Manager will play animportant role in providing an abstraction to expose advancedHPC system capabilities (accelerators, memory hierarchies), aswell as to leverage resources across multiple sites, includinglocations in the cloud, as a means for analyzing data as it isbeing generated. The Resource Manager will have sophisti-cated scheduling capabilities to efficiently schedule availableresources amongst components of a workflow. Since a primarytask of the Resource Manager is to manage resources to meetuser constraints, this will involve developing smart algorithmsthat leverage historical information to schedule resourcesamongst codes and anticipated workflow components.

E. Manager

Given the potential complexity of the new roles and capa-bilities of this new approach, we also include a Manager thatis responsible for balancing, correcting, provisioning, and gen-erally orchestrating the complex and sometimes contradictoryconstraints imposed by users and the hardware. For a managerto be effective, it will need to be able to generate low-levelpolicy directives for the clerk to carry out that satisfy the hardconstraints (i.e. “Produce visualizations A, B, and C for somepublished simulation steps so that the latest visualization isno more than 10 seconds behind”) imposed by incoming high-level policies. At the same time, it will have to consider whichsoft constraints (i.e. “If possible, produce visualization D usingcurrently allocated resources”) can be met. This means that themanagement role, be it distributed or centralized, will need tohave timely information about capabilities and constraints ofthe resources as well as the requested data access policies ofthe publishers and subscribers.

1) Current Research: Many of today’s scientific workflowsare static, which allow suitably crafted batch submissionscripts to serve as a one-time-only management implementa-tion. Optimizations, when performed, are mostly accomplishedthrough iterative hand-tuning of these static workflows. Evenfor those that are more dynamic, the way that Managerand Resource Manager roles have been intertwined makesit difficult to separately discuss them. However, there havebeen some successful experiments at providing more flexiblemanagement orchestrators that fit our model.

Building on work with the ADIOS infrastructure, the pre-viously mentioned Goldrush and Landrush had componentsof online management in addition to the resource manage-ment complexities. Digging into more of the issues of policylanguages and automated enforcement of dynamic controls,the SODA project [56] built a reference shared control planeinformation bus and policy protocol. Elsewhere, in the Ac-tiveSpaces project [57], we explore the movement of code todata as well as data to code when using DataSpaces as anin-memory HPC staging service.

As part of the ECP CODAR project [58], we are developinga set of tools to dynamically tune workflows. The Savanna

Page 10: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

Chronological Sequence #

Participants Action

1 Policy communication between publisher, subscriber, and manager

2 Begin forward modeling. Pre-load model file (SSD).

3 Launch online data processing analysis (subscriber)

4 Monitor performance. Begin data compression.

5 Pre-processing: compute adjoint sources.

6 Backward modeling: Prefetch wavefields into SSD

7 Analysis process to optimize I/O

8 Post-processing

P SM

C ClM

C M R

C M Cl

C M R

Cl

C M Cl

C M R

Cl

C M R

Cl

ClRCM

SP Publisher Subscriber

Manager Consultant Resource Manager Clerk

Fig. 5. Chronological sequence of actions explaining the communication be-tween publishers, subscribers, and the system actors for the Earth Tomographyworkflow.

library enables composing complex workflows consisting ofmultiple simulation and analysis components, and providesan interface to co-locate simulation and analysis processes tostudy the effects of such node-partitioning on I/O and systemperformance.

2) Future Vision: The great potential for intelligent I/O,with the ability to both apply learning techniques for auto-mated improvements and higher-level user specification ofintent, all comes together in the future innovations in theManager role. There are a host of technical innovations thatcan be expected in policy expression and domain-specific lan-guages, in distributed and parallel management processes, andin supervised and unsupervised learning for I/O performancetuning. The core of Manager’s role should be built upon theresearch from the fields of logics and artificial intelligences,concerning computer aided planning, decision theory, auto-mated theorem provers, optimization with constraints, and soon.

V. PUTTING IT ALL TOGETHER: EARTH TOMOGRAPHY

Let us revisit the Earth tomography workflow that wasdescribed in Section II-C1 and envision how the new model’sroles enable advanced application opportunities in the exascale(or post-exascale) era. Figure 5 describes the chronological setof events that occur in the system in terms of the communi-cation between the various actors in the system.

1) First, the main components in the simulation, visualiza-tion, and forward and backward modeling phases statepolicy requirements with the system to establish ServiceLevel Agreements (SLAs).

2) As the forward modeling phase begins, we assume thatthe Consultant has information from previous simula-tions to mine data to tune the workflow. The Consultantand the Manager communicate to prefetch the modelfile, into a fast memory module such as an SSD.

3) Next, parts of the post-processing of the forward model-ing phase are executed as an in situ analysis application.Based on historical information, the Consultant providesparameters to the Manager, which in turn directs theResource Manager to spawn the analysis component.

4) During the forward modeling phase, the Consultantmines live performance information to tune the work-flow. As per the SLA, it may advise that the data streambe compressed with a lossless compression technique.

5) When the forward modeling phase concludes, automaticpost-processing is performed which is currently donemanually. The system mines historical information toset up the post-processing workflow.

6) The backward modeling phase begins. The Consultantadvises the Manager to prefetch forward wavefields intoa fast storage module (SSD).

7) Depending on the runtime I/O performance, the Consul-tant advises the Manager to optimize I/O. The Managerlaunches a staging routine through the Resource Man-ager to write image output asynchronously.

8) After the backward modeling completes, a post-processing workflow is launched with optimal param-eters determined from historical information.

VI. CONCLUSIONS

Increasing hardware complexity and new application re-quirements are driving the need for Intelligent I/O capabilitiesat the intersection of HPC and Big Data. Through currentwork, we have developed an extended abstraction for thepublish/subscribe approach that enables both messaging andmass storage I/O as well as future innovation through modelingand learning. These extensions are organized around distinctroles in either the data- or control-plane that interact throughpolicies that describe intentions and constraints and throughmetadata that describes what has been or should be done.This extended Publish/Subscribe abstraction, with its Clerks,Consultants, Managers, and Resource Managers, forms thebasis for an Intelligent I/O System capable of managing thedata needs of the next generation of scientific applications.

ACKNOWLEDGMENT

Without the continued support from the Department ofEnergy’s Office of Advanced Scientific Computing Research,the projects upon which this future vision rests would notbe possible. Additionally, support from the DOE computingfacilities in Oak Ridge and NERSC, as well as the NationalScience Foundation, were also critical.

REFERENCES

[1] Q. Liu, J. Logan, Y. Tian, H. Abbasi, N. Podhorszki, J. Y.Choi, S. Klasky, R. Tchoua, J. Lofstead, R. Oldfield, M. Parashar,N. Samatova, K. Schwan, A. Shoshani, M. Wolf, K. Wu, and W. Yu,

Page 11: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

“Hello ADIOS: the challenges and lessons of developing leadershipclass I/O frameworks,” Concurrency and Computation: Practice andExperience, vol. 26, no. 7, pp. 1453–1473, may 2014. [Online].Available: http://doi.wiley.com/10.1002/cpe.3125

[2] D. Laney, “3D data management: Controlling data volume, velocity andvariety,” META Group Research Note, vol. 6, no. 70, 2001.

[3] J. Vetter and et al, “Report of the DOE workshop on extreme heterogene-ity,” Technical report, Department of Energy, Tech. Rep., in preparation.

[4] C. Docan, M. Parashar, and S. Klasky, “Dataspaces: An interactionand coordination framework for coupled simulation workflows,”in Proceedings of the 19th ACM International Symposium onHigh Performance Distributed Computing, ser. HPDC ’10. NewYork, NY, USA: ACM, 2010, pp. 25–36. [Online]. Available:http://doi.acm.org/10.1145/1851476.1851481

[5] G. Eisenhauer, M. Wolf, H. Abbasi, and K. Schwan, “Event-basedsystems: opportunities and challenges at exascale,” in Proceedings ofthe Third ACM International Conference on Distributed Event-BasedSystems. ACM, 2009, p. 2.

[6] K. Moreland, C. Sewell, W. Usher, L. t. Lo, J. Meredith, D. Pugmire,J. Kress, H. Schroots, K. L. Ma, H. Childs, M. Larsen, C. M. Chen,R. Maynard, and B. Geveci, “VTK-m: Accelerating the visualizationtoolkit for massively threaded architectures,” IEEE Computer Graphicsand Applications, vol. 36, no. 3, pp. 48–58, May 2016.

[7] M. Larsen, C. Harrison, J. Kress, D. Pugmire, J. S. Meredith, andH. Childs, “Performance modeling of in situ rendering,” in SC16:International Conference for High Performance Computing, Networking,Storage and Analysis, Nov 2016, pp. 276–287.

[8] J. Kress, S. Klasky, D. Pugmire, and H. Childs, “Visualization andAnalysis Requirements for In Situ Processing for a Large-Scale FusionSimulation Code,” in Proceedings of the Second Workshop on In SituInfrastructures for Enabling Extreme-Scale Analysis and Visualization(ISAV), held in conjunction with SC16, Salt Lake City, UT, Nov. 2016.

[9] J. Kress, R. M. Churchill, S. Klasky, M. Kim, H. Childs, and D. Pugmire,“Preparing for In Situ Processing on Upcoming Leading-edge Super-computers,” Supercomputing Frontiers and Innovations, vol. 3, no. 4,pp. 49–65, Dec. 2016.

[10] G. Ostrouchov and N. F. Samatova, “High end computing forfull-context analysis and visualization: when the experiment is done,”White paper accepted by the High End Computing RevitalizationTask Force (HECRTF) Washington, DC, June 16-18 2003. [Online].Available: https://www.researchgate.net/publication/259870252

[11] E. Bozdag, D. Peter, M. Lefebvre, D. Komatitsch, J. Tromp, J. Hill,N. Podhorszki, and D. Pugmire, “Global adjoint tomography: first-generation model,” Geophysical Journal International, vol. 207, pp.1739–1766, November 2016. [Online]. Available: https://doi.org/10.1093/gji/ggw356

[12] W. Dorland, F. Jenko, M. Kotschenreuther, and B. Rogers, “Electrontemperature gradient turbulence,” Physical Review Letters, vol. 85,no. 26, p. 5579, 2000.

[13] T. Gorler, X. Lapillonne, S. Brunner, T. Dannert, F. Jenko, F. Merz, andD. Told, “The global version of the gyrokinetic turbulence code gene,”Journal of Computational Physics, vol. 230, no. 18, pp. 7053–7071,2011.

[14] F. Jenko, D. Told, T. Gorler, J. Citrin, A. B. Navarro, C. Bourdelle,S. Brunner, G. Conway, T. Dannert, H. Doerk et al., “Global and localgyrokinetic simulations of high-performance discharges in view of iter,”Nuclear Fusion, vol. 53, no. 7, p. 073003, 2013.

[15] C. Chang, S. Ku, P. Diamond, Z. Lin, S. Parker, T. Hahm, and N. Sam-atova, “Compressed ion temperature gradient turbulence in divertedtokamak edge,” Physics of Plasmas, vol. 16, no. 5, p. 056108, 2009.

[16] R. Hager and C. Chang, “Gyrokinetic neoclassical study of the bootstrapcurrent in the tokamak edge pedestal with fully non-linear coulombcollisions,” Physics of Plasmas, vol. 23, no. 4, p. 042503, 2016.

[17] “Stream Computing Platforms, Applications, and Analytics,”https://researcher.watson.ibm.com/researcher/view\ group\ subpage.php?id=2534, 2018, [Online; accessed 22-February-2018].

[18] “TIBCO FTL,” https://www.tibco.com/products/tibco-ftl, 2018, [Online;accessed 22-February-2018].

[19] “Amazon Simple Notification Service,” https://aws.amazon.com/sns/,2018, [Online; accessed 22-February-2018].

[20] “Distributed Messaging - zeromq,” http://zeromq.org/, 2018, [Online;accessed 22-February-2018].

[21] “Apache Kafka,” https://kafka.apache.org/, 2018, [Online; accessed 22-February-2018].

[22] “The EVPath library,” https://www.cc.gatech.edu/systems/projects/EVPath/, 2018, [Online; accessed 22-February-2018].

[23] N. Jiang, A. Quiroz, C. Schmidt, and M. Parashar, “Meteor: Amiddleware infrastructure for content-based decoupled interactionsin pervasive grid environments,” Concurr. Comput. : Pract. Exper.,vol. 20, no. 12, pp. 1455–1484, Aug. 2008. [Online]. Available:http://dx.doi.org/10.1002/cpe.v20:12

[24] M. Wolf, H. Abbasi, B. Collins, D. Spain, and K. Schwan, “Service aug-mentation for high end interactive data services,” in Cluster Computing,IEEE International Conference on, vol. 0, 09 2005, pp. 1–11.

[25] J. J. Miller, “Graph database applications and concepts with neo4j,”in Proceedings of the Southern Association for Information SystemsConference, Atlanta, GA, USA, vol. 2324, 2013, p. 36.

[26] I. Robinson, J. Webber, and E. Eifrem, Graph databases. ” O’ReillyMedia, Inc.”, 2013.

[27] J. Han, E. Haihong, G. Le, and J. Du, “Survey on nosql database,” inPervasive computing and applications (ICPCA), 2011 6th internationalconference on. IEEE, 2011, pp. 363–366.

[28] C. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges,techniques and technologies: A survey on big data,” Information Sci-ences, vol. 275, pp. 314–347, 2014.

[29] T. Lu, E. Suchyta, J. Choi, N. Podhorszki, S. Klasky, Q. Liu, D. Pugmire,M. Wolf, and M. Ainsworth, “Canopus: enabling extreme-scale dataanalytics on big hpc storage via progressive refactoring,” in 9th USENIXWorkshop on Hot Topics in Storage and File Systems (HotStorage 17).USENIX Association, 2017.

[30] S. Klasky, H. Abbasi, M. Ainsworth, J. Choi, M. Curry, T. Kurc, Q. Liu,J. Lofstead, C. Maltzahn, M. Parashar et al., “Exascale storage systemsthe sirius way,” in Journal of Physics: Conference Series, vol. 759, no. 1.IOP Publishing, 2016, p. 012095.

[31] “SENSEI In Situ ,” https://sensei-insitu.org/, 2018, [Online; accessed22-February-2018].

[32] M. Ainsworth, O. Tugluk, B. Whitney, and S. Klasky, “MultilevelTechniques for Compression and Reduction of Scientific Data–TheMultivariate Case,” under submission.

[33] S. Klasky, E. Suchyta, M. Ainsworth, Q. Liu, B. Whitney, M. Wolf,J. Youl Choi, I. Foster, M. Kim, J. y. Logan, K. Mehta, T. Munson,G. Ostrouchov, M. Parashar, N. Podhorszki, D. Pugmire, and L. Wan,“Exacution: Enhancing Scientific Data Management for Exascale,” inIEEE 37th International Conference on Distributed Computing Systems,ser. ICDCS ’17, 2017, pp. 1927–1937.

[34] S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, and W. Allcock,“I/O Performance Challenges at Leadership Scale,” in Proceedings of theInternational Conference for High Performance Computing, Networking,Storage and Analysis, ser. SC ’09, 2009, pp. 40:1–40:12.

[35] Z. Zhao, D. Petesch, D. Knaak, and T. Declerck, “I/O Performance onCray XC30,” in Proceedings of the Cray User Group Conference, ser.CUG ’14, 2014.

[36] L. Wan, M. Wolf, F. Wang, J. Youl Choi, G. Ostrouchov, and S. Klasky,“Comprehensive Measurement and Analysis of the User-Perceived I/OPerformance in a Production Leadership-Class Storage System,” in IEEE37th International Conference on Distributed Computing Systems, ser.ICDCS ’17, 2017, pp. 1022–1031.

[37] H. Luu, M. Winslett, W. Gropp, R. Ross, P. Carns, K. Harms, M. Prabhat,S. Byna, and Y. s. Yao, “A Multiplatform Study of I/O Behavior onPetascale Supercomputers,” in Proceedings of the 24th InternationalSymposium on High-Performance Parallel and Distributed Computing,ser. HPDC ’15, 2015, pp. 33–44.

[38] P. H. Carns, R. Latham, R. B. Ross, K. Iskra, S. Lang, and K. Riley,“24/7 Characterization of Petascale I/O Workloads,” in Proceedings ofthe First Workshop on Interfaces and Abstractions for Scientific DataStorage, 2009.

[39] L. Wan, M. Wolf, F. Wang, J. Youl Choi, G. Ostrouchov, and S. Klasky,“Analysis and Modeling of the End-to-End I/O Performance in OLCFsTitan Supercomputer,” in IEEE 37th International Conference on Dis-tributed Computing Systems, ser. ICDCS ’17, 2017, pp. 1022–1031.

[40] R. McKenna, S. Herbein, A. Moody, T. Gamblin, and M. Taufer,“Machine Learning Predictions of Runtime and IO Traffic on High-endClusters,” in 2016 IEEE International Conference on Cluster Computing,ser. CLUSTER ’16, 2016, pp. 255–258.

[41] J. D. McCalpin, “Stream: Sustainable memory bandwidth in highperformance computers,” University of Virginia, Charlottesville,Virginia, Tech. Rep., 1991-2007, a continually updated technical

Page 12: A View from ORNL: Scientific Data Research Opportunities in ...A View from ORNL: Scientific Data Research Opportunities in the Big Data Age Scott Klaskyyz, Matthew Wolf , Mark Ainsworthx,

report. http://www.cs.virginia.edu/stream/. [Online]. Available: http://www.cs.virginia.edu/stream/

[42] L. Wan, Z. Lu, Q. Cao, F. Wang, S. Oral, and B. Settlemyer, “SSD-Optimized Workload Placement with Adaptive Learning and Classifica-tion in HPC Environments,” in 30st International Conference on MassiveStorage Systems and Technology, ser. MSST ’14, 2014.

[43] L. Wan, Q. Cao, F. Wang, and S. Oral, “Optimizing Checkpoint DataPlacement with Guaranteed Burst Buffer Endurance in Large-ScaleHierarchical Storage Systems,” Journal of Parallel and DistributedComputing, vol. 100, pp. 16–29, 2017.

[44] J. Logan, J. Y. Choi, M. Wolf, G. Ostrouchov, L. Wan, N. Podhorszki,W. Godoy, E. Lohrmann, G. Eisenhauer, C. Wood, K. Huck, andS. Klasky, “Extending Skel to Support the Development and Opti-mization of Next Generation I/O Systems,” in 2017 IEEE InternationalConference on Cluster Computing, ser. CLUSTER ’17, 2017, pp. 563–571.

[45] B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones,E. A. Lee, J. Tao, and Y. Zhao, “Scientific workflow managementand the kepler system: Research articles,” Concurr. Comput. : Pract.Exper., vol. 18, no. 10, pp. 1039–1065, Aug. 2006. [Online]. Available:http://dx.doi.org/10.1002/cpe.v18:10

[46] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman,G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, andD. S. Katz, “Pegasus: A Framework for Mapping Complex ScientificWorkflows Onto Distributed Systems,” Sci. Program., vol. 13, no. 3, pp.219–237, Jul. 2005.

[47] F. Costa, D. d. Oliveira, K. Ocana, E. Ogasawara, J. Dias, and M. Mat-toso, “Handling Failures in Parallel Scientific Workflows Using Clouds,”in 2012 SC Companion: High Performance Computing, Networking,Storage and Analysis, ser. SCC ’17, 2012, pp. 129–139.

[48] J. Wang, D. Crawl, S. Purawat, M. Nguyen, and I. Altintas, “Big DataProvenance: Challenges, State of the Art and Opportunities,” in 2017IEEE International Conference on Big Data, ser. Big Data ’15, 2015,pp. 2509–2516.

[49] F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan,H. Abbasi, and S. Klasky, “Goldrush: resource efficient in situ sci-entific data analytics using fine-grained interference aware execution,”in Proceedings of the International Conference on High PerformanceComputing, Networking, Storage and Analysis. ACM, 2013, p. 78.

[50] A. Goswami, Y. Tian, K. Schwan, F. Zheng, J. Young, M. Wolf,G. Eisenhauer, and S. Klasky, “Landrush: Rethinking in-situ analysisfor gpgpu workflows,” in Cluster, Cloud and Grid Computing (CCGrid),2016 16th IEEE/ACM International Symposium on. IEEE, 2016, pp.32–41.

[51] A. Goswami, J. Young, K. Schwan, N. Farooqui, A. Gavrilovska,M. Wolf, and G. Eisenhauer, “Gpushare: Fair-sharing middleware forgpu clouds,” in Parallel and Distributed Processing Symposium Work-shops, 2016 IEEE International. IEEE, 2016, pp. 1769–1776.

[52] T. Agarwal, A. Sharma, and L. V. Kale, “Topology-aware task mappingfor reducing communication contention on large parallel machines,” inParallel and Distributed Processing Symposium, 2006. IPDPS 2006.20th International. IEEE, 2006, pp. 10–pp.

[53] H. Yu, I.-H. Chung, and J. Moreira, “Topology mapping for blue gene/lsupercomputer,” in Proceedings of the 2006 ACM/IEEE Conference onSupercomputing, ser. SC ’06. New York, NY, USA: ACM, 2006.[Online]. Available: http://doi.acm.org/10.1145/1188455.1188576

[54] T. Hoefler and M. Snir, “Generic topology mapping strategies forlarge-scale parallel architectures,” in Proceedings of the InternationalConference on Supercomputing, ser. ICS ’11. New York, NY, USA:ACM, 2011, pp. 75–84. [Online]. Available: http://doi.acm.org/10.1145/1995896.1995909

[55] J. Y. Choi, J. Logan, M. Wolf, G. Ostrouchov, T. Kurc, Q. Liu,N. Podhorszki, S. Klasky, M. Romanus, Q. Sun et al., “Tge: Machinelearning based task graph embedding for large-scale topology mapping,”in Cluster Computing (CLUSTER), 2017 IEEE International Conferenceon. IEEE, 2017, pp. 587–591.

[56] J. Dayal, J. Lofstead, G. Eisenhauer, K. Schwan, M. Wolf, H. Abbasi,and S. Klasky, “Soda: Science-driven orchestration of data analytics,”in e-Science (e-Science), 2015 IEEE 11th International Conference on.IEEE, 2015, pp. 475–484.

[57] C. Docan, M. Parashar, J. Cummings, and S. Klasky, “Moving the codeto the data - dynamic code deployment using activespaces,” in 2011IEEE International Parallel Distributed Processing Symposium, May2011, pp. 758–769.

[58] I. Foster, M. Ainsworth, B. Allen, J. Bessac, F. Cappello, J. Y. Choi,E. Constantinescu, P. E. Davis, S. Di, W. Di et al., “Computing justwhat you need: online data analysis and reduction at extreme scales,”in European Conference on Parallel Processing. Springer, 2017, pp.3–19.