Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf ·...

15
Structured Streams: Data Services for Petascale Science Environments Patrick Widener Matthew Wolf Hasan Abbasi Matthew Barrick Jay Lofstead Jack Pulikottil Greg Eisenhauer Ada Gavrilovska Scott Klasky Ron Oldfield Patrick G. Bridges Arthur B. Maccabe Karsten Schwan. Abstract The challenge of meeting the I/O needs of petascale applications is exacerbated by an emerging class of data-intensive HPC applications that requires annotation, reorganization, or even conversion of their data as it moves between HPC com- putations and data end users or producers. For instance, data visualization can present data at different levels of detail. Further, actions on data are often dynamic, as new end-user requirements introduce data manipulations unforeseen by orig- inal application developers. These factors are driving a rich set of requirements for future petascale I/O systems: (1) high levels of performance and therefore, flexibility in how data is extracted from petascale codes; (2) the need to support ‘on de- mand’ data annotation – metadata creation and management – outside application codes; (3) support for concurrent use of data by multiple applications, like visualization and storage, including associated consistency management and scheduling; and (4) the ability to flexibly access and reorganize physical data storage. We introduce an end-to-end approach to meeting these requirements: Structured Streams, streams of structured data with which methods for data management can be associated whenever and wherever needed. These methods can execute synchronously or asynchronously with data extraction and streaming, they can run on the petascale machine or on associated machines (such as storage or visualization engines), and they can implement arbitrary data annotations, reorganization, or conversions. The Structured Streaming Data System (SSDS) enables high-performance data movement or manipulation between the compute and service nodes of the petascale machine and between/on service nodes and ancillary machines; it enables the metadata creation and management associated with these movements through specification instead of application coding; and it ensures data consistency in the presence of anticipated or unanticipated data consumers. Two key abstractions implemented in SSDS, I/O graphs and Metabots, provide developers with high-level tools for structuring data movement as dynamically-composed topologies. A lightweight storage system avoids traditional sources of I/O overhead while enforcing protected access to data. This paper describes the SSDS architecture, motivating its design decisions and intended application uses. The utility of the I/O graph and Metabot abstractions is illustrated with examples from existing HPC codes executing on Linux Infiniband clusters and Cray XT3 supercomputers. Performance claims are supported with experiments benchmarking the underlying software layers of SSDS, as well as application-specific usage scenarios. 1 Introduction Large-scale HPC applications face daunting I/O challenges. This is especially true for complex coupled HPC codes like those in climate or seismic modeling, and also for emerging classes of data-intensive HPC applications. Problems arise not only from large data volumes but also from the need to perform activities such as data staging, reorganization, or transformation [24]. Coupled simulations, for instance, may require data staging and conversion, as in multi-scale materials modeling [7], or data remeshing or changes in data layout [17, 14]. Emerging data-intensive applications have additional requirements, such as those derived from their continuous monitoring [23]. Their online monitoring and the visualization of monitoring data require data filtering and conversion in addition to the basic constraints of low overhead, flexible extraction of said data [35]. Similar needs exist for data-intensive applications in the sensor domain, where sensor data interpretation requires actions like data cleaning or remeshing [22]. Addressing these challenges presents technical difficulties including:

Transcript of Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf ·...

Page 1: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

Structured Streams: Data Services for Petascale Science Environments

Patrick Widener Matthew Wolf Hasan Abbasi Matthew Barrick Jay LofsteadJack Pulikottil Greg Eisenhauer Ada Gavrilovska Scott Klasky Ron Oldfield

Patrick G. Bridges Arthur B. Maccabe Karsten Schwan.

Abstract

The challenge of meeting the I/O needs of petascale applications is exacerbated by an emerging class of data-intensiveHPC applications that requires annotation, reorganization, or even conversion of their data as it moves between HPC com-putations and data end users or producers. For instance, data visualization can present data at different levels of detail.Further, actions on data are often dynamic, as new end-user requirements introduce data manipulations unforeseen by orig-inal application developers. These factors are driving a rich set of requirements for future petascale I/O systems: (1) highlevels of performance and therefore, flexibility in how data is extracted from petascale codes; (2) the need to support ‘on de-mand’ data annotation – metadata creation and management – outside application codes; (3) support for concurrent use ofdata by multiple applications, like visualization and storage, including associated consistency management and scheduling;and (4) the ability to flexibly access and reorganize physical data storage.

We introduce an end-to-end approach to meeting these requirements: Structured Streams, streams of structured datawith which methods for data management can be associated whenever and wherever needed. These methods can executesynchronously or asynchronously with data extraction and streaming, they can run on the petascale machine or on associatedmachines (such as storage or visualization engines), and they can implement arbitrary data annotations, reorganization, orconversions. The Structured Streaming Data System (SSDS) enables high-performance data movement or manipulationbetween the compute and service nodes of the petascale machine and between/on service nodes and ancillary machines; itenables the metadata creation and management associated with these movements through specification instead of applicationcoding; and it ensures data consistency in the presence of anticipated or unanticipated data consumers. Two key abstractionsimplemented in SSDS, I/O graphs and Metabots, provide developers with high-level tools for structuring data movement asdynamically-composed topologies. A lightweight storage system avoids traditional sources of I/O overhead while enforcingprotected access to data.

This paper describes the SSDS architecture, motivating its design decisions and intended application uses. The utility ofthe I/O graph and Metabot abstractions is illustrated with examples from existing HPC codes executing on Linux Infinibandclusters and Cray XT3 supercomputers. Performance claims are supported with experiments benchmarking the underlyingsoftware layers of SSDS, as well as application-specific usage scenarios.

1 Introduction

Large-scale HPC applications face daunting I/O challenges. This is especially true for complex coupled HPC codeslike those in climate or seismic modeling, and also for emerging classes of data-intensive HPC applications. Problemsarise not only from large data volumes but also from the need to perform activities such as data staging, reorganization, ortransformation [24]. Coupled simulations, for instance, may require data staging and conversion, as in multi-scale materialsmodeling [7], or data remeshing or changes in data layout [17, 14]. Emerging data-intensive applications have additionalrequirements, such as those derived from their continuous monitoring [23]. Their online monitoring and the visualization ofmonitoring data require data filtering and conversion in addition to the basic constraints of low overhead, flexible extractionof said data [35]. Similar needs exist for data-intensive applications in the sensor domain, where sensor data interpretationrequires actions like data cleaning or remeshing [22].

Addressing these challenges presents technical difficulties including:

Page 2: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

• scaling to large data volumes and large numbers of I/O clients (i.e., compute nodes), given limited I/O resources (i.e., alimited number of nodes in I/O partitions),

• avoiding excessive overheads on compute nodes (e.g., I/O buffers and compute cycles used for I/O),• balancing bandwidth utilization across the system, as mismatches will slow down the computational engines, either

through blocking or through over-provisioning in the I/O subsystem, and• offering additional functionality in I/O including on demand data annotation, filtering, or similar metadata-centric I/O

actions.Structured Streams, and the Structured Streaming Data System (SSDS) that implements them, are a new approach to

petascale I/O that encompasses a number of new I/O techniques aimed at addressing the technical issues listed above:Data taps are flexible mechanisms for extracting data from or injecting data into HPC computations; efficiency is gainedfrom making it easy to vary I/O overheads and costs in terms of buffer usage and CPU cycles spent on I/O and by controllingI/O volumes and frequency.Structured data exchanges between all stream participants make it possible to enrich I/O by annotating or changing data,both synchronously or asynchronously with data movement.I/O graphs explicitly represent an application’s I/O tasks as configurable topologies of the nodes and links used for movingand operating on data. I/O graphs start with lightweight data taps on computational nodes, traverse arbitrary additionaltask nodes on the petascale machine (including compute and I/O nodes, as desired), and end on storage or visualizationengines. Using I/O graphs, developers can flexibly and dynamically partition I/O tasks and concurrently execute them, acrosspetascale machines and the ancillary engines supporting their use. Enhanced techniques dynamically manage I/O graphexecution, including their scheduling and the I/O costs imposed on petascale applications.Metabots are tools for specifying and implementing operations on the data moved by I/O graphs. Metabot specificationsinclude the nature of operations (annotation, organization, modification of data to meet dynamic end user needs) as well asimplementation and interaction details (such as application synchrony, data consistency requirements, or metabot scheduling).Lightweight storage separates fast path data movements from machine to disk from metadata-based operations like fileconsistency, while preserving access control. Metabots operating on disk-resident data are one method for asynchronously(outside the data fast path) determining the file properties of data extracted from high performance machines.

SSDS is being implemented for leadership class machines residing at sites like the Oak Ridge National Laboratory. Itsrealization for Cray XT3 and XT4 machines runs data taps on its compute nodes, using Cray’s Catamount kernel, and itexecutes full-featured I/O graphs utilizing nodes, metabots, and lightweight storage both on the Cray XT3/XT4 I/O nodesand on secondary service machines. SSDS also runs on Linux-based clusters using Infiniband RDMA transports in placeof the Sandia Portals [3] communication construct. Enhanced techniques for automatically managing I/O graphs’ costs vs.performance have not yet been realized, but measurements shown in this paper demonstrate the basic performance propertiesof SSDS mechanisms and the cost/performance tradeoffs made possible by their use.

In the remainder of this paper, Section 2 describes the basic I/O structure of the HPC systems targeted by SSDS. Section 3follows with a description of the Structured Streams abstractions, how it addresses the emerging I/O challenges in the systemsdescribed in Section 2, and the implementation of structured streams in SSDS. Section 4 presents experimental results fromour prototype SSDS implementation, illustrating how the basic SSDS abstractions that implement structured streams providea powerful and flexible I/O system for addressing emerging HPC I/O challenges. Section 5 then describes related work, andSection 6 presents conclusions and directions for future research.

2 Background

Figure 1 depicts a representative machine structure for handling the data-intensive applications targeted by our work, de-rived from our interactions with HPC vendors and with scientists at Sandia, Los Alamos, and Oak Ridge National Laborato-ries. These architectures have four principal components: (1) a dedicated storage engine with limited attached computationalfacilities for data mining; (2) a large-scale MPP with compute nodes for application computation and I/O staging nodes forbuffering and data reorganization; (3) miscellaneous local compute facilities such as visualization systems; and (4) remotedata sources and sinks such as high-bandwidth sensor arrays and remote collaboration sites.

The storage engine provides long-term storage, its primary client being the local MPP system. Nodes in the MPP areconnected by a high-performance interconnect such as the Cray Seastar or 4x Infiniband with effective data rates of 40GB/secor more, while this MPP is connected to other local machines, including the data archive, using a high-speed commodityinterconnect such as 10GigE. Because these interconnects have lower peak and average bandwidths than the MPP’s high-performance interconnect, some application nodes inside the MPP (i.e., service or I/O nodes) are typically dedicated to

2

Page 3: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

impedance matching by buffering, staging, and reordering data as it flows into and out of the MPP.

Storage Engine

RemoteClients

RemoteSensors

Visualization

Local MPP(Compute and

I/O Nodes)

Figure 1. System Hardware Architecture

Systems with remote access to the storageengine have lower bandwidths, even when us-ing high end networks like TeraGrid or DOE’sUltraScience network, but they can still pro-duce and consume large amounts of data. Con-sider, for example, data display or visualiza-tion clusters [16], sensor networks, satellites orother telemetry sources, or ground-based sitessuch as the Long Wavelength Array of radio-telescopes. For all such cases, the supercom-puter may be either the source (checkpoints orprocessed simulation results) or sink (analysisof data across data sets) for the large data sets resident in the pool of storage.

3 Structured Streams

3.1 Using Structured Streams

To address the I/O challenges in systems and applications such as those described in Section 2, we propose to manage alldata movement as structured streams. Conceptually, a structured stream is a sequence of annotations, augmentations, andtransformations of typed, structured data; the stream describes how data is conceptually transformed as it moves from datasource to sink, along with desired performance, consistency, and quality of service requirements for the stream.

An SSDS-based application describes its data flows in terms of structured streams. Data generation on an MPP clusteror retrieval from an storage system, for example, may be expressed as one or more structured streams, each of which per-forms data manipulation according to downstream application requirements. A structured streaming data system (SSDS),then, maps these streams onto runtime resources. In addition to meeting the application’s I/O needs, this mapping can beconstructed with MPP utilization in mind or to meet overhead constraints.

A structured stream is a dynamic, extensible entity; incorporation of new application components into a structured streamis implemented as their attachment to any node in the graph. Such new components can attach as pure sinks (as simple graphendpoints, termed data taps) or include specific additional functionality in more complex subgraphs. A resulting strength ofthe structured stream model is that data generators need not be modified to accommodate new consumers.

3.2 Example Structured Streams

Structured streams carry data of known structure, but they also offer rich facilities for the runtime creation of additionalmetadata. For example, metadata may be created to support direct naming, reference by relation, and by arbitrary content.Application-centric metadata like the structure or layout of data events can be stated explicitly as part of the extended I/Ograph interface offered by SSDS (i.e., data taps), or it may be derived from application programs, compilers, or I/O marshalinginterfaces.

Consider the I/O tasks associated with the online visualization of data in multi-scale modeling codes. Here, even withina single domain like materials design, different simulation components will use different data representations (due to differ-ences in length or time scales); they will have multiple ways of storing and archiving data; and they will use different programfor analyzing, storing, and displaying data. For example, computational chemistry aspects can be approached either fromthe viewpoint of chemical physics or physical chemistry. Although researchers from both sides may agree on the techniquesand base representations, there can be fundamental differences in the types of statistics which these two groups may gather.The physicists may be interested in wavefunction isosurfaces and the HOMO/LUMO gap, while the chemists may be moreinterested in types of bonds and estimated relative energies. More simply, some techniques require the data structures to bereported in wave space, while others require real space. Similar issues arise for many other HPC applications.

A brief example demonstrates how application-desired structures and representations of data can be specified within theI/O graphs used to realize structured streams. Consider, for example, the output of the Warp molecular dynamics code [26],a tool developed by Steve Plimpton. It and its descendants have been used by numerous scientists for exploring materialsphysics problems in chemistry, physics, mechanical, and materials engineering. The output is composed of arrays of three

3

Page 4: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

real coordinates, representing the x, y, and z values for an individual atom’s location, coupled with additional attribute for therun, including the type of atom, velocities of the atoms and/or temperatures of the ensemble, total energies, and so on. Partof the identification of the atomic type is coupled to the representation of the force interaction between atoms within the code– Iron atoms interact differently with other Iron atoms than they do with Nickel. Using output data from one code to serveas input for another involves not only capturing the simple positions, but also the appropriate changes in classifications andindexing that the new code may require. I/O graphs use structured data representations to simply record the ways in whichsuch data is laid out in memory (and/or in file blocks) and then maintain and use the translations to other representations.The current implementation of SSDS uses translations that are created manually, although at a higher level than the Perl orPython scripting which are standard practices in the field. In ongoing work related (but not directly linked) to the SSDSeffort, we are producing automated tools for creating and maintaining these translations.

3.3 Realizing Structured Streams

The concept of a structured stream only describes how data is transformed. To actually realize structured streams, we havedecomposed their specification and implementation into two concrete abstractions implemented in the Structured StreamingData System (SSDS) that describe when and where data is transformed, I/O graphs and metabots. Based on the applicationand performance and consistency needs of a structured stream and available system resources, a structured stream is specifiedas a series of in-band synchronous data movements and transformations across a graph of hosts (an I/O graph) and somenumber of asynchronous, out-of-band data annotations and transformations (metabots), resulting in a complete descriptionof how, where, and when data will move through a system.

The mapping of a structured stream to an IOgraph and a set of metabots in SSDS can change as application developersdesire or as run-time conditions dictate. For instance, data format conversion may be performed strictly in an I/O graph fora limited number of consumers, but broken out (according to data availability deadlines) into a combination of I/O graphand metabot actions as the number of consumers grows. This ability to customize structured streams, and their potentialfor coordination and scheduling of both in-band and out-of-band data activity, makes them a powerful tool for applicationcomposition.

3.3.1 I/O graphs

I/O graphs are the data-movement and lightweight transformation engines of structured streams. The nodes of an I/O graphexchange data with each other over the network, receiving, routing, and forwarding as appropriate. I/O graphs are alsoresponsible for annotating data and/or for executing the data manipulations required to filter data or more generally, ‘makedata right’ for recipients and to carry out data staging, buffering, or similar actions performed on the structured data eventstraversing them.

Stated more precisely, in an I/O graph, each operation uses one or more specific input data form(s) to determine onwardrouting and produce potentially different output data form(s). We identify three types of such operations:

• inspection, in which an input data form and its contents are read-only and used as input to a yes/no routing, forwarding,or buffering decision for the data;

• annotation, in which the input data form is not changed, but the data itself might be modified before the routing deter-mination is made; and

• morphing, in which the input data form (and its contents) is changed to a different data form before the routing determi-nation is made.

Regardless of which operations are being performed, I/O graph functionality is dynamically mapped onto MPP compute andservice nodes, and onto the nodes of the storage engine and of ancillary machines. The SSDS design also anticipates thedevelopment of QoS- and resource-aware methods for data movement [5] and for SSDS graph provisioning [18, 30] to assistin deployment or evolution of I/O graphs.

While the I/O graphs shown and evaluated in this paper were explicitly created by developers, higher level methods can beused to construct I/O graphs. For example, consider a data conversion used in the interface of a high performance molecularmodeling code to a visualization infrastructure. This data conversion might involve a 50% reduction in height and width ofa 2-dimensional array of, say, a bitmap image (25% of original data size). Precise descriptions of these conversions can bea basis for runtime generation of binary codes with proper looping and other control constructs to apply the operation to therange of elements specified. Our previous work has used XML specifications based on which automated tools can generatethe binary codes that implement the necessary data extraction and transport actions across I/O graph nodes.

4

Page 5: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

3.3.2 Metabots

Metadata in MPP Applications. The particulars of stream structure may not be readily available from MPP applications.One reason for current applications using traditional file systems is that these systems force metadata to be generated in-linewith computational results, reducing effective computational I/O bandwidth. As a result, minimal metadata is available onstorage devices for use by other applications, and the metadata that is present is frequently only useful to the application thatgenerated it. Additionally, such applications are extremely sensitive to the integrity of their metadata, ruling out modificationswhich might benefit other clients. Recovering metadata at a later time for use by other applications can reduce to inspectionof source codes in order to determine data structure.

As noted earlier, this situation is caused by the difference between MPP internal bandwidth and I/O bandwidth. Oneapproach is to recover I/O bandwidth by removing metadata operations and other semantic operations from parallel filesystems. However, semantics such as file abstractions, name mapping, and namespaces are commonly relied upon by MPPapplications, so removing such metadata permanently is not an option. In fact, we consider the generation, management,and use of these and other types of metadata to be of increasing importance in the development of future high performanceapplications. Our approach, therefore, is not to require all metadata management to be in the fast “data path” of theseapplications, but instead, to move as much metadata-related processing as appropriate out of this path. The metabots describedbelow realize this goal.Metabots and I/O graphs. Metabots provide a specification-based means of introducing asynchronous data manipulationinto structured streams. A basic I/O graph emphasizes metadata generation as data is captured and moved (e.g., data sourceidentification). By default, generation is performed “in-band”, a process analogous to current HPC codes that implicitlycreate metadata during data storage. Metabots make it possible to move these tasks “out-of-band”, that is, outside the fastpath of data transfer. Specifically, Metabots can coordinate and execute these tasks at times less likely to affect applications,such as after a data set has been written to disk. In particular, metabots provide additional functionality for metadata creationand data organizations unanticipated by application developers but required by particular end-user needs.

Colloquially, metabots are metadata agents whose run-time representations crawl storage containers, generating metadatain user-defined ways. For example, in many scientific applications, data may need to be analyzed for both spatial and temporalfeatures. In examining the flux of ions through a particular region of space in a fusion simulation, the raw data (organized bytime slice) may need to be transformed so that it is organized as a time series of data within a specific spatial bounding box.This can involve both computationally intensive (e.g., bounding box detection) and I/O bound phases (e.g. appending datafragments to the time series). The metabot run-time abstraction allows these to occur out-of-band or in-band, as appropriate.

Metabots differ from traditional workflow concepts in two ways: (1) they are tightly coupled to the synchronous datatransmission, and need to be flexibly reconfigurable based on what the run-time has or has not completed, and (2) they areconfined to specific meta-data and data-organizational tasks. As such, the streaming data and metabot run-times could beintegrated in future work to serve as a self-tuning actor within a more general workflow system like Kepler [20].

3.4 A Software Architecture for Petascale Data Movement

The data manipulation and transmission mechanism of SSDS leverages extensive prior work with high performance datamovement. Key technologies realized in that research and leveraged for this effort include: (1) efficient representations ofmeta-information about data structure and layout; enabling (2) high performance and ‘structure-aware’ manipulations onSSDS data ‘in flight’, carried out by dynamically deployed binary codes and using higher level tools with which such manip-ulations can be specified, termed XChange [1]; (3) a dynamic overlay optimized for efficient data movement, where data fastpath actions are strongly separated from the control actions necessary to build, configure, and maintain the overlay[30]; and(4) a lightweight object storage facility (LWFS [21]) that provides flexible, high-performance data storage while preservingaccess controls on data.

LWFS implements back end metadata and data storage. PBIO is an efficient binary runtime representation of metadata.[10] High performance data paths are realized with the EVPath data movement and manipulation infrastructure [9], and theXChange tool’s purpose is to provide I/O graph mapping and management support [30]. Knowledge about the structure andlayout of data is integrated into the base layer of I/O graphs. Selected data manipulations can then be directly integrated intoI/O graph actions, in order to move only the data that is currently required and (if necessary and possible) to manipulate datato avoid unnecessary data copying due to send/receive data mismatches. The role of CM (Connection Manager) is to managethe communication interfaces of I/O Graphs. Also part of EVPath and CM are the control methods and interfaces needed toconfigure I/O graphs, including deploying the operations that operate on data, link graph nodes, delete them, etc. Potentialoverheads from high-level control semantics will not affect the data fast path performance, and alternative realizations of such

5

Page 6: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

semantics become possible. These will be important enablers for integrating actions like file system consistency or conflictmanagement with I/O graphs, for instance. Finally, the metadata used in I/O graphs can be provided by applications, but itcan also be derived automatically, by the metabots described in Section 3.3.2.

ECho

Pub/SubManager

LWFSXChange

I/O GraphManager

Metabots

CM Transports

CM EVPathPBIO

Figure 2. SSDS software architecture.

I/O graph nodes use daemon processes torun on the MPP’s service nodes, on the storageengine, and on secondary machines like visu-alization servers or remote sensor machines.In addition, selected I/O graph nodes may runon the MPP’s compute engines, to provide tosuch applications the extended I/O interfacesoffered by SSDS. For all I/O graph nodes, op-erator deployment can use runtime binary codegeneration techniques to optimize data manip-ulation for current application needs and plat-form conditions. Additional control processesnot shown in the figure run tasks like opti-mization for code generation across multipleI/O graph nodes and machines, and I/O graphmapping and management actions.

Structured streams do not replace standard back-end storage (i.e., file systems) or transports (i.e., network subsystems).Instead, they extend and enhance the I/O functionality seen by high performance applications. The structured stream modelof I/O inherently utilizes existing (and future) high performance storage systems and data storage models, but offers a data-driven rather than connection- or file-driven interface for the HPC developer. In particular, developers are provided withenhanced I/O system interfaces and tools to define data formats and the operations to be performed on data ‘in flight’ betweensimulation components and I/O actions.

As an example, the current implementation of SSDS leverages existing file systems (ext3 and lustre) and protocols (Portalsand IB RDMA). Since the abstraction presented to the programmer is inherently asynchronous and data-driven, however, therun-time can perform data object optimizations (like message aggregation or data validation) in a more efficient way than thecorresponding operation on a file object.

In contrast, the successful paradigm of MPI I/O [32], particularly when coupled with a parallel file system, heavilyleverages the file nature of the data target and utilizes the transports infrastructure as efficiently as possible within that model.However, that inherently means the underlying file system concepts of consistency, global naming, and access patterns willbe enforced at the higher level as well.

By adopting a model that allows for the embedding of computations within the transport overlay, it is possible to delayexecution of or entirely eliminate those elements of the file object which the application does not immediately require. If aparticular algorithm does not require consistency (as is true of some highly fault-tolerant algorithms), then it is not necessaryto enforce it from the application perspective. Similarly, if there is an application-specific concept of consistency (such asvalidating a check point file before allowing it to overwrite the previous check point), that could be enforced as well, inaddition to the more application-driven specifications mentioned earlier.

3.4.1 Implementation

Datatap. The datatap is implemented as a request-read service that is designed for the multiple orders of magnitude differencebetween the available memory on the I/O and service nodes compared to the compute partition. We assume the existence ofa large number of compute nodes producing data (we refer to them as datatap clients) and a smaller number of I/O nodesreceiving the data (we refer to them as datatap servers). The datatap client issues a data available request to the datatapserver, encodes the data for transmission and registers this buffer with the transport for remote read. For very large datasizes,the cost of encoding data can be significant, but it will be dwarfed by the actual cost of the data transfer [12, 11, 4]. On receiptof the request, the datatap server issues a read call. Due to the limited amount of memory available on the datatap server,the server only issues a read call if there is memory available to complete it. The datatap server issues multiple read requeststo reduce the request latency as much as possible. The datatap server is performance bound by the available memory whichrestricts the number of concurrent request and the request service latency. The datatap server acts as a data feed in to the I/Ograph overlay. The I/O graph can replicate the functionality of writing the output to a file (see Section 4.4), or it can be used

6

Page 7: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

to perform “inflight” data transformations (see Section 4.4).We currently have two implementations of the datatap using Infiniband user-level verbs and the Sandia Portals interface.

We needed the multiple implementations in order to support both our local Linux clusters and the Cray XT3 platform. Thetwo implementations have a common design and hence common performance except in one regard. The Infiniband user levelverbs do not provide a reliable datagram (RD) transport increasing the time spent in issuing a data available request (seeFigure 8).I/O graph Implementation. Actual implementation of I/O graph data transport and processing is accomplished via a mid-dleware package designed to facilitate the dynamic composition of overlay networks for message passing. The principalabstraction in this infrastructure is ‘stones’ (as in ‘stepping stones’), which are linked together to compose a data path. Mes-sage flows between stones can be both intra- and inter-process, with inter-process flows being managed by special outputstones. The taxonomy of types of stones is relatively broad, but includes: terminal stones which implement data sinks; filterstones which can optionally discard data; transform stones which modify data; and split stones which implement data-basedrouting decisions and may redirect incoming data to one or more other stones for further processing.

I/O graphs are highly efficient because the underlying transport mechanism performs only minimal encoding on the send-ing side and uses dynamic code generation to perform highly efficient decoding on the receiving side. The functions thatfilter and transform data are represented in a small C-like language. These functions can be transported between nodes insource form, but when placed in the overlay we use dynamic code generation to create a native version of these functions.This dynamic code generation capability is based on a lower-level package that provides for dynamic generation of a virtualRISC instruction set. Above that level, we provide a lexer, parser, semanticizer, and code generator, making the equivalent ofa just-in-time compiler for the small language. As such, the system generates native machine code directly into the applica-tion’s memory without reference to an external compiler. Because we do not rely upon a virtual machine or other sand-boxingtechnique, these filters can run at roughly the speed of unoptimized native code and can be generated considerably faster thanforking an external compiler.

4 Experimental Evaluation

4.1 Overview

To evaluate the effectiveness of our prototype SSDS implementation, we conducted a variety of experiments to understandits performance characteristics. In particular, we conducted experiments using a combination of HPC-oriented I/O bench-marks that test individual portions of SSDS including I/O graphs, the dataTap, and Metabots, and prototype full-system SSDSapplication benchmark using a modified version of the GTC [23] HPC code.

As an experimental testbed, we utilized a cluster of 53 dual-processor 3.2GHz Intel EM64T processors each with 6GB ofmemory running Redhat Enterprise Linux AS release 4 with kernel version 2.6.9-42.0.3.ELsmp. Nodes were using connectedby a non-blocking 4x Infiniband interconnect using the IB TCP/IP implementation. I/O services in the cluster are providedby cluster dedicated nodes containing 73GB 10k RPM Seagate ST373207LC Ultra SCSI 320 disks. Underlying SSDS I/Oservice was provided by a prototype implementation of the Sandia Lightweight File Systems (LWFS) [21] communicatingusing the user-level Portals TCP/IP reference implementation [3]. Note that the Portals reference implementation, unlike thenative Portals implementation on Cray Seastar-based systems, is a largely unoptimized communication subsystem. Becausethis communication infrastructure currently constrains the absolute performance of SSDS on this platform, our experimentsfocus on relative comparisons instead of absolute performance numbers.

4.2 Metabot Evaluation

Overview. To understand the potential performance benefits of Metabots in SSDS, we ran several metadata manipulationbenchmarks with various SSDS configurations using different setups of a benchmark based on the LLNL fdtree benchmark[13], which we shall call fdtree-prime. The benchmark essentially creates a file hierarchy parametrized on the depth ofthe hierarchy, the number of directories at each depth and the number and size of files in each directory. In particular, wefocused on comparing the benchmark performance with metadata manipulation inline and with metadata manipulation movedout-of-band to a Metabot.

In situations where applications create large file hierarchies which are accessed at a later point of time, in-band file creationcan incur significant metadata manipulation overhead. Since the application programmer generally knows the structure of thishierarchy, including file names, numbers, and sizes, it may frequently be possible to move namespace metadata manipulations

7

Page 8: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 10000 20000 30000 40000 50000 60000 70000

seco

nds

# of files

LWFS (Number of Files)

namingraw

reconstruct

(a) Scaling of Files

0

500

1000

1500

2000

2500

3000

3500

4000

1 2 3 4 5

Sec

onds

# of Levels

LWFS (Directory Depth)

NamingRaw

Reconstruct

(b) Scaling of Directory

Figure 3. Out-of-band Metadata Reconstruction

out-of-band. We have implemented this optimization using SSDS Metabots, where the application can write directly to theresulting storage targets without in-band metadata manipulation costs to application execution. Subsequently, a Metabotcreates the needed filename to object mappings out-of-band.Setup. To understand the potential performance benefits of Metabots, we built a Metabot that would perform specified filehierarchy metadata manipulations according to a specification after the actual raw I/O writes done by fdtree-prime. We thencompared the time taken to run fdtree-prime setups with metadata manipulation in-band and metadata manipulation out-of-band, as well as the amount of time needed for out-of-band metadata construction. For these tests, we used two differentfdtree-prime setups: one which created an increasing number of files of size 4KB in a single directory and one that created thesame-sized files in an increasingly deep directory hierarchy in which each directory contained 100 files and 2 subdirectories.Experiments were run on 4 nodes of the cluster described in section 4.1. One node executed the benchmark itself, while threenodes ran the LWFS authorization, naming, and storage servers.Results. In our first experiment, we see our raw write performance gets increasingly better with an increase in the number offiles. Even with a flat directory structure, at 65,536 files, the write performance with inline-metadata creation is 70% slowerthan a raw write. In the second experiment, the performance gain is even more apparent. With a depth of 5 levels, and 2directories per level, the write performance with inline-metadata creation is 9.7 times slower than a raw write. In both cases,the metadata construction Metabot takes about the same time as that of the inline-metadata benchmark.Analysis. While the above results demonstrate how moving metadata creation out-of-band can significantly increase theperformance of in-band activity, the sum total of raw-write time and construction Metabot time is still greater than that ofinline-metadata creation. This is so because the construction Metabot has to read from a raw object stored on the LWFSstorage server and carry out the same operations as that of the inline-metadata benchmark. In the current LWFS API, thecreation of a file is accompanied by the creation of a new LWFS object containing the data for the new file. Hence theconstruction Metabot has to repeat the workload of the inline-metadata benchmark. A more efficient implementation of theAPI would allow the construction Metabot to avoid file data copies and new object creations and simply create filesystemmetadata for the object that was created during the raw-write benchmark. Performance could also have been better hadthe construction Metabot been deployed on the storage server as opposed on a remote node as was done for this series ofexperiments. Another issue to be addressed is that we do not have metrics for comparing write performance with other parallelfile system implementations; this was primarily due to constraints of platform availability. However, LWFS performancecharacteristics are comparable to Lustre [21].

4.3 DataTap Evaluation

Overview. The datatap serves as a lightweight low overhead extraction system. As such the datatap replaces the remotefilesystem services offered by the I/O nodes for large MPP machines. The datatap is designed to require a minimum level ofsynchronicity in extracting the data, thus allowing a large overlap between the application’s kernel and the data movement.

The adverse performance impact of extracting data from an application can be broken down into two parts. The non-asynchronous parts of the data extraction protocol (i.e., the time for the remote node to accept the data transfer request)and the blocking wait for the transfer to complete (e.g., if the transfer time is not fully overlapped by computation) have an

8

Page 9: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

impact on the total computation time of the application. To reduce this overhead we designed the datatap in SSDS to have aminimum blocking overhead.Setup. We have implemented two versions of the datatap, (1) using the low level Infiniband verbs layer and (2) using theSandia Portals interface. The Portals interface is optimized for the Cray XT3 environment but it does offer a TCP/IP modulealso. However the performance of the Portals over TCP/IP datatap is orders of magnitude worse than either Infiniband orPortals on the Cray XT3.Results. We tested the Infiniband datatap on our local Linux cluster described above. The Portals datatap was tested offsiteon a Cray XT3 at Oak Ridge National Laboratory. The results demonstrate the feasibility, scalability, and limitations ofinserting datataps in the I/O fast path.

0

100

200

300

400

500

600

700

800

900

1000

0 10 20 30 40 50 60 70

Ban

dwid

th (

MB

/s)

Number of processors

Server read data bandwidth

4MB32MB64MB

128MB 256MB

(a) IB-RDMA data tap on Linux cluster

0

100

200

300

400

500

600

700

800

0 10 20 30 40 50 60 70

Ban

dwid

th (

MB

/s)

Number of processors

Server read data bandwidth

4MB32MB64MB

128MB 256MB

(b) Portals data tap on Cray XT3

Figure 4. Bandwidth during data consumer read. This bandwidth is less than the maximum availablefor higher number of processors because of multiple overlapping reads

First we consider the bandwidth observed for a RDMA read (or Portals “get”). The results are shown in Figure 4. Themaximum bandwidth is available when the data size is large and the number of processors is small. This is due to theincreasing number of concurrent read requests as the number of processors increases. For both the Infiniband and the Portalsversion the read bandwidth reaches a minimum value for a specific data size. This occurs when the maximum amountof allocatable memory is reached, forcing the datatap data consumer to schedule requests. The most significant aspect ofthis metric is that as the number of processors increases, the maximum number of outstanding requests reaches a plateau.Increasing the ammount of memory available to the consumer (or server) will result in a higher level of concurrency.Evaluation. First we look at the time taken to complete a data transfer. The bandwidth graph is shown in Figure 6. ThePortals version shows a much higher degree of variability, but the overall pattern is the same for both Infiniband and Portals.For larger number of processors the time to complete a transfer increases proportionally. This increase is caused by theincreasing number of outstanding requests on the consumer as the total transferred data increases beyond the maximumamount of memory allocated to the consumer. For small data sizes (the 4MB transfer for example) the time to completionstays almost constant. This is because all the nodes can be serviced simultaneously. The higher latency in the Infinibandversion is due to the lack of a connectionless reliable transport. We had to implement a lightweight retransmission layerto address this issue, which invariably limits the performance and scalability observed in our Infiniband-based experiments.Another notable feature is that for the Portals datatap the time to complete for large data sizes is almost the same. Thisis because the Cray SeaStar is a high bandwidth but also high latency interface. Once the total data size to be transferredincreases beyond the total available memory the performance becomes bottlenecked by the latency.

The design of the datatap is such that only a limited number of data producers can be serviced immediately. This isdue to the large imbalance between the combined memory of the compute partition and the combined memory of the I/Opartition. Hence, an important performance metric is the average service latency, i.e. the time taken before a transfer request

9

Page 10: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

0

50

100

150

200

250

300

350

400

0 10 20 30 40 50 60 70

Ban

dwid

th (

MB

/s)

Number of processors

Client observed egress bandwidth

4MB32MB64MB

128MB 256MB

(a) IB-RDMA data tap on Linux cluster

0

50

100

150

200

250

300

0 10 20 30 40 50 60 70

Ban

dwid

th (

MB

/s)

Number of processors

Client observed egress bandwidth

4MB32MB64MB

128MB 256MB

(b) Portals data tap on Cray XT3

Figure 5. Bandwidth observed by a single client

0

5

10

15

20

25

30

0 10 20 30 40 50 60 70

Tim

e (s

)

Number of processors

Time to complete transfer request

4MB32MB64MB

128MB 256MB

(a) IB-RDMA data tap on Linux cluster

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 10 20 30 40 50 60 70

Tim

e (s

)

Number of processors

Time to complete transfer request

4MB32MB64MB

128MB 256MB

(b) Portals data tap on Cray XT3

Figure 6. Total time to complete a single data transfer

is serviced by the datatap server. The request service latency also determines how long the computing application will waiton the blocking send. Figure 7 shows the latency with increasing number of clients. The latency increase is almost directlyproportional to the number of nodes, as the datatap server becomes memory bound once the number of processors increasesbeyond a specific limit. The shape of the graphs is different for Infiniband and Portals but the conclusion is the same: requestservice latency can be improved by allocating more memory to the datatap servers (for example, by increasing the number ofservers). The impact of the memory bottleneck on aggregate bandwidth observed by the datatap server is shown in Figure 9.The results demonstrate the aggregate bandwidth increases with increasing number of nodes, but reaches a maximum whenthe server becomes memory-limited. Our ongoing work focuses on understanding how to best schedule outstanding servicerequests so as to minimize the impact of these memory mismatches on the ongoing computation.

Time spent in issuing a data transfer request will be the cause of the most significant overhead on the performance of theapplication. This is because the application only blocks when issuing the send and when waiting for the completion of thedata transfer. The actual data transfer is overlapped with the computation of the application. Unfortunately, the Infinibandversion of the datatap blocks for a significant period of time (up to 2 seconds for 64 nodes and a transfer of 256 MB/node)(see Figure 8(a)). This performance bottleneck is caused by the lack of a reliable connectionless transport layer in the current

10

Page 11: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

(a) Number of ions = 582410

Run Parameters Time for 100 iterationsGTC/No output 213.002GTC/LUSTRE 231.86GTC/Datatap 219.65

(b) Number of ions = 1164820

Run Parameters Time for 100 iterationsGTC/No output 422.33GTC/LUSTRE 460.90GTC/Datatap 434.53

Table 1. Comparison of GTC run times on the ORNL Cray XT3 development machine for two inputsizes using different data output mechanisms

OpenFabrics distribution. Thus as the request service latency increases the time taken to complete a send also increases. Weare currently looking at ways to bypass this bottleneck. In contrast, the Portals datatap has very low latency and the latencystays almost constant for increasing number of nodes. The bulk of the time is spent in marshaling the data and we believe thatthis can also be optimized further. This demonstrates the feasibility of the datatap approach for environments with efficientlyimplemented transport layers.

4.4 Application-level Structured Stream Demonstration

Overview. The power of the structured stream abstraction is to provide a ready interface for programmers to overlap com-putation with I/O in the high performance environment. In order to demonstrate both the capability of the interface and itsperformance, we have chosen to implement a structured stream to replace the bulk of the output data from the GyrokineticTurbulence Code GTC [23]. GTC is a particle-in-cell code for simulating fusion within tokamaks, and it is able to scale tomultiple thousands of processors

In its default I/O pattern, the dominant cost is from each processor writing out the local array of particles into a separatefile. This corresponds to writing out something close to 10% of the memory foot print of the code, with the write frequencychosen so as to keep the average overhead of I/O to within a reasonable percentage of total execution. As part of the standardprocess of accumulating and interpreting this data, these individual files are the aggregated and parsed into time series,spatially-bounded regions, etc. as per the needs of the following annotation pipeline.Run-time comparisons. For this experiment, we replace the aforementioned bulk write with a structured stream publishevent. We ran GTC with two sets of input parameters with 528,410 ions and 1,164,820 ions and compared the run-time forthree different configurations. In the table 1 GTC/No output is the GTC configuration with no data output, GTC/Lustreoutputs data to a per-process file on the Lustre filesystem and finally GTC/Datatap uses SSDS’s lightweight datatap func-tionality for data output. We compare the application run-time on the Cray XT3 development cluster at Oak Ridge NationalLaboratory. We observe a significant reduction in the overhead caused by the data output (from about 8% on Lustre to about3% using the datatap). This decrease in overhead is observed when we double the datasize (by increasing the number ofions).

0

2e-05

4e-05

6e-05

8e-05

0.0001

0.00012

0.00014

0.00016

0.00018

0.0002

0 10 20 30 40 50 60 70

Tim

e (s

)

Number of processors

Average observed latency for request completion

4MB32MB64MB

128MB 256MB

(a) IB-RDMA data tap on Linux cluster

0

5e-05

0.0001

0.00015

0.0002

0.00025

0 10 20 30 40 50 60 70

Tim

e (s

)

Number of processors

Average observed latency for request completion

4MB32MB64MB

128MB 256MB

(b) Portals data tap on Cray XT3

Figure 7. Average latency in request servicing

11

Page 12: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70

Tim

e (s

)

Number of processors

Time to issue request

4MB32MB64MB

128MB 256MB

(a) IB-RDMA data tap on Linux cluster

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70

Tim

e (s

)

Number of processors

Time to issue request

4MB32MB64MB

128MB 256MB

(b) Portals data tap on Cray XT3

Figure 8. Time to issue data transfer request

I/O graph evaluation. The structured stream is configured with a simple I/O graph: datataps are placed in each of the GTCprocesses, feeding out asynchronously to an I/O node. From the I/O node, each of the messages is forwarded to a graph nodewhere the data is partitioned into different bounding boxes, and then copies of both the whole data and the multiple smallpartitioned data sets are forwarded on to the storage nodes. Once the data is received by the datatap server we filter based onthe bounding box and then transfer the data for visualization. The time taken to perform the bounding box computation is2.29s and the time to transfer the filtered data is 0.037s. In the second implementation we transfer the data first and run thebounding box filter after the data transfer. The time taken for the bounding box filter is the same (2.29s) but the time taken totransfer the data increases to 0.297s. In the first implementation the total time taken to transfer the data and run the boundingbox filter is lower, but the computation is performed on the datatap server resulting in a higher impact on the performanceof the datatap, resulting in higher request service latency. For the second implementation the computation is performed on aremote node therefore reducing the impact on the datatap.

5 Related Work

A number of different systems (among them NASD [15], Panasas [25], PVFS [19], and Lustre [6]), provide high-performance parallel file systems. Unlike these systems, SDSS provides a more general framework for manipulating datamoving to and from storage than these systems. In particular, the higher-level semantic metadata information available inStructured Streams allows it to make informed scheduling, staging, and buffering decisions than these systems. Each of thesesystems could, however, be used as an underlying storage system for SSDS in a way similar to how SSDS currently uses

0

100

200

300

400

500

600

0 10 20 30 40 50 60 70

Ban

dwid

th (

MB

/s)

Number of processors

Server observed ingress bandwidth

4MB32MB64MB

128MB 256MB

(a) IB-RDMA data tap on Linux cluster

0

50

100

150

200

250

300

350

400

450

0 10 20 30 40 50 60 70

Ban

dwid

th (

MB

/s)

Number of processors

Server observed ingress bandwidth

4MB32MB64MB

128MB 256MB

(b) Portals data tap on Cray XT3

Figure 9. Aggregate Bandwidth observed by data consumer

12

Page 13: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

LWFS [21].The previous work with Active Disks [29] is somewhat similar in spirit to SDSS provides a way for executable code to be

hosted very close to the drive as a way to enhance performance for data analysis operations. Its main limitation is that it relieson manipulating data stored on the drive. Our approach focuses on pulling that functionality into the IO graph and metabots.With IO graphs, SSDS can manipulate the data before it reaches storage, while metabots provide similar functionality toActive Disks but with explicit consistency management for interaction with SSDS-generated IO graphs.

Scientific workflow management systems like Kepler [20], Pegasus [8], Condor/G [31], and others [36] are also closelyrelated to the general Structured Streams/SDSS approach described in this paper. Similarly, the SRB [28] project is develop-ing a Data Grid Language (DGL) [33] to describe how to route data within a grid environment, coupled with transformationson the metadata associated with file data. Unlike the system described in this paper, these systems focus on on coarse-grained application scheduling, metadata manipulation as opposed to file data manipulation, and wide-area Grid as opposedto tightly-coupled HPC systems. In addition, current workflow systems are tightly coupled to the synchronous data transmis-sion, and need to be reconfigurable based on what the run-time has or has not completed. Our approach, in contrast, focuseson fine-grained scheduling, buffering, and staging issues in large-scale HPC systems, and allows the data itself to be anno-tated and manipulated both synchronously and asynchronously, all while still meeting application and system performanceand consistency constraints. Because of the usefulness of sych workflow systems, we are currently examining how streamingdata and metabot run-times could be integrated in future work to serve as a self-tuning actor within a more general workflowsystem like Kepler [20].

Our approach also shares some research goals with DataCutter [2, 34], which delivers client-specific data visualizations tomultiple end points using data virtualization abstractions. DataCutter, unlike our system, requires end users to write customdata filters in C++ and limits automatically generated filters to a flat SQL-style approach for data selection. Our approach,in contrast, uses comparatively richer descriptions for both filter and transformation operations that provides SSDS moreoptimization opportunities. In addition, DataCutter has no analogue to the asynchronous data manipulation provided bymetabots.

Finally, previous research with ECho [11] and Infopipes [27] provide the ability to dynamically install application-specificstream filters and transformers. Neither of these systems, however, dealt with the more general scheduling, buffering, andasynchronous data manipulation problems that we seek to address using Structured Streams.

6 Conclusions and Future Work

The structured stream abstraction presented here is a new way of approaching I/O for petascale machines and beyond.Through microbenchmarks and integration with a production HPC code, we have shown that it is possible to achieve per-formance while also providing an extensible capability within the I/O system. The layering of asynchronous lightweightdatataps with a high-performance data transport and computation system provides SSDS with a flexible and efficient mech-anism for building an online data manipulation overlay. This enables us to address the needs of next generation leadershipapplications.

Additionally, the integration of offline metadata annotation facilities provides a mechanism for shifting the embeddedcomputation between online and offline. This allows the system to address run-time quality of service trade-offs such as sizeof memory buffer in I/O nodes vs. computational demands vs. bandwidth on ingress/egress.

As next steps, enriching the specification and scheduling capabilities for both online and offline data manipulation willfurther improve runtime performance. As a further extension of this, we intend to investigate utilizing autonomic techniquesfor performing tradeoffs relevant to application concepts of data utility. On the metabot side, the specific issue of dataconsistency models and verification will be a major driver. For datatap and I/O graph development, we will work to furtherenrich the model of embedding computation into the overlay to better exploit concurrency in transport and processing.Exploiting concurrency whereever possible, including in the I/O system, will be key to the widespread deployment andadoption of petascale applications.

References

[1] H. Abbasi, M. Wolf, K. Schwan, G. Eisenhauer, and A. Hilton. Xchange: Coupling parallel applications in a dynamic environment.In Proc. IEEE International Conference on Cluster Computing, 2004.

[2] M. Beynon, R. Ferreira, T. M. Kurc, A. Sussman, and J. H. Saltz. Datacutter: Middleware for filtering very large scientific datasetson archival storage systems. In IEEE Symposium on Mass Storage Systems, pages 119–134, 2000.

13

Page 14: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

[3] R. Brightwell, T. Hudson, R. Riesen, and A. B. Maccabe. The Portals 3.0 message passing interface. Technical report SAND99-2959,Sandia National Laboratories, December 1999.

[4] F. E. Bustamante, G. Eisenhauer, K. Schwan, and P. Widener. Efficient wire formats for high performance computing. In Proc.Supercomputing 2000 (SC 2000), Dallas, Texas, November 2000.

[5] Z. Cai, V. Kumar, and K. Schwan. Self-regulating data streams for predictable high performance across dynamic network overlays.In Proc. 15th IEEE International Symposium on High Performance Distributed Computing (HPDC 2006), Paris, France, June 2006.

[6] Lustre: A scalable, high-performance file system. Cluster File Systems Inc. white paper, version 1.0, November 2002.http://www.lustre.org/docs/whitepaper.pdf.

[7] J. Clayton and D. McDowell. A multiscale multiplicative decomposition for elastoplasticity of polycrystals. International Journalof Plasticity, 19(9):1401–1444, 2003.

[8] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, andS. Koranda. Mapping abstract complex workflows onto grid environments. J. Grid Comput., 1(1):25–39, 2003.

[9] G. Eisenhauer. The evpath library. http://www.cc.gatech.edu/systems/projects/EVPath.[10] G. Eisenhauer. Portable binary input/output. http://www.cc.gatech.edu/systems/projects/PBIO.[11] G. Eisenhauer, F. Bustamente, and K. Schwan. Event services for high performance computing. In Proceedings of High Performance

Distributed Computing (HPDC-2000), 2000.[12] G. Eisenhauer and L. K. Daley. Fast heterogenous binary data interchange. In Proceedings of the Heterogeneous Computing

Workshop (HCW2000), May 3-5 2000. http://www.cc.gatech.edu/systems/papers/Eisenhauer00FHB.pdf.[13] fdtree. http://www.llnl.gov/icc/lc/siop/downloads/download.html Last Visited: April 16, 2007.[14] C. Forum. Mxn parallel data redistribution @ ornl. http://www.csm.ornl.gov/cca/mxn/, Jan 2004.[15] G. A. Gibson, D. P. Nagle, K. Amiri, F. W. Chang, E. Feinberg, H. G. C. Lee, B. Ozceri, E. Riedel, and D. Rochberg. A case for

network-attached secure disks. Technical Report CMU–CS-96-142, Carnegie-Mellon University, June 1996.[16] O. V. T. Group. Exploratory visualization environment for research in science and technology (everest). http://www.csm.

ornl.gov/viz/.[17] X. Jiao and M. T. Heath. Common-refinement-based data transfer between nonmatching meshes in multiphysics simulations. Inter-

national Journal for Numerical Methods in Engineering, 61(14):2402–2427, December 2004.[18] V. Kumar, B. F. Cooper, Z. Cai, G. Eisenhauer, and K. Schwan. Resource-aware distributed stream management using dynamic

overlays. In Proceedings of the 25th IEEE International Conference on Distributed Computing Systems (ICDCS-2005), 2005.[19] R. Latham, N. Miller, R. Ross, and P. Carns. A next-generation parallel file system for linux clusters. LinuxWorld, 2(1), January

2004.[20] B. Ludscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, and Y. Zhao. Scientific workflow management

and the kepler system: Research articles. Concurr. Comput. : Pract. Exper., 18(10):1039–1065, 2006.[21] R. A. Oldfield, A. B. Maccabe, S. Arunagiri, T. Kordenbrock, R. R. sen, L. Ward, and P. Widener. Lightweight i/o for scientific

applications. In Proc. 2006 IEEE Conference on Cluster Computing, Barcelona, Spain, September 2006.[22] R. A. Oldfield, D. E. Womble, and C. C. Ober. Efficient parallel I/O in seismic imaging. The International Journal of High

Performance Computing Applications, 12(3):333–344, Fall 1998.[23] L. Oliker, J. Carter, michael Wehner, A. Canning, S. Ethier, A. Mirin, G. Bala, D. parks, patrick Worley Shigemune Kitawaki, and

Y. Tsuda. Leading computational methods on scalar and vector hec platforms. In Proceedings of SuperComputing 2005, 2005.[24] I. W. G. on High End Computing. Hec-iwg file systems and i/o r&d workshop. http://www.nitrd.gov/subcommittee/

hec/workshop/20050816\_storage/.[25] Object-based storage architecture: Defining a new generation of storage systems built on distributed, intelligent storage devices.

Panasas Inc. white paper, version 1.0, October 2003. http://www.panasas.com/docs/.[26] S. Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 117(1):1–19, 1995.

http://lammps.sandia.gov/index.html.[27] C. Pu, K. Schwan, and J. Walpole. Infosphere project: System support for information flow applications. SIGMOD Record, 30(1):25–

34, 2001.[28] A. Rajasekar, M. Wan, R. Moore, W. Schroeder, G. Kremenek, A. Jagatheesan, C. Cowart, B. Zhu, S.-Y. Chen, and R. Olschanowsky.

Storage Resource Broker—managing distributed data in a Grid. Computer Society of India Journal, Special Issue on SAN, 33(4):42–54, October 2003.

[29] E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle. Active disks for large-scale data processing. IEEE Computer, 34(6):68–74,June 2001.

[30] K. Schwan, B. F. Cooper, G. Eisenhauer, A. Gavrilovska, M. Wolf, H. Abbasi, S. Agarwala, Z. Cai, V. Kumar, J. Lofstead, M. Man-sour, B. Seshasayee, and P. Widener. Autoflow: Autonomic information flows for critical information systems. In M. Parashar andS. Hariri, editors, Autonomic Computing: Concepts, Infrastructure and Applications. CRC Press, 2006.

[31] D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the condor experience. Concurrency - Practice andExperience, 17(2-4):323–356, 2005.

[32] R. Thakur, W. Gropp, and E. Lusk. Data sieving and collective I/O in ROMIO. Technical Report ANL/MCS-P723-0898, Mathematicsand Computer Science Division, Argonne National Laboratory, August 1998.

14

Page 15: Structured Streams: Data Services for Petascale Science ...treport/tr/07-11/Peta-Scale.pdf · Structured Streams, and the Structured Streaming Data System (SSDS) that implements them,

[33] J. Weinberg, A. Jagatheesan, A. Ding, M. Faerman, and Y. Hu. Gridflow description, query, and execution at scec using thesdsc matrix. In HPDC ’04: Proceedings of the 13th IEEE International Symposium on High Performance Distributed Comput-ing (HPDC’04), pages 262–263, Washington, DC, USA, 2004. IEEE Computer Society.

[34] L. Weng, G. Agrawal, U. Catalyurek, T. Kur, S. Narayanan, and J. Saltz. An approach for automatic data virtualization. In HPDC,pages 24–33, 2004.

[35] M. Wolf, Z. Cai, W. Huang, and K. Schwan. Smartpointers: Personalized scientific data portals in your hand. In Proceedings ofSuperComputing 2002, Nov 2002. http://www.sc-2002.org/paperspdfs/pap.pap304.pdf.

[36] J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. SIGMOD Rec., 34(3):44–49, 2005.

15