On Mixing High-Speed Updates and In-Memory...

On Mixing High-Speed Updates and In-Memory Queries A Big-Data Architecture for Real-time Analytics

Tao Zhong Kshitij A Doshi Xi Tang Ting Lou Zhongyan Lu Hong Li Software and Services Group, Intel

{tao.t.zhong, kshitij.a.doshi, xi.tang, ting.lou, zhongyan.lu, hong.li}@intel.com Abstract— Up-to-date business intelligence has become a

critical differentiator for the modern data-driven highly engaged enterprise. It requires rapid integration of new information on a continuous basis for subsequent analyses. ETL-based and traditionally batch-processing oriented methods of absorbing changes into a relational database schema take time, and are therefore incompatible with very low-latency demands of real-time analytics. Instead, in-memory clustered stores that employ tunable consistency mechanisms are becoming attractive since they dispense with the need to transform and transit data between storage layouts and tiers.

When data is updated infrequently, in-memory approaches such as RDD transformations in Spark can suffice, but as updates become frequent, such in-memory approaches need to be extended to support dynamic datasets. This paper describes a few key additional requirements that result from having to support in-memory processing of data while updates proceed concurrently. The paper describes Real-time Analytics Foundation (RAF), an architecture to meet the new requirements. Performance of an early implementation of RAF is also described: for an unaudited TPC-H derived workload, RAF shows a node-to-node scaling ratio of 88% at 8 nodes, and for a query equivalent to Q6 in the TPC-H set, RAF is able to show 9x improvement over that of Hive-Hadoop. The paper also describes two RAF based solutions that are being put together by two independent software vendors in China.

Index Terms— Big Data, Real-time, Low-latency, Analytics, Resilient Distributed Datasets, CRUD, Clustering.

I. INTRODUCTION Operational analytics enable a company or an organization

to analyze data from its day to day operations, including up to the minute transactions, so that it may act upon findings instantly. It is gaining momentum as institutions and individuals become pervasively data driven in all spheres of life. In addition to capitalizing upon well-catalogued historical knowledge, establishments are turning towards just-in-time analysis of information in motion that may just be seconds old, and not yet well categorized or linked to other data. A few examples of value from operational analytics follow: • GPS-based navigation equipped with a static information

base of road networks is a great modern convenience. Add instantaneous analysis of traffic conditions and it can guide motorists away or out of traffic logjams, to reduce wasted energy, accidents, delays and emergencies.

• A credit card company may use instantaneous correlation between a user’s transactions in order to detect and intercept suspicious transactions – such as, a transaction that breaks a pattern, or issues from a merchant that is too far away from the location of a recent transaction.

• A metropolitan or regional power grid which processes millions of sensor readings every second to modulate power generation, perform load-balancing, direct repair actions, and take policy enforcement steps. In addition to triggering immediate reactive actions, the data may also be used in long range capacity planning. An essential feature in the above examples is the need to

integrate new transactions into analysis results within a very short time—sometimes as short as a few tens of milliseconds. A second feature is the need to complete analysis queries

quickly. Producing answers swiftly requires parallel execution of queries and agile sharing of information across parallel tasks. To keep pace with the growth of processing demand and that in the volume of content to be analyzed, organizations have increasingly embraced scale-out clusters (Hadoop[1], HBase[2], MPP databases[9], etc.) where more machines can be added readily in order to boost throughput and capacity. High latencies can make these clusters less suitable for operational analytics: transforming data between block format and in-memory formats, paging, and sharing via a distributed file system such as HDFS[4] prolong analysis.

To slash latencies by orders of magnitude, Spark [16], HP Vertica[14], and several other initiatives [10],[15],[7] propose maintaining data in memory (instead of demand paging it from disk), noting that rising capacities and dropping DRAM costs have made in-memory solutions practical in recent years. This paper identifies additional capabilities that are needed to blend in-memory data processing between transactional and analytic activities into a seamless whole. The resilient distributed datasets (RDDs) of Spark[16] make in-memory solutions less failure prone. In an architecture that we call Real-time Analytics Foundation (RAF), we propose enhancing the RDD approach so that resiliency is blended with a few additional characteristics listed next: • Efficient allocation and control of memory resources • Resilient update of information at much finer resolution • Flexible and highly efficient concurrency control • Replication and partitioning of data transparent to clients

Architecturally RAF elevates memory across an entire cluster to a first class storage entity and defines high level mechanisms by which applications on RAF can orchestrate distributed actions upon objects stored in cluster memory. By merging distributed memory and distributed computing, RAF delivers a platform on which applications designed for optimized in-memory execution can be distributed easily. Services that expose REST interfaces, Create-Retrieve-Update-Delete (CRUD) interfaces, SQL-like interfaces, or other object access methods for items which are in their in-memory formats can be created readily on top of RAF. To promote responsible and transparent use of memory, RAF opts to use a programming language such as C, C++, over mixed language environments in which garbage allocation is opaque as a general rule.

The paper is organized as follows. Section 2 sets out the motivation, explaining the requirements and how current solutions compare in delivering low-latency analytics. Section 3 then describes the overall architecture and implementation highlights that are the distinctive aspects of the solution. Performance of the current implementation is then brought out in section 4. Section 4 also describes two end user scenarios in which this approach is applied to meet the unique demands. Section 5 describes continuing and future work, and section 6 concludes.

II. MOTIVATION AND BACKGROUND Data has a lot of value when mined. By analyzing an end-

user’s recent browsing history, an artificial intelligence system

2013 IEEE International Conference on Big Data

102978-1-4799-1293-3/13/$31.00 ©2013 IEEE

such as GoogleNow can automatically determine viewing preferences and push selections at him/her without explicit requests being made. A manufacturer or a retailer can extract and correlate consumer information streaming in from shopping outlets and use it in order to group and price products for a mix of high turnover and profitability. Smart deployment of police and ambulance services based on an analysis of traffic patterns can save lives during emergencies.

This understanding, that data has value, has led to the practice of mining it at the point of generation, for example, by subjecting it to streaming analytics. SQLstream[13] in effect runs queries on streams rather than on previously captured and ETL’ed data. However, as data continues to compound at brisk rates, institutions need to grapple with two broad demands – (a) accumulating, processing, synopsizing and utilizing information in a timely manner, and (b) storing the refined data resiliently, and keeping it accessible at high speed. Recognizing that large scale parallelism is critical and that the storage and computational capacities of single node multi-core systems quickly limit a solution, recent years have seen the emergence of softer-consistency, non-relational, cluster-based approaches for data processing described collectively as Big Data. The term Big Data itself is elastic and serves well as a description of the scale or volume of these solutions, but does not define a constraining principle for organizing storage or orchestrating processing. It is therefore useful to outline a few requirements for low-latency and high throughput analytics on datasets that are also subject to non-trivial volumes of updates. 1. In-memory structures and storage: As the complexity of

queries increases, it becomes necessary not just to keep larger amounts of data in memory for avoiding disk latencies, but also to use data processing techniques that avoid having to marshal pointer linked and deeply nested structures between one software module and another, particularly when the modules are sharing data on the same node.

2. Resiliency: At the same time, it is obviously critical that data that must survive abrupt restarts of individual nodes is replicated across nodes and between storage tiers, in a timely fashion.

3. Sharing data through memory: Passing data by reference should be supported efficiently so that needless copying can be avoided. A module that produces data should be able to collaborate with a module that consumes data without the overhead of data copying, marshaling, and that of superfluous allocation or deallocation. Furthermore, it is important that modules that can form a chain of processing are able to operate on data flowing from a previous stage of processing into a next stage of processing while the output of the previous stage is in memory.

4. Uniform interaction with storage: Some systems designed for in-memory analytics split transactional object management from that for the analytics subsystem. Most systems mandate differing data storage and processing formats. Due to the resulting fragmentation of formats, developers and end users face the danger of being locked into specific approaches that later fail to meet shifting demands. At the same time, it is necessary to grapple with resilience questions that arise with distribution, and naming and storing of data that is not limited to a single “native” format. In a nutshell, solutions need to be interoperable and scalable, across multiple hardware platforms and software environments. The resulting impetus for OpenStack[17] has

identified the need to move away from centralized databases and towards distributed object storage with object granular capabilities for locating, replicating, and migrating contents (see figure 1). To meet these objectives in concert with requirement #1, objects stored in memory should be interoperable, relocatable, replicable in the same way that objects are placed in a block storage medium.

5. Minimizing memory recycling: Garbage collected language

runtimes (such as Java) popular in many Big Data implementations liberate programmers from having to manage memory explicitly. However, the lack of direct control over memory deallocation can lead to performance overhead of garbage collection and cause problems such as long stalls during a stop-the-world garbage collection. More explicit memory management such as that in C and C++ is desirable from the standpoint of ensuring high performance.

6. Efficient integration of CRUD: In an operational analytics solution, it is important to be able to make small updates on a continuous basis, than to have to buffer up many changes and risk a long latency and resource intensive merging of those changes into the source from time to time. We borrow the acronym CRUD (Create/ Retrieve/ Update/ Delete) from [3] to refer succinctly to two capabilities: (a) writers perform their updates atomically at a granularity consistent with an application’s data model, and (b) readers have a stable, consistent view of the dataset. To minimize overhead and maximize concurrency, it is essential to support efficient CRUD operations on memory as well as disk versions of a dataset which is simultaneously accessed for analytics queries. In effect, without burdening the application logic, the datastore should provide a logically versioned view in which writers update contents granularity finer than that of a full dataset.

7. Synchronizing efficiently: Having to wait for milliseconds in order to achieve serialized execution across a cluster negates much of the benefit of in-memory processing. If synchronization is lock-based, then it requires distributed deadlock detection and guarding against node failures. While partitioning of data across nodes can mitigate distributed synchronization, local synchronization within a node can also be a serious performance inhibitor. Given the need for large scale concurrency—within and across nodes, it is essential to choose processing approaches that are implicitly data parallel.

8. Searching Efficiently: Many NoSQL solutions, particularly key-value stores that are available today sort records for

Figure 1: reproduced from [17]

103

efficient searches based on record keys. Ealso needs to be available across non-key aMany a distributed data processing alterna

in the past decade to displace some prdatabase based solution. Several in-memosystems such as [6][10][16] have similarly alternatives or as complementary approachrelational databases. Let us briefly note the areas of improvement, among these foursolutions that are very popular today— SCoherence[6], Redis[10], and Memcached[5

Spark introduces an innovative distributed that is particularly well suited to large scaleA complex analysis task is decomposed intransformations, scheduled such that datasetstransformation are used in another transformneed to force the datasets out of memory inhowever dataset oriented in that it is not welon a single record based on a record’s keysmall granularity updates into transformatiopopular due to its performance and ANSI Crich data types and consistency guarandistributed framework for assembling ccomputations. A single threaded modeaggregation performance. Oracle Coherencommercial data grid solution with high scalability, rich data processing model, transactions. That it is implemented in Jagarbage collection inefficiencies. And its qconstrained to filtering and to computing aggMAX, MIN and several others. Memcastability and simplicity, is widely used for othat simplicity also precludes creating cosoftware – which limits the use of Memcachbackend assist for use in accelerating web section discusses how to build upon the stresome of the soft spots of the current popularto better address the needs outlined in this se

III. ARCHITECTURAL CONCEPTS AND IMPLHIGHLIGHTS

The framework targets common user sccomplex queries need to be conducted at vInformation upon which the queries operateon some storage medium or generated dynaas a result of ongoing transactional activitiwe translate the eight requirements articulatefive design elements: (a) C and C++ basedefficient sharing of data through memory (bof new content, (c) Efficient concurrencyinformation in motion, and (e) Fast, general,

Transactional and analytic applications ruend-user clients interact with these services that abstract away the details of how objeccached, stored, replicated, partitioned, andclients. For these applications, the RAF provcomputing environment. The distribuenvironment is integrated with a memory-cestorage system; that is, one application can pto another in order to share data in memorwithout first having to serialize or deserializto file systems or transmitting via network co

To outline the RAF framework, it is usefconcepts, Delegate, Filter, RDD, and Transfthis with respect to figure 2. In figure 2, newexisting data occur as a result of clients con

Efficient searching attributes. ative has emerged revious relational ory non-relational emerged either as

hes to in-memory strengths, and the

r in-memory data park [16], Oracle ]. computing model

batch processing. nto a sequence of s generated by one

mation with rarely a nto disks. Spark is ll-suited to operate y, or to weave in ons. Redis is very C implementation, ntees. It lacks a complex, parallel el constrains its nce is a mature availability, high and support for

ava exposes it to query capability is gregations such as

ached, due to its object caching; but omplex enterprise hed primarily as a servers. The next

engths and address r solutions in order ection. LEMENTATION

cenarios in which very low latencies. e may be available amically on the fly es. In this section

ed in section 2 into d programming for b) Resilient storing y, (d) Processing ad-hoc searches.

un as services, and --using interfaces

cts are distributed, d so on, from the vides a distributed uted computing entric, distributed, pass a data handle ry with the latter, ze data for storing onnections. ful to discuss four formations. We do

w data or updates to ntacting a Storage

Service. Typically the updatesand, sometimes, the updates mvolatile media. • An RDD is an acronym for

introduced in [16]. RAF usesidentical to those in [16information in memory of owith assurance that in casmachines, an RDD can be rerepeating well-formed operat

• A Transformation is an opeRDDs in order to generate a be transient. The concept othat in [18]; and RAF transformations as join, map,

• A Filter is a particular type o[18] produces, out of an inpwhose contents satisfy a spec

• In order to facilitate efficiennon-volatile storage betweenupdates and applications that a level of indirection to thobtained through a set of wstable view of content to content mutates. This is descThus in figure 2 raw inform

into in-memory storage from by data producing clients, is fiof Delegate modules, and then RDDs (such as RDD1 and RTransformations then produce reiterate, data sharing is thropposed to a distributed file supplemented by a flexibleapplications interact with oth(i.e., using RPC), or asynchrHigh performance analytics acoupled (and therefore efficientA. Efficient Storage Sharing us

A shared CRUD[3] data s

Figure 2: RAF Fram

s will happen first in memory, may become propagated to non-

a Resilient Distributed Dataset,

s the RDD construct for reasons 6], viz., resilient storing of one or more machines, together se of failure of one or more econstructed from precursors by tions that produced it. eration applied to one or more new dataset or a result that may of transformations is similar to transformations include such union, etc.

of transformation. A filter, as in put dataset, a resulting dataset

cified condition. nt sharing of both volatile and

n applications that need to make need to read the data, we create

he storage. This indirection is wrapper functions that ensure a

read-only consumers, even as cribed further in §3.1. mation – whether it’s preloaded disks or dynamically generated irst made stable through the use filtered in order to create initial

RDD2 in the figure). Chains of the desired analytics results. To rough memory instead of as system such as HDFS. This is

e execution model in which her applications synchronously ronously via queued messages. applications can remain loosely t) by using message queues. sing DELEGATE ource (§2) furnishes different

mework and Concepts

104

information to queries at different pointintroduces an impedance mismatch for softwon RDDs since a typical storage interface CRUD actions does not have the same breathat an RDD possesses. This difficulty is reminserting a bridge module named DELEGATDELEGATE is to create a version of thparticular time, and present it as a memoBeneath the DELEGATE module, CRUDproceed directly against the mutable datastooperations proceed against the RDD that is the view produced by DELEGATE, someadded or deleted without affecting the analys

For efficiency, the DELEGATE module ewrite through pointer indirection. DELprovides C/C++ wrappers for objects in persistent or volatile, and using these wrcreating operation can share memory efficwith concurrent transactional operations thacomposition of a dataset. While embracinRDDs from [16], RAF deviates in implemfrom [16] in order to carry out explicit sharing of RDDs and facilitating coorDELEGATE modules, ensuring that dataefficiently accessible from CPUs for analysisB. Memory-centric Storage Operation

For driving very low latency analytics appmaintain data in memory, and share access tmachines with large memory sizes havcommon in recent years, accumulating anlarge amounts of data has become practicasofter and application managed consistency ACID semantics in the data layer itself, applications to spread data among a clustermachines to further expand solution sizes. control which data needs to be flushed medium and when; and this is achieved in Rdisk or file system plugins. 3.2.1 Reliability

It is necessary to address the concern that memory) can compromise reliability and avais recoverable by design. Thus it is the data san initial RDD is obtained (via Delegate, as that needs to be recoverable even if the committing changes to a non-volatile copy oas to a disk is non-synchronous. This is achie

Applications run on a cluster and can quknow how many nodes are present in theapplication can write an update to more thanthe cluster size drops to 1, can record changemedium. A storage structure (such as a filuniformly sized partitions with records dpartitions through hashing of record keytypically copied to at least one other back storage layer maintains the mapping betweand the additional nodes where that partitionmaintained. • When a node N is about to leave a cluster,

cluster abruptly, the storage layer discoveand backup partitions are on N. For eacpartitions, it selects one out of its backmarks it as primary; and creates a new bayet another node in the cluster. For any baN, RAF creates a new instance of the basome other node.

ts in time. This ware that operates for implementing

adth of operations moved in RAF by

TE. The purpose of he datastore at a ry resident RDD.

D operations can re. While analysis created on top of

e entities may be sis. employs copy-on-LEGATE module storage --whether rappers, an RDD ciently but safely at may change the ng the concept of entation of RDDs instantiation and

rdination through a is directly and s computations.

plications on RAF to it efficiently. As ve become more nd operating upon al. With a shift to --away from strict it is possible for

r of large memory Applications also to a non-volatile RAF by means of

volatility (of main ailability. An RDD source from which described in §3.1) primary mode of

of the source, such eved as follows. uery the system to e cluster. Thus an n one node; and if es to a non-volatile le) is divided into distributed among ys. A partition is

up node, and the een each partition n’s backup copy is

, or if it leaves the ers which primary ch of N’s primary kup partitions and ackup partition on ackup partition on ackup partition on

• If a new node X joins thepartitions across all nodes incUpdates to a partition are

partition’s replicas through burdening the application logcontrol the replica count as wwrite is required to be synchronC. Data and Storage Types

In this subsection we briefaspects of data layout and disRAF implements data types,describes two concepts that reacross nodes. 3.3.1 Structured Data

Structured data is expressed illustrated in Figure 3. On the rshows two types: CustomerKein [8] this structure is describeorder to yield data accessor mparsing the necessary fields addition, a metadata structure sin JSON format defines tuples,the names (analogous to namesthe key attributes, (3) non-key store type, which we will descsubsection. In implementing thRAF supports both memororganization of structured dataIn particular, the responsibilityit is in memory, on disk, locaapplications and absorbed intApplications supply the metadafile, and the RAF framewonecessary internal structures atransmitting or receiving data a

While the use of Protobuf asamong distributed entities is wRAF is particularly advantageis built on the proposition offrom CRUD actions as quickThis is best seen by considealternatives for database recoupdates would require extensivstructures are not well suited fo3.3.2 Storage Types:

RAF provides two attributesand partitioned store. A datastoand it may be optionally replicaand is updated rarely but is rea

Figure 3: Type a

cluster, then RAF rebalances cluding X, in background. automatically streamed to a a network interface, without gic. Applications can and do

well as whether replication of a nous.

fly describe application visible stribution. §3.3.1 explains how and §3.3.2 on storage types elate to how data is distributed

by a two part definition that is right is a protocol buffer [8] that ey, and Customer. As described ed in a file which is compiled in ethods that are very efficient in

from a message stream. In shown to the left in figure 3 also , by describing four aspects: (1) s of relations in a database), (2) (or “value”) attributes, and (4) a cribe a little further in the next his type of extensible structure, ry and disk based, flexible , with efficient serialization [8].

y for locating the data – whether al, or remote is offloaded from to the RAF platform modules. ata and protobuf definitions in a ork automatically creates the and serialization necessary for

across node boundaries.

s an efficient exchange format well understood, its use in the ous because real-time analytics

f propagating updates resulting kly and efficiently as possible. ering row- and column- based ords: in row-based structures, ve parsing, while column-based or high rates of updates.

s for datastores: replicated store ore may be single or partitioned, ated. When a datastore is small, ad frequently, it is advantageous

and Store Metadata

105

to keep it all of it as one single extent (pmemory) at one home location, but make onof it available on other nodes. This is captuFigure 4. If on the other hand, it is written (case (c) in Figure 4), then it makes sense nsingle extent that hosts it, widely. For high one copy of it would be advisable on a remotIf a datastore is large in size (case (b) in Fidivided into multiple extents or partitions; angiven at least one backup copy on some otheits home location.

When there are backup copies, or if there partition, a master copy of any given partitionode. That node is the one which storage opeoperations) contact and the master copy reffrom the update operation into remote copieaccordance with application guided consisAn application can specify to the RAF wheththrough updates or wants remote copies pwrite-behind approach. Under write-throoperation returns control to application onceall the copies; and under write-behind, the opsoon as it completes locally on the mastchanges are reflected to other nodes asynchro3.3.3 Storage Service Interface

For CRUD operations, RAF furnishes bothinterface and a C++ language applicatiinterface. As described in §3.3.1, protocol buaccessing data, which makes it easyprogramming interfaces in other languagevery simple grammar is used to write comma

COMMAND :== SVC_NAME OP_NAME PARAM E

SVC_NAME :== ID

OP_NAME :== ID

PARAM :== JavaScript Object Notati

EXA_PARAM :== ID ‘=’ PARAM

ID :== [a-zA-Z0-9_-\.]+

The following, for example, creates a new in a business directory in a storage dataset na

Figure 4: Attributes of Stor

preferably also in ne or more replicas ured as case (a) in frequently enough

not to replicate the reliability, at least te node as backup. igure 4), then it is nd each partition is er node other than

are replicas of a on is kept on some erations (“CRUD” flects the changes es. This is done in stency philosophy. her it wants write-

performed using a ough, the update data is updated in peration returns as ter copy, and the onously.

h a command line ion programming uffers are used for y to implement es. The following and line scripts:

EXA_PARAM* ‘;’

ion (JSON)

customer “Tom5” amed “Customer”:

D. Distributed Execution of AnAs described earlier in this s

of resilient distributed datasetscapitalize on memory residentAnd with the introduction oweaves in CRUD capable storeRDDs can be derived -- just asas HDFS files can be used fostages of transformation descri(DAG) is used, as in [16], yielcomputation plan for analyzingto produce query results. By dmultiple partitions each of wallows this approach to scale moving from a read-only to rea

RAF adopts the SEDA [12] mcomposable parallelism. Efficdispatching of run-to-completcontext switching for synchronis particularly synergistic sinceof wait-states that might otherwdata into memory from disk-bamonitor over data partitions locwould operate on that data partcore counts driving up the nwithin each node and with clusters becoming commonplacwell positioned to match lowwith non-blocking, data flow fo3.4.1 Analytics Tasks Interface

The storage service intercomplemented by a service incan compose and execute a DAanalytics tasks. A command cadesignate each source operandand the keyword out_rdd to nAn RDD may be described asand as the out_rdd of some opossible for an analytics clieoperations through which desig

The syntax of an analyticrespects from that of a stogrammar for an analytics task,service operation, is the samexcept that the JSON content i(i.e., by operation). Let us Following is a command whichcustomer, and then applies asubset of customers who are inthe resulting RDD bj_custom

The following example furtherspecified step by step and (bhappen to share common intinstead of regenerating the sharthe following two independeaverage sales per customer forBeijing or Shanghai, (2) For c

rage

alytics Tasks section RAF reuses the concept s (RDDs) from [16] in order to t data for low-latency analytics. of Delegate (§3.1), RAF also es as source datasets from which s [16] shows how datasets such

or deriving RDDs. Successive ibed by a directed acyclic graph ld a memory-based, low latency g data in mutable/CRUD storage dividing very large datasets into which is backed-up by a copy and remain resilient in spite of

ad-write storage model. model to achieve highly efficient, iency results from event-based tion methods which minimize nization. Memory-based storage e it drives down the probability wise result from having to page

ased storage. Each node acts as a cal to it, and each operation that tition is queued. With increasing number of computing elements

larger and larger node count ce, this architectural approach is

w-latency in-memory execution orm of execution.

rface described in §3.3.3 is nterface through which clients AG of operations to orchestrate an use the keyword in_rdd to

d in a filter/transform operation, name the result of the operation. s an in_rdd in one operation, other operation, which makes it ent to construct a pipeline of gnated flows of data occur. cs command differs in minor orage service command. The , which we refer to as an RDD

me as that described in §3.3.3 is different for each OP_NAME use an example to illustrate. h takes as input an RDD named a filtering operation to extract a n the city of Beijing, and names mer:

r shows (a) how DAGs can be b) how independent tasks that termediate datasets can reuse, red datasets. Let us say we have ent queries – (1) : Compute r customers that are from either ustomers from Beijing, identify

106

the subset of customers who have purchasedand then for that subset, calculate the averaglevel depiction is as below: Computation Flow X:

Computation Flow Y:

Both flows take as input a sales rprogressively compute the RDDs to arrivetargets: the first flow is aimed at taking all cfrom either Beijing or Shanghai, and averathat slice of customers, while the seconcalculate the average sales per repeat customIn RAF, the above computation could be orgaCommon Precursor Action (across X and Y, a

Computation Flow X only:

Computation Flow Y only:

In order to specify the above compapplications would proceed to specify thecomputations to an in-memory RDD service,

a

// --- extract the dataset “customers from Beijing” --rdd.service create { “transform_type”: ”filter”“in_rdd”: [{“rdd_name”: ”customer”}] , “out_rdd”: {“rdd_name”: ”bj_customer”}}transform={“op”: ”EQ”, “sub_exp”: [{“op”: ”“param”: ”c_city”}, {“op”: ”CONST”, “param

b

// --- extract the dataset “customers from Shanghai”rdd.service create {“transform_type”:”filter”, “in_rdd”:[{“rdd_name”:”customer”}], “out_rdd”:{“rdd_name”: ”sh_customer”}} transform={“op”:”EQ”, “sub_exp” : [{“op”:”F“param”:”c_city”}, {“op”:”CONST”, “param”

c

//--- Union Beijing and Shanghai Customers --- rdd.service create {“transform_type”:”union”,“in_rdd”:[{“rdd_name”:”bj_customer”}, {“rdd_name”:”sh_customer}], “out_rdd”: {“rdd_name”:”bj_sh_customer”}}

d //--- compute avg. sales across Beijing & Shanghai rdd.service action {“action_type”:”average”, “rdd_name”:”bj_sh_customer” } action = {“expr”:{“op”:”FIELD”, “param”: ”sales_amo

e // --- compute average sales for repeat customers frrdd.service action {“action_type”:”average”, “rdd_name”:”bj_customer”, } action =

Sales Reports Customers

Customfrom Beijing Shangh

Customers from Beijing

Customers from Shanghai


Repeacustomfrom Beijing





Customersfrom Beijing or Shanghai



Repeat customers from Beijing


d more than once, ge sales. The high

reports file, and e at two different customers who are aging the sales in nd flow aims to

mer in Beijing. anized as follows: above):

positions flexibly, e following three , as follows: -- ” ,

} ”FIELD”, m”: ”Beijing”}]}

” ---

FIELD”, ”:”Shanghai”}]}

customers --

ount”}}

from Beijing --

{“expr”:{“op”:”FIELD”, “para

All the above script comm§3.4.1.The name for the analyKeywords create and actionthe above example rdd.servicRDD (tasks a, b, and c) or cooperating on one or more exisone RDD to operate on in the e), it is permissible to drop th§3.4.1. RDDs to operate on rdd_name in both types of opkeyword to name an RDD exapplication to avoid creating rsuppose one process A execomputation flow X; that bj_customer and sh_customeexecutes the actions in comprepeating the computation of bthe RDDs explicitly identified can avoid recreating bj_custocreating it and vice versa.

We can represent the overallthe DAG in Figure 5. Let us uthe execution would occur usin5, each box is an RDD. creation/transformation steps,action steps. Thus bj_customeby filter transformer on the cuis created by a union transfor

mers

or hai

Average sales per customer

at mers

g


s Average sales per customer


Figu

Figu

Figu

am”:” sales_amount”}}

mands follow the grammar in ytics service is rdd.service. n respectively denote whether in ce is being asked to create an ompute a result (tasks d, e) by ting RDDs. When there is only case of an action (as in tasks d,

he in_rdd keyword described in are identified by the keyword

perations. The use of rdd_name xplicitly in this way allows an redundant RDDs. For example, ecutes the actions shown in process would create RDDs er. Another process B that

putation flow Y would end up bj_customer for its query. With by rdd_name keyword, flow Y omer if A has already begun

l flow of both computations by use this diagram to discuss how ng the RAF. In the DAG of Fig.

Solid lines represent RDD and dashed lines represent

er and sh_customer are created ustomer dataset, bj_sh_customer rmer, and bj_customer_2 is the

ure 5

ure 6

ure 7

107

result of filtering for repeat customers in bj_Averaging actions on bj_customer_2 andrespectively yield the two results from two above, RDD bj_customer is used in bothcreation is shared (not repeated) in the execu

Figures 6 and 7 show how partitions (§3.3customers were a very large dataset that is three nodes –shown respectively by blue rec1,2,3. The number of partitions, incidentally,user. Then, as shown in Figure 6, the creaRDDs— bj_customer and sh_customer cpartition by partition; as would the creation bj_sh_customer. Thus data decomposition would be easily carried over into destinaallows, as shown in Figure 7, keeping compuby maintaining corresponding partitions inwhere the semantics of an operation allexample where such an optimization may noa join that requires cross-node communicatiis made more efficient in RAF by replicatinrarely updated (§3.3.2) but which may be reoperations such as joins. Finally, reliability iis obtained by virtue of RDDs being described in [16]. In RAF, the maintapartitions further improves availability of RD

IV. RESULTS In this section we share the performance

using unit tests that demonstrate outstandingdecision support performance. We do this inin section 4.2, we outline two RAF-based established commercial software vendors, aperformance, to portray how straight-forwarealistic operational analytics solutions atop RA. Unit Testing:

We measured RAF on two fronts: hoperations scale, which represents memoroperations performance, and, how long complete a query to show distributed analyTable 1 has configuration data for the updatoperations, we borrow SSB schema froBenchmark [19] and utilize SSB-DBGEN tofile. Then the update testing parses the datathe records into RAF. The performance metrTPS, which is total record number divided time in second.

Figure 8 shows consistent increase in the test as it distributes over multiple nodes; wscaling of 7.0x for an 8 node cluster coXeon® E5-2680 processors. In this test, fo(with asynchronous backup copying to obtained an average CPU utilization of aboubeing an update only test (and therefore nethe percent of time spent in operating systwhich is reflective of the fast data capturestorage capability of RAF.

Server Intel® Xeon® E5-2680 (8C, 2.7GHz1333, 10Gb NIC

Network 10Gbps Switch Op. Sys. Red Hat Enterprise Linux 6.2, 64 bit

Table 1 Configuration The second test uses a query that is log

query 6 of the industry standard TPC-H benfor execution on top of RAF, and for executiBoth the RAF test and Hive test are perform

_customer dataset. d bj_sh_customer queries. As noted

h queries and its ution of the DAG. .2) work. Suppose partitioned across

ctangles numbered , is controllable by ation of derivative can also proceed of the union RDD in source RDDs

ation RDDs. This utations node local n the same node low. One counter ot be applicable is ion. However, this ng RDDs that are ead many times in in such operations recomputable as

aining of backup DDs.

e readout for RAF, g transactional and n section 4.1. Then

applications from and describe their ard it is to create RAF.

how well update ry-centric storage does it take to

ytics performance. te test. For update om Star Schema ol to generate data

a file and insert all ric is measured by by total execution

throughput of this with an aggregate nsisting of Intel® or the 1 node test other nodes) we ut 80%. Despite it etwork intensive), em was only 7%, , distribution, and

z), 128GB DDR3-

t

ically identical to nchmark, rewritten on on top of Hive. med on an 8 node

Intel® Xeon® E5-2680 cluster120G. As shown in Figure 9performance of Hive on 1st Iteand subsequent iterations of thRAF explicitly uses memory athis query drops to 150ms bumanner. These unit testing resmemory distributed processingsubstantiated further by mCurrently we are building morehigh concurrency operational a

B. Solution-level ImplementatioWe describe in sections 4.2.1

developed by software veperformance. They each requirdata at the same time that the ireal time. Both sections describthe performance obtained by th4.2.1 Telecommunications Subs

RAF is used by an ISV (whose customers are telecomThe customers have hundreds“Unified Service Platform” conducts subscriber transactionservice transactions. The typsubscriber pushing fresh crediaccount, prepaying or paying oof products. A subscriber belona telecommunications service p

Figure 8: Scalabi

Figure 9: Latency re

r and the dataset size used is , RAF achieves nearly 9X the eration. We do not compare 2nd

his test with Hive because, since as storage, its response time for ut Hive cannot benefit in like sults show the advantage of in-g oriented design of RAF, to be

more sophisticated workloads. e TPC-H derived queries, and a

analytics workload.

on and Testing: 1 and 4.2.2 two solutions being endors and initial solution re storing a large volume of new information must be analyzed in be the problem being solved and he solution in current testing. scriber Management (independent software vendor)

mmunications service providers. s of millions of subscribers. A (USP) provided by the ISV

ns, many of which may be self-pical scenario is comprises a its into his/her mobile services on demand for a broad spectrum ngs to one province company of provider, but he/she can use pre-

ility Testing Result

elative to Hive/HDFS

108

paid cards issued by province companies that are different from the one to which he/she belongs. USP performs transactions, routes requests and responses back and forth among participating IT systems of province companies, logs transaction histories, and carries out desired analytics against stored data or in-progress transactions.

For one specific customer (telecommunications provider), there are more that 300 million subscribers and USP transactions have a volume of about 10 million per day, with a peak volume of 1000 transactions/second. The customer has requested the ISV to furnish USP with two types of business analytics: drill-downs of different categories of transactions according to subscribers and services consumed and real-time detection and flagging of suspicious transactions. One particular type of suspicious transaction occurs when someone’s account is hacked or a phishing attack compromises a subscriber’s secret card number; and is detected by monitoring transactions for evidence of abnormal levels of activity within a short time period. Previously USP was implemented on top of a commercial RDBMS, and on a 40 core Intel® Xeon® server (E7-4870) platform, yielded an average transaction time of 3s, which could not meet the deployment goal. Now, with RAF, on a 5 node Intel® Xeon® E5-2680 cluster (16 cores/node), the data resulting from credit card activity is funneled into in-memory data stores for inline analysis for the transactions. As a result, transaction times have been reduced ten-fold to 300 ms. The new solution has also created significant head room for running complex ad-hoc queries for obtaining detailed drill-downs and we are in the process of building a systematic solution level workload to benchmark the ad-hoc query performance. 4.2.2 Safe City Solution

The second solution under development targets license plate crime in China. To avoid steep expenses like road tax, mandatory insurance, etc., unscrupulous drivers can (and do) create fake license plates. One ISV we are working with addresses this solution by capturing images of vehicles and automatically tying license plate numbers with vehicle descriptions, and then detecting duplications using offline, batch analysis. Since it is hard for the police to be effective with long latency results, the ISV is shifting to using RAF to create a real-time detection system. Each vehicle captured by a distributed sensor feed is tracked, by funneling data into an in-memory RAF cluster, which can also perform the necessary checks for duplicated license plate numbers for discrepancies between registration information and sensor data about its color and make. Recent travel history can also reveal whether a license plate is suspicious because the vehicle it is on has traveled too far too soon. A variety of common searches and aggregations also become possible on the data feeds that have recently arrived from the sensors. In current testing, a 5 node Intel® Xeon® E5-2680 cluster handles data feeds from 600+ video cameras in less than a second, exceeding ISV requirement.

V. CONTINUING AND FUTURE WORK Motivated by the high degree of familiarity that many

developers have with database interfaces, we are incrementally introducing SQL-92/JDBC/ODBC like interfaces on top of RAF. A number of optimizations are also being added; these include: (1) application requested indexing, to accelerate searches; (2) blending in column-store capabilities where appropriate (for example, for rarely-written data), and (3) compression, in order to reduce data transported between nodes.

VI. CONCLUSION This paper presents RAF, an architectural approach that

meshes memory-centric non-relational query processing for low latency analytics with memory-centric update processing to accommodate high volumes of updates and to surface those updates for inclusion into analytics. This blending is kept efficient by merging memory management between the two spheres of operation by using a transparent versioning capability that we term Delegate, which participates as a special type of content transformer in a hierarchy of RDD transformations. In RAF, protocol buffers are used to obtain data abstraction and efficient conveyance among applications, providing applications with a high degree of independence in location, representation, and transmission of data. The use of protocol buffers is particularly valuable as it removes the need for producers and consumers to coordinate explicitly in partitioning, tiering, caching, or replication of information that may be arriving at high velocities. These improvements for distributing a memory-based storage service are combined with a message queues based execution partitioning mechanism in which each node acts as a local monitor over its data partitions. A light-weight but expressive interface makes it easy for RAF services to map various transformations that need to occur in the course of execution of a query into the data flows that need to occur among the distributed execution agents through message queues, thus hiding all of the plumbing from the services that need to use RAF as a single seamless platform for modifying as well as querying data in pooled memory and aggregate computing capacity of a cluster of machines. Using unit tests we show high cluster scaling capability for transactions, an order of magnitude latency improvement for query processing. The paper also describes two real-world usage scenarios in which RAF is being used to create high throughput operational analytics solutions. With rapidly increasing memory capacities and large, open-stack based clusters, RAF provides a collection of architectural techniques for rapid integration of real-time analytics solutions.

REFERENCES [1] Apache Hadoop: http://hadoop.apache.org/ [2] Apache HBase: http://hbase.apache.org [3] H. Kilov, “From semantic to object-oriented data modeling”,

Proc. 1st Intl. Conf. on Systems Integration,” 1990. [4] K. Shvachko, H.Kuang, S. Radia, R. Chansler. “The Hadoop

Distributed File System”, Proc. MSST2010, May 2010. [5] Memcached: http://www.memcached.org/ [6] Oracle Coherence: http://www.oracle.com/technetwork/ middle

ware/ coherence/ [7] H. Plattner, A. Zeier, In-Memory Data Management. [8] Protobuf: http://code.google.com/p/protobuf/ [9] B. Ramesh, T. Kraus, T. Walter, Optimization of SQL queries

involving aggregate expressions using a plurality of local and global aggregation operations: U.S. Patent 5884299.

[10] Redis: http://www.redis.io/ [11] SAP HANA: http://www.saphana.com/ [12] SEDA: http://www.eecs.harvard.edu/~mdw/proj/seda/ [13] SQLStream: http://www.sqlstream.com/ [14] Vertica: http://www.vertica.com/ [15] VoltDB: http://www.voltdb.com [16] M. Zaharia, et al., Resilient Distributed Datasets: A Fault-

Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012.

[17] Piatt, B., “Openstack Overview”, http://ww.omg.org/news /mee tings /tc /ca-10 /special-events/pdf /5-3_Piatt.pdf

[18] Spark Programming Guide: http://spark-project.org/ docs/ latest/scala-programming-guide.html

[19] Star□Schema□Benchmark:□www.cs.umb.edu/~poneil /StarSchemaB.PDF

109

On Mixing High-Speed Updates and In-Memory...

Documents

Transcript of On Mixing High-Speed Updates and In-Memory...