Download - Using Grid Technologies in the Cloud for High Scalability

Victoria Livschitz, CEO Grid [email protected]

September 17th, 2008

Using Grid Technologies on the Cloud for High Scalability

A Practitioner Report for Cloud User Group

2Grid Dynamics

A word about Grid Dynamics

Who we are: global leader in scalability engineering Mission: enable adoption of scalable applications and networks though

design patterns, best practices and engineering excellence Value proposition: fusion of innovation with best practices Focused on “physics”, “economics” and “engineering” of extreme scale Founded in 2006, 30 people and growing, HQ in Silicon Valley

Services Technology consulting Application & systems architecture, design, development

Customers Users of scalable applications: eBay, Bank of America, web start-ups Makers of scalable middleware: GigaSpaces, Sun, Microsoft Partners: GridGain, GigaSpaces, Terracotta, Data Synapse, Sun, MS

3Grid Dynamics

Why I am speaking here tonight?

We do scalability engineering for a living

Cloud computing is new, very exciting and terribly over-hyped Not a lot of solid data on performance, scalability, usability, stability…

Many of our customers are early adopters or enablers Their pains, discoveries and lessons are worth sharing

The practitioner prospective Recently completed 3 benchmark projects that we can make public Results are presented here tonight

4Grid Dynamics

Exploring Scalability thru Benchmarking

Benchmark Cloud Vendor Middleware Application

1. Test scalability of EC2 on the simplest map-reduce problem

Public commercial cloud, EC2

Amazon GridGain Monte-Carlo

2. Test scalability of data-driven HPC applications, similar to those used in practice

Public commercial cloud, EC2

Amazon GigaSpaces Risk Management

3. Explore performance implications of data “in the cloud” vs. “outside the cloud”

Incubator compute cloud for academic use, CompFin

Microsoft Windows HPC Server, Velocity

Data-intensive Analytics

5Grid Dynamics

Benchmark #1: Scalability of Simple Map/Reduce Application on EC2

6Grid Dynamics

Basic Scalability of Simple Map/Reduce

Goal: Establish upper limit on scalability of Monte-Carlo simulations performed on EC2 using GridGain

Why Monte-Carlo: simple, widely-used, perfectly scalable problem

Why EC2: most popular public cloud

Why GridGain: simple, open-source map-reduce middleware

Intended Claims: EC2 scales linearly as grid execution platform GridGain scales linearly as map-reduce middleware Businesses can run their existing Monte-Carlo simulations on EC2

today using open-source technologies

7Grid Dynamics

Other Goals

Understand “process bottlenecks” of EC2 platform Changes to the programming, deployment, management model Ease of use Security Metering and payment

Identify scalability bottlenecks at any level in the stack EC2 GridGain Glueware

Robustness Stability Predictability

8Grid Dynamics

SpareCapacity

WorkerNodes

HeadNode

OpenMQServer

JMSJMS

Architecture

GridConsole

HTTPServer

Job ExecutionJob Execution Spare EC2 Instances

Spare EC2 Instances

JMSMessage

Processing

JMSMessage

Processing

Manages worker

nodes and tasks

Manages worker

nodes and tasks

Discovery &Task Assignment

Corporate Intranet

Amazon EC2 Cloud

Controls Grid

Operation

Controls Grid

Operation

Configuration&

Task Repository

Configuration&

Task Repository

Technology Stack: EC2 GridGain Typica OpenMQ

9Grid Dynamics

Performance Methodology & Results

Same algorithm exercised on wide range of nodes 2,4, 8, 16, …, 256, 512. Limited by Amazon permission of 550 nodes Simultaneously double the amount of computations and nodes Measure completion time Repeat several times to get statistical averages

Conclusions Total degradation from 13 to

16 seconds, or 20% Discarding first 8 nodes,

near perfect scale up to 128 Slight degradation from 128

to 256 (3%), from 256 to 512 (7%)

=> Prove point of near linear scalability end-to-end

10Grid Dynamics

Simple scaling script

var itersPerNode = 5000;

var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512];

for (var i in cnode) {

var n = cnode[i];

grid.growEC2Grid(n, true);

grid.waitForGridInstances(n);

runTask(itersPerNode * n, n, 3);

}

11Grid Dynamics

Observations

Deployment considerations Start-up for whole grid in different configurations is 0.5 - 3 min 2-step deployment process

First, bring up one EC2 node as controller Next, use the controller on-the-inside to coordinate bootstrapping

Some of EC2 nodes don’t finish bootstrapping successfully Average of 0.5% nodes come up in incomplete state Not clear the nature of the problem If the exact processing power is essential, start the nodes, then kill off the

sick ones and bring up a few new ones before starting computation

IP address deadlock issue IP addresses of the nodes are needed to start & configure the grid IP addresses are not available until the grid is up & configures Need carefully choreograph bootstrapping and pass IP’s as parameters

into controlling scripts

12Grid Dynamics

Observations

Monitoring considerations Connection to each node from outside is possible, but not efficient Check heartbeat from the internal management nodes Local scripts must be stored on S3 or passed back before termination

Programming model considerations EC2 does not support IP multicast

Switched to JMS instead Luckily, GridGain supported multiple protocols

Typica : 3rd party connectivity library that use EC2 query interface Undocumented limit on URL length is hit with 100s of nodes Amazon just disconnects with improper URLs without specifying the error,

so debugging was hard Workaround: rewrote some parts of our framework to enquire about

individual running nodes. Works, but less efficient

13Grid Dynamics

Observations

Metering and payment Amazon sets a limit on concurrent VM

Eventually approval for 550 VMs after some due diligence from Amazon

Amazon charges by full or partial VM/hours Sometimes, short usage of VMs is not metered

Not clear why One hypotheses: metering “sweeps” happen every so often

Be careful with usage bills for testing A test may need to be run multiple times Beware of rouge scripts Test everything on smaller configurations first Scale gradually, or you will miss the bottlenecks

14Grid Dynamics

Achieving scalability

Software breaks at scale. Including the glueware Barrier #1 was hit at 100 nodes because of ActiveMQ scalability

Correction: Switched ActiveMQ for OpenMQ Comment: some users report better ActiveMQ scalability with 5.x

Barrier #2 was hit at 300 nodes because of Typica URL length limit Correction: Changed our use of the API

Security considerations EC2 credentials are passed to Head Node 3rd party GridGain tasks can access them Sounds like potential vulnerability

15Grid Dynamics

What have we learned?

EC2 is ready for production usage on large-scale stateless computations

Price/performance Strong linear scale curve

GridGain showed itself very well Scale, stability, ease-of-use, pluggability Solid open source choice of map-reduce middleware

Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security

What’s next? Can we go higher then 512? What is the behavior of more complex applications?

16Grid Dynamics

Benchmark #2: Scalability of Data-Driven Risk Management Application on EC2

17Grid Dynamics

Data-driven Risk Management on EC2

Goal: Investigate scalability of a prototypical Risk Management application that use significant amount of cached data to support large-scale Monte-Carlo simulations executed on EC2 using GigaSpaces

Why risk management: class of problems widely used in financial services

Why GigaSpaces: leading middleware platform for compute & data grids

Intended Claims: EC2 scales linearly for data-driven HPC applications GigaSpaces scales well as both compute and data grid middleware Businesses can run their existing risk management (and similar)

applications on EC2 today using off-the-shelf technologies

18Grid Dynamics

Architecture

Compute Grid

Data Grid

GridConsole

Workers take tasks, perform calculations, write results back

Workers take tasks, perform calculations, write results back

User uses ec2-gdc-tools to manage grid

User uses ec2-gdc-tools to manage grid

ServiceGrid

Manager

Master

Master writes tasks into data grid and waits for results…

Master writes tasks into data grid and waits for results…

AmazonEC2 Grid

19Grid Dynamics

Performance methodology & results

Same algorithm exercised on wide range of nodes 16,32, 128, 256, 512. Still limited by Amazon permission of 550 Constant size of data grid (4 large EC2 nodes) Double the nodes with constant amount of work Measure completion time (strive for linear time reduction)

Conclusions Near perfect scale from 16

to 256 nodes 28% degradation from 256

to 512 since data cache becomes a bottleneck

20Grid Dynamics

What have we learned?

EC2 is ready for production usage for classes of large-scale data-driven HPC applications, common to Risk Management

GigaSpaces showed itself very well Compute - data grid scales well in master-worker pattern

Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security Bootstrapping this system is far more complex then GridGain’s. For

more details, contact me offline

What’s next? How does data grid scale? What about more complex applications? What’s the scalability of co-located compute-data grid configuration?

21Grid Dynamics

Benchmark #3: Performance implications of data “in the cloud” vs. “outside the cloud” for data-intensive analytics applications

22Grid Dynamics

Data-intensive Analytics on MS cloud

Goal: Investigate performance improvements from data “in the cloud” vs. “outside the cloud” for complex data-intensive Analytical applications in the context of HPC CompFin++ Labs environment using Velocity

What is CompFin++ Labs: MS-funded “incubator” compute cloud for exploration of modern compute & data challenges on massive scale

What is Velocity: MS new in-memory data grid middleware, still CTP1

The Model: Computes correlation between stock prices over time. Algorithms use significant amount of data which could be cached. Maximum cache hit ratio for the model is around 90%.

Intended Claims: Measure impact of data “closeness” to the computation on the cloud

23Grid Dynamics

Architecture: CompFin

24Grid Dynamics

Architecture: Anticipated Bottlenecks

25Grid Dynamics

Architecture: CompFin + Velocity

26Grid Dynamics

Benchmarked configurations

Same analytical model with complex queries Perfect linear scale curve (baseline) Original CompFin Distributed cache (original CompFin + Velocity distributed cache

for financial data) Local cache (original CompFin + Velocity distributed cache for

financial data + near cache with data-aware routing)

27Grid Dynamics

Test methodology 3 ways of measuring scalability were used

Fixed amount of computations, increasing amount of data Fixed amount of date, increasing amount of computations Proportional Increase of computations and nodes “Node” = 1 core “Data unit” = 32 million records or 512 megabytes of tick data

Test 1 Test 2 Test 3

Test # 11 22 33 44 55 66 77 88 99

Nodes 88 3232 3232 3232 3232 3232 6464 128128 200200

Data Units

11 11 11 66 1212 1212 2424 4848 6969

28Grid Dynamics

Performance results

29Grid Dynamics

Performance results

30Grid Dynamics

Conclusions

Data “on the cloud” definitely matters! Performance improvements up to 31 times over “outside the cloud”

Velocity distributed cache has some scalability challenges: Failure on 50 nodes cluster with 200 concurrent clients Good news: it’s a very young product and MS is actively improving it

Compute-data affinity matters too! Significant performance gain of local cache over distributed cache Local cache resolved distributed cache scalability issue by reducing

its load

31Grid Dynamics

Final Remarks

Clouds are proving themselves out Early adaptors are there already The rest of the real world will join soon

There are still significant adoption challenges Technology immaturity Lack of real data, best practices, robust design patterns “Fitting” of application middleware to cloud platforms is just starting

Amazon is the leading commercial cloud provider, but is not the only game in town Companies are building public, private, dedicated and special-

purpose clouds

Victoria [email protected]

Thank You!