Victoria Livschitz, CEO Grid [email protected]
September 17th, 2008
Using Grid Technologies on the Cloud for High Scalability
A Practitioner Report for Cloud User Group
2Grid Dynamics
A word about Grid Dynamics
Who we are: global leader in scalability engineering Mission: enable adoption of scalable applications and networks though
design patterns, best practices and engineering excellence Value proposition: fusion of innovation with best practices Focused on “physics”, “economics” and “engineering” of extreme scale Founded in 2006, 30 people and growing, HQ in Silicon Valley
Services Technology consulting Application & systems architecture, design, development
Customers Users of scalable applications: eBay, Bank of America, web start-ups Makers of scalable middleware: GigaSpaces, Sun, Microsoft Partners: GridGain, GigaSpaces, Terracotta, Data Synapse, Sun, MS
3Grid Dynamics
Why I am speaking here tonight?
We do scalability engineering for a living
Cloud computing is new, very exciting and terribly over-hyped Not a lot of solid data on performance, scalability, usability, stability…
Many of our customers are early adopters or enablers Their pains, discoveries and lessons are worth sharing
The practitioner prospective Recently completed 3 benchmark projects that we can make public Results are presented here tonight
4Grid Dynamics
Exploring Scalability thru Benchmarking
Benchmark Cloud Vendor Middleware Application
1. Test scalability of EC2 on the simplest map-reduce problem
Public commercial cloud, EC2
Amazon GridGain Monte-Carlo
2. Test scalability of data-driven HPC applications, similar to those used in practice
Public commercial cloud, EC2
Amazon GigaSpaces Risk Management
3. Explore performance implications of data “in the cloud” vs. “outside the cloud”
Incubator compute cloud for academic use, CompFin
Microsoft Windows HPC Server, Velocity
Data-intensive Analytics
5Grid Dynamics
Benchmark #1: Scalability of Simple Map/Reduce Application on EC2
6Grid Dynamics
Basic Scalability of Simple Map/Reduce
Goal: Establish upper limit on scalability of Monte-Carlo simulations performed on EC2 using GridGain
Why Monte-Carlo: simple, widely-used, perfectly scalable problem
Why EC2: most popular public cloud
Why GridGain: simple, open-source map-reduce middleware
Intended Claims: EC2 scales linearly as grid execution platform GridGain scales linearly as map-reduce middleware Businesses can run their existing Monte-Carlo simulations on EC2
today using open-source technologies
7Grid Dynamics
Other Goals
Understand “process bottlenecks” of EC2 platform Changes to the programming, deployment, management model Ease of use Security Metering and payment
Identify scalability bottlenecks at any level in the stack EC2 GridGain Glueware
Robustness Stability Predictability
8Grid Dynamics
SpareCapacity
WorkerNodes
HeadNode
OpenMQServer
JMSJMS
Architecture
GridConsole
HTTPServer
Job ExecutionJob Execution Spare EC2 Instances
Spare EC2 Instances
JMSMessage
Processing
JMSMessage
Processing
Manages worker
nodes and tasks
Manages worker
nodes and tasks
Discovery &Task Assignment
Corporate Intranet
Amazon EC2 Cloud
Controls Grid
Operation
Controls Grid
Operation
Configuration&
Task Repository
Configuration&
Task Repository
Technology Stack: EC2 GridGain Typica OpenMQ
9Grid Dynamics
Performance Methodology & Results
Same algorithm exercised on wide range of nodes 2,4, 8, 16, …, 256, 512. Limited by Amazon permission of 550 nodes Simultaneously double the amount of computations and nodes Measure completion time Repeat several times to get statistical averages
Conclusions Total degradation from 13 to
16 seconds, or 20% Discarding first 8 nodes,
near perfect scale up to 128 Slight degradation from 128
to 256 (3%), from 256 to 512 (7%)
=> Prove point of near linear scalability end-to-end
10Grid Dynamics
Simple scaling script
var itersPerNode = 5000;
var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512];
for (var i in cnode) {
var n = cnode[i];
grid.growEC2Grid(n, true);
grid.waitForGridInstances(n);
runTask(itersPerNode * n, n, 3);
}
11Grid Dynamics
Observations
Deployment considerations Start-up for whole grid in different configurations is 0.5 - 3 min 2-step deployment process
First, bring up one EC2 node as controller Next, use the controller on-the-inside to coordinate bootstrapping
Some of EC2 nodes don’t finish bootstrapping successfully Average of 0.5% nodes come up in incomplete state Not clear the nature of the problem If the exact processing power is essential, start the nodes, then kill off the
sick ones and bring up a few new ones before starting computation
IP address deadlock issue IP addresses of the nodes are needed to start & configure the grid IP addresses are not available until the grid is up & configures Need carefully choreograph bootstrapping and pass IP’s as parameters
into controlling scripts
12Grid Dynamics
Observations
Monitoring considerations Connection to each node from outside is possible, but not efficient Check heartbeat from the internal management nodes Local scripts must be stored on S3 or passed back before termination
Programming model considerations EC2 does not support IP multicast
Switched to JMS instead Luckily, GridGain supported multiple protocols
Typica : 3rd party connectivity library that use EC2 query interface Undocumented limit on URL length is hit with 100s of nodes Amazon just disconnects with improper URLs without specifying the error,
so debugging was hard Workaround: rewrote some parts of our framework to enquire about
individual running nodes. Works, but less efficient
13Grid Dynamics
Observations
Metering and payment Amazon sets a limit on concurrent VM
Eventually approval for 550 VMs after some due diligence from Amazon
Amazon charges by full or partial VM/hours Sometimes, short usage of VMs is not metered
Not clear why One hypotheses: metering “sweeps” happen every so often
Be careful with usage bills for testing A test may need to be run multiple times Beware of rouge scripts Test everything on smaller configurations first Scale gradually, or you will miss the bottlenecks
14Grid Dynamics
Achieving scalability
Software breaks at scale. Including the glueware Barrier #1 was hit at 100 nodes because of ActiveMQ scalability
Correction: Switched ActiveMQ for OpenMQ Comment: some users report better ActiveMQ scalability with 5.x
Barrier #2 was hit at 300 nodes because of Typica URL length limit Correction: Changed our use of the API
Security considerations EC2 credentials are passed to Head Node 3rd party GridGain tasks can access them Sounds like potential vulnerability
15Grid Dynamics
What have we learned?
EC2 is ready for production usage on large-scale stateless computations
Price/performance Strong linear scale curve
GridGain showed itself very well Scale, stability, ease-of-use, pluggability Solid open source choice of map-reduce middleware
Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security
What’s next? Can we go higher then 512? What is the behavior of more complex applications?
16Grid Dynamics
Benchmark #2: Scalability of Data-Driven Risk Management Application on EC2
17Grid Dynamics
Data-driven Risk Management on EC2
Goal: Investigate scalability of a prototypical Risk Management application that use significant amount of cached data to support large-scale Monte-Carlo simulations executed on EC2 using GigaSpaces
Why risk management: class of problems widely used in financial services
Why GigaSpaces: leading middleware platform for compute & data grids
Intended Claims: EC2 scales linearly for data-driven HPC applications GigaSpaces scales well as both compute and data grid middleware Businesses can run their existing risk management (and similar)
applications on EC2 today using off-the-shelf technologies
18Grid Dynamics
Architecture
Compute Grid
Data Grid
GridConsole
Workers take tasks, perform calculations, write results back
Workers take tasks, perform calculations, write results back
User uses ec2-gdc-tools to manage grid
User uses ec2-gdc-tools to manage grid
ServiceGrid
Manager
Master
Master writes tasks into data grid and waits for results…
Master writes tasks into data grid and waits for results…
AmazonEC2 Grid
19Grid Dynamics
Performance methodology & results
Same algorithm exercised on wide range of nodes 16,32, 128, 256, 512. Still limited by Amazon permission of 550 Constant size of data grid (4 large EC2 nodes) Double the nodes with constant amount of work Measure completion time (strive for linear time reduction)
Conclusions Near perfect scale from 16
to 256 nodes 28% degradation from 256
to 512 since data cache becomes a bottleneck
20Grid Dynamics
What have we learned?
EC2 is ready for production usage for classes of large-scale data-driven HPC applications, common to Risk Management
GigaSpaces showed itself very well Compute - data grid scales well in master-worker pattern
Some level of effort is required to “port” grid system to EC2 Deployment, monitoring, programming mode, metering, security Bootstrapping this system is far more complex then GridGain’s. For
more details, contact me offline
What’s next? How does data grid scale? What about more complex applications? What’s the scalability of co-located compute-data grid configuration?
21Grid Dynamics
Benchmark #3: Performance implications of data “in the cloud” vs. “outside the cloud” for data-intensive analytics applications
22Grid Dynamics
Data-intensive Analytics on MS cloud
Goal: Investigate performance improvements from data “in the cloud” vs. “outside the cloud” for complex data-intensive Analytical applications in the context of HPC CompFin++ Labs environment using Velocity
What is CompFin++ Labs: MS-funded “incubator” compute cloud for exploration of modern compute & data challenges on massive scale
What is Velocity: MS new in-memory data grid middleware, still CTP1
The Model: Computes correlation between stock prices over time. Algorithms use significant amount of data which could be cached. Maximum cache hit ratio for the model is around 90%.
Intended Claims: Measure impact of data “closeness” to the computation on the cloud
23Grid Dynamics
Architecture: CompFin
24Grid Dynamics
Architecture: Anticipated Bottlenecks
25Grid Dynamics
Architecture: CompFin + Velocity
26Grid Dynamics
Benchmarked configurations
Same analytical model with complex queries Perfect linear scale curve (baseline) Original CompFin Distributed cache (original CompFin + Velocity distributed cache
for financial data) Local cache (original CompFin + Velocity distributed cache for
financial data + near cache with data-aware routing)
27Grid Dynamics
Test methodology 3 ways of measuring scalability were used
Fixed amount of computations, increasing amount of data Fixed amount of date, increasing amount of computations Proportional Increase of computations and nodes “Node” = 1 core “Data unit” = 32 million records or 512 megabytes of tick data
Test 1 Test 2 Test 3
Test # 11 22 33 44 55 66 77 88 99
Nodes 88 3232 3232 3232 3232 3232 6464 128128 200200
Data Units
11 11 11 66 1212 1212 2424 4848 6969
28Grid Dynamics
Performance results
29Grid Dynamics
Performance results
30Grid Dynamics
Conclusions
Data “on the cloud” definitely matters! Performance improvements up to 31 times over “outside the cloud”
Velocity distributed cache has some scalability challenges: Failure on 50 nodes cluster with 200 concurrent clients Good news: it’s a very young product and MS is actively improving it
Compute-data affinity matters too! Significant performance gain of local cache over distributed cache Local cache resolved distributed cache scalability issue by reducing
its load
31Grid Dynamics
Final Remarks
Clouds are proving themselves out Early adaptors are there already The rest of the real world will join soon
There are still significant adoption challenges Technology immaturity Lack of real data, best practices, robust design patterns “Fitting” of application middleware to cloud platforms is just starting
Amazon is the leading commercial cloud provider, but is not the only game in town Companies are building public, private, dedicated and special-
purpose clouds
Victoria [email protected]
Thank You!
Top Related