Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical...
-
Upload
ariel-perry -
Category
Documents
-
view
217 -
download
3
Transcript of Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical...
Evaluating Condor for Enterprise Use: A UBS Case Study
April 26, 2006
Gregg Cooke, IT Technical Council
GENERALLY ACCESSIBLE
2
Overview
Context: Why UBS Uses Grids
Tests: What Did We Look At?
Results: Strengths & Limitations
SECTION 1
The Context: Grids in an Investment Bank
4
Grids at UBS
Specifically, when we say “grid” we mean a computational cluster– Condor fits the definition closely
Other terminology:
What do we mean by “grid”?
Condor term UBS term
Pool Grid
Job cluster Job
Job Task
Virtual machine Engine or Node
Central Manager Broker or Manager
5
Grids at UBS
Complex, long-running calculations include:– Monte Carlo simulations of risk exposure
– Black-Scholes option valuations on portfolios of stock options
– Valuation of complicated “exotic” financial instruments
Speed of computation directly correlates to volume of sales
Accuracy of risk exposure calculation directly correlates to reserve cash
Calculations constructed by quantitative analysts (“quants”)– Write code that’s easy to change, not code that’s particularly efficient or
parallelized
Why do we use grids?
6
Current Grid Environment at UBS
10 separate production grids totaling 3000+ engines– All separate grids…some 60-engine, some 2000-engine
– 1 million tasks per day
Wide variety of platforms, languages, architectures– C/C++, C#, Java on Windows or Linux
– Service-oriented vs. batch-oriented, embarrassingly parallel vs. workflow
– Rarely any greenfield development
Dedicated deployment & operations teams (“GSD”)– Straddle the development / operations worlds
– Focused on meeting businesses SLAs
– Strong drivers of what grid platform we use
How do we build & run our grids?
7
Typical UBS Grid Environment
Job specification
Task input,Task results
Manager
Trader Desktop
Engine-1 Engine-2 Engine-3 Engine-N. . .
Taskassignments
Job status
Quants
•write the calculations
• part of the business
GSD
• makes app meet SLAs
• faces off with business
Dev
• builds & tests the application
• uses quant code, partners with GSD
SECTION 2
The Tests: Function, not Performance
9
How to Test Condor?
No performance tests…instead:
Determine the functional limits of Condor
Determine how Condor integrates with existing enterprise systems
Port one or more projects to use Condor and measure:– Porting effort
– Opportunities for new functionality (and cost of lost functionality)
– Operational impact
Feasibility Study: is Condor suitable for use within our enterprise?
10
The Tests
Scheduling capabilities– Various combinations of Requirements, Rank, Start, Suspend, etc. rules
Administrative capabilities– Features of command line tools, common admin practices,
Interaction model– Integrating Condor with an app: APIs, SOAP interface, command line interface
Robustness and resilience– Failover options, long-term stability, task retry, realtime reconfiguration, etc.
Usability– Impact to the user when a Condor engine is installed on their desktop
And…scheduling latency…
We tested the following aspects of Condor:
11
Scheduling Latency
Applications may be designed with a given scheduling latency in mind– We can control how long our code takes…we cannot control the scheduling latency
– Redevelopment is often a major undertaking
We were expecting a very short (100msec) deterministic scheduling latency– Condor’s is much longer (1min or more) and nondeterministic
– Condor does have an alternative (COD) but it changes the expected behavior of the grid
Impact on testing: new set of questions!– “Does Condor’s scheduling latency present a problem for our applications?”
– “Do we have applications that were not developed with assumptions about the scheduling latency?”
– “Are there other aspects of Condor’s performance that offset the scheduling latency concerns?”
– “Can we measure the performance of our applications on Condor without regard to scheduling latency?”
Definition: the interval between the initial request and when the first engine starts working on your task
SECTION 3
The Result: Condor as a Functional Benchmark
13
What We Love About Condor
Incredibly powerful expression-based scheduling policy
No-impact desktop cycle scavenging
Easy reconfiguration
Anything that can be run from a command line can be a task
Too many to list…here are the top four:
But, Condor has limits too…
14
What Condor Needs to Better Support UBS
Administrative interface
Code deployment
Scheduling latency
Job submission APIs
Important: remember that these conclusions are only relevant to UBS!
This is only what we found, based on our context…your mileage may vary
We found issues in four key areas:
15
Administration Interface
What we expected:– A nice GUI admin console similar to others our operations personnel are familiar
with
What we found:– A rich command-line administration interface, but no GUI
Our conclusion:– At UBS, Condor will not be used by operations teams that cannot accept a
command-line admin interface
– These are usually Windows teams…Unix teams don’t seem to have as much bias
What this means for the Condor community:– A GUI admin console will make Condor more acceptable to enterprise users
– Web-based is best
– Doesn’t have to be fancy…just needs to be point & click (and stable, of course)
– Work being done at Indiana University on a Condor portal is a start
Our conclusions:
16
Code Deployment
What we expected:– Automatic task code deployment done once and refreshed automatically when
the grid system senses a change in a central repository
What we found:– Automatic task code deployment every time a job is submitted
Our conclusion:– At UBS, Condor causes problems with applications with huge (15Mb+) task
codes and short tasks because the network transmission time impacts job completion time
What this means for the Condor community:– To make Condor more acceptable to enterprise users, task code should be
cached at the engines and only refreshing when it changes
– Fortunately, this is being worked on by the Condor Project!
– We’ve watched commercial grid vendors implement this…is not an easy feature!
Our conclusions:
17
Scheduling Latency
What we expected:– Negligibly small latency that’s deterministic enough for us to predict job
completion times
What we found:– Latencies that depend on configuration settings and complexity of classads
Our conclusion:– At UBS, Condor cannot be used for tasks that require less than 3.5 min to
complete or where the total job completion time must be easily predictable
– However,
– Even though our highest-value applications require short deterministic scheduling latencies, there are many more lower-value applications that aren’t sensitive to scheduling latency
Our conclusions:
18
Application Programmer’s Interface
What we expected:– Nice, well-designed APIs for all our favorite languages
What we found:– A command line interface and a maturing SOAP interface
Our conclusion:– Once the SOAP interface matures, UBS programmers will be more amenable to
using Condor
What this means for the Condor community:– Full-speed ahead on the SOAP interface!
– Make sure all of the functionality available in the command-line interface is available in the SOAP interface
Our conclusions:
19
Condor at UBS
Teaching new teams how to grid their applications– Condor is an excellent exploration and learning environment
– Has already accelerated at least one team
A functional benchmark for all things grid– Condor is a crucible where new and innovative grid ideas get tried and refined
– Many of these features will prove valuable for commercial vendors to embrace
– Check-pointing & task migration
– Expression-based scheduling policy
– User-centric cycle scavenging
Non-critical batch-oriented applications with standalone or SOAP-enabled service code, with operations teams that don’t mind a command line administration interface– There are lots and lots of non-critical batch-oriented apps with standalone
services
– There are not a lot of operations teams that will tolerate a command line interface…
We will continue to use Condor for: