Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical...

19
Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE

Transcript of Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical...

Page 1: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

Evaluating Condor for Enterprise Use: A UBS Case Study

April 26, 2006

Gregg Cooke, IT Technical Council

GENERALLY ACCESSIBLE

Page 2: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

2

Overview

Context: Why UBS Uses Grids

Tests: What Did We Look At?

Results: Strengths & Limitations

Page 3: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

SECTION 1

The Context: Grids in an Investment Bank

Page 4: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

4

Grids at UBS

Specifically, when we say “grid” we mean a computational cluster– Condor fits the definition closely

Other terminology:

What do we mean by “grid”?

Condor term UBS term

Pool Grid

Job cluster Job

Job Task

Virtual machine Engine or Node

Central Manager Broker or Manager

Page 5: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

5

Grids at UBS

Complex, long-running calculations include:– Monte Carlo simulations of risk exposure

– Black-Scholes option valuations on portfolios of stock options

– Valuation of complicated “exotic” financial instruments

Speed of computation directly correlates to volume of sales

Accuracy of risk exposure calculation directly correlates to reserve cash

Calculations constructed by quantitative analysts (“quants”)– Write code that’s easy to change, not code that’s particularly efficient or

parallelized

Why do we use grids?

Page 6: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

6

Current Grid Environment at UBS

10 separate production grids totaling 3000+ engines– All separate grids…some 60-engine, some 2000-engine

– 1 million tasks per day

Wide variety of platforms, languages, architectures– C/C++, C#, Java on Windows or Linux

– Service-oriented vs. batch-oriented, embarrassingly parallel vs. workflow

– Rarely any greenfield development

Dedicated deployment & operations teams (“GSD”)– Straddle the development / operations worlds

– Focused on meeting businesses SLAs

– Strong drivers of what grid platform we use

How do we build & run our grids?

Page 7: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

7

Typical UBS Grid Environment

Job specification

Task input,Task results

Manager

Trader Desktop

Engine-1 Engine-2 Engine-3 Engine-N. . .

Taskassignments

Job status

Quants

•write the calculations

• part of the business

GSD

• makes app meet SLAs

• faces off with business

Dev

• builds & tests the application

• uses quant code, partners with GSD

Page 8: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

SECTION 2

The Tests: Function, not Performance

Page 9: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

9

How to Test Condor?

No performance tests…instead:

Determine the functional limits of Condor

Determine how Condor integrates with existing enterprise systems

Port one or more projects to use Condor and measure:– Porting effort

– Opportunities for new functionality (and cost of lost functionality)

– Operational impact

Feasibility Study: is Condor suitable for use within our enterprise?

Page 10: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

10

The Tests

Scheduling capabilities– Various combinations of Requirements, Rank, Start, Suspend, etc. rules

Administrative capabilities– Features of command line tools, common admin practices,

Interaction model– Integrating Condor with an app: APIs, SOAP interface, command line interface

Robustness and resilience– Failover options, long-term stability, task retry, realtime reconfiguration, etc.

Usability– Impact to the user when a Condor engine is installed on their desktop

And…scheduling latency…

We tested the following aspects of Condor:

Page 11: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

11

Scheduling Latency

Applications may be designed with a given scheduling latency in mind– We can control how long our code takes…we cannot control the scheduling latency

– Redevelopment is often a major undertaking

We were expecting a very short (100msec) deterministic scheduling latency– Condor’s is much longer (1min or more) and nondeterministic

– Condor does have an alternative (COD) but it changes the expected behavior of the grid

Impact on testing: new set of questions!– “Does Condor’s scheduling latency present a problem for our applications?”

– “Do we have applications that were not developed with assumptions about the scheduling latency?”

– “Are there other aspects of Condor’s performance that offset the scheduling latency concerns?”

– “Can we measure the performance of our applications on Condor without regard to scheduling latency?”

Definition: the interval between the initial request and when the first engine starts working on your task

Page 12: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

SECTION 3

The Result: Condor as a Functional Benchmark

Page 13: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

13

What We Love About Condor

Incredibly powerful expression-based scheduling policy

No-impact desktop cycle scavenging

Easy reconfiguration

Anything that can be run from a command line can be a task

Too many to list…here are the top four:

But, Condor has limits too…

Page 14: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

14

What Condor Needs to Better Support UBS

Administrative interface

Code deployment

Scheduling latency

Job submission APIs

Important: remember that these conclusions are only relevant to UBS!

This is only what we found, based on our context…your mileage may vary

We found issues in four key areas:

Page 15: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

15

Administration Interface

What we expected:– A nice GUI admin console similar to others our operations personnel are familiar

with

What we found:– A rich command-line administration interface, but no GUI

Our conclusion:– At UBS, Condor will not be used by operations teams that cannot accept a

command-line admin interface

– These are usually Windows teams…Unix teams don’t seem to have as much bias

What this means for the Condor community:– A GUI admin console will make Condor more acceptable to enterprise users

– Web-based is best

– Doesn’t have to be fancy…just needs to be point & click (and stable, of course)

– Work being done at Indiana University on a Condor portal is a start

Our conclusions:

Page 16: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

16

Code Deployment

What we expected:– Automatic task code deployment done once and refreshed automatically when

the grid system senses a change in a central repository

What we found:– Automatic task code deployment every time a job is submitted

Our conclusion:– At UBS, Condor causes problems with applications with huge (15Mb+) task

codes and short tasks because the network transmission time impacts job completion time

What this means for the Condor community:– To make Condor more acceptable to enterprise users, task code should be

cached at the engines and only refreshing when it changes

– Fortunately, this is being worked on by the Condor Project!

– We’ve watched commercial grid vendors implement this…is not an easy feature!

Our conclusions:

Page 17: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

17

Scheduling Latency

What we expected:– Negligibly small latency that’s deterministic enough for us to predict job

completion times

What we found:– Latencies that depend on configuration settings and complexity of classads

Our conclusion:– At UBS, Condor cannot be used for tasks that require less than 3.5 min to

complete or where the total job completion time must be easily predictable

– However,

– Even though our highest-value applications require short deterministic scheduling latencies, there are many more lower-value applications that aren’t sensitive to scheduling latency

Our conclusions:

Page 18: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

18

Application Programmer’s Interface

What we expected:– Nice, well-designed APIs for all our favorite languages

What we found:– A command line interface and a maturing SOAP interface

Our conclusion:– Once the SOAP interface matures, UBS programmers will be more amenable to

using Condor

What this means for the Condor community:– Full-speed ahead on the SOAP interface!

– Make sure all of the functionality available in the command-line interface is available in the SOAP interface

Our conclusions:

Page 19: Evaluating Condor for Enterprise Use: A UBS Case Study April 26, 2006 Gregg Cooke, IT Technical Council GENERALLY ACCESSIBLE.

19

Condor at UBS

Teaching new teams how to grid their applications– Condor is an excellent exploration and learning environment

– Has already accelerated at least one team

A functional benchmark for all things grid– Condor is a crucible where new and innovative grid ideas get tried and refined

– Many of these features will prove valuable for commercial vendors to embrace

– Check-pointing & task migration

– Expression-based scheduling policy

– User-centric cycle scavenging

Non-critical batch-oriented applications with standalone or SOAP-enabled service code, with operations teams that don’t mind a command line administration interface– There are lots and lots of non-critical batch-oriented apps with standalone

services

– There are not a lot of operations teams that will tolerate a command line interface…

We will continue to use Condor for: