June 2002 Capacity Planning for the Newer Workloads Linwood Merritt Capital One Services, Inc....
-
Upload
marvin-booth -
Category
Documents
-
view
218 -
download
2
Transcript of June 2002 Capacity Planning for the Newer Workloads Linwood Merritt Capital One Services, Inc....
June 2002
Capacity Planning for the Newer Workloads
Linwood Merritt
Capital One Services, Inc.
June 2002
Disclaimer
• These generic issues are addressed by this presentation:– Vendor capacity ratings
– e-Commerce
– Continuous availability
– Data warehousing
– Growth rates
• This presentation contains no specific business-related information.
June 2002
Introduction: Environment
• Capital One– 5th largest card issuer in the United States
– Capital One to S&P 500 in 1998
– Fortune 500 company (#260)
– Managed loans at $48.6 billion as of Q1 2002
– Accounts at 46.6 million as of Q1 2002
– Fortune 100 “Best Places to Work in America”
– CIO 100 Award “Master of the Customer Connection”
– Information Week “Innovation 100” Award Winner
– ComputerWorld “Top 100 places to work in IT”
June 2002
Outline of Approach
• Understand behavior and issues around workloads, hardware, and data
• Create projections and build recommendations.
• Report the findings.
June 2002
Outline of Presentation
• Discussion of workload types and capacity projection approaches
• Overall summary of issues and approaches
• Examples
June 2002
What Workloads?
• E-Commerce
• Relational database systems
• Mainframe-class UNIX
• Multiple platforms
• New characteristics
June 2002
e-Commerce WorkloadsDirect to Client (business-to-business)
• Access– Internet– Leased line
• Services– Point of Care / Point of Sale – Value-added analysis
June 2002
e-Commerce WorkloadsDirect to Customer
• Access– Internet– Dial-in
• Services– Marketing– Account query
June 2002
e-Commerce WorkloadsHow to Predict
• Take business projections of volumes or users (include fudge factor)
• Estimate transaction volumes and CPU/transaction
• Convert to normalized unit such as MIPS
June 2002
Relational Databases
• Sub-second (OLTP), decision support / data mining
• Distributed gateways
• Database machines
• Redundant data with extracts
• How to predict: estimate a factor over current database demand or take usage estimates
June 2002
Mainframe-Class Unix
• Types: Mainframe USS or Linux, Future UNIX vendor offerings
• Candidate applications– Web server– Vendor-ported applications– User-ported / new applications
• How to predict:– Estimate by timeframe– Add factor to growth rates
June 2002
Multiple Platforms
• Mainframe: plan like existing applications (#users, transactions * CPU/transaction, application look-alikes, sizing tools)
• Distributed: use vendor sizing, modeling tools, existing applications
• Network: use network simulation tools, rules-of-thumb, bandwidth calculations
June 2002
New Characteristics
• External users
• Continuous availability
• New user interfaces
• Cross-platform
June 2002
External Users
• Drive need for continuous availability
• Different access patterns (e.g., doctor’s office vs. call center)
• Service level measurement - harder to put agent on external workstations
June 2002
Continuous Availability
• Driven by external users
• 24x7 schedule– Application redesign– Data Sharing: CPU overhead– Coupling Facility– Expansion of “prime shift”
• 99.999% “up time”– Redundancy, overhead– Availability reporting
June 2002
User Interfaces
• TCP/IP - no “definite response” (end-to-end response time measurement)
• Multiple internal transactions per “mouse click”
• Response time measurement:– Agent on workstations– Scripting from “robots”
June 2002
Cross Platform Applications
• Only unified view: simulation package
• Each platform (“silo”) can be analyzed separately.
• Different application development groups
• May be able to cross-validate user numbers
June 2002
Types of Implementation (1)
• Standalone / “shrink-wrap”• Layered onto legacy applications
– New mainframe application code– GUI front-end– Browser– Middle-tier (Unix or NT)– MQSeries - can add middle-tier and new mainframe applications
June 2002
Types of Implementation (2)
• Legacy extracts
• Re-engineered legacy applications– Convergence of business rules / applications– Re-usable components– Redundant access– Salvage investment, fix Band-Aids– Simplify logic, reduce platform complexity
June 2002
What Are We Analyzing?(Mainframe)
• MIPS - growth, latent demand, software cost
• Memory - track and watch 2 GB limit on central storage (goes away with 64-bit)
• I/O - channels, gigabytes of disk, tape
• Coupling Facility - Parallel Sysplex, Shared Data, continuous availability
• Vendor upgrade paths
• New partitions
June 2002
What Are We Analyzing?(Distributed)
• Number and types of platforms
• CPU, memory, disk space
• Bandwidth
• Location of applications / processes
• Platform limitations (CPU, memory)
• Software pricing considerations
• Porting opportunities
June 2002
Measurement of New Workloads
• Summarize by platform:– Workload rules (process or user names)– Processes by descending CPU%
• Resources: CPU, memory, disk space, Coupling Facility, network traffic
• Growth: – Resources/user/application– Number of users + application changes
June 2002
Distributed Approach
• Consider tiers of service (not currently at Capital One)
• Address service level measurement issue• Implement reporting• Add to Capacity Plan• “Silo” vs. “Application”
June 2002
Tiers of Service“Platinum”
• Most expensive
• Modeling product
• Install in one server for each major application, use collection product for other servers
June 2002
Tiers of Service“Gold”
• Collection product
• Capacity planning with Rules of Thumb
June 2002
Tiers of Service“Brass”
• Least expensive (man-hours only)
• “Native” – Unix scripts– NT PerfMon
June 2002
Service Level Measurement
• API call at workstation - “Applications Response Measurement” (ARM) or Windows 2000 trace API calls
• Agents: software tracing of Windows API calls - can be installed in a subset of end-user base (sampling)
• Scripting (“robots”)
• Stop watch sampling and logging
June 2002
Distributed Reporting
June 2002
Add to Capacity Plan
June 2002
Scope of Analysis
• Silos – Look at each hardware/application environment
independently.
• Applications– Look at each application as a whole.– Application instrumentation– Inference: put platform silos together.
June 2002
Analyzing the DataGrowth Rates
• General list of business plans
• List of technical scenarios
• Timeline
• Estimate median and maximum likely MIPS/CPU/users/business units
• Derive scenario growth rates
June 2002
Analyzing the DataAdditional Resources
• Parallel Sysplex (Coupling Facility): important for continuous availability, level set functionality
• Disk / channels / tape: disk megabytes, channel maximum, tape connectivity
• Communications connectivity: new partitions for availability
• Memory: 2 GB constraint, 64-bit
June 2002
Growth
• “Baseline” growth
• “Scenario” growth
• Independent events (merger/acquisition, potential major project)
June 2002
Example 1: Mainframe Upgrade
• Task force, led by Capacity Planner• Driven by expiring three-year lease (CPU
replacement, three-year planning horizon)• “Vendor parade” - presentations and dialogues
– Upgrade paths– Technology / service differences– References / site visits– Capacity sizing: MIPS charts, LSPR / sizing tools
June 2002
Mainframe Upgrade Deliverables
• Document– Business drivers and technical scenarios– Growth forecasts– Vendor options and growth paths– Coupling Facility / Parallel Sysplex
• Evaluation– Difference thresholds: MIPS claims, price/MIPS,
ICF– Differentiators
June 2002
Business and Technical
Business DriversCost management
External business
Improved data access
Business expansion
Technical Scenarios
Consolidation of distributed servers
Continuous availability
Significant external business
Data Warehousing
Acquisition/merger
June 2002
Projections
• Make educated guess by timeframe for each scenario
• Add to “baseline” growth
• Convert to growth rate
• Use both “baseline” and “scenario growth”
• Compare maximum scenario growth to maximum for platform family
June 2002
Impact Analysis
Wk1 Wk2 Wk3 Wk4 Total @85%' Memory ESCON Ind1 Other Ind Total w /ind @85%' Memory ESCON
Period1 2.6 772.3 6.0 112 86.5 341.6 6.0 112
Period2 50.0 7.3 913.1 1074.2 7.1 123 90.8 382.5 1386.4 1631.1 10.8 186
Period3 54.8 20.4 1027.9 1209.3 8.0 134 95.1 428.8 1551.8 1825.6 12.1 203
Period4 135.0 31.0 1234.0 1451.8 9.6 147 99.8 484.1 1817.9 2138.7 14.1 217
Period5 147.9 47.0 200 1594.9 1876.3 12.4 161 104.7 547.65 2247.2 2643.8 17.5 227
Period6 162.0 50 71.2 236.6 1877.8 2209.2 14.6 177 109.8 624.1 2611.7 3072.6 20.3 246
Period7 177.5 54.8 108.0 280.0 2161.2 2542.6 16.8 194 115.1 712.95 2989.3 3516.8 23.2 268
June 2002
Scenario TimelinePeriod1
Period2
Period3
Period4
Period5
Period6
Period7
First mainframe Wk1 Application
24x7 operation
First Parallel Sysplex exploitation
Initial muck exploitation with 250 Users
(Potential acquisition)
New DB2 functionality exploitation
Full Data Sharing exploitation (IMS, CICS, DB2)
Full subsystem redundancy (IMS, CICS, DB2)
MajorProject A with 100 users, 150% CAGR
64-bit OS/390
June 2002
Vendor Upgrade PathsDetail
• Use logarithms: Start*CAGR^x = Threshold
x years = log(Threshold/Start)/log(CAGR)
Model MIPS MSU +40%/Yr +25%/YrGS2068E 952 160 Aug-00 Sep-00GS2074E 1013 171 Oct-00 Dec-00GS2084E 1141 193 Apr-01 Jul-01GS2094E 1260 213 Sep-01 Dec-01GS2104E 1378 234 Nov-01 May-02
June 2002
Vendor Upgrade PathsSummary
Start MIPS Model MIPS MSU Model MIPS MSU Model MIPS MSUPeriod 1 909 Tri4009 978 171 GS2068E 952 160 9672-X77 984 169Period 2 988 Tri5009 1195 210 GS2074E 1013 171 9672-X87 1089 188Period 3 1074 GS2084E 1141 193 9672-X97 1186 205Period 4 1140 Tri6009 1401 247 GS2094E 1260 213 9672-XX7 1277 221Period 5 1209 GS2104E 1378 234 9672-XY7 1362 235Period 6 1325 Tri7009 1597 282 GS2114E 1498 255 9672-XZ7 1441 248Period 7 1452 Tri8009 1784 315 GS2128E 1650 283 Freeway 1784 315Period 8 1650 Tri9009 1962 346 GS2154E 1894 325 1962 346Period 9 1876 TriA009 2130 374 2100Series 2130 374 2130 374Period A 2036 TriB009 2290 403 2290 403 2290 403Period B 2209 TriC009 2441 427 2441 427 2441 427Period C 2370 TriD009 2584 451 2584 451 2584 451Period D 2543 TriE009 2720 475 2720 475 2720 475
June 2002
Upgrade Document
June 2002
Example 2: UNIX Modeling
• Modeling product installed on MQSeries server
• Application running with a known number of users
• Projected rollout schedule used to drive model
• Mainframe side: CICS application, IMS load
June 2002
UNIX Platform Workloads
• Two primary workloads:– MQSeries userids (mqm*) - memory
intensive– Messaging application processes (MDA*) -
“CPU intensive”
June 2002
Workload Modeling Methodology
• MQSeries - Calculate relative workload intensity, enter model ratio.
• Messaging application processes - Keep constant until application is removed from platform (“design loop” - always uses 1 CPU). Must adjust across CPU upgrade to continue using 1 CPU.
June 2002
Track Across Upgrade
CPU Upgrade
June 2002
Model Spreadsheet
Timeframe #Users #Msging Ratio MQ Ratio MDA %CPU Resp MemoryBaseline 100 100 1.00 1.00 63.5 1 1.4
+1st Event 180 100 1.80 1.00 65.5 1.04 1.7+2nd Event 212 100 2.12 1.00 66.4 1.06 2+3rd Event 362 100 3.62 1.00 70.3 1.17 2.5+4th Event 512 0 5.12 0.00 24.0 0.66 3.1+5th Event 1012 0 10.12 0.00 37.0 0.71 5.3
June 2002
Model Presentation
Timeframe: April 2000#Users: 180, 100Ratios: 1.27, 1.00 Config: F50/02,2GBComment: Add Event1 Users
June 2002
Validation - Tracking Users(on mainframe)
//ECLUSRS EXEC SASV8,REGION=0M//ECLD1 DD DSN=XYZ.PRD.A.AAAPRD.I.VOLFIL,DISP=SHR//ECLDPDB DD DSN=CAPLAN.PRD.ECLDPDB,DISP=OLD//SYSIN DD *,DLM=@@data ecld1;format date date.;format dt datetime.;INFILE ECLD1 MISSOVER;INPUT @1 RECNUM $CHAR5. @6 RECTYPE $CHAR8. @14 USERCT $CHAR5. @19 USERMAX $CHAR5.;if recnum =: '99999' and rectype =: 'TCSCONFG';dt = datetime();date = datepart(dt);hour = hour(dt);data ecldpdb.users;update ecldpdb.users ecld1;by date hour;proc print;title 'Ecloud1 Users';
June 2002
Example 3: Server Replacement
• Project: replace “old” NT servers
• Application: Imaging servers
• Capacity sizing data:– Rules-of-thumb analysis by vendor, using
projected claims/minute and processor clock speeds
– Benchmark information
June 2002
Server Replacement Process
• Multiple servers: each server is a workload, must be sized separately.
• Enumerate and measure servers.• Apply growth rates and determine processing power
requirements for the replacements.• Research available configurations and order
appropriate server configurations.• Track CPU utilization across the upgrades.• Update relative capacity specs for next upgrade.
June 2002
Server Sizing
• Find (or derive) benchmark capacity ratings for starting and replacement configurations.
• Apply an estimate of current CPU utilization, a growth percentage, and a “peak/average” and performance buffer (+100% for this study).
• Output: estimated percentages of a standard configuration. The number of estimated CPUs needed (23) came very close to the vendor’s original number of 24.
June 2002
Sizing Spreadsheet
ServersName Processor #CPUs Memory TPC-C CPU% Growth Buffer Needed % of 6400 #CPUsNTServer1 PentiumPro 200 MHz 2 128M 5158 30% xxx% 200% 5261 21% 1NTServer2 PentiumPro 200 MHz 2 128M 5158 30% xxx% 200% 5261 21% 1NTServer3 PentiumPro 200 MHz 2 128M 5158 30% xxx% 200% 5261 21% 1NTServer4 PentiumPro 200 MHz 2 128M 5158 30% xxx% 200% 5261 21% 1NTServer6 PentiumPro 200 MHz 2 128M 5158 30% xxx% 200% 5261 21% 1NTServer7 PentiumPro 200 MHz 2 128M 5158 15% xxx% 200% 2631 10% 1NTServer8 PentiumPro 200 MHz 2 128M 5158 15% xxx% 200% 2631 10% 1NTServer9 PentiumPro 200 MHz 2 128M 5158 15% xxx% 200% 2631 10% 1NTServer10 PentiumPro 200 MHz 2 128M 5158 15% xxx% 200% 2631 10% 1NTServer11 PentiumPro 200 MHz 2 128M 5158 20% xxx% 200% 3507 14% 1NTServer12 PentiumPro 200 MHz 2 128M 5158 20% xxx% 200% 3507 14% 1NTServer13 Pentium III 500 MHz 1 256M 6859.1 20% xxx% 200% 4664 19% 1NTServer14 Pentium III 500 MHz 2 256M 12895.1 30% xxx% 200% 13153 52% 4NTServer15 Pentium III 500 MHz 2 256M 12895.1 30% xxx% 200% 13153 52% 4NTServer16 PentiumPro 200 MHz 2 128M 5158 50% xxx% 200% 8769 35% 2NTServer17 Pentium III 500 MHz 1 128M 6859.1 20% xxx% 200% 4664 19% 1
23
June 2002
Example 4: Hundreds of Servers
• Data capture
• Reporting
• Business drivers
June 2002
Data Capture
• Time-based scheduling product
• Script-based data “pull”
• Issue: data loss, time to find and rebuild
• Potential fixes:– Product– Data “push” from servers
June 2002
Data Reporting, Analysis
• Color-based “health index” (Concord NetHealth metric).
• Statistical Analysis (over two standard deviations from mean)
• Thumbnail drilldown graphs
• Automatic generation of html
• “Treemap” graphs
June 2002
Health Index *
* Concord NetHealth metric
June 2002
Statistical Process Control
cmg
June 2002
Thumbnail Html
June 2002
Automatic Generation of Html
• Driven by “matrix”– Originally spreadsheet– Converted to relational database– Ultimate capacity planning solution: information by
server, application, platform, business driver
• SAS code - builds web pages and hyperlinks
June 2002
Treemap
Paper by Ben Shneiderman, University of Maryland, http://www.cs.umd.edu/hcil/treemaps
ASSDSDFVVBNMXSDFFGFRRFHFHJKJKLLXXXXX
XESDGFKOKJHHMM
XESDGFKOKJ DERFFVBBNHGFF
XESDG
XESSDEFBJMGG
XESDG
June 2002
Business Drivers
• Capacity Councils - business units responsible for capacity planning of “demand” side
• Capacity Planners - build projections based on business drivers and historical trending
June 2002
Business Driver Based Forecasts
Server
Application
Application
Application
Business
Driver
Business
Driver
Projections
Projections
June 2002
Regression Analysis
Widgets
Gadgets
Customers
CPU
By month (input = Widgets, Gadgets, Customers):
projection =Widgets*f1 + Gadgets*f2 + Customers*f3;
f1
f2
f3
Output = CoefficientsInput = CPU and Business Drivers by month
June 2002
Graphical Output
Widgets Gadgets Customers
June 2002
Enterprise “Capacity at a Glance”
Use Type #Footprints 4Q2000 Capacity
Units 4Q2001 Capacity
2001+ Growth
Contact
Mainframe "Legacy" Applications (claims, membership, decision support, etc.)
OS/390, Amdahl GS2068E, 6 physical processors + 1 Integrated Coupling Facility processor
1 966 MIPS 1200 25% Merritt
Mainframe Storage Memory 7.00 Gigabytes 25% BlowMainframe Channels Channels 52 parallel,
124 ESCON
Channels 25% Blow
Mainframe DASD Disk 3.90 Terabytes 4.6 30% SmithMainframe tape silos STK Silo 4.00 Silos 25% JonesImaging Servers NT 35 25% LarryFile Servers NT 23 40% MoeApplication Servers NT 38 40% CurleyDatabase servers NT 4 40% CurleyInfrastructure servers NT 24 5% ShempPrint servers NT 30 5% Shemp
The following table summarizes data processing resources that are candidates for capacity reporting and planning. Note that some "capacity" fields are blank. These fields will be filled in as more information is gathered during the ongoing capacity planni
June 2002
SummaryIssues
• Access patterns and schedules
• Platforms (more types and numbers)
• Resources (what to track)
• Levels of capacity management
• Reporting of utilization and service levels, for large numbers of platforms
• Higher availability (redundancy, reporting)
• Deriving and reporting projections
June 2002
SummaryDeriving Projections
• Basic capacity planning:– Growth rates– Upgrade thresholds
• Aggressive estimate of “scenario” demand
• Bracket growth:– Lower end: “baseline”– Upper end: “scenarios”
June 2002
SummaryTypes of Projections
• Number of transactions
• Number of users
• Number of platforms
• Application sizing input
• Application complexity
• Fraction of an existing workload
• Growth rate
June 2002
SummaryCapacity Planning
• Projections based on application and platform
• Levels of capacity planning service
• Report on all enterprise resources
• Organize data with “matrix” database