Bottlenecks exposed

28
^Bottlenecks Exposed: The Most Frequently Found Performance Problems – and How to Nail Them! Dan Downing, VP Testing Services MENTORA Atlanta • Boston • DC • San Jose 404.250.6515 • www.mentora.com Bottlenecks Exposed – Title Slide Web Application Copyright Mentora 2001

Transcript of Bottlenecks exposed

Page 1: Bottlenecks exposed

^Bottlenecks Exposed: The Most Frequently Found Performance

Problems – and How to Nail Them!

Dan Downing, VP Testing Services

MENTORAAtlanta • Boston • DC • San Jose

404.250.6515 • www.mentora.com

Bottlenecks Exposed – Title Slide

Web Application

Copyright Mentora 2001

Page 2: Bottlenecks exposed

2

• Identify common website performance bottlenecks:• Source (what component they occur on)• Symptom (how you know there’s a problem)• Causes (what creates the problem)• Measurements (how to nail it)• Cures (how to make it go away)

• Illustrate with examples of B2C, B2B, B2E cases

Audience: Performance Engineer, Load Testing Expert, with intermediate experience

Objectives

Page 3: Bottlenecks exposed

3

Terms & Concepts

• Application Performance Testing: A repeatable methodology for volume-simulationof real-world applications in a customer’s environment to yield performance results that can be implemented to deliver efficient utilization of computing resources.

• Scalability: The demonstrated ability (or lack thereof) of a system (or component) to yield the same response time of a business process irrespective of the magnitude of the load applied to the system.

• Bottleneck: A hardware component or process or software of the system-under-test that is causing performance degradation and low scalability under load.

• Resource Utilization: The quantification of a shared computing resource being consumed by an application process or component.

• Symptom: The outwardly visible but unquantifiable effect of a performance bottleneck

• Cause: The specific and measurable factor yielding one or more symptoms.• Cure: The specific action applied to the Cause that will measurably improve the

visible symptom.• Measurement: A numeric value of a performance-affecting factor that can be

quantified by a monitoring tool and related to a specific component of the system-under-test.

Page 4: Bottlenecks exposed

4

Symptoms

• “It’s Too Slow”– As perceived from slow browser response by functional

testers– As measured by poor scalability during first low-load test– As experienced (too late!) by low productivity by real

production users• “It’s broken”

– Page ‘never returns’ after button press– Web server errors (404, 500…)– Application error messages in application logs

Symptoms are usually very unspecific!

Page 5: Bottlenecks exposed

5

3-Tier Environment

• Network– Firewall, load balancer, routers, network interface

cards, cabling between all components• Web Server Tier

– One or more (usually many) low capacity computers that receive, route, and display results of http requests from visitors’ browsers

• Application Server Tier– One or more (often 2) medium-high capacity computers

that receives, applies business logic to, and returns to the web server the results of the http request

• Database Server Tier– One or more (usually one with redundant stand-by) high

capacity computers that operate database software, and access database (often on large disk arrays) for servicing user data requests

Web Server Sun E220

DB Server Sun E4500

App Server Sun E420

Oracle

Page 6: Bottlenecks exposed

6

Performance Bottleneck Sources

Network

Web ServerApp Server

DB Server

30%16%>30%

16%12%21-30%

25%40%11-20%

27%29%<10%

NtwkWeb Srvr

How often?

What in your experience* do you find as the relative distribution of bottlenecks?

9%7%>60%

29%21%41-60%

32%48%21-40%

21%11%11-20%

7%11%<10%

DB Srvr

App Srvr

How often?

* Poll results of 56 Mercury Conference ’01 attendees of intermediate to advanced experience.

Page 7: Bottlenecks exposed

7

Performance Bottleneck Sources

In my experience, it’s the application! (~80% of the time)

Network8% Web Server

12%

App Server35%

DB Server45%

- % distribution is a SWAG based on experience testing dozens of apps

Most of the application code resides here…

21-40% (48%)

21-40% (32%)

11-20% (40%)

>30%% (30%)

Highest ranges from poll shown in color

Page 8: Bottlenecks exposed

8

Database (Simple) Anatomy

Data

Data

Data

Log

BIC

lient Com

mB

uffer

QueryParser

QueryOpti-mizer

QueryPlan

Storage

QueryExecutor

Metadata cache

WriteBuffer

Shared Memory

DataCache

Disk Array (e.g. Sun A10000)

DB Server (e.g. Sun 4500 quad cpu 2 GB memory)

DBConnection

Pool

App Server (e.g. Sun 420)

Data

SQL

Data

Page 9: Bottlenecks exposed

9

Key DB Server Measurements

Should be ~80% of available user memory on Server, and should average < 75%; else, add!DB Memory

Should be balanced across all drives, else indicates ‘db hot spot’ on large, hi-access tables, which need to be striped across multiple drives; avg 20% below disk IO saturation level

Server I/O

Correlates with cache-hit ratio; should decrease run-to-run as cache is tunedPhysical reads/writes

A measure of the number of open client queries; should be low, or could be an indicator of inefficient query model

Open cursors

A measure of the data-intensiveness of queries; read bytes should be <50% of sent bytes, else indicates complex application queries should become stored procedures

SQL*Net bytes rcvd/sent from/to client

A general indicator of db load handling, and should be compared run-to-runTransactions/second

Should be low (<20%); else could indicate under-sized query cache, old/no optimizer statistics, or flawed query model in app server function

Parse-to-execute ratio

Should be low for normal transactions (can be high for reporting functions); else indicates that indexes missing or poorly designed

Table scan blocks/sec

Should be zero at target loads; if not, indicated transaction model design problemDeadlocks

Should be hi – 90-95% range; else data cache sized too low and too much physical IOCache Hit Ratio

Should be low and constant, else yields virtual memory disk IO, which indicates insufficient memory allocated to DB processes

Server Page Faults/s.

Memory available should stay constant and average below 70-80%; else add memoryServer Memory

Shows raw horsepower consumption on the server; should average 70-80%; else add cpus!Server CPUImpact/RangeMeasurement

Page 10: Bottlenecks exposed

10

DB Server Causes & Cures

Pinpoint and correct!Inefficient access method; too many DB connections; small comm buffers;…

Other

Fix application transaction codeDeadlocks non-zero /errors in error logDeadlocks

rerun optimizer statisticshigh table scan blocks; many slow functions

Out-of-date statistics

Increase cache sizeLow cache-hit ratio, hi physical readsData cache too small

Review/fix transaction logic; modify DB locking strategy

Hi blocked transactions, high table locksInefficient concurrency model

Raise size of query plan cacheHi parsed-to-executed queries ratioQuery plan cache too small

Find/add/fix table indexeshigh table scan blocks; slow functionMissing/ineffective indexes

Tune query prepares in App server / code

Hi open cursors; hi bytes sent from clientOveruse of row-at-a-time processing

Reconfigure DB (add memory, write processes, threads, …)

Low correlation btw DB and Server resource utilization; unbalanced I/O

Inefficient DB configuration

Convert client SQL to stored procedures | optimize slow q’s

Many slow pages; hi 'bytes recvd' by db server; low db cpu; or: many slow queries

Inefficient SQL query model

Analyze query plan, optimize query

Slow page (>10 sec) which ties to a specific function, thus an SQL query; hi db cpu | IO

Inefficient SQL statement

CureMeasurementCause

Page 11: Bottlenecks exposed

11

Inefficient SQL statement24%

Inefficient SQL query model17%

Inefficient DB configuration14%

Hi row-at-a-time logic12%

Missing indexes9%

Inefficient concurrency model

7%

Query cache too small7%

Data cache too small5%

Other5%

Database Server Causes

~60% of the time the time it’s bad SQL or bad indexes!

Page 12: Bottlenecks exposed

12

Example:B2B Supply Chain Management

• Symptom:– Transactions that return list data running

very slowly; they don’t scale• Measurement: (using LR Oracle Monitor)

– Hi table scan blocks– Low index fast full scans

• Cure:– Add additional indexes– Design indexes so queries can be resolved

with index table columns w/o accessing base table

– Enable fast scan Oracle parameter

Web Server Sun E220

DB Server Sun E420

App Server Sun E420

Oracle

Apache

WebLogic

Oracle

Page 13: Bottlenecks exposed

13

LR Oracle Monitor

Table scan blocks average = 12

Index fast full scans = 0

Page 14: Bottlenecks exposed

14

App Server (Simple) Anatomy

Connection M

gr

PresentationManager

ObjectCache

DB ServerApp Server (e.g. usually two; Sun 420 dual cpu 1GB memory)

Data

SQL

Web Server

Client Requests

html pages

Business Logic

PresentationLogic

Security Mgr

Transaction Mgr

DB

Conn. M

gr

Messaging M

gr

Com

munic. M

gr

Page 15: Bottlenecks exposed

15

Key App Server Measurements

Should see all app server instance doing similar amount of work; else indicates load balacingproblem

Load balancing

Should contain low/no error messages, low warnings; else indicates application problemsApplication log

Memory should track App Server memory, should stabilize at target load at 70% average, else possible memory leak or add memory

Server Memory

Active sessions should rise with load, and stabilize at less than Total; if does not stabilize, indicates insufficient processing power to keep up with DB; if maxes out, too few connections

Active/Total DB Pool Connections

A general indicator of app server load as evidenced by web server request volume, and should be compared run-to-run and track with load applied

Requests/second

Should be a relatively low ratio vs. non-secure transactions (<15%?); else, eating up cpu, bwSSL transactions/sec

Should be rise as load increases, stabilize at target load, approximate vendor target/instance; else, decrease inactive session keep-alive time

Active/Total Sessions

Memory should rise as active sessions grow, should shrink in garbage collection cycle, and should stabilize at target load at 70% average, else possible memory leak or add memory

App Server memory

Should be hi – 90% range; else data/object caches sized too low and too much physical IOCache Hit Ratios

Should be low and constant, else yields virtual memory disk IO, which indicates insufficient memory allocated to App Server processes

Server Page Faults/s.

Shows raw horsepower consumption on the server; should average 70-80%; else add cpus!Server CPUImpact/RangeMeasurement

Page 16: Bottlenecks exposed

16

App Server Metrics & CuresCureMeasurementCause

Pinpoint and correct!Low OS resources; erratic transaction performance

Other

Change object access methodSlow object creationInefficient object access method

Review/relax app securityHi calls on port 7002Inefficient security model

Pinpoint & diagnose longest running business processes

Slow specific business functionInefficiently coded transaction

Raise DB connections; lower no. of App Server instances

Steadily rising active connections, hi cpu utilization

Poorly configured DB connection pool

Add cpus, memory; decrease no. App server instances

Hi cpu, memory, I/O utilizationInsufficient hardware resources

Validate proper JVM-to-app server match; Increase data & object caches; add HW memory

Low correlation btw App and HW resource utilization; overall poor performance

Poorly configured App Server

Tune session keep-alive settingSteadily rising active sessionsSub-optimal session model

Tune app server load balancingSpikes in transaction timesInefficient garbage collection

Find and fix memory faulty application code

Memory utilization rises steadily, doesn't recover

Memory leak

Page 17: Bottlenecks exposed

17

App Server Causes

Memory leak15%

Inefficient garbage collection

12%

Sub-optimal session model12%

Poorly configured App Server12%

Insufficient hardware resources

10%

Poorly configured DB connection pool

9%

Inefficiently coded transaction

11%

Inefficient DB access architecture

4%

Inefficient object access method

5%

Other10%

60% of the time: object caching, SQL, db connection pool; 20% of the time: inefficient application server

Page 18: Bottlenecks exposed

18

Example:B2C Large Retail Web Store

Web Server Sun E420

DB Server Sun E4500

App Server Sun E420

Oracle

• Symptom:– App server memory leak

• Measurement:– Steadily increasing, non-recovering

memory usage in Dynamo console– Memory exhausted and app server dies

over 8 hour run• Solution:

– Test individual functions– Isolate errant function not releasing

memory– Fix code!– Re-test to validate fix (longevity test)

Apache

ATG Dynamo

Oracle

Page 19: Bottlenecks exposed

19

Web Server Metrics & Cures

CureMeasurementCause

Add cpus, memory; add web servers; distribute content; add specialized servers (images, streaming media…)

Hi cpu, memory, I/O; timeout errors

Insufficient hw capacity

Tune web server configurationHi I/O, hi memory utilization, low throughput

Poorly configured server

Review/revise load balancing policiesUneven utilization across web servers

Unbalanced load across servers

Review/relax secure transaction model

Memory utilization >70%, low throughput; hi port 443 calls

Hi SSL transactions

Diagnose App, DB serversLow OS resource utilization, overall poor throughput

Other

Reduce keep-alive time; correct transaction design

Hi ip connections per active session

Inefficient transaction design

Diagnose / fix applicationBroken link errorsBroken links

Direct firewall and user traffic to different ports

Hi firewall-to-web server trafficSecurity too tight

Page 20: Bottlenecks exposed

20

Web Server Causes

Security too tight8% Broken links

8%

Inefficient transaction design

11%

Other12%

Hi SSL transactions13%

Unbalanced load across servers

15%

Poorly configured server15%

Insufficient hw capacity18%

Major contributor: Secure transactions; often: load balancing; sometimes: high-resource specialized functions (external links, email, chat)

Page 21: Bottlenecks exposed

21

Example:B2E Collaborating Communities

Web/ App Server Dell 1550

DB Server Dell 2450

SQL Server

IIS/Visual Basic

SQL Server

Cisco Load Director

• Symptom:– Slow overall performance– DB server low activity

• Measurement:– Web/App server resources maxed out– Non-scalable transaction times

• Solution:– Short-term: Move “Chat” function to

dedicated server– Long-term: Re-architect system in java,

separate Web and App tiers, introduce dedicated server for chat and email functions

Page 22: Bottlenecks exposed

22

Network Metrics & Cures

Review/tune configuration of NICs, Routers, other devices

Hi latency values in network delay monitor; low throughput

Poor network architecture

CureMeasurementCause

?????? Other

Tune NIC buffers; add 2nd NIC for failover heartbeat

Low throughput btw serversPoorly configured/insufficient network interface cards

Loosen security policies; redesign application security

High traffic btw firewall & servers

Security too tight

Get hoster to raise bw ceiling; increase system bw; add NICs for failover functions

Low, maxed throughput; high collision rate

Insufficient overall bandwidth

Revise load balancing policyUneven load at web serversLoad balancing ineffective

Page 23: Bottlenecks exposed

23

Network Causes

Load balancing ineffective22%

Insufficent overall bandwidth

13%

Security too tight15%

Poorly configured/insufficient NICs

10%

Other20%

Poor network architecture20%

No single major cause; often problem is load balancing, security, or network architecture.

Page 24: Bottlenecks exposed

24

Web Server Sun E420

DB Server Sun E4500

App Server Sun E420

Oracle

Example:B2C On-line Printing Services

• Symptom:– Low transaction performance scalability

under load– High latency across load balancer

• Measurement:– Unbalanced load on web server tier

• Solution:– Replace load balancer (bad hardware)– Change load balancer policies from IP-

based to server-load based

Cisco Load Director

Page 25: Bottlenecks exposed

25

Monitoring Tools

• LoadRunner– Transaction performance monitor– Server resource monitor– Oracle, SQL Server, selected app servers monitors– Network delay monitor

• Database performance monitoring tools– Quest Oracle Instance Monitor, Embarcadero, BMC DB Patrol

• App Server System Console (from app server vendor)• Java object monitoring tools

– JProbe, Performasure (Sitraka)• Network Analyzer (aka network sniffer)• Operating system utilities

– Unix top, sar, vmstat, iostat– 2000/NT Perfmon

Page 26: Bottlenecks exposed

26

Tool Example:WebLogic Console

Page 27: Bottlenecks exposed

27

Lessons Learned

1. 80% of the time it is the application or system software, notthe infrastructure!

2. Make friends with your app server, db server, and hardware monitoring tools!

3. Application architect, DBA, and App Server experts are indispensable and must be involved during load tests!

4. Arrive armed with the Top 10 Things to check for each component!

5. Id the measurements you need to be able to make6. Systems Engineer with networking, firewall, and load

balancer expertise is very handy!

Page 28: Bottlenecks exposed

28

Questions?

[email protected]