DB2 pureScaleBest Practices - Ningapi.ning.com/files/dzJVcnA3Xm1VXdni8Ey2kC9LNGO9Ef3... · DB2...

#IDUG

Matt Huras IBMDB2 for Linux, Unix, Windows

DB2 pureScale Best Practices

#IDUG

Agenda

● DB2 pureScale Technology Overview

● Initial Configuration Best Practices

– Cluster Topology, Network, Storage

– Instance and Database Configuration

– Client and Workload Balancing Configuration

● Performance Tuning Best Practices

– Compute

– Network

– Storage

– Workload

• Emerging Best Practices

#IDUG

Cluster Interconnect

Technology Overview

Single Database View

Clients

Database

Log Log Log Log

Shared Storage Access

CS CS CSCS

CS CS

CS

Member Member Member Member

Primary2nd-ary

DB2 engine runs on several host computers• Co-operate with each other to provide coherent access to the

database from any member

Data sharing architecture • Shared access to database• Members write to their own logs• Logs accessible from another host (used during recovery)

Cluster caching facility (CF)• Efficient global locking and buffer management• Synchronous duplexing to secondary ensures availability

Low latency, high speed interconnect• Special optimizations provide significant advantages on RDMA-

capable interconnects (eg. Infiniband)

Clients connect anywhere,…… see single database

• Clients connect into any member• Automatic load balancing and client reroute may change

underlying physical member to which client is connected

Integrated cluster services• Failure detection, recovery automation, cluster file system• In partnership with STG (GPFS,RSCT) and Tivoli (SA MP)

Leverages IBM’s System z Sysplex Design Concepts

#IDUG

Agenda




– Instance & Database Configuration



– Compute

– Network

– Storage

– Workload


#IDUG

Overall System Topology

• Use at least 2 physical machines for production deployments• At least 2 members on separate machines

• Primary and secondary CF on separate machines

• Single machine pS instances are not recommended for production deployments

• However, can be used for development and/or QA systems

• eg. define multiple logical members on a single machine, to match production topology in a QA system

• Isolate CFs and members using one of the following methods,…… in order of preference :

1. Separate physical machines

2. Separate partitions (eg. AIX LPARs)

3. Separate CPUs

• On Linux, rely on pureScale default behavior

• If a member and CF are co-located, DB2 will automatically bind them to separate CPUs

• On AIX, see bullet 2 (i.e. use LPARs !)

• If, for some reason you cannot use LPARs

- use DB2_RESOURCE_POLICY to tell DB2 to automatically bind members to a set of CPUs- set CF_NUM_WORKERS to limit the CF to remaining CPUs

BP

BP

BP

#IDUG

• Use these approximate ratios as rules-of-thumb :

• For write-heavy workloads, both primary and secondary CF should have at least

1/6th the combined CPU count of all members

• For predominantly read workloads, both primary and secondary CF may suffice

with 1/12th the combined CPU count of all members

• Ensure each CF worker threads get it’s own dedicated logical CPU

• CF response time is critical to overall system performance• CF workers are constantly look for new requests from members• Each worker needs it’s own logical CPU to delivery optimal response time

• Use dedicated (not shared) LPARs on AIX for the CF

• Use at least 1 physical core for each CF• eg. 4 SMT threads (aka logical CPUs) on Power 7

• Leave some CPU resource for pureScale’shousekeeping & recovery threads • CF threads are in busy loop looking for new requests

• Using the default of CF_NUM_WORKERS=AUTOusually takes care of this for you

OK, But How Should I Divide Up CPU Resources ?

CF

BP

Tip

BP

BPBP

BPBP

Tip

CF Workers

Logical CPUs

Cores

#IDUG

Example Configurations : Example 1

Workload : OLTP workload with typical R/W ratio (20% transactions write)

Target Cluster : Ten machines, each with 4 cores, 8 logical CPUs

Recommendation ?

• Target member to primary CF CPU ratio ~8:1

• Action:• Dedicate 8 full machines to 8 members

• Dedicate 2 full machines to CFs

• Leave CF_NUM_WORKERS=AUTO

• Notes:• When a CF is the only member/CF on a machine, the

CF_NUM_WORKERS=AUTO settig will not assign all CPUs to workers

• It will leave usually 1 CPUunassigned for use bypureScale’s recoveryautomation threads

Member Member Member Member Member Member CAs

Ten Machines

Member MemberCAp

TipTip

#IDUG


Member

CFp

Member

CFs

Two Machines

Workload : Moderately heavy write ratio (eg. 30% transactions write)

Target Cluster : Two quad-core x86 machines- each core has 2 logical CPUs (8 logical CPUs per machine)- bare metal

Recommendation ?


• Action:• Define a member and CF on each machine

• Leave CF_NUM_WORKERS=AUTO

• Notes:• When a CF & member are co-located on Linux, setting

CF_NUM_WORKERS=AUTO usually results in a ~80:20 member:CF split, ie:

member: 6 CPUsCF: 2 CPUs

• Also, DB2 will automatically bind member and CFto appropriate cores !

• A CF needs at least 1 worker for every port(this example assumes each CF is using <=2 ports)

3 cores6 CPUs

CF_NUM_WORKERS=AUTO

1 core 2 CPUs

TipTip

#IDUG


Workload : Moderately heavy write ratio (eg. 30% transactions write)

Target Cluster : Four 16-core Power 7 machines- each core has 4 logical CPUs (64 logical CPUs/machine)- use AIX LPARs

Recommendation ?


• Action:• Dedicate 2 full machines to 2 members

• On other 2 machines define:• one 8-core LPAR for a member

• one dedicated 8-core LPAR for a CF

• Set CF_NUM_WORKERS=28

• Notes:• CF_NUM_WORKERS=AUTO leaves

just 1 logical CPU for recovery threads(i.e. in this case, results in 31). This isjust 25% of 1 Power 7 core, and may notbe sufficient.

Member

MemberMember

Four Machines

Member

CFp CFs

LPAR 18 cores

32 CPUs

LPAR 28 cores

32 CPUs

CF_NUM_WORKERS=28

TipTip

#IDUG

Network Configuration : RDMA Network

● Use at least 2 switches – eg. connect each CF/member to each of 2

switches– Avoids single point of failure

● Use 2 (or more) RDMA adapters on each CF and member

– Up to 4 per CF; up to 2 per member– Allows CF/member to survive single adapter failure– Provides additional bandwidth/IOPs

– Note: 2nd port on the same adapter may not provide significant additional bandwidth/IOPs

• If a member/CF must use a single adapter, utilize both of its ports to maximize availability

– Eg. If a member uses just 1 adapter, utilize both of its ports to connect the member to 2 switches

pureScale can be configured with either Infiniband and RDMA over Converged Ethernet (aka RoCE)See: “Installation prerequisites for DB2 pureScale Feature” for current list of supported Adapters

BPBPBP

BPBPBP

TipTip

#IDUG

Storage Selection

• DB2 pureScale supports all SAN storage, with 3 categories of support

• Category 1: Tested storage which supports fast I/O fencing

• Category 2: Tested storage without fast I/O fencing

• Category 3: All other types of shared storage

• Use Category 1 storage to get fastest member recovery

• Uses fast IO fencing (aka ‘SCSI-3 Persistent Reserve) to quickly (within seconds) isolate shared storage from a failing member

• Allows recovery to begin with assurance that a ‘comatose’ (non-responsive) member will not come back to life and start writing to data or logs

• Without fast IO fencing, a lease expiry fencing mechanism is used that generally takes several 10s of seconds to ensure a non-responsive member will not write to data or logs

• Includes IBM DS3000, DS5000, DS8000, V7000, EMC Symmetrix, Hitachi, Netapp

• See this link for the latest information regarding storage support

• http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/index.jsp?topic=%2Fcom.ibm.db2.luw.qb.server.doc%2Fdoc%2Fc0059360.html

BPBPBP

Tip

#IDUG

What is IO Fencing ? Why is it Necessary ?

Log Log

Data

Member 1 Member 2

M1

Rollback

Com

mit

#IDUG

The Importance of Category 1 Storage

Log Log

Data

Member 1 Member 2

M1

Log Log

Data

Member 1 Member 2

M1

Category 1 Other Storage

Wait for lease to expire

Recovery in seconds Recovery in ~minute

HW Fence

#IDUG

Current Category 1 List (as of Oct 2013)

See "Shared Storage Considerations" section of Information Center for latest information:http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/topic/com.ibm.db2.luw.qb.server.doc

/doc/c0059360.html

#IDUG

Storage Selection (continued)

• Use fast storage for logs

● Like any database, pureScale needs adequate IO bandwidth to keep response times low when the system is under heavy load

● In addition, cluster databases may need to flush their logs to disk more often than single node database, so log I/O performance is even more important

● Solid-state disks (SSDs) can be very useful in minimizing IO times

– A relatively small SSD investment can make a big difference in a log-bound system where the storage write cache can't keep up

– Ensure appropriate redundancy (eg. RAID, and/or DB2 log mirroring)

Member 1 Member 2

CF

Log Log

UPDATE … WHERE KEY=4 UPDATE … WHERE KEY=5

LogBuffer

Page

BPBPBP

UPDATE … WHERE KEY=6

PageLog

Buffer

#IDUG

Cluster File System Configuration

– Place logs and tablespaces on separate filesystems

– General recommendation: one file system for all members’ logs and another for all tablespaces

– Consider additional data file systems for very large databases or to enable multi-temperature storage on different storage groups

– Note: each additional file system add a small incremental cost during member recovery (~ 1-2 seconds)

– Use the db2cluster command to create additional file systems

db2cluster -cfs -create …

– Rely on pureScale’s default file system parameter settings

– db2cluster uses defaults appropriate for pureScale (eg. Direct I/O, block size)

– Add capacity at the cluster file system level db2cluster -cfs -add …

– Use LUNs with the same characteristics where possible to simplify maintenance

– ie. same size and performance characteristic

– To rebalance after adding LUNs over new storage

db2cluster -cfs -rebalance …

BP

BP

Tip

#IDUG

New Instance (aka ‘DBM’) Configuration Parameters

AUTO (based on number of CPUs)

# threads in a CF that can process CF GBP, GLM and other requests

CF_NUM_WORKERS

Default DescriptionParameter

% of INSTANCE_MEMORY to reserve for automated recoveries of other members

# connections allowed at a CF (from each member)

Overall memory to use for a CF

Path where CF related diagnostic messages will be stored

Severity of diagnostic messages logged

AUTO (typically ~5%)RSTRT_LIGHT_MEM

AUTO (grows as needed)CF_NUM_CONNS

AUTO (typically uses between 70-90% of host memory)

CF_MEM_SZ

NULL (use DIAGPATH)CF_DIAGPATH

2CF_DIAGLEVEL

• Rely on defaults for pureScale instance configuration

• Ensure DIAGPATH & CF_DIAGPATH are set to local paths (ie. not shared)

• Some customers have experienced up to a 3 second reduction in member recovery time

• Due to less contention when writing diagnostics during recovery

• pureScale configures local diagnostic paths by default starting in 10.1 (not 9.8)

db2 update dbm cfg using DIAGPATH /local_path

db2 update dbm cfg using CF_DIAGPATH /local_path

BPBPBP

Tip

#IDUG

New Database Configuration Parameters

• Start with defaults for pureScale DB configuration• Usually results in an appropriate 80:15:5 ratio between GBP:LOCK:SCA

• Customization may be appropriate if …• … running more than 1 database

• If all similar in priority/size/workload, do the following to give equal CF memory to each

db2 update dbm cfg using NUM_DB <# of your DBs>

db2set DB2_DATABASE_CF_MEMORY -1

• Otherwise, explicitly configure CF_DB_MEM_SZ for each

db2 update db cfg for <db1> using CF_DB_MEM_SZ <nnn>

db2 update db cfg for <db2> using CF_DB_MEM_SZ <mmm>

• .. you want to reduce the window during which simultaneous primary and secondary CF failure could cause a group crash recovery to occur:

db2 update db cfg for <db1> using CF_CATCHUP_TRGT 5

DefaultDescriptionParameter

Target time for a newly started secondary CF to enter peer state

Total amount of CF memory for a given database (includes previous 3 areas)

Memory to use for the System Communication Area

Memory used for the GLM

Memory used for the GBP

15 minutesCF_CATCHUP_TRGT

AUTOCF_DB_MEM_SZ

AUTOCF_SCA_SZ

AUTOCF_LOCK_SZ

AUTOCF_GBP_SZ

#IDUG

CF Memory Configuration Details

� CF memory is dominated by the Group Bufferpool (GBP)

– The GBP only stores modified pages, so the higher the read ratio, the less memory required by the CF

– The GBP is always allocated in 4K pages, regardless of the bufferpool page size(s) at the members

� Size CF memory so that GBP gets ~35-40% the memory of the sum of all members’ bufferpools

• Example:– LBP size in each of 4 members : 10 M 4KB pages = 160 GB total– Reasonable CF_GBP_SZ is : ~15 4K pages = 60 GB total

• For higher read workloads (e.g. 85-95% SELECT), the required size decreases since there are fewer modified pages in the system– Consider 25% a minimum, even for very read-heavy workloads

• For 2 member clusters, the % can range higher to 40-50%

BP

#IDUG

Automatic CF Memory Sizing : 1 Active Database

● Total CF memory allocation is controlled by

DBM config parameter CF_MEM_SZ

● Default AUTOMATIC settings provide reasonable initial calculations (but no self tuning)

• CF_MEM_SZ set to 70-90% of physical memory

• CF_DB_MEM_SZ defaults to CF_MEM_SZ

(for single DB)

• CF_SCA_SZ = 5-20% of CF_DB_MEM_SZ● Metadata space for table control blocks, etc.

• CF_LOCK_SZ = 15% of CF_DB_MEM_SZ

• CF_GBP_SZ = remainder of CF_DB_MEM_SZ

CF_MEM_SZ (Instance)

CF_DB_MEM_SZ (DB 1)

CF_GBP_SZ

CF_SCA_SZ

CF_LOCK_SZ

#IDUG

Example CF Memory Sizing : 1 Active Database

● 4 Members, each with two bufferpools● 8 M pages (for 4K page size)

● 6 M pages (for 8K page size)

● How big should the GBP be ?

● A Solution● Total LBP size is 4*(32 GB + 48 GB) = 320 GB

● Target GBP size : 35-40% = 112-128 GB

● Give CF partition/machine ~ 196GB

● Rely on AUTO configuration● CF_MEM_SZ ~ 80% ~ 160GB

● CF_GBP_SZ ~ 80% ~ 128 GB

● Other memory areas given appropriate defaults

● Or, use explicit settings


CF_DB_MEM_SZ (DB 1)

CF_GBP_SZ

CF_SCA_SZ

CF_LOCK_SZ

#IDUG

If you have multiple databases and all of similar importance and workload, you can use the DB2_DATABASE_CF_MEMORY registry variable to evenly divide CF memory across your databases

• Ensures first database to activate doesn't consume all CF memory

● If DB2_DATABASE_CF_MEMORY=-1CF_DB_MEM_SZ = CF_MEM_SZ / NUM_DB

● If DB2_DATABASE_CF_MEMORY=33CF_DB_MEM_SZ = (33/100) * CF_MEM_SZ


CF_DB_MEM_SZ (DB 2)

CF_GBP_SZ

CF_LOCK_SZ

CF_SCA_SZ

CF_DB_MEM_SZ (DB 3)

CF_GBP_SZ

CF_LOCK_SZ

CF_SCA_SZ

CF_DB_MEM_SZ (DB 1)

CF_GBP_SZ

CF_LOCK_SZ

CF_SCA_SZ

CF Memory : Multiple Active Databases

Tip

#IDUG

Client Configuration : WLB, Affinity or Subsetting ?

• Use default (“all member”) WLB for single workloads

• When consolidating multiple workloads or databases in a single pureScalecluster

● Before 10.5, use affinity routing● With 10.5 and beyond, use member subsetting

Member Member Member Member Member

Batch OLTP

Member Member Member

Single Workload

Member

Workload balanced over all members Workload independently balanced over

defined subsets of members


No workload balancing

Workload

A

Workload

B

Workload

C

Workload

C

BP

BP

Default WLB AffinityMember Subsets

#IDUG

Configuring WLB (For “All Members” or Subsets)

• Defaults are reasonable for many deployments

• The most common customizations are:

1. Enabling transaction level WLB (vs connection level)

2. Keep alive timeout

3. Enabling the alternate server list for first connection processing

• Following slides show how to customize these for the CLI/ODBC client

• Subsequent slide maps to corresponding parameters for Java

#IDUG

Customize CLI/ODBC WLB with db2dsdriver.cfg

<databases>

<database name="SAMPLE" host="myhost1.net1.com" port=“50001">

<parameter name=”KeepAliveTimeout” value=”10”/>

<WLB>

<parameter name="enableWLB" value="true"/>

<parameter name="maxRefreshInterval" value=“30"/>

</WLB>

<ACR>

<parameter name=”enableAcr” value=”true”/>

<parameter name=”enableSeamlessAcr” value=”true”/>

<parameter name="enableAlternateServerListFirstConnect" value="true"/>

<alternate_server_list>

<server name="m2" hostname="myhost2.net1.com" port=“50001" />



</alternate_server_list>

</ACR>

</database>

</databases>

#IDUG

1) Enabling Transaction Level WLB

<databases>



<WLB>



</WLB>

<ACR>









</ACR>

</database>

</databases>

#IDUG

ClientDB2

DriverAppl-ication

Member 1

Member 2

DB2 Server

Max size of physical connection pool is unlimited by default.

Can set finite limit via :

Member 3

Physical Connection Pool Managed by Client Driver

Thread 1

Thread 2

• The connections established by applications are termed logical• Transparent to the application, the DB2 driver maintains a number of physical connections to members

• Each logical connection is mapped to a single physical connection• This mapping is done transparently to the application and may change on connection or transaction boundaries

• In this example, the application has two threads, each with one logical connection• Thread 1’s logical connection is currently mapped to a physical connection to member 1 (in red)• Thread 2’s logical connection is currently mapped to a physical connection to member 2 (also in red)• The remaining 7 physical connections are not currently in use (in black)

<WLB><parameter name=”maxTransports” value=”100”/>

</WLB>

CONNECT

CONNECT

#IDUG

ClientDB2

DriverAppl-ication

Member 1

P=20

Member 2

P=20

DB2 Server

Member 3

P=60

Connection Level WLB Enabled by Default

Thread 2

Thread 5

• The client driver transparently routes logical connection requests to members with less logical connections than their relative priority calls for

• In this example, the relative member priorities (P) indicate M1, M2 and M3 should have 20%, 20% and 60% of the logical connections, respectively

• M1 has too many connections, M2 has too few, and M3 has the right amount

• So, when thread 2 disconnects and then reconnects to the database, the client driver will use a physical connection to M2

CONNECTRESET

CONNECT

Thread 4

Thread 3

Thread 1

Note: P ranges from 0-100 and is based on CPU load, paging rates and other factors.Larger values mean more cycles available and more work should be routed to the member.

#IDUG

ClientDB2

DriverAppl-ication

Member 1

P=30

Member 2

P=30

DB2 Server

Member 3

P=90

Transaction Level WLB

Thread 2

Thread 5

• DB2 client driver transparently routes transactions to members with less logical connections than their relative priority calls for

• In this example, the priorities indicate M1, M2 and M3 should have 20%, 20% and 60% of the logical connections, respectively

• M1 has too many connections, M2 has too few, and M3 has the right amount

• So, when thread 2 starts a new transaction, the client driver may use a physical connection to M2

• Enabled via:

COMMIT

SELECT

Thread 4

Thread 3

Thread 1

<parameter name=”enableWLB” value=”true”/>

#IDUG

Connection or Transaction Level ?

• Transaction level WLB is a good general purpose approach

• If in doubt, start with transaction level WLB

• However, watch out for, application usage patterns that maintain state across transactions and will revert back to connection level WLB:

• Sequences• Use db2set DB2_ALLOW_WLB_WITH_SEQUENCES=YES to allow WLB & sequences

• Confirms applications don’t get previous sequence value in a transaction before getting next value

• This db2set is commonly used

• Declared Global Temporary Tables• Cursors with hold• Full list:

http://publib.boulder.ibm.com/infocenter/db2luw/v10r5/index.jsp?topic=%2Fcom.ibm.db2.

luw.qb.server.doc%2Fdoc%2Fr0056430.html

Tip

BP

#IDUG

Connection or Transaction Level ? (continued)

• Connection level WLB may have advantages in certain scenarios:

• Minimize reconnect overhead for tiny transactions

• However, watch out for:

• Long-lived connections (balancing may rarely/never occur)

• Notes:• Both transaction level and connection level WLB work best when each client

process/JVM has multiple connections, particularly in pre-10.1 clients

• This is because balancing is performed independently by each client process/JVM• Decisions made by 1 client process/JVM may not be coordinated with decisions of another

• 10.1 C-Client and 10.5 JCC client have enhancements for transaction level WLB to better handle the single transaction per process/JVM case

Tip

#IDUG

2) KeepAliveTimeout

<databases>



<WLB>



</WLB>

<ACR>









</ACR>

</database>

</databases>

#IDUG

What is KeepAliveTimeout ?

• A value that limits how long it may take before a client (or server) detects the abnormal termination of the server (or client)

• If not set, the system default may be used (typically ~ 2 hours)• This means that in the event of a DB2 member failure, a DB2 client that is awaiting a reply to a

SQL statement, may not notice the failure for up to 2 hours

• Can change host level setting, but change will affect all TCPIP connections on the host

• Note: 10.1 and beyond clients use a default of 15 seconds

• Use a value that supports your failover objective• 5-10 seconds is often a reasonable value: <parameter name=”KeepAliveTimeout” value=”10”/>

ClientDB2

DriverAppl-ication

SELECT SELECT

socket lost err

CONNECT

successACR

SELECT

result set

result set

Member 1

Member 2

DB2 ServerSocket lost error returned to DB2 driver within 10 seconds of member failure if KeepAliveTimeout set to 10.

BP

#IDUG

3) Enabling Alternate Members on First Connect

<databases>



<WLB>



</WLB>

<ACR>









</ACR>

</database>

</databases>

#IDUG

Huh ? Alternate Members on First Connect ?

• After a successful connection to any member of a pureScale cluster, the client will receive a complete list of all members and their IP information

• From this point on, the client can transparently perform workload balacing and reroute connections from failed members to other members, and

• However, if a client’s initial connection targets a member that is not currently active, the attempt will fail

• The client is unable to transparently try other members, because it does not have their IP information

• Use enableAlternateServerListFirstConnect to give the client all members’ IP information up front

• With this, even if the target member of the first client connect attempt is not available, the client will be able retry against other members

BP

#IDUG

Notes on Configuring CLI/ODBC vs Java/JCC

maxRetriesForClientReroute, retryIntervalForClientReroute, affinityFailbackInterval

maxAcrRetries, acrRetryInterval, affinityFailbackInterval

ACR connection retry and failback behavior

See next slidekeepAliveTimeoutRapid detection of failed server

With JNDI : db2.jcc.clientRerouteServerListJNDIName,db2.jcc.DB2ClientRerouteServerListWithout JNDI: db2.jcc.clientRerouteAlternateServerName, db2.jcc.clientRerouteAlternatePortNumber

enableAlternateServerListFirstConnect

Try these other members if initial connect to specific member fails

client_affinity_definied, affinity_list in db2dsdriver.cfg. Use

db2.jcc.dsdriverConfigFile to specify a db2dsdriver.cfg file.

client_affinity_defined, affinity_list

Client-Member Affinitization

db2.jcc.maxTransportObjectsmaxTransportsMaximum number of underlying physical connections

db2.jcc.maxRefreshIntervalmaxRefreshIntervalMaximum age of member weight information before it is refreshed

enableSeamlessFailoverenableSeamlessAcrSeamless ACR

No explicit switch. Can enable/disable via

presence/absence of alternate server information.

enableAcrAutomatic Client Reroute

Transaction level WLB

Connection level WLB

Capability

enableSysplexWLBenableWLB

Not available.connectionLevelLoadBalancing

Java equivalent (eg. configuration property)

db2dsdriver.cfg

#IDUG

Setting KeepAliveTimeout for Java Clients

• Connection URL Examplejdbc:db2://coralpib19a.torolab.ibm.com:56733/eComHQ:enableSysple

xWLB=true;keepAliveTimeOut=10;

• Java application code:

String url =

jdbc:db2://coralpib19a.torolab.ibm.com:56733/eComHQ;

Properties properties = new Properties();

properties.put("user", "yourID");

properties.put("password", "yourPassword");

properties.put("enableSysplexWLB", "true");

properties.put("keepAliveTimeOut", "10");

Connection con = DriverManager.getConnection( url, properties );

http://public.dhe.ibm.com/software/dw/data/dm-1206purescaleenablement/wlb.pdf

White Paper on Client Configuration and WLB

Tip

#IDUG

Agenda







– Compute

– Network

– Storage

– Workload


#IDUG

Is the CF Healthy ?


CF

LockList

LBP

dbheapLockList

LBP

dbheapLockList

LBP

dbheapLockList

LBP

dbheap

LockList

GBP

SCA

CF Memory ?

CF CPU ?

Network ?

Read Steve Rees’ “DB2 Performance and Monitoring Best Practices” onwww.ibm.com/developerworks

BP

© 2012 IBM Corporation41

#IDUG

Accounting for pureScale Bufferpool Operations

CF

Member

X

CF

Member

GBP GBP

LBP LBP

CF

Member

GBP

LBP

CF

Member

GBP

LBP

Pool_data_l_reads

Pool_data_lbp_pages_found

Pool_data_gbp_l_reads

Pool_data_gbp_invalid_pages

Pool_data_gbp_p_reads

Pool_data_p_reads

Agent Agent Agent Agent

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

+1

Found in Found in

LBPLBPInvalid in Invalid in

LBP, found LBP, found

in GBPin GBP

Not in LBP, Not in LBP,

found in found in

GBPGBP

Not in LBP or GBP,Not in LBP or GBP,

found on diskfound on disk

PageFoundWhere?

Metricsaffected

© 2012 IBM Corporation42

#IDUG

Calculating Bufferpool Hit Ratios

(POOL_DATA_L_READS - POOL_DATA_P_READS)

POOL_DATA_L_READS

(POOL_DATA_LBP_PAGES_FOUND –

POOL_ASYNC_DATA_LBP_PAGES_FOUND)

POOL_DATA_GBP_L_READS

(POOL_DATA_GBP_L_READS - POOL_DATA_GBP_P_READS)

POOL_DATA_GBP_L_READS

� Overall (and non-pS) hit ratio– Primary indicator of BP quality– Great values: 95% for index, 90% for data– Good values: 80-90% for index, 75-85% for data

� LBP hit ratio– Usually less significant indicator

• Excludes GBP hits• Includes invalid pages

� GBP hit ratio– Typically lower than overall hit ratio (GBP

currently only caches updated pages)– More reads generally decreases GBP hit

ratio

� Calculate via MON_GET_BUFFERPOOL()– Can be issued at any member– Can retrieve data for any (or all) members– eg. to calculate GBP hit ratio, request data for all

members via specifying member=-2

SELECT POOL_DATA_GBP_L_READS,

POOL_DATA_GBP_P_READS,

FROM TABLE (

MON_GET_BUFFERPOOL(‘ ‘,-2) )

#IDUG

Bufferpool Tuning

• See Steve Rees’ “DB2 Performance and Monitoring Best Practices” on www.ibm.com/developerworks

• One important indicator : Group Bufferpool Full conditions

select sum(num_gbp_full) from table(mon_get_group_bufferpool(-2))

• Occur when there are no free locations in the GBP for incoming pages from the

members

• Causes an internal ‘stall’ condition during which dirty pages are written synchronously to

create more space

• Of course this is fully transparent to applications

• Similar to "dirty steal" in single node DB2

• They are counted at the members, but are specific to the GBP, hence the above query

sums across all members

• Can often also indicate storage too slow for workload

#IDUG

• pureScale page locks are physical locks, indicating which membercurrently ‘owns’ the page. Picture the following:

• Member A : acquires a page P and modifies a row on it, and continues with its transaction. ‘A’ holds an exclusive page lock on page P until ‘A’commits

• Member B : wants to modify a different row on the same page P. What now?

• ‘B’ doesn’t have to wait until ‘A’ commits & frees the page lock

• The CF will negotiate the page back from ‘A’ in the middle of ‘A’s transaction, on ‘B’s behalf. This is called a reclaim.

• Provides far better concurrency & performance than needing to wait for a page lock until the holder commits.

Log P P

pureScale Page Negotiation (or 'reclaims')

P

Member A Member B

LogP ?P !

CF

GLM

Px: A: B

P

#IDUG

Monitoring Page Reclaims

● Page reclaims help eliminate page lock waits, but they're not cheap● Excessive reclaims can cause contention – low CPU usage, reduced

throughput, etc.

● MON_GET_PAGE_ACCESS_INFO gives very useful reclaim stats

Schema name

Is 12,641 excessive? Maybe –it depends how long these accumulated. RoT: more than 1 reclaim per 10 Tx is worth looking into

#IDUG

Reducing Page Reclaims : Page Size and Extent Size

• Reduce page size if possible• If on 10.5, row size can exceed page size if

EXTENDED_ROW_SZ=ENABLE

“Small but hot" tables with frequent updates may benefit from increased PCTFREE

• Spreads rows over more pages

• Increases overall space consumption – 50% pctfree can double object size, but halve contention

• Note : PCTFREE only takes effect on LOAD and REORG

• Larger extent sizes tend to perform better than small ones• Some operations require CF communication & other processing each

time a new extent is created

• Larger extents mean fewer CF messages

• Default of 32-page extent size usually works well

Tip

Tip

BP

#IDUG

Consider a CURRENT MEMBER Default Column

� Case 1: intensive inserts of successively incrementing/decrementing numeric values, timestamps, etc.• Can cause significant demand for the page at the high/low end of the index, as the

page getting all the new keys gets reclaimed between members

• Consider adding a hidden CURRENT MEMBER leading column – so each member

tends to insert into a different page

alter table orders add column curmem smallint

default current member implicitly hidden;

create index seqindex on orders (curmem, ordnum);

DB2 10 multi-range index scan makes this unconventional index work…

M1 M2 M3

0, 1, 2 … 567

Orders(ordnum)

M1 M2 M3

M1,0 …. M1,567 M2,0 …. M2,566 M3,0 …. M3,565

Orders(curmem,ordnum)

#IDUG

Consider a CURRENT MEMBER Default Column

� Case 2: low-cardinality indexes – e.g. GENDER, STATE, etc.• Here, there can be significant demand for the pages where new RIDs are added for

the (relatively few) distinct keys

• Consider transparently increasing the cardinality (and separate new key values by

member) by adding a trailing CURRENT MEMBER column to the index

alter table customer add column curmem smallint

default current member implicitly hidden;

create index genderidx on customer (gender, curmem);

M1 M2 M3

1,F.. 1,M.. 2,F..2,M.. 3,F..3,M..

Gender

M1 M2 M3

F,F,F,F,F,F,F, M,M,M,M,M,M,M

Gender

#IDUG

Other Application / Schema Design Considerations

SEQUENCEs and IDENTITY columns should use a large cache and avoid the ORDER keyword

• Obtaining new batches of numbers requires CF communication and a log flush in pureScale

• Larger cache size (100 or more – best to tune) means fewer refills & better performance

CREATE SEQUENCE MY_SEQ START WITH 1 INCREMENT BY 1

NO MAXVALUE NO CYCLE CACHE 200

BP

#IDUG

CF CPU Capacity : Experiences

SELECT VARCHAR(NAME,20) AS ATTRIBUTE,

VARCHAR(VALUE,25) AS VALUE,

VARCHAR(UNIT,8) AS UNIT

FROM SYSIBMADM.ENV_CF_SYS_RESOURCES

ATTRIBUTE VALUE UNIT-------------------- ----------- ------HOST_NAME coralm215 -MEMORY_TOTAL 64435 MBMEMORY_FREE 31425 MBMEMORY_SWAP_TOTAL 4102 MBMEMORY_SWAP_FREE 4102 MBVIRTUAL_MEM_TOTAL 68538 MBVIRTUAL_MEM_FREE 35528 MBCPU_USAGE_TOTAL 93 PERCENT

HOST_NAME coralm216 -MEMORY_TOTAL 64435 MBMEMORY_FREE 31424 MBMEMORY_SWAP_TOTAL 4102 MBMEMORY_SWAP_FREE 4102 MBVIRTUAL_MEM_TOTAL 68538 MBVIRTUAL_MEM_FREE 35527 MBCPU_USAGE_TOTAL 93 PERCENT

16 record(s) selected.

`

• vmstat & other CPU monitoring

tools typically show the CF at 100% busy - even when the cluster is idle

• Use ENV_CF_SYS_RESOURCES to

get more accurate memory and CPU utilization

• Response time to requests from members may degrade as sustained CF CPU utilization climbs above 80-90%

• Allocating additional CPU cores to the CF may be required

Tip

#IDUG

� Infiniband is not infinite…

� Typical ratio is 1 CF RDMA adapter per 6-8 CF cores

� Main symptoms of interconnect bottleneck• High CF response time

• Increased member CPU time

• Poor cluster throughput with CPU capacity remaining on CF

� Look for an avg CF_WAIT_TIME of < 200msCF_WAITS : ~ # times a member makes a CF request (mostly dependent on

the workload rather than the tuning)CF_WAIT_TIME – time accumulated by members waiting on a CF response • Note: CF_WAIT_TIME does NOT include reclaim time or lock wait time

These metrics are available at the statement level in MON_GET_PKG_CACHE_STMT, or at the agent level in MON_GET_WORKLOAD, etc. (more useful for overall tuning)

Detecting an Interconnect Bottleneck

BP

Tip

#IDUG

� Situation: very busy pureScale cluster running SAP workload

� CF with two Infiniband HCAs

� CF_WAIT_TIME / CF_WAITS gives us a rough idea of average interconnect network time per CF call• Important – this is an average over all CF calls

• Look for an average less than 200 µs

• Best way to judge good or bad numbers – look for a change from what's normal for

your system

� Observed average per call CF_WAIT_TIME with 2 CF HCAs – 630 µs• This is very high – even a very busy system should be less than 200 µs

• CF CPU utilization about 75% - high, but not so high to cause this major slowdown

Interconnect Bottleneck Example

Tip

#IDUG

� And good things happened!

CFsec

CFpri

Add Another CF HCA

MetricMetric 2 CF HCAs2 CF HCAs 3 CF HCAs3 CF HCAs

Average CF_WAIT_TIME 630 µs 145 µs

Activity time of key INSERT statement 15.6 ms 4.2 ms

Activity wait time of key INSERT 8 ms 1.5 ms

Mbr1

Mbr3

Mbr2

Mbr4

CFsec

CFpri

Mbr1

Mbr3

Mbr2

Mbr4

#IDUG

� MON_GET_CF_WAIT_TIME() gives round-trip counts & times by message type

MemberCF

LOCKs

WRITEs

READs

CF_CMD_NAMECF_CMD_NAME REQUESTSREQUESTS WAIT_TIMEWAIT_TIME

SetLockState 107787498 6223065328

WriteAndRegisterMultiple 4137160 2363217374

ReadAndRegister 57732390 4227970323

Other Ways to Drill Down on Interconnect Traffic

#IDUG

� Typical MON_GET_CF_WAIT_TIME() desireable upper bounds

MemberCF

LOCKs

WRITEs

READs

CF_CMD_NAMECF_CMD_NAME AVG WAIT_TIMEAVG WAIT_TIME

SetLockState 30-40

WriteAndRegisterMultiple 500

ReadAndRegister 100

279943693257732390ReadAndRegister

CF_CMD_NAMECF_CMD_NAME REQUESTSREQUESTS CMD_TIMECMD_TIME

SetLockState 107787498 3552982001


� MON_GET_CF_CMD() gives command processing time on the CF, without network time

Other Ways to Drill Down on Interconnect Traffic

279943693257732390ReadAndRegister

598801653992011CrossInvalidate

CF_CMD_NAMECF_CMD_NAME REQUESTSREQUESTS CMD_TIMECMD_TIME

SetLockState 107787498 3552982001


• CrossInvalidate CMD times >20 usec can indicate a network bottleneckCrossInvalidate processing has the least CF CPU overhead, and so can often be

usedas a direct indicator of network health

Tip

#IDUG

The Role of Cross Invalidation in pureScale

Cross Invalidation (aka Silent Invalidation)

� CF receives new version of a page

� CF uses RDMA to turn off the ‘valid’ bit associated with (now) stale copies of this page in any member local bufferpool

� Since this is done with RDMA, it requires no CPU cycles on this other members

� No interrupt or other message processing required

� Cross Invalidation is key to efficient scaling as the cluster grows

GBP GLM SCA

Buffer Mgr

Lock Mgr Lock Mgr Lock Mgr Lock Mgr

New

page image

Rea

d P

age

#IDUG

Agenda







– Compute

– Network

– Storage

– Workload


#IDUG

New Recovery Time Controls in 10.5

• page_age_trgt_mcr : Target for age of oldest updated page in LBP not reflected in

GBP (pureScale) or persistent storage (non-pureScale)

• DB configuration parameter; Default 120 seconds

• Comments:

– Can approximate member recovery time, when you have very large batch update transactions

– If you do not have large batch update transactions, this parameter has little effect, due to pureScale‘Force-at-Commit’ policy, which requires all transactions to send all updated pages to the CF’s GBP before the transaction can commit, and limits member recovery time to ~20 sec

• If your workload has large very batch update transactions and you want to avoid the exposure of a larger member recovery period, consider a smaller value than the 120 second default

• page_age_trgt_gcr : Target for age of oldest update page in LBP not reflected on

persistent storage

• DB configuration parameter; Default 240 seconds

• Comments

– Can approximate group crash recovery time

– Recall : group crash recovery only occurs in rare simultaneous failure conditions (eg. both CF fail at the same time)

• Consider higher value if experiencing high I/O due to cast outConsider lower value if need significantly lower recovery times for cluster wide outage (e.g. concurrent Primary & Secondary CF failure) is required

BP

BP

Set SOFTMAX=0 to usethese on an upgraded db

Set SOFTMAX=0 to usethese on an upgraded db

Tip

#IDUG

Using Random Indexes (in DB2 10.5)• Intensive inserts of successively incrementing/decrementing key values can cause

significant demand for pages at the high/low end of the index

The 'top' page getting all the new keys gets reclaimed between members

• The new 'RANDOM' keyword on index create causes keys to be hashed on insert

Use this to address hot high key indexes that are used for unique/primary constraints, or point access queries

Other Indexes with contention points should still use “current member” indexes

create index ordindex on orders (ordnum random);

M1 M2 M3 M1 M2 M3

Orders(ordnum) Orders(ordnum random)

Demand for last page in an ascending index

RANDOM spreads increasing keys over the index evenly, avoiding hotspots

BP

#IDUG

Enabling Sufficient Capacity during Maintenance

• pureScale supports completely seamless online system maintenance• With online “rolling maintenance” where members/CFs are maintained one-at-a-time:

1. Drain member of work

2. Temporarily remove member/CF from of the cluster, and perform maintenance

3. Add back to the cluster and restart

4. Repeat with next member/CF

• Unlike other methods, requires no forcing of connections, error codes, or quiesce waits

• DB2 10.5, adds support for online rolling DB2 fixpack updates

• Possible strategies for ensuring sufficient capacity during online maintenance• Schedule maintenance during offpeak hours

• For example, if you have 25% less workload during a 2 hour window Sunday am …

… define a 4 member cluster (each with the same CPU power), so that shutting down 1 removes only 25 % capacity

• Maintain full capacity by dynamically adding CPUs to member

• Via dynamic LPARs on AIX, for example

DB2 DB2 DB2 DB2

Tip

#IDUG

Summary

• We’ve covered the key pureScale best practices from

– initial system configuration, to

– system tuning

– … including emerging best practices from the latest 10.5 relesae

• Look for this throughout the presentation for a summary of the key best practices

BP

DB2 pureScaleBest Practices - Ningapi.ning.com/files/dzJVcnA3Xm1VXdni8Ey2kC9LNGO9Ef3... · DB2...

Documents

Transcript of DB2 pureScaleBest Practices - Ningapi.ning.com/files/dzJVcnA3Xm1VXdni8Ey2kC9LNGO9Ef3... · DB2...