OpenVMS Distributed Lock Manager Performance

OpenVMS Distributed Lock Manager PerformanceSession ES-09-U

Keith ParrisHPQ

Background

VMS system managers have traditionally looked at performance in 3 areas: CPU Memory I/O

But in VMS clusters, what may appear to be an I/O bottleneck can actually be a lock-related issue

Overview

VMS keeps some lock activity data that no existing performance management tools look at

Locking statistics and lock-related symptoms can provide valuable clues in detecting disk, adapter, or interconnect saturation problems

Overview The VMS Lock Manager does an excellent job under a

wide variety of conditions to optimize locking activity and minimize overhead, but: In clusters with identical nodes running the same

applications, remastering can sometimes happen too often In extremely large clusters, nodes can “gang up” on lock

master nodes and overload them Locking activity can contribute to:

CPU 0 saturation in Interrupt State Spinlock contention (Multi-Processor Synchronization time)

We’ll look at methods of detection, and solutions to, these types of problems

Topics

Available monitoring tools for the Lock Manager

How to map VMS symbolic lock resource names to real physical entities

Lock request latencies How to measure lock rates

Topics

Lock mastership, and why one might care about it

Dynamic lock remastering How to detect and prevent lock mastership

thrashing How to find the lock master node for a given

resource tree How to force lock mastership of a given

resource tree to a specific node

Topics

Lock queues, their causes, and how to detect them

Examples of problem locking scenarios How to measure pent-up remastering

demand

Monitoring tools MONITOR utility

MONITOR LOCK MONITOR DLOCK MONITOR RLOCK (in VMS 7.3 and above; not 7.2-

2) MONITOR CLUSTER MONITOR SCS

SHOW CLUSTER /CONTINUOUS DECamds / Availability Manager DECps (Computer Associates’ Unicenter

Performance Management for OpenVMS, earlier Advise/IT)

Monitoring tools

ANALYZE/SYSTEMNew SHOW LOCK qualifiers for VMS 7.2 and above:

/WAITING Displays only the waiting lock requests (those blocked

by other locks) /SUMMARY

Displays summary data and performance counters

New SHOW RESOURCE qualifier for VMS 7.2 and above: /CONTENTION

Displays resources which are under contention

Monitoring tools ANALYZE/SYSTEM

New SDA extension LCK for lock tracing in VMS 7.2-2 and above

SDA> LCK !Shows help text with command summaryCan display various additional lock manager statistics:

SDA> LCK STATISTIC !Shows lock manager statisticsCan show busiest resource trees by lock activity rate:

SDA> LCK SHOW ACTIVE !Shows lock activityCan trace lock requests:

SDA> LCK LOAD !Load the debug execlet SDA> LCK START TRACE !Start tracing lock requests SDA> LCK STOP TRACE !Stop tracing SDA> LCK SHOW TRACE !Display contents of trace

bufferCan even trigger remaster operations:

SDA> LCK REMASTER !Trigger a remaster operation

Mapping symbolic lock resource names to real entities

Techniques for mapping resource names to lock types Common prefixes:

SYS$ for VMS executiveF11B$ for XQP, file systemRMS$ for Record Management Services

See Appendix H in Alpha V1.5 IDSM or Appendix A in Alpha V7.0 version

Resource names

Example: XQP File Serialization Lock Resource name format is

“F11B$s” {Lock Basis}Parent lock is the Volume Allocation Lock “F11B$v”

{Lock Volume Name}

Calculate File ID from Lock BasisLock Basis is RVN and File Number from File ID

(ignoring Sequence Number), packed into 1 longword

Identify disk volume from parent resource name

Resource names

Identifying file from File IDLook at file headers in Index File to get filespec:

Can use DUMP utility to display file header (from Index File)

$ DUMP /HEADER /IDENTIFIER=(file_id) /BLOCK=COUNT=0 disk:[000000]INDEXF.SYS

Follow directory backlinks to determine directory path See example procedure FILE_ID_TO_NAME.COM

(or use LIB$FID_TO_NAME routine to do all this, if sequence number can be obtained)

Resource names

Example: RMS lock tree for an RMS indexed file: Resource name format is

“RMS$” {File ID} {Flags byte} {Lock Volume Name}

Identify filespec using File ID Flags byte indicates shared or private disk mount Pick up disk volume name

This is label as of time disk was mounted

Sub-locks are used for buckets and records within the file

Internal Structure of an RMS Indexed File

Data Bucket Data Bucket Data Bucket

Level 2 Index Bucket

Data Bucket Data Bucket








Root Index Bucket

RMS Data Bucket Contents

Data Bucket

Data Record Data Record





RMS Indexed FileBucket and Record Locks

Sub-locks of RMS File Lock Have to look at Parent lock to identify file

Bucket lock: 4 bytes: VBN of first block of the bucket

Record lock: 8 bytes (6 on VAX): Record File Address

(RFA) of record

Locks and File I/O

Lock requests and data transfers for a typical RMS indexed file I/O(prior to 7.2-1H1):1) Lock & get root index bucket2) Lock & get index buckets for any additional index

levels3) Lock & get data bucket containing record4) Lock record5) For writes: write data bucket containing recordNote: Most data reads may be avoided thanks to

RMS global buffer cache

Locks and File I/O

Since all indexed I/Os access Root Index Bucket, contention on lock for Root Index Bucket of hot file can be a bottleneck

Lookup by Record File Address (RFA) avoids index lookup on 2nd and subsequent accesses to a record

Lock Request Latencies

Latency depends on several things: Directory lookup needed or not

Local or remote directory node

$ENQ or $DEQ operation Local or remote lock master

If remote, type of interconnect

Directory Lookups

This is how VMS finds out which node is the lock master

Only needed for 1st lock request on a particular resource tree on a given node Resource Block (RSB) remembers master node

CSID Basic conceptual algorithm: Hash resource

name and index into lock directory vector, which has been created based on LOCKDIRWT values


Local requests are fastest Remote requests are significantly

slower: Code path ~20 times longer Interconnect also contributes latency Total latency up to 2 orders of magnitude

higher than local requests

Lock Request LatencyClient process on same node:4-6 microseconds

Lock Master Node

Client

Lock Request LatencyClient across CI star coupler:440 microseconds

Lock Master Client node

StarCoupler

Storage

Client


4

94120

230270 285

333

440

050

100150200250300350400450500

Latency (micro-seconds)

Local node

Galaxy SMCI

MC 2

Gigabit Ethernet

FDDI GS-FDDI-GS

FDDI GS-ATM-GS

DSSI

CI

How to measure lock rates

VMS keeps counters of lock activity for each resource tree but not for each of the sub-resources

So you can see the lock rate for an RMS indexed file, for example but not for individual buckets or records

within that file SDA extension LCK can trace all lock

requests if needed

Identifying busiest lock trees in the cluster with a program

Measure lock rates based on RSB data: Follow chain of root RSBs from

LCK$GQ_RRSFL listhead via RSB$Q_RRSFL links

Root RSBs contain counters:RSB$W_OACT: Old activity field (average lock rate

per 8 second interval) Divide by 8 to get per-second average

RSB$W_NACT: New activity (locks so far within current 8-second interval)

Transient value, so not as useful

Identifying busiest lock trees in the cluster with a program

Look for non-zero OACT values: Gather resource name, master node CSID,

and old-activity field Do this on each node Summarize data across the cluster See example procedure LOCK_ACTV.COM

and program LCKACT.MAR Or, for VMS 7.2-2 and above:

SDA> LCK SHOW ACTIVE Note: Per-node data, not cluster-wide summary

Lock Activity Program Example

0000002020202020202020203153530200004C71004624534D52 RMS$F.qL...SS1 ... RMS lock tree for file [70,19569,0] on volume SS1 File specification: DISK$SS1:[DATA8]PDATA.IDX;1 Total: 11523 *XYZB12 6455 XYZB11 746 XYZB14 611 XYZB15 602 XYZB23 564 XYZB13 540 XYZB19 532 XYZB16 523 XYZB20 415 XYZB22 284 XYZB18 127 XYZB21 125

* Lock Master Node for the resource

{This is a fairly hot file. Here the lock master node is optimal.}


0000002020202032454C494653595302000000D3000C24534D52 RMS$.......SYSFILE2 ... RMS lock tree for file [12,211,0] on volume SYSFILE2 File specification: DISK$SYSFILE2:[SYSFILE2]SYSUAF.DAT;5 Total: 184 XYZB16 75 XYZB20 48 XYZB23 41 XYZB21 16 XYZB19 2 *XYZB15 1 XYZB13 1 XYZB14 0 XYZB12 0

{This reflects user logins, process creations, password changes, and such.Note the poor lock master node selection here (XYZB16 would be optimal).}

Example: Application (re)opens file frequently

Symptom: High lock rate on File Access Arbitration Lock for application data file

Cause: BASIC program re-executing OPEN command for a file; BASIC dutifully closes and then re-opens file

Fix: Modify BASIC program to execute OPEN statement only once at image startup time


00000016202020202020202031505041612442313146 F11B$aAPP1 .... Files-11 File Access Arbitration lock for file [22,*,0] on volume APP1 File specification: DISK$APP1:[DATA]XDATA.IDX;1 Total: 50 *XYZB15 8 XYZB21 7 XYZB16 7 XYZB19 6 XYZB20 6 XYZB23 6 XYZB18 5 XYZB13 3 XYZB12 1 XYZB22 1 XYZB14 1

{This shows where the application is apparently opening (or re-opening) thisparticular file 50 times per second.}

Lock Mastership (Resource Mastership) concept

One lock master node is selected by VMS for a given resource tree at a given time

Different resource trees may have different lock master nodes

Lock Mastership (Resource Mastership) concept

Lock master remembers all locks on a given resource tree for the entire cluster

Each node holding locks also remembers the locks it is holding on resources, to allow recovery if lock master node dies

Lock Mastership

Lock mastership node may change for various reasons: Lock master node goes down -- new master

must be elected VMS may move lock mastership to a

“better” node for performance reasonsLOCKDIRWT imbalance found, orActivity-based Dynamic Lock RemasteringLock Master node no longer has interest

Lock Remastering

Circumstances under which remastering occurs, and does not: LOCKDIRWT values

VMS tends to remaster to node with higher LOCKDIRWT values, never to node with lower LOCKDIRWT

Shifting initiated based on activity counters in root RSBPE1 parameter being non-zero can prevent movement

or place threshold on lock tree size

Shift if existing lock master loses interest

Lock Remastering

VMS rules for dynamic remastering decision based on activity levels:

assuming equal LOCKDIRWT values

1) Must meet general threshold of 80 lock requests so far (LCK$GL_SYS_THRSH)

2) New potential master node must have at least 10 more requests per second than current master (LCK$GL_ACT_THRSH)

Lock Remastering

VMS rules for dynamic remastering: 3) Estimated cost to move (based on size of

lock tree) must be less than estimated savings (based on lock rate)except if new master meets criteria (2) for 3

consecutive 8-second intervals, cost is ignored

4) No more than 5 remastering operations can be going on at once on a node (LCK$GL_RM_QUOTA)

Lock Remastering

VMS rules for dynamic remastering: 5) If PE1 on the current master has a

negative value, remastering trees off the node is disabled

6) If PE1 has a positive, non-zero value on the current master, the tree must be smaller than PE1 in size or it will not be remastered

Lock Remastering

Implications of dynamic remastering rules: LOCKDIRWT must be equal for lock activity

levels to control choice of lock master node PE1 can be used to control movement of lock

trees OFF of a node, but not ONTO a node RSB stores lock activity counts, so even high

activity counts can be lost if the last lock is DEQueued on a given node and thus the RSB gets deallocated

Lock Remastering

Implications of dynamic remastering rules: With two or more large CPUs of equal size

running the same application, lock mastership “thrashing” is not uncommon:10 more lock requests per second is not much

of a difference when you may be doing 100s or 1,000s of lock requests per second

Whichever new node becomes lock master may then see its own lock rate slow somewhat due to the remote lock request workload

Lock Remastering

Lock mastership thrashing results in user-visible delays

Lock operations on a tree are stalled during a remaster operation

Locks and Resources were sent over 1 per SCS messageRemastering large lock trees could take a long time

e.g. 10 to 50 seconds for 15K lock tree size, prior to 7.2-2

Improvement in VMS in version 7.2-2 and above gives very significant performance gain

by using 64 Kbyte block data transfers instead of sending 1 SCS message per RSB or LKB

How to Detect Lock Mastership Thrashing

Detection of remastering activity MONITOR RLOCK in 7.3 and above (not 7.2-2) SDA> SHOW LOCK/SUMMARY in 7.2 and above Change of mastership node for a given resource Check message counters under SDA:

SDA> EXAMINE PMS$GL_RM_RBLD_SENTSDA> EXAMINE PMS$GL_RM_RBLD_RCVD

Counts which increase suddenly by a large amount indicate remastering of large tree(s)

SENT: Off of this nodeRCVD: Onto this node

See example procedures WATCH_RBLD.COM and RBLD.COM

How to Prevent Lock Mastership Thrashing

Unbalanced node power Unequal workloads Unequal values of LOCKDIRWT Non-zero values of PE1

How to find the lock master node for a given resource tree

1) Take out a Null lock on the root resource using $ENQ VMS does directory lookup and finds out

master node 2) Use $GETLKI to identify the current lock

master node’s CSID and the lock count If the local node is the lock master, and the

lock count is 1 (i.e. only our NL lock), there’s no interest in the resource now

How to find the lock master node for a given resource tree

3) $DEQ to release the lock 4) Use $GETSYI to translate the CSID to

an SCS Nodename See example procedure

FINDMASTER_FILE.COM and program FINDMASTER.MAR, which can find the lock master node for RMS file resource trees

Controlling Lock Mastership Lock Remastering is a good thing

Maximizes the number of lock requests which are local (and thus fastest) by trying to move lock mastership of a tree to the node with the most activity on that tree

So why would you want to wrest control of lock mastership away from VMS? Spread lock mastership workload more evenly

across nodes to help avoid saturation of any single lock master node

Provide best performance for a specific job by guaranteeing local locking for its files

How to force lock mastership of a resource tree to a specific node

3 ways to induce VMS to move a lock tree:1) Generate a lot of I/Os

For example, run several copies of a program that rapidly accesses the file

2) Generate a lot of lock requestswithout the associated I/O operations

3) Generate the effect of a lot of lock requests without actually doing themby modifying VMS’ data structures

How to force lock mastership of a resource tree to a specific node

We’ll examine: 1) Method using documented features

thus fully supported

2) Method modifying VMS data structures

Controlling Lock Mastership Using Supported Methods

To move a lock tree to a particular node (non-invasive method):

Assume PE1 non-zero on all nodes to start with

1) Set PE1 to 0 on existing lock master node to allow dynamic lock remastering of tree off that node

2) Set PE1 to negative value (or small positive value) on target node to prevent lock tree from moving off of it afterward


3) On target node, take out a Null lock on root resource

4) Take out a sub-lock of the parent Null lock, and then repeatedly convert it between Null and some other mode

Check periodically to see if tree has moved yet (using $GETLKI)

5) Once tree has moved, free locks 6) Set PE1 back to original value on former

master node


Pros: Uses only supported interfaces to VMS

Cons:Generates significant load on existing lock master, from

which you may have been trying to off-load work. In some cases, node may thus be saturated and unable to initiate lock remastering

Programs running locally on existing lock master can generate so many requests that tree won’t move because you can’t generate nearly as many lock requests remotely

See example program LOTSALOX.MAR

Controlling Lock Mastership By Modifying VMS Data Structures

Goal: Reproduce effect of lots of lock requests without the overhead of the lock requests actually occurring

General Method: Modify activity-related counts and remastering-related fields and flags in root RSB to persuade VMS to remaster the resource tree


1) Run program on node which is presently lock master

2) Use $GETSYI to get CSID of desired target node, given nodename

3) Lock down code and data 4) $CMKRNL, raise IPL, grab LCKMGR

spinlock


5) Starting at LCK$GQ_RRSFL listhead, follow chain of root RSBs via RSB$Q_RRSFL links

6) Search for root RSB with matching resource name, access mode, and group (0=System)


7) Set up to trigger remaster operation: Set RSB$L_RM_CSID to target node‘s CSID Set RSB$B_LSTCSID_IDX to low byte of

target node’s CSID Set RSB$B_SAME_CNT to 3 or more so

remastering occurs regardless of cost


Zero our activity counts RSB$W_OACT and RSB$W_NACT so local lock rate seems low

Set new-master activity count RSB$W_NMACT to maximum possible (hex FFFF) to simulate tons of locking activity

Set RSB$M_RM_PEND flag in RSB$L_STATUS field to indicate a remaster operation is now pending

8) Release LCKMGR spinlock, lower IPL, and let VMS do its job


Problem (for all methods): Once PE1 is set to zero to allow the desired lock

tree to migrate, other lock trees may also migrate, unwanted

Solution: To prevent this, in all other resource trees

mastered on this node:Clear RM_PEND flag in L_STATUS if set, and

Set W_OACT and W_NACT to max. (hex FFFF) Zero W_NMACT, L_RM_CSID, B_LSTCSID_IDX, and

B_SAME_CNT


Pros: Does the job reliablyCan avoid other resource trees “escaping”

Cons:High-IPL code presents some level of risk of

crashing a system

See example program REMASTER.MAR One might instead use (in 7.2-2 & above)

SDA> LCK REMASTER

Causes of lock queues

Program bug (e.g. not freeing a record lock)

I/O or interconnect saturation “Deadman” locks

How to detect lock queues

Using DECamds / Availability Manager Using SDA Using other methods

Lock contention & DECamds

DECamds can identify lock contention if a lock blocks others for 15 seconds

AMDS$LOCK_LOG.LOG file in AMDS$SYSTEM: contains a log of occurrences of suspected contention

Resource name decoding techniques shown earlier can sometimes be used to identify the file involved

Deadman locks can be filtered out

Detecting Lock Queues with ANALYZE/SYSTEM (SDA)

New qualifier added to SHOW RESOURCE command in SDA for 7.2 and above: SHOW RESOURCE/CONTENTION shows blocking

and blocked lock requests

New qualifier was added to SHOW LOCK command in SDA for 7.2 and above: SHOW LOCK/WAITING displays blocked lock

requests (but then you must determine what’s blocking them)

Detecting Lock Queues with a program

Traverse lock database starting with LCK$GQ_RRSFL listhead and following chain of root RSBs via RSB$Q_RRSFL links

Within each resource tree, follow RSB$Q_SRSFL chain to examine all sub-resources, recursively

Detecting Lock Queues with a program

Check the Wait Queue (RSB$Q_WTQFL and RSB$Q_WTQBL)

Check the Convert Queue (RSB$Q_CVTQFL and RSB$Q_CVTQBL)

If queues are found, display: Queue length(s) Resource name Resource names for all parent locks, up to the root lock

See example DCL procedure LCKQUE.COM and program LCKQUE.MAR

Example: Directory File Grows Large

Symptom: High queue length on file serialization lock for .DIR file

Cause: Directory file has grown to over 127 blocks (VMS version 7.1-2 or earlier; 7.2 and later

are much less sensitive to this problem) Fix: Delete or rename files out of

directory

Lock Queue Program ExampleHere are examples where a directory file got very large under 7.1-2:

'F11B$vAPP2 ' 202020202020202032505041762442313146 Files-11 Volume Allocation lock for volume APP2 'F11B$sH...' 00000148732442313146 Files-11 File Serialization lock for file [328,*,0] on volume APP2 File specification: DISK$APP2:[]DATA.DIR;1 Convert queue: 0, Wait queue: 95

'F11B$vLOGFILE ' 2020202020454C4946474F4C762442313146 Files-11 Volume Allocation lock for volume LOGFILE'F11B$s....' 00000A2E732442313146 Files-11 File Serialization lock for file [2606,*,0] on volume LOGFILE File specification: DISK$LOGFILE:[000000]LOGS.DIR;1 Convert queue: 0, Wait queue: 3891

Example: Fragmented File Header

Symptom: High queue length on File Serialization Lock for application data file

Cause: CONVERTs onto disk without sufficient contiguous space resulted in highly-fragmented files, increasing I/O load on disk array. File was so fragmented it had 3 extension file headers

Fix: Defragment disk, or do an /IMAGE Backup/Restore

Lock Queue Program ExampleHere's an example of the result of reorganizing RMS indexed files with$CONVERTs over a weekend without enough contiguous free space available,causing a lot of file fragmentation, and dramatically increasing theI/O load on a RAID array on the next busy day (we had to fix this witha backup/restore cycle soon after). The file shown here had gotten sofragmented as to have 3 extension file headers. The lock we're queueingon here is the file serialization lock for this RMS indexed file:

'F11B$s....' 0000000E732442313146 Files-11 File Serialization lock for file [14,*,0] on volume THDATA File specification: DISK$THDATA:[TH]OT.IDX;1 Convert queue: 0, Wait queue: 28

Future Directions for this Investigation Work

Concern: Locking down remastering with PE1 (to avoid lock mastership thrashing) can result in sub-optimal lock master node selections over time

Future Directions for this Investigation Work

Possible ways of mitigating side-effects of preventing remastering using PE1: Adjust PE1 value as high as you can without producing

noticeable delays Upgrade to 7.2-2 or above for more-efficient remastering Set PE1 to 0 for short periods, periodically Raise fixed threshold values in VMS data cells

LCK$GL_SYS_THRSH and particularly LCK$GL_ACT_THRSH

More-invasive automatic monitoring and control of remastering activity

Enhancements to VMS itself

How to measure pent-up remastering demand

While PE1 is set to prevent remastering, sub-optimal lock mastership may result VMS will “want” to move some lock trees

but cannot See example procedure LCKRM.COM

and program LCKRM.MAR, which measure pent-up remastering demand

How to measure pent-up remastering demand

LCKRM example:

Time: 16:19

----- XYZB12: -----

'RMS$..I....SS1 ...' 000000202020202020202020315353020000084900B424534D52 RMS lock tree for file [180,2121,0] on volume SS1 File specification: DISK$SS1:[PDATA]PDATA.IDX;1 Pent-up demand for remaster operation is pending to node XYZB18 (CSID 00010031) Last CSID Index: 34, Same-count: 0 Average lock rates: Local 44, Remote 512 Status bits: RM_PEND

Interrupt-state/stack saturation

Too much lock mastership workload can saturate primary CPU on a node

See with MONITOR MODES/CPU=0/ALL

Interrupt-state/stack saturation FAST_PATH:

Can shift interrupt-state workload off primary CPU in SMP systems

IO_PREFER_CPUS value of an even number disables CPU 0 use Consider limiting interrupts to a subset of non-primaries

FAST_PATH for CI since 7.0 FAST_PATH for MC “never” FAST_PATH for SCSI and FC is in 7.3 and above FAST_PATH for LANs (e.g. FDDI & Ethernet) slated for 7.3-1 Even with FAST_PATH enabled, CPU 0 still receives the

device interrupt, but hands it off immediately via an inter-processor interrupt

7.3-1 is slated to allow FAST_PATH interrupts to bypass CPU 0 entirely and go directly to a non-primary CPU

Dedicated-CPU Lock Manager

With 7.2-2 and above, you can choose to dedicate a CPU to do lock management work. This may help reduce MP_SYNC time.

LCKMGR_MODE parameter: 0 = Disabled >1 = Enable if at least this many CPUs are running

LCKMGR_CPUID parameter specifies which CPU to dedicate to LCKMGR_SERVER process

Example programs

Programs referenced herein may be found: On the VMS Freeware V5 CD, under directories

[KP_LOCKTOOLS] or [KP_CLUSTERTOOLS] or on the web at:

http://www.openvms.compaq.com/freeware/freeware50/kp_clustertools/ http://www.openvms.compaq.com/freeware/freeware50/kp_locktools/

New additions & corrections may be found at:http://encompasserve.org/~parris/

http://www.openvms.compaq.com/freeware/freeware50/kp_clustertools/

http://www.openvms.compaq.com/freeware/freeware50/kp_locktools/

http://encompasserve.org/~parris/

Example programs

Copies of this presentation (and others) may be found at: http://www.geocities.com/keithparris/

Questions?

Speaker Contact Info:

Keith ParrisE-mail: [email protected] [email protected] [email protected]: http://encompasserve.org/~parris/ and http://www.geocities.com/keithparris/

OpenVMS Distributed Lock Manager Performance

Documents

Transcript of OpenVMS Distributed Lock Manager Performance