Monitoring Java On The Mainframe - GSE...

Post on 11-Jul-2018

224 views 0 download

Transcript of Monitoring Java On The Mainframe - GSE...

Monitoring Java On The

MainframeMainframe

Speaker name Dave Swift

Speaker company IBM

Date of presentation (01/11/2016)

Session OA

• This deck is intended to provide a brief overview

of the IBM OMEGAMON for JVM product

• Sections within this presentation are:– Why monitor JVMs on z/OS

Purpose of this Presentation

– Introducing IBM OMEGAMON for JVM

– Resource monitoring of z/OS Connect Enterprise Edition

– Example scenarios

– Further Information

– = (New in OMEGAMON for JVM V5.4.0)

Why Monitor JVMs?� Clear understanding of how many and what JVMs are

running within an LPAR

� Need to address tuning issues in operations environment

� Highlight degradation of performance over time

� Leaks leading to inefficient or excessive Garbage Collections

� Diagnose potential OutOfMemory conditions� Diagnose potential OutOfMemory conditions

� Heap issues and native memory leaks

� Contention for shared resources

� Poor performance caused by multiple threads waiting on resources

� Sub-optimal CPU utilisation

� Is work running on general CPU or specialty engines?

� Operating within the correct environment

� Are there subsystems/applications running with insecure JVM levels or settings?

� Are the correct application versions deployed?

Introducing OMEGAMON for JVM on z/OS

• Brand new OMEGAMON monitoring agent focused on assisting z/OS system administrators, operators and SMEs identify problems, resolve quicker and optimize performance

• Lightweight overhead compared to other offerings. – 90% of data collected is through Health Center API

• Ability to view all JVMs side-by-side. No disconnect when switching between JVMs

• Collects data on any online JVM on z/OS– Subsystems: CICS, IMS DB2, WAS, z/OS Connect, ODM

– Standalone Batch USS Java applications

– Can identify and distinguish Liberty JVM servers

• Data presented on both OMEGAMON enhanced 3270 UI and Tivoli Enterprise Portal

• Reports on Garbage Collection, Active Threads, Lock Utilization, JVM Environment, CPU Utilization, Native Memory

• Detailed report on z/OS Connect EE resources for hybrid cloud monitoring

• Provides the standard OMEGAMON features: – Look back in time with historical data collection

– Be alerted to abnormal conditions through defined event generation (Situations)

– Easy to configure and deploy using PARMGEN

z/OS Connect

JVM

Liberty

JVM

Batch / USS

JVM

CICS TS

JVM

WAS on z/OS

JVM

DB2 on z/OS

JVM

IMS

JVM

ODM on z/OS

JVM

OMEGAMON

JVM

Agent

DB2

IMS

CICS

WAS

z/OS ConnectEnterprise Edition

z/OS Connect EE Resource

Monitoring

OMEGAMON for JVM

WAS

Identify Service/API performance issues within z/OS Connect EE instances faster and avoid

bottlenecks

Data ProvidedHighest JVM Statistics

Thread DetailsJVM Environment

z/OS Connect EE Request MetricsAverage Response Time

Slowest Services

Native Memory

CPU StatisticsGarbage Collection

StatisticsLock Details

JVM Command Line

System Variables

Env Variables

JVM Parameters

Classpath

Boot Classpath

GET Count

Average Hold Time

Slow Gets

Recursive Acquires

Lock Utilization %

Thread State

Contending Object

Stack Trace

Nursery GC Details

Global GC Details

% Time Paused

Heap Allocation

General CPU

Specialty Processor (IFA) CPU

Specialty Processor Work on

General CPU

LE Heap Details

z/OS Extended Region

Detail

Java Native Memory

Scenario: Visibility of all JVMs • JVMs can be found all over the environment.

Can you be clear what is online, are there JVMs

online that are unplanned?

• Starting the JVM Monitor will seek out and find

all JVMs on an LPAR regardless of subsystem

type whether they have been configured for full

monitoring or not.

How much Java

are we running?

We need to see

all JVMs that

are currently

online• The agent will capture the jobname, ASID,

subsystem type and basic details of the JVM.

CICS TS

JVM

WAS

JVM

DB2

JVM

IMS

JVM

OMEGAMON JVM

Agent

online

LPAR

Scenario: Visibility of all JVMs

For a JVM to be fully monitored, it must be instrumented to allow OMEGAMON to collect

data. If not, we can still determine online JVMs and their subsystem type. These are reported on the second subpanel here. A user can then determine if they want to instrument that JVM

for full monitoring.

Scenario: Visibility of all JVMs

Equivalent Tivoli Enterprise Portal screen showing JVMs currently being fully showing JVMs currently being fully

monitored and those detected as being online but not monitored by JVM agent

Scenario: Visibility of all JVMs

• To enable full monitoring of a JVM it must be instrumented to allow the OMEGAMON agent to interact with the JVM and issue requests via the Health Center API.

• Typical configuration is a minor change to the JVM startup parameters:

-Xhealthcenter:level=inprocess

-javaagent:/omegamon/uss/install/dir/kan/bin/IBM/kjj.jar

• OMEGAMON code will collect JVM environment information, capture JVM events (for example GCs) and push the details to the OMEGAMON JVM agent.and push the details to the OMEGAMON JVM agent.

CICS TS

JVM

WAS

JVM

DB2

JVM

IMS

JVM

OMEGAMON JVM

Agent

LPAR

Scenario: Optimizing Garbage

Collection• “Performance of JVM is poor. Could Garbage Collection be a

cause?”

• Performance of the Garbage Collector has improved significantly in recent releases of Java however poor performance can still occur due to:

• Insufficient heap allocation

• Poorly written applications

• The symptoms of such problems might be:

Performance of

JVM is poor.

What can be • The symptoms of such problems might be:

• Excessive number GC events occurring within a given period of time

• High heap occupancy even after a GC

• Long pause times when GC event is occurring

• System GC events occurring

• The Garbage Collection Details workspaces provide insight into the performance of the JVM GC allowing the operator to confirm (or dismiss) the JVM as a bottleneck in the performance throughput.

What can be

causing this?

Scenario: Optimizing Garbage

Collection

The Highest JVM Statistics subpanel shows the

poorest performing statistics in key GC metrics

If a threshold is exceeded (example GC Rate per

Minute), zoom into the JVM that potentially has

an issue.

Scenario: Optimizing Garbage

Collection

GC Details can point out key values that may

indicate a problem. A rolling 5 minute interval is

used to scale values.

Does the Occupancy look OK? Average Heap

size fine?

Scenario: Out of Memory Conditions

• The java.lang.OutOfMemoryError is a severe condition which often occurs with little warning, and usually brings down the JVM. The error occurs when the system runs out of memory – either Java heap space or native memory.

• OMEGAMON for JVM constantly monitors the proportion of the maximum Java heap size that is still allocated after garbage collection. If that value exceeds 80% a situation is triggered which can take actions

The address space

periodically abends.

Can we see what 80% a situation is triggered which can take actions such as alerting operations staff or application SMEs. If the condition escalates, then the application can be restarted in an orderly fashion before it crashes or impacts end-users.

• The Native Memory analysis provides details which can help identify constraint in native memory. Either the address space is over-committed or an application has a memory leak. By analyzing metrics such as Language Environment Heap utilization and Extended Region Free %, OMEGAMON can avert major outages due to native memory.

Can we see what

caused the issue?

Scenario: Out Of Memory Condition

Select a Job Name using the action menu and select option 'N' for 'Native Memory'

Scenario: Out Of Memory Condition

If the Extended Region Free % falls below 10 and continues to fall, it is an indication of a

native memory leak. If the value falls below 5, then the JVM may need to be shut down and

restarted

Scenario: Identify Possible

Memory Leak

Can we be

alerted to

memory issues

A snapshot of data taken a regular intervals to allow viewing of system status a specified

point in the past

memory issues

before it causes

an abend?

Scenario: Identify Possible Memory

Leak

A creeping rise in the heap occupancy after a

GC has been performed is a sign of a possible GC has been performed is a sign of a possible

memory leak. Unaddressed could lead to Out Of

Memory Error and JVM abend and core dump

Scenario: Identifying Locks and

Thread Blocks

• If not GC issues, perhaps threads are being blocked for an excessive period of time or locks within the JVM are being held for long periods causing application to wait for the monitor to yield.

• If high values found here, the application owner (if

Our applications are

performing poorly.

Can we see what • If high values found here, the application owner (if applicable) can be alerted or adjustments to the JVM environment could be made.

Can we see what

might be the cause?

Scenario: Identifying Locks and

Thread Blocks

Thread Statistics drills-down to all active threads

making BLOCKED threads easy to spot.

NEW in V540 –

Also shows Thread CPU to spot loops!

Scenario: Identifying Locks and

Thread Blocks

The Lock Statistics shows which monitor objects

were used as lock most often an how long they

were held for.

Scenario: Identify Environment

Issues• We are able to deep-dive into JVM environment details to view

information like the classpath, system properties and the version of Java being used.

We need to

ensure the Java

• We can also define a situation to check setting and alert us to a problem. In this case, if a ‘bad’ Java version is being used

levels being

used are up to

date

Scenario: Identify Environment

Issues

In the TEP Situation Editor we create a new Situation to check against the JVMs

Version attribute. If this condition is ever met, a Warning alert will be raised.

Scenario: Identify Environment

Issues

Once the situation is tripped, you can analyze the current conditions, identify the offending job and take appropriate

actionaction

Scenario: Identify Environment

Issues

The Situation Status Tree in enhanced 3270 UI will if there is a JVM online with the offending Java level. A user could then take

appropriate action

Scenario: Slow API Response Time

• It’s important to be alerted to poor response time to services/APIs you are making available to consumers, potentially externally, to satisfy application performance and manage varying workloads before application owners raise complaints.

Reports are coming

back that application

request response

time into z/OS is application owners raise complaints.

• The z/OS Connect Summary workspace displays all the current z/OS Connect Services that are executed in the last 5 minutes. This workspace helps identify slow requests and in conjunction with the Garbage Collection or Threads workspace, specific causes of the symptom can be found

time into z/OS is

poor. Can we

identify affected

services?

Scenario: Slow API Response Time

Identify the z/OS Connect Job by looking at the Application field. Select the Job using option 'Z'

Sort the rows by 'Avg Response Time' - Identify and select the service name with highest Avg Response Time. Selecting option 'S' will display more detailed information

about a particular request

Scenario: Slowest z/OS Connect

Services

• There may be certain properties that will point us to a problem around the reported service performance issue. Maybe there is something specific about the slowest requests, the client connected, or the payload being submitted.

Response time of

services through a

z/OS Connect EE

instance is slow. being submitted.

• The z/OS Connect Slowest Requests display the five worst performing requests over the last 5 minutes for a particular z/OS Connect service. This workspace can be used to provide diagnostic information about a specific request, which can help determine why a particular request performed poorly

instance is slow.

Can we investigate

requests to deduce

the issue?

Scenario: Slowest z/OS Connect Services

Identify the desired Service Name you want more details for and select it with option 'S'

Scenario: Slowest z/OS Connect Services

Here you can view the five slowest requests, their request ID, method, response time, etc

Requests longer in length may take longer as they are sent as JSON –which might have overhead depending on the subsystem being called.

In addition, if the Response length is 0, there is no JSON response and the request may have encountered an error which could also cause a

slow response time

The OMEGAMON Portfolio

Service Management Suite on z/OS

OMEGAMON Performance Management Suite on

z/OSOMEGAMON for

CICSOMEGAMON for

IMS

Service Management Unite

NetView for z/OSSystem Automation for

z/OSTivoli Asset Discovery

OMEGAMON z/OS Management SuiteOMEGAMON on

z/OS

OMEGAMON for JVM

OMEGAMON Mainframe Networks

OMEGAMON for Storage

OMEGAMON Dashboard Edition

OMEGAMON for DB2 PE

OMEGAMON for Messaging

ITCAM for Application Diagnostics

More Information/References� OMEGAMON Product Home

� Overview and product information for all OMEGAMON products

� www.ibm.com/OMEGAMON

� Service Management Connect

� Blogs, forums, articles, best practices videos for IBM z Systems monitoring

� www.ibm.com/developerworks/servicemanagement/z

� Examples:� Examples:

� Introducing OMEGAMON Monitoring for JVM

� Using OMEGAMON to Diagnose Slow JVMs Through Thread Data

� OMEGAMON JVM monitoring for z/OS Locking Data

� OMEGAMON Monitoring for JVM Technote

� Summary of latest fixes, known issues and updates

� www.ibm.biz/OMEGJVMTechnote

Contacts

� Offering Management

� Nathan Brice nbrice@uk.ibm.com

� Chris Walker crwalker@us.ibm.com

� Release Management� Release Management

� Jeff Summers summerje@us.ibm.com

� Dan Kitay dkitay@us.ibm.com

� Marketing Enablement

� John Knutson knutson@uk.ibm.com

� Sales Enablement

� Giulio Peri giulio_peri@it.ibm.com

� Diego Bessone dbessone@us.ibm.com

Video Overview� Short 4 minute overview introducing OMEGAMON Monitoring Feature for JVM…

� YouTube Direct Link: https://youtu.be/QcqnD_B3xsg

� Service Management Connect Blog with video embedded: www.ibm.biz/OMEGJVMVideoBlog

Session feedback

• Please submit your feedback at

http://conferences.gse.org.uk/2016/feedback/nn

• Session is nn

This is the last

slide in the deck

• Session is nn