CPS Troubleshooting Guide, Release 10.1GCMNotification 100 APNSNotification 102...

192
CPS Troubleshooting Guide, Release 10.1.0 First Published: September 02, 2016 Last Modified: September 02, 2016 Americas Headquarters Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134-1706 USA http://www.cisco.com Tel: 408 526-4000 800 553-NETS (6387) Fax: 408 527-0883

Transcript of CPS Troubleshooting Guide, Release 10.1GCMNotification 100 APNSNotification 102...

  • CPS Troubleshooting Guide, Release 10.1.0First Published: September 02, 2016

    Last Modified: September 02, 2016

    Americas HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAhttp://www.cisco.comTel: 408 526-4000 800 553-NETS (6387)Fax: 408 527-0883

  • THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT NOTICE. ALL STATEMENTS,INFORMATION, AND RECOMMENDATIONS IN THIS MANUAL ARE BELIEVED TO BE ACCURATE BUT ARE PRESENTED WITHOUT WARRANTY OF ANY KIND,EXPRESS OR IMPLIED. USERS MUST TAKE FULL RESPONSIBILITY FOR THEIR APPLICATION OF ANY PRODUCTS.

    THE SOFTWARE LICENSE AND LIMITEDWARRANTY FOR THE ACCOMPANYING PRODUCT ARE SET FORTH IN THE INFORMATION PACKET THAT SHIPPED WITHTHE PRODUCT AND ARE INCORPORATED HEREIN BY THIS REFERENCE. IF YOU ARE UNABLE TO LOCATE THE SOFTWARE LICENSE OR LIMITED WARRANTY,CONTACT YOUR CISCO REPRESENTATIVE FOR A COPY.

    The Cisco implementation of TCP header compression is an adaptation of a program developed by the University of California, Berkeley (UCB) as part of UCB's public domain versionof the UNIX operating system. All rights reserved. Copyright © 1981, Regents of the University of California.

    NOTWITHSTANDINGANYOTHERWARRANTYHEREIN, ALL DOCUMENT FILES AND SOFTWARE OF THESE SUPPLIERS ARE PROVIDED “AS IS"WITH ALL FAULTS.CISCO AND THE ABOVE-NAMED SUPPLIERS DISCLAIM ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING, WITHOUT LIMITATION, THOSE OFMERCHANTABILITY, FITNESS FORA PARTICULAR PURPOSEANDNONINFRINGEMENTORARISING FROMACOURSEOFDEALING, USAGE, OR TRADE PRACTICE.

    IN NO EVENT SHALL CISCO OR ITS SUPPLIERS BE LIABLE FOR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, OR INCIDENTAL DAMAGES, INCLUDING, WITHOUTLIMITATION, LOST PROFITS OR LOSS OR DAMAGE TO DATA ARISING OUT OF THE USE OR INABILITY TO USE THIS MANUAL, EVEN IF CISCO OR ITS SUPPLIERSHAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

    Any Internet Protocol (IP) addresses and phone numbers used in this document are not intended to be actual addresses and phone numbers. Any examples, command display output, networktopology diagrams, and other figures included in the document are shown for illustrative purposes only. Any use of actual IP addresses or phone numbers in illustrative content is unintentionaland coincidental.

    Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. and other countries. To view a list of Cisco trademarks, go to this URL: http://www.cisco.com/go/trademarks. Third-party trademarks mentioned are the property of their respective owners. The use of the word partner does not imply a partnershiprelationship between Cisco and any other company. (1110R)

    © 2016 Cisco Systems, Inc. All rights reserved.

    http://www.cisco.com/go/trademarkshttp://www.cisco.com/go/trademarks

  • C O N T E N T S

    P r e f a c e Preface xi

    About this guide xi

    Audience xi

    Additional Support xi

    Version Control Software xii

    Conventions (all documentation) xii

    Obtaining Documentation and Submitting a Service Request xiii

    C H A P T E R 1 Troubleshooting CPS 1

    General Troubleshooting 1

    Gathering Information 1

    Basic Troubleshooting 2

    Trace Support Commands 3

    trace.sh 3

    trace_id.sh 4

    Periodic Monitoring 4

    RADIUS Troubleshooting 8

    E2E Call Flow Troubleshooting 8

    Diameter Error Codes and Scenarios 8

    LDAP Error Codes 11

    Rare Troubleshooting Scenarios 21

    Recovery using Remove/Add members Option 21

    Remove Failed Members 22

    Add Failed Members 23

    Maintenance Window Procedures 25

    Prior to Any Maintenance 25

    Change Request Procedure 25

    CPS Troubleshooting Guide, Release 10.1.0 iii

  • Software Upgrades 25

    Application Restarts 25

    VM Restarts 26

    Hardware Restarts 26

    Planned Outages 26

    Non-maintenance Window Procedures 27

    Common Troubleshooting Tasks 27

    Kill All Cisco Processes From the Command Line as Root 27

    Low or Out of Disk Space 29

    df Command 29

    du Command 29

    Diameter Issues 29

    High CPU Usage Issue 30

    JVM Crash 30

    High Memory Usage/Out of Memory Error 31

    Issues with Output displayed on Grafana 31

    Enable Debug Logs 32

    Install SAR Tool 32

    Installation 33

    Disabling 33

    Frequently Encountered Scenarios 34

    Subscriber not Mapped on SCE 36

    CPS Server Will Not Start and Nothing is in the Log 37

    Server returned HTTP Response Code: 401 for URL 37

    com.broadhop.exception.BroadhopException Unable to Find System Configuration for

    System 38

    Log Files Display the Wrong Time but the Linux Time is Correct 38

    JMX Management Beans are not Deployed 39

    Unable to Access Binding Information 39

    Error Processing Package, Reference Data Does Not Exist for NAS IP... 39

    REST Web Service Queries Returns an Empty XML Response for an Existing User 40

    Error in Datastore: "err" : "E11000 Duplicate Key Error Index 40

    Error Processing Request: Unknown Action 41

    Memcached Server is in Error 41

    Firewall Error: Log shows Host Not Reachable, or Connection Refused 42

    CPS Troubleshooting Guide, Release 10.1.0iv

    Contents

  • Unknown Error in Logging: License Manager 42

    Ecore File is Not Generated: 43

    Logging Does Not Appear to be Working 43

    Cannot Connect to Server Using JMX: No Such Object in Table 44

    File System Check (FSCK) Errors 45

    CPS: 27717 Mongo Stuck in STARTUP2 after sessionMgr01/2 Reboot 47

    SR: 628099455 System Failure Errors in Control Center 48

    Multi-user Policy Builder Errors 52

    Policy Reporting Configuration not getting updated post CPS Upgrade 54

    CPS Memory Usage 55

    Errors while Installing HA Setup 56

    Enable/disable Debit Compression 57

    Diameter proxy error in diagnostics.sh output 58

    Not able to Publish the Policy in Policy Builder 59

    CPS not sending SNMP traps to External NMS server 59

    Diameter Peer Connectivity is Down 60

    Policy Builder Loses Repositories 60

    Not able to access IPv6 Gx port from PCEF/GGSN 61

    Bring up sessionmgr VM from RECOVERY state to SECONDARY state 61

    ZeroMQ Connection Established between Policy Director and other site Policy Server 61

    Troubleshooting CPS upgrade from existing 7.0 63

    Diagnose Diameter No Response for Peer Message 63

    Not able to access Policy Builder 70

    Graphs in Grafana are lost when time on VMs are changed 72

    Systems is not enabled for Plugin Configuration 72

    Publishing is not Enabled 72

    Collecting MongoDB Information for Troubleshooting 73

    Added Check to Switch to Unknown Service if Subscriber is deleted Mid Session 74

    Could not Build Indexes for Table 77

    Error Submitting Message to Policy Director (lb) during Longevity 77

    Mismatch between Statistics Count and Session Count 78

    Disk Statistics not Populated in Grafana after CPS Upgrade 79

    Re-create Session Shards 80

    Session Switches from Known to Unknown in CCR-U Request 81

    Intermittent BSON Object Size Error in createsub with Mongo v3.2.1 82

    CPS Troubleshooting Guide, Release 10.1.0 v

    Contents

  • No Traps Generated When Number of Sessions Exceeds the Limit 83

    RAR Message not Received 83

    No Response to Diameter Request 84

    Admin Database shows Problem in Connecting to the Server 85

    Locale MAC Error 87

    Sessions Stored in a Single Shard 87

    Licensing not Throwing Traps or Diagnostic Errors upon Breach 88

    Corosync Process Taking lot of Time to Unload and is Stuck 89

    Issue related to Firewall 89

    CPS Setup cannot Handle High TPS 90

    Troubleshoot ANDSF 91

    Policy Builder Scenarios 91

    Not Able to See DM Configuration Tab in Policy Builder after Installation 91

    Diagnostic.sh throws Errors after Restart 92

    Not Getting GCM Notifications in Logs 93

    Session is not created for iPhone and Android Users 94

    Check for service Use Case Templates for GCM, APNS, General, and default

    Services 94

    Control Center Scenarios 95

    Subscriber Session not getting Created and Getting Exception Error (401) 95

    SSID Credentials are Wrongly Passed in Policy 96

    DM Tree Lookups Fail and Exception in consolidated-qns.log 96

    Data Populated in MongoDB ANDSF Collection, but values are not shown in Control

    Center 97

    Not able to see the Mobile Configuration Certificate sub screen in Control Center 97

    Control Center session timeout frequently and not able to login from another

    browser 97

    Geo-location is not read Properly in Control Center 97

    ANDSF Server Scenarios 98

    API Error Codes 98

    General Errors 99

    Problem Accessing ua/soap Getting Jetty Related Error 99

    Check if Blank Policy is Retrieved in SyncML Response 99

    Policy Engine didn't Return a Management Response 100

    Notification Errors 100

    CPS Troubleshooting Guide, Release 10.1.0vi

    Contents

  • GCM Notification 100

    APNS Notification 102

    SNMP Traps and Key Performance Indicators (KPIs) 104

    Full (HA) Setup 104

    All-in-one (AIO) Setup 105

    Testing Traps Generated by CPS 105

    Component Notifications 106

    Application Notifications 109

    SNMP System and Application KPI Values 122

    SNMP System KPIs 123

    Application KPI Values 124

    FAQs 126

    Reference Document 129

    C H A P T E R 2 Check Subscriber Access 131

    Checking Access 131

    Testing Subscriber Access with 00.testAccessRequest.sh 131

    Testing Subscriber Access with soapUI 132

    Testing for ISG Functionality and Connectivity with test aaa Scripts 138

    C H A P T E R 3 TCP Dumps 139

    About TCP Dumps 139

    TCPDUMP Command 139

    Options 139

    Specific Traffic Types 140

    Capture RADIUS Traffic 140

    Capture SNMP Traffic 140

    Other Ports 141

    C H A P T E R 4 Call Flows 143

    One-Click Call Flow 144

    User/Password Login Call Flow 145

    Data-limited Voucher Call Flow 146

    Time-limited Voucher Call Flow 147

    EAP-TTLS Call Flow 148

    CPS Troubleshooting Guide, Release 10.1.0 vii

    Contents

  • Service Selection Call Flow 149

    MAC TAL Call Flow 150

    Tiered Services Call Flow 151

    Diameter Call Flows 151

    Receive and Queuing of Diameter Message at Policy Director 152

    Request Processing at PCRF (Policy Server) 152

    Rules Call Flow 153

    Response Creation and Sending 154

    RAR Call Flow 154

    C H A P T E R 5 Logging 155

    Overview 155

    CPS Logs 156

    Application/Script Produces Logs: Deploy Logs 157

    Application/Script Produces Logs: policy server 157

    Application/Script Produces Logs: policy server pb 158

    Application/Script Produces Logs: mongo 159

    Application/Script Produces Logs: httpd 159

    Application/Script Produces Logs: license manager 160

    Application/Script Produces Logs: svn 160

    Application/Script Produces Logs: auditd 160

    Application/Script Produces Logs: graphite 161

    Application/Script Produces Logs: kernel 162

    Basic Troubleshooting Using CPS Logs 162

    Logging Level and Effective Logging Level 162

    Consolidated Application Logging 165

    Enable Debug Logs 166

    Enable Unified API Request and Response Logging 167

    Rsyslog Log Processing 167

    Rsyslog Overview 167

    Rsyslog-proxy 168

    Configuration for HA Environments 168

    Configuration for AIO 169

    Enable Consolidated Syslog Output to Files on OAM VMs 170

    Configuration of Logback.xml 171

    CPS Troubleshooting Guide, Release 10.1.0viii

    Contents

  • Basic Troubleshooting Using ANDSF Logs 171

    Debugging Common Errors using Logging Techniques of ANDSF 171

    Debugging Common Call Flow Scenarios for ANDSF using Logging Patterns 172

    Generic Call Flow For Android 172

    Generic Call Flow For Apple 173

    GCM Notification 175

    APNS Notification 176

    Notification for Revalidation Timer 177

    CPS Troubleshooting Guide, Release 10.1.0 ix

    Contents

  • CPS Troubleshooting Guide, Release 10.1.0x

    Contents

  • Preface

    • About this guide, page xi

    • Audience, page xi

    • Additional Support, page xi

    • Version Control Software, page xii

    • Conventions (all documentation), page xii

    • Obtaining Documentation and Submitting a Service Request, page xiii

    About this guideThis guide describes how to troubleshoot Cisco Policy Suite.

    AudienceThis guide is best used by these readers:

    • Network administrators

    • Network engineers

    • Network operators

    • System administrators

    This document assumes a general understanding of network architecture, configuration, and operations.

    Additional SupportFor further documentation and support:

    • Contact your Cisco Systems, Inc. technical representative.

    • Call the Cisco Systems, Inc. technical support number.

    CPS Troubleshooting Guide, Release 10.1.0 xi

  • •Write to Cisco Systems, Inc. at [email protected].

    • Refer to support matrix at http://www.support.cisco.com and to other documents related to Cisco PolicySuite.

    Version Control SoftwareCisco Policy Builder uses version control software to manage its various data repositories. The default installedversion control software is Subversion, which is provided in your installation package.

    Conventions (all documentation)This document uses the following conventions.

    IndicationConventions

    Commands and keywords and user-entered textappear in bold font.

    bold font

    Document titles, new or emphasized terms, andarguments for which you supply values are in italicfont.

    italic font

    Elements in square brackets are optional.[ ]

    Required alternative keywords are grouped in bracesand separated by vertical bars.

    {x | y | z }

    Optional alternative keywords are grouped in bracketsand separated by vertical bars.

    [ x | y | z ]

    A nonquoted set of characters. Do not use quotationmarks around the string or the string will include thequotation marks.

    string

    Terminal sessions and information the system displaysappear in courier font.

    courier font

    Nonprinting characters such as passwords are in anglebrackets.

    < >

    Default responses to system prompts are in squarebrackets.

    [ ]

    An exclamation point (!) or a pound sign (#) at thebeginning of a line of code indicates a comment line.

    !, #

    CPS Troubleshooting Guide, Release 10.1.0xii

    PrefaceVersion Control Software

    http://www.support.cisco.com

  • Means reader take note. Notes contain helpful suggestions or references to material not covered in themanual.

    Note

    Means reader be careful. In this situation, you might perform an action that could result in equipmentdamage or loss of data.

    Caution

    IMPORTANT SAFETY INSTRUCTIONS.

    Means danger. You are in a situation that could cause bodily injury. Before you work on any equipment,be aware of the hazards involved with electrical circuitry and be familiar with standard practices forpreventing accidents. Use the statement number provided at the end of each warning to locate its translationin the translated safety warnings that accompanied this device.

    SAVE THESE INSTRUCTIONS

    Provided for additional information and to comply with regulatory and customer requirements.Warning

    Obtaining Documentation and Submitting a Service RequestFor information on obtaining documentation, using the Cisco Bug Search Tool (BST), submitting a servicerequest, and gathering additional information, see What's New in Cisco Product Documentation.

    To receive new and revised Cisco technical content directly to your desktop, you can subscribe to the What'sNew in Cisco Product Documentation RSS feed. RSS feeds are a free service.

    CPS Troubleshooting Guide, Release 10.1.0 xiii

    PrefaceObtaining Documentation and Submitting a Service Request

    http://www.cisco.com/c/en/us/td/docs/general/whatsnew/whatsnew.htmlhttp://www.cisco.com/assets/cdc_content_elements/rss/whats_new/whatsnew_rss_feed.xmlhttp://www.cisco.com/assets/cdc_content_elements/rss/whats_new/whatsnew_rss_feed.xml

  • CPS Troubleshooting Guide, Release 10.1.0xiv

    PrefaceObtaining Documentation and Submitting a Service Request

  • C H A P T E R 1Troubleshooting CPS

    • General Troubleshooting, page 1

    • Diameter Error Codes and Scenarios, page 8

    • LDAP Error Codes, page 11

    • Maintenance Window Procedures, page 25

    • Non-maintenance Window Procedures, page 27

    • Common Troubleshooting Tasks, page 27

    • Frequently Encountered Scenarios, page 34

    • Troubleshoot ANDSF, page 91

    • SNMP Traps and Key Performance Indicators (KPIs), page 104

    General Troubleshooting• Find out if your problem is related to CPS or another part of your network.

    • Gather information that facilitate the support call.

    • Are their specific SNMP traps being reported that can help you isolate the issue?

    Gathering InformationDetermine the Impact of the Issue

    • Is the issue affecting subscriber experience?

    • Is the issue affecting billing?

    • Is the issue affecting all subscribers?

    • Is the issue affecting all subscribers on a specific service?

    • Is there anything else common to the issue?

    CPS Troubleshooting Guide, Release 10.1.0 1

  • • Have there been any changes performed on the CPS system or any other systems?

    • Has there been an increase in subscribers?

    • Is the issue affecting all subscribers?

    • Is the issue affecting all subscribers on a specific service?

    • Is there anything else common to the issue?

    • Initially, categorize the issue to determine the level of support needed.

    Basic TroubleshootingCapture the following details in most error cases:

    Step 1 Output of the following commands:diagnostics.sh

    about.sh

    Step 2 Collect all the logs:

    • Archive created at /var/log/broadhop on pcrfclient01 and pcrfclient02 includes consolidated policy server(qns) logs. Make sure that consolidated logs cover logs of time when issue happened.

    • SSH to all available policy server (qns) and load balancer (lb) VMs and capture the following logs:/var/log/broadhop/qns-*.log

    /var/log/broadhop/qns-*.log.gz

    /var/log/broadhop/service-qns-*.log

    /var/log/broadhop/service-qns-*.log.gz

    • SSH to all the available sessionmgr VMs and capture the following mongoDB logs:/var/log/mongodb-*.log

    /var/log/mongodb-*.log.gz

    • SSH to all available VMs and capture the following logs:/var/log/messages*

    Step 3 CPS configuration details present at /etc/broadhop.Step 4 SVN repository

    To export SVN repository, go to /etc/broadhop/qns.conf and copy the URL specified againstcom.broadhop.config.url.

    For example,

    -Dcom.broadhop.config.url=http://pcrfclient01/repos/run

    Run the following command to export SVN repository:

    svn export

    CPS Troubleshooting Guide, Release 10.1.02

    Troubleshooting CPSBasic Troubleshooting

  • Step 5 Top command on all available VMs to display the top CPU processes on the system:top -b -n 30

    Step 6 Output of the following command from pcrfclient01 VM top_qps.sh with output period of 10-15 min and interval of 5sec:top_qps.sh 5

    Step 7 Output of the following command on load balancer (lb) VMs having issue.netstat -plan

    Step 8 Output of the following command on all VMs.service iptables status

    Step 9 Details mentioned in Periodic Monitoring.Step 10 Steps to reproduce the issue.

    Trace Support CommandsThis section covers the following two commands:

    • trace.sh

    • trace_id.sh

    trace.shtrace.sh usage:

    /var/qps/bin/control/trace.sh -i -d sessionmgr01:27719/policy_trace

    /var/qps/bin/control/trace.sh -x -d sessionmgr01:27719/policy_trace

    /var/qps/bin/control/trace.sh -a -d sessionmgr01:27719/policy_trace

    /var/qps/bin/control/trace.sh -e -d sessionmgr01:27719/policy_trace

    This script starts a selective trace and outputs it to standard out.

    • Specific Audit Id Tracing$0 -i

    • Dump All Traces for Specific Audit Id$0 -x

    • Trace All.$0 -a

    • Trace All Errors.$0 -e

    CPS Troubleshooting Guide, Release 10.1.0 3

    Troubleshooting CPSTrace Support Commands

  • trace_id.shtrace_id.sh usage:

    /var/qps/bin/control/trace_ids.sh -i -d sessionmgr01:27719/policy_trace

    /var/qps/bin/control/trace_ids.sh -r -d sessionmgr01:27719/policy_trace

    /var/qps/bin/control/trace_ids.sh -x -d sessionmgr01:27719/policy_trace

    /var/qps/bin/control/trace_ids.sh -l -d sessionmgr01:27719/policy_trace

    This script starts a selective trace and outputs it to standard out.

    • Add Specific Audit Id Tracing$0 -i

    • Remove Trace for Specific Audit Id$0 -r

    • Remove Trace for All Ids$0 -x

    • List All Ids under Trace$0 -l

    Periodic Monitoring• Run the following command on pcrfclient01 and verify that all the processes are reported as Running.For CPS 6.1.0 and lower releases:

    /opt/broadhop/control/statusall.sh

    For CPS 7.0.0 and higher releases:

    /var/qps/bin/control/statusall.sh

    Program 'cpu_load_trap'status Waitingmonitoring status Waiting

    Process 'collectd'status Runningmonitoring status Monitoreduptime 42d 17h 23m

    Process 'auditrpms.sh'status Runningmonitoring status Monitoreduptime 28d 20h 26m

    System 'qns01'status Runningmonitoring status Monitored

    The Monit daemon 5.5 uptime: 21d 10h 26mProcess 'snmpd'

    status Runningmonitoring status Monitoreduptime 21d 10h 26m

    Process 'qns-1'status Runningmonitoring status Monitoreduptime 6d 17h 9m

    CPS Troubleshooting Guide, Release 10.1.04

    Troubleshooting CPSPeriodic Monitoring

  • • Run /var/qps/bin/diag/diagnostics.sh command on pcrfclient01 and verify that no errors/failuresare reported in output./var/qps/bin/diag/diagnostics.shCPS Diagnostics HA Multi-Node Environment---------------------------Ping check for all VMs...Hosts that are not 'pingable' are added to the IGNORED_HOSTS variable...[PASS

    ]Checking basic ports for all VMs...[PASS]Checking qns passwordless logins for all VMs...[PASS]Checking disk space for all VMs...[PASS]Checking swap space for all VMs...[PASS]Checking for clock skew for all VMs...[PASS]Checking CPS diagnostics...Retrieving diagnostics from qns01:9045...[PASS]Retrieving diagnostics from qns02:9045...[PASS]Retrieving diagnostics from qns03:9045...[PASS]Retrieving diagnostics from qns04:9045...[PASS]Retrieving diagnostics from pcrfclient01:9045...[PASS]Retrieving diagnostics from pcrfclient02:9045...[PASS]

    Checking svn sync status between pcrfclient01 & 02...svn is not sync between pcrfclient01 & pcrfclient02...[FAIL]Corrective Action(s): Run ssh pcrfclient01 /var/qps/bin/support/recover_svn_sync.shChecking HAProxy statistics and ports...

    • Perform the following actions to verify VMs status is reported as UP and healthy and no alarms aregenerated for any VMs.

    ◦Login to the VMware console

    ◦Verify the VM statistics, graphs and alarms through the console.

    • Verify if any trap is generated by CPS.cd /var/log/snmp

    tailf trap

    • Verify if any error is reported in CPS logs.cd /var/log/broadhop

    grep -i error consolidated-qns.log

    grep -i error consolidated-engine.log\

    • Monitor the following KPIs on Grafana for any abnormal behavior:

    ◦CPU usage of all instancesCPU Idle on Active Load Balancer (LB)

    CPU Idle on Standby Load Balancer (LB)

    CPU Idle on Policy Server (QNS) VMs

    CPU Idle on sessionmgr VMs: Session database, Balance, Reporting, SPR

    CPU Idle on OAM (pcrfclient) VM:

    ◦Memory usage of all instancesMemory Free on Active Load Balancer (LB)

    Memory Free on Standby Load Balancer (LB)

    Memory Free on Policy Server (QNS) VMs

    CPS Troubleshooting Guide, Release 10.1.0 5

    Troubleshooting CPSPeriodic Monitoring

  • Memory Free on sessionmgr VMs

    Memory Free on OAM (pcrfclient) VMs

    ◦Free disk space on all instancesDisk Space Free on Active Load Balancer (LB)

    Disk Space Free on Standby Load Balancer (LB)

    Disk Space Free on Policy Server (QNS) VMs

    Disk Space Free on sessionmgr VMs: Session database, Balance, Reporting, SPR

    Disk Space Free on OAM (pcrfclient) VMs

    ◦Diameter messages load: CCR-I, CCR-U, CCR-T, AAR, RAR, STR, ASR

    ◦Diameter messages response time: CCR-I, CCR-U, CCR-T, AAR, RAR, STR, ASR

    • Errors for diameter messages.Run the following command on pcrfclient01:

    tailcons | grep diameter | grep -i error

    • Response time for sessionmgr insert/update/delete/query.

    ◦Average read, write, and total time per sec:mongotop --host sessionmgr* --port port_number

    ◦For requests taking more than 100ms:SSH to sessionmgr VMs:

    tailf /var/log/mongodb-.log

    Above commands will by default display requests taking more than 100 ms, until andunless the following parameter has been configured onmongod process --slows XYZms.XYZ represents the value in milliseconds desired by user.

    Note

    • Garbage collection.Check the service-qns-*.log from all policy server (QNS), load balancer (lb) and PCRF VMs.In the logs look for “GC” or “FULL GC”.

    • Session count.Run the following command on pcrfclient01:

    session_cache_ops.sh --count

    • Run the following command on pcrfclient01 and verify that the response time is under expected valueand there are no errors reported.

    /opt/broadhop/qns-1/control/top_qps.sh

    • Use the following command to check mongoDB statistics on queries/inserts/updates/deletes for all CPSdatabases (and on all primary and secondary databases) and verify if there are any abnormalities (forexample, high number of insert/update/delete considering TPS, large number of queries going to othersite).

    CPS Troubleshooting Guide, Release 10.1.06

    Troubleshooting CPSPeriodic Monitoring

  • mongostat --host --port

    For example,

    mongostat --host sessionmgr01 --port 27717

    • Use the following command for all CPS databases and verify if there is any high usage reported in output.Here considering session database as an example:

    mongotop --host --port

    For example,

    mongotop --host sessionmgr01 --port 27717

    • Verify EDRs are getting generated by checking count of entries in CDR database.

    • Verify EDRs are getting replicated by checking count of entries in MySQL database.

    • Determine most recently inserted CDR record in MySQL database and compare the insert time with thetime the CDR was generated. Time difference should be within 2 min or otherwise signifies lag inreplication.

    • Count of CCR-I/CCR-U/CCR-T/RAR messages from/to GW.

    • Count of failed CCR-I/CCR-U/CCR-T/RARmessages from/to GW. If GW has capability, capture detailsat error code level.

    Run the following command on pcrfclient01:

    cd /var/broadhop/stats

    grep "Gx_CCR-" bulk-*.csv

    • Response time of CCR-I/CCR-U/CCR-T messages at GW.

    • Count of session in PCRF and count of session in GW. There could be some mismatch between thecount due to time gap between determining session count from CPS and GW. If the count difference ishigh then it could indicate stale sessions on PCRF or GW.

    • Count of AAR/RAR/STR/ASR messages from/to Application Function.

    • Count of failed AAR/RAR/STR/ASR messages from/to Application Function. If Application Functionhas capability, capture details at error code level.

    Run the following command on pcrfclient01:

    cd /var/broadhop/stats

    grep "Gx_CCR-" bulk-*.csv

    • Response time of CCR-I/CCR-U/CCR-T messages at GW.

    • Count of session in PCRF and count of session in GW. There could be some mismatch between thecount due to time gap between determining session count from CPS and GW. If the count difference ishigh then it could indicate stale sessions on PCRF or GW.

    • Count of AAR/RAR/STR/ASR messages from/to Application Function.

    • Count of failed AAR/RAR/STR/ASR messages from/to Application Function. If Application Functionhas capability, capture details at error code level.

    Run the following command on pcrfclient01:

    cd /var/broadhop/stats

    CPS Troubleshooting Guide, Release 10.1.0 7

    Troubleshooting CPSPeriodic Monitoring

  • grep "Rx_AAR-" bulk-*.csv

    • Response time of AAR/RAR/STR/ASR messages at Application Function.

    • Count of session in PCRF and count of session in Application Function. There could be some mismatchbetween the count due to time gap between determining session count from CPS and ApplicationFunction. If the count difference is high then it could indicate stale sessions on PCRF or ApplicationFunction.

    Count of session in PCRF:

    session_cache_ops.sh -count

    RADIUS Troubleshooting• Test service definition requests from a PEP such as ISG to the CPS by running the following command:test aaa group radius L4REDIRECT_SERVICE password legacy

    Repeat this command for PBHK_SERVICE and OPENGARDEN_SERVICE.

    • Listen for RADIUS traffic from the PEP by logging into lb01 and lb02 and run the following command:tcpdump –i any port 1812 –s 0 -vvv

    Test general subscriber access with the procedures in Check Subscriber Access.

    E2E Call Flow Troubleshooting• On an All-in-One deployment, run the following commands:tcpdump -i -s 0 -vv

    ◦Append a –w /tmp/callflow.pcap to capture output to Wireshark file

    • Open the file in WireShark and filter on HTTP or RADIUS to assist debugging the call flow.

    • In a distributed model, you need to tcpdump on individual VMs:

    ◦Load balancers on port 1812, 1813, 1700, 8080 and 3868

    Correct call flows are shown Call Flows.

    Diameter Error Codes and ScenariosThe following table describes some common diameter error codes and scenarios:

    CPS Troubleshooting Guide, Release 10.1.08

    Troubleshooting CPSRADIUS Troubleshooting

  • Table 1: Common Diameter Error Codes and Scenarios

    CPS ScenariosNameCode

    Everything went well and Requestprocessed successfully.

    DIAMETER_SUCCESS2001

    Message cannot be delivered, eitherbecause no host within the realmsupporting the required applicationwas available to process the requestor because Destination-Host AVPwas given without the associatedDestination-Realm AVP.

    DIAMETER_UNABLE_TO_DELIVER3002

    Message got discarded by theoverload handling mechanism.Note: CPS 7.5 adds the option tosilently discard instead of sendingDIAMETER_TOO_BUSY asdiscarding is often a better way tohave other node back off instead ofimmediately resending the requestin an overload scenario.

    DIAMETER_TOO_BUSY3004

    A request was sent for anapplication that is not supported.

    DIAMETER_APPLICATION_UNSUPPORTED3007

    A CER was received from anunknown peer.

    DIAMETER_UNKNOWN_PEER3010

    When for some reason a PCC rulecannot be enforced or modifiedsuccessfully in a network initiatedprocedure. The reason is providedin the Event Trigger AVP value.

    DIAMETER_PCC_BEARER_EVENT4141

    Error used by the OCS to indicateto the PCRF that the OCS has noavailable policy counters for thesubscriber.

    DIAMETER_ERROR_NO_AVAILABLE_POLICY_COUNTERS4241

    The request contained an unknownSession-Id.

    DIAMETER_UNKNOWN_SESSION_ID5002

    A request was received for whichthe user could not be authorized.Nosession created due to variousreasons. For example, this errorcould occur if the service requestedis not permitted to the user.

    DIAMETER_AUTHORIZATION_REJECTED5003

    CPS Troubleshooting Guide, Release 10.1.0 9

    Troubleshooting CPSDiameter Error Codes and Scenarios

  • CPS ScenariosNameCode

    When a CER message is received,and there are no commonapplications supported between thepeers.

    DIAMETER_NO_COMMON_APPLICATION5010

    Message rejected as something elsethat went wrong and there’s nospecific reason.

    DIAMETER_UNABLE_TO_COMPLY5012

    Subscriber not found in SPR.DIAMETER_USER_UNKNOWN5030

    When the set of bearer/sessioninformation sent in a CCRoriginated due to a trigger eventbeen met is incoherent with theprevious set of bearer/sessioninformation for the samebearer/session.

    DIAMETER_ERROR_TRIGGER_EVENT5141

    When for some reason the PCCrules cannot be installed/activated.The reason is provided in the EventTrigger AVP value.

    DIAMETER_PCC_RULE_EVENT5142

    Emergency service related - Usedwhen the PCRF cannot authorizean IP-CAN bearer upon thereception of an IP-CAN bearerauthorization request coming fromthe PCEF.

    DIAMETER_ERROR_BEARER_NOT_AUTHORIZED5143

    Emergency service related - Usedwhen the PCRF does not acceptone or more of the traffic mappingfilters.

    DIAMETER_ERROR_TRAFFIC_MAPPING_INFO_REJECTED5144

    Error used by the OCS to indicateto the PCRF that the OCS does notrecognize one or more PolicyCounters specified in the request,when the OCS is configured toreject the request provided withunknown policy counteridentifier(s).

    DIAMETER_ERROR_UNKNOWN_POLICY_COUNTERS5570

    CPS Troubleshooting Guide, Release 10.1.010

    Troubleshooting CPSDiameter Error Codes and Scenarios

  • LDAP Error CodesThe following table describes LDAP error codes:

    Table 2: LDAP Error Codes

    NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe result code (0)that will be used toindicate a successfuloperation

    SUCCESS0

    YYThe result code (1)that will be used toindicate that anoperation wasrequested out ofsequence.

    OPERATIONS_ERROR

    1

    YYThe result code (2)that will be used toindicate that theclient sent amalformed request.

    PROTOCOL_ERROR2

    YYThe result code (3)that will be used toindicate that theserver was unable tocomplete processingon the request in theallotted time limit.

    TIME_LIMIT_EXCEEDED

    3

    YThe result code (4)that will be used toindicate that theserver found morematching entries thanthe configuredrequest size limit.

    SIZE_LIMIT_EXCEEDED

    4

    YThe result code (5)that will be used if arequested compareassertion does notmatch the targetentry.

    COMPARE_FALSE5

    CPS Troubleshooting Guide, Release 10.1.0 11

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe result code (6)that will be used if arequested compareassertionmatched thetarget entry.

    COMPARE_TRUE6

    YThe result code (7)that will be used ifthe client requested aform ofauthentication that isnot supported by theserver.

    AUTH_METHOD_NOT_SUPPORTED

    7

    YThe result code (8)that will be used ifthe client requestedan operation thatrequires a strongauthenticationmechanism.

    STRONG_AUTH_REQUIRED

    8

    YThe result code (10)that will be used ifthe server sends areferral to the clientto refer to data inanother location.

    REFERRAL10

    YThe result code (11)that will be used if aserver administrativelimit has beenexceeded.

    ADMIN_LIMIT_EXCEEDED

    11

    YThe integer value(12) for the"UNAVAILABLE_CRITICAL_EXTENSION" resultcode.

    UNAVAILABLE_CRITICAL_EXTENSION

    12

    CPS Troubleshooting Guide, Release 10.1.012

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe result code (13)that will be used ifthe server requires asecurecommunicationmechanism for therequested operation.

    CONFIDENTIALITY_REQUIRED

    13

    YThe result code (14)that will be returnedfrom the server afterSASL bind stages inwhich moreprocessing isrequired.

    SASL_BIND_IN_PROGRESS

    14

    YThe result code (16)that will be used ifthe client referencedan attribute that doesnot exist in the targetentry.

    NO_SUCH_ATTRIBUTE

    16

    YThe result code (17)that will be used ifthe client referencedan attribute that is notdefined in the serverschema.

    UNDEFINED_ATTRIBUTE_TYPE

    17

    YThe result code (18)that will be used ifthe client attemptedto use an attribute ina search filter in amanner not supportedby thematching rulesassociated with thatattribute.

    INAPPROPRIATE_MATCHING

    18

    YThe result code (19)that will be used ifthe requestedoperation wouldviolate someconstraint defined inthe server.

    CONSTRAINT_VIOLATION

    19

    CPS Troubleshooting Guide, Release 10.1.0 13

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe result code (20)that will be used ifthe client attempts tomodify an entry in away that wouldcreate a duplicatevalue, or createmultiple values for asingle-valuedattribute.

    ATTRIBUTE_OR_VALUE_ EXISTS

    20

    YThe result code (21)that will be used ifthe client attempts toperform an operationthat would create anattribute value thatviolates the syntaxfor that attribute.

    INVALID_ATTRIBUTE_SYNTAX

    21

    YThe result code (32)that will be used ifthe client targeted anentry that does notexist.

    NO_SUCH_OBJECT32

    YThe result code (33)that will be used ifthe client targeted anentry that as an alias.

    ALIAS_PROBLEM33

    YThe result code (34)that will be used ifthe client provided aninvalid DN.

    INVALID_DN_SYNTAX34

    YThe result code (36)that will be used if aproblem isencountered whilethe server isattempting todereference an alias.

    ALIAS_DEREFERENCING_PROBLEM

    36

    CPS Troubleshooting Guide, Release 10.1.014

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe result code (48)that will be used ifthe client attempts toperform a type ofauthentication that isnot supported for thetarget user.

    INAPPROPRIATE_AUTHENTICATION

    48

    YThe result code (49)that will be used ifthe client providedinvalid credentialswhile trying toauthenticate.

    INVALID_CREDENTIALS

    49

    YThe result code (50)that will be used ifthe client does nothave permission toperform therequested operation.

    INSUFFICIENT_ACCESS_RIGHTS

    50

    YYThe result code (51)that will be used ifthe server is too busyto process therequested operation.

    BUSY51

    YYThe result code (52)that will be used ifthe server isunavailable.

    UNAVAILABLE52

    YYThe result code (53)that will be used ifthe server is notwilling to performthe requestedoperation.

    UNWILLING_TO_PERFORM

    53

    YThe result code (54)that will be used ifthe server detects achaining or aliasloop.

    LOOP-DETECT54

    CPS Troubleshooting Guide, Release 10.1.0 15

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe result code (60)that will be used ifthe client sends avirtual list viewcontrol without aserver-side sortcontrol.

    SORT_CONTROL_MISSING

    60

    YThe result code (61)that will be used ifthe client provides avirtual list viewcontrol with a targetoffset that is out ofrange for theavailable data set.

    OFFSET_RANGE_ERROR

    61

    YThe result code (64)that will be used ifthe client requestviolates a namingconstraint (e.g., aname form or DITstructure rule)defined in the server.

    NAMING_VIOLATION

    64

    YThe result code (65)that will be used ifthe client requestviolates an objectclass constraint (e.g.,an undefined objectclass, a disallowedattribute, or a missingrequired attribute)defined in the server.

    OBJECT_CLASS_VIOLATION

    65

    YThe result code (66)that will be used ifthe requestedoperation is notallowed to beperformed onnon-leaf entries.

    NOT_ALLOWED_ON_NONLEAF

    66

    CPS Troubleshooting Guide, Release 10.1.016

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe result code (67)that will be used ifthe requestedoperation would alterthe RDN of the entrybut the operation wasnot a modify DNrequest.

    NOT_ALLOWED_ON_RDN

    67

    YThe result code (68)that will be used ifthe requestedoperation wouldcreate a conflict withan entry that alreadyexists in the server.

    ENTRY_ALREADY_EXISTS

    68

    YThe result code (69)that will be used ifthe requestedoperation would alterthe set of objectclasses defined in theentry in a disallowedmanner.

    OBJECT_CLASS_MODS_PROHIBITED

    69

    YThe result code (71)that will be used ifthe requestedoperation wouldimpact entries inmultiple data sources.

    AFFECTS_MULTIPLE_DSAS

    71

    YThe result code (76)that will be used if anerror occurred whileperformingprocessing associatedwith the virtual listview control.

    VIRTUAL_LIST_VIEW_ERROR

    76

    YYThe result code (80)that will be used ifnone of the otherresult codes areappropriate.

    OTHER80

    CPS Troubleshooting Guide, Release 10.1.0 17

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YYThe client-side resultcode (81) that will beused if an establishedconnection to theserver is lost.

    SERVER_DOWN81

    YYThe client-side resultcode (82) that will beused if a genericclient-side erroroccurs duringprocessing.

    LOCAL_ERROR82

    YYThe client-side resultcode (83) that will beused if an erroroccurs whileencoding a request.

    ENCODING_ERROR

    83

    YYThe client-side resultcode (84) that will beused if an erroroccurs whiledecoding a response.

    DECODING_ERROR

    84

    YYYThe client-side resultcode (85) that will beused if a clienttimeout occurs whilewaiting for aresponse from theserver.

    TIMEOUT85

    YThe client-side resultcode (86) that will beused if the clientattempts to use anunknownauthentication type.

    AUTH_UNKNOWN86

    YThe client-side resultcode (87) that will beused if an erroroccurs whileattempting to encodea search filter.

    FILTER_ERROR87

    CPS Troubleshooting Guide, Release 10.1.018

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe client-side resultcode (88) that will beused if the end usercanceled theoperation in progress.

    USER_CANCELED88

    YThe client-side resultcode (89) that will beused if there is aproblem with theparameters providedfor a request.

    PARAM_ERROR89

    YYThe client-side resultcode (90) that will beused if the client doesnot have sufficientmemory to performthe requestedoperation.

    NO_MEMORY90

    YYThe client-side resultcode (91) that will beused if an erroroccurs whileattempting to connectto a target server.

    CONNECT_ERROR

    91

    YThe client-side resultcode (92) that will beused if the requestedoperation is notsupported.

    NOT_SUPPORTED92

    YThe client-side resultcode (93) that will beused if the responsefrom the server didnot include anexpected control.

    CONTROL_NOT_FOUND

    93

    YThe client-side resultcode (94) that will beused if the server didnot send any results.

    NO_RESULTS_RETURNED

    94

    CPS Troubleshooting Guide, Release 10.1.0 19

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe client-side resultcode (95) that will beused if there are stillmore results toreturn.

    MORE_RESULTS_TO_RETURN

    95

    YThe client-side resultcode (96) that will beused if the clientdetects a loop whileattempting to followreferrals.

    CLIENT_LOOP96

    YThe client-side resultcode (97) that will beused if the clientencountered toomany referrals in thecourse of processingan operation.

    REFERRAL_LIMIT_EXCEEDED

    97

    YThe result code (118)that will be used ifthe operation wascanceled

    CANCELED118

    YThe result code (119)that will be used ifthe client attempts tocancel an operationthat the client doesn'texist in the server.

    NO_SUCH_OPERATION

    119

    YThe result code (120)that will be used ifthe client attempts tocancel an operationtoo late in theprocessing for thatoperation.

    TOO_LATE120

    YThe result code (121)that will be used ifthe client attempts tocancel an operationthat cannot becanceled.

    CANNOT_CANCEL

    121

    CPS Troubleshooting Guide, Release 10.1.020

    Troubleshooting CPSLDAP Error Codes

  • NotApplicableto Search

    TerminateConnection

    Sent ToPolicyServer

    TriggersRetry

    CountsasTimeout

    DefinitionName

    YThe result code (122)that will be used ifthe requestedoperation includedthe LDAP assertioncontrol but theassertion did notmatch the targetentry.

    ASSERTION_FAILED

    122

    YThe result code (123)that will be used ifthe client is deniedthe ability to use theproxied authorizationcontrol.

    AUTHORIZATION_DENIED

    123

    Rare Troubleshooting Scenarios

    Recovery using Remove/Add members OptionWhen Arbiter blade and a sessionmgr blade goes down there is not any primary sessionmgr node to caterrequests coming from CPS VMs (Classic HA setup-1 arbiter 2 sessionmgrs). As a result the system becomesunstable.

    A safe way to recover from the issue is to bring UP the down blades to working state. If bringing blades backto working state is not possible then only way to keep setup working is removing failed members of replica-setfrom mongo-config. In doing so UP and running sessionmgr node becomes primary. It is must to add failedmembers back to replica-set once they come online.

    The following sections describe how to remove failed members from mongo-replica set and how to add themback in replica-set once they are online.

    The steps mentioned in the following sections should be executed properly.Note

    The following steps are done only when only one sessionmgr is UP but is in secondary mode and cannotbecome primary on its own and bringing back down blades (holding arbiter and primary sessionmgr VMs)to operational mode is not possible.

    Note

    CPS Troubleshooting Guide, Release 10.1.0 21

    Troubleshooting CPSRare Troubleshooting Scenarios

  • Remove Failed MembersThis option is usually used when member/s are not running and treated as failed member. The script removesall such failed members from replica-set.

    Step 1 Login to pcrfclient01/02.Step 2 Execute the diagnostics script to know which replica-set or respective component is failed and you want to remove.

    diagnostics.sh --get_replica_status

    Step 3 Execute build_set.sh with below options to remove failed member/s from replica set. This operation removes the allfailed members across the site.cd /var/qps/bin/support/mongo/

    For session database:

    ./build_set.sh --session --remove-failed-members

    For SPR database:

    ./build_set.sh --spr --remove-failed-members

    For balance database:

    ./build_set.sh --balance --remove-failed-members

    For report database:

    ./build_set.sh --report --remove-failed-members

    Step 4 Execute the diagnostics script again to verify if that particular member is removed.diagnostics.sh --get_replica_status

    CPS Troubleshooting Guide, Release 10.1.022

    Troubleshooting CPSRecovery using Remove/Add members Option

  • If status is not seen properly by above command, login to mongo port on sessionmgr and check replica status.Note

    Figure 1: Replica Status

    Add Failed Members

    Step 1 Login to pcrfclient01/02.Step 2 Once the failed members are back online, they can be added back in replica-set.Step 3 Execute the diagnostics script to know which replica-set member is not in configuration or failed member.

    diagnostics.sh --get_replica_status

    CPS Troubleshooting Guide, Release 10.1.0 23

    Troubleshooting CPSRecovery using Remove/Add members Option

  • If status is not seen properly by above command, login to mongo port on sessionmgr and check replica status.

    Figure 2: Replica Status

    cd /var/qps/bin/support/mongo

    For session database:

    ./build_set.sh --session --add-members

    For SPR database:

    ./build_set.sh --spr --add-members

    For balance database:

    ./build_set.sh --balance --add-members

    For report database:

    ./build_set.sh --report --add-members

    CPS Troubleshooting Guide, Release 10.1.024

    Troubleshooting CPSRecovery using Remove/Add members Option

  • Maintenance Window ProceduresThe usual tasks for a maintenance window might include these:

    Prior to Any MaintenanceBackup all relevant information to an offline resource. For more information on backup see Cisco Policy SuiteBackup and Restore Guide.

    • Data - Backup all database information. This includes Cisco MsBM Cisco Unified SuM.

    Sessions can be backed up as well.Note

    • Configurations - Backup all configuration information. This includes SVN (from PCRF Client) the/etc/broadhop directory from all PCRFs

    • Logs - Backup all logs for comparison to the upgrade. This is not required but will be helpful if thereare any issues.

    Change Request Procedure• Have proper sign off for any change request. Cisco and all customer teams must sign off.

    • Make sure the proposed procedures are well defined.

    • Make sure the rollback procedures are correct and available.

    Software Upgrades• Determine if the software upgrade will cause an outage and requires a maintenance window to performthe upgrade.

    • Typically software upgrades can be done on one node a time and so minimize or eliminate any outage.

    • Most of the time an upgrade requires a restart of the application. Most applications can be started in lessthan 1 minute.

    Application RestartsApplication restarts are component independent. These are the components

    • PCRF/PCRF Client

    • Load Balancer/IO Manager

    • sessionMgr

    CPS Troubleshooting Guide, Release 10.1.0 25

    Troubleshooting CPSMaintenance Window Procedures

  • IO Manager PCRF PCRF Client

    • IOManagers and PCRF give up their resources and allow the fail overs to take over. They can be stoppeddirectly with service qns restart

    • PCRF Client is a GUI application and can be restarted at any point. If SVN is restarted the PCRFapplications continue to run but throw errors saying that they cannot check for new configurations. Thiswill not impact the environment.

    • sessionMgr is deployed as active - standby and is used by the policy server to maintain the subscribersession state information.

    • Load Balancers distribute the load for RADIUS Web Services MySQL LDAP and SVN. Two loadbalancers are deployed for each Cisco Policy Suite in active/passive mode.

    VM Restarts• LINUX must be shutdown normally for VM restarts.

    • All VMs are Linux.

    • The preferred methods are init 0 or shutdown –h

    • Failure to use the Linux OS shutdown can result in VM corruption and problems restarting the VM andapplications.

    • VM restart is typically done to increase resources to the VM (disk memory CPU).

    Hardware Restarts• Hardware restarts should be rare.

    •When a hardware restart is needed VMs must be shutdown first.

    •When all VMs are stopped shutdown the hardware with either the ESXi console or as a power off.

    Planned Outages• Planned outages are similar to hardware restarts.

    • VMs need to be shutdown hardware can then be stopped.

    •When hardware is started the typical hardware starting order is:

    ◦Start the servers with PCRFClient01 LB01 and SessionMgr01 first.

    ◦Start all other servers in any order after that.

    CPS Troubleshooting Guide, Release 10.1.026

    Troubleshooting CPSVM Restarts

  • Non-maintenance Window ProceduresTasks you can perform as non-maintenance that is at any time are these

    • Data archiving or warehousing

    • Log removal

    Common Troubleshooting TasksThis section describes frequently used troubleshooting tasks youmight use before calling support or as directedby support.

    Kill All Cisco Processes From the Command Line as RootDepending on the Linux version one or both of these ps commands are applicable. Remove the portion '| xargskill -9' if you want to test out the command.

    These commands do the following

    • print out all processes (ps) then

    • search (grep) for all processes that do not contain the word grep or mysql then

    • use sed to remove all the remaining text except for the PID value and then

    • send that PID to kill -9.

    [root@lab ~]# ps -APID TTY TIME CMD1 ? 00:00:00 init2 ? 00:00:01 migration/03 ? 00:00:00 ksoftirqd/04 ? 00:00:01 migration/15 ? 00:00:00 ksoftirqd/16 ? 00:18:49 events/07 ? 00:00:00 events/18 ? 00:00:00 khelper49 ? 00:00:00 kthread54 ? 00:00:00 kblockd/055 ? 00:00:00 kblockd/156 ? 00:00:00 kacpid217 ? 00:00:00 cqueue/0218 ? 00:00:00 cqueue/1221 ? 00:00:00 khubd223 ? 00:00:00 kseriod299 ? 00:00:00 khungtaskd300 ? 00:00:00 pdflush301 ? 00:01:09 pdflush302 ? 00:00:01 kswapd0303 ? 00:00:00 aio/0304 ? 00:00:00 aio/1510 ? 00:00:00 kpsmoused554 ? 00:00:00 mpt_poll_0555 ? 00:00:00 mpt/0556 ? 00:00:00 scsi_eh_0560 ? 00:00:00 ata/0561 ? 00:00:00 ata/1562 ? 00:00:00 ata_aux569 ? 00:00:00 kstriped

    CPS Troubleshooting Guide, Release 10.1.0 27

    Troubleshooting CPSNon-maintenance Window Procedures

  • 582 ? 00:00:00 ksnapd597 ? 00:04:19 kjournald623 ? 00:00:00 kauditd656 ? 00:00:00 udevd2168 ? 00:00:00 kmpathd/02169 ? 00:00:00 kmpathd/12171 ? 00:00:00 kmpath_handlerd2194 ? 00:00:00 kjournald2664 ? 00:00:00 vmmemctl2795 ? 00:02:38 vmtoolsd2877 ? 00:00:00 iscsi_eh2920 ? 00:00:00 cnic_wq2924 ? 00:00:00 bnx2i_thread/02925 ? 00:00:00 bnx2i_thread/12939 ? 00:00:00 ib_addr2949 ? 00:00:00 ib_mcast2950 ? 00:00:00 ib_inform2951 ? 00:00:00 local_sa2954 ? 00:00:00 iw_cm_wq2958 ? 00:00:00 ib_cm/02959 ? 00:00:00 ib_cm/12963 ? 00:00:00 rdma_cm2984 ? 00:00:00 iscsiuio3784 ? 00:06:49 snmpd3799 ? 00:00:00 snmptrapd3814 ? 00:00:21 memcached3836 ? 00:00:00 sshd3857 ? 00:00:00 ntpd3870 ? 00:00:00 mysqld_safe3925 ? 00:00:19 mysqld3977 ? 00:00:01 gpm3992 ? 00:00:01 httpd4006 ? 00:03:18 collectd4058 ? 00:00:00 crond4077 ? 00:00:42 pcrfclient_avai4079 ? 00:00:34 qns_availabilit4081 ? 00:00:17 database_availa4082 ? 00:00:13 server_availabi4098 ? 00:00:00 atd4169 ? 00:00:03 avahi-daemon4170 ? 00:00:00 avahi-daemon4302 ? 2-15:01:05 java4324 ? 00:21:27 java4375 ? 00:20:32 mongod4380 tty1 00:00:00 mingetty4381 tty2 00:00:00 mingetty4382 tty3 00:00:00 mingetty4383 tty4 00:00:00 mingetty4384 tty5 00:00:00 mingetty4393 tty6 00:00:00 mingetty4395 ? 00:00:00 gdm-binary4433 ? 00:00:00 gdm-binary4435 ? 00:00:02 gdm-rh-security4436 tty7 00:05:28 Xorg5022 ? 00:00:00 ntpd14425 ? 00:00:00 sshd14487 pts/0 00:00:00 bash14823 ? 00:00:00 sleep14837 ? 00:00:00 sleep14854 ? 00:00:00 sleep15014 ? 00:00:00 sleep15019 pts/0 00:00:00 ps25203 ? 00:00:00 gnome-vfs-daemo25316 ? 00:00:00 pam_timestamp_c28836 ? 00:00:06 httpd28837 ? 00:00:06 httpd28838 ? 00:00:06 httpd28839 ? 00:00:06 httpd28840 ? 00:00:06 httpd28841 ? 00:00:06 httpd28842 ? 00:00:06 httpd28843 ? 00:00:06 httpd[root@lab ~]#

    CPS Troubleshooting Guide, Release 10.1.028

    Troubleshooting CPSKill All Cisco Processes From the Command Line as Root

  • Low or Out of Disk SpaceTo determine the disk space used use these Linux disk usage and disk free commands

    • du

    • df

    df Commanddf

    For example:home# df -h[root@lab home]# df -hFilesystem Size Used Avail Use% Mounted on/dev/cciss/c0d0p5 56G 27G 26G 51% //dev/cciss/c0d0p1 99M 12M 83M 12% /boottmpfs 2.0G 0 2.0G 0% /dev/shmnone 2.0G 0 2.0G 0% /dev/shm/dev/cciss/c0d0p2 5.8G 4.0G 1.6G 73% /home

    As shown above the /home directory is using the most of it's allocated space (73%).

    du CommandThe /home directory is typically for /home/admin but in some cases there is also /home/qns or /home/remote.You can check both

    du

    For example:home# du -hs[root@lab home]# du -hs160M .[root@lab home]# du -hs *1.3M qns158M remote36K testuser

    The du command shows where the space is being used. By default the du command by itself gives a summaryof quota usage for the directory specified and all subdirectories below it.

    By deleting any directories you remove the ability to roll back if for some reason an update is not workingcorrectly. Only delete those updates to which you would probably never roll back perhaps those 6 monthsold and older.

    Note

    Diameter IssuesThe following details need to be captured for diameter issues:

    • Details of service associated with subscribers in failure case.

    • Pcaps capturing calls having issue.

    CPS Troubleshooting Guide, Release 10.1.0 29

    Troubleshooting CPSLow or Out of Disk Space

  • • If the issue is with no response pcap should be captured both at CPS and the peer.

    • Subscriber trace information can be captured using the following process

    ◦To add the subscriber that needs to be traced/var/qps/bin/control/trace_ids.sh -i -d sessionmgr01:/policy_trace

    cd /var/qps/bin/control

    ◦Run the following command to obtain subscriber information/var/qps/bin/control/trace.sh -i -d sessionmgr01/policy_trace

    If CPS receives the request message for the same subscriber the trace result will be displayed.

    Port no. can be found in “Trace DBDatabase” configuration in Cluster-1. If Trace Database is not configuredthen by default “Admin Db Configuration” will pick up the trace database.

    Note

    High CPU Usage Issue• Thread details and jstack output. It could be captured as:

    ◦From top output see if java process is taking high CPU.

    ◦Capture output of the following command:ps -C java -L -o pcpucpunicestatecputimepidtid | sort > tid.log

    ◦Capture output of the following command where is the pid of process causing highCPU (as per top output):

    If java process is running as a root user:

    jstack > jstack.log

    If java process is running as policy server (qns) user :

    sudo -u qns "jstack " > jstack.log

    If running above commands report error for process hung/not responding then use -F option afterjstack.

    Capture another jstack output as above but with an additional -l option

    JVM CrashJVM generates a fatal error log file that contains the state of process at the time of the fatal error. By default,the name of file has format hs_err_pid.log and it is generated in the working directory from where thecorresponding java processes were started (that is the working directory of the user when user started thepolicy server (qns) process). If the working directory is not known then one could search system for file withname hs_err_pid*.log and look into file which has timestamp same as time of error.

    CPS Troubleshooting Guide, Release 10.1.030

    Troubleshooting CPSHigh CPU Usage Issue

  • High Memory Usage/Out of Memory Error• JVM could generate heap dump in case of out of memory error. By default, CPS is not configured togenerate heap dump. For generating heap dump the following parameters need to be added to/etc/broadhop/jvm.conf file for different CPS instances present.-XX+HeapDumpOnOutOfMemoryError

    -XXHeapDumpPath=/tmp

    Note that the heap dump generation may fail if limit for core is not set correctly. Limit could be set infile /etc/security/limits.conf for root and policy server (qns) user.

    • If no dump is generated but memory usage is high and is growing for sometime followed by reductionin usage (may be due to garbage collection) then the heap dump can be explicitly generated by runningthe following command:

    • If java process is running as user root:jmap -dumpformat=bfile=

    • If java process is running as policy server (qns) user:sudo -u qns jmap -dumpformat=bfile=

    Note • Capture this during off-peak hour. In addition to that, nice utility could be used toreduce priority of the process so that it does not impact other running processes.

    • Create archive of dump for transfer and make sure to delete dump/archive aftertransfer.

    • Use the following procedure to log Garbage Collection:

    • Login to VM instance where GC (Garbage Collection) logging needs to be enabled.

    • Run the following commands:cd /opt/broadhop/qns-1/bin/chmod +x jmxterm.sh./jmxterm.sh> open :> bean com.sun.management:type=HotSpotDiagnostic> run setVMOption PrintGC true> run setVMOption PrintGCDateStamps true> run setVMOption PrintGCDetails true> run setVMOption PrintGCDetails true> exit

    • Revert the changes once the required GC logs are collected.

    Issues with Output displayed on GrafanaIn case of grafana issue whisper db output is required

    whisper-fetch --pretty /var/lib/carbon/whisper/cisco/quantum/qps/hosts/*

    CPS Troubleshooting Guide, Release 10.1.0 31

    Troubleshooting CPSHigh Memory Usage/Out of Memory Error

  • For example,

    whisper-fetch --pretty

    /var/lib/carbon/whisper/cisco/quantum/qps/dc1-pcrfclient02/load/midterm.wsp

    Enable Debug LogsBy default Cisco recommends to keep log level as WARN or ERROR. Sometimes for analysis the user mayneed more detailed logging. For this, the user needs the log level based on Cisco recommendation oncase-to-case basis.

    The following are the various top-level loggers for which the user may need to change log level on case-to-casebasis. These loggers must be defined in /etc/broadhop/logback.xml file.

    To make sure that all changes are controlled from one VM sync all changes made in the Cluster Managerabove to all other VMs.

    SSHUSER_PREFERROOT=true copytoall.sh

    For example,

    SSHUSER_PREFERROOT=true copytoall.sh /etc/broadhop/logback.xml /etc/broadhop/logback.xml

    • For Diameter issues: com.broadhop.diameter2

    • For CDR/EDR issues: com.broadhop.policyintel

    • For Custom Reference Data issues: com.broadhop.custrefdata

    • For Notifications issues: com.broadhop.notifications

    • For Session Manager Cache issues: com.broadhop.policy.mdb.cache

    • For Control Center issues: com.broadhop.controlcenter

    • For Fault Management issues: com.broadhop.faultmanagement

    • For LDAP issues: com.broadhop.ldap

    • For SPR issues: com.broadhop.spr

    • For Unified API issues: com.broadhop.unifiedapi

    • For audit issues: com.broadhop.audit

    • For policy related issues: com.broadhop.policy

    • For any CPS logs issues for which the log level is not overridden by other loggers: com.broadhop

    For consolidated logs make sure that the configuration specified in Control Center is correct to forwardlogs to OAM (pcrfclient) VMs.

    Note

    Install SAR ToolYou can install SAR tool to capture system issues.

    CPS Troubleshooting Guide, Release 10.1.032

    Troubleshooting CPSEnable Debug Logs

  • Download SAR Tool ftp//fr2.rpmfind.net/linux/centos/6.7/os/x86_64/Packages/sysstat-9.0.4-27.el6.x86_64.rpm

    This package provides the SAR and iostat commands for Linux. SAR and iostat enables system monitoringof disk network and other IO activity.

    Installation

    Step 1 Move the sysstat-9.0.4-27.el6.x86_64.rpm package to /var/tmp on pcrfclient01 and pcrfclient02.Step 2 SSH to pcrfclient01.Step 3 Run the following command to install the sysstat package:

    rpm -ivh /var/tmp/sysstat-9.0.4-27.el6.x86_64.rpm

    Step 4 Change the SAR cron job so that SAR statistics are collected every minute. To do this open /etc/cron.d/sysstat in aneditor and make the following change:Change the following line to remove “/10”:*/10 * * * * root /usr/lib64/sa/sa1 1 1

    So that it looks like this:

    * * * * * root /usr/lib64/sa/sa1 1 1

    Step 5 To verify the SAR is running and logging, inspect the /var/log/sa directory and verify the 'sa' log is created. It will takeone minute after you make the change in Step 4 for this log to be created.

    Step 6 Repeat Step 2 to Step 5 for pcrfclient02.

    DisablingIt is recommended to have SAR installed on the system. It can be used for troubleshooting many issues. Incase you do not want to have it installed, use the following steps:

    Step 1 To disable the SAR tool, open /etc/cron.d/sysstat in an editor and make the following change:Change the following line:

    * * * * * root /usr/lib64/sa/sa1 1 1

    So that it looks like this:

    #* * * * * root /usr/lib64/sa/sa1 1 1

    Step 2 To verify that SAR is no longer gathering statistics, check the /var/log/sa directory and verify the timestamp on the 'sa'log is not updating.

    CPS Troubleshooting Guide, Release 10.1.0 33

    Troubleshooting CPSInstall SAR Tool

    ftp//fr2.rpmfind.net/linux/centos/6.7/os/x86_64/Packages/sysstat-9.0.4-27.el6.x86_64.rpmftp//fr2.rpmfind.net/linux/centos/6.7/os/x86_64/Packages/sysstat-9.0.4-27.el6.x86_64.rpm

  • Frequently Encountered ScenariosThis section lists the following trouble issues already diagnosed and solved.

    • Subscriber not Mapped on SCE

    • CPS Server Will Not Start and Nothing is in the Log

    • Server returned HTTP Response Code: 401 for URL

    • com.broadhop.exception.BroadhopException Unable to Find System Configuration for System

    • Log Files Display the Wrong Time but the Linux Time is Correct

    • JMX Management Beans are not Deployed

    • Unable to Access Binding Information

    • Error Processing Package, Reference Data Does Not Exist for NAS IP...

    • REST Web Service Queries Returns an Empty XML Response for an Existing User

    • Error in Datastore: "err" : "E11000 Duplicate Key Error Index

    • Error Processing Request: Unknown Action

    • Memcached Server is in Error

    • Firewall Error: Log shows Host Not Reachable, or Connection Refused

    • Unknown Error in Logging: License Manager

    • Ecore File is Not Generated:

    • Logging Does Not Appear to be Working

    • Cannot Connect to Server Using JMX: No Such Object in Table

    • File System Check (FSCK) Errors

    • CPS: 27717 Mongo Stuck in STARTUP2 after sessionMgr01/2 Reboot

    • SR: 628099455 System Failure Errors in Control Center

    • Multi-user Policy Builder Errors

    • Policy Reporting Configuration not getting updated post CPS Upgrade

    • CPS Memory Usage

    • Errors while Installing HA Setup

    • Enable/disable Debit Compression

    • Diameter proxy error in diagnostics.sh output

    • Not able to Publish the Policy in Policy Builder

    • CPS not sending SNMP traps to External NMS server

    • Diameter Peer Connectivity is Down

    • Policy Builder Loses Repositories

    CPS Troubleshooting Guide, Release 10.1.034

    Troubleshooting CPSFrequently Encountered Scenarios

  • • Not able to access IPv6 Gx port from PCEF/GGSN

    • Bring up sessionmgr VM from RECOVERY state to SECONDARY state

    • ZeroMQ Connection Established between Policy Director and other site Policy Server

    • Troubleshooting CPS upgrade from existing 7.0

    • Diagnose Diameter No Response for Peer Message

    • Not able to access Policy Builder

    • Graphs in Grafana are lost when time on VMs are changed

    • Systems is not enabled for Plugin Configuration

    • Publishing is not Enabled

    • Collecting MongoDB Information for Troubleshooting

    • Added Check to Switch to Unknown Service if Subscriber is deleted Mid Session

    • Could not Build Indexes for Table

    • Error Submitting Message to Policy Director (lb) during Longevity

    • Mismatch between Statistics Count and Session Count

    • Disk Statistics not Populated in Grafana after CPS Upgrade

    • Re-create Session Shards

    • Session Switches from Known to Unknown in CCR-U Request

    • Intermittent BSON Object Size Error in createsub with Mongo v3.2.1

    • No Traps Generated When Number of Sessions Exceeds the Limit

    • RAR Message not Received

    • No Response to Diameter Request

    • Admin Database shows Problem in Connecting to the Server

    • Locale MAC Error, on page 87

    • Sessions Stored in a Single Shard , on page 87

    • Licensing not Throwing Traps or Diagnostic Errors upon Breach, on page 88

    • Corosync Process Taking lot of Time to Unload and is Stuck, on page 89

    • Issue related to Firewall, on page 89

    • CPS Setup cannot Handle High TPS, on page 90

    CPS Troubleshooting Guide, Release 10.1.0 35

    Troubleshooting CPSFrequently Encountered Scenarios

  • Subscriber not Mapped on SCEThis issue was causing the subscriber to get no mapping on the SCE.

    Step 1 Write an awk script to perform the following grep to create a text file of over 1000 instances of this message:grep "No member in system" policy.log* >

    no_member_found.txt

    This grep resulted in a file with these lines:

    policy.log:2009-07-17 11:00:21,201 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d162818policy.log:2009-07-17 11:02:06,108 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for D02625policy.log.1:2009-07-17 09:25:29,036 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for D162346policy.log.1:2009-07-17 09:27:28,718 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d162365policy.log.1:2009-07-17 09:27:37,193 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d162365policy.log.1:2009-07-17 09:27:42,257 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d162365policy.log.1:2009-07-17 09:38:09,010 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d02116policy.log.1:2009-07-17 09:38:12,618 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for D163647policy.log.1:2009-07-17 09:40:42,751 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d102096

    Step 2 Then use the following awk script to generate a new file that only has the user name. The script says print the 10th field:awk '{print $10}' no_member_found.txt >

    no_member_found_usernames_with_dupes.txt

    Step 3 Run the following command to remove duplicates:sort no_member_found_usernames_with_dupes.txt | uniq >

    uniq_sorted_no_member_found_usernames.txt

    This resulted in a file with usernames only:

    D00059

    D00077

    CPS Troubleshooting Guide, Release 10.1.036

    Troubleshooting CPSSubscriber not Mapped on SCE

  • D001088

    D00112

    d001313

    D00145

    D001452

    d00156

    D00186

    d00198

    D00200

    d00224

    CPS Server Will Not Start and Nothing is in the LogIf the CPS server does not start (or starts and immediately crashes) and no errors appear in/var/log/broadhop/qns.log to give reasons it did not start check the following list

    1 Check /var/log/broadhop/service-qns-1.log2 Check /etc/broadhop/servers

    • There should be an entry in this file for the current host name (Type 'hostname' in the console windowto find the local hostname)

    • There must be directory that corresponds to the hostname entry with config files. That is if the serversfile has svn01=controlcenter there must be a /etc/broadhop/control center directory

    3 Attempt to start the server directly from the command line and look for errors.

    • Type: /opt/broadhop/qns/bin/qns.sh

    • The server should start up successfully and the command line should not return. If the commandprompt returns then the server did not start successfully.

    • Look for any errors displayed in the console output.

    4 Look for OSGi Errors

    • Look in /opt/broadhop/qns/configuration for a log file. If any exist examine the log file for errormessages.

    Server returned HTTP Response Code: 401 for URLA 401 type error means you're not logging in to SVN with proper credentials.

    CPS Troubleshooting Guide, Release 10.1.0 37

    Troubleshooting CPSCPS Server Will Not Start and Nothing is in the Log

  • The server won't start and the following appears in the log:2010-12-10 01:05:26,668 \[SpringOsgiExtenderThread-8\]ERROR c.b.runtime.impl.RuntimeLoader - There was an errorinitializing reference data\!java.io.IOException: Server returned HTTP response code:401 for URL: http://lbvip01/repos/run/config.propertiessun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313) \~\[na:1.6.0_20\]org.springframework.core.io.UrlResource.getInputStream(UrlResource.java:124) \~\[org.springframework.core_3.0.0.REL

    To fix this error:

    • Edit /etc/broadhop/qns.conf

    • Ensure that the configuration URL and repository credentials hostnames match.\-Dcom.broadhop.config.url=http://lbvip01/repos/run/

    \-Dcom.broadhop.repository.credentials=broadhop/

    broadhop@lbvip01

    com.broadhop.exception.BroadhopException Unable to Find SystemConfiguration for System

    Symptoms server won't stay started and the log displays this:com.broadhop.exception.BroadhopException: Unable to find system configuration for system:The system that is set up in your Quantum Policy Builder (and cluster name) must match theonespecified in /etc/broadhop/qns.conf. Either add or change this via the Quantum Policy Builder

    interface, and then publish or update the system/clustername in /etc/broadhop/qns.conf\-Dcom.broadhop.run.systemId=poc-system\-Dcom.broadhop.run.clusterId=cluster-1

    Log Files Display the Wrong Time but the Linux Time is CorrectIf log files or other dates are showing in the incorrect time zone despite the Linux time being set to the propertime zone, most likely the time zone that the JVM reads is incorrect.

    Step 1 In /etc/sysconfig, run the command cat clock to see this output:ZONE="America/Denver"

    UTC=false

    ARC=false

    Step 2 Change the ZONE line to the time zone you desire, for instance you could change it to:ZONE="Asia/Singapore"

    UTC=false

    ARC=false

    to change the JVM time zone to Singapore time.

    CPS Troubleshooting Guide, Release 10.1.038

    Troubleshooting CPScom.broadhop.exception.BroadhopException Unable to Find System Configuration for System

  • The value for ZONE is driven by the directories in /usr/share/zoneinfo

    JMX Management Beans are not Deployed

    Step 1 Restart the CPS Server. The JMX Beans sometimes are not deployed when features are installed or updated.Step 2 Run ps -ef | grep java and look for: ‘-javaagent:/opt/broadhop/qns/bin/jmxagent.jar’. If this is absent, you have an old

    build and need to update.Step 3 If you have an old build, see the Operations guide for instructions on updating.

    Unable to Access Binding InformationMake sure the binding has been compiled. This error is typically caused by a bad build.

    Attempt to upgrade to a newer build.

    If you're on a released build, try restarting, there's been a strange bug which causes web service problemsafter update.2010-10-19 12:05:00,194 [pool-4-thread-1] ERRORc.b.d.impl.DiagnosticController - Diagnostic failed. Aproblem exists with the system --> Common Services: Featurecom.broadhop.ws.service is unabled to start. Error: Errorcreating bean with name'org.springframework.web.servlet.mvc.annotation.DefaultAnnotationHandlerMapping#0' defined in URL [bundleentry://27.fwk15830670/META-INF/spring/bundle-ws-context.xml]:Initialization of bean failed; nested exception isorg.springframework.beans.factory.BeanCreationException:Error creating bean with name 'subscriberEndpoint' definedin URL [bundleentry://27.fwk15830670/META-INF/spring/bundle-ws-context.xml]: Cannot resolve reference to bean'jibxMarshaller' while setting bean property 'marshaller';nested exception isorg.springframework.beans.factory.BeanCreationException:Error creating bean with name 'jibxMarshaller' defined inURL [bundleentry://27.fwk15830670/META-INF/spring/bundle-ws-context.xml]: Invocation of init method failed;nested exception is org.jibx.runtime.JiBXException: Unableto access binding information for classcom.broadhop.ws.impl.messages.RemoveSubscriberProfileRequest

    Error Processing Package, Reference Data Does Not Exist for NAS IP...2010-10-19 13:25:53,481 [pool-11-thread-1] ERRORc.b.u.t.udp.UdpMessageListener - Error processing packet {}com.broadhop.exception.BroadhopException: Radius referencedata does not exist for NAS IP 192.168.180.74 or 10.0.0.52atcom.broadhop.radius.impl.RadiusReferenceData.getRadiusDevi

    CPS Troubleshooting Guide, Release 10.1.0 39

    Troubleshooting CPSJMX Management Beans are not Deployed

  • ce(RadiusReferenceData.java:111)~[com.broadhop.radius.service_1.0.0.release.jar:na]atcom.broadhop.radius.impl.RadiusReferenceData.getSharedSecret(RadiusReferenceData.java:130)~[com.broadhop.radius.service_1.0.0.release.jar:na]atcom.broadhop.radius.impl.RadiusMessageListener.getSharedSecret(RadiusMessageListener.java:247)~[com.broadhop.radius.service_1.0.0.release.jar:na]atcom.broadhop.radius.impl.RadiusMessageListener.processPacket(RadiusMessageListener.java:86)~[com.broadhop.radius.service_1.0.0.release.jar:na]atcom.broadhop.utilities.transports.udp.UdpMessageListener$1.run(UdpMessageListener.java:192)~[com.broadhop.utility_5.1.1.r019218.jar:na]atjava.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [na:1.6.0_21]atjava.util.concurrent.FutureTask$Sync.innerRun(UnknownSource) [na:1.6.0_21]at java.util.concurrent.FutureTask.run(UnknownSource) [na:1.6.0_21]Quantum Policy Builder Copyright 2013 Cisco Systems. All rights reserved.Troubleshooting Guide Issue 1.0 August 2013Chapter 2 Troubleshooting CPS 37atjava.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) [na:1.6.0_21]atjava.util.concurrent.ThreadPoolExecutor$Worker.run(UnknownSource) [na:1.6.0_21]at java.lang.Thread.run(Unknown Source)[na:1.6.0_21]

    Ensure that this NAS IP has been set up in Cisco Policy Builder under Reference Data->Policy EnforcementPoints. If you use an ISG, add to the ISG Pools folder. Otherwise, add to the RADIUS Device Pools folder.The IP's that matter are in the 'Devices' table on the ISG Pool object itself.

    REST Web Service Queries Returns an Empty XML Response for an ExistingUser

    For example:

    Because there are multiple ways needed to return web service data, the BroadHop Web Service Blueprintdoesn't return any XML by default. To fix this issue, configure the 'Default Web Service Query Response'blueprint under the 'BroadHop Web Services' Blueprint.

    Error in Datastore: "err" : "E11000 Duplicate Key Error Index

    This removes ALL sessionsNote

    CPS Troubleshooting Guide, Release 10.1.040

    Troubleshooting CPSREST Web Service Queries Returns an Empty XML Response for an Existing User

  • Typically, duplicate keys like this happen when initially configuring policies and switching primary keys. Ina production scenario, you may not want to remove all sessions.

    Step 1 ssh into sessionmgr01Step 2 Open SessionMgr CLI

    /usr/bin/mongo --port 27717

    Using /usr/bin/mongo indicates whether the mongo replica set is primary or secondary.

    Step 3 Enter following commands on the MongoDB CLIuse session_cache;

    db.session.remove({});

    Step 4 If it gives you a 'not master' error, log into sessionmgr02 and do the same

    Error Processing Request: Unknown Actioncom.broadhop.policy.impl.RulesPolicyService - Errorprocessing policy request: Unknown action:com.broadhop.pop3auth.actions.IPOP3AuthRequest and RemoteActions are disabled.

    If you see an error of the type above, it means that the implementation class it's looking for is not availableon the server. This can be caused by:

    • The component needed is not installed on the server.

    • Ensure that the pop3auth service is installed in your server.

    • Look for exceptions in the logs when starting up.

    • Try restarting the service bundle (pop3auth service in this case) using the OSGi console and looking atthe logs.

    Memcached Server is in ErrorERROR c.b.d.impl.DiagnosticController - Diagnostic failed.

    A problem exists with the system --> Common Services:

    2:Memcached server is in error

    Step 1 Log on to the server where policy server (qns) is runningStep 2 Telnet to the memcache server's IP and port 11211 (For example, telnet lbvip01 11211).

    You can figure out which memcache server CPS is pointing to in Cisco Policy Builder. Look at:Reference Data >Systems > System Name > Cluster Name.

    1 If you cannot telnet to the port, do this

    CPS Troubleshooting Guide, Release 10.1.0 41

    Troubleshooting CPSError Processing Request: Unknown Action

  • Make sure memcache is running:

    • Log on to server where memcache is running.run service memcached status

    [root@sessionmgr01 ~]# service memcached status

    memcached is stopped

    • If the service is stopped, start it:[root@sessionmgr01 ~]# service memcached start

    Starting a new distributed memory caching

    (memcached) process for 11211:

    2 Make sure firewall configuration is OK:

    To check if this is the problem, just stop the firewall.

    /etc/init.d/iptables stop

    If it is the problem, add an exception in /etc/sysconfig/iptables. Look at other entries in the file for anexample.

    After adding an exception, restart the IP tables: /etc/init.d/iptables restart.

    Firewall Error: Log shows Host Not Reachable, or Connection RefusedIn HA environment if we see some connection refused errors stop the firewall and execute

    service iptables stop

    to see if the problem is related to the iptables firewall issue.

    Unknown Error in Logging: License Manager2010-12-12 18:51:32,258 [pool-4-thread-1] ERRORc.b.licensing.impl.LicenseManager - Unknown error inloggingjava.lang.NullPointerException: nullatcom.broadhop.licensing.impl.LicenseManager.checkFeatures(LicenseManager.java:311) ~[na:na]

    This issue may occur if no license has been assigned yet.

    Option 1: If this is for development or Proof Of Concept deployments you can turn on developer mode. Thiseffectively gives you 100 users but is not for use in production.

    1 Login to CPS.

    2 Add the following to the /etc/broadhop/qns.conf file:

    -Dcom.broadhop.developer.mode=true

    CPS Troubleshooting Guide, Release 10.1.042

    Troubleshooting CPSFirewall Error: Log shows Host Not Reachable, or Connection Refused

  • 3 Restart CPS

    Option 2: Generate a real license. Have your Cisco technical representative send you the Technical ArticleTool com.broadhop.licensing.service - Creating a CPS License.

    Option 3: If we have license error in the logs, check the MAC address of the VM and compare that with theMAC address in the license file in /etc/broadhop/license/.

    Ecore File is Not Generated:(Example shown is RADIUS feature)2010-12-12 18:39:34,075 [SpringOsgiExtenderThread-8] ERRORc.b.runtime.impl.RuntimeLoader - Unable to load class:com.broadhop.refdata.radius.RadiusPackage. Ecore file isnot generated http://lbvip01/repos/run/com.broadhop.radius.ecorecom.broadhop.radius.ecore

    A feature (RADIUS) has been installed in Cisco Policy Builder but is not installed on the server. Or a featuresfile being accessed is not where features have been placed.

    1 Check if the feature is installed in your server by running/var/qps/bin/diag/list_installed_features.sh.

    2 If the feature is installed you probably are pointing to (or publishing to) the wrong repository. Check whereyou are publishing to in Policy Builder and check and what URL you are pulling from in/etc/broadhop/qns.conf file.

    3 If the feature is not installed you may be pointing to a different features file than you expect. Do this:

    a Login to CPS server and find the name of the policy server (qns) you are on.

    b Type: hostname

    c Check /etc/broadhop/servers file.

    Whatever is listed next to the hostname you are using should also have a directory in the/etc/broadhop directory. It is in THAT directory you should change the features file. This defaultsqns01 to policy director (iomanager). Change it to 'pcrf'.

    Logging Does Not Appear to be Working

    Step 1 Run the JMX Command:/opt/broadhop/qns/bin/jmxcmd.sh

    ch.qos.logback.classic:Name=default,Type=ch.qos.logback

    .classic.jmx.JMXConfigurator Statuses

    or

    Step 2 Access that bean using JMX Term or JConsole to view the status of the Logback Appenders. To access JMX Term,follow these steps:

    CPS Troubleshooting Guide, Release 10.1.0 43

    Troubleshooting CPSEcore File is Not Generated:

  • Execute below script: /opt/broadhop/qns-1/bin/jmxterm.sh1

    2 If user does not have permission to execute the command then change the permission using below command:

    chmod 777 opt/broadhop/qns-1/bin/jmxterm.sh

    3 Again execute the script: /opt/broadhop/qns-1/bin/jmxterm.sh

    4 Once command is executed, JMX terminal opens up.

    5 Execute the below command to open connection:

    $>open qns01:9045

    6 All beans can be seen using below command

    $>beans#domain = JMImplementation:JMImplementation:type=MBeanServerDelegate#domain = ch.qos.logback.classic:ch.qos.logback.classic:Name=default,Type=ch.qos.logback.classic.jmx.JMXConfigurator#domain = com.broadhop.action:com.broadhop.action:name=AddSubscriberService,type=histogramcom.broadhop.action:name=AddSubscriberService,type=servicecom.broadhop.action:name=GetSessionAction,type=histogramcom.broadhop.action:name=GetSessionAction,type=servicecom.broadhop.action:name=GetSubscriberActionImpl,type=histogramcom.broadhop.action:name=GetSubscriberActionImpl,type=servicecom.broadhop.action:name=LockSessionAction,type=histogramcom.broadhop.action:name=LockSessionAction,type=servicecom.broadhop.action:name=LogMessage,type=histogramcom.broadhop.action:name=LogMessage,type=servicecom.broadhop.action:name=OCSLoadBalanceState,type=histogramcom.broadhop.action:name=OCSLoadBalanceState,type=servicejava.nio:name=mapped,type=BufferPool#domain = java.util.logging:java.util.logging:type=Logging

    Cannot Connect to Server Using JMX: No Such Object in TableThis is likely caused because the server's name is not set up in the hosts file with its proper IP address.

    CPS Troubleshooting Guide, Release 10.1.044

    Troubleshooting CPSCannot Connect to Server Using JMX: No Such Object in Table

  • In /etc/hosts the hostname (e.g. qns01) SHOULD NOT be aliased to 127.0.0.1 or localhost.

    If improperly aliased JMX tells the server it'