CPS Troubleshooting Guide, Release 10.1GCMNotification 100 APNSNotification 102...
Transcript of CPS Troubleshooting Guide, Release 10.1GCMNotification 100 APNSNotification 102...
-
CPS Troubleshooting Guide, Release 10.1.0First Published: September 02, 2016
Last Modified: September 02, 2016
Americas HeadquartersCisco Systems, Inc.170 West Tasman DriveSan Jose, CA 95134-1706USAhttp://www.cisco.comTel: 408 526-4000 800 553-NETS (6387)Fax: 408 527-0883
-
THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT NOTICE. ALL STATEMENTS,INFORMATION, AND RECOMMENDATIONS IN THIS MANUAL ARE BELIEVED TO BE ACCURATE BUT ARE PRESENTED WITHOUT WARRANTY OF ANY KIND,EXPRESS OR IMPLIED. USERS MUST TAKE FULL RESPONSIBILITY FOR THEIR APPLICATION OF ANY PRODUCTS.
THE SOFTWARE LICENSE AND LIMITEDWARRANTY FOR THE ACCOMPANYING PRODUCT ARE SET FORTH IN THE INFORMATION PACKET THAT SHIPPED WITHTHE PRODUCT AND ARE INCORPORATED HEREIN BY THIS REFERENCE. IF YOU ARE UNABLE TO LOCATE THE SOFTWARE LICENSE OR LIMITED WARRANTY,CONTACT YOUR CISCO REPRESENTATIVE FOR A COPY.
The Cisco implementation of TCP header compression is an adaptation of a program developed by the University of California, Berkeley (UCB) as part of UCB's public domain versionof the UNIX operating system. All rights reserved. Copyright © 1981, Regents of the University of California.
NOTWITHSTANDINGANYOTHERWARRANTYHEREIN, ALL DOCUMENT FILES AND SOFTWARE OF THESE SUPPLIERS ARE PROVIDED “AS IS"WITH ALL FAULTS.CISCO AND THE ABOVE-NAMED SUPPLIERS DISCLAIM ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING, WITHOUT LIMITATION, THOSE OFMERCHANTABILITY, FITNESS FORA PARTICULAR PURPOSEANDNONINFRINGEMENTORARISING FROMACOURSEOFDEALING, USAGE, OR TRADE PRACTICE.
IN NO EVENT SHALL CISCO OR ITS SUPPLIERS BE LIABLE FOR ANY INDIRECT, SPECIAL, CONSEQUENTIAL, OR INCIDENTAL DAMAGES, INCLUDING, WITHOUTLIMITATION, LOST PROFITS OR LOSS OR DAMAGE TO DATA ARISING OUT OF THE USE OR INABILITY TO USE THIS MANUAL, EVEN IF CISCO OR ITS SUPPLIERSHAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Any Internet Protocol (IP) addresses and phone numbers used in this document are not intended to be actual addresses and phone numbers. Any examples, command display output, networktopology diagrams, and other figures included in the document are shown for illustrative purposes only. Any use of actual IP addresses or phone numbers in illustrative content is unintentionaland coincidental.
Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. and other countries. To view a list of Cisco trademarks, go to this URL: http://www.cisco.com/go/trademarks. Third-party trademarks mentioned are the property of their respective owners. The use of the word partner does not imply a partnershiprelationship between Cisco and any other company. (1110R)
© 2016 Cisco Systems, Inc. All rights reserved.
http://www.cisco.com/go/trademarkshttp://www.cisco.com/go/trademarks
-
C O N T E N T S
P r e f a c e Preface xi
About this guide xi
Audience xi
Additional Support xi
Version Control Software xii
Conventions (all documentation) xii
Obtaining Documentation and Submitting a Service Request xiii
C H A P T E R 1 Troubleshooting CPS 1
General Troubleshooting 1
Gathering Information 1
Basic Troubleshooting 2
Trace Support Commands 3
trace.sh 3
trace_id.sh 4
Periodic Monitoring 4
RADIUS Troubleshooting 8
E2E Call Flow Troubleshooting 8
Diameter Error Codes and Scenarios 8
LDAP Error Codes 11
Rare Troubleshooting Scenarios 21
Recovery using Remove/Add members Option 21
Remove Failed Members 22
Add Failed Members 23
Maintenance Window Procedures 25
Prior to Any Maintenance 25
Change Request Procedure 25
CPS Troubleshooting Guide, Release 10.1.0 iii
-
Software Upgrades 25
Application Restarts 25
VM Restarts 26
Hardware Restarts 26
Planned Outages 26
Non-maintenance Window Procedures 27
Common Troubleshooting Tasks 27
Kill All Cisco Processes From the Command Line as Root 27
Low or Out of Disk Space 29
df Command 29
du Command 29
Diameter Issues 29
High CPU Usage Issue 30
JVM Crash 30
High Memory Usage/Out of Memory Error 31
Issues with Output displayed on Grafana 31
Enable Debug Logs 32
Install SAR Tool 32
Installation 33
Disabling 33
Frequently Encountered Scenarios 34
Subscriber not Mapped on SCE 36
CPS Server Will Not Start and Nothing is in the Log 37
Server returned HTTP Response Code: 401 for URL 37
com.broadhop.exception.BroadhopException Unable to Find System Configuration for
System 38
Log Files Display the Wrong Time but the Linux Time is Correct 38
JMX Management Beans are not Deployed 39
Unable to Access Binding Information 39
Error Processing Package, Reference Data Does Not Exist for NAS IP... 39
REST Web Service Queries Returns an Empty XML Response for an Existing User 40
Error in Datastore: "err" : "E11000 Duplicate Key Error Index 40
Error Processing Request: Unknown Action 41
Memcached Server is in Error 41
Firewall Error: Log shows Host Not Reachable, or Connection Refused 42
CPS Troubleshooting Guide, Release 10.1.0iv
Contents
-
Unknown Error in Logging: License Manager 42
Ecore File is Not Generated: 43
Logging Does Not Appear to be Working 43
Cannot Connect to Server Using JMX: No Such Object in Table 44
File System Check (FSCK) Errors 45
CPS: 27717 Mongo Stuck in STARTUP2 after sessionMgr01/2 Reboot 47
SR: 628099455 System Failure Errors in Control Center 48
Multi-user Policy Builder Errors 52
Policy Reporting Configuration not getting updated post CPS Upgrade 54
CPS Memory Usage 55
Errors while Installing HA Setup 56
Enable/disable Debit Compression 57
Diameter proxy error in diagnostics.sh output 58
Not able to Publish the Policy in Policy Builder 59
CPS not sending SNMP traps to External NMS server 59
Diameter Peer Connectivity is Down 60
Policy Builder Loses Repositories 60
Not able to access IPv6 Gx port from PCEF/GGSN 61
Bring up sessionmgr VM from RECOVERY state to SECONDARY state 61
ZeroMQ Connection Established between Policy Director and other site Policy Server 61
Troubleshooting CPS upgrade from existing 7.0 63
Diagnose Diameter No Response for Peer Message 63
Not able to access Policy Builder 70
Graphs in Grafana are lost when time on VMs are changed 72
Systems is not enabled for Plugin Configuration 72
Publishing is not Enabled 72
Collecting MongoDB Information for Troubleshooting 73
Added Check to Switch to Unknown Service if Subscriber is deleted Mid Session 74
Could not Build Indexes for Table 77
Error Submitting Message to Policy Director (lb) during Longevity 77
Mismatch between Statistics Count and Session Count 78
Disk Statistics not Populated in Grafana after CPS Upgrade 79
Re-create Session Shards 80
Session Switches from Known to Unknown in CCR-U Request 81
Intermittent BSON Object Size Error in createsub with Mongo v3.2.1 82
CPS Troubleshooting Guide, Release 10.1.0 v
Contents
-
No Traps Generated When Number of Sessions Exceeds the Limit 83
RAR Message not Received 83
No Response to Diameter Request 84
Admin Database shows Problem in Connecting to the Server 85
Locale MAC Error 87
Sessions Stored in a Single Shard 87
Licensing not Throwing Traps or Diagnostic Errors upon Breach 88
Corosync Process Taking lot of Time to Unload and is Stuck 89
Issue related to Firewall 89
CPS Setup cannot Handle High TPS 90
Troubleshoot ANDSF 91
Policy Builder Scenarios 91
Not Able to See DM Configuration Tab in Policy Builder after Installation 91
Diagnostic.sh throws Errors after Restart 92
Not Getting GCM Notifications in Logs 93
Session is not created for iPhone and Android Users 94
Check for service Use Case Templates for GCM, APNS, General, and default
Services 94
Control Center Scenarios 95
Subscriber Session not getting Created and Getting Exception Error (401) 95
SSID Credentials are Wrongly Passed in Policy 96
DM Tree Lookups Fail and Exception in consolidated-qns.log 96
Data Populated in MongoDB ANDSF Collection, but values are not shown in Control
Center 97
Not able to see the Mobile Configuration Certificate sub screen in Control Center 97
Control Center session timeout frequently and not able to login from another
browser 97
Geo-location is not read Properly in Control Center 97
ANDSF Server Scenarios 98
API Error Codes 98
General Errors 99
Problem Accessing ua/soap Getting Jetty Related Error 99
Check if Blank Policy is Retrieved in SyncML Response 99
Policy Engine didn't Return a Management Response 100
Notification Errors 100
CPS Troubleshooting Guide, Release 10.1.0vi
Contents
-
GCM Notification 100
APNS Notification 102
SNMP Traps and Key Performance Indicators (KPIs) 104
Full (HA) Setup 104
All-in-one (AIO) Setup 105
Testing Traps Generated by CPS 105
Component Notifications 106
Application Notifications 109
SNMP System and Application KPI Values 122
SNMP System KPIs 123
Application KPI Values 124
FAQs 126
Reference Document 129
C H A P T E R 2 Check Subscriber Access 131
Checking Access 131
Testing Subscriber Access with 00.testAccessRequest.sh 131
Testing Subscriber Access with soapUI 132
Testing for ISG Functionality and Connectivity with test aaa Scripts 138
C H A P T E R 3 TCP Dumps 139
About TCP Dumps 139
TCPDUMP Command 139
Options 139
Specific Traffic Types 140
Capture RADIUS Traffic 140
Capture SNMP Traffic 140
Other Ports 141
C H A P T E R 4 Call Flows 143
One-Click Call Flow 144
User/Password Login Call Flow 145
Data-limited Voucher Call Flow 146
Time-limited Voucher Call Flow 147
EAP-TTLS Call Flow 148
CPS Troubleshooting Guide, Release 10.1.0 vii
Contents
-
Service Selection Call Flow 149
MAC TAL Call Flow 150
Tiered Services Call Flow 151
Diameter Call Flows 151
Receive and Queuing of Diameter Message at Policy Director 152
Request Processing at PCRF (Policy Server) 152
Rules Call Flow 153
Response Creation and Sending 154
RAR Call Flow 154
C H A P T E R 5 Logging 155
Overview 155
CPS Logs 156
Application/Script Produces Logs: Deploy Logs 157
Application/Script Produces Logs: policy server 157
Application/Script Produces Logs: policy server pb 158
Application/Script Produces Logs: mongo 159
Application/Script Produces Logs: httpd 159
Application/Script Produces Logs: license manager 160
Application/Script Produces Logs: svn 160
Application/Script Produces Logs: auditd 160
Application/Script Produces Logs: graphite 161
Application/Script Produces Logs: kernel 162
Basic Troubleshooting Using CPS Logs 162
Logging Level and Effective Logging Level 162
Consolidated Application Logging 165
Enable Debug Logs 166
Enable Unified API Request and Response Logging 167
Rsyslog Log Processing 167
Rsyslog Overview 167
Rsyslog-proxy 168
Configuration for HA Environments 168
Configuration for AIO 169
Enable Consolidated Syslog Output to Files on OAM VMs 170
Configuration of Logback.xml 171
CPS Troubleshooting Guide, Release 10.1.0viii
Contents
-
Basic Troubleshooting Using ANDSF Logs 171
Debugging Common Errors using Logging Techniques of ANDSF 171
Debugging Common Call Flow Scenarios for ANDSF using Logging Patterns 172
Generic Call Flow For Android 172
Generic Call Flow For Apple 173
GCM Notification 175
APNS Notification 176
Notification for Revalidation Timer 177
CPS Troubleshooting Guide, Release 10.1.0 ix
Contents
-
CPS Troubleshooting Guide, Release 10.1.0x
Contents
-
Preface
• About this guide, page xi
• Audience, page xi
• Additional Support, page xi
• Version Control Software, page xii
• Conventions (all documentation), page xii
• Obtaining Documentation and Submitting a Service Request, page xiii
About this guideThis guide describes how to troubleshoot Cisco Policy Suite.
AudienceThis guide is best used by these readers:
• Network administrators
• Network engineers
• Network operators
• System administrators
This document assumes a general understanding of network architecture, configuration, and operations.
Additional SupportFor further documentation and support:
• Contact your Cisco Systems, Inc. technical representative.
• Call the Cisco Systems, Inc. technical support number.
CPS Troubleshooting Guide, Release 10.1.0 xi
-
•Write to Cisco Systems, Inc. at [email protected].
• Refer to support matrix at http://www.support.cisco.com and to other documents related to Cisco PolicySuite.
Version Control SoftwareCisco Policy Builder uses version control software to manage its various data repositories. The default installedversion control software is Subversion, which is provided in your installation package.
Conventions (all documentation)This document uses the following conventions.
IndicationConventions
Commands and keywords and user-entered textappear in bold font.
bold font
Document titles, new or emphasized terms, andarguments for which you supply values are in italicfont.
italic font
Elements in square brackets are optional.[ ]
Required alternative keywords are grouped in bracesand separated by vertical bars.
{x | y | z }
Optional alternative keywords are grouped in bracketsand separated by vertical bars.
[ x | y | z ]
A nonquoted set of characters. Do not use quotationmarks around the string or the string will include thequotation marks.
string
Terminal sessions and information the system displaysappear in courier font.
courier font
Nonprinting characters such as passwords are in anglebrackets.
< >
Default responses to system prompts are in squarebrackets.
[ ]
An exclamation point (!) or a pound sign (#) at thebeginning of a line of code indicates a comment line.
!, #
CPS Troubleshooting Guide, Release 10.1.0xii
PrefaceVersion Control Software
http://www.support.cisco.com
-
Means reader take note. Notes contain helpful suggestions or references to material not covered in themanual.
Note
Means reader be careful. In this situation, you might perform an action that could result in equipmentdamage or loss of data.
Caution
IMPORTANT SAFETY INSTRUCTIONS.
Means danger. You are in a situation that could cause bodily injury. Before you work on any equipment,be aware of the hazards involved with electrical circuitry and be familiar with standard practices forpreventing accidents. Use the statement number provided at the end of each warning to locate its translationin the translated safety warnings that accompanied this device.
SAVE THESE INSTRUCTIONS
Provided for additional information and to comply with regulatory and customer requirements.Warning
Obtaining Documentation and Submitting a Service RequestFor information on obtaining documentation, using the Cisco Bug Search Tool (BST), submitting a servicerequest, and gathering additional information, see What's New in Cisco Product Documentation.
To receive new and revised Cisco technical content directly to your desktop, you can subscribe to the What'sNew in Cisco Product Documentation RSS feed. RSS feeds are a free service.
CPS Troubleshooting Guide, Release 10.1.0 xiii
PrefaceObtaining Documentation and Submitting a Service Request
http://www.cisco.com/c/en/us/td/docs/general/whatsnew/whatsnew.htmlhttp://www.cisco.com/assets/cdc_content_elements/rss/whats_new/whatsnew_rss_feed.xmlhttp://www.cisco.com/assets/cdc_content_elements/rss/whats_new/whatsnew_rss_feed.xml
-
CPS Troubleshooting Guide, Release 10.1.0xiv
PrefaceObtaining Documentation and Submitting a Service Request
-
C H A P T E R 1Troubleshooting CPS
• General Troubleshooting, page 1
• Diameter Error Codes and Scenarios, page 8
• LDAP Error Codes, page 11
• Maintenance Window Procedures, page 25
• Non-maintenance Window Procedures, page 27
• Common Troubleshooting Tasks, page 27
• Frequently Encountered Scenarios, page 34
• Troubleshoot ANDSF, page 91
• SNMP Traps and Key Performance Indicators (KPIs), page 104
General Troubleshooting• Find out if your problem is related to CPS or another part of your network.
• Gather information that facilitate the support call.
• Are their specific SNMP traps being reported that can help you isolate the issue?
Gathering InformationDetermine the Impact of the Issue
• Is the issue affecting subscriber experience?
• Is the issue affecting billing?
• Is the issue affecting all subscribers?
• Is the issue affecting all subscribers on a specific service?
• Is there anything else common to the issue?
CPS Troubleshooting Guide, Release 10.1.0 1
-
• Have there been any changes performed on the CPS system or any other systems?
• Has there been an increase in subscribers?
• Is the issue affecting all subscribers?
• Is the issue affecting all subscribers on a specific service?
• Is there anything else common to the issue?
• Initially, categorize the issue to determine the level of support needed.
Basic TroubleshootingCapture the following details in most error cases:
Step 1 Output of the following commands:diagnostics.sh
about.sh
Step 2 Collect all the logs:
• Archive created at /var/log/broadhop on pcrfclient01 and pcrfclient02 includes consolidated policy server(qns) logs. Make sure that consolidated logs cover logs of time when issue happened.
• SSH to all available policy server (qns) and load balancer (lb) VMs and capture the following logs:/var/log/broadhop/qns-*.log
/var/log/broadhop/qns-*.log.gz
/var/log/broadhop/service-qns-*.log
/var/log/broadhop/service-qns-*.log.gz
• SSH to all the available sessionmgr VMs and capture the following mongoDB logs:/var/log/mongodb-*.log
/var/log/mongodb-*.log.gz
• SSH to all available VMs and capture the following logs:/var/log/messages*
Step 3 CPS configuration details present at /etc/broadhop.Step 4 SVN repository
To export SVN repository, go to /etc/broadhop/qns.conf and copy the URL specified againstcom.broadhop.config.url.
For example,
-Dcom.broadhop.config.url=http://pcrfclient01/repos/run
Run the following command to export SVN repository:
svn export
CPS Troubleshooting Guide, Release 10.1.02
Troubleshooting CPSBasic Troubleshooting
-
Step 5 Top command on all available VMs to display the top CPU processes on the system:top -b -n 30
Step 6 Output of the following command from pcrfclient01 VM top_qps.sh with output period of 10-15 min and interval of 5sec:top_qps.sh 5
Step 7 Output of the following command on load balancer (lb) VMs having issue.netstat -plan
Step 8 Output of the following command on all VMs.service iptables status
Step 9 Details mentioned in Periodic Monitoring.Step 10 Steps to reproduce the issue.
Trace Support CommandsThis section covers the following two commands:
• trace.sh
• trace_id.sh
trace.shtrace.sh usage:
/var/qps/bin/control/trace.sh -i -d sessionmgr01:27719/policy_trace
/var/qps/bin/control/trace.sh -x -d sessionmgr01:27719/policy_trace
/var/qps/bin/control/trace.sh -a -d sessionmgr01:27719/policy_trace
/var/qps/bin/control/trace.sh -e -d sessionmgr01:27719/policy_trace
This script starts a selective trace and outputs it to standard out.
• Specific Audit Id Tracing$0 -i
• Dump All Traces for Specific Audit Id$0 -x
• Trace All.$0 -a
• Trace All Errors.$0 -e
CPS Troubleshooting Guide, Release 10.1.0 3
Troubleshooting CPSTrace Support Commands
-
trace_id.shtrace_id.sh usage:
/var/qps/bin/control/trace_ids.sh -i -d sessionmgr01:27719/policy_trace
/var/qps/bin/control/trace_ids.sh -r -d sessionmgr01:27719/policy_trace
/var/qps/bin/control/trace_ids.sh -x -d sessionmgr01:27719/policy_trace
/var/qps/bin/control/trace_ids.sh -l -d sessionmgr01:27719/policy_trace
This script starts a selective trace and outputs it to standard out.
• Add Specific Audit Id Tracing$0 -i
• Remove Trace for Specific Audit Id$0 -r
• Remove Trace for All Ids$0 -x
• List All Ids under Trace$0 -l
Periodic Monitoring• Run the following command on pcrfclient01 and verify that all the processes are reported as Running.For CPS 6.1.0 and lower releases:
/opt/broadhop/control/statusall.sh
For CPS 7.0.0 and higher releases:
/var/qps/bin/control/statusall.sh
Program 'cpu_load_trap'status Waitingmonitoring status Waiting
Process 'collectd'status Runningmonitoring status Monitoreduptime 42d 17h 23m
Process 'auditrpms.sh'status Runningmonitoring status Monitoreduptime 28d 20h 26m
System 'qns01'status Runningmonitoring status Monitored
The Monit daemon 5.5 uptime: 21d 10h 26mProcess 'snmpd'
status Runningmonitoring status Monitoreduptime 21d 10h 26m
Process 'qns-1'status Runningmonitoring status Monitoreduptime 6d 17h 9m
CPS Troubleshooting Guide, Release 10.1.04
Troubleshooting CPSPeriodic Monitoring
-
• Run /var/qps/bin/diag/diagnostics.sh command on pcrfclient01 and verify that no errors/failuresare reported in output./var/qps/bin/diag/diagnostics.shCPS Diagnostics HA Multi-Node Environment---------------------------Ping check for all VMs...Hosts that are not 'pingable' are added to the IGNORED_HOSTS variable...[PASS
]Checking basic ports for all VMs...[PASS]Checking qns passwordless logins for all VMs...[PASS]Checking disk space for all VMs...[PASS]Checking swap space for all VMs...[PASS]Checking for clock skew for all VMs...[PASS]Checking CPS diagnostics...Retrieving diagnostics from qns01:9045...[PASS]Retrieving diagnostics from qns02:9045...[PASS]Retrieving diagnostics from qns03:9045...[PASS]Retrieving diagnostics from qns04:9045...[PASS]Retrieving diagnostics from pcrfclient01:9045...[PASS]Retrieving diagnostics from pcrfclient02:9045...[PASS]
Checking svn sync status between pcrfclient01 & 02...svn is not sync between pcrfclient01 & pcrfclient02...[FAIL]Corrective Action(s): Run ssh pcrfclient01 /var/qps/bin/support/recover_svn_sync.shChecking HAProxy statistics and ports...
• Perform the following actions to verify VMs status is reported as UP and healthy and no alarms aregenerated for any VMs.
◦Login to the VMware console
◦Verify the VM statistics, graphs and alarms through the console.
• Verify if any trap is generated by CPS.cd /var/log/snmp
tailf trap
• Verify if any error is reported in CPS logs.cd /var/log/broadhop
grep -i error consolidated-qns.log
grep -i error consolidated-engine.log\
• Monitor the following KPIs on Grafana for any abnormal behavior:
◦CPU usage of all instancesCPU Idle on Active Load Balancer (LB)
CPU Idle on Standby Load Balancer (LB)
CPU Idle on Policy Server (QNS) VMs
CPU Idle on sessionmgr VMs: Session database, Balance, Reporting, SPR
CPU Idle on OAM (pcrfclient) VM:
◦Memory usage of all instancesMemory Free on Active Load Balancer (LB)
Memory Free on Standby Load Balancer (LB)
Memory Free on Policy Server (QNS) VMs
CPS Troubleshooting Guide, Release 10.1.0 5
Troubleshooting CPSPeriodic Monitoring
-
Memory Free on sessionmgr VMs
Memory Free on OAM (pcrfclient) VMs
◦Free disk space on all instancesDisk Space Free on Active Load Balancer (LB)
Disk Space Free on Standby Load Balancer (LB)
Disk Space Free on Policy Server (QNS) VMs
Disk Space Free on sessionmgr VMs: Session database, Balance, Reporting, SPR
Disk Space Free on OAM (pcrfclient) VMs
◦Diameter messages load: CCR-I, CCR-U, CCR-T, AAR, RAR, STR, ASR
◦Diameter messages response time: CCR-I, CCR-U, CCR-T, AAR, RAR, STR, ASR
• Errors for diameter messages.Run the following command on pcrfclient01:
tailcons | grep diameter | grep -i error
• Response time for sessionmgr insert/update/delete/query.
◦Average read, write, and total time per sec:mongotop --host sessionmgr* --port port_number
◦For requests taking more than 100ms:SSH to sessionmgr VMs:
tailf /var/log/mongodb-.log
Above commands will by default display requests taking more than 100 ms, until andunless the following parameter has been configured onmongod process --slows XYZms.XYZ represents the value in milliseconds desired by user.
Note
• Garbage collection.Check the service-qns-*.log from all policy server (QNS), load balancer (lb) and PCRF VMs.In the logs look for “GC” or “FULL GC”.
• Session count.Run the following command on pcrfclient01:
session_cache_ops.sh --count
• Run the following command on pcrfclient01 and verify that the response time is under expected valueand there are no errors reported.
/opt/broadhop/qns-1/control/top_qps.sh
• Use the following command to check mongoDB statistics on queries/inserts/updates/deletes for all CPSdatabases (and on all primary and secondary databases) and verify if there are any abnormalities (forexample, high number of insert/update/delete considering TPS, large number of queries going to othersite).
CPS Troubleshooting Guide, Release 10.1.06
Troubleshooting CPSPeriodic Monitoring
-
mongostat --host --port
For example,
mongostat --host sessionmgr01 --port 27717
• Use the following command for all CPS databases and verify if there is any high usage reported in output.Here considering session database as an example:
mongotop --host --port
For example,
mongotop --host sessionmgr01 --port 27717
• Verify EDRs are getting generated by checking count of entries in CDR database.
• Verify EDRs are getting replicated by checking count of entries in MySQL database.
• Determine most recently inserted CDR record in MySQL database and compare the insert time with thetime the CDR was generated. Time difference should be within 2 min or otherwise signifies lag inreplication.
• Count of CCR-I/CCR-U/CCR-T/RAR messages from/to GW.
• Count of failed CCR-I/CCR-U/CCR-T/RARmessages from/to GW. If GW has capability, capture detailsat error code level.
Run the following command on pcrfclient01:
cd /var/broadhop/stats
grep "Gx_CCR-" bulk-*.csv
• Response time of CCR-I/CCR-U/CCR-T messages at GW.
• Count of session in PCRF and count of session in GW. There could be some mismatch between thecount due to time gap between determining session count from CPS and GW. If the count difference ishigh then it could indicate stale sessions on PCRF or GW.
• Count of AAR/RAR/STR/ASR messages from/to Application Function.
• Count of failed AAR/RAR/STR/ASR messages from/to Application Function. If Application Functionhas capability, capture details at error code level.
Run the following command on pcrfclient01:
cd /var/broadhop/stats
grep "Gx_CCR-" bulk-*.csv
• Response time of CCR-I/CCR-U/CCR-T messages at GW.
• Count of session in PCRF and count of session in GW. There could be some mismatch between thecount due to time gap between determining session count from CPS and GW. If the count difference ishigh then it could indicate stale sessions on PCRF or GW.
• Count of AAR/RAR/STR/ASR messages from/to Application Function.
• Count of failed AAR/RAR/STR/ASR messages from/to Application Function. If Application Functionhas capability, capture details at error code level.
Run the following command on pcrfclient01:
cd /var/broadhop/stats
CPS Troubleshooting Guide, Release 10.1.0 7
Troubleshooting CPSPeriodic Monitoring
-
grep "Rx_AAR-" bulk-*.csv
• Response time of AAR/RAR/STR/ASR messages at Application Function.
• Count of session in PCRF and count of session in Application Function. There could be some mismatchbetween the count due to time gap between determining session count from CPS and ApplicationFunction. If the count difference is high then it could indicate stale sessions on PCRF or ApplicationFunction.
Count of session in PCRF:
session_cache_ops.sh -count
RADIUS Troubleshooting• Test service definition requests from a PEP such as ISG to the CPS by running the following command:test aaa group radius L4REDIRECT_SERVICE password legacy
Repeat this command for PBHK_SERVICE and OPENGARDEN_SERVICE.
• Listen for RADIUS traffic from the PEP by logging into lb01 and lb02 and run the following command:tcpdump –i any port 1812 –s 0 -vvv
Test general subscriber access with the procedures in Check Subscriber Access.
E2E Call Flow Troubleshooting• On an All-in-One deployment, run the following commands:tcpdump -i -s 0 -vv
◦Append a –w /tmp/callflow.pcap to capture output to Wireshark file
• Open the file in WireShark and filter on HTTP or RADIUS to assist debugging the call flow.
• In a distributed model, you need to tcpdump on individual VMs:
◦Load balancers on port 1812, 1813, 1700, 8080 and 3868
Correct call flows are shown Call Flows.
Diameter Error Codes and ScenariosThe following table describes some common diameter error codes and scenarios:
CPS Troubleshooting Guide, Release 10.1.08
Troubleshooting CPSRADIUS Troubleshooting
-
Table 1: Common Diameter Error Codes and Scenarios
CPS ScenariosNameCode
Everything went well and Requestprocessed successfully.
DIAMETER_SUCCESS2001
Message cannot be delivered, eitherbecause no host within the realmsupporting the required applicationwas available to process the requestor because Destination-Host AVPwas given without the associatedDestination-Realm AVP.
DIAMETER_UNABLE_TO_DELIVER3002
Message got discarded by theoverload handling mechanism.Note: CPS 7.5 adds the option tosilently discard instead of sendingDIAMETER_TOO_BUSY asdiscarding is often a better way tohave other node back off instead ofimmediately resending the requestin an overload scenario.
DIAMETER_TOO_BUSY3004
A request was sent for anapplication that is not supported.
DIAMETER_APPLICATION_UNSUPPORTED3007
A CER was received from anunknown peer.
DIAMETER_UNKNOWN_PEER3010
When for some reason a PCC rulecannot be enforced or modifiedsuccessfully in a network initiatedprocedure. The reason is providedin the Event Trigger AVP value.
DIAMETER_PCC_BEARER_EVENT4141
Error used by the OCS to indicateto the PCRF that the OCS has noavailable policy counters for thesubscriber.
DIAMETER_ERROR_NO_AVAILABLE_POLICY_COUNTERS4241
The request contained an unknownSession-Id.
DIAMETER_UNKNOWN_SESSION_ID5002
A request was received for whichthe user could not be authorized.Nosession created due to variousreasons. For example, this errorcould occur if the service requestedis not permitted to the user.
DIAMETER_AUTHORIZATION_REJECTED5003
CPS Troubleshooting Guide, Release 10.1.0 9
Troubleshooting CPSDiameter Error Codes and Scenarios
-
CPS ScenariosNameCode
When a CER message is received,and there are no commonapplications supported between thepeers.
DIAMETER_NO_COMMON_APPLICATION5010
Message rejected as something elsethat went wrong and there’s nospecific reason.
DIAMETER_UNABLE_TO_COMPLY5012
Subscriber not found in SPR.DIAMETER_USER_UNKNOWN5030
When the set of bearer/sessioninformation sent in a CCRoriginated due to a trigger eventbeen met is incoherent with theprevious set of bearer/sessioninformation for the samebearer/session.
DIAMETER_ERROR_TRIGGER_EVENT5141
When for some reason the PCCrules cannot be installed/activated.The reason is provided in the EventTrigger AVP value.
DIAMETER_PCC_RULE_EVENT5142
Emergency service related - Usedwhen the PCRF cannot authorizean IP-CAN bearer upon thereception of an IP-CAN bearerauthorization request coming fromthe PCEF.
DIAMETER_ERROR_BEARER_NOT_AUTHORIZED5143
Emergency service related - Usedwhen the PCRF does not acceptone or more of the traffic mappingfilters.
DIAMETER_ERROR_TRAFFIC_MAPPING_INFO_REJECTED5144
Error used by the OCS to indicateto the PCRF that the OCS does notrecognize one or more PolicyCounters specified in the request,when the OCS is configured toreject the request provided withunknown policy counteridentifier(s).
DIAMETER_ERROR_UNKNOWN_POLICY_COUNTERS5570
CPS Troubleshooting Guide, Release 10.1.010
Troubleshooting CPSDiameter Error Codes and Scenarios
-
LDAP Error CodesThe following table describes LDAP error codes:
Table 2: LDAP Error Codes
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe result code (0)that will be used toindicate a successfuloperation
SUCCESS0
YYThe result code (1)that will be used toindicate that anoperation wasrequested out ofsequence.
OPERATIONS_ERROR
1
YYThe result code (2)that will be used toindicate that theclient sent amalformed request.
PROTOCOL_ERROR2
YYThe result code (3)that will be used toindicate that theserver was unable tocomplete processingon the request in theallotted time limit.
TIME_LIMIT_EXCEEDED
3
YThe result code (4)that will be used toindicate that theserver found morematching entries thanthe configuredrequest size limit.
SIZE_LIMIT_EXCEEDED
4
YThe result code (5)that will be used if arequested compareassertion does notmatch the targetentry.
COMPARE_FALSE5
CPS Troubleshooting Guide, Release 10.1.0 11
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe result code (6)that will be used if arequested compareassertionmatched thetarget entry.
COMPARE_TRUE6
YThe result code (7)that will be used ifthe client requested aform ofauthentication that isnot supported by theserver.
AUTH_METHOD_NOT_SUPPORTED
7
YThe result code (8)that will be used ifthe client requestedan operation thatrequires a strongauthenticationmechanism.
STRONG_AUTH_REQUIRED
8
YThe result code (10)that will be used ifthe server sends areferral to the clientto refer to data inanother location.
REFERRAL10
YThe result code (11)that will be used if aserver administrativelimit has beenexceeded.
ADMIN_LIMIT_EXCEEDED
11
YThe integer value(12) for the"UNAVAILABLE_CRITICAL_EXTENSION" resultcode.
UNAVAILABLE_CRITICAL_EXTENSION
12
CPS Troubleshooting Guide, Release 10.1.012
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe result code (13)that will be used ifthe server requires asecurecommunicationmechanism for therequested operation.
CONFIDENTIALITY_REQUIRED
13
YThe result code (14)that will be returnedfrom the server afterSASL bind stages inwhich moreprocessing isrequired.
SASL_BIND_IN_PROGRESS
14
YThe result code (16)that will be used ifthe client referencedan attribute that doesnot exist in the targetentry.
NO_SUCH_ATTRIBUTE
16
YThe result code (17)that will be used ifthe client referencedan attribute that is notdefined in the serverschema.
UNDEFINED_ATTRIBUTE_TYPE
17
YThe result code (18)that will be used ifthe client attemptedto use an attribute ina search filter in amanner not supportedby thematching rulesassociated with thatattribute.
INAPPROPRIATE_MATCHING
18
YThe result code (19)that will be used ifthe requestedoperation wouldviolate someconstraint defined inthe server.
CONSTRAINT_VIOLATION
19
CPS Troubleshooting Guide, Release 10.1.0 13
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe result code (20)that will be used ifthe client attempts tomodify an entry in away that wouldcreate a duplicatevalue, or createmultiple values for asingle-valuedattribute.
ATTRIBUTE_OR_VALUE_ EXISTS
20
YThe result code (21)that will be used ifthe client attempts toperform an operationthat would create anattribute value thatviolates the syntaxfor that attribute.
INVALID_ATTRIBUTE_SYNTAX
21
YThe result code (32)that will be used ifthe client targeted anentry that does notexist.
NO_SUCH_OBJECT32
YThe result code (33)that will be used ifthe client targeted anentry that as an alias.
ALIAS_PROBLEM33
YThe result code (34)that will be used ifthe client provided aninvalid DN.
INVALID_DN_SYNTAX34
YThe result code (36)that will be used if aproblem isencountered whilethe server isattempting todereference an alias.
ALIAS_DEREFERENCING_PROBLEM
36
CPS Troubleshooting Guide, Release 10.1.014
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe result code (48)that will be used ifthe client attempts toperform a type ofauthentication that isnot supported for thetarget user.
INAPPROPRIATE_AUTHENTICATION
48
YThe result code (49)that will be used ifthe client providedinvalid credentialswhile trying toauthenticate.
INVALID_CREDENTIALS
49
YThe result code (50)that will be used ifthe client does nothave permission toperform therequested operation.
INSUFFICIENT_ACCESS_RIGHTS
50
YYThe result code (51)that will be used ifthe server is too busyto process therequested operation.
BUSY51
YYThe result code (52)that will be used ifthe server isunavailable.
UNAVAILABLE52
YYThe result code (53)that will be used ifthe server is notwilling to performthe requestedoperation.
UNWILLING_TO_PERFORM
53
YThe result code (54)that will be used ifthe server detects achaining or aliasloop.
LOOP-DETECT54
CPS Troubleshooting Guide, Release 10.1.0 15
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe result code (60)that will be used ifthe client sends avirtual list viewcontrol without aserver-side sortcontrol.
SORT_CONTROL_MISSING
60
YThe result code (61)that will be used ifthe client provides avirtual list viewcontrol with a targetoffset that is out ofrange for theavailable data set.
OFFSET_RANGE_ERROR
61
YThe result code (64)that will be used ifthe client requestviolates a namingconstraint (e.g., aname form or DITstructure rule)defined in the server.
NAMING_VIOLATION
64
YThe result code (65)that will be used ifthe client requestviolates an objectclass constraint (e.g.,an undefined objectclass, a disallowedattribute, or a missingrequired attribute)defined in the server.
OBJECT_CLASS_VIOLATION
65
YThe result code (66)that will be used ifthe requestedoperation is notallowed to beperformed onnon-leaf entries.
NOT_ALLOWED_ON_NONLEAF
66
CPS Troubleshooting Guide, Release 10.1.016
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe result code (67)that will be used ifthe requestedoperation would alterthe RDN of the entrybut the operation wasnot a modify DNrequest.
NOT_ALLOWED_ON_RDN
67
YThe result code (68)that will be used ifthe requestedoperation wouldcreate a conflict withan entry that alreadyexists in the server.
ENTRY_ALREADY_EXISTS
68
YThe result code (69)that will be used ifthe requestedoperation would alterthe set of objectclasses defined in theentry in a disallowedmanner.
OBJECT_CLASS_MODS_PROHIBITED
69
YThe result code (71)that will be used ifthe requestedoperation wouldimpact entries inmultiple data sources.
AFFECTS_MULTIPLE_DSAS
71
YThe result code (76)that will be used if anerror occurred whileperformingprocessing associatedwith the virtual listview control.
VIRTUAL_LIST_VIEW_ERROR
76
YYThe result code (80)that will be used ifnone of the otherresult codes areappropriate.
OTHER80
CPS Troubleshooting Guide, Release 10.1.0 17
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YYThe client-side resultcode (81) that will beused if an establishedconnection to theserver is lost.
SERVER_DOWN81
YYThe client-side resultcode (82) that will beused if a genericclient-side erroroccurs duringprocessing.
LOCAL_ERROR82
YYThe client-side resultcode (83) that will beused if an erroroccurs whileencoding a request.
ENCODING_ERROR
83
YYThe client-side resultcode (84) that will beused if an erroroccurs whiledecoding a response.
DECODING_ERROR
84
YYYThe client-side resultcode (85) that will beused if a clienttimeout occurs whilewaiting for aresponse from theserver.
TIMEOUT85
YThe client-side resultcode (86) that will beused if the clientattempts to use anunknownauthentication type.
AUTH_UNKNOWN86
YThe client-side resultcode (87) that will beused if an erroroccurs whileattempting to encodea search filter.
FILTER_ERROR87
CPS Troubleshooting Guide, Release 10.1.018
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe client-side resultcode (88) that will beused if the end usercanceled theoperation in progress.
USER_CANCELED88
YThe client-side resultcode (89) that will beused if there is aproblem with theparameters providedfor a request.
PARAM_ERROR89
YYThe client-side resultcode (90) that will beused if the client doesnot have sufficientmemory to performthe requestedoperation.
NO_MEMORY90
YYThe client-side resultcode (91) that will beused if an erroroccurs whileattempting to connectto a target server.
CONNECT_ERROR
91
YThe client-side resultcode (92) that will beused if the requestedoperation is notsupported.
NOT_SUPPORTED92
YThe client-side resultcode (93) that will beused if the responsefrom the server didnot include anexpected control.
CONTROL_NOT_FOUND
93
YThe client-side resultcode (94) that will beused if the server didnot send any results.
NO_RESULTS_RETURNED
94
CPS Troubleshooting Guide, Release 10.1.0 19
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe client-side resultcode (95) that will beused if there are stillmore results toreturn.
MORE_RESULTS_TO_RETURN
95
YThe client-side resultcode (96) that will beused if the clientdetects a loop whileattempting to followreferrals.
CLIENT_LOOP96
YThe client-side resultcode (97) that will beused if the clientencountered toomany referrals in thecourse of processingan operation.
REFERRAL_LIMIT_EXCEEDED
97
YThe result code (118)that will be used ifthe operation wascanceled
CANCELED118
YThe result code (119)that will be used ifthe client attempts tocancel an operationthat the client doesn'texist in the server.
NO_SUCH_OPERATION
119
YThe result code (120)that will be used ifthe client attempts tocancel an operationtoo late in theprocessing for thatoperation.
TOO_LATE120
YThe result code (121)that will be used ifthe client attempts tocancel an operationthat cannot becanceled.
CANNOT_CANCEL
121
CPS Troubleshooting Guide, Release 10.1.020
Troubleshooting CPSLDAP Error Codes
-
NotApplicableto Search
TerminateConnection
Sent ToPolicyServer
TriggersRetry
CountsasTimeout
DefinitionName
YThe result code (122)that will be used ifthe requestedoperation includedthe LDAP assertioncontrol but theassertion did notmatch the targetentry.
ASSERTION_FAILED
122
YThe result code (123)that will be used ifthe client is deniedthe ability to use theproxied authorizationcontrol.
AUTHORIZATION_DENIED
123
Rare Troubleshooting Scenarios
Recovery using Remove/Add members OptionWhen Arbiter blade and a sessionmgr blade goes down there is not any primary sessionmgr node to caterrequests coming from CPS VMs (Classic HA setup-1 arbiter 2 sessionmgrs). As a result the system becomesunstable.
A safe way to recover from the issue is to bring UP the down blades to working state. If bringing blades backto working state is not possible then only way to keep setup working is removing failed members of replica-setfrom mongo-config. In doing so UP and running sessionmgr node becomes primary. It is must to add failedmembers back to replica-set once they come online.
The following sections describe how to remove failed members from mongo-replica set and how to add themback in replica-set once they are online.
The steps mentioned in the following sections should be executed properly.Note
The following steps are done only when only one sessionmgr is UP but is in secondary mode and cannotbecome primary on its own and bringing back down blades (holding arbiter and primary sessionmgr VMs)to operational mode is not possible.
Note
CPS Troubleshooting Guide, Release 10.1.0 21
Troubleshooting CPSRare Troubleshooting Scenarios
-
Remove Failed MembersThis option is usually used when member/s are not running and treated as failed member. The script removesall such failed members from replica-set.
Step 1 Login to pcrfclient01/02.Step 2 Execute the diagnostics script to know which replica-set or respective component is failed and you want to remove.
diagnostics.sh --get_replica_status
Step 3 Execute build_set.sh with below options to remove failed member/s from replica set. This operation removes the allfailed members across the site.cd /var/qps/bin/support/mongo/
For session database:
./build_set.sh --session --remove-failed-members
For SPR database:
./build_set.sh --spr --remove-failed-members
For balance database:
./build_set.sh --balance --remove-failed-members
For report database:
./build_set.sh --report --remove-failed-members
Step 4 Execute the diagnostics script again to verify if that particular member is removed.diagnostics.sh --get_replica_status
CPS Troubleshooting Guide, Release 10.1.022
Troubleshooting CPSRecovery using Remove/Add members Option
-
If status is not seen properly by above command, login to mongo port on sessionmgr and check replica status.Note
Figure 1: Replica Status
Add Failed Members
Step 1 Login to pcrfclient01/02.Step 2 Once the failed members are back online, they can be added back in replica-set.Step 3 Execute the diagnostics script to know which replica-set member is not in configuration or failed member.
diagnostics.sh --get_replica_status
CPS Troubleshooting Guide, Release 10.1.0 23
Troubleshooting CPSRecovery using Remove/Add members Option
-
If status is not seen properly by above command, login to mongo port on sessionmgr and check replica status.
Figure 2: Replica Status
cd /var/qps/bin/support/mongo
For session database:
./build_set.sh --session --add-members
For SPR database:
./build_set.sh --spr --add-members
For balance database:
./build_set.sh --balance --add-members
For report database:
./build_set.sh --report --add-members
CPS Troubleshooting Guide, Release 10.1.024
Troubleshooting CPSRecovery using Remove/Add members Option
-
Maintenance Window ProceduresThe usual tasks for a maintenance window might include these:
Prior to Any MaintenanceBackup all relevant information to an offline resource. For more information on backup see Cisco Policy SuiteBackup and Restore Guide.
• Data - Backup all database information. This includes Cisco MsBM Cisco Unified SuM.
Sessions can be backed up as well.Note
• Configurations - Backup all configuration information. This includes SVN (from PCRF Client) the/etc/broadhop directory from all PCRFs
• Logs - Backup all logs for comparison to the upgrade. This is not required but will be helpful if thereare any issues.
Change Request Procedure• Have proper sign off for any change request. Cisco and all customer teams must sign off.
• Make sure the proposed procedures are well defined.
• Make sure the rollback procedures are correct and available.
Software Upgrades• Determine if the software upgrade will cause an outage and requires a maintenance window to performthe upgrade.
• Typically software upgrades can be done on one node a time and so minimize or eliminate any outage.
• Most of the time an upgrade requires a restart of the application. Most applications can be started in lessthan 1 minute.
Application RestartsApplication restarts are component independent. These are the components
• PCRF/PCRF Client
• Load Balancer/IO Manager
• sessionMgr
CPS Troubleshooting Guide, Release 10.1.0 25
Troubleshooting CPSMaintenance Window Procedures
-
IO Manager PCRF PCRF Client
• IOManagers and PCRF give up their resources and allow the fail overs to take over. They can be stoppeddirectly with service qns restart
• PCRF Client is a GUI application and can be restarted at any point. If SVN is restarted the PCRFapplications continue to run but throw errors saying that they cannot check for new configurations. Thiswill not impact the environment.
• sessionMgr is deployed as active - standby and is used by the policy server to maintain the subscribersession state information.
• Load Balancers distribute the load for RADIUS Web Services MySQL LDAP and SVN. Two loadbalancers are deployed for each Cisco Policy Suite in active/passive mode.
VM Restarts• LINUX must be shutdown normally for VM restarts.
• All VMs are Linux.
• The preferred methods are init 0 or shutdown –h
• Failure to use the Linux OS shutdown can result in VM corruption and problems restarting the VM andapplications.
• VM restart is typically done to increase resources to the VM (disk memory CPU).
Hardware Restarts• Hardware restarts should be rare.
•When a hardware restart is needed VMs must be shutdown first.
•When all VMs are stopped shutdown the hardware with either the ESXi console or as a power off.
Planned Outages• Planned outages are similar to hardware restarts.
• VMs need to be shutdown hardware can then be stopped.
•When hardware is started the typical hardware starting order is:
◦Start the servers with PCRFClient01 LB01 and SessionMgr01 first.
◦Start all other servers in any order after that.
CPS Troubleshooting Guide, Release 10.1.026
Troubleshooting CPSVM Restarts
-
Non-maintenance Window ProceduresTasks you can perform as non-maintenance that is at any time are these
• Data archiving or warehousing
• Log removal
Common Troubleshooting TasksThis section describes frequently used troubleshooting tasks youmight use before calling support or as directedby support.
Kill All Cisco Processes From the Command Line as RootDepending on the Linux version one or both of these ps commands are applicable. Remove the portion '| xargskill -9' if you want to test out the command.
These commands do the following
• print out all processes (ps) then
• search (grep) for all processes that do not contain the word grep or mysql then
• use sed to remove all the remaining text except for the PID value and then
• send that PID to kill -9.
[root@lab ~]# ps -APID TTY TIME CMD1 ? 00:00:00 init2 ? 00:00:01 migration/03 ? 00:00:00 ksoftirqd/04 ? 00:00:01 migration/15 ? 00:00:00 ksoftirqd/16 ? 00:18:49 events/07 ? 00:00:00 events/18 ? 00:00:00 khelper49 ? 00:00:00 kthread54 ? 00:00:00 kblockd/055 ? 00:00:00 kblockd/156 ? 00:00:00 kacpid217 ? 00:00:00 cqueue/0218 ? 00:00:00 cqueue/1221 ? 00:00:00 khubd223 ? 00:00:00 kseriod299 ? 00:00:00 khungtaskd300 ? 00:00:00 pdflush301 ? 00:01:09 pdflush302 ? 00:00:01 kswapd0303 ? 00:00:00 aio/0304 ? 00:00:00 aio/1510 ? 00:00:00 kpsmoused554 ? 00:00:00 mpt_poll_0555 ? 00:00:00 mpt/0556 ? 00:00:00 scsi_eh_0560 ? 00:00:00 ata/0561 ? 00:00:00 ata/1562 ? 00:00:00 ata_aux569 ? 00:00:00 kstriped
CPS Troubleshooting Guide, Release 10.1.0 27
Troubleshooting CPSNon-maintenance Window Procedures
-
582 ? 00:00:00 ksnapd597 ? 00:04:19 kjournald623 ? 00:00:00 kauditd656 ? 00:00:00 udevd2168 ? 00:00:00 kmpathd/02169 ? 00:00:00 kmpathd/12171 ? 00:00:00 kmpath_handlerd2194 ? 00:00:00 kjournald2664 ? 00:00:00 vmmemctl2795 ? 00:02:38 vmtoolsd2877 ? 00:00:00 iscsi_eh2920 ? 00:00:00 cnic_wq2924 ? 00:00:00 bnx2i_thread/02925 ? 00:00:00 bnx2i_thread/12939 ? 00:00:00 ib_addr2949 ? 00:00:00 ib_mcast2950 ? 00:00:00 ib_inform2951 ? 00:00:00 local_sa2954 ? 00:00:00 iw_cm_wq2958 ? 00:00:00 ib_cm/02959 ? 00:00:00 ib_cm/12963 ? 00:00:00 rdma_cm2984 ? 00:00:00 iscsiuio3784 ? 00:06:49 snmpd3799 ? 00:00:00 snmptrapd3814 ? 00:00:21 memcached3836 ? 00:00:00 sshd3857 ? 00:00:00 ntpd3870 ? 00:00:00 mysqld_safe3925 ? 00:00:19 mysqld3977 ? 00:00:01 gpm3992 ? 00:00:01 httpd4006 ? 00:03:18 collectd4058 ? 00:00:00 crond4077 ? 00:00:42 pcrfclient_avai4079 ? 00:00:34 qns_availabilit4081 ? 00:00:17 database_availa4082 ? 00:00:13 server_availabi4098 ? 00:00:00 atd4169 ? 00:00:03 avahi-daemon4170 ? 00:00:00 avahi-daemon4302 ? 2-15:01:05 java4324 ? 00:21:27 java4375 ? 00:20:32 mongod4380 tty1 00:00:00 mingetty4381 tty2 00:00:00 mingetty4382 tty3 00:00:00 mingetty4383 tty4 00:00:00 mingetty4384 tty5 00:00:00 mingetty4393 tty6 00:00:00 mingetty4395 ? 00:00:00 gdm-binary4433 ? 00:00:00 gdm-binary4435 ? 00:00:02 gdm-rh-security4436 tty7 00:05:28 Xorg5022 ? 00:00:00 ntpd14425 ? 00:00:00 sshd14487 pts/0 00:00:00 bash14823 ? 00:00:00 sleep14837 ? 00:00:00 sleep14854 ? 00:00:00 sleep15014 ? 00:00:00 sleep15019 pts/0 00:00:00 ps25203 ? 00:00:00 gnome-vfs-daemo25316 ? 00:00:00 pam_timestamp_c28836 ? 00:00:06 httpd28837 ? 00:00:06 httpd28838 ? 00:00:06 httpd28839 ? 00:00:06 httpd28840 ? 00:00:06 httpd28841 ? 00:00:06 httpd28842 ? 00:00:06 httpd28843 ? 00:00:06 httpd[root@lab ~]#
CPS Troubleshooting Guide, Release 10.1.028
Troubleshooting CPSKill All Cisco Processes From the Command Line as Root
-
Low or Out of Disk SpaceTo determine the disk space used use these Linux disk usage and disk free commands
• du
• df
df Commanddf
For example:home# df -h[root@lab home]# df -hFilesystem Size Used Avail Use% Mounted on/dev/cciss/c0d0p5 56G 27G 26G 51% //dev/cciss/c0d0p1 99M 12M 83M 12% /boottmpfs 2.0G 0 2.0G 0% /dev/shmnone 2.0G 0 2.0G 0% /dev/shm/dev/cciss/c0d0p2 5.8G 4.0G 1.6G 73% /home
As shown above the /home directory is using the most of it's allocated space (73%).
du CommandThe /home directory is typically for /home/admin but in some cases there is also /home/qns or /home/remote.You can check both
du
For example:home# du -hs[root@lab home]# du -hs160M .[root@lab home]# du -hs *1.3M qns158M remote36K testuser
The du command shows where the space is being used. By default the du command by itself gives a summaryof quota usage for the directory specified and all subdirectories below it.
By deleting any directories you remove the ability to roll back if for some reason an update is not workingcorrectly. Only delete those updates to which you would probably never roll back perhaps those 6 monthsold and older.
Note
Diameter IssuesThe following details need to be captured for diameter issues:
• Details of service associated with subscribers in failure case.
• Pcaps capturing calls having issue.
CPS Troubleshooting Guide, Release 10.1.0 29
Troubleshooting CPSLow or Out of Disk Space
-
• If the issue is with no response pcap should be captured both at CPS and the peer.
• Subscriber trace information can be captured using the following process
◦To add the subscriber that needs to be traced/var/qps/bin/control/trace_ids.sh -i -d sessionmgr01:/policy_trace
cd /var/qps/bin/control
◦Run the following command to obtain subscriber information/var/qps/bin/control/trace.sh -i -d sessionmgr01/policy_trace
If CPS receives the request message for the same subscriber the trace result will be displayed.
Port no. can be found in “Trace DBDatabase” configuration in Cluster-1. If Trace Database is not configuredthen by default “Admin Db Configuration” will pick up the trace database.
Note
High CPU Usage Issue• Thread details and jstack output. It could be captured as:
◦From top output see if java process is taking high CPU.
◦Capture output of the following command:ps -C java -L -o pcpucpunicestatecputimepidtid | sort > tid.log
◦Capture output of the following command where is the pid of process causing highCPU (as per top output):
If java process is running as a root user:
jstack > jstack.log
If java process is running as policy server (qns) user :
sudo -u qns "jstack " > jstack.log
If running above commands report error for process hung/not responding then use -F option afterjstack.
Capture another jstack output as above but with an additional -l option
JVM CrashJVM generates a fatal error log file that contains the state of process at the time of the fatal error. By default,the name of file has format hs_err_pid.log and it is generated in the working directory from where thecorresponding java processes were started (that is the working directory of the user when user started thepolicy server (qns) process). If the working directory is not known then one could search system for file withname hs_err_pid*.log and look into file which has timestamp same as time of error.
CPS Troubleshooting Guide, Release 10.1.030
Troubleshooting CPSHigh CPU Usage Issue
-
High Memory Usage/Out of Memory Error• JVM could generate heap dump in case of out of memory error. By default, CPS is not configured togenerate heap dump. For generating heap dump the following parameters need to be added to/etc/broadhop/jvm.conf file for different CPS instances present.-XX+HeapDumpOnOutOfMemoryError
-XXHeapDumpPath=/tmp
Note that the heap dump generation may fail if limit for core is not set correctly. Limit could be set infile /etc/security/limits.conf for root and policy server (qns) user.
• If no dump is generated but memory usage is high and is growing for sometime followed by reductionin usage (may be due to garbage collection) then the heap dump can be explicitly generated by runningthe following command:
• If java process is running as user root:jmap -dumpformat=bfile=
• If java process is running as policy server (qns) user:sudo -u qns jmap -dumpformat=bfile=
Note • Capture this during off-peak hour. In addition to that, nice utility could be used toreduce priority of the process so that it does not impact other running processes.
• Create archive of dump for transfer and make sure to delete dump/archive aftertransfer.
• Use the following procedure to log Garbage Collection:
• Login to VM instance where GC (Garbage Collection) logging needs to be enabled.
• Run the following commands:cd /opt/broadhop/qns-1/bin/chmod +x jmxterm.sh./jmxterm.sh> open :> bean com.sun.management:type=HotSpotDiagnostic> run setVMOption PrintGC true> run setVMOption PrintGCDateStamps true> run setVMOption PrintGCDetails true> run setVMOption PrintGCDetails true> exit
• Revert the changes once the required GC logs are collected.
Issues with Output displayed on GrafanaIn case of grafana issue whisper db output is required
whisper-fetch --pretty /var/lib/carbon/whisper/cisco/quantum/qps/hosts/*
CPS Troubleshooting Guide, Release 10.1.0 31
Troubleshooting CPSHigh Memory Usage/Out of Memory Error
-
For example,
whisper-fetch --pretty
/var/lib/carbon/whisper/cisco/quantum/qps/dc1-pcrfclient02/load/midterm.wsp
Enable Debug LogsBy default Cisco recommends to keep log level as WARN or ERROR. Sometimes for analysis the user mayneed more detailed logging. For this, the user needs the log level based on Cisco recommendation oncase-to-case basis.
The following are the various top-level loggers for which the user may need to change log level on case-to-casebasis. These loggers must be defined in /etc/broadhop/logback.xml file.
To make sure that all changes are controlled from one VM sync all changes made in the Cluster Managerabove to all other VMs.
SSHUSER_PREFERROOT=true copytoall.sh
For example,
SSHUSER_PREFERROOT=true copytoall.sh /etc/broadhop/logback.xml /etc/broadhop/logback.xml
• For Diameter issues: com.broadhop.diameter2
• For CDR/EDR issues: com.broadhop.policyintel
• For Custom Reference Data issues: com.broadhop.custrefdata
• For Notifications issues: com.broadhop.notifications
• For Session Manager Cache issues: com.broadhop.policy.mdb.cache
• For Control Center issues: com.broadhop.controlcenter
• For Fault Management issues: com.broadhop.faultmanagement
• For LDAP issues: com.broadhop.ldap
• For SPR issues: com.broadhop.spr
• For Unified API issues: com.broadhop.unifiedapi
• For audit issues: com.broadhop.audit
• For policy related issues: com.broadhop.policy
• For any CPS logs issues for which the log level is not overridden by other loggers: com.broadhop
For consolidated logs make sure that the configuration specified in Control Center is correct to forwardlogs to OAM (pcrfclient) VMs.
Note
Install SAR ToolYou can install SAR tool to capture system issues.
CPS Troubleshooting Guide, Release 10.1.032
Troubleshooting CPSEnable Debug Logs
-
Download SAR Tool ftp//fr2.rpmfind.net/linux/centos/6.7/os/x86_64/Packages/sysstat-9.0.4-27.el6.x86_64.rpm
This package provides the SAR and iostat commands for Linux. SAR and iostat enables system monitoringof disk network and other IO activity.
Installation
Step 1 Move the sysstat-9.0.4-27.el6.x86_64.rpm package to /var/tmp on pcrfclient01 and pcrfclient02.Step 2 SSH to pcrfclient01.Step 3 Run the following command to install the sysstat package:
rpm -ivh /var/tmp/sysstat-9.0.4-27.el6.x86_64.rpm
Step 4 Change the SAR cron job so that SAR statistics are collected every minute. To do this open /etc/cron.d/sysstat in aneditor and make the following change:Change the following line to remove “/10”:*/10 * * * * root /usr/lib64/sa/sa1 1 1
So that it looks like this:
* * * * * root /usr/lib64/sa/sa1 1 1
Step 5 To verify the SAR is running and logging, inspect the /var/log/sa directory and verify the 'sa' log is created. It will takeone minute after you make the change in Step 4 for this log to be created.
Step 6 Repeat Step 2 to Step 5 for pcrfclient02.
DisablingIt is recommended to have SAR installed on the system. It can be used for troubleshooting many issues. Incase you do not want to have it installed, use the following steps:
Step 1 To disable the SAR tool, open /etc/cron.d/sysstat in an editor and make the following change:Change the following line:
* * * * * root /usr/lib64/sa/sa1 1 1
So that it looks like this:
#* * * * * root /usr/lib64/sa/sa1 1 1
Step 2 To verify that SAR is no longer gathering statistics, check the /var/log/sa directory and verify the timestamp on the 'sa'log is not updating.
CPS Troubleshooting Guide, Release 10.1.0 33
Troubleshooting CPSInstall SAR Tool
ftp//fr2.rpmfind.net/linux/centos/6.7/os/x86_64/Packages/sysstat-9.0.4-27.el6.x86_64.rpmftp//fr2.rpmfind.net/linux/centos/6.7/os/x86_64/Packages/sysstat-9.0.4-27.el6.x86_64.rpm
-
Frequently Encountered ScenariosThis section lists the following trouble issues already diagnosed and solved.
• Subscriber not Mapped on SCE
• CPS Server Will Not Start and Nothing is in the Log
• Server returned HTTP Response Code: 401 for URL
• com.broadhop.exception.BroadhopException Unable to Find System Configuration for System
• Log Files Display the Wrong Time but the Linux Time is Correct
• JMX Management Beans are not Deployed
• Unable to Access Binding Information
• Error Processing Package, Reference Data Does Not Exist for NAS IP...
• REST Web Service Queries Returns an Empty XML Response for an Existing User
• Error in Datastore: "err" : "E11000 Duplicate Key Error Index
• Error Processing Request: Unknown Action
• Memcached Server is in Error
• Firewall Error: Log shows Host Not Reachable, or Connection Refused
• Unknown Error in Logging: License Manager
• Ecore File is Not Generated:
• Logging Does Not Appear to be Working
• Cannot Connect to Server Using JMX: No Such Object in Table
• File System Check (FSCK) Errors
• CPS: 27717 Mongo Stuck in STARTUP2 after sessionMgr01/2 Reboot
• SR: 628099455 System Failure Errors in Control Center
• Multi-user Policy Builder Errors
• Policy Reporting Configuration not getting updated post CPS Upgrade
• CPS Memory Usage
• Errors while Installing HA Setup
• Enable/disable Debit Compression
• Diameter proxy error in diagnostics.sh output
• Not able to Publish the Policy in Policy Builder
• CPS not sending SNMP traps to External NMS server
• Diameter Peer Connectivity is Down
• Policy Builder Loses Repositories
CPS Troubleshooting Guide, Release 10.1.034
Troubleshooting CPSFrequently Encountered Scenarios
-
• Not able to access IPv6 Gx port from PCEF/GGSN
• Bring up sessionmgr VM from RECOVERY state to SECONDARY state
• ZeroMQ Connection Established between Policy Director and other site Policy Server
• Troubleshooting CPS upgrade from existing 7.0
• Diagnose Diameter No Response for Peer Message
• Not able to access Policy Builder
• Graphs in Grafana are lost when time on VMs are changed
• Systems is not enabled for Plugin Configuration
• Publishing is not Enabled
• Collecting MongoDB Information for Troubleshooting
• Added Check to Switch to Unknown Service if Subscriber is deleted Mid Session
• Could not Build Indexes for Table
• Error Submitting Message to Policy Director (lb) during Longevity
• Mismatch between Statistics Count and Session Count
• Disk Statistics not Populated in Grafana after CPS Upgrade
• Re-create Session Shards
• Session Switches from Known to Unknown in CCR-U Request
• Intermittent BSON Object Size Error in createsub with Mongo v3.2.1
• No Traps Generated When Number of Sessions Exceeds the Limit
• RAR Message not Received
• No Response to Diameter Request
• Admin Database shows Problem in Connecting to the Server
• Locale MAC Error, on page 87
• Sessions Stored in a Single Shard , on page 87
• Licensing not Throwing Traps or Diagnostic Errors upon Breach, on page 88
• Corosync Process Taking lot of Time to Unload and is Stuck, on page 89
• Issue related to Firewall, on page 89
• CPS Setup cannot Handle High TPS, on page 90
CPS Troubleshooting Guide, Release 10.1.0 35
Troubleshooting CPSFrequently Encountered Scenarios
-
Subscriber not Mapped on SCEThis issue was causing the subscriber to get no mapping on the SCE.
Step 1 Write an awk script to perform the following grep to create a text file of over 1000 instances of this message:grep "No member in system" policy.log* >
no_member_found.txt
This grep resulted in a file with these lines:
policy.log:2009-07-17 11:00:21,201 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d162818policy.log:2009-07-17 11:02:06,108 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for D02625policy.log.1:2009-07-17 09:25:29,036 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for D162346policy.log.1:2009-07-17 09:27:28,718 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d162365policy.log.1:2009-07-17 09:27:37,193 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d162365policy.log.1:2009-07-17 09:27:42,257 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d162365policy.log.1:2009-07-17 09:38:09,010 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d02116policy.log.1:2009-07-17 09:38:12,618 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for D163647policy.log.1:2009-07-17 09:40:42,751 INFOwikiimport:com.broadhop.sme.business.network.accounting.NetworkAccountingUtil No member in system for d102096
Step 2 Then use the following awk script to generate a new file that only has the user name. The script says print the 10th field:awk '{print $10}' no_member_found.txt >
no_member_found_usernames_with_dupes.txt
Step 3 Run the following command to remove duplicates:sort no_member_found_usernames_with_dupes.txt | uniq >
uniq_sorted_no_member_found_usernames.txt
This resulted in a file with usernames only:
D00059
D00077
CPS Troubleshooting Guide, Release 10.1.036
Troubleshooting CPSSubscriber not Mapped on SCE
-
D001088
D00112
d001313
D00145
D001452
d00156
D00186
d00198
D00200
d00224
CPS Server Will Not Start and Nothing is in the LogIf the CPS server does not start (or starts and immediately crashes) and no errors appear in/var/log/broadhop/qns.log to give reasons it did not start check the following list
1 Check /var/log/broadhop/service-qns-1.log2 Check /etc/broadhop/servers
• There should be an entry in this file for the current host name (Type 'hostname' in the console windowto find the local hostname)
• There must be directory that corresponds to the hostname entry with config files. That is if the serversfile has svn01=controlcenter there must be a /etc/broadhop/control center directory
3 Attempt to start the server directly from the command line and look for errors.
• Type: /opt/broadhop/qns/bin/qns.sh
• The server should start up successfully and the command line should not return. If the commandprompt returns then the server did not start successfully.
• Look for any errors displayed in the console output.
4 Look for OSGi Errors
• Look in /opt/broadhop/qns/configuration for a log file. If any exist examine the log file for errormessages.
Server returned HTTP Response Code: 401 for URLA 401 type error means you're not logging in to SVN with proper credentials.
CPS Troubleshooting Guide, Release 10.1.0 37
Troubleshooting CPSCPS Server Will Not Start and Nothing is in the Log
-
The server won't start and the following appears in the log:2010-12-10 01:05:26,668 \[SpringOsgiExtenderThread-8\]ERROR c.b.runtime.impl.RuntimeLoader - There was an errorinitializing reference data\!java.io.IOException: Server returned HTTP response code:401 for URL: http://lbvip01/repos/run/config.propertiessun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313) \~\[na:1.6.0_20\]org.springframework.core.io.UrlResource.getInputStream(UrlResource.java:124) \~\[org.springframework.core_3.0.0.REL
To fix this error:
• Edit /etc/broadhop/qns.conf
• Ensure that the configuration URL and repository credentials hostnames match.\-Dcom.broadhop.config.url=http://lbvip01/repos/run/
\-Dcom.broadhop.repository.credentials=broadhop/
broadhop@lbvip01
com.broadhop.exception.BroadhopException Unable to Find SystemConfiguration for System
Symptoms server won't stay started and the log displays this:com.broadhop.exception.BroadhopException: Unable to find system configuration for system:The system that is set up in your Quantum Policy Builder (and cluster name) must match theonespecified in /etc/broadhop/qns.conf. Either add or change this via the Quantum Policy Builder
interface, and then publish or update the system/clustername in /etc/broadhop/qns.conf\-Dcom.broadhop.run.systemId=poc-system\-Dcom.broadhop.run.clusterId=cluster-1
Log Files Display the Wrong Time but the Linux Time is CorrectIf log files or other dates are showing in the incorrect time zone despite the Linux time being set to the propertime zone, most likely the time zone that the JVM reads is incorrect.
Step 1 In /etc/sysconfig, run the command cat clock to see this output:ZONE="America/Denver"
UTC=false
ARC=false
Step 2 Change the ZONE line to the time zone you desire, for instance you could change it to:ZONE="Asia/Singapore"
UTC=false
ARC=false
to change the JVM time zone to Singapore time.
CPS Troubleshooting Guide, Release 10.1.038
Troubleshooting CPScom.broadhop.exception.BroadhopException Unable to Find System Configuration for System
-
The value for ZONE is driven by the directories in /usr/share/zoneinfo
JMX Management Beans are not Deployed
Step 1 Restart the CPS Server. The JMX Beans sometimes are not deployed when features are installed or updated.Step 2 Run ps -ef | grep java and look for: ‘-javaagent:/opt/broadhop/qns/bin/jmxagent.jar’. If this is absent, you have an old
build and need to update.Step 3 If you have an old build, see the Operations guide for instructions on updating.
Unable to Access Binding InformationMake sure the binding has been compiled. This error is typically caused by a bad build.
Attempt to upgrade to a newer build.
If you're on a released build, try restarting, there's been a strange bug which causes web service problemsafter update.2010-10-19 12:05:00,194 [pool-4-thread-1] ERRORc.b.d.impl.DiagnosticController - Diagnostic failed. Aproblem exists with the system --> Common Services: Featurecom.broadhop.ws.service is unabled to start. Error: Errorcreating bean with name'org.springframework.web.servlet.mvc.annotation.DefaultAnnotationHandlerMapping#0' defined in URL [bundleentry://27.fwk15830670/META-INF/spring/bundle-ws-context.xml]:Initialization of bean failed; nested exception isorg.springframework.beans.factory.BeanCreationException:Error creating bean with name 'subscriberEndpoint' definedin URL [bundleentry://27.fwk15830670/META-INF/spring/bundle-ws-context.xml]: Cannot resolve reference to bean'jibxMarshaller' while setting bean property 'marshaller';nested exception isorg.springframework.beans.factory.BeanCreationException:Error creating bean with name 'jibxMarshaller' defined inURL [bundleentry://27.fwk15830670/META-INF/spring/bundle-ws-context.xml]: Invocation of init method failed;nested exception is org.jibx.runtime.JiBXException: Unableto access binding information for classcom.broadhop.ws.impl.messages.RemoveSubscriberProfileRequest
Error Processing Package, Reference Data Does Not Exist for NAS IP...2010-10-19 13:25:53,481 [pool-11-thread-1] ERRORc.b.u.t.udp.UdpMessageListener - Error processing packet {}com.broadhop.exception.BroadhopException: Radius referencedata does not exist for NAS IP 192.168.180.74 or 10.0.0.52atcom.broadhop.radius.impl.RadiusReferenceData.getRadiusDevi
CPS Troubleshooting Guide, Release 10.1.0 39
Troubleshooting CPSJMX Management Beans are not Deployed
-
ce(RadiusReferenceData.java:111)~[com.broadhop.radius.service_1.0.0.release.jar:na]atcom.broadhop.radius.impl.RadiusReferenceData.getSharedSecret(RadiusReferenceData.java:130)~[com.broadhop.radius.service_1.0.0.release.jar:na]atcom.broadhop.radius.impl.RadiusMessageListener.getSharedSecret(RadiusMessageListener.java:247)~[com.broadhop.radius.service_1.0.0.release.jar:na]atcom.broadhop.radius.impl.RadiusMessageListener.processPacket(RadiusMessageListener.java:86)~[com.broadhop.radius.service_1.0.0.release.jar:na]atcom.broadhop.utilities.transports.udp.UdpMessageListener$1.run(UdpMessageListener.java:192)~[com.broadhop.utility_5.1.1.r019218.jar:na]atjava.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [na:1.6.0_21]atjava.util.concurrent.FutureTask$Sync.innerRun(UnknownSource) [na:1.6.0_21]at java.util.concurrent.FutureTask.run(UnknownSource) [na:1.6.0_21]Quantum Policy Builder Copyright 2013 Cisco Systems. All rights reserved.Troubleshooting Guide Issue 1.0 August 2013Chapter 2 Troubleshooting CPS 37atjava.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) [na:1.6.0_21]atjava.util.concurrent.ThreadPoolExecutor$Worker.run(UnknownSource) [na:1.6.0_21]at java.lang.Thread.run(Unknown Source)[na:1.6.0_21]
Ensure that this NAS IP has been set up in Cisco Policy Builder under Reference Data->Policy EnforcementPoints. If you use an ISG, add to the ISG Pools folder. Otherwise, add to the RADIUS Device Pools folder.The IP's that matter are in the 'Devices' table on the ISG Pool object itself.
REST Web Service Queries Returns an Empty XML Response for an ExistingUser
For example:
Because there are multiple ways needed to return web service data, the BroadHop Web Service Blueprintdoesn't return any XML by default. To fix this issue, configure the 'Default Web Service Query Response'blueprint under the 'BroadHop Web Services' Blueprint.
Error in Datastore: "err" : "E11000 Duplicate Key Error Index
This removes ALL sessionsNote
CPS Troubleshooting Guide, Release 10.1.040
Troubleshooting CPSREST Web Service Queries Returns an Empty XML Response for an Existing User
-
Typically, duplicate keys like this happen when initially configuring policies and switching primary keys. Ina production scenario, you may not want to remove all sessions.
Step 1 ssh into sessionmgr01Step 2 Open SessionMgr CLI
/usr/bin/mongo --port 27717
Using /usr/bin/mongo indicates whether the mongo replica set is primary or secondary.
Step 3 Enter following commands on the MongoDB CLIuse session_cache;
db.session.remove({});
Step 4 If it gives you a 'not master' error, log into sessionmgr02 and do the same
Error Processing Request: Unknown Actioncom.broadhop.policy.impl.RulesPolicyService - Errorprocessing policy request: Unknown action:com.broadhop.pop3auth.actions.IPOP3AuthRequest and RemoteActions are disabled.
If you see an error of the type above, it means that the implementation class it's looking for is not availableon the server. This can be caused by:
• The component needed is not installed on the server.
• Ensure that the pop3auth service is installed in your server.
• Look for exceptions in the logs when starting up.
• Try restarting the service bundle (pop3auth service in this case) using the OSGi console and looking atthe logs.
Memcached Server is in ErrorERROR c.b.d.impl.DiagnosticController - Diagnostic failed.
A problem exists with the system --> Common Services:
2:Memcached server is in error
Step 1 Log on to the server where policy server (qns) is runningStep 2 Telnet to the memcache server's IP and port 11211 (For example, telnet lbvip01 11211).
You can figure out which memcache server CPS is pointing to in Cisco Policy Builder. Look at:Reference Data >Systems > System Name > Cluster Name.
1 If you cannot telnet to the port, do this
CPS Troubleshooting Guide, Release 10.1.0 41
Troubleshooting CPSError Processing Request: Unknown Action
-
Make sure memcache is running:
• Log on to server where memcache is running.run service memcached status
[root@sessionmgr01 ~]# service memcached status
memcached is stopped
• If the service is stopped, start it:[root@sessionmgr01 ~]# service memcached start
Starting a new distributed memory caching
(memcached) process for 11211:
2 Make sure firewall configuration is OK:
To check if this is the problem, just stop the firewall.
/etc/init.d/iptables stop
If it is the problem, add an exception in /etc/sysconfig/iptables. Look at other entries in the file for anexample.
After adding an exception, restart the IP tables: /etc/init.d/iptables restart.
Firewall Error: Log shows Host Not Reachable, or Connection RefusedIn HA environment if we see some connection refused errors stop the firewall and execute
service iptables stop
to see if the problem is related to the iptables firewall issue.
Unknown Error in Logging: License Manager2010-12-12 18:51:32,258 [pool-4-thread-1] ERRORc.b.licensing.impl.LicenseManager - Unknown error inloggingjava.lang.NullPointerException: nullatcom.broadhop.licensing.impl.LicenseManager.checkFeatures(LicenseManager.java:311) ~[na:na]
This issue may occur if no license has been assigned yet.
Option 1: If this is for development or Proof Of Concept deployments you can turn on developer mode. Thiseffectively gives you 100 users but is not for use in production.
1 Login to CPS.
2 Add the following to the /etc/broadhop/qns.conf file:
-Dcom.broadhop.developer.mode=true
CPS Troubleshooting Guide, Release 10.1.042
Troubleshooting CPSFirewall Error: Log shows Host Not Reachable, or Connection Refused
-
3 Restart CPS
Option 2: Generate a real license. Have your Cisco technical representative send you the Technical ArticleTool com.broadhop.licensing.service - Creating a CPS License.
Option 3: If we have license error in the logs, check the MAC address of the VM and compare that with theMAC address in the license file in /etc/broadhop/license/.
Ecore File is Not Generated:(Example shown is RADIUS feature)2010-12-12 18:39:34,075 [SpringOsgiExtenderThread-8] ERRORc.b.runtime.impl.RuntimeLoader - Unable to load class:com.broadhop.refdata.radius.RadiusPackage. Ecore file isnot generated http://lbvip01/repos/run/com.broadhop.radius.ecorecom.broadhop.radius.ecore
A feature (RADIUS) has been installed in Cisco Policy Builder but is not installed on the server. Or a featuresfile being accessed is not where features have been placed.
1 Check if the feature is installed in your server by running/var/qps/bin/diag/list_installed_features.sh.
2 If the feature is installed you probably are pointing to (or publishing to) the wrong repository. Check whereyou are publishing to in Policy Builder and check and what URL you are pulling from in/etc/broadhop/qns.conf file.
3 If the feature is not installed you may be pointing to a different features file than you expect. Do this:
a Login to CPS server and find the name of the policy server (qns) you are on.
b Type: hostname
c Check /etc/broadhop/servers file.
Whatever is listed next to the hostname you are using should also have a directory in the/etc/broadhop directory. It is in THAT directory you should change the features file. This defaultsqns01 to policy director (iomanager). Change it to 'pcrf'.
Logging Does Not Appear to be Working
Step 1 Run the JMX Command:/opt/broadhop/qns/bin/jmxcmd.sh
ch.qos.logback.classic:Name=default,Type=ch.qos.logback
.classic.jmx.JMXConfigurator Statuses
or
Step 2 Access that bean using JMX Term or JConsole to view the status of the Logback Appenders. To access JMX Term,follow these steps:
CPS Troubleshooting Guide, Release 10.1.0 43
Troubleshooting CPSEcore File is Not Generated:
-
Execute below script: /opt/broadhop/qns-1/bin/jmxterm.sh1
2 If user does not have permission to execute the command then change the permission using below command:
chmod 777 opt/broadhop/qns-1/bin/jmxterm.sh
3 Again execute the script: /opt/broadhop/qns-1/bin/jmxterm.sh
4 Once command is executed, JMX terminal opens up.
5 Execute the below command to open connection:
$>open qns01:9045
6 All beans can be seen using below command
$>beans#domain = JMImplementation:JMImplementation:type=MBeanServerDelegate#domain = ch.qos.logback.classic:ch.qos.logback.classic:Name=default,Type=ch.qos.logback.classic.jmx.JMXConfigurator#domain = com.broadhop.action:com.broadhop.action:name=AddSubscriberService,type=histogramcom.broadhop.action:name=AddSubscriberService,type=servicecom.broadhop.action:name=GetSessionAction,type=histogramcom.broadhop.action:name=GetSessionAction,type=servicecom.broadhop.action:name=GetSubscriberActionImpl,type=histogramcom.broadhop.action:name=GetSubscriberActionImpl,type=servicecom.broadhop.action:name=LockSessionAction,type=histogramcom.broadhop.action:name=LockSessionAction,type=servicecom.broadhop.action:name=LogMessage,type=histogramcom.broadhop.action:name=LogMessage,type=servicecom.broadhop.action:name=OCSLoadBalanceState,type=histogramcom.broadhop.action:name=OCSLoadBalanceState,type=servicejava.nio:name=mapped,type=BufferPool#domain = java.util.logging:java.util.logging:type=Logging
Cannot Connect to Server Using JMX: No Such Object in TableThis is likely caused because the server's name is not set up in the hosts file with its proper IP address.
CPS Troubleshooting Guide, Release 10.1.044
Troubleshooting CPSCannot Connect to Server Using JMX: No Such Object in Table
-
In /etc/hosts the hostname (e.g. qns01) SHOULD NOT be aliased to 127.0.0.1 or localhost.
If improperly aliased JMX tells the server it'