Troubleshooting and Health Checks - Arista and Health Checks If you encounter an issue when using...

Configuration Guide CloudVision version 201720 317

Chapter 21

Troubleshooting and Health ChecksIf you encounter an issue when using CloudVision appliance check to see if there are troubleshootingsteps for the issue

bull ldquoTroubleshootingrdquo on page 318

bull ldquoSystem Recoveryrdquo on page 321

bull ldquoHealth Checksrdquo on page 323

bull ldquoResource Checksrdquo on page 324

318 Configuration Guide CloudVision version 201720

Troubleshooting Chapter 21 Troubleshooting and Health Checks

211 TroubleshootingThe following table lists the troubleshooting procedures for known issues

Issue Potential Cause Solution

HBase Master and Tomcat showas NOT RUNNING under thefollowing conditions

bull At the end of a shell-basedinstallation or an ISO-basedinstallation

bull After running cvpi status all

The input NTP and DNS serversare not reachable

Check to see if the NTP and DNS servers specifiedduring the installation are reachable

Fix reachability of NTP and DNS servers and rebootthe CVP VM from the console using the sudo init 6command

If after the reboot the problem persists delete the VMand then re-install it

The Hbase is corrupted Try creating configlets in a test container (withoutdevices) Check to see if they are created

CVP behavior seems affectedimmediately after a rebootfollowing an unplanned powerfailure or an unclean shutdown

Using cvpi staus all showshbase still not running after 15minutes

There are multiple potential causes See ldquoCVP Behavior Change Following Powercyclerdquoon page 319 for details on the potential causes and for troubleshooting steps

After upgrading CVP RunTimeexceptions occur on basicoperations (for example addingdevices to the inventory)

Your browser has cached itemsfor the previous version of CVPthat are not valid for the newversion of CVP

Clear your browserrsquos cache cookies and hosted appdata Then refresh the browser and try again

After installing CVP the cvpi start all command fails withmessages about invalid DNSnames

The CVP host names specifiedare not fully qualified domainnames (FQDN)

A re-installation is required

bull Shell-based FQDNs have to be entered whenprompted for CVP host names

bull ISO-based The CVP host names specified in thecvpyaml file must be FQDNs

CVP redirects you to a URL thatyou do not have access rights toview The URL and message are

bull URL httpltyour cvpgtwebunAuthorised

bull Message ldquoYou do not havesufficient privileges to accessthe specified URL Pleasecontact your administratorrdquo

If you access CVP using httpsand a self-signed certificate thecertificate may have expired butis still cached by your browser

Clear your browserrsquos cache cookies hosted app dataand content licenses

Using cvpi staus all showsjust the cvp-frontend and orcvp-backend components as NOTRUNNING or FAILED

On the primary node execute the following commands cvpi watchdog off cvpi stop cvp and then cvpi start cvp cvpi watchdog on (which may take 5-10minutes to execute) If that doesnt result in all services showing as running see ldquoSystemRecoveryrdquo on page 321 to resolve the issue

Installation process ends withsome CVP services failing to start

Using the cvpi status all command some CVP services have the status of NOTRUNNING

On the primary node execute the following commands cvpi watchdog off cvpi stop all and then cvpi start all cvpi watchdog on (which may take 5-10minutes to execute) If that doesnt result in all services showing as running see ldquoSystemRecoveryrdquo on page 321 to resolve the issue

In a multi-node cluster Zookeeperand Hazelcast exceptions occur

There may be issues withnetwork connectivity qualitybetween nodes

Check both the connectivity between nodes (usingping) as well as the quality of the connectivity betweennodes (for example using ping -f)

Ensure the network connectivity and 100 pass rate ofthat connectivity

Chapter 21 Troubleshooting and Health Checks Troubleshooting


2111 CVP Behavior Change Following Powercycle

You may encounter an unexpected change in the behavior of CVP immediately after a reboot that wasperformed following an unplanned power failure or an unclean shutdown of CVP

For information on the potential causes and details on the troubleshooting steps see

bull ldquoPotential Causesrdquo

bull ldquoConfirming the Causerdquo

bull ldquoTroubleshooting Procedurerdquo on page 320

21111 Potential Causes

The potential causes for CVP behavior changes in this situation include

bull Lease recovery on WAL file fails after power cycleSee httpsissuesapacheorgjirabrowseHDFS-7342 for details

bull Lease on WAL file cannot be released because blocks are replicated in Hadoop

bull Combination of the previous 2 items

21112 Confirming the Cause

The objective of this task is to confirm that the cause is a WAL file lease recovery failure afterpowercycle or a failure to release the WAL file due to blocks being replicated in Hadoop Confirmingthe cause is a simple process that involves reviewing thecvpihbaselogshbase-cvp-master-ltfqdngtlog file

To confirm the cause complete the following steps

Step 1 Open the following log file on primary or secondary node

cvpihbaselogshbase-cvp-master-ltfqdngtlog

Step 2 Go to the last exception in the log (it should be near the end of the log and should have beengenerated within the last 3 minutes of logging activity captured in the log)

Cannot login to CVP

+

System is not synchronizing withntp servers

Nodes are not synchornizing withthe ntpserver and that has leadto a clock skew between thenodes which is more thanallowed by CVP components

Run ntpstat on all nodes Output from all nodesmust say

synchronised to NTP server () at hellip

1 Run service ntpd restart

2 Then wait a few seconds

3 Check ntpstat

4 If time is still not synchronized run

service ntpd stop ntpdate lthostname or IP of an ntpservergt service ntpd start

5 Check ntpstat again

IO slowness issues The disk IO throughput is at anunhealthy level (too low)

Use the cvpi resources command to find outwhether the disk IO throughput is at a healthy level orunhealthy level The disk IO throughput reported inthe command output is measured by the VirtualMachine (See ldquoRunning Health Checksrdquo on page 323for an example of the output of the cvpi resources command)




Step 3 Make sure that the exception in the log file is the same as the exception shown in this table

21113 Troubleshooting Procedure

This procedure provides the troubleshooting steps for situations that meet the conditions specified inthe table above

Pre-requisites

Make sure that you have confirmed the cause (see ldquoConfirming the Causerdquo on page 319)

Complete the following steps to resolve the issue

Step 1 Use the cvpi watchdog off command to disable watchdog

Step 2 Wait 15 minutes for Hadoop to finish replicating blocks

Step 3 Use the cvpi start hbase command to start hbase

Step 4 Use the cvpi status hbase command to verify that hbase is running

Step 5 Do one of the following

bull If hbase is running use the cvpi watchdog on command to re-enable watchdog and thenwait for services to come up

bull If hbase is not running go to system recovery to resolve the issue (see ldquoSystem Recoveryrdquoon page 321)

Related topics




Exception found in log

orgapachehadoopipcRemoteException(orgapachehadoophdfsprotocolAlreadyBeingCreatedException) DIR NameSysteminternalReleaseLease Failed to release lease for file hbaseMasterProcWALsstate-00000000000000000ltnumbergtlogCommitted blocks are waiting to be minimally replicated Try again later

Chapter 21 Troubleshooting and Health Checks System Recovery


212 System RecoverySystem recovery should be used only when the CVP cluster has become unusable and other stepssuch as performing a cvpi watchdog off cvpi stop all and then cvpi start all cvpi watchdog on have failed For example situations in which regardless of restarts a cvpi status allcontinues to show some components as having a status of UNHEALTHY or NOT RUNNING

If a GUI-based backup has been saved while the system was healthy it is possible to redeploy the CVPcluster restore the backup and be at the same state within CVP as when the backup was takenCreating backups on a regular basis is recommended and described in ldquoCreating a Backuprdquo onpage 299

There are two ways to completely recover a CVP cluster

bull ldquoVM Redeploymentrdquo

bull ldquoCVP Re-Install without VM Redeploymentrdquo

Note A good backup is required to proceed with either of these system recoveries

2121 VM Redeployment

Complete these steps

Step 1 Delete all the CVP VMs

Step 2 Redeploy the VMs using the procedures in

Step 3 Issue a ldquocvpi status allrdquo command to ensure all components are running

Step 4 Login to the CVP GUI as lsquocvpadmincvpadminrsquo to set the cvpadmin password

Step 5 From the Backup amp Restore tab on the Setting page restore from the backup using theprocedures in ldquoImporting a Backuprdquo on page 301 and ldquoRestoring Datardquo on page 302

2122 CVP Re-Install without VM Redeployment


Step 1 Run lsquocvpReInstall from the Linux shell of the primary node This may take 15 minutes tocomplete[rootcvp99 ~] cvpReInstall0Log directory is tmpcvpReinstall_17_02_23_01_59_48Existing cvpicvp-configyaml will be backed up herehelliphellipComplete

CVP configuration not backed up please use cvpShell to setup the cluster

CVP Re-install complete you can now configure the cluster


System Recovery Chapter 21 Troubleshooting and Health Checks

Step 2 Re-configure using the procedure in ldquoShell-based Configurationrdquo on page 97 Log into theLinux shell of each node as lsquocvpadminrsquo or lsquosu cvpadminrsquo

Step 3 Issue a cvpi status all command to ensure all components are running



Related topics




Chapter 21 Troubleshooting and Health Checks Health Checks


213 Health ChecksThe following table lists the different types of CVP health checks you can run including the steps touse to run each check and the expected result for each check

2131 Running Health Checks

Run the cvpi resources command to execute a health check on disk bandwidth The output of thecommand indicates whether the disk bandwidth is at a healthy level or unhealthy level The thresholdfor healthy disk bandwith is 20MBS

The possible health statuses are

bull Healthy - Disk bandwidth above 20MBs

bull Unhealthy - Disk bandwidth at or below 20MBs

The output is color coded to make it easy to interpret the output Green indicates a healthy leveland red indicates an unhealthy level (see the example below)

Component Steps to Use Expected Result

Network connectivity ping -f across all nodes No packet loss network is healthy

HBase echo list | cvpihbasebinhbase shell |grep -A 2 row(

Prints an array of tables in Hbase created by CVPHbase and the underlying infrastructure works

All daemons running on allnodes bypass cvpi status all

On all nodes

su - cvp -c ldquocvpijdkbinjpsrdquo

On primary and secondary nodes 9 processesincluding jps

bull 3149 HMasterbull 2931 NameNodebull 2797 QuorumPeerMainbull 12113 Bootstrapbull 3040 DFSZKFailoverControllerbull 2828 JournalNodebull 11840 HRegionServerbull 12332 Jpsbull 2824 DataNode

On tertiary 6 processes

bull 2434 JournalNodebull 4256 HRegionServerbull 2396 QuorumPeerMainbull 2432 DataNodebull 4546 Jpsbull 8243 Bootstrap

Check time is in syncbetween nodes

On all nodes run ldquodate +srdquo UTC time should be within a few seconds of each other(typically less than one second) Up to 10 seconds isallowable


Use the cvpi resources command to find outwhether the disk IO throughput is at a healthy level orunhealthy level The disk IO throughput reported inthe command output is measured by the VirtualMachine

See ldquoRunning Health Checksrdquo on page 323 for anexample of the output of the cvpi resources command


Resource Checks Chapter 21 Troubleshooting and Health Checks

Example

This example shows output of the cvpi resources command In this example the disk bandwidthstatus is healthy (above the 20MBs threshold)

Figure 21-1 Example output of cvpi resources command

Related topics

bull ldquoResource Checksrdquo



214 Resource ChecksCloudVision Portal (CVP) enables you to run resource checks on CVP node VMs You can run checksto determine the current data disk size of VMs that you have upgraded to CVP version 201720 andto determine the current memory allocation for each CVP node VM

Performing these resource checks is important to ensure that the CVP node VMs in your deploymenthave the recommended data disk size and memory allocation for using the Telemetry feature If theresource checks show that the CVP node VM data disk size or memory allocation (RAM) are below therecommended levels you can increase the data disk size and memory allocation

These procedures provide detailed instructions on how to perform the resource checks and if neededhow to increase the CVP node VM data disk size and CVP node VM memory allocation

bull ldquoRunning CVP node VM Resource Checksrdquo

bull ldquoIncreasing Disk Size of VMs Upgraded to CVP Version 201720rdquo on page 325

bull ldquoIncreasing CVP Node VM Memory Allocationrdquo on page 327

2141 Running CVP node VM Resource Checks

CloudVision Portal (CVP) enables you to quickly and easily check the current resources of the primarysecondary and tertiary nodes of a cluster by running a single command The command you use is thecvpi resources command

Use this command to check the following CVP node VM resources

bull Memory allocation

bull Data disk size (storage capacity)

bull Disk throughput (in MB per second)

bull Number of CPUs

Complete the following steps to run the CVP node VM resource check

Step 1 Login to one of the CVP nodes as root

Chapter 21 Troubleshooting and Health Checks Resource Checks


Step 2 Execute the cvpi resources command

The output shows the current resources for each CVP node VM (see Figure 21-2)

bull If the total size of sdb1 (or vdb1) is approximately 120G or less you can increase the disksize to 1TB (see ldquoIncreasing Disk Size of VMs Upgraded to CVP Version 201720rdquo)

bull If the memory allocation is the default of 16GB you can increase the RAM memoryallocation (see ldquoIncreasing CVP Node VM Memory Allocationrdquo)

Figure 21-2 Using the cvpi resource command to run CVP node VM resource checks

2142 Increasing Disk Size of VMs Upgraded to CVP Version 201720

If you already upgraded any CVP node VMs running an older version of CVP to version 201720 youmay need to increase the size of the data disk of the VMs so that the data disks have the 1TB diskimage that is used on current CVP node VMs

CVP node VM data disks that you upgraded to version 201720 may still have the original disk image(120GB data image) because the standard upgrade procedure did not upgrade the data disk imageThe standard upgrade procedure updated only the root disk which contains the Centos image alongwith rpms for CVPI CVP and Telemetry

Note It is recommended that each CVP node have 1TB of disk space reserved for enabling CVP TelemetryIf the CVP nodes in your current environment do not have the recommended reserved disk space of1TB complete the procedure below for increasing the disk size of CVP node VMs

Pre-requisites

Before you begin the procedure make sure that you

bull Have upgraded to version 201720 You cannot increase the data disk size until you havecompleted the upgrade to version 201720 (see ldquoUpgrading CloudVision Portal (CVP)rdquo onpage 304)

bull Have performed the resource check to verify that the CVP node VMs have the data disk size imageof previous CVP versions (approximately 120GB or less) See ldquoRunning CVP node VM ResourceChecksrdquo on page 324

bull Make sure that you perform a GUI-based backup of the CVP system and copy the backup to a safelocation (a location off of the CVP node VMs) The CVP GUI enables you to create a backup youcan use to restore CVP data (see ldquoUsing the GUI to Backup and Restore Datardquo on page 298)



Procedure

Complete the following steps to increase the data disk size

Step 1 Turn off cvpi service by executing the systemctl stop cvpi command on all nodes in thecluster (For a single-node installation run this command on the node)

Step 2 Run the cvpi -v=3 stop all on the primary node

Step 3 Perform a graceful power-off of all VMs

Note You do not need to unregister and re-register VMs from vSphere Client or undefine and redefine VMsfrom kvm hypervisor

Step 4 Do the following to increase the size of the data disk to 1TB using the hypervisor

bull ESX Using vSphere client do the following (see Figure 21-3 for an example)a Select the Virtual Hardware tab and then select hard disk 2b Change the setting from 120GB to 1TBc Click OK

bull KVM Use the qemu-img resize command to resize the data disk from 120GB to 1TB Besure to select disk2qcow2

Figure 21-3 Using vSphere to increase data disk size

Step 5 Power on all CVP node VMs and wait for all services to start

Step 6 Use the cvpi status all command to verify that all the cvpi services are running



Step 7 Run the cvpitoolsdiskResizepy command on the primary node (Do not run thiscommand on the secondary and tertiary nodes)

Step 8 Run the df -h data command on all nodes to verify that the data is increased toapproximately 1TB

Step 9 Wait for all services to start

Step 10 Use the cvpi -v=3 status all command to verify the status of services

Step 11 Use the systemctl status cvpi to ensure that cvpi service is running

Related topics

bull ldquoIncreasing CVP Node VM Memory Allocationrdquo

bull ldquoRunning CVP node VM Resource Checksrdquo on page 324

2143 Increasing CVP Node VM Memory Allocation

If the CVP Open Virtual Appliance (OVA) template currently specifies the default of 16GB of memoryallocated for the CVP node VMs in the CVP cluster you need to increase the RAM to ensure that theCVP node VMs have adequate memory allocated for using the Telemetry feature

Note It is recommended that CVP node VMs have 32GB of RAM allocated for deployments in whichTelemetry is enabled

You can perform a rolling modification to increase the RAM allocation of every node in the cluster Ifyou want to keep the service up and available while you are performing the rolling modification makesure that you perform the procedure on only one CVP node VM at a time

Once you have completed the procedure on a node you repeat the procedure on another node in thecluster You must complete the procedure once for every node in the cluster

Pre-requisites


bull Have performed the resource check to verify that the CVP node VMs have the default RAMmemory allocation of 16GB (see ldquoRunning CVP node VM Resource Checksrdquo on page 324)

bull Make sure that you perform a GUI-based backup of the CVP system and copy the backup to a safelocation (a location off of the CVP node VMs) The CVP GUI enables you to create a backup youcan use to restore CVP data (see ldquoRunning CVP node VM Resource Checksrdquo on page 324)

Procedure

Complete the following steps to increase the RAM memory allocation of the CVP node VMs

Step 1 Login to a CVP node of the cluster as cvp user



Step 2 Using the cvpi status cvp shell command make sure that all nodes in the cluster areoperational

Step 3 Using vSphere client shutdown one CVP node VM by selecting the node in the left pane andthen click the Shut down the virtual machine option



Step 4 On the CVP node VM increase the memory allocation to 32GB by right-clicking the node iconand then choose Edit Settings

The Virtual Machine Properties dialog appears

Step 5 Do the following to increase the memory allocation for the CVP node VM

bull Using the Memory Size option click the up arrow to increase the size to 32GB

bull Click the OK button

The memory allocation for the CVP node VM is changed to 32GB The page refreshesshowing options to power on the VM or continue making edits to the VM properties



Step 6 Click the Power on the virtual machine option

Step 7 Wait for the cluster to reform

Step 8 Once the cluster is reformed repeat step 1 through step 7 one node at a time on each of theremaining CVP node VMs in the cluster

Related topics






211 TroubleshootingThe following table lists the troubleshooting procedures for known issues


HBase Master and Tomcat showas NOT RUNNING under thefollowing conditions

bull At the end of a shell-basedinstallation or an ISO-basedinstallation

bull After running cvpi status all

The input NTP and DNS serversare not reachable

Check to see if the NTP and DNS servers specifiedduring the installation are reachable

Fix reachability of NTP and DNS servers and rebootthe CVP VM from the console using the sudo init 6command

If after the reboot the problem persists delete the VMand then re-install it

The Hbase is corrupted Try creating configlets in a test container (withoutdevices) Check to see if they are created

CVP behavior seems affectedimmediately after a rebootfollowing an unplanned powerfailure or an unclean shutdown

Using cvpi staus all showshbase still not running after 15minutes

There are multiple potential causes See ldquoCVP Behavior Change Following Powercyclerdquoon page 319 for details on the potential causes and for troubleshooting steps

After upgrading CVP RunTimeexceptions occur on basicoperations (for example addingdevices to the inventory)

Your browser has cached itemsfor the previous version of CVPthat are not valid for the newversion of CVP

Clear your browserrsquos cache cookies and hosted appdata Then refresh the browser and try again

After installing CVP the cvpi start all command fails withmessages about invalid DNSnames

The CVP host names specifiedare not fully qualified domainnames (FQDN)

A re-installation is required

bull Shell-based FQDNs have to be entered whenprompted for CVP host names

bull ISO-based The CVP host names specified in thecvpyaml file must be FQDNs

CVP redirects you to a URL thatyou do not have access rights toview The URL and message are

bull URL httpltyour cvpgtwebunAuthorised

bull Message ldquoYou do not havesufficient privileges to accessthe specified URL Pleasecontact your administratorrdquo

If you access CVP using httpsand a self-signed certificate thecertificate may have expired butis still cached by your browser

Clear your browserrsquos cache cookies hosted app dataand content licenses

Using cvpi staus all showsjust the cvp-frontend and orcvp-backend components as NOTRUNNING or FAILED

On the primary node execute the following commands cvpi watchdog off cvpi stop cvp and then cvpi start cvp cvpi watchdog on (which may take 5-10minutes to execute) If that doesnt result in all services showing as running see ldquoSystemRecoveryrdquo on page 321 to resolve the issue

Installation process ends withsome CVP services failing to start

Using the cvpi status all command some CVP services have the status of NOTRUNNING

On the primary node execute the following commands cvpi watchdog off cvpi stop all and then cvpi start all cvpi watchdog on (which may take 5-10minutes to execute) If that doesnt result in all services showing as running see ldquoSystemRecoveryrdquo on page 321 to resolve the issue

In a multi-node cluster Zookeeperand Hazelcast exceptions occur

There may be issues withnetwork connectivity qualitybetween nodes

Check both the connectivity between nodes (usingping) as well as the quality of the connectivity betweennodes (for example using ping -f)

Ensure the network connectivity and 100 pass rate ofthat connectivity




















Cannot login to CVP

+







3 Check ntpstat












Pre-requisites










Related topics
































Related topics


















On all nodes













Example



Related topics
















bull Number of CPUs














Pre-requisites







Procedure



















Related topics








Pre-requisites




Procedure




















Related topics























Cannot login to CVP

+







3 Check ntpstat












Pre-requisites










Related topics
































Related topics


















On all nodes













Example



Related topics
















bull Number of CPUs














Pre-requisites







Procedure



















Related topics








Pre-requisites




Procedure




















Related topics









Pre-requisites










Related topics
































Related topics


















On all nodes













Example



Related topics
















bull Number of CPUs














Pre-requisites







Procedure



















Related topics








Pre-requisites




Procedure




















Related topics






























Related topics


















On all nodes













Example



Related topics
















bull Number of CPUs














Pre-requisites







Procedure



















Related topics








Pre-requisites




Procedure




















Related topics


















On all nodes













Example



Related topics
















bull Number of CPUs














Pre-requisites







Procedure



















Related topics








Pre-requisites




Procedure




















Related topics






Example



Related topics
















bull Number of CPUs














Pre-requisites







Procedure



















Related topics








Pre-requisites




Procedure




















Related topics















Pre-requisites







Procedure



















Related topics








Pre-requisites




Procedure




















Related topics






Procedure



















Related topics








Pre-requisites




Procedure




















Related topics











Related topics








Pre-requisites




Procedure




















Related topics




Troubleshooting and Health Checks - Arista and Health Checks If you encounter an issue when using...

Documents

Transcript of Troubleshooting and Health Checks - Arista and Health Checks If you encounter an issue when using...