VMworld 2013: Troubleshooting at Cox Communications with VMware vCenter Log Insight and vCenter...

Post on 15-Jul-2015

112 views 2 download

Tags:

Transcript of VMworld 2013: Troubleshooting at Cox Communications with VMware vCenter Log Insight and vCenter...

Troubleshooting at Cox Communications with

VMware vCenter Log Insight and vCenter

Operations Management Suite

Chris Nakagaki, Cox Communications

Jason Davis, Cox Communications

Himanshu Kumar Singh, VMware

VCM5034

#VCM5034

Troubleshooting at

Cox Communications

with VMware vCenter

Log Insight and

vCenter Operations

Management Suite

Press Start

Player 1!

x3

World vCOPs

Agenda Background

Why vCOPs and Log Insight?

vCOPs

Capacity Planning Demo

Custom Dashboarding Demo

HeatMaps Demo

Log Insight – What is it? How did Cox use it?

Storage Deeper Dive Demo

VM Backup Failures Demo

Q&A

How to Play

Background

Cox Communications, Inc. (Atlanta)

100+ Hosts, 3000+ VM’s

2800+GHz Compute Capacity

13.5 TB Memory Capacity

200TB SAN Storage

Chris Nakagaki vExpert 2011, 2012, 2013

10 years @ Cox Communications

Started w/ ESX 2.5

@zsoldier

Jason Davis

15 years Windows Experience

12 years @ Cox Communications

Started w/ ESX 2.0

Credits?

Why vCOPs and Log Insight?

!?

Dynamic Thresholds (vCOPs)

Easy Deployment (vCOPs/Log Insight)

Capacity Planning (vCOPs)

Cloud Suite Cost Savings (vCOPs)

Log Aggregation (Log Insight)

Pretty Pictures (vCOPs)

Because we like to have a strong upper

and lower body.

vCOPs – Is there capacity?

1UP!

Network switch maintenance

Multiple hosted production VM’s

potentially affected

Can we place affected hosts in

maintenance mode and maintain

uptime?

vCOPs – Is there capacity?

1UP!

vCOPs – Is there capacity?

1UP!

Conclusion:

Yes, there is capacity

Network maintenance can proceed

Demonstrated:

vCOPs Capacity Planning Tool

Bottleneck is disk space not anything else

VM’s can continue to run

vCOPs - How do we monitor streaming servers?

Sim Infrastructure

Live streaming event w/ CEO and CTO

Monitor VM’s associated w/ streaming

service live!

Key Metrics?

CPU

Memory

Network

vCOPs - How do we monitor streaming servers?

Sim Infrastructure

vCOPs - How do we monitor streaming servers?

Sim Infrastructure

Conclusion:

vCOPs custom dashboarding is useful!

We demonstrated:

Grouping all streaming VMs as an

application object

Creating a custom dashboard

Focused on 3 Key Metrics

Health Tree to show who’s being lazy

vCOPs – Why are VM’s slow?

POW!

Receiving reports that VM’s are

performing slowly.

No immediate discernable pattern

vCOPs to the rescue!

vCOPs – Why are VM’s slow?

POW!

vCOPs – Why are VM’s slow?

POW!

Determined one array having severe

latency.

Now questions arise around VMware NMP

To Log Insight for deeper analysis…

Why vCOPs AND Log Insight?

Chicken Legs

Log Insight – What is it?

Continue?

Continue?

5 4 3 2 1 0

18

We Interrupt This Program

To Bring You An Important Message…

19

Introducing: VMware vCenter Log Insight

Himanshu Singh

Senior Product Marketing Manager

Enterprise and Cloud Management, VMware

20

Problem: Operate and Troubleshoot a Complex System

VMware

Logs

OS and

App Logs

Physical Infrastructure Logs

21

VMware’s Approach to Log Management

Extend Analytics to Log Data

• With vC Ops, VMware introduced an analytics-based operations

management solution for structured data (metrics, KPIs, events, alerts)

• Log Insight extends our analytics-based approach to logs and

unstructured, machine generated data

Easy to Use and Accessible

• Existing solutions are highly specialized and often too expensive

• Log Insight has an intuitive, easy-to-use interface

• Using a predictable pricing model with unlimited amount of log data,

making it accessible to all

Optimized for VMware Environments

• Log Insight comes with built-in knowledge and native support for vSphere

• Integration with vC Ops maximizes ROI and value, providing a complete

cloud operations management solution

1

2

3

22

VMware Cloud Ops Mgmt = Log Insight and vCenter Operations

Cloud Operations Management

• vCenter Log Insight and vCenter

Operations complement each other

• Delivers best of breed capabilities for

performance, capacity, configuration

management

• Tight integration enables seamless

transition from monitoring to

troubleshooting

• Log Insight and VC Ops together provide

a complete solution for

Cloud Operations Management

23

Key vCenter Log Insight Use Cases

IT Operations

• Troubleshooting and Root Cause Analysis

I observed a problem (e.g. slowness), try to troubleshoot the problem and identify the

part of the stack that is responsible (e.g. network delay vs storage)

Follow the trail from vC Ops to logs to get to root cause to an observed problem

• Monitoring Using Logs

Monitor metrics and events (performance & change) that are visible only in logs

Collect all the data in one place without the need for custom parsing, transformation of

data

Security and Compliance

• Security Forensics

• Comprehensive Audit (who, when) / Compliance Reporting

Business Transaction Monitoring

• Collect and correlate transaction logs with infrastructure performance

24

Integration with vCenter Operations

Automated correlation of performance and log data

25

Announcing: Log Insight Content Pack Marketplace

And more…

https://solutionexchange.vmware.com/store/loginsight

Extend vCenter Log Insight with Content Packs from:

26

And Now…

Back To Regular Scheduled Programming…

Player 2!

x3

World Log Insight

Log Insight – Was Round Robin causing issues?

Were paths being marked dead?

Were the paths remaining dead?

Did the paths come back when

expected?

LET’S SEE ….

Leeroy Jenkins!

Log Insight – Was Round Robin causing issues?

Leeroy Jenkins!

Log Insight – Was Round Robin causing issues?

Conclusion:

No, round robin was not causing issues!

We Demonstrated:

Paths were marked DEAD.

Paths remained DEAD.

Paths came back ON when expected.

Leeroy Jenkins!

Log Insight – What’s causing VM backup failures?

Netbackup has snapshot errors (status

code 156).

Symantec HOWTO70949 article states

there are multiple possible causes.

Which is the most probable cause?

Does VMware have correlating logs?

LET’S SEE …

Paku-Man?

Log Insight – What’s causing VM backup failures?

Paku-Man?

Log Insight – What’s causing VM backup failures?

Conclusion:

The most probable cause is inability to

create VM snapshots due to timeouts.

We Demonstrated:

Correlating errors in VMware logs stating: “The guest OS has reported an error during quiescing.”

VMware KB 1018194 provides additional

troubleshooting steps:

Reboot the VM

Reduce I/O

Etc ….

Paku-Man?

Q&A

42

35

Other VMware Activities Related to This Session

HOL:

HOL-SDC-1301

Applied Cloud Operations

Group Discussions:

VCM1002-GD, VCM1004-GD

Cloud Operations with Hicham Mourad or Sam McBride

Breakout Session – repeat by demand:

VCM4528 – Thursday, 2 pm Moscone West, room 3001

Tips and Tricks with vCenter Log Insight

Follow us:

@VMLogInsight and get 5 free licenses

Hang with us:

Booth 2020 – Cloud Management Lounge

VCM5034

THANK YOU

Troubleshooting at Cox Communications with

VMware vCenter Log Insight and vCenter

Operations Management Suite

Chris Nakagaki, Cox Communications

Jason Davis, Cox Communications

Himanshu Kumar Singh, VMware

VCM5034

#VCM5034