Storage Troubleshooting with VC Ops 5

53
© 2010 VMware Inc. All rights reserved Confidential Storage Troubleshooting with VC Ops 5

description

Storage Troubleshooting with VC Ops 5. Things we want to know in performance. The storage team have requested for greater visibility Joint troubleshooting, capacity planning, performance monitoring . Is there any storage bottlenect? If yes, where? - PowerPoint PPT Presentation

Transcript of Storage Troubleshooting with VC Ops 5

Page 1: Storage Troubleshooting with VC Ops 5

© 2010 VMware Inc. All rights reserved

Confidential

Storage Troubleshooting with VC Ops 5

Page 2: Storage Troubleshooting with VC Ops 5

2 Confidential

Things we want to know in performance

The storage team have requested for greater visibility

• Joint troubleshooting, capacity planning, performance monitoring.

• Is there any storage bottlenect? If yes, where?

• You need to know both Big Picture and details

Your needs:

• Be able to quickly tell the overall workload

• Be able to quickly tell which VMs are generating the big IOPS.

• Be able to tell the total IOPS generate from all VMs, and see a chart to see if there is a spike.

You want to know the 3 dimensions

• IOPS: Read, Write, Read/Write Ratio, Total IOPS

• Latency: Read, Write, Total

• Throughput

This is a Level 300 material.I’m assuming you’re hands-on on both vSphere 5 and VC Ops 5.

This is based on vSphere 5.0.1 and VC Ops 5.0.1

Please read speaker notes.

Page 3: Storage Troubleshooting with VC Ops 5

3 Confidential

The challenges

Your environment• Production Site

• 500 servers VM, 3000 desktop VM• 2 vCenters, 80 ESXi, 10 clusters, 60 datastores, 6 RDM.• 50 physical servers (mostly UNIX)

• You use VMFS on FC and NFS on 10 GE

• 2 storage arrays: 1 high end, 1 midrange

• DR Site• Let’s not talk about this. The production is complex enough already!

Page 4: Storage Troubleshooting with VC Ops 5

4 Confidential

Storage counters: ESXi hostDatastore Disk

Storage Adapter or Storage Path

Page 5: Storage Troubleshooting with VC Ops 5

5 Confidential

ESXi: Adapter, Device and Path

1 adapter can many Devices (LUN).1 Device is accessed via many paths.

1 path can only access 1 Device.

Page 6: Storage Troubleshooting with VC Ops 5

6 Confidential

ESXi: Disk

Page 7: Storage Troubleshooting with VC Ops 5

7 Confidential

NFS

ESXi: Adapter, Device and Path

Disk

ESXi 5.0

Disk

Datastore

Storage Path

Storage Adapter 1

Storage Path

Disk

Storage Path

Storage Adapter 2

Storage Path Storage Path Storage Path

vmhba2 vmhba3

vmhba3

vmnic

VMFS VMFS

Datastore

RDM

Datastore

Page 8: Storage Troubleshooting with VC Ops 5

8 Confidential

Storage counters: VM

Disk

Virtual Disk (VMDK, RDM)

Datastore

Disk

VM

RDMVMFS NFS

Drive 1 Drive 2 Drive 3

Disk

scsi0:0 scsi0:2

Datastore Datastore

vDisk vDisk vDisk

Page 9: Storage Troubleshooting with VC Ops 5

9 Confidential

VC Ops has 4 groups of Storage metrics for a VM

Which counters do you take? There are so many of them. Say you want Write Latency. Which one do you take: Virtual Disk, Datastore, Disk, or Storage?I’ll try to answer in the next few slides.If you want to know now, the counter with the black arrow is the counters that I think we should use.

? Not sure what this is

IOPS counters

Other counters

Latency counters

Thruput counters

Why only at Disk level?

? Not sure what this is

These don’t exist in vCenter. RDM?

Don’t use

Page 10: Storage Troubleshooting with VC Ops 5

10 Confidential

VM: Storage

Page 11: Storage Troubleshooting with VC Ops 5

11 Confidential

Comparing VC Ops with vCenter

Datastore shows the metric for this VM only, not for every VM in that datastore. Datastore figures will be higher if your VM has snapshot.

Disk = physical LUN backing up the datastore. If there is no extent, then Disk = Datastore.

Where does the Storage counter come from, as there is no Storage in vCenter? vCenter only has Datastore, Disk, Virtual Disk, as shown in this screenshot.If you know, let me know.

Page 12: Storage Troubleshooting with VC Ops 5

12 Confidential

VC Ops has 2 groups of Storage metrics for a Datastore

Not sure the difference between Max Observed and Highest ObservedWhich counters do you take? There are so many of them. Say you want Write Latency. Which one do you take: Virtual Disk, Datastore, Disk, or Storage?I’ll try to answer in the next few slides.

IOPS counters

Other counters

Latency counters

Thruput counters

VMFS datastore NFS datastore

Page 13: Storage Troubleshooting with VC Ops 5

13 Confidential

VC Ops has 4 groups of Storage metrics for a ESXi

Which counters do you take? There are so many of them. Say you want Write Latency. Which one do you take: Virtual Disk, Datastore, Disk, or Storage?I’ll try to answer in the next few slides.

IOPS counters

Other counters

Latency counters

Thruput counters

Page 14: Storage Troubleshooting with VC Ops 5

14 Confidential

VC Ops: Storage metrics from Cluster until World

Notice Storage is not the group, but Disk. I was hoping for Storage as it is more intuitive.For IOPS or Throughput, it is the sum of all components (e.g. all VM in that vCenter)For Latency, I’m not sure if it is an average, or the max. If it is a Max, that would be an awesome Super Metric!IOPS counters

Other counters

Latency counters

Thruput counters

Cluster Datacenter WorldvCenter

Page 15: Storage Troubleshooting with VC Ops 5

15 Confidential

Storage counters at VC level

Page 16: Storage Troubleshooting with VC Ops 5

16 Confidential

Storage counters at World level

Page 17: Storage Troubleshooting with VC Ops 5

17 Confidential

Part 1: IOPS

Page 18: Storage Troubleshooting with VC Ops 5

18 Confidential

Page 19: Storage Troubleshooting with VC Ops 5

19 Confidential

Same data, but on 1 chart

Page 20: Storage Troubleshooting with VC Ops 5

20 Confidential

Page 21: Storage Troubleshooting with VC Ops 5

21 Confidential

vCenter: performance chart

This is the object name. In this case, this is a VM and its name is vCenter5

This one tells us that it is the Datastore group, and it is showing Past day data (last 24 hours)

Page 22: Storage Troubleshooting with VC Ops 5

22 Confidential

Same VM & timeline, but from the Disk counter.

Page 23: Storage Troubleshooting with VC Ops 5

23 Confidential

vCenter Ops might aggregate differently than vCenter

Same info, but this time from vCenter Ops.They are similar, but not identical. Is this because the way VC Ops aggregate?Read peaks at 245 in vCenter vs 217 in VC Ops. Around 13% lower in VC Ops.Write peaks at 137 vs 135. This is close enough.

Page 24: Storage Troubleshooting with VC Ops 5

24 Confidential

IOPS: Snapshot causes real IOPS penalty

This is from the Virtual Disk counters. 173 reads at Virtual Disk translates into 245 reads at Datastore. This is 40% more70 writes at Virtual Disk translates into 137 writes at Datastore. This is almost 200%!So a snapshot can cause much higher IOPS.

Page 25: Storage Troubleshooting with VC Ops 5

25 Confidential

Again, the same gap remain between vCenter and VC Ops.

Page 26: Storage Troubleshooting with VC Ops 5

26 Confidential

Page 27: Storage Troubleshooting with VC Ops 5

27 Confidential

IOPS: Conclusion

Use the Datastore counter for vmdk• The Virtual Disk counter is useful if you are comparing with actual IOPS issued

at Guest OS level. It will be too low if you have snapshot.

• The Storage counter = Virtual Disk

• The Disk counter is useful if you are discussing with the Storage team, who is showing you LUN by LUN metrics. Disk = LUN. • It is not useful if your datastore spans multiple LUNs due to Extent.

• In most cases, Disk = Datastore as you should avoid Extent.

Use the Disk counter for RDM VC Ops counter may differ to vCenter

• If the number looks strange, check with vCenter.

• Sometimes the data in vCenter itself is wrong.

• Check a few VMs, not just 1.

Page 28: Storage Troubleshooting with VC Ops 5

28 Confidential

Part 2: Latency

Page 29: Storage Troubleshooting with VC Ops 5

29 Confidential

VM level: Total Latency

Page 30: Storage Troubleshooting with VC Ops 5

30 Confidential

VM Level: Read Latency

Page 31: Storage Troubleshooting with VC Ops 5

31 Confidential

Page 32: Storage Troubleshooting with VC Ops 5

32 Confidential

Page 33: Storage Troubleshooting with VC Ops 5

33 Confidential

Page 34: Storage Troubleshooting with VC Ops 5

34 Confidential

Avoid the counter “Datastore | Highest Latency”

Page 35: Storage Troubleshooting with VC Ops 5

35 Confidential

Page 36: Storage Troubleshooting with VC Ops 5

36 Confidential

Page 37: Storage Troubleshooting with VC Ops 5

37 Confidential

Data at VC Ops

Page 38: Storage Troubleshooting with VC Ops 5

38 Confidential

Total Latency >< Read Latency + Write Latency

Page 39: Storage Troubleshooting with VC Ops 5

39 Confidential

View at Datastore level

Page 40: Storage Troubleshooting with VC Ops 5

40 Confidential

Latency: Conclusion

Use the Datastore counter for vmdk• The Virtual Disk counter is useful if you are comparing with actual IOPS issued

at Guest OS level. It will be too low if you have snapshot.

• The Storage counter = Virtual Disk

• The Disk counter is useful if you are discussing with the Storage team, who is showing you LUN by LUN metrics. Disk = LUN. • It is not useful if your datastore spans multiple LUNs due to Extent.

• In most cases, Disk = Datastore as you should avoid Extent.

Use the Disk or Virtual Disk counter for RDM VC Ops counter may differ to vCenter

• If the number looks strange, check with vCenter.

• Sometimes the data in vCenter itself is wrong.

• Check a few VMs, not just 1.

Page 41: Storage Troubleshooting with VC Ops 5

41 Confidential

Latency: Conclusion

Do not use the Total Latency• When creating super metric, manually add the Read and the Write.

Use the Datastore counter for vmdk Use the Disk counter for RDM VC Ops counter may differ to vCenter

• If the number looks strange, check with vCenter.

• Sometimes the data in vCenter itself is wrong.

• Check a few VMs, not just 1.

Page 42: Storage Troubleshooting with VC Ops 5

42 Confidential

Part 3: Throughput

Page 43: Storage Troubleshooting with VC Ops 5

43 Confidential

Throughput counters for VM

Page 44: Storage Troubleshooting with VC Ops 5

44 Confidential

Throughput counters for VM

Page 45: Storage Troubleshooting with VC Ops 5

45 Confidential

Same VM, vastly different data

Page 46: Storage Troubleshooting with VC Ops 5

46 Confidential

Page 47: Storage Troubleshooting with VC Ops 5

47 Confidential

Throughput: Conclusion

Use the Datastore counter for vmdk• The Virtual Disk counter is useful if you are comparing with actual IOPS issued

at Guest OS level. It will be too low if you have snapshot.

• The Storage counter = Virtual Disk

• The Disk counter is useful if you are discussing with the Storage team, who is showing you LUN by LUN metrics. Disk = LUN. • It is not useful if your datastore spans multiple LUNs due to Extent.

• In most cases, Disk = Datastore as you should avoid Extent.

Be careful with the Disk counters, as they can report large numbers• vCenter: Disk | Disk Throughput usage

• vC Ops: Disk | IO Usage capacity

VC Ops counter may differ to vCenter• If the number looks strange, check with vCenter.

• Sometimes the data in vCenter itself is wrong.

• Check a few VMs, not just 1.

Page 48: Storage Troubleshooting with VC Ops 5

48 Confidential

Part 4: Other Interesting Metrics

Page 49: Storage Troubleshooting with VC Ops 5

49 Confidential

Built-in Super Metric?

The 3 chart below shows summary at World level• The actual world is on the right. It has 5 vCenters

Page 50: Storage Troubleshooting with VC Ops 5

50 Confidential

Other interesting metrics

Page 51: Storage Troubleshooting with VC Ops 5

51 Confidential

Page 52: Storage Troubleshooting with VC Ops 5

52 Confidential

vCenter “equivalent” dashboard

Page 53: Storage Troubleshooting with VC Ops 5

53 Confidential

Capacity

You have 1000 VMs on 50 datastores.

max([$This:M180/14,$This:M1978/$This:M1977*0.8]) which is translated to: max([This Resource: summary|total_number_vms/14,This

Resource: capacity|used_space/This Resource: capacity|total_capacity*0.8])

this means show me all datastores where either number of attached vm's is more than 14 or space left is less than 20%.

You can imagine how great it can look on a heatmap