CLARiiON Performance Analysis

19
Analysis and Report prepared by Ankita Maharana Customer: Freescale Semiconductor Inc Service Request Number: 66459368 Array ID: APM00070402012 EMC 2 Storage Performance Analysis – CLARiiON EMC 2

description

xd

Transcript of CLARiiON Performance Analysis

  • Analysis and Report prepared by Ankita Maharana

    Customer: Freescale Semiconductor Inc

    Service Request Number: 66459368

    Array ID: APM00070402012

    EMC2 Storage Performance Analysis CLARiiON

    EMC2

  • EMC2 page|1

    Navisphere Archive file analysis

    The following slides are based on performance information contained in the NAR file(s) covering the following period:

    Start-time: (GMT): 10/10/2014 18:20:33 (GMT, DST not included) End-time: (GMT): 10/15/2014 02:10:33 (GMT, DST not included)

    The times on the graphs are in GMT (UTC). This needs to be taken into account when comparing graphs to the local time zone in which they were recorded.

    Total amount of polls: 1240

    This value is extremely important as it puts the figures shown in the report into perspective. This NAR file had an archiving interval of 5 minutes, so the points on the charts (polls) are averages over this period. A detailed report of the performance of each LUN would need to be done by the consultants in EMC Technical Solutions / Professional Ser-vices, but we have done a general analysis for this array to show where problems are occurring.

  • EMC2 page|2

    Problem Description

    Performance Analysis

    Configuration Analysis The SP Collects have been checked and there are no significant faults.

    The CX3series Flare needs to be upgraded to 3.26.80.5.032(NAS code permitting), to fix a number of stability issues. See the following ETA / EMC KB Articles: emc253359 & emc263053 and the latest Release notes (available on support.emc.com), to see a full list of the issues which have been fixed. From 30th June 2013, the only supported Flare version, for the CX4 series, will be Release 30.

  • EMC2 page|3

    Best Practice SP Dirty Pages % should remain between the high and low watermarks (the default is 80% & 60% respectively)

    LUN Utilization < 70% at peak times and around 50% on average

    LUN Response Times < 20ms

    LUN Queue Length (or the Average Busy Queue Length) < 10 IOs

    LUN Forced Flushes near 0

    Disks should ideally stay below their throughput saturation points: o 3500 IOPS for Enterprise Flash Drives (EFD) o 180 IOPS for 15k rpm FC or SAS drives o 140 IOPS for 10k rpm FC or SAS drives o 90 IOPS for 7.2k rpm NL-SAS o 80 IOPS for 7.2k rpm SATA2 o 60 IOPS for 7.2k rpm SATA drives o For very large block sequential I/O and multiple streams/threads, each spinning hard disk drives can provide up to 30 MB/s bandwidth.

    The figures above are rules of thumb for drive performance and not absolute maximums. The IOPS values are based on random small block (8KB) IO size, and will decrease as the IO size increases.

    Block front-end application workload is translated into a different back-end disk workload based on the RAID type in use: o For reads (no impact of RAID type):

    1 application read I/O = 1 back-end read I/O o For small block random writes:

    RAID 1/0 - 1 application write I/O = 2 back-end write I/O RAID 5 - 1 application write I/O = 4 back-end disk I/O (2 read I/O + 2 write I/O) RAID 6 - 1 application write I/O = 6 back-end write I/O (3 read I/O + 3 write I/O)

    o For large block sequential writes (doing full-stripe writes): RAID 1/0 - 1 application write I/O = 2 back-end write I/O RAID 5 - 1 application write I/O = 1+ 1/N back-end disk I/O (where N is number of drives in the RAID Group) RAID 6 - 1 application write I/O = 1+ 2/N back-end disk I/O (where N is number of drives in the RAID Group)

  • EMC2 page|4

    Storage Terms Used:

    Average Seek Distance: a measure of the amount of distance the disk head move, expressed in GB. The higher the number, the greater the head movement. May indicate a lot of random IO bad for response times. High Seek distances, combined with heavy I/O, are an indication of drive contention.

    Throughput: a measurement of data being exchanged, in the number of Inputs/Outputs per Second (IOPS).

    Bandwidth: a measure of the data being moved, in MB/s.

    Response time: the time interval between the issuing of a request and the fulfilment of the request. This is a calculated value using IOPS and Queue Length - under 20ms is considered good. One caution, with low IOPS (under 50) and low Queue Length (under 2-3), Response Times can be artificially high.

    Utilization: a measure of how busy is the SP, the LUNs, the Disks or RAID Groups are. By itself, not an indication of a problem, unless its SP uti-lization and both SPs are more than 60% utilized, which would lead to serious performance problems if one SP fails.

    EMC KB Article: these are knowledgebase articles which can be found on the EMC support website (see Bibliography below).

    Queue Length: a measure of how many IOs are starting to back up in the queue under 10 is a good number.

    Dirty Pages: a measure of the amount a writes to a LUN that are in the write cache (and not yet written to disk).

    Force Flushing: this occurs when the write cache is full and cant take any more IOs until it flushes (de-stages) the writes to the disks. This will cause response times and queue lengths to go up and affect other LUNs on the array. Try to avoid this. See EMC KB Article 10256.

  • EMC2 page|5

    Bibliography KB or Knowledge Base Article (also known as EMC KB Articles or Primus Articles) will be referenced in this performance analysis document and these articles can be accessed via EMC Support (https://support.emc.com/search/). Articles have often two numbers due to the migration to the new Knowledgebase. The older number has the prefix emc in the article number: e.g. emc297178. It should be possible to search for ei-ther number on EMC Support and find the same article. These articles can be accessed in the EMC support page https://support.emc.com/search/ by selecting knowledgebase option in Scope by Resource, and entering the article reference number: e.g. emc207795 or 10256

    Refer to the EMC support site for Performance and Availability Best Practice for details on configuring the VNX / CLARiiON for optimum perfor-mance.

    General troubleshooting documents can also be found on the EMC support product pages (https://support.emc.com/products).

  • EMC2 page|6

    SP Heat Map This details the overloaded areas / components within this array.

    In Pools the load will move rapidly between slices, so what should be looked for are heavily loaded tiers, rather than the drives themselves.

    There is a reporting error where some drives mistakenly report 100% utilization and this can be cleared by following EMC KB Article 12357.

    This information can be monitored in Unisphere Analyzer by selecting Performance Detail (or Performance Summary) for the drives in each Pool. Figure: Performance Survey of the array component utilization levels

  • EMC2 page|7

    SP Utilization The Unisphere Analyzer equivalent of this graph would be Performance Detail SP Utilization (%).

    An average utilization of 50% is acceptable, but response times increase exponentially as utilization approaches 100%.

    Another important consideration is availability. The recommended SP utilization level is around 50% because, in the event of a SP reboot / failure, only one SP will be managing the entire array.

    See EMC KB Article 11592 for methods to reduce SP utilization. Figure: Graph of Performance SP Utilization (%)

  • EMC2 page|8

    Write Cache Usage The Unisphere Analyzer equivalent of this graph is contained in the Performance Overview.

    There are times at which the write cache on one or both SP is 100% full, which will cause the SPs write cache to force flush.

    This will lead to high response times for all LUNs on that SP at those times.

    There is no (or negligible) forced flushing. This would occur, if write cache was reaching 100% full (see EMC KB Article 10256). Figure: Graph of Write cache full percentage for each SP (%) and forced flushing.

  • EMC2 page|9

    SP Port bandwidth The I/O is not evenly balanced across the front-end ports.

    Load balancing software, such as PowerPath, will also help to make sure the load is spread evenly across all available front-end ports. Figure: Graph SP Front-end Port Bandwidth (MB/s)

  • EMC2 page|10

    LUNs with highest throughput per SP (SP Usage) These LUNs have the highest throughput per SP, although not necessarily the highest bandwidth

    The second line on each graph (y axis on the right hand side) is to show the cumulative percentage of the SP Utilization that is accounted for by these LUNs.

    Figure: Graph of Peak Throughput per LUN (based on highest 5% of polled intervals)

  • EMC2 page|11

    RAID Group performance overview These graphs show the twenty busiest RAID Groups in descending order of performance loads.

    Multiple graphs are included to show the throughput (I/O per second), bandwidth (MB/s) and utilization and response time. Figures: Performance summary graph for the standard RAID Groups (not Private RAID Groups for Pools and FAST Cache)

  • EMC2 page|12

  • EMC2 page|13

  • EMC2 page|14

    LUN performance overview Total Throughput

    Figure: Performance summary graph of the twenty busiest LUNs from both standard RAID Groups and Virtual Provisioning Pools

  • EMC2 page|15

  • EMC2 page|16

    LUN performance overview Total Bandwidth

    Figure: Total bandwidth graph of the twenty busiest LUNs from both standard RAID Groups and Virtual Provisioning Pools

  • EMC2 page|17

    LUN performance overview Response Time

    Figure: Response Time summary graph of the twenty busiest LUNs from both standard RAID Groups and Virtual Provisioning Pools

  • EMC2 page|18

    Conclusions There are too many heavily loaded LUNs here to do an in-depth analysis for each of them, in a break/fix Service Request.

    The NAR file only covers a short period, so only performance problems which occurred during this time will be apparent in this analysis.

    Improper Load Balancing, currently SP A owns 7 LUNs and SP B owns 31 LUNs.

    At times there is a large amount of forced flushing, which will cause high write response times for LUNs on all RAID Groups / Pools.

    Adding an Analyzer enabler to this array would make it possible to track these issues from Unisphere.

    Recommendations Upgrade flare to 3.26.80.5.032.

    Use Analyzer (Block Statistics in Unisphere) to generate live graphs, so these issues can be managed when they occur. To monitor the write cache and FAST cache levels, refer to EMC KB Articles 12351 and 15606 respectively.

    IMPORTANT NOTE:

    Detailed advice on LUN configuration and reconfiguration is outside the scope of what can be dealt with by EMC Technical Support, who deal with break/fix issues. EMC Professional Services would be able to provide a more in-depth performance review and provide detailed recommendations if necessary. In this case I will contact your local Account Service Representative (ASR) and copy your EMC District Service Manager, to let them know that this particular issue may require additional assistance. If you require help with resolving the configuration issues, then Professional Services should be engaged.