EMC Data Domain de-Duplication 2011

12
Wikibon is a professional community solving technology and business problems through an open source sharing of free advisory knowledge. tour Become a Member! Sign Up! Why Register? Login User Name Password Remember Me Login Forgot Password? portals + Storage Information Security Sustainability Mobile Enterprise Performance Lab IT Career Center research notes Wikitips Professional Alerts Case Studies Page 1 of 12 EMC Data Domain De-duplication Performance Introduction - Wikibon 4/24/2012 http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Transcript of EMC Data Domain de-Duplication 2011

Page 1: EMC Data Domain de-Duplication 2011

Wikibon is a professional community solving technology and business problems through an open source sharing of free advisory knowledge.

tour

Become a Member!

Sign Up!

Why Register?

Login

User Name

Password

Remember Me

Login

Forgot Password?

portals

+ Storage •Information Security•Sustainability•Mobile Enterprise•Performance Lab•IT Career Center•

research notes

Wikitips•Professional Alerts•Case Studies•

Page 1 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 2: EMC Data Domain de-Duplication 2011

Article•Comments (0)•Page Protected•History•Vault•

How-to Notes•Community Questions•

research meetings

Peer Incite Podcasts•Peer Incite Archive •Upcoming Peer Incites •

Technology Events

IBM Software Impact 2012 Apr 29-May 4, 2012

Interop Discover IT - Las Vegas May 6-10, 2012

Citrix Synergy 2012, San Francisco May 9-11, 2012

SAP SAPPHIRE NOW & ASUG Annual Conference May 14-16, 2012

FEI 2012 - Orlando FL May 15-17, 2012

Announcements

April 10 Peer Incite: Creating a Zero Data Loss Environment•April 3 Peer Incite: Selecting Data Protection Solutions for Cloud Storage Offerings•March 20 Wikibon Peer Incite: The Rise of 10Gb Ethernet and the Impact of Intels Xeon E5 Family of Processors

Mar 6 Peer Incite: Achieving Hyper Productivity Through DevOps - A new Methodology for Business Technology Management

Feb 9 Peer Incite: Squinting through the Glare of Project Lightning•

Home Profile Peers Wiki Groups Feedback

Page 2 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 3: EMC Data Domain de-Duplication 2011

Currently 5/5 Stars.•1•2•3•4•5•

rate this

EMC Data Domain De-duplication 2011Last Update: Jan 25, 2011 | 03:28Viewed 6511 times | Community Rating: 5Originating Author: David Floyer

#memeconnect #emc

Contents

1 Data Domain De-duplication Performance Introduction •1.1 High-End De-duplication Performance Improvements◦1.2 Technology and Functionality Components of Performance

1.3 Impact of DDBoost◦1.4 Performance Comparison with Other De-duplication Technologies

1.5 Performance Discussion◦1.6 Performance Conclusions◦

2 EMC Archiver makes its Debut •2.1 Introduction to DD Archiver◦2.2 Archiving and Big Data◦2.3 DD Archiver Business Case◦

3 Overall Conclusions•

Data Domain De-duplication Performance Introduction

On January 18th 2011, EMC refreshed the complete Data Domain product family. The greatest improvement came at the top-end, where the performance of the EMC Global De-duplication Array (GDA) was doubled to a maximum inline de-duplication rate of 26 terabytes/hour. In addition, performance across the board was filled out with systems reaching down to 0.5 terabytes/hour, as shown in Table 1. Table 1 has been compiled by Wikibon from data on EMC.com, and from EMC product announcement presentations.

Page 3 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 4: EMC Data Domain de-Duplication 2011

Table 1 - EMC Inline De-duplication Performance Comparisons. Sources: EMC.com downloads 1/18/2011 and EMC Product Presentations 18/1/2011

High-End De-duplication Performance Improvements

The Data Domain GDA takes two nodes and creates a single image that shares the same directory. The processing load is shared across the two nodes. The advantage of the setup is that backup job-streams do not have to be dedicated to a specific system. In the case that a problem occurs in one of two separate backup streams and has to be rerun, the backup window would be extended by 100%. By sharing the workload increase across a two node GDA, the backup window would only be extended by 50%. Global systems in general smooth out the peaks and valleys of multiple single nodes and are more efficient. The negative impact of global de-duplication in general is the same as the negative impact of any multi-processor architecture: Interference between the two nodes creates a performance overhead; the greater the sharing and updating of data, the greater the overhead. In the first generation of GDA with DD880 nodes, the EMC GDA achieved a 2 Node/1 Node ratio of 1.45, which is honest but not a good figure. With the new generation GDA using DD890 nodes, the ration is a much more respectable 1.79 (data from the last column of Table 1 above). This adds an additional 23% of performance and enables the possibility of future extensions to the number of nodes in a Data Domain GDA.

Technology and Functionality Components of Performance

Table 2 was derived from an in-depth analysis of Table 1 by Wikibon. It breaks out the contribution of different technologies and functionality to the improvement in performance between the DD880 with no DDBoost or 10GE connection all the way through to the performance of the GDA with DD890 nodes and all the functionality and technology improvements. The net result is a nearly fivefold (487%) improvement in performance in the last 12 months, a truly impressive record of performance improvement. Table 2 shows the components and their contribution. The contributions are multiplicative; for example the Original DD880 GDA improvement is 50%, so if that component is present the base performance is multiplied by 1.5. If the DD880 to DD890 GDA improvement is

Page 4 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 5: EMC Data Domain de-Duplication 2011

present, the performance figure is multiplied by an additional 1.23. This table can be used to produce approximate estimates of mixes of technologies and functionalities.

Table 2 - Table 2 - EMC Data Domain Functionality and Technology Improvement Contribution Breakout. Sources: Table 1 and Wikibon analysis, 2011

The processor technology updates led to a 50% improvement in the DD890 nodes compared with the previous generation DD880 nodes. The improvement from using 10 Gigabyte Ethernet over the previous 10 gigabit Ethernet is 11%.

Impact of DDBoost

Not all the components are available to all environments. Data centers that are using NetBackup in an OST Ethernet environment have the greatest performance potential, as DDBoost can improve the Data Domain throughput by about 63%. It should be remembered that there is an overhead both in cost and potential elapsed time increase from the work that is moved from the Data Domain environment to the backup server. Figure 1 below shows the topology and workflow for DD boost in conjunction with NetWorker.

Figure 1 - Data Domain Topology and Workflow for NetWorker with DDBoost.

Page 5 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 6: EMC Data Domain de-Duplication 2011

Source: Extracted from an EMC GoogleDoc downloaded 18/1/2011 from https://docs.google.com/viewer?url=http://www.datadomain.com/pdf/h7504-dd-boost-networker-so.pdf&pli=1

The GDA with DDBoost is also supported with benchmarks in a NetBackup environment. EMC has announced support for the EMC NetWorker products from EMC’s Legato acquisition, but no benchmarks results are available at the moment. Wikibon will update the chart when they become available.

Performance Comparison with Other De-duplication Technologies

Table 3 below takes the highest performing EMC products and compares them against other available inline and post-process de-duplication systems. The EMC results in this set of benchmarks reflect the maximum performance of NetBackup.

Table 3 – EMC & non-EMC De-duplication Performance Comparisons. Source: EMC data from Table 1. Non-EMC data from Wikibon Data De-duplication Performance Tables. Format and Metrics derived from an original table created by Curtis Preston (BackupCentral), dowloaded 1/18/2011 from http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/348-tar.

The non-EMC data in Table 3 reflects the vendor claims and is not normalized against any

Page 6 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 7: EMC Data Domain de-Duplication 2011

assumptions of workload, backup package, or environment. The data is sorted by daily backup capacity, which is derived from either the inline backup speed (TB/hr) times 24 for inline solutions or 24 x post-process de-duplication speed. Although the ingest rate is significantly faster and can decrease the backup window, the de-duplication speed determines the amount that can be processed in a given period of time, before the next backup cycle is due. This metric does not reflect the total backup performance across different workload types. For example, a TSM incrementals-forever environment, where the amount of data presented to the Data Domain server is much lower (and the de-duplication impact much lower), could lead to very different estimates of overall backup performance. However, the chart does serve as a first level approximation of the performance range of different systems.

Performance Discussion

The interesting conclusion from this chart is that the highest performance systems are now inline systems, a complete turnabout from a few years ago when inline systems were simpler to run but slower. The fastest system is the NEC HydraStor, which although it occupies 11 frames, has a truly impressive benchmark result four times greater than the closest competitor. Large organizations with aggressive RPO requirements should consider this product, which is new to the U.S. market place. The two next products were very close, the EMC GDA using DD890 nodes, and the Symantec Netbackup 5000. Fourth in the list is the EMC single node DD890. The first post-process system is 10 time slower than the NEC inline systems.

Also missing in action is the inline IBM ProtecTier system that used to be a front-runner in large installations. IBM will need an extremely aggressive technology refresh for ProtecTier to reclaim its previous place.

Performance Conclusions

EMC has executed extremely well in both market penetration and the introduction of major improvements in de-duplication performance. It has a very strong portfolio of products which perform well and cover the complete market (with the possible exception of a very high-end machine).

De-duplication for backup has now been completely accepted as a standard practice by the industry. There is now a strong drive for de-duplication of primary storage, which NetApp was the first to introduce with its ASIS product. EMC and many other vendors have announced primary de-duplication systems in the storage arrays.

The only potential cloud in the Data Domain sky is the emergence of more modern backup software models that take space-efficient consistent snaps and copy the changes to the disk-based backup software. The backup software is designed to allow very flexible and fast recovery of (say) an email account in Exchange or particular files for a user. De-duplication is performed on a continuous basis against much smaller amounts of data than in traditional backup systems. Replication is built-in, and very good RPO and RTO services levels can be achieved with much lower implementation efforts than traditional systems. EMC/Data Domain will need to guard against the temptation of milking the cash cow of traditional backup systems and failing to recognize and be ready with backup and de-duplication products that exploit these new approaches.

EMC Archiver makes its Debut

Page 7 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 8: EMC Data Domain de-Duplication 2011

Introduction to DD Archiver

When EMC bought Data Domain in 2008, one of the most attractive features about the company was the lack of friction in implementing a Data Domain solution. The backup software was still the same, and the disks were still seen as a tape library. The only difference was an extra step inserted inline into the process, and the output of that step was de-duplicated to disk. This meant that the recovery data was available on-line and recovery was much faster than any tape library. Installations could hold (say) 90 days or more of data from backups and recover the data within minutes instead of hours.

What happens after this time is a little less sophisticated. The data is re-hydrated and moved back out to tape libraries, either in-house or off-site. The most common archiving implementation is to keep these backup tapes as the company archive, with a compliance tick placed in the box.

A part of this announcement, EMC has introduced the Data Domain archiver. Instead of re-hydrating the data to tape, the data is migrated down to an archiver that keeps the data de-duplicated, and includes the de-duplication metadata. The DD Archiver has the minimum of controller power (similar to a DD860) and the maximum of storage space (raw capacity 768 terabytes). Because the data is still de-duplicated, the amount of disk storage is minimized. When it is full, the archiver is designed to be locked down with retention locks and encryption. This box could be shut down, or EMC is hinting at spin-down techniques.

Archiving and Big Data

There are a number of potential use cases for this technology. The simplest is to use this as a lower-cost migration tier and hold the backup data for longer, say one year. For data centers that are likely to need access to this data for operational or compliance reasons, keeping the data longer on a DD archiver will make sense.

The more interesting question is: Can a single archive copy be used for multiple purposes? For example, can the backup archive also be used as an email archive? It would be great to be able to shout "Eureka" and start implementing that email archiving or technical drawing archive by pointing at the backup archive. Cross-functional archiving for free!???

Wikibon has recently written extensively about archiving, and concluded that the current model of a combination of data center archiving and point solutions for a specific department is broken. Wikibon concluded that the business will need to define the archive requirements around the positive ability to exploit "big data" and to drive improved business productivity and effectiveness, rather than the traditional fear of being sued or failing to be in compliance. By definition, this will be a cross-functional exercise that will need to look at ways of capturing metadata early in the data record's history with minimal impact of end-users, and define metadata models that will allow ease of use and easy extensibility. Software will be selected on an industry basis. This software could possibly use hardware like the DD Archiver, but the ISVs will choose the technology that best fits the needs of the application, or even construct their own appliances. Wikibon believes that an IT-led initiative to use a particular technology as a foundation for organization archiving would not be a wise use of resources.

DD Archiver Business Case

The business case for the DD archiver should focus on the hidden costs of migrating the data to tape, keeping track of the data, checking that the data can still be read, migrating very long term data to new media, deleting the data, and very occasionally having to restore it. The DD archiver could

Page 8 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 9: EMC Data Domain de-Duplication 2011

Comments (0)

decrease costs internally or could be part of an external archiving service. Normally, the older the data, the less value it has and the less likely it is to be touched. The business case should focus on the benefits to the IT department. IT executives should not say anything other than using backup records as a long-term archive is at best a stop-gap measure that is much more expensive and less effective than a business-driven archive solution, and at worst could be a liability.

Overall Conclusions

The EMC Data Domain announcement is very strong, with a doubling of performance for the high-end, the introduction of a broad range of offerings covering the entire market, the announcement of support for the IBM i-series and archiving product that could help to keep backup records longer. Data Domain is now the de facto standard for de-duplication in enterprise data centers.

Action Item: Action Item: Data Domain is now the standard against which backup data de-duplication solutions will be judged. Senior IT executives will need to keep abreast of Data Domain products and directions and should include them in most RFPs for backup de-duplication. At the same time, IT executives should be pushing EMC to provide de-duplication products and services for more modern backup topologies, as well as the array functionality to run them efficiently.

Footnotes: Updated 1/24/2011 to include Sepaton S2100-ES2 Series 1910/2910, announced 1/24/2011

categoriesArchiving, Data Domain, De-duplication, EMC, Professional alerts Contributors

Bert Latamore

Wikibon Daemon

Wikibon

Comments on 'EMC Data Domain De-duplication 2011'

There are currently no comments. Be the first!

Post A Comment

You must be logged in to post a comment, please Sign in

Revision ID Author Timestamp Comment

Page 9 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 10: EMC Data Domain de-Duplication 2011

32308 David Floyer 11 Jan 24 17:34:04

32307 David Floyer 11 Jan 24 12:33:37

32306 David Floyer 11 Jan 24 12:31:14

32276 Bert Latamore 11 Jan 18 18:17:35

32275 Wikibon Daemon 11 Jan 18 17:51:24

32271 Wikibon Daemon 11 Jan 18 16:37:18

32267 Wikibon 11 Jan 18 15:11:25

32265 Wikibon Daemon 11 Jan 18 15:04:47

32264 Wikibon Daemon 11 Jan 18 15:04:21

Unprotected "[[EMC Data Domain De-duplication 2011]]"

32263 Wikibon Daemon 11 Jan 18 14:59:13

32262 Wikibon Daemon 11 Jan 18 14:57:47

32258 Wikibon Daemon 11 Jan 18 14:54:03

Reverted edits by [[Special:Contributions/Wikibon Daemon|Wikibon Daemon]] ([[User talk:Wikibon Daemon|Talk]]) to last version by [[User:Bert Latamore|Bert Latamore]]

32257 Wikibon Daemon 11 Jan 18 14:53:03

32256 Wikibon Daemon 11 Jan 18 14:51:38

Protected "[[EMC Data Domain De-duplication 2011]]" ([edit=sysop] (indefinite) [move=sysop] (indefinite))

32255 Wikibon Daemon 11 Jan 18 14:48:58

Reverted edits by [[Special:Contributions/Bert Latamore|Bert Latamore]] ([[User talk:Bert Latamore|Talk]]) to last version by [[User:Wikibon Daemon|Wikibon Daemon]]

32249 Bert Latamore 11 Jan 18 14:17:24

32246 Wikibon Daemon 11 Jan 18 13:42:58

32245 David Floyer 11 Jan 18 10:31:52

Undo revision 32244 by [[Special:Contributions/David Floyer|David Floyer]] ([[User talk:David Floyer|Talk]])

32244 David Floyer 11 Jan 18 10:26:35

Page 10 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 11: EMC Data Domain de-Duplication 2011

Search:

32243 David Floyer 11 Jan 18 10:17:55

32242 David Floyer 11 Jan 18 10:13:36

32241 David Floyer 11 Jan 18 10:13:02 /* DD Archiver Business Case */

32240 David Floyer 11 Jan 18 09:57:25 /* Archiving Futures */

32237 David Floyer 11 Jan 18 08:57:15 /* EMC Archiver makes its Debut */

32236 David Floyer 11 Jan 18 08:28:20 /* Performance Conclusions */

32235 David Floyer 11 Jan 18 08:15:17

32234 David Floyer 11 Jan 18 08:13:42

32232 David Floyer 11 Jan 18 07:31:19

32229 David Floyer 11 Jan 18 07:17:52

Created page with '====Data Domain Performance Introduction==== One January 18th 2011, EMC refreshed the complete Data Domain product family. The greatest improvement came at the top e...'

most recent wikibon articles

Comparison of Big Data Management & Data Warehousing•Big Data Management and Data Warehousing Comparison•Creating a Zero Data Loss Environment•Renewed focus on the end user and simplification are ITs new frontiers•Prebuilt modular infrastructure is the future and its here now•

latest wikibon blog posts

Making the case for Network Security•Data Scientists: A New Field, A New Job•IBM says Goodbye to Blade Servers and Hello to Convergence with PureSystems•Live Blogging: Q&A with Steve Mills, IBM's Hardware & Software Chief•Live Blogging: Notes From IBM's PureApplication Systems Announcement•

more »

company profiles

Yahoo•

Page 11 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011

Page 12: EMC Data Domain de-Duplication 2011

© Wikibon 2008-2012 About Wikibon l Contacts l Terms of Service l Disclaimers l Privacy l Help

Brocade•Virtual Instruments•IWave•Facetime•TideMark•

all »

wikibon community information

About Wikibon•Tour Wikibon•Wikibon Tutorial•Wikibon Publishing FAQ•Wikibon Contributor Center•Wikibon Help Section•

Browse best practices . publish tips . access project tools . collaborate with peers . get help on RFP's . use privacy settings to control who sees your info . join a group and share experiences with colleagues . review case studies . read professional alerts

Cloud Computing Clustered storage, Storage services, Symplified, WEB2.0 Companies 1010data, AMD, APC, ARM, AVG Technologies, Acronis, Actifio, Acxiom, Amazon, AppHarbor, Appirio, Apple, Aprigo, Aprius, Arista Networks, Arkeia Software, Arxscan, Asigra, Astute Networks, Atempo, Atrato, Attivio, AutoVirt, Autonomy, Autovirt, Avere Systems, Axcient, Axxana, BRS Software, Basho Technologies, Belkin, Big Switch Networks, Blackwave.tv, BlueArc, Bocada, Broadcom, Brocade, C2C Systems, CA Technologies, CSC, Calxeda, Caringo, Cirtas, Cisco, Citrix, Clearwell Systems, Cleversafe, ClickFox, Cloud.com, Cloudbees Data Protection Backup and restore, Business compliance, CDP, Data deduplication, Email archiving, Overland Storage, Storage disaster recovery, Storage security, Virtual tape Energy Efficiency Data deduplication, Green storage, MAID, SSD, Thin provisioning, Tiered storage, VMware, Virtual tape Planning Design Implementation Management Backup and restore, Business compliance, Data classification, Email storage, Green storage, Managing storage, ROI, SRM, Storage asset management, Storage capacity management, Storage capacity planning, Storage design, Storage implementation, Storage management, Storage operations, Storage planning, Storage vendor management, Tiered storage Storage networks Clustered storage, Clustered storage, Consolidating storage, ISCSI, NAS, SAN, SRM, Storage consolidation, Storage virtualization, Tiered storage, Tiered storage, VMware Virtualization Clustered storage, Green storage, Storage consolidation, Storage virtualization, Thin provisioning, Thin provisioning, VMware, Virtual tape

Page 12 of 12EMC Data Domain De-duplication Performance Introduction - Wikibon

4/24/2012http://wikibon.org/wiki/v/EMC_Data_Domain_De-duplication_2011