Documentum Disaster Recovery Implementation

17
EMC Proven Professional Knowledge Sharing 2010 Documentum Disaster Recovery Implementation Narsingarao Miriyala Narsingarao Miriyala EMC Corporation [email protected]

Transcript of Documentum Disaster Recovery Implementation

EMC Proven Professional Knowledge Sharing 2010

Documentum Disaster Recovery Implementation

Narsingarao Miriyala

Narsingarao MiriyalaEMC [email protected]

2010 EMC Proven Professional Knowledge Sharing 2

Table of Contents 

Introduction .............................................................................................. 3

Audience ................................................................................................... 3

Benefits and Costs ..................................................................................... 3

Documentum Repository Overview .......................................................... 4

Disaster Recovery Fundamentals .............................................................. 7

Data Replication ....................................................................................... 9

Symmetrix Remote Data Facility (SRDF) ...................................................................................... 9

MirrorView Replication ............................................................................................................... 9

IP Replicator .............................................................................................................................. 10

Documentum DR Models ........................................................................ 10

Metadata and Content on EMC Symmetrix .............................................................................. 11

Metadata on Symmetrix and Content on CLARiiON ................................................................. 12

Metadata and Content on CLARiiON ......................................................................................... 14

Restore From Snaps ................................................................................ 17

Table of Figures Figure 1 (Documentum Repository) ................................................................................................ 4 Figure 2 (Content Object Relation Model) ...................................................................................... 5 Figure 3 (Documentum Transaction) ............................................................................................... 6 Figure 4 (Documentum Orphan Objects) ........................................................................................ 6 Figure 5 (Orphan Content) .............................................................................................................. 7 Figure 6 (Orphan Repository Objects) ............................................................................................. 8 Figure 7 (Metadata & Content on Symmetrix) .............................................................................. 12 Figure 8 (Metadata on Symmetrix & Content on CLARiiON) ........................................................ 14 Figure 9 (Content and Metadata on CLARiiON) ............................................................................ 16 Figure 10 (Content and Data Snaps) .............................................................................................. 17

Disclaimer: The views, processes or methodologies published in this compilation are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies

2010 EMC Proven Professional Knowledge Sharing 3

Introduction This article presents a Disaster Recovery (DR) strategy to successfully recover the

repository at a DR site when content and metadata is replicated synchronously or

asynchronously to DR site. It also explains various models to deploy DR solutions using

EMC storage and replication software. This article will not address the following topics:

Documentum software configurations on primary and DR sites, recovering your system

from backups, nor FAST Index data.

Audience This article is for customers, partners and consultants who are considering deployment

of a DR solution. I assume that you are familiar with Documentum repository, EMC

storage products, and replication software.

Benefits and Costs When architecting a DR solution, one needs to consider the costs and benefits of each

available model. Some have higher monetary costs, but bring flexible and tasty benefits.

Others are cheaper upfront, but contain significant technical limitations and restrictions.

Typically, for every project there exists a Service Level Agreement (SLA). It defines

Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RPO and RTO

are inversely proportional to the cost of the DR solution. In other words, the smaller the

gap between the data at source and target sites, the higher the costs of the DR solution

2010 EMC Proven Professional Knowledge Sharing 4

Documentum Repository Overview Before we talk about the recovery process, let’s discuss few basics of the Documentum

Repository. The Repository consists of two main components: a Relational Database

and a File System as shown in Figure 1 (Documentum Repository). All the metadata is

stored in the database and the associated content files are stored in a file system. We

need to synchronize both these components to successfully recover the system. Many

customers use the Network File System for storing the Content File; the Network File

system provides HA.

Figure 1 (Documentum Repository) 

2010 EMC Proven Professional Knowledge Sharing 5

Every content file on storage has an associated persistent content object in the

database. This persistent object is associated with one or more sysobjects as shown in

Figure 2 (Content Object Relation Model). The Content object has metadata about a

document such as format, size, location of the file in the storage, etc. The associated

sysobject has system related metadata such as object name, owner, creation and

modified date etc. Please refer to the Documentum Object reference PDF document for

a complete list of the metadata attributes for content and sysobject.

Figure 2 (Content Object Relation Model) 

Let’s understand how Documentum processes a content transaction as illustrated in

Figure 3 (Documentum Transaction). When a user imports a document into the

repository, the content server starts a database transaction, enters a new row in

database tables, saves the document into the file system, and commits the transaction in

the database. If you look more closely, database transactions are committed only after

saving the content into the file system. The transaction is rolled back and an error

message appears if, for any reason, the file system is full or unavailable.

2010 EMC Proven Professional Knowledge Sharing 6

Figure 3 (Documentum Transaction) 

When you delete a Document using Webtop or any interface in a regular system, the

system will delete the sysobject but will not delete the content object and actual file in

the file system. The Content object and Content File are marked as orphans as shown in

Figure 4 (Documentum Orphan Objects). If you are using Trusted Content Services

(TCS) and enable the digital shredding option on the file store, the system will delete

sysobject, content object, and the actual file in a single transaction. This option is

usually used in compliance environments.

Figure 4 (Documentum Orphan Objects) 

2010 EMC Proven Professional Knowledge Sharing 7

Documentum provides two utilities to delete orphan files on the file system and in the

database:

• dmfilescan: This utility scans for orphan content files on storage and generated

scripts to delete the content in the file system.

• dmclean: This utility scans the repository and finds all orphan Content objects,

ACL, annotations etc.

The Administrator can run the scripts to clean up the orphan files and objects.

Disaster Recovery Fundamentals Content files on DR storage should be always ahead of the metadata in the DR

database. This is a guiding principle for successful repository recovery at a DR site.

This method ensures that every sysobject in the repository has an associated content

file in the file system (Storage). As content is ahead of metadata, you will notice Content

files on file systems with no reference to any sysobjects in the database as shown in

Figure 5 (Orphan Content). Remove this orphan content using the “dmfilescan” utility.

Figure 5 (Orphan Content) 

2010 EMC Proven Professional Knowledge Sharing 8

On the other hand, a sysobject in a database may reference a content file on a file

system that doesn’t exist as shown in Figure 6 (Orphan Repository Objects). This is

considered a corrupt repository. A repository is corrupt when objects in the database

are ahead of the content in the file system. You have only one option. You have to

recover the repository from this corruption by rolling-forward the database transactions

to the time that the last content file was saved to the file system.

Figure 6 (Orphan Repository Objects) 

The question is how to make sure that the content is always ahead of the Database at

the DR site. You can achieve this by using various replication mechanisms, storage

consistency groups, and data and content snaps.

2010 EMC Proven Professional Knowledge Sharing 9

Data Replication EMC supports various replication mechanisms between EMC storage devices. In this

article, we will talk about few of them including SRDF, MirrorView, and IP Replicator.

Symmetrix Remote Data Facility (SRDF) SRDF replication software is used for data replication between EMC Symmetrix® storage

devices. There are multiple SRDF supported modes. With Documentum replication we

talk primarily about two modes: SRDF Synchronous and SRDF Asynchronous.

• Synchronous Mode (SRDF/S) – Provides real-time mirroring of data between

primary and DR Symmetrix systems. Data is simultaneously written to both

systems’ cache in real-time before the application I/O is completed, thus ensuring

the highest possible data availability. Data must be successfully stored in both

the primary and DR Symmetrix systems before a positive acknowledgement is

sent to the primary host. This mode is used primarily for metropolitan area

networks where there are distances between two data centers.

• Asynchronous mode (SRDF/A) – Provides a consistent point-in-time image on

the DR site only slightly behind the source system. This mode allows replication

over unlimited distance, with minimum to no effect on the performance of the

local production repository. SRDF/A delivers high-performance and extended

distance replication.

MirrorView Replication MirrorView replication software is used to replicate data between EMC CLARiiON®

storage systems. This software supports the storage managed propagation of changes

made to data stored in a LUN on one CLARiiON system to a corresponding LUN on the

DR CLARiiON system. The propagation of changes from a LUN on one storage system

(primary LUN), to a corresponding LUN (DR LUN) on another system can be done in

either synchronous or asynchronous mode.

In synchronous mode, a write that changes the content of the primary LUN is not

acknowledged to the server host as successfully complete until the change is

2010 EMC Proven Professional Knowledge Sharing 10

successfully propagated to the DR storage system. In the asynchronous mode, changes

to the primary LUN(s) are not immediately propagated. Instead they are traced and

periodically propagated from primary LUN (s) to the DR LUN (s) as a batch.

IP Replicator This replication software is used for data replication on EMC Celerra® systems. IP

replicator is an extremely useful and easy-to-use feature that protects the physical NAS

and iSCSI environments. It uses a snapshot-based approach to provide an efficient data

replication solution over Internet Protocol (IP) networks. This is an Asynchronous

replication where customers can set granular recovery point objectives (RPOs) for each

of the objects being replicated. This allows you to meet business compliance service

levels especially when scaling a large NAS infrastructure.

Documentum DR Models In this section, I will discuss different DR models where metadata and content will be

replicated synchronously or asynchronously to a DR site. Data recovery point snaps will

be created for metadata and content for a successful recovery.

All the models have advantages and disadvantages. Implementation costs vary from

DMX/SRDF solutions to Celerra/IP replication solutions. Each model has to be reviewed

and validated to your business needs before making a final decision. The data volume

size and distance between the two sites are only two of the factors you must consider

before selecting a particular solution.

We will make the following assumptions to design these models:

• Oracle Database for storing metadata

• Distance between Primary and DR Site is within the metropolitan WAN area

(100 miles are less)

• File Systems are NFS mounted to content servers

• RPO of 0 to 20 minutes

2010 EMC Proven Professional Knowledge Sharing 11

The following models will be discussed in depth below:

1. Content and metadata on an EMC Symmetrix storage system on primary and DR site and replication using SRDF/S

2. Content and metadata on EMC CLARiiON storage systems on primary and DR

site and replication using EMC Recovery Point

3. Content on EMC CLARiiON and metadata on Symmetrix storage system and replication using IP Replicator for CLARiiON and SRDF/S for Symmetrix

Metadata and Content on EMC Symmetrix Oracle Data and Content are stored on a Symmetrix system at the primary and DR sites.

Oracle is connected to storage by SAN, and content file systems are mounted onto

Content Servers using a NAS gateway as shown in Figure 7 (Metadata & Content on

Symmetrix). Both the Content and Oracle data are replicated to the DR site using SRDF

synchronous mode. You should create a Single Consistency group for Oracle and

content data volumes.

The RPO should be zero and RTO should be minimal (< 30 minutes) to recover the

system during DR Failover since both data and content are always in synch. For

additional protection against data corruption, you can create checkpoint snaps on the

DR site at selected intervals, for example every 10 minutes you can create a snap of

data and content. You can have 3 to 5 rolling snaps on the target side.

The data on the DR side is dark; you cannot mount the data for Read/write while the

primary device is up and running with replication turned ON. In a disaster, follow these

steps to recover the system at the DR site:

• Stop replicating the data

• Make the metadata and content volumes on the DR side primary

• Mount the data and content

• Start the Oracle Database

• Start the Content Server

• Run Documentum dmfilescan and dmclean jobs

• Execute the scripts generated from the above jobs

• Run Documentum consistency check job, validate the results

2010 EMC Proven Professional Knowledge Sharing 12

Figure 7 (Metadata & Content on Symmetrix) 

Metadata on Symmetrix and Content on CLARiiON Oracle data is stored on Symmetrix DMX® and Content is stored on CLARiiON storage

at the primary and DR sites. Oracle is connected to storage by SAN and content file

systems are mounted onto Content Servers using the NAS gateway as shown in Figure

8 (Metadata on Symmetrix & Content on CLARiiON).

Metadata is replicated to the DR site using SRDF/S replication mode. You should create

a Single Consistency group for metadata volumes on the DMX array. On the target side,

the system should be configured to take snaps of replicated Oracle data every 10

minutes. You should have 3 to 5 rolling snap checkpoints created on target site.

2010 EMC Proven Professional Knowledge Sharing 13

Content is replicated from the NFS server (EMC Celerra) to DR NFS server using

Asynchronous IP Replicator. Create a consistency group for all the Content File Systems

to be replicated to the DR site. The replication time interval is based on factors such as

RPO and the amount of content generated in 10 minutes. If you have an RPO of 20

minutes and you have set 10 minute intervals to replicate the content, you should have

enough bandwidth between the two sites to replicate the content in 10 minutes. My

recommendation is to work with an EMC NAS SME to design and configure the IP

Replicator for content replication.

In a real disaster, follow these steps to recover the repository at the DR site:

• Stop replicating the data and content from the primary to the DR site • Check the last content replicated time to the DR site • Select the database replicated snap which is older than the last content

replicated time • Mount the data on DMX, and content on CLARiiON • Mount content file systems from NAS to content servers • Start the Oracle Database • Check for any inconsistencies on the database • Start the Content Server • Run Documentum dmfilescan and dmclean jobs to clean orphan content and

objects in repository • Execute the scripts generated from the above jobs • Run Documentum consistency check job, validate the results

2010 EMC Proven Professional Knowledge Sharing 14

Figure 8 (Metadata on Symmetrix & Content on CLARiiON) 

Metadata and Content on CLARiiON Oracle Data and content are stored in CLARiiON storage at primary and DR sites.

Metadata and Content file systems are NFS mounted (EMC Celerra) to Oracle and

Content servers respectively as shown in Figure 9 (Content and Metadata on

CLARiiON).

Both Metadata and Content are replicated from the NFS server (EMC Celerra) to the DR

NFS server using Asynchronous IP Replicator. Replication time intervals are based on

factors such as RPO and the amount of content generated in 10 minutes. With an RPO

of 20 minutes, and 10 minute intervals to replicate content, you should have enough

bandwidth between the two sites to replicate the content in 10 minutes.

2010 EMC Proven Professional Knowledge Sharing 15

Create rolling data snaps on the target side every 10 minutes for Oracle data. Work with

an EMC NAS SME to design and configure the IP Replicator for content replication.

Follow these steps in a real disaster to recover the repository at the DR site:

• Stop replicating the data and content from the primary to the DR site

• Check the last content replicated time to the DR site

• Select the database replicated snap that is older than content replication time

• Mount the data and content on the CLARiiON

• Mount metadata & content file systems from NAS to Oracle and content servers

respectively at the DR site

• Start the Oracle Database

• Check for any inconsistencies on the database

• Start the Content Server

• Run Documentum dmfilescan and dmclean jobs to clean orphan content and

objects in repository

• Execute the scripts generated from the above jobs

• Run Documentum consistency check job, validate the results

2010 EMC Proven Professional Knowledge Sharing 16

Figure 9 (Content and Metadata on CLARiiON) 

2010 EMC Proven Professional Knowledge Sharing 17

Restore From Snaps As discuss in earlier sections, we need to restore the repository with content ahead of

the database. Your content snap should be more recent than the data snap. As shown in

Figure 10 (Content and Data Snaps) below, the database and content snaps are taken

5 minutes apart with 10 minute intervals between each snap. The Snap restore

compatibility table gives the database snap and associated content snaps that can be

used to restore the content. The last column gives the maximum amount of data lost.

This time may vary if the content is replicated asynchronously or the replication

mechanism is broken.

Figure 10 (Content and Data Snaps) 

Snap Restore Compatibility

Database Snap

Compatible Content Snaps

Max Estimated Data Loss

D5 C5 5 MinutesD4 C4,C5 15 MinutesD3 C3,C4,C5 25 Minutes

D2 C2,C3,C4,C5 35 Minutes

D1 C1,C2,C3,C4,C5 45 Minutes