HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster...

61
HathiTrust is a Solution The Foundations of a Disaster Recovery Plan for the Shared Digital Repository This report serves as recommendations made by Michael J. Shallcross, 2009 Digital Preservation Intern University of Michigan School of Information

Transcript of HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster...

Page 1: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

HathiTrustisaSolution

TheFoundationsofaDisasterRecoveryPlanfortheSharedDigitalRepository

ThisreportservesasrecommendationsmadebyMichaelJ.Shallcross,2009DigitalPreservationInternUniversityofMichiganSchoolofInformation

Page 2: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

ii

ExecutiveSummary ThisreportseekstoestablishtheframeworkofaDisasterRecoveryPlanfortheHathiTrustDigitalLibrary.WhileprofessionalbestpracticesandinstitutionalneedshaveprovidedaclearmandateforHathiTrust’sDisasterRecoveryProgram,commonparlancehasoftenobscuredtwoprominentfeaturesofsuchinitiatives.First,a‘DisasterRecoveryPlan’isactuallycomprisedofasuiteofdocumentswhichdetailarangeofissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivitiestotherestorationofhardwareanddata.Second,thereisnoconclusiontotheplanningprocess;itisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing,andmaintenance.

Theprimarygoalofthepresentdocumentistoprovideafoundationonwhichfutureplanningeffortsmaybuild.Tothatend,itexaminesthestrategiesbywhichHathiTrusthasanticipatedandmitigatedtherisksposedbytencommonscenarioswhichcouldprecipitateadisaster:

o Hardwarefailureanddatalosso Networkconfigurationerrorso Externalattackso Formatobsolescenceo Coreutilityorbuildingfailureo Softwarefailureo Operatorerroro Physicalsecuritybreacho Mediadegradationo Manmadeaswellasnaturaldisasters.

Asthislistreveals,adisasterwithinthedigitalrepositoryrefersnotmerelytodataloss,thedestructionofequipment,ordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocauseanextendedserviceoutage.Foreachscenario,thereportdiscussespossiblethreats,summarizesthepotentialseverityofrelatedevents,andthendetailssolutionsHathiTrusthasenactedthroughdirectquotationsfromtheHathiTrustWebsiteandTRACself‐assessment,ServiceLevelAgreements,andliteraturefromserviceprovidersandvendors.AttachedappendicesproviderelevantinformationandincludecontactsforimportantHathiTrustresources,anannotatedguidetoDisasterRecoveryPlanningreferences,andanoverviewofkeystepsintheDisasterRecoveryPlanningprocess.

TheconcludingsectionofthereportprovidesrecommendationsandactionitemsforHathiTrust

asitproceedswithitsDisasterRecoveryInitiative.ThesearedividedintoShort(0‐6mos.),Intermediate(6‐12mos.)andLong‐Term(12+mos.)objectivesandarearrangedinasuggestedorderofaccomplishment.

o Short‐termgoalsinclude: DescribingthenatureandextentofHathiTrust’sinsurancecoverage Testingandvalidationofcurrenttapebackupprocedures Improvedphysicalandintellectualcontroloversystemhardware Establishment,distribution,andmaintenanceofphonetrees Increaseddocumentationofinstitutionalknowledge IdentificationofDisasterRecoverymeasuresinplaceattheIndianapolissite.

o Intermediate‐termobjectivesfocuson: CreationofaDisasterRecoveryPlanningCommittee

Page 3: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

iii

Initiationofthedatacollectionandanalysisessentialtothecreationofrecoverystrategies(ThissectionprovidesahighlevelbreakdownofvarioustasksandincludesthecoordinationofactivitiesbetweentheAnnArborandIndianapolissitesaswellaswithserviceprovidersandvendors.)

o Long‐termactionitemsdealwith: CompletionandimplementationofthesuiteofDisasterRecoverydocuments Initiationofstafftrainingandtestsoforganizationalcompliance. Storageofanadditionalcopyofbackuptapesataremotethirdlocation InvestigationofanalternatehotsiteinAnnArborintheeventadisaster

renderstheMACCunusable Considerationofathirdinstanceoftherepository Avoidanceofvendorlock‐inifakeysuppliershouldgooutofbusiness.

Thisreportdemonstratesthatvariousriskmanagementstrategies,designelements,operating

procedures,andsupportcontractshaveendowedHathiTrustwiththeabilitytopreserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofadisaster.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackupstoaremotelocation,andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Unfortunately,disastersoftenresultfromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsofaDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensurethat,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedserviceprovider.

Page 4: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

iv

Acknowledgements TheauthorwouldliketothankShannonZacharyforherencouragementandguidance;CorySnavelyandJeremyYorkfortheirgenerousexpenditureoftime,energy,andknowledge;andNancyMcGovernandLanceStuchellforaccesstotheiroutstandingDisasterRecoveryPlanningresources.Thefollowingindividualshavealsobeeninvaluablesourcesofadvice,support,andinformation:JohnWilkin,BobCampe,CyndiMesa,AnnThomas,JohnWeise,LarryWentzel,LaraUnger‐Syrigos,BillHall,EmilyCampbell,SebastienKorner,JessicaFeeman,PhilFarber,ChrisPowell,CameronHanover,StephenHipkiss,TimPrettyman,ReneGobeyn,andKrystalHall.ThanksalsotoDr.ElizabethYakel,MagiaKrause,andVeronicaandCoraFambrough.TheworkinthisreportwasmadepossiblebyanIMLSGrant.

Page 5: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

v

TableofContents• ExecutiveSummary p.ii• Acknowledgements p.iv• Introduction p.1

o GoalsforHathiTrust’sDisasterRecoveryProgram p.1o TheMandateforDisasterRecoveryPlanninginDigitalPreservation p.2o DisasterPreparednessintheDesignandOperationofHathiTrust p.2o EssentialHathiTrustBusinessFunctions p.3

• HathiTrust’sDisasterRecoveryStrategies p.5o BasicRequirementsforDisasterRecovery p.5o DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSitesp.5o DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups p.6

• Scenario1:HardwareFailureorObsolescenceandDataLoss p.8o Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss p.8o HathiTrust’sSolutionsforHardwareFailureandDataLoss p.8o RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructure p.9o KeyFeaturesofHathiTrust’sIsilonIQClusteredStorage p.10o HardwareSupportandService p.12o EquipmentTracking p.13o HardwareReplacementSchedule p.13o TimelineforEmergencyReplacementofHathiTrustInfrastructure p.13o HathiTrustandInsuranceCoverageattheUniversityofMichigan p.14

• Scenario2:NetworkConfigurationErrors p.15o Review:RisksInvolvingNetworkConfigurationErrors p.15o HathiTrust’sSolutionsforNetworkConfigurationErrors p.15o ExtentofITComSupport p.15o ITComResponsibilities p.16o ITComServicesinResponsetoOutagesorDegradationImpactingtheNetwork p.16o HathiTrustResponsibilities p.16

• Scenario3:NetworkSecurityandExternalAttacks p.17o Review:RisksInvolvingNetworkSecurityandExternalAttacks p.17o HathiTrust’sSolutionsforNetworkSecurity p.17

• Scenario4:FormatObsolescence p.18o Review:RisksInvolvingFormatObsolescence p.18o HathiTrust’sSolutionsforFormatObsolescence p.18o SelectionofFileFormats p.18o FormatMigrationPoliciesandActivities p.19

• Scenario5:CoreUtilityand/orBuildingFailure p.20o Review:RisksInvolvingCoreUtilityorBuildingFailure p.20o HathiTrust’sSolutionsforUtilityorBuildingFailure p.20o GeneralMaintenanceandRepairsinUniversityofMichiganFacilities p.20o TheMichiganAcademicComputingCenter(MACC) p.20o ArborLakesDataFacility(ALDF) p.22

Page 6: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

vi

• Scenario6:SoftwareFailureorObsolescence p.23o Review:RisksInvolvingSoftwareFailureorObsolescence p.23o HathiTrust’sSolutionsforSoftwareIssues p.23

• Scenario7:OperatorError p.24o Review:RisksInvolvingOperatorError p.24o HathiTrust’sSolutionsforOperatorError p.24o Ingest p.24o ArchivalStorage p.24o Dissemination p.24o DataManagement p.24

• Scenario8:PhysicalSecurityBreach p.25o Review:RisksInvolvingaPhysicalSecurityBreach p.25o HathiTrust’sSolutionsforPhysicalSecurity p.25o SecurityattheMACC p.25o SecurityattheALDF p.26

• Scenario9:NaturalorManmadeDisaster p.27o Review:RisksInvolvingaNaturalorManmadeDisaster p.27o HathiTrust’sSolutionsforNaturalorManmadeCatastrophicEvents p.27o BasicDisasterRecoveryStrategies p.28

• Scenario10:MediaFailureorObsolescence p.29o Review:RisksInvolvingMediaFailureorObsolescence p.29o HathiTrust’sSolutionsforMediaFailure p.29o RemainingVulnerabilities p.29

• ConclusionsandActionItems p.30o Conclusions p.30o Short‐TermActionItems p.30o Intermediate‐TermActionItems p.31o Long‐TermActionItems p.32

• APPENDIXA:ContactInformationforImportantHathiTrustResources p.34• APPENDIXB:HathiTrustOutagesfromMarch2008throughApril2009 p.37• APPENDIXC:WashtenawCountyHazardRankingList p.38• APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences p.39• APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess p.45• APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008) p.52• APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardService

Agreement(2006) p.53• APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009) p.54• APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006) p.55

**AppendicesF–IareembeddedPDFfiles.**

Page 7: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 1

Introduction

Intherealmofprintlibraries,adisasterisafairlyunambiguousevent:itisafire,abrokenpipe,aninfestationofpests—inshort,anythingwhichthreatensthecontinueduseandexistenceoftextsortheenvironmentinwhichtheyarestored.Thisbasicdefinitionmayalsobeappliedtothedigitallibrary,inwhichadisasterrefersnotmerelytothelossofcontentorcorruptionofdata,thedestructionofequipmentordamagetoitsenvironment,buttoanyeventwhichhasthepotentialtocauseanextendedserviceoutage.Thislastpartprovestobethegreatestdifferencebetweentheprintanddigitalworldsbecausethereareagreatmanythreatswhichcanleavedataintactbutincapacitatetheprimaryfunctionsofadigitallibrary.ThedailyoperationofaninstitutionsuchasHathiTrustinvolvestheanticipationandresolutionofavarietyofproblems—crashedservers,softwarebugs,networkingerrors,etc.—whichonlyrisetothelevelofa‘disaster’whentheyexceedthecapacityofnormaloperatingproceduresand/orthemaximumallowableoutageperiods.DisasterRecoveryPlanningthuspromptsustodeveloprobuststrategiestomitigateandlimittheeffectsofcommonproblemsandatthesametimeforcesustothinktheunthinkable.Nevertheless,confrontingworst‐casescenariosisavitalactivity;thebeliefthataneventwillneverhappensimplybecauseithasneverhappenedisaninvitationtotheverydisasterweseektoavoid.Hereinliesaconundrum,inthatthecreationofdetailedplansforeveryeventualityisnearlyimpossibleandalsoimpractical,sincetheresultsofsuchanendeavorwouldbeneedlesslycomplexaswellasexpensive.Atitsbasis,then,DisasterRecoveryPlanningdemandsanastuteassessmentofrisksothatwemayweighthecostsofpreparationsandsolutionsagainstthecostsofapotentialevent.

Sowheretobegin?WhenthesubjectofDisasterRecoveryPlanningarises,commonparlanceoftenobscurestwoprominentfeaturesofsuchinitiatives.First,a‘DisasterRecoveryPlan’isactuallycomprisedofasuiteofdocumentswhichdetailavarietyofrelatedissues,fromcrisiscommunicationsandthecontinuityofadministrativeactivitiestotherecoveryofhardwareanddataandtherestorationofcorefunctions.Second,thereisnoconclusiontotheplanningprocessorapointatwhichaplanis‘done’;thereisinsteadacontinuouscycleofobservation,analysis,solutiondesign,implementation,training,testing,andmaintenance.Theessentialfirststepisthereforeathoroughknowledgeoftheorganization,itsgoals,anditsmandateforaDisasterRecoveryProgramsothatlatereffortscanfocusonthearticulationofpoliciesandthedevelopmentofsolutions.Asapreliminarystepinthiseffort,thisreportlookstoestablishabasicfoundationfromwhichfutureplanningeffortsmaygrow.

• GoalsforHathiTrust’sDisasterRecoveryProgram WhileamoreformalstatementofHathiTrust’sgoalsandrequirementsforitsDisasterRecoveryProgrammustbeelucidated,therepository’smissionstatementprovidesagoodindicationofitsmainobjectiveintheformationofaDisasterRecoveryPlan.Aspartofitsaimto“contributetothecommongoodbycollecting,organizing,preserving,communicating,andsharingtherecordofhumanknowledge,”HathiTrustseeks“tohelppreservetheseimportanthumanrecordsbycreatingreliableandaccessibleelectronicrepresentations.”1Thisstatementclearlyjoinsthetwinimperativesofpreservationandaccesswithanadditionalrequirement:reliability.ThedevelopmentandimplementationofaDisasterRecoveryPlanwillensurethatdigitalobjectswillretaintheirauthenticityandintegrityoverthelongtermandthatpartnerlibrariesanddesignatedusersmayrelyonHathiTrustservices(ortheirtimelyresumption)andcontentinthefaceofcatastrophicevents.

1HathiTrust.“Mission&Goals”(2009)retrievedfromhttp://www.hathitrust.org/mission_goalson8July2009.

Page 8: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 2

• TheMandateforDisasterRecoveryPlanninginDigitalPreservation HathiTrust’smandateforacomprehensiveandproactiveDisasterRecoveryPlanstemsfromanumberofsignificantsources,amongwhichwemayincludeitsmissionandgoals.The“InstitutionalDataResourceManagementPolicy”(2008)oftheUniversityofMichigan’sStandardPracticeGuidealsoprovidesanimpetusforthecreationofaDisasterRecoveryProgram.WhilenotnecessarilyinclusiveoftheMichiganDigitizationProjectmaterialsstoredinHathiTrust,thisdocumentunderscoreshowimportantitisthatdataresources“besafeguarded[and]protected”and“contingencyplans[…]bedevelopedandimplemented.”2Initsdiscussionofthelatterpoint,thepolicyspecifiesthat:

DisasterRecovery/BusinessContinuityplansandothermethodsofrespondingtoanemergencyorotheroccurrencesofdamagetosystemscontaininginstitutionaldata[…]willbedeveloped,implemented,andmaintained.Thesecontingencyplansshallinclude,butarenotlimitedto,databackup,DisasterRecovery,andemergencymodeoperationsprocedures.Theseplanswillalsoaddresstestingofandrevisiontodisasterrecovery/businesscontinuityproceduresandacriticalityanalysis.3

WhiledatabackupproceduresandahostofriskmanagementpracticesarealreadyanintegralpartofHathiTrust’soperation,therepositorynowlookstoformalizetheotherstrategiessuggestedbythe“InstitutionalDataManagementPolicy.”Beyondtheexamplelaidoutbythisdocument,HathiTrust’smandateforDisasterRecoveryderivesfromtheprofessionalliteraturedetailingbestpracticesinthefieldofdigitalpreservation.TheReferenceModelforanOpenArchivalReferenceSystemidentifiesDisasterRecoveryasanessentialcomponentofits“ArchivalStorage”functionandhighlightstheimportanceofsuchplansinachievingthegoaloflong‐termpreservationofadigitalarchive’sholding.AsoutlinedintheOAISdocument,“theDisasterRecoveryfunctionprovidesamechanismforduplicatingthedigitalcontentsofthearchivecollectionandstoringtheduplicateinaphysicallyseparatefacility.”4HathiTrusthassuccessfullymetthisrequirementbyperformingnightlytapebackupsandestablishingamirrorsiteatIndianaUniversityinIndianapolis.TheTrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)isevenmoreexplicitinitsrequirementthatrepositoriesdocumenttheirpoliciesandprocedureswith“suitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff‐sitebackupofallpreservedinformationtogetherwithanoff‐sitecopyoftherecoveryplan(s).”5ProfessionalbestpracticesaswellasinternalneedsandgoalsthusprovidethemandatewhichunderliesHathiTrust’sdevelopmentofaformalDisasterRecoveryPlan.

• DisasterPreparednessintheDesignandOperationofHathiTrust OneoftheprimarygoalsofHathiTrustistoprovide“transparencyinallofitsoperations,includingitsworktocomplywithdigitalpreservationstandardsandreviewprocesses.”6Nowhereisthiscommitmentmoreclearthaninitseffortstoanticipateandmitigateriskswhichcouldthreatenthe

2UniversityofMichigan.“InstitutionalDataResourceManagementPolicy”(2008)StandardPracticeGuide,retrievedfromhttp://spg.umich.edu/on8July2009.3Ibid.4ConsultativeCommitteeforSpaceDataSystems.ReferenceModelforanOpenArchivalInformationSystem(2002)p.4‐8.5OCLCandCRL.“SectionC3.4”TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.6HathiTrust.“Accountability”(2009)retrievedfromhttp://www.hathitrust.org/accountabilityon25June2009.

Page 9: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 3

contentsandfunctionsoftheSharedDigitalRepository.AsafirststepinaddressingthedisasterpreparednessrequirementinsectionC3.4oftheTRACCriteriaandChecklist,7thisdocumentservestwopurposes.First,itprovidesanoverviewofthepolicies,procedures,resourcesandcontractsthatenableHathiTrusttoaddressthechallengesandthreatsendemictothefieldofdigitalpreservation.MaterialisthereforeciteddirectlyfromtheHathiTrustWebsite(http://www.hathitrust.org),themostrecentversionofHathiTrust’sreviewofitscompliancewiththeminimumrequiredelementsoftheTRACCriteriaandChecklist,8andrelevantliteratureprovidedbykeyvendorsandserviceproviders.9Second,thisreportexaminesHathiTrust’scurrentlevelofdisasterpreparednessanddefinescurrentandforthcomingeffortsinitsdevelopmentofadynamicandproactiveDisasterRecoveryProgram.PertherecommendationsoftheTRACCriteriaandChecklist,thisdocumentrecordsthemeasuresandprecautionsalreadyinplaceinregardsto“specifictypesofdisasters”thatcouldbefallHathiTrust.Theseeventsincludehardwarefailure,dataloss,networkconfigurationerrors,externalattacks,coreutilityfailure,formatobsolescence,softwarefailure,physicalsecuritybreach,andmanmadeaswellasnaturaldisasters.Whileaformal,writtenplandetailingindividualrolesandresponsibilitiesintherepository’sresponsetoeachofthesescenariosisstillforthcoming,theevidencegatheredinthisreportrevealsthatcrucialelementsofaDisasterRecoveryPlanarealreadyinplacewithinHathiTrust.10

• EssentialHathiTrustBusinessFunctionsAsthedevelopmentoftheDisasterRecoveryPlanproceeds,itisimportanttobearinmindthat

itsgoalisnotmerelytherestorationofhardwareanddatabutalsotherecoveryandcontinuityofessentialrepositoryfunctions.ThefollowinglistrepresentscorefunctionsthatneedtobeaddressedbyHathiTrust’sDisasterRecoveryPlanandassuchshouldnotbeconsideredacomprehensiverepresentationoftherepository’sfunctions.Bydirectingplanningeffortstowardspecificfunctions(ratherthantheorganization’sactivitiesasawhole),HathiTrustmayprioritizeandfocusitsrecoveryresponsesandresourcestoensurethatthemostessentialfunctionsgobackonlinefirst.SubsequentdiscussionofDisasterRecoverystrategiesandriskmanagementsolutionsinthisreportarepresentedundertheassumptionthatthecontinuityofthesefunctionsisaprimaryobjective.Theprioritizationofthesefunctionsremainstobedeterminedbyanappropriateauthority.11

7“Repositoryhassuitablewrittendisasterpreparednessandrecoveryplan(s),includingatleastoneoff‐sitebackupofallpreservedinformationtogetherwithanoff‐sitecopyoftherecoveryplan(s).Therepositorymusthaveawrittenplanwithsomeapprovalprocessforwhathappensinspecifictypesofdisaster(fire,flood,systemcompromise,etc.)andforwhohasresponsibilityforactions.Thelevelofdetailinadisasterplanandthespecificrisksaddressedneedtobeappropriatetotherepository’slocationandserviceexpectations.Fireisanalmostuniversalconcern,butearthquakesmaynotrequirespecificplanningatalllocations.Thedisasterplanmust,however,dealwithunspecifiedsituationsthatwouldhavespecificconsequences,suchaslackofaccesstoabuilding.”OCLCandCRL.TrustedRepositoriesAudit&Checklist:CriteriaandChecklist(2007)p.49.8HathitrustDigitalLibraryReviewofCompliancewithTrustworthyRepositoriesAudit&Certification:CriteriaandChecklistMinimumRequiredElements,revisedMay20,2009.Availableathttp://hathitrust.org/documents/trac.pdf9ContactinformationforrelevantUniversityofMichigandepartmentsandserviceprovidersaswellasforexternalvendorsmaybefoundinAppendixA.10AlistofresourcesrelatedtodisasterrecoveryandtheplanningprocessmaybefoundinAppendixD(AnnotatedListofDisasterRecoveryPlanningResources).11ThislistofessentialHathiTrustbusinessfunctionswasdevelopedinconjunctionwithJeremyYork.

Page 10: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 4

o Ingest Ingestdigitalobjects(SIPs)viaGRIN—theGoogleReturnInterface(ora

modifiedingestportalforlocalcontent) ValidateingestedcontentwithGROOVE—theGoogleReturnObject‐Oriented

ValidationEnvironment(oramodifiedversionforlocalizedingest)o ArchivalStorage

Preserveindefinitelydigitalobjectsandmetadata(AIPs)intheSharedDigitalRepository(includesensuringtheintegrityandauthenticityofmaterials).Thisfunctionaddressestheneedsofpartnerlibrariesaswellasindividualusers.

Recordchangestoandactionsonitemswhiletheyareintherepository Maintainapersistentobjectaddressforitemswithinrepository

o Dissemination Provideaccesstodigitalobjectsforusers Allowforthetextsearchesthroughavarietyoffields Enablelargescalefull‐textsearches Permitthecreationofpublicandprivatecontentcollections Disseminatedigitalobjects(DIPs)tousers(viathepage‐turneraccesssystem

anddataAPI) DistributedatasetsandHathiTrustAPIstodevelopers ResearchanddevelopadditionalapplicationsandresourcesforHathiTrust

o Administration Providetransparentandup‐to‐dateinformationtousersandthegeneralpublic

viahttp://www.hathitrust.org/ Communicateinformationandcoordinateactivitiesamongstpartnerlibraries

andHathiTrustboardsandcommittees.o DataManagement

UpdateandmanagetheRightsandGeoIPdatabases BuildandmaintainCollectionBuilderandLargeScaleSearchSolrindexes Determineappropriateuseraccesstotextsviadatabasequeries SynccontentwiththeIndianapolissiteandbackupcontenttotape

Page 11: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 5

HathiTrust’sDisasterRecoveryStrategies

• BasicRequirementsforDisasterRecovery RoyTennanthasidentifiedthreerequisitecomponentsofadigitalDisasterRecoveryPlan:(1)theuseofaneffectivedataprotectionsystem(i.e.RAID),(2)redundantpowerandenvironmentalsystems,and(3)regularbackupofinformationtotapeand,ideally,toaremotemirroredsite.12HathiTrusthasincorporatedalltheseelementsintoitsdesignandoperation.ItsIsilonIQstorageclusterprovidesahighdegreeofdataredundancywithitsN+3parityprotection;theMichiganAcademicComputingCenterprovidesfullyredundantpowerandenvironmentalsystemsforHathiTrustinfrastructure;andnightlytapebackupsandthereplicationofdatatoafullyoperationalmirrorsitelocatedatIndianaUniversityinIndianapoliswiththesamelevelsofpowerandenvironmentalconditioningprovidemultiplecopiesaswellasgeographicdistributionofcontent.

o “HathiTrustisintendedtoprovidepersistentandhighavailabilitystoragefordepositedfiles.Inordertofacilitatethis,theinitiative’stechnologyconcentratesoncreatingaminimumoftwosynchronizedversionsofhigh‐availabilityclusteredstoragewithwidegeographicseparation(thefirsttwoinstancesofstoragewillbelocatedinAnnArbor,MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredinaseparateAnnArborfacility).Eachofthesestorageortapeinstancesisphysicallysecure(e.g.,inalockedcageinamachineroom)andonlyaccessibletospecifiedpersonnel.Eachseparatestoragesystemisalsoequippedwithmechanismstoprovidemirroredmanagementandaccessfunctionality,andemploy100%dataredundancyinanefforttopreventdataloss.”13

DetailsonparityprotectionandtheHathiTrustserverenvironmentareavailablebelow(seeScenario1andScenario5,respectively).

• DisasterRecoveryStrategy#1:RedundancybetweentheAnnArborandIndianapolisSites HathiTrust'sfirstlineofdefenseintheeventofadisasterisitshotmirrorsiteinIndianapolis.WhileingestofmaterialisrestrictedtotheAnnArborlocation,bothsitespossesstwowebservers,aMYSQLdatabaseserver,andanIsilonIQstoragecluster(currentlycomposedof21‘nodes,’serverscomposedofCentralProcessingUnitsaswellasstorage).Duringnormaloperations,thisarrangementallowsHathiTrusttobalanceahighvolumeofwebtrafficacrossbothsitessuchthatindividualuserrequestsmaybehandledbyeithersiteinatransparentmanner.Shouldthetolerancesforfailurebeexceededatasite(asinadisastersituation)thefailovercapabilitybuitintotheHathiTrustarchitectureenablestheremainingsitetoprovideaccesstothedesignatedcommunitywithoutnoticeableservicedisruptions.AsnotedintheMay2009HathiTrustUpdate,withthefulloperationofbothlocations,“Wearenowensuringthatusersdonotfeeltheeffectsofsingle‐siteoutages,suchasroutinemaintenance,

12Tennant,Roy.“DigitalLibraries:CopingwithDisasters.”LibraryJournal,15November2009.Retrievedfromhttp://www.libraryjournal.com/article/CA180529.htmlon13July2009.13HathiTrust.“Technology”retrievedfromhttp://www.hathitrust.org/technologyon15June2009.

Page 12: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 6

bytakingadvantageofsiteredundancy.”14However,becauseingesttakesplaceonlyinAnnArbor,thelossofkeycomponentstherewouldinhibittherepository’sabilitytoacquirenewcontent.

HathiTrustutilizesIsilonSystem’sSyncIQApplicationSoftwaretosynchronizedataattheIndianapolissitewithnewlyingestedorupdatedmaterialfromtheAnnArborsite.ThesynctoIndianapolisrunson24separatesubsetsofthedataandeachonerunsevery2hours,withtheexceptionofSundays.Inotherwords,subset1runsatmidnightonMonday,subset2runsat2a.m.,andsoon.ThemaximumtimefordatatobereplicatedfromAnnArbortoIndianapoliswouldthereforebethreedaysplustheruntimeofthesyncprocess(whichtendstotakelessthanthreehours.)15

o “SyncIQisanasynchronousreplicationapplicationthatfullyleveragestheuniquearchitectureofIsilonIQstoragetoefficientlycopydatafromaprimaryclustertoonelocatedatasecondarylocation.”16

o “Allnodes[…inboththesourceandtargetIsilonIQclusters]concurrentlysendandreceivedataduringreplicationjobsinrealtime,withoutimpactingusersreadingandwritingtothesystem.”17

o “Arobustwizard‐drivenweb‐basedinterfaceisfullyintegratedinto[…Isilon’sproprietary]OneFSmanagementtooltocontrolallthefunctionality,includingscheduling,policysettings,monitoringandloggingofdatatransferredandbandwidthutilization.”18

o “Onlyfilesthathavechangedwillbereplicatedtothetargetclusters.Thiswilloptimizetransfertimesandminimizebandwidthused.”19

o “Intheeventthesecondarysystemisnotavailableduetoasystemornetworkinterruption,thereplicationjobwillbeabletorollbackandrestartatthelastsuccessfulcopyoperation.”20

o “Uponacriticalfailureorlossofnetworkconnection,analertwillbesenttoallrecipientsconfiguredtoreceivecriticalalerts.”21

• DisasterRecoveryStrategy#2:NightlyAutomatedTapeBackups

HathiTrust’sabilitytorecoverfromadisasterisalsoensuredbythenightlyautomatedtapebackupsperformedbytheTivoliStorageManager(TSM)clientapplicationinstalledontheingestserversconnectedtotheHathiTruststorageclusterandmanagedbyMichigan’sITCSTSMGroup.TheTSMBackupServiceStandardServiceLevelAgreement22outlinestheobligationsandresponsibilitiesofboththeserviceproviderandHathiTrust:

14HathiTrust.“UpdateonMay2009Activities”(2009)retrievedfromhttp://www.hathitrust.org/updates_may2009on2July2009.15Snavely,Cory(Head,UMLibraryITCoreServices).Personalemailon13July2009.16“BackupandRecoveryWithIsilonIQClusteredStorage,”2007p.1117Ibid.18Ibid.19Ibid.20Ibid.21Ibid22PleaserefertoAppendixF(TSMBackupServiceStandardServiceLevelAgreement).

Page 13: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 7

o “TheprogressiveincrementalmethodologyusedbyTivoliStorageManageronlybacksupneworchangedversionsoffiles,therebygreatlyreducingdataredundancy,networkbandwidthandstoragepoolconsumptionascomparedtotraditionalmethodologiesbasedonperiodicfullbackups.”23

o “ITCSisresponsibleforallofthecentralserverhardware,tapehardware,networkinghardware,andrelatedcomponents.ITCSisalsoresponsibleforhardwaremaintenanceaswellassoftwaremaintenance,administration,andsecurityauditsonthecentral(non‐client)TSMservers.”(TSMBackupServiceSLA,sec.4.1)

o “ITCSprovides7x24on‐callmonitoringandsupport,andstrivestokeeptheserversupinproductionatalltimes.Thetargetup‐timeis99.9%ofthetime.TheTSMhardwaredesignismodularandshouldallowustotakepiecesoutofservicewithoutaffectingcustomers.Wheneverpossible,systemmaintenancewillbeperformedduringstandardweekendmaintenancewindowsasdefinedbyITCS.”(sec.4.2)

o “Inanemergency,[email protected](thiswillgototheon‐callstaff’spagerinrealtime).(sec.4.6)

o “ITCSisresponsibleforphysicalsecurity.Machineaccessaudits,OSsecurity,andnetworksecurityontheTSMserverendarealsotheresponsibilityofITCS.”(sec.4.9)

o “Theservice[…]includesdatacompression,dataencryptions,anddatareplication.”(sec.1.0)

o “ITCSwillmaintainatleasttwoTSMsitesandwillmirrordatabetweenthesitestoprovideredundancyintheeventofadisaster.CurrentlythosesitesaretheArborLakesDataFacility(ALDF)at4251PlymouthRd.andtheMichiganAcademicComputingCenter(MACC)locatedat1000OakbrookDr.”(sec.4.10)

o “Bothfacilitiesaresecure,climatecontrolledsitesdesignedandbuiltforhighavailableproductionservices.”24

o “Intheeventofacustomerdisasterwithlarge‐scale(afullserverormore)dataloss,ITCSwillworkwiththecustomertooptimizetherestoretimetobestofourability.Wewillonlybeabletodevoteresourcestotheextentthatothercustomersarenotaffected.Restoringlargefileservers(multipleTerabytes)cantakeseveraldays.Ifcustomerswanttominimizethisamountoftimetorestore,wecanpurchaseadditionalresourcesforthispurpose.Contactusdirectly,andwe’llworkoutascenariowithcostinginformation.IntheeventofaMAJORcampusoutageaffectingalargenumberofcustomers,ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritizecustomerrestores.”(sec.4.11)

o “DisasterRecoveryplanningistheresponsibilityofthecustomerunit.”(sec.5.8)HavingestablishedthemainDisasterRecoverystrategiesemployedbyHathiTrust,wemaynowproceedtoinvestigatethemeansbywhichitanticipatesandmitigatesthemostcommonthreatsfacingdigitalrepositories.

23IBM.“IBMTivoliStorageManager:FeaturesandBenefits”(2009)retrievedfromhttp://www‐01.ibm.com/software/tivoli/products/storage‐mgr/features.html?S_CMP=rnavon16June2009.24InformationTechnologyCentralServicesattheUniversityofMichigan.“FrequentlyAskedQuestionsabouttheTSMBackupService”(2009)retrievedfromhttp://www.itcs.umich.edu/tsm/questions.phpon16June2009.

Page 14: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 8

Scenario1:HardwareFailureorObsolescenceandDataLoss

• Review:RisksInvolvingHardwareFailureorObsolescenceandDataLoss ThefollowingtablehighlightsthevariouseventswhichposearisktothehardwareanddataofHathiTrust.Thesethreatsmaystemfromflawsormalfunctionsintheequipmentitselforasaresultofexternaleventsthatincludephysicalsecuritybreachesandnaturalormanmadedisasters.Thearrangementofthesepotentialrisksreflectstherelativeseverityoftheirrespectiveconsequences.

• HathiTrust’sSolutionsforHardwareFailureandDataLoss

ThethreatsfacedbyHathiTrust’shardware(andassociatedapplicationsaswellasthedatastoredtherein)arecomprisedofthefailureofredundantfeatures,failurethatexceedscomponents’toleranceforredundancy,andsinglepointsoffailure.Whilethefailureofredundantcomponentsmayhappenmorefrequently(i.e.,thelossofanindividualdrivewithintheIsilonIQcluster),suchlossesdonothavealargeimpactontherepository;eventswhichcompromisesinglepointsoffailurewillhavemuchgreaterconsequencesforthecontinuityofHathiTrustoperations.Atthesametime,whileacomponentmayhaveredundancyononelevel(forexample,therearefiveserversdedicatedtoingest),thatcomponentsimultaneouslymaybeconsideredatahigherleveltobeasinglepointoffailure(i.e.,becausetheingestserversarehousedinasinglechassis,theentireunitisvulnerabletoaneventsuchasafire).Thisdualityhighlightstheneedforvigilanceandforesightinmanagingtherepository’sinfrastructure. BecauseHathiTrustreliesheavilyuponhardwaretofulfillitsmissionanddeliverservicestoitsdesignatedcommunityofusers,theselectionofequipmentanddevelopmentofsystemarchitecture

Severity EventHighimpact Lossatasinglepointoffailure

• Anadditionalfailurepasttoleranceswhenonlyonesiteisoperational• Serviceisunavailableandcannotberestoreduntilcomponentisrepaired/restored

ModerateImpact Failureofacomponentpastredundancytolerance• Systemnolongerhasredundancy:additionallossorfailureofcomponentswill

resultinlossofsystem.Thisisaparticularproblemifonesiteisalreadydown.• Lossofdbserver(homeofRightsdb)orofbothWebserversatasitewillrender

thatlocationinaccessible• LossoffourdrivesornodesineitherIsilonstorageclusterwillresultinthelossof

thatinstance.Theclusterwillbeofflineandunabletohandlereadorwriterequests;alltrafficwouldhavetobehandledbytheremainingsite.

• LossofUMArborLakessitewouldpreventperformanceoftapebackups.• LossofUMMACCsitewoulddepriveIUsiteofdataredundancy• Lossofingestserverswouldpreventnewcontentfromenteringrepository

LowImpact Failureofredundantsystemcomponents• Includesredundantcomponentswithineachsiteaswellasgeneralredundancy

betweentheIUandUMsiteso HTinfrastructurehasbeendesignedtoavoidsinglepointsoffailureandto

ensuredataandequipmentredundancyo Servicecontinuesinanuninterruptedandtransparentmanner

Page 15: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 9

hasaimedatminimizingthedangersposedbysinglepointsoffailurethroughtheintroductionofstrategicredundancies.ThebasicmeansforavoidingthedisastrouseffectsofhardwarefailureordatalosshavebeentheestablishmentoftheIndianapolismirrorsiteandthenightlybackupofcontenttotape.(Formoredetail,pleaserefertotheprecedingsection).Whilethesestrategiesaccountforextraordinaryevents,HathiTrust’sserverreplacementscheduleallowstherepositorytoanticipatetheresultsofnormalequipmentuseanddepreciation.Stepstosafeguardthelong‐termfunctionalityofHathiTrusthavethereforebeencomplementedbyaconsiderationofbestpracticesfordisasterpreparedness.

• RedundantComponentsandSinglePointsofFailureintheHathiTrustInfrastructureThefollowingsectionsprovideageneraloutlineofHathiTrust’sredundantcomponentsand

singlepointsoffailure.Giventhecomplexityoftherepository’sinfrastructure,unknownorunanticipatedscenariosmayexist;futureDisasterRecoveryPlanningwillthusinvolveaperiodicreviewofkeyfeaturesandvulnerabilities.

o SiteRedundancy:TheestablishmentofthemirrorsiteinIndianaprovidesHathiTrustwithafullyredundantoperation.Becausebothinstancesprovidefullaccesstocontentinadditiontootherrepositoryfunctions,userswillnotexperiencealossordegradationofserviceintheeventthatserviceislostfromonesite.KeyexceptionstoHathiTrust’ssiteredundancyarenotedbelow.

o RedundantComponentsatEachSite:ThefollowingcomponentsprovideeachsitewithatoleranceunderwhichlimitedfailureswillnotdisruptmajorHathiTrustfunctionsanduserservices.

Webservers:eachsitehastwoserverssothatifonefails,theothermaycontinuetohandletraffic.ThesealsohosttheGeoIPdatabase.

IsilonIQclusters:thecurrentconfigurationof21nodesfeaturesN+3parityprotection;thisdataredundancypermitsthesimultaneousfailureof3drivesonseparatenodesorthelossofthreeentirenodeswithoutservicedegradation.

Ingestservers:theAnnArborsitepossessesfiveserverssothatingestmaycontinue(albeitataslowerrate)intheeventofanyfailures.

LargeScaleSearch(LSS)Solrindex:currentlyhousedonthewebservers,butwillsoonbemaintainedonfivenewserversinAnnArbor.

o SinglePointsofFailure:25Thesearecomponentsofasystemwhich,iflost,willpreventtheentiresystemfromfunctioning.Eventhosecomponentswithwhollyredundantpeerdevices(suchastheweboringestservers)maybeconsideredsinglepointsoffailureiftheyhaveexceededtheircapacitytosustainlosses(i.e.,ifonewebserveratasitehasalreadybeenlost).

SinglePointsofFailureattheComponentLevel:BecauseonlyoneofthesecomponentsexistsateachHathiTrustsite,alosswillresultinsystemfailure.

• MYSQLdatabaseserver:housestherightsdatabase,ingesttrackingdatabase,andtheCollectionBuilderSolrindex

• Servernetworkswitches• Outboundnetworkswitches

SinglePointsofFailureattheSystemLevel:Whileanygivencomponentmayhavevariousdegreesofinternalredundancy(suchasmultiplepowersuppliesor

25ContentinthissectioniscourtesyofCorySnavely(personalemailfrom13July2009).

Page 16: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 10

multipledrives)itmightstillfailasawholeandthusresultinthelossofaparticularinstanceofHathiTrust.Thefollowingarecomponentslocatedateachsitewhich,whilepossessedofinternalredundancies,arestillsubjecttocompleteloss(asintheeventofafire)andmaythusrenderasiteinoperable.

• IsilonIQstoragecluster:theentireclustercouldbelostinalarge‐scaleevent.Additionally,thelossofafourthdriveornodewillexceedthecluster’sfailuretoleranceandresultinaservicedisruption.

• Webservers:shouldonefail,theremainingserverwillbeasinglepointoffailure.

• Bladeserverchassis:sinceweb,ingest,anddatabaseserversarehousedinonechassis,theentireunitcouldpotentiallyfail.

• LSSindex:inthenearfuture,theserversinAnnArborwillbethesoleinstanceoftheLargeScaleSearchindex.

• MirlyndatabaseandMirlyn2Solrindex26:thesearecurrentlykeycomponentsoftheUMLibraryinfrastructure;shouldthesebeunavailable,accesstoanduseofHathiTrustwillbecompromised.

• KeyFeaturesofHathiTrust’sIsilonIQClusteredStorage

TheIsilonIQstorageclusterstoresandprovidesdigitalobjectsforHathiTrust’spartnerlibrariesandmembersofitsdesignatedcommunity.Theclusterprovidesahighdegreeofinherentredundancy,whichgivesbothHathiTrustsitesaconsiderabledegreeoftoleranceinregardstothefailureofvariousaspectsofthestorageunits.Asoneexample,Isilon’sproprietaryOneFSoperatingsystempermitstheindividualstoragenodes—theindividualserversthatarethebuildingblocksofthecluster—tofunctionas‘coherentpeers’sothatanyonenode‘knows’everythingcontainedontheotherunitsinthecluster.

o “Isilon'sOneFSoperatingsystem[…]intelligentlystripesdataacrossallnodesinaclustertocreateasingle,sharedpoolofstorage.”27

o “Becauseallfilesarestripedacrossmultiplenodeswithinacluster,nosinglenodestores100%ofafile;ifanodefails,allothernodesintheclustercandeliver100%ofthefileswithinthatcluster.”28

o “Adistributedclusteredarchitecturebydefinitionishighlyavailablesinceeachnodeisacoherentpeertotheother.Ifanynodeorcomponentfails,thedataisstillaccessiblethroughanyothernode,andthereisnosinglepointoffailureasthefilesystemstateismaintainedacrosstheentirecluster.”29

26MirlynisthenameoftheUniversityofMichigan’scurrentOnlinePublicAccessCatalog,whichissupportedbytheAlephintegratedlibrarysystem.Mirlyn2isabetaversionofUM’srecentlyimplementednextgenerationcatalog,basedontheVuFindplatform,whichwillbecomethemainlibrarycatalogonAugust3,2009.27IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon17June2009.28IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.7.“Incomputerdatastorage,datastripingisthetechniqueofsegmentinglogicallysequentialdata,suchasasinglefile,sothatsegmentscanbeassignedtomultiplephysicaldevices.[…]ifonedrivefailsandthesystemcrashes,thedatacanberestoredbyusingtheotherdrivesinthearray.”(http://en.wikipedia.org/wiki/Data_striping,retrievedon16August,2009).29IsilonSystems.“BreakingtheBottleneck:SolvingtheStorageChallengesofNextGenerationDataCenters”(2008)p.8

Page 17: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 11

HathiTrust’sIsilonIQclustersensureahighdegreeofdataredundancywiththeirN+3parityprotection.N+3providestriplesimultaneousfailureprotectionsothatuptothreedrivesonseparateIsilonIQnodes,orthreeentirenodes,canfailatthesametimeandalldatawillstillbefullyavailable.

o “TraditionalRAID‐5parityprotectionresultsindatalossifmultiplecomponentsfailpriortothecompletionofarebuild.FlexProtect,incontrast,automaticallydistributesalldataanderrorcorrectioninformationacrosstheentireIsilonclusterandwithitsrobusterrorcorrectiontechniquesefficientlyandreliablyensuresthatalldataremainsintactandfullyaccessibleevenintheunlikelyeventofsimultaneouscomponentfailures.”30

o “Eachfileisstripedacrossmultiplenodeswithinacluster,with[three]paritystripesforeachdatablock.”31

ThefilesystemmayalsoperformaDynamicSectorRepair(DSR)atthetimeofanyfilewriting.Ifitencountersabaddisksector,thefilesystemwilluseparityinformationelsewhereinthesystemtorebuildthenecessaryinformationandrewriteanewblockelsewhereelseonthedrive.Thebadsectorwillberemappedbythedrivesothatitisneverusedagainandthewriteoperationwillbecompleted. TheIsilon“restriper”isameta‐process/infrastructurethathasfourprimaryphasestohelpmanageandprotectdataintheeventthatcomponentsoftheclustersustainapartialfailureormalfunction.Theprocessesrunasbackgroundoperationsanddonotrequiresystemdowntime.3233

o FlexProtectrepairsdata(i.e.,intheeventofadriveloss)usingparity. “IsilonOneFSwithFlexProtectcanboasttheindustryleadingMeanTimeto

DataLoss(MTTDL)forpetabyteclusters.”34 “FlexProtectintroducesstate‐of‐the‐artfunctionality,whichrebuildsfaileddisks

inafractionofthetime,harnessesfreestoragespaceacrosstheentireclustertofurtherinsureagainstdataloss,andproactivelymonitorsandpreemptivelymigratesdataoffofat‐riskcomponents.”35

o AutoBalance“rebalancesthedatainaclusteraccordingtobusinessrules,inrealtime,non‐disruptively.”36

“Assoonasthe[neworrepaired]nodeisturnedonandnetworkcablesareconnected,AutoBalanceimmediatelybeginstomigratecontentfromtheexistingstoragenodestothenewlyaddednodeacrosstheclusterinterconnectback‐endswitch,re‐balancingallofthecontentacrossallnodesintheclusterandmaximizingutilization.”37

30IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon30June2009.31IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.732IsilonX‐SeriesSpecifications(productbrochure)33InformationontheIsilonrestripercomesfromapersonalemailsentbyKipCranfordofIsilonSystems,Inc.on1June2009.34IsilonSystems.“DataProtectionforIsilonScale‐OutNAS”(2009)p.435IsilonSystems,Inc.“IsilonIQOneFSOperatingSystem”(2009)retrievedfromhttp://www.isilon.com/products/OneFS.phpon15June2009.36McFarland,Anne.“IsilonAcceleratesDeliveryofDigitalContent”TheClipperGroupNavigator(2003).37IsilonSystems.“TheClusteredStorageRevolution”(2008)p.13

Page 18: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 12

o Collectcleansuporphanednodesanddatablockstopreventfragmentationofdata.o MediaScanverifiesdisksectors.

ThefunctionofMediaScanistoscaneveryblockinthefilesystemlookingforbaddisksectors.Ifitencountersabadsector,itwillperformaDynamicSectorRepair(DSR)anduseparityinformationelsewhereinthesystemtorebuildthenecessaryinformationandrewriteanewblocksomewhereelseonthedrive.

MediaScanperiodicallyreviewsdatablocksanddisksectorsthatmaynothavebeenaccessed,fromafilelevel,inmonthsoryearsandtherebyhelpstokeepthedrivesashealthyaspossible.

o AsoftheOneFS5.0release,allfilesystemmetadatacanbecheckedbytheIntegrityScanrestriperphase.ThisprocesswillallowHathiTrusttocompletelycheckfiledataandmetadataviaassociatedchecksums.

Otherinstancesofinherentredundancyincludenon‐volatileRAM,afullyjournaledfilesystem,andsoftwareapplicationsthatmanageclientconnectionsintheeventofanode’sfailure.

o “OneFSisafully‐journaledfilesystemwithlargeamountsofbattery‐backednon‐volatilerandomaccessmemory(NVRAM)withineachnode,whichensurestheintegrityofthefilesystemintheeventofunexpectedfailuresduringanywriteoperation.”38

o “TheIsilonSmartConnectmodule[…ensures]thatwhenanodefailureoccurs,allin‐flightreadsandwritesarehandedofftoanothernodeintheclustertofinishitsoperationwithoutanyuserorapplicationinterruption.[…]Ifanodeisbroughtdownforanyreason,includingafailure,thevirtualIPaddressesontheclientswillseamlesslyfailoveracrossallothernodesinthecluster.Whentheofflinenodeisbroughtbackonline,SmartConnectautomaticallyfailsbackandrebalancestheNFSclientsacrosstheentireclustertoensuremaximumstorageandperformanceutilization.”39

• HardwareSupportandService HathiTrustequipmentiscoveredbysupportandserviceagreementswithitsvariousvendors(SunMicrosystems,Dell,CDW‐G,etc.).Agoodexampleofonesuchagreementisfoundinthe“Platinum”supportprovidedbyIsilonSystemsandwhichincludes:

o Extended24x7x365Telephone&OnlineHardwareandSoftwareSupporto 24x7ProactiveMonitoring&Alerts–EmailHome(forHardwareandSoftware)o ReturnPartstoFactoryforRepairand4‐hourReplacementPartsDeliveryo SupportIQ(EnhancedServiceabilityDiagnostics)andSystemEventTrackingo On‐siteTroubleshootingo IsilonHardwareInstallationo SoftwareProductDocumentation,ReleaseNotes,andaccesstoProductTechnicalNoteso RemoteDiagnosis(ProvidedUserGrantsAccess)o Maintenance&PatchReleases

38IsilonSystems.“UncompromisingReliabilitythroughClusteredStorage:DeliveringHighlyAvailableClusteredStorageSystems”(2008)p.939IsilonSystems.“DataProtectionforIsilonScale‐OutNAS”(2009)p.6

Page 19: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 13

o MinorandMajorUpgradeReleases(IncludesPerformanceImprovements,NewFeatures,ServiceabilityImprovements).40

• EquipmentTrackingLITCoreServices(CS)maintainsaninventoryofserversonawikipageaccessibletoitsstaff.

Detailsincludeeachserver’sname,location,onlineandretiredates,upgrades,notesonstorage,anditsprimaryservice.Additionalinformationisprovidedrelatedtospecifications,supportcontracts,andkeycontactinformation.TheCSserverinventoryiscurrentlyoutofdate.

• HardwareReplacementSchedule

o “HathiTrustreplacesstorageregularly,approximatelyevery3‐4yearsorastheusablelifeofstorageequipmentdictates”(HTTRACC1.7)

o “HathiTruststaffupgradehardwareonaregularbasis(i.e.,everythreeorfouryears),andtohelpdetectmorerapidgrowthindemands,thewebserverandstorageinfrastructureshavetheirownperformancemonitoringthatindicateoverloadconditions.”(HTTRACC1.10)

• TimelineforEmergencyReplacementofHathiTrustInfrastructureShouldaseriouseventrequirethereplacementofpart(orall)oftheHathiTrusttechnical

infrastructure,thefollowingtimelineprovidesageneralestimateofthetimerequiredtoorder,ship,andinstallnewequipment.AcursoryreviewofthetimenecessaryforHathiTrusttorecoverfromamajordisasteratthemainAnnArbororIndianapolisdatacentersuggeststhatalargeeventcouldidleaninstanceoftherepositoryforatleastamonthandahalf.Inadditiontotheserversandswitchesmentionedabove,criticalcomponentsincludefour30Apowerdistributionunits(PDUs)perrackandfourracksperdatacenterasofthiswriting.

o SubmissionofPurchaseOrders: Forordersunder$5,000,theM‐PathwaysapplicationallowstheUniversity

Library’sbusinessmanagertosendpurchaseordersdirectlytovendors. Forordersover$5,000,ProcurementServicesnormallytakesonetotwo

businessdaystoapprovethepurchase,buttheprocessmaytakeuptoaweekifquestionsariseoradditionalpurchaseinformationisneeded.

o DeliveryofEquipment: Productsthevendorhasinstockandavailableforimmediateshipmenttake1‐3

daystobedelivered. Itemsthatneedtobeconfigured(suchasservers)usuallytake1‐2weeks. Isilonstoragewilltake3weekstobedeliveredinaworstcasescenario.

o Installation: 3daysFTEforIsilonIQclusterinadditiontothetimerequiredforotherservers,

switches,PDUsandrackunits.

40IsilonSystems.“SupportAdvantageOfferings”(2009)retrievedfromhttp://www.isilon.com/support/?page=planson30June2009.

Page 20: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 14

o DataRestoration:about.5TB/hour(15days,asofJune2009)41 WhileHThasabout110TBofdatainitsstorage,thebackuptapesmaintained

bytheTSMGroupcontainroughly176TBofinformationduetothedataencryptionusedtoprotecttheintellectualrightsofthematerial(asof06/2009).

Thelengthoftimerequiredfora‘bare‐metalrestoration’willbeinfluencedbytapemounts,networkspeed,restoringtotheNFSshares,decryption,etcetera.

Ifthelibrary/HTweretopurchaseanadditionaltapedrive(atroughly$20,000),theprocesscouldbespedup,perhapstoabout1TB/hour.

Intheeventofalarge‐scaledisasterinwhichmultiplecampusunitsrequireextensivedatarestoration,theTSMBackupServiceSLAstatesthat“ITCSmanagementwillworkwithcustomerstodeterminehowtoprioritizecustomerrestores.”(sec.4.11)ThisdeterminationwillreflecttheUniversityofMichigan’sorganizationalpriorities42:

• Priority1:Healthandsafetyoffaculty,staff,students,hospitalpatients,contractors,renters,andanyotherpeopleonUniversitypremises.

• Priority2:Deliveryofhealthcareandhospitalpatientservices• Priority3:Continuationandmaintenanceofresearchspecimens,

animals,biomedicalspecimens,researcharchives.• Priority4:Deliveryofteaching/learningprocessesandservices• Priority5:SecurityandpreservationofUniversityfacilities/equipment.• Priority6:Maintenanceofcommunity/Universitypartnerships.

o Fractionalrestoreswould,forthemostpart,runatcomparablespeedsunlesstherewasaneedtorestorealargenumberofrandomfiles,inwhichcasetherewouldbeadecreaseinspeedduetotapeseekandmounttimes.

o DelaysinrecoverycouldbeincreaseddramaticallyiftheMACCdatacenteroritsinfrastructurehassustaineddamageandneedsrepair.

• HathiTrustandInsuranceCoverageattheUniversityofMichigan

TheOfficeofFinancialOperationsreviewsandaddsfinancialassetsgreaterthan$5,000totheassetmanagementsystemoftheUniversityofMichigan.ThePropertyControlOfficeisthenresponsiblefortaggingfinancialassetswithuniqueUniversityofMichiganidentifiersandtrackingthem.RiskManagementServicesadministerstheUniversity’spropertyinsuranceandwillprovidethereimbursementofreplacementcostsforitemsself‐insuredbyMichigan.AsofJuly2009,thenatureandextentoftheUniversityofMichigan’sinsurancecoverageforHathiTrusthardwareremainedunderreview.ThemaincontactwithRiskManagementServicesinthismatterhasbeenCyndiMesa,HeadofUMLibraryFinance.

41Hanover,Cameron(ITCSTSMGroupStorageEngineer).Personalemailon23June2009.42UniversityofMichiganAdministrativeInformationServices.“EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning”(2007)retrievedfromhttp://www.mais.umich.edu/projects/drbc_methodology.htmlon6July2009.

Page 21: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 15

Scenario2:NetworkConfigurationErrors

• Review:RisksInvolvingNetworkConfigurationErrorsThefollowingtablesummarizestherisksfacingHathiTrustastheresultofnetworkconfiguration

errors.ConsiderationisgiventonetworkconnectionswithinUMdatacentersaswellasatUM’sHatcherGraduateLibrary(siteofkeyadministrativeanddevelopmentactivities).Thearrangementoftheseeventsreflectstherelativeseverityoftheirrespectiveconsequences.

• HathiTrust’sSolutionsforNetworkConfigurationErrors

HathiTrust’scontinuedaccesstotheInternetviatheUMnetBackboneisessentialforitscontinuedprovisionofservice.TherepositoryreceivesnetworkinfrastructuremaintenancethroughUM’sITCS/ITCom;withitsrobustdisasterplanninginadditiontothelessonslearnedfromtheMidwestblackoutof2003,ITComguaranteescontinuednetworkaccessinallbutthemostcatastrophicscenarios.Intheeventofawidespreadpoweroutage,HathiTrustwouldbeabletomaintainaccesstotheUMnetBackbonesincedatacentersareequippedwithredundantpowersuppliesandtheHatcherGraduateLibraryiscurrentlycategorizedasapriorityrecipientofpowerfromtheuniversity.ITCSalsohas17generatorswhichcanbeusedtomaintainpowertonetworkswitchesintheeventofablackout.TheresponsibilitiesandobligationsofbothpartiesareoutlinedintheCustomerNetworkInfrastructureMaintenanceServiceAgreement.43

• ExtentofITComSupporto “ITComagreestoprovidetheUnitNetworkInfrastructureMaintenancetoincludedata

switches,routers,accesspoints,hubs,uninterruptiblepowersupplies(UPS’s),firewalls,andotheridentifiedandagreeduponcomponents.”(ITCSsec.1.0)

43PleaserefertoAppendixG(ITCS/ITComCustomerNetworkInfrastructureMaintenanceServiceAgreement).

Severity EventHighimpact • Lossofservernetworkswitchoroutboundnetworkswitch

• LossofaccesstoUMnetBackbone

ModerateImpact • ExtendedlossofpoweratHatcherLibrarycouldleadtolossoflocalserversanddisruptionofadministrativeandoperationalactivities.

LowImpact • LossofpowerthatthreatensabilitytoconnecttoLocalAreaNetwork(LAN)/Backbone

o Thelibraryremains(fornow)apriorityrecipientofelectricityfromtheUMpowerplant

o CampusdatacentershaveUPSsandredundantbackuppower• Failureoflocal/server‐sideconnections

o Shouldproblemsarisewithconnectionstoindividualnodes,theclusteredarchitectureoftheIsilonsystemwillallowread/writerequeststobehandledbyalternatenodes.

o IfconnectionsfailatoneHTsite,trafficcanbehandledbyremainingsite.

Page 22: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 16

• ITComResponsibilities

o “ProvideandmaintainthenecessarymaterialsandelectroniccomponentstooperatetheUnitNetworkInfrastructure.”(sec.5.2)

o “ProvideconfigurationandNetworkInfrastructureAdministrationsupportnecessarytorepairandmaintaintheUnitNetworkInfrastructurehardwareandsoftwarecoveredbythisagreement.”(sec.5.3)

o “Monitor24hours/dayand365days/year(24x365),supportedprotocolstothebackboneinterfaceoftheUnitsnetworkuptoandincludingtheextensiontothefirsthuborswitch.”(sec.5.6)

o “Monitor24hours/dayand365days/year(24x365),networkinterfacesonuninterruptiblepowersupplies(UPS)thatsupporttheUnitnetworkswitches.ProvidenotificationintheeventthataUPSisactivated,(inputpowerislostordegradedandsystemswitchestobatterypower),deactivated,(inputpowerisrestored),orunreachable.ProvidenotificationtotheUnitNetworkAdministratorwhenbatteriesdegradetothepointofneedingreplacement.”(sec.5.7)

o “ProvidemaintenanceonthestationcablingasinstalledbyITCom,oranapprovedU‐MvendorwhichmetITCominstallationspecifications.”(sec.5.8)

o “ProvidePreventativeMaintenance(clean&vacuum)oneachCustomerUnitswitchcoveredinthisagreementyearly.”(sec.5.9)

• ITComServicesinResponsetoOutagesorDegradationImpactingtheNetworko “Aresponsewithin30minutesoftheITComNOCnotificationortheUnit’scall,to

provideinformationtotheUnitonspecificstepsthathavebeen/willbetakentoresolvetheproblem.”(sec.7.2.1)

o “Anon‐sitevisit,ifnecessary,withintwo(2)hoursoftheresponse(i.e.,themaximumon‐siteresponsetimewillbetwoandahalf(21/2)hours).AnupdatewillbeprovidedtotheUnitNetworkAdministratorifonsiteandabestguessETRwillbeprovidedbasedonavailablefacts.ITComwillcontinuetoprovidetheUnitwithupdateseverytwohoursduringanoutage.”(sec.7.2.1)

o “IfanoutageisidentifiedwithintheagreementservicehoursITComwillresolvetheoutageeveniftherepairtimeextendsbeyondtheserviceagreementhours.”(sec.7.2.1)(Repairsoutsideoftheagreementhoursresultinadditionallaborexpenses.)

o ConductmonitoringviaSNMPPOLLINGatoneminuteintervals.(sec.7.2.1)

• HathiTrustResponsibilitiesITCom’sresponsibilitiesendatthefirstnetworkswitchandfromtheretoitsservers,HathiTrust

isresponsibleformaintainingnetworkconnectivityandsecurity.TherepositoryusesInternet2forcommunicationandsynchronizationbetweentheAnnArborandIndianapolissites.EachIsilonnodehasdual10GBInfinibandportsforinternal(i.e.,intra‐cluster)communicationanddual1GBEthernetforexternalcommunication.Scenario3:NetworkSecurityandExternalAttacks

Page 23: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 17

• Review:RisksInvolvingNetworkSecurityandExternalAttacks

ThefollowingtablegivesageneraloverviewofthebasicthreatanexternalattackornetworksecuritybreachposestoHathiTrust;entriesarearrangedbyseverity.Thelist,however,isnotexhaustiveandnoattempthasbeenmadetopublicizepotentialvulnerabilities.

• HathiTrust’sSolutionsforNetworkSecurity

MaliciousactivityagainstHathiTrustcouldinvolveunauthorizedaccesstoasystemordata,denialofservice,orunauthorizedchangestothesystem,software,ordata.Asanacademicentity,therepositoryisseenaslessofatargetforsuchactionsthancommercialorgovernmentaltargets;despitethisperceivedlowerrisk,HathiTrusthasnotbeenlulledintoafalsesenseofsecurity.TherepositorytakesseriouslythepotentialforviolationsofitsnetworkandoperatingsystemsecurityandthereforehasinstitutedaprogramofperiodicsoftwareupdatesinadditiontothemaintenanceofanITCom‐supportedfirewall,authentication‐requiredaccess,andothermeasures(suchasthrottlingsoftwaretodeterdenialofserviceattacks).Becausecontentiscurrentlyacceptedfromtrustedsources(namely,GoogleandlegacydigitalcollectionsfromHathiTrustpartners)theGROOVEprocessdoesnotincludeavirusdetectionphase.Asdigitalobjectsareingestedfromagreaternumberofsources,additionalsecuritymeasuresshouldbeconsidered.

o “HathiTruststaffapplysecurityupdatestotheoperatingsystemandtonetworkingdevicesassoonastheybecomeavailableinordertominimizesystemvulnerability.Aswithnewsoftwarereleases,securityupdatesaretestedinadevelopmentenvironmentbeforebeingreleasedtoproduction.Softwarepackagesthatpresentalowersecurityriskandthathaveagreaterpotentialtoaffectapplicationbehavior(webservers,languageinterpreters,etc.)aregenerallyinstalled,configuredandtestedmanuallytoallowforgreatercontrolinmanagingupdates.Softwareupdatesarenotappliedautomatically;moreover,updatesthatpresentapotentialforhavinganimpactonsystembehaviorareappliedandtestedfirstinthedevelopmentenvironment.Ifnoimpactsareseen,HathiTruststaffapplytheseupdatesinproductionafteratestingperiodofatleastoneweek.”(HTTRACC1.10)

Severity EventsHighimpact • UnauthorizedaccesstoHathiTrustcontentleadstotheinfringementofcopyrights.

• Lossofdataorfunctionalityforanextendedperiodoftimeasaresultofmaliciousactivity.

ModerateImpact • HathiTrustservicesaretemporarilyunavailableasaresultofmaliciousactivity.LowImpact • ThedeliveryofHathiTrustservicesslowsastheresultofmaliciousactivity.

• Asecurityweaknessexistswithinthesystembutremainsunexploited.

Page 24: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 18

Scenario4:FormatObsolescence

• Review:RisksInvolvingFormatObsolescenceThefollowingtableoutlinesthethreatsposedbyformatobsolescenceandarrangesthem

accordingtotheirpotentialseverity.

• HathiTrust’sSolutionsforFormatObsolescence

AnawarenessandacknowledgementofthedangersofformatobsolescencehasledHathiTrusttoimplementproactivepoliciesandprocedurestoensurelong‐termaccesstotherepository’scontent.Therepositoryonlyacceptsspecificformatsthatmeetrigorousspecificationsand,throughthepriorexperienceofUniversityofMichiganpersonnel,hasdevelopedprotocolsforthesuccessfulmigrationofcontentfromoneformattoanother.Inaddressingthethreatofformatobsolescence,thepreservationoftheintegrityandauthenticityofdepositedcontenthasbeenanoverarchingconcern.

• SelectionofFileFormatso “HathiTrustiscommittedtopreservingtheintellectualcontentandinmanycasesthe

exactappearanceandlayoutofmaterialsdigitizedfordeposit.HathiTruststoresandpreservesmetadatadetailingthesequenceoffilesforthedigitalobject.HathiTrusthasextensivespecificationsonfileformats,preservationmetadata,andqualitycontrolmethods,includedintheUniversityofMichigandigitizationspecifications,datedMay1,2007.”44(HTTRACB1.1)

o “HathiTrustcurrentlyingestsonlydocumentedacceptablepreservationformats,includingTIFFITUG4filesstoredat600dpi,JPEGorJPEG2000filesstoredatseveralresolutionsrangingfrom200dpito400dpi,andXMLfileswithanaccompanyingDTD(typicallyMETS).HathiTrustsupportstheseformatsbecauseoftheirbroadacceptanceaspreservationformatsandbecausetheformatsaredocumented,openandstandards‐based,givingHathiTrustaneffectivemeanstomigrateitscontentstosuccessivepreservationformatsovertime,asnecessary.TheRepositoryAdministratorshaveundertakensuchtransformationsinthepast;moreover,HathiTrustoffersend‐userservicesthatroutinelytransformdigitalobjectsstoredinHathiTrustto“presentation”formatsusingmanyofthewidelyavailablesoftwaretoolsassociatedwithHathiTrust’s

44Specificationsareavailableathttp://www.lib.umich.edu/lit/dlps/dcs/UMichDigitizationSpecifications20070501.pdf

Severity EventsHighimpact • Applicationsandhardwarearenolongerabletoreadordisplaydigitalobjects.

• Errorsintranslatingandreadingfilesarenotunderstoodoracknowledgedbyrepositoryusers.

ModerateImpact • ProblemswiththetranslationoffileformatsresultinDIPsthatdonotfaithfullyreflecttheoriginaldigitalobjects.

LowImpact • Formatsandassociatedapplicationschangebutretaincompatibilitywitholderversionsofthefileformats.

Page 25: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 19

preservationformats.HathiTrustgivesattentiontodataintegrity(e.g.,throughchecksumvalidation)aspartofformatchoiceandmigration.”45

o “Eachformatconformstoawell‐documentedandregisteredstandard(e.g.,ITUTIFFandJPEG2000)and,wherepossible,isalsonon‐proprietary(e.g.,XML).”(HTTRACB4.2)

• FormatMigrationPoliciesandActivitieso “HathiTrustiscommittedtomigratingtheformatsofmaterialscreatedaccordingto[its]

specificationsastechnology,standards,andbestpracticesinthedigitallibrarycommunitychange.”(HTTRACB1.1)

o “HathiTruststaffmembersconductmigrationsfromonestoragemediumtoanotherusingtoolsthatvalidatechecksumsinternally.(Digitalobjectsarestoredbothonlineandontape,andtheonlinestoragesystemconductsregularscanstodetectandcorrectdataintegrityproblems.)Atotalfilecountisdonefollowingalargedatatransfer,andregularlyscheduledintegritychecksfollow.”(HTTRACC1.7)

o “[HathiTrust]hasmigratedlargeSGML‐encodedcollectionstoXML,andLatin‐1characterencodingstoUTF‐8Unicode.Oursuccessinmigratingfromolderformatstonewerformatsdemonstratesourcommitmenttoourcollectionsandourabilitytokeepmaterialsinourrepositoryviable.Allmigrationsaredocumentedinchangelogs.”(HTTRACB4.2)

45HathiTrust.“Preservation”(2009)retrievedfromhttp://www.hathitrust.org/preservationon16June2009.

Page 26: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 20

Scenario5:CoreUtilityand/orBuildingFailure

• Review:RisksInvolvingCoreUtilityorBuildingFailureThefollowingtablesummarizesthedangersautilityorbuildingfailureposestoHathiTrustand

rankseventsbytheirpotentialseverity.

• HathiTrust’sSolutionsforUtilityorBuildingFailure

ThecontinueddeliveryofHathiTrust’sservicesdependsuponthemaintenanceofpower,environmentalcontrol,andsecurityinitsserverenvironmentattheMichiganAcademicComputingCenter(MACC)andotherlocationsthathostcomponentsoftherepository.Inthisrespect,HathiTrustisheavilyreliantupontheinfrastructureoftheMACCaswellasthatoftheArborLakesDataFacility,hometooneinstanceoftheTSMGroup’sbackuptapelibrary.BothlocationsprovidecloselymonitoredandhighlyredundantenvironmentsthathelpensurethatHathiTrust’sinfrastructureremainssecureandoperable.Atthesametime,administrativeanddatamanagementfunctionscriticaltothedevelopmentandmaintenanceoftherepositorytakeplaceintheUniversityofMichigan’sHatcherGraduateLibrary.TheserviceandcooperationofMichigan’sPlantOperationsDivisionarethereforecriticalforthecontinuedaccesstoanduseofthisstructureintheoperationofHathiTrust.

• GeneralMaintenanceandRepairsinUniversityofMichiganFacilitiesFacilitiesandmaintenanceissuesontheUniversityofMichigancampusarereportedtothe

PlantOperationsDivision,theDepartmentofPublicSafety(DPS),andOccupationalSafetyandEnvironmentalHealth(OSEH)inadditiontotheimpactedfacility’smanager.RepairworkiscoordinatedbytheUniversityLibraryfacilitiesmanagerinconjunctionwithadministratorsandworkersfromPlantOperations.

• TheMichiganAcademicComputingCenter(MACC) TheMACChostsmanyofthekeycomponentsoftheMichigan’sUniversityLibrarysystemandas

wellasthetechnicalinfrastructureofHathiTrust.TheUniversityofMichigandoesnotownthebuildinginwhichthedatacenterislocatedbutinsteadoperatestheMACCinconjunctionwiththeMichiganInformationTechnologyCenter(MITC)Foundationandotherpartners.TheMACCServerHostingService

Severity Events• ExtensivestructuraldamagerenderstheMACC(orkeyelementsofits

infrastructure)unusableandnecessitatestheestablishmentofahotsitetorecoverandcontinueoperations.

• Additionalfailurepasttoleranceinbackupcoolingorpowerinfrastructure

Highimpact

ModerateImpact • Failureofbackuppowerpastredundancytolerance(failureof2generators)

o DatacentercoordinatormayinitiateloadshedandshutdownhalfoftheMACC(butlibraryrackswillremainoperational)

• Structuraldamagerendersfacilitytemporarilyunsafeand/orunusable.LowImpact • Lossofpower

• Lossofenvironmentalcontrolunitswithinredundancy

Page 27: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 21

LevelAgreement46liststheresponsibilitiesofthedatacenteraswellastherepository;ofparticularsignificancearetheMACC’sagreementsto:

o “Provideacontrolledphysicalenvironmenttosupportservers[with]roomaveragetemperatureofbetween65and75degreesand35‐50%relativehumidity[and]monitoredenvironmentals(temperature,humidity,smoke,water,electrical.”(sec.4.1)

o “Provideadequate,conditioned,60‐cycleelectricalservicewithadequatebackupelectricalcapacitytosupportcircuits,service,andoutlets[andalsoto]provideUninterruptiblePowerSupply(UPS)andgeneratorbackup”(sec.4.2)

o “Provide7x24telephonecontactforemergenciesandforemergencyaccesstofacility.”(sec.4.4)

Inadditiontofeaturessuchasredundantelectricalandenvironmentalsystems,theMACCmaintainsafull‐timecoordinatorandstaffwhoprovide24x7responsestofailuresormalfunctionsintheserverenvironment.AlertspromptedbyissueswiththeenvironmentalsystemsorpoweraresenttotheUniversityofMichiganNetworkOperationsCenter(NOC)duringnon‐businesshours.

o Overview: “TheMACC'sredundancyisdesignedtoensurethesafetyandsecurityofthe

datahousedwithin.Itconsistsof:• Adualpowerpathfromthepropertylinetothepowerdistribution

units• Dieselpoweredgeneratorsforelectricalbackup• Flywheels(notbatteries)toprovidepowerwhilethegeneratorscome

on• State‐of‐the‐artgeneratorsandflywheelsforbackuppower• Threeextracomputerroomairconditioners• Twoextradrycoolers• Glycolloopforcoolingwithtwoparallelpathwayswithcrossovervalves

atregularintervals.”47 “Astate‐of‐the‐artmonitoringsystemkeepstrackof1,700differentparameters

andautomaticallynotifiesstaffofanyirregularity.”48o EnvironmentalControlsandMonitoring

“TheMACChas18ComputerRoomAirConditioningunits(CRACs).Atanygiventime,only15arenecessarytomaintaintherequiredtemperatureandhumidity.[Thus,thecomputerroomhasN5+1redundancyinitscoolingability.]Italsoisequippedwithanumberofportablecoolerstoaddressspecificcoolingneeds.Theheatfromtheroomistransferredtoanunder‐floorglycolloopthatreleasestheheattotheoutdoors.”49

46PleaserefertoAppendixH(MACCServerHostingServiceLevelAgreement).47MichiganAcademicComputingCenter.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon16June2009.48‐‐.“MichiganAcademicComputingCenter”(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June2009.49‐‐.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon16June2009.

Page 28: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 22

“Thelayoutofthefacilityallowsthefrontonthecomputerrackstobefacingthecoldaisles.Theseaisleshaveperforatedfloortilesthroughwhichthecoolairispumpeddirectlytothecomputerslocatedthere.Heatisdischargedfromthebacksofthecomputers,whichcreatesthehotaisles.Thisalternatingarrangementfacilitatesthecoolingprocess,asthehotairproducedbythecomputerscanbesiphonedoffbeforeitminglestoomuchwiththecoolerairofthefacility.”50

“TwoseparatesmokedetectionandfirealarmsystemsprotecttheMACC.Oneisforthebuilding;theotherisfortheMACCitself.Thetwosystemsworktogethertoactivatealarmsystemsandnotifythefiredepartmentandkeypersonnel.Intheeventofanactualfire,thefire‐suppressionsystempipeswillnotfillwithwaterunlessthereisapressuredropcausedbymeltingofoneormoreofthesprinklerheads.”51

o BackupPower “Threegenerators,eachroughlythesizeofarailcar,providebackuppower.

Onlytwoofthethreearerequiredtorunthefacilityintheeventofapoweroutage.”52

“TheMACCusesenvironmentallyresponsibleflywheelsinsteadofbatteriesforpowerbackupwhilethegeneratorscomeonline.Thecombinationofgeneratorsandflywheelsprovidesthefacilitywithafullyredundantuninterruptiblepowersystem(UPS).”53

TheMACChasacontractwiththeUMPlantOperationsDivisionforthedeliveryofdieselfuelforitsgeneratorsintheeventofanextendedblackout.54

Intheeventthatabackupgeneratorisdisabled,theMACCcoordinatorwillinitiateloadshed,inwhichonehalfoftheMACCwillbeshutdownsothattheotherhalf(andrequisiteenvironmentalsystems)maycontinuetooperate.TheHathiTrustandUMLibraryracksareamongthosewhichwillretainpowershouldthisresponseprovenecessary.55

• ArborLakesDataFacility(ALDF)TheALDFhousestheTSMGroup’sinfrastructureandoneinstanceofthebackuptapelibrary

thatformsanintegralpartofHathiTrust’sDisasterRecoverystrategy.AsthehomeofcriticalcomponentsoftheUMnetBackbone,theALDFprovidesasafeandsecurelocationforonesetoftherepository’sbackuptapes.Intheinterestofsecurity,thisreportwillomitfurtherinformationontheexactnatureofthefacility’spowerandenvironmentalsystems.

50Ibid.51Ibid.52‐‐.“MichiganAcademicComputingCenter”(2009)retrievedfromhttp://macc.umich.edu/index.phpon16June2009.53Ibid.54Gobeyn,Rene(MACCDataCenterCoordinator).Personalinterviewon23June2009.55Ibid.

Page 29: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 23

Scenario6:SoftwareFailureorObsolescence

• Review:RisksInvolvingSoftwareFailureorObsolescenceThefollowingtabledetailsvariousrisksinherenttosoftwarefailureorobsolescenceandranks

themaccordingtotheirseverity.

• HathiTrust’sSolutionsforSoftwareIssues

ThedevelopmentanduseofHathiTrust’stoolsandresourcesdependsonhighlyfunctionalsoftwareapplications.Repositorypolicieshavethereforebeencraftedtoensurethattheseapplicationsarethoroughlytestedandregularlyupdatedtominimizethethreatofserviceoutagesasaresultofsoftwarefailureorobsolescence.HathiTrustfurthermoreemploysopensourceapplicationsthatarewell‐supportedandenjoywidespreaduseanddevelopmentwithinthedigitallibrarycommunity.

o “Changesinsoftwarereleasesofallcomponentsofthesystem(fromingesttoaccess)aredevelopedandtestedinanisolated“development”environmenttoprepareforreleasetoproduction.Whenreadyforrelease,developersrecordthechangesmadeandincrementversionnumbersofsystemcomponentsasappropriateusingaversioncontrolsystem.Newversionsofsoftwarearereleasedusingautomatedmechanisms(inordertopreventmanualerrors).Majorchangesandupgradesinhardwarearchitecturearerecordedinmonthlyreportsofunitactivity,andthusaretraceabletothatlevelofdetail.”(HTTRACC1.8).

o “Additionally,subsetsofproductiondataareavailableinthedevelopmentenvironmenttoallowdeveloperstoensurepropersystembehaviorbeforereleasingchangestoproduction.”(HTTRACC1.9)

o “Inordertodesign,buildandmodifysoftwareforthedesignatedend‐usercommunity,HathiTrustconductsanactiveusabilityprogramandseeksinputfromtheStrategicAdvisoryBoardofHathiTrust.Similarly,withregardtosoftwaredevelopmentinsupportofthearchivingneedsoftheParticipatingLibraries,HathiTrustfocusesonthedevelopmentofhighlyfunctionalingestandvalidationmechanisms.HathiTrustalsoseeksandrespondstoguidancefromtheStrategicAdvisoryBoardwithregardtoarchivingservices.”(HTTRACC2.2)

Severity Events

Highimpact • Softwarebugescapesdetectionindevelopmentenvironmentandresultsincrashofapplication.

ModerateImpact • Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfullaccesstodigitalobjects.

• Improperversionofsoftwareisintroducedtosystem(couldhaveagreaterorlesserimpactdependingonresultsoferrorandrepository’sabilitytodetectit).

LowImpact

• Softwarebugescapesdetectionindevelopmentenvironmentandpreventsfulluseofsystemcapabilities(i.e.,rotationofimagesoradditionalfunctionality)

Page 30: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 24

Scenario7:OperatorError

• Review:RisksInvolvingOperatorErrorThefollowingtablesummarizesriskstoHathiTrustposedbyoperatorerror;eventsareranked

accordingtotheirpotentialseverity.

• HathiTrust’sSolutionsforOperatorError

Inanyhumanenterprise,occasionaloperatorerrorisunavoidable;HathiTruststrivestoensurethatanysucheventsaredetectedandresolvedinatimelyfashion.56Tohelpavoidoccurrencesandmitigatetheirpotentialimpact,HathiTrusthasautomatedmanyproceduresandalsoreliesuponapplicationassertions,whichcannotifyadministratorswhenprocessesarenotoperatingcorrectly.Evenifanerrorisintroducedtothefilesystemandthenbackedup,theTSMclientsavesuptosevenversionsofafileforuptosixmonthssothatanearlierversioncanberetrieved.

• Ingest:TheGoogleReturn(Object‐Oriented)ValidationEnvironment(GROOVE)processis

entirelyautomatedtoavoidtheintroductionofoperatorerrortotheprocess;stepsinclude:o Identificationofmaterialforingesto DecryptionandunzippingoffilesFormatverificationandvalidationwithJHOVEo LunBarcodeandMD5checksumvalidationo CreationofHathiTrustMETSdocumentso EstablishmentofHathiTrusthandles(persistentURLs)o Extensionofthepairtreefiledirectory(asnewmaterialentersthesystem)

• ArchivalStorage:Filesstoredwithintherepositoryarenotaccesseddirectlyormanipulatedby

staffsothatneitherthezippedimageandOCRfilesnortheMETSdocumentmaybeaccidentlyalteredordeleted.

• Dissemination:Thepage‐turnerapplicationreferencesthestoredimageandthencreatesa.png(forTIFFs)or.jpg(forJPEG2000s)filefordisplaytotheviewer.

• DataManagement:“Newversionsofsoftwarearereleasedusingautomatedmechanisms(inordertopreventmanualerrors).”(HTTRACC1.8)

56PleaserefertoAppendixB(HathiTrustOutagesfromMarch2008throughApril2009).

Severity EventsHighimpact • Operatorerrorresultsintheirreparablelossofdataordamagetoequipment.

• Operatorerrorresultsinlossofkeyrepositoryfunctions(ingest,storage,dissemination,etc.)foranextendedperiodoftime.

ModerateImpact • Operatorerrorremainsundetectedandcausespersistentproblemsinthesystembuthasnolongtermconsequences.

LowImpact • Operatorerrorisdetectedbynormalproceduresorviaanactivitylogandcanbereadilycorrected.

Page 31: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 25

Scenario8:PhysicalSecurityBreach

• Review:RisksInvolvingaPhysicalSecurityBreach MaintainingthephysicalsecurityoftheHathiTrustinfrastructureisyetanothercrucialelementintherepository’seffortstomanagerisksandtherebylessenthechancethatadisaster‐typeeventoccurs.Risksinvolvethedamageanddestructionofequipmentandcouldevenextendtounauthorizedsystemaccess.MultiplelevelsofsecurityexistatboththeMichiganAcademicComputingCenter(MACC)andtheArborLakesDataFacility(ALDF)toprotectHathiTrustfromtheactsofvandalism,destructionormalicioustampering.Detailsonthepotentialimpactsofaphysicalsecuritybreacharecoveredin“Scenario1:HardwareFailure”and“Scenario3:NetworkSecurity.”

• HathiTrust’sSolutionsforPhysicalSecurityo “Eachof[theHathiTrust]storageortapeinstancesisphysicallysecure(e.g.,inalocked

cageinamachineroom)andonlyaccessibletospecifiedpersonnel.”57

• SecurityattheMACCTheMACCServerHostingSLAstatesthedatacenterstaffwill:

o “Provideservicesnecessarytomaintainasafe,secure,andorderlyenvironmentforalltenantsoftheMACC.”(sec.4.7)

o “ProvideaccesscontrolviaHiDcardandbiometricreadersforthoselistedontheTenantStaffAuthorizedforAccesslist.”(sec.4.5)

TheMACCWebsiteandtheMichiganAcademicComputingCenterOperatingAgreement58provideadditionaldetailsconcerningtheresourcesandproceduresthathelpprotectHathiTrust’sequipmentattheMACC.TheMACCDataCenterCoordinatorpersonallyoverseestheenforcementofsecurityprotocolsandconductsregularauditsofsecuritylogsand,whennecessary,reviewssurveillancevideofootage.

o SecuritySystems “State‐of‐the‐artsecuritydevicessuchasirisscanners,cameras,closedcircuit

televisionandon‐callstaffkeepthedataandmachineshousedintheMACCsafe.”59

“Accesstothedatacenterwillbebytwo‐factorauthentication(accesscardandirisscan)orescorted,supervisedaccess.Accesstothebuildingwillbebyaccesscard.”(MACCOA,sec.5.3.1)

“Camerasthroughoutthecorridor,securitytrap,andfacilitywillbemonitoredandmaintainedbytheDataCenterCoordinator.”(sec.5.2.1)

o SecurityProcedures

57HathiTrust.“Technology”(2009)retrievedfromhttp://www.hathitrust.org/technologyon15June2009.58PleaserefertoAppendixI(MichiganAcademicComputingCenterOperatingAgreement).59MichiganAcademicComputingCenter.“VitalStatistics”(2009)retrievedfromhttp://macc.umich.edu/about/vital‐statistics.phpon17June2009.

Page 32: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 26

“TheOperationsAdvisoryCommitteewillestablishproceduresforgrantingaccesscardstothefacilitytothosewhosejobsrequirehands‐onaccesstosystems.AllrequestsforaccesscardswillbevettedandapprovedbytheOperationsAdvisoryCommitteeattheirnextmeeting.”(sec.5.3.2)

“Everyoneontheaccesslistforthedatacenterwillberequiredtoattendatrainingsessionbeforeworkinginthedatacenterandsignanaccessagreementstatingpoliciestheymustobservewhileinthedatacenter.”(sec.5.3.8)

• SecurityattheALDFAsnotedintheTSMBackupServiceSLA,theUniversityofMichigan’sITCS“isresponsiblefor

physicalsecurity”attheALDF.(sec.4.9)WhilethisdocumentwillnotdetailspecificfeaturesoftheALDF’soperation,multiplelevelsofsecurityandoversightareemployed.

Page 33: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 27

Scenario9:NaturalorManmadeDisaster

• Review:RisksInvolvingaNaturalorManmadeDisasterThefollowingtabledetailstheriskstoHathiTrustposedbyanaturalormanmadedisaster;

eventsarerankedbyorderoftheirseverity.DuetopossibleoverlapbetweenthisscenarioandScenario1(HardwareFailure),readersareencouragedtoconsultthatearliersection.

• HathiTrust’sSolutionsforNaturalorManmadeCatastrophicEvents

TheUniversityofMichiganAnnArborCampusEmergencyProcedures(revisedJanuary2008)hassetprocedurestoaddressbuildingevacuations(intheeventoffire),tornadoes,severeweather,flooding,chemical/biological/radioactivespills,aswellasbombthreats,civildisturbances,andactsofviolenceorterrorism.60Inallcases,staffwillfollowthedirectionsofPublicSafetyandnotre‐enterbuildingsorresumework“untiladvisedtodosobyDPSorOSEHorsomeonefromon‐siteincidentcommand.”

Intheeventofaseverenaturalormanmadedisaster,therepairandrestorationofthephysicallocationsofHathiTrustinfrastructurewouldneedtobecoordinatedbetweentherepositoryandtheappropriatefacilitymanagers.SuchactivitywouldrelyuponthedisasterrecoveryplansinplaceattheMITCBuilding(homeoftheMACC)andUniversityofMichigan(whichincludestheHatcherGraduateLibraryandtheALDF).Itmustbenotedthataneventwhichcausessignificantdamagetoanimportantstructureortoabuilding’sinfrastructurecouldresultinthelossofaninstanceoftherepositoryforanextendedperiodoftime.Insuchacase,HathiTrustwouldneedtosetupanalternatehotsiteuntilstructuralrestorationiscomplete(oranewfacilityhasbeenfound).

60PleaseseeAppendixC(WashtenawCountyHazardRankingList).

Severity EventsHighimpact • Widespreaddamagetoadatacenterand/oritsinfrastructurethatforcesan

instanceoftherepositorytofindanewhotsitewithsufficientpowersupply,environmentalcontrols,andsecurity.

• Damagetoworkareasforcestafftorelocatetoanewcenterofoperations.• Extensivelossordamagetohardwarerequireslarge‐scalereplacement.• Withtheextendedlossofonesite,HathiTrustlosesredundancy(andpossiblysome

functionality:i.e.theabilitytoingestnewmaterialinAnnArbor)andthusacentralcomponentofitsdisasterrecoveryandbackupplans.

• AnactofviolenceorterrorismoccursatornearHathiTrustfacilities.ModerateImpact • Aneventresultsinanextendedoutageatonesitethatexceedstherecoverytime

objective.• Hardwaresustainssomedamageandsiteisabletocontinueoperationina

reducedcapacity.• Anactualorthreatenedactofviolenceorterrorismforcesthetemporary

evacuationorquarantineofHathiTrustfacilities.LowImpact • LocalconditionsresultinatemporaryoutageataHathiTrustsite.

Page 34: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 28

• BasicDisasterRecoveryStrategies

Intheimmediateaftermathofalarge‐scalemanmadeornaturaldisaster,therepository’simmediaterecoverywillbeenabledbyitsbasicsystemarchitecture:

o “theinitiative’stechnologyconcentratesoncreatingaminimumoftwosynchronizedversionsofhigh‐availabilityclusteredstoragewithwidegeographicseparation(thefirsttwoinstancesofstoragearelocatedinAnnArbor,MIandIndianapolis,IN),aswellasanencryptedtapebackup(writtentoandstoredinaseparatefacilityoutsideofAnnArbor).”61

TheestablishmentofthemirrorsiteinIndianapolisandtheretentionofmultiplebackuptapesattwolocationsinAnnArborensurethataseriouseventateitherlocationwillnotimpedethecontinuedfunctioningoftherepositoryattheother.ConsiderationmustbegivenastohowdataattheIndianapolissitewillbebackedupandhowkeyrepositoryfunctions(suchasingest)willproceediftheAnnArborinstanceisoff‐lineforanextendedperiodoftime.Likewise,along‐termoutageattheIUlocationwouldrequireHathiTrusttoestablishathirdsitefordatabackup(i.e.,alocationwhereadditionalcopiesofbackuptapescouldbestored).

61HathiTrust.“Technology”retrievedfromhttp://www.hathitrust.org/technologyon15June2009.

Page 35: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 29

Scenario10:MediaFailureorObsolescence

• Review:RisksInvolvingMediaFailureorObsolescenceThefollowingtablesummarizesriskstoHathiTrustposedbythefailureofthemediausedforits

databackups.Whiletherisksfromthisarelimited(bothcopiesofthetapebackupswouldhavetobeimpactedfordatatobeunavailable),theissueshouldnonethelessbeaddressedwithregulartestrestorationsand/orinspectionsofthemedia.

• HathiTrust’sSolutionsforMediaFailure

GiventhenatureofHathiTrust’sstoragesystem,thisscenarioisonlyaconcerninregardstothedigitalmagnetictapesusedbytheTSMGroupforbackups.

o Twotapecopiesofallbackupdataaremadeandthesearestoredinseparateclimate‐controlledconditionsintapelibrariesattheMACCandtheALDF.

o Contentistransferredtonewtapeduringdatadefragmentation(whichoccurswhenexistingtapesare80%full),

o Ifadegradedorotherwise‘bad’sectionoftapeisdetectedduringabackupprocedurethattapeisimmediatelymarkedas“readonly.”

Dataisthenceforthwrittentoadifferenttape;existingdataonthebadtapewillbecopiedtoproperlyfunctioningmedia.

Ifdatacannotbereclaimedfrombadtape,theTSMGroupwouldcontactHathiTrustsothatthebackupofcontentcanbeproperlycompleted.

• RemainingVulnerabilities

ThereissomereasonforconcerninthisareabecausetheTSMGroupdoesnothavearegularprogramtomonitoritsmediaforphysicaldegradationorimpairmentafterdatadefragmentation.Whilethetapesarereportedtobehighlydependable,problemssuchas“stickyshed”(thehydrolysisofthetape’sbinder)couldbecomeanissuewitholdertapes.Aregularprogramoftapevalidationortestrestorationswouldprovideanopportunitytocheckonthephysicalconditionanddataintegrityofthetapes.Likewise,thecreationofascheduleforthereplacementofoldertapescouldavoidfutureproblemswithmediadegradation.

Severity EventsHighimpact • Physicaldegradation(i.e.intapebinder,substrate,ormagneticcontent)affects

bothcopiesofolderbackuptapes.ModerateImpact • Becausebackuptapesarenotregularlytestedoraudited,thephysicalsubstrateof

tapesmaydegradeovertime.

LowImpact • Badtapeisdetectedduringatapebackup.

Page 36: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 30

ConclusionsandActionItems

• ConclusionsAsthisreportdemonstrates,avarietyofriskmanagementstrategiesinadditiontodesign

elements,operatingprocedures,andserviceandsupportcontractsendowHathiTrustwiththeabilitytopreserveitsdigitalcontentandcontinueessentialrepositoryfunctionsintheeventofarangeofdisasters.TheestablishmentoftheIndianapolismirrorsite,theperformanceofnightlytapebackups,andtheredundantpowerandenvironmentalsystemsoftheMACCreflectprofessionalbestpracticesandwillenableHathiTrusttoweatherawiderangeofforeseeableevents.Asitis,disastersoftenresultfromtheunknownandtheunexpected;whiletheaforementionedstrategiesarecrucialcomponentsofaDisasterRecoveryPlan,theymustbesupplementedwithadditionalpoliciesandprocedurestoensurethat,comewhatmay,HathiTrustwillbeabletocarryonasbothanorganizationandadedicatedserviceprovider.

IntheefforttosecureHathiTrust’slong‐termcontinuity,thepresentdocumentstandsmerelyasapreliminarystepintheestablishmentofalegitimateDisasterRecoveryPlan.ThedataonHathiTrust’spolicies,procedures,andcontractsconsolidatedhereinshouldfacilitatethedatacollectionrequisitetotheinitialphasesoftheplanningprocess,butthecoreactivitiesofformulatingtechnicalandadministrativeresponsestrategiesanddelegatingrolesandresponsibilitiesremaintobeundertaken.ThefollowingsectionoutlinesrecommendationsandactionitemsderivedfromresearchintotherepositoryaswellasfromdiscussionswithCorySnavelyandotherHathiTruststaffmembers.ItemshavebeenseparatedintoanapproximatetimelineofactivityrangingfromShortTermthroughLongTermandthearrangementwithineachcategoryrepresentsasuggested(butbynomeansdefinitive)orderofaccomplishment.ForamoredetailedexplanationofactionitemsrelatedexplicitlytoDisasterRecoveryPlanning,pleaserefertotheoverviewoftheplanningprocessinAppendixEorconsultAppendixDforalistofmorecomprehensiveguidesandresources.(NB:*=Denotesanongoingactivity.)

• ShortTermActionItems(0‐6months)a. ResolvethenatureandextentoftheinsurancecoverageforHathiTrustequipment.b. ArrangewithTSMGroupadministratorstoperiodicallyperformavolumeauditof

backuptapestoensuredataintegrity.c. InstituteperiodictestrestoreswithTSMGrouptoensurethattheprocesswillrun

smoothlyintheeventofadisaster.d. Discussthecreationofalong‐termreplacementscheduleforbackuptapeswiththe

TSMGrouptoavoidthepossibilityofmediadegradation.e. Improvecontroloversystemcomponents

i. Updatethehardwareinventorytoincludeallimportantsystemcomponents;documentmodels,serialnumbers,UMID’s,associatedsoftwareandversionnumber,dateofpurchase,originalcost,aswellasvendorcontactinformationandproductsupportcontracts.*

Page 37: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 31

ii. Establishasoftwareinventorytodocumentnecessaryapplicationsintheeventofhardwareloss;shouldincludepurpose,acquisitiondate,cost,licensenumber,andversionnumber.*

iii. CreateamapidentifyingwherecomponentsareintheMACCandwithinindividualracks*

iv. Reviewandassesspointsoffailureaswellastheadequacyofredundantcomponents.*

f. Establishphonetreesi. Includekeycontactsfordifferenttypesofdisasterii. Prioritizephonetreestotargetindividualswho

1. Makedecisions2. Havevitalinformation3. Canofferassistanceinresolvingsituations

iii. Distributeinformationandexplainprotocolstoallrelevantstaff*iv. Developaregularmaintenance/updateschedule(onceevery4‐6months)*

g. Thoroughlydocumentandmakeavailable(asneeded)importantinstitutionalknowledgesothatHathiTrustmaycontinuetofunctionintheeventoftheextendedabsenceorlossofkeystaff.*

h. IdentifydisasterpreparednessanddisasterrecoverymeasuresinplaceatIndianapolis.

• IntermediateTerm(6‐12months)a. FormaDisasterRecoveryPlanningCommitteetoresearchanddevelopplansandto

overseetheirimplementation.b. CommunicateandcoordinateplanningactivitiesbetweenAnnArborandIndianapolis.*

i. Considertheformationofsub‐committeesforlocalizedresearchanddevelopmentofplansandanexecutivecommitteetooverseetheimplementationandmanagementofplans.

c. DraftaDisasterRecoveryPlanningpolicystatementtodefinethemandate,responsibilities,andobjectivesfortheplan.

d. Initiatethedatacollectionandanalysisphaseoftheplanningprocess.i. Identifycorerepositoryfunctionsandassociatedhardwareandinfrastructure

elements.ii. Determinethepotentialimpactfromthelossofthosefunctionsiii. Definethelevelsoffunctionalityrequiredforpartialaswellasfullrecovery.

EstablishwhatlevelisneededforHTtofulfillitsmissionandtheneedsofitsusers.

iv. DefineHathiTrust’sRecoveryTimeObjective(RTO:themaximumallowableoutageperiodforservices)andRecoveryPointObjective(RPO:thepointintimetowhichdatastoresmustbereturnedfollowingadisaster).

v. Determinetheavailabilityofresourcesintheeventofadisasterandestablishtherepository’sprioritizationwithmajorserviceprovidersandvendors(i.e.,TSMGroup,ITCom,Isilon,etc.).

Page 38: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 32

e. Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrolsasneededtoanticipateandmitigatethoserisks.*

f. Developrecoverystrategiestobringcorefunctionsbackonlineassoonaspossiblewithinasetcostrange.

i. Establishalogicalprogressionintherestorationofservicesandassociatedcomponents.

ii. Identifytheresourcesrequiredfortheseefforts.iii. Consideralternativesolutions,includingpartial(vs.full)recovery

g. Communicateplanninggoalsandeffortstokeycontactsfromserviceprovidersandvendorstobettercoordinaterecoveryefforts.*

h. InitiatetheproductionofcoreDisasterRecoverydocuments(seeAppendixEformoreinformation).Thefollowinglistisnotexhaustive;datacollectionandanalysiswillhelpdetermineifallorotherplans(i.e.,awebcontinuityplan)areneeded.

i. BusinessContinuityPlan:detailsHathiTrust’scorefunctionsandtheprioritiesforre‐establishingeachintheeventofadisruption.

ii. ContinuityofOperationsPlan:focusesonrestoringanorganization’s(usuallyaheadquarterselement)essentialfunctionsatanalternatesiteandperformingthosefunctionsforupto30daysbeforereturningtonormaloperations.

iii. ITContingencyPlan:addressesexplicitlythedisasterplanningforcomputers,servers,andelementsofthetechnicalinfrastructurethatsupportkeyapplicationsandfunctions.

iv. CrisisCommunicationsPlan:establishesproceduresforinternalandexternalcommunicationsduringandafteranemergency.

v. Cyber‐IncidentResponsePlan:definestheproceduresforrespondingtocyberattacksagainsttheHathiTrustITsystem.

vi. OccupantEmergencyPlan:definesresponseproceduresforstaffintheeventofasituationthatposesapotentialthreattothehealthandsafetyofHathiTrustpersonnelortheirenvironment.(ThisrequirementisaddressedbyUniversityofMichiganBuildingEmergencyActionPlans.)

vii. DisasterRecoveryPlan:bringstogetherguidanceandproceduresfromtheotherplanstoenabletherestorationofcoreinformationsystems,applications,andservices.ThisplandefinesrolesandresponsibilitieswithinDisasterResponseTeams.

viii. DisasterRecoveryTrainingPlan:establishesthesituationsandprocedurestobecoveredbyHathiTrust’sDisasterRecoverytraining.

• LongTerm(12+months)

a. CompleteandimplementDisasterRecoveryPlans.i. Distributephysicalcopiesoftheplansasneededandincludeatleastonecopy

inanoff‐sitelocation.ii. Integrateelementsofresponsestrategiesintosystemarchitecturetofacilitate

theirdeploymentintheeventofadisaster.*

Page 39: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 33

b. DisasterRecoveryCommitteeshouldmonitorchangesinbestpracticesandtechnology,updateplans,andoverseeorganizationalreadiness.*

i. InitiatestafftrainingsothatindividualsarefamiliarwithDisasterRecoveryproceduresandcommunicationprotocols.*

ii. InstituteregulartestsofdisasterpreparednesswithsimulateddisastersinvolvingdifferentcomponentsofHathiTrustoperations.*

iii. EstablishascheduleformaintenanceandrevisionstotheDisasterRecoverydocuments.*

iv. CoordinateDisasterRecoveryPlanimplementation,training,andreviewwithIndianapolis.*

c. StoreanadditionalcopyofbackuptapesatathirdsitetoincreaseexposureandlimitthechancethatawidespreadeventinAnnArborcouldimpactbothlocalcopies.

d. ExplorethepossibilityofestablishingathirdsiteforHathiTrust’sdigitalobjectstoincreaseexposureandaddressconcernsovertherelativegeographicalproximityofIndianapolisandAnnArbor.

e. Determinethefeasibilityofmovingoperationstoa“hot”siteinAnnArborshouldadisasterrendertheMACCunusable.

i. Identifysuitablesitesandconsidermakingpreliminaryarrangements.ii. Identifyandpriceoutequipment/infrastructurenecessarytocontinue

operations.f. PlanforintegrationofnewsystemcomponentsshouldthesuddencollapseofIsilon

leaveHathiTrustwithoutservice/support.g. Consideranincreasetosystemsecuritymeasuresascontentbecomesacceptedfroma

widerrangeofsourcesandasHathiTrustbecomesahigher‐profileorganization.

Page 40: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 34

APPENDIXA:ContactInformationforImportantHathiTrustResources

IndianaUniversityMirrorSite

• AndrewPoland(Staff,InformationTechnologyServices)o [email protected] (317)274‐0746

• TroyDeanWilliams(VicePresidentforInformationTechnology,IUatBloomington)o [email protected] (812)856‐5323

UniversityofMichiganMichiganAcademicComputingCenter(MACC):HousesmuchofthetechnicalinfrastructureoftheUniversityLibrary’sdigitalresources.

• ReneGobeyn(MACCDataCenterCoordinator)o [email protected] (734)936‐2654

• ITComUMNOC(NetworkOperationsCenter)o [email protected] (734)647‐8888

ITCS‐ITCom:ResponsibleformaintainingnetworkconnectionstotheUMnetBackboneandInternet;ITComprovidesmaintenanceandsupportservicesforhardwareandsoftware.

• MikeBrower(SeniorProjectManager,UMLibraries)o [email protected] (734)936‐9736

• KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations)o [email protected] (734)647‐3214

• ITComUMNOC(NetworkOperationsCenter)o [email protected] (734)647‐8888

TivoliStorageManagerGroup:Responsiblefornightlyautomatedtapebackupsofstorageservers.

• AndrewInman(ServiceManager)o [email protected] (734)615‐6286

• CameronHanover(StorageEngineer)o [email protected] (734)764‐7019

• GeneralSupport:[email protected]• Emergencycontact:[email protected]

o Messagewillgotoon‐callstaff’spagerinrealtime• [email protected]

ArborLakesDataFacility:HousesoneinstanceoftheTSMbackuptapelibrary.

• ITComUMNOC(NetworkOperationsCenter)

Page 41: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 35

o [email protected] (734)615‐4209

• KenPritchard(ALDFfacilitymanager)o [email protected] (734)615‐2812

ProcurementServices:Approvesdepartmentalpurchasesover$5,000;buyersalsoworkasintermediarieswithvendors.

• SteveWorden(UMHardwarePurchasingSpecialist)o [email protected] (734)645‐8972

• ShellyEauclaire(SeniorBuyer,PurchasingServices)o [email protected] (734)615‐8767

• IanPepper(UMDellComputersContractAdministrator)o [email protected] (734)647‐4981

• JeffRabbitt(AlternateDellContractAdministrator)o [email protected] (734)644‐9232

PropertyControl:Responsiblefortrackingandtaggingtheuniversity’sassets.

• MaryEllenLyon(BusinessOperationManager)o [email protected] (734)647‐3351(t,th)o (734)763‐1197(m,w,f)

OfficeofFinancialAnalysis:

• DavidStorey(InventoryCoordinator):DeliversUMpropertytagstoequipmentattheMACC.o [email protected] (734)647‐4264

RiskManagementServices:Providesinsurancecoverageofuniversityassets.

• KathleenRychlinski(AssistantDirector,RiskManagementServices)o [email protected] (734)763‐1587

Non‐UniversityContactInformationIsilonSystems

• JimRamberg(RegionalTerritoryManager)o [email protected] Desk:(847)330‐6399o Cell:(630)561‐2463

SunMicrosystems

• ChristineSluman(ServiceSalesRep—Education)o [email protected] (303)557‐3660,ext.60519

Page 42: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 36

o (303)949‐1567(Cell)• LarryZimmerman(MichiganAccountManager‐Sales)

o [email protected] (248)880‐3756

CDW‐G

• UniversityofMichiganAccountTeamo [email protected]

• HansenChennikkra(AccountManager)o [email protected] (866)339‐3639

• AdamSullivan(AccountManager)o [email protected] (866)339‐4118

DellComputers

• BrianUllestad(HigherEducationAccountManager)o [email protected] 1‐800‐274‐7799ext.7249522

Page 43: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 37

APPENDIXB:HathiTrustOutagesfromMarch2008throughApril200962

• April2009:HathiTrustexperiencedreducedperformancefrom11:00pmEDTonThursday,April23to8:22amEDTonFriday,April24duetoadatabaseproblematoneofthesitesandfrom5:30pmto9:00pmEDTonThursday,April30duetounintendedconsequencesfromanetworkingconfigurationchange.

• March2009:HathiTrustwasunavailableonTuesday,March3from7:00‐8:00amESTandonThursday,March5from7:00‐7:45amESTforoperatingsystemanddatabasesoftwareupgrades.

• February2009:OnSunday,February22at8:40amEST,apowersurgeresultingfromelectricalsystemmaintenancecausedHathiTrustdatabaseandwebserverstogooffline.Stafflearnedoftheproblematapproximately6:00pmEST,andservicewasrestoredby6:30pmEST.

• January2009:AbriefoutageisscheduledinJanuaryforastoragesystemsoftwareupgrade.• December2008:OnFriday,December19at7:30amEST,HathiTrustwasdownbrieflytoapply

securityupdatestoadatabaseserver.Servicewasrestoredat7:40amEST.• November2008:OnTuesday,November4at7:30amEST,HathiTrustwasdownbrieflytoapply

securityupdatestoadatabaseserver.Servicewasrestoredat7:45amEST• October2008:Nooutagesreported.• September2008:OnThursday,September18atapproximately9:30amEDT,HathiTrustbecame

inaccessibleduetoasoftwareproblemonastoragesystem;theproblemwasrelatedtoourworkwithdatasynchronization.Supportwascontactedandtheproblemwasresolvedat10:45amEDT

• August2008:OnTuesday,August26atapproximately9:00amEDT,adatabaseserverwasbroughtdowntomovetoIndianapolis.Priortoshuttingthisserverdown,wedidnotupdateamanualfailoverconfiguration,causingvolumestobeinaccessibletosomeusers.Theproblemwasresolvedat11:15amEDT.

• July2008:ServicewasunavailableonThursdayJuly31from7:00‐7:30amEDTforastoragesystemsoftwareupgrade.

• June2008:Nooutagesreported.• May2008:Nooutagesreported.• April2008:Nooutagesreported.• March2008:Nooutagesreported.

62HathiTrust.“Updates”fromhttp://www.hathitrust.org/updatesretrievedon16June2009.

Page 44: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 38

APPENDIXC:WashtenawCountyHazardRankingList

ThefollowinglistranksavarietyofnaturalandmanmadeeventswithinWashtenawCounty,Michigan,basedupontheirfrequencyofoccurrenceandtheextentoftheirpotentialimpact(onthecounty’spopulation).

Rank Hazard FrequencyPopulationImpacted

1Convectiveweather(severewinds,lightning,tornados,hailstorms)

Onceormore/yr.

250,000

2Hazardousmaterialsincidents:transportation

Onceormore/yr.

2,000

3 Hazardousmaterialsincidents:fixedsiteOnceormore/yr.

10,000

4Severewinterweatherhazards(ice/sleet/snowstorms)

Onceormore/yr.

250,000

5 InfrastructurefailuresOnceevery5yrs.

30,000

6 Transportationaccidents:airandlandOnceormore/yr.

100

7 ExtremetemperaturesOnceevery5yrs.

10,000

8 Floodhazards:riverine/urbanfloodingOnceevery10yrs.

2,000

9 NuclearattackHasnotoccurred

250,000

10Petroleumandnaturalgaspipelineaccidents

Onceevery10yrs.

1,000

11 Firehazards:wildfiresOnceormore/yr.

0

Source:WashtenawCountyHazardMitigationPlan(availableonlineathttp://www.ewashtenaw.org/government/departments/planning_environment/planning/planning/hazard_html)

Page 45: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 39

APPENDIXD:AnnotatedGuidetoDisasterRecoveryPlanningReferences

Thetopicofdisasterrecoveryplanningfortheprintandanalogresourcesoflibrarieshasbeenwidelydealtwithinprofessionalliterature,butcomparativelylittleinformationexistsconcerningthedevelopmentandimplementationofplansforthedigitalcontentofculturalinstitutions.Thefollowingbibliographydetailsresourceswhichprovideguidance,examples,andexplanationsoftheobjectivesandstrategiesfordigitalDisasterRecoveryPlans.ItconsistsprimarilyofmaterialcompiledbyLanceStuchell(ICPSRIntern)andNancyMcGovern(ICPSRDigitalPreservationOfficer)andisincludedherewiththeirpermission.

UniversityofMichiganResources

• UniversityofMichiganAdministrativeInformationServices(MAIS):EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning.

o http://www.mais.umich.edu/projects/drbc_methodology.htmlo ThissitebroadlyoutlinestheneedforandfunctionsofEmergencyManagement,

BusinessContinuity,andDisasterRecoveryPlanningatUM.Italsocontainstemplatesdesignedtohelpunitsplan,test,andauditdisasterandcontinuityprograms.

• ProvostandExecutiveVicePresidentforAcademicAffairs:StandardPracticeGuide:InstitutionalDataResourceManagementPolicy

o http://spg.umich.edu/o ThispolicydefinesinstitutionaldataresourcesasUniversityassetsandmakes

recommendationsonidentifying,preserving,andprovidingaccesstotheseassets.Thedigitalresourcesofthelibrarymaybeidentifiedassuch,basedupontheirusebydepartmentsacrosstheuniversity.

• ICPSRDisasterPlanningResources:

o DigitalPreservationOfficerNancyMcGovernispartofaDisasterRecoveryinitiativeatICPSRandoverthepastseveralyearsherteam(includingLanceStuchell)hasproducedavarietyofdocumentsandtemplatestohelpotherinstitutionsworkthethroughtheplanningprocess.

o Documentsareavailableuponrequestandshouldbepostedinthenearfuture(asofJuly2009)totheICPSRWebsite(http://icpsr.umich.edu/).

• DisasterRecoveryExperts:o ReneGobeyn(MACCDataCenterCoordinator)

ManagedandcoordinatedDisasterRecoveryforU.S.militarydatacenters [email protected]

o KrystalHall(DisasterRecoveryPlanner,ITCS/ITComOperations) HelpeddevelopcurrentITCSDisasterRecoveryplans [email protected]

Page 46: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 40

ExternalResources

• GeneralGuidetoDisasterPlanningo ContingencyPlanningGuideforInformationTechnologySystems:Recommendationsof

theNationalInstituteofStandardsandTechnology,NISTSpecialPublication800‐34,June2002.

http://csrc.nist.gov/publications/nistpubs/800‐34/sp800‐34.pdf AnindispensableresourcewhichwasusedheavilybyICPSRinitsDisaster

Recoveryplanning.Itcoverseverythingfrominitialdatacollectionandpolicyformationtothestructureofdisasterresponseteamsandthearticulationofrecoverystrategies.

• ExamplesandToolsfortheDocumentationOutlinedbyNISTGuide:o FullDisasterRecoveryPlan:

UnitedStatesDepartmentofAgricultureDisasterRecoveryandBusinessResumptionPlans

http://www.ocio.usda.gov/directives/doc/DM3570‐001.htmo BusinessContinuityPlan(BCP):

MAIS:EmergencyManagement,BusinessContinuity,andDisasterRecoveryPlanning

http://www.mais.umich.edu/projects/drbc_templates.html Thissiteprovidesseveralresourcesthatdealwithcontinuityplanning.

o ContinuityofOperationsPrograms(COOP): FEMA:ContinuityofOperations(COOP)Programs

• http://www.fema.gov/government/coop/index.shtm• Containsalotofusefulinformationongovernmentpolicy,templates,

andtrainingresourcestoassistinthecreationofaCOOP. Ready.gov:ContinuityofOperationsPlanning

• http://www.ready.gov/business/plan/planning.html• GuidelinesforcomposingabusinessCOOP,includingwhatoutside

actorsshouldbeinvolvedintheplanningprocess. TheFloridaDepartmentofHealth:ContinuityofOperationsPlanforInformation

Technology• http://www.naphit.org/global/library/basement_docs/FL_DisasterReco

very_template.doc• Lengthy(40pages)anddetailedCOOPtemplatewrittenforanIT

environment. FloridaAtlanticUniversityLibraries:ContinuityofOperationsPlan

• http://www.staff.library.fau.edu/policies/coop‐2007.pdf• AdetailedworkingCOOP,whichincludesreactionstospecificdisaster

scenarios.o ITContingencyPlan:

Page 47: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 41

SeetheUSDADisasterRecoveryPlanforanexampleofanITContingencyPlan.o CyberIncidentResponsePlan:

Multi‐StateInformationSharingandAnalysisCenterCyberIncidentResponseGuide

• http://www.msisac.org/localgov/documents/FINALIncidentResponseGuide.pdf

• Theguideprovidesastep‐by‐stepprocessforrespondingtoincidentsanddevelopinganincidentresponseteam.ItmayalsoserveatemplateinordertodraftaCyber‐IncidentResponsePolicyandPlan.

o CrisisCommunicationPlan: Ready.gov:WriteaCrisisCommunicationPlan

• http://www.ready.gov/business/talk/crisisplan.html• Thissiteprovidesguidelinesforcomposingabusinessdisaster

communicationplanandincludessuggestionsfortheplan’sWebpresence.

NCStateUniversity:CrisisCommunicationPlan• http://www.ncsu.edu/emergency‐information/crisisplan.php• ThisisthepolicyandplanfortheUniversityasawhole.Whilemuchof

thispolicydealswithcommunicationatahighlevel,usefulsectionsdetailvitalcontactswithintheorganization(includingwhotocontactfirst),andhowtomanageexternalcommunications.

OtherthoroughuniversitypoliciesandplansincludetheLSU:CrisisCommunicationPlanandtheMissouriS&T:CrisisCommunicationPlan.

HeritageMicrofilmFloodUpdateEmail• ThisemailwassentinresponsetotheJune2008floodingthatoccurred

intheMidwest.• ItupdatesclientsontheoutageofNewspaperArchive.comwhich

resultedfromaflood‐inducedwidespreadpowerfailure.Itisanexcellentexampleofanexternalcrisiscommunicationtousers.

o DisasterRecoveryPlans(DRP): TheUniversityofIowa:ITServicesDisasterRecoveryPlan

• http://cio.uiowa.edu/ITplanning/Plans/ITSdisasterPrep.shtml• Thispolicydetailsthedatacollectionandassessmentwhichinformsthe

UIplanandalsoincludesemergencyprocedures,responsestrategies,andacrisiscommunicationplan.

UniversityofArkansas:ComputingServicesDisasterRecoveryPlan• http://www.uark.edu/staff/drp/• Acompleteandthoroughplanthatoutlinestheinitiationofemergency

andrecoveryprocedures,andaddresseshowtheplanwillbemaintained.

AdamsStateCollege(CO):InformationTechnologyDisasterRecoveryPlan• http://www.adams.edu/administration/computing/dr‐plan100206.pdf

Page 48: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 42

• Thisplanhasathoroughsectiononriskassessment. DigitalPreservationEuropeRepositoryPlanningChecklistandGuidance

• http://www.digitalpreservationeurope.eu/platter.pdf• DesignedforusewiththePlanningToolforTrustedElectronic

Repositories(PLATTER),thisdocumentoutlinesconsiderationsforaDisasterRecoveryStrategicObjectivePlan(SOP)andplacesthemincontextwithotherrepositoryplans.

o OccupantEmergencyPlan(OEP): ThisrequirementisaddressedbyUniversityofMichiganBuildingEmergency

ActionPlans(EAP).• http://www.umich.edu/~oseh/guideep.pdf

o DisasterRecoveryTrainingGuides: dPlan.org

• Providesusefulinformationontrainingandanonlineformthatwouldbeusefulinassigningtrainersandmonitoringthetrainingprocess.

CalPreservation.org:DisasterPlanExercise• http://calpreservation.org/disasters/exercise.html• Providesrolesandteachingpointsforarole‐playtrainingexercisethat

focusesonadisasterinalibrary.

• PolicyPlanningTools:o AssociationofPublicTreasurersoftheUnitedStatesandCanada:DisasterPolicy

CertificationGuidelines www.aptusc.org/includes/getpdf.php?f=Disaster_Policy.pdf Thisplanningdocumentandtemplatefordisastermanagementpolicies

providesoutlinesandexamplelanguageonseveralfacetsofastrongpolicy,includingthepossiblelossofabuilding,thereplacementofcomputerresources,andtestingandtrainingforthedisasterplan.Italsooutlinestheneedtoidentifypossiblethreatstoassets.

• ExamplesofDisasterPlanningPolicies:

o ArkansasSecretaryofState:DisasterPlanningPolicy http://www.sos.arkansas.gov/elections/elections_pdfs/register/oct_reg/016.14.

01‐020.pdf Thispolicyoutlinesareasofresponsibilitybetweendepartmentsandunits,and

includestraining,communication,andrecoveryplanupdates.o WashingtonStateDepartmentofInformationServices:DisasterRecoveryandBusiness

ResumptionPlanningPolicy http://isb.wa.gov/policies/portfolio/500p.doc ThisdocumentillustratespolicyformationforanITDisasterRecoveryPlan.It

providesguidelinesforDisasterRecoveryPlanningaswellasmaintenance,testing,andtraininginvolvedwiththerecoveryplan.

Page 49: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 43

o FloridaStateUniversity:InformationTechnologyDisasterRecoveryandDataBackupPolicy

http://oti.fsu.edu/oti_pdf/Information%20Technology%20Disaster%20Recovery%20and%20Data%20Backup%20Policy.pdf

ThisdocumentincludespolicyfordatabackupaswellasDisasterRecovery.PartofthepolicyincludesadefinitionofBestPracticeDisasterRecoveryProcedures,aswellasanoutlineoftheuniversity’sownITrecoveryplanningandimplementationprocedures.

• ExampleofaRelevantDisasterPlanningProgram:o OCLCDigitalArchivePreservationPolicyandSupportingDocumentation

http://www.oclc.org/support/documentation/digitalarchive/preservationpolicy.pdf

ThisdocumenthasacleararticulationofOCLC'sdisasterpolicy,alongwithanoutlineofdisasterpreventionandrecoveryproceduresandatime‐framefortherestorationofservicesintheeventofadisaster.

Thepolicyincludesagooddefinitionofadisasterpreventionandrecoveryplan:“Asetofresponsesbasedonsoundprinciplesandendorsedbyseniormanagement,whichcanbeactivatedbytrainedstaffwiththegoalofpreventingorreducingtheseverityoftheimpactofdisastersandincidents.”

OCLCembedsitsdisasterplanwithinitsoverallpreservationpolicy,stating:“Thegoalofdisasterpreventionistosafeguardthedata(contentandmetadata)intheDigitalArchiveandtosafeguardtheDigitalArchive’ssoftwareandsystems.Fordisasterpreventionandrecovery,alldata(contentandmetadata)isconsideredofequalvalue.”

• DesigningaDisasterPlanningProgram:o MichiganStateUniversity:StepbyStepGuidetoDisasterRecoveryPlanning

http://www.drp.msu.edu/Documentation/StepbyStepGuide.htm Thisprogrambreaksdownthedisasterplanningprocessintosteps,and

providesinformationrelevanttoindividualunitswithinauniversitysetting.TheMSUDisasterRecoveryPlanningHomepage(http://www.drp.msu.edu/)alsooffersavarietyofresources.

o MinnesotaStateArchives:DisasterPreparedness http://www.mnhs.org/preserve/records/docs_pdfs/disaster_000.pdf Thisdocumentisadetailedguidetothedisasterplanningprocess.Whilemostly

dealingwithpaperrecords,thedocumentclearlyidentifiesdifferentrolesandresponsibilitiesformembersoftheplanningandrecoveryteam.

o CiscoSystems:DisasterRecoveryBestPracticesWhitePaper http://www.cisco.com/warp/public/63/disrec.pdf

Page 50: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 44

ThepaperoutlinesDisasterRecoveryusingtheframeworkoftheaboveresources,buttailorsittoanITpointofview.Ithasusefulinformationonhowtoprepareandrecoverbothhardwareandsoftwareassets.

o AT&T:KeyElementstoanEffectiveBusinessContinuityPlan http://www.business.att.com/content/article/Key_to_Effective_BC_Plan.pdf Ashortpaperthatsummarizesbusinesscontinuityplanningintheprivate

sector.

• GeneralInformationo FederalEmergencyManagementAdministration:EmergencyManagementGuidefor

Business&Industry http://www.fema.gov/business/guide/index.shtm Apracticalguidewithstep‐by‐stepadviceoncreatingaDisasterRecovery

program.Includesinformationontheformationonaplanningcommittee,organizationalanalysis,anddetailsonspecifichazards.

o SpecialLibrariesAssociationInformationPortal:DisasterPlanningandRecovery http://www.sla.org/content/resources/infoportals/disaster.cfm Anexhaustivelistofresources,thispageincludesarticlesondigitaldisaster

recoverystrategiesaswellasinformationonplanning,examplesofplans,andlinkstoawiderangeofresourcesinthepublicandprivatesector.

WrittenResources:

• Wellheiser,JohannaandJudeScott.AnOunceofPrevention:IntegratedDisasterPlanningforArchives,Libraries,andRecordCentres.Lanham,MD:ScarecrowPress,2002.

o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004233950&local_base=AA_PUB

• Cox.RichardJ.FlowersAftertheFuneral:ReflectionsonthePost‐9/11DigitalAge.Lanham,MD:ScarecrowPress,2003.

o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004341258&local_base=AA_PUB

• Matthews,GrahamandJohnFeather,eds.DisasterManagementforLibrariesandArchives.Burlington,VT:Ashgate,2003.

o http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=004354795&local_base=AA_PUB

Page 51: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 45

APPENDIXE:OverviewoftheDisasterRecoveryPlanningProcess

VariousresourcesagreethatthereisnoonewaytogoaboutinitiatingaDisasterRecoveryprogramordraftingaDRplan.Anorganizationmustproceedaccordingtoitsfunctionsandresourcesaswellastheneedsofitsdesignatedcommunityofusers.ThefollowingdiscussiondrawsheavilyupontheICPSRDisasterPlanningPolicyFramework(writtenbyNancyMcGovernandLanceStuchell)andtheContingencyPlanningGuideforInformationTechnologySystemspublishedbyNIST(2002).Assuch,itrepresentsaconsolidationandsimplificationofinformationpresentedinmoredepthelsewhere.Alistofplanningresources(withlinkinformationtofulltexts)isavailableinAppendixD.

• BasicPreceptsofDisasterRecoveryPlanning

1) DisasterRecoveryPlanningisacontinuousactivitythatinvolvesmonitoringinternalconditionsaswellasevolutionsintechnologyandthreats;respondingtonewdevelopmentsthatarise;revisingplanssothattheyremainrelevantandeffective;trainingstaffaccordingtoplans;andtestingorganizationalreadiness.

a. Thereisnosingledocumentwhichcontains“theplan”;rather,aDisasterRecoveryPlanconsistsofasuiteofdocumentsthatrequirearegularscheduleoftestingandrevisiontobeeffective.

b. ThereisnopointatwhichaDisasterRecoveryPlanis“finished.”

2) DisasterRecoveryPlanningneedstobeanorganizationwideactivity

a. DisasterrecoverymustbeoneofthebasicfunctionsofHathiTrust.

b. Aneffectiveplanneedsfulladministrativesupport.

c. Policiesandproceduresmustcomplementandconformtodisasterresponseplansestablishedbytheuniversity,city,andDepartmentofHomelandSecurity.

3) DisasterrecoverycannotbelimitedtothehardwareandsoftwarecomponentsordatacollectionsofHathiTrust;planningmustalsoaccountfortheimpactofhumanemergenciesontherepository’soperations.

• EssentialStepsinDisasterRecoveryPlanning

1) EstablishaDisasterRecoveryPlanningCommittee.

a. Thisgroupwillresearchanddeveloptheplanandhelpwithitsimplementationaswellasmonitorthetraining,testing,andrevisingofplanstoensureorganizationalcomplianceandreadiness.

b. Thecommitteeshouldinvolveindividualsrepresentingthevariousmissioncriticalunitswithinthelibrary(fromadministrationtoCoreServicestotheDigitalPreservationLibrarian)whowillparticipateinthedevelopmentofpolicyandrecoveryplanning.

c. Itisessentialthatthecommitteeinvolveindividualswiththeauthoritytosupportandenforcerecommendations.

d. Thecommittee’sactivitiesshouldinitiatetheformationofaDisasterResponseProgram.

2) DraftaDisasterRecoveryPlanningPolicyStatement

Page 52: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 46

a. Enablestheorganization—andothers—tounderstandthescopeandnatureoftheDisasterRecoveryPlan.

b. Establishestheorganizationalframeworkandresponsibilitiesfortheplanningprocess.

c. Keypolicyelements(asdetailedintheNISTreport):

i. Rolesandresponsibilitieswithintheorganizationinregardstoplanning

ii. MandateforDisasterRecoveryaswellasanystatutoryorregulatoryrequirements

iii. Scopeasappliestothetype(s)ofplatform(s)andorganizationalfunctionssubjecttoDisasterRecoveryPlanning

iv. ResourcerequirementsfortheDisasterRecoveryprogram

v. Trainingrequirements

vi. Exerciseandtestingschedules(atleastonemajorannualtest)

vii. Planmaintenanceschedule(elementsshouldbereviewedannually)

viii. Frequencyofbackupsandstorageofbackupmedia.

3) ConductDataCollectionandAnalysis(i.e.“BusinessImpactAnalysis”)

a. Determinecriticalfunctionsandidentifyspecificsystemresourcesrequiredtoperformthem.Minimumrequirementsforfunctionalityshouldbeestablished.

b. Determinerisksandvulnerabilitiesfacingtherepository’ssystemsandinfrastructure.

c. Identifyandcoordinatewithinternalandexternalpointsofcontacttodeterminehowtheydependonorsupporttherepositoryanditsfunctions;considerhowonefailuremightcascadeintoothers.

i. IdentifyresourcesthatarecrucialtoHathiTrust(I.e.,Mirlyn)

ii. Determinetheallowableoutage/disruptiontimefortheseresources

d. Developrecoverypriorities;balancethecostofinoperabilityagainstthecostofrecovery

i. DetermineHathiTrust’spositionwithintheprioritiesoftheuniversityaswellaswithitsmajorserviceprovidersandvendors(i.e.,TSMGroup,ITCom,Isilon,etc.)tobetterunderstandhowthatprioritizationwillimpactrecoveryefforts.

ii. Establishthemostcrucialfunctionswhichmustberestoredfirst.

iii. DetermineHathiTrust’sRecoveryTimeObjective(RTO,i.e.,themaximumallowableoutageperiod)andRecoveryPointObjective(RPO,i.e.,thepointintimetowhichdatafilesmustberestoredafteradisaster).

iv. Reviewpotentialresources(financial,personnel,etc.)withinHathiTrustaswellasthoseavailableviacontracts,serviceproviders,andproductsupport.ThisstepshouldinvolvetheclarificationofHathiTrust’spositionwithintheuniversity’saswellaskeyserviceproviders’andvendors’priorities.

4) Addressrisksuncoveredinthedatacollectionphaseandinstitutepreventativecontrolsasneededtoanticipateandmitigatethoserisks.

Page 53: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 47

5) Developrecoverystrategiesthatrespondtothepotentialimpactsandmaximumallowableoutagetimesestablishedinthedatacollectionphase.Effortsshouldfocusonsolutionsthatarecost‐effectiveandtechnicallyviable.

a. Strategiesshouldbedesignedtobringcorefunctionsbackonlineassoonaspossiblewithinanestablishedcostrange.

b. Recoveryeffortsmustbeprioritizedaccordingtothenatureofcorefunctionsaswellaslogicalorderofprocedures.

c. Alternativesolutionsshouldbeconsideredbaseduponcost,availabilityofresources,outagetimes,levelsoffunctionality(partialvs.full),andabilitytointegratemethodswithexistinginfrastructure.

d. Determinethepracticalityofpartial(vs.full)recoveryinordertobringservicesbackonlineinatimelyandcost‐effectivemanner.

e. Recoverystrategiesandresourcesshouldbeincorporated(aspossible)intotherepository’ssystemarchitecturesothatintheeventofadisaster,theresponsemayproceedinanefficientandstraightforwardmanner.

6) FormalizeandrecordcollecteddataandrecoverystrategiesinDisasterRecoveryDocuments.Intheprocessofproducingthiswiderangeofdocuments,anorganizationisforcedtoconsideranddocumentpoliciesandproceduresrelatedtoavarietyofkeyadministrativeandtechnicalissues.Thedecisionofwhichplanstoinclude(andwhichtoexclude)mustbedeterminedbaseduponareviewofHathiTrust’sneedsandobjectives.Additionaldocuments(aWebcontinuityplan,forinstance)maybenecessarybasedupondatacollectionandanalysis.

a. BusinessContinuityPlan

i. Businesscontinuityistheabilityofabusinesstocontinueitsoperationswithminimaldisruptionordowntimeintheeventofnaturalormanmadedisasters.

ii. Suchplanningallowsanorganizationtoensureitssurvivalbyconsideringpotentialbusinessinterruptionsandestablishingappropriate,cost‐effectiveresponses.

iii. TheBusinessContinuityPlandetailsHathiTrust’scorefunctionsandtheprioritiesforre‐establishingeachintheeventofadisruption.Itshouldaddresskeyadministrativeandsupportfunctionsaswellasthosewhichdirectlyinvolvetherepository’sdesignatedcommunity.

iv. Theplanshouldthoroughlydocumentthenatureofkeyfunctions,interdependences,theimpactoftheirloss,andalternativemeanstoensuretheircontinuationintheeventofadisaster.MAISoffersausefulBusinessContinuityplanningtemplateathttp://www.mais.umich.edu/projects/drbc_templates.html.

b. ContinuityofOperationsPlan(COOP)

i. TheCOOPfocusesonrestoringanorganization’s(usuallyaheadquarterselement)essentialfunctionsatanalternatesiteandperformingthosefunctionsforupto30daysbeforereturningtonormaloperations.

Page 54: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 48

ii. ThisplanmayincludetheBusinessContinuityPlanandDisasterRecoveryPlanasappendices.

c. ITContingencyPlan

i. TheITContingencyPlanaddressesdisasterplanningforcomputers,servers,andelementsofthetechnicalinfrastructurethatsupportkeyapplicationsandfunctions.

ii. Itshouldaccountforthefollowing:

1. Documenthardwareandsoftware

2. Developanemergencycontactlist

3. Backupandstorealldatafilesoff‐site

4. Proactivelymonitorequipmentanddata

5. Installandupdateantivirussoftwareonbothcomputersandservers

6. Developrecoveryscenarios

7. Communicateandmonitortheplan

iii. TheplanallowsHathiTrusttoformalizeanddocumentproceduresandpoliciesalreadyinplaceanddetailstherepository’sadherencetothesegoals.

d. CrisisCommunicationsPlan

i. CommunicationisavitallyimportantaspectofDisasterRecoveryPlanningandanorganization’sactualresponseinadisaster.

ii. TheCrisisCommunicationsPlanestablishesproceduresforinternalandexternalcommunicationsduringandafteranemergency.

iii. Thedifferentphasesofcrisiscommunicationencompasstheinitialnotificationofanevent,damageassessment,andplanactivationaswellasstatusreports(asneeded)andtheeventualcompletionofrecoveryefforts.

iv. Activationofthecommunicationsplanmustbetheresponsibilityofaspecificindividual.

v. TheDisasterResponseTeamcoordinateswiththeCrisisCommunicationTeamtoensurethatinformationprovidedaboutanemergencyisclear,concise,andconsistent.

e. Cyber‐IncidentResponsePlan

i. ThisplandefinestheproceduresforrespondingtocyberattacksagainsttheHathiTrustITsystem.

ii. Itprovidesaformalframeworkfortheidentification,mitigation,andrecoveryfrommaliciouscomputerincidents,suchasunauthorizedaccesstoasystemordata,denialofservice,orunauthorizedchangestosystemhardware,software,ordata.

Page 55: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 49

f. OccupantEmergencyPlan

i. TheOccupantEmergencyPlandefinesresponseproceduresforlibrarystaffintheeventofasituationthatposesapotentialthreattothehealthandsafetyofpersonnel,theenvironment,orHathiTrustproperty.

ii. HathiTrustmayutilizetheframeworkprovidedbyUMBuildingEmergencyActionPlansforthiselement.

g. DisasterRecoveryPlan

i. TheprimaryfocusoftheDisasterRecoveryPlanistherestorationofcoreinformationsystems,applications,andservices.

ii. Theplanbringstogetherguidanceandproceduresfromtheotherplans(i.e.,BusinessContinuityPlan,ITContingencyPlan,CrisisCommunicationsPlan,etc.)pertainingtoemergenciesthatresultininterruptionsofservicethatexceedacceptabledowntimes,asdefinedintheBCP.

iii. Theplanshoulddetailestablishedrecoverystrategiesforspecificdisastersituationsaswellastheteamsinvolvedintheirexecution.

iv. Personnelshouldbechosentostaffdisasterresponseteamsbasedontheirskillsandknowledge.Ideally,teamswouldbestaffedwiththepersonnelresponsibleforthesameorsimilaroperationundernormalconditions.It’salsoimportantthatteammembersshouldbefamiliarwiththegoalsandproceduresofotherteamstofacilitateinter‐teamcoordination.Eachteamisledbyateamleader(withasuitablealternate)whodirectsoverallteamoperationsandactsastheteam’srepresentativetomanagementandliaisonswithotherteamleaders.DisasterResponsecannotbeindividual‐specificoroverlyreliantonspecificpeople.Teamsmustassigneachroleatleastonealternateintheeventthatcorepeopleareunavailableatthetimeofadisaster.

v. NISTsuggeststhatacapablestrategywillrequiresomeorallofthefollowingfunctionalgroups.ForHathiTrust,manyofthesearealreadyinplaceintheformofUniversityofMichiganunitsandserviceproviders.

1. Anauthoritativeroleforoveralldecision‐makingresponsibility

2. SeniorManagementOfficial

3. ManagementTeam

4. DamageAssessmentTeam

5. OperatingSystemAdministrationTeam

6. SystemsSoftwareTeam

7. ServerRecoveryTeam(e.g.,clientserver,Webserver)

8. LAN/WANRecoveryTeam

9. DatabaseRecoveryTeam

10. NetworkOperationsRecoveryTeam

11. ApplicationRecoveryTeam(s)

Page 56: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 50

12. TelecommunicationsTeam

13. HardwareSalvageTeam

14. AlternateSiteRecoveryCoordinationTeam

15. OriginalSiteRestoration/SalvageCoordinationTeam

16. TestTeam

17. AdministrativeSupportTeam

18. TransportationandRelocationTeam

19. MediaRelationsTeam

20. LegalAffairsTeam

21. Physical/PersonnelSecurityTeam

22. ProcurementTeam(equipmentandsupplies)

h. DisasterRecoveryTrainingPlan

i. ThisplanwillestablishthesituationsandprocedurestobecoveredbyHathiTrust’sDisasterRecoverytraining.

ii. Thecontentsoftheplanshouldreflecttherangeofresponsibilitiesheldbetweenadministrators,departmentheads,andstaffwithinHathiTrust.

iii. TheplanshouldaccommodateDisasterRecoveryPlanningCommitteemembersaswellasthoseoftheDisasterResponseTeam.Forthelatter,itshouldidentifykeyrolesandresponsibilitiesinrecoveryefforts.

iv. Theplanshouldallowin‐housetrainingtobesupplementedbyexternalopportunities.

v. Aregularlyscheduledemergencydrillsshouldalsobeincludedtotestthereadinessofstaffandtheappropriatenessofresponseprocedures.

7) Implementelementsdevelopedinplanningprocess.Proceduresandpoliciesrelatedtocommunication,technologicalsolutions,etc.mustbeincorporatedintoHathiTrust’soveralldesignandoperationsothatDisasterRecoverybecomesacriticalorganizationalfunction.

8) InstituteregularprogramoftrainingandtestingtobesurethatstaffunderstandandacceptpoliciesandproceduresandtoensurethatHathiTrustispreparedforadisaster.

9) ConductregularreviewandmaintenanceofDisasterRecoverydocumentstorespondtochangesinpersonnel,organizationalstructureorfunctions,andevolutionsintechnologyand/orthreats.

• MainPhasesinaDisasterResponse:

1) Notification/Activation:Thisphasecoverstheinitialactionsonceasituationhasbeendetectedoristhreatened.Itincludesdamageassessmentandtheimplementationofanappropriateresponsestrategy.

a. Properdiagnosisandcommunication(bothinternalandexternal)ofadisasterisessential.

Page 57: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 51

b. Thenatureofindividualeventswilldeterminewhoneedstobeinvolved(i.e.,facilitiesmanagement,coreservices,etc.).

2) Recovery:Thisphasefocusesonthereturntoapre‐establishedleveloffunctionality(plansshoulddetailpartialaswellasfullrecoveries).

a. ResponseteamsimplementrecoverystrategiesandadheretoproceduresandprotocolsoutlinedinDisasterRecoveryDocuments

3) Reconstitution:Afterrecoveryeffortsarecomplete,normaloperationsmustberestored.Thismayinvolvethereconstructionoffacilitiesand/orinfrastructureaswellasthetestingofrestoredelementstoensuretheirfullfunctionality.

Page 58: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 52

APPENDIXF:TSMBackupServiceStandardServiceLevelAgreement(2008)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)

Page 59: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 53

APPENDIXG:ITCS/ITComCustomerNetworkInfrastructureMaintenanceStandardSA(2006)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)

Page 60: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 54

APPENDIXH:MACCServerHostingServiceLevelAgreement(Draft,2009)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)

Page 61: HathiTrust is a Solution · Disaster Recovery Strategies p. 5 o Basic Requirements for Disaster Recovery p. 5 o Disaster Recovery Strategy #1: Redundancy between the Ann Arbor and

2009‐08‐24 55

APPENDIXI:MichiganAcademicComputingCenterOperatingAgreement(2006)(RightclicktoopentheAdobeDocumentObjectlocatedbelow)