Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April...

34
Supporting the Scientific Data Lifecycle | ISGC 2015, Taipei | Patrick Fuhrmann | 17 March 2015 | 1 Patrick Fuhrmann On behave of the project team Supporting the scientific data lifecycle

Transcript of Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April...

Page 1: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|1

PatrickFuhrmann

Onbehaveoftheprojectteam

Supportingthescientificdatalifecycle

Page 2: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|2

Content• Howaresoftwarefeaturesselected.• Howaresoftwarefeaturesfunded.• Hardeningnewfeatures.• Exploringnewcommunities.• RespondingonnewtechnologiesHWandSW• SomethingaboutINDIGO-DataCloud• EssentiallyarandomwalkfocusingonthingsIthoughtmightbeinteresting.

Page 3: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|3

Somewordsonwhyandwhen

dCachedoeswhatitdoes.

Page 4: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|4

Howaresoftwarefeaturesselected?

• Scientificcommunitiesbelieve, thatOpenSourceSoftwareisgrowingontrees.

• Consequently theyarenotwillingtocontributetothedevelopment andsoftwaremanagementatall.

• Theyassumethatcomplainsareveryvaluablecontributions.• Nextconsequence isthatOpenSourceteamsmainly

implementsoftwarefeatures,whicharerequiredbythelabs,wherethecoreteammembersarehosted.

• Inordertoexplorenewcommunitiesandsatisfytheirsoftwarerequirements,OpenSourceProjectsneedexternalmoney.

Page 5: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|5

Howarenewfeaturesfunded?

• Thisiswhere“National”and“European”projectscomeintoplay.

• FordCache,this:– wasEMI– istheGermannationalLSDMAproject– andwillbeINDIGO-DataCloud

• Thedrawback:Theytellyouwhattheywanttoseeinyourcode.

Page 6: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|6

Fundedfeaturesarenotnecessarythoseyouneed?

• However,dCachehassomeinvariantobjectives:– Themasterplan(lastslideofthispresentation)– Beuptodateonnewtechnologies,eithersoftwareorhardware.

– Attractnewcommunitiesastheirspecificrequirements, iftheycanbefulfilled,makedCacheevenbetter.

• Itcanbeabittrickytotunethefundingprojectsexactlyintothedirectionofourobjectives.

• So,let’sseehowdCachemanaged/es that…….

Page 7: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|7

FundinginfluencesdCachedevelopmenttopics

2010 2013

Standardization

NFS4.1/pNFS

HTTP/WebDAV

ContributingtotheDynamicFederation

INDIGO DataCloud

2015 2018

DataLifeCycleMultiTierStorage

QualityofService

MigrationArchivingAAI

Deployingnewtechnologies intoProductionandexploringnewcommunities

Page 8: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|8

From2013tonow,wesloweddowndevelopmentbetweentwoverydemandingdevelopmentprojects,EMIandINDIGO-DataCloud,to:

• Deploynewlyimplementedtechnologiesintoproduction.

• Explorenewcommunitiesandlearnabouttheirneeds.

Page 9: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|9

DeployingNFSintoproduction

• CMSGridInfrastructure@DESY• TheDesy-Cloud• FERMIlab(variousIntensityDrontier)• Andtheissues

Page 10: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|10

NewProductionSystemsbasedondCacheNFS.

NFS4.1/pNFS

DirectlowlatencyaccessWorkernodesHPC

dCacheBackendStorageLayer

WideAreaFTSGLOBUS(ONLINE)

Sync&ShareLaptopsMobileDevices

SeePaul’spresentationonThursday

Page 11: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|11

CMSTierII@DESY

• SlowlymigratingCMSGridworkernodestoNFS4.1dataaccess.

• Goodexperienceaslongasthenetworkisstable.

Page 12: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|12

NFS4.1pNFS dCap

ExecutionTime(hours)

JobEfficiency(CPU

/W

allTim

e)

JobEfficiency(NFS– dCap)

Page 13: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|13

Aswithallnewsspec’s,thereareissues

• Networkproblemscausethesystemtobebehaveunpredictable.

• DataServerbehindfirewalls• WeakclientsonVM’s• SpecificationViolation– infinitestaterecoverywithLinuxkernel

Page 14: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|14

Exploringnewcommunities.

Page 15: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|15

Exploringnewcommunities.

• Jülich – AachenResearchAssociation,JADE– "SupercomputingandmodelingfortheHumanBrain(SMHB)”,associatedtotheEuropeanHumanBrainProject(PlenarybyKHMeier)

• MoSGrid– ScientificGatewayformolecularsimulation.

• VAVID– DataGatewayforanalyzingwindenergyinfrastructures

Page 16: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|16

JADE

Aachen

Jülich

Page 17: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|17

ProjectsinHPC

HPCjobsonsupercomputer

HPCjobsgetaccesstodCachestorage.

Page 18: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|18

WiththestartofINDIGO-DataCloud,itsmoneyandalargerteam(8+3)wecancontinueto

explorenewhorizons.(Backtodevelopmentmode)

Page 19: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|19

• NewDiskTechnologies– OpenEthernetDisks(HGST)

• NewObject-StoreBack-ends– CEPH

• NewEuropeanProjects(INDIGODC)– FocusingonDataQualityofServiceand– DataLifecycleManagement

Respondingtonewtechnologies

Page 20: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|20

HGSTOpenEthernetDisks

• SmallARMCPUwithEthernetpiggybackedonregularDisk.

• Spec:– AnyLinux(Debian ondemo)– CPU32-bitARM,512Level2– 2GBDRAMDDR-3Memory

• 1792MBavailable

– BlockstoragedriverasSCSIsda– Ethernetnetworkdriveraseth0

Page 21: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|21

HGSTOpenEthernetDisks(cont)

• AdditionalCPUisnotusedbydiskitselfandcanrunarbitrarycustomerOS.

• Diskisseenasregularblockdevice.

• Notyetonthemarket.• dCachegot5disksandweareevaluatingtorunpoolnodesonthediskitself.

• SeetalkonThursday.

Page 22: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|22

ResponsetoCEPH

• CEPHcomplementsdCacheperfectly.– SimplifiesoperatingdCachedisks.– dCacheaccessesdataasobject-storeanywayalready.

• dCacheisevaluatinga‘twostepapproach’.– Eachpoolssees itownobjectspaceinCEPH– Allpoolshaveaccess totheentirespace,whichisaslightchangeofdCache

poolsemantics.• WouldmergeCEPHanddCacheadvantages

– MultiTier(Tape,Disk,SSD)– Multiprotocolsupportforacommonnamespace.

• Allprotocolsseethesamenamespace– AllthedCacheAAIfeatures

• SupportforX509,Kerberos,username/password

Page 23: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|23

INDIGO-DataCloudCheat-Sheet

• Horizon2020projectstartingAprilorMay• Budget11.1MillionEuros(800.000fordCache)• 26Partners• Duration30months• TheprojectaimsforanOpenSourceDataandComputingplatformtargetedatscientificcommunities,deployableonmultiplehardware,andprovisionedoverprivateandpublice-infrastructures.

SeeLudek’s presentationonWednesday

Page 24: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|24

INDIGOinanutshell

1. Self-service,on-demand2. Accessthroughthenetwork3. Resourcepooling4. Elasticity(withinfinite resources)5. Payasyougo

Intheend,ApplicationsRule.

StolenfromDavide Salomoni (ProjectDirector)

Page 25: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|25

dCacheinvolvementinINDIGO

• dCacheismostlyinvolvedinWP4,whichisaboutVirtualInfrastructures.(IaaS)

• Forstoragesystems,likedCache,thisessentiallymeansSDS(SoftwareDefinedStorage),whichaccordingtoWikipedia is:– Software-definedstorage(SDS) isanevolvingconceptforcomputerdatastoragesoftwaretomanagepolicy-basedprovisioningandmanagementofdatastorageindependentofhardware.

Page 26: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|26

SDSaccordingtodCache

• User/PaaS defined“QualityofService”management– User/PaaS defined“AccessLatency”

• SSDorTapedependingfromapplicationrequirements.

– User/PaaS Defined“DataProtection”• Ononedisk,twodisksortreetapesdependingonhowpreciousyourdatais.

– User/PaaS Defined“DataMigrationPolicies”• LikeAmazonGlaciervers.S3

• AutomaticStorage-Tiermigration– Basedonaccessprofile

• Allthiswouldn’tbeneededifSSD’s wouldbecheapand100%reliable.

Page 27: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|27

dCacheiswellprepared

HistoricallydCachesupportsmulti-tierstorageandthecorrespondingtransition.

SSDs

SpinningDisks

Tape, BlueRay…

Virtual File-systemLayer

NFS/pNFS gridFTPhttpWebDAV xRootd/dCapAutomatic

andManualMedia

transitions

Page 28: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|28

Recentlyadded

Weoptimizedthe‘smallfile’problemwithdisk<->tapetransitions.

TapeSystem

Containers

Page 29: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|29

What’smissing

• Mainlyacommonagreement(standard)onhowtotriggertransitions.(Protocol,API??)

• WehavesomeexperiencewithSRM,howeveritseemsnottobesuitableforthispurpose.

• AnothercandidateisCMDI(SNIA),whichisanindustrystandard.

• MigrationPoliciesarealreadydiscussed,documentedandimplementedwithinRDA(PracticalPolicyWorkingGroup).

• DetailswillonlybeavailableaftertheINDIGOkickoffmeetingendofApril‘15.

Page 30: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|30

Summary

Magically,uptonow,attherightmoment,therewasalwaysanEUorNationalProject,fundingdCacheexactlyforthosefeaturesoractivites,dCachewasplanningtodoanywayandwiththattheyhelpedusfollowingourmasterplan:

ThesupportoftheCompleteScientificBigDataLifeCycleManagement.

Page 31: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|31

ScientificDataLifecycle

HighSpeedDataIngest

FastAnalysisNFS4.1/pNFS

WideAreaTransfers(Globus Online,FTS)byGridFTP

Visualization&SharingbyWebDAV,OwnCloud

Page 32: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|32

Don’tforget

UpcomingdCacheWorkshop

Page 33: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|33

TheEND

furtherreadingwww.dCache.org

Page 34: Supporting the scientific data lifecycle · 2016-01-07 · • Horizon 2020 project starting April or May • Budget 11.1 Million Euros (800.000 for dCache) • 26 Partners • Duration

SupportingtheScientificDataLifecycle|ISGC2015,Taipei|PatrickFuhrmann|17March2015|34