Pegasus WMS - 情報処理学会sighpc.ipsj.or.jp/HPCAsia2018/poster/post102s2-file2.pdf · •...
Transcript of Pegasus WMS - 情報処理学会sighpc.ipsj.or.jp/HPCAsia2018/poster/post102s2-file2.pdf · •...
• Pegasusisasystemformappingandexecu4ngabstractapplica4onworkflowsoverarangeofexecu4onenvironments.
• Thesameabstractworkflowcan,atdifferent4mes,bemappeddifferentexecu4onenvironmentssuchasXSEDE,OSG,commercialandacademicclouds,campusgrids,andclusters.
• Pegasuscaneasilyscaleboththesizeoftheworkflow,andtheresourcesthattheworkflowisdistributedover.Pegasusrunsworkflowsrangingfromjustafewcomputa4onaltasksupto1million.
• WorkflowsoJenconsumetensofthousandsofhoursofcomputa4onandinvolvetransferofmanyterabytesofdata.
• WorkflowshaveaDAGmodel
• AnodeintheDAGisstartedonlywhenalltheparentnodeshavesuccessfullyfinished.
Event-Based Triggering and Management of Scientific Workflow Ensembles Suraj Pandey1, Karan Vahi2, Rafael Ferreira da Silva2, Ewa Deelman2, Ming Jiang3, Albert Chu3, Cyrus Harrison3, Henri Casanova1
1University of Hawaii, 2USC Information Sciences Institute,
3Lawrence Livermore National Laboratory
h=ps://github.com/pegasus-isi/pegasus-llnl ProjectissupportedbytheDepartmentofEnergyundercontractDE-AC52-07NA27344andbyLLNLunderproject16-ERD-036.
PegasusiscurrentlyfundedbytheNaRonalScienceFoundaRon(NSF)undertheOACSI2-SSIgrant1664162.
PegasusWMSSystemArchitecture
ExperimentalSetupatLLNLCatalystCluster
APIs
Users
Interfaces
Pegasus Dashboard
OpenStack, Eucalyptus, Nimbus
Pegasus WMS
Mapper
Engine
Scheduler
Monitoring & Provenance
Workflow DB
Logs
Clouds
Not
ifica
tions
j1
j2
jn Job Queue
…
Cloudware
Amazon EC2, Google Cloud, RackSpace, Chameleon Compute
Amazon S3, Google Cloud Storage, OpenStack Storage
Distributed Resources
Campus Clusters
Local Clusters
Open Science Grid
XSEDE
HTCondor GRAM
PBS LSF SGE
Middleware C O M P U T
E
GridFTP
Storage
Other workflow composition tools:
Submit Host
HTTP
FTP SRM
IRODS SCP
ProblemsmappingcertaincomputaRonstoDAGworkflows• Withpushtowardsextremescalecompu4ngitispossibleruntradi4onalHPCsimula4oncodessimultaneouslyontensofthousandsofcores.• GenerateddataoJenneedstobeperiodicallyanalyzedusingBigDataAnaly4csframeworks.• Integra4ngBigDataanaly4cswithHPCsimula4onsisamajorchallengeforthecurrentgenera4onofscien4ficworkflowmanagementsystems.• Needanabilitytoautoma4callyspawnandmanagetheanalysisworkflows,asthelongrunningsimula4onworkflowexecutes.
EnsembleManager
SoluRon• PegasushasanEnsembleManagerServicethatallowsusertosubmit
acollec4onofworkflowscalledensembles.• WeextendedtheEnsembleManagertosupporteventtriggersthat
cantriggeraddi4onofnewworkflowstoanexis4ngensemble.• Wesupportthefollowingtypesoftriggers
a) Filebasedeventtriggers–afilegetsmodifiedb) Directorybasedeventtriggers–filesappearinadirectory
• Ini4ally,ensemblehasasingleworkflowconsis4ngofthelongrunningHPCsimula4onworkflow.• TheHPCsimula4onworkflowperiodicallygeneratesoutputdatathat
inadirectorythatistrackedbytheensemblemanager.• Anewanalysisworkflowislaunchedautoma4callyastheoutputdata
isdetected.
• Reliablyandrepeatedlytestthatimplementedsolu4onworks.• Testedtheimplementa4ononLLNLCatalystCluster(150teraFLOP/ssystemwith324nodes,
eachwith128GBofDRAMand800GBofnonvola4lememory).ExperimentalSetup1. Oncatalyst,aMagpieSLURMjobissubmiaedthatdoes-Determineswhichnodeswillbe“master”nodes,“slave”nodes,orothertypesofnodes.-Setsup,configures,andstartsappropriateBigDatadaemonstorunontheallocatednodes.Inoursetup,weusedtheMagpieSPARKtemplatetosetupadynamicSparkcluster-Reasonablyop4mizesconfigura4onforthegivenclusterhardwarethatitisbeingrunon.Magpiethenexecutesauserspecifiedscripttogivecontrolbacktotheuser.2. TheuserscriptsetsupPegasusWMSandstartstheEnsembleManager.3. EnsembleManagersubmitstheHPCSimula4onWorkflowconsis4ngofLULESHapplica4on-Every10simula4oncyclesLULESHwritesoutoutputstoadirectoryonthesharedfilesystem.-Thisdirectoryistrackedbytheensemblemanageraspartoftheeventtriggerspecified.-EnsembleManagerinvokesascriptthatgeneratestheBigDataAnaly4csWorkflowonthenewlygenerateddatasets.
HPCSimula+onWorkflow
EnsembleManager
BigDataAnaly+csWorkflow
MagpieSetup
SubmitSlurmJob
RunMagpieDaemons
RunMagpieUserScript
PegasusWMSSetup
LaunchPegasus&Condor
LaunchEnsembleManager
ReadEventConfigura@on
TriggerEvent?
Generateandsubmitanaly@csworkflow
LULESHMPIJob
GenerateTriggerFile
CopyDataintoHDFS
RunSparkAnaly@cs
WorkflowEnsembleExecu:onTimeline