Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending...

25
1 Toward Enabling Reproducibility for Data- Intensive Research using the Whole Tale Platform Victoria Stodden Associate Professor, School of Information Sciences University of Illinois at Urbana-Champaign ParCo Symposium Reproducibility in Data-Intensive Computing Prague, CZ September 10, 2019

Transcript of Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending...

Page 1: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

�1

TowardEnablingReproducibilityforData-IntensiveResearchusingtheWholeTalePlatform

VictoriaStoddenAssociateProfessor,SchoolofInformationSciences

UniversityofIllinoisatUrbana-Champaign

ParCoSymposiumReproducibilityinData-IntensiveComputing

Prague,CZSeptember10,2019

Page 2: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

Agenda

1. InfrastructureContributions:TheWholeTaleProject

2.ExtendingWholeTaletoEnable“TalesatScale”

3. InfrastructureChallenges

�2

Page 3: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

ParsingReproducibility

● EmpiricalReproducibility:○ traditionalempiricalexperiments,e.g.atthebench/lab

● StatisticalReproducibility:○ statisticalmethodologyusedpermitsgeneralizabilityofdatainferences

● ComputationalReproducibility:○ transparencyofcomputationalstepsthatproducescientificfindings

V.Stodden.(2013).ResolvingIrreproducibilityinEmpiricalandComputationalResearch.IMSBulletin

Page 4: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

WholeTale:MergingScience&CyberinfrastructurePathways

�4

Page 5: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

WholeTaleCollaboration(PITeam)● UIllinois(NCSA)BertramLudäscher,VictoriaStodden,MattTurk

○ overalllead(co-operativeagreement)○ reproducibility;provenance;opensourcesoftwaredevelopment;outreach

● UChicago(Globus)KyleChard○ datatransfer&storage;compute;infrastructure

● UCSantaBarbara(NCEAS)MattJones○ (meta-)datapublishing;provenance;repositories

● UTexas,Austin(TACC)NiallGaffney○ compute;HTC;“bigtale”;ScienceGateways

● UNotreDame(CRC)JarekNabrzyski○ UXdesign;UIdesign

�5

Page 6: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

SimplifyingComputationalReproducibilityinWholeTale● Researcherscaneasilypackageandsharetales:

○ Data,Code,andComputeEnvironment■ includingnarrativeandworkflowinformationincludinginputs,outputs,andintermediates

○ tore-createthecomputationalresultsfromascientificstudy○ achievingcomputationalreproducibility○ thus“settingthedefaulttoreproducible.”

● Alsoempowersuserstoverifyandextendresultswithdifferentdata,methods,andenvironments.

�6

V.Stodden,D.H.Bailey,J.Borwein,R.J.LeVeque,W.Rider,andW.Stein.SettingtheDefaulttoReproducible:ReproducibilityinComputationalandExperimentalMathematics,ICERMWorkshop2013.

Page 7: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

WholeTale:What’sinaname…ADoubleEntendre:

○ Wholetale:capturestheend-to-endscientificdiscoverystory,includingcomputationalaspects

○ Longtail:includesallcomputationalresearch,e.g.bespokeorsmallscaleresearch

AddressesProblemsscientistsface:○ Reproducibility(andreuse)challengesincomputational&data-enabled

research(e.g.data+codeaccess,dependencyhell,…)WholeTaleApproach:

○ directlyrespondtocommunityneedsandrequirements

�7

Page 8: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

TheNeedforaPlatformforReproducibleResearch● Enableresearchersto(easily)managethecompleteconductofa

computationalexperimentandpermititsexposureasapublishable“Tale”

● Addressthetwotrendssimultaneously:○ improvedtransparencysoresearcherscanrunmuchmoreambitious

computationalexperiments.○ andbettercomputationalexperimentinfrastructurewillallowresearchersto

bemoretransparent.

D.DonohoandV.Stodden.(2015).ReproducibleResearchintheMathematicalSciences.ThePrincetonCompaniontoAppliedMathematics,Ed.N.J.Higham.

Page 9: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

SowhatisWholeTale?● Aweb-based,opensourceplatformforreproducibleresearchforthe

creation,publication,andexecutionoftales:executableresearchobjectsthatcapturedata,code,anddetailsofthecomputingenvironmentusedtoproduceresearchfindings

● DrivenbyCommunityEngagement:○ Workinggroups,internships,collaborations,etc.

● EnhancesEducation&Training:○ Trainingforreproducibility;useofWholeTaleintheclassroom

Page 10: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

WTSoftwareDevelopment● Open-SourceDevelopmentModel

○ across5collaborativesites● Allsourceisopen:

○ https://github.com/whole-tale/● Developersareexpectedtofollowthe:

○ Developer'sguide● Opencommunicationvia:

○ weeklycallswithpublicmeetingnotes● Softwarereleasesfollowa:

○ Developmentplan

�10

Development

Workshops & Working Groups

Page 11: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

Whatexactlyis(in)aTale?● Tale=executableresearchobject,i.e.

○ data(references)○ +code(computationalmethods)○ +narrative(traditionalsciencestory)○ +computeenvironment(e.g.RStudio,Jupyter)

● Capturedinastandards-basedtaleformatcompletewithmetadata

�11

Code/Narrative

Computeenvironment

Data

Page 12: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

�12

BrowseExistingTales…

Page 13: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

�13

…ComposeNewTales…

Page 14: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

�14

…Run&InteractwithTales

Page 15: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

�15

…UseTaleMetadata

Page 16: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

�16

…IntegrateDataReposwithWholeTale!

● Enablesturnkeyexploratorydataanalysisonexistingpublisheddatasets

● DataONEandDataversenetworkscover>90majorresearchrepositories!

Page 17: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

InputData

ResearchQuestion Analysis Output

Data Narrative PublishedArticle

Verify/Reproduce/Re-use

Accelerate

AcceleratingReproducibleOpenScience

Page 18: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

�18

WholeTalePlatformOverview

Research&QuantitativeComputationalEnvironments

ExternalDataSources

Code+Narrative

● Authenticateusingyourinstitutionalidentity● Accesscommonly-usedcomputationalenvironments● Easilycustomizeyourenvironment● Referenceandaccessexternallyregistereddata

● Createoruploadyourdataandcode● Addmetadata(includingprovenanceinformation)● Submitcode,data,andenvironmenttoarchivalrepository● Getapersistentidentifier● Shareforverificationandre-use

PublishTale

CreatetaleAnalyzedata

Page 19: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

Whoseproblemsareweaddressing?● Researchers,scientists,othersmaybe

○ creatorsoftalese.g.shareyourfindingsinatale

○ reviewersofarticlescanreviewtalese.g.reproducenewscientificclaims

○ (re-)usersoftalese.g.builduponprogressofothers

�19

Page 20: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

ExtendingWTtoData-IntensiveResearch● Motivatingscenario:TheRenaissanceSimulationsLaboratoryprovidesaccesstoover70

TBofrawdataandderiveddataproducts.RSLexposesdataavailableonsystemsattheSanDiegoSupercomputingCenterviaJupyterweb-basedinteractiveenvironments.

● Relevantfeatures:

1.theRSdataislarge,impracticaltotransfer,requireslarge-scaleresourcestoanalyze.2.theresearchcommunityleveragesJupyterinteractiveenvironmentsforboth

exploratoryandprimaryanalyticalworkwithsomeanalysisrequiringbatchcomputeresources.

3.thecommunityisinterestedinsharingresultingresearchartifacts(e.g.,code,deriveddata)forbothre-executionandre-use.

�20

Page 21: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

ExtendingWTtoData-IntensiveResearch

�21

Tale frontend and HPC workloads on WT deployment cluster: Users can launch local HPC jobs using standard system calls

Tale Frontend on single HPC Compute Node: running the Tale frontend (Jupyter/R-studio notebooks) on compute nodes in an HPC cluster, which launch independent HPC jobs using standard system calls.

Page 22: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

ExtendingWTtoData-IntensiveResearch

�22

Tale frontend on HPC compute node with local LRM (cluster queuing system) access: Allows submission of HPC jobs to the queuing system of the cluster.

Tale frontend on HPC compute nodes with MPI: launch the Tale frontend as an MPI job. The cluster LRM (queuing system) allocates the number of nodes requested at the submission of the Tale frontend job and sets the appropriate MPI environment. The Tale frontend would run on the lead node allocated to the MPI job by the LRM and would launch MPI subjobs on the nodes allocated to the MPI job.

Page 23: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

ExtendingWTtoData-IntensiveResearch

�23

Tale frontend on WT cluster with remote LRM access: Tale frontends run alongside WT services, but HPC jobs can be submitted to remote clusters via the middleware.

Decoupled Tale frontend with LRM Remote Access: Tale frontends run on various resources and HPC jobs can run on any resources supported by the middleware. Users could bypass the limitations present in the default resources provided by the WT infrastructure e.g. a user with cloud access could request that a Tale be run on cloud resources under the user’s account.

Page 24: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

ChallengestoExtendingWT● TheneedtomaintainresponsivenessofTalefrontends

● DependenceonMiddleware:ScalabilityandLongevity

● ManagingHPCnetworkrestrictions

● Talefrontendshoweverrequireincomingnetworkconnectionsinordertoexposetheiruserinterface.Consequently,ageneralsolutioninvolvingTalefrontendsoncomputenodesrequiressomeformofproxyingofconnectionsfromtheWholeTaleclustertoHPCclustercomputenodes.Restrictionsonincomingnetworkconnectionsmaylikelybearesultoflocalsecuritypoliciesandthereforeproxying,evenifauthenticated,maybeseenasanunwelcomecircumventionofsuchpolicies.

● ContainerizationandHPCworkloadse.g.adependenceonspecifichardwarewhichcanaffecttheabilityforthecodetobere-runifthespecifichardwarebecomesunavailable

● Dataaccessandquasi-locality:IfTalefrontendsand/orHPCworkloadsrunonHPCresourcesonwhichcopiesofdataarealreadyavailable,theWTimplementationisbeinefficientsinceeachfilewouldbetransferredoncetoWTresourcesandonceforeachTalefrontendinstancethataccessesthefile

�24

Page 25: Toward Enabling Reproducibility for Data- Intensive ...vcs/talks/ParCo2019-STODDEN.pdf · Extending WT to Data-Intensive Research Motivating scenario: The Renaissance Simulations

Conclusion WholeTaleofferspotentialforenablingreproducibilityforData-Intensive

computing,butisnotwithoutchallengesrequiringinnovationinthesoftwarearchitectureandinfrastructureimplementation.

However,reproducibilityisnowrecognizedasapressingissueofwhichcomputationalinfrastructureisonekeypart.

Infrastructuresupportingtransparencyandreproducibilitywillbeusednotoutofhygieneorasabestpractice,butbecauseitenablesincreasinglyambitiouscomputationalresearch.

�25