(pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

24
Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research – FINAL REPORT Lucie C. Burgess, David Crotty, David de Roure, Jeremy Gibbons, Carole Goble, Paolo Missier, Richard Mortier, Thomas E. Nichols, Richard O’Beirne Dickson Poon China Centre, St. Hugh’s College, University of Oxford, 6 th and 7 th April 2016 Page |1 21 July 2016 ALAN TURING INSTITUTE SYMPOSIUM ON REPRODUCIBILITY FOR DATA-INTENSIVE RESEARCH – FINAL REPORT AUTHORS Lucie C. Burgess 1 , David Crotty 2 , David de Roure 1 , Jeremy Gibbons 1 , Carole Goble 3 , Paolo Missier 4 , Richard Mortier 5 , Thomas E. Nichols 6 , Richard O’Beirne 2 1. University of Oxford 2. Oxford University Press 3. University of Manchester 4. Newcastle University 5. University of Cambridge 6. University of Warwick TABLE OF CONTENTS 1. Symposium overview – p.2 1.1 Objectives 1.2 Citing this report 1.3 Organising committee 1.4 Format 1.5 Delegates 1.6 Funding 2. Executive summary – p.4 2.1 What is reproducibility and why is it important? 2.2 Summary of recommendations 3. Data provenance to support reproducibility – p.7 4. Computational models and simulations – p.10 5. Reproducibility for real-time big data – p.13 6. Publication of data-intensive research – p.16 7. Novel architectures and infrastructure to support reproducibility – p.19 8. References - p.22 Annex A – Symposium programme and biographies for the organising committee and speakers Annex B – Delegate list ACKNOWLEDGEMENTS The Organising Committee would like to thanks the speakers, session chairs and delegates to the symposium, and in particular all those who wrote notes during the sessions and provided feedback on the report. This report is available online at: https://osf.io/bcef5/ (report and supplementary materials) https://dx.doi.org/10.6084/m9.figshare.3487382 (report only)

Transcript of (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

Page 1: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |121July2016

ALANTURINGINSTITUTESYMPOSIUMONREPRODUCIBILITYFORDATA-INTENSIVERESEARCH–FINALREPORT

AUTHORS

LucieC.Burgess1,DavidCrotty2,DaviddeRoure1,JeremyGibbons1,CaroleGoble3,PaoloMissier4,RichardMortier5,ThomasE.Nichols6,RichardO’Beirne2

1. UniversityofOxford2. OxfordUniversityPress3. UniversityofManchester4. NewcastleUniversity5. UniversityofCambridge6. UniversityofWarwick

TABLEOFCONTENTS

1. Symposiumoverview–p.21.1 Objectives1.2 Citingthisreport1.3 Organisingcommittee1.4 Format1.5 Delegates1.6 Funding

2. Executivesummary–p.42.1 Whatisreproducibilityandwhyisitimportant?2.2 Summaryofrecommendations

3. Dataprovenancetosupportreproducibility–p.74. Computationalmodelsandsimulations–p.105. Reproducibilityforreal-timebigdata–p.136. Publicationofdata-intensiveresearch–p.167. Novelarchitecturesandinfrastructuretosupportreproducibility–p.198. References-p.22

AnnexA–SymposiumprogrammeandbiographiesfortheorganisingcommitteeandspeakersAnnexB–DelegatelistACKNOWLEDGEMENTS

TheOrganisingCommitteewouldliketothanksthespeakers,sessionchairsanddelegatestothesymposium,andinparticularallthosewhowrotenotesduringthesessionsandprovidedfeedbackonthereport.

Thisreportisavailableonlineat:https://osf.io/bcef5/(reportandsupplementarymaterials)https://dx.doi.org/10.6084/m9.figshare.3487382(reportonly)

Page 2: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |221July2016

1.SymposiumOverview1.1ObjectivesTheAlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearchwasheldon6th-7thApril2016attheUniversityofOxford.Itwasorganisedbysenioracademics,publishersandlibraryprofessionalsrepresentingtheAlanTuringInstitute(ATI)jointventurepartners(theuniversitiesofCambridge,Edinburgh,Oxford,UCLandWarwick),theUniversityofManchester,NewcastleUniversityandtheBritishLibrary.Thekeyaimofthesymposiumwastoaddressthechallengesaroundreproducibilityofdata-intensiveresearchinscience,socialscienceandthehumanities.ThisreportpresentsanoverviewofthediscussionsandmakessomerecommendationsfortheATItotakeforwards.

AstheUK’sleadingdatascienceinstitute,theATIhaskeyroleinsupportingandpromotingthereproducibilityofdata-intensiveresearch,fromthreeperspectives.Firstly,reproducibilityisadatascienceresearchareainitsownright,requiringthedevelopmentofnovelanalyticaltechniques,computationalmethods,technicalarchitecturesandotherfoundationalresearch.Secondly,reproducibilityisatechnical-socio-culturalissue,requiringtheimplementationofnewworkflows,researchworkingpracticesandpoliciesinanincreasinglyopenandtransparentenvironment,enabledthroughsupportinginfrastructure.OneareaoffocusoftheworkshopwastheATI’sownoutputs,whichwillincludealgorithms,computations,dataandcode,andthetechniquesthattheATIshouldusetoensurethesearereproducible,cite-ableandre-useable.Thirdly,theATIhasanimportantroletoplayasanadvocateforreproducibility.Amajorgoalofthesymposiumwastoencourageresearcherstocoalescearoundthetopic,toexchangetheirexpertise,andtomaximiseparticipationfromthedevelopingATIcommunity.ThesymposiumwasintendedtohelpenvisionandarticulatetheseobjectivesatatimewhentheATIisintheearlystagesofitsformation.

Thisfinalreportisakeyoutcomeofthesymposium.Itisintendedbothtoinformthedatascienceresearchprogramme;toinformresearcherpractices,forexampleinthepracticalapplicationofreproducibilitytoolsandtechniques,structuresforprofessionaldevelopment,creditandreward;toenablethedevelopmentofkeypolicies,forexampleindatasharing,re-useandintellectualproperty;andtoinformthedevelopmentoftheATI’sdataandcomputeinfrastructure.AlthoughprimarilyintendedfortheATIanditspartners,wehopethisreportwillbeusefultothewiderUKandinternationaldatasciencecommunityandabroadspectrumofstakeholdersingovernment,policyandindustry.

1.2CitingthisreportThisreport,thesymposiumprogramme,speakerbiographiesanddelegatelist,slidesandvideorecordingsfromthepresentationsandtalksarepublicallyavailableonlineattheOpenScienceFrameworkhttps://osf.io/bcef5/andthroughFigshare1.3OrganisingCommitteeThesymposiumorganiserswere(inalphabeticalorder,primaryinstitutionalaffiliationshown):LucieBurgess,AssociateDirectorforDigitalLibraries,UniversityofOxford,http://orcid.org/0000-0001-6601-7196DrDavidCrotty,EditorialDirector,JournalsPolicy,OxfordUniversityPress,http://orcid.org/0000-0002-8610-6740ProfDaviddeRoure,Professorofe-Research,DirectoroftheOxforde-ResearchCentre,UniversityofOxford,http://orcid.org/0000-0001-9074-3016DrAdamFarquhar,HeadofDigitalScholarship,BritishLibraryhttp://orcid.org/0000-0001-5331-6592ProfJeremyGibbons,ProfessorofComputing,UniversityofOxfordhttp://orcid.org/0000-0002-8426-9917ProfCaroleGobleCBE,ProfessorofComputerScience,UniversityofManchester,http://orcid.org/0000-0003-1219-2137DrPaoloMissier,Reader,SchoolofComputingScience,NewcastleUniversityhttp://orcid.org/0000-0002-0978-2446DrRichardMortier,Lecturer,ComputerLaboratory,UniversityofCambridge,https://orcid.org/0000-0001-5205-5992

Page 3: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |321July2016

ProfThomasE.Nichols,Professor,DepartmentofStatisticsandWarwickManufacturingGroup,UniversityofWarwick,http://orcid.org/0000-0002-4516-5103RichardO’Beirne,DigitalandJournalsStrategyManager,OxfordUniversityPress,http://orcid.org/0000-0001-7398-16531.4FormatThesymposiumformatwasaninteractiveworkshop,hostedbytheUniversityofOxfordandheldattheDicksonPoonChinaCentre,St.Hugh’sCollege.ThefullprogrammeisattachedatAnnexA.ThesymposiumopenedwithakeynotepresentationbyProfessorCaroleGobleCBE,UniversityofManchester,whogaveadefinitionofreproducibilityinthecontextofcomputationaldataanalytics.ProfessorJaredTanner,UniversityofOxford,gaveanintroductiontothemissionandobjectivesoftheATI.MembersoftheOrganisingCommitteechairedfiveworkshopsessionsonthefollowingtopics,inwhichtheresearchchallenges,down-streamimpactsandATIprioritieswerediscussed:

• Session1:Dataprovenancetosupportreproducibility• Session2:Computationalmodelsandsimulations• Session3:Reproducibilityforreal-timebigdata• Session4:Publicationofdata-intensiveresearch• Session5:Novelarchitecturesandinfrastructurestosupportreproducibility.

Eachworkshopsessionopenedwithtalksfromexpertspeakersfollowedbythreeparallelbreakoutgroups,eachwithaseparatediscussiontopicandexpertchair.Datascienceisinherentlyinter-disciplinaryandtheparticipationofnon-domain-expertsledbyanexpertinaspecificdomainbroughtawiderangeofviewstothediscussion.Delegatescontributed16lightningtalksonabroadrangeoftopicsrelatedtoreproducibility,such‘ProvenanceinneuroimagingwithNIDM-results’;‘ReproduciblemodeldevelopmentwiththeCardiacElectrophysiologyWebLab’;‘DatasharingstoriesfromScientificData(NaturePublishingGroup)’;and’TheSkyeproject:bridgingtheoryandpracticeforscientificdatacuration’.Allsymposiumtalks(slidesandvideorecordings)areavailableatthelinksinsection1.2.

1.5Delegates

Thesymposiumconvenedaninvitedinter-disciplinarygroupofresearcherswhoemploydata-intensivecomputationalmethodsintheirresearch,frommanyareasofcomputerscienceincludingprovenance,programmeverification,statistics,mathematics,psychology,bioinformatics,behaviouralsocialscience,webscience,climatology,musicology,history,andlinguistics.StakeholdersfromkeyinstitutionssuchastheDigitalCurationCentreandtheSoftwareSustainabilityInstitute,andfrompublishersanddatarepositoriessuchasElsevier,GigaScienceandF1000researchalsoattended.Althoughmostdelegateswererecognisedinternationalexpertsintheirfields,wealsowelcomedasmallcohortofearlycareerresearchersfromtheTuringInstitutejointventurepartners.Thediversityinrequirementsandperspectivesamongstthesestakeholderswasintendedtocrystalliseabroadrangeofresearchchallengesandtomaximisedownstreamimpactacrosssectors.AdelegatelistisprovidedatAnnexB.1.6FundingWearepleasedtoacknowledgefundingfromtheAlanTuringInstitute,withoutwhichtheeventwouldnothavebeenpossible,andgeneroussponsorshipbyOxfordUniversityPress.

Page 4: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |421July2016

2.ExecutiveSummary2.1Whatisreproducibilityandwhyisitimportant?ProfessorGoble,inherkeynotetothesymposiumonthe“R*Brouhaha”offeredadefinitionofreproducibilityanditsimportancetotheATI.Reproducibilitycanbedefinedastheconclusionofonestudyconfirmedindependentlyinanother[1],andhasbeendescribedasthecornerstoneofacumulativescience[2].However,itisimportanttonotethatthelanguageandconceptualframeworkofreproducibilityarenotstandardisedacrossthedisciplines[3].Numerouspapershavebeenpublishedonthetopicofreproducibilityinrecentyears.Newtoolsandtechnologies,increasedcomputingpower,massiveamountsofdata,theopenavailabilityoflargepublicdatabases,interdisciplinaryapproaches,andthecomplexityofthequestionsbeingaskedarecomplicatingreplicationefforts.“Asfullreplicationofstudiesonindependentlycollecteddataisoftennotfeasible,therehasrecentlybeenacallforreproducibleresearchasanattainableminimumstandardforassessingthevalueofscientificclaims.Thisrequiresthatpapersinexperimentalsciencedescribetheresultsandprovideasufficientlyclearprotocoltoallowsuccessfulrepetitionandextensionofanalysesbasedonoriginaldata”[2],[4].Ithasbeensuggestedthatthescientificcommunityneedstodevelopacultureofreproducibilityforcomputationalscience[4].Thechallengeofreproducibilityindata-intensiveresearchisneatlycharacterisedinarecentreportfromtheJanuary2016DagstuhlseminaronthetopicofReproducibilityofdata-orientedexperimentsine-science[5]:“Inmanysubfieldsofcomputerscience,experimentsplayanimportantrole.Besidestheoreticpropertiesofalgorithmsormethods,theireffectivenessandperformanceoftencanonlybevalidatedviaexperimentation.Inmostofthesecases,theexperimentalresultsdependontheinputdata,settingsforinputparameters,andpotentiallyoncharacteristicsofthecomputationalenvironmentwheretheexperimentswerede-signedandrun.Unfortunately,mostcomputationalexperimentsarespecifiedonlyinformallyinpapers,whereexperimentalresultsarebrieflydescribedinfigurecaptions;thecodethatproducedtheresultsisseldomavailable.”Reproducibilityisfundamentaltopublictrustinscience,scientificdiscourseandtransparencyintheuseofpublicfunds.Aswellasbeingamoralobligation,forresearchers,“goodhabitsofreproducibilitymayactuallyturnouttobeatime-saverinthelongerrun”[2].Anopenquestionistheextenttowhichreproducibilitymaybeimportantfordatascientistsoutsidethedomainofscientificresearch,forexamplethosecomputinganalysesforactionablebusinessintelligenceordatajournalism(see,forexample[6]).AstheworkoftheATIwillspanthefullrealmofdatascience,wesuggestthatthebroadercontextshouldbekeptinmind.Thesymposiumdidnotattempttore-createeveryaspectofthereproducibilitydebatebutinsteadfocusedonfivekeyareas,whichwebelieveareimportantforreproducibilityininterdisciplinarydata-intensiveresearch:Dataprovenancetosupportreproducibility;Computationalmodelsandsimulations;Reproducibilityforreal-timebigdata;Publicationofdata-intensiveresearch;Novelarchitecturesandinfrastructurestosupportreproducibility.Theconclusionsandrecommendationsofourdiscussionsaresummarisedbelow.2.2SummaryofrecommendationsfortheAlanTuringInstituteAstheUK’sleadingdatascienceinstitute,theATIhasapowerfulroletoplayinsupportingreproducibility.Therearethreekeywaysinwhichthiscanbeachieved.Firstly,throughfundingandconductingworld-leadingresearchintotechnical(computationalandinfrastructural),socialandculturalaspectsofreproducibility,anareaofdatascienceresearchinitsownright;secondly,throughimplementingpracticalmechanismsthatsupportreproducibilityduringthepursuitofinterdisciplinarydatascienceresearch,alsobothtechnical,socialandcultural;andthirdly,throughactingasanadvocate,exemplarandchampionofreproducibleresearch,engagingwiththecommunityanddevelopingpartnershipswithexistinginstitutionsinthisarea.

Page 5: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |521July2016

2.2.1Keyresearchquestionsrelatedtoreproducibilityindata-intensiveresearchThesymposiumidentifiedanumberofkeyareasforfurtherresearch,whichparticipantsbelievearewithintheremitoftheATIandaredescribedfurtherinthecorrespondingsectionsofthisreport:• Assessmentofexistingsystem-level,low-overhead,automatedandscalableprovenancecapturesystemsandthe

evaluationoftheirroleinsupportingreproducibility(see,forexample[7]);• FurtherdevelopmentofthePROVW3Cstandarddatamodelandontologyindifferentdomains(see,forexample

D-PROV[8]);• Developmentofdomain-specificlanguages(DSLs)indata-intensivedomains(see,forexample,ProjectSkye:a

programminglanguagebridgingtheoryandpracticeforscientificdatacuration[9]),codifyingtheinterfacesbetweengenericdatascienceimplementations/algorithmsanddomain-specificknowledge;andinparticulardevelopmentofDSLsformassive-scalesimulationse.g.climatemodels,urbantrafficflowmodels;

• Developmentofautomatedmethodsforcapturingtheverificationofstatisticalinferenceoverlargereal-timedatasets;

• Integratingmachinelearningapproacheswithstatisticalmethodsinordertoimprovecalibrationorreducebias;• Establishmentofresearchmethodologiesthatcanbeappliedtolarge-scale,real-time,automatedsystems;• Postulatingandevaluatingmechanismsoftransitivecreditforre-useofdataandsoftwarebeyondcurrent

citationmechanisms;• Developmentanddeploymentofnoveltechnicalarchitecturestosupportreproducibility,suchasUnikernels1

(see,forexample[10]);• Furtherresearchintocontainersandwrappers(suchasResearchObjects2andDocker3[11])forencapsulating

dataandmethods,andevaluationofbarriersandenablerstotheiruse.

2.2.2PracticalmechanismstosupportreproducibilityattheATI

Softwareengineeringbestpractices.Recentyearshaveseendramaticadvancesincomputinginfrastructurethatsupportreproducibility.TheATIhastheopportunitytoestablishaworld-leadinginfrastructureforreproducibleresearch,embeddingend-to-endreproducibilityinalltheresearchitsconductsinsupportofitsdatascienceobjectives.Centraltothisvisionistheadoptionofsoftwareengineeringbestpracticesusingmoderntoolsandworkflows,including:

• Systematiccodemanagementandsharing(e.g.,GitandGitHub),• Continuousintegrationandtesting(e.g.,TravisCI4),• Virtualisationandportabilitybymeansofappropriatedevelopmentanddeploymentenvironments(e.g.,

Docker,Unikernels)

Reproducibilitycanbefurthersupportedthroughintegrationofpreservationandcitationserviceswithinthedataprocessinginfrastructure,e.g.byexposingprogrammaticAPIsintorepositoriesforresearchoutputs,dataandcode.

Exemplarydatagovernance.Asaninternationalleaderindatascience,thereisanopportunityfortheATItoactasanexemplar,andinthatcontextitwillbecriticalfortheATItoembedsocial,cultural,andtechnicalmechanismstosupportreproducibility.Regardingdatagovernance,theseincludetheestablishmentandimplementationof:

• Policiesfordataandcodesharing,disseminationandopenness;• Policiesfortransparentintellectualpropertymanagement,particularlyconcerningtheownershipandrights

tobenefitfromdataandcode,withthedefaultbeingopenly-licensed;• Mechanismsofcreditandrewardforbothdirectandtransitivecontributions,suchasartifactevaluation.

1http://unikernel.org2http://www.researchobject.org3https://www.docker.com4https://travis-ci.org

Page 6: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |621July2016

InparticularwerecommendthattheATIshouldensurethatresearchdataandsoftwareareviewedasimportantscholarlyoutputs,andtheATIshouldreflectthisethosinitsgrant-givingguidelines,fundingdecisions,researcherselectionandpromotiondecisions.WesuggestthatATIshoulddevelopa‘fiveyearplanforreproducibility’,settingoutthestepsitintendstotake,publishingmeasurabletargets,andreviewingitsprogress.TheATIhasaresponsibilitytobuildtrustinitsresearch,andtakingactivestepstoconductreproducibleresearchisoneofthewaysthroughwhichthisaimcanbeachieved.Anunderstandingoftheprovenanceofdata,aconsiderationofbias,thelimitsandimpactofconsent,ethicsandadiscussionoflimitationsarounddatasharingarenecessarytoensuretrust.Weareconfidentthat,asaleaderindatasciencefundedbybothpublicandprivatefunds,theATIwillembraceexemplarydatagovernancetoensurethatitsresearchbothmeetsthetermsandconditionsspecifiedbydataprovidersandisforthepublicgood.Professionaldevelopmentandoutreach.TheATIhasauniqueopportunitytomediatebetweencomputerscientistsanddomainexpertsacrossawiderangeofdomainsindevelopingandapplyingreproduciblemethodsandtools,duetothewiderangeofuse-casesandresearchersworkingacrossmanydifferentdisciplinesunderpinnedorenabledbydatascience.WerecommendthattheATIincentivisescommunityparticipationthroughinvitingandpublicisinguse-caseswhichdemonstratetheutilityandbenefitsofreproducibleresearch.Recognitionofthehuman-in-the-loopaspectsofreproducibilityiskey.ThereisanopportunityfortheATItofostercollaborationamongstresearchersusingestablishedmethodswiththoseusingnewmethods,forexamplethoseworkingwithdatafromreal-timesystems.Theremaybeaneedfortrainingofdatascientistsinsocialsciencemethodologies,andvice-versa,toimprovethequalityofconclusionsfromdataanalyticsintoreal-timedatasets.Inparticularthereisanurgentneedtoinvestinresearcherskillsinsupportofreproducibility,providingtrainingindatacuration,citationandsoftwaresustainability.AlthoughthismaynotbetheremitoftheATIspecifically,theATIhasanopportunitytoworkwithorganisationssuchastheDigitalCurationCentre5,theDigitalPreservationCoalition6,theSoftwareSustainabilityInstitute7andtheResearchDataAlliance8,whosupportskillsdevelopmentthroughprofessionaltrainingandcommunitiesofinterest.2.2.3Advocacy,engagementandpartnershipsFurthermore,theATIcanplayanimportantroleasanadvocateforreproducibilityinresearchinstitutionsanduniversities,ingovernmentandindustry,forexampleworkingwithgovernmenttoembeddatamanagementandsoftwaresustainabilityintotheUK’sscienceandinnovationstrategy.WesuggestthattheATIshouldconveneagroupofeminentacademics,practitioners,publishersandotherstakeholderstodevelopapolicyonenhancingthequality,transparency,accountabilityandcommunicationofresearch,embracingreproduciblemethodsandtools.TherewasconsiderableinterestfromdatascientistsattendingthesymposiumindevelopingpartnershipswiththeATI.TherearestrongdatasciencecommunitiesacrosstheUKandinternationallyinadditiontotheuniversitieswhichareATIjointventurepartners.Forexample,fromtheUK,theuniversitiesofBristol,KingsCollegeLondon,Manchester,Nottingham,NewcastleandSouthamptonwererepresentedatthesymposium.ThedatasciencecommunityalsoextendstoorganisationssuchastheDigitalCurationCentreandtheSoftwareSustainabilityInstituteandothers,toadvocatesofopendataingovernmentsuchastheNationalArchivesandtothepublishingcommunity,whichisadaptingitsbusinessmodelstosupportreproduciblemethods.WesuggestthattheATIshouldbuilddiversenetworksacrossthesecommunitiesinordertofullyleverageitsinvestment.

5http://www.dcc.ac.uk-DigitalCurationCentrewebsite[accessed28-May-2016]6http://www.dpconline.org-DigitalPreservationCoalitionwebsite[accessed28-May-2016]7http://www.software.ac.uk-SoftwareSustainabilityInstitutewebsite[accessed28-May-2016]8https://rd-alliance.org-ResearchDataAlliancewebsite[accessed28-May-2016]

Page 7: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |721July2016

3. DataProvenancetoSupportReproducibilityDrPaoloMissier,NewcastleUniversity;ProfThomasNichols,UniversityofWarwick;ProfDorothyBishop,UniversityofOxford;ProfLucMoreau,UniversityofSouthampton3.1 Overview

Thedriveforgreaterreproducibilityrequiresthecapabilitytotrackandrecordeveryaspectofexperimentaldesign,dataacquisition,pre-processing,analysisandresultsgeneration.ThePROVdatamodel[12]waspublishedasaW3Cstandardin2013tohelpenablethisgoal.Completeprovenancewouldthenallowanindependentinvestigatortounderstandexactlywhatwasdonetothedataateachstep,andattempttoreproducetheresultwitheithertheoriginaldata,ideallysharedandopen,oranewsetofdata.Theabilitytocaptureprovenanceisalsoimportanttosupporttheexplorationofalternativeexperimentaldesigns,bymakingitpossibletoreasonabout,andexplain,differencesinoutcomesproducedbydifferentversionsofanexperiment.

Realisingthispotentialhasbeendifficult,however.Thegoalofthesessionwastounderstandtherealityofprovenancemanagementpracticeswithrespecttoreproducibility.Thesessionbeganwiththreepresentations,whichexploredprovenanceissuesintheneurosciences;gaveacasestudyofopendataandprovenanceinpsychologyresearch;summarisedtheprovenancestandardsandprovidedanoverviewofexperimentaltoolsthatleveragethosestandardstofacilitatereproducibility.Followingthetalks,threebreakoutgroupsexploredspecificissues:

• motivations,challengesandlimitationstoexploitingprovenance,particularlyforopendata,• integrationbetweenprovenanceandothertools,and• automatedreasoningusingprovenance.

3.2Recommendations3.2.1Provenanceispivotaltoreproducibilityofdata-intensiveresearch(see[5]forageneralconsideration,and[13]foracase-studyinneuro-imaging,discussedbyparticipants).Therearemanyresearchchallengesassociatedwithprovenance,adiverseareaofdatascience,inwhichtheATIcanplayakeyroleasaworld-leadingresearchinstitution.3.2.2Inparticular,werecommendthatATIshouldinvestigateautomatedandlow-overheadprovenancecollectionsystems,andevaluatetheiruseinsupportingreproducibility,withtheunderstandingthatthosearegoingtodeliverfine-grainedprovenancewhichwillrequirehigherlevelabstraction.WesuggestthattheATIshouldselectanddeployatleastonesuchprovenancecollectionsystemasapilot,butatascalesignificantenoughtoincludeavarietyofitsdatascienceactivitieswithin6months,toembedprovenanceacrossitsresearchactivitiesandcreateausefuldataset.3.2.3ThePROVdatamodel[12]wasadoptedasastandardbytheWorldWideWebconsortiumin2013andsincethenfurtherdatamodelshavebeenadaptedfromthestandard(see,forexample[8],[14]).TheapplicationandadaptationofPROVtodifferentusecasesanddomainsisakeyareaforfurtherresearch.3.2.3ThereexistanumberofsocialandculturalissuesaroundtheuseofprovenancetosupportreproducibilityinwhichtheATIcanplayapowerfulroleasanadvocate.Inparticular,establishingtheappropriateresearchinfrastructuretosupportreproducibility,bothtechnical(intermsofdataandcomputeinfrastructure,asmentionedabove)andsocio-cultural(throughestablishingandimplementingappropriatedata-sharingpoliciesandrewardmechanisms)willbecritical.

Page 8: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |821July2016

3.2.4Practicestoencouragereproducibilityandre-useneededtobeincorporatedintotheresearchprocessatthebeginning,andthiscouldbesupportedbytheATIthroughimplementationofappropriatedataandcode-sharingpolicies,andthroughappropriatetrainingofitsresearchfellowsanddoctoralstudentsinmetadatamanagementanddatacuration.3.3 Keydiscussionpoints3.3.1 Reproducibilityandre-useofunderlyingdataandsoftwarearekeytodoingbetterresearch,improvingresearchdisseminationandprovidingtransparencyfortheuseofpublicfundinginresearch.Reproducibilityrequiresarichnessofthedescriptionsofthestepsleadingtotheproductionofthedataandderivedresults.Provenanceplaysakeyroleinrecordingrepeatableexperimentsandscientificworkflows(through,forexample,recordingacquisitionparameters,instrumentsettingsandothermetadata)andisalsopivotalwherescienceisobservation-basedandthusnotinherentlyreproducible,e.g.archaeologicalexcavations,large-scalephysicsexperimentsorlongitudinalobservationaldata,whereanalysiscanberepeated,buttheexperimentcannot.Therearemanymotivations,challenges,andlimitationsintherecordingandexploitationofprovenance[13].3.3.2 Onesuchkeychallengeisthebalancebetweencostandbenefitsassociatedwithcollectingprovenanceforaspecificsystemorevenasingleapplication.Oneprovocativestartingquestionwaswhetheritwasworthcollectingprovenanceatallandwhethertherewere“successstories”thatcouldactasexemplarsforthevaluepropositionofprovenancemanagement.Provenanceiscostlyforhumanstocollectandacompellingcaseforthiscostcanbehardtomake.However,automatedtoolscanimprovetheeconomics:see,forexample,useofprovenanceintheDataOneproject[8],alarge-scalefederatedrepositoryforobservationalearthdata;andtheuseofprovenanceinneuroimagingusingtheNeuroImagingDataModel[15].3.3.3 Abetterunderstandingofdatarequirestrackingofitscontext,capturedthroughprovenanceinformation.Automatedcollectionandstoragecanresultinlow-leveldetailthatneedsabstractiontocreatesemanticallymeaningfulinformation,butthatcanbeveryuseful.TheFabricforReproducibleComputing(FRESCO)9ObservedProvenanceUserSpace(OPUS)projectattheUniversityofCambridge[16]providesanexemplarforsystem-levelprovenancecapturecomponentsthatoperateunobtrusively,transparentlyandwithlowoverheadprovidingalowbarriertoentryformakingapplicationsprovenance-aware.ThecapturedprovenanceisnotPROV-compliantbutclearlyausefulstartingpoint.AnotherexampleisHarvard’sProvenance-AwareStorageSystem,PASS[17],andSPADE,anopen-sourcesoftwareinfrastructureforprovenancedatacollectionandmanagement[18].3.3.4 Provenancecollectionsystemshavetheadvantageofbeingdomain-agnosticbutarenotnecessarilyimmediatelyusefultousers,astheyrequirehigher-levelabstraction.Thisisakeyareaforfurtherresearch.WerecommendthattheATIshouldfundresearchintomechanismstobridgetheabstractiongapinordertomakeautomaticallycollectedprovenanceinformationunderstandableandexploitableatscale.Ideally,theresearchwouldbothanalyseandusethedatacollectedandproposefeasiblemethodologiesforotherstocaptureprovenancedata,withtheaimofmoreeasilyansweringcomplexresearchquestionsatthescientists’level(forexample,canprovenancebeusedtoexplainthedifferentlevelsofconfidencereturnedbytworepetitionsofamethodtestingthesamehypothesisunderslightlydifferentconfigurationsoftheexperiment?).3.3.5 Aninterestingareaforfutureresearchisintheuseofprovenancetosupportautomatedreasoning;toinvestigatemachinelearning/inferencetechniquesformappinglow-levelpatternstohigh-levelabstractions,aswellasmechanismsforhelpingusersexploitlargeprovenancetracesinawaythatisintuitiveandhighlycustomisable,butremovingtheneedforsophisticatedprogramming.Thisresearchwillrequirelarge-scaleprovenancedatasetstobemadeavailable,whichisachallengeinitself.

9https://www.cl.cam.ac.uk/research/dtg/fresco/,accessed28May2016.

Page 9: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |921July2016

3.3.6 Researchdata,particularlyopendata,isdynamicinnature;therearedifferentlevelsofaggregations,granularityandversions.Althoughprovenanceisusefultobeabletotrackfindingsandconclusionsfromtheoriginalexperimentaldatatothepublisheddataset,thereisaneedforencapsulation,wrappingdataandmethodstogether.FrameworkssuchasResearchObjects[19]havegonesomewaytomakingthisencapsulationareality,butneedsinvestmentintoeaseofusetobecomemorewidelyadoptedintheresearchcommunity.3.3.7 Thereisaneedforharmonizedmetadatastandardswithindisciplinesandbetweendisciplinestorecordprovenanceandenablereproducibility.TheBioinformaticscommunityhasbeenprescientintacklingtheseissues,throughlarge-scalecommunityeffortsliketheELIXIR10projectinbiosciencebutthisisnotalwaysaffordableforcommunity-curateddatabasesanddatasets,orforprojectswithsmallresearchgrants.3.3.8 Recordingprovenanceinformation,andcleaning,enriching,curatingandsharingdatacanbeasignificantoverhead.Oneparticipantcommented“Scientistsarenotinherentlyexcitedaboutdatabase/annotation‘stuff’,sowhyisitworthforthemdoingmorethantheminimum?”Anotherparticipantcommented:“Whatistheartofthepossible?It'snotaboutwhat'srightorwhat'sbest.It'saboutwhatyoucanactuallygetdone!”Keytoovercomingthischallengeisparticipationandbuy-infromresearchfunders,appropriatepolicyinterventionsbyresearchinstitutions,and,mostimportantly,evidencethatprovenancemanagementgeneratesmeasurableaddedvalue.3.3.9 Toincentivisecommunitybuy-inandparticipationweneeddemonstrator/exemplarprojectsthatshowcasethebenefitsinimplementingprovenanceandsharingdata;greaterrewardsforresearcherswhoimplementreproducibleresearchmethods;adeeperrecognitionof,andsupportforpublishingdatasetsandmethodsindatapapersandforsharingofdataandsoftware;andsupportfordatacitationasaformofattributionandevidenceofshareddatatobepartofthescholarlyrecord.3.3.10 Acultureofchangeneedstobedrivenbyfunders.Followingtheexampleofbiomedicalfundersrequiringastatistician,everydatascienceprojectshouldhavethesupportofaresearchdatamanagerandinformaticsexpert.Theseindividualscanleadeffortstoincorporatetechniquessupportingreproducibility,includingprovenancemanagement,fromtheinceptionofaresearchproject.3.3.11 However,thereisamajortrainingandskillsgapbetweenwhattheprovenanceresearchcommunityknowshowtodoinprinciple,andwhatresearchsupportprofessionalsaretaughtandabletosupportinpractice.AlthoughthismaynotbetheremitoftheATIspecifically,theATIhasanopportunitytoworkwithorganisationssuchastheDigitalCurationCentreortheResearchDataAlliancewhosupportskillsdevelopmentthroughprofessionaltrainingandcommunitiesofinterest.

10https://www.elixir-europe.org/-adistributedinfrastructureforlifescienceinformation.[Accessed:28-May-2016].

Page 10: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1021July2016

4.Computationalmodelsandsimulations

ProfJeremyGibbons,UniversityofOxford;DrNicolaBotta,PotsdamInstituteforClimateImpactResearch;ProfPatrikJansson,ChalmersUniversityofTechnology,Sweden;DrCamilDemetrescu,SapienzaUniversityofRome4.1Overview

Thefocusofthissessionwastwo-fold;firstly,participantsconsideredcomputingtechniquesemployedtoyieldcorrectcomputationalsimulationsofabstractmodels(suchasdifferentialequationsineconomics),includingDomain-SpecificLanguages(see[20],[21]forageneralintroductiontoDSLs)andcodeverificationtechniques.Wehavetheabilitytosimulatetheevolutionofcoupledearthsystemmodelsoverthousandsofyears,createsyntheticpopulationsofmillionsofagentsandanalysenetworksofmaterialandenergyflowsonveryfinescales.However,interpretingandcommunicatingresultsinarigorous,unequivocalwayischallenging,inspiteofthegrowthofavailablecomputingpower.DSLscanplayacrucialroleinhelpingtoclosethegapbetweenscientificcomputingandrigorousscientificadvice.DSLswerediscussedinthemoregeneralcontextofreproducibility,rigourandrepeatabilityinsoftwareandsystemsresearch[22],[23],[24].

Secondly,theworkshopconsideredartifactevaluationincomputerscience,aprocessusedtovalidatethatthecodesuppliedwithanacademicpapergivesthesameresultsasthosereportedinthepaper[25].Participantsdiscussedanevaluationprocessofsupplementarymaterials(software,dataetc.)thatcomplementconferencepublicationsincomputerscienceandhasbeenwidelytestedinseveralmainstreamconferencesincomputingsince2011[26],[27].Weconsideredthemotivations,implementationandoutcomeofartefactevaluation,itsapplicationtootherdomainsandspecificallytotheATI.

4.2Recommendations

4.2.1 WesuggestthattheATIhasanopportunitytomediatebetweencomputerscientistsanddomainexpertsacrossawiderangeofdomainsindevelopingorapplyingDSLs.AkeychallengefortheATIisindevelopingoradaptingDSLsforapplicationinspecificdisciplines(see,forexample[28]and[29]whichsurveytheuseofDSLsinmachinelearningandroboticsrespectively),andacrossarangeofdisciplines.TheATIhasauniquepositiontoenabledevelopmentanduseofDSLsduetothewiderangeofuse-casesandresearchersworkingacrossmanydifferentdisciplinesunderpinnedorenabledbydatascience.Forexample,itmightbefruitfultodevelopDSLstobetterunderstandtheinteractionbetweenbiological,ecologicalandenvironmentalsystems.

4.2.2 Intermsofreproducibility,DSLshavethepotentialtoenableimprovedspecificationofhypothesesandresearchmethodologiesinordertoreproduceanexperimentorre-usedataininterdisciplinaryresearch.TheATIcouldplayaleadershiproleinsupportingthedevelopmentofDSLsaimedatansweringresearchquestionsinspecificdomainsoracrossdomains.FurthermoretheATIcouldenablereproducibilitythroughtheresearchoftheATIcommunity,designingusecasesanddevelopingbestpracticesdrawingondeepexpertiseinprogramminglanguagedesignandimplementationacrosstheUK.

4.2.3 Datascientistsanddomainscientistshavecomplementaryskillsets,andidentifyingandcodifyingabstractionlayerstohelpthemcommunicaterequiresathirdskillset,namelyprogramminglanguagedesignanddomainspecificlanguageimplementation.Applyingtheseskillsrequirestime-consumingcollaborationtodevelopthemuch-neededunderstandingofbothsides.TheATIcouldencouragereproducibilitybyencouraging"DSLphenomenology",thatis,embeddingresearcherswithprogramminglanguagedesignexpertiseaspartofteamsinvolvingdatascientistsandapplicationscientiststohelpunderstandandcodifytheinterfacesbetweengenericdatasciencealgorithms/implementationsanddomain-specificknowledge.Thiscould,forexample,taketheformofshort-termvisitsforprogramminglanguageresearchersorstudentsinterestedinDSLdevelopmenttovisittheATIandworkonsuchprojects.

Page 11: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1121July2016

4.2.4 WesuggestthatartifactevaluationsbeaddressedatworkshopsandconferencesorganisedbytheATI,co-operatingwithpublisherstowardsabetterintegrationoftraditionalscholarlypublicationsandthecorrespondingartefactsthatwereusedintheirpreparation,andworkingwithexistingorganisationsinthisspacesuchasDataCiteandtheSoftwareSustainabilityInstitute.

4.3Keydiscussionpoints

4.3.1 ‘DomainSpecificLanguages’(DSLs)arecustom-designedprogramminglanguagesappliedtomodelaparticulardomain.Theyallowadomain-experttoallowenoughprecisionfortranslationofacomplexmodeltocodeorsoftwarewhileminimisingerrors.MostparticipantsintheworkshopsessionwerefamiliarwithexamplesofDSLssuchasExcel,R,SQLanddatabasequerylanguages,languagesforontologies,andmark-upinXMLorHTML,althoughasmostparticipantswerenon-DSLexpertstheywereunfamiliarwiththetermDSL.

4.3.2 DSLsallowcomputerscientistsanddatascientiststo‘speakthesamelanguage’asdomainexperts,forexampleinmachinelearningusingbigdata[28],roboticsystems[29],systemsbiologyorquantitativefinance.DSLscanreducethedistanceoftheabstractionfromthedomainspecialisttothegeneralityofthecoder.Forexample,MLFi(modellinglanguageforfinance)isaprogramminglanguagethatenablesdescriptionofcomplexfinancialcontractssuchasderivatives[30],andusesOCaml(afunctionalprogramminglanguage)asitsprimaryimplementation.Asacounterexample,manynon-DSLssuchasJavausecodelibrariesthatareeffectively‘black-box’implementationstothedomainscientistandtheirusemayimpactontheresults.ThereisnoclearboundarybetweenDSLsandtoolsorsoftwareimplementationsforspecificapplicationsordomains.

4.3.3 Custom-designedprogramming/modellinglanguagesforparticularresearchdomainshaveimmensepotentialbenefits,suchasverification,portability,optimizationandimprovementsincommunicationbetweendatascientists/computerscientistsanddomainexperts.

4.3.4 Successfulprogramminglanguages‘live’muchlongerthanhardware.Forexample,FORTRANwasintroducedalmost60yearsagoandisstillinwideusewhereasthecomputersonwhichitwasoriginallyusedhavelongsincebecomemuseumpieces;C,C++andJavahaveenjoyedconsiderablylongevity.DSLstosupporttheworkoftheATI(e.g.forstatisticalmodelling,distributedcomputation,etc.)shouldbedesignedwithlongevityinmind,ratherthanasareactiontoshort-termneeds,andshouldbeindependentofarchitecture.ThefactorsthatunderpinthesuccessofpracticalDSLs,andtheproblemscausedbysomeoftheirdesignflaws,needtobebetterunderstood.

4.3.5 Thereisatensionbetweenthedegreeofspecificationneededtoreproduceanexperimentormethodology,andtheagilityrequiredforsuccessfulresearch.Intheabsenceofaspecification,reproducibilityisdifficultorimpossible,buttherequirementtoformallydocumentaspecificationcanactasabarriertoagileresearchmethods.

4.3.6 Manyresearchersareunsureaboutwhattodocumentinaresearchenvironmentwhichisdynamicandrapidlychanging.Electroniclabnotebooksarebeingusedsuccessfullyinsomeareas.Thereisaneedforhigh-level,descriptivelanguageswhichthatcanbeusedtoimplementmethodologicalspecificationsacrossdifferentdisciplines.Inneurosciences,forexample,domain-specificlanguagessuchasRandMatlabmeettheneedsofthecommunity,butarelessusefulinotherdomains.

4.3.7 Theinter-disciplinarynatureofdatasciencerequiresaccessiblehigh-levellanguageswhichcanbeusedtosynthesisecomputationalmodelsfornon-experts.Forexample,theOfficeforNationalStatisticsistakingstepstoshareanddescribethesoftwareusedtodevelopitsdatainordertoimplementprinciplesofopenscienceandsupportreproducibility.Incomputationalbiology,documentingexperimentsandmethodologiesmostlyreliesonXMLschemas,ratherthanformalspecificationlanguagesthatcouldbemoreadaptiveandflexible.Inthesocialsciences,thereisaneedforlanguagesthatcandescribenarrativesandreasoning.

Page 12: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1221July2016

4.3.8 TheworkshopdiscussedalsotheuseofDSLsincodeverificationandanalysisappliedtolarge-scaledatasetsandsimulations.Inmassive-scalesimulations(e.g.climatemodelling,urbantrafficflowmodels)validationofcomputationsischallenging,becausethereisoftennostraightforwardtestormethodofverificationbywhichcorrectresultscanbeidentified.Inthosecircumstances,itisallthemoreimportantthatcodeisafaithfulimplementationoftheabstractmodelandthattheDSLpreventsasmanyavoidableerrorsaspossible.

4.3.9 Manyscholarlyarticlesincomputationalsciencesreportclaimsbasedonexperimentswithsoftwareanddataartefacts.Unfortunately,theirkeyroleinvalidatinganddisseminatingresearchisoftenoverlooked.Arecentartifactevaluationprocesswasadoptedatsomemajorcomputerscienceconferences,recognisingartifactsasfirst-classcitizensincreatinganddisseminatingresearchincomputing.Theevaluationprocessisfullyopenandtransparent.

4.3.10 Thegroupdiscussedthisprocessasastepforwardtowardsthereproducibilitychallenge.Wearguedthatsoftwareanddataplayamajorroleinmanyfields,hencetheideacouldbesuccessfullyexportedbeyondcomputerscience.Furthermore,inadditiontoconferences,avariantoftheprocesscouldbeappliedtojournals.Whilesustainableartefactevaluationsprocessesexist,currentlymanypublishersdonotoffereffectivemeanstoarchive,index,search,andmakethemcitableusingtheappropriatemetadata.Collectivepressureneedstobeappliedforthistochange,workingwithpublishers,researchorganisations,fundersandotherkeystakeholders.

4.3.11 Whileahelpfulstep,artifactevaluationaloneisnotenoughforreproducibility,becauseartifactsareevaluatedinacontemporaneoustechnologicalcontextthatmaynotbeavailableindefinitelyinthefuture.Artifactevaluationshouldbeusedasoneofthekeytoolsavailableinthereproducibilitytoolbox.

Page 13: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1321July2016

5. ReproducibilityforReal-TimeBigDataProfessorDaviddeRoure,UniversityofOxford;DrSuzyMoat,UniversityofWarwick;DrEricMeyer,UniversityofOxford

5.1 Overview

Todayweembracethemethodologicalchallengesofbigandreal-timedata,arisingfromsciencebutalsoexemplifiedbynewandemergingformsofdatasuchassocialmedia,whichprovidesanewlensontosociety,demandsstudyinitsownright,andisitselfaresearchtool.TheInternetofThings,deployedinourcities,cars,homesandbodies,bringsyetmoredata,machine-to-machine.Duringthissessionwediscussedreproducibilityinthecontextofreal-timebigdataandnewformsofdigitalscholarship,characterisedbymachinesandpeopleoperatingtogetheratscale.Widespreadadoptionofnewtechnologiesleadstomassivedatageneration,whileatthesametimewehavecrowd-scalepersonalengagementwiththedataanditsanalysis,suchasincitizenscienceprojects.Thisdemocratisationandempowermentleadstoentirelynewsocialprocesses,andnewchallengesandopportunitiesforreproducibilityinworkingwiththesenewformsofdata.Weconsideredhowwemightreproducedatascienceresearchusingsocialmediaanalytics,whichexaminesnewsocialprocessesatthescaleofthepopulationandinrealtime.

ThediscussionwascontextualisedwithexamplesofresearchprojectswhichanalysedatafromeverydayuseoftheInternet[31],[32],andaskwhethersourcessuchasGoogle,WikipediaandFlickrcanbeusedtomeasureandevenpredicthumanbehaviourintherealworld.Welookedaheadtoourincreasinglyautomatedfuture,askingwhetheritismeaningfultoautomatereproducibility,andifandhowweshouldkeepthehumanintheloop.

5.2 Recommendations

5.2.1 TheATIshouldtakebiasinlarge-scaleonlinedatasetsveryseriously,andconsiderbothautomatedandhumanmeansofadjustingforbiasinordertomakedatamoremeaningful.SomeparticipantsinthesymposiumhadparticipatedinATIscientificscopingworkshopsonethicsofdatascienceandtherewasconsensusthattheethicalissuesrelatedtobiasareveryimportantandmustbefactoredintoresearchconclusions.5.2.2 ThereisanopportunityfortheATItofostercollaborationbetweenresearchersusingestablishedmethodswiththoseworkingwithnewmethods,andinparticularbetweensocialscientistsanddatascientistsinordertounderstandbetternewandemergingformsofdata,forexamplecreatedthroughtheInternetofThings.Theremaybeaneedforspecifictrainingofdatascientistsintosocialsciencemethodologies,andviceversa,toimprovethequalityofconclusionsfromdataanalyticsintoreal-timebigdatasets.5.2.3 ThereareinterestingopportunitiesfortheATItodevelopautomatedmethodsforcapturingtheverificationofstatisticalinferenceoverlargereal-timedatasets,ortobringtogethermachinelearningapproacheswithstatisticalmethodsinordertoimprovecalibrationorreducebias.5.2.4 Reproducibilityisbothmoreimportantandmorechallenginginsocialdatascienceusingrepurposeddata,proprietarydata,closedcommercialdataorpersonaldata.TheATIhasaresponsibilitytobuildtrustinitsresearch,andreproducibilityisoneofthewaysthroughwhichthisaimcanbeachieved.Provenanceofdata,aconsiderationofbiasandconsent,ethicsandadiscussionoflimitationsarounddatasharingarenecessarytoensuretrust.Asaleaderindatasciencefundedbybothpublicandprivatefunds,theATIshouldembraceexemplarydatagovernancetoensurethatresearchmeetsthetermsandconditionsspecifiedbydataproviders,butisalsoforthepublicgood.

Page 14: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1421July2016

5.3 Keydiscussionpoints

5.3.1 “Inthewild”methodologies,real-timeworkingandincreasingautomationallmeanweareworkingwith/interferingwithasystemthatisitselfadapting(andinterferingwithus).Examplesofautomationtodayincludebotsandthetargeteddigitaladvertisingecosystem.Weneedtoestablishresearchmethodologiesthatcopewiththis,includingdynamiccalibrationofourmodelsinrealtime.Weshouldembraceratherthanrejectthechallengingofexistingassumptions.5.3.2 Thedataavailabletosocialscienceresearchhaschangedirrevocablyinrecentyears.Researchisneededintoautomatedmethodsthatcanmoreeasilyprovidequalityassurance,butthereisalsoaneedforahumanintheloop.Issuesaffectingdataqualityandthereforescientificconclusionshavealwaysexisted,suchaschoosingtherightvariables,surveyparticipantsandmethodswhichperturbthedata,aswellasselectivepublicationofdataorover-relianceonstatisticssuchasp-values.Howevertheseissuesbecomeincreasinglychallengingwiththeincreasingsizeandcomplexityofdatasets.5.3.3 DrMoatgaveanexampleofGoogleflutrendsalgorithmwhichover-predictedanoutbreakofseasonalflu,duetopressreportswhichmayhavetriggeredonlinesearchesrelatingtoflubypeoplewhowerenotill[33].Ifwetrytouseonlinesourcesasmoreup-to-date/economicalsubstitutesforofficialdata,weneedtobeverycarefulaboutthepopulationswemakeinferencesabout.Real-timedatasetsoftencannotbematchedwithacalibrationsample,becausethecalibrationsampledoesnotexist.5.3.4 Muchsocialsciencedataisinherentlydifferenttoscientificdataordatageneratedthroughsimulationsorcomputationalmodels:repurposeddata(e.g.Twitter,Wikipedia),‘dataexhaust’(e.g.frommobilephonesorloyaltycards),proprietarycommercialdata,datathatcomeswithnon-sharingornon-disclosureagreements,ordatalimitedindistributionduetodifferingnationallegalframeworks(e.g.governmentorpersonaldata).Thereareinterestingissuesaroundwhatreproducibilitymeansinthesecontexts,suchasapotentialconflictbetweenethicsandreproducibility:forexample,isresearchofpoorqualityifitiscollectedwithinappropriateethicalframework,butthentheresultscannotbereproducedduetotheconstraintsaroundaccesstothedata?Ideally,privilegedaccesstocloseddatasetsshouldbeastransparentaspossible,withfullprovenanceinformationprovidedintheabsenceofwideraccesstotheunderlyingdata,suchthatotherswiththesamepermissionscouldreplicate.5.3.5 Forexample,participantssuggestedthatsomedatasetsbeingusedbypolicymakersingovernmentmightnotbeaswell-documentedastheyshouldbe.Inthesamewaythatcodeverificationcanfindbugs,bothautomatedandhumanmethodscanbeutilisedtoestablishtheappropriateprovenanceofdata,andsimilarmethodsshouldbedevelopedforcapturingtheverificationofstatisticalinferenceoverlargedatasets.Asaleaderinthefield,theATIhasaresponsibilitytoestablishappropriatebenchmarksandstandardsthroughitsownresearch,orpointtoexistingwell-establishedbenchmarks(see,forexample[34]).5.3.6 Theapplicationofreal-timedataindatascienceresearchrequiressocialscientistsandcomputerscientiststoworkcloselytogether,forthereasonsgivenabove.Participantsquestionedwhetherthisisanareainwhichweneedspecifictraining,sothattheinherentbiasinonlinereal-timedatasetsorethicalissues(forexample,relatedtoconsent)canbeunderstoodandtheimpactevaluated.5.3.7 Theremaybeopportunitiestobringtogethermachinelearningapproacheswithstatisticalmethodsinordertoimprovecalibrationorreducebias.Anexamplewasgivenofpredictionoftheoutcomeofthe2012USpresidentialelectionsand2014ScottishReferendumusingdatagatheredfromX-boxusers[35],whichofcoursewouldproduceabiasedsamplebutisoneofmultiplesources.

Page 15: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1521July2016

5.3.8 Inter-disciplinarityiskeytodatascienceanditsreproducibility,andtherearegoodexamplesofdata-orientedmultidisciplinaryresearchprogrammessuchastheTrans-AtlanticPlatformfortheSocialSciencesandHumanitiesDiggingintoDatachallenge[36].Whileweweredrillingdownintodigitalsocialresearch,weequallycouldfocusondigitalhumanities.Alsoweneedmorecriticalthinking-somecommunitiesseebenefitofconstructivefeedback,butelsewherepeopleareuncritical,andthescholarlyecosystemseemsoverlygenerousinpublishingresultsuncritically,andnotpublishingnegativeresults.Thisqualitycontrolissue,andtheneedforcriticalthinking,extendsbeyondtheresearchitselfintoitsapplicationinthelifecycleofinnovation,asnewdatasciencedrivesinnovationinmarketplace.5.3.9 Participantssuggestedthatcomputerscientistshavearesponsibilitytoinformresearchersusingonlinedataastohowheavilyphenomenaareinfluencedbyuserinterfacedesign.Thereisanincorrectassumptioninsomeresearchthatonlinebehaviouristhesameashumanbehaviour,orthatonlinedatasetsdonotincorporatebias.Thisisadiscussiontohavewithsocialscienceandhumanitiesandmaybeablindspot.5.3.10 Therearemorephilosophicalquestionsarisingabouttheincreasingautomation,forexampletoprocessdataatscaleandatspeed,togetherwiththeuseofmachinelearningindataanalytics.Asweautomatetheconductofresearch,towhatextentareweirrevocablyembeddingourcurrentmethodsintotheknowledgeinfrastructure,andhowaretheytobechallenged?5.3.11 Thereisaneedtoclarifywhatismeantby‘bestpractice’inrelationtoreproducibility,particularlyforreal-timedatasets,andforthesocialsciencesandhumanities.Inthesciencesitmaybeenoughtoappropriatelycitedataandsoftware,andstoreaDockerfileorscriptinordertorecordanexperimentalmethodology,butreal-timeonlinedatasetsaremixed-mode,involvingbothpeopleandmachines.Repeatingorre-usingcrowd-sourcedanalyses,forexample,maysimplybeimpossible.

Page 16: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1621July2016

6.PublicationofData-IntensiveResearch

ProfessorCaroleGoble,UniversityofManchester;DrDavidCrotty,OxfordUniversityPress;RichardO’Beirne,OxfordUniversityPress;SimonHodson,CoData;DrLaurieGoodman,GigaScience;NeilChueHong,SoftwareSustainabilityInstitute

6.1 Overview

Theideaofdatabeingapublishableentityhasbecomeaprominentconceptinresearcherandpublisherconversations.Thissessionexploredtheroleandimpactofthepublicationofdataintensiveresearchasthemethodsandoutputsofresearchchange.Participantsdiscussedtheroleofdataandsoftwareinachievingreproducibility,andthelinkstoresearchinstitutionandfunderpolicies,thepublishingecosystem,andresearcherbehavioursandskills.LaurieGoodman,Editor-in-ChiefoftherespecteddatajournalGigaScience11,presentedapublisher’sperspectiveofdatacitation:lessonslearned,issuesneedingmoreeducationandpracticalstepsneededtoencouragedatare-useinresearch.NeilChueHong,DirectoroftheSoftwareSustainabilityInstitute,remindedusthat,whilstopendataismovingusforward,weriskbeingstalledbyoursoftware.Participantsdiscussedwaystoovercomethebarrierstotheavailabilityandaccessibilityofdata,softwareandotheroutputs,whicharefundamentaltogoodresearch,andmadeanumberofrecommendationsforresearchpoliciesandpractices.

6.2 Recommendations

6.2.1. Asaninternationalleaderindatascience,thereisanopportunityfortheATItoprovideanexemplar,toensurethatdataoutputsandsoftwareareviewedasimportantscholarlyoutputs,andtoreflectthisethosinitsgrantguidelines,fundingdecisions,researcherselectionandpromotiondecisions.TheATIshoulddevelopa‘fiveyearplanforreproducibility’,settingoutthestepsitintendstotake,publishingmeasurabletargets,andreviewingitsprogress.6.2.2. TheATIhasanimportantroletoplayasanadvocateofreproducibilityinuniversities,governmentandindustry,embeddingdatamanagementandsoftwaresustainabilityintotheUK’sscienceandinnovationstrategy.TheATIshouldpursuepartnershipswiththemanyorganisationsalreadyworkinginthissphere,suchastheSoftwareSustainabilityInstitute,inordertosupportandembedtheirworkinthedatasciencecommunity.6.2.3. TheATIshouldfundandpursueresearchintomechanismsofcreditforre-useofdataandsoftwareoverandabovecitation,acomplexcomputationalproblemwhenusingdatasets‘inthewild’ontheweb.6.2.4. TheATIshoulddevelopaclearpolicyonresearchdataandsoftwarecurationandpublication,includingapolicyontheownershipofintellectualpropertyrightsindataandsoftware,andsupportitsimplementationthroughappropriatetrainingforitsresearchersandsupportstaff.6.2.5. TheATIshouldconveneagroupofeminentacademics,workingwiththenationalacademies,towriteapolicy

documentonreproducibilitywithrespecttotheroleofdataandsoftwarepublicationtoenhancethequality,transparency,accountabilityandcommunicationofresearch,followingonfromtheRoyalSociety’s“ScienceasanOpenEnterprise”report[37].

11http://gigascience.biomedcentral.com

Page 17: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1721July2016

6.3Keydiscussionpoints

6.3.1. ActionbyUKresearchfunderstoimprovedatapublicationisnecessarybutcurrentlyinsufficient.Beyondcompliancewithfunderpolicies,weneedrealisticincentivesforacademicstopublishtheirdata,codeandsoftware,aboveandbeyondthecurrentResearchExcellenceFramework.Oneparticipantnoted,“IfIhavemyannualreviewcomingup,theonlythingofimportanceisnumberofpublications;notopenaccessorotheraltruisticmeasures”.6.3.2. Althoughpublicationofdataandcodeisgrowinginadoption,levelsarestilllow.ThereisanopportunityfortheATItodrivechangebysettingambitiousgoalswithinthedatasciencecommunity.Forexample,theATIcouldacceleratetherecognitionoftheroleofdataandsoftwareinresearchthroughestablishinganambitiousfive-yearplaninsupportofreproducibilitywithmeasurableobjectives.In5years,whatstepswouldneedtobetakentoensurethat,perhaps,75%ofarticlespublishedbyATIresearcherscitetheassociateddataset?WhatmightbearealistictargetforcitationofsoftwareorcompleteResearchObjects?6.3.3. Therewasconsensusthattodrivechange,weneedimprovedformsofcreditforresearchersresultingfromdataandcodepublication.Datacitationisoneformofcredit;trackingtransitivecreditfromresearchobjectsmightbeanother.ThisisadatascienceresearchareainitsownrightandcouldformpartoftheATI’sresearchagenda.6.3.4. Inparalleltotheconversationsaroundethics,theATIshouldengageitsresearchersinaconversationaroundwhyreproducibilitymatters,andensurethatresearchersindatascienceseereproducibilityascriticaltoimprovingthequalityoftheirresearch.6.3.5. Weusetheterm“software”tomeaneverythingfromasmallfragmentofcodetoasuiteofprograms.Therearetwoorthogonalissuesrelatedtosoftwaresustainability–‘runnability’(makingsurethatsoftwareisresilienttofuturechangesintechnologyandwillcontinuetocompute)and‘readability’(ensuringthatthesoftwareusediswell-documentedanditsroleintheresearchcanbeinterpretedandunderstood,ashighlightedinCaroleGoble’skeynote).TheATIhasaroleasabeaconforbestpracticeinsoftware“readability”fordatascienceresearch,drawingon,adapting,supportingandpromotingtoresearchers.6.3.6. TheATIshouldproduceguidelines(drawingonandadaptingexistingbestpractice)astohowcodeshouldbewrittenandreviewedtoimprovethe“communication”ofdata-intensiveresearchdonebytheATI.ThisshouldbebackedbypracticalsupportwithintheATI,e.g.codereviewgroups,usingthein-houseResearchSoftwareEngineerteamtoprovideadvice,andcheckliststocompletebeforesubmittingresearchincludingensuringinformationaboutthesoftwareshouldbeusediscaptured.WesuggestthattheATIshouldformapartnershipwiththeSoftwareSustainabilityInstitute,whichisundertakingsimilarworkforRCUKandELIXIREUResearchInfrastructure.6.3.7. Ithasbeendiscussedinmanyotherforathattechnologyimprovements,suchasthedevelopmentandadoptionofwell-documentedopenAPIsorimprovementsindatabasequeries,arenecessarytomakedataandcodesharing,annotation,submissionanddatacurationmucheasier(see,forexample[38]).OrganisationslikeDataCite,theDigitalCurationCentre,Jisc,theSoftwareSustainabilityInstituteandForce11aredoingexcellentworkintheseareas.TheATIshouldsupporttheadvocacyworkoftheseinstitutionsandthereisanopportunitytoworkwiththemtosupporttechnologyenhancementsforreproducibilityindata-intensiveresearch.Therearealsoforathataredisciplinespecific;whereappropriatetheseshouldbeengagedwith.6.3.8. Modernwebpracticesimplicitlydemanda‘lossofcontrol’throughtheuseofdistributed,‘looselycoupled’services,perhapsownedbymanydifferentstakeholders.Forexample,aresearcher’spre-printmaybepublishedonarXiv,herdatainDryad,hercodeinGitHub,peerreviewcommentsinPublons,hermethodsdescribedinProtocols.io

Page 18: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1821July2016

andonlyaminimalnarrativeheldbytheresearchinstitutionorpublisher.ServiceslikeORCID12(openidentifierswhichdisambiguateauthorsandimproveinteroperabilitybetweendataandcoderepositories)enableatransferofcontrolbacktotheresearcherbyholdingalltheirpersonaldatainasinglestorethroughwhichtheycanauthenticateandauthorizetheuseofthirdpartywebservices.TheATIshouldencourageormandatetheuseofORCIDsbyitsresearchers,followingonfromtheexamplesetbyfundersliketheWellcomeTrustinsupportofreproducibledata-intensiveresearch.Thepracticestocross-linkacrossstakeholderstogiveametadata-led,integratedview(developedbytheResearchObjectgroup)shouldbeexplored.6.3.9. Asanexampleofadvocacywork,thereisanopportunityfortheATItoworkwithpublishersandlearnedsocietiestoimprovepracticesinsupportofreproducibility–forexample,forATIresearcherstoleveragetheirrelationshipswithjournalsviaeditorialboardstowidenadoptionofdataandsoftwarecitationbestpractices;providingcasestudiesofdata-intensiveresearchthathavesuccessfullyusedopenmethods;orprovidinganalysisofcitation/reusedatatodemonstrateincreaseincreditfromreproducibility.Forexample,theACM(thescientificandeducationalcomputingsociety)isdrivingthedevelopmentofbestpracticeandpolicyinreproducibilitythroughitsTaskForceonData,SoftwareandReproducibilityinPublication[39].

6.3.10. ThereareotherimportantpracticalstepsthattheATIcantaketoencouragedataandsoftwarepublication.ThereismuchexistingworkinthecommunityondataandsoftwarecitationfromwhichtheATIcandrawandimplement.Forexample,theATIshouldadopttheForce11jointdeclarationofdatacitationprinciplesandtheforthcomingsoftwarecitationprinciples,andfosterengagementwiththisorganisation.TheATIshouldmakeaclearpolicystatementaroundresearchdatacurationandsoftwarepublicationactivitiesthatunderpinitsresearch,buildingonthedatamanagementpoliciesofitsuniversitypartners.Someworkhasalreadybeenstartedinthisareabythejointventurepartneruniversitylibraries,butneedsdrivingforwards.6.3.11. Inparticular,theATIshouldestablishaclearpolicyonownershipofintellectualpropertyrightsinpublisheddataandunpublishedresearchdata,asalackofclarityintherightsindatacanbeabarriertosharingandre-use.6.3.12. WesuggestthattheATIshouldformanexpertdatamanagementteamtosupportresearchersindatadeposit,managementand,crucially,softwarecuration,providingexemplarsforgoodpractice.Boththedatamanagementteamandpoliciesmustinteroperatewithresearchdatacuratorsandpoliciesofpartnerinstitutionsandworkcollaborativelytoraisestandardsacrossthewiderdatasciencecommunity.WorkingwithinstitutionsliketheDigitalCurationCentre,theATIshouldofferoptionsforprofessionaldevelopmentindataandsoftwarecurationinareaswhereresearchersfindtheyneedsupport.Thereismuchexcellentworkinthecommunityfromwhichtodraw;forexample,thefreeonlinetrainingprovidedbyEdinaattheUniversityofEdinburgh13.Wesuggestthatsuchskillsshouldbeconsideredpartofthefullrangeofcapabilitiesandcompetenciesrequiredtobeasuccessful,high-qualitydatascientist.6.3.13. TheShareInitiative14,ledbytheCenterforOpenScienceandtheAssociationofResearchLibrariesintheUSmaybeagoodmodeltoemulate,withitsstrongnucleusofkeystakeholdersthatcaneffectchange.TheremaybeanopportunityfortheATItoleadasimilarinitiativeintheUK,forexampleconveningacademicsindatasciencefromitsmemberinstitutionsalongsidetheBritishComputerSociety,theACM,majorfunderssuchastheEPSRCandmajorpublisherstodemandhigherstandardsforopennessandcreditwithintheirownfield?Thiscouldbeamoretractableproblemandleadthewayforotherstofollow.

1213http://datalib.edina.ac.uk/mantra/14http://www.share-research.org

Page 19: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |1921July2016

7. NovelarchitecturesandinfrastructuretosupportreproducibilityDrRichardMortier,UniversityofCambridge;DrAdamFarquhar,BritishLibrary;DrKenjiTakeda,MicrosoftResearch

7.1 Overview

Recentyearshaveseendramaticadvancesincomputinginfrastructurethatsupportreproducibility.Virtualmachines,cloudcomputingandcontainertechnologysuchasDockerallprovidemeanstocaptureandreplicatethesoftwareenvironmentinwhichcodemustrun,tovaryingdegreesoffidelity.Thissessionexploredhowtheseinfrastructuretechnologiesarebeingusedandextended,withcasestudiesfromMicrosoftResearch,useofitsAzureplatformandbenchmarkdatasetsincludingtheMicrosoftAcademicGraph[40];andUnikernels[10],developedattheUniversityofCambridge,ameanstocreateentirelyself-containedrunnablesoftwareimagesbylinkingapplicationcodewithnecessarytheplatformlibrariesatbuildtime.Participantsconsideredcurrentdevelopments,futuretechnicalrequirementsandresearchquestionsarisingfromreproducibilityindataintensiveresearch.Practicalissuessuchasincentivesandbusinessmodelswerealsodiscussed.

7.2 Recommendations

7.2.1 TheATIshouldensurethatend-to-endreproducibilityisembeddedinalltheresearchthatitcarriesoutandsupports,soastobecomeanexemplarofbestpractice.

7.2.2 TheATIshouldmeasurethecost,impactandworthofreproducibleresearch,bothintheresearchitcarriesout,andtocaptureabroaderbaselineamongitspartnersandthedatasciencecommunitythroughsurveysandotherinstruments.Thisbaselineshouldfeedintoanunderstandingofhowbesttosupportreproducibilityinthewiderscientificcommunity,andtheresourcingimplicationsfordoingso.

7.2.3 TheATIshouldensurethatsoftwaredevelopmentcarriedoutunderitsaegisadherestobestsoftwareengineeringpracticeusingmoderntoolsandworkflows,includingappropriatecodemanagementandsharing(e.g.,GitandGitHub),continuoustestingandintegration(e.g.,TravisCI),andmanagementandsharingofdevelopmentanddeploymentenvironments(e.g.,DockerimagesandtheDockerHub).

7.2.4 Tofacilitatethis,werecommendthattheATIsupportsalimitednumberofdevelopmentanddeploymentconfigurationstoassistinalleviatingtheversioningproblem.HavingasmallnumberofstandardimagesshouldreducemaintenanceandadministrativeburdenofATIinfrastructure,makeiteasiertoreproduceapplicationbehaviourduringdevelopment,makeiteasiertomapfromdevelopmenttodeployment,andgreatlyreducetheoverheadsinsharingcodebothbetweenexistingdevelopersandwhenon-boardingnewdevelopers.Forexample,usingtheDockerplatform,theATImightdeveloptheirownimagesbasedoffexistingstandardimages,e.g.,https://hub.docker.com/_/r-base/providesastandard,officialenvironmentfortheRdataprocessinglanguageforapplicationsusingR,orhttps://hub.docker.com/_/debian/providesastandard,officialenvironmentfortheDebianLinuxdistributionformoregeneraldevelopment.TheseATIimagescouldthenbeencouraged(orperhapsmandated)foralldevelopmentworkcarriedoutwithintheATI,andcouldprovideauseful(evencommercial)platformforotherdatascientists,nationallyandinternationally.

7.2.5 TheATIshouldalsostayabreastofandsupportfurtherinvestigationintomoreresearch-orientedtechniquesandtheirapplicabilitytoreproducibilityanddatasciencegenerally.Forexample,couldsomethingliketheOPUStoolchain[41]beusedtocollectprovenancedataforalldatasciencecarriedoutbytheATI,fromthestart?Couldtheincreaseduseofmodernlanguagetoolchainsbyunikernelssimplifytheprocessofmappingpre-existingcodesontoemerginghardwarearchitecturessuchasmassivelymulti-coreanddistributedmemorymachines,providinggreaterscalabilityandfutureproofing?

Page 20: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |2021July2016

7.2.6 AnyprocessestheATIputsinplaceforacquiringcomputationalresourceshouldbeatleastasstraightforwardtoexerciseasthoseprovidedbycurrentcloudserviceofferingssuchasMicrosoftAzureorAmazonAWS.

7.2.7 Legal(opensourcelicensingandotherintellectualpropertyissuesparticularly),ethical,andprivacyissuesallimpactreproducibilityindatascience,andtheATImustthusbecognisantofthem.Therearethusimportantinter-relationshipswithothersocialscienceinterestsalreadyexpressedwithintheATIthrough,e.g.,thescopingworkshopontheEthicsofBigDataworkshop,andtheResponsibleInnovationandHuman-DataInteractionsymposium.7.3 Keydiscussionpoints

7.3.1 Reproducibilityisacomplexmatterthatplacesmanydesignconstraintsonapplications.Forexample:aresearchermaywanttore-runtheirowndevelopedsoftwareinthefuture;oraresearchermaywanttoapplyanotherresearcher’ssoftwareusingtheirowndata.Thetechnicaldiscussionfocusedprimarilyontherepeatabilityaspectsofreproducibility,thoughitwasnotedthatthisquestionitselfmayrequireresearch:isbit-wiserepeatabilityalwaysnecessaryordesirable,ormightrepeatabilityatdifferentlayersbeofuse(e.g.,whensimulatingusingrandomseeds).

7.3.2 OfparticularinterestaretechnologiessuchasDockercontainersandunikernels,whichmakecodedependenciesexplicit.Originallyleveraging“LinuxContainer”technologybutnowexpandingtocover“WindowsServerContainers”and“WindowsHyper-VContainers”,Dockerisaplatformthatmakesitstraightforwardtocreatecontainerimagesbyexplicitlyspecifyingenvironmentaldependencies,todeployandmanagethoseimagesindividuallyandinclusters(DockerSwarm),andtosharethoseimageswithothers(DockerHub).AnumberofeventparticipantsalreadymadeextensiveuseofDockertocreaterepeatable,sharableenvironmentsinwhichtoruntheircode:itsuseappearstobewidespreadandincreasinginmanyscientificcommunities.

7.3.3 Techniquesthatsupportreproducibilitythroughcomputersystemdesignandsoftwareengineeringpracticeinclude:unittestingandtestdrivendevelopment,wheretestsforsubcomponentsaredevelopedinparallelwithorevenbeforethosecomponents,toensurethatregressionsdonotoccurduringfuturedevelopment;continuousintegration,wheresuitesoftests(oftenunittests)arerunonevery“commit”or“check-in”ofcode;andcontinuousdeployment,wherearunningserviceisupdatedfrequentlyratherthanwaitingweeksormonthsforreleases(e.g.,Facebookisreportedtoupdateitssitetwiceaday).CoupledwitheffectivesourcecodemanagementworkflowsusingtoolssuchasGit,thesetechniquesensuredevelopersareawarewhenkeyfunctionalityhasbeenchanged.ThroughuseofservicessuchasTravisCIwithDockerdeveloperscanalsoautomaticallyensurethatcodeexecutesasexpectedinmultipleconfigurationsofenvironment.

7.3.4 Relatedtocontainersbutcurrentlymoreresearchorientedareunikernels,whichleveragemodernlanguagesandtheirtoolchains(thatis,thesequenceoftoolsthatareappliedtoconverthigh-levelsourcecodeintoanexecutingapplication;typicallythismightincludecompilers,linkers,linters,optimisers,librariesandruntimes)toremovecomponentswhilebuildingcode.Thisresultsinthebuildoutputcapturingtheminimalsetofdependenciesenablingittoruninthetargetenvironment(e.g.,theapplicationmightrunasastandardUNIXprocessduringdevelopment,butmightberetargetedtobootdirectlyontheXenhypervisorformoreefficientandrepeatabledeployment).Otherrelatedtools,suchasOPUSfromtheUniversityofCambridge,attempttodetectwhichlibrariesapplicationcodedependsontorun,duringthedevelopmentandbuildprocesses.

7.3.5 Codedevelopedindata-drivenresearchislikelytocrossmultipleworkflows,operatingsystemsandsoftwareenvironments/applications.Therefore,wemayneedtoplanforthedeploymentofanensemble.Oneparticipant

Page 21: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |2121July2016

madeananalogyofmakingthejobofa‘softwarearchaeologist’15inthefutureeasier,throughemployingdifferentstrategiestoimplementreproducibility:“Level0:archiveallpastmajorversionsoftheoperatingsystem,tocreateanenvironmentwherecodecanpotentiallybererun;Level1:explicitlydocumentdependencies;Level2:maintainyoursoftwarepubliclyasapackagethatcanberedeployed”.

7.3.6 Thecostsandeffectivenessofdifferentreproducibilityapproacheswerequestionedanddiscussed.Thelackofgoodbaselinedatawasnoted.Theagilityofcurrentcloudservicestoprovidetheresourcesneededforresearch,quickly,flexiblyaccordingtoneedandatreasonablecostwaspraised.7.3.7 Theproblemofhandlingproprietaryorotherwiseunavailablesoftwareorhardwarewasdiscussed,raisingcaseswhereDockercontainersarenotpresentlyagoodfit.InsuchcasestheBritishLibrary(andothers)havehadsomesuccessusingfullvirtualisationapproaches.CommentsweremadethattheATImaywishtodiscouragesuchpracticeswherepossible,asuseofstandardtoolsetsismorelikelytoenablereproducibility.

15InhisnovelTheFireUpontheDeep(1992),thecomputerscientistandscience-fictionwriterVernorVingewroteofatimewhenprogrammer-archaeologistsmaintainedthefabricofcivilizationbydivingintoandmodifyinglegacycodethatranthesystemsonwhichsocietydepended.

Page 22: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |2221July2016

8.References

[1] B.R.Jasny,G.Chin,L.Chong,andS.Vignieri,‘Again,andAgain,andAgain…’,Science(80-.).,vol.334,no.December,p.2011,2011.DOI:10.1126/science.334.6060.1225

[2] G.K.Sandve,A.Nekrutenko,J.Taylor,E.Hovig,‘TenSimpleRulesforReproducibleComputationalResearch’,

PLoSComput.Biol.,vol.9,no.10,p.e1003285,Oct.2013.DOI:10.1371/journal.pcbi.1003285[3] S.N.Goodman,D.Fanelli,J.P.A.Ioannidis,‘Whatdoesresearchreproducibilitymean?’,Sci.Transl.Med.,vol.

8,no.341,p.341ps12,Jun.2016.DOI:10.1126/scitranslmed.aaf5027[4] R.D.Peng,‘ReproducibleResearchinComputationalScience’,Science(80-.).,vol.334,no.6060,pp.1226–

1227,2011.DOI:10.1126/science.1213847[5] J.Freire,N.FuhrandA.Rauber,‘Reproducibilityofdata-orientedexperimentsine-science’,inDagstuhl

Reports,6(1),2016.Available:http://drops.dagstuhl.de/opus/institut_dagrep.php?fakultaet=07[6] M.Broussard,‘BigDatainPractice:Enablingcomputationaljournalismthroughcode-sharingandreproducible

researchmethods’,Digit.Journal.,vol.4,no.2,pp.266–279,2016.DOI:10.1080/21670811.2015.1074863[7] M.Stamatogiannakis,H.Kazmi,H.Sharif,R.Vermeulen,A.Gehani,H.Bos,andP.Groth,‘Trade-Offsin

AutomaticProvenanceCapture’,SpringerInternationalPublishing,2016,pp.29–41.Available:10.1007/978-3-319-40593-3_3

[8] P.Missier,S.Dey,K.Belhajjame,V.Cuevas-Vicenttín,andB.Ludäscher,‘D-PROV:ExtendingthePROV

ProvenanceModelwithWorkflowStructure’,Proc.5thUSENIXWork.TheoryPract.Proven.,pp.9:1–9:7,2013.Available:http://dl.acm.org/citation.cfm?id=2482949.2482961

[9] J.Cheney,‘ProjectSkye:Aprogramminglanguagebridgingtheoryandpracticeforscientificdatacuration’.

[Online].Available:http://homepages.inf.ed.ac.uk/jcheney/group/skye.html#project.[Accessed28-May-2016][10] A.Madhavapeddy,R.Mortier,C.Rotsos,D.Scott,B.Singh,T.Gazagnaire,S.Smith,S.Hand,J.Crowcroft,

‘Unikernels-libraryoperatingsystemsforthecloud’,inProceedingsoftheeighteenthinternationalconferenceonArchitecturalsupportforprogramminglanguagesandoperatingsystems-ASPLOS’13,2013,vol.48,no.4,p.461.DOI:10.1145/2451116.2451167

[11] C.Boettiger,‘AnintroductiontoDockerforreproducibleresearch’,ACMSIGOPSOper.Syst.Rev.,vol.49,no.1,

pp.71–79,2015.DOI:10.1145/2723872.2723882[12] L.Moreau,P.Missier,Paolo;K.Belhajjame,R.B’Far,J.Cheney,S.Coppens,S.Cresswell,Y.Gil,P.Groth,G.

Klyne,T.Lebo,J.McCusker,S.Miles,J.Myers,S.Sahoo,C.Tilmes,‘PROV-DM:ThePROVDataModel.W3CRecommendationREC-prov-dm-20130430,WorldWideWebConsortium,April2013.’2013.Available:http://www.w3.org/TR/prov-dm/

[13] P.Missier,‘Thelifecycleofprovenancemetadataanditsassociatedchallengesandopportunities’,Building

TrustInFinancialInformation-PerspectivesontheFrontiersofProvenance,Springer,2016.[Online].Pre-printavailable:http://bibbase.org/network/publication/missier-thelifecycleofprovenancemetadataanditsassociatedchallengesandopportunities-2016.[Accessed:28-May-2016].

[14] Y.Cao,C.Jones,V.Cuevas-Vicenttín,M.B.Jones,B.Ludäscher,T.McPhillips,P.Missier,C.Schwalm,P.

Slaughter,D.Vieglais,L.Walker,andY.Wei,‘DataONE:ADataFederationwithProvenanceSupport’,inProvenanceandAnnotationofDataandProcesses:6thInternationalProvenanceandAnnotationWorkshop,IPAW2016,McLean,VA,USA,June7-8,2016,Proceedings,M.MattosoandB.Glavic,Eds.Cham:SpringerInternationalPublishing,2016,pp.230–234.DOI:10.1007/978-3-319-40593-3_28

Page 23: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |2321July2016

[15] D.B.Keator,K.Helmer,J.Steffener,J.A.Turner,T.G.M.VanErp,S.Gadde,N.Ashish,G.A.Burns,andB.N.Nichols,‘Towardsstructuredsharingofrawandderivedneuroimagingdataacrossexistingresources’,NeuroImage,vol.82.pp.647–661,2013.DOI:10.1016/j.neuroimage.2013.05.094

[16] N.Balakrishnan,T.Bytheway,Thomas,L.Carata,Lucian,O.R.A.Check,J.Snee,James;S.Akoush,R.Sohan,

M.Seltzer,A.Hopper,‘RecentAdvancesinComputerArchitecture:TheOpportunitiesandChallengesforProvenance’,7thUSENIXWorkshopontheTheoryandPracticeofProvenance(TaPP15),Edinburgh,Scotland,USENIXAssociation,Jul2015,2015.[Online].Available:https://www.usenix.org/system/files/tapp15-balakrishnan.pdf.[Accessed:28-May-2016].

[17] D.A.Holland,M.I.Seltzer,U.Braun,andK.K.Muniswamy-Reddy,‘PASSingtheprovenancechallenge’,

Concurr.Comput.Pract.Exp.,vol.20,no.5,pp.531–540,2008.DOI:10.1002/cpe.1227[18] A.GehaniandD.Tariq,‘SPADE:SupportforProvenanceAuditinginDistributedEnvironments’,Proc.

ACM/IFIP/USENIX13thInt.Middlew.Conf.,pp.101–120,2012.DOI:10.1007/978-3-642-35170-9_6[19] S.Bechhofer,D.DeRoure,M.Gamble,C.Goble,andI.Buchan,‘ResearchObjects:TowardsExchangeand

ReuseofDigitalKnowledge’,Nat.Preced.,Jul.2010.DOI:10.1038/npre.2010.4626.1[20] P.Hudak,‘Buildingdomain-specificembeddedlanguages’,ACMComput.Surv.,vol.28,no.4es,p.196–es,Dec.

1996.DOI:10.1145/242224.242477[21] W.Taha,‘PlenarytalkIIIDomain-specificlanguages’,in2008InternationalConferenceonComputer

Engineering&Systems,2008,pp.xxiii–xxviii.DOI:10.1109/ICCES.2008.4772953[22] J.VitekandT.Kalibera,‘Repeatability,reproducibility,andrigorinsystemsresearch’,inProceedingsofthe

ninthACMinternationalconferenceonEmbeddedsoftware-EMSOFT’11,2011,p.33.DOI:10.1145/2038642.2038650

[23] S.KrishnamurthiandJ.Vitek,‘Therealsoftwarecrisis:repeatabilityasacorevalue’,Commun.ACM,vol.58,

no.3,pp.34–36,Feb.2015.DOI:10.1145/2658987[24] J.Gibbons,‘Sciencereliesoncomputermodelling-sowhathappenswhenitgoeswrong?’TheConversation

[online].Available:https://theconversation.com/science-relies-on-computer-modelling-so-what-happens-when-it-goes-wrong-56859[Accessed28-May-2016].

[25] B.R.Childers,G.Fursin,S.Krishnamurthi,andA.Zeller,‘ArtifactEvaluationforPublications:DagstuhlPerspectivesWorkshop’,15452,2015.DOI:10.4230/DagRep.5.11.29

[26] S.Krishnamurthi,‘Artifactevaluationforsoftwareconferences’,ACMSIGPLANNot.,vol.48,no.4S,p.17,Jul.

2013.DOI:10.1145/2502508.2502518[27] A.BergelandL.Bettini,‘Artifactevaluation(summary)’,inProceedingsofthe20139thJointMeetingon

FoundationsofSoftwareEngineering-ESEC/FSE2013,2013,p.24.DOI:10.1145/2491411.2508114[28] I.Portugal,P.Alencar,andD.Cowan,‘ASurveyonDomain-SpecificLanguagesforMachineLearninginBig

Data’,Feb.2016.arXivpreprintarXiv:1602.07637(2016)[29] C.Schlegel,U.P.Schultz,S.Stinckwich,andS.Wrede,‘ProceedingsoftheSixthInternationalWorkshopon

Domain-SpecificLanguagesandModelsforRoboticSystems(DSLRob2015)’,Jan.2016.arXiv:1601.00877(2016)

[30] S.P.Jones,J.-M.Eber,andJ.Seward,‘ComposingContracts :AnAdventureinFinancialEngineering’,ACM

SIGPLANNot.,vol.35,no.9,pp.280–292,2000.DOI:10.1145/357766.351267

Page 24: (pdf) of ATI Symposium on Reproducibility for Data−Intensive ...

AlanTuringInstituteSymposiumonReproducibilityforData-IntensiveResearch–FINALREPORTLucieC.Burgess,DavidCrotty,DaviddeRoure,JeremyGibbons,CaroleGoble,PaoloMissier,RichardMortier,ThomasE.Nichols,RichardO’BeirneDicksonPoonChinaCentre,St.Hugh’sCollege,UniversityofOxford,6thand7thApril2016

P a g e |2421July2016

[31] H.S.Moat,C.Curme,A.Avakian,D.Y.Kenett,H.E.Stanley,andT.Preis,‘QuantifyingWikipediaUsagePatternsBeforeStockMarketMoves’,Sci.Rep.,vol.3,May2013.DOI:10.1038/srep01801

[32] T.Preis,H.S.Moat,andH.E.Stanley,‘QuantifyingTradingBehaviorinFinancialMarketsUsingGoogle

Trends’,Sci.Rep.,vol.3,Apr.2013.DOI:10.1038/srep01684[33] D.Butler,‘WhenGooglegotfluwrong’,Nature,vol.494,pp.155–156,2013.DOI:10.1038/494155a[34] ‘DatascienceatMicrosoftResearch’.[Online].Available:http://research.microsoft.com/en-us/projects/data-

science-initiative/#datasets.[Accessed:28-May-2016][35] W.Wang,D.Rothschild,S.Goel,andA.Gelman,‘Forecastingelectionswithnon-representativepolls’,Int.J.

Forecast.,vol.31,pp.980–991,2015.DOI:10.1016/j.ijforecast.2014.06.001[36] ‘Trans-AtlanticPlatformSocialSciencesandHumanities-DiggingintoDataChallenge’.[Online].Available:

http://www.transatlanticplatform.com/2016/02/29/trans-atlantic-platform-announces-the-2016-t-ap-digging-into-data-challenge/.[Accessed28-May-2016].

[37] G.Boulton,P.Campbell,B.Collins,P.Elias,W.Hall,L.Graeme,O.O’Neill,M.Rawlins,J.Thornton,P.Vallance,

andM.Walport,‘Scienceasanopenenterprise’,Science(80-.).,no.June,pp.1–104,2012.[Online].Available:https://royalsociety.org/~/media/policy/projects/sape/2012-06-20-saoe.pdf

[38] P.Buneman,S.Davidson,andJ.Frew,‘Whydatacitationisacomputationalproblem’,2016.Pre-print,to

appear,CommunicationsoftheACM(2016).[Online].Available:http://frew.eri.ucsb.edu/private/preprints/bdf-cacm-data-citation.pdf

[39] ‘ACMTaskForceonData,SoftwareandReproducibilityinPublication-webpage’.[Online].Available:https://www.acm.org/publications/task-force-on-data-software-and-reproducibility.[Accessed:07-Jun-2016].

[40] A.Sinha,Z.Shen,Y.Song,H.Ma,D.Eide,B.-J.(Paul)Hsu,andK.Wang,‘AnOverviewofMicrosoftAcademic

Service(MAS)andApplications’,inProceedingsofthe24thInternationalConferenceonWorldWideWeb-WWW’15Companion,2015,pp.243–246.DOI:10.1145/2740908.2742839

[41] N.Balakrishnan,T.Bytheway,R.Sohan,andA.Hopper,‘OPUS:ALightweightsystemforObservational

ProvenanceinUserSpace’,inPresentedaspartofthe5thUSENIXWorkshopontheTheoryandPracticeofProvenance,2013(2013).[Online].Available:https://www.usenix.org/conference/tapp13/technical-sessions/presentation/balakrishnan