Download - LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

Transcript
Page 1: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094-17 Indiana University Data To Insight Center

Abstract

Overthelastdecade,societyisseeinganearlyexponentialincreaseinthevolumeofdigitalcontent.ResearchersandeducatorsinresponseseethepotentialthatBigDatatechniquesbringtocomputationalexplorationorculturalandscholarlydigitalcollectionsfororganizing,accessing,andanalyzingcontent.LibrarieshavelongmadeamissionofprovisioningaccessservicestodigitalcontenttoenrichandimprovethelivesofallAmericans,however,whendigitalcollectionshaveaccessrestrictions,provisioningservicesbecomesachallenge.

WerespondtothischallengewiththeDataCapsuleservice,developedintheHathiTrustResearchCenter,thatenablesremoteaccesstorestricteddigitaldataintheHathiTrustDigitalLibrary.DataCapsuleisarchitectedtobemodularandusesapplicationprogramminginterfaces(APIs)forcommunication;thisbestpracticeinsystemsdesignplusproposedeffortinpackaging,willallowforfasterintegrationintoanewenvironmentandreadycontributionsbythirdparties.

Inthisproject,weintendtopartnerwith8academiclibrariesacrossthecountryinamulti-methodresearchprojectthatdrawsfromhumancomputerinteractionandexperimentalcomputerscienceto:

• Understandcurrentlibraryneedsandpracticesinprovisioninglibraryservicesforcomputationalaccesstospecialcollectionshavingconstraintsduetosensitivityorrestrictions

• ExtendtheDataCapsuleservicetobroaderneedsofprovisioningforanalyticalaccesstorestrictedcollectionsacrossarangeofcollectionsanduses,

• StudyextensionsofDataCapsuletocloudcomputingenvironmentsforbroaderuses• Identifygapsinskillsneededforlibrarianstoenablesecuredataanalyticsandprovideresourcesthat

canaddressthosegaps.

Thisprojectproposal,responsivetotheIMLSNationalLeadershipGrantsforLibrariesprogram,isplannedasa2-yeareffort.IffundeditwillbecarriedoutundertheencompassingframeworkofParticipatoryDesignandinvolvefundedpartnersatIndianaUniversity,UniversityofIllinois,UniversityofCaliforniaatBerkeley,andUniversityofVirginia;andengagedpartnersatIndiana University, LafayetteCollege,MIT,RutgersUniversity,SwarthmoreCollege,andUCLA.

Inresponsetoreviewerfeedback,weincreasedthenumberoflibrarypartnersintheprojectfrom3-5to8,andintroducedthetwo-tieredpartnermodel.Level1partners(2)receivedirectfundingthroughthegrant.Level2partners(6)receivetravelfundsbuiltintotheIndianaUniversitygranttoparticipateinaregionalcommunity-buildingevent.Thechangeresultedinanincreaseofabout15%fromthepre-proposal.

Sustainabilityisplannedthroughutilizinganexistingoperationalservice,growingitsadoptercommunity(libraries),extendingforbroadercollectionsandusecases.TheserviceitselfisgroundedintheHathiTrustResearchCenter,whichcontinuestosupportandendorsetheDataCapsuleserviceasitsprimaryserviceforcomputationalanalysisonthenearly15millionvolumesoftheHathiTrustDigitalLibrary.HTRCdeeplywelcomesthisinitiativetoinvolvemorepartnersinuseandsustainersofthesoftwarecodebase.

Page 2: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.1

DataCapsuleApplianceforResearchAnalysisofRestrictedandSensitiveDatainAcademicLibraries

1.StatementofNationalNeed

Overthelastdecade,societyisseeinganearlyexponentialincreaseinthevolumeofdigitalcontent[1].Thenewcontentiscomingintoexistenceontheculturalsidethroughmassivedigitizationefforts[2]orbecausecontentisincreasinglyborndigital.LibrarieshavelongmadeamissionofprovisioningaccessservicestodigitalcontenttoenrichandimprovethelivesofallAmericans[3].Whendigitizedcollections(ofletters,governmentpapers,videoclips,institutionalrecords,annotatedvolumes)haveaccessrestrictions,however,provisioningservicesbecomesachallenge.Collectionscanhaveaccessrestrictionsforanumberofreasons:asetofpapersthathavenotbeenproperlyaccessioned;acollectionofvideoswithmixedin-copyrightandpublicdomaincontent;materialdonatedbyaprominentresearcherthatcontainssensitiveinformationfromethnographicstudiesonaboriginalpeoples.Thedata-sidepushfornewservicestomeetthechallengeofrestrictedandsensitivecollectionsisbeingmetwithacorollaryenduserpull,asresearchersandeducatorsdiscoverthepotentialthatBigDatatechniquesbringtothehumanities[4]andotherareas,andbegintoenvisionopportunityintheirownresearchspheretotheexplorationofbothsmallorlargecollectionsofmaterialscomputationallyfororganizing,accessing,andanalyzingcontent.

Traditionaltypesoflibraryservicesofteninadequatelyaddressenduserneedswhenacollectionofmaterialsisrestrictedordeemedtocontainsensitivedata.Securedataenclavepilotsallowresearcherstoworkwiththisuniquetypeofdata[5]–[9].Yetsuchenclavesoftenarelimitedtoanalysisofmicrodatathroughcommonstatisticalpackages,makingthemless-suitedforotherusesastherearehundredsofdifferentcomputationalcontentminingtools,forexample,thetextanalysisportalTAPoRlists493ofthem[10].Additionally,enclavesarefrequentlycustom-builtforacollection,orasmallsetofcentrallylocatedcollections,makingthissolutionnotsoeasilyportabletonewinstitutionsorcollections.

Drawingonthemostpressingthemesoftrust,access,infrastructure,andskillsinprovidingdataservices[11],theoverarchinggoalofthisprojectismanifold:understandcurrentlibraryneedsandpracticesinprovisioningservicesforcomputationalaccesstospecialcollections,extendanexistingservicetoenableintuitiveandyetsecurecomputationalaccesstorestricteddatainlibraries,andidentifygapsinskillsneededforlibrarianstoenablesecuredataanalyticsandprovideresourcesthatcanaddressthosegaps.WeaimtobuilduponaservicethathasbeendevelopedintheHathiTrustResearchCenter(HTRC)thatenablesenduserstoremotelyaccesstheHathiTrustDigitalLibraryforcomputationaluse.Wepropose,aspartofthisgrant,topackagetheserviceasanappliancesothatitcanbeeasilyinstalledinalibrarytechnologicalenvironment,andextendtheservicetosatisfyscenariosofdifferentcollectionsandenduserneedsdrivenbyourlibrarypartners.TheserviceiscalledDataCapsule[12],[13],anditderivesfromtheoreticalworkonaconceptcalled“storagecapsules”[14].ThroughagrantfromtheAlfredP.SloanFoundation(2011-2015)theauthorofstoragecapsules,AtulPrakash,alongwithPlaleandMcDonald(lattertwoareleadsonthisproposal)developedthestoragecapsuleconceptintotheworkingDataCapsuleservice,whichbecameavailableinHTRCin2015.TheserviceinHTRCutilizesatoolcalledtheWorkset[15],whichmaintainsanenduser’scontext.

Page 3: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.2

BuildingontheearlierworkoftheDCservice,weproposetoextendandevaluatethesystemundertheencompassingframeworkofParticipatoryDesignwithlibrarypartnersfromeightlibrariesacrossthecountrywhohavecommittedtoservingaseitherbeingLevel1testingpartners,eagertoengageinhandsonevaluation,orLevel2partners,readytoparticipateindiscussionsandstudies.Wepropose,throughthisParticipatoryDesignframework,toextendtheserviceto:

• Bepackagedasanappliancethatcanberunandmanagedlocallyatpartnerinstitutions• GeneralizetheDataCapsuleservicetoconnecttobroadertypesofrestrictedcollections• DeliverextensionstotheDataCapsuleserviceandWorksetmodelthatreflectpartnerneedsobtained

throughintensepartnerengagement• DeliveradesignofDataCapsulethatutilizeshighperformanceandcloudcomputingresourcesthat

accommodatesbothlarge-scaleneedsofpartnersandpartnerswithlightertechnologyresourcesavailabletothem

AstheDataCapsuleserviceisarchitectedusingprinciplesofwelldefinedAPIsandsoftwarecomponentmodularity,itishighlysuitedtoextensionandgeneralizationforthebroaderuse.

TheconceptualframeworkguidingthearchitectureofDataCapsule(DC)initscurrentformcanbeexplainedinthecontextoffairuse.Legaljudgmentsoffairusehaverepeatedlyreturnedtotwokeyanalyticalquestions[16]:First,“didtheuse“transform”thematerialtakenfromthecopyrightedworkbyusingitforabroadlybeneficialpurposedifferentfromthatoftheoriginalordiditjustrepeattheworkforthesameintentandvalueastheoriginal?”Andsecond,“Wasthematerialtakenappropriateinkindandamount,consideringthenatureofthecopyrightedworkandoftheuse?”InDC,thetransformingworkiscarriedoutbyanenduserwithinaCapsulethattheyhaveattheirdisposalforuseforanextendedperiodofweekstomonths.Theservicethenenforcesbothquestionsasfollows:

• Useisappropriate:theDCserviceassessesappropriatenessofthecontentexportedfromCapsule:o Unintentionalexportationsuchasthroughmalwareisstoppedo Intentionalexportationisreviewedthroughmanual(orinfutureautomatic)resultsreview

• Amountofdatausedisappropriate:theamountofdatausedincreationofallexporteddataproductsisbelowathresholdofappropriatenessofuse

• Datatypes:thetypeofdatausedinthecreationofnewcontentisallowablefortheneed• Intentisreasonableandidentityisproven:throughstructuresofpolicyandinstitutionalinfrastructure• WhenaCapsuleisusedforanalyticalpurposes,acceptableactivitiesincludebutarenotlimitedtoa)

imageanalysisandtextextraction,b)textualanalysisandinformationextraction,c)linguisticanalysis,d)automatedtranslationandlanguagetranslation,ande)indexingandsearch.

DataCapsulethusenablestransformativeuseofrestrictedandsensitivecollectionsthroughaservicethatwillbepackagedasanappliance,willhaveoptionsforhookingtoanewcollectionwithrelativeease,andprovidestheneededassurancesthattheactionsallowablebytheservicewillprotectthecollection.

Page 4: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.3

2.ProjectDesign

Theprojectisstructuredtobringtogetherthreedistinctcomplementarybodiesofexpertise:humancomputerinteractionexpertiseincommunityengagement,participatorydesign,andsocial-technicalinteractions(Kouper);computerscienceandtechnologyexpertiseindata-drivenarchitectures,datamodels,andtrust(Plale,McDonald,andDownie),andlibrarypartnerswithexpertiseintechnologyservicesforspecialcollections(MitchellandUnsworth).Themultidisciplinaryteamiscriticaltobringaboutaprojectofthisnature.

Thelibrarypartnershipisdesignedattwolevels.Level1TestingPartnersidentifyacollectionandanend-userneed,andworkwiththeDataCapsuleteamtoimplementaproof-of-conceptdemonstrationforthecollection.Level1TestingPartnersalsoparticipateintheassessment,userstudy,andparticipatoryactivities.TheyincludethelibrariesofUniversityofCaliforniaBerkeleyandUniversityofVirginia.Level2Partnersengageintheassessmentanduserstudy,andcontributetoparticipatoryactivities.Level2partnersincludethelibrariesofLafayetteCollege,IndianaUniversity,MIT,Rutgers,Swarthmore,andUCLA.

2.1Goals,methods,assumptions,andrisks

Thebroadgoalofthisprojectwillbeaccomplishedthroughsynergisticandmutuallyreinforcingactivityinitstwomajorfociofexpertise:inparticipatory,design-orientedpartnerengagementandinsoftwarearchitectureandevaluation.Thenatureoftheprojectisiterativewithinandbetweenthetwofociofexpertise:“explore,approximate,andrefine”[17].

Researchmethodologies:Theprojectwillemployresearchmethodologiesfromboththedomainsofhuman-computerinteractiontoaccomplishthegoalsassessment,partnerengagementandevaluation,andexperimentalcomputersciencetoadvancetheDataCapsuledesignandWorkset.Thismulti-methodapproachtoresearchisincreasinglyimportantinsuccessfultechnologyadoption:activeall-stakeholderengagementattheearlystagesensuresagoodfitonthehumancapitalside,andtheexperimentalcomputerscienceensuresagoodfitonthetechnologicalside.Themethodologiesofeacharedescribedinmoredetailbelow.

Projectrisks:Lowlibrarypartnerparticipationisapotentialprojectrisk.Weaddressedthisriskduringdevelopmentofthefullproposalbydevotingsubstantiallymoreresourcestothelibrarypartners.Weincreasedthenumberoflibrarypartnersintheprojectfrom3-5to8,andintroducedthetwo-tieredpartnermodel.Level1partners(2)receivefundingthroughasubcontractthattheyuseforengagementoftechnicalorcollectionsexpertise.WeadditionallybuiltfundingintotheIndianaUniversitybudgettofundtravelforLevel2partners(6)toparticipateinaregionalcommunity-buildingevent.Thechangeresultedinanincreaseintheoverallbudgetofabout15%fromthepre-proposal.Wethoughtthisactionanecessaryriskmitigationstrategy.Ourprojecthasalreadybuiltintoitaprogramforconstantsupportandinteractionwiththelibrarypartnersonbothlevelstoensurethehighestpossibleparticipation.

Assumptions:Ourprojecthasseveralassumptions,allofwhichwethinkarereasonableexpectationsintheenvironmentsofmajoracademiclibraries,thoughfurtherstudywillbecarriedoutforlesswell-equippedlibraries.DataCapsuleisanenvironment(asetofsoftwareservicespluspolicies)thatutilizesaclusterof

Page 5: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.4

computerslocatedwithinasecurenetwork.ThecodebaseismodularandutilizesApplicationProgrammingInterfaces(APIs)forextensibilityandinteroperability.ADataCapsuleControllerrunsononeoftheclusternodes.Fromthere,itallocatestoanenduseraCapsule--avirtualcomputer(virtualmachine)thatrunsononeoftheothernodesinthecluster.TheDataCapsuleserviceimplementationwillbeextendedinthisprojecttoutilizetheoperatingsystemlibrary,Libvirt1,whichallowsDCtoconfigureanenduserCapsuleforsecureaccess.Ourimplementationplanthusassumesi)Level1testingpartnerssupporttheexistenceofalibrarysuchasLibvirtrunningontheirtestingservers,ii)programmaticaccesstoacollectionisavailablethroughanAPI,andiii)thereexistsatrustedserviceinthelibraryenvironmentthroughwhichuserauthenticationcanbecarriedout.

ResearchFramework1:TheframeworkofParticipatoryDesign(PD)informstheresearchquestionsandmethodologiesofthehuman-computerinteractionresearch.Atheoreticalframeworkandasetofpractices,PDexploresconditionsfordeepuserengagementinthedesignandimplementationofcomputer-basedsystemsatwork[18].Userempowermentanddemocraticdecision-makingarecrucialforsuccessfulPDasoneofthemainassumptionsisthattechnologyisbeingdesignedtofacilitateskilledworkandenhanceratherthancompletelyreplacehumanlabor[19].Librariesrecognizetheneedtoengagetheirendusersinthedesignoflibraryspacesandtechnologies[20],[21].Weraisethequestionsofhowlibrariansthemselvescanbeinvolvedinco-designoftoolsthatuseandenhancetheirskillsets,while,atthesametime,enablelibraryendusers.

ResearchFramework2:Experimentalcomputerscienceasadisciplineandmethodologyformstheframeworkforassessingandadvancingthetechnologicalaspectsoftheproject.Throughiterativedesignandprototyping,wereflectuserneedsinthesoftwaredevelopmentprocess.Throughcarefullycontrolledcomparativeevaluationstudiesthataredesignedtoincludeperformanceevaluation,weaccuratelyassessdifferenttechnologicaltradeoffs.Thesestudies,whichareofaqualitysoastobepublishedinarchivalvenues,contributetothediffusionoftheprojectresultsmorebroadlythroughlibrariesandthroughtime.

DataCapsuleisanenvironmentthatutilizesaclusterofcomputerslocatedwithinasecurenetwork.Capsuleshavetwomodesofrunning:anopenmodeduringwhichausercanuploadtools,data,andsoftwareoftheirchoice.Duringopenmode,accesstotherestrictedorsensitivedataisblocked.Inthesecondmode,aclosedmode,allaccesstotheInternetisblocked,andthechannelstotherestricteddataareopened.Thisiswherethetoolsthatneedtoworkwiththesensitivedatacanbestartedup.Uponcompletionofatask,theuserstorestheresultstheywishtoexporttoaspecialdirectory,wheretheyarequeuedformanualreview,and,uponsuccessfulreview,theuserissentaURLfromwhichdownloadcanoccur.

TheexistingDataCapsulesystemwillbemigratedtoutilizetheLibvirtvirtualizationtoolkit.TheDataCapsuleControllerisdeliveredaseitheravirtualmachineimageormultipleDockercontainers,togetherwithasetofconfigurationfilesforpartnerstocustomizefortheirparticularenvironment.TheDataCapsuleControllerexpectstwocommunicationendpointsfromthepartnersite:APIsandcorrespondingSDK/toolkitthatcansecurelyaccessthedatacollectiontobeusedfromcapsules;andatrusteduserauthentication/authorizationinformationrelaytotheDataCapsuleController.LibvirtdaemonsarerequiredtoberunningonallData

1ThevirtualizationAPI:https://libvirt.org;runsonLinux,Windows,OSX,FreeBSD

Page 6: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.5

Capsulehostingservers.TheDataCapsuleControllerwillprovideRESTfulAPIsandabasicadministrationdashboardforpartnersitetobuildcustomizedfront-enduserinterface.AseparatedatabaseisneededtostorestatusofCapsulesandtheiractivities,aswellasusercomputationresultsforthewholesystem.TheDataCapsuleControllerisexpectedtoberatherlightweighttorunasasingleVM.TheDockercontainerapproachcouldprovidefurtherflexibilityofpackagingcomponentsandlesssystemresourceconsumption,albeitbemorecomplicatedtodeploy[22],[23].

OneoftheimportanttoolsintheDataCapsuleenvironmentistheWorkset.Asrestrictedcollectionscannotbemovedoutsideoftheirsecurestorageandprocessingenvironment,usersneedamechanismtosaveapersistentcontextoftheirsourcesthatholdsinformationaboutthestateoftheiractivities.HTRCusesthenotionoftheWorkset-amachine-actionablepersonalresearchcollectiondescribedusingtheResourceDescriptionFramework(RDF)thatconsistsofreferencestodigitalobjects(e.g.,volumes,pages,andsoon)andmetadata[18].TheWorksetmodelcombinespointersto,andmetadataabout,thegeneratedresourcesanditsselectionproceduresaswellasmetadataaboutbibliographicresourcesthatwentintoitscreation.Itprovidescontextandcontinuitythroughtheresearchlifecycle,fromitsconceptionandcreationtoarchiving,citation,andusebyotherresearchers.

Theresearchquestions/issuesthatweproposetoinvestigateare:

● Whataretheusesofrestrictedcollectionsinthecontextofdeliveringcomputationalanalyticalservices?Howdocollectionprovidersandusersconstructtheirneedsoftransformativeusesofthecollection?

● Howdocollection-specificservices,policiesandusesaffectthedesignofDC,andhowcanDCappliancefitwithinthelibraryanditstechnologicalandorganizationalmodels?Howdodifferentlypositionedactorswithinanorganizationinfluencethat?

● QuantifytheperformanceimplicationsofcertaindesigntradeoffsinextendingandgeneralizingtheDataCapsulesystemtomeettheneedsofabroadsetoflibraryusesandenvironments.

○ Includeinthestudyanassessmentoftradeoffswhenconsideringlibrarieswithlesswellequippedtechnicalinfrastructures

● EvaluatethetradeoffsforextendingtheDataCapsulesystemtoallowuserCapsulestoutilizehigh-performancecomputeresourcesinsideorexternaltoaninstitution,andrunlargeanalysistasks.

● EvaluatethedifferentmodelsforWorksetuseintheCapsulefordifferentuseandcollectionneeds.

2.2Specificactivities

Element1:Assessment

Workwithpartnerstomapoutcollectionspecificsandthecontextsoftheiruse;prioritizeneedsinco-designandimplementation;organizeeventstobringparticipantstogetherasacommunity.Employparalleltheoreticalreflectionandcontinuousexchangeofknowledge.

Page 7: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.6

Tasks

• Researchteaminterviewspartnerstogatherinformationaboutcollectionsandthecontextoftheiruses,identifiescollection-specificcharacteristicsaswellasworkpracticesthatmayimpactdevelopmentandimplementationofDC.Accessrestrictions,storage,security,andanalyticalneedsaswellastherelationshipsbetweencollectionusers,stewards,andtechnicalsupportwillbeincluded.Userneedsasseenbylibrariansortakenfrompreviousfeedbackofactualusers(e.g.,typesofdataanalysis,toolsused)willalsobeidentified.

• ExaminepoliciesandotherfactorsthataffecttheuseofrestricteddataandDC.Collectandanalyzedocumentsthatgovernaccessanduseoftherestrictedcollections.

• Organizecommunity-buildingeventspossiblyco-locatedwithregionalHTRCUnCampeventstoincreaseparticipation;organizeregularinformation-sharingsessions.

Outcomes

• Effectivecoordination,sharing,andnetworkingwithallpartners• Taxonomicknowledgeaboutrestrictedcollectionsandtheirpoliciesandcontextsofuse• Emergingsenseofcommunity• Communitybuildingmeetings

Element2:PartnerEngagement

Engagethetechnicalteam,Level1testing,andLevel2partnersinclosecooperation.Level1testingpartnerseachhaveaninstallationofDataCapsuleonanexperimentalsetofmachinesoftheirchoice.

Tasks

• TechnicalteamandLevel1testingpartnersengageinmutualexchangeaboutcollectionconstraints,infrastructureconstraints,technologyoptions,andsolutionsforprototypedemonstrationswithpartnercollections.Carryoutcontinuousinstallation,evaluation,andfeedbackcyclestorefine.

• EngagelibrarypartnersinParticipatoryDesign.Participatoryactivitiesandevaluationofappliance,whichwillincludedemoofDataCapsuleprototypeandWorksetreflectingco-designedfunctionality;installationofDataCapsuleatLevel1partners;continuousinstallofextensionsatLevel1partners,evaluationofimprovementsforallpartners.

• VisitworkplacesofLevel1and2partnersforpurposesofinformationexchange,assessmentandlearning.

Outcomes

Page 8: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.7

• Sharedknowledgeandunderstanding• Participant-influenceddesignoftechnologies• Betterfitoftechnologytoneeds• Loweredbarrierstoadoptionforpartners

Element3:DataCapsule

ExtendexistingDataCapsuleservicetoenableintuitiveandyetsecurecomputationalaccesstorestricteddatainlibraries.Evaluateextensionsthroughdemos,prototypedfunctionality,andevaluativestudies.

Tasks

• Design,developarchitectureforpackagingDataCapsuleasanappliance• Extenddatacapsulesystem’sarchitectureto

i) Enforceproperaccessofrestrictedandsensitivecollections,ii) Supportaccesstomultiplecollectionshavingdiverseformatsandtypes,iii) Supportrangeofusemodelsneededbypartners.Implementselectivechangesinform

ofprototypedemoforfeedback.• DesignevaluativestudyofDCascapableofutilizinghighperformanceorcloudcomputing

resourcestoserveinstitutionswithvariousresourcesincludinglessequippedinstitutions.Carryoutperformanceexperimentsevaluatedifferentdesigntradeoffs

Outcomes

• ExtendedcodebaseofDataCapsulepackagedasanappliancewithsupportfornewcollectiontypesandusecases.Codebasereleasedwithappropriateuseranddeveloperdocumentation.

• PublishedproofofconceptstudyofhowDataCapsulecanbescaledtouselarge-scalecomputeresourcesataninstitutionoratacloudprovidersuchasAmazonWebServices

• Publishedstudyofdesigntradeoffsinenhancementstosupportnewusecasesandaccessmodestorestrictedandsensitivecollections

Element4:Workset

EvaluateWorksetswithinthecontextoftheproject’snewusesanduserstoimprovetheutilityandimpactofWorksetsinthescholarlyresearchprocess.

Tasks

• Participateinassessmentandparticipatoryactivitiestogatherinformationaboutthe

Page 9: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.8

applicabilityofthecurrentWorksetmodeltospecificcollections.• DesignandcarryoutstudythatevaluatestradeoffstoextensionofWorksetmodeltoaccommodatethenewusesofDataCapsulesforcomputationalaccesstorestrictedandsensitivecollections.

• BringWorksettostatetoparticipateindemosshowcasingnewDataCapsulefunctionality• ActivelyengagelibrarypartnersinexploringhowbesttoeducateusersonoptimalpracticesforWorksetuseandreuse.

Outcomes

• Educationalmaterialsforaresearcher’sbestutilizationoftheWorksetnotioninthedistantanalysisthatthisprojectenables

• PublishablestudyofdesigntradeoffsforextendingWorksettoadditionalcollectionsanduses

2.3Projectmanagement

TheprojectwillbeledbyBethA.Plalewithdirectoversightandresponsibilityforprojectsuccess.Dr.KouperandRobertMcDonaldwillserveasco-Directors.TheleadershipteamincludingJ.StephenDownieatUniversityofIllinoiswillmeetweekly,andbejoinedonceamonthbytheLevel1TestingLibrarypartners.DecisionmakingwithiscarriedoutthroughconsensusbuildingwiththefinaldecisionrestingwiththePD.

Dr.Plalealsobringstechnicalexpertise,andinthisPlalewillworkcloselywithDr.Yu(Marie)Ma,Dev/OpsmanagerofHathiTrustResearchCenter,toensurethatthetechnicalstaffmembersaretaskedappropriatelyfortheprojectneedsandtimelines.Dr.InnaKouperwillleadtheprojectassessmentandcommunitybuildingactivitiesusingParticipatoryDesignmethodsandcarriedoutincollaborationwithpartnerlibraries.RobertH.McDonaldwillcoordinatethepartnerlibraries.Level1partnerlibrarieswillsuperviseprototypingandtestingofdigitalcollections.J.StephenDowniewillcoordinateexpertiseontheWorkset.

Bi-weeklyvideoconferencingmeetingscarriedoutforcommunitybuildingwillbeheldusingtheZoom.usconferencingsystemthatIUprovidesfreetoitsresearchgroups.TechnicalcommunicationwithLevel1(andlevel2asinterested)partners,whichtendstobefrequentandshortduringjointefforts,willutilizeaSlack.comchannel.Stakeholderinteractionswillbeviaregularteleconferencesandphonecalls.Userstudieswillbeconductedonlineusingscreen-sharingandrecordingtoolssuchasZoominadditiontoin-personvisits.

IssuesraisedbylibrarypartnersneedingimmediateattentionoftheDataCapsuleandWorksettechnicalteamcanutilizetheHathiTrustResearchCenterservicedeskbuiltontheAtlassianJiraServiceDeskandbugtrackingsystem.Softwaredevelopmentandprojectmanagementcomputers,grantsmanagementstaff,andofficespaceneededfortheeffortatIndianaUniversityareprovidedbytheDataToInsightCenter.Theotherfundeduniversitieswillprovidesimilarresourcesneededforaccomplishingtasks.WewillutilizecomputerresourcessuchasAmazonWebServicesasneededfortesting.

Page 10: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.9

Asthisisaresearchgrant,evaluationandperformancemeasurementsarebuiltintotheoutcomes.Thatis,publishedresultsareamongsttheplannedoutcomes.ThefindingsfromassessmentandParticipatoryDesignwillbesharedanddiscussedwithdeveloperandlibrarianteamsduringregularmeetings.Ongoingfeedbackwillbeincorporatedintothefindings.

2.4Projectdisseminationandsustainability

Recommendationsfromthisprojectcanbeadoptedindiverselibrarysettings;thesurveysandcommunitybuildingeffortscanbringtogethermanystakeholdersindata,includingresearchers,librarians,universityadministrators,andfundingagencies.Resultsoftheprojectwillbedisseminatedthroughmultipleprofessional,academic,andsocialmediachannels.

Communitybuildingisakeypartoftheproject.CommunitybuildingusermeetingsfromthisprojectwillbeconsideredtobecomepartoftheregularHTRCUnCamps--hybridconference-workshopeventsalreadyapartofHTRC’scommunityengagementplan.ChangestotheDataCapsulecodebaseundertakenduringthisprojectwillbecommittedbacktoanewprojectbranchoftheexistingDataCapsulecoderepository(https://github.com/htrc/HTRC-DataCapsules).AsanintendedoutcomeoftheParticipatoryDesignframeworkofthisproject,librarypartners,especiallyLevel1partners,willbeactivelycontributingtothecodebranchbytheendoftheproject.Thiswillcreateabroadercommunityaroundthecodebase,thusgivingastrongfoundationforitssustainability.ThechangestotheDataCapsulessystem,includingtheWorkset,areanticipatedtoalsobenefittheinstancerunningintheHathiTrustResearchCenter,creatinganotherpillarinthefoundationofsustainabilityfortheframework.

3.NationalImpact

Theproposedprojectwillhavenationalimpactthroughi)provisionofaportablesolutionforaccessingrestrictedandsensitivecollections,ii)fosteringacommunityandincreasedcollaborationaroundthetechnical,organizational,andpolicychallengesofprovidingcomputationalaccesstorestrictedcollections,andiii)amplifyingprojectoutcomesthroughtheconnectiontoHathiTrustConsortiumanditshundredsofmemberlibraries.Ourportablesolution,onceinshareableform,canbereusedbyotherlibrariesaroundthecountry,whereexpertscanimprovethecodeanddocumentationaswellasdigitalcurationactivities,andworkwiththeiruserstodevelopnewrequirementsandmaterialstouserestricteddigitalcollectionsinresearchandteaching.AnemergingcommunitywillbecomepartofthelargerHathiTrustcommunityandwillcontinuestimulatinglibrariesandresearchandnon-profitorganizationstojoinforcesinfurtherdevelopmentandmutuallearningandsupport.Astrongsenseofcontributionandcollaborationaroundcommunity-sustainedsoftwarewillhelptohavealong-lastingimpact.

Addressedneeds:Throughitsdevelopmentandparticipatoryactivities,thisprojectwillbroadenaccesstodigitalcollectionsthatexistinlibraries,includingpapers,letters,video-materialsandmanyothers.Itwillnotonlyestablishacommunitydedicatedtoworkingonsolutionsforrestrictedcollections,butalsodevelopastrongfoundationformotivatingandengagingfuturegenerationsoflibraryexpertsindevelopinginnovative

Page 11: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Narrative,p.10

softwareandservices.Projectoutcomeswilladdressthelibraryneedsofprovidingscalabletoolsforworkingwithdigitalcollections,whilerespectingprivacy,copyright,andconfidentialityrestrictions,andcontributetobuildingtheNationalDigitalPlatformasadistributedsetofsoftwareapplicationsandprofessionalexpertisethatprovidelibrarycontentandservicestoallusersintheUS[24].

Inadditiontoprovidingastrongprototype,wewillhelptrainlibrariansandprofessionalsinvolvedindevelopingtechnologyviasupportfromandcollaborationswithourtechnicalteamandviatargetedcommunityevents.Wewillsupportcommunitiesofpracticeandstrengthenlibrariesaspartnersinaddressingtheresearchandscholarshipneedsofcomputationalresearch.

Resultingproducts:ThisprojectwillresultinthetangibleproductsofextensionstotheexistingcodebaseforDataCapsule,toguidelinesandeducationalmaterials,andpublications.Theintangibleproductiscommunitybuy-intowardsadoptionandcommunityinvolvementinongoingcontributionstotheDCcodebase.Thetangibleproductsenableproliferationofexperienceandfactsbeyondtheimmediatelibrarypartnerstoincreasedadoption.Publications,forinstance,areatangibleoutcomethatfacilitatestrustintechnologyandhumanwork.Researchisgroundingforassessmentsofuse.

Sustainingthebenefit:Thesustainabilityofthebenefitsoftheproposedactivityextendswellbeyondtheperiodoffunding.Itisanimportantpointthatthisactivitywillvaultanexistingandsuccessfulserviceintobroaderusethroughstudyandextension,andwilldosoinawaythatbuildsitsadopters(libraries)intotheprocessthusgrowingthesustainingcommunitythroughthegrantduration.

Growingadoptersandasustainingcommunityaroundthesoftwarecodebasecantaketime,likelymoretimethantheshortgrantduration.ThisriskismitigatedbecausetheserviceitselfisgroundedintheHathiTrustResearchCenter,whichstandsbehindtheDataCapsuleserviceasitsprimaryserviceforcomputationalanalysisonthenearly15millionvolumesoftheHathiTrustDigitalLibrary.HTRCdeeplywelcomesthisinitiativetoinvolvemorepartners.AsexpectedoutcomeofthisprojectistohavepartnersoutsidetheHTRCtechnicalteammakingcontributionstothecodebase,theHTRCcommitstoincorporatingthosechangesbacktothemainbranchoftheDataCapsulecodebaseandusetheextensionsinfuturereleasesofDataCapsuleforitsownandbroaderuse.

Page 12: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Scheduleofcompletion,p.1

ScheduleofCompletion

2017 2018 2019

Apr-Jun Jul-Sep Oct-Dec Jan-Mar Apr-Jun Jul-Sep Oct-Dec Jan-Mar Apr-Jun

Task

Award-May2017

Task/elementI:Assessment

Preparationforassessment

Assessmentofcollections,policiesandcontextsofuse

Preparationforcommunitybuildingevents

Communitybuildingevents

Carryoutpublishableanalysesofcollectedassessmentandparticipatorydesigndata

Supportstakeholder/communityinteractions

Conductonlineuserstudies

Publishtrainingmaterials

Publishresults

Task/elementII:Partnerengagementandevaluation

PlanDCinstall

Firstinstallintestenvironment

Partnercampusvisits

Guidedhandsonexperienceandcrossinstitutionlearning

Co-designandevaluationofappliance

DemoDCandworksetreflectingparticipatorydesignfunctionality

Continuousinstall,evaluationofimprovements

IntegrateprojectdevelopmentsintoDCcodebaseandrelease

Task/elementIII:Datacapsuledevelopment

Designforappliancearchitecture

Development:codechangestopackageasappliance

Usingfeedbackfromassessment,refinedesignplans

Carryoutpublishablestudythatevaluatesdifferentdesigntradeoffs

DesignevaluativestudyforDCasthinclienttoHPCresources

CarryoutdevelopmentstudyofDCasthinclient

EvaluateandintegratechangesinmainDCbranch

Publishresults

Page 13: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

LG-71-17-0094IndianaUniversityDataToinsightCenter

Scheduleofcompletion,p.2

Developandreleaseuseranddeveloperguides

Task/elementIV:Worksetstudyanddevelopment

Developstudyofworksetinthissetting

Conductstudyofworkset

Usingfeedbackfromassessment,refinedesignplans

Carryoutpublishablestudythatevaluatesdifferentdesigntradeoffs

Evaluateandintegratechangesinmainworkset/worksetbuilderbranch

Publishresults

Page 14: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 1 of6

LG-71-17-0094IndianaUniversityDataToInsightCenter

DIGITALPRODUCTFORM

Introduction

TheInstituteofMuseumandLibraryServices(IMLS)iscommittedtoexpandingpublicaccesstofederallyfundeddigital products(i.e.,digitalcontent,resources,assets,software,anddatasets).TheproductsyoucreatewithIMLSfunding requirecarefulstewardshiptoprotectandenhancetheirvalue,andtheyshouldbefreelyandreadilyavailableforuseand re-usebylibraries,archives,museums,andthepublic.However,applyingtheseprinciplestothedevelopmentand managementofdigitalproductscanbechallenging.Becausetechnologyisdynamicandbecausewedonotwanttoinhibit innovation,wedonotwanttoprescribesetstandardsandpracticesthatcouldbecomequicklyoutdated.Instead,weask thatyouanswerquestionsthataddressspecificaspectsofcreatingandmanagingdigitalproducts.LikeallcomponentsofyourIMLSapplication,youranswerswillbeusedbyIMLSstaffandbyexpertpeerreviewerstoevaluateyourapplication, andtheywillbeimportantindeterminingwhetheryourprojectwillbefunded.

PARTI:IntellectualPropertyRightsandPermissions

A.1 Whatwillbetheintellectualpropertystatusofthedigitalproducts(content,resources,assets,software,ordatasets) youintendtocreate?Whowillholdthecopyright(s)?Howwillyouexplainpropertyrightsandpermissionstopotential users(forexample,byassigninganon-restrictivelicensesuchasBSD,GNU,MIT,orCreativeCommonstotheproduct)? Explainandjustifyyourlicensingselections.

Theformalproductsproducedasoutcomeofourproposedeffortaresoftware,trainingmaterials,useranddeveloperdocumentation,andstudies.Weanticipateintermediateproductsemergingaswellintheformofdatasetsderivedfromtestingoftheconnectionstorestrictedandsensitivecollections.Theformalmaterialsandsoftwareproductsresultingfromthiseffortwillbelicensedusingopenandfreelicensing,e.g.,CreativeCommonsandApache2.0-stylelicenses,followingthebestpracticeestablishedbytheHathiTrustResearchCenter(HTRC).Intermediateproductsemergingasaresultoftestingandexperimentationwillbediscardedbytheendoftheprojectlife.WhileoperationaluseofaDataCapsuleserviceatapartnerinstitutionisnotanticipatedoverthecourseoftheproject,shoulditoccur,orshoulduseofHTRC’soperationalDataCapsuleservicebeusedfortraining,thenthedataproductsemergingfromenduseruseofaCapsulewillfollowtheHTRCpolicyofnotimposinglicensingrestrictionsontheproductsassumingthattheDataCapsuleservicethattheenduserisusingisfullyoperationalandthedataproductspassthereviewprocess(runbyHTRC).Iftheconditionsarenotmet,thedataproductsareconsideredintermediateproductsandwillbedestroyedbyendofprojectlife.

A.2 Whatownershiprightswillyourorganizationassertoverthenewdigitalproductsandwhatconditionswillyouimpose onaccessanduse?Explainandjustifyanytermsofaccessandconditionsofuseanddetailhowyouwillnotifypotential usersaboutrelevanttermsorconditions.

Softwareproductsdevelopedinthisprojectwillbeopenlysharedandaccessibleviaanopensoftwarerepository(Github).AstoaccesstotheDataCapsuleservice,duringthecourseoftheprojecttherewillbetestinstancesofDataCapsuleservicerunningattheLevel1librarytestingpartnerinstitutions,andanoperationalinstancerunningatIndianaUniversityaspartofHTRC.WeanticipatethetestinstancesofDataCapsuleservicehavingnoend-userusesduringthecourseoftheprojectastheywillbeunderdevelopment.TrainingwillbecarriedoutontheoperationalHTRCinstanceoftheDataCapsuleservice.

A.3 Ifyouwillcreateanyproductsthatmayinvolveprivacyconcerns,requireobtainingpermissionsorrights,orraiseany culturalsensitivities,describetheissuesandhowyouplantoaddressthem.

Aspartofthisproject,wewillbeconductinginterviewsandtakingnotesduringethnographicobservations.Thedatacollectedviainteractionswithhumansubjectswillbestoredsecurelyandaccessed by projectinvestigators only. Such datawill be shared only after appropriate anonymization or

Page 15: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 2 of6

LG-71-17-0094IndianaUniversityDataToInsightCenter

withexplicitconsentfromparticipants.Additionally,restrictedcollectionsthatwillbeusedduringtestingincomputationalanalysisinDataCapsulesmayraisecopyright,privacyorotherconcerns.Theseconcernedwillbeaddressedthroughpolicydiscussionswithlibrarypartners;thesediscussionsmaybeguidedbyHTRC’spolicydevelopedtoaddresssimilarconcerns.

PartII:ProjectsCreatingorCollectingDigitalContent,Resources,orAssets

A. CreatingorCollectingNewDigitalContent,Resources,orAssets

A.1 Describethedigitalcontent,resources,orassetsyouwillcreateorcollect,thequantitiesofeachtype,andformatyou willuse.

Inthecourseofthisprojectthefollowingdigitalcontentwillbecreated:

1. ExtensionstoDataCapsuleservice.TheextensionswillstartfromtheexistingHTRCcodebase,whichisorganizedinapprox.50modules.Itisexpectedthatmodificationswilltouch10-20%ofthecodeforpartnercustomization.2. EnhancementstotheWorksetmodel.ThisresourceisanOntologythatcanbeexpressedinRDFand/orXMLformats.Enhancementswillcompriseabout10%oftheresource.3. Interviewrecordingsandtranscriptsandfieldnotes.SeePartIVDatasetsformoredetails.4. Onlinemanualsandtrainingmaterials.Installation,testinganduseofDataCapsulewillbedocumentedinonlinemanualsandtrainingmaterials,whichwillbeopenlyaccessibleviatheweb.5. Publicationsandpresentations.Findingsfromtheprojectwillbedisseminatedviajournals,conferences,andothervenues.PDFdocumentsandslideswillbeopenlysharedwiththecommunity,unlesspublishingrestrictionsapply.

A.2 Listtheequipment,software,andsuppliesthatyouwillusetocreatethecontent,resources,orassets,orthenameof theserviceproviderthatwillperformthework.

Theprojectactivitywillbetodevelopsoftwareextensionstoexistingcodebasesandconducthuman-computerinteractionstudies.Activitydoesnotextendtothecreationofdigitalcollections.Weintendto use computers at IndianaUniversity, University of Illinois, University of Virginia, UC Berkeley, andUCLAfortestinganddevelopment.WeexpectLevel1partnerstohavetestserversavailableonwhichwewillinstallthesoftware(DataCapsule).

A.3 Listallthedigitalfileformats(e.g.,XML,TIFF,MPEG)youplantouse,alongwiththerelevantinformationaboutthe appropriatequalitystandards(e.g.,resolution,samplingrate,orpixeldimensions).

Softwarewillexistindevelopmentformats,predominantlyJavafiles,Pythonscripts,andXMLconfigurationfiles.PartnerlibrarieswhowillusetheoperationalDataCapsuleserviceatHTRCforanalyzingtheirrestrictedcollections,mayhavederivedproductsinotherformatsthatareappropriateintheirrespectiveuserdisciplines,suchastabularfilesorimages.Qualitystandardsforthosederivedproductsaswellasqualitychallengeswillbediscussedduringparticipatorydesignactivities.Softwarequalitywillbemonitoredandevaluatedbyusing"fitnessforpurpose"andstructuralanalysistechniques.

B. WorkflowandAssetMaintenance/Preservation

B.1 Describeyourqualitycontrolplan(i.e.,howyouwillmonitorandevaluateyourworkflowandproducts).

Fordetailsonsoftwarequalitycontrol,seePartIII.

TheassessmentiscarriedoutbyaPhDresearchfacultymemberwhoishighlytrainedincarryingoutqualityprocesses.Dr.Kouperhasastrongrecordofpublicationqualityresearchinthisarea.SoftwaredevelopmentwilluseHTRC’ssoftwaredevelopmentprocesses,includingoversightbyaDevOpsManager,helpdesk,andbugtracking.StudiesofDataCapsuleandWorksetwillbeunderthesupervisionofPlaleand Downie, both full professors and accomplished scholars in this type of work.

Page 16: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 3 of6

LG-71-17-0094IndianaUniversityDataToInsightCenterB.2 Describeyourplanforpreservingandmaintainingdigitalassetsduringandaftertheawardperiodofperformance. Yourplanmayaddressstoragesystems,sharedrepositories,technicaldocumentation,migrationplanning,and commitmentoforganizationalfundingforthesepurposes.Pleasenote:Youmaychargethefederalawardbeforecloseout forthecostsofpublicationorsharingofresearchresultsifthecostsarenotincurredduringtheperiodofperformanceof thefederalaward(see2C.F.R.§200.461).

Softwareproductswillbeshared,preservedandmaintainedusingtheopensoftwarerepositoryGithub.TechnicaldocumentationwillbestoredonGitHubaswellasontheopenHTRCwikipages.WewillencourageHathiTrustcommunityandtheemergingDataCapsulecommunitytofurthercontributetocurationandpreservationofthesoftware.Productsofresearch(publications,datasets,andpresentations)willbepreservedinIndianaUniversityinstitutionalrepositoryIUScholarworks,whichwillserveasanadditionalpreservationlayertotraditionalpublicationvenues.

C. Metadata

C.1 Describehowyouwillproduceanyandalltechnical,descriptive,administrative,orpreservationmetadata.Specify whichstandardsyouwilluseforthemetadatastructure(e.g.,MARC,DublinCore,EncodedArchivalDescription,PBCore, PREMIS)andmetadatacontent(e.g.,thesauri).

READMEfiles,useranddeveloperguidesaretheformofdocumentationusedtopreservesoftwaremetadata.FordatasetswewilluseDublinCoretorecorddescription,administrative,andpreservationmetadata.

C.2 Explainyourstrategyforpreservingandmaintainingmetadatacreatedorcollectedduringandaftertheawardperiod ofperformance.

Metadatawillbemaintainedaspartofthesoftwareanddatamaintenance,i.e.,itwillbestoredandmigratedalongwiththedigitalproducts.

C.3 Explainwhatmetadatasharingand/orotherstrategiesyouwillusetofacilitatewidespreaddiscoveryanduseofthe digitalcontent,resources,orassetscreatedduringyourproject(e.g.,anAPI[ApplicationProgrammingInterface], contributionstoadigitalplatform,orotherwaysyoumightenablebatchqueriesandretrievalofmetadata).

Astheprojectisnotconcernedwithcreatingadigitalcollection,wewillrelyonotherlargerresourcesforwidespreaddiscoveryanduse,includingHathiTrustResearchCenternetworks,academicpublishingdatabases,andsoftwareandinstitutionalrepositories.

D. AccessandUse

D.1 Describehowyouwillmakethedigitalcontent,resources,orassetsavailabletothepublic.Includedetailssuchasthe deliverystrategy(e.g.,openlyavailableonline,availabletospecifiedaudiences)andunderlyinghardware/software platformsandinfrastructure(e.g.,specificdigitalrepositorysoftwareorleasedservices,accessibilityviastandardweb browsers,requirementsforspecialsoftwaretoolsinordertousethecontent).

Softwareandstudyproductswillbeopenlyavailableonline,unlessthelatterisrestrictedbythepublishers.

D.2 Providethename(s)andURL(s)(UniformResourceLocator)foranyexamplesofpreviousdigitalcontent,resources, orassetsyourorganizationhascreated.

TheDatatoInsightCenterhasitsowngrouprepositoryonGitHubwhereallsoftwareproductsaremadeavailabletothepublic:https://github.com/Data-to-Insight-CenterMostrecentexamplesincludeDataMatchMakerhttps://github.com/Data-to-Insight-Center/Data-MatchMakerandPRAGMAData

https://github.com/Data-to-Insight-Center/PRAGMA-Data-Repository

Page 17: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 4 of6

LG-71-17-0094IndianaUniversityDataToInsightCenter

Additionally,D2IcontributionstoHTRCcodearemadeavailableviaseparaterepositoryhttps://github.com/htrc,wheretheexistingDataCapsulecodebasecanbefoundhttps://github.com/htrc/HTRC-DataCapsules.

PartIII.ProjectsDevelopingSoftware

A. GeneralInformation

A.1 Describethesoftwareyouintendtocreate,includingasummaryofthemajorfunctionsitwillperformandtheintended primaryaudience(s)itwillserve.

Toaccomplishthegoalsofthisproject,wewillextendtheDataCapsulesservicecodebase.HTRCDataCapsuleworksbygivingaresearcheravirtualmachine(VM)thatrunswithintheHTRCdomain.TheresearchercanconfiguretheVMastheywouldtheirowndesktopwiththeirowntools.Aftertheyaredone,theVMswitchesintoa“securemode”,wherenetworkandotherdatachannelsarerestrictedinexchangeforaccesstothedatabeingprotected.Currently,DataCapsuleworksonlywiththeHathiTrustDigitalLibraryandwithinHTRCarchitecture.Wewillgeneralizethearchitecturetoworkwithothercollectionsandevaluatedesign,secureaccessandscalabilityoptionstoworkinspecificlibraryenvironments.

A.2 Listotherexistingsoftwarethatwhollyorpartiallyperformsthesamefunctions,andexplainhowthesoftwareyou intendtocreateisdifferent,andjustifywhythosedifferencesaresignificantandnecessary.

ComparableconceptualframeworksthatintendtoperformsimilarfunctionsincludeDataEnclavesandStorageCapsules.DataEnclavesrelyoncustomizedvirtualizationsoftwareandpre-definedsetoftoolstoenableaccess.Tothebestofourknowledge,noworkingsoftwareexiststhataddressestheneedtoperformcomputationalanalysisondocumentsandresourcesusingaresearcher-definedsetoftools.Astheneedforcomputationalresearchonrestrictedcollectionsusingalargevarietyoftoolsgrows,thedevelopmentofsuchsoftwareisundoubtedlysignificantandnecessary.

B. TechnicalInformation

B.1 Listtheprogramminglanguages,platforms,software,orotherapplicationsyouwillusetocreateyoursoftwareand explainwhyyouchosethem.

DataCapsulesoftwareisinJava,Python,andshellscripts.

B.2 Describehowthesoftwareyouintendtocreatewillextendorinteroperatewithrelevantexistingsoftware.

ThesoftwareextendstheDataCapsuleservice.

B.3 Describeanyunderlyingadditionalsoftwareorsystemdependenciesnecessarytorunthesoftwareyouintendto create.

DataCapsuleusesopensourcevirtualizationinfrastructure(QEMUandKVM),whichneedstobeinstalledforthecapsuletowork.

MySQLrelationaldatabasesystemisusedtostorecapsulemetadataandresults.

DataCapsuleisprovidedforUbuntu(Linux)environment.

B.4 Describetheprocessesyouwillusefordevelopment,documentation,andformaintainingandupdatingdocumentation forusersofthesoftware.

ThecodewillbeforkedinGitHubrepository,creatinganewbranch.ContributingdeveloperswillbeusingtheirenvironmenttowritecodeandthencommitthecodebacktoGitHub.WewilluseHTRCdocumentationandbug-trackingservices(AtlassianConfluenceandJira)formaintainingandupdatingdocumentation for users of the software. Atthe end of the projectonline manuals will also be written.

Page 18: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 5 of6

LG-71-17-0094IndianaUniversityDataToInsightCenterB.5 Providethename(s)andURL(s)forexamplesofanyprevioussoftwareyourorganizationhascreated.

TheDatatoInsightCenterhasitsowngrouprepositoryonGitHubwhereallsoftwareproductsaremadeavailabletothepublic:https://github.com/Data-to-Insight-CenterMostrecentexamplesincludeDataMatchMakerhttps://github.com/Data-to-Insight-Center/Data-MatchMakerandPRAGMAData

https://github.com/Data-to-Insight-Center/PRAGMA-Data-Repository

Additionally,D2IcontributionstoHTRCcodearemadeavailableviaseparaterepositoryhttps://github.com/htrc,wheretheexistingDataCapsulecodebasecanbefoundhttps://github.com/htrc/HTRC-DataCapsules.

C. AccessandUse

C.1 Weexpectapplicantsseekingfederalfundsforsoftwaretodevelopandreleasetheseproductsunderopen-source licensestomaximizeaccessandpromotereuse.Whatownershiprightswillyourorganizationassertoverthesoftwareyou intendtocreate,andwhatconditionswillyouimposeonitsaccessanduse?Identifyandexplainthelicenseunderwhich youwillreleasesourcecodeforthesoftwareyoudevelop(e.g.,BSD,GNU,orMITsoftwarelicenses).Explainandjustify anyprohibitivetermsorconditionsofuseoraccessanddetailhowyouwillnotifypotentialusersaboutrelevanttermsandconditions.

WewilluseApache2.0licensetoreleaseDataCapsule.Thelicenseallowstoreproduceanddistributecopiesofthesoftwareanditsderivativeswithorwithoutmodifications.Thelicensetextisputtousebyaddingittotheheaderofasoftwarefile(seehttps://www.apache.org/licenses/LICENSE-2.0foracopyofthelicense).

C.2 Describehowyouwillmakethesoftwareandsourcecodeavailabletothepublicand/oritsintendedusers.

ThesourcecodeextensionstotheDataCapsulewillbemadeavailableviaGitHubhttps://github.com/htrcasaseparatebranchoftheprimarybranch.

C.3 Identifywhereyouwilldepositthesourcecodeforthesoftwareyouintendtodevelop:

Nameofpubliclyaccessiblesourcecoderepository:GitHub

URL:https://github.com/htrc

PartIV:ProjectsCreatingDatasets

A.1 Identifythetypeofdatayouplantocollectorgenerate,andthepurposeorintendedusetowhichyouexpectittobe put.Describethemethod(s)youwilluseandtheapproximatedatesorintervalsatwhichyouwillcollectorgenerateit.

Datawillbecollectedviaphoneinterviewsandethnographicobservations,whichinvolvenote-taking,recording,andphotographs.Phoneinterviewswillbeconductedatthebeginningoftheproject.Follow-upinterviewsandadditionalrecordingsofconversationsandnote-takingwilltakeplacethroughouttheprojectasaneedtodocumentparticipantinteractionswillarise.

A.2 Doestheproposeddatacollectionorresearchactivityrequireapprovalbyanyinternalreviewpanelorinstitutional reviewboard(IRB)?Ifso,hastheproposedresearchactivitybeenapproved?Ifnot,whatisyourplanforsecuring approval?

DatacollectioninvolveshumansubjectsandrequiresIRBapproval.IRBapplicationwillbepreparedandsubmittedwhen/iftheprojectisapprovedforfunding.

A.3 Willyoucollectanypersonallyidentifiableinformation(PII),confidentialinformation(e.g.,tradesecrets),orproprietary information?Ifso,detailthespecificstepsyouwilltaketoprotectsuchinformation whileyou prepare the data files for public release (e.g., data anonymization, data

Page 19: LG-71-17-0094-17 Indiana University Data To Insight Center · LG-71-17-0094 Indiana University Data To insight Center Narrative, p. 3 2. Project Design The project is structured to

OMB Control #: 3137-0092, Expiration Date: 7/31/2018 IMLS-CLR-F-0032 Digital product, pp. 6 of6

LG-71-17-0094IndianaUniversityDataToInsightCenter

suppressionPII,orsyntheticdata).

Participantscanbeidentifiedinphoneinterviews,notes,andrecordings.PersonallyidentifiableinformationwillbestoredsecurelyandonlyPIandco-PIswillhaveaccesstoit.BeforepublicreleaseofthedatasetallPIIwillberemoved(participantswillbeassignedcodednumbersandanyinformationthatmayidentifythemindividuallywillbeobscuredintheinterviews,notes,andtranscripts).

A.4 Ifyouwillcollectadditionaldocumentation,suchasconsentagreements,alongwiththedata,describeplansfor preservingthedocumentationandensuringthatitsrelationshiptothecollecteddataismaintained.

Participantswillbeprovidedwithinformedconsentforms,whichtheywillsign.TheformswillbestoredsecurelyandseparatelyandtherelationshiptothecollecteddatawillbemaintainedviaastudyIDthatwillberecordedintheinformedconsentformsandinthedatafiles.

A.5 Whatmethodswillyouusetocollectorgeneratethedata?Providedetailsaboutanytechnicalrequirementsor dependenciesthatwouldbenecessaryforunderstanding,retrieving,displaying,orprocessingthedataset(s).

Thedatawillbecollectedviainterviewsandobservationsandwillconsistoftextfiles,audioandvideofiles,andphotographs.Commonwordprocessingsoftwareandmultimediaplayersmaybeusedtodisplaythedata.Processeddatamayconsistofadditionalspreadsheetsandvisualizations,whichwillbestoredinnon-proprietaryformats(e.g.,CSVorPNG).

A.6 Whatdocumentation(e.g.,datadocumentation,codebooks)willyoucaptureorcreatealongwiththe dataset(s)? Where will the documentation be stored and in what format(s)? How will youpermanentlyassociateandmanagethe documentationwiththedataset(s)itdescribes?

Codebookswillbecreatedaspartoftheanalysisofqualitativedata(e.g.,inthethematiccodingprocedurescodeswillbedevelopedintheinductivemanner,aftercloseiterativereadingoftheinterviews).Codes,theirdescriptionsandotherdocumentationthatdescribeswhenandwheretheinterviewsandobservationstookplacewillbestoredintextformatsalongwiththedata.Thedocumentationwillbeassociatedwiththedatasetsthroughconsistentfilenamingandthroughidentifiersthatrefertoeachdatacollectioneffortseparately.

A.7 Whatisyourplanforarchiving,managing,anddisseminatingdataafterthecompletionoftheaward-fundedproject?

ThedatawillbemanagedandarchivedusingScholarlyDataArchive(backed-upstorageforlong-termarchiving)andinstitutionalGoogleDriveatIndianaUniversity(foractiveworkwithdata).Folderswithappropriatepermissionsfordata,processingscripts,IRBdocumentation,andpublicationswillbecreated.Fordissemination,wewilluseIUScholarworksrepositoryandoneofthepubliclyavailablerepositories,suchasFigshareorMendeley.

A.8 Identifywhereyouwilldepositthedataset(s):

Nameofrepository:IUScholarworks;Figshare;MendeleyData

URL:scholarworks.iu.edu/dspace/;fighare.com;data.mendeley.com

A.9 Whenandhowfrequentlywillyoureviewthisdatamanagementplan?Howwilltheimplementationbemonitored?

PIswillmonitortheimplementationofthisdatamanagementplan.Theplanwillbereviewedevery6monthsandadjustedaccordingtotheamountsandtypesofdatagenerated.