INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud...

28
J Grid Computing (2018) 16:381–408 https://doi.org/10.1007/s10723-018-9453-3 INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures D. Salomoni · I. Campos · L. Gaido · J. Marco de Lucas · P. Solagna · J. Gomes · L. Matyska · P. Fuhrman · M. Hardt · G. Donvito · L. Dutka · M. Plociennik · R. Barbera · I. Blanquer · A. Ceccanti · E. Cetinic · M. David · C. Duma · A. L´ opez-Garc´ ıa · G. Molt ´ o · P. Orviz · Z. Sustr · M. Viljoen · F. Aguilar · L. Alves · M. Antonacci · L. A. Antonelli · S. Bagnasco · A. M. J. J. Bonvin · R. Bruno · Y. Chen · A. Costa · D. Davidovic · B. Ertl · M. Fargetta · S. Fiore · S. Gallozzi · Z. Kurkcuoglu · L. Lloret · J. Martins · A. Nuzzo · P. Nassisi · C. Palazzo · J. Pina · E. Sciacca · D. Spiga · M. Tangaro · M. Urbaniak · S. Vallero · B. Wegh · V. Zaccolo · F. Zambelli · T. Zok Received: 25 November 2017 / Accepted: 19 July 2018 / Published online: 7 August 2018 © The Author(s) 2018 Abstract This paper describes the achievements of the H2020 project INDIGO-DataCloud. The project has provided e-infrastructures with tools, applica- tions and cloud framework enhancements to man- age the demanding requirements of scientific com- munities, either locally or through enhanced inter- faces. The middleware developed allows to federate hybrid resources, to easily write, port and run scien- tific applications to the cloud. In particular, we have extended existing PaaS (Platform as a Service) solu- tions, allowing public and private e-infrastructures, including those provided by EGI, EUDAT, and Helix Nebula, to integrate their existing services and make them available through AAI services compliant with D. Salomoni · A. Ceccanti · C. Duma INFN - CNAF, Bologna, Italy I. Campos () · J. Marco de Lucas · A. L´ opez-Garc´ ıa · P. Orviz · F. Aguilar · L. Lloret IFCA, Consejo Superior de Investigaciones Cientificas-CSIC, Santander, Spain e-mail: [email protected] L. Gaido · S. Bagnasco · S. Vallero · V. Zaccolo INFN - Torino, Torino, Italy G. Donvito · M. Antonacci INFN - Bari, Bari, Italy GEANT interfederation policies, thus guaranteeing transparency and trust in the provisioning of such services. Our middleware facilitates the execution of applications using containers on Cloud and Grid based infrastructures, as well as on HPC clusters. Our devel- opments are freely downloadable as open source com- ponents, and are already being integrated into many scientific applications. Keywords Cloud computing · Platform as a service · Containers · Software management · Advanced user interfaces · Authorization and authentication P. Fuhrman Deutsches Elektronen Synchrotron (DESY), Hamburg, Germany I. Blanquer · G. Molt ´ o Institute of Instrumentation for Molecular Imaging - Universitat Polit` ecnica de Val` encia, Valencia, Spain M. Plociennik · M. Urbaniak · T. Zok PSNC IBCh PAS, Pozna´ n, Poland M. Hardt · B. Ertl · B. Wegh Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

Transcript of INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud...

Page 1: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

J Grid Computing (2018) 16:381–408https://doi.org/10.1007/s10723-018-9453-3

INDIGO-DataCloud: a Platform to Facilitate SeamlessAccess to E-Infrastructures

D. Salomoni · I. Campos · L. Gaido · J. Marco de Lucas · P. Solagna ·J. Gomes · L. Matyska · P. Fuhrman · M. Hardt · G. Donvito · L. Dutka ·M. Plociennik · R. Barbera · I. Blanquer · A. Ceccanti · E. Cetinic · M. David ·C. Duma · A. Lopez-Garcıa · G. Molto · P. Orviz · Z. Sustr · M. Viljoen ·F. Aguilar · L. Alves · M. Antonacci · L. A. Antonelli · S. Bagnasco ·A. M. J. J. Bonvin · R. Bruno · Y. Chen · A. Costa · D. Davidovic · B. Ertl ·M. Fargetta · S. Fiore · S. Gallozzi · Z. Kurkcuoglu · L. Lloret · J. Martins ·A. Nuzzo · P. Nassisi · C. Palazzo · J. Pina · E. Sciacca · D. Spiga · M. Tangaro ·M. Urbaniak · S. Vallero · B. Wegh · V. Zaccolo · F. Zambelli · T. Zok

Received: 25 November 2017 / Accepted: 19 July 2018 / Published online: 7 August 2018© The Author(s) 2018

Abstract This paper describes the achievements ofthe H2020 project INDIGO-DataCloud. The projecthas provided e-infrastructures with tools, applica-tions and cloud framework enhancements to man-age the demanding requirements of scientific com-munities, either locally or through enhanced inter-faces. The middleware developed allows to federatehybrid resources, to easily write, port and run scien-tific applications to the cloud. In particular, we haveextended existing PaaS (Platform as a Service) solu-tions, allowing public and private e-infrastructures,including those provided by EGI, EUDAT, and HelixNebula, to integrate their existing services and makethem available through AAI services compliant with

D. Salomoni · A. Ceccanti · C. DumaINFN - CNAF, Bologna, Italy

I. Campos (�) · J. Marco de Lucas · A. Lopez-Garcıa ·P. Orviz · F. Aguilar · L. LloretIFCA, Consejo Superior de InvestigacionesCientificas-CSIC, Santander, Spaine-mail: [email protected]

L. Gaido · S. Bagnasco · S. Vallero · V. ZaccoloINFN - Torino, Torino, Italy

G. Donvito · M. AntonacciINFN - Bari, Bari, Italy

GEANT interfederation policies, thus guaranteeingtransparency and trust in the provisioning of suchservices. Our middleware facilitates the execution ofapplications using containers on Cloud and Grid basedinfrastructures, as well as on HPC clusters. Our devel-opments are freely downloadable as open source com-ponents, and are already being integrated into manyscientific applications.

Keywords Cloud computing · Platform as aservice · Containers · Software management ·Advanced user interfaces · Authorization andauthentication

P. FuhrmanDeutsches Elektronen Synchrotron (DESY),Hamburg, Germany

I. Blanquer · G. MoltoInstitute of Instrumentation for Molecular Imaging- Universitat Politecnica de Valencia, Valencia, Spain

M. Plociennik · M. Urbaniak · T. ZokPSNC IBCh PAS, Poznan, Poland

M. Hardt · B. Ertl · B. WeghKarlsruhe Institute of Technology (KIT), Karlsruhe,Germany

Page 2: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

382 D. Salomoni et al.

1 Introduction

INDIGO-DataCloud was an European project startingin April 2015, with the purpose of developing a mod-ular architecture and software components to improvehow scientific work is supported at the edge of com-puting services development. Its main goal has beento deliver a Cloud platform addressing the specificneeds of scientists in a wide spectrum of disciplines,engaging public institutions and private companies.It aimed at being as inclusive as possible, develop-ing open source software exploiting existing solutions,adopting and enhancing state of the art technolo-gies, connecting with other initiatives and with leadingcommercial providers.

J. Gomes · M. David · L. Alves · J. Martins · J. PinaLaboratory of Instrumentation and Experimental ParticlePhysics (LIP), Lisbon, Portugal

S. Fiore · A. Nuzzo · P. Nassisi · C. PalazzoFondazione Centro Euro-Mediterraneo sui CambiamentiClimatici, Lecce, Italy

R. Barbera · R. Bruno · M. FargettaINFN - Catania, Catania, Italy

D. SpigaINFN - Perugia, Perugia, Italy

L. DutkaCyfronet AGH, Krakow, Poland

P. Solagna · M. Viljoen · Y. ChenEGI Foundation, Amsterdam, Netherlands

L. Matyska · Z. SustrCESNET, Prague, Czech Republic

A. M. J. J. Bonvin · Z. KurkcuogluUniversity of Utrecht, Utrecht, The Netherlands

L. A. Antonelli · A. Costa · S. Gallozzi · E. SciaccaIstituto Nazionale di Astrofisica, Rome, Italy

E. Cetinic · D. DavidovicRuder Boskovic Institute, Zagreb, Croatia

M. Tangaro · F. ZambelliConsiglio Nazionale delle Ricerche, Istituto di Biomembrane,Bioenergetica e Biotecnologie Molecolari, Bari, Italy

F. ZambelliDepartment of Biosciences, University of Milano, Milan, Italy

R. BarberaDepartment of Physics and Astronomy, University of Catania,Catania, Italy

Since its inception, the project roadmap has beenuser community driven. Its main focus was on closingthe existing technology gaps that hindered an opti-mal exploitation of Cloud technologies by scientificusers. In order to do so, user requirements from severalmultidisciplinary scientific communities were col-lected, and systematized into specific technicalrequirements. This process was carried out across theentire lifetime of the project, which allowed the updateof existing requirements as well as the insertion of newones, thus driving the project architecture definitionand the technological developments.

The project also made focus on deliveringproduction-quality software, thus it defined proce-dures and quality metrics, which were followed by,and automatically checked for, all the INDIGO com-ponents. A comprehensive process to package andissue the INDIGO software was also defined. Asan outcome of this, INDIGO delivered two mainsoftware releases (the first in August 2016, the sec-ond in April 2017), each followed by several minorupdates. The latest release consists of about 40 openmodular components, 50 Docker containers, 170 soft-ware packages, all supporting up-to-date open oper-ating systems. This result was accomplished reusingand extending open source software and —wheneverapplicable— contributing code to upstream projects.

The paper is structured as follows. Section 2 con-tains a description on how the user requirementswere collected and consolidated. From there, theINDIGO architecture is further elaborated from thelower Infrastructure as a Service layer (Section 3)moving towards the Platform layer (Section 4) inorder to arrive to the user interfaces (Section 5). Theoverall software development process is described inSection 6. Section 7 contains a summary of someusage patterns on how to leverage the INDIGO solu-tions to develop, deploy and support applications ina Cloud framework. The conclusions are laid out inSection 8. The list of upstream contributed softwarecan be found in the Appendix.

1.1 Context and State of the Art

From the collection of user community requests,and its consolidation into technical requirements (seeSection 2), we identified a number of technology gapsthat today hinder an optimal scientific exploitation ofheterogeneous e-infrastructures.

Page 3: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 383

In this Section we will elaborate more on the gen-eral strategy to address those requirements, linking ourdevelopments with the previous and existing works.The specific enhancements and developments will befurther elaborated in the corresponding sections.

Lack of proper federated identity support acrossseveral e-Infrastructures is a key issue for theresearchers perspective. The provision of an effectivedistributed authentication and authorization in hetero-geneous platforms is fundamental to support access todistributed infrastructures. Several efforts have beenmade in this context [1–5] but they were focusedon specific infrastructures and services. However,although some of these approaches have been usedin production in specific e-Infrastructures [6] they aredifficult to implement in a broader environment.

In parallel to the development of INDIGO-Datacloud, the Authentication and Authorisation forResearch and Collaboration project (AARC) definedthe AARC Blueprint Architecture [7]. This documentdescribes a set of interoperable architecture build-ing blocks for designing and implementing accessmanagement solutions for international research col-laborations. Following the AARC recommendationswe have developed several key components relatedwith identity and access management, providing aframework compliant with the proposed blueprintarchitecture, as will be described in Section 3.3.

Facilitating the transparent execution of user appli-cations across different computing infrastructures isalso a key issue [8]. Advanced users have nowa-days at their disposal tools to implement applicationsin Clouds provisioned in Infrastructure as a Service(IaaS) mode. Examples of such solutions are vir-tual appliances and contextualization [9] or containertechnologies [10]).

The situation for non-Cloud resources in scientificfacilities is completely different. Here we are referring forinstance to local clusters, Grid infrastructures and HPCsystems. Such infrastructures are tipically shared amongmany users with different requirements, therefore it ismanagerially and technically impossible offering tai-lored environments to all of them. As a consequencescientific users often need to follow a troublesomeprocess to package and execute their applications.

To address this problem we have applied the tech-nology of Docker containers [11] to facilitate applica-tions execution in multiuser environments. As a resultwe have provided a flexible user-level solution to give

autonomy to users in shared computing facilities [12].Section 3.1 contains a thorough discussion on thestrategy and outcomes.

Adoption of true Platform as a Service (PaaS)Cloud solutions is a common problem for scientificcommunities. The roots of this problem are on theone hand the non-interoperability of the interfaces [13,14], and second, the lack of true orchestration mecha-nisms across federated heterogeneous infrastructures.Both barriers made it difficult for users to adopt Cloudhybrid solutions.

In Section 4 we describe our approach, and howwe have tackled this problem by leveraging the OCCI[15–17] and TOSCA [18] open standards. In this regardwe have not only supported those standards at thecorresponding architectural levels [19, 20], but alsowe made important contributions to both the standardsspecifications and implementations. INDIGO has con-tributed to the networking parts of the OCCI standard, aswell as to the improvement of the TOSCA support inthe upstream OpenStack components: the Heat Trans-lator and TOSCA parser [21]. Our solution makes theexecution of dynamic workflows [22–24] possible, ina more consistent way across hybrid Clouds [25].

In this interoperability context, hybrid Clouddeployments, although possible [14, 26, 27], werecomplicated from a practical point of view and there-fore user adoption has been hindered. By adopt-ing INDIGO solutions users can now express theirrequirements and deploy them as applications overthose hybrid infrastructures [28].

Linked with the previous statements another out-standing gap was the lack of advanced schedulingfeatures in Cloud environments [29]. Common cloudusage scenarios, being industry driven, do not take intoaccount the unique requirements of scientific applica-tions [30], leading to an inefficient utilization of theresources or to non optimal user experience.

Developments in this area can be found in the lit-erature [31–34], where it becomes evident that thereare many challenges to be addressed. Within INDIGOwe focused (see Section 3.2) in the efficient shar-ing of resources among users following fair-shareapproaches (limiting the amount of resources that canbe consumed by a user group), proper quota partition-ing across different computing frameworks (like HPCand Cloud resources) or new Cloud computing execu-tion models (like preemptible instances) as these areaspects that affect both users and resource providers.

Page 4: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

384 D. Salomoni et al.

Regarding storage support, INDIGO has performedsubstantial contributions to storage-related entities andstandardization bodies, such as the Research DataAlliance (RDA), where INDIGO has been highlyinvolved the Quality of Service, Data Life cycleand Data Management Plans working group (nowrenamed to Storage Service Definitions). Moreover,INDIGO has also contributed this work to the SNIACDMI standard, providing several extensions thathave been included in SNIA reference implementa-tions and documents.

2 Analysis of Requirements Comingfrom Research Communities

In order to guide our developments we performed ananalysis of a number of use cases originating in several

flagship research communities. In particular comingfrom the areas of High Energy Physics, Environmentalmodelling, Bioinformatics, Astrophysics and Socialsciences. See Table 1 for the full list.

The deployment of customized computing back-ends, such as batch queues, including automaticelasticity is among the features more demanded byresearchers. The automation of the deployment ofuser-specific software in VMs or containers is also onthe top of their wish list. Such automation is a mustwhen it is about simplifying the executing applicationsin heteregeneous infrastructures. For similar reasons,highly specialized applications require also support tohardware accelerators and specialized hardware suchas Infiniband, multicore systems, or GP/GPUs.

Often user communities are asking for termi-nal access to resources, workflow management anddata handling, in a way that such access is linked

Table 1 Research Communities and use cases analyzed to extract general requirements

Research community Application/Use case

LIFEWATCH (Biodiversity) Monitoring and Modelling Algae Bloom in a Water Reservoir: Support of hydrodynamic and

water quality modelling including data input-output management and visualization.

INSTRUCT (Bioinformatics) Molecular dynamics simulations: Support of Molecular dynamics simulations of

macromolecules that need specific hardware (GP/GPUs) using a pipeline of software that

combines protocols that automate the step for setup and execution of these simulations.

CTA (Astronomical data) Astronomical Data Archives: Data analysis and management using different tools such as data

discovery, comparison, cross matching, data mining and also workflows. The use case could be

described as follows: data production, data reduction, data quality, data handling and workflows,

data publication and data link to articles

Climate modelling Intermodel comparison of data analysis for different climate models using the ENES platform

(European Network for Earth System modelling)

EuroBioImaging (Bioinformatics) Medical Imaging Biobanks: The virtual Biobank integrates medical images from different

sources and formats. This case study includes all the steps needed to manage the images, like

analysis, storage, processing (pre, post). Privacy is a constraint to take into account for user

management

ELIXIR (Bioinformatics) Galaxy as a Cloud service: Deployment of Galaxy instance that should support all the

software/steps needed by the pipeline over, for example, a virtual cluster or cloud instances.

DARIAH (Social sciences) Transparent access to data catalogues and on-demand data management features.

Mastercode (HEP pheno) Complex combination of codes including legacy parts to perform combined analysis of data

coming from particle detectors, astrophysics experiments, and dark matter detectors. Installation

of these codes is in general very complex in multi-user farms. Providing a container based

solution would simplify installation across infrastructures.

Lattice QCD (HEP) Lattice simulations run on large HPC facilities using low latency interconnects, producing large

amounts of output. Accessing such facilities in Cloud mode would require implementing MPI

parallel processing capabilities.

Page 5: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 385

to a common Authorization and AuthenticationInfrastructure.

In order to generalize the requirements, we haveextracted two generic usage scenarios, which can sup-port a wide range of applications in these areas. Thefirst generic use case is computing oriented, whilethe second is data analysis oriented. For full detailsregarding user communities description and detailedusage patterns we refer to the users requirementsdeliverable of the project available publicly in [35].

2.1 Computing Portal Service

The first generic user scenario is a computing por-tal service. In such scenario, computing applicationsare stored by the application developers in reposito-ries as downloadable images (in the form of VMs orcontainers). Such images can be accessed by users viaa portal, and require a back-end for execution; in themost common situation this is typically a batch queue.Support for parallel processing using containers is arequirement that comes up as well from the users.

The application consists of two main parts: the por-tal / Scientific Gateway and the processing workingnodes. The number of nodes available for computingshould increase (scale out) and decrease (scale in),according to the workload. The system should also be

able to do Cloud-bursting to external infrastructureswhen the workload demands it. Furthermore, usersshould be able to access and reference data, and also toprovide their local data for the runs. A solution alongthese lines is shown in Fig. 1.

A solution along these lines has been requested inthe user scenarios coming from ELIXIR, WeNMR,INSTRUCT, DARIAH, Climate Change and LIFE-WATCH.

2.2 Data Analysis Service

A second generic use case is described by scientificcommunities that have a coordinated set of data repos-itories and software services to access, process andinspect them.

Processing is typically interactive, requiring accessto a console deployed on the data premises. The appli-cation consists of a console / Scientific Gateway thatinteracts with the data. In Fig. 2 we show a schematicview of such a use case. Examples of such include R,Python or Ophidia. It can be a complementary sce-nario to the previous one, and it could also exposeprogrammatic services.

The communities related to INSTRUCT, CTA, Cli-mate Change, LIFEWATCH and Lattice QCD haverequested related features.

Fig. 1 User community computing portal service

Page 6: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

386 D. Salomoni et al.

Fig. 2 Data analysis service

3 Developing for the Infrastructure as a Service(IaaS) Layer

INDIGO-DataCloud has provided e-infrastructureswith tools, applications and cloud frameworkenhancements to manage the demanding requirementsof modern scientific communities, either locally orthrough enhanced interfaces enabling those infras-tructures to become part of a federation or to connectto federated platform layers (PaaS).

In this section we will describe the highlights ofthis development work, which was needed to prop-erly interface with the resource centers. This work hasfocussed on virtualizing local compute, storage andnetworking resources (IaaS) and on providing thoseresources in a standardized, reliable and performingway to remote customers or to higher level feder-ated services, building virtualized site independentplatforms.

The IaaS resources are provided by large resourcecenters, typically engaged in well-established Euro-pean e-infrastructures. The e-infrastructure manage-ment bodies, or the resource centers themselves willselect the components they operate, and INDIGO willhave limited influence on that process.

Therefore, INDIGO has concentrated on a selec-tion of the most prominent existing components andhas further developed the appropriate interfaces to

high-level services based on standards. We have alsodeveloped new components where we felt a full func-tionality was completely missing.

The contribution of INDIGO to enhance the flex-ibility to access resources in Cloud and HPC infras-tructures will be of paramount importance to enabletransparent execution of applications across systemspromoting the development of the future EuropeanOpen Science Cloud (EOSC) ecosystem [36].

As we describe below new components are pro-vided, or already existing components are improvedin the areas of computing, storage, networkingand Authorization and Authentication Infrastructure(AAI). For almost all components, we succeededin committing modifications to the correspondingupstream software providers and by that, significantlycontributed to the sustainability model of the software.

3.1 Supporting Linux containers

It is unquestionable that Docker is the most widelyadopted Linux container technology. Therefore, inorder to facilitate application delivery across multi-ple computational platforms, INDIGO has providedsupport for container execution, both interactively andthrough batch systems, in cloud and conventional clus-ters. This was achieved by developing new tools andextending existing ones.

Page 7: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 387

The key middleware developed for this purposeis udocker [12].1 The udocker novelty consists inenabling to pull and execute Docker containers [11]without using or requiring the installation of theDocker software. By using udocker it is possibleto encapsulate applications in Docker containers andexecute them in batch or interactive systems whereDocker is unavailable. It provides several differentexecution engines based on PRoot [37], runC [38]and Fakechroot [39]. None of the udocker enginesrequires root privileges for installation or executionbeing therefore adequate for deployment and use byend-users without system administrator intervention.In addition, the PRoot and Fakechroot engines executecontainers via pathname translation and therefore donot require the use of Linux namespaces.

Since udocker never requires privileges and exe-cutes as unprivileged user many of the security con-cerns associated with the Docker software are avoided.Udocker also supports GPGPU and MPI applications,making it adequate to execute containers in batch sys-tems and HPC clusters. The udocker software suiteis meant to be easily deployed by end-users. It onlyrequires the download and execution of a Python scriptto quickly setup udocker within the user home direc-tory. Udocker empowers end-users to execute Dockercontainers regardless of the Linux host environment.

Since its first release in June 2016 udockerexpanded quickly in the open source community. Ithas been adopted by a number of projects as a drop-in replacement for Docker. Among them openmole,bioconda, common-work language (cwl) or SCAR -Serverless Container-aware ARchitectures [40].

As an example, udocker is being used with greatsuccess to execute code produced by the MasterCodecollaboration [41]. The MasterCode collaboration isconcerned with the investigation of Supersymmet-ric models that go beyond the current status of theStandard Model of particle physics. It involves teamsfrom CERN, DESY, Fermilab, SLAC, CSIC, INFN,NIKHEF, Imperial College London, King’s CollegeLondon, the Universities of Amsterdam, Antwerpen,Bristol, Minnesota and ETH Zurich. Examples anddocumentation can be found at https://github.com/indigo-dc/udocker.

1https://github.com/indigo-dc/udocker

INDIGO has also developed bdocker,2 which pro-vides a front-end to execute the Docker software inbatch systems under restrictions and limits config-urable by the system administrator (resource con-sumption, access to host directories, list of con-tainer images, etc). It has been implemented forthe SoGE (Son of Grid Engine) batch system butcan be extended to other batch systems. This inte-grates the benefits of application delivery providedby Docker with the scheduling policies of the batchsystem. While udocker can be deployed directly bythe end-user, bdocker is installed and configured bythe system administrator to provide the users withthe ability to run limited execution environments pro-vided by Docker. Finally, ONEDock3 was developedto introduce support to the execution of containers asif they were Virtual Machines in OpenNebula-basedon-premises Clouds, by supporting Docker as a hyper-visor in this Cloud Management Platform. With thisapproach, applications and their execution environ-ment packaged as Docker images can be instantiatedon-demand through OpenNebula and provide SSH-based access for multiple users, with the advantagefor the administrators of reduced overhead (such aslow memory footprint) when compared to VirtualMachines.

3.2 Development of Advanced SchedulingTechnologies

The end goal of this activity in INDIGO is improvingthe performance of the cloud management platformsby designing and implementing novel schedulingmechanisms and policies at the IaaS level. Enablingadvanced scheduling policies to optimize the usage ofthe data center will clearly improve the response to theusers.

To this end cloud schedulers need to include sup-port for postponing low priority workloads (by killing,preempting or stopping running containers or VMs) inorder to allocate higher priority requests.

3.2.1 Preemptible Instances

INDIGO pushed the state of the art in scheduling tech-nologies by implementing preemptible instances on

2https://github.com/indigo-dc/bdocker3https://github.com/indigo-dc/onedock

Page 8: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

388 D. Salomoni et al.

top of the OpenStack[42] cloud management frame-work, opie.4 Openstack preemptible instances is thematerialisation of the preemptible instances extension,serving as a reference implementation.

Preemptible instances differ from regular ones inthat they are subject to be terminated by a incomingrequest for provision of a normal instance. If biddingis in place, this special type of instance could alsobe stopped by a higher priority preemptible instance(higher bid). Not all the applications are suitable forpreemptible execution, only fault-tolerant ones canwithstand this type of execution. On the other side theyare highly affordable VMs that allow providers to opti-mize the usage of their available computing resources(i.e. backfilling).

The opie package provides a set of pluggable exten-sions for OpenStack Compute (nova) making possibleto execute preemptible instances using a modified fil-ter scheduler. This solution has gained great interestfrom the scientific community and commercial part-ners, and is under discussions to be introduced in theupstream OpenStack scheduler.

3.2.2 Implementing Advanced Schedulingin OpenStack and OpenNebula

In IaaS private clouds the computing and storageresources are statically partitioned among projects. Auser typically is member of one project, and eachproject has its own fixed quota of resources definedby the cloud administrator. A user request is rejectedif the project quota has already been reached, even ifunused resources allocated to other projects would beavailable.

This rigid resource allocation model strongly lim-its the global efficiency of the data centres, which aimto fully utilize their resources for optimizing costs. Inthe traditional computing clusters the utilization effi-ciency is maximized through the use of a batch systemwith sophisticated scheduling algorithms plugged in.However this feature is missing in the most popularcloud middlewares. In the course of INDIGO we havedeveloped support for advanced scheduling policiessuch as intelligent job allocation based on fair-sharealgorithms for both OpenStack and OpenNebula cloudframeworks.

4https://github.com/indigo-dc/opie

• Synergy5 is an advanced service interoperablewith the OpenStack components, which imple-ments a new resource provisioning model basedon pluggable scheduling algorithms. It allowsto maximize the resource usage, at the sametime guaranteeing a fair distribution of resourcesamong users and groups.

The service also provides a persistent queu-ing mechanism for handling those user requestsexceeding the current overall resource capacity.These requests are processed according to a prior-ity defined by the scheduling algorithm, when therequired resources become available.

• The scheduling capabilities of OpenNebula havebeen enhanced with the development of one-FaSS(FairShare Scheduler for OpenNebula).6 In Open-Nebula the scheduler is first-in-first-out (FIFO).One-FaSS grants fair access to dynamic resourcespriorizing tasks assigned according to an initialweight and to the historical resource usage.

The project has also developed tools to facilitatethe management of hybrid data centers, this is, whereboth batch system based and cloud based services areprovided. Physical computing resources can play bothroles in a mutual exclusive way. The Partition Direc-tor7 takes care of commuting the role of one or morephysical machines from Worker Node (member of thebatch system cluster) to Compute Node (member of acloud instance) and vice versa.

The current release only works with the IBM plat-form LSF Batch system (version 7.0x or higher) andOpenstack Cloud manager instances (Kilo or newer).The main functionalities are switch role of selectedphysical machines from the LSF cluster to the Open-stack one and viceversa, and manage intermediatetransition status to ensure consistency.

3.3 Development of Authorizationand Authentication Infrastructures

INDIGO has provided the necessary components tooffer a commonly agreed Authentication and Autho-rization Infrastructure (AAI). The INDIGO IAM8

5https://github.com/indigo-dc/synergy-service6https://github.com/indigo-dc/one-fass7https://github.com/indigo-dc/dynpart8https://github.com/indigo-dc/iam

Page 9: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 389

(Identity and Access Management service) providesuser identity and policy information to services sothat consistent authorization decisions can be enforcedacross distributed services.

IAM has a big impact on the end-user experience.It provides a layer where identities, enrollment, groupmembership and other attributes and authorizationpolicies on distributed resources can be managed in ahomogeneous way, supporting the federated authenti-cation mechanisms supported by the INDIGO AAI.

The INDIGO AAI solution pioneers the usageof OpenID Connect (OIDC) on the SP-IdP proxy.INDIGO has made contributions to the upstream com-ponents whenever needed to enable OpenID Con-nect (namely in OpenStack Keystone and ApacheLibcloud).

INDIGO-Datacloud provides a flexible Authenti-cation and Authorization Infrastructure (AAI) whosemain components are depicted in Fig. 3.

In order to do authentication and authorization ina consistent way, services rely on the informationprovided by the central IAM service.

The Login Service component implements bro-kered authentication: users can authenticate with any

of the supported mechanisms (SAML, OpenID Con-nect, X.509 certificates, local username/password).Identity and authorization information is then exposedto services via standard OAuth/OpenID Connect pro-tocols. This approach simplifies integration with off-the-shelf components and does not overload all ser-vices in the infrastructure with the complexity ofhandling multiple credential types. This approach hasa big impact on end-user experience, as it allows usersto centralize the management of their credentials andprovide a consistent login experience to all services inthe system.

The Group Membership Service component pro-vides the tools to manage registration and enrollmentflows for the collaboration, organize users into groupsand manage user account life cycle.

The Authorization Service component, based onthe Argus Authorization Service [43], provides theability to define fine-grained policies for the collab-oration leveraging the flexibility of a XACML policyengine [44]. This component also provides a pol-icy composition and distribution mechanism that isused to ensure consistent authorization across thedistributed infrastructure.

Fig. 3 Architecture of the INDIGO identity and access management service

Page 10: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

390 D. Salomoni et al.

The Provisioning Service component exposes anSCIM API [45] to provision information about thecollaboration users to relying services. This mecha-nism is useful, for instance, to manage the lifecycle ofresources at a site (e.g., local UNIX accounts) depend-ing on the lifecycle of IAM user account information.As an example, accounts could be provisioned auto-matically across the infrastructure and configured withuser SSH public keys as registered in the central IAMat user registration time, and disabled when the usermembership at the IAM expires or is suspended due toa security incident.

Finally, to integrate services that do not speakOpenID Connect natively, IAM integrates withWaTTS, the INDIGO Token Translation Service.9

WaTTS can translate identity and authorization infor-mation about a user provided as OpenID Connecttokens to various credential types. This allows theprovision of services that do not normally supportfederated identities to federated users.

IAM is used internally to the PaaS layer in orderto deal with the authorization of each user to the ser-vices, but also in handling group membership and rolemanagement for each user. Users may present them-selves with any of the three supported token types(X.509, OpenID Connect, SAML) and are able toaccess resources that support any of them.

3.4 Virtual Networks

The INDIGO PaaS layer has been developed toexploit a wide range of cloud management frame-works (e.g. OpenStack, OpenNebula, Google Com-pute Engine, Microsoft Azure, Amazon EC2) andcombine resources provided by these frameworks toenable the deployment of complex distributed applica-tions. Each cloud management framework may exhibita different native API and often these APIs canbe configured in different ways. This heterogeneityconstitutes a challenge when transparent instantia-tion of cloud resources across multiple frameworks isrequired. The INDIGO PaaS layer supports both com-mon native APIs as well as the Open Cloud Comput-ing Interface10 (OCCI). The OCCI specification is astandard from the Open Grid Forum11 that provides

9https://github.com/indigo-dc/tts10http://occi-wg.org/11https://www.ogf.org

a flexible and extensible API to access and managecloud resources. Although OCCI provides a conve-nient uniform API, its support to manage the networkenvironment is limited in what concerns the setup ofpublic/private network accessibility. Depending on theactual cloud management framework being used thetarget cloud may need manual network configurationprior to the use of OCCI. To address these problemsINDIGO has defined and implemented an OCCI net-work extension that allows the network environmentto be properly setup via the OCCI API regardless ofthe underlying cloud management framework.

For OpenNebula[46] sites the solution consists ina Network Orchestrator Wrapper (NOW)12 and acorresponding backend in the rOCCI-server.13 NOWenforces site-wide policy and network configurationby making sure that only LANs designated by siteadministrators are made available to users, and thatusers cannot reuse LANs assigned to others whilethey remain reserved. NOW has been released withINDIGO, and the backend has been provided as acontribution to upstream rOCCI-server distribution.

For OpenStack, the OpenStack OCCI Interface(OOI) has been extended with support for advancednetworking functions provided by OpenStack’s Neu-tron component such as router, network and subnetsetup. The contribution was accepted upstream and isdistributed with the OOI implementation.

The networking features of the OCCI gateway forthe Amazon’s EC2 API were adjusted making surethat the model of setting up and using local virtualnetworks is in accordance with the model used in theother cloud management frameworks.

In addition a Virtual Router was implementedallowing networks to span across cloud sites, poten-tially geographically distant, so that a custom net-working environment can be setup even if theresources are allocated in different cloud sites. Thevirtual router is a virtual machine that can be startedvia OCCI and makes use of OpenVPN14 to imple-ment network tunnels. The Virtual Routers can beinstantiated by the PaaS layer to orchestrate the inter-connection of virtual machines across cloud providers.

12https://github.com/indigo-dc/now13https://github.com/the-rocci-project/rOCCI-server14https://openvpn.net

Page 11: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 391

4 Architecture of the Platform as a Service

Generally speaking, a Platform as a Service (PaaS)is a software suite, which is able to receive program-matic resource requests from end users, and executethese requests provisioning the resources on some e-infrastructures. We can see already many examples inthe industrial sector, in which open source PaaS solu-tions (eg. OpenShift [47] or Cloud Foundry [48]) arebeing deployed to support the work of companies indifferent sectors.

The case of supporting scientific users is more com-plex in general than supporting commercial activities,because of the heterogeneous nature of the infrastruc-tures at the IaaS level (i.e. the resource centers) and ofthe inherent complexity of the scientific work require-ments. The key point is to find the right agreement tounify interfaces between the PaaS and IaaS levels.

The Infrastructure Manager [49] (IM) has beenused to address the IaaS level. The IM is able todeploy complex and customized virtual infrastructureson IaaS Cloud deployment(such as AWS, OpenStack,etc.). It automates the deployment, configuration, soft-ware installation, monitoring and update of the virtualinfrastructure on multiple Cloud back-ends.

In the framework of INDIGO the IM has extendedits capabilities. In particular, in the PaaS it is used bythe Orchestrator (see below) in order to provision andconfigure the virtual infrastructure required to supportthe scientific applications involved in the project.15

In order to better adapt to the wide range of usecases provided by the users communities we decidedto take a different approach from many of the moreused PaaS: our solution is based on the concept oforchestrating complex clusters of services and on thepossibility to automatize the actions needed to imple-ment the use cases. This approach was really success-ful as it gave the possibility to implement also legacyapplications and did not depended on the language inwhich the application is built.

In Figure 4 we show the general interactionbetween the IaaS and PaaS layers. The Orchestratorprovides the entry point to the PaaS layer with itsability to decide the most appropriate site on whichto deploy a certain application architecture describedin TOSCA Templates. INDIGO-DataCloud fosterslocal-site orchestration and, therefore, depending on

15https://github.com/indigo-dc/im

the underlying Cloud Management Framework of theCloud site, the TOSCA Template is translated into thespecific orchestration component of OpenStack (Heat)or it is delegated on the Infrastructure Manager (IM) toexecute on OpenNebula-based Cloud sites. Since bothVirtual Machines and containers can be provisionedon the underlying Cloud site, Virtual Machine Imagesavailable in each are registered in the Information Sys-tem and container images are pre-staged to the Cloudsites to reduce deployment times.

INDIGO has provided a working PaaS Layerorchestrating heterogeneous computing and storageresources. Using the PaaS Orchestrator together withthe IM and TOSCA Templates, the end users are ableto exploit computational resources without knowledgeabout the IaaS details. In the following we describe themain technologies employed to build the PaaS.

4.1 PaaS Layer and Microservices Architecture

The Paas layer should be able to hide complexityand federate resources for both Computing and Stor-age. For that we have applied the current technologiesbased on lightweight containers and related virtualiza-tion developments using microservices.

Kubernetes [50], an open source platform toorchestrate and manage Docker containers, is usedto coordinate the microservices in the PaaS layer.Kubernetes is extremely useful for the monitoringand scaling of services, and to ensure their reliability.The PaaS manages the needed micro-services usingKubernetes, in order, for example, to select the rightend-point for the deployment of applications or ser-vices. The Kubernetes solution is used in the PaaSlayer as is provided by the community.

The microservices that compose the PaaS layer arevery heterogeneous in terms of development: someof them are developed ad hoc, some others werealready available and used as they are, few others havebeen significantly modified in order to implement newfeatures within INDIGO.

The language in which the INDIGO PaaS receivesend user requests is TOSCA [18]. TOSCA standsfor Topology and Orchestration Specification forCloud Applications. It is an OASIS specificationfor the interoperable description of applicationsand infrastructure cloud services, the relationshipsbetween parts of these services, and their operationalbehaviour.

Page 12: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

392 D. Salomoni et al.

Fig. 4 Interaction between the IaaS and PaaS layers

TOSCA has been selected as the language fordescribing applications, due to the wide range of adop-tion of this standard, and since it can be used as theorchestration language for both OpenNebula (throughthe IM) and OpenStack (through Heat).

The released INDIGO PaaS layer (see Fig. 5) isable to provide automatic distribution of the applica-tions and/or services over a hybrid and heterogeneousset of IaaS infrastructures, on both private and publicclouds.

The PaaS layer is able to accept a description ofa complex set, or cluster, of services/applications bymean of TOSCA templates, and is able to providethe needed brokering features in order to find thebest fitting resources. During this process, the PaaSlayer is also able to evaluate data distribution, so thatthe resources requested by the users are chosen bythe closeness to the storage services hosting the datarequested by those specific applications/services.

4.1.1 The Orchestrator Engine

The INDIGO PaaS Orchestrator16 is a core compo-nent of the PaaS layer: it orchestrates the provisioningof virtualized compute and storage resources on CloudManagement Frameworks (like OpenStack and Open-Nebula) and on Mesos clusters.

It receives the deployment requests, expressedthrough templates written in TOSCA, and deploysthem on the best available cloud site. In order to selectthe best site, the Orchestrator implements a com-plex workflow: it gathers information about the SLAssigned by the providers and monitoring data aboutthe availability of the compute and storage resources.Finally the Orchestrator asks the Cloud ProviderRanker17 to provide a ranked list of best cloud

16https://github.com/indigo-dc/orchestrator17https://github.com/indigo-dc/cloudproviderranker

Page 13: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 393

Fig. 5 Architecture of the INDIGO platform as a service layer

providers according an algorithm described below.Therefore the Orchestrator mission is to coordinatethe deployment process over the IaaS platforms. SeeFig. 6 for an overview of the Orchestrator architecture.

The Orchestrator is based on developments alreadydone in other publicly funded projects, such as theItalian PONs PRISMA [51], and Open City Platform[52]. During the INDIGO project, this componenthas been extended and enhanced to support the spe-cific microservices building the INDIGO PaaS Layer.It delegates the actual deployment of resources toIM, OpenStack Heat or Mesos frameworks based onTOSCA templates and the ranked list of sites.

A very innovative component is the Cloud ProviderRanker. This is a standalone REST web service, whichranks cloud providers on the basis of rules definedper user/group/use case, with the aim of fully decou-pling the ranking logic from the business logic of theOrchestrator.

It allows the consumers of the service (one ormore orchestrators) to specify preferences on cloudproviders. If some preferences have been specified

for some providers, then they have absolute priorityover any other provider information (like monitoringdata). On the other hand, when preferences are notspecified, for each provider the rank is calculated, bydefault, as the sum of SLA ranks and a combina-tion of monitoring data, conveniently normalized withweights specified in the Ranker configuration file.Moreover, the ranking algorithm can be customized tothe specific needs.

This is a completely new service, fully imple-mented within the INDIGO project; it is based onan open source tool, Drools,18 in order to reducethe needed development effort, and to simplify thelong-term support.

As a summary, one of the main achievementsof INDIGO in this respect is the implementationof TOSCA Templates on IaaS that do not supportnatively TOSCA, like Standard OpenStack, OpenNeb-ula, or Public clouds (like Microsoft Azure, AWS,OTC,...) using the Infrastructure Manager.

18https://www.drools.org

Page 14: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

394 D. Salomoni et al.

Fig. 6 Architecture of the orchestrator within the PaaS layer

In particular, using the PaaS Orchestrator and theTOSCA templates, the end user can exploit compu-tational resources without knowledge about the IaaSdetails: indeed the TOSCA standard language ensuresthat the same template can be used to describe a virtualcluster on different cloud sites; then the InfrastructureManager implements the TOSCA runtime for con-tacting the different cloud sites through their nativeAPIs. The provisioning and configuration of the IaaSresources is therefore accomplished in a completelytransparent way for the end user. The same approachis used also for submitting dockerized applicationsand services to a Mesos cluster (and its frameworksMarathon and Chronos): the user can describe hisrequest in a TOSCA template and the Orchestratorprovides the TOSCA runtime for contacting the Mesosmaster node, submitting the request and monitoringits status on behalf of the user as detailed in the nextsection.

4.1.2 High-Level Geographical Applications/ServiceDeployment

INDIGO has developed the tools and services toprovide a solution for orchestrating Docker contain-ers for both applications (job-like execution) and

long running services. Mesos/Marathon/Chronos19

is used to manage the deployment of services andapplications (MSA service).

From a resource perspective Mesos is a clustermanagement tool: it pools several resource centers tobe centrally managed as single unit; from an applica-tion perspective, Mesos is a scheduler: it dispatchesworkloads to consume pooled resources, scaling up tothousands of nodes.

Mesos is fault tolerant, as it is possible to repli-cate the master process. The INDIGO PaaS uses alsoMarathon and Chronos. Marathon is used to deploy,monitor and scale Long-running services, ensuringthat they are always up and running. Chronos is usedto run user applications (jobs) taking care of fetch-ing input data, handling dependencies among jobs orrescheduling failed jobs.

Therefore the submission of jobs uses an approachvery similar to a batch system, exploiting resourceswhere they are, without even knowing about thedetails. This includes deployment on multiple IaaS,both private and public, hiding to the end users thecomplexity of the distributed resources.

19https://github.com/indigo-dc/mesos-cluster

Page 15: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 395

It is also possible using the CLUES20 (ClusterEnergy Saving) service to auto-scale (up & down)depending on the load, on several types of clusters:Mesos, SLURM, PBS, HTCondor, etc. CLUES isan elasticity manager system for HPC clusters andCloud infrastructures that features the ability to poweron/deploy working nodes as needed (depending on thejob workload of the cluster) and to power off/terminatethem when they are no longer needed.

The sustainability of our PaaS layer relies heavilyon the fact that we have used Open Source frameworkswhen already available.

4.1.3 Data Management Services

The goal of the data management developments inINDIGO has been pushing forward the state of theart concerning unified data access over heterogeneousinfrastructures. Among the features demanded by theuse cases there are High-performance data access,migration and replica management. Supporting suchfeatures requires at the user level a flexible securityframework based on tokens and Access Control Lists(ACLs).

Data management services developed in INDIGOare based on three open source components: Onedata[53], DynaFed [54] and FTS3 [55].

INDIGO has invested a substantial effort in thedevelopment of Onedata,21 which is a global datamanagement system aiming to provide easy access todistributed storage resources. The main goal is sup-porting a wide range of use cases from personal datamanagement to data-intensive scientific computations.

Support for federation in Onedata can be achievedby the possibility of establishing a distributed providerregistry, where various infrastructures can setup theirown provider registry and build trust relationshipsbetween these instances, allowing users from variousplatforms to share their data transparently.

Onedata provides an easy to use Graphical UserInterface for managing storage Spaces, with customiz-able access control rights on entire data sets or singlefiles to particular users or groups.

The INDIGO PaaS Orchestrator integrates a plu-gin for interacting with the Onedata services providing

20https://github.com/indigo-dc/clues-indigo21https://github.com/indigo-dc/onedata

advanced capabilities of data location aware schedul-ing. Combining the information about the distribu-tion of the compute resources and the data providerswith the data requirements specified by the user, theOrchestrator is able to schedule the processing jobsto the computing center nearest to the data. A pro-totype based on Onedata has been implemented anddemonstrated for some use-cases, e.g. the LifeWatchAlgaeBloom (See Table 1).

Using Onedata is possible to integrate alreadyavailable storage services in the INDIGO Platformexploiting the data stored in external infrastructures.This is the case of the data stored in WLCG bythe CMS experiment from LHC. In this case theINDIGO PaaS provided the services needed in orderto deal with authentication and autorization (TokenTranslation)

5 Interfacing with the Users

Users typically do not access the PaaS core com-ponents directly. Instead, they often Graphical UserInterfaces or simpler APIs. A user authenticated on theINDIGO Platform can access and customize a rich setof TOSCA-compliant templates through a GUI-basedportlet.

The INDIGO repository provides a catalogue ofpre-configured TOSCA templates to be used for thedeployment of a wide range of applications and ser-vices, customizable with different requirements ofscalability, reliability and performance. In these tem-plates a user can choose between two different exam-ples of generic scenarios:

1. Deploy a customized virtual infrastructure start-ing from a TOSCA template that has beenimported, or built from scratch: the user will beable to access the deployed customized virtualinfrastructure and run, administer and manageapplications running on it.

2. Deploy a service/application whose lifecycle willbe directly managed by the PaaS platform: inthis case the user will be returned with list ofendpoints to access the deployed services.

APIs for accessing the INDIGO PaaS layer areavailable, they allow for an easy integration of thePaaS features inside Portals, Desktop Applications

Page 16: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

396 D. Salomoni et al.

and Mobile Apps. The final release of INDIGO-DataCloud software includes a large set of com-ponents to facilitate the development of ScienceGateways and desktop/mobile applications, big dataanalytics and scientific workflows. The componentsdirectly related to the end-user interfaces are inte-grated in the INDIGO Future Gateway (FG) frame-work. The FG can be used to build powerful, cus-tomized, easy to use science gateways environmentsand front-ends, on top of the INDIGO-DataCloudPaaS layer and integrated with data management ser-vices. The FG provides capabilities including:

• The FG API server, used to integrate third-partyscience gateways; the FG Liferay Portal, contain-ing base portlets for the authentication, authoriza-tion and administration of running applicationsand deployments;

• Customizable Application Portlets, for user-friendly specification of the parameters used byTOSCA templates;

• A workflows monitoring portlet, used for mon-itoring task execution via integrated workflowsystems, described below.

• An Open Mobile Toolkit as well as applicationtemplates for Android and iOS, simplifying thecreation of mobile apps making use of the FG APIServer.

• Support for scientific workflows, where theINDIGO components:

– Provide dynamic scalable services in aWorkflows as a Service model;

– Implement modules and componentsenabling the usage of the PaaS layer(via FG API Server) for the main scien-tific workflow engines deployed by usercommunities (such as Kepler, Ophidia,Taverna,Pegasus);

– Support a two-level (coarse and finegrained) workflow orchestration, essen-tial for complex, distributed experimentsinvolving (among others) parallel dataanalysis tasks on large volumes of scien-tific data.

• Key extensions to the Ophidia big data analyt-ics framework (allowing to process, transformand manipulate array-based data in scientificcontexts), providing many new functionalities,

including a set of new operators related to dataimport regarding heterogeneous data formats (e.g.SAC, FITS), a new OpenIDConnect interface andnew workflow interface extensions.

• Enhancements of the jSAGA library througha ”Resource Management API”, complementingthe standard Job/Data Management API. Thisallows to acquire and manage resources (com-pute, storage, network) and enables the wrappingof underlying technologies (cloud, pilot jobs, grid,etc.) by means of a single API, supporting asyn-chronous mode (task), timeout management, noti-fication (metrics) and security context forwarding.

• Command-line clients for the PaaS layer to pro-vide an easy way for users to interact with theOrchestrator or with WATTS:

– Orchent22: a command-line applica-tion to manage deployments and theirresources on the INDIGO-DataCloudOrchestrator;

– Wattson23: a command-line client forthe INDIGO Token Translation Service.

INDIGO has provided the tools for a simple andeffective end user experience, both for software devel-opers and for researchers running the applications. InFig. 7 we show how to launch an application usinga mobile platform developed in the project for theclimate change application ENES.

The ENES end-user starts the app and needs toauthenticate and authorize using IAM service. Theapps request to have an access to id, e-mail and offlineaccess, which will be required to refresh the existingtokens during the mobile app lifecycle. After the usergives the permissions and logs in, the user will see thelist with scheduled analysis if available.

The mobile app is a handy interface for schedulingnew tasks as well. The submitting form requires thatthe user selects model, scenario, frequency, percentileand nodes to run the analysis. After the user providesthe necessary inputs, the app sends the request to theFutureGateway server using its API. The user is thenable to monitor the status of the analysis. If the taskis done, the user is able to see and download resultsas PNG files illustrates predicted climate changes on

22https://github.com/indigo-dc/orchent23https://github.com/indigo-dc/wattson

Page 17: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 397

Fig. 7 Task submission and visualization of results on the mobile platform for ENES

Earth. Finally the user can remove useless or abortedtasks.

6 Software Lifecycle Management

The software lifecycle process in INDIGO has beensupported by a continuous improvement process thatencompassed the software quality assurance (SQA),the software release and maintenance, the deploy-ment of pilot infrastructures for software integrationand testing, and, lastly, the exploitation activities andsupport services. In Fig. 8 we depict the interde-pendencies between the different processes, togetherwith the services involved at each stage. Appendix Bdescribes the tools and services that were requiredfor the implementation of the software lifecycleprocess.

The quality requirements [56], that drive the soft-ware lifecycle process, define the minimum set ofcriteria that the software developed in INDIGO hasto comply with. The requirements are met for eachchange in the codebase, thus the production version of

a given software component is permanently in a work-able status, protected from incoming changes that donot adhere to the SQA criteria. The continuous evalua-tion of the SQA requirements is only possible throughthe aid of automation, achieved in INDIGO throughthe progressive adoption of DevOps practices.

In the next sections we will describe the DevOpsapproaches being adopted and the upstream contribu-tions included in the official distributions of externalopen source projects.

6.1 DevOps Approach in INDIGO

Progressive levels of automation were being adoptedthroughout the different phases of the INDIGO-DataCloud project software development and deliveryprocesses. This evolution was intentionally marked bythe commitment to DevOps principles [57]. Startingwith a continuous integration (CI) approach, achievedalready in the early stages of the project, the secondpart of the project was devoted to the establishmentof the next natural step in the DevOps practices: thecontinuous delivery (CD).

Page 18: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

398 D. Salomoni et al.

Fig. 8 Software lifecycle, release, maintenance and exploitation interdependencies

6.1.1 Services for Continuous Integration and SQA

The INDIGO-DataCloud CI process is schematicallyshown in Fig. 9 is explained below. The process,in its different steps, reflects some of the main andimportant achievements of the software integrationteam.

• New features are developed independently fromthe production version in feature branches. In

order to test and review the new change, a pullrequest (PR) is created in GitHub. The PR cre-ation marks the start of the automated validationprocess through the execution of the SQA jobs inthe CI infrastructure (Jenkins).

• The SQA jobs perform the adherence of the codeto a style standard and calculate unit and func-tional test coverage. Other checks are executed atthis stage for security static analysis and metricsgathering.

Fig. 9 Continuous Integration workflow followed by new feature additions in the production codebase

Page 19: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 399

• The results of the several SQA jobs automaticallyexecuted in Jenkins are notified back to GitHub,updating the PR with the exit status and the linksto the output logs.

• On successful completion of the SQA tests, thecode review is the last step before the source codeis merged in the production version. The GitHubPR provides a place for discussion, open to col-laboration, where the developers and/or externalexperts analyze the results of the SQA jobs anddiscuss any relevant aspect of the change (inter-nal to the code or in terms of the goals orapplicability).

• Once peer-reviewed, the change is merged andbecomes ready for integration and later release.

6.1.2 Continuous Delivery

Continuous delivery adds, on top of the CI approachdescribed above, a seamless manufacturing of soft-ware packages ready to be deployed into production.

In the INDIGO-DataCloud scenario, the continu-ous delivery adoption translates into the definition ofpipelines. A pipeline is a serial execution of tasks thatencompasses in the first place the SQA jobs (CI phase)and adds as the second part (CD phase) the buildingand deployment testing of the software packages cre-ated. The pipeline only succeeds if each task is run tocompletion, otherwise the process is stopped and setas a build failure.

6.1.3 DevOps Adoption from User Communities

The experience gathered throughout the project withregards to the adoption of different DevOps prac-tices is not only useful and suitable for the softwarerelated to the core services in the INDIGO-DataCloudsolution, but also applicable to the development anddistribution of the applications coming from the usercommunities.

The novelty introduced, showcased in Appendix C,is the validation of the user application by comparingthe execution results with a set of reference outputs.Thus this pipeline implementation goes a step for-ward, with respect to the former DevOps approaches,as the application execution is tested before the newversion is released.

6.2 INDIGO Upstream Software Contributions

The INDIGO software solution encompasses not onlyproducts implemented from scratch within the projectbut also external services adopted from open sourceinitiatives. These latter set of products were activelydeveloped to enhance their functionality to matchthe INDIGO project’s objectives, but at the sametime, aiming to be considered as upstream contri-butions. Thus, multiple contributions developed byINDIGO have been pushed and accepted in the offi-cial distributions of major open source projects suchas OpenStack, OpenNebula and OpenID Connect.Appendix A lists the software projects and productsbeing contributed by INDIGO-DataCloud.

7 Examples of Implementation Towards ResearchCommunities

In what follows we try to provide some basic infor-mation that may be useful for promoting the useof INDIGO solutions towards the Research Commu-nities. Based on the described architecture we willintroduce the basic ideas on how to develop, deployand support applications in the Cloud framework,exploiting the different service layers, and introduc-ing generic examples that may make easier the use ofINDIGO solutions.

7.1 Understanding the Services of the CloudComputing Framework

Figure 10 provides the description of how an applica-tion can be built using a service oriented architecturein the Cloud, using INDIGO solutions.

This layered scheme includes different elements,that are managed by different actors, that must be min-imally understood in order to design, develop, test,deploy and put in production an application.

The lowest layer, Infrastructure as a Service, pro-vides a way to access to the basic resources that theapplication will use: computing, storage, network, etc.These resources are physically in a site, typically acomputing center, either in a research centre, or ina cloud provider (for example commercial cloud ser-vices), and are handled by the system managers at

Page 20: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

400 D. Salomoni et al.

Fig. 10 Service composition in the Service Oriented Architecture

those sites, that install a IaaS solution compatible withthe INDIGO software stack (e.g. OpenStack, Open-Nebula, Google Compute Engine, Microsoft Azure,Amazon EC2).

By accessing through a web interface such as Hori-zon for OpenStack, a user that is granted access toa pool of resources at a site can launch a virtualmachine, for example a server with 2 cores, 4GBRAM, 100GB of storage and with a certain Linuxflavour installed.

The user can then get access in console modethis machine, using for instance ssh protocol. Oncelogged in, the user can execute a simple script, or caninstall a web server, etc. When the work is finishedthe machine can be stopped by the user, liberating sothe resources. This is a very basic mode of access-ing Cloud services, which shares analogies with theusual access to classical computing services, like forexample any remote server or a cluster.

A different way to interact with IaaS services is touse the existing APIs to manage the resources usingthe web services protocol. Such invocation of servicescan be made from any program or application, forexample from a python script, and even through a webinterface.

Many applications require the setup, launchingand interconnection of several (IaaS) services imple-mented in different virtual machines, and managed

under a single control, as a Platform. The Platform as aService (PaaS) layer enables this orchestration of IaaSservices, and in the case of a Federated Cloud theymight even be located in different sites.

For example, an Apache Web Server may belaunched in site 1, with the purpose of display-ing the output of a simulation running in a clus-ter launched on demand in site 2. The ApacheWeb server may be better supported with a pool ofresources using another cloud-oriented solution suchas Marathon/Mesos. Launching an application in thiscontext requires identifying the available resourcesand launching them via the IaaS services. INDIGOis supporting the TOSCA standard to prepare a tem-plate that can be used to automatize this selection andorchestration of services.

7.2 Building and Executing Applications UsingINDIGO Solutions

In what follows we present below several simpleexamples of basic, but generic, applications exploitingINDIGO solutions.

7.2.1 Executing Containers on HPC Systems

The first generic example is how to build an applica-tion encapsulated as a container and how to executed

Page 21: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 401

it in an HPC system. This basic example of usingINDIGO solutions is shown in Fig. 11. A user can cre-ate a container using a conventional Dockerfile whichdescribes the steps required to create the Dockerimage. The process can be fully automatized usingGitHub and Docker Hub in such a way that a changein the Dockerfile, immediately triggers a rebuilt of theapplication container.

INDIGO provides the udocker tool to enable exe-cution of application containers in batch systems. Theend-user can download the udocker Python script fromGitHub or can send it with the batch job. Once exe-cuted for the first time it setups itself in the user homedirectory. udocker provides a Docker like commandline interface with which the user can pull, import orload Docker containers and then execute them usinga chroot-like environment. The software within thecontainer must not require privileges during execu-tion as it will be executed under the user that invokesudocker.

This is also a good solution for research communi-ties that want to migrate towards a cloud-based frame-work using containers, but keep exploiting resourceslike grid-enabled clusters or even supercomputers.

udocker is used by the Case Studies on Struc-tural Biology (Powerfit and Disvis) exploiting gridresources, on Phenomenology in Particle Physics, andrecently for Lattice QCD on supercomputers. Also,the TRUFA genomic pipeline exploits this solution,and it is being extended to similar applications in thearea that require the integration of legacy libraries.

7.2.2 Executing Containers on the Cloud

The second example is how to build an applicationencapsulated as a container and launch it in the Cloud,from a web interface, using the INDIGO solutionsFutureGateway and PaaS Orchestrator

This second example, compared to the first one,shows the evolution required to move an application tothe Cloud arena: the application must be encapsulatedinto a container, as before, but to launch this containerthe cloud resources must be allocated, the user mustauthenticate and get the access granted. If differentservices are required, they must be orchestrated.

The way to express these requirements, using anopen standard, is a TOSCA template. FutureGatewayoffers a user-friendly web-based interface to cus-tomize the TOSCA template, authenticate the user,select the container to be executed, interact with theOrchestrator to allocate the required cloud resourcesand launch the application. As in the first example thecontainer can be created using a dockerfile. Automa-tion in this step can be achieved using GitHub andDockerHub.

Using the Future Gateway portal or the commandline tool Orchent, the user can submit a TOSCA tem-plate to the Orchestrator, which in turn will requestand allocate the resources at the IaaS level by askingthe Infrastructure Manager to do so.

The user may wish to connect to the container thathas been launched via the orchestrator using the sshcommand. Once in the container it is posible to mount

Fig. 11 Basic execution of containers using INDIGO solutions

Page 22: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

402 D. Salomoni et al.

remote data repositories or stage the output data usingOnedata, available at the IaaS layer.

7.3 Building Advanced Applications Using INDIGOSolutions

In this section we present several simplified schemescorresponding to different applications already imple-mented, with the idea that they can be more easily usedas a guide to configure new applications.

The key, as stated before, is the composition ofthe template, written using the TOSCA language. Thetemplate should specify the image of the applicationto be used, as a container, using docker technol-ogy. We also need to specify the resources (CPUs,storage, memory, network ports) required to supportthe execution. The parameters required to configureINDIGO services used like, for example, Onedata,Mesos/Marathon or other additional cloud servicesneed to be specified as well.

Examples of TOSCA templates can be found athttps://github.com/indigo-dc/tosca-templates. Future-Gateway offers a friendly way to handle the TOSCAtemplates to launch the applications.

7.3.1 Deployment of a Digital Repository

A first example is the deployment, as a SaaS solu-tion, of a digital repository. The scheme is presentedin the Fig. 12 below. The specific template for thisapplication is available for reuse in the github reposi-tory of the project.

The repository manager, controlling the applica-tion, uses the FutureGateway to configure the appli-cation, based on the ZENODO software, that can beautomatically scaled up and ensure its high availabil-ity using Cloud resources as needed, and also enablingthe authentication and authorization mechanism fortheir research community, DARIAH, based on theINDIGO solution IAM.

All these details are transparent to the final user,who accesses the repository directly through its webinterface, and benefits of the enhanced scalability andavailability.

7.3.2 Launching a Virtual Elastic Cluster for DataIntensive Applications

A second example is the launch of a Virtual ElasticCluster to support a data intensive system.The schemeis presented in Fig. 13 below.

The specific template for this advanced applicationis available for reuse in the github repository of theproject.

Galaxy is an open source, web-based platform fordata intensive biomedical research. This applicationdeploys a Galaxy instance provider platform, allow-ing to fully customize each virtual instance through auser-friendly web interface, ready to be used by lifescientists and bioinformaticians.

The front-end that will be in charge of manag-ing the cluster elasticity can use a specified LRMS(selected among torque, sge, slurm and condor)workload.

Fig. 12 Deployment of a digital repository using INDIGO solutions

Page 23: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 403

Fig. 13 Launching of a Virtual Elastic Cluster using INDIGO solutions

All these details are transparent to the final user, theresearcher, who accesses the Galaxy instance directlythrough its web interface, and benefits of the enhancedscalability and availability.

This complex template includes the configurationof the distributed storage based in Onedata, the use ofthe encrypted files via LUKS, the deployment of elas-tic clusters using another INDIGO solution, CLUES,and the integration of the Authentication and Autho-rization mechanism, very relevant for this applicationarea, using IAM.

8 Conclusions

Thanks to the new common solutions developed bythe INDIGO project, teams of first-line researchers inEurope are using public and private Cloud resourcesto get new results in Physics, Biology, Astronomy,Medicine, Humanities and other disciplines.

INDIGO-developed solutions that have for instanceenabled new advances in understanding how thebasic blocks of matter (quarks) interact, usingsupercomputers, how new molecules involved in lifework, using GPUs, or how complex new reposito-ries to preserve and consult digital heritage can beeasily built. The variety of the requirements coming

from these so diverse user communities proves thatthe modular INDIGO platform, consisting of severalstate-of-the-art, production-level services, is flexibleand general enough to be applied to all of them withthe same ease of use and efficiency.

These services allow to federate hybrid resources,to easily write, port and run scientific applications tothe cloud. They are all freely downloadable as opensource components, and are already being integratedinto many scientific applications, namely:

• High-energy physics: the creation of complexclusters deployed on several Cloud infrastructuresis automated, in order to perform simulation andanalysis of physics data for large experiments.

• Lifewatch: parameters from a water quality modelin a reservoir are calibrated, using automatedmultiple simulations.

• Digital libraries: multiple libraries can easilyaccess a cloud environment under central coor-dination but uploading and managing their owncollections of digital objects. This allows them toconsistently keep control of their collections andto certify their quality.

• Elixir: Galaxy, a tool often used in many lifescience research environments, is automaticallyconfigured, deployed on the Cloud and used toprocess data through a user-friendly interface.

Page 24: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

404 D. Salomoni et al.

• Theoretical HEP physics: the MasterCode soft-ware, used in theoretical physics, adopts INDIGOtools to run applications on Grids, Clouds andon HPC systems with an efficient, simple-to-use,consistent interface.

• In DARIAH, a pan-european social and technicalinfrastructure for arts and humanities, the deploy-ment of a self-managed, auto-scalable Zenodo-based repository in the cloud is automated.

• Climate change: distributed, parallel data analysisin the context of the Earth System Grid Federation(ESGF) infrastructure is performed through soft-ware deployed on HPC and cloud environments inEurope and in the US.

• Image analysis: in the context of EuroBioImag-ing, a distributed infrastructure for microscopy,molecular and medical imaging, INDIGO compo-nents are used to perform automatic and scalableanalysis of bone density.

• Astronomical data archives: big data consisting ofimages collected by telescopes are automaticallydistributed and accessed via INDIGO tools.

The same solutions are also being explored byindustry, to provide innovative services to EU com-panies: for example, modelling water reservoirs inte-grating satellite information, improving security incyberspace, or assisting doctors in diagnostics throughmedical images. INDIGO solutions are also beingintensively tested in other projects, such as HelixNeb-ula ScienceCloud.

INDIGO services are fundamental for the imple-mentation of the EOSC. In particular, many INDIGOcomponents are included in the unified service cat-alogue provided by the project EOSC-hub [58], thatwill put in place the basic layout for the Euro-pean Open Science Cloud. Two additional Horizon2020 projects were also approved (DEEP HybridDataCloud and eXtreme DataCloud), thatwill continue to develop and enhance INDIGOcomponents.

The outcomes of INDIGO-DataCloud will persist,and also be extended, after the end of the project inthe framework of the INDIGO Software Collabora-tion agreement. This Collaboration shall be continuedwithout financial support from the European Union. Itis open to new initiatives and partners willing to con-tribute, extend or maintain the INDIGO-DataCloudsoftware components.

Acknowledgments INDIGO-Datacloud has been funded bythe European Commision H2020 research and innovation pro-gram under grant agreement RIA 653549.

Open Access This article is distributed under the terms of theCreative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unre-stricted use, distribution, and reproduction in any medium,provided you give appropriate credit to the original author(s)and the source, provide a link to the Creative Commons license,and indicate if changes were made.

Appendix A: Contribution to Open SourceSoftware Projects

Here follows the list of software developed in theframework of INDIGO-Datacloud that has been con-tributed upstream to the Open Source community.

• OpenStack (https://www.openstack.org)

– Changes/contribution done already mergedupstream

* Nova Docker* Heat Translator (INDIGO-Data

Cloud is 3rd overall contributorand core developer)

* TOSCA parser (INDIGO-DataCloud is 2nd overall contribu-tor and core developer)

* OpenID Connect CLI support* OOI: OCCI implementation for

OpenStack

– Changes/contribution under discussionto be merged upstream OpenStack Pre-emptible Instances support (extensions)

• OpenNebula

– Changes/contribution done already mergedupstream

* ONEDock

• Changes/contribution done already mergedupstream for:

– Infrastructure Manager (http://www.grycap.upv.es/im/index.php)

– CLUES (http://www.grycap.upv.es/clues/eng/index.php)

– Onedata (https://onedata.org)– Apache Libcloud (https://github.com/

apache/libcloud)

Page 25: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 405

– Kepler Workflow Manager (https://kepler-project.org/)

– TOSCA adaptor for JSAGA (http://software.in2p3.fr/jsaga/dev/)

– CDMI and QoS extensions for dCache(https://www.dcache.org)

– Workflow interface extensions forOphidia (http://ophidia.cmcc.it)

– OpenID Connect Java implementationfor dCache (https://www.dcache.org)

– MitreID (https://mitreid.org/) andOpenID Connect (http://openid.net/connect/) libraries

– FutureGateway (https://www.catania-science-gateways.it/)

Appendix B: Tools and Services Involvedin the Software Lifecycle

Figure 14 showcases the tools and services used forthe development and distribution of the INDIGO-DataCloud software:

• Project management service using openpro-ject.org: It provides tools such as an issue tracker,wiki, a placeholder for documents and a projectmanagement timeline.

• Source code is publicly available, housed exter-nally in GitHub repositories, increasing so the

visibility and simplifying the path to exploita-tion beyond the project lifetime. The INDIGO-DataCloud software is released under the Apache2.0 software license [59].

• Continuous Integration service using Jenkins:Service to automate the building, testing andpackaging, where applicable. Testing includesthe style compliance and estimation of the unitand functional test coverage of the softwarecomponents.

• Artifact repositories for RedHat and Debian pack-ages [60] and virtual – Docker – images [61].

• Code review service using GitHub: Source codereview is one integral part of the SQA as itappears as the last step in the change verificationprocess. This service facilitates the code reviewprocess, recording the comments and allowing thereviewer to verify the candidate change beforebeing merged into the production version.

• Issue tracking using GitHub Issues: Service totrack issues, new features and bugs of INDIGO-DataCloud software components.

• Release notes, installation and configurationguides, user and development manuals are madeavailable on GitBook [62].

• Code metrics services using Grimoire: To collectand visualize several metrics about the softwarecomponents.

Fig. 14 Tools and services used to support the software lifecycle process

Page 26: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

406 D. Salomoni et al.

• Integration infrastructure: this infrastructure iscomposed of computing resources to supportdirectly the CI service.

• Testing infrastructure: this infrastructure aims toprovide a stable environment for users where theycan preview the software and services developedby INDIGO-DataCloud, prior to its public release.

• Preview infrastructure: where the released arti-facts are deployed and made available for testingand validation by the use-cases.

Appendix C: DevOps Adoption from UserCommunities

DisVis [63] and PowerFit [64] applications were inte-grated into a CI/CD pipeline described above. As

it can be seen in the Fig. 15, with this pipelinein place the application developers were providedwith both a means to validate the source codebefore merging and the creation of a new ver-sioned Docker image, automatically available in theINDIGO-DataClouds catalogue for applications i.e.DockerHub???s indigodatacloudapps reposi-tory.

Once the application is deployed as aDocker container, and subsequently uploaded toindigodatacloudapps repository, it is instan-tiated in a new container to be validated. Theapplication is then executed and the results comparedwith a set of reference outputs. Thus this pipelineimplementation goes a step forward by testing theapplication execution for the last available Dockerimage in the catalogue.

Fig. 15 DisVis development workflow using a CI/CD approach

Page 27: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

INDIGO-DataCloud: a Platform to Facilitate Seamless Access to E-Infrastructures 407

References

1. Garcıa, A.L., Castillo, E.F.-d., Puel, M.: Identity federationwith VOMS in cloud infrastructures. In: 2013 IEEE 5ThInternational Conference on Cloud Computing Technologyand Science, pp 42–48 (2013)

2. Chadwick, D.W., Siu, K., Lee, C., Fouillat, Y., Germonville,D.: Adding federated identity management to OpenStack.Journal of Grid Computing 12(1), 3–27 (2014)

3. Craig, A.L.: A design space review for general federationmanagement using keystone. In: Proceedings of the 2014IEEE/ACM 7th International Conference on Utility andCloud Computing, pp 720–725. IEEE Computer Society(2014)

4. Pustchi, N., Krishnan, R., Sandhu, R.: Authorization fed-eration in iaas multi cloud. In: Proceedings of the 3rdInternational Workshop on Security in Cloud Computing,pp 63–71. ACM (2015)

5. Lee, C.A., Desai, N., Brethorst, A.: A Keystone-BasedVirtual Organization Management System. In: 2014 IEEE6Th International Conference On Cloud Computing Tech-nology and Science (Cloudcom), pp 727–730. IEEE(2014)

6. Castillo, E.F.-d., Scardaci, D., Garcıa, A.L.: The EGI Fed-erated Cloud e-Infrastructure. Procedia Computer Science68, 196–205 (2015)

7. AARC project: AARC Blueprint Architecture, see https://aarc-project.eu/architecture. Technical report (2016)

8. Oesterle, F., Ostermann, S., Prodan, R., Mayr, G.J.: Experi-ences with distributed computing for meteorological appli-cations: grid computing and cloud computing. Geosci.Model Dev. 8(7), 2067–2078 (2015)

9. Plasencia, I.C., Castillo, E.F.-d., Heinemeyer, S., Garcıa,A.L., Pahlen, F., Borges, G.: Phenomenology tools on cloudinfrastructures using OpenStack. The European PhysicalJournal C 73(4), 2375 (2013)

10. Boettiger, C.: An introduction to docker for reproducibleresearch. ACM SIGOPS Operating Systems Review 49(1),71–79 (2015)

11. Docker: http://www.docker.com (2013)12. Gomes, J., Campos, I., Bagnaschi, E., David, M., Alves,

L., Martins, J., Pina, J., Alvaro, L.-G., Orviz, P.: Enablingrootless linux containers in multi-user environments:the udocker tool. Computing Physics Communications.https://doi.org/10.1016/j.cpc.2018.05.021 (2018)

13. Zhang, Z., Chuan, W., Cheung, D.W.L.: A survey on cloudinteroperability taxonomies, standards, and practice. SIG-METRICS perform. Eval. Rev. 40(4), 13–22 (2013)

14. Lorido-Botran, T., Miguel-Alonso, J., Lozano, J.A.: AReview of Auto-scaling Techniques for Elastic Applica-tions in Cloud Environments. Journal of Grid Computing12(4), 559–592 (2014)

15. Nyren, R., Metsch, T., Edmonds, A., Papaspyrou, A.: OpenCloud Computing Interface–Core. Technical report, OpenGrid Forum (2010)

16. Metsch, T., Edmonds, A.: Open Cloud Computing Interface-Infrastructure. Technical report, Open Grid Forum (2010)

17. Metsch, T., Edmonds, A.: Open Cloud ComputingInterface-RESTful HTTP Rendering. Technical report,Open Grid Forum (2011)

18. (Ca Technologies) Lipton, P., (Ibm) Moser, S., (Vnomic)Palma, D., (Ibm) Spatzier, T.: Topology and Orchestra-tion Specification for Cloud Applications. Technical report,OASIS Standard (2013)

19. Teckelmann, R., Reich, C., Sulistio, A.: Mapping of cloudstandards to the taxonomy of interoperability in IaaS. In:Proceedings - 2011 3rd IEEE International Conferenceon Cloud Computing Technology and Science, CloudCom2011, pp. 522–526 (2011)

20. Garcıa, A.L., Castillo, E.F.-d., Fernandez, P.O.: Standardsfor enabling heterogeneous IaaS cloud federations. Com-puter Standards & Interfaces 47, 19–23 (2016)

21. Caballer, M., Zala, S., Garcıa, A.L., Monto, G., Fernandez,P.O., Velten, M.: Orchestrating complex application archi-tectures in heterogeneous clouds. Journal of Grid Comput-ing 16(1), 3–18 (2018)

22. Hardt, M., Jejkal, T., Plasencia, I.C., Castillo, E.F.-d., Jack-son, A., Weiland, M., Palak, B., Plociennik, M., Niels-son, D.: Transparent Access to Scientific and CommercialClouds from the Kepler Workflow Engine. Computing andInformatics 31(1), 119 (2012)

23. Fakhfakh, F., Kacem, H.H., Kacem, A.H.: WorkflowScheduling in Cloud Computing a Survey. In: IEEE 18ThInternational Enterprise Distributed Object ComputingConference Workshops and Demonstrations (EDOCW),2014, Vol. 71, pp. 372–378. Springer, New York (2014)

24. Stockton, D.B., Santamaria, F.: Automating NEURON sim-ulation deployment in cloud resources. Neuroinformatics15(1), 51–70 (2017)

25. Plociennik, M., Fiore, S., Donvito, G., Owsiak, M., Far-getta, M., Barbera, R., Bruno, R., Giorgio, E., Williams,D.N., Aloisio, G.: Two-level Dynamic Workflow Orches-tration in the INDIGO DataCloud for Large-scale, ClimateChange Data Analytics Experiments. Procedia ComputerScience 80, 722–733 (2016)

26. Moreno-Vozmediano, R., Montero, R.S., Llorente, I.M.:Multicloud deployment of computing clusters for looselycoupled mtc applications. IEEE transactions on parallel anddistributed systems 22(6), 924–930 (2011)

27. Katsaros, G., Menzel, M., Lenk, A.: Cloud Service Orches-tration with TOSCA, Chef and Openstack. In: Ic2e (2014)

28. Garcia, A.L., Zangrando, L., Sgaravatto, M., Llorens, V.,Vallero, S., Zaccolo, V., Bagnasco, S., Taneja, S., Dal Pra,S., Salomoni, D., Donvito, G.: Improved Cloud resourceallocation: how INDIGO-DataCloud is overcoming the cur-rent limitations in Cloud schedulers. J. Phys. Conf. Ser.898(9), 92010 (2017)

29. Singh, S., Chana, I.: A survey on resource scheduling incloud computing issues and challenges. Journal of GridComputing, pp. 1–48 (2016)

30. Garcıa, A.L., Castillo, E.F.-d., Fernandez, P.O., Plasen-cia, I.C., de Lucas, J.M.: Resource provisioning in ScienceClouds: Requirements and challenges. Software: Practiceand Experience 48(3), 486–498 (2018)

Page 28: INDIGO-DataCloud: a Platform to Facilitate Seamless Access ... · 1 Introduction INDIGO-DataCloud was an European project starting in April 2015, with the purpose of developing a

408 D. Salomoni et al.

31. Chauhan, M.A., Babar, M.A., Benatallah, B.: Architectingcloud-enabled systems: a systematic survey of challengesand solutions. Software - Practice and Experience 47(4),599–644 (2017)

32. Somasundaram, T.S., Govindarajan, K.: CLOUDRBA Framework for scheduling and managing High-Performance Computing (HPC) applications in sciencecloud. Futur. Gener. Comput. Syst. 34, 47–65 (2014)

33. Sotomayor, B., Keahey, K., Foster, I.: Overhead Matters: AModel for Virtual Resource Management. In: Proceedingsof the 2nd International Workshop on Virtualization Tech-nology in Distributed Computing SE - VTDC ’06, p. 5.IEEE Computer Society, Washington (2006)

34. SS, S.S., Shyam, G.K., Shyam, G.K.: Resource manage-ment for Infrastructure as a Service (IaaS) in cloud com-puting SS Manvi A survey. J. Netw. Comput. Appl. 41,424–440 (2014)

35. INDIGO-DataCloud consortium: Initial requirements fromresearch communities - d2.1, see https://www.indigo-datacloud.eu/documents/initial-requirements-research-communities-d21. Technical report (2015)

36. Europen open science cloud: https://ec.europa.eu/research/openscience (2015)

37. Proot: https://proot-me.github.io/ (2014)38. Runc: https://github.com/opencontainers/runc (2016)39. Fakechroot: https://github.com/dex4er/fakechroot (2015)40. Perez, A., Molto, G., Caballer, M., Calatrava, A.: Server-

less computing for container-based architectures FutureGeneration Computer Systems (2018)

41. de Vries, K.J.: Global fits of supersymmetric models afterLHC run 1. Phd thesis Imperial College London (2015)

42. Openstack: https://www.openstack.org/ (2015)43. See http://argus-documentation.readthedocs.io/en/stable/

argus introduction.html (2017)44. See https://en.wikipedia.org/wiki/xacml (2013)45. See http://www.simplecloud.info (2014)46. Opennebula: http://opennebula.org/ (2018)

47. Redhat openshift: http://www.opencityplatform.eu (2011)48. The cloud foundry foundation: https://www.cloudfoundry.

org/ (2015)49. Caballer, M., Blanquer, I., Molto, G., de Alfonso, C.:

Dynamic management of virtual infrastructures. Journal ofGrid Computing 13(1), 53–70 (2015)

50. See http://www.infoq.com/articles/scaling-docker-with-kubernetes (2014)

51. Prisma project: http://www.ponsmartcities-prisma.it/ (2010)52. Opencitiy platform: http://www.opencityplatform.eu (2014)53. Onedata: https://onedata.org/ (2018)54. Dynafed: http://lcgdm.web.cern.ch/dynafed-dynamic-

federation-project (2011)55. Fts3: https://svnweb.cern.ch/trac/fts3 (2011)56. Fernandez, P.O., Garcıa, A.L., Duma, D.C., Donvito, G.,

David, M., Gomes, J.: A set of common software qualityassurance baseline criteria for research projects, see http://hdl.handle.net/10261/160086. Technical report

57. Httermann, M.: Devops for developers Apress (2012)58. EOSC-Hub: ”Integrating and managing services for the

European Open Science Cloud” Funded by H2020 researchand innovation pr ogramme under grant agreement No.777536. See http://eosc-hub.eu (2018)

59. Apache License: author = https://www.apache.org/licenses/LICENSE-2.0 (2004)

60. INDIGO Package Repo: http://repo.indigo-datacloud.eu/(2017)

61. INDIGO DockerHub: https://hub.docker.com/u/indigodatacloud/ (2015)

62. Indigo gitbook: https://indigo-dc.gitbooks.io/indigo-datacloud-releases (2017)

63. Van Zundert, G.C., Bonvin, A.M.: Disvis: quantifying andvisualizing the accessible interaction space of distancerestrained biomolecular complexes. Bioinformatics 31(19),3222–3224 (2015)

64. Van Zundert, G.C., Bonvin, A.M.: Fast and sensitive rigid–body fitting into cryo–em density maps with powerfit.AIMS Biophys. 2(0273), 73–87 (2015)