Supporting Research through "Desktop as a Service" models of e-infrastructure access

1

Supporting Research through "Desktop as a Service" models of e-infrastructure access

David Wallom

2

Overview

• What is e-infrastructure?• Challenges of a changing community• Availability of uniform environments for analysis• Cloud computing and ‘as a Service’

3

What is e-Infrastructure?

The integration of digitally-based technology, resources, facilities, and services combined with people and organizational structures needed to support modern, collaborative research (and teaching).

1. Data and Storage 2. Hardware (Compute)3. Software (and Algorithms) 4. Networks 5. Security and authentication6. People (Collaboration, Skills, Capacity) 7. The Digital Library

RCUK e-Infrastructure Working Group

4

Changes in Earth Observation

13 Tools or Frameworks presented so far @ EO Open Science ‘16

5

What are the issues?

• Complex large volume data from multiple sources, access policy adherence• Congested hardware, unsuitable usage mechanisms, changing platforms• Software diversely applied across a community with significant sustainability questions• Limited training @ undergraduate level → large amount of training at further career stages • Disconnect between academic and industrial community members & tooling used

6

A complex geographically and organisationally distributed community with rapidly evolving challenges using

diverse tooling…

7

Bioinformatics software challenges

• An onslaught of new challenges for bioinformatics:

– projects that used to require teams of 500 are now accessible to small (<5) teams

– but biology curricula (i.e. biologists) still lack computational skills.

– thus biologists are overwhelmed by large amounts of data

– furthermore data types are young - so software is young, thus

• software may be badly built (by biologists with no formal software dev training/xp).

• software needs to be frequently updated (bugfixes, algorithmic improvements (sensitivity/specificity), new data type support).

changes everything for biology

8

How to distribute complex community software in a way that ensures we

minimise the amount of time learning to use it takes and minimises support

requriements?

Bio-Linux: A scalable solution • Comprehensive, free bioinformatics workstation based on Ubuntu

Linux and Debian Med

• 11 years & 8 major releases

• >8000 users from 1600 locations

• 200+ bioinf packages including big integrative tools :- QIIME, Galaxy Server, PredictProtein, EMBOSS, ...Incorporates all software

Dual BootLinux Live Local Servers Cloud

Docker, simplifying the portability of applications and services

11

How to simplify the access to resources onto which we can run applications on demand in an environment in which we

are familiar?

12

Where do we do work at the moment?

External Network inside JASMIN

Unmanaged Cloud – IaaS, PaaS, SaaS

JASMIN Internal Network

Panasas storage

Lotus Batch Compute

JASMIN Cloud Architecture

Standard Remote Access Protocols – ftp, http, …

Managed Cloud - PaaS, SaaS

JASMIN Analysis Platform

VM

Project1-orgScience Analysis

VM 0

Science Analysis

VM 0Science Analysis

VM

JASMIN Cloud Management Interfaces

Direct File System Access

Direct access to batch processing

cluster

Appliance Catalogue

Firewall + NAT

Firewall

optirad-org

Science Analysis

VM 0Science Analysis

VM 0

IPython Slave VM

File Server VM

IPython JupyterHub VM

eos-cloud-org

Science Analysis

VM 0

Science Analysis VM

0

EOSCloud VM File Server

VM

EOSCloudFat Node

IPython Notebook VM with access cluster through IPython.parallel EOSCoud Desktop as a Service

with dynamic RAM boost

Appliance Catalogue

Appliance Catalogue

Firewall + NAT Firewall + NAT

Firewall

Thanks to Phil Kershaw

OTHER PUBLIC CLOUD PROVIDERS ARE AVAILABLE

15

Why Cloud?

• Data sets can be too big or restricted to easily move – move the compute to the data

– Researcher work patterns are maintained

• More efficient use of shared resources• Central maintenance of infrastructure• Central Management of data sharing agreements possible• Lower barrier to entry (Compared to traditional HPC and Grid)

• What type of cloud?• What role for traditional HPC?

TRAINING IS KEY TO MAKING INFORMED CHOICES

16

Bringing together the ideal platform with a uniform software tooling distribution

mechanism

EOS Cloud• A tenancy in the JASMIN Unmanaged Cloud (& QMUL RCC)• Reusing JASMIN web interfaces and user management to provide

custom IaaS software platform• Each receives two VMs

– Bio-Linux– Ubuntu Docker hosting environment

• Users have total responsibility for instantiated system• Accessible though standard remote desktop tools• But,

– utilising single scale of resources would be a waste– Can we scale the users virtual services to take into account demand?

Boosting Resource Capabilities• Users VMs operate in native state ‘Standard’

– Enough capability to access stored data– Configure applications and workflows– Free

• User may boost his running VM to increased capability

– Enough to run analysis applications on useful timescale– Credit consumption only for Boosted instances

• Reference datasets available to users through shared storage

Name # Core Memory (GB) Cost(Credit/hour)

Standard 1 16 0

Standard+ 2 40 1

Big 8 140 4

Max 16 500 8

oSwitchOne-line access to other operating systems.

• Docker applications though portable can feel extremely alien in their usability

• With oSwitch in contrast things feel (largely) unchanged:

– Current working directory is maintained.

– User name, uid and gid are maintained.

– Login shell (bash/zsh/fish) is maintained.

– Home directory is maintained (thus all .dotfiles and config files are maintained).

– read/write permissions are maintained.

– Paths are maintained whenever possible. Thus volumes (external drives, NAS, USB) mounted on the host are available in the container at the same path.

https://github.com/yeban/oswitch

Pilot Users

• Citizen data collection project– Physical samples sent for sequencing to assess microbial diversity– ~200 sites

• Creating compute pipelines and containers for each OSD in silico analysis – HPC, Cloud (IaaS & PaaS)– Portable

• Run same analysis on different laptops/grids/clouds– Repeatable/Reproducible

• Same input gives same output given that reference databases did not change– Preservation

• All analysis tools and dependencies are in one image• Images are simple tar.gz • Preserving Docker and base images is preserving all analysis

22

Desktop as a Service for research

• Giving researchers an environment they are confident in by changing the infrastructure around them

• Location independent persistence of the research environment

• Support further key usage models such as teaching or online learning

• Gives tool developers a deployment mechanism that already has community visibility

23

Conclusions

• Abstract underpinning e-infrastructure services from the users– In many cases not interested in what type of tin they’re running on!

– Run something on one resource should be able to be moved to others through the use of standards etc!

• Cloud is (obviously) an enabler for research– Allowing flexibility in infrastructure hitherto not possible

– User control rather than provider control

– Higher level services more easily composed and made accessible through Marketplaces

• Creating an ‘EnvLinux’ distribution would create a community wide software suite across all stakeholder groups.

– Support a single point of distribution for all community relevant software

– Create simple deposit mechanism to allow new tools and services to easily join

– Support activities such as teaching that do not generally work well with rapidly changing activities such as research software development

24

Thank you & questions?

Supporting Research through "Desktop as a Service" models of e-infrastructure access

Science

Transcript of Supporting Research through "Desktop as a Service" models of e-infrastructure access