Supporting Research through "Desktop as a Service" models of e-infrastructure access
-
Upload
david-wallom -
Category
Science
-
view
33 -
download
0
Transcript of Supporting Research through "Desktop as a Service" models of e-infrastructure access
1
Supporting Research through "Desktop as a Service" models of e-infrastructure access
David Wallom
2
Overview
• What is e-infrastructure?• Challenges of a changing community• Availability of uniform environments for analysis• Cloud computing and ‘as a Service’
3
What is e-Infrastructure?
The integration of digitally-based technology, resources, facilities, and services combined with people and organizational structures needed to support modern, collaborative research (and teaching).
1. Data and Storage 2. Hardware (Compute)3. Software (and Algorithms) 4. Networks 5. Security and authentication6. People (Collaboration, Skills, Capacity) 7. The Digital Library
RCUK e-Infrastructure Working Group
4
Changes in Earth Observation
13 Tools or Frameworks presented so far @ EO Open Science ‘16
5
What are the issues?
• Complex large volume data from multiple sources, access policy adherence• Congested hardware, unsuitable usage mechanisms, changing platforms• Software diversely applied across a community with significant sustainability questions• Limited training @ undergraduate level → large amount of training at further career stages • Disconnect between academic and industrial community members & tooling used
6
A complex geographically and organisationally distributed community with rapidly evolving challenges using
diverse tooling…
7
Bioinformatics software challenges
• An onslaught of new challenges for bioinformatics:
– projects that used to require teams of 500 are now accessible to small (<5) teams
– but biology curricula (i.e. biologists) still lack computational skills.
– thus biologists are overwhelmed by large amounts of data
– furthermore data types are young - so software is young, thus
• software may be badly built (by biologists with no formal software dev training/xp).
• software needs to be frequently updated (bugfixes, algorithmic improvements (sensitivity/specificity), new data type support).
changes everything for biology
8
How to distribute complex community software in a way that ensures we
minimise the amount of time learning to use it takes and minimises support
requriements?
Bio-Linux: A scalable solution • Comprehensive, free bioinformatics workstation based on Ubuntu
Linux and Debian Med
• 11 years & 8 major releases
• >8000 users from 1600 locations
• 200+ bioinf packages including big integrative tools :- QIIME, Galaxy Server, PredictProtein, EMBOSS, ...Incorporates all software
Dual BootLinux Live Local Servers Cloud
Docker, simplifying the portability of applications and services
11
How to simplify the access to resources onto which we can run applications on demand in an environment in which we
are familiar?
12
Where do we do work at the moment?
External Network inside JASMIN
Unmanaged Cloud – IaaS, PaaS, SaaS
JASMIN Internal Network
Panasas storage
Lotus Batch Compute
JASMIN Cloud Architecture
Standard Remote Access Protocols – ftp, http, …
Managed Cloud - PaaS, SaaS
JASMIN Analysis Platform
VM
Project1-orgScience Analysis
VM 0
Science Analysis
VM 0Science Analysis
VM
JASMIN Cloud Management Interfaces
Direct File System Access
Direct access to batch processing
cluster
Appliance Catalogue
Firewall + NAT
Firewall
optirad-org
Science Analysis
VM 0Science Analysis
VM 0
IPython Slave VM
File Server VM
IPython JupyterHub VM
eos-cloud-org
Science Analysis
VM 0
Science Analysis VM
0
EOSCloud VM File Server
VM
EOSCloudFat Node
IPython Notebook VM with access cluster through IPython.parallel EOSCoud Desktop as a Service
with dynamic RAM boost
Appliance Catalogue
Appliance Catalogue
Firewall + NAT Firewall + NAT
Firewall
Thanks to Phil Kershaw
OTHER PUBLIC CLOUD PROVIDERS ARE AVAILABLE
15
Why Cloud?
• Data sets can be too big or restricted to easily move – move the compute to the data
– Researcher work patterns are maintained
• More efficient use of shared resources• Central maintenance of infrastructure• Central Management of data sharing agreements possible• Lower barrier to entry (Compared to traditional HPC and Grid)
• What type of cloud?• What role for traditional HPC?
TRAINING IS KEY TO MAKING INFORMED CHOICES
16
Bringing together the ideal platform with a uniform software tooling distribution
mechanism
EOS Cloud• A tenancy in the JASMIN Unmanaged Cloud (& QMUL RCC)• Reusing JASMIN web interfaces and user management to provide
custom IaaS software platform• Each receives two VMs
– Bio-Linux– Ubuntu Docker hosting environment
• Users have total responsibility for instantiated system• Accessible though standard remote desktop tools• But,
– utilising single scale of resources would be a waste– Can we scale the users virtual services to take into account demand?
Boosting Resource Capabilities• Users VMs operate in native state ‘Standard’
– Enough capability to access stored data– Configure applications and workflows– Free
• User may boost his running VM to increased capability
– Enough to run analysis applications on useful timescale– Credit consumption only for Boosted instances
• Reference datasets available to users through shared storage
Name # Core Memory (GB) Cost(Credit/hour)
Standard 1 16 0
Standard+ 2 40 1
Big 8 140 4
Max 16 500 8
oSwitchOne-line access to other operating systems.
• Docker applications though portable can feel extremely alien in their usability
• With oSwitch in contrast things feel (largely) unchanged:
– Current working directory is maintained.
– User name, uid and gid are maintained.
– Login shell (bash/zsh/fish) is maintained.
– Home directory is maintained (thus all .dotfiles and config files are maintained).
– read/write permissions are maintained.
– Paths are maintained whenever possible. Thus volumes (external drives, NAS, USB) mounted on the host are available in the container at the same path.
https://github.com/yeban/oswitch
Pilot Users
• Citizen data collection project– Physical samples sent for sequencing to assess microbial diversity– ~200 sites
• Creating compute pipelines and containers for each OSD in silico analysis – HPC, Cloud (IaaS & PaaS)– Portable
• Run same analysis on different laptops/grids/clouds– Repeatable/Reproducible
• Same input gives same output given that reference databases did not change– Preservation
• All analysis tools and dependencies are in one image• Images are simple tar.gz • Preserving Docker and base images is preserving all analysis
22
Desktop as a Service for research
• Giving researchers an environment they are confident in by changing the infrastructure around them
• Location independent persistence of the research environment
• Support further key usage models such as teaching or online learning
• Gives tool developers a deployment mechanism that already has community visibility
23
Conclusions
• Abstract underpinning e-infrastructure services from the users– In many cases not interested in what type of tin they’re running on!
– Run something on one resource should be able to be moved to others through the use of standards etc!
• Cloud is (obviously) an enabler for research– Allowing flexibility in infrastructure hitherto not possible
– User control rather than provider control
– Higher level services more easily composed and made accessible through Marketplaces
• Creating an ‘EnvLinux’ distribution would create a community wide software suite across all stakeholder groups.
– Support a single point of distribution for all community relevant software
– Create simple deposit mechanism to allow new tools and services to easily join
– Support activities such as teaching that do not generally work well with rapidly changing activities such as research software development
24
Thank you & questions?