e-infrastructural needs to support informatics

17
1 e-Infrastructural needs to support informatics* David Wallom

Transcript of e-infrastructural needs to support informatics

1

e-Infrastructural needs to support informatics* David Wallom

2

What is e-Infrastructure?

The integration of digitally-based technology, resources, facilities, and services combined with people and organizational structures needed to support modern, collaborative research (and teaching).

1.Data and Storage 2.Software (and Algorithms) 3.Hardware (Compute) 4.Networks 5.Security and authentication6.People (Collaboration, Skills, Capacity) 7.The Digital Library

Bioinformatics software challenges• This brings a onslaught of new

challenges for bioinformatics:– projects that used to require teams

of 500 are now accessible to small teams

– but biology curricula (i.e. biologists) still lack computational skills.

– thus biologists are overwhelmed by large amounts of data

– furthermore data types are young - so software is young, thus

• software may be badly built (by biologists with no formal software dev training/xp).

• software needs to be frequently updated (bugfixes, algorithmic improvements (sensitivity/specificity), new data type support).

changes everything for biology

ARCHER• UK National Supercomputing

Service• Replacement for HECToR• LINPACK = 1.359 Pflop/s

• EPSRC is the managing partner on behalf of RCUK• NERC are the other partner

research council• Cray XC30 Hardware

• Nodes based on 2× Intel Ivy Bridge 12-core processors

• 64GB (or 128GB) memory per node• 3008 nodes in total (72162 cores)• Linked by Cray Aries interconnect

(dragonfly topology)

External Network inside JASMIN

Unmanaged Cloud – IaaS, PaaS, SaaS

JASMIN Internal Network

Panasas storage

Lotus Batch Compute

JASMIN Cloud Architecture

Standard Remote Access Protocols – ftp, http, …

Managed Cloud - PaaS, SaaS

JASMIN Analysis Platform

VM

Project1-orgScience Analysis

VM 0

Science Analysis

VM 0Science Analysis

VM

JASMIN Cloud Management Interfaces

Direct File System Access

Direct access to batch processing

cluster

Appliance Catalogue

Firewall + NAT

Firewall

optirad-org

Science Analysis

VM 0Science Analysis

VM 0

IPython Slave VM

File Server VM

IPython JupyterHub VM

eos-cloud-org

Science Analysis

VM 0

Science Analysis VM

0

EOSCloud VM File Server

VM

EOSCloudFat Node

IPython Notebook VM with access cluster through IPython.parallel EOSCoud Desktop as a Service

with dynamic RAM boost

Appliance Catalogue

Appliance Catalogue

Firewall + NAT Firewall + NAT

Firewall

Thanks to Phil Kershaw

OTHER PUBLIC CLOUD PROVIDERS ARE AVAILABLE

Bio-Linux: A scalable solution • Comprehensive, free bioinformatics workstation based on Ubuntu

Linux and Debian Med

• 11 years & 8 major releases

• Around 8000 users from 1600 locations

• 200+ bioinf packages including big integrative tools :- QIIME, Galaxy Server, PredictProtein, EMBOSS, ...Incorporates all software

Dual BootLinux Live Local Servers Cloud

Docker, simplifying the portability of applications and services

EOS Cloud

• A tenancy in the JASMIN Unmanaged Cloud (& QMUL RCC)• Reusing JASMIN web interfaces and user management to

provide custom IaaS software platform• Each receives two VMs

– Bio-Linux– Ubuntu Docker hosting environment

• Users have total responsibility for instantiated system• Accessible though standard remote desktop tools• Scalability limited by support available

Why Cloud?• Data sets can be too big or restricted to easily move

– move the compute to the data– Researcher work patterns are maintained

• Tools such as Bio-Linux/Docker etc are community enablers• More efficient use of shared resources• Central maintenance of infrastructure• Central Management of data sharing agreements possible• Lower barrier to entry (Compared to traditional HPC and Grid)• What type of cloud?• What role for traditional HPC?

TRAINING IS KEY TO MAKING INFORMED CHOICES

EOS Cloud next?

• Expand currently available resource beyond current limitations?

• Create deployable machine image for other cloud marketplaces

• EOS/institutional badging to give users confidence in quality

Pilot Users

• CEH Bioinformaticians using the EOS Cloud to study patterns in microbial biodiversity

• Genomic and transcriptomic data from fish toxicogenomics studies at Exeter

© USC

© Wikimedia Commons

Pilot Users

• Creating compute pipelines and containers for each OSD in silico analysis – HPC, Cloud (IaaS & PaaS)– Portable

• Run same analysis on different laptops/grids/clouds– Repeatable/Reproducible

• Same input gives same output given that reference databases did not change– Preservation

• All analysis tools and dependencies are in one image• Images are simple tar.gz • Preserving Docker and base images is preserving all analysis

Software Sustainability Institute

www.software.ac.uk

The Software Sustainability Institute

A national facility for cultivating better, more sustainable, research software to enable world-class research• Software reaches boundaries in its

development cycle that prevent improvement, growth and adoption

• Providing the expertise and services needed to negotiate to the next stage

• Developing the policy and tools tosupport the community developing andusing research software Supported by RCUK

Communication

Website & blog

Campaigns

Advice

Guides

Courses

Workshops

Fellowship

Research

Software

Policy

Training

Community

Consultancy41 projects 92 evaluations

4 surgeries

33 UK SWC workshops

1000+ learners

50,000 readers

41 domainambassadors

20+ workshops organised

740 researchers50,000 grants

analysed

150+ contributed articles19,000 unique visitors per month

272 RSEs engaged1700 signatures13 issues highlighted

17

The end of the beginning, not the beginning of the end!

• A holistic approach is required with all parts of e-infrastructure supported from the Hard to Soft to Wet!

• Good start up investments need continuity to ensure impact– Certain tools are foundations upon which large swathes of community depend

– Putting tools next to immovable data ensures value!

• Integrating with larger activities ensure benefits of scaling– you can’t steer something you’re not involved with…

• Abstract underpinning e-infrastructure services from the users, as they’re not interested!– Run something on one resource should be able to be moved to others throug hthe use of standards

etc!

– I have ignored the institutional resources…

****WARNING****Institute for Environmental Analytics Summer School on e-infrastructure for the environment 19th – 22nd Sept ’16, Oxford.