Cloud Technical Challenges
-
Upload
guy-coates -
Category
Technology
-
view
1.354 -
download
1
description
Transcript of Cloud Technical Challenges
![Page 2: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/2.jpg)
Outline
Background
Cloud Experiences
Barriers
Future Directions
![Page 3: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/3.jpg)
The Sanger Institute Funded by Wellcome Trust.
• 2nd largest research charity in the world.• ~700 employees.• Based in Hinxton Genome Campus,
Cambridge, UK.
Large scale genomic research.• Sequenced 1/3 of the human genome.
(largest single contributor).• We have active cancer, malaria,
pathogen and genomic variation / human health studies.
All data is made publicly available.• Websites, ftp, direct database. access,
programmatic APIs.
![Page 4: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/4.jpg)
Lost in the clouds...
![Page 5: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/5.jpg)
Victory!
![Page 6: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/6.jpg)
Our Cloud Experiences
![Page 7: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/7.jpg)
Hype Cycle
Awesome!
Just works...
![Page 8: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/8.jpg)
Ensembl
Ensembl is a system for genome Annotation.
Data visualisation / Mining web services.• www.ensembl.org• Provides web / programmatic interfaces to genomic data.• 10k visitors / 126k page views per day.
Compute Pipeline (HPTC Workload)• Take a raw genome and run it through a compute pipeline to find genes
and other features of interest.• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate
genomes.
• Software is Open Source (apache license).• Data is free for download.
We have web services and HPTC workloads running on Iaas.
![Page 9: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/9.jpg)
Why Cloud?
Web services• Was hosted in a single datacentre at the Genome Campus, UK.• 1 datacentre = Single point of failure.• Access slow if you were not in western Europe.
Cloud Application• Build worldwide network of mirrors on IaaS.
HPC• People want to run Ensembl HPC pipeline on their own data.• Requires skilled bioinformatician to get the software running and access
to a HPC cluster.
Cloud Application• Build HPC SaaS.• Users deploy ready-to-run Ensembl code on AWS, self-assembles into a
HPC cluster and analyses their data.
![Page 10: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/10.jpg)
Hype Cycle
Web services /Web services /Some HPCSome HPC
![Page 11: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/11.jpg)
That was easy...
![Page 12: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/12.jpg)
Hype cycle
Sequencinginformatics
![Page 13: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/13.jpg)
DNA sequencing
![Page 14: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/14.jpg)
Economic Trends:
As cost of sequencing halves every 12 months.• cf Moore's Law
The Human genome project: • 13 years.• 23 labs.• $500 Million.
A Human genome today:• 3 days.• 1 machine.• $10,000.• Large centres are now doing studies with 10,000s of
genomes.
Trend will continue:• Generation 3 sequencers are on their way.• $500 genome is probable within 5 years.
![Page 15: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/15.jpg)
The scary graph
Peak Yearly capillary sequencing: 30 Gbase
Current weeky sequencing:3000 Gbase
![Page 16: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/16.jpg)
19941995
19961997
19981999
20002001
20022003
20042005
20062007
20082009
0
1000
2000
3000
4000
5000
6000
Disk Storage
Year
Te
rab
yte
s
Managing Growth We have exponential growth in
storage and compute.• Storage /compute doubles every 12
months.• 2009 ~7 PB raw
Gigabase of sequence ≠ Gigbyte of storage.• 16 bytes per base for for sequence
data.• Intermediate analysis typically need 10x
disk space of the raw data.
Moore's law will not save us.• Transistor/disk density: T
d=18 months
• Sequencing cost: Td=12 months
• Sequencing output: Td=3-6 months
![Page 17: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/17.jpg)
What do you need to do sequencing?
SequencerSequencer analysis softwareanalysis software
LIMS System / Data TrackingLIMS System / Data Tracking
Sample prepSample prep Datarepository
Datarepository
External repositoryExternal
repository
HPC Resource
HPC Resource
Integratedcompute
Integratedcompute
![Page 18: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/18.jpg)
What IT do you need to do sequencing?
SequencerSequencer analysis softwareanalysis software
Datarepository
Datarepository
External repositoryExternal
repository
LIMS System / Data TrackingLIMS System / Data Tracking
Sample prepSample prep
HPC Resource
HPC Resource
Integratedcompute
Integratedcompute
Part covered in the grant
![Page 19: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/19.jpg)
This is really hard...
We have a whole division of HPC specialists, LIMs developers, bio-informaticians.
What about smaller labs with 1 or 2 sequencers?
![Page 20: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/20.jpg)
...and then change it.
Sequencing informatics is massively fluid.• New chemistry.• More sequencing machines.• New analysis software.
Constant cycle of development and deployment.
![Page 21: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/21.jpg)
How can cloud help?
![Page 22: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/22.jpg)
What can we put on the Cloud?
SequencerSequencer analysis softwareanalysis software
LIMS System / Data TrackingLIMS System / Data Tracking
Sample prepSample prep Datarepository
Datarepository
External repositoryExternal
repository
HPC Resource
HPC Resource
Integratedcompute
Integratedcompute
![Page 23: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/23.jpg)
Does it Cloud?
How do we decide what to cloud?
Rule of thumb borrowed from HPC.• Small data / High CPU work better in distributed environments.
IO Bound / Large data
CPU Bound / small data
![Page 24: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/24.jpg)
Sequencing Data
( Raw data (TB) )
Alignments (200 GB)
Sequence + quality data (500 GB)
Variation data (1GB)
Individual features (3MB)
Structured data(databases)
Unstructured data(flat files)
Data size per Genome
Tracking / LIMs (100s Kbytes)
![Page 25: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/25.jpg)
Sequencing Data
( Raw data (TB) )
Alignments (200 GB)
Sequence + quality data (500 GB)
Variation data (1GB)
Individual features (3MB)
Structured data(databases)
Unstructured data(flat files)
Data size per Genome
Cloud FriendlyCloud Friendly
Cloud UnfriendlyCloud Unfriendly
Tracking / LIMs (100s Kbytes)
![Page 26: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/26.jpg)
Can we Cloudify Sequencing?
SequencerSequencer analysis softwareanalysis softwareSample prepSample prep Data
repositoryData
repository
External repositoryExternal
repository
HPC Resource
HPC Resource
Integratedcompute
Integratedcompute
LIMS System / Data TrackingLIMS System / Data Tracking
![Page 27: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/27.jpg)
What are the blockers?
HPC infrastructure is now available in the cloud.• Good enough for 95% of sequencing.
Doing big data is hard:
1. You have to get the data there first.
2. You may not be allowed to put the data there.
![Page 28: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/28.jpg)
Moving data is hard
Tools:• (FTP,ssh/rsync) are not suited to wide-area networks.• WAN tools: gridFTP/FDT/Aspera.
Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s) • 11 hours to move 1TB to Dublin.• 23 hours to move 1 TB to East coast.
What speed should we get?• Once we leave JANET (UK academic network) finding out what the
connectivity is and what we should expect is almost impossible.
Do you have fast enough disks at each end to keep the network full?
Why not just ship disks?• Logistical nightmare.• Format issues, corruption, slow.
![Page 29: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/29.jpg)
Networking
How do we improve data transfers across the public internet?• CERN approach; don't.• Dedicated networking has been
put in between CERN and the T1 centres who get all of the CERN data.
Can it work for cloud?• Buy dedicated bandwidth to a
provider.• Ties you in.• Should they pay?
We need good connectivity to everywhere.
![Page 30: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/30.jpg)
Data Security
![Page 31: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/31.jpg)
Are you allowed to put data on the cloud?
Default policy:
“Our data is confidential/important/critical to our business. We must keep our data on our computers.”
![Page 32: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/32.jpg)
What does “My System” mean?
Purchased computer in my data centre
Leased computer inmy data centre
Purchased computer in a co-lo facility
Traditionally outsourced IT service
IaaS on a cloud provider
SaaS on a cloud provider
My System Not my system
Root / Admin Access?
Encrypted/ Non encrypted?
VPN / inside or outside firewall?
Legal / IP agreement in place?
![Page 33: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/33.jpg)
How confidential is the data?
Publically available Genome data
Anonymised datasets(eg individual genomes with no identifiers)
Trade Secret / Patentable data
Low Risk High Risk
Personally identifiable datasets
![Page 34: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/34.jpg)
Reasons to be optimistic:
Most (all?) data security issues can be dealt with.• But the devil is in the details.• Data can be put on the cloud, if care is taken.
It is probably more secure there than in your own data-centre.• Can you match AWS data availability guarantees?
Are cloud providers different from any other organisation you outsource to?
![Page 35: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/35.jpg)
Outstanding Issues
Audit and compliance:• If you need IP agreements, above your providers standard T&Cs, how do
you push them through?
Geographical boundaries mean little in the cloud.• Data can be replicated across national boundaries, without end user
being aware.
Moving personally identifiable data outside of the EU is potentially problematic.• (Can be problematic within the EU; privacy laws are not as harmonised as
you might think.)• More sequencing experiments are trying to link with phenotype data. (ie
personally identifiable medical records).
![Page 36: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/36.jpg)
Private Cloud to rescue?
Sequencing increasingly takes place in large consortiums.• Eg International Cancer Genome Consortium http://www.icgc.org)
Can we do private clouds within the consortium?
![Page 37: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/37.jpg)
Traditional Collaboration
SequencingCentre + DCCSequencing
Centre + DCC
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
ITIT
ITIT
ITIT
ITIT
![Page 38: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/38.jpg)
Cloud Collaborations
SequencingCentre
SequencingCentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Private CloudIaaS / SaaS
Private CloudIaaS / SaaS
Private CloudIaaS / SaaS
Private CloudIaaS / SaaS
![Page 39: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/39.jpg)
Private Cloud
Advantages:• LIMS / analysis software easily shared with consortium.
• Small organisations leverage expertise of big IT organisations.• Academia tends to be linked by fast research networks.
• Moving data is easier.• Consortium will be signed up to data-access agreements.
• Simplifies data governance.
Problems:• Big change in funding model.• Are big centres set up to provide private cloud services?
• Selling services is hard if you are a charity.• Can we do it as well as the big internet companies?
![Page 40: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/40.jpg)
Cloud data archives
![Page 41: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/41.jpg)
Dark Archives
Storing data in an archive is not particularly useful.• You need to be able to access the
data and do something useful with it.
Data in current archives is “dark”.• You can put/get data, but cannot
compute across it.• Is data in an inaccessible archive
really useful?
![Page 42: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/42.jpg)
Example problem:
“We want to run out pipeline across 100TB of data currently in EGA/SRA.”
We will need to de-stage the data to Sanger, and then run the compute.• Extra 0.5 PB of storage, 1000 cores of compute.• 3 month lead time.• ~$1.5M capex.
![Page 43: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/43.jpg)
Cloud / Computable archives
Move the compute to the data.• Upload workload onto VMs.• Put VMs on compute that is
“attached” to the data.
Federated between centres• Grid software build on top of
cloud components.• Avoids scaling problems
inherent in putting everything on one place.
CPUCPU CPUCPU CPUCPU CPUCPUDataData
VMVMDataData
CPUCPU CPUCPU CPUCPU CPUCPU
![Page 44: Cloud Technical Challenges](https://reader033.fdocuments.us/reader033/viewer/2022051323/5481bb65b07959600c8b45e6/html5/thumbnails/44.jpg)
Acknowledgements
Sanger
• Phil Butcher• James Beal• Pete Clapham• Simon Kelley• Gen-Tao Chiang
• Steve Searle• Jan-Hinnerk Vogel• Bronwen Aken
EBI
Glenn Proctor Steve Keenan