What Are Science Clouds?

58
What is So Special About Science Clouds and Why Does It Ma8er? November 17, 2013 Robert L. Grossman University of Chicago Open Data Group Open Cloud ConsorLum

description

This is a talk I gave at Data Cloud 2013 on November 17, 2013 that was titled: "What is So Special About Science Clouds and Why Does It Matter? ."

Transcript of What Are Science Clouds?

Page 1: What Are Science Clouds?

What  is  So  Special  About  Science  Clouds  and  Why  Does  It  Ma8er?    

November  17,  2013  

Robert  L.  Grossman  University  of  Chicago  Open  Data  Group  

Open  Cloud  ConsorLum  

Page 2: What Are Science Clouds?

Part  1  Clouds  

2  

Page 3: What Are Science Clouds?

In  2011,  aNer  several  years  and  15  draNs,  NIST  developed  a  definiLon  of  a  cloud  that  is  now  the  standard  definiLon.  

Page 4: What Are Science Clouds?

EssenLal  CharacterisLcs  of  a  Cloud  

1.  Self  Service    2.  Scale  

4  

Page 5: What Are Science Clouds?

Self  Service  

Self  Service  

5  

Page 6: What Are Science Clouds?

Scale  

6  

Page 7: What Are Science Clouds?

Cloud  Deployment  Models  

•  Public  Clouds    – Vendors  offering  cloud  services,  such  as  Amazon.  

•  Private  Clouds  – Run  internally  by  company  or  organizaLon,  such  as  the  University  of  Chicago.  

•  Community  Clouds  – Run  by  a  community  or    organizaLons  (either  formally  or  informally),  such  as  the  Open  Cloud  ConsorLum  

7  

Page 8: What Are Science Clouds?

How  do  you  measure  compute  capacity  for  science  clouds?  

TB?  PB?  EB?     100’s?  1,000’s?  10,000’s?  

Page 9: What Are Science Clouds?

Think  of  science  clouds  as  large  if  you  measure  them  in  MW,  as  in  Facebook’s  Pineville  Data  

Center  is  30  MW.  

Another  way:  

opencompute.org  

Page 10: What Are Science Clouds?

What  about  automaLc  provisioning  and  infrastructure  management?    

Page 11: What Are Science Clouds?

11  

This  is  not  a  cloud.  

Page 12: What Are Science Clouds?

This  is  a  cloud.  

Page 13: What Are Science Clouds?

Commercial  Cloud  Service  Provider  (CSP)    15  MW  Data  Center  

100,000  servers  1  PB  DRAM  

100’s  of  PB  of  disk  

AutomaLc  provisioning  and  infrastructure  management  

Monitoring,  network  security  and  forensics  

AccounLng  and  billing   Customer  

Facing  Portal  

Data  center  network  

~1  Tbps  egress  bandwidth    

25  operators  for  15  MW  Commercial  Cloud  

Page 14: What Are Science Clouds?

Rack  /  Container  Test:    The  addiLon  of  racks  /  containers  of  cores  and  disks  is  automated  and  does  not  require  changing  the  soNware  stack,  but  aNerwards  the  capacity  of  the  system  has  increased.  

Requirement of a cloud computing infrastructure

Page 15: What Are Science Clouds?

•  At  good  cloud  service  providers,  development  and  operaLons  are  integrated  (devops).    

•  SRE/Devops  are  considered  key  personnel.  

15  

•  For  many  organizaLons,  system  administrators  are  just  performing  a  service.  •  It’s  considered  a  good  pracLce  to  outsource  the  service  to  the  lowest  cost  provider.  

Page 16: What Are Science Clouds?

Latency  is  Difficult  

Page 17: What Are Science Clouds?

EssenLal  CharacterisLcs  of  a  Cloud  

1.  Self  Service    2.  Scale  3.  Infrastructure  management  and  automaLon  4.  Focus  on  devops  

17  

Page 18: What Are Science Clouds?

Part  2  Science  Clouds  

18  

Page 19: What Are Science Clouds?

Discipline   Dura5on   Size   #  Devices  

HEP  -­‐  LHC   10  years   15  PB/year*   One  

Astronomy  -­‐  LSST   10  years   12  PB/year**   One  

Genomics  -­‐  NGS   2-­‐4  years   0.5  TB/genome   1000’s  

Some  Examples  of  the  Sizes  of  Datasets  Produced  by  Instruments  

*At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  parLcle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambiLous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  h8p://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resulLng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  h8p://www.lsst.org/News/enews/teragrid-­‐1004.html  

N.B.    This  is  just  the  data  produced  by  the  instrument  itself.    The  analysis  of  this  data  produces  significantly  more  data.  

Page 20: What Are Science Clouds?

Science  Cloud    Service  Provider  (Sci  CSP)  

Data  scienLst  

Sci  CSP  services  

Page 21: What Are Science Clouds?

What  are  some  of  the  important  differences  between  commercial  and  research-­‐focused  Sci  CSPs?    

Page 22: What Are Science Clouds?

Amazon  Web  Services  (AWS)?  

Community  clouds,  science  clouds,  etc.  

•  Lower  cost  (at  medium  &  large  scale)  •  Some  data  too  important  to  be  stored  

exclusively  in  commercial  cloud  •  CompuLng  over  scienLfic  data  is  a  core  

competency  •  Can  support  any  required  governance  /  

security  model  

•  Scale  •  Simplicity  of  a  credit  card  •  Wide  variety  of  offerings.  

vs.  

It  is  essenLal  that  community  science  clouds  interoperate  with  public  clouds.   22  

Page 23: What Are Science Clouds?

Science  Clouds  

Science  Clouds   Commercial  Clouds  POV   DemocraLze  access  to  

data.    Integrate  data  to  make  discoveries.    Long  term  archive.  

As  long  as  you  pay  the  bill;  as  long  as  the  business  model  holds.  

Data  &  Storage  

In  addiLon,  data  intensive  compuLng  &  HP  storage  

Internet  style  scale  out  and  object-­‐based  storage  

Flows   Large  &  small  data  flows   Lots  of  small  web  flows  AccounLng   EssenLal   EssenLal  Lock  in   Moving  environment  

between  CSPs  essenLal  Lock  in  is  good  

Interop   CriLcal,  but  difficult   Customers  will  drive  to  some  degree  

23  

Page 24: What Are Science Clouds?

EssenLal  Services  for  a  Science  CSP  •  Support  for  data  intensive  compuLng  •  Support  for  big  data  flows  •  Account  management,  authenLcaLon  and  authorizaLon  services  

•  Health  and  status  monitoring  •  Billing  and  accounLng  •  Ability  to  rapidly  provision  infrastructure  •  Security  services,  logging,  event  reporLng  •  Access  to  large  amounts  of  public  data  •  High  performance  storage  •  Simple  data  export  and  import  services  

Page 25: What Are Science Clouds?

Datascope  –  Science  Cloud    Service  Provider  (Sci  CSP)  

Data  scienLst  

Sci  CSP  services  

Cloud  Service  OperaLons  Center  (CSOC)  

Page 26: What Are Science Clouds?

Part  3.  Open  Science  Data  Cloud  

Page 27: What Are Science Clouds?

Small   Medium  to  Large     Very  Large  

Data  Size  

10’s  

100’s  

1000’s  

Number  

Public  infrastructure  

Dedicated    infrastructure  

Shared  community  infrastructure  

Individual  scienLsts  &  small  projects  

Community  based  science  via  Science  as  a  Service  

very  large  projects  

Page 28: What Are Science Clouds?

The  long  tail  of  data  science  

A  few  large  data  science  projects.  

Many  smaller  data  science  projects.  

Page 29: What Are Science Clouds?

Commercial  Cloud  Service  Provider  (CSP)    15  MW  Data  Center  

100,000  servers  1  PB  DRAM  

100’s  of  PB  of  disk  

AutomaLc  provisioning  and  infrastructure  management  

Monitoring,  network  security  and  forensics  

AccounLng  and  billing   Customer  

Facing  Portal  

Data  center  network  

~1  Tbps  egress  bandwidth    

25  operators  for  15  MW  Commercial  Cloud  

Page 30: What Are Science Clouds?

Open  Science  Data  Cloud  

Cores  &  Disks  (OpenStack,  GlusterFS  &  Hadoop)  

Infrastructure  automaLon  &  management  

(Yates)  

Compliance,  &  security  (OCM)  

AccounLng  &  billing  

(Salesforce.com)  

Customer  Facing  Portal  (Tukey)  

Data  center  network  

~10-­‐100  Gbps  bandwidth    

6  engineers  to  operate  0.5  MW  Science  Cloud  

Science  Cloud  SW  &  Services  

•  Virtual  Machine  (VM)  containing  common  applicaLons  &  pipelines  •  Tukey  (OSDC  portal  &  middleware  v0.2)  •  Yates  (infrastructure  automaLon  and  management  v0.1)  •  UDR  /  UDT  for  high  performance  data  transport  •  Interoperate  with  other  clouds  (upcoming)  and  proprietary  systems  (such  as  

Globus  Online.)  

Page 31: What Are Science Clouds?

The  Open  Science  Data  Cloud  (OSDC)  is  a  producLon    5  PB*,  7500  core,  wide  area  10G  cloud.  

www.opensciencedatacloud.org  *10  PB  raw  storage.  

Page 32: What Are Science Clouds?

32  www.opencloudconsorLum.org  

•  U.S  based  not-­‐for-­‐profit  corporaLon.  •  Manages  cloud  compuLng  infrastructure  to  

support  scienLfic  research:  Open  Science  Data  Cloud.  

•  Manages  cloud  compuLng  infrastructure  to  support  medical  and  health  care  research:  Biomedical  Commons  Cloud  

•  Manages  cloud  compuLng  testbeds:  Open  Cloud  Testbed.  

 

Page 33: What Are Science Clouds?

33  www.opencloudconsorLum.org  

•  Companies:  Cisco,  Yahoo!,  Infoblox,  …  •  UniversiLes:    University  of  Chicago,  Northwestern  Univ.,  Johns  Hopkins,  Calit2,  LLNL,  University  of  Illinois  at  Chicago,  …  

•  Federal  agencies  and  labs:  NASA,  LLNL,  …  •  InternaLonal  Partners:  AIST  (Japan),  U.  Edinburgh,  U.  Amsterdam,  …  

Page 34: What Are Science Clouds?

Designed  to  hold  Protected  Health  InformaLon  (PHI)  e.g.  genomic  data,  electronic  medical  records,  etc.    (HIPAA,  FISMA)  

•  Earth  sciences  •  Biological  sciences  •  Social  sciences  •  Digital  humaniLes  •  ACL,  groups,  etc.  

Science  Cloud   Biomedical  Cloud  

Page 35: What Are Science Clouds?

What  You  Get  with  the  OSDC  

•  Login  with  your  university  credenLals  via  InCommon  

•  Launch  virtual  machines,  virtual  clusters,  access  to  large  Hadoop  clusters,  etc.  

•  Access  PB+  of  open  and  protected  data  •  Manage  files,  collecLons  of  files,  collecLons  of  collecLons  

•  Manage  users,  groups  of  users  •  Manage  accounts,  sub-­‐accounts  •  Efficient  transfer  of  large  data  (UDT,  UDR)  

Page 36: What Are Science Clouds?

Our  Point  of  View  •  We  want  to  develop  as  li8le  technology  and  soNware  as  possible  –  we  want  others  to  develop  soNware  and  technology.  

•  We  focus  on  providing  researchers  the  ability  to  compute  over  large  and  very  large  datasets.  

•  We  need  open  source  soluLons.  •  We  can  interoperate  with  proprietary  soluLons.  •  We  are  working  to  make  interoperaLon  with  AWS  seamless  

•  Run  lights  out  over  mulLple  data  centers  connected  with  10G  (soon  100G)    networks.  

Page 37: What Are Science Clouds?

OSDC  Cloud  Services    OperaLons  Center  (CSOC)  

•  The  OSDC  operates  a  Cloud  Services  OperaLons  Center  (or  CSOC).  

•  It  is  a  CSOC  focused  on  supporLng  Science  Clouds  for  researchers.  

Page 38: What Are Science Clouds?

•  How  quickly  can  we  set  up  a  rack?  

•  How  efficiently  can  we  operate  a  rack?  (racks/admin)  

•  How  few  changes  does  our  soNware  stack  and  operaLons  require  when  we  add  new  racks?  

2013  OSDC  rack  design    •  1  PB  /  rack  •  1150  cores  /  rack  

OSDC  Racks  

Page 39: What Are Science Clouds?

Tukey  

•  Tukey  (based  in  part  on  Horizon).  •  We  have  factored  out  digital  ID  service,  file  sharing,  and  transport  from  the    Bionimbus  and  Matsu  Projects.  

Page 40: What Are Science Clouds?

Yates  

•  AutomaLon  installaLon  of  OSDC  soNware  stack  on  rack  of  computers.  

•  Based  upon  Chef  •  Version  0.1  

Page 41: What Are Science Clouds?

UDR  

•  UDT  is  a  high  performance  network  transport  protocol  •  UDR  =  rsync  +  UDT    •  It  is  easy  for  an  average  systems  administrator  to  keep  100’s  of  TB  of  distributed  data  synchronized.    

•  We  are  using  it  to  distribute  c.  1  PB  from  the  OSDC  

Page 42: What Are Science Clouds?

Bionimbus  Protected  Data  Cloud  

42  

Page 43: What Are Science Clouds?

Analyzing  Data  From    The  Cancer  Genome  Atlas  (TCGA)  

1.  Apply  to  dbGaP  for  access  to  data.  

2.  Hire  staff,  set  up  and  operate  secure  compliant  compuLng  environment  to  mange  10  –  100+  TB  of  data.      

3.  Get  environment  approved  by  your  research  center.  

4.  Setup  analysis  pipelines.  5.  Download  data  from  CG-­‐

Hub  (takes  days  to  weeks).    6.  Begin  analysis.  

Current  Prac5ce   With  Protected  Data  Cloud  (PDC)  

1.  Apply  to  dbGaP  for  access  to  data.  

2.  Use  your  exisLng  NIH  grant  eRA  credenLals  to  login  to  the  PDC,  select  the  data  that  you  want  to  analyze,  and  the  pipelines  that  you  want  to  use.    

3.  Begin  analysis.  

Page 44: What Are Science Clouds?

OCC Project Matsu Clouds to Support Earth Science

44

matsu.opensciencedatacloud.org  

Page 45: What Are Science Clouds?

Biomedical  Community  Cloud  

Cloud  for  Public  Data    

Cloud  for  Controlled  Genomic  Data    

Cloud  for  EMR,  PHI,  

data  

Example:  Open  Cloud  ConsorLum’s  Biomedical  Commons  Cloud  (BCC)  

Medical  Research  Center  A  

Medical  Research  Center  B  

Hospital  D  

Medical  Research  Center  C  

45  

Company  E  

Page 46: What Are Science Clouds?

4.  Cloud  Condos  

Page 47: What Are Science Clouds?

Cyber  Condo  Model  •  Research  insLtuLons  today  have  access  to  high  performance  networks  –  10G  &  100G.  

•  They  couldn’t  afford  access  to  these  networks  from  commercial  providers.  

•  Over  a  decade  ago,  they  got  together  to  buy  and  light  fiber.        

•  This  changed  how  we  do  scienLfic  research.  

Page 48: What Are Science Clouds?

Cloud  Condos  •  The  Open  Cloud  ConsorLum’s  Burnham  Facility  (in  planning)  is  a  Cloud  Condo  model.  

•  This  infrastructure  provides  a  sustainable  home  for  large  commons  of  research  data  (and  an  infrastructure  to  compute  over  it).  

•  Please  join  us.  

Page 49: What Are Science Clouds?

Some  Data  Commons  Guidelines  for  the  Next  Five  Years  

•  There  is  a  societal  benefit  when  research  data  is  available  in  data  commons  operated  by  a  NFP  (vs  sold  exclusively  as  data  products  by  commercial  enLLes  or  only  offered  for  download  by  the  USG).  

•  Large  data  commons  providers  should  peer.  •  Data  commons  providers  should  develop  standards  for  interoperaLng.  

•  Standards  should  not  be  developed  ahead  of  open  source  reference  implementaLons.  

•  We  need  a  period  of  experimentaLon  as  we  develop  the  best  technology  and  pracLces.  

•  The  details  are  hard  (consent,  publicaLon,  IDs,  open  vs  controlled  access,  sustainability,  etc.)  

Page 50: What Are Science Clouds?

Working  with  the  OSDC  -­‐  CSP  

•  If  you  have  a  cloud,  please  interoperate  it  with  the  OSDC.  

•  Work  with  us  to  design  and  prototype  standards  so  that  Science  Clouds  and  Science  Data  Commons  can  interoperate.  – Data  synchronizaLon  between  two  clouds  – APIs  to  access  data    – Resvul  queries    – Sca8ering  queries,  gathering  the  results  – Coordinated  analysis  

Page 51: What Are Science Clouds?

OSDC  SoNware  Ecosystem  

AWS  

Globus  Online  

CSP  A  

Medical  Research  Center  B  

Hospital  D  

University  E  

51  

Startup  F  Startup  G  

Bioninmbus  

OpenStack  

Hadoop  

Tukey  R  

UDT  

GlusterFS  

Page 52: What Are Science Clouds?

Working  with  the  OSDC  -­‐  Researchers    

•  Apply  for  an  account  and  make  a  discovery  •  Add  data  to  the  OSDC  •  Add  your  soNware  to  the  OSDC  •  Suggest  someone  else’s  data  to  add  •  Suggest  someone  else’s  soNware  to  add  

Page 53: What Are Science Clouds?

Data  Commons  

EO1  

TCGA  

CSP  A  

Medical  Research  Center  B  

Hospital  D  

University  E  

53  

Startup  F  Startup  G  

urban  sciences  data  

1000  Genomes  

EMR  

census  

Social  sciences  data  

earth  cube  data  

Bookworm  

Page 54: What Are Science Clouds?

QuesLons?  

54  

Page 55: What Are Science Clouds?

Thank  You!  

Page 56: What Are Science Clouds?

For  more  informaLon  •  @bobgrossman  •  You  can  find  more  informaLon  on  my  blog:  

                                               rgrossman.com.  •  You  can  find  more  of  my  talks  on:  

         slideshare.net/rgrossman  

Center forResearchInformatics

Page 57: What Are Science Clouds?

Major  funding  and  support  for  the  Open  Science  Data  Cloud  (OSDC)  is  provided  by  the  Gordon  and  Be8y  Moore  FoundaLon.    This  funding  is  used  to  support  the  OSDC-­‐Adler,  Sullivan  and  Root  faciliLes.    AddiLonal  funding  for  the  OSDC  has  been  provided  by  the  following  sponsors:    •  The  Bionimbus  Protected  Data  Cloud  is  supported  in  by  part  by  NIH/NCI  through  NIH/SAIC  Contract  

13XS021  /  HHSN261200800001E.    •  The  OCC-­‐Y  Hadoop  Cluster  (approximately  1000  cores  and  1  PB  of  storage)  was  donated  by  Yahoo!  

in  2011.  •  Cisco  provides  the  OSDC  access  to  the  Cisco  C-­‐Wave,  which  connects  OSDC  data  centers  with  10  

Gbps  wide  area  networks.  •  The  OSDC  is  supported  by  a  5-­‐year  (2010-­‐2016)  PIRE  award  (OISE  –  1129076)  to  train  scienLsts  to  

use  the  OSDC  and  to  further  develop  the  underlying  technology.  •  OSDC  technology  for  high  performance  data  transport  is  support  in  part  by    NSF  Award  1127316.  •  The  StarLight  Facility  in  Chicago  enables  the  OSDC  to  connect  to  over  30  high  performance  

research  networks  around  the  world  at  10  Gbps  or  higher.  •  Any  opinions,  findings,  and  conclusions  or  recommendaLons  expressed  in  this  material  are  those  

of  the  author(s)  and  do  not  necessarily  reflect  the  views  of  the  NaLonal  Science  FoundaLon,  NIH  or  other  funders  of  this  research.  

 The  OSDC  is  managed  by  the  Open  Cloud  ConsorLum,  a  501(c)(3)  not-­‐for-­‐profit  corporaLon.  If  you  are  interested  in  providing  funding  or  donaLng  equipment  or  services,  please  contact  us  at  [email protected].  

Page 58: What Are Science Clouds?

Please  join  us!    

www.opensciencedatacloud.org  www.opencloudconsorLum.org