Using&the&Open&Science&DataCloud&&...

56
Using the Open Science Data Cloud for Data Science Research Robert Grossman University of Chicago Open Cloud Consor=um June 17, 2013

Transcript of Using&the&Open&Science&DataCloud&&...

Page 1: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Using  the  Open  Science  Data  Cloud    for  Data  Science  Research  

Robert  Grossman  University  of  Chicago  

Open  Cloud  Consor=um  June  17,  2013  

Page 2: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Data:  1  PB  of  OSDC  data  across  several  disciplines  

Instrument:    3000  cores  /    5  PB  OSDC    science  cloud  

+  +  

Team:  you  and  your  colleagues  

Discoveries  

correla=on  algorithms  +  

Page 3: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Part  1  What  Instrument  Do  we  Use  to    Make  Big  Data  Discoveries?  

How  do  we  build  a  “datascope?”  

Page 4: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

What  is  big  data?  

TB?  PB?  EB?    

W?  KW?  MW?  

Page 5: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

An  algorithm  and  compu=ng  infrastructure  is  “big-­‐data  scalable”  if  adding  a  rack  (or  container)  of  data  (and  corresponding  processors)  allows  you  to  do  the  same  computa=on  in  the  same  =me  but  over  more  data.  

Page 6: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Commercial  Cloud  Service  Provider  (CSP)    15  MW  Data  Center  

100,000  servers  1  PB  DRAM  

100’s  of  PB  of  disk  

Automa=c  provisioning  and  infrastructure  management  

Monitoring,  network  security  and  forensics  

Accoun=ng  and  billing   Customer  

Facing  Portal  

Data  center  network  

~1  Tbps  egress  bandwidth    

25  operators  for  15  MW  Commercial  Cloud  

Page 7: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

OSDC’s  vote  for  a  datascope:  a  (bou=que)  data  center  scale  facility  with  a  big-­‐data  scalable  analy=c  infrastructure.  

Page 8: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Data:  1  PB  of  OSDC  data  across  several  disciplines  

Instrument:    3000  cores  /    5  PB  OSDC    science  cloud  

+  +  

Team:  you  and  your  colleagues  

Discoveries  

correla=on  algorithms  +  

Page 9: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Discipline   Dura2on   Size   #  Devices  

HEP  -­‐  LHC   10  years   15  PB/year*   One  

Astronomy  -­‐  LSST   10  years   12  PB/year**   One  

Genomics  -­‐  NGS   2-­‐4  years   0.5  TB/genome   1000’s  

Some  Examples  of  Big  Data  Science  

*At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  par=cle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambi=ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  hhp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resul=ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hhp://www.lsst.org/News/enews/teragrid-­‐1004.html  

Page 10: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

One  large  instrument   Many  smaller  instruments  

Page 11: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Part  2.  What  is  a  Cloud  and  Why  Do  We  Care?  

11  

Page 12: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

There  Are  Two  Essen=al    Characteris=cs  of  a  Cloud  

1.  Self  service  2.  Scale  

•  Clouds  enable  you  to  compute  over  large  amounts  of  data  with  the  necessity  of  first  downloading  the  data.  

•  Clouds  can  be  designed  to  be  secure  and  compliant.  

12  

Page 13: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Self  Service  

Self  Service  

13  

Page 14: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Scale  

14  

Page 15: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Types  of  Clouds  

•  Public  Clouds    – Amazon  

•  Private  Clouds  – Run  internally  by  universi=es  or  companies  

•  Community  Clouds  – Run  by  organiza=ons  (either  formally  or  informally),  such  as  the  Open  Cloud  Consor=um  

15  

Page 16: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Amazon  Web  Services  (AWS)?  

Community  clouds,  science  clouds,  etc.  

•  Lower  cost  (at  medium  scale)  •  Data  too  important  for  

commercial  cloud  •  Compu=ng  over  scien=fic  

data  is  a  core  competency  •  Can  support  any  required  

governance  /  security  

•  Scale  •  Simplicity  of  a  credit  card  •  Wide  variety  of  offerings.  

vs.  

OCC  supports  AWS  interop  and  burs=ng  when  permissible.  16  

Page 17: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Science  Clouds  

NFP  Science  Clouds   Commercial  Clouds  POV   Democra=ze  access  to  

data.    Integrate  data  to  make  discoveries.    Long  term  archive.  

As  long  as  you  pay  the  bill;  as  long  as  the  business  model  holds.  

Data  &  Storage  

Data  intensive  compu=ng  &  HP  storage  

Internet  style  scale  out  and  object-­‐based  storage  

Flows   Large  &  small  data  flows   Lots  of  small  web  flows  Streams   Streaming  processing  

required  NA  

Accoun=ng   Essen=al   Essen=al  Lock  in   Moving  environment  

between  CSPs  essen=al  Lock  in  is  good  

Interop   Cri=cal,  but  difficult   Customers  will  drive  to  some  degree   17  

Page 18: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Essen=al  Services  for  a  Science  CSP  •  Support  for  data  intensive  compu=ng  •  Support  for  big  data  flows  •  Account  management,  authen=ca=on  and  authoriza=on  services  

•  Health  and  status  monitoring  •  Billing  and  accoun=ng  •  Ability  to  rapidly  provision  infrastructure  •  Security  services,  logging,  event  repor=ng  •  Access  to  large  amounts  of  public  data  •  High  performance  storage  •  Simple  data  export  and  import  services  

Page 19: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Datascope  –  Science  Cloud    Service  Provider  (Sci  CSP)  

Data  scien=st  

Sci  CSP  services  

Page 20: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Cloud  Services    Opera=ons  Centers  (CSOC)  

•  The  OSDC  operates  Cloud  Services  Opera=ons  Center  (or  CSOC).  

•  It  is  a  CSOC  focused  on  suppor=ng  Science  Clouds  for  researchers.  

•  Compare  to  Network  Opera=ons  Center  or  NOC.  

•  Both  are  an  important  part  of  cyber  infrastructure  for  big  data  science.  

Page 21: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Datascope  –  Science  Cloud    Service  Provider  (Sci  CSP)  

Data  scien=st  

Sci  CSP  services  

Cloud  Service  Opera=ons  Center  (CSOC)  

Page 22: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Part  3  Data  Science  

Page 23: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Data  

Founda=ons  of  data  science  

General  and  discipline  specific  souware  applica=ons  and  tools  

Models  and  algorithms    

Establish  best  prac=ces,  strategies  for  data  science  in  general  and  discipline  specific  data  science  in  par=cular  

Analy=c  infrastructure  

Data  

Page 24: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

What  are  the  founda=ons  for  data  science?  

Page 25: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Theory  to  Big  Data  Spectrum  

Simple  counts  and  sta=s=cs  over  big  data  

Mathema=cal  theorems  

No  data   Small  data  

Big  data  

Tradi=onal  sta=s=cal  modeling  

Medium  data  

(Semi-­‐)Automa=ng  sta=s=cal  modeling  

GB   TB   PB  

OSDC  Datascope   0.5-­‐2.0  MW  

Page 26: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Part  4  The  Open  Science  Data  Cloud  

www.opensciencedatacloud.org  

Page 27: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Data:  1  PB  of  OSDC  data  across  several  disciplines  

Instrument:    3000  cores  /    5  PB  OSDC    science  cloud  

+  +  

Team:  you  and  your  colleagues  

Discoveries  

correla=on  algorithms  +  

Page 28: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

2013  Open  Science  Data  Cloud  (IaaS)  

5  PB  2013    (OpenStack  &  GlusterFS)  

Infrastructure  automa=on  &  management  

(Yates)  

Compliance,  &  security  

(OpenFISMA)  

Accoun=ng  &  billing  

(Salesforce.com)  

Customer  Facing  Portal  (Tukey)  

Data  center  network  

~10-­‐100  Gbps  bandwidth    

5  engineers  to  operate  0.5  MW  Science  Cloud  

Science  Cloud  SW  &  Services  

•  Virtual  Machine  (VM)  containing  common  applica=ons  &  pipelines    

•  Tukey  (OSDC  portal  &  middleware  v0.3)  •  Yates  (infrastructure  automa=on  and  management  v0.1)   28  

Page 29: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Tukey  

•  Tukey  (based  in  part  on  Horizon).  •  We  have  factored  out  digital  ID  service,  file  sharing,  and  transport  from  Bionimbus  and  Matsu.  

Page 30: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Yates  

•  Automa=on  installa=on  of  OSDC  souware  stack  on  rack  of  computers.  

•  Based  upon  Chef  •  Version  0.1  

Page 31: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

UDR  

•  UDT  is  a  high  performance  network  transport  protocol  •  UDR  =  rsync  +  UDT    •  It  is  easy  for  an  average  systems  administrator  to  keep  100’s  of  TB  of  distributed  data  synchronized.    

•  We  are  using  it  to  distribute  c.  1  PB  from  the  OSDC  

Page 32: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Open  Science  Data  Cloud  Services  

•  Digital  ID  services  •  Data  sharing  services  •  Data  transport  services  (UDR)  •  What  other  core  services  are  essen&al?  •  Of  course,  working  groups  and  applica=ons  always  add  their  own  services  

•  These  core  services  will  hopefully  make  the  OSDC  ahrac=ve  as  a  plaxorm  (PaaS)  for  scien=fic  discovery.  

Page 33: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

33  www.opencloudconsor=um.org  

•  U.S  based  not-­‐for-­‐profit  corpora=on.  •  Manages  cloud  compu=ng  infrastructure  to  

support  scien=fic  research:  Open  Science  Data  Cloud.  

•  Manages  cloud  compu=ng  infrastructure  to  support  medical  and  health  care  research:  Biomedical  Commons  Cloud  

•  Manages  cloud  compu=ng  testbeds:  Open  Cloud  Testbed.  

 

Page 34: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

OCC  Members  &  Partners  

•  Companies:  Cisco,  Yahoo!,  Intel,  …  •  Universi=es:    University  of  Chicago,  Northwestern  Univ.,  Johns  Hopkins,  Calit2,  ORNL,  University  of  Illinois  at  Chicago,  …  

•  Federal  agencies  and  labs:  NASA  •  Interna=onal  Partners:  Univ.  Edinburgh,  AIST  (Japan),  Univ.  Amsterdam,  …  

•  Partners:  Na=onal  Lambda  Rail  

34  

Page 35: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Third  party  open  source  souware  

+  

Tukey  

Yates  

Open  source  souware  developed  by  the  OCC  and  open  standards  

+  

Data  center  

+  Data  with  permissions  

+  Authoriza=on  of  users  access  to  data  

+  Policies,  procedures,  controls,  etc.  

+  Governance,  legal  agreements  

+  Sustainability  model   35  

Page 36: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Part  5  OSDC  Data  

Page 37: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Data:  1  PB  of  OSDC  data  across  several  disciplines  

Instrument:    3000  cores  /    5  PB  OSDC    science  cloud  

+  +  

Team:  you  and  your  colleagues  

Discoveries  

correla=on  algorithms  +  

Page 38: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum
Page 39: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

OSDC  Public  Data  Sets  

•  Over  800  TB  of  open  access  data  in  the  OSDC  •  Earth  sciences  data  •  Biological  sciences  data  •  Social  sciences  data  •  Digital  humani=es    

Page 40: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Part  6  OSDC  Working  Groups  

Just  look  around  you  

Page 41: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Matsu Working Group: Clouds to Support Earth Science

41

matsu.opensciencedatacloud.org  

Page 42: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Matsu  Architecture  

Hadoop  HDFS  

Matsu  Web  Map    Tile  Service  (WMTS)  

Matsu  MR-­‐based  Tiling  Service  

NoSQL  Database  

Images  at  different  zoom  layers  suitable  for  OGC  Web  Mapping  Server  

Level  0,  Level  1  and  Level  2  images  

MapReduce  used  to  process  Level  n  to  Level  n+1  data  and  to  par==on  images  for  different  zoom  levels  

NoSQL-­‐based  Analy=c  Services  

Streaming  Analy=c  Services  

MR-­‐based  Analy=c  Services  

Analy=c  Services   Storage  for  WMS  =les  and  derived  data  products  

Presenta=on  Services  

Web  Coverage  Processing  Service  

(WCPS)  

Workflow  Services  

Page 43: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Hadoop-­‐Based  Re-­‐Analysis  Zoom  Level  1:  4  images   Zoom  Level  2:  16  images  

Zoom  Level  3:  64  images   Zoom  Level  4:  256  images  

Page 44: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Bionimbus    Working  Group  

bionimbus.opensciencedatacloud.org  (biological  data)  

Page 45: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Bionimbus  Protected  Data  Cloud  

45  

Page 46: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Analyzing  Data  From    The  Cancer  Genome  Atlas  (TCGA)  

1.  Apply  to  dbGaP  for  access  to  data.  

2.  Hire  staff,  set  up  and  operate  secure  compliant  compu=ng  environment  to  mange  10  –  100+  TB  of  data.      

3.  Get  environment  approved  by  your  research  center.  

4.  Setup  analysis  pipelines.  5.  Download  data  from  CG-­‐

Hub  (takes  days  to  weeks).    6.  Begin  analysis.  

Current  Prac2ce   With  Protected  Data  Cloud  (PDC)  

1.  Apply  to  dbGaP  for  access  to  data.  

2.  Use  your  eRA  commons  creden=als  to  login  to  the  PDC,  select  the  data  that  you  want  to  analyze,  and  the  pipelines  that  you  want  to  use.    

3.  Begin  analysis.  

46  

Page 47: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

One  Million  Genomes  •  Sequencing  a  million  genomes  would  most  likely  fundamentally  change  the  way  we  understand  genomic  varia=on.  

•  The  genomic  data  for  a  pa=ent  is  about  1  TB  (including  samples  from  both  tumor  and  normal  =ssue).  

•  One  million  genomes  is  about  1000  PB  or  1  EB  •  With  compression,  it  may  be  about  100  PB  •  At  $1000/genome,  the  sequencing  would  cost  about  $1B  

Page 48: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Big  data  driven  discovery  on  1,000,000  genomes  and  1  EB  of  data.  

Genomic-­‐driven  

diagnosis  

Improved  understanding  of  genomic  science  

 Genomic-­‐  driven  drug  development  

Precision  diagnosis  and  treatment.    Preven=ve  

health  care.  

Page 49: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Biomedical  Commons  Cloud  (BCC)  Working  Group  

Cloud  for  Public  Data    

Cloud  for  Controlled  Genomic  Data    

Cloud  for  EMR,  PHI,  

data  

Example:  Open  Cloud  Consor=um’s  Biomedical  Commons  Cloud  (BCC)  

Medical  Research  Center  A  

Medical  Research  Center  B  

Hospital  D  

Medical  Research  Center  C  

49  

Page 50: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Resource   Who  users   Who  operates  Open  Science  Data  Cloud  (OSDC)  

Pan  science  data  for  researchers  

Open  Cloud  Consor=um  (OCC)  supported  by  University  OCC  members  

Biomedical  Commons  Clouds  (BCC)  

(Interna=onal)  biomedical  researchers  

OCC  Biomedical  Commons  Cloud  Working  Group  supported  by  OCC  University  members  

Bionimbus  Protected  Data  Cloud  

Genomics  researchers  

University  of  Chicago  supported  by  the  OCC  

50  

Page 51: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

OpenFlow-­‐Enabled  Hadoop  WG  

•  When  running  Hadoop  some  map  and  reduce  jobs  take  significantly  longer  than  others.  

•  These  are  stragglers  and  can  significantly  slow  down  a  MapReduce  computa=on.    

•  Stragglers  are  common  (dirty  secret  about  Hadoop)  •  Infoblox  and  UChicago  are  leading  a  OCC  Working  Group  on  OpenFlow-­‐enabled  Hadoop  that  will  provide  addi=onal  bandwidth  to  stragglers.    

•  We  have  a  testbed  for  a  wide  area  version  of  this  project.  

Page 52: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

OSDC  PIRE  Project  We  select  OSDC  PIRE  Fellows  (US  ci=zens  or  permanent  residents):    •  We  give  them  tutorials  and  training  on  big  data  science.  

•  We  provide  them  fellowships  to  work  with  OSDC  interna=onal  partners.  

•  We  give  them  preferred  access  to  the  OSDC.  

Nominate  your  favorite  scien=st  as  an  OSDC  PIRE  Fellow.    www.opensciencedatacloud.org    (look  for  PIRE)  

Page 53: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Part  7  Key  Ques=ons  for  This  Workshop  

Page 54: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

•  Ques=on  1.    How  can  we  add  partner  sites  at  other  loca=ons  that  extend  the  OSDC?    In  par=cular,  how  can  we  extend  the  OSDC  to  sites  around  the  world?    How  can  the  OSDC  interoperate  with  other  science  clouds?  

•  Ques=on  2.  What  data  can  we  add  to  the  OSDC  to  facilitate  data  intensive  cross-­‐disciplinary  discoveries?  

•  Ques=on  3.    How  can  we  build  a  plugin  structure  so  that  Tukey  can  be  extended  by  other  users  and  by  other  communi=es?  

•  Ques=on  4.  What  tools  and  applica=ons  can  we  add  to  the  OSDC  facilitate  data  intensive  cross-­‐disciplinary  discoveries?  

•  Ques=on  5.    How  can  we  beher  integrate  digital  IDs  and  file  sharing  services  into  the  OSDC?  

•  Ques=on  6.  What  are  3-­‐5  grand  challenge  ques=ons  that  leverage  the  OSDC?  

Page 55: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Ques=ons  

Page 56: Using&the&Open&Science&DataCloud&& …ciara.fiu.edu/osdc-overview-13-v3.pdfUsing&the&Open&Science&DataCloud&& for&DataScience&Research& RobertGrossman& University&of&Chicago& OpenCloudConsorum

Robert  Grossman  is  a  faculty  member  at  the  University  of  Chicago.    He  is  the  Chief  Research  Informa=cs  Officer  for  the  Biological  Sciences  Division,  a  Faculty  Member  and  Senior  Fellow  at  the  Computa=on  Ins=tute  and  the  Ins=tute  for  Genomics  and  Systems  Biology,  and  a  Professor  of  Medicine  in  the  Sec=on  of  Gene=c  Medicine.    His  research  group  focuses  on  big  data,  biomedical  informa=cs,  data  science,  cloud  compu=ng,  and  related  areas.        He  is  also  the  Founder  and  a  Partner  of  Open  Data  Group,  which  has  been  building  predic=ve  models  over  big  data  for  companies  for  over  ten  years.        He  recently  wrote  a  book  for  the  general  reader  that  discusses  big  data  (among  other  topics)  called  the  Structure  of  Digital  Compu=ng:  From  Mainframes  to  Big  Data,  which  can  be  purchased  from  Amazon.    He  blogs  occasionally  about  big  data  at  rgrossman.com.