Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

31
Adapting federated cyberinfrastructure for shared data collection facilities in structural biology GlobusWorld 2012 Ian StokesRees [email protected] @ijstokes Harvard Medical School

description

Early stage experimental data in structural biology is generally unmaintained and inaccessible to the public. It is increasingly believed that this data, which forms the basis for each macromolecular structure discovered by this field, must be archived and, in due course, published. Furthermore, the widespread use of shared scientific facilities such as synchrotron beamlines complicates the issue of data storage, access and movement, as does the increase of remote users. This work describes a prototype system that adapts existing federated cyberinfrastructure technology and techniques to significantly improve the operational environment for users and administrators of synchrotron data collection facilities used in structural biology. This is achieved through software from the Virtual Data Toolkit and Globus, bringing together federated users and facilities from the Stanford Synchrotron Radiation Lightsource, the Advanced Photon Source, the Open Science Grid, the SBGrid Consortium and Harvard Medical School. The performance and experience with the prototype provide a model for data management at shared scientific facilities. Stokes-Rees, Ian, Ian Levesque, Frank V. Murphy, Wei Yang, Ashley Deacon, and Piotr Sliz. “Adapting Federated Cyberinfrastructure for Shared Data Collection Facilities in Structural Biology.” Journal of Synchrotron Radiation 19, no. 3 (April 6, 2012). http://scripts.iucr.org/cgi-bin/paper?S0909049512009776.

Transcript of Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Page 1: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Adapting  federated  cyberinfrastructure  for  shared  data  collection  

facilities  in  structural  biology

GlobusWorld  2012

Ian  Stokes-­‐Rees  [email protected]  @ijstokesHarvard  Medical  School

Page 2: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

J.  Synchrotron  Rad.  (2012)  19doi:10.1107/S0909049512009776

Online  6  April  2012

Page 3: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

[email protected] Overview - Ian Stokes-Rees

Structural  Biology:Study  of  Protein  Structure  and  Function

1mm

10nm

400m

• Shared  scientiRic  data  collection  facility• Data  intensive  (10-­‐100  GB/day)

Page 4: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

[email protected] Overview - Ian Stokes-Rees

Cryo  Electron  Microscopy

• Previously,  1-­10,000  images,  managed  by  hand• Now,  robotic  systems  collect  millions  of  hi-­res  images• estimate  250,000  CPU-­hours  to  reconstruct  model• 20-­50  TB  of  data  in  most  extreme  case

Page 5: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

[email protected] Overview - Ian Stokes-Rees

Molecular  Dynamics  Simulations

1  fs  time  step1ns  snapshot1  us  simulation1e6  steps1000  frames10  MB  /  frame10  GB  /  sim20  CPU-­years3  months  (wall-­clock)

Page 6: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

[email protected] Overview - Ian Stokes-Rees

SBGrid  Consortium

Rice UniversityE. NikonowiczY. ShamooY.J. Tao

CalTechP. BjorkmanW. ClemonsG. JensenD. Rees

StanfordA. BrungerK. GarciaT. Jardetzky

UCSFJJ MirandaY. Cheng

UC DavisH. Stahlberg

UCSDT. NakagawaH. Viadiu

WesternUM. Swairjo

U. WashingtonT. Gonen

Washington U. School of Med.T. EllenbergerD. Fremont

VanderbiltCenter for Structural Biology

Rosalind FranklinD. Harrison

A. LeschzinerK. MillerA. RaoT. RapoportM. SamsoP. SlizT. SpringerG. VerdineG. WagnerL. WalenskyS.WalkerT.WalzJ. WangS. Wong

N. Beglova S. BlacklowB. ChenJ. ChouJ. ClardyM. EckB. FurieR. GaudetM. GrantS.C. Harrison J. HogleD. JeruzalmiD. KahneT. Kirchhausen

Harvard and Affiliates

NE-CATR. OswaldC. ParrishH. Sondermann

R. CerioneB. CraneS. EalickM. JinA. Ke

Cornell U.

Brandeis U.N. Grigorieff

Tufts U.K. Heldwein

UMass MedicalW. Royer

NIHM. Mayer

U. MarylandE. Toth

K. ReinischJ. SchlessingerF. SigworthF. Zhou

T. BoggonD. BraddockY. HaE. Lolis

Yale U.

C. SandersB. SpillerM. Stone

M. Waterman

W. ChazinB. EichmanM. EgliB. LacyM. Ohi

Columbia U.Q. Fan

Rockefeller U.R. MacKinnon

Thomas JeffersonJ. Williams

Not Pictured: University of Toronto: L. Howell, E. Pai, F. Sicheri; NHRI (Taiwan): G. Liou; Trinity College, Dublin: Amir Khan

Page 7: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

[email protected] Overview - Ian Stokes-Rees

Boston  Life  Sciences  Hub

• Biomedical  researchers• Government  agencies• Life  sciences• Universities• Hospitals

Tufts

University

School of

Medicin

e

Page 8: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Data  Access

Page 9: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Globus  Online:  High  Performance  Reliable  3rd  Party  File  Transfer

portal

cluster

desktop laptop

lab fileserver

data collectionfacility

GUMSDN  to  user  mapping

VOMSVO  membership

CertiRicate  Authorityroot  of  trust

Globus  OnlineRile  transfer  service

Page 10: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

datacollec(on

6+monthstaging+storage

10+TB+per+group+permanent+archive

/stage/sliz

/stage/murphy

/data/murphy

/data/deacon

Sliz+lab

~andy~sue

~sco?

NEBCAT+beamline+at+APS

SBGrid'Science'PortalGlobus'Online

Harvard "Sco?"+fromthe+Sliz+lab

VOMS

/data/sliz

/embarg/2010/

/embarg/2011/

/public/2009/

10+PBpublic+archive

general+public

WWW

Tier'1 Tier'2 Tier'3

Local  accounts  withinlab  infrastructure

Shared  (lab  level)  accounts  at  facility

User  can  directly  access  lab  or  facility  data  from  laptop

Public  access  available  to  archived  data  through  web  interface

Embargo  policy  to  hold  deposited  data  for  agreed  time

Tiered  storage

VO  management

Page 11: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Architecture✦ SBGrid

• manages  all  user  account  creation  and  credential  mgmt• hosts  MyProxy,  VOMS,  GridFTP,  and  user  interfaces

✦ Facility• knows  about  lab  groups

• e.g.  “Harrison”,  “Sliz”

• delegates  knowledge  of  group  membership  to  SBGrid  VOMS• facility  can  poll  VOMS  for  list  of  current  members

• uses  X.509  for  user  identiRication• deploys  GridFTP  server

✦ Lab  group• designates  group  manager  that  adds/removes  individuals• deploys  GridFTP  server  or  Globus  Connect  client

✦ Individual• username/password  to  access  facility  and  lab  storage• Globus  Connect  for  personal  GridFTP  server  to  laptop• Globus  Online  web  interface  to  “drive”  transfers

Page 12: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology
Page 13: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Objective

✦ Easy  to  use  high  performance  data  mgmt  environment

✦ Fast  and  reliable  Rile  transfer• facility-­‐to-­‐lab,  facility-­‐to-­‐individual,  lab-­‐to-­‐individual

✦ Reduced  administrative  overhead✦ Better  data  curation✦ Release  data  to  public  after  embargo  period

Page 14: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Challenges

✦ Access  control• visibility• policies

✦ Provenance• data  origin• history

✦ Meta-­‐data• attributes• searching

Page 15: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Grid  Computing  Today

Page 16: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Capabilities

✦ Server-­‐to-­‐server  interaction✦ Federated  aggregation  of  CPU  power✦ Predictable  data  patterns✦ Command-­‐line  access  from  specialized  servers✦ Web-­‐based  access  through  “Science  Gateways”• pre-­‐deRined  computing  workRlows• e.g.  SBGrid  Science  Portal

Page 17: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Weaknesses

✦ Federated  user  identity  system• X.509  digital  certiRicates  and  associated  apparatus

✦ Ease  of  use  for  end  users✦ Data  management  tools✦ Support  for  collaborations

Page 18: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Opportunities

✦ “Last  mile”  challenge• to  the  desktop• to  the  laptop

✦ UniRied  identity  management• centralized  set  of  credentials  for  each  person

✦ Empower  collaborations  to  self-­‐manage✦ Shift  of  focus  from  “compute”  to  “data”• for  users• for  facilities  where  data  is  the  main  challenge

Page 19: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

User  Credentials

Page 20: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

✦ Analogy  to  a  passport:• Application  form• Sponsor’s  attestation• Consular  services

• veriRication  of  application,  sponsor,  and  accompanying  identiRication  and  eligibility  documents

• Passport  issuing  ofRice

✦ Portable,  digital  passport• Rixed  and  secure  user  identiRiers

• name,  email,  home  institution• signed  by  widely  trusted  issuer• time  limited• ISO  standard

X.509  Digital  CertiRicates

Page 21: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Addressing  CertiRicate  ProblemsU1U1U1

!"#$"%&'%()*"+',"!&'

-.'.)"*&'/.' 012*%2!' 3%"!'

!"&$!*'&!4,5(*)'*$67"!''

*289:'4)"*&%'

!";("<'!"#$"%&'

;"!(9:'$%"!'"=()(7(=(&:'

,2*>!6'"=()(7(=(&:'

411!2;"',"!&'

*289:'4;4(=47(=(&:'

)"*"!4&"',"!&'5":'14(!'

%()*',"!&'

!"&!(";"',"!&'

"?12!&'%()*"+',"!&'5":'14(!'

U2a

U1

S1

R1

R2

time

Page 22: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

!"#$%&'(#!")*# *+,(-,.# /-0.#

(,123#4%&'(#

5,(6.&#07'8'9'7':3#

4++.,;0#&0&90.-<'+=#8.,>+-=#4(%#.,70-#

(,123#

;0.'23#>-0.#07'8'9'7':3#

+.0-0(:#50.:#:,#.0?>0-:#&0&90.-<'+#93#@A#

4%%#@A#:,#!")*#

.0?>0-:#!")*#$B#

.0:>.(#!")*#$B# 4%%#$B#:,#+.,C3#50.:#

.0?>0-:#!"#8.,>+-#4(%#.,70-# U2b

S2

V1

V2

time

VO  (Group)  Membership  Registration

Page 23: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

!"#$%&'()' *+",-"#' .-/#'

0/#123'/&14151&1$3'

6",7#8'/&14151&1$3'

4/,/#%$/'6/#$'9/3'+%1#'

-/$'#/$#1/0%&'-/#1%&',:85/#'

;<=*'

#/>:/-$'+"#$%&'%66":,$'

0/#176%?",'/8%1&'-/,$'

/8%1&'0/#17/@'

AB)!'

;<')@81,'

,"?23'%4/,$-'

6#/%$/'&"6%&'%66$'

-14,'6/#$'

!"#$%&'(%)*+%',,#""%-#.#$'/#.%$#"*!$,#"%

-/$';<'#14C$-'

,"?23'%0%1&%51&1$3'

+"#$%&'&"41,'

#/>:/-$'-14,/@'6/#$'

#/$:#,'$#%691,4',:85/#'

%++#"0/'6/#$'

%66":,$'#/%@3',"?76%?",'

#/>:/-$'-14,/@'6/#?76%$/'

#/$:#,'-14,/@'6/#?76%$/' D'E'+%1#'-14,/@'6/#$'1,$"'!F(*GDH'7&/'H'E'6#/%$/'&"6%&'+#"I3'6/#$'

=3!#"I3'

#/41-$/#'+#"I3'6/#$'J1$C'=3!#"I3'

!"#$%&'(%)*+%',,#""%+#0%1*$/'2%

U1

U2*

S1*A1a

A1btime

Page 24: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Process  and  Design  Improvements✦ Single  web-­‐form  application• includes  e-­‐mail  veriRicationn

✦ Centralized  and  connected  credential  management• FreeIPA  LDAP  -­‐  user  directory  and  credential  store• VOMS  -­‐  lab,  institution,  and  collaboration  afRiliations• MyProxy  -­‐  X.509  credential  store

✦ Overlap  administrative  roles• system  admin• registration  agent  for  certiRicate  authority  (approve  X.509  

request)• VO  administrator  to  register  group  afRiliations

✦ Automation

Page 25: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

“Last  Mile”  andEase  of  Use

Page 26: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Grid  computing  at  your  desk✦ Platforms• Windows• OS  X

✦ Web  interfaces• SBGrid  Science  Portal

✦ One-­‐stop-­‐shop  for  account  management• Centralized  services• Automated  processes• Administrator  intervention• Support

Page 27: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

SBGridSciencePortal

laptop

desktop lab fileserver

facility fileserver

Ryan,  a  postdoc  in  the  Frank  Lab  at  Columbia

Access  NRAMM  facilities  securely  and  transfer  data  back  to  home  institute

/data/columbia/frank

/nfs/data/rsmith

/Users/Ryan

Ryan  applies  for  an  account  at  the  SBGrid  Science  Portal

automated  X.509application

automated  Globus  Online  application

veriRication  of  lab  membership

request  accessto  NRAMMfacility

using  credential  held  by  SBGrid

check  SBGrid  for  Ryan’s  group  membership

in  Frank  Lab,  so  grant  access  to  Riles

use  Globus  Online  to  managetransfer  from  NRAMM  back  to  lab

initiate  transfer  at  NRAMM

transfer  data  to  lab

notify  user  of  completion

Page 28: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Architecture  Diagrams

Page 29: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Service  Architecture

DOEGrids CA@Lawrence Berkley Labs

UC San Diego

Apache

GridSite

Django

Sage Math

R-Studio

SBGrid Science Portal @ Harvard Medical School

MyProxy@NCSA, UIUC

Gratia Acct'ing@FermiLab

FreeIPA

LDAP

VOMS

GUMS

GACL

ID mgmt

glideinWMS factory Open Science Grid

fileserver

SQLDB

scp

GridFTP

data

SRM

WebDAV

cluster

Condor

Cycle Server

VDT

Globus

computation

data

computations

interfaces

User

shell CLI

GUMSGUMSGridFTP +

Hadoop

GlobusOnline@Argonne

glideinWMS

Monitoring@Indiana

Ganglia

Nagios

monitoring

RSV

pacct

Page 30: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

[email protected] Overview - Ian Stokes-Rees

Page 31: Adapting federated cyberinfrastructure for shared data collection facilities in structural biology

Acknowledgements✦ Piotr  Sliz✦ SBGrid  Science  Portal• Daniel  O’Donovan,  Meghan  Porter-­‐Mahoney

✦ SBGrid  System  Administrators• Ian  Levesque,  Peter  Doherty,  Steve  Jahl

✦ Facility  Collaborators• Frank  Murphy  (NE-­‐CAT/APS)• Ashley  Deacon  (JCSG/SLAC)

✦ Globus  Online  Team• Steve  Tueke,  Ian  Foster,  Rachana  

Ananthakrishnan,  Raj  Kettimuthu  

✦ Ruth  Pordes• Director  of  OSG,  for  championing  SBGrid