DataONE_cobb_hubbub2012_20120924_v05

49
DataONE: An interoperable data repositories case study John W. Cobb R&D Staff and DataONE Leadership Team Member Oak Ridge Na;onal Laboratory HUBbub 2012 , the HUBzero conference Indianapolis, IN 24 September 2012

description

An interoperable data repositories case study: DataONE presented at HUBbub 2012 confence

Transcript of DataONE_cobb_hubbub2012_20120924_v05

Page 1: DataONE_cobb_hubbub2012_20120924_v05

DataONE:  An  interoperable  data  repositories  case  study    

John  W.  Cobb    R&D  Staff  and  DataONE  Leadership  Team  Member  Oak  Ridge  Na;onal  Laboratory    HUBbub  2012  ,  the  HUBzero  conference    Indianapolis,  IN  24  September  2012    

Page 2: DataONE_cobb_hubbub2012_20120924_v05

2  

•  Authorship:  This  talk  represents  work  of  the  en;re  DataONE  extended  team.  

•  It  especially  draws  upon  slide  material  from  •  Bill  Michener,  UNM  

(esp.  recent  DataONE  AHM  Sept.  18,  2012)  

•  Amber  Budden  –  DataONE  Ass.  Dir.  For  CE  

•  DataONE  is  an  NSF  supported  project  (OCI-­‐0830944)  

Acknowledgment:  

Page 3: DataONE_cobb_hubbub2012_20120924_v05

3  

•  A  personal  view    (apologies  for  a  possibly  mis-­‐informed  speaker)  

•  HUB-­‐roots  (history  and  pre-­‐history)  •  PUNCH:    web  portal  for  running  tools    

(DOI:  10.1109/40.846308)  •  -­‐>  NanoHUB:  Applica;on  orchestra;on  environment  •  +  RAPPTURE:  Rapid  Applica;on  por;ng  and  development  •  +  Framelesss  VNC  windows  –    

seamless  hosted  environment  on  clients!  •  +  Rich  collabora;ve  environment  and  rich  user  experience  !!  (“wishlist”)  •  Repurpose:  Hubzero  -­‐>  hubs  explode    

(ex.  NEESHub  a  cri;cal  advantage  for  largest  research  award  in  Purdue  history)  •  Now  (and  recent  past)  turn  to  Hub+Data  Integra;on.  Some  successes  

already  •  Opportunity:  Richer  interac;ons  between  HUB’s  and  mul;ple  data  

repositories  •  Perhaps  for  example:  Enable  mul;-­‐project  collabora;on  within  PURR?  •  Or:  Integrate  NEES  DB’s  with  SCEC  simula;ons  and  IRIS  waveforms?  

Hubs  and  data  repositories  

Page 4: DataONE_cobb_hubbub2012_20120924_v05

4  

•  HUB  +  Database  exists  •  HUB  +  external  data  repository  access  use  case.  •  But  …..  What  if?      •  Access  mul;ple  (possibly  external)  repositories  from  within  a  HUB  

environment?  •  Access  mul;ple  external  repositories  with  similar  data?  Say  aggregate  all  data  

from  state  hydrologists?  C.f.  driNET  hip://drinet.hubzero.org  •  Integrate  disparate  data  sets  for  new  and  novel  analysis.  

Recall    Noshir  Contractor’s  comments  this  morning:  teaming  and    interdisciplinary  work  has  increased  impact  (Wuchty,  Jones,  Uzzi)  

•  Enable  reproducible  analysis  and  synthesis  via  a  automated  workflow  to  create  synthe;c  data  products  

•  Programma;c  access  •  More  integra;on  (more  than  just  raw  search  terms  a  la  Google)  •  …  •  What  do  you  want  to  discover  today?  (to  paraphrase  Microsol)  

Mul;ple  data  repository  access?  

Page 5: DataONE_cobb_hubbub2012_20120924_v05

5  

•  DataONE  is  a  project  to  address  these  issues  

•  Build  (assemble/aggregate)    data  repository  interoperability  

•  Advance  state  of  the  prac;ce  data  lifecycle  management  •  Planning  •  Deposi;on  •  Metadata  genera;on  •  Seman;c  integra;on  •  Workflow  and  provenance  •  Analysis  •  Synthesis  

•  Focus  on  a  broad  science  area  •  Deploy  a  working  CI  and  grow  it  •  DataONE  –  Data  Observa;on  

Network  Earth  

DataONE  mo;va;on  

Plan  

Collect  

Assure  

Describe  

Preserve  

Discover  

Integrate  

Analyze  

Page 6: DataONE_cobb_hubbub2012_20120924_v05

6  

Page 7: DataONE_cobb_hubbub2012_20120924_v05

7  

Pressing  issues  for  the  digital  data  lifecycle  

Plan  

Collect  

Assure  

Describe  

Preserve  

Discover  

Integrate  

Analyze  

Page 8: DataONE_cobb_hubbub2012_20120924_v05

8  

Decreasin

g  Spa;

al  Coverage  

Increasin

g  Process  K

nowledge  

Adapted  from  CENR-­‐OSTP  

Remote  sensing  

Intensive  science  sites    and  experiments  

Extensive  science  sites  

Volunteer  &    educa;on  networks  

Mul;ple  data  sources  –  mutually  reinforcing  

8  

Page 9: DataONE_cobb_hubbub2012_20120924_v05

9  

Scaiered  data  sources  “finding  the  needle  in  the  haystack”  

Data  are  massively  dispersed  •  Ecological  field  sta;ons  and  research  centers  (100s)  •  Natural  history  museums  and  biocollec;on  facili;es  (100s)  •  Agency  data  collec;ons  (100s  to  1000s)  •  Individual  scien;sts  (1000s  to  10,000s  to  100,000s)  

Page 10: DataONE_cobb_hubbub2012_20120924_v05

10  

Data  Preserva;on  and  Planning  

✔ ?  

Page 11: DataONE_cobb_hubbub2012_20120924_v05

11  

Preserva;on:  Poor  data  prac;ce  “data  entropy”  

Inform

a?on

 Con

tent  

Time  

Time  of  publica?on  

Specific  details  

General  details  

Accident  

Re?rement  or    career  change  

Death  

(Michener  et  al.  1997)  

Page 12: DataONE_cobb_hubbub2012_20120924_v05

12  

 Preserva;on:  Data  longevity  

Study Resource Type Resource Half-life

Rumsey (2002) Legal Citations 1.4 years

Harter and Kim (1996) Scholarly Article Citations 1.5 years

Koehler (1999 and 2002) Random Web Pages 2.0 years

Spinellis (2003) Computer Science Citations

4.0 years

Markwell and Brooks (2002)

Biological Science Education Resources

4.6 years

Nelson and Allen (2002) Digital Library Object 24.5 years

Koehler,  W.  (2004)  Informa(on  Research  9(2):  174.  

Page 13: DataONE_cobb_hubbub2012_20120924_v05

13  

The Long Tail of Orphan Data Vo

lum

e

Rank frequency of datatype

Specialized repositories (e.g. GenBank, PDB)

Orphan data

(B. Heidorn)

“Most of the bytes are at the high end, but most of the datasets are at the low end” – Jim Gray

13  

The  Ultra-­‐violet  divergence  

The  Infrared  Catastrophe  

Page 14: DataONE_cobb_hubbub2012_20120924_v05

14  

Data  deluge  and  interoperability  “the  flood  of  increasingly  heterogeneous  data”  

Data  are  heterogeneous  •  Syntax  

•  (format)  •  Schema  

•  (model)  •  Seman;cs  

•  (meaning)  

Jones  et  al.  2007  

Page 15: DataONE_cobb_hubbub2012_20120924_v05

15  

Metadata  universe  (mul;-­‐verse)  

Source:  Jenn  Riley,  Indiana  U.  Digital  Librarian  hip://www.dlib.indiana.edu/~jenlrile/metadatamap/    Via  John  Kunze,  Cal.  Dig.  Lib    

•  There  are  a  mul;tude  of  metadata  standards  •  Discipline  and  sub-­‐discipline  specific  •  Each  with  different  terms  and  context  

Page 16: DataONE_cobb_hubbub2012_20120924_v05

16  

Each  dot  is  its  own  standard  !  

“…billions  and  billions  of  worlds  …”  –  Carl  Sagan  

Page 17: DataONE_cobb_hubbub2012_20120924_v05

17  

•  Hard-­‐core  cyberinfrastructure  (CI)  •  CI  Member  Node  (MN)  data  

repositories  •  Coordina;ng  Node  (CN)  global  

metadata  repo’s  •  Simple,  but  powerful  REST  API/SPI  for  

universal  access  •  Inves;gator  toolkit  (ITK)  solware  tools  

to  allow  access  to  the  data  repository  collec;ve  via  familiar  access  idioms  

•  Cultural  and  wetware  issues  •  Educa;onal  Materials  •  Best  prac;ces  •  Workshops  and  tutorials  •  Surveys  and  assessments  •  Scien;st,  policymaker,  ci;zen  

engagement  •  Collabora;on,  governance,  and  

sustainability  

DataONE    CI  architectural  Elements  

Member Nodes

Service Interfaces

Bridge to non-DataONE Member Node services

Data Repository

Coordinating Nodes

Object Store Index

Coordination LayerIdentifiers

Preservation

Catalog

Monitor

Service InterfacesResolution Discovery

Replication Registration

Investigator Toolkit

Client LibrariesJava Python Command Line

Web Interface Data ManagementAnalysis, Visualization

hip://mule1.dataone.org/ArchitectureDocs-­‐current/      

Page 18: DataONE_cobb_hubbub2012_20120924_v05

18  

A  User’s  View  

Page 19: DataONE_cobb_hubbub2012_20120924_v05

19  

Key  Cyberinfrastructure  Elements  

• Unique  iden;fiers  •  Search  and  deliver  • Replica;on  •  Federated  iden;ty  

Usable  by  People  and  their  Agents  

Page 20: DataONE_cobb_hubbub2012_20120924_v05

20  

Suppor;ng  the  data  lifecycle  

UCSB  Node  

UNM  Node  

ORC  Node  

1.  Deposi;on/acquisi;on/ingest  2.  Cura;on  and  metadata  management  3.  Protec;on,  including  privacy  4.  Discovery,  access,  use,  and  dissemina;on  5.  Interoperability,  standards,  and  integra;on  6.  Evalua;on,  analysis,  and  visualiza;on"

The  data  lifecycle    

}

Page 21: DataONE_cobb_hubbub2012_20120924_v05

21  

Three  major  components  for  a  flexible,  scalable,  sustainable  network  

Member  Nodes  •  diverse  ins;tu;ons  •  serve  local  community  •  provide  resources  for  managing  their  data  

•  retain  copies  of  data  

Coordina?ng  Nodes  •  retain  complete  metadata  catalog    

•  indexing  for  search  •  network-­‐wide  services  •  ensure  content  availability  (preserva;on)      

•  replica;on  services  

Inves?gator  Toolkit  

DataONE  Supports  Data  Preserva;on  

Page 22: DataONE_cobb_hubbub2012_20120924_v05

22  

•  Enables  integra;on  of  mul;ple  geographically  diverse  and  metadata  diverse  repositories  

•  Presents  collec;ve  search  results  across  mul;ple  repository  

•  Provides  a  unified  API/SPI  for  search  and  programma;c  interface  hip://mule1.dataone.org/ArchitectureDocs-­‐current/      

•  DataONE  content  has  unique  iden;fiers  (DOI’s)  for  referencable/citable  data  objects  

•  Supports  both  large  datasets  and  the  long-­‐tails  

DataONE  sa;sfies  arch  requirements  

Page 23: DataONE_cobb_hubbub2012_20120924_v05

23  

•  Enables  new  analysis  and  synthesis  efforts  by  integra;ng  tasks  across  repositories    

•  Provides  means  for  data  replica;on  and  basis  for  repositories  to  build  “data  wills”  or  “data  trust”  plans    

•  Provides  a  plavorm  to  develop  advanced  interoperable  workflow  tools  and  seman;c  integra;on  tools  

 

DataONE  spurs  innova;on  

Page 24: DataONE_cobb_hubbub2012_20120924_v05

24  

DataONE:  current  state/recent  progress  

Page 25: DataONE_cobb_hubbub2012_20120924_v05

25  

DataONE:  Suppor;ng  Scien;fic  Data  Preserva;on,  Discovery,  and  Innova;on    

Current  Member  Nodes:                          Coming  Soon:        Current  Tools:  

       Tools  Coming  Soon:     Queensland  University  of  Technology  

Page 26: DataONE_cobb_hubbub2012_20120924_v05

26  

Data  Management  Planning  Tool  

hips://dmp.cdlib.org/    

Page 27: DataONE_cobb_hubbub2012_20120924_v05

27  

Page 28: DataONE_cobb_hubbub2012_20120924_v05

28  

Page 29: DataONE_cobb_hubbub2012_20120924_v05

29  

Plans  per  template  (as  of  June  2012)  

hip://dmptool.org                    

339  

287  

197  

159  133   133   124  

101  71   65   60  

46   37   36   34  17   15   6  

0  

50  

100  

150  

200  

250  

300  

350  

400  

Approxim

ate  nu

mbe

r  of  p

lans  per  te

mplate  

Templates  of  greatest  interest  to  the  DataONE  community  in  red;  

     2,302  unique  users  to  date  

Page 30: DataONE_cobb_hubbub2012_20120924_v05

30  

✔ Check  for  best  prac;ces  ✔ Create  metadata  ✔ Connect  to  ONEShare  

Data  &    Metadata  (EML)  

Page 31: DataONE_cobb_hubbub2012_20120924_v05

31  

2.  Data  Discovery  

Page 32: DataONE_cobb_hubbub2012_20120924_v05

32  

The  DataONE  Federa;on  

Page 33: DataONE_cobb_hubbub2012_20120924_v05

33  

NASA  collectors   DAAC  Users      (UWG)  

DataONE  Users  

ORNL  DAAC      as  a  DataONE  Member  Node    

Inves?gator  Toolkit  

33  

Page 34: DataONE_cobb_hubbub2012_20120924_v05

34  

hips://cn.dataone.org/onemercury/    

Page 35: DataONE_cobb_hubbub2012_20120924_v05

35  

Page 36: DataONE_cobb_hubbub2012_20120924_v05

36  

Page 37: DataONE_cobb_hubbub2012_20120924_v05

37  

Kepler

DMP-Tool

Inves;gator  Toolkit  Support    

Plan  

Collect  

Assure  

Describe  

Preserve  

Discover  

Integrate  

Analyze  

Page 38: DataONE_cobb_hubbub2012_20120924_v05

38  

Spa;o-­‐Temporal  Exploratory  Model  iden;fies  factors  affec;ng  paierns  of  migra;on  

 

Diverse  bird  observa;ons  and  environmental  data  from  300,00  loca;ons  in  the  US  integrated  and  analyzed  using  High  Performance  Compu;ng  Resources  

Land  Cover  

Meteorology  

MODIS  –  Remote  sensing  data  

•  Examine  paierns  of  migra;on    

•  Infer  how  climate  change  may  affect  bird  migra;on  

Model  results  

Occurrence  of  Indigo  Bun?ng  (2008)  

Jan   Sep   Dec  Jun  Apr  

Explora;on,  Visualiza;on,  and  Analysis  

38  

Page 39: DataONE_cobb_hubbub2012_20120924_v05

39  

Public  Par?cipa?on  in  Scien?fic  Research  Conference:  4-­‐5  August  2012  in  Portland,  Oregon  USA  prior  to  Ecological  Society  of  America  mee;ng  (6-­‐10  Aug.):  hip://www.birds.cornell.edu/citscitoolkit/conference/2012  

39  

Page 40: DataONE_cobb_hubbub2012_20120924_v05

40  

Year  1   Year  2   Year  3   Year  4   Year  5  

   Scien;sts:  BL  

User  Assessments  

Scien;sts:  FU  

Librarians:  BL   Librarians:  FU  

Policy  Makers:  BL   Policy  Makers:  FU  

Educators:  BL   Educators:  FU  

Library  Policies:  BL   Library  Policies:  FU  

Page 41: DataONE_cobb_hubbub2012_20120924_v05

41  

12 21 26 95 95 96 97

266

676

DIF DwC DC EML FGDC Open GIS

ISO My Lab none

Metadata  language  

What  standard  do  you  currently  use?  

Page 42: DataONE_cobb_hubbub2012_20120924_v05

42  

41%  

76%  

78%  

81%  

0%   20%   40%   60%   80%   100%  

Willing  to  place  all  of  my  data  into  a  central  data  repository  with  no  

restric;ons  

Appropriate  to  create  new  datasets  from  shared  data  

Willing  to  place  at  least  some  of  my  data  into  a  central  data  repository  

with  no  restric;ons  

Willing  to  share  data  across  a  broad  group  of  researchers  

Many  are  interested  in  sharing  data  

Percent  agree  

Page 43: DataONE_cobb_hubbub2012_20120924_v05

43  

User  Matrix  

Dat

a

Ser

vice

Inve

stig

ator

To

olK

it

Dat

a M

anag

emen

t P

lann

ing

Bes

t P

ract

ices

Tool

s D

atab

ase

Trai

ning

Cur

ricul

a

Scientist

Data Librarians

Ecological Modeler

Resource Manager

Page 44: DataONE_cobb_hubbub2012_20120924_v05

44  

Community  Engagement  

Page 45: DataONE_cobb_hubbub2012_20120924_v05

45  

Best  Prac;ces  and  Solware  Tools  

Page 46: DataONE_cobb_hubbub2012_20120924_v05

46  

Page 47: DataONE_cobb_hubbub2012_20120924_v05

47  

•  Member  node  growth  •  Number  of  member  nodes  •  Increase  the  number  and  size  of  data  sets  •  Sustainably  

•  In  terms  of  resource  needs  form  MN’s  •  In  terms  of  resource  demands  on  DataONE  

•  New  Inves;gator  toolkit  tools  (strategically)  •  An  increasing  number  of  science  use  cases  with  

more  breakthrough  science  •  Also,  re-­‐purposing  DataONE  CI  outside  of  Bio/Eco/

Env  areas  in  strategic  collabora;ve  partnerships  

DataONE:  Next  steps  

Page 48: DataONE_cobb_hubbub2012_20120924_v05

48  

Ack:  DataONE  Team  and  Sponsors  

• Bertram  Ludaescher  

• Deborah  McGuinness  

• Jeff  Horsburgh  

• Robert  Sandusky  

• Peter  Honeyman  

• Carole  Goble  

• Cliff  Duke  

• Donald  Hobern  

• Ewa  Deelman  • Amber  Budden,  Roger  Dahl,  Rebecca  Koskela,    Bill  Michener,  Robert  Nahf,  Skye  Roseboom,  Mark  Servilla  

• Patricia  Cruse,  John  Kunze  

•   Dave  Vieglais    

• Paul  Allen,  Rick  Bonney,  Steve  Kelling  

• Stephanie  Hampton,  Chris  Jones,  Mai  Jones,  Ben  Leinfelder,  Andrew  Pippin  

• Suzie  Allard,  Nick  Dexter,  Kimberly  Douglass,  Carol  Tenopir,  Robert  Waltz,  Bruce  Wilson  

• John  Cobb,  Bob  Cook,  Ranjeet  Devarakonda,  Giri  Palanismy,  Line  Pouchard    

•  Sky  Bristol,  Mike  Frame,  Richard  Huffine,  Viv  Hutchison,  Jeff  Moriseie,  Jake  Weltzin,  Lisa  Zolly  

• David  DeRoure  

• Ryan  Scherle,  Todd  Vision  

LEON LEVY FOUNDATION

• Randy  Butler  

Page 49: DataONE_cobb_hubbub2012_20120924_v05

49  

Ques;ons?    

Contact  Points    

John  W.  Cobb,  Ph.D.  Oak  Ridge  

 John  W.  Cobb,  Ph.D.  

Oak  Ridge  Na;onal  Lab  [email protected]  865.576.5439  

 hip://www.dataone.org/      hip://docs.dataone.org