FACIT Tools For Distributed Collections

31
FACIT Tools For FACIT Tools For Distributed Collections Distributed Collections A breakout session at the NDIPP Partners meeting July 09, 2008 Terry Moore, University of Tennessee Scott Smith/Justin Mathena, National Geospatial Digital Archive (UCSB) Santiago de Ledesma, ACCRE, Vanderbilt

description

FACIT Tools For Distributed Collections. A breakout session at the NDIPP Partners meeting July 09, 2008 Terry Moore, University of Tennessee Scott Smith/Justin Mathena, National Geospatial Digital Archive (UCSB) Santiago de Ledesma, ACCRE, Vanderbilt. Overview. FACIT project Basic idea - PowerPoint PPT Presentation

Transcript of FACIT Tools For Distributed Collections

Page 1: FACIT Tools For  Distributed Collections

FACIT Tools For FACIT Tools For Distributed Collections Distributed Collections

A breakout session at theNDIPP Partners meeting

July 09, 2008Terry Moore, University of Tennessee

Scott Smith/Justin Mathena, National Geospatial Digital Archive (UCSB)

Santiago de Ledesma, ACCRE, Vanderbilt

Page 2: FACIT Tools For  Distributed Collections

OverviewOverview

FACIT project Basic ideaApplication context: NGDA FACIT Technology

Logistical Networking “inside” LoDNL-Store

FACIT technology and the problem of long-term preservation of bits

2

Page 3: FACIT Tools For  Distributed Collections

3

Our SponsorsOur Sponsors

Page 4: FACIT Tools For  Distributed Collections

What is FACITWhat is FACIT FACIT – Federated Archive

Cyberinfrastructure Testbed Goal of FACIT: Create a testbed to

experiment with a different approach to federated resource sharing for access and preservation

FACIT partners: National Geospatial Data Archive (NGDA: UCSB

and Stanford) The NGDA is an NDIIPP partnerLogistical Networking (UTK)– network storage

techREDDnet (Vanderbilt) – NSF funded

infrastructure using LN for data intensive collaboration 4

Page 5: FACIT Tools For  Distributed Collections

NGDA OverviewNGDA Overview

National Geospatial Digital Archive

Focus: long-term archiving (100

year problem) Emphasis on geospatial data

Policy level archive - not architecture specific

Based on 20+ years of experience @ UCSB

Page 6: FACIT Tools For  Distributed Collections

Pertinent DetailsPertinent Details

Preservation through SimplicityKey component of architecture is the

Data Model: all other parts considered disposable

Objects maintain self-descriptive 'manifests'

Archive organization and object structure both based on file systems

Data Model allows easy tie-in to L.N.

Page 7: FACIT Tools For  Distributed Collections

Detail View: A ManifestDetail View: A Manifest

Page 8: FACIT Tools For  Distributed Collections

Detail View: An ObjectDetail View: An Object

Page 9: FACIT Tools For  Distributed Collections
Page 10: FACIT Tools For  Distributed Collections
Page 11: FACIT Tools For  Distributed Collections
Page 12: FACIT Tools For  Distributed Collections
Page 13: FACIT Tools For  Distributed Collections

NGDA and Logistical NetworkingNGDA and Logistical Networking

Logistical Networking as an abstracted storage layerIncrease download speeds

Logistical Networking used as a Tool for custodianshipLogistical Networking as content

transfer solutionLogistical Networking as a backup

for at-risk data (temporary stewardship)

Page 14: FACIT Tools For  Distributed Collections
Page 15: FACIT Tools For  Distributed Collections

The Future of NGDA and LNThe Future of NGDA and LN Initial trials have met with mixed

successLots of mixed size objects in test sets

(30,000)Upload set data size of ~1TB“Moderate” Download Speeds to LC

over WANRoughly 1 day per TB download

The near futureMiddleware to bridge search client &

LN cloudAdjustments to handle mixed data sets

Page 16: FACIT Tools For  Distributed Collections

17

Basic elements of the LN stackBasic elements of the LN stack

Highly generic, “best effort” protocol for using storage

•Generic -> doesn’t restrict applications

•“best effort”-> low burden on providers

•Easy to port and deploy

Metadata container for bit-level structure

•Modeled on Unix inode

•bit-level structure, control keys, …

•XML encoded

Page 17: FACIT Tools For  Distributed Collections

18

Sample exNodesSample exNodes

A B C

0

300

200

100

REDDnet Depots

Network

Tennessee

Vanderbilt

UCSB Stanford

Crossing administrative domains, sharing resources

Partial exNode encoding

Page 18: FACIT Tools For  Distributed Collections

19

New federation members?New federation members?

• Add new depots

LoC

REDDnetDepots

Network

Tennessee

Vanderbilt

UCSB Stanford

• Rewrite the exNodes

• Copy the data

Page 19: FACIT Tools For  Distributed Collections

20

Basic elements of the LN stackBasic elements of the LN stack

Highly generic, “best effort” protocol for using storage

•Generic -> doesn’t restrict applications

•“best effort”-> low burden on providers

•Easy to port and deploy

Metadata container for bit-level structure

•Modeled on Unix inode

•bit-level structure, control keys, …

•XML encoded

L-Store

LoDN

Page 20: FACIT Tools For  Distributed Collections

LoDN - Network File ManagerLoDN - Network File Manager

• Store files into the Logistical Network using Java upload/download tools.• Manages exNode maintenance and replication• Provides account (single user or group) as well as “world” permissions.

Page 21: FACIT Tools For  Distributed Collections

Accessing distributed collection with LNAccessing distributed collection with LN

22

Page 22: FACIT Tools For  Distributed Collections

What is L-Store?What is L-Store?What is L-Store?What is L-Store? Goal of L-Store: Use LN to

provide a generic, high performance, wide area capable, storage virtualization service

Provides a file system interface to (globally) distributed IBP depots (e.g. currently uses WebDAV and CIFS)

Flexible role based AuthZ (work in progress)

Page 23: FACIT Tools For  Distributed Collections

L-Store and Logistical NetworkingL-Store and Logistical Networking L-Store adds a name space on top of the

exnode layer Allows for LN operations on the name

space. LN’s parallelism for high performance

and reliability, e.g. parallel transfers to improve performance (3GB/s during SC06 demo) 3 GB/s

30 Mins

Page 24: FACIT Tools For  Distributed Collections

L-Store scalabilityL-Store scalability

L-Store uses a Distributed Hash Table to store all its “structural” metadata (i.e. metadata about how the bits are stored)

DHTs provide a highly scalable way of storing metadata.

Metadata and data can scaled independently.

Page 25: FACIT Tools For  Distributed Collections

Storage ManagementStorage Management

Nevoa Networks (Brazilian company based on LN) provides management of remote/distributed storage via StorCoreProvides resource discovery for L-

Store.Allows to group depots to form

Logical units.It can create dynamic logical units

based on queries.

Page 26: FACIT Tools For  Distributed Collections

L-Store and FACITL-Store and FACIT

FACIT drives L-Store development:L-Sync: An rsync-like tool that

uses L-Store as intermediate storage.

Extended metadata attributes.A flexible policy framework.

Page 27: FACIT Tools For  Distributed Collections

Questions aboutQuestions about

NGDA? Logistical Networking? LoDN? L-Store?

28

Page 28: FACIT Tools For  Distributed Collections

Discussion: Preservation’s storage problemDiscussion: Preservation’s storage problem Long-term preservation is a relay: Repeated

migrations across storage media/systems, archive systems, institutions

Begin with the bits: Storage technology changes every 3-5 yrs During some periods of time data will be in “steady

state” But during a century, there will be 20-30 handoffs!

How can we create a “handoff process” that can be sustained for century or more? Can we create a “technical” process or will a social process have to do?

Complicating factor: We’re drowning in data

29100 5 15 90 95 10020

Page 29: FACIT Tools For  Distributed Collections

Framing The Issue GloballyFraming The Issue Globally• World data expected to total 2

zettabytes by 2011 (IDC Whitepaper)

“As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home.”

Page 30: FACIT Tools For  Distributed Collections

What does experience show?What does experience show?SDSC’s archive shows exponential growth w/ a consistent doubling period of ~15 months

Page 31: FACIT Tools For  Distributed Collections

If preservation is a relay, then …If preservation is a relay, then … The key preservation problem at the bit

layer is … Choice 1: steady state data storage Choice 2: copying data to different systems

Impression: De facto choice is #1 When you have to “hand-off” data do, is

sufficient to haveChoice 1: A social solutionChoice 2: A technical solution

Impression: De facto choice is #1

Contention: Neither of these de facto choices is adequate

32