FACIT Tools For Distributed Collections
description
Transcript of FACIT Tools For Distributed Collections
FACIT Tools For FACIT Tools For Distributed Collections Distributed Collections
A breakout session at theNDIPP Partners meeting
July 09, 2008Terry Moore, University of Tennessee
Scott Smith/Justin Mathena, National Geospatial Digital Archive (UCSB)
Santiago de Ledesma, ACCRE, Vanderbilt
OverviewOverview
FACIT project Basic ideaApplication context: NGDA FACIT Technology
Logistical Networking “inside” LoDNL-Store
FACIT technology and the problem of long-term preservation of bits
2
3
Our SponsorsOur Sponsors
What is FACITWhat is FACIT FACIT – Federated Archive
Cyberinfrastructure Testbed Goal of FACIT: Create a testbed to
experiment with a different approach to federated resource sharing for access and preservation
FACIT partners: National Geospatial Data Archive (NGDA: UCSB
and Stanford) The NGDA is an NDIIPP partnerLogistical Networking (UTK)– network storage
techREDDnet (Vanderbilt) – NSF funded
infrastructure using LN for data intensive collaboration 4
NGDA OverviewNGDA Overview
National Geospatial Digital Archive
Focus: long-term archiving (100
year problem) Emphasis on geospatial data
Policy level archive - not architecture specific
Based on 20+ years of experience @ UCSB
Pertinent DetailsPertinent Details
Preservation through SimplicityKey component of architecture is the
Data Model: all other parts considered disposable
Objects maintain self-descriptive 'manifests'
Archive organization and object structure both based on file systems
Data Model allows easy tie-in to L.N.
Detail View: A ManifestDetail View: A Manifest
Detail View: An ObjectDetail View: An Object
NGDA and Logistical NetworkingNGDA and Logistical Networking
Logistical Networking as an abstracted storage layerIncrease download speeds
Logistical Networking used as a Tool for custodianshipLogistical Networking as content
transfer solutionLogistical Networking as a backup
for at-risk data (temporary stewardship)
The Future of NGDA and LNThe Future of NGDA and LN Initial trials have met with mixed
successLots of mixed size objects in test sets
(30,000)Upload set data size of ~1TB“Moderate” Download Speeds to LC
over WANRoughly 1 day per TB download
The near futureMiddleware to bridge search client &
LN cloudAdjustments to handle mixed data sets
17
Basic elements of the LN stackBasic elements of the LN stack
Highly generic, “best effort” protocol for using storage
•Generic -> doesn’t restrict applications
•“best effort”-> low burden on providers
•Easy to port and deploy
Metadata container for bit-level structure
•Modeled on Unix inode
•bit-level structure, control keys, …
•XML encoded
18
Sample exNodesSample exNodes
A B C
0
300
200
100
REDDnet Depots
Network
Tennessee
Vanderbilt
UCSB Stanford
Crossing administrative domains, sharing resources
Partial exNode encoding
19
New federation members?New federation members?
• Add new depots
LoC
REDDnetDepots
Network
Tennessee
Vanderbilt
UCSB Stanford
• Rewrite the exNodes
• Copy the data
20
Basic elements of the LN stackBasic elements of the LN stack
Highly generic, “best effort” protocol for using storage
•Generic -> doesn’t restrict applications
•“best effort”-> low burden on providers
•Easy to port and deploy
Metadata container for bit-level structure
•Modeled on Unix inode
•bit-level structure, control keys, …
•XML encoded
L-Store
LoDN
LoDN - Network File ManagerLoDN - Network File Manager
• Store files into the Logistical Network using Java upload/download tools.• Manages exNode maintenance and replication• Provides account (single user or group) as well as “world” permissions.
Accessing distributed collection with LNAccessing distributed collection with LN
22
What is L-Store?What is L-Store?What is L-Store?What is L-Store? Goal of L-Store: Use LN to
provide a generic, high performance, wide area capable, storage virtualization service
Provides a file system interface to (globally) distributed IBP depots (e.g. currently uses WebDAV and CIFS)
Flexible role based AuthZ (work in progress)
L-Store and Logistical NetworkingL-Store and Logistical Networking L-Store adds a name space on top of the
exnode layer Allows for LN operations on the name
space. LN’s parallelism for high performance
and reliability, e.g. parallel transfers to improve performance (3GB/s during SC06 demo) 3 GB/s
30 Mins
L-Store scalabilityL-Store scalability
L-Store uses a Distributed Hash Table to store all its “structural” metadata (i.e. metadata about how the bits are stored)
DHTs provide a highly scalable way of storing metadata.
Metadata and data can scaled independently.
Storage ManagementStorage Management
Nevoa Networks (Brazilian company based on LN) provides management of remote/distributed storage via StorCoreProvides resource discovery for L-
Store.Allows to group depots to form
Logical units.It can create dynamic logical units
based on queries.
L-Store and FACITL-Store and FACIT
FACIT drives L-Store development:L-Sync: An rsync-like tool that
uses L-Store as intermediate storage.
Extended metadata attributes.A flexible policy framework.
Questions aboutQuestions about
NGDA? Logistical Networking? LoDN? L-Store?
28
Discussion: Preservation’s storage problemDiscussion: Preservation’s storage problem Long-term preservation is a relay: Repeated
migrations across storage media/systems, archive systems, institutions
Begin with the bits: Storage technology changes every 3-5 yrs During some periods of time data will be in “steady
state” But during a century, there will be 20-30 handoffs!
How can we create a “handoff process” that can be sustained for century or more? Can we create a “technical” process or will a social process have to do?
Complicating factor: We’re drowning in data
29100 5 15 90 95 10020
Framing The Issue GloballyFraming The Issue Globally• World data expected to total 2
zettabytes by 2011 (IDC Whitepaper)
“As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home.”
What does experience show?What does experience show?SDSC’s archive shows exponential growth w/ a consistent doubling period of ~15 months
If preservation is a relay, then …If preservation is a relay, then … The key preservation problem at the bit
layer is … Choice 1: steady state data storage Choice 2: copying data to different systems
Impression: De facto choice is #1 When you have to “hand-off” data do, is
sufficient to haveChoice 1: A social solutionChoice 2: A technical solution
Impression: De facto choice is #1
Contention: Neither of these de facto choices is adequate
32