1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of...
-
date post
22-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of...
![Page 1: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/1.jpg)
1
Stanford Archival Repository Project
Brian Cooper
Arturo Crespo
Hector Garcia-MolinaDepartment of Computer Science
Stanford University
![Page 2: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/2.jpg)
2
Data does not live forever
Much data is stored digitally (perhaps exclusively)– Text
– Multimedia (images, sound, etc.)
– Scientific data
But digital storage is currently unreliable– Magnetic tapes decay, break or lose magnetism
– Disks crash
– Buildings burn down
– Users delete data (accidentally
or maliciously)
![Page 3: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/3.jpg)
3
Data does not live forever
Digital information already lost:– Early NASA records
– U.S. Census Information
– Toxic waste records
Decay time for common media:– Magnetic Tapes: 10-20 years
– CD-ROM: 5-50 years
– Hard Drive: 3-5 years
![Page 4: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/4.jpg)
4
Digital archiving
Digital archivists need:– A reliable system to store digital data for long periods
without losing it
– Convenient tools to add new data and manage data already archived
– Methods for finding the “best” configuration» Most reliable
» Most cost effective
» Etc.
![Page 5: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/5.jpg)
5
Archival Repository Project
Goal: Reliably archive digital information for long periods of time (decades or centuries)– Focus on “preserving bits”– Preserving meaning: future work
Strategy– Replicate objects– Automatically detect and correct errors
Our project– Stanford Archival Vault (SAV) – reliably archives data– InfoMonitor – automatically adds newly created data to
the archive– ArchSim – a simulation tool to model archives
![Page 6: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/6.jpg)
6
Architecture
Users Users
Filesystem
InfoMonitor
SAV ArchiveSAV Archive
Archived data
Archived data
Internet
Local archive
Remote archive
![Page 7: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/7.jpg)
7
SAV architecture
Object Store
Reliability Layer
Remote SAV Sites
Upper Layers
User InterfaceData Creation/Import
“Core” SAV components
Upper layers
Application/user level
![Page 8: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/8.jpg)
8
Write-once repository
Deletions/modifications disallowed– Any object deleted or modified must have been corrupted,
and is replaced
Challenges– Constructing structures of objects
» Object references constrained to point from new to old objects
– Representing modifications» Archive new version of objects = version chain
– Finding objects» Indexes
![Page 9: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/9.jpg)
9
Write once repository: Indexes
Key to performance– Locate an object quickly using its signature, “Who points
to me?” problem, etc.
Disposable indexes– Can be rebuilt at any time from SAV objects
“Bookmarks” used to find collections of objects using indexed name
![Page 10: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/10.jpg)
10
Write once repository: Indexes
Bookmark (with well-known name)SAV
![Page 11: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/11.jpg)
11
Replication: Site networks
Sites form “replication agreements”– Agree to replicate data
– Specify data to replicate in agreement» May be a subset of all of the data in the archive
– Periodically connect and compare data, looking for errors
Strongly connected Weakly connected
![Page 12: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/12.jpg)
12
Replication: Data sets
SAV replicates different data sets separately– E.g., web pages under agreement A, Usenet articles under
agreement B
– “Replication sets” should grow without human intervention
– Traverse link structure to find objects in set
![Page 13: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/13.jpg)
13
Replication: Data sets
Start traversalSAV
![Page 14: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/14.jpg)
14
User interface
![Page 15: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/15.jpg)
15
User interface
![Page 16: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/16.jpg)
16
Object store performance
0
500
1000
1500
2000
2500
3000
0 2 4 6 8 10 12
Repository size (gigabytes)
Tim
e (s
eco
nd
s)
Read from disk Compute CRC Build indexes
![Page 17: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/17.jpg)
17
Reliability layer performance
0
2
4
6
8
10
12
14
0 2 4 6 8 10 12
Repository size (gigabytes)
Tim
e (s
eco
nd
s)
Compute set Compare sets Transfer set over netw ork
![Page 18: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/18.jpg)
18
The InfoMonitor
Goal– Create a convenient, transparent mechanism for getting
data from existing stores into the archive
ArchitectureUsersUsers
Filesystem
InfoMonitor
SAV Archive
![Page 19: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/19.jpg)
19
Detecting new data
Must find and archive new data– Filesystem will not signal data writes
– Users should not have to explicitly “check-in” data
Scanning– Quick scan: detect changes using timestamps
– Slow scan: detect changes using file contents
Filtering– Automatically decide what to archive
– Use filtering rules
![Page 20: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/20.jpg)
20
User interface
![Page 21: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/21.jpg)
21
User interface
![Page 22: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/22.jpg)
22
InfoMonitor performance
0
2000
4000
6000
8000
10000
12000
14000
16000
0 500 1000 1500 2000 2500 3000
Data migrated (megabytes)
To
tal t
ime
(sec
on
ds)
Initial load Slow scan Quick scan
![Page 23: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/23.jpg)
23
Designing Archival Repositories
Designer needs to answer questions like:
– What is the minimum number of copies of a documents that are needed to ensure its preservation?
– What is a more cost efficient, to store the information on one expensive disk with low failure rates or on two inexpensive disks with high failure rate?
– Are two sites enough to guarantee preservation?
– How often should we scan the repositories for errors?
– What’s the MTTF of this design?
![Page 24: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/24.jpg)
24
Contributions
A comprehensive model for an Archival Repository
A powerful simulation tool: ArchSim, for evaluating
Archival Repositories and the available strategies.
A detailed case study for an hypothetical TR
Repository operated between Stanford and MIT
![Page 25: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/25.jpg)
25
How important is having good disks?
MTTF of the MIT/Stanford Archival System with r sto of 60 days
0
200
400
600
800
1000
1200
1400
1600
3 5 10 20f
sto (years)
MT
TF
of
Sy
ste
m (
ye
ars
)
![Page 26: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/26.jpg)
26
Preventive maintenance
Preventive Maintenance and Aging
0
10
20
30
40
50
60
70
80
1 3 5 10 Never
Start of aging (years)
MT
TF
(ye
ars)
1
3
5
10
Never
Preventive Maintenance Period (years)
![Page 27: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/27.jpg)
27
Current and future work
New models for replication agreements and “data trading”
Archiving the World Wide Web Modeling cost Managing “meaning” Security Alternative object naming schemes Other “upper layers,” e.g. user access, metadata, etc.
![Page 28: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/28.jpg)
28
Conclusion
Digital librarians need tools to preserve data Our project addresses this need
– Reliable storage: SAV
– Convenient access: InfoMonitor
– Finding the best configuration: ArchSim
More work must be done to refine these models– More automation
– More flexibility
– Answer a wider range of design questions
![Page 29: 1 Stanford Archival Repository Project Brian Cooper Arturo Crespo Hector Garcia-Molina Department of Computer Science Stanford University.](https://reader038.fdocuments.us/reader038/viewer/2022110323/56649d815503460f94a65f67/html5/thumbnails/29.jpg)
29
For more information
http://www-db.stanford.edu/archivalrep
Brian Cooper: [email protected]
Arturo Crespo: [email protected]
Hector Garcia-Molina: [email protected]