An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances...
-
Upload
aditya-lipson -
Category
Documents
-
view
217 -
download
2
Transcript of An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances...
An Introduction to Repositories
Thornton Staples
Director of Community Strategy and Alliances
Director of the Fedora Project
Creating a digital library is not a process of moving the traditional
library online.
Increasingly, it’s more about the care and feeding of the web!
Creating digital surrogates of paper collections is only the beginning
• Surrogate collections are an important step!• Collecting born-digital materials is rapidly
coming upon us• Simple Institutional repository approaches are
good but only scratch the surface• Complex scholarly and scientific projects are
the biggest challenge
Te xtTe xt
T h e R os s etti A rch iv e
Ar tw o r k Ar tw o r k
W o r k W o r k
I m a g e s
W o r k
Repositories are designed to be flexible and adaptable
• Relational databases are too rigid• Need to be able to add new content types
and media easily• Need to be able to handle arbitrary
complexity in relatively simple ways• Above all, it all needs to be durable over a
very long time!
Preservation and ArchivingPreservation
and Archiving
Scholars WorkbenchScholars
WorkbenchInstitutional Repository
Institutional Repository
Data CurationSolutions
Data CurationSolutions
The Repository(Content abstraction)
The Repository(Content abstraction)
RaidArraysRaid
Arrays TapeLibraries
TapeLibraries
Cloud StorageCloud Storage
Repositories are the foundation for many applications
• A set of abstractions that can be used to represent different kinds of data
• Manages the actual content beneath the surface
• Negotiates the connection between access and storage
• Designed to make data “durable” over the long term
Access is the core purpose of a repository
• Searching is important but it is not the only thing
• Finding is the point of searching!• The point of finding is very often to use the
resource that you have found, for analysis or reuse
• New digital resources that reuse found objects depend on continuing access for validity
Any unit of content may have more than one context
• Within one collection– An architectural image may related to more than
one building
• Across collections– Special collections images many be art objects
• Across repositories– Born digital publications will almost always cross
institutional boudaries
Authenticity and fidelity
• What is an authoritative digital surrogate of a real object?
• When is a copy of an original surrogate exact?
• A born-digital object has nothing to compare• Digital “fingerprints” must be captured and
managed as metadata• When formats change, objects will not have
all the same technical characteristics…
Making complex digital information “durable” is a very hard problem
• Durability implies that digital content is directly in use and sustained long-term
• A history of the changes to the encoding and state of content must be reliably provided
• A meaningful context for any unit of content may be one of many and must be sustained
• Replication appears to be our best friend and the could looks like an answer
Management is the core function of a repository
• Repositories are designed to keep everything as stable as possible while providing flexible access
• Managing things such that when they aren’t changing they are reliably the same
• Accounting for migration for technical reasons• Disaster preparedness (lots of copies!)• Must respect legal and policy issues
Repository abstractions provide a durability framework for managing.
• Content is “unitized” as information objects that combine data, metadata, policies, relationships and the history of the object.
• Complex digital resources are formally defined graphs of related objects.
• The public view of the content is presented as virtual data components.
DCDC
Persistent ID
RELS-EXTRELS-EXT
AUDITAUDIT
11
22
nn
Reserved Datastreams
Custom Datastreams
(any type, any number)
A data object is one unit of content
POLICYPOLICY
Files are stored on disk and managed directly
• Versioning is necessary• Checksums for each file provide assurance
that they file has not changed• Can be managed by the repository or as
remote files
Virtual datastreams provide the access abstraction
• Can be simply retrieving a stored component• Views of the content can be derived on
demand, for different formats and resolutions• Other data productions can be derived on
demand; i.e. tiles from a JPEG2000 file• By providing an abstract view of the content
you break the dependence on the stored files
Pid
syste m Me ta
MO D S
JP2 0 0 0
T hum b S cre e n Mas te rC us to m
S izeD ub linC o re MODS C itatio n
MODSFile
J PEG200File
ContentAccess
ContentManagement
Descriptive metadata is about the content of the resource
• Indexed for searching • Also used for rendering user experiences• Some standards in use:
• Dublin core - general• MODS - bibliographic• VRACore – cultural heritage• FGDC - GIS datasets• DDI – social science datsets
Administrative metadata is more about the encoding and use
• Metadata about the object generally, like checksums
• Technical metadata about the specifics of the encoding each format
• Event metadata, about what happens to an object over its lifetime; audit trails
• Policy metadata, like access restrictions and credit lines
Relationships Among Objects
• Describes adjacency relationships among objects, among units of content
• Can be done by explicitly listing IDs in XML, using METS for example
• or using RDF:
PID – typeOfRelationship – relatedObjectPID• Can used to assemble complex resources
and aggregations of objects• Explicit and implicit aggregations
Text Collections
Te xts
M o de r n E ng l i s h C o l l e c t i o n
P ag eIm ag e s
Establishing and Enforcing Policies
• Policies must be established for the entire life-cycle of the information– Ownership and workflow policies– Access and use policies– Policies associated with sustaining (or not!)
• Polices must be expressed for end users• Policies must also be expressed for machine
access
Indexing
• In a repository there is no “catalog”; the repository is the catalog
• Many indexes can be created for many reasons
• Either metadata or full content, or both• Ontology-based indexes are rapidly
becoming more feasible• Keeping indexes updated is the trick
Fedora Repository ServiceGSearch GSearch
OAIOAI
IngestIngest
SimpleJMS
SimpleJMS
More…More…repository publishes events
serviceslisten andconsumeevents or other messages
Indexing as a harvesting service
BlacklightBlacklight