Post on 31-Jan-2016
description
APARSENMetadata for preservation, curation and
interoperabilityWorkshop on Research Metadata in Context
7-8 Sept 2010, Nijmegen
David GiarettaAPA and STFC
Digital Preservation• Ensure that digitally encoded information are
understandable and usable over the long term– Long term could start at just a few years
• Easy to make claims– Difficult to provide proof
• Reference Model for Open Archival Information System (ISO 14721)– The basic standard for work in digital preservation– Defines terminology and compliance criteria
Definitions (OAIS)
• Long Term Preservation: The act of maintaining information, Independently Understandable by a Designated Community, and with evidence supporting its Authenticity, over the Long Term.
• Long Term: A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing Designated Community, on the information being held in an OAIS. This period extends into the indefinite future.
Not just BIT preservation
Not just rendering
Information not just DATA or Documents
Authenticity
Basic concept
• Digital preservation had been dominated by libraries and (state) archives
• However there was a focus there on “rendered objects” and
• Tendency to think data is an “easy” add-onHOWEVER• Need to deal with DATA – processed to new things, not
just rendered• Need to follow OAIS – finer grained view • Need to test and prove that things work
“metadata”“CASPAR banned the use of the term metadata unless absolutely necessary”
Data…Level 2 GOME Satellite
instrument data
Contains numbers – need meaning
6
...to process to this
7
...or this
8
...through complex processing schemes
9
10
Just Format?
sfqsftfoubujpo jogpsnbujpo svmft
You have a file
JHOVE tells you it is WORD version 7
..with some extra information..
11
representation information rules
Format Registries – useful but not enough: formats can be used for multiple purposes e.g. audio files used to store configuration parameters
12
Examples (cont)
• “504b0304140000000800f696….”• “This is a ZIP file which contains Word files,
each of which contains an encoded message which needs the key ‘!D$G^AJU*KI’ to decode it using encryption method SHA7”
13
Examples (cont)
• LaTex file containing an EPS (Encapulated Postscript) version of an image
• Web page containing Java Applet generating random numbers
• SWISS-PROT data• Foreign Language emails
14
XML enough? – can stare at this and probably understand it
<family> <father>John</father> <mother>Mary</mother> <son>Paul</son></family>
..but what about this?
15
<VOTABLE version="1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ivoa.net/xml/VOTable/v1.1 http://www.ivoa.net/xml/VOTable/v1.1" xmlns="http://www.ivoa.net/xml/VOTable/v1.1"><RESOURCE><TABLE name="6dfgs_E7_subset" nrows="875"><PARAM arraysize="*" datatype="char" name="Original Source"
value="http://www-wfau.roe.ac.uk/6dFGS/6dfgs_E7.fld.gz"><DESCRIPTION>URL of data file used to create this table.</DESCRIPTION></PARAM><PARAM arraysize="*" datatype="char" name="Comment" value="Cut down 6dfGS dataset for TOPCAT demo
usage."/><FIELD arraysize="15" datatype="char" name="TARGET"><DESCRIPTION>Target name</DESCRIPTION></FIELD><FIELD arraysize="11" datatype="char" name="DEC" unit="DMS"><DATA><FITS><STREAM encoding='base64'>U0lNUExFICA9ICAgICAgICAgICAgICAgICAgICBUIC8gU3RhbmRhcmQgRklUUyBmb3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICBCSVRQSVggID0gICAgICAgICAgICAgICAgICAgIDggLyBDaGFyYWN0ZXIgZGF0YSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIE5BWElTICAgPSAgICAgICAgICAgICAgICAgICAgMCAvIE5vIGltYWdlLCBqdXN0IGV4dGVuc2lvbnMgICAgICAgICAgICAgICAgICAgICAg
Performance Viewer: side-by-side comparison and validation of the transformation. From left to right: 3D visualization in Ogre3D, 3D model of the stage including the virtual dancer in VRML.
Figure 8 Some aspects of acousmatic production
Rendered
Non-Rendered
Static Dynamic
DynamicStatic
Simple
Complex
SimpleComplex
Rendered
Non-
Rendered
20
Information Model & Representation Information
The Information Model is key
Recursion ends at KNOWLEDGEBASE of the DESIGNATED COMMUNITY
(this knowledge will change over time and region)
InformationObject
RepresentationInformation
1+
interpretedusing1+Data
Object
interpretedusing
PhysicalObject
DigitalObject
BitSequence
1+
Representation Information Network
Modules and Dependencies:defining the Designated
CommunityREADME.txt
TEXT EDITORENGLISH
LANGUAGE
WINDOWS XP
FITS FILE
FITS STANDARD
PDF STANDARD
FITSJAVA s/w
JAVA VMPDF s/w
FITS DICTIONARY
DICTIONARYSPECIFICATION
UNICODESPECIFICATION
XMLSPECIFICATION
MULTIMEDIA PERFORMANCE DATA
C3D DirectX MAX/MSP
3D motiondata files
3D scenedata files
motion to musicmapping strategy
FITS FILE
FITS DICTIONARY
FITS STANDARD
PDF SOFTWARE
JAVA VM
PDF STANDARD
FITS JAVA SOFTWARE
DICTIONARY SPECIFICATION
XML SPECIFICATION
UNICODE SPECIFICATION
DDL DESCRIPTION
DDLDEFINITION
DDLSOFTWARE
If we can run this then we can run the Java software to extract the numbers
If we cannot run this then we can use an emulator or use its RepInfo to re-create a Java VM
If we cannot run the Java Virtual Machine then we use this source code to re-write in another programming language such as C
If we can run this then we can use this in a generic application to extract the numbers
If we cannot run the DDL software then we can look at the DDL definition and write some software to extract the numbers
In principle we could use this, plus the Dictionaries in order to understand the keywords in order to extract the numbers
FITS FILE
FITS DICTIONARY
FITS STANDARD
PDF SOFTWARE
JAVA VM
PDF STANDARD
FITS JAVA SOFTWARE
DICTIONARY SPECIFICATION
XML SPECIFICATION
UNICODE SPECIFICATION
DDL DESCRIPTION
DDLDEFINITION
DDLSOFTWARE
•Rep
•Info
/DISCIPLINE
•Virtualisation
Virtualisation
2-D array
2-D image
2-D astronomical
image
HeightWidth
Bits per Pixel
HeightWidth
Bits per PixelCo-ordinate system
Time
HeightWidth
Bits per PixelAstronomical co-ordinate system
Time – EPOCHBandpass
General Table
Time series Science data table
Number of columnsNames of columns
Number of rowsValue in cell at any row, column
Number of columnsNames of columns
Number of rowsValue in cell at any row, columnTime corresponding to any row
Number of columnsNames of columns
Number of rowsValue in cell at any row, column
Type of column valueColumn “metadata”
Table “metadata”
Root node
Node 4Node 3
Node 2Node 1
Node 6Node 6
Node 5
Node 9Node 8Node 7
Get the RootGet the number of children for a node
Get child number “i”
Image
Cultural Heritage
Image
ArtisticImage
Astronomical Image
Earth Observation
Image
Optical Astronomical
Image
X-ray Astronomical
Image
Archival Information
Package
Preservation DescriptionInformation
Content Information
further described by
Package Description
Packaging Information
derivedfrom
describedby
delimitedby
identifies
Preservation DescriptionInformation
FixityInformation
ProvenanceInformation
ReferenceInformation
ContextInformation
Access RightsInformation
34
Archival
Package
Contentfurther described by
Package Packaging
derivedfrom
describedby
delimitedby
DataObject
PhysicalObject
DigitalObject
StructureReferenceOther
Interpretedusing
Interpretedusing*
1
11...*
Bit
addsmeaning
to
Provenance Context Fixity AccessRights
RepresentationInformation
Provenance
has
has
USE DATA• Use application to find data in
Repository• Create DIP with enough RepInfo for the
user (via DC profile)• Obtain more RepInfo from Registry if
necessary
DRM
Cost sharing
Preservable infrastructure
APARSEN
Technical2000
Management5000
Spreading excellence
4000
Economic/Legal3000
2100: Preservation Services
1200: Staff and experience exchange
2200: Identifiers & citabillity
2300: Storage solutions
2400: Authenticity & Provenance
2500: Interoperability & intelligibility
2600: Annotation, Reputation & data quality
3100: Digital Rights & access management
3200: Cost /benefit data collection and modelling
3300: Peer Review & 3rd party Certification
3400: Brokerage services
3500: Data policies and governance
4100: External W/S & symposia
4200: Formal qualifications
4300: Training courses
4400: Awareness raising
5100: Financial management
5200: Technical co-ord.
2700: Scalability 3600: Business cases
Integration1000
1400: Common testing environments 4500: Liaison with
other stakeholders
1300: Common standards
1100: Common Vision
4600: International liaison
1500: Internal W/S & symposia
1600: Common tools, software repository and market place
5300: Evaluate impact of the Network of Excellence
Technical2000
Economic/Legal3000
2100: Preservation Services
2200: Identifiers & citabillity
2300: Storage solutions
2400: Authenticity & Provenance
2500: Interoperability & intelligibility
2600: Annotation, Reputation & data quality
3100: Digital Rights & access management
3200: Cost /benefit data collection and modelling
3300: Peer Review & 3rd party Certification
3400: Brokerage services
3500: Data policies and governance
2700: Scalability 3600: Business cases
Trust Certification of repositories
Reputation and trustability of datasets, publications and people
Authenticity
SustainabilityBusiness cases
Preservation
Cost/benefit analysis
Transfer of custody – who to hand over to and what to hand over
Storage solutions
UsabilityIntelligibility
Use by common tools
Cross domain usability
Interoperability
AccessIdentify of datasets, publication, people
Rights and responsibilities
Policies and governance
FUTURE
• Users may be unable to understand or use the data e.g. the semantics, format, processes or algorithms involved
• Non-maintainability of essential hardware, software or support environment may make the information inaccessible
• The chain of evidence may be lost and there may be lack of certainty of provenance or authenticity
• Access and use restrictions may not be respected in the future• Loss of ability to identify the location of data• The current custodian of the data, whether an organisation or
project, may cease to exist at some point in the future• The ones we trust to look after the digital holdings may let us
down
Links• CASPAR – http://www.casparpreserves.eu • CASPAR Source code - http://sourceforge.net/projects/digitalpreserve/ • OAIS Reference Model
-http://public.ccsds.org/publications/archive/650x0b1.pdf • and the updated draft is available from
http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206500P11/Overview.aspx • CASPAR Validation report
http://www.casparpreserves.eu/Members/cclrc/Deliverables/caspar-validation-evaluation-report/at_download/file
• PARSE.Insight: – www.parse-insight.eu
• Alliance for Permanent Access:– www.alliancepermanentaccess.eu
• Digital Curation Centre: – www.dcc.ac.uk
42
END