Preservation Metadata Extraction and Collection : Tools and Techniques

25
Preservation Metadata Extraction and Collection : Tools and Techniques Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa

description

Preservation Metadata Extraction and Collection : Tools and Techniques. Mat Black National Library of New Zealand Te Puna Matauranga o Aotearoa. How to get what you need to keep what you’ve got. The stack. Fixity generation Virus checking Format identification Format validation - PowerPoint PPT Presentation

Transcript of Preservation Metadata Extraction and Collection : Tools and Techniques

Page 1: Preservation Metadata  Extraction and Collection : Tools and Techniques

Preservation Metadata Extraction and Collection :

Tools and Techniques

Mat BlackNational Library of New ZealandTe Puna Matauranga o Aotearoa

Page 2: Preservation Metadata  Extraction and Collection : Tools and Techniques

How to get what you need to keep what you’ve got

Page 3: Preservation Metadata  Extraction and Collection : Tools and Techniques

The stack

• Fixity generation

• Virus checking

• Format identification

• Format validation

• Enviromental metadata collection

• Format specific metadata extraction

Page 4: Preservation Metadata  Extraction and Collection : Tools and Techniques

Fixity“Get it early and get it right”

• Common fixity types:

– Hashing algorithms (MD5, SHA1)

– Digital signatures

– File size?

• Use multiple fixity algorythems.

• Find out the legal implications.

Page 5: Preservation Metadata  Extraction and Collection : Tools and Techniques

Fixity values for what?

• File

• Bitstream

• Compound (all the files in an object)

• Metadata

• The whole lot (files, filename & metadata)

Page 6: Preservation Metadata  Extraction and Collection : Tools and Techniques

Virus checking

• Virus check datetime

• Results including false positives and any warnings

(word macros etc)

• The virus checker name and version

• The virus pattern file name and version

• The virus engine name and version

Page 7: Preservation Metadata  Extraction and Collection : Tools and Techniques

Format identificationFile / Bitstream / Complex

• Methods of format identification

– File name or extension

– File type/creator codes (Old Mac’s)

– Magic numbers

– Brute force file parsing (for all, try throw catch)

http://en.wikipedia.org/wiki/File_format

Page 8: Preservation Metadata  Extraction and Collection : Tools and Techniques

A sound file opened in an image viewer

Page 9: Preservation Metadata  Extraction and Collection : Tools and Techniques

What file format is this?

??And the winner is……..And the winner is……..

Page 10: Preservation Metadata  Extraction and Collection : Tools and Techniques

Subzero by Pain Receptor

They describe their music as sounding like

“falling down the stairs carrying leeches and bottles“

http://www.myspace.com/painreceptor

Page 11: Preservation Metadata  Extraction and Collection : Tools and Techniques

Sub format identification

• Embedded Bistreams

– XML Base64 encoded octet streams

– Microsoft Structured Storage

• Archives

– ZIP, TAR, ARC

• Encapsulation/Container formats

– OGG, AVI, MIME

• CODEC’s

– DV, DivX, Indeo, Cinepak, MS MPEG-4

Page 12: Preservation Metadata  Extraction and Collection : Tools and Techniques

Available tools

• File extensions (google it)

• Magic utilities (google it)

• Jhove http://hul.harvard.edu/jhove/

• DROID http://www.nationalarchives.gov.uk/pronom/

• Build you own! (Java, PERL, C#, C++)

– If you have a fixed format list

– You use a proprietary format.

Page 13: Preservation Metadata  Extraction and Collection : Tools and Techniques

Format validation

• Types of validation

–Pattern comparison

–Parsing

–Rendering

Page 14: Preservation Metadata  Extraction and Collection : Tools and Techniques

Available tools

• JHOVE

• NLNZ Extract tool (sort of)

• The application used to create the file

• Anything that opens a file and can throw an error.

– Parsing tools• E.g. XML Parsers, XML Schema, PERL Modules, Java

Classes.

– rendering tools• E.g. LibTIFF, ImageMagick, Microsoft Office (wrapped),

OpenOffice PERL Modules, Java classes, etc….

Page 15: Preservation Metadata  Extraction and Collection : Tools and Techniques

Things to keep in mind

• Test it till it breaks.

• Define your requirements, break them, then define

them again. (Repeat if required).

• Not all tools are created equal.

• Not all tools obey the rules.

• Some rules are made to be broken.

Page 16: Preservation Metadata  Extraction and Collection : Tools and Techniques

Environmental metadata

• Consider the native environment of your content.

• Is there metadata that you need that only exists in a

digital objects native environment?

• Structure and relationships.

• File system attributes

Page 17: Preservation Metadata  Extraction and Collection : Tools and Techniques

Format specific metadata extractionaka format characterisation

• Available metadata will vary depending on the format.– You will probably need format specific schemas.

• The types of metadata that can be extracted:– Preservation

– Descriptive

– Structural

– Administrative

– Rights

– Technical

Page 18: Preservation Metadata  Extraction and Collection : Tools and Techniques

The big question….

“Why would I extract the metadata now and store it in a

database if I can just come back and extract it again

later when I need it”?

Page 19: Preservation Metadata  Extraction and Collection : Tools and Techniques

Available tools

• NLNZ Metadata Extract Tool– http://www.natlib.govt.nz/en/whatsnew/4initiatives.html#extraction

• JHOVE– http://hul.harvard.edu/jhove/

• Anything you can wrap– LibTIFF, ImageMagick, PERL Modules, Java classes etc…

• Build your own! – And make sure you open source it

Page 20: Preservation Metadata  Extraction and Collection : Tools and Techniques

What tools should I use?

• Use as many tools as you need to.

• Keep the workflow configurable

– Preferably by content or format type.

– Allow for multiple tools to be used.

– Allow for new tools to be added later.

• Compare metadata from multiple tools.

Page 21: Preservation Metadata  Extraction and Collection : Tools and Techniques

The workflow

1. Fixity generation

2. Virus checking

3. Format identification

4. Format validation

5. Enviromental metadata extraction

6. Format specific metadata extraction

7. Store in repository

Page 22: Preservation Metadata  Extraction and Collection : Tools and Techniques

Paranoid workflow

1. Fixity generation

2. Virus checking

3. Fixity check

4. Format identification

5. Fixity check

6. Format validation

7. Fixity check

8. Enviromental metadata extraction

9. Fixity check

10. Format specific metadata extraction

11. Fixity check

12. Virus check

13. Store in repository

14. Fixity check

15. Virus check

16. Fixity check

Page 23: Preservation Metadata  Extraction and Collection : Tools and Techniques

Paranoid access flow.

• Retrieve content from repository

• Fixity check

• Virus check

• Send content to consumer

Page 24: Preservation Metadata  Extraction and Collection : Tools and Techniques

Global Digital Format Registry

• Format identification components

• Format validation components

• Metadata extraction components

• Format registry

• At risk content alerts

• http://hul.harvard.edu/gdfr/

Page 25: Preservation Metadata  Extraction and Collection : Tools and Techniques

Questions?