Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne...
-
Upload
ralph-baldwin -
Category
Documents
-
view
212 -
download
0
Transcript of Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne...
![Page 1: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/1.jpg)
Analysing the Impact of File Formats on Data Integrity
Volker Heydegger
University of Cologne
Archiving 2008
Bern, 23rd – 27th June 2008
![Page 2: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/2.jpg)
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring Information Loss Robustness Indicators Study results for different file formats
![Page 3: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/3.jpg)
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring Information Loss Robustness Indicators Study results for different file formats
![Page 4: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/4.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Background
• EU-founded project “Planets”
characterisation of file format content
www.planets-project.eu
University of Cologne, Computer Science for the Humanities
(Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI))
Planets partner
www.hki.uni-koeln.de
![Page 5: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/5.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Context
• Long-term preservation of digital informationWhich file format to choose?
Criteria, e.g.:
Open standard
Spread of usage
Hard-/Software-Dependencies
Authenticity
…
Robustness
![Page 6: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/6.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Robustness::= Error resilience of file formats against bit-stream corruption
![Page 7: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/7.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Issues/ Research topics
• Is there any correlation between file format and data integrity?
• If so, are there any differences among file formats concerning the degree of robustness?
• Which file format based factors are responsible for varying degrees of robustness?
• How can we improve the robustness of file formats?
![Page 8: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/8.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Benefits
• Digital preservation: Decision support for choosing file format for long-term preservation
• Contribution to file format research
• Improvement of existing file formats
• Design of future file formats
![Page 9: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/9.jpg)
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring Information Loss Robustness Indicators Study results for different file formats
![Page 10: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/10.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
File Format Data and Information loss
What is “File Format” in our context?• Set of rules, constituting the logical organisation of
data
• Set of rules, indicating how to interpret data
• Set of rules file format specification
• File Format Data::= Binary data, formatted according to the rules of a file format
![Page 11: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/11.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
What happens if data is corrupted in files?
GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel
![Page 12: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/12.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
First 224 Byte of testfile
FF
![Page 13: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/13.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Plain information loss: 1 byte data = = 1 Pixel
![Page 14: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/14.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
What happens if data is corrupted in files?
GTestimage: Tif, Greyscale, 32x32 pixel, 8 bit per pixel
![Page 15: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/15.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Part of the TIF Image File Directory, Tag: Photometric
Interpretation
00
![Page 16: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/16.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
Conditional information loss: 1 bit changes == 100% information changed
![Page 17: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/17.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Categories of File Format Data• Technical data (data for processing):
Image width: 277
Image length: 339
Compression: uncompressed
![Page 18: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/18.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• “Payload” data (basic data of usage):
Pixel data, starting from byte #0x008
![Page 19: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/19.jpg)
Overview
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
• Introduction• File format data and information loss
What happens if data is corrupted in files? Categories of file format data
• Measuring information loss Robustness Indicators Study results for different file formats
![Page 20: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/20.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Robustness Indicators
(1) RB = Δ (b0 ,b1) / m
where
i. b0 is the basic data of usage before being corrupted,
ii. b1 is the basic data of usage after being corrupted,
iii. m is the number of corruption procedures.
RB indicates an average information loss.
![Page 21: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/21.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
ExampleA file X may have 2000 byte of payload data. Presuming the number of byte changed after the file has been corrupted 3 times is per each corruption procedure
1. Δ (b0 ,b1) = 200 byte
2. Δ (b0 ,b1) = 150 byte
3. Δ (b0 ,b1) = 250 byte
The average information loss for file X based on 3 corruption procedures is then
RB= 600 / 3 = 200
![Page 22: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/22.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
RB related to the total number of payload data:
(2) RBt= RB / n where
n is the total number of basic data of usage (payload data).
(3) RBt= RB / n * 100
= RBt expressed in percentage
Interpretation: RBt = 0 % : max. Robustness
(min. Information loss)
![Page 23: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/23.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example (continued)
(2) RBt= 200 / 2000 = 0.1
(3) RBt= 200 / 2000 * 100 = 10 (%)
![Page 24: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/24.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Study on Robustness for various file formats: Example Results
TIF
- uncompressed
- LZW
- JPEG (2 different compression levels)
- ZIP
PNG (filtered, unfiltered)
JPEG2000 (lossless, lossy)
BMP (uncompressed)
G
![Page 25: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/25.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Study on Robustness for various file formats: Example Results
Method- simulation of file corruption: every file is corrupted up to 3000 times (3000 corruption procedures)
- applying 3-5 different corruption ratios: less than 0.01% 0.01% 0.1% 1.0% more than 1.0%G
![Page 26: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/26.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Method
- compressed payload data is decompressed
- original payload data and corrupted one is compared
- computing Robustness Indicators Values
G
![Page 27: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/27.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
G
![Page 28: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/28.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “bad case”
![Page 29: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/29.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “good case”
![Page 30: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/30.jpg)
Volker Heydegger | Archiving 2008 | 25th June 2008 | Bern, Switzerland
Example: Jp2 formatted image, corruption of 1 Byte, “good case” with visualized differences in pixel data
![Page 31: Analysing the Impact of File Formats on Data Integrity Volker Heydegger University of Cologne Archiving 2008 Bern, 23rd – 27th June 2008.](https://reader030.fdocuments.us/reader030/viewer/2022032805/56649ee85503460f94bfa25b/html5/thumbnails/31.jpg)
Thank you very much!
Volker Heydegger
University of Cologne
Archiving 2008
Bern, 23rd – 27th June 2008