Quality assurance for document image collections in digital preservation

23
Quality assurance for document image collections in digital preservation Reinhold Huber-Mörk 1 & Alexander Schindler 1,2 1 Research Area Intelligent Vision Systems Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology

description

Reinhold Huber-Mörk, AIT Austrian Institute of Technology, gave a presentation on ‘Quality assurance for document image collections in digital preservation’ at the Acivs conference in Brno, Czech Republic in September 2012. Acivs is short for Advanced Concepts for Intelligent Vision Systems and focuses on techniques for building adaptive, intelligent, safe and secure imaging systems.

Transcript of Quality assurance for document image collections in digital preservation

Page 1: Quality assurance for document image collections in digital preservation

Quality assurance for document image collections in digital preservation

Reinhold Huber-Mörk1 & Alexander Schindler1,2

1 Research Area Intelligent Vision Systems Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology

Page 2: Quality assurance for document image collections in digital preservation

Overview

Digital preservation

Quality assurance in digital image preservation workflows

Keypoint based approach for image content comparison

Spatially distinctive keypoints

Document image preprocessing

Structural similarity assessment

Results on real-world data sets

2 12.09.2013

Page 3: Quality assurance for document image collections in digital preservation

Digital preservation

„Set of processes, activities and management of digital information over time

to ensure its long term accessibility“ (Source: Wikipedia)

Physical damage of digital/digitized content, e.g. „bit rot“ related to some

storage media

Digital obsolescence of hardware/software, e.g. vanishing file formats

Content modification in preservation, e.g. error injection during file format

conversion, digital manipulation, reacquisition,…

3 12.09.2013

Images provided by historical newspaper collection / The

British Library

Page 4: Quality assurance for document image collections in digital preservation

Quality assurance in digital image preservation workflows

Automated preservation workflows are common in large digitization projects

(e.g. museum collections, Google books PPPs,…).

Automated quality assurance to ensure file format consistency, detection of

duplicates and quality and content preservation.

SCAPE FP7

4 12.09.2013

Page 5: Quality assurance for document image collections in digital preservation

Keypoint based approach for document comparison

Local features are detected & described by standard LoG/SIFT approach

Scaling, rotation, cropping and additional/missing content is handled

Affine transformation is sufficient (usually no perspective, bending etc.)

5 12.09.2013

200 400 600 800 1000 1200

100

200

300

400

500

600

700

800

Images provided by

historical newspaper collection / The British

Library

Page 6: Quality assurance for document image collections in digital preservation

Spatially distinctive keypoints (SDKs) (1)

High-resolution document scans contain large number of keypoints (e.g.

~50.000 keypoints on ~5000x3000 pixel images)

Matching of descriptors results in high computational complexity

Changing of detector edge/peak thresholds often results in spatially uneven

distribution of keypoints

One solution is dense/regular spatial sampling of keypoints

Another solution is adaptive non-maximal suppression (Brown et. al, 2005)

Our solution is to enforce keypoint selection at positions locally adjacent to

spatially uniformly distributed interest regions

6 12.09.2013

Page 7: Quality assurance for document image collections in digital preservation

Spatially distinctive keypoints (SDKs) (2)

Interest regions are distributed over the image using a regular grid

Keypoints with highest saliency are selected from each interest region

(Harris & Stevens corner strength is used as saliency measure)

7 12.09.2013

Images provided by International Dunhuang Project / The British Library

Page 8: Quality assurance for document image collections in digital preservation

Evaluation SDK (1)

8 12.09.2013

Mean SSIM vs. number of SDKs (evaluated on 1560 Dunhuang image pairs)

#SDKs=64 #SDKs=256 #SDKs=512

#SDKs=1024 #SDKs=2048 all keypoints

Page 9: Quality assurance for document image collections in digital preservation

Evaluation SDK (2)

9 12.09.2013

Dependency of mean SSIM on the number of SDKs

(evaluated on 1560 Dunhuang image pairs)

Page 10: Quality assurance for document image collections in digital preservation

Robust symmetric matching

RANSAC constrained by affine transformation

Only accept significant matches - distance ratio of best and second best match

Enforcing one-to-one matching of descriptors - ignoring ambiguous matches

10 12.09.2013

Images provided by International

Dunhuang Project / The British Library

Page 11: Quality assurance for document image collections in digital preservation

Image preprocessing (1)

Content in (historical) book collections is characterized by a mixture of text,

graphical art, empty pages & other artefacts

E.g. onsider a sample from the Dunhuang manuscripts

11 12.09.2013

Images provided by

International Dunhuang

Project / The British Library

Page 12: Quality assurance for document image collections in digital preservation

Image preprocessing (2)

Locally adaptive histogram equalization to enhance paper structure while

preserving text structure

Contrast limited adaptive histogram equalization (CLAHE, Pizer et.al. 1987),

where grid/tile spacing ~ character size (e.g. 40x50 pixels)

12 12.09.2013

Images provided by International Dunhuang Project / The British Library

Page 13: Quality assurance for document image collections in digital preservation

Image preprocessing (3)

Tile centers Original Global hist. eq. CLAHE

13 12.09.2013

Images provided by International Dunhuang Project / The British Library

Page 14: Quality assurance for document image collections in digital preservation

Structural similarity (1)

MSE, PSNR, etc. not well suited for content comparison –> perceptual

image quality assessment

Non-blind/full-reference image quality assessment

The mean structural similarity index (SSIM, Wang et. al 2004) compares two

images based on luminance, contrast and structure terms.

Mean SSIM is evaluated for overlapping region of image pairs -> registration

To lower the influence of misregistration the local minimum of the mean

SSIM between the images in the pair is evaluated

14 12.09.2013

Page 15: Quality assurance for document image collections in digital preservation

Structural similarity (2)

Registered and overlaid images (SSIM low … black, SSIM high …white)

15 12.09.2013

Images provided by International Dunhuang Project / The British Library

Page 16: Quality assurance for document image collections in digital preservation

16

Pairs not matching

Pairs with low

structural similarity

Pairs with high structural

similarity

Mean SSIM = 0 8 pairs

Mean SSIM <0.67 78 pairs

Mean SSIM >0.67 (p=5 quantile)

1482 pairs

1560 pairs Total number

Results - International Dunhuang Project data (1)

Page 17: Quality assurance for document image collections in digital preservation

Results - International Dunhuang Project data (2)

17 12.09.2013

Pairs of high mean SSIM are not subject to a human verification

Images provided by International Dunhuang

Project / The British Library

Page 18: Quality assurance for document image collections in digital preservation

Results - International Dunhuang Project data (3)

18 12.09.2013

Pairs of medium mean SSIM are possibly subject to human verification

Images provided by International

Dunhuang Project / The British Library

Page 19: Quality assurance for document image collections in digital preservation

Results - International Dunhuang Project data (4)

19 12.09.2013

Pairs of low mean SSIM are subject to human verification

Images provided by International Dunhuang Project / The British Library

Page 20: Quality assurance for document image collections in digital preservation

20

rate=1…content

is identical

Book/barcode nr. Book/Barcode name #Pairs Rate of matches

1 +Z13641740X_31525197396364410 546 0.9982

2 +Z13722110X_31525197396362478 18 1.0000

3 +Z136400800_31525197396361993 269 0.9888 4 +Z136408008_31525197396361942 291 0.9897

5 +Z136408409_31525197396362038 1 1.0000

6 +Z136409104_31525197396361681 182 0.9670

7 +Z136411408_31525197396362266 219 0.9954

8 +Z136415001_31525197396363522 360 0.9861

9 +Z136419900_31525197396360634 219 0.9954

10 +Z136428500_31525197396360351 249 0.9799

11 +Z136436004_31525197396361129 273 0.9853

12 +Z137116108_31525197396265632 589 0.9949

13 +Z137117708_31525197396287838 651 0.9969

14 +Z137118403_31525197396265776 505 0.9822

15 +Z137120100_31525197396265914 1231 0.9992

16 +Z137150402_31525197396389590 2 1.0000

17 +Z137219001_31525197396361518 664 0.9774

18 +Z150800609_31525197396361025 212 0.9858

19 +Z152471307_31525197396311214 443 0.9910

20 +Z152472403_31525197396313828 460 0.9913

21 +Z152472701_31525197396315698 859 0.9953

Results - Google books redownload workflow (1)

Page 21: Quality assurance for document image collections in digital preservation

21

Pairs not matching

Pairs with low similarity

(or low overlap)

Pairs with high similarity

Results - Google books redownload workflow (2)

Images provided by Google books collection / Austrian National Library

Page 22: Quality assurance for document image collections in digital preservation

Conclusion and outlook

Keypoint based approach for quality assurance in digital book preservation

Combination of keypoints approach with perceptual similarity evaluation

Recently: combination with bag of keypoints approach for duplicate

detection and collection comparison

Currently: Evaluation at Austrian National Library (Google books collection)

and British Library (historical newspaper collection)

Future: Integration on SCAPE platform for scalable distributed computing

22 12.09.2013

Page 23: Quality assurance for document image collections in digital preservation

AIT Austrian Institute of Technology your ingenious partner [email protected]