ReCAP Shelving Facility Research Collections and Preservation Consortium.
Quality assurance for document image collections in digital preservation
-
Upload
scape-project -
Category
Technology
-
view
464 -
download
1
description
Transcript of Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
Reinhold Huber-Mörk1 & Alexander Schindler1,2
1 Research Area Intelligent Vision Systems Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology
Overview
Digital preservation
Quality assurance in digital image preservation workflows
Keypoint based approach for image content comparison
Spatially distinctive keypoints
Document image preprocessing
Structural similarity assessment
Results on real-world data sets
2 12.09.2013
Digital preservation
„Set of processes, activities and management of digital information over time
to ensure its long term accessibility“ (Source: Wikipedia)
Physical damage of digital/digitized content, e.g. „bit rot“ related to some
storage media
Digital obsolescence of hardware/software, e.g. vanishing file formats
Content modification in preservation, e.g. error injection during file format
conversion, digital manipulation, reacquisition,…
3 12.09.2013
Images provided by historical newspaper collection / The
British Library
Quality assurance in digital image preservation workflows
Automated preservation workflows are common in large digitization projects
(e.g. museum collections, Google books PPPs,…).
Automated quality assurance to ensure file format consistency, detection of
duplicates and quality and content preservation.
SCAPE FP7
4 12.09.2013
Keypoint based approach for document comparison
Local features are detected & described by standard LoG/SIFT approach
Scaling, rotation, cropping and additional/missing content is handled
Affine transformation is sufficient (usually no perspective, bending etc.)
5 12.09.2013
200 400 600 800 1000 1200
100
200
300
400
500
600
700
800
Images provided by
historical newspaper collection / The British
Library
Spatially distinctive keypoints (SDKs) (1)
High-resolution document scans contain large number of keypoints (e.g.
~50.000 keypoints on ~5000x3000 pixel images)
Matching of descriptors results in high computational complexity
Changing of detector edge/peak thresholds often results in spatially uneven
distribution of keypoints
One solution is dense/regular spatial sampling of keypoints
Another solution is adaptive non-maximal suppression (Brown et. al, 2005)
Our solution is to enforce keypoint selection at positions locally adjacent to
spatially uniformly distributed interest regions
6 12.09.2013
Spatially distinctive keypoints (SDKs) (2)
Interest regions are distributed over the image using a regular grid
Keypoints with highest saliency are selected from each interest region
(Harris & Stevens corner strength is used as saliency measure)
7 12.09.2013
Images provided by International Dunhuang Project / The British Library
Evaluation SDK (1)
8 12.09.2013
Mean SSIM vs. number of SDKs (evaluated on 1560 Dunhuang image pairs)
#SDKs=64 #SDKs=256 #SDKs=512
#SDKs=1024 #SDKs=2048 all keypoints
Evaluation SDK (2)
9 12.09.2013
Dependency of mean SSIM on the number of SDKs
(evaluated on 1560 Dunhuang image pairs)
Robust symmetric matching
RANSAC constrained by affine transformation
Only accept significant matches - distance ratio of best and second best match
Enforcing one-to-one matching of descriptors - ignoring ambiguous matches
10 12.09.2013
Images provided by International
Dunhuang Project / The British Library
Image preprocessing (1)
Content in (historical) book collections is characterized by a mixture of text,
graphical art, empty pages & other artefacts
E.g. onsider a sample from the Dunhuang manuscripts
11 12.09.2013
Images provided by
International Dunhuang
Project / The British Library
Image preprocessing (2)
Locally adaptive histogram equalization to enhance paper structure while
preserving text structure
Contrast limited adaptive histogram equalization (CLAHE, Pizer et.al. 1987),
where grid/tile spacing ~ character size (e.g. 40x50 pixels)
12 12.09.2013
Images provided by International Dunhuang Project / The British Library
Image preprocessing (3)
Tile centers Original Global hist. eq. CLAHE
13 12.09.2013
Images provided by International Dunhuang Project / The British Library
Structural similarity (1)
MSE, PSNR, etc. not well suited for content comparison –> perceptual
image quality assessment
Non-blind/full-reference image quality assessment
The mean structural similarity index (SSIM, Wang et. al 2004) compares two
images based on luminance, contrast and structure terms.
Mean SSIM is evaluated for overlapping region of image pairs -> registration
To lower the influence of misregistration the local minimum of the mean
SSIM between the images in the pair is evaluated
14 12.09.2013
Structural similarity (2)
Registered and overlaid images (SSIM low … black, SSIM high …white)
15 12.09.2013
Images provided by International Dunhuang Project / The British Library
16
Pairs not matching
Pairs with low
structural similarity
Pairs with high structural
similarity
Mean SSIM = 0 8 pairs
Mean SSIM <0.67 78 pairs
Mean SSIM >0.67 (p=5 quantile)
1482 pairs
1560 pairs Total number
Results - International Dunhuang Project data (1)
Results - International Dunhuang Project data (2)
17 12.09.2013
Pairs of high mean SSIM are not subject to a human verification
Images provided by International Dunhuang
Project / The British Library
Results - International Dunhuang Project data (3)
18 12.09.2013
Pairs of medium mean SSIM are possibly subject to human verification
Images provided by International
Dunhuang Project / The British Library
Results - International Dunhuang Project data (4)
19 12.09.2013
Pairs of low mean SSIM are subject to human verification
Images provided by International Dunhuang Project / The British Library
20
rate=1…content
is identical
Book/barcode nr. Book/Barcode name #Pairs Rate of matches
1 +Z13641740X_31525197396364410 546 0.9982
2 +Z13722110X_31525197396362478 18 1.0000
3 +Z136400800_31525197396361993 269 0.9888 4 +Z136408008_31525197396361942 291 0.9897
5 +Z136408409_31525197396362038 1 1.0000
6 +Z136409104_31525197396361681 182 0.9670
7 +Z136411408_31525197396362266 219 0.9954
8 +Z136415001_31525197396363522 360 0.9861
9 +Z136419900_31525197396360634 219 0.9954
10 +Z136428500_31525197396360351 249 0.9799
11 +Z136436004_31525197396361129 273 0.9853
12 +Z137116108_31525197396265632 589 0.9949
13 +Z137117708_31525197396287838 651 0.9969
14 +Z137118403_31525197396265776 505 0.9822
15 +Z137120100_31525197396265914 1231 0.9992
16 +Z137150402_31525197396389590 2 1.0000
17 +Z137219001_31525197396361518 664 0.9774
18 +Z150800609_31525197396361025 212 0.9858
19 +Z152471307_31525197396311214 443 0.9910
20 +Z152472403_31525197396313828 460 0.9913
21 +Z152472701_31525197396315698 859 0.9953
Results - Google books redownload workflow (1)
21
Pairs not matching
Pairs with low similarity
(or low overlap)
Pairs with high similarity
Results - Google books redownload workflow (2)
Images provided by Google books collection / Austrian National Library
Conclusion and outlook
Keypoint based approach for quality assurance in digital book preservation
Combination of keypoints approach with perceptual similarity evaluation
Recently: combination with bag of keypoints approach for duplicate
detection and collection comparison
Currently: Evaluation at Austrian National Library (Google books collection)
and British Library (historical newspaper collection)
Future: Integration on SCAPE platform for scalable distributed computing
22 12.09.2013
AIT Austrian Institute of Technology your ingenious partner [email protected]