IMPACT Final Conference - Stefan Pletschacher
-
Upload
impact-centre-of-competence -
Category
Education
-
view
2.038 -
download
3
description
Transcript of IMPACT Final Conference - Stefan Pletschacher
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation ToolsStefan Pletschacher
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Overview
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 2
Digitisation Workflows Performance Evaluation Ground Truth Evaluation Tools Segmentation and Layout OCR Text Interpretation of Results
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Digitisation Workflow
① Scanning② Image enhancement
Page splittingBorder removal Dewarping (page curl, arbitrary warping)Noise removalBinarisation
③ Layout analysisSegmentation of regions, lines, words and charactersRegion classificationLogical layout analysis
④ OCR⑤ Post-processing
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 3
Evaluation • Individual Processing Steps• Complex Workflows
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 4
Performance Evaluation Overview
Evaluation Tools
GT Tools
Ground Truth
DIA / OCR Software
Results
Image Repository
Evaluation Metrics
Evaluation Results
Evaluation Scenarios
Compatibility through one common format
(PAGE)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Image Repository Central management of
Metadata, Images and Ground Truth
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Datasets
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 6
Total number of images: 667,120 Institutional Datasets (10 libraries): 602,313 images Demonstrator Sets: 56,141 images Ground Truth in PAGE format: 36,498 approved instances, still growing
Working Sets (Showcases, Typewritten Set, Dewarping Set, Challenge Sets etc.)
Usage statistics (Since 6/10/2010)– 5,153,347 thumbs browsed– 810,001 images accessed (724,946 full quality images, 22,676 direct access calls)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools for Ground Truth Production Aletheia
Page border, print space Layout regions (incl.
metadata) Text lines, words and
glyphs Unicode text at all levels Reading order, layers etc.
FineReader EngineExporter (Preproduction)
GT Validator
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Ground Truthing Historical Documents
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 8
Complex Reading Order (Groups of ordered and unordered elements)
Full Unicode Support (incl. special characters for historical documents)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Deskew BinarisationBorder RemovalDewarping
Ground Truth – Image Enhancement
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Page Analysis and Ground-Truth Elements Two-level architecture:
– root structure– task specific sub-formats
Separate XML Schema definitions Format identification via Namespaces Mapping of
– dependencies– processing chains– alternative processing steps
Linking via IDs
http://schema.primaresearch.org/PAGE/
Representation of Processing Results /
Ground Truth
PAGE root(XML)
PAGE gts(XML)
PAGE gts(XML)
PAGE gts(XML)
The PAGE Format Framework
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation Tools
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 11
Segmentation andLayout
OCR Text
Deskewing Dewarping Border Removal Binarisation Double Page Splitting
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Segmentation and Layout
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 12
Ground Truth
Result
Overlap Differentiation of errors based on
reading order
allowable
non-allowable
Miss / Part. Miss
Split
Misclass.
Merge
False Detection
Error types
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Example – Ground Truth
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 13
Image
Caption
Page
Paragraph
Paragraph
Header
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Example – Layout Analysis Result
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 14
Paragraph
Header
Paragraph
ImageImage
Image
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 15
Partial MissMiss
Merge
Ground Truth
Layout Analysis Result
Misclassi-fication
Paragraph
Caption
Split
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Text Comparison of Ground Truth and OCR output based on encoded text (ASCII, Unicode) Character accuracy
– Distance measure: minimum number of edit operations (insertions, deletions, substitutions)– Per character class (lower case, upper case, whitespace characters, numbers, symbols, ...)
Word accuracy– Correctly recognised words vs. total word count– Stop words and non-stop words
Rejected and suspicious characters Substitution errors (higher penalty) Correction effort
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 16
OCR is cool OOR is cod
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Interpretation of Results Metrics
– Measurements of conditions– Types and number of errors
Scenarios– Application context– Error weights
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 17
Miss
Misclass.
Merge
Split
False detect.
Merge Rate
M1M2
M3
Split Rate
S1 S2
...
Error Rate
Overall success/error rates are based on– weighted individual results– type and size of affected regions– allowable vs. non-allowable errors
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
For more information visit:
PRImAhttp://www.primaresearch.org
IMPACThttp://www.impact-project.eu
Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 18