IMPACT Final Conference - Stefan Pletschacher

Post on 21-Nov-2014

2.038 views 3 download

description

Stefan Pletschacher from the University of Salford - IMPACT Evaluation Tools, ground truth and datasets

Transcript of IMPACT Final Conference - Stefan Pletschacher

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation ToolsStefan Pletschacher

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Overview

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 2

Digitisation Workflows Performance Evaluation Ground Truth Evaluation Tools Segmentation and Layout OCR Text Interpretation of Results

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Digitisation Workflow

① Scanning② Image enhancement

Page splittingBorder removal Dewarping (page curl, arbitrary warping)Noise removalBinarisation

③ Layout analysisSegmentation of regions, lines, words and charactersRegion classificationLogical layout analysis

④ OCR⑤ Post-processing

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 3

Evaluation • Individual Processing Steps• Complex Workflows

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 4

Performance Evaluation Overview

Evaluation Tools

GT Tools

Ground Truth

DIA / OCR Software

Results

Image Repository

Evaluation Metrics

Evaluation Results

Evaluation Scenarios

Compatibility through one common format

(PAGE)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Image Repository Central management of

Metadata, Images and Ground Truth

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 5

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Datasets

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 6

Total number of images: 667,120 Institutional Datasets (10 libraries): 602,313 images Demonstrator Sets: 56,141 images Ground Truth in PAGE format: 36,498 approved instances, still growing

Working Sets (Showcases, Typewritten Set, Dewarping Set, Challenge Sets etc.)

Usage statistics (Since 6/10/2010)– 5,153,347 thumbs browsed– 810,001 images accessed (724,946 full quality images, 22,676 direct access calls)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Tools for Ground Truth Production Aletheia

Page border, print space Layout regions (incl.

metadata) Text lines, words and

glyphs Unicode text at all levels Reading order, layers etc.

FineReader EngineExporter (Preproduction)

GT Validator

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 7

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Ground Truthing Historical Documents

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 8

Complex Reading Order (Groups of ordered and unordered elements)

Full Unicode Support (incl. special characters for historical documents)

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Deskew BinarisationBorder RemovalDewarping

Ground Truth – Image Enhancement

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 9

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Page Analysis and Ground-Truth Elements Two-level architecture:

– root structure– task specific sub-formats

Separate XML Schema definitions Format identification via Namespaces Mapping of

– dependencies– processing chains– alternative processing steps

Linking via IDs

http://schema.primaresearch.org/PAGE/

Representation of Processing Results /

Ground Truth

PAGE root(XML)

PAGE gts(XML)

PAGE gts(XML)

PAGE gts(XML)

The PAGE Format Framework

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 10

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation Tools

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 11

Segmentation andLayout

OCR Text

Deskewing Dewarping Border Removal Binarisation Double Page Splitting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Segmentation and Layout

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 12

Ground Truth

Result

Overlap Differentiation of errors based on

reading order

allowable

non-allowable

Miss / Part. Miss

Split

Misclass.

Merge

False Detection

Error types

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Example – Ground Truth

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 13

Image

Caption

Page

Paragraph

Paragraph

Header

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Example – Layout Analysis Result

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 14

Paragraph

Header

Paragraph

ImageImage

Image

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Evaluation

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 15

Partial MissMiss

Merge

Ground Truth

Layout Analysis Result

Misclassi-fication

Paragraph

Caption

Split

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

OCR Text Comparison of Ground Truth and OCR output based on encoded text (ASCII, Unicode) Character accuracy

– Distance measure: minimum number of edit operations (insertions, deletions, substitutions)– Per character class (lower case, upper case, whitespace characters, numbers, symbols, ...)

Word accuracy– Correctly recognised words vs. total word count– Stop words and non-stop words

Rejected and suspicious characters Substitution errors (higher penalty) Correction effort

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 16

OCR is cool OOR is cod

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Interpretation of Results Metrics

– Measurements of conditions– Types and number of errors

Scenarios– Application context– Error weights

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 17

Miss

Misclass.

Merge

Split

False detect.

Merge Rate

M1M2

M3

Split Rate

S1 S2

...

Error Rate

Overall success/error rates are based on– weighted individual results– type and size of affected regions– allowable vs. non-allowable errors

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

For more information visit:

PRImAhttp://www.primaresearch.org

IMPACThttp://www.impact-project.eu

Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 18