IMPACT Final Conference - NCSR - Wordspotting

21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 24-25 October 2011, London, UK IMPACT Tools Developed by NCSR IMPACT Final Conference 2011 B. Gatos Computational Intelligence Laboratory Institute of Informatics and Telecommunications National Center for Scientific Research ( NCSR) "Demokritos" GR-153 10 Agia Paraskevi, Athens, Greece

description

IMPACT Final Conference - NCSR - Wordspotting

Transcript of IMPACT Final Conference - NCSR - Wordspotting

Page 1: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24-25 October 2011, London, UK

IMPACT Tools Developed by NCSR

IMPACT Final Conference 2011

B. Gatos Computational Intelligence LaboratoryInstitute of Informatics and TelecommunicationsNational Center for Scientific Research (NCSR) "Demokritos"GR-153 10 Agia Paraskevi, Athens, Greece

Page 2: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2

Develop an alternative technique for historical document indexing based on spotting words directly on document images avoiding the conventional OCR procedure

Provide three methods for word spotting: Selecting the query from a predefined list of keywords Query by example Free text query

Incorporate the whole word spotting functionality in a GUI tool

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 3: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

3

The main operational parts of the Word Spotting application are:

Page segmentation Feature extraction Marking character templates Word matching User feedback Query selection by example Free text synthetic query creation Searching User access control

DOCUMENT PAGES

PROCESSING

Predefined Keyword

Word segmentation

Features extraction

Document corpus

Segmented words

features

Similarity measurement

Ranking results

Synthetic keyword

Keywords list

Word instances in documents

Query module

By Example Free Text

Synthetic keyword

Alphabet characters

User-defined character templates

Query word features

extraction

USER’S

FEEDBACK

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 4: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

4

Main steps (1/2) Document pages

• Select pages from the documents corpus

• Apply word segmentation to the pages

• Apply feature extraction to all segmented

words

Query• Define the list of keywords

• Select the query keyword from the list

• Mark the character templates

• Create a synthetic query image

• Apply feature extraction to the query image

DOCUMENT PAGES

PROCESSING

Predefined Keyword

Word segmentation

Features extraction

Document corpus

Segmented words

features

Similarity measurement

Ranking results

Synthetic keyword

Keywords list

Word instances in documents

Query module

By Example Free Text

Synthetic keyword

Alphabet characters

User-defined character templates

Query word features

extraction

USER’S

FEEDBACK

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 5: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

5

Main steps (2/2) Matching and User feedback

• Word matching

• User feedback

• Selecting the final results

DOCUMENT PAGES

PROCESSING

Predefined Keyword

Word segmentation

Features extraction

Document corpus

Segmented words

features

Similarity measurement

Ranking results

Synthetic keyword

Keywords list

Word instances in documents

Query module

By Example Free Text

Synthetic keyword

Alphabet characters

User-defined character templates

Query word features

extraction

USER’S

FEEDBACK

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 6: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

6

Marking character templates • Applied directly on a text image

• Character baseline adjustment

• Performed “once-for-all” and can be used for entire books or collections with similar text characteristics

DOCUMENT PAGES

PROCESSING

Predefined Keyword

Word segmentation

Features extraction

Document corpus

Segmented words

features

Similarity measurement

Ranking results

Synthetic keyword

Keywords list

Word instances in documents

Query module

By Example Free Text

Synthetic keyword

Alphabet characters

User-defined character templates

Query word features

extraction

USER’S

FEEDBACK

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 7: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7

Feature extraction & word matching• Describe each word (synthetic or real)

by a set of features

• Normalize

• Match by checking similarity based on features

DOCUMENT PAGES

PROCESSING

Predefined Keyword

Word segmentation

Features extraction

Document corpus

Segmented words

features

Similarity measurement

Ranking results

Synthetic keyword

Keywords list

Word instances in documents

Query module

By Example Free Text

Synthetic keyword

Alphabet characters

User-defined character templates

Query word features

extraction

USER’S

FEEDBACK

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 8: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

8

x

y

x

y

y=yt

Upper Boundary

Lower Boundary

x y

x y

x y

Features based on word profile projections

Features based on zones

Hybrid features by projections and zones

Feature extraction & word matching

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 9: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

9

Features by centers of masses

Feature extraction & word matching

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 10: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

10

User feedback example

(a) Synthetic query word (b) Initial ranking of segmented words. The highlighted words denote correct words selected by the user (c) Ranking after user’s feedback.

(a)

(b) (c)

DOCUMENT PAGES

PROCESSING

Predefined Keyword

Word segmentation

Features extraction

Document corpus

Segmented words

features

Similarity measurement

Ranking results

Synthetic keyword

Keywords list

Word instances in documents

Query module

By Example Free Text

Synthetic keyword

Alphabet characters

User-defined character templates

Query word features

extraction

USER’S

FEEDBACK

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 11: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

11

Searching• Allow the user to search the image

corpus for instances of query keywords that have already undergone the user feedback process.

• The user selects one of the processed keywords and the application shows all the instances of this keyword in the images of the corpus.

• The user can navigate through the results in an instance level (showing one instance per time) or in a page level (showing all instances in a page).

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 12: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

12

A. L. Kesidis, E. Galiotou, B. Gatos and I. Pratikakis, “A

word spotting framework for historical machine-

printed documents”, International Journal on Document

Analysis and Recognition, DOI: 10.1007/s10032-010-

0134-4, pp. 1-14, 2010.

A. L. Kesidis, E. Galiotou, B. Gatos, A. Lampropoulos, I.

Pratikakis, I. Manolessou and A. Ralli, "Accessing the

content of Greek historical documents", 3rd 

Workshop on Analytics for Noisy Unstructured Text Data

(AND'09), pp. 55-62, Barcelona, Spain, July 2009

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 13: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

13

Main steps Document pages (applied once) Query

• Select (by cropping) a query word image from a page

• Apply feature extraction to the query image Matching

• Match query features to all segmented words features

• Rank the segmented words by similarity

• Return the most similar segmented words

• No user feedback!

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 14: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

14

Main steps Document pages (applied once) Query

• Type the query text

• Construct a synthetic query image using letter templates (provided by an administrator)

• Apply feature extraction to the query image Matching

• Match query features to all segmented words features

• Rank the segmented words by similarity

• Return the most similar segmented words

• No user feedback!

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 15: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15

Two levels Guest Administrator

Search

Word Spotting

Query by

Example

Free Text

Query

User management

+settings

Guest √ √ √

Administrator

√ √ √ √ √

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 16: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

16

Queryby

Keyword

Queryby

Example

FreeText

OFFLINE PREPARATION – ADMINISTRATIVE TASKS

Page segmentation and features extraction Admin Admin Admin

Keywords definition Admin

Letter templates definition Admin Admin

Word Spotting by User’s feedback Admin

ONLINE USAGE

Searching All Users All Users All Users

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 17: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

17

Document corpus• French book

(153 pages, 47836 words)

• German book

(126 pages, 24596 words)

Segmentation• Projections

• RLSA• USAL1

(Connected components)

• USAL2 (Projections)

Features• Hybrid

(Projections+Zones)

• Center of Masses

• Overall 80 experiments

• Each experiment performed Without user feedback With 1, 2, and 3 user selected words

Keywords• 5 keywords per

book

• French: Le Dernier fils de France, ou le Duc de Normandie, fils de Louis XVI et de Marie-Antoinette, par A., 1838

• German: Aufschlüsse zur Magie aus geprüften Erfahrungen über verborgene philosophische Wissenschaften und verdeckte Geheimnisse der Natur, 1788

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 18: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

18

Feature extraction• Hybrid provided better results + is faster than Center of masses

(a) (b)

Average precision vs recall diagrams of word spotting in relation to feature extraction methods for (a) Book A and (b) Book B.

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 19: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

19

User feedback• User feedback improved the results

(a) (b)

Average precision vs recall diagrams of word spotting in relation to the user’s feedback involvement for (a) Book A and (b) Book B.

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 20: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

20

Segmentation issues

Query by Example + Free Text Query• In Query by Example the performance is similar to User Feedback when one

relevant instance is selected

• In both methods the results are related to the similarity threshold

IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK

Page 21: IMPACT Final Conference - NCSR - Wordspotting

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

21IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK