IMPACT Final Conference - NCSR - Wordspotting
-
Upload
impact-centre-of-competence -
Category
Technology
-
view
895 -
download
2
description
Transcript of IMPACT Final Conference - NCSR - Wordspotting
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
24-25 October 2011, London, UK
IMPACT Tools Developed by NCSR
IMPACT Final Conference 2011
B. Gatos Computational Intelligence LaboratoryInstitute of Informatics and TelecommunicationsNational Center for Scientific Research (NCSR) "Demokritos"GR-153 10 Agia Paraskevi, Athens, Greece
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
2
Develop an alternative technique for historical document indexing based on spotting words directly on document images avoiding the conventional OCR procedure
Provide three methods for word spotting: Selecting the query from a predefined list of keywords Query by example Free text query
Incorporate the whole word spotting functionality in a GUI tool
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
3
The main operational parts of the Word Spotting application are:
Page segmentation Feature extraction Marking character templates Word matching User feedback Query selection by example Free text synthetic query creation Searching User access control
DOCUMENT PAGES
PROCESSING
Predefined Keyword
Word segmentation
Features extraction
Document corpus
Segmented words
features
Similarity measurement
Ranking results
Synthetic keyword
Keywords list
Word instances in documents
Query module
By Example Free Text
Synthetic keyword
Alphabet characters
User-defined character templates
Query word features
extraction
USER’S
FEEDBACK
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
4
Main steps (1/2) Document pages
• Select pages from the documents corpus
• Apply word segmentation to the pages
• Apply feature extraction to all segmented
words
Query• Define the list of keywords
• Select the query keyword from the list
• Mark the character templates
• Create a synthetic query image
• Apply feature extraction to the query image
DOCUMENT PAGES
PROCESSING
Predefined Keyword
Word segmentation
Features extraction
Document corpus
Segmented words
features
Similarity measurement
Ranking results
Synthetic keyword
Keywords list
Word instances in documents
Query module
By Example Free Text
Synthetic keyword
Alphabet characters
User-defined character templates
Query word features
extraction
USER’S
FEEDBACK
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
5
Main steps (2/2) Matching and User feedback
• Word matching
• User feedback
• Selecting the final results
DOCUMENT PAGES
PROCESSING
Predefined Keyword
Word segmentation
Features extraction
Document corpus
Segmented words
features
Similarity measurement
Ranking results
Synthetic keyword
Keywords list
Word instances in documents
Query module
By Example Free Text
Synthetic keyword
Alphabet characters
User-defined character templates
Query word features
extraction
USER’S
FEEDBACK
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
6
Marking character templates • Applied directly on a text image
• Character baseline adjustment
• Performed “once-for-all” and can be used for entire books or collections with similar text characteristics
DOCUMENT PAGES
PROCESSING
Predefined Keyword
Word segmentation
Features extraction
Document corpus
Segmented words
features
Similarity measurement
Ranking results
Synthetic keyword
Keywords list
Word instances in documents
Query module
By Example Free Text
Synthetic keyword
Alphabet characters
User-defined character templates
Query word features
extraction
USER’S
FEEDBACK
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
7
Feature extraction & word matching• Describe each word (synthetic or real)
by a set of features
• Normalize
• Match by checking similarity based on features
DOCUMENT PAGES
PROCESSING
Predefined Keyword
Word segmentation
Features extraction
Document corpus
Segmented words
features
Similarity measurement
Ranking results
Synthetic keyword
Keywords list
Word instances in documents
Query module
By Example Free Text
Synthetic keyword
Alphabet characters
User-defined character templates
Query word features
extraction
USER’S
FEEDBACK
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
8
x
y
x
y
y=yt
Upper Boundary
Lower Boundary
x y
x y
x y
Features based on word profile projections
Features based on zones
Hybrid features by projections and zones
Feature extraction & word matching
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
9
Features by centers of masses
Feature extraction & word matching
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
10
User feedback example
(a) Synthetic query word (b) Initial ranking of segmented words. The highlighted words denote correct words selected by the user (c) Ranking after user’s feedback.
(a)
(b) (c)
DOCUMENT PAGES
PROCESSING
Predefined Keyword
Word segmentation
Features extraction
Document corpus
Segmented words
features
Similarity measurement
Ranking results
Synthetic keyword
Keywords list
Word instances in documents
Query module
By Example Free Text
Synthetic keyword
Alphabet characters
User-defined character templates
Query word features
extraction
USER’S
FEEDBACK
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
11
Searching• Allow the user to search the image
corpus for instances of query keywords that have already undergone the user feedback process.
• The user selects one of the processed keywords and the application shows all the instances of this keyword in the images of the corpus.
• The user can navigate through the results in an instance level (showing one instance per time) or in a page level (showing all instances in a page).
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
12
A. L. Kesidis, E. Galiotou, B. Gatos and I. Pratikakis, “A
word spotting framework for historical machine-
printed documents”, International Journal on Document
Analysis and Recognition, DOI: 10.1007/s10032-010-
0134-4, pp. 1-14, 2010.
A. L. Kesidis, E. Galiotou, B. Gatos, A. Lampropoulos, I.
Pratikakis, I. Manolessou and A. Ralli, "Accessing the
content of Greek historical documents", 3rd
Workshop on Analytics for Noisy Unstructured Text Data
(AND'09), pp. 55-62, Barcelona, Spain, July 2009
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
13
Main steps Document pages (applied once) Query
• Select (by cropping) a query word image from a page
• Apply feature extraction to the query image Matching
• Match query features to all segmented words features
• Rank the segmented words by similarity
• Return the most similar segmented words
• No user feedback!
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
14
Main steps Document pages (applied once) Query
• Type the query text
• Construct a synthetic query image using letter templates (provided by an administrator)
• Apply feature extraction to the query image Matching
• Match query features to all segmented words features
• Rank the segmented words by similarity
• Return the most similar segmented words
• No user feedback!
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
15
Two levels Guest Administrator
Search
Word Spotting
Query by
Example
Free Text
Query
User management
+settings
Guest √ √ √
Administrator
√ √ √ √ √
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
16
Queryby
Keyword
Queryby
Example
FreeText
OFFLINE PREPARATION – ADMINISTRATIVE TASKS
Page segmentation and features extraction Admin Admin Admin
Keywords definition Admin
Letter templates definition Admin Admin
Word Spotting by User’s feedback Admin
ONLINE USAGE
Searching All Users All Users All Users
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
17
Document corpus• French book
(153 pages, 47836 words)
• German book
(126 pages, 24596 words)
Segmentation• Projections
• RLSA• USAL1
(Connected components)
• USAL2 (Projections)
Features• Hybrid
(Projections+Zones)
• Center of Masses
• Overall 80 experiments
• Each experiment performed Without user feedback With 1, 2, and 3 user selected words
Keywords• 5 keywords per
book
• French: Le Dernier fils de France, ou le Duc de Normandie, fils de Louis XVI et de Marie-Antoinette, par A., 1838
• German: Aufschlüsse zur Magie aus geprüften Erfahrungen über verborgene philosophische Wissenschaften und verdeckte Geheimnisse der Natur, 1788
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
18
Feature extraction• Hybrid provided better results + is faster than Center of masses
(a) (b)
Average precision vs recall diagrams of word spotting in relation to feature extraction methods for (a) Book A and (b) Book B.
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
19
User feedback• User feedback improved the results
(a) (b)
Average precision vs recall diagrams of word spotting in relation to the user’s feedback involvement for (a) Book A and (b) Book B.
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
20
Segmentation issues
Query by Example + Free Text Query• In Query by Example the performance is similar to User Feedback when one
relevant instance is selected
• In both methods the results are related to the similarity threshold
IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
21IMPACT Tools Developed by NCSR - IMPACT Final Conference 2011, 24-25 October, London, UK