A Search Engine for Historical Manuscript Images Toni M. Rath, R. Manmatha and Victor Lavrenko...
-
Upload
francis-griffin -
Category
Documents
-
view
214 -
download
1
Transcript of A Search Engine for Historical Manuscript Images Toni M. Rath, R. Manmatha and Victor Lavrenko...
A Search Engine for Historical Manuscript
Images
Toni M. Rath, R. Manmatha and Victor Lavrenko
Center for Intelligent Information RetrievalUniversity of Massachusetts
SIGIR2004
Introduction
The first known automatic retrieval system for handwritten historical manuscript
The obvious approach to this problem is to use handwriting recognition but the error rate is excess 50%
This is an system search handwritten manuscript using text queries without recognition
Introduction
Seem the problem as an image annotation or cross-lingual retrieval problem that use text word to query image word
Learn a statistical relevance model by training on a transcribed set (word for word) of pages
Two models Probabilistic Annotation Model Direct Retrieval Model
Related Work
Obvious approach Handwriting recognition + text search engin
Image annotation Duygulu – Translation model Blei – Latent dirichlet allocation model Jeon – Cross-media relevance model (CMRM) Lavrenko – Cross-lingual relevance model
Related Work (2)
Handwriting recognition+ text search engine Advantage
Can be used for every English word Disadvantage
Well Know segment error
Image Annotation (Convert)
Related Work (3)
The different between image annotation and their model Use shape feature instead color and texture
feature Do not using cluster or blob Learn the relation between features and
English texts to instead blobs and English texts
System Overview Probabilistic Annotation Model
1.Training relations between features and English word
2.Each word image in the testing set is annotated with every term in the annotation vocabulary and a corresponding probability
3.The result in 2 will be store in an inverted list for quick access so typical query times are less than one second
System Overview
Direct Retrieval1.Training relations between features and English word
2.Use query to estimate a distribution over the feature vocabulary that one would expect to observe jointly with the query
3.Compare this distribution with a distribution of the feature vocabulary of each word image using
Kullback-Liebler divergence,one may rank all word images in the testing set at query time
Word Image Representation Simple shape features: like width and height.
Use a total 5 such feature Fourier coefficient of profile feature: detail
descriptions of a word’s shape can be obtain with profile features, such as the upper and lower profiles (see the picture), each profile have 7 features and one image word obtain 3x7=21 profile feature
One image word have totally 21+5= 26-dimensional continuous-space feature vector
Word Image Representation
Dividing the range of observed values in each feature dimension into 10 bins of equal size, and associate a unique feature vocabulary term with each bin
Repeat the process but 9 bins in this time
Each word image will have 2x26=52 features
There are (10+9)x26=494 features in the feature vocabulary
Model Formulation
Probabilistic Annotation Model
w: an English word f : a feature word k: feature number=52I: image word in the training set i: position in the training set |T|:words number in the training set
Model Formulation
V : vocabulary in training set
Smoothing
δ ( x ∈ { wi.fi1…ftk } )= 1 , if x ∈ {wi.fi1…ftk} , else 0x : w or f 1~0 : ג
Model Formulation
Direct Retrieval
Q : query word W : image word
Use (6) to estimate P(Q|W) and P(f|W)
Reordering An image word can be represent by a vector
of 494 entries and 52 1’s Change retrieved images and training images
for the given query to that form Reordering was performed using the average
dot product of the retrieved images and training images for the given query
Data Collection George Washington collection at Library
of Congress Contains 150000 ages Image were digitized from microfilm at 300dpi, 8 bi
t grayscale from thesepages Training set
100 pages (24665 words,3087 vocabulary) Testing set
987 pages (234754 words)
Experimental Eval. - queries
Mixture of proper names, places, nouns, number in the form of a yearHave reasonably frequent words in the training setIt is possible that some of the query words may not occur in the test set
Eval. – Word Image Retrieval
A number of words are incorrect segmentDirect retrieval model did not retrieve any instances of deserter and disobedience, while probabilistic annotation model found one disobedience The low turnout may be caused either by insufficient training or the lack relevant images in the testing collection
Eval. - Page Image Rtrieval
In one word queries the performance is quite good, even higher than in thesingle word retrieval without reordering.In two word queries the results seem low, but believe that a more thoroughevaluation with ground true data would yield better results.
Conclusions Results show that retrieval can be done even
when recognition of handwriting remain a challenging task
Adapting statistical relevance models produce good results, much remains to be done. Better models are needed
Large datasets can be handled either by using a cluster of processors, or by improving the efficiency of both the feature processing and retrieval model stages