Document Image Retrieval through Word Shape Coding...Document Image Retrieval through Word Shape...

6
Document Image Retrieval through Word Shape Coding Shijian Lu, Member, IEEE, Linlin Li, and Chew Lim Tan, Senior Member, IEEE Abstract—This paper presents a document retrieval technique that is capable of searching document images without optical character recognition (OCR). The proposed technique retrieves document images by a new word shape coding scheme, which captures the document content through annotating each word image by a word shape code. In particular, we annotate word images by using a set of topological shape features including character ascenders/descenders, character holes, and character water reservoirs. With the annotated word shape codes, document images can be retrieved by either query keywords or a query document image. Experimental results show that the proposed document image retrieval technique is fast, efficient, and tolerant to various types of document degradation. Index Terms—Document image retrieval, document image analysis, word shape coding. Ç 1 INTRODUCTION WITH the proliferation of digital libraries and the promise of paperless offices, an increasing number of document images of different qualities are being scanned and archived. Under the traditional retrieval scenario, scanned document images need to be first converted to ASCII text through optical character recognition (OCR) [12]. However, for a huge number of document images archived in digital libraries, the OCR of all of them for the retrieval purpose is wasteful and has been proven prohibitively expensive, particularly considering the arduous post-OCR correction process. In addition, compared with structured representation of docu- ments via OCR, image-based representation of documents is often more intuitive and more flexible because it preserves the physical document layout and nontext components (such as embedded graphics) much better [20]. Under such circumstances, a fast and efficient document image retrieval technique will facilitate the location of the imaged text information, or at least significantly narrow the archived document images down to those interested ones. There is, therefore, a recent trend toward content-based document image retrieval techniques without going through the OCR process. A large number of content-based image retrieval techniques [16] have been reported. For the retrieval of document images, the earlier works were often based on the character shape coding that annotates character images by a set of predefined codes. For example, Nakayama annotates character images by seven codes and then uses them for content word detection [7] and document image categorization [6]. Similarly, Spitz et al. take a character shape coding approach for language identification [3], word spotting [8], and document image retrieval [9]. In [11], Tan et al. also propose a character shape coding scheme that annotates character images based on the vertical component cut. In addition, a number of image matching techniques [19], [18] have also been reported for the word image spotting. The major limitation of the above character shape coding techniques lies with their sensitivity to the character segmentation error. For document images of low quality, the accuracy of the resultant character shape codes is often severely degraded by the character segmentation error resulting from various types of document degradation. To overcome the limitation of the character shape coding, we have proposed a number of word shape coding schemes which treat each word image as a single component and so are much more tolerant to the character segmentation error. In our earlier work [21], the vertical bar pattern is used for the word shape coding and document image retrieval. In [4], we code word images by using character extremum points, and the resultant word shape codes are then used for the language identification. Later, the number of horizontal word cuts is incorporated in [5] and then used for the multilingual document image retrieval. Besides, we also reported a keyword spotting technique in [10] where each word image is annotated by a primitive string. This paper presents a new word image annotation technique and its applications to the document image retrieval by either query keywords or a query document image. We annotate word images by a set of topological character shape features including character ascenders/descenders, character holes, and character water reservoirs illustrated in Figs. 1b, 1c, and 1d. Compared with the coding schemes reported in our earlier works [4], [5], [21], the word annotation technique presented in this paper has the following advantages: First, it is much faster because it does not require the time-consuming connected component labeling. Sec- ond, the character shape features in use are more tolerant to the document skew and the variations in text fonts and text styles. Third and most importantly, its collision rate is much lower because of the distinguishability of the three character shape features in use. The rest of this paper is organized as follows: Section 2 describes the proposed word image annotation scheme. The proposed document image retrieval techniques are then presented in Section 3. Section 4 then presents and discusses experimental results. Finally, some concluding remarks are drawn in Section 5. 2 WORD IMAGE ANNOTATION This section presents the proposed word image annotation technique. In particular, we will divide this section into three sections which deal with the document image preprocessing, the word shape feature extraction, and the word image representation, respectively. 2.1 Document Image Preprocessing Archived document images often suffer from various types of document degradation such as impulse noise and low contrast. Therefore, document images need to be preprocessed so as to extract the character shape features in use properly. In the proposed technique, document images are first smoothed to suppress noise by a simple mean filter within a 3 3 window. The filtered document images are then binarized. A large number of document binarization techniques [2] have been reported, and we directly make use of Otsu’s global method [1]. After that, words and text lines are located through the analysis of the horizontal and vertical document projection profiles illustrated in Fig. 2. For Latin-based document images, the horizontal projection profile normally shows two peaks at the x line and base line of the text. Besides, due to the blanks between adjacent words within the same text line, some zero-height segments of IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008 1913 . S. Lu is with the Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR), 21 Heng Mui Keng Terrace, Singapore, 119613. E-mail: [email protected]. . L. Li and C.L. Tan are with the Department of Computer Science, School of Computing, National University of Singapore, 3 Science Drive 2, Singapore 117543. E-mail: {lilinlin, tancl}@comp.nus.edu.sg. Manuscript received 12 Sept. 2007; revised 5 Feb. 2008; accepted 24 Mar. 2008; published online 10 Apr. 2008. Recommended for acceptance by J.Z. Wang, D. Geman, J. Luo, and R.M. Gray. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMISI-2007-09-0579. Digital Object Identifier no. 10.1109/TPAMI.2008.89. 0162-8828/08/$25.00 ß 2008 IEEE Published by the IEEE Computer Society

Transcript of Document Image Retrieval through Word Shape Coding...Document Image Retrieval through Word Shape...

Page 1: Document Image Retrieval through Word Shape Coding...Document Image Retrieval through Word Shape Coding Shijian Lu, Member, IEEE, Linlin Li, and Chew Lim Tan,Senior Member, IEEE Abstract—This

Document Image Retrievalthrough Word Shape Coding

Shijian Lu, Member, IEEE, Linlin Li, andChew Lim Tan, Senior Member, IEEE

Abstract—This paper presents a document retrieval technique that is capable of

searching document images without optical character recognition (OCR). The

proposed technique retrieves document images by a new word shape coding

scheme, which captures the document content through annotating each word

image by a word shape code. In particular, we annotate word images by using a

set of topological shape features including character ascenders/descenders,

character holes, and character water reservoirs. With the annotated word shape

codes, document images can be retrieved by either query keywords or a query

document image. Experimental results show that the proposed document image

retrieval technique is fast, efficient, and tolerant to various types of document

degradation.

Index Terms—Document image retrieval, document image analysis, word shape

coding.

Ç

1 INTRODUCTION

WITH the proliferation of digital libraries and the promise ofpaperless offices, an increasing number of document images ofdifferent qualities are being scanned and archived. Under thetraditional retrieval scenario, scanned document images need to befirst converted to ASCII text through optical character recognition(OCR) [12]. However, for a huge number of document imagesarchived in digital libraries, the OCR of all of them for the retrievalpurpose is wasteful and has been proven prohibitively expensive,particularly considering the arduous post-OCR correction process.In addition, compared with structured representation of docu-ments via OCR, image-based representation of documents is oftenmore intuitive and more flexible because it preserves the physicaldocument layout and nontext components (such as embeddedgraphics) much better [20]. Under such circumstances, a fast andefficient document image retrieval technique will facilitate thelocation of the imaged text information, or at least significantlynarrow the archived document images down to those interestedones.

There is, therefore, a recent trend toward content-baseddocument image retrieval techniques without going through theOCR process. A large number of content-based image retrievaltechniques [16] have been reported. For the retrieval of documentimages, the earlier works were often based on the character shapecoding that annotates character images by a set of predefinedcodes. For example, Nakayama annotates character images byseven codes and then uses them for content word detection [7] anddocument image categorization [6]. Similarly, Spitz et al. take acharacter shape coding approach for language identification [3],word spotting [8], and document image retrieval [9]. In [11], Tan

et al. also propose a character shape coding scheme that annotatescharacter images based on the vertical component cut. In addition,a number of image matching techniques [19], [18] have also beenreported for the word image spotting. The major limitation of theabove character shape coding techniques lies with their sensitivityto the character segmentation error. For document images of lowquality, the accuracy of the resultant character shape codes is oftenseverely degraded by the character segmentation error resultingfrom various types of document degradation.

To overcome the limitation of the character shape coding, wehave proposed a number of word shape coding schemes whichtreat each word image as a single component and so are muchmore tolerant to the character segmentation error. In our earlierwork [21], the vertical bar pattern is used for the word shapecoding and document image retrieval. In [4], we code word imagesby using character extremum points, and the resultant word shapecodes are then used for the language identification. Later, thenumber of horizontal word cuts is incorporated in [5] and thenused for the multilingual document image retrieval. Besides, wealso reported a keyword spotting technique in [10] where eachword image is annotated by a primitive string.

This paper presents a new word image annotation techniqueand its applications to the document image retrieval by eitherquery keywords or a query document image. We annotate wordimages by a set of topological character shape features includingcharacter ascenders/descenders, character holes, and characterwater reservoirs illustrated in Figs. 1b, 1c, and 1d. Compared withthe coding schemes reported in our earlier works [4], [5], [21], theword annotation technique presented in this paper has thefollowing advantages: First, it is much faster because it does notrequire the time-consuming connected component labeling. Sec-ond, the character shape features in use are more tolerant to thedocument skew and the variations in text fonts and text styles.Third and most importantly, its collision rate is much lowerbecause of the distinguishability of the three character shapefeatures in use.

The rest of this paper is organized as follows: Section 2describes the proposed word image annotation scheme. Theproposed document image retrieval techniques are then presentedin Section 3. Section 4 then presents and discusses experimentalresults. Finally, some concluding remarks are drawn in Section 5.

2 WORD IMAGE ANNOTATION

This section presents the proposed word image annotationtechnique. In particular, we will divide this section into threesections which deal with the document image preprocessing, theword shape feature extraction, and the word image representation,respectively.

2.1 Document Image Preprocessing

Archived document images often suffer from various types ofdocument degradation such as impulse noise and low contrast.Therefore, document images need to be preprocessed so as toextract the character shape features in use properly. In theproposed technique, document images are first smoothed tosuppress noise by a simple mean filter within a 3 � 3 window.

The filtered document images are then binarized. A largenumber of document binarization techniques [2] have beenreported, and we directly make use of Otsu’s global method [1].After that, words and text lines are located through the analysis ofthe horizontal and vertical document projection profiles illustratedin Fig. 2. For Latin-based document images, the horizontalprojection profile normally shows two peaks at the x line andbase line of the text. Besides, due to the blanks between adjacentwords within the same text line, some zero-height segments of

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008 1913

. S. Lu is with the Institute for Infocomm Research, Agency for Science,Technology and Research (A*STAR), 21 Heng Mui Keng Terrace,Singapore, 119613. E-mail: [email protected].

. L. Li and C.L. Tan are with the Department of Computer Science, School ofComputing, National University of Singapore, 3 Science Drive 2,Singapore 117543. E-mail: {lilinlin, tancl}@comp.nus.edu.sg.

Manuscript received 12 Sept. 2007; revised 5 Feb. 2008; accepted 24 Mar.2008; published online 10 Apr. 2008.Recommended for acceptance by J.Z. Wang, D. Geman, J. Luo, and R.M. Gray.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMISI-2007-09-0579.Digital Object Identifier no. 10.1109/TPAMI.2008.89.

0162-8828/08/$25.00 � 2008 IEEE Published by the IEEE Computer Society

Page 2: Document Image Retrieval through Word Shape Coding...Document Image Retrieval through Word Shape Coding Shijian Lu, Member, IEEE, Linlin Li, and Chew Lim Tan,Senior Member, IEEE Abstract—This

significant length can also be detected from the vertical projectionprofile. Word and text line images can thus be located based on thepeaks and the zero-height segments of the horizontal and verticalprojection profiles, respectively.

2.2 Word Shape Feature Extraction

This section presents the extraction of the three character shapefeatures in use, namely, character ascenders/descenders, characterholes, and character reservoirs. Among them, character ascendersand descenders can be simply located based on the observationthat they lie above the x line and below the base line of the text,respectively. Character holes and character reservoirs can then bedetected through the analysis of character white runs describedbelow.

Scanning vertically (or horizontally) from top to bottom (orfrom left to right), a character white run can be located by abeginning pixel BP and an ending pixel EP corresponding to “01”and “10” illustrated in Fig. 3 (“1” and “0” denote white back-ground pixels and gray foreground pixels in Fig. 3). As we onlyneed leftward and rightward reservoirs (to be discussed in the nextsection), we scan word images vertically column by column.Clearly, two vertical white runs from the two adjacent scanningcolumns are connected if they satisfy the following constraint:

BPc < EPa and EPc > BPa; ð1Þ

where ½BPcEPc� and ½BPaEPa� refer to the BP and EP of the whiteruns detected in the current and adjacent scanning columns.Consequently, a set of connected vertical white runs form a whiterun component whose centroid can be estimated as follows:

Cx ¼PNr

i¼1ðEPi;y�BPi;yÞBPi;xPNr

i¼1ðEPi;y�BPi;yÞ

Cy ¼PNr

i¼1ðEPi;y�BPi;yÞðEPi;yþBPi;yþ1Þ=2PNr

i¼1ðEPi;y�BPi;yÞ

;

8>><>>:

ð2Þ

where the denominator gives the number of pixels (componentsize) within the white run component under study. The numeratorinstead gives the sum of the x and y coordinates of pixels withinthe white run component. Parameter Nr refers to the number ofwhite runs within the white run component under study.

Character holes and character reservoirs can be detected basedon the openness and closeness of the detected white runcomponents shown in Figs. 1c and 1d. Generally, a white runcomponent is closed if all neighboring pixels on the left of the firstand on the right of the last constituent white run are text pixels. On

the contrary, a white run component is open if some neighboringpixels on the left of the first or on the right of the last constituent

white run are background pixels. Therefore, a leftward andrightward closed white run component results in a character hole

(such as the hole of character “o”). At the same time, a leftward (orrightward) open and rightward (or leftward) closed white run

component results in a leftward (or rightward) character reservoir(such as the leftward reservoir of character “a”).

It should be noted that due to the document degradation, therenormally exist a large number of tiny concavities along the

character stroke boundary. As a result, a large number of characterreservoirs of a small depth will be detected by the above vertical

scanning process. However, these small reservoirs are not desired,which can be identified based on their depth (Nr in (2)) relative to

the x height (the distance between x line and base line of the textshown in Fig. 2). Generally, the relative depth of these undesired

reservoirs is much smaller than that of these desired ones. Ourexperiments show that a relative depth threshold at 0.2 is capable

of identifying character reservoirs of a small depth adequately.

2.3 Word Image Representation

Each word image can, thus, be annotated by a sequence of

character codes for the three types of character shape features.However, not every character (i.e., “mnruvw” in Table 1) has a

code and some characters (such as “hlIJL” in Table 1) may share acode with another, while other characters (such as p, b, x in

Table 1) may be represented by more than one code. The idea hereis to represent a word as a linear sequence of codes rather than

representing each and every character in a word. To deal withcharacter segmentation error, we particularly annotate word

images by using five shape features including character ascen-ders/descenders, character holes, and leftward and rightward

character reservoirs. We do not use upward and downwardreservoirs based on two observations: First, most character

segmentation error is due to the touching of two or more adjacentcharacters at either the x line or base line position but seldom at

both the x line and base line positions. Second, a typical touching atthe x line or base line position introduces an upward or downward

reservoir, which seldom affects leftward or rightward reservoirs.The five shape features in use are annotated by two types of

codes according to their vertical alignment. Particularly, the firsttype is used when the five shape features have no vertically

aligned shape features (such as the hole of “o” and the rightwardreservoir of “c”). In this case, the five shape features (i.e.,

character ascenders/descenders, character holes, and leftward

1914 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008

Fig. 1. The three topological character shape features in use: (a) the sample word

image “shape,” (b) character ascenders and descenders, (c) character holes, and

(d) character water reservoirs.

Fig. 2. The detection of word and text line images through the analysis of the

horizontal and vertical projection profiles and the illustration of the x line and base

line of text.

Fig. 3. The illustration of the beginning pixel and ending pixel of horizontal and

vertical white runs.

Page 3: Document Image Retrieval through Word Shape Coding...Document Image Retrieval through Word Shape Coding Shijian Lu, Member, IEEE, Linlin Li, and Chew Lim Tan,Senior Member, IEEE Abstract—This

and rightward character reservoirs from left to right) areannotated by “l,” “n,” “o,” “u,” and “c,” respectively. The secondtype is used when the five shape features have vertically alignedfeatures (such as “e,” whose hole lies right above its rightwardreservoir). In this case, the shape feature together with its verticalalignments usually determines a Roman letter uniquely. Undersuch circumstances, we annotate the shape feature together withits vertical alignments by the uniquely determined Roman letter.It should be noted that we annotate character descenders andleftward reservoirs by “n” and “u” for the first type of codesbecause both “n” and “u” have no desired shape features and sowill not contribute any shape codes.

Table 1 shows the proposed coding scheme where 52 Romanletters and numbers 0-9 are annotated by 35 codes. For example,character “b” is annotated by “lo” (the first type of code),indicating a character hole (o) directly on the right of a characterascender (l). Character “a” is coded by itself (the second type ofcode) because a leftward reservoir right above a character holeuniquely indicates an entity of character “a.” Based on the codingscheme in Table 1, the word image “shape” in Fig. 1a can berepresented by a code sequence “slanoe” where “s,” “l,” “a,” “no,”and “e” are converted from the five spelling characters, respec-tively. It should be noted that character “g” in Table 1 may havetwo holes with one lying below the base line (for serif “g”) or asingle hole lying above a leftward reservoir (for sans serif “g”).However, both the two feature patterns uniquely indicate theentity of the character “g.”

The proposed word shape coding scheme is tolerant tocharacter segmentation error. For example, though characters“ab” are frequently touched at the base line position, they can stillbe properly annotated by “alo.” Another example, characters “rt”touched at the x line position can be properly annotated as “lc” aswell. In addition, though some text font such as serif may producea number of leftward and rightward reservoirs, the depth of thereservoir from serif is normally much smaller than that of thosereal reservoirs (such as the rightward reservoir of “c”). Therefore,the reservoirs from serif can be simply detected based on theirdepth relative to the x height as described in the last section.

3 DOCUMENT IMAGE RETRIEVAL

Based on the word shape coding scheme described above, thecontent of document images can be captured by the convertedword shape codes. Similar to most content-based image retrieval,document images can then be retrieved by either query keywordsor a query document image based on their content similarity.

3.1 Retrieval by Query Keywords

Similar to a Google search, which retrieves Web pages containingthe query keywords, our document image retrieval works bymatching the codes transliterated from the query keywords and

those converted from words within the archived documents.Practically, such types of retrieval can be simply accomplished bymatching the codes transliterated from the query keywords andthose converted from words within archived document images. Inparticular, we define it as a retrieval success if a document imagecontaining any of the query keywords is retrieved. In addition, wedefine it as a retrieval failure if a document image containing querykeywords is not retrieved or a document image containing noquery keywords is retrieved. Acting as a prescreening procedure,such retrieval by query keywords significantly narrows thearchived document images down to those containing the querykeywords, though it may not locate the relevant document imagesaccurately.

For text images, such retrieval by query keywords can besimply adapted for the keyword spotting. For the keywordspotting purpose, the word position needs to be determined. Inaddition, the page number needs to be determined as well becausequery keywords may appear multiple times at different pages. Tolocate the query keywords properly, we format each wordimage W with a unique spelling as a word record as follows:

WR ¼�WSC hp1 blx1 bly1 w1 h1i � � �� � � hpi blxi blyi wi hii � � �

�;

ð3Þ

where WSC denotes the indexing word shape code convertedfrom the W . Terms pi, blxi, blyi, wi, and hii ¼ 1 � � �n specify thepage number, the position (blxi and blyi give the x andy coordinates of the word left bottom corner), and the size (wiand hi refer to the word width and height) of the ith occurrence ofthe W , respectively. In our implemented system, all word recordsare stored within a table where each record is indexed by thecorresponding word shape code. Word images can, thus, belocated if their indexing word shape codes match those translit-erated from the query keywords.

3.2 Retrieval by a Query Document Image

Similar to content-based retrieval of images from an imagedatabase, archived document images can also be retrieved by aquery document image according to the content similarity basedon our proposed word shape coding. To evaluate the documentsimilarity, we first convert a document image into a documentvector. Particularly, each document vector element is composed oftwo components including a word shape component and a wordfrequency component:

D ¼ ðWSC1 : WON1Þ; . . . ; ðWSCN : WONNÞ½ �; ð4Þ

where N is the number of unique words within the documentimage under study. WSCi and WONi denote the word shape andword frequency components, respectively.

The document vector construction process can be summarizedas follows: Given a word shape code converted from a word within

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008 1915

TABLE 1Codes of 52 Roman Letters and Numbers 0-9 by Using the Three Proposed Shape Features

Page 4: Document Image Retrieval through Word Shape Coding...Document Image Retrieval through Word Shape Coding Shijian Lu, Member, IEEE, Linlin Li, and Chew Lim Tan,Senior Member, IEEE Abstract—This

the document image under study, the corresponding document

vector is searched for the element with the same word shape code

component. If such an element exists, the word frequency

component of that document vector element is increased by one.

Otherwise, a new document vector element is created and the

corresponding word shape and word frequency components are

initialized with the converted word shape code and one,

respectively. The conversion process terminates when all words

within the document image under study have been converted and

examined as described above. Finally, to compensate for the

variable document length, the frequency component of the

converted document vector elements is normalized by dividing

by the number of words within the document image under study.The similarity between two document images can, thus, be

evaluated based on the frequency component of their document

vectors. In particular, the similarity between the two document

vectors DV1 and DV2 can be evaluated by using the cosine measure

as follows:

simðDV1; DV2Þ ¼PV

i¼1 DVF1;i �DVF2;iffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPVi¼1ðDVF1;iÞ2 �

PVi¼1ðDVF2;iÞ2

q ; ð5Þ

where V defines the vocabulary size, which is equal to the number

of unique word shape codes within the DV1 and DV2. DVF1;i and

DVF2;i specify the word frequency information. In particular, if the

word shape code under study finds a match within DV1 and DV2,

DVF1;i and DVF2;i are determined as the corresponding word

frequency components (WON component). Otherwise, both are

simply set at zero.It should be noted that documents normally contain a large

number of stop words which greatly affect the document similarity

because they frequently dominate the direction of the converted

document vectors. Therefore, stop words must be removed from

the converted document vectors before the document similarity

evaluation. In our proposed technique, we simply utilize the stop

words provided by the Cross-Language Evaluation Forum (CLEF)

[13]. In particular, all listed stop words are first transliterated into a

stop word template according to the coding scheme in Table 1. The

converted document vectors are then updated by removing

elements that share the word shape component with the con-

structed stop word template.

4 EXPERIMENTAL RESULTS

This section evaluates the performance of the proposed word

image annotation and document image retrieval techniques.

Throughout the experiments, we use 252 text documents selected

from the Reuters-21578 [15] where every 63 deals with one

specific topic.

4.1 Coding Performance

The proposed document retrieval techniques depend heavily on

the performance of the proposed word shape coding scheme. To

retrieve a document image properly, the collision rate (frequency

of words that have different spellings but share the same word

shape code) of the word shape coding scheme should be as low as

possible. In addition, the coding scheme should be tolerant to

various types of document degradation. In our experiments, we

particularly compare our word shape coding scheme with Spitz’s

[3] and our earlier coding schemes that use character extremum

points [5] and a vertical bar pattern [21], respectively.

We test the coding collision rate by using a dictionary that is

composed of 57,000 English words. First, the 57,000 English words

are transliterated into word shape codes according to our proposed

word shape coding scheme and the other three. The coding

collision rates are then calculated, and the results are shown in

Table 2. As Table 2 shows, our proposed word shape coding

scheme significantly outperforms the other three in terms of the

coding collision rate. Such experimental results can be explained

by the fact that our coding scheme annotates 26 lowercase Roman

letters by 18 codes, while the other three comparison schemes

annotate 26 lowercase Roman letters by 6 [3], 9 [21], and 13 [5]

codes, respectively.

The coding robustness is then tested by the 252 text documents

described above. For each text document, five test document

images are first created including: 1) a synthetic image created by

Photoshop, 2-3) two noisy images by adding impulse noise

ðnoise level ¼ 0:05Þ and Gaussian noise ð� ¼ 0:08Þ to the synthetic

image, and 4-5) two real images scanned at 600 dots per inch (dpi)

and 300 dpi, respectively. Therefore, five sets of test document

images are created where each set is composed of 252 document

images. After that, words within the five sets of document images

are converted into word shape codes by using the four word image

shape coding schemes. Table 3 shows the coding accuracy under

various types of document degradation.As Table 3 shows, Spitz’s character shape coding scheme is the

most accurate when the document image is synthetic. However,for document images scanned at a low resolution, the accuracy ofSpitz’s coding scheme drops severely because of the dramaticincrease of character segmentation error. In addition, comparedwith our earlier word shape coding schemes [5], [4], [21], the wordshape coding scheme presented in this paper is more tolerant tonoise. Furthermore, the proposed coding scheme is fast. It is 5-8 times faster than our earlier coding schemes [5], [21] and up to15 times faster than OCR (evaluated by Omnipage [14]). The speedadvantage can be explained by the fact that our word shape codingscheme needs neither time-consuming connected componentlabeling nor complicated postprocessing.

1916 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008

TABLE 2Collision Rates of the Four Coding Schemes that Are Evaluated Based on 57,000 English Words (WCR: Word Collision Rate)

Page 5: Document Image Retrieval through Word Shape Coding...Document Image Retrieval through Word Shape Coding Shijian Lu, Member, IEEE, Linlin Li, and Chew Lim Tan,Senior Member, IEEE Abstract—This

4.2 Retrieval by Query Keywords

The performance of the retrieval by query keywords is thenevaluated. First, 137 frequent words are selected from the 252 textdocuments as query keywords. The retrieval is then conductedover the five sets of test document images described above. In ourexperiments, the retrieval performance is evaluated by precision(P), recall (R), and the F1 rating [17] defined as follows:

P ¼ No of correctly searched words

No of all searched words;

R ¼ No of correctly searched words

No of all correct words;

F1 ¼2RP

Rþ P ;

ð6Þ

where the retrieval precision ðP Þ and recall ðRÞ are averaging overall occurrences of the 137 selected keywords, respectively.

Table 4 shows the experimental results where the retrievalprecisions and recalls are evaluated based on the number of wordimages searched by using the 137 query keywords. As Table 4shows, our proposed word shape coding scheme consistentlyoutperforms the other three in terms of retrieval precision, recall,and F1. In fact, such experimental results coincide with the codingperformance described in the last section.

4.3 Retrieval by a Query Document Image

The retrieval by a query document image is also evaluated basedon the five sets of document images described in Section 4. Insteadof designing retrieval experiments, we just evaluate the similaritybetween document images of the same and different topics. This isbased on the belief that document images can be ranked properly iftheir topic similarity can be gauged properly. In addition, thesimilarity between the 252 ASCII text documents is also evaluatedto verify the performance of the proposed document imageretrieval technique.

In our experiments, the five sets of test images are firstconverted into document vectors. The similarity among them isthen evaluated as described in Section 3.2. In particular, thesimilarity between documents of the same topic (315 imagescreated from the 63 text documents of one specific topic describedin Section 4.1) is evaluated as follows:

Sim ¼M � ðM � 1Þ2

XMi¼1

XMj¼1

simðDVi;DVjÞ 8i; j : j > i; ð7Þ

where M is the number of the document images of the same topic.DVi and DVj denote the document vectors (stop words removed)of two document images of the same topic under study. Thefunction simðÞ refers to the cosine similarity defined in (5). The

similarity between document images of two different topics(630 document images with each 315 created from 63 textdocuments that deal with one specific topic) is evaluated asfollows:

Sim ¼ 1

M2

XMi¼1

XMj¼1

simðDVi;DVjÞ; ð8Þ

where DVi and DVj denote the document vectors of two differenttopics instead.

The upper part of Table 5 shows the similarity betweendocument images of the same and different topics. Clearly, thetopic similarity between documents of the same topic is muchlarger than those between documents of different topics. Archiveddocument images can, therefore, be ranked based on the similaritybetween their document vectors and the query document vector. Inaddition, the similarity among the 252 text documents is alsoevaluated where document vectors are constructed by using theASCII text [12]. The lower part of Table 5 shows the evaluateddocument similarity. As Table 5 shows, the topic similaritiesevaluated by the proposed technique are close to those evaluatedover the ASCII text, indicating that the proposed techniquecaptures the document topics properly. In addition, it alsoindicates that the proposed document retrieval technique iscomparable to the OCRþ Search whose performance should bemore or less (depending on OCR error) lower than that directlyevaluated over ASCII text.

5 CONCLUSION

This paper has reported a document image retrieval technique thatsearches document images by either query keywords or a querydocument image. A novel word image annotation technique ispresented which captures the document content by convertingeach word image into a word shape code. In particular, we convertword images by using a set of topological character shape featuresincluding character ascenders/descenders, character holes, andcharacter water reservoirs. Experimental results show that theproposed word image annotation technique is fast, robust, andcapable of retrieving imaged documents effectively.

ACKNOWLEDGMENTS

This research is supported by the Agency for Science, Technology,and Research (A�STAR), Singapore, under Grant 0421010085.

REFERENCES

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008 1917

TABLE 4Performance of the Retrieval by Query Keywords Evaluated by Precision, Recall, and F1 Parameter

TABLE 3Accuracy of the Spitz’s Character Shape Coding Scheme [3], Our Earlier Two Word ShapeCoding Schemes [5], [21], and the Word Shape Coding Scheme Presented in This Paper

Page 6: Document Image Retrieval through Word Shape Coding...Document Image Retrieval through Word Shape Coding Shijian Lu, Member, IEEE, Linlin Li, and Chew Lim Tan,Senior Member, IEEE Abstract—This

[1] N. Otsu, “A Threshold Selection Method from Graylevel Histogram,” IEEETrans. Systems, Man, and Cybernetics, vol. 19, no. 1, pp. 62-66, 1979.

[2] O.D. Trier and T. Taxt, “Evaluation of Binarization Methods for DocumentImages,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 3,pp. 312-315, Mar. 1995.

[3] A.L. Spitz, “Determination of Script and Language Content of DocumentImages,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 3,pp. 235-245, Mar. 1997.

[4] S. Lu and C.L. Tan, “Script and Language Identification in Noisy andDegraded Document Images,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 30, no. 1, pp. 14-24, Jan. 2008.

[5] S. Lu and C.L. Tan, “Retrieval of Machine-Printed Latin Documentsthrough Word Shape Coding,” Pattern Recognition, vol. 41, no. 5, pp. 1816-1826, 2008.

[6] T. Nakayama, “Content-Oriented Categorization of Document Images,”Proc. Int’l Conf. Computational Linguistics (COLING ’96), pp. 818-823, 1996.

[7] T. Nakayama, “Modeling Content Identification from Document Images,”Proc. Fourth Conf. Applied Natural Language Processing (ANLP ’94), pp. 22-27,1994.

[8] A.L. Spitz, “Using Character Shape Codes for Word Spotting in DocumentImages,” Shape, Structure and Pattern Recognition, pp. 382-389. WorldScientific, 1995.

[9] A.F. Smeaton and A.L. Spitz, “Using Character Shape Coding forInformation Retrieval,” Proc. Fourth Int’l Conf. Document Analysis andRecognition (ICDAR ’97), pp. 974-978, 1997.

[10] Y. Lu and C.L. Tan, “Information Retrieval in Document Image Databases,”IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1398-1410, Nov.2004.

[11] C.L. Tan, W. Huang, Z. Yu, and Y. Xu, “Image Document Text Retrievalwithout OCR,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24,no. 6, pp. 838-844, June 2002.

[12] G. Salton, Introduction to Modern Information Retrieval. McGraw-Hill, 1983.[13] http://www.unine.ch/info/clef/, 2008.[14] http://www.nuance.com/omnipage/, 2008.[15] http://kdd.ics.uci.edu/databases/reuters21578, 2008.[16] M. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-Based Multimedia

Information Retrieval: State-of-the-Art and Challenges,” ACM Trans.Multimedia Computing, Comm., and Applications, vol. 2, no. 1, pp. 1-19, 2006.

[17] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,”Proc. 22nd Ann. Int’l ACM Conf. Research and Development in InformationRetrieval (SIGIR ’99), vol. 42-49, 1999.

[18] S. Khoubyari and J.J. Hull, “Keyword Location in Noisy Document Image,”Proc. Second Ann. Symp. Document Analysis and Information Retrieval (SDAIR’93), pp. 217-231, 1993.

[19] F.R. Chen, D.S. Bloomberg, and L.D. Wilcox, “Spotting Phrases in Lines ofImaged Text,” Proc. SPIE Conf. Document Recognition II, pp. 256-269, 1995.

[20] T.M. Breuel, “The Future of Document Imaging in the Era of ElectronicDocuments,” Proc. Int’l Workshop Document Analysis, pp. 275-296, 2005.

[21] C.L. Tan, W. Huang, Z. Yu, and Y. Xu, “Text Retrieval from DocumentImages Based on Word Shape Analysis,” Applied Intelligence, vol. 18, no. 3,pp. 257-270, 2003.

. For more information on this or any other computing topic, please visit ourDigital Library at www.computer.org/publications/dlib.

1918 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 11, NOVEMBER 2008

TABLE 5Similarities between Documents of the Same and Different Topics