Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate...

19
Pattern Recognition 35 (2002) 875–893 www.elsevier.com/locate/patcog Segmentation of touching and fused Devanagari characters Veena Bansal, R.M.K. Sinha Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur 208016, India Received 9 December 1997; accepted 10 April 2001 Abstract Devanagari script is a two dimensional composition of symbols. It is highly cumbersome to treat each composite character as a separate atomic symbol because such combinations are very large in number. This paper presents a two pass algorithm for the segmentation and decomposition of Devanagari composite characters= symbols into their constituent symbols. The proposed algorithm extensively uses structural properties of the script. In the rst pass, words are segmented into easily separable characters= composite characters. Statistical information about the height and width of each separated box is used to hypothesize whether a character box is composite. In the second pass, the hypothesized composite characters are further segmented. A recognition rate of 85 percent has been achieved on the segmented conjuncts. The algorithm is designed to segment a pair of touching characters. ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Devanagari script; Character=text recognition; Prototype construction; Character fusion; Character fragmentation; Character segmentation=decomposition 1. Introduction Segmentation of a document into lines and words, and of words into individual characters and symbols consti- tute an important task in the optical reading of texts. Presently, most recognition errors are due to character segmentation errors [1]. Very often, adjacent characters are touching, and may exist in an overlapped eld [1]. Therefore, it is a complex task to segment a given word correctly into its character components. In this paper, we have considered the problem of seg- menting printed text in Devanagari. Devanagari is the script for Hindi (the ocial language of India), Sanskrit, Marathi and Nepali languages, among others. It is used by more than 400 million people across the globe. De- vanagari is a derivative of ancient Brahmi, the mother of all Indian scripts. It is a logical script composition of Corresponding author. Tel.: +91-512-597174; fax: +91- 512-590280. E-mail address: [email protected] (R.M.K. Sinha). symbols in two dimensions. The two dimensional compo- sition has to be decomposed into symbols that are mean- ingful in the script. 2. Background There has always been a dilemma in a segmentation task whether to segment rst and then classify, or clas- sify while segmenting. Casey and Lecolinet [2] while re- viewing strategies for segmentation have argued that the segmentation is interdependent with the local decisions of shape similarity and with global contextual acceptabil- ity. They conclude that the strategies for segmentation can be classied into three pure elemental strategies as follows: 1. The classical approach where segments are identied based on character-like features; 2. Recognition-based segmentation, in which a search is made for image components that match with the char- acter classes in the alphabet; and 0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII:S0031-3203(01)00081-4

Transcript of Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate...

Page 1: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

Pattern Recognition 35 (2002) 875–893www.elsevier.com/locate/patcog

Segmentation of touching and fused Devanagari characters

Veena Bansal, R.M.K. Sinha ∗

Department of Computer Science and Engineering, Indian Institute of Technology, Kanpur 208016, India

Received 9 December 1997; accepted 10 April 2001

Abstract

Devanagari script is a two dimensional composition of symbols. It is highly cumbersome to treat each compositecharacter as a separate atomic symbol because such combinations are very large in number. This paper presents atwo pass algorithm for the segmentation and decomposition of Devanagari composite characters=symbols into theirconstituent symbols. The proposed algorithm extensively uses structural properties of the script. In the 1rst pass, wordsare segmented into easily separable characters=composite characters. Statistical information about the height and widthof each separated box is used to hypothesize whether a character box is composite. In the second pass, the hypothesizedcomposite characters are further segmented. A recognition rate of 85 percent has been achieved on the segmentedconjuncts. The algorithm is designed to segment a pair of touching characters. ? 2002 Pattern Recognition Society.Published by Elsevier Science Ltd. All rights reserved.

Keywords: Devanagari script; Character=text recognition; Prototype construction; Character fusion; Character fragmentation;Character segmentation=decomposition

1. Introduction

Segmentation of a document into lines and words, andof words into individual characters and symbols consti-tute an important task in the optical reading of texts.Presently, most recognition errors are due to charactersegmentation errors [1]. Very often, adjacent charactersare touching, and may exist in an overlapped 1eld [1].Therefore, it is a complex task to segment a given wordcorrectly into its character components.In this paper, we have considered the problem of seg-

menting printed text in Devanagari. Devanagari is thescript for Hindi (the o<cial language of India), Sanskrit,Marathi and Nepali languages, among others. It is usedby more than 400 million people across the globe. De-vanagari is a derivative of ancient Brahmi, the motherof all Indian scripts. It is a logical script composition of

∗ Corresponding author. Tel.: +91-512-597174; fax: +91-512-590280.E-mail address: [email protected] (R.M.K. Sinha).

symbols in two dimensions. The two dimensional compo-sition has to be decomposed into symbols that are mean-ingful in the script.

2. Background

There has always been a dilemma in a segmentationtask whether to segment 1rst and then classify, or clas-sify while segmenting. Casey and Lecolinet [2] while re-viewing strategies for segmentation have argued that thesegmentation is interdependent with the local decisionsof shape similarity and with global contextual acceptabil-ity. They conclude that the strategies for segmentationcan be classi1ed into three pure elemental strategies asfollows:

1. The classical approach where segments are identi1edbased on character-like features;

2. Recognition-based segmentation, in which a search ismade for image components that match with the char-acter classes in the alphabet; and

0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.PII: S0031-3203(01)00081-4

Page 2: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

876 V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893

3. A holistic approach that attempts to recognize the wordas a whole.

Our approach to Devanagari script segmentation is ahybrid of the classical and recognition-based segment-ation.Many algorithms have been proposed for segmenta-

tion of touching characters [1,3–6] for the Roman script.All of these algorithms generate multiple segmentationpoints.Most of these methods have been investigated for the

Roman script. The OCR work on Indian scripts [7–11]have not considered the problem of touching and fusedcharacter segmentation. Although algorithms for prelim-inary segmentation and isolation of shadow charactershave been developed earlier [8] and similar techniqueshave been used by Choudhary et al. [12–14], no at-tempt has been made so far in isolating the touching andfused composite characters of Devanagari script. Thiswork represents the 1rst of its kind, to the best of ourknowledge, in this direction.In Section 3, an example illustration of the segmen-

tation process is presented. This is followed by a briefdescription of Devanagari script features that are relevantfrom an OCR viewpoint in Section 4.This is followed by a presentation of the details of the

preliminary segmentation process in Section 5. In Sec-tion 6, the algorithm for isolation of touching charac-ters is presented. Isolation of shadow character pair andlower modi1er symbols are discussed in Sections 7 and8, respectively. In Section 9, we present results of ourinvestigation. This is followed by some conclusions thatcan be drawn from this work. The conclusions are givenin Section 10.

3. An illustration of Devanagari segmentation process

Let us introduce Devanagari script segmentation pro-cess with the help of the word that is pronouncedas ‘unmulan’. The word image is shown in Fig. 1(a). Thesubimages after ideal segmentation of the example wordis shown in Fig. 1(b). The most dominating horizontalline in the word is referred to as the header line that gluesthe characters of the word together. Let us remove itfrom the word. The resulting image is shown in Fig. 1(c).Now, we extract the subimages that are separated fromtheir neighbours by vertical white space. These subim-ages are shown in Fig. 1(d). These subimages may con-tain more than one connected component. For instance,the second subimage has two connected components.The second subimage of Fig. 1(d) that is replicated inFig. 1(e) has a lower modi1er. We now separate thelower modi1er and extract the vertically separate subim-ages. The resulting subimages are shown in Fig. 1(f).However, the subimage shown in Fig. 1(g) contains two

consonants that are touching each other. Such a conso-nant pair is referred to as conjunct, touching charactersor fused characters. There are over 100 conjuncts cur-rently in use in Devanagari script. An optical characterrecognizer (OCR) can treat conjuncts as a unit or segmentit to yield the constituent consonants. Since conjunct isa logical composition of its constituent consonants, itmay not be logical to treat it as a unit. We segment allconjunct images into their constituent consonant images.The segmentation of the subimage of Fig. 1(g) is shownin Fig. 1(h).The segmentation process that extracts constituent

symbols images from a word for Devanagari script thusneeds to perform the following tasks:

1. Remove the header line.2. Extract subimages that are vertically separate from

their neighbours. These subimages may contain morethan one connected component.

3. Select the subimages that need further segmentationdue to the presence of lower modi1ers. A selectioncriterion is required for selecting these.Wedescribe our algorithms for tasks 1–3 in Section 5.

4. Separate the lower modi1ers from the subimagesselected in step 3.We describe our algorithms for task 4 in Section 8.

5. Select the subimages that contain conjuncts or shadowcharacters. A selection criterion is required for select-ing these.We describe our algorithms for task 5 in Section 5.

6. Segment the conjunct subimages selected in step 5 intoconstituent consonant subimages.We describe our algorithms for task 6 in Section 6.

7. Segment the subimages of shadow characters selectedin step 5 into constituent character subimages.We describe our algorithms for task 7 in Section 7.

One may note that the nature of segmentation requiredfor Devanagari script is diJerent from that of Romanscript. In Devanagari script, about 5 percent of the char-acters are conjuncts. The percentage of the consonantsthat have lower modi1ers is also about 5 percent. In Ro-man script, ink spread causes merging=touching of char-acters. Whereas, every Devanagari document, includingthe ones with absolutely no ink spread, will have about10 percent touching characters.Our segmentation process consists of two distinct

stages. In the 1rst stage, a preliminary segmentationis performed that executes line and word segmentationfollowed by tasks 1 and 2. This process leads to theisolation of subimages corresponding to characters thatare vertically separate from their neighbours. Thesesubimages may contain more than one connected com-ponent. Statistical information about height and widthof the subimages obtained by the preliminary segmen-

Page 3: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893 877

Fig. 1. Image of a Devanagari word and its segmentation (please refer to Section 3 for the discussion).

tation is used to mark the subimages that need furthersegmentation, i.e. tasks 3 and 5. Some of the markedcharacter boxes may contain characters under shadow ofeach other. A character is said to be under the shadowof another character when the characters do not physi-cally touch each other but it is not possible to separatethem merely by drawing a vertical line. Some of theimage boxes contain lower modi1ers and some maycontain conjuncts. All of these image boxes are seg-mented in the second stage, i.e. tasks 4 and 6 of thelist. In our present OCR system, the second stage isinvoked only after the classi1cation process has triedto classify the image boxes and has failed to clas-sify them into any of the known classes. This type ofstrategy is referred to as loose coupling between seg-mentation and classi1cation [2]. Our approach reliesheavily on the structural properties of the individualcharacters and on observation about the nature of thecharacters that touch and fuse, i.e. 10 percent of thecharacters that are normally encountered in printedDevanagari text.

4. Some relevant features of Devanagari script fromOCR viewpoint

Devanagari script has about 11 vowels and 33 con-sonants. The vowels and the consonants are shown inFigs. 2(a) and (d), respectively. One may note the signif-icant diJerence in the dimensions of these characters. InEnglish as well as in Hindi, the vowels are used in twoways:

1. They are used to produce their own sounds. For in-stance, in word insurance in English, the letter i isused for producing its own sound.The vowels shown in Fig. 2(a) are used for this pur-pose in Devanagari.

2. They are used to modify the sound of a consonant. Forinstance, in word his in English, the letter i is used formodifying the sound of preceding consonant h.

However, instead of juxtaposing the vowel to modify thesound of the preceding consonant as in case of English,

Page 4: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

878 V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893

Fig. 2. Characters and symbols of Devanagari script.

a diJerent mechanism is used in Devanagari. There is asymbol corresponding to each vowel that we refer to asmodi1er symbol. The modi1ers symbols correspondingto the vowels are shown in Fig. 2(b). In order to modifythe sound of a consonant, we attach an appropriate mod-i1er in an appropriate manner to the consonant. In En-glish, one can write more than one vowel following a con-sonant, i.e. oo; ie; au; ee. In Devanagari, only one vowelmodi1er can be attached to a consonant at any time. Thevowel set and the corresponding modi1er set is richer inDevanagari. It contains single vowels and correspondingmodi1ers for producing sounds corresponding to Englishvowel sequences ea; oo; ie; au; ee, etc. Each modi1er hasbeen attached to the 1rst consonant of the script (seeFig. 2(c)). A visual inspection of Fig. 2(c) reveals thatsome of the modi1er symbols are placed next to the con-sonant (core modi1ers), some above (top modi1ers) andsome are placed below (lower modi1ers) the consonant.Some of the modi1ers contain a core modi1er and a top

modi1er, the core modi1er is placed before or next tothe consonant; the top modi1er is placed above the coremodi1er.In Devanagari, a consonant is pronounced as a

sequence of its own sound and the implied sound cor-responding to vowel . Whenever we want to write aconsonant that is deprived of the implied vowel sound,we use the pure form of the consonant. Devanagariscript has a pure form for most of the consonants. Thepure form of each consonant is shown in Fig. 2(e).For the consonants that do not have a pure forms, thesame eJect is achieved by attaching the special sym-bol (called halant) below the consonant. Whenever,the last character of a word is a consonant without anyvowel sound, the script rules require that we write theconsonant and attach halant even if the pure form ofthe consonant exists. A consonant in pure form alwaystouches the next character, yielding conjuncts, touch-ing characters, or fused characters. Fig. 2(f) shows the

Page 5: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893 879

conjuncts formed by writing pure form consonants fol-lowed by consonant . We can use almost any consonantin place of and write over 100 conjuncts.

4.1. Composition of characters and symbols forwriting words

A horizontal line is drawn on top of all characters of aword that is referred to as the header line or shirorekha.It is convenient to visualize a Devanagari word in termsof three strips: a core strip, a top strip and a bottomstrip. The core and top strips are separated by the headerline. Fig. 3 shows the image of a word that contains 1vecharacters, two lower modi1ers and a top modi1er. Thethree strips and the header line have been marked. Nocorresponding feature separates the lower strip from thecore strip. The top strip has top modi1ers and the bottomstrip has lower modi1ers whereas the core strip has thecharacters and core modi1er ‘ ’. If no consonant of a wordhas a top modi1er, the top strip will be empty. Similarly,if no character of a word has a lower modi1er, the bottomstrip will be empty. It is possible that either of the bottomor top strips may be present or both may be present.A few sentences written in Hindi language in Devana-

giri script are presented in Fig. 4. Sample text lines showthe placing of the header lines and modi1ers. Some ofthe conjuncts are also used in this text.

4.2. Script composition grammar

It is evident from the above discussion that word for-mation in Devanagari (as in other Indian scripts) fol-lows a de1nite script composition rule for which there isno counterpart in an English text. A simpli1ed Devana-gari script composition grammar in Backus-Naur Form(BNF) is summarized in Fig. 5.

Fig. 3. Three strips of a Devanagari word.

Fig. 4. Sample Hindi text written in Devanagari script.

5. Preliminary segmentation of words

Before we describe the preliminary segmentation ofwords, let us 1rst de1ne the following two operators:

De%nition 1 (Vertical projection). For a binary imageof size H ∗ W where H is the height of the image andW is the width of the image, the vertical projection hasbeen de1ned by [1] as

VP(k); k =1; 2; : : : ; W:

This operation counts the total number of black pixels ineach vertical column.

De%nition 2 (Horizontal projection). For a binaryimage of size H ∗W where H is the height of the imageandW is the width of the image, the horizontal projectionis de1ned as

HP(j); j=1; 2; : : : ; H:

This operation counts the total number of black pixels ineach horizontal row.

The preliminary segmentation consists of the follow-ing 1ve steps:

Step i: Locate the text lines. A text line is separatedfrom the previous and following text lines by white space.The line segmentation is based on horizontal histogramsof the document. Those rows, for which HP[j] is zero;j=1; 2; : : : ; H ; serve as delimiters between successivetext lines.Step ii: Locate the words. The segmentation of the

text line into words is based on the vertical projection ofthe text line. A vertical histogram of the text line is madeand white space are used as word delimiter.Step iii: Locate the header line. After extracting the

subimages corresponding to words for a text line, we lo-cate the position of the header line of each word. Coor-dinates of the top-left corner are (0,0) and bottom-rightcorner are (W;H) where H is the height and W is thewidth of the word image box. We compute the horizontalprojection of the word image box. The row containingmaximum number of black pixels is considered to be theheader line. Let this position be denoted by hLinePos.Fig. 6(a) shows image of a word and Fig. 6(b) shows

Page 6: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

880 V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893

Fig. 5. A simpli1ed Devanagari script composition grammar inBNF.

its horizontal projection. The row that corresponds to theheader line has been marked as hLinePos.Step iv: Separate character=symbol boxes of the im-

age below the header line. To do this, we make verticalprojection of the image starting from the hLinePos tothe bottom row of the word image box. The columnsthat have no black pixels are treated as boundariesfor extracting image boxes corresponding to characters.Fig. 6(c) shows the vertical projection. The columnscorresponding to white space between successive char-acters have been marked. The extracted subimages havebeen shown in Fig. 6(d).Step v: Separate symbols of the top strip. To do this,

we compute the vertical projection of the image start-ing from the top row of the image to the hLinePos. Thecolumns that have no black pixels are used as delim-iters for extracting top modi1er symbol boxes. Fig. 6(e)shows the vertical projection. The extracted subimageshave been shown in Fig. 6(f).

Our selection criteria for marking the subimages thatneed further segmentation due to the possible presence oflower modi1ers is based on statistical analysis of heightof the subimages extracted from the word image belowthe header line of the entire text line obtained after pre-liminary segmentation. We check the height of all theimage boxes corresponding to characters extracted froma text line and denote the maximum height as Max-CharHt. The image boxes are divided into three bins.The 1rst bin contains those image boxes whose heightis 80 percent or more of the MaxCharHt. The secondbin contains the image boxes whose height is less than80 percent of the MaxCharHt but more than 64 per-cent of MaxCharHt. The remaining image boxes go inthe third bin. The number of image boxes in each binis counted. The bin that contains maximum image boxesbecomes the bin that contains image boxes that need nofurther segmentation. We 1nd the average height of animage box in this bin and denote it as threshold characterheight. All image boxes that have height more than thethreshold character height are marked as image boxesthat may need further segmentation due to the presenceof lower modi1ers. These 1gures of 80 percent and 64percent correspond to the ratio of heights of three zonesin Devanagari.

In the same manner, we check the width of all theimage boxes corresponding to characters extracted froma text line and denote the maximum width as Max-CharWd. The image boxes are divided into three bins.The 1rst bin contains those image boxes whose width is80 percent or more of theMaxCharWd. The second bincontains the image boxes whose width is less than 80 per-cent ofMaxCharWd but more than 64 percent ofMax-CharWd. The remaining image boxes go in the third bin.The number of image boxes in each bin is counted. Thebin that contains maximum image boxes becomes the binthat contains image boxes that need no further segmen-tation. We 1nd the average width of an image box in thisbin and denote it as threshold character width. All imageboxes that have width more than the threshold charac-ter width are marked as image boxes that may need fur-ther segmentation as they may contain shadow charactersor conjuncts. The preliminary segmentation and thresh-old computation constitutes pass-I of the segmentationprocess.

6. Segmentation of conjunct/touching characters

An image box marked for further segmentation dueto its width may contain either a conjunct or a shadowcharacter pair. An outer boundary traversal is initiatedfrom the top-left most black pixel of the image. Duringthe traversal, the right most pixel visited will not be onthe right boundary of the image box if the image boxcontains a shadow character pair. If the image contains aconjunct, the right most pixel visited will be on the rightboundary of the image box. We discuss the shadow char-acter pair segmentation in Section 7. In this section, wediscuss segmentation of a conjunct image into subimagescorresponding to its constituent characters.A conjunct is segmented vertically at an appropriate

column to yield the constituent characters. In order tolocate this column, we make use of two structural prop-erties of Devanagari characters that are explained in nexttwo subsections.

6.1. Vertical bar and its location

Devanagari characters can be divided into threegroups based on the presence and position of verticalbars, namely: no bar characters, end bar charactersand middle bar characters. Some of the characters be-longing to each of these classes are shown in Fig. 7. Avertical bar does not occur at the left end of a character.The position of the vertical bar is the left most columnwhere number of black pixels is 80 percent or more ofthe character height. We divide the character image inthree equal vertical zone and compute vertical projectionfor each zone. The column that contains desired numberof pixels corresponds to the vertical bar column.

Page 7: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893 881

Fig. 6. A Devanagari word image and its preliminary segmentation: (a) word image; (b) horizontal projection of the word image;(c) vertical projection of the word image below the header line; (d) image units extracted; (e) vertical projection of the word imageabove the header line; (f) image units extracted from the top strip.

Fig. 7. Three classes of core characters based on the vertical bar feature.

6.2. Continuity of the horizontal projection

We explain this property with the help of Fig. 8. Fig.8(a) shows images of three Devanagari characters—amiddle bar character, a no bar character, and an end barcharacter. The horizontal projections of the character im-ages are shown in Fig. 8(b). We delete the vertical barfrom each of the character image, if it is present, and the

image on the right of the vertical bar. The resultant char-acter images are shown in Fig. 8(c) and the correspond-ing horizontal projections are shown in Fig. 8(d). Notethat every row in the horizontal projections in each casehas at least one black pixel. Please recall that each time,the image is enclosed by a minimum inscribing rectan-gular box. The horizontal projection is made using onlythe pixels in this box. Let us now chop a few columns

Page 8: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

882 V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893

Fig. 8. Continuity of horizontal projection of Devanagari characters.

Page 9: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893 883

Fig. 9. Locating column C1: (a) an example conjunct image that has been labeled to show its boundaries, inscribing box,CONJ HEIGHT, CONJ LEFT ONETHIRD, CONJ MID and vertical bar column; (b) the conjunct image after vertical bar has beenremoved; (c) (i–iv): the image inscribed between the columns C1 and CONJ RIGHT, corresponding CHP(k) that has been labeledby R1, R2 and R3.

from the left of the character images of Fig. 8(c). Theresultant images are shown in Fig. 8(e) and the corre-sponding horizontal projections are shown in Fig. 8(f).Note that some of the rows in the horizontal projectionsof Fig. 8(e) are now without any black pixels. By in-specting the horizontal projection of the character image,

from which the vertical bar and the image on the rightof the vertical bar has been removed, we can 1nd out ifthe character image has been chopped on the left side.In other words, when we approach the segmentation col-umn of the conjunct from its right in steps of one col-umn, we can 1nd out if the left boundary of the second

Page 10: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

884 V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893

Fig. 10. The image of the conjunct and column C2: (a) image inscribed by CONJ LEFT and CONJ LEFT ONETHIRD is used forcomputing the collapsed horizonal projection for identifying the class; (b) horizontal collapsed projection of the image in the box;(c) the column C2; (d) the column C1.

constituent character has been reached. Instead of mak-ing the horizontal projection, we make a collapsed hor-izontal projection of the image and check for its conti-nuity, both of that we de1ne next. We also de1ne penwidth.

De%nition 3 (Collapsed horizontal projection). Thecollapsed horizontal projection is de1ned as

CHP(k)=

1 if there exists an i; i=1; : : : ; W

such that Image[k; i]= 1;

0 otherwise:

H and W are the height and width of the image, respec-tively, and k =1; 2; : : : ; H . CHP(k) denotes the presenceof at least one black pixel in the horizontal row k. Thisoperator is cheaper than the horizontal projection opera-tor since it stops scanning the image the moment a blackpixel is hit.

De%nition 4 (Continuity of the collapsed horizontalprojection). We scan the collapsed horizontal projectionCHP(k) in the sequence CHP(1); CHP(2); : : : ; CHP(H).Let the position of 1rst row for which CHP(k) is 1be R1. We continue the scan and 1nd the position of

Page 11: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893 885

Fig. 11. Algorithm for locating the columns C1, C2 and the segmentation column C.

the subsequent row for which CHP(k) is 0. Let this po-sition be R2. We continue the scan and 1nd the positionof the subsequent row for which CHP(k) is 1. Let thisposition be R3. We will always 1nd R1. For R2 and R3the following three possibilities exist:

1. R2 and R3 both do not exist;2. R2 and R3 both exist; or3. R2 exists but R3 does not exist.

The collapsed horizontal projection {CHP(k); k =1;2; : : : ; H} has no discontinuity if R2 and R3 both do notexist or R2 exists but R3 does not exist. The diJerenceR2 − R1 is referred to as the height of the collapsedhorizontal projection.

De%nition 5 (Pen width). In Devanagari characters, thepen width is same as the thickness of the header line.

The thickness of the header line is obtained during thepreliminary segmentation process from the horizontalprojection of the words (see Fig. 6).

6.3. Locating the segmentation column

Now, we present our algorithm for locating the seg-mentation column. We locate a segmentation column,say C1, by examining the right part of the conjunct im-age in step 1 explained below. We locate another seg-mentation column, say C2, by examining the left partof the image in step 2 explained below. Then columnsC1 and C2 are co-related to yield the 1nal segmentationcolumn C.

Step 1 (Locate segmentation column C1). Letus say, the conjunct image has the left and theright co-ordinates—CONJ TOP and CONJ BOTTOM.

Page 12: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

886 V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893

Fig. 11. (continued)

We compute height of the conjunct image whichis (CONJ BOTTOM-CONJ TOP), referred to asCONJ HEIGHT. Recall that the coordinates of upper-left corner are (0; 0). We denote left one third andmiddle column positions of the conjunct image asCONJ LEFT ONETHIRD and CONJ MID respec-tively. Fig. 9(a) shows image of the conjunct. We havemarked CONJ TOP, CONJ BOTTOM, CONJ RIGHTand CONJ LEFT. We have also marked columns corre-sponding to CONJ LEFT ONETHIRD and CONJ MID.These steps have been presented in the form of analgorithm in Fig. 11.Let us now delete the vertical bar from the right end

of the conjunct image. The image after deleting thevertical bar is shown in Fig. 9(b). Let us refer to theright boundary of this image as CONJ RIGHT. The

column that is one pen width to the left of CONJ RIGHTis our candidate segmentation column. Let this columnbe C1. We have also marked the column C1 in Fig. 9(b).We inscribe the image between C1 and CONJ RIGHTand compute the collapsed horizontal projection. Let thiscollapsed horizontal projection be called CHP(k). Thesesteps are presented as an algorithm in Fig. 11. If CHP(k)has no discontinuity and the height of CHP(k) is morethan 1=3rd oJ CONJ HEIGHT, C1 is our segmentationcolumn. Otherwise, we shift C1 to the left by one col-umn and recompute CHP(k). After a few iterations,CHP(k) will have desired properties and we wouldhave found our segmentation column C1. Fig. 9(c)(i)–(iv) depict the iteration process. In each 1gure, we haveshown the image box under consideration, correspond-ing CHP(k) as well as R1, R2 and R3. The CHP(k)

Page 13: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893 887

Fig. 12. Two characters in shadow and their segmentation: (a) the image of characters in shadow; left most pixel is also marked;(b) the box inscribing the image of 1rst character; (c) left most black pixel outside the 1rst image box; (d) the box inscribing thesecond character; the overlapping region has also been marked; (e) the subimages for the constituent characters after cleaning theoverlapping region.

shown in Fig. 9(c)(iv) has no discontinuity and itsheight is more than 1=3rd of CONJ HEIGHT. Therefore,the process is terminated and no more iterations areperformed.Step 2 (Locate segmentation column C2). We put

pure consonants in two classes based on their height:

(i) Height of the pure consonant is 80 percent or lessthan the corresponding cons: Characters of this classare: . This class is re-ferred to as class H1.

(ii) Height of the pure consonant remains unchanged:Characters of this class are: . This class ofcharacters is referred to as class H2.

Classes H1 and H2 as outlined above are used for iden-tifying the column C2. First, we inscribe the image be-tween CONJ LEFT and CONJ LEFT ONETHIRD andcompute collapsed horizontal projection of the imagein the box. Let this collapsed horizontal projection beCHP2(k).

If the height of the CHP2(k) is more than 80 per-cent of CONJ HEIGHT, the 1rst constituent characterof the conjunct belongs to class H2. Otherwise, it be-longs to class H1. The image of our example conjunctis shown in Fig. 10(a). We have marked columnsCONJ LEFT and CONJ LEFT ONETHIRD. The col-lapsed horizontal projection, referred to as CHP2(k),of the image inscribed by columns CONJ LEFT andCONJ LEFT ONETHIRD is shown in Fig. 10(b).The height of CHP2(k) is less than 80 percent ofCONJ HEIGHT. Therefore, the subimage on the leftside belongs to class H1. After identifying the classof the left subimage of the conjunct, we set C2 toCONJ LEFT ONETHIRD. We compute the height ofthe image inscribed by CONJ LEFT and C2. If theheight is less than 1=3rd of CONJ HEIGHT for classH1 characters, we move C2 to the right in steps of onecolumn provided the pixel strength of the new columnC2 is not more than the pixel strength of the presentcolumn C2. We have marked column C2 in Fig. 10(c)for our example conjunct. The height of the image is

Page 14: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

888 V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893

Fig. 13. Lower modi1er segmentation: (a) the image of a character containing a lower modi1er; (b) horizontal projection of theselected image area; (c) the subimages for the core character and the lower modi1er.

more than 1=3rd of CONJ HEIGHT and we stop theprocess.For class H2 characters, we inscribe the image be-

tween CONJ LEFT and CONJ LEFT ONETHIRD. Wethen compute the vertical projection of the image in thisbox and locate the right most column with maximumnumber of black pixels. Let this column be C2.We move C2 in steps of one column to the right as

long as the pixel strength of the next column is notmore than the pixel strength of the present column.C2 is not allowed to go beyond CONJ MID. Thesesteps have been presented in form of an algorithm inFig. 11.Relating C1 and C2: We now make an attempt to

relate C1 and C2. While comparing C1 and C2, thefollowing three possibilities are encountered:

Fig. 14. Composite characters not invoked for segmentation.

(a) C1 is less than C2: Both the segmentationcolumns are ignored and no segmentation is done.This situation occurs when a character does not requiresegmentation and has been wrongly marked for furthersegmentation.

Page 15: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893 889

Table 1Performance of the system for Font I

Total number Overall Number of Touching char. Touching chars.of chars. recognition touching chars. recognition substitution

Doc I 2229 2001 186 174 12Doc II 1785 1605 84 69 15Doc III 1500 1263 90 81 9Doc IV 1419 1230 63 60 3Doc V 1926 1653 99 84 15Doc VI 2517 2259 117 90 27Doc VII 2310 2019 141 114 27Doc VIII 2679 2232 162 126 36Doc IX 2412 2136 144 111 33Doc X 2079 1764 86 72 14Overall 20,856 18,162 1172 981 191

87.08% 5.61% 83.70% 17.29%

Table 2Performance of the system for Font II

Total number Overall Number of Touching char. Touching chars.of chars. recognition touching chars. recognition substitution

Doc I 2386 2122 98 78 20Doc II 1740 1466 72 64 8Doc III 2258 1930 128 112 16Doc IV 2052 1834 90 78 12Doc V 2038 1758 120 98 22Doc VI 2136 1854 110 86 24Doc VII 1820 1538 88 78 10Doc VIII 1926 1857 96 84 12Overall 16,356 14,359 802 678 124

87.79% 4.90% 84.53% 15.46%

(b) C1 and C2 are almost same: If C1 and C2 arethe same or the diJerence between them is less than thepen width, the segmentation column C1 is moved to theleft by the diJerence. Column C1 is our segmentationcolumn C.(c) The di:erence between C1 and C2 is more than

the pen width: This situation occurs when the imagehas more than two characters, we can probably extractthe 1rst constituent character using C1 and then attemptto segment the remaining image. However, this remainsto be done in future.

7. Segmentation of shadow character

Two adjacent characters are in shadow of each otherif they do not touch each other and cannot be sepa-rated by a single vertical line due to the overlapping re-gion. In order to extract the subimages corresponding toeach character, we locate the left most black pixel in the

image, say P1. Fig. 12(a) shows image of two charactersthat are in shadow. We have also marked the left mostblack pixel in the image. Next, we do an outer bound-ary traversal starting at pixel P1. The extreme left, right,top and bottom image points visited during the traversalform the inscribing box for the subimage of the 1rst char-acter of the pair. Fig. 12(b) shows the inscribing box forthe 1rst character. It is clear from the Fig. 12(b) that asmall part of the image of the second character overlapswith the inscribing box for the 1rst character. In orderto extract the second subimage, another outer boundarytraversal is initiated from a pixel, say P2 that is outsidethe 1rst character image box. Fig. 12(c) shows the pixelP2 that is the left most black pixel outside the box in-scribing the 1rst character subimage. The outer boundarytraversal initiated from P2 yields the inscribing box forthe second character subimage. Fig. 12(d) shows the boxinscribing the second character. During this traversal, theextreme left, right, top and bottom points that are visitedin the territory of 1rst image box form the overlap region.

Page 16: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

890 V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893

Fig. 15. The input text.

The overlap region is cleaned from each image box toobtain proper subimages. Fig. 12(f) shows the extractedsubimages.

8. Segmentation of lower modi%ers

In Section 5, we marked some of the image boxesfor further segmentation suspecting the presence oflower modi1ers based on the threshold characterheight. Devanagari script has characters of varyingheight. Character has almost same height as thatis character with the lower modi1er . Therefore, acharacter suspected to have a lower modi1ers may notactually contain a lower modi1er. Therefore, we needto carefully examine the image box for the presenceof a lower modi1er. When a lower modi1er is placedbelow a core character, one of the three possible join-ing patterns is formed. These three joining patternsare:

(i) Weak joining: A lower modi1er below a middleor end bar character forms a weak join with thecore character. Some of the non-bar character thathave a small bar in the lower region also form aweak joining point with the lower modi1er. Someexamples are .

(ii) Thick joining: There are some characters that touchthe lower end of the core strip at more than onepoint. A lower modi1er placed below such characterforms a thick join with the core character, such as

.(iii) Gap between the character and the modi<er: Some-

times a modi1er does not touch the core character atall. The lower modi1er called ‘nukta’ (representedas a dot(:) in the bottom) usually does not touch thecore character. Some examples are: , etc.

The horizontal segmentation row for separating a lowermodi1er is below the core strip. Height of the core

strip is estimated to be threshold character height (seeSection 5). We explore the rows below the core stripfor a possible segmentation row. If we 1nd a row thatcontains no black pixels, we use this row to separate thelower modi1er. Otherwise, the row with minimum num-ber of black pixels below the core strip is located. Thepixels below this row are checked whether they havesu<cient height and width to qualify for a lower mod-i1er symbol by making a horizontal projection of theselected image. The height of a lower modi1er is usually1=5th or more of the character height. In case, enoughpixels are present, we use the located row for segment-ing the lower modi1er. However, in case of a thickjoining pattern, some adjustment is made based on thepro1le information of the joining region. An image thathas beenmarked for further segmentation has been shownin Fig. 13(a). We have shown the inscribing boxes forthe subimages corresponding to the core character andthe lower modi1er. The threshold character height hasalso been marked. The subimage inscribed by the boxshown in Fig. 13(b) is used for making horizontal pro-jection. The row that contains minimum number of blackpixels has also been marked. We extract the subimagesshown in Fig. 13(c) using this row. These subimagescorrespond to the core character and the lower modi1er,respectively.The character is a long character and has a high

frequency of occurrence in text. The image below thatthreshold character height resembles the lower modi1ershalant very closely. Our lower modi1er segmentation al-gorithm removes lower part of and puts it in a lowermodi1er box. The chopped character is essentially a newcharacter and it is treated like a new character. However,this lower modi1er does not make a syntactically validcharacter with character . Since the trainer and recog-nizer both use the same segmentation algorithm, char-acter is always chopped. The chopped character isadded to the set of known characters. The compositiongrammar is augmented to compose the chopped andlower modi1er halant into character [9,15].

9. Results

The structural segmentation algorithm has been testedon 18 document pages from two diJerent magazines. Thenumber of conjuncts in the test documents is about 5percent. Out of all the conjunct and composite charac-ters, about 90 percent conjuncts have been marked forfurther segmentation based on the output of the clas-si1cation process (see Refs. [16,17]) and the thresholdwidth and height of the characters. During the testing, itwas observed that sometimes a composite character wassubstituted by another Devanagari character. This hap-pened when both constituent characters were relativelythin. The composite character and get mapped to

Page 17: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893 891

Fig. 16. The image after preliminary segmentation (lines 1–5); lower modi1ers and conjuncts have not yet been segmented.

Fig. 17. The image after lower modi1er separation (lines 1–5); conjuncts have not yet been segmented.

character . Some of the composite characters that werenot selected for segmentation are shown in Fig. 14. Therecognition of the constituent characters obtained aftersegmentation of touching character is about 88 percent

that matches with the overall performance of the system.There are approximately 12 percent substitution errors.The results of touching character segmentation are sum-marized in Tables 1 and 2. However, the algorithm is

Page 18: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

892 V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893

Fig. 18. The image after conjunct segmentation (lines 1–5).

Fig. 19. The output of the classi1cation process; the word hasbeen underlined if the true word is the second or third choice;incorrect words are shown in { }.

capable of segmenting all the conjuncts of Fig. 14 whensuggested to do so.A sample text page and the text after recognition is

shown in Fig. 15. The segmented image after preliminarysegmentation and lower modi1er segmentation are shownin Figs. 16 and 17, respectively. The image after conjunctsegmentation and the OCR output after post-processing[18] are shown in Figs. 18 and 19.

10. Conclusion

In this paper, we have presented a complete methodfor segmentation of text printed in Devanagari script,

a script used by more than 400 million people on theglobe.A preliminary segmentation process extracts the

header line and delineates the upper-strip from the rest.This yields vertically separated character boxes, thatmay be conjuncts, touching characters, characters withlower modi1er attached to it, shadow characters or acombination of these. In pass-I of the segmentationprocess, statistical information about these boxes is col-lected. In pass-II, this statistical information id usedto select the boxes on which further segmentation isattempted. Besides separating the lower strip from thecore strip the constituent character boxes in the conjunctare identi1ed in the core strip. We extensively use thestructural properties of the characters in the segmen-tation process. The presence and relative location ofvertical line, and the nature of constituent pure conso-nant form in the conjunct, provide valuable clues aboutsegmentation boundaries. Our segmentation approach isa hybrid approach, wherein we try to recognize the partsof the conjunct that form part of a character class. Aperformance of 85 percent correct recognition has beenachieved.Devanagari script is a derivative of the ancient Brahmi

script that is also the mother of all other Indian scripts.The methodology outlined here are equally applicableto many other scripts like Bangla, Asamiya, Gujarati andGurumukhi that are similar in structure.

Acknowledgements

A part of this work was supported by Department ofElectronics, Government of India.

Page 19: Segmentation of touching and fused Devanagari characters · 2015-08-14 · character as a separate atomic symbol because such combinations are very large in number. ... most recognition

V. Bansal, R.M.K. Sinha / Pattern Recognition 35 (2002) 875–893 893

References

[1] Su Liang, M. Shridhar, M. Ahmad, Segmentation oftouching characters in printed document recognition,Pattern Recognition 27 (1994) 825–840.

[2] R.G. Casey, E. Lecolinet, A survey of methods andstrategies in character segmentation, IEEE Trans. PatternAnal. Mach. Intell. 18 (1996) 690–706.

[3] R.G. Casey, G. Nagy, Recursive segmentation and classi-1cation of composite character patterns, Proceedingsof the Sixth International Conference Pattern Recognition,Munich, Germany, 1982, pp. 1023–1026.

[4] S. Kahan, T. Pavlidis, H.S. Baird, On the recognition ofprinted characters of any font and size, IEEE Trans. PatternAnal. Mach. Intell. 9 (1987) 274–287.

[5] Shuichi Tsujimoto, Haruo Asada, Solving ambiguitiesin segmenting touching characters, in: H.S. Baird et al.(Eds.), Structured Document Image Analysis, Springer,New York, USA, 1992.

[6] J. Rocha, T. Pavlidis, A shape analysis model withapplications to a character recognition system, IEEETrans. PAMI 16 (4) (1994) 393–403.

[7] S.S. Marwah, S.K. Mullick, R.M.K. Sinha, Recognition ofDevanagari characters using a hierarchical binary decisiontree classi1er, IEEE International Conference on Systems,Man and Cybernetics, October 1994.

[8] R.M.K. Sinha, H.N. Mahabala, Machine recognition ofDevanagari script, IEEE International Conference onSystems, Man and Cybernetics, 1979, pp. 435–441.

[9] R.M.K. Sinha, Rule based contexual post-processing forDevanagari text recognition, Pattern Recognition 20 (5)(1987) 475–485.

[10] I.K. Sethi, B. Chatterjee, Machine recognition ofhandprinted Devanagari numerals, J. Inst. Electron.Telecommun. Eng. India 22 (1976) 532–535.

[11] J.C. Sant, S.K. Mullick, Handwritten Devanagari scriptrecognition using CTNNSE algorithm, InternationalConference on Application of Information Technology inSouth Asian Language, February 1994.

[12] B.B. Chaudhuri, U. Pal, Skew angle detection of digitizedIndian script documents, IEEE Trans. Pattern Anal. Mach.Intell. 19 (2) (1997) 182–186.

[13] B.B. Chaudhuri, U. Pal, A complete printed Bangla OCRsystem, Pattern Recognition 31 (5) (1997) 531–549.

[14] B.B. Chaudhuri, U. Pal, An OCR system to read twoIndian language scripts: Bangla and Devanagari, Procee-dings of 4th International Conference on DocumentAnalysis and Recognition, 1997, pp. 1011–1015.

[15] R.M.K. Sinha, Visual text recognition through contexualprocessing, Pattern Recognition 21 (1988) 463–479.

[16] R.M.K. Sinha, V. Bansal, On Devanagari documentprocessing, IEEE International Conference on Systems,Man and Cybernetics, Vancouver, Canada, 1995.

[17] R.M.K. Sinha, V. Bansal, On automating trainer forconstruction of prototypes for Devanagari text recogni-tion, Technical Report, TRCS-95-232, I.I.T. Kanpur,India, 1995.

[18] V. Bansal, R.M.K. Sinha, Partitioning and searchingdictionary for correction of optically-read Devanagaricharacter strings, Proceedings of the InternationalConference on Document Analysis and Recognition(ICDAR-99), Bangalore, India, 1999, pp. 653–656.

About the Author—VEENA BANSAL received her Bachelors degree from REC, University of Allahabad in 1984 and Mastersdegree from Univ. of Connecticut in 1987, both in Computer Science. She spent one year at Univ. of Virginia. Later, she was on thefaculty of H.B.T.I, Kanpur, Computer Science department for next two years, then worked as senior project associate at IIT, Kanpurfor another two years. She received her Ph.D. in January 1999 from Indian Institute of Technology, Kanpur. She is with departmentof Industrial and Management Engineering at IIT, Kanpur. Her interests include Pattern Recognition and Document Processing.

About the Author—R. MAHESH K. SINHA is a Professor of Computer Science & Engineering and Electrical Engineering atIndian Institute of Technology where he has served for the past 25 years. He has been a Visiting Professor at Michigan StateUniversity, Wayne State University, INRS at Montreal, and at Asian Institute of Technology, Bangkok. His research interests lie inthe areas of Applied AI, Document Processing, OCR, Natural Language Processing, Machine Translation and Computer Architecture.He is the originator of multi-lingual GIST technology, ANNGLABHARTI for translation from English to Indian Languages, andexample-based ANUBHARTI approach for machine aided translation. His early work on Devanagari OCR has been ice-breaking.He has been intensely involved in design of Indian scripts Standard Code for Information Interchange (ISCII), phonetic keyboarding,transliteration and several other aspects of Indian Language and Scripts processing. He was Guest-Editor of JIETE special issueon this subject. He has organized and been on organizing committees of several national and international conferences and chairedseveral sessions. He was Chairman of UNESCO Second Regional Workshop on Computer Processing of Asian Languages. He is anAssociate of ORBICOM Network of UNESCO Chairs in Communication. He is listed in Men of Achievement and been recipientof outstanding achievement award. He is a Senior member of IEEE and Fellow of IETE.