11 2006, SriLanka New Segmentation Algorithm...

First International Conference on Industrial and Information Systems, ICIIS 2006, 8 - 11 August 2006, Sri Lanka

New Segmentation Algorithm for OfflineHandwritten Connected Character Segmentation

U.K.S. Jayarathnal, G.E.M.D.C. Bandara21. Department ofStatistics and Computer Science, Faculty ofScience, University ofPeradeniya, Peradeniya, Sri Lannka.2. Department ofProduction Engineering, Faculty ofEngineering, University ofPeradeniya, Peradeniya, Sri Lannka.

Email:sampath e d dn.ac.lk

Abstract - An approach for the Segmentation of offlinehandwritten connected two-digit strings is presented inthis paper. Very often even in a printed text, adjacentcharacters tend to touch or connect. This makes it aproblem in performing proper character isolation,hence difficult in segmenting the digit strings in orderto recognize its individual characters. We, in our studyhave developed an algorithm, which provides a solutionbased on the analysis of the foreground pixeldistribution to segment, connected digit string pairs.

In the segmentation stage, the junction basedsplitting technique decides complete segments of theconnected digit strings. Use of fuzzy characteristicvalues at the merging of the complete segments isolatesthe major segments (Merged complete segments) fromthe minor segments. In this work, it was found that theunwanted connection resides within the minorsegments of the connected character skeleton. At thecharacter isolation stage, all major segments arecombined with each of minor segments to generate setof different connection sketches. In order to recognizeindividual characters of the connection string, thesesketches are formulated into a new input characterimage, which then can be used as an input for acharacter recognition system.

Keywords: Skeletonization, Fuzzy Rules, Connecteddigit Strings, Binarization, Segmentation

I INTRODUCTION

Alphabetic character segmentation and recognition is abasic need in incorporating brainpower to computers.Machine intelligence involves several aspects amongwhich optical recognition is a tool, which can be integratedto text recognition and speech recognition.

Offline handwritten character reading by computerprogram is a complex undertaking. It is essential toseparate a given character string correctly into thesequence of characters. Any failure or error in thesegmentation step directly produces a negative effect onrecognition. The difficulty of handwritten charactersegmentation comes from the great variety ofhandwritings. In the case of hand-printed scripts,segmentation is a relatively simple task. In the case of

overlapped scripts, broken characters, connectedcharacters, loosely configured characters, and mixedscripts, segmentation is difficult. Overlapped, broken,connected and loosely configured characters are majorcauses of segmentation errors [2]. Very often even inprinted text, adjacent characters tend to touch or connect.

The Segmentation and Recognition of Connectedcharacters, is a key problem in the development of OCR/ICR systems. Many methods have been proposed in therecent years. These methods include, segmentation of theconnected handwritten two digit strings based on thethinning of background regions [3], comparison withtouching character images synthesized from two singlecharacter images [4], a neural network based approach todeal with various types of touching observed frequently innumeral strings [5], statistical and structural analysiscombined with peak-and-valley functions to segmentmachine-printed addresses [6]. Performances of thestatistical methods depend on the amount of data used inthe definition of the statistical model parameters, and arenot flexible enough to adapt new handwriting constraints.Once the networks are trained, the advantages of neuralnetworks are automatic learning and quick classification.Their drawbacks are the requirement of long processingtime and a proper dataset for learning. The backgroundthinning based approach first locates several feature pointson the background skeleton of a connected digit image.Possible segmentation paths are then constructed bymatching these feature points. Fuzzified decision rules,which are generated from training samples, are used torank segmentation paths. The advantage of the work of [3]is the possibility of segmenting single and double touchedconnection in the character recognition dilemma. Butwhen considering the connected character isolation, inorder to recognize a connected character input image, israther difficult to achieve with the background skeletonanalysis. The need of external comparison with thetouching characters [4] is difficult to extend in to variousdifferent handwritten styles.

II METHODOLOGY

When considering existing systems, there are some majorfeatures defined in [3] which, a connection of adjacenthandwritten digits can be categorized into (Fig. 1,)

1-4244-0322- 7/06/$20. 00 C2006 IEEE 540


(a) Two strokes from two adjacent digits touching end to end.(b) The end of a stroke of the left/right digit touching the side

of a stroke of the right/left digit.(c) Overlapping of two vertically oriented strokes from two

adjacent digits.(d) Stroked connection between adjacent two-digit strings.

Fig. 1. Different connection styles

The present technique can be used for the detection andthe segmentation of type (b) and (c) style connections,while this paper focuses on the performance of thealgorithm in unwanted connection segments (d) in theconnected character segmentation dilemma. In addition,while it is assumed in this paper that each connectingcharacter image consist of only two component characters,the present technique can be easily extended to deal withthe connected character image which consist of three ormore characters. The block diagram of the proposedsystem is depicted in Fig. 2,

F; Preprocess]ng1 2 3 4 5 7

r m- (- )r )( )(

8 Merging of Segment across Junctions ]

9 Segnmented Character Sketch Isoat i

10 Formulatenewin utimage

Isotated IndividuaLt Characters for Recognition

Fig. 2. Block diagram of the proposed system

This paper will cover the proposed system by presentingthe preprocessing of the connected character image,namely, the binarization & skeletonizaiton, isolation ofconnected characters, Junction based segmentation, insections A, B and C respectively. Merging of thesegments across junctions, Segmented Character Sketch

isolation and the Formulation of the new Input characterimage (isolated individual characters) are described insection H. In this work, the same methodology proposed in[8] for isolated offline handwritten character recognitionwas used to identify each of the individual character in thenew input character image. Section IV outlines someexperimental results. The paper concludes with conclusionsand future work outlines.

A Binarization & Skeletonization

Any scanned digital image is represented as a collection ofpixels having intensities within the range of 0% to 100%.Prior to processing the image for skeletonization, the imageshould undergo a binarization process. In this work,binarization process proposed by the work in [8] is used forthe preprocessing of the connected character image. In thebinarization process, if the intensity of a particular pixel isless than a particular threshold value, it is set to black (0)and if the intensity value is greater than or equal to thethreshold, it is set to white (255). After binarization, everypixel in the image was represented as either black or white.In this work, the threshold value was taken as 200.The properly binarized image was taken in to account forthe skeletonization process, which dealt with getting thesingle order pixel skeletons of the connected characterpattern within the input image. With this work, theforeground pixel based implementation [8] of the"Improved parallel thinning algorithm" is used to obtainthe connected character skeleton of the input image.

B Isolation ofConnected Character Skeleton intoCorrelation Area

The skeletonized character image was then processed forindividual and connected character isolation. In this work,a simple method is proposed to isolate the connectedcharacters skeletons into correlation area. For theindividual characters; the method proposed in [1] is used.

Definition 1: A correlation area is a rectangular area inthe image, which contains connected character skeleton.Fig. 3, depicts how a correlation area can be represented.

(Xmin,Ymin) x

.-_ m SCorrelation Area. ma'

.S \ (Xm]Ym|

yFig. 3. The representation ofthe correlation area of a connected two-digit

string

In the proposed method, firstly, the rows of the characterskeletons are used to isolate, assuming that the maximum

1-4244-0322- 7/06/$20. 00 c2006 IEE

I I

541


distance between two rows of connected characterskeleton is 50 pixels. Then the character skeletons in eachrow are proposed to isolate assuming that the maximumdistance between two separate connected characterskeletons in a single row is 200 pixels. This processisolates the connected character skeletons into theircorrelation area.

C Junction based segmentation

The most challenging task associated with this work wasthe segmentation of connected characters skeleton into aset of meaningful segments for the calculation of fuzzycharacteristics. Once a skeleton is properly segmented,each of the resulted segments can be used to calculate a setof fuzzy features (MHP, MVP, etc.) as described in [1].

In order to calculate these fuzzy features moreaccurately, each resulted segment should be a meaningfulsegment. That is, each resulted segment should be ameaningful straight line or a meaningful arc as describedin [8], but not an arbitrary segment.

When considering the connected charactersegmentation, it is almost impossible to segment connectedcharacter skeleton in to a set of meaningful segments. Thisis due to the possibility of having different connectionsbetween two-digit character strings. Therefore thesegmentation stage of the proposed work consists of twophases namely, the initial segmentation phase and the totalsegmentation phase. In the initial segmentation, the inputcharacter image undergoes Junction Point identificationand Junction Based Segmentation.

Definition 2: A Junction Point is a pixel point in theCorrelation area, having three or more neighboring pixels.It is assumed here that the skeleton is in one pixel thickness.

After the identification of the all junction points (Fig. 4,) inthe correlation area, each is used to segment he connectedcharacter skeleton in to initial segments.

Junction Points

Fig. 4. Junction points of the correlation area

In this research it was found that, the initial segment wouldnot produce complete segmentation due to the non-Junctionpoint connection between segment skeletons. As anexample, the handwritten character skeleton 'Z' (in Fig. 5,)would not be segmented into a "negative slanted" and two"Horizontal Lines" because of the non-Junction pointconnection in between.

No JucIo Pin

Fig. 5. Initial Segmentation of the correlation areaTo compensate for this, a separate segmentation algorithmis used with the rule based segmentation approachproposed in [7] (Fig. 6,).

Fig. 6. Complete segmentation in to meaningful segments

D The algorithm

Two types of major data structures are used in thisalgorithm, namely, the Point, which holds the X and Ycoordinate values of a pixel point and the Segment, whichis a Point array. The input for this connected charactersegmentation algorithm is the correlation area of aconnected character skeleton. In order to get the expectedsegmentation results, the input connected characterskeleton should be in one pixel thickness and it should notcontain any spurious branches. This section outlines themain routine of the algorithm. Apart from that somedefinitions proposed in [1] are applied here. Thesedefinitions will be used through out this paper to describethe algorithm in detail.

Definition 3: A starter point is a pixel point on thecharacter skeleton with which the traversal through theskeleton could be started. Starter points are twofold;major starter points and minor starter points

Definition 4: A major starter point is a starter point,which is identified before starting the traversal through theskeleton. The identification of major starter points isdescribed in Section 5.

Definition 5: A minor starter point is a starter point,which is identified during the traversal through theskeleton.

The main routing of the connected characterssegmentation algorithm (hereafter refereed to as ccsalgorithm) can be described as follows;

Function ccs_algo (correlation-area) returnstotal-Segments

Beginmajor_starters = empty //a queue ofPoint, major starter points.junction_points = empty // a queue ofPoint, junction points.partial-Segment = empty //an array ofSegmentinit_Segments = empty //an array ofSegment, allpartial segscomplete_Segments = empty //an array ofSegmentstotal_Segments = empty Ilan array ofSegments

major-starters = find all major starter points (correlation_area)junction_points = find all junction points (correlation_area)for each of the initial major start points in major starters dopartial-Segments = init_Split(major_starter_point,

1-4244-0322- 7/06/$20. 00 C2006 IEE 542


junction_points)add all the segments in the partial-Segments to init_Segmentsend for.

noice removal(init_Segments)complete_Segments= complete_Split(init Segments)

total-Segments = get_Segment_naming(complete_Segments)return total_SegmentsEnd

E Identification ofStarter points and Junction Points

The algorithm starts with finding all the major starterpoints and Junction Points in the given correlation area.In order to find the major starter points, two techniquescan be used [8]. In both of these techniques, the pixels inthe given skeleton area are processed row-wise.

F Traversal through the correlation area

Definition 5: The traversal direction is the direction fromthe current pixel to the next pixel to be visited during thetraversal.Definition 6: An end point is a pixel point in thecorrelation area, in which there is no neighboring pixel tovisit next

After finding all the major starter points, and the junctionpoints, the algorithm starts traversing (hereafter referred toas initial splitting) through the connected characterskeleton, starting from the major starter point of the list ofmajor starter points. In this initial splitting, the segmentsare identified in the traversal path based on the junctionpoint list. Moreover, the minor starter points are alsoidentified at each junction of the skeleton and they arequeued to a different queue, which is hereafter referred toas minor starters. Once the traversal reaches a junctionpoint, or an end point, which is a pixel point with noneighboring pixel to visit next, the focus is shifted to theidentified minor starter points in the minor starters queue.Then the algorithm starts traversing the unvisited paths ofthe skeleton by starting with each minor starter point in theminor starters queue. In these traversals, the algorithmalso segments the path that is being visited up to junctionpoint or an endpoint into initial segments.

The process mentioned above is continued withall the unvisited major starter points in the major startersqueue, until all the unvisited paths in the correlation areaare visited. The risk of visiting the same pixel (hence thesame path) more than once during each splitting traversalis eliminated by memorizing all the visited pixel pointsand only visiting the unvisited pixels in the later traversals.Therefore it is guaranteed that a path in the skeleton isvisited only once and hence the same segment is notidentified twice.The initial splitter routine is as follows.Function init_split(major_starter_point) returns segments

Begincurrent_segment = empty//Segmentpoints, current segmentcurrent_point = empty //Point, refers to the current pixel point

current direction = empty //String, current traversal direction.next_point = empty //Point, nextpixelpoint to be visitedsegments = empty //segments array, identified segmentsminor-starters = empty //Point queue, minor starter points

while (there are more points in the minor starters queue ORcurrent_point is nonempty) do

if(current_point =empty ) thencurrent_point = minor starters.deque(initialize the current_segmentif(current_point is unvisited) thencurrent segment.add(current_point)make the current_point as visitedend if end if

if(unvisited eight adjacent neighbors of current_point exist) thenneighbors =get all unvisited adjacent neighbors of

current-pointif(current_point =junction_point)

minor_starters.enque(neighbors) //neighbors at the junctionsegment.add(current segment) //segmentation at thejunction

current_point= emptyelse next_point = choose any neighbor of the current_point

current segment.add(next_point)mark next_point as visitedcurrent direction = get the current traversal directioncurrent_point = next-pointend ifend if

else // ifthere are no unvisited neighbors to visitsegment.add(current segment)current_point= emptyend ifend while

return segmentsEnd

The initial splitter process mentioned above iscontinued with all the unvisited major starter points in themajor starters queue, and the final output of thesegmentation is used as the input to the complete splittingprocess, which produces complete segmentation of theconnected character skeleton. The complete splitterprocess is used to analyze each initial segment to spliteach of the candidate segments in to complete charactersegments.The complete splitter routine is as follows.Function complete_split(init_segments) returns all_segmentsBegin

all-segments = empty H Segment queue, all identified segmentscurrent-segment = empty H selected initial segmenttmp_segment = empty H segment to hold next 5 pointscurrent_point = empty H Point, refers to the current pixel pointnext_point = empty H Point, next pixel point to be visitedstart_point = empty H Point, first point of initial segmenten_point = empty H Point, last point of a selected initial segmentdirection = empty H String, to hold the traversal direction.

current_point= start_pointcurrent segment.add(current_point)end_point=get end point of the selected initial segment

while (no termination occur) doif(current_segment >1 ) then

1-4244-0322- 7/06/$20. 00 c2006 IEE 543


next_point = get next segment point of the currentpoint(init_segment)

if(does neighbour in the same direction) thencurrent_segment.add(next_point)current_point=next_pointif(current_point is the end point)terminate condition occursall_segments.add(current_segment)

end ifelse H neighbour is not in the same direction

//I.e. the traversal direction changes.tmp_segment = get next 5 pixels in the path.

if(IsAbruptChange(current segment, tmp_segment)) thenall_segment.add(current segment)current-segment = tmp_segmentif(current_point is the end point)terminate condition occursall_segments.add(current_segment)end if

elseH the traversal can continue with the same segment.

add all the points in the tmp_segment to current-segmentif(current_point is the end point)

terminate condition occursall_segments.add(current_segment)

end if end if end ifelsenext_point = get next segment point of the currentpoint(init_segment)current_segment.add(nextpoint)current direction = get the current traversal directioncurrent_point=nextpointend ifend while

return all-segmentsEnd

G Getting Next 5 Pixel Points in the Segment

The intention of getting next 5 pixel points into a separatedata structure refereed to as tmp segment, is to find thenew written direction. The segmentation decision is basedon the abrupt change in the written direction [8].

Definition 7: The written direction is the direction of aparticular sequence of pixels to which they were written.

According to the complete segmentation algorithm, as longas the current traversal direction remains unchanged (if itcan find an unvisited neighboring pixel in the currenttraversal direction), the algorithm considers that the path,which is being visited, belongs to the same segment. If thecurrent traversal direction changes, then the algorithmgoes and checks for an abrupt change in written direction.In this work, calculation of written direction of a sequenceof pixels is implemented as proposed in [8]. Furthermore,the experimental rule base proposed in [7] is used tosegment handwritten uppercase English characters in tomeaningful segments. In addition, the current traversaldirection in [8] is used to find out the treaversal direction

from the current pixel point to the next point in theconnected character skeleton.

H Merging ofSegments acrossjunctionsAfter the segmentation process and noise removal [1], eachsegment can be considered as totally spitted segments ofthe connected character skeleton. With respect to the workproposed in [1], [7] and [8], for the recognition, each oftheisolated character should undergo proper segmentation, inorder to obtain meaningful segments as the output.

\ /

UJnnecessary Segmlentation atthe u nction poins

Fig. 7. Unnecessary segmentation at ajunction point

When the consideration is only focused on isolatedcharacters, it is rather easy to do the segmentation in tomeaningful skeletons as described in the work of [8]. In thisresearch, not only the individual character segments, butalso the connection between each individual character hasto be considered for the segmentation in order to get thesegments out from the connected character input. This maylead to unnecessary segmentation across certain connectionas described in the Fig. 7. Therefore, Merging of eachsegment across junction is proposed, to avoid unnecessarysegmentation and to reconstruct meaningful segments.

In this research, each of the segments in the completesegment list was named with respect to its participatingjunction point number(s).

Definition 8: The junction point number is a uniqueinteger assigned to the identified junction points of thecorrelation area. As an example, correlation area with 6junction points, each point can be named by the integersstarting from 1 to 6 in the same order in which each pointwas identified.As shown in the Fig. 8, in this method each of the segmentobjects can be composed in to a first junction number,second junction number and a pixel point segment. TheJunction number (first or second) depends on the directionof segmentation. In this work, the first coordinate point of asegment is considered as the first junction number and lastcoordinates point as the second junction number. Anysegment with a corner point, which is not ajunction point isassigned zero as the junction number at that corner.

Segment 0 - 4

3 ai424

Segnent 4 - 5

Fig. 8. Junction based naming process

Ex: Segment (first Junction no - second junction no)

1-4244-0322- 7/06/$20. 00 c2006 IEE 544


Segment (3 - 0) splitter.

In the process of Merging, each segment in the totalsegment is categorized in accordance with the relatedjunction number. As an example, junction point number 3is used to create a category, which consists of all segments,which have a junction point number 3 at first or secondjunction number. This method is used to create groups ofsegments with respect to all the junction point numbers inthe correlation area except the points, which have numberzero.In this work, each of the group is separately examined toselect possible merging at each junction point number. In aparticular junction point number, each segment is analyzedwith all the other participating segments within the group.At the analyzing time, each pairs of segments within thegroup, are used to merge in order to get merged segments.Each identified segment was then said to possess a certainnumber of characteristics, for each a fuzzy value wascalculated, using the methods described in [1].

According to the calculated fuzzy values of aparticular segment, an experimental rule base is used toextract meaningful segment within a particular group.Each identified merged segments is mapped and replacedwith a segment, if the segment is a participant in themerging is recorded in other groups as well (if one or bothof the participating segments in the merged segment).

All successfully merged meaningful segments arerecorded in a group called major segment group, and all theothers not merged are traced in to a separate structure calledminor segment group. In this method, it was found that theminor segment group could isolate an unnecessaryconnection between individual characters. In this work,each of the segments in the major segment group is used toformulate new input character image by selecting segmentsfrom the minor segment group. This method is used tocreate input character image with several correlation areas.For the recognition, the work proposed in [1], is used toidentify each individual character by sending eachcorrelation area to the isolated character identificationprocess. In this work, the correlation area, which onlyproduces successful two characters out put, is considered asthe separated characters output of the original connectedcharacter input.

III EXPERIMENTAL RESULTS

The algorithm was tested with the connected characterskeletons obtained by the skeletonization process, ofhandwritten upper case English characters. Fig. 9, Showsthe complete segmentations of the connected characterskeleton after applying the initial splitter and the complete

Fig. 9. Total meaningful segments from merging

At the character isolation stage, all merged segments(major segments) are combined with each of minorsegments to generate set of different connection sketches(Fig. 10,). These sketches are used to formulate new inputcharacter image to an individual character recognitionsystem, in order to recognize characters of the connectionstring.

I .Fig. IO. Connection sketches for the recognition process

Fig. 11. Final output from the recognition system

IV CONCLUSIONS & RECOMMENDATIONS

In the work presented above, it was found that the proposedalgorithm for the connected character separation was anextremely reliable and relatively simple method forgenerating the fuzzy description of connected characterpatterns, once the characters were properly segmented. Thesegmentation algorithm developed for this work wascapable of separating the connections in the uppercaseEnglish handwritten characters into meaningful segmentsaccurately. The next step in the development ofthis systemwill be a connected character separation algorithm inlowercase handwritten English letters and numeric.

The recognition rate was found to be 100%, when thesame characters that were used for the training, fed to thesystem for identification with a connection in between.Therefore, it can be concluded that the proposed method is100% accurate for machine printed characters, which haveunwanted connection. It was observed that it was veryflexible and simple to implement, compared to otheravailable methods for offline handwritten connectedcharacter recognition.

Although the method used in this work does not dealwith the two stroke connections from two adjacent digitstouching end to end, the system can still be used for theseparation of any handwritten connected characters withsingle or double connections. As a future development itcan also be suggested a development of an identification

1-4244-0322- 7/06/$20. 00 c2006 IEE 545


method, which is capable of separating the abovementioned connections. The whole method suggested inthis paper can be applicable to any handwritten or machinewritten offline character recognition system for anycharacter set in any language. The main requirementswould be, a different connected character separationalgorithm for different character sets and a proper rulebase for the meaningful segmentation.

REFERENCES

[1] K. B. M. R. Batuwita, G.E.M.D.C. Bandara (2005), "Anonline adaptable fuzzy system for offline handwrittenCharacter recognition", Proceedings of 11th WorldCongress of International Fuzzy Systems Association (IFSA2005), Beijing, China, 2005, Springer-Tsinghua, Vol. II,p.1185-1190

[2] LI Guo-hong, SHI Peng-fei (2003), "An approach to offlinehandwritten Chinese character recognition based onsegment evaluation of adaptive duration ", Institute ofImage Processing and Pattern Recognition, ShanghaiJiaotong University, Shanghai, China.

[3] Zongkang Lu, Zheru Chi, Wan- Chi Siu, Pengfei Shi, "ABackground thinning based approach for separating andrecognizing connected handwritten digit strings",Department of Electronic Engineering , the Hong KongPolytechnic University, Hong Kong, Institute of PatternRecognition & Image Processing, Shanghai JiaotongUniversity, China

[4] Akihiro Nomura, Kazuyuki Michishita, Seiichi Uchida, andMasakazu Suzuki, "Detection and Segmentation of TouchingCharacters in Mathematical Expressions", Graduate Schoolof Mathematics, Faculty of Information Science andElectrical Engineering, Kyushu University, 6-10-1 Hakozaki,Higashi-ku, Fukuoka-shi, 812-8581 Japan.

[5] Daekeun Y. , Gyeonghwan K., "An approach for locatingsegmentation points of handwritten digit strings using aneural network ", Dept. of Electronic Engineering SogangUniversity, CPO Box 1142, Seoul 100-611 Korea.

[6] Yi Lu, Beverly Haist, Laurel Harmon, Jhon Trendkle, RobertVogt., "An Accurate and Efficient System for SegmentingMachine-Printed Text ", Environmental Research Instituteof Michigan, Ann Arbor, MI.

[7]. K. B. M. R. Batuwita, G.E.M.D.C. Bandara (2005), "AnImproved Segmentation Algorithm for Individual OfflineHandwritten Character Segmentation", Proceedings of theInternational Conference on Computational Intelligence forModelling, Control and Automation (CIMCA'2005),November 28-30, 2005, Vienna, Austria.

[8]. K. B. M. R. Batuwita, G.E.M.D.C. Bandara (2005), "NewSegmentation Algorithm for Individual Offline HandwrittenCharacter Segmentation", Proceedings of the 2ndInternational Conference on Fuzzy Systems and KnowledgeDiscovery (FSKD'2005), August, 28-30, Changsha, China.

1-4244-0322- 7/06/$20. 00 c2006 IEE 546

11 2006, SriLanka New Segmentation Algorithm...

Documents

Transcript of 11 2006, SriLanka New Segmentation Algorithm...