Thesis on Telugu ocr

Optical Character Recognition system for printed Telugu text

MTech Project Report

Submitted in partial fulfillment of the requirements for the degree of

Master of Technology

by

Udaya Kumar Ambati

Roll No : 09305073

under the guidance of

Prof.M.R.Bhujade

Department of Computer Science and Engineering

Indian Institute of Technology, Bombay

April 2010

Acknowledgements

I would sincerely like to thank my guide,Prof. M.R. Bhujade for his motivating support through-

out the semester and the consistent directions that he has fed into my work.I would like to thank

each and every one who helped me throughout my work.

Abstract

Telugu is a language spoken by more than 66 million people of South India. Not much work

has been reported on the development of optical character recognition (OCR) systems for Telugu

text. Therefore, it is an area of current research. Some characters in Telugu are made up of

more than one connected symbol. Compound characters are written by associating modifiers

with consonants, resulting in a huge number of possible combinations, running into hundreds

of thousands. A compound character may contain one or more connected symbols. Therefore,

systems developed for documents of other scripts, like Roman, cannot be used directly for the

Telugu language.

This project aims at developing a complete Optical Character Recognition system for printed

Telugu text. The system segments the document image into lines and words. The features of

each character are extracted. The extracted features are passed to a Support Vector Machine

where the characters are classified by Supervised Learning Algorithm.

Contents

1 Introduction 1

2 Structure of Telugu text and Segmentation issues[5] 3

2.1 Characteristics of Telugu script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Segmentation issues in OCR of Telugu script . . . . . . . . . . . . . . . . . . . . 6

3 Preprocessing phase 8

3.1 Thresholding and noise removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Skew detection and correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Skew angle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.2 Image rotation transformation . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3 Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4 Line Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.5 Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.6 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.7 Pattern classification [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.7.1 SVM Classifier:[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Implementation 22

5 Results 24

6 Conclusion and Future work 29

i

List of Figures

2.1 Harshapriya and Godavari fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Vowels their associated modifiers (Matras) and their phonetic English representation 4

2.3 Consonants and their associated modifiers (Matras) and their phonetic English

representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Various combinations forming compound characters . . . . . . . . . . . . . . . . 6

3.1 Original Text lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Smoothed Text lines with Histogram . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Highest peak and vertical line drawn at the middle of highest peak . . . . . . . . 13

3.4 middle line detection for considering small length text . . . . . . . . . . . . . . . 14

3.5 (a).Initial segmentation line through the white pixels of horizontal histogram (b).

Result after considering only the candidate lines from original histogram. . . . . 14

3.6 Output for word segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1 Home page of the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 Displaying the original image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 Bounding Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4 Line Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.5 Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

ii

Chapter 1

Introduction

During the past few decades, substantial research efforts have been devoted to optical char-

acter recognition (OCR) [7, 6]. The object of OCR is automatic reading of optically sensed

document text materials to translate human-readable characters into machine-readable codes.

Research in OCR is popular for its various potential applications in banks, post offices and

defence organizations. Other applications involve reading aids for the blind, library automation,

language processing and multi-media design .

Commercial OCR packages are already available for languages like English. Considerable work

has also been done for languages like Japanese and Chinese [7]. Recently, work has been done

in the development of OCR systems for Indian languages. This includes work on recognition of

Devanagari characters , Bengali characters , Kannada characters and Tamil characters .

The Indian subcontinent has more than 18 constitutionally recognized languages with several

scripts but commercial products in Optical Character Recognition(OCR) are very few. Telugu

is one of the oldest and most popular languages of India. Historically, Telugu has evolved from

the ancient Brahmi script. It also used features of the Dravidian (Pali) language for script

generation. In the process of evolution, this script was carved with needles on palm leaves, and

so, it favored rounded letter shapes. Work on Telugu character recognition is not substantial.

Motivation In spite of Telugu being the third mostly used language in India there are only a

few OCR systems for Telugu script. This gave us a motivation to approach the problem. Further

1

motivation to develop a Telugu language OCR is the digitization of thousands of printed books

of Indian languages by both private and public sector. For an efficient access of these scanned

documents an OCR specific for printed Telugu text is of very urgent need.

Scope of the report The first section of the report deals with the explanation of the structure

of Telugu characters and its segmentation issues. In second section we explain the algorithm used

for noise removal and binarization. Next section explains an efficient algorithm that segments

the given scanned document into lines and words. The last section explains the concept of

Support Vector Machines(SVM) and the method of feature extraction of Telugu letters and

their classification using SVM.

Most document analysis systems can be visualized as consisting of two steps: the pre-processor

and the recognizer. in preprocessing, the raw image obtained by scanning a page of text is con-

verted to a form acceptable to the recognizer by extracting individually recognizable characters.

the pre-processed image of the character is processed to obtain meaningful elements, called fea-

tures; recognition is completed by searching for a feature vector in a database of stored feature

vectors of all possible Telugu characters that matches with the feature vector of the character

to be recognized.

In Indian scripts, one or more vowel and consonant modifiers are attached to the consonant

forms in a variety of combinations forming compound characters. The total number of possible

compound characters is in of the order of hundreds of thousands. Therefore, the question, What

constitutes a character?, assumes many new dimensions for Indian languages. Is a modifier an

independent character or not? Does being treated as an independent character depend on the

way it is written, i.e. whether it is written touching the character it is to modify or separated

from it? A more detailed discussion of these issues for Telugu script is provided in Sect. 2.

In this project, an approach has been presented for Telugu.

2

Chapter 2

Structure of Telugu text and

Segmentation issues[5]

2.1 Characteristics of Telugu script

Telugu is a syllabic languageconfusion and spelling problems. In that sense, it is a WYSIWYG

(what you see is what you get) script. This form of script is considered to be most scientific

by linguists. The Telugu script consists of 18 vowels, 36 consonants and two dual symbols. Of

the vowels, sixteen are in common usage. Fig 2.1 lists some of the vowels in Harshapriya and

Godavari fonts.

All vowels and consonants, along with their modifiers and phonetic equivalent symbols, are

listed in Fig 2.2 and Fig 2.3, respectively. Compound characters in Telugu follow some phonetic

[5]

Figure 2.1: Harshapriya and Godavari fonts

3

[5]

Figure 2.2: Vowels their associated modifiers (Matras) and their phonetic English representation

sequences that can be represented in grammatical form, as shown in Fig 2.4. Base consonants

are vowel-suppressed consonants. These are typically used when words of other languages are

written in Telugu. The third combination, i.e. of a base consonant and a vowel, is an extremely

important and often used combination in Telugu script. As there are 38 (36+2 dual symbols)

base consonants and 16 vowels, logically, 608 (3816 = 608) combinations are possible.

The combinations from the fourth to the seventh combinations are categorized under conjunct

formation. Telugu has a special feature of providing a unique symbol of dependent form for each

of the consonants. In all conjunct formations, the first consonant appears in its actual form.

The dependent vowel sign and the second (third) consonant act as dependent consonants in the

formation of the complete character. combinations from the fourth to seventh combinations

generate a large number of conjuncts in Telugu script. The fourth combination logically gener-

ates (383816) 23,104 different compound characters. This is an important combination. Thefifth combination is similar to the fourth combination. The second and the third consonants act

as the dependent consonants. Logically 746,496 different compound characters are possible in

this combination, but their frequency of appearance in the text is less when compared to the

previous combination. In the sixth and seventh combinations, 1,296 combinations and 46,656

combinations, respectively, are logically possible.

The sixth and seventh combinations are used when words from other languages are written in

Telugu script. In these combinations, the vowel is omitted. The first consonant appears as a

base consonant and the other consonants act as dependent consonants.

4

[5]

Figure 2.3: Consonants and their associated modifiers (Matras) and their phonetic English

representation

5

[5]

Figure 2.4: Various combinations forming compound characters

2.2 Segmentation issues in OCR of Telugu script

A connected region in an image of Telugu text may be:

1. A part of a character or a compound character

2. A character

3. A compound character

This complicates the segmentation issues. The areas occupied by individual characters in a

line of text are not in a horizontal line, unlike in English text, and in some cases, the area of

a single complex character formation can be equal to the sum of the areas of two individual

characters. The segmentation algorithm has to take these factors into consideration. The basic

question to be answered in segmentation is: What are the symbols that will be isolated during

segmentation and provided to the recognizer for completing the OCR?

6

The first approach is to treat all types of conjuncts, together with the base consonants, as

units for the purpose of segmentation and further recognition. This is not preferable for a

number of reasons. The first reason is that the sheer number of possibilities has been shown

to be enormous. The second reason is that, in compound characters like KRAI, we have to

identify all the three parts, i.e. below and on the left, as being together in the same compound

character, although they are not connected in the image. This is, in general, difficult because the

association information is difficult to generate until the recognition process is at least partially

completed, and the reason we are segmenting is to perform this recognition process. This is the

catch-22 situation referred earlier, and, therefore, treating all types of conjuncts together is not

possible. The second alternative is to attempt to isolate the base consonants, vowel modifiers,

etc. This is difficult and leads to unmanageable complications at the segmentation stage where

the symbols are yet to be recognized. This is primarily because the symbols are full of curves

and their separation is not clear. However, this is a popular approach for Indian scripts like

Devanagari and Bangla [3].

7

Chapter 3

Preprocessing phase

3.1 Thresholding and noise removal

The task of thresholding is to extract the foreground from the background. Generally an OCR

expects a text printed against clean backgrounds. Usually a simple global binarization technique

is adopted which does not handle well text printed against shaded or texture backgrounds, and/or

embedded in images.

In this project, a simple yet effective algorithm is proposed for document image binarization

and cleanup. It is especially robust for extracting from images.

There are basically two classes of binarization techniques global and adaptive. Global methods

binarize the entire image using a single threshold. For example, a typical OCR system separates

text from background by global thresholding[12, 8] . A simple way to automatically select a

global threshold is to value at the valley of the intensity histogram of the image, assuming that

there are two peaks in the histogram, one corresponding to the foreground and the other to the

background. Methods have also been proposed to facilitate more robust valley picking.

There are problems with the global thresholding paradigm. First, due to noise and poor

contrast, many documents do not have well differentiated foreground and background intensi-

ties. Second, the bimodal histogram assumption is not always valid in the case of complicated

documents such as photographs and advertisements. Third, the foreground peak is often over-

8

shadowed by other peaks which makes the valley detection difficult or impossible. Some research

has been carried out to overcome these problems. For example, weighted histograms[1] are used

to balance the size difference between the foreground and background, and /or convert the valley-

finding into maximum peak detection. Minimum-error thresholding models the foreground and

background intensity distributions as Gaussian distributions and the threshold is selected to

minimize the classification error. Otsu[9] models the intensity histogram as probability distribu-

tion and the threshold is chosen to maximize the separability of the resultant background and

foreground classes. Similarly entropy measures[] have been used to select the threshold which

maximizes the sum of background and foreground entropies.

In contrast, adaptive algorithms compute a threshold for each pixel based on information

extracted from its neighborhood. For images in which the intensity ranges of foreground objects

and backgrounds entangle, different thresholds must be used for different regions.

3.1.1 The Algorithm

The algorithm proposed by Wu and Manmatha[11] works under the assumption that text input

image or a region of the input image has more or less the same intensity value. However the

unique feature of this algorithm is it works well even of the text is printed against shaded or

hatched background

The following are the steps in the algorithm:

1. smooth the input text chip.

2. compute the intensity histogram of the smoothed chip.

3. smooth histogram using a low-pass filter.

4. pick a threshold at the first valley counted from the left side of the histogram.

5. binarize the smoothed text chip using the threshold.

A low-pass Gaussian filter is used to smooth the text chip in step 1. The smoothing operation

affects the background more than the text because text is normally is of lower frequency than

the shading. Thus it cleans up the background.

9

The histogram generated by step 2 is often jagged, hence it needs to be smoothed to allow

the valley to be detected. Again a Gaussian filter is used for this purpose.

Text is normally the darkest item in the detected chips. Therefore, a threshold is picked

at the first valley closest to the darkest side of the histogram. To extract text against darker

background, a threshold at the last valley is picked instead.

3.2 Skew detection and correction

Skew estimation of document refers to the process of finding the angle of inclination made

by the document with respect to horizontal axis,which is often introduced during document

scanning. For any ensuing document image processing tasks(such as page layout analysis,

OCR,document retrieval etc.)to yield accurate results,the skew angle must be detected and cor-

rected beforehand.The algorithms for skew estimation can mainly be classified as the ones based

on(i)projection profile(PP) , nearest neighbor(NN) (iii)Hough transform(HT) and (iv)cross-

correlation. We used the variation of the hough transform method [4] to detect skew in our

project.

3.2.1 Skew angle Detection

The skew angles detection process used in this project can be divided into three steps:

detection point determination

coarse skew angle estimation

Hough transformation.

First, the skew image is vertically separated into several blocks, each block consisting of one

hundred rows. Then the locations of detection points in each block are recorded to estimate

the coarse skew angle e. The coarse skew angle here can be estimated by selecting the angle

which possesses most detection points.Finally, the accurate skew angle can be determined by

choosing the peak in the Hough plane within the small range of [ e - 3 , e + 3] A detailed

description of the three steps to detect the skew angle follows.

10

Step 1. Detection point (DP) determination First of all, the input image is vertically

divided into several blocks. According to our empirical study, 100 rows are chosen as the size of

each block. A detection point is defined as the left-most black pixel in each block. Each divided

block is scanned from left to right and then from top to bottom to find the detection point. If

the scanned pixel is not a background pixel, it is declared as a detection point. Following the

above procedure, we can find all detection points embedded in the input image. These detection

points are then fed into Step 2 for the estimation of the coarse skew angle.

Step 2. Coarse skew angle estimation In this step, the coarse skew angle 0 e is determined

by selecting the majority of local skew angles which are generated from the detection points.

Before the majority selection procedure, the local skew angle i has to be calculated first.Consider

two detection points DPi1(xi1, yi1) and DPi(xi, yi) in two consecutive divided blocks Bi1

and Bi. The local skew angle i is defined as

i = tan1(

yixi

)= tan1

(yi yi1xi xi1

)(3.1)

Here, the value yixi is adopted to represent the local skew angle i to avoid the computation

burden of tan1 function. The coarse skew angle r is then assigned as the majority of local

skew angles.

Hough Transformation Following the previous two steps, the search range of the skew angle

in the Hough plane is reduced from [90, 90] to [e 3, e + 3]. Last, the left-most pixelPi(xi, yi) in each row of the x y plane is transformed to the Hough plane by making useof the following equation:

i = xi. cos i + yi. sin (3.2)

where i is located in the range [e 3, e + 3]. The skew angle of the input document canthereby be determined by selecting the angle with the largest value in the transformed Hough

plane.

3.2.2 Image rotation transformation

In this section, a skew image will be corrected to generate a non-skew image by rotating

over a skew angle 0 which is obtained in Section 3.2.1.The rotation transformation is a mapping

11

function f(x, y) which maps the coordinates of pixels in the original image to those in the output

image. However, some pixel values in the output image which correspond to the pixels in the

original image cannot be defined via the mapping function f because the range and domain

defined in image processing are integer. In program implementation, we can devise an inverse

function f1 to define all output pixel values from the original image. Each pixel value in the

output image can thereby be determined from the value in the original image via the inverse

function f1.

Geometrically, the value of pixel P (x, y) in the output image can be determined from that

of the corresponding pixel P (x, y) in the original image. The location of pixel P can be obtained

from the location of pixel P via the following function f1:(x, y)

=(x, y

) cos () sin () sin () cos ()

= (x cos + y sin ,x sin + y cos ) (3.3)

3.3 Connected Components

The connected components are computed for the whole document using a recursive labeling

algorithm. The algorithm works by first negating the whole image. Each black pixel is replaced

by -1 and white pixel with 0. Each pixel in this image is now checked for a black pixel. If a

pixel is a text pixel, We define a search function which takes a text pixel, its coordinates and

defines its neighbors. This function recursively searches the black pixels that are part of this

component and labels them. Again it reaches a new component.

3.4 Line Segmentation

There are several steps in the line segmentation method proposed by Priyanka and Srikanth[10]

that are systematically described below.

Step1:Run length smearing A smoothing algorithm is applied in the text of a document

page. In this step we use run length smearing technique [12] to increase the strength of the

histogram. Here we consider the consecutive run of white pixels in between two black pixels and

then we compute the length of that white run. If the length of white run is less than five times

12

the stoke width, fill the white run length into black. in figure there are two original text lines

and in figure there are smoothed text lines with horizontal histogram corresponding to their

text lines.

[10]

Figure 3.1: Original Text lines

[10]

Figure 3.2: Smoothed Text lines with Histogram

Step2:Recursive procedure to get middle lines for segmentation Getting the his-

togram of every line from the smoothed document page, we consider the highest peaks of the

projection profile. After that we find the middle point of the length of the highest peak, and

then we draw a vertical line from top to bottom at the middle point of the highest peak as

shown in fig.

[10]

Figure 3.3: Highest peak and vertical line drawn at the middle of highest peak

The continuity of this step is to find the middle lines of each and every peaks of histogram. At

the line (the line passes vertically through middle point of the highest peak) we find middle point

of peaks. We draw the horizontal lines based on this middle point of the width of histogram. In

some cases all peak of histograms do not cross this vertical line. For these cases we find distances

between middle lines and find the average value of these distances.If the distance between the

two middle lines is greater than two times of average value then we assume that region contains

13

[10]

Figure 3.4: middle line detection for considering small length text

one or more text lines and we need recursive segmentation for that region. After getting that

region (the region between two middle lines of peaks) we apply the same procedure to find

vertical line through the middle of highest peak and middle lines of that particular region. This

procedure runs recursively; until we find middle lines of particular image as shown in Fig .10

Step3:Finding candidate line In this step, from the starting point of first histogram we

vertically scan the region in between the first middle and second middle line of histogram until

we get first two white pixels. We consider that two white pixels as minimum points. The line,

where we get the first white pixel, we consider that line as first minimum. Similarly the line

where we get second white pixel, we consider that line as second minimum. Now we calculate

the vertical distances from first middle line to first minimum point and from first middle line

to second minimum point. Getting these two distances, we consider the maximum distance.

The minimum point which contains maximum vertical distance as a separator between two

consecutive middle lines. In this way we find all line separators between two consecutive middle

lines and shown in Fig below. If we consider only the point where we get minimum black pixel

in the histogram is separator line, then we will get many errors.

[10]

Figure 3.5: (a).Initial segmentation line through the white pixels of horizontal histogram (b).

Result after considering only the candidate lines from original histogram.

14

3.5 Word Segmentation

In word segmentation method, a text line has taken as an input. After a text line is segmented,

it is scanned vertically. If in one vertical scan two or less black pixels are encountered then

the scan is denoted by 0, else the scan is denoted by the number of black pixels. In this way

a vertical projection profile is constructed. Now, if in the profile there exist a run of at least

k1 consecutive 0s then the midpoint of that run is considered as the boundary of a word. The

value of k1 is taken as 1/3 of the text line height. Word segmentation results of a Telugu text

line are shown in Fig.

[10]

Figure 3.6: Output for word segmentation

3.6 Feature Extraction

Feature Extraction [5]: The output of the Normalization phase gives a normalized image of size

N N. Real Valued Directional Features[] are calculated for each normalized image of size NN.These are based on the percentage of pixels in each direction range within each partition. An

adaptive gradient magnitude threshold, r is computed over the whole character image gradient

map. This threshold is needed to filter out spurious responses to the Sobel operator used to find

the gradients. Threshold value ,rt is computed as

rt = r(i, j)

D1D2

Thresholding is performed to nullify the pixels whose gradient magnitude values below the

computed threshold.

The feature vector is extracted basing on the direction of the gradient at each pixel. We di-

vided the whole character image into MN partitions. In our project we selected M=N=8. Thedirections of the gradient are quantized into K values. Thus each pixel can have now gradient

direction values from 1 to K. Percentage of pixels in each partition with direction quantised to k

are calculated. Thus each partition gives us K such values. We have total MNK dimensional

15

feature vector for each character image. We chose the value of K = 12. In our project we have

total 192 dimensional feature vector for each normalized character image.

The steps to extract feature vector are as follows:

For each connected component.

Obtain the bonding box for each connected component eliminating the blank surroundingspace.

Calculate the gradient magnitude and direction at each pixel.

Calculate the adaptive threshold of gradient magnitude and perform thresholding to obtainthe new gradient direction each pixel.

Partition the adaptive gradient direction map and extract the complete feature vector.

3.7 Pattern classification [2]

The feature vector extracted from the normalized image has to be assigned a label using a

pattern classifier[2]. There are many methods for designing pattern classifiers such as Bayes clas-

sifier based on density estimation, using neural networks, linear discriminant functions, nearest

neighbor classification based on prototypes etc. In this system we have used the Support Vector

Machine (SVM) classifier. SVMs represent a new pattern classification method which grew out

of some of the recent work in statistical learning theory. The solution offered by SVM method-

ology for the two class pattern recognition problem is theoretically elegant, computationally

efficient and is often found to give better performance by way of improved generalizations. In

the next subsection we provide a brief overview of SVMs.

3.7.1 SVM Classifier:[2]

classifier is a two-class classifier based on the use of discriminant functions. A discriminant

function represents a surface which separates the patterns so that the patterns from the two

16

classes lie on the opposite sides of the surface. The SVM is essentially a separating surface which

is optimal according to a criterion as explained below.

Consider a two-class problem where the class labels are denoted by +1 and 1. Given aset of labeled (training) patterns = ((x)i, yi), yi {1,+1} the hyper-plane represented by(w, b) where w 0fori : yi = +1;

wtxi + b > 0fori : yi = 1; (3.4)

Here,wtxi denotes the inner product between the two vectors, and g (x+ b) is the linear dis-

criminant function.

In general, the set may not be linearly separable. In such a case one can employ the

generalized linear discriminant function defined by,

g (x) = wt(x) + b where :

Let zi = (xi) Thus now we have a training sample (zi, yi) to learn a separating hyperplane

in

Now it is clear that i = 0 if i / S Hence we can rewrite (4.6) as

w =iS

i yizi (3.11)

The set of patterns zi : i s.t.i > 0 g are called the support vectors. From (4.8), it is clear

that w is a linear combination of support vectors and hence the name SVM for the classifier.

The support vectors are those patterns which are closest to the hyper-plane and are sufficient

to completely define the optimal hyper-plane. Hence these patterns can be considered to be the

most important training examples.

To learn the SVM all we need are the optimal Lagrange multipliers corresponding the problem

given by (4.4) and (4.5). This can be done efficiently by solving its dual which is the optimization

problem given by: Find i, i = 1, ...., l, to

Maximize :i

i 12i,j

ijyiyjztizj

Subject to : i 0, i = 1, 2, ..., l,li

iyi = 0. (3.12)

By solving this problem we obtain i i and using these we get w and b. It may be noted

that the dual given by 4.(9) is a quadratic optimization problem of dimension l (recall that l

is the number of training patterns) with one equality constraint and nonnegativity constraints

on the variables. This is so irrespective of how complicated the function is. Once the SVM

is obtained, the classification of any new feature vector,x, is based on the sign of (recall that

z = (x)

f(x) = (x)tw + b =iS

i yi(xi)t(x) + b (3.13)

where we have used (4.8). Thus, both while solving the optimization problem (given by (4.9))

and while classifying a new pattern, the only way the training pattern vectors, xi come into

picture are as inner products (xi)t(xj). This is the only way, also enters into the picture.

Suppose we have a function,K :

Table 3.1: Some popular kernels for SVMs.

Type of kernel K(xi, xj) Comments

Polynomial kernel (xtixj + 1)p Power p is specified a priori by

the user

Gaussian kernel exp( 122||xi xj ||2) The width 2 common to

all the kernels, is specified a

priori

Perceptron Kernel tanh(0xtixj + 1) Mercers condition satisfied

only for certain values of 0

and 1

Given any symmetric function K :

this, we can change the optimization problem to

Minimize :12||w||2 + C

li=1

i, (3.14)

Subject to : 1 yi(ztiw+ b) i 0 i = 1, ...., l.i 0, i = 1, ...., l (3.15)

Here i can be thought of as penalties for violating separability constraints. Now these are

also variables over which optimization is to be performed. The constant C is a user specified

parameter of the algorithm and as C we get the old problem. It so turns out that the dualof this problem is same as (4.9) except that the non negativity constraint on i is replaced by

0 i C. The optimal values of the new variables i are irrelevant to th e final SVM solution.

To sum up, the SVM method for learning two class classifiers is as follows. We choose a

Kernel function and some value for the constant C in (4.11). Then we solve its dual which is

same as (4.9) except that the variables i also have an upper bound, namely, C. (It may be

noted that here we use K(()x)i,xj in place of ztizj in (4.9)).Once we solve this problem, all we

need to store are the non-zero i i and the corresponding xi (which are the support vectors).

Using these, given any new feature vector x, we can calculate the output of SVM, namely, f(x)

through (4.10). The classification of x would be +1 if the output of SVM is positive; otherwise

it is 1.

SVM classifier for OCR We have used SVM classifiers for labeling each segment of a

word. As explained earlier, we have trained a number of two-class classifiers (SVMs), each one

for distinguishing one class from all others. Thus each of our class labels has an associated

SVM.A test example is assigned the label of the class whose SVM gives the largest positive

output. If no SVM gives a positive output then the example is rejected. The output of the SVM

gives a measure of the distance of the example from the separating hyper-plane in the space.

Hence higher the value of the (positive) output for a given pattern higher is the confidence in

classifying the pattern.

21

Chapter 4

Implementation

Developing an OCR for printed Telugu text consists of two stages, Pre-processing and Recog-

nition. In the prep-processing phase thresholding and noise removal are implemented using the

algorithm specified in sect 3.1.1. Skew detection and removal are implemented using a variant

of Hough transform.

The first step of the OCR starts with taking the document image as input.The image is then

converted into a grayscale image. The grayscale image is converted to a binary image using the

method described in section Thresholding. Connected components in the whole document are

found out with their bounding box using a two pass algorithm. These connected components are

then used to line segment the whole document. Line segmentation takes the array of connected

components as parameters and returns the top and bottom row numbers of each line with respect

to image coordinate system.

Each text line is given as input to the word segmentation phase. This function segments the

text line into words and returns the left and right column numbers of each word. The connected

components which belong to each word are grouped.

Each component is normalized into an image of 4848 image. This image is given as inputto the feature extraction function. This function takes just an image and returns the feature

vector of 192 dimensions using Sobel operator and the adaptive threshold gradient. The feature

vector is then given as an input to the SVM classifier which is trained using training SVM phase

22

described in later sections.

All these functions are implemented in Java Advanced Imaging package of Oracle Sun Mi-

crosystems in Netbeans6.8 IDE. LibSVM is the package used for training and using the SVM

classifier.

23

Chapter 5

Results

Figure 5.1: Home page of the tool

24

Figure 5.2: Displaying the original image

25

Figure 5.3: Bounding Connected Components

26

Figure 5.4: Line Segmentation

27

Figure 5.5: Word Segmentation

28

Chapter 6

Conclusion and Future work

Conclusion The main aim of this project is to develop a Optical Character Recognition for

printed Telugu text. Telugu script has a complex structure and has thousands of combinations of

vowel, consonant and consonant modifier.Hence detection and recognition of basic symbols helps

in reducing the number of classes. This project develops a tool that takes a document image

as input and displays each characters Unicode.This Unicode can be further used to display the

corresponding Telugu text.

Future work The recognition accuracies can be further increased by post processing which

makes use of the association of the basic symbols. For example, it is known that the some

modifiers occur very frequently with some characters and some modifiers occur very infrequently.

This feature vector can be further used for recognizing handwritten Telugu script. The final

output of the proposed system can be used further for text to speech conversion.

29

Bibliography

[1] Histogram modification for threshold selection. Systems, Man and Cybernetics, IEEETransactions on, 9(1):38 52, jan. 1979.

[2] T V Ashwin and P S Sastry. font and sizeindependent ocr system for printed kannadadocuments using support vector machines. Sadhana, 27:3558, 2002.

[3] B. B. Chaudhuri and U. Pal. A complete printed bangla ocr system. Pattern Recognition,31(5):531 549, 1998.

[4] Huei-Fen Jiang, Chin-Chuan Han, and Kuo-Chin Fan. A fast approach to the detectionand correction of skew documents. Pattern Recogn. Lett., 18(7):675686, 1997.

[5] C. Vasantha Lakshmi and C. Patvardhan. An optical character recognition system forprinted telugu text. Pattern Analysis and Applications, 7:190204, 2004. 10.1007/s10044-004-0217-2.

[6] S. Mori, C.Y. Suen, and K. Yamamoto. Historical review of ocr research and development.Proceedings of the IEEE, 80(7):1029 1058, jul. 1992.

[7] G. Nagy. Twenty years of document image analysis in pami. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 22(1):38 62, jan. 2000.

[8] L. Ogorman. Binarization and multithresholding of document images using connectivity.CVGIP: Graphical Models and Image Processing, 56(6):494 506, 1994.

[9] N. Otsu. A threshold selection method from grey-level histograms. SMC, 9(1):6266,January 1979.

[10] Nallapareddy Priyanka, Srikanta Pal, and Ranju Manda. Article:line and word segmen-tation approach for printed documents. IJCA,Special Issue on RTIPPR, (1):3036, 2010.Published By Foundation of Computer Science.

[11] Victor Wu and R. Manmatha. Document image clean-up and binarization. In In Proc.SPIE Symposium on Electronic Imaging, pages 263273, 1998.

[12] Hong Yan. Skew correction of document images using interline cross-correlation. CVGIP:Graph. Models Image Process., 55(6):538543, 1993.

30

IntroductionStructure of Telugu text and Segmentation issues4Characteristics of Telugu scriptSegmentation issues in OCR of Telugu script

Preprocessing phaseThresholding and noise removalThe Algorithm

Skew detection and correctionSkew angle DetectionImage rotation transformation

Connected ComponentsLine SegmentationWord SegmentationFeature ExtractionPattern classification 13SVM Classifier:13

ImplementationResultsConclusion and Future work

Thesis on Telugu ocr

Documents

Transcript of Thesis on Telugu ocr