Thesis on Telugu ocr
description
Transcript of Thesis on Telugu ocr
-
Optical Character Recognition system for printed Telugu text
MTech Project Report
Submitted in partial fulfillment of the requirements for the degree of
Master of Technology
by
Udaya Kumar Ambati
Roll No : 09305073
under the guidance of
Prof.M.R.Bhujade
Department of Computer Science and Engineering
Indian Institute of Technology, Bombay
April 2010
-
Acknowledgements
I would sincerely like to thank my guide,Prof. M.R. Bhujade for his motivating support through-
out the semester and the consistent directions that he has fed into my work.I would like to thank
each and every one who helped me throughout my work.
-
Abstract
Telugu is a language spoken by more than 66 million people of South India. Not much work
has been reported on the development of optical character recognition (OCR) systems for Telugu
text. Therefore, it is an area of current research. Some characters in Telugu are made up of
more than one connected symbol. Compound characters are written by associating modifiers
with consonants, resulting in a huge number of possible combinations, running into hundreds
of thousands. A compound character may contain one or more connected symbols. Therefore,
systems developed for documents of other scripts, like Roman, cannot be used directly for the
Telugu language.
This project aims at developing a complete Optical Character Recognition system for printed
Telugu text. The system segments the document image into lines and words. The features of
each character are extracted. The extracted features are passed to a Support Vector Machine
where the characters are classified by Supervised Learning Algorithm.
-
Contents
1 Introduction 1
2 Structure of Telugu text and Segmentation issues[5] 3
2.1 Characteristics of Telugu script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Segmentation issues in OCR of Telugu script . . . . . . . . . . . . . . . . . . . . 6
3 Preprocessing phase 8
3.1 Thresholding and noise removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Skew detection and correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Skew angle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Image rotation transformation . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Line Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.7 Pattern classification [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7.1 SVM Classifier:[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Implementation 22
5 Results 24
6 Conclusion and Future work 29
i
-
List of Figures
2.1 Harshapriya and Godavari fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Vowels their associated modifiers (Matras) and their phonetic English representation 4
2.3 Consonants and their associated modifiers (Matras) and their phonetic English
representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Various combinations forming compound characters . . . . . . . . . . . . . . . . 6
3.1 Original Text lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Smoothed Text lines with Histogram . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Highest peak and vertical line drawn at the middle of highest peak . . . . . . . . 13
3.4 middle line detection for considering small length text . . . . . . . . . . . . . . . 14
3.5 (a).Initial segmentation line through the white pixels of horizontal histogram (b).
Result after considering only the candidate lines from original histogram. . . . . 14
3.6 Output for word segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1 Home page of the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Displaying the original image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 Bounding Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.4 Line Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5 Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
ii
-
Chapter 1
Introduction
During the past few decades, substantial research efforts have been devoted to optical char-
acter recognition (OCR) [7, 6]. The object of OCR is automatic reading of optically sensed
document text materials to translate human-readable characters into machine-readable codes.
Research in OCR is popular for its various potential applications in banks, post offices and
defence organizations. Other applications involve reading aids for the blind, library automation,
language processing and multi-media design .
Commercial OCR packages are already available for languages like English. Considerable work
has also been done for languages like Japanese and Chinese [7]. Recently, work has been done
in the development of OCR systems for Indian languages. This includes work on recognition of
Devanagari characters , Bengali characters , Kannada characters and Tamil characters .
The Indian subcontinent has more than 18 constitutionally recognized languages with several
scripts but commercial products in Optical Character Recognition(OCR) are very few. Telugu
is one of the oldest and most popular languages of India. Historically, Telugu has evolved from
the ancient Brahmi script. It also used features of the Dravidian (Pali) language for script
generation. In the process of evolution, this script was carved with needles on palm leaves, and
so, it favored rounded letter shapes. Work on Telugu character recognition is not substantial.
Motivation In spite of Telugu being the third mostly used language in India there are only a
few OCR systems for Telugu script. This gave us a motivation to approach the problem. Further
1
-
motivation to develop a Telugu language OCR is the digitization of thousands of printed books
of Indian languages by both private and public sector. For an efficient access of these scanned
documents an OCR specific for printed Telugu text is of very urgent need.
Scope of the report The first section of the report deals with the explanation of the structure
of Telugu characters and its segmentation issues. In second section we explain the algorithm used
for noise removal and binarization. Next section explains an efficient algorithm that segments
the given scanned document into lines and words. The last section explains the concept of
Support Vector Machines(SVM) and the method of feature extraction of Telugu letters and
their classification using SVM.
Most document analysis systems can be visualized as consisting of two steps: the pre-processor
and the recognizer. in preprocessing, the raw image obtained by scanning a page of text is con-
verted to a form acceptable to the recognizer by extracting individually recognizable characters.
the pre-processed image of the character is processed to obtain meaningful elements, called fea-
tures; recognition is completed by searching for a feature vector in a database of stored feature
vectors of all possible Telugu characters that matches with the feature vector of the character
to be recognized.
In Indian scripts, one or more vowel and consonant modifiers are attached to the consonant
forms in a variety of combinations forming compound characters. The total number of possible
compound characters is in of the order of hundreds of thousands. Therefore, the question, What
constitutes a character?, assumes many new dimensions for Indian languages. Is a modifier an
independent character or not? Does being treated as an independent character depend on the
way it is written, i.e. whether it is written touching the character it is to modify or separated
from it? A more detailed discussion of these issues for Telugu script is provided in Sect. 2.
In this project, an approach has been presented for Telugu.
2
-
Chapter 2
Structure of Telugu text and
Segmentation issues[5]
2.1 Characteristics of Telugu script
Telugu is a syllabic languageconfusion and spelling problems. In that sense, it is a WYSIWYG
(what you see is what you get) script. This form of script is considered to be most scientific
by linguists. The Telugu script consists of 18 vowels, 36 consonants and two dual symbols. Of
the vowels, sixteen are in common usage. Fig 2.1 lists some of the vowels in Harshapriya and
Godavari fonts.
All vowels and consonants, along with their modifiers and phonetic equivalent symbols, are
listed in Fig 2.2 and Fig 2.3, respectively. Compound characters in Telugu follow some phonetic
[5]
Figure 2.1: Harshapriya and Godavari fonts
3
-
[5]
Figure 2.2: Vowels their associated modifiers (Matras) and their phonetic English representation
sequences that can be represented in grammatical form, as shown in Fig 2.4. Base consonants
are vowel-suppressed consonants. These are typically used when words of other languages are
written in Telugu. The third combination, i.e. of a base consonant and a vowel, is an extremely
important and often used combination in Telugu script. As there are 38 (36+2 dual symbols)
base consonants and 16 vowels, logically, 608 (3816 = 608) combinations are possible.
The combinations from the fourth to the seventh combinations are categorized under conjunct
formation. Telugu has a special feature of providing a unique symbol of dependent form for each
of the consonants. In all conjunct formations, the first consonant appears in its actual form.
The dependent vowel sign and the second (third) consonant act as dependent consonants in the
formation of the complete character. combinations from the fourth to seventh combinations
generate a large number of conjuncts in Telugu script. The fourth combination logically gener-
ates (383816) 23,104 different compound characters. This is an important combination. Thefifth combination is similar to the fourth combination. The second and the third consonants act
as the dependent consonants. Logically 746,496 different compound characters are possible in
this combination, but their frequency of appearance in the text is less when compared to the
previous combination. In the sixth and seventh combinations, 1,296 combinations and 46,656
combinations, respectively, are logically possible.
The sixth and seventh combinations are used when words from other languages are written in
Telugu script. In these combinations, the vowel is omitted. The first consonant appears as a
base consonant and the other consonants act as dependent consonants.
4
-
[5]
Figure 2.3: Consonants and their associated modifiers (Matras) and their phonetic English
representation
5
-
[5]
Figure 2.4: Various combinations forming compound characters
2.2 Segmentation issues in OCR of Telugu script
A connected region in an image of Telugu text may be:
1. A part of a character or a compound character
2. A character
3. A compound character
This complicates the segmentation issues. The areas occupied by individual characters in a
line of text are not in a horizontal line, unlike in English text, and in some cases, the area of
a single complex character formation can be equal to the sum of the areas of two individual
characters. The segmentation algorithm has to take these factors into consideration. The basic
question to be answered in segmentation is: What are the symbols that will be isolated during
segmentation and provided to the recognizer for completing the OCR?
6
-
The first approach is to treat all types of conjuncts, together with the base consonants, as
units for the purpose of segmentation and further recognition. This is not preferable for a
number of reasons. The first reason is that the sheer number of possibilities has been shown
to be enormous. The second reason is that, in compound characters like KRAI, we have to
identify all the three parts, i.e. below and on the left, as being together in the same compound
character, although they are not connected in the image. This is, in general, difficult because the
association information is difficult to generate until the recognition process is at least partially
completed, and the reason we are segmenting is to perform this recognition process. This is the
catch-22 situation referred earlier, and, therefore, treating all types of conjuncts together is not
possible. The second alternative is to attempt to isolate the base consonants, vowel modifiers,
etc. This is difficult and leads to unmanageable complications at the segmentation stage where
the symbols are yet to be recognized. This is primarily because the symbols are full of curves
and their separation is not clear. However, this is a popular approach for Indian scripts like
Devanagari and Bangla [3].
7
-
Chapter 3
Preprocessing phase
3.1 Thresholding and noise removal
The task of thresholding is to extract the foreground from the background. Generally an OCR
expects a text printed against clean backgrounds. Usually a simple global binarization technique
is adopted which does not handle well text printed against shaded or texture backgrounds, and/or
embedded in images.
In this project, a simple yet effective algorithm is proposed for document image binarization
and cleanup. It is especially robust for extracting from images.
There are basically two classes of binarization techniques global and adaptive. Global methods
binarize the entire image using a single threshold. For example, a typical OCR system separates
text from background by global thresholding[12, 8] . A simple way to automatically select a
global threshold is to value at the valley of the intensity histogram of the image, assuming that
there are two peaks in the histogram, one corresponding to the foreground and the other to the
background. Methods have also been proposed to facilitate more robust valley picking.
There are problems with the global thresholding paradigm. First, due to noise and poor
contrast, many documents do not have well differentiated foreground and background intensi-
ties. Second, the bimodal histogram assumption is not always valid in the case of complicated
documents such as photographs and advertisements. Third, the foreground peak is often over-
8
-
shadowed by other peaks which makes the valley detection difficult or impossible. Some research
has been carried out to overcome these problems. For example, weighted histograms[1] are used
to balance the size difference between the foreground and background, and /or convert the valley-
finding into maximum peak detection. Minimum-error thresholding models the foreground and
background intensity distributions as Gaussian distributions and the threshold is selected to
minimize the classification error. Otsu[9] models the intensity histogram as probability distribu-
tion and the threshold is chosen to maximize the separability of the resultant background and
foreground classes. Similarly entropy measures[] have been used to select the threshold which
maximizes the sum of background and foreground entropies.
In contrast, adaptive algorithms compute a threshold for each pixel based on information
extracted from its neighborhood. For images in which the intensity ranges of foreground objects
and backgrounds entangle, different thresholds must be used for different regions.
3.1.1 The Algorithm
The algorithm proposed by Wu and Manmatha[11] works under the assumption that text input
image or a region of the input image has more or less the same intensity value. However the
unique feature of this algorithm is it works well even of the text is printed against shaded or
hatched background
The following are the steps in the algorithm:
1. smooth the input text chip.
2. compute the intensity histogram of the smoothed chip.
3. smooth histogram using a low-pass filter.
4. pick a threshold at the first valley counted from the left side of the histogram.
5. binarize the smoothed text chip using the threshold.
A low-pass Gaussian filter is used to smooth the text chip in step 1. The smoothing operation
affects the background more than the text because text is normally is of lower frequency than
the shading. Thus it cleans up the background.
9
-
The histogram generated by step 2 is often jagged, hence it needs to be smoothed to allow
the valley to be detected. Again a Gaussian filter is used for this purpose.
Text is normally the darkest item in the detected chips. Therefore, a threshold is picked
at the first valley closest to the darkest side of the histogram. To extract text against darker
background, a threshold at the last valley is picked instead.
3.2 Skew detection and correction
Skew estimation of document refers to the process of finding the angle of inclination made
by the document with respect to horizontal axis,which is often introduced during document
scanning. For any ensuing document image processing tasks(such as page layout analysis,
OCR,document retrieval etc.)to yield accurate results,the skew angle must be detected and cor-
rected beforehand.The algorithms for skew estimation can mainly be classified as the ones based
on(i)projection profile(PP) , nearest neighbor(NN) (iii)Hough transform(HT) and (iv)cross-
correlation. We used the variation of the hough transform method [4] to detect skew in our
project.
3.2.1 Skew angle Detection
The skew angles detection process used in this project can be divided into three steps:
detection point determination
coarse skew angle estimation
Hough transformation.
First, the skew image is vertically separated into several blocks, each block consisting of one
hundred rows. Then the locations of detection points in each block are recorded to estimate
the coarse skew angle e. The coarse skew angle here can be estimated by selecting the angle
which possesses most detection points.Finally, the accurate skew angle can be determined by
choosing the peak in the Hough plane within the small range of [ e - 3 , e + 3] A detailed
description of the three steps to detect the skew angle follows.
10
-
Step 1. Detection point (DP) determination First of all, the input image is vertically
divided into several blocks. According to our empirical study, 100 rows are chosen as the size of
each block. A detection point is defined as the left-most black pixel in each block. Each divided
block is scanned from left to right and then from top to bottom to find the detection point. If
the scanned pixel is not a background pixel, it is declared as a detection point. Following the
above procedure, we can find all detection points embedded in the input image. These detection
points are then fed into Step 2 for the estimation of the coarse skew angle.
Step 2. Coarse skew angle estimation In this step, the coarse skew angle 0 e is determined
by selecting the majority of local skew angles which are generated from the detection points.
Before the majority selection procedure, the local skew angle i has to be calculated first.Consider
two detection points DPi1(xi1, yi1) and DPi(xi, yi) in two consecutive divided blocks Bi1
and Bi. The local skew angle i is defined as
i = tan1(
yixi
)= tan1
(yi yi1xi xi1
)(3.1)
Here, the value yixi is adopted to represent the local skew angle i to avoid the computation
burden of tan1 function. The coarse skew angle r is then assigned as the majority of local
skew angles.
Hough Transformation Following the previous two steps, the search range of the skew angle
in the Hough plane is reduced from [90, 90] to [e 3, e + 3]. Last, the left-most pixelPi(xi, yi) in each row of the x y plane is transformed to the Hough plane by making useof the following equation:
i = xi. cos i + yi. sin (3.2)
where i is located in the range [e 3, e + 3]. The skew angle of the input document canthereby be determined by selecting the angle with the largest value in the transformed Hough
plane.
3.2.2 Image rotation transformation
In this section, a skew image will be corrected to generate a non-skew image by rotating
over a skew angle 0 which is obtained in Section 3.2.1.The rotation transformation is a mapping
11
-
function f(x, y) which maps the coordinates of pixels in the original image to those in the output
image. However, some pixel values in the output image which correspond to the pixels in the
original image cannot be defined via the mapping function f because the range and domain
defined in image processing are integer. In program implementation, we can devise an inverse
function f1 to define all output pixel values from the original image. Each pixel value in the
output image can thereby be determined from the value in the original image via the inverse
function f1.
Geometrically, the value of pixel P (x, y) in the output image can be determined from that
of the corresponding pixel P (x, y) in the original image. The location of pixel P can be obtained
from the location of pixel P via the following function f1:(x, y)
=(x, y
) cos () sin () sin () cos ()
= (x cos + y sin ,x sin + y cos ) (3.3)
3.3 Connected Components
The connected components are computed for the whole document using a recursive labeling
algorithm. The algorithm works by first negating the whole image. Each black pixel is replaced
by -1 and white pixel with 0. Each pixel in this image is now checked for a black pixel. If a
pixel is a text pixel, We define a search function which takes a text pixel, its coordinates and
defines its neighbors. This function recursively searches the black pixels that are part of this
component and labels them. Again it reaches a new component.
3.4 Line Segmentation
There are several steps in the line segmentation method proposed by Priyanka and Srikanth[10]
that are systematically described below.
Step1:Run length smearing A smoothing algorithm is applied in the text of a document
page. In this step we use run length smearing technique [12] to increase the strength of the
histogram. Here we consider the consecutive run of white pixels in between two black pixels and
then we compute the length of that white run. If the length of white run is less than five times
12
-
the stoke width, fill the white run length into black. in figure there are two original text lines
and in figure there are smoothed text lines with horizontal histogram corresponding to their
text lines.
[10]
Figure 3.1: Original Text lines
[10]
Figure 3.2: Smoothed Text lines with Histogram
Step2:Recursive procedure to get middle lines for segmentation Getting the his-
togram of every line from the smoothed document page, we consider the highest peaks of the
projection profile. After that we find the middle point of the length of the highest peak, and
then we draw a vertical line from top to bottom at the middle point of the highest peak as
shown in fig.
[10]
Figure 3.3: Highest peak and vertical line drawn at the middle of highest peak
The continuity of this step is to find the middle lines of each and every peaks of histogram. At
the line (the line passes vertically through middle point of the highest peak) we find middle point
of peaks. We draw the horizontal lines based on this middle point of the width of histogram. In
some cases all peak of histograms do not cross this vertical line. For these cases we find distances
between middle lines and find the average value of these distances.If the distance between the
two middle lines is greater than two times of average value then we assume that region contains
13
-
[10]
Figure 3.4: middle line detection for considering small length text
one or more text lines and we need recursive segmentation for that region. After getting that
region (the region between two middle lines of peaks) we apply the same procedure to find
vertical line through the middle of highest peak and middle lines of that particular region. This
procedure runs recursively; until we find middle lines of particular image as shown in Fig .10
Step3:Finding candidate line In this step, from the starting point of first histogram we
vertically scan the region in between the first middle and second middle line of histogram until
we get first two white pixels. We consider that two white pixels as minimum points. The line,
where we get the first white pixel, we consider that line as first minimum. Similarly the line
where we get second white pixel, we consider that line as second minimum. Now we calculate
the vertical distances from first middle line to first minimum point and from first middle line
to second minimum point. Getting these two distances, we consider the maximum distance.
The minimum point which contains maximum vertical distance as a separator between two
consecutive middle lines. In this way we find all line separators between two consecutive middle
lines and shown in Fig below. If we consider only the point where we get minimum black pixel
in the histogram is separator line, then we will get many errors.
[10]
Figure 3.5: (a).Initial segmentation line through the white pixels of horizontal histogram (b).
Result after considering only the candidate lines from original histogram.
14
-
3.5 Word Segmentation
In word segmentation method, a text line has taken as an input. After a text line is segmented,
it is scanned vertically. If in one vertical scan two or less black pixels are encountered then
the scan is denoted by 0, else the scan is denoted by the number of black pixels. In this way
a vertical projection profile is constructed. Now, if in the profile there exist a run of at least
k1 consecutive 0s then the midpoint of that run is considered as the boundary of a word. The
value of k1 is taken as 1/3 of the text line height. Word segmentation results of a Telugu text
line are shown in Fig.
[10]
Figure 3.6: Output for word segmentation
3.6 Feature Extraction
Feature Extraction [5]: The output of the Normalization phase gives a normalized image of size
N N. Real Valued Directional Features[] are calculated for each normalized image of size NN.These are based on the percentage of pixels in each direction range within each partition. An
adaptive gradient magnitude threshold, r is computed over the whole character image gradient
map. This threshold is needed to filter out spurious responses to the Sobel operator used to find
the gradients. Threshold value ,rt is computed as
rt = r(i, j)
D1D2
Thresholding is performed to nullify the pixels whose gradient magnitude values below the
computed threshold.
The feature vector is extracted basing on the direction of the gradient at each pixel. We di-
vided the whole character image into MN partitions. In our project we selected M=N=8. Thedirections of the gradient are quantized into K values. Thus each pixel can have now gradient
direction values from 1 to K. Percentage of pixels in each partition with direction quantised to k
are calculated. Thus each partition gives us K such values. We have total MNK dimensional
15
-
feature vector for each character image. We chose the value of K = 12. In our project we have
total 192 dimensional feature vector for each normalized character image.
The steps to extract feature vector are as follows:
For each connected component.
Obtain the bonding box for each connected component eliminating the blank surroundingspace.
Calculate the gradient magnitude and direction at each pixel.
Calculate the adaptive threshold of gradient magnitude and perform thresholding to obtainthe new gradient direction each pixel.
Partition the adaptive gradient direction map and extract the complete feature vector.
3.7 Pattern classification [2]
The feature vector extracted from the normalized image has to be assigned a label using a
pattern classifier[2]. There are many methods for designing pattern classifiers such as Bayes clas-
sifier based on density estimation, using neural networks, linear discriminant functions, nearest
neighbor classification based on prototypes etc. In this system we have used the Support Vector
Machine (SVM) classifier. SVMs represent a new pattern classification method which grew out
of some of the recent work in statistical learning theory. The solution offered by SVM method-
ology for the two class pattern recognition problem is theoretically elegant, computationally
efficient and is often found to give better performance by way of improved generalizations. In
the next subsection we provide a brief overview of SVMs.
3.7.1 SVM Classifier:[2]
classifier is a two-class classifier based on the use of discriminant functions. A discriminant
function represents a surface which separates the patterns so that the patterns from the two
16
-
classes lie on the opposite sides of the surface. The SVM is essentially a separating surface which
is optimal according to a criterion as explained below.
Consider a two-class problem where the class labels are denoted by +1 and 1. Given aset of labeled (training) patterns = ((x)i, yi), yi {1,+1} the hyper-plane represented by(w, b) where w 0fori : yi = +1;
wtxi + b > 0fori : yi = 1; (3.4)
Here,wtxi denotes the inner product between the two vectors, and g (x+ b) is the linear dis-
criminant function.
In general, the set may not be linearly separable. In such a case one can employ the
generalized linear discriminant function defined by,
g (x) = wt(x) + b where :
-
Let zi = (xi) Thus now we have a training sample (zi, yi) to learn a separating hyperplane
in
-
Now it is clear that i = 0 if i / S Hence we can rewrite (4.6) as
w =iS
i yizi (3.11)
The set of patterns zi : i s.t.i > 0 g are called the support vectors. From (4.8), it is clear
that w is a linear combination of support vectors and hence the name SVM for the classifier.
The support vectors are those patterns which are closest to the hyper-plane and are sufficient
to completely define the optimal hyper-plane. Hence these patterns can be considered to be the
most important training examples.
To learn the SVM all we need are the optimal Lagrange multipliers corresponding the problem
given by (4.4) and (4.5). This can be done efficiently by solving its dual which is the optimization
problem given by: Find i, i = 1, ...., l, to
Maximize :i
i 12i,j
ijyiyjztizj
Subject to : i 0, i = 1, 2, ..., l,li
iyi = 0. (3.12)
By solving this problem we obtain i i and using these we get w and b. It may be noted
that the dual given by 4.(9) is a quadratic optimization problem of dimension l (recall that l
is the number of training patterns) with one equality constraint and nonnegativity constraints
on the variables. This is so irrespective of how complicated the function is. Once the SVM
is obtained, the classification of any new feature vector,x, is based on the sign of (recall that
z = (x)
f(x) = (x)tw + b =iS
i yi(xi)t(x) + b (3.13)
where we have used (4.8). Thus, both while solving the optimization problem (given by (4.9))
and while classifying a new pattern, the only way the training pattern vectors, xi come into
picture are as inner products (xi)t(xj). This is the only way, also enters into the picture.
Suppose we have a function,K :
-
Table 3.1: Some popular kernels for SVMs.
Type of kernel K(xi, xj) Comments
Polynomial kernel (xtixj + 1)p Power p is specified a priori by
the user
Gaussian kernel exp( 122||xi xj ||2) The width 2 common to
all the kernels, is specified a
priori
Perceptron Kernel tanh(0xtixj + 1) Mercers condition satisfied
only for certain values of 0
and 1
Given any symmetric function K :
-
this, we can change the optimization problem to
Minimize :12||w||2 + C
li=1
i, (3.14)
Subject to : 1 yi(ztiw+ b) i 0 i = 1, ...., l.i 0, i = 1, ...., l (3.15)
Here i can be thought of as penalties for violating separability constraints. Now these are
also variables over which optimization is to be performed. The constant C is a user specified
parameter of the algorithm and as C we get the old problem. It so turns out that the dualof this problem is same as (4.9) except that the non negativity constraint on i is replaced by
0 i C. The optimal values of the new variables i are irrelevant to th e final SVM solution.
To sum up, the SVM method for learning two class classifiers is as follows. We choose a
Kernel function and some value for the constant C in (4.11). Then we solve its dual which is
same as (4.9) except that the variables i also have an upper bound, namely, C. (It may be
noted that here we use K(()x)i,xj in place of ztizj in (4.9)).Once we solve this problem, all we
need to store are the non-zero i i and the corresponding xi (which are the support vectors).
Using these, given any new feature vector x, we can calculate the output of SVM, namely, f(x)
through (4.10). The classification of x would be +1 if the output of SVM is positive; otherwise
it is 1.
SVM classifier for OCR We have used SVM classifiers for labeling each segment of a
word. As explained earlier, we have trained a number of two-class classifiers (SVMs), each one
for distinguishing one class from all others. Thus each of our class labels has an associated
SVM.A test example is assigned the label of the class whose SVM gives the largest positive
output. If no SVM gives a positive output then the example is rejected. The output of the SVM
gives a measure of the distance of the example from the separating hyper-plane in the space.
Hence higher the value of the (positive) output for a given pattern higher is the confidence in
classifying the pattern.
21
-
Chapter 4
Implementation
Developing an OCR for printed Telugu text consists of two stages, Pre-processing and Recog-
nition. In the prep-processing phase thresholding and noise removal are implemented using the
algorithm specified in sect 3.1.1. Skew detection and removal are implemented using a variant
of Hough transform.
The first step of the OCR starts with taking the document image as input.The image is then
converted into a grayscale image. The grayscale image is converted to a binary image using the
method described in section Thresholding. Connected components in the whole document are
found out with their bounding box using a two pass algorithm. These connected components are
then used to line segment the whole document. Line segmentation takes the array of connected
components as parameters and returns the top and bottom row numbers of each line with respect
to image coordinate system.
Each text line is given as input to the word segmentation phase. This function segments the
text line into words and returns the left and right column numbers of each word. The connected
components which belong to each word are grouped.
Each component is normalized into an image of 4848 image. This image is given as inputto the feature extraction function. This function takes just an image and returns the feature
vector of 192 dimensions using Sobel operator and the adaptive threshold gradient. The feature
vector is then given as an input to the SVM classifier which is trained using training SVM phase
22
-
described in later sections.
All these functions are implemented in Java Advanced Imaging package of Oracle Sun Mi-
crosystems in Netbeans6.8 IDE. LibSVM is the package used for training and using the SVM
classifier.
23
-
Chapter 5
Results
Figure 5.1: Home page of the tool
24
-
Figure 5.2: Displaying the original image
25
-
Figure 5.3: Bounding Connected Components
26
-
Figure 5.4: Line Segmentation
27
-
Figure 5.5: Word Segmentation
28
-
Chapter 6
Conclusion and Future work
Conclusion The main aim of this project is to develop a Optical Character Recognition for
printed Telugu text. Telugu script has a complex structure and has thousands of combinations of
vowel, consonant and consonant modifier.Hence detection and recognition of basic symbols helps
in reducing the number of classes. This project develops a tool that takes a document image
as input and displays each characters Unicode.This Unicode can be further used to display the
corresponding Telugu text.
Future work The recognition accuracies can be further increased by post processing which
makes use of the association of the basic symbols. For example, it is known that the some
modifiers occur very frequently with some characters and some modifiers occur very infrequently.
This feature vector can be further used for recognizing handwritten Telugu script. The final
output of the proposed system can be used further for text to speech conversion.
29
-
Bibliography
[1] Histogram modification for threshold selection. Systems, Man and Cybernetics, IEEETransactions on, 9(1):38 52, jan. 1979.
[2] T V Ashwin and P S Sastry. font and sizeindependent ocr system for printed kannadadocuments using support vector machines. Sadhana, 27:3558, 2002.
[3] B. B. Chaudhuri and U. Pal. A complete printed bangla ocr system. Pattern Recognition,31(5):531 549, 1998.
[4] Huei-Fen Jiang, Chin-Chuan Han, and Kuo-Chin Fan. A fast approach to the detectionand correction of skew documents. Pattern Recogn. Lett., 18(7):675686, 1997.
[5] C. Vasantha Lakshmi and C. Patvardhan. An optical character recognition system forprinted telugu text. Pattern Analysis and Applications, 7:190204, 2004. 10.1007/s10044-004-0217-2.
[6] S. Mori, C.Y. Suen, and K. Yamamoto. Historical review of ocr research and development.Proceedings of the IEEE, 80(7):1029 1058, jul. 1992.
[7] G. Nagy. Twenty years of document image analysis in pami. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 22(1):38 62, jan. 2000.
[8] L. Ogorman. Binarization and multithresholding of document images using connectivity.CVGIP: Graphical Models and Image Processing, 56(6):494 506, 1994.
[9] N. Otsu. A threshold selection method from grey-level histograms. SMC, 9(1):6266,January 1979.
[10] Nallapareddy Priyanka, Srikanta Pal, and Ranju Manda. Article:line and word segmen-tation approach for printed documents. IJCA,Special Issue on RTIPPR, (1):3036, 2010.Published By Foundation of Computer Science.
[11] Victor Wu and R. Manmatha. Document image clean-up and binarization. In In Proc.SPIE Symposium on Electronic Imaging, pages 263273, 1998.
[12] Hong Yan. Skew correction of document images using interline cross-correlation. CVGIP:Graph. Models Image Process., 55(6):538543, 1993.
30
IntroductionStructure of Telugu text and Segmentation issues4Characteristics of Telugu scriptSegmentation issues in OCR of Telugu script
Preprocessing phaseThresholding and noise removalThe Algorithm
Skew detection and correctionSkew angle DetectionImage rotation transformation
Connected ComponentsLine SegmentationWord SegmentationFeature ExtractionPattern classification 13SVM Classifier:13
ImplementationResultsConclusion and Future work