Thesis on Telugu ocr

35
Optical Character Recognition system for printed Telugu text MTech Project Report Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Udaya Kumar Ambati Roll No : 09305073 under the guidance of Prof.M.R.Bhujade Department of Computer Science and Engineering Indian Institute of Technology, Bombay April 2010

description

The documents contains the details regarding the Oprical Character REcognition of Telugu Script

Transcript of Thesis on Telugu ocr

  • Optical Character Recognition system for printed Telugu text

    MTech Project Report

    Submitted in partial fulfillment of the requirements for the degree of

    Master of Technology

    by

    Udaya Kumar Ambati

    Roll No : 09305073

    under the guidance of

    Prof.M.R.Bhujade

    Department of Computer Science and Engineering

    Indian Institute of Technology, Bombay

    April 2010

  • Acknowledgements

    I would sincerely like to thank my guide,Prof. M.R. Bhujade for his motivating support through-

    out the semester and the consistent directions that he has fed into my work.I would like to thank

    each and every one who helped me throughout my work.

  • Abstract

    Telugu is a language spoken by more than 66 million people of South India. Not much work

    has been reported on the development of optical character recognition (OCR) systems for Telugu

    text. Therefore, it is an area of current research. Some characters in Telugu are made up of

    more than one connected symbol. Compound characters are written by associating modifiers

    with consonants, resulting in a huge number of possible combinations, running into hundreds

    of thousands. A compound character may contain one or more connected symbols. Therefore,

    systems developed for documents of other scripts, like Roman, cannot be used directly for the

    Telugu language.

    This project aims at developing a complete Optical Character Recognition system for printed

    Telugu text. The system segments the document image into lines and words. The features of

    each character are extracted. The extracted features are passed to a Support Vector Machine

    where the characters are classified by Supervised Learning Algorithm.

  • Contents

    1 Introduction 1

    2 Structure of Telugu text and Segmentation issues[5] 3

    2.1 Characteristics of Telugu script . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2 Segmentation issues in OCR of Telugu script . . . . . . . . . . . . . . . . . . . . 6

    3 Preprocessing phase 8

    3.1 Thresholding and noise removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.1.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.2 Skew detection and correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.2.1 Skew angle Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.2.2 Image rotation transformation . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.3 Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.4 Line Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.5 Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.6 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.7 Pattern classification [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.7.1 SVM Classifier:[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4 Implementation 22

    5 Results 24

    6 Conclusion and Future work 29

    i

  • List of Figures

    2.1 Harshapriya and Godavari fonts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2 Vowels their associated modifiers (Matras) and their phonetic English representation 4

    2.3 Consonants and their associated modifiers (Matras) and their phonetic English

    representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.4 Various combinations forming compound characters . . . . . . . . . . . . . . . . 6

    3.1 Original Text lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.2 Smoothed Text lines with Histogram . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.3 Highest peak and vertical line drawn at the middle of highest peak . . . . . . . . 13

    3.4 middle line detection for considering small length text . . . . . . . . . . . . . . . 14

    3.5 (a).Initial segmentation line through the white pixels of horizontal histogram (b).

    Result after considering only the candidate lines from original histogram. . . . . 14

    3.6 Output for word segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    5.1 Home page of the tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    5.2 Displaying the original image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    5.3 Bounding Connected Components . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    5.4 Line Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    5.5 Word Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    ii

  • Chapter 1

    Introduction

    During the past few decades, substantial research efforts have been devoted to optical char-

    acter recognition (OCR) [7, 6]. The object of OCR is automatic reading of optically sensed

    document text materials to translate human-readable characters into machine-readable codes.

    Research in OCR is popular for its various potential applications in banks, post offices and

    defence organizations. Other applications involve reading aids for the blind, library automation,

    language processing and multi-media design .

    Commercial OCR packages are already available for languages like English. Considerable work

    has also been done for languages like Japanese and Chinese [7]. Recently, work has been done

    in the development of OCR systems for Indian languages. This includes work on recognition of

    Devanagari characters , Bengali characters , Kannada characters and Tamil characters .

    The Indian subcontinent has more than 18 constitutionally recognized languages with several

    scripts but commercial products in Optical Character Recognition(OCR) are very few. Telugu

    is one of the oldest and most popular languages of India. Historically, Telugu has evolved from

    the ancient Brahmi script. It also used features of the Dravidian (Pali) language for script

    generation. In the process of evolution, this script was carved with needles on palm leaves, and

    so, it favored rounded letter shapes. Work on Telugu character recognition is not substantial.

    Motivation In spite of Telugu being the third mostly used language in India there are only a

    few OCR systems for Telugu script. This gave us a motivation to approach the problem. Further

    1

  • motivation to develop a Telugu language OCR is the digitization of thousands of printed books

    of Indian languages by both private and public sector. For an efficient access of these scanned

    documents an OCR specific for printed Telugu text is of very urgent need.

    Scope of the report The first section of the report deals with the explanation of the structure

    of Telugu characters and its segmentation issues. In second section we explain the algorithm used

    for noise removal and binarization. Next section explains an efficient algorithm that segments

    the given scanned document into lines and words. The last section explains the concept of

    Support Vector Machines(SVM) and the method of feature extraction of Telugu letters and

    their classification using SVM.

    Most document analysis systems can be visualized as consisting of two steps: the pre-processor

    and the recognizer. in preprocessing, the raw image obtained by scanning a page of text is con-

    verted to a form acceptable to the recognizer by extracting individually recognizable characters.

    the pre-processed image of the character is processed to obtain meaningful elements, called fea-

    tures; recognition is completed by searching for a feature vector in a database of stored feature

    vectors of all possible Telugu characters that matches with the feature vector of the character

    to be recognized.

    In Indian scripts, one or more vowel and consonant modifiers are attached to the consonant

    forms in a variety of combinations forming compound characters. The total number of possible

    compound characters is in of the order of hundreds of thousands. Therefore, the question, What

    constitutes a character?, assumes many new dimensions for Indian languages. Is a modifier an

    independent character or not? Does being treated as an independent character depend on the

    way it is written, i.e. whether it is written touching the character it is to modify or separated

    from it? A more detailed discussion of these issues for Telugu script is provided in Sect. 2.

    In this project, an approach has been presented for Telugu.

    2

  • Chapter 2

    Structure of Telugu text and

    Segmentation issues[5]

    2.1 Characteristics of Telugu script

    Telugu is a syllabic languageconfusion and spelling problems. In that sense, it is a WYSIWYG

    (what you see is what you get) script. This form of script is considered to be most scientific

    by linguists. The Telugu script consists of 18 vowels, 36 consonants and two dual symbols. Of

    the vowels, sixteen are in common usage. Fig 2.1 lists some of the vowels in Harshapriya and

    Godavari fonts.

    All vowels and consonants, along with their modifiers and phonetic equivalent symbols, are

    listed in Fig 2.2 and Fig 2.3, respectively. Compound characters in Telugu follow some phonetic

    [5]

    Figure 2.1: Harshapriya and Godavari fonts

    3

  • [5]

    Figure 2.2: Vowels their associated modifiers (Matras) and their phonetic English representation

    sequences that can be represented in grammatical form, as shown in Fig 2.4. Base consonants

    are vowel-suppressed consonants. These are typically used when words of other languages are

    written in Telugu. The third combination, i.e. of a base consonant and a vowel, is an extremely

    important and often used combination in Telugu script. As there are 38 (36+2 dual symbols)

    base consonants and 16 vowels, logically, 608 (3816 = 608) combinations are possible.

    The combinations from the fourth to the seventh combinations are categorized under conjunct

    formation. Telugu has a special feature of providing a unique symbol of dependent form for each

    of the consonants. In all conjunct formations, the first consonant appears in its actual form.

    The dependent vowel sign and the second (third) consonant act as dependent consonants in the

    formation of the complete character. combinations from the fourth to seventh combinations

    generate a large number of conjuncts in Telugu script. The fourth combination logically gener-

    ates (383816) 23,104 different compound characters. This is an important combination. Thefifth combination is similar to the fourth combination. The second and the third consonants act

    as the dependent consonants. Logically 746,496 different compound characters are possible in

    this combination, but their frequency of appearance in the text is less when compared to the

    previous combination. In the sixth and seventh combinations, 1,296 combinations and 46,656

    combinations, respectively, are logically possible.

    The sixth and seventh combinations are used when words from other languages are written in

    Telugu script. In these combinations, the vowel is omitted. The first consonant appears as a

    base consonant and the other consonants act as dependent consonants.

    4

  • [5]

    Figure 2.3: Consonants and their associated modifiers (Matras) and their phonetic English

    representation

    5

  • [5]

    Figure 2.4: Various combinations forming compound characters

    2.2 Segmentation issues in OCR of Telugu script

    A connected region in an image of Telugu text may be:

    1. A part of a character or a compound character

    2. A character

    3. A compound character

    This complicates the segmentation issues. The areas occupied by individual characters in a

    line of text are not in a horizontal line, unlike in English text, and in some cases, the area of

    a single complex character formation can be equal to the sum of the areas of two individual

    characters. The segmentation algorithm has to take these factors into consideration. The basic

    question to be answered in segmentation is: What are the symbols that will be isolated during

    segmentation and provided to the recognizer for completing the OCR?

    6

  • The first approach is to treat all types of conjuncts, together with the base consonants, as

    units for the purpose of segmentation and further recognition. This is not preferable for a

    number of reasons. The first reason is that the sheer number of possibilities has been shown

    to be enormous. The second reason is that, in compound characters like KRAI, we have to

    identify all the three parts, i.e. below and on the left, as being together in the same compound

    character, although they are not connected in the image. This is, in general, difficult because the

    association information is difficult to generate until the recognition process is at least partially

    completed, and the reason we are segmenting is to perform this recognition process. This is the

    catch-22 situation referred earlier, and, therefore, treating all types of conjuncts together is not

    possible. The second alternative is to attempt to isolate the base consonants, vowel modifiers,

    etc. This is difficult and leads to unmanageable complications at the segmentation stage where

    the symbols are yet to be recognized. This is primarily because the symbols are full of curves

    and their separation is not clear. However, this is a popular approach for Indian scripts like

    Devanagari and Bangla [3].

    7

  • Chapter 3

    Preprocessing phase

    3.1 Thresholding and noise removal

    The task of thresholding is to extract the foreground from the background. Generally an OCR

    expects a text printed against clean backgrounds. Usually a simple global binarization technique

    is adopted which does not handle well text printed against shaded or texture backgrounds, and/or

    embedded in images.

    In this project, a simple yet effective algorithm is proposed for document image binarization

    and cleanup. It is especially robust for extracting from images.

    There are basically two classes of binarization techniques global and adaptive. Global methods

    binarize the entire image using a single threshold. For example, a typical OCR system separates

    text from background by global thresholding[12, 8] . A simple way to automatically select a

    global threshold is to value at the valley of the intensity histogram of the image, assuming that

    there are two peaks in the histogram, one corresponding to the foreground and the other to the

    background. Methods have also been proposed to facilitate more robust valley picking.

    There are problems with the global thresholding paradigm. First, due to noise and poor

    contrast, many documents do not have well differentiated foreground and background intensi-

    ties. Second, the bimodal histogram assumption is not always valid in the case of complicated

    documents such as photographs and advertisements. Third, the foreground peak is often over-

    8

  • shadowed by other peaks which makes the valley detection difficult or impossible. Some research

    has been carried out to overcome these problems. For example, weighted histograms[1] are used

    to balance the size difference between the foreground and background, and /or convert the valley-

    finding into maximum peak detection. Minimum-error thresholding models the foreground and

    background intensity distributions as Gaussian distributions and the threshold is selected to

    minimize the classification error. Otsu[9] models the intensity histogram as probability distribu-

    tion and the threshold is chosen to maximize the separability of the resultant background and

    foreground classes. Similarly entropy measures[] have been used to select the threshold which

    maximizes the sum of background and foreground entropies.

    In contrast, adaptive algorithms compute a threshold for each pixel based on information

    extracted from its neighborhood. For images in which the intensity ranges of foreground objects

    and backgrounds entangle, different thresholds must be used for different regions.

    3.1.1 The Algorithm

    The algorithm proposed by Wu and Manmatha[11] works under the assumption that text input

    image or a region of the input image has more or less the same intensity value. However the

    unique feature of this algorithm is it works well even of the text is printed against shaded or

    hatched background

    The following are the steps in the algorithm:

    1. smooth the input text chip.

    2. compute the intensity histogram of the smoothed chip.

    3. smooth histogram using a low-pass filter.

    4. pick a threshold at the first valley counted from the left side of the histogram.

    5. binarize the smoothed text chip using the threshold.

    A low-pass Gaussian filter is used to smooth the text chip in step 1. The smoothing operation

    affects the background more than the text because text is normally is of lower frequency than

    the shading. Thus it cleans up the background.

    9

  • The histogram generated by step 2 is often jagged, hence it needs to be smoothed to allow

    the valley to be detected. Again a Gaussian filter is used for this purpose.

    Text is normally the darkest item in the detected chips. Therefore, a threshold is picked

    at the first valley closest to the darkest side of the histogram. To extract text against darker

    background, a threshold at the last valley is picked instead.

    3.2 Skew detection and correction

    Skew estimation of document refers to the process of finding the angle of inclination made

    by the document with respect to horizontal axis,which is often introduced during document

    scanning. For any ensuing document image processing tasks(such as page layout analysis,

    OCR,document retrieval etc.)to yield accurate results,the skew angle must be detected and cor-

    rected beforehand.The algorithms for skew estimation can mainly be classified as the ones based

    on(i)projection profile(PP) , nearest neighbor(NN) (iii)Hough transform(HT) and (iv)cross-

    correlation. We used the variation of the hough transform method [4] to detect skew in our

    project.

    3.2.1 Skew angle Detection

    The skew angles detection process used in this project can be divided into three steps:

    detection point determination

    coarse skew angle estimation

    Hough transformation.

    First, the skew image is vertically separated into several blocks, each block consisting of one

    hundred rows. Then the locations of detection points in each block are recorded to estimate

    the coarse skew angle e. The coarse skew angle here can be estimated by selecting the angle

    which possesses most detection points.Finally, the accurate skew angle can be determined by

    choosing the peak in the Hough plane within the small range of [ e - 3 , e + 3] A detailed

    description of the three steps to detect the skew angle follows.

    10

  • Step 1. Detection point (DP) determination First of all, the input image is vertically

    divided into several blocks. According to our empirical study, 100 rows are chosen as the size of

    each block. A detection point is defined as the left-most black pixel in each block. Each divided

    block is scanned from left to right and then from top to bottom to find the detection point. If

    the scanned pixel is not a background pixel, it is declared as a detection point. Following the

    above procedure, we can find all detection points embedded in the input image. These detection

    points are then fed into Step 2 for the estimation of the coarse skew angle.

    Step 2. Coarse skew angle estimation In this step, the coarse skew angle 0 e is determined

    by selecting the majority of local skew angles which are generated from the detection points.

    Before the majority selection procedure, the local skew angle i has to be calculated first.Consider

    two detection points DPi1(xi1, yi1) and DPi(xi, yi) in two consecutive divided blocks Bi1

    and Bi. The local skew angle i is defined as

    i = tan1(

    yixi

    )= tan1

    (yi yi1xi xi1

    )(3.1)

    Here, the value yixi is adopted to represent the local skew angle i to avoid the computation

    burden of tan1 function. The coarse skew angle r is then assigned as the majority of local

    skew angles.

    Hough Transformation Following the previous two steps, the search range of the skew angle

    in the Hough plane is reduced from [90, 90] to [e 3, e + 3]. Last, the left-most pixelPi(xi, yi) in each row of the x y plane is transformed to the Hough plane by making useof the following equation:

    i = xi. cos i + yi. sin (3.2)

    where i is located in the range [e 3, e + 3]. The skew angle of the input document canthereby be determined by selecting the angle with the largest value in the transformed Hough

    plane.

    3.2.2 Image rotation transformation

    In this section, a skew image will be corrected to generate a non-skew image by rotating

    over a skew angle 0 which is obtained in Section 3.2.1.The rotation transformation is a mapping

    11

  • function f(x, y) which maps the coordinates of pixels in the original image to those in the output

    image. However, some pixel values in the output image which correspond to the pixels in the

    original image cannot be defined via the mapping function f because the range and domain

    defined in image processing are integer. In program implementation, we can devise an inverse

    function f1 to define all output pixel values from the original image. Each pixel value in the

    output image can thereby be determined from the value in the original image via the inverse

    function f1.

    Geometrically, the value of pixel P (x, y) in the output image can be determined from that

    of the corresponding pixel P (x, y) in the original image. The location of pixel P can be obtained

    from the location of pixel P via the following function f1:(x, y)

    =(x, y

    ) cos () sin () sin () cos ()

    = (x cos + y sin ,x sin + y cos ) (3.3)

    3.3 Connected Components

    The connected components are computed for the whole document using a recursive labeling

    algorithm. The algorithm works by first negating the whole image. Each black pixel is replaced

    by -1 and white pixel with 0. Each pixel in this image is now checked for a black pixel. If a

    pixel is a text pixel, We define a search function which takes a text pixel, its coordinates and

    defines its neighbors. This function recursively searches the black pixels that are part of this

    component and labels them. Again it reaches a new component.

    3.4 Line Segmentation

    There are several steps in the line segmentation method proposed by Priyanka and Srikanth[10]

    that are systematically described below.

    Step1:Run length smearing A smoothing algorithm is applied in the text of a document

    page. In this step we use run length smearing technique [12] to increase the strength of the

    histogram. Here we consider the consecutive run of white pixels in between two black pixels and

    then we compute the length of that white run. If the length of white run is less than five times

    12

  • the stoke width, fill the white run length into black. in figure there are two original text lines

    and in figure there are smoothed text lines with horizontal histogram corresponding to their

    text lines.

    [10]

    Figure 3.1: Original Text lines

    [10]

    Figure 3.2: Smoothed Text lines with Histogram

    Step2:Recursive procedure to get middle lines for segmentation Getting the his-

    togram of every line from the smoothed document page, we consider the highest peaks of the

    projection profile. After that we find the middle point of the length of the highest peak, and

    then we draw a vertical line from top to bottom at the middle point of the highest peak as

    shown in fig.

    [10]

    Figure 3.3: Highest peak and vertical line drawn at the middle of highest peak

    The continuity of this step is to find the middle lines of each and every peaks of histogram. At

    the line (the line passes vertically through middle point of the highest peak) we find middle point

    of peaks. We draw the horizontal lines based on this middle point of the width of histogram. In

    some cases all peak of histograms do not cross this vertical line. For these cases we find distances

    between middle lines and find the average value of these distances.If the distance between the

    two middle lines is greater than two times of average value then we assume that region contains

    13

  • [10]

    Figure 3.4: middle line detection for considering small length text

    one or more text lines and we need recursive segmentation for that region. After getting that

    region (the region between two middle lines of peaks) we apply the same procedure to find

    vertical line through the middle of highest peak and middle lines of that particular region. This

    procedure runs recursively; until we find middle lines of particular image as shown in Fig .10

    Step3:Finding candidate line In this step, from the starting point of first histogram we

    vertically scan the region in between the first middle and second middle line of histogram until

    we get first two white pixels. We consider that two white pixels as minimum points. The line,

    where we get the first white pixel, we consider that line as first minimum. Similarly the line

    where we get second white pixel, we consider that line as second minimum. Now we calculate

    the vertical distances from first middle line to first minimum point and from first middle line

    to second minimum point. Getting these two distances, we consider the maximum distance.

    The minimum point which contains maximum vertical distance as a separator between two

    consecutive middle lines. In this way we find all line separators between two consecutive middle

    lines and shown in Fig below. If we consider only the point where we get minimum black pixel

    in the histogram is separator line, then we will get many errors.

    [10]

    Figure 3.5: (a).Initial segmentation line through the white pixels of horizontal histogram (b).

    Result after considering only the candidate lines from original histogram.

    14

  • 3.5 Word Segmentation

    In word segmentation method, a text line has taken as an input. After a text line is segmented,

    it is scanned vertically. If in one vertical scan two or less black pixels are encountered then

    the scan is denoted by 0, else the scan is denoted by the number of black pixels. In this way

    a vertical projection profile is constructed. Now, if in the profile there exist a run of at least

    k1 consecutive 0s then the midpoint of that run is considered as the boundary of a word. The

    value of k1 is taken as 1/3 of the text line height. Word segmentation results of a Telugu text

    line are shown in Fig.

    [10]

    Figure 3.6: Output for word segmentation

    3.6 Feature Extraction

    Feature Extraction [5]: The output of the Normalization phase gives a normalized image of size

    N N. Real Valued Directional Features[] are calculated for each normalized image of size NN.These are based on the percentage of pixels in each direction range within each partition. An

    adaptive gradient magnitude threshold, r is computed over the whole character image gradient

    map. This threshold is needed to filter out spurious responses to the Sobel operator used to find

    the gradients. Threshold value ,rt is computed as

    rt = r(i, j)

    D1D2

    Thresholding is performed to nullify the pixels whose gradient magnitude values below the

    computed threshold.

    The feature vector is extracted basing on the direction of the gradient at each pixel. We di-

    vided the whole character image into MN partitions. In our project we selected M=N=8. Thedirections of the gradient are quantized into K values. Thus each pixel can have now gradient

    direction values from 1 to K. Percentage of pixels in each partition with direction quantised to k

    are calculated. Thus each partition gives us K such values. We have total MNK dimensional

    15

  • feature vector for each character image. We chose the value of K = 12. In our project we have

    total 192 dimensional feature vector for each normalized character image.

    The steps to extract feature vector are as follows:

    For each connected component.

    Obtain the bonding box for each connected component eliminating the blank surroundingspace.

    Calculate the gradient magnitude and direction at each pixel.

    Calculate the adaptive threshold of gradient magnitude and perform thresholding to obtainthe new gradient direction each pixel.

    Partition the adaptive gradient direction map and extract the complete feature vector.

    3.7 Pattern classification [2]

    The feature vector extracted from the normalized image has to be assigned a label using a

    pattern classifier[2]. There are many methods for designing pattern classifiers such as Bayes clas-

    sifier based on density estimation, using neural networks, linear discriminant functions, nearest

    neighbor classification based on prototypes etc. In this system we have used the Support Vector

    Machine (SVM) classifier. SVMs represent a new pattern classification method which grew out

    of some of the recent work in statistical learning theory. The solution offered by SVM method-

    ology for the two class pattern recognition problem is theoretically elegant, computationally

    efficient and is often found to give better performance by way of improved generalizations. In

    the next subsection we provide a brief overview of SVMs.

    3.7.1 SVM Classifier:[2]

    classifier is a two-class classifier based on the use of discriminant functions. A discriminant

    function represents a surface which separates the patterns so that the patterns from the two

    16

  • classes lie on the opposite sides of the surface. The SVM is essentially a separating surface which

    is optimal according to a criterion as explained below.

    Consider a two-class problem where the class labels are denoted by +1 and 1. Given aset of labeled (training) patterns = ((x)i, yi), yi {1,+1} the hyper-plane represented by(w, b) where w 0fori : yi = +1;

    wtxi + b > 0fori : yi = 1; (3.4)

    Here,wtxi denotes the inner product between the two vectors, and g (x+ b) is the linear dis-

    criminant function.

    In general, the set may not be linearly separable. In such a case one can employ the

    generalized linear discriminant function defined by,

    g (x) = wt(x) + b where :

  • Let zi = (xi) Thus now we have a training sample (zi, yi) to learn a separating hyperplane

    in

  • Now it is clear that i = 0 if i / S Hence we can rewrite (4.6) as

    w =iS

    i yizi (3.11)

    The set of patterns zi : i s.t.i > 0 g are called the support vectors. From (4.8), it is clear

    that w is a linear combination of support vectors and hence the name SVM for the classifier.

    The support vectors are those patterns which are closest to the hyper-plane and are sufficient

    to completely define the optimal hyper-plane. Hence these patterns can be considered to be the

    most important training examples.

    To learn the SVM all we need are the optimal Lagrange multipliers corresponding the problem

    given by (4.4) and (4.5). This can be done efficiently by solving its dual which is the optimization

    problem given by: Find i, i = 1, ...., l, to

    Maximize :i

    i 12i,j

    ijyiyjztizj

    Subject to : i 0, i = 1, 2, ..., l,li

    iyi = 0. (3.12)

    By solving this problem we obtain i i and using these we get w and b. It may be noted

    that the dual given by 4.(9) is a quadratic optimization problem of dimension l (recall that l

    is the number of training patterns) with one equality constraint and nonnegativity constraints

    on the variables. This is so irrespective of how complicated the function is. Once the SVM

    is obtained, the classification of any new feature vector,x, is based on the sign of (recall that

    z = (x)

    f(x) = (x)tw + b =iS

    i yi(xi)t(x) + b (3.13)

    where we have used (4.8). Thus, both while solving the optimization problem (given by (4.9))

    and while classifying a new pattern, the only way the training pattern vectors, xi come into

    picture are as inner products (xi)t(xj). This is the only way, also enters into the picture.

    Suppose we have a function,K :

  • Table 3.1: Some popular kernels for SVMs.

    Type of kernel K(xi, xj) Comments

    Polynomial kernel (xtixj + 1)p Power p is specified a priori by

    the user

    Gaussian kernel exp( 122||xi xj ||2) The width 2 common to

    all the kernels, is specified a

    priori

    Perceptron Kernel tanh(0xtixj + 1) Mercers condition satisfied

    only for certain values of 0

    and 1

    Given any symmetric function K :

  • this, we can change the optimization problem to

    Minimize :12||w||2 + C

    li=1

    i, (3.14)

    Subject to : 1 yi(ztiw+ b) i 0 i = 1, ...., l.i 0, i = 1, ...., l (3.15)

    Here i can be thought of as penalties for violating separability constraints. Now these are

    also variables over which optimization is to be performed. The constant C is a user specified

    parameter of the algorithm and as C we get the old problem. It so turns out that the dualof this problem is same as (4.9) except that the non negativity constraint on i is replaced by

    0 i C. The optimal values of the new variables i are irrelevant to th e final SVM solution.

    To sum up, the SVM method for learning two class classifiers is as follows. We choose a

    Kernel function and some value for the constant C in (4.11). Then we solve its dual which is

    same as (4.9) except that the variables i also have an upper bound, namely, C. (It may be

    noted that here we use K(()x)i,xj in place of ztizj in (4.9)).Once we solve this problem, all we

    need to store are the non-zero i i and the corresponding xi (which are the support vectors).

    Using these, given any new feature vector x, we can calculate the output of SVM, namely, f(x)

    through (4.10). The classification of x would be +1 if the output of SVM is positive; otherwise

    it is 1.

    SVM classifier for OCR We have used SVM classifiers for labeling each segment of a

    word. As explained earlier, we have trained a number of two-class classifiers (SVMs), each one

    for distinguishing one class from all others. Thus each of our class labels has an associated

    SVM.A test example is assigned the label of the class whose SVM gives the largest positive

    output. If no SVM gives a positive output then the example is rejected. The output of the SVM

    gives a measure of the distance of the example from the separating hyper-plane in the space.

    Hence higher the value of the (positive) output for a given pattern higher is the confidence in

    classifying the pattern.

    21

  • Chapter 4

    Implementation

    Developing an OCR for printed Telugu text consists of two stages, Pre-processing and Recog-

    nition. In the prep-processing phase thresholding and noise removal are implemented using the

    algorithm specified in sect 3.1.1. Skew detection and removal are implemented using a variant

    of Hough transform.

    The first step of the OCR starts with taking the document image as input.The image is then

    converted into a grayscale image. The grayscale image is converted to a binary image using the

    method described in section Thresholding. Connected components in the whole document are

    found out with their bounding box using a two pass algorithm. These connected components are

    then used to line segment the whole document. Line segmentation takes the array of connected

    components as parameters and returns the top and bottom row numbers of each line with respect

    to image coordinate system.

    Each text line is given as input to the word segmentation phase. This function segments the

    text line into words and returns the left and right column numbers of each word. The connected

    components which belong to each word are grouped.

    Each component is normalized into an image of 4848 image. This image is given as inputto the feature extraction function. This function takes just an image and returns the feature

    vector of 192 dimensions using Sobel operator and the adaptive threshold gradient. The feature

    vector is then given as an input to the SVM classifier which is trained using training SVM phase

    22

  • described in later sections.

    All these functions are implemented in Java Advanced Imaging package of Oracle Sun Mi-

    crosystems in Netbeans6.8 IDE. LibSVM is the package used for training and using the SVM

    classifier.

    23

  • Chapter 5

    Results

    Figure 5.1: Home page of the tool

    24

  • Figure 5.2: Displaying the original image

    25

  • Figure 5.3: Bounding Connected Components

    26

  • Figure 5.4: Line Segmentation

    27

  • Figure 5.5: Word Segmentation

    28

  • Chapter 6

    Conclusion and Future work

    Conclusion The main aim of this project is to develop a Optical Character Recognition for

    printed Telugu text. Telugu script has a complex structure and has thousands of combinations of

    vowel, consonant and consonant modifier.Hence detection and recognition of basic symbols helps

    in reducing the number of classes. This project develops a tool that takes a document image

    as input and displays each characters Unicode.This Unicode can be further used to display the

    corresponding Telugu text.

    Future work The recognition accuracies can be further increased by post processing which

    makes use of the association of the basic symbols. For example, it is known that the some

    modifiers occur very frequently with some characters and some modifiers occur very infrequently.

    This feature vector can be further used for recognizing handwritten Telugu script. The final

    output of the proposed system can be used further for text to speech conversion.

    29

  • Bibliography

    [1] Histogram modification for threshold selection. Systems, Man and Cybernetics, IEEETransactions on, 9(1):38 52, jan. 1979.

    [2] T V Ashwin and P S Sastry. font and sizeindependent ocr system for printed kannadadocuments using support vector machines. Sadhana, 27:3558, 2002.

    [3] B. B. Chaudhuri and U. Pal. A complete printed bangla ocr system. Pattern Recognition,31(5):531 549, 1998.

    [4] Huei-Fen Jiang, Chin-Chuan Han, and Kuo-Chin Fan. A fast approach to the detectionand correction of skew documents. Pattern Recogn. Lett., 18(7):675686, 1997.

    [5] C. Vasantha Lakshmi and C. Patvardhan. An optical character recognition system forprinted telugu text. Pattern Analysis and Applications, 7:190204, 2004. 10.1007/s10044-004-0217-2.

    [6] S. Mori, C.Y. Suen, and K. Yamamoto. Historical review of ocr research and development.Proceedings of the IEEE, 80(7):1029 1058, jul. 1992.

    [7] G. Nagy. Twenty years of document image analysis in pami. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 22(1):38 62, jan. 2000.

    [8] L. Ogorman. Binarization and multithresholding of document images using connectivity.CVGIP: Graphical Models and Image Processing, 56(6):494 506, 1994.

    [9] N. Otsu. A threshold selection method from grey-level histograms. SMC, 9(1):6266,January 1979.

    [10] Nallapareddy Priyanka, Srikanta Pal, and Ranju Manda. Article:line and word segmen-tation approach for printed documents. IJCA,Special Issue on RTIPPR, (1):3036, 2010.Published By Foundation of Computer Science.

    [11] Victor Wu and R. Manmatha. Document image clean-up and binarization. In In Proc.SPIE Symposium on Electronic Imaging, pages 263273, 1998.

    [12] Hong Yan. Skew correction of document images using interline cross-correlation. CVGIP:Graph. Models Image Process., 55(6):538543, 1993.

    30

    IntroductionStructure of Telugu text and Segmentation issues4Characteristics of Telugu scriptSegmentation issues in OCR of Telugu script

    Preprocessing phaseThresholding and noise removalThe Algorithm

    Skew detection and correctionSkew angle DetectionImage rotation transformation

    Connected ComponentsLine SegmentationWord SegmentationFeature ExtractionPattern classification 13SVM Classifier:13

    ImplementationResultsConclusion and Future work