Chapter 1 - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/27115/4/04...script identification...
Transcript of Chapter 1 - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/27115/4/04...script identification...
Chapter 1
Introduction
---------------------------------------------------------------------------------------------------------------------------
Analysis of document images for information extraction has become very prominent
in recent past. Wide variety of information, which has been conventionally stored on
paper, is now being converted into electronic form for better storage and intelligent
processing. This needs processing of documents using digital image processing
methods. To develop a successful multi-lingual Optical Character Recognition
(OCR) system, separation or identification of different scripts is an essential step. The
recognized script document can then be submitted to respective OCR system for
character/numeral recognition. In this chapter, a brief overview of document image
analysis, OCR and script recognition is presented. Literature related to Indic scripts
identification is reviewed. Further, properties of major Indic scripts are also
described.
---------------------------------------------------------------------------------------------------------------------------
One interesting and challenging field of research in pattern recognition is Optical
Character Recognition (OCR). To develop a successful multi-lingual OCR system,
separation or identification of different scripts is an essential step. In a multi-lingual
country like India, designing script identification system facilitates OCR system. India
has more than 22 official languages and 12 different scripts [8] are used for these
languages. We can use systematic stage approach for script identification in
documents and feed the recognized script document to OCR for character/numeral
recognition.
2 Introduction
1.1 Document Image Analysis
Document image analysis is the process that performs the overall
interpretation of document images. It refers to algorithms and techniques that are
applied to images of documents to obtain a computer-readable description from pixel
data. It recognizes the text & graphics components in image of documents and to
extract intended information from them. It also adds to OCR in systematizing the
document and applies outside knowledge in interpreting it. It is concerned with image
processing, document formatting, script identification, and character recognition
combined in order to deal with a particular application. Thus, document image
analysis deals with the global issues involved in recognition of written script in
images.
Two categories of document image analysis can be defined; text processing
and graphical processing.
Text processing deals with the textual components of a document image and its task
are;
- Determining the skew (any tilt at which the document may have been scanned)
- Finding columns, paragraphs, textual lines, words, recognizing the text by OCR.
Graphical processing deals with the non-textual elements pictures like tables, lines,
images, symbols, delimiters, company logo etc.
1.2 Optical Character Recognition
A well-known document image analysis product is Optical Character
Recognition (OCR) software that recognizes characters in a scanned document. It is a
field of research in pattern recognition, Artificial Intelligence and Computer Vision.
Powerful OCR software allows you to save a lot of time and effort when creating,
processing and repurposing various documents. This technology is used in a broad
range of applications. Emblematic applications are handwritten character recognition,
processing of textual web images, and information extraction from digital libraries.
Large digital archives are currently available; however their full fruition can be
achieved only by accessing the information that is embedded in the digital image.
The problem of character recognition can be divided into two major
categories: (i) Type written and handwritten and (ii) Offline and Online recognition.
3 Introduction
Typewritten OCR system recognizes scripts that have been previously typed
and scanned prior to recognition process.
The field of handwriting recognition is divided into the sub-fields of on-line
and off-line recognition. In on-line recognition special devices are used to track the
movement of the pen and record temporal information. As for Online Character
Recognition, the concentration is based on the interpretation of dynamic handwriting
motion. This technology is used mostly for handwriting analysis on Tablet, PC, PDA
units and mobile phones among others. In off-line recognition an image of the
handwritten text is scanned and recorded. In general off-line recognition is considered
the more difficult task, because of the lack of temporal information. It is possible to
construct the image of the handwriting using the information of the movement of the
pen, however it is not possible to reconstruct the information of the movement of the
pen using only the image.
Generally, handwritten character recognition refers to the process of
recognizing static handwriting, usually focusing on the shape of the character against
its background. This process is done from an offline state with the source being
constant. The system attempts to recognize a character that has been written by
human. This is usually more difficult task due to following reasons:
Complexity in pre-processing
Complexity in feature extraction and classification
Sensitivity of the scheme to the variation in handwritten text of a document.
Characters in the document have descenders and ascenders.
Variation in shapes of characters written by different writers.
Similarity between some symbols of different scripts
1.3 Script Identification
The goal of handwriting recognition system is to process handwritten data
electronically with the same or nearly the same accuracy as humans. By doing this
process with computers a large amount of data can be transcript at a high speed. An
integrated approach to the design of OCRs for all Indian scripts has great benefits. It
is necessary to identify different script forms before running an individual OCR
system. In a country like India, script identification is a must for multilingual OCR
system. It acts as a pre-processor to the OCR system identifying the script type of the
4 Introduction
document, so that specific OCR tool can be selected as illustrated in Fig. 1.1. In a
multi-script environment a bank of OCRs corresponding to all different scripts are
expected to be seen. The characters in an input document can then be recognized
reliably by selecting the appropriate OCR system from the OCR bank.
FIGURE 1.1 Stages of document processing in a multi-script environment
Many of the documents in Indian environment are multi-script in nature. A
document containing text information in more than one script is called a multi-script
document. Most of the people use more than one script for communication. Many
Indian documents contain two scripts, namely, the state’s official script (local script)
and English. In certain cases, a document may contain three scripts, for example, the
state’s official script (local), Devanagari (National) and Latin (English). An automatic
script identification technique is useful to identify the script type of a particular
word/line in a multi-script document, segment out characters and feed it to
appropriate script-specific OCR for recognition. Fig. 1.2 shows several examples of
multi-script documents.
FIGURE 1.2: Examples of multi-script document images: (a) Malayalam and English
(b) Kannada and English (c) Tamil and English (d) Oriya and English
5 Introduction
Recognition of scripts from document images is at the heart of any document
image understanding system. Typically in a multi-script document, different
paragraphs, text-blocks, text lines or words in a page are written in different scripts
(Figure 2). The structure of the script and a writing style pose challenges for script
type recognition. The script recognition system operates in following phases as shown
in Fig.1.3
1. Pre-processing (noise removal, enhancement, skew detection, segmentation)
2. Feature extraction
3. Script recognition (In Indian context, Kannada, English, Devanagari, Tamil,
Telugu, Gujarati, Punjabi, Oriya, Bengali, Malayalam, and Urdu).
FIGURE 1.3 Stages of script recognition
Documents written in Indian scripts present great challenges to an OCR
designer due to the large number of letters in the alphabet, the sophisticated ways in
which they combine, and the complicated graphemes they result in. The problem is
compounded by the unstructured manner in which popular fonts are designed.
Further, handwriting script recognition for Indic scripts is still in its infancy compared
to non Indic scripts like Latin and Chinese, Japanese, and Korean, and worthy of
serious investigation.
1.4 Script Recognition - Literature Review
Script is defined as the graphic form of the writing system used to write
statements expressible in language. A script class refers to a particular style of writing
and the set of characters used in it. Languages throughout the world are typeset in
many different scripts. A script may be used by only one language or may be shared
by many languages, sometimes with slight variations from one language to other. In
6 Introduction
India, there are many documents written in regional scripts. For example, due to the
policy of state governments in India, the official transactions are done in the regional
language apart from using English language for communication with other states.
Significant work related to script identification is carried out by various
researchers for identification of scripts from a multilingual document. Existing script
identification techniques mainly depend on various features extracted from document
images at block, line or word level. Block level script identification identifies the
script of the given document in a mixture of various script documents. In line based
script identification, a document image can contain more than one script but it
requires the same script on a single line. Word level script identification allows the
document to contain more than one script and the script of every word is identified.
The script recognition methods available in literature at block level, line level and
word level respectively is reviewed below.
Quite a few publications are found in the literature for differentiating the
Indian scripts at block level. Peake and Tan [1] have proposed a method for automatic
script and language identification from document images using multiple channel
(Gabor) filters and gray level co-occurrence matrices for seven scripts: Chinese,
English, Greek, Korean, Malayalam, Persian and Russian. Tan [2] has developed
rotation invariant texture feature extraction method for automatic script identification
for six scripts: Chinese, Greek, English, Russian, Persian and Malayalam. Judith [3]
has proposed method for Script and Language Identification of Arabic, Chinese,
Cyrillic, Devanagari, Japanese and Roman by connected compound features. To
discriminate between printed text lines in Arabic and English, three techniques are
presented in [4]. Firstly, an approach based on detecting the peaks in the horizontal
projection profile is considered. Secondly, another approach based on the moments of
the profiles using neural networks for classification is presented. Finally, approach
based on classifying run length histogram using neural networks is described.
Dhandra et. al. [5] have proposed script identification method at block level by
extracting the features in two stages. In the first stage, the morphological erosion and
opening by reconstruction is carried out on a document image in horizontal, vertical,
left and right diagonal directions. In the second stage, average pixel distribution is
found in these directions. The classification is done using nearest neighbor classifier.
The experiments are performed on Kannada, Urdu, English, and Devanagari scripts
7 Introduction
by considering the block size of 128 x 128 pixels. Multilingual document recognition
technology and its application in China which is useful for building multilingual
digital library are reported in [6]. The key technologies include statistical character
recognition, structural analysis for similar character discrimination, character
segmentation, script identification, post-processing. A hierarchical blind script
identifier for 11 different Indian scripts is reported in [7]. The various nodes of
hierarchical tree use different feature-classifier combinations such as Gabor and
Discrete Cosine Transform features and has been evaluated using nearest neighbor,
linear discriminant and support vector machine classifiers.
Significant methods are available in the literature for script recognition at line
level from printed documents compared to handwritten documents. Twelve Indian
scripts have been explored to develop an automatic script recognizer at text line level
in [8, 10]. Script recognizer has been designed to classify using the characteristics and
shape based features of the script. Devanagari was discriminated through the headline
feature and structural shapes were designed to discriminate English from the other
Indian script. Further, the work has been extended using Water Reservoirs to
accommodate more scripts rather than triplets. In [9], an automated technique for the
identification of printed Roman, Chinese, Arabic, Devanagari and Bangla text lines
from a single document is presented. An automatic scheme to identify text lines of
different Indian scripts from a printed document is attempted in [11]. Features based
on water reservoir principle, contour tracing, profile etc. are employed to identify the
scripts. In [12], a system is presented for Oriya and Roman scripts of printed line
documents. Classification is done through horizontal projection profiles for intensity
of pixels in different zone along with the line height and the number of characters
present in that line. In [13], texture is used as a tool for determining the script of
handwritten document image based on the observation that text has a distinct visual
texture to classify the scripts namely, English, Devanagari and Urdu. Handwritten
block and lines are used and 13 spatial spread features extracted using morphological
filters to attain the feature set. In [14], a model to identify the script type of a
trilingual document printed in Kannada, Hindi and English scripts is proposed. The
distinct characteristic features of these scripts are thoroughly studied from the nature
of the top and bottom profiles and the model is trained to learn thoroughly the distinct
features of each script.
8 Introduction
A brief review of work proposed in the literature at word level follows. Chain
code based representation and manipulation of hand written images is reported in
[15]. A survey of offline cursive script word recognition is presented in [16]. The
survey is classified into three categories: segmentation-free methods; segmentation-
based methods and the perception-oriented approach. Most of this survey focuses on
the algorithms that were proposed in order to realize the recognition phase. Two
different approaches have been proposed in [17] for script identification at the word
level, from a bilingual document containing Roman and Tamil scripts. In the first
approach, words are divided into three distinct spatial zones. The spatial spread of a
word in upper and lower zones, together with the character density, is used to identify
the script. The second approach analyses the directional energy distribution of a word
using Gabor filters with suitable frequencies and orientations. Text-Word level script
identification from a document containing English, Devanagari and Telugu text is
reported in [18]. In [19], a method for identification and separation of text words of
Kannada, Devanagari, and Roman scripts using discriminating features is presented.
In [20], using a piece-wise projection method, the destination address block (DAB) is
segmented into lines and then words are extracted. Using water reservoir the busy-
zone of the word is computed. Finally, using matra and water reservoir concept based
features word-wise Bangla/Devanagari and English scripts are identified. A system
for word-wise handwritten script identification for Indian postal automation is
reported in [21]. Knowledge based approach to determine postal code is proposed in
[22]. In [23], a method is proposed during morphological opening by reconstruction
of an image in different directions and regional descriptors for script identification at
word level. The method is based on the observation that every text has a distinct
visual appearance. In [24], a script identification algorithm which takes into account
the fact that the script changes at the word level in most Indian bilingual or
multilingual printed documents is analyzed. A Gabor function based multichannel
directional filtering approach for both text area separation and script identification at
the word level is reported in [25]. In [26], effectiveness of Gabor and discrete cosine
transform (DCT) features for word level multi-script identification has been
independently evaluated using nearest neighbor, linear discriminant and support
vector machine (SVM) classifiers. In [27], distinct features of each script are used to
identify Kannada, English and Devanagari using voting technique. The method
9 Introduction
proposed in [28] automatically separates the scripts of handwritten words from a
document, written in Bengali or Devanagari mixed with Roman scripts.
Some background information about the past researches on both global based
approach as well as local based approach for script identification in document images
is reported in [29]. Both the systems can perform script identification in document
images at document, line and word level. Gopal Datt Joshi et. al. [30] have proposed
hierarchical classification scheme which uses features consistent with human
perception for script identification from Indian document.
1.5 Introduction to Major Indian Scripts and Languages
India is multilingual country. It has 22 official languages which include
Assamese, Bengali, English, Gujarati, Hindi, Konkani, Kannada, Kashmiri,
Malayalam, Marathi, Nepali, and Oriya. Further, all the Indian languages do not have
the unique scripts. Some of them use the same or similar script. For example,
languages such as Hindi, Marathi, Rajasthani, Sanskrit and Nepali are written using
the Devanagari script; Assamese and Bengali languages are written using the Bengali
script; Urdu and Kashmiri are written using Urdu script and Telugu and Kannada use
the similar script. In all, twelve different Indic scripts are used to write these 22
languages. These scripts are named as Roman, Bengali, Devanagari, Gurumukhi,
Gujarati, Kashmiri, Malayalam, Oriya, Tamil, Kannada, Telugu and Urdu. With the
exception of the Urdu script which is of Perso-Arabic origin, they have evolved from
a single source, the phonographic Brahmi script, first documented extensively in the
edicts of Emperor Asoka of the third century BC. They are defined as “syllabic
alphabets” or abugidas in that the unit of encoding is a syllable of speech; however
the corresponding orthographic units show distinctive internal structure and a
constituent set of graphemes [32]. A word in these scripts is written as a sequence of
these orthographic syllabic units referred to as characters.
10 Introduction
Figure 1.4: Twelve Indian scripts: Roman, Devanagari, Bangla, Gujarati, Kannada, Kashmiri,
Malayalam, Oriya, Gurumukhi, Tamil, Telagu, and Urdu
Apart from numerals, vowels, and consonants, there are compound characters
in most of the Indian regional scripts. Combining two or more consonants forms the
compound characters and they remain complex in their shapes than basic consonants.
Further, a vowel following a consonant may take a modified shape and is placed on
the left, right, top, or bottom of the consonant depending on the vowel. Such
characters are called modified characters. A brief description of the languages using
scripts Latin, Devanagari, Gujarati, Gurumukhi, Telugu, Kannada, Tamil, Malayalam,
Bengali and Oriya, respectively, considered in our study is presented below. All these
scripts are written from left to right.
i) English: English is the most common auxiliary language widely used in almost all
the continents of the world. In the last couple of centuries it has virtual attained the
status of a universal language. In many Asian countries like India and Malaysia
English is accepted and used as a means of communications among themselves. In
multilingual country like India, where more than 22 official state languages and
hundreds of local dialects are in use English is playing a binding force among
countrymen. The Indian parliament has also recognized English as an official
11 Introduction
language in addition to Hindi, which is considered as the National language. The
modern English alphabet is a Latin-based alphabet consisting of 26 letters each of
upper and lower case characters. In addition, there are some special symbols and
numerals. English script is also termed as bicameral script (a script using two separate
cases). The letters A, E, I, O, U are considered vowel letters, the remaining letters are
considered consonant letters (Fig. 1.5). Capital letters are A, B, C, etc.; lower case
includes a, b, c, etc. The structure of the English alphabet contains more vertical and
slant strokes.
Vowels (upper case)
Consonants (upper case)
FIGURE 1.5 English Alphabets
ii) Hindi: An Indo-Aryan language of North India, having equal status with English
as an official language throughout India. It is one of several languages spoken in
different parts of the sub-continent with about 487 million speakers. Hindi is derived
from Devanagari script. The script is phonetic; so that Hindi, unlike English, is
pronounced as it is written. Devanagari alphabet descended from the Brahmi script
sometime around the 11th century AD. It was originally developed to write Sanskrit
but was later adapted to write many other languages. Type of writing system is alpha-
syllabary / abugida. The script has 12 vowels and 34 consonants (Fig. 1.6). Consonant
letters carry an inherent vowel which can be altered or muted by means of diacritics
or matra. Vowels can be written as independent letters, or by using a variety of
diacritical marks which are written above, below, before or after the consonant they
belong to. This feature is common to most of the alphabets of South and South East
Asia. When consonants occur together in clusters, special conjunct letters are used.
Devanagari script is used to write the languages Bhojpuri, Marathi, Mundari, Nepali,
Pali, Sanskrit, Sindhi and many more including Hindi. Devanagari is recognizable by
a distinctive horizontal line running along the tops of the letters that links them
together.
12 Introduction
Vowels and vowel diacritics
Consonants
FIGURE 1.6 Hindi Vowels and Consonants
iii) Gujarati: The Gujarati script is one of the modern scripts of India, and is derived
from the Devanagari script during the 16th century CE. The major difference between
Gujarati and Devanagari is the lack of the top horizontal bar in Gujarati. Otherwise
the two scripts are fairly similar. Gujarati is a syllabic alphabet in which all
consonants have an inherent vowel. Vowels can be written as independent letters, or
by using a variety of diacritical marks which are written above, below, before or after
the consonant they belong to. Gujarati character set provides 14 vowels and 34 (+2
compound -ksha, gna ) consonants as shown in Fig. 1.7.
Vowels and vowel diacritics
Consonants
FIGURE 1.7 Gujarati Vowels and Consonants
iv) Punjabi: Punjabi is an Indo-Aryan language spoken by about 105 million people
mainly in West Punjab in Pakistan and in East Punjab in India. Punjabi descended
from the Shauraseni language of medieval northern India and became a distinct
language during the 11th century. The Gurumukhi (Punjabi) alphabet was devised
during the 16th century and is modeled on the Landa alphabet. This is a syllabic
alphabet in which all consonants have an inherent vowel. Diacritics, which can appear
above, below, before or after the consonant they belong to, are used to change the
inherent vowel. Modern Gurumukhi has forty-one consonants, nine vowel symbols,
two symbols for nasal sounds, and one symbol which duplicate the sound of any
consonant. In addition, four conjuncts are used (Fig. 1.8).
13 Introduction
Vowels and vowel diacritics
Consonants
FIGURE 1.8 Punjabi Vowels and Consonants
v) Telugu: A Dravidian language spoken by about 75 million people mainly in the
southern Indian state of Andhra Pradesh, where it is the official language. It is also
spoken in such neighbouring states as Karnataka, Tamil Nadu, Orissa, Maharashtra
and Chhattisgarh. The origins of the Telugu alphabet can be traced to the Brahmi
alphabet of ancient India, which developed into an alphabet used for both Telugu and
Kannada, which in turn split into two separate alphabets between the 12th and 15th
centuries AD. The writing system is syllabic alphabet in which all consonants have an
inherent vowel. Diacritics, which can appear above, below, before or after the
consonant they belong to, are used to change the inherent vowel and consist of
sequences of simple and/or complex characters. The overall pattern consists of 60
symbols, of which 16 are vowels, 3 vowel modifiers, and 41 consonants as mentioned
in Fig. 1.9.
Vowels and vowel diacritics
Consonants
FIGURE 1.9 Telugu Vowels, Vowels diacritics and Consonants
14 Introduction
vi) Kannada: The official language of the southern Indian state of Karnataka.
Kannada is a Dravidian language spoken by about 44 million people in the Indian
states of Karnataka, Andhra Pradesh, Tamil Nadu and Maharashtra. The earliest
inscriptional records in Kannada are from the 6th century. Kannada script is closely
akin to Telugu script in origin. Under the influence of Christian missionary
organizations, Kannada and Telugu scripts were standardized at the beginning of the
19th century. Writing system is alpha syllabary in which all consonants have an
inherent vowel. Other vowels are indicated with diacritics, which can appear above,
below, before or after the consonants. Kannada has 16 vowels and 34 consonants.
There are about 250 basic, modified and compound character shapes in Kannada (Fig.
1.10).
Vowels
Consonants
FIGURE 1.10 Kannada Vowels and Consonants
vii) Tamil: A Dravidian language spoken by around 52 million people in India, Sri
Lanka, Malaysia, Vietnam, Singapore, Canada, the USA, UK and Australia. It is the
first language of the Indian state of Tamil Nadu, and is spoken by a significant
minority of people (2 million) in north-eastern Sri Lanka. The earlier Tamil
inscriptions were written in brahmi, grantha and vaTTezuttu scripts. The Tamil script
is partially “alphabetic” and partially syllable-based (Fig. 1.11). Writing system of
Tamil is syllabic alphabet. There are twelve vowels and eighteen consonants.
Consonants are made up of six surds and their corresponding six sonants and six
medials. Combinations of consonants with vowels give rise to new symbols or result
in modified symbols.
15 Introduction
Vowels and vowel diacritics
Consonants
FIGURE 1.11 Tamil Vowels and Consonants
viii) Malayalam: Malayalam belongs to the southern group of Dravidian languages
along with Tamil, Kota, Kodagu and Kannada. It has high affinity towards Tamil. In
the early thirteenth century the Malayalam script developed from a script known as
vattezhuthu (round writing), a descendant of the Brahmi script. This is a syllabic
alphabet in which all consonants have an inherent vowel. Diacritics, which can appear
above, below, before or after the consonant they belong to, are used to change the
inherent vowel. The modern Malayalam alphabet has 13 vowel letters, 36 consonant
letters, and a few other symbols as shown in Fig. 1.12.
Vowels
Consonants
FIGURE 1.12 Malayalam Vowels and Consonants
ix) Bengali: The Bengali (also called Bangla) script is used for writing the Bengali
language, spoken by people mostly in Bangladesh and India. The Bengali alphabet is
derived from the Brahmi alphabet. It is also closely related to the Devanagari
alphabet, from which it started to diverge in the 11th Century A.D. The Bengali script
has a total of 11 vowel graphemes. All of these are used in both Bengali and
Assamese, the two main languages using the script. It is also used for a number of
other Indian languages including Sylheti and, with one or two modifications,
Assamese. Bengali writing shares some similarities with the Dravidian-language
16 Introduction
scripts, particularly in the shapes of some vowel letters, but it is generally more
similar to the Aryan-language scripts, in particular Devanagari.
There are thirty-five consonant letters and eleven independent vowel letters
are used in this script (Fig. 1.13). Each vowel letter also has a diacritic form which
combines with a consonant to modify the inherent vowel.
Vowels and vowel diacritics
Consonants
FIGURE 1.13 Bengali Vowels and Consonants
x) Odiya (Oriya): The spoken languages Oriya, Bengali and Assamese have a
common mother language - Parkrit (or Pali), which diversified into three branches in
Eastern India - Magadhi, Maitheli and Sudrusa. Magadhi became the modern Oriya,
Maitheli the modern Bengali and Sudrusa the modern Assamese languages. The Oriya
script is derived from the ancient Brahmi script through various transformations. The
complex nature of Oriya alphabets consists of 268 symbols (13 vowels, 36
consonants, 10 digits and 210 conjuncts). Fig. 1.14 shows vowels, vowel diacritics
and consonants of Oriya.
Vowels and vowel diacritics
Consonants
FIGURE 1.14 Oriya vowels, vowel diacritics and Consonants
17 Introduction
xi) Urdu: The Urdu alphabet is the right-to-left alphabet used for the Urdu language.
It is a modification of the Persian alphabet, which is itself a derivative of the Arabic
alphabet. With 38 letters and no distinct letter cases, the Urdu alphabet is typically
written in the calligraphic Nasta'liq script.
FIGURE 1.15 The Urdu alphabet, with names in the Devanagari and Roman alphabets
1.6 Motivation and Problem Definition
Automatic script identification is crucial to meet the growing demand for electronic
processing of volumes of documents written in different scripts. Script identification
from handwritten documents is a challenging task due to large variation in
handwriting as compared to printed documents. Many of the documents in India,
handwritten or machine printed, contain two or more than two scripts. Further, the
frequency of occurrence of documents consisting of regional script and Latin script is
more compared to other combinations. From literature survey, it is evident that,
handwritten script recognition is as its early stages [3, 13, 16, 20, 21, 22, 23, 28]
compared to observation that most of the reported studies, accomplish script
recognition for printed documents [4, 5, 7, 8, 9, 10, 11, 12, 14, 17, 19, 22, 24, 25, 26,
27, 30, 31]. This motivated us to work in this area and design algorithms for script
recognition from handwritten documents. Based on the work carried out in this area,
it was proposed to design efficient algorithms to identify script type at level of
18 Introduction
block/line/word with the observation that in multi-script documents a specific script
may appear at level of block/line/word in the document. Ten Indian major scripts
including Roman(Latin) script are considered in the proposed work.
1.7 Organization of the Thesis
The thesis is organized into seven chapters.
In Chapter 1, a brief description about Document Image Analysis and OCR
system is presented. The importance of handwritten script identification is also
described. Methods and techniques available in the literature are presented. Different
type of scripts and languages in Indian context are discussed.
Chapter 2 presents details regarding collection of handwritten script
documents from various sources. As standard database for handwritten script
identification for Indian scripts is not available, we have created a large dataset for
carrying out experiments for the methods proposed in the thesis. A novel method for
skew correction of the scanned document images is presented. Denoising is performed
and binary images of the blocks of handwritten document, lines and words from the
document images are extracted to create the database. The proposed skew correction
technique is experimented on various printed and handwritten script documents.
In Chapter 3, methods used for extracting the features of various Indian scripts
are described. In many cases, the most distinguished information is hidden in the
frequency content of the signal rather than in the spatial domain. So for feature
extraction, Gabor, DCT and Wavelets are considered. Also, a brief description of the
classifiers used for recognizing the script of the block, line, and word is presented in
this chapter.
A novel method for recognizing the script at block level is presented in
Chapter 4. Block level script identification, identifies the script of the given document
in a mixture of various script documents. Blocks of size 512 x 512 pixels is input to
the proposed system for script recognition. Features based on Fourier, DCT and
Wavelet is extracted to maximize the distinction between English, Devanagari and
local official language scripts. The classification is done using k-nearest neighbour (k-
19 Introduction
NN) classifier. The results clearly shows that the combined features that constitute
DCT and wavelets yielded better results for recognition of the script.
In Chapter 5, handwritten script identification methods at line level and
portion of the line level are presented. A document image can contain more than one
script but it requires the same script on a single line. Gabor filter banks are used for
feature extraction of line and portion of the line. The portion of the line may contain
two or more words. The script classification task for portion of the line is simplified
and performed faster as compared to the analysis of the entire line extracted from the
handwritten document. Experiments are performed for identification of script type of
eight Indian scripts including English for bi-script documents. Gabor combined with
DCT and Gabor combined with wavelets are proposed for tri-scripts. The
classification is done using k-NN and SVM classifier. At line level, features are
extracted by using Gabor combined with DCT and Gabor combined with wavelets.
The results are promising when we applied DCT/Wavelet to the Gabor convolved
images as compared to the Gabor convolved images.
Script identification at word level is proposed in Chapter 6. The method
presented in Chapter 5 for script identification at line level of the document image is
used for Word level identification. To increase the accuracy of the script recognition,
neural network Classifier with ranking of features is considered. Experiments are
carried out on nine different scripts. It is observed that performance of Neural
Network classifier is better than k-NN and SVM. The proposed method is
experimented on text- word database consisting of more than two characters, two
characters, and one character, respectively. Encouraging results are obtained.
Conclusions and future work are presented in chapter 7. In conclusion, the
methods proposed for identification of scripts from handwritten documents is
summarized. Limitation of the proposed methods and future extension is also
discussed.