A Four-Tier Annotated Urdu Handwritten Text Image Dataset for...

23
26 A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script PRAKASH CHOUDHARY, National Institute of Technology Manipur, Computer Science and Engineering NEETA NAIN, National Institute of Technology Jaipur, Computer Science and Engineering This article introduces a large handwritten text document image corpus dataset for Urdu script named CALAM (Cursive And Language Adaptive Methodologies). The database contains unconstrained handwrit- ten sentences along with their structural annotations for the offline handwritten text images with their XML representation. Urdu is the fourth most frequently used language in the world, but due to its complex cursive writing script and low resources, it is still a thrust area for document image analysis. Here, a unified approach is applied in the development of an Urdu corpus by collecting printed texts, handwritten texts, and demographic information of writers on a single form. CALAM contains 1,200 handwritten text images, 3,043 lines, 46,664 words, and 101,181 ligatures. For capturing maximum variance among the words and handwritten styles, data collection is distributed among six categories and 14 subcategories. Handwritten forms were filled out by 725 different writers belonging to different geographical regions, ages, and genders with diverse educational backgrounds. A structure has been designed to annotate handwritten Urdu script images at line, word, and ligature levels with an XML standard to provide a ground truth of each image at different levels of annotation. This corpus would be very useful for linguistic research in benchmarking and providing a testbed for evaluation of handwritten text recognition techniques for Urdu script, signature verification, writer identification, digital forensics, classification of printed and handwritten text, catego- rization of texts as per use, and so on. The experimental results of some recently developed handwritten text line segmentation techniques experimented on the proposed dataset are also presented in the article for asserting its viability and usability. CCS Concepts: Computing methodologies Image processing; Applied computing Document management and text processing Additional Key Words and Phrases: Urdu handwritten text, annotation, OCR algorithms benchmarking, corpus ACM Reference Format: Prakash Choudhary and Neeta Nain. 2016. A four-tier annotated urdu handwritten text image dataset for multidisciplinary research on urdu script. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 4, Article 26 (May 2016), 23 pages. DOI: http://dx.doi.org/10.1145/2857053 1. INTRODUCTION Over the past few years, a lot of advancements have been made in the field of hand- written text recognition. Linguistic resources such as annotated corpus are playing a significant role and are the most demanding platform for computational linguistic research. A machine-readable corpus has more capability to explore and identify all Authors’ addresses: P. Choudhary, Department of Computer Science and Engineering, NIT Manipur, Imphal- 795001 India; email: [email protected]; N. Nain, Department of Computer Science and En- gineering, MNIT Jaipur, Rajasthan -302017 India; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2016 ACM 2375-4699/2016/05-ART26 $15.00 DOI: http://dx.doi.org/10.1145/2857053 ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Transcript of A Four-Tier Annotated Urdu Handwritten Text Image Dataset for...

Page 1: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26

A Four-Tier Annotated Urdu Handwritten Text Image Datasetfor Multidisciplinary Research on Urdu Script

PRAKASH CHOUDHARY, National Institute of Technology Manipur, ComputerScience and EngineeringNEETA NAIN, National Institute of Technology Jaipur, Computer Science and Engineering

This article introduces a large handwritten text document image corpus dataset for Urdu script namedCALAM (Cursive And Language Adaptive Methodologies). The database contains unconstrained handwrit-ten sentences along with their structural annotations for the offline handwritten text images with theirXML representation. Urdu is the fourth most frequently used language in the world, but due to its complexcursive writing script and low resources, it is still a thrust area for document image analysis. Here, a unifiedapproach is applied in the development of an Urdu corpus by collecting printed texts, handwritten texts,and demographic information of writers on a single form. CALAM contains 1,200 handwritten text images,3,043 lines, 46,664 words, and 101,181 ligatures. For capturing maximum variance among the words andhandwritten styles, data collection is distributed among six categories and 14 subcategories. Handwrittenforms were filled out by 725 different writers belonging to different geographical regions, ages, and genderswith diverse educational backgrounds. A structure has been designed to annotate handwritten Urdu scriptimages at line, word, and ligature levels with an XML standard to provide a ground truth of each imageat different levels of annotation. This corpus would be very useful for linguistic research in benchmarkingand providing a testbed for evaluation of handwritten text recognition techniques for Urdu script, signatureverification, writer identification, digital forensics, classification of printed and handwritten text, catego-rization of texts as per use, and so on. The experimental results of some recently developed handwrittentext line segmentation techniques experimented on the proposed dataset are also presented in the article forasserting its viability and usability.

CCS Concepts: � Computing methodologies → Image processing; � Applied computing →Document management and text processing

Additional Key Words and Phrases: Urdu handwritten text, annotation, OCR algorithms benchmarking,corpus

ACM Reference Format:Prakash Choudhary and Neeta Nain. 2016. A four-tier annotated urdu handwritten text image dataset formultidisciplinary research on urdu script. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 4, Article 26(May 2016), 23 pages.DOI: http://dx.doi.org/10.1145/2857053

1. INTRODUCTION

Over the past few years, a lot of advancements have been made in the field of hand-written text recognition. Linguistic resources such as annotated corpus are playinga significant role and are the most demanding platform for computational linguisticresearch. A machine-readable corpus has more capability to explore and identify all

Authors’ addresses: P. Choudhary, Department of Computer Science and Engineering, NIT Manipur, Imphal-795001 India; email: [email protected]; N. Nain, Department of Computer Science and En-gineering, MNIT Jaipur, Rajasthan -302017 India; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2016 ACM 2375-4699/2016/05-ART26 $15.00DOI: http://dx.doi.org/10.1145/2857053

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 2: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:2 P. Choudhary and N. Nain

Table I. Some of Most Widely Used Online Handwritten Databases

Database Scripts Content Contents SizeCASIA [Liu et al.2011]

Chinese Text pages andlines, characters,and symbols

5,090 text pages of 52,220 lines,3.9 million characters, and171 symbols; 10,000 images

TAUT [Nakagawaet al. 1997]

Isolated characters

IAM-on [Indermhleet al. 2010]

English Text documentshaving words andstrokes

941 pages with 7,616 lines; 68,841words and 355,097 strokes

Nakagawa andMatsumoto [2004]

Japanese Characters 3 million patterns

Nethravathi et al.[2010]

Tamil Kannada Isolated words 100,000 words in each script

OHASD [Elanwaret al. 2010]

Arabic Paragraphs 154 paragraphs of 3,800 words and19,400 characters

Kumar [2010] Devanagari Characters 1,800 character samples

features of natural language including the characteristics of the desired texts such aslexical, textual, semantic, and syntactic attributes. Corpus Linguistics is an approachfor investigating the diversity of a language using a large collection of real-life naturallanguage text samples. This approach has been used in a number of research areas forages, like in the study of the language, writing style of the language, and developmentand benchmarking of various OCR algorithms.

Researchers developed various datasets/databases of different standards based onrequirements, such as a database of isolated digits/characters, text lines/words, orparagraphs to evaluate the performance of various OCR algorithms. In the field ofhandwritten text document image analysis, many highly cited algorithms exist in theliterature such as those of Alaei et al. [2011b], Gatos et al. [2009], Stamatopoulos et al.[2013], Margner and El Abed [2009], Gatos et al. [2010], Likforman-Sulem et al. [2007],Li et al. [2008], Louloudis et al. [2009], and Yin and Liu [2009], which have used andhave shown the need in standard databases for training and testing their method’sperformance.

Two axes of database development have been identified for handwritten text recogni-tion systems based on the input mode: online and offline. Researchers have developedonline handwritten databases [Liu et al. 2011; Guyon et al. 1994; Viard-Gaudin et al.1999; Kumar 2010; Indermhle et al. 2010; Nakagawa et al. 1997] and [Nethravathiet al. 2010]. Bhaskarabhatla and Madhvanath [2004] collected handwritten data foronline handwriting recognition in different Indic scripts. A brief overview of the onlinehandwritten databases for English, Chinese, Japanese, and Indic scripts is shown inTable I.

Guyon et al. [1994] designed a platform for data exchange and recognizer benchmark-ing. This format includes various online hand-printed and cursive alphabets includingLatin and Chinese, signatures, and pen gestures. The database CASIA (Institute of Au-tomation of Chinese Academy of Sciences) built by NLPR (National Laboratory of Pat-tern Recognition) [Liu et al. 2011] introduces both modes of online and offline Chinesehandwriting databases, containing samples of isolated characters and handwrittentexts. The dataset has 3.9 million isolated character samples produced by 1,020 writersusing Anoto pen on paper for obtaining both online trajectory data and offline images.

The database TAUT [Nakagawa et al. 1997] is another online handwritten databasemade of 10,000 character patterns by selecting the 1,227 most frequently appear-ing character categories from a sequence of newspaper sentences. The datasetIRESTE [Viard-Gaudin et al. 1999] is a dual handwriting database; it has 4,086

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 3: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:3

isolated digits, 10,685 isolated lowercase letters, 10,679 isolated uppercase letters,and 410 EURO signs. It also contains 31,346 isolated words (28,657 French and 2,689English words).

Kumar [2010] introduced a database for Devanagari script, composed of 1,800 sam-ples from 36 character classes obtained from 25 native writers. Each writer was askedto provide two samples per class. In 2010, Nethravathi et al. [2010] developed an onlinehandwritten database of 200,000 words for two Indic scripts, Tamil and Kannada, bycollecting 100,000 words for each script from 600 users to capture the variations inwriting style. Bhaskarabhatla and Madhvanath [2004] collected handwritten data foronline handwriting recognition in different official Indic scripts.

The second category of handwritten datasets is the offline handwritten databasewhere standard databases of isolated characters/digits or sentences have been de-veloped and proposed in the literature. Some of the most widely used handwrittendatabases for some scripts such as French, English, Korean, Chinese, and Indic scriptare IRESTE, CEDAR, IAM, CMATER, NIST, PE92, IFN/ENIT, KHTD, FHT, HIT-MW,and PBOK. A brief overview of these databases is shown in Table II.

NIST [Wilkinson 1992], MNIST [Deng 2012], IRESTE [Viard-Gaudin et al. 1999],IAM [Marti and Bunke 2002], RIMES [Grosicki et al. 2006], and CEDAR [Hull 1994]are the most frequently used standard English text databases.

The NIST database is composed of handwritten characters/digits and running En-glish texts. The data samples were extracted from 2,100 filled forms. The MNISTdatabase is a large database of handwritten digits extracted from the NIST database.The IAM database is a collection of 1,539 handwritten text pages. Besides text pages,it also contains images of text lines and words with ground-truth labels.

The CEDAR [Hull 1994] database is a collection of digital images of city names,state names, and zip codes from the postal addresses. Images have been segmentedfrom the addresses by a semiautomatic process. It has been very widely used in theexperimentation of a wide number of OCR techniques in ICDAR and ICFHR.

The databases IRESTE [Viard-Gaudin et al. 1999] and CEMTAR [Sarkar et al. 2012]were developed for more than one language simultaneously. IRESTE is a dual hand-writing database of English and French scripts. It includes images of isolated digitsand letters and 410 EURO signs. The database also contains 31,346 isolated words(French: 28,657 and English: 2,689). The CEMTAR database contains 150 handwrit-ten document pages, among which 100 pages were written purely in Bangla script andthe rest of the 50 pages were written in Bangla texts mixed with English words.

The database PE92 [Kim et al. 1993] is a collection of handwritten Korean characterimages where the authors collected 100 sets of KS (Korean Script) 2,350 handwrittenKorean character images. The first 70 sets were generated by more than 500 differentwriters, and the same person wrote each of the remaining 30 sets.

A Chinese handwriting database named HIT-MW [Su et al. 2007] was developed byincluding 853 handwritten forms, where forms were produced under unconstrainedconditions without preprinted character boxes. ETL9 [Saito et al. 1985] is a set ofhand-printed characters in JIS Chinese with its analysis in Japanese.

The corpus in Sutat and Methasate [2004] is a Thai script handwritten charactercorpus, developed by the NECTEC (National Electronics and Computer TechnologyCenter). The corpus consists of more than 44,000 images of online and offline hand-written characters, including isolated, touching, and cursive handwritten characters.

For the Arabic language, most of the available databases are a collection of isolatedcharacters/digits or words, while the database AHTD [Mahmoud et al. 2011] is a col-lection of handwritten text pages. The IFN/ENIT [Pechwitz et al. 2002] database wasdeveloped for training and testing of Arabic handwriting recognition systems. It con-tains more than 2,200 binary images of handwritten forms written by 411 writers. A

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 4: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:4 P. Choudhary and N. Nain

Table II. Some of the Most Widely Used Offline Handwritten Text Databases

Database Scripts Contents Content SizePE92 [Kim et al. 1993] Korean Characters 235,000 charactersHIT-MW [Su et al. 2007] Chinese Sentences and 853 forms having

characters 8,664 lines and186,444 characters

ETL-9 [Saito et al. 1985] Japanese Characters 6,07,200 charactersRIMES [Grosicki et al. 2006] English Mail samples 12,723 mail samplesIAM [Marti and Bunke 2002] Running English 5,685 paragraphs of

text pages 13,353 lines and115,320 words

CEDAR [Hull 1994] Words 5,000 city names,of state, city, 5,000 state names,and postal 10,000 postal zip codes,name and 50,000 characters

IFN/ENIT [Pechwitz et al. 2002] Arabic Text pages 2,200 binary images26,00 isolated wordsof 411 writers

AI ISRA [Kharma et al. 1999] Paragraph, words, 500 paragraphs, 37,000digits and words, 10,000 digits,signatures and 2,500 signatures

AHTD [Mahmoud et al. 2011] Text forms 1,000 forms writtenby 300 writers

FHT [Ziaratban et al. 2009] Farsi Handwritten forms 1,000 filled formsIFN/FARSI [Mozaffari et al. 2008] City names 7,271 images of 1,080

Iranian city namesHafT [Safabaksh et al. 2013] Text pages 1,800 formsKhosravi and Kabir [2007] Digits 102,352 digits

extracted from 1,200registration forms

CENPARMI [Haghighi et al. 2009] Words, dates, 432,357 isolateddigits, and samples of dates, digits,letters letters, and words

written by 400 writersKHTD [Alaei et al. 2011a] Indic scripts Kannada text

pages, blines,words

204 text pages of 4,298lines and 26,115 wordsby 51 writers

Bhattacharya and Chaudhuri [2009] Devanagari Numeral samples 22,556 Devanagariand Bangala and 23,392 Bangala

CEMTAR [Sarkar et al. 2012] Banagala Text pages 150 document pagesand 100 pages of BangalaEnglish and 50 pages of

English-Bangala mixPBOK [Alaei et al. 2012] Indo-Persian Text pages of 707 text pages of

scripts Bangala, Oriya, four scripts,Kannada, and 12,565 text lines, andPersian 104,541 words

CENPARMI [Sagheer et al. 2009] Urdu Isolated digits 44 isolated charactersand words and 57 Urdu words

CENIP-UCCP [Raza et al. 2012] Sentences 400 text pages.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 5: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:5

ground-truth file for each word in the database has been compiled. This file containsinformation about the word, such as the position of the word base line, and informationon the individual characters used in the word.

AHDB [Al-Ma’adeed et al. 2002] introduced a database for offline Arabic handwritingrecognition, together with an analysis of the database and its associated preprocessingoperations. The database contains a sample image of Arabic words and free handwrit-ing text pages. Alamri et al. [2008] introduced a database of isolated Indian digits,numerical strings, Arabic isolated letters, and a collection of 70 Arabic words. It alsoincludes a free format sample for Arabic date.

Al-Ohali et al. [2003] developed an Arabic cheques database for research in therecognition of handwritten Arabic cheques. It is composed of real-life Arabic legalamounts, Arabic subwords, courtesy amounts, Indian digits, and Arabic cheques.

The database Al ISRA [Kharma et al. 1999] describes the methodology for the devel-opment of a comprehensive database including handwritten Arabic words, numbers,and signatures. AHTD [Mahmoud et al. 2011] is a database for offline Arabic hand-written text recognition. The database is composed of images of the handwritten textat various resolutions, and it also provides ground-truth metainformation for writtentext at the page, paragraph, and line levels.

For the Farsi language, there exist a few databases. FHT [Ziaratban et al. 2009] isan unconstrained Farsi handwritten text database of 1,000 forms with contributionsfrom 250 participants in different age groups and with varied education levels. Thesecharacteristics of the database make it suitable for many OCR applications. Khosraviand Kabir [2007] introduced a very large dataset of handwritten Farsi digits. Thedatabase includes binary images of digits extracted from about 12,000 registrationforms of two types, filled out by BSc and senior high school students.

A new large-scale multipurpose CENPARMI Farsi handwritten dataset [Haghighiet al. 2009] consists of 432,357 images of dates, words, isolated letters, isolated dig-its, numeral strings, special symbols, and documents. The forms were collected from400 native Farsi writers. The IfN/Farsi [Mozaffari et al. 2008] database consists of7,271 binary images of Iranian province/city names. The HaFT [Safabaksh et al. 2013]database contains 1,800 grayscale images of unconstrained texts.

The generation of corpus methodology for Indian scripts was initiated in 1991. Todate, very few datasets are available for Indian scripts. Some of the notable worksare as follows: The Kannada handwritten text database (KHTD) [Alaei et al. 2011a]is an unconstrained dataset, containing 204 handwritten documents of four differentcategories written by 51 native speakers of Kannada. The total number of text linesand words in the dataset are 4,298 and 26,115 respectively.

Bhattacharya and Chaudhuri [2009] developed a mixed numeral handwrittendatabase of Indian scripts. The database includes isolated handwritten numeral sam-ples of real-life situations for Devanagari and Bangla scripts. CEMTAR [Sarkar et al.2012] is a database of unconstrained Bangla−English mixed script handwritten docu-ment images. The database contains 150 handwritten document pages, among which100 pages are written purely in Bangla script and the rest of the 50 pages are writtenin Bangla text mixed with English words.

The standard database PBOK [Alaei et al. 2012] of four different scripts includestext pages of three Indic scripts, Kannada, Bangla, and Oriya. The Kannada part ofthe database has 228 text pages of four different domains written by 57 writers. TheKannada section contains a total of 4,850 handwritten text lines, 29,306 words, and213,147 characters. It also contains 199 and 140 handwritten text pages of Bangla andOriya, respectively. The database provides pixel- and content-based ground truthingfor all the text pages. This database contains text pages written from both directions,and most of the samples are either overlapping or touching text lines.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 6: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:6 P. Choudhary and N. Nain

For the Urdu language, so far only two handwritten databases exist: CEN-PARMI [Sagheer et al. 2009] is the Urdu offline handwriting database, which in-cludes isolated digits, numeral strings with/without decimal points, five special sym-bols, 44 isolated characters, 57 financial related words, and a collection of Urdu datesin different formats. Another available offline Urdu handwritten database is CENIP-UCCP [Raza et al. 2012], which is an unconstrained offline sentence database com-posed of 400 digitized forms produced by 200 different writers. The database has beenlabeled/marked up at the text-line level only.

From the literature review, it can be summarized that there exists a sufficient num-ber of standard databases for scripts like English, Chinese, and Japanese, while veryfew standard databases are available for Arabic and Farsi. Compared to these lan-guages, very little attention has been given to the Urdu language. The Urdu handwrit-ten database developed by CENPARMI [Sagheer et al. 2009] focuses only on isolatedcharacters and digits and some selected words. Only CENIP-UCCP [Raza et al. 2012]includes 400 images of handwritten sentences.

Urdu script is more complicated and elaborate compared to Arabic and Persian. Themain reason for the Urdu script getting less attention and its slower development inthe OCRs field is the lack of a standard database for Urdu script. The availability ofresources for data collection is much less for Urdu as compared to scripts like Persianand Arabic. It is difficult to use Urdu script in automation, as a single character entryneeds two to three keystroke combinations. To bypass this data entry step, we needto develop machine vision systems for automatically converting handwritten Urducharacters into their transcripted counterpart. To develop such intelligent systems, weneed a large corpus to train the system for recognizing handwritten Urdu characters.These issues motivated us to develop an Urdu handwritten text database, which isa much-needed platform for training, testing, and benchmarking of handwritten textrecognition systems for the Urdu script.

This article describes the detailed methodology of developing an annotated corpus,CALAM, in a scientific way, including a large volume of unconstrained handwritten textimages in Urdu script and their corresponding transcripted texts in a Unicode text fileor in an XML file format. The corpus consists of of 1,200 handwritten images writtenby 725 writers belonging to different geographical regions. The number of handwrittentext lines varies from two to six lines in a form. The average number of words variesfrom 20 to 80 in a text form/image. The text page also includes the demographic in-formation of the writer like name, age, gender, education, address, and signature. Theselection of texts is distributed within six categories and 14 subcategories to achievethe maximum variations in the words as texts. The corpus is designed to support alarge number of computational linguistic research, such as identifying writing stylesand grammatical information and developing machine-readable platforms. The corpusconsists of an aligned transcription for image, line by line, phrase by phrase, or word byword. The corpus is completely marked up for content information to support contentdetection and evaluation of systems like linguistic handwriting recognition, signatureverification, and writer identification. The database was experimented for the bench-marking of handwritten text recognition algorithms by generating an XML file ofannotated handwritten text images. Experimental results in the form of quantitativeanalysis of four handwritten text-line segmentation techniques are also reported.

The article first introduces the experimental setup for the collection and distributionof data in a systematic manner, and then reports the process of information fetchingand feeding in both the handwritten text image and its corresponding XML file. Thearticle is organized as follows: Section 2 describes the characteristics of Urdu script.Section 3 introduces the process of data collection and gives an overview of the statisticsof the database. Section 4 describes the functionality and annotation of the scanned

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 7: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:7

Fig. 1. A set of the 38 most commonly used alphabets and digits of Urdu script.

handwritten image in a hierarchical manner with the generation of an XML file forground truth. Section 5 does the comparative analysis of the structure with the existingdocument annotation tools. Section 6 provides experimental evidence in terms of text-line segmentation and distribution-of-words frequency in the proposed corpus. Finally,conclusions are presented in Section 7.

2. CHARACTERISTIC OF URDU SCRIPT

Urdu script belongs to the Indo-Aryan family of scripts and is historically relatedwith India from the time of the Mughal Empire. The present shape of Urdu scriptis significantly influenced by languages like Persian, Arabic, Turkish, Punjabi, andother indigenous languages of the Indian subcontinent. It is the national language ofPakistan and is one of the 22 scheduled languages in the Constitution of India. Indiahas a large number of native Urdu speakers in its five states: Andhra Pradesh, Jammuand Kashmir, Bihar, Uttar Pradesh, and New Delhi. Urdu is the official language ofJammu and Kashmir state, and recently Urdu was also approved as the second officiallanguage of Uttar Pradesh. The population of Hindi-Urdu speakers is the fourth-largestcommunity in the world after Mandarin, Chinese, English, and Spanish. According toGovernment of India 2001 census data [Census 2001], in India, more than 50 millionpeople speak Urdu as their native language.

The Urdu script is written from right to left and is an extension of the Persianalphabet, which is itself an extension of the Arabic alphabet. The Urdu alphabet setcontains 38 characters and 10 digits, as shown in Figure 1. Urdu is associated with theNastaleeq style of Persian calligraphy, whereas Arabic is written in the Naskh style. Asshown in Figure 1, the “diamond shape” on the top of characters indicates the extendedcharacters for Urdu from Persian. In Unicode, Arabic and its associative languages likeUrdu, Punjabi, and Sindhi have been allocated 1,200 code points as (0600h - 06FFh,FB50h - FEFFh).

At the time of writing, individual characters are joined together according to rulesfor every consecutive pair of characters in order to form groups of characters called

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 8: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:8 P. Choudhary and N. Nain

Table III. A Sample of Valid Urdu Ligatures Formed by a Combination of Two to Eight Characters

Table IV. Differences Between Very Similar-Looking Letters Using the Dots

Table V. Differences Between Very Similar-LookingLetters Using the Diacritic

ligatures. A word consists of one or more ligatures written next to each other. Ligaturesin Urdu are composed of one or more characters; Table III shows examples of seven validligatures formed with a combination of two to eight Urdu characters. Urdu characterstypically attain different shapes according to their placement in forming a ligature.Both the meaning and shape of the characters change depending on their positions (atbeginning, middle, and last). The problem is further aggravated by the cursive natureof the script. Thus, the shape assumed by a character in a word is context sensitive,decided by its placement.

Furthermore, the uses of the dots(.) and diacritic during the writing makes it morecomplicated for the recognition process. Dots play a significant role in the Urdu alpha-bet; a single dot can make a big difference. The placement of a dot can change oneletter into a different letter. For example, as shown in Table IV, the letter [be] has itsbasic shape in common with three other letters, [pe], [te], and [se], with only some dotsdifferentiating them.

One of the challenges for Urdu OCR is to characterize the differences between thesevery similar-looking letters. Table IV shows the differences between these very similar-looking letters using the dots, and Table V shows the differences between very similar-looking letters using the diacritic.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 9: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:9

3. DATA COLLECTION AND DISTRIBUTION

The process of design and development of an Urdu corpus starts with the raw datacollection and ends with appropriate tagging and labeling of the collected texts inthe database. In our methodology, we used a higher-level (sentences)-based approachrather than collecting a list of isolated characters, digits, and words that combinesdifferent units of writings in a single trial. The collection of data has been done mostlyfrom the news channels like BBC Urdu and ETV Urdu, Urdu blogs, historical-ancientdocuments, and textbooks. In some categories like history, architecture, and biography,the printed documents are entered manually in Unicode text due to nonavailability ofUrdu Unicode texts for some words that were used earlier and are not in use now. Tocapture the maximum words for the corpus, we have used a long time period for datacollection, starting from 1901 to the present, among six different categories.

In order to be representative of all the phenomena of a particular language, thecorpus contains a large variety of text samples. The domain of the corpus is a datacollection of six different categories that are further divided into 14 subcategories tocapture the maximum variance in word collection and make the corpus more significantin terms of a balanced corpus. Although there are no specific criteria for a balancedcorpus, the criteria we have chosen for a balanced corpus are topics (category) of textselection and time span of data collection. The advantage of the balanced corpus isthat texts are selected in such a way that the phenomena of searching become moreefficient compared to the imbalanced corpus. It also provides additional facilities suchas classification of texts as per research requirement, filtering of texts, and statisticalanalysis of data based on various terms like age, gender, educational qualification,region, and category.

The list of categories and their corresponding subcategories with denoted keywordsfor data collection is as follows:

(1) History - H(a) Indian History - IH(b) World History - WH

(2) Literature - L(a) Poetry/Religion - PR(b) Gazals/Shyari - GS(c) Biography - BI

(3) Science - S(a) Medical - ME(b) Physics - PH(c) Chemistry - CH

(4) News - N(a) International - IN(b) National - NA(c) Sports - SP

(5) Architecture - A(a) Rural Architecture - RA(b) Urban Architecture - UA

(6) Politics - P(a) Central Government - CG(b) State Government - SG

3.1. Design of Handwritten Form

The form layout has been designed in a specific way to collect a large amountof significant information on a single form and make the corpus available in the

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 10: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:10 P. Choudhary and N. Nain

Fig. 2. A sample filled-in form.

multidisciplinary research areas of document image analysis, such as writer identifica-tion, signature verification, segmentation of printed and handwritten text, evaluationof OCR algorithms and technology, and training of a system for automatic data entry.The layout of the handwritten form is separated into four parts with a horizontal linefor convenience in the segmentation of machine-printed text followed by handwrittentext and demographic information of the writer. The design of the A4 size form is splitinto four parts as shown in Figure 2; each part is separated from each other by ahorizontal line and organized as follows:

(1) Part 1: This part of the form comprises the title for a language in the databaseand a unique identification number (UID). For example, an Urdu language, IndianHistory, Form 1 will have the UID as (URD-H-IH-001). The UID of the corre-sponding form is automatically updated or generated once a language and cate-gory/subcategory are selected.

(2) Part 2: This part of the form consists of two to four lines of printed text, which arecollected from various sources having around 20 to 80 words.

(3) Part 3: The third part of the form is left blank where the writers replicate theprinted text in their own handwriting as shown in Figure 2.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 11: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:11

Fig. 3. Distribution of domain-wise data collection of handwritten forms.

(4) Part 4: The fourth part of the form is optional to collect demographic informationof the writers that will be an aid for training a system for automatic data entry.In the demographic information, we collected the following information of writers:name, age/gender, education, address, rural/urban, date form was filled out, andsignature.

The filled-out forms were scanned at a resolution of 300dpi at a gray level. Eachform was completely scanned, including the printed texts, handwritten texts, and de-mographic information, and its corresponding transcripted texts of the scanned imagewere stored in a Unicode UTF-8 text file.

3.2. Statistics of the Database

The database contains 1,200 handwritten text forms, filled out by 725 writers fromdifferent age groups and with different educational qualifications. Text pages werewritten by both males and females; 65% of the writers were males and 35% werefemales. Information about name, age, and address was collected on each page. Seventy-five percent of the 725 writers were younger than 26 years, and 79% were graduatestudents. Each writer was asked to write forms in an unconstrained environment inhis or her natural handwriting with different pen styles and inks.

To capture the maximum variance in data collection, the domain of data collectionis divided into six categories and 14 subcategories. The statistics of the data collectionaccording to the categories are shown in Figure 3.

The database contains 3,403 Urdu handwritten text lines, 46,664 Urdu words, and101,181 Urdu ligatures. On average, each filled-out handwritten text page comprises2.84 text lines, 38.89 text words, and 84.31 ligatures. The database also contains 33,162unique words, which are 71.07% of the total words present in the database. Besidesthis, the database contains 2,353 Urdu printed text lines.

The domain-wise distribution of lines, words, and ligatures in the database is shownin Figure 4. The database contains ligatures of one to six characters, and the dis-tribution of the ligatures with various character combinations is shown in Figure 5.Statistics of the demographic distribution of the dataset are tabulated in Table VI.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 12: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:12 P. Choudhary and N. Nain

Fig. 4. Domain-wise data distribution of lines, words, and ligatures.

Fig. 5. Distribution of number of ligatures with one to six characters in the form.

Table VI. Statistics of the Demographic Distribution of the Dataset

Disparatedemographic Writer City Dateinformation names names formats SignaturesTotal number 667 432 346 725

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 13: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:13

4. CORPUS MARKUP AND ANNOTATION

Availability of a large annotated ground-truth database is a significant advancementfor handwritten text recognition techniques. Corpus annotation is a useful process formaking the corpus available in the broad areas of computational linguistic researchby associating it with some additional information and providing support for machinelearning. Corpus annotation plays a significant role for automatic evaluation of seg-mentation and recognition results.

Annotation is a time-consuming and error-prone task, so it requires the utmost careas highlighted in literature work related to online Indic script annotation [Kumar et al.2006; Belhe et al. 2009; Jawahar et al. 2009; Bhaskarabhatla et al. 2004]. Messaoudand Abed [2010] have designed a structure to annotate the handwritten historical docu-ments, while Alaei et al. [2012] and Alaei et al. [2011a] annotate an offline handwrittendocuments database. We have developed a structure (CALAM) that highly annotates alarge volume of offline handwritten text documents in a systematic and scientific wayand also reduces the time of annotation and takes care of the data validation.

Apart from pure text-page annotation, CALAM provides some additional linguisticsfeatures such as aligned transcription for segmented lines and words. In addition tohandwritten text-page annotation, the database also accumulates the demographicinformation at the form level related to the writer of the text page in Unicode. The in-formation includes the writer’s name, education, age/gender, address, and geographicalinformation.

The annotation of a handwritten text form was done in standard encoding Unicode(UTF-8) for two reasons: (1) to ensure the compatibility with a non-Urdu operatingsystem and character set and (2) to make the corpus, language, and operating systemindependent and compatible with other corpus access Unicode-based tools.

The next section describes the step-by-step process of designing a corpus after thegeneration of scanned handwritten text forms and the four levels of annotation alongwith an XML standard file generation for ground truth as shown in Figure 6.

4.1. Structural Mapping and Auto-Indexing

The structural mapping provides the facilities of corpus creation and navigationthrough the stored information of handwritten images, segmented lines, words, andcomponents very easily through the database, along with a broad view of the inputdata and transcription of Unicode text. It also provides additional support for inser-tion, modification, and searching of data for direct access to the needed attributes andtheir annotated information.

The Unique Id configuration of each handwritten form is as follows:

(1) The file name is the concatenation of the language (2 bits), category (3 bits),and subcategory (3 bits)xxxxxxxx(8 bit) form no. The index structure is shown inFigure 7.

(2) The index of the form id is 16 bits: Total number of forms (maximum) = 216 =65,536.

(3) There can be a maximum of eight categories, and hence 2,048 forms in each cate-gory, and there can be eight subcategories, and hence 256 forms in each subcategory.

(4) The structures reserves 2 bits for language to further extend and support otherlanguages.

To achieve the automatic consistency checking throughout the database, all thehandwritten text images stored in the database get the same unique id that was gen-erated during the auto-indexing. The UID of each uploaded image was automatically

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 14: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:14 P. Choudhary and N. Nain

Fig. 6. Hierarchical process flow of the corpus development and its annotation.

indexed according to the selected language, category, and subcategory as shown inFigure 7. At the time of insertion of a new form, the user selects a particular scriptlanguage and category of the handwritten text form, and the id field is appended ac-cordingly, For example, for the Urdu script and Literature category, UID of a formwill be URD-L-GS-005, as shown in Figure 8. Automatic indexing is also applicablefor the UID of the segmented lines and words of the handwritten image that is theextension of the form UID with a symbol of -. According to the image ID, the lineUID is automatically generated. Similarly, according to the line UID, the word UID isautomatically generated. For example, the first image of the Literature category andPoetry/Religion subcategory of the database is named as URD-L-PR-001. The imagesare stored in PNG format, so the first image file of the database has the name ofURD-L-PR-001.PNG.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 15: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:15

Fig. 7. Automatic unique id generation of a handwritten text form.

Fig. 8. A sample structural mark up of a handwritten text image.

4.2. Ground Truth and Validation

The structure provides the functionality of mapping the accurate location of hand-written texts in the corresponding scanned images, lines, words, and ligatures. Thesetextual region coordinates are conversely indexed in the database as well as in theXML file. That is useful for proper benchmarking of segmentation techniques for hand-written text recognition. Selected segmented images of lines and words are stored ina separate folder, while all the manually entered ground-truth transcription data ofimages, lines, and words are directed toward the respective fields in the database. Abounding box is displayed over the selected textual region for better visibility, so thatone can recognize the path of the image components. A mapping has been done forthe window screen and the viewport. When the cursor points at the unique id of lines,words, and ligatures, a rectangular bounding box appears on the corresponding imageof the line, word, or ligature in the viewport. A sample of the structural markup of ahandwritten image is shown in Figure 8.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 16: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:16 P. Choudhary and N. Nain

Table VII. Metainformation of an XML Formatted File

Level Specification Metainformation1 Demographic information Writer’s description

Image unique IDPrinted textDate of creation

2 Handwritten image Writer informationNumber of lineTranscription of handwritten textPixel coordinate of imageLine unique ID

3 Segmented lines Pixel coordinates for bounding boxTranscription textNumber of wordsWord unique ID

4 Segmented words Pixel coordinates for bounding boxTranscription textSynonyms, antonymsNumber of ligatures

5 Ligatures Pixel coordinates for bounding box

Visualization of the image and corresponding information on the same viewportmakes it useful for validation of context information and its visual review of annotateddata. As a result, we create a database using this structure where all informationstored in the database and images of text pages, segmented lines, and words are storedseparately in the system with their corresponding UID as the name of the image inPNG format.

Validation checks are crucial to maintain the integrity of any database structureand is also helpful in ensuring the system operates on clean, correct, and useful data.They are equipped in our corpus by using auto-indexing and cross-indexing routinesusing validation and data normalization rules. In a nutshell, data needs to be validatedat the same stage/level where it is most likely to be erroneous. The different types ofdata validations applied are form-level validation, search criteria validation, field-levelvalidation, and range validation for every field.

4.3. An XML Representation

An Extensible Mark-up Language (XML)-based set of rules was used for encodingdocuments in a format that is both human readable and machine readable, as XMLprovides a standard representation that is logically related in a hierarchical way thatis better suited for document analysis tasks. An XML is the most commonly used fileformat to generate ground-truth annotation results of the corpus. CALAM provides thefunctionality of creating an XML representation based on the data entry description foreach handwritten text form of the database. The user can select an image to generatea corresponding XML formatted file and then download or directly view the XML fileof that image.

The heart of the CALAM is the image database that includes 1,200 scanned imagesof handwritten sentences. In addition to image files, each image is accompanied by arich XML metainformation file that is encoded at five levels of hierarchical metain-formation as shown in Table VII. There is a hierarchical record in each XML file forcategorization of handwritten text image data into different levels such as lines, words,and components to describe its specification. The XML schema also encapsulates writ-ers’ demographic information like name, age, education, and address as other elements

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 17: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:17

Table VIII. Comparative Analysis of CALAM

Input AnnotationStructure Type Level Output Applications

PixLabeler English Image Text, XML LabelingGTLC Chinese Lines XML Annotation

WordsCharacters

APTI Arabic Image XML TranscriptionTruthing tool English Image XML Retrieval of text

MAST Camera image Printed text Unicode Annotation of textin image XML

LabelMe Scene image Object in image XML Object detectionDesign and annotate corpus,

Lines Image OCR algorithm benchmarking,CALAM Handwritten Words Unicode Unicode transcription,

document Components Auto-generated statistics analysis ofXML file corpus data on various terms,

NLP applications

of the corpus. Standard CES (Character Encoding Scheme) [Ide 998b] under the guide-lines of TEI (Text Encoding Initiative) [Sthrenberg 2012] is used for electronic dataencoding and an XML file’s metainformation.

As a result, the structure generates an XML file for each text page includingthe data information of lines, words, and ligatures of the respective page, based onthe data entries. The XML file contains the same information as was provided dur-ing the data entries (with a five-level hierarchy, suffixed with UID). For example, theXML file obtained for the handwritten form with UID “URD-U-UA-001.png” will be“URD-U-UA-001.xml.”

5. COMPARATIVE CHARACTERISTICS OF THE PROPOSED DATABASE

The comparative analysis of the proposed corpus CALAM with existing structures likePix Labeler [Saund et al. 2009], GTLC [Yin et al. 2009], Truthing Tool [Elliman andSherkat 2001], APTI [Slimane et al. 2009], MAST [Kasar et al. 2011], and LabelMe[Russell et al. 2008] for a handwritten text image corpus is illustrated in Table VIII.

The comparative analysis of Table VIII shows the functionality of the existing tools ofannotation. PixLabler and Truthing Tools provide a way to annotate English-languagedocuments. APTI and GTLC are available for offline handwritten document annota-tion in Arabic and Chinese scripts, respectively. APTI has been designed to annotatehandwritten images excluding lines’ and words’ annotation. MAST and LabelMe weredesigned for annotation of camera-based images. LabelMe provides the functionality ofobject recognition in a scene image, and MAST can be used for annotation of multiscriptscenic images for printed text.

Compared to the previous structures, CALAM provides the display of the handwrit-ten Urdu text image file and the transcription material of the corresponding imageon the same screen in a collaboration context. CALAM is a simple way for annota-tion and collection of a large volume of information for Urdu script, such as digits,paragraphs, lines, words, machine-printed text, and handwritten text on the sameplatform. CALAM automatically generates an XML file of annotated metainformationthat would be useful to ground truth of the image (bounding box coordinates of lines,words, and ligatures) for benchmarking and evaluation of various OCR techniques likesegmentation and handwritten text recognition. All the structural markups are donewith the pixel-level precision.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 18: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:18 P. Choudhary and N. Nain

Table IX. Quantitative Analysis of Four Techniques Used for Testing theProposed Dataset for Text-Line Segmentation Benchmarking

Techniques Number of Test Images Average AccuracyGodara et al. [2014] 400 91.2%

Khanduja et al. [2013] 400 90.6%Panwar and Nain [2014] 400 93.08%

Alaei et al. [2011b] 400 94.12%

The corpus structure can be used for different classification criteria as required inmultidisciplinary research, such as searching, filtering, statistics analysis on data, andthe study of data distribution in terms of sex, name, education, region, domain, andother parameters.

6. EXPERIMENTATIONS AND RESULTS

To strengthen our claim for the applicability of the proposed dataset for Urdu linguisticresources, we have also conducted the experimentations of some handwritten textsegmentation algorithms and the Zipf ’s [Piantadosi 2014] test on the dataset to observethe behavior of the word frequency distribution.

6.1. Text-Line Segmentation Results

To provide insight to other researchers for evaluation and comparison of their results oftext-line segmentation/recognition techniques on the proposed dataset, we have testedfour different text-line segmentation algorithms on the CALAM dataset. Each tech-nique was tested on 400 images taken from the proposed CALAM Urdu handwrittendataset [Choudhary et al. 2015]. We have selected 200 and 100 images from the Newsand Politics categories, respectively, and the remaining 100 images were a combinationof the first 25 images from each of the four categories.

(1) We tested a technique proposed by Alaei et al. [2011b] to segment handwritten textdocuments into individual text lines. The average accuracy defined by Equation (1)of the proposed algorithm is 94.12%:

Accuracy = T .P. + T .N.

(T .P. + T .N. + F.P. + F.N.), (1)

where terms related to accuracy measurement are as given: T.P. (True Positive),T.N. (True Negative), F.P. (False Positive), and F.N. (False Negative).

(2) The second technique tested is proposed by Godara et al. [2014] for handwrittenUrdu script segmentation using the smearing method for line segmentation. Theaverage accuracy achieved by the algorithm is 91.2%.

(3) The third technique used in our experimentation has been proposed by Khandujaet al. [2013]. The average accuracy achieved is 90.6%, where 400 images were usedfor testing.

(4) Panwar et al. [2013] and Panwar and Nain [2014] proposed a line segmentationtechnique based on the Connectivity Strength Parameter (CSF). The average ac-curacy achieved is 93.08%. Table IX summarizes the complete test results.

6.2. Word Frequency Distribution Using ZipF’s Rule

To strengthen our claim for the applicability of the proposed dataset for Urdu linguisticresources, we have also conducted the Zipf ’s [Piantadosi 2014] test on the datasetto ascertain that it caters to the universality of a language principle. In 1949, Zipf[Piantadosi 2014] proposed a rule to analyze the distribution and behavior of words ina corpus that is significant in statistical linguistics analysis. According to Piantadosi

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 19: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:19

Fig. 9. Zipf curve of the words’ distribution of the proposed corpus CALAM.

[2014], every natural language follows Zipf ’s rule for the frequency distribution ofwords. Zipf ’s rule states that if f is the frequency of a word in a corpus and r is therank of the word, then the frequency of words in a large corpus of natural language isinversely proportional to the rank of words as shown in Equation (2):

f ∝ 1r. (2)

Zipf ’s rule states that if words are arranged from the corpus in descending order offrequency (w1, w2, . . . , wn), then the occurrence frequency of the second word w2 is w1

2 ,half times the first word w1, and the third word w3 occurred roughly w1

3 , one-third asoften as the first word, and so on.

From this, it can be concluded that with the multiplication of the rank of a wordr (rank one being the most frequent) by its frequency f (how many times the wordoccurred in the text), the product C would remain approximately the same for eachword as shown in Equation (3):

w fi = Cwri

. (3)

From Equation (3), we can derive a generalization of this rule stating that thefrequency of words decreases very rapidly with rank. This can also be written asEquation (4):

w fi = C(wri)k. (4)

By taking the log of Equation (4), we get Equation (5):

log (w fi) = log C + k log (wri), (5)

where k = −1 and C is a constant. So a log(f ) and log(r) graph drawn between frequencyand rank of a corpus must be linear with slope as −1. Figure 9 shows the Zipf ’s curvefor the proposed Urdu corpus words. The resultant log(f ) and log(r) Zipf ’s curve graphvalidates that the proposed corpus follows Zipf ’s rule for frequency distribution ofwords.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 20: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:20 P. Choudhary and N. Nain

7. CONCLUSION

In this article, we have presented an Urdu handwritten text image corpus CALAMalong with its annotation structure with pixel-level precision. The uniformity of thestructure provides an appropriate way for annotation of handwritten text images. Thebalancing in the data collection stage makes the corpus useful for researchers to controlthe proportion of values according to different usages of the corpus. We described anXML-based handwritten text image corpus and the annotation methodology that hasthe potential to provide researchers all the facilities for document image processingresearch, on a single platform, such as writer identification; signature verification;segmentation/recognition of text pages at line, word, and ligature levels; and separationof handwritten and printed texts. The database would be helpful in the design ofan automatic intelligent system for direct processing of massive handwritten formscollected for census data.

Also, it can be very widely used for language transcription and transliteration appli-cations acting as an information exchange center. To date, only two datasets are avail-able for handwritten Urdu script. The aim of this work is to build a resource that wouldprovide ground-truth annotation for handwritten text images. We propose floating thedataset as an open source on cloud storage free for academic use, where permissionsfor usage would be given on request.

REFERENCES

S. Al-Ma’adeed, D. Elliman, and C. A. Higgins. 2002. A data base for Arabic handwritten text recognitionresearch. In Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition.485–489.

Y. Al-Ohali, M. Cheriet, and C. Suen. 2003. Databases for recognition of handwritten Arabic cheques. PatternRecognition 36, 1 (2003), 111–121.

A. Alaei, P. Nagabhushan, and U. Pal. 2011a. A benchmark Kannada handwritten document dataset andits segmentation. In Proceedings of the International Conference on Document Analysis and Recognition(ICDAR’11). 141–145.

A. Alaei, U. Pal, and P. Nagabhushan. 2011b. A new scheme for unconstrained handwritten text-line seg-mentation. Pattern Recognition 44, 4 (April 2011), 917–928.

A. Alaei, U. Pal, and P. Nagabhushan. 2012. Dataset and ground truth for handwritten text in four differentscripts. International Journal of Pattern Recognition and Artificial Intelligence 26, 04 (2012), 1–25.

H. Alamri, J. Sadri, C. Y. Suen, and N. Nobile. 2008. A novel comprehensive database for Arabic offline hand-writing recognition. In Proceedings of the 11th International Conference on Frontiers in HandwritingRecognition (ICFHR’08). 664–669.

S. Belhe, S. Chakravarthy, and A. G. Ramakrishnan. 2009. XML standard for indic online handwrittendatabase. In Proceedings of the International Workshop on Multilingual OCR (MOCR’09). ACM, NewYork, NY, USA, Article 19, 4 pages.

A. S. Bhaskarabhatla, S. Madhvanath, M. N. S. S. K. Pavan Kumar, A. Balasubramanian, and C. V. Jawahar.2004. Representation and annotation of online handwritten data. In Proceedings of the 9th InternationalWorkshop on Frontiers in Handwriting Recognition (IWFHR-9’04). 136–141.

S. Bhaskarabhatla and S. Madhvanath. 2004. Experiences in collection of handwriting data for onlinehandwriting recognition in indic scripts. In Proceedings of the 4th International Conference LinguisticResources and Evaluation (LREC’04). 2223–2226.

U. Bhattacharya and B. Chaudhuri. 2009. Handwritten numeral databases of indian scripts and multistagerecognition of mixed numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 3(March 2009), 444–457.

Census. 2001. (2001). http://www.censusindia.gov.in/2001-common/censusdataonline.html.P. Choudhary, N. Nain, and M. Ahmed. 2015. A unified approach for development of Urdu corpus for OCR and

demographic purpose. In Proceedings of the 7th International Conference on Machine Vision (ICMV’15),Vol. 9445. 1–5.

L. Deng. 2012. The MNIST database of handwritten digit images for machine learning research [best of theweb]. IEEE Signal Processing Magazine 29, 6 (Nov. 2012), 141–142.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 21: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:21

R. I. M. Elanwar, M. A. Rashwan, and S. A. Mashali. 2010. OHASD: The first online Arabic sentence databasehandwritten on tablet PC. In Proceedings of the World Academy of Science, Engineering and Technology(WASET’10), International Conference on Signal and Image Processing (ICSIP’10) 4, 12 (2010), 585–590.

D. Elliman and N. Sherkat. 2001. A truthing tool for generating a database of cursive words. In Proceedingsof the 6th International Conference on Document Analysis and Recognition, 2001. 1255–1262.

B. Gatos, N. Stamatopoulos, and G. Louloudis. 2009. ICDAR 2009 handwriting segmentation contest. InProceedings of 10th International Conference on Document Analysis and Recognition (ICDAR’09). 1393–1397.

B. Gatos, N. Stamatopoulos, and G. Louloudis. 2010. ICFHR 2010 handwriting segmentation contest. InProceedings of 2010 International Conference on Frontiers in Handwriting Recognition (ICFHR’10).737–742.

S. Godara, N. Nain, and M. Ahamed. 2014. Handwritten Urdu script segmentation using hybrid approach. InProceedings of the DAR 2014 Satellite Workshop of ICVGIP 2014 on Document Analysis and Recognition,2014.

E. Grosicki, M. Carr, E. Augustin, and F. Prłteux. 2006. La campagne d’valuation RIMES pour la recon-naissance de courriers manuscrits. In Actes 9me Colloque International Francophone sur lEcrit et leDocument (CIFED’06). Fribourg, Suisse, 61–66.

I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. 1994. UNIPEN project of online dataexchange and recognizer benchmarks. In Proceedings of the 12th IAPR International Conference onComputer Vision and Image Processing, Vol. 2. 29–33.

P. J. Haghighi, N. Nobile, C. L. He, and C. Y. Suen. 2009. A new large-scale multi-purpose handwrittenFarsi database. In Proceedings of the 6th International Conference on Image Analysis and Recognition(ICIAR’09). Springer-Verlag, Berlin, 278–286.

J. J. Hull. 1994. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysisand Machine Intelligence 16, 5 (May 1994), 550–554.

N. Ide. 1998b. Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In Proceedingsof the 1st International Language Resources and Evaluation Conference. 463–470.

E. Indermhle, M. Liwicki, and H. Bunke. 2010. IAMonDo-database: An online handwritten documentdatabase with non-uniform contents. In Proceedings of the International Workshop on Document AnalysisSystems. 97–104.

C. V. Jawahar, A. Balasubramanian, M. Meshesha, and A. M. Namboodiri. 2009. Retrieval of online hand-writing by synthesis and matching. Pattern Recognition 42, 7 (2009), 1445–1457.

T. Kasar, D. Kumar, M. N. Anil Prasad, D. Girish, and A. G. Ramakrishnan. 2011. MAST: Multi-scriptannotation toolkit for scenic text. In Proceedings of the 2011 Joint Workshop on Multilingual OCR andAnalytics for Noisy Unstructured Text Data. ACM, New York, NY, Article 14, 8 pages.

D. Khanduja, N. Nain, and S. Panwar. 2013. A hybrid feature extraction algorithm for devanagari script.ACM Transactions on Asian Low-Resource Language Information Processing 15, 1, 105–111.

N. Kharma, M. Ahmed, and R. Ward. 1999. A new comprehensive database of handwritten arabic words,numbers, and signatures used for OCR testing. In Proceedings of the IEEE Canadian Conference onElectrical and Computer Engineering, Vol. 2. 766–768.

H. Khosravi and E. Kabir. 2007. Introducing a very large dataset of handwritten Farsi digits and a study ontheir varieties. Pattern Recognition Letters 28, 10 (2007), 1133–1141.

D. H. Kim, Y. S. Hwang, S. T. Park, E. J. Kim, S. H. Paek, and S. Y. Bang. 1993. Handwritten Korean characterimage database PE92. In Proceedings of the 2nd International Conference on Document Analysis andRecognition. 470–473.

A. Kumar, A. Balasubramanian, A. Namboodiri, and C. V. Jawahar. 2006. Model-based annotation of on-line handwritten datasets. In Proceedings of the International Workshop on Frontiers in HandwritingRecognition (IWFHR’06). Universit de Rennes, La Baule, Centre de Congreee Atlantia, France.

S. Kumar. 2010. An analysis of irregularities in Devanagari script writing: A machine recognition perspective.International Journal of Computer Science Engineering 2, 2 (2010), 274–279.

Y. Li, Y. Zheng, D. Doermann, S. Jaeger, and Yi Li. 2008. Script-independent text line segmentation infreestyle handwritten documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 30,8 (Aug. 2008), 1313–1329.

L. Likforman-Sulem, A. Zahour, and B. Taconet. 2007. Text line segmentation of historical documents: Asurvey. International Journal of Document Analysis and Recognition (IJDAR) 9, 2–4 (2007), 123–138.

C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang. 2011. CASIA online and offline chinese handwriting databases.In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11), 2011.37–41.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 22: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

26:22 P. Choudhary and N. Nain

G. Louloudis, B. Gatos, I. Pratikakis, and C. Halatsis. 2009. Text line and word segmentation of handwrittendocuments. Pattern Recognition 42, 12 (2009), 3169–3183. New Frontiers in Handwriting Recognition.

S. A. Mahmoud, I. Ahmad, M. Alshayeb, and W. G. Al-Khatib. 2011. A database for offline arabic handwrittentext recognition. In Image Analysis and Recognition, Mohamed Kamel and Aurlio Campilho (Eds.).Lecture Notes in Computer Science, Vol. 6754. Springer, Berlin, 397–406.

V. Margner and H. El Abed. 2009. ICDAR 2009 Arabic handwriting recognition competition. In Proceedingsof 10th International Conference on Document Analysis and Recognition, 2009. 1383–1387.

U.-V. Marti and H. Bunke. 2002. The IAM-database: An English sentence database for offline handwritingrecognition. International Journal on Document Analysis and Recognition 5, 1 (2002), 39–46.

I. B. Messaoud and H. E. Abed. 2010. Automatic annotation for handwritten historical documents usingMarkov models. In Proceedings of the International Conference on Frontiers in Handwriting Recognition(ICFHR’10). 381–386.

S. Mozaffari, H. El Abed, V. Margner, K. Faez, and A. Amirshahi. 2008. IfN/Farsi-database: A databaseof Farsi handwritten city names. In Proceedings of the 11th International Conference on Frontiers inHandwriting Recognition (ICFHR’08). 397402.

M. Nakagawa, T. Higashiyama, Y. Yamanaka, S. Sawada, L. Higashigawa, and K. Akiyama. 1997. Onlinehandwritten character pattern database sampled in a sequence of sentences without any writing in-structions. In Proceedings of the 4th International Conference on Document Analysis and Recognition,1997., Vol. 1. 376–381.

M. Nakagawa and K. Matsumoto. 2004. Collection of online handwritten Japanese character patterndatabases and their analyses. Document Analysis and Recognition 7, 1 (2004), 69–81.

B. Nethravathi, C. P. Archana, K. Shashikiran, A. G. Ramakrishnan, and V. Kumar. 2010. Creation of ahuge annotated database for Tamil and Kannada OHR. In Proceedings of the International Conferenceon Frontiers in Handwriting Recognition (ICFHR’10). 415–420.

S. Panwar and N. Nain. 2014. A novel segmentation methodology for cursive handwritten documents. IETEJournal of Research 60, 6 (2014), 432–439.

S. Panwar, N. Nain, S. Saxena, and P. C. Gupta. 2013. Language adaptive methodology for handwrittentext line segmentation. In Computer Analysis of Images and Patterns, Richard Wilson, Edwin Hancock,Adrian Bors, and William Smith (Eds.). Lecture Notes in Computer Science, Vol. 8047. Springer, Berlin,344–351.

M. Pechwitz, S. S. Maddouri, V. Mrgner, N. Ellouze, and H. Amiri. 2002. IFN/ENIT - database of hand-written Arabic words. In Francophone International Conference on Writing and Document (CIFED’02).Hammamet, Tunisia, 129–136.

S. T. Piantadosi. 2014. Zipfs word frequency law in natural language: A critical review and future directions.Psychonomic Bulletin Review 21, 5 (2014), 1112–1130.

A. Raza, I. Siddiqi, A. Abidi, and F. Arif. 2012. An unconstrained benchmark Urdu handwritten sentencedatabase with automatic line segmentation. In Proceedings of the 2012 International Conference onFrontiers in Handwriting Recognition (ICFHR’12). IEEE Computer Society, Washington, DC, 491–496.

B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. 2008. LabelMe: A database and web-based toolfor image annotation. International Journal of Computer Vision 77, 1–3 (2008), 157–173.

R. Safabaksh, A. R. Ghanbarian, and G. Ghiasi. 2013. HaFT: A handwritten Farsi text database. In Proceed-ings of the 8th Iranian Conference on Machine Vision and Image Processing (MVIP’13). 89–94.

M. W. Sagheer, C.-L. He, N. Nobile, and C. Suen. 2009. A new large urdu database for offline handwritingrecognition. In Proceedings of International Conference on Image Analysis and Processing (ICIAP’09),Pasquale Foggia, Carlo Sansone, and Mario Vento (Eds.). Lecture Notes in Computer Science, Vol. 5716.Springer, Berlin, 538–546.

T. Saito, H. Yamada, and K. Yamamoto. 1985. On the data base ETL9 of handprinted characters in JISChinese characters and its analysis (in Japanese). Transactions of the IECE Japan J68-D(4) (1985),757–764.

R. Sarkar, N. Das, S. Basu, M. Kundu, M. Nasipuri, and Dk. Basu. 2012. CMATERdb1: A database ofunconstrained handwritten Bangla and Bangla-English mixed script document image. InternationalJournal of Document Analysis and Recognition 15, 1 (March 2012), 71–83.

E. Saund, J. Lin, and P. Sarkar. 2009. PixLabeler: User interface for pixel-level labeling of elements in docu-ment images. In Proceedings of the 10th International Conference on Document Analysis and Recognition(ICDAR’09). IEEE Computer Society, Washington, DC, 646–650.

F. Slimane, R. Ingold, S. Kanoun, A. M. Alimi, and J. Hennebert. 2009. A new arabic printed text imagedatabase and evaluation protocols. In Proceedings of the 10th International Conference on DocumentAnalysis and Recognition (ICDAR’09). 946–950.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

Page 23: A Four-Tier Annotated Urdu Handwritten Text Image Dataset for ...cvip2019.mnit.ac.in/doc/TALLIP.pdf · the rest of the 50 pages were written in Bangla texts mixed with English words.

An Annotated Urdu Handwritten Text Images Corpus 26:23

N. Stamatopoulos, B. Gatos, G. Louloudis, U. Pal, and A. Alaei. 2013. ICDAR 2013 handwriting segmentationcontest. In Proceedings of the 12th International Conference on Document Analysis and Recognition(ICDAR’13). 1402–1406.

M. Sthrenberg. 2012. The TEI and current standards for structuring linguistic data. Journal of the TextEncoding Initiative 3 (Nov. 2012), 1–14.

T. Su, T. Zhang, and D. Guan. 2007. Corpus-based HIT-MW database for offline recognition of general-purposeChinese handwritten text. International Journal of Document Analysis and Recognition (IJDAR) 10, 1(2007), 27–38.

S. Sutat and L. Methasate. 2004. Thai handwritten character corpus. IEEE International Symposium onCommunications and Information Technology 1 (Oct 2004), 486–491.

C. Viard-Gaudin, P. M. Lallican, S. Knerr, and P. Binter. 1999. The IRESTE on/off (IRONOFF) dual handwrit-ing database. In Proceedings of the 5th International Conference on Document Analysis and Recognition,1999 (ICDAR’99). 455–458.

R. Wilkinson. 1992. The first census optical character recognition systems. In The U.S. Bureau of Censusand the National Institute of Standards and Technology (Tech. Rep. NISTIR 4912, National Institute ofStandards and Technology.). Gaithersburg, MD, 1–372.

F. Yin and C.-L. Liu. 2009. Handwritten Chinese text line segmentation by clustering with distance metriclearning. Pattern Recognition 42, 12 (2009), 3146–3157. New Frontiers in Handwriting Recognition.

F. Yin, Q.-F. Wang, and C.-L. Liu. 2009. A tool for ground-truthing text lines and characters in offlinehandwritten Chinese documents. In Proceedings of the 10th International Conference on DocumentAnalysis and Recognition (ICDAR’09). 951–955.

M. Ziaratban, K. Faez, and F. Bagheri. 2009. FHT: An unconstraint Farsi handwritten text database. InProceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09).IEEE Computer Society, Washington, DC, 281–285.

Received January 2015; revised December 2015; accepted December 2015

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.