Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A...

20
adhan¯ a Vol. 27, Part 1, February 2002, pp. 3–22. © Printed in India Document image analysis: A primer RANGACHAR KASTURI 1 , LAWRENCE O’GORMAN 2 and VENU GOVINDARAJU 3 1 Department of Computer Science & Engineering, The Pennsylvania State University, University Park, PA 16802, USA 2 Avaya Labs, Room 1B04, 233 Mt. Airy Road, Basking Ridge, NJ 07920, USA 3 CEDAR, State University of New York at Buffalo, Amherst, NY 14228, USA e-mail: [email protected]; [email protected]; [email protected] Abstract. Document image analysis refers to algorithms and techniques that are applied to images of documents to obtain a computer-readable description from pixel data. A well-known document image analysis product is the Optical Character Recognition (OCR) software that recognizes characters in a scanned document. OCR makes it possible for the user to edit or search the document’s contents. In this paper we briefly describe various components of a document analysis system. Many of these basic building blocks are found in most document analysis systems, irrespective of the particular domain or language to which they are applied. We hope that this paper will help the reader by providing the background necessary to understand the detailed descriptions of specific techniques presented in other papers in this issue. Keywords. OCR; feature analysis; document processing; graphics recognition; character recognition; layout analysis. 1. Introduction The objective of document image analysis is to recognize the text and graphics com- ponents in images of documents, and to extract the intended information as a human would. Two categories of document image analysis can be defined (see figure 1). Textual processing deals with the text components of a document image. Some tasks here are: determining the skew (any tilt at which the document may have been scanned into the computer), finding columns, paragraphs, text lines, and words, and finally recognizing the text (and possibly its attributes such as size, font etc.) by optical character recognition (OCR). Graphics processing deals with the non-textual line and symbol components that make up line diagrams, delimiting straight lines between text sections, company logos etc. Pictures are a third major component of documents, but except for recognizing their location on a page, further analysis of these is usually the task of other image processing and machine vision techniques. After application of these text and graphics analysis techniques, the sev- eral megabytes of initial data are culled to yield a much more concise semantic description of the document. 3

Transcript of Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A...

Page 1: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Sadhana Vol. 27,Part 1, February2002, pp.3–22.© Printedin India

Document image analysis: A primer

RANGACHAR KASTURI1, LAWRENCEO’GORMAN2 andVENUGOVINDARAJU3

1 Departmentof ComputerScience& Engineering,ThePennsylvaniaStateUniversity, UniversityPark,PA 16802,USA2AvayaLabs,Room1B04,233Mt. Airy Road,BaskingRidge,NJ07920,USA3CEDAR, StateUniversityof New York atBuffalo,Amherst,NY 14228,USAe-mail:[email protected];[email protected];[email protected]

Abstract. Documentimageanalysisrefersto algorithmsandtechniquesthatareappliedto imagesof documentsto obtaina computer-readabledescriptionfrompixeldata.A well-knowndocumentimageanalysisproductis theOpticalCharacterRecognition(OCR) software that recognizescharactersin a scanneddocument.OCRmakesit possiblefor theuserto edit or searchthedocument’s contents.Inthispaperwebriefly describevariouscomponentsof adocumentanalysissystem.Many of thesebasicbuilding blocksarefoundin mostdocumentanalysissystems,irrespective of the particulardomainor languageto which they areapplied.Wehopethat this paperwill help the readerby providing the backgroundnecessaryto understandthe detaileddescriptionsof specifictechniquespresentedin otherpapersin this issue.

Keywords. OCR;featureanalysis;documentprocessing;graphicsrecognition;characterrecognition;layoutanalysis.

1. Introduction

The objective of documentimage analysisis to recognizethe text and graphicscom-ponentsin imagesof documents,and to extract the intendedinformation as a humanwould. Two categories of document image analysis can be defined (see figure 1).Textual processingdealswith the text componentsof a documentimage.Sometaskshereare:determiningtheskew (any tilt at which thedocumentmayhave beenscannedinto thecomputer),finding columns,paragraphs,text lines,andwords,andfinally recognizingthetext (andpossiblyits attributessuchassize,font etc.)by opticalcharacterrecognition(OCR).Graphicsprocessingdealswith the non-textual line andsymbolcomponentsthat make upline diagrams,delimiting straightlines betweentext sections,company logosetc.Picturesarea third major componentof documents,but except for recognizingtheir locationon apage,further analysisof theseis usually the taskof other imageprocessingandmachinevision techniques.After applicationof thesetext andgraphicsanalysistechniques,thesev-eralmegabytesof initial dataareculled to yield a muchmoreconcisesemanticdescriptionof thedocument.

3

Page 2: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

4 Rangachar Kasturi et al

Figure 1. A hierarchy of documentprocessingsubareaslisting the typesof documentcomponentsdealtwithin eachsubarea.(Reproducedwith permissionfrom O’Gorman& Kasturi1997.)

Considerthreespecificexamplesof theneedfor documentanalysispresentedhere.

(1) Typical documentsin today’s office arecomputer-generated,but even so, inevitably bydifferentcomputersandsoftwaresuchthateventheirelectronicformatsareincompatible.Someincludeboth formattedtext and tablesaswell ashandwrittenentries.Therearedifferentsizes,from a businesscardto a largeengineeringdrawing. Documentanalysissystemsrecognizetypesof documents,enabletheextractionof their functionalparts,andtranslatefrom onecomputergeneratedformatto another.

(2) Automatedmail-sortingmachinesto performsortingandaddressrecognitionhave beenusedfor severaldecades,but thereis theneedto processmoremail, morequickly, andmoreaccurately.

(3) In atraditionallibrary, lossof material,misfiling, limited numbersof eachcopy, andevendegradationof materialsarecommonproblems,andmaybeimprovedby documentanal-ysis techniques.All theseexamplesserve asapplicationsripe for thepotentialsolutionsof documentimageanalysis.

Documentanalysissystemswill becomeincreasinglymoreevident in theform of every-daydocumentsystems.For instance,OCRsystemswill bemorewidely usedto store,search,andexcerptfrom paper-baseddocuments.Page-layoutanalysistechniqueswill recognizeaparticularform, or pageformat andallow its duplication.Diagramswill be enteredfrompicturesor by hand,and logically edited.Pen-basedcomputerswill translatehandwrittenentriesinto electronicdocuments.Archivesof paperdocumentsin librariesandengineer-ing companieswill beelectronicallyconvertedfor moreefficient storageandinstantdeliv-ery to a homeor office computer. Thoughit will be increasinglythe casethat documentsareproducedandresideon a computer, the fact that therearevery many differentsystemsand protocols,and also the fact that paperis a very comfortablemediumfor us to dealwith, ensuresthat paperdocumentswill be with us to somedegreefor many decadestocome.Thedifferencewill bethatthey will finally beintegratedinto ourcomputerizedworld.

Page 3: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Documentimageanalysis:A primer 5

Figure 2. A typical sequenceof stepsfor documentanalysis,alongwith examplesof intermediateandfinal resultsandthedatasize.(Reproducedwith permissionfrom O’Gorman& Kasturi1997.)

Figure 2 illustratesa commonsequenceof stepsin documentimageanalysis.After datacapture,the imageundergoespixel-level processingandfeatureanalysisandthentext andgraphicsaretreatedseparatelyfor the recognitionof each.We describethesestepsbrieflyin the following sections;the readeris referredto the book,Documentimage analysis, fordetails(O’Gorman& Kasturi1997).We concludethis paperby consideringthechallengesin analysingmultilingualdocumentswhich is particularlyimportantin thecontext of Indianlanguagedocumentanalysis.

2. Data capture

Data in a paper documentare usually capturedby optical scanningand stored in afile of picture elements,called pixels, that are sampledin a grid pattern throughoutthe document.Thesepixels may have values:OFF (0) or ON (1) for binary images,0–255 for gray-scaleimages,and 3 channelsof 0–255 colour values for colour images.At a typical sampling resolution of 120 pixels per centimetre,a 20 x 30 cm pagewould yield an image of 2400x3600 pixels. When the document is on a differentmedium such as microfilm, palm leaves, or fabric, photographicmethods are often

Page 4: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

6 Rangachar Kasturi et al

used to capture images. In any case, it is important to understandthat the imageof the documentcontains only raw data that must be further analysedto glean theinformation.

3. Pixel-level processing

The next step in documentanalysisis to perform processingon the capturedimage toprepareit for further analysis.Suchprocessingincludes:Thresholdingto reducea gray-scaleor colour imageto a binary image,reductionof noiseto reduceextraneousdata,seg-mentationto separatevariouscomponentsin the image,and,finally, thinning or boundarydetectionto enableeasiersubsequentdetectionof pertinentfeaturesandobjectsof interest.After suchprocessing,dataareoftenrepresentedin compactform suchaschain-codesandvectors.This pixel-level processing(alsocalledpreprocessingandlow-level processinginotherliterature)is thesubjectof this section.

3.1 Binarization

For gray-scaleimageswith information that is inherentlybinary suchas text or graphics,binarizationisusuallyperformedfirst.Theobjectiveof binarizationis toautomaticallychoosea thresholdthat separatesthe foregroundandbackgroundinformation.Selectionof a goodthresholdis oftenatrial anderrorprocess(seefigure3).Thisbecomesparticularlydifficult incaseswherethecontrastbetweentext pixelsandbackgroundis low (for example,text printedon a gray background),when text strokes are very thin resultingin backgroundbleedinginto text pixels during digitization, or when the pageis not uniformly illuminatedduringdatacapture.Many methodshave beendevelopedfor addressingtheseproblemsincludingthosethat model the backgroundandforegroundpixels assamplesdrawn from statisticaldistributionsandmethodsbasedon spatiallyvarying(adaptive) thresholds.Whetherglobalor adaptive thresholdingmethodsareusedfor binarization,onecanseldomexpectperfectresults.Dependingonthequalityof theoriginal,theremaybegapsin lines,raggededgesonregionboundaries,andextraneouspixel regionsof ON andOFFvalues.Thisfact,thatprocessedresultsarenotperfect,is generallytruewith otherdocumentprocessingmethods,andindeedimageprocessingin general.Therecommendedprocedureis to processaswell aspossibleateachstepof processing,but to deferdecisionsthatdonothaveto bemadeuntil later, to avoidmakingirreparableerrors.In latersteps,thereis moreinformationasaresultof processingupto thatpoint,andthisprovidesgreatercontext andhigherlevel descriptionsto aid in makingcorrectdecisions,andultimatelyrecognition.Somegoodreferencesfor thresholdingmethodsareReddietal (1984),Tsai(1985),Sahooetal (1988),O’Gorman(1994),andTrier & Taxt(1995).

3.2 Noisereduction

Document image noise is due to many sourcesincluding degradation due to aging,photocopying, or during datacapture.Imageandsignalprocessingmethodsareappliedtoreducenoise.After binarization,documentimagesareusuallyfilteredto reducenoise.Salt-and-peppernoise(alsocalledimpulseandspecklenoise,or just dirt) is a prevalentartifactin poorerquality documentimages(suchaspoorly thresholdedfaxesor poorly photocopiedpages).Thisappearsasisolatedpixelsor pixel regionsof ON noisein OFFbackgroundsor OFF

noise(holes)within ON regions,andasroughedgeson characterandgraphicscomponents.

Page 5: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Documentimageanalysis:A primer 7

Figure 3. Imagebinarization.(a) Histogramof original gray-scaleimage.Horizontalaxis showsmarkingsfor thresholdvaluesof imagesbelow. The lower peakis for the white backgroundpixels,andtheupperpeakis for theblack foregroundpixels.Imagebinarizedwith: (b) too low a thresholdvalue,(c) a goodthresholdvalue,and(d) too high a thresholdvalue.(Reproducedwith permissionfrom O’Gorman& Kasturi1997.)

Theprocessof reducingthis is called“filling”. Themostimportantreasonto reducenoiseis that extraneousfeaturesotherwisecausesubsequenterrorsin recognition.The objectivein the designof a filter to reducenoiseis that it remove asmuchof the noiseaspossiblewhile retainingall of thesignal.Morphological(Serra1982;Haralicket al 1987;Haralick& Shapino1992)methodsarefrequentlyusedfor noisereduction.Thebasicmorphologicaloperationsareerosionanddilation. Erosionis the reductionin sizeof ON-regions.This ismostsimply accomplishedby peelingoff a single-pixel layer from the outerboundaryofall ON regions on eacherosionstep.Dilation is the oppositeprocess,wheresingle-pixel,ON-valuedlayersareaddedto boundariesto increasetheir size.Theseoperationsareusuallycombinedandappliediteratively to erodeanddilate many layers.Oneof thesecombinedoperationsiscalledopening,whereoneormoreiterationsof erosionarefollowedby thesamenumberof iterationsof dilation. Theresultof openingis thatboundariescanbesmoothed,

Page 6: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

8 Rangachar Kasturi et al

narrow isthmusesbroken, andsmall noiseregionseliminated.The morphologicaldual ofopeningis closing.This combinesoneor moreiterationsof dilation followed by the samenumberof iterationsof erosion.The resultof closingis that boundariescanbe smoothed,narrow gapsjoined, andsmall noiseholesfilled. For documents,morespecificfilters canbedesignedto take advantageof theknown characteristicsof thetext andgraphicscompo-nents.In particular, we desireto maintainsharpnessin thesedocumentcomponents,not toroundcornersandshortenlengths,assomenoisereductionfilters do. For thesimplestcaseof single-pixel islands,holes,andprotrusions,thesecanbe foundby passinga 3 x 3 sizedwindow over the imagethatmatchesthesepatterns(Shih& Kasturi 1988),thenfilled. Fornoiselargerthanone-pixel, thekFill filter canbeused(O’Gorman1992).

3.3 Segmentation

Segmentationoccurson two levels.On thefirst level, if thedocumentcontainsbothtext andgraphics,theseareseparatedfor subsequentprocessingby differentmethods(Wong 1982;Fletcher1988;Jain1992).Onthesecondlevel,segmentationis performedontext by locatingcolumns,paragraphs,words,andcharacters;andongraphics,segmentationusuallyincludesseparatingsymbolandline components.For instance,in a pagecontainingtext andsomeillustrationssimilar to thepagesof this journal,text andgraphicsarefirst separated.Thenthetext is separatedinto its componentsdown to individualcharacters.Thegraphicsis separatedinto its componentssuchasrectangles,circles,connectinglines,symbolsetc.After thisstepanimageis typically brokendown into its basiccomponentssuchasanindividual characteror agraphicalelement.

3.4 Thinningandregiondetection

Thinningis animageprocessingoperationin whichbinaryvaluedimageregionsarereducedto linesthatapproximatethecentrelines,orskeletons,of theregions.Thepurposeof thinningis to reducetheimagecomponentsto their essentialinformationsothatfurtheranalysisandrecognitionarefacilitated.For instance,alinedrawingcanbehandwrittenwith differentpensgiving differentstroke thicknesses,but the informationpresentedis the same.In figure 4,someimagesare shown whosecontentscan be analysedwell due to thinning, and theirthinning resultsarealsoshown here.Note shouldbe madethat thinning is alsoreferredtoasskeletonizingandcore-linedetectionin theliterature.We will usetheterm“thinning” todescribetheprocedure,andthinnedline, or skeleton,to describetheresults.A relatedtermis the “medial axis”. This is thesetof pointsof a region in which eachpoint is equidistantto its two closestpointson theboundary. Themedialaxisis oftendescribedastheidealthatthinningapproaches.However, sincethemedialaxisis definedonly for continuousspace,itcanonly beapproximatedby practicalthinningtechniquesthatoperateon a sampledimagein discretespace.Several thinning algorithmshave beendescribedby Arcelli & SannitidiBaja(1985,1993),andSannitidi Baja(1994)andanalgorithmthatthinsby severalpixelsineachpassis describedby O’Gorman(1990).Reviews of thinningmethodshave beengivenby Lametal (1992)andLam& Suen(1995).

Note thata circle or a squarethat is completelyfilled with blackpixels resultsideally ina singlepixel at thecentreof theshape,irrespective of thesizeof theoriginal.Clearly, it ismoreuseful to detectthe boundaryof suchobjects.In general,for large blob-like objects,region-boundarydetectionis preferredandfor objectsmadeup of long connectedstrokes,thinning is the methodof choice.Thinning is commonlyusedin the preprocessingstageof suchdocumentanalysisapplicationsasdiagramunderstandingandmapprocessing.For

Page 7: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Documentimageanalysis:A primer 9

Figure 4. Original imageson left andthinnedimageresultson right. (a) Theletter“m”. (b) A linediagram.(c) A fingerprintimage.(Reproducedwith permissionfrom O’Gorman& Kasturi1997.)

recognitionof large graphicalobjectswith filled regions which are often found in logos,region boundarydetectionis useful.But for small regionssuchasthosewhich correspondto individual characters,neitherthinningnorboundarydetectionis performedandtheentirepixel arrayrepresentingtheregion is forwardedto thesubsequentstageof analysis.

3.5 Chaincodingandvectorization

Whenobjectsaredescribedby their skeletonsor contours,they canbe representedmoreefficiently than simply by ON and OFF valued pixels in a raster image. One commonway to do this is by chain coding (Freeman1974), wherethe ON pixels are representedas sequencesof connectedneighboursalong lines and curves. Instead of storing theabsolutelocation of each ON pixel, the direction from its previously coded neighbouris stored.A neighbouris any of the adjacentpixels in the 3 x 3 pixel neighbourhoodaround that centre pixel (see figure 5). There are two advantagesof coding by direc-tion versus absolutecoordinatelocation. One is in storageefficiency. For commonly

Figure 5. Fora3x 3pixel regionwith centrepixeldenotedasX, figureshowscodesfor chaindirectionsfrom centrepixel to eachof eightneighbours:0 (east),1 (north-east),2 (north),3 (northwest),etc.(Reproducedwith permissionfrom O’Gorman&Kasturi1997.)

Page 8: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

10 Rangachar Kasturi et al

sized images larger than 256x256, the coordinatesof an ON-valued pixel are usuallyrepresentedas two 16-bit words; in contrast,for chain coding with eight possibledirec-tions from a pixel, eachON-valued pixel can be stored in a byte, or even packed intothree bits. A more important advantagein this context is that, since chain coding con-tains information on connectednesswithin its code,this can facilitate further processingsuch as smoothingof continuouscurves, and approximationof smoothedstraight linesby vectors.

After pixel-levelprocessing,theraw imagedataisconvertedtoahigherlevelof abstraction;viz., regionsrepresentingindividualcharacters,chaincodesorvectorsrepresentingcurveandstraightline segments,andboundariesrepresentinglargesolidobjects.

4. Feature-level analysis

After pixel-level processinghaspreparedthe documentimage, intermediatefeaturesarefoundfrom theimageto aid in thefinal stepof recognition.At thefeaturelevel, thinnedandchain-codeddatais analysedto detectstraightlines,curves,andsignificantpointsalongthecurves.This is a moreinformative representationthat is alsocloserto how humanswoulddescribethediagram– aslinesandcurvesratherthanasON andOFFpoints.Curvedlinesareoftenapproximatedby polygonalization.Critical pointssuchascornersandpointsof highcurvaturearedeterminedto assistin subsequentanalysisfor shaperecognition.For regionscorrespondingto individual charactersor graphicalsymbols,local featuressuchasaspectratio, compactness(ratio of areato squareof perimeter),asymmetry, black pixel density,contoursmoothness,numberof loops,numberof linecrossingsandlineendsetc.arecomputedfor input to objectrecognitionstage.

4.1 Lineandcurvefitting

A simplewayto fit astraightline to asequenceof pointsis to just to specifytheendpointsasthestraightlineendpoints.However, thismayresultin someportionsof thecurvehaving largeerrorwith respectto thefit. A popularway to achieve thelowestaverageerrorfor all pointson thecurve is to performa least-squaresfit of a line to thepointson thecurve. Especiallyfor machine-drawn documentssuchas engineeringdrawings, circular curve featuresareprevalent.Thesefeaturesaredescribedby theradiusof thecurve, thecentrelocationof thecurve, andthe two transitionlocationswherethe curve eitherendsor smoothlymakesthetransitionto oneor two straightlinesaroundit. Onesequenceof proceduresin circularcurvedetectionbeginsby finding thetransitionpointsof thecurve, that is the locationsalongtheline wherea straightline makesa transitioninto thecurve,andthenbecomesa straightlineagain.Next, a decisionis madeon whetherthefeaturebetweenthestraightlinesis actuallya corneror a curve. Finally, the centreof the curve is determined.If a higherorderfit isdesired,splinescanbeusedtoperformpiecewisepolynomialinterpolationsamongdatapoints(Pavlidis 1982;Medioni& Yasumoto1987).B-splinesarepiecewisepolynomialcurvesthatarespecifiedby a guiding.Giventwo endpointsanda curve representedby its polygonalfit(describednext), a B-splinecanbe determinedthat performsa closebut smoothfit to thepolygonalapproximation.TheB-splinehasseveralpropertiesthatmake it a popularsplinechoice.A differentapproachfor line andcurvefitting is by theHoughtransform(Illingworth& Kittler 1988).This approachis usefulwhenthe objective is to find lines or curvesthatfit groupsof individual pointswhich arenot necessarilyconnectedin the imageplane.Forexample,pixelsbelongingto differentline segmentsof adashedline aregroupedtogetherbytheHoughtransform(Lai & Kasturi1991).

Page 9: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Documentimageanalysis:A primer 11

4.2 Polygonalization

Polygonalapproximationis onecommonapproachto obtainingfeaturesfrom curves.Theobjective is to approximateagivencurvewith connectedstraightlinessuchthattheresultiscloseto theoriginal,but thatthedescriptionis moresuccinct.Theusercandirectthedegreeof approximationby specifyingsomemeasureof maximumerrorfrom theoriginal. In gen-eral,whenthespecifiedmaximumerror is smaller, a greaternumberof linesis requiredfortheapproximation.Theeffectivenessof onepolygonalizationmethodcomparedto anothercanbe measuredin the numberof lines requiredto producea comparableapproximation,and also by the computationtime requiredto obtain the result.The iterative endpointfitalgorithm(Ramer1972) is a popularpolygonalizationmethodwhoseerror measureis thedistancebetweentheoriginal curve andthepolygonalapproximation.Onepotentialdraw-backof polygonalapproximationis that theresultis not usuallyunique,that is, anapprox-imation of the samecurve that begins at a differentpoint on the curve will perhapsyielddifferentresults.The lack of a uniqueresultwill presenta problemif the resultsof polyg-onalizationare to be usedfor matching.Besidespolygonalapproximationmethods,therearehigherordercurve-andspline-fittingmethodsthatachieve closerfits. Thesearecompu-tationallymoreexpensive thanmostpolygonalizationmethods,andcanbemoredifficult toapply.

4.3 Critical pointdetection

Theconceptof “critical points”or“dominantpoints”derivesfromtheobservationthathumansrecognizeshapein large part by curvaturemaximain the shapeoutline. The objective inimageanalysisis to locatethesecritical pointsandrepresenttheshapemoresuccinctlybypiecewise linear segmentsbetweencritical points,or to performshaperecognitionon thebasisof thecriticalpoints.Criticalpointdetectionisespeciallyimportantfor graphicsanalysisof man-madedrawings. Sincecornersandcurveshave intendedlocations,it is importantthat any analysispreciselylocatesthese.It is the objective of critical point detectionto dothis – asopposedto polygonalapproximation,wherethe objective is only a visually closeapproximation.Oneapproachfor critical pointdetectionbeginswith curvatureestimation.Apopularfamily of methodsfor curvatureestimationis calledthek-curvaturesapproach(alsothedifferenceof slopes,or DOS,approach)(Freeman& Davies1977;O’Gorman1988).Forthesemethods,curvatureis measuredastheangulardifferencebetweentheslopesof two linesegmentsfit to thedataaroundeachcurvepoint.Curvatureis measuredfor all pointsalongaline, andplottedon a curvatureplot. For straightportionsof thecurve, thecurvatureis low.For corners,thereis a peakof high curvaturethat is proportionalto the cornerangle.Forcurves,thereis acurvatureplateauwhoseheightis proportionalto thesharpnessof thecurve(that is, thecurvatureis inverselyproportionalto theradiusof curvature),andthelengthoftheplateauis proportionalto thelengthof thecurve.To locatethesefeatures,thecurvatureplot is thresholdedto find all curvaturepointsabove a chosenthresholdvalue;thesepointscorrespondto features.Thecorneris thenparameterizedby its location,thecurveby its radiusof curvatureandbounds,andthebeginningandendtransitionpointsfromstraightlinesaroundthecurve into thecurve.Theusermustchooseonemethodparameter, thelengthof thelinesegments,to befit alongthedatato determinethecurvature.Thereis a tradeoff in thechoiceof this length.It shouldbeaslong aspossibleto smoothout effectsdueto noise,but not solong soasto alsoaverageout features.Thatis, thelengthshouldbechosenastheminimumarclengthbetweencritical points.Thepaperby (Wu & Wang1993)is a goodreferenceoncriticalpointdetection.Theapproachin thispaperbeginswith polygonalapproximation,then

Page 10: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

12 Rangachar Kasturi et al

refinesthecornersto coincidebetterwith truecorners.This paperalsousesk-curvatureinthepreprocessingstage,soit providesagooddescriptionof thismethodandits useaswell.

5. Text document analysis

Thereare two main typesof analysisthat are appliedto text in documents.One is opti-cal characterrecognition(OCR) to derive the meaningof the charactersand words fromtheir bit-mappedimages,andthe otheris page-layoutanalysisto determinethe formattingof the text, andfrom that to derive meaningassociatedwith the positionalandfunctionalblocks(titles, subtitles,bodiesof text, footnotesetc) in which the text is located.Depend-ing on thearrangementof thesetext blocks,a pageof text maybea title pageof a paper, atableof contentsof a journal,a businessform, or the faceof a mail piece.OCR andpagelayoutanalysismaybeperformedseparately, or the resultsfrom oneanalysismaybeusedto aid or correctthe other. OCR methodsareusuallydistinguishedasbeingapplicableforeithermachine-printedor handwrittencharacterrecognition.Layoutanalysistechniquesareappliedto formatted,machine-printedpages,anda type of layout analysis,forms recogni-tion, is appliedto machine-printedor handwrittentext occurringwithin delineatedblockson a printedform. In somecasesit is necessaryto correcttheskew of thedocumentwhichis typically a resultof improperpaperfeedinginto thescanner. Skew estimationandlayoutanalysisarediscussedbriefly in thissection.Generalapproachesto OCRarepresentedin thenext section.

5.1 Skew estimation

A text line is a groupof characters,symbols,andwordsthat areadjacent,relatively closeto eachother, and throughwhich a straightline canbe drawn (usuallywith horizontalorverticalorientation).Thedominantorientationof thetext linesin adocumentpagedeterminesthe skew angleof that page.A documentoriginally haszeroskew, wherehorizontallyorvertically printedtext linesareparallelto the respective edgesof thepaper, however whena pageis manuallyscannedor photocopied,non-zeroskew maybe introduced.SincesuchanalysisstepsasOCR andpagelayout analysismostoften dependon an input pagewithzeroskew, it is importantto performskew estimationandcorrectionbeforethesesteps.Also,sinceareaderexpectsapagedisplayedonacomputerscreento beuprightin normalreadingorientation,skew correctionis normally donebeforedisplayingscannedpages.A popularmethodfor skew detectionemploystheprojectionprofile.A projectionprofile is ahistogramof thenumberof ON pixel valuesaccumulatedalongparallelsamplelinestakenthroughthedocument.Theprofile maybeat any angle,but often it is takenhorizontallyalongrows orvertically alongcolumns,andthesearecalledthehorizontalandverticalprojectionprofilesrespectively. For a documentwhosetext lines spanhorizontally, the horizontalprojectionprofile haspeakswhosewidths areequalto thecharacterheightandvalleys whosewidthsareequalto thebetween-linespacing.For multi-columndocuments,theverticalprojectionprofile hasa plateaufor eachcolumn, separatedby valleys for the between-columnandmargin spacing.Themoststraightforwarduseof theprojectionprofile for skew detectionisto computeit at a numberof anglescloseto theexpectedorientation(Postl1986).For eachangle,ameasureis madeof thevariationin thebin heightsalongtheprofile,andtheonewiththemaximumvariationgivestheskew angle.At thecorrectskew angle,sincescanlinesarealignedto text lines,theprojectionprofilehasmaximumheightpeaksfor text andvalleysforbetween-linespacing.Modificationsandimprovementscanbemadeto thisgeneraltechnique

Page 11: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Documentimageanalysis:A primer 13

to iteratemorequickly to thecorrectskew angleandto determineit moreaccurately. Severalothermethodsfor skew correctionhavealsobeenproposed(Baird1987;Akiyama& Hagita1990;Srihari& Govindaraju1989;Hashizume1986).

Figure 6. The original documentpageis shown with resultsfrom structuralandfunctionallayoutanalysis.Thestructurallayout resultsshow blocksthataresegmentedon thebasisof spacingin theoriginal.Thelabelingin thefunctionallayoutresultsismadewith theknowledgeof theformattingrulesof the particularjournal (IEEE Trans.Pattern Anal. Machine Intell.). (Reproducedwith permissionfrom O’Gorman& Kasturi1997.)

Page 12: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

14 Rangachar Kasturi et al

5.2 Layoutanalysis

After skew detection,theimageis usuallyrotatedto zeroskew angle,andthenlayoutanal-ysisis performed.Structurallayoutanalysis(alsocalledphysicalandgeometriclayoutanal-ysis in the literature) is performedto obtain a physical segmentationof groupsof docu-mentcomponents.Dependingon the documentformat, segmentationcanbe performedtoisolatewords,text lines,andstructuralblocks(groupsof text linessuchasseparatedpara-graphsor table of contentsentries).Functionallayout analysis(Dengelet al 1992) (alsocalledsyntacticandlogical layout analysisin the literature)usesdomain-dependentinfor-mationconsistingof layout rulesof a particularpageto performlabelingof the structuralblocksgiving someindicationof the function of the block. (This functional labelingmayalso entail splitting or merging of structuralblocks.) An example of the result of func-tional labelingfor thefirst pageof a technicalarticlewould indicatethe title, authorblock,abstract,keywords, paragraphsof the text body, etc. Seefigure 6 for an exampleof theresultsof structuralanalysisandfunctionallabelingon a documentimage.Structurallayoutanalysiscanbe performedin top-down (Pavlidis & Zhon 1991)or bottom-up(O’Gorman1993) fashion.When it is done top-down, a pageis segmentedfrom large componentsto smallersub-components,for example,the pagemay be split into oneor morecolumnblocksof text, theneachcolumnsplit into paragraphblocks,theneachparagraphsplit intotext lines,etc.For thebottom-upapproach,connectedcomponentsaremergedinto charac-ters,thenwords,thentext lines,etc.Alternately, top-down andbottom-upanalysesmaybecombined.

6. Optical character recognition

Optical CharacterRecognition(OCR) lies at the coreof the disciplineof patternrecogni-tion wherethe objective is to interpreta sequenceof characterstaken from an alphabet.Charactersof the alphabetareusually rich in shape.In fact, the characterscanbe subjectto many variationsin termsof fontsandhandwritingstyles.Despitethesevariations,thereis perhapsa basicabstractionof theshapesthat identifiesany of their instantiations.Devel-opingcomputeralgorithmsto identify thecharactersof thealphabetis theprincipal taskofOCR.Thechallengeto theresearchcommunityis thefollowing – while humanscanrecog-nizeneatlyhandwrittencharacterswith 100%accuracy, thereis noOCRthatcanmatchthatperformance.

OCRdifficulty canincreaseon severalcounts.Increasein fonts,sizeof thealphabetset,unconstrainedhandwriting,touchingof adjacentcharacters,brokenstrokesdueto poorbina-rization,noiseetc.all contributeto thedifficulty. figure7showsasampleof 0’sand6’sthatareeasilyconfusedby a handwrittendigit recognizer. Therearemany applicationsthat requiretherecognitionof unconstrainedhandwriting.A wordcanbeeitherpurelynumericasin thecaseof a Zip code,or purelyalphabeticasin thecaseof US stateabbreviationsor mixedasin thenumberof anapartment(e.g.,1A).

Thetaskbecomesparticularlychallengingwhenadjacentcharactersin acharacterstringaretouchingasshown in figure8.Unlikepurelyalphabeticstringswherejoiningof thecharactersis naturalandtakesplaceby meansof ligatures,the joining of numeralsin a numericwordandthe upper-casecharactersin an abbreviation areaccidental.Therearevariouswaysinwhich two digitscantouch.Someof thecategorieslendthemselvesto naturalsegmentation,whereasfor someaholisticapproachis theonly optionavailable.

Page 13: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Documentimageanalysis:A primer 15

Figure 7. Handwrittendigitsareeasilyconfused.

6.1 Methodology

OCRalgorithmshavetwo essentialcomponents:(i) Featureextraction,and(ii) classification.Ultimately, the OCR processassignsa characterimageto a classby usinga classificationalgorithmbasedon the featuresextractedandthe relationshipsamongthe features.Sincemembersof a characterclassare equivalent or similar in as much as they sharedefiningattributes,the measurementof similarity, either explicitly or implicitly, is central to anyclassifier. Thereis oftena third componentof contextual processingto correctOCRerrorsusingcontextual knowledge.Wewill describethethreecomponentsbriefly.

6.1a Feature extraction: Featureextraction is concernedwith recovering the definingattributesobscuredby imperfectmeasurements.To representa characterclass,eithera pro-totype (an ideal form on which all memberpatternsare based,the class“essence")or asetof samplesmustbe known. The featureselectionprocessattemptsto recover the pat-tern attributescharacteristicof eachclass.Global features,suchasthe numberof holesin

Figure 8. Thereis no simplesplitting line that willsegmentthetwotouchingdigitswithoutleavingartifactson theseparateddigits.

Page 14: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

16 Rangachar Kasturi et al

Figure 9. Charactersareseparatedinto categoriesby partioningthefeaturespace.

the character, the numberof concavities in its outercontour, andthe relative protrusionofcharacterextremities,andlocal features,suchasthe relative positionsof line-endings,linecrossovers,andcornersarecommonlyused.The classificationstageidentifieseachinputcharacterimageby consideringthedetectedfeatures.

6.1b Classification: In the statisticalclassificationapproaches,characterimagepatternsarerepresentedby pointsin amultidimensionalfeaturespace. Eachcomponentof thefeaturespaceis a measurementor featurevalue,which is a randomvariablereflectingthe inherentvariability within andbetweenclasses.A classifierpartitionsthe featurespaceinto regionsassociatedwith eachclass,labelingan observed patternaccordingto the classregion intowhich it falls.

An exampleof featurespacepartitioning usedto classify a set of 50 charactersinto 5differentclasses{C,E,T,X,Y} is illustratedin figure 9. The featurespaceis basedon twofeatures,thepercentageof blackpixelsbelongingto thevertical (SV ) andhorizontalstroke(SH ). Strokesareclassifiedashorizontalandvertical.Pixelscanbepartof oneorbothstrokes,henceSV + SH canbe greaterthan100%.CharactersE andT areexpectedto have largevaluesof SV andSH ; characterC isexpectedtohavebothSV andSH atabout50%;charactersX andY areexpectedto have smallvaluesof SH . Thedecisionfunctionsfor eachclassaremeasuresof distancefrom theclasses,andthedottedlinesareequidistantfrom theclassesthey separate.

Templatematching is a naturalapproachto classificationand is one of the most com-monly used.Individual pixels are directly usedas features.A similarity measureinsteadof a distancemeasureis defined.One can count the numberof agreements(black pix-els in test pattern matching black pixels in templateand white pixels in test patternmatchingwhite pixels in template).The templateclass that has the maximum numberof agreementscan be chosenas the classof the test pattern.Suchan approachis called

Page 15: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Documentimageanalysis:A primer 17

Figure 10. Structuralfeaturesof stroke, concavity, hole, crosspoints& endpointscanbeusedasthedimensionsof afeaturespacetoclassifycharacters;thelocationsof thesefeaturesfor ‘A’ areillustrated.

the maximum correlation approach.Alternatively, one could count the number of dis-agreements(black pixel in test pattern where it is white in templateand vice versa).The class with the minimum number of disagreementscan be chosenas the class ofthe test pattern.Such an approachis called the minimum error approach.The agree-mentsanddisagreementscanbe weightedin orderto derive a suitablesimilarity measure.Template matching is effective when the variations within a class are due to “addi-tive noise” only and test patternsare free of noise due to rotation, shearing,warping,or occlusion.

The K nearestneighbour(K-NN) rule is a well-known decisionrule usedextensivelyin patternclassificationproblems.The misclassificationrateof theK-NN rule approachesthe optimal Bayeserror rateasymptoticallyasK increases(Fukunaga & Hostetler1975].TheK-NN rule is particularlyeffective, whenprobability distributionsof the featurevari-able are not known. Selectionof templatesis an importantpart of the nearestneighbour(1-NN) rule (Hart 1968). Templatesare selectedso that classificationsobtainedusingany proper subsetof the initial templateset lead to gradualdegradationin recognitionaccuracy.

Although many problemsare successfullydealt with using the statistical approach,it is often more appropriateto representpatternsexplicitly in terms of the structureor arrangementof componentsor primitive elementstaken as the defining attribute ofthe pattern. The structural approachto OCR representscharacterpattern images in

Page 16: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

18 Rangachar Kasturi et al

termsof primitives and relationsamongprimitives in order to describepatternstructureexplicitly.

Whenaskedto describeanalphanumericcharacter, peoplearemostlikely to usestructuralfeatures(figure10).For example,anupper-case‘A’ hastwo straightlines(strokes)meetingwith asharppoint(endpoint)atthetop,andathird linecrossingthetwoatapproximatelytheirmidpoint (crosspoints),creatinga gap in theupperpart (hole).Thebasisof any structuraltechniqueis therepresentationof thepatternwith a setof featureprimitivesthatareabletodescribeall encounteredpatternsanddiscriminatebetweenthem.

6.1c Contextual processing: Thesetechniquesutilize knowledge at the word level tocorrecterrorscommittedby OCR. Thesemethodsuseinformationaboutothercharactersthat have beenrecognizedin a word as well as knowledge about the text in which theword occursto carry out the task.Typically, the knowledgeaboutthe text takes the formof a lexicon (a list of words that occur in the text). For example,a characterrecognizermay not be able to reliably distinguishbetweenan u and a v in the secondposition ofqXeen. A contextual postprocessingtechniquewould determinethat u is correctsince itis very unlikely that qveenwould occur in the English languagedictionary. Furthermore,onecould useknowledgeof the Englishlanguageby which a q is almostalwaysfollowedby anu.

Therehasbeensystematicstudy of OCR performancefor English (Rice et al 1992).A completereview of OCR productsfor machineprinted documentswith benchmarkshasbeenpublishedby the University of Nevada(Nartker et al 1994). On good quality,machine-printedtext data the correct recognition rate rangedfrom 99.13% to 99.77%.For poorqualitymachine-printedtext data,thecorrectrecognitionraterangedfrom 89.34%to 97.01%.Thedrop in recognitionperformanceon poorquality datawasprimarily duetobrokenstrokesin charactersandtouchingof adjacentcharacters.More recently, OCRin thepresenceof graphicsin complex documentshasalsobeenreported(Sawaki & Hagita1998).Wilsonetal (1996)havepublishedacomprehensivereportontheuseandevaluationof OCRtechnologyfor form-basedapplications.

There is abundant literature describingmethodsfor OCR. OCR is perhapsthe mostresearchedareaof the patternrecognitiondiscipline.While researchin OCR for printedRomanscripthasreacheda point of diminishingreturns,OCRfor handwritingandfor non-Romanscriptscontinuesto bea very active field. Thereaderis referredto theproceedingsof conferencesandworkshopssuchastheInternationalConferencesonDocumentAnalysisandRecognitionandtheInternationalWorkshoponFrontiersin HandwritingRecognition.

7. Analysis of graphical documents

Graphicsrecognitionand interpretationis an importanttopic in documentimageanalysissincegraphicselementspervadetextual material,with diagramsillustratingconceptsin thetext, company logosheadingbusinessletters,andlinesseparatingfieldsin tablesandsectionsof text. Thegraphicscomponentsthatwe dealwith arethebinary-valuedentitiesthatoccuralongwith text andpicturesin documents.We alsoconsiderspecialapplicationdomainsinwhich graphicalcomponentsdominatethedocument;theseincludesymbolsin theformsoflinesandregionsonengineeringdiagrams,maps,businesscharts,fingerprints,musicalscoresetc.Theobjectiveis toobtaininformationtosemanticallydescribethecontentswithin imagesof documentpages.

Page 17: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Documentimageanalysis:A primer 19

Documentimageanalysiscanbe importantwhenthe original documentis producedbycomputeraswell. Anyonewho hasdealtwith transportandconversionof computerfilesknows thatcompatibilitycanrarelybetakenfor granted.Becauseof themany differentlan-guages,proprietarysystems,andchangingversionsof CAD andtext formattingpackagesthatareused,incompatibilityis especiallytruein thisarea.Becausetheformatteddocument– that viewed by humans– is semanticallythe sameindependentof the languageof pro-duction,this form is a “protocol-lessprotocol”. If a documentsystemcantranslatebetweendifferentmachine-drawn formats,thenext objectiveis to translatefromhand-drawngraphics.This is analogousto handwritingrecognitionandtext recognitionin OCR.Whenmachinescananalysecomplex hand-drawn diagramsaccuratelyandquickly, thegraphicsrecognitionproblemwill besolved,but thereis still muchopportunityfor researchbeforethis goalwillbereached.

A commonsequenceof stepstaken for documentimageanalysisof graphicsinterpreta-tion is similar to that for text. Preprocessing,segmentation,andfeatureextractionmethodssuchasthosedescribedin earliersectionsarefirst applied.An initial segmentationstepthatis generallyappliedto a mixed text/graphicsimageis that of text andgraphicsseparation.An algorithmspecificallydesignedfor separatingtext componentsin graphicsregionsirre-spectiveof theirorientationis describedby Fletcher(1988).Thisis aHoughtransform-basedtechniquethatusestheheuristicthattext componentsarecollinear. Oncetext is segmented,typical featuresextractedfrom a graphicsimageinclude straight lines, curves,and filledregions.After featureextraction,patternrecognitiontechniquesareapplied,bothstructuralpatternrecognitionmethodsto determinethe similarity of an extractedfeatureto a knownfeatureusinggeometricandstatisticalmeans,andsyntacticpatternrecognitiontechniquesto accomplishthis sametaskusingrules(a grammar)on context andsequenceof features.After thismid-level processing,thesefeaturesareassembledinto entitieswith somemeaning– or semantics– thatis dependentuponthedomainof theparticularapplication.Techniquesusedfor this include patternmatching,hypothesisand verification,and knowledge-basedmethods.Thesemanticinterpretationof a graphicselementmaybedifferentdependingondomain;for instancea line maybea roadon a map,or anelectricalconnectionof a circuitdiagram.

Most commercialOCR systemsrecognizelong border and table lines as being dif-ferent from characters,so no attemptto recognizethem as charactersis made.Graphicsanalysissystemsfor engineeringdrawings must discriminatebetweentext and graph-ics (mainly lines). This is usually accomplishedvery well except for some confusionwhen charactersadjoin lines, causing them to be interpreted as graphics; or whenthere are small, isolated graphicssymbols that are interpretedas characters.Segmen-tation and analysis of colour-composite multi-layer maps, recognition of the three-dimensional object representedby its orthographic projections in a mechanicalpartdrawing, and constructionof a 3-D virtual walk-throughfrom an architecturaldrawingare someexamplesof challengespresentedto the graphicsimage analysisresearchers.Clearly, muchdomain-dependentknowledgeis appliedin essentiallyall graphicsanalysissystems.

It is beyond the scopeof this paperto describethe approachesfor graphicaldocumentanalysisin detail.In additiontomany journalpapers,theinterestedreaderisreferredtoseveralworkshopsdedicatedto this topic(GREC19,95,97,99)aswell aspapersin theInternationalConferenceson DocumentAnalysisandRecognition,andtheInternationalConferencesonPatternRecognition.

Page 18: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

20 Rangachar Kasturi et al

8. Document analysis in a multilingual context

In this paperwe have presentedanoverview of themethodsappliedto documentimagestoextractmeaningfulsemanticinformationfrom bit-mappedimagesof documents.Althoughwe did not make any explicit statement,the methodsdescribedso far implicitly assumedthat the documentscontaina single language.In general,pixel-level processingand fea-ture analysismethodsare essentiallyindependentof the particular languagepresentin adocument.However, particular featuresextractedfrom the image and the methodsthatare applied for discriminationamongvariouscharactersare clearly language-dependent.Whenadocumentcontainsseverallanguageswithin apageit is necessaryto first segmentitinto regionscontainingasinglelanguagein eachregion.

8.1 Multilingual documentsegmentation

An interestingproblemin multilingual documentanalysisis that of segmentingthe docu-mentimagesinto regionsthatcontainasinglelanguageandidentifying thelanguagein eachregion. Spitz (1997)presenteda methodthat performssucha classificationin documentscontainingseveralHan-basedscripts(Chinese,Japanese,andKorean)and23Latin-basedlan-guagessuchasEnglish,German,andVietnamese.His methodmadestronguseof language-dependentfeaturessuchasthedistributionof opticaldensityandthelocationof concavitiesinpixel data.

9. OCR for Indian languages

Proceedingsof the Fifth InternationalConferenceon DocumentAnalysisandRecognition(ICDAR 1999)hasseveralpapersdealingwith OCRfor Devanagari (suchasKarnik 1999).OCRfor Indianlanguagesin generalis moredifficult thanfor Europeanlanguages(Murthy1998)becauseof the large numberof vowels, consonants,andconjuncts(combinationofvowelsandconsonants).Further, mostscriptsspreadoverseveralzones.Segmentationhastodealwith thepositioningof theconjunctsandhalf syllables.Thesefactorscoupledwith theinflectionalandagglutinativenatureof IndianlanguagesmaketheOCRtaskquitechallenging.Languagemodelsandcomputationallinguisticsasit pertainsto Indianlanguagesis anareaof recentresearch.Bharati et al (1998) describemachinetranslationsystemsand lexicalresourcesfor Indian languagesincludingSanskrit.The issuesin parsingandunderstandingSanskritarediscussedin detailby Ramanujam(1999).

10. Conclusions

We presenteda brief summaryof basicbuilding blocksthat comprisea documentanalysissystem.We encouragethereaderto refer to citedpapersfor moredetaileddescriptions.Wehopethat this introductionwill help the readerby providing the backgroundnecessarytounderstandthecontentsof thepapersthatfollow in this specialissue.

References

Arcelli C, Sannitidi BajaG 1985A width-independentfastthinningalgorithm.IEEETrans.PatternAnal.MachineIntell. PAMI-7: 463–474

Page 19: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

Documentimageanalysis:A primer 21

Arcelli C, Sannitidi BajaG 1993Euclideanskeletonvia center-of-maximal-discextraction.ImageVisionComput.11:163–173

Akiyama T, Hagita N 1990 Automatedentry systemfor printed documents.Pattern Recogn. 23:1141–1154

BairdH S1987Theskew angleof printeddocuments.Proceedingsof theConferenceof theSocietyofPhotographicScientistsandEngineersonHybrid ImagingSystems(Springfield,VA: Soc.Photogr.Sci.Eng.)pp14-21

BharatiA, Chaitanya V, Sangal R 1998Computationallinguisticsin India: An overview. TechnicalReport,IndianInstituteof InformationTechnologies,Hyderabad

DengelA, BleisingerR, Hoch R, Fein F, HonesF 1992From paperto office documentstandardrepresentation.IEEEComput.25:63–67

FletcherA, Kasturi R 1988A robust algorithmfor text string separationfrom mixed text /graphicsimages.IEEETrans.PatternAnal.MachineIntell. PAMI-10: 910–918

FreemanH 1974Computerprocessingof line drawing images.Comput.Surv. 6: 57–98FreemanH, Davis L 1977A corner-finding algorithmfor chain-codedcurves.IEEE Trans.Comput.

C-26:297–303FukunagaK, HostetlerL D 1975K-nearest-neighbourBayes-riskestimation.IEEETrans.Inf. Theor.

21:285-293GarrisM D, Dimmick D L 1996Form designfor high accuracy opticalcharacterrecognition.IEEE

Trans.PatternAnal.MachineIntel. PAMI-18: 653–656GREC1995,97,99SelectedpapersfromtheInternationalWorkshopsonGraphicsRecognition1995,

1997,and1999.Lecture Notesin ComputerScienceseries(SpringerVerlag)vols. 1072(1996),1389(1998),1941(2000)

HaralickR M, ShapiroL G 1992Computerandrobotvision(Reading,MA: Addison-Wesley)HaralickR M, Sternberg SR, ZhuangX 1987Imageanalysisusingmathematicalmorphology. IEEE

Trans.PatternAnal.MachineIntell. PAMI-9: 532–550HashizumeA, YehPS,RosenfeldA 1986A methodof detectingtheorientationof alignedcomponents.

PatternRecogn.Lett.4: 125–132HartP E 1968Thecondensednearestneighbourrule. IEEETrans.Inf. Theor. 14:515–516ICDAR 19995thInt.Conf. onDocumentAnalysisandRecognition(LosAlamitos,CA: IEEEComput.

Soc.)IllingworthJ,Kittler J1988A survey of theHoughtransform.Comput.GraphicsImageProcess.44:

87–116KarnikRP1999IdentifyingDevnagari characters.Proc.Int. Conf. onDocumentAnalysisandRecog-

nition (LosAlamitos,CA: IEEEComput.Soc.)pp.669–672JainA K, BhattacharjeeS K 1992Text segmentationusingGaborfilters for automaticdocument

processing.MachineVisionAppl.J. 5: 169–184Lai C P, KasturiR 1991Detectionof dashedlinesin engineeringdrawingsandmaps.Proc.First Int.

Conf. onDocumentAnalysisandRecognition, St.Malo, France,pp.507–515Lam L, LeeS-W, SuenC Y 1992Thinningmethodologies- A comprehensive survey. IEEE Trans.

PatternAnal.MachineIntell. PAMI-14: 869–885LamL, SuenC Y 1995An evaluationof parallelthinningalgorithmsfor characterrecognition.IEEE

Trans.PatternRecogn.MachineIntell. 17:914–919Medioni G, YasumotoY 1987 Cornerdetectionand curve representationusing cubic B-splines.

Comput.Vision,Graphics,ImageProcess.29:267–278Murthy B K, DeshpandeW R 1998Opticalcharacterrecognition(OCR)for Indianlanguages.Proc.

Int. Conf. onComput.Vision,Graphics,Vision, ImageProcess.ICVGIP, New DelhiNartkerT A, RiceSV, KanaiJ1994OCRAccuracy. UNLV’sSecondAnnualTest.TechnicalJournal

INFORM, Universityof Nevada,LasVegasO’GormanL 1988Curvilinearfeaturedetectionfrom curvatureestimation.9th Int. Conferenceon

PatternRecognition, Rome,Italy, pp1116–1119O’GormanL 1990k x k Thinning.Comput.Vision,Graphics,ImageProcess.51:195–215

Page 20: Document image analysis: A primer - CUBS - Home...Document image analysis: A primer 5 Figure 2. A typical sequence of steps for document analysis, along with examples of intermediate

22 Rangachar Kasturi et al

O’GormanL 1992Imageanddocumentprocessingtechniquesfor theright pageselectroniclibrarysystem.Int. Conf. PatternRecognition (ICPR), TheNetherlands,pp260–263

O’GormanL 1993Thedocumentspectrumfor structuralpagelayoutanalysis.IEEE Trans.PatternAnal.MachineIntelli. PAMI-15: 1162–73

O’Gorman L 1994 Binarization and multi-thresholdingof documentimagesusing connectivity.CVGIP:GraphicalModelsImageProcess.56:494–506

O’GormanL, Kasturi R 1997Documentimageanalysis.IEEE ComputerSocietyPressExecutiveBriefingSeries, LosAlamitos,CA

Pavlidis T 1982Algorithmsfor graphicsandimageprocessing(Rockville,MD: Comput.Sci.Press)Pavlidis T,ZhouJ1991Pagesegmentationbywhitestreams.Proc.1stInt.Conf.onDocumentAnalysis

andRecognition (ICDAR), St.Malo, France,pp945–953PostlW 1986Detectionof linearobliquestructuresandskew scanin digitizeddocuments.Proc.8th

Int. Conf. onPatternRecognition (ICPR), Paris,France,pp687–689RamanujanP1999Developmentof ageneral-purposeSanskritparser, M Scthesis,Dept.of Computer

Science& Automation,IndianInstituteof Science,BangaloreRamerU E 1972An iterative procedurefor the polygonalapproximationof planecurvesComput.

GraphicsImageProcess.1: 244–256ReddiSS,RudinSF, KeshavanH R 1984An optimalmultiple thresholdschemefor imagesegmen-

tation.IEEETrans.Syst.ManCybern.SMC-14:661–665Rice S V, Kanai J, Nartker T A 1992A reporton the accuracy of OCR devices.TechnicalReport,

InformationScienceResearchInstituteof Nevada,LasVegasSawaki M, HagitaK 1998Text-line extractionandcharacterrecognitionof documentheadlineswith

graphicaldesignusing complimentarysimilarity measure.IEEE Trans.Pattern Anal. MachineIntell. PAMI-20: 1103–1109

SahooP K, SoltaniS, WongA K C, ChenY C 1988A survey of thresholdingtechniques.Comput.Vision,Graphics,ImageProcess.41:233–260

Sannitidi BajaG1994Well-shaped,stableandreversibleskeletonsfromthe(3,4)-distancetransform.VisualCommun.ImageRepresentation5: 107–115

SerraJ 1982Imageanalysisandmathematicalmorphology (London:AcademicPress)ShihC-C, KasturiR 1988Generationof a line-descriptionfile for graphicsrecognition.Proc.SPIE

Conf. onApplicationsof Artificial Intelligence937:568–575SpitzL 1997Determinationof theScriptandLanguageContentof DocumentImages.IEEE Trans.

PatternAnaly. MachineIntell. PAMI-19: 235–245SrihariS N, GovindarajuV 1989Analysisof textual imagesusingtheHoughTransform.Machine

VisionAppl.2: 141–153Trier O D, TaxtT 1995Evaluationof binarizationmethodsfor documentimagesIEEETrans.Pattern

Anal.MachineIntell. PAMI-17: 312–315TsaiW-H 1985Moment-preservingthresholding:A new approach.Comput.Vision,Grapics,Image

Process.29:377-393WilsonC L, GeistJ,GarrisM D, ChellapaR 1996Design,integration,andevaluationof form-based

handprintandOCR systems.TechnicalReport,NISTIR5932,National Instituteof Standards&Technology, US; downloadfrom http://www.itl.nist.gov/iad/894.03/pubs.html

WongK Y, Casey R G, WahlF M 1982Documentanalysissystem.IBM J. Res.Dev. 6: 647–656Wu W-Y, WangM-J J1993Detectingthedominantpointsby thecurvature-basedpolygonalapprox-

imation.CVGIP:GraphicalModelsImageProcess.55:79–88