XDOC DATA FORMAT

download XDOC DATA FORMAT

of 58

Transcript of XDOC DATA FORMAT

  • 8/7/2019 XDOC DATA FORMAT

    1/58

    Copyright 19952000 ScanSoft, Inc.All Rights Reserved.

    XDOCDATA FORMATTechnical Specification

    Version 4.0May 1999

  • 8/7/2019 XDOC DATA FORMAT

    2/58

    Copyright Copyright 19952000 by Scan Soft, In c. All right s reser ved. No part of th ispublicat ion ma y be tr an smitted, tr an scribed, reproduced, stored in an y retrievalsystem or tra nslated int o an y lan guage or compu ter lan guage in any form or by anymeans, mechanical, electronic, magnetic, optical, chemical, manual, or otherwise,without t he pr ior wr itten consent of ScanSoft, In c., 9 Cent ennial Dr ive, Peabody,Massachuset ts 01960. Printed in th e United Sta tes of America.

    The softwar e described in t his book is furnished un der license an d ma y be used orcopied only in a ccordan ce with th e ter ms of such license.

    Disclaimer ScanSoft, In c. provides this p ublication as is without warr an ty of any kin d, eitherexpress or implied, including but not limited to the implied warr an ties of merchan tability or fitness for a par ticular pur pose. Some sta tes or jurisdictions donot allow disclaimer of express or implied war ra nt ies in certa in tr an sactions;ther efore, this statem ent m ay not apply to you. ScanSoft reser ves the right to reviseth is publicat ion a nd t o make chan ges from time t o time in t he cont ent h ereof withoutobligation of Scan Soft t o notify an y person of such r evision or cha nges.

    Credits TextBridge is a registered trademar k and ScanWorX is a trademar k of ScanS oft, In c..

    Other terms u sed in th is manua l may bethe tr ademarks of other m anu facturers.

    Writers: Daniel S. Conn ellyBeth P addock Rebecca H ar vey

    S c a n S o ft I n c .9 Centenn ial DrivePeabody, Massachuset ts 01960Main: 978-977-2000SDK Techn ical Su pport : 978-977-2174Em ail: api_supp ort@scans oft.com

    XDOC Da ta Fo rm a t : Tech n ica l Sp ec i f i ca t ionPa rt Num ber 00-08137-00May 1999

  • 8/7/2019 XDOC DATA FORMAT

    3/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification ii i

    Table of Contents

    Table of Contents iii

    Preface vAbout This Docume nt ...........................................................................................v

    Documen ta tion Conve nt ions ...............................................................................vi

    Rela ted Pu blicat ions ............................................................................................vi

    Cu st omer Su ppor t ................................................................................................vi

    Overview 1

    XDOC Text Markups 12.1 Modifiers a nd Ope ra nd s ...............................................................................2

    2.2 Tra ns act ions ..................................................................................................3

    2.2.1 Text lin e tra n sa ctions ..........................................................................4

    2.2.2 P age t ra ns act ions .................................................................................62.2.3 Docume nt t ra n sa ctions ........................................................................7

    2.3 Text Tr an sa ction Syn ta x ..............................................................................7

    XDOC File Structure 13.1 Pa ge Seque nce...............................................................................................1

    3.2 Text Da ta .......................................................................................................2

    Page Image Analysis 1

    4.1 Un its of Measu re an d Coordina te Tran sform at ions ...................................14.2 Pa ge Segm en ta tion .......................................................................................3

    4.2.1 Topology ................................................................................................5

    4.2.2 Lexical order .........................................................................................7

    4.2.3 Im age zones ..........................................................................................7

    4.3 Rulin gs........................................................................................................... 7

    4.4 Algebra ic Tr an sforma tion s...........................................................................9

    4.4.1 Convert page coordina tes to ima ge coordina tes .................................9

  • 8/7/2019 XDOC DATA FORMAT

    4/58

    iv XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    4.4.2 Conver t skew informa tion to degr ees ...............................................11

    4.4.3 Convert original image coordin at es to deskew ed coordin at es.........11

    4.5 Fon ts ............................................................................................................13

    Transformations 1A.1 Text Line s......................................................................................................1

    A.2 Zones ..............................................................................................................3

    A.3 Ru lings...........................................................................................................4

    Sample XDOC Text 1B.1 Sa mp le Pa ge Ima ge ......................................................................................1

    B.2 Sa mp le XDOC Text .......................................................................................2

    Character Set 1

    Index 1

  • 8/7/2019 XDOC DATA FORMAT

    5/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification v

    Preface

    This Technical S pecification describes the ScanSoft Inc. XDOC outputforma t. XDOC provides richn ess of informa tion th at can be converted toan other output format . It capt ur es the str uctur e of docum ents as pr ocessedby ScanS ofts docum ent recognition pr oducts, including S c a n S o ft S D K forWindows platforms.

    XDOC dat a format is one of several t ext outpu t format options available fromth e ScanSoft SDK, and other Scan soft pr oducts. You can write a pplicat ions t ointerpr et XDOC Text for use with a word processor, desktop pu blishingsystem, or an y other a pplicat ions r equiring input from pa per or image files.

    About This Document

    Th e T echn ical S pecification describes the str uctur e and synta x of the XDOCforma t. The informa tion is aimed at systems developers a nd a pplicationsprogramm ers who are familiar with character format s and mark ups, andimage file formats, a s well as th eir par ticular platform or OS environment.The specificat ion is organ ized as follows:

    Chapt er 1, Overview, describes th e hierar chy and relationsh ips of th eXDOC data forma t.

    Chapt er 2, XDOC Text Marku ps, describes the stru cture and synta x of XDOC text markups.

    Chapt er 3, XDOC File Stru cture, provides informa tion about thestru cture of and page sequen ce in a n XDOC Text file.

    Chapt er 4, Pa ge Image Analysis, discusses how ScanSofts documentrecognition software initially perceives page image layout, divides it intomean ingful ar eas, an d identifies font s.

    Appendix A, Tran sformations, furth er describes the tran sformat ionsdescribed in Chapt er 4.

    Appendix B, Sample XDOC Text, consists of a page image and itscorr espondin g XDOC Text file.

  • 8/7/2019 XDOC DATA FORMAT

    6/58

    vi XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    Documentation Conventions

    Throughout this T echn ical Sp ecification, you refers to th e a pplicationprogramm er or system developer, an d the user refers to an application enduser.

    As described in Ta ble P1, the T echn ical S pecification uses certa in graphicalelements an d formatt ing to emph asize informa tion and den ote meaning inthe text.

    Ta ble P 1. Documenta tion Conventions

    Co n ve n t ion De scr ip t ion

    bo ld Int roduces a n ew term, or the first u se of an import an tterm in a chapt er; highlight s fun ction nam e.

    italic Denotes titles of manuals or books. Also used to denotegeneric representa tions of ent ries in examples.

    monospace Denotes code examples a nd file n ames. (quotes) Denotes t i t les of chapters and sections in this man ual.

    Introduces t i p s th at p rovide useful inform ation about aprocedura l step or system function.

    Note Introduces important information about the currentsubject.

    Related Publications

    Consult th e Overview for the ScanSoft SDK for detailed information aboutwriting docum ent recognition applicat ions. Refer t o the appr opriate u ser sguide and inst allation guide for any other ScanSoft product tha t pr oducesXDOC forma t

    Customer Support

    If you should experience problems working with XDOC that you cannotresolve, contact Technical Support via electronic mail at

    [email protected]

    If you prefer to use th e telephone, you can call Customer Support a t

    9789772174

    Be ready to provide:

    Your software registrat ion number.

    A detailed description of th e steps tha t led to th e difficulty.

  • 8/7/2019 XDOC DATA FORMAT

    7/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification 11

    1Overview

    XDOC captu res th e content and stru cture of compound documentsconta ining both text a nd gra phicsas r ecognized and output by Scan Soft In c.document recognition software. In addition to recognized text, XDOCprovides detailed information about the following:

    page layout (X,Y coordinates of text and graph ics on the page)

    chara cter forma tt ing (bold, italic, subscript, superscript, un derline)

    fon t me t rics

    g raph ic images

    horizonta l and vertical rulings

    document structuring conventions

    lexical classes

    chara cter and word confidence

    word bounding boxes

    XDOC provides a flexible gramma r t ha t your applications can easilyinterpr et an d convert t o outp ut format s for use with a word pr ocessor ordesktop publishing system.

    ScanSofts document recognition software stores XDOC data formatinformation about each document in a set of related files. A single XDOC

    Te x t file contains informat ion a bout the text an d references any files tha tdescribe the graphics. XDOC Text is a ScanSoft-defined document storagelanguage.

    In m ost cases, each gra phic element in a documen t is a sepa ra te file writtenin an image file format , such as T I F F .

    Figure 11 illustr ates t he hiera rchy of th e XDOC data forma t.

    XDOC Text filerefers to separategraphics files

    Graphics files:TIFF, RES, PCX,FAX, or othergraphics format

    Figu re 11 . XDOC data hierarchy

  • 8/7/2019 XDOC DATA FORMAT

    8/58

    12 XDOC Data Form at: Technical S pecification Ver. 3.0 (Rev. 79), 05/ 06/ 97

    You can th en u se th ese XDOC Text and graph ics files as input for your a ppli-cations (an d selected ScanS oft pr oducts), to inter pret an d convert to a formatuseful to an end-user.

    You can a lso read a gr aph ics file with gen eric, off-the-shelf softwar e.

    The form at, st ru cture, an d layout of XDOC Text are discussed in more deta ilin the remainder of this specification.

    Chapt er 2, XDOC Text Marku ps, details th e stru cture an d synta x of thetext ma rku ps tha t describe th e physical an d logical attr ibutes of text in anXDOC Text file.

    Chap ter 3, XDOC File Structur e, describes the st ru ctu re an d page sequenceof an XDOC Text file. This informa tion can h elp you d etermine the best wayto read a nd int erpret an XDOC Text file with your applications.

    Chapter 4, Page Image Analysis, explains how the document recognitionsoftware initially perceives the layout of a page image, divides it intomean ingful ar eas of graph ics or text, ha ndles ru lings, an d identifies font s.

  • 8/7/2019 XDOC DATA FORMAT

    9/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification 21

    2XDOC Text Markups

    A document in XDOC Text consists of text with int ermixed mar kup s th atdescribe physical a nd logical a ttr ibutes of the text. Th is chapter describes th estru cture an d synta x of th e mar kups in a n XDOC Text file.

    Figure 21 illustr ates a sa mple documen t with a variety of typefaces andpar agra ph in dents; Figure 22 shows the corresponding XDOC Text filewhose contents describe the sa mple docum ent.

    This is the first program that you willwrite when you study the C programminglanguage .

    Section 1.1 Getting Started. . . . . . . . . . . . . . . page 7

    HELLO, WORLD

    Figu re 21 . Original Pa ge

    [a;"XDOC.12.0";E;"FWX12.0"][d;"hellowconf.xdc"][p;1;P;0;S;0;-909;400;400;0;0;2142;2794;0;0;1][t;1;1;227;386;A;"";"";"";0;0;2140;2793;0;0,0,0,0][f;0;"";R;s;2540;F;0;0;0;10;100][f;2;"C";R;q;2201;F;22;21;16;9;100][f;3;"T";R;q;1693;V;25;25;17;10;100][f;4;"C";B;s;3471;F;37;37;25;15;100]

    [O;1252;1][s;1;569;323;1;264;c;4;9][w;835][e;1][c;4]HELLO,[h;1066;19;2][w;904]WORLD[y;1522;253;264;3;H][s;1;569;129;3;439;p;2;5][w;541][c;2]This[h;777;29;4][w;623]is[h;842;28;5][w;581]the[h;928;28;6][w;477]first[h;1055;25;7][w;669]program[h;1228;25;8][w;451]that[h;1331;25;9][w;811]you[h;1417;23;10][w;629]Will[y;1522;0;439;1

    ;S][s;1;569;128;11;482;p;2;5][w;601]write[h;800;24;12][w;789]when[h;908;23;13][w;861]you[h;992;26][w;623]study[h;1121;25;14][w;824]the[h;1204;25;15][w;492]C[h;1246;25;16][w;733]programming[y;1522;19;482;1;S][s;1;569;130;;17;523;p;2;5][w;526]language[y;1522;658;523;2;H][s;1;569;0;20;18;608;t;3;0;1][w;532][c;3]Section[h;673;15;21;19][w;632]1.1[h;723;14;22][w;601]Getting[h;841;12;23][w;632]Started[l;".";950;266;24;15;1][w;523]page[h;1280;12;24][w;794]7[y;1522;215;608;0;H][g;1666;0;0;2142;2794;0]

    Figu re 22 . XDOC Text file

  • 8/7/2019 XDOC DATA FORMAT

    10/58

    22 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    2.1 Modifiers and Operands

    With in XDOC, marku ps ar e called m o d i f i e r s and t he values assigned toth eir att ributes, if an y, are called o p e r a n d s . Operands appear in a l istform at. An operan ds position in th e list, ra ther tha n a n insert ed tag,indicates its particular meaning or attribute.

    There a re th ree basic types of operan ds: cha racters, int egers, and str ings. Ac h a r a c t e r o p er a n d is a single alphabetic chara cter. Cha racter operan dsenu mera te a limited num ber of discrete values for th e operan d.

    N u m e r i c op e r a n d s consist of a sequ ence of up to 10 digits pr eceded by anoptiona l minus sign. They encode quan titat ive data . A nu meric operan dcannot include a decimal point, and t hu s only int eger values are repr esented.

    A s t r in g o p e r a n d is a sequence of printing cha racters, wher e the sequen cecan be quite lar ge (up t o 256 chara cters), and can include all printingchara cters. String operands a re enclosed in double quote cha racters.

    Note If a strin g operan d includes t he double quote cha racter, t wo double

    quote cha racters a re used together. In a ll other respects, stringopera nds a re literal. They have no embedded mar kups or modifiers.

    All XDOC Text m odifiers begin with th e modifier sta rt escape cha ra cter:

    [

    To avoid confusion, the body of the text cannot include the modifier startescape char acter literally. Where [ should appea r in the body of the text,XDOC replaces it with the doubled modifier star t escape chara cter:

    [[

    The m odifier code, which is a single alph abet ic char acter , followsimmediately after th e modifier sta rt escape char acter. Lowercase anduppercase modifier codes are available. If the modifier does not have anyoperands then it is followed directly by th e text th at it modifies oth erwise it isfollowed by th e operands. All opera nds a re pr eceded by an operand separa torcharacter:

    ;

    At the en d of the operand list (but only if ther e is such a list) comes t hemodifier end escape chara cter:

    ]

  • 8/7/2019 XDOC DATA FORMAT

    11/58

    Ver. 4.0 (R ev. 80), 05/ 15/ 99 X DOC Text M ark ups 23

    [s;1;569;323;264;c;4;9]Start escapecharacterModifiercode

    Operands

    Operandseparators

    End escapecharacter

    Figu re 23 . Sample XDOC Text ma rku p

    Syntactically, XDOC Text can include any text and any modifier. However,once a modifier is begun , it mu st be completed before a new m odifier canbegin.

    2.2 Transactions

    A t r a n s a c t i o n is a series of marku ps th at, t aken togeth er, describe a block of text as a unit . There are t hree fundamenta l tran sactions in XDOC Text:

    the l ine of text

    t he pa ge

    t h e d ocu m en t

    When a n ew tra nsa ction begins in XDOC Text, the previous tr ansa ction of th e same t ype termina tes. For example, text continues t o flow into thecurren t text line unt il a new text line begins. No fur th er text is added to theprevious line.

    Similar ly, text flows into th e curren t page, line-by-line, u nt il a n ew pagebegins. At tha t point, no new text can be added to th e previous pa ge.

    Fina lly, text will flow into th e curren t document, pa ge-by-page un til a n ewdocum ent begins. Then, no furt her t ext can be added to the pr eviousdocument.

    The star t of the next tr ansaction t erminates t he previous t ransa ction. Newdat a can be added to the curr ent line, page or documen t un til a new line,page or documen t ha s begun .

    Two modifiers, one a t th e end of a text line an d one at th e end of a page,provide sum ma ry informa tion a bout the previous line or page. Theseinforma tional modifiers do not force the cur rent tra nsa ction to end.

    There is also an optional modifier th at m ar ks th e end of the curr entdocum ent for your convenience. This modifier does not pr event you fromadding more informa tion to the curr ent docum ent.

  • 8/7/2019 XDOC DATA FORMAT

    12/58

    24 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    As shown in Figure 24, the t hr ee tran saction types are u sua lly nested soth at line tr an sactions occur with in an ongoing page tran saction, an d pagesoccur within an ongoing document tra nsaction.

    [a;"XDOC.12.0";E;"FWX12.0s"[d;"hellowconf.xdc"]

    [p;1;P;83;S;0;1666;0;0;2142;2794

    [t;1;1;227;386;A;"";"";"";0;0;2140;2793;1[f;0;"";R;s;2540;F;0;0;0;10;100[f;2;"C";R;q;2201;F;22;21;16;9;100[f;3;"T";R;q;1693;V;25;25;17;10;100[f;4;"C";B;s;3471;F;37;37;25;15;100

    [s;1;569;323;264;c;4;9][w;835[k;T;1;0;0;568;1524][e;1][c;4]HELLO[h;1066;19][w;904]WORLD[y;1522;253;264;3;H[s;1;569;129;439;p;2;5][w;541][c;2]This[h;777;2[w;623]is[h;842;28][w;581]the[h;928;28][w;477]fir[h;1055;25][w;669]program[h;1228;25][w;451]that[h;1331;2[w;811]you[h;1417;23][w;629]Will[y;1522;0;439;1;

    [s;1;569;128;482;p;2;5][w;601]write[h;800;24][w;789]wh[h;908;23][w;861]you[h;992;26][w;623]study[h;1121;2[w;824]the[h;1204;25][w;492]C[h;1246;25][w;733]programmi[y;1522;19;482;1;S][s;1;569;130;523;p;2;5][w;526]language[y;1522;658;523;2;[s;1;569;0;608;t;3;0;1][w;532][c;3]Section[h;673;1[w;632]1.1[h;723;14][w;601]Getting[h;841;12][w;632]Start[l;".";950;266;15;1][w;523]page[h;1280;12][w;794[y;1522;215;608;0;H]

    Document

    Text lines

    Page

    Page

    [g;1666;0;0;2142;2794]

    Figu re 24 . Nested XDOC tr an sactions

    Refer to Section 2.3, Text Tr an saction Syn ta x, for a n a lpha betical listing of each modifier an d its a ssociated operan ds.

    2.2.1 Text line transactions

    All text is organized into lines. Each line ha s a single established Ycoordinate or baseline value. Words in the line of text a re sepa rat ed bywhitespace. Each whitespace has a beginn ing X coordinate a nd a length.

    There ar e two types of whitespace: non-printing (horizontal spa ce) andprint ing (leader). Each line has a left ma rgin an d a right m ar gin, which a re,in effect, whitespaces at the beginn ing and end of the line between th e linean d the m argins of th e galley in which tha t line sits.

    Modifiers for text line transactions fall into three categories: essentialmodifiers, m ode shift modifiers, a nd lexical m odifiers.

  • 8/7/2019 XDOC DATA FORMAT

    13/58

    Ver. 4.0 (R ev. 80), 05/ 15/ 99 X DOC Text M ark ups 25

    2.2.1.1 Essential modifiers

    The essential m odifiers for a text line tr an saction are:

    Mod ifie r E xa m p le

    st ar t-of-t ext lin e [s;1;569;128;5;482;p;2]

    t ext lin e in for ma t ion [y;1522;0;439;1;S ]

    horizonta l space (nonpr intingwhitespace)

    [h;842;28;3]

    leader characters (printingwhitespace)

    [l;".";950;266;2;15;1]page[h;1280;12;3]7

    2.2.1.2 Mode shift modifiers

    XDOC may a pply a n umber of modifiers indicat ing a m ode shift t o the textwithin a text line. These mode shifts are:

    Mod ifie r E xa m p le

    change fon t [c;1]

    change r egion [e;1]

    change subscr ipt [B1[B

    change super scr ipt [S1[S

    change under lin e [Umu st[U

    d r op pe d ca p it a l le tt er [u ;76 3;1 59 ;2 34 ;8 69 ;1 ;3 ]

    2.2.1.3 Lexical modifiers

    Lexical modifiers subst itute for n ormal text or indicate the quality of norma ltext. They are:

    Mod ifie r E xa m p le

    opt iona l h yphen [H

    quest ion able cha racter [Ql

    un recogn ized cha racter [E

    cha racter confidence [q;581]

    word confidence [w;904]

    lexical class (built- in and user) [v;14]

    w or d bou n din g box [b;892;229;10 66;274;250 ;0]

  • 8/7/2019 XDOC DATA FORMAT

    14/58

    26 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    2.2.2 Page transactions

    There a re t hr ee types of page tr an saction modifiers: essential m odifiers,layout m odifiers, an d font modifier.

    2.2.2.1 Essential modifiers

    A page tran saction a lways begins with a st art -of-page modifier an d ends witha page information modifier. These are the following:

    Mod ifie r E x a m p le

    s t a r t -of-page [p ;1 ;P ;0 ;S ;0 ;909 ;400 ;400 ;0 ;0 ;2142;2794;0 ;0 ]

    p a ge in for m a tion [g;1 66 6;0 ;0 ;2 14 2;2 794 ;0 ]

    2.2.2.2 Layout modifiers

    Three modifiers provide the page layout. These are the following:

    Mod ifie r E x a m p le

    text zone descriptor [t;1;1;227;386;A;"";"";"";0;0;2140;2793;0;1]

    image zone descriptor [x;2;11;183;1483;1916;954;"H1.tif";183;1483;1916;954]

    ru ling descr ip tor [r ;805 ;2584;H;1610;s ;3 ;0 ;0 ;R ;0 ]

    n ew t a ble [j;3;251;373;2024;2268;5;5;0;2 61;399;39 9;681;681;995;995;1489;1489;2011]

    s ta r t t a ble colu m n [n ;3;2;1;1;0;0;2;0]

    en d cell t able [A

    hard column break [X

    s ect ion br ea k [k ;T;2 ;-2 ;1 ;1 ;1 ;1;3 01 ;9 34 ;9 82 ;1 61 0]

    2.2.2.3 Font modifier

    XDOC Text specifies th e font inform at ion modifier for each font page-by-pageeven th ough font informa tion is pr operly par t of th e docum ent tra nsa ction.This mean s th at XDOC Text may describe the sa me font inconsistent ly frompage to page.

    To resolve this problem, you m ay need t o search for a nd u se th e lastdescription of a font in a document, proceeding in workflow order. In general,you can use t he font description on a page when processing tha t pa ge. Anyfont appear ing on a page ha s a corr esponding description on th at page.

    There is one font information modifier.

    Mod ifie r E x a m p le

    fon t iden t ifie r [f;3 ;"T";R ;q ;1693;V;25;25;17;10;100]

  • 8/7/2019 XDOC DATA FORMAT

    15/58

    Ver. 4.0 (R ev. 80), 05/ 15/ 99 X DOC Text M ark ups 27

    2.2.3 Document transactions

    There ar e three modifiers appropriate to the docum ent tr an saction. Theseare :

    Mod ifie r E xa m p le

    st ar t of docu men t [a ;"XDOC.12.0";E ;"F WX12.5"]

    end of documen t [Zdocument name [d;"hellowconf.xdc"]

    2.3 Text Transaction Syntax

    Table 21 lists t he XDOC Text m odifiers a lpha betically by modifier code. Ineach des cription, th e modifier code is a liter al code value. All operan ds,however, ar e just var iable names for values, whose first letter indicat es thetype of th e operan d: c (cha ra cter), n (num eric) or s (string).

    The second part of the operan d nam e is a num ber indicat ing the ordinalposition of the opera nd. F or example, n3 is th e na me of the th ird operan d;its value is a series of chara cters represent ing an int eger.

    The XDOC Define is th e modifier na me a s defined in th e kdoctext.hinclude file. kdoctext.h also defines th e acceptable values for each operan d.

    Ta ble 21. XDOC Text Mar kups

    * indicat es not found in XDOC_LITE.

    Cod e Me a n in g Op e r a n d s XDOC De fin e

    a Sta r t of document

    s1: version identifieralways"XDOC12.0"

    c2: XDOC flavor enhanced,

    lite or pluss3: OCR version - version of OCR engine

    KDC_STDOC

    A End cell table None KDC_NDTABLE

    b Wor d bou n din gbox

    n1:n2:n3:n4: left, top, right, bottom

    boundin g box of the n extword

    n 5: b a sel in e

    n6: leader dot ; t rue or false

    KDC_WBOX

    B S h ift t o/fr omsubscript

    None KDC_SUB

    c Ch an ge fon t n1: identifier of the fontchanged to

    KDC_CHGFONT

    d Docu men tname

    s1: in terna l document name,defined by u ser

    KDC_DOCNAME

  • 8/7/2019 XDOC DATA FORMAT

    16/58

    28 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    Ta ble 21. XDOC Text Mar kups

    Cod e Me a n in g Op e r a n d s XDOC De fin e

    E U n r ecogn ize dcharacter

    None KDC_UNREC

    e Change region n1 : iden t ifi er of the regionbeing chan ged to

    KDC_REGION

    f Fontinformation

    n1: font ident ifiers 2: fon t n a mec3: face s tylec4 : se r if s tylen5: average character widthc6: code for fixed or variable

    char acter widthn7 : average he igh t of

    uppercase characters, 0 if unknown

    n8: average he igh t of lowercase cha ra cters withdescender, 0 if unkn own

    n9: average he igh t of lowercase cha ra cters withno descender a nd noascender, 0 if unkn own

    n10: typographers point sizen11: font width, e.g.,

    100%=normal,80%=compressed,120%=expanded

    KDC_FONTINFO

    g Pageinformation

    n1: cosecant of angular tilt of page on image

    n2,n3: X,Y coordinates of top left

    corner of source imagen4,n5: X,Y coordinates of bottom

    right corn er of sourceimage,

    c6: required or opt page break

    KDC_NDPAGE

    H Opt iona l (soft )hyphen

    None KDC_OHP HEN

    h N on pr in tin gwhitespace

    n1: X coordinate of left edge of whitespace

    n2: distance from left edge of whitespa ce to next word

    n3: unique id of next word*n4: space char acter count for

    ta bulat ion from previousword (output only if moretha n one space)

    *n5: tab advance char actercount for ta bulation fromprevious word (out put if line is part of a t able)

    KDC_HSPACE

  • 8/7/2019 XDOC DATA FORMAT

    17/58

    Ver. 4.0 (R ev. 80), 05/ 15/ 99 X DOC Text M ark ups 29

    Ta ble 21. XDOC Text Markups (cont.)

    Cod e Me a n in g Op e r a n d s XDOC De fin e

    j N ew t able n 1: u niqu e id of cell t ablen2:n3:n4:n5: left, top, right, bottom

    coordinates of tablen6: number of columns in tablen7: number of rows in tablen8: position of table on page

    (cur ren tly always left)n9-x: pairs of left, right

    coordinates for eachcolumn

    (x+1)-y: prs of top,bot coordina tesfor each row.

    KDC_STABLE

    J End Drop Cap None KDC_NDROP

    *k Sect ion change c1 : type (column, header,footer, caption, timest am p)

    n2: number of columnsn3: If type is header or footer

    then t his operand is theposition (left, right,center). If type is captionth en value is pictu re id.

    n4: I f type is column then thisoperand indicates if ther ear e vertical lines betweenthe column s. If type iscaption then value iscaption expands up ordown

    n5: number of vert ical half lines to output in t hedefault font if not captionor inset

    n6: use balanced column s forword pr ocessors tha t canha ndle balan ced cols andha rd col breaks (such asMS Word)

    n7: use balanced cols andha rd col breaks for wordprocessors th at do notproperly handle th is (suchas WordPer fect)

    n8-x: pairs of left, rightcoordinates for eachcolumn in t he section

    KDC_SECTION

  • 8/7/2019 XDOC DATA FORMAT

    18/58

    210 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    Ta ble 21. XDOC Text Markups (cont.)

    Cod e Me a n in g Op e r a n d s XDOC De fin e

    l P r in t ingwhitespace(leadercharacters)

    s1: r epea ted characte r in theleader

    n2: X coordinate of left edge of whitespace

    n3: distance from left edge of whitespa ce to next word

    n4: unique id of next word*n5: space char acter count for

    ta bulat ion from previousword (output only if moretha n one space)

    *n6: tab advance char actercount for ta bulation fromprevious word (out put if line is part of a t able)

    KDC_LEADER

    M St ar t/st opheadline

    none KDC_HEADLINE

    n S ta r t t ab le ce ll n1 : un ique id of t ab le tha t ce llbelongs t o

    n2 : cur ren t column numbern3: number of columns the cell

    spansn4: number of rows the cell

    spansn5: does this cell exist or is it a

    cont inu at ion of a cell aboveor to th e left

    n6: cell horizontal alignment(always; left)

    n7: number of decimal placesfor decimal alignment

    (always 2, not yetsupported)

    n8: cell vertical alignment(always top)

    KDC_SCOL

    O la ngu age n 1: MS Win dows code pa ge

    n2: language (see langids.h)

    KDC_LANGUAGE

    o Sta r t newta ble row

    n1: row height (always 0)n2-x: border codes for top, left,

    bottom r ight side of eachcell in th e row. A bordercode of 0 indicat es th at itis a "fake" border between

    two su b cells (joined orspan ned cells).

    KDC_SROW

  • 8/7/2019 XDOC DATA FORMAT

    19/58

    Ver. 4.0 (R ev. 80), 05/ 15/ 99 X DOC Text M ark ups 211

    Ta ble 21. XDOC Text Markups (cont.)

    Cod e Me a n in g Op e r a n d s XDOC De fin e

    p S ta r t -of-page n1 : logica l page numberc2: code for page orientat ion,

    portr ait or landscapen3: recomp on or off?c4: recognit ion moden 5: i ma ge s ke wn6: page untiltcosecant of

    angular correction alreadyapplied

    n7 : x resolu tionn8: y res olutionn9:n10: X,Y coordinates of original

    top left corn er of pagen11: page width, 0 if unknownn12: page height, 0 if unkn own

    n13: user set or okd zones(M_USER_SPECIFIED_R

    EGIONS) was setn14: user set or okd zone order

    (M_USER_SPECIFIED_ORDER) was set

    n15: word box units

    Q Qu es tion a blecharacter

    None KDC_QABLE

    q Characterconfidence

    n1: character confidence(0 - 999)

    KDC_CCONF

    R Re ve r se vid eo? n 1: s ta r t or s top KDC_REVERSE

    r Rulingdescriptor n1:n2: X,Y coordinates of themid-point of ruling

    c3: code for or ientat ionn4: tota l l ength of ru l ingc5: style: single, double, triplen6: thickness, 0 i f unknownn7: interval , 0 i f unknown

    n8: id

    n9: type (recognit ion or IP)

    n10: cell table rul ing?

    KDC_RULE

  • 8/7/2019 XDOC DATA FORMAT

    20/58

    212 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    Ta ble 21. XDOC Text Markups (cont.)

    Cod e Me a n in g Op e r a n d s XDOC De fin e

    S S h ift t o/fr omsuperscript

    None KDC_SUP ER

    s S t ar t -of-t extline

    n 1: zon e n um be rn2: X coordinate of the

    leftmost edge of the t exton this line

    n3 : d is tance from n2 to thebeginning of the text onthis line

    n4: unique id of next wordn5: Y coordinate of baseline*c6: the current s tyle tag:

    para graph, table or centerline

    *n7: identifier of primary font*n8: space char acter count for

    indent off the left mar gin(output if style is table, or

    if not 0)*n9: tab advance character

    coun t for ta bulation off th e left ma rgin (out put if style is table)

    KDC_STLINE

    t Text zonedescriptor

    n1 : zone iden t ifi ern 2: ou t pu t or de rn3: Y coordinate of the top of

    zonen4: he igh t of the zonec5 : p la in t ext or t ab le?s6: pr efixs7: su ffix

    s8: zone name, as defined bythe user

    n9: top coordinate of theoriginal r egion fra me

    n10: left coordinate of theoriginal r egion fra me

    n11; right coordinate of theoriginal r egion fra me

    n12: bottom coordinate of theoriginal r egion fra me

    n13: top border visible?n14: left border visible?n15: bottom border visible?n16 top border vis ible?

    n17: inverse video?

    KDC_TZONE

  • 8/7/2019 XDOC DATA FORMAT

    21/58

    Ver. 4.0 (R ev. 80), 05/ 15/ 99 X DOC Text M ark ups 213

    Ta ble 21. XDOC Text Markups (cont.)

    Cod e Me a n in g Op e r a n d s XDOC De fin e

    U S h ift t o/fr omunderline

    None KDC_UNDERLINE

    u Dr oppedcapital letter(s)

    n1:n2:

    n3:n4: top, left , r ight , bot tom.

    coordinat es of the framen5: part ia l word or complete

    wordn6: number of l ines drop cap

    ha s to th e right of it (forfuture use)

    KDC_DROP

    v Web link n1-x web link KDC_LINK

    w Wordconfidence

    n1: word confidence (0-999) KDC_WCONF

    *X Colu m n b re ak *n1: har d or soft column break? KDC_COL_BK

    x Im age zonedescriptor

    n1 : zone iden t ifi er

    n 2: zon e or d ern3: code of the graphics com-

    pression scheme u sed forthe data .

    n4: Left coordinate of zonen5: Top coordinate of zonen6: Right coordinate of zonen7: Bottom coordinate of zones8: zone name, as defined by

    the user

    *n9: Left expanded frame

    *n10: top expanded frame

    *n11: right expanded frame

    *n12: bottom expanded frame

    KDC_PZONE

    Y Ch ar act er box n1: left

    n2: top

    n 3: r igh t

    n 4: bot tom

    KDC_CBOX

    y Text lineinformation

    n1: X coordinate of the r ightedge of the t ext zone a tthis line

    n2: dis tance from end of thetext to th e right edge of zone on th is line

    n3: Y coordinate of baseline*n4: number of l ine advances

    in curren t font t o nextline

    *c5: hard or soft l ine endinghere

    KDC_NDLINE

  • 8/7/2019 XDOC DATA FORMAT

    22/58

    214 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    Z End of document

    None KDC_NDDOC

  • 8/7/2019 XDOC DATA FORMAT

    23/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification 31

    3XDOC File Structure

    This chapt er describes the st ru cture of an XDOC Text file. This informa tioncan help you determ ine how your applications should input XDOC Text dataand interpret i t .

    Ea ch XDOC Text file represent s a sepa ra te documen t a nd, in most cases,serves as th e sta rt ing point for r eferencing docum ent images. An XDOC Textfile does not cross-reference an y other XDOC Text files.

    Alth ough fur ther processing can combine documen ts or split them apa rt , theend r esult is a lways one or more XDOC Text files, each with its associatedgraph ics files, an d each sta nding independen t of the other s.

    3.1 Page Sequence

    An XDOC Text file arr an ges the text of a docum ent as a sequence of pages.However, th e sequen ce of pages in th e physical XDOC Text file is notnecessarily th e intended presen tat ion sequen ce. Each page in the docum entha s a logical sequence number t hat specifies the intended pr esenta tion order,but the pages in the physical XDOC Text file occur in working order.

    By concatenating individual XDOC files together, you can use a single XDOCText file to represen t m ultiple documents, su ch as t he chapt ers of a book. Thefirst page of each document would divide one chapter from another. If you doconcatenate documents, th e page num bering system mu st span the ent ireXDOC Text file.

    The XDOC reader in your a pplicat ion m ight first collate t he pa ges intoascending page num ber sequence, with a ny duplicates together . Whenreading a n XDOC Text file with m ultiple documen ts, a pa ge tha t begins adocum ent sh ould always be the first pa ge of that documen t r egardless of th ecollated order . All oth er pa ges in th e XDOC Text file can m igrat e from onedocum ent to an other for page collat ion pur poses.

    Ea ch document h as a n inter na l document n ame. When an XDOC Text filecont ains only one docum ent, as is usua lly the case, the inter na l documentna me is also the n am e of th e entire docum ent.

    The XDOC Text files should be stored within a Text InformationMana gement System, wher e document n am ing issues can be resolved.

  • 8/7/2019 XDOC DATA FORMAT

    24/58

    32 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    Figure 31 shows four documents in an XDOC Text file. The first page of thedocumen t, Cha pter W, is pa ge 1. The first pa ge of Chapter X is page 20. Thefirst page of cha pter Y is 30. The first page of Chapt er Z is page 55.

    576 20 .............. 30...............5 7432 1

    Chapter W Chapter X Chapter Y Chapter Z

    v

    Page

    Documentboundaries

    6058..7 5 12 61 1 55. .

    v v v

    Figu re 31 . XDOC text file: macro structure

    The pages of Chap ter W ar e n o t in sequence. Pages 1, 5, and 7 a re dup licatedan d th e duplicates a re out of sequence.

    ScanSofts documen t recognition softwar e r epeatedly at tempt s t orecognize the data on each page unt il it can m ake a positive cha ra cteridentification. This mean s th at the information in th e last version of an yduplicated pa ge is always th e most accur at e.

    3.2 Text Data XDOC Text is cha racter da ta, wher e each char acter is an eight-bit byte. Thecha ra cter set is a m odified version of th e Intern ational Sta nda rdsOrgan ization ( IS O ) 8859/1 character set, (see Appendix C). Extensions to theISO set provide special char acters needed for high-end publishingapplications.

    XDOC Text includes many newline char acters th at m ake it easier to read thechara cter dat a with na tive-platform text display softwar e for U NIX, DOS, orMacintosh. (Conventional conversions are available for text display onplatforms oth er th an th e platform where t he XDOC Text file was created.)

    These newline cha ra cters a re not par t of the XDOC dat a itself. Instead,explicit XDOC Text ma rku ps indicate th e organization of th e source text intolines of text. XDOC Text includes n o other nonprint ing chara cters.

    XDOC text uses explicit whitespace mar kups, ra th er th an Space, Horizonta lTab an d Vertical Tab cha racters t o represent h orizonta l and vertical textspa cing. Refer to Cha pter 2, XDOC Text Ma rk ups , for m ore deta iledinformation.

    Store and ret rieve applicat ions ma y treat XDOC Text as a pur e byte streamwith no intern al stru cture. Since th e bytes are not limited to ASCII cha ractercodes, it may be n ecessary t o treat th ese files as bina ry.

  • 8/7/2019 XDOC DATA FORMAT

    25/58

    Ver. 4.0 (R ev. 80), 05/ 15/ 99 X DOC Text Files 33

    Applications th at interpr et XDOC Text mu st incorporat e an XDOC Textpar ser. You can create a text pa rser either by direct coding or by specifying agrammar , which is th en converted to a program by a stan dard pa rsergenerator, such as lex on UNIX. Becau se XDOC Text is a regular langua ge,i t may be parsed with a Finite State Automaton.

  • 8/7/2019 XDOC DATA FORMAT

    26/58

  • 8/7/2019 XDOC DATA FORMAT

    27/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification 41

    4Page Image Analysis

    XDOC Text files include detailed information about the original page image.This chapt er explains h ow the documen t r ecognition softwar e initiallyperceives the pa ge layout , segments it int o areas of gra phics or t ext, han dlesru lings, an d ident ifies fonts.

    4.1 Units of Measure and Coordinate Transformations

    XDOC Text data contains X,Y coordinates for text and, if known, for the pageframe th at contains t he text. The r ecognition software positions the pa ge and

    th e text on an image with a r ecta ngular coordinate system wh ose origincoincides with t he u pper left h an d corn er of the image an d in wh ich Xincreases to the right an d Y increases downwar d.

    The recognition software may initially perceive a page as offse t and its texta s s k e w e d within t he coordina te system of th e image. A page th at is offset isrem oved from th e origins (0,0) of th e X an d Y axes of the coordina te fra me.Skewed text appears t o be rotat ed. It runs n either par allel nor perpendicularto the X and Y axes of the coordinate fram e.

    Figure 41 illustr ates a page with exaggerated sk ew and offset.

    (0,0)

    X-offset of nearPage corner

    Y-offset ofnear page corner

    Tilt

    w1

    Figu re 41 . Offset a nd sk ewThe length s of th e hea vy arr ows repr esent t he X offset a nd Y offset of thenear pa ge corner, an d the t i l t . These values ar e recorded in XDOC Text aspar t of the pa ge description.

  • 8/7/2019 XDOC DATA FORMAT

    28/58

    42 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    Tilt is th e cotan gent of th e angle, which r epresent s th e angu lar skew. Tilt isth e nu mber of units of horizonta l movement needed along a line of text toencoun ter one un it of vertical displacement.

    XDOC Text never includes a tilt value of less tha n four . A tilt value t ha tsmall would correspond t o text whose angu lar s kew is un acceptably large for

    th e softwar e to perform accur ate recognition.

    A tilt of 4000 corresp onds to a n egligible angula r sk ew for pa ges of rea sonablelength a nd width . Consequen tly, very large tilts (very small angular skews)ar e represen ted by a fixed tilt value t ha t is appr oximately 4000.

    Since the actua l text skew in acceptable documen t ima ges is relatively small,th e recognition softwar e can elimina te it by approximate mea ns with outintr oducing noticeable err ors. Once the recognition software elimina tes skew,th e text lines are h orizontal an d th e left ma rgin is vertical.

    Note The recognition softwar e determ ines an d corr ects t he t ext skew, notthe pa per (page) skew.

    In XDOC Text, coordinates a re n ever actua lly skewed. Instead , skew isalways reduced to s h e a r by projecting t he sk ewed ba seline of a line of text t oth e left t ext ma rgin for t he pa ge. See Appendix A, Tran sformation Deta ils,for more information.

    Figure 42 shows the final tra nslat ion of th e image.

    F i g u r e 42. Translation of page origin to image originSince XDOC Text maint ains values for t he tilt an d th e tra nslat ion, you canalways register th e text positions quite accur ately with th e original image.

    However, since the recognition software maintains XDOC page coordinates inabsolut e un its, an d does not record the image resolution in XDOC Text, youmu st k eep tra ck of the scale of the text r elative to the scale of the imagedisplay.

  • 8/7/2019 XDOC DATA FORMAT

    29/58

    Ver. 4 .0 (R ev. 8 0), 0 5/ 1 5/ 9 9 X DO C P age I m a ge A n al ys is 43

    The recognition softwar e always assum es tha t th e image has th e coordina tes0,0 for its upp er left corner. You mus t a ccount for an y tr an slation of theimage in display. Refer t o Section 4.4, Algebra ic Tran sform at ions, for moreinform ation about t he tr an sformat ion of page coordina tes to imagecoordinates.

    4.2 Page Segmentation

    Many docum ents include both t ext and graph ic images, sometimes on th esame page image. These docum ents a re called c om p o u n d d o c u m e n t s . Therecognition softwar e can au tomat ically distinguish t ext from graph ics whenprocessing such pa ges and t reat each type of mat erial appr opriately.

    Figure 43 shows an illustr at ion of a compoun d documen t.

    F igu re 43 . Compound documen t

    XDOC Text describes the layout of a page in subunits called z o n e s . The termzone is non-specific; it does not n ecessarily denote a mea nin gful u nit of textor image. A zone can be any kind of page segment.

    A r e g i o n is one kind of page segment. A region is a defined area of the pageimage th at is delineat ed by an applicat ion en d user , or by th e recognitionsoftwar es au to-segmenta tion algorithm s.

  • 8/7/2019 XDOC DATA FORMAT

    30/58

    44 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    As shown in F igure 44, an application end -user can define r egions withinth e page by d rawing arbitr ary, convex polygons on th e pa ge. The recognitionsoftwar e does not rear ra nge such u ser-defined regions by oth er pr ocessing.

    Text regions identifyarea of page to beprocessed as text

    Preview toolbox

    Image regionidentifies area tobe processed asan image

    Figu re 44 . User -specified r egions

    A user does not n eed to segment compoun d docum ents m anu ally. Therecognition software automatically divides the image into regions of consistent texture.

    The r ecognition softwar e ma y th en bu ild these r egionseither text orimageinto XDOC Text pages, possibly one or m ore pages per image. Thesoftwar e only appropriates pa ges in an XDOC Text file for docum ents tha tcontain run ning text and t abular da ta. Pa ges are holders for r unn ing text.

    The r ecognition softwar e does not group r egions by pa ge for ta bles, forms, orgraph ics, wher e ph ysical position provides t he essential context.

    However, for r un ning t ext, the recognition software a na lyzes the text r egionson a page for typographic layout an d documen t str ucture a nd links relat edtext elements together into ga l l eys .

    Galley is a typogra phical term, mean ing a un it of contin uous text t ha tincludes typographic elements such as typeface, cha ra cter style, and inden ts,but ignores physical page breaks a nd gra phic elements.

  • 8/7/2019 XDOC DATA FORMAT

    31/58

    Ver. 4 .0 (R ev. 8 0), 0 5/ 1 5/ 9 9 X DO C P age I m a ge A n al ys is 45

    Figu re 45 . Galley

    Within XDOC Text, all these varieties of page segmentsuser-specifiedregions, auto-segmented regions, linked regions, and galleysare called

    zones.You can still distinguish regions, which are t he funda ment al un it of pagesegmenta tion within t he r ecognition system, from other types of zones bythe Change region (KDC_REGION) modifier within text zones in XDOCText.

    The recognition software never reorganizes graphics zones. A graphics zonein XDOC is identical to an image region with in t he r ecognition system.

    All zones ha ve b o u n d a r i e s . Text zones a lso have m a r g i n s . Boun dar iesprovide a fra mework th at encloses th e indicated text or grap hics. Theframework a ids furt her an alysis of individual zones.

    The r elationsh ips between bounda ries of multiple zones a re r elativelyunconstrained. For example, zones can overlap. Also, zones need not coverth e entire pa ge. Some, if not all, of the page a rea can be un zoned.

    For most applications, the topological layout and lexical order are moresignificant than the physical layout. Topological layout and lexical order arediscussed in S ection 4.2.1 an d Section 4.2.2, resp ectively.

    4.2.1 Topology A text zone in XDOC Text has both a f r a m e and c o n t e n t . The conten t issimply th e list of lines of text. Th e XDOC Text file pr esent s th is list of textlines directly.

    The recognition software derives the frame coordinates of a text zone, whichare X and Y values for an enclosing rectangle, from these lines of text. Thetopmost an d bottom-most baselines give th e Y coordinat es, and the rightmostand leftmost text lines (including any line margins) give the X coordinates.

  • 8/7/2019 XDOC DATA FORMAT

    32/58

    46 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    The frame is not th e boun dary for a text zone. The left a nd r ight boundar iesof a t ext zone a re defined by the left a nd right exten ts of each line of text.This allows a text zone to have irregular sides, which r un aroun d includedgraph ics. The top and bottom boundar ies, however, are st ra ight lines. Thesear e identical to th e top and bottom edges of the frames.

    Text zoneframes

    Text zoneboundaries

    Figu re 46 . Text zone frames a nd bounda ries

    Since text zone bound aries a re defined by the positions of the text lines, textzone bounda ries depend on t he corrections for t ilt a nd offset th at are a ppliedto each line of text , as discussed in Appendix A.

    Figure 47 shows the final tra nslat ion of th e image.

    F igu re 47 . Text an d picture zones on a desheared an d tra nslated page

  • 8/7/2019 XDOC DATA FORMAT

    33/58

    Ver. 4 .0 (R ev. 8 0), 0 5/ 1 5/ 9 9 X DO C P age I m a ge A n al ys is 47

    4.2.2 Lexical order The lexical ordering of text zones preserves t he r eading order. H ighlysegmented pages can h ave many reading threads, which mean s tha t ma nylexical orderings a re a llowable for th ose text zones. No r eading t hr ead isinterr upt ed by some other r eading thr ead in th e listed sequence of th e zones.

    In an XDOC Text file, the recognition software sequences the lines of text in

    th e lexical order specified by th e zone order. Th e lexical order of th e zones isspecified by the sequ ence of text zone fra mes.

    4.2.3 Image zones In most respects, XDOC Text handles image zones very differently from textzones. For example, the content of each image zone is stored in a separa tefile. The image in th is file ha s a r ectan gular fram e whose sides ar e para llel tothe axes of the original image. However, th e image within t ha t fram e will betilted if th e original page image wa s skewed. Unlike skewed text, th erecognition software does not correct image skew in any way.

    The origin of an image is at th e upper left corn er of the frame of the imagezone. XDOC tr eat s th e edges of th e fram e of th e image zone as crop lines for

    displaying t he ima ge in th e associated file.

    4.3 Rulings R u l i n g s are h orizontal or vertical lines on the p age image. XDOC trea tsthem like zones without t ext. Rulings ar e a pr operty of the page but n ot aproperty of th e text.

    XDOC Text specifies the orientation of each ruling as horizontal or vertical.The p osition of a horizonta l ru ling is defined by th e X,Y coordin at e of its m id-point and its t otal length (X extent). Similarly, the position of a verticalru ling is given by th e X,Y coordina te of its m id-point a nd it s tota l length (Yextent).

    The tr an sforma tions for r ulings from a skewed image to XDOC Text arealmost exactly the same as th e tra nsforma tions th at t he recognition softwar euses to eliminate sk ew and offset in lines of text a nd in zone bounda ries. Theonly difference is that the recognition software positions rulings by referenceto their mid-points only.

    The r ecognition softwar e projects t he en d points of all ru lings a t t he sk ewan gle to th e left edge of the paper (Figure 48) an d th en pr ojects t hem back horizonta lly to th eir shea red position.

  • 8/7/2019 XDOC DATA FORMAT

    34/58

    48 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    Ruling end pointsprojected to leftedge of paper

    Figu re 48. Rulings on skewed a nd offset pa ge

    Vertical rulings rema in vertical and horizonta l rulings remain horizont al.The ru lings ar e now easily tra nslat ed to the up per left of the ima ge bytra nslat ing each ru lings mid-point, as shown in F igure 49.

    F igu re 49 . Rulings after deshearing and translation

  • 8/7/2019 XDOC DATA FORMAT

    35/58

    Ver. 4 .0 (R ev. 8 0), 0 5/ 1 5/ 9 9 X DO C P age I m a ge A n al ys is 49

    4.4 Algebraic Transformations

    This s ection describes specifically how to t ra nsform t he following:

    Convert page coordina tes to image coordinates

    Convert skew informat ion to degrees

    Convert origina l image coordinates to deskewed coordinates

    4.4.1 Convert page coordinates to image coordinates

    The appr oximations used to tr an sform im age coordina tes to the pa gecoordinates in XDOC Text introduce negligible errors. To perform the reversetra nsform ation, from pa ge coordina tes t o ima ge coordina tes, you can use t heclassical tr igonometric affine tr an sformat ion t o scale, rotate, an d tr an slateth e page coordinates.

    Fu rth ermore, since the sk ew angle is almost always quite small, you can u se

    small an gle approximat ions in th is tra nsforma tion. This simplifiestrigonometr ic fun ctions t o ar ithmet ic fun ctions.

    XDOC Text indirectly provides the X,Y coordinates of many interestingpoints on a page, for exa mple, th e X,Y coordina tes of the lower left corn er of an a rbitrary word.

    If that word is the first on its t ext line, the X coordina te is th e sum of th esecond and third operands of the Start-of-line (KDC_STLINE) modifier. Foran y oth er word in th at line, th e X coordinate is th e sum of the first a ndsecond operand s of the pr eceding Nonprint ing whitespace (KDC_HSPACE)modifier or Leader character (KDC_LEADER) modifier.

    The Y coordina te of this point is th e sum of the baseline for th is text line an dth e length of the avera ge descender for the p rimar y font of th e line. Thebaseline a ppear s in both t he four th operand of the KDC_STLINE m odifieran d th e th ird operan d of the Text line inform at ion (KDC_NDLINE ) modifier.

    The length of a descender for t he pr imary font of the line is th e eighthoperand less the n inth opera nd of th e Font inform ation (KDC_FONTINF O)modifier for t ha t pr imary font . The pr imary font for t he t ext line is given inth e sixth operand of the KDC_STLINE modifier for t ha t line.

    The coordinates of the other corn ers of the word (in the page coordinatesystem) are obtained as follows:

    The Y coordinate of the lower right corner is identical to the Y coordinateof th e lower left corner , which wa s described a bove.

    The Y coordinates of th e two upper corn ers are also identical. Their valueis the baseline value minus t he height of large cha ra cters in th e primar yfont of the line. This height is t he seventh operand of theKDC_FONTINFO modifier for the primary font of the line.

    The X coordinate of the upper left corner is identical to the X coordinateof th e lower left corner .

  • 8/7/2019 XDOC DATA FORMAT

    36/58

    410 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    The X coordinates of th e two left corn ers are also identical. If the word isthe last word in t he line, this X coordina te value is th e first operand lessthe second operand in th e KDC_NDLINE modifier of th is line.

    For all other words in this text line, the value of the X coordinate of theleft corners is the first operand of the immediately followingKDC_HSPACE modifier, or t he second opera nd of th e followingKDC_LEADER m odifier.

    Let ( x, y) be the coordinates of the upper left-hand corner of the frame forth e cur rent page. The upper left corner a nd t he dimensions of this page, if known, are given in th e fifth t hr ough eight h operan ds of th e Star t page(KDC_STPAGE) modifier. (Otherwise, th e image fram e, as given in th e Pa geinform ation (KDC_NDPAGE) modifier, is ta ken as the page fram e.)

    The unit of measure of all coordinates in XDOC Text is 0.1 millimeters (mm).All coordinates in the page of text must be scaled to the size of the displayimage. Multiply each X coordinate and Y coordinate by a constant, which isth e num ber of un its of distan ce on th e displayed image needed to represent

    0.1 mm on th e source image.

    Let (X,Y) be the arbitrary point (x,y) after scaling and let ( X, Y) be th eposition of th e upper left h an d corn er of the pa ge on the image after scaling.The cotan gent of the a ngular skew of the t ext within th e image is the firstoperand of the KDC_NDPAGE modifier. Let T (for tilt) be this cotangent.

    Since the an gular skew is small, the t an gent of the an gular skew is small andappr oximately equal to th e sine of th e an gular sk ew. The cosine of th eangular skew is very close to 1.0. Thus, the following simple, approximate(but rather accurate) coordinate transformation carries (X,Y) to (X,Y), theequivalent ima ge coordinate.

    X = X - Y/T + X

    Y = Y + Y

    The design inten t of XDOC is to describe, to a word pr ocessor, a pa ge as seenby recognition. The X values foun d in XDOC out put very accura tely positionword boun dar ies (begin word , end word).

    However, th e Y values a re chosen from am ong the char acter Y values, suchth at a coherent horizonta l line of text will be rend ered. A Y value, somewherebetween the ma ximum an d minimum , is derived using proprietary heur istics.Thus, t he h orizontal placement of a given Y value cann ot be assum ed to be atan y specific position on th e horizonta l line.

    Note When tr an sforming page coordinates t o ima ge coordina tes on a multi-zone docum ent, you should process each zone separ ately rat her tha napplying tra nsform ations to t he pa ge as wh ole. Refer t o Appendix A,Transformation Details, for more in-depth information.

  • 8/7/2019 XDOC DATA FORMAT

    37/58

    Ver. 4 .0 (R ev. 8 0), 0 5/ 1 5/ 9 9 X DO C P age I m a ge A n al ys is 411

    4.4.2 Convert skew information to degrees

    The a ngular correction of th e text on a page can be retr ieved from XDOCfrom th e KDC_STPAGE mar kup or the KDC_NDPAGE ma rku p.

    The following is an example of a KDC_STPAGE ma rku p with its operand s:

    [p;1;P;0;S;0;-20;400;400;0;0;2142;2794;0;0;1]

    The four th operand, (-20) repr esents t he cosecan t of the a ngular correctionalready a pplied to th e text. This value can be positive or n egative dependingon th e an gle of correction. To convert th is value t o degrees do th e following:

    arcsin(1/cosecant) = angle of correction

    Table 41 shows some exam ples of this tr ansformat ion.

    Ta ble 41. Sample Transformations

    C o s e c a n tC a l c u l a t e d

    An gle o f Cor r ec t ion P e r ce ive d T ilt

    4000 0.014 degrees in sign ifican t

    200 0.28 degrees barely t ilt ed

    -30 -1.9 degrees qu ite t ilted

    20 2.8 degrees qu it e t ilted

    12 4.7 degrees severe t ilt

    4.4.3 Convert original image coordinates to deskewed coordinates

    The V_XDC_WBOX tagged valu e cont rols the out pu t of word bound ing boxcoordinates in XDOC forma t. When th e value of this tag is non-zero, the APIoutpu ts word-bound ing boxes for all word wh ose confiden ce is less th an orequal t o the value of V_ACCEP T_THRESH . The default va lue is 0.

    To include boun ding boxes for all words, set t he va lue of V_ACCEPT_THRESH to 999.

    Word boun ding boxes enable you t o locate and highlight words on the pageimage seen by r ecognition. Word boun ding boxes ar e pa ssed to th e textoutput system from recognition; boun dar ies include ascenders and

    descenders, wher e applicable. Becau se the a scender s an d descenders a reincluded, you cannot u se bounding box coordina tes t o determine t he wordbaseline.

    If you want to use word bounding box coordina tes to r efer t o the ima ge a f t e rrecognition, you mu st sa ve the im age if any prepr ocessing, such as r otation ordeskewing, was performed. The coordina tes ma tch th e image passed torecognit ion from p repr ocessing.

  • 8/7/2019 XDOC DATA FORMAT

    38/58

    412 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    However, if the deskewed ima ge is not available, this section d ocum ents th eformu la used to map coordinates back an d forth from t he origina l bitma p tothe deskewed bitmap when an image has been deskewed using thepreprocessing routines (PP_SKEW, PP_SKEW_C).

    Le t w and h be the width an d height of the bitm ap r espectively.

    Assume a coordinate system where t he pixel at th e center of the bitma p, at(w/2,h/2 ), is the origin and all coordinates are referenced as offsets from thisorigin. In creasing y is up, increasing x is to the r ight .

    The an gle theta is measu red in t he clockwise positive sense.

    (0,0)

    (0,-h/2)

    (0,h/2)

    (-w/2,0) (w/2,0)

    (-w/2,-h/2)lowerleft

    (-w/2,-h/2)lowerright

    (-w/2,h/2)upperleft

    (w/2,h/2)upperright

    Figu re 410 . Deskewing coordinate system

    To find th e input pixel th at got m oved to a given output coordinate in term sof th e above coordin at e system :

    x_in = x_out - y_out*sin(theta)y_in = x_out*sin(theta)+y_out

    To find t he outpu t coordina te t ha t a given inpu t pixel will be relocated to :

    x_out = (x_in + y_in*sin(theta) )/(1+sin 2 (theta) )y_out = (-x_in*sin(theta)+y_in )/(1+sin 2 (theta) )

    Note, that th is is different from a tru e rotation in which :

    x_in = x_out*cos(theta) - y_out*sin(theta)y_in = x_out*sin(theta) + y_out*cos(theta)

  • 8/7/2019 XDOC DATA FORMAT

    39/58

    Ver. 4 .0 (R ev. 8 0), 0 5/ 1 5/ 9 9 X DO C P age I m a ge A n al ys is 413

    4.5 Fonts A fon t is a set of typeface chara cters of a p art icular design a nd size.

    The recognition softwar e uses letter sizes an d letter sh apes distinguish onefont from a nother font. After encoun tering a reasonable nu mber of letters of aconsistent size an d sha pe, the softwar e assigns a n umer ic font identifier toth is letter category. This font identifier is great er th an or equal to 1.

    If the recognition software sees too few letters in a particular size category, itassigns th ose cha racters to the nu meric ident ifier 0. The 0 font identifier ha sno corresponding font description an d indicat es th at th ese cha ra cters do notbelong t o any pa rticular font .

    The r ecognition softwar e ha s n o mecha nism for merging categories once theyha ve been identified. Thu s, you cann ot be certa in th at two similar ly sizedfont s are actua lly identical, although you can m ake th at a ssum ption in th einterests of simplicity.

    A font descript ion consists of four ba sic types of dat a:

    font family name,

    font style (serif, italic, roman , bold),

    font spacing (mono-spaced vs. variable-spaced)

    fon t s ize

    The r ecognition softwar e measu res font size in t hr ee different ways. The firstis colum n widt h. Column width is valid only for m ono-spa ced fonts.

    The second is in th e appr oximate typogra pher s point size for t he font face, inun its th at are 72 points to the inch. Point size is rounded off to th e near estpopular point size: 6, 7, 8, 10, 12, 14, 18, 24, and 36.

    The th ird size measur ement is a set of four values for t he h eight s of foursepara te t ypes of cha ra cters: uppercase, lowercase with n o ascender ordescender, lowercase with a n a scender, a nd lowercase with a descender.

    Figure 411 (next pa ge) illustr at es the dista nces used to obtain th e fourchara cter height values for font values.

    Height of upper-case letter

    Height of lowercaseletter noascenderor descender

    Height of lowercaseletter with ascender

    Height of lowercaseletter with descender

    A

    n

    d

    pFigu re 411. Letter h eights

  • 8/7/2019 XDOC DATA FORMAT

    40/58

    414 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    The recognition softwar e gather s th ese values by averaging the mea sur ementof individua l cha ra cters in t he designated category for the pa rticular font. Inmost cases, the n um ber of samples is quite lar ge. The u nit of font sizemeasu remen t, outside of th e typographer s point size, is 0.1 millimeter s (mm).

  • 8/7/2019 XDOC DATA FORMAT

    41/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification A1

    ATransformations

    This appen dix provides additiona l details about t he t ra nsform at ion of imagecoordinates to page coordinates in XDOC Text.

    A.1 Text Lines In XDOC Text, coordinates a re n ever actua lly skewed. Instead, sk ew isalways redu ced to shea r by projecting th e skewed baseline of a line of text tothe left text mar gin for the pa ge. See Figure A1 and F igur e A2.

    Baseline Yfor each skewed lineof text

    Figu re A1 . Baseline Y coord ina tes of skewed text

    Top edge of

    Lines of text

    Bottom edge of page

    Figu re A2 . Shear ed approximat ion to skew

    Thus, each line of text has a single Y coordinate, or baseline value. Lines of text in XDOC always appear a s horizonta l.

  • 8/7/2019 XDOC DATA FORMAT

    42/58

    A2 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    The document recognition software deshears all X coordinates in XDOC Text.This means that vertical text margins, such as the left margin, appearvertical. This pr ocess is illustra ted in Figure A3.

    F igu re A3 . Desheared lines of text

  • 8/7/2019 XDOC DATA FORMAT

    43/58

    Ver. 4.0 (R ev. 80), 05/ 1 5/ 99 Tra nsfor m at ion Det ai ls A3

    A.2 Zones Figures A4, A5, and A6 illustrate the process of transforming an offsetan d skewed page image to a deshear ed page.

    Text zones

    Picture zone

    Figu re A4 . Text zones a nd a picture zone on a sk ewed and offset pa ge

    F igu re A5 Skew converted to shear, zone by zone

    F igu re A6 . Text an d pictur e zones on a desheared an d tra nslated page

  • 8/7/2019 XDOC DATA FORMAT

    44/58

    A4 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    XDOC Text portr ays th e lines of text as horizont al an d gives them the Ycoordinate wh ere th e tilted line of text int ersects th e left text ma rgin for t hetext zone.

    A.3 Rulings The first step in t he t ra nsform ation of a skewed ima ge to XDOC Text is theconversion of skew to shear. This happens before the recognition softwareproduces XDOC Text. In the case of ru lings, which a re pa ge-wide featu res,the software performs shea ring relat ive to the left edge of th e page.

    Vertical rulings ar e always vertical, even on a sh eared XDOC Text page. Thiscreates gaps at the junctions of horizont al a nd vertical r ulings. However, onceth e softwar e has corrected shear , these gaps disappear.

    The softwar e projects th e end points of all ru lings at the s kew an gle to theleft edge of th e paper an d th en pr ojects t hem back horizont ally to theirsheared position (Figure A7).

    F igu re A7 . Rulings converted from skew to shear at end points

    Fr om th ese positions, the softwar e calculat es th e positions of the m id-point s(Figur e A8). The vertical rulings never appea r t o be sheared since th esoftwar e only present s th e mid-points. The shea red ru lings still have gaps.

    F igu re A8 . Shear ed rulings when presented a t mid-points only

  • 8/7/2019 XDOC DATA FORMAT

    45/58

    Ver. 4.0 (R ev. 80), 05/ 1 5/ 99 Tra nsfor m at ion Det ai ls A5

    After deshear ing, however, as shown in F igure A9, th e softwar e restores th ehorizonta l and vertical ru lings to their corr ect r elative positions. Thesoftwar e performs the d eshear ing, which is a horizonta l displacementproportional to the Y coordinate of a point, at the mid-point of each ruling.

    F igu re A9 . Rulings after deshearing

  • 8/7/2019 XDOC DATA FORMAT

    46/58

  • 8/7/2019 XDOC DATA FORMAT

    47/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification B1

    BSample XDOC Text

    This appen dix provides a sam ple page image an d its corresponding XDOCText file.

    B.1 Sample Page Image

    This document h as mu ltiple column s conta ining alphan umer ic dat a, tele-phone nu mbers, an d cur rency amounts. It includes severa l typefaces andstyles.

    F i g u r e B 1. Sample page image

  • 8/7/2019 XDOC DATA FORMAT

    48/58

    B2 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    B.2 Sample XDOC Text

    This XDOC Text file includes a ll required ma rku ps only. The lines h ave beenbroken artificially for readability.

    [a;"XDOC.12.0";E;"FWX12.5"]

    [d;"beth.xdc"][p;1;P;1;S;0;-434;400;400;0;0;2150;2794;0;0;1][t;2;1;227;386;A;"";"";"";0;0;2140;2793;0;0;0;0;1][r;1069;2;H;2140;s;1;0;0;R;0][f;0;"";R;s;2540;F;0;0;0;10;100][f;2;"T";T;q;2963;V;34;34;24;14;100][f;3;"T";B;q;2624;V;30;29;21;12;100][f;4;"C";R;q;1947;F;22;21;16;9;100][O;1252;1][s;2;244;522;1;275;c;2;17][k;T;1;-2;1;1;1;1;243;1887][e;2][c;2]New[h;854;14;2]England[h;1045;12;3]Begonia[h;1224;16;4]Society[y;1883;501;275;1;H][s;2;244;580;5;327;c;2;19]Annual[h;980;13;6]Fund[h;1105;10;7]Donations[y;1883;558;327;2;H][s;2;244;129;7;469;t;3;4;1][c;3]Name[h;474;452;8;17;1]Address[h;1070;377;9;14;1]Telephone[h;1633;84;10;3;1]Donation[y;1883;0;469;2;H][s;2;244;5;11;543;t;4;0;1][c;4]Peter[h;351;24;12]Adams[h;477;158;8;1]1234[h;712;27;13]Rock[h;822;25;14]Lane,[h;941;32;15]Bath,[h;1068;31;16]ME[h;1139;27;17]01201[h;1265;133;18;6;1]207-555-2314[h;1648;103;19;5;1]25.00[y;1883;32;543;2;H][s;2;244;3;20;592;t;4;0;1]John[h;331;22;21]Albert[h;478;156;22;8;1]321[h;691;26;23]Riley[h;822;23;24]Road,[h;940;33;25]Bath,[h;1068;31;26]ME[h;1139;27;27]01201[h;1265;133;28;6;1]207-555-3425[h;1648;104;29;5;1]15.00[y;1883;32;592;2;H][s;2;244;5;30;649;t;4;0;1]Patricia[h;414;24;31]Arnel[h;541;93;32;4;1]822[h;689;27;33]Martin[h;843;24;34]Drive,[h;982;33;35]Bath,[h;1110;31;36]ME[h;1181;27;37]01201

    [h;1307;91;38;4;1]207-555-7869[h;1647;104;39;5;1]30.0[y;1883;54;649;2;H][s;2;244;3;40;704;t;4;0;1]Heidi[h;348;27;41]Barnett[h;521;113;42;5;1]666[h;690;27;43]Billiard[h;885;22;44]Ave.,[h;1004;33;45]Bath,[h;1131;32;46]ME[h;1203;27;47]01201[h;1328;70;48;3;1]207-555-4454[h;1648;104;49;5;1]65.00[y;1883;32;704;2;H][s;2;244;3;50;760;t;4;0;1]Rocky[h;353;24;51]Balboa[h;500;135;52;6;1]565[h;691;27;53]Bliss[h;819;25;54]Way,[h;920;32;55]Bath,[h;1047;31;56]ME[h;1118;27;57]01201[h;1244;155;58;7;1]207-555-1357[h;1648;103;59;5;1]20.00[y;1883;31;760;2;H][s;2;244;4;60;817;t;4;0;1]Scott[h;349;27;61]Booker[h;500;134;62;6;1]787[h;690;27;63]Hunan[h;822;23;64]Avenue,[h;983;32;65]Bath,[h;1110;31;66]ME[h;1182;27;67]01201[h;1308;91;68;4;1]207-555-2465[h;1648;103;69;5;1]25.00[y;1883;32;817;2;H][s;2;244;3;70;873;t;4;0;1]Henry[h;352;24;71]Bzada[h;479;155;72;7;1]111[h;691;26;73]Roberts[h;862;26;74]Drive,[h;1004;33;75]Bath,[h;1132;31;76]ME[h;1203;27;77]01201[h;1329;70;78;3;1]207-555-2244[h;1648;82;79;4;1]100.00[y;1883;32;873;2;H][s;2;244;2;80;929;t;4;0;1]Andrew[h;373;24;81]Chate[h;500;134;82;6;1]567[h;690;27;83]Rodway[h;844;24;84]Lane,[h;961;33;85]Bath,[h;1088;32;86]ME[h;1160;27;87]01201[h;1286;113;88;5;1]207-555-5689[h;1648;103;89;5;1]10.00[y;1883;32;929;2;H][s;2;244;2;90;986;t;4;0;1]Andrea[h;371;25;91]Challen[h;544;90;92;4;1]88[h;669;29;93]Patterson[h;884;25;94]Road,[h;1004;33;95]Bath,[h;1131;32;96]ME[h;1203;27;97]01201[h;1328;71;;98;3;1]207-555-0987[h;1648;102;99;5;1]25.00

  • 8/7/2019 XDOC DATA FORMAT

    49/58

    Ver. 4.0 (R ev. 80), 05/ 15/ 99 S am ple XDOC Text B3

    [y;1883;32;986;2;H][s;2;244;2;100;1041;t;4;0;1]William[h;396;23;101]Davis[h;520;114;102;5;1]5[h;648;28;103]Broadway,[h;856;33;104]Bath,[h;984;31;105]ME[h;1055;27;106]01201[h;1180;220;107;11;1]207-555-1212[h;1648;104;108;5;1]35[h;1787;1;1093]-[h;1804;13;110]00[y;1883;31;1041;2;H][s;2;244;2;111;1097;t;4;0;1]Adrian[h;373;24;112]Davis[h;499;135;113;6;1]32[h;669;27;114]Whitney[h;844;25;115]Place,[h;983;33;116]Bath,[h;1111;31;117]ME[h;1182;27;118]01201[h;1307;92;119;4;1]207-555-6935[h;1649;103;120;5;1]50.00[y;1883;31;121;1097;2;H][s;2;244;3;122;1153;t;4;0;1]Howard[h;373;24;123]Dundee[h;521;113;124;5;1]9546[h;712;28;125]LightAvenue[h;968;69;126]Bath,[h;1131;32;127]ME[h;1203;27;128]01201[h;1328;71;129;3;1]207-555-0283[h;1647;105;130;5;1]15.00[y;1883;31;131;1153;2;H][s;2;244;2;132;1209;t;4;0;1]Ralph[h;351;25;133]Ester[h;478;156;134;8;1]35[h;669;27135]Togo[h;777;25;136]Way,[h;876;33;137]Bath,[h;1004;31;138]ME[h;1075;27;139]01201

    [h;1201;198;140;10;1]207-555-3729[h;1648;102;141;5;1]25.00[y;1883;31;1209;2;H][s;2;244;2;142;1266;t;4;0;1]Robert[h;371;26;143]Ezra[h;478;155;144;7;1]456[h;690;27;145]Outland[h;863;23;146]Avenue,[h;1025;32;147]Bath,[h;1152;31;148]ME[h;1223;27;149]01201[h;1350;49;150;2;1]207-555-0447[h;1647;104;151;5;1]35.00[y;1883;32;1266;2;H][s;2;244;3;152;1323;t;4;0;1]Jane[h;328;26;153]Farguar[h;500;135;154;6;1]44[h;670;29;155]Peters[h;819;26;156]Road,[h;941;33;157]Bath,[h;1068;32;158]ME[h;1140;27;159]01201[h;1265;134;160;6;1]207-555-9742[h;1647;103;161;5;1]20.00[y;1883;32;1323;2;H][s;2;244;4;162;1379;t;4;0;1]Bill[h;327;28;163]Fuzell[h;477;158;164;8;1]5[h;648;27;165]Rockpoint[h;862;26;166]Road,[h;983;33;167]Bath,[h;1110;32;168]ME[h;1181;28;169]01201[h;1307;92;170;4;1]207-555-8080[h;1648;103;171;5;1]15.00[y;1883;32;1379;2;H][s;2;244;3;172;1436;t;4;0;1]Jake[h;328;26;173]Fratz[h;455;180;174;9;1]695[h;691;26;175]Neary[h;822;23;176]Road,[h;940;33;177]Bath,[h;1068;31;178]ME[h;1139;27;179]01201[h;1264;135;180;6;1]207-555-0865[h;1648;102;181;5;1]25.00[y;1883;32;1436;2;H][s;2;244;2;182;1493;t;4;0;1]Joe[h;306;25;183]Gunda[h;435;199;184;10;1]900[h;690;28;185]Finn[h;799;23;186]Avenue,[h;961;33;187]Bath,[h;1089;31;188]ME[h;1160;27;189]01201[h;1285;113;190;5;1]207-555-2499[h;1647;103;191;5;1]25[h;1786;13;192]-[h;1803;13;193]00[y;1883;32;1493;2;H][s;2;244;2;194;1549;t;4;0;1]B.G.[h;322;31;195]Hunter[h;477;157;196;8;1]12[h;668;28;197]Ryder[h;798;26;198]Lane,[h;918;33;199]Bath,[h;1046;31;200]ME[h;1117;27;201]01201

    [h;1243;155;202;7;1]207-555-1009[h;1647;82;203;4;1]150.00[y;1883;32;1549;2;H][s;2;244;4;204;1604;t;4;0;1]Peter[h;350;25;205]Jacobs[h;498;136;206;6;1]33[h;668;30;207]PeeWee[h;820;25;208]Way,[h;919;33;209]Bath,[h;1047;31;210]ME[h;1118;27;211]01201[h;1243;156;212;8;1]207-555-9989[h;1648;103;213;5;1]30.00[y;1883;32;1604;2;H][s;2;244;4;214;1661;t;4;0;1]Paul[h;326;27;215]Jacoby[h;480;153;216;7;1]7899[h;711;27;217]Helter[h;862;25;218]Drive,[h;1003;33;219]Bath,[h;1131;31;220]ME[h;1202;27;221]01201[h;1328;71;222;3;1]207-555-3343[h;1647;104;223;5;1]35.00[y;1883;32;1661;2;H][s;2;244;1;224;1717;t;4;0;1]Dorothy[h;394;24;225]Johnson[h;564;69;2263;1]46

  • 8/7/2019 XDOC DATA FORMAT

    50/58

    B4 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    [h;669;26;227]Main[h;778;25;228]Street,[h;940;34;229]Peabody,[h;1131;31;230]MA,[h;1216;35;231]01970[h;1349;49;232;2;1]207-555-4318[h;1648;103;233;5;1]25.00[y;1883;32;1717;2;H][s;2;244;1;234;1772;t;4;0;1]Kenneth[h;393;25;235]Kuntz[h;519;113;236;5;1]2[h;646;28;237]Cyndy[h;778;22;238]Avenue,[h;940;32;239]Bath,[h;1067;31;240]ME[h;1138;27;241]01201[h;1264;134;242;6;1]207-555-9823[h;1646;105;243;5;1]10.00[y;1883;32;1772;2;H][s;2;244;0;244;1830;t;4;0;1]Jacqueline[h;456;26;245]Kitee[h;584;49;246;2;1]567[h;689;27;247]Cassia[h;841;25;248]Drive,[h;981;33;249]Bath,[h;1109;31;250]ME[h;1180;27;251]01201[h;1307;91;252;4;1]207-555-0978[h;1647;104;253;5;1]10.00[y;1883;32;1830;2;H][s;2;244;1;254;1886;t;4;0;1]Chen[h;329;25;255]Ka[h;392;242;256;12;1]545[h;690;27;257]Built[h;819;26;258]Road,[h;940;33;259]Bath,[h;1067;32;260]ME[h;1139;27;270]01201[h;1264;135;271;6;1]207-555-3719[h;1648;104;272;5;1]30.00[y;1883;32;1886;2;H][s;2;244;1;273;1941;t;4;0;1]Harold[h;372;25;274]Langlois[h;562;70;275;3;1]200[h;690;25]Wayland[h;864;22]Avenue,[h;1025;33]Bath,[h;1152;32]ME[h;1224;27;276]01201[h;1349;49;277;2;1]207-555-1947[h;1647;104;278;5;1]25.00

    [y;1883;32;1941;2;H][s;2;244;0;279;1999;t;4;0;1]William[h;394;24;280]Lyttle[h;541;91;281;4;1]43[h;667;30;282]Sudbury[h;843;23;283]Drive,[h;982;33;284]Bath,[h;1109;31;285]ME[h;1180;27;286]01201[h;1307;91;287;4;1]207-555-1009[h;1647;104;288;5;1]35.00[y;1883;32;1999;2;H][s;2;244;1;289;2055;t;4;0;1]Rawley[h;372;23;290]Mooler[h;520;112;291;5;1]799[h;689;28;292]Concord[h;863;25;293]Lane,[h;982;33;294]Bath,[h;1109;32;295]ME[h;1180;27;296]01201[h;1306;92;297;4;1]207-555-2357[h;1647;103;298;5;1]25.00[y;1883;32;2055;2;H][s;2;244;1;299;2111;t;4;0;1]George[h;369;27;300]Smyth[h;499;133;301;6;1]1234[h;711;25]Wash[h;820;23]Ave,[h;917;35]Peabody,[h;1109;32]MA[h;1181;26]01970[h;1306;91;302;4;1]207-555-1212[h;1647;104;303;5;1]10.00[y;1883;33;2111;0;H][g;285;0;0;2150;2794,0]

  • 8/7/2019 XDOC DATA FORMAT

    51/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification C1

    CCharacter SetThe characters in an XDOC file are encoded using the codes defined byMicrosoft Windows code pages. XDOC ma rku ps embeded in th e text u se th eASCII (7 bits) su bset of each code pa ge. ScanS oft Incs r ecognition softwar ecan r ecognize most cha ra cters in cluded in t he following Microsoft Windowscode pa ges: Lat in1 (1252), Centr al Eu rope (1250), Greek (1253), Tur kish(1254), Cyrillic (1251) and Baltic (1257).

    These Microsoft Windows code page definitions can be foun d at th is web site:

    http://www.microsoft.com/globaldev/reference/wincp.asp

    The following char acter s a re n ot recognized by ScanSoft Incs r ecognit ionsoftware (listed as hex values):

    Lat in1 (1252):5e, 7e, 7f, 81, 83, 84, 85, 86, 87, 88, 89, 8b, 8d, 8e, 8f, 90, 96, 98, 99, 9b, 9c, 9e,a0, a 2, a4, a 6, a8, a a, a c, af, b2, b3, b4, b5, b7, b8, b9, ba, d7, de, fe

    Western Eu rope (1250):5e, 7e, 7f, 81, 83, 85, 86, 87, 88, 89, 8b, 90, 96, 98, 99, 9b, a0, a 1, a2, a 4, a6,a8, a c, b2, b4, b5, b7, b8, bd, d7, ff

    Greek (1253):Fe, 7e, 7f, 81, 83, 85, 86, 87, 88, 89, 90, 96, 98, 99, a0, a1, a 4, a6, a 8, aa , ac, af,

    b2, b3, b4, b5, d2, ff

    Turkish (1254):fe, 7e, 7f, 81, 83, 85, 86, 87, 88, 89, 8b, 8d, 8e, 8f, 90, 96, 98, 99, 9b, 9d, 9e, a0,a2, a 4, a6, ac, af, b2, b3, b4, b5, b7, b8, b9, ba, d7

    Cyr illic (1251):5e, 7e, 7f, 85, 86, 87, 89, 8b, 96, 98, 99, 9b, a4, a 6, ac, b5, b7

    Baltic (1257):5e, 7e, 7f, 81, 83, 85, 86, 87, 88, 89, 8a, 8b, 8c, 8d, 8e, 8f, 90, 96, 98, 99, 9a, 9b,9c, 9d, 9e, 9f, a0, a 1, a2, a 4, a5, a 6, a8, b2, b3, b4, b5, b7, b8, b9, d7, ff

  • 8/7/2019 XDOC DATA FORMAT

    52/58

    C2 XDOC Data Form at: Technical S pecification Ver. 4.0 (Rev. 80), 05/ 15/ 99

    ScanSoft also defines th ree cont rol codes for represent ing th ree char actersnot repr esented in t he su pported code pages. The following chara cters aredefined in th e icrpub.h include file, and u nlike the other cha ra cters, shouldalways be referred to by th eir symbols rat her th an t heir values.

    N OT _E Q U AL : This is the not equal ( !=).

    L E S S _O R _E Q U AL : This is th e less th an or equa l (=).

  • 8/7/2019 XDOC DATA FORMAT

    53/58

    Ver. 4.0 (Rev. 80), 05/ 15/ 99 XDOC Data Form at: Technical S pecification I1

    Index

    AAffine tr an sform at ion, 4, 9Angular correction

    retr ieving, 4, 11

    BBoundaries

    zones, 4, 5

    CCha nge font, 2, 5, 10Change region, 2, 5, 10Chan ge subscript, 2, 5Chan ge superscript, 2, 5Chan ge underline, 2, 5Character confidence, 2, 6, 14Character operands, 2, 2Chara cter set, 3, 2Compoun d documen ts, 1, 1Compoun d documen ts, 4, 3Concaten ating files, 3, 1Confidence

    chara cters, 2, 14word, 2, 16

    Confidence, 2, 6

    DDefining re gions, 4, 4Document nam e, 2, 9, 10Document nam e, 3, 1Documentation conventions,, viDropped capital lett er, 2, 5, 16

    EEnd cell table, 2, 7En d escape char acter, 2, 2En d of docum ent , 2, 9, 17Essen tial modifiers

    page, 2, 7text line, 2, 5

    FFont description, 4, 13

  • 8/7/2019 XDOC DATA FORMAT

    54/58

    Font identifier, 2, 8Font inform at ion, 2, 10Font modifier

    page, 2, 7Fonts, 4, 13Fra me, 4, 5

    GGalleys, 4, 4Graph ics form at s, 1, 1

    HHar d column break, 2, 7, 16Horizonta l space, 2, 5

    IIdentifying fonts, 4, 13Image zone descriptor, 2, 7, 16Ima ge zones, 4, 7ISO chara cter set, 3, 2

    KKDC_CCONF , 2, 14KDC_CHGFONT. 2, 10KDC_DOCNAME, 2, 1 0KDC_DROP, 2, 16KDC_FONTINFO, 2, 10KDC_HARD_COL, 2, 16KDC_HSPACE, 2, 11, 12KDC_LEADER, 2, 13KDC_LEXCL, 2, 16KDC_NDDOC, 2, 17KDC_NDLINE, 2, 17KDC_NDPAGE, 2, 11KDC_NDTABLE, 2 , 9KDC_OHPH EN, 2, 11KDC_PZONE, 2, 16KDC_QABLE, 2, 14KDC_REGION, 2, 10KDC_RULE, 2, 14KDC_SCOL, 2, 13KDC_STABLE, 2, 12KDC_STDOC, 2, 9KDC_STLINE, 2, 15KDC_SUB, 2, 10KDC_SUPER, 2, 15KDC_TZONE, 2, 15KDC_UNDERLINE , 2, 16KDC_UNREC, 2, 10KDC_WBOX, 2, 9KDC_WCONF , 2, 16

  • 8/7/2019 XDOC DATA FORMAT

    55/58

    LLayout modifiers

    page, 2, 7Leader chara cters, 2, 5, 13Lexical class, 2, 6, 16Lexical modifiers

    text line, 2, 5Lexical order, 4, 7

    MMarkups

    synta x, 2, 1Mode shift modifiers

    text line, 2, 5Modifier code, 2, 2Modifier codes

    alph abet ical listing, 2, 9, 10, 12, 13, 14, 15, 16Modifier end escape char acter, 2, 2

    Modifier st ar t escape chara cter, 2, 2Modifiers

    alpha betical listing, 2, 9page, 2, 7text line, 2, 4

    Modifiers, 2 , 2Multiple documen t, 3, 1

    NNesting tr an sactions, 2, 4New ta ble, 2, 7, 12Nonprint ing whitespa ce, 2, 5, 11Num eric operan ds, 2, 2

    OOffset page image, 4, 1Operand separa tor char acter, 2, 2Operands

    chara cter, 2, 2num eric, 2, 2str ing, 2, 2

    Operan ds, 2, 2Optional hyphen , 2, 6, 11

    P

    Page imagezones, 4, 3

    Pa ge image ana lysis, 4, 1Pa ge image shea r, 4, 2Pa ge image tilt, 4, 2Pa ge inform at ion, 2, 7, 11Page modifiers, 2, 7Pa ge segment at ion, 4, 3Page segments

    regions, 4, 3

  • 8/7/2019 XDOC DATA FORMAT

    56/58

    zones, 4, 3Pa ge sequence, 3, 1Page tr ansa ctions

    essentia l, 2, 7font modifier, 2, 7layout m odifiers, 2, 7

    Pa ge tran sactions, 2, 7Parsing XDOC Text, 3, 3print ing whitespace, 2, 5Pr inting whitespa ce, 2, 13

    QQuestionable chara cter, 2, 6, 14

    RReading thr eads, 4, 7Regions

    defining, 4, 4Regions, 4, 3Retrieving an gular corr ection, 4, 11Ruling descriptor, 2, 7, 14Rulings

    tr an sform ations, A, 4Rulings, 4, 7Runn ing text vs. form s, 4, 4

    SSam ple page image, B, 1Sa mple XDOC file, 2, 1Sa mple XDOC Text, B, 2Scale

    text vs. image, 4, 3Section br eak, 2, 7Section change2, 12Separat or char acter, 2, 2Shift t o/from subscript, 2, 10Shift t o/from super script, 2, 15Shift t o/from un derline, 2, 16Skewed page ima ge, 4, 1Star t escape character, 2, 2Sta rt n ew table row, 2, 13Sta rt of docum ent, 2, 9Sta rt ta ble cell, 2, 13Sta rt t able column , 2, 7Start-of-page, 2, 7, 14Start-of-text line, 2, 5, 15Str ing opera nds, 2, 2

    TTechnical Specification

    documentation conventions,, viorganization,, v

    Techn ical s upport,, vi

  • 8/7/2019 XDOC DATA FORMAT

    57/58

    Terminating t ran sactions, 2, 3Text dat a, 3, 2Text line information, 2, 5, 17Text line m odifiers, 2, 4Text line transactions

    essentia l, 2, 5mode sh ift modifiers, 2, 5

    Text line tra nsa ctions, 2, 4Text line tra nsform at ions, A, 1Text pars er, 3, 3Text tra nsa ction synta x, 2, 9Text zone descript or, 2, 7, 15ScanSoft SDK Pr ogra mm ers Gu ide,, viScan Soft SDK,, vTextline transactions,

    lexical m odifiers , 2, 5Tilt

    page image, 4, 2Topological layout, 4, 5Transactions

    nest ing, 2, 4Tra nsa ctions, 2, 3Transformations

    algebr aic, 4, 9image coordinates to page coordinates, A, 1page to image, 4, 9ru lings, 4, 7ru lings, A, 4text line, A, 1text, 4, 2zones, 4, 6zones, A, 3

    UUnr ecognized cha ra cter, 2, 6, 10

    VViewing XDOC files, 3, 2

    WWord bound ing box, 2, 6Word bound ing box. 2, 9Word boundin g boxes

    coordinates, 4, 11Word confidence, 2, 6, 16

    XXDOC

    flexible gram ma r, 1, 1mar kups, 2, 1tr ans actions, 2, 3

    XDOC coordina te s ystem , 4, 1XDOC define , 2, 9XDOC features, 1, 1

  • 8/7/2019 XDOC DATA FORMAT

    58/58

    XDOC file hiera rchy, 1, 1XDOC ima ges, 1, 1XDOC Text

    file stru cture, 3, 1XDOC Text file

    concatena ting, 3, 1mu ltiple documents, 3, 1page sequence, 3, 1

    XDOC Text m ar kups, 2, 9XDOC Text pa ges

    ru nnin g text, 4, 4XDOC Text, 1, 1

    ZZone

    frame, 4, 5Zones

    boun dar ies, 4, 5Zones, 4, 3