Making Scanned Documents Web Accessible

10
ARE YOUR DIGITAL DOCUMENTS WEB FRIENDLY? | ZHOU 151 Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible The Internet has greatly changed how library users search and use library resources. Many of them prefer resources available in electronic format over tradi- tional print materials. While many docu- ments are now born digital, many more are only accessible in print and need to be digitized. This paper focuses on how the Colorado State University Libraries cre- ates and optimizes text-based and digitized PDF documents for easy access, download- ing, and printing. T o digitize print materials, we normally scan originals, save them in archival digital formats, and then make them Web- accessible. There are two types of print documents, graphic-based and text-based. If we apply the same tech- niques to digitize these two different types of materials, the documents produced will not be Web-friendly. Graphic-based materials include archival resources such as his- torical photographs, drawings, manuscripts, maps, slides, and post- ers. We normally scan them in color at a very high resolution to capture and present a reproduction that is as faithful to the original as possible. Then we save the scanned images in TIFF (Tagged Image File Format) for archival purposes and convert the TIFFs to JPEG (Joint Photographic Experts Group) 2000 or JPEG for Web access. However, the same practice is not suitable for modern text-based documents, such as reports, jour- nal articles, meeting minutes, and theses and dissertations. Many old text-based documents (e.g., aged newspapers and books), should be Yongli Zhou Tutorial files for fast Web delivery as access files. For text-based files, access files normally are PDFs that are converted from scanned images. “BCR’s CDP Digital Imaging Best Practices Version 2.0” says that the master image should be the highest quality you can afford, it should not be edited or processed for any specific output, and it should be uncom- pressed. 1 This statement applies to archival images, such as photographs, manuscripts, and other image-based materials. If we adopt the same approach for modern text documents, the result may be problematic. PDFs that are created from such master files may have the following drawbacks: Because of their large file size, they require a long download time or cannot be downloaded because of a timeout error. They may crash a user’s com- puter because they use more memory while viewing. They sometimes cannot be printed because of insufficient printer memory. Poor print and on-screen view- ing qualities can be caused by background noise and bleed- through of text. Background noise can be caused by stains, highlighter marks made by users, and yellowed paper from aged documents. The OCR process sometimes does not work for high-resolu- tion images. Content creators need to spend more time scanning images at a high resolution and converting them to PDF documents. Web-friendly files should be small, accessible by most users, full-text searchable, and have good treated as graphic-based material. These documents often have faded text, unusual fonts, stains, and col- ored background. If they are scanned using the same practice as modern text documents, the document cre- ated can be unreadable and contain incorrect information. This topic is covered in the section “Full-Text Searchable PDFs and Troubleshooting OCR Errors.” Currently, PDF is the file format used for most digitized text docu- ments. While PDFs that are created from high-resolution color images may be of excellent quality, they can have many drawbacks. For exam- ple, a multipage PDF may have a large file size, which increases down- load time and the memory required while viewing. Sometimes the down- load takes so long it fails because a time-out error occurs. Printers may have insufficient memory to print large documents. In addition, the Optical Character Recognition (OCR) process is not accurate for high- resolution images in either color or grayscale. As we know, users want the ability to easily download, view, print, and search online textual docu- ments. All of the drawbacks created by high-quality scanning defeat one of the most important purposes of digitizing text-based documents: making them accessible to more users. This paper addresses how Colorado State University Libraries (CSUL) manages these problems and others as staff create Web-friendly digitized textual documents. Topics include scanning, long-time archiving, full-text searchable PDFs and troubleshooting OCR problems, and optimizing PDF files for Web delivery. Preservation Master Files and Access Files For digitization projects, we normally refer to images in uncompressed TIFF format as master files and compressed Yongli Zhou is Digital Repositories Librarian, Colorado State University Libraries, Colorado State University, Fort Collins, Colorado

description

Yongli Zhou

Transcript of Making Scanned Documents Web Accessible

  • tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 151Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 151

    Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible

    The Internet has greatly changed how library users search and use library resources. Many of them prefer resources available in electronic format over tradi-tional print materials. While many docu-ments are now born digital, many more are only accessible in print and need to be digitized. This paper focuses on how the Colorado State University Libraries cre-ates and optimizes text-based and digitized PDF documents for easy access, download-ing, and printing.

    T o digitize print materials,we normally scan originals,save them in archival digitalformats, and thenmake themWeb-accessible. There are two types ofprintdocuments,graphic-basedandtext-based.Ifweapplythesametech-niquestodigitizethesetwodifferenttypes of materials, the documentsproducedwillnotbeWeb-friendly.

    Graphic-based materials includearchival resources such as his-torical photographs, drawings,manuscripts,maps,slides,andpost-ers.Wenormallyscanthemincolorat a very high resolution to captureand present a reproduction that isasfaithfultotheoriginalaspossible.ThenwesavethescannedimagesinTIFF(TaggedImageFileFormat)forarchival purposes and convert theTIFFs to JPEG (Joint PhotographicExpertsGroup)2000orJPEGforWebaccess.However,thesamepracticeisnot suitable for modern text-baseddocuments, such as reports, jour-nal articles, meeting minutes, andtheses and dissertations. Many oldtext-based documents (e.g., agednewspapers and books), should be

    Yongli ZhouTutorial

    files for fastWeb delivery as accessfiles.For text-based files,access filesnormallyarePDFsthatareconvertedfromscannedimages.

    BCRsCDPDigitalImagingBestPractices Version 2.0 says that themaster image should be the highestqualityyoucanafford, it shouldnotbeeditedorprocessedforanyspecificoutput, and it should be uncom-pressed.1 This statement applies toarchivalimages,suchasphotographs,manuscripts, and other image-basedmaterials. If we adopt the sameapproachformoderntextdocuments,the resultmaybeproblematic.PDFsthatarecreatedfromsuchmasterfilesmayhavethefollowingdrawbacks:

    Because of their large file size,they require a long downloadtime or cannot be downloadedbecauseofatimeouterror.

    They may crash a users com-puter because they use morememorywhileviewing.

    They sometimes cannot beprinted because of insufficientprintermemory.

    Poor print and on-screen view-ing qualities can be caused bybackground noise and bleed-through of text. Backgroundnoise can be caused by stains,highlighter marks made byusers, andyellowedpaper fromageddocuments.

    The OCR process sometimesdoes not work for high-resolu-tionimages.

    Content creators need to spendmore time scanning images at ahigh resolution and convertingthemtoPDFdocuments.

    Web-friendly files should besmall, accessible by most users,full-text searchable, and have good

    treated as graphic-based material.These documents often have fadedtext, unusual fonts, stains, and col-oredbackground.Iftheyarescannedusing the same practice as moderntext documents, the document cre-ated can be unreadable and containincorrect information. This topic iscovered in the section Full-TextSearchablePDFsandTroubleshootingOCRErrors.

    Currently, PDF is the file formatused for most digitized text docu-ments. While PDFs that are createdfrom high-resolution color imagesmaybeofexcellentquality,theycanhave many drawbacks. For exam-ple, a multipage PDF may have alargefilesize,whichincreasesdown-load time and thememory requiredwhileviewing.Sometimesthedown-load takes so long it fails because atime-out error occurs. Printers mayhave insufficient memory to printlarge documents. In addition, theOpticalCharacterRecognition(OCR)process is not accurate for high-resolution images in either color orgrayscale. As we know, users wanttheability toeasilydownload,view,print,andsearchonlinetextualdocu-ments.All of thedrawbacks createdby high-quality scanning defeat oneof the most important purposes ofdigitizing text-based documents:making them accessible to moreusers.

    This paper addresses howColorado State University Libraries(CSUL)managestheseproblemsandothers as staff create Web-friendlydigitized textual documents.Topics include scanning, long-timearchiving, full-text searchable PDFsand troubleshootingOCR problems,and optimizing PDF files for Webdelivery.

    Preservation Master Files and Access Files

    Fordigitizationprojects,wenormallyrefertoimagesinuncompressedTIFFformatasmasterfilesandcompressed

    Yongli Zhou is Digital repositories librarian, Colorado State university libraries, Colorado State university, fort Collins, Colorado

  • 152 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010152 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010

    factors that determine PDF file size.Color images typically generate thelargest PDFs and black-and-whiteimages generate the smallest PDFs.Interestingly,animageofsmallerfilesize does not necessarily generate asmallerPDF.Table1showshowfileformat and color mode affect PDFfilesize.

    Thesourcefile isapagecontain-ingblack-and-whitetextandlineartdrawings. Its physical dimensionsare8.047"by10.893".Allimageswerescannedat300dpi.

    CSUL uses Adobe AcrobatProfessional to create PDFs fromscanned images. The current ver-sion we use is Adobe Acrobat 9Professional,butmostof its featureslisted in this paper are availablefor other Acrobat versions. WhenAcrobat converts TIFF images to aPDF,itcompressesimages.ThereforeafinalPDFhasasmallerfilesizethanthe total size of the original images.Acrobat compresses TIFF uncom-pressed, LZW, and Zip the sameamount and produces PDFs of thesame file size. Because our in-housescanning software does not supportTIFFG4,wedidnotincludeTIFFG4test data here. By comparing simi-lar pages, we concluded that TIFFG4works the same as TIFF uncom-pressed,LZW,andZip.Forexample,ifwescanatext-basedpageasblack-and-white and save it separately inTIFF uncompressed, LZW, Zip, orG4, then convert each page into aPDF,thefinalPDFwillhavethesamefile sizewithoutanoticeablequalitydifference. TIFF JPEG generates thesmallestPDF,butitisalossyformat,soitisnotrecommended.BothJPEGandJPEG2000havesmallerfilesizesbut generate largerPDFs than thoseconvertedfromTIFFimages.

    recommendations

    1. UseTIFFuncompressedorLZWin 24 bits color for pages withcolorgraphsorforhistoricaldoc-uments.

    2. UseTIFFuncompressedorLZW

    compressanimageupto50per-cent. Some vendors hesitate touse this format because it wasproprietary;however, thepatentexpired on June 20, 2003. Thisformathasbeenwidelyadoptedbymuchsoftwareand is safe touse.CSULsavesallscannedtextdocumentsinthisformat.

    TIFF Zip: This is a losslesscompression. Like LZW, ZIPcompressionismosteffectiveforimagesthatcontainlargeareasofsinglecolor.2

    TIFF JPEG: This is a JPEG filestored inside a TIFF tag. It is alossycompression,soCSULdoesnotusethisfileformat.

    Otherimageformats:

    JPEG:Thisformatisalossycom-pressionandcanonlybeusedfornonarchival purposes. A JPEGimage can be converted to PDForembeddedinaPDF.However,aPDFcreatedfromJPEGimageshasamuch larger file size com-paredtoaPDFcreatedfromTIFFimages.

    JPEG 2000: This formats fileextension is .jp2. This formatofferssuperiorcompressionper-formance andother advantages.JPEG 2000 normally is used forarchival photographs, not fortext-baseddocuments.

    In short, scanned images shouldbe saved as TIFF files, either withcompression or without. We recom-mend saving text-only pages andpagescontainingtextand/orlineartas TIFF G4 or TIFF LZW. We alsorecommendsavingpageswithphoto-graphsandillustrationsasTIFFLZW.We also recommend saving pageswithphotographsandillustrationsasTIFFuncompressedorTIFFLZW.

    How Image Format and Color Mode Affect PDF File Size

    Colormode and file format are two

    on-screen viewing and print quali-ties.Inthefollowingsections,wewilldiscusshow tomake scanneddocu-mentsWeb-friendly.

    Scanning

    Therearethreemainfactorsthataffectthequalityandfilesizeofadigitizeddocument: file format, color mode,and resolution of the source images.Thesefactorsshouldbekeptinmindwhenscanningtextdocuments.

    File Format and compression

    Most digitized documents arescanned and saved as TIFF files.However, there are many differentformatsofTIFF.Whichoneisappro-priateforyourproject?

    TIFF:Uncompressedformat.Thisisastandardformatforscannedimages. However, an uncom-pressedTIFF file has the largestfilesizeandrequiresmorespacetostore.

    TIFF G3:TIFFwithG3compres-sion is the universal standardfor faxs and multipage line-artdocuments. It is used for black-and-whitedocumentsonly.

    TIFF G4: TIFF with G4 com-pression has been approved asa lossless archival file formatfor bitonal images. TIFF imagessaved in this compression havethesmallestfilesize.Itisastan-dard file format used by manycommercialscanningvendors.Itshould only be used for pageswithtextorlineart.Manyscan-ning programs do not providethisfileformatbydefault.

    TIFF Huffmann: A method forcompressing bi-level data basedon the CCITT Group 3 1D fac-similecompressionschema.

    TIFF LZW: This format uses alossless compression that doesnotdiscarddetails from images.Itmaybeusedforbitonal,gray-scale, and color images. It may

  • tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 153Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 153

    to be scanned at no less than 600dpi in color. Our experiments showthat documents scanned at 300 or400 dpi are sufficient for creatingPDFs of good quality. Resolutionslower than 300 dpi are not recom-mended because they can degradeimage quality and produce moreOCRerrors.Resolutionshigher than400 dpi also are not recommendedbecausetheygeneratelargefileswithlittle improved on-screen viewingandprintquality.WecomparedPDFfilesthatwereconvertedfromimagesof resolutions at 300, 400, and 600dpi.Viewedat100percent,thediffer-enceinimagequalitybothonscreenandinprintwasnegligible.Ifapagehas textwith very small font, it canbe scanned at a higher resolution toimproveOCRaccuracyandviewingandprintquality.

    Table 2 shows that high-resolu-tion images produce large files andrequire more time to be convertedinto PDFs. The time required tocombine images is not significantlydifferent compared to scanning timeand OCR time, so it was omitted.Ourexample isamoderntextdocu-mentwithtextandablack-and-whitechart.

    Most of our digitization projectsdo not require scanning at 600 dpi;300dpiistheminimumrequirement.Weuse 400dpi formostdocumentsand choose aproper colormode foreachpage.Forexample,wescanourthesesanddissertationsinblack-and-whiteat400dpiforbitonalpages.Wescan pages containing photographsor illustrations in 8-bit grayscale or24-bitcolorat400dpi.

    Other Factors that Affect PDF File Size

    In addition to the three main fac-torswehavediscussed,unnecessaryedges, bleed-through of text andgraphs,backgroundnoise,andblankpages also increase PDF file sizes.Figure1showshowacleanscancanlargely reduce a PDF file size and

    cover. The updated file has a filesize of 42.8 MB. The example canbe accessed at http://hdl.handle.net/10217/3667.Sometimeswescana page containing text and photo-graphsorillustrationstwice,incolororgrayscaleandinblack-and-white.When we create a PDF, we com-binetwoimagesofthesamepagetoreproduce the original appearanceandtoreducefilesize.Howtoopti-mizePDFsusingmultiplescanswillbediscussedinalatersection.

    How Image Resolution Affects PDF File Size

    Before we start scanning, we checkwith our project manager regardingproject standards. For some fundedprojects, documents are required

    ingrayscale8bitsforpageswithblack-and-whitephotographsorgrayscaleillustrations.

    3. Use TIFF uncompressed, LZW,or G4 in black-and-white forpagescontainingtextorlineart.

    To achieve the best result, eachpageshouldbescannedaccordingly.Forexample,wehadadocumentwitha color cover, 790 pages containingtextand lineart,and7blankpages.We scanned the original documentincolorat300dpi.ThePDFcreatedfrom these images was 384 MB, solarge that it exceeded themaximumfilesizethatourrepositorysoftwareallows for uploading. To optimizethe document, we deleted all blankpages,convertedthe790pageswithtextandlineartfromcolortoblack-and-white, and retained the color

    Table 1. File format and color mode versus PDF file size

    File Format Scan Specifications TIFF Size (KB) PDF Size (KB)

    TIFF Color 24 bits 23,141 900

    TIFF LZW Color 24 bits 5,773 900

    TIFF ZIP Color 24 bits 4,892 900

    TIFF JPEG Color 24 bits 4,854 873

    JPEG 2000 Color 24 bits 5,361 5,366

    JPEG Color 24 bits 4,849 5,066

    TIFF Grayscale 8 bits 7,729 825

    TIFF LZW Grayscale 8 bits 2,250 825

    TIFF ZIP Grayscale 8 bits 1,832 825

    TIFF JPEG Grayscale 8 bits 2,902 804

    JPEG 2000 Grayscale 8 bits 2,266 2,270

    JPEG Grayscale 8 bits 2,886 3,158

    TIFF Black-and-white 994 116

    TIFF LZW Black-and-white 242 116

    TIFF ZIP Black-and-white 196 116

    note: Black-and-white scans cannot be saved in JPEg, JPEg 2000, or TIff JPEg formats.

  • 154 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010154 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010

    ManyPDF files cannot be savedas PDF/A files. If an error occurswhen saving a PDF to PDF/A, youmay use Adobe Acrobat Preflight(Advanced > Preflight) to identifyproblems.Seefigure2.

    Errors can be created by non-embedded fonts, embedded imageswith unsupported file compression,bookmarks, embedded video andaudio, etc. By default, the ReduceFile Size procedure in AcrobatProfessionalcompressescolorimagesusing JPEG 2000 compression.Afterrunning the Reduce File Size pro-cedure, a PDFmay not be saved asa PDF/A because of a JPEG 2000compression used error. Accordingto the PDF/A Competence Center,thisproblemwillbeeliminatedinthesecondpartofthePDF/AstandardPDF/A-2 is planned for 2008/2009.TherearemanyotherfeaturesinnewPDFs;forexample,transparencyandlayers will be allowed in PDF/A-2.5 However, at the time this paperwaswritten PDF/A-2 had not beenannounced.6

    portable, which means the file cre-atedononecomputercanbeviewedwith an Acrobat viewer on othercomputers,handhelddevices,andonotherplatforms.3

    APDF/AdocumentisbasicallyatraditionalPDFdocumentthatfulfillsprecisely defined specifications. ThePDF/A standard aims to enable thecreation of PDF documents whosevisual appearance will remain thesameover the courseof time.Thesefilesshouldbesoftware-independentandunrestrictedbythesystemsusedtocreate,store,andreproducethem.4The goal of PDF/A is for long-termarchiving. A PDF/A document hasthe same file extension as a regularPDFfileandmustbeatleastcompat-iblewithAcrobatReader4.

    There are many ways to cre-ate a PDF/A document. You canconvert existing images and PDFfiles to PDF/A files, export a doc-ument to PDF/A format, scan toPDF/A, to name a few. There aremany software programs you canusetocreatePDF/A,suchasAdobeAcrobatProfessional8andlaterver-sions,CompartAG,PDFlib,andPDFToolsAG.

    simultaneously improve its viewingandprintquality.

    Recommendations

    1. Unnecessary edges:Cropout.2. Bleed-through text or graphs:Place

    a piece of white or black cardstock on the back of a page.If a page is single sided, usewhite card stock. If a page isdouble sided, use black cardstockand increasecontrast ratiowhen scanning. Often color orgrayscale images have bleed-through problems. Scanning apage containing text or line artasblack-and-whitewilleliminatebleed-throughtextandgraphs.

    3. Background noise: Scanning apage containing text or line artas black-and-white can elimi-nate background noise. Manyaged documents have yellowedpapers.Ifwescanthemascoloror grayscale, the result will beimageswithyelloworgrayback-ground,whichmayincreasePDFfilesizesgreatly.Wealsorecom-mendincreasingthecontrastforbetterOCRresultswhenscanning documentswithbackgroundcolors.

    4. Blank pages: Do notinclude if they are notrequired. Blank pagesscanned in grayscale orcolorcanquicklyincreasefilesize.

    PDF and Long-Term Archiving PDF/A

    PDF vs. PDF/A

    PDF, short for PortableDocument Format, wasdeveloped by Adobe as aunique format to be viewedthroughAdobeAcrobatview-ers.Asthenameimplies,itis

    Table 2. Color Mode and Image Resolution vs. PDF File Size

    Color mode

    Resolution (DPI)

    Scanning time (sec.)

    OCR time (sec.)

    TIFF LZW (KB)

    PDF size (KB)

    color 600 100 N/A* 16,498 2,391

    color 400 25 35 7,603 1,491

    color 300 18 16 5,763 952

    grayscale 600 36 33 6,097 2,220

    grayscale 400 18 18 2,888 1370

    grayscale 300 14 12 2,240 875

    B/W 600 12 18 559 325

    B/W 400 10 10 333 235

    B/W 300 8 9 232 140

    *n/a due to an oCr error

  • tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 155Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 155

    able.Thisoptionkeepstheorigi-nalimageandplacesaninvisibletextlayeroverit.Recommendedfor cases requiring maximumfidelity to the original image.8This is the only option used byCSUL.

    2. Searchable Image: Ensures thattext issearchableandselectable.This option keeps the originalimage, de-skews it as needed,andplacesaninvisibletextlayeroverit.Theselectionfordowns-ample images in this same dia-log box determineswhether theimage is downsampled and towhat extent.9 The downsam-pling combines several pixelsin an image to make a singlelargerpixel; thus some informa-tion is deleted from the image.However, downsampling doesnot affect the quality of text orline art. When a proper settingisused,thesizeofaPDFcanbesignificantly reduced with littleornolossofdetailandprecision.

    3. ClearScan: Synthesizes a newType3fontthatcloselyapproxi-matestheoriginal,andpreservesthe page background using alow-resolution copy.10 The finalPDF is the same as a born-dig-ital PDF. Because Acrobat can-not guarantee the accuracy of

    manipulate the PDF document foraccessibility. Once OCR is properlyapplied to the scanned files, how-ever, the image becomes searchabletextwithselectablegraphics,andonemayapplyotheraccessibilityfeaturestothedocument.7

    Acrobat Professional providesthreeOCRoptions:

    1. Searchable Image (Exact): Ensuresthattextissearchableandselect-

    Full-Text Searchable PDFs and Trouble-shooting OCR Errors

    APDFcreatedfromascannedpieceof paper is inherently inaccessiblebecause the content of the docu-mentisanimage,notsearchabletext.Assistive technology cannot reador extract the words, users cannotselectoreditthetext,andonecannot

    Figure 1. PDFs Converted from different images: (a) the original PDF converted from a grayscale image and with unnecessary edges; (b) updated PDF converted from a black-and-white image and with edges cropped out; (c) screen viewed at 100 percent of the PDF in grayscale; and (d) screen viewed at 100 percent of the PDF in black-and-white.

    Dimensions: 9.127 X 11.455Color Mode: grayscale Resolution: 600 dpiTIFF LZW: 12.7 MBPDF: 1,051 KB

    Dimensions: 8 X 10.4Color Mode: black-and-whiteResolution: 400 dpiTIFF LZW: 153 KBPDF: 61 KB

    Figure 2. Example of Adobe Acrobat 9 Preflight

  • 156 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010156 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010

    but at least users can read all text,while the black-and-white scan con-tainsunreadablewords.

    Troubleshoot OCR Error 3: Cannot OCR Image Based Text

    The search of a digitized PDF isactually performed on its invis-ible text layer. The automated OCRprocess inevitably produces someincorrectly recognized words. Forexample, Acrobat cannot recognizethe Colorado State University Logocorrectly(seefigure6).

    Unfortunately, Acrobat does notprovide a function to edit a PDFfiles invisible text layer. To manu-ally edit or add OCRd text,AdobeAcrobat Capture 3.0 (see figure 7)must be purchased. However, ourtestsshowthatCapture3.0hasmanydrawbacks. This software is compli-cated and produces its own errors.Sometimes it consolidates words;other times it breaks them up. Inaddition,itistime-consumingtoaddormodify invisible text layersusingAcrobatCapture3.0.

    At CSUL, we manually addsearchable text for title and abstractpages only if they cannot beOCRdbyAcrobatcorrectly.Theexamplein

    Troubleshoot OCR Error 2: Could Not Perform Recognition (OCR)

    Sometimes Acrobat gener-ates an Outside of the AllowedSpecifications error when process-ing OCR. This error is normallycaused by color images scanned at600dpiormore.

    In the example in figure 4, thepage only contains text but wasscanned in color at 600 dpi. Whenwe scanned this page as black-and-white at 400 dpi, we did notencounter this problem. We couldalsousealower-resolutioncolorscantoavoidthiserror.Ourexperimentsalso show that images scanned inblack-and-white work best for theOCRprocess.

    In this articlewemainly discussrunningtheOCRprocessonmoderntextual documents. Black-and-whitescansdonotworkwellforhistoricaltextual documents or aged newspa-pers. These documents may havefaded text and background noise.When they are scanned as black-and-white,brokenlettersmayoccur,andsometextmightbecomeunread-able. For this reason they should bescannedincolororgrayscale.Infig-ure5,imagesscannedincolormightnot produce accurate OCR results,

    OCRed text at 100 percent, thisoptionisnotacceptableforus.

    Foratutorialontohowtomakeafull-textsearchablePDF,pleaseseeappendixA.

    Troubleshoot OCR Error 1: Acrobat Crashes

    OccasionallyAcrobatcrashesduringtheOCRprocess.Theerrormessagedoes not indicate what causes thecrashandwheretheproblemoccurs.Fortunately, thepagenumber of theerrorcanbe foundon the topshort-cutsmenu.Infigure3,wecanseetheerroroccursonpage7.

    We discovered that errors areoftencausedbyfiguresordiagrams.Foraproblemlikethis,thesolutionisto skip theerror-causingpagewhenrunningtheOCRprocess.Ourinitialresearch was performed onAcrobat8 Professional. Our recent studyshows that this problem has beensignificantly improved in Acrobat 9Professional.

    Figure 3. Adobe Acrobat 8 Professional crash window

    Figure 4. Could not perform recognition (OCR) error

    Figure 5. An aged newspaper scanned in color and black-and-white

    Aged Newspaper Scanned in Color Aged Newspaper Scanned in Black-and-White

  • tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 157Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 157

    averylightyellowbackground.Theundesirablemarks and backgroundcontribute to its large file size andcreateinkwastewhenprinted.

    Method 2: Running Acrobats Built-In Optimization Processes

    Acrobat provides three built-in pro-cessestoreducefilesize.Bydefault,Acrobat use JPEG compression forcolor and grayscale images andCCITT Group 4 compression forbitonalimages.

    optimize scanned pDFOpen a scanned PDF and selectDocuments>OptimizeScannedPDF.Anumberofsettings,suchas imagequalityandbackgroundremoval,canbespecifiedintheOptimizeScannedPDF dialog box. Our experimentsshow this process can noticablydegradeimagesandsometimesevenincreasefilesize.Thereforewedonotusethisoption.

    reduce File sizeOpen a scanned PDF and selectDocuments > Reduce File Size. TheReduce File Size command resa-mples and recompresses images,removes embedded Base-14 fonts,and subset-embeds fonts that wereleft embedded. It also compressesdocument structure and cleans upelementssuchasinvalidbookmarks.If the file size is already as smallas possible, this command has noeffect.11 After process, some filescannot be saved as PDF/A, as wediscussed in a previous section.WealsonoticedthatdifferentversionsofAcrobat can create files of differentfile sizes even if the same settingswereused.

    pDF optimizerOpen a scanned PDF and selectAdvanced > PDF Optimizer. Manysettings canbe specified in thePDFOptimizer dialog box. For example,we can downsample images from

    sections, we can greatly reducea PDFs size by using an appro-priate color mode and resolution.Figure 9 shows two different ver-sions of a digitized document. Thesource document has a color coverand 111 bitonal pages. The origi-nal PDF, shown in figure 9 on theleft,wascreatedbyanotheruniver-sitydepartment.Itwasnotscannedaccording to standards and pro-cedures adopted by CSUL. It wasscanned incolorat300dpiandhasafilesizeof66,265KB.Weexportedthe original PDF as TIFF images,batch-converted color TIFF imagestoblack-and-whiteTIFFimages,andthencreatedanewPDFusingblack-and-whiteTIFFimages.TheupdatedPDFhasa filesizeof8,842KB.Theimage on the right ismuch cleanerandhasbetterprintquality.Thefileonthelefthasunwantedmarksand

    figure8isabooktitlepageforwhichweusedAcrobatCapture3.0toman-uallyaddsearchabletext.Theentirebookmaybeaccessedathttp://hdl.handle.net/10217/1553.

    Optimizing PDFs for Web Delivery

    A digitized PDF file with 400 colorpagesmaybeas largeas200 to400MB. Most of the time, optimizingprocessesmayreducefilesthislargewithout a noticeable difference inquality. In some cases, quality maybe improved.We will discuss threeoptimizationmethodsweuse.

    Method 1: Using an Appropriate Color Mode and Resolution

    As we have discussed in previous

    ~dO

    UniversitY

    Original Logo Text OCRed by Acrobat

    Figure 6. Incorrectly recognized text sample

    Figure 7. Adobe Acrobat capture interface

    Figure 8. Image-based text sample

  • 158 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010158 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010

    grayscale.APDFmaycontainpagesthat were scanned with differentcolormodes and resolutions.APDFmay alsohavepages ofmixed reso-lutions.One pagemay contain bothbitonalimagesandcolororgrayscaleimages,buttheymustbeofthesameresolution.

    The following strategies wereadoptedbyCSUL:

    1. Combine bitmap, grayscale,and color images.We use gray-scale images forpages that con-tain grayscale graphs, such asblack-and-white photos, colorimages for pages that containcolorimages,andbitmapimagesfor text-onlyor text and lineartpages.

    2. Ifapagecontainshigh-definitioncolor or grayscale images, scanthat page in a higher resolutionandscanotherpagesat400dpi.

    3. If a page contains a very smallfont and the OCR process doesnotworkwell,scanitatahigherresolution and the rest of docu-mentat400dpi.

    4. If a page has both text, color,or grayscale graphs, we scan ittwice. Then we modify imagesusing Adobe Photoshop andcombinetwoimagesinAcrobat.

    In figure 10, thegrayscale imagehas a gray background and a truereproduction of the original photo-graph.Theblack-and-whitescanhasa white background and clean text,but details of the photograph arelost. The PDF converted from thegrayscale image is 491 KB and hasnineOCRerrors.ThePDFconvertedfrom the black-and-white image is61KB and has no OCR errors. ThePDF converted from a combinationofthegrayscaleandblack-and-whiteimages is 283 KB and has no OCRerrors.

    The followingare the stepsusedto create a PDF in figure 10 usingAcrobat:

    1. Scan a page twicegrayscale

    Optimizer can be found at http://www.acrobatusers.com/tutorials/understanding-acrobats-optimizer.

    Method 3: Combining Different Scans

    Many documents have color coversand color or grayscale illustrations,but the majority of pages are text-only. It is not necessary to scan allpagesofsuchdocuments incoloror

    a higher resolution to a lower reso-lution and choose a different filecompression. Different collectionshave different original sources,therefore different settings shouldbe applied. We normally do sev-eral tests for each collection andchoose the one that works best forit.WealsomakeourPDFscompat-ible with Acrobat 6 to allow userswith older versions of software toview our documents. A detailedtutorial of how to use the PDF

    Figure 9. Reduce file size example

    Figure 10. Reduce file size example: combine images

  • tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 159Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 159

    help.html?content=WSfd1234e1c4b69f30ea53e41001031ab64-7757.html (accessedMar.3,2010).

    3. Ted Padova Adobe Acrobat 7 PDF Bible,1sted.(Indianapolis:Wiley,2005).

    4. Olaf Drmmer,Alexandra Oettler,and Dietrich von Seggern, PDF/A in a NutshellLong Term Archiving with PDF,(Berlin:AssociationforDigitalDocumentStandards,2007).

    5. PDF/A Competence Center,PDF/A: An ISO StandardFutureDevelopment of PDF/A, http://www.pdfa.org/doku.php?id=pdfa:en(accessedJuly20,2010).

    6. PDF/A Competence Center,PDF/AA new Standard for Long-TermArchiving,http://www.pdfa.org/doku.php?id=pdfa:en:pdfa_whitepaper(accessedJuly20,2010).

    7. Adobe, Creating Accessible PDFDocuments with Adobe Acrobat 7.0: AGuideforPublishingPDFDocumentsforUse by People with Disabilities, 2005,http://www.adobe.com/enterprise/accessibility/pdfs/acro7_pg_ue.pdf(accessedMar.8,2010).

    8. Adobe, Recognize Text inScanned Documents, 2010, http://help.adobe.com/en_US/Acrobat/9.0/Standard/WS2A3DD1FA-CFA5-4cf6-B993-159299574AB8.w.html (accessedMar.8,2010).

    9. Ibid.10. Ibid.11. Adobe,ReduceFileSizebySaving,

    2010, http://help.adobe.com/en_US/Acrobat/9.0/Standard/WS65C0A053-BC7C-49a2-88F1-B1BCD2524B68.w.html(accessedMar.3,2010).

    the other 76 pages as grayscale andblack-and-white. Then we used theprocedure described above to com-bine text pages and photographs.ThefinalPDFhascleartextandcor-rectly reproduced photographs. Theexample canbe foundat http://hdl.handle.net/10217/1553.

    Conclusion

    Our case study, as reported in thisarticle, demonstrates the importanceof investing the time and effort toapply the appropriate standards andtechniquesforscanningandoptimiz-ing digitized documents. If propertechniques are used, the final resultwillbeWeb-friendlyresourcesthatareeasy to download, view, search, andprint. Users will be left with a posi-tiveimpressionofthelibraryandfeelencouraged to use its materials andservicesagaininthefuture.

    References

    1. BCRs CDP Digital Imaging BestPractices Working Group, BCRs CDPDigital Imaging Best Practices Version2.0, June 2008, http://www.bcr.org/dps/cdp/best/digital-imaging-bp.pdf(accessedMar.3,2010).

    2. Adobe, About File Formats andCompression, 2010, http://livedocs.adobe.com/en_US/Photoshop/10.0/

    andblack-and-white.2. Crop out text on the grayscale

    scanusingPhotoshop.3. Delete the illustration on the

    black-and-white image usingPhotoshop.

    4. Create a PDF using the black-and-whiteimage.

    5. Run the OCR process and savethefile.

    6. Insert the color graph. SelectTools > Advanced Editing >TouchUp Object Tool. Right-clickonthepageandselectPlaceImage.LocatethecolorgraphintheOpendialog,thenclickOpenandmove the color graph to itscorrectlocation.

    7. SavethefileandruntheReduceFile Size orPDFOptimizerpro-cedure.

    8. Savethefileagain.

    Thismethodproduces thesmall-est file size with the best quality,but it is very time-consuming. AtCSULweusedthismethodforsomeimportantdocuments,suchasoneofour institutional repositorys show-case items, Agricultural Frontier to Electronic Frontier. The book has 220pages, including a color cover, 76pages with text and photographs,and 143 text-only pages. We useda color image for the cover pageand 143 black-and-white images forthe143 text-onlypages.Wescanned

    Appendix A. Step-by-Step Creating a Full-Text Searchable PDF

    Inthistutorial,wewillshowyouhowtocreateafull-textsearchablePDFusingAdobeAcrobat9Professional.

    Creating a PDF from a Scanner

    AdobeAcrobatProfessionalcancreateaPDFdirectlyfromascanner.Acrobat9providesfiveoptions:BlackandWhiteDocument,GrayscaleDocument,ColorDocument,ColorImage,andCustomScan.Thecustomscanoptionallowsyoutoscan,runtheOCRprocedure,addmetadata,combinemultiplepagesintoonePDF,andalsomakeitPDF/Acompliant.TocreateaPDFfromascanner,gotoFile>CreatePDF>FromScanner>CustomScan.Seefigure1.

    AtCSUL,wedonotdirectlycreatePDFsfromscannersbecauseourtestsshowthatitcanproducefuzzytextanditisnottimeefficient.BothscanningandrunningtheOCRprocesscanbeverytimeconsuming.Ifanerroroccursduringtheseprocesses,wewouldhavetostartoveragain.Wenormallyscanimagesonscanningstationsbystudentemployees

  • 160 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010160 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010

    or outsource them to vendors. Then library staffwill perform qualitycontrolandcreatePDFsonseperatemachines.Inthisway,wecanworkonmultiple documents at the same time and ensure thatwe providehigh-qualityPDFs.

    Creating a PDF from Scanned Images

    1. FromthetaskbarselectCombine>MergeFilesintoasinglePDF>FromMultipleFiles.Seefigure2.

    2. IntheCombineFilesdialog,makesuretheSinglePDFradiobuttonis selected. From theAddFilesdropdownmenu selectAddFiles.Seefigure3.

    3. IntheAddFilesdialog,locateimagesandselectmultipleimagesbyholdingshiftkey,andthenclickAddFilesbutton.

    4. Bydefault,Acrobatsortsfilesbyfilenames.UseMoveUpandMoveDownbuttonstochangeimageordersandusetheRemovebuttonto delete images. Choose a target file size. The smallest iconwillproduceafilewithasmallerfilesizebutalowerimagequalityPDF,andthelargesticonwillproduceahighimagequalityPDFbutwithavery largefilesize.Wenormallyuse thedefault filesizesetting,whichisthemiddleicon.

    5. Savethefile.

    Atthispoint,thePDFisnotfull-textsearchable.

    Making a Full-Text Searchable PDF

    APDFdocument created from a scannedpiece of paper is inherentlyinaccessible because the content of the document is an image, notsearchable text.Assistive technologycannot readorextract thewords,userscannotselectoreditthetext,andonecannotmanipulatethePDFdocument for accessibility. Once optical character recognition (OCR)is properly applied to the scanned files, however, the image becomessearchabletextwithselectablegraphics,andonemayapplyotheracces-sibilityfeaturestothedocument.

    AdobeAcrobatProfessionalprovidesthreeOCRoptions,SearchableImage (Exact), Searchable Image, and Clean Scan. Because SearchableImage (Exact) is theonlyoption thatkeeps theoriginal look,weonlyusethisoption.

    TorunanOCRprocedureusingAcrobat9Professional:

    1. OpenadigitizedPDF.2. Select Document > OCR text recognition > Recognize text using

    OCR.3. IntheRecognizeTextdialog,specifypagestobeOCRed.4. IntheRecognizeTextdialog,clicktheEditbuttonintheSettingssec-

    tiontochooseOCRlanguageandPDFOutputStyle.WerecommendtheSearchable Image(Exact)option.ClickOK.Thesettingwillberememberedbytheprogramandwillbeuseduntilanewsettingischosen.

    SometimesaPDFsfilesizeincreasesgreatlyafteranOCRprocess.Ifthishappens,usethePDFoptimizertoreduceitsfilesize.

    Figure 2. Merge files into a single PDF

    Figure 3. Combine Files dialog

    Figure 1. Acrobat 9 Professionals Create PDF from Scanner Dialog