Making Scanned Documents Web Accessible
-
Upload
milan-savic -
Category
Documents
-
view
4 -
download
0
description
Transcript of Making Scanned Documents Web Accessible
-
tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 151Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 151
Are Your Digital Documents Web Friendly?: Making Scanned Documents Web Accessible
The Internet has greatly changed how library users search and use library resources. Many of them prefer resources available in electronic format over tradi-tional print materials. While many docu-ments are now born digital, many more are only accessible in print and need to be digitized. This paper focuses on how the Colorado State University Libraries cre-ates and optimizes text-based and digitized PDF documents for easy access, download-ing, and printing.
T o digitize print materials,we normally scan originals,save them in archival digitalformats, and thenmake themWeb-accessible. There are two types ofprintdocuments,graphic-basedandtext-based.Ifweapplythesametech-niquestodigitizethesetwodifferenttypes of materials, the documentsproducedwillnotbeWeb-friendly.
Graphic-based materials includearchival resources such as his-torical photographs, drawings,manuscripts,maps,slides,andpost-ers.Wenormallyscanthemincolorat a very high resolution to captureand present a reproduction that isasfaithfultotheoriginalaspossible.ThenwesavethescannedimagesinTIFF(TaggedImageFileFormat)forarchival purposes and convert theTIFFs to JPEG (Joint PhotographicExpertsGroup)2000orJPEGforWebaccess.However,thesamepracticeisnot suitable for modern text-baseddocuments, such as reports, jour-nal articles, meeting minutes, andtheses and dissertations. Many oldtext-based documents (e.g., agednewspapers and books), should be
Yongli ZhouTutorial
files for fastWeb delivery as accessfiles.For text-based files,access filesnormallyarePDFsthatareconvertedfromscannedimages.
BCRsCDPDigitalImagingBestPractices Version 2.0 says that themaster image should be the highestqualityyoucanafford, it shouldnotbeeditedorprocessedforanyspecificoutput, and it should be uncom-pressed.1 This statement applies toarchivalimages,suchasphotographs,manuscripts, and other image-basedmaterials. If we adopt the sameapproachformoderntextdocuments,the resultmaybeproblematic.PDFsthatarecreatedfromsuchmasterfilesmayhavethefollowingdrawbacks:
Because of their large file size,they require a long downloadtime or cannot be downloadedbecauseofatimeouterror.
They may crash a users com-puter because they use morememorywhileviewing.
They sometimes cannot beprinted because of insufficientprintermemory.
Poor print and on-screen view-ing qualities can be caused bybackground noise and bleed-through of text. Backgroundnoise can be caused by stains,highlighter marks made byusers, andyellowedpaper fromageddocuments.
The OCR process sometimesdoes not work for high-resolu-tionimages.
Content creators need to spendmore time scanning images at ahigh resolution and convertingthemtoPDFdocuments.
Web-friendly files should besmall, accessible by most users,full-text searchable, and have good
treated as graphic-based material.These documents often have fadedtext, unusual fonts, stains, and col-oredbackground.Iftheyarescannedusing the same practice as moderntext documents, the document cre-ated can be unreadable and containincorrect information. This topic iscovered in the section Full-TextSearchablePDFsandTroubleshootingOCRErrors.
Currently, PDF is the file formatused for most digitized text docu-ments. While PDFs that are createdfrom high-resolution color imagesmaybeofexcellentquality,theycanhave many drawbacks. For exam-ple, a multipage PDF may have alargefilesize,whichincreasesdown-load time and thememory requiredwhileviewing.Sometimesthedown-load takes so long it fails because atime-out error occurs. Printers mayhave insufficient memory to printlarge documents. In addition, theOpticalCharacterRecognition(OCR)process is not accurate for high-resolution images in either color orgrayscale. As we know, users wanttheability toeasilydownload,view,print,andsearchonlinetextualdocu-ments.All of thedrawbacks createdby high-quality scanning defeat oneof the most important purposes ofdigitizing text-based documents:making them accessible to moreusers.
This paper addresses howColorado State University Libraries(CSUL)managestheseproblemsandothers as staff create Web-friendlydigitized textual documents.Topics include scanning, long-timearchiving, full-text searchable PDFsand troubleshootingOCR problems,and optimizing PDF files for Webdelivery.
Preservation Master Files and Access Files
Fordigitizationprojects,wenormallyrefertoimagesinuncompressedTIFFformatasmasterfilesandcompressed
Yongli Zhou is Digital repositories librarian, Colorado State university libraries, Colorado State university, fort Collins, Colorado
-
152 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010152 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010
factors that determine PDF file size.Color images typically generate thelargest PDFs and black-and-whiteimages generate the smallest PDFs.Interestingly,animageofsmallerfilesize does not necessarily generate asmallerPDF.Table1showshowfileformat and color mode affect PDFfilesize.
Thesourcefile isapagecontain-ingblack-and-whitetextandlineartdrawings. Its physical dimensionsare8.047"by10.893".Allimageswerescannedat300dpi.
CSUL uses Adobe AcrobatProfessional to create PDFs fromscanned images. The current ver-sion we use is Adobe Acrobat 9Professional,butmostof its featureslisted in this paper are availablefor other Acrobat versions. WhenAcrobat converts TIFF images to aPDF,itcompressesimages.ThereforeafinalPDFhasasmallerfilesizethanthe total size of the original images.Acrobat compresses TIFF uncom-pressed, LZW, and Zip the sameamount and produces PDFs of thesame file size. Because our in-housescanning software does not supportTIFFG4,wedidnotincludeTIFFG4test data here. By comparing simi-lar pages, we concluded that TIFFG4works the same as TIFF uncom-pressed,LZW,andZip.Forexample,ifwescanatext-basedpageasblack-and-white and save it separately inTIFF uncompressed, LZW, Zip, orG4, then convert each page into aPDF,thefinalPDFwillhavethesamefile sizewithoutanoticeablequalitydifference. TIFF JPEG generates thesmallestPDF,butitisalossyformat,soitisnotrecommended.BothJPEGandJPEG2000havesmallerfilesizesbut generate largerPDFs than thoseconvertedfromTIFFimages.
recommendations
1. UseTIFFuncompressedorLZWin 24 bits color for pages withcolorgraphsorforhistoricaldoc-uments.
2. UseTIFFuncompressedorLZW
compressanimageupto50per-cent. Some vendors hesitate touse this format because it wasproprietary;however, thepatentexpired on June 20, 2003. Thisformathasbeenwidelyadoptedbymuchsoftwareand is safe touse.CSULsavesallscannedtextdocumentsinthisformat.
TIFF Zip: This is a losslesscompression. Like LZW, ZIPcompressionismosteffectiveforimagesthatcontainlargeareasofsinglecolor.2
TIFF JPEG: This is a JPEG filestored inside a TIFF tag. It is alossycompression,soCSULdoesnotusethisfileformat.
Otherimageformats:
JPEG:Thisformatisalossycom-pressionandcanonlybeusedfornonarchival purposes. A JPEGimage can be converted to PDForembeddedinaPDF.However,aPDFcreatedfromJPEGimageshasamuch larger file size com-paredtoaPDFcreatedfromTIFFimages.
JPEG 2000: This formats fileextension is .jp2. This formatofferssuperiorcompressionper-formance andother advantages.JPEG 2000 normally is used forarchival photographs, not fortext-baseddocuments.
In short, scanned images shouldbe saved as TIFF files, either withcompression or without. We recom-mend saving text-only pages andpagescontainingtextand/orlineartas TIFF G4 or TIFF LZW. We alsorecommendsavingpageswithphoto-graphsandillustrationsasTIFFLZW.We also recommend saving pageswithphotographsandillustrationsasTIFFuncompressedorTIFFLZW.
How Image Format and Color Mode Affect PDF File Size
Colormode and file format are two
on-screen viewing and print quali-ties.Inthefollowingsections,wewilldiscusshow tomake scanneddocu-mentsWeb-friendly.
Scanning
Therearethreemainfactorsthataffectthequalityandfilesizeofadigitizeddocument: file format, color mode,and resolution of the source images.Thesefactorsshouldbekeptinmindwhenscanningtextdocuments.
File Format and compression
Most digitized documents arescanned and saved as TIFF files.However, there are many differentformatsofTIFF.Whichoneisappro-priateforyourproject?
TIFF:Uncompressedformat.Thisisastandardformatforscannedimages. However, an uncom-pressedTIFF file has the largestfilesizeandrequiresmorespacetostore.
TIFF G3:TIFFwithG3compres-sion is the universal standardfor faxs and multipage line-artdocuments. It is used for black-and-whitedocumentsonly.
TIFF G4: TIFF with G4 com-pression has been approved asa lossless archival file formatfor bitonal images. TIFF imagessaved in this compression havethesmallestfilesize.Itisastan-dard file format used by manycommercialscanningvendors.Itshould only be used for pageswithtextorlineart.Manyscan-ning programs do not providethisfileformatbydefault.
TIFF Huffmann: A method forcompressing bi-level data basedon the CCITT Group 3 1D fac-similecompressionschema.
TIFF LZW: This format uses alossless compression that doesnotdiscarddetails from images.Itmaybeusedforbitonal,gray-scale, and color images. It may
-
tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 153Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 153
to be scanned at no less than 600dpi in color. Our experiments showthat documents scanned at 300 or400 dpi are sufficient for creatingPDFs of good quality. Resolutionslower than 300 dpi are not recom-mended because they can degradeimage quality and produce moreOCRerrors.Resolutionshigher than400 dpi also are not recommendedbecausetheygeneratelargefileswithlittle improved on-screen viewingandprintquality.WecomparedPDFfilesthatwereconvertedfromimagesof resolutions at 300, 400, and 600dpi.Viewedat100percent,thediffer-enceinimagequalitybothonscreenandinprintwasnegligible.Ifapagehas textwith very small font, it canbe scanned at a higher resolution toimproveOCRaccuracyandviewingandprintquality.
Table 2 shows that high-resolu-tion images produce large files andrequire more time to be convertedinto PDFs. The time required tocombine images is not significantlydifferent compared to scanning timeand OCR time, so it was omitted.Ourexample isamoderntextdocu-mentwithtextandablack-and-whitechart.
Most of our digitization projectsdo not require scanning at 600 dpi;300dpiistheminimumrequirement.Weuse 400dpi formostdocumentsand choose aproper colormode foreachpage.Forexample,wescanourthesesanddissertationsinblack-and-whiteat400dpiforbitonalpages.Wescan pages containing photographsor illustrations in 8-bit grayscale or24-bitcolorat400dpi.
Other Factors that Affect PDF File Size
In addition to the three main fac-torswehavediscussed,unnecessaryedges, bleed-through of text andgraphs,backgroundnoise,andblankpages also increase PDF file sizes.Figure1showshowacleanscancanlargely reduce a PDF file size and
cover. The updated file has a filesize of 42.8 MB. The example canbe accessed at http://hdl.handle.net/10217/3667.Sometimeswescana page containing text and photo-graphsorillustrationstwice,incolororgrayscaleandinblack-and-white.When we create a PDF, we com-binetwoimagesofthesamepagetoreproduce the original appearanceandtoreducefilesize.Howtoopti-mizePDFsusingmultiplescanswillbediscussedinalatersection.
How Image Resolution Affects PDF File Size
Before we start scanning, we checkwith our project manager regardingproject standards. For some fundedprojects, documents are required
ingrayscale8bitsforpageswithblack-and-whitephotographsorgrayscaleillustrations.
3. Use TIFF uncompressed, LZW,or G4 in black-and-white forpagescontainingtextorlineart.
To achieve the best result, eachpageshouldbescannedaccordingly.Forexample,wehadadocumentwitha color cover, 790 pages containingtextand lineart,and7blankpages.We scanned the original documentincolorat300dpi.ThePDFcreatedfrom these images was 384 MB, solarge that it exceeded themaximumfilesizethatourrepositorysoftwareallows for uploading. To optimizethe document, we deleted all blankpages,convertedthe790pageswithtextandlineartfromcolortoblack-and-white, and retained the color
Table 1. File format and color mode versus PDF file size
File Format Scan Specifications TIFF Size (KB) PDF Size (KB)
TIFF Color 24 bits 23,141 900
TIFF LZW Color 24 bits 5,773 900
TIFF ZIP Color 24 bits 4,892 900
TIFF JPEG Color 24 bits 4,854 873
JPEG 2000 Color 24 bits 5,361 5,366
JPEG Color 24 bits 4,849 5,066
TIFF Grayscale 8 bits 7,729 825
TIFF LZW Grayscale 8 bits 2,250 825
TIFF ZIP Grayscale 8 bits 1,832 825
TIFF JPEG Grayscale 8 bits 2,902 804
JPEG 2000 Grayscale 8 bits 2,266 2,270
JPEG Grayscale 8 bits 2,886 3,158
TIFF Black-and-white 994 116
TIFF LZW Black-and-white 242 116
TIFF ZIP Black-and-white 196 116
note: Black-and-white scans cannot be saved in JPEg, JPEg 2000, or TIff JPEg formats.
-
154 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010154 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010
ManyPDF files cannot be savedas PDF/A files. If an error occurswhen saving a PDF to PDF/A, youmay use Adobe Acrobat Preflight(Advanced > Preflight) to identifyproblems.Seefigure2.
Errors can be created by non-embedded fonts, embedded imageswith unsupported file compression,bookmarks, embedded video andaudio, etc. By default, the ReduceFile Size procedure in AcrobatProfessionalcompressescolorimagesusing JPEG 2000 compression.Afterrunning the Reduce File Size pro-cedure, a PDFmay not be saved asa PDF/A because of a JPEG 2000compression used error. Accordingto the PDF/A Competence Center,thisproblemwillbeeliminatedinthesecondpartofthePDF/AstandardPDF/A-2 is planned for 2008/2009.TherearemanyotherfeaturesinnewPDFs;forexample,transparencyandlayers will be allowed in PDF/A-2.5 However, at the time this paperwaswritten PDF/A-2 had not beenannounced.6
portable, which means the file cre-atedononecomputercanbeviewedwith an Acrobat viewer on othercomputers,handhelddevices,andonotherplatforms.3
APDF/AdocumentisbasicallyatraditionalPDFdocumentthatfulfillsprecisely defined specifications. ThePDF/A standard aims to enable thecreation of PDF documents whosevisual appearance will remain thesameover the courseof time.Thesefilesshouldbesoftware-independentandunrestrictedbythesystemsusedtocreate,store,andreproducethem.4The goal of PDF/A is for long-termarchiving. A PDF/A document hasthe same file extension as a regularPDFfileandmustbeatleastcompat-iblewithAcrobatReader4.
There are many ways to cre-ate a PDF/A document. You canconvert existing images and PDFfiles to PDF/A files, export a doc-ument to PDF/A format, scan toPDF/A, to name a few. There aremany software programs you canusetocreatePDF/A,suchasAdobeAcrobatProfessional8andlaterver-sions,CompartAG,PDFlib,andPDFToolsAG.
simultaneously improve its viewingandprintquality.
Recommendations
1. Unnecessary edges:Cropout.2. Bleed-through text or graphs:Place
a piece of white or black cardstock on the back of a page.If a page is single sided, usewhite card stock. If a page isdouble sided, use black cardstockand increasecontrast ratiowhen scanning. Often color orgrayscale images have bleed-through problems. Scanning apage containing text or line artasblack-and-whitewilleliminatebleed-throughtextandgraphs.
3. Background noise: Scanning apage containing text or line artas black-and-white can elimi-nate background noise. Manyaged documents have yellowedpapers.Ifwescanthemascoloror grayscale, the result will beimageswithyelloworgrayback-ground,whichmayincreasePDFfilesizesgreatly.Wealsorecom-mendincreasingthecontrastforbetterOCRresultswhenscanning documentswithbackgroundcolors.
4. Blank pages: Do notinclude if they are notrequired. Blank pagesscanned in grayscale orcolorcanquicklyincreasefilesize.
PDF and Long-Term Archiving PDF/A
PDF vs. PDF/A
PDF, short for PortableDocument Format, wasdeveloped by Adobe as aunique format to be viewedthroughAdobeAcrobatview-ers.Asthenameimplies,itis
Table 2. Color Mode and Image Resolution vs. PDF File Size
Color mode
Resolution (DPI)
Scanning time (sec.)
OCR time (sec.)
TIFF LZW (KB)
PDF size (KB)
color 600 100 N/A* 16,498 2,391
color 400 25 35 7,603 1,491
color 300 18 16 5,763 952
grayscale 600 36 33 6,097 2,220
grayscale 400 18 18 2,888 1370
grayscale 300 14 12 2,240 875
B/W 600 12 18 559 325
B/W 400 10 10 333 235
B/W 300 8 9 232 140
*n/a due to an oCr error
-
tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 155Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 155
able.Thisoptionkeepstheorigi-nalimageandplacesaninvisibletextlayeroverit.Recommendedfor cases requiring maximumfidelity to the original image.8This is the only option used byCSUL.
2. Searchable Image: Ensures thattext issearchableandselectable.This option keeps the originalimage, de-skews it as needed,andplacesaninvisibletextlayeroverit.Theselectionfordowns-ample images in this same dia-log box determineswhether theimage is downsampled and towhat extent.9 The downsam-pling combines several pixelsin an image to make a singlelargerpixel; thus some informa-tion is deleted from the image.However, downsampling doesnot affect the quality of text orline art. When a proper settingisused,thesizeofaPDFcanbesignificantly reduced with littleornolossofdetailandprecision.
3. ClearScan: Synthesizes a newType3fontthatcloselyapproxi-matestheoriginal,andpreservesthe page background using alow-resolution copy.10 The finalPDF is the same as a born-dig-ital PDF. Because Acrobat can-not guarantee the accuracy of
manipulate the PDF document foraccessibility. Once OCR is properlyapplied to the scanned files, how-ever, the image becomes searchabletextwithselectablegraphics,andonemayapplyotheraccessibilityfeaturestothedocument.7
Acrobat Professional providesthreeOCRoptions:
1. Searchable Image (Exact): Ensuresthattextissearchableandselect-
Full-Text Searchable PDFs and Trouble-shooting OCR Errors
APDFcreatedfromascannedpieceof paper is inherently inaccessiblebecause the content of the docu-mentisanimage,notsearchabletext.Assistive technology cannot reador extract the words, users cannotselectoreditthetext,andonecannot
Figure 1. PDFs Converted from different images: (a) the original PDF converted from a grayscale image and with unnecessary edges; (b) updated PDF converted from a black-and-white image and with edges cropped out; (c) screen viewed at 100 percent of the PDF in grayscale; and (d) screen viewed at 100 percent of the PDF in black-and-white.
Dimensions: 9.127 X 11.455Color Mode: grayscale Resolution: 600 dpiTIFF LZW: 12.7 MBPDF: 1,051 KB
Dimensions: 8 X 10.4Color Mode: black-and-whiteResolution: 400 dpiTIFF LZW: 153 KBPDF: 61 KB
Figure 2. Example of Adobe Acrobat 9 Preflight
-
156 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010156 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010
but at least users can read all text,while the black-and-white scan con-tainsunreadablewords.
Troubleshoot OCR Error 3: Cannot OCR Image Based Text
The search of a digitized PDF isactually performed on its invis-ible text layer. The automated OCRprocess inevitably produces someincorrectly recognized words. Forexample, Acrobat cannot recognizethe Colorado State University Logocorrectly(seefigure6).
Unfortunately, Acrobat does notprovide a function to edit a PDFfiles invisible text layer. To manu-ally edit or add OCRd text,AdobeAcrobat Capture 3.0 (see figure 7)must be purchased. However, ourtestsshowthatCapture3.0hasmanydrawbacks. This software is compli-cated and produces its own errors.Sometimes it consolidates words;other times it breaks them up. Inaddition,itistime-consumingtoaddormodify invisible text layersusingAcrobatCapture3.0.
At CSUL, we manually addsearchable text for title and abstractpages only if they cannot beOCRdbyAcrobatcorrectly.Theexamplein
Troubleshoot OCR Error 2: Could Not Perform Recognition (OCR)
Sometimes Acrobat gener-ates an Outside of the AllowedSpecifications error when process-ing OCR. This error is normallycaused by color images scanned at600dpiormore.
In the example in figure 4, thepage only contains text but wasscanned in color at 600 dpi. Whenwe scanned this page as black-and-white at 400 dpi, we did notencounter this problem. We couldalsousealower-resolutioncolorscantoavoidthiserror.Ourexperimentsalso show that images scanned inblack-and-white work best for theOCRprocess.
In this articlewemainly discussrunningtheOCRprocessonmoderntextual documents. Black-and-whitescansdonotworkwellforhistoricaltextual documents or aged newspa-pers. These documents may havefaded text and background noise.When they are scanned as black-and-white,brokenlettersmayoccur,andsometextmightbecomeunread-able. For this reason they should bescannedincolororgrayscale.Infig-ure5,imagesscannedincolormightnot produce accurate OCR results,
OCRed text at 100 percent, thisoptionisnotacceptableforus.
Foratutorialontohowtomakeafull-textsearchablePDF,pleaseseeappendixA.
Troubleshoot OCR Error 1: Acrobat Crashes
OccasionallyAcrobatcrashesduringtheOCRprocess.Theerrormessagedoes not indicate what causes thecrashandwheretheproblemoccurs.Fortunately, thepagenumber of theerrorcanbe foundon the topshort-cutsmenu.Infigure3,wecanseetheerroroccursonpage7.
We discovered that errors areoftencausedbyfiguresordiagrams.Foraproblemlikethis,thesolutionisto skip theerror-causingpagewhenrunningtheOCRprocess.Ourinitialresearch was performed onAcrobat8 Professional. Our recent studyshows that this problem has beensignificantly improved in Acrobat 9Professional.
Figure 3. Adobe Acrobat 8 Professional crash window
Figure 4. Could not perform recognition (OCR) error
Figure 5. An aged newspaper scanned in color and black-and-white
Aged Newspaper Scanned in Color Aged Newspaper Scanned in Black-and-White
-
tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 157Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 157
averylightyellowbackground.Theundesirablemarks and backgroundcontribute to its large file size andcreateinkwastewhenprinted.
Method 2: Running Acrobats Built-In Optimization Processes
Acrobat provides three built-in pro-cessestoreducefilesize.Bydefault,Acrobat use JPEG compression forcolor and grayscale images andCCITT Group 4 compression forbitonalimages.
optimize scanned pDFOpen a scanned PDF and selectDocuments>OptimizeScannedPDF.Anumberofsettings,suchas imagequalityandbackgroundremoval,canbespecifiedintheOptimizeScannedPDF dialog box. Our experimentsshow this process can noticablydegradeimagesandsometimesevenincreasefilesize.Thereforewedonotusethisoption.
reduce File sizeOpen a scanned PDF and selectDocuments > Reduce File Size. TheReduce File Size command resa-mples and recompresses images,removes embedded Base-14 fonts,and subset-embeds fonts that wereleft embedded. It also compressesdocument structure and cleans upelementssuchasinvalidbookmarks.If the file size is already as smallas possible, this command has noeffect.11 After process, some filescannot be saved as PDF/A, as wediscussed in a previous section.WealsonoticedthatdifferentversionsofAcrobat can create files of differentfile sizes even if the same settingswereused.
pDF optimizerOpen a scanned PDF and selectAdvanced > PDF Optimizer. Manysettings canbe specified in thePDFOptimizer dialog box. For example,we can downsample images from
sections, we can greatly reducea PDFs size by using an appro-priate color mode and resolution.Figure 9 shows two different ver-sions of a digitized document. Thesource document has a color coverand 111 bitonal pages. The origi-nal PDF, shown in figure 9 on theleft,wascreatedbyanotheruniver-sitydepartment.Itwasnotscannedaccording to standards and pro-cedures adopted by CSUL. It wasscanned incolorat300dpiandhasafilesizeof66,265KB.Weexportedthe original PDF as TIFF images,batch-converted color TIFF imagestoblack-and-whiteTIFFimages,andthencreatedanewPDFusingblack-and-whiteTIFFimages.TheupdatedPDFhasa filesizeof8,842KB.Theimage on the right ismuch cleanerandhasbetterprintquality.Thefileonthelefthasunwantedmarksand
figure8isabooktitlepageforwhichweusedAcrobatCapture3.0toman-uallyaddsearchabletext.Theentirebookmaybeaccessedathttp://hdl.handle.net/10217/1553.
Optimizing PDFs for Web Delivery
A digitized PDF file with 400 colorpagesmaybeas largeas200 to400MB. Most of the time, optimizingprocessesmayreducefilesthislargewithout a noticeable difference inquality. In some cases, quality maybe improved.We will discuss threeoptimizationmethodsweuse.
Method 1: Using an Appropriate Color Mode and Resolution
As we have discussed in previous
~dO
UniversitY
Original Logo Text OCRed by Acrobat
Figure 6. Incorrectly recognized text sample
Figure 7. Adobe Acrobat capture interface
Figure 8. Image-based text sample
-
158 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010158 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010
grayscale.APDFmaycontainpagesthat were scanned with differentcolormodes and resolutions.APDFmay alsohavepages ofmixed reso-lutions.One pagemay contain bothbitonalimagesandcolororgrayscaleimages,buttheymustbeofthesameresolution.
The following strategies wereadoptedbyCSUL:
1. Combine bitmap, grayscale,and color images.We use gray-scale images forpages that con-tain grayscale graphs, such asblack-and-white photos, colorimages for pages that containcolorimages,andbitmapimagesfor text-onlyor text and lineartpages.
2. Ifapagecontainshigh-definitioncolor or grayscale images, scanthat page in a higher resolutionandscanotherpagesat400dpi.
3. If a page contains a very smallfont and the OCR process doesnotworkwell,scanitatahigherresolution and the rest of docu-mentat400dpi.
4. If a page has both text, color,or grayscale graphs, we scan ittwice. Then we modify imagesusing Adobe Photoshop andcombinetwoimagesinAcrobat.
In figure 10, thegrayscale imagehas a gray background and a truereproduction of the original photo-graph.Theblack-and-whitescanhasa white background and clean text,but details of the photograph arelost. The PDF converted from thegrayscale image is 491 KB and hasnineOCRerrors.ThePDFconvertedfrom the black-and-white image is61KB and has no OCR errors. ThePDF converted from a combinationofthegrayscaleandblack-and-whiteimages is 283 KB and has no OCRerrors.
The followingare the stepsusedto create a PDF in figure 10 usingAcrobat:
1. Scan a page twicegrayscale
Optimizer can be found at http://www.acrobatusers.com/tutorials/understanding-acrobats-optimizer.
Method 3: Combining Different Scans
Many documents have color coversand color or grayscale illustrations,but the majority of pages are text-only. It is not necessary to scan allpagesofsuchdocuments incoloror
a higher resolution to a lower reso-lution and choose a different filecompression. Different collectionshave different original sources,therefore different settings shouldbe applied. We normally do sev-eral tests for each collection andchoose the one that works best forit.WealsomakeourPDFscompat-ible with Acrobat 6 to allow userswith older versions of software toview our documents. A detailedtutorial of how to use the PDF
Figure 9. Reduce file size example
Figure 10. Reduce file size example: combine images
-
tHe Next GeNerAtioN liBrArY cAtAloG | ZHou 159Are Your DiGitAl DocuMeNts weB FrieNDlY? | ZHou 159
help.html?content=WSfd1234e1c4b69f30ea53e41001031ab64-7757.html (accessedMar.3,2010).
3. Ted Padova Adobe Acrobat 7 PDF Bible,1sted.(Indianapolis:Wiley,2005).
4. Olaf Drmmer,Alexandra Oettler,and Dietrich von Seggern, PDF/A in a NutshellLong Term Archiving with PDF,(Berlin:AssociationforDigitalDocumentStandards,2007).
5. PDF/A Competence Center,PDF/A: An ISO StandardFutureDevelopment of PDF/A, http://www.pdfa.org/doku.php?id=pdfa:en(accessedJuly20,2010).
6. PDF/A Competence Center,PDF/AA new Standard for Long-TermArchiving,http://www.pdfa.org/doku.php?id=pdfa:en:pdfa_whitepaper(accessedJuly20,2010).
7. Adobe, Creating Accessible PDFDocuments with Adobe Acrobat 7.0: AGuideforPublishingPDFDocumentsforUse by People with Disabilities, 2005,http://www.adobe.com/enterprise/accessibility/pdfs/acro7_pg_ue.pdf(accessedMar.8,2010).
8. Adobe, Recognize Text inScanned Documents, 2010, http://help.adobe.com/en_US/Acrobat/9.0/Standard/WS2A3DD1FA-CFA5-4cf6-B993-159299574AB8.w.html (accessedMar.8,2010).
9. Ibid.10. Ibid.11. Adobe,ReduceFileSizebySaving,
2010, http://help.adobe.com/en_US/Acrobat/9.0/Standard/WS65C0A053-BC7C-49a2-88F1-B1BCD2524B68.w.html(accessedMar.3,2010).
the other 76 pages as grayscale andblack-and-white. Then we used theprocedure described above to com-bine text pages and photographs.ThefinalPDFhascleartextandcor-rectly reproduced photographs. Theexample canbe foundat http://hdl.handle.net/10217/1553.
Conclusion
Our case study, as reported in thisarticle, demonstrates the importanceof investing the time and effort toapply the appropriate standards andtechniquesforscanningandoptimiz-ing digitized documents. If propertechniques are used, the final resultwillbeWeb-friendlyresourcesthatareeasy to download, view, search, andprint. Users will be left with a posi-tiveimpressionofthelibraryandfeelencouraged to use its materials andservicesagaininthefuture.
References
1. BCRs CDP Digital Imaging BestPractices Working Group, BCRs CDPDigital Imaging Best Practices Version2.0, June 2008, http://www.bcr.org/dps/cdp/best/digital-imaging-bp.pdf(accessedMar.3,2010).
2. Adobe, About File Formats andCompression, 2010, http://livedocs.adobe.com/en_US/Photoshop/10.0/
andblack-and-white.2. Crop out text on the grayscale
scanusingPhotoshop.3. Delete the illustration on the
black-and-white image usingPhotoshop.
4. Create a PDF using the black-and-whiteimage.
5. Run the OCR process and savethefile.
6. Insert the color graph. SelectTools > Advanced Editing >TouchUp Object Tool. Right-clickonthepageandselectPlaceImage.LocatethecolorgraphintheOpendialog,thenclickOpenandmove the color graph to itscorrectlocation.
7. SavethefileandruntheReduceFile Size orPDFOptimizerpro-cedure.
8. Savethefileagain.
Thismethodproduces thesmall-est file size with the best quality,but it is very time-consuming. AtCSULweusedthismethodforsomeimportantdocuments,suchasoneofour institutional repositorys show-case items, Agricultural Frontier to Electronic Frontier. The book has 220pages, including a color cover, 76pages with text and photographs,and 143 text-only pages. We useda color image for the cover pageand 143 black-and-white images forthe143 text-onlypages.Wescanned
Appendix A. Step-by-Step Creating a Full-Text Searchable PDF
Inthistutorial,wewillshowyouhowtocreateafull-textsearchablePDFusingAdobeAcrobat9Professional.
Creating a PDF from a Scanner
AdobeAcrobatProfessionalcancreateaPDFdirectlyfromascanner.Acrobat9providesfiveoptions:BlackandWhiteDocument,GrayscaleDocument,ColorDocument,ColorImage,andCustomScan.Thecustomscanoptionallowsyoutoscan,runtheOCRprocedure,addmetadata,combinemultiplepagesintoonePDF,andalsomakeitPDF/Acompliant.TocreateaPDFfromascanner,gotoFile>CreatePDF>FromScanner>CustomScan.Seefigure1.
AtCSUL,wedonotdirectlycreatePDFsfromscannersbecauseourtestsshowthatitcanproducefuzzytextanditisnottimeefficient.BothscanningandrunningtheOCRprocesscanbeverytimeconsuming.Ifanerroroccursduringtheseprocesses,wewouldhavetostartoveragain.Wenormallyscanimagesonscanningstationsbystudentemployees
-
160 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010160 iNForMAtioN tecHNoloGY AND liBrAries | septeMBer 2010
or outsource them to vendors. Then library staffwill perform qualitycontrolandcreatePDFsonseperatemachines.Inthisway,wecanworkonmultiple documents at the same time and ensure thatwe providehigh-qualityPDFs.
Creating a PDF from Scanned Images
1. FromthetaskbarselectCombine>MergeFilesintoasinglePDF>FromMultipleFiles.Seefigure2.
2. IntheCombineFilesdialog,makesuretheSinglePDFradiobuttonis selected. From theAddFilesdropdownmenu selectAddFiles.Seefigure3.
3. IntheAddFilesdialog,locateimagesandselectmultipleimagesbyholdingshiftkey,andthenclickAddFilesbutton.
4. Bydefault,Acrobatsortsfilesbyfilenames.UseMoveUpandMoveDownbuttonstochangeimageordersandusetheRemovebuttonto delete images. Choose a target file size. The smallest iconwillproduceafilewithasmallerfilesizebutalowerimagequalityPDF,andthelargesticonwillproduceahighimagequalityPDFbutwithavery largefilesize.Wenormallyuse thedefault filesizesetting,whichisthemiddleicon.
5. Savethefile.
Atthispoint,thePDFisnotfull-textsearchable.
Making a Full-Text Searchable PDF
APDFdocument created from a scannedpiece of paper is inherentlyinaccessible because the content of the document is an image, notsearchable text.Assistive technologycannot readorextract thewords,userscannotselectoreditthetext,andonecannotmanipulatethePDFdocument for accessibility. Once optical character recognition (OCR)is properly applied to the scanned files, however, the image becomessearchabletextwithselectablegraphics,andonemayapplyotheracces-sibilityfeaturestothedocument.
AdobeAcrobatProfessionalprovidesthreeOCRoptions,SearchableImage (Exact), Searchable Image, and Clean Scan. Because SearchableImage (Exact) is theonlyoption thatkeeps theoriginal look,weonlyusethisoption.
TorunanOCRprocedureusingAcrobat9Professional:
1. OpenadigitizedPDF.2. Select Document > OCR text recognition > Recognize text using
OCR.3. IntheRecognizeTextdialog,specifypagestobeOCRed.4. IntheRecognizeTextdialog,clicktheEditbuttonintheSettingssec-
tiontochooseOCRlanguageandPDFOutputStyle.WerecommendtheSearchable Image(Exact)option.ClickOK.Thesettingwillberememberedbytheprogramandwillbeuseduntilanewsettingischosen.
SometimesaPDFsfilesizeincreasesgreatlyafteranOCRprocess.Ifthishappens,usethePDFoptimizertoreduceitsfilesize.
Figure 2. Merge files into a single PDF
Figure 3. Combine Files dialog
Figure 1. Acrobat 9 Professionals Create PDF from Scanner Dialog