1500 East, Room 463, Salt Newspaper...

9
Microfilm, Paper, and OCR: Issues in Newspaper Digitization The Utah Digital Newspapers Program by Kenning Arlitsch and John Herbert Kenning Arlitsch and John Herbert are both at the J. Willard Marriott Library, University of Utah. Mr. Arlitsch (kenning.arlitsch© library.utah.edu - 295 S. 1500 East, Room 463, Salt Lake City, UT84112) is Head of Information Technology, and Mr. Herbert (John. [email protected] - 295 S. 1500 East, Room 418, Salt Lake City, UT 84112) is Program Director - Utah Digital Newspapers. They would like to gratefully acknowledge the contribu- tions of Scott Christensen and Frederick Zarndt of iAr- chives Inc., and of Randy Silverman, Preservation Li- brarian at the Marriott Li- brary in the preparation of this manuscript. History of the UDN Program The Marriott Library at the Uni- versity of Utah (U of U) has a long history of large-scale news- paper projects beginning with the National Endowment for the Humanities' United States Newspapers Program (USNP) in the 1980s, in which the Library led the effort to catalog and microfilm Utah newspapers. This involvement continues today with the Utah Digital News- paper (UDN) program, which is digitizing historic Utah news- papers, making them searchable and available on the Internet. UDN's Grant History: 2002-2004 1 With the first of three Library Services and Technology Act (LSTA) grants, the Marriott Li- brary digitized 30 years of three weekly newspapers in 2002. Dur- ing this first phase of the pro- gram, the newspaper digitiza- tion process was developed and the UDN website was launched with some 30,000 total pages. (http://digitalnewspapers.org). A second LSTA grant, which ran from January-September 2003, digitized 106,000 new pages, effectively quadrupling the collection. The grant also funded a project director to run day-to-day operations and se- cure ongoing funding, and fund- ed a publicity campaign to in- sure broad knowledge of the program across the state. In September 2003, the pro- gram was awarded a $1 million federal grant to continue for another two years by the Insti- tute for Museum and Library Services (IMLS), an agency with- in the Department of Health and Human Services. IMLS is pro- viding $470,000, with the U of U and Brigham Young University (BYU) providing matching funds 59

Transcript of 1500 East, Room 463, Salt Newspaper...

Page 1: 1500 East, Room 463, Salt Newspaper Digitizationdigitalnewspapers.org/public/pdf/MicroFilmArticle.pdfUSNP, newspaper microfilm col-lections are available and fairly complete in each

Microfilm, Paper,and OCR: Issues inNewspaperDigitizationThe Utah Digital Newspapers Program

by Kenning Arlitsch and John Herbert

Kenning Arlitsch and JohnHerbert are both at theJ. Willard Marriott Library,University of Utah. Mr.Arlitsch (kenning.arlitsch©library.utah.edu - 295 S.1500 East, Room 463, SaltLake City, UT84112) is Headof Information Technology,and Mr. Herbert ([email protected] -295 S. 1500 East, Room 418,Salt Lake City, UT 84112) isProgram Director - UtahDigital Newspapers. Theywould like to gratefullyacknowledge the contribu-tions of Scott Christensenand Frederick Zarndt of iAr-chives Inc., and of RandySilverman, Preservation Li-brarian at the Marriott Li-brary in the preparation ofthis manuscript.

History of the UDN Program

The Marriott Library at the Uni-versity of Utah (U of U) has along history of large-scale news-paper projects beginning withthe National Endowment forthe Humanities' United StatesNewspapers Program (USNP) inthe 1980s, in which the Libraryled the effort to catalog andmicrofilm Utah newspapers. Thisinvolvement continues todaywith the Utah Digital News-paper (UDN) program, which isdigitizing historic Utah news-papers, making them searchableand available on the Internet.

UDN's Grant History:2002-20041

With the first of three LibraryServices and Technology Act(LSTA) grants, the Marriott Li-brary digitized 30 years of threeweekly newspapers in 2002. Dur-ing this first phase of the pro-gram, the newspaper digitiza-tion process was developed andthe UDN website was launchedwith some 30,000 total pages.(http://digitalnewspapers.org).

A second LSTA grant, whichran from January-September2003, digitized 106,000 newpages, effectively quadrupling

the collection. The grant alsofunded a project director to runday-to-day operations and se-cure ongoing funding, and fund-ed a publicity campaign to in-sure broad knowledge of theprogram across the state.

In September 2003, the pro-gram was awarded a $1 millionfederal grant to continue foranother two years by the Insti-tute for Museum and LibraryServices (IMLS), an agency with-in the Department of Healthand Human Services. IMLS is pro-viding $470,000, with the U of Uand Brigham Young University(BYU) providing matching funds

59

Page 2: 1500 East, Room 463, Salt Newspaper Digitizationdigitalnewspapers.org/public/pdf/MicroFilmArticle.pdfUSNP, newspaper microfilm col-lections are available and fairly complete in each

Kenning Arlitsch and John Herbert Microform & Imaging Review

of $450,000 and $100,000 re-spectively. With this grant, theprogram will digitize 264,000newspaper pages, with portionsdistributed to other sites, namelyBYU and Utah State University(USU). The metadata (includingsearchable full text) from thesesites will be harvested and com-bined with the metadata fromthe U of U's collection. This willpresent a combined, or aggre-gated, collection to readers sothey can search on the entirecollection at once, regardless ofwhere the data is located. An-other major goal of the grantis to administer a training pro-gram to other academic andhistorical institutions in theWest, providing information onlaunching a digital newspapersprogram, managing the digiti-zation process, and writing com-pelling grant proposals.

In March 2004, the programwas awarded a third LSTA grantto digitize 10,000 pages fromeach of five specific Utah news-papers in five different counties.In administering this grant, theUtah State Library is providing$74,000, with matching fundsof $25,000 raised locally, $5,000each from public libraries inthe five newspaper communi-ties. These matching funds inparticular show how the pro-gram has substantial grass rootssupport in local communitiesthroughout the state. By thetime the two current grants ex-pire in September 2005, the pro-gram should have 450,000 news-paper pages digitized.

Impact of the ProgramAs the program has grown dur-ing the past three years, it hashad an increasing impact onUtahans. Monthly website usage

has increased five-fold fromJune 2003 to March 2004.2 Nu-merous emails and phone callshave been received from pa-trons who either want more in-formation about the programor who are willing to support itin some way. What the programhas done, at a very high level, isbreak down the traditional bar-riers between a major universi-ty and the general citizenry.Not only is the program tellingthe unique story of Utah's his-tory to the world via the Inter-net, it is also helping to create anew generation of "citizen his-torians" who are experiencingUtah history more easily and ef-fectively than ever before.

Digitizing Microfilm

The first newspapers digitizedby the UDN were scanned frommicrofilm. After decades of in-dependent newspaper microfilmcreation and USNP participa-tion, the U of U's newspapermicrofilm was clearly the mostcomplete and accessible sourcefor scanning. Many newspaperoriginals were destroyed fol-lowing filming, so the expecta-tion was that paper would bedifficult to locate.3 But problemswith the quality and availabilityof our microfilm caused us topursue print archives; during2003, 65% of the 106,000 pageswere digitized from paper.

Service BureausLibraries have long used servicebureaus to convert their docu-ments to microfilm, and thequality of work performed bythese bureaus can have reper-cussions long after contracts arecompleted. Lockhart and Swart-zell" conducted extensive tests

on five vendors in the late1980s, determining that while"all vendors met the basic tech-nical standards ... each testbatch had problems whichwould require detailed atten-tion in project initiation."5 Inthe UDN, these problems wouldhave a significant impact.

The U of U began microfilm-ing newspapers through a serv-ice bureau in 1948 and by thetime the USNP was launched in1983, "the Marriott Library hadalmost complete microfilm hold-ings for 30 years' worth of UtahNewspapers."6 Some of that mi-crofilm was digitized in 2002,and its defects had an impacton the digitized images, bothvisually and for optical charac-ter recognition (OCR) processes.Uneven lighting plagued manyof the newspapers. An imagemight go from an acceptable ex-posure on one side of the frameto a one or two f/stop differenceon the other side. Consistentfocus across the frame was an-other challenge; letters weresharp on one side but some-times more softly focused onthe other. (This can easily occurwhen a copy-stand-mountedcamera is not level.) Blacksmudges infected many frames,blocking out words or entirecolumns.

Most of these visual defectsappear in the early years of thenewspapers, leading us to con-clude they were the first to bemicrofilmed and that servicebureaus of the late 1940s hadnot yet perfected their tech-niques. There may also havebeen little or no quality controlefforts on the part of the U of U;recommendations from theAmerican Library Association,RLG, and ANSI/AIIM for inspec-

60

Page 3: 1500 East, Room 463, Salt Newspaper Digitizationdigitalnewspapers.org/public/pdf/MicroFilmArticle.pdfUSNP, newspaper microfilm col-lections are available and fairly complete in each

Vol.33 No. 2 i Microfilm, Paper, and OCR

tion of microfilm only becameavailable in the 1990s.7

Even the ownership of mas-ter reels can come into questionwith a service bureau. The Li-brary's microfilm service bureauhad changed ownership severaltimes, and a misunderstandingresulted in the master reels be-ing shipped out of state. Theservice bureau erroneously be-lieved it had acquired the mas-ter reels as a part of the pur-chase from the previous owner.During the 2002 processing, wediscovered, shockingly, that themaster reels were in Texas andthe service bureau refused toreturn them. The University re-acted by contacting the UtahAttorney General's office, andafter several months of corre-spondence, the reels were re-turned. Now the Library is usingstorage and duplication servicesoffered by BYU.

Physical ConditionThe physical condition of micro-film can also affect scan andOCR quality. Cellulose acetatefilm, used widely through the1970s before being replaced bystronger polyester, is known totear.8 Cellulose acetate is alsoprone to the same kind of "vin-egar syndrome" chemical de-composition (though it is not asflammable) as the older cellu-lose nitrate base.9 This decom-position leads to "buckling andshrinking, embrittlement, andbubbling,"10 causing distortionsin the image. Separately, chemi-cal "redox blemishes," resultingfrom oxidative attack in less-than-ideal storage conditions,have been noted in microfilmthroughout the country,11 andhave been seen in a few in-stances in film used by the UDN.

These reddish spots adversely af-fect the quality of the scannedimage.

Advantages of MicrofilmDespite the problems mentionedabove, scanning newspapersfrom microfilm offers severaldistinct advantages:

• Inexpensive scanning. Withthe right equipment, microfilmcan be scanned in an auto-mated fashion, allowing an op-erator to load a reel of film andessentially walk away from thescanner. These scanners cancost $100,000, but the UDN ex-perience shows that firms withthis equipment can offer pric-ing at approximately $0.15/page.

• Low conservation costs.Whereas paper may requireconservation treatment prior toscanning, microfilm is usuallyphysically stable and requiresno such treatment. Barring thephysical problems describedabove, preparation costs arelimited to making scanning cop-ies from the master reels. Scan-ning from microfilm is best donefrom a clean copy, free of thedefects found in service copies.

• Availability. Thanks to theUSNP, newspaper microfilm col-lections are available and fairlycomplete in each state.

Digitizing Paper

In 2003, sixty-five percent ofthe 106,000 pages were digi-tized from paper. When ingood condition, paper repre-sents original source material,whereas microfilm represents asmuch as a third-generation copy(paper-to-master-to-scan copy).Our hypothesis that scanningfrom paper produces better im-ages and more accurate search-ing is discussed in the section"OCR Accuracy - Microfilm vs.Paper." However, for all itspromise of cleaner images and

better search accuracy, originalnewsprint has its own set ofchallenges - not the least ofwhich is finding the collectionin the first place. The UDN isconstantly canvassing the statein an effort to locate originalcollections.

Scanning EquipmentThe oversized nature of news-papers makes them difficultto scan on conventional equip-ment; a book scanner or high-resolution digital scanning cam-era with copy stand and light-ing are requisite. Equipmentthat scans this size at a mini-mum of 300 dpi (400 dpi is rec-ommended) and at an eco-nomically feasible speed costs$50,000 - $100,000. The UDNout-sources its scanning at $.20-$.30/page, depending on wheth-er the newspaper is loose orbound.

Conservation costsNewsprint from the mid-19th cen-tury can still be in very goodcondition, if it was properlystored and handled minimally.Some collections, however, havebeen stored in adverse condi-tions, and have deterioratedover time, requiring conserva-tion work to render them stableenough for scanning. This re-pair work generally consists ofminor mending and cleaning.While the time and effort forthis can vary widely from onecollection to another, our over-all cost average for this minimalconservation is $0.19/ page.

Advantages of Paper• Cleaner digital images, moreaccurate OCR. Provided the pa-per is in fair (or better) condi-tion, better digital images areachieved by scanning directly

61

Page 4: 1500 East, Room 463, Salt Newspaper Digitizationdigitalnewspapers.org/public/pdf/MicroFilmArticle.pdfUSNP, newspaper microfilm col-lections are available and fairly complete in each

Kenning Arlitsch and John Herbert Microform & Imaging Review

from originals. And cleaner digi-tal images produce more accu-rate searchable text. These as-sumptions are reinforced byour initial testing (see "OCR Ac-curacy - Microfilm vs. Paper"),though more testing is needed.That said, the variable qualityof each source medium - micro-film or paper -makes it difficultto state categorically that oneis always a better choice thanthe other.

• Color scanning. Using origi-nals makes it possible to scan infull color, though the desirabili-ty of this is questionable. Colorscanning offers a more accuraterepresentation of the original,but storage costs for the largercolor files are much greater, ifnot prohibitive. Issues of filepresentation and storage arediscussed in the next section.

Digital File Storage

As the UDN program grows, weface the problem of storing andmanaging both low-resolutionPDF files presented on the Weband high-resolution archival TIFFfiles. The bi-tonal PDF files pre-sented on the Web are quitesmall: articles average 10-50KB,and full pages average 300KB.The entire current collection ofimages and metadata requiresonly 130GB of online disk space.But because a separate file isgenerated for each article, thefiles are numerous. The 136,000pages currently in the collectioncomprise 1.6 million files, ornearly 12 files per page. Provid-ing this average continues, wewill have nearly 5.4 million filesfor our 450,000 newspaper pagesby the end of 2005. The goodnews is that even with the Li-brary's current total of 2 millionfiles from all digital collections,including metadata and fulltext, we are not experiencingperformance issues.12

Numerically daunting as theeventual 5.4 million files is, weface more serious problems, in-cluding the long-term storageof the full-page archival filesfor the newspapers, which are4-bit grayscale TIFF files aver-aging 14MB in size. Standardprocedure for the other digitalcollections at the Library (photo-graphs, books, documents, maps,art prints, etc.) is to store thearchival files directly on the serv-er.13 But the sheer number andsize of the newspaper files pre-clude archiving them online. In2002, we stored these files onDVD but soon realized thatwith two copies of each file, wewould be quickly overwhelmedby creating and maintainingdiscs. Instead, we are now stor-ing them on magnetic tape, re-alizing that while tapes do nothave the longevity of opticaldiscs, they offer more flexibilityand reusability. We are usingLTO Ultrium tapes, which havean uncompressed capacity of100GB and cost $50 each.

Storage of Color FilesOur partners at BYU have de-cided to present the 40,000pages of the Deseret News infull color on the Web. At 80MBeach, the full-page archival im-ages are considerably largerthan the 14MB grayscale im-ages at the U of U, and presentan extreme case of the storageand management concerns de-scribed above. BYU has decidedto archive these files on DVD,with a second copy stored off-site.

To present the color files onthe Web, BYU is using CVistacompression software from CVi-sion Technologies Inc.14 Achiev-ing a 64:1 compression ratio.

BYU is able to reduce the 80MBfull-page TIFF files to 1.25MBPDF files. While this is still largefor some dial-up Internet users,the remarkable compressionrate does allow BYU to presentfull-color images on the Web.

Distributing the CollectibnIn 2002, the Utah Academic Li-brary Consortium (UALC)15 es-tablished the Mountain WestDigital Library (MWDL). Fourdigitization centers in Utah andtwo in Nevada support educa-tional and cultural heritagepartners by providing digitiza-tion infrastructure, training, andstandards, as well as by creat-ing their own digital collections.An aggregating server16 at theU of U harvests metadata fromeach center and provides a sin-gle searchable index at http://mwdl.org. Images are calledfrom the CONTENTdm serverwhere they reside in real time.

As the MWDL matures, it hasbecome clear that the UDNwould benefit from its distrib-uted network. The four centersin Utah plan to host newspapertitles in their region, therebyreducing the burden and costof collection storage at a singlelocation. As the aggregatingserver at the U of U harvestsmetadata, we believe we willcreate the first distributed digi-tal newspaper collection in thecountry. We anticipate theMWDL will begin harvestingnewspaper metadata from BYUby early summer 2004.

Optical CharacterRecognition

OCR is the centerpiece of creat-ing full-text from digital im-ages. Today there are literally

62

Page 5: 1500 East, Room 463, Salt Newspaper Digitizationdigitalnewspapers.org/public/pdf/MicroFilmArticle.pdfUSNP, newspaper microfilm col-lections are available and fairly complete in each

Vol. 33 No. 2 Microfilm, Paper, and OCR

hundreds of commercially avail-able OCR packages with wideranges in sophistication andprice. Digitizing historic news-papers is much more complexand difficult than generatingtext from most other documents.Some common problems aredeteriorated originals, unusualfonts, faded printing, shadedbackgrounds, fragmented let-ters, touching/overlapping let-ters, skewed text, curved lines(which is very common in boundvolumes), and bleed through.17

In fact, running OCR against agray-scale image (rather than bi-tonal) can actually reduce OCRaccuracy where bleed throughhas occurred.18 Consequently,nothing but the most robustOCR software should be usedfor historic newspapers.

One method of measuringOCR effectiveness is how accu-rately it determines which wordsare on the printed page. This isnormally expressed as the per-centage of words on the pagethat are accurately "read" bythe software. Of course, "read-ing" a word involves piecing ittogether letter-by-letter (seethe next section), so sometimesOCR accuracy is measured asthe percentage of letters accu-rately "read." It is important torealize, however, that these twolevels of accuracy are fundamen-tally different: Word accuracy isby definition significantly lowerthan letter accuracy because itis effectively the joint accura-cies (or joint probabilities) ofthe letters in the word. For ex-ample, OCR accuracy at the let-ter-level for a document maybe 98%. But computing the ac-curacy of a five-letter word inthat same document is done bytaking 0.98 to the fifth power

(the joint probability of five let-ters), which is 90.4%. This dis-tinction between letter and wordaccuracy is critical to note whenanalyzing OCR accuracy.

One other consideration withnewspaper OCR is that articlesoften have a great deal of re-dundancy. The same importantsearch keyword, such as a lastname or a city, often appearsmore than once in the same ar-ticle. In these cases, word accu-racy need not be 100% for asuccessful search.19

General Description ofSoftwareIncluded in the processing ser-vices offered by our service pro-vider, iArchives Inc. of Lindon,Utah, is OCR. Their OCR soft-ware is not only very sophisti-cated and state-of-the-art, it isalso proprietary and patented.Consequently, we are limited inthe details we can present here.What follows is a fairly genericdescription of how OCR soft-ware operates.

The OCR process begins witha clean-up of the raw TIFF im-ages created by the scanningprocess. This clean-up involves:

• Cropping each image, whichdetermines where the edgesof each page are and removeseverything outside them;

• De-skewing each image topresent the printed lines on thehorizontal;

• De-speckling each image toremove extraneous spots/ speck-les from the original.

Once the clean-up is complete,each page is "zoned" into in-dividual articles. This creates aseparate image of each article,which is important because theOCR runs on the individual arti-cles, not the full page.

The next step is the OCR soft-ware itself. iArchives' OCR frame-work uses multiple engines,each of which runs with a dif-ferent orientation. For instance,one engine may work betterwith lighter images and thinnerfonts, while another may workbetter with darker images andbolder fonts. These orientationsare needed because images spana wide range of quality fromarticle to article and page topage.

Each engine inspects each im-age pixel by pixel, looking care-fully at contiguous dark pixels,determining their overall shape,and comparing the shape toknown letters in many differentfonts. It also closely examinesthe adjacent white areas. Oneof the most important decisionsthe software makes is, giventhe size, shape, and location ofa white space, is it: 1) part of aletter that is fragmented; 2) thespace between letters; or 3) thespace between words? Variousalgorithms are run to help an-swer these questions, including:a) growing and shrinking po-tential letter fragments to see ifdark areas can be connected toform letters; b) whitening thebackground and darkening thetext to improve the contrast;and c) changing near-white towhite and near-black to black.In the end, the software assem-bles the contiguous/connecteddark pixels into a letter, some-times noting that more thanone letter is a possibility. Letterswithin a certain (very small) dis-tance of each other are assem-bled into a "node," which iswhat the engine believes is aword.

Finally, because the enginemay have multiple possibilities

63

Page 6: 1500 East, Room 463, Salt Newspaper Digitizationdigitalnewspapers.org/public/pdf/MicroFilmArticle.pdfUSNP, newspaper microfilm col-lections are available and fairly complete in each

Kenning Arlitsch and John Herbert Microform & Imaging Review

No. ofWords

Generated123

No limit

OCRAccuracy

77.9%80.3%80.7%80.8%

AccuracyImprove-

mentn/a

2.4%0.4%0.1%

ExcessWords

Generated18,99026,47328,29328,676

AdditionalWords

n/a7,4831,820383

Table 1: OCR Accuracy for Multiple Word Options

for a particular letter, it may asa result have multiple possibili-ties for the associated wordthat the letter is in. For exam-ple, if the engine can't decideabout the third letter in thenode "be*t," possible word op-tions include beat, beet, bent,and best. The engine will rankorder the possible words frombest to worst option, and then,depending on the setup pa-rameters, provide the requisitenumber of words for the text.These words are accumulatedinto the text file for the arti-cle.20

The Dilemma of MultipleWord OptionsThe UDN program presents full-text searching for 136,000 news-paper pages. When searchesare performed on the entirecollection, the search-hit limitof 10,000 can easily be reached,especially if a single and some-what common word is used inthe search. One method of lim-iting a search, of course, is tosearch on more than one word.As is quite popular, a user willsearch on a first and last nametogether, instead of the lastname only. For example, search-ing the collection for "cassidy"results in 392 hits, but whenthat is limited by searching for"butch cassidy," 62 hits result.Additionally, if an exact-phrase

'It , o , . ' • . • • •

search is used to further limitthe search, only 40 hits result.This is where things get inter-esting, as the example below il-lustrates.

On page 2 of the Eastern UtahAdvocate, November 14, 1901,there is an article titled "East-ern and Southern Utah." Onthe printed page is the phrase"George Parker alias ButchCassidy," and the OCR has gen-erated this corresponding text:"george parker alia alfas21

butch dutch cassidy." The OCRhad some difficulty with thewords "alias" and "Butch," put-ting two options in for each. Inparticular, the two options for"Butch" (butch and dutch) pre-sent an intriguing problem. Thethree words "butch dutch cas-sidy" as they appear in the textpreclude a successful search onthe exact phrase "butch cassidy"because "dutch," as the secondword option for "butch," is inbetween.

So the dilemma is this: whenthe OCR generates more thanone word for a particular node,we enhance our ability to suc-cessfully search on that singleword because there is morethan one possibility on which tosearch. In the example above,both a search for "butch" andfor "dutch" will generate a hiton the article. But at the sametime, almost paradoxically, a mul-

tiple word option reduces thechances of a successful search ifthat word is used in an exactphrase. As noted above, the ar-ticle is not included in the hitsfor "butch cassidy" because theword "dutch" is in between"butch" and "cassidy." In fact,an exact-phrase search for"butch dutch cassidy" revealsfive hits, so the identical OCRresult occurs four other times inthe collection.

Testing to Find a SolutionWorking closely with iArchives,we ran a series of formal testsof their OCR framework in Oc-tober 2003. The testing was de-signed to find the best word-option setting for their soft-ware. The test set included16 randomly selected full pages,representative of the entire col-lection. The text for each pagewas keyed and verified to nearly100% accuracy, becoming whatwas called the "ground truth,"which was used to compareagainst actual OCR results. TheOCR framework was then runfour times: with one word gen-erated for each node, with twowords generated, with threewords generated, and with nolimit on the number of words.The results are in Table 1.22

The data shows that as thelimit on the number of wordsgenerated increases, OCR accu-racy increases, as does the num-ber of excess words. These excesswords, like "dutch" in the ex-ample above, can reduce exact-phrase search accuracy. It shouldbe noted that OCR accuracy andexcess words are strongly de-pendent on the quality of theimages and format complexity.Good image quality and simpleformats will generally have

64

Page 7: 1500 East, Room 463, Salt Newspaper Digitizationdigitalnewspapers.org/public/pdf/MicroFilmArticle.pdfUSNP, newspaper microfilm col-lections are available and fairly complete in each

Vol.33 No. 2 Microfilm, Paper, and OCR

higher OCR accuracy and fewerexcess words, while poor imagequality and complex formats willlikely have lower OCR accuracyand more excess words.

Our analysis concluded thatthe best overall search accuracycomes from the two-word op-tion. This conclusion tried tostrike the right balance betweensingle-word and exact-phraseaccuracy. The two-word optionhad a significant increase (2.4%)in accuracy over the one-wordoption, while the three-wordand unlimited options increasedaccuracy only slightly at 0.4%and 0.1%, respectively. The2,203 additional excess wordsgenerated by these options,which have the undesired side-effect of further reducing phrasesearch accuracy, were not, weconcluded, worth the small0.5% increase in single-wordaccuracy.

In spite of this well tested so-lution, we consider it less thanoptimal because it merely strikesa balance between two compet-ing interests (single-word andexact-phrase searches) ratherthan strongly supporting both.The ultimate resolution to thistricky problem lies with utilizinga "proximity search." Proximitysearches, which are growing inpopularity among search en-gines, allow a search for aphrase, such as "butch cassidy."But instead of the words liter-ally having to be together inthe text, these searches allowthem to be within a certain pre-set number of words (say, three)of each other. So, any time"cassidy" is within three wordsof "butch," a hit would be gen-erated. This longer-term solu-tion will allow us to supportmultiple words generated by

the OCR, providing higher ac-curacy for single-word searches,and at the same time be able tosearch accurately on phrases.

Dictionary FiltersAfter the initial text is gener-ated by the OCR, it is filteredthrough a number of differentdictionaries to ensure that onlyvalid words are in the final text.In our early days, we used asmall English dictionary of only28,000 words. There are manyways to count them, but ac-cording to Oxford Dictionaries,"... there are, at the very least,a quarter of a million distinctEnglish words."23 So our filteringdictionary was too small by anorder of magnitude, and wefound - not surprisingly - thatit was leaving out far too manyimportant words. iArchives lo-cated and incorporated a two-million item dictionary contain-ing all English words, commonforeign language words, sur-names, and place names. Ad-ditionally, we augmented theplace names dictionary with aset of Utah place names pro-vided by BYU. We felt it par-ticularly important, since weare digitizing Utah newspapers,that we be able to accuratelysearch on all Utah place names,and, given the high genealogi-cal use of the collection, to havea robust surnames dictionary.

Once we incorporated thenew dictionaries into the text-generation process, we re-fil-tered the originally generatedtext and re-built the text filesusing the expanded dictionary.This re-filtering merely involvedre-running a script at the veryback end of iArchives processand was accomplished in a shorttimeframe with little additional

expense. The result is that theentire collection is now filteredproperly through the new dic-tionaries.

OCR Accuracy - Microfilm vs.PaperOne of the important proces-sing issues involved in newspa-per digitization is deciding whatsource material from which todigitize: microfilm or originalnewspapers. The USNP has madeavailable almost every importantnewspaper title in microfilm.However, this often decades-old microfilm, while available,does not necessarily provide thebest source material for OCR.We also wanted to examine howwell OCR operates on originals,because intuitively we believedoriginals would provide higher-quality images for the OCR.When paper is scanned, a newdigital photograph is taken ofthe original newspaper page,which is often a tremendousimprovement in overall imagequality from that paper's micro-film.

To test this assumption, dur-ing 2003 UDN scanned originalswhen they could be found ingood condition. After complet-ing the 2003 processing, we rana series of acceptance tests toensure the work was ready forinstallation onto our servers. TheQA testing involved, in part,performing keyword searchesand determining overall searchaccuracy for each newspaper t i -tle. The results of the QA test-ing are in Table 2.

As this somewhat small sam-ple shows, original newspapersprovide approximately a ten-percentage-point improvementin OCR accuracy over microfilm.While these results certainly sup-

65

Page 8: 1500 East, Room 463, Salt Newspaper Digitizationdigitalnewspapers.org/public/pdf/MicroFilmArticle.pdfUSNP, newspaper microfilm col-lections are available and fairly complete in each

Kenning Arlitsch and John Herbert Microform & Imaging Review

OriginalPaper

Microfilm

TOTAL

Issues Sam-pled

43

30

73

KeywordSearches

294

219

513

Hits

218

142

360

Pet

74.2%

64.8%

70.2%

Table 2: QA Testing Results

port our assumption that origi-nals are better source materials,we consider these results pre-liminary and in need of furthersampling and study to confirmthe numbers. The 2004 and2005 processing for UDN shouldhave an even higher percent-age from originals and provideus with more material fromwhich to test.

Next Steps

Assessing Microfilm"OCR-Ability"The economic conditions andtechnical solutions are in placetoday for launching and ex-panding all types of digital col-lections, complete with full-textsearching. Generating accuratefull-text, of course, relies uponsecuring good quality sourcematerials and using effectiveOCR software. Microfilm hasbeen widely used as the storagemedium of choice for newspa-pers and other media for ageneration or more, and theamount of material on micro-film is staggering. Moreover, asnoted earlier, many originalswere destroyed once they werefilmed, reducing the ability tofind originals in any condition.Even though the UDN has hadsome good fortune in acquiringoriginals, we understand thatnot all digitization projects will

be as fortunate. So it is not dif-ficult to foresee circumstanceswhere microfilm may be theonly available source materialfor some digitization projects.Our UDN experience shows, how-ever, that while some microfilmis digitized as accurately as anyoriginal, others clearly havemuch poorer results. While mi-crofilm may be prevalent andless expensive to digitize, it isnot necessarily the best sourcematerial.

How then do we decidewhether microfilm should bedigitized or whether we shouldincur the additional time andexpense of locating and scan-ning original paper? An assess-ment methodology to predictthe accuracy of OCR-generatedtext extracted from a microfilmscan is needed. In other words,what is the "OCR-ability" of areel of film? We need the abil-ity to estimate the overall OCRaccuracy of film without havingto go through the entire digiti-zation process and the expensethat would necessarily be in-curred. This type of evaluationmodel would have broad appli-cability well beyond newspapersas many different media havebeen copied to microfilm.

Distributed Collections andAggregated SearchingOur current plans at UDN callfor aggregating the distributed

collections at BYU and USU,and presenting a single search-able index at our website. De-veloping this technology willgreatly enhance our ability tolink digital newspaper collec-tions together, enabling pow-erful search engines to provideusers with nearly complete datain very short response times.This is the real promise of digi-tal newspaper collections, in-deed of digital collections of alltypes - providing to practicallyeveryone immediate access tomeaningful data.

Endnotes1 John Herbert and KenningArlitsch. "digitalnewspapers.org:The Digital Newspapers Programat the University of Utah," Seri-als Librarian, 47, nos. 1 and 2(forthcoming).2 Web site visits averaged 433per month from April-June 2003,and 2,347 per month from Jan-uary-March 2004.3 Nicholson Baker. Doublefold:Libraries and the Assault on Pa-per. New York: Random House(2001).4 Vickie Lockhart and AnnSwartzell. "Evaluation of Micro-form Vendors," Microform Re-view, 19, no. 3 (1990): 119-123.5 Ibid.6 Robert P. Holley. The UtahNewspaper Project Final Report,Project No. PS-200010-85 Na-tional Endowment for the Hu-manities United States Newspa-per Program (1987).7 Walter Cybulski. "You Say YouWant a Resolution? TechnicalInspection and the Evaluationof Quality in Preservation Mi-crofilming," Microform & Imag-ing Review, 28, no. 2 (1999): 56-67.

66

Page 9: 1500 East, Room 463, Salt Newspaper Digitizationdigitalnewspapers.org/public/pdf/MicroFilmArticle.pdfUSNP, newspaper microfilm col-lections are available and fairly complete in each

Vol.33 No. 2 Microfilm, Paper, and OCR

8 Michael J. Gunn. "'Poly' or'Cell'?" Microform Review, 16,no.3 (1987): 231-232.9 Thomas A. Bourke. "The Curseof Acetate; or a Base Conun-drum Confronted." MicroformReview, 23, no. 1 (1994): 15-17.10 Ibid.11 James M. Reilly, Douglas W.Nishimura, Kaspars M. Cupriks,and Peter Z. Adelstein. "Stabilityof Black and White Images, withSpecial Reference to Microfilm."Abbey Newsletter, 12, no. 5(July 1988) and reprinted inMicroform Review 17, no. 5(1988): 270-278. Also availableat: http://palimpsest.Stanford,edu/byorg/abbey/an/an 12/an 12-5/an 12-507.html.12 We use CONTENTdm digitalasset management softwarefrom DiMeMa Inc. to manageand present all our digital col-lections, including the newspa-pers. See http://contentdm.comfor product information.13 We track the online files us-ing CONTENTdm's Full Resolu-tion Manager feature. Regulartape backups of the server files

and rotating copies sent off-siteensure long-term viability.14 See http://www.cvisiontech.com/ for CVista product infor-mation.15 See http://www.ualc.net forinformation about the highereducation library consortium.16 Each MWDL site runs a CON-TENTdm server, and the aggre-gator at the University of Utahis a CONTENTdm product knownas the Multi-Site Server. Seehttp://contentdm.com for details.17 Frank R. Jenkins, Thomas A.Nartker, and Stephen V. Rice."Testing OCR Accuracy," Inform10 (September 1996): 20-22+.18 Thomas A. Nartker, StephenV. Rice, and Frank R. Jenkins."OCR Accuracy," Inform 9 (July1995): 38-40+.19 Kazem Taghva, Julie Borsack,Allen Condit, and Srinivas Erva."The effects of noisy data ontext retrieval," Journal of theAmerican Society for Informa-tion Science, 45, no. 1 (January1994): 50-58.20 Included with the text arethe x- and y-coordinates of each

node's location within the im-age. These coordinates are usedto highlight the word when asuccessful search for it is done.21 The words "alfas" and "alia,"while uncommon do actuallypass a dictionary filter. Whensearching Google, "alfas" re-trieves 80 pages of hits, withthe most common referenceappearing to be the shortenedplural of "alfa romeos." "Alia"retrieves 87 pages of hits fromGoogle* and is the acronym ofboth the "Australasian LightingIndustry Association" and the"Australian Library InformationAssociation."22 "OCR Accuracy" was definedas the percentage of singlewords in the ground truthwhich are also found in theOCR text. "Excess Word" wasdefined as a word in the OCRtext which was not in theground truth. The total numberof ground truth words was62,228.23 AskOxford.com, http://www.askoxford.com/asktheexperts/faq/aboutwords/numberwords.

67