Document Image Analysis Lecture 5: Metrics
description
Transcript of Document Image Analysis Lecture 5: Metrics
![Page 1: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/1.jpg)
UC Berkeley CS294-9 Fall 2000 5- 1
Document Image AnalysisLecture 5: Metrics
Richard J. FatemanHenry S. Baird
University of California – BerkeleyXerox Palo Alto Research Center
![Page 2: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/2.jpg)
UC Berkeley CS294-9 Fall 2000 5- 2
The course so far….• Reminder: All course materials are online:
http://www-inst.eecs.berkeley.edu/~cs294-9/
• Overview of the DIA Research Field
• Some applications (Postal Addresses, Checks):
• Research Objectives: more systematic
modeling, design
• Some basic engineering
![Page 3: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/3.jpg)
UC Berkeley CS294-9 Fall 2000 5- 3
How well are we doing?
• Cost to achieve a useful result• Compare digital version to
– hand keying/ digitizing– verification– correction
• Correction cost may dominate total system cost
![Page 4: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/4.jpg)
UC Berkeley CS294-9 Fall 2000 5- 4
When is a result nearly correct?
• Character Model– Correct– Reject– Error
• String model– Insertion– Deletion– Rejection– Substitution [wrong letter identification]
![Page 5: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/5.jpg)
UC Berkeley CS294-9 Fall 2000 5- 5
Using ascii character labels
ABCDEFGHIJKL = s1ACD~~OIIUKL = s2
Insert B after A in s2Substitute E for ~, F for ~ [~=reject]subst G for O in s2subst H for I in s2subst I for U … etc (really H was recognized as II, IJ was recognized as U)
![Page 6: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/6.jpg)
UC Berkeley CS294-9 Fall 2000 5- 6
Ascii labels are inadequate
• Unicode +• Font +• Point size +• Tag information <author> .. </author>
![Page 7: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/7.jpg)
UC Berkeley CS294-9 Fall 2000 5- 7
Simple measures may mislead
Increase the rejection rate and this “error rate” decreases. Reject all characters to get 0/0?
Some applications (e.g. post office) force very low error, even if (low confidence) correct results are sometimes rejected.
%100### rejectedcharacterserrors
![Page 8: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/8.jpg)
UC Berkeley CS294-9 Fall 2000 5- 8
Some errors are acceptable
• Keyword search: if the key word occurs many times and is occasionally rejected
• Erroneous (nonsense) words are unlikely to be found by a search
• Caveat: if a key word is consistently changed to a nearby word, it may be missed (e.g. search for durnptruck and never find it.)
![Page 9: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/9.jpg)
UC Berkeley CS294-9 Fall 2000 5- 9
Example: UNLV-ISRI document collection
• 20 million pages of scientific, legal, official memos from DOE and contractors– Rock mining– Maps– Safe transportation of nuclear waste– Average length 44 pages
![Page 10: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/10.jpg)
UC Berkeley CS294-9 Fall 2000 5- 10
Example: UNLV-ISRI document collection
• DOE’s Licensing Support System Prototype– 104,000 Page images, 2,600 documents– Manually typed “correct” text– OCR text
• To determine relevance to queries, 3 methods used– Geology students ranking (0/1)– OCR keyword search– “correct” text search
![Page 11: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/11.jpg)
UC Berkeley CS294-9 Fall 2000 5- 11
Example: UNLV-ISRI document collection
• Exact match on 71 queries. – 632 returned by correct text– 617 returned by OCR. – Essentially: OCR is OK for this application.
• Probabilistic ranking / frequency: – Excessive OCR errors affected ranking– On average, similar results
• Feedback on relevance was not helpful for poor OCR
• Benchmarking: similar relevance = good results
![Page 12: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/12.jpg)
UC Berkeley CS294-9 Fall 2000 5- 12
Example: UNLV-ISRI document collection
One surprising result is that for some standard tests of precision and recall, processing OCR did better than actual text.
[Crummy OCR meant that some terms were not recognized; but the documents were irrelevant….]
![Page 13: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/13.jpg)
UC Berkeley CS294-9 Fall 2000 5- 13
A theory for computing accuracy
• Consider the result of OCR to be a string– Idealization: most common errors involve
mis-counting the number of spaces!– Ignores size/font/absolute position etc etc
![Page 14: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/14.jpg)
UC Berkeley CS294-9 Fall 2000 5- 14
Computing the shortest edit distance
• Bio-informatics sequencing• Associate a cost for each
correspondence. For example,– Match or substitute (cost 0 or 1)– Insert or delete (cost 2)
![Page 15: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/15.jpg)
UC Berkeley CS294-9 Fall 2000 5- 15
Attempt to align of AUGGAA to ACUGAUGUGA. Distances were calculated using following parameters: s(a,b) = 0 when a equals b; s(a,b) = 1 when a differs from b insert or delete cost = 2. One of the possible optimal paths is indicated by a solid line connecting cells. It corresponds to the following alignment: ACUGAUGUGA A-UG--G-AA [explain dynamic programming here?]
A
U
G
G
A
A
A C U G A U G U G A
14
![Page 16: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/16.jpg)
UC Berkeley CS294-9 Fall 2000 5- 16
Computing the shortest edit distance
• Also useful for other tasks (recognizing speech)
• Lots of ways of organization of dynamic programming, still O(n2).
• Probably of more interest is word accuracy, or accuracy on non-stopwords (excluding and the of … etc.)
![Page 17: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/17.jpg)
UC Berkeley CS294-9 Fall 2000 5- 17
Correct Zoning is essential
• Read order in multi-column pages
• How to compare competing programs on performance of repeated headers
• What to do with figures, logos.
123456
123456
![Page 18: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/18.jpg)
UC Berkeley CS294-9 Fall 2000 5- 18
Document Attribute Format Specification : DAFS
``While many formats exist for composing a document fromelectronic storage onto paper, no satisfactory standard existsfor the reverse process. DAFS is intended to be a standardfor document decomposition. It will used in applications suchas OCR and document image understanding.
There are three storage formats: DAFS-Unicode, DAFS-ASCII anda more compact DAFS-Binary form.
DAFS is a file format specification for documents with avariety of uses. It is developed under the Document ImageUnderstanding (DIMUND) project funded by ARPA.’’ www.raf.com, Illuminator, UW CDRoms (English and Japanese)
![Page 19: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/19.jpg)
UC Berkeley CS294-9 Fall 2000 5- 19
DAFS vs SGML
• DAFS= SGML+Unicode +CCITFax4• SGML requires DTD (document type
definition) • SGML is intended for structure, not
appearance (e.g. not bold, italic)• Images which accidentally contain ascii
version of <tag> can be problematical– Solved by putting images in separate files!
![Page 20: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/20.jpg)
UC Berkeley CS294-9 Fall 2000 5- 20
Perfect results: how to obtain ground truth?
• Painfully enter it by hand, or • Painfully correct OCR results, or• Compute some kind of average of OCR
programs
![Page 21: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/21.jpg)
UC Berkeley CS294-9 Fall 2000 5- 21
Perfect ground truth: a synthetic approach
• (Kanungo,UMD): start with TeX, – produce the ground truth for layout form
TeX,– Extract character positions, glyphs by
analyzing DVI files– This provides essentially every bit position of
each character.
![Page 22: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/22.jpg)
UC Berkeley CS294-9 Fall 2000 5- 22
Ground truth
• Next, commit to paper:– Print the DVI files– Scan a calibration page – Compute parameters of 2d2d transformations T
imposed by physics– Scan the printout– Align the page– Run the recognizer– Compare reported positions (• T-1 ) to correct ones
![Page 23: Document Image Analysis Lecture 5: Metrics](https://reader035.fdocuments.us/reader035/viewer/2022062423/5681472f550346895db46bd8/html5/thumbnails/23.jpg)
UC Berkeley CS294-9 Fall 2000 5- 23
Change of Pace
• Assignment 1– What does it mean to write a program?
• Documentation• Demo• Instructions for use• (perhaps optional)
– Extensions, limitations, discussion
• Discussion questions