Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of...
-
Upload
impact-centre-of-competence -
Category
Technology
-
view
200 -
download
0
description
Transcript of Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of...
![Page 1: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/1.jpg)
PoCoTo An Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts
Thorsten Vobl, Annette Gotscharek, Ulrich Reffle,
Christoph Ringlstetter, Klaus U. Schulz
CIS - Center for Information and Language Processing University of Munich Gini GmbH Munich
![Page 2: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/2.jpg)
Motivation
- For historical texts still many OCR errors - Downstream Applications harmed Option to improve quality with interactive Postcorrection
Why: selected and important texts/corpora or parts can/must be lifted to a much higher level of accuracy/to perfection. Somehow “business driven” How: The user experience of the software has a major influence on time and efforts needed for improving accuracy.
![Page 3: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/3.jpg)
Approach
Features to Raise Productivity within our competence and explorative : • Plugin Language technology that unmasks orthographic variation in historical language and returns document specific distributions of OCR errors. • Tool visualizes series of similar OCR errors • Error series can be corrected in one shot • Implement productive UX through interface and functionality
![Page 4: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/4.jpg)
Evaluation
Tool developed in University Environment during EU project IMPACT and maintained since despite serious fluctuation Practical user tests in three major European libraries have shown Gains in time/corrections rates. User ratings from practitioners high. Maintaining Interest, open for new languages, new functionalities. Division of language resources and tool through a server-client model Published as an open source tool under GitHub.
![Page 5: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/5.jpg)
§ Language technology used for improvement of interactive postcorrection § Lexica, matching tool, profiler integrated as background technology § Document centric knowledge from unsupervised analysis of OCRed
document used for detection of error classes and suggested corrections § Batchmode for corrections of many errors in „one shot“
§ Rich graphical user interface to let users fully benefit from „knowledge“ on document derived error classes
Starting Point: Postcorrection Tool as a Carrier of Technology
![Page 6: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/6.jpg)
Flexible GUI
OCR
Correction candidates, Special workflows
Image
§ Unlimited configuration of the views:
– OCR with image snippets – Complete image page – Correction candidates, special
workflows
Font-/window size configuration
![Page 7: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/7.jpg)
§ OCRed text is presented to the user with word-image alignment.
§ Natural flow of text is maintained, comparison with original text images a lot easier than with focus hopping
View: OCR + Image Snippets
![Page 8: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/8.jpg)
§ Alternative view with the complete page image.
– Useful for difficult to read words – Useful if word segmentation of the OCR
is too poor – Useful if long distance text understanding
is needed
View: Original Image
![Page 9: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/9.jpg)
§ Classical correction workflow through seuential manual input
Manual Correction
![Page 10: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/10.jpg)
§ Speed-up through selection of proposed correction candidates
In line with what is usually offered: „Base Mode“
Drop Down Selection of Correction Candidates
![Page 11: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/11.jpg)
Modern word word form in word form in form ground truth OCRed text Wmod Wgt Wocr
Patterns applied „pattern trace“
OCR errors applied „OCR trace“
„Interpretation“ of the OCR token Starting from OCR token Wocr Estimation of the Channel Model
Two-Channel Model for OCRed historical Text
![Page 12: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/12.jpg)
Improved model for • words • patterns • OCR errors
and their probabilities . .
for each OCR token Wocr
Improved list of interpretations with probabilities
Final Result
Modern word
Ground truth
OCR trace
Hist trace
Local guess Global guess
Profiling of historical OCRed corpora with EM
![Page 13: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/13.jpg)
Document Eckartshausen
Result Probabilities historical patterns
![Page 14: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/14.jpg)
LMF
Document Eckartshausen
Result Probabilities OCR errors
![Page 15: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/15.jpg)
§ Valid historical words not marked as errors even if not in the lexicon („hypothetical lexicon“)
§ Historical variants proposed as correction candidates
Lexicons Triggered by Profiles
![Page 16: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/16.jpg)
§ Improved Ranking of candidates through document specific language and error profile
§ Concordance Error View with high confidence corrections
Selection of Correction Candidates
![Page 17: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/17.jpg)
§ High Probability Identical strings corrected as batch
§ Concordance views optional
Rapid Workflow - Batch Processing Identical Strings
![Page 18: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/18.jpg)
§ Strings with identical error patterns corrected as batch
§ In the example: n -> u
Rapid Workflow - Batch Processing Identical Error Patterns
![Page 19: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/19.jpg)
Controlled “Hard” Evaluations
0 10 20 30 40 50 60 70 80 90
0
100
200
300
400
500
600
700
800
BSB Dokument1
Corrections made
User1 Full
User2 Full
User3 Base
User4 Base
User5 Full
User6 Base
time in minutes
co
rre
ctio
ns m
ad
e
§ Measure Points every 10 minutes for 90 minutes
§ Each User with a base/full session (inter/intra User comparison)
§ More corrections avg. 1.5x – 3x for Full Mode
§ Earley Gains: First 10 Minutes
![Page 20: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/20.jpg)
Closer Look into the Data
![Page 21: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/21.jpg)
Soft Evaluations Questionaires with all three institutions. Most favorite aspect: Batch Corrections
Main problems: Stability Correction of Segmentation Errors
![Page 22: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/22.jpg)
Future work
• Extend to new Languages e.g. Latin
• New Correction Scenarios e.g. specific Named Entity Correction
• Turn Interest into a Community and Implement Industrial Tool Partnerships for isolated parts of the Software
![Page 23: Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text](https://reader034.fdocuments.us/reader034/viewer/2022042813/546fd469af7959b30a8b46b0/html5/thumbnails/23.jpg)
Thanks for your attention!
… and special thanks to University of Alicante, Bavarian State Library, Royal Library of the Netherlands for their Time and Efforts during the Experiments