IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
ABBYY & OCR Improvements for IMPACT
Michael FuchsSenior Product Marketing ManagerABBYY [email protected]
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
2
Agenda
Who is ABBYY?Company Overview(Short) Product OverviewABBYY Technology in the IMPACT project
OCR & Processing – IMPACT improvementsBinarisation, Segmentation, RecognitionDictionary API, Export Formats
Lessons Learned, Pricing, Pre-Announcement, Q&A
ABBYY & OCR for IMPACT
ABBYY & IMPACT
3
ABBYY & OCR for IMPACT
ABBYY Group
Overview ABBYY Group Founded in 1989 as BIT Software > 1000 employees in 14 offices worldwide Headquarters/R&D in Moscow, Russia
4
ABBYY & OCR for IMPACT
ABBYY OCR Products – Usage View
Desktop/Workgroup
Server/Backend SDK/Integration
OC
R &
Docu
men
t C
on
vers
ion
FineReader (Professional, Corporate, Site Licence Edition) Note: No Gothic/Fraktur OCR!
PDF Transformer
FotoReader
ScreenshotReader
Recognition Server (Professional, Extended Edition)
Gothic/Fraktur OCR & XML
Export Support!
FineReader Engines (Windows, Linux, Mac OS X, Free BSD, Embedded Systems)
Mobile OCR Engine (Android, Symbian, Linux, Windows, Windows Mobile, iOS )
End Users, Companies,(Libraries)
Companies,Scan Service
Provider, Libraries
User driven processing,
Ready to use
Automated processing,
Ready to use
Automated processing,
Development needed
Developers,Scan Service
ProviderIMPACT Research
Users
are
:
5
ABBYY & OCR for IMPACT 6
What (ABBYY) OCR can read...
Recognition Languages Almost 200 OCR languages 34 languages with dictionary support and spell check Alphabets: Cyrillic, Latin, Greek, Armenian, Hebrew, Thai Chinese, Japanese, Korean (CJK) - 4 sets of hieroglyphs
(Chinese (traditional and simplified), Japanese, Korean) Arabic (Technical Preview in the SDK)
Font Types Recognition of mixed font types
(dot-matrix printer, typewriter, Gothic, etc.) OCR-A OCR-B MICR (E13B) CMC-7
ABBYY & OCR for IMPACT 7
IMPACT & ABBYY
ABBYY is the OCR technology provider for IMPACT members
ABBYY also improved the core technologies for the recognition of old documents in IMPACT, focus areas are/were:
Image pre-processing Segmentation Character recognition Export
IMPACT members work with the Software Development Kit (SDK) FineReader Engine – not the desktop application
IMPACT focus is/was on research and not in setting up a production system ;o)
Improved technologies are/will be added to current/future products
ABBYY & OCR for IMPACT 8
Designed to be not OCRed
ABBYY & OCR for IMPACT
Why ABBYY? - OCR …
Std. OCR *
ABBYY Fraktur OCR*
*Recognition Server 3.0 R1 – Gothic/Fraktur disabled and enabled
Original Image[perfect quality :o) ]
9
ABBYY & OCR for IMPACT
ABBYY “History” and Old Fonts Recognition FineReader XIX (V7 Technology)
2003(METAe result 2000-2003)
FineReader Engine 9.0 (Release 1)
2008(Pre-IMPACT – “State of the Art”)
FineReader Engine 10 2010IMPACT Project Optimizations
10
ABBYY & OCR for IMPACT
ABBYY and Old European Fonts Accuracy Comparison:
ABBYY Technology Version 10 recognition of old European fonts:
25% more accurate than FRE 9.0 38% more accurate than FR XIX
Up to 98,2 % on good quality
images
11
2003
2008 2010
ABBYY & OCR for IMPACT
OCR Processing Steps &
ABBYY Improvements for IMPACT
12
ABBYY & OCR for IMPACT
Step 1. Scanning, Image Loading, Pre-Processing and Modification Compensating image defects and making the document suited for
automatic OCR
Step 2. Document Layout Analysis Layout analysis, detection of document sections like text, images and
barcodes
Step 3. (Optical) Character Recognition Automatic recognition of characters, apply selected recognition languages
& dictionaries
Step 4. (optional) Verification - by Operators or automated post correction Manual validation of suspicious characters and words
Step 5. Document Synthesis and Export Generating an output document in the selected format
13
Processing Steps
ABBYY & OCR for IMPACT
Step 1: Image pre-processing
14
ABBYY & OCR for IMPACT
Intelligent background filtering
Adaptive Binarisation
15
Step 1: Image pre-processing Image Loading, Pre-Processing and Modification
General binarisation on an image level can not deliver good results for OCR
ABBYY & OCR for IMPACT
Step 1: Image pre-processing New V10: Binarisation, Textured Background optimisations
Original scan
V9 binarisation
New V10 binarisation
16
ABBYY & OCR for IMPACT
Step 1: Image pre-processing New V10: Binarisation, Textured Background optimisations
Original scan
V9 binarisation
V10 binarisation
17
ABBYY & OCR for IMPACT
Step 1: Image pre-processing New V10: Binarisation for the IMPACT project
Original State of Art (V9) New (V10)
No text from the other page!
18
ABBYY & OCR for IMPACT
Step 2: Document Layout Analysis
19
ABBYY & OCR for IMPACT 20
Step 2: Document Layout Analysis Analyze layout and find text, images, tables and barcodes
ABBYY & OCR for IMPACT
Step 2: Document Layout Analysis (old Newspapers)Segmentation Improvements: Image/Text detection – Example 1/3
V9 Technology V10 Technology
21
Part of the column was detected as an image
ABBYY & OCR for IMPACT
Step 2: Document Layout Analysis (old Newspapers) Segmentation Improvements: Word Order Detection– Example 2/3
Less linear word order errors
22
V9 Technology V10 Technology
ABBYY & OCR for IMPACT
Step 2: Document Layout Analysis (old Newspapers) Segmentation Improvements: Lost text (no Detection) – Example 3/3
Less lost text
23
V9 Technology V10 Technology
ABBYY & OCR for IMPACT
Step 2: Document Layout Analysis Segmentation Improvements: IMPACT Results over time
24
Before IMPACT: Overall segmentation improvements
● Better picture detection● Better separators● Better page layout reconstruction
Only a random set of old newspapers available
After IMPACT: IMPACT Segmentation Ground Truth available New (internal) DA model for historic newspapers New segmentation evaluation methodology Evaluation results on newspapers
● 40% less split/merge errors● 25% less garbage and lost text
ABBYY & OCR for IMPACT
Step 3: Text/Character Recognition
25
ABBYY & OCR for IMPACT
Samples for Classifiers used in ABBYY technologiesAfter line detection, character recognition is applied with different
classifiers
26
Step 3: Text/Character Recognition
Raster classifier Contour classifier
Feature differentiating classifier
Structure classifier
ABBYY & OCR for IMPACT 27
Step 3: Text/Character Recognition Optimization and new Developments Improved Gothic Classifiers
A significant amount of time was invested in gothic classifier training The library selection of ground truth material (historical relevance)
was used New gothic graphemes were added
Results Good quality images: 2.8% (total) error rate on the used test set
which is about 20% improvement to the “state of art” (V9) = almost comparable to modern documents
Bad quality Images: 7% (total) error rate on the used test set which is about 30% improvement to the “state of art” (V9)
Most of the improvements available in ABBYY current products: ABBYY FineReader Engine 10 (SDK) & Recognition Server 3.0Quality optimization will be continued in future releases and technology cycles optimized
ABBYY & OCR for IMPACT 28
Step 3: Text/Character Recognition Optimization and new Developments
Old Slavonic as new OCR Language New Development
Before
Now
ABBYY & OCR for IMPACT
Quality-Test-Comparison:Binarisation & Recognition Improvements
29
ABBYY & OCR for IMPACT
Binarisation & Recognition Improvements
How to evaluate the recognition improvements of binarisation?
Binarisation & recognition quality go hand in hand!
-> # Errors = 100% with V9 binarisation & V9 recognition-> # Errors = -5% with V9 binarisation & V10 recognition
-> # Errors = -11% with V10 binarisation & V9 recognition
-> # Errors = -15% with V10 binarisation & V10 recognition
Binarisation
Recognition Technology
30
ABBYY & OCR for IMPACT
Step 3-5: Dictionaries & Export
31
ABBYY & OCR for IMPACT 32
Step 3 – 5: Other Optimizations
External Dictionary API Tuning External Dictionary API was available in the FineReader Engine (SDK) Support for any language, any time period API was/is heavily used from IMPACT language partners to run quality
tests
New ALTO XML Export Formats FineReader Engine 10 R2, December 2010 Recognition Server 3.0, July 2011
ABBYY & OCR for IMPACT
Additional Notes
33
ABBYY & OCR for IMPACT
Further Information & Trial Versions The ABBYY Gothic/Fraktur OCR Portal:
www.frakturschrift.com
34
ABBYY & OCR for IMPACT
The Reality Masses of books/document are available & already scanned It is unclear if Antiqua and/or Gothic/Fraktur fonts are used in the
documents Pre-Sorting is impossible, it would be too time/cost expensive
ABBYY Europe's AnswerReduced the pricing for mixed “Old” + “Modern”
font OCR projectsThe pricing is now ready for “mass processing”
Examples Recognition Server 3.0 with “Gothic” enabled
10.000 pages – 299 Euro – available online 500.000 pages* – 5.000 Euro = 1 Euro cent per page = ca 2.000 books a
250 pages Over 3 Mio pages* - ca 0,52 Euro cent per page = 12.000 books a 1,25 €
(250 pages) Over 10 Mio pages* - ca. 40.000 books = ca. 0,5 € per book
... No more excuses for not
OCRing :o)
What IMPACT taught ABBYY about Libraries & Mass Digitalization projects…
35* page size is A4, bigger formats are counted as multiple pages
ABBYY & OCR for IMPACT
The ABBYY Gothic/Fraktur OCR Portal: finereader.abbyyonline.com
Historic OCR added just last week Web GUI to upload documents and
get results Simple to use Low Volume, ad hoc Usage Instant results, quality evaluation Pay as you go
ABBYY Online OCR SDK OCR Service with API and XML Output Runs on Windows Azure Currently Closed Beta Test Public Beta Test Q1/2012
Pre-AnnouncementABBYY Online OCR Services with Gothic/Fraktur
36
ABBYY & OCR for IMPACT
Summary
37
ABBYY & OCR for IMPACT
The whole is greater than the sum of its parts
(Aristotle)
38
ABBYY & OCR for IMPACT
Thank you for your attention!
Questions?
Michael FuchsSenior Product Marketing ManagerABBYY [email protected]
39