IMPACT Final Conference - Michael Fuchs

39
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. ABBYY & OCR Improvements for IMPACT Michael Fuchs Senior Product Marketing Manager ABBYY Europe [email protected]

description

ABBYY FineReader: IMPACT Improvements with Michael Fuchs from ABBYY Europe

Transcript of IMPACT Final Conference - Michael Fuchs

Page 1: IMPACT Final Conference - Michael Fuchs

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

ABBYY & OCR Improvements for IMPACT

Michael FuchsSenior Product Marketing ManagerABBYY [email protected]

Page 2: IMPACT Final Conference - Michael Fuchs

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2

Agenda

Who is ABBYY?Company Overview(Short) Product OverviewABBYY Technology in the IMPACT project

OCR & Processing – IMPACT improvementsBinarisation, Segmentation, RecognitionDictionary API, Export Formats

Lessons Learned, Pricing, Pre-Announcement, Q&A

Page 3: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

ABBYY & IMPACT

3

Page 4: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

ABBYY Group

Overview ABBYY Group Founded in 1989 as BIT Software > 1000 employees in 14 offices worldwide Headquarters/R&D in Moscow, Russia

4

Page 5: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

ABBYY OCR Products – Usage View

Desktop/Workgroup

Server/Backend SDK/Integration

OC

R &

Docu

men

t C

on

vers

ion

FineReader (Professional, Corporate, Site Licence Edition) Note: No Gothic/Fraktur OCR!

PDF Transformer

FotoReader

ScreenshotReader

Recognition Server (Professional, Extended Edition)

Gothic/Fraktur OCR & XML

Export Support!

FineReader Engines (Windows, Linux, Mac OS X, Free BSD, Embedded Systems)

Mobile OCR Engine (Android, Symbian, Linux, Windows, Windows Mobile, iOS )

End Users, Companies,(Libraries)

Companies,Scan Service

Provider, Libraries

User driven processing,

Ready to use

Automated processing,

Ready to use

Automated processing,

Development needed

Developers,Scan Service

ProviderIMPACT Research

Users

are

:

5

Page 6: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT 6

What (ABBYY) OCR can read...

Recognition Languages Almost 200 OCR languages 34 languages with dictionary support and spell check Alphabets: Cyrillic, Latin, Greek, Armenian, Hebrew, Thai Chinese, Japanese, Korean (CJK) - 4 sets of hieroglyphs

(Chinese (traditional and simplified), Japanese, Korean) Arabic (Technical Preview in the SDK)

Font Types Recognition of mixed font types

(dot-matrix printer, typewriter, Gothic, etc.) OCR-A OCR-B MICR (E13B) CMC-7

Page 7: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT 7

IMPACT & ABBYY

ABBYY is the OCR technology provider for IMPACT members

ABBYY also improved the core technologies for the recognition of old documents in IMPACT, focus areas are/were:

Image pre-processing Segmentation Character recognition Export

IMPACT members work with the Software Development Kit (SDK) FineReader Engine – not the desktop application

IMPACT focus is/was on research and not in setting up a production system ;o)

Improved technologies are/will be added to current/future products

Page 8: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT 8

Designed to be not OCRed

Page 9: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Why ABBYY? - OCR …

Std. OCR *

ABBYY Fraktur OCR*

*Recognition Server 3.0 R1 – Gothic/Fraktur disabled and enabled

Original Image[perfect quality :o) ]

9

Page 10: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

ABBYY “History” and Old Fonts Recognition FineReader XIX (V7 Technology)

2003(METAe result 2000-2003)

FineReader Engine 9.0 (Release 1)

2008(Pre-IMPACT – “State of the Art”)

FineReader Engine 10 2010IMPACT Project Optimizations

10

Page 11: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

ABBYY and Old European Fonts Accuracy Comparison:

ABBYY Technology Version 10 recognition of old European fonts:

25% more accurate than FRE 9.0 38% more accurate than FR XIX

Up to 98,2 % on good quality

images

11

2003

2008 2010

Page 12: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

OCR Processing Steps &

ABBYY Improvements for IMPACT

12

Page 13: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 1. Scanning, Image Loading, Pre-Processing and Modification Compensating image defects and making the document suited for

automatic OCR

Step 2. Document Layout Analysis Layout analysis, detection of document sections like text, images and

barcodes

Step 3. (Optical) Character Recognition Automatic recognition of characters, apply selected recognition languages

& dictionaries

Step 4. (optional) Verification - by Operators or automated post correction Manual validation of suspicious characters and words

Step 5. Document Synthesis and Export Generating an output document in the selected format

13

Processing Steps

Page 14: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 1: Image pre-processing

14

Page 15: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Intelligent background filtering

Adaptive Binarisation

15

Step 1: Image pre-processing Image Loading, Pre-Processing and Modification

General binarisation on an image level can not deliver good results for OCR

Page 16: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 1: Image pre-processing New V10: Binarisation, Textured Background optimisations

Original scan

V9 binarisation

New V10 binarisation

16

Page 17: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 1: Image pre-processing New V10: Binarisation, Textured Background optimisations

Original scan

V9 binarisation

V10 binarisation

17

Page 18: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 1: Image pre-processing New V10: Binarisation for the IMPACT project

Original State of Art (V9) New (V10)

No text from the other page!

18

Page 19: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis

19

Page 20: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT 20

Step 2: Document Layout Analysis Analyze layout and find text, images, tables and barcodes

Page 21: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis (old Newspapers)Segmentation Improvements: Image/Text detection – Example 1/3

V9 Technology V10 Technology

21

Part of the column was detected as an image

Page 22: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis (old Newspapers) Segmentation Improvements: Word Order Detection– Example 2/3

Less linear word order errors

22

V9 Technology V10 Technology

Page 23: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis (old Newspapers) Segmentation Improvements: Lost text (no Detection) – Example 3/3

Less lost text

23

V9 Technology V10 Technology

Page 24: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 2: Document Layout Analysis Segmentation Improvements: IMPACT Results over time

24

Before IMPACT: Overall segmentation improvements

● Better picture detection● Better separators● Better page layout reconstruction

Only a random set of old newspapers available

After IMPACT: IMPACT Segmentation Ground Truth available New (internal) DA model for historic newspapers New segmentation evaluation methodology Evaluation results on newspapers

● 40% less split/merge errors● 25% less garbage and lost text

Page 25: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 3: Text/Character Recognition

25

Page 26: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Samples for Classifiers used in ABBYY technologiesAfter line detection, character recognition is applied with different

classifiers

26

Step 3: Text/Character Recognition

Raster classifier Contour classifier

Feature differentiating classifier

Structure classifier

Page 27: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT 27

Step 3: Text/Character Recognition Optimization and new Developments Improved Gothic Classifiers

A significant amount of time was invested in gothic classifier training The library selection of ground truth material (historical relevance)

was used New gothic graphemes were added

Results Good quality images: 2.8% (total) error rate on the used test set

which is about 20% improvement to the “state of art” (V9) = almost comparable to modern documents

Bad quality Images: 7% (total) error rate on the used test set which is about 30% improvement to the “state of art” (V9)

Most of the improvements available in ABBYY current products: ABBYY FineReader Engine 10 (SDK) & Recognition Server 3.0Quality optimization will be continued in future releases and technology cycles optimized

Page 28: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT 28

Step 3: Text/Character Recognition Optimization and new Developments

Old Slavonic as new OCR Language New Development

Before

Now

Page 29: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Quality-Test-Comparison:Binarisation & Recognition Improvements

29

Page 30: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Binarisation & Recognition Improvements

How to evaluate the recognition improvements of binarisation?

Binarisation & recognition quality go hand in hand!

-> # Errors = 100% with V9 binarisation & V9 recognition-> # Errors = -5% with V9 binarisation & V10 recognition

-> # Errors = -11% with V10 binarisation & V9 recognition

-> # Errors = -15% with V10 binarisation & V10 recognition

Binarisation

Recognition Technology

30

Page 31: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Step 3-5: Dictionaries & Export

31

Page 32: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT 32

Step 3 – 5: Other Optimizations

External Dictionary API Tuning External Dictionary API was available in the FineReader Engine (SDK) Support for any language, any time period API was/is heavily used from IMPACT language partners to run quality

tests

New ALTO XML Export Formats FineReader Engine 10 R2, December 2010 Recognition Server 3.0, July 2011

Page 33: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Additional Notes

33

Page 34: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Further Information & Trial Versions The ABBYY Gothic/Fraktur OCR Portal:

www.frakturschrift.com

34

Page 35: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

The Reality Masses of books/document are available & already scanned It is unclear if Antiqua and/or Gothic/Fraktur fonts are used in the

documents Pre-Sorting is impossible, it would be too time/cost expensive

ABBYY Europe's AnswerReduced the pricing for mixed “Old” + “Modern”

font OCR projectsThe pricing is now ready for “mass processing”

Examples Recognition Server 3.0  with “Gothic” enabled

10.000 pages – 299 Euro – available online 500.000  pages* – 5.000 Euro =  1 Euro cent per page = ca 2.000 books a

250 pages Over 3 Mio pages* -  ca 0,52 Euro cent per page = 12.000 books a 1,25 €

(250 pages) Over 10 Mio pages* - ca. 40.000 books = ca. 0,5 € per book

... No more excuses for not

OCRing :o)

What IMPACT taught ABBYY about Libraries & Mass Digitalization projects…

35* page size is A4, bigger formats are counted as multiple pages

Page 36: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

The ABBYY Gothic/Fraktur OCR Portal: finereader.abbyyonline.com

Historic OCR added just last week Web GUI to upload documents and

get results Simple to use Low Volume, ad hoc Usage Instant results, quality evaluation Pay as you go

ABBYY Online OCR SDK OCR Service with API and XML Output Runs on Windows Azure Currently Closed Beta Test Public Beta Test Q1/2012

Pre-AnnouncementABBYY Online OCR Services with Gothic/Fraktur

36

Page 37: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Summary

37

Page 38: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

The whole is greater than the sum of its parts

(Aristotle)

38

Page 39: IMPACT Final Conference - Michael Fuchs

ABBYY & OCR for IMPACT

Thank you for your attention!

Questions?

Michael FuchsSenior Product Marketing ManagerABBYY [email protected]

39