IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Lexica in OCR and IR Evaluation for Bulgarian, Czech, Dutch, English, French, German, Polish, Slovenian, Spanish Jesse de Does


Evaluation of lexicon supported OCR and Information retrieval with Jesse de Does from the INL

Transcript of IMPACT Final Conference - Jesse de Does

Page 1: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

IMPACT Lexica in OCR and IR

Evaluation for

Bulgarian, Czech, Dutch, English, French, German, Polish, Slovenian, Spanish Jesse de Does

Page 2: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Contents OCR evaluation

– Use of lexica in OCR– Evaluation Method– (non-final) Results

IR evaluation– Use of lexica in IR– Evaluation Method– (Very preliminary) results

Page 3: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

3date footertext 3

Use of lexica in OCR! This is not about postcorrection, but about what happens during OCR

Using “Finereader Engine External Dictionary Interface”

Functionality:Any procedure that prunes a set of candidates and assigns weights can be implemented in this waySuch a procedure need not be limited to the use static of word listsPermits dynamic implementations (spelling variation rules, morphology, …)

Page 4: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Finereader SDK external dictionaries SDK users have to implement a COM interface to prune a set of “Fuzzy Words”

eerdecc cc

eerstecc f c o o

External dictionary prunes this to the linguistically possible ones(In this case: { eerste, eerde})

Fuzzy Word: set of character recognition candidates for each position in a word

Page 5: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Finereader SDK external dictionaries

eerdecc cc

eerstecc f c o o

Of cause a lot of things may go wrong in this simple scenarioLexicon may be too small (you will never have all spelling variations, compounds, …)Lexicon may include typical OCR errors (eu, cn, ….)! The Fuzzy word may be too restricted (or of course too comprehensive)

{eerste, eerde}x


Page 6: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


OCR Evaluation

Measure evaluation of Finereader SDK 10 with default included dictionary Finereader SDK 10 with both default dictionary AND use of historical lexicon

Main performance indicator: word recall: after alignment, how many of the words in the ground truth have a (case-insensitive) match in the OCR. Errors on punctuation not penalized.

Specific evaluation tool (only word accuracy)– Workaround for region segmentation problems– Display specific information about dictionary coverage, information about

performance on dictionary words, false friends ….

Page 7: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 8: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Dictionary “cleaning”

Dictionary hallucinations: Many in-dictionary errors (“false friends”) Many errors on short words

Dictionary cleaning procedures: Remove false friends (words related by frequent OCR substitution to much more frequent

words) Remove infrequent short words (even if correct)

Page 9: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Evaluation procedure (data) IMPACT demonstrator sets (size between ~1200-8000 pages) Split:

– Development– Evaluation– Demonstration

OCR evaluation sets: random choice of about 200 pages from evaluation portion– Manageable size (one experiment takes between 30 min and 1:30)

Page 10: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Bulgarian Bylgarska iliustracia, 1880 - Jenski glas, 1889 - Sborniche za spomen na 25-godishninata ot smyrtta na

Levski, 1898 - Spisanie Dennica, 1890 - Ugozapadna Bulgaria, 1893 - Zelokupna Bulgaria, 1880 -

Page 11: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 12: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


OCR→GT Freq.

п→и 968

ж→ѫ 825

н→и 732

н→п 579

е→с 441

п→н 378

и→н 356

ь→ъ 270

ъ→ь 256

г→т 247

и→п 242

ш→ні 218

д→л 114

OCR→GT Freq.

п→и 733

н→и 599

н→п 463

п→н 354

ь→ъ 330

ж→ѫ 283

и→н 249

ш→ні 220

ъ→ь 217

и→п 205

е→с 200

г→т 185

е→ѣ 165

Page 13: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 14: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


1. Czech Co jest konstituce?, čili, Krátký, prostonárodní wýklad hlawnějších zásad konstitucí ewropejských,

1848 Ferina Lišák z Kuliferdy a na Klukově, čili, Kratičká historye zlopověstných kousků starého Reinecke,

1848 Homerowa Iliada, 1802 Na den narození neimocněišího, a neijasněišího cysare rímského, téz dědičného rakauského a krále

ceského, Frantiska II., w Praze 12. den mesyce Unora, léta 1805, 1805 Plody sborů učenců řeči českoslowanské prešporského, 1836 Rozprawy o gmenách, počátkách i starožitnostech národu Slawského a geho kmeni /, 1830 Sokol, 1872 Základowé pitwy (Anatomie), čili, Soustawnj rozbor a popis těla lidského a gednotliwých geho

částek, 1840

Page 15: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 16: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


2.Dutch 18th and 19th century books, newspapers, parliamentary papers …….. Provinciale Overijsselsche en Zwolsche courant : staats-, handels-, nieuws- en advertentieblad, 1852-

1852 Rechtsgeleerd advis in de zaak van den gewezen stadhouder, en over deszelfs schryven aan de

gouverneurs van de Oost- en West-Indische bezittingen van den staat [...]. Ingelevert [...] op den 7 january 1796. / By B. Voorda et al, 1796-1796

Verhaal van het levensgevaar, waar in zig drie Rotterdamsche burgers [...] bevonden hebben, te Utrecht, 1784-1784

Vrijmoedige aanmerkingen, over de uitsluiting van allen die door publieke armkassen bedeeld worden, als stemgerechtigden [...] bij eene oproeping van het Nederlandsche volk tot eene Nationaale Conventie, 1795-1795

Page 17: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 18: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 19: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Standard Finereader language: OldEnglish 15th-19th century material 2 sets:

– One general set, 15th-19th century– One 17th century-specific set

Page 20: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


General set with various choices of dictionary – no improvement!

Page 21: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


More distinct improvement on 17th century set with special dictionary compiled from OED quotations dated 1580-1720

Page 22: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


French Standard Finereader language: OldFrench

17th century books

Conduite du jugement naturel où tous les bons esprits de l'un et l'autre sexe pourront facilement puiser la pureté de la science, par M. Jacques Forton, sieur de S. Ange,..., 1653

Dissertation de la philosophie en général, 1668 La Dialectique du sieur de Launay, contenant l'art de raisonner juste sur toute sorte de matières...,

1673 Lettre de M. Gadroys à M. de La Grange Trianon,... pour servir de réponse à celle que M. de

Castelet a écrite contre les raisons de M. Descartes touchant le flux et le reflux de la mer. - Seconde lettre de M. Gadroys... [au même, sur le même sujet.], 1677

Traitez de métaphysique démontrée selon la méthode des géomètres. [Par le sieur de La Coudraye.], 1693

Page 23: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 24: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


German Standard Finereader language: OldGerman

Das Buch des heyligen Römischen Reichs unnderhalltunge, 1501 Die Poesie ihr Wesen und ihre Formen mit Grundzügen der vergleichenden Literaturgeschichte, 1884 Echo Deß Hochzeitlichen Te Deum Laudamus, 1722 Ergebnisse der Erhebungen über die Beschäftigung gewerblicher Arbeiter an Sonn- und Festtagen, Bd.:1, Gruppe

I bis VII der Gewerbestatistik, Berlin, 1887, 1887 Quedlinburgisches Kreis-Tags-Memorial, 1673 Von der Regierung der Kirche und den unterschiedlichen Würden der Geistlichkeit *(full title in comments), 1779 Warhaffter und grundlicher Bericht uß was Ursachen Martinus du Voysin (zu Basel verburgerter Krämer) inn der

Statt Surseew im Aargöw, ..., den 13. Tag Octobris deß 1608. Jars erstlich enthauptet, und volgends verbrennt worden, 1609

Page 25: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 26: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Polish Adwersaria, albo terminata sprawy wojennej, która się toczyła w wołoskiej ziemi z tureckim cesarzem, 1621 Chorągiew Sarmacka w Wołoszech, to jest pospolite ruszenie i szczęśliwy powrót Polaków z Wołoch w roku 1621,

1621 Diariusz wiadomości od wyjazdu króla z Wilna do Smoleńska, 1610 Discurs o cenie pieniedzy teraznieyszey y o niektorych skutkach iey…, 1632 Nowe Ateny, albo Akademia wszelkiey scyencyi pełna, na różne tytuły iak na classes podzielona, mądrym dla

memoryału, idiotom dla nauki, politykom dla praktyki, melancholikom dla rozrywki erygowana ... . Część 3 albo Supplement., 1746

Pasja żołnierzy obojga narodów w stolicy moskiewskiej krótko opisana, 1613 Powodzenia niebezpiecznego ale szczęśliwego wojska j. k. m. w Multanach opisanie, 1601 Relacja chwalebnej ekspedycji Jana Kazimierza, króla polskiego i szwedzkiego, 1650 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej, 1634 Wyprawa i wyjazd sułtana Amurata, cesarza tureckiego, na wojnę do Korony Polskiej_BW, 1634 Żałosne opisanie upadku króla hiszpańskiego na morzu i na lądzie, 1589

Page 27: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 28: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Slovene Genovefa, 1841 Gosp. Krištofa Šmida korarja avgustanskiga, zgodBe S. Pisma za mlade ljud..., 1850 Kmetijske in rokodelske novice, 1844 Kratkozhasne uganke, 1788 Kuharske Bukve, 1799 Marianske Kempensar, ali Dvoje bukuvze, 1769 Novice kmetijskih, rokodelnih in narodskih reči, 1851 Sgodbe svetiga pisma za mlade ljudi, 1830 Ta male katechismus, 1768 Vezhna pratika od gospodarstva, 1789 Zerkviza na skali, 1855

Page 29: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Page 30: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Carta athenagorica, 1690Commentarios reales, 1609El Parnasso español, 1648Obras de Garcilasso de la Vega con las anotaciones por el Mtro. Francisco Sánchez Brocense, 1612Obras de Lope de Vega, 1604Vida de Lazarillo de Tormes, 1652

Page 31: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Page 32: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.



Page 33: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Evaluation of “IR”

Main question:Are we able to retrieve historical variants of words?

Practical evaluation criterion:Measure accuracy of modern lemma assignment

(If we can do this, good retrieval is possible)

More complete evaluation to follow soon – all partners are finishing the work

Page 34: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


Evaluation method Each language partner annotates ~10.000 tokens of Ground Truth with modern

lemma and/or equivalent word form We measure performance of:

– Lemmatization with a modern lexicon– Lemmatization with a modern lexicon and spelling variation patterns– Lemmatization with a historical lexicon, a modern lexicon and spelling variation patterns– No context information is used

Page 35: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


EnglishUsing OED IR lexicon and very restricted set of spelling variation patterns

Considered tokens: 9409.8994 had a correct lemma (recall 0,956)

Total correct suggestions: 8994 Average rank of correct lemma: 1,086280total possible lemmata: 23859None match at all: 265Matched With Patterns: 1330Exact Match: 7814

Page 36: IMPACT Final Conference - Jesse de Does

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.


SpanishUsing Apertium modern Spanish Lexicon, IMPACT historical spanish IR lexicon and 9298 token consideredWith only modern lexicon and patterns 7473 with at least one correct lemma and 1825 without (recall 0,80)

Average rank of correct lemma: 1,1, Total suggestions 9699No match at all: 991

Modern Exact: 7471; Modern With Patterns: 836

With historical lexicon, modern lexicon and patterns:8864 with at least one correct lemma and 434 without (recall 0,926)

Average rank of correct lemma: 1,16, Total suggestions 12417 ModernWithPatterns: 186 No match at all: 542 Historical Lexicon Exact match: 8265 ModernExact: 305