Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

20
Digitale Zeitungen – Verarbeitung in Europeana Newspapers Information Day SBB Berlin, 27 Februar 2014 Clemens Neudecker, KB, Twitter: @cneudecker

Transcript of Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

Page 1: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

Digitale Zeitungen –Verarbeitung in Europeana Newspapers

Information Day SBB

Berlin, 27 Februar 2014

Clemens Neudecker, KB, Twitter: @cneudecker

Page 2: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Übersicht

• Ziele & Herausforderungen

• Zeitungen im Projekt

• Workflow & Technologien

• Fragen & Antworten

2

Page 3: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Ziele

• Verarbeitung von 8 Mio. Zeitungsseiten mit OCR (UIBK)

• Verarbeitung von 2 Mio. Zeitungsseiten mit OLR (CCS)

• Erstellen von Software für NER in 3 Sprachen (KB)

• Entwicklung von Tools die den Workflow automatisieren

• Erstellen von Richtlinien und Empfehlungen (“best practices”)

3

Page 4: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Herausforderungen

• Qualität vs. Durchsatz

• Komplexität von Zeitungslayouts (Spalten, Anzeigen, Abbildungen)

• Stark schwankende Qualität der Digitalisate (Microfilm, Bitonal)

• Unterschiedliche Dateiformate, Sprachen, Alphabete

• Historische Schreibvarianten

• Klar strukturierter und weitgehend automatisierter Workflow

4

Page 5: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Die Zeitungen

Page 6: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspaper Dataset (1)

Page 7: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspaper Dataset (2)

Page 8: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspapers Dataset (3)

Page 9: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Europeana Newspapers Dataset (4)

Page 10: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Workflow

10

Page 11: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OCR @ UIBK

• OCR = Optical Character Recognition (Optische Zeichenerkennung)

• Technologien: ABBYY FineReader SDK• State-of-the-art OCR software, unterstützt Fraktur/Latin/Cyrillic out-of-the-box

• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext

11

Page 12: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Tools (BCT)

• BCT = Binarisation and Colour Reduction Tool

• Ziel: Konvertierung von Farb-/Graustufenscans nach 1-bit mit für OCR optimierter Methode (GPP) + JP2k

• Hintergrund: Dateigrösseder Images reduzieren umDatenmenge handhabbarzu machen (hunderte TBs)

12

Page 13: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Tools (FRT)

• FRT = File Rename Tool

• Ziel: Unterstützung der Bibliotheken bei der Daten-anlieferung – Umbenennungvon Dateien und Ordnern

• Hintergrund: Daten in der fürautomatisierte Verarbeitungnotwendigen Struktur aufbereiten

13

Page 14: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

Tools (FAT)

• FAT = File Analyzer Tool

• Ziel: Check und Validierungder Datenstruktur vorAnlieferung zur Verarbeitung

• Hintergrund: Garantie füralle Beteiligten dass die Datenfür die weitere Verarbeitungin geeigneter Form vorliegen

14

Page 15: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OLR @ CCS

• OLR = Optical Layout Recognition (Optische Layouterkennung)

• Technologien: docWorks• Aufteilung der Seite nach Spalten, Artikeln, Überschriften, “Seitentypen” (Anzeigen)

• Export als METS/ALTO Paket bestehend aus Images, Metadaten & Volltext

15

Page 16: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

OLR ���� Artikelerkennung

16

Page 17: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

NER @ KB

• NER = Named Entities Recognition

• Technologien: Stanford CRF-NER• 3 Sprachen: Deutsch, Niederländisch, Französisch

• Open source: https://github.com/KBNLresearch/europeananp-ner

• Erkennung von 3 Klassen: Person, Ort, Organisation

17

Page 18: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp 18

Ergebnisse für NL

Model trainiert auf manuell getaggten Zeitungsseiten von 1618 - 1900.

100 Seiten mit insgesamt 183.421 Tokens (“Wörtern”)

*

* K-fold cross validation = 1/4 der Trainingsdaten nur für die Evaluierung

Personen Orte Organisationen

Precision 0.940 0.950 0.942

Recall 0.588 0.760 0.559

F-measure 0.689 0.838 0.671

Page 19: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp

NER vs. OCR

19

0,25

0,35

0,45

0,55

0,65

0,75

0,85

0,95

NER

OCR

Page 20: Europeana Newspapers German infoday - Verarbeitung Digitale Zeitungen

Danke für die Aufmerksamkeit!

Noch Fragen?

[email protected]