Refinement of Digitised Newspapers
-
Upload
cneudecker -
Category
Technology
-
view
62 -
download
5
description
Transcript of Refinement of Digitised Newspapers
![Page 1: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/1.jpg)
Europeana Newspapers Workshop:
Refinement
WP2 – Introduction to Refinement
Munich, 26 June 2013
Clemens Neudecker (@cneudecker)
![Page 2: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/2.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Overview
• Objectives & Challenges
• Overview of Refinement Dataset
• Introduction to Refinement: Workflow & Technologies
• Questions & Answers
2
![Page 3: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/3.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Objectives
- Analysis of available digital newspaper collections of project partners and identification of subsets suitable for refinement
- Definition of requirements and minimum quality of digitized newspapers for refinement to enable advanced services in Europeana
- Coordination of the scalable processing of 10 million digitised newspaper pages with several refinement technologies
- Providing recommendations on best practices for the refinement of digitised newspaper collections with full-text (and ingest to Europeana)
![Page 4: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/4.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Challenges
• Processing quality vs. speed/throughput
• Volume of data requires focus on simple & standardised workflow with clear checkpoints
• Diverse partners supplying content with different digitisation & access policies
• Large variety of content in terms of file formats, fonts, languages, etc.
4
![Page 5: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/5.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
The data
![Page 6: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/6.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (1)
![Page 7: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/7.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspaper Dataset (2)
![Page 8: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/8.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (3)
![Page 9: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/9.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Europeana Newspapers Dataset (4)
![Page 10: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/10.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement Workflow steps
10
![Page 11: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/11.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (BCT)
• BCT = Binarisation and Colour Reduction Tool
• Purpose: Convert grey/colour scans to bitonal using highly optimised GPP method
• Background: Reduce total file size of master images to guarantee feasibility and timing of data transfers
11
![Page 12: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/12.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (FRT)
• FRT = File Rename Tool
• Purpose: Support content holders in preparing their data in the correct format
• Background: Ensure folder structure and file naming requirements for automated processing are met
12
![Page 13: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/13.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Tools (FAT)
• FAT = File Analyzer Tool
• Purpose: Final quality check of data before refinement
• Background: Ensure content and refinement partners that all preparation steps have been executed successfully
13
![Page 14: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/14.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement: OCR@UIBK
• OCR = Optical Character Recognition
• Number of pages to be refined: 8 million
• Technologies: ABBYY FineReader SDK
• State-of-the-art OCR software, fully supports Fraktur/Latin/Cyrillic fonts
• Result: METS/ALTO package containing images, metadata & full text
14
![Page 15: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/15.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
OCR Full text search
15
http://www.europeana-newspapers.eu/building-a-content-browser-for-digital-newspapers/
![Page 16: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/16.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement: OLR@CCS
• OLR = Optical Layout Recognition
• Number of pages to be refined: 2 million
• Technologies: docWorks
• Separation of columns, articles, headlines, page classes
• Result: METS/ALTO package containing images, metadata & full text
16
![Page 17: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/17.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
OLR Article separation
17
![Page 18: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/18.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
Refinement: NER@KB
• NER = Named Entities Recognition
• Number of pages to be refined: 2 million
• Technologies: Stanford CRF-NER
• Languages supported: German, Dutch, English (+ French, Latvian)
• Open source: https://github.com/KBNLresearch/europeananp-ner
• Detection of Named entities: Person, Location, Organization
• Feedback cycle with manual training step better results
18
![Page 19: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/19.jpg)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the Competitiveness and Innovation Framework Programme by the European Community http://ec.europa.eu/ict_psp
NER Browse by names or places
19
![Page 20: Refinement of Digitised Newspapers](https://reader035.fdocuments.us/reader035/viewer/2022070319/557dcb9dd8b42a93718b48db/html5/thumbnails/20.jpg)
Thank you for your [email protected]