Planning Digitisation Projects Aly Conteh The British Library 30/11/2012 CERL Annual Seminar.
Bratislava WS - Conteh - BL - IMPACT overview_pdf
-
Upload
impact-centre-of-competence -
Category
Education
-
view
690 -
download
0
Transcript of Bratislava WS - Conteh - BL - IMPACT overview_pdf
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Workshop, 7th May 2010, Bratislava
Aly Conteh, British Library
Overview of the IMPACT Project
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Background Text that is not digital is virtually invisible
Digitised material is becoming available too slowly, in too small quantities and from too few sources
OCR (optical character recognition) technology does not produce satisfactory results for historical documents
There is a lack of institutional knowledge and expertise which causes inefficiency and ‘re-inventing the wheel’
Aly Conteh, British Library 2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OBJECTIVESSignificantly improve mass digitisation of historical printed text by
Innovating OCR software and language technology
Sharing expertise and building capacity across Europe
Ensuring that tools and services will be sustained after the end of the project
Aly Conteh, British Library 3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The IMPACT Consortium - Original Libraries
– National Library of the Netherlands (KB)– The British Library (BL)– Bibliothèque nationale de France (BNF)– German National Library (DNB)– Bavarian State Library (BSB)– Göttingen State and University Library
(UGOE) – Austrian National Library (ONB)– University of Innsbruck Library (UIBK)
Universities & Research centres– Dutch Institute for Lexicology (INL)– National Centre for Scientific Research –
Demokritos (NCSR)– University of Salford (USAL)– University of Munich (CIS group)– University of Innsbruck (InfMath group)– University of Bath (UKOLN)
Industry partners– IBM (Haifa Research Lab)– ABBYY (Moscow)
Aly Conteh, British Library 4
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Extension: objectives To demonstrate the IMPACT tools for efficient lexicon building for language families
outside the current IMPACT focus→ Currently in IMPACT three Germanic languages : English, German, Dutch→ Add Romance and Slavic languages
To demonstrate and disseminate project results in Southern and Eastern Europe, and support building capacity in digitisation in these countries
To reinforce cooperation and better exploitation of ICT R&D synergies across the enlarged European Union
To build strategic partnerships with aim of gaining access to knowledge, developing standards and interoperable solutions
Aly Conteh, British Library 5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Extention in two iterations:1. Second phase, foreseen in original IMPACT contract
→ 3 languages: French, Spanish, Polish→ 5 partners (entry 1 february 2010)
2. Proposal in Objective ICT-2009.9.5 , call 5 of FP7: Enlarged European Union→ 3 languages: Slovene, Bulgarian and Czech → 6 partners (entry 1 april 2010)
All will be equal partners in consortium Full integration expected in June 2010
Aly Conteh, British Library 6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
New partners identified: second phase22 Analyse et Traitement Informatique de la Langue Française ATILF FR
23 Biblioteca Nacional de España BNE ES
24 Fundación Biblioteca Virtual Miguel de Cervantes BVC ES
25 Poznań Supercomputing and Networking Center PSNC PL
26 University of Warsaw, Department of Formal Linguistics UW DFL PL
Aly Conteh, British Library 7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
New partners identified – IMPACT enlarged EU16 Institute for Parallel Processing, Bulgarian Academy of Sciences BAS BG
17 “St. Cyril and Methodius” National Library NLB BG
18 Jožef Stefan Institute JSI SI
19 Narodna in univerzitetna knjižnica (National and University Library) NUK SI
20 Institute of the Czech National Corpus, Charles University Prague CUP CZ
21 Národní knihovna České republiky (National Library of the Czech Republic) NKC CZ
Aly Conteh, British Library 8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Aly Conteh, British Library 9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Facts and figures Project supported by the European Community under the FP7 ICT Work
Programme. coordinated by the National Library of the Netherlands (KB) Project type: Large-scale Integrating Project EU funding: € 11 500 000 Start date: 1 January 2008 Duration: 48 months From 2012: sustainable Centre of Competence Contact: [email protected] Web site: www.impact-project.eu
Aly Conteh, British Library 10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Project Structure
Aly Conteh, British Library 11
OPERATIONAL CONTEXT
Requirements, Benchmarking and Metrics
Best Practices and Guidelines
Technical Framework and Technical Integration
CAPACITY BUILDING
Published resources
Training and support
Demonstration
TEXT RECOGNITION
Pre-processing and segmentation
Adaptive and experimental OCR
Models and dictionaries
ENHANCEMENT & ENRICHMENT
Collaborative correction
Lexicons and gazetteers
Structural metadata
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools for Text Recognition (OCR)Technologies for the extraction of text in a digital form from the page
Adaptive OCR engine: Core of IMPACT, cutting-edge software system which is tailored specifically to the needs of libraries adapts itself to the material during OCR process, integrating several other tools:
Image enhancement toolkit Segmentation toolkit Post-correction modules Other OCR engines
Experimental prototypes and tools Typewritten OCR prototype Wordspotting engine Inventory extraction prototype
Aly Conteh, British Library 12
OC
CB
TR EE
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Tools for Enrichment (language technology)Make the OCR results more accurate and more accessible Collaborative correction
Full web-based collaborative correction system: web-based platform, suitable for massive volunteer participation, validates and corrects OCR results. first tool of its kind to be directly linked to an OCR engine
Lexicons and gazetteers General and Named Entities lexica for Dutch, German and English as well as support for lexicon
development in other European languages Toolboxes providing the means to overcome the historical language barrier Collaborative web-based workspace for named entity management
Structural metadataFunctional Extension Parser: a set of web services that can be exploited to automatically detect and tag structural metadata of scanned material
Aly Conteh, British Library 13
OC
CB
TR EE
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Strategic tools and services Web site provides access to all project outputs and forms the nucleus of a virtual network of all European digitisation
centres of competence and associated research activities
A set of Decision Support Tools that can be used to initiate, organise, manage and cost mass digitisation projects
A learning resource toolbox will contain operational guidelines, providing guidance on real world implementation of all tools produced within the project.
Training and support Help Desk system that brokers end-user requests to project partners and to other digitisation centres of
competence. Training programme dealing with large-scale digitisation issues and technologies, with a range of supporting
documentation made available through the project website
Demonstration
Aly Conteh, British Library 14
OC
CB
TR EE
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Building a sustainable Centre of Competence First Phase 2008: IMPACT core consortium of 15 partners
Good mix of public and private partners Experience in mass digitisation and research in OCR, Language and Image processing
Second Phase 2010: extension with 11 additional partners Public collection holders and language institutes Adding wider set of European languages and experience in mass digitisation
Third Phase 2011: Open to all partners Other Centres of Competence Digitisation Suppliers Research Institutes Libraries, Archives and Museums
By 2012 IMPACT exists as a sustainable Centre of Competence
Aly Conteh, British Library 15
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Aly Conteh, British Library 16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Aly Conteh, British Library 17
http://www.impact-project.eu
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Aly Conteh, British Library 18
Twitter: impactocr Blog: impactocr.wordpress.com
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you
Aly Conteh, British Library 19