Project AIDE

19
Bruxelles, 2006- 03-10 Computer Aided Document Indexing System (CADIS) with Eurovoc Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing University of Zagreb [email protected] Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb [email protected]

description

Computer Aided Document Indexing System ( CADIS ) with Eurovoc Bojana Dalbelo Ba šić Faculty of Electrical Engineering and Computing University of Zagreb [email protected] Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb [email protected]. Project AIDE. - PowerPoint PPT Presentation

Transcript of Project AIDE

Bruxelles, 2006-03-10

Computer Aided Document Indexing System (CADIS) with Eurovoc

Bojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of [email protected]

Marko TadićFaculty of Humanities and Social SciencesUniversity of [email protected]

Bruxelles, 2006-03-10

Project AIDE

idea for a project

September 2004, conference at JRC, Ispra

interdisciplinary collaboration of 3 institutions

Croatian Information Documentation Referral Agency (HIDRA)

Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb

Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb

Bruxelles, 2006-03-10

AIDE – collaborating institutions HIDRA

collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia

coordinator Maja Cvitaš, M.A.

ZEMRIS

research in the field of artificial intelligence, neural networks, machine learning, data and text mining

coordinators prof. Bojana Dalbelo Bašić andJan Šnajder

ZZL

computational linguistic research and building language technologies for Croatian

coordinator prof. Marko Tadić

Bruxelles, 2006-03-10

AIDE – project objective

Development of intelligentsystem for automatic indexingof the official documentation

of the Republic of Croatiawith descriptors from Eurovoc

thesaurus

Bruxelles, 2006-03-10

AIDE – how? automatic indexing, how?

program which “learns to index”

Joint Research Center of EC (JRC), Ispra, Italy at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003)

compiling a corpus of Croatian indexed documents for machine learning of automatic indexing with Eurovoc descriptors

situation with Croatian documentation in 2004. there were only few hundreds of documents indexed manual indexing: painfully slow

Bruxelles, 2006-03-10

AIDE – how?

how could we speed up the manual indexing?

plan:

to develop a workstation for computer aided document indexing

conduct the research and development of algorithms in the field of computational linguistics/language technologies

insert that knowledge in the workstation and turn it into Computer Aided Document Indexing System (CADIS)

Bruxelles, 2006-03-10

CADIS: two windows

Document window

Eurovoc browser window

Bruxelles, 2006-03-10

Document Window

Bruxelles, 2006-03-10

Bruxelles, 2006-03-10

CADIS features

Enhanced user interface

list of descriptors appearing in document

Bruxelles, 2006-03-10

CADIS features

Descriptors and non-descriptors marked in document

Bruxelles, 2006-03-10

CADIS features

Lists of n-grams

Bruxelles, 2006-03-10

CADIS features

Integration of corpus analysis

greyed n-grams are statistically relevant in the corpus

Bruxelles, 2006-03-10

CADIS features

Manual marking of significant n-grams — important step towards automatic indexing

Bruxelles, 2006-03-10

Eurovoc browser window

Bruxelles, 2006-03-10

Further development CADIS for other languages?

already for Croatian and English

usable for other languages without linguistic module

cooperation needed with respective language technology experts for development of linguistic module for other languages

partners for EU project proposals for the next step

AIDE

research on machine learning and text-mining

use that knowledge to turn the workstation into an intelligent system for Automatic Indexing of Documents with Eurovoc

establishing the publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia

Bruxelles, 2006-03-10

http://textmining.zemris.fer.hr

Bruxelles, 2006-03-10

Conclusion

CADIS is unique in Europe

Web info at:

HIDRA: www.hidra.hr/hidra/aide/aide.htm

ZEMRIS: textmining.zemris.fer.hr

for download contact: [email protected]

Bruxelles, 2006-03-10

Computer Aided Document Indexing System (CADIS) with Eurovoc

Bojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of [email protected]

Marko TadićFaculty of Humanities and Social SciencesUniversity of [email protected]