Talk overview

31
Leuven, 2007-05- 22 Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering and Computing, University of Zagreb [email protected] Marko Tadić Faculty of Humanities and Social Sciences, University of Zagreb [email protected] Marie-Francine Moens Centre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuven [email protected]

description

- PowerPoint PPT Presentation

Transcript of Talk overview

Page 1: Talk overview

Leuven, 2007-05-22

Computer Aided Document Indexing System for Accessing Legislation

A Joint Venture of Flanders and Croatia

Bojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagreb

[email protected]

Marko TadićFaculty of Humanities and Social Sciences, University of Zagreb

[email protected]

Marie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke

Universiteit [email protected]

Page 2: Talk overview

Leuven, 2007-05-22

Talk overview

document indexing and computer aided document indexing

project AIDE

CADIS workstation: features

project CADIAL

eCADIS workstation: additional features

machine learning techniques

future developments

conclusions

Page 3: Talk overview

Leuven, 2007-05-22

Computer Aided Document Indexing document indexing

attachment of descriptors from a controlled thesaurus to a document

descriptors = labels representing the content of a document

necessary for document retrieval in many document collections

parliamentary documentation

legislation

technical documentation

usually done manually

tedious, error prone, slow (max. 30-40 documents/day)

could computers be of any help in this process?

if we build a Computer Aided Document Indexing System (CADIS)

Page 4: Talk overview

Leuven, 2007-05-22

Project AIDE in Croatia

idea for a project

September 2004

interdisciplinary collaboration of 3 institutions

Croatian Information Documentation Referral Agency (HIDRA)

Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb

Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb

Page 5: Talk overview

Leuven, 2007-05-22

AIDE – collaborating institutions HIDRA

collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia

coordinator Maja Cvitaš, M.A.

ZEMRIS

research in the field of artificial intelligence, neural networks, machine learning, data and text mining

coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc.

ZZL

computational linguistic research and building language technologies for Croatian

coordinator prof. Marko Tadić

Page 6: Talk overview

Leuven, 2007-05-22

AIDE – project objective

Development of intelligentsystem for automatic indexingof the official documentation

of the Republic of Croatiawith descriptors from Eurovoc thesaurus

Page 7: Talk overview

Leuven, 2007-05-22

AIDE – how? AIDE = Automatic Indexing of Documents with Eurovoc

automatic indexing, how? program which “learns to index” documents

conference in Joint Research Center of EC (JRC), Ispra, Italy, 2004-09 at least 10,000 manually indexed documents 3-5 descriptors per document 10-15 documents per descriptor indexed documents stored in XML format Steinberger (2003)

compiling a corpus of Croatian manually indexed documents for machine learning of automatic indexing with Eurovoc descriptors

situation with Croatian documentation in 2004-09 there were only few hundreds of documents indexed manual indexing: painfully slow

how could we speed up the manual indexing?

Page 8: Talk overview

Leuven, 2007-05-22

AIDE – activities

investigate and develop algorithms in the field of computational linguistics/language technologies

include that knowledge into the Computer Aided Document Indexing System (CADIS)

demonstration of CADIS in European parliament (2006-03-10)

Page 9: Talk overview

Leuven, 2007-05-22

CADIS: two parallel windows

Document window

Eurovoc browser window

Page 10: Talk overview

Leuven, 2007-05-22

Document Window

Page 11: Talk overview

Leuven, 2007-05-22

Page 12: Talk overview

Leuven, 2007-05-22

CADIS features

Enhanced user interface

list of descriptors literary appearing in document

Page 13: Talk overview

Leuven, 2007-05-22

CADIS features

Descriptors and non-descriptors marked in document

Page 14: Talk overview

Leuven, 2007-05-22

CADIS features

Lists of n-grams

Page 15: Talk overview

Leuven, 2007-05-22

CADIS features

Integration of corpus analysis

greyed n-grams are statistically relevant in the corpus i.e. collocations

Page 16: Talk overview

Leuven, 2007-05-22

CADIS features

Manual marking of significant n-grams

important step towards further refinment of automatic indexing

Page 17: Talk overview

Leuven, 2007-05-22

Eurovoc browser window

Page 18: Talk overview

Leuven, 2007-05-22

AIDE – activities

investigate and develop algorithms in the field of computational linguistics/language technologies

include that knowledge into the Computer Aided Document Indexing System (CADIS)

demonstration of CADIS in European parliament (2006-03-10)

ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006

joint project proposal with Katholieke Universiteit Leuven for CADIAL project

Page 19: Talk overview

Leuven, 2007-05-22

CADIAL project Computer Aided Document Indexing for Accessing Legislation

a joint Flemish-Croatian project

Department International Flanders, grant no. KRO/009/06

partners:

Katholieke Universiteit Leuven (prof. Marie-Francine Moens)

University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić)

started: 2007-03

duration: 2 years

web: www.cadial.org

the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia

new version of CADIS (eCADIS) is one of modules in this project planned as a web-based service

Page 20: Talk overview

Leuven, 2007-05-22

CADIAL project 2

used the 10,000 manually indexed documents to train the

system for automatic indexing of documents in Croatian

used the 20,000 manually indexed documents from Acquis to

train the system for automatic indexing of documents in

English

included that training data into the next version: eCADIS (-

version)

Page 21: Talk overview

Leuven, 2007-05-22

eCADIS () features

Automatic suggestion of relevant descriptorsi.e. automatic indexing

application of machine learning techniques

Page 22: Talk overview

Leuven, 2007-05-22

eCADIS () features

Compare it to manually attached indexes…

Page 23: Talk overview

Leuven, 2007-05-22

eCADIS () features

Manual marking of inappropriate suggestions

another step in further refinment of automatic indexing

Page 24: Talk overview

Leuven, 2007-05-22

eCADIS () on document in English

Page 25: Talk overview

Leuven, 2007-05-22

eCADIS () on document in English

Automatic suggestion of relevant descriptorsi.e. automatic indexing

Page 26: Talk overview

Leuven, 2007-05-22

eCADIS () on document in English

Compare it to manually attached indexes…

Page 27: Talk overview

Leuven, 2007-05-22

Training the classifiers already existing classifiers

profile classifier (Steinberger 2003)

K-nearest neighbours

binary classifiers

SVM, Logistic Regression, Rocchio, Bayes, …

classifiers used for the preliminary training

ca 3500 independent binary classifiers

need to be further evaluated

Logistic Regression used for 10,000 documents in Croatian

SVM used for 20,000 documents in English

features tokens, lemmas, stems, character n-grams

various feature selection methods and their combinations: 2, ig, mi…

Page 28: Talk overview

Leuven, 2007-05-22

Further development of eCADIS

training with new features and feature selection methods

collocations, word n-grams, chunks

new measures for evaluation of results

sensitive to thesaurus hierarchy

web-interface for eCADIS for inclusion into the CADIAL system

eCADIS for other languages

now only Croatian and English (-version) covered

usable for other languages as it is, but without the linguistic module less efficient

no list of lemmas, but types poor statistics for n-grams

cooperation with language technology experts in different languages for development of linguistic modules

Page 29: Talk overview

Leuven, 2007-05-22

Further development of eCADIS … eCADIS for other languages

training the automatic indexing system for other languages

enables automatic suggestions of relevant descriptors in new, unseen documents

analysis of manual markings descriptors, word n-grams, suggestions

promote the use of eCADIS in other countries beyond the scope of CADIAL project

e.g. Belgium (Flanders)

linguistic module for Dutch and French needed

computational lingustics expertise

training data from Acquis can be used to make an automatic indexing system for Dutch and French

machine learning expertise

Page 30: Talk overview

Leuven, 2007-05-22

Conclusion CADIAL

a joint Flemish-Croatian project sponsored by Flemish government

better public access to Croatian official documentation

faster and improved document indexing

automatic content metadata generation (Semantic Web)

easier document retrieval and exploration of legislation

multilingual access via standardized EU thesaurus Eurovoc

a test-case for the usage of such a system in Flanders

Web information on CADIAL project and eCADIS

www.cadial.org

contact:

[email protected]

[email protected]

Page 31: Talk overview

Leuven, 2007-05-22

Computer Aided Document Indexing System for Accessing Legislation

A Joint Venture of Flanders and Croatia

Bojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagreb

[email protected]

Marko TadićFaculty of Humanities and Social Sciences, University of Zagreb

[email protected]

Marie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke

Universiteit [email protected]