Eurovoc and parliamentary documents: a semi-automatic classification experience at the Camera dei...

Eurovoc and parliamentary documents: Eurovoc and parliamentary documents:

a semi-automatic classification a semi-automatic classification experience at the Camera dei deputatiexperience at the Camera dei deputati

Calogero Salamone

Luxembourg, 19 november 2010

General

Establishing techniques to allow citizens access to legal information is a matter of primary importance in terms of the fundamentals of public service

Classification of parliamentary and legal resources provide an important support for research

History

In 1969/1970, Italy’s Chamber of Deputies and Senate began to consider the classification of laws, in the context of early automation projects of the Parliament

An Automatic machine dictionary of Italian language (“Camera 72”) was projected to be used for the information retrieval of legal texts

History

The project should have included a research system based on the storage of the full text of laws, decrees, treaties etc. dating back to 1848

An accurate legal-linguistic analysis was to establish a classification system to identify and resolve the problems of homographs, polysemy, shifts of meanings This project was abandoned

History

In 1992 the thesaurus TESEO (TEsauro Senato per l’Organizzazione dei documenti parlamentari) was adopted for the classification of the bills’ database managed by the Senate

The same thesaurus was adopted for the database of parliamentary oversight (Sindacato ispettivo) managed by the Chamber of deputies (questions to the government, motions and resolutions)

History

TESEO includes 3650 terms grouped into 45 thematic areas (Top Terms), derived from an old home-made classification system and arranged according to the logical structure of the Universal Decimal Classification (UDC)

There are only 358 language equivalent terms (non-descriptors) used for cross-referencing

From TESEO to EUROVOC

The use of TESEO at Chamber of Deputies was overall satisfactory

Difficulties were sometimes encountered in some areas due to the vagueness or absence of appropriate descriptors

These problems led to creating a supplementary list with additional descriptors


In 2005 the Chamber began to consider whether to switch from TESEO to EUROVOC

We considered inter alia the advantages of multilingual classification, including the possibility of connecting different legal and social phenomena under a single system of categorization


We also considered the larger number of descriptors available and the even bigger number of language equivalent terms (non-descriptors) available for the italian language

There are some areas arranged in an EU perspective that can be difficult to use in a national perspective.


We hope to gradually extend Classification through Eurovoc thesaurus from policy-setting and oversight documents to the whole information system

That’s why we developed a map to match and link the descriptors of Eurovoc to those of TESEO

Automatic indexing

We know that automatic classification processes do not achieve the same quality as human indexing does

They can be efficient enough to be used for specific purposes, e.g. to automatically index documents that otherwise would not be indexed at all, or to support the process of human indexing

Automatic indexing

The Chamber of deputies chose to test automatic indexing on policy-setting and oversight documents

These are texts written in everyday language whose length is usually limited

Automatic indexing

The application of automatic indexing to the classification of legislative texts is probably more difficult

Legislative texts present a higher level of formalization of language and the consistency of documentary units that should be indexed (up to the level of the paragraphs), may probably be too short for the application of automated tools

Automatic indexing

The Chamber of Deputies decision to use an automated classification system was finalised in 2005

In an initial phase we started by testing automatic classification through TESEO descriptorsIn a second phase started in 2006, the program was set to automatic classification with Eurovoc thesaurus

Automatic indexing

In 2008, with the beginning of the 16th Parliament, the Eurovoc classification of policy-setting and oversight documents of the Chamber of Deputies and the Senate was launched

Automatic indexing

We selected a semantic technology solution (COGITO by Expert System), which automatically suggests a set of descriptors to be applied to each document

Each document is analyzed and interpreted in order to be archived quickly in the corresponding category

Automatic indexing

The categorizer automatically analyzes each document and suggests a list of descriptors that could be used

This list is checked, modified and validated by a professional operator

Automatic indexing

The current procedure is in fact semi-automatic

Automatic suggestions are modified and integrated (amended and supplemented)

The operator is responsible for the selection and final results

Automatic indexing

So far, the classification suggested by Cogito categorizer has been used by transferring it manually to another application in order to record Eurovoc descriptors in the database used for research

Automatic indexing

History

A new integrated application, Camer@voc, is now available, which enables the automatic Cogito categorizer to analyse all the texts, and then to revise them, as well as validate and record Eurovoc descriptors

History

Camer@voc is a Web application created to manage the automatic classification of policy-setting and oversight documents

The application also allows the management of various stages of classification and its history

History

Camer@voc is entirely developed in an open source environment using three-tier architecture

Applicative infrastructure is divided into three different modules dedicated respectively to the user interface (View), the functional logic also called business logic (Model) and the data persistence management (Controller)

Automatic indexing

Main functionalities:

Sampling of new texts needing to be classified

Automatic indexing

Automatic indexing


Display lists of documents automatically classified, divided by classification status

Automatic indexing

Automatic indexing


Viewing and editing the automatic classification of a document; confirmation and subsequent storage of the final classification

Automatic indexing

Automatic indexing

Future developments include a phase of extensive and deep fine-tuning

The aim is to check whether the system ultimately can lead to a high level of response so that it can be considered acceptable - even temporarily - without human intervention

Automatic indexing

In case of positive results, we can consider the possibility of publishing automatic classification before revising it

Users would be warned about this characteristic by a message like “Classification to be reviewed”

Questions to:

[email protected]

Eurovoc and parliamentary documents: a semi-automatic classification experience at the Camera dei...

Documents

Transcript of Eurovoc and parliamentary documents: a semi-automatic classification experience at the Camera dei...