Classroom Meetings Springfield Elementary School Presentation by: Kristen Salamone.
Eurovoc and parliamentary documents: a semi-automatic classification experience at the Camera dei...
-
Upload
makayla-rhodes -
Category
Documents
-
view
214 -
download
0
Transcript of Eurovoc and parliamentary documents: a semi-automatic classification experience at the Camera dei...
Eurovoc and parliamentary documents: Eurovoc and parliamentary documents:
a semi-automatic classification a semi-automatic classification experience at the Camera dei deputatiexperience at the Camera dei deputati
Calogero Salamone
Luxembourg, 19 november 2010
General
Establishing techniques to allow citizens access to legal information is a matter of primary importance in terms of the fundamentals of public service
Classification of parliamentary and legal resources provide an important support for research
History
In 1969/1970, Italy’s Chamber of Deputies and Senate began to consider the classification of laws, in the context of early automation projects of the Parliament
An Automatic machine dictionary of Italian language (“Camera 72”) was projected to be used for the information retrieval of legal texts
History
The project should have included a research system based on the storage of the full text of laws, decrees, treaties etc. dating back to 1848
An accurate legal-linguistic analysis was to establish a classification system to identify and resolve the problems of homographs, polysemy, shifts of meanings This project was abandoned
History
In 1992 the thesaurus TESEO (TEsauro Senato per l’Organizzazione dei documenti parlamentari) was adopted for the classification of the bills’ database managed by the Senate
The same thesaurus was adopted for the database of parliamentary oversight (Sindacato ispettivo) managed by the Chamber of deputies (questions to the government, motions and resolutions)
History
TESEO includes 3650 terms grouped into 45 thematic areas (Top Terms), derived from an old home-made classification system and arranged according to the logical structure of the Universal Decimal Classification (UDC)
There are only 358 language equivalent terms (non-descriptors) used for cross-referencing
From TESEO to EUROVOC
The use of TESEO at Chamber of Deputies was overall satisfactory
Difficulties were sometimes encountered in some areas due to the vagueness or absence of appropriate descriptors
These problems led to creating a supplementary list with additional descriptors
From TESEO to EUROVOC
In 2005 the Chamber began to consider whether to switch from TESEO to EUROVOC
We considered inter alia the advantages of multilingual classification, including the possibility of connecting different legal and social phenomena under a single system of categorization
From TESEO to EUROVOC
We also considered the larger number of descriptors available and the even bigger number of language equivalent terms (non-descriptors) available for the italian language
There are some areas arranged in an EU perspective that can be difficult to use in a national perspective.
From TESEO to EUROVOC
We hope to gradually extend Classification through Eurovoc thesaurus from policy-setting and oversight documents to the whole information system
That’s why we developed a map to match and link the descriptors of Eurovoc to those of TESEO
Automatic indexing
We know that automatic classification processes do not achieve the same quality as human indexing does
They can be efficient enough to be used for specific purposes, e.g. to automatically index documents that otherwise would not be indexed at all, or to support the process of human indexing
Automatic indexing
The Chamber of deputies chose to test automatic indexing on policy-setting and oversight documents
These are texts written in everyday language whose length is usually limited
Automatic indexing
The application of automatic indexing to the classification of legislative texts is probably more difficult
Legislative texts present a higher level of formalization of language and the consistency of documentary units that should be indexed (up to the level of the paragraphs), may probably be too short for the application of automated tools
Automatic indexing
The Chamber of Deputies decision to use an automated classification system was finalised in 2005
In an initial phase we started by testing automatic classification through TESEO descriptorsIn a second phase started in 2006, the program was set to automatic classification with Eurovoc thesaurus
Automatic indexing
In 2008, with the beginning of the 16th Parliament, the Eurovoc classification of policy-setting and oversight documents of the Chamber of Deputies and the Senate was launched
Automatic indexing
We selected a semantic technology solution (COGITO by Expert System), which automatically suggests a set of descriptors to be applied to each document
Each document is analyzed and interpreted in order to be archived quickly in the corresponding category
Automatic indexing
The categorizer automatically analyzes each document and suggests a list of descriptors that could be used
This list is checked, modified and validated by a professional operator
Automatic indexing
The current procedure is in fact semi-automatic
Automatic suggestions are modified and integrated (amended and supplemented)
The operator is responsible for the selection and final results
Automatic indexing
So far, the classification suggested by Cogito categorizer has been used by transferring it manually to another application in order to record Eurovoc descriptors in the database used for research
Automatic indexing
Automatic indexing
Automatic indexing
Automatic indexing
Automatic indexing
History
A new integrated application, Camer@voc, is now available, which enables the automatic Cogito categorizer to analyse all the texts, and then to revise them, as well as validate and record Eurovoc descriptors
History
Camer@voc is a Web application created to manage the automatic classification of policy-setting and oversight documents
The application also allows the management of various stages of classification and its history
History
Camer@voc is entirely developed in an open source environment using three-tier architecture
Applicative infrastructure is divided into three different modules dedicated respectively to the user interface (View), the functional logic also called business logic (Model) and the data persistence management (Controller)
Automatic indexing
Main functionalities:
Sampling of new texts needing to be classified
Automatic indexing
Automatic indexing
Main functionalities:
Display lists of documents automatically classified, divided by classification status
Automatic indexing
Automatic indexing
Main functionalities:
Viewing and editing the automatic classification of a document; confirmation and subsequent storage of the final classification
Automatic indexing
Automatic indexing
Automatic indexing
Future developments include a phase of extensive and deep fine-tuning
The aim is to check whether the system ultimately can lead to a high level of response so that it can be considered acceptable - even temporarily - without human intervention
Automatic indexing
In case of positive results, we can consider the possibility of publishing automatic classification before revising it
Users would be warned about this characteristic by a message like “Classification to be reviewed”
Questions to: