Post on 17-Jan-2016
Metadata in NIRFabio Vitali
University of BolognaMaria Guercio
University of Urbino
Introduction
Metadata support has always been present in NIR
Recently (June/July 2004) deep (and hot) discussions have happened within the WG about identifying a full set of metadata information
This is the result so far of the status of discussion.
Some terminology
Automatic: any task that can be completely left to the machine to be performed
– All kinds of data format conversion – E.g. XML->HTML or NIR XML -> NIR RDF.
Semi-automatic: any task that can, with a certain degree of precision, be performed by the machine, but that still requires a human for final verification and approval.
– Identification of structures – E.g. partitioning of documents, identification and interpretation
of citations Manual: any task that needs to be decided upon and
performed by a thinking human, even though the machine can provide the support to help him/her and ease the task itself.
Some terminology (2)
Objective– an objective datum is something for which no reasonable
discussion can exist as to its value.– E.g. the title of article 15, the publication date
Subjective– A subjective datum is something that requires an active
interpretation from a human that may be wrong, or for which different opinions exist
– E.g., resolution of implicit citations, classification of provisions
Explicit– A datum that is actually written somewhere in the text
Implicit– A datum that needs to be deduced from the external, or
through the application of specific reasoning
Some terminology (3) Low competence
– the kind of competence one may expect from a non-specialized employee, such as a secretary, armed with just common sense and some topical experience
– E.g.: where does article 1 end and article 2 start High competence
– The kind of competence one may expect from overspecialized jurists that come to some results after careful and painful reasoning
– e.g.: dates and times in norms. Editorial intervention
– by the publisher of a document Authorial intervention
– by the author of a document
Design issues for NIR (1)
Data structure rather than application– Norme In Rete knows about applications,
but is not dependent on any use of the data and is not specifically targeted towards any specific application (except presentation)
– The same text should be marked in the same way by different editors (at least in the most fundamental structures)
Design issues for NIR (2)
Rigorous distinction of roles– The author of a norm is the legislator, the provider of
the actual XML document is the editor.– The legislator is GOD (his decisions cannot be
discussed), but He only speaks through the text of the norms.
– The editor can add a large quantity of information, but it has no official status
– The very act of adding tag is an editorial operation, subjective and open to discussions.
– In fact, any addition coming from editors (structure identification, notes, comments, interpretation) happens outside of the document content (in markup structures or in special metadata sections)
Design issues for NIR (3)
Complexity of the access to texts– Many editors, many publishing systems,
many copies in different stages of evolution
– There is no authoritative source of XML documents (only of printed documents).
– One web site could forget about updating a law to the latest version
– Use of URN allows to refer to the text of a law without identifying a single existing authoritative source.
Design issues for NIR (4)
Support for description and prescription– Tagging of existing texts can only be
descriptive (supporting any possible mess that the legislator may have put in)
– Support for legal drafting can be provided, suggesting or enforcing legal drafting rules in the writing.
Design issues for NIR (5)
Everything has a reliable name– Every legal structure needs to be
referenced and accessible.– References need to be unambiguous,
universal, definitive.– URN for whole documents, – id attributes for substructures and spans– XPointers for even smaller entities.
Design issues for NIR (6)
Clean separation between objective properties and interpretation– Objective properties can be marked by low-level
editors, while interpretation requires experts and high-level editors.
– Objective (manifest) properties include identification of boundaries (articles, slauses, etc.) and official facts about texts (publication dates, etc.)
– Interpretation includes identification of troublesome dates (dies coactu, dies valens), identification of normative content of the texts provisions, application of modifications.
Design issues for NIR (7)
Specific support for multiple interpretations– “Disposizioni” (law provisions) can be
identified and specified on the text. – Multiple different interpretations of the
same text must be allowed– So they cab be placed outside of the main
document.
Basic structures (1)
Containers– Documents, parts, subparts, articles, etc. – All numbered and titled
Text containers– Clauses (comma), list elements, etc.
Inline elements– Presentation oriented (bold, italics, etc.): discouraged,
we rely on HTML elements and CSS styles– Legal oriented (references, modifications, specification
of dates, organizations, roles, places, etc.): we rely on specific NIR elements.
Basic structures (2)
Metadata– Publication information and other data supplied
by editors (publication notes, document evolution, etc.)
– Law provisions for the interpretation of the semantics of the content
Support for irregular texts (those that do not comply with standard legal drafting rules) is available through relaxed syntax in some cases (documentoNIR)
The Schemas for NIR documents3 different DTDs
– Strict rules (prescriptive)– Loose rules (descriptive)– Light rules (support for most common
cases)– They are intercompatible
The vocabulary is exactly the sameAll light documents are also looseAll strict document are also loose
The needs for metadata
Metadata represent the only chance for putting information that was not explicitly written by the legislator.
All possible types of additional information beyond those provided in the text need to find a place here.
Uses: archival, analysis, annotations, automatic processing (consolidation), etc.
Official classification of metadata A starting point is provided by NISO (US
National Information Standards Organization) in the guide “Understanding metadata” (2004):– descriptive metadata to describe a resource “for
purposes such as discovery and identification”– structural metadata to indicate “how compounds
objects are put together”– administrative metadata to provide information “to
help manage a resource”, articulated (only) as rights management metadata and preservation metadata (“information needed to archive and preserve a resource”)
But… The distinction between descriptive, structural and
administrative metadata cannot find any concrete basis on the real practice: – All the communities involved in the preservation of
documents have developed and used relevant information related to the structure identification as a sub-set of information of their descriptive systems. They never consider the structural data as independent component.
– The ambiguity of the administrative metadata is even more evident, specifically in the digital systems where the technological components are less and less relevant for the long-term preservation and play a function for physical retrieval of a resource in a digital repository, but are considered part of the descriptive system in the case of web resources.
<xml>Changes</xml>
Metadata in the NIR DTD
Any kind of information that is provided by the editor rather than by the author.
In a way even tagging text is metadata
Deriving new versions out of an original and a few modification documents is also adding metadata.
But adding proper metadata means providing additional information to a version of a document that can be used to better search, contextualize and understand a document.
text
<xml>Changes</xml>
<xml>Text
</xml>
<xml>Changes</xml>
<xml>Changes</xml>
meta
Proper metadata in the NIR DTD
Can be specified – In an external document (in RDF - still
underspecified)– In an internal section at the beginning of the
document (meta) in a NIR vocabulary– In many internal sections near the parts of the text
they refer to, in a NIR vocabulary Conversion back and forth is always possible
and automatic. Deals with description, structure,
administration, as well as: – Interpretation of content– Relationships with other documents– Comments and notes
Seven types of proper metadata Reflective information
– Things the document knows about itself Positioning information
– Things the document knows about the norms it expresses and the legal system it belongs to
Lifecycle information– Special moments in the history of the document and of its norms, and
the list of other documents that justify them Editorial notes
– Things the editor wants to attach to specific parts of the document but cannot, since the DTD does not allow editorial intervention on content
Iter-connected texts– The history of the document before its approval
Proprietary extensions Provisions (disposizioni)
Reflection info (descrittori)Refers to the document, not its content
– Publication date. Re-publications. Errata. Official clarifications.
– URN(s), aliases– Objective data, easy to find even with low competences
Storing freshness information?– A document does not usually know whether it is up-to-date.
We may deal with stale documents, dead web sites, CD-ROMs
– The best we can do is to provide them with a last-updated date
– The normative system will confirm whether this is the last interesting date, or there exist more recent versions of the same document
Positioning info (inquadramento)
Refers to the norms contained in the doc– Missing parts– Rank, function, nature and proposers of
the law– Keywords and taxonomies they belong to
Objective data (mostly), but requiring high competence to write down.
Lifecycle (altriatti) - 1
Over time, documents undergo changes (in content, efficacy, power and so on)
These change happen at specific points in time and depend on specific documents (modification documents).
Usually modification documents specify several changes on the same modified document, and may specify multiple modification dates.
Therefore it makes sense to create a secondary structure where all relevant moments and documents can be matched
Lifecycle (altriatti) - 2t01
1/1/1996
t02
1/3/1997
t03
12/6/1998
t04
24/9/1999
t05
1/1/2001
original
v01
modified
v02
suspended resumed
v02repealed
ID URN of law relation
r01 urn:nir:xxxxxxx12/1995 original
r02 urn:nir:xxxxxxx1/1997 passive
r03 urn:nir:xxxxxxx5/1998 passive
r04 urn:nir:xxxxxxx12/2000 passive
ID date idref
t01 1/1/1996 r01
t02 1/3/1997 r02
t03 12/6/1998 r03
t04 24/9/1999 r03
t05 1/1/2001 r04
Lifecycle (altriatti) - 3
The lifecycle section only provides information about the relation to the document that causes the modifications
This information is objective and can be provided with low competence
Information about each actual modification is optional and placed in the provision section.
That information is sometimes subjective and can be provided only with significant competence
Other sections
Editorial notes (redazionale)– Footnotes, comments, and any text the editor
feels like adding. It can point to specific places in the text through <ndr> elements
Iter-connected data (lavoripreparatori)– An official blurb detailing the iter for the approval
of the act, with presentation dates, discussion dates, etc. Plain text.
Proprietary– An open-ended section where editors can add
their own metadata with freedom.
Provisions
Provisions describe the meaning of each meaningful fragment of the text according to a predefined (and hopefully complete) taxonomy (ontology???)
Divided in three main sections plus a residual category:– Justifications– Analytical provisions– Modifications – Other
Justifications
Some norms (e.g., decrees) introduce before the actual text a foreword providing a number of justifications:– Considered…– Consulted…– Based on a proposal by– Considering…– Etc.
Analytical provisions
Describe properties and meaning of fragments of the actual text.
A full taxonomy exists, including concepts like definition, obligation, right, etc.
Carlo will be speaking about them
Modifications In a modifying law, each modification can be
described in detail with a provision. The provision describes in details what kind of
modification, the document it is applied to, where inside it, and when.
Possible modifications are: abrogation, substitution,insertion, renumbering, change of terms, prorogation, repetition, suspension, retro-activity, ultra-activity, etc (a total of 24 different types).
Currently no way to express normal case (dies coactu = dies valens = 15 days after publication for the whole act), but a way will be found soon.
Arguments for provisions
All provisions have some specific arguments, plus some shared arguments
E.g.: <motivazioni>
<regole><obbligo>
<pos href=“#art12com5”/><destinatario>sindaco</destinatario><controparte>ufficio tributi</controparte><termine da=“r01” a=“r02”/>
</obbligo>…
</regole>
Important shared arguments are positions and terms
Positions All provisions point to a position inside the
document where the text of the provision is placed. <articolo id="art1">
<num>1.</num> <comma id="art1-com1">
<num>1</num><corpo>blah blah</corpo>
…<obbligo>
<pos href=“#art1com”/><destinatario>xxx</destinatario><controparte>y1</controparte>
</obbligo>
The pos element points to the id, or XPointer, or the text content, of the part of the document that contains the provision.
Terms
Specify conditions, and specific efficacy (dies coactu) and validity (dies valens) intervals.
No formal language exists yet for specifying conditions– E.g.: “after the approval of the
corresponding regulation”Dates are specified by referring to the
id of the relevant date as placed in the lifecycle section.
Conclusions Metadata are still under heavy evolution within
the NIR WG. In the last 4 month a major work has been
started, in order to perform a systematic analysis of the desired metadata information for NIR documents.
I haven’t even mentioned namespaces Some details are still shaky (required elements,
repeatable elements, conditions, default values), but the structure should be reasonable stable.
These are not in the published version: it is still way too early.