Digital representation of rights for language resources

Click here to load reader

download Digital representation of rights for language resources

of 27

Transcript of Digital representation of rights for language resources

Digital Representation of Rights for Language Resources

Victor Rodriguez-DoncelOntology Engineering GroupUniversidad Politcnica de MadridPenny LabropoulouInstitute for Language and Speech ProcessingAthena RC AthensDigital Representation of Rights for Language Resources4th Workshop on Linked Data in Linguistics: Resources and ApplicationsBeijing, China, 31st July 2015. Co-located with ACL-IJCNLP 2015

Good afternoon and welcome to this session.

This is a recording presenting the work Digital Representation of Rights for Language Resources, by VRD (speaking now) and PL.

This work is the result of a joint effort of UPM (Madrid, Spain) and the ILSP (Athens, Greece), in tight cooperation with the EU funded LIDER Porject and the W3C LD4LT community group, heavily working with resources and licenses from the META Share network.

In this presentation, it will be shown how the most commonly used licenses for language resources can be digitally represented, reusing existing vocabularies and extending the Open Digital Rights Language core model.

(pausa)The most important information of Licenses will be represented as RDF.(pausa)

Practical examples and guidelines of use the Rights Information for Language Resources vocabulary will be given.

1

Under the umbrella term Language Resources, we find a number of different items like dictionaries, lexicons, thesauri, semantic networks, written and spoken corpora, pos taggers, syntantic analyzers, ontologies, term banks, phonetic databases

Some of these heterogeneous resources are actually databases, some are pieces of software, some are pure creative works.

https://www.jasondavies.com/wordcloud/#language resourcelanguage resourcelanguage resourcelexiconlexiconlexiconspoken corpusspoken corpusterm bankterm bankterm bankdictionarydictionarysemantic networksemantic networkphonetic databasesphonetic databasesthesaurusthesaurusapplicationapplicationPOS taggerPOS taggerPorter stemmerPorter stemmeraligned corpusaligned corpusknowledge sourceknowledge sourcen-gram modeln-gram modelwritten corpuswritten corpusontologyontologytokenizertokenizerconcordancerconcordancerword sense disambiguation serviceword sense disambiguation servicetranslation web servicetranslation web servicemorphology analyzermorphology analyzersyntatic analyzersyntatic analyzerlanguage resourceabcdefghijklmnopqrstuvwxyz2

The production of these resources in occasions require an important effort which is acknowledged by almost every legal system. In general LANGUAGE RESOURCES are protected by the law.

Most of the resources qualify to be intellectual property works and as such protected by copyright. Some receive full protection as works (like ontologies, or computer programs), some a reduced protection as databases. Language resources are not, however, pantentable.

In any case, for their safe consumption, rights information must be present. In particular, licenses determine which uses are allowed under which conditions. 3

XMLRDF

https://creativecommons.org/licenses/?lang=en

Actually, one of the priorities set by the FLARENET Strategic Research Agenda is the availability of LRs within an adequate IPR and legal framework. The recommendations include the elaboration of specific, simple and harmonised licensing solutions for data resources, taking into account licensing schemes already in use and the adoption of electronic licensing and adaptation of current distribution models to new media

The most important parts of a license can be represented in a digital form as XML or RDF.

This is not a new idea, and was also supported by the Creative Commons licenses, which include a legal code, a summary and a purely electronic version.

4

Advantages of digital (RDF) licensingBetter understanding of the licensing terms by human usersAllows processing of the licensing terms by machinesEnhances of the search and discovery of LRsAllows easier management of the LRs by publishers

There is a number of advantages when digital licensing is used

1 Improvement of the understanding of the licensing terms by human usersAlthough licenses are texts in natural language, the legal jargon may not be easily understood by newcomers. A harmonised vocabulary for licensing terms favours universal understanding of their precise meaning

2. Digital licensing allows the Processing of the licensing terms by machinesComputer programs can take decisions based on the license, like selectively granting access or permitting or denying the combination of differently licensed resources.

3.Enhancement of the search and discovery of LRs. Query by licensing terms is possible, for example limiting a search to resources where the license allows commercial use, creation of derivative works etc

4. Finally, better management, preservation and interoperation of the LRs by publishers, who have a clearer account on which rights have been granted to which resources

5

License as metadata

The license is a piece of metadata describing a language resource, like the authorship or the creation date6

License as metadata in catalogues

Language resources are sometimes collected in catalogs, and in this case licensing metadata appears within the resource but also as a metadata record in the catalog. This second case is more important.Indeed, this duplicity of information may lead to inconsistencies, which would be minor if rights were represented digitally in a uniform or interoperable manner.

7

Some cataloguesOLACMETA-SHARECLARINCLARIN Virtual Language ObservatoryDatahub.ioLREMap Linghub

These language resource catalogs are actually databases with the metadata of the resources.All of them handle licensing information, as it is crucial for the safe consumption of resources in industrial settings.Some data catalogs are the OLAC Language Resource Catalog, META-SHARE, CLARIN, CLARIN Virtual Language Observatory, Datahub.io, LREMap, Linghub

8

Licensing info is metadataLicensing info as free textLicensing info as a choice among several possibilitiesLicensing info as a more complex rights expression

We have studied how each of these respositories handle the licensing information, finding three possible scenarios.

Licensing info as free textcatalogs where the rights information is loosely represented as a free text metadata element: this is mainly the case for portals harvesting from various sources, such as OLAC, the LRE Map and the CLARIN Virtual Lanuage Observatory (VLO)Licensing info as a choice among several possibilities META-SHARE and partly Datahub and the CLARIN network repositoriesLicensing info as a more complex rights expressionThe META-SHARE ontology defined a richer set of combinable elements to build licenses.

The latter is the most complete option. For example, faceted browsing with the criterion of access rights/ license is a feature integrated in most of the catalogs mentioned before. But in the case of META-SHARE, faceted browsing with a filter for conditions of use is allowed (e.g. whether the license allows commercial use, derivatives etc.)9

Rights information in the META-SHAREmodel

Stelios Piperidis. 2012. The META-SHARE language resources sharing infrastructure: Principles, challenges, solutions

We find this META-SHARE repo as the most interesting case from a licensing point of view.The META-SHARE network includes 13 resource repositories, with over 1200 resource packages.

The META-SHARE (MS) metadata schema constitutes an essential ingredient of the META-SHARE infrastructure.META-SHARE creates a space within which different LRs may be shared under specific licensing terms

10

Rights information in the META-SHAREmodel

11

META-SHARE metadata model

The original abstract META-SHARE metadata model was first implemented as an XML Schema.

A resource in META-SHARE IS described with an XML document adhering to that schema. Some elements are obligatory (minimal version), some recommended and some optional (having thus a maximal version).

The model contains 5 types of entities, where the most important one was the resource, specialized by the langresource.The language resource contained information on the resource, on the version, other metadata and also DISTRIBUTION INFO.

The core of the schema is the resourceInfo component, which includes administrative components relevant to all LRs, like idenification, usage info or media type.

Licenses are root entities, but licensing terms are present as well as features of the distribution info: its availability, detailed conditions or the IPR holder

The META-SHARE Metadata model has been ported to OWL/RDF by Marta Villegas, and has also largely influenced the ontology presented in the session before.

12

Rights Information for Language Resources Ontology

http://purl.org/NET/ms-rights#

The work presented today proposes a model specifically addressing the licensing elements in a more standardized manner.

This rights ontology builds upon the META-SHARE schema for the LanguageResource and the Distribution classes and for the License class builds on the ODRL model. 13

The ontology defines 4 main classes:Language resourceDistributionLicenseConditions of useA language resource is represented with a class instance of the first, which is connected with one or more distributions with the distribution object property. Every distribution may have one ore more licenses (dual licensing is permitted) and the license can be further specified with conditions of use.14

The language resource is characterized by its availability, which maybe restricted, unrestricted or under negotiation.15

The distribution is characterized by its access medium and can be limited depending on the user nature (Whether it is a member of a consortium, or academic user, or commercial user, etc.).

The author of the language resource may have licensed the distribution rights to different persons, consequently there may be one rights holder of the distribution rights per type of distribution.

16

Finally, the license can belong to a license category and can be further specialized by different conditions of useThese conditions of use include a)obligations, like Attribution, compensation (that is to say payments) or inform the licensor on further uses. b)Prohibitions, like making commercial use or redistributionsc) conditions, for example based on the purpose (educationl, evaluation, etc.)

This is a surprisingly flat list of license terms but which can be mapped directly to the existing METASHARE records in XML.

17

Other RELs: ccREL

In contrast, Rights Expression Languages (or RELs) define the license features in a structured fashion.For example, the ccREL (the Creative Commons REL) presents this structure, where a license permits, prohibits or requires different actions. These actions include reproduction, distribution or make derivative works.

18

Other RELs: MPEG-21 MCO

The MPEG-21 MEDIA CONTRAct ontology has similar elements, also including the deontic modalities of permission, obligation and prohibition.19

The ODRL Core Ontology

Finally, the ODRL core ontology, permits representing licenses as collections of rules.Again, every rule can be a permission, a prohibition or an obligation, exercised over an asset and possibly by a specific party.Constraints are generic and include an operator and a right operand.

ODRL 2.1 is a policy and rights expression language suitable to represent the licensing terms ofthe language resources. ODRL specifies both an abstract core model and a common vocabulary,which can be extended for the particular domains ODRL is applied to (like eBooks, mobile devices or the news industry).

ODRL is the most natural choice for expressing licenses and policies in RDF and its expressions can be used within the Rights Information for Language Resources Ontology

20

:example0 a odrl:Set; odrl:permission [ odrl:target :langResource ; odrl:action odrl:reproduce] ; odrl:prohibition [ odrl:target :langResource ; odrl:action odrl:derive, odrl:commercialize] .:langResource :distribution :myDistribution .:myDistribution :license :myLicense .:myLicense :conditionsOfUse :noDerivatives, :nonCommercialUse .

SPARQL CONSTRUCT Queries

However this creates a duality: both the structured ODRL and the flat META-SHARE forms are possible to represent the same license. The slide shows an example of license in Turtle using the ODRL model and the META-SHARE model.

However, a set of SPARQL queries can easily transform one form into the other, as long as the ODRL structures are simple (not nested).

21

License templates

An innovative feature of the Information for Language Resources Ontology is the use of license templates

A license template is a license where some fields are left incomplete. These licensing patterns are public and inmutable, and can be referenced once and and again, saving verbose licenses.

22

Use of license templates

Thus, two different resources may refer to the same license template with a single triple, while each of them specifies a different price.23

The schema of a METASHARE LICENSE is presented in this slide.In this case, we are representing the meta-share commercila, non redistribution policy.

This is represented by means of a license, which has only one permission. This permission permits making a reproduction, derivative works, commercial use and exert database rights. However there is a duty (attribution) and a prohibition (further distribution). Further, the permission is constrainted to the purpose of language engineering research, and the location must be at the assignee site.

24

0102 a odrl:Policy ;03 dct:hasVersion "1.0" ;04 rdfs:label "META-SHARE Commercial NoRedistribution" ;05 dct:alternative "MS C-NoReD" ;06 dct:language ;07 ms:conditionsOfUse ms:noRedistribution, cc:Attribution,08 cc:CommercialUse, ms:conditionsOfUse,09 ms:languageEngineeringResearch ;10 cc:legalcode .11 ms:licenseCategory ms:PUB ;12 odrl:permission [13 odrl:action cc:Reproduction, cc:DerivativeWorks , odrl:extract,14 odrl:aggregate, cc:CommercialUse ;15 odrl:duty [16 odrl:action cc:Attribution ;17 ] ;18 odrl:constraint [19 odrl:operator odrl:eq ;20 odrl:purpose ms:languageEngineeringResearch21 ] , [22 odrl:operator odrl:eq ;23 odrl:spatial "only at assignees site"24 ]25 ];26 odrl:prohibition [27 odrl:action cc:Distribution ;28 ] .

The same example, this time in turtle, is shown in this slide. It is unabridged.The main resource in the license is an odrl:Policy (line 02) which has attributed some metadata elements: version (03),label (04), alternative name (05) or location of the legal code 26 (10). The policy additionallyhas information regarding the language and a flat list with the conditions (ms:NoRedistribution,cc:Attribution, etc. in lines 07-09). The main permission (lines 12-25), which explicitly authorizes for making derivative works,making commercial use has the duty of attribution (15-17) and the constraints of being used only forlanguage engineering purposes (lines 18-21) and on the users site (lines 21-24). Distribution is forbidden in lines 26-28

25

Rights information for language Resources

http://purl.org/NET/ms-rights#

Ontology specification, grounded with a formal list of requirements and illustrated with examples

And we are concluding the presentationThis paper has presented the Rights Information for Language Resources Ontology, specified in the framework ofthe W3C Linked Data for Language Technology Group. It is expected to enhance the accessibility of language resources, to ease the publication of licenses as linked data, and to enable the automatic processing of licenses by web services and other tools.The URI shown in the slide leads to the ontology specification, grounded with a formal list of requirements and illustrated with examples.

In the future, we expect to improve on the model, especially as regards the user modelling, as well as formalizing constraints for the data structures. Finally, the use of SPARQL queries to move from the flat METASHARE to the ODRL-like policies has to be further document. The same applies to the construct queries capable of filling in automatically thel license templates

26

Victor [email protected] [email protected] Thanks --4th Workshop on Linked Data in Linguistics: Resources and ApplicationsBeijing, China, 31st July 2015. Co-located with ACL-IJCNLP 2015

Thanks for the attention and we hope to be there in person the next time.27