N. Calzolari 1 Dottorato, Pisa, Maggio 2009 Nicoletta Calzolari Istituto di Linguistica...

65
1 Dottorato, Pisa, Maggio 2009 N. Calzolar Nicoletta Calzolari Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa [email protected] Risorse Linguistiche Risorse Linguistiche (lessici, corpora, ontologie, …) (lessici, corpora, ontologie, …) Standard e tecnologie linguistiche Standard e tecnologie linguistiche (cont.) (cont.) With many others at ILC e Progetti e Progetti

Transcript of N. Calzolari 1 Dottorato, Pisa, Maggio 2009 Nicoletta Calzolari Istituto di Linguistica...

1Dottorato, Pisa, Maggio 2009N. Calzolari

Nicoletta Calzolari Nicoletta Calzolari

Istituto di Linguistica Computazionale - CNR - Pisa

[email protected]

Risorse Linguistiche Risorse Linguistiche

(lessici, corpora, ontologie, …) (lessici, corpora, ontologie, …)

Standard e tecnologie linguistiche Standard e tecnologie linguistiche

(cont.) (cont.)

With many others at ILC

… … e Progettie Progetti

2Dottorato, Pisa, Maggio 2009N. Calzolari

SIMPLE Model for a BioLexiconSIMPLE Model for a BioLexicon Design a representational model for a Design a representational model for a

BioLexicon, a comprehensive lexical resourceBioLexicon, a comprehensive lexical resource able to integrate terminological, lexical and ontological able to integrate terminological, lexical and ontological

infoinfo compatible with HLT international standards (i.e. ISO)compatible with HLT international standards (i.e. ISO) able to meet the domain-specific requirementsable to meet the domain-specific requirements

Implement a BioLexicon database, a container Implement a BioLexicon database, a container with lexical objects to be filled with data with lexical objects to be filled with data provided by “populators” (EBI, UoM & CNR-provided by “populators” (EBI, UoM & CNR-ILC)ILC)– able to be automatically incremented with new terms able to be automatically incremented with new terms

and linguistic info extracted from textsand linguistic info extracted from texts

from Valeria Quochi

3Dottorato, Pisa, Maggio 2009N. Calzolari

Terminolgy to OntologyJena/Rennes/EBI

Bio-Lexicon Populationvariants; synt info of terms UoM

Term Repository Gather terms EBI

Bio-eventsextraction of bio-events ILC

BioLexicon Building cycle

Bio-LexiconConceptual model and physical DB

ILC

from Valeria Quochi

4Dottorato, Pisa, Maggio 2009N. Calzolari

The The BioLexiconBioLexicon: where from: where from

Existing repositories

MEDLINEBioLexiconBioLexicon

chemical compounds, species names, disease, enzymes

Subclustering of term variants

genes/proteins

Incremental population processIncremental population process

Named Entity Recognition

Term Mapping by Normalisation

new genes/proteins names

Manual curation Verbs, nouns, adjs, advs (variants, inflected forms, derivative relations, ...)

Linguistic pre-processingSubcat extraction

Manual annotation of a bio-event corpus

Bio-event extraction

Syn-sem mapping

from Simonetta Montemagni

5Dottorato, Pisa, Maggio 2009N. Calzolari

BioLexicon Model: High-level lexical objects, Data Categories

Syntax

Semantics

e.g.<feat att=“POS” val=“VVZ”><feat att=“ConfScore” val=“0.9”><feat att=“source” val=“UNIPROT”……

from Valeria Quochi

6Dottorato, Pisa, Maggio 2009N. Calzolari

GeneRegOnto – BioLexGeneRegOnto – BioLexConcepts to PredicatesConcepts to Predicates

from Valeria Quochi

7Dottorato, Pisa, Maggio 2009N. Calzolari

Regulation

PositiveProtein

Regulation

NegativeProteinRegulation

regulatesTranscription

FactorProtein

isregulatedby

regulate

PredRegulate

Arg0Regulate Arg1Regulate

NF-AT IL2

regulation

regulates

regulator regulatee

bio event concept

bio entity concept

bio relations Bio-specific qualia relations

bio semantic entry

predicative argument structure

bio semantic roles

NF-AT positively regulates IL2

from Valeria Quochi

8Dottorato, Pisa, Maggio 2009N. Calzolari

SynBehaviourSynBehaviourLesion1

SubcatFrameSubcatFramepp-of

SenseSense Lesion1

PredicatePredicate LESION

SemArgSemArgArg0Pat

Activity

ProteinProteinSynArgSynArgArg0pp-of

The pattern “lesion of PROTEIN” is not in the lexicon, but can be calculated accessing info scattered over various lexical objects (i.e the syntactic unit lesion heads a pp-of corresponding to the patient argument, restricted by the ontological node PROTEIN)

All lexical items labelled as PROTEIN can be candidates to fill this argument slot. Lesion of OmpC, OmpR, etc… are all admitted instances/sentences of this “predicate”/pattern.

BioLexicon

BioLexicon

9Dottorato, Pisa, Maggio 2009N. Calzolari

derivesFrom derived_from

precededBy ?

participatesIn ?

hasParticipant ?

agentOf …hasAgent ?

functionOf is_the_activity_of

hasFunction …instanceOf …

isA is_a

partOf is_a_part_of

hasPart has_as_part

GrainOf …

hasGrain …componentOf … hasComponent …properPartOf …hasProperPart …locatedIn …locationOf …containtIn …contains contains

adjacentTo ?

Constitutive

Telic

Formal

Agentive

Good mapping of RelationsGood mapping of RelationsOBO Relations

Relations from Extended Qualia

Structure

10Dottorato, Pisa, Maggio 2009N. Calzolari

Enhancing Semantic Enhancing Semantic RelationsRelations

Source_Sense Rel Type Target_Sense

Phosphoglycolate BelongsToSpecies Mouse

phosphoglycolate mouse

BelongsToSpecies

from Valeria Quochi

11Dottorato, Pisa, Maggio 2009N. Calzolari

How to link Bio-Ontology and Bio-Lexicon

Place(s) of Semantics in BootStrepPlace(s) of Semantics in BootStrep Bio-OntologyBio-Ontology holds domain specific as well as general semantics holds domain specific as well as general semantics

(in terms of classes and relations between classes)(in terms of classes and relations between classes) Lexicon modelLexicon model comes comes with semantic layerwith semantic layer based on based on linguistic linguistic

ontologyontology ( (SIMPLE-CLIPS OntologySIMPLE-CLIPS Ontology))

Questions: Questions: What What relation between bio-ontology and linguistic ontologyrelation between bio-ontology and linguistic ontology?? Do they overlap? What is the overlap/intersection? the difference? Do they overlap? What is the overlap/intersection? the difference? Mapping possible? How could a mapping look like? Mapping possible? How could a mapping look like?

Aim: Aim: Bringing lexical semantics and ontological semantics Bringing lexical semantics and ontological semantics

togethertogether

?

12Dottorato, Pisa, Maggio 2009N. Calzolari

the BioLexicon Model & Standards

The Bio-Lexicon is based on the MILE metamodel and the more recent ISO proposal of a Lexical Markup Framework (LMF)

Data Categories drawn as far as possible from already existing repositories and standards (i.e. morphosyntactic datacat)

There is the need, however, to define a set of Data Categories specific for the biology domain (i.e. semantic roles and relations)

13Dottorato, Pisa, Maggio 2009N. Calzolari

ISO ISO Meta-model & Data CategoriesMeta-model & Data Categories

An ISO standard for NLP lexicaAn ISO standard for NLP lexica

Definition of the Definition of the Lexical Markup FrameworkLexical Markup Framework, a general & , a general & abstract meta-model & a set of structural nodes relevant abstract meta-model & a set of structural nodes relevant for linguistic descriptionfor linguistic description

ObjectivesObjectives Design of the abstract lexical meta-modelabstract lexical meta-model Definition of the common setcommon set of related Data CategoriesData Categories

The field is The field is

maturemature

from Monica Monachini

14Dottorato, Pisa, Maggio 2009N. Calzolari

ISO - LMFISO - LMF Specifically designed to accommodate as many models of Specifically designed to accommodate as many models of

lexical representation as possiblelexical representation as possible Its pros:Its pros:

Meta-modelMeta-model: a high-level specification ISO24613: a high-level specification ISO24613 Data Category RegistryData Category Registry: low-level specifications ISO12620: low-level specifications ISO12620

Not a Not a monolithic monolithic model, rather a model, rather a modular modular frameworkframework LMF library LMF library provides the hierarchy of lexical objects (with provides the hierarchy of lexical objects (with

structural relations among them)structural relations among them) Data Category Registry Data Category Registry provides a library of descriptors to provides a library of descriptors to

encode linguistic information associated to lexical objects encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined)(N.B. Data Categories can be also user-defined)

15Dottorato, Pisa, Maggio 2009N. Calzolari

ISO LMF – ISO LMF – Lexical Markup FrameworkLexical Markup Framework

Morphology

NLP Multilingual notations

NLP MWE pattern

NLP Paradigm class

NLP Semantic

MRD

NLP Syntax

Constraint Expression

Core Package

Structural skeleton, with the basic hierarchy of information in a lexical entry

+ various extensions;

LMF specs comply with modelling UML principles; an XML DTD allows implementation

Builds also Builds also on on EAGLES/EAGLES/ISLEISLE

NEDONEDOAsian Asian Lang.Lang.

NICT Language-

Grid Service Ontology

ICTICT

KYOTOKYOTOLIRICSLIRICS

16Dottorato, Pisa, Maggio 2009N. Calzolari

LMFLMF: NLP Extension for : NLP Extension for SemanticsSemantics

17Dottorato, Pisa, Maggio 2009N. Calzolari

Lexical EntryLexical Entry

<LexicalEntry rdf:ID="LEprotein"><LexicalEntry rdf:ID="LEprotein">

<hasSyntacticBehaviour <hasSyntacticBehaviour rdf:resource=“../../#SB_protein”/>rdf:resource=“../../#SB_protein”/>

<hasLemma><hasLemma>

<Lemma rdf:ID="L_protein“/><Lemma rdf:ID="L_protein“/>

<hasRepresentationFrame><hasRepresentationFrame>

<RepresentationFrame rdf:ID=“RF_protein” /><RepresentationFrame rdf:ID=“RF_protein” />

</hasRepresentationFrame></hasRepresentationFrame>

</hasLemma></hasLemma>

</LexicalEntry></LexicalEntry>

Lexical EntryLE_protein

LemmaL_protein

SyntacticBeahviour

SB_protein

Representation Frame

RF_proteinDC: writtenForm= protein

18Dottorato, Pisa, Maggio 2009N. Calzolari

Event Representation Event Representation through SemanticPredicatethrough SemanticPredicate

SemanticPredicate

SP_regulate

SemanticArgument

SP_TF_protein

DC: role=agent

SemanticArgument

SP_Target Gene

DC: role=patient

19Dottorato, Pisa, Maggio 2009N. Calzolari

<Sense rdf:ID=“activate_2"> <belongsToSynset rdf:resource="#activate"/> <hasSemanticRelation rdf:resource="#is_a_1"/> <hasSemanticRelation rdf:resource="#has_as_part_1"/> <hasSemanticRelation rdf:resource="#object_of_the_activity_1"/> <hasSemanticFeature rdf:resource="# SF_chemistry"/> <hasSemanticFeature rdf:resource="# SF_process"/> </Sense>

Sense

activate_2

Synset

activate

PredicativeRepresentatio

n

SemanticFeature

SF_chemistry

SF_process

Collocation

SemanticRelation

is_a: [SenseID]

Typical_of: [SenseID] S_protein

Sense Representation

20Dottorato, Pisa, Maggio 2009N. Calzolari

<SemanticRelation rdf:ID=“is_in">

<hasSourceSense>

<Sense rdf:ID=“S_cox15">

<id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S_cox15</id>

</Sense>

</hasSourceSense>

<hasTargetSense>

<Sense rdf:ID=“S_chromosome19">

<id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S_chromosome19</id>

</Sense>

</hasTargetSense>

<relationName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">is_in</relationName>

</SemanticRelation>

Sense

S_chromosome19

SemanticRelation

Is_in

Sense

S_cox15

Example of Semantic Relation

21Dottorato, Pisa, Maggio 2009N. Calzolari

Example:Example: How to encode How to encode WordnetWordnet type of type of Info in Info in LMFLMF

: Statement

text = used especially for furniture and flooring

: Semantic Definition

text = a deciduous tree of the genus Quercus

: Semantic Definition

text = the hard durable wood of any oak: Statement

text = great oaks grow from little acorns

: Statement

text = has acorns and lobed leaves

: Synset Relation

label = substanceHolonym

: Lemma

wordForm = oak tree

: Lemma

wordForm = oak

: Lexical Entry

partOfSpeech = noun

: Lexical Entry

partOfSpeech = noun

: Sense

id = oak_tree0

: Synset

id = 12100739

: Synset

id = 12100067

: Sense

id = oak0

: Sense

id = oak2

22Dottorato, Pisa, Maggio 2009N. Calzolari

XML based Abstract Lexicon Interchange FormatXML based Abstract Lexicon Interchange Format Mapping exerciseMapping exercise

Major best practices:OLIFPAROLE/SIMPLELC-StarWordNet - EuroWordNetFrameNetBDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French……others on the way…

Entries from existing lexicons have been mapped to Entries from existing lexicons have been mapped to

LMF to prove that the model is able to represent many LMF to prove that the model is able to represent many

best practicesbest practices and achieve unification and achieve unification

from Monica Monachini

23Dottorato, Pisa, Maggio 2009N. Calzolari

Lexical WEBLexical WEB & & Content Interoperability Content Interoperability ‘Standards’ ‘Standards’

As a critical step for As a critical step for semantic mark-upsemantic mark-up in the in the SemWebSemWeb

ComLex

SIMPLE

WordNetsWordNets

WordNets

FrameNetLex_x

Lex_y

LMFLMF

with intelligent

agents

NomLex

Standards Standards for for

InteroperaInteroperabilitybility

EnougEnough??h??

24Dottorato, Pisa, Maggio 2009N. Calzolari

Need of tools to make this vision Need of tools to make this vision operational & concreteoperational & concrete

New New prototype prototype ““LeXFlow”:LeXFlow”:

web-based collaborative environment for semi-web-based collaborative environment for semi-automatic management/integration of lexical resourcesautomatic management/integration of lexical resources enabling interoperability of enabling interoperability of distributeddistributed lexical resourceslexical resources accessed by different types of accessed by different types of agentsagents

addressing semi-automatic integration of computational addressing semi-automatic integration of computational lexicons, with focus on linking and cross-lingual enrichment lexicons, with focus on linking and cross-lingual enrichment of distributed LRsof distributed LRs Case-study: Case-study: cross-fertilization between Italian and Chinese cross-fertilization between Italian and Chinese

WordNetsWordNets

FromFrom Language Resources Language Resources To To Language ServicesLanguage Services

25Dottorato, Pisa, Maggio 2009N. Calzolari

26Dottorato, Pisa, Maggio 2009N. Calzolari

Our WN case studyOur WN case study

ItalWordNet (Roventini et al., 2003)ItalWordNet (Roventini et al., 2003) Academia Sinica Bilingual Ontological Academia Sinica Bilingual Ontological

WordNet (Sinica BOW, Huang et al., WordNet (Sinica BOW, Huang et al., 2004)2004)

Both connected to Princeton WordNet Both connected to Princeton WordNet (although to different versions) (although to different versions)

Same set of semantic relations (EWN Same set of semantic relations (EWN ones)ones)

27Dottorato, Pisa, Maggio 2009N. Calzolari

ILIMapper

ItalianSimple

ItalianWordnet

ChineseWordnet

RelationMapper

Web service Interface

MultiWordnetRelation Calculator

Web service Interface

Simple-WordnetRelation Calculator

Agent Role1 Agent Role4

Agent Role2

Agent Role3

Coordination

Application

Data

Architecture for cooperative integration Architecture for cooperative integration of lexiconsof lexicons

28Dottorato, Pisa, Maggio 2009N. Calzolari

Basic assumptions behind MWN …Basic assumptions behind MWN …

Interlingual levelInterlingual level:: Interlingua provides an indirect linkage between Interlingua provides an indirect linkage between

different WordNets: the Interlingual Index (ILI), different WordNets: the Interlingual Index (ILI), an unstructured version of WordNet used in an unstructured version of WordNet used in EuroWordNetEuroWordNet

Each synset in a WNEach synset in a WNAA is linked to at least one is linked to at least one record of the ILI by means of a set of relations record of the ILI by means of a set of relations ((eq_synonymeq_synonym, , eq_near_synonymeq_near_synonym, …), …)

Synset correspondenceSynset correspondence:: If there is a SIf there is a SAA and a S and a SBB that point to the same that point to the same

ILI, they are correspondentILI, they are correspondent Relation correspondenceRelation correspondence::

If there are two synsets in WNIf there are two synsets in WNAA and a relation and a relation between them, the same holds between between them, the same holds between corresponding synsets in WNcorresponding synsets in WNBB

29Dottorato, Pisa, Maggio 2009N. Calzolari

passaggio,strada,via

N#1290

iperonimia/HYP parte, trattoN#12348

carreggiataN#21225

iponimia/HPO

che_dao

(車道 )N#3245327

tong_dao

(通道 )N#03092396

dao_lu,dao,lu

(道路 ,道 , 路 )N#03243979

上位(泛稱)詞 _

為 /HYP

meronimy/MPT

ILI1.5-3001757-n

road,routeILI1.6-3243979-n

Syn

on

ym

ILI1.5-8488101-n

bend,crook,turnILI1.6-9992072-n

ILI1.5-2857000-n

passageILI1.6-3092396-n

ILI1.5-5691718-n

stretchILI1.6-???

ILI1.5-3002522-n

roadwayILI1.6-3245327-n

curvatura, svolta,curva

N#20944

Syn

on

ym

下位(特指)詞 _

為 /HPO

wan

(彎 )N#9992072

部件 _ 部份詞 _ 為 /MPT

A new proposed mero relation

Reinforcement & validity

Derived

00406975-vAbsorb, assimilate

Ingest, take_in00338206-v

01513366-vreceive, have01260836-v

00407124-vimbibe

00338343-v

00403772-vacquire_knowledge

00335115-v

V#32080assimilare_5, assorbire_3,

accettare_2, recepire_1

V#39802prendere_3

AG#42011relativo_4

00001533-v

吸 00407124-v

00403772-v

causes

has_hyperonym

HPO

HYP

eq_syn

eq_syn

eq_syn

eq_syn

eq_syn

eq_near_syn

00462055-aRespective, several,

various00364361-a

V#32925studiare_3, imparare_1,

apprendere_2

eq_near_syn

has_hyperonym

HYPCAU

has_hyponym

Derived

31Dottorato, Pisa, Maggio 2009N. Calzolari

For a Global WordNet GridFor a Global WordNet Grid This architecture for This architecture for making distributed wordnets making distributed wordnets

interoperable interoperable lends itself to different applications in LR lends itself to different applications in LR

processing:processing: Enrichment of existing lexical resourcesEnrichment of existing lexical resources Creation of new resourcesCreation of new resources Validation of existing resourcesValidation of existing resources

Can provide a Can provide a platform for cooperative & collective creation & platform for cooperative & collective creation &

management of LRsmanagement of LRs, by providing a web-based environment for , by providing a web-based environment for

the collaboration & interaction of distributed agents and resourcesthe collaboration & interaction of distributed agents and resources

Can be seen as theCan be seen as the

Prototype of a Prototype of a web application supporting the GlobalWordNet web application supporting the GlobalWordNet

Grid initiativeGrid initiative, i.e. a shared multi-lingual knowledge base for , i.e. a shared multi-lingual knowledge base for

cross-lingual processing based on distributed resources over the cross-lingual processing based on distributed resources over the

GridGrid

New project:New project:

KYOTOKYOTO

32Dottorato, Pisa, Maggio 2009N. Calzolari

Top

Middle

H20 CO2

Substance

Abstract

Process

Physical

Ontology

Environmental organizations

Tybot: term yielding robot

Kybot: knowledge yielding robot

Wordnets

Distributed, diverse & dynamic data

1

Capture text:"Sudden increase of CO2 emissions in 2008 in Europe"

2

CO2 emission3

Wikyoto

maintainterms & concepts

4

Index facts:Process: Emission Involves: CO2Property: increase, suddenWhen: 2008 Where: Europe

5Text & Fact Index

SemanticSearch

6

Citizens

Governments

Companies

DomainCO2

EmissionH20

PollutionGreenhouse

Gas

from Piek Vossen

33Dottorato, Pisa, Maggio 2009N. Calzolari

TEXT

LinearDAF

Discourse Annotation

LinearMAF

Morphological Annotation

LinearSYNAF

Syntactic Annotation

LinearSEMAFTerm

Extraction (Tybot)

GenericTMF

Semantic Annotation

LinearGenericFACTAF

Wordnet

Domain Wordnet

LMF API

ontology

domain ontology

OWL API

Fact Extraction

(Kybot)

Domain TermsLanguageSpecific

LanguageNeutral &Specific

LanguageNeutral

from Piek Vossen

34Dottorato, Pisa, Maggio 2009N. Calzolari

System componentsSystem components WikyotoWikyoto = wiki environment for a social group: = wiki environment for a social group:

to model the terms and concepts of a domain and agree to model the terms and concepts of a domain and agree on their meaning, within group, across languages and on their meaning, within group, across languages and culturescultures

to define the types of knowledge and facts of interestto define the types of knowledge and facts of interest TybotsTybots = Term extraction robots, extract term = Term extraction robots, extract term

data from text corpusdata from text corpus Kybots Kybots = Knowledge yielding robots, extract = Knowledge yielding robots, extract

facts from a text corpusfacts from a text corpus Linguistic processorsLinguistic processors::

tokenizers, segmentizers, taggers, grammars tokenizers, segmentizers, taggers, grammars named entity recognitionnamed entity recognition word sense disambiguationword sense disambiguation generate a layered text annotation in Kyoto Annotation generate a layered text annotation in Kyoto Annotation

Format (KAF)Format (KAF)from Piek

Vossen

35Dottorato, Pisa, Maggio 2009N. Calzolari

KYOTO SYSTEMKYOTO SYSTEMLinear

SYNAF/SEMAF

LinearSEMAF

Term extraction (Tybot) Generic

TMF

Semantic annotation

LinearGenericFACTAF

Fact extraction (Kybot)

Domain editing (Wikyoto)

Wordnet

Domain Wordnet

LMF API

Ontology

Domain ontology

OWL APIConceptUser

FactUser

from Piek Vossen

36Dottorato, Pisa, Maggio 2009N. Calzolari

Fact mining by KybotsFact mining by Kybots

SourceDocuments

LinguisticProcessors

[[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP

Morpho-syntactic analysis

Abstract Physical

H2O CO2

Substance

CO2 emission

water pollution

Ontology Wordnets &Linguistic Expressions

Generic

Process

Chemical Reaction

Logical Expressions

Domain

[[the emission]NP ] Process: e1 [of greenhouse gases]PP Patient: s2 [in agricultural areas]PP] Location: a3

Fact analysisPatient

Patient

from Piek Vossen

37Dottorato, Pisa, Maggio 2009N. Calzolari

Contribution of KYOTOContribution of KYOTO

html

•hundreds of thousands sources in the environment domain•in many different languages•spread all over the world•changing every day

xls

pdf

• KYOTO learns terms and concepts from text documents, • Stored as structures that people and computers understand

Wordnetenvironment

terms

Ontologyenvironment

concepts

Wordnetenvironment

terms

Wordnetenvironment

termsWordnet

environmentterms

• KYOTO delivers a Web 2.0 environment for community based control• Connects people across language and cultures• Establish consensus and knowledge transition

• KYOTO enables semantic search and fact extraction• Software can partially understand language and exploit web 1 data• Understanding is helped by the terms and concepts defined for each language

environmentfacts

TYBOT

KYBOT

WIKYOTO

from Piek Vossen

38Dottorato, Pisa, Maggio 2009N. Calzolari

GlobalInformation

Lemma

MonolingualExternalRef

MonolingualExternalRefs

Sense

LexicalEntry

Statement

Definition

SynsetRelation

SynsetRelations

MonolingualExternalRef

MonolingualExternalRefs

Synset

Lexicon

InterlingualExternalRef

InterlingualExternalRefs

SenseAxis

SenseAxes

LexicalResource

1..1 1..* 0..1

1..*1..*

1..1 0..*

0..1

1..*

Meta0..1

0..1

Meta

0..1 0..1

Meta Meta

0..1

Meta

0..*

0..1 0..10..1

1..* 1..*0..*

0..1

1..*

A common representation A common representation format:format: WordNet - LMFWordNet - LMF Data

Categories

from Monica Monachini

39Dottorato, Pisa, Maggio 2009N. Calzolari

Centralized WordNet DC RegistryCentralized WordNet DC Registry

A list of 85 sem.rels as a result of a mapping of the KYOTO

WordNet grid Inter-WN

Intra-WN

from Monica Monachini

40Dottorato, Pisa, Maggio 2009N. Calzolari

SWN<fuego_3, llama_1>

09686541-n

<!ELEMENT SenseAxes (SenseAxis+)><!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)><!ATTLIST SenseAxisid ID #REQUIREDrelType CDATA #REQUIRED><!ELEMENT Target EMPTY><!ATTLIST TargetID CDATA #REQUIRED><!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)><!ELEMENT InterlingualExternalRef (Meta?)><!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIREDexternalReference CDATA #REQUIREDrelType (at|plus|equal) #IMPLIED>

IWN<fuoco_1, fiamma_1>

00001251-n

WordNet-LMF multilingual level - Cross-lingual synset relations

WN3.0<fire_1 flame_1 flaming_1>

13480848-n

groups monolingual synsets corresponding to each other and sharing the same relations to English

link to ontology/(ies)

specifies the type of correspondence

from Monica Monachini

41Dottorato, Pisa, Maggio 2009N. Calzolari

Ultimate goalUltimate goal Global standardization and anchoring of meaning Global standardization and anchoring of meaning

such that:such that: Machines can start to approach text understanding -> Machines can start to approach text understanding ->

semantic web connects to the current websemantic web connects to the current web Communities can dynamically maintain Communities can dynamically maintain

knowledgeknowledge, concepts and their terms in an easy to use , concepts and their terms in an easy to use systemsystem

Cross-linguistic and cross-cultural sharing and Cross-linguistic and cross-cultural sharing and communication of knowledgecommunication of knowledge is enabled is enabled

Comparable to a formalization of Wikipedia for Comparable to a formalization of Wikipedia for humans humans ANDAND machines across languages machines across languages

from Piek Vossen

42Dottorato, Pisa, Maggio 2009N. Calzolari

Some steps for a “new generation” Some steps for a “new generation” of LRsof LRs

FromFrom huge efforts in building static, large-scale, huge efforts in building static, large-scale,

general-purpose LRs general-purpose LRs

ToTo non-staticnon-static LRs rapidly built on-demand, tailored to LRs rapidly built on-demand, tailored to

spefic user needsspefic user needs

FromFrom closed, locally developed and centralized closed, locally developed and centralized

resourcesresources

To To LRs residing over distributed places, accessible on LRs residing over distributed places, accessible on

the web, choreographed by agents acting over themthe web, choreographed by agents acting over them

FromFrom Language Resources Language Resources

To To Language ServicesLanguage Services

43Dottorato, Pisa, Maggio 2009N. Calzolari

Distributed Language Distributed Language ServicesServices

A A long-term scenariolong-term scenario implying implying content interoperabilitycontent interoperability standards, standards, supra-national cooperationsupra-national cooperation and and development of development of architecturesarchitectures enabling enabling

accessibilityaccessibility

Create new resources on the basis of existing Create new resources on the basis of existing Exchange and integrate information across Exchange and integrate information across repositoriesrepositoriesCompose new services on demandCompose new services on demand

Collaborative & collective/social developmentCollaborative & collective/social development and validationand validation, cross-resource integration and , cross-resource integration and exchange of informationexchange of information

LanguaLanguage Gridge Grid

WiWikiki

44Dottorato, Pisa, Maggio 2009N. Calzolari

Natural convergence with HLTHLT:

•multilingual semantic multilingual semantic

processingprocessing•ontologiesontologies•semantic-syntactic semantic-syntactic

computational lexiconscomputational lexicons

In the In the “Semantic Web”“Semantic Web” vision ...vision ...

……need to tackle the twofold challenge of need to tackle the twofold challenge of content availabilitycontent availability && multilingualitymultilinguality

45Dottorato, Pisa, Maggio 2009N. Calzolari

Semantic WebSemantic WebSemantic WebSemantic Web LT & LRsLT & LRs

Content Interoperable LRs & LTContent Interoperable LRs & LT

Language Tech … & Language Tech … & … …

Knowledge, ContentKnowledge, Content

Knowledge MarkupKnowledge Markup

Ready?Ready?????

How to How to cooperate??cooperate??

46Dottorato, Pisa, Maggio 2009N. Calzolari

LR and the future of LT or Content Tech

The need of ever growing and richer LRs for effective multilingual content processing requires a change in the paradigm, & the design of a new generation of LRs, based on open content interoperability standards

The Semantic Web notion may be used to shape the LRs of the future, in the vision of an open space of sharable knowledge available on the Web for processing

The effort of making available millions of “richly annotated words” for dozens of languages is not affordable by any single group

This objective can only be achieved creating integrated Open and Distributed Linguistic Infrastructures

Not only the linguistic experts can participate in these, but may include designers, developers, users of content encoding practices, etc. in wiki mode

Is the LR/LT field mature enough to broaden and open itself to Is the LR/LT field mature enough to broaden and open itself to the concept of the concept of cooperative effort of different set of communities?cooperative effort of different set of communities?

Could a sort ofCould a sort of “Language Genome” large initiative “Language Genome” large initiative be be effectiveeffective? ? Storing lots of (annotated) factsStoring lots of (annotated) facts

47Dottorato, Pisa, Maggio 2009N. Calzolari

In Spoken, Written, Multimodal areas … … in new emerging areas

Statistical approaches… Different dimensions & layers: Content (Ontologies), Emotion,

Time, … For Evaluation For Training …

LRECLREC (> 900 submissions); many LRs at (> 900 submissions); many LRs at COLINGCOLING and even at and even at ACLACL!!!!ELRAELRA (self-sustaining) & (self-sustaining) & LDCLDCLRELRE (new Journal: N. Ide & NC) (new Journal: N. Ide & NC)ISO-TC37-SC4/WG4 ISO-TC37-SC4/WG4 (International Standards for LRs)(International Standards for LRs)AFNLPAFNLP……FLaReNetFLaReNetESFRI - CLARINESFRI - CLARIN (also political & strategic role) (also political & strategic role)New callsNew calls or or initiativesinitiatives in EU, US, ASIA, on LRs, interoperability, in EU, US, ASIA, on LRs, interoperability, cooperation, …cooperation, …

Today, many vitality &Today, many vitality & s success signs… uccess signs…

for LRsfor LRs

48Dottorato, Pisa, Maggio 2009N. Calzolari

BUT … an important pointBUT … an important point

In the ’90s In the ’90s There was a global vision of the field & its main There was a global vision of the field & its main

components:components: StandardsStandards Creation of LRsCreation of LRs DistributionDistribution

Then:Then: Automatic acquisitionAutomatic acquisition

… … towards thetowards the InfrastructurInfrastructure of LRs & LTe of LRs & LT

While today:While today: There is an ever increasing set of initiatives for new LRs, basic There is an ever increasing set of initiatives for new LRs, basic

robust technologies, models??, algorithms, robust technologies, models??, algorithms,

We have a LR community cultureWe have a LR community culture

BUT sort of scattered, opportunistic, not much BUT sort of scattered, opportunistic, not much coherencecoherence

ELRAELRA LDCLDC

49Dottorato, Pisa, Maggio 2009N. Calzolari

Today …Today …The wealth of data & of basic technologies is such that:The wealth of data & of basic technologies is such that:

We should reflect again at the field as a whole & ask We should reflect again at the field as a whole & ask if if

StandardsStandards

Creation of LRsCreation of LRs

Automatic acquisitionAutomatic acquisition

DistributionDistribution

are still are still “the”“the” important components, important components, or how they have changed/must changeor how they have changed/must change

… … Which new challenges towards a Which new challenges towards a new & more mature infrastructure of new & more mature infrastructure of

LRs & LTs??LRs & LTs??

Dynamic LRsDynamic LRs SharingSharing

Collaborative creation & Manag.Collaborative creation & Manag.

Content interoperabilityContent interoperability

50Dottorato, Pisa, Maggio 2009N. Calzolari

These dimensionsThese dimensions

could be at the basis of a could be at the basis of a new Paradigm for LRs & LTnew Paradigm for LRs & LT

& of a new & of a new Infrastructure ??Infrastructure ??

Dynamic LRsDynamic LRs

SharingSharing

Collaborative creation & Manag.Collaborative creation & Manag.

Content interoperabilityContent interoperability

++ Distributed architectures/infrastrDistributed architectures/infrastr

Need moreNeed more

Technology existsTechnology exists

51Dottorato, Pisa, Maggio 2009N. Calzolari

Cultural issuesCultural issuesLanguage … and cultural cultural identityidentityLanguage … and the the HumanitiesHumanities

Many dimensions around the notion Many dimensions around the notion of languageof language

Economic, Economic, social issuessocial issues

ApplicationsServices Technical Technical

issuesissues

Interdisciplinarity

Interdisciplinarity

&&Multid

isciplinarity

Multidisciplinarity

Political issuesPolitical issuese.g. a commonly agreed list of

minimal requirements for “national” LRs: BLARK

Multilingualis

Multilingualis

mmNeed of bodies for

Need of bodies for

a broad research agenda &

a broad research agenda &

strategic actions for LT&LRs

strategic actions for LT&LRs

(W/S /MM)

(W/S /MM)

based on all the dimensions

based on all the dimensions

We need to put togetherWe need to put together technical, technical, organisational, organisational, strategic, strategic, economic, economic, political political issues of LRsissues of LRs

Two new European Infrastructural & Networking Initiatives

finally

52Dottorato, Pisa, Maggio 2009N. Calzolari

Which Which Communities?Communities?

Language Language ResourcesResources

Language Language TechnologiesTechnologies

StandardisationStandardisation

GridGridSemantic Semantic

WebWebOntologistsOntologistsICTICT……

HumanitiesHumanitiesSocial SciencesSocial SciencesDigital LibrariesDigital LibrariesCultural HeritageCultural Heritage……

Many Many applicationapplication domains domains ((eculture, egovernment, ehealth, …)eculture, egovernment, ehealth, …)

corecore

Multilinguality

EnablinEnabling g

infrastrinfrastr

forfor

onon

Focus on cooperationFocus on cooperation

Technologies exist, but the Technologies exist, but the infrastructure infrastructure that that puts them together and sustains them is still puts them together and sustains them is still missingmissing

forfor

FLaReNetFLaReNetNetworkNetwork

FLaReNetFLaReNetNetworkNetwork

CLARINCLARINResInfraResInfra

53Dottorato, Pisa, Maggio 2009N. Calzolari

CLARINCLARIN

Large-scale pan-European collaborative effort (31+ countries) Make LRs & LTs available & readily usable to scholars of

humanities & social sciences (& all disciplines) Need to overcome the present fragmented situation by

harmonising structural and terminological differences Basis is a Grid-type infrastructure and Semantic Web

technology The benefits of computer enhanced language processing become

available only when a critical mass of coordinated effort is invested in building an enabling infrastructure, which can provide services in the form of provision of tools & resources as well as training & counseling across a wide span of domains

The infrastructure will be based on a number of resource, service and expertise centres

ESFRI Research Infrastructures

Common Language Resources and Common Language Resources and

Technologies InfrastructureTechnologies Infrastructure

for the Humanities & Social Sciencesfor the Humanities & Social Sciences

54Dottorato, Pisa, Maggio 2009N. Calzolari

Create a comprehensive and free to use distributed comprehensive and free to use distributed archive of LRs & LTsarchive of LRs & LTs covering not only the languages of all member states, but also other languages studied and used in Europe

Through the fact that the tools & resources tools & resources will be interoperable across languages & domains,interoperable across languages & domains, contribute to preserving and supporting supporting multilingual & multicultural multilingual & multicultural European heritageEuropean heritage

An operational open infrastructure of web servicesopen infrastructure of web services will introduce a new paradigm of distributed collaborative new paradigm of distributed collaborative developmentdevelopment

Allow Allow many contributors to add all kinds of new many contributors to add all kinds of new servicesservices based on existing ones, thus ensuring reusability based on existing ones, thus ensuring reusability and allowing scaling up to suit individual needsand allowing scaling up to suit individual needs

CLARINCLARIN Mission Mission

55Dottorato, Pisa, Maggio 2009N. Calzolari

How can we tackle these How can we tackle these challenges?challenges?

J. Taylor “eScience is about global

collaboration inkey areas of science and the next

generationof infrastructures that will enable it”

Need to build new types of platforms

to allow researchers to combine existing resources easily to new ones to tackle the big challenges

to increase the productivity of all interested researchers, since currently too much time is wasted by preparatory work from P.

Wittenburg

56Dottorato, Pisa, Maggio 2009N. Calzolari

eScience VisioneScience Vision

CLARIN establishes such a new generationnew generation of extended infrastructure

Thus CLARIN is not about creating and building new language resources and technology, but

making them available and accessible as servicesservices in a stable and persistent infrastructure

to allow tackling the great challenges

CLARIN: http://www.clarin.euGrid Project: http://www.mpi.nl/dam-lrISO TC37/SC4: http://www.tc37sc4.org Standards Project: http://lirics.loria.fr/

from P. Wittenburg

57Dottorato, Pisa, Maggio 2009N. Calzolari

We have still a long path …We have still a long path …

in an in an e-Contente-Contentplusplus Call for a: Call for a: ““Thematic Network on Language ResourcesThematic Network on Language Resources”: ”:

FLaReNetFLaReNetTo provide common recommendations (to the EC) for future actionsTo give prioritiesNeed of ‘visions’‘visions’

& also a “new project”

In a global context, in cooperation with In a global context, in cooperation with

CLARINCLARIN

& also with & also with non-EU membersnon-EU members

58Dottorato, Pisa, Maggio 2009N. Calzolari

CLARINCLARINResInfResInf

Which Which Communities?Communities?

Language Language ResourcesResources

Language Language TechnologiesTechnologies

StandardisationStandardisation OntologistsOntologists ContentContent

ECECFunding Funding

agencies agencies ……

HumanitiesHumanitiesSocial SciencesSocial SciencesDigital LibrariesDigital LibrariesCultural HeritageCultural Heritage……

Many Many applicationapplication domainsdomains

((eculture, egovernment, eculture, egovernment, ehealth, ehealth, intelligence, domotics, intelligence, domotics, content industry, …)content industry, …)

corecore

Multilinguality

EUEUForum Forum

forfor

forfor Focus on cooperationFocus on cooperation

LRs & LTs exist, but a global vision, policy and LRs & LTs exist, but a global vision, policy and strategy strategy is still missingis still missing

forfor

FLaReNetFLaReNetNetworkNetwork

59Dottorato, Pisa, Maggio 2009N. Calzolari

ee Content Content plusplus

A new European Network for Language Resources –

Nicoletta CalzolariNicoletta Calzolari (coord.)(coord.)[email protected]

Fostering Language Resources Network

http://http://www.flarenet.euwww.flarenet.eu

N. Calzolari Dottorato, Pisa, Maggio 2009 60

A European forum to facilitate interaction among LR stakeholders

The Network structure considers that LRs present various dimensions and must be approached from many perspectives:

technical, but also organisationaleconomiclegalpolitical

Addresses also multicultural and multilingual aspects,

essential when facing access and use of digital content in today’s Europe

FLaReNet Fostering Language Resources Network

http://www.flarenet.eu/

N. Calzolari Dottorato, Pisa, Maggio 2009 61

A layered structure, with leading experts & groups (national and European institutions, SMEs, large companies) for all relevant LR areas (about 40 partners)

in collaboration with CLARINto ensure coherence of LR-related efforts in Europe

FLaReNet will consolidate existing knowledge, presenting it analytically and visibly contribute to structuring the area of LRs of the future by discussing

new strategies to: convert existing and experimental technologies related

to LRs into useful economic and societal benefits integrate so far partial solutions into broader

infrastructures consolidate areas mature enough for recommendation of

best practices anticipate the needs of new types of LRs

Organised in Thematic Working Groups

N. Calzolari Dottorato, Pisa, Maggio 2009 62

The Chart for the area of LRs in its different dimensions

Methods and models for LR building, reuse, interlinking and maintenance

Harmonisation of formats and standards Definition of evaluation protocols and evaluation

procedures Methods for the automatic construction and

processing of LRs

Thematic Areas

To build together:

Evolving RoadMap Blueprint of actions and infrastructures

N. Calzolari Dottorato, Pisa, Maggio 2009 63

The largest Network of LR and HLT players, with diverse approaches, efforts and technologies

Enable progress toward community consensus Give an extended picture of LRs & recast its definition in the

light of recent scientific, methodological, technological, social developments

Consolidate methods & approaches, common practices, frameworks and architectures

A “roadmap” identifying areas where consensus has been achieved or is emerging vs. areas where additional discussion and testing is required, together with an indication of priorities

Recommendations in the form of a plan of coherent actions for the EU and national organizations

A European model for the LRs of the next years

Objectives & expected results

Ambitious!Ambitious!

N. Calzolari Dottorato, Pisa, Maggio 2009 64

The outcomes will be of a directive nature to help the EC, and national funding agencies,

identifying priority areas of LRs of major interest for the public that need public funding to develop or improve

A blueprint of actions will constitute input to policy development both at EU and national level for identifying new language policies that support

linguistic diversity in Europe in combination with strengthening the language

product market, e.g. for new products & innovative services, especially for less technologically advanced languages

Outcomes of FLaReNet

N. Calzolari Dottorato, Pisa, Maggio 2009 65

Call for international cooperation also outside Europe

and will be relevant for setting up a global worldwide Forum

of Language Resources and Language Technologies

These Initiatives, … together