N. Calzolari 1 Dottorato, Pisa, Maggio 2009 Nicoletta Calzolari Istituto di Linguistica...
-
Upload
terrance-luckett -
Category
Documents
-
view
217 -
download
1
Transcript of N. Calzolari 1 Dottorato, Pisa, Maggio 2009 Nicoletta Calzolari Istituto di Linguistica...
1Dottorato, Pisa, Maggio 2009N. Calzolari
Nicoletta Calzolari Nicoletta Calzolari
Istituto di Linguistica Computazionale - CNR - Pisa
Risorse Linguistiche Risorse Linguistiche
(lessici, corpora, ontologie, …) (lessici, corpora, ontologie, …)
Standard e tecnologie linguistiche Standard e tecnologie linguistiche
(cont.) (cont.)
With many others at ILC
… … e Progettie Progetti
2Dottorato, Pisa, Maggio 2009N. Calzolari
SIMPLE Model for a BioLexiconSIMPLE Model for a BioLexicon Design a representational model for a Design a representational model for a
BioLexicon, a comprehensive lexical resourceBioLexicon, a comprehensive lexical resource able to integrate terminological, lexical and ontological able to integrate terminological, lexical and ontological
infoinfo compatible with HLT international standards (i.e. ISO)compatible with HLT international standards (i.e. ISO) able to meet the domain-specific requirementsable to meet the domain-specific requirements
Implement a BioLexicon database, a container Implement a BioLexicon database, a container with lexical objects to be filled with data with lexical objects to be filled with data provided by “populators” (EBI, UoM & CNR-provided by “populators” (EBI, UoM & CNR-ILC)ILC)– able to be automatically incremented with new terms able to be automatically incremented with new terms
and linguistic info extracted from textsand linguistic info extracted from texts
from Valeria Quochi
3Dottorato, Pisa, Maggio 2009N. Calzolari
Terminolgy to OntologyJena/Rennes/EBI
Bio-Lexicon Populationvariants; synt info of terms UoM
Term Repository Gather terms EBI
Bio-eventsextraction of bio-events ILC
BioLexicon Building cycle
Bio-LexiconConceptual model and physical DB
ILC
from Valeria Quochi
4Dottorato, Pisa, Maggio 2009N. Calzolari
The The BioLexiconBioLexicon: where from: where from
Existing repositories
MEDLINEBioLexiconBioLexicon
chemical compounds, species names, disease, enzymes
Subclustering of term variants
genes/proteins
Incremental population processIncremental population process
Named Entity Recognition
Term Mapping by Normalisation
new genes/proteins names
Manual curation Verbs, nouns, adjs, advs (variants, inflected forms, derivative relations, ...)
Linguistic pre-processingSubcat extraction
Manual annotation of a bio-event corpus
Bio-event extraction
Syn-sem mapping
from Simonetta Montemagni
5Dottorato, Pisa, Maggio 2009N. Calzolari
BioLexicon Model: High-level lexical objects, Data Categories
Syntax
Semantics
e.g.<feat att=“POS” val=“VVZ”><feat att=“ConfScore” val=“0.9”><feat att=“source” val=“UNIPROT”……
from Valeria Quochi
6Dottorato, Pisa, Maggio 2009N. Calzolari
GeneRegOnto – BioLexGeneRegOnto – BioLexConcepts to PredicatesConcepts to Predicates
from Valeria Quochi
7Dottorato, Pisa, Maggio 2009N. Calzolari
Regulation
PositiveProtein
Regulation
NegativeProteinRegulation
regulatesTranscription
FactorProtein
isregulatedby
regulate
PredRegulate
Arg0Regulate Arg1Regulate
NF-AT IL2
regulation
regulates
regulator regulatee
bio event concept
bio entity concept
bio relations Bio-specific qualia relations
bio semantic entry
predicative argument structure
bio semantic roles
NF-AT positively regulates IL2
from Valeria Quochi
8Dottorato, Pisa, Maggio 2009N. Calzolari
SynBehaviourSynBehaviourLesion1
SubcatFrameSubcatFramepp-of
SenseSense Lesion1
PredicatePredicate LESION
SemArgSemArgArg0Pat
Activity
ProteinProteinSynArgSynArgArg0pp-of
The pattern “lesion of PROTEIN” is not in the lexicon, but can be calculated accessing info scattered over various lexical objects (i.e the syntactic unit lesion heads a pp-of corresponding to the patient argument, restricted by the ontological node PROTEIN)
All lexical items labelled as PROTEIN can be candidates to fill this argument slot. Lesion of OmpC, OmpR, etc… are all admitted instances/sentences of this “predicate”/pattern.
BioLexicon
BioLexicon
9Dottorato, Pisa, Maggio 2009N. Calzolari
derivesFrom derived_from
precededBy ?
participatesIn ?
hasParticipant ?
agentOf …hasAgent ?
functionOf is_the_activity_of
hasFunction …instanceOf …
isA is_a
partOf is_a_part_of
hasPart has_as_part
GrainOf …
hasGrain …componentOf … hasComponent …properPartOf …hasProperPart …locatedIn …locationOf …containtIn …contains contains
adjacentTo ?
Constitutive
Telic
Formal
Agentive
Good mapping of RelationsGood mapping of RelationsOBO Relations
Relations from Extended Qualia
Structure
10Dottorato, Pisa, Maggio 2009N. Calzolari
Enhancing Semantic Enhancing Semantic RelationsRelations
Source_Sense Rel Type Target_Sense
Phosphoglycolate BelongsToSpecies Mouse
phosphoglycolate mouse
BelongsToSpecies
from Valeria Quochi
11Dottorato, Pisa, Maggio 2009N. Calzolari
How to link Bio-Ontology and Bio-Lexicon
Place(s) of Semantics in BootStrepPlace(s) of Semantics in BootStrep Bio-OntologyBio-Ontology holds domain specific as well as general semantics holds domain specific as well as general semantics
(in terms of classes and relations between classes)(in terms of classes and relations between classes) Lexicon modelLexicon model comes comes with semantic layerwith semantic layer based on based on linguistic linguistic
ontologyontology ( (SIMPLE-CLIPS OntologySIMPLE-CLIPS Ontology))
Questions: Questions: What What relation between bio-ontology and linguistic ontologyrelation between bio-ontology and linguistic ontology?? Do they overlap? What is the overlap/intersection? the difference? Do they overlap? What is the overlap/intersection? the difference? Mapping possible? How could a mapping look like? Mapping possible? How could a mapping look like?
Aim: Aim: Bringing lexical semantics and ontological semantics Bringing lexical semantics and ontological semantics
togethertogether
?
12Dottorato, Pisa, Maggio 2009N. Calzolari
the BioLexicon Model & Standards
The Bio-Lexicon is based on the MILE metamodel and the more recent ISO proposal of a Lexical Markup Framework (LMF)
Data Categories drawn as far as possible from already existing repositories and standards (i.e. morphosyntactic datacat)
There is the need, however, to define a set of Data Categories specific for the biology domain (i.e. semantic roles and relations)
13Dottorato, Pisa, Maggio 2009N. Calzolari
ISO ISO Meta-model & Data CategoriesMeta-model & Data Categories
An ISO standard for NLP lexicaAn ISO standard for NLP lexica
Definition of the Definition of the Lexical Markup FrameworkLexical Markup Framework, a general & , a general & abstract meta-model & a set of structural nodes relevant abstract meta-model & a set of structural nodes relevant for linguistic descriptionfor linguistic description
ObjectivesObjectives Design of the abstract lexical meta-modelabstract lexical meta-model Definition of the common setcommon set of related Data CategoriesData Categories
The field is The field is
maturemature
from Monica Monachini
14Dottorato, Pisa, Maggio 2009N. Calzolari
ISO - LMFISO - LMF Specifically designed to accommodate as many models of Specifically designed to accommodate as many models of
lexical representation as possiblelexical representation as possible Its pros:Its pros:
Meta-modelMeta-model: a high-level specification ISO24613: a high-level specification ISO24613 Data Category RegistryData Category Registry: low-level specifications ISO12620: low-level specifications ISO12620
Not a Not a monolithic monolithic model, rather a model, rather a modular modular frameworkframework LMF library LMF library provides the hierarchy of lexical objects (with provides the hierarchy of lexical objects (with
structural relations among them)structural relations among them) Data Category Registry Data Category Registry provides a library of descriptors to provides a library of descriptors to
encode linguistic information associated to lexical objects encode linguistic information associated to lexical objects (N.B. Data Categories can be also user-defined)(N.B. Data Categories can be also user-defined)
15Dottorato, Pisa, Maggio 2009N. Calzolari
ISO LMF – ISO LMF – Lexical Markup FrameworkLexical Markup Framework
Morphology
NLP Multilingual notations
NLP MWE pattern
NLP Paradigm class
NLP Semantic
MRD
NLP Syntax
Constraint Expression
Core Package
Structural skeleton, with the basic hierarchy of information in a lexical entry
+ various extensions;
LMF specs comply with modelling UML principles; an XML DTD allows implementation
Builds also Builds also on on EAGLES/EAGLES/ISLEISLE
NEDONEDOAsian Asian Lang.Lang.
NICT Language-
Grid Service Ontology
ICTICT
KYOTOKYOTOLIRICSLIRICS
16Dottorato, Pisa, Maggio 2009N. Calzolari
LMFLMF: NLP Extension for : NLP Extension for SemanticsSemantics
17Dottorato, Pisa, Maggio 2009N. Calzolari
Lexical EntryLexical Entry
<LexicalEntry rdf:ID="LEprotein"><LexicalEntry rdf:ID="LEprotein">
<hasSyntacticBehaviour <hasSyntacticBehaviour rdf:resource=“../../#SB_protein”/>rdf:resource=“../../#SB_protein”/>
<hasLemma><hasLemma>
<Lemma rdf:ID="L_protein“/><Lemma rdf:ID="L_protein“/>
<hasRepresentationFrame><hasRepresentationFrame>
<RepresentationFrame rdf:ID=“RF_protein” /><RepresentationFrame rdf:ID=“RF_protein” />
</hasRepresentationFrame></hasRepresentationFrame>
</hasLemma></hasLemma>
</LexicalEntry></LexicalEntry>
Lexical EntryLE_protein
LemmaL_protein
SyntacticBeahviour
SB_protein
Representation Frame
RF_proteinDC: writtenForm= protein
18Dottorato, Pisa, Maggio 2009N. Calzolari
Event Representation Event Representation through SemanticPredicatethrough SemanticPredicate
SemanticPredicate
SP_regulate
SemanticArgument
SP_TF_protein
DC: role=agent
SemanticArgument
SP_Target Gene
DC: role=patient
19Dottorato, Pisa, Maggio 2009N. Calzolari
<Sense rdf:ID=“activate_2"> <belongsToSynset rdf:resource="#activate"/> <hasSemanticRelation rdf:resource="#is_a_1"/> <hasSemanticRelation rdf:resource="#has_as_part_1"/> <hasSemanticRelation rdf:resource="#object_of_the_activity_1"/> <hasSemanticFeature rdf:resource="# SF_chemistry"/> <hasSemanticFeature rdf:resource="# SF_process"/> </Sense>
Sense
activate_2
Synset
activate
PredicativeRepresentatio
n
SemanticFeature
SF_chemistry
SF_process
Collocation
SemanticRelation
is_a: [SenseID]
Typical_of: [SenseID] S_protein
Sense Representation
20Dottorato, Pisa, Maggio 2009N. Calzolari
<SemanticRelation rdf:ID=“is_in">
<hasSourceSense>
<Sense rdf:ID=“S_cox15">
<id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S_cox15</id>
</Sense>
</hasSourceSense>
<hasTargetSense>
<Sense rdf:ID=“S_chromosome19">
<id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">S_chromosome19</id>
</Sense>
</hasTargetSense>
<relationName rdf:datatype="http://www.w3.org/2001/XMLSchema#string">is_in</relationName>
</SemanticRelation>
Sense
S_chromosome19
SemanticRelation
Is_in
Sense
S_cox15
Example of Semantic Relation
21Dottorato, Pisa, Maggio 2009N. Calzolari
Example:Example: How to encode How to encode WordnetWordnet type of type of Info in Info in LMFLMF
: Statement
text = used especially for furniture and flooring
: Semantic Definition
text = a deciduous tree of the genus Quercus
: Semantic Definition
text = the hard durable wood of any oak: Statement
text = great oaks grow from little acorns
: Statement
text = has acorns and lobed leaves
: Synset Relation
label = substanceHolonym
: Lemma
wordForm = oak tree
: Lemma
wordForm = oak
: Lexical Entry
partOfSpeech = noun
: Lexical Entry
partOfSpeech = noun
: Sense
id = oak_tree0
: Synset
id = 12100739
: Synset
id = 12100067
: Sense
id = oak0
: Sense
id = oak2
22Dottorato, Pisa, Maggio 2009N. Calzolari
XML based Abstract Lexicon Interchange FormatXML based Abstract Lexicon Interchange Format Mapping exerciseMapping exercise
Major best practices:OLIFPAROLE/SIMPLELC-StarWordNet - EuroWordNetFrameNetBDef formal database of lexicographic definitions derived from Explanatory Dictionary of Contemporary French……others on the way…
Entries from existing lexicons have been mapped to Entries from existing lexicons have been mapped to
LMF to prove that the model is able to represent many LMF to prove that the model is able to represent many
best practicesbest practices and achieve unification and achieve unification
from Monica Monachini
23Dottorato, Pisa, Maggio 2009N. Calzolari
Lexical WEBLexical WEB & & Content Interoperability Content Interoperability ‘Standards’ ‘Standards’
As a critical step for As a critical step for semantic mark-upsemantic mark-up in the in the SemWebSemWeb
ComLex
SIMPLE
WordNetsWordNets
WordNets
FrameNetLex_x
Lex_y
LMFLMF
with intelligent
agents
NomLex
Standards Standards for for
InteroperaInteroperabilitybility
EnougEnough??h??
24Dottorato, Pisa, Maggio 2009N. Calzolari
Need of tools to make this vision Need of tools to make this vision operational & concreteoperational & concrete
New New prototype prototype ““LeXFlow”:LeXFlow”:
web-based collaborative environment for semi-web-based collaborative environment for semi-automatic management/integration of lexical resourcesautomatic management/integration of lexical resources enabling interoperability of enabling interoperability of distributeddistributed lexical resourceslexical resources accessed by different types of accessed by different types of agentsagents
addressing semi-automatic integration of computational addressing semi-automatic integration of computational lexicons, with focus on linking and cross-lingual enrichment lexicons, with focus on linking and cross-lingual enrichment of distributed LRsof distributed LRs Case-study: Case-study: cross-fertilization between Italian and Chinese cross-fertilization between Italian and Chinese
WordNetsWordNets
FromFrom Language Resources Language Resources To To Language ServicesLanguage Services
26Dottorato, Pisa, Maggio 2009N. Calzolari
Our WN case studyOur WN case study
ItalWordNet (Roventini et al., 2003)ItalWordNet (Roventini et al., 2003) Academia Sinica Bilingual Ontological Academia Sinica Bilingual Ontological
WordNet (Sinica BOW, Huang et al., WordNet (Sinica BOW, Huang et al., 2004)2004)
Both connected to Princeton WordNet Both connected to Princeton WordNet (although to different versions) (although to different versions)
Same set of semantic relations (EWN Same set of semantic relations (EWN ones)ones)
27Dottorato, Pisa, Maggio 2009N. Calzolari
ILIMapper
ItalianSimple
ItalianWordnet
ChineseWordnet
RelationMapper
Web service Interface
MultiWordnetRelation Calculator
Web service Interface
Simple-WordnetRelation Calculator
Agent Role1 Agent Role4
Agent Role2
Agent Role3
Coordination
Application
Data
Architecture for cooperative integration Architecture for cooperative integration of lexiconsof lexicons
28Dottorato, Pisa, Maggio 2009N. Calzolari
Basic assumptions behind MWN …Basic assumptions behind MWN …
Interlingual levelInterlingual level:: Interlingua provides an indirect linkage between Interlingua provides an indirect linkage between
different WordNets: the Interlingual Index (ILI), different WordNets: the Interlingual Index (ILI), an unstructured version of WordNet used in an unstructured version of WordNet used in EuroWordNetEuroWordNet
Each synset in a WNEach synset in a WNAA is linked to at least one is linked to at least one record of the ILI by means of a set of relations record of the ILI by means of a set of relations ((eq_synonymeq_synonym, , eq_near_synonymeq_near_synonym, …), …)
Synset correspondenceSynset correspondence:: If there is a SIf there is a SAA and a S and a SBB that point to the same that point to the same
ILI, they are correspondentILI, they are correspondent Relation correspondenceRelation correspondence::
If there are two synsets in WNIf there are two synsets in WNAA and a relation and a relation between them, the same holds between between them, the same holds between corresponding synsets in WNcorresponding synsets in WNBB
29Dottorato, Pisa, Maggio 2009N. Calzolari
passaggio,strada,via
N#1290
iperonimia/HYP parte, trattoN#12348
carreggiataN#21225
iponimia/HPO
che_dao
(車道 )N#3245327
tong_dao
(通道 )N#03092396
dao_lu,dao,lu
(道路 ,道 , 路 )N#03243979
上位(泛稱)詞 _
為 /HYP
meronimy/MPT
ILI1.5-3001757-n
road,routeILI1.6-3243979-n
Syn
on
ym
ILI1.5-8488101-n
bend,crook,turnILI1.6-9992072-n
ILI1.5-2857000-n
passageILI1.6-3092396-n
ILI1.5-5691718-n
stretchILI1.6-???
ILI1.5-3002522-n
roadwayILI1.6-3245327-n
curvatura, svolta,curva
N#20944
Syn
on
ym
下位(特指)詞 _
為 /HPO
wan
(彎 )N#9992072
部件 _ 部份詞 _ 為 /MPT
A new proposed mero relation
Reinforcement & validity
Derived
00406975-vAbsorb, assimilate
Ingest, take_in00338206-v
01513366-vreceive, have01260836-v
00407124-vimbibe
00338343-v
00403772-vacquire_knowledge
00335115-v
V#32080assimilare_5, assorbire_3,
accettare_2, recepire_1
V#39802prendere_3
AG#42011relativo_4
00001533-v
吸 00407124-v
00403772-v
causes
has_hyperonym
HPO
HYP
eq_syn
eq_syn
eq_syn
eq_syn
eq_syn
eq_near_syn
00462055-aRespective, several,
various00364361-a
V#32925studiare_3, imparare_1,
apprendere_2
eq_near_syn
has_hyperonym
HYPCAU
has_hyponym
Derived
31Dottorato, Pisa, Maggio 2009N. Calzolari
For a Global WordNet GridFor a Global WordNet Grid This architecture for This architecture for making distributed wordnets making distributed wordnets
interoperable interoperable lends itself to different applications in LR lends itself to different applications in LR
processing:processing: Enrichment of existing lexical resourcesEnrichment of existing lexical resources Creation of new resourcesCreation of new resources Validation of existing resourcesValidation of existing resources
Can provide a Can provide a platform for cooperative & collective creation & platform for cooperative & collective creation &
management of LRsmanagement of LRs, by providing a web-based environment for , by providing a web-based environment for
the collaboration & interaction of distributed agents and resourcesthe collaboration & interaction of distributed agents and resources
Can be seen as theCan be seen as the
Prototype of a Prototype of a web application supporting the GlobalWordNet web application supporting the GlobalWordNet
Grid initiativeGrid initiative, i.e. a shared multi-lingual knowledge base for , i.e. a shared multi-lingual knowledge base for
cross-lingual processing based on distributed resources over the cross-lingual processing based on distributed resources over the
GridGrid
New project:New project:
KYOTOKYOTO
32Dottorato, Pisa, Maggio 2009N. Calzolari
Top
Middle
H20 CO2
Substance
Abstract
Process
Physical
Ontology
Environmental organizations
Tybot: term yielding robot
Kybot: knowledge yielding robot
Wordnets
Distributed, diverse & dynamic data
1
Capture text:"Sudden increase of CO2 emissions in 2008 in Europe"
2
CO2 emission3
Wikyoto
maintainterms & concepts
4
Index facts:Process: Emission Involves: CO2Property: increase, suddenWhen: 2008 Where: Europe
5Text & Fact Index
SemanticSearch
6
Citizens
Governments
Companies
DomainCO2
EmissionH20
PollutionGreenhouse
Gas
from Piek Vossen
33Dottorato, Pisa, Maggio 2009N. Calzolari
TEXT
LinearDAF
Discourse Annotation
LinearMAF
Morphological Annotation
LinearSYNAF
Syntactic Annotation
LinearSEMAFTerm
Extraction (Tybot)
GenericTMF
Semantic Annotation
LinearGenericFACTAF
Wordnet
Domain Wordnet
LMF API
ontology
domain ontology
OWL API
Fact Extraction
(Kybot)
Domain TermsLanguageSpecific
LanguageNeutral &Specific
LanguageNeutral
from Piek Vossen
34Dottorato, Pisa, Maggio 2009N. Calzolari
System componentsSystem components WikyotoWikyoto = wiki environment for a social group: = wiki environment for a social group:
to model the terms and concepts of a domain and agree to model the terms and concepts of a domain and agree on their meaning, within group, across languages and on their meaning, within group, across languages and culturescultures
to define the types of knowledge and facts of interestto define the types of knowledge and facts of interest TybotsTybots = Term extraction robots, extract term = Term extraction robots, extract term
data from text corpusdata from text corpus Kybots Kybots = Knowledge yielding robots, extract = Knowledge yielding robots, extract
facts from a text corpusfacts from a text corpus Linguistic processorsLinguistic processors::
tokenizers, segmentizers, taggers, grammars tokenizers, segmentizers, taggers, grammars named entity recognitionnamed entity recognition word sense disambiguationword sense disambiguation generate a layered text annotation in Kyoto Annotation generate a layered text annotation in Kyoto Annotation
Format (KAF)Format (KAF)from Piek
Vossen
35Dottorato, Pisa, Maggio 2009N. Calzolari
KYOTO SYSTEMKYOTO SYSTEMLinear
SYNAF/SEMAF
LinearSEMAF
Term extraction (Tybot) Generic
TMF
Semantic annotation
LinearGenericFACTAF
Fact extraction (Kybot)
Domain editing (Wikyoto)
Wordnet
Domain Wordnet
LMF API
Ontology
Domain ontology
OWL APIConceptUser
FactUser
from Piek Vossen
36Dottorato, Pisa, Maggio 2009N. Calzolari
Fact mining by KybotsFact mining by Kybots
SourceDocuments
LinguisticProcessors
[[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP
Morpho-syntactic analysis
Abstract Physical
H2O CO2
Substance
CO2 emission
water pollution
Ontology Wordnets &Linguistic Expressions
Generic
Process
Chemical Reaction
Logical Expressions
Domain
[[the emission]NP ] Process: e1 [of greenhouse gases]PP Patient: s2 [in agricultural areas]PP] Location: a3
Fact analysisPatient
Patient
from Piek Vossen
37Dottorato, Pisa, Maggio 2009N. Calzolari
Contribution of KYOTOContribution of KYOTO
html
•hundreds of thousands sources in the environment domain•in many different languages•spread all over the world•changing every day
xls
• KYOTO learns terms and concepts from text documents, • Stored as structures that people and computers understand
Wordnetenvironment
terms
Ontologyenvironment
concepts
Wordnetenvironment
terms
Wordnetenvironment
termsWordnet
environmentterms
• KYOTO delivers a Web 2.0 environment for community based control• Connects people across language and cultures• Establish consensus and knowledge transition
• KYOTO enables semantic search and fact extraction• Software can partially understand language and exploit web 1 data• Understanding is helped by the terms and concepts defined for each language
environmentfacts
TYBOT
KYBOT
WIKYOTO
from Piek Vossen
38Dottorato, Pisa, Maggio 2009N. Calzolari
GlobalInformation
Lemma
MonolingualExternalRef
MonolingualExternalRefs
Sense
LexicalEntry
Statement
Definition
SynsetRelation
SynsetRelations
MonolingualExternalRef
MonolingualExternalRefs
Synset
Lexicon
InterlingualExternalRef
InterlingualExternalRefs
SenseAxis
SenseAxes
LexicalResource
1..1 1..* 0..1
1..*1..*
1..1 0..*
0..1
1..*
Meta0..1
0..1
Meta
0..1 0..1
Meta Meta
0..1
Meta
0..*
0..1 0..10..1
1..* 1..*0..*
0..1
1..*
A common representation A common representation format:format: WordNet - LMFWordNet - LMF Data
Categories
from Monica Monachini
39Dottorato, Pisa, Maggio 2009N. Calzolari
Centralized WordNet DC RegistryCentralized WordNet DC Registry
A list of 85 sem.rels as a result of a mapping of the KYOTO
WordNet grid Inter-WN
Intra-WN
from Monica Monachini
40Dottorato, Pisa, Maggio 2009N. Calzolari
SWN<fuego_3, llama_1>
09686541-n
<!ELEMENT SenseAxes (SenseAxis+)><!ELEMENT SenseAxis (Meta?, Target+, InterlingualExternalRefs?)><!ATTLIST SenseAxisid ID #REQUIREDrelType CDATA #REQUIRED><!ELEMENT Target EMPTY><!ATTLIST TargetID CDATA #REQUIRED><!ELEMENT InterlingualExternalRefs (InterlingualExternalRef+)><!ELEMENT InterlingualExternalRef (Meta?)><!ATTLIST InterlingualExternalRef externalSystem CDATA #REQUIREDexternalReference CDATA #REQUIREDrelType (at|plus|equal) #IMPLIED>
IWN<fuoco_1, fiamma_1>
00001251-n
WordNet-LMF multilingual level - Cross-lingual synset relations
WN3.0<fire_1 flame_1 flaming_1>
13480848-n
groups monolingual synsets corresponding to each other and sharing the same relations to English
link to ontology/(ies)
specifies the type of correspondence
from Monica Monachini
41Dottorato, Pisa, Maggio 2009N. Calzolari
Ultimate goalUltimate goal Global standardization and anchoring of meaning Global standardization and anchoring of meaning
such that:such that: Machines can start to approach text understanding -> Machines can start to approach text understanding ->
semantic web connects to the current websemantic web connects to the current web Communities can dynamically maintain Communities can dynamically maintain
knowledgeknowledge, concepts and their terms in an easy to use , concepts and their terms in an easy to use systemsystem
Cross-linguistic and cross-cultural sharing and Cross-linguistic and cross-cultural sharing and communication of knowledgecommunication of knowledge is enabled is enabled
Comparable to a formalization of Wikipedia for Comparable to a formalization of Wikipedia for humans humans ANDAND machines across languages machines across languages
from Piek Vossen
42Dottorato, Pisa, Maggio 2009N. Calzolari
Some steps for a “new generation” Some steps for a “new generation” of LRsof LRs
FromFrom huge efforts in building static, large-scale, huge efforts in building static, large-scale,
general-purpose LRs general-purpose LRs
ToTo non-staticnon-static LRs rapidly built on-demand, tailored to LRs rapidly built on-demand, tailored to
spefic user needsspefic user needs
FromFrom closed, locally developed and centralized closed, locally developed and centralized
resourcesresources
To To LRs residing over distributed places, accessible on LRs residing over distributed places, accessible on
the web, choreographed by agents acting over themthe web, choreographed by agents acting over them
FromFrom Language Resources Language Resources
To To Language ServicesLanguage Services
43Dottorato, Pisa, Maggio 2009N. Calzolari
Distributed Language Distributed Language ServicesServices
A A long-term scenariolong-term scenario implying implying content interoperabilitycontent interoperability standards, standards, supra-national cooperationsupra-national cooperation and and development of development of architecturesarchitectures enabling enabling
accessibilityaccessibility
Create new resources on the basis of existing Create new resources on the basis of existing Exchange and integrate information across Exchange and integrate information across repositoriesrepositoriesCompose new services on demandCompose new services on demand
Collaborative & collective/social developmentCollaborative & collective/social development and validationand validation, cross-resource integration and , cross-resource integration and exchange of informationexchange of information
LanguaLanguage Gridge Grid
WiWikiki
44Dottorato, Pisa, Maggio 2009N. Calzolari
Natural convergence with HLTHLT:
•multilingual semantic multilingual semantic
processingprocessing•ontologiesontologies•semantic-syntactic semantic-syntactic
computational lexiconscomputational lexicons
In the In the “Semantic Web”“Semantic Web” vision ...vision ...
……need to tackle the twofold challenge of need to tackle the twofold challenge of content availabilitycontent availability && multilingualitymultilinguality
45Dottorato, Pisa, Maggio 2009N. Calzolari
Semantic WebSemantic WebSemantic WebSemantic Web LT & LRsLT & LRs
Content Interoperable LRs & LTContent Interoperable LRs & LT
Language Tech … & Language Tech … & … …
Knowledge, ContentKnowledge, Content
Knowledge MarkupKnowledge Markup
Ready?Ready?????
How to How to cooperate??cooperate??
46Dottorato, Pisa, Maggio 2009N. Calzolari
LR and the future of LT or Content Tech
The need of ever growing and richer LRs for effective multilingual content processing requires a change in the paradigm, & the design of a new generation of LRs, based on open content interoperability standards
The Semantic Web notion may be used to shape the LRs of the future, in the vision of an open space of sharable knowledge available on the Web for processing
The effort of making available millions of “richly annotated words” for dozens of languages is not affordable by any single group
This objective can only be achieved creating integrated Open and Distributed Linguistic Infrastructures
Not only the linguistic experts can participate in these, but may include designers, developers, users of content encoding practices, etc. in wiki mode
Is the LR/LT field mature enough to broaden and open itself to Is the LR/LT field mature enough to broaden and open itself to the concept of the concept of cooperative effort of different set of communities?cooperative effort of different set of communities?
Could a sort ofCould a sort of “Language Genome” large initiative “Language Genome” large initiative be be effectiveeffective? ? Storing lots of (annotated) factsStoring lots of (annotated) facts
47Dottorato, Pisa, Maggio 2009N. Calzolari
In Spoken, Written, Multimodal areas … … in new emerging areas
Statistical approaches… Different dimensions & layers: Content (Ontologies), Emotion,
Time, … For Evaluation For Training …
LRECLREC (> 900 submissions); many LRs at (> 900 submissions); many LRs at COLINGCOLING and even at and even at ACLACL!!!!ELRAELRA (self-sustaining) & (self-sustaining) & LDCLDCLRELRE (new Journal: N. Ide & NC) (new Journal: N. Ide & NC)ISO-TC37-SC4/WG4 ISO-TC37-SC4/WG4 (International Standards for LRs)(International Standards for LRs)AFNLPAFNLP……FLaReNetFLaReNetESFRI - CLARINESFRI - CLARIN (also political & strategic role) (also political & strategic role)New callsNew calls or or initiativesinitiatives in EU, US, ASIA, on LRs, interoperability, in EU, US, ASIA, on LRs, interoperability, cooperation, …cooperation, …
Today, many vitality &Today, many vitality & s success signs… uccess signs…
for LRsfor LRs
48Dottorato, Pisa, Maggio 2009N. Calzolari
BUT … an important pointBUT … an important point
In the ’90s In the ’90s There was a global vision of the field & its main There was a global vision of the field & its main
components:components: StandardsStandards Creation of LRsCreation of LRs DistributionDistribution
Then:Then: Automatic acquisitionAutomatic acquisition
… … towards thetowards the InfrastructurInfrastructure of LRs & LTe of LRs & LT
While today:While today: There is an ever increasing set of initiatives for new LRs, basic There is an ever increasing set of initiatives for new LRs, basic
robust technologies, models??, algorithms, robust technologies, models??, algorithms,
We have a LR community cultureWe have a LR community culture
BUT sort of scattered, opportunistic, not much BUT sort of scattered, opportunistic, not much coherencecoherence
ELRAELRA LDCLDC
49Dottorato, Pisa, Maggio 2009N. Calzolari
Today …Today …The wealth of data & of basic technologies is such that:The wealth of data & of basic technologies is such that:
We should reflect again at the field as a whole & ask We should reflect again at the field as a whole & ask if if
StandardsStandards
Creation of LRsCreation of LRs
Automatic acquisitionAutomatic acquisition
DistributionDistribution
are still are still “the”“the” important components, important components, or how they have changed/must changeor how they have changed/must change
… … Which new challenges towards a Which new challenges towards a new & more mature infrastructure of new & more mature infrastructure of
LRs & LTs??LRs & LTs??
Dynamic LRsDynamic LRs SharingSharing
Collaborative creation & Manag.Collaborative creation & Manag.
Content interoperabilityContent interoperability
50Dottorato, Pisa, Maggio 2009N. Calzolari
These dimensionsThese dimensions
could be at the basis of a could be at the basis of a new Paradigm for LRs & LTnew Paradigm for LRs & LT
& of a new & of a new Infrastructure ??Infrastructure ??
Dynamic LRsDynamic LRs
SharingSharing
Collaborative creation & Manag.Collaborative creation & Manag.
Content interoperabilityContent interoperability
++ Distributed architectures/infrastrDistributed architectures/infrastr
Need moreNeed more
Technology existsTechnology exists
51Dottorato, Pisa, Maggio 2009N. Calzolari
Cultural issuesCultural issuesLanguage … and cultural cultural identityidentityLanguage … and the the HumanitiesHumanities
Many dimensions around the notion Many dimensions around the notion of languageof language
Economic, Economic, social issuessocial issues
ApplicationsServices Technical Technical
issuesissues
Interdisciplinarity
Interdisciplinarity
&&Multid
isciplinarity
Multidisciplinarity
Political issuesPolitical issuese.g. a commonly agreed list of
minimal requirements for “national” LRs: BLARK
Multilingualis
Multilingualis
mmNeed of bodies for
Need of bodies for
a broad research agenda &
a broad research agenda &
strategic actions for LT&LRs
strategic actions for LT&LRs
(W/S /MM)
(W/S /MM)
based on all the dimensions
based on all the dimensions
We need to put togetherWe need to put together technical, technical, organisational, organisational, strategic, strategic, economic, economic, political political issues of LRsissues of LRs
Two new European Infrastructural & Networking Initiatives
finally
52Dottorato, Pisa, Maggio 2009N. Calzolari
Which Which Communities?Communities?
Language Language ResourcesResources
Language Language TechnologiesTechnologies
StandardisationStandardisation
GridGridSemantic Semantic
WebWebOntologistsOntologistsICTICT……
HumanitiesHumanitiesSocial SciencesSocial SciencesDigital LibrariesDigital LibrariesCultural HeritageCultural Heritage……
Many Many applicationapplication domains domains ((eculture, egovernment, ehealth, …)eculture, egovernment, ehealth, …)
corecore
Multilinguality
EnablinEnabling g
infrastrinfrastr
forfor
onon
Focus on cooperationFocus on cooperation
Technologies exist, but the Technologies exist, but the infrastructure infrastructure that that puts them together and sustains them is still puts them together and sustains them is still missingmissing
forfor
FLaReNetFLaReNetNetworkNetwork
FLaReNetFLaReNetNetworkNetwork
CLARINCLARINResInfraResInfra
53Dottorato, Pisa, Maggio 2009N. Calzolari
CLARINCLARIN
Large-scale pan-European collaborative effort (31+ countries) Make LRs & LTs available & readily usable to scholars of
humanities & social sciences (& all disciplines) Need to overcome the present fragmented situation by
harmonising structural and terminological differences Basis is a Grid-type infrastructure and Semantic Web
technology The benefits of computer enhanced language processing become
available only when a critical mass of coordinated effort is invested in building an enabling infrastructure, which can provide services in the form of provision of tools & resources as well as training & counseling across a wide span of domains
The infrastructure will be based on a number of resource, service and expertise centres
ESFRI Research Infrastructures
Common Language Resources and Common Language Resources and
Technologies InfrastructureTechnologies Infrastructure
for the Humanities & Social Sciencesfor the Humanities & Social Sciences
54Dottorato, Pisa, Maggio 2009N. Calzolari
Create a comprehensive and free to use distributed comprehensive and free to use distributed archive of LRs & LTsarchive of LRs & LTs covering not only the languages of all member states, but also other languages studied and used in Europe
Through the fact that the tools & resources tools & resources will be interoperable across languages & domains,interoperable across languages & domains, contribute to preserving and supporting supporting multilingual & multicultural multilingual & multicultural European heritageEuropean heritage
An operational open infrastructure of web servicesopen infrastructure of web services will introduce a new paradigm of distributed collaborative new paradigm of distributed collaborative developmentdevelopment
Allow Allow many contributors to add all kinds of new many contributors to add all kinds of new servicesservices based on existing ones, thus ensuring reusability based on existing ones, thus ensuring reusability and allowing scaling up to suit individual needsand allowing scaling up to suit individual needs
CLARINCLARIN Mission Mission
55Dottorato, Pisa, Maggio 2009N. Calzolari
How can we tackle these How can we tackle these challenges?challenges?
J. Taylor “eScience is about global
collaboration inkey areas of science and the next
generationof infrastructures that will enable it”
Need to build new types of platforms
to allow researchers to combine existing resources easily to new ones to tackle the big challenges
to increase the productivity of all interested researchers, since currently too much time is wasted by preparatory work from P.
Wittenburg
56Dottorato, Pisa, Maggio 2009N. Calzolari
eScience VisioneScience Vision
CLARIN establishes such a new generationnew generation of extended infrastructure
Thus CLARIN is not about creating and building new language resources and technology, but
making them available and accessible as servicesservices in a stable and persistent infrastructure
to allow tackling the great challenges
CLARIN: http://www.clarin.euGrid Project: http://www.mpi.nl/dam-lrISO TC37/SC4: http://www.tc37sc4.org Standards Project: http://lirics.loria.fr/
from P. Wittenburg
57Dottorato, Pisa, Maggio 2009N. Calzolari
We have still a long path …We have still a long path …
in an in an e-Contente-Contentplusplus Call for a: Call for a: ““Thematic Network on Language ResourcesThematic Network on Language Resources”: ”:
FLaReNetFLaReNetTo provide common recommendations (to the EC) for future actionsTo give prioritiesNeed of ‘visions’‘visions’
& also a “new project”
In a global context, in cooperation with In a global context, in cooperation with
CLARINCLARIN
& also with & also with non-EU membersnon-EU members
58Dottorato, Pisa, Maggio 2009N. Calzolari
CLARINCLARINResInfResInf
Which Which Communities?Communities?
Language Language ResourcesResources
Language Language TechnologiesTechnologies
StandardisationStandardisation OntologistsOntologists ContentContent
ECECFunding Funding
agencies agencies ……
HumanitiesHumanitiesSocial SciencesSocial SciencesDigital LibrariesDigital LibrariesCultural HeritageCultural Heritage……
Many Many applicationapplication domainsdomains
((eculture, egovernment, eculture, egovernment, ehealth, ehealth, intelligence, domotics, intelligence, domotics, content industry, …)content industry, …)
corecore
Multilinguality
EUEUForum Forum
forfor
forfor Focus on cooperationFocus on cooperation
LRs & LTs exist, but a global vision, policy and LRs & LTs exist, but a global vision, policy and strategy strategy is still missingis still missing
forfor
FLaReNetFLaReNetNetworkNetwork
59Dottorato, Pisa, Maggio 2009N. Calzolari
ee Content Content plusplus
A new European Network for Language Resources –
Nicoletta CalzolariNicoletta Calzolari (coord.)(coord.)[email protected]
Fostering Language Resources Network
http://http://www.flarenet.euwww.flarenet.eu
N. Calzolari Dottorato, Pisa, Maggio 2009 60
A European forum to facilitate interaction among LR stakeholders
The Network structure considers that LRs present various dimensions and must be approached from many perspectives:
technical, but also organisationaleconomiclegalpolitical
Addresses also multicultural and multilingual aspects,
essential when facing access and use of digital content in today’s Europe
FLaReNet Fostering Language Resources Network
http://www.flarenet.eu/
N. Calzolari Dottorato, Pisa, Maggio 2009 61
A layered structure, with leading experts & groups (national and European institutions, SMEs, large companies) for all relevant LR areas (about 40 partners)
in collaboration with CLARINto ensure coherence of LR-related efforts in Europe
FLaReNet will consolidate existing knowledge, presenting it analytically and visibly contribute to structuring the area of LRs of the future by discussing
new strategies to: convert existing and experimental technologies related
to LRs into useful economic and societal benefits integrate so far partial solutions into broader
infrastructures consolidate areas mature enough for recommendation of
best practices anticipate the needs of new types of LRs
Organised in Thematic Working Groups
N. Calzolari Dottorato, Pisa, Maggio 2009 62
The Chart for the area of LRs in its different dimensions
Methods and models for LR building, reuse, interlinking and maintenance
Harmonisation of formats and standards Definition of evaluation protocols and evaluation
procedures Methods for the automatic construction and
processing of LRs
Thematic Areas
To build together:
Evolving RoadMap Blueprint of actions and infrastructures
N. Calzolari Dottorato, Pisa, Maggio 2009 63
The largest Network of LR and HLT players, with diverse approaches, efforts and technologies
Enable progress toward community consensus Give an extended picture of LRs & recast its definition in the
light of recent scientific, methodological, technological, social developments
Consolidate methods & approaches, common practices, frameworks and architectures
A “roadmap” identifying areas where consensus has been achieved or is emerging vs. areas where additional discussion and testing is required, together with an indication of priorities
Recommendations in the form of a plan of coherent actions for the EU and national organizations
A European model for the LRs of the next years
Objectives & expected results
Ambitious!Ambitious!
N. Calzolari Dottorato, Pisa, Maggio 2009 64
The outcomes will be of a directive nature to help the EC, and national funding agencies,
identifying priority areas of LRs of major interest for the public that need public funding to develop or improve
A blueprint of actions will constitute input to policy development both at EU and national level for identifying new language policies that support
linguistic diversity in Europe in combination with strengthening the language
product market, e.g. for new products & innovative services, especially for less technologically advanced languages
Outcomes of FLaReNet