2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

35
Semantic annotation of text: techniques and applications Prof. Luis Sanchez-Fernandez Web Technologies Laboratory University Carlos III of Madrid http://webtlab.it.uc3m.es 1 http://webtlab.it.uc3m.es

description

2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Transcript of 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Page 1: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es1

Semantic annotation of text: techniques and applications

Prof. Luis Sanchez-FernandezWeb Technologies LaboratoryUniversity Carlos III of Madrid

http://webtlab.it.uc3m.es

Page 2: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

2

Semantic Web Techniques for semantic annotation

of text An approach to named entity

disambiguation using Wikipedia

Outline

http://webtlab.it.uc3m.es

Page 3: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Short history of the Web

1990: Creation of World Wide Web infraestructure at CERN by Tim Berners-Lee

HTTP, HTML, first Web client, first Web server 1993: Mosaic, first graphic Web client 1994: Netscape Navigator 1996: Commercial use of WWW is generalized 1999: Tim Berners-Lee proposes the Semantic

Web 2002: Weblogs and RSS Web 2.0 6th October 2009: at least 8 billion indexable Web

pages 23rd September 2010: at least 15 billion

indexable Web pages according to http://www.worldwidewebsize.com/

Page 4: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

The problem of information overload

The great success of the web has lead to one of its current problems: information overload Difficult and time costly to find and

update relevant information for people and companies

Ex.: keep an updated state of the art Company employees can use up to 20%

of their working time searching in the Web (Outsell Inc, 2002)

Page 5: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es5

The goal of the Semantic Web is to automate web tasks by enriching the current Web content with formal representations that enable better cooperation between humans and computers

The Semantic Web proposal

Page 6: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es6

Semantic Web Stack

Page 7: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Máster interuniversitario en Ingeniería Telemática

7

RDF

“Resource Description Framework” (RDF) Goal of RDF (alternative views):

Language for resource description in the Web Language for formal representation of (parts of)

information available in a Web document (metadata) Formal => machine readable Vocabulary defined with ontologies

What is a resource? Web content: Web pages, images, e-mails, files, … Resources mentioned in Web content: Persons,

locations, organizations, …

Page 8: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Máster interuniversitario en Ingeniería Telemática

8

RDF basic principles

We want to represent a piece of information available in the Web describing a resource

Each metadata states a property that can be modelled as a (formal) statement, composed of: subject: resource being described predicate: property of the resource object: value of the property for the resource

being described “http://www.example.org has a creator whose

value is John Smith”

Page 9: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Máster interuniversitario en Ingeniería Telemática

9

RDF Model

An RDF model (set of RDF statements) can be represented by means of a graf

For each statement: subject is a node predicate is an arc object is a node

Subject and predicate are resources Object can be either a resource or a

literal

Page 10: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Máster interuniversitario en Ingeniería Telemática

10

Example

“http://www.example.org has a creator whose value is John Smith”.

Page 11: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Máster interuniversitario en Ingeniería Telemática

11

Textual notation (triples)

<http://www.example.org/index.html> <http://purl.org/dc/elements/1.1/creator> <http://www.example.org/staffid/85740> .

<http://www.example.org/index.html> <http://www.example.org/terms/creation-date>

"August 16, 1999" .<http://www.example.org/index.html>

<http://www.example.org/terms/language> "English“ .

Page 12: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Máster interuniversitario en Ingeniería Telemática

12

Ontologies: goal

An ontology is a formal, explicit specification of a shared conceptualization

An ontology defines the basic terms and relations comprising the vocabulary of a topic area, as well as rules that should be fulfilled by such terms and relations

Page 13: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Máster interuniversitario en Ingeniería Telemática

13

RDF Schema

RDF vocabulary Properties definition and description of

properties Classes definition and description

Can be used to define simple ontologies

Page 14: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Máster interuniversitario en Ingeniería Telemática

14

Properties in RDF Schema

rdfs:subPropertyOf rdfs:range rdfs:domain rdfs:subClassOf

Page 16: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es16

Ontology language More powerful than RDF-Schema Examples:

Existence/cardinality constraints all instances of person have a mother that is also a

person, or that persons have exactly 2 parents Transitive, inverse or symmetrical properties

isPartOf is a transitive property, hasPart is the inverse of isPartOf, touches is symmetrical

OWL

Page 17: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es17

Semantic Web and Technology Enhanced Learning

Page 18: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es18

Modelling (ontologies) learning processes learning content learning output (competences) learning agents (students, teachers)

Adding metadata (annotations) according to the models

Use the models and the metadata in tools to make decissions example: personalized, adaptive content and/or

problems

Typical applications

Page 19: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es19

Semantic annotation of text

Page 20: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Generalities

Goal: extract semantic annotations from free text

Natural language is complex and ambiguous

Language dependent Domain dependent applications

News Literature E-mail Transcriptions of spoken dialogues

Some useful results can be achieved nowadays

Page 21: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Taxonomy of semantic annotations Content based

annotations Document

categorization Named entities Ontology based

domain annotations Concepts and

instances identification

Relations extractionNamed Entity (Washington, location)

<rdf:Description rdf:about=‘WST'> <rdf:type rdf:resource=‘State'/></rdf:Description><rdf:Description rdf:about=‘WDC'> <rdf:type rdf:resource=‘City'/></rdf:Description>

isGovernor(GaryLocke,WST)

Page 22: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

basic techniques (i)

Symbolic NLP Based on the use of lexicons and

grammar rules to process text Example: “Barack Obama Elected

President”Lexical Analysis

NP Barack

NP Obama

VBT Elect

VBT VBT + ‘ed’

NN President

Parsing

S NP NP* VBT NN

S

NP VBTNP NN

Semantic Analysis

S NP NP*(X) VBT(Elect) NN(Y)

hasFunction(X, Y)

hasFunction(BarackObama, President)

Page 23: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

Basic techniques (ii)

Statistical NLP Based on counting: finding frequent patterns

that make likely the occurrence of certain text feature

Use of extensive corpora Example:

“Washington” when appearing in the same document with “Hollywood” is likely to represent (Denzel Washington, actor) while Washington” when appearing in the same document with “Obama” is likely to represent (Washington D.C., American capital)

We can count the frequency of different meanings of “Washington” when appearing in different contexts

Page 24: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es24

An approach to named entity disambiguation with

Wikipedia

Page 25: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es25

Instance: a particular person, location (GPE), organization, ...

Introduction

Entity: text + type

Page 26: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es26

Strategy I

Page 27: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es27

Approach Find entities in document For each entity, identify candidate

instances that are compatible with the entity name

Assign a ranking value to each candidate instance: 0 ≤ r ≤ 1

Greater ranking values indicate greater likelihood of occurrence

Strategy II

Page 28: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es28

Semantic coherence (in terms of ranking) “An instance would have a high ranking

value if the instances that typically co-occur with it also have high ranking values”

Strategy III

iCj

jjii IrIIIr )(),(Cooc)(

RAR

Page 29: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es29

We can add a vector E that accounts for other context information

Equation similar to Google PageRank

Strategy IV

ERAR

)1(

Page 30: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es30

Alternative instance names extracted by processing a Wikipedia dump Page titles, redirects, disambiguation pages,

anchors Indexed by Lucene

Candidate instances are obtained by querying Lucene

Candidate instances weighted by combining Lucene scores and PageRank values

Filtering limits the maximum number of candidates

Instance finder & filter

Page 31: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es31

Instance ranker

EkRAkRAkR ECCLL

E: candidate instance weights passed by the instance filterAC: based on instance co-occurrence in Wikipedia pagesAL: based on direct links

Page 32: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es32

Run I. finder I. ranker I. selector

αL αPR kL kC kE σL σH

WebTLab1

0.8 0.2 0.55 0.25 0.2 1.2 2.0

WebTLab2

0.8 0.2 0.55 0.25 0.2 1.05 1.5

WebTLab3

0.8 0.2 0.4 0.4 0.2 1.2 2.0

Results I

Run 2250 queries

1020 non-NIL

1230 NIL

WebTLab1 0.7698 0.6647 0.8569

WebTLab2 0.7636 0.6098 0.8911

WebTLab3 0.7596 0.6049 0.8878

EkRAkRAkR ECCLL

Page 33: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es33

Run I. finder I. ranker I. selector

αL αPR kL kC kE σL σH

WebTLab1

0.8 0.2 0.55 0.25 0.2 1.2 2.0

WebTLab2

0.8 0.2 0.55 0.25 0.2 1.05 1.5

WebTLab3

0.8 0.2 0.4 0.4 0.2 1.2 2.0

Results II

Run ORG GPE PER

WebTLab1 0.7613 0.6569 0.8908

WebTLab2 0.7707 0.6262 0.8935

WebTLab3 0.7680 0.6195 0.8908

EkRAkRAkR ECCLL

Page 34: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es34

Approach based on instance co-occurrence

Text from Wikipedia restricted to: titles, anchors

Results considered promising Should improve for GPE

Conclusions

Page 35: 2011 03 11 (upm) emadrid lsanchez uc3m anotación semántica de texto

http://webtlab.it.uc3m.es35

Thank You!

Questions?