Populating Ontologies for the Semantic Web Alexiei Dingli.

41
Populating Ontologies for the Semantic Web Alexiei Dingli

Transcript of Populating Ontologies for the Semantic Web Alexiei Dingli.

Page 1: Populating Ontologies for the Semantic Web Alexiei Dingli.

Populating Ontologies for the Semantic Web

Alexiei Dingli

Page 2: Populating Ontologies for the Semantic Web Alexiei Dingli.

What’s the problem?

Page 3: Populating Ontologies for the Semantic Web Alexiei Dingli.

Towards a solution … (1)

Ask intelligent

agents to do the

job for us!!

But they don’t understand the

WWW !!!

Page 4: Populating Ontologies for the Semantic Web Alexiei Dingli.

Towards a solution … (2)

But there’s another way in which this can be achieved, by supplying the missing semantic information

For the Web to reach its full potential, it must evolve into a SemanticWeb, providing a universally accessible platform that allows data tobe shared and processed by automated tools as well as by people.

(W3C Web Guru)

Creating the Semantic Web !!

Page 5: Populating Ontologies for the Semantic Web Alexiei Dingli.

Towards a solution … (3)

Why do many believe this solution will fail?

It requires lots of time and effort

It needs lots of people willing to do it

Not everyone can do it

Page 6: Populating Ontologies for the Semantic Web Alexiei Dingli.

Our approaches

Active learning to reduce annotation burden Supervised learning Adaptive IE The Melita methodology

Automatic annotation of large repositories Largely unsupervised Armadillo

Page 7: Populating Ontologies for the Semantic Web Alexiei Dingli.

Adaptive IE What is AIE?

Performs tasks of traditional IEExploits the power of Machine Learning in

order to adapt to complex domains having large amounts of domain

dependent data different sub-languages features different text genres

Considers important the Usability and Accessibility of the system

Page 8: Populating Ontologies for the Semantic Web Alexiei Dingli.

Amilcare

Tool for adaptive IE from Web-related textsSpecifically designed for document

annotationBased on (LP)2 algorithm

Covering algorithm based on Lazy NLP Trains with a limited amount of examples Effective on different text types

free texts semi-structured texts structured texts

Uses Gate and Annie for preprocessing

Page 9: Populating Ontologies for the Semantic Web Alexiei Dingli.

CMU: detailed results (LP)2 BWI HMM SRV Rapier Whisk

speaker 77.6 67.7 76.6 56.3 53.0 18.3 location 75.0 76.7 78.6 72.3 72.7 66.4

stime 99.0 99.6 98.5 98.5 93.4 92.6 etime 95.5 93.9 62.1 77.9 96.2 86.0

All Slots 86.0 83.9 82.0 77.1 77.3 64.9

1. Best overall accuracy 2. Best result on speaker field3. No results below 75%

Page 10: Populating Ontologies for the Semantic Web Alexiei Dingli.

Gate

General Architecture for Text Engineering provides a software infrastructure for researchers and

developers working in NLP

Contains Tokeniser Gazetteers Sentence Splitter POS Tagger Semantic Tagger (ANNIE) Orthographic Coreference

http://www.gate.ac.uk

Pronominal Coreference Multi lingual support Protégé WEKA many more exist and can be added

Page 11: Populating Ontologies for the Semantic Web Alexiei Dingli.

AnnotationCurrent practice of annotation for knowledge identification and extraction

is time consuming

needs annotation by experts

is complex

Reduce burden of text annotation for Knowledge

Management

Page 12: Populating Ontologies for the Semantic Web Alexiei Dingli.

Different Annotation Systems

SGML TEX Xanadu CoNote ComMentor JotBot Third Voice Annotate.net The Annotation Engine Visual Text

Alembic Annotea CritLink The Gate Annotation Tool iMarkup MnM S-CREAM Yawas

Page 13: Populating Ontologies for the Semantic Web Alexiei Dingli.

Melita

Tool for assisted automatic annotation Uses an Adaptive IE engine to learn how to annotate

(no use of rule writing for adapting the system) Users: annotates document samples IE System:

Trains while users annotate Generalizes over seen cases Provides preliminary annotation for new documents

Performs smart ordering of documents Advantages

Annotates trivial or previously seen cases Focuses slow/expensive user activity on unseen cases User mainly validates extracted information

Simpler & less error prone / Speeds up corpus annotation The system learns how to improve its capabilities

Page 14: Populating Ontologies for the Semantic Web Alexiei Dingli.

Methodology: Melita Bootstrap Phase

Bare Text

Amilcare Learns in

background

User Annotates

Page 15: Populating Ontologies for the Semantic Web Alexiei Dingli.

Methodology: Melita Checking Phase

Bare Text

Learning in background

from missing

tags, mistakes

User Annotates

Amilcare Annotates

Page 16: Populating Ontologies for the Semantic Web Alexiei Dingli.

Methodology: Melita Support Phase

Bare Text

Corrections used to retrain

Amilcare Annotates

User Corrects

Page 17: Populating Ontologies for the Semantic Web Alexiei Dingli.

Intrusivity An evolving system is difficult to control Goal:

Avoiding unwelcome/unreliable suggestions Adapting proactivity to user’s needs

Method: Allow users to tune proactivity Monitor user reactions to suggestions

Page 18: Populating Ontologies for the Semantic Web Alexiei Dingli.

Smart ordering of Documents

Bare Text

Tries to annotate all the documents and selects the

document with partial annotations

Learns annotations

User Annotates

Page 19: Populating Ontologies for the Semantic Web Alexiei Dingli.

Methodology: Melita

Ontology

defining

concepts

Control Panel

Document

Panel

Page 20: Populating Ontologies for the Semantic Web Alexiei Dingli.

Results

Tag Amount of Texts needed for training

Prec Rec

stime 20 84 63

etime 20 96 72

location 30 82 61

speaker 100 75 70

Location

0

20

40

60

80

100

0 50 100 150

training examples

Original Order selected Order

30 60

Page 21: Populating Ontologies for the Semantic Web Alexiei Dingli.

Future Work

Research better ways of annotating concepts in documents

Optimise document ordering to maximise the discovery of new tags

Allow users to edit the rules Learn to discover relationships !! Not only suggest but also corrects

user annotations !!

Page 22: Populating Ontologies for the Semantic Web Alexiei Dingli.

Annotation for the Semantic Web

Semantic Web requires document annotation Current approaches

Manual (e.g. Ontomat) or semi-automatic (MnM, S-Cream, Melita)

BUT: Manual/Semi-automatic annotation of

Large diverse repositories Containing different and sparse information

is unfeasible E.g. a Web site (So: 1,600 pages)

Page 23: Populating Ontologies for the Semantic Web Alexiei Dingli.

Redundancy Information on the Web (or large repositories) is

Redundant

Information repeated in different superficial formats Databases/ontologies Structured pages (e.g. produced by databases) Largely structured pages (bibliography pages) Unstructured pages (free texts)

Page 24: Populating Ontologies for the Semantic Web Alexiei Dingli.

Our Proposal

Largely unsupervised annotation of documents Based on Adaptive Information Extraction Bootstrapped using redundancy of information

Method Use the structured information (easier to extract)

to bootstrap learning on less structured sources (more difficult to extract)

Page 25: Populating Ontologies for the Semantic Web Alexiei Dingli.

Example: Extracting Bibliographies

Mines web-sites to extract biblios from personal pages Tasks: Finding people’s names Finding home pages Finding personal biblio pages Extract biblio references

Sources NE Recognition (Gate’s Annie) Citeseer/Unitrier (largely incomplete biblios) Google Homepagesearch

Page 26: Populating Ontologies for the Semantic Web Alexiei Dingli.

AKT Reference Ontology

Developed by the AKT partners Represent the knowledge used in the CS AKTive Portal

testbed Consists of several sub-ontologies Available in several flavours …

DAML+OIL OWL

Has 9,000,000 RDF triples !! Available at

Ontology http://www.aktors.org/publications/ontology/ RDF Triples http://triplestore.aktors.org/

Page 27: Populating Ontologies for the Semantic Web Alexiei Dingli.

Mining Web sites (1)• Mines the site looking for

People’s names• Uses

•Generic patterns (NER)•Citeseer for likely bigrams

• Looks for structured lists of names

• Annotates known names• Trains on annotations to discover

the HTML structure of the page• Recovers all names and

hyperlinks

Page 28: Populating Ontologies for the Semantic Web Alexiei Dingli.

Experimental Results (1) People

discovering who works in the department using Information Integration

Total present in site 129 Using generic patterns + online repositories

48 correct, 3 wrong Precision 48 / 51 = 94 % Recall 48 / 129 = 37 % F-measure 51 %

Errors A. Schriffin Eugenio Moggi Peter Gray

Page 29: Populating Ontologies for the Semantic Web Alexiei Dingli.

Experimental Results (2) People

using Information Extraction Total present in site 129

96 correct, 9 wrong Precision 96 / 105 = 91 % Recall 96 / 129 = 74 % F-measure 87 %

Errors Speech and Hearing European Network Department Of

Position Paper The Network To System

Page 30: Populating Ontologies for the Semantic Web Alexiei Dingli.

Mining Web sites (2)

• Annotates known papers• Trains on annotations to

discover the HTML structure• Recovers co-authoring

information

Page 31: Populating Ontologies for the Semantic Web Alexiei Dingli.

Experimental Results (1) Papers

discovering publications in the department using Information Integration

Total present in site 320 Using generic patterns + online repositories

151 correct, 1 wrong Precision 151 / 152 = 99 % Recall 151 / 320 = 47 % F-measure 64 %

Errors - Garbage in database!!@misc{ computer-mining,

author = "Department Of Computer", title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks", url = "citeseer.nj.nec.com/582939.html" }

Page 32: Populating Ontologies for the Semantic Web Alexiei Dingli.

Experimental Results (2) Papers

using Information Extraction Total present in site 320

214 correct, 3 wrong Precision 214 / 217 = 99 % Recall 214 / 320 = 67 % F-measure 80 %

Errors Wrong boundaries in detection of paper names! Names of workshops mistaken as paper names!

Page 33: Populating Ontologies for the Semantic Web Alexiei Dingli.

User Role Providing …

A URL List of services

Already wrapped (e.g. Google is in default library) Train wrappers using examples

Examples of fillers (e.g. project names)

In case … Correcting intermediate results Reactivating Armadillo when paused

Page 34: Populating Ontologies for the Semantic Web Alexiei Dingli.

Armadillo Library of known services (e.g. Google, Citeseer)

Tools for training learners for other structured sources

Tools for bootstrapping learning From un/structured sources No user annotation Multi-strategy acquisition of information using redundancy

User-driven revision of results With re-learning after user correction

Page 35: Populating Ontologies for the Semantic Web Alexiei Dingli.

Rationale Armadillo learns how to extract information

From large repositories

By integrating information from diverse and distributed resources

Use: Ontology population Information highlighting Document enrichment Enhancing user experience

Page 36: Populating Ontologies for the Semantic Web Alexiei Dingli.

Data Navigation (1)

Page 37: Populating Ontologies for the Semantic Web Alexiei Dingli.

Data Navigation (2)

Page 38: Populating Ontologies for the Semantic Web Alexiei Dingli.

Data Navigation (3)

Page 39: Populating Ontologies for the Semantic Web Alexiei Dingli.

What’s so new about Armadillo? In other systems …

User defined examples are used Generic patters are used that work independently of

the site

In our system … We also make use of

generic patterns & some user defined examples We learn page specific patterns And we integrate information from different sources

Page 40: Populating Ontologies for the Semantic Web Alexiei Dingli.

IE for SW: The Vision Automatic annotation services

For a specific ontology Constantly re-indexing/re-annotating documents Semantic search engine

Effects: No annotation in the document

As today’s indexes are not stored in the documents No legacy with the past

Annotation with the latest version of the ontology Multiple annotations for a single document

Simplifies maintenance Page changed but not re-annotated

Page 41: Populating Ontologies for the Semantic Web Alexiei Dingli.

Links Melita

http://nlp.shef.ac.uk/melita/ Armadillo

http://nlp.shef.ac.uk/armadillo/ Amilcare

http://nlp.shef.ac.uk/amilcare/ Gate

http://www.gate.ac.uk AKT Reference Ontology

http://www.aktors.org/publications/ontology/ AKT 3Store

http://triplestore.aktors.org/ More than 40 semantic web technologies

http://www.aktors.org/technologies/ Most of them can be freely downloaded Range from IE tools, semantic portals, annotation tools, semantic

web services, dialogue systems, etc