1 Dr Alexiei Dingli Introduction to Web Science Harvesting the SW.
-
Upload
malcolm-french -
Category
Documents
-
view
217 -
download
0
Transcript of 1 Dr Alexiei Dingli Introduction to Web Science Harvesting the SW.
2
• Acquire
• Model
• Reuse
• Retrieve
• Publish
• Maintain
Six challenges of the Knowledge Life Cycle
4
A couple of approaches …
• Active learning to reduce annotation burden– Supervised learning – Adaptive IE– The Melita methodology
• Automatic annotation of large repositories– Largely unsupervised– Armadillo
5
• Created by Carnegie Mellon School of Computer Science
• How to retrieve – Speaker– Location– Start Time– End Time
• From seminar announcements received by email
The Seminar Announcements Task
6
Dr. Steals presents in Dean Hall at one am.
becomes
<speaker>Dr. Steals</speaker> presents in <location>Dean Hall</location> at <stime>one am</stime>.
Seminar Announcements Example
7
• How many documents out of the retrieved documents are relevant?
• How many retrieved documents are relevant out of all the relevant documents?
• Weighted harmonic mean of precision and recall
Information Extraction Measures
8
• If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure?
IE Measures Examples
9
• If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure?
• Precision = 4/8 = 50%• Recall = 4/10 = 40%• F =(2*50*40)/(50+40) = 44.4%
IE Measures Answers
10
• What is IE?– Automated ways of extracting unstructured or
partially structured information from machine readable files
• What is AIE?– Performs tasks of traditional IE– Exploits the power of Machine Learning in order to
adapt to • complex domains having large amounts of domain
dependent data• different sub-language features• different text genres
– Considers important the Usability and Accessibility of the system
Adaptive IE
11
Amilcare
• Tool for adaptive IE from Web-related texts– Specifically designed for document annotation– Based on (LP)2 algorithm
*Linguistic Patterns by Learning Patterns
• Covering algorithm based on Lazy NLP • Trains with a limited amount of examples• Effective on different text types
– free texts
– semi-structured texts
– structured texts
– Uses Gate and Annie for preprocessing
12
CMU: detailed results
(LP)2 BWI HMM SRV Rapier Whisk speaker 77.6 67.7 76.6 56.3 53.0 18.3 location 75.0 76.7 78.6 72.3 72.7 66.4
stime 99.0 99.6 98.5 98.5 93.4 92.6 etime 95.5 93.9 62.1 77.9 96.2 86.0
All Slots 86.0 83.9 82.0 77.1 77.3 64.9
1. Best overall accuracy 2. Best result on speaker field3. No results below 75%
13
Gate
• General Architecture for Text Engineering– provides a software infrastructure for researchers and developers working in NLP
• Contains– Tokeniser– Gazetteers– Sentence Splitter– POS Tagger– Semantic Tagger (ANNIE)– Co-reference Resolution – Multi lingual support– Protégé– WEKA– many more exist and can be added
• http://www.gate.ac.uk
14
Current practice of annotation for knowledge identification and extraction
Annotation
is time consuming
needs annotation by experts
is complex
Reduce burden of text annotation for Knowledge
Management
15
Different Annotation Systems
• SGML
• TEX
• Xanadu• CoNote• ComMentor• JotBot• Third Voice• Annotate.net• The Annotation Engine
• Alembic• The Gate Annotation Tool• iMarkup, Yawas• MnM, S-CREAM
16
• Tool for assisted automatic annotation• Uses an Adaptive IE engine to learn how to annotate
(no use of rule writing for adapting the system)• Users: annotates document samples• IE System:
– Trains while users annotate– Generalizes over seen cases– Provides preliminary annotation for new documents
• Performs smart ordering of documents • Advantages
– Annotates trivial or previously seen cases – Focuses slow/expensive user activity on unseen cases– User mainly validates extracted information
• Simpler & less error prone / Speeds up corpus annotation– The system learns how to improve its capabilities
Melita
18
Methodology: Melita Checking Phase
Bare Text
Learning in background
from missing
tags, mistakes
User Annotates
Amilcare Annotates
19
Methodology: Melita Support Phase
Bare Text
Corrections used to retrain
Amilcare Annotates
User Corrects
20
Smart ordering of Documents
Bare Text
Tries to annotate all the documents and selects the document with partial
annotations
Learns annotations
User Annotates
21
• An evolving system is difficult to control• Goal:
– Avoiding unwelcome/unreliable suggestions– Adapting proactivity to user’s needs
• Method: – Allow users to tune proactivity– Monitor user reactions to suggestions
Intrusivity
23
Results
Tag Amount of Texts needed for training
Prec Rec
stime 20 84 63
etime 20 96 72
location 30 82 61
speaker 100 75 70
Location
0
20
40
60
80
100
0 50 100 150
training examples
Original Order selected Order
30 60
24
• Research better ways of annotating concepts in documents
• Optimise document ordering to maximise the discovery of new tags
• Allow users to edit the rules
• Learn to discover relationships !!
• Not only suggest but also corrects user annotations !!
Future Work
25
• Semantic Web requires document annotation– Current approaches
• Manual (e.g. Ontomat) or semi-automatic (MnM, S-Cream, Melita)
• BUT:– Manual/Semi-automatic annotation of
• Large diverse repositories
• Containing different and sparse information
is unfeasible
• E.g. a Web site (So: 1,600 pages)
Annotation for the Semantic Web
26
• Information on the Web (or large repositories) is Redundant
• Information repeated in different superficial formats– Databases/ontologies– Structured pages (e.g. produced by databases)– Largely structured pages (bibliography pages)– Unstructured pages (free texts)
Redundancy
27
• Largely unsupervised annotation of documents– Based on Adaptive Information Extraction– Bootstrapped using redundancy of information
• Method– Use the structured information (easier to extract)
to bootstrap learning on less structured sources (more difficult to extract)
The Idea
28
– Mines web-sites to extract biblios from personal pages Tasks:• Finding people’s names• Finding home pages• Finding personal biblio pages• Extract biblio references
– Sources• NE Recognition (Gate’s Annie)• Citeseer/Unitrier (largely incomplete biblios)• Google• Homepagesearch
Example: Extracting Bibliographies
29
Mining Web sites (1)• Mines the site looking for People’s names
• Uses •Generic patterns (NER)•Citeseer for likely bigrams
• Looks for structured lists of names
• Annotates known names• Trains on annotations to discover
the HTML structure of the page• Recovers all names and
hyperlinks
30
Experimental Results II - Sheffield
• People– discovering who works in the department – using Information Integration
• Total present in site 139 • Using generic patterns + online repositories
– 35 correct, 5 wrong
– Precision 35 / 40 = 87.5 %
– Recall 35 / 139 = 25.2 %
– F-measure 39.1 %
• Errors– A. Schriffin
– Eugenio Moggi
– Peter Gray
31
Experimental Results IE - Sheffield
• People – using Information Extraction
• Total present in site 139 – 116 correct, 8 wrong
– Precision 116 / 124 = 93.5 %
– Recall 116 / 139 = 83.5 %
– F-measure 88.2 %
• Errors– Speech and Hearing
– European Network
– Department Of
• Enhancements – Lists, Postprocessor
– Position Paper– The Network– To System
32
Experimental Results - Edinburgh
• People– using Information Integration
• Total present in site 216• Using generic patterns + online repositories
– 11 correct, 2 wrong – Precision 11 / 13 = 84.6 %– Recall 11 / 216 = 5.1 %– F-measure 9.6 %
– using Information Extraction– 153 correct, 10 wrong – Precision 153 / 163 = 93.9 %– Recall 153 / 216 = 70.8 %– F-measure 80.7 %
33
Experimental Results - Aberdeen
• People– using Information Integration
• Total present in site 70• Using generic patterns + online repositories
– 21 correct, 1 wrong – Precision 21 / 22 = 95.5 %– Recall 21 / 70 = 30.0 %– F-measure 45.7 %
– using Information Extraction– 63 correct, 2 wrong – Precision 63 / 65 = 96.9 %– Recall 63 / 70 = 90.0 %– F-measure 93.3 %
34
Mining Web sites (2)
• Annotates known papers• Trains on annotations to
discover the HTML structure• Recovers co-authoring
information
35
Experimental Results (1)• Papers
– discovering publications in the department – using Information Integration
• Total present in site 320 • Using generic patterns + online repositories
– 151 correct, 1 wrong – Precision 151 / 152 = 99 %– Recall 151 / 320 = 47 %– F-measure 64 %
• Errors - Garbage in database!!@misc{ computer-mining,
author = "Department Of Computer",
title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks",
url = "citeseer.nj.nec.com/582939.html" }
36
Experimental Results (2)
• Papers – using Information Extraction
• Total present in site 320 – 214 correct, 3 wrong
– Precision 214 / 217 = 99 %
– Recall 214 / 320 = 67 %
– F-measure 80 %
• Errors– Wrong boundaries in detection of paper names!
– Names of workshops mistaken as paper names!
37
• Task
– Given the name of an artist, find all the paintings of that artist.
– Created for the ArtEquAKT project
Artists domain
38
Artists domain EvaluationArtist Method Precision Recall F-Measure
Caravaggio II 100.0% 61% 75.8%
IE 100.0% 98.8% 99.4%
Cezanne II 100.0% 27.1% 42.7%
IE 91.0% 42.6% 58.0%
Manet II 100.0% 29.7% 45.8%
IE 100.0% 40.6% 57.8%
Monet II 100.0% 14.6% 25.5%
IE 86.3% 48.5% 62.1%
Raphael II 100.0% 59.9% 74.9%
IE 96.5% 86.4% 91.2%
Renoir II 94.7% 40.0% 56.2%
IE 96.4% 60.0% 74.0%
39
– Providing …• A URL• List of services
– Already wrapped (e.g. Google is in default library) – Train wrappers using examples
• Examples of fillers (e.g. project names)
– In case … • Correcting intermediate results• Reactivating Armadillo when paused
User Role
40
– Library of known services (e.g. Google, Citeseer)
– Tools for training learners for other structured sources
– Tools for bootstrapping learning • From un/structured sources• No user annotation• Multi-strategy acquisition of information using redundancy
– User-driven revision of results• With re-learning after user correction
Armadillo
41
• Armadillo learns how to extract information– From large repositories
By integrating information – from diverse and distributed resources
• Use: – Ontology population– Information highlighting– Document enrichment– Enhancing user experience
Rationale
45
• Automatic annotation services – For a specific ontology– Constantly re-indexing/re-annotating documents– Semantic search engine
• Effects:– No annotation in the document
• As today’s indexes are not stored in the documents
– No legacy with the past • Annotation with the latest version of the ontology
• Multiple annotations for a single document
– Simplifies maintenance • Page changed but not re-annotated
IE for SW: The Vision