Human Language Technologies for the Semantic Web Department of Computer Science, University of...
-
Upload
kaitlyn-moser -
Category
Documents
-
view
216 -
download
0
Transcript of Human Language Technologies for the Semantic Web Department of Computer Science, University of...
Human Language Technologies for the Semantic Web
Department of Computer Science,University of Sheffield
Fabio Ciravegna and Yorick Wilks
F. Ciravegna- AKT Town Meeting April 2003
Language Technologies
• Goal– Building systems able to process Natural
Language in its written or spoken form
• Methodology– Use of Language Analysis
• Technologies (examples):• Information Extraction from Text• Question Answering • Text Generation
F. Ciravegna- AKT Town Meeting April 2003
HLT for Kn. Management
• Use of HLT for Knowledge– Acquisition – Retrieval– Publication
• Main benefits– Cost Reduction– Time needed for KM– Improving knowledge accessibility
• Accessing/Diffusing/Understanding
F. Ciravegna- AKT Town Meeting April 2003
HLT in AKT for KM
acquisition retrieval publishing
Text mining
Information Extraction from Text
Text Generation
F. Ciravegna- AKT Town Meeting April 2003
HLT for Semantic Web
• Use of HLT for:– Document annotation– Information integration from different
sources
• Benefit– Reduce annotation needs– Retrieve and integrate dispersed
information
F. Ciravegna- AKT Town Meeting April 2003
Information Extraction
• Textual documents are pervasive (e.g. Web) – Contained knowledge cannot be queried,
therefore cannot be• Used by automatic systems• Easily managed by humans
• IE can identify information in documents– e.g. to populate a database– e.g. to annotate documents
• Method: natural language analysisWordsInformationKnowledge
IE tasks
Named Entities Template Elements
Template Relations
Scenario Template
WASHINGTON, D.C. (October 5, 1999) - nQuest Inc. today announced that Paul Jacobs, former Vice-President of E-Commerce at SRA International, has joined the company's executive management team as president.
nQuest Inc. Paul Jacobs.SRA International
Company: nQuest Inc. Date: today InPerson: Paul JacobsInRole: president
Company: SRA InternationalOutPerson: Paul JacobsOutRole: Vice-President of E-Commerce,
F. Ciravegna- AKT Town Meeting April 2003
IE Tools @ Sheffield
• GATE: – General Architecture for Language
Engineering– Used to integrate HLT modules
• Annie:– Rule-based Named Entity Recogniser– Download at www.gate.ac.uk
• Amilcare:– Adaptive IE system– Portable using examples– www.nlp.shef.ac.uk/amilcare
F. Ciravegna- AKT Town Meeting April 2003
IE Tools @ Sheffield (2)
• Melita: – Annotation tool – supported by adaptive IE (Amilcare)– Learns how to annotate– www.aktors.org/technologies/melita/
• Lasie– IE system for complex event extraction– Manual rule development– www.dcs.shef.ac.uk/research/groups/nlp/funded/
lasie.html
F. Ciravegna- AKT Town Meeting April 2003
•An architecture•A macro-level organisational picture for LE software systems.
• A framework•for programmers, GATE is an object-oriented class library that implements the architecture.
• A development environment•for language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction.
• Free software (LGPL). Mature robust software (in development since 1995). •Comes with…
• Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.
GATE is…
F. Ciravegna- AKT Town Meeting April 2003
Some users…
At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • BT Exact Technologies, UK;• Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US;• Sirma AI Ltd., Bulgaria; • Resco AB, Sweden/Finland/Germany;• Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts• Master Foods NV: extraction of commodities events from news• the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary
College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities;
• the Perseus Digital Library project, Tufts University, US.
F. Ciravegna- AKT Town Meeting April 2003
GATE and Content Extraction
ANNIE - Open-source IE system in GATE, providing modules needed for content extraction– Pre-processing– Named entity recognition– Coreference resolution
• ANNIE handles proper names, pronouns, and nominals
• Easy-to-use pattern-action rule language to enable customisation and postprocessing of the IE results
• Contact Hamish Cunningham ([email protected])
F. Ciravegna- AKT Town Meeting April 2003
Amilcare Active annotation for the Semantic Web
• Tool for adaptive IE from Web-related texts– Specifically designed for document annotation– Trains with a limited amount of examples– Effective on different text types
• From free texts to rigid docs (XML,HTML, etc.)
– Tools for:• Normal user
– Able to annotate a corpus
• Amilcare Expert– Able to optimise experiments
• IE Expert– Able to edit rules
– Uses Annie for preprocessing up to Named Entity Recognition
[Ciravegna – IJCAI 2001]
F. Ciravegna- AKT Town Meeting April 2003
Implementation details
• 100% Java• External Interfaces:
– API for use from other programs– GUI for manual training
• Requirements:– 10M on HD– Up to 300M RAM
• Contact Fabio Ciravegna ([email protected])
F. Ciravegna- AKT Town Meeting April 2003
Users• Integrated with SW annotation tools:
– MnM (Open Univ.) – Ontomat (Karlsruhe Univ.) – Melita (Sheffield Univ.)
• Users:– Merck (D), – ISOCO (SP), – Quinary (I), – Ontoprise (D)– University College Dublin (IE), – 2 departments of CNRS (F)– University of Trier (D), – University of Texas (Austin, USA)
F. Ciravegna- AKT Town Meeting April 2003
Document Annotation
• Many application areas require document annotation (enrichment)– Knowledge Management
• Protocol analysis in industry (Kingston 94)
• Italian police: 100 annotators/6 pages a day each– Semantic Web (Staab00, Motta02, Ciravegna02)
• Annotation is generally manual– Expensive– Inefficient – Difficult– Tedious & Tiring
• Error prone (15-30% inter-annotator disagreement)– Never ending
F. Ciravegna- AKT Town Meeting April 2003
Melita• Document annotation tool
– Use adaptive IE engine to support annotation
• IE System:– Trains while users annotate– Provides preliminary annotation for new documents
• Advantages– Annotates trivial or previously seen cases – Focuses slow/expensive user activity on unseen cases– Validating extracted information
• Simpler & less error prone • Speeds up corpus annotation
– Learns how to improve capabilities
F. Ciravegna- AKT Town Meeting April 2003
Annotation with IE
User Annotates
Trains on annotated corpus
Bare TextBare Text
AnnotationComparison
Retrains using errors, missing tags and mistakes
Annotates
F. Ciravegna- AKT Town Meeting April 2003
Bare Text User
Corrects
Annotates
Uses corrections to retrain
Annotation with Suggestions
F. Ciravegna- AKT Town Meeting April 2003
Cooperation:is IE a Useful Support?
CMU Seminars TASK Test:250 texts (Amilcare report the best IE results ever)
Location
0
20
40
60
80
100
0 20 40 60 80 100 120 140
training examples
Precision Recall F-measure
Speaker
0
20
40
60
80
100
0 20 40 60 80 100 120 140
training examples
Precision Recall F-measure
Stime
0
20
40
60
80
100
0 20 40 60 80 100 120 140
training examples
Precision Recall F-measure
Etime
0
20
40
60
80
100
0 20 40 60 80 100 120 140
training examples
Precision Recall F-measure
F. Ciravegna- AKT Town Meeting April 2003
Integrating Information
• Information is available over the Web– Dispersed– In textual format
• IE as basis for retrieval and integration of information – Unsupervised learning using
• The redundancy of the web
• Available Repositories– Collections of documents/data– Known services (e.g. databases, digital libraries, search
engines)
to bootstrap learning and produce simple high precision IE applications
F. Ciravegna- AKT Town Meeting April 2003
Mining Web Sites
• Extracting knowledge from CS Web sites
NamePositionEmail/TelephoneInvolvement in projectsPublicationsCo-workers
Person:
•Information distributed•Challenges
•Retrieving information•Integrating Information•Largely unsupervised by user
F. Ciravegna- AKT Town Meeting April 2003
Mining Web sites
People and Projectnames
HomePageSearch
Project/People name lists and hyperlinksBasket:
• Annotates known names• Trains on annotations to discover
the HTML structure of the page• Recovers all names and hyperlinks
• Mines the site looking for Project and People names
• Uses •Generic patterns•Annie•Citeseer for likely bigrams
F. Ciravegna- AKT Town Meeting April 2003
Mining Web sites
Projects/People Web pages
HomePageSearch
Extracts personal data•Addresses•Tel number•Email address•…
Project/People name lists and hyperlinksBasket:Name lists and hyperlinks Personal data People and ProjectsBasket:
F. Ciravegna- AKT Town Meeting April 2003Name lists and hyperlinks Personal data People and ProjectsBasket:
HomePageSearch
People Publications
Mining Web sites
• Annotates known papers• Trains on annotations to
discover the HTML structure• Recovers co-authoring
information
Name lists and hyperlinks Personal data Co-authoring informationPeople and ProjectsBasket:
F. Ciravegna- AKT Town Meeting April 2003
Paper discovery
F. Ciravegna- AKT Town Meeting April 2003
Focus on people
F. Ciravegna- AKT Town Meeting April 2003
User Role
• Providing:– A URL– List of services (e.g. Google)
• Train wrappers using examples
– some examples of fillers (e.g. projects)
• In case, correcting intermediate results
F. Ciravegna- AKT Town Meeting April 2003
Rationale
• Large collections (e.g. Web) contain redundant information– Redundancy can be used to bootstrap learning
• Mining the Web for information– Learned patters
• Integration of information – Multiple evidence
• Different strategies with different reliability• Scruffy works!
– User corrections of data in case
F. Ciravegna- AKT Town Meeting April 2003
Conclusion
• In AKT we are using HLT (IE) for:– Helping in document annotation– Integrating information from different
sources
• Benefit:– Reduce annotation needs– Retrieve and integrate dispersed
information• Minimum user intervention