Human Language Technologies for the Semantic Web Department of Computer Science, University of...

Human Language Technologies for the Semantic Web

Department of Computer Science,University of Sheffield

Fabio Ciravegna and Yorick Wilks

F. Ciravegna- AKT Town Meeting April 2003

Language Technologies

• Goal– Building systems able to process Natural

Language in its written or spoken form

• Methodology– Use of Language Analysis

• Technologies (examples):• Information Extraction from Text• Question Answering • Text Generation


HLT for Kn. Management

• Use of HLT for Knowledge– Acquisition – Retrieval– Publication

• Main benefits– Cost Reduction– Time needed for KM– Improving knowledge accessibility

• Accessing/Diffusing/Understanding


HLT in AKT for KM

acquisition retrieval publishing

Text mining

Information Extraction from Text

Text Generation


HLT for Semantic Web

• Use of HLT for:– Document annotation– Information integration from different

sources

• Benefit– Reduce annotation needs– Retrieve and integrate dispersed

information


Information Extraction

• Textual documents are pervasive (e.g. Web) – Contained knowledge cannot be queried,

therefore cannot be• Used by automatic systems• Easily managed by humans

• IE can identify information in documents– e.g. to populate a database– e.g. to annotate documents

• Method: natural language analysisWordsInformationKnowledge

IE tasks

Named Entities Template Elements

Template Relations

Scenario Template

WASHINGTON, D.C. (October 5, 1999) - nQuest Inc. today announced that Paul Jacobs, former Vice-President of E-Commerce at SRA International, has joined the company's executive management team as president.

nQuest Inc. Paul Jacobs.SRA International

Company: nQuest Inc. Date: today InPerson: Paul JacobsInRole: president

Company: SRA InternationalOutPerson: Paul JacobsOutRole: Vice-President of E-Commerce,


IE Tools @ Sheffield

• GATE: – General Architecture for Language

Engineering– Used to integrate HLT modules

• Annie:– Rule-based Named Entity Recogniser– Download at www.gate.ac.uk

• Amilcare:– Adaptive IE system– Portable using examples– www.nlp.shef.ac.uk/amilcare


IE Tools @ Sheffield (2)

• Melita: – Annotation tool – supported by adaptive IE (Amilcare)– Learns how to annotate– www.aktors.org/technologies/melita/

• Lasie– IE system for complex event extraction– Manual rule development– www.dcs.shef.ac.uk/research/groups/nlp/funded/

lasie.html

http://www.aktors.org/technologies/melita/


•An architecture•A macro-level organisational picture for LE software systems.

• A framework•for programmers, GATE is an object-oriented class library that implements the architecture.

• A development environment•for language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction.

• Free software (LGPL). Mature robust software (in development since 1995). •Comes with…

• Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.

GATE is…


Some users…

At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • BT Exact Technologies, UK;• Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US;• Sirma AI Ltd., Bulgaria; • Resco AB, Sweden/Finland/Germany;• Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts• Master Foods NV: extraction of commodities events from news• the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary

College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities;

• the Perseus Digital Library project, Tufts University, US.


GATE and Content Extraction

ANNIE - Open-source IE system in GATE, providing modules needed for content extraction– Pre-processing– Named entity recognition– Coreference resolution

• ANNIE handles proper names, pronouns, and nominals

• Easy-to-use pattern-action rule language to enable customisation and postprocessing of the IE results

• Contact Hamish Cunningham ([email protected])


Amilcare Active annotation for the Semantic Web

• Tool for adaptive IE from Web-related texts– Specifically designed for document annotation– Trains with a limited amount of examples– Effective on different text types

• From free texts to rigid docs (XML,HTML, etc.)

– Tools for:• Normal user

– Able to annotate a corpus

• Amilcare Expert– Able to optimise experiments

• IE Expert– Able to edit rules

– Uses Annie for preprocessing up to Named Entity Recognition

[Ciravegna – IJCAI 2001]


Implementation details

• 100% Java• External Interfaces:

– API for use from other programs– GUI for manual training

• Requirements:– 10M on HD– Up to 300M RAM

• Contact Fabio Ciravegna ([email protected])


Users• Integrated with SW annotation tools:

– MnM (Open Univ.) – Ontomat (Karlsruhe Univ.) – Melita (Sheffield Univ.)

• Users:– Merck (D), – ISOCO (SP), – Quinary (I), – Ontoprise (D)– University College Dublin (IE), – 2 departments of CNRS (F)– University of Trier (D), – University of Texas (Austin, USA)


Document Annotation

• Many application areas require document annotation (enrichment)– Knowledge Management

• Protocol analysis in industry (Kingston 94)

• Italian police: 100 annotators/6 pages a day each– Semantic Web (Staab00, Motta02, Ciravegna02)

• Annotation is generally manual– Expensive– Inefficient – Difficult– Tedious & Tiring

• Error prone (15-30% inter-annotator disagreement)– Never ending


Melita• Document annotation tool

– Use adaptive IE engine to support annotation

• IE System:– Trains while users annotate– Provides preliminary annotation for new documents

• Advantages– Annotates trivial or previously seen cases – Focuses slow/expensive user activity on unseen cases– Validating extracted information

• Simpler & less error prone • Speeds up corpus annotation

– Learns how to improve capabilities


Annotation with IE

User Annotates

Trains on annotated corpus

Bare TextBare Text

AnnotationComparison

Retrains using errors, missing tags and mistakes

Annotates


Bare Text User

Corrects

Annotates

Uses corrections to retrain

Annotation with Suggestions


Cooperation:is IE a Useful Support?

CMU Seminars TASK Test:250 texts (Amilcare report the best IE results ever)

Location

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples

Precision Recall F-measure

Speaker

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples


Stime

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples


Etime

0

20

40

60

80

100

0 20 40 60 80 100 120 140

training examples



Integrating Information

• Information is available over the Web– Dispersed– In textual format

• IE as basis for retrieval and integration of information – Unsupervised learning using

• The redundancy of the web

• Available Repositories– Collections of documents/data– Known services (e.g. databases, digital libraries, search

engines)

to bootstrap learning and produce simple high precision IE applications


Mining Web Sites

• Extracting knowledge from CS Web sites

NamePositionEmail/TelephoneInvolvement in projectsPublicationsCo-workers

Person:

•Information distributed•Challenges

•Retrieving information•Integrating Information•Largely unsupervised by user


Mining Web sites

People and Projectnames

HomePageSearch

Project/People name lists and hyperlinksBasket:

• Annotates known names• Trains on annotations to discover

the HTML structure of the page• Recovers all names and hyperlinks

• Mines the site looking for Project and People names

• Uses •Generic patterns•Annie•Citeseer for likely bigrams

http://citeseer.nj.nec.com/cs


Mining Web sites

Projects/People Web pages

HomePageSearch

Extracts personal data•Addresses•Tel number•Email address•…

Project/People name lists and hyperlinksBasket:Name lists and hyperlinks Personal data People and ProjectsBasket:


F. Ciravegna- AKT Town Meeting April 2003Name lists and hyperlinks Personal data People and ProjectsBasket:

HomePageSearch

People Publications

Mining Web sites

• Annotates known papers• Trains on annotations to

discover the HTML structure• Recovers co-authoring

information

Name lists and hyperlinks Personal data Co-authoring informationPeople and ProjectsBasket:



Paper discovery


Focus on people


User Role

• Providing:– A URL– List of services (e.g. Google)

• Train wrappers using examples

– some examples of fillers (e.g. projects)

• In case, correcting intermediate results


Rationale

• Large collections (e.g. Web) contain redundant information– Redundancy can be used to bootstrap learning

• Mining the Web for information– Learned patters

• Integration of information – Multiple evidence

• Different strategies with different reliability• Scruffy works!

– User corrections of data in case


Conclusion

• In AKT we are using HLT (IE) for:– Helping in document annotation– Integrating information from different

sources

• Benefit:– Reduce annotation needs– Retrieve and integrate dispersed

information• Minimum user intervention

Human Language Technologies for the Semantic Web Department of Computer Science, University of...

Documents

Transcript of Human Language Technologies for the Semantic Web Department of Computer Science, University of...