Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9...

53
Dr. Andreas Hotho Text Clustering with Background Knowledge A. Hotho: Text Clustering with Background Knowledge 2 Agenda • Introduction Semantic Web Semantic Web Mining Typical Preprocessing steps for Text Mining Ontology Learning Text Clustering with Background Knowledge Text Clustering using FCA Text Classification using Background Knowledge Application driven Evaluation of Ontology Learning Different kinds of Background Knowledge A. Hotho: Text Clustering with Background Knowledge 3 Knowledge and Data Engineering Group @ University of Kassel Founded at April 2004 Head: Prof. Gerd Stumme Member of the Research Center L3S Research areas: Semantic Web/Ontologies Knowledge Discovery Web Mining Peer-to-Peer Folksonomies Social Bookmark Systems A. Hotho: Text Clustering with Background Knowledge 4 Acknowledgement Some of the slides are taken from: ECML/PKDD Tutorial “Ontology Learning from text”, Paul Buitelaar, Philipp Cimiano, Marko Grobelnik, Michael Sintek KDD Course of AIFB Karlsruhe and KDE Kassel Semantic Web Tutorial Slides from AIFB Some slides of the Semantic Web Introduction have been stolen from various places, from Jim Hendler and Frank van Harmelen, in particular

Transcript of Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9...

Page 1: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

Dr. Andreas Hotho

Text Clustering withBackground Knowledge

A. Hotho: Text Clustering with Background Knowledge 2

Agenda

• Introduction• Semantic Web • Semantic Web Mining• Typical Preprocessing steps for Text Mining• Ontology Learning• Text Clustering with Background Knowledge• Text Clustering using FCA• Text Classification using Background Knowledge• Application driven Evaluation of Ontology Learning• Different kinds of Background Knowledge

A. Hotho: Text Clustering with Background Knowledge 3

Knowledge and Data Engineering Group @ University of Kassel

Founded at April 2004 Head: Prof. Gerd StummeMember of the Research Center L3SResearch areas:

Semantic Web/OntologiesKnowledge DiscoveryWeb MiningPeer-to-PeerFolksonomiesSocial Bookmark Systems

A. Hotho: Text Clustering with Background Knowledge 4

Acknowledgement

Some of the slides are taken from:

• ECML/PKDD Tutorial “Ontology Learning from text”, Paul Buitelaar, Philipp Cimiano, Marko Grobelnik, Michael Sintek

• KDD Course of AIFB Karlsruhe and KDE Kassel

• Semantic Web Tutorial Slides from AIFB

• Some slides of the Semantic Web Introduction have beenstolen from various places, from Jim Hendler and Frank van Harmelen, in particular

Page 2: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 5

Resources in BibSonomy tagged with: SumSchool06

http://www.bibsonomy.org/tag/SumSchool06Dr. Andreas Hotho

IntroductionSemantic Web

A. Hotho: Text Clustering with Background Knowledge 7

Syntax is not enough

Andreas

• Tel

• E-MailA. Hotho: Text Clustering with Background Knowledge 8

Information Convergence

Convergence not just in devices, also in “information”Your personal information (phone, PDA,…)

Calendar, photo, home page, files…

Your “professional” life (laptop, desktop, … Grid)

Web site, publications, files, databases, …

Your “community” contexts (Web)Hobbies, blogs, fanfic, social networks…

The Web teaches us that people will work to shareHow do we CREATE, SEARCH, and BROWSE in the non-text based parts of our lives?

Page 3: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 9

CV

name

education

work

private

Meaning of Informationen:(or: what it means to be a computer)

A. Hotho: Text Clustering with Background Knowledge 10

CV

name

education

work

private

< >

< >

< >

< >

< >

< Χς >

< ναµε >

<εδυχατιον>

<ωορκ>

<πριϖατε>

XML ≠ Meaning, XML = Structure

A. Hotho: Text Clustering with Background Knowledge 11

XML is unspecific:No predetermined vocabularyNo semantics for relationships

& must be specified upfront

Only possible in close cooperationsSmall, reasonably stable groupCommon interests or authorities

Not possible in the Web or on a broad scale in general !

Source of Problems

A. Hotho: Text Clustering with Background Knowledge 12

(One) Layer Model of the Semantic Web

Page 4: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 13

Some Principal Ideas

• URI – uniform resource identifiers• XML – common syntax• Interlinked• Layers of semantics –

from database to knowledge base to proofs

Design principles of WWW applied to Semantics!!

Tim Berners-Lee, Weaving

the Web

A. Hotho: Text Clustering with Background Knowledge 14

Ontology

Ontologies enable a better communicationbetween Humans/MachinesOntologies standardize and formalize themeaning of words through concepts

„An ontology is an explicit specification of a conceptualization.“ [Gruber, 1993]

„People can‘t share knowledge if they do not speaka common language.“ [Davenport & Prusak, 1998]

A. Hotho: Text Clustering with Background Knowledge 15

What is an Ontology?

Gruber 93:

An Ontology is aformal specificationof a sharedconceptualizationof a domain of interest

⇒ Executable⇒ Group of persons⇒ About concepts⇒ Between application

and „unique truth“

A. Hotho: Text Clustering with Background Knowledge 16

^

Communication Principle

ReferentFormStands for

refers toevokes

Concept

“Jaguar“

[Odwen, Richards, 1923]

Page 5: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 17

Views on Ontologies

Front-End

Back-End

TopicMaps

Extended ER-Models

Thesauri

Predicate Logic

Semantic Networks

Taxonomies

Ontologies

Navigation

Queries

Sharing of Knowledge

Information Retrieval

Query Expansion

Mediation Reasoning

Consistency CheckingEAI

A. Hotho: Text Clustering with Background Knowledge 18

Taxonomy

Object

Person Topic Document

ResearcherStudent Semantics

OntologyDoctoral Student

Taxonomy := Segementation, classification and ordering of elements into a classification system according to theirrelationships between each other

PhD Student F-Logic

Menu

A. Hotho: Text Clustering with Background Knowledge 19

Thesaurus

Object

Person Topic Document

ResearcherStudent Semantics

PhD StudentDoktoral Student

• Terminology for specific domain• Graph with primitives, 2 fixed relationships (similar, synonym) • originate from bibliography

similarsynonym

OntologyF-Logic

Menu

A. Hotho: Text Clustering with Background Knowledge 20

Topic Map

Object

Person Topic Document

ResearcherStudent Semantics

PhD StudentDoktoral Student

knows described_in

writes

AffiliationTel

• Topics (nodes), relationships and occurences (to documents)• ISO-Standard• typically for navigation- and visualisation

OntologyF-Logic

similarsynonym

Menu

Page 6: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 21

OntologyF-Logic

similar

OntologyF-Logic

similarPhD StudentDoktoral Student

Ontology (in our sense)

Object

Person Topic Document

Tel

PhD StudentPhD Student

Semantics

knows described_in

writes

Affiliationdescribed_in is_about

knowsP writes D is_about T P T

DT T D

Rules

subTopicOf

• Representation Language: Predicate Logic (F-Logic)• Standards: RDF(S); coming up standard: OWL

ResearcherStudent

instance_of

is_a

is_a

is_aAffiliation

Affiliation

A. Hotho

KDE+49 561 804 6252

A. Hotho: Text Clustering with Background Knowledge 22

PhD StudentPhD Student AssProfAssProf

AcademicStaffAcademicStaff

rdfs:subClassOfrdfs:subClassOf

cooperate_withcooperate_with

rdfs:rangerdfs:domainOntology

<swrc:AssProf rdf:ID="sst"><swrc:name>Steffen Staab</swrc:name>

...</swrc:AssProf>

http://www.aifb.uni-karlsruhe.de/WBS/sst

Anno-tation

<swrc:PhD_Student rdf:ID="sha"><swrc:name>Siegfried

Handschuh</swrc:name>

...</swrc:PhD_Student>

WebPage

http://www.aifb.uni-karlsruhe.de/WBS/shaURL

<swrc:cooperate_with rdf:resource = "http://www.aifb.uni-

karlsruhe.de/WBS/sst#sst"/>

instance ofinstance of

Cooperate_with

Ontology & Metadata

Links have explicit meanings!

A. Hotho: Text Clustering with Background Knowledge 23

What’s in a link? Formally

W3C recommendationsRDF: an edge in a graphOWL: consistency (+subsumption+classif. + …)

Currently under discussionRules: a deductive database

Currently under intense researchProof: worked-out proofsTrust: signature & everything working together

A. Hotho: Text Clustering with Background Knowledge 24

What’s in a link? Informally

• RDF: pointing to shared data• OWL: shared terminology

• Rules: if-then-else conditions

• Proof: proof already shown• Trust: reliability

Page 7: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 25

Ontologies and their Relatives (I)

There are many relatives around:

Controlled vocabularies, thesauri and classification systems available in the WWW, seehttp://www.lub.lu.se/metadata/subject-help.html

Classification Systems (e.g. UNSPSC, Library Science, etc.)Thesauri (e.g. Art & Architecture, Agrovoc, etc.)DMOZ Open Directory http://www.dmoz.org

Lexical Semantic NetsWordNet, see http://www.cogsci.princeton.edu/~wn/EuroWordNet, see http://www.hum.uva.nl/~ewn/

Topic Maps, http://www.topicmaps.org (e.g. used withinknowledge management applications)

In general it is difficult to find the border line!

A. Hotho: Text Clustering with Background Knowledge 26

Ontologies and their Relatives (II)

Catalog / ID

Terms/Glossary

Thesauri

InformalIs-a

FormalIs-a

FormalInstance

Frames

ValueRestric-tions

Generallogical

constraints

AxiomsDisjointInverseRelations,...

A. Hotho: Text Clustering with Background Knowledge 27

Ontologies - Some Examples

General purpose ontologies:WordNet / EuroWordNet, http://www.cogsci.princeton.edu/~wnThe Upper Cyc Ontology, http://www.cyc.com/cyc-2-1/index.htmlIEEE Standard Upper Ontology, http://suo.ieee.org/

Domain and application-specific ontologies:RDF Site Summary RSS, http://groups.yahoo.com/group/rss-dev/files/schema.rdfUMLS, http://www.nlm.nih.gov/research/umls/GALEN SWRC – Semantic Web Research Community: http://ontoware.org/projects/swrc/RETSINA Calendering Agent, http://ilrt.org/discovery/2001/06/schemas/ical-full/hybrid.rdfDublin Core, http://dublincore.org/

Web Services OntologiesCore ontology of services http://cos.ontoware.orgWeb Service Modeling ontology http://www.wsmo.orgDAML-S

Meta-OntologiesSemantic Translation, http://www.ecimf.org/contrib/onto/ST/index.htmlRDFT, http://www.cs.vu.nl/~borys/RDFT/0.27/RDFT.rdfsEvolution Ontology, http://kaon.semanticweb.org/examples/Evolution.rdfs

Ontologies in a wider senseAgrovoc, http://www.fao.org/agrovoc/Art and Architecture, http://www.getty.edu/research/tools/vocabulary/aat/UNSPSC, http://eccma.org/unspsc/DTD standardizations, e.g. HR-XML, http://www.hr-xml.org/

A. Hotho: Text Clustering with Background Knowledge 28

Wordnet

• WordNet contains207016 Word-Sense Pairs117597 synsets

• WordNet categorizes word into syntactic categories(N, noun)

(V, verb)

(Adj, adjective) und

(Adv, adverb).

• WordNet additionally contains lexical-semantic relations between word meanings

[ http://wordnet.princeton.edu/]

Statistics under: http://wordnet.princeton.edu/man/wnstats.7WN#sect2

Page 8: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 29

Wordnet II

Lexical - semantic relations

syntactic categories

examples

Synonymy

N, V, Adj, Adv

jolly, merry

Antonymy

Adj, Adv, (N, V)

fast, slow friendly, unfriendly

Hyperonymy

N

animal, living being mammal, animal dog, mammal

Meronomy

N

flour, cake tyre, car

A. Hotho: Text Clustering with Background Knowledge 30

Wordnet III

• Lexical semantic relations in WordNet mainly correspond to their counterparts of frame-oriented representationformalisms

hyperonym / hyponym is analog to is-a-relation

meronym / holonym corresponds to has-part / part-of-relations

WordNet allows floating transition between linguisticinformation and conceptual structures

A. Hotho: Text Clustering with Background Knowledge 31

(I)

• provided by the National Library of Medicine (NLM), a database of medical terminology.

• Unifies terms from several medical databases(MEDLINE, SNOMED International, Read Codes, etc.) such that different terms are identified as the same medical concept.

• Applications:Primarily Browse/Search in Document Collections, e.g.PubMed: Access to Documents (e.g. MEDLINE)CliniWeb International: Clinical Information in the WWW[ http://www.nlm.nih.gov/research/umls/umlsapps.html ]

[ http://www.nlm.nih.gov/research/umls/ ]

A. Hotho: Text Clustering with Background Knowledge 32

(II)

UMLS Knowledge Sources:

Metathesaurus provides the concordanceof medical concepts:

730.000 concepts1.5 million concept names in different source vocabularies

SPECIALIST Lexicon provides word synonyms, derivations, lexical variants, and grammatical forms of words used in MetaThesaurus terms

130,000 entries.

Semantic Network codifies the relationships (e.g. causality, "is a", etc.) among medical terms.

134 semantic types, 54 relationships.

Page 9: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 33

The semantic web and machine learning

1. What can machine learningdo for the Semantic Web?

2. Learning Ontologies(even if not fullyautomatic)

3. Learning to map betweenontologies

4. Duplicate recognition5. Deep Annotation:

Reconciling databases and ontologies

6. Annotation by Information Extraction

1. What can the Semantic Web do for Machine Learning?

2. Lots and lots of SW tools to describe and exchange datafor later use by machinelearning methods in a canonical way! (preprocessing!)

3. Using ontological structuresto improve the machinelearning task

4. Provide backgroundknowledge to guide machinelearning

A. Hotho: Text Clustering with Background Knowledge 34

Foundations of the Semantic Web: References

• Semantic Web Activity at W3C http://www.w3.org/2001/sw/• www.semanticweb.org (currently relaunched)• Journal of Web Semantics • D. Fensel et al.: Spinning the Semantic Web: Bringing the World Wide

Web to Its Full Potential, MIT Press 2003• G. Antoniou, F. van Harmelen. A Semantic Web Primer, MIT Press 2004.• S. Staab, R. Studer (eds.). Handbook on Ontologies. Springer Verlag,

2004. • S. Handschuh, S. Staab (eds.). Annotation for the Semantic Web. IOS

Press, 2003. • International Semantic Web Conference series, yearly since 2002, LNCS• World Wide Web Conference series, ACM Press, first Semantic Web

papers since 1999• York Sure, Pascal Hitzler, Andreas Eberhart, Rudi Studer, The Semantic

Web in One Day, IEEE Intelligent Systems, http://www.aifb.uni-karlsruhe.de/WBS/phi/pub/sw_inoneday.pdf

• Some slides have been stolen from various places, from Jim Hendler and Frank van Harmelen, in particular.

Dr. Andreas Hotho

Semantic Web Mining

A. Hotho: Text Clustering with Background Knowledge 36

Where to start?

Web Mining AreasWeb content miningWeb structure mining

Web usage mining

Page 10: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 37

• Web Mining can help

• to learn structures for knowledge organization (e.g. ontologies)

• and to populate them. Ontology Learning

Instance Learning

Extracting Semantics from the Web

A. Hotho: Text Clustering with Background Knowledge 38

Ontology Learning

• Typically, a domain-specific document corpus contains much information about a specific domain.

• One possible approach is to take this given corpus and extract linguistic and ontological resources from it.

Concentration to Web Content

ONTOLOGY LEARNING

KnowledgeDiscovery

OntologyEngineering

A. Hotho: Text Clustering with Background Knowledge 39

Ontology Learning Steps

1. Concept ExtractionMulti-Word-Term ExtractionWord Meaning Recognition

2. Concept Relation Extraction:Taxonomy LearningNon-taxonomic relation extractionLabeling of non-taxonomic relations

Beside these two steps ontology reuse via pruning is applicable.

A. Hotho: Text Clustering with Background Knowledge 40

root

furnishing

accomodation

event area

...

hotel youth hostel...

cityregion ...

wellness hotel

Ontology Learning from the Web

[Mädche, Staab: ECAI 2000]

Derived concept pairs(wellness hotel, area)(hotel, area)(accomodation, area)

Association RuleMining

Generalized Conceptual RelationhasLocation(accomodation,area)

is-ahierarchy

Example

Page 11: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 41

• Web Mining can help

• to learn structures for knowledge organization (e.g. ontologies)

• and to populate them. Ontology Learning

Instance Learning

Extracting Semantics from the Web

A. Hotho: Text Clustering with Background Knowledge 42

Knowledge base

Hotel: Wellnesshotel

GolfCourse: Seaview

belongsTo(Seaview, Wellnesshotel)

...

Information Extraction,

eg. [Craven et al, AI Journal 2000]

belongsTo

FORALL X, YY: Hotel[cooperatesWith ->> X] <-

X:ProjectHotel[cooperatesWith ->> Y].

GolfCourse

Organization

Hotel

namecooperatesWith

Ontology

ExampleInstance Learning from the Web

A. Hotho: Text Clustering with Background Knowledge 43

Example

Information Highlightingfor supporting annotationbased on IE techniques.

A. Hotho: Text Clustering with Background Knowledge 44

Crawling:load a documentextract linksload next document

Focused Crawlingintelligent focused decision on the next step

Crawling the (semantic) web for filling the ontologyExample

[Ehrig et al, 2002]

?

?

?

?

Page 12: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 45

Knowledge base

Hotel: Wellnesshotel

GolfCourse: Seaview

belongsTo(Seaview, Wellnesshotel)

...

ILP BasedAssociation Rule Mining,

eg. [Dehaspe, Toivonen,

J. DMKD 1998]

Hotel(x), GolfCourse(y), belongsTo(y,x) → hasStars(x,5)

support = 0.4 % confidence = 89 %

belongsTo

FORALL X, YY: Hotel[cooperatesWith ->> X] <-

X:ProjectHotel[cooperatesWith ->> Y].

GolfCourse

Organization

Hotel

namecooperatesWith

Ontology

ExampleMining the Semantic Web

A. Hotho: Text Clustering with Background Knowledge 46

Semantic Web Usage Mining

p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03:51 +0100] "GET /search.html?l=ostsee%20strand&syn=023785&ord=asc HTTP/1.0" 200 1759

p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05:06 +0100] "GET /search.html?l=ostsee%20strand&p=low&syn=023785&ord=desc HTTP/1.0" 200 8450

p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06:41 +0100] "GET /mlesen.html?Item=3456&syn=023785 HTTP/1.0" 200 3478

Search byLocation

Search byLocationand Price

Refinesearch

Chooseitem

Look at individualHotel.

From logfile analysis ...

... to semantic logfile analysis:

Basic idea: associate each requested page with one or more ontological entities, to better understand the process of navigation

[Berendt & Spiliopoulou 2000; Berendt 2002; Oberle 2003]

Use the gained knowledge to

• understand search strategies

• improve navigation design

• personalization

Example

A. Hotho: Text Clustering with Background Knowledge 47

Text Document Clustering of Crawled Documents

WWW

Explanation

Clustering

Focused Crawling

Example

Dr. Andreas Hotho

Preprocessing stepsfor

Text Mining

Slides partially from:- AIFB KDD course- Raymond J. Mooney (http://www.cs.utexas.edu/users/mooney/ir-course/).

Page 13: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 49

Preprocessing of Text documents

keys

xnmn

xijn-1

xij…

xijxij…

xij…

xij…

xij…

x222

x111

mm-1…………21

keys

xnmn

xijn-1

xij…

xijxij…

xij…

xij…

xij…

x222

x111

mm-1…………21

docu

men

ts

Kind of Features to extract

• Terms• Words• Phrases• Concepts

• Meta data • Shallow parsing• Deep parsing• …

A. Hotho: Text Clustering with Background Knowledge 50

Which kind of features to extract?

Metadatae.g., author, data, document type, language, copyright status;according to metadata schema that specifies attributes, e.g.:

Dublin Core Metadata Initiative (http://dublincore.org/),BibTex Schema for Bibliographic Metadata;

typically by explicit markup from human indexer, but possibly also automatically by means of information extraction (next lecture).

Controlled Vocabulary of Index Termsfixed set of index terms that describe content of documents;often in hierarchical form (taxonomies),indexing vocabulary / taxonomy is centrally designed and maintained by some authority, e.g.:

Inspec Topic Classification (http://www.iee.org/Publish/Inspec/),MeSH - Medical Subject Headings(http://www.nlm.nih.gov/mesh/),IPC - International Patent Classification(http://www.wipo.int/classifications/ipc/en/)many many more…

A. Hotho: Text Clustering with Background Knowledge 51

Example: MeSH Classification

PMID- 7810287OWN - NLMSTAT- MEDLINEDA - 19950202DCOM- 19950202LR - 20041117PUBM- PrintIS - 0094-6354 (Print)VI - 62IP - 4DP - 1994 AugTI - Cockayne syndrome: a case report.PG - 346-8AB - A 4-year-old female with Cockayne syndromepresented for cataract extraction under generalanesthesia. […]FAU - O'Brien, F CAU - O'Brien FCFAU - Ginsberg, BAU - Ginsberg BLA - engPT - Case ReportsPT - Journal ArticlePL - UNITED STATESTA - AANA JJT - AANA journal.JID - 0431420SB - NMH - Anesthesia, General/*methods/nursingMH - Cataract ExtractionMH - Child, PreschoolMH - Cockayne Syndrome/complications/*surgeryMH - FemaleMH - HumansEDAT- 1994/08/01MHDA- 1994/08/01 00:01PST - ppublishSO - AANA J. 1994 Aug;62(4):346-8.

A. Hotho: Text Clustering with Background Knowledge 52

Which kind of features to extract? (cont.)

Alternative: dynamic vocabularysocial tagging systems ("folksonomies"), e.g.:

del.icio.us for bookmarks (http://del.icio.us), flickr for photographs (http://www.flickr.com);

tags correspond to index terms that are freely choosen and assigned to documents by users without centralizedmanagement.

Derived Features: full-text indexingalso known as the bag-of-words model,general assumption: every word or expression in the text document can be a valid key,index terms are automatically extracted from the document collection,dictionary of index terms is continually increasing,many design decisions to choosing appropriate terms,next section….

Page 14: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 53

Example: social tagging

A. Hotho: Text Clustering with Background Knowledge 54

Example: BibSonomy contains also publication meta data

A. Hotho: Text Clustering with Background Knowledge 55

StopwordRemovalStemmingTokenization

Document Representation: Full-Text Indexing

treat

infectionblood

medic

blood

potent

transmiss

typical full-text indexing pipeline

A. Hotho: Text Clustering with Background Knowledge 56

Full Text Representation

Tokenizationgoal: segment input character sequence into "useful" tokens(e.g., individual terms);design decisions and problems:

set of word delimiters to use (e.g., whitespace, punctuation marks),handling of special and numerical characters,handling of capitalization (typically conversion to lower case),handling of punctuation marks (sentence delimiter or abbreviation?);different languages have different rules for compound words(e.g., "color screen" vs "Farbbildschirm„)

Stemming or Lemmatizationmorphological normalization of inflected word forms to base form(e.g., "houses" → "house", "goes" → "go");Stemming: simple approach based on few structural rules

e.g., Porter Stemming algorithm for English language;Lemmatization: retrieval of base form, typically based on dictionary

can handle exceptional cases (e.g., "mice" → "mouse")

Page 15: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 57

Full Text Representation

Stopword removal

removal of very frequent and uninformative words,typically function words as, e.g., "the","a","an","of","for",e.g., SMART stopword list for English language defines 571 stopwords(ftp://ftp.cs.cornell.edu/pub/smart/english.stop)

A. Hotho: Text Clustering with Background Knowledge 58

Property: Word Frequency

• A few words are very common.2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences.

• Most words are very rare.Half the words in a corpus appear only once, called hapax legomena (Greek for “read only once”)

• Called a “heavy tailed” distribution, since most of the probability mass is in the “tail”

A. Hotho: Text Clustering with Background Knowledge 59

Sample Word Frequency Data

(from B. Croft, UMass)

A. Hotho: Text Clustering with Background Knowledge 60

Zipf’s Law

• Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ).

• Zipf (1949) “discovered” that:

• If probability of word of rank r is pr and N is the total number of word occurrences:

rf 1

∝ )constant (for kkrf =⋅

1.0 const. indp. corpusfor ≈== ArA

Nfpr

Page 16: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 61

Zipf and Term Weighting

Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.

A. Hotho: Text Clustering with Background Knowledge 62

Pruning based on Zipf Law

• Stopword removal

• drop words with less than a given occurrences to drop the extremely uncommon words (e.g. 30 occurrences)

• similar idea is behind the tfidf weighting

A. Hotho: Text Clustering with Background Knowledge 63

Further possible (non-standard) steps

• separate indexing of phrases and compound words(e.g., "machine learning" ≠ "machine","learning")

based on background dictionaries or statistical detectionof frequent phrases - machine learning again ;-);

• alternative: additional indexing of all adjacent words up to a certain window length (bigrams, trigrams, n-grams);

• expansion with synonymous terms based on thesauri;• separate consideration of different parts of speech

(e.g., "walk" as verb or "walk" as noun);• many more …

A. Hotho: Text Clustering with Background Knowledge 64

Levels of Linguistic Analysis:

´The 'Human Language Technologies Layer Cake'

Tokenization (incl. Named-Entity Rec.)

Phrase Recognition / Chunking

Dependency Struct. (Phrases)

Dependency Struct. (S)

Discourse Analysis

[table] [2005-06-01] [John Smith]

Morphological Analysis[table:N:ART] [Sommer~schule:N] [work~ing:V]

[[the] [large] [table] NP] [[in] [the] [corner] PP]

[[the:SPEC] [large:MOD] [table:HEAD] NP]

[[He:SUBJ] [booked:PRED] [[this] [table:HEAD] NP:DOBJ] S]

[[He:SUBJ] [booked:PRED] [[this] [table:HEAD] NP:DOBJ:X1] …]… [[It:SUBJ:X1] [was:PRED] still available …]

Part-of-Speech & Semantic Tagging[table:N:ARTIFACT] [table:N:furniture_01]

© Paul Buitelaar, DFKI

Page 17: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 65

Full Text Representation and Sparseness

Fulltext Indexing typicallyresults in a very sparsematrix - usually, less than 1% of the matrix cells are non-zero !

Sparseness requires specialattention with respect to: storage and computation.

Store only non-zeroelements with respectiveindices and assume rest of matrix to be zero.Tune computations to thisdata structure.

Frequent terms are likely to beindexed already at thebeginning.

Sparseness Structure of the Document x Term Matrix for Training Document part of the Reuters 21578-Corpus (9603 x 17525) – non-sparse fraction is 0.25 % !

A. Hotho: Text Clustering with Background Knowledge 66

Comparison of Representation Approaches

auto

mat

ion

of in

dexi

ng

dynamics of set of features

fullyautomatic

only humanannotation

fixed set of features highly dynamic feature set

social tagging (e.g. del.icio.us)

fulltext indexing

traditional library classification

news agency systems (e.g. Reuters)

A. Hotho: Text Clustering with Background Knowledge 67

Retrieval Models: Vector Space [Salton 60s]

Vector Space Model (best match model + ranking)typically full text indexing of documents;documets are regarded as vectorsvector space dimensions defined by different index termsquery is also treated as a vector in the same spacerank documents based on geometric similarity with queryvery successful paradigm with many connections to machine learning view

Issues of vector space modelchoice of appropriate term weighting (typically TFIDF)choice of geometric similarity measure to use (typicallycosine)

A. Hotho: Text Clustering with Background Knowledge 68

Term Weighting - Alternatives

• boolean weighting (simplest case) with entries 0 and 1 corresponding to

• absolute frequency tfji of term i in document j• relative frequency rfji of term i in document j• most popular choice:

term-frequency inverse document frequency (TFIDF) weighting: )

)(log(.)(

wdfNtfwtfidf =

tf(w) term frequency (number of word occurrences in a document)df(w) document frequency (number of documents containing the word)N number of all documentstfIdf(w) relative importance of the word in the document

The word is more important if it appears in less documents

The word is more important if it appears several times in a target document

Page 18: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 69

Cosine Measure typically used as similarity measure:document vectors ranked according to cosine score with query;corresponds to the angle between two vectors, i.e. thenormalized inner product between input vectors

Note: direction distinguishes document vectors, not the length !all vector entries will be positive so cosine varies between 0 (orthogonal vectors) and 1 (same direction)

Cosine Measure

A. Hotho: Text Clustering with Background Knowledge 70

Cosine Measure (Illustration: scaling onto unit hypersphere)

doc1

doc2

query

doc1'

doc2 '

query 'α1

1

0

1

0

A. Hotho: Text Clustering with Background Knowledge 71

Evaluation Measures

• Need to analyse results and to evaluate systems.• Three important considerations:

Precision: how well did the presented result set match theinformation need of the user ?Recall: how much of the relevant information availablewas presented in the result set ?

• There is a well-known set of information retrievalmeasures which evaluate information retrieval engineswith respect to the subjective (!) perception of relevancy of a test user.

A. Hotho: Text Clustering with Background Knowledge 72

Evaluation Measures: Notation

true negative(TN)

false negative(FN)

negative(doc's not returned)

false positive(FP)

true positive (TP)

positive(doc's returned)

negative(doc's judged non-

relevant)

positive(doc's judged

relevant)

human judgement

retrieval result

2 partitions of a set of documents:according to perceived relevancy to useraccording to result of retrieval engine

Page 19: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 73

Evaluation Measures: Information Retrieval (and Text Classification)

measures overall error

fraction of relevant documents in result

fraction of returned documentswrt all relevant documents

harmonic mean of precisionand recall

A. Hotho: Text Clustering with Background Knowledge 74

Evaluation Measures (cont.): Considering Ranked Retrieval

Precision and recall are well-defined only for exact match (i.e. unranked) retrieval;

Approach for ranked retrieval:rank all test documents,calculate precision and recallat fixed cutoff points (e.g., at position k = 5);different results will beachieved for varying k;

Typical observation for k → n:precision decreases and recall increases on the longrun (think why !)

The break-even-point measure isdefined as value of precisionand recall when they becomeequal. -n

……

-13

-12

+11

-10

-9

-8

+7

-6

+5

+4

+3

-2

+1

rel?pos?

precision

recall

break even point

plot connecting precisionand recall for different k

break even point

A. Hotho: Text Clustering with Background Knowledge 75

Goal:Cluster should be as similar as possible to the given classes

Evaluation of Text Document Clustering

( )P

LPLP ∩=:,Precision

),(Precisionmax:),(Purity LPDP

LP *L*P

L*P*∈∈

∑=

),(Precisionmax:),(InvPrty PLDL

PL *P*L

L*P*∈∈

∑=

compare clustering P* of document set Dwith given classes L*

46 Classes

given classes L*

60 Cluster Clustering P*

A. Hotho: Text Clustering with Background Knowledge 76

Different Text Clustering and Classification Datasets

Reuters-21578Documents about finance from 19879603 training documents and 3299 test documents (ModApteSplit)Binary Classification on Top 50 classes.

Reuters RCV1Documents about finance from 1996/1997806791 documents categorized with respect to three controlled vocabulary4 major topic categries10 major industry codesregion code without an hierarchy

David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li. RCV1: A New Benchmark Collection for Text Categorization Research, 2004

Page 20: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 77

Different Text Clustering and Classification Datasets

20 NewsgroupsNews Groups Documents, different topics like: sport, cs, …20000 documents20 classes, every class contains 1000 documents

OHSUMED CorpusOHSUMED (TREC-9), titles and abstracts from medical journals, 198736369 training documents and 18341 test documentsBinary Classification on Top 50 classes (MeSH classifications).

FAODOC CorpusDocuments about agricultural information1501 docs within 21 categories

Dr. Andreas Hotho

Ontology Learning

Thanks to Philipp Cimiano for the slides

A. Hotho: Text Clustering with Background Knowledge 79

Motivation for Ontology Learning

• High cost for modelling ontologies.• Typically, ontologies are domain dependant.

• Idea: learn from existing domain data?• Which data?

Legacy Data (XML oder DB-Schema) => LiftingTexts ?Images ?

• In this lecture we will discuss some ideas of ontology learning from text data using knowledge discovery techniques.

A. Hotho: Text Clustering with Background Knowledge 80

Learning ontologies from texts

Problems: Bridge the gap between symbol and concept/ontology level

Knowledge is rarely mentioned explicitly in texts.

Reverse Engineering

Write

Shared World Model

Page 21: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 81

Some Current Work on OL from Text

Terms, Synonyms & ClassesStatistical AnalysisPatterns(Shallow) Linguistic ParsingTerm Disambiguation & Compositional Interpretation

TaxonomiesStatistical Analysis & Clustering (e.g. FCA)Patterns(Shallow) Linguistic ParsingWordNet

RelationsAnonymous Relations (e.g. with Association Rules)Named Relations (Linguistic Parsing)(Linguistic) Compound AnalysisWeb Mining, Social Network Analysis

Definitions(Linguistic) Compound Analysis (incl. WordNet)

Overview of Current Work: Paul Buitelaar, Philipp Cimiano, Bernardo Magnini Ontology Learning from Text: Methods, Evaluation and Applications Frontiers in Artificial Intelligence and Applications Series, Vol. 123, IOS Press, July 2005.

A. Hotho: Text Clustering with Background Knowledge 82

mountain)iver,disjoint(r

z))yz)l(x,has_capitacapital(z)(zx)(y,located_iny)l(x,has_capitacapital(y)ycountry(x)(x

=→∧∀∧∧∧∃→∀

area geo. Inhabitedcity city,capital CC ≤≤

area) geo.:rangeriver,:gh(domflow_throu

⟩⟨== (c)Ref[c],i(c),:country:c C

located_incapital_of R≤

nation}{country,

.capital,.. city, nation, country, river, Terms

Synonyms

Concepts

Concept Hierarchy

Relations

Axioms

Rules

Relation Hierarchy

The Ontology Learning Layer Cake

A. Hotho: Text Clustering with Background Knowledge 83

Tools - Axioms

Economic Univ. Prague

DFKI XXXOntoLT / RelExt

labelsTextToOnto ++

XPMI-IRNRC-CNRC

XDODDLEKeio Univ.

XDIRT

clustersclustersCBCISI, USC

XXParmenidesUniv. Zürich

clustersclustersXATRACTUniv. of Salford

XXint.XXOntoLearnUniv. di Roma

XXclustersclustersASIUM / Mo‘kUniv. de Paris-Sud

?clustersclustersOntoBasisCNTS, Univ. Antwerpen

XXXXHASTIAmir Kabir Univ. Tehran

XAEON

XXint.clustersXText2OntoAIFB, Univ. Karlsruhe

GeneralAxioms

AxiomsSchemata

RelationHierarchyRelationsConcept

HierarchyConcept

FormationSynonymsTerms

Ontology Learning LayersSystemOrganization

A. Hotho: Text Clustering with Background Knowledge 84

Evaluation of Ontology Learning

The a priori approach is based on a gold standard ontology:Given an ontology modeled by an expert -> The so called gold standardCompare the learned ontology with the gold standard

Which methods exists: Pattern-based

learning accuracy/precision/recall/f-measure

Which methods exists: Clustering-based

Problem: labels for clusters either unknown or difficult to find

Basic idea for both

Count edges in the “ontology graph”Counting of direct relation only (Reinberger et.al. 2005)Least common superconceptSemantic cotopy…

Evaluation via application (cf. section using ontologies)

Page 22: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 85

Evaluation of Ontology Learning

The aposteriori Approach:ask domain expert for a per concept evaluation of the learned ontologyCount three categories of concepts:

Correct : both in learned and the gold ontologyNew : only in learned ontology, but relevant and should be in gold standard as wellSpurious: useless

Compute precision = (correct + new) / (correct + new + spurious)

As the result: The a posteriori evaluations are costly – BUTa posteriori evaluation by domain experts still show very good results, very helpful for domain expert!

Sabou M., Wroe C., Goble C. and Mishne G.,Learning Domain Ontologies for Web Service Descriptions: an Experiment in Bioinformatics, In Proceeedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10-14 May, 2005.

A. Hotho: Text Clustering with Background Knowledge 86

Terms

Synonyms

Some Knowledge Discovery Techniques for Ontology Learning

focus of today's lecture

Concepts

Concept Hierarchy

Relations

Relation Hierarchy

Axioms

Rules

A. Hotho: Text Clustering with Background Knowledge 87

How do people acquire taxonomic knowledge?

I have no idea!

But people apply taxonomic reasoning!„Never do harm to any animal!“

=> „Don‘t do harm to the cat!“

More difficult questions:representationreasoning patterns

But let‘s speculate a bit! ;-)

A. Hotho: Text Clustering with Background Knowledge 88

How do people acquire taxonomic knowledge?

What is liver cirrhosis?

Mr. Smith died from liver cirrhosis.Mr. Jagger suffers from liver cirrhosis.Alcohol abuse can lead to liver cirrhosis.

=>prob(isa(liver cirrhosis,disease))

Page 23: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 89

How do people acquire taxonomic knowledge?

What is liver cirrhosis?

Diseases such as liver cirrhosis aredifficult to cure. (New York Times)

A. Hotho: Text Clustering with Background Knowledge 90

How do people acquire taxonomic knowledge?

What is liver cirrhosis?

Cirrhosis: noun[uncountable]serious disease of the liver, often caused by drinking too much alcohol

disease))cirrhosis,isa(liver (probdisease)sis,isa(cirrho cirrhosiscirrhosisliver

→∧≈

Pattern based

A. Hotho: Text Clustering with Background Knowledge 91

How do people acquire taxonomic knowledge?

Clustering based• ……….

• The old lady loves her dog.

• The old lady loves her cat.

• The old lady loves her husband.

• ……….

dog

husband

catslady

A. Hotho: Text Clustering with Background Knowledge 92

Context Extraction

extract syntactic dependencies from text⇒ verb/object, verb/subject, verb/PP relations⇒ car: drive_obj, crash_subj, sit_in, …

LoPar, a trainable statistical left-corner parser:

Parser tgrep Lemmatizer Smoothing

WeightingFCALattice

Compaction Pruning

Page 24: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 93

Ontology Learning as Term Clustering

• Distributional Hypothesis: • "Words are [semantically] similar to the extent to which

they appear in similar [syntactic] contexts."• [Harris 1985]

• Linguistic context can be represented in vector form.

• Allows to measure the similarity wrt. some similarity measure (e.g. cosine measure).

• Hierarchical clustering approaches can be used to create taxonomic structures.

0313bike4025car

sit_inride_objcrash_intodrive_obj

0313bike4025car

sit_inride_objcrash_intodrive_obj

A. Hotho: Text Clustering with Background Knowledge 94

Extracting attributes using techniques from NLP

The museum houses an impressive collection of medieval and modern art.The building combines geometric abstraction with classical references that allude to the Roman influence on the region.

house_subj(museum)

house_obj(collection)

combine_subj(museum)

combine_obj(abstraction)

combine_with(reference)

allude_to(influence)

s

npvp

v np

np pp

The museum

houses

an impressive collection of modern art

A. Hotho: Text Clustering with Background Knowledge 95

Extraction Process for Linguistic Contexts

Preprocessing:Part of Speech TaggingLemmatizingMatching regular expression over POS-tags

Extract shallow syntactic dependencies from text:adjective modifiers: "a nice city" nice(city)prepositional phrase modifiers: "a city near the river" near_river(city) and city_near(river)possessive modifiers: "the city's center" has_center(city)noun phrases in subject or object position: "the city offers an exciting nightlife" offer_subj(city) and offer_obj(nightlife)prepositional phrases following a verb: "the river flows throughthe city" flow_through(city)copula constructs: "a flamingo is a bird" is_bird(flamingo)verb phrases with the verb to have: "every country has a capital" -> has_capital(country)

A. Hotho: Text Clustering with Background Knowledge 96

Example

• People book hotels. The man drove the bike along thebeach.

book_subj(people)book_obj(hotels)drove_subj(man)drove_obj(bike)drove_along(beach)

book_subj(people)book_obj(hotel)drive_subj(man)drive_obj(bike)drive_along(beach)

Lemmatization

Page 25: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 97

Representation of the context of a word as feature vector

XXexcursion

XXtrip

XXXXmotor-bike

XXXcar

XXapartment

join_obj/joinable

ride_obj/rideable

drive_obj/driveable

rent_obj/rentable

book_obj/bookable

A. Hotho: Text Clustering with Background Knowledge 98

Tourism Lattice

A. Hotho: Text Clustering with Background Knowledge 99

Concept Hierarchy

bookable

rentable joinable

driveable appartment

car

bike

tripexcursion

rideable

A. Hotho: Text Clustering with Background Knowledge 100

Example Clustering: (Bi-Section-KMeans)

excursion trip

apartmentcar biketrip excursion

excursiontripcar

bikeapartment

apartmentbike car

bike carIssues:

not easy to understandno formal interpretation

Page 26: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 101

Agglomerative/Bottom-Up Clustering

car bus tripexcursionappartment

A. Hotho: Text Clustering with Background Knowledge 102

Linkage Strategies

Complete-Linkage:consider the two most dissimilar elements of each of the clusters=> O(n2 log(n))

Average-Linkage:consider the average similarity of the elements in the clusters => O(n2 log(n))

Single-Linkage:consider the two most similar elements of each of the clusters=> O(n2)

A. Hotho: Text Clustering with Background Knowledge 103

Data Sets

Tourism (118 Mio. tokens):http://www.all-in-all.de/englishhttp://www.lonelyplanet.comBritish National Corpus (BNC)handcrafted tourism ontology (289 concepts)

Finance (185 Mio. tokens):Reuters news from 1987GETESS finance ontology (1178 concepts)

A. Hotho: Text Clustering with Background Knowledge 104

Results Tourism Domain

Page 27: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 105

Results in Finance Domain

A. Hotho: Text Clustering with Background Knowledge 106

Results Tourism Domain

A. Hotho: Text Clustering with Background Knowledge 107

Results in Finance Domain

A. Hotho: Text Clustering with Background Knowledge 108

Summary

Weak-FairO(n2)36.42/32.77%DivisiveClustering

FairO(n2 log(n))O(n2 log(n))O(n2)

36.78/33.35%36.55/32.92%38.57/32.15%

AgglomerativeClustering

GoodO(2n)43.81/41.02%FCA

TraceabilityEfficiencyEffectiveness

Page 28: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 109

TextToOnto & FCA

A. Hotho: Text Clustering with Background Knowledge 110

Text2Onto

Ontology learning framework developed at AIFB.Algorithms for extracting …

concepts, instancessubclass-of / instance-of relationsnon-taxonomic / subtopic-of relationsdisjointness axioms

Incremental ontology learningIndependent of concrete ontology language

A. Hotho: Text Clustering with Background Knowledge 111

Experimental results

• Formal Concept Analysis yields better concept hierarchies than similarity-based clustering algorithms,

• The results of FCA are better understandable (intensional description of concepts!),

• Bi-Section-Kmeans is most efficient (O(n2)),

• Though FCA is exponential in the worst case, it shows a favourable runtime behaviour (sparsely populated formal contexts).

A. Hotho: Text Clustering with Background Knowledge 112

Other Clustering Approaches

Bottom-Up/Agglomerative(ASIUM System) Faure and Nedellec 1998Caraballo 1999(Mo‘K Workbench) Bisson et al. 2000

Other:Hindle 1990Pereira et al. 1993Hovy et al. 2000

Page 29: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 113

Ontology Learning References

• Reinberger, M.-L., & Spyns, P. (2005). Unsupervised text mining for the learning of dogma-inspired ontologies. In Buitelaar, P., Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.

• Philipp Cimiano, Andreas Hotho, Steffen Staab: Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text. ECAI 2004: 435-439

• P. Cimiano, A. Pivk, L. Schmidt-Thieme and S. Staab, Learning Taxonomic Relations from Heterogenous Evidence. In Buitelaar, P., Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.

• Sabou M., Wroe C., Goble C. and Mishne G.,Learning Domain Ontologies for Web Service Descriptions: an Experiment in Bioinformatics, In Proceeedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10-14 May, 2005.

• Alexander Maedche, Ontology Learning for the Semantic Web, PhD Thesis, Kluwer, 2001.

• Alexander Maedche, Steffen Staab: Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16(2): 72-79 (2001)

• Alexander Maedche, Steffen Staab: Ontology Learning. Handbook on Ontologies 2004: 173-190

• M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, I. Rojas. Unsupervised Learning of semantic relations between concepts of a molecular biology ontology. IJCAI, 659ff.

• A. Schutz, P. Buitelaar. RelExt: A Tool for Relation Extraction from Text in Ontology Extension. ISWC 2005.

• Faure, D., & N´edellec, C. (1998). A corpus-based conceptual clustering method for verb frames and ontology. In Velardi, P. (Ed.), Proceedings of the LREC Workshop on Adapting lexical and corpus resources to sublanguages and applications, pp. 5–12.

• Michele Missikoff, Paola Velardi, Paolo Fabriani: Text Mining Techniques to Automatically Enrich a Domain Ontology. AppliedIntelligence 18(3): 323-340 (2003).

• Gilles Bisson, Claire Nedellec, Dolores Cañamero: Designing Clustering Methods for Ontology Building - The Mo'K Workbench. ECAI Workshop on Ontology Learning 2000

Dr. Andreas Hotho

Text Clustering

A. Hotho: Text Clustering with Background Knowledge 115

Motivation

• Challenge: browse, search and organize the hughamount of unstructured text documents available in the internet or in intranets of companies

hugh sets of documents in Internetportals like Yahoo.com, DMoz.org, Web.de are manually structuredmeta search engines like Vivisimo.com use cluster techniques to structure the search results

• Advantage: the structure and the visualization of the information provided by the clustering helps the user to work with a larger amount of information

A. Hotho: Text Clustering with Background Knowledge 116

Motivation

Page 30: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 117

Text Clustering

Text Clustering

[…] the partitioning of texts into previously unseen categories […]

A. Hotho et al., SIGIR 2003 Semantic Web Workshop

Automatic Text Clustering uses full-text vector representations of text documents as in information retrievalwithin standard clustering algorithms.

A. Hotho: Text Clustering with Background Knowledge 118

Motivation-Overall Process

cluster algorithm

Objects

explanation

representation of objectsMorgens Abends team baseman

Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1

similarity measuredistance function

backgroundknowledge

oman discount oil crude

A. Hotho: Text Clustering with Background Knowledge 119

Motivationrequirements on the cluster methods

EfficientResults should also be available on large data sets or on ad-hoc collect e.g. from search engines

EffectiveCluster result must be correct

Problem of explanatory powerResults of the clustering process must be understandable

User interaction und subjectivityUser has his own imagination of the clustering goal and want integrate this in the cluster process

A. Hotho: Text Clustering with Background Knowledge 120

Text Clustering with Background Knowledge

- choose a representation- similarity measure- clustering algorithm

Bi-Section is a version of KMeans

cosine as similarity measure

bag of terms(details on thenext slide)

Reuters data setfor our studies(min15 max100)

Page 31: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 121

Preprocessing steps

docid term1 term2 term3 ...doc1 0 0 1doc2 2 3 1doc3 10 0 0doc4 2 23 0

...

– build a bag of words model

– extract word counts (term frequencies)– remove stopwords– pruning: drop words with less than e.g. 30 occurrences – weighting of document vectors with tfidf

(term frequency - inverted document frequency)

⎟⎟⎠

⎞⎜⎜⎝

⎛+=

)(log))(log()(

tdfD

* d,ttf d,ttfidf|D| no. of documents ddf(t) no. of documents d which

contain term t

A. Hotho: Text Clustering with Background Knowledge 122

Ontology

Ontology O represents the background knowledge core ontology consists of:

Set of concepts: CConcept hierarchy or taxonomy: Lexicon: Lex

Root

PublicationPerson

AcademicStaff

StudentArticleBook

Project

PhDStudent

Topic

Research Topic

KnowledgeManagement

DistributedOrganization

DE:Wissensmanagement EN:Knowledge Management

C≤

A. Hotho: Text Clustering with Background Knowledge 123

109377 Concepts(synsets)

WordNet as ontology

144684 lexicalentries

Rootentity

something

physical object

artifact

substance

chemicalcompound

organiccompound

lipid

oil

EN:oil

covering

coating

paint

oil paint

cover

cover with oil

bless

oil, anoint

EN:anoint EN:inunct

oil colorcrude oil

144684 lexicalentries

Use of superconcepts(Hypernyms in Wordnet)

• Exploit more generalized concepts• Example:

chemical compound is the 3rd superconcept of oil

• “prune‘‘ unimportant superconceptswith tfidf

Word Sense Disambiguation

Strategies:all, first, context

A. Hotho: Text Clustering with Background Knowledge 124

Reuters texts

Dok 17892 crude =============

Oman has granted term crude oil customers retroactive discounts from official prices of 30 to 38 cents per barrel on liftings made during February, March and April, the weekly newsletter Middle East Economic Survey (MEES) said. MEES said the price adjustments, arrived at through negotiations between the Omani oil ministry and companies concerned, are designed to compensate for the difference between market-related prices and the official price of 17.63 dlrs per barrel adopted by non-OPEC Oman since February. REUTER

Page 32: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 125

Ontology-based representation

21112111...

Omangrantedtermcrudeoillipidcompoundcustomerretroactivediscount...

2111222111...

different strategies: add, replace, only

Omanhas grantedtermcrudeoilcustomersretroactivediscounts...

211112111...

Omangrantedtermcrudeoilcustomerretroactivediscount...

1 2 3

A. Hotho: Text Clustering with Background Knowledge 126

Bi-Partitioning K-Means

Input: Set of documents D, number of clusters kOutput: k cluster that exhaustively partition D

Initialize: P* = {D}

Outer loop: Repeat k-1 times: Bi-Partition the largest cluster E∈P*

Inner loop: Randomly initialize two documents from E to become e1,e2

Repeat until convergence is reachedAssign each document from E to the nearest of the two ei ; thus split E into E1,E2

Re-compute e1,e2 to become the centroids of the documentrepresentations assigned to them

P* := (P* \ E ) ∪ {E1,E2 }

A. Hotho: Text Clustering with Background Knowledge 127

Evaluation of Text Clustering

avg

-pur

ity

b.knowl.

0,51

0,53

0,55

0,57

0,59

0,61

0,63

context all context all

0 5false true

300

Prune

Evaluation parameter• min 15, max 100, 2619 documents• cluster k = 60• tfidf• term and concept vector

depthdisambig.

A. Hotho: Text Clustering with Background Knowledge 128

Evaluation of Text Clustering

0,6160,618

0,570

0,300

0,350

0,400

0,450

0,500

0,550

0,600

0,650

add repl add only repl add only repl add only repl add only repl add only repl add only

context context first all context first all

0 0 5

false true

tfidf - 30without - 30

CLUSTERCOUNT 60 EXAMPLE 100 MINCOUNT 15

Mittelwert - PURITY

ONTO HYPDEPTH HYPDIS HYPINT

WEIGHT

PRUNE

backgro..depthdisambig.integrat.

Evaluation parameter• min 15, max 100, 2619 documents• cluster k = 60

avg - purity

Page 33: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 129

Evaluation of Text Clustering

Backgr. depth integr. Mean - PURITY Mean - INVPURITYfalse 0,570 ±0,019 0,479 ±0,016true 0 add 0,585 ±0,014 0,492 ±0,017

only 0,603 ±0,019 0,504 ±0,0215 add 0,618 ±0,015 0,514 ±0,019

only 0,593 ±0,01 0,500 ±0,016

Evaluation parameter

• min 15, max 100, 2619 Dokumente• Cluster k = 60• Disamb = context• Prune = 30

A. Hotho: Text Clustering with Background Knowledge 130

Variance analysis of the Reuters classes

Idea: Ideally documents of one class should have the same representation variance = 0representation of the documents is changed, variance will also change

Analysis:Compare the variance of the classes of both Representations (with and without ontology)Compare the purity per class

A. Hotho: Text Clustering with Background Knowledge 131

Variance analysis

Variance and purity per class for PRC-min15-max100

-30,00%

-20,00%

-10,00%

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

earn

pet-chem

meal-feed

ship

lead

jobs

strategic-metal

acq

defnoclass

cocoa

trade

veg-oil

zinc

tin copper

coffee

iron-steel

housing

nat-gas

oilseed

crude

money-fx

ipi

alum

gas

grain

wpi

gnp

cpi

retail

carcass

interest

money-supply

dlr

livestock

bop

silver

orange

sugar

wheat

reserves

hog

gold

rubber

heat

cotton

variance_prc_min15_max100_5_addpurity_prc_min15_max100_5_addLinear (purity_prc_min15_max100_5_add)

class

percen

tage D

ifference

A. Hotho: Text Clustering with Background Knowledge 132

Conclusion

Background knowledge helps in improving clustering resultsSimilar terms in 2 documents may contribute to a good similarity rating if they are related via Wordnet synsetsor hypernyms

Adding background knowledge is not beneficial per se

has to be combined with term and concept weightingword sense disambiguation

Page 34: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 133

Conclusion and Outlook

Ontologies provide the background knowledge for clustering of text/web documents to achieve better clustering results describing text clusters to make descriptions more understandable for more details see: [Hotho et al. 2003]

Some possible improvements:include more aspects of Wordnet, e.g. adjectivestake domain specific ontologies, e.g. AGROVOC

Dr. Andreas Hotho

Text Clustering with FCA

A. Hotho: Text Clustering with Background Knowledge 135

Introduction Clustering

case sex glasses moustache smile hat1 m y n y n2 f n n y n3 m y n n n4 m n n n n5 m n n y? n6 m n y n y7 m y n y n8 m n n y n9 m y y y n

10 f n n n n11 m n y n n12 f n n n n

A. Hotho: Text Clustering with Background Knowledge 136

Introduction Formal Concept Analysis

Page 35: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 137

Extracted Word/Concept lists

A. Hotho: Text Clustering with Background Knowledge 138

Motivation for an Explanation of Clustering Results

Starting Point:

How do people describe a group of documents/objects?

• general and specific words are used

• Background Knowledge provides general word

• Background Knowledge could help to find links between important but seldom words of a text document

A. Hotho: Text Clustering with Background Knowledge 139

Introduction to Formal Concept Analysis

Formal Concept Analysis [Wille 1982] allows to generate and visualize concept hierarchies.

FCA models concepts as units of thought, consisting of two parts:

The extension consists of all objects belonging to the concept.The intension consists of all attributes common to all those objects.

A. Hotho: Text Clustering with Background Knowledge 140

Introduction to Formal Concept Analysis

bank financ market team baseman seasonFinanceText1 X X XFinanceText2 X X XSportText1 X X XSportText2 X X X

Example: Textsfrom the WWW

Page 36: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 141

Introduction to Formal Concept Analysis

bank financ market team baseman seasonFinanceText1 X X XFinanceText2 X X XSportText1 X X XSportText2 X X X

objects

attributes

formal contextDef.: A formal context is a triple(G,M,I), where

• G is a set of objects,

• M is a set of attributes

• and I is a relation between G and M.

• (g,m)∈I is read as „object g has attribute m“.

A. Hotho: Text Clustering with Background Knowledge 142

Introduction to Formal Concept Analysis

Concept lattice

bank financ market team baseman seasonFinanceText1 X X XFinanceText2 X X XSportText1 X X XSportText2 X X X

A. Hotho: Text Clustering with Background Knowledge 143

Introduction to Formal Concept Analysis

bank financ market american team baseman seasonFinanceText1 X X X XFinanceText2 X X XSportText1 X X XSportText2 X X X X

A. Hotho: Text Clustering with Background Knowledge 144

Introduction to Formal Concept Analysis

Page 37: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 145

FCA text clustering

• preprocess text documents

• extract a description for all documents

• calculate FCA lattice

• visualize lattice

A. Hotho: Text Clustering with Background Knowledge 146

Motivation-Overall Process

cluster algorithmFCA

Objects

explanationFCA

representation of objectsMorgens Abends team baseman

Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1

similarity measuredistance function

A. Hotho: Text Clustering with Background Knowledge 147

Example corpus

• 21 documents collected from the internet

• 3 categories; soccer, finance and software

• 1419 different word stems, whereof 253 stopwords

A. Hotho: Text Clustering with Background Knowledge 148

Lattice for 21 documents with 117 terms (θ = 15%)

Page 38: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 149

Extraction of cluster descriptions

• Lattice with all terms/concepts is to large to act as a basis of a description

selection of the most imortant terms/concepts

• approach: introduce a threshold θRemove all terms of the document vector with a valuesmaller than θ(e.g. θ = 25% of the max value)

bank financ market team baseman seasonFinanceText1 X X XFinanceText2 X X XSportText1 X X XSportText2 X X X

bank financ market team baseman seasonFinanceText1 1 2 1FinanceText2 2 2 1SportText1 1 3 2SportText2 1 2 2

A. Hotho: Text Clustering with Background Knowledge 150

Lattice with θ = 80%

A. Hotho: Text Clustering with Background Knowledge 151

Lattice with θ = 45%

A. Hotho: Text Clustering with Background Knowledge 152

Lattice with manuell selected terms

Page 39: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 153

Lesson learned

• results are not really good• lattice is to fine grained • lattice is difficult to interpret• The absence/presents of a term in a document

description results usually in a totally different lattice

• use clustering approaches like kmeans to reduce this kind of effects

A. Hotho: Text Clustering with Background Knowledge 154

Motivation-Overall Process

cluster algorithm

Objects

explanation

representation of objectsMorgens Abends team baseman

Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1

similarity measuredistance function

A. Hotho: Text Clustering with Background Knowledge 155

Visualization of Bi-Sec-K-Means clustering results

• Compute 10 Bi-Sec-K-Means cluster• Extract a term description• Compute lattice• Visualize the lattice

A. Hotho: Text Clustering with Background Knowledge 156

Result for 10 cluster

Page 40: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 157

Result for the same terms but the not based on cluster

A. Hotho: Text Clustering with Background Knowledge 158

Motivation-Overall Process

cluster algorithm

Objects

explanation

representation of objectsMorgens Abends team baseman

Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1

similarity measuredistance function

backgroundknowledge

A. Hotho: Text Clustering with Background Knowledge 159

Extracted Word/Concept lists

A. Hotho: Text Clustering with Background Knowledge 160

Combine FCA & Standard Text-clustering

• preprocess Reuters documents and enrich them with background knowledge (Wordnet)

• calculate a reasonable number k (100) of clusters with BiSec-k-Means using cosine similarity

• extract a description for all clusters• relate clusters (objects) with FCA• use the visualization of the concept lattice for better

understanding

Page 41: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 161

Extracting Cluster Descriptions

using all concepts (synsets) as attributes for FCA provides a too large concept lattice

select the important ones

Approach: introduce two thresholds: θ1, θ2

for every centroid drop all concepts (synsets) with a value lower than θ1,mark all concepts (synset) between θ1 and θ2 with “m”and above θ2 with “h”we chose θ1 = 7% and θ2 = 20% of max value

A. Hotho: Text Clustering with Background Knowledge 162

Result

A. Hotho: Text Clustering with Background Knowledge 163

Result

A. Hotho: Text Clustering with Background Knowledge 164

Result

compound, chemical compound

oil

Crude oilbarrel

Palm oil

chain of concepts with increasing specificity

Page 42: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 165

Similar example

compound, chemical compound

oil

refiner

chain of concepts with increasing specificity

A. Hotho: Text Clustering with Background Knowledge 166

Results

Crude oilbarrel

A. Hotho: Text Clustering with Background Knowledge 167

Results

resin palm

• Resulting concept lattice canalso be interpreted as a concept hierarchy directly on the documents

• all documents in one clusterobtain exactly the samedescription

A. Hotho: Text Clustering with Background Knowledge 168

Results Multi topic cluster

porkmeat... musiccoffeefoodbeverage

Multi Topic Cluster CL8

• BiSec-k-Meansresults are bad

• FCA helps to identifyinconsistencies

Page 43: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 169

Formal Concept Analysis (FCA) for Providing Cluster Descriptions

Apply FCA to the clusters generated by Bi-Section-KMeansEmbed clusters into lattice structure

Clusters are objectsTerms and concepts are attributes

FCA provides 2 achievements:Intentional descriptions of clusters are generated

Exploit concepts from background knowledge

Interactive exploration of document collection is supported

Browse lattice structureZoom into interesting parts

A. Hotho: Text Clustering with Background Knowledge 170

Conclusion and Outlook

• FCA allows a better understandable explanation of ontology enriched (k-Means) text clusters

• Clustering of text/web documents with an ontology achieve best clustering results

More details in: Wordnet improves Text Document Clustering,

Semantic Web WS at SIGIR, Hotho et al. 2003

• Some possible improvements:include more aspects of Wordnet, e.g. adjectivestake domain specific ontologies, e.g. AGROVOCuse more sophisticated means for feature selection within FCA

Dr. Andreas Hotho

Text Classification

A. Hotho: Text Clustering with Background Knowledge 172

Text Classification

Text Classification (Text Categorization)

Text categorization (TC - a.k.a. text classification, or topic spotting), [is] the activity of labeling natural language texts with thematic categories from a predefined set […].

F. Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys 34(1), 2002.

Automatic Text Classification uses full-text vector representations of text documents as in information retrievalwithin standard classification algorithms.

Page 44: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 173

Text Classification Approaches

classificationalgorithm

(AdaBoost)

DocumentsBag of Words

backgroundknowledge

oman has granded …Obj1 2 2Obj2 1 1Obj3 2 …Obj4 2 …

1 …0 …

0 00 0

A. Hotho: Text Clustering with Background Knowledge 174

Conceptual Document Representation

Let's extract some concepts...

Detecting the appropriate a set of concepts from an Ontology (O, Lex) requires multiple steps:

1. Candidate Term Detection2. Morphological Transformations3. Word Sense Disambiguation4. Generalization

A. Hotho: Text Clustering with Background Knowledge 175

Conceptual Document RepresentationCandidate Term Detection

Querying the lexicon directly for each single word will not do the trick! (Remember the multi-word expressions!)

Solution:Move a window of maximum size over the text, decrease window size if unsuccessful before moving on.

But querying the lexicon for any candidate term windowproduces much overhead !

Solution:Avoid unnecessary lexicon queries by matching POS tags in the window against appropriately defined syntacticalpatterns (e.g. noun phrases).

A. Hotho: Text Clustering with Background Knowledge 176

AdaBoost

Boosting is a relatively young and very successful machine learning technique.

Boosting algorithms build so called ensemble classifiers(meta classifiers):

1. Build many very simple “weak” classifiers.2. Combine weak learners in an additive model:

Page 45: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 177

AdaBoost

AdaBoost maintains weights Dt over the training instances.

At each iteration t: choose a base classifier ht that performs best on weighted training instances.

Calculate weight parameter αt based on performance base classifier. Higher errors lead to smaller weights and smaller errors lead to higher weights.

Weight update increases (decreases) weights for wrongly (correctly) classified instances.

Thereby, AdaBoost “is focusing in” on “hard” training instances.

A. Hotho: Text Clustering with Background Knowledge 178

Evaluation

Datasets: Reuters-21578

Documents about finance from 19879603 training documents and 3299 test documents (ModApte Split)Binary Classification on Top 50 classes.

OHSUMED CorpusOHSUMED (TREC-9), titles and abstracts from medical journals, 198736369 training documents and 18341 test documentsBinary Classification on Top 50 classes (MeSH classifications).

FAODOC CorpusDocuments about agricultural information1501 docs within 21 categories

A. Hotho: Text Clustering with Background Knowledge 179

Evaluation: Reuters Results

• Top 50 reuters classes with 17525 termstems/10259 – 27236 synset features

A. Hotho: Text Clustering with Background Knowledge 180

Evaluation: OHSUMED Results

Top 50 classes with WordNet

Page 46: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 181

Evaluation: OHSUMED Results

Relative improvement on the top 50 classes with WordNet

A. Hotho: Text Clustering with Background Knowledge 182

Evaluation: OHSUMED Results

Relative improvement on the top 50 classes withMesh Ontology (~ 22.000 Concepts, all Strategy)

A. Hotho: Text Clustering with Background Knowledge 183

Evaluation: Reuters Results

• Top 50 reuters classes with 17525 termstems/10259 – 27236 synset features

A. Hotho: Text Clustering with Background Knowledge 184

Evaluation: Reuters Results

Relative improvement on the top 50 classes

Page 47: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 185

Evaluation: FAODOC Results

A. Hotho: Text Clustering with Background Knowledge 186

Evaluation: FAODOC Results

A. Hotho: Text Clustering with Background Knowledge 187

Conclusion and Outlook

• Successful integration of conceptual features to improveclassification performance

• Gereralization does improve classification results in mostcases

A. Hotho: Text Clustering with Background Knowledge 188

Conclusion and Outlook

• Advanced Generalization Strategies

• Development of additional weak learner plugins thatexploit ontologies more directly

• Heuristics for efficient handling of continuous featurevalues like TFIDF in AdaBoost

• Multilingual Text Classification

Page 48: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

Dr. Andreas Hotho

Application driven Evaluation of

Ontology Learning

A. Hotho: Text Clustering with Background Knowledge 190

Ontology Learning

• Until now we used manually engineered ontologies.• Large ontologies are not for every domain available.• Big effort to build such ontologies

• Idea: Learn Ontologies from text

A. Hotho: Text Clustering with Background Knowledge 191

Ontologies: Semantic Structures

Ontologies:

MeSH Tree StructuresWordNet

conceptural feature representation

ontology learning

term vectors

concept vectors

linguistic context vectors

+

term clustering

"learned" ontology structures

[Maedche & Staab 2001][Cimiano et al. ECAI 2004][Cimiano et al. JAIR 2005]

Why ?knowledge acquisition bottleneckadaption to domain contextjust have some fun in trying something weired

A. Hotho: Text Clustering with Background Knowledge 192

Ontology Learning as Term Clustering

Hierarchical clustering approaches, e.g. Agglomerative (bottom-up) clusteringBi-Section-kMeans clustering

are used to create taxonomic structures (concept hierarchies)

Quality of learned semantic structures is surprisingly high.

[Cimiano et al. ECAI 2004]

Deficiencies (?):taxonomic relations mix withsynonyms and other relationsbinary splitssuperconcepts – i.e. clusters –are not mapped to lexical entries

Page 49: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 193

Learned Ontology: kmeans 7000

A. Hotho: Text Clustering with Background Knowledge 194

Evaluation Setting: Ontologies

Learned Ontologies:linguistic contexts from 1987 portion of OHSUMED corpusbased on top 10,000 terms ∩ MeSH terms = 7,000 terms(cheated)

agglomarative clusteredbi-sec-kmeans clustered

based on top 14,000 termsbi-sec-kmeans clustered

Competitors:Mesh Tree Structures

Maintained by United States National Library of Medicine> 22,000 hierarchically organized concepts

WordNet(psycho-)linguistic ontology115,424 synstets in total - 79,689 synsets in noun category

A. Hotho: Text Clustering with Background Knowledge 195

Evaluation Setting: Text Classification and Clustering

OHSUMED Corpus, (TREC-9), titles and abstracts from medical journals, 1987

Typically regarded as a rather "hard" corpus.Text Classification Setting:

36,369 training documents and 18,341 test documentsBinary classification on top 50 classes (MeSHclassifications).Classification algorithm: AdaBoost with Binary Decision Stumps, 1000 iterations.

Text Clustering Setting:4,390 documents rated relevant for one of 106 queriescluster-to-query evaluationclustering algorithm: bi-section-kmeansweighting: TFIDF, pruning level: 20

A. Hotho: Text Clustering with Background Knowledge 196

Evaluation Results: Text Classification

* extensive experimental evaluation for different superconcept integartion depths (3,5,10,15,20,25,30) –only optimal feature configuration (wrt. F1) for each ontology shown

Page 50: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 197

Evaluation Results: Text Classification

0,00%

1,00%

2,00%

3,00%

4,00%

5,00%

6,00%

7,00%

8,00%

term & concept.sc30 term & concept.sc15 term & concept.sc20 term &synset.context.hyp5

term & mesh.sc3

7000-agglo 7000-bisec-kmeans 14000-bisec-kmeans WordNet MeSH Tree Struct

rel. imprv. macro F1rel. imprv. micro F1

T** S** T* S* T** S** T** S*

A. Hotho: Text Clustering with Background Knowledge 198

Evaluation Results: Text Clustering

* extensive experimental evaluation for different superconcept integartion depths (3,5,10,15,20,25,30) –only optimal feature configuration (wrt. Purity) for each ontology shown

** all results are averages over 20 results with different random seeds

A. Hotho: Text Clustering with Background Knowledge 199

Last but not least…

Main points of this lesson:Integration of explicit conceptual features improves text clustering and classification performance.Learned Ontologies achieve improvements competitive with manually created ontologies.In both cases, the major improvement is due to generalizations.

Outlook:Investigation of relation to purely statistical "conceptualizations", e.g. LSI, PLSAImprovements in Ontology Learning.More advanced generalization strategies.

A. Hotho: Text Clustering with Background Knowledge 200

Literature

• Stephan Bloehdorn, Andreas Hotho: Text Classification by Boosting Weak Learners based on Terms and Concepts. ICDM 2004.

• Andreas Hotho, Steffen Staab, Gerd Stumme: WordNet improves text document clustering; Semantic Web Workshop @ SIGIR 2003.

• W. R. Hersh, C. Buckley, T. J. Leone, and D. H. Hickam. Ohsumed: An Interactive Retrieval Evealuation and new large Test Collection for Research. SIGIR 1994.

• Alexander Maedche, Steffen Staab. Ontology Learning for the Semantic Web. IEEE Intelligent Systems, 16(2):72–79, 2001.

• Philipp Cimiano, Andreas Hotho, Steffen Staab. ComparingConceptual, Partitional and Agglomerative Clustering for LearningTaxonomies from Text. ECAI 2004. Extended Version to appear(JARS 2005).

Page 51: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

Dr. Andreas Hotho

Background Knowledge

A. Hotho: Text Clustering with Background Knowledge 202

Statistical Concepts as Background Knowledge

• Calculating a kind of statistical concept and combine them with the classical bag of words representation

L. Cai and T. Hofmann. Text Categorization by Boosting Automatically Extracted Concepts. In Proc. of the 26th Annual Int. ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, Canada, 2003.

• Clustering word to setup a kind of concepts

G. Karypis and E. Han. Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In Proc. of 9th ACM International Conference on Information and Knowledge Management, CIKM-00, pages 12–19, New York, US, 2000. ACM Press.

• Clustering words and documents simultaneously

Inderjit S. Dhillon, Yuqiang Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. In 2nd SIAM International Conference on Data Mining (Workshop on Clustering High-Dimensional Data and its Applications), 2002.

A. Hotho: Text Clustering with Background Knowledge 203

Text Classification and Ontologies

• Using Hypernyms of wordnet as concept feature (no WSD, no significant better results)

Sam Scott , Stan Matwin, Feature Engineering for Text Classification, Proceedings of the Sixteenth International Conference on Machine Learning, p.379-388, June 27-30, 1999

• Brown Corpus tagged with Wordnet senses does not shows significant better results.

A. Kehagias, V. Petridis, V. G. Kaburlasos, and P. Fragkou. A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms. Journal of Intelligent Information Systems, 21(3):227–247, 2000.

• Map terms to concepts of the UMLS ontology to reduce the size of feature set, use search algorithm to find super concepts, evaluation using KNN and medline documents, show improvement.

B. B. Wang, R. I. Mckay, H. A. Abbass, and M. Barlow. A comparative study for domain ontology guided feature extraction. In Proceedings of the 26th Australian Computer Science Conference (ACSC-2003), pages 69–78. Australian Computer Society, 2003.

• Generative model consist of feature, concepts and topics, using Wordnetto initialize the parameter for concepts, evaluation on Reuter and Amazon corpus

Georgiana Ifrim, Martin Theobald, Gerhard Weikum, Learning Word-to-Concept Mappings for Automatic Text Classification Learning in Web Search Workshop 2005.

A. Hotho: Text Clustering with Background Knowledge 204

Using Ontologies

Wordnet and IR Query expansion with wordnet does not really improve the performance

Ellen M. Voorhees, Query expansion using lexical-semantic relations, Proceedings of the 17th annual internationalACM SIGIR conference on Research and development in information retrieval, p.61-69, July 03-06, 1994, Dublin, Ireland

Text Clustering and OntologiesWordnet synset chains

Green: Wordnet Chains (Stephen J. Green. Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering (TKDE), 11(5):713–730, 1999.

Dave et.al.: worse results using an ontology (no word sense disambiguation)

(Kushal Dave, Steve Lawrence, and David M. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the Twelfth International World Wide Web Conference, WWW2003. ACM, 2003.)

Part of Speech attributes and named entities used as features

(Vasileios Hatzivassiloglou, Luis Gravano, and Ankineedu Maganti. An investigation of linguistic features and clustering algorithms for topical document clustering. In SIGIR 2000: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 24-28, 2000, Athens, Greece. ACM, 2000.)

Page 52: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

Dr. Andreas Hotho

Literature

Tag: SumSchool06

http://www.bibsonomy.org/tag/SumSchool06

A. Hotho: Text Clustering with Background Knowledge 206

Selected Literature

Semantic Web & OntologyY. Sure and R. Studer. Vision for Semantically-Enabled Knowledge Technologies.

Online at: KTweb -- Connecting Knowledge Technologies Communities, 2003.

Y. Sure and R. Studer: A Methodology for Ontology-based Knowledge Management. In: On-To-Knowledge: Semantic Web enabled Knowledge Management. J. Davies, D. Fensel, F. van Harmelen (eds.), ISBN: 0-470-84867-7, Wiley, 2002, pages 33-46.

Y. Sure, S. Staab and R. Studer. Methodology for Development and Employment of Ontology Based Knowledge Management Applications. In: SIGMOD Record, Vol. 31, No. 4, pp. 18-23, December 2002.

S. Staab, H.-P. Schnurr, R. Studer, and Y. Sure: Knowledge Processes and Ontologies.In: IEEE Intelligent Systems 16(1), January/Febrary 2001, Special Issue on KnowledgeManagement.

A. Hotho: Text Clustering with Background Knowledge 207

Selected Literature

Foundations of the Semantic Web

Semantic Web Activity at W3C http://www.w3.org/2001/sw/www.semanticweb.org (currently relaunched)Journal of Web Semantics D. Fensel et al.: Spinning the Semantic Web: Bringing the World Wide Web to

Its Full Potential, MIT Press 2003G. Antoniou, F. van Harmelen. A Semantic Web Primer, MIT Press 2004.S. Staab, R. Studer (eds.). Handbook on Ontologies. Springer Verlag, 2004. S. Handschuh, S. Staab (eds.). Annotation for the Semantic Web. IOS Press,

2003. International Semantic Web Conference series, yearly since 2002, LNCSWorld Wide Web Conference series, ACM Press, first Semantic Web papers

since 1999York Sure, Pascal Hitzler, Andreas Eberhart, Rudi Studer, The Semantic Web in

One Day, IEEE Intelligent Systems, http://www.aifb.uni-karlsruhe.de/WBS/phi/pub/sw_inoneday.pdf

Some slides have been stolen from various places, from Jim Hendler and Frank van Harmelen, in particular.

A. Hotho: Text Clustering with Background Knowledge 208

Selected Literature

Ontology Learning References Reinberger, M.-L., & Spyns, P. (2005). Unsupervised text mining for the learning of dogma-inspired ontologies. In Buitelaar, P.,

Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.

Philipp Cimiano, Andreas Hotho, Steffen Staab: Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text. ECAI 2004: 435-439

P. Cimiano, A. Pivk, L. Schmidt-Thieme and S. Staab, Learning Taxonomic Relations from Heterogenous Evidence. In Buitelaar, P., Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.

Sabou M., Wroe C., Goble C. and Mishne G.,Learning Domain Ontologies for Web Service Descriptions: an Experiment in Bioinformatics, In Proceeedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10-14 May, 2005.

Alexander Maedche, Ontology Learning for the Semantic Web, PhD Thesis, Kluwer, 2001.

Alexander Maedche, Steffen Staab: Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16(2): 72-79 (2001)

Alexander Maedche, Steffen Staab: Ontology Learning. Handbook on Ontologies 2004: 173-190

M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, I. Rojas. Unsupervised Learning of semantic relations between concepts of a molecular biology ontology. IJCAI, 659ff.

A. Schutz, P. Buitelaar. RelExt: A Tool for Relation Extraction from Text in Ontology Extension. ISWC 2005.

Faure, D., & N´edellec, C. (1998). A corpus-based conceptual clustering method for verb frames and ontology. In Velardi, P. (Ed.), Proceedings of the LREC Workshop on Adapting lexical and corpus resources to sublanguages and applications, pp. 5–12.

Michele Missikoff, Paola Velardi, Paolo Fabriani: Text Mining Techniques to Automatically Enrich a Domain Ontology. AppliedIntelligence 18(3): 323-340 (2003).

Gilles Bisson, Claire Nedellec, Dolores Cañamero: Designing Clustering Methods for Ontology Building - The Mo'K Workbench. ECAI Workshop on Ontology Learning 2000

Page 53: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be

A. Hotho: Text Clustering with Background Knowledge 209

Selected Literature

Semantic Web & OntologyY. Sure, S. Staab, J. Angele. OntoEdit: Guiding Ontology Development by Methodology

and Inferencing. In: R. Meersman, Z. Tari et al. (eds.). Proceedings of theConfederated International Conferences CoopIS, DOA and ODBASE 2002, October 28th - November 1st, 2002, University of California, Irvine, USA, Springer, LNCS 2519, pages 1205-1222.

Y. Sure, M. Erdmann, J. Angele, S. Staab, R. Studer and D. Wenke. OntoEdit: Collaborative Ontology Engineering for the Semantic Web. In: Proceedings of thefirst International Semantic Web Conference 2002 (ISWC 2002), June 9-12 2002, Sardinia, Italia, Springer, LNCS 2342, pages 221-235.

E. Bozsak, M. Ehrig, S. Handschuh, A. Hotho, A. Mädche, B. Motik, D. Oberle, C. Schmitz, S. Staab, L. Stojanovic, N. Stojanovic, R. Studer, G. Stumme, Y. Sure, J. Tane, R. Volz, V. Zacharias. KAON - Towards a large scale Semantic Web. In: Proceedings of EC-Web 2002 (in combination with DEXA2002). Aix-en-Provence, France, September 2-6, 2002. LNCS, Springer, 2002, pages 304-313.

A. Hotho: Text Clustering with Background Knowledge 210

Selected Literature

Text Clustering with Background KnowledgeA. Hotho, S. Staab, and G. Stumme. Explaining text clustering results using semantic

structures. In Proc. of the 7th PKDD, 2003.

B. Lauser and A. Hotho. Automatic multi-label subject indexing in a multilingual environment. In Proc. of the 7th European Conference in Research and AdvancedTechnology for Digital Libraries, ECDL 2003, 2003.

A. Hotho, S. Staab, and G. Stumme. Text clustering based on background knowledge. Technical Report 425, University of Karlsruhe, Institute AIFB, 2003.

Hotho, A., Mädche, A., Staab, S.: Ontology-based Text Clustering, Workshop "Text Learning: Beyond Supervision",IJCAI 2001.

A. Hotho, A. Maedche, S. Staab, V. Zacharias : On Knowledgeable Supervised Text Mining . to appear in: "Text Mining" Workshop Proceedings, Springer, 2002.

A. Hotho: Text Clustering with Background Knowledge 211

Selected Literature

Using OntologiesStephan Bloehdorn, Andreas Hotho: Text Classification by Boosting Weak Learners

based on Terms and Concepts. ICDM 2004: 331-334

Andreas Hotho, Steffen Staab, Gerd Stumme: Ontologies Improve Text Document Clustering. ICDM 2003: 541-544

Andreas Hotho, Steffen Staab, Gerd Stumme: Explaining Text Clustering Results Using Semantic Structures. PKDD 2003: 217-228

Stephan Bloehdorn, Philipp Cimiano, and Andreas Hotho: Learning Ontologies to Improve Text Clustering and Classification, Proc. of GfKl, 2005.

Semantic Web MiningB. Berendt, A. Hotho, and G. Stumme. Towards semantic web mining. In I. Horrocks

and J. A. Hendler, editors, The Semantic Web - ISWC 2002, First International Semantic Web Conference, Sardinia, Italy, June 9-12, 2002, Proceedings, volume2342 of Lecture Notes in Computer Science, pages 264–278. Springer, 2002.