Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9...

Dr. Andreas Hotho

Text Clustering withBackground Knowledge

A. Hotho: Text Clustering with Background Knowledge 2

Agenda

• Introduction• Semantic Web • Semantic Web Mining• Typical Preprocessing steps for Text Mining• Ontology Learning• Text Clustering with Background Knowledge• Text Clustering using FCA• Text Classification using Background Knowledge• Application driven Evaluation of Ontology Learning• Different kinds of Background Knowledge


Knowledge and Data Engineering Group @ University of Kassel

Founded at April 2004 Head: Prof. Gerd StummeMember of the Research Center L3SResearch areas:

Semantic Web/OntologiesKnowledge DiscoveryWeb MiningPeer-to-PeerFolksonomiesSocial Bookmark Systems


Acknowledgement

Some of the slides are taken from:

• ECML/PKDD Tutorial “Ontology Learning from text”, Paul Buitelaar, Philipp Cimiano, Marko Grobelnik, Michael Sintek

• KDD Course of AIFB Karlsruhe and KDE Kassel

• Semantic Web Tutorial Slides from AIFB

• Some slides of the Semantic Web Introduction have beenstolen from various places, from Jim Hendler and Frank van Harmelen, in particular


Resources in BibSonomy tagged with: SumSchool06

http://www.bibsonomy.org/tag/SumSchool06Dr. Andreas Hotho

IntroductionSemantic Web


Syntax is not enough

Andreas

• Tel

• E-MailA. Hotho: Text Clustering with Background Knowledge 8

Information Convergence

Convergence not just in devices, also in “information”Your personal information (phone, PDA,…)

Calendar, photo, home page, files…

Your “professional” life (laptop, desktop, … Grid)

Web site, publications, files, databases, …

Your “community” contexts (Web)Hobbies, blogs, fanfic, social networks…

The Web teaches us that people will work to shareHow do we CREATE, SEARCH, and BROWSE in the non-text based parts of our lives?


CV

name

education

work

private

Meaning of Informationen:(or: what it means to be a computer)


CV

name

education

work

private

< >

< >

< >

< >

< >

< Χς >

< ναµε >

<εδυχατιον>

<ωορκ>

<πριϖατε>

XML ≠ Meaning, XML = Structure


XML is unspecific:No predetermined vocabularyNo semantics for relationships

& must be specified upfront

Only possible in close cooperationsSmall, reasonably stable groupCommon interests or authorities

Not possible in the Web or on a broad scale in general !

Source of Problems


(One) Layer Model of the Semantic Web


Some Principal Ideas

• URI – uniform resource identifiers• XML – common syntax• Interlinked• Layers of semantics –

from database to knowledge base to proofs

Design principles of WWW applied to Semantics!!

Tim Berners-Lee, Weaving

the Web


Ontology

Ontologies enable a better communicationbetween Humans/MachinesOntologies standardize and formalize themeaning of words through concepts

„An ontology is an explicit specification of a conceptualization.“ [Gruber, 1993]

„People can‘t share knowledge if they do not speaka common language.“ [Davenport & Prusak, 1998]


What is an Ontology?

Gruber 93:

An Ontology is aformal specificationof a sharedconceptualizationof a domain of interest

⇒ Executable⇒ Group of persons⇒ About concepts⇒ Between application

and „unique truth“


^

Communication Principle

ReferentFormStands for

refers toevokes

Concept

“Jaguar“

[Odwen, Richards, 1923]


Views on Ontologies

Front-End

Back-End

TopicMaps

Extended ER-Models

Thesauri

Predicate Logic

Semantic Networks

Taxonomies

Ontologies

Navigation

Queries

Sharing of Knowledge

Information Retrieval

Query Expansion

Mediation Reasoning

Consistency CheckingEAI


Taxonomy

Object

Person Topic Document

ResearcherStudent Semantics

OntologyDoctoral Student

Taxonomy := Segementation, classification and ordering of elements into a classification system according to theirrelationships between each other

PhD Student F-Logic

Menu


Thesaurus

Object



PhD StudentDoktoral Student

• Terminology for specific domain• Graph with primitives, 2 fixed relationships (similar, synonym) • originate from bibliography

similarsynonym

OntologyF-Logic

Menu


Topic Map

Object



PhD StudentDoktoral Student

knows described_in

writes

AffiliationTel

• Topics (nodes), relationships and occurences (to documents)• ISO-Standard• typically for navigation- and visualisation

OntologyF-Logic

similarsynonym

Menu


OntologyF-Logic

similar

OntologyF-Logic

similarPhD StudentDoktoral Student

Ontology (in our sense)

Object


Tel

PhD StudentPhD Student

Semantics

knows described_in

writes

Affiliationdescribed_in is_about

knowsP writes D is_about T P T

DT T D

Rules

subTopicOf

• Representation Language: Predicate Logic (F-Logic)• Standards: RDF(S); coming up standard: OWL

ResearcherStudent

instance_of

is_a

is_a

is_aAffiliation

Affiliation

A. Hotho

KDE+49 561 804 6252


PhD StudentPhD Student AssProfAssProf

AcademicStaffAcademicStaff

rdfs:subClassOfrdfs:subClassOf

cooperate_withcooperate_with

rdfs:rangerdfs:domainOntology

<swrc:AssProf rdf:ID="sst"><swrc:name>Steffen Staab</swrc:name>

...</swrc:AssProf>

http://www.aifb.uni-karlsruhe.de/WBS/sst

Anno-tation

<swrc:PhD_Student rdf:ID="sha"><swrc:name>Siegfried

Handschuh</swrc:name>

...</swrc:PhD_Student>

WebPage

http://www.aifb.uni-karlsruhe.de/WBS/shaURL

<swrc:cooperate_with rdf:resource = "http://www.aifb.uni-

karlsruhe.de/WBS/sst#sst"/>

instance ofinstance of

Cooperate_with

Ontology & Metadata

Links have explicit meanings!


What’s in a link? Formally

W3C recommendationsRDF: an edge in a graphOWL: consistency (+subsumption+classif. + …)

Currently under discussionRules: a deductive database

Currently under intense researchProof: worked-out proofsTrust: signature & everything working together


What’s in a link? Informally

• RDF: pointing to shared data• OWL: shared terminology

• Rules: if-then-else conditions

• Proof: proof already shown• Trust: reliability


Ontologies and their Relatives (I)

There are many relatives around:

Controlled vocabularies, thesauri and classification systems available in the WWW, seehttp://www.lub.lu.se/metadata/subject-help.html

Classification Systems (e.g. UNSPSC, Library Science, etc.)Thesauri (e.g. Art & Architecture, Agrovoc, etc.)DMOZ Open Directory http://www.dmoz.org

Lexical Semantic NetsWordNet, see http://www.cogsci.princeton.edu/~wn/EuroWordNet, see http://www.hum.uva.nl/~ewn/

Topic Maps, http://www.topicmaps.org (e.g. used withinknowledge management applications)

In general it is difficult to find the border line!


Ontologies and their Relatives (II)

Catalog / ID

Terms/Glossary

Thesauri

InformalIs-a

FormalIs-a

FormalInstance

Frames

ValueRestric-tions

Generallogical

constraints

AxiomsDisjointInverseRelations,...


Ontologies - Some Examples

General purpose ontologies:WordNet / EuroWordNet, http://www.cogsci.princeton.edu/~wnThe Upper Cyc Ontology, http://www.cyc.com/cyc-2-1/index.htmlIEEE Standard Upper Ontology, http://suo.ieee.org/

Domain and application-specific ontologies:RDF Site Summary RSS, http://groups.yahoo.com/group/rss-dev/files/schema.rdfUMLS, http://www.nlm.nih.gov/research/umls/GALEN SWRC – Semantic Web Research Community: http://ontoware.org/projects/swrc/RETSINA Calendering Agent, http://ilrt.org/discovery/2001/06/schemas/ical-full/hybrid.rdfDublin Core, http://dublincore.org/

Web Services OntologiesCore ontology of services http://cos.ontoware.orgWeb Service Modeling ontology http://www.wsmo.orgDAML-S

Meta-OntologiesSemantic Translation, http://www.ecimf.org/contrib/onto/ST/index.htmlRDFT, http://www.cs.vu.nl/~borys/RDFT/0.27/RDFT.rdfsEvolution Ontology, http://kaon.semanticweb.org/examples/Evolution.rdfs

Ontologies in a wider senseAgrovoc, http://www.fao.org/agrovoc/Art and Architecture, http://www.getty.edu/research/tools/vocabulary/aat/UNSPSC, http://eccma.org/unspsc/DTD standardizations, e.g. HR-XML, http://www.hr-xml.org/


Wordnet

• WordNet contains207016 Word-Sense Pairs117597 synsets

• WordNet categorizes word into syntactic categories(N, noun)

(V, verb)

(Adj, adjective) und

(Adv, adverb).

• WordNet additionally contains lexical-semantic relations between word meanings

[ http://wordnet.princeton.edu/]

Statistics under: http://wordnet.princeton.edu/man/wnstats.7WN#sect2


Wordnet II

Lexical - semantic relations

syntactic categories

examples

Synonymy

N, V, Adj, Adv

jolly, merry

Antonymy

Adj, Adv, (N, V)

fast, slow friendly, unfriendly

Hyperonymy

N

animal, living being mammal, animal dog, mammal

Meronomy

N

flour, cake tyre, car


Wordnet III

• Lexical semantic relations in WordNet mainly correspond to their counterparts of frame-oriented representationformalisms

hyperonym / hyponym is analog to is-a-relation

meronym / holonym corresponds to has-part / part-of-relations

WordNet allows floating transition between linguisticinformation and conceptual structures


(I)

• provided by the National Library of Medicine (NLM), a database of medical terminology.

• Unifies terms from several medical databases(MEDLINE, SNOMED International, Read Codes, etc.) such that different terms are identified as the same medical concept.

• Applications:Primarily Browse/Search in Document Collections, e.g.PubMed: Access to Documents (e.g. MEDLINE)CliniWeb International: Clinical Information in the WWW[ http://www.nlm.nih.gov/research/umls/umlsapps.html ]

[ http://www.nlm.nih.gov/research/umls/ ]


(II)

UMLS Knowledge Sources:

Metathesaurus provides the concordanceof medical concepts:

730.000 concepts1.5 million concept names in different source vocabularies

SPECIALIST Lexicon provides word synonyms, derivations, lexical variants, and grammatical forms of words used in MetaThesaurus terms

130,000 entries.

Semantic Network codifies the relationships (e.g. causality, "is a", etc.) among medical terms.

134 semantic types, 54 relationships.


The semantic web and machine learning

1. What can machine learningdo for the Semantic Web?

2. Learning Ontologies(even if not fullyautomatic)

3. Learning to map betweenontologies

4. Duplicate recognition5. Deep Annotation:

Reconciling databases and ontologies

6. Annotation by Information Extraction

1. What can the Semantic Web do for Machine Learning?

2. Lots and lots of SW tools to describe and exchange datafor later use by machinelearning methods in a canonical way! (preprocessing!)

3. Using ontological structuresto improve the machinelearning task

4. Provide backgroundknowledge to guide machinelearning


Foundations of the Semantic Web: References

• Semantic Web Activity at W3C http://www.w3.org/2001/sw/• www.semanticweb.org (currently relaunched)• Journal of Web Semantics • D. Fensel et al.: Spinning the Semantic Web: Bringing the World Wide

Web to Its Full Potential, MIT Press 2003• G. Antoniou, F. van Harmelen. A Semantic Web Primer, MIT Press 2004.• S. Staab, R. Studer (eds.). Handbook on Ontologies. Springer Verlag,

2004. • S. Handschuh, S. Staab (eds.). Annotation for the Semantic Web. IOS

Press, 2003. • International Semantic Web Conference series, yearly since 2002, LNCS• World Wide Web Conference series, ACM Press, first Semantic Web

papers since 1999• York Sure, Pascal Hitzler, Andreas Eberhart, Rudi Studer, The Semantic

Web in One Day, IEEE Intelligent Systems, http://www.aifb.uni-karlsruhe.de/WBS/phi/pub/sw_inoneday.pdf

• Some slides have been stolen from various places, from Jim Hendler and Frank van Harmelen, in particular.

Dr. Andreas Hotho

Semantic Web Mining


Where to start?

Web Mining AreasWeb content miningWeb structure mining

Web usage mining


• Web Mining can help

• to learn structures for knowledge organization (e.g. ontologies)

• and to populate them. Ontology Learning

Instance Learning

Extracting Semantics from the Web


Ontology Learning

• Typically, a domain-specific document corpus contains much information about a specific domain.

• One possible approach is to take this given corpus and extract linguistic and ontological resources from it.

Concentration to Web Content

ONTOLOGY LEARNING

KnowledgeDiscovery

OntologyEngineering


Ontology Learning Steps

1. Concept ExtractionMulti-Word-Term ExtractionWord Meaning Recognition

2. Concept Relation Extraction:Taxonomy LearningNon-taxonomic relation extractionLabeling of non-taxonomic relations

Beside these two steps ontology reuse via pruning is applicable.


root

furnishing

accomodation

event area

...

hotel youth hostel...

cityregion ...

wellness hotel

Ontology Learning from the Web

[Mädche, Staab: ECAI 2000]

Derived concept pairs(wellness hotel, area)(hotel, area)(accomodation, area)

Association RuleMining

Generalized Conceptual RelationhasLocation(accomodation,area)

is-ahierarchy

Example


• Web Mining can help

• to learn structures for knowledge organization (e.g. ontologies)

• and to populate them. Ontology Learning

Instance Learning

Extracting Semantics from the Web


Knowledge base

Hotel: Wellnesshotel

GolfCourse: Seaview

belongsTo(Seaview, Wellnesshotel)

...

Information Extraction,

eg. [Craven et al, AI Journal 2000]

belongsTo

FORALL X, YY: Hotel[cooperatesWith ->> X] <-

X:ProjectHotel[cooperatesWith ->> Y].

GolfCourse

Organization

Hotel

namecooperatesWith

Ontology

ExampleInstance Learning from the Web


Example

Information Highlightingfor supporting annotationbased on IE techniques.


Crawling:load a documentextract linksload next document

Focused Crawlingintelligent focused decision on the next step

Crawling the (semantic) web for filling the ontologyExample

[Ehrig et al, 2002]

?

?

?

?


Knowledge base

Hotel: Wellnesshotel

GolfCourse: Seaview

belongsTo(Seaview, Wellnesshotel)

...

ILP BasedAssociation Rule Mining,

eg. [Dehaspe, Toivonen,

J. DMKD 1998]

Hotel(x), GolfCourse(y), belongsTo(y,x) → hasStars(x,5)

support = 0.4 % confidence = 89 %

belongsTo

FORALL X, YY: Hotel[cooperatesWith ->> X] <-

X:ProjectHotel[cooperatesWith ->> Y].

GolfCourse

Organization

Hotel

namecooperatesWith

Ontology

ExampleMining the Semantic Web


Semantic Web Usage Mining

p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03:51 +0100] "GET /search.html?l=ostsee%20strand&syn=023785&ord=asc HTTP/1.0" 200 1759

p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05:06 +0100] "GET /search.html?l=ostsee%20strand&p=low&syn=023785&ord=desc HTTP/1.0" 200 8450

p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06:41 +0100] "GET /mlesen.html?Item=3456&syn=023785 HTTP/1.0" 200 3478

Search byLocation

Search byLocationand Price

Refinesearch

Chooseitem

Look at individualHotel.

From logfile analysis ...

... to semantic logfile analysis:

Basic idea: associate each requested page with one or more ontological entities, to better understand the process of navigation

[Berendt & Spiliopoulou 2000; Berendt 2002; Oberle 2003]

Use the gained knowledge to

• understand search strategies

• improve navigation design

• personalization

Example


Text Document Clustering of Crawled Documents

WWW

Explanation

Clustering

Focused Crawling

Example

Dr. Andreas Hotho

Preprocessing stepsfor

Text Mining

Slides partially from:- AIFB KDD course- Raymond J. Mooney (http://www.cs.utexas.edu/users/mooney/ir-course/).


Preprocessing of Text documents

keys

xnmn

xijn-1

xij…

xijxij…

xij…

xij…

xij…

x222

x111

mm-1…………21

keys

xnmn

xijn-1

xij…

xijxij…

xij…

xij…

xij…

x222

x111

mm-1…………21

docu

men

ts

Kind of Features to extract

• Terms• Words• Phrases• Concepts

• Meta data • Shallow parsing• Deep parsing• …


Which kind of features to extract?

Metadatae.g., author, data, document type, language, copyright status;according to metadata schema that specifies attributes, e.g.:

Dublin Core Metadata Initiative (http://dublincore.org/),BibTex Schema for Bibliographic Metadata;

typically by explicit markup from human indexer, but possibly also automatically by means of information extraction (next lecture).

Controlled Vocabulary of Index Termsfixed set of index terms that describe content of documents;often in hierarchical form (taxonomies),indexing vocabulary / taxonomy is centrally designed and maintained by some authority, e.g.:

Inspec Topic Classification (http://www.iee.org/Publish/Inspec/),MeSH - Medical Subject Headings(http://www.nlm.nih.gov/mesh/),IPC - International Patent Classification(http://www.wipo.int/classifications/ipc/en/)many many more…


Example: MeSH Classification

PMID- 7810287OWN - NLMSTAT- MEDLINEDA - 19950202DCOM- 19950202LR - 20041117PUBM- PrintIS - 0094-6354 (Print)VI - 62IP - 4DP - 1994 AugTI - Cockayne syndrome: a case report.PG - 346-8AB - A 4-year-old female with Cockayne syndromepresented for cataract extraction under generalanesthesia. […]FAU - O'Brien, F CAU - O'Brien FCFAU - Ginsberg, BAU - Ginsberg BLA - engPT - Case ReportsPT - Journal ArticlePL - UNITED STATESTA - AANA JJT - AANA journal.JID - 0431420SB - NMH - Anesthesia, General/*methods/nursingMH - Cataract ExtractionMH - Child, PreschoolMH - Cockayne Syndrome/complications/*surgeryMH - FemaleMH - HumansEDAT- 1994/08/01MHDA- 1994/08/01 00:01PST - ppublishSO - AANA J. 1994 Aug;62(4):346-8.


Which kind of features to extract? (cont.)

Alternative: dynamic vocabularysocial tagging systems ("folksonomies"), e.g.:

del.icio.us for bookmarks (http://del.icio.us), flickr for photographs (http://www.flickr.com);

tags correspond to index terms that are freely choosen and assigned to documents by users without centralizedmanagement.

Derived Features: full-text indexingalso known as the bag-of-words model,general assumption: every word or expression in the text document can be a valid key,index terms are automatically extracted from the document collection,dictionary of index terms is continually increasing,many design decisions to choosing appropriate terms,next section….


Example: social tagging


Example: BibSonomy contains also publication meta data


StopwordRemovalStemmingTokenization

Document Representation: Full-Text Indexing

treat

infectionblood

medic

blood

potent

transmiss

…

…

…

…

typical full-text indexing pipeline


Full Text Representation

Tokenizationgoal: segment input character sequence into "useful" tokens(e.g., individual terms);design decisions and problems:

set of word delimiters to use (e.g., whitespace, punctuation marks),handling of special and numerical characters,handling of capitalization (typically conversion to lower case),handling of punctuation marks (sentence delimiter or abbreviation?);different languages have different rules for compound words(e.g., "color screen" vs "Farbbildschirm„)

Stemming or Lemmatizationmorphological normalization of inflected word forms to base form(e.g., "houses" → "house", "goes" → "go");Stemming: simple approach based on few structural rules

e.g., Porter Stemming algorithm for English language;Lemmatization: retrieval of base form, typically based on dictionary

can handle exceptional cases (e.g., "mice" → "mouse")


Full Text Representation

Stopword removal

removal of very frequent and uninformative words,typically function words as, e.g., "the","a","an","of","for",e.g., SMART stopword list for English language defines 571 stopwords(ftp://ftp.cs.cornell.edu/pub/smart/english.stop)


Property: Word Frequency

• A few words are very common.2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences.

• Most words are very rare.Half the words in a corpus appear only once, called hapax legomena (Greek for “read only once”)

• Called a “heavy tailed” distribution, since most of the probability mass is in the “tail”


Sample Word Frequency Data

(from B. Croft, UMass)


Zipf’s Law

• Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ).

• Zipf (1949) “discovered” that:

• If probability of word of rank r is pr and N is the total number of word occurrences:

rf 1

∝ )constant (for kkrf =⋅

1.0 const. indp. corpusfor ≈== ArA

Nfpr


Zipf and Term Weighting

Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.


Pruning based on Zipf Law

• Stopword removal

• drop words with less than a given occurrences to drop the extremely uncommon words (e.g. 30 occurrences)

• similar idea is behind the tfidf weighting


Further possible (non-standard) steps

• separate indexing of phrases and compound words(e.g., "machine learning" ≠ "machine","learning")

based on background dictionaries or statistical detectionof frequent phrases - machine learning again ;-);

• alternative: additional indexing of all adjacent words up to a certain window length (bigrams, trigrams, n-grams);

• expansion with synonymous terms based on thesauri;• separate consideration of different parts of speech

(e.g., "walk" as verb or "walk" as noun);• many more …


Levels of Linguistic Analysis:

´The 'Human Language Technologies Layer Cake'

Tokenization (incl. Named-Entity Rec.)

Phrase Recognition / Chunking

Dependency Struct. (Phrases)

Dependency Struct. (S)

Discourse Analysis

[table] [2005-06-01] [John Smith]

Morphological Analysis[table:N:ART] [Sommer~schule:N] [work~ing:V]

[[the] [large] [table] NP] [[in] [the] [corner] PP]

[[the:SPEC] [large:MOD] [table:HEAD] NP]

[[He:SUBJ] [booked:PRED] [[this] [table:HEAD] NP:DOBJ] S]

[[He:SUBJ] [booked:PRED] [[this] [table:HEAD] NP:DOBJ:X1] …]… [[It:SUBJ:X1] [was:PRED] still available …]

Part-of-Speech & Semantic Tagging[table:N:ARTIFACT] [table:N:furniture_01]

© Paul Buitelaar, DFKI


Full Text Representation and Sparseness

Fulltext Indexing typicallyresults in a very sparsematrix - usually, less than 1% of the matrix cells are non-zero !

Sparseness requires specialattention with respect to: storage and computation.

Store only non-zeroelements with respectiveindices and assume rest of matrix to be zero.Tune computations to thisdata structure.

Frequent terms are likely to beindexed already at thebeginning.

Sparseness Structure of the Document x Term Matrix for Training Document part of the Reuters 21578-Corpus (9603 x 17525) – non-sparse fraction is 0.25 % !


Comparison of Representation Approaches

auto

mat

ion

of in

dexi

ng

dynamics of set of features

fullyautomatic

only humanannotation

fixed set of features highly dynamic feature set

social tagging (e.g. del.icio.us)

fulltext indexing

traditional library classification

news agency systems (e.g. Reuters)


Retrieval Models: Vector Space [Salton 60s]

Vector Space Model (best match model + ranking)typically full text indexing of documents;documets are regarded as vectorsvector space dimensions defined by different index termsquery is also treated as a vector in the same spacerank documents based on geometric similarity with queryvery successful paradigm with many connections to machine learning view

Issues of vector space modelchoice of appropriate term weighting (typically TFIDF)choice of geometric similarity measure to use (typicallycosine)


Term Weighting - Alternatives

• boolean weighting (simplest case) with entries 0 and 1 corresponding to

• absolute frequency tfji of term i in document j• relative frequency rfji of term i in document j• most popular choice:

term-frequency inverse document frequency (TFIDF) weighting: )

)(log(.)(

wdfNtfwtfidf =

tf(w) term frequency (number of word occurrences in a document)df(w) document frequency (number of documents containing the word)N number of all documentstfIdf(w) relative importance of the word in the document

The word is more important if it appears in less documents

The word is more important if it appears several times in a target document


Cosine Measure typically used as similarity measure:document vectors ranked according to cosine score with query;corresponds to the angle between two vectors, i.e. thenormalized inner product between input vectors

Note: direction distinguishes document vectors, not the length !all vector entries will be positive so cosine varies between 0 (orthogonal vectors) and 1 (same direction)

Cosine Measure


Cosine Measure (Illustration: scaling onto unit hypersphere)

doc1

doc2

query

doc1'

doc2 '

query 'α1

1

0

1

0


Evaluation Measures

• Need to analyse results and to evaluate systems.• Three important considerations:

Precision: how well did the presented result set match theinformation need of the user ?Recall: how much of the relevant information availablewas presented in the result set ?

• There is a well-known set of information retrievalmeasures which evaluate information retrieval engineswith respect to the subjective (!) perception of relevancy of a test user.


Evaluation Measures: Notation

true negative(TN)

false negative(FN)

negative(doc's not returned)

false positive(FP)

true positive (TP)

positive(doc's returned)

negative(doc's judged non-

relevant)

positive(doc's judged

relevant)

human judgement

retrieval result

2 partitions of a set of documents:according to perceived relevancy to useraccording to result of retrieval engine


Evaluation Measures: Information Retrieval (and Text Classification)

measures overall error

fraction of relevant documents in result

fraction of returned documentswrt all relevant documents

harmonic mean of precisionand recall


Evaluation Measures (cont.): Considering Ranked Retrieval

Precision and recall are well-defined only for exact match (i.e. unranked) retrieval;

Approach for ranked retrieval:rank all test documents,calculate precision and recallat fixed cutoff points (e.g., at position k = 5);different results will beachieved for varying k;

Typical observation for k → n:precision decreases and recall increases on the longrun (think why !)

The break-even-point measure isdefined as value of precisionand recall when they becomeequal. -n

……

-13

-12

+11

-10

-9

-8

+7

-6

+5

+4

+3

-2

+1

rel?pos?

precision

recall

break even point

plot connecting precisionand recall for different k

break even point


Goal:Cluster should be as similar as possible to the given classes

Evaluation of Text Document Clustering

( )P

LPLP ∩=:,Precision

),(Precisionmax:),(Purity LPDP

LP *L*P

L*P*∈∈

∑=

),(Precisionmax:),(InvPrty PLDL

PL *P*L

L*P*∈∈

∑=

compare clustering P* of document set Dwith given classes L*

46 Classes

given classes L*

60 Cluster Clustering P*


Different Text Clustering and Classification Datasets

Reuters-21578Documents about finance from 19879603 training documents and 3299 test documents (ModApteSplit)Binary Classification on Top 50 classes.

Reuters RCV1Documents about finance from 1996/1997806791 documents categorized with respect to three controlled vocabulary4 major topic categries10 major industry codesregion code without an hierarchy

David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li. RCV1: A New Benchmark Collection for Text Categorization Research, 2004


Different Text Clustering and Classification Datasets

20 NewsgroupsNews Groups Documents, different topics like: sport, cs, …20000 documents20 classes, every class contains 1000 documents

OHSUMED CorpusOHSUMED (TREC-9), titles and abstracts from medical journals, 198736369 training documents and 18341 test documentsBinary Classification on Top 50 classes (MeSH classifications).

FAODOC CorpusDocuments about agricultural information1501 docs within 21 categories

Dr. Andreas Hotho

Ontology Learning

Thanks to Philipp Cimiano for the slides


Motivation for Ontology Learning

• High cost for modelling ontologies.• Typically, ontologies are domain dependant.

• Idea: learn from existing domain data?• Which data?

Legacy Data (XML oder DB-Schema) => LiftingTexts ?Images ?

• In this lecture we will discuss some ideas of ontology learning from text data using knowledge discovery techniques.


Learning ontologies from texts

Problems: Bridge the gap between symbol and concept/ontology level

Knowledge is rarely mentioned explicitly in texts.

Reverse Engineering

Write

Shared World Model


Some Current Work on OL from Text

Terms, Synonyms & ClassesStatistical AnalysisPatterns(Shallow) Linguistic ParsingTerm Disambiguation & Compositional Interpretation

TaxonomiesStatistical Analysis & Clustering (e.g. FCA)Patterns(Shallow) Linguistic ParsingWordNet

RelationsAnonymous Relations (e.g. with Association Rules)Named Relations (Linguistic Parsing)(Linguistic) Compound AnalysisWeb Mining, Social Network Analysis

Definitions(Linguistic) Compound Analysis (incl. WordNet)

Overview of Current Work: Paul Buitelaar, Philipp Cimiano, Bernardo Magnini Ontology Learning from Text: Methods, Evaluation and Applications Frontiers in Artificial Intelligence and Applications Series, Vol. 123, IOS Press, July 2005.


mountain)iver,disjoint(r

z))yz)l(x,has_capitacapital(z)(zx)(y,located_iny)l(x,has_capitacapital(y)ycountry(x)(x

=→∧∀∧∧∧∃→∀

area geo. Inhabitedcity city,capital CC ≤≤

area) geo.:rangeriver,:gh(domflow_throu

⟩⟨== (c)Ref[c],i(c),:country:c C

located_incapital_of R≤

nation}{country,

.capital,.. city, nation, country, river, Terms

Synonyms

Concepts

Concept Hierarchy

Relations

Axioms

Rules

Relation Hierarchy

The Ontology Learning Layer Cake


Tools - Axioms

Economic Univ. Prague

DFKI XXXOntoLT / RelExt

labelsTextToOnto ++

XPMI-IRNRC-CNRC

XDODDLEKeio Univ.

XDIRT

clustersclustersCBCISI, USC

XXParmenidesUniv. Zürich

clustersclustersXATRACTUniv. of Salford

XXint.XXOntoLearnUniv. di Roma

XXclustersclustersASIUM / Mo‘kUniv. de Paris-Sud

?clustersclustersOntoBasisCNTS, Univ. Antwerpen

XXXXHASTIAmir Kabir Univ. Tehran

XAEON

XXint.clustersXText2OntoAIFB, Univ. Karlsruhe

GeneralAxioms

AxiomsSchemata

RelationHierarchyRelationsConcept

HierarchyConcept

FormationSynonymsTerms

Ontology Learning LayersSystemOrganization


Evaluation of Ontology Learning

The a priori approach is based on a gold standard ontology:Given an ontology modeled by an expert -> The so called gold standardCompare the learned ontology with the gold standard

Which methods exists: Pattern-based

learning accuracy/precision/recall/f-measure

Which methods exists: Clustering-based

Problem: labels for clusters either unknown or difficult to find

Basic idea for both

Count edges in the “ontology graph”Counting of direct relation only (Reinberger et.al. 2005)Least common superconceptSemantic cotopy…

Evaluation via application (cf. section using ontologies)


Evaluation of Ontology Learning

The aposteriori Approach:ask domain expert for a per concept evaluation of the learned ontologyCount three categories of concepts:

Correct : both in learned and the gold ontologyNew : only in learned ontology, but relevant and should be in gold standard as wellSpurious: useless

Compute precision = (correct + new) / (correct + new + spurious)

As the result: The a posteriori evaluations are costly – BUTa posteriori evaluation by domain experts still show very good results, very helpful for domain expert!

Sabou M., Wroe C., Goble C. and Mishne G.,Learning Domain Ontologies for Web Service Descriptions: an Experiment in Bioinformatics, In Proceeedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10-14 May, 2005.


Terms

Synonyms

Some Knowledge Discovery Techniques for Ontology Learning

focus of today's lecture

Concepts

Concept Hierarchy

Relations

Relation Hierarchy

Axioms

Rules


How do people acquire taxonomic knowledge?

I have no idea!

But people apply taxonomic reasoning!„Never do harm to any animal!“

=> „Don‘t do harm to the cat!“

More difficult questions:representationreasoning patterns

But let‘s speculate a bit! ;-)



What is liver cirrhosis?

Mr. Smith died from liver cirrhosis.Mr. Jagger suffers from liver cirrhosis.Alcohol abuse can lead to liver cirrhosis.

=>prob(isa(liver cirrhosis,disease))




Diseases such as liver cirrhosis aredifficult to cure. (New York Times)




Cirrhosis: noun[uncountable]serious disease of the liver, often caused by drinking too much alcohol

disease))cirrhosis,isa(liver (probdisease)sis,isa(cirrho cirrhosiscirrhosisliver

→∧≈

Pattern based



Clustering based• ……….

• The old lady loves her dog.

• The old lady loves her cat.

• The old lady loves her husband.

• ……….

dog

husband

catslady


Context Extraction

extract syntactic dependencies from text⇒ verb/object, verb/subject, verb/PP relations⇒ car: drive_obj, crash_subj, sit_in, …

LoPar, a trainable statistical left-corner parser:

Parser tgrep Lemmatizer Smoothing

WeightingFCALattice

Compaction Pruning


Ontology Learning as Term Clustering

• Distributional Hypothesis: • "Words are [semantically] similar to the extent to which

they appear in similar [syntactic] contexts."• [Harris 1985]

• Linguistic context can be represented in vector form.

• Allows to measure the similarity wrt. some similarity measure (e.g. cosine measure).

• Hierarchical clustering approaches can be used to create taxonomic structures.

0313bike4025car

sit_inride_objcrash_intodrive_obj

0313bike4025car

sit_inride_objcrash_intodrive_obj


Extracting attributes using techniques from NLP

The museum houses an impressive collection of medieval and modern art.The building combines geometric abstraction with classical references that allude to the Roman influence on the region.

house_subj(museum)

house_obj(collection)

combine_subj(museum)

combine_obj(abstraction)

combine_with(reference)

allude_to(influence)

s

npvp

v np

np pp

The museum

houses

an impressive collection of modern art


Extraction Process for Linguistic Contexts

Preprocessing:Part of Speech TaggingLemmatizingMatching regular expression over POS-tags

Extract shallow syntactic dependencies from text:adjective modifiers: "a nice city" nice(city)prepositional phrase modifiers: "a city near the river" near_river(city) and city_near(river)possessive modifiers: "the city's center" has_center(city)noun phrases in subject or object position: "the city offers an exciting nightlife" offer_subj(city) and offer_obj(nightlife)prepositional phrases following a verb: "the river flows throughthe city" flow_through(city)copula constructs: "a flamingo is a bird" is_bird(flamingo)verb phrases with the verb to have: "every country has a capital" -> has_capital(country)


Example

• People book hotels. The man drove the bike along thebeach.

book_subj(people)book_obj(hotels)drove_subj(man)drove_obj(bike)drove_along(beach)

book_subj(people)book_obj(hotel)drive_subj(man)drive_obj(bike)drive_along(beach)

Lemmatization


Representation of the context of a word as feature vector

XXexcursion

XXtrip

XXXXmotor-bike

XXXcar

XXapartment

join_obj/joinable

ride_obj/rideable

drive_obj/driveable

rent_obj/rentable

book_obj/bookable


Tourism Lattice


Concept Hierarchy

bookable

rentable joinable

driveable appartment

car

bike

tripexcursion

rideable


Example Clustering: (Bi-Section-KMeans)

excursion trip

apartmentcar biketrip excursion

excursiontripcar

bikeapartment

apartmentbike car

bike carIssues:

not easy to understandno formal interpretation


Agglomerative/Bottom-Up Clustering

car bus tripexcursionappartment


Linkage Strategies

Complete-Linkage:consider the two most dissimilar elements of each of the clusters=> O(n2 log(n))

Average-Linkage:consider the average similarity of the elements in the clusters => O(n2 log(n))

Single-Linkage:consider the two most similar elements of each of the clusters=> O(n2)


Data Sets

Tourism (118 Mio. tokens):http://www.all-in-all.de/englishhttp://www.lonelyplanet.comBritish National Corpus (BNC)handcrafted tourism ontology (289 concepts)

Finance (185 Mio. tokens):Reuters news from 1987GETESS finance ontology (1178 concepts)


Results Tourism Domain


Results in Finance Domain


Results Tourism Domain


Results in Finance Domain


Summary

Weak-FairO(n2)36.42/32.77%DivisiveClustering

FairO(n2 log(n))O(n2 log(n))O(n2)

36.78/33.35%36.55/32.92%38.57/32.15%

AgglomerativeClustering

GoodO(2n)43.81/41.02%FCA

TraceabilityEfficiencyEffectiveness


TextToOnto & FCA


Text2Onto

Ontology learning framework developed at AIFB.Algorithms for extracting …

concepts, instancessubclass-of / instance-of relationsnon-taxonomic / subtopic-of relationsdisjointness axioms

Incremental ontology learningIndependent of concrete ontology language


Experimental results

• Formal Concept Analysis yields better concept hierarchies than similarity-based clustering algorithms,

• The results of FCA are better understandable (intensional description of concepts!),

• Bi-Section-Kmeans is most efficient (O(n2)),

• Though FCA is exponential in the worst case, it shows a favourable runtime behaviour (sparsely populated formal contexts).


Other Clustering Approaches

Bottom-Up/Agglomerative(ASIUM System) Faure and Nedellec 1998Caraballo 1999(Mo‘K Workbench) Bisson et al. 2000

Other:Hindle 1990Pereira et al. 1993Hovy et al. 2000


Ontology Learning References

• Reinberger, M.-L., & Spyns, P. (2005). Unsupervised text mining for the learning of dogma-inspired ontologies. In Buitelaar, P., Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.

• Philipp Cimiano, Andreas Hotho, Steffen Staab: Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text. ECAI 2004: 435-439

• P. Cimiano, A. Pivk, L. Schmidt-Thieme and S. Staab, Learning Taxonomic Relations from Heterogenous Evidence. In Buitelaar, P., Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.

• Sabou M., Wroe C., Goble C. and Mishne G.,Learning Domain Ontologies for Web Service Descriptions: an Experiment in Bioinformatics, In Proceeedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10-14 May, 2005.

• Alexander Maedche, Ontology Learning for the Semantic Web, PhD Thesis, Kluwer, 2001.

• Alexander Maedche, Steffen Staab: Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16(2): 72-79 (2001)

• Alexander Maedche, Steffen Staab: Ontology Learning. Handbook on Ontologies 2004: 173-190

• M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, I. Rojas. Unsupervised Learning of semantic relations between concepts of a molecular biology ontology. IJCAI, 659ff.

• A. Schutz, P. Buitelaar. RelExt: A Tool for Relation Extraction from Text in Ontology Extension. ISWC 2005.

• Faure, D., & N´edellec, C. (1998). A corpus-based conceptual clustering method for verb frames and ontology. In Velardi, P. (Ed.), Proceedings of the LREC Workshop on Adapting lexical and corpus resources to sublanguages and applications, pp. 5–12.

• Michele Missikoff, Paola Velardi, Paolo Fabriani: Text Mining Techniques to Automatically Enrich a Domain Ontology. AppliedIntelligence 18(3): 323-340 (2003).

• Gilles Bisson, Claire Nedellec, Dolores Cañamero: Designing Clustering Methods for Ontology Building - The Mo'K Workbench. ECAI Workshop on Ontology Learning 2000

Dr. Andreas Hotho

Text Clustering


Motivation

• Challenge: browse, search and organize the hughamount of unstructured text documents available in the internet or in intranets of companies

hugh sets of documents in Internetportals like Yahoo.com, DMoz.org, Web.de are manually structuredmeta search engines like Vivisimo.com use cluster techniques to structure the search results

• Advantage: the structure and the visualization of the information provided by the clustering helps the user to work with a larger amount of information


Motivation


Text Clustering

Text Clustering

[…] the partitioning of texts into previously unseen categories […]

A. Hotho et al., SIGIR 2003 Semantic Web Workshop

Automatic Text Clustering uses full-text vector representations of text documents as in information retrievalwithin standard clustering algorithms.


Motivation-Overall Process

cluster algorithm

Objects

explanation

representation of objectsMorgens Abends team baseman

Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1

similarity measuredistance function

backgroundknowledge

oman discount oil crude


Motivationrequirements on the cluster methods

EfficientResults should also be available on large data sets or on ad-hoc collect e.g. from search engines

EffectiveCluster result must be correct

Problem of explanatory powerResults of the clustering process must be understandable

User interaction und subjectivityUser has his own imagination of the clustering goal and want integrate this in the cluster process


Text Clustering with Background Knowledge

- choose a representation- similarity measure- clustering algorithm

Bi-Section is a version of KMeans

cosine as similarity measure

bag of terms(details on thenext slide)

Reuters data setfor our studies(min15 max100)


Preprocessing steps

docid term1 term2 term3 ...doc1 0 0 1doc2 2 3 1doc3 10 0 0doc4 2 23 0

...

– build a bag of words model

– extract word counts (term frequencies)– remove stopwords– pruning: drop words with less than e.g. 30 occurrences – weighting of document vectors with tfidf

(term frequency - inverted document frequency)

⎟⎟⎠

⎞⎜⎜⎝

⎛+=

)(log))(log()(

tdfD

* d,ttf d,ttfidf|D| no. of documents ddf(t) no. of documents d which

contain term t


Ontology

Ontology O represents the background knowledge core ontology consists of:

Set of concepts: CConcept hierarchy or taxonomy: Lexicon: Lex

Root

PublicationPerson

AcademicStaff

StudentArticleBook

Project

PhDStudent

Topic

Research Topic

KnowledgeManagement

DistributedOrganization

DE:Wissensmanagement EN:Knowledge Management

C≤


109377 Concepts(synsets)

WordNet as ontology

144684 lexicalentries

Rootentity

something

physical object

artifact

substance

chemicalcompound

organiccompound

lipid

oil

EN:oil

covering

coating

paint

oil paint

cover

cover with oil

bless

oil, anoint

EN:anoint EN:inunct

oil colorcrude oil

144684 lexicalentries

Use of superconcepts(Hypernyms in Wordnet)

• Exploit more generalized concepts• Example:

chemical compound is the 3rd superconcept of oil

• “prune‘‘ unimportant superconceptswith tfidf

Word Sense Disambiguation

Strategies:all, first, context


Reuters texts

Dok 17892 crude =============

Oman has granted term crude oil customers retroactive discounts from official prices of 30 to 38 cents per barrel on liftings made during February, March and April, the weekly newsletter Middle East Economic Survey (MEES) said. MEES said the price adjustments, arrived at through negotiations between the Omani oil ministry and companies concerned, are designed to compensate for the difference between market-related prices and the official price of 17.63 dlrs per barrel adopted by non-OPEC Oman since February. REUTER


Ontology-based representation

21112111...

Omangrantedtermcrudeoillipidcompoundcustomerretroactivediscount...

2111222111...

different strategies: add, replace, only

Omanhas grantedtermcrudeoilcustomersretroactivediscounts...

211112111...

Omangrantedtermcrudeoilcustomerretroactivediscount...

1 2 3


Bi-Partitioning K-Means

Input: Set of documents D, number of clusters kOutput: k cluster that exhaustively partition D

Initialize: P* = {D}

Outer loop: Repeat k-1 times: Bi-Partition the largest cluster E∈P*

Inner loop: Randomly initialize two documents from E to become e1,e2

Repeat until convergence is reachedAssign each document from E to the nearest of the two ei ; thus split E into E1,E2

Re-compute e1,e2 to become the centroids of the documentrepresentations assigned to them

P* := (P* \ E ) ∪ {E1,E2 }


Evaluation of Text Clustering

avg

-pur

ity

b.knowl.

0,51

0,53

0,55

0,57

0,59

0,61

0,63

context all context all

0 5false true

300

Prune

Evaluation parameter• min 15, max 100, 2619 documents• cluster k = 60• tfidf• term and concept vector

depthdisambig.



0,6160,618

0,570

0,300

0,350

0,400

0,450

0,500

0,550

0,600

0,650

add repl add only repl add only repl add only repl add only repl add only repl add only

context context first all context first all

0 0 5

false true

tfidf - 30without - 30

CLUSTERCOUNT 60 EXAMPLE 100 MINCOUNT 15

Mittelwert - PURITY

ONTO HYPDEPTH HYPDIS HYPINT

WEIGHT

PRUNE

backgro..depthdisambig.integrat.

Evaluation parameter• min 15, max 100, 2619 documents• cluster k = 60

avg - purity



Backgr. depth integr. Mean - PURITY Mean - INVPURITYfalse 0,570 ±0,019 0,479 ±0,016true 0 add 0,585 ±0,014 0,492 ±0,017

only 0,603 ±0,019 0,504 ±0,0215 add 0,618 ±0,015 0,514 ±0,019

only 0,593 ±0,01 0,500 ±0,016

Evaluation parameter

• min 15, max 100, 2619 Dokumente• Cluster k = 60• Disamb = context• Prune = 30


Variance analysis of the Reuters classes

Idea: Ideally documents of one class should have the same representation variance = 0representation of the documents is changed, variance will also change

Analysis:Compare the variance of the classes of both Representations (with and without ontology)Compare the purity per class


Variance analysis

Variance and purity per class for PRC-min15-max100

-30,00%

-20,00%

-10,00%

0,00%

10,00%

20,00%

30,00%

40,00%

50,00%

60,00%

earn

pet-chem

meal-feed

ship

lead

jobs

strategic-metal

acq

defnoclass

cocoa

trade

veg-oil

zinc

tin copper

coffee

iron-steel

housing

nat-gas

oilseed

crude

money-fx

ipi

alum

gas

grain

wpi

gnp

cpi

retail

carcass

interest

money-supply

dlr

livestock

bop

silver

orange

sugar

wheat

reserves

hog

gold

rubber

heat

cotton

variance_prc_min15_max100_5_addpurity_prc_min15_max100_5_addLinear (purity_prc_min15_max100_5_add)

class

percen

tage D

ifference


Conclusion

Background knowledge helps in improving clustering resultsSimilar terms in 2 documents may contribute to a good similarity rating if they are related via Wordnet synsetsor hypernyms

Adding background knowledge is not beneficial per se

has to be combined with term and concept weightingword sense disambiguation


Conclusion and Outlook

Ontologies provide the background knowledge for clustering of text/web documents to achieve better clustering results describing text clusters to make descriptions more understandable for more details see: [Hotho et al. 2003]

Some possible improvements:include more aspects of Wordnet, e.g. adjectivestake domain specific ontologies, e.g. AGROVOC

Dr. Andreas Hotho

Text Clustering with FCA


Introduction Clustering

case sex glasses moustache smile hat1 m y n y n2 f n n y n3 m y n n n4 m n n n n5 m n n y? n6 m n y n y7 m y n y n8 m n n y n9 m y y y n

10 f n n n n11 m n y n n12 f n n n n


Introduction Formal Concept Analysis


Extracted Word/Concept lists


Motivation for an Explanation of Clustering Results

Starting Point:

How do people describe a group of documents/objects?

• general and specific words are used

• Background Knowledge provides general word

• Background Knowledge could help to find links between important but seldom words of a text document


Introduction to Formal Concept Analysis

Formal Concept Analysis [Wille 1982] allows to generate and visualize concept hierarchies.

FCA models concepts as units of thought, consisting of two parts:

The extension consists of all objects belonging to the concept.The intension consists of all attributes common to all those objects.



bank financ market team baseman seasonFinanceText1 X X XFinanceText2 X X XSportText1 X X XSportText2 X X X

Example: Textsfrom the WWW




objects

attributes

formal contextDef.: A formal context is a triple(G,M,I), where

• G is a set of objects,

• M is a set of attributes

• and I is a relation between G and M.

• (g,m)∈I is read as „object g has attribute m“.



Concept lattice




bank financ market american team baseman seasonFinanceText1 X X X XFinanceText2 X X XSportText1 X X XSportText2 X X X X




FCA text clustering

• preprocess text documents

• extract a description for all documents

• calculate FCA lattice

• visualize lattice



cluster algorithmFCA

Objects

explanationFCA


Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1



Example corpus

• 21 documents collected from the internet

• 3 categories; soccer, finance and software

• 1419 different word stems, whereof 253 stopwords


Lattice for 21 documents with 117 terms (θ = 15%)


Extraction of cluster descriptions

• Lattice with all terms/concepts is to large to act as a basis of a description

selection of the most imortant terms/concepts

• approach: introduce a threshold θRemove all terms of the document vector with a valuesmaller than θ(e.g. θ = 25% of the max value)


bank financ market team baseman seasonFinanceText1 1 2 1FinanceText2 2 2 1SportText1 1 3 2SportText2 1 2 2


Lattice with θ = 80%


Lattice with θ = 45%


Lattice with manuell selected terms


Lesson learned

• results are not really good• lattice is to fine grained • lattice is difficult to interpret• The absence/presents of a term in a document

description results usually in a totally different lattice

• use clustering approaches like kmeans to reduce this kind of effects



cluster algorithm

Objects

explanation


Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1



Visualization of Bi-Sec-K-Means clustering results

• Compute 10 Bi-Sec-K-Means cluster• Extract a term description• Compute lattice• Visualize the lattice


Result for 10 cluster


Result for the same terms but the not based on cluster



cluster algorithm

Objects

explanation


Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1


backgroundknowledge


Extracted Word/Concept lists


Combine FCA & Standard Text-clustering

• preprocess Reuters documents and enrich them with background knowledge (Wordnet)

• calculate a reasonable number k (100) of clusters with BiSec-k-Means using cosine similarity

• extract a description for all clusters• relate clusters (objects) with FCA• use the visualization of the concept lattice for better

understanding


Extracting Cluster Descriptions

using all concepts (synsets) as attributes for FCA provides a too large concept lattice

select the important ones

Approach: introduce two thresholds: θ1, θ2

for every centroid drop all concepts (synsets) with a value lower than θ1,mark all concepts (synset) between θ1 and θ2 with “m”and above θ2 with “h”we chose θ1 = 7% and θ2 = 20% of max value


Result


Result


Result

compound, chemical compound

oil

Crude oilbarrel

Palm oil

chain of concepts with increasing specificity


Similar example

compound, chemical compound

oil

refiner

chain of concepts with increasing specificity


Results

Crude oilbarrel


Results

resin palm

• Resulting concept lattice canalso be interpreted as a concept hierarchy directly on the documents

• all documents in one clusterobtain exactly the samedescription


Results Multi topic cluster

porkmeat... musiccoffeefoodbeverage

Multi Topic Cluster CL8

• BiSec-k-Meansresults are bad

• FCA helps to identifyinconsistencies


Formal Concept Analysis (FCA) for Providing Cluster Descriptions

Apply FCA to the clusters generated by Bi-Section-KMeansEmbed clusters into lattice structure

Clusters are objectsTerms and concepts are attributes

FCA provides 2 achievements:Intentional descriptions of clusters are generated

Exploit concepts from background knowledge

Interactive exploration of document collection is supported

Browse lattice structureZoom into interesting parts



• FCA allows a better understandable explanation of ontology enriched (k-Means) text clusters

• Clustering of text/web documents with an ontology achieve best clustering results

More details in: Wordnet improves Text Document Clustering,

Semantic Web WS at SIGIR, Hotho et al. 2003

• Some possible improvements:include more aspects of Wordnet, e.g. adjectivestake domain specific ontologies, e.g. AGROVOCuse more sophisticated means for feature selection within FCA

Dr. Andreas Hotho

Text Classification


Text Classification

Text Classification (Text Categorization)

Text categorization (TC - a.k.a. text classification, or topic spotting), [is] the activity of labeling natural language texts with thematic categories from a predefined set […].

F. Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys 34(1), 2002.

Automatic Text Classification uses full-text vector representations of text documents as in information retrievalwithin standard classification algorithms.


Text Classification Approaches

classificationalgorithm

(AdaBoost)

DocumentsBag of Words

backgroundknowledge

oman has granded …Obj1 2 2Obj2 1 1Obj3 2 …Obj4 2 …

1 …0 …

0 00 0


Conceptual Document Representation

Let's extract some concepts...

Detecting the appropriate a set of concepts from an Ontology (O, Lex) requires multiple steps:

1. Candidate Term Detection2. Morphological Transformations3. Word Sense Disambiguation4. Generalization


Conceptual Document RepresentationCandidate Term Detection

Querying the lexicon directly for each single word will not do the trick! (Remember the multi-word expressions!)

Solution:Move a window of maximum size over the text, decrease window size if unsuccessful before moving on.

But querying the lexicon for any candidate term windowproduces much overhead !

Solution:Avoid unnecessary lexicon queries by matching POS tags in the window against appropriately defined syntacticalpatterns (e.g. noun phrases).


AdaBoost

Boosting is a relatively young and very successful machine learning technique.

Boosting algorithms build so called ensemble classifiers(meta classifiers):

1. Build many very simple “weak” classifiers.2. Combine weak learners in an additive model:


AdaBoost

AdaBoost maintains weights Dt over the training instances.

At each iteration t: choose a base classifier ht that performs best on weighted training instances.

Calculate weight parameter αt based on performance base classifier. Higher errors lead to smaller weights and smaller errors lead to higher weights.

Weight update increases (decreases) weights for wrongly (correctly) classified instances.

Thereby, AdaBoost “is focusing in” on “hard” training instances.


Evaluation

Datasets: Reuters-21578

Documents about finance from 19879603 training documents and 3299 test documents (ModApte Split)Binary Classification on Top 50 classes.

OHSUMED CorpusOHSUMED (TREC-9), titles and abstracts from medical journals, 198736369 training documents and 18341 test documentsBinary Classification on Top 50 classes (MeSH classifications).

FAODOC CorpusDocuments about agricultural information1501 docs within 21 categories


Evaluation: Reuters Results

• Top 50 reuters classes with 17525 termstems/10259 – 27236 synset features


Evaluation: OHSUMED Results

Top 50 classes with WordNet



Relative improvement on the top 50 classes with WordNet



Relative improvement on the top 50 classes withMesh Ontology (~ 22.000 Concepts, all Strategy)



• Top 50 reuters classes with 17525 termstems/10259 – 27236 synset features



Relative improvement on the top 50 classes


Evaluation: FAODOC Results


Evaluation: FAODOC Results



• Successful integration of conceptual features to improveclassification performance

• Gereralization does improve classification results in mostcases



• Advanced Generalization Strategies

• Development of additional weak learner plugins thatexploit ontologies more directly

• Heuristics for efficient handling of continuous featurevalues like TFIDF in AdaBoost

• Multilingual Text Classification

Dr. Andreas Hotho

Application driven Evaluation of

Ontology Learning


Ontology Learning

• Until now we used manually engineered ontologies.• Large ontologies are not for every domain available.• Big effort to build such ontologies

• Idea: Learn Ontologies from text


Ontologies: Semantic Structures

Ontologies:

MeSH Tree StructuresWordNet

conceptural feature representation

ontology learning

term vectors

concept vectors

linguistic context vectors

+

term clustering

"learned" ontology structures

[Maedche & Staab 2001][Cimiano et al. ECAI 2004][Cimiano et al. JAIR 2005]

Why ?knowledge acquisition bottleneckadaption to domain contextjust have some fun in trying something weired


Ontology Learning as Term Clustering

Hierarchical clustering approaches, e.g. Agglomerative (bottom-up) clusteringBi-Section-kMeans clustering

are used to create taxonomic structures (concept hierarchies)

Quality of learned semantic structures is surprisingly high.

[Cimiano et al. ECAI 2004]

Deficiencies (?):taxonomic relations mix withsynonyms and other relationsbinary splitssuperconcepts – i.e. clusters –are not mapped to lexical entries


Learned Ontology: kmeans 7000


Evaluation Setting: Ontologies

Learned Ontologies:linguistic contexts from 1987 portion of OHSUMED corpusbased on top 10,000 terms ∩ MeSH terms = 7,000 terms(cheated)

agglomarative clusteredbi-sec-kmeans clustered

based on top 14,000 termsbi-sec-kmeans clustered

Competitors:Mesh Tree Structures

Maintained by United States National Library of Medicine> 22,000 hierarchically organized concepts

WordNet(psycho-)linguistic ontology115,424 synstets in total - 79,689 synsets in noun category


Evaluation Setting: Text Classification and Clustering

OHSUMED Corpus, (TREC-9), titles and abstracts from medical journals, 1987

Typically regarded as a rather "hard" corpus.Text Classification Setting:

36,369 training documents and 18,341 test documentsBinary classification on top 50 classes (MeSHclassifications).Classification algorithm: AdaBoost with Binary Decision Stumps, 1000 iterations.

Text Clustering Setting:4,390 documents rated relevant for one of 106 queriescluster-to-query evaluationclustering algorithm: bi-section-kmeansweighting: TFIDF, pruning level: 20


Evaluation Results: Text Classification

* extensive experimental evaluation for different superconcept integartion depths (3,5,10,15,20,25,30) –only optimal feature configuration (wrt. F1) for each ontology shown


Evaluation Results: Text Classification

0,00%

1,00%

2,00%

3,00%

4,00%

5,00%

6,00%

7,00%

8,00%

term & concept.sc30 term & concept.sc15 term & concept.sc20 term &synset.context.hyp5

term & mesh.sc3

7000-agglo 7000-bisec-kmeans 14000-bisec-kmeans WordNet MeSH Tree Struct

rel. imprv. macro F1rel. imprv. micro F1

T** S** T* S* T** S** T** S*


Evaluation Results: Text Clustering

* extensive experimental evaluation for different superconcept integartion depths (3,5,10,15,20,25,30) –only optimal feature configuration (wrt. Purity) for each ontology shown

** all results are averages over 20 results with different random seeds


Last but not least…

Main points of this lesson:Integration of explicit conceptual features improves text clustering and classification performance.Learned Ontologies achieve improvements competitive with manually created ontologies.In both cases, the major improvement is due to generalizations.

Outlook:Investigation of relation to purely statistical "conceptualizations", e.g. LSI, PLSAImprovements in Ontology Learning.More advanced generalization strategies.


Literature

• Stephan Bloehdorn, Andreas Hotho: Text Classification by Boosting Weak Learners based on Terms and Concepts. ICDM 2004.

• Andreas Hotho, Steffen Staab, Gerd Stumme: WordNet improves text document clustering; Semantic Web Workshop @ SIGIR 2003.

• W. R. Hersh, C. Buckley, T. J. Leone, and D. H. Hickam. Ohsumed: An Interactive Retrieval Evealuation and new large Test Collection for Research. SIGIR 1994.

• Alexander Maedche, Steffen Staab. Ontology Learning for the Semantic Web. IEEE Intelligent Systems, 16(2):72–79, 2001.

• Philipp Cimiano, Andreas Hotho, Steffen Staab. ComparingConceptual, Partitional and Agglomerative Clustering for LearningTaxonomies from Text. ECAI 2004. Extended Version to appear(JARS 2005).

Dr. Andreas Hotho

Background Knowledge


Statistical Concepts as Background Knowledge

• Calculating a kind of statistical concept and combine them with the classical bag of words representation

L. Cai and T. Hofmann. Text Categorization by Boosting Automatically Extracted Concepts. In Proc. of the 26th Annual Int. ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, Canada, 2003.

• Clustering word to setup a kind of concepts

G. Karypis and E. Han. Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In Proc. of 9th ACM International Conference on Information and Knowledge Management, CIKM-00, pages 12–19, New York, US, 2000. ACM Press.

• Clustering words and documents simultaneously

Inderjit S. Dhillon, Yuqiang Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. In 2nd SIAM International Conference on Data Mining (Workshop on Clustering High-Dimensional Data and its Applications), 2002.


Text Classification and Ontologies

• Using Hypernyms of wordnet as concept feature (no WSD, no significant better results)

Sam Scott , Stan Matwin, Feature Engineering for Text Classification, Proceedings of the Sixteenth International Conference on Machine Learning, p.379-388, June 27-30, 1999

• Brown Corpus tagged with Wordnet senses does not shows significant better results.

A. Kehagias, V. Petridis, V. G. Kaburlasos, and P. Fragkou. A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms. Journal of Intelligent Information Systems, 21(3):227–247, 2000.

• Map terms to concepts of the UMLS ontology to reduce the size of feature set, use search algorithm to find super concepts, evaluation using KNN and medline documents, show improvement.

B. B. Wang, R. I. Mckay, H. A. Abbass, and M. Barlow. A comparative study for domain ontology guided feature extraction. In Proceedings of the 26th Australian Computer Science Conference (ACSC-2003), pages 69–78. Australian Computer Society, 2003.

• Generative model consist of feature, concepts and topics, using Wordnetto initialize the parameter for concepts, evaluation on Reuter and Amazon corpus

Georgiana Ifrim, Martin Theobald, Gerhard Weikum, Learning Word-to-Concept Mappings for Automatic Text Classification Learning in Web Search Workshop 2005.


Using Ontologies

Wordnet and IR Query expansion with wordnet does not really improve the performance

Ellen M. Voorhees, Query expansion using lexical-semantic relations, Proceedings of the 17th annual internationalACM SIGIR conference on Research and development in information retrieval, p.61-69, July 03-06, 1994, Dublin, Ireland

Text Clustering and OntologiesWordnet synset chains

Green: Wordnet Chains (Stephen J. Green. Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering (TKDE), 11(5):713–730, 1999.

Dave et.al.: worse results using an ontology (no word sense disambiguation)

(Kushal Dave, Steve Lawrence, and David M. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the Twelfth International World Wide Web Conference, WWW2003. ACM, 2003.)

Part of Speech attributes and named entities used as features

(Vasileios Hatzivassiloglou, Luis Gravano, and Ankineedu Maganti. An investigation of linguistic features and clustering algorithms for topical document clustering. In SIGIR 2000: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 24-28, 2000, Athens, Greece. ACM, 2000.)

Dr. Andreas Hotho

Literature

Tag: SumSchool06

http://www.bibsonomy.org/tag/SumSchool06


Selected Literature

Semantic Web & OntologyY. Sure and R. Studer. Vision for Semantically-Enabled Knowledge Technologies.

Online at: KTweb -- Connecting Knowledge Technologies Communities, 2003.

Y. Sure and R. Studer: A Methodology for Ontology-based Knowledge Management. In: On-To-Knowledge: Semantic Web enabled Knowledge Management. J. Davies, D. Fensel, F. van Harmelen (eds.), ISBN: 0-470-84867-7, Wiley, 2002, pages 33-46.

Y. Sure, S. Staab and R. Studer. Methodology for Development and Employment of Ontology Based Knowledge Management Applications. In: SIGMOD Record, Vol. 31, No. 4, pp. 18-23, December 2002.

S. Staab, H.-P. Schnurr, R. Studer, and Y. Sure: Knowledge Processes and Ontologies.In: IEEE Intelligent Systems 16(1), January/Febrary 2001, Special Issue on KnowledgeManagement.


Selected Literature

Foundations of the Semantic Web

Semantic Web Activity at W3C http://www.w3.org/2001/sw/www.semanticweb.org (currently relaunched)Journal of Web Semantics D. Fensel et al.: Spinning the Semantic Web: Bringing the World Wide Web to

Its Full Potential, MIT Press 2003G. Antoniou, F. van Harmelen. A Semantic Web Primer, MIT Press 2004.S. Staab, R. Studer (eds.). Handbook on Ontologies. Springer Verlag, 2004. S. Handschuh, S. Staab (eds.). Annotation for the Semantic Web. IOS Press,

2003. International Semantic Web Conference series, yearly since 2002, LNCSWorld Wide Web Conference series, ACM Press, first Semantic Web papers

since 1999York Sure, Pascal Hitzler, Andreas Eberhart, Rudi Studer, The Semantic Web in

One Day, IEEE Intelligent Systems, http://www.aifb.uni-karlsruhe.de/WBS/phi/pub/sw_inoneday.pdf

Some slides have been stolen from various places, from Jim Hendler and Frank van Harmelen, in particular.


Selected Literature

Ontology Learning References Reinberger, M.-L., & Spyns, P. (2005). Unsupervised text mining for the learning of dogma-inspired ontologies. In Buitelaar, P.,

Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.

Philipp Cimiano, Andreas Hotho, Steffen Staab: Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text. ECAI 2004: 435-439

P. Cimiano, A. Pivk, L. Schmidt-Thieme and S. Staab, Learning Taxonomic Relations from Heterogenous Evidence. In Buitelaar, P., Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.

Sabou M., Wroe C., Goble C. and Mishne G.,Learning Domain Ontologies for Web Service Descriptions: an Experiment in Bioinformatics, In Proceeedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10-14 May, 2005.

Alexander Maedche, Ontology Learning for the Semantic Web, PhD Thesis, Kluwer, 2001.

Alexander Maedche, Steffen Staab: Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16(2): 72-79 (2001)

Alexander Maedche, Steffen Staab: Ontology Learning. Handbook on Ontologies 2004: 173-190

M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, I. Rojas. Unsupervised Learning of semantic relations between concepts of a molecular biology ontology. IJCAI, 659ff.

A. Schutz, P. Buitelaar. RelExt: A Tool for Relation Extraction from Text in Ontology Extension. ISWC 2005.

Faure, D., & N´edellec, C. (1998). A corpus-based conceptual clustering method for verb frames and ontology. In Velardi, P. (Ed.), Proceedings of the LREC Workshop on Adapting lexical and corpus resources to sublanguages and applications, pp. 5–12.

Michele Missikoff, Paola Velardi, Paolo Fabriani: Text Mining Techniques to Automatically Enrich a Domain Ontology. AppliedIntelligence 18(3): 323-340 (2003).

Gilles Bisson, Claire Nedellec, Dolores Cañamero: Designing Clustering Methods for Ontology Building - The Mo'K Workbench. ECAI Workshop on Ontology Learning 2000


Selected Literature

Semantic Web & OntologyY. Sure, S. Staab, J. Angele. OntoEdit: Guiding Ontology Development by Methodology

and Inferencing. In: R. Meersman, Z. Tari et al. (eds.). Proceedings of theConfederated International Conferences CoopIS, DOA and ODBASE 2002, October 28th - November 1st, 2002, University of California, Irvine, USA, Springer, LNCS 2519, pages 1205-1222.

Y. Sure, M. Erdmann, J. Angele, S. Staab, R. Studer and D. Wenke. OntoEdit: Collaborative Ontology Engineering for the Semantic Web. In: Proceedings of thefirst International Semantic Web Conference 2002 (ISWC 2002), June 9-12 2002, Sardinia, Italia, Springer, LNCS 2342, pages 221-235.

E. Bozsak, M. Ehrig, S. Handschuh, A. Hotho, A. Mädche, B. Motik, D. Oberle, C. Schmitz, S. Staab, L. Stojanovic, N. Stojanovic, R. Studer, G. Stumme, Y. Sure, J. Tane, R. Volz, V. Zacharias. KAON - Towards a large scale Semantic Web. In: Proceedings of EC-Web 2002 (in combination with DEXA2002). Aix-en-Provence, France, September 2-6, 2002. LNCS, Springer, 2002, pages 304-313.


Selected Literature

Text Clustering with Background KnowledgeA. Hotho, S. Staab, and G. Stumme. Explaining text clustering results using semantic

structures. In Proc. of the 7th PKDD, 2003.

B. Lauser and A. Hotho. Automatic multi-label subject indexing in a multilingual environment. In Proc. of the 7th European Conference in Research and AdvancedTechnology for Digital Libraries, ECDL 2003, 2003.

A. Hotho, S. Staab, and G. Stumme. Text clustering based on background knowledge. Technical Report 425, University of Karlsruhe, Institute AIFB, 2003.

Hotho, A., Mädche, A., Staab, S.: Ontology-based Text Clustering, Workshop "Text Learning: Beyond Supervision",IJCAI 2001.

A. Hotho, A. Maedche, S. Staab, V. Zacharias : On Knowledgeable Supervised Text Mining . to appear in: "Text Mining" Workshop Proceedings, Springer, 2002.


Selected Literature

Using OntologiesStephan Bloehdorn, Andreas Hotho: Text Classification by Boosting Weak Learners

based on Terms and Concepts. ICDM 2004: 331-334

Andreas Hotho, Steffen Staab, Gerd Stumme: Ontologies Improve Text Document Clustering. ICDM 2003: 541-544

Andreas Hotho, Steffen Staab, Gerd Stumme: Explaining Text Clustering Results Using Semantic Structures. PKDD 2003: 217-228

Stephan Bloehdorn, Philipp Cimiano, and Andreas Hotho: Learning Ontologies to Improve Text Clustering and Classification, Proc. of GfKl, 2005.

Semantic Web MiningB. Berendt, A. Hotho, and G. Stumme. Towards semantic web mining. In I. Horrocks

and J. A. Hendler, editors, The Semantic Web - ISWC 2002, First International Semantic Web Conference, Sardinia, Italy, June 9-12, 2002, Proceedings, volume2342 of Lecture Notes in Computer Science, pages 264–278. Springer, 2002.

Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9...

Documents

Transcript of Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9...