Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9...
Transcript of Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9...
![Page 1: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/1.jpg)
Dr. Andreas Hotho
Text Clustering withBackground Knowledge
A. Hotho: Text Clustering with Background Knowledge 2
Agenda
• Introduction• Semantic Web • Semantic Web Mining• Typical Preprocessing steps for Text Mining• Ontology Learning• Text Clustering with Background Knowledge• Text Clustering using FCA• Text Classification using Background Knowledge• Application driven Evaluation of Ontology Learning• Different kinds of Background Knowledge
A. Hotho: Text Clustering with Background Knowledge 3
Knowledge and Data Engineering Group @ University of Kassel
Founded at April 2004 Head: Prof. Gerd StummeMember of the Research Center L3SResearch areas:
Semantic Web/OntologiesKnowledge DiscoveryWeb MiningPeer-to-PeerFolksonomiesSocial Bookmark Systems
A. Hotho: Text Clustering with Background Knowledge 4
Acknowledgement
Some of the slides are taken from:
• ECML/PKDD Tutorial “Ontology Learning from text”, Paul Buitelaar, Philipp Cimiano, Marko Grobelnik, Michael Sintek
• KDD Course of AIFB Karlsruhe and KDE Kassel
• Semantic Web Tutorial Slides from AIFB
• Some slides of the Semantic Web Introduction have beenstolen from various places, from Jim Hendler and Frank van Harmelen, in particular
![Page 2: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/2.jpg)
A. Hotho: Text Clustering with Background Knowledge 5
Resources in BibSonomy tagged with: SumSchool06
http://www.bibsonomy.org/tag/SumSchool06Dr. Andreas Hotho
IntroductionSemantic Web
A. Hotho: Text Clustering with Background Knowledge 7
Syntax is not enough
Andreas
• Tel
• E-MailA. Hotho: Text Clustering with Background Knowledge 8
Information Convergence
Convergence not just in devices, also in “information”Your personal information (phone, PDA,…)
Calendar, photo, home page, files…
Your “professional” life (laptop, desktop, … Grid)
Web site, publications, files, databases, …
Your “community” contexts (Web)Hobbies, blogs, fanfic, social networks…
The Web teaches us that people will work to shareHow do we CREATE, SEARCH, and BROWSE in the non-text based parts of our lives?
![Page 3: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/3.jpg)
A. Hotho: Text Clustering with Background Knowledge 9
CV
name
education
work
private
Meaning of Informationen:(or: what it means to be a computer)
A. Hotho: Text Clustering with Background Knowledge 10
CV
name
education
work
private
< >
< >
< >
< >
< >
< Χς >
< ναµε >
<εδυχατιον>
<ωορκ>
<πριϖατε>
XML ≠ Meaning, XML = Structure
A. Hotho: Text Clustering with Background Knowledge 11
XML is unspecific:No predetermined vocabularyNo semantics for relationships
& must be specified upfront
Only possible in close cooperationsSmall, reasonably stable groupCommon interests or authorities
Not possible in the Web or on a broad scale in general !
Source of Problems
A. Hotho: Text Clustering with Background Knowledge 12
(One) Layer Model of the Semantic Web
![Page 4: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/4.jpg)
A. Hotho: Text Clustering with Background Knowledge 13
Some Principal Ideas
• URI – uniform resource identifiers• XML – common syntax• Interlinked• Layers of semantics –
from database to knowledge base to proofs
Design principles of WWW applied to Semantics!!
Tim Berners-Lee, Weaving
the Web
A. Hotho: Text Clustering with Background Knowledge 14
Ontology
Ontologies enable a better communicationbetween Humans/MachinesOntologies standardize and formalize themeaning of words through concepts
„An ontology is an explicit specification of a conceptualization.“ [Gruber, 1993]
„People can‘t share knowledge if they do not speaka common language.“ [Davenport & Prusak, 1998]
A. Hotho: Text Clustering with Background Knowledge 15
What is an Ontology?
Gruber 93:
An Ontology is aformal specificationof a sharedconceptualizationof a domain of interest
⇒ Executable⇒ Group of persons⇒ About concepts⇒ Between application
and „unique truth“
A. Hotho: Text Clustering with Background Knowledge 16
^
Communication Principle
ReferentFormStands for
refers toevokes
Concept
“Jaguar“
[Odwen, Richards, 1923]
![Page 5: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/5.jpg)
A. Hotho: Text Clustering with Background Knowledge 17
Views on Ontologies
Front-End
Back-End
TopicMaps
Extended ER-Models
Thesauri
Predicate Logic
Semantic Networks
Taxonomies
Ontologies
Navigation
Queries
Sharing of Knowledge
Information Retrieval
Query Expansion
Mediation Reasoning
Consistency CheckingEAI
A. Hotho: Text Clustering with Background Knowledge 18
Taxonomy
Object
Person Topic Document
ResearcherStudent Semantics
OntologyDoctoral Student
Taxonomy := Segementation, classification and ordering of elements into a classification system according to theirrelationships between each other
PhD Student F-Logic
Menu
A. Hotho: Text Clustering with Background Knowledge 19
Thesaurus
Object
Person Topic Document
ResearcherStudent Semantics
PhD StudentDoktoral Student
• Terminology for specific domain• Graph with primitives, 2 fixed relationships (similar, synonym) • originate from bibliography
similarsynonym
OntologyF-Logic
Menu
A. Hotho: Text Clustering with Background Knowledge 20
Topic Map
Object
Person Topic Document
ResearcherStudent Semantics
PhD StudentDoktoral Student
knows described_in
writes
AffiliationTel
• Topics (nodes), relationships and occurences (to documents)• ISO-Standard• typically for navigation- and visualisation
OntologyF-Logic
similarsynonym
Menu
![Page 6: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/6.jpg)
A. Hotho: Text Clustering with Background Knowledge 21
OntologyF-Logic
similar
OntologyF-Logic
similarPhD StudentDoktoral Student
Ontology (in our sense)
Object
Person Topic Document
Tel
PhD StudentPhD Student
Semantics
knows described_in
writes
Affiliationdescribed_in is_about
knowsP writes D is_about T P T
DT T D
Rules
subTopicOf
• Representation Language: Predicate Logic (F-Logic)• Standards: RDF(S); coming up standard: OWL
ResearcherStudent
instance_of
is_a
is_a
is_aAffiliation
Affiliation
A. Hotho
KDE+49 561 804 6252
A. Hotho: Text Clustering with Background Knowledge 22
PhD StudentPhD Student AssProfAssProf
AcademicStaffAcademicStaff
rdfs:subClassOfrdfs:subClassOf
cooperate_withcooperate_with
rdfs:rangerdfs:domainOntology
<swrc:AssProf rdf:ID="sst"><swrc:name>Steffen Staab</swrc:name>
...</swrc:AssProf>
http://www.aifb.uni-karlsruhe.de/WBS/sst
Anno-tation
<swrc:PhD_Student rdf:ID="sha"><swrc:name>Siegfried
Handschuh</swrc:name>
...</swrc:PhD_Student>
WebPage
http://www.aifb.uni-karlsruhe.de/WBS/shaURL
<swrc:cooperate_with rdf:resource = "http://www.aifb.uni-
karlsruhe.de/WBS/sst#sst"/>
instance ofinstance of
Cooperate_with
Ontology & Metadata
Links have explicit meanings!
A. Hotho: Text Clustering with Background Knowledge 23
What’s in a link? Formally
W3C recommendationsRDF: an edge in a graphOWL: consistency (+subsumption+classif. + …)
Currently under discussionRules: a deductive database
Currently under intense researchProof: worked-out proofsTrust: signature & everything working together
A. Hotho: Text Clustering with Background Knowledge 24
What’s in a link? Informally
• RDF: pointing to shared data• OWL: shared terminology
• Rules: if-then-else conditions
• Proof: proof already shown• Trust: reliability
![Page 7: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/7.jpg)
A. Hotho: Text Clustering with Background Knowledge 25
Ontologies and their Relatives (I)
There are many relatives around:
Controlled vocabularies, thesauri and classification systems available in the WWW, seehttp://www.lub.lu.se/metadata/subject-help.html
Classification Systems (e.g. UNSPSC, Library Science, etc.)Thesauri (e.g. Art & Architecture, Agrovoc, etc.)DMOZ Open Directory http://www.dmoz.org
Lexical Semantic NetsWordNet, see http://www.cogsci.princeton.edu/~wn/EuroWordNet, see http://www.hum.uva.nl/~ewn/
Topic Maps, http://www.topicmaps.org (e.g. used withinknowledge management applications)
In general it is difficult to find the border line!
A. Hotho: Text Clustering with Background Knowledge 26
Ontologies and their Relatives (II)
Catalog / ID
Terms/Glossary
Thesauri
InformalIs-a
FormalIs-a
FormalInstance
Frames
ValueRestric-tions
Generallogical
constraints
AxiomsDisjointInverseRelations,...
A. Hotho: Text Clustering with Background Knowledge 27
Ontologies - Some Examples
General purpose ontologies:WordNet / EuroWordNet, http://www.cogsci.princeton.edu/~wnThe Upper Cyc Ontology, http://www.cyc.com/cyc-2-1/index.htmlIEEE Standard Upper Ontology, http://suo.ieee.org/
Domain and application-specific ontologies:RDF Site Summary RSS, http://groups.yahoo.com/group/rss-dev/files/schema.rdfUMLS, http://www.nlm.nih.gov/research/umls/GALEN SWRC – Semantic Web Research Community: http://ontoware.org/projects/swrc/RETSINA Calendering Agent, http://ilrt.org/discovery/2001/06/schemas/ical-full/hybrid.rdfDublin Core, http://dublincore.org/
Web Services OntologiesCore ontology of services http://cos.ontoware.orgWeb Service Modeling ontology http://www.wsmo.orgDAML-S
Meta-OntologiesSemantic Translation, http://www.ecimf.org/contrib/onto/ST/index.htmlRDFT, http://www.cs.vu.nl/~borys/RDFT/0.27/RDFT.rdfsEvolution Ontology, http://kaon.semanticweb.org/examples/Evolution.rdfs
Ontologies in a wider senseAgrovoc, http://www.fao.org/agrovoc/Art and Architecture, http://www.getty.edu/research/tools/vocabulary/aat/UNSPSC, http://eccma.org/unspsc/DTD standardizations, e.g. HR-XML, http://www.hr-xml.org/
A. Hotho: Text Clustering with Background Knowledge 28
Wordnet
• WordNet contains207016 Word-Sense Pairs117597 synsets
• WordNet categorizes word into syntactic categories(N, noun)
(V, verb)
(Adj, adjective) und
(Adv, adverb).
• WordNet additionally contains lexical-semantic relations between word meanings
[ http://wordnet.princeton.edu/]
Statistics under: http://wordnet.princeton.edu/man/wnstats.7WN#sect2
![Page 8: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/8.jpg)
A. Hotho: Text Clustering with Background Knowledge 29
Wordnet II
Lexical - semantic relations
syntactic categories
examples
Synonymy
N, V, Adj, Adv
jolly, merry
Antonymy
Adj, Adv, (N, V)
fast, slow friendly, unfriendly
Hyperonymy
N
animal, living being mammal, animal dog, mammal
Meronomy
N
flour, cake tyre, car
A. Hotho: Text Clustering with Background Knowledge 30
Wordnet III
• Lexical semantic relations in WordNet mainly correspond to their counterparts of frame-oriented representationformalisms
hyperonym / hyponym is analog to is-a-relation
meronym / holonym corresponds to has-part / part-of-relations
WordNet allows floating transition between linguisticinformation and conceptual structures
A. Hotho: Text Clustering with Background Knowledge 31
(I)
• provided by the National Library of Medicine (NLM), a database of medical terminology.
• Unifies terms from several medical databases(MEDLINE, SNOMED International, Read Codes, etc.) such that different terms are identified as the same medical concept.
• Applications:Primarily Browse/Search in Document Collections, e.g.PubMed: Access to Documents (e.g. MEDLINE)CliniWeb International: Clinical Information in the WWW[ http://www.nlm.nih.gov/research/umls/umlsapps.html ]
[ http://www.nlm.nih.gov/research/umls/ ]
A. Hotho: Text Clustering with Background Knowledge 32
(II)
UMLS Knowledge Sources:
Metathesaurus provides the concordanceof medical concepts:
730.000 concepts1.5 million concept names in different source vocabularies
SPECIALIST Lexicon provides word synonyms, derivations, lexical variants, and grammatical forms of words used in MetaThesaurus terms
130,000 entries.
Semantic Network codifies the relationships (e.g. causality, "is a", etc.) among medical terms.
134 semantic types, 54 relationships.
![Page 9: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/9.jpg)
A. Hotho: Text Clustering with Background Knowledge 33
The semantic web and machine learning
1. What can machine learningdo for the Semantic Web?
2. Learning Ontologies(even if not fullyautomatic)
3. Learning to map betweenontologies
4. Duplicate recognition5. Deep Annotation:
Reconciling databases and ontologies
6. Annotation by Information Extraction
1. What can the Semantic Web do for Machine Learning?
2. Lots and lots of SW tools to describe and exchange datafor later use by machinelearning methods in a canonical way! (preprocessing!)
3. Using ontological structuresto improve the machinelearning task
4. Provide backgroundknowledge to guide machinelearning
A. Hotho: Text Clustering with Background Knowledge 34
Foundations of the Semantic Web: References
• Semantic Web Activity at W3C http://www.w3.org/2001/sw/• www.semanticweb.org (currently relaunched)• Journal of Web Semantics • D. Fensel et al.: Spinning the Semantic Web: Bringing the World Wide
Web to Its Full Potential, MIT Press 2003• G. Antoniou, F. van Harmelen. A Semantic Web Primer, MIT Press 2004.• S. Staab, R. Studer (eds.). Handbook on Ontologies. Springer Verlag,
2004. • S. Handschuh, S. Staab (eds.). Annotation for the Semantic Web. IOS
Press, 2003. • International Semantic Web Conference series, yearly since 2002, LNCS• World Wide Web Conference series, ACM Press, first Semantic Web
papers since 1999• York Sure, Pascal Hitzler, Andreas Eberhart, Rudi Studer, The Semantic
Web in One Day, IEEE Intelligent Systems, http://www.aifb.uni-karlsruhe.de/WBS/phi/pub/sw_inoneday.pdf
• Some slides have been stolen from various places, from Jim Hendler and Frank van Harmelen, in particular.
Dr. Andreas Hotho
Semantic Web Mining
A. Hotho: Text Clustering with Background Knowledge 36
Where to start?
Web Mining AreasWeb content miningWeb structure mining
Web usage mining
![Page 10: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/10.jpg)
A. Hotho: Text Clustering with Background Knowledge 37
• Web Mining can help
• to learn structures for knowledge organization (e.g. ontologies)
• and to populate them. Ontology Learning
Instance Learning
Extracting Semantics from the Web
A. Hotho: Text Clustering with Background Knowledge 38
Ontology Learning
• Typically, a domain-specific document corpus contains much information about a specific domain.
• One possible approach is to take this given corpus and extract linguistic and ontological resources from it.
Concentration to Web Content
ONTOLOGY LEARNING
KnowledgeDiscovery
OntologyEngineering
A. Hotho: Text Clustering with Background Knowledge 39
Ontology Learning Steps
1. Concept ExtractionMulti-Word-Term ExtractionWord Meaning Recognition
2. Concept Relation Extraction:Taxonomy LearningNon-taxonomic relation extractionLabeling of non-taxonomic relations
Beside these two steps ontology reuse via pruning is applicable.
A. Hotho: Text Clustering with Background Knowledge 40
root
furnishing
accomodation
event area
...
hotel youth hostel...
cityregion ...
wellness hotel
Ontology Learning from the Web
[Mädche, Staab: ECAI 2000]
Derived concept pairs(wellness hotel, area)(hotel, area)(accomodation, area)
Association RuleMining
Generalized Conceptual RelationhasLocation(accomodation,area)
is-ahierarchy
Example
![Page 11: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/11.jpg)
A. Hotho: Text Clustering with Background Knowledge 41
• Web Mining can help
• to learn structures for knowledge organization (e.g. ontologies)
• and to populate them. Ontology Learning
Instance Learning
Extracting Semantics from the Web
A. Hotho: Text Clustering with Background Knowledge 42
Knowledge base
Hotel: Wellnesshotel
GolfCourse: Seaview
belongsTo(Seaview, Wellnesshotel)
...
Information Extraction,
eg. [Craven et al, AI Journal 2000]
belongsTo
FORALL X, YY: Hotel[cooperatesWith ->> X] <-
X:ProjectHotel[cooperatesWith ->> Y].
GolfCourse
Organization
Hotel
namecooperatesWith
Ontology
ExampleInstance Learning from the Web
A. Hotho: Text Clustering with Background Knowledge 43
Example
Information Highlightingfor supporting annotationbased on IE techniques.
A. Hotho: Text Clustering with Background Knowledge 44
Crawling:load a documentextract linksload next document
Focused Crawlingintelligent focused decision on the next step
Crawling the (semantic) web for filling the ontologyExample
[Ehrig et al, 2002]
?
?
?
?
![Page 12: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/12.jpg)
A. Hotho: Text Clustering with Background Knowledge 45
Knowledge base
Hotel: Wellnesshotel
GolfCourse: Seaview
belongsTo(Seaview, Wellnesshotel)
...
ILP BasedAssociation Rule Mining,
eg. [Dehaspe, Toivonen,
J. DMKD 1998]
Hotel(x), GolfCourse(y), belongsTo(y,x) → hasStars(x,5)
support = 0.4 % confidence = 89 %
belongsTo
FORALL X, YY: Hotel[cooperatesWith ->> X] <-
X:ProjectHotel[cooperatesWith ->> Y].
GolfCourse
Organization
Hotel
namecooperatesWith
Ontology
ExampleMining the Semantic Web
A. Hotho: Text Clustering with Background Knowledge 46
Semantic Web Usage Mining
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03:51 +0100] "GET /search.html?l=ostsee%20strand&syn=023785&ord=asc HTTP/1.0" 200 1759
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05:06 +0100] "GET /search.html?l=ostsee%20strand&p=low&syn=023785&ord=desc HTTP/1.0" 200 8450
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06:41 +0100] "GET /mlesen.html?Item=3456&syn=023785 HTTP/1.0" 200 3478
Search byLocation
Search byLocationand Price
Refinesearch
Chooseitem
Look at individualHotel.
From logfile analysis ...
... to semantic logfile analysis:
Basic idea: associate each requested page with one or more ontological entities, to better understand the process of navigation
[Berendt & Spiliopoulou 2000; Berendt 2002; Oberle 2003]
Use the gained knowledge to
• understand search strategies
• improve navigation design
• personalization
Example
A. Hotho: Text Clustering with Background Knowledge 47
Text Document Clustering of Crawled Documents
WWW
Explanation
Clustering
Focused Crawling
Example
Dr. Andreas Hotho
Preprocessing stepsfor
Text Mining
Slides partially from:- AIFB KDD course- Raymond J. Mooney (http://www.cs.utexas.edu/users/mooney/ir-course/).
![Page 13: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/13.jpg)
A. Hotho: Text Clustering with Background Knowledge 49
Preprocessing of Text documents
keys
xnmn
xijn-1
xij…
xijxij…
xij…
xij…
xij…
x222
x111
mm-1…………21
keys
xnmn
xijn-1
xij…
xijxij…
xij…
xij…
xij…
x222
x111
mm-1…………21
docu
men
ts
Kind of Features to extract
• Terms• Words• Phrases• Concepts
• Meta data • Shallow parsing• Deep parsing• …
A. Hotho: Text Clustering with Background Knowledge 50
Which kind of features to extract?
Metadatae.g., author, data, document type, language, copyright status;according to metadata schema that specifies attributes, e.g.:
Dublin Core Metadata Initiative (http://dublincore.org/),BibTex Schema for Bibliographic Metadata;
typically by explicit markup from human indexer, but possibly also automatically by means of information extraction (next lecture).
Controlled Vocabulary of Index Termsfixed set of index terms that describe content of documents;often in hierarchical form (taxonomies),indexing vocabulary / taxonomy is centrally designed and maintained by some authority, e.g.:
Inspec Topic Classification (http://www.iee.org/Publish/Inspec/),MeSH - Medical Subject Headings(http://www.nlm.nih.gov/mesh/),IPC - International Patent Classification(http://www.wipo.int/classifications/ipc/en/)many many more…
A. Hotho: Text Clustering with Background Knowledge 51
Example: MeSH Classification
PMID- 7810287OWN - NLMSTAT- MEDLINEDA - 19950202DCOM- 19950202LR - 20041117PUBM- PrintIS - 0094-6354 (Print)VI - 62IP - 4DP - 1994 AugTI - Cockayne syndrome: a case report.PG - 346-8AB - A 4-year-old female with Cockayne syndromepresented for cataract extraction under generalanesthesia. […]FAU - O'Brien, F CAU - O'Brien FCFAU - Ginsberg, BAU - Ginsberg BLA - engPT - Case ReportsPT - Journal ArticlePL - UNITED STATESTA - AANA JJT - AANA journal.JID - 0431420SB - NMH - Anesthesia, General/*methods/nursingMH - Cataract ExtractionMH - Child, PreschoolMH - Cockayne Syndrome/complications/*surgeryMH - FemaleMH - HumansEDAT- 1994/08/01MHDA- 1994/08/01 00:01PST - ppublishSO - AANA J. 1994 Aug;62(4):346-8.
A. Hotho: Text Clustering with Background Knowledge 52
Which kind of features to extract? (cont.)
Alternative: dynamic vocabularysocial tagging systems ("folksonomies"), e.g.:
del.icio.us for bookmarks (http://del.icio.us), flickr for photographs (http://www.flickr.com);
tags correspond to index terms that are freely choosen and assigned to documents by users without centralizedmanagement.
Derived Features: full-text indexingalso known as the bag-of-words model,general assumption: every word or expression in the text document can be a valid key,index terms are automatically extracted from the document collection,dictionary of index terms is continually increasing,many design decisions to choosing appropriate terms,next section….
![Page 14: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/14.jpg)
A. Hotho: Text Clustering with Background Knowledge 53
Example: social tagging
A. Hotho: Text Clustering with Background Knowledge 54
Example: BibSonomy contains also publication meta data
A. Hotho: Text Clustering with Background Knowledge 55
StopwordRemovalStemmingTokenization
Document Representation: Full-Text Indexing
treat
infectionblood
medic
blood
potent
transmiss
…
…
…
…
typical full-text indexing pipeline
A. Hotho: Text Clustering with Background Knowledge 56
Full Text Representation
Tokenizationgoal: segment input character sequence into "useful" tokens(e.g., individual terms);design decisions and problems:
set of word delimiters to use (e.g., whitespace, punctuation marks),handling of special and numerical characters,handling of capitalization (typically conversion to lower case),handling of punctuation marks (sentence delimiter or abbreviation?);different languages have different rules for compound words(e.g., "color screen" vs "Farbbildschirm„)
Stemming or Lemmatizationmorphological normalization of inflected word forms to base form(e.g., "houses" → "house", "goes" → "go");Stemming: simple approach based on few structural rules
e.g., Porter Stemming algorithm for English language;Lemmatization: retrieval of base form, typically based on dictionary
can handle exceptional cases (e.g., "mice" → "mouse")
![Page 15: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/15.jpg)
A. Hotho: Text Clustering with Background Knowledge 57
Full Text Representation
Stopword removal
removal of very frequent and uninformative words,typically function words as, e.g., "the","a","an","of","for",e.g., SMART stopword list for English language defines 571 stopwords(ftp://ftp.cs.cornell.edu/pub/smart/english.stop)
A. Hotho: Text Clustering with Background Knowledge 58
Property: Word Frequency
• A few words are very common.2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences.
• Most words are very rare.Half the words in a corpus appear only once, called hapax legomena (Greek for “read only once”)
• Called a “heavy tailed” distribution, since most of the probability mass is in the “tail”
A. Hotho: Text Clustering with Background Knowledge 59
Sample Word Frequency Data
(from B. Croft, UMass)
A. Hotho: Text Clustering with Background Knowledge 60
Zipf’s Law
• Rank (r): The numerical position of a word in a list sorted by decreasing frequency (f ).
• Zipf (1949) “discovered” that:
• If probability of word of rank r is pr and N is the total number of word occurrences:
rf 1
∝ )constant (for kkrf =⋅
1.0 const. indp. corpusfor ≈== ArA
Nfpr
![Page 16: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/16.jpg)
A. Hotho: Text Clustering with Background Knowledge 61
Zipf and Term Weighting
Luhn (1958) suggested that both extremely common and extremely uncommon words were not very useful for indexing.
A. Hotho: Text Clustering with Background Knowledge 62
Pruning based on Zipf Law
• Stopword removal
• drop words with less than a given occurrences to drop the extremely uncommon words (e.g. 30 occurrences)
• similar idea is behind the tfidf weighting
A. Hotho: Text Clustering with Background Knowledge 63
Further possible (non-standard) steps
• separate indexing of phrases and compound words(e.g., "machine learning" ≠ "machine","learning")
based on background dictionaries or statistical detectionof frequent phrases - machine learning again ;-);
• alternative: additional indexing of all adjacent words up to a certain window length (bigrams, trigrams, n-grams);
• expansion with synonymous terms based on thesauri;• separate consideration of different parts of speech
(e.g., "walk" as verb or "walk" as noun);• many more …
A. Hotho: Text Clustering with Background Knowledge 64
Levels of Linguistic Analysis:
´The 'Human Language Technologies Layer Cake'
Tokenization (incl. Named-Entity Rec.)
Phrase Recognition / Chunking
Dependency Struct. (Phrases)
Dependency Struct. (S)
Discourse Analysis
[table] [2005-06-01] [John Smith]
Morphological Analysis[table:N:ART] [Sommer~schule:N] [work~ing:V]
[[the] [large] [table] NP] [[in] [the] [corner] PP]
[[the:SPEC] [large:MOD] [table:HEAD] NP]
[[He:SUBJ] [booked:PRED] [[this] [table:HEAD] NP:DOBJ] S]
[[He:SUBJ] [booked:PRED] [[this] [table:HEAD] NP:DOBJ:X1] …]… [[It:SUBJ:X1] [was:PRED] still available …]
Part-of-Speech & Semantic Tagging[table:N:ARTIFACT] [table:N:furniture_01]
© Paul Buitelaar, DFKI
![Page 17: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/17.jpg)
A. Hotho: Text Clustering with Background Knowledge 65
Full Text Representation and Sparseness
Fulltext Indexing typicallyresults in a very sparsematrix - usually, less than 1% of the matrix cells are non-zero !
Sparseness requires specialattention with respect to: storage and computation.
Store only non-zeroelements with respectiveindices and assume rest of matrix to be zero.Tune computations to thisdata structure.
Frequent terms are likely to beindexed already at thebeginning.
Sparseness Structure of the Document x Term Matrix for Training Document part of the Reuters 21578-Corpus (9603 x 17525) – non-sparse fraction is 0.25 % !
A. Hotho: Text Clustering with Background Knowledge 66
Comparison of Representation Approaches
auto
mat
ion
of in
dexi
ng
dynamics of set of features
fullyautomatic
only humanannotation
fixed set of features highly dynamic feature set
social tagging (e.g. del.icio.us)
fulltext indexing
traditional library classification
news agency systems (e.g. Reuters)
A. Hotho: Text Clustering with Background Knowledge 67
Retrieval Models: Vector Space [Salton 60s]
Vector Space Model (best match model + ranking)typically full text indexing of documents;documets are regarded as vectorsvector space dimensions defined by different index termsquery is also treated as a vector in the same spacerank documents based on geometric similarity with queryvery successful paradigm with many connections to machine learning view
Issues of vector space modelchoice of appropriate term weighting (typically TFIDF)choice of geometric similarity measure to use (typicallycosine)
A. Hotho: Text Clustering with Background Knowledge 68
Term Weighting - Alternatives
• boolean weighting (simplest case) with entries 0 and 1 corresponding to
• absolute frequency tfji of term i in document j• relative frequency rfji of term i in document j• most popular choice:
term-frequency inverse document frequency (TFIDF) weighting: )
)(log(.)(
wdfNtfwtfidf =
tf(w) term frequency (number of word occurrences in a document)df(w) document frequency (number of documents containing the word)N number of all documentstfIdf(w) relative importance of the word in the document
The word is more important if it appears in less documents
The word is more important if it appears several times in a target document
![Page 18: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/18.jpg)
A. Hotho: Text Clustering with Background Knowledge 69
Cosine Measure typically used as similarity measure:document vectors ranked according to cosine score with query;corresponds to the angle between two vectors, i.e. thenormalized inner product between input vectors
Note: direction distinguishes document vectors, not the length !all vector entries will be positive so cosine varies between 0 (orthogonal vectors) and 1 (same direction)
Cosine Measure
A. Hotho: Text Clustering with Background Knowledge 70
Cosine Measure (Illustration: scaling onto unit hypersphere)
doc1
doc2
query
doc1'
doc2 '
query 'α1
1
0
1
0
A. Hotho: Text Clustering with Background Knowledge 71
Evaluation Measures
• Need to analyse results and to evaluate systems.• Three important considerations:
Precision: how well did the presented result set match theinformation need of the user ?Recall: how much of the relevant information availablewas presented in the result set ?
• There is a well-known set of information retrievalmeasures which evaluate information retrieval engineswith respect to the subjective (!) perception of relevancy of a test user.
A. Hotho: Text Clustering with Background Knowledge 72
Evaluation Measures: Notation
true negative(TN)
false negative(FN)
negative(doc's not returned)
false positive(FP)
true positive (TP)
positive(doc's returned)
negative(doc's judged non-
relevant)
positive(doc's judged
relevant)
human judgement
retrieval result
2 partitions of a set of documents:according to perceived relevancy to useraccording to result of retrieval engine
![Page 19: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/19.jpg)
A. Hotho: Text Clustering with Background Knowledge 73
Evaluation Measures: Information Retrieval (and Text Classification)
measures overall error
fraction of relevant documents in result
fraction of returned documentswrt all relevant documents
harmonic mean of precisionand recall
A. Hotho: Text Clustering with Background Knowledge 74
Evaluation Measures (cont.): Considering Ranked Retrieval
Precision and recall are well-defined only for exact match (i.e. unranked) retrieval;
Approach for ranked retrieval:rank all test documents,calculate precision and recallat fixed cutoff points (e.g., at position k = 5);different results will beachieved for varying k;
Typical observation for k → n:precision decreases and recall increases on the longrun (think why !)
The break-even-point measure isdefined as value of precisionand recall when they becomeequal. -n
……
-13
-12
+11
-10
-9
-8
+7
-6
+5
+4
+3
-2
+1
rel?pos?
precision
recall
break even point
plot connecting precisionand recall for different k
break even point
A. Hotho: Text Clustering with Background Knowledge 75
Goal:Cluster should be as similar as possible to the given classes
Evaluation of Text Document Clustering
( )P
LPLP ∩=:,Precision
),(Precisionmax:),(Purity LPDP
LP *L*P
L*P*∈∈
∑=
),(Precisionmax:),(InvPrty PLDL
PL *P*L
L*P*∈∈
∑=
compare clustering P* of document set Dwith given classes L*
46 Classes
given classes L*
60 Cluster Clustering P*
A. Hotho: Text Clustering with Background Knowledge 76
Different Text Clustering and Classification Datasets
Reuters-21578Documents about finance from 19879603 training documents and 3299 test documents (ModApteSplit)Binary Classification on Top 50 classes.
Reuters RCV1Documents about finance from 1996/1997806791 documents categorized with respect to three controlled vocabulary4 major topic categries10 major industry codesregion code without an hierarchy
David D. Lewis, Yiming Yang, Tony G. Rose, Fan Li. RCV1: A New Benchmark Collection for Text Categorization Research, 2004
![Page 20: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/20.jpg)
A. Hotho: Text Clustering with Background Knowledge 77
Different Text Clustering and Classification Datasets
20 NewsgroupsNews Groups Documents, different topics like: sport, cs, …20000 documents20 classes, every class contains 1000 documents
OHSUMED CorpusOHSUMED (TREC-9), titles and abstracts from medical journals, 198736369 training documents and 18341 test documentsBinary Classification on Top 50 classes (MeSH classifications).
FAODOC CorpusDocuments about agricultural information1501 docs within 21 categories
Dr. Andreas Hotho
Ontology Learning
Thanks to Philipp Cimiano for the slides
A. Hotho: Text Clustering with Background Knowledge 79
Motivation for Ontology Learning
• High cost for modelling ontologies.• Typically, ontologies are domain dependant.
• Idea: learn from existing domain data?• Which data?
Legacy Data (XML oder DB-Schema) => LiftingTexts ?Images ?
• In this lecture we will discuss some ideas of ontology learning from text data using knowledge discovery techniques.
A. Hotho: Text Clustering with Background Knowledge 80
Learning ontologies from texts
Problems: Bridge the gap between symbol and concept/ontology level
Knowledge is rarely mentioned explicitly in texts.
Reverse Engineering
Write
Shared World Model
![Page 21: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/21.jpg)
A. Hotho: Text Clustering with Background Knowledge 81
Some Current Work on OL from Text
Terms, Synonyms & ClassesStatistical AnalysisPatterns(Shallow) Linguistic ParsingTerm Disambiguation & Compositional Interpretation
TaxonomiesStatistical Analysis & Clustering (e.g. FCA)Patterns(Shallow) Linguistic ParsingWordNet
RelationsAnonymous Relations (e.g. with Association Rules)Named Relations (Linguistic Parsing)(Linguistic) Compound AnalysisWeb Mining, Social Network Analysis
Definitions(Linguistic) Compound Analysis (incl. WordNet)
Overview of Current Work: Paul Buitelaar, Philipp Cimiano, Bernardo Magnini Ontology Learning from Text: Methods, Evaluation and Applications Frontiers in Artificial Intelligence and Applications Series, Vol. 123, IOS Press, July 2005.
A. Hotho: Text Clustering with Background Knowledge 82
mountain)iver,disjoint(r
z))yz)l(x,has_capitacapital(z)(zx)(y,located_iny)l(x,has_capitacapital(y)ycountry(x)(x
=→∧∀∧∧∧∃→∀
area geo. Inhabitedcity city,capital CC ≤≤
area) geo.:rangeriver,:gh(domflow_throu
⟩⟨== (c)Ref[c],i(c),:country:c C
located_incapital_of R≤
nation}{country,
.capital,.. city, nation, country, river, Terms
Synonyms
Concepts
Concept Hierarchy
Relations
Axioms
Rules
Relation Hierarchy
The Ontology Learning Layer Cake
A. Hotho: Text Clustering with Background Knowledge 83
Tools - Axioms
Economic Univ. Prague
DFKI XXXOntoLT / RelExt
labelsTextToOnto ++
XPMI-IRNRC-CNRC
XDODDLEKeio Univ.
XDIRT
clustersclustersCBCISI, USC
XXParmenidesUniv. Zürich
clustersclustersXATRACTUniv. of Salford
XXint.XXOntoLearnUniv. di Roma
XXclustersclustersASIUM / Mo‘kUniv. de Paris-Sud
?clustersclustersOntoBasisCNTS, Univ. Antwerpen
XXXXHASTIAmir Kabir Univ. Tehran
XAEON
XXint.clustersXText2OntoAIFB, Univ. Karlsruhe
GeneralAxioms
AxiomsSchemata
RelationHierarchyRelationsConcept
HierarchyConcept
FormationSynonymsTerms
Ontology Learning LayersSystemOrganization
A. Hotho: Text Clustering with Background Knowledge 84
Evaluation of Ontology Learning
The a priori approach is based on a gold standard ontology:Given an ontology modeled by an expert -> The so called gold standardCompare the learned ontology with the gold standard
Which methods exists: Pattern-based
learning accuracy/precision/recall/f-measure
Which methods exists: Clustering-based
Problem: labels for clusters either unknown or difficult to find
Basic idea for both
Count edges in the “ontology graph”Counting of direct relation only (Reinberger et.al. 2005)Least common superconceptSemantic cotopy…
Evaluation via application (cf. section using ontologies)
![Page 22: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/22.jpg)
A. Hotho: Text Clustering with Background Knowledge 85
Evaluation of Ontology Learning
The aposteriori Approach:ask domain expert for a per concept evaluation of the learned ontologyCount three categories of concepts:
Correct : both in learned and the gold ontologyNew : only in learned ontology, but relevant and should be in gold standard as wellSpurious: useless
Compute precision = (correct + new) / (correct + new + spurious)
As the result: The a posteriori evaluations are costly – BUTa posteriori evaluation by domain experts still show very good results, very helpful for domain expert!
Sabou M., Wroe C., Goble C. and Mishne G.,Learning Domain Ontologies for Web Service Descriptions: an Experiment in Bioinformatics, In Proceeedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10-14 May, 2005.
A. Hotho: Text Clustering with Background Knowledge 86
Terms
Synonyms
Some Knowledge Discovery Techniques for Ontology Learning
focus of today's lecture
Concepts
Concept Hierarchy
Relations
Relation Hierarchy
Axioms
Rules
A. Hotho: Text Clustering with Background Knowledge 87
How do people acquire taxonomic knowledge?
I have no idea!
But people apply taxonomic reasoning!„Never do harm to any animal!“
=> „Don‘t do harm to the cat!“
More difficult questions:representationreasoning patterns
But let‘s speculate a bit! ;-)
A. Hotho: Text Clustering with Background Knowledge 88
How do people acquire taxonomic knowledge?
What is liver cirrhosis?
Mr. Smith died from liver cirrhosis.Mr. Jagger suffers from liver cirrhosis.Alcohol abuse can lead to liver cirrhosis.
=>prob(isa(liver cirrhosis,disease))
![Page 23: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/23.jpg)
A. Hotho: Text Clustering with Background Knowledge 89
How do people acquire taxonomic knowledge?
What is liver cirrhosis?
Diseases such as liver cirrhosis aredifficult to cure. (New York Times)
A. Hotho: Text Clustering with Background Knowledge 90
How do people acquire taxonomic knowledge?
What is liver cirrhosis?
Cirrhosis: noun[uncountable]serious disease of the liver, often caused by drinking too much alcohol
disease))cirrhosis,isa(liver (probdisease)sis,isa(cirrho cirrhosiscirrhosisliver
→∧≈
Pattern based
A. Hotho: Text Clustering with Background Knowledge 91
How do people acquire taxonomic knowledge?
Clustering based• ……….
• The old lady loves her dog.
• The old lady loves her cat.
• The old lady loves her husband.
• ……….
dog
husband
catslady
A. Hotho: Text Clustering with Background Knowledge 92
Context Extraction
extract syntactic dependencies from text⇒ verb/object, verb/subject, verb/PP relations⇒ car: drive_obj, crash_subj, sit_in, …
LoPar, a trainable statistical left-corner parser:
Parser tgrep Lemmatizer Smoothing
WeightingFCALattice
Compaction Pruning
![Page 24: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/24.jpg)
A. Hotho: Text Clustering with Background Knowledge 93
Ontology Learning as Term Clustering
• Distributional Hypothesis: • "Words are [semantically] similar to the extent to which
they appear in similar [syntactic] contexts."• [Harris 1985]
• Linguistic context can be represented in vector form.
• Allows to measure the similarity wrt. some similarity measure (e.g. cosine measure).
• Hierarchical clustering approaches can be used to create taxonomic structures.
0313bike4025car
sit_inride_objcrash_intodrive_obj
0313bike4025car
sit_inride_objcrash_intodrive_obj
A. Hotho: Text Clustering with Background Knowledge 94
Extracting attributes using techniques from NLP
The museum houses an impressive collection of medieval and modern art.The building combines geometric abstraction with classical references that allude to the Roman influence on the region.
house_subj(museum)
house_obj(collection)
combine_subj(museum)
combine_obj(abstraction)
combine_with(reference)
allude_to(influence)
s
npvp
v np
np pp
The museum
houses
an impressive collection of modern art
A. Hotho: Text Clustering with Background Knowledge 95
Extraction Process for Linguistic Contexts
Preprocessing:Part of Speech TaggingLemmatizingMatching regular expression over POS-tags
Extract shallow syntactic dependencies from text:adjective modifiers: "a nice city" nice(city)prepositional phrase modifiers: "a city near the river" near_river(city) and city_near(river)possessive modifiers: "the city's center" has_center(city)noun phrases in subject or object position: "the city offers an exciting nightlife" offer_subj(city) and offer_obj(nightlife)prepositional phrases following a verb: "the river flows throughthe city" flow_through(city)copula constructs: "a flamingo is a bird" is_bird(flamingo)verb phrases with the verb to have: "every country has a capital" -> has_capital(country)
A. Hotho: Text Clustering with Background Knowledge 96
Example
• People book hotels. The man drove the bike along thebeach.
book_subj(people)book_obj(hotels)drove_subj(man)drove_obj(bike)drove_along(beach)
book_subj(people)book_obj(hotel)drive_subj(man)drive_obj(bike)drive_along(beach)
Lemmatization
![Page 25: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/25.jpg)
A. Hotho: Text Clustering with Background Knowledge 97
Representation of the context of a word as feature vector
XXexcursion
XXtrip
XXXXmotor-bike
XXXcar
XXapartment
join_obj/joinable
ride_obj/rideable
drive_obj/driveable
rent_obj/rentable
book_obj/bookable
A. Hotho: Text Clustering with Background Knowledge 98
Tourism Lattice
A. Hotho: Text Clustering with Background Knowledge 99
Concept Hierarchy
bookable
rentable joinable
driveable appartment
car
bike
tripexcursion
rideable
A. Hotho: Text Clustering with Background Knowledge 100
Example Clustering: (Bi-Section-KMeans)
excursion trip
apartmentcar biketrip excursion
excursiontripcar
bikeapartment
apartmentbike car
bike carIssues:
not easy to understandno formal interpretation
![Page 26: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/26.jpg)
A. Hotho: Text Clustering with Background Knowledge 101
Agglomerative/Bottom-Up Clustering
car bus tripexcursionappartment
A. Hotho: Text Clustering with Background Knowledge 102
Linkage Strategies
Complete-Linkage:consider the two most dissimilar elements of each of the clusters=> O(n2 log(n))
Average-Linkage:consider the average similarity of the elements in the clusters => O(n2 log(n))
Single-Linkage:consider the two most similar elements of each of the clusters=> O(n2)
A. Hotho: Text Clustering with Background Knowledge 103
Data Sets
Tourism (118 Mio. tokens):http://www.all-in-all.de/englishhttp://www.lonelyplanet.comBritish National Corpus (BNC)handcrafted tourism ontology (289 concepts)
Finance (185 Mio. tokens):Reuters news from 1987GETESS finance ontology (1178 concepts)
A. Hotho: Text Clustering with Background Knowledge 104
Results Tourism Domain
![Page 27: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/27.jpg)
A. Hotho: Text Clustering with Background Knowledge 105
Results in Finance Domain
A. Hotho: Text Clustering with Background Knowledge 106
Results Tourism Domain
A. Hotho: Text Clustering with Background Knowledge 107
Results in Finance Domain
A. Hotho: Text Clustering with Background Knowledge 108
Summary
Weak-FairO(n2)36.42/32.77%DivisiveClustering
FairO(n2 log(n))O(n2 log(n))O(n2)
36.78/33.35%36.55/32.92%38.57/32.15%
AgglomerativeClustering
GoodO(2n)43.81/41.02%FCA
TraceabilityEfficiencyEffectiveness
![Page 28: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/28.jpg)
A. Hotho: Text Clustering with Background Knowledge 109
TextToOnto & FCA
A. Hotho: Text Clustering with Background Knowledge 110
Text2Onto
Ontology learning framework developed at AIFB.Algorithms for extracting …
concepts, instancessubclass-of / instance-of relationsnon-taxonomic / subtopic-of relationsdisjointness axioms
Incremental ontology learningIndependent of concrete ontology language
A. Hotho: Text Clustering with Background Knowledge 111
Experimental results
• Formal Concept Analysis yields better concept hierarchies than similarity-based clustering algorithms,
• The results of FCA are better understandable (intensional description of concepts!),
• Bi-Section-Kmeans is most efficient (O(n2)),
• Though FCA is exponential in the worst case, it shows a favourable runtime behaviour (sparsely populated formal contexts).
A. Hotho: Text Clustering with Background Knowledge 112
Other Clustering Approaches
Bottom-Up/Agglomerative(ASIUM System) Faure and Nedellec 1998Caraballo 1999(Mo‘K Workbench) Bisson et al. 2000
Other:Hindle 1990Pereira et al. 1993Hovy et al. 2000
![Page 29: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/29.jpg)
A. Hotho: Text Clustering with Background Knowledge 113
Ontology Learning References
• Reinberger, M.-L., & Spyns, P. (2005). Unsupervised text mining for the learning of dogma-inspired ontologies. In Buitelaar, P., Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.
• Philipp Cimiano, Andreas Hotho, Steffen Staab: Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text. ECAI 2004: 435-439
• P. Cimiano, A. Pivk, L. Schmidt-Thieme and S. Staab, Learning Taxonomic Relations from Heterogenous Evidence. In Buitelaar, P., Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.
• Sabou M., Wroe C., Goble C. and Mishne G.,Learning Domain Ontologies for Web Service Descriptions: an Experiment in Bioinformatics, In Proceeedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10-14 May, 2005.
• Alexander Maedche, Ontology Learning for the Semantic Web, PhD Thesis, Kluwer, 2001.
• Alexander Maedche, Steffen Staab: Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16(2): 72-79 (2001)
• Alexander Maedche, Steffen Staab: Ontology Learning. Handbook on Ontologies 2004: 173-190
• M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, I. Rojas. Unsupervised Learning of semantic relations between concepts of a molecular biology ontology. IJCAI, 659ff.
• A. Schutz, P. Buitelaar. RelExt: A Tool for Relation Extraction from Text in Ontology Extension. ISWC 2005.
• Faure, D., & N´edellec, C. (1998). A corpus-based conceptual clustering method for verb frames and ontology. In Velardi, P. (Ed.), Proceedings of the LREC Workshop on Adapting lexical and corpus resources to sublanguages and applications, pp. 5–12.
• Michele Missikoff, Paola Velardi, Paolo Fabriani: Text Mining Techniques to Automatically Enrich a Domain Ontology. AppliedIntelligence 18(3): 323-340 (2003).
• Gilles Bisson, Claire Nedellec, Dolores Cañamero: Designing Clustering Methods for Ontology Building - The Mo'K Workbench. ECAI Workshop on Ontology Learning 2000
Dr. Andreas Hotho
Text Clustering
A. Hotho: Text Clustering with Background Knowledge 115
Motivation
• Challenge: browse, search and organize the hughamount of unstructured text documents available in the internet or in intranets of companies
hugh sets of documents in Internetportals like Yahoo.com, DMoz.org, Web.de are manually structuredmeta search engines like Vivisimo.com use cluster techniques to structure the search results
• Advantage: the structure and the visualization of the information provided by the clustering helps the user to work with a larger amount of information
A. Hotho: Text Clustering with Background Knowledge 116
Motivation
![Page 30: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/30.jpg)
A. Hotho: Text Clustering with Background Knowledge 117
Text Clustering
Text Clustering
[…] the partitioning of texts into previously unseen categories […]
A. Hotho et al., SIGIR 2003 Semantic Web Workshop
Automatic Text Clustering uses full-text vector representations of text documents as in information retrievalwithin standard clustering algorithms.
A. Hotho: Text Clustering with Background Knowledge 118
Motivation-Overall Process
cluster algorithm
Objects
explanation
representation of objectsMorgens Abends team baseman
Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1
similarity measuredistance function
backgroundknowledge
oman discount oil crude
A. Hotho: Text Clustering with Background Knowledge 119
Motivationrequirements on the cluster methods
EfficientResults should also be available on large data sets or on ad-hoc collect e.g. from search engines
EffectiveCluster result must be correct
Problem of explanatory powerResults of the clustering process must be understandable
User interaction und subjectivityUser has his own imagination of the clustering goal and want integrate this in the cluster process
A. Hotho: Text Clustering with Background Knowledge 120
Text Clustering with Background Knowledge
- choose a representation- similarity measure- clustering algorithm
Bi-Section is a version of KMeans
cosine as similarity measure
bag of terms(details on thenext slide)
Reuters data setfor our studies(min15 max100)
![Page 31: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/31.jpg)
A. Hotho: Text Clustering with Background Knowledge 121
Preprocessing steps
docid term1 term2 term3 ...doc1 0 0 1doc2 2 3 1doc3 10 0 0doc4 2 23 0
...
– build a bag of words model
– extract word counts (term frequencies)– remove stopwords– pruning: drop words with less than e.g. 30 occurrences – weighting of document vectors with tfidf
(term frequency - inverted document frequency)
⎟⎟⎠
⎞⎜⎜⎝
⎛+=
)(log))(log()(
tdfD
* d,ttf d,ttfidf|D| no. of documents ddf(t) no. of documents d which
contain term t
A. Hotho: Text Clustering with Background Knowledge 122
Ontology
Ontology O represents the background knowledge core ontology consists of:
Set of concepts: CConcept hierarchy or taxonomy: Lexicon: Lex
Root
PublicationPerson
AcademicStaff
StudentArticleBook
Project
PhDStudent
Topic
Research Topic
KnowledgeManagement
DistributedOrganization
DE:Wissensmanagement EN:Knowledge Management
C≤
A. Hotho: Text Clustering with Background Knowledge 123
109377 Concepts(synsets)
WordNet as ontology
144684 lexicalentries
Rootentity
something
physical object
artifact
substance
chemicalcompound
organiccompound
lipid
oil
EN:oil
covering
coating
paint
oil paint
cover
cover with oil
bless
oil, anoint
EN:anoint EN:inunct
oil colorcrude oil
144684 lexicalentries
Use of superconcepts(Hypernyms in Wordnet)
• Exploit more generalized concepts• Example:
chemical compound is the 3rd superconcept of oil
• “prune‘‘ unimportant superconceptswith tfidf
Word Sense Disambiguation
Strategies:all, first, context
A. Hotho: Text Clustering with Background Knowledge 124
Reuters texts
Dok 17892 crude =============
Oman has granted term crude oil customers retroactive discounts from official prices of 30 to 38 cents per barrel on liftings made during February, March and April, the weekly newsletter Middle East Economic Survey (MEES) said. MEES said the price adjustments, arrived at through negotiations between the Omani oil ministry and companies concerned, are designed to compensate for the difference between market-related prices and the official price of 17.63 dlrs per barrel adopted by non-OPEC Oman since February. REUTER
![Page 32: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/32.jpg)
A. Hotho: Text Clustering with Background Knowledge 125
Ontology-based representation
21112111...
Omangrantedtermcrudeoillipidcompoundcustomerretroactivediscount...
2111222111...
different strategies: add, replace, only
Omanhas grantedtermcrudeoilcustomersretroactivediscounts...
211112111...
Omangrantedtermcrudeoilcustomerretroactivediscount...
1 2 3
A. Hotho: Text Clustering with Background Knowledge 126
Bi-Partitioning K-Means
Input: Set of documents D, number of clusters kOutput: k cluster that exhaustively partition D
Initialize: P* = {D}
Outer loop: Repeat k-1 times: Bi-Partition the largest cluster E∈P*
Inner loop: Randomly initialize two documents from E to become e1,e2
Repeat until convergence is reachedAssign each document from E to the nearest of the two ei ; thus split E into E1,E2
Re-compute e1,e2 to become the centroids of the documentrepresentations assigned to them
P* := (P* \ E ) ∪ {E1,E2 }
A. Hotho: Text Clustering with Background Knowledge 127
Evaluation of Text Clustering
avg
-pur
ity
b.knowl.
0,51
0,53
0,55
0,57
0,59
0,61
0,63
context all context all
0 5false true
300
Prune
Evaluation parameter• min 15, max 100, 2619 documents• cluster k = 60• tfidf• term and concept vector
depthdisambig.
A. Hotho: Text Clustering with Background Knowledge 128
Evaluation of Text Clustering
0,6160,618
0,570
0,300
0,350
0,400
0,450
0,500
0,550
0,600
0,650
add repl add only repl add only repl add only repl add only repl add only repl add only
context context first all context first all
0 0 5
false true
tfidf - 30without - 30
CLUSTERCOUNT 60 EXAMPLE 100 MINCOUNT 15
Mittelwert - PURITY
ONTO HYPDEPTH HYPDIS HYPINT
WEIGHT
PRUNE
backgro..depthdisambig.integrat.
Evaluation parameter• min 15, max 100, 2619 documents• cluster k = 60
avg - purity
![Page 33: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/33.jpg)
A. Hotho: Text Clustering with Background Knowledge 129
Evaluation of Text Clustering
Backgr. depth integr. Mean - PURITY Mean - INVPURITYfalse 0,570 ±0,019 0,479 ±0,016true 0 add 0,585 ±0,014 0,492 ±0,017
only 0,603 ±0,019 0,504 ±0,0215 add 0,618 ±0,015 0,514 ±0,019
only 0,593 ±0,01 0,500 ±0,016
Evaluation parameter
• min 15, max 100, 2619 Dokumente• Cluster k = 60• Disamb = context• Prune = 30
A. Hotho: Text Clustering with Background Knowledge 130
Variance analysis of the Reuters classes
Idea: Ideally documents of one class should have the same representation variance = 0representation of the documents is changed, variance will also change
Analysis:Compare the variance of the classes of both Representations (with and without ontology)Compare the purity per class
A. Hotho: Text Clustering with Background Knowledge 131
Variance analysis
Variance and purity per class for PRC-min15-max100
-30,00%
-20,00%
-10,00%
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
earn
pet-chem
meal-feed
ship
lead
jobs
strategic-metal
acq
defnoclass
cocoa
trade
veg-oil
zinc
tin copper
coffee
iron-steel
housing
nat-gas
oilseed
crude
money-fx
ipi
alum
gas
grain
wpi
gnp
cpi
retail
carcass
interest
money-supply
dlr
livestock
bop
silver
orange
sugar
wheat
reserves
hog
gold
rubber
heat
cotton
variance_prc_min15_max100_5_addpurity_prc_min15_max100_5_addLinear (purity_prc_min15_max100_5_add)
class
percen
tage D
ifference
A. Hotho: Text Clustering with Background Knowledge 132
Conclusion
Background knowledge helps in improving clustering resultsSimilar terms in 2 documents may contribute to a good similarity rating if they are related via Wordnet synsetsor hypernyms
Adding background knowledge is not beneficial per se
has to be combined with term and concept weightingword sense disambiguation
![Page 34: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/34.jpg)
A. Hotho: Text Clustering with Background Knowledge 133
Conclusion and Outlook
Ontologies provide the background knowledge for clustering of text/web documents to achieve better clustering results describing text clusters to make descriptions more understandable for more details see: [Hotho et al. 2003]
Some possible improvements:include more aspects of Wordnet, e.g. adjectivestake domain specific ontologies, e.g. AGROVOC
Dr. Andreas Hotho
Text Clustering with FCA
A. Hotho: Text Clustering with Background Knowledge 135
Introduction Clustering
case sex glasses moustache smile hat1 m y n y n2 f n n y n3 m y n n n4 m n n n n5 m n n y? n6 m n y n y7 m y n y n8 m n n y n9 m y y y n
10 f n n n n11 m n y n n12 f n n n n
A. Hotho: Text Clustering with Background Knowledge 136
Introduction Formal Concept Analysis
![Page 35: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/35.jpg)
A. Hotho: Text Clustering with Background Knowledge 137
Extracted Word/Concept lists
A. Hotho: Text Clustering with Background Knowledge 138
Motivation for an Explanation of Clustering Results
Starting Point:
How do people describe a group of documents/objects?
• general and specific words are used
• Background Knowledge provides general word
• Background Knowledge could help to find links between important but seldom words of a text document
A. Hotho: Text Clustering with Background Knowledge 139
Introduction to Formal Concept Analysis
Formal Concept Analysis [Wille 1982] allows to generate and visualize concept hierarchies.
FCA models concepts as units of thought, consisting of two parts:
The extension consists of all objects belonging to the concept.The intension consists of all attributes common to all those objects.
A. Hotho: Text Clustering with Background Knowledge 140
Introduction to Formal Concept Analysis
bank financ market team baseman seasonFinanceText1 X X XFinanceText2 X X XSportText1 X X XSportText2 X X X
Example: Textsfrom the WWW
![Page 36: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/36.jpg)
A. Hotho: Text Clustering with Background Knowledge 141
Introduction to Formal Concept Analysis
bank financ market team baseman seasonFinanceText1 X X XFinanceText2 X X XSportText1 X X XSportText2 X X X
objects
attributes
formal contextDef.: A formal context is a triple(G,M,I), where
• G is a set of objects,
• M is a set of attributes
• and I is a relation between G and M.
• (g,m)∈I is read as „object g has attribute m“.
A. Hotho: Text Clustering with Background Knowledge 142
Introduction to Formal Concept Analysis
Concept lattice
bank financ market team baseman seasonFinanceText1 X X XFinanceText2 X X XSportText1 X X XSportText2 X X X
A. Hotho: Text Clustering with Background Knowledge 143
Introduction to Formal Concept Analysis
bank financ market american team baseman seasonFinanceText1 X X X XFinanceText2 X X XSportText1 X X XSportText2 X X X X
A. Hotho: Text Clustering with Background Knowledge 144
Introduction to Formal Concept Analysis
![Page 37: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/37.jpg)
A. Hotho: Text Clustering with Background Knowledge 145
FCA text clustering
• preprocess text documents
• extract a description for all documents
• calculate FCA lattice
• visualize lattice
A. Hotho: Text Clustering with Background Knowledge 146
Motivation-Overall Process
cluster algorithmFCA
Objects
explanationFCA
representation of objectsMorgens Abends team baseman
Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1
similarity measuredistance function
A. Hotho: Text Clustering with Background Knowledge 147
Example corpus
• 21 documents collected from the internet
• 3 categories; soccer, finance and software
• 1419 different word stems, whereof 253 stopwords
A. Hotho: Text Clustering with Background Knowledge 148
Lattice for 21 documents with 117 terms (θ = 15%)
![Page 38: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/38.jpg)
A. Hotho: Text Clustering with Background Knowledge 149
Extraction of cluster descriptions
• Lattice with all terms/concepts is to large to act as a basis of a description
selection of the most imortant terms/concepts
• approach: introduce a threshold θRemove all terms of the document vector with a valuesmaller than θ(e.g. θ = 25% of the max value)
bank financ market team baseman seasonFinanceText1 X X XFinanceText2 X X XSportText1 X X XSportText2 X X X
bank financ market team baseman seasonFinanceText1 1 2 1FinanceText2 2 2 1SportText1 1 3 2SportText2 1 2 2
A. Hotho: Text Clustering with Background Knowledge 150
Lattice with θ = 80%
A. Hotho: Text Clustering with Background Knowledge 151
Lattice with θ = 45%
A. Hotho: Text Clustering with Background Knowledge 152
Lattice with manuell selected terms
![Page 39: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/39.jpg)
A. Hotho: Text Clustering with Background Knowledge 153
Lesson learned
• results are not really good• lattice is to fine grained • lattice is difficult to interpret• The absence/presents of a term in a document
description results usually in a totally different lattice
• use clustering approaches like kmeans to reduce this kind of effects
A. Hotho: Text Clustering with Background Knowledge 154
Motivation-Overall Process
cluster algorithm
Objects
explanation
representation of objectsMorgens Abends team baseman
Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1
similarity measuredistance function
A. Hotho: Text Clustering with Background Knowledge 155
Visualization of Bi-Sec-K-Means clustering results
• Compute 10 Bi-Sec-K-Means cluster• Extract a term description• Compute lattice• Visualize the lattice
A. Hotho: Text Clustering with Background Knowledge 156
Result for 10 cluster
![Page 40: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/40.jpg)
A. Hotho: Text Clustering with Background Knowledge 157
Result for the same terms but the not based on cluster
A. Hotho: Text Clustering with Background Knowledge 158
Motivation-Overall Process
cluster algorithm
Objects
explanation
representation of objectsMorgens Abends team baseman
Obj1 1 1Obj2 1 1Obj3 2 1Obj4 2 1
similarity measuredistance function
backgroundknowledge
A. Hotho: Text Clustering with Background Knowledge 159
Extracted Word/Concept lists
A. Hotho: Text Clustering with Background Knowledge 160
Combine FCA & Standard Text-clustering
• preprocess Reuters documents and enrich them with background knowledge (Wordnet)
• calculate a reasonable number k (100) of clusters with BiSec-k-Means using cosine similarity
• extract a description for all clusters• relate clusters (objects) with FCA• use the visualization of the concept lattice for better
understanding
![Page 41: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/41.jpg)
A. Hotho: Text Clustering with Background Knowledge 161
Extracting Cluster Descriptions
using all concepts (synsets) as attributes for FCA provides a too large concept lattice
select the important ones
Approach: introduce two thresholds: θ1, θ2
for every centroid drop all concepts (synsets) with a value lower than θ1,mark all concepts (synset) between θ1 and θ2 with “m”and above θ2 with “h”we chose θ1 = 7% and θ2 = 20% of max value
A. Hotho: Text Clustering with Background Knowledge 162
Result
A. Hotho: Text Clustering with Background Knowledge 163
Result
A. Hotho: Text Clustering with Background Knowledge 164
Result
compound, chemical compound
oil
Crude oilbarrel
Palm oil
chain of concepts with increasing specificity
![Page 42: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/42.jpg)
A. Hotho: Text Clustering with Background Knowledge 165
Similar example
compound, chemical compound
oil
refiner
chain of concepts with increasing specificity
A. Hotho: Text Clustering with Background Knowledge 166
Results
Crude oilbarrel
A. Hotho: Text Clustering with Background Knowledge 167
Results
resin palm
• Resulting concept lattice canalso be interpreted as a concept hierarchy directly on the documents
• all documents in one clusterobtain exactly the samedescription
A. Hotho: Text Clustering with Background Knowledge 168
Results Multi topic cluster
porkmeat... musiccoffeefoodbeverage
Multi Topic Cluster CL8
• BiSec-k-Meansresults are bad
• FCA helps to identifyinconsistencies
![Page 43: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/43.jpg)
A. Hotho: Text Clustering with Background Knowledge 169
Formal Concept Analysis (FCA) for Providing Cluster Descriptions
Apply FCA to the clusters generated by Bi-Section-KMeansEmbed clusters into lattice structure
Clusters are objectsTerms and concepts are attributes
FCA provides 2 achievements:Intentional descriptions of clusters are generated
Exploit concepts from background knowledge
Interactive exploration of document collection is supported
Browse lattice structureZoom into interesting parts
A. Hotho: Text Clustering with Background Knowledge 170
Conclusion and Outlook
• FCA allows a better understandable explanation of ontology enriched (k-Means) text clusters
• Clustering of text/web documents with an ontology achieve best clustering results
More details in: Wordnet improves Text Document Clustering,
Semantic Web WS at SIGIR, Hotho et al. 2003
• Some possible improvements:include more aspects of Wordnet, e.g. adjectivestake domain specific ontologies, e.g. AGROVOCuse more sophisticated means for feature selection within FCA
Dr. Andreas Hotho
Text Classification
A. Hotho: Text Clustering with Background Knowledge 172
Text Classification
Text Classification (Text Categorization)
Text categorization (TC - a.k.a. text classification, or topic spotting), [is] the activity of labeling natural language texts with thematic categories from a predefined set […].
F. Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys 34(1), 2002.
Automatic Text Classification uses full-text vector representations of text documents as in information retrievalwithin standard classification algorithms.
![Page 44: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/44.jpg)
A. Hotho: Text Clustering with Background Knowledge 173
Text Classification Approaches
classificationalgorithm
(AdaBoost)
DocumentsBag of Words
backgroundknowledge
oman has granded …Obj1 2 2Obj2 1 1Obj3 2 …Obj4 2 …
1 …0 …
0 00 0
A. Hotho: Text Clustering with Background Knowledge 174
Conceptual Document Representation
Let's extract some concepts...
Detecting the appropriate a set of concepts from an Ontology (O, Lex) requires multiple steps:
1. Candidate Term Detection2. Morphological Transformations3. Word Sense Disambiguation4. Generalization
A. Hotho: Text Clustering with Background Knowledge 175
Conceptual Document RepresentationCandidate Term Detection
Querying the lexicon directly for each single word will not do the trick! (Remember the multi-word expressions!)
Solution:Move a window of maximum size over the text, decrease window size if unsuccessful before moving on.
But querying the lexicon for any candidate term windowproduces much overhead !
Solution:Avoid unnecessary lexicon queries by matching POS tags in the window against appropriately defined syntacticalpatterns (e.g. noun phrases).
A. Hotho: Text Clustering with Background Knowledge 176
AdaBoost
Boosting is a relatively young and very successful machine learning technique.
Boosting algorithms build so called ensemble classifiers(meta classifiers):
1. Build many very simple “weak” classifiers.2. Combine weak learners in an additive model:
![Page 45: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/45.jpg)
A. Hotho: Text Clustering with Background Knowledge 177
AdaBoost
AdaBoost maintains weights Dt over the training instances.
At each iteration t: choose a base classifier ht that performs best on weighted training instances.
Calculate weight parameter αt based on performance base classifier. Higher errors lead to smaller weights and smaller errors lead to higher weights.
Weight update increases (decreases) weights for wrongly (correctly) classified instances.
Thereby, AdaBoost “is focusing in” on “hard” training instances.
A. Hotho: Text Clustering with Background Knowledge 178
Evaluation
Datasets: Reuters-21578
Documents about finance from 19879603 training documents and 3299 test documents (ModApte Split)Binary Classification on Top 50 classes.
OHSUMED CorpusOHSUMED (TREC-9), titles and abstracts from medical journals, 198736369 training documents and 18341 test documentsBinary Classification on Top 50 classes (MeSH classifications).
FAODOC CorpusDocuments about agricultural information1501 docs within 21 categories
A. Hotho: Text Clustering with Background Knowledge 179
Evaluation: Reuters Results
• Top 50 reuters classes with 17525 termstems/10259 – 27236 synset features
A. Hotho: Text Clustering with Background Knowledge 180
Evaluation: OHSUMED Results
Top 50 classes with WordNet
![Page 46: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/46.jpg)
A. Hotho: Text Clustering with Background Knowledge 181
Evaluation: OHSUMED Results
Relative improvement on the top 50 classes with WordNet
A. Hotho: Text Clustering with Background Knowledge 182
Evaluation: OHSUMED Results
Relative improvement on the top 50 classes withMesh Ontology (~ 22.000 Concepts, all Strategy)
A. Hotho: Text Clustering with Background Knowledge 183
Evaluation: Reuters Results
• Top 50 reuters classes with 17525 termstems/10259 – 27236 synset features
A. Hotho: Text Clustering with Background Knowledge 184
Evaluation: Reuters Results
Relative improvement on the top 50 classes
![Page 47: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/47.jpg)
A. Hotho: Text Clustering with Background Knowledge 185
Evaluation: FAODOC Results
A. Hotho: Text Clustering with Background Knowledge 186
Evaluation: FAODOC Results
A. Hotho: Text Clustering with Background Knowledge 187
Conclusion and Outlook
• Successful integration of conceptual features to improveclassification performance
• Gereralization does improve classification results in mostcases
A. Hotho: Text Clustering with Background Knowledge 188
Conclusion and Outlook
• Advanced Generalization Strategies
• Development of additional weak learner plugins thatexploit ontologies more directly
• Heuristics for efficient handling of continuous featurevalues like TFIDF in AdaBoost
• Multilingual Text Classification
![Page 48: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/48.jpg)
Dr. Andreas Hotho
Application driven Evaluation of
Ontology Learning
A. Hotho: Text Clustering with Background Knowledge 190
Ontology Learning
• Until now we used manually engineered ontologies.• Large ontologies are not for every domain available.• Big effort to build such ontologies
• Idea: Learn Ontologies from text
A. Hotho: Text Clustering with Background Knowledge 191
Ontologies: Semantic Structures
Ontologies:
MeSH Tree StructuresWordNet
conceptural feature representation
ontology learning
term vectors
concept vectors
linguistic context vectors
+
term clustering
"learned" ontology structures
[Maedche & Staab 2001][Cimiano et al. ECAI 2004][Cimiano et al. JAIR 2005]
Why ?knowledge acquisition bottleneckadaption to domain contextjust have some fun in trying something weired
A. Hotho: Text Clustering with Background Knowledge 192
Ontology Learning as Term Clustering
Hierarchical clustering approaches, e.g. Agglomerative (bottom-up) clusteringBi-Section-kMeans clustering
are used to create taxonomic structures (concept hierarchies)
Quality of learned semantic structures is surprisingly high.
[Cimiano et al. ECAI 2004]
Deficiencies (?):taxonomic relations mix withsynonyms and other relationsbinary splitssuperconcepts – i.e. clusters –are not mapped to lexical entries
![Page 49: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/49.jpg)
A. Hotho: Text Clustering with Background Knowledge 193
Learned Ontology: kmeans 7000
A. Hotho: Text Clustering with Background Knowledge 194
Evaluation Setting: Ontologies
Learned Ontologies:linguistic contexts from 1987 portion of OHSUMED corpusbased on top 10,000 terms ∩ MeSH terms = 7,000 terms(cheated)
agglomarative clusteredbi-sec-kmeans clustered
based on top 14,000 termsbi-sec-kmeans clustered
Competitors:Mesh Tree Structures
Maintained by United States National Library of Medicine> 22,000 hierarchically organized concepts
WordNet(psycho-)linguistic ontology115,424 synstets in total - 79,689 synsets in noun category
A. Hotho: Text Clustering with Background Knowledge 195
Evaluation Setting: Text Classification and Clustering
OHSUMED Corpus, (TREC-9), titles and abstracts from medical journals, 1987
Typically regarded as a rather "hard" corpus.Text Classification Setting:
36,369 training documents and 18,341 test documentsBinary classification on top 50 classes (MeSHclassifications).Classification algorithm: AdaBoost with Binary Decision Stumps, 1000 iterations.
Text Clustering Setting:4,390 documents rated relevant for one of 106 queriescluster-to-query evaluationclustering algorithm: bi-section-kmeansweighting: TFIDF, pruning level: 20
A. Hotho: Text Clustering with Background Knowledge 196
Evaluation Results: Text Classification
* extensive experimental evaluation for different superconcept integartion depths (3,5,10,15,20,25,30) –only optimal feature configuration (wrt. F1) for each ontology shown
![Page 50: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/50.jpg)
A. Hotho: Text Clustering with Background Knowledge 197
Evaluation Results: Text Classification
0,00%
1,00%
2,00%
3,00%
4,00%
5,00%
6,00%
7,00%
8,00%
term & concept.sc30 term & concept.sc15 term & concept.sc20 term &synset.context.hyp5
term & mesh.sc3
7000-agglo 7000-bisec-kmeans 14000-bisec-kmeans WordNet MeSH Tree Struct
rel. imprv. macro F1rel. imprv. micro F1
T** S** T* S* T** S** T** S*
A. Hotho: Text Clustering with Background Knowledge 198
Evaluation Results: Text Clustering
* extensive experimental evaluation for different superconcept integartion depths (3,5,10,15,20,25,30) –only optimal feature configuration (wrt. Purity) for each ontology shown
** all results are averages over 20 results with different random seeds
A. Hotho: Text Clustering with Background Knowledge 199
Last but not least…
Main points of this lesson:Integration of explicit conceptual features improves text clustering and classification performance.Learned Ontologies achieve improvements competitive with manually created ontologies.In both cases, the major improvement is due to generalizations.
Outlook:Investigation of relation to purely statistical "conceptualizations", e.g. LSI, PLSAImprovements in Ontology Learning.More advanced generalization strategies.
A. Hotho: Text Clustering with Background Knowledge 200
Literature
• Stephan Bloehdorn, Andreas Hotho: Text Classification by Boosting Weak Learners based on Terms and Concepts. ICDM 2004.
• Andreas Hotho, Steffen Staab, Gerd Stumme: WordNet improves text document clustering; Semantic Web Workshop @ SIGIR 2003.
• W. R. Hersh, C. Buckley, T. J. Leone, and D. H. Hickam. Ohsumed: An Interactive Retrieval Evealuation and new large Test Collection for Research. SIGIR 1994.
• Alexander Maedche, Steffen Staab. Ontology Learning for the Semantic Web. IEEE Intelligent Systems, 16(2):72–79, 2001.
• Philipp Cimiano, Andreas Hotho, Steffen Staab. ComparingConceptual, Partitional and Agglomerative Clustering for LearningTaxonomies from Text. ECAI 2004. Extended Version to appear(JARS 2005).
![Page 51: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/51.jpg)
Dr. Andreas Hotho
Background Knowledge
A. Hotho: Text Clustering with Background Knowledge 202
Statistical Concepts as Background Knowledge
• Calculating a kind of statistical concept and combine them with the classical bag of words representation
L. Cai and T. Hofmann. Text Categorization by Boosting Automatically Extracted Concepts. In Proc. of the 26th Annual Int. ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, Canada, 2003.
• Clustering word to setup a kind of concepts
G. Karypis and E. Han. Fast supervised dimensionality reduction algorithm with applications to document categorization and retrieval. In Proc. of 9th ACM International Conference on Information and Knowledge Management, CIKM-00, pages 12–19, New York, US, 2000. ACM Press.
• Clustering words and documents simultaneously
Inderjit S. Dhillon, Yuqiang Guan, and J. Kogan. Iterative clustering of high dimensional text data augmented by local search. In 2nd SIAM International Conference on Data Mining (Workshop on Clustering High-Dimensional Data and its Applications), 2002.
A. Hotho: Text Clustering with Background Knowledge 203
Text Classification and Ontologies
• Using Hypernyms of wordnet as concept feature (no WSD, no significant better results)
Sam Scott , Stan Matwin, Feature Engineering for Text Classification, Proceedings of the Sixteenth International Conference on Machine Learning, p.379-388, June 27-30, 1999
• Brown Corpus tagged with Wordnet senses does not shows significant better results.
A. Kehagias, V. Petridis, V. G. Kaburlasos, and P. Fragkou. A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms. Journal of Intelligent Information Systems, 21(3):227–247, 2000.
• Map terms to concepts of the UMLS ontology to reduce the size of feature set, use search algorithm to find super concepts, evaluation using KNN and medline documents, show improvement.
B. B. Wang, R. I. Mckay, H. A. Abbass, and M. Barlow. A comparative study for domain ontology guided feature extraction. In Proceedings of the 26th Australian Computer Science Conference (ACSC-2003), pages 69–78. Australian Computer Society, 2003.
• Generative model consist of feature, concepts and topics, using Wordnetto initialize the parameter for concepts, evaluation on Reuter and Amazon corpus
Georgiana Ifrim, Martin Theobald, Gerhard Weikum, Learning Word-to-Concept Mappings for Automatic Text Classification Learning in Web Search Workshop 2005.
A. Hotho: Text Clustering with Background Knowledge 204
Using Ontologies
Wordnet and IR Query expansion with wordnet does not really improve the performance
Ellen M. Voorhees, Query expansion using lexical-semantic relations, Proceedings of the 17th annual internationalACM SIGIR conference on Research and development in information retrieval, p.61-69, July 03-06, 1994, Dublin, Ireland
Text Clustering and OntologiesWordnet synset chains
Green: Wordnet Chains (Stephen J. Green. Building hypertext links by computing semantic similarity. IEEE Transactions on Knowledge and Data Engineering (TKDE), 11(5):713–730, 1999.
Dave et.al.: worse results using an ontology (no word sense disambiguation)
(Kushal Dave, Steve Lawrence, and David M. Pennock. Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In Proceedings of the Twelfth International World Wide Web Conference, WWW2003. ACM, 2003.)
Part of Speech attributes and named entities used as features
(Vasileios Hatzivassiloglou, Luis Gravano, and Ankineedu Maganti. An investigation of linguistic features and clustering algorithms for topical document clustering. In SIGIR 2000: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, July 24-28, 2000, Athens, Greece. ACM, 2000.)
![Page 52: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/52.jpg)
Dr. Andreas Hotho
Literature
Tag: SumSchool06
http://www.bibsonomy.org/tag/SumSchool06
A. Hotho: Text Clustering with Background Knowledge 206
Selected Literature
Semantic Web & OntologyY. Sure and R. Studer. Vision for Semantically-Enabled Knowledge Technologies.
Online at: KTweb -- Connecting Knowledge Technologies Communities, 2003.
Y. Sure and R. Studer: A Methodology for Ontology-based Knowledge Management. In: On-To-Knowledge: Semantic Web enabled Knowledge Management. J. Davies, D. Fensel, F. van Harmelen (eds.), ISBN: 0-470-84867-7, Wiley, 2002, pages 33-46.
Y. Sure, S. Staab and R. Studer. Methodology for Development and Employment of Ontology Based Knowledge Management Applications. In: SIGMOD Record, Vol. 31, No. 4, pp. 18-23, December 2002.
S. Staab, H.-P. Schnurr, R. Studer, and Y. Sure: Knowledge Processes and Ontologies.In: IEEE Intelligent Systems 16(1), January/Febrary 2001, Special Issue on KnowledgeManagement.
A. Hotho: Text Clustering with Background Knowledge 207
Selected Literature
Foundations of the Semantic Web
Semantic Web Activity at W3C http://www.w3.org/2001/sw/www.semanticweb.org (currently relaunched)Journal of Web Semantics D. Fensel et al.: Spinning the Semantic Web: Bringing the World Wide Web to
Its Full Potential, MIT Press 2003G. Antoniou, F. van Harmelen. A Semantic Web Primer, MIT Press 2004.S. Staab, R. Studer (eds.). Handbook on Ontologies. Springer Verlag, 2004. S. Handschuh, S. Staab (eds.). Annotation for the Semantic Web. IOS Press,
2003. International Semantic Web Conference series, yearly since 2002, LNCSWorld Wide Web Conference series, ACM Press, first Semantic Web papers
since 1999York Sure, Pascal Hitzler, Andreas Eberhart, Rudi Studer, The Semantic Web in
One Day, IEEE Intelligent Systems, http://www.aifb.uni-karlsruhe.de/WBS/phi/pub/sw_inoneday.pdf
Some slides have been stolen from various places, from Jim Hendler and Frank van Harmelen, in particular.
A. Hotho: Text Clustering with Background Knowledge 208
Selected Literature
Ontology Learning References Reinberger, M.-L., & Spyns, P. (2005). Unsupervised text mining for the learning of dogma-inspired ontologies. In Buitelaar, P.,
Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.
Philipp Cimiano, Andreas Hotho, Steffen Staab: Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text. ECAI 2004: 435-439
P. Cimiano, A. Pivk, L. Schmidt-Thieme and S. Staab, Learning Taxonomic Relations from Heterogenous Evidence. In Buitelaar, P., Cimiano, P., & Magnini, B. (Eds.), Ontology Learning from Text: Methods, Evaluation and Applications.
Sabou M., Wroe C., Goble C. and Mishne G.,Learning Domain Ontologies for Web Service Descriptions: an Experiment in Bioinformatics, In Proceeedings of the 14th International World Wide Web Conference (WWW2005), Chiba, Japan, 10-14 May, 2005.
Alexander Maedche, Ontology Learning for the Semantic Web, PhD Thesis, Kluwer, 2001.
Alexander Maedche, Steffen Staab: Ontology Learning for the Semantic Web. IEEE Intelligent Systems 16(2): 72-79 (2001)
Alexander Maedche, Steffen Staab: Ontology Learning. Handbook on Ontologies 2004: 173-190
M. Ciaramita, A. Gangemi, E. Ratsch, J. Saric, I. Rojas. Unsupervised Learning of semantic relations between concepts of a molecular biology ontology. IJCAI, 659ff.
A. Schutz, P. Buitelaar. RelExt: A Tool for Relation Extraction from Text in Ontology Extension. ISWC 2005.
Faure, D., & N´edellec, C. (1998). A corpus-based conceptual clustering method for verb frames and ontology. In Velardi, P. (Ed.), Proceedings of the LREC Workshop on Adapting lexical and corpus resources to sublanguages and applications, pp. 5–12.
Michele Missikoff, Paola Velardi, Paolo Fabriani: Text Mining Techniques to Automatically Enrich a Domain Ontology. AppliedIntelligence 18(3): 323-340 (2003).
Gilles Bisson, Claire Nedellec, Dolores Cañamero: Designing Clustering Methods for Ontology Building - The Mo'K Workbench. ECAI Workshop on Ontology Learning 2000
![Page 53: Text Clustering with Background Knowledge · A. Hotho: Text Clustering with Background Knowledge 9 CV name education work private Meaning of Informationen: (or: what it means to be](https://reader035.fdocuments.us/reader035/viewer/2022070915/5fb55e1f0f292108bb2baae8/html5/thumbnails/53.jpg)
A. Hotho: Text Clustering with Background Knowledge 209
Selected Literature
Semantic Web & OntologyY. Sure, S. Staab, J. Angele. OntoEdit: Guiding Ontology Development by Methodology
and Inferencing. In: R. Meersman, Z. Tari et al. (eds.). Proceedings of theConfederated International Conferences CoopIS, DOA and ODBASE 2002, October 28th - November 1st, 2002, University of California, Irvine, USA, Springer, LNCS 2519, pages 1205-1222.
Y. Sure, M. Erdmann, J. Angele, S. Staab, R. Studer and D. Wenke. OntoEdit: Collaborative Ontology Engineering for the Semantic Web. In: Proceedings of thefirst International Semantic Web Conference 2002 (ISWC 2002), June 9-12 2002, Sardinia, Italia, Springer, LNCS 2342, pages 221-235.
E. Bozsak, M. Ehrig, S. Handschuh, A. Hotho, A. Mädche, B. Motik, D. Oberle, C. Schmitz, S. Staab, L. Stojanovic, N. Stojanovic, R. Studer, G. Stumme, Y. Sure, J. Tane, R. Volz, V. Zacharias. KAON - Towards a large scale Semantic Web. In: Proceedings of EC-Web 2002 (in combination with DEXA2002). Aix-en-Provence, France, September 2-6, 2002. LNCS, Springer, 2002, pages 304-313.
A. Hotho: Text Clustering with Background Knowledge 210
Selected Literature
Text Clustering with Background KnowledgeA. Hotho, S. Staab, and G. Stumme. Explaining text clustering results using semantic
structures. In Proc. of the 7th PKDD, 2003.
B. Lauser and A. Hotho. Automatic multi-label subject indexing in a multilingual environment. In Proc. of the 7th European Conference in Research and AdvancedTechnology for Digital Libraries, ECDL 2003, 2003.
A. Hotho, S. Staab, and G. Stumme. Text clustering based on background knowledge. Technical Report 425, University of Karlsruhe, Institute AIFB, 2003.
Hotho, A., Mädche, A., Staab, S.: Ontology-based Text Clustering, Workshop "Text Learning: Beyond Supervision",IJCAI 2001.
A. Hotho, A. Maedche, S. Staab, V. Zacharias : On Knowledgeable Supervised Text Mining . to appear in: "Text Mining" Workshop Proceedings, Springer, 2002.
A. Hotho: Text Clustering with Background Knowledge 211
Selected Literature
Using OntologiesStephan Bloehdorn, Andreas Hotho: Text Classification by Boosting Weak Learners
based on Terms and Concepts. ICDM 2004: 331-334
Andreas Hotho, Steffen Staab, Gerd Stumme: Ontologies Improve Text Document Clustering. ICDM 2003: 541-544
Andreas Hotho, Steffen Staab, Gerd Stumme: Explaining Text Clustering Results Using Semantic Structures. PKDD 2003: 217-228
Stephan Bloehdorn, Philipp Cimiano, and Andreas Hotho: Learning Ontologies to Improve Text Clustering and Classification, Proc. of GfKl, 2005.
Semantic Web MiningB. Berendt, A. Hotho, and G. Stumme. Towards semantic web mining. In I. Horrocks
and J. A. Hendler, editors, The Semantic Web - ISWC 2002, First International Semantic Web Conference, Sardinia, Italy, June 9-12, 2002, Proceedings, volume2342 of Lecture Notes in Computer Science, pages 264–278. Springer, 2002.