Bayesian Networks in Document Clustering
description
Transcript of Bayesian Networks in Document Clustering
![Page 1: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/1.jpg)
Bayesian Networks in Document Bayesian Networks in Document ClusteringClustering
Slawomir Wierzchon , Mieczyslaw KlopotekMichal Draminski Krzysztof Ciesielski
Mariusz Kujawiak
Institute of Computer Science, Polish Academy of SciencesWarsaw
Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using
Bayesian networks and artificial immune systems"
![Page 2: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/2.jpg)
A search engine with SOM-based document set
representation
![Page 3: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/3.jpg)
Map visualizations in 3D (BEATCA)
![Page 4: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/4.jpg)
........
INTERNET
DBREGISTRY
HT-Base
HT-Base
VEC-BaseMAP-Base
DocGR-Base
Search Engine
Indexing +Optimizing
SpiderDownloading
MappingClustering
of docs
........
CellGR-Base
Clusteringof cells
........
........ ........ ........
Processing Flow Diagram - BEATCA
The preparation of documents is done by an indexer, which turns a document into a vector-space model representation
Indexer also identifies frequent phrases in document set for clustering and labelling purposes
Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded
The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation
‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query
![Page 5: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/5.jpg)
Document model in search engines
In the so-called vector model a document is considered as a vector in space spanned by the words it contains.
dogfood
walk
My dog likes this food
When walking, I take some food
![Page 6: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/6.jpg)
Clustering document vectors
Document space 2D map
mxr
Mocna zmiana położenia (gruba
strzałka)
Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar
![Page 7: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/7.jpg)
Our problem
Instability Pre-defined major themes needed
Our approach– Find a coarse clustering into a few themes
![Page 8: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/8.jpg)
Bayesian Networks in Document Clustering
SOM document-map based search engines require initial document clustering in order to present results in a meaningful way.
Latent semantic Indexing based methods appear to be promising for this purpose.
One of them, the PLSA, has been empirically investigated.
A modification is proposed to the original algorithm and an extension via TAN-like Bayesian networks is suggested.
![Page 9: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/9.jpg)
A Bayesian Network
chappiR
dog
owner
food
Meat
walk
Represents joint probability distribution as a product of conditional probabilities of childs on parents in a directed acyclic graph
High compression,
Simpliofication of reasoning
.
![Page 10: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/10.jpg)
BN application in text processing
Document classification Document Clustering Query Expansion
![Page 11: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/11.jpg)
Hidden variable approaches
PLSA (Probabilistic Latent Semantic Analysis) PHITS (Probabilistic Hyperlink Analysis) Combined PLSA/PHITS Assumption of a hidden variable expressing the
topic of the document. The topic probabilistically influence the
appearence of the document (links in PHITS, terms in PLSA)
![Page 12: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/12.jpg)
PLSA - concept N be term-document matrix of
word counts, i.e., Nij denotes how often a term (single word or phrase) ti occurs in document dj.
probabilistic decomposition into factors zk (1 k K)
P(ti | dj) = Σk P(ti|zk)P(zk|dj), with non-negative probabilities and two sets of normalization constraints
Σi P(ti|zk) = 1 for all k and Σk P(zk| dj) = 1 for all j.
D Z
T1
T2
Tn
.....
Hidden variable
![Page 13: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/13.jpg)
PLSA - concept PLSA aims at maximizing L:=
Σi,j Nij log Σk P(ti|zk)P(zk|dj). Factors zk can be interpreted as
states of a latent mixing variable associated with each observation (i.e., word occurrence),
Expectation-Maximization (EM) algorithm can be applied to find a local maximum of L.
...
..
D Z
T1
T2
Tn
Hidden variable
different factors usually capture distinct "topics" of a document collection; by clustering documents according to their dominant factors, useful topic-specific document clusters often emerge
![Page 14: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/14.jpg)
EM algorithm – step 0Data:D Z T1 T2 ... Tn1 ? 1 0 ... 12 ? 0 0 ... 13 ? 1 1 ... 14 ? 0 1 ... 15 ? 1 0 ... 0..........
Data:D Z T1 T2 ... Tn1 1 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 1 0 1 ... 15 2 1 0 ... 0..........
Z randomly initialized
![Page 15: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/15.jpg)
EM algorithm – step 1
Data:D Z T1 T2 ... Tn1 1 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 1 0 1 ... 15 2 1 0 ... 0..........
BN trained D Z
T1
T2
Tn
Hidden variable
![Page 16: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/16.jpg)
EM algorithm step 2
Data:D Z T1 T2 ... Tn1 2 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 2 0 1 ... 15 1 1 0 ... 0..........
Z sampled from BN
D Z
T1
T2
Tn
Hidden variable
GOTO step 1 untill convergence (Z assignment „stable”)
Z sampled for each record according to the probability distribution
P(Z=1|D=d,T1=t1,...,Tn=tn)P(Z=2|D=d,T1=t1,...,Tn=tn)....
![Page 17: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/17.jpg)
The problem
Too high number of adjustable variables Pre-defined clusters not identified Long computation times instability
![Page 18: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/18.jpg)
Solution
Our suggestion
– Use the „Naive Bayes” „sharp version” – document assigned to the „most probable class”
We were successful – Up to five classes well clustered– High speed (with 20,000 documents)
![Page 19: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/19.jpg)
Next step
Naive bayes assumes document and term independence
What if they are in fact dependent?
Our solution:– TAN APPROACH– First we create a BN of terms/documents– Then assume there is a hidden variable
Promissing results, need a deeper study
![Page 20: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/20.jpg)
PLSA – a model with term TAN
D1
Z
T6T5
T4
Hidden variable
D2
Dk
T1
T2T3
![Page 21: Bayesian Networks in Document Clustering](https://reader036.fdocuments.us/reader036/viewer/2022062423/56814664550346895db3843d/html5/thumbnails/21.jpg)
PLSA – a model with document TAN
T1
Z
Hidden variable
T2
Ti
D6D5
D4
D1
D2D3