News Clustering: Community Finding in a Lexical Network of News Feeds
-
Upload
mcsnews-archive -
Category
Documents
-
view
217 -
download
0
Transcript of News Clustering: Community Finding in a Lexical Network of News Feeds
-
8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds
1/7
Community Finding in a Lexical Network of News Feeds
Jessica Hullman ([email protected])Bryan Gibson ([email protected])
Abstract: We create a weighted lexical network derived from the cosine similarities of financialnews feeds to compare two clustering methods, Newman's Modularity method and hierarchical
clustering. We find that hierarchical clustering, clustering documents according to sharedunique terms, shows results that are closer to expectation.
INTRODUCTION
Network analysis is increasingly used to model
large datasets, as many systems take the form of
networks, with sets of nodes joined together in pairs
by edges. Edges can be weighted to better capture
granularity in the relationships between the nodes.
This method can be applied to lexical networks, where
each node represents a collection of words (a title, or
an article), and edges are the similarity between these
collections.
News represent a dataset where articles naturally
fall into categories and topics. Usually these categories
are mutually exclusive, as in newspaper sectioning,
though a particular article may apply to several topics.Network analysis is well-suited to capture these sub-
primary connections using weighted links between
articles. By creating a lexical network using basic
information retrieval techniques, clustering algorithms
can then be applied in order to identify topical
categories. We hypothesize that such a network will
naturally exhibit clustering of articles into
communities, some with stronger ties than others.
To measure the strength of links between feeds and
the communities communities created, we will employ
two different network analysis techniques, hierarchical
clustering and the modularity algorithm proposed by
Girvan and Newman [1] and improved by Clauset,
Newman and Moore [2]. Our hypothesis is that the
modularity algorithm is better-suited than traditional
hierarchical clustering for capturing the multiple topic
categories that news feeds fall under.
DATA
The corpus consisted of 1388 Reuters news-feeds
from the month of August, 2007. In order to narrow
the range of topics addressed in the feeds, we selected
feeds that pertained to the currency market. The
primary feed source was Dow Jones Newswires, with
additional feeds from the Wall Street Journal,
Barron's, and SMARTMONEY. The feeds ranged
from ~50 to ~800 words.
Because each feed title was fairly long itself (~8-9words), we constructed and analyzed two lexical
networks, one of titles alone and one of full-article
text.
DJ DATA SNAP: Philadelphia Fed: Mfg Activity Stagnates In August
DJ US Fed Discount Window Borrowings Barely Budge On Week
Sample titles from the network.
METHODSLexical Similarity
Lexical similarity, given by cosine similarity, is the
most common measure of document similarity in basic
information retrieval. To compute cosine similarities,
each document is treated as a bag of words, or an
1
mailto:[email protected]:[email protected]:[email protected]:[email protected] -
8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds
2/7
unordered collection of all the unique terms it
contains. Unique terms are found by first stemming
the documents and then using a combination of term
frequency and inverse document frequency.
The frequency of each term is the number of times
it appears in a document (tf). The inverse document
frequency is a measure of the general importance of
the term (obtained by dividing the number of all
documents by the number of documents containing the
term, and then taking the logarithm of that quotient).
idfi=log
D
{d j :tid j}
with D as the total number of documents in the
corpus and {dj :tid j} as the number of documents
where the term ti appears.
Using tf and idf for each term, we are able to
calculate the cosine similarity between two doc-
uments. The cosine similarity between documents i
andj, denoted by sim(i,j) is given by:
simi , j= wx, y tfw, x tfw, yidfw 2
x ixtfxi, xidfx
i
2 yiytfyi, yidfy
i
2
The dataset is then represented as an adjacency
matrix, in which the i,jth entry in the matrix consists
of the cosine similarity between the nodes (titles or
articles). We use this weighted network as our net-
work to be analyzed.
Community-finding within a network
The task of finding good clusters (also called
network transitivity, or the property that two vertices
linked to the same third vertex have a higher
likelihood of being connected themselves) has been
the focus of considerable research in machine learning
and network analysis. Central to all of the goals of
cluster analysis is the notion of degree of similarity (or
dissimilarity) between the individual objects being
clustered. The problem is NP-complete: a solution
would solve all such problems, but does not exist at
present, inspiring debate among researchers as to the
optimum algorithm.
The clustering problem as it applies to the lexical
network of news-feeds is a case of unsupervised
document classification, in that initially, no human-
tagged gold standard is provided to the algorithm.
We investigate two methods that have shown
promising results, modularity and hierarchical
clustering.
Modularity
While other methods of partitioning require
predefined parameters of the communities to be found,
and often partition networks in which no good division
exists, modularity is a measure which quantifies
statistically-surprising arrangements of edges, those
that (in the case of edges between subgraphs) are less
than what would be expected by chance. Modularity (a
positive or negative number) is the number of edges
falling within groups minus the expected number in an
equivalent network with edges placed at random. The
best division of a network is thus that with a large and
positive modularity value.
Given a network ofn vertices, for a division of the
network into two groups let s(i) = 1 if vertex i belongs
to group 1 and s(i) = -1 if it belongs to group 2. The
network is represented as an adjacency matrix A(ij).
The expected number of edges between i andj if edges
are placed at random is k(i)k(j)/2m where k(i) and k(j)
are the degrees of the vertices and m=1
2ki is
2
-
8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds
3/7
the total number of edges in the network. Modularity
is defined as:
Q=1
4m
ijAijki kj2m si s j= 14m sTBs
The leading factor of 1/4m is conventional, and B is a
new real symmetric matrix with elements
Bij=A
ij
kik
j
2m
called the modularity matrix [1].
This was improved in [2] with the change of only
maintaining a matrix of Qij for only nodes ij that are
connected, since joining two disconnected nodes will
not increase Q. This results in increased efficiency.
We used this algorithm as it was implemented by
Ryan Roth in Clairlib [3] to do preliminary clustering
analysis on our sample cosine networks.
Hierarchical Clustering
Hierarchical Clustering is the traditional method fordetecting community structure in networks. In
hierarchical clustering, the data are not partitioned into
a particular cluster in a single step. Instead, a series of
partitions takes place, which may run from a single
cluster containing all objects to n clusters each
containing a single object. Hierarchical clustering is
subdivided into agglomerative methods, which
proceed by series of fusions of the n objects into
groups, and divisive methods, which separate n
objects successively into finer groupings.
We used the R statistical software [4] to do
hierarchical clustering on our dataset in order to
compare this more-traditional method with
modularity. We used the hclust() functionality using
the complete-linkage method, an agglomerative
method in which each object is initially assigned to its
own cluster and then the algorithm proceeds
iteratively, at each stage joining the two most similar
clusters, continuing until there is just a single cluster.
At each stage distances between clusters are
recomputed by the LanceWilliams dissimilarity
update formula using the formula
D r , s=Max {d i , j}, where ir , js
where D(r,s) represents the dissimilarity between
nodes r and j, and d(i,j) represents the distance
between i andj where object i is in cluster rand object
j is in clusterj.
RESULTS
Title Network
Table 1 summarizes the network statistics for
varying cosine thresholds of the title network. A
threshold between 0.2 and 0.3 appears to capture a
phase change. Here the largest connected comp-onent
(LCC) over the number of nodes (n) drops from 91%
to 33%.
Figure 1 and 2 visualize this network with a
3
Table 1: Summary of network statistics for varying
cosine thresholds in the title network.
Threshold Nodes Edges Diameter LCC ASP Avg Degree
0 1117 623286 1 1117 1 558
0.1 1116 46915 5 1116 2.35 42.04
0.2 1100 14351 10 1006 3.88 13.05
0.3 1020 8068 19 340 3.34 7.91
0.4 973 6419 5 55 1.04 6.6
0.5 937 5930 6 55 0.98 6.33
0.6 916 5187 4 40 0.87 5.66
0.7 895 4954 4 40 0.87 5.54
0.8 876 4177 2 40 0.84 4.77
0.9 842 3802 2 40 0.82 4.52
1 832 3258 1 40 0.78 3.92
-
8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds
4/7
threshold of 0.2 and 0.3, respectively, showing the
large difference in connectedness between the two
networks.
Table 2 shows the highest modularity value
with the corresponding number of clusters the
Modularity algorithm found for each threshold.
No gold standard division of the title network
existed for assessment purposes. To compare the
clusters produced by each algorithm, we opted to cut
the dendrogram produced by R (Figure 3) at elevenclusters in order obtain a similar number of clusters
with which to compare the clusters produced for the
0.3 cosine threshold network by the modularity
algorithm. We then qualitatively assessed the clusters
created by counting the number of clusters falling into
each of the following categories: 3 or more shared
words, 2 shared words, 1 shared word, single-word
links, and 0 words, where shared words are defined as
words appearing in each cluster, single word links are
defined as groupings in which each title shares at least
one word (though not necessarily the same word) with
at least one other title in the cluster, and 0 shared
words/links describes clusters in which at least one
title shares no words with any other titles.
Table 3 shows the segmentation of clusters produced
by each algorithm:
Closer examination of the clusters showed that the
3+ Shared Word clusters produced by R appeared to
be topic based. This sample is clustered around news
from Philadelphia in August.
DJ DATA SNAP: Philadelphia Fed: Mfg Activity Stagnates In August
DJ Philadelphia Fed Aug Business Index 0.0 Vs Jul 9.2
DJ Philadelphia Fed Aug Price Paid 15.4 Vs Jul 28.1
The 2 word cluster was also based around a topic
4
Table 2: Modularity clustering of title network
Threshold Number of Clusters Q
0.2 22 0.84
0.3 11 0.64
Figure 3: Hierarchical clustering at 0.3 threshold
Figure 2: Title cosine network at 0.3 threshold
Figure 1: Title cosine network at 0.2 threshold
Table 3: Segmentation of clusters by number of
shared words or single word links (SWL)
3+ 2 1 SWL 0heirarchical 3 1 1 1 1
Modularity 0 0 6 1 0
-
8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds
5/7
(the Bank of Japan):
DJ BOJ Drains Extra Y500 Bln From Money Market
DJ BOJ: Foresees Bank Reserve Balances To Be Y3.6 Tln
Other clusters were less valid; the 1 word clusterwas based around the DJ (Dow Jones) initial on the
majority of the feeds, while the single-word link
cluster contained feeds from various sources with
linking words that in most cases failed to contribute
much information, such as week and market. The
0 word cluster was as follows:
DJ HOT TOPIC: Long Way Before Asian Equities Out Of The Woods
DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July
DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July 2
DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July 3
DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July 4
Here, the -2- appended after the third title results
from the fact that the feeds came out in real time, and
were sometimes continued some minutes after the
initial feed, titled identically except for a number
denoting the part appended to the end. The first title
shares no words with any other feed in this group,
which also contains additional continuations of the
second title. There appears to be no basis for the
categorization.
Clusters produced by the modularity algorithm
were less intuitive overall, contradicting our
hypothesis with regard to this network. The majority
grouped feeds on diverse topics on the basis of the
DJ (Dow Jones) initials. In addition, a group of
virtually identical titles pertaining to various parts of
the same feed (with the only difference being a
number appended to the end) were placed in a total of
five different clusters. This result appears to be a
consequence of the algorithm itself rather than the
Clairlib implementation.
Full-Text Network
The cosine similarity network for the full-text of
the feeds differs substantially from the title network,
as shown in Table 3. It is more densely connected due
to the increase in the number of words and therefore
the similarities between various nodes. Here a phase
change appears to take place between 0.3 and 0.4, with
the LCC/n dropping from 85% to 52%.
Figures 4 and 5 show the full content networks atthresholds of 0.3 and 0.4 respectively. Compared to
the networks created from titles alone, these networks
are much more interconnected. Again, this is due to
the much larger number of words in the articles.
The clustering found through modularity is given
in Table 4. The number of clusters found by the
modularity algorithm is much higher than that found
in the title networks. This is likely a result of the
larger average number of connections for each node.
5
Table 5: Modularity clustering of content network
Threshold Number of Clusters Q
0.3 28 0.43
0.4 22 0.65
Table 4: Summary of network statistics for varying
cosine thresholds in the content network.
Threshold Nodes Edges Diameter LCC ASP Avg Degree
0 1119 625521 1 1119 1 1118
0.1 732 134809 3 732 1.51 368.33
0.2 723 44524 7 717 2.25 123.16
0.3 689 11941 9 596 3 34.66
0.4 628 3966 9 328 3.44 12.63
0.5 581 2132 7 84 2.67 7.34
0.6 525 1627 3 28 1.12 6.2
0.7 504 1456 3 28 1.06 5.78
0.8 481 1287 2 24 1.01 5.35
0.9 457 1230 3 24 1 5.38
1 415 1093 1 24 1 5.27
-
8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds
6/7
Figure 6 shows the dendrogram produced for
hierarchical clustering on the 0.4 cosine network. The
structure appears to show some distinct clustering,
however, due to the large number of words in each
article, and the large number of clusters produced, it
becomes difficult to judge how successfully the
algorithm is partitioning the data. The size and nature
of the dataset make relatively fine-grained groupings
ideal. It would be possible to measure its success if
the articles were to be manually partitioned into such
clusters, giving us a standard against which to judge.
This is a task which we will undertake in the future.
DISCUSSION
These results seem to indicate that hierarchical
clustering is working as well, if not better than the
modularity algorithm on the data we studied. Nodes
clustered together using hierarchical clustering appear
to share more attributes (unique terms) than those
clustered using modularity.
Initial tests gave us positive results showing that
such algorithms can be used to find communities in
lexical networks. We have yet to determine whether
the clusters identified by the algorithm give an
accurate representation of the topic delineation among
the titles.
In order to determine if any of the clustering
algorithms are partitioning the graph correctly, we
would need to manually partition the graph according
to some predefined method. We could then compare
that manual partition to those created by the different
algorithms to see which most closely resembles
clustering as it would be done by a human. This
manual partitioning would be a time intensive task for
a corpus of this size.
Additional work would continue comparing
different clustering algorithms and their effectiveness.This would include algorithms operating by removing
links of high betweenness, as well as splitting the
network along eigenvectors [5].
We used Pajek to compute betweenness for both
networks, and found the 0.20 threshold network to
6
Figure 4: Content cosine network at 0.3 threshold
Figure 5: Content cosine network at 0.4 threshold
Figure 6: Hierarchical clustering at 0.4 threshold
-
8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds
7/7
have 0.05471 and the 0.3, 0.01003. This represents the
variation in the betweenness centrality of vertices
divided by the maximum variation in betweenness
centrality scores in a network of the same size (Pajek
Center and Periphery). We would expect these scores
based on the fact that the larger network (0.2) will in
most cases have more variation in centrality of nodes.
Depending on what sort of algorithm was used, the
difference in betweenness for the two networks might
yield similarly different numbers of clusters if a
clustering algorithm based on betweenness was used.
REFERENCES
[1] M. Girvan and M. E. J. Newman, Community
structure in social and biological networks.Proc.
Natl. Acad. Sci. USA99, 7821-7826 (2002).
[2] A. Clauset, M. E. J. Newman and C. Moore,
Finding Community Structure in Very Large
Networks,Phys. Rev. E70 066111 (2004).
[3] http://www.clairlib.org
[4] http:// www.r-project.org
[5] M. E. J. Newman, Finding community structure in
networks using the eigenvectors of matrices,Phys.
Rev. E74 036104 (2006).
7
http://www.clairlib.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.clairlib.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/