News Clustering: Community Finding in a Lexical Network of News Feeds

8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds

1/7

Community Finding in a Lexical Network of News Feeds

Jessica Hullman ([email protected])Bryan Gibson ([email protected])

Abstract: We create a weighted lexical network derived from the cosine similarities of financialnews feeds to compare two clustering methods, Newman's Modularity method and hierarchical

clustering. We find that hierarchical clustering, clustering documents according to sharedunique terms, shows results that are closer to expectation.

INTRODUCTION

Network analysis is increasingly used to model

large datasets, as many systems take the form of

networks, with sets of nodes joined together in pairs

by edges. Edges can be weighted to better capture

granularity in the relationships between the nodes.

This method can be applied to lexical networks, where

each node represents a collection of words (a title, or

an article), and edges are the similarity between these

collections.

News represent a dataset where articles naturally

fall into categories and topics. Usually these categories

are mutually exclusive, as in newspaper sectioning,

though a particular article may apply to several topics.Network analysis is well-suited to capture these sub-

primary connections using weighted links between

articles. By creating a lexical network using basic

information retrieval techniques, clustering algorithms

can then be applied in order to identify topical

categories. We hypothesize that such a network will

naturally exhibit clustering of articles into

communities, some with stronger ties than others.

To measure the strength of links between feeds and

the communities communities created, we will employ

two different network analysis techniques, hierarchical

clustering and the modularity algorithm proposed by

Girvan and Newman [1] and improved by Clauset,

Newman and Moore [2]. Our hypothesis is that the

modularity algorithm is better-suited than traditional

hierarchical clustering for capturing the multiple topic

categories that news feeds fall under.

DATA

The corpus consisted of 1388 Reuters news-feeds

from the month of August, 2007. In order to narrow

the range of topics addressed in the feeds, we selected

feeds that pertained to the currency market. The

primary feed source was Dow Jones Newswires, with

additional feeds from the Wall Street Journal,

Barron's, and SMARTMONEY. The feeds ranged

from ~50 to ~800 words.

Because each feed title was fairly long itself (~8-9words), we constructed and analyzed two lexical

networks, one of titles alone and one of full-article

text.

DJ DATA SNAP: Philadelphia Fed: Mfg Activity Stagnates In August

DJ US Fed Discount Window Borrowings Barely Budge On Week

Sample titles from the network.

METHODSLexical Similarity

Lexical similarity, given by cosine similarity, is the

most common measure of document similarity in basic

information retrieval. To compute cosine similarities,

each document is treated as a bag of words, or an

1
mailto:[email protected]:[email protected]:[email protected]:[email protected]


2/7

unordered collection of all the unique terms it

contains. Unique terms are found by first stemming

the documents and then using a combination of term

frequency and inverse document frequency.

The frequency of each term is the number of times

it appears in a document (tf). The inverse document

frequency is a measure of the general importance of

the term (obtained by dividing the number of all

documents by the number of documents containing the

term, and then taking the logarithm of that quotient).

idfi=log

D

{d j :tid j}

with D as the total number of documents in the

corpus and {dj :tid j} as the number of documents

where the term ti appears.

Using tf and idf for each term, we are able to

calculate the cosine similarity between two doc-

uments. The cosine similarity between documents i

andj, denoted by sim(i,j) is given by:

simi , j= wx, y tfw, x tfw, yidfw 2

x ixtfxi, xidfx

i

2 yiytfyi, yidfy

i

2

The dataset is then represented as an adjacency

matrix, in which the i,jth entry in the matrix consists

of the cosine similarity between the nodes (titles or

articles). We use this weighted network as our net-

work to be analyzed.

Community-finding within a network

The task of finding good clusters (also called

network transitivity, or the property that two vertices

linked to the same third vertex have a higher

likelihood of being connected themselves) has been

the focus of considerable research in machine learning

and network analysis. Central to all of the goals of

cluster analysis is the notion of degree of similarity (or

dissimilarity) between the individual objects being

clustered. The problem is NP-complete: a solution

would solve all such problems, but does not exist at

present, inspiring debate among researchers as to the

optimum algorithm.

The clustering problem as it applies to the lexical

network of news-feeds is a case of unsupervised

document classification, in that initially, no human-

tagged gold standard is provided to the algorithm.

We investigate two methods that have shown

promising results, modularity and hierarchical

clustering.

Modularity

While other methods of partitioning require

predefined parameters of the communities to be found,

and often partition networks in which no good division

exists, modularity is a measure which quantifies

statistically-surprising arrangements of edges, those

that (in the case of edges between subgraphs) are less

than what would be expected by chance. Modularity (a

positive or negative number) is the number of edges

falling within groups minus the expected number in an

equivalent network with edges placed at random. The

best division of a network is thus that with a large and

positive modularity value.

Given a network ofn vertices, for a division of the

network into two groups let s(i) = 1 if vertex i belongs

to group 1 and s(i) = -1 if it belongs to group 2. The

network is represented as an adjacency matrix A(ij).

The expected number of edges between i andj if edges

are placed at random is k(i)k(j)/2m where k(i) and k(j)

are the degrees of the vertices and m=1

2ki is

2


3/7

the total number of edges in the network. Modularity

is defined as:

Q=1

4m

ijAijki kj2m si s j= 14m sTBs

The leading factor of 1/4m is conventional, and B is a

new real symmetric matrix with elements

Bij=A

ij

kik

j

2m

called the modularity matrix [1].

This was improved in [2] with the change of only

maintaining a matrix of Qij for only nodes ij that are

connected, since joining two disconnected nodes will

not increase Q. This results in increased efficiency.

We used this algorithm as it was implemented by

Ryan Roth in Clairlib [3] to do preliminary clustering

analysis on our sample cosine networks.

Hierarchical Clustering

Hierarchical Clustering is the traditional method fordetecting community structure in networks. In

hierarchical clustering, the data are not partitioned into

a particular cluster in a single step. Instead, a series of

partitions takes place, which may run from a single

cluster containing all objects to n clusters each

containing a single object. Hierarchical clustering is

subdivided into agglomerative methods, which

proceed by series of fusions of the n objects into

groups, and divisive methods, which separate n

objects successively into finer groupings.

We used the R statistical software [4] to do

hierarchical clustering on our dataset in order to

compare this more-traditional method with

modularity. We used the hclust() functionality using

the complete-linkage method, an agglomerative

method in which each object is initially assigned to its

own cluster and then the algorithm proceeds

iteratively, at each stage joining the two most similar

clusters, continuing until there is just a single cluster.

At each stage distances between clusters are

recomputed by the LanceWilliams dissimilarity

update formula using the formula

D r , s=Max {d i , j}, where ir , js

where D(r,s) represents the dissimilarity between

nodes r and j, and d(i,j) represents the distance

between i andj where object i is in cluster rand object

j is in clusterj.

RESULTS

Title Network

Table 1 summarizes the network statistics for

varying cosine thresholds of the title network. A

threshold between 0.2 and 0.3 appears to capture a

phase change. Here the largest connected comp-onent

(LCC) over the number of nodes (n) drops from 91%

to 33%.

Figure 1 and 2 visualize this network with a

3

Table 1: Summary of network statistics for varying

cosine thresholds in the title network.

Threshold Nodes Edges Diameter LCC ASP Avg Degree

0 1117 623286 1 1117 1 558

0.1 1116 46915 5 1116 2.35 42.04

0.2 1100 14351 10 1006 3.88 13.05

0.3 1020 8068 19 340 3.34 7.91

0.4 973 6419 5 55 1.04 6.6

0.5 937 5930 6 55 0.98 6.33

0.6 916 5187 4 40 0.87 5.66

0.7 895 4954 4 40 0.87 5.54

0.8 876 4177 2 40 0.84 4.77

0.9 842 3802 2 40 0.82 4.52

1 832 3258 1 40 0.78 3.92


4/7

threshold of 0.2 and 0.3, respectively, showing the

large difference in connectedness between the two

networks.

Table 2 shows the highest modularity value

with the corresponding number of clusters the

Modularity algorithm found for each threshold.

No gold standard division of the title network

existed for assessment purposes. To compare the

clusters produced by each algorithm, we opted to cut

the dendrogram produced by R (Figure 3) at elevenclusters in order obtain a similar number of clusters

with which to compare the clusters produced for the

0.3 cosine threshold network by the modularity

algorithm. We then qualitatively assessed the clusters

created by counting the number of clusters falling into

each of the following categories: 3 or more shared

words, 2 shared words, 1 shared word, single-word

links, and 0 words, where shared words are defined as

words appearing in each cluster, single word links are

defined as groupings in which each title shares at least

one word (though not necessarily the same word) with

at least one other title in the cluster, and 0 shared

words/links describes clusters in which at least one

title shares no words with any other titles.

Table 3 shows the segmentation of clusters produced

by each algorithm:

Closer examination of the clusters showed that the

3+ Shared Word clusters produced by R appeared to

be topic based. This sample is clustered around news

from Philadelphia in August.

DJ DATA SNAP: Philadelphia Fed: Mfg Activity Stagnates In August

DJ Philadelphia Fed Aug Business Index 0.0 Vs Jul 9.2

DJ Philadelphia Fed Aug Price Paid 15.4 Vs Jul 28.1

The 2 word cluster was also based around a topic

4

Table 2: Modularity clustering of title network

Threshold Number of Clusters Q

0.2 22 0.84

0.3 11 0.64

Figure 3: Hierarchical clustering at 0.3 threshold

Figure 2: Title cosine network at 0.3 threshold

Figure 1: Title cosine network at 0.2 threshold

Table 3: Segmentation of clusters by number of

shared words or single word links (SWL)

3+ 2 1 SWL 0heirarchical 3 1 1 1 1

Modularity 0 0 6 1 0


5/7

(the Bank of Japan):

DJ BOJ Drains Extra Y500 Bln From Money Market

DJ BOJ: Foresees Bank Reserve Balances To Be Y3.6 Tln

Other clusters were less valid; the 1 word clusterwas based around the DJ (Dow Jones) initial on the

majority of the feeds, while the single-word link

cluster contained feeds from various sources with

linking words that in most cases failed to contribute

much information, such as week and market. The

0 word cluster was as follows:

DJ HOT TOPIC: Long Way Before Asian Equities Out Of The Woods

DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July

DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July 2



Here, the -2- appended after the third title results

from the fact that the feeds came out in real time, and

were sometimes continued some minutes after the

initial feed, titled identically except for a number

denoting the part appended to the end. The first title

shares no words with any other feed in this group,

which also contains additional continuations of the

second title. There appears to be no basis for the

categorization.

Clusters produced by the modularity algorithm

were less intuitive overall, contradicting our

hypothesis with regard to this network. The majority

grouped feeds on diverse topics on the basis of the

DJ (Dow Jones) initials. In addition, a group of

virtually identical titles pertaining to various parts of

the same feed (with the only difference being a

number appended to the end) were placed in a total of

five different clusters. This result appears to be a

consequence of the algorithm itself rather than the

Clairlib implementation.

Full-Text Network

The cosine similarity network for the full-text of

the feeds differs substantially from the title network,

as shown in Table 3. It is more densely connected due

to the increase in the number of words and therefore

the similarities between various nodes. Here a phase

change appears to take place between 0.3 and 0.4, with

the LCC/n dropping from 85% to 52%.

Figures 4 and 5 show the full content networks atthresholds of 0.3 and 0.4 respectively. Compared to

the networks created from titles alone, these networks

are much more interconnected. Again, this is due to

the much larger number of words in the articles.

The clustering found through modularity is given

in Table 4. The number of clusters found by the

modularity algorithm is much higher than that found

in the title networks. This is likely a result of the

larger average number of connections for each node.

5

Table 5: Modularity clustering of content network

Threshold Number of Clusters Q

0.3 28 0.43

0.4 22 0.65

Table 4: Summary of network statistics for varying

cosine thresholds in the content network.

Threshold Nodes Edges Diameter LCC ASP Avg Degree

0 1119 625521 1 1119 1 1118

0.1 732 134809 3 732 1.51 368.33

0.2 723 44524 7 717 2.25 123.16

0.3 689 11941 9 596 3 34.66

0.4 628 3966 9 328 3.44 12.63

0.5 581 2132 7 84 2.67 7.34

0.6 525 1627 3 28 1.12 6.2

0.7 504 1456 3 28 1.06 5.78

0.8 481 1287 2 24 1.01 5.35

0.9 457 1230 3 24 1 5.38

1 415 1093 1 24 1 5.27


6/7

Figure 6 shows the dendrogram produced for

hierarchical clustering on the 0.4 cosine network. The

structure appears to show some distinct clustering,

however, due to the large number of words in each

article, and the large number of clusters produced, it

becomes difficult to judge how successfully the

algorithm is partitioning the data. The size and nature

of the dataset make relatively fine-grained groupings

ideal. It would be possible to measure its success if

the articles were to be manually partitioned into such

clusters, giving us a standard against which to judge.

This is a task which we will undertake in the future.

DISCUSSION

These results seem to indicate that hierarchical

clustering is working as well, if not better than the

modularity algorithm on the data we studied. Nodes

clustered together using hierarchical clustering appear

to share more attributes (unique terms) than those

clustered using modularity.

Initial tests gave us positive results showing that

such algorithms can be used to find communities in

lexical networks. We have yet to determine whether

the clusters identified by the algorithm give an

accurate representation of the topic delineation among

the titles.

In order to determine if any of the clustering

algorithms are partitioning the graph correctly, we

would need to manually partition the graph according

to some predefined method. We could then compare

that manual partition to those created by the different

algorithms to see which most closely resembles

clustering as it would be done by a human. This

manual partitioning would be a time intensive task for

a corpus of this size.

Additional work would continue comparing

different clustering algorithms and their effectiveness.This would include algorithms operating by removing

links of high betweenness, as well as splitting the

network along eigenvectors [5].

We used Pajek to compute betweenness for both

networks, and found the 0.20 threshold network to

6

Figure 4: Content cosine network at 0.3 threshold

Figure 5: Content cosine network at 0.4 threshold

Figure 6: Hierarchical clustering at 0.4 threshold


7/7

have 0.05471 and the 0.3, 0.01003. This represents the

variation in the betweenness centrality of vertices

divided by the maximum variation in betweenness

centrality scores in a network of the same size (Pajek

Center and Periphery). We would expect these scores

based on the fact that the larger network (0.2) will in

most cases have more variation in centrality of nodes.

Depending on what sort of algorithm was used, the

difference in betweenness for the two networks might

yield similarly different numbers of clusters if a

clustering algorithm based on betweenness was used.

REFERENCES

[1] M. Girvan and M. E. J. Newman, Community

structure in social and biological networks.Proc.

Natl. Acad. Sci. USA99, 7821-7826 (2002).

[2] A. Clauset, M. E. J. Newman and C. Moore,

Finding Community Structure in Very Large

Networks,Phys. Rev. E70 066111 (2004).

[3] http://www.clairlib.org

[4] http:// www.r-project.org

[5] M. E. J. Newman, Finding community structure in

networks using the eigenvectors of matrices,Phys.

Rev. E74 036104 (2006).

7
http://www.clairlib.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.clairlib.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/

News Clustering: Community Finding in a Lexical Network of News Feeds

Documents

Transcript of News Clustering: Community Finding in a Lexical Network of News Feeds