Download - News Clustering: Community Finding in a Lexical Network of News Feeds

Transcript
  • 8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds

    1/7

    Community Finding in a Lexical Network of News Feeds

    Jessica Hullman ([email protected])Bryan Gibson ([email protected])

    Abstract: We create a weighted lexical network derived from the cosine similarities of financialnews feeds to compare two clustering methods, Newman's Modularity method and hierarchical

    clustering. We find that hierarchical clustering, clustering documents according to sharedunique terms, shows results that are closer to expectation.

    INTRODUCTION

    Network analysis is increasingly used to model

    large datasets, as many systems take the form of

    networks, with sets of nodes joined together in pairs

    by edges. Edges can be weighted to better capture

    granularity in the relationships between the nodes.

    This method can be applied to lexical networks, where

    each node represents a collection of words (a title, or

    an article), and edges are the similarity between these

    collections.

    News represent a dataset where articles naturally

    fall into categories and topics. Usually these categories

    are mutually exclusive, as in newspaper sectioning,

    though a particular article may apply to several topics.Network analysis is well-suited to capture these sub-

    primary connections using weighted links between

    articles. By creating a lexical network using basic

    information retrieval techniques, clustering algorithms

    can then be applied in order to identify topical

    categories. We hypothesize that such a network will

    naturally exhibit clustering of articles into

    communities, some with stronger ties than others.

    To measure the strength of links between feeds and

    the communities communities created, we will employ

    two different network analysis techniques, hierarchical

    clustering and the modularity algorithm proposed by

    Girvan and Newman [1] and improved by Clauset,

    Newman and Moore [2]. Our hypothesis is that the

    modularity algorithm is better-suited than traditional

    hierarchical clustering for capturing the multiple topic

    categories that news feeds fall under.

    DATA

    The corpus consisted of 1388 Reuters news-feeds

    from the month of August, 2007. In order to narrow

    the range of topics addressed in the feeds, we selected

    feeds that pertained to the currency market. The

    primary feed source was Dow Jones Newswires, with

    additional feeds from the Wall Street Journal,

    Barron's, and SMARTMONEY. The feeds ranged

    from ~50 to ~800 words.

    Because each feed title was fairly long itself (~8-9words), we constructed and analyzed two lexical

    networks, one of titles alone and one of full-article

    text.

    DJ DATA SNAP: Philadelphia Fed: Mfg Activity Stagnates In August

    DJ US Fed Discount Window Borrowings Barely Budge On Week

    Sample titles from the network.

    METHODSLexical Similarity

    Lexical similarity, given by cosine similarity, is the

    most common measure of document similarity in basic

    information retrieval. To compute cosine similarities,

    each document is treated as a bag of words, or an

    1

    mailto:[email protected]:[email protected]:[email protected]:[email protected]
  • 8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds

    2/7

    unordered collection of all the unique terms it

    contains. Unique terms are found by first stemming

    the documents and then using a combination of term

    frequency and inverse document frequency.

    The frequency of each term is the number of times

    it appears in a document (tf). The inverse document

    frequency is a measure of the general importance of

    the term (obtained by dividing the number of all

    documents by the number of documents containing the

    term, and then taking the logarithm of that quotient).

    idfi=log

    D

    {d j :tid j}

    with D as the total number of documents in the

    corpus and {dj :tid j} as the number of documents

    where the term ti appears.

    Using tf and idf for each term, we are able to

    calculate the cosine similarity between two doc-

    uments. The cosine similarity between documents i

    andj, denoted by sim(i,j) is given by:

    simi , j= wx, y tfw, x tfw, yidfw 2

    x ixtfxi, xidfx

    i

    2 yiytfyi, yidfy

    i

    2

    The dataset is then represented as an adjacency

    matrix, in which the i,jth entry in the matrix consists

    of the cosine similarity between the nodes (titles or

    articles). We use this weighted network as our net-

    work to be analyzed.

    Community-finding within a network

    The task of finding good clusters (also called

    network transitivity, or the property that two vertices

    linked to the same third vertex have a higher

    likelihood of being connected themselves) has been

    the focus of considerable research in machine learning

    and network analysis. Central to all of the goals of

    cluster analysis is the notion of degree of similarity (or

    dissimilarity) between the individual objects being

    clustered. The problem is NP-complete: a solution

    would solve all such problems, but does not exist at

    present, inspiring debate among researchers as to the

    optimum algorithm.

    The clustering problem as it applies to the lexical

    network of news-feeds is a case of unsupervised

    document classification, in that initially, no human-

    tagged gold standard is provided to the algorithm.

    We investigate two methods that have shown

    promising results, modularity and hierarchical

    clustering.

    Modularity

    While other methods of partitioning require

    predefined parameters of the communities to be found,

    and often partition networks in which no good division

    exists, modularity is a measure which quantifies

    statistically-surprising arrangements of edges, those

    that (in the case of edges between subgraphs) are less

    than what would be expected by chance. Modularity (a

    positive or negative number) is the number of edges

    falling within groups minus the expected number in an

    equivalent network with edges placed at random. The

    best division of a network is thus that with a large and

    positive modularity value.

    Given a network ofn vertices, for a division of the

    network into two groups let s(i) = 1 if vertex i belongs

    to group 1 and s(i) = -1 if it belongs to group 2. The

    network is represented as an adjacency matrix A(ij).

    The expected number of edges between i andj if edges

    are placed at random is k(i)k(j)/2m where k(i) and k(j)

    are the degrees of the vertices and m=1

    2ki is

    2

  • 8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds

    3/7

    the total number of edges in the network. Modularity

    is defined as:

    Q=1

    4m

    ijAijki kj2m si s j= 14m sTBs

    The leading factor of 1/4m is conventional, and B is a

    new real symmetric matrix with elements

    Bij=A

    ij

    kik

    j

    2m

    called the modularity matrix [1].

    This was improved in [2] with the change of only

    maintaining a matrix of Qij for only nodes ij that are

    connected, since joining two disconnected nodes will

    not increase Q. This results in increased efficiency.

    We used this algorithm as it was implemented by

    Ryan Roth in Clairlib [3] to do preliminary clustering

    analysis on our sample cosine networks.

    Hierarchical Clustering

    Hierarchical Clustering is the traditional method fordetecting community structure in networks. In

    hierarchical clustering, the data are not partitioned into

    a particular cluster in a single step. Instead, a series of

    partitions takes place, which may run from a single

    cluster containing all objects to n clusters each

    containing a single object. Hierarchical clustering is

    subdivided into agglomerative methods, which

    proceed by series of fusions of the n objects into

    groups, and divisive methods, which separate n

    objects successively into finer groupings.

    We used the R statistical software [4] to do

    hierarchical clustering on our dataset in order to

    compare this more-traditional method with

    modularity. We used the hclust() functionality using

    the complete-linkage method, an agglomerative

    method in which each object is initially assigned to its

    own cluster and then the algorithm proceeds

    iteratively, at each stage joining the two most similar

    clusters, continuing until there is just a single cluster.

    At each stage distances between clusters are

    recomputed by the LanceWilliams dissimilarity

    update formula using the formula

    D r , s=Max {d i , j}, where ir , js

    where D(r,s) represents the dissimilarity between

    nodes r and j, and d(i,j) represents the distance

    between i andj where object i is in cluster rand object

    j is in clusterj.

    RESULTS

    Title Network

    Table 1 summarizes the network statistics for

    varying cosine thresholds of the title network. A

    threshold between 0.2 and 0.3 appears to capture a

    phase change. Here the largest connected comp-onent

    (LCC) over the number of nodes (n) drops from 91%

    to 33%.

    Figure 1 and 2 visualize this network with a

    3

    Table 1: Summary of network statistics for varying

    cosine thresholds in the title network.

    Threshold Nodes Edges Diameter LCC ASP Avg Degree

    0 1117 623286 1 1117 1 558

    0.1 1116 46915 5 1116 2.35 42.04

    0.2 1100 14351 10 1006 3.88 13.05

    0.3 1020 8068 19 340 3.34 7.91

    0.4 973 6419 5 55 1.04 6.6

    0.5 937 5930 6 55 0.98 6.33

    0.6 916 5187 4 40 0.87 5.66

    0.7 895 4954 4 40 0.87 5.54

    0.8 876 4177 2 40 0.84 4.77

    0.9 842 3802 2 40 0.82 4.52

    1 832 3258 1 40 0.78 3.92

  • 8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds

    4/7

    threshold of 0.2 and 0.3, respectively, showing the

    large difference in connectedness between the two

    networks.

    Table 2 shows the highest modularity value

    with the corresponding number of clusters the

    Modularity algorithm found for each threshold.

    No gold standard division of the title network

    existed for assessment purposes. To compare the

    clusters produced by each algorithm, we opted to cut

    the dendrogram produced by R (Figure 3) at elevenclusters in order obtain a similar number of clusters

    with which to compare the clusters produced for the

    0.3 cosine threshold network by the modularity

    algorithm. We then qualitatively assessed the clusters

    created by counting the number of clusters falling into

    each of the following categories: 3 or more shared

    words, 2 shared words, 1 shared word, single-word

    links, and 0 words, where shared words are defined as

    words appearing in each cluster, single word links are

    defined as groupings in which each title shares at least

    one word (though not necessarily the same word) with

    at least one other title in the cluster, and 0 shared

    words/links describes clusters in which at least one

    title shares no words with any other titles.

    Table 3 shows the segmentation of clusters produced

    by each algorithm:

    Closer examination of the clusters showed that the

    3+ Shared Word clusters produced by R appeared to

    be topic based. This sample is clustered around news

    from Philadelphia in August.

    DJ DATA SNAP: Philadelphia Fed: Mfg Activity Stagnates In August

    DJ Philadelphia Fed Aug Business Index 0.0 Vs Jul 9.2

    DJ Philadelphia Fed Aug Price Paid 15.4 Vs Jul 28.1

    The 2 word cluster was also based around a topic

    4

    Table 2: Modularity clustering of title network

    Threshold Number of Clusters Q

    0.2 22 0.84

    0.3 11 0.64

    Figure 3: Hierarchical clustering at 0.3 threshold

    Figure 2: Title cosine network at 0.3 threshold

    Figure 1: Title cosine network at 0.2 threshold

    Table 3: Segmentation of clusters by number of

    shared words or single word links (SWL)

    3+ 2 1 SWL 0heirarchical 3 1 1 1 1

    Modularity 0 0 6 1 0

  • 8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds

    5/7

    (the Bank of Japan):

    DJ BOJ Drains Extra Y500 Bln From Money Market

    DJ BOJ: Foresees Bank Reserve Balances To Be Y3.6 Tln

    Other clusters were less valid; the 1 word clusterwas based around the DJ (Dow Jones) initial on the

    majority of the feeds, while the single-word link

    cluster contained feeds from various sources with

    linking words that in most cases failed to contribute

    much information, such as week and market. The

    0 word cluster was as follows:

    DJ HOT TOPIC: Long Way Before Asian Equities Out Of The Woods

    DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July

    DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July 2

    DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July 3

    DJ Hungary Ctrl Bk MPC Voted 9 To 1 For Steady Rates In July 4

    Here, the -2- appended after the third title results

    from the fact that the feeds came out in real time, and

    were sometimes continued some minutes after the

    initial feed, titled identically except for a number

    denoting the part appended to the end. The first title

    shares no words with any other feed in this group,

    which also contains additional continuations of the

    second title. There appears to be no basis for the

    categorization.

    Clusters produced by the modularity algorithm

    were less intuitive overall, contradicting our

    hypothesis with regard to this network. The majority

    grouped feeds on diverse topics on the basis of the

    DJ (Dow Jones) initials. In addition, a group of

    virtually identical titles pertaining to various parts of

    the same feed (with the only difference being a

    number appended to the end) were placed in a total of

    five different clusters. This result appears to be a

    consequence of the algorithm itself rather than the

    Clairlib implementation.

    Full-Text Network

    The cosine similarity network for the full-text of

    the feeds differs substantially from the title network,

    as shown in Table 3. It is more densely connected due

    to the increase in the number of words and therefore

    the similarities between various nodes. Here a phase

    change appears to take place between 0.3 and 0.4, with

    the LCC/n dropping from 85% to 52%.

    Figures 4 and 5 show the full content networks atthresholds of 0.3 and 0.4 respectively. Compared to

    the networks created from titles alone, these networks

    are much more interconnected. Again, this is due to

    the much larger number of words in the articles.

    The clustering found through modularity is given

    in Table 4. The number of clusters found by the

    modularity algorithm is much higher than that found

    in the title networks. This is likely a result of the

    larger average number of connections for each node.

    5

    Table 5: Modularity clustering of content network

    Threshold Number of Clusters Q

    0.3 28 0.43

    0.4 22 0.65

    Table 4: Summary of network statistics for varying

    cosine thresholds in the content network.

    Threshold Nodes Edges Diameter LCC ASP Avg Degree

    0 1119 625521 1 1119 1 1118

    0.1 732 134809 3 732 1.51 368.33

    0.2 723 44524 7 717 2.25 123.16

    0.3 689 11941 9 596 3 34.66

    0.4 628 3966 9 328 3.44 12.63

    0.5 581 2132 7 84 2.67 7.34

    0.6 525 1627 3 28 1.12 6.2

    0.7 504 1456 3 28 1.06 5.78

    0.8 481 1287 2 24 1.01 5.35

    0.9 457 1230 3 24 1 5.38

    1 415 1093 1 24 1 5.27

  • 8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds

    6/7

    Figure 6 shows the dendrogram produced for

    hierarchical clustering on the 0.4 cosine network. The

    structure appears to show some distinct clustering,

    however, due to the large number of words in each

    article, and the large number of clusters produced, it

    becomes difficult to judge how successfully the

    algorithm is partitioning the data. The size and nature

    of the dataset make relatively fine-grained groupings

    ideal. It would be possible to measure its success if

    the articles were to be manually partitioned into such

    clusters, giving us a standard against which to judge.

    This is a task which we will undertake in the future.

    DISCUSSION

    These results seem to indicate that hierarchical

    clustering is working as well, if not better than the

    modularity algorithm on the data we studied. Nodes

    clustered together using hierarchical clustering appear

    to share more attributes (unique terms) than those

    clustered using modularity.

    Initial tests gave us positive results showing that

    such algorithms can be used to find communities in

    lexical networks. We have yet to determine whether

    the clusters identified by the algorithm give an

    accurate representation of the topic delineation among

    the titles.

    In order to determine if any of the clustering

    algorithms are partitioning the graph correctly, we

    would need to manually partition the graph according

    to some predefined method. We could then compare

    that manual partition to those created by the different

    algorithms to see which most closely resembles

    clustering as it would be done by a human. This

    manual partitioning would be a time intensive task for

    a corpus of this size.

    Additional work would continue comparing

    different clustering algorithms and their effectiveness.This would include algorithms operating by removing

    links of high betweenness, as well as splitting the

    network along eigenvectors [5].

    We used Pajek to compute betweenness for both

    networks, and found the 0.20 threshold network to

    6

    Figure 4: Content cosine network at 0.3 threshold

    Figure 5: Content cosine network at 0.4 threshold

    Figure 6: Hierarchical clustering at 0.4 threshold

  • 8/3/2019 News Clustering: Community Finding in a Lexical Network of News Feeds

    7/7

    have 0.05471 and the 0.3, 0.01003. This represents the

    variation in the betweenness centrality of vertices

    divided by the maximum variation in betweenness

    centrality scores in a network of the same size (Pajek

    Center and Periphery). We would expect these scores

    based on the fact that the larger network (0.2) will in

    most cases have more variation in centrality of nodes.

    Depending on what sort of algorithm was used, the

    difference in betweenness for the two networks might

    yield similarly different numbers of clusters if a

    clustering algorithm based on betweenness was used.

    REFERENCES

    [1] M. Girvan and M. E. J. Newman, Community

    structure in social and biological networks.Proc.

    Natl. Acad. Sci. USA99, 7821-7826 (2002).

    [2] A. Clauset, M. E. J. Newman and C. Moore,

    Finding Community Structure in Very Large

    Networks,Phys. Rev. E70 066111 (2004).

    [3] http://www.clairlib.org

    [4] http:// www.r-project.org

    [5] M. E. J. Newman, Finding community structure in

    networks using the eigenvectors of matrices,Phys.

    Rev. E74 036104 (2006).

    7

    http://www.clairlib.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.clairlib.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/http://www.r-project.org/