Download - Web Communities

Web Communities

Prasanna Desikan(06/13/2002)

2

Definition

Web community: Groups of individuals who share

common interests, together with the web pages most popular among them.

Web page collections with a shared topic.

3

Types of Communities Explicitly- defined.

Communities that manifest themselves as newsgroups or as resource collections on directories such as Yahoo!

Implicitly- defined. Communities that result from nature of

content-creation of the web.

4

Terms and Definitions

Directed Bipartite Graph: A graph whose nodes set can be partitioned into two sets F and C, and every directed edge in the graph is from a node u in F to a node v in C.

5

Terms and Definitions

Completed Bipartite Graph: A bipartite graph that contains all possible edges between a vertex of F and a vertex of C.

Core: A complete bipartite sub-graph with at least i nodes from F and at least j nodes from C. In the web world, the i pages the contains the

links are referred to as ‘fans’ and the j pages that are referenced as ‘centers’.

6

Inferring Web Communities From Link Topology

Community is a core of central authoritative pages linked together by hub pages.

Identify communities corresponding to the principal and non-principal eigenvectors discovered by HITS.

For communities on broad topics: the grouping of pages discovered is relatively independent of the exact choice of root set.

7


Findings on Structure of Communities. Robustness: For broad topics, HITS

produces stable, robust communities. Topic Generalization: HITS tend to

generalize topics that are not broad. “Michael Jordan” produces links to pages on

MJ and his team. “Dennis Ritchie” produces links that reference

to “C – Programming Language.”

8


Other Generalization: HITS tends to converge on topics with greater density of linkage. E.g for a query on “linguistics”, the top authorities are

focused on a sub-topic “computational linguistics” because of its greater density of linkage on web.

Temporal Issues: For obtaining long-term “core” of a topic, we can superimpose the results of HITS on the same topic, spaced-out several month periods.

9

Trawling the Web for Emerging Web Communities

Trawling: Systematic Enumeration of emerging communities from web crawl.

Scan through a web crawl and identify all instances of graph structures that are indicative signatures of communities.

10


Data Source: A copy of web from Alexa.Pre-processing data. Identify potential fan pages (a page that

has links to at least six different websites) – out of 200 million pages around 24 million were extracted.

Eliminate mirrors (out of 24 million it removed around 60% of pages.

11


Prune by in-degree. Eliminate all pages that have an in-degree

greater than a threshold value k. k is set as 50 in the experiments.

Iterative pruning. When looking for (i,j) cores any potential fan

with out-degree smaller than j can be pruned and the corresponding edges deleted from the graph.

12


Inclusion-exclusion pruning. Let {c1,c2,…..,cj} be centers adjacent to

a fan x. N(ct) = neighborhood of ct, the set of

fans that point to ct. x is a part of core if and only if the

intersection of sets N(ct) has size at least i.

Filter nepotistic cores.

13


Evaluation of Communities. Fossilization: 30% of communities were

fossilized. A fossil is a community core not all of whose

fans exist on the web today. Reliability: Only 4% of the trawled cores

were coincidental i.e a collection of fan pages without any cogent theme unifying them.

14


Quality: 56% were not in Yahoo as constructed from the crawl. And 29% were not in Yahoo at the time of the paper. This indicates identification of emerging

communities by trawling.

15

Self Organization and Identification of Web Communities

Web community is defined as a collection of web pages such that each member page has more hyperlinks (in either direction) within the community than outside of the community.

Approach: Maximal Flow – Minimal Cut framework.

Benefits: Focused crawling, automatic population of portal categories.

16

A Simple Community Identification Example

Figure : Maximum Flow methods will separate the two subgraphs with any choice of s and t that has s on the left subgraph and t on the right subgraph, removing the three dashed links.

17

Approximate Flow Community

18

Exact Flow Community

19


An artificial source ‘s’, is added with infinite capacity edges routed to all seed vertices in S.

Each pre-existing edge is made bi-directional and rescaled to a constant value k.

20


All vertices except the source, sink, and seed vertices are routed to the artificial sink with unit capacity.

A residual flow graph is produced by a maximum flow procedure.

All vertices accessible from s through non-zero positive edges form the desired result and satisfy our definition of a community.

21

Sample Results From Community Identification

The scores are the total number of inbound and outbound links that a web page has to other pages that are also in the community.

22

Characterization of Communities

Table 3: The fifteen most significant text features for each community, sorted in descending order of the Kullback-Leibler metric.

23

Discovering Seeds of New Interest Spread From Premature Pages.

A method for discovering topics, which stimulate communities of people into earnest communications on the topics’ meaning, and grow into a trend of popular interest.

Community is a group of people sharing some value.

24

Agora Method on Links Archive page - Page of highest rank

according to Google in a community.

Agora Pages - Pages linked from multiple archive-pages but are not in any community themselves are taken as novel topics attracting multiple communities, called agora-topic pages.

25

Agora Method on Links Step 1: A query representing user’s

interest domain is entered to a search engine (Google here, obtaining 105 to 106 pages).

Step 2: Communities, of pages obtained in Step 1, are obtained and archive-pages are selected from communities.

26

Agora Method on Links Step 3: Pages, not in the

communities but linked from multiple archive-pages, are obtained as agora-pages. Having all obtained results by here, archive pages (black nodes), agora-pages (red nodes) and the links between them are visualized as in Fig.1.

27

Fig: The output of Agora on Links, for domain query “Human Genome”

28

Evaluation Stage 1. An interest domain is fixed, a group

of people relevant to the domain gathered, and the domain-name is input as a query (e.g. ”information retrieval”).

Stage 2. The output graph adding real and fake red nodes, as if they all were really obtained as agora-pages, is shown to the subjects. That is, some red nodes, not really obtained, were added with red links to black archive-nodes. Subjects reported individual impressions and exchanged ideas in the group.

29

Sample Results Institutes in ‘red’ were the ones who

have data sources of human or mouse genomes, and is useful for researchers in other institutes to look at those data.

8 of the 12 ‘red’ nodes were termed as “interesting for thinking of future work” by the subjects.

30

References [1]D. Gibson, J. Klienberg, and P.Raghavan. Inferring web

communties from link topology. In Proc. 9th ACM Conference on Hypertext and Hypermedia.

[2]Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Trawling the web for emerging cyber-communities. In Proc 8th Int. World Wide Web Conf.,1999.

[3] Gary William Flake, Steve Lawrence, C. Lee Giles . Efficient Identification of Web Communities. Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[4] Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee. Self-Organization and Identification of Web Communities. IEEE Computer, 35(3), 66–71, 2002.

31

References [5] Naohiro Matsumura , Yukio Ohsawa , Mitsuru Ishizuka

Discovering Seeds of New Interest Spread from Premature Pages Cited by Multiple Communities, 2001 International Conference on Web Intelligence.

32

Kullback-Leibler Metric

Let p and q be probability distributions with support X and Y respectively. The relative entropy or Kullback-Liebler distance between two probability distributions p and q is defined as

Back

33Back