1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip...

32
1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    5

Transcript of 1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip...

1

Discovering Unexpected Information from Your

Competitor’s Web Sites

Bing Liu, Yiming Ma, Philip S. Yu

Héctor A. Villa Martínez

2

Objective of this article

The authors presents a system to help find unexpected information in a web site.

3

Searching information in the web

Many methods Keyword based (e.g. Goggle, Yahoo). Wrapper based (e.g. extract prices). Web query languages (e.g. extend SQL). User preference based (specify categories).

4

Searching information in the web

Main drawbacks: Hard to find unexpected information. Only finds anticipated information.

5

What is unexpected anyways?

A piece of information is unexpected if: it is relevant but unknown, or it contradicts existing beliefs or expectations

relevant interesting (subjective)

6

Summary of the approach

U: user web site

E: knowledge about the competitor

C: competitor web site

Compare C vs. U and E to find unexpected information in C.

7

How to compare two web pages

Use the vector space representation: Define a set of p keywords (index terms)

K = {k1, k2, …, kp). Represent a document D using a vector

D = {w1, w2, …, wp} where wi is the weight of the keyword i

wi > 0 if keyword i appears in D = 0 otherwise

8

Vector space representation

Example:K = {night, day, empire, barbarians, people, house}D = [“Because night is here but the barbarians have not

come.And some people arrived from the borders,and said that there are no longer any barbarians.And now what shall become of us without any barbarians?Those people were some kind of solution.”]D = {1, 0, 0, 3, 2, 0} or normalized to:D = {1/6, 0, 0, 3/6, 2/6, 0}

9

Comparing two web pages

Given two web pages in vector space representation, D = {d1, d2, …, dn}, and Q = {q1, q2, …, qn} the cosine gives a measure of similarity:

sim (D, Q) = (D ● Q) / (|D| * |Q|)

10

Comparing two web pages

Example:

P = {0.3, 0.0, 0.0, 0.7}

Q = {0.5, 0.0, 0.1, 0.4}

R = {0.0, 0.5, 0.5, 0.0} Sim (P, P) = (P ● P) / (|P| * |P|) = 1.0 Sim (P, Q) = (P ● Q) / (|P| * |Q|) = 0.87 Sim (P, R) = (P ● R) / (|P| * |R|) = 0

11

Methods to find unexpected information in a site

Let U = (u1, …, um) the user web site, and C = (c1, …, cn) the competitor web site:

1. Find the corresponding C page(s) of a U page.

2. Find unexpected terms in a C page.3. Find unexpected pages in C.4. Find unexpected concepts in a C page.5. Find unexpected outgoing links.

12

1. Find the corresponding C page(s) of a U page

Given a page ui in U

Compare ui with each page in C. Order the results in descending order.

13

1. Find the corresponding C page(s) of a U page

Example: Select u1

Find sim(u1, c1), sim(u1, c2), …, sim(u1, cn) Order the results in decreasing order: say c4,

c2, c8, … etc.Complexity:

O(G|C| + |ui||C|)

where G = max number of terms in cj

14

2. Find unexpected terms in a C page

Given uj and ci measure the unexpectedness of each term tr

1 – (frj / fri) if (frj / fri) ≤ 1

unexpTrij =

0 otherwise

15

2. Find unexpected terms in a C page

Example:

keywords = {data, predict, classify, state}

uj = {0.4, 0.5, 0.0, 0.1}

ci = {0.3, 0.3, 0.2, 0.2}

unexpT = {0, 0, 1, 0.5}

Complexity: O(Z)

where Z = number of terms in cj

16

3. Find unexpected pages in C

1. Combine all pages of U in a single page Du.

2. Combine all pages of C in a single page Dc.

3. Compute the unexpectedness of each term kt in Dc with respect to Du. (Task 2)

4. The unexpectedness of a page Ci is the sum of the unexpectedness of its terms

5. unexpPi = (ΣunexpTrcu) / m

17

3. Find unexpected pages in C

Complexity

O(Mu|U| + Mc|C|)

where

Mu is the maximal number of terms in a U page

Mc is the maximal number of terms in a C page

18

4. Find unexpected concepts in a C page

A concept is a set of keywords that occur together and express the same idea.

Example: “information extraction”, “extraction of information”, and “information is extracted” express the same idea “information extraction”

19

4. Find unexpected concepts in a C page

Algorithm Divide the page in sentences. Use the Apriori algorithm (Agrawal &

Srikant) to find association rules of the form X Y with confidence c, where X and Y K, the set of keywords and c is user defined. These association rules are the concepts present in the page.

20

4. Find unexpected concepts in a C page

3. Treating each concept as a term, proceed as Task 2, finding unexpected terms in C.

21

5. Find unexpected outgoing links

Let Lu be the set of outgoing links from U

Let Lc be the set of outgoing links from C

unexpL = Lc – Lu

22

Incorporating user knowledge

Let E be the user knowledge about his competitor. E is specified as:

Keyword terms Concepts Links

23

Incorporating user knowledge

The elements in E are incorporated in task 2 thru 5 to find unexpected terms, pages, concepts, and links.

Elements in E are ranked low in unexpectedness.

24

System architecture

C++/Win32 A spider or crawler. Collects information. Keyword extractor & concepts finder. Comparison component. Do tasks 1-5. User interface.

25

A running example

The authors compare its own site with SGI’s MineSet data mining site, and not extra knowledge:

http://www.comp.nus.edu.sg/~dm2

http://www.sgi.com/software/mineset

26

Results

Found documentation pages in SGI site. Now the authors are planning to add their own.

Found previously unknown pages describing MineSet technology.

Found some previously unknown MineSet features.

Found many interesting terms, concepts, and links.

27

Evaluation

The system was further tested with three different organizations:

Travel company Private school Diving company

28

Evaluation

The users reported the system helped them in: Focus in interesting pages, terms, and

concepts. Make a more complete analysis of the

competitor’s site. Not missing important information. Find unexpected things.

29

Efficiency

If number of keywords is constant, the algorithms are linear in the number of pages.

Tested on a Pentium II 350 PC with 64MB of RAM #pag sim unexpTrij unexpPj Assoc. mining[1] (143) 0.0128 0.0156 0.0232 0.0379[2] (21) 0.0134 0.0189 0.0213 0.0182[3] (66) 0.0113 0.0177 0.0198 0.0206[4] (127) 0.0097 0.0201 0.0224 0.0115[5] (46) 0.0143 0.0153 0.0188 0.0105

[1] http://www.bluemartini.com [4] http://www.sgi.comlsoftwarelmineset [2] http://www.datamining.com [5] http://www.spss.comlclementine [3] http://www.mineit.com

30

Future work

Use of metadata. Study how links can be used to infer more

unexpected information. Monitor the site, reporting any unexpected

change.

31

Intrinsic limitations

Text oriented. Do not work with images. Can have problems with tables.

Do not work with dynamic web sites.

32