WEB MINING. Why IR ? Research & Fun .

22
Web Mining

Transcript of WEB MINING. Why IR ? Research & Fun .

Page 1: WEB MINING. Why IR ? Research & Fun .

Web Mining

Page 2: WEB MINING. Why IR ? Research & Fun .

Why IR?

Page 3: WEB MINING. Why IR ? Research & Fun .

Why IR?

Page 4: WEB MINING. Why IR ? Research & Fun .

Research & Fun

http://duilian.msra.cn

Page 5: WEB MINING. Why IR ? Research & Fun .

Overview of Search Engine

Page 6: WEB MINING. Why IR ? Research & Fun .

Flow Chart of SE

Page 7: WEB MINING. Why IR ? Research & Fun .

Text Processing (1) - Indexing

A list of terms with relevant informationFrequency of termsLocation of terms Etc.

Index terms: represent document content & separate documents “economy” vs “computer” in a news article of Financial Times

To get IndexExtraction of index terms Computation of their weights

Page 8: WEB MINING. Why IR ? Research & Fun .

Text Processing (2) - Text Processing (2) - ExtractionExtraction

Extraction of index termsWord or phrase levelMorphological Analysis (stemming in English)“information”, “informed”, “informs”, “informative”

informRemoval of stop words

“a”, “an”, “the”, “is”, “are”, “am”, …

Page 9: WEB MINING. Why IR ? Research & Fun .

Text Processing (3) – Term Text Processing (3) – Term WeightWeight

Calculation of term weights Statistical weights using frequency information importance of a term in a document

E.g. TF*IDF TF: total frequency of a term k in a document IDF: inverse document frequency of a term k in a collection

DF: In how many documents the term appears? High TF , low DF means good word to represent text

High TF, High DF means bad word

Page 10: WEB MINING. Why IR ? Research & Fun .

An ExampleAn ExampleDocument 1

Document 2

Page 11: WEB MINING. Why IR ? Research & Fun .

Text Processing (4) - Storing Text Processing (4) - Storing indexing resultsindexing results

Arizona

University

:::

1 1 2 2

Index Word Word Info.Document 1

Document 2

1 1 1 1

Page 12: WEB MINING. Why IR ? Research & Fun .

Text Processing (2) - Storing indexing result

Page 13: WEB MINING. Why IR ? Research & Fun .

Text Processing (3) - Inverted File

Page 14: WEB MINING. Why IR ? Research & Fun .

Matching & Ranking (2)

Ranking Retrieval Model

Boolean (exact) => Fuzzy Set (inexact)

Vector SpaceProbabilisticInference Net ...

Weighting SchemesIndex terms, query termsDocument characteristics

Page 15: WEB MINING. Why IR ? Research & Fun .

Vector Space Model

Page 16: WEB MINING. Why IR ? Research & Fun .

Techniques for efficiency New storage structure esp. for new document types

Use of accumulators for efficient generation of ranked output

Compression/decompression of indexes Technique for Web search engines

Use of hyperlinks Inlinks & outlinks (PageRank)Authority vs hub pages (HITS)

In conjunction with Directory Services (e.g. Yahoo)

Matching & Ranking (2)

Page 17: WEB MINING. Why IR ? Research & Fun .
Page 18: WEB MINING. Why IR ? Research & Fun .

Pagerank Algorithm

Basic idea: more links to a page implies a better page But, all links are not created equal Links from a more important page should count more than links from a weaker page

Basic PageRank R(A) for page A: outDegree(B) = number of edges leaving page B = hyperlinks on page B

Page B distributes its rank boost over all the pages it points to

Page 19: WEB MINING. Why IR ? Research & Fun .
Page 20: WEB MINING. Why IR ? Research & Fun .
Page 21: WEB MINING. Why IR ? Research & Fun .
Page 22: WEB MINING. Why IR ? Research & Fun .

Readings Gregory Grefenstette (1998). “The Problem of Cross-Language Information

Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.

Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct.

Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21st ACM SIGIR Conference, Austrailia.

James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers.

Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58.

Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html

Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.

Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” Ralph Grishman (1997). “Information Extraction: Techniques and Challenges.”

In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)