WEB MINING. Why IR ? Research & Fun .
-
Upload
jimmy-wade -
Category
Documents
-
view
213 -
download
0
Transcript of WEB MINING. Why IR ? Research & Fun .
Web Mining
Why IR?
Why IR?
Research & Fun
http://duilian.msra.cn
Overview of Search Engine
Flow Chart of SE
Text Processing (1) - Indexing
A list of terms with relevant informationFrequency of termsLocation of terms Etc.
Index terms: represent document content & separate documents “economy” vs “computer” in a news article of Financial Times
To get IndexExtraction of index terms Computation of their weights
Text Processing (2) - Text Processing (2) - ExtractionExtraction
Extraction of index termsWord or phrase levelMorphological Analysis (stemming in English)“information”, “informed”, “informs”, “informative”
informRemoval of stop words
“a”, “an”, “the”, “is”, “are”, “am”, …
Text Processing (3) – Term Text Processing (3) – Term WeightWeight
Calculation of term weights Statistical weights using frequency information importance of a term in a document
E.g. TF*IDF TF: total frequency of a term k in a document IDF: inverse document frequency of a term k in a collection
DF: In how many documents the term appears? High TF , low DF means good word to represent text
High TF, High DF means bad word
An ExampleAn ExampleDocument 1
Document 2
Text Processing (4) - Storing Text Processing (4) - Storing indexing resultsindexing results
Arizona
University
:::
…
1 1 2 2
Index Word Word Info.Document 1
Document 2
1 1 1 1
Text Processing (2) - Storing indexing result
Text Processing (3) - Inverted File
Matching & Ranking (2)
Ranking Retrieval Model
Boolean (exact) => Fuzzy Set (inexact)
Vector SpaceProbabilisticInference Net ...
Weighting SchemesIndex terms, query termsDocument characteristics
Vector Space Model
Techniques for efficiency New storage structure esp. for new document types
Use of accumulators for efficient generation of ranked output
Compression/decompression of indexes Technique for Web search engines
Use of hyperlinks Inlinks & outlinks (PageRank)Authority vs hub pages (HITS)
In conjunction with Directory Services (e.g. Yahoo)
Matching & Ranking (2)
Pagerank Algorithm
Basic idea: more links to a page implies a better page But, all links are not created equal Links from a more important page should count more than links from a weaker page
Basic PageRank R(A) for page A: outDegree(B) = number of edges leaving page B = hyperlinks on page B
Page B distributes its rank boost over all the pages it points to
Readings Gregory Grefenstette (1998). “The Problem of Cross-Language Information
Retrieval.” In Cross-Language Information Retrieval (ed: Grefenstette), Kluwer Academic Publishers.
Doug Oard et al. (1999). “Multilingual Information Discovery and AccesS (MIDAS).” D-Lib Magazine, 5 (10), Oct.
Sung Hyon Myaeng et al. (1998). “ A Flexible Model for Retrieval of SGML Documents.” Proc. of the 21st ACM SIGIR Conference, Austrailia.
James Allan (2002). “Introduction to Topic Detection and Tracking.” in Topic Detection and Tracking: Event-based Information Organization (ed: Allan), Kluwer Academic Publishers.
Paul Resnick & Hal Varian (1997). “Recommender Systems.” CACM 40 (3), March, pp 56-58.
Bardrul Sarwar et al. (2001). “Item-based Collaborative Recommendation Algorithms”, http://citeseer.nj.nec.com/sarwar01itembased.html
Karen Sparck Jones (1999). “Automatic summarizing: factors and directions.” In Advances in Automatic Text Summarization (eds: Mani & Maybury), MIT Press.
Ellen Boorhees. (2000). “Overview of TREC-9 Question Answering Track.” Ralph Grishman (1997). “Information Extraction: Techniques and Challenges.”
In Information Extraction - International Summer School SCIE-97, (ed: Maria Teresa Pazienza), Springer-Verlag, 1997. (See http://nlp.cs.nyu.edu/publication/index.shtml)