Search Engine Technology (1) Prof. Dragomir R. Radev [email protected].

62
Search Engine Technology (1) Prof. Dragomir R. Radev [email protected]

Transcript of Search Engine Technology (1) Prof. Dragomir R. Radev [email protected].

Page 1: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Search Engine Technology(1)

Prof. Dragomir R. Radev

[email protected]

Page 2: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

SET FALL 2013

…1.Introduction…………

Page 3: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.
Page 4: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.
Page 5: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.
Page 6: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.
Page 7: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Examples of search engines

• Conventional (library catalog). Search by keyword, title, author, etc.

• Text-based (Lexis-Nexis, Google, Yahoo!).Search by keywords. Limited search using queries in natural language.

• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).

• Question answering systems (Ask, NSIR, Answerbus)Search in (restricted) natural language

• Clustering systems (Vivísimo, Clusty)• Research systems (Lemur, Nutch)

Page 8: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

What does it take to build a search engine?

• Decide what to index

• Collect it

• Index it (efficiently)

• Keep the index up to date

• Provide user-friendly query facilities

Page 9: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

What else?

• Understand the structure of the web for efficient crawling

• Understand user information needs

• Preprocess text and other unstructured data

• Cluster data

• Classify data

• Evaluate performance

Page 10: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Goals of the course• Understand how search engines work

• Understand the limits of existing search technology

• Learn to appreciate the sheer size of the Web

• Learn to write code for text indexing and retrieval

• Learn about the state of the art in IR research

• Learn to analyze textual and semi-structured data sets

• Learn to appreciate the diversity of texts on the Web

• Learn to evaluate information retrieval

• Learn about standardized document collections

• Learn about text similarity measures

• Learn about semantic dimensionality reduction

• Learn about the idiosyncracies of hyperlinked document collections

• Learn about web crawling

• Learn to use existing software

• Understand the dynamics of the Web by building appropriate mathematical models

• Build working systems that assist users in finding useful information on the Web

Page 11: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Course logistics

• Wednesdays 6:10-7:55 in 410 IAB• Dates:

– Sep 4, 11, 18, 25– Oct 2, 9, 16, 23, 30– Nov 6, 13, 20, 27– Dec 4 + final in mid-December, date TBA

• URL: http://www1.cs.columbia.edu/~cs6998/

• Instructor: Dragomir Radev– Email: [email protected] – Office hours: TBA

• TAs: Amit Ruparel and Ashlesha Shirbhate– {ar3202, ass2167}@columbia.edu– [email protected]

Page 12: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Course outline

• Classic document retrieval: storing, indexing, retrieval

• Web retrieval: crawling, query processing.

• Text and web mining: classification, clustering

• Network analysis: random graph models, centrality, diameter and clustering coefficient

Page 13: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Syllabus• Introduction. • Queries and Documents. Models of Information retrieval. The

Boolean model. The Vector model. • Document preprocessing. Tokenization. Stemming. The Porter

algorithm. Storing, indexing and searching text. Inverted indexes. • Word distributions. The Zipf distribution. The Benford distribution.

Heap's law. TF*IDF. Vector space similarity and ranking. • Retrieval evaluation. Precision and Recall. F-measure. Reference

collections. The TREC conferences. • Automated indexing/labeling. Compression and coding. Optimal

codes. • String matching. Approximate matching. • Query expansion. Relevance feedback. • Text classification. Naive Bayes. Feature selection. Decision

trees.

Page 14: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Syllabus• Linear classifiers. k-nearest neighbors. Perceptron. Kernel

methods. Maximum-margin classifiers. Support vector machines. Semi-supervised learning.

• Lexical semantics and Wordnet. • Latent semantic indexing. Singular value decomposition. • Vector space clustering. k-means clustering. EM clustering. • Random graph models. Properties of random graphs: clustering

coefficient, betweenness, diameter, giant connected component, degree distribution.

• Social network analysis. Small worlds and scale-free networks. Power law distributions. Centrality.

• Graph-based methods. Harmonic functions. Random walks. • PageRank. Hubs and authorities. Bipartite graphs. HITS. • Models of the Web.

Page 15: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Syllabus• Crawling the web. Webometrics. Measuring the size of the web. The Bow-

tie-method. • Hypertext retrieval. Web-based IR. Document closures. Focused crawling. • Question answering • Burstiness. Self-triggerability • Information extraction • Adversarial IR. Human behavior on the web. • Text summarization

POSSIBLE TOPICS

• Discovering communities, spectral clustering

• Semi-supervised retrieval

• Natural language processing. XML retrieval. Text tiling. Human behavior on the web.

Page 16: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Readings

• required: Information Retrieval by Manning, Schuetze, and Raghavan (http://nlp.stanford.edu/IR-book/information-retrieval-book.html), freely available, hard copy for sale

• optional: Modeling the Internet and the Web: Probabilistic Methods and Algorithms by Pierre Baldi, Paolo Frasconi, Padhraic Smyth, Wiley, 2003, ISBN: 0-470-84906-1 (http://ibook.ics.uci.edu).

• papers from SIGIR, WWW and journals (to be announced in class).

Page 17: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Prerequisites

• Linear algebra: vectors and matrices.• Calculus: Finding extrema of functions.• Probabilities: random variables, discrete and

continuous distributions, Bayes theorem.• Programming: experience with at least one web-

aware programming language such as Perl (highly recommended) or Java in a UNIX environment.

• Required CS account

Page 18: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Course requirements

• Three assignments (30%)– Some of them will be in Perl. The rest can be done in

any appropriate language (e.g. Python or Java). All will involve some data analysis and evaluation.

• Final project (30%)– Research paper or software system.

• Class participation (10%)• Final exam (30%)

Page 19: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Final project format

• Research paper - using the SIGIR format. Students will be in charge of problem formulation, literature survey, hypothesis formulation, experimental design, implementation, and possibly submission to a conference like SIGIR or WWW.

• Software system - develop a working system or API. Students will be responsible for identifying a niche problem, implementing it and deploying it, either on the Web or as an open-source downloadable tool. The system can be either stand alone or an extension to an existing one.

Page 20: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Active research projects

• Scientific paper analysis, bibliometrics• Citation analysis• Question answering• Social media• Political debates• Blogs and rumors• IR for the humanities• Health IR• Collective intelligence• Sentiment analysis and word polarity• Cartoons• Social networks

Page 21: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

More project ideas

• Shingling• Build a language identification system.• Participate in the Netflix challenge.• Query log analysis.• Build models of Web evolution.• Information diffusion in blogs or web.• Author-topic models of web pages.• Using the web for machine translation.• News recommendation system.• Compress the text of Wikipedia (losslessly).• Spelling correction using query logs.• Automatic query expansion.

Page 22: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

List of projects from the past• Document Closures for Indexing• Tibet - Table Structure Recognition Library• Ruby Blog Memetracker• Sentence decomposition for more accurate information retrieval• Extracting Social Networks from LiveJournal• Google Suggest Programming Project (Java Swing Client and Luce

ne Back-End)• Leveraging Social Networks for Organizing and Browsing Shared P

hotographs• Media Bias and the Political Blogosphere• Measuring Similarity between search queries• Extracting Social Networks and Information about the people within t

hem from Text• LSI + dependency trees

Page 23: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Available corpora• Netflix challenge • AOL query logs • Blogs • Bio papers • AAN • Email • Generifs • Web pages • Political science corpus • VAST • del.icio.us • SMS • News data: aquaint, tdt, nantc, reuters,

setimes, trec, tipster • Europarl multilingual • US congressional data • DMOZ • Pubmedcentral • DUC/TAC

• Timebank• Wikipedia • wt2g/wt10g/wt100g • dotgov • RTE • Paraphrases • GENIA • Generifs • Hansards • IMDB • MTA/MTC • nie • cnnsumm • Poliblog • Sentiment • xml • epinions • Enron

Page 24: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Related courses elsewhere

• Stanford (Chris Manning, Prabhakar Raghavan, and Hinrich Schuetze)

• Cornell (Jon Kleinberg)• CMU (Yiming Yang and Jamie Callan)• UMass (James Allan)• UTexas (Ray Mooney)• Illinois (Chengxiang Zhai)• Johns Hopkins (David Yarowsky)• UNT (Rada Mihalcea)

Page 25: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

The size of the World Wide Web

• The size of the indexed world wide web pages (By Sep.4, 2012)– Indexed by Google: about 40 billion pages– Indexed by Bing: about 16.5 billion pages– Indexed by Yahoo: about 4.8 billion pages

http://www.worldwidewebsize.com/

Page 26: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

• Twitter hits 400 million tweets per day (June, 2012. Dick Costolo, CEO at Twitter)

• Over 2.5 billion photos uploaded to Facebook each month (2010. blog.facebook.com)

• Google’s clusters process a total of more than 20 petabytes of data per day. (2008. Jeffrey Dean from Google [link])

Page 27: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

• 55 Million WordPress Sites in the World

• WordPress.com users produce about 500,000 new posts and 400,000 new comments on an average day

http://en.wordpress.com/stats/

Page 28: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

• Dynamically generated content

• New pages get added all the time

• The size of the blogosphere doubles every 6 months

• Yahoo deals with 12TB of data per day (according to Ron Brachman)

Page 29: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

SET FALL 2013

…2. Models of Information retrieval The Vector model The Boolean model……

Page 30: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Sample queries (from Excite)In what year did baseball become an offical sport?play station codes . combirth control and depressiongovernment"WorkAbility I"+conferencekitchen applianceswhere can I find a chines rosewoodtiger electronics58 Plymouth FuryHow does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a

hero?emeril LagasseHubbleM.S Subalaksmirunning

Page 31: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Key Terms Used in IR

• QUERY: a representation of what the user is looking for - can be a list of words or a phrase.

• DOCUMENT: an information entity that the user wants to retrieve

• COLLECTION: a set of documents• INDEX: a representation of information that makes

querying easier• TERM: word or concept that appears in a document

or a query

Page 32: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Mappings and abstractions

Reality Data

Information need Query

From Robert Korfhage’s book

Page 33: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Documents

• Not just printed paper

• Can be records, pages, sites, images, people, movies

• Document encoding (Unicode)

• Document representation

• Document preprocessing (e.g., removing metadata)

• Words, terms, types, tokens

Page 34: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Sample query sessions (from AOL)

• toley spies gramestolley spies gamestotally spies games

• tajmahal restaurant brooklyn nytaj mahal restaurant brooklyn nytaj mahal restaurant brooklyn ny 11209

• do you love me like you saydo you love me like you say lyricsdo you love me like you say lyrics marvin gaye

Page 35: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Characteristics of user queries

• Sessions: users revisit their queries.

• Very short queries: typically 2 words long.

• A large number of typos.

• A small number of popular queries. A long tail of infrequent ones.

• Almost no use of advanced query operators with the exception of double quotes

Page 36: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Queries as documents

• Advantages:– Mathematically easier to manage

• Problems:– Different lengths– Syntactic differences– Repetitions of words (or lack thereof)

Page 37: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Document representations

• Term-document matrix (m x n)

• Document-document matrix (n x n)

• Typical example in a medium-sized collection: 3,000,000 documents (n) with 50,000 terms (m)

• Typical example on the Web: n=30,000,000,000, m=1,000,000

• Boolean vs. integer-valued matrices

Page 38: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Storage issues

• Imagine a medium-sized collection with n=3,000,000 and m=50,000

• How large a term-document matrix will be needed?

• Is there any way to do better? Any heuristic?

Page 39: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Tokenizing text

• (CNN) -- A tropical storm has strengthened into Hurricane Leslie in the Atlantic Ocean, forecasters said Wednesday.

• The slow-moving storm could affect Bermuda this weekend, according to the National Hurricane Center in Miami.

• The Category 1 hurricane was churning Wednesday afternoon about 465 miles (750 kilometers) south-southeast of the British territory and moving north at 2 mph (4 kph), the hurricane center said.

http://www.cnn.com/2012/09/05/world/americas/bermuda-hurricane-leslie/index.html

Page 40: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Inverted index

• Instead of an incidence vector, use a posting table

• CLEVELAND: D1, D2, D6• OHIO: D1, D5, D6, D7• Use linked lists to be able to insert new

document postings in order and to remove existing postings.

• Can be used to compute document frequency• Keep everything sorted! This gives you a

logarithmic improvement in access.

Page 41: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Basic operations on inverted indexes

• Conjunction (AND) – iterative merge of the two postings: O(x+y)

• Disjunction (OR) – very similar• Negation (NOT) – can we still do it in

O(x+y)? – Example: MICHIGAN AND NOT OHIO– Example: MICHIGAN OR NOT OHIO

• Recursive operations• Optimization: start with the smallest sets

Page 42: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Major IR models

• Boolean

• Vector

• Probabilistic

• Language modeling

• Fuzzy retrieval

• Latent semantic indexing

Page 43: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

The Boolean model

x w y z

D1D2

Venn diagrams

Page 44: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Boolean queries

• Operators: AND, OR, NOT, parentheses• Example:

– CLEVELAND AND NOT OHIO– (MICHIGAN AND INDIANA) OR (TEXAS AND

OKLAHOMA)

• Ambiguous uses of AND and OR in human language– Exclusive vs. inclusive OR– Restrictive operator: AND or OR?

Page 45: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Canonical forms of queries

• De Morgan’s Laws:

NOT (A AND B) = (NOT A) OR (NOT B)

NOT (A OR B) = (NOT A) AND (NOT B)

• Normal forms– Conjunctive normal form (CNF)– Disjunctive normal form (DNF) – Some people swear by CNF - why?

Page 46: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Evaluating Boolean queries

• Incidence vectors:– CLEVELAND: 1100010– OHIO: 1000111

• Examples:– CLEVELAND AND OHIO– CLEVELAND AND NOT OHIO– CLEVALAND OR OHIO

Page 47: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Exercise

• D1 = “computer information retrieval”

• D2 = “computer retrieval”

• D3 = “information”

• D4 = “computer information”

• Q1 = “information AND retrieval”

• Q2 = “information AND NOT computer”

Page 48: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Exercise0

1 Swift

2 Shakespeare

3 Shakespeare Swift

4 Milton

5 Milton Swift

6 Milton Shakespeare

7 Milton Shakespeare Swift

8 Chaucer

9 Chaucer Swift

10 Chaucer Shakespeare

11 Chaucer Shakespeare Swift

12 Chaucer Milton

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Page 49: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

SET FALL 2013

…3. Document preprocessing. Tokenization. Stemming. The Porter algorithm. Storing, indexing and searching text.

Inverted indexes.…

Page 50: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Document preprocessing

• Dealing with formatting and encoding issues• Hyphenation, accents, stemming, capitalization• Tokenization:

– USA vs. U.S.A. – equivalence class– Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad

hoc, can’t– Example: “The New York-Los Angeles flight”– Hewlett-Packard– numbers, e.g., (888) 555-1313, 1-888-555-1313– dates, e.g., Jan-13-2012, 20120113, 13 January 2012,

01/13/12– MIT, mit (in German)?

Page 51: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Non-English languages

• http://www.kyodo.co.jp/entame/showbiz/2012-09-04_64060/

• ストロベリーナイト• http://ja.wikipedia.org/wiki/

%E3%82%B9%E3%83%88%E3%83%AD%E3%83%99%E3%83%AA%E3%83%BC%E3%83%8A%E3%82%A4%E3%83%88

• http://it.wikipedia.org/wiki/Dorama• http://it.wikipedia.org/wiki/Strawberry_Night• テレビドラマ

Page 52: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Non-English languages

– Arabic:

– Japanese:

(kono hon ha omoi)– German: Lebensversicherungsgesellschaftsangesteller– Chinese: shàng hé– http://www.mandarintools.com/worddict.html– http://en.wiktionary.org/wiki/%D9%83%D8%AA

%D8%A7%D8%A8

كتاب

この本は重い。

尚和

Page 53: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Hiragana, Katakana, Romaji, Kanji

• http://tastymiso.com/japanese-alphabets/113

Page 54: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Document preprocessing

• Normalization:– Casing (cat vs. CAT), the Fed– Stemming (computer, computation)– Soundex– Accent removal – cote in French– Labeled/labelled, extraterrestrial/extra-terrestrial/extra

terrestrial, Qaddafi/Kadhafi/Ghadaffi

• Index reduction– Dropping stop words (“and”, “of”, “to”)– Problematic for “to be or not to be”

Page 55: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Porter’s algorithmExample: the word “duplicatable”

duplicat rule 4duplicate rule 1b1duplic rule 3

The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.

More examples:

SSES SS caresses caressIES I ponies poniSS SS caress caressS [blank] cats cat

Page 56: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Porter’s algorithm

Computable Comput

Intervention Intervent

Retrieval Retriev

Document Docum

Representing Repres

Representative Repres

Page 57: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Links

• http://maya.cs.depaul.edu/~classes/ds575/porter.html

• http://www.tartarus.org/~martin/PorterStemmer/def.txt

Page 58: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

When does stemming help?

• Camera, cameras?

• Electricity, electrical?

• Operating, operations, operative, operational

• (systems, research, dentistry, plan)

Page 59: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Approximate string matching

• The Soundex algorithm (Odell and Russell)

• Uses:– spelling correction– hash function– non-recoverable

Page 60: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

The Soundex algorithm

1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions

2. Assign the following numbers to the remaining letters after the first:b,f,p,v : 1

c,g,j,k,q,s,x,z : 2

d,t : 3

l : 4

m n : 5

r : 6

Page 61: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

The Soundex algorithm

3. if two or more letters with the same code were adjacent in the original name, omit all but the first

4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits

Examples:

Euler: E460, Gauss: G200, H416: Hilbert, K530: Knuth, Lloyd: L300

same as Ellery, Ghosh, Heilbronn, Kant, and Ladd

Some problems: Rogers and Rodgers, Sinclair and StClair

Page 62: Search Engine Technology (1) Prof. Dragomir R. Radev radev@cs.columbia.edu.

Readings

• MRS1, MRS2, MRS3• MRS5 (Zipf), MRS6• MRS7, MRS8