Open Source Search Tools for conferencesourcesearchtools of Open Search Tools: Tutorial

111
Applications of Open Search Tools: WWW2010 Tutorial Rosie Jones and Ted Drake Yahoo! Inc April 26 th , 2010 [email protected] , [email protected]

description

Presentation by Ted DRAKE and Rosie JONES for the www2010 conference in North Carolina. This discusses the open source search software, APIs and trends.

Transcript of Open Source Search Tools for conferencesourcesearchtools of Open Search Tools: Tutorial

Page 1: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

Applications of Open Search Tools: WWW2010 TutorialRosie Jones and Ted Drake

Yahoo! Inc

April 26th, 2010

[email protected], [email protected]

Page 2: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 2 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Introductions

Page 3: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 3 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Schedule

2:00 – 2:15 Introductions and Overview Rosie & Ted

2:15 – 2:30 Motivation – state of the industry

Ted Drake

2:30 – 3:00 Indexing and Search Rosie & Ted

3:00 – 3:30 Hello World! Using Search Service APIs & Examples

Ted Drake

3:30 – 4:00 Coffee Break

4:00 – 4:30 Mashup Patterns Ted Drake

4:30 – 5:00 Ranking and Evaluation Rosie Jones

5:00 – 5:30 Discussion, Questions Ted & Rosie

Page 4: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 4 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Caveat

• There is a lot of open search software out there!

• This tutorial is breadth-oriented, and example driven

– And therefore necessarily kind of shallow

For the slides:[email protected]@yahoo-inc.comhttp://www.slideshare.net/7mary4

Page 5: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 5 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Schedule

2:00 – 2:15 Introductions and Overview Rosie & Ted

2:15 – 2:30 Motivation – state of the industry

Ted Drake

2:30 – 3:00 Search and Indexing Rosie & Ted

3:00 – 3:30 Hello World! Using Search Service APIs & Examples

Ted Drake

3:30 – 4:00 Coffee Break

4:00 – 4:30 Mashup Patterns Ted & Rosie

4:30 – 5:00 Ranking and Evaluation Rosie Jones

5:00 – 5:30 Discussion, Questions Ted & Rosie

Page 6: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 6 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Motivation

Page 7: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 7 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

State of the Industry - Mashups

Programmable Web: Resource for API and Mashup development

• 10 new search mashups every month (average)

• 62 search APIs (as of April 25,2010)

Page 8: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 8 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

State of the Industry - Healthy Market

1,500 search related companies on TechCrunch

Page 9: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 9 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Open Source Technology Reduces Barriers

• Yahoo! Query Language – Select * from (insert your desire)

– Built in cache, threading, authentication

– Easily extended with Open Tables

• Hadoop– Yahoo Distribution of Hadoop includes patches and updates

– Your Hadoop installation can perform at your current scale• All the way up to Yahoo scale

• Open Source Search Engines – Lemur

– Lucene

Page 10: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 10 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Motivation II: Tools for Academic Papers

Page 11: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 11 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

In Academia: Paper in WWW 2010

• Highlighting Disputed Claims on the WebRob Ennals, Beth Trushkowsky, John Mark Agosta, Tye Rattenbury, Tad Hirsch

The server uses Yahoo BOSS2 to search the web for snippets that resemble a paraphrase entered by the user.

Page 12: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 12 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

In Academia: Papers from SIGIR 2008

• Towards breaking the quality curse: a web-querying approach to web people search  

[ Kalashnikov et al SIGIR 2008]

– Web as external corpus

– Use Yahoo! API to retrieve

• Emulating query-biased summaries using document titles [Joho et al SIGIR 2008]

– Yahoo!, Google, Terrier (TREC)

Page 13: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 13 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

More Publications using Open Source Search Engines

• Affective feedback: an investigation into the role of emotions in the information seeking process [ Arapakis et al SIGIR 2008]

– Use Indri to parse and retrieve TREC newswire and web collections

• [Jung et al IP&M 2007]

– Last clicked document is predictor of relevance (used Nutch search engine on university website)

• Minimal test collections for retrieval evaluation [Carterette et al SIGIR 2006]– Indri, Lemur, Lucene, Mg, SMART, Zettair

Page 14: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 14 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Schedule

2:00 – 2:15 Introductions and Overview Rosie & Ted

2:15 – 2:30 Motivation – state of the industry

Ted Drake

2:30 – 3:00 Search and Indexing Rosie & Ted

3:00 – 3:30 Hello World! Using Search Service APIs & Examples

Ted Drake

3:30 – 4:00 Coffee Break

4:00 – 4:30 Mashup Patterns Ted & Rosie

4:30 – 5:00 Automatic Evaluation Rosie Jones

5:00 – 5:30 Discussion, Questions Ted & Rosie

Page 15: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 15 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Web Search Architecture

Crawlers

Find documentsFollow linksFetch freshest contentBuild graph of hyperlinks

Indexers

Process text and meta-data - compressed - for quick lookup

Index

Text and meta-data - compressed - for quick lookup

Offline

Retrieval

Find documentscontaining query words

Ranking

InterfaceRuntime

Page 16: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 16 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

What is Open Search

Page 17: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 17 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Open Source Search and Open Search

Open source code lets you build your own search engine

Open search lets youleverage existing commercialsearch engines

Page 18: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 18 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Why Open Search?

#!/usr/local/bin/perl –w

$searchResultPage = GET($url);

process($searchResultPage)

Curl (php)

Javascript…

Page 19: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 19 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Scraping Modules

http://search.cpan.org/~jfriedl/Yahoo-Search-1.10.13/lib/Yahoo/Search.pm

Page 20: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 20 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Do I Look Like A Piece of Bad Software?

Page 21: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 21 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Information Superhighway for Known Robots

Search engine may stop accepting requests from your IP, or just slow down service

Page 22: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 22 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Scrape with Search Engine’s Blessing

• http://code.google.com/apis/ajaxsearch/

• http://msdn.microsoft.com/en-us/library/dd251056.aspx

• http://developer.yahoo.com/search/boss/

MUCH MORE DETAIL IN THE NEXT SECTION!

Page 23: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 23 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Other Parts to the Search Process

• Indexing

– Indexing algorithms

– Access to the index – what is overall document frequency? What if I rank differently using the index?

• Presentation

– User interface effects

• Existing Open Search Platforms Can Get You Started

Page 24: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 24 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Indexing Your Own Content

Page 25: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 25 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Task of Indexing

• Store document contents in format that allows quick lookup

• Invest time offline

– For fast runtime access

• Runtime task

– Given the current query

– Which subset of documents should we spend time ranking

Page 26: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 26 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Brute Force Document Scoring

• Check every document in collection to see if it contains any query terms

– Most documents don’t contain any of the query terms

– Look at query terms to see which documents to consider

Page 27: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 27 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

27

open

drake

search

ted

D1

D67

D3

D92

query= open search ted drake

D8 D9 D15 D32

D1 D9 D46

mit D3 D8 D9 D15 D32

D1 D6 D9 D15 D32

D3 D8 D9 D15 D32

PostingPosting list

D1

D3

D8

D9

D15

D32

D6

D46

Inverted Index

Page 28: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 28 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

High Level Comparison

Platform License Lang. Docs Ranking Users Parallel Scale

Lucene Apache Java Many Flexible Amazon Yes TB

zettair BSD like

C HTML, TREC,

TXT

Flexible Research No TB

Indri BSD like

C++ Many Very Flexible

Research Yes TB

Sphinx GPL C++ Many Flexible craigslist Yes TB

RDBMS BSD, GPL

C SQL Text

Limited - Maybe GB

Xapian GPL C++ Many Flexible gmane Yes TB

Page 29: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 29 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Previous Benchmarks (Middleton+Baeza-Yates 07)

Page 30: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 30 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Open Search Benchmarking

• http://zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter/

– An over the weekend experiment to make code examples

Page 31: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 31 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Benchmarks

• Not enough comparative benchmarks out there

• Hard to do; we really need standards– Optimize each platform, per hardware and data set

– Lot of platforms, with different APIs, options and numerical settings

• Need good diverse data sets, small & large

• Hard to please– Winners & losers in benchmarks; lot of biases

– Always room for improvement

• Really evolutionary to nail benchmarks

– It’s an Open Source project• http://github.com/zooie/opensearch/tree/master

Page 32: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 32 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

In action

Lucene

Sphinx

Indri

All the code examples are here:http://github.com/zooie/opensearch/tree/master

Page 33: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 33 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Lucene

• Lot of industrial support w/ proven scalability– Amazon, Netflix, Wikipedia

• An IR Library in Java– There’s also pyLucene & CLucene

• Use Nutch, Solr or Hounder for the rest– Crawlers, result abstracts…

Page 34: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 34 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Lucene Indexing

Page 35: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 35 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Lucene Search

javac -cp /lucenedir/lucene-2.4.1/lucene-core-2.4.1.jar:. Index.java

java –Xmv512m –cp /lucenedir/lucene-core-2.4.1.jar:. Index

Page 36: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 36 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Sphinx

• Runs Craigslist Search

• MySQL integration focus– But also supports a XML input pipe

• Pretty fast indexer

• searchd, indexer commands

• Mostly declarative option setting (sphinx.conf)

• Client API (python, Java, ruby, php) sockets to searchd

Page 37: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 37 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Sphinx Indexing

• SQL text columns or XML input

• sphinx.conf• indexer --quiet --config sphinx.conf medindex

Page 38: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 38 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Sphinx Search

Socket connection to searchd Sphinx service

Page 39: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 39 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Indri

• Lemur Project– http://www.lemurproject.org/

• Powerful Structured Query Language

• Advanced Language Models

• Native C++; swigged Java, php

• Command line binaries

• Developer resources– http://lemur.wiki.sourceforge.net/

Page 40: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 40 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Indri: Hello World

• Index & Search directory of txt files

• IndriBuildIndex -index=/Users/viksi/sigir/med_data/indri_index -corpus.path=/Users/viksi/sigir/med_data/indri_data -corpus.class=txt -memory=300m

– http://www.lemurproject.org/lemur/indexing.php#IndriBuildIndex

• IndriRunQuery -index=/Users/viksi/sigir/med_data/indri_index -count=100 -rule="method:dirichlet,mu:2500" -query="#weight(1.0 #uw2(chest pain) 2.0 #1(heart attack))”

– http://www.lemurproject.org/lemur/IndriQueryLanguage.php

Page 41: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 41 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Indexed Info in Search API

Page 42: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 42 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Index - Structured Meta Data

SearchMonkey:

Yahoo! SearchMonkey captures the structured data from web sites for the index.

• RDF

• Microformats

Page 43: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 43 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Index - Social

Social:

• Delicious saves/tags

• FOAF (Friend of a Friend), XFN

• Recent social activity: Twitter, Facebook, Buzz, Blogs…

Page 44: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 44 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Index – Machine Tags

• Keyterms

• Mis-Spelling

• Content Enrichment

• Inbound Links

Page 45: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 45 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Schedule

2:00 – 2:15 Introductions and Overview Rosie & Ted

2:15 – 2:30 Motivation – state of the industry

Ted Drake

2:30 – 3:00 Search and Indexing Rosie & Ted

3:00 – 3:30 Hello World! Using Search Service APIs & Examples

Ted Drake

3:30 – 4:00 Coffee Break

4:00 – 4:30 Mashup Patterns Ted & Rosie

4:30 – 5:00 Automatic Evaluation Rosie Jones

5:00 – 5:30 Discussion, Questions Ted & Rosie

Page 46: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 46 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Hello, World!Open Search Service APIs

Photo by Oskay

Page 47: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 47 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Roadmap of APIs

• Google• Bing• BOSS• Twitter• YQL• Live examples

Photo by Scorpions and Centaurs

Page 48: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 48 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Google AJAX Search

• Javascript Widget or API

• REST API:• http://ajax.googleapis.com/ajax/servic

es/search/{vertical}?v=1.0&q={query}

• Web, Local, Video, Blogs, News, Books, Images, Patents

• Can’t modify results though

Page 49: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 49 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Google Custom Search

• Turn-key product• Bulk load 1000s site restricts; On-demand 24 hour Web Indexing• Iframe or Custom Search Element results for developers; XML for enterprise

Page 50: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 50 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Bing 2.0 API

• Multiple Sources; Batch support– Web, Images, InstantAnswer, Phonebook, RelatedSearch, Spell

• Usage: http://api.search.live.net/json.aspx?AppId={appid}&Market=en-US&Query={query}&Sources=web+spell&Web.Count=1

• Can modify (w/ some restrictions, i.e. re-ranking, blending with non-Bing sources)

Page 51: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 51 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Yahoo! BOSS

• BOSS = Build your Own Search Service

• Open Yahoo’s core search features via web services to let 3rd parties revolutionize Search

• Unrestricted

Page 52: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 52 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Unrestricted?

• Unlimited queries• Blend, re-order, discard• Full Presentation control• Limited only by your imagination

Page 53: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 53 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

BOSS API

• Usage– http://boss.yahooapis.com/ysearch/{vert}/v1/{q}?appid={appid}&start=0&count=10&lan

g=en&format=xml&view=keyterms

• Verticals– Web, News, Images, Spelling

– In query syntax– inurl, url, intitle, site, AND/OR, “-”, “+”

• Notable web view fields– Delicious bookmarks– SearchMonkey (microformats)– Larger abstracts– Extracted Entities (keyterms)

• Can modify

SearchMonkeySearchMonkeySearchMonkeySearchMonkey

keytermskeytermskeytermskeyterms

BookmarksBookmarksBookmarksBookmarks

Page 54: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 54 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Web = Cross Platform

• Google AJAX, Bing, BOSS• HTTP GET, URI => XML, JSON• Any programming lang. that supports HTTP

• Many language specific libraries available– Web Search “[platform] [language]”

• “yahoo boss python”

• Mobile: HTML web apps work on all smart phones

Page 55: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 55 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Platforms

Page 56: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 56 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Yahoo! YQL

• select * from internet API (e.g. flickr, ebay, amazon)– http://developer.yahoo.com/yql/

many standard & “open tables” services »

Page 57: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 57 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Amazon Web Services (AWS)

• Amazon Cloud Support

• Amazon SimpleDB, Relational Database Services

• E-Commerce Fulfillment Services

• Messaging

• Monitoring

• Networking

• Payments & Billing

• Storage

• Workforce: Amazon Mechanical Turk

Large scale functionality at startup prices

Page 58: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 58 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Google App Engine

• Free application hosting (up to 5 million pv/month)

• Java, Ruby, or Python

• Extensive SDK support

• Distributed Data Storage (up to 500 mb for free)

Page 59: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 59 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Examples

Page 60: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 60 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

BOSS Out in the Open• http://www.xurch.com• http://search.techcrunch.com• http://www.spysee.jp• http://www.123people.com• http://www.pipl.com• http://tweetnews.appspot.com• http://bossy.appspot.com• http://www.hakia.com• http://oneriot.com• http://www.daylife.com• http://www.inquisitorx.com/• http://insiderfood.com/• http://ask-boss.appspot.com/• http://www.4hoursearch.com• http://www.devunity.com (Techcrunch 50)• http://copyrightspot.com/ (Mashable)• http://imusicmash.com (Mashable)• http://truevert.com (Mashable)• http://professeurs.esiea.fr/wassner/?2008/10/20/171-semantic-calculator• http://www.ysearchblog.com/archives/000613.html• http://www.ysearchblog.com/archives/000621.html

– DNS Mashup– BuildASearch– PlayerSearch– V3GGIE– Dipidity Newsline– Tianamo

Page 61: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 61 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Google Custom Search Examples

• CopyScape – Looks for sites copying your text

• Topicalizer – Extracts topics, finds related information from text

Page 62: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 62 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Bing Examples

• Site Search Engine by a Microsoft engineerhttp://nathanbuggia.com/blog/post/Custom-Site-Search-Engine-Using-the-Live-Search-API.aspx

Page 63: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 63 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Coolest Features Across the Board

• BOSS se_link (graphs), delicious (bookmarks), keyterms (extracted entities), searchmonkey (rdfa, microformats, structured abstracts)

• Yahoo! YQL

• Bing Video, Translation, Instant Answer, Batch

• Google CSE large site restricts, refinements• Google AJAX Transliteration, Blogs, Books

Page 64: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 64 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Schedule

2:00 – 2:15 Introductions and Overview Rosie & Ted

2:15 – 2:30 Motivation – state of the industry

Ted Drake

2:30 – 3:00 Search and Indexing Rosie & Ted

3:00 – 3:30 Hello World! Using Search Service APIs & Examples

Ted Drake

3:30 – 4:00 Coffee Break

4:00 – 4:30 Mashups Ted Drake

4:30 – 5:00 Automatic Evaluation Rosie Jones

5:00 – 5:30 Discussion, Questions Ted & Rosie

Page 65: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 65 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mashups

Page 66: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 66 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Let’s Build Something

• TweetNews– http://tweetnews.appspot.com/search?q=twitter– “the best mashup we’ve ever seen” (Wired)

• Tools– BOSS, BOSS Mashup Framework, Google App Engine,

Python 2.5

• Source– http://vik.singh.googlepages.com/fresh.zip

Page 67: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 67 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Digression: TF-IDF for Ranking

• TF = Term Frequency

– Documents containing the query terms often tend to be relevant

• IDF – Inverse Document Frequency

– Words that are in every document aren’t as important

• The, of, “click here”, “home page”– Document frequency: number of documents containing this

term

– Divide by Document frequency: Inverse Document Frequency

• Sort by TF * IDF to get a ranking over documents

Page 68: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 68 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

TweetNews Model• Goal: Inject relevance in latest news search results

• Approach:– Fetch latest (order by date) news results for query– Also fetch latest tweets for query (search.twitter.com)– Vectorize each Twitter and News search result– Euclidean Normalized TFIDF document vector of term:freq pairs– Compute cosine sim between each twitter & news result vector– Assign tweet to news result if sim >= threshold– Sort news results by # of related tweets

• WWW2010 similar paper

– Time is of the Essence: Improving Recency Ranking Using Twitter DataAnlei Dong, Ruiqiang Zhang, Pranam Kolari, Bai Jing, Yi Chang, Fernando Diaz,Zhaohui Zheng, Hongyuan Zha

Page 69: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 69 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

TweetNews Main Source

Page 70: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 70 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Non-Search: delicious Classifier

Usage• &view=delicious_toptags• &view=delicious_saves

Idea: Liberal v. Conservative Classifier

1. Generate politics queries list• Mine Reuters or editors

2. BOSS search each; take top 1k results

3. Filter on tag ‘liberal’ or ‘conservative’; assemble binary training set

4. Features“&abstract=long”, “&view=keyterms,

delicious_saves, searchmonkey_rss”, title, url, date, se_link # inbound links

Page 71: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 71 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mashup: Related terms

• Delicious users can tag web sites they bookmark.

• Get a ranked list of tags for a general topic

• select delicious_toptags,title from search.web where query="hadoop" and view="delicious_toptags“

Page 72: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 72 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mashup – Social Impact

• What are your friends buzzing, digging, tagging…

• YQL: select * from social.connections.updates where guid=me

• Use data to find more recent and relevant information

• Lijit creates a vertical search engine based on a user’s delicious, facebook, and other saved bookmarks

• WWW2010 Related Paper: Liquid Query: Multi-domain Exploratory Search on the Web Marco Brambilla, Alessandro Bozzon, Stefano Ceri, Piero Fraternali

• Now it’s time to turn on the FIRE HOSE

Page 73: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 73 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mashup – The Fire Hose

Page 74: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 74 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mashup – Government Data

• Guardian’s World Government Data Collection http://www.guardian.co.uk/world-government-data– U.S. Unemployment Statistics

– U.S. Aviation Accidents

– Raw Data for U. S. Department of Energy (DOE) Categorical Exclusion(CX) Determinations Under the National Environmental Policy Act (NEPA)

– Treasury Recovery Act Data

– Migratory Bird Flyways - Continental United States

Page 75: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 75 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Coming Soon: Twitter Annotations

Metadata for tweetsStep 1. create link for users to tweet your page.

Step 2. Insert metadata into each tweet

Step 3. Pull that information back and mash with other data.

Example• Yahoo! Finance has a tweet this stock link. • Insert information (ticker:yhoo) into the tweet’s metadata. • Follow the distribution of this metadata and look for

correlations in stock price activity. Perhaps a new line on Finance Charts.

Page 76: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 76 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mashup – Open Tables on YQL

– Define new API definitions

– Open Source in GitHub

– Server-side JavaScript allows Insert and more

– Allows for private keys

Page 77: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 77 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mashup – Open Tables on YQL<?xml version="1.0" encoding="UTF-8"?>

<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd"> <meta> <author>Nagesh Susarla</author> <documentationURL>See search.web and search.images for more details</documentationURL> </meta> <bindings> <select itemPath="results.result" produces="XML"> <inputs> <key id="query" type="xs:string" paramType="query" required="true"/> </inputs> <execute><![CDATA[ var qs = query; var search = y.query('select * from search.web(50) where query=@query', {query: qs}).results; var images = []; default xml namespace='http://www.inktomi.com/'; for each (var result in search.result) { images.push(y.query('select * from search.images(1) where query=@query and url=@url', {url:result.url, query:qs})); } var i = 0; for each (var result in search.result) { var image = images[i].results.result; if (image) { result.image = <image>{image}</image>; } i++; } response.object = search; ]]> </execute> </select> </bindings> </table>

Page 78: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 78 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mashup – Using an Open Table

Page 79: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 79 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Blending Vertical + Service

Comprehensiveness!Every Search Engine should be a One-Stop Shop

Page 80: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 80 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Delicious Blending Idea

• Goal: Blend delicious + web results

• Approach:– 1000s BOSS Web Queries, Filter w/ delicious_saves– Training set: x: search features | y: delicious count

– Machine learn the transfer function• Infer the delicious count for any web result• Can now normalize the two search result sets

Page 81: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 81 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

From WebFrom WebFrom WebFrom Web

Page 82: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 82 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Hack Ideas

Discovery (BOSS Search App Store)• Designing a fairer marketplace for app distribution• Emerging problem for Facebook, iPhone App Store

Desktop, Data Visualization (Cooliris, Inquisitor)

Mobile (iPhone, Android, BlackBerry)• Passive Location/Contextual Based Search

Social (Facebook, Twitter, OpenSocial, Friend Connect, OneConnect)

Semantic• BOSS keyterms, SearchMonkey• Bing Instant Answers• Google CSE Refiners

Page 83: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 83 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Schedule

2:00 – 2:15 Introductions and Overview Rosie & Ted

2:15 – 2:30 Motivation – state of the industry

Ted Drake

2:30 – 3:00 Search and Indexing Rosie & Ted

3:00 – 3:30 Hello World! Using Search Service APIs & Examples

Ted Drake

3:30 – 4:00 Coffee Break

4:00 – 4:30 Mashups Ted Drake

4:30 – 5:00 Ranking and Evaluation Rosie Jones

5:00 – 5:30 Discussion, Questions Ted & Rosie

Page 84: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 84 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Ranking

Page 85: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 85 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Retrieval and Ranking

• RETRIEVE the documents matching simple conditions

– Boolean AND on query terms

– TF-IDF

– …

• RANK using more sophisticated function

– Term proximity

– Page authority

– Author identity

– …

Page 86: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 86 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Ranking with Open Source Tools

• Indri/Lemur

– Language modeling

– BM25, Okapi, Cosine similarity, inQuery

• Lucene

– TF-IDF, weighted by term occurrences

– Fielded search

• Terrier

– Okapi BM25, language modeling and TF-IDF

– Divergence from Randomness

• Your own re-ranking code using open search

Page 87: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 87 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Evaluation with Click Logs

Page 88: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 88 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Evaluating with Clicks

People click on the good results, right?

Page 89: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 89 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Not All Results Are Equally Likely to be Looked At

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf)

Page 90: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 90 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Clicks and Views Depend on Rank

[Joachims et al, 2005]

Page 91: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 91 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Evaluation from Click Logs

• Show a screenshot and me doing a “skip first”

Read FromTop toBottom

[Joachims et al SIGIR 2005]

Page 92: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 92 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mining Clicks for Ranking

• Clicks can be used to predict

– Pairwise preference

• Query: Doc1, Doc2 [ Joachims 2002]

– Absolute relevance

• Taking clicks on other documents into account

• [Carterette and Jones, NIPS 2007]

• [Chapelle and Zhang, WWW 2009]

Page 93: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 93 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Interleaving for Learning from Clicks – Pairwise Judgments

• [Joachims, KDD 2002]

• [Radlinski and Joachims, KDD 2007]

• [Radlinksi et al, CIKM 2008]

Results from Method 1 Results from Method 2

Page 94: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 94 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Evaluation using Discounted Cumulative Gain

• Discounted Cumulative Gain (DCG)

• [Järvelin and Kekäläinen 2000]

Highly relevantValue = 3

Somewhat relevantValue = 2

Tangentially relevantValue = 1

IrrelevantValue = 0

Most importantValue = 1

Less importantValue = 1/log(i)

Page 95: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 95 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Directly Modeling Relevance From Clicks

Rank 1Rank 2Rank 3Rank 4Rank 5

Rank 1Rank 2Rank 3Rank 4Rank 5

Click count 1

Is DCG1 > DCG2?

P(DCG1 > DCG2)

Which ranking of web pages is better for the query “NIPS 2007”?

[Carterette and Jones, NIPS 2007]

Page 96: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 96 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Ingredients for Learning from Clicks

• Sufficient users

• Ability to record results shown

• Ability to vary presentation order

• Ability to vary results shown

• Ability to log clicks

• Ability to run experiments

varying your secret sauce

Page 97: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 97 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Page 98: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 98 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

How to Get Search Engine Results to Modify?

• Radlinski and Joachims

• citeseer/arXiv.org results and permuted rankings, recorded clicks, skip above, skip next

• See also their open source engine Osmot

– http://radlinski.org/osmot/

Page 99: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 99 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Query Logs• Might be in /etc/httpd/logs/access_log* check httpd.conf

• [IP] - - [Time] “[Method] [URI] [Version] [Code] “[Referrer]” “[User-Agent]”

– 10.66.91.231 - - [08/Jun/2009:21:24:44 -0700] "GET /search?q=awesome+presentation HTTP/1.1" 200 2940 "http://i_was_referred_from_here.com" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315 Firefox/3.0.10 Ubiquity/0.1.4”

• Tip: Instrument as much as possible in GET URI via CGI parameters– search?q=yahoo&region=us&tab=local&device=mobile&advanced=1

– One log, avoid joins; URI must < 2k bytes

• grep, cut, uniq, wc, sort, cat are your friends– Ex. Count user query sessions (session key = IP+hour)– sudo grep ’/search?q=' /etc/httpd/logs/access_log.1 | cut -d' '

-f1,4 | cut -d':' -f1,2 | uniq | wc –l

• For advanced SQL processing on single machine: sqlite3 import script– http://selinap.com/2008/04/python-parse-apache-log-to-sqlite-database/

• Distributed: Hadoop & Pig– http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/

Page 100: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 100 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Other Wishlist Items

• A good baseline

– Motivate your users to use your engine

– More fun than reading newspaper stories from 1997

• Evaluate something that is different from ranking

– Summarization

– Information extraction

• Or improve on existing ranking

• NLP tasks “take top results and do X…”

– Data mining

• Pseudo-relevance feedback

Page 101: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 101 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Reasons to Build a Demo

“Eat Your Own Dogfood”algorithm design and testing- allows you to improve without labeled data

- look closely at the results - convince your advisor/funders it works!

Observe user behavior

Cheap flight to bostonCheap flights to bostonCheap flightsTravelocityExpediaAmerican arlines.comAmerican airlines.comAmericanairlines.com

PuppyCute puppyMore cute puppy picutres

Page 102: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 102 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

More About Logs and Evaluation in Other Tutorials

• Web Search Engine Metrics (Direct Metrics to Measure User Satisfaction) – Tuesday, 2:00 PM–5:30 PM

• Ali Dasdan, Yahoo! (USA)Kostas Tsioutsiouliklis, Yahoo! (USA)Emre Velipasaoglu, Yahoo! (USA)

• Web Search/Browse Log Mining: Challenges, Methods, and Applications – Today, 9:00 AM–5:30 PM

Daxin Jiang, Microsoft (China),Jian Pei, Simon Fraser University (Canada)Hang Li, Microsoft (

Page 103: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 103 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

What Doesn’t Exist?

• Query log mining tools

– An opportunity for you!

• …

Page 104: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 104 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Other Open Source Tools

Page 105: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 105 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Lemur Query Log Toolbar

• Research community project for collecting query logs

– Sign up at http://lemurstudy.cs.umass.edu/

• Built and maintained by LTI CMU and CIIR UMass Amherst

• http://www.lemurproject.org/

Page 106: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 106 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Book on Hadoop Scale Processing Coming Out

• Ivory: A Hadoop toolkit for Web-scale information retrieval

http://www.umiacs.umd.edu/~jimmylin/ivory/docs/index.html

• Jimmy Lin

Page 107: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 107 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Take Home Messages

• You can evaluate with clicks

• You can collect clicks by building a useful / fun search service

• You can create a useful/fun search service using open search APIs

• You obtain implementations of standard retrieval algorithms with open source search engines

• Modify that code with your new techniques

Page 108: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 108 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Pointers - Tools

[1] Indri Homepage. http://www.lemurproject.org/indri/..

[2] Lemur Toolkit Homepage. http://www.lemurproject.org/.

[3] Lucene Homepage. http://jakarta.apache.org/lucene/.

[4] Xapian Code Library Homepage. http://www.xapian.org/.

[5] Zettair Homepage. http://www.seg.rmit.edu.au/zettair/.

[6] Terrier Homepage. http://ir.dcs.gla.ac.uk/terrier/.

[7] Nutch Homepage. http://lucene.apache..org/nutch/.

[8] Sphinx search http://sphinxsearch.com/

Page 109: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 109 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Mashup Resources

• Yahoo Developer Network: developer.yahoo.com

• Y.Q.L. : developer.yahoo.com/yql

• BOSS : developer.yahoo.com/boss

• Bing : bing.com/developers

• Google Search: code.google.com/apis/ajaxsearch/

• App Engine : code.google.com/appengine/

• A.W.S. : aws.amazon.com

• Programmable Web : programmableWeb.com

• Mashable : mashable.com

• Tech Crunch : TechCrunch.com

Page 110: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 110 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

Acknowledgements

• Vik Singh co-wrote earlier version of this tutorial

• A few slides from Ricardo Baeza-Yates and Ben Carterette

• Andrew Tomkins, Wei Vivian Zhang, Ahmed Hassan, Eran Palmon for helpful feedback

Page 111: Open Source Search Tools for  conferencesourcesearchtools of Open Search Tools:  Tutorial

- 111 -WWW 2010 Tutorial Open Search ToolsDrake & Jones

QA