Download - Faceted Search Nycto Talk

Transcript
Page 1: Faceted Search Nycto Talk

New York CTO ClubDecember 9, 2009

Daniel Tunkelang, GoogleOtis Gospodneti!, Sematext

Faceted Search

Page 2: Faceted Search Nycto Talk

Agenda

Daniel:! What is faceted search?

! Why use faceted search?

! Thoughts about design and user experience.

Otis:! What are Lucene and Solr?

! Why use an open-source search library?

! Thoughts about implementation.

Page 3: Faceted Search Nycto Talk

“Regular” Search

Interface:

! User expresses information need as short query.

! Search engine returns ranked, pageable result set.

User happy when...

! Top-ranked result satisfies information need.

! At least some result on first page is relevant.

User unhappy when...

! No result on first page satisfies information need.

! Results misleadingly appear relevant (bait and switch).

Page 4: Faceted Search Nycto Talk

Relevance Is Subjective

Relevance is defined as a measure of

information conveyed by a document relative to

a query.

It is shown that the relationship between the

document and the query, though necessary, is

not sufficient to determine relevance.

William Goffman, On relevance as a measure, 1964.

Page 5: Faceted Search Nycto Talk

Regular Search Experience

Page 6: Faceted Search Nycto Talk

Assumptions Are Dangerous

! self-awareness

! self-expression

! model knows best

! answer is a document

! one-shot query

tf-idfPageRank

Page 7: Faceted Search Nycto Talk

What is Faceted Search?

! Best understood through examples.

" See the following slides.

" Or shop on almost any ecommerce site.

! Facets = multiple ways to organize information.

" Often based on available structured information.

" But not always, e.g., facets obtained via text mining.

! Typical interaction:

" User starts with a full-text search.

" Facets guide query refinement process.

Page 8: Faceted Search Nycto Talk

Faceted Search for News

Page 9: Faceted Search Nycto Talk

Faceted Search for People

Page 10: Faceted Search Nycto Talk

Faceted Search for Breakfast

Page 11: Faceted Search Nycto Talk
Page 12: Faceted Search Nycto Talk

But Facets are Not a Silver Bullet...

! Screen real estate is finite.

" Choose facets wisely.

" Choose facet values wisely for monster facets.

! Multiple selection within a facet is powerful, but...

" Has to be intuitive, especially AND vs. OR.

" Even trickier for hierarchical facets.

! Search relevance still matters!

" Most faceted search applications rank results.

" Irrelevant results " irrelevant facet refinements.

Page 13: Faceted Search Nycto Talk

Exploring Information Science

Page 14: Faceted Search Nycto Talk

Deliver Precision and Recall

Easier said than done!

Ranking of facet values is an open research topic.

Page 15: Faceted Search Nycto Talk

Be Careful with Faceted Search!

Cameras have artists?!

Page 16: Faceted Search Nycto Talk

Clarify, Then Refine

Page 17: Faceted Search Nycto Talk

Take-Aways

! Faceted search addresses the subjectivity of relevance and information overload.

! But deploying faceted search effectively requires that you think about user experience.

! Recommended reading:

" My thin book entitled Faceted Search

" Marti Hearst's book on Search User Interfaces

" Peter Morville's upcoming book on Search Patterns

Page 18: Faceted Search Nycto Talk

Otis Gospodneti!, Sematext

Faceted Search with Lucene & Solr

Page 19: Faceted Search Nycto Talk

What is / isn't Lucene

! Free, ASL, Java IR library, Jar

! Doug Cutting, ASF, 2001

! Application agnostic: Indexing & Searching

! High performance, scalable

! No dependencies

! Heavily ported

! No: crawler, rich doc parser, turn-key solution

! No: out of the box faceted search-capability... but...

Page 20: Faceted Search Nycto Talk
Page 21: Faceted Search Nycto Talk

What is/isn't Solr

! Indexing/Search server with HTTP API built on

top of Lucene

! Fast & scalable (distributed search, index

replication) #

! XML, JSON, Ruby, Perl, PHP, javabin

! No: crawler (but Nutch ==> Solr works) #

! Yes: rich text parser

! Yes: Faceted Search out of the box!

Page 22: Faceted Search Nycto Talk

Solr and Faceted Search

! 3 Types of facets: Field Values (text), Dates,

Queries.

! “Text”: return counts for all/top terms in a field

for a result set - e.g. categories a la Amazon

! Dates: return counts for docs in specified date

ranges

! Queries: return counts for docs that also match

a given query - handy for number ranges (think

prices!)#

Page 23: Faceted Search Nycto Talk

Facet Field Requirements

! Must be indexed

! Often not tokenized

! Often not altered (lowercase, punctuation) #

! Storing not required

! Multivalued fields OK

Page 24: Faceted Search Nycto Talk

Turn It On

! 0 facets:! http://host:80/solr/select?q=foo

! 1 facet: ! http://host:80/solr/select?q=foo&facet=true&facet.field=category

! N facets:! http://host:80/solr/select?

q=foo&facet=true&facet.field=category&facet.field=inStock

! facet=true or facet.on

Page 25: Faceted Search Nycto Talk

Text Facet Response

<result numFound="4" start="0"/>

<lst name="facet_counts">

<lst name="facet_fields">

<lst name="category">

<int name="electronics">3</int>

<int name="copier">0</int>

</lst>

<lst name="inStock">

<int name="false">3</int>

<int name="true">1</int>

</lst>

</lst>

</lst>

! facet.mincount=1 to

avoid 0-count facet

values

! facet.limit=N to limit to

top N facet values

! facet.missing=true to

catch uncategorized

! lots of other options!

Page 26: Faceted Search Nycto Talk

Date Facets

! http://.../solr/select/?

q=*:*&rows=0&facet=true&facet.date=timesta

mp&facet.date.start=NOW/DAY-

5DAYS&facet.date.end=NOW/DAY

%2B1DAY&facet.date.gap=%2B1DAY

! (%2B1 ==> +1) #

! Solr Date Math Parser syntax: /HOUR,

+2YEARS, -1DAY, /DAY+6MONTHS+3DAYS,

+6MONTHS+3DAYS/DAY

Page 27: Faceted Search Nycto Talk

Date Facet Response

<result name="response" numFound="42" start="0"/>

<lst name="facet_counts">

<lst name="facet_dates">

<lst name="timestamp">

<int name="2007-08-11T00:00:00.000Z">1</int>

<int name="2007-08-12T00:00:00.000Z">5</int>

<int name="2007-08-13T00:00:00.000Z">3</int>

<int name="2007-08-14T00:00:00.000Z">7</int>

<int name="2007-08-15T00:00:00.000Z">2</int>

<int name="2007-08-16T00:00:00.000Z">16</int>

<str name="gap">+1DAY</str>

<date name="end">2007-08-17T00:00:00Z</date>

</lst>

Page 28: Faceted Search Nycto Talk

Query Facets

! http://.../solr/select?

q=shoes&rows=0&facet=true&facet.field=inStoc

k&facet.query=price:

[*+TO+500]&facet.query=price:[500+TO+*]

! Avoids the bucket-at-index-time work-around

! Keep queries disjoint

Page 29: Faceted Search Nycto Talk

Query Facet Response

<result numFound="3" start="0"/>

<lst name="facet_counts">

<lst name="facet_queries">

<int name="price:[* TO 500]">3</int>

<int name="price:[500 TO *]">1</int>

</lst>

<lst name="facet_fields">

<lst name="inStock">

<int name="false">3</int>

<int name="true">1</int>

</lst>

</lst>

</lst>

Page 30: Faceted Search Nycto Talk

UI Integration

! Use Filter Queries via fq

! http://.../solr/select?

q=shoes&facet=true&facet.field=category&

fq=price:[0 TO 300]

! http://.../solr/select?

q=shoes&facet=true&facet.field=category&

fq=price:[0 TO 300]&fq=inStock:true

! Important: single request does it all

Page 31: Faceted Search Nycto Talk

State of Lucene & Solr

! Super healthy community, exploding

development

! Lucene 3.0 – 2009-11-25:

! Performance, faster range queries, clean API, better

Unicode support, more non-English support

! Solr 1.4 – 2009-11-10:

! Performance, new replication, Db indexing, rich-doc

indexing, results clustering, faster response protocol,

deduplication...

Page 32: Faceted Search Nycto Talk

Lucene, Solr, Enterprise

! Free: Community

! Lucene ~ 600 emails/month (dev: 2000/month) #

! Solr ~1300 emails/month (dev: 800/month) #

! Commercial: Support Subscriptions

! Sematext

! Lucid Imagination