Faceted Search Nycto Talk

32
New York CTO Club December 9, 2009 Daniel Tunkelang, Google Otis Gospodneti!, Sematext Faceted Search

description

These slides were used for a presentation by Daniel Tunkelang (Google) and Otis Gospondetic (Sematext) at the New York CTO Club on December 9th, 2009.Faceted SearchPeople come to your site to get the information they need, by exploring, discovering, and making comparisons. You want them to successfully sift through all of your content, quickly and effectively. The traditional approach of providing a search box and a ranked list of results can frustrate users, who need more guidance in order to find what they are looking for--or even know if the information is available.Enter faceted search. Faceted search enables users to navigate a multi-dimensional information space by combining text search with a progressive narrowing of choices in each dimension. This technique has become ubiquitous in online retail, and is increasingly popular in other domains, both on the public internet and on intranets.This talk will review the basic concepts of faceted search, and then dive into some of the subtler concerns. Specifically, we will elaborate on both the design and implementation concerns that determine whether a faceted search deployment will be successful.Our own Daniel Tunkelang co-founded Endeca, a pioneer in faceted search, and worked there for 10 years before recently moving to Google. In addition to building the world's leading commercial technology for faceted search, he has played an active role in engaging the broader community of researchers and practitioners to advance understanding of this field. These efforts include organizing an annual workshop on human-computer information retrieval and publishing a textbook on faceted search.Otis Gospodnetic is the co-founder of Sematext, a Lucene expert, co-author of Lucene in Action and upcoming Solr in Action, and a long-time Lucene and Solr developer with over 10 years of experience in search and related technologies. Sematext implements open-source search, linguistic, and text analytics technology in the enterprise. They focus on the development of scalable and high-performance search solutions.

Transcript of Faceted Search Nycto Talk

Page 1: Faceted Search Nycto Talk

New York CTO ClubDecember 9, 2009

Daniel Tunkelang, GoogleOtis Gospodneti!, Sematext

Faceted Search

Page 2: Faceted Search Nycto Talk

Agenda

Daniel:! What is faceted search?

! Why use faceted search?

! Thoughts about design and user experience.

Otis:! What are Lucene and Solr?

! Why use an open-source search library?

! Thoughts about implementation.

Page 3: Faceted Search Nycto Talk

“Regular” Search

Interface:

! User expresses information need as short query.

! Search engine returns ranked, pageable result set.

User happy when...

! Top-ranked result satisfies information need.

! At least some result on first page is relevant.

User unhappy when...

! No result on first page satisfies information need.

! Results misleadingly appear relevant (bait and switch).

Page 4: Faceted Search Nycto Talk

Relevance Is Subjective

Relevance is defined as a measure of

information conveyed by a document relative to

a query.

It is shown that the relationship between the

document and the query, though necessary, is

not sufficient to determine relevance.

William Goffman, On relevance as a measure, 1964.

Page 5: Faceted Search Nycto Talk

Regular Search Experience

Page 6: Faceted Search Nycto Talk

Assumptions Are Dangerous

! self-awareness

! self-expression

! model knows best

! answer is a document

! one-shot query

tf-idfPageRank

Page 7: Faceted Search Nycto Talk

What is Faceted Search?

! Best understood through examples.

" See the following slides.

" Or shop on almost any ecommerce site.

! Facets = multiple ways to organize information.

" Often based on available structured information.

" But not always, e.g., facets obtained via text mining.

! Typical interaction:

" User starts with a full-text search.

" Facets guide query refinement process.

Page 8: Faceted Search Nycto Talk

Faceted Search for News

Page 9: Faceted Search Nycto Talk

Faceted Search for People

Page 10: Faceted Search Nycto Talk

Faceted Search for Breakfast

Page 11: Faceted Search Nycto Talk
Page 12: Faceted Search Nycto Talk

But Facets are Not a Silver Bullet...

! Screen real estate is finite.

" Choose facets wisely.

" Choose facet values wisely for monster facets.

! Multiple selection within a facet is powerful, but...

" Has to be intuitive, especially AND vs. OR.

" Even trickier for hierarchical facets.

! Search relevance still matters!

" Most faceted search applications rank results.

" Irrelevant results " irrelevant facet refinements.

Page 13: Faceted Search Nycto Talk

Exploring Information Science

Page 14: Faceted Search Nycto Talk

Deliver Precision and Recall

Easier said than done!

Ranking of facet values is an open research topic.

Page 15: Faceted Search Nycto Talk

Be Careful with Faceted Search!

Cameras have artists?!

Page 16: Faceted Search Nycto Talk

Clarify, Then Refine

Page 17: Faceted Search Nycto Talk

Take-Aways

! Faceted search addresses the subjectivity of relevance and information overload.

! But deploying faceted search effectively requires that you think about user experience.

! Recommended reading:

" My thin book entitled Faceted Search

" Marti Hearst's book on Search User Interfaces

" Peter Morville's upcoming book on Search Patterns

Page 18: Faceted Search Nycto Talk

Otis Gospodneti!, Sematext

Faceted Search with Lucene & Solr

Page 19: Faceted Search Nycto Talk

What is / isn't Lucene

! Free, ASL, Java IR library, Jar

! Doug Cutting, ASF, 2001

! Application agnostic: Indexing & Searching

! High performance, scalable

! No dependencies

! Heavily ported

! No: crawler, rich doc parser, turn-key solution

! No: out of the box faceted search-capability... but...

Page 20: Faceted Search Nycto Talk
Page 21: Faceted Search Nycto Talk

What is/isn't Solr

! Indexing/Search server with HTTP API built on

top of Lucene

! Fast & scalable (distributed search, index

replication) #

! XML, JSON, Ruby, Perl, PHP, javabin

! No: crawler (but Nutch ==> Solr works) #

! Yes: rich text parser

! Yes: Faceted Search out of the box!

Page 22: Faceted Search Nycto Talk

Solr and Faceted Search

! 3 Types of facets: Field Values (text), Dates,

Queries.

! “Text”: return counts for all/top terms in a field

for a result set - e.g. categories a la Amazon

! Dates: return counts for docs in specified date

ranges

! Queries: return counts for docs that also match

a given query - handy for number ranges (think

prices!)#

Page 23: Faceted Search Nycto Talk

Facet Field Requirements

! Must be indexed

! Often not tokenized

! Often not altered (lowercase, punctuation) #

! Storing not required

! Multivalued fields OK

Page 24: Faceted Search Nycto Talk

Turn It On

! 0 facets:! http://host:80/solr/select?q=foo

! 1 facet: ! http://host:80/solr/select?q=foo&facet=true&facet.field=category

! N facets:! http://host:80/solr/select?

q=foo&facet=true&facet.field=category&facet.field=inStock

! facet=true or facet.on

Page 25: Faceted Search Nycto Talk

Text Facet Response

<result numFound="4" start="0"/>

<lst name="facet_counts">

<lst name="facet_fields">

<lst name="category">

<int name="electronics">3</int>

<int name="copier">0</int>

</lst>

<lst name="inStock">

<int name="false">3</int>

<int name="true">1</int>

</lst>

</lst>

</lst>

! facet.mincount=1 to

avoid 0-count facet

values

! facet.limit=N to limit to

top N facet values

! facet.missing=true to

catch uncategorized

! lots of other options!

Page 26: Faceted Search Nycto Talk

Date Facets

! http://.../solr/select/?

q=*:*&rows=0&facet=true&facet.date=timesta

mp&facet.date.start=NOW/DAY-

5DAYS&facet.date.end=NOW/DAY

%2B1DAY&facet.date.gap=%2B1DAY

! (%2B1 ==> +1) #

! Solr Date Math Parser syntax: /HOUR,

+2YEARS, -1DAY, /DAY+6MONTHS+3DAYS,

+6MONTHS+3DAYS/DAY

Page 27: Faceted Search Nycto Talk

Date Facet Response

<result name="response" numFound="42" start="0"/>

<lst name="facet_counts">

<lst name="facet_dates">

<lst name="timestamp">

<int name="2007-08-11T00:00:00.000Z">1</int>

<int name="2007-08-12T00:00:00.000Z">5</int>

<int name="2007-08-13T00:00:00.000Z">3</int>

<int name="2007-08-14T00:00:00.000Z">7</int>

<int name="2007-08-15T00:00:00.000Z">2</int>

<int name="2007-08-16T00:00:00.000Z">16</int>

<str name="gap">+1DAY</str>

<date name="end">2007-08-17T00:00:00Z</date>

</lst>

Page 28: Faceted Search Nycto Talk

Query Facets

! http://.../solr/select?

q=shoes&rows=0&facet=true&facet.field=inStoc

k&facet.query=price:

[*+TO+500]&facet.query=price:[500+TO+*]

! Avoids the bucket-at-index-time work-around

! Keep queries disjoint

Page 29: Faceted Search Nycto Talk

Query Facet Response

<result numFound="3" start="0"/>

<lst name="facet_counts">

<lst name="facet_queries">

<int name="price:[* TO 500]">3</int>

<int name="price:[500 TO *]">1</int>

</lst>

<lst name="facet_fields">

<lst name="inStock">

<int name="false">3</int>

<int name="true">1</int>

</lst>

</lst>

</lst>

Page 30: Faceted Search Nycto Talk

UI Integration

! Use Filter Queries via fq

! http://.../solr/select?

q=shoes&facet=true&facet.field=category&

fq=price:[0 TO 300]

! http://.../solr/select?

q=shoes&facet=true&facet.field=category&

fq=price:[0 TO 300]&fq=inStock:true

! Important: single request does it all

Page 31: Faceted Search Nycto Talk

State of Lucene & Solr

! Super healthy community, exploding

development

! Lucene 3.0 – 2009-11-25:

! Performance, faster range queries, clean API, better

Unicode support, more non-English support

! Solr 1.4 – 2009-11-10:

! Performance, new replication, Db indexing, rich-doc

indexing, results clustering, faster response protocol,

deduplication...

Page 32: Faceted Search Nycto Talk

Lucene, Solr, Enterprise

! Free: Community

! Lucene ~ 600 emails/month (dev: 2000/month) #

! Solr ~1300 emails/month (dev: 800/month) #

! Commercial: Support Subscriptions

! Sematext

! Lucid Imagination