Best practices in museum search

Post on 12-Jan-2015

2.912 views 1 download

Tags:

description

Slide from my workshop at MCN2011

Transcript of Best practices in museum search

"Best Practices" inMuseum Search

.. in my (researched) opinion

Nate Solas#mcn2011 #search@homebrewernate.solas@walkerart.orghttp://bit.ly/mcn2011search

Search is hard...

... shouldn't we just leave this to Google?

"Leave it to Google" IS a best practice!

For them, it's a solved problem. They have absolutely solved searching for content on websites, especially a finite domain like a museum website.

http://www.powerhousemuseum.com/search/index.php?cx=018242116655519399236%3A4srvv8yns7w&q=blue&sa=&cof=FORID%3A11&siteurl=www.powerhousemuseum.com%2Fvisit%2F

http://www.tate.org.uk/search/default.jsp?q=bluehttp://www.brooklynmuseum.org/http://www.amnh.org/http://si.edu/ (GSA)

<title>We can do more to help</title>

<article>

Mark the content! Google indexes ALL the words, so all of our nav, advertising, footer... If we don't indicate what's the "content", it's all fair game (sort of. They're actually smarter than that.)

<sidebar>Meta tags (OG), RDFa, valid HTML5 markup, etc.</sidebar>

</article>

Internal search: yes

• We (should) know the most about our content, so we know:o how to suggest thingso how to interpret queries in context (run the search)o how to present things to make sense

It's no longer just a 'web page'!

•  We (should) have the content as discrete pieces of metadata: title, date, body, author, etc.o We can therefore index just the content, none of the other

chrome on the page.o Facets: we can use this metadata to drill down.

Phases of search:

... let's just look at three parts:

the query,results,&dead ends

Search box, top right. Done.(Powerhouse Museum has it bottom left, but they're in Australia so this makes sense. ;)

• If there's text in the box ("search"), clear it when they click in!

• Autocomplete / suggest isn't really common (yet), but seems very useful where it shows up.o Three strategies I see:

– Suggest page ("live search taxonomy") (http://www.imamuseum.org/)– Suggest tag/title (http://www.vam.ac.uk/)– Suggest phrase from full corpus (http://beta.walkerart.org/ (beta))

The Query

Suggest / Autocom

Full text autocomplete is sort of the holy grail, IMO, but we can't be as smart as Google.

IMA does "live search" (auto-suggest) instead of autocomplete, very useful but it doesn't help me spell Lichtenstein.

The real point is to eliminate dead ends.

Suggest / Autocomplete

Results

Questions your result page should answer immediately:

1. What are these things?– Why did they match (and why in that order?)– Was I understood / can I try again easily?

Finally:• What's next?

o try some results oro narrow (refine) search oro broaden search

WHAT are these things?

Mixed results ("All")

MOMA gets it:• http://www.moma.org/search?quer

y=blueo Full breadcrumb, excerpt, title,

media if they have it.

This is confusing at first:• http://www.metmuseum.org/search-re

sults?ft=blue

Separate results

V&A splits into sections• http://www.vam.ac.uk/contentapi/search/?q=blue&searc

h-submit=Goo .. but some of the "articles" aren't

articles.

MFA sections and staggers• http://www.mfa.org/search/mfa/blue

Careful. This sort of assumes people know what they're looking for.

...um

Why did they match (in this order)?

• Highlight the match, if possible

• Sort by relevanceo (But see section on "boosting"...)

• If you're splitting up content, it's hard to explain.o ...best result could be at the bottom of the page

... so ... don't. Let user do this.

Was I understood / Can I try again?

MFA site: http://www.mfa.org/search/mfa/blue• Without the URL hint, can you even tell what was searched for?

o And what if you want to add a single word? (WAC site is guilty of this. Blame the designer. ;-)

A few "not like this" examples:

• "blue phase"o http://www.vam.ac.uk/contentapi/search/?q=%22blue+phase%22&search-submit=Goo http://www.imamuseum.org/search/ima/%22blue%20phase%22

(People are going to use quotes!)

Was I really understood?

We know what you want: "Hours"• http://www.britishmuseum.org/search_results.aspx?searchText=hours• http://www.moma.org/search?query=hours&page=1• http://beta.walkerart.org/search/?q=hours

"We have a special “live search” taxonomy for explicitly boosting content pages we know people are searching for. E.g. “jobs” on our employment page; “love” is our Love sculpture, not the hundreds of other works, “wedding” is for facility rentals, not our hundred wedding dresses in the collection."

-- Charlie Moad, IMA

Do me a favor:• http://beta.walkerart.org/search/?q=articel• http://www.vam.ac.uk/contentapi/search/?suggest=article&q=articel

o (again, a bit confusing but right)

Narrow results with facets

Awesome:• si.edu collections

o  http://collections.si.edu/search/results.jsp?q=blue

Good:• IMA

o http://www.imamuseum.org/search/ima/blue• WAC (I'm biased)

o http://beta.walkerart.org/magazine/type/articles/genre/film

Less awesome:• British Museum

o http://www.britishmuseum.org/search_results.aspx?searchText=blue&searchPrevious=blue&itemsPerPage=10

Broaden results

• Similar searches / More Like Thiso  http://beta.walkerart.org/search/?q=absent+landlordo  

http://www.powerhousemuseum.com/collection/database/search_tags.php?tag=blue

o http://www.vam.ac.uk/contentapi/search/?q=%22blue+phase%22&search-submit=Go Sort of weird, though.

• More Like Thiso  We're trying it on detail pages:

 http://beta.walkerart.org/calendar/2011/merce-cunningham-dance-company

Dead ends / spell check

"Did you mean?"• http://beta.walkerart.org/search/?q=absent+landlord• http://www.vam.ac.uk/contentapi/search/?q=blu&search-sub

mit=Go

This is really just spellcheck. But it's apparently really hard, since nobody's doing it.

Final thoughts

Can we just spider our own pages like Google?• Sure. Lots of tools to do this, and it looks like that's how MOMA does it.

o However... http://www.moma.org/search?query=%22ad+reinhardt%22+%22sum+of+days%22&page=1

o http://www.moma.org/search?query=blu&page=1 (look at the mp4!)

Boosting• what kind of boosting makes sense?

o weight towards recent content push down past events, maybe

o "we know what you want" look at logs to see what people are searching for

So... "best practices"

• Unified search across all contento full-text search with stemming, phrases, etc.

• Coherent, user-centric divisions of content for faceting• Prevent dead ends

o show #s for facetso autocomplete query

• Help the usero "Did you mean?"

Or just give it to them, don't ask

Let's build that!

"Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable."-- http://en.wikipedia.org/wiki/Solr

There's a tool for you...

http://wiki.apache.org/solr/IntegratingSolr

• ColdFusion - ColdFusion 9 now includes Apache Solr• Django - Haystack• Drupal - A Drupal module that integrates Apache Solr in Drupal.• eZ Find - eZ Find, a solid solr integration to the open source CMS eZ Publish• Forrest/Cocoon - SolrForrest• Foswiki - A Foswiki plugin that integrates Apache Solr in Foswiki.• Plone - collective.solr• SVN - reposearch• TYPO3• Various Library Catalog Applications - Solr4Lib• Woltlab Community Framework - A WCF package working with the burning board, the blog and all other WCF

components.• WordPress - solr-for-wordpress A WordPress plugin that replaces the default WordPress search with Solr.• ZooKeeperIntegration• OpenCms - opencms-solr

Hurry, hurry!

1. introducing Solr2. build fulltext search & introduce dismax3. facets4. build autocomplete5. did you mean?

Installation, fast test

user:~solr$ lssolr-nightly.zipuser:~solr$ unzip -q solr-nightly.zipuser:~solr$ cd solr-nightly/example/user:~/solr/example$ java -jar start.jar

That's it! You can actually do local development against that sort of setup and it works fine.

Installation, f'realz (Ubuntu)

apt-get install build-essential jetty \    libjetty-extra openjdk-6-jdkcp dist/apache-solr-3.4.0.war \    /usr/share/jetty/webapps/solr.warcp -r example/solr /usr/share/jetty/

edit /usr/share/jetty/solr/conf/schema.xml and solrconfig.xml

edit /etc/default/jetty: turn off no-start, make it bind to all ips, and set the java opts:JAVA_OPTIONS="-Dsolr.solr.home=/usr/share/jetty/solr -Dsolr.data.dir=/usr/share/jetty/solr/data $JAVA_OPTIONS"

/etc/init.d/jetty start

For today:

http://172.16.0.67/

Explore the fieldtypes: core0

Get the sample text onyour clipboard.In core0, click Admin,then Analysis

Field Names:id (string)text_wstext_generaltext_enphonetictext_general_revalphaonlysort

core1: fulltext search engine

Click search on core1. Try it out.(dataset is Walker Art Center events)

Click "edit" on core1. Discuss.

core1: dismax query parser

DisMax is an abbreviation Disjunction Max, and is a popular query mode with Solr.

Disjunction refers to the fact that your search is executed across multiple fields, e.g. title, body and keywords, with different relevance weights

Max means that if your word "foo" matches both title and body, the max score of these two (probably title match) is added to the score, not the sum of the two as a simple OR query would do. This gives more control over your ranking.

core1: dismax in practice

The DisMaxQParserPlugin is designed to process simple user entered phrases (without heavy syntax) and search for the individual words across several fields using different weighting (boosts) based on the significance of each field.

In English: it does a really good job helping you figure out what the user meant to look for.

Try some quotes

chuck close

vs.

chuck "close"

Debug: what's going on?

core2: facets

99% chance your Solr library will abstract this for you, but it's good to know what's under the hood.

... we won't do it today, but you can facet by queries, not just field names.

So you can do things like this in one call:• Give me all events matching the query• Show how many by type (like we're doing)• Show how many are happening today• Show how many are happening "this weekend"• ... etc.• http://beta.walkerart.org/calendar/type/free-events

core3: Autocomplete (a)

Read this later:http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

This is a very popular and decent solution. It really only works the way he suggests, though, by seeding with popular queries (since it starts at character 0). If you have this data, go for it, but our top queries actually aren't very interesting: "jobs", "staff", "hours", etc.

We want something that can complete any phrase that occurs in our corpus (a), ideally in the middle of the phrase (b).

Key technologies

ShingleFilterFactoryMake tokens out of phrases.

TermsComponent"return terms and document frequency of those terms"

Post-processing for stopwordsIndex them in phrases, but remove from suggestions in certain scenarios

ShingleFilterFactory

    <fieldType name="shingle_text" class="solr.TextField" positionIncrementGap="100">        <analyzer type="index">            <charFilter class="solr.HTMLStripCharFilterFactory" />            <tokenizer class="solr.StandardTokenizerFactory"/>            <filter class="solr.LowerCaseFilterFactory"/>            <filter class="solr.ASCIIFoldingFilterFactory"/>            <!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />-->            <filter class="solr.ShingleFilterFactory" maxShingleSize="5" />        </analyzer>        <analyzer type="query">            <charFilter class="solr.HTMLStripCharFilterFactory" />            <tokenizer class="solr.StandardTokenizerFactory"/>            <filter class="solr.LowerCaseFilterFactory"/>            <filter class="solr.ASCIIFoldingFilterFactory"/>        </analyzer>    </fieldType>

TermsComponent

<!-- in solrconfig.xml -->        <arr name="last-components">            <str>terms</str>        </arr>

# strict "starts with"/select?terms=true&terms.fl=auto_text&terms.prefix=term

OR

# attempt at "infix" (sloooow on big corpus)/select?terms=true&terms.fl=auto_text&terms.rege=(^|.* +)term.*

core4: Autocomplete (b)

Infix. Big challenges, decent hacks.

Smaller shingles.

Less words (only title & subtitle).

Still... kinda slow in our beta site. Probably have to move to prefix. :(

core5: spellcheck

Similar to the setup for autocomplete

Just remember to call a url with spellcheck.build=true to get things started.

For better results, use spellcheck.q and escape spaces. This makes it a phrase instead of spellchecking individual words and correcting them to deadends.

select?q=chuc+closee&spellcheck.q=chuc\+closee

Search is hard...

Our content team (and I know the MET too with their new site) constantly struggle to understand why certain results come up over others. They always ask us to make tweaks which inevitably hurt other results. It’s a constant battle for perfection and I have to do a lot of educating.

·         Retail results come up over artworks because they actually write good descriptions! We even set our boost on retail to 0.5.·         Why does “after van Gogh” show up before the real “van Gogh”?·         Why does last year’s event show up before this year’s?

While there are answers to all these, it’s inevitably a slippery slope. My final answer is to usually use the live search taxonomy. It is in place to tell the search engine what users are looking for specific to your institution. People just need to understand that it is a content task just as much as creating a page.

-- Charlie Moad, IMA

If we're bored

ASCII / UTF8http://beta.walkerart.org/search/?q=jerome+belhttp://beta.walkerart.org/search/?q=J%C3%A9r%C3%B4me+Bel<!-- remove diacritics BEFORE stemming to match cases without diacritics --><filter class="solr.ASCIIFoldingFilterFactory"/>

boost in general, elevate.xml

bq=(instances:{20110927 TO *})^1000 OR (display_type:Walker\ Shop)^20 OR (display_type:Events)^1

http://wiki.apache.org/solr/QueryElevationComponent - "sponsored search"Index non-data resources (pdf, docs, etc.): Apache Tika