Building an easy to use search solution (for different languages) Ivo Lukač @ J.Boye Aarhus 13: Web...
-
Upload
ivo-lukac -
Category
Technology
-
view
533 -
download
0
description
Transcript of Building an easy to use search solution (for different languages) Ivo Lukač @ J.Boye Aarhus 13: Web...
Building an easy to use search solution
(for different languages)
Ivo Lukač @ J.Boye Aarhus 13: Web & Intranet Conference !
“Making search work” track
www.netgenlabs.com
Speaker
•Co-owner of Netgen - web development agency, Zagreb, Croatia
•Started as developer 11 years ago
•Now I do variety of things, but can be best described as International Business Developer
www.netgenlabs.com
Use case
•Regulatory reform project: cutting of unneeded legislative, laws and/or procedures
•Netgen is the technology implementation partner
•Project lead by Sense Consulting
•Croatia, Egypt, Vietnam, Armenia, Iraq - mostly “exotic” countries
www.netgenlabs.com
We would rather work in Denmark, but seems that it doesn’t need such a solution :(
How we use search
www.netgenlabs.com
Solution
•In 2006. simple filter
•Today eZ Publish CMS powered flexible information architecture with Solr for search
•Usually 70% common features, 30% customisation
•Aiming for 90%/10%
•If you interested in tech specifics ask me later…
www.netgenlabs.com
Search features
• Simple (default) and advanced search (with filters)
• Full text search on complex data, boosting on attribute level
• Filtering with multilevel tags/taxonomies
• Stopwords
• Search time spelling based on indexed data
• Sometimes using faceting on result set
www.netgenlabs.com
Additional features
•Sometimes using multi search
•Typing suggestions
•Latest search phrase list
Challenges
www.netgenlabs.com
Characters
•At the beginning we didn’t have Unicode - it was a mess!
•Unicode solved a lot of problems but not all
•Same characters can have more byte codes which is not being normalised by default
www.netgenlabs.com
Indexing
•Indexing files like Word, PDF or similar proved to be problematic due to character problems
•token delimiter configuration could be language specific
•stemming sometimes supported, sometimes not
www.netgenlabs.com
Blind work
•the biggest challenge is that developers don’t know the language
•first level of testing is very hard
•still can’t trust Google Translate
www.netgenlabs.com
What vehicle would you use to transport 10 cases of Heineken?
How to overcome this?
www.netgenlabs.com
Main idea
•lets try to assess search result quality
•use editors for rating (not the public)
•use most frequently searched terms (we can’t test all)
•rate results above the fold
www.netgenlabs.com
The tool
•integrated in the public site
•added thumbs up/down buttons for first X results and only shown to editors
www.netgenlabs.com
Demo
•imported articles to test instance form various sources about CMS topic
•rating result quality of 7 search terms
•Thumbs up/down for suggested 3 search results
•Test periods are used for framing test data
Rating side
Analysing side
www.netgenlabs.com
Rate measures
•Discounted Cumulative Gain (DCG) - rate sum discounted based on position in search results
•Normalised Discounted Cumulative Gain (NDCG) - discounted rate sum normalised against best possible outcome (to get percentage as the unit)
•Popularity based NDCG - takes into account the popularity of the search form
http://en.wikipedia.org/wiki/Discounted_cumulative_gain
www.netgenlabs.com
Known problems
•What if good results are not showing? - something bad is going on with the search engine
•what if there is no good result?
•what about new content added in time?
•at the end of the day measurements are good for comparing between test periods, not meaningful by itself
www.netgenlabs.com
Improvements
•opening rating to public users
•using clicks as rates
•implement “did you find what you have looking for?” feature
•integrate with analytics
•use rate data to boost particular item in search!
Questions now or later
[email protected] ilukac.com/twitter ilukac.com/facebook ilukac.com/gplus ilukac.com/linkedin