FIBEP World Media Intelligence Congress17-20 November 2015, ViennaFIBEP World Media Intelligence Congress17-20 November 2015, Vienna
www.wmicongress.com
Speaker:Twitter:
How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible
open-source platform
Session Title:
2015-11-19
Kristian Schou, Infomedia & Charlie Hull, Flax@InfomediaDK @Flaxsearch Web: www.flax.co.uk
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
About Infomedia• Founded in 2003• The leading Danish provider of media monitoring and media
analysis• Largest and oldest Danish Media archive with access to
approximately 75 million searchable articles
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
About Flax• Founded in 2001 in Cambridge, U.K. • Independent, honest advice and analysis• Expert design & development, Apache Solr committers• Test-driven relevancy and performance tuning• Custom training & mentoring for your staff• Flexible support up to 24/7/365 with SLAs• Some of our clients:
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
The situation at Infomedia in 2013• Very old media monitoring system based on Verity
• Verity was put into production in 2001 at the company that would later become Infomedia!
• Slightly less old installation of Autonomy IDOL used for Infomedia’s Media Archive
• put into production at Infomedia in 2009/10
• Drawbacks:– Verity at almost max capacity needing constant attention– Old and complex workflow for receiving and processing articles – Different platforms for monitoring and archive searches meant we were ‘bi-lingual’,
using two different query languages in-house.– Verity no longer supported by the owning company (HP)– Verity not scalable!
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
What to do?• Different upgrading options explored throughout 2011-2012
• Upgrade everything to Autonomy IDOL?• Switch to other commercial search engine?• Go open-source?
• Recommendations and internal testing drew us to Apache Solr, an open source enterprise search platform
• Advantages:– Transparency (going from commercial to open-source)– Rapid maturity of Solr – development moving very fast– Large and active Solr Community– Customizability– Solr is known to be fast and highly scalable– No license fees
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Defining the project with Flax• Infomedia searched for Solr expertise in Denmark/Scandinavia
– could not find an option that we were comfortable with
• Introduced to Flax through networking and recommendations– Experience from similar upgrade projects with Gorkana and AAP– Very impressed with Flax’s insight, knowledge and credentials– Actual committer to Apache Solr
• Project began in autumn of 2013 with the goals of:– Building a completely new search architecture to replace Verity and IDOL– Defining Infomedia's own query language, IQL, owned and controlled by Infomedia – Translating old monitoring queries (app. 8.000) to this new IQL syntax
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Replacing Verity• Verity replaced by Flax Monitor
– Parses IQL to Lucene queries– Runs on 2 servers– Uses Luwak, Flax's 'stored search' library:
• Built on Apache Lucene (as is Solr)• Also used by Bloomberg, Booz Allen Hamilton & others• In use for 1m stored searches (some 250k characters), 1m stories/day• 40x faster than Elasticsearch Percolator• Open source at https://github.com/flaxsearch/luwak
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Turning search upside down
@_FIBEP #_FIBEP #WMIC152015-11-19
Docs
Result
QueryQueryStoredQueries $$$
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Turning search upside down
@_FIBEP #_FIBEP #WMIC152015-11-19
Docs
Result
QueryQueryStoredQueries
1 million queriesSome 250k longComplex rules
1 million new documents a day
$$$
Within 5-100ms
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Turning search upside down
@_FIBEP #_FIBEP #WMIC152015-11-19
Docs
Result
QueryQueryStoredQueries
1 million queriesSome 250k longComplex rules
1 million new documents a day
$$$$$$
Within 5-100ms
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Turning search upside down
@_FIBEP #_FIBEP #WMIC152015-11-19
Docs
Result
QueryQueryStoredQueries
1 million queriesSome 250k longComplex rules
1 million new documents a day
$$$$$$
Within 5-100ms
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Turning search upside down
@_FIBEP #_FIBEP #WMIC152015-11-19
Docs
QueryQueryStoredQueries 1.
Pre
QuerySubset
1 million queriesSome 250k longComplex rules
~200
Doc
1 million new documents a day
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Turning search upside down
@_FIBEP #_FIBEP #WMIC152015-11-19
Docs
QueryQueryStoredQueries 1.
Pre
QuerySubset
Result
1 million queriesSome 250k longComplex rules
~200
2.Search
Doc
1 million new documents a day
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Replacing Autonomy IDOL• Autonomy IDOL replaced by Apache Solr
Parses IQL to Lucene queries
SolrCloud distributes the index & queries across several servers
Setup: 75 million documents hosted on 8 servers,6 cores/24GB memory and 125 GB storage per server
This setup is doubled to have full redundancy
Features added to standard Solr by Flax:
• Custom highlighting,
• Framework to handle multiple languages
• Extended error logging
• Cluster management
• Performance enhancements for complex wildcard queries
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Benefits of the project• Articles indexed and searchable within minutes of receiving them• New, much smarter tools for constructing and comparing
monitoring queries• The Flax Monitor is an extremely smart and performant monitoring
solution
• Huge benefits from defining the Infomedia Query Language, IQL– Extremely enlightening and empowering process to analyze what we actually need from a
query language– We fully understand and have documented how IQL works– IQL is designed to match Infomedia’s demands and preferences– We can revise and expand IQL as new needs and opportunities arrive– Not bound to any search platform. We can take it with us
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Learnings/Where are we now?• A challenging, complex, time-consuming but ultimately rewarding project
• The ripple effect – we have had to revisit and update a lot of legacy systems • Customization is great, but can also mean more specification• Open Source prevents lock-in but demands investment in education - otherwise it is still
just a magic box• Flax‘s expert knowledge has been invaluable
• A succesful migration• More than 90% of Infomedia’s monitoring queries have been migrated to IQL with
practically no negative change in precision or recall
• The collaboration with Flax continues• As Infomedia develops, so do new ideas and feature requests• A customized open source platform also means continuous improvement
• Currently updating to Solr 5.3• Still experimenting with different ways to scale our Solr installation
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
Other lessons• You can also keep your old query language
- Flax have written dtSearch & Verity parsers for Lucene
• Some of your old queries might not be working- e.g. Verity doesn't always tell you when queries are broken!
• Open source can help future-proof your search- and you have control of the software
• Engage with the open source community:- User groups
- Mailing lists
- Contribute back if you can
@_FIBEP #_FIBEP #WMIC152015-11-19
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
@_FIBEP #_FIBEP #WMIC15Date of Presentation
Thanks for listening - any questions?
Kristian Schou, Infomedia & Charlie Hull, Flax@InfomediaDK @Flaxsearch Web: www.flax.co.uk
FIBEP World Media Intelligence Congress17-20 November 2015, Vienna
@_FIBEP #_FIBEP #WMIC15Date of Presentation
Something else you might like
Think outside the search box!
2DSearch is a patent pending, radical alternative to traditional keyword search. Instead of a one-dimensional search box, concepts are expressed and manipulated as objects on a two-dimensional canvas. So you spend less time worrying about Boolean strings, and more time creating semantically transparent queries and effective search strategies.
Sign up to gain early access at www.2dsearch.com
Top Related