Post on 02-Jul-2015
description
Thoth
Real-time Solr Monitor
Search Analysis Engine
dbraga@trulia.com
pmhatre@trulia.com
Damiano Braga
Sr. Software Engineer
Praneet Mhatre
Data Mining Engineer
Overview
- What is Thoth ?
- Data Collection and Thoth Core Indexing
- Thoth API & Thoth Dashboard
- Thoth Monitor
- Thoth ML : Prediction and Topic Modeling
- Special Thanks & Q/A
Demo
What is Thoth?
- Innovation project at Trulia
- Understand our search infrastructure without touching logs
- Troubleshoot search performance issues
- Designed as a modular system
- Set of tools that can help gather info, monitor, understand a search infrastructure
- Open source project :
thoth
thoth-ml
thoth-api
thoth-dashboard
thoth-monitor
thoth-demo
Problem: Know Your Search Infrastructure
- Solr logs are a good source. Sometimes partial information
- Decentralized data (at least 1 log per search server)
- Log rotation
- Not searchable
If we could index all the information .. Let’s use Solr !
- We can search on it
- We have some handy features for free: facets, stats etc
- It’s scalable
Thoth Document
1 Solr Request = 1 Thoth (Solr) Document
Server Info
hostname, port number, core name, pool name
Query Info
timestamp, actual query, qtime, hits, exception?
Data Collection (1/2)
- Should be smooth. No traffic slowing down.
- We care about near real-time data
- We care about historical data
- Dataset is growing fast
- Interceptor on each search server
- We use a SolrComponent attached to a Request Handler
- Queue System (E.g: ActiveMQ) to facilitate and temporary store messages
- Each search server has a manifest in the solrconfig.xml
Data Collection (2/2)<requestHandler name="select" class="com.solr2activemq.SolrToActiveMQHandler”>
<arr name="last-components”>
<str>solr2activemq</str>
</arr>
</requestHandler>
<searchComponent name="solr2activemq” class="com.solr2activemq.SolrToActiveMQComponent" >
<str name="activemq-broker-uri">localhost</str>
<int name="activemq-broker-port">61616</int>
<str name="activemq-broker-destination-type">queue</str>
<str name="activemq-broker-destination-name">test-queue</str>
<str name="solr-hostname">localhost</str>
<int name="solr-port">8983</int>
<str name="solr-poolname">default</str>
<str name="solr-corename">collection</str>
<int name="solr2activemq-buffer-size">1000</int>
<int name="solr2activemq-dequeuing-buffer-polling">500</int>
<int name="solr2activemq-check-activemq-polling">5000</int>
</searchComponent>
Sizing of Data
- Need for granular information for near real-time data
- Less granularity for historical data
Too much data = slow search, space problem
- Shrinking feature:
- Create Shrank Document
- Real-time Core cleanup
- Shrinking time is configurable
Thoth Index
- Solr 4.7
- Soft commit for near real-time search
- Soft commit maxTime set to 1s
- Auto commit set to 15s
- Update chain set to enforce UUID as PkID
- Use of Solrj to index data and query
Thoth API
- Abstraction for Thoth index and Thoth data
- Read only REST-like API
- JSON response
- Written in Node.js to accommodate socket.io
Example:
{"numFound":95,"values":[{"timestamp":"2014-09-
16T18:00:02Z","value":45337},{"timestamp":"2014-09-
16T18:15:02Z","value":77325},{"timestamp":"2014-09-
16T18:30:02Z","value":109523},{"timestamp":"2014-09-
16T18:45:02Z","value":112279},{"timestamp":"2014-09-
16T19:00:02Z","value":115334}
thoth:3001/api/server/foo/core/bar/port/portbar/start/NOW-1DAY/end/NOW/count/nqueries
Thoth Dashboard (1/5)
- Visual insight on Thoth data
- Useful graphs divided by server or pool
- Handy list of slow queries and exceptions
- Real-time view for server
- Selecting data based on time
- Sharable URLs (to OPS team, QA team, Release Eng. )
Thoth Dashboard (2/5)
Thoth Dashboard (3/5)
Thoth Dashboard (4/5)
Thoth Dashboard (5/5)
Thoth Monitor
- Continuously monitoring for metrics
- Stateless
- Alerting through email or Nagios
- Examples: QTime, Number of Zero hits,
Predictor Model Health
- Possibility to implement custom monitors
- Reuse StatsComponent
[http://wiki.apache.org/solr/StatsComponent]
if possible
Thoth ML
What can we do with all this data?
• Rich source of information
• Can we turn it into knowledge?
• How about machine learning?
1. Query time prediction
2. Query pattern recognition
3. Server sizing and resource allocation
1. Query Time Prediction (1/4)
• Goal : appropriately route queries to slow/ fast pool
• Look at query attributes
• Query text
• Start parameter
• Facets, range queries, geo spatial searches etc
• Train a supervised learning model
• Use learned model to predict if a query will be slow v/s fast
• H2O Machine Learning Library
1. Query Time Prediction (2/4)
Challenges
• Imbalanced dataset
• Frequency of model training
• Type of model
• Minimal delay requirement
1. Query Time Prediction (3/4)
Challenges Addressed
• Imbalanced dataset
• Stratified sampling
• Frequency of model training
• Auto identify relearning frequency
• Type of model
• Boolean, categorical features -> Tree based
• High accuracy
• Gradient Boosted Machine
• Minimal delay requirement
• User pool queries: 45-50 ms
• Prediction: 1-3 ms
1. Query Time Prediction (4/4)
• 1000 Gradient Boosted Trees
• Slow queries = (>100ms. Configurable)
• Experimental Results
• Training on ~3.1 million
• Test on ~1.4 million
• AUC: 0.94542
• Accuracy: 0.9202223
Query Time Prediction in Action (1/2)
Performance on real time traffic at Trulia
Query Time Prediction in Action (2/2)
Performance on real time traffic at Trulia
2. Query Pattern Recognition
• Exceptions, zero hit queries
• Analyze and find out why
• Probabilistic Topic Modeling
• Using MALLET open source toolkit
Topic Modeling Flow
Topics With Keywords
Future Direction
- Thoth ML improvements:
• Predicting query time buckets
• Regression v/s classification
• Exceptions and zero hit query analysis
• Sizing and resource allocation
- Solr Cloud integration
- Dashboard integration with Solr cloud
- More standard metrics on Thoth Monitor
- More data collection (load, GC)
Contributors and Special Thanks
Damiano : dbraga@trulia.com
Praneet: pmhatre@trulia.com
Fork us on Github!
github.com/trulia/thoth
JD Cantrell ( API, Dashboard)
Giulio Grillanda (API, Dashboard)
Rajendra Shioramwar (Core)
Ying Wang (Design)
Girish Gudla (Monitor)
Alexander Kanarsky
Alex Burmester