SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks
-
Upload
lucidworks -
Category
Technology
-
view
503 -
download
1
Transcript of SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
SearchHub: How to Spend Your Summer Keeping it Real Grant Ingersoll
CTO, Lucidworks
3
01
SearchHub Demo
github.com/lucidworks/searchhub
http://searchhub.lucidworks.com
4
02SearchHub Details
• Basics:
• 37 Apache Projects registered so far plus LW properties, opensource.com, Stack Overflow
• 130 datasources* including email, Github, JIRA*, Website and Wiki
• Fusion 2.4.2
• Signals everywhere
• UI based on View (work not complete)
• ASF Mail archives mirrored at: http://asfmail.lucidworks.io
5
03Goals
• Company:
• “LucidFind” aka SearchHub on Fusion
• Provide backend for LW.com search, including docs and support
• Real, production, living, breathing instance of Fusion that we control
• Fusion best practices demo of major use cases
• CTO Office
• Real data, including clicks
• Platform for machine learning and experimentation
• Demos and talks
6
01Agenda
• Quick Intro to Fusion and SearchHub
• Fusion Configuration, UI, Middle Tier
• Data Acquisition
• Deployment
• Signals and Machine Learning
• Next Steps
7
Drive next generation relevance via Content, Collaboration and
Context
Built on best in class Open Source: Apache Solr + Spark
Simplify application development and reduce ongoing maintenance
Access data from anywhere to build intelligent, data-
driven applications.
Fusion in a Nutshell
8
01Fusion
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader Election
Load Balancing
ZK N
Shared Config Management
Worker Worker
Apache SparkCluster
Manager
Core Services
• • •
NLP
Recommenders / Signals
Blob Storage
Pipelines
Scheduling
Alerting / Messaging
Connectors
RE
ST A
PI
Admin UI
Lucidworks View
HD
FS
(Op
tio
nal
)
LOGS FILE WEB DATABASE CLOUD HADOOP
9
01Fusion Configuration, UI and Middle Tier
• UI
• Derivative of Lucidworks View (https://lucidworks.com/products/view/)
• Deep integration of Snowplow Javascript Tracker (https://github.com/snowplow/snowplow/wiki/javascript-tracker)
• Python Flask middle tier ($SEARCHHUB_HOME/python)
• Data sources (project_config)
• Pipelines (fusion_config)
• Schedules (fusion_config)
10
01Data Acquisition• Sources:
• ASF Mail archives mirrored at: http://asfmail.lucidworks.io
• Stack Overflow (SO)
• Github
• Processing
• Pipelines, including custom stage for parsing mail
• Main Challenges:
• “fail2ban” by the ASF
• Focused crawling of SO — JSoup FTW! (try.jsoup.org)
• Mail Threads
11
01Deployment
• Client and Middle Tier run in a Docker container using Apache HTTPd and mod_wsgi
• Hosted on AWS (m4.2xls)
• Fusion backend is OOTB 2.4.2 with extra memory for Connectors and Solr
• README has the gory details: https://github.com/lucidworks/searchhub/blob/master/README.md
12
01Signals• UI is fully instrumented, using Snowplow Javascript Tracker, for most
user interactions. See SnowplowService.js
• Captures, amongst other things:
• User Id, Session Id, Unique Query Id, IP address, Location, Timing data
• Actions tracked:
• Page View
• Page Ping (heartbeat) every 30 seconds
• Search with query, displayed doc list and displayed facet list
• Clicks with query, doc id, position, score and query UUID
• Typeahead Clicks with characters typed and suggestions offered
13
01Machine Learning
• Fusion makes it easy to “round-trip” ML data/models between Spark and Solr
• Examples of:
• Recommenders
• Spark Lucene tokenization
• k-Means
• Word2Vec
• Topic Detection (LDA)
• Random Forests Classifier
• Many examples SparkShellHelpers.scala
14
Experiment Management and BanditsGet Started
• Goal: Experimentation, not hard coded rules*
• Goal: Drive down the cost of experimentation
• “A/B testing on steroids”
• Exploration vs. Exploitation
• Fusion 3.0 (beta):
• Record and calculate relevance metrics from w/in Fusion (gold standard, TREC, other)
• Easily calculate MRR, NDCG, Precision, Recall and report over time
• Support for Bandits: Greedy Epsilon, SoftMax, UCB1
15
Demo
16
01Still Hungry?
• “Combining Content and Collaboration in Recommenders” by Jake Mannix: Friday at 1:10 pm http://sched.co/7amt
• https://github.com/lucidworks/searchhub
• http://searchhub.lucidworks.com
•Email: [email protected] •Twitter: @gsingers •Web: http://lucidworks.com