Never Stop Exploring: Pushing the Limits of Solr
Anirudha Jadhav ©2014 Bloomberg L.P.
Who am I ?
• Big Search and Distributed database specialist
• Built a Search as a Service platform
• Lead Search Architect @ Bloomberg Vault
• Credit Derivatives Analytics Engineer @ Bloomberg
• Masters' @ Courant Institute of Mathematical Sciences, New York University
• Passionate about Search, Scuba Diving , Motorcycles and German Shepherds
bloomberg.com/company
Agenda
• Search at Bloomberg
• Goals and Objec5ves • A li9le background
• Factors affec5ng indexing
• Our tests and benchmarks
• Design for a be9er NRT indexer
• Future work
• Q/A
Search at Bloomberg
Search at Bloomberg
• News Search
• Federated Search
• Complex re-ranking of search results • Archival Search
• GeoSpatial Search
• Analytics and Statistics on Search
Objective
Significantly increase Near Real Time (NRT) indexing throughput Eg. Building a Search application that receives market data
Indexing workflow
Indexing Data Flow in SolrCloud
Indexing Workflow
Down Cas)ng Creates tokens by lowercasing all le4ers and dropping non-‐le4ers.
We were talking about IBM during the fishing trip
[We] [were] [talking] [about] [IBM] [during] [the] [fishing] [trip]
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip]
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish]
[talking] [about] [ibm] [fishing] [trip] [talk] [big] [blue] [fish] [journey] [chat]
Consider the sentence:
[we] [were] [talking] [about] [ibm] [during] [the] [fishing] [trip] [talk] [fish]
Tokeniza)on A tokenizer splits the stream of characters into a series of tokens.
Stemming Lemma)za)on Stemming algorithms reduce words "fishing", "fished", "fish", and "fisher" to the root word, "fish" Lemma*za*on expands words to their inflected forms (ie fishing -‐> fished, fishes, fish but not fisher)
Stop Word Removal Remove common stop words “and”,”or” etc. which introduce noise in the search process
Synonym Expansion Mapping of words based upon thesaurus (synonyms, acronyms, hypernyms, business rules, etc..) For example talk -‐> chat, IBM -‐> “big blue”, trip -‐> journey
Designing the Search Index
Designing a good Search Applica)on also involves many aspects of user
interac)on that directly influence indexing design
• Data Type and Data Distribu)on • Server side parameters • Networking • Client side parameters • Query pa4erns
Factors Affecting Indexing
Data and Distribution of Tokens
Common types of data that we index in a search index
• Textual data ( human generated ) e.g. messages, news, blogs
• Textual data ( machine generated ) e.g. logs , 5ckets
• Numerical data
• Geospa5al data
How does this affect search index designs ?
• Query speed and indexing speed depend on the size of an index
• Size is dependent on • Number of documents in the index • Average size of each document • Distribu5on of tokens • Index features eg. Face5ng, Highligh5ng
Server-side Factors
• Ratio of CPU’s to the number of solr cores running • 2 Solr indices per CPU or a Thread
• Disk space • Disk space for Solr index * 2 ( head room for merge cycles )
• Memory
• JVM heap • Off Heap
• DocValues
Networking
Cluster design consideration
• Should a cluster span data centers ? • Latency between datacenters • Reliability and availability SLA’s
• Where does your Zookeeper ensemble live ?
• How many elec5on members • Consider observers to scale zookeeper • Dynamically promote an observer to elec5on member
Manage concurrent connections on the server
Monitor network latencies for QoS guarantees
Client-side Factors
• Managing connections and reusing connections
• Which format to use for indexing data
• javabin • csv • json • xml
• How many simultaneous threads to use
Experiments with NRT Indexing
It’s not always efficient to send a single document to Solr for indexing
How do you decide how many documents to send ? Collector : A buffer that collects Solr update documents
• Time Triggers ( T ) • Time based collector on the client-‐side to batch document payloads to Solr
• Document Size Triggers ( S ) • Document size based collector on the client-‐side to batch document payloads to Solr
• Document Number Triggers ( N ) • Number of documents based collector on the client-‐side to batch document payloads to Solr
The collectors are all simultaneously used in order of priority. The lower priority collectors act as a cut-‐off backups to safe guard from overflows.
Tests and Benchmarks
Benchmarking Setup
• Client application sending data to 4-way replicated SolrCloud
• 5 node Zookeeper ensemble
• All tests done with a similar dataset ( machine generated text ) • We synthesize a high throughput ingest stream, which serves as our input
• Soft commits set at 1sec
Benchmarking : Time Limit Tests
docs
/sec
Time Triggers: Collection window in ms
Benchmarking : Document Limit Tests
docs
/sec
Document Number Triggers: Collection window in number of documents
Benchmarking : Byte Limit Tests
docs
/sec
Document Size Triggers: Collection window in bytes
Observations
• On an average we were able to observe 5x-7x increase in ingestion throughput
• Optimization parameters are dependent constantly changing factors
• The tuning variables need to be constantly adjusted for best performance
• How to use this now
Design for a better NRT indexer
PID Controller
Proportional term ( P ) – present
Output proportional to current error value
Integral term ( I ) - past
Sum of instantaneous error over time, and give accumulated offset that should
have been corrected previously
Derivative term ( D ) - future
Calculated by determining the slope of previous
error over time times the rate of change
PID implementation in the indexer
Solr Cloud
Sampling thread Process variable Docs/sec
Solr response
Client indexer process
Pick one of the Triggers
Time (T ) Control Variable
PID controller implementa5on
Indexing threads
Future Work
Future work
• Perfect the PID indexer
• Add it to the YCSB benchmarking framework
• Add other server side parameters on the PID indexer
• Use the PID indexer along with the YCSB framework to size hardware
Never Stop Exploring: Pushing the Limits of Solr
Anirudha Jadhav , Bloomberg LP
QUESTIONS ?
Top Related