High Performance Solr
-
Upload
shalin-mangar -
Category
Software
-
view
3.261 -
download
0
description
Transcript of High Performance Solr
![Page 2: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/2.jpg)
Performance constraints
• CPU
• Memory
• Disk
• Network
![Page 3: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/3.jpg)
Tuning (CPU) Queries• Phrase query
• Boolean query (AND)
• Boolean query (OR)
• Wildcard
• Fuzzy
• Soundex
• …roughly in order of increasing cost
• Query performance inversely proportional to matches (doc frequency)
![Page 4: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/4.jpg)
Tuning (CPU) Queries• Reduce frequent-term queries
• Remove stopwords
• Try CommonGramsFilter
• Index pruning (advanced)
• Some function queries match ALL documents - terribly inefficient
![Page 5: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/5.jpg)
Tuning (CPU) Queries• Make efficient use of caches
• Watch those eviction counts
• Beware of NOW in date range queries. Use NOW/DAY or NOW/HOUR
• No need to cache every filter
• Use fq={!cache=false}year:[2005 TO *]
• Specify cost for non-cached filters for efficiency
• fq={!geofilt sfield=location pt=22,-127 d=50 cache=false cost=50}
• Use PostFilters for very expensive filters (cache=false, cost > 100)
![Page 6: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/6.jpg)
Tuning (CPU) Queries• Warm those caches
• Auto-warming
• Warming queries
• firstSearcher
• newSearcher
![Page 7: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/7.jpg)
Tuning (CPU) Queries• Stop using primitive number/date fields if you are performing range queries
• facet.query (sometimes) or facet.range are also range queries
• Use Trie* Fields
• When performing range queries on a string field (rare use-case), use frange to trade off memory for speed
• It will un-invert the field
• No additional cost is paid if the field is already being used for sorting or other function queries
• fq={!frange l=martin u=rowling}author_last_name instead of fq=author_last_name:[martin TO rowling]
![Page 8: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/8.jpg)
Tuning (CPU) Queries• Faceting methods
• facet.method=enum - great for less unique values
• facet.enum.cache.minDf - use filter cache or iterate through DocsEnum
• facet.method=fc
• facet.method=fcs (per-segment)
• facet.sort=index faster than facet.sort=count but useless in typical cases
![Page 9: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/9.jpg)
Tuning (CPU) Queries
• ReRankQueryParser
• Like a PostFilter but for queries!
• Run expensive queries at the very last
• Solr 4.9+ only (soon to be released)
![Page 10: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/10.jpg)
Tuning (CPU) Queries
• Divide and conquer
• Shard’em out
• Use multiple CPUs
• Sometime multiple cores are the answer even for small indexes and specially for high-updates
![Page 11: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/11.jpg)
Tuning Memory Usage• Use DocValues for sorting/faceting/grouping
• There are docValueFormats: {‘default’, ‘memory’, ‘direct’} with different trade-offs.
• default - Helps avoid OOM but uses disk and OS page cache
• memory - compressed in-memory format
• direct - no-compression, in-memory format
![Page 12: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/12.jpg)
Tuning Memory usage
• termIndexInterval - Choose how often terms are loaded into term dictionary. Default is 128.
![Page 13: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/13.jpg)
Tuning Memory Usage• Garbage Collection pauses kill search performance
• GC pauses expire ZK sessions in SolrCloud leading to many problems
• Large heap sizes are almost never the answer
• Leave a lot of memory for the OS page cache
• http://wiki.apache.org/solr/ShawnHeisey
![Page 14: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/14.jpg)
Tuning Disk usage• Atomic updates are costly
• Lookup from transaction log
• Lookup from Index (all stored fields)
• Combine
• Index
![Page 15: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/15.jpg)
Tuning Disk Usage
• Experiment with merge policies
• TieredMergePolicy is great but LogByteSizeMergePolicy can be better if multiple indexes are sharing a single disk
• Increase buffer size - ramBufferSizeMB (>1024M doesn’t help, may reduce performance)
![Page 16: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/16.jpg)
Tuning Disk Usage• Always hard commit once in a while
• Best to use autoCommit and maxDocs
• Trims transaction logs
• Solution for slow startup times
• Use autoSoftCommit for new searchers
• commitWithin is a great way to commit frequently
![Page 17: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/17.jpg)
Tuning Network• Batch writes together as much as possible
• Use CloudSolrServer in SolrCloud always
• Routes updates intelligently to correct leader
• ConcurrentUpdateSolrServer (previously known as StreamingUpdateSolrServer) for indexing in non-Cloud mode
• Don’t use it for querying!
![Page 18: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/18.jpg)
Tuning network
• Share HttpClient instance for all Solrj clients or just re-use the same client object
• Disable retries on HttpClient
![Page 19: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/19.jpg)
Tuning Network
• Distributed Search is optimised if you ask for fl=id,score only
• Avoid numShard*rows stored field lookups
• Saves numShard network calls
![Page 20: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/20.jpg)
Tuning Network• Consider setting up a caching proxy such as squid or varnish in front of
your Solr cluster
• Solr can emit the right cache headers if configured in solrconfig.xml
• Last-Modified and ETag headers are generated based on the properties of the index such as last searcher open time
• You can even force new ETag headers by changing the ETag seed value
• <httpCaching never304=“true”><cacheControl>max-age=30, public</cacheControl></httpCaching>
• The above config will set responses to be cached for 30s by your caching proxy unless the index is modifed.
![Page 21: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/21.jpg)
Avoid wastage• Don’t store what you don’t need back
• Use stored=false
• Don’t index what you don’t search
• Use indexed=false
• Don’t retrieve what you don’t need back
• Don’t use fl=* unless necessary
• Don’t use rows=10 when all you need is numFound
![Page 22: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/22.jpg)
Reduce indexed info• omitNorms=true - Use if you don’t need index-time boosts
• omitTermFreqAndPositions=true - Use if you don’t need term frequencies and positions
• No fuzzy query, no phrase queries
• Can do simple exists check, can do simple AND/OR searches on terms
• No scoring difference whether the term exists once or a thousand times
![Page 23: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/23.jpg)
DocValue tricks & gotchas• DocValue field should be stored=false, indexed=false
• It can still be retrieved using fl=field(my_dv_field)
• If you store DocValue field, it uses extra space as a stored field also.
• In future, update-able doc value fields will be supported by Solr but they’ll work only if stored=false, indexed=false
• DocValues save disk space also (all values, next to each other lead to very efficient compression)
![Page 24: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/24.jpg)
Deep paging• Bulk exporting documents from Solr will bring it to
its knees
• Enter deep paging and cursorMark parameter
• Specify cursorMark=* on the first request
• Use the returned ‘nextCursorMark’ value as the nextCursorMark parameter
![Page 25: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/25.jpg)
Classic paging vs Deep paging
![Page 26: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/26.jpg)
LucidWorks Open Source• Effortless AWS deployment and monitoring http://
www.github.com/lucidworks/solr-scale-tk
• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr): https://github.com/LucidWorks/banana
• Data Quality Toolkit: https://github.com/LucidWorks/data-quality
• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash
![Page 27: High Performance Solr](https://reader034.fdocuments.us/reader034/viewer/2022052618/554f84adb4c9052a518b4e29/html5/thumbnails/27.jpg)
LucidWorks• We’re hiring!
• Work on open source Apache Lucene/Solr
• Help our customers win
• Work remotely from home! Location no bar!
• Contact me at [email protected]