Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

30
OCTOBER 13-16, 2016 AUSTIN, TX

Transcript of Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

Page 1: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Page 2: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

Solr at Scale for Time-Oriented Data Brett Hoerner @bretthoerner

Senior Platform Engineer, Rocana

Page 3: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

3

• Local to Austin, TX

• Have used Solr(Cloud) since 4.0 (2012)

• Not a contributor, just a user

• Work for startups, typically focused on scalability & performance

• Generally (have to) handle operations in addition to development

01

Page 4: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

4

• "Tuning Solr for Logs"Radu Gheorghe's talk atLucene/Solr Revolution 2014bit.ly/tuning-solr-for-logs

02Quick plug

Page 5: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

5

• SaaS social media marketing research tool

• Access to full firehose for multiple networks

• Example SolrCloud collection:~150+ billion documents spanning 1 year ~10k writes/second ~45-65 fields per document~800 shardsOn 13 machines in EC2Engineering+Operations team of 1-2

02Spredfast

Page 6: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

6

02

Page 7: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

7

02

Page 8: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

8

• (Ro)ot (Ca)use A(na)lysis for complex IT operations (large datacenters)

• On-premises enterprise software (not SaaS)

• Monitors 10s or 100s of thousands of machines

• Customers care about 1TB/day on the low end

• Hadoop ecosystem

02Rocana

Page 9: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

9

02

Page 10: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

10

• Each social post or log line becomes a Solr doc

• Almost always sort on time field (not TF-IDF)

• Queries almost always include facets

• Queries always include a time range"last 30 minutes" "last 30 days" "December 2014"

02Time-Oriented Realtime Search

Page 11: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

11

• Typically a part of larger stream processing system

• Kafka, or something like it, is recommended

02Time-Oriented Realtime Search

Firehose

Firehose

Firehose

Kafka

Sold Indexer

Sold Indexer

SolrIndexer

KafkaKafka

SolrSolrSolrSolrSolrSolr

S3 Writer S3

Page 12: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

12

• Adjust...JVM heap (up to ~30GB)ramBufferSizeMB (up to ~512MB)solr.autoCommit.maxTime (multiple minutes) (and autoCommit openSearcher = false)solr.autoSoftCommit.maxTime (as high as possible)mergeFactor

• Batch writes! (by count and time)

02Optimizing indexing

Page 13: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

13

• DocValues on any field you sort/facet on

• Warm on most common sort (time)

• Small filterCache, only use for time rangefq=ts:[1444755392 TO 1444841789]q=text:happy+birthdayOR at least cache separatelyfq=ts:[1444755392 TO 1444841789]fq=text:happy+birthdayq=*:*

02Optimizing queries

Page 14: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

14

• By default, Solr hashes the unique field* of each document to decide which shard it belongs on.* uniqueKey in schema.xml

• The effect is that documents are evenly spread across *all* shards

02Sharding by time

Page 15: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

15

• This means every shard is actively writing and merging new segments all the time

• Your docs/sec per node is docs/nodes, which is spreading writes pretty thin if you're thinking of using, say, 500 shards

02Sharding by time

Page 16: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

16

• Even worse, on the read side this means *every* query must be sent to *every* shard(unless you're looking for a document by its unique field, which is a pretty poor use case for Solr...)

• Given 1 query and 500 shards:q=text:happy+timestamp:[37 TO 286]&sort=timestamp desc&rows=100sends 500 requests outsearches/sorts your *entire* data setwaits for 500 responsesmerges themand finally responds

02Sharding by time

Page 17: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

17

• The solution is to take full control of document routing/admin/collections?action=CREATE&name=my_collection&router.name=implicit &shards=1444780800,1444867200,1444953600,...

02Sharding by time

Page 18: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

18

02Sharding by time

1444780800 1444867200 1444953600 ...

my_collection

Kafka Solr WriterSolr WriterSolr Writer

{ id: "event100", body: "hello, world", created_at: 1444965428, _route_: 1444953600}

Page 19: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

19

02Sharding by time

1444780800 1444867200 1444953600 ...

my_collection

/solr/my_collection/select?q=text:hello&fq=created_at:[1444874953 TO 1444989225]&shards=1444867200,1444953600

Page 20: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

20

• Duplicate cluster that only holds more recent data

• ... but with more hardware per document

03Cluster "layering"

12 months of data

30 days of data

Query for "last hour"Query for "last June"

Page 21: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

21

• bit.ly/created-at-hack

• If we can make assumptions about what's in each shard, we can optimize the "sub" queries that are sent to each node

• Also optionally disable facet refinement

01Hacks

Page 22: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

22

• Solr on HDFS is one interesting optionCan recover existing distributed indexes on another node (using the *same* directory!), see "autoAddReplicas" in Collection API CREATE.

• "Normal" replication was historically an issue (for us) at scale

• Apparently made 100% faster in Solr 5.2

• Remember that replicas aren't backups

01Replication

Page 23: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

23

• So, you have your >100 billion document cluster running...

• Indexes are slowly created over the course of months/years by ingesting realtime data...

01

Page 24: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

24

• But what if...We need to add new fields (to old docs) We need to remove unused fields We need to change fields (type, content)We decide we need to query further in the pastWe have catastrophic data lossWe want to upgrade Solr (with no risk)

01

Page 25: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

25

• Let's say:We index 5k/docs sec for a yearThat means 157,680,000,000 documentsSay the cluster can ingest 50k/sec maxIt'd take 36.5 days to reindex a year ... for any/every change... if nothing went wrong for 36.5 days... and you need to write the code to do it

01Timebomb

Page 26: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

26

• Hadoop to the rescue (?)

• Under Solr contribgithub.com/apache/lucene-solr/tree/trunk/solr/contrib/map-reduce

• Given raw input data*, run a MapReduce job that generates Solr indexes (locally!)* this is one good reason to use something like Kafka and push all your raw data to HDFS/S3/etc in addition to Solr

01MapReduceIndexerTool

Page 27: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

27

• Amazon ElasticMapReduce works well for thisPlus, you can use spot instances (cheap!)

• The trick is, you have to load the completed indexes yourselfAt that point it becomes an Ops problem, some kind of orchestration like Chef comes in handy here, but it's not done for you or open-source (yet?)

• Unless you run Solr on HDFS (GoLive)

01MapReduceIndexerTool

Page 28: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

28

• ~150 billion document collection spanning 1 year reindexed from scratch and running on a new cluster in ~6 days for ~$3kBug/bribe Adam McElwee to open source :) twitter.com/txlord

01MapReduceIndexerTool

Page 29: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

29

• Optimize like you would any Solr cluster

• Reduce caching, RAM is probably scarce and hits are probably low

• Shard based on time

• Be prepared to rebuild the entire collection so you can iterate on product/design

01Conclusion

Page 30: Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

30

[email protected] twitter.com/bretthoerner rocana.com/careers

01Fin