Using druid for interactive count distinct queries at scale @ nmc
-
Upload
ido-shilon -
Category
Technology
-
view
81 -
download
1
Transcript of Using druid for interactive count distinct queries at scale @ nmc
![Page 1: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/1.jpg)
Yakir Buskilla + Itai YaffeNielsen
USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE
![Page 2: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/2.jpg)
Introduction
Yakir Buskilla Itai Yaffe
● Software Architect
● Focusing on Big Data and Machine Learning problems
● Big Data Infrastructure Developer
● Dealing with Big Data challenges for the last 5 years
![Page 3: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/3.jpg)
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 2 years ago
● A leader in the Ad Tech and Marketing Tech industry
● What do we do ?
○ Data as a Service (DaaS)
○ Software as a Service (SaaS)
![Page 4: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/4.jpg)
NMC high-level architecture
![Page 5: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/5.jpg)
The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time
![Page 6: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/6.jpg)
The need
![Page 7: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/7.jpg)
The need
![Page 8: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/8.jpg)
● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.
![Page 9: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/9.jpg)
Our journey
● Elasticsearch
○ Indexing data■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the corresponding index
![Page 10: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/10.jpg)
What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)
![Page 11: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/11.jpg)
● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch
![Page 12: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/12.jpg)
KMV intuition
![Page 13: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/13.jpg)
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error
![Page 14: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/14.jpg)
“Very fast highly scalable columnar data-store”
DRUID
![Page 15: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/15.jpg)
Roll-up
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Count Distinct
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2
2
1
![Page 16: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/16.jpg)
Druid architecture
![Page 17: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/17.jpg)
How do we use Druid
![Page 18: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/18.jpg)
Guidelines and pitfalls
● Setup is not easy
![Page 19: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/19.jpg)
Guidelines and pitfalls
● Monitoring your system
![Page 20: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/20.jpg)
Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute Count Distinct Timestamp Attribute Region Count
Distinct
US XXXXXX US
Porsche Intent
XXXXXX
Porsche Intent
... ......
XXXXXX
...
![Page 21: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/21.jpg)
Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters
![Page 22: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/22.jpg)
Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10
![Page 23: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/23.jpg)
Guidelines and pitfalls
● Community
![Page 24: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/24.jpg)
Summary
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES
![Page 25: Using druid for interactive count distinct queries at scale @ nmc](https://reader035.fdocuments.us/reader035/viewer/2022062302/58ed35b21a28abb27e8b45dd/html5/thumbnails/25.jpg)
THANK YOU!