Fast > Perfect: Practical real-time approximations using Spark Streaming

30
Fast > Perfect Practical real-time approximations using Spark Streaming Kevin Schmidt @kevinschmidtbiz Luis Vicente @lvicentesanchez

Transcript of Fast > Perfect: Practical real-time approximations using Spark Streaming

Page 1: Fast > Perfect: Practical real-time approximations using Spark Streaming

Fast > Perfect

Practical real-time approximationsusing Spark Streaming

Kevin Schmidt@kevinschmidtbiz

Luis Vicente@lvicentesanchez

Page 2: Fast > Perfect: Practical real-time approximations using Spark Streaming

A Bit of Context: Mind Candy

Page 3: Fast > Perfect: Practical real-time approximations using Spark Streaming

A Bit of Context: Free To Play

Sum Arbitrary Values

Count Uniques It’s Complicated

Page 4: Fast > Perfect: Practical real-time approximations using Spark Streaming

A Bit of Context: Setup

Page 5: Fast > Perfect: Practical real-time approximations using Spark Streaming

A Bit of Context: Requirements

• Constant storage space usage independent of number of users

• Handle delayed or duplicate data

• Error rate under 3%

Page 6: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: Basics

How To Count IDs Uniquely

Page 7: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: HyperLogLog

addIdentifier(value: String)

merge(other: HyperLogLog): HyperLogLog

zero(): HyperLogLog

countUniques(): Long

HyperLogLog

Error Rate = 1.6%Fixed Size = 4KB

14Bit Size:

12Bit Size:

Error Rate = 0.9%Fixed Size = 16KB

Page 8: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: DStream

Page 9: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: RDD

Page 10: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: Transform

Page 11: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: Storing

Page 12: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: Some Scala

https://github.com/lvicentesanchez/fast-gt-perfect

Page 13: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: Adding Up

Page 14: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: Performance

Page 15: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Users: Result

• Constant storage size usage for one day of data using 14bit HyperLogLogs: 288 * 16KB = 4608KB

• HyperLogLogs count users only once even if data is duplicated or repeated

• Time bucketing ensures delayed data is counted correctly

• Difference of <1% between HyperLogLogs and real count

Page 16: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Revenue: Basics

How To Sum Arbitrary Values

Page 17: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Revenue: BloomFilter

BloomFilter

Capacity = 10kError Rate = 1%Size = 11.7KB

Configurable Size:

addIdentifier(value: String)

merge(other: BloomFilter): BloomFilter

zero(): BloomFilter

contains(): Boolean

Page 18: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Revenue: Transform

Page 19: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Revenue: Transform

Page 20: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Revenue: Storing

Page 21: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Revenue: Some Scala

https://github.com/lvicentesanchez/fast-gt-perfect

Page 22: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Revenue: Adding Up

Page 23: Fast > Perfect: Practical real-time approximations using Spark Streaming

Counting Revenue: Result

• Constant storage size usage for one day of data using a 10k BloomFilter: 288 * 11.7KB = 3370KB

• BloomFilter eliminates sales already counted

• Time bucketing ensures delayed data is counted correctly and keeps BloomFilters small

• Difference of <1% between approximated and real revenue

Page 24: Fast > Perfect: Practical real-time approximations using Spark Streaming

Trending: Basics

How To Find the Top K

Page 25: Fast > Perfect: Practical real-time approximations using Spark Streaming

Trending: StreamSummary

StreamSummary

Configurable Size:

addIdentifier(value: String)

merge(other: SS): SS

topK(k: Int): Seq[(String, Long)]Capacity = 400Max Size = 21.9KB

Metwally, Agrawal & Abbadi: Efficient Computation of Frequent and Top-k Elements in Data Streams (2005)

Page 26: Fast > Perfect: Practical real-time approximations using Spark Streaming

Trending: Transform

Page 27: Fast > Perfect: Practical real-time approximations using Spark Streaming

Trending: Storing

Page 28: Fast > Perfect: Practical real-time approximations using Spark Streaming

Trending: Adding Up

Page 29: Fast > Perfect: Practical real-time approximations using Spark Streaming

Trending: Result

• Constant storage size usage for one day of data using a Top400 StreamSummary: 288 * 21.9KB = 6307KB

• StreamSummary will not eliminate duplicates

• Time bucketing ensures delayed data is counted correctly

• Difference of <2% between StreamSummary trending items and the real trending items

Page 30: Fast > Perfect: Practical real-time approximations using Spark Streaming

Questions?

Kevin Schmidt@kevinschmidtbiz

Luis Vicente@lvicentesanchez

https://github.com/lvicentesanchez/fast-gt-perfect