Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson
-
Upload
spark-summit -
Category
Data & Analytics
-
view
73 -
download
8
Transcript of Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson
Erik Erlandson
Sketching Data With T-Digest in Apache Spark
Red Hat, Inc.
IntroductionErik ErlandsonSoftware Engineer at Red Hat, Inc.Emerging Technologies Group
Internal Data ScienceInsightful Applications
Why Sketching?
● Faster● Smaller● Essential Features
We All Sketch Data3.46.02.5⋮
Mean = 3.97Variance = 3.30
3.4, 5.0, 9.06.0, 2.1, 7.72.5, 4.4, 3.2
⋮
T-Digest• Computing Extremely Accurate Quantiles Using
t-Digests• Ted Dunning & Omar Ertl• https://github.com/tdunning/t-digest• Implementations in Java, Python, R, JS, C++
and Scala
What is T-Digest Sketching?
3.46.02.5⋮
(3.4, 3)(6.0, 2)(2.5, 8)
⋮
or
Sketch of CDF
P(X <= x)
X
Data Domain
Incremental UpdatesCurrentT-Digest + (x, w) = Updated
T-Digest
Large orStreaming Data
Compact“Running”Sketch
The Payoff
RESTService
Query Latencies
What does my latency distribution look like?
I want to simulate my latencies!
Are 90% of my latencies under 1 second?
Representation
clusters
DistributionCDF
(location, mass)(x, m)
Update(x, m)
NearestCluster
Update location
IncrementMass
Cluster Mass Bounds
q=0 q=1
C∙M/4
Quantiles q(x)
M = (masses)
B(x) =C∙M∙q(x)∙(1-q(x))
C =compression
Bounds Force New Clusters
(x,m)
mc+ m?
(xc,mc)
mc+ m > B(xc)!
(xc,mc) (xu,B(xc))
(x, B(xc)-(mc+ m))
(x,m)
Resolution
q=0 q=1
More small clusters
Fewer Large Clusters
T-Digests are Monoidal
C1 ∪ C2
D1 |+| D2
D1 ≡ C1D2 ≡ C2
C1 ∪ C2 ⟹
Monoidal => Map-Reduce
P1
P2
Pn
|+|
Data in Spark t-digests
result
Map
7
|+| - Randomized Order
1 35
92 4 86 1110
71 35
9 24 86 1110D1 |+| D2 ⟸
7
|+| - Merged Order
1 35
92 4 86 1110
71 35
92 4 86 1110D1 |+| D2 ⟸
7
|+| - Large to Small
1 35
92 4 86 1110
71 35
924 8 611 10D1 |+| D2 ⟸
Comparing |+| Definitions
Algorithmic Considerations• Clusters maintained in sorted order by location• Clusters frequently inserted / deleted / updated• Query the cluster nearest to an incoming (x,m)• Given (x,m), query the prefix-sum of cluster mass
– (m’), over all clusters (x’,m’) where x’ <= x• Do it all in logarithmic time!
Backed By Balanced Tree
Scala Considerations• Immutable Red/Black Tree• Extends Map and MapLike• Capabilities are Mixable Traits
– Red/Black– Ordered– Incrementable-Values– Nearest-Neighbor– Prefix-Sum
• Interface to Algebird Monoids & Aggregators
Discrete DistributionsIf (tdigest.clusters.size <= max_discrete) {
// increment by m (or insert new)
tdigest.clusters.increment(x, m)
} else {
// do full t-digest cluster updating algorithm
tdigest.update(x, m)
}
Experimental
Applications
• Quantile Estimation• Feature Data Characterization• Building CoDecs• Value-At-Risk Modeling• Generative Data Models
Demo
Thank [email protected]@manyangledhttps://github.com/isarn/isarn-sketches