Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Erik Erlandson

Sketching Data With T-Digest in Apache Spark

Red Hat, Inc.

IntroductionErik ErlandsonSoftware Engineer at Red Hat, Inc.Emerging Technologies Group

Internal Data ScienceInsightful Applications

Why Sketching?

● Faster● Smaller● Essential Features

We All Sketch Data3.46.02.5⋮

Mean = 3.97Variance = 3.30

3.4, 5.0, 9.06.0, 2.1, 7.72.5, 4.4, 3.2

⋮

T-Digest• Computing Extremely Accurate Quantiles Using

t-Digests• Ted Dunning & Omar Ertl• https://github.com/tdunning/t-digest• Implementations in Java, Python, R, JS, C++

and Scala

https://github.com/tdunning/t-digest

https://github.com/tdunning/t-digest

What is T-Digest Sketching?

3.46.02.5⋮

(3.4, 3)(6.0, 2)(2.5, 8)

⋮

or

Sketch of CDF

P(X <= x)

X

Data Domain

Incremental UpdatesCurrentT-Digest + (x, w) = Updated

T-Digest

Large orStreaming Data

Compact“Running”Sketch

The Payoff

RESTService

Query Latencies

What does my latency distribution look like?

I want to simulate my latencies!

Are 90% of my latencies under 1 second?

Representation

clusters

DistributionCDF

(location, mass)(x, m)

Update(x, m)

NearestCluster

Update location

IncrementMass

Cluster Mass Bounds

q=0 q=1

C∙M/4

Quantiles q(x)

M = (masses)

B(x) =C∙M∙q(x)∙(1-q(x))

C =compression

Bounds Force New Clusters

(x,m)

mc+ m?

(xc,mc)

mc+ m > B(xc)!

(xc,mc) (xu,B(xc))

(x, B(xc)-(mc+ m))

(x,m)

Resolution

q=0 q=1

More small clusters

Fewer Large Clusters

T-Digests are Monoidal

C1 ∪ C2

D1 |+| D2

D1 ≡ C1D2 ≡ C2

C1 ∪ C2 ⟹

Monoidal => Map-Reduce

P1

P2

Pn

|+|

Data in Spark t-digests

result

Map

7

|+| - Randomized Order

1 35

92 4 86 1110

71 35

9 24 86 1110D1 |+| D2 ⟸

7

|+| - Merged Order

1 35

92 4 86 1110

71 35

92 4 86 1110D1 |+| D2 ⟸

7

|+| - Large to Small

1 35

92 4 86 1110

71 35

924 8 611 10D1 |+| D2 ⟸

Comparing |+| Definitions

Algorithmic Considerations• Clusters maintained in sorted order by location• Clusters frequently inserted / deleted / updated• Query the cluster nearest to an incoming (x,m)• Given (x,m), query the prefix-sum of cluster mass

– (m’), over all clusters (x’,m’) where x’ <= x• Do it all in logarithmic time!

https://en.wikipedia.org/wiki/Prefix_sum

https://en.wikipedia.org/wiki/Prefix_sum

Backed By Balanced Tree

Scala Considerations• Immutable Red/Black Tree• Extends Map and MapLike• Capabilities are Mixable Traits

– Red/Black– Ordered– Incrementable-Values– Nearest-Neighbor– Prefix-Sum

• Interface to Algebird Monoids & Aggregators

http://erikerlandson.github.io/blog/2015/09/26/a-library-of-binary-tree-algorithms-as-mixable-scala-traits/

https://github.com/isarn/isarn-sketches-algebird-api

https://github.com/isarn/isarn-sketches-algebird-api

Discrete DistributionsIf (tdigest.clusters.size <= max_discrete) {

// increment by m (or insert new)

tdigest.clusters.increment(x, m)

} else {

// do full t-digest cluster updating algorithm

tdigest.update(x, m)

}

Experimental

Applications

• Quantile Estimation• Feature Data Characterization• Building CoDecs• Value-At-Risk Modeling• Generative Data Models

https://github.com/willb/var-notebook/blob/master/var-notebook/var-pdfs.ipynb

https://github.com/willb/var-notebook/blob/master/var-notebook/var-pdfs.ipynb

http://erikerlandson.github.io/blog/2016/05/05/random-forest-clustering-of-machine-package-configurations/

http://erikerlandson.github.io/blog/2016/05/05/random-forest-clustering-of-machine-package-configurations/

Thank [email protected]@manyangledhttps://github.com/isarn/isarn-sketches

https://github.com/isarn/isarn-sketches

https://github.com/isarn/isarn-sketches

Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson

Data & Analytics

Transcript of Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik Erlandson