STRATIFIED SAMPLING MEETS MACHINE LEARNING · 2020-04-07 · STRATIFIED SAMPLING MEETS MACHINE...

STRATIFIED SAMPLING MEETS MACHINE LEARNING KEVIN LANG KONSTANTIN SHMAKOV EDO LIBERTY

4

Apps with Flurry SDK

6

Examples: •  Number of event of a certain type •  Number of unique user •  Number of unique users in a specific day •  Total time spent in certain geo •  Average $ spent by age

P iq(u

i)q(·)

7

SAMPLING

Challenges: 1.  The data is very large. Computing exactly is too costly. 2.  The function is user specified and completely unconstrained.

Good News: And approximate answer is acceptable (if the error is small) Solution: Estimate the answer on a random subset of the records

q(·)

Pi q(ui)

8

NOTATIONS

•  for brevity

•  the exact answer for the query q!

•  the probability of choosing record

•  the set of sampled records, each chosen with probability

•  the Horvitz-Thompson estimator for!

qi := q(ui)

pi

y :=P

i qi

S pi

y =P

i2S qi/pi y

i

9

PROPERTIES •  Horvitz-Thompson estimator is unbiased

•  its standard deviation isn’t large!

•  probability for large error is small

(Olken, Rotem and Hellerstein 1986, and 1990) application to databases (Acharya, Gibbons, Poosala 2000) uniform sampling is best in the worst case

E[y � y] = 0

�[y � y] yp

1/(⇣ · card(q))

card(q) :=P

|qi|/max |qi|⇣ = mini

pi

Pr[|y � y| � "y] e�O("2⇣·card(q))

card(q) ⇠ ⌦(n) ! |S| ⇠ 1/"2

10

•  Sample = 100,000 US individuals. •  Query = Republicans vs. Democrats in American Samoa?

•  Sample different strata (e.g. US territories) with different probabilities. (Neyman, Jerzy 1934)

STRATIFIED SAMPLING

card(q)If is small must be large |S|

American Samoa is 0.02% of population

Only ~20 from Samoa in the

sample

Survey error is very large!

11

DBLP EXAMPLE

Choosing the right strata is hard! •  2,101,151 papers •  1000 most populous venues •  Query example title contains “learning” and # authors <= 3 title contains “mechanism” and year > 2004 What is the right stratification here? •  Stratifying by venue made things worse! •  Stratifying by year was better but still worse than uniform sampling.

12

SAMPLING, STRATIFICATION, AND DATABASES

•  Design strata that minimize worst case variance on possible queries •  Linearly combine strata based on record features •  Combine stratifies and uniform sampling: Congressional Sampling

o  Acharya, Gibbons, Poosala 2000: Important idea: consider past queries to the database!

•  Each stratum is a set of records that agree on all queries o  Chaudhuri, Das and Narasayya 2007: optimize for the query log

•  Split to two strata, per each query. Take linear combinations

o  Joshi, Jermaine, 2008: linear combinations of stratified probabilities

13

OUR APPROACH

•  Assume queries are drawn from a distribution

•  Use the query log as a “training set” (assumed w.r.t. )

•  Allow each record to be sampled with a different probability

•  Minimize the Risk

•  This translates to

unknown

Q

Q Q

pi

E[(y � y)2]

Eq⇠QX

i

q2i (1/pi � 1)

14

OUR APPROACH

•  ERM: Minimize

•  Sampling budget ( )

•  Regularization ( )

X

q2Q

X

i

q2i (1/pi � 1)

Query log

X

i

pici BP

i ci ⌧ B

8 i pi 2 [⇣, 1] ⇣ B/P

i ci

15

OUR APPROACH

•  Solve with Lagrange multipliers

•  By KKT conditions

or or

max

↵,�,�[

1

|Q|X

q2Q

X

i

q2i (1/pi � 1)�X

i

↵i(pi � ⇣)

�X

i

�i(1� pi)� �(B �X

i

pici)]

pi /q

1ci

1|Q|

Pq2Q q2ipi = ⇣ pi = 1

16

OUR APPROACH

Alg’ Risk

Best Risk

Database “badness”

Training Set size

Risk(p) Risk(p⇤)

1 +O

skew

slog(n/�)

|Q|

!!

17

RESULTS

18

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

0.32

0.34

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp

ect

ed E

rror

[weaker...] Value of Regularization Parameter Eta [...stronger]

DBLP Dataset

Uniform Sampling p = 1/1005000 Training Queries

10000 Training Queries20000 Training Queries40000 Training Queries

RESULTS

19

RESULTS

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp

ect

ed E

rror


Cube Dataset

Uniform Sampling p = 1/1050 Training Queries


6400 Training Queries

20

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Exp

ect

ed

Err

or


YAM+ Dataset

Uniform p = 1/10050 Training Queries


All Training Queries

RESULTS

21

RESULTS

1e-05

0.0001

0.001

0.01

0.1

1

10

1 10 100 1000 10000 100000 1e+06

Exp

ect

ed

Err

or

Numeric Cardinality of Test Query

YAM+ Dataset

Regularized ERMUniform Sampling

22

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Pro

babili

ty (

Rescale

d)

Average Error

YAM+ Dataset

Uniform SamplingRegularized ERM

RESULTS

23

0.0001

0.001

0.01

0.1

1

10

0.0001 0.001 0.01 0.1 1

Exp

ect

ed

Err

or

’Sampling Rate’ = Budget / (Total Cost)

YAM+ Dataset

Uniform SamplingRegularized ERM

RESULTS

24

0.0001

0.001

0.01

0.1

1

10

100

1 10 100 1000 10000

Exp

ect

ed

Err

or

Numeric Cardinality of Test Query

Cube Dataset

ERMUniform Sampling

RESULTS

STRATIFIED SAMPLING MEETS MACHINE LEARNING · 2020-04-07 · STRATIFIED SAMPLING MEETS MACHINE...

Documents

Transcript of STRATIFIED SAMPLING MEETS MACHINE LEARNING · 2020-04-07 · STRATIFIED SAMPLING MEETS MACHINE...