STRATIFIED SAMPLING MEETS MACHINE LEARNING · 2020-04-07 · STRATIFIED SAMPLING MEETS MACHINE...
Transcript of STRATIFIED SAMPLING MEETS MACHINE LEARNING · 2020-04-07 · STRATIFIED SAMPLING MEETS MACHINE...
STRATIFIED SAMPLING MEETS MACHINE LEARNING KEVIN LANG KONSTANTIN SHMAKOV EDO LIBERTY
2
3
4
Apps with Flurry SDK
5
6
Examples: • Number of event of a certain type • Number of unique user • Number of unique users in a specific day • Total time spent in certain geo • Average $ spent by age
P iq(u
i)q(·)
7
SAMPLING
Challenges: 1. The data is very large. Computing exactly is too costly. 2. The function is user specified and completely unconstrained.
Good News: And approximate answer is acceptable (if the error is small) Solution: Estimate the answer on a random subset of the records
q(·)
Pi q(ui)
8
NOTATIONS
• for brevity
• the exact answer for the query q!
• the probability of choosing record
• the set of sampled records, each chosen with probability
• the Horvitz-Thompson estimator for!
qi := q(ui)
pi
y :=P
i qi
S pi
y =P
i2S qi/pi y
i
9
PROPERTIES • Horvitz-Thompson estimator is unbiased
• its standard deviation isn’t large!
• probability for large error is small
(Olken, Rotem and Hellerstein 1986, and 1990) application to databases (Acharya, Gibbons, Poosala 2000) uniform sampling is best in the worst case
E[y � y] = 0
�[y � y] yp
1/(⇣ · card(q))
card(q) :=P
|qi|/max |qi|⇣ = mini
pi
Pr[|y � y| � "y] e�O("2⇣·card(q))
card(q) ⇠ ⌦(n) ! |S| ⇠ 1/"2
10
• Sample = 100,000 US individuals. • Query = Republicans vs. Democrats in American Samoa?
• Sample different strata (e.g. US territories) with different probabilities. (Neyman, Jerzy 1934)
STRATIFIED SAMPLING
card(q)If is small must be large |S|
American Samoa is 0.02% of population
Only ~20 from Samoa in the
sample
Survey error is very large!
11
DBLP EXAMPLE
Choosing the right strata is hard! • 2,101,151 papers • 1000 most populous venues • Query example title contains “learning” and # authors <= 3 title contains “mechanism” and year > 2004 What is the right stratification here? • Stratifying by venue made things worse! • Stratifying by year was better but still worse than uniform sampling.
12
SAMPLING, STRATIFICATION, AND DATABASES
• Design strata that minimize worst case variance on possible queries • Linearly combine strata based on record features • Combine stratifies and uniform sampling: Congressional Sampling
o Acharya, Gibbons, Poosala 2000: Important idea: consider past queries to the database!
• Each stratum is a set of records that agree on all queries o Chaudhuri, Das and Narasayya 2007: optimize for the query log
• Split to two strata, per each query. Take linear combinations
o Joshi, Jermaine, 2008: linear combinations of stratified probabilities
13
OUR APPROACH
• Assume queries are drawn from a distribution
• Use the query log as a “training set” (assumed w.r.t. )
• Allow each record to be sampled with a different probability
• Minimize the Risk
• This translates to
unknown
Q
Q Q
pi
E[(y � y)2]
Eq⇠QX
i
q2i (1/pi � 1)
14
OUR APPROACH
• ERM: Minimize
• Sampling budget ( )
• Regularization ( )
X
q2Q
X
i
q2i (1/pi � 1)
Query log
X
i
pici BP
i ci ⌧ B
8 i pi 2 [⇣, 1] ⇣ B/P
i ci
15
OUR APPROACH
• Solve with Lagrange multipliers
• By KKT conditions
or or
max
↵,�,�[
1
|Q|X
q2Q
X
i
q2i (1/pi � 1)�X
i
↵i(pi � ⇣)
�X
i
�i(1� pi)� �(B �X
i
pici)]
pi /q
1ci
1|Q|
Pq2Q q2ipi = ⇣ pi = 1
16
OUR APPROACH
Alg’ Risk
Best Risk
Database “badness”
Training Set size
Risk(p) Risk(p⇤)
1 +O
skew
slog(n/�)
|Q|
!!
17
RESULTS
18
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
0.32
0.34
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Exp
ect
ed E
rror
[weaker...] Value of Regularization Parameter Eta [...stronger]
DBLP Dataset
Uniform Sampling p = 1/1005000 Training Queries
10000 Training Queries20000 Training Queries40000 Training Queries
RESULTS
19
RESULTS
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Exp
ect
ed E
rror
[weaker...] Value of Regularization Parameter Eta [...stronger]
Cube Dataset
Uniform Sampling p = 1/1050 Training Queries
100 Training Queries200 Training Queries800 Training Queries
6400 Training Queries
20
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Exp
ect
ed
Err
or
[weaker...] Value of Regularization Parameter Eta [...stronger]
YAM+ Dataset
Uniform p = 1/10050 Training Queries
100 Training Queries200 Training Queries400 Training Queries
All Training Queries
RESULTS
21
RESULTS
1e-05
0.0001
0.001
0.01
0.1
1
10
1 10 100 1000 10000 100000 1e+06
Exp
ect
ed
Err
or
Numeric Cardinality of Test Query
YAM+ Dataset
Regularized ERMUniform Sampling
22
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Pro
babili
ty (
Rescale
d)
Average Error
YAM+ Dataset
Uniform SamplingRegularized ERM
RESULTS
23
0.0001
0.001
0.01
0.1
1
10
0.0001 0.001 0.01 0.1 1
Exp
ect
ed
Err
or
’Sampling Rate’ = Budget / (Total Cost)
YAM+ Dataset
Uniform SamplingRegularized ERM
RESULTS
24
0.0001
0.001
0.01
0.1
1
10
100
1 10 100 1000 10000
Exp
ect
ed
Err
or
Numeric Cardinality of Test Query
Cube Dataset
ERMUniform Sampling
RESULTS