Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the...

51
Quantile Regression Two-step Algorithm Oracle Rules Simulation Embracing Blessing of Massive Scale in Big Data Guang Cheng Department of Statistics Purdue University July 31, 2017 A joint work with Shih-Kang Chao and Stanislav Volgushev

Transcript of Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the...

Page 1: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Embracing Blessing of Massive Scale in BigData

Guang ChengDepartment of Statistics

Purdue University

July 31, 2017A joint work with Shih-Kang Chao and Stanislav Volgushev

Page 2: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

With the advance of technology, it is increasingly common thatdata set is so large that it cannot be stored in a single machine

Social media (views, likes, comments, images...)Meteorological and environmental surveillanceTransactions in e-commerceOthers...

Figure: A server room in Council Bluffs, Iowa. Photo: Google/ConnieZhou.

Page 3: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Divide and Conquer (D&C)

To take advantage of the opportunities in massive data, weneed to deal with storing (disc) and computational(memory, processor) bottlenecks.

Divide and Conquer:

Randomly divide N samples into S groups;Conduct parallel computing based on each subset,implemented by e.g., Hadoop;Individual sub-estimates or sub-inference results areaggregated in the end.

Data that are stored in distributed locations can beanalyzed in a similar manner.

Page 4: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Divide and Conquer (D&C)

To take advantage of the opportunities in massive data, weneed to deal with storing (disc) and computational(memory, processor) bottlenecks.

Divide and Conquer:

Randomly divide N samples into S groups;Conduct parallel computing based on each subset,implemented by e.g., Hadoop;Individual sub-estimates or sub-inference results areaggregated in the end.

Data that are stored in distributed locations can beanalyzed in a similar manner.

Page 5: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Divide and Conquer (D&C)

To take advantage of the opportunities in massive data, weneed to deal with storing (disc) and computational(memory, processor) bottlenecks.

Divide and Conquer:

Randomly divide N samples into S groups;Conduct parallel computing based on each subset,implemented by e.g., Hadoop;Individual sub-estimates or sub-inference results areaggregated in the end.

Data that are stored in distributed locations can beanalyzed in a similar manner.

Page 6: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Divide and Conquer (D&C)

To take advantage of the opportunities in massive data, weneed to deal with storing (disc) and computational(memory, processor) bottlenecks.

Divide and Conquer:

Randomly divide N samples into S groups;Conduct parallel computing based on each subset,implemented by e.g., Hadoop;Individual sub-estimates or sub-inference results areaggregated in the end.

Data that are stored in distributed locations can beanalyzed in a similar manner.

Page 7: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Divide and Conquer (D&C)

To take advantage of the opportunities in massive data, weneed to deal with storing (disc) and computational(memory, processor) bottlenecks.

Divide and Conquer:

Randomly divide N samples into S groups;Conduct parallel computing based on each subset,implemented by e.g., Hadoop;Individual sub-estimates or sub-inference results areaggregated in the end.

Data that are stored in distributed locations can beanalyzed in a similar manner.

Page 8: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Divide and Conquer (D&C)

To take advantage of the opportunities in massive data, weneed to deal with storing (disc) and computational(memory, processor) bottlenecks.

Divide and Conquer:

Randomly divide N samples into S groups;Conduct parallel computing based on each subset,implemented by e.g., Hadoop;Individual sub-estimates or sub-inference results areaggregated in the end.

Data that are stored in distributed locations can beanalyzed in a similar manner.

Page 9: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Does D&C fit for statistical analysis?

Sometimes it does, but sometimes it doesn’t...

In the following, we give two simple examples.

Page 10: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Example 1: Sample Mean

Problem(N)

subproblem(n)

subproblem(n)

subproblem(n)

subproblem(n)

Xn,1 Xn,2Xn,3

Xn,4

1

4

4∑

s=1

Xs =1

4n

4∑

s=1

n∑

i=1

Xis =1

N

N∑

i=1

Xi = XN .

It fits!

Page 11: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Example 2: Sample Median

Problem(N)

subproblem(n)

subproblem(n)

subproblem(n)

subproblem(n)

X1(0.5) X2

(0.5)X3

(0.5)X4

(0.5)

Xs(0.5) = the middle value of ordered n samples in s group;

X(0.5) = overall median

1

4

4∑

s=1

Xs(0.5)

??= X(0.5)

Page 12: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Example 2: Sample Median

Simulation 1: Xi ∼ N(0, 1); Simulation 2: Xi ∼ Exponential(1).N = 215. True median v.s. simulated S−1

∑Ss=1X

s(0.5)

0.2 0.4 0.6 0.8−0.2

0−0

.10

0.00

0.10

τ = 0.5 , N (0,1)

logN(S)

0.2 0.4 0.6 0.8

0.7

0.8

0.9

1.0

τ = 0.5 , EXP (1)

logN(S)

Page 13: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Example 2: Sample Median

Simulation 1: Xi ∼ N(0, 1); Simulation 2: Xi ∼ Exponential(1).N = 215. True median v.s. simulated S−1

∑Ss=1X

s(0.5)

0.2 0.4 0.6 0.8−0.2

0−0

.10

0.00

0.10

τ = 0.5 , N (0,1)

logN(S)

0.2 0.4 0.6 0.8

0.7

0.8

0.9

1.0

τ = 0.5 , EXP (1)

logN(S)

Page 14: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Specific Goals

When does the D&C algorithm work?

Especially for skewed and heavy tail distribution

Statistical inference

Asymptotic distribution

Inference for the “whole” underlying distribution?

Take advantage of massive size to discover subtle patternshidden in the “whole” distribution

Page 15: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Specific Goals

When does the D&C algorithm work?

Especially for skewed and heavy tail distribution

Statistical inference

Asymptotic distribution

Inference for the “whole” underlying distribution?

Take advantage of massive size to discover subtle patternshidden in the “whole” distribution

Page 16: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Specific Goals

When does the D&C algorithm work?

Especially for skewed and heavy tail distribution

Statistical inference

Asymptotic distribution

Inference for the “whole” underlying distribution?

Take advantage of massive size to discover subtle patternshidden in the “whole” distribution

Page 17: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Outline

1 Quantile Regression

2 Two-Step Algorithm: D&C and Quantile Projection

3 Oracle Rules: Linear Model and Nonparametric Model

4 Simulation

Page 18: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Quantile

Response Y , predictors X. For τ ∈ (0, 1), conditional quantilecurve Q(·; τ) of Y ∈ R conditional on X is defined through

P (Y ≤ Q(X; τ)|X = x) = τ ∀x.

Q(x; τ) at τ = 0.1, 0.25, 0.5, 0.75, 0.9 under different models

● ●

● ●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

Y = 0.5*X + N(0,1)

x

y

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●● ●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

Y = X + (.5+X)*N(0,1)

x

y

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

Y|X=x ~ Beta(1.5−x,.5+x)

x

y

Page 19: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Quantile Regression v.s. Mean Regression

Mean Regression:Yi = m(Xi) + εi, E[ε|X = x] = 0

m: regression function, object ofinterest.

εi: “errors.”

Quantile Regression:P (Y ≤ Q(x; τ)|X = x) = τ

No strict distinction between “signal”and “noise.”

Object of interest: properties ofconditional distribution of Y |X = x.

Contains much richer informationthan just mean trend.

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

10 15 20 25

−0.

050.

000.

050.

100.

150.

20

Age

Bon

e M

ass

Den

sity

(B

MD

)

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

10 15 20 25

−0.

050.

000.

050.

100.

150.

20

Age

Bon

e M

ass

Den

sity

(B

MD

)

Page 20: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Quantile Curves v.s. Conditional Distribution

Let FY |X(y|x) be the conditional dist. function of Y given X.

Q(x0; τ) = F−1(τ |x0).

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

τ

BM

D

Q̂( Age x0 = 13 ; τ)Q̂( Age x0 = 18 ; τ)Q̂( Age x0 = 23 ; τ)

0.00 0.05 0.10

0.0

0.2

0.4

0.6

0.8

1.0

BMD

CD

F

F̂(y | Age x0 = 13)F̂(y | Age x0 = 18)F̂(y | Age x0 = 23)

Page 21: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Quantile Regression as Optimization

Koenker and Bassett (1978): if Q(x; τ) = β(τ)>x, estimate by

β̂or(τ) := arg minb

N∑

i=1

ρτ (Yi − b>Xi) (1.1)

where ρτ (u) := τu+ + (1− τ)u− is the so-called “check function.”

Remark: Optimization problem (1.1) is convex (butnon-smooth), and can be solved by linear programming.

Page 22: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

A General Series Approximation Framework:

Q(x; τ) ≈ Z(x)>β(τ)

m := dim(Z) (it is possible that m→∞). Solve

β̂or(τ) := arg minb

N∑

i=1

ρτ{Yi − b>Z(Xi)

}

Examples of Z(x) include

linear model with fixed/increasing dimension;B-splines, polynomials, trigonometric polynomials.

Page 23: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

A General Series Approximation Framework:

Q(x; τ) ≈ Z(x)>β(τ)

m := dim(Z) (it is possible that m→∞). Solve

β̂or(τ) := arg minb

N∑

i=1

ρτ{Yi − b>Z(Xi)

}

Examples of Z(x) include

linear model with fixed/increasing dimension;B-splines, polynomials, trigonometric polynomials.

Page 24: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

A General Series Approximation Framework:

Q(x; τ) ≈ Z(x)>β(τ)

m := dim(Z) (it is possible that m→∞). Solve

β̂or(τ) := arg minb

N∑

i=1

ρτ{Yi − b>Z(Xi)

}

Examples of Z(x) include

linear model with fixed/increasing dimension;B-splines, polynomials, trigonometric polynomials.

Page 25: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

A General Series Approximation Framework:

Q(x; τ) ≈ Z(x)>β(τ)

m := dim(Z) (it is possible that m→∞). Solve

β̂or(τ) := arg minb

N∑

i=1

ρτ{Yi − b>Z(Xi)

}

Examples of Z(x) include

linear model with fixed/increasing dimension;B-splines, polynomials, trigonometric polynomials.

Page 26: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Motivation for Two-Step Algorithm

To infer the “whole” conditional distribution by

FY |X(y|x0) =

∫ 1

01{Q(x0; τ) < y}dτ,

we need to compute Q(x0; τ) for {τ1, . . . , τK} ∈ [τL, τU ]K ;

What is the minimal number of K?

For each fixed τj , we have to distribute data to S machines

since β̂or(τ) is computationally infeasible.

What is the maximum number of S?

Page 27: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Motivation for Two-Step Algorithm

To infer the “whole” conditional distribution by

FY |X(y|x0) =

∫ 1

01{Q(x0; τ) < y}dτ,

we need to compute Q(x0; τ) for {τ1, . . . , τK} ∈ [τL, τU ]K ;

What is the minimal number of K?

For each fixed τj , we have to distribute data to S machines

since β̂or(τ) is computationally infeasible.

What is the maximum number of S?

Page 28: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Motivation for Two-Step Algorithm

To infer the “whole” conditional distribution by

FY |X(y|x0) =

∫ 1

01{Q(x0; τ) < y}dτ,

we need to compute Q(x0; τ) for {τ1, . . . , τK} ∈ [τL, τU ]K ;

What is the minimal number of K?

For each fixed τj , we have to distribute data to S machines

since β̂or(τ) is computationally infeasible.

What is the maximum number of S?

Page 29: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Motivation for Two-Step Algorithm

To infer the “whole” conditional distribution by

FY |X(y|x0) =

∫ 1

01{Q(x0; τ) < y}dτ,

we need to compute Q(x0; τ) for {τ1, . . . , τK} ∈ [τL, τU ]K ;

What is the minimal number of K?

For each fixed τj , we have to distribute data to S machines

since β̂or(τ) is computationally infeasible.

What is the maximum number of S?

Page 30: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Step I: D&C Algorithm at Any Fixed τ

Problem(N)

subproblem(n)

subproblem(n)

subproblem(n)

subproblem(n)

β̂1(τ1) β̂2(τ1) β̂3(τ1)β̂4(τ1)

β̂s(τ) := arg minb∈Rm

n∑

i=1

ρτ{Yis − b>Z(Xis)

}

β(τ) :=1

S

S∑

s=1

β̂s(τ).

Page 31: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Step II: Quantile Projection Algorithm

We want to estimate β(τ) as a function over τ based on{β(τ1), . . . ,β(τK)} obtained in Step I:

β̂(τ) := Ξ̂>B(τ). (2.1)

B := (B1, ..., Bq)> is B-spline basis (along τ -direction)

defined on {t1, ..., tG} ⊂ T with degree rτ ∈ N;

Ξ̂ is computed as [α̂1 α̂2 ... α̂m], where

α̂j := arg minα∈Rq

K∑

k=1

(βj(τk)−α>B(τk)

)2. (2.2)

Quantile projection requires only {β(τ1), ...,β(τK)} of sizem×K, without the need to access the raw data set.

Page 32: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Step II: Quantile Projection Algorithm

We want to estimate β(τ) as a function over τ based on{β(τ1), . . . ,β(τK)} obtained in Step I:

β̂(τ) := Ξ̂>B(τ). (2.1)

B := (B1, ..., Bq)> is B-spline basis (along τ -direction)

defined on {t1, ..., tG} ⊂ T with degree rτ ∈ N;

Ξ̂ is computed as [α̂1 α̂2 ... α̂m], where

α̂j := arg minα∈Rq

K∑

k=1

(βj(τk)−α>B(τk)

)2. (2.2)

Quantile projection requires only {β(τ1), ...,β(τK)} of sizem×K, without the need to access the raw data set.

Page 33: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Step II: Quantile Projection Algorithm

We want to estimate β(τ) as a function over τ based on{β(τ1), . . . ,β(τK)} obtained in Step I:

β̂(τ) := Ξ̂>B(τ). (2.1)

B := (B1, ..., Bq)> is B-spline basis (along τ -direction)

defined on {t1, ..., tG} ⊂ T with degree rτ ∈ N;

Ξ̂ is computed as [α̂1 α̂2 ... α̂m], where

α̂j := arg minα∈Rq

K∑

k=1

(βj(τk)−α>B(τk)

)2. (2.2)

Quantile projection requires only {β(τ1), ...,β(τK)} of sizem×K, without the need to access the raw data set.

Page 34: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Step II: Quantile Projection Algorithm

We want to estimate β(τ) as a function over τ based on{β(τ1), . . . ,β(τK)} obtained in Step I:

β̂(τ) := Ξ̂>B(τ). (2.1)

B := (B1, ..., Bq)> is B-spline basis (along τ -direction)

defined on {t1, ..., tG} ⊂ T with degree rτ ∈ N;

Ξ̂ is computed as [α̂1 α̂2 ... α̂m], where

α̂j := arg minα∈Rq

K∑

k=1

(βj(τk)−α>B(τk)

)2. (2.2)

Quantile projection requires only {β(τ1), ...,β(τK)} of sizem×K, without the need to access the raw data set.

Page 35: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Estimation of FY |X(y|x)

Now we can estimate FY |X as

F̂Y |X(y|x0) := τL +

∫ τU

τL

1{Q̂(x0; τ) < y}dτ,

where Q̂(x0; τ) = Z(x0)>β̂(τ), and τk ∈ [τL, τU ] for 1 ≤ k ≤ K.

Page 36: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Computational Cost of Two-Step Procedure

The two-step procedure requires only one pass through theentire data;

The computational cost and statistical performance aredetermined by the number of machines S and the numberof quantiles K in the projection, which both grow as N .

Find an upper bound of S and an lower bound of K s.t.β̂(τ) are “close” to β̂or(τ) in some statistical sense;The sharp upper and lower bounds characterize the intrinsiccomputational limit.

Page 37: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Computational Cost of Two-Step Procedure

The two-step procedure requires only one pass through theentire data;

The computational cost and statistical performance aredetermined by the number of machines S and the numberof quantiles K in the projection, which both grow as N .

Find an upper bound of S and an lower bound of K s.t.β̂(τ) are “close” to β̂or(τ) in some statistical sense;The sharp upper and lower bounds characterize the intrinsiccomputational limit.

Page 38: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Computational Cost of Two-Step Procedure

The two-step procedure requires only one pass through theentire data;

The computational cost and statistical performance aredetermined by the number of machines S and the numberof quantiles K in the projection, which both grow as N .

Find an upper bound of S and an lower bound of K s.t.β̂(τ) are “close” to β̂or(τ) in some statistical sense;The sharp upper and lower bounds characterize the intrinsiccomputational limit.

Page 39: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Computational Cost of Two-Step Procedure

The two-step procedure requires only one pass through theentire data;

The computational cost and statistical performance aredetermined by the number of machines S and the numberof quantiles K in the projection, which both grow as N .

Find an upper bound of S and an lower bound of K s.t.β̂(τ) are “close” to β̂or(τ) in some statistical sense;The sharp upper and lower bounds characterize the intrinsiccomputational limit.

Page 40: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Oracle Rule

Chao, Volgushev and C. (2016) showed that

aNu>(β̂or(τ)− β(τ)

) G(τ) in `∞(T ) for T ⊂ (0, 1), (3.1)

where G is a mean-zero Gaussian process.

We say that β̂(τ) satisfies oracle rule if it shares the sameweak convergence as β̂or(τ), i.e., (3.1);

Statistical inferential accuracy (based on F̂Y |X(·|x0)) forFY |X(·|x0) achieves its oracle level if the above oracle ruleholds.

Page 41: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Oracle Rule

Chao, Volgushev and C. (2016) showed that

aNu>(β̂or(τ)− β(τ)

) G(τ) in `∞(T ) for T ⊂ (0, 1), (3.1)

where G is a mean-zero Gaussian process.

We say that β̂(τ) satisfies oracle rule if it shares the sameweak convergence as β̂or(τ), i.e., (3.1);

Statistical inferential accuracy (based on F̂Y |X(·|x0)) forFY |X(·|x0) achieves its oracle level if the above oracle ruleholds.

Page 42: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Oracle Rule

Chao, Volgushev and C. (2016) showed that

aNu>(β̂or(τ)− β(τ)

) G(τ) in `∞(T ) for T ⊂ (0, 1), (3.1)

where G is a mean-zero Gaussian process.

We say that β̂(τ) satisfies oracle rule if it shares the sameweak convergence as β̂or(τ), i.e., (3.1);

Statistical inferential accuracy (based on F̂Y |X(·|x0)) forFY |X(·|x0) achieves its oracle level if the above oracle ruleholds.

Page 43: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Two Leading Models

Linear model: m = dim(Z(x)) is fixed and Q(x; τ) = Z(x)>β(τ);

Univariate spline model: m = dim(Z(x)) → ∞ with the splineapproximation error defined as cN (γN ) :=

∣∣Q(x; τ)−Z(x)>γN (τ)∣∣,

γN (τ) := arg minγ∈Rm

E[(Z>γ −Q(X; τ))2f(Q(X; τ)|X)

]. (3.2)

Page 44: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

“Theoretical” Phase Transitions

K

SN1/2

log2N

N1/(2ητ )

N1/2

?

?

N1/2

m1/2 log2N

(Nm

)1/2

(Nm

)1/(2ητ )

Figure: Regions (S,K) for the oracle rule of linear model and splinenonparametric model. ”?” region is the discrepancy between thesufficient and necessary conditions.

Page 45: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Simulations for Nonparametric Spline Model

B: cubic B-spline with q = dim(B) defined on G = 4 + qequidistant knots. Require K > q so that β̂(τ) iscomputable (see (2.2));

Xi follows a multivariate uniform distribution U([0, 1]m−1)with covariance matrix ΣX := E[XiX

>i ] with

Σjk = 0.120.7|j−k| for j, k = 1, ...,m− 1;

x0 = (1, (m− 1)−1/2l>m−1)> and N = 214;

y0 = Q(x0; τ) so that FY |X(y0|x0) = τ ;

Our theorem suggest the following CI for τ :

[F̂Y |X(Q(x0; τ)|x0)±N−1/2

√τ(1− τ)x>0 Σ−1X x0Φ

−1(1− α/2)].

(4.1)

Page 46: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

FY |X(y|x), ε ∼ N (0, 0.12) and m = 4

● ●● ●

●●

● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 4 , y0 = Q( x0 ; τ = 0.1 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ●

●●

● ● ●●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 4 , y0 = Q( x0 ; τ = 0.5 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ● ● ●●

●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 4 , y0 = Q( x0 ; τ = 0.9 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

Page 47: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

FY |X(y|x), ε ∼ N (0, 0.12) and m = 32

●●

●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 32 , y0 = Q( x0 ; τ = 0.1 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ●●

● ● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 32 , y0 = Q( x0 ; τ = 0.5 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

●● ●

●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 32 , y0 = Q( x0 ; τ = 0.9 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

Page 48: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

FY |X(y|x), ε ∼ Exp(0.8) and m = 4

● ●● ●

● ● ● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 4 , y0 = Q( x0 ; τ = 0.1 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ● ●● ● ●

●● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 4 , y0 = Q( x0 ; τ = 0.5 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ●● ● ● ● ●

● ●

● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 4 , y0 = Q( x0 ; τ = 0.9 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

Page 49: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

FY |X(y|x), ε ∼ Exp(0.8) and m = 32

● ●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 32 , y0 = Q( x0 ; τ = 0.1 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ● ● ●

● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 32 , y0 = Q( x0 ; τ = 0.5 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ● ● ● ●● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 32 , y0 = Q( x0 ; τ = 0.9 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

Page 50: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Empirical Observations from Simulations

Either S > S∗ or q < q∗ leads to the collapse of the oraclerule.

Increase in q and K improves the coverage probability;

Coverage is no longer symmetry around τ even when modelerror is normal;

Page 51: Embracing Blessing of Massive Scale in Big Datascience.purdue.edu/bigdata/Quantile.pdf · With the advance of technology, it is increasingly common that data set is so large that

Quantile Regression Two-step Algorithm Oracle Rules Simulation

Thanks for your attention