Download - Patrick Breheny

7/23/2019 Patrick Breheny

1/18

IntroductionThe empirical distribution function

Introduction; The empirical distribution function

Patrick Breheny

August 26

Patrick Breheny STA 621: Nonparametric Statistics


2/18


Nonparametric vs. parametric statistics

The main idea of nonparametric statistics is to makeinferences about unknown quantities without resorting tosimple parametric reductions of the problem

For example, suppose x F, and we wish to estimate, sayE{X} or P{X > 1}The approach taken by parametric statistics is to assume thatF belongs to a family of distribution functions that can bedescribed by a small number of parameters e.g., the normaldistribution:

f(x) = 122

exp(x )222

These parameters are then estimated, and we make inferenceabout the quantities we were originally interested in (E{X} orP

{X > 1

}) based on assuming X

N(, 2)



3/18


Parametric statistics (contd)

Or suppose we wish to know how E{y} changes with xAgain, the parametric approach is to assume that

E{y|x} = + x

We estimate and , then base all future inference on thoseestimates



4/18


Shortcomings of the parametric approach

Both of the aforementioned parametric approach rely on atremendous reduction of the original problem

They assume that all uncertainty regarding F(x), or E

{y

|x

},

can be reduced to just two unknown numbers

If these assumptions are true, then of course, there is nothingwrong with making them

If they are false, however:

The resulting statistical inference will be questionableWe might miss interesting patterns in the data


I d i


5/18


The nonparametric approach

In contrast, nonparametric statistics tries to make as fewassumptions as possible about the data

Instead of assuming that F(x) is normal, we will allow F(x)

to be any function (provided, of course, that it satisfies thedefinition of a cdf)

Instead of assuming that E{y} is linear in x, we will allow itto be any continuous function

Obviously, this requires the development of a whole new set oftools, as instead of estimating parameters, we will beestimating functions (which are much more complex)


I t d ti


6/18


The four main topics

We will go over four main areas of nonparametric statistics in thiscourse:

Estimating aspects of the distribution of a random variableTesting aspects of the distribution of a random variable

Estimating the density of a random variable

Estimating the regression function E

{y

|x

}= f(x)


Introduction


7/18


The empirical distribution function

We will begin with the problem of estimating a CDF(cumulative distribution function)

Suppose X F, where F(x) = P(X x) is a distributionfunction

The empirical distribution function, F, is the CDF that putsmass 1/n at each data point xi:

F(x) =1

n

n

i=1

I(xi

x)

where I is the indicator function


Introduction


8/18


The empirical distribution function in R

R provides the very useful function ecdf for working with theempirical distribution function

Data Fhat(0.6)

[1] 0.933667

plot(Fhat)



9/18

Introduction


10/18


Properties of F

At any fixed value of x,

E{F(x)} = F(x)V{F(x)} = 1

nF(x)(1 F(x))

Note that these two facts imply that

F(x)P F(x)

An even stronger proof of convergence is given by theGlivenko-Cantelli Theorem:

supx

|F(x) F(x)| a.s. 0


Introduction


11/18


F as a nonparametric MLE

The empirical distribution function can be thought of as anonparametric maximum likelihood estimator

Homework:

Show that, out of all possible CDFs,

Fmaximizes

L(F|x) =n

i=1PF(xi)


Introduction


12/18


Confidence intervals vs. confidence bands

Before moving into the issue of calculating confidenceintervals for F, we need to discuss the notion of a confidenceinterval for a function

One approach is to fix x and calculate a confidence intervalfor F(x) i.e., find a region C(x) such that, for any CDF F,

P{F(x) C(x)} 1

These intervals are referred to as pointwise confidenceintervals


Introduction


13/18


Confidence intervals vs. confidence bands (contd)

Clearly, however, if there is a 1 probability that F(x) willnot lie in C(x) at each point x, there is greater than a 1 probability that there exists an x such that F(x) will lieoutside C(x)

Thus, a different approach to inference is to find a confidenceregion C(x) such that, for any CDF F,

P{F(x) C(x) x} 1

These intervals are referred to as confidence bands orconfidence envelopes


Introductionf


14/18


Inference regarding F

We can use the fact that, for each value of x, F(x) follows abinomial distribution with mean F(x) to construct pointwiseintervals for F

To construct confidence bands, we need a result called theDvoretzky-Kiefer-Wolfowitz inequality, or DKW inequality:

P

supx

|F(x) F(x)| > 2exp(2n2)

Note that this is a finite-sample, not an asymptotic, result


IntroductionTh i i l di ib i f i


15/18


A confidence band for F

Thus, setting the right side of the DKW inequality equal to ,

= 2exp(2n2)

log

2

= 2n2

log

2

= 2n2

=1

2n log 2


IntroductionTh i i l dist ib ti f ti


16/18


A confidence band for F (contd)

Thus, the following functions define an upper and lower 1 confidence band for any F and n:

L(x) = max{F(x) , 0}U(x) = min{F(x) + , 1}




17/18


Pointwise vs. confidence for nerve pulse data

0.0 0.5 1.0 1.5

0.0

0.2

0.4

0.6

0.8

1.0

Time

F^(x

)




18/18


Homework

I claimed that the confidence band based on the DKWinequality worked for any distribution function . . . does it?

Homework: Generate 100 observations from a N(0, 1)distribution. Compute a 95 percent confidence band for theCDF F. Repeat this 1000 times to see how often theconfidence band contains the true distribution function.Repeat using data generated from a Cauchy distribution.