A short survey of Stein’s method - Stanford...

A short survey of Stein’s method

Sourav Chatterjee

Stanford University

Sourav Chatterjee A short survey of Stein’s method

Assumption

I Since this is a general audience talk, I will assume that theaudience has a passing familiarity with the definitions ofrandom variable, expected value, variance and central limittheorem, but not much more.


Common pursuits in probability theory

I Some of the most common things that probabilists do are:compute or estimate expected values and variances of randomvariables, and prove generalized central limit theorems,sometimes on function spaces.

I A large fraction of papers in probability theory may beclassified into one of the above categories, although they maynot say it explicitly.

I There are many important problems where we do not knowhow to compute or estimate expected values (e.g. almost anystatistical mechanics model in three and higher dimensions,such as the Ising model); many important problems where wedo not know how to compute the order of the variance(e.g. almost any random combinatorial optimization problem,such as traveling salesman); and many important problemswhere we do not know how to prove generalized central limittheorems (e.g. quantum field theories).


What is Stein’s method?

I Stein’s method is a sophisticated technique for provinggeneralized central limit theorems, pioneered in the 1970s byCharles Stein, one of the leading statisticians of the 20thcentury.

I Recall the ordinary central limit theorem: If X1,X2, . . . areindependent and identically distributed random variables, then

limn→∞

P(∑n

i=1 Xi − nµ

σ√n

≤ x

)=

∫ x

−∞

e−u2/2

√2π

du ,

where µ = E(Xi ) and σ2 = Var(Xi ).I Usual method of proof: The probability on the left-hand side

is computed using Fourier transforms. Independence of thesummands implies that the Fourier transform decomposes as aproduct. The rest is analysis.

I Stein’s motivation: What if the Xi ’s are not exactlyindependent? Fourier transforms are usually not very helpful.


Stein’s idea

I Let W be any random variable and Z be a standard Gaussianrandom variable. That is,

P(Z ≤ x) =

∫ x

−∞

e−u2/2

√2π

du .

I Suppose that we wish to show that W is “approximatelyGaussian”, in the sense that P(W ≤ x) ≈ P(Z ≤ x) for all x .

I Or more generally, Eh(W ) ≈ Eh(Z ) for any well-behaved h.(To see that this is a generalization, recall thatP(W ≤ x) = Eh(W ), where h(u) = 1 if u ≤ x and 0otherwise.)

I To put this in context, imagine that

W =

∑ni=1 Xi − nµ

σ√n

.


Stein’s idea (contd.)

I We have a random variable W , a standard Gaussian randomvariable Z , and we wish to show that Eh(W ) ≈ Eh(Z ) for allwell-behaved h.

I In the Fourier-theoretic approach, we write both expectationsin terms of Fourier transforms and then use analysis to showthat they are approximately equal.

I The problem with this approach is that it may be hard oruseless to write Eh(W ) in terms of Fourier transforms if W isa relatively complicated random variable.

I For example: Toss a coin n times, and let W be the numberof times where a certain pattern, say HTHH, appears.



I Stein’s idea:I Given h, obtain a function f by solving the differential equation

f ′(x)− xf (x) = h(x)− Eh(Z ) .

I Show that E(f ′(W )−Wf (W )) ≈ 0 using the properties of W .I Since

E(f ′(W )−Wf (W )) = Eh(W )− Eh(Z ) ,

conclude that Eh(W ) ≈ Eh(Z ).

I What’s the gain? Stein’s method is a local-to-global approachfor proving central limit theorems; often, it is possible to proveE(f ′(W )−Wf (W )) ≈ 0 using small local perturbationsof W .

I The Fourier-theoretic method (and most other methods ofproving central limit theorems) are non-perturbative.



I Indeed, Stein’s method can be successfully implemented invarious examples by showing that for some function g relatedto f ,

E(f ′(W )−Wf (W )) ≈ αE(g(W ′)− g(W )) ,

where α is a real number and W ′ is a small perturbation ofW , that has the same probability distribution as W .

I Since Eg(W ′) = Eg(W ), we may conclude thatE(f ′(W )−Wf (W )) ≈ 0. This is known as the method ofexchangeable pairs.


Why does the method work?

I If we replace Eh(Z ) by some other constant in the differentialequation

f ′(x)− xf (x) = h(x)− Eh(Z ) ,

the f that we get is not well-behaved: it will blow up badly atinfinity.

I There is a relationship between this differential equation andthe Gaussian distribution.

I In fact, a random variable X has the standard Gaussiandistribution if and only if E(f ′(X )− Xf (X )) = 0 for all f .

I For this reason, the differential operator T , defined asTf (x) = f ′(x)− xf (x), is called a characterizing operator forthe standard Gaussian distribution.

I Stein’s method can be (and has been) generalized to proveother probabilistic limit theorems, even on function spaces, byworking with suitable characterizing operators and solving therelated differential equations.


A simple example

I Let X1,X2, . . . ,Xn be independent random variables with E(Xi ) = 0 andE(X 2

i ) = 1.

I Let S = n−1/2(X1 + . . . + Xn).

I For each i let Si = S − n−1/2Xi . Then Si and Xi are independent.

I Therefore, for any f , E(Xi f (Si )) = E(Xi )E(f (Si )) = 0.

I By first order Taylor expansion, this gives

E(Xi f (S)) = E(Xi (f (S) − f (Si ))) ≈ E(Xi (S − Si )f′(S))

= n−1/2E(X 2i f′(S)) .

I Thus,

E(Sf (S)) = n−1/2∑i

E(Xi f (S)) ≈ n−1∑i

E(X 2i f′(S))

= E((n−1

∑X 2

i

)f ′(S)

).

I By the law of large numbers, n−1∑X 2i ≈ 1. Therefore,

E(Sf (S)) ≈ E(f ′(S)). By Stein’s method, this shows that S isapproximately Gaussian.


History of Stein’s method

I Too much to summarize in this talk. See my ICM article for arecap of the main developments.

I Many variants: exchangeable pairs, size-biased couplings,zero-biased couplings, dependency graphs, multivariate andfunction space versions, Poisson approximation, etc.

I Key figures in the historical development: Charles Stein, LouisChen, Andrew Barbour, Erwin Bolthausen, Larry Goldstein,Gesine Reinert, Yosef Rinott, Vladimir Rotar, ...

I Several young probabilists working on developing the theoryStein’s method these days, including myself. Many more areusing Stein’s method in their research.

I Significant opportunities for new results and new applications.


In the remainder of this talk, I will talk about my interpretation ofStein’s method.


What causes central limit behavior?

I Suppose that we are investigating whether a random variableW is “approximately Gaussian”.

I Traditional wisdom: If W is built out of many small randominfluences which are approximately independent of each other,then one may expect W to exhibit Gaussian behavior.

I It is clear what this means if W is a sum of independent orapproximately independent random variables, but nototherwise.

I Examples where central limit behavior is expected but theabove heuristic does not make sense in any obvious way:Minimal spanning trees, traveling salesman problem, optimalmatching,....


Minimal spanning tree

I Suppose that we have an undirected graph, e.g. a discretegrid.

I On each edge of the grid, there is a random weight. Theweights are independent.

I A minimal spanning tree (MST) of this graph is a subtree(i.e. cycle free subgraph) that minimizes the sum of edgeweights among all subtrees.

I It is expected that under fairly general conditions, the totaledge weight of a minimal spanning tree should obey a centrallimit theorem.

I Proved on integer lattices by Alexander (1995, for 2D) andKesten and Lee (1996, general dimensions).

I It is not clear how to weight of MST may be seen as a sumtotal of many small influences that are approximatelyindependent.


Search for a better heuristic

I Often, the random variable W of interest may be written as afunction f (X1, . . . ,Xn) of a collection of independent randomvariables X1, . . . ,Xn.

I For example, the weight of the MST is a function of the edgeweights, which are independent.

I Let ∂iW be the change in W if Xi is perturbed (typically,replaced by an independent copy).

I The typical size of ∂iW is often called the “influence” of Xi

on W . Widespread use in theoretical CS, machine learningand other areas.

I It is true that if the influences are small, W exhibits centrallimit behavior?


Search for a better heuristic (contd.)

I Not quite.

I For example, let X1, . . . ,Xn be independent random variables,with P(Xi = 1) = P(Xi = −1) = 1/2.

I Let

W =1

n

∑1≤i<j≤n

XiXj .

I W has a limiting distribution that is not Gaussian.

I However,

∂iW =1

n

∑j 6=i

Xj ,

which is typically quite small.


The perturbative heuristic for central limit behavior

I It turns out that one can still formulate a heuristic for centrallimit behavior using influences.

I In a 2008 paper I showed — roughly speaking — that if theinfluences ∂iW are not only small, but also approximatelyindependent, then W is approximately Gaussian.

I The proof uses Stein’s method. Will give precise statementand proof shortly.


Coming back to MST

I Does this heuristic work for the MST? Yes.

I If the weight of an edge e is perturbed, then the change ∂eWin the total weight W may be shown to be approximatelyequal (with high probability) to a function of the weights ofthe edges close to e.

I Consequently, if e and e ′ are two edges that are far apart,then ∂eW and ∂e′W are approximately independent.

I Since most pairs of edges are far apart from each other, theperturbative heuristic implies that W is approximatelyGaussian.

I This argument has been recently used to obtain rates ofconvergence in central limit theorems for minimal spanningtrees in C. & Sen (2013).


Open problem: Traveling salesman

I n points are uniformly distributed in the unit square.I Ln is the length of the minimum length path that touches

every point.I It is widely believed that Ln should satisfy a central limit

theorem.I According to the perturbative heuristic, one should be able to

solve this open question if one shows that the change in Lnthat occurs upon perturbing a single point depends(approximately) only on a small number of points in itsneighborhood.

I There are many other similar questions about central limitbehavior in combinatorial optimization that modernprobability theory is unable to tackle. The perturbativeheuristic gives a possible unified approach to attack suchproblems, if someone can figure out how to verify the requiredcondition (approximate independence of influences).


Formal statement of the perturbative heuristic

I Let X = (X1, . . . ,Xn) be a vector of independent randomvariables.

I Let W = f (X ) for some function f .

I Let X ′ = (X ′1, . . . ,X′n) be an independent copy of X .

I Let [n] = {1, . . . , n}, and for each A ⊆ [n], define the randomvector XA as

XAi =

{X ′i if i ∈ A,

Xi if i 6∈ A.

I Let∂i f := f (X )− f (X {i}) ,

and for each A ⊆ [n] and i 6∈ A, let

∂i fA := f (XA)− f (XA∪{i}) .


Formal statement of the perturbative heuristic (contd.)

I For each proper subset A of [n] define

ν(A) :=1

n(n−1|A|) .

I Let Ti be the weighted average

1

2

∑A⊆[n]\{i}

ν(A)∂i f ∂i fA .

I This Ti will be our measure of influence of Xi on W .

I The heuristic is that if the Ti ’s are approximatelyuncorrelated, then W is approximately Gaussian. To be mademore precise in the next slide.


Formal statement of the perturbative heuristic (contd.)

I Let T :=∑n

i=1 Ti .

I The order of fluctuations of T is a measure of theuncorrelatedness of the Ti ’s. If the Ti ’s have smallcorrelation, then T has a small variance.

Theorem (C., 2008)

Suppose that E(W ) = 0 and Var(W ) = 1. Then

supx∈R

∣∣∣∣P(W ≤ x)−∫ x

−∞

e−u2/2

√2π

du

∣∣∣∣≤ 2

(√Var(T ) +

1

4

n∑i=1

E|∂i f |3)1/2

.

The first term measures the uncorrelatedness of the influences andthe second term measures the smallness of the influences.


Proof sketch

I First, recall notation:I X = (X1, . . . ,Xn) is a vector of independent random variables.I W = f (X ) is some function of X . Assume that E(W ) = 0 and

Var(W ) = 1.I X ′ = (X ′1, . . . ,X

′n) is an independent copy of X .

I For A ⊆ [n], the random vector XA is defined as

XAi =

{X ′i if i ∈ A,

Xi if i 6∈ A.

I Discrete derivatives are defined as

∂i f := f (X )− f (X {i}) , ∂i fA := f (XA)− f (XA∪{i}) .

I Let

T :=1

2

n∑i=1

∑A⊆[n]\{i}

ν(A)∂i f ∂i fA

where

ν(A) :=1

n(n−1|A|) .


Proof sketch (contd.)

I For any g , f , A, and i 6∈ A,

E(∂ig∂i fA)

= E(g(X )∂i fA)− E(g(X {i})∂i f

A)

= 2E(g(X )∂i fA) by exchangeability of (Xi ,X

′i ).

I With g = ϕ ◦ f , we have ∂ig ≈ ϕ′(W )∂i f , and hence

1

2E(ϕ′(W )∂i f ∂i f

A)

≈ 1

2E(∂ig∂i f

A) = E(ϕ(W )∂i fA).

I Thus,

E(ϕ′(W )T ) ≈ E(ϕ(W )

n∑i=1

∑A⊆[n]\{i}

ν(A)∂i fA

).


Proof sketch (contd.)

I Now note thatn∑

i=1

∑A⊆[n]\{i}

ν(A)∂i fA = f (X )− f (X ′),

which is just an algebraic identity.I Thus,

E(ϕ′(W )T ) ≈ E[ϕ(W )(f (X )− f (X ′))]

= E(ϕ(W )W ).

I Exact equality holds for ϕ(u) = u, which gives

E(T ) = Var(W ) = 1.

I Thus, if Var(T ) is tiny, then

E(ϕ(W )W ) ≈ E(ϕ′(W )) .

I One can now apply Stein’s method to conclude that W isapproximately Gaussian.


Summary

I Stein’s method is a powerful tool for proving central limittheorems in the presence of dependence. Many variants.

I The perturbative heuristic is a version of Stein’s method thatintends to give a unified approach to understanding centrallimit behavior for complicated random variables.

I The heuristic says, roughly, that if a random variable W is afunction of many independent random variables X1, . . . ,Xn,and (a) the influence of each Xi on W , measured in a certainway, is small; and (b) the influences are approximatelyindependent, then W may be expected to exhibit central limitbehavior.

I The heuristic has been used to prove central limit theorems instatistics, random matrix theory, number theory, and otherareas.

I Lots of open questions that may be approachable by thisheuristic.


A short survey of Stein’s method - Stanford...

Documents

Transcript of A short survey of Stein’s method - Stanford...