A short survey of Stein’s method - Stanford...
Transcript of A short survey of Stein’s method - Stanford...
A short survey of Stein’s method
Sourav Chatterjee
Stanford University
Sourav Chatterjee A short survey of Stein’s method
Assumption
I Since this is a general audience talk, I will assume that theaudience has a passing familiarity with the definitions ofrandom variable, expected value, variance and central limittheorem, but not much more.
Sourav Chatterjee A short survey of Stein’s method
Common pursuits in probability theory
I Some of the most common things that probabilists do are:compute or estimate expected values and variances of randomvariables, and prove generalized central limit theorems,sometimes on function spaces.
I A large fraction of papers in probability theory may beclassified into one of the above categories, although they maynot say it explicitly.
I There are many important problems where we do not knowhow to compute or estimate expected values (e.g. almost anystatistical mechanics model in three and higher dimensions,such as the Ising model); many important problems where wedo not know how to compute the order of the variance(e.g. almost any random combinatorial optimization problem,such as traveling salesman); and many important problemswhere we do not know how to prove generalized central limittheorems (e.g. quantum field theories).
Sourav Chatterjee A short survey of Stein’s method
What is Stein’s method?
I Stein’s method is a sophisticated technique for provinggeneralized central limit theorems, pioneered in the 1970s byCharles Stein, one of the leading statisticians of the 20thcentury.
I Recall the ordinary central limit theorem: If X1,X2, . . . areindependent and identically distributed random variables, then
limn→∞
P(∑n
i=1 Xi − nµ
σ√n
≤ x
)=
∫ x
−∞
e−u2/2
√2π
du ,
where µ = E(Xi ) and σ2 = Var(Xi ).I Usual method of proof: The probability on the left-hand side
is computed using Fourier transforms. Independence of thesummands implies that the Fourier transform decomposes as aproduct. The rest is analysis.
I Stein’s motivation: What if the Xi ’s are not exactlyindependent? Fourier transforms are usually not very helpful.
Sourav Chatterjee A short survey of Stein’s method
Stein’s idea
I Let W be any random variable and Z be a standard Gaussianrandom variable. That is,
P(Z ≤ x) =
∫ x
−∞
e−u2/2
√2π
du .
I Suppose that we wish to show that W is “approximatelyGaussian”, in the sense that P(W ≤ x) ≈ P(Z ≤ x) for all x .
I Or more generally, Eh(W ) ≈ Eh(Z ) for any well-behaved h.(To see that this is a generalization, recall thatP(W ≤ x) = Eh(W ), where h(u) = 1 if u ≤ x and 0otherwise.)
I To put this in context, imagine that
W =
∑ni=1 Xi − nµ
σ√n
.
Sourav Chatterjee A short survey of Stein’s method
Stein’s idea (contd.)
I We have a random variable W , a standard Gaussian randomvariable Z , and we wish to show that Eh(W ) ≈ Eh(Z ) for allwell-behaved h.
I In the Fourier-theoretic approach, we write both expectationsin terms of Fourier transforms and then use analysis to showthat they are approximately equal.
I The problem with this approach is that it may be hard oruseless to write Eh(W ) in terms of Fourier transforms if W isa relatively complicated random variable.
I For example: Toss a coin n times, and let W be the numberof times where a certain pattern, say HTHH, appears.
Sourav Chatterjee A short survey of Stein’s method
Stein’s idea (contd.)
I Stein’s idea:I Given h, obtain a function f by solving the differential equation
f ′(x)− xf (x) = h(x)− Eh(Z ) .
I Show that E(f ′(W )−Wf (W )) ≈ 0 using the properties of W .I Since
E(f ′(W )−Wf (W )) = Eh(W )− Eh(Z ) ,
conclude that Eh(W ) ≈ Eh(Z ).
I What’s the gain? Stein’s method is a local-to-global approachfor proving central limit theorems; often, it is possible to proveE(f ′(W )−Wf (W )) ≈ 0 using small local perturbationsof W .
I The Fourier-theoretic method (and most other methods ofproving central limit theorems) are non-perturbative.
Sourav Chatterjee A short survey of Stein’s method
Stein’s idea (contd.)
I Indeed, Stein’s method can be successfully implemented invarious examples by showing that for some function g relatedto f ,
E(f ′(W )−Wf (W )) ≈ αE(g(W ′)− g(W )) ,
where α is a real number and W ′ is a small perturbation ofW , that has the same probability distribution as W .
I Since Eg(W ′) = Eg(W ), we may conclude thatE(f ′(W )−Wf (W )) ≈ 0. This is known as the method ofexchangeable pairs.
Sourav Chatterjee A short survey of Stein’s method
Why does the method work?
I If we replace Eh(Z ) by some other constant in the differentialequation
f ′(x)− xf (x) = h(x)− Eh(Z ) ,
the f that we get is not well-behaved: it will blow up badly atinfinity.
I There is a relationship between this differential equation andthe Gaussian distribution.
I In fact, a random variable X has the standard Gaussiandistribution if and only if E(f ′(X )− Xf (X )) = 0 for all f .
I For this reason, the differential operator T , defined asTf (x) = f ′(x)− xf (x), is called a characterizing operator forthe standard Gaussian distribution.
I Stein’s method can be (and has been) generalized to proveother probabilistic limit theorems, even on function spaces, byworking with suitable characterizing operators and solving therelated differential equations.
Sourav Chatterjee A short survey of Stein’s method
A simple example
I Let X1,X2, . . . ,Xn be independent random variables with E(Xi ) = 0 andE(X 2
i ) = 1.
I Let S = n−1/2(X1 + . . . + Xn).
I For each i let Si = S − n−1/2Xi . Then Si and Xi are independent.
I Therefore, for any f , E(Xi f (Si )) = E(Xi )E(f (Si )) = 0.
I By first order Taylor expansion, this gives
E(Xi f (S)) = E(Xi (f (S) − f (Si ))) ≈ E(Xi (S − Si )f′(S))
= n−1/2E(X 2i f′(S)) .
I Thus,
E(Sf (S)) = n−1/2∑i
E(Xi f (S)) ≈ n−1∑i
E(X 2i f′(S))
= E((n−1
∑X 2
i
)f ′(S)
).
I By the law of large numbers, n−1∑X 2i ≈ 1. Therefore,
E(Sf (S)) ≈ E(f ′(S)). By Stein’s method, this shows that S isapproximately Gaussian.
Sourav Chatterjee A short survey of Stein’s method
History of Stein’s method
I Too much to summarize in this talk. See my ICM article for arecap of the main developments.
I Many variants: exchangeable pairs, size-biased couplings,zero-biased couplings, dependency graphs, multivariate andfunction space versions, Poisson approximation, etc.
I Key figures in the historical development: Charles Stein, LouisChen, Andrew Barbour, Erwin Bolthausen, Larry Goldstein,Gesine Reinert, Yosef Rinott, Vladimir Rotar, ...
I Several young probabilists working on developing the theoryStein’s method these days, including myself. Many more areusing Stein’s method in their research.
I Significant opportunities for new results and new applications.
Sourav Chatterjee A short survey of Stein’s method
In the remainder of this talk, I will talk about my interpretation ofStein’s method.
Sourav Chatterjee A short survey of Stein’s method
What causes central limit behavior?
I Suppose that we are investigating whether a random variableW is “approximately Gaussian”.
I Traditional wisdom: If W is built out of many small randominfluences which are approximately independent of each other,then one may expect W to exhibit Gaussian behavior.
I It is clear what this means if W is a sum of independent orapproximately independent random variables, but nototherwise.
I Examples where central limit behavior is expected but theabove heuristic does not make sense in any obvious way:Minimal spanning trees, traveling salesman problem, optimalmatching,....
Sourav Chatterjee A short survey of Stein’s method
Minimal spanning tree
I Suppose that we have an undirected graph, e.g. a discretegrid.
I On each edge of the grid, there is a random weight. Theweights are independent.
I A minimal spanning tree (MST) of this graph is a subtree(i.e. cycle free subgraph) that minimizes the sum of edgeweights among all subtrees.
I It is expected that under fairly general conditions, the totaledge weight of a minimal spanning tree should obey a centrallimit theorem.
I Proved on integer lattices by Alexander (1995, for 2D) andKesten and Lee (1996, general dimensions).
I It is not clear how to weight of MST may be seen as a sumtotal of many small influences that are approximatelyindependent.
Sourav Chatterjee A short survey of Stein’s method
Search for a better heuristic
I Often, the random variable W of interest may be written as afunction f (X1, . . . ,Xn) of a collection of independent randomvariables X1, . . . ,Xn.
I For example, the weight of the MST is a function of the edgeweights, which are independent.
I Let ∂iW be the change in W if Xi is perturbed (typically,replaced by an independent copy).
I The typical size of ∂iW is often called the “influence” of Xi
on W . Widespread use in theoretical CS, machine learningand other areas.
I It is true that if the influences are small, W exhibits centrallimit behavior?
Sourav Chatterjee A short survey of Stein’s method
Search for a better heuristic (contd.)
I Not quite.
I For example, let X1, . . . ,Xn be independent random variables,with P(Xi = 1) = P(Xi = −1) = 1/2.
I Let
W =1
n
∑1≤i<j≤n
XiXj .
I W has a limiting distribution that is not Gaussian.
I However,
∂iW =1
n
∑j 6=i
Xj ,
which is typically quite small.
Sourav Chatterjee A short survey of Stein’s method
The perturbative heuristic for central limit behavior
I It turns out that one can still formulate a heuristic for centrallimit behavior using influences.
I In a 2008 paper I showed — roughly speaking — that if theinfluences ∂iW are not only small, but also approximatelyindependent, then W is approximately Gaussian.
I The proof uses Stein’s method. Will give precise statementand proof shortly.
Sourav Chatterjee A short survey of Stein’s method
Coming back to MST
I Does this heuristic work for the MST? Yes.
I If the weight of an edge e is perturbed, then the change ∂eWin the total weight W may be shown to be approximatelyequal (with high probability) to a function of the weights ofthe edges close to e.
I Consequently, if e and e ′ are two edges that are far apart,then ∂eW and ∂e′W are approximately independent.
I Since most pairs of edges are far apart from each other, theperturbative heuristic implies that W is approximatelyGaussian.
I This argument has been recently used to obtain rates ofconvergence in central limit theorems for minimal spanningtrees in C. & Sen (2013).
Sourav Chatterjee A short survey of Stein’s method
Open problem: Traveling salesman
I n points are uniformly distributed in the unit square.I Ln is the length of the minimum length path that touches
every point.I It is widely believed that Ln should satisfy a central limit
theorem.I According to the perturbative heuristic, one should be able to
solve this open question if one shows that the change in Lnthat occurs upon perturbing a single point depends(approximately) only on a small number of points in itsneighborhood.
I There are many other similar questions about central limitbehavior in combinatorial optimization that modernprobability theory is unable to tackle. The perturbativeheuristic gives a possible unified approach to attack suchproblems, if someone can figure out how to verify the requiredcondition (approximate independence of influences).
Sourav Chatterjee A short survey of Stein’s method
Formal statement of the perturbative heuristic
I Let X = (X1, . . . ,Xn) be a vector of independent randomvariables.
I Let W = f (X ) for some function f .
I Let X ′ = (X ′1, . . . ,X′n) be an independent copy of X .
I Let [n] = {1, . . . , n}, and for each A ⊆ [n], define the randomvector XA as
XAi =
{X ′i if i ∈ A,
Xi if i 6∈ A.
I Let∂i f := f (X )− f (X {i}) ,
and for each A ⊆ [n] and i 6∈ A, let
∂i fA := f (XA)− f (XA∪{i}) .
Sourav Chatterjee A short survey of Stein’s method
Formal statement of the perturbative heuristic (contd.)
I For each proper subset A of [n] define
ν(A) :=1
n(n−1|A|) .
I Let Ti be the weighted average
1
2
∑A⊆[n]\{i}
ν(A)∂i f ∂i fA .
I This Ti will be our measure of influence of Xi on W .
I The heuristic is that if the Ti ’s are approximatelyuncorrelated, then W is approximately Gaussian. To be mademore precise in the next slide.
Sourav Chatterjee A short survey of Stein’s method
Formal statement of the perturbative heuristic (contd.)
I Let T :=∑n
i=1 Ti .
I The order of fluctuations of T is a measure of theuncorrelatedness of the Ti ’s. If the Ti ’s have smallcorrelation, then T has a small variance.
Theorem (C., 2008)
Suppose that E(W ) = 0 and Var(W ) = 1. Then
supx∈R
∣∣∣∣P(W ≤ x)−∫ x
−∞
e−u2/2
√2π
du
∣∣∣∣≤ 2
(√Var(T ) +
1
4
n∑i=1
E|∂i f |3)1/2
.
The first term measures the uncorrelatedness of the influences andthe second term measures the smallness of the influences.
Sourav Chatterjee A short survey of Stein’s method
Proof sketch
I First, recall notation:I X = (X1, . . . ,Xn) is a vector of independent random variables.I W = f (X ) is some function of X . Assume that E(W ) = 0 and
Var(W ) = 1.I X ′ = (X ′1, . . . ,X
′n) is an independent copy of X .
I For A ⊆ [n], the random vector XA is defined as
XAi =
{X ′i if i ∈ A,
Xi if i 6∈ A.
I Discrete derivatives are defined as
∂i f := f (X )− f (X {i}) , ∂i fA := f (XA)− f (XA∪{i}) .
I Let
T :=1
2
n∑i=1
∑A⊆[n]\{i}
ν(A)∂i f ∂i fA
where
ν(A) :=1
n(n−1|A|) .
Sourav Chatterjee A short survey of Stein’s method
Proof sketch (contd.)
I For any g , f , A, and i 6∈ A,
E(∂ig∂i fA)
= E(g(X )∂i fA)− E(g(X {i})∂i f
A)
= 2E(g(X )∂i fA) by exchangeability of (Xi ,X
′i ).
I With g = ϕ ◦ f , we have ∂ig ≈ ϕ′(W )∂i f , and hence
1
2E(ϕ′(W )∂i f ∂i f
A)
≈ 1
2E(∂ig∂i f
A) = E(ϕ(W )∂i fA).
I Thus,
E(ϕ′(W )T ) ≈ E(ϕ(W )
n∑i=1
∑A⊆[n]\{i}
ν(A)∂i fA
).
Sourav Chatterjee A short survey of Stein’s method
Proof sketch (contd.)
I Now note thatn∑
i=1
∑A⊆[n]\{i}
ν(A)∂i fA = f (X )− f (X ′),
which is just an algebraic identity.I Thus,
E(ϕ′(W )T ) ≈ E[ϕ(W )(f (X )− f (X ′))]
= E(ϕ(W )W ).
I Exact equality holds for ϕ(u) = u, which gives
E(T ) = Var(W ) = 1.
I Thus, if Var(T ) is tiny, then
E(ϕ(W )W ) ≈ E(ϕ′(W )) .
I One can now apply Stein’s method to conclude that W isapproximately Gaussian.
Sourav Chatterjee A short survey of Stein’s method
Summary
I Stein’s method is a powerful tool for proving central limittheorems in the presence of dependence. Many variants.
I The perturbative heuristic is a version of Stein’s method thatintends to give a unified approach to understanding centrallimit behavior for complicated random variables.
I The heuristic says, roughly, that if a random variable W is afunction of many independent random variables X1, . . . ,Xn,and (a) the influence of each Xi on W , measured in a certainway, is small; and (b) the influences are approximatelyindependent, then W may be expected to exhibit central limitbehavior.
I The heuristic has been used to prove central limit theorems instatistics, random matrix theory, number theory, and otherareas.
I Lots of open questions that may be approachable by thisheuristic.
Sourav Chatterjee A short survey of Stein’s method