Correlation and Large-Scale Simultaneous Significance...
Transcript of Correlation and Large-Scale Simultaneous Significance...
Correlation and Large-Scale SimultaneousSignificance Testing, Bradley Efron, 2007, JASA
Stat 300C: Final Presentation
Leonid Pekelis
June 03, 2011
Main Points
I Correlation between test statistics can have varied effects onmultiple hypothesis testing procedures, making it harder totrust FDR procedures which don’t account for correlation.
I Allowing for some assumptions, can formalize a model whichdescribes how correlations propogate to false discoveryestimates.
I There is some evidence that this model is actually how theworld works (at least for microarrays).
I It is straightforward to adjust FDR procedures to account forsuch correlations.
Effect of Correlations
Effect of Correlations
1. Breast Cancer study (BC) compared gene activity groups ofpatients observed to have one of two different geneticmutations known to increase breast cancer risk, “BRCA1” or“BRCA2”, Hendenfalk et al. (2001)
I 7 BRCA1, 8 BRCA2, 15 patients totalI N = 3225 genes measured
2. HIV study, van’t Wout et al. (2003)I 4 HIV positive, 4 HIV negative controlsI N = 7680 genes per microarray
Ensemble Distribution
zi = Φ−1(G0(ti )) ∼ N (0, 1), i = 1, 2, . . . ,Nzbci ∼ N (−0.09, 1.552) zHIVi ∼ N (−0.11, 0.752)
Outline of the talk
1. Count vector model
1.1 Covariance of count vector under correlation
2. Poisson process model for counts
3. Numerical examples of model’s accuracy
4. Conditional FDR estimates
5. Numerical simulation comparing conditional to traditionalFDR
6. Data example, NBA
Counts Model
K = 82 bins of width ∆ = 0.1 from −4.1 to 4.1, Z = ∪Kk=1Zk
Count vector y, yk = #{zi in kth bin}
πk(i) = P(zi ∈ Zk), πk· =N∑i=1
πk(i)/N.
= ∆φ(z [k])
γkl(i , j) = P(zi ∈ Zk ∩ zj ∈ Zl), γkl · =
∑i 6=j γkl(i , j)
N(N − 1)
E (y) = Nπ, Cov(y) = C0 + C1
C0 = N(diag(π)− ππ′)
C1 = N(N − 1)diag(π)δdiag(π), δkl =γkl ·πk·πl ·
− 1
Counts Model
Further assume bivariate normality, Cov(zi , zj) = ρij .
γkl(i , j) =
∫Zk
∫Zl
ψ2(zi , zj , ρij)dz.
=∆2
2π√
1− ρ2ije− 1
2
z[k]2−2ρij z[k]z[l ]+z[l ]2
1−ρ2ij
δkl + 1 =
∑i 6=j P(zi ∈ Zk ∩ zj ∈ Zl)∑i P(zi ∈ Zk)
∑j P(zj ∈ Zl)
.=
∫ 1
−1
1√1− ρ2
eρ
2(1−ρ2)(ρz[k]2−2z[k]z[l ]+ρz[l ]2)
g(ρ)dρ
=
∫ 1
−1Rkl(ρ)g(ρ)dρ
Counts Model
Suppose ρ ∼ (0, α2), α2 =∫ 1−1 ρ
2g(ρ)dρ,then 2nd order Taylor approximation of of Rkl(ρ) around ρ = 0gives
δ.
= α2qq′, qk = (z [k]2 − 1)/√
2.
Putting the previous results together (Theorem 1)
Cov(y).
= N(diag(π)− ππ′) +N(N − 1)
2α2ww′
wk = ∆w(z [k]), ,w(z) = φ′′(z) = φ(z)(z2 − 1)
Poisson Model
Suppose y|u ∼ Po(u), u ∼ (v, Γ), will need N ∼ Po(N0).
Simplifies Cov(y).
= N(diag(π) + N2
2 α2ww′. Match with
y ∼ (v, diag(v) + Γ) ⇒
y ∼ Po(Nπ + AN√
2w), A ∼ (0, α2)
Numerical Examples, α = 0.05
Numerical Examples, α = 0.10
Numerical Examples, α = 0.15
Numerical Examples, α = 0.20
Numerical Examples, α = 0.25
Numerical Examples, α = 0.30
Numerical Examples, α = 0.35
Numerical Examples, α = 0.40
Numerical Examples, α = 0.45
Numerical Examples, α = 0.50
Numerical Examples
Numerical Examples
α: 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
C1 0.9958 0.9925 0.9828 0.9657 0.9291 0.8679 0.8085 0.7758 0.7748 0.8081Cnorm 0.1007 0.2776 0.4582 0.5962 0.6794 0.7059 0.6996 0.6938 0.7043 0.7390Cpois 0.1074 0.2790 0.4563 0.5931 0.6765 0.7036 0.6978 0.6923 0.7028 0.7374
α: 0.40 0.45 0.50
0.7758 0.7748 0.80810.6938 0.7043 0.73900.6923 0.7028 0.7374
Table: Proportion of total variance explained by first eigenvector, as afunction of α.
Conditional FDR
Given A, can approximate
u = Nπ + AN√
2w.
= N∆fA(z [k])
fA(z) = φ(z)(1 + Aq(z)),
Matching moments, can approximate uk.
= N 1σAψ(x/σA), with
σ2A = 1 +√
2A.I took 2nd term in Edgeworth expansion,
fA(x).
=1
σAψ(x/σA)
(1 +
µ4 − 3σ4
24σ4H4(x)
).
Conditional FDR
Conditional FDR
Use GLM to fit distribution of yk ∼ Po(eβ0+β1z[k]+β2z[k]2) for
k ∈ K0.Using normal approximation for with p0 proportion of nulls givesE (yk) = p0uk , hence
σ̂A = (−2β̂2)−.5
Estimate p0 by p̂0 = P̂0/P0(σ̂A, P0(σ) = 2Φ(x0;σ)− 1,P̂0 = Y0/N
Fdr(x |σ̂A) = Np̂0Φ̄(x ;σA)/T (x)
Simulation
Data Example: NBA
1. What professional basketball players can really be calledexceptional?
2. Data from http://www.databasebasketball.com/
3. 1946-2009, stats on every player, each year, ≈ 22, 000 entries
4. Will focus on ppm = points scored in seasonminutes played in season
5. Idea: get z-value for each player, apply BH procedure todetermine non-null players
6. Can hypothesise there is some correlation between playersppm scores.
7. Cleaned data (year > 1950, minutes ≥ 10)
Data Example: NBA
Data Example: NBA
Data Example: NBA
Data Example: NBA
Data Example: NBA
I Detrend: year effect, shot clock (1954), 3 pointer (1979),center
I Aggregate years by players, keep only careers ≥ 5 years
I Gives N = 1535 players
I Calculate tk =∑ck
i=1 ppmi/ckSE , ck - career length
I Convert to z values, zk = Φ−1(Tck−1(tk))
Data Example: NBA
Max = 6.74 (Kareem , Abdul-jabbar ’69-’89), Min = -6.43 (E.c.Coleman ’94-’00)
Wilt Chamberlain (’59-’72) = 3.31, Michael Jordan (’84-’02) =6.49
Data Example: NBA
I Naive BH(.10) procedure gives 891 rejections,
I Est. correlation from central spread Poisson glm,znull ∼ N (0, 22)
I Trying BH(.10) with correlated null gives 1 rejection,
I Third approach: estimate p̂0 = P̂0/P0(1.92) ≈ 0.588,P̂0 = Y0(1)/N
I Conditional Fdr estimates Fdr(naive|2) = .347 ,Fdr(cor|2) = 0.673
I Both > .10!
I x∗ = arg max{x : Fdr(x |2) ≤ 0.10}, gives 36 rejections
I Actually used x∗ = arg minFdr(x |2), sinceminFdr(x |2) = .121 > .10.
Data Example: NBA
Theoretical Null Dist N (0, 1), Correlated Null Dist N (0, 22)
Data Example: NBA (Best Players)
[1] ”Kareem , Abdul-jabbar” ”Tim , Duncan” ”Shaquille , O’neal”[4] ”Michael , Jordan” ”Karl , Malone” ”Julius , Erving”[7] ”Walter , Davis” ”Glenn , Robinson” ”Jerry , West”[10] ”Dominique , Wilkins” ”Tim , Thomas” ”Calvin , Murphy”[13] ”Bob , Pettit” ”Eddie , Johnson” ”Sam , Cassell”[16] ”James , Worthy” ”George , Gervin” ”John , Drew”[19] ”Allen , Iverson” ”Dan , Issel”
Data Example: NBA (Worst Players)
[1] ”Charles , Jones” ”Tree , Rollins” ”Ben , Wallace”[4] ”Nate , Mcmillan” ”Greg , Kite” ”Manute , Bol”[7] ”Harvey , Catchings” ”Paul , Mokeski” ”Don , Buse”[10] ”Adonal , Foyle” ”Kurt , Rambis” ”Bo , Outlaw”[13] ”Matt , Guokas” ”Bruce , Bowen” ”George , Johnson”[16] ”Chris , Dudley”