Multiplicity and Test field significance · Multiplicity and Test field significance References:...

14
Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a field with multiple locations, all the local null hypotheses much be evaluated simultaneously. The chances for multiple local null hypotheses to be rejected by accidence increases. Consequently, one has to raise the bar for field significance than for local significance. For example, for a spatial field with N=20 independent spatial samples (locations), α=0.05 is significant level of each local test. Without consider the multiplicity, we only need to reject 0.05*20=1 local hypothesis to achieve field significance of 0.05.

Transcript of Multiplicity and Test field significance · Multiplicity and Test field significance References:...

Page 1: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

Multiplicity and Test field significance

References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006

The main issue: For a field with multiple locations, all the local null

hypotheses much be evaluated simultaneously. The chances for multiple local null hypotheses to be rejected by accidence increases. Consequently, one has to raise the bar for field significance than for local significance.

For example, for a spatial field with N=20 independent spatial samples (locations), α=0.05 is significant level of each local test. Without consider the multiplicity, we only need to reject 0.05*20=1 local hypothesis to achieve field significance of 0.05.

Page 2: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

•  However, Livezey and Chen (1983) shows that when test N local null hypotheses simultaneously, the number of local hypothesis need to be rejected follows binomial distribution, namely,

•  Thus, to reject the global null hypothesis at 95% significance, we need to have at least 4 local null hypotheses to be rejected simultaneously to satisfy the global field significance at 0.05 level.

Pr{X = x} = N!x!(N − x)!

px (1− p)(N−x ) < 0.05

where x is the number of local null hypotheses need to rejectedin order to have global significance at confidence of 1-0.05=0.95

Pr{X ≥ 3} =1−Pr{X < 3} =1− Pr{X = 0}+Pr{X =1}+Pr{X = 2}( ) =1- 0.358+ 0.377+ 0.189( ) = 0.076

Pr{X = x} = N!x!(N − x)!

α x (1− 0.05)(N−x )

Page 3: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

• Based on binomial distribution, the required fraction of the local null hypothesis been rejected increases with decrease of degree of freedom (DOF).

•  For example, for DOF=40, 12.5% of the local null hypothesis have to be rejected in order to reject the global null hypothesis at 5% significance. Whereas for DOF=52, 11.4% of the local null hypotheses have to be rejected to reject the global null hypothesis at 5%.

Livezey and chen 1983

Page 4: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

Livezey and Chen: count norm global significance test: This method involves two steps:

•  Step-1: Test significance of each local null hypothesis at each spatial location (e.g., whether the means of the two local time series are significantly different, or whether the two time series are significantly correlated). The local hypothesis test does NOT have to as vigorous as the global test (no need to use Monte to determine the significance). However, we need to i. Determine the degree of freedom for the temporal correlation between the two time series: •  Two factors need to be considered: Auto-correlation of the time series;

the influence of small samples at large lag iΔt, use biased estimate for the autocorrelation Cuu and Cpp as the following (Chen 1981).

τ = 1+ 2 Cuu(iΔt)Cpp(iΔt)

i=1

N

∑#

$%

&

'(Δt effective time of autocorrelation

where Cuu(iΔt) : autocorrelation of time series 1, Cpp(iΔt) : autocorrelation of time series 2iΔt : lag of the autocorrelation, Δt : time step, N: totla number of samplesn=NΔt/τ Effect number of indepedent samples

Page 5: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

ii. Effective number of independent sample (freedom): n = NΔt/τ Iii. Check critical value for significant level at 95% for effect number of independent sample, n, in the t-test table to determine the significance of the temporal correlation at each grid.

Step-2. Determine the percentage of the local null hypotheses need to be rejected for rejecting the global null hypothesis: •  Calculate the % of the total local null hypotheses, or % of the total area,

that can be rejected, i.e., correlation ≥ critical value for 95% significance). •  Determine whether the % of the local null hypothesis exceed the

threshold for global significance. •  i. If there is an easy way to determine DOF of the spatial field, one can

compare and see if the % of the local null hypotheses is greater than the threshold shown in the figure in slide 3 (Fig. 3 of Livezey&Chen 1983).

•  Ii. Often, it is not so easy to determine DOF, use Monte Carlo test to determine the probability distribution of the local null hypothesis being rejected when one field is correlated with random time series. In this case, no need to estimate autocorrelation because the autocorrelation of a random time series is zero.

Page 6: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

Example: •  Test significance of the correlation pattern between northern hemispheric

700 hPa geopotential height anomalies Z’700 and Southern Oscillation index (SOI) based on 29 winters.

•  Left: the correlation pattern based on real data. •  Right: the correlation pattern between Z’700 and random time series. This

pattern suggests that there are significant spatial autocorrelation in Z’700.

Source: Livezey&Chen 1983

Page 7: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

Step-1: Compute fraction of the local correlation between Z’700 and SOI exceed 95% significance (or probability of the local null hypotheses<0.05).

At each location, perform the following analysis: a.  Compute auto-correlation for both Z’700 and SOI using formula

b.  Determine autocorrelation length for correlation between between Z’700 and random time series using

c.  Determine temporal DOF (effect sample number): n=NΔt/τ Where N=29

d.  Determine the significance of the local hypothesis for DOF=n Repeat a-d steps for all other locations. 11.4% of the total area pass local tests for lag-0 between Z’700 and random time series, 17.5% of lag-1 and 22% for lag-2 in DJF.

rk =[(xi − x− )(xi+k − x+ )]

i=1

N−k

[ (xi − x− )2 ]1/2 (xi+k − x+ )2

i=k+1

N

∑#

$%

&

'(

1/2

i=1

N−k

where k is the phase lag. N: number of temporal samplesrk : auto-correlation with lag-k.

τ = 1+ 2 Cuu(iΔt)Cpp(iΔt)i=1

N

∑#

$%

&

'(Δt

Page 8: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

Step-2: Monte Carlo test to determine the probability distribution of the % areas that the correlation between Z’700 and random time series pass the local significant test. a.  At each location, compute correlation between Z’700 and a random

time series of 29 samples. Because the random time series has 29 DOF, the threshold for significant local correlation at 95% is 0.367.

b.  Repeat step a) for all locations. Then compute the % areas where the correlation between between Z’700 and random time series ≥0.367.

c.  Plot probability distribution as function of % area that passes local test.

For DJF: there is >5% probability for 11.4% total area to pass local significant test at 95% confidence by chance. Thus, global null hypothesis cannot be rejected.

Page 9: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

•  For DJF: •  For lag-0: the probability for 11.4% total area to pass local significant test at 95%

confidence by chance is greater than 5%. Thus, global null hypothesis cannot be rejected at 95%.

•  For lag-1 and lag-2 seasons: the probability for 17.5% and 21% of total areas to pass local significant test at 95% confidence by chance is less that 5%. Thus, global null hypothesis can be rejected at 95% confidence.

•  Thus, the spatial correlation patterns between SOI and lagged Z’700 by 1 and 2 seasons are significant.

•  For JJA: •  For lag-0, 1, 2, <4.5% area pass the local significant test. The probability for these %

areas to pass local significant test by chance is greater than 25%. Global null hypothesis cannot be rejected.

•  Thus, the spatial correlation pattern between SOI and lagged Z’700 is not significant.

Page 10: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

Determine field significance use False Discovery Rate (FDR). •  Wilks (2006) indicates that the Livezey & Chen’s count norm

method sets too high bar for reject global null hypothesis for these reasons: 1.  Does not give more weight to strongly rejected local null hypothesis 2.  Binominal distribution use integer number, thus upper bounce of

probability for % area needed to reject global null hypothesis. 3.  Very sensitive to spatial autocorrelation

•  FDR method focuses on the local null hypotheses with the lowest probability, because this local null hypothesis is mostly likely to be rejected. If these local null hypotheses of false discovery can be rejected, then the global null hypothesis can be rejected. However, to guard against false rejection, we must hold high standard by using threshold for even more rare probability for these low probability local hypothesis.

Page 11: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

How do we use FDR?

•  Step-1: Rank all the N independent spatial samples from that of smallest probability (p(1)) to largest probability (p(N)). There are two ways of determining probability of each local hypothesis: i. Use t-distribution table and effect DOF, n, as shown in slide 7 (Step-1 in Livezey&Chen 1983). ii. Compare probability distribution generated by Monte Carlo test.

•  Step-2: Starting from the p(N) to p(1), compare each p(j) (j=n, n-1, ….1) to the FDR threshold, pFDR,j=αglobal*j/N. •  If for example, p(N)≤αglobalN/N=αglobal, this and all the local hypotheses

with less probability, i.e., from p(N) to p(1), can be rejected or regarded as globally significant.

•  If p(N)>αglobal, we check whether p(N-1) ≤αglobal(N-1)/N. If not, we continue go down to check p(n-2), until we find p(j) ≤αglobalJ/N. Then, all the local null hypotheses with smaller p(j-1), … p(1) can be rejected. Even if only one local null hypothesis with the smallest probability satisfies p(1) ≤αglobal/N, the global null hypothesis can be rejected, because it exceeds a very tough threshold for very rare probability, αglobal/N.

Page 12: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

Example: •  A spatial field has 20 independent

local null hypothesis to be tested. The probability values of the 20 local hypotheses are ranked from the lowest to the highest and plotted as the dots in the figure. Only three local null hypotheses has probability < 5%.

•  Based on Livezey and Chen binominal distribution, (Pr{X<3)}=0.076. Not enough to reject the global null hypothesis.

Genera

ted b

y C

am

Scanner fro

m in

tsig

.com

0.05

Page 13: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

Using FDR method as described below: •  Step 1: Determine the probability of each

local null hypothesis based on the probability distribution generated by the Monte Carlo test as described in Step-2 of Livezey&Chen. Rank from the local null hypotheses from the lowest to the left to the highest to the right as shown in the figure.

•  Step 2: Compare each local null hypothesis from the largest probability from right p(N) to those with the smallest probability p(1) to FDR threshold (PFDR=j/Nαglobal) as shown by the sloped dash line in the figure.

•  Step-3: The comparison shows that the 2 local null hypotheses of the smallest probability are below the PFDR thresholds. Thus, the global null hypothesis can be rejected at 95% confidence.

Genera

ted b

y C

am

Scanner fro

m in

tsig

.com

0.05

Page 14: Multiplicity and Test field significance · Multiplicity and Test field significance References: Wilks Text, Section 5.4, Lievezey and Chen 1983, Wilks 2006 The main issue: For a

Influence of spatial autocorrelation on field significance:

•  Probability of falsely reject true global null hypothesis as a function of spatial correlation of the data.

•  The probability of rejecting true global hypothesis increase rapidly for the Livezey and Chen count method with increase of spatial correlation of the data, whereas for FDR method, it does not change much with the spatial correlation.

assumed independent local tests. The actual test levels(vertical intercepts in Fig. 3) are only slightly smallerthan the nominal !global " 0.05. The results are similaroverall to the Walker test computed with the permuta-tion procedure, with a slight loss of power for theWalker test relative to its resampling counterpart and,as before, somewhat better power for the FDR test ascompared with the Walker test.

Figure 4 compares the actual test levels for the count-ing test [Eqs. (1) and (2)], Walker’s test [Eq. (4b)], andthe FDR test [Eq. (7b)] as functions of the data corre-lation parameter # for K " 100 (Fig. 4a) and K " 1000(Fig. 4b). That is, these figures show probabilities (es-timated using 105 replications each) of rejecting trueglobal null hypotheses as functions of the strength ofthe data correlation, under the (erroneous, except at# " 0) assumption that the underlying K local test re-

sults are independent. As before, the counting test isusually conservative for zero or small levels of datacorrelation but, as expected, becomes very permissivefor strongly correlated tests. This latter attribute is thereason why resampling procedures, rather than Eq. (2),are generally used to define critical values for this test,when Eq. (2) indicates the possibility of a significantresult.

By contrast, the curves in Fig. 4 for the Walker andFDR tests are nearly flat until large values of #, eventhough Eqs. (4b) and (7b) have assumed independenceamong the K tests, reflecting their robustness to corre-lation among the local tests. Katz and Brown (1991)and Ventura et al. (2004) also found the Walker andFDR procedures, respectively, to be robust to correla-tion among the local tests. Remarkable is that the effectof correlation of the local tests on the Walker and FDRtests is opposite to that for the counting test: these testsbecome somewhat conservative, and for large values of# they typically perform similarly to the counting testoperating on independent local tests. This conservativebehavior can be understood qualitatively by adoptingthe view that K correlated local tests behave collec-tively as if there were some smaller number of “effec-tively independent” tests. The effect of reducing thenumber of tests in Eq. (3) is to shift probability mass forp(1) away from the origin: an extremely small value isless likely within a sample of reduced size from theuniform distribution. Therefore, the integral in Eq. (4a)will be smaller than the nominal level !global.

In comparing Figs. 4a and 4b, it is evident that therobustness of the Walker and FDR tests increases asthe number of local tests increases, so that the perfor-mance of these tests even for correlated data is verygood for K " 1000 (Fig. 4b), which is representative ofthe order of magnitude of many atmospheric multipletesting settings. By contrast, Fig. 4b also shows that theperformance of the counting procedure under the as-sumption of independence local tests is extremely pooras the number of local tests increases. For example, forlarge values of the correlation parameter, counting testsat a nominal level of !global " 0.05 actually reject trueglobal null hypotheses more than 25% of the time, andeven tests at the nominal !global " 0.01 level reject trueglobal null hypotheses approximately 20% of the time.

One practical result of the robustness of the Walkerand FDR tests to spatial correlation is that they can becomputed under the assumption of independence, withthe assurances that the results are approximately cor-rect and that the actual test level is smaller than !global.That is, unlike the counting test, observing a significantresult from a Walker or FDR global test under theassumption of local test independence implies that an

FIG. 4. Actual test levels (probabilities of rejecting true globalnull hypotheses) as functions of the data correlation parameter #,when global tests are calculated under the assumption of inde-pendence of K " (a) 100 and (b) 1000 local tests, for nominalglobal test levels of 0.05 and 0.01.

SEPTEMBER 2006 W I L K S 1187