TWO GRAPH-BASED TESTS A DISSERTATIONvm961zz5360/... · two graph-based tests for high-dimensional...

TWO GRAPH-BASED TESTS

FOR HIGH-DIMENSIONAL INFERENCE

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF STATISTICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Hao Chen

June 2013

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/vm961zz5360

© 2013 by Hao Chen. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/vm961zz5360

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

David Siegmund, Co-Adviser


Nancy Zhang, Co-Adviser


Jerry Friedman

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

With modern science there is a growing emphasis on multivariate, complex data

types. Some of these data are high dimensional. Others, such as survey preference,

network, and tree data, cannot be characterized easily with standard models on Eu-

clidean spaces. This dissertation details the investigation in this new setting of two

classic statistical problems: change-point detection and two-sample comparison of

categorical data.

Change-point models are widely used in various fields for detecting lack of homo-

geneity in a sequence of observations. In many applications, the dimension of the

observations in the sequence can be very high, even much larger than the length of

the sequence. Testing the homogeneity of such sequences is a challenging but impor-

tant problem. Existing approaches are limited in many ways. We proposed a new

non-parametric approach that can be applied to data in high dimension, and even to

non-Euclidean object data, as long as an informative similarity measure on the sample

space can be defined. The approach is graph-based two-sample tests adapted to the

scan-statistic setting. Graph-based two-sample tests are tests base on graphs connect-

ing observations by similarity [Friedman and Rafsky, 1979, Rosenbaum, 2005]. We

show that this new approach is powerful in high dimensions compared to parametric

approaches. We also derive accurate analytic p-value approximations for very general

situations, which lead to easy off-the-shelf homogeneity testing for large multivariate

data sets. This approach has been applied on two data sets: The determination of

authorship of a classic novel, and the detection of change in a social network over

time.

Two-sample comparison of categorical data is a classic problem in statistics. In

iv

many modern applications, the number of categories can be quite large, even com-

parable to the sample size, causing existing methods to have low power. When the

number of categories is large, there is often underlying structure on the sample space

that can be exploited. We propose a general non-parametric approach that makes

use of similarity information on the space of categories in two-sample tests. Our ap-

proach addresses a shortcoming of existing graph-based two-sample tests by no longer

requiring uniqueness of the underlying graph, thus allowing ties in the distance ma-

trix defining the graph. We found two types of statistics that are both powerful and

fast to compute. We show that their permutation null distributions are asymptoti-

cally normal and that their p-value approximations under typical settings are quite

accurate, facilitating the application of this approach.

v

Acknowledgements

I would like to thank my advisor, Professor Nancy Zhang, for her guidance and friend-

ship throughout my Ph.D. studies, and my co-advisor, Professor David Siegmund, for

his guidance and encouragement. Both of them broadened my perspectives and in-

spired me in many ways. From critical thinking to effective communication, there

are so many things I learned, which are impossible to summarize. Besides academic

assistance, they care about me when I encounter difficulties in life. I thank them for

being such great advisors and friends.

I would like to thank Professors Jerome Friedman, Wing Wong, and Hua Tang

for serving on my committees and for providing valuable feedback. I am especially

grateful to Professor Jerome Friedman for reading this thesis, discussing with me

on related problems, and providing helpful suggestions and insights. In addition, I

would like to thank Professors Susan Holmes, Emmanuel Candes, Art Owen, Richard

Olshen, Bradley Efron, Persi Diaconis, and Tze Lai for their support in my doctoral

years. Moreover, I appreciate the encouragement and help from Professor Mark Low

at University of Pennsylvania.

My Ph.D. life would not be so colorful and memorable without the companionship

of friends from the Sequoia Hall. Especially, I thank Zhen Zhu, Pei He, Luo Lu,

Gourab Mukherjee, Noah Simon, Zongming Ma, Yi Liu, Yao Xie, Li Ma, Linxi Liu,

and Jian Li for all of the wonderful times we’ve had together.

Last but not least, I want to thank my parents for their everlasting love and care.

I dedicate this thesis to them as a simple expression of gratitude for their love and

support.

vi

Contents

Abstract iv

Acknowledgements vi

1 Introduction 1

1.1 A Review of Graph-Based Two-Sample Tests . . . . . . . . . . . . . . 1

1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Change-Point Detection . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Two-Sample Comparison of Categorical Data . . . . . . . . . 4

1.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

I Graph-Based Change-Point Detection 6

2 Change-Point Problems 7

2.1 Background and Challenges . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Review of Change-Point Problems with Multivariate Observations . . 9

3 A Graph-Based Framework 10

3.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Test Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Single Change-Point Alternative . . . . . . . . . . . . . . . . . 11

3.2.2 Changed Interval Alternative . . . . . . . . . . . . . . . . . . 17

4 Analytic Appximations to Significance Levels 19

4.1 Quantity of the Interest . . . . . . . . . . . . . . . . . . . . . . . . . 19

vii

4.2 Properties of the Processes . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 Limiting Distributions . . . . . . . . . . . . . . . . . . . . . . 21

4.2.2 Covariance Function . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Asymptotic Approximations . . . . . . . . . . . . . . . . . . . . . . . 24

4.4 Skewness Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4.1 Derivation of (4.18) and (4.20) . . . . . . . . . . . . . . . . . 29

4.4.2 Explicit Expressions for Skewness . . . . . . . . . . . . . . . . 31

4.5 Numerical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5.1 Single Change-Point Alternative . . . . . . . . . . . . . . . . . 35

4.5.2 Changed Interval Alternative . . . . . . . . . . . . . . . . . . 37

5 Assessment of the Method 47

5.1 Numeric Power Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Results on Real Data Examples . . . . . . . . . . . . . . . . . . . . . 51

5.2.1 Friendship Network . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2.2 Authorship Debate . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

II Graph-Based Tests for Two-Sample Comparisons of Cat-egorical Data 59

6 Introduction 60

6.1 Background and Challenges . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Implicit Information and Their Role in Improving the Tests . . . . . 62

6.3 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7 Graph-Based Test Statistics 65

7.1 The Test Statistics Based on MST . . . . . . . . . . . . . . . . . . . . 66

7.1.1 RaMST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.1.2 RuMST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.2 The Test Statistic Based on MDP . . . . . . . . . . . . . . . . . . . . 71

7.2.1 RaMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.3 A Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

viii

7.4 Computational Issues of RaMST and RuMST . . . . . . . . . . . . . . . . 76

7.5 A Fast Method Generalized from RaMST . . . . . . . . . . . . . . . . . 78

8 Examples 80

8.1 Preference Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8.2 Haplotype Association . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8.3 Binary Clinical Features . . . . . . . . . . . . . . . . . . . . . . . . . 84

9 Permutation Distributions 87

9.1 RC0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

9.2 TC0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

9.3 Checking the p-Values Under Normal Approximations . . . . . . . . . 91

10 Discussion 94

A Existing Theorems Used in Proofs 96

A.1 Stein’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

B Supporting Materials for Part I 97

B.1 Proofs for the Limiting Distributions . . . . . . . . . . . . . . . . . . 97

B.2 Effect of Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

C Supporting Materials for Part II 106

C.1 Computation Issues for RaMST and RuMST . . . . . . . . . . . . . . . . . 106

C.1.1 Theoretical Justifications . . . . . . . . . . . . . . . . . . . . . 107

C.2 Proofs for Permutation Distributions . . . . . . . . . . . . . . . . . . 114

C.2.1 Proof of Lemma 9.1.1 . . . . . . . . . . . . . . . . . . . . . . . 114

C.2.2 Proof of Theorem 9.1.3 . . . . . . . . . . . . . . . . . . . . . . 117

C.2.3 Proof of Lemma 9.2.1 . . . . . . . . . . . . . . . . . . . . . . . 127

ix

List of Tables

4.1 Critical values for the single change-point scan statistic based on MST

at 0.05 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . 37

4.2 Critical values for the single change-point scan statistic based on MST


4.3 Critical values for the single change-point scan statistic based on MDP.

n = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4 Critical values for the single change-point scan statistic based on NNG


4.5 Critical values for the single change-point scan statistic based on NNG


4.6 Critical values for the changed interval scan statistic based on MST at

0.05 significance level. n = 1000. . . . . . . . . . . . . . . . . . . . . . 42

4.7 Critical values for the changed interval scan statistic based on MST at


4.8 Critical values for the changed interval scan statistic based on MDP.

n = 1000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.9 Critical values for the changed interval scan statistic based on NNG at


4.10 Critical values for the changed interval scan statistic based on NNG at


5.1 Number of simulated sequences (out of 100) with significance less than

5%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

x

5.2 p-values for the tests. In each cell, the first value is calculated from

10,000 permutations and the second value is calculated from the ana-

lytic approximation after skewness correction. . . . . . . . . . . . . . 54

5.3 p-values for the tests only using data from the first 350 chapters.Numbers

in each cell have the same meaning as in Table 5.2. . . . . . . . . . . 54

6.1 Basic Notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.1 The power of six tests – four graph-based tests based on RaMST, RuMST,

RaMDP, RuNNG, the deviance test (LR) and Pearson’s Chi-square test –

under two significance levels (α = 0.01, 0.05) and different simulation

settings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.2 The number of categories, K, and the MSTs on categories, M , as

haplotype length increases for the haplotype association problem in

Section 8.2. All categories are assumed to be non-empty. . . . . . . . 77

7.3 Computational time for RaMST and RuMST. M is the number of MSTs

on categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

8.1 The power of five tests – three graph-based tests based on RuMST,

RC−uMST, RC−uNNG and two Chi-square tests – under two significance lev-

els (α = 0.01, 0.05) and different simulation settings. . . . . . . . . . 82

8.2 p-values for the KCS data set. . . . . . . . . . . . . . . . . . . . . . . 86

xi

List of Figures

3.1 The MST, MDP and NNG graphs on an example two-dimensional data

set. 20 points were drawn from N (0, I2) (shown in triangles) and 20

points were drawn from N ((2, 2)′, I2) (shown in circles). . . . . . . . . 12

3.2 The computation of RG(t) for nine different values of t. The data

is a sequence of length n = 40, with the first 20 points drawn from

N (0, I2) and the second 20 points drawn from N ((2, 2)′, I2). The sim-

ilarity graph G shown in the plots is the MST on Euclidean distance.

Each t divides the observations into two groups, one group for obser-

vations before and at t (shown as triangles) and the other group for

observations after t (shown as circles). Edges that connect observa-

tions from the two different groups (i.e. edges connecting a triangle

and a circle) are bold in the graph. Notice that G does not change as t

changes, but the group identities of some observations change, causing

RG(t) to change. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 The profile of RG(t) and ZG(t) against t for the same data set as in

Figure 3.2 (a change-point at 20). . . . . . . . . . . . . . . . . . . . . 16

3.4 The profile of RG(t) and ZG(t) against t on a sequence of points all

randomly drawn from N (0, I2). . . . . . . . . . . . . . . . . . . . . . 16

xii

5.1 Results of graph-based scans of the MIT phone call network. Top row

shows results from using number of different edges as the dissimilarity

measure and bottom row shows results from using the normalized num-

ber of different edges. The three columns show three different ways of

constructing the graph: MST, MDP, and NNG from left to right. In

each plot, Z(t) is plotted along t. The estimated change-point is shown

in the caption above the plot. The two vertical lines show n0 and n1;

we basically excluded the first 5% and the last 5% of the points. The

horizontal lines show critical values at 0.05 and 0.01 significance lev-

els, with the solid lines showing critical values computed from 10,000

permutations and the dashed lines showing those computed from the

analytic approximation after skewness correction. . . . . . . . . . . . 52

5.2 Results of graph-based scans of chapter-wise word usage frequencies of

Tirant lo Blanch. The first row shows results from the word length data

and the second row shows results from the context-free word frequency

data. The three columns show scans based on three different graphs:

MST, MDP, and NNG from left to right. The content in each plot is

the same as in Figure 5.1. In the caption for each plot, the estimated

change-point is shown in the form A/B, where A is the index of the

change-point within the 425 chapters used for analysis, and B is the

chapter number in the novel. . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 Results from the first 350 chapters. The setting of the figure is the

same as in Figure 5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7.1 Illustration of MST, MDP, and NNG on six points. Notice that only

one of the two possible MSTs on the six points and one of the two

possible NNGs on the six points are shown. . . . . . . . . . . . . . . 66

7.2 Embedding the MST on categories on the subjects. This figure only

shows 3 out of 15552 possible embeddings. . . . . . . . . . . . . . . . 67

7.3 Power versus type I error for tests based on RaMST, RuMST, RaMDP, the

likelihood ratio (deviance), and RuNNG under different simulation settings. 75

xiii

8.1 C-uMST and C-uNNG constructed on a typical data set generated

under parameters ζ0 = 1234 and θ = 5 with na = nb = 20. The

Spearman’s distance is used in both the generating model and for con-

structing the graph. Each node is labeled by the ranking it represents,

followed by the number of subjects from groups a and b with that

ranking in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . 82

8.2 Power versus type I error for five different tests in the preference rank-

ing example with θ = 5 and na = nb = 20. One of two distance

measures (Kendall’s or Spearman’s distance) was used for the gener-

ating model and for constructing the graph. The title of each plot

denotes the choice of distance: The first letter denotes the distance

used in the generating model (“K” is Kendall’s and “S” is Spearman’s

distance); and the second letter denotes the distance used in construct-

ing the graph. For instance, “KS” in the bottom left panel means that

Kendall’s distance is used in the generating model, but Spearman’s

distance is used in constructing the graph. . . . . . . . . . . . . . . . 83

8.3 The power versus type I error plots for the five tests for the haplo-

type example. The length of the haplotype is 11, but only 4 positions

informative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9.1 Boxplots for the differences between p-values calculated from normal

approximation and 10,000 permutations. . . . . . . . . . . . . . . . . 93

B.1 The three quantities, γG(t), θb,G(t) and SG(t) from left to right, for a

MDP graph. n = 1000, b = 3. . . . . . . . . . . . . . . . . . . . . . . 104

B.2 The three quantities, γG(t), θb,G(t) and SG(t) from left to right, for

a MST graph constructed using Euclidean distance on a sequence of

n = 1000 observations iid drawn from N(0, I100). b = 3. . . . . . . . 104

B.3 The integrand before (left) and after (right) extrapolation. The inte-

grand can only be directly calculated in the middle part (t ∈ [248, 752]),

and the outer part is obtained by extending using the boundary tangent.105

xiv

Chapter 1

Introduction

Statistics is a rapidly growing field where new challenges arise before old ones are fully

understood. As advanced technologies in various fields produce data with increasing

dimensionality and volume, many fundamental problems have to be re-addressed to

meet the demands from modern data. This dissertation focuses on two of these

problems: Change-point detection and two-sample comparison of categorical data.

This chapter gives a brief review of graph-based two-sample tests, which is a building

block of our approach in tackling these two problems in high dimension, and a brief

overview of the two problems.

1.1 A Review of Graph-Based Two-Sample Tests

Let y1, . . . ,yn and yn+1, . . . ,yN be two samples from distributions F0 and F1,

respectively. We consider testing the null hypothesis that the two distributions are

the same, F0 = F1, against an omnibus alternative, F0 6= F1.

By graph-based two-sample test, we refer to tests that are based on graphs with the

observations y1, . . . ,yN as nodes. Friedman and Rafsky [1979] proposed the first

graph-based two-sample test as a generalization of the Wald-Wolfowitz runs test to

multivariate settings. Their test is based on a minimum spanning tree (MST) on the

observations yi, i = 1, . . . , N, which is a spanning tree connecting all observations

that minimizes the sum of distances across edges. The Friedman-Rafsky test is based

1

CHAPTER 1. INTRODUCTION 2

on the number of edges connecting observations across different groups:

RG =∑

(i,j)∈G

Igi 6=gj , (1.1)

where G is the MST, and gi is an indicator function for yi belonging to the first

sample. The sum RG is compared to its null distribution obtained by permuting the

sample labels – randomly picking n observations out of the N observations to be the

first sample – and the null hypothesis is rejected if RG is relatively low. The rationale

is that, if the two samples come from different distributions, observations from the

same sample should be closer to each other, and thus edges in the MST should be

more likely to connect observations within a group. Friedman and Rafsky showed

that, while this test has low power in low dimensions, it has comparable power to

likelihood ratio tests in a numerical study of normal data in > 20 dimensions, and

higher power when the normality assumption is violated.

Another graph-based two-sample method, the cross-match test, was proposed

by Rosenbaum [2005]. This test is based on minimum distance non-bipartite pair-

ing (MDP), which divides the N observations into N/2 (assuming N is even) non-

overlapping pairs in such a way as to minimize the total of N/2 distances between

pairs. For odd N , Rosenbaum suggested creating a pseudo data point that has dis-

tance 0 with all other observations, and later discarding the pair containing this

pseudo point. The sum (1.1) is computed with G set to the MDP, and the same ra-

tional is adopted that the null hypothesis is rejected if RG is relatively low compared

to that calculated under permutations. Note that the topology of the MDP is the

same for any graph with N nodes. This fact makes the test based on MDP truly

distribution-free under the null hypothesis.

1.2 Overview

This section gives a brief overview of the challenges of the two problems in high

dimension and how we approached them. The details will be presented in Part I and

Part II of the dissertation, respectively.


1.2.1 Change-Point Detection

Change-point models are widely used in various fields for detecting lack of homogene-

ity in an ordered sequence of observations. There is a rich literature on theory and

application of change-point models when the observations are real or integer valued

scalars. However, in many applications, the observations are multivariate. Examples

can be close to our everyday lives. Many classic works in literature, such as Tirant lo

Blanch, a Catalan romance, and Dream of the Red Chamber, a Chinese masterpiece,

have debates on whether there is a change of author mid-way through the novel.

In the digital era, an objective approach to these debates is to statistically test for

abrupt changes in writing style, which can be reflected by word usage. Similar prob-

lems arise in genomic sequence analysis in biology, where it is often of interest to find

regions of the genome with different DNA-word compositions. In both settings, each

observation in the sequence is a vector of word counts over a dictionary. Multivariate

change-point models are also useful for detecting abrupt shifts in network connectiv-

ity for either social networks or gene-/protein- interaction networks, and for detecting

abrupt events in image data from climatology, neuroscience, and surveillance tapes.

In these applications, the dimension of the observations in the sequence can be

very high, even larger than the number of observations. Testing the homogeneity

of such sequences is a challenging but important problem. Existing approaches for

change-point detection in a sequence of multivariate observations are limited in many

ways. Most methods are based on parametric models that are highly context specific.

These parametric approaches for multivariate data cannot be applied in very high

dimensions, unless strong assumptions are made to avoid the estimation of the large

number of nuisance parameters that are a by-product of increasing dimension. Non-

parametric methods have also been proposed, but they do not generalize well to high

dimension.

We propose a new non-parametric approach that can be applied to data in high

dimension, and even to non-Euclidean object data, as long as an informative similarity

measure on the sample space can be defined. Briefly, the approach is graph-based two-

sample tests adapted to the scan-statistic setting. We showed that this new approach

is powerful in high dimensions compared to parametric approaches. We also derived

accurate analytic p-value approximations for very general situations, which leads to


easy off-the-shelf homogeneity testing for large multivariate data sets.

1.2.2 Two-Sample Comparison of Categorical Data

Two-sample comparison of categorical data is a classic problem in Statistics. The

standard procedure is to assume that each sample is drawn from a multinomial dis-

tribution, and the comparison becomes a test of whether the two samples come from

the same multinomial distribution. Classical methods, such as the Pearson’s Chi-

square test and the deviance test, work well when we observe each category a large

number of times. At least, the region in the contingency table where the two groups

truly differ needs to be adequately sampled for existing tests to achieve good power.

However, in many applications, the number of categories is comparable to the sample

size, causing existing methods to have low power.

When the number of categories is large, there is often underlying structure on

the sample space that can be exploited. For example, in genetics, a haplotype is a

combination of alleles at adjacent loci on a chromosome that is transmitted together.

A common problem of genetic association studies is to compare haplotype counts

between treatment and control groups. Each haplotype can be represented as a fixed-

length binary vector. Haplotypes that are longer than 10 are often of interest in

genetics, leading to > 1, 000 possible combinations. However, the number of subjects

in association studies is often only in the thousands or even hundreds. In this example,

haplotypes can be related through hamming distance or other more sophisticated

measures. Another example is ranking data from surveys and psychometric research,

where two group comparisons are common. The number of possible rankings is often

large compared to the number of subjects. In this example, rankings can be related

through Kendall’s or Spearman’s distance.

We propose general non-parametric approaches that make use of similarity infor-

mation on the space of categories in two-sample tests. As we see in Section 1.1, both

existing graph-based two-sample tests require a unique underlying graph G on the

observations. When the similarity matrix on observations is filled with ties, which is

a major characteristic for categorical data, neither MST nor MDP can be constructed

uniquely. We explored different ways to construct the graph and the statistic under

the categorical setting and found two types of statistics that are both powerful and


fast to compute. We showed that their permutation null distributions are asymptot-

ically normal under mild conditions and that their p-value approximations are quite

accurate under typical settings, facilitating the application of the new approaches to

real problems.

1.3 Remarks

The two problems are separate problems except that we approach both through graph-

based tests. Given the review of graph-based two-sample tests in Section 1.1, Part I

and Part II are “independent” and can be read in either order.

The notations are consistent within each part. I try to make them as consistent

as possible across the two parts. However, some notations are double defined. Their

meanings are clear within the context, but generalizing one from one part to the other

needs caution.

Nevertheless, I always use G to denote the similarity graph in a generic sense, as

well as the set of edges in the graph when the vertex set is implicitly obvious. | · | is

used to denote the size of a set, so |G| is the number of edges in G. If not otherwise

specified, the probabilities are all under permutation.

Some commonly used notations are also defined here. φ(·) and Φ(·) are the density

and cumulative distribution of the standard normal distribution. For any event x, Ix

is the indicator function that takes value 1 if x is true, and 0 otherwise. For any real

value x, [x] denotes the largest integer ≤ x.

Part I

Graph-Based Change-Point

Detection

6

Chapter 2

Change-Point Problems

2.1 Background and Challenges

Change-point models are widely used in various fields for detecting lack of homogene-

ity in a sequence of observations ordered based on an index, such as time. In the

typical formulation, the observations yi, i = 1, 2, . . . , n are assumed to have distri-

bution F0 for i ≤ τ and possibly a different distribution F1 for i > τ . The parameter

τ is refereed to as the change-point. We consider the case where the total length of

the sequence n is fixed. There is rich literature on theory and applications of this

model when yi are real or integer valued scalars. For example, in a well known study

of the annual flow volume of the Nile River at the city of Aswan, Egypt, from 1871

to 1970, each yi is a continuous measurement of the annual discharge from the river

[Cobb, 1978], and the goal is to detect shifts in flow volume. If the distribution of yi

were assumed to be normal, score- or likelihood- based tests can be applied [James

et al., 1987]. Bayesian and non-parametric approaches have also been developed (see

Carlstein et al. [1994] for a survey).

Modern statistical applications are faced with data of increasing richness and

dimension. High throughput measurement schemes and digitization in many scientific

fields have produced data sequences yi : i = 1, 2, . . . , n, where each yi is a high

dimensional vector or even a non-Euclidean data object. The dimension of each

observation can be larger than the length of the sequence. Testing the homogeneity

of such sequences and estimating the locations of change-points if the sequence is not

7

CHAPTER 2. CHANGE-POINT PROBLEMS 8

homogeneous are challenging but important problems. Following are some motivating

examples:

Network evolution: Data on networks have become increasingly common. For ex-

ample, e-mail, phone, and on-line chat records can be used to construct a net-

work of social interactions among individuals [Kossinets and Watts, 2006, Eagle

et al., 2009]. High throughput biological experiments have led to the ubiquitous

study of protein- or gene- interaction networks. A large part of these studies is

characterizing how the network evolves through time. Here, the observation at

each time point is a graphical encoding of the network. In a longitudinal study,

one might ask whether there is an abrupt shift in network connectivity at any

point in time.

Image analysis: Image data collected through time appears in diverse applications,

from video surveillance to climatology to neuroscience. The detection of abrupt

events, such as security breaches, storms, or brain activity, can be formulated as

a change-point problem. Here, the observation at each time point is the digital

encoding of an image.

Text or sequence analysis: Many classic works in both western and eastern liter-

ature have ongoing authorship debates. For example, the debate surrounding

both Tirant lo Blanch, a Catalan romance, and Dream of the Red Chamber,

a Chinese masterpiece, is whether there is a change of authorship mid-way

through the novel. In the digital era, an objective approach to these debates is

to statistically test for abrupt changes in writing style, which can be reflected

by word usage. Similar problems arise in genomic sequence analysis in biology,

where it is often of interest to find regions of the genome with different DNA-

word compositions (see, for example, Tsirigos and Rigoutsos [2005]). In both

settings, each observation in the sequence is a vector of word counts over a large

dictionary of words.

CHAPTER 2. CHANGE-POINT PROBLEMS 9

2.2 Review of Change-Point Problems with Mul-

tivariate Observations

There are several methods that can be used to detect change-point(s) in a sequence

of multivariate observations. Most methods are based on parametric models that are

highly context specific. For example, Zhang et al. [2010] and Siegmund et al. [2011]

studied the problem of detecting common shifts in mean in multivariate Gaussian

sequences with identity covariance. Srivastava and Worsley [1986] and James et al.

[1992] discussed general likelihood ratio tests for a change in multivariate normal

mean. Both tests require that the dimension of the observations be smaller than

the number of observations. Giron et al. [2005] assumed the observations follow

multinomial distribution and they developed a Bayesian approach. This survey is not

exhaustive, while in general, parametric approaches for multivariate data cannot be

applied in very high dimensions, unless strong assumptions are made to avoid the

estimation of the large number of nuisance parameters that are a by-product of the

increasing dimension.

In the nonparametric domain, Desobry et al. [2005] and Harchaoui et al. [2009]

used kernel-based methods. A common drawback for kernel-based methods is that

they rely heavily on the choice of kernel functions and parameters, and the problem

becomes more severe when applying to high-dimensional data. Lung-Yut-Fong et al.

[2011] used a non-parametric approach based on marginal rank statistics, which also

requires the restriction that the number of observations be larger than the dimension

of the data. Although these tests can be quite useful in low dimensions, they are

impractical when data resides in high dimensional sample space.

Chapter 3

A Graph-Based Framework for

Change-Point Detection

3.1 Problem Formulation

We start with a formal formulation of the problem. Consider a sequence of indepen-

dent observations yi, i = 1, . . . , n. Let F0 and F1 be two probability measures on

the sample space, e.g., Rd. We do not assume that F0 and F1 are known, that is,

F0 and F1 can be arbitrary, but need to differ on a set of non-zero measure. We are

concerned with testing the null hypothesis,

H0 : yi ∼ F0, i = 1, . . . , n, (3.1)

against the alternative,

Ha : ∃ 1 ≤ τ < n, yi ∼

F1, i > τ

F0, otherwise.(3.2)

Under the alternative hypothesis, there exists a time point τ where the distribution

of the observations changes abruptly from F0 to F1. (The index can be some other

meaningful orderings. For simplicity, we refer the order to be time if not otherwise

10

CHAPTER 3. A GRAPH-BASED FRAMEWORK 11

specified.) A related alternative, which we will also study, is that of a changed interval:

Ha : ∃ 1 ≤ τ1 < τ2 ≤ n, yi ∼

F1, i = τ1 + 1, . . . , τ2

F0, otherwise.(3.3)

Under the second alternative, yi changes from F0 to F1 and then back to F0 at some

later time.

In both the single change-point and the changed interval alternatives, the obser-

vations are divided into two distinct groups. Usually, we are interested in the case

that both groups have a minimum number of observations: 1 < n0 ≤ τ ≤ n1 < n

for the single change-point scenario and 1 < l0 ≤ τ2 − τ1 ≤ l1 < n for the changed

interval scenario, where n0, n1, l0, l1 are some prespecified values. Sometimes, the

domain knowledge gives us good choices for these values. We may also have some

constrains on the locations of τ1 and τ2.

3.2 Test Statistics

We do not impose any restrictions on the sample space or distributions of yi. Our

approach requires that the similarity between yi can be represented by a graph, with

edges in the graph connecting observations that are “close” in some sense. Recall

Section 1.1 that the rationale of graph-based two-sample test is observations from

the same distribution being more likely to be connected in the graph G if the two

distributions are different. This is also the rationale for our proposed method for

the change-point setting that if there is a change-point, observations from the same

distribution are more likely to be connected. To give a sense of how the similarity

graph usually looks like, Figure 3.1 presents minimum spanning tree (MST), minimum

distance pairing (MDP), and nearest neighbor graph (NNG) on 40 points in R2, where

20 points are randomly drawn from N (0, I2) and 20 points from N ((2, 2)′, I2).

3.2.1 Single Change-Point Alternative

Here, we consider testing the null (3.1) versus the single change-point alternative

(3.2). Let G be the similarity graph on yi. Any time t divides the observations into


Figure 3.1: The MST, MDP and NNG graphs on an example two-dimensional dataset. 20 points were drawn from N (0, I2) (shown in triangles) and 20 points weredrawn from N ((2, 2)′, I2) (shown in circles).

two groups: Those that come before and after t, so the number of edges connecting

points from different groups for time t is defined as:

RG(t) =∑

(i,j)∈G

Igi(t)6=gj(t), gi(t) = Ii>t.

Here, gi(t) is an indicator function for the event of whether yi is observed after t. As

in the Friedman-Rafsky and Rosenbaum’s tests, small values of RG(t) are evidence

against the null. To standardized RG(t) so that it is comparable across t, let

ZG(t) = −RG(t)− E[RG(t)]√Var[RG(t)]

. (3.4)

In the standardization, we also invert the sign, so that large values of ZG(t) are

evidence against the null. The expectation and variance above are defined under

the permutation null, which places 1/n! probability on each of the n! permutations

of yi : i = 1, . . . , n. That is, the time of observing yi is permuted, so i is the

permutation random variable, and gi(t) is the indicator function of observing yi after

t under permutation. Here, For graph G, since it is determined by the values of yi’s,

its structure is not changed under permutation.

Remark 3.2.1. It would be clearer to use π(i) to denote the observed time for yi after

permutation as we do in Appendix B.1. However, when there is no much ambiguity,


i is used to avoid notation cumbersome.

Lemma 3.2.2 below gives analytic formulas for E[RG(t)] and Var[RG(t)]. Before

we state the lemma, we introduce a new notation Gi, which is a subgraph of G

including all edges that connect to node i. Since the vertex set of Gi is obvious (node

i and all nodes that connect to node i in G), Gi is also used to denote the set of edges

in Gi. |Gi| is then the number of edges in Gi. Apparently, |Gi| is also the degree of

node i.

Lemma 3.2.2. Under the permutation null, the expectation and variance of RG(t)

are

E(RG(t)) = p1(t)|G|,

Var(RG(t)) = p2(t)|G|+(

1

2p1(t)− p2(t)

)∑i

|Gi|2 +(p2(t)− p2

1(t))|G|2,

where

p1(t) =2t(n− t)n(n− 1)

,

p2(t) =4t(t− 1)(n− t)(n− t− 1)

n(n− 1)(n− 2)(n− 3).

Proof. Notice that the indices i, j, are the random variables under permutation. The

formula for the expectation is immediate,

E(RG(t)) =∑

(i,j)∈G

P(gi(t) 6= gj(t)) = p1(t)|G|,

because there are 2t(n − t) ways to place i and j on the two sides of t among all

n(n− 1) ways.

For the second moment,

E(R2G(t)) =

∑(i,j),(k,l)∈G

P(gi(t) 6= gj(t), gk(t) 6= gl(t)).


By examining different ways of placing i, j, k, l, we have

P(gi(t) 6= gj(t), gk(t) 6= gl(t))

=

2t(n−t)n(n−1)

= p1(t) if

i = k, j = l

i = l, j = k

t(n−t)n(n−1)

= 12p1(t) if

i = k, j 6= l

i = l, j 6= k

j = k, i 6= l

j = l, i 6= k4t(t−1)(n−t)(n−t−1)n(n−1)(n−2)(n−3)

= p2(t) if i, j, k, l all different.

So

E(R2G(t)) =

∑(i,j)∈G

p1(t) +∑

(i, j), (i, k) ∈ G

j 6= k

1

2p1(t) +

∑(i, j), (k, l) ∈ G

i, j, k, l all different

p2(t)

= p2(t)|G|+(

1

2p1(t)− p2(t)

)∑i

|Gi|2 + p2(t)|G|2.

Var(RG(t)) follows from E(R2G(t))− E2(RG(t)).

Remark 3.2.3. The expectation and variance of RG(t) under the permutation null

depend only on t, n, and two characteristics of the graph – the number of edges (|G|)and the sum of squares of node degrees (

∑ni=1 |Gi|2).

Figure 3.2 illustrates the computation of RG(t) on a small artificial data set of

length n = 40 with the first 20 points drawn from N (0, I2) and the second 20 points

drawn from N ((2, 2)′, I2), so there is a true change-point at τ = 20. The similarity

graph is the MST constructed using Euclidean distance. Figure 3.3 shows plots of

the RG(t) and ZG(t) processes from the same data set. We see that ZG(t) peaked at

the true change-point 20. For contrast, we randomly generated another sequence of

40 with all points drawn from N (0, I2). RG(t) and ZG(t) calculated from this data

set is shown in Figure 3.4. We can see clearly that the data set with no change-point

has ZG(t) almost random, whose maximum is much smaller (around 1 compared to

around 4 in Figure 3.3).


Figure 3.2: The computation of RG(t) for nine different values of t. The data isa sequence of length n = 40, with the first 20 points drawn from N (0, I2) and thesecond 20 points drawn from N ((2, 2)′, I2). The similarity graph G shown in the plotsis the MST on Euclidean distance. Each t divides the observations into two groups,one group for observations before and at t (shown as triangles) and the other groupfor observations after t (shown as circles). Edges that connect observations from thetwo different groups (i.e. edges connecting a triangle and a circle) are bold in thegraph. Notice that G does not change as t changes, but the group identities of someobservations change, causing RG(t) to change.


Figure 3.3: The profile of RG(t) and ZG(t) against t for the same data set as in Figure3.2 (a change-point at 20).

Figure 3.4: The profile of RG(t) and ZG(t) against t on a sequence of points allrandomly drawn from N (0, I2).


We use the scan statistic to test H0 versus Ha:

maxn0≤t≤n1

ZG(t), (3.5)

where n0 and n1 are pre-specified constraints for the range of τ as we described

in Section 3.1. The null hypothesis is rejected if the maxima is greater than some

threshold. How to determine this threshold will be discussed in Chapter 4.

3.2.2 Changed Interval Alternative

In this section, we consider testing H0 versus the changed interval alternative (3.3).

Similar to the single change-point case, any specific alternative (t1, t2) divides the

data into two groups, one group containing all points observed during (t1, t2), and the

other group containing points observed outside of this interval. Then, the number of

edges in G connecting data points from different groups is

RG(t1, t2) =∑

(i,j)∈G

Igi(t1,t2) 6=gj(t1,t2), gi(t1, t2) = It1<i≤t2 .

We standardize RG(t1, t2) as before,

ZG(t1, t2) = −RG(t1, t2)− E(RG(t1, t2))√Var(RG(t1, t2))

.

Lemma 3.2.4 below gives explicit expressions for E(RG(t1, t2)) and Var(RG(t1, t2))

under the permutation null. The scan statistic involves a maximization over t1 and

t2,

max1 ≤ t1 < t2 ≤ n

l0 ≤ t2 − t1 ≤ l1

ZG(t1, t2) (3.6)

where l0 and l1 are constraints on the window size. For example, we can set l1 = n−l0so that only alternatives where the number of observations in either group is larger

than l0 are considered.

Lemma 3.2.4. Under the permutation null, the expectation and variance of RG(t1, t2)


are

E(RG(t1, t2)) = p1(t2 − t1)|G|,

Var(RG(t1, t2)) = p2(t2 − t1)|G|+(

1

2p1(t2 − t1)− p2(t2 − t1)

)∑i

|Gi|2

+(p2(t2 − t1)− p2

1(t2 − t1))|G|2,

where p1(·) and p2(·) are defined in Lemma (3.2.2).

The proof for this lemma is very similar to the proof of Lemma 3.2.2 and so

omitted here.

Remark 3.2.5. We can also constrain t1 and t2 to a prefixed region using domain

knowledge. The procedure will be similar and the p-value approximations in Chapter

4 are up to some minor modifications.

Chapter 4

Analytic Appximations to

Significance Levels

4.1 Quantity of the Interest

How large do the values of the scan statistics (3.5) and (4.2) need to be to constitute

sufficient evidence against the null hypothesis of homogeneity? We are concerned

about the tail distribution of the scan statistics under H0, that is,

P

(max

n0≤t≤n1

ZG(t) > b

)(4.1)

for the single change-point alternative, and

P

max1 ≤ t1 < t2 ≤ nl0 ≤ t2 − t1 ≤ l1

ZG(t1, t2) > b

(4.2)

for the changed interval alternative. (Since 1 ≤ t1 < t2 ≤ n is implicitly obvious,

we omit this in the rest of this chapter for simplicity.) The probabilities in (4.1)

and (4.2) are defined under the permutation distribution, where the order of yi is

permuted. Under the null hypothesis, the observations are iid distributed, the scan

statistic calculated under any permutation of time would follow the same distribution.

19

CHAPTER 4. ANALYTIC APPXIMATIONS TO SIGNIFICANCE LEVELS 20

Therefore, we can directly sample from the permutation distribution to approximate

the two probabilities (4.1) and (4.2). This is affordable if the number of observations

n is not large. When n is large, doing permutation is computationally prohibitive,

especially for (4.2) where each scan is of order O(n2) if l1 − l0 ∼ O(n). Even though

parallel computing makes doing permutation more achievable, it needs many com-

puting units. In addition, we would have no idea of how different factors come to

play roles in the probabilities. Having analytic expressions for the probabilities, by

all means, make the approach much easier to carry out and give us a better idea of

the approach.

For (4.1) and (4.2), If we treat ZG(t) and ZG(t1, t2) as families of tests, the

two probabilities are the family-wise error rates. However, all the tests are dependent

since they all base on the same sequence and each test, ZG(t) or ZG(t1, t2), has a

complicated distribution because it is calculated under permutation. Therefore, it is

impossible to obtain exact analytic expressions for the two probabilities for finite n.

In the rest of this chapter, we give analytic approximations to the two probabil-

ities. We first show that, under some mild conditions of G, ZG([nu]) : 0 < u < 1converges to a Gaussian process and ZG([nu], [nv]) : 0 < u < v < 1 converges

to a Gaussian random field as n → ∞ (Section 4.2). We then derive analytic ap-

proximations to the two probabilities under these conditions of G (Section 4.3). To

give better approximations for n small and also for more general graphs, we make

some refinements by correcting skewness for the marginal distributions (Section 4.4).

All these approximations are checked by numerical studies under different scenarios

(Section 4.5).

4.2 Properties of the Processes

In this section, we study the random process ZG([nu]) : 0 < u < 1 and the random

field ZG([nu], [nv]) : 0 < u < v < 1 as n → ∞. Their limiting distributions are

stated in Section 4.2.1 and the covariance function of the random process ZG([nu]) :

0 < u < 1 is shown in Section 4.2.2.


4.2.1 Limiting Distributions

This section shows the limiting distributions of ZG([nu]) : 0 < u < 1 and

ZG([nu], [nv]) : 0 < u < v < 1 using Stein’s method. We first introduce some

notations. For edge e = (e−, e+), where e− < e+ are the indices of the nodes connected

by the edge e, let

Ae = Ge− ∪Ge+ , (4.3)

be the set of edges connecting to either node e− or node e+, and

Be = ∪Ae′ : e′ ∈ Ae, (4.4)

be the set of edges connecting to nodes in Ge− and Ge+ .

Theorem 4.2.1. If∑

e∈G |Ae||Be| ∼ o(n3/2), |G| ∼ O(n), n → ∞, then under the

permutation null,

1. ZG([nu]) : 0 < u < 1 converges to a Gaussian process, which we denote as

Z?G(u) : 0 < u < 1,

2. ZG([nu], [nv]) : 0 < u < v < 1 converges to a two-dimensional Gaussian

random field, which we denote as Z?G(u, v) : 0 < u < v < 1.

Remark 4.2.2. The condition for the graph restrict the “hub” of the graph, that is

a node with a large degree. The condition requires that the maximum degree of the

graph cannot be of order |G|3/4 or higher. To help understand the condition, it can

be simplified to a stronger version which requires that maximum degree of the graph

is o(|G|1/6).

The proof for Theorem 4.2.1 is in Appendix B.1.

4.2.2 Covariance Function

The covariance function of the Gaussian process Z?G(u), 0 < u < 1 is stated in the

next lemma.

ρ?G(u, v)∆= cov(Z?

G(u), Z?G(v)). (4.5)


Lemma 4.2.3.

ρ?G(u, v) =2(u ∧ v)2(1− (u ∨ v))2|G|+ (u ∧ v)(1− (u ∨ v))(1− 2u)(1− 2v)

∑i |Gi|2

σG(u)σG(v),

(4.6)

where

σ?G(u) =

√2u2(1− u)2|G|+ u(1− u)(1− 2u)2

∑i

|Gi|2.

Proof. First observe that ρ?G(u, u) = 1, which holds for (4.6). Because of the inter-

changeability of u and v in the definition of ρG(u, v), it is enough to show that when

u < v,

ρ?G(u, v) =2u2(1− v)2|G|+ u(1− v))(1− 2u)(1− 2v)

∑i |Gi|2

σG(u)σG(v). (4.7)

Let ρG,n(u, v)∆= cov(ZG([nu]), ZG([nv])), then ρG(u, v) = limn→∞ ρG,n(u, v). Let

s = [nu], t = [nv], then s < t, and limn→∞ s/n = u, limn→∞ t/n = v. Since

cov(ZG(s), ZG(t)) =E(RG(s)RG(t))− E(RG(s))E(RG(t))√

Var(RG(s))Var(RG(t)),

and the expressions for E(RG(s)), E(RG(t)), Var(RG(s)), Var(RG(t)) can be found

in Lemma 3.2.2, we only need to figure out

E(RG(s)RG(t)) =∑

(i,j),(k,l)∈G

P(gi(s) 6= gj(s), gk(t) 6= gl(t)).

By examining different ways of placing i, j, k, l, we have

P[gi(s) 6= gj(s), gk(t) 6= gl(t)]

=

2s(n−t)n(n−1)

:= q1(s, t) if

i = k, j = l

i = l, j = k

s(n−t)(n+2t−2s−2)n(n−1)(n−2)

:= q2(s, t) if

i = k, j 6= l

i = l, j 6= k

j = k, i 6= l

j = l, i 6= k4s(n−t)[(s−1)(n−s−1)+(t−s)(n−s−2)]

n(n−1)(n−2)(n−3):= q3(s, t) if i, j, k, l all different.


Then

E(RG(s)RG(t)) =∑

(i,j),(k,l)∈G

P(gi(s) 6= gj(s), gk(t) 6= gl(t))

=∑

(i,j)∈G

q1(s, t) +∑

(i, j), (i, k) ∈ G

j 6= k

q2(s, t) +∑

(i, j), (k, l) ∈ G

i, j, k, l all different

q3(s, t)

= (q1(s, t)− 2q2(s, t) + q3(s, t))|G|+ (q2(s, t)− q3(s, t))n∑i=1

|Gi|2 + q3(s, t)|G|2.

So

limn→∞

E(RG(s)RG(t)) = 4u2(1− v)2|G|+ u(1− v)(1− 2u)(1− 2v)n∑i=1

|Gi|2

+ 4uv(1− u)(1− v)|G|2.

Together with

limn→∞

E(RG(s)) = 2u(1− u)|G|,

limn→∞

Var(RG(s)) = 4u2(1− u)2|G|+ u(1− u)(1− 2u)2

n∑i=1

|Gi|2,

and similar for RG(t), we have (4.7).

Remark 4.2.4. ρ?G(u, v), u ≤ v is partially differentiable for u whenever u 6= v. View

v as fixed, when u = v, its left- and right- derivatives are well defined for any order.

We denote the k-th left- and right- derivative by f(k)v,−(0) and f

(k)v,+(0), respectively. It

is not hard to check that f ′v,−(0) = −f ′v,+(0).


4.3 Asymptotic Approximations

This section studies the asymptotic behavior of the two probabilities (4.1) and (4.2).

We need the function ν(x) defined by

ν(x) = 2x−2 exp

−2

∞∑m=1

m−1Φ

(−1

2xm1/2

), x > 0. (4.8)

This function is closely related to the Laplace transform of the overshoot over the

boundary of a random walk. A simple approximation given in Siegmund and Yakir

[2007] is sufficient for numerical purpose:

ν(x) ≈ (2/x)(Φ(x/2)− 0.5)

(x/2)Φ(x/2) + φ(x/2). (4.9)

The following proposition is the foundation for obtaining analytic approximations to

the probabilities.

Proposition 4.3.1. Assume that n0 →∞, n1 →∞, b→∞, and n→∞ in a way

such that for some 0 < x0 < x1 < 1 and b0 > 0

ni/n→ xi (i = 0, 1) and b/√n→ b0.

Then as n→∞,

P

(max

n0≤t≤n1

Z?G(t/n) > b

)∼ bφ(b)

∫ x1

x0

h∗r0,r1(x)ν(b0

√2h∗r0,r1(x)

)dx, (4.10)

P

(max

n0≤t2−t1≤n1

Z?G(t1/n, t2/n) > b

)(4.11)

∼ b3φ(b)

∫ x1

x0

(h∗r0,r1(x)ν(b0

√2h∗r0,r1(x))

)2

(1− x)dx

where

h∗r0,r1(x) =1

2x(1− x)+

2

4x(1− x) + (1− 2x)2(r1/r0 − 4r0),


with r0∆= limn→∞ |G|/n, and r1

∆= limn→∞

∑i |Gi|2/n.

Proof. We first show the single change-point case. We adopt Woodroofe’s method

[Woodroofe, 1976, 1978] by condition on the first cross-over.

P( maxn0≤t≤n1

Z?G(t/n) > b)

=∑

n0≤t≤n1

∫ ∞0

P(Z?G(t/n) = b+ dx)P( max

n0≤s<tZ?G(s/n) < b|Z?

G(t/n) = b+ dx) (4.12)

By change of measure and rearranging the terms, we have

P( maxn0≤t≤n1

Z?G(t/n) > b)

=φ(b)

b

∑n0≤t≤n1

∫ ∞0

e−x−x2

2b2 P( maxn0≤s<t

b(Z?G(s/n)− Z?

G(t/n)) < −x|Z?G(t/n) = b+

x

b)dx.

Since b→∞, if x ∼ o(b2), then x2

2b2is negligible to x and x

bis negligible to b; while if

x ∼ O(b), then x+ x2

2b2→∞, and the integrand becomes 0, so

P( maxn0≤t≤n1

Z?G(t/n) > b)

≈ φ(b)

b

∑n0≤t≤n1

∫ ∞0

e−xP( maxn0≤s<t

b(Z?G(s/n)− Z?

G(t/n)) < −x|Z?G(t/n) = b)dx.

Notice that for u < v,

b(Z?G(u)− Z?

G(v))|(Z?G(v) = b) ∼ N ((ρG(u, v)− 1)b2, (1− ρ2

G(u, v))b2).

Let δ = v − u, by Taylor expansion, we have

ρG(u, v) = 1 + f ′v,−(0)δ + f ′′v,−(0)δ2/2 +O(δ3),

ρ2G(u, v) = 1 + 2f ′v,−(0)δ + ((f ′v,−)2 + f ′′v,−(0))δ2 +O(δ3).

So for δ ∼ O(n−1),

b(Z?G(u)− Z?

G(v))|(Z?G(v) = b) ∼ N (−f ′v,−(0)|δ|b2, 2f ′v,−(0)|δ|b2).


One can show that, for b = b0

√n, and n→∞,

limk→∞

lim supn→∞

∑|i−t|>k

P(Z?G(i/n) > b|Z?

G(t/n) = b+ dx) = 0.

LetW(t)m be a random walk withW

(t)1 ∼ N (µ(t), (σ(t))2), where µ(t) = 1

nf ′v,−(0)b2, (σ(t))2 =

2µ(t). Then

P( maxn0≤s<t

b(Z?G(s/n)− Z?

G(t/n)) < −x|Z?G(t/n) = b) ∼ P( max

n0≤s≤t−W (t)

t−s < −x)

∼ P(minm≥1

W (t)m > x).

Together with the fact∫ ∞0

exp−2µx/σP(minm≥1

Wm > x)dx = µν(2µ/σ),

for a random walk W1 ∼ N (µ, σ) (see Siegmund [1992]), we have

limn→∞

P( maxn0≤t≤n1

Z?G(t/n) > b) ≈ lim

n→∞

φ(b)

b

∑n0≤t≤n1

b20f′t/n,−(0)ν(b0

√2f ′t/n,−(0))

For f ′t/n,−(0), we take the derivative of ρG(u, v), and after some tedious calculation,

we have

f ′v,−(0) =1

2v(1− v)+

2

4v(1− v) + (1− 2v)2(∑

i |Gi|2/|G| − 4|G|). (4.13)

Putting everything together, we have

limn→∞

P( maxn0≤t≤n1

Z?G(t/n) > b) ≈ lim

n→∞

φ(b)

b

∑n0≤t≤n1

b20h∗r0,r1

(t/n)ν(b0

√2h∗r0,r1(t/n))

=φ(b)

b

∫ x1

x0

b20h∗r0,r1

(x)ν(b0

√2h∗r0,r1(x))ndx

= bφ(b)

∫ x1

x0

h∗r0,r1(x)ν(b0

√2h∗r0,r1(x)

)dx.

Now, we show the changed interval case following the method of Siegmund [1988,


1992]. We omit most of the technical details, which follow these two papers given

that

ρG,(u1,u2)(δ1, δ2)∆= cov(Z?

G(u1 − δ1, u2 − δ2), Z?G(u1, u2).

is differentiable with the derivative being continuous except at δ1 = 0 and at δ2 = 0.

A key intermediate form is

P

(max

n0≤t2−t1≤n1

Z?G(t1/n, t2/n) > b

)≈ φ(b)

b

∑n0≤t2−t1≤n1

C1(t1, t2)b2C2(t1, t2)b2 × ν(√

2C1(t1, t2)b2)ν(√

2C2(t1, t2)b2),

(4.14)

where C1, C2 are the partial derivatives

C1(nu1, nu2) ≡ 1

n

∂−ρG,(u1,u2)(δ1, 0)

∂δ1

∣∣∣∣δ1=0

= − 1

n

∂+ρG,(u1,u2)(δ1, 0)

∂δ1

∣∣∣∣δ1=0

,

C2(nu1, nu2) ≡ − 1

n

∂+ρG,(u1,u2)(0, δ2)

∂δ2

∣∣∣∣δ2=0

.

Under the permutation null, the processes derived from perturbation of the left and

right end points,

Z?G((t1 + k)/n, t2/n), k = . . . ,−2,−1, 0, 1, 2, . . .

and

Z?G(t1/n, (t2 − k)/n), k = . . . ,−2,−1, 0, 1, 2, . . . ,

are identical in distribution to the process

Z?G((t2 − t1 − k)/n), k = . . . ,−2,−1, 0, 1, 2, . . . ,

Thus, the partial derivatives are equal to the derivative in the one change-point

scenario,

C1(t1, t2) = C2(t1, t2) =1

nf ′u2−u1,−(0).


Substituting 1nf ′u2−u1,−(0) for C1(t1, t2) and C2(t1, t2) in (4.14) and the double sum-

mation goes to an integral as n→∞ yields (4.11).

Therefore, when∑

e∈G |Ae||Be| ∼ o(n3/2), |G| ∼ O(n), we approximate (4.1) and

(4.2) by

P

(max

n0≤t≤n1

ZG(t) > b

)∼ bφ(b)

∫ x1

x0

h∗r0,r1(x)ν(b0

√2h∗r0,r1(x)

)dx, (4.15)

P

(max

n0≤t2−t1≤n1

ZG(t1, t2) > b

)(4.16)

∼ b3φ(b)

∫ x1

x0

(h∗r0,r1(x)ν(b0

√2h∗r0,r1(x))

)2

(1− x)dx.

Remark 4.3.2. In practice, when we use (4.15) and (4.16) to approximate the prob-

abilities, we use the form of h∗r0,r1(x) before taking the limit:

hG(n, x) =(n− 1)[h1(n, x)|G|+ h2(n, x)

∑ni=1 |Gi|2 − h3(n, x)|G|2]

2u(1− u)[h4(n, x)|G|+ h5(n, x)∑n

i=1 |Gi|2 − h6(n, x)|G|2], (4.17)

where

h1(n, x) = 4n(n− 1)(−2nx2 + 2nx− 1

)h2(n, x) = n

[n(n+ 1)(1− 2x)2 − 2(n− 1)

]h3(n, x) = 4n

[n(1− 2x)2 − 1

]h4(n, x) = 4n(n− 1)(nx− 1)(n− nx− 1)

h5(n, x) = n (n− 1)[n2(1− 2x)2 − n+ 2

]h6(n, x) = 4n

[n2(1− 2x)2 − 2n(1− 3x+ 3x2) + 1

].

4.4 Skewness Correction

Convergence of ZG(t) to normality is slow if t/n is close to 0 or 1. Also, MST and

NNG constructed on high dimensional data can be dominated by hubs under standard


distance measures, such as L2 or L1. We see this to be true, for example, under L2 dis-

tance when the dimension is 100 in the simulations of Section 4.5. In our simulations

we have noticed that if the underlying graph is void of hubs, then the statistic ZG(t)

is right-skewed, and the approximations (4.15) and (4.18) underestimate the true tail

probabilities. If a graph is dominated by hubs, then the statistic is left-skewed, and

the approximations overestimate the tail probabilities. The effects of skewness are

explored in more detail in Appendix B.2.

By incorporating the skewness of the marginal distribution of the two processes,

the two tail probabilities can be approximated by:

P

(max

n0≤t≤n1

ZG(t) > b

)≈ bφ(b)

∫ x1

x0

SG(nx)hG(n, x)ν(√

2b20hG(n, x))dx, (4.18)

where

SG(t) =exp

(12(b− θb,G(t))2 + 1

6γG(t)θb,G(t)3

)√1 + γG(t)θb,G(t)

, (4.19)

with γG(t) = E[Z3G(t)] and θb,G(t) = (−1 +

√1 + 2γG(t)b)/γG(t).

P

(max

n0≤t2−t1≤n1

ZG(t1, t2) > b

)(4.20)

≈ φ(b)

b

∑n0≤t2−t1≤n1

SG(t1, t2)(b2

0hG(n, (t2 − t1)/n)ν(b0

√2hG(n, (t2 − t1)/n))

)2

,

where

SG(t1, t2) =exp

(12(b− θb,G(t1, t2))2 + 1

6γG(t1, t2)θb,G(t1, t2)3

)√1 + γG(t1, t2)θb,G(t1, t2)

, (4.21)

with γG(t1, t2) = E[Z3G(t1, t2)] and θb,G(t1, t2) = (−1 +

√1 + 2γG(t1, t2)b)/γG(t1, t2).

We next show how (4.18) and (4.20) are derived and give explicit expressions for

γG(t) and γG(t1, t2).

4.4.1 Derivation of (4.18) and (4.20)

Correcting skewness under the change-point setting was first carried out in Tu et al.

[1999] and later modified in Tang and Siegmund [2001]. Both of them take a universal


third moment correction. In our problem, the extent of the skewness of ZG(t) depends

on how closeness t is close to the ends, so the correction needs to be different for

different t. Take the single change-point case for example, we consider the third

moment for calculating the marginal probability P(ZG(t) ∈ b+ dx/b). We show how

to incorporate the third moment through cumulant generating functions and the same

treatment is done for the changed interval case.

In the derivation below we suppress the dependence on the graph G and the time

parameter t. Consider the probability measure dQθ = eθZ−ψ(θ)dP, where ψ(θ) =

log EP(eθZ). Choose θb such that ψ(θb) = EQθb(Z) = b. Then,

P(Z ∈ b+ dx/b) = EP(1Z∈b+dx/b) ≈ e−θb(b+x/b)+ψ(θb)Qθb(Z ∈ b+ dx/b). (4.22)

Since under Qθb , Z is centered at b with variance..

ψ(θb), Qθb(Z ∈ b + dx/b) can be

approximated by the normal density,

Qθb(Z ∈ b+ dx/b) ≈ 1√2π

..

ψ(θb)

exp

(− x2

2b2..

ψ(θb)

)≈ 1√

2π..

ψ(θb)

. (4.23)

The second approximation above is accurate for x/b→ 0.

To obtain ψ(θb) and..

ψ(θb), we use Taylor expansions, noting that ψ(0) = ψ(0) =

0,..

ψ(0) = 1,...

ψ(0) = EP(Z3)∆= γ:

ψ(θ) ≈ ψ(0) + ψ(0)θ +..

ψ(0)θ2

2+

...

ψ(0)θ3

6=θ2

2

(1 +

γθ

3

), (4.24)

..

ψ(θ) ≈..

ψ(0) +...

ψ(0)θ = 1 + γθ. (4.25)

Combining (4.22), (4.23),(4.24) and (4.25) gives

P(Z ∈ b+ dx/b) ≈ 1√2π(1 + γθb)

exp(−θbb− xθb/b+ θ2b (1 + γθb/3)/2). (4.26)

For an approximation of θb, we solve ψ(θb) by approximating ψ up to the third order,

b = ψ(θb) ≈ ψ(0) +..

ψ(0)θb +...

ψ(0)θ2b

2= θb +

1

2γθ2

b , (4.27)


yielding

θb ≈ (−1 +√

1 + 2γb)/γ. (4.28)

Note that when γ = 0, θb = b. (4.18) follows by using (4.26) in (4.12) in the proof of

Theorem 4.3.1 and approximating the θbx/b term in the exponent by x.

Remark 4.4.1. The term θb(t) is an approximation to the solution of ψt(θ) = b,

where ψt(θ) is the cumulant generating function of Z(t). By a third order Taylor ap-

proximation to ψt(θ), we have ψ−1t (b) ≈ (−1+

√1 + 2γ(t)b)/γ(t). When the marginal

distribution is left-skewed, it is possible that γ(t) can be too small for 1 + 2γ(t)b to be

positive. This does not mean that the solution to ψt(θ) = b does not exist, but that

higher moments are needed to get a good approximation. In this paper, we apply an

easy heuristic fix to this problem: Since 1 + 2γ(t)b < 0 usually happens when t/n is

close to 0 or 1, within this problematic region θb(t) can be extrapolated using its val-

ues outside the region. The details of the extrapolation method are shown in Appendix

B.2.

4.4.2 Explicit Expressions for Skewness

We have

E(Z3G(t)) =

E3(RG(t)) + 3E(RG(t))Var(RG(t))− E(R3(t))

(Var(RG(t)))3/2,

E(Z3G(t1, t2)) =

E3(RG(t1, t2)) + 3E(RG(t1, t2))Var(RG(t1, t2))− E(R3(t1, t2))

(Var(RG(t1, t2)))3/2.

The explicit expressions of E(RG(t)), Var(RG(t)), E(RG(t1, t2)), and Var(RG(t1, t2)

are given in Lemma 3.2.2 and Lemma 3.2.4. The explicit expressions of E3(RG(t))

and E3(RG(t1, t2)) are given in the following lemma.


Lemma 4.4.2.

E(R3G(t)) = p1(t)|G|+ 3

2p1(t)

∑i

|Gi|(|Gi| − 1)

+ 3p2(t)

(|G|(|G| − 1) +

1

2

∑i

|Gi|(|Gi| − 1)(|G| − |Gi|)

)

− 3p2(t)

∑i

|Gi|(|Gi| − 1) +∑

(i,j)∈G

(|Gi| − 1)(|Gj| − 1)

+ p3(t)

∑i

|Gi|(|Gi| − 1)(|Gi| − 2)

+ p4(t)

|G|(|G| − 1)(|G| − 2) + 6∑

(i,j)∈G

(|Gi| − 1)(|Gj| − 1)

− 2p4(t)

∑(i,j)∈G

|k : (i, k), (j, k) ∈ G|

− p4(t)

(∑i

|Gi|(|Gi| − 1)(3|G| − 2|Gi| − 2)

).

The functions p1(t) and p2(t) are given in Lemma 3.2.2, and

p3(t) :=t(n− t)((n− t− 1)(n− t− 2) + (t− 1)(t− 2))

n(n− 1)(n− 2)(n− 3),

p4(t) :=8t(t− 1)(t− 2)(n− t)(n− t− 1)(n− t− 2)

n(n− 1)(n− 2)(n− 3)(n− 4)(n− 5).

Also

E3(RG(t1, t2)) = E3(RG(t2 − t1)). (4.29)

The terms in E(R3G(t)) can be rearranged and written in other forms. The expan-

sion shown here makes it easier to understand the origin of each term in the context

of the proof.

Proof. For the un-centered process RG(t),

E(R3G(t)) =

∑(i,j),(k,l),(u,v)∈G

P(gi(t) 6= gj(t), gk(t) 6= gl(t), gu(t) 6= gv(t)).


There are in total eight different configurations for three edges randomly chosen from

the graph, and we derive P(gi(t) 6= gj(t), gk(t) 6= gl(t), gu(t) 6= gv(t))∆= P3 separately

for each of them.

1) The three edges are actually the same edge.

P3 = P(gi(t) 6= gj(t)) =2t(n− t)n(n− 1)

.

2) Two edges are the same and share one node with the third edge.

P3 = P(gi(t) 6= gj(t), gi(t) 6= gk(t)) =t(n− t)n(n− 1)

.

3) Two edges are the same and do not share any node with the third edge.

P3 = P(gi(t) 6= gj(t), gk(t) 6= gl(t)) =4t(t− 1)(n− t)(n− t− 1)

n(n− 1)(n− 2)(n− 3).

4) The three edges share one node, and neither of them share the other node (star-

shaped).

P3 = P(gi(t) 6= gj(t), gi(t) 6= gk(t), gi(t) 6= gl(t))

=t(n− t)((n− t− 1)(n− t− 2) + (t− 1)(t− 2))

n(n− 1)(n− 2)(n− 3).

5) One edge share one node with another edge and share the other node with the

third edge. No node sharing between the second and the third edge (linear chain).

P3 = P(gi(t) 6= gj(t), gi(t) 6= gk(t), gj(t) 6= gl(t))

=2t(t− 1)(n− t)(n− t− 1)

n(n− 1)(n− 2)(n− 3).

6) The three edges form a triangle.

P3 = P(gi(t) 6= gj(t), gj(t) 6= gk(t), gk(t) 6= gi(t)) = 0.


7) Two edges share one node, and share no node with the third edge.

P3 = P(gi(t) 6= gj(t), gi(t) 6= gk(t), gu(t) 6= gv(t))

=2t(t− 1)(n− t)(n− t− 1)

n(n− 1)(n− 2)(n− 3).

8) No pair of the three edges share any node.

P3 = P(gi(t) 6= gj(t), gk(t) 6= gl(t), gu(t) 6= gv(t))

=8t(t− 1)(t− 2)(n− t)(n− t− 1)(n− t− 2)

n(n− 1)(n− 2)(n− 3)(n− 4)(n− 5).

The following lists the number of occurrences of each of the above cases:

1) |G|

2) 3∑

i |Gi|(|Gi| − 1)

3) 3|G|(|G| − 1)− 3∑

i |Gi|(|Gi| − 1)

4)∑

i |Gi|(|Gi| − 1)(|Gi| − 2)

5) 6∑

(i,j)∈G(|Gi| − 1)(|Gj| − 1)− 6∑

(i,j)∈G |k : (i, k), (j, k) ∈ G|

6) 2∑

(i,j)∈G |k : (i, k), (j, k) ∈ G|

7) 3∑

i |Gi|(|Gi|−1)(|G|−|Gi|)+6∑

(i,j)∈G |k : (i, k), (j, k) ∈ G|−12∑

(i,j)∈G(|Gi|−1)(|Gj| − 1)

8) |G|(|G|−1)(|G|−2)+6∑

(i,j)∈G(|Gi|−1)(|Gj|−1)−2∑

(i,j)∈G |k : (i, k), (j, k) ∈G| −

∑i |Gi|(|Gi| − 1)(3|G| − 2|Gi| − 2)

The lemma follows by summing up all of the probabilities as enumerated above.

It is not hard to observe that the number of occurrences only depends on the

lengths of the two intervals, so E3(RG(t1, t2)) = E3(RG(t2 − t1)).

For MDP, only cases 1, 3, and 8 are possible, and the number of occurrences of

each case is


1) |G| = n

2) (3)) 3|G|(|G| − 1) = 3n(n− 1)

3) (8)) |G|(|G| − 1)(|G| − 2) = n(n− 1)(n− 2)

By summing the probabilities of these 3 cases, we have a much simpler expression for

E(R3G(t)) for MDP:

E(R3G(t)) = p1(t)n+ p2(t)3n(n− 1) + p4(t)n(n− 1)(n− 2)

= (p1(t)− 3p2(t) + 2p4(t))n+ 3(p2(t)− p4(t))n2 + p4(t)n3.

4.5 Numerical Studies

To check the analytic approximations to p-values, we compare the critical values

obtained from (4.15), (4.18), (4.16), and (4.20) to those obtained from permutation,

under various simulation settings. In each simulation, iid sequences of length 1000

were generated from a given distribution F0 in Rd. MST, MDP, and NNG were

constructed on the data based on Euclidean distance. For each graph, analytic and

permutation critical values were computed for both 0.05 and 0.01 p-value thresholds.

4.5.1 Single Change-Point Alternative

Tables 4.1 - 4.5 show the results for the single change-point alternative. In the column

headers, “A1” denotes the critical values obtained assuming Gaussianity (4.15), “A2”

denotes the critical values obtained after correcting for skewness (4.18), and “Per”

denotes the critical values obtained by 10,000 permutations.

Six different choices for F0 are shown, for two different distributions (standard

normal and exponential with mean 1), each in three different dimensions (d=1,10, or

100). For d = 10 or 100, each element of the data vector is generated independently

from the given distribution. The analytic approximations depend also on constraints

on the region in which the change-point is searched. These are reflected in the choice

of n0 and n1 (l0 and l1 for the changed interval alternative). To make things simple,

we set n1 = n − n0, so that we only allow the case that both groups have at least


n0 observations. In general, the analytic approximations become less precise when

the minimum segment length decreases. This is mainly because the Gaussian ap-

proximation (and skewness correction) to the distribution of Z(t) degrades for small

samples.

Both the analytic and permutation p-values depend on certain characteristics of

the graph’s structure. The structures of MST (for d ≥ 2) and NNG depend on the

underlying data set, and thus the critical values vary by simulation run. In such

cases, we show results for 5 randomly simulated sequences. Two characteristics of

the graph are also shown for each simulated sequence: The sum of squared node

degrees (∑

i |Ei|2) and the maximum node degree (D). These quantities give some

intuition on the size and density of hubs in the graph. Since the MST for any one-

dimensional data set is a chain, the critical values for MST-based scan do not change

with simulation run for each setting of the parameters.

The structure of the MDP graph is always the same for all data sets. Therefore,

the critical values for MDP-based scan depend only on n, n0, n1 (l0 and l1 for the

changed interval alternative). The critical values for MDP-based scan do not depend

on the dimension or the underlying distribution of the data. As emphasized in Rosen-

baum [2005], it is a truly distribution free method, which can be desirable in high

dimensions.

We can see from the tables that the analytic approximations after skewness cor-

rection perform much better than the analytic approximations under Gaussian as-

sumption, especially when dimension increases. The accuracy of the skew-corrected

approximation does not degrade significantly with dimension. For the statistics based

on MDP, the skew-corrected approximations work quite well when the minimum win-

dow size is as small as 25 at 0.05 significance level, and 50 at 0.01 significance level.

For the statistics based on MST and NNG, the skew corrected approximations remain

accurate for window sizes as small as 25 at both 0.05 and 0.01 significance levels.

There is not much difference between results for simulations based on normal and

those based on exponential distributions. The main factor influencing approximation

accuracy, other than the minimum window size, is the dimension (d). As dimension

increases, the graph becomes more “star-shaped” as reflected by the increase in both∑|Ei|2 and D. As shown in Section 4.4, skewness and other higher order moments


of Z(t) are a function of polynomials of the node degrees. Thus the increase in the

number and density of hubs makes skewness correction important in high dimensions.

Table 4.1: Critical values for the single change-point scan statistic based on MST at0.05 significance level. n = 1000.

Critical Values Graphn0 = 100 n0 = 50 n0 = 25

A1 A2 Per A1 A2 Per A1 A2 Per∑|Ei|2 D

d = 1 2.98 3.05 3.04 3.08 3.22 3.23 3.14 3.39 3.49 4994 22.92 2.90 2.90 3.00 2.95 2.95 3.05 2.98 2.96 5430 8

N(0,1) 2.92 2.89 2.89 3.00 2.95 2.92 3.05 2.97 2.95 5438 7d = 10 2.92 2.90 2.87 3.00 2.95 2.94 3.05 2.98 2.96 5394 7

2.92 2.89 2.86 3.00 2.94 2.90 3.05 2.97 2.92 5534 82.92 2.89 2.89 3.00 2.95 2.92 3.05 2.97 2.95 5460 72.93 2.91 2.89 3.01 2.97 2.96 3.06 3.00 2.97 5064 7

Exp(1) 2.93 2.91 2.88 3.01 2.97 2.92 3.06 3.00 2.95 5082 7d = 10 2.93 2.91 2.91 3.01 2.98 2.97 3.06 3.01 3.00 5028 5

2.93 2.91 2.87 3.01 2.98 2.93 3.06 3.01 2.97 5028 62.93 2.91 2.88 3.01 2.96 2.92 3.06 2.98 2.94 5180 92.86 2.69 2.68 2.94 2.70 2.68 3.00 2.70 2.68 12454 38

N(0,1) 2.86 2.72 2.72 2.95 2.74 2.72 3.00 2.74 2.72 10904 38d = 100 2.86 2.70 2.66 2.94 2.71 2.66 3.00 2.71 2.66 11294 42

2.87 2.72 2.68 2.95 2.74 2.68 3.00 2.74 2.68 10690 402.86 2.69 2.65 2.94 2.70 2.65 3.00 2.70 2.65 11722 402.85 2.64 2.60 2.93 2.65 2.60 2.99 2.65 2.60 14706 56

Exp(1) 2.87 2.77 2.76 2.95 2.80 2.77 3.01 2.81 2.77 9608 25d = 100 2.84 2.62 2.53 2.93 2.62 2.53 2.99 2.62 2.53 15536 77

2.86 2.74 2.69 2.95 2.76 2.69 3.00 2.76 2.69 10890 302.86 2.72 2.66 2.94 2.73 2.66 3.00 2.73 2.66 12018 39

4.5.2 Changed Interval Alternative

Tables 4.6 - 4.10 show the results of p-value approximations for the changed interval

alternative. The notation and simulation settings are identical to those for the single

change-point alternative in Section 4.5, except that n0 is replaced by l0 for the smallest

window size. (l1 is set to n− l0.)


Table 4.2: Critical values for the single change-point scan statistic based on MST at0.01 significance level. n = 1000.



d = 1 3.52 3.62 3.67 3.60 3.81 3.85 3.65 4.05 4.31 4994 23.47 3.43 3.46 3.53 3.46 3.48 3.57 3.48 3.48 5430 8

N(0,1) 3.47 3.43 3.44 3.53 3.46 3.46 3.57 3.47 3.46 5438 7d = 10 3.47 3.43 3.44 3.53 3.46 3.47 3.58 3.48 3.48 5394 7

3.47 3.42 3.38 3.53 3.46 3.40 3.57 3.47 3.41 5534 83.47 3.43 3.44 3.53 3.46 3.46 3.57 3.47 3.46 5460 73.48 3.45 3.40 3.54 3.49 3.44 3.58 3.50 3.45 5064 7

Exp(1) 3.48 3.44 3.40 3.54 3.48 3.42 3.58 3.50 3.44 5082 7d = 10 3.48 3.45 3.47 3.54 3.49 3.49 3.58 3.51 3.52 5028 5

3.48 3.45 3.41 3.54 3.49 3.44 3.58 3.51 3.46 5028 63.48 3.44 3.49 3.54 3.47 3.53 3.58 3.48 3.54 5180 93.42 3.17 3.19 3.48 3.17 3.19 3.53 3.17 3.19 12454 38

N(0,1) 3.42 3.21 3.24 3.49 3.21 3.24 3.53 3.21 3.24 10904 38d = 100 3.42 3.19 3.17 3.49 3.19 3.17 3.53 3.19 3.17 11294 42

3.42 3.22 3.18 3.49 3.22 3.18 3.53 3.22 3.18 10690 403.42 3.18 3.21 3.49 3.18 3.21 3.53 3.18 3.21 11722 403.41 3.14 3.12 3.48 3.14 3.12 3.52 3.14 3.12 14706 56

Exp(1) 3.43 3.28 3.26 3.49 3.28 3.26 3.54 3.28 3.26 9608 25d = 100 3.41 3.15 3.10 3.48 3.15 3.10 3.52 3.15 3.10 15536 77

3.42 3.24 3.21 3.49 3.24 3.21 3.53 3.24 3.21 10890 303.42 3.22 3.13 3.48 3.22 3.13 3.53 3.22 3.13 12018 39


Table 4.3: Critical values for the single change-point scan statistic based on MDP.n = 1000.

significance level = 0.05

d = 1 d = 10 d = 100n0 A1 A2 N(0,1) Exp(1) N(0,1) Exp(1) N(0,1) Exp(1)

200 2.82 2.84 2.83 2.81 2.85 2.85 2.85 2.83100 2.98 3.07 3.06 3.04 3.08 3.08 3.07 3.0550 3.08 3.27 3.30 3.29 3.35 3.36 3.35 3.3125 3.14 3.48 3.54 3.58 3.57 3.66 3.60 3.60


d = 1 d = 10 d = 100n0 A1 A2 N(0,1) Exp(1) N(0,1) Exp(1) N(0,1) Exp(1)

200 3.38 3.43 3.39 3.38 3.44 3.46 3.45 3.44100 3.52 3.66 3.66 3.64 3.67 3.75 3.67 3.5950 3.60 3.90 3.99 3.99 3.94 4.05 3.95 3.9925 3.65 4.21 4.61 4.65 4.78 4.72 4.59 4.81

From the tables, conclusions similar to those for the single change-point alternative

can be drawn. The analytic approximation after skewness correction performs much

better than the analytic approximation under Gaussian assumption, especially when

dimension increases. The accuracy of skew-corrected approximation does not degrade

significantly with dimension. It does well for MST- and NNG- based tests when

the smallest window size to be considered is as small as 25 for both 0.05 and 0.01

significance levels, and for MDP-based test when the smallest window size is 50.


Table 4.4: Critical values for the single change-point scan statistic based on NNG at0.05 significance level. n = 1000.



2.96 2.98 2.95 3.04 3.07 3.03 3.10 3.13 3.08 2008 2N(0,1) 2.96 2.98 2.97 3.05 3.07 3.05 3.10 3.13 3.09 1972 2d = 1 2.96 2.98 3.01 3.04 3.07 3.10 3.10 3.12 3.13 2032 2

2.96 2.98 2.97 3.04 3.07 3.04 3.10 3.13 3.09 2008 22.96 2.99 3.01 3.05 3.08 3.10 3.10 3.13 3.13 1954 22.96 2.99 2.98 3.05 3.08 3.08 3.10 3.13 3.11 1948 2

Exp(1) 2.96 2.98 2.96 3.04 3.07 3.07 3.10 3.13 3.11 2038 2d = 1 2.96 2.98 2.96 3.04 3.08 3.05 3.10 3.13 3.09 2014 2

2.96 2.98 2.99 3.04 3.07 3.08 3.10 3.12 3.13 2008 22.96 2.98 2.99 3.04 3.08 3.08 3.10 3.13 3.13 2038 22.94 2.92 2.89 3.02 2.97 2.93 3.07 3.00 2.96 3370 6

N(0,1) 2.94 2.91 2.90 3.02 2.97 2.95 3.07 2.99 2.96 3502 6d = 10 2.94 2.91 2.89 3.01 2.96 2.95 3.06 2.98 2.96 3444 7

2.94 2.91 2.91 3.01 2.96 2.94 3.06 2.98 2.96 3436 62.94 2.91 2.88 3.02 2.97 2.93 3.07 2.99 2.94 3330 62.94 2.92 2.91 3.02 2.98 2.96 3.07 3.00 2.98 3144 5

Exp(1) 2.94 2.92 2.92 3.02 2.98 2.97 3.07 3.00 2.99 3096 6d = 10 2.94 2.92 2.92 3.02 2.98 2.98 3.07 3.01 3.01 3118 6

2.94 2.93 2.92 3.02 2.98 2.97 3.07 3.01 2.99 3114 52.94 2.92 2.91 3.02 2.98 2.98 3.07 3.01 3.00 3152 62.87 2.65 2.62 2.95 2.65 2.62 3.00 2.65 2.62 9382 52

N(0,1) 2.87 2.73 2.70 2.95 2.75 2.71 3.01 2.76 2.71 8466 24d = 100 2.88 2.76 2.72 2.96 2.78 2.72 3.01 2.79 2.72 7756 20

2.86 2.59 2.56 2.94 2.59 2.56 3.00 2.59 2.56 11092 682.87 2.68 2.64 2.95 2.69 2.64 3.00 2.69 2.64 9538 382.86 2.71 2.70 2.95 2.72 2.70 3.00 2.73 2.70 10222 34

Exp(1) 2.86 2.72 2.68 2.95 2.74 2.69 3.00 2.74 2.69 10390 37d = 100 2.86 2.70 2.64 2.94 2.71 2.64 3.00 2.71 2.64 11574 35

2.87 2.74 2.72 2.95 2.76 2.73 3.01 2.77 2.73 8782 222.87 2.73 2.68 2.95 2.74 2.68 3.01 2.74 2.68 8622 41


Table 4.5: Critical values for the single change-point scan statistic based on NNG at0.01 significance level. n = 1000.



3.50 3.53 3.53 3.57 3.61 3.59 3.61 3.65 3.63 2008 2N(0,1) 3.50 3.54 3.52 3.57 3.61 3.63 3.61 3.65 3.66 1972 2d = 1 3.50 3.53 3.58 3.57 3.61 3.66 3.61 3.65 3.71 2032 2

3.50 3.53 3.56 3.57 3.61 3.63 3.61 3.65 3.68 2008 23.50 3.54 3.53 3.57 3.62 3.64 3.61 3.66 3.65 1954 23.50 3.54 3.50 3.57 3.62 3.61 3.61 3.66 3.64 1948 2

Exp(1) 3.50 3.53 3.57 3.57 3.61 3.63 3.61 3.65 3.65 2038 2d = 1 3.50 3.54 3.52 3.57 3.61 3.63 3.61 3.66 3.66 2014 2

3.50 3.53 3.60 3.57 3.61 3.66 3.61 3.65 3.71 2008 23.50 3.54 3.54 3.57 3.62 3.58 3.61 3.66 3.66 2038 23.48 3.45 3.46 3.55 3.49 3.48 3.59 3.50 3.49 3370 6

N(0,1) 3.48 3.44 3.47 3.54 3.48 3.48 3.59 3.49 3.48 3502 6d = 10 3.48 3.44 3.42 3.54 3.47 3.45 3.58 3.48 3.46 3444 7

3.48 3.44 3.43 3.54 3.47 3.46 3.59 3.48 3.47 3436 63.48 3.44 3.44 3.55 3.48 3.48 3.59 3.49 3.48 3330 63.49 3.45 3.46 3.55 3.49 3.51 3.59 3.50 3.51 3144 5

Exp(1) 3.49 3.45 3.48 3.55 3.49 3.52 3.59 3.50 3.52 3096 6d = 10 3.49 3.46 3.48 3.55 3.49 3.54 3.59 3.51 3.57 3118 6

3.49 3.46 3.41 3.55 3.50 3.46 3.59 3.51 3.46 3114 53.49 3.46 3.49 3.55 3.49 3.52 3.59 3.51 3.53 3152 63.42 3.13 3.07 3.49 3.13 3.07 3.54 3.13 3.07 9382 52

N(0,1) 3.43 3.21 3.19 3.50 3.21 3.19 3.54 3.21 3.19 8466 24d = 100 3.44 3.25 3.23 3.50 3.25 3.23 3.54 3.25 3.23 7756 20

3.42 3.09 3.08 3.48 3.09 3.08 3.53 3.09 3.08 11092 683.42 3.16 3.16 3.49 3.16 3.16 3.54 3.16 3.16 9538 383.42 3.20 3.19 3.49 3.20 3.19 3.53 3.20 3.19 10222 34

Exp(1) 3.42 3.22 3.21 3.49 3.22 3.21 3.53 3.22 3.21 10390 37d = 100 3.42 3.18 3.17 3.48 3.18 3.17 3.53 3.18 3.17 11574 35

3.43 3.23 3.23 3.49 3.23 3.23 3.54 3.23 3.23 8782 223.43 3.22 3.24 3.50 3.22 3.24 3.54 3.22 3.24 8622 41


Table 4.6: Critical values for the changed interval scan statistic based on MST at0.05 significance level. n = 1000.

Critical Values Graphl0 = 100 l0 = 50 l0 = 25


d = 1 4.08 4.29 4.24 4.22 4.76 4.73 4.33 5.44 5.77 4994 23.97 3.89 3.84 4.07 3.92 3.89 4.16 3.93 3.89 5454 8

N(0,1) 3.97 3.91 3.81 4.07 3.95 3.85 4.16 3.97 3.87 5400 7d = 10 3.97 3.90 3.81 4.07 3.93 3.90 4.16 3.94 3.91 5448 8

3.97 3.90 3.91 4.07 3.94 3.93 4.16 3.95 3.94 5440 73.97 3.89 3.82 4.07 3.91 3.85 4.15 3.93 3.85 5524 83.99 3.93 3.86 4.09 3.97 3.92 4.17 3.99 3.95 5042 8

Exp(1) 3.99 3.93 3.84 4.09 3.96 3.90 4.17 4.00 3.92 5040 6d = 10 3.99 3.93 3.85 4.09 3.97 3.91 4.17 4.00 3.93 5106 6

3.99 3.93 3.82 4.09 3.97 3.87 4.17 3.99 3.91 5042 63.99 3.91 3.94 4.08 3.95 3.98 4.17 3.97 3.98 5126 83.87 3.51 3.52 3.98 3.51 3.52 4.09 3.51 3.52 11600 40

N(0,1) 3.86 3.49 3.55 3.98 3.49 3.55 4.08 3.49 3.55 13346 64d = 100 3.88 3.57 3.66 3.99 3.57 3.66 4.09 3.57 3.66 10422 34

3.88 3.57 3.58 3.99 3.57 3.58 4.09 3.57 3.58 10804 433.88 3.56 3.58 3.99 3.56 3.58 4.09 3.56 3.58 10862 363.88 3.63 3.59 3.99 3.63 3.59 4.09 3.63 3.59 10384 24

Exp(1) 3.87 3.58 3.49 3.98 3.58 3.49 4.09 3.58 3.49 11922 33d = 100 3.88 3.60 3.63 3.99 3.60 3.63 4.09 3.60 3.63 11194 34

3.89 3.63 3.55 4.00 3.63 3.55 4.10 3.63 3.55 9680 273.88 3.62 3.60 3.99 3.62 3.60 4.09 3.62 3.60 10468 29


Table 4.7: Critical values for the changed interval scan statistic based on MST at0.01 significance level. n = 1000.



d = 1 4.51 4.78 4.73 4.63 5.31 5.30 4.72 6.08 6.65 4994 24.42 4.32 4.31 4.50 4.33 4.33 4.58 4.33 4.33 5454 8

N(0,1) 4.42 4.34 4.22 4.51 4.36 4.25 4.58 4.37 4.25 5400 7d = 10 4.42 4.33 4.20 4.50 4.51 4.25 4.58 4.35 4.29 5448 8

4.42 4.34 4.36 4.50 4.32 4.36 4.58 4.36 4.36 5440 74.42 4.32 4.31 4.50 4.33 4.32 4.57 4.33 4.32 5524 84.43 4.36 4.36 4.52 4.39 4.36 4.59 4.39 4.36 5042 8

Exp(1) 4.43 4.36 4.30 4.52 4.39 4.36 4.59 4.40 4.36 5040 6d = 10 4.43 4.36 4.32 4.52 4.39 4.38 4.59 4.40 4.44 5106 6

4.43 4.36 4.27 4.52 4.39 4.33 4.59 4.39 4.33 5042 64.43 4.35 4.35 4.52 4.37 4.35 4.59 4.37 4.35 5126 84.34 3.99 4.28 4.43 3.99 4.28 4.52 3.99 4.28 11600 40

N(0,1) 4.33 3.98 3.95 4.42 3.98 3.95 4.51 3.98 3.95 13346 64d = 100 4.34 4.04 4.12 4.44 4.04 4.12 4.52 4.04 4.12 10422 34

4.34 4.05 4.22 4.43 4.05 4.22 4.52 4.05 4.22 10804 434.34 4.03 4.00 4.43 4.03 4.00 4.52 4.03 4.00 10862 364.34 4.10 3.95 4.44 4.10 3.95 4.52 4.10 3.95 10384 24

Exp(1) 4.33 4.05 3.87 4.43 4.05 3.87 4.52 4.05 3.87 11922 33d = 100 4.34 4.08 4.14 4.43 4.08 4.14 4.52 4.08 4.14 11194 34

4.35 4.10 3.86 4.44 4.10 3.86 4.53 4.10 3.86 9680 274.34 4.08 4.10 4.44 4.08 4.10 4.52 4.08 4.10 10468 29


Table 4.8: Critical values for the changed interval scan statistic based on MDP.n = 1000.


d = 1 d = 10 d = 100l0 A1 A2 N(0,1) Exp(1) N(0,1) Exp(1) N(0,1) Exp(1)

100 4.08 4.38 4.39 4.46 4.30 4.29 4.32 4.3250 4.22 4.97 5.03 5.12 5.10 4.87 5.19 4.9925 4.33 5.81 6.31 6.32 6.14 6.12 6.60 6.35


d = 1 d = 10 d = 100l0 A1 A2 N(0,1) Exp(1) N(0,1) Exp(1) N(0,1) Exp(1)

100 4.51 4.90 4.91 5.13 4.93 4.92 5.01 4.9150 4.63 5.58 5.63 5.94 5.64 5.48 6.13 5.6325 4.72 6.52 6.91 6.91 6.91 6.91 7.12 6.91


Table 4.9: Critical values for the changed interval scan statistic based on NNG at0.05 significance level. n = 1000.



4.04 4.10 4.07 4.15 4.23 4.20 4.23 4.31 4.30 2026 2N(0,1) 4.04 4.10 4.09 4.15 4.24 4.18 4.23 4.31 4.24 1942 2d = 1 4.04 4.10 4.11 4.15 4.24 4.23 4.23 4.31 4.35 1948 2

4.04 4.10 3.96 4.15 4.23 4.11 4.23 4.31 4.25 2038 24.04 4.10 4.04 4.15 4.24 4.17 4.23 4.31 4.31 1960 24.04 4.10 4.00 4.15 4.23 4.14 4.23 4.31 4.24 2086 2

Exp(1) 4.04 4.10 4.08 4.15 4.23 4.20 4.23 4.31 4.24 1990 2d = 1 4.04 4.10 4.00 4.15 4.24 4.15 4.23 4.32 4.27 2014 2

4.04 4.10 4.01 4.15 4.23 4.20 4.23 4.31 4.34 2080 24.04 4.10 4.04 4.15 4.23 4.18 4.23 4.31 4.27 2008 23.99 3.92 3.82 4.09 3.96 3.88 4.18 3.97 3.90 3558 6

N(0,1) 3.99 3.91 3.86 4.09 3.94 3.86 4.18 3.95 3.88 3508 6d = 10 4.00 3.92 3.86 4.10 3.96 3.93 4.18 3.97 3.93 3394 6

3.99 3.91 3.81 4.09 3.94 3.86 4.18 3.95 3.90 3418 63.99 3.91 3.88 4.09 3.94 3.88 4.18 3.95 3.88 3450 64.00 3.94 3.85 4.10 3.98 3.91 4.18 3.99 3.91 3306 6

Exp(1) 4.01 3.95 3.91 4.11 4.00 3.98 4.19 4.02 3.99 3118 5d = 10 4.00 3.94 3.89 4.10 3.98 3.93 4.19 4.00 3.94 3018 5

4.00 3.95 3.90 4.11 3.99 3.93 4.19 4.01 3.93 3014 54.01 3.96 3.95 4.11 4.01 3.97 4.19 4.03 3.99 3092 53.89 3.55 3.48 4.00 3.55 3.48 4.10 3.55 3.48 8240 30

N(0,1) 3.88 3.50 3.49 3.99 3.50 3.49 4.09 3.50 3.49 9360 33d = 100 3.90 3.61 3.60 4.00 3.61 3.60 4.10 3.61 3.60 8482 18

3.88 3.51 3.48 3.99 3.51 3.48 4.09 3.51 3.48 9154 403.88 3.50 3.44 3.99 3.50 3.44 4.09 3.50 3.44 9392 393.88 3.54 3.47 3.99 3.54 3.47 4.09 3.54 3.47 10406 45

Exp(1) 3.88 3.55 3.55 3.99 3.55 3.55 4.09 3.55 3.55 10504 44d = 100 3.88 3.54 3.61 3.99 3.54 3.61 4.09 3.54 3.61 10106 32

3.90 3.64 3.53 4.00 3.63 3.53 4.10 3.63 3.53 8666 223.90 3.58 3.57 4.00 3.58 3.57 4.10 3.58 3.57 8274 28


Table 4.10: Critical values for the changed interval scan statistic based on NNG at0.01 significance level. n = 1000.



4.48 4.55 4.58 4.57 4.67 4.65 4.64 4.73 4.65 2026 2N(0,1) 4.48 4.56 4.53 4.57 4.68 4.71 4.64 4.74 4.79 1942 2d = 1 4.48 4.56 4.56 4.57 4.68 4.72 4.64 4.74 4.83 1948 2

4.48 4.55 4.45 4.57 4.67 4.68 4.64 4.74 4.69 2038 24.48 4.56 4.56 4.57 4.68 4.66 4.64 4.74 4.82 1960 24.48 4.55 4.49 4.57 4.67 4.62 4.64 4.74 4.68 2086 2

Exp(1) 4.48 4.55 4.49 4.57 4.67 4.57 4.64 4.73 4.57 1990 2d = 1 4.48 4.56 4.49 4.57 4.68 4.59 4.64 4.75 4.60 2014 2

4.48 4.55 4.61 4.57 4.67 4.65 4.64 4.74 4.76 2080 24.48 4.55 4.60 4.57 4.67 4.65 4.64 4.73 4.78 2008 24.44 4.35 4.20 4.52 4.39 4.25 4.60 4.37 4.25 3558 6

N(0,1) 4.44 4.34 4.34 4.52 4.35 4.38 4.59 4.35 4.38 3508 6d = 10 4.44 4.35 4.28 4.52 4.36 4.33 4.60 4.37 4.33 3394 6

4.44 4.34 4.30 4.52 4.36 4.30 4.59 4.35 4.30 3418 64.44 4.34 4.22 4.52 4.35 4.22 4.59 4.35 4.22 3450 64.44 4.37 4.31 4.53 4.43 4.39 4.60 4.39 4.39 3306 6

Exp(1) 4.45 4.38 4.39 4.53 4.42 4.50 4.60 4.42 4.50 3118 5d = 10 4.45 4.37 4.31 4.53 4.72 4.33 4.60 4.39 4.38 3018 5

4.45 4.38 4.42 4.53 4.35 4.45 4.60 4.41 4.45 3014 54.45 4.39 4.46 4.53 4.43 4.47 4.61 4.43 4.47 3092 54.35 4.02 3.91 4.44 4.02 3.91 4.53 4.02 3.91 8240 30

N(0,1) 4.34 3.97 3.82 4.44 3.97 3.82 4.52 3.97 3.82 9360 33d = 100 4.36 4.07 3.94 4.45 4.07 3.94 4.53 4.07 3.94 8482 18

4.35 3.99 4.06 4.44 3.99 4.06 4.52 3.99 4.06 9154 404.34 3.98 3.83 4.44 3.98 3.83 4.52 3.98 3.83 9392 394.34 4.02 3.87 4.43 4.02 3.87 4.52 4.02 3.87 10406 45

Exp(1) 4.34 4.03 3.99 4.43 4.03 3.99 4.52 4.03 3.99 10504 44d = 100 4.34 4.02 4.22 4.43 4.02 4.22 4.52 4.02 4.22 10106 32

4.35 4.10 3.95 4.44 4.10 3.95 4.53 4.10 3.95 8666 224.36 4.05 4.02 4.45 4.05 4.02 4.53 4.05 4.02 8274 28

Chapter 5

Assessment of the Method

5.1 Numeric Power Studies

We used simulations to compare the power of the graph-based scan statistics to

parametric approaches. In the first simulation set-up, we generated a sequence of 200

observations from the following model:

yt ∼

N(0, Id), t = 1, . . . , 100;

N(µ,Σ), t = 101, . . . , 200.

As before, d is the dimension of each observation. There is a change-point at 100.

The mean µ of the second half of the data is shifted from 0 by amount ∆ in Euclidean

distance. We considered cases where the covariance matrix remains constant (Σ = Id),

as well as cases where the covariance matrix also changes. When the covariance

matrix changes, we set Σ to a diagonal matrix with Σ[1, 1] = d1/3 and Σ[i, i] = 1 for

i = 2, . . . , d. We chose ∆ for each value of d so that most methods have moderate

power.

Hotelling’s T2 is a parametric test designed specifically for detecting a change

in multivariate normal mean when there is no change in variance. When there is a

change in both mean and variance, the generalized likelihood ratio test (GLR) can be

used. We compare the graph-based scan statistics to scan statistics based on these

47

CHAPTER 5. ASSESSMENT OF THE METHOD 48

two existing methods. For any candidate change-point t, the Hotelling’s T 2 is

T 2(t) =t(n− t)

n(yt − y∗t )

T Σ−1(yt − y∗t ),

where

yt =t∑i=1

yi/t, y∗t =n∑

i=t+1

yi/(n− t),

Σ = (n− 2)−1

[t∑i=1

(yi − yt)(yi − yt)T +

n∑i=t+1

(yi − y∗t )(yi − y∗t )T

].

The GLR is

GLR(t) = n log |Σn| − t log |Σt| − (n− t) log |Σ∗t |,

where

Σt =

∑ti=1(yi − yt)(yi − yt)

T

t, Σ∗t =

∑ni=t+1(yi − y∗t )(yi − y∗t )

T

n− t.

Some constraints apply to T 2(t) and GLR(t). For T 2 the number of observations n

needs to be larger than the dimension of the data d so that Σ can be inverted. For

GLR, both t and n− t need to be larger than the dimension of the data so that the

determinants of Σt, Σ∗t are not zero. Thus, when d ≤ 20, we set n0 = d + 10 and

n1 = n − n0. When d > 20, we set n0 = 50 and n1 = 150. (An exception for GLR

is that when d = 50, n0 and n1 are set to 60 and 140, respectively, so that the test

statistic can be calculated.)

Table 5.1 shows the power comparisons. Scan statistics based on the three ways

of constructing the graph – MST, MDP and NNG – using Euclidean distance are

compared to scan statistics based on maximization of T 2(t) and GLR(t). First, com-

pare the graph-based methods to Hotelling’s T 2: When the variance does not change,

T 2 outperforms all other methods in low to moderate dimension (d < 175). This is

expected, as T 2 was designed specifically for this scenario. Remarkably, graph-based

methods surpass T 2 at its own game when dimension is high (d = 175). Now, consider

the case where the variance also changes. By assuming an incorrect alternative, the

power of T 2 is quickly surpassed by graph-based methods, for d as low as 5.


Comparing graph-based methods to GLR-based scan statistic, we see a similar

pattern: When dimension is low (d = 1, 5, 10), GLR-based scans dominate in power

when both the mean and variance changes. Graph-based methods exceed GLR in

power when d increases, already performing much better by d = 20, which is consid-

ered quite moderate in today’s applications. The low power of GLR at even moderate

dimension is due to its requirement that the covariance matrix be estimated for both

segments.

We also considered a case where the normality assumption is violated by gener-

ating data from the log-normal distribution (Σ = Id). Then, graph-based methods

outperform T 2 by d = 75, and GLR by d = 5.

Comparing among the graph-based scan statistics, we see that MST and NNG

have comparable power, and dominate MDP in all situations. An explanation is that,

of these three ways of constructing graphs, the MDP retains the least information

from the data, having half as many edges as the other two graphs. The fact that MST

and NNG have similar power in all scenarios suggests that the graph-based method

is not very sensitive to the method of graph construction.


Table 5.1: Number of simulated sequences (out of 100) with significance less than5%.

Normal data, Σ = I

d 1 5 10 20 50 75 100 125 150 175∆ 0.5 0.65 0.8 0.8 1 1.2 1.2 1.4 1.6 2

T2 81 85 98 76 82 90 74 72 67 46GLR 72 51 28 8 8 - - - - -MST 14 20 18 15 29 45 34 42 47 73MDP 5 7 12 8 19 23 15 18 27 43NNG 8 16 16 16 33 49 37 46 48 77

Normal data, Σ is diagonal with Σ[1, 1] = d1/3,Σ[i, i] = 1, i = 2, . . . , d.

d 1 5 10 20∆ 0.5 0.4 0.1 0.2

T2 80 18 8 3GLR 76 80 67 31MST 8 27 35 65MDP 9 18 22 30NNG 10 17 34 59

Log-normal data, Σ = I.

d 1 5 10 20 50 75 100∆ 0.7 0.9 1 1 1.2 1.4 1.4

T2 83 77 79 58 60 43 29GLR 28 21 18 12 7 - -MST 18 35 47 28 31 62 71MDP 7 15 27 14 27 22 25NNG 19 34 39 28 34 58 74


5.2 Results on Real Data Examples

5.2.1 Friendship Network

The MIT Media Laboratory conducted a study following 90 subjects, consisting of

students and staff at the university, using mobile phones with pre-installed software

recording call logs from July 2004 to June 2005 [Eagle et al., 2009]. In this analysis,

we extract the information on the caller, callee and time for every call that was made

during the study period. The question of interest is whether phone call patterns

changed during this time, which may reflect a change in relationship among these

subjects. We bin the calls by day and, for each day, construct a network with the

90 subjects as nodes and a link between two subjects if they had at lease one call

on that day. We encode the network of each day by an adjacency matrix, with 1 for

element [i, j] if there is an edge between subject i and subject j, and 0 otherwise.

Thus, the processed data are adjacency matrices, one for each day from 2004/7/20

to 2005/6/14.

We show results for graphs constructed using two different dissimilarity measures.

Let Ai be the 90 by 90 adjacency matrix on day i. We denote vi to be the vector

form of Ai. The dissimilarities are:

(1) the number of different edges: ‖vi − vj‖1 = ‖vi − vj‖22,

(2) the number of different edges, normalized by the geometric mean of the total for

each day:‖vi−vj‖1√‖vi‖1‖vj‖1

.

Results based on different dissimilarities and different ways of constructing the

graph are shown in Figure 5.1. We see that statistics based on MST and NNG give

similar results under both dissimilarities. The statistic based on MDP is not infor-

mative here, possibly because MDP is not dense enough to capture the information

in the data. Based on the scans using MST and NNG, a change-point occurred at

around 2005/1/9, which turns out to be the winter break for that year at MIT. The

p-values for the scan based on MST and NNG under both dissimilarity measures are

all < 0.0001, by both 10,000 permutations and analytic approximation (4.18). Per-

haps a change in courses after winter break changed the social organization among

the subjects.


Figure 5.1: Results of graph-based scans of the MIT phone call network. Top rowshows results from using number of different edges as the dissimilarity measure andbottom row shows results from using the normalized number of different edges. Thethree columns show three different ways of constructing the graph: MST, MDP, andNNG from left to right. In each plot, Z(t) is plotted along t. The estimated change-point is shown in the caption above the plot. The two vertical lines show n0 and n1;we basically excluded the first 5% and the last 5% of the points. The horizontal linesshow critical values at 0.05 and 0.01 significance levels, with the solid lines showingcritical values computed from 10,000 permutations and the dashed lines showing thosecomputed from the analytic approximation after skewness correction.


5.2.2 Authorship Debate

Tirant lo Blanch, a chivalry novel published in 1490, is considered to be one of the

best known medieval works of literature in Catalan. The dedicatory letter at the

beginning of the book states,

... So that no one else can be blamed if any faults are found in this work,

I, Joanot Martorell, knight, take sole responsibility for it, as I have carried

out the task singlehandedly...

However, the colophon at the end of the book states something different,

... by the magnificent and virtuous knight, Sir Joanot Martorell, who

because of his death, could only finish writing three parts of it. The

fourth part, which is the end of the book, was written by the illustrious

knight Sir Marti Joan de Galba. If faults are found in that part, let them

be attributed to his ignorance...

This inconsistency sparked a debate, still ongoing, about the authorship of Tirant

lo Blanch since its publication. Opinions have mainly fallen into two camps, one

favoring the single authorship by Joanot Martorell and the other favoring a change of

author somewhere between chapters 350 and 400. (There are in total 487 chapters in

the book.) One objective way to settle this debate is through the statistical analysis

of word usage, which reflects the unique style of writing of different people.

Giron et al. [2005] analyzed two sets of word usage statistics extracted from the

book. The first, which we call the word length data set, categorizes the words in

each chapter by its length, with a single category for all words with length greater

than nine letters. Thus, this data set represents each chapter by a vector of length

10. The second, which we call the context-free word frequency data set, counts the

occurrence of the 25 most frequent context-free words in each chapter. Giron et al.

[2005] analyzed the two data sets using a Bayesian multinomial change-point model

and a Bayesian clustering method, and concluded in favor of the change of author

hypotheses with the estimated change-point between chapters 371 and 382.

Here, we apply the graph based change-point method to the two data sets, treating

each chapter as a time-point. There are in total 487 chapters, and we use the 425


chapters that have more than 200 words. For both data sets, we normalized the count

vector for each chapter by dividing by the total number of words in the chapter. Thus,

our data is a sequence of 425 normalized frequencies, of dimension 10 for the word

length data and dimension 25 for the context-free word frequency data. The L2 norm

is used to construct the MST, MDP and NNG graphs representing similarity between

chapters. Z(t) and the estimated change-points, computed for each type of graph,

are shown in Figure 5.2. Test results using the three different graphs and the two

data sets support the change of author hypothesis, with the estimated change-point

around chapter 360, which is consistent with the view that there is a change of author

somewhere between chapters 350 and 400. The p-values are shown in Table 5.2.

Table 5.2: p-values for the tests. In each cell, the first value is calculated from 10,000permutations and the second value is calculated from the analytic approximationafter skewness correction.

data MST MDP NNGword length 0.0000/0.0000 0.0041/0.0018 0.0000/0.0000

context-free word frequency 0.0000/0.0000 0.0000/0.0000 0.0000/0.0000

To check the robustness of our analysis, we also applied the scan on data for

the first 350 chapters to see if it rejects the null there. Opinions seem to be quite

uniform that the first 350 chapters were all written by Joanot Martorell. The results

are shown in Figure 5.3. The word length data does not reject the null for the 350

chapters at 0.05 significance level. However, the context-free word frequency data

supports a change-point, although different graphs favor different locations for the

change-point. The p-values of the tests are shown in Table 5.3. These results suggest

that word length may be more robust than context-free word frequency in reflecting

writing styles.

Table 5.3: p-values for the tests only using data from the first 350 chapters.Numbersin each cell have the same meaning as in Table 5.2.

data MST MDP NNG

word length 0.0538/0.0562 0.1061/0.1040 0.3086/0.3527context-free word frequency 0.0000/0.0000 0.0019/0.0009 0.0000/0.0000


Figure 5.2: Results of graph-based scans of chapter-wise word usage frequencies ofTirant lo Blanch. The first row shows results from the word length data and the sec-ond row shows results from the context-free word frequency data. The three columnsshow scans based on three different graphs: MST, MDP, and NNG from left to right.The content in each plot is the same as in Figure 5.1. In the caption for each plot,the estimated change-point is shown in the form A/B, where A is the index of thechange-point within the 425 chapters used for analysis, and B is the chapter numberin the novel.


Figure 5.3: Results from the first 350 chapters. The setting of the figure is the sameas in Figure 5.2.


5.3 Discussion

The new nonparametric method for change-point detection can be applied to high

dimensional and non-Euclidean data. The method requires only the existence of a

dissimilarity measure on the sample space. In applications, the choice of a good

dissimilarity measure is critical, and domain knowledge should be used to design a

measure that is sensitive to the signal of interest. The approach we propose decouples

this application-specific choice of dissimilarity measure from the formal test for a

change-point. Graph-based scan statistics are easy to compute, and the analytic

p-value approximations are generally applicable.

We have shown that the p-value approximations are quite accurate. Our simula-

tions were for a data sequence of length n = 1000. The accuracy of the approximations

depend on n0 (l0 for the changed interval alternative) and not so much on n. Accuracy

also depends on the structure of the graph. When the graph is highly star-shaped,

which is common for high-dimensional data when the Euclidean distance is used in

constructing the graph, the skewness correction is critical for the approximations to

be accurate. For extremely star-shaped graphs, we imagine that adjusting for kurto-

sis and higher order moments might also be helpful. The strategy would be similar

to skewness correction, but more technically complicated. We don’t compute these

higher order terms in this paper, but if needed they can be computed in a similar

fashion as the skewness term with the aid of a symbolic computation software.

The main reason that higher order corrections are necessary in high dimensions

is the increase in size and density of hubs in the graph, as shown in Section 4.5.

If hubs dominate the topology of the graph, perturbation of any hub can change

the topology drastically. Furthermore, R(t), which does not take into account the

interaction between edges, loses all information regarding the high order structure.

Under such circumstances, the particular graph would not be useful for separating

F1 from F0, and we would suggest exploring other dissimilarity measures and graph

construction methods.

Compared to parametric approaches, the graph-based approach requires far fewer

assumptions, but also makes less use of the data. Although this leads to loss of

power in low dimensions if the data indeed follow the parametric model, it leads to

robustness and wider applicability. An important observation is that the graph-based


approach has desirable power, compared to standard parametric tests, in moderate

and high dimensions. For high dimensional data, it is often hard to predict the

direction and nature of the change. Without such prior knowledge, parametric models

would require the estimation of many parameters, most of which would be unrelated

to the change. For example, the Hotelling T 2 statistic requires the estimation of

the large covariance matrix. If, by prior knowledge or data pre-processing, we can

circumvent the covariance estimation, then Hotelling T 2 would be preferable when

the data satisfies its assumptions – normality with no change of variance. Otherwise,

graph-based approaches gain increasing advantage over Hotelling’s T 2 as d increases,

even in the problem for which Hotelling’s T 2 was explicitly designed.

We mainly explored three different ways of constructing the underlying graph

given a dissimilarity measure. From the numerical results and the analysis of the MIT

cell phone network, we see that scans based on MST and NNG perform similarly, while

scans based on MDP have lower power. We suspect this is due to the fact that MDP

is the least dense graph and utilizes the least amount of information from the original

data set. In this regard, one may try denser graphs which retain more information

from the data than the MST and NNG. One may even consider assigning weights to

the edges. As in all problems, building more assumptions into the statistic leads to

improved power if the assumptions are true, but sacrifices robustness.

If more than one change-point or changed interval were of interest, the graph-

based scan can be applied recursively in a procedure that is called binary or circular

binary segmentation [Vostrikova, 1981, Olshen et al., 2004].

Part II

Graph-Based Tests for Two-Sample

Comparisons of Categorical Data

59

Chapter 6

Introduction

6.1 Background and Challenges

Testing whether two data samples are drawn from the same distribution is a funda-

mental problem in statistics. For low-dimensional Euclidean data, there are many

approaches, both parametric and non-parametric, to this problem. When the data

are categorical, the existing approaches are much more limited. The standard proce-

dure is to assume that each sample is drawn from a multinomial distribution, and the

comparison becomes a test of whether the two samples come from the same multino-

mial distribution. Classical methods, such as the Pearson’s Chi-square test and the

deviance test, work well when we observe each category a large number of times. At

least, the region in the contingency table where the two groups truly differ needs to

be adequately sampled for existing tests to achieve good power. However, in many

modern applications, the number of possible categories is comparable to or even larger

than the sample size. Following are some examples:

Preference rankings: Survey data in marketing or psychometric research often

come in the form of preference rankings. Subjects may be asked to rate wine

(rank from best to worst tasting), pictures (choose 3 most familiar out of 5),

or insurance plans (identify the most and least desirable). See Diaconis [1988]

and Critchlow [1985] for more detailed examples on ranked and partially ranked

data. It is a common problem to compare two groups of subjects to see if there

60


is any between-group difference in preference. The number of possible full rank-

ings is the factorial of the number of objects being rated, and the number of

possible rankings is higher if some subjects only partially rank the objects.

Haplotype association: In genetics, a haplotype is a combination of alleles at adja-

cent loci on a chromosome that is transmitted together. A common problem of

genetic association studies is to compare haplotype counts between treatment

and control groups (e.g. see Zaykin et al. [2002] and Furihata et al. [2006]).

Each haplotype can be represented as a fixed-length binary vector. The num-

ber of possible haplotypes is exponential in the number of loci. Haplotypes that

are longer than 10 are often of interest in genetics, leading to > 1, 000 possible

combinations. However, the number of subjects in association studies is often

only in the thousands or even hundreds, and the counts for most haplotypes are

small.

Sequence or document comparisons: In the modern age of digitized texts, it is

often of interest to compare the word composition in two different documents.

A similar problem is the comparison of DNA or protein sequences, which plays

a large role in bioinformatics [Lippert et al., 2002]. The number of possible

words in these applications can be very large, while the counts for most words

are small. For recent interest in this problem see Perry and Beiko [2010], Bush

and Lahn [2006] and Rajan et al. [2007] for examples.

Classical Chi-square tests have low power in the above scenarios due to sparsity

of the contingency table and high dimensionality of the parameter space. For exact

tests, it is possible to generalize the concept to the setting of more than two categories,

but this is computationally challenging [Mehta and Patel, 1983] and not efficient due

to the existence in high dimensions of many equivalent tables, which are tables that

have the same probability as the one observed.


6.2 Implicit Information and Their Role in Im-

proving the Tests

When the number of categories is very large, there is often underlying similarity

between different categories that can be exploited. For example, rankings can be

related through Kendall’s or Spearman’s distance. Hamming distance or other more

sophisticated measures can be used to compare haplotypes and fixed-length words

in DNA sequences. In document comparison, the similarities between words are not

equally likely: Some words are synonyms of others; Some are more likely to be used

together. Such similarity information between categories can be properly used to

improve the power of two-sample tests.

We assume that a distance matrix has been given on the set of categories, and

adopt a graph-based approach proposed by Friedman and Rafsky [1979] and Rosen-

baum [2005], where a graph is constructed on all subjects so that subjects more similar

in value are connected by an edge. Friedman and Rafsky’s test is based on a minimum

spanning tree (MST), and Rosenbaum’s test is based on minimum distance pairing

(MDP). The test statistic in both cases is the number of edges connecting subjects

from different groups. The underlying rationale is that, if two groups come from the

same distribution, subjects coming from the same group should be as distant to each

other as subjects coming from different groups. More details of these tests are given

in Section 1.1. Both tests, however, require uniqueness of the underlying graphs.

When the distance matrix on subjects is filled with ties, which is characteristic of

categorical data, neither approach can be directly applied.

Ties in the distance matrix lead to ambiguity in constructing the MST or MDP,

and the number of possible graphs increases rapidly with the number of ties. Some

efforts were made to address this problem. In the analysis of a partially ranked data

set with 38 subjects in 23 categories, Critchlow [1985] tried both the graph obtained

from the union of all MSTs (uMST), and the graph obtained from the union of all

nearest neighbor graphs (uNNG). Nettleton and Banerjee [2001] also used uNNG on

a binary clinical feature data set with 64 subjects in 63 categories. In general, nearest

neighbor graphs do not work well for categorical data, see Section 7.3. In this paper,

Critchlow’s method using the uMST is studied in more detail and a computationally


tractable form for categorical data is given. A different statistic, based on averaging

over all optimal graphs of a certain kind, is also proposed and analyzed.

6.3 Notation

We start by introducing our notation. The different categories are indexed by

1, 2, . . . , K. The naming of the categories is arbitrary, that is, category 1 is not

necessarily closer in distance to category 2 than to category 3. The two groups are

labeled a and b. The data is given in the form of a two-way contingency table (Table

6.1). Without loss of generality, we assume that each category has at least one subject

over the two groups. That is, categories with no observation in either group can be

omitted from the analysis without loss of information.

Table 6.1: Basic Notations.

1 2 . . . K Total

Group a na1 na2 . . . naK naGroup b nb1 nb2 . . . nbK nb

Total m1 m2 . . . mK N

mk = nak + nbk, k = 1, . . . , K;

na =K∑k=1

nak, nb =K∑k=1

nbk, N = na + nb =K∑k=1

mk.

Sometimes, we refer to individual subjects themselves, which we denote by

Y1, . . . , YN . Thus, each Yi takes value in 1, . . . , K and has a group label

gi =

a, if Yi belongs to group a;

b, if Yi belongs to group b.(6.1)

We assume that a distance matrix, d(i, j) : i, j = 1, . . . , K has been given on the

set of possible categories, with d(i, j) small if categories i and j are similar. Possible

ways of defining the distance matrix are shown for various examples in Section 6.1.

As in Part I, we use G to denote both the graph and its set of edges, Gi to denote


the subgraph including all edges that connect to node i and its set edges. Sometimes,

the name of the graph is not as simple as G and we use EGi to denote the set of edges

in G that contain node i to avoid ambiguity. In addition, we use VGi to denote the

set of nodes in G that are connected to node i by an edges, and EGi,2 to denote the set

of edges in G that contain at least one node in VGi .

Following is a list of abbreviations for different types of graphs and test statistics:

MST: Minimum Spanning Tree,

MDP: Minimum Distance Pairing,

NNG: Nearest Neighbor Graph,

uMST: The graph obtained by taking the union of all MSTs,

uNNG: The graph obtained by taking the union of all NNGs, equivalent to the graph

connecting each point to all of its nearest neighbors,

RG: The test statistic on the graph G,

RaMST: the test statistic averaging over all test statistics computed on each of the

MSTs,

RaMDP: the test statistic averaging over all test statistics computed on each of the

MDPs,

Chapter 7

Graph-Based Test Statistics

Both Friedman and Rafsky’s test and Rosenbaum’s test assume uniqueness of the

type of graph used. For categorical data, ties appear in the distance matrix whenever

a category has multiple counts. Even sparse contingency tables have quite a few cells

containing more than one subject. The number of possible graphs grows rapidly with

the number of ties. Thus, Friedman and Rafsky’s and Rosenbaum’s methods can not

be directly applied to categorical data. For categorical data, distances are often based

on qualitative measures, and thus while their relative ranking may be trustworthy,

their absolute scale is not. Hence, we do not consider methods based directly on the

distance matrix. While there are many ways to construct a graph based on a distance

matrix, we limit our study to MST, MDP and NNG, which are representative. Figure

7.1 illustrates the three different types of graphs on a simple example containing six

points. These six points take on six distinct values.

One natural solution, when the optimizing graph is not unique, is to average

the test statistic over all graphs of the given kind. In this section, we consider the

statistic based on averaging the sum (1.1) over all MSTs (RaMST). Another solution to

non-uniqueness it to take the union over all optimizing graphs, such as the statistic

based on the uMST (RuMST). RaMST and RuMST are analytically tractable and intuitively

appealing, and their derivations are shown in Section 7.1. For comparison, we also

consider the statistic based on averaging (1.1) over all MDPs, RaMDP, and the statistic

based on uNNG, RuNNG. Computation of RaMDP, described in Section 7.2.1, is often

intractable. Computation of uNNG is instantaneous. In Section 7.3, we study by

65

CHAPTER 7. GRAPH-BASED TEST STATISTICS 66

Figure 7.1: Illustration of MST, MDP, and NNG on six points. Notice that only oneof the two possible MSTs on the six points and one of the two possible NNGs on thesix points are shown.

simulation the performance of four graph-based statistics, RaMST, RuMST, RaMDP, RuNNG,

comparing them to each other and to Chi-square tests. Our results show that tests

based on minimum spanning trees have best power, and the intuition for this is

explained. The statistics based on uMDP and average over all NNGs are not included

in the comparison because they do not have the potential of high power according

to the performance of RaMDP and RuNNG in Section 7.3, while calculating them is not

instant.

Computation of RaMST and RuMST is described in more detail in Section 7.4. When

the number of MSTs on categories is large, which is common for categorical data,

computation for RaMST can be very costly. We generalize the statistic based on RaMST

to a similar but simpler form in Section 7.5.

7.1 The Test Statistics Based on MST

7.1.1 RaMST

First, we define more notations. For each k = 1, . . . , K, let Ck ⊂ 1, . . . , N be the

subjects that belong to category k. From Table 6.1, |Ck| = mk. Let Tk be the set of all

spanning trees for Ck. Since the distance between any two subjects in Ck is zero, any


Figure 7.2: Embedding the MST on categories on the subjects. This figure only shows3 out of 15552 possible embeddings.

spanning tree of Ck is a MST of Ck. Let T ∗0 be the set of all MSTs on the categories.

We can embed each tree in T ∗0 as a graph on the subjects by randomly picking one

subject in Ck to represent category k, for k = 1, . . . , K. For each τ ∗0 ∈ T ∗0 , there are

K∏k=1

m|Eτ∗0k |

k (7.1)

different embeddings. For example, Figure 7.2 shows 3 out of 15552 (= 2·33·1·42·32·2)

possible embeddings for a MST on six categories containing 2, 3, 1, 4, 3 and 2 subjects.

Let T0 be the set of all graphs obtained from embedding a tree from T ∗0 on the subjects.

Then

|T0| =∑τ∗0∈T ∗0

(K∏k=1

m|Eτ∗0k |

k

). (7.2)

Let T be the set of all MSTs on the N subjects. Then, any member of T can be

represented as a union of a graph from T0 and a graph from each of Tk : k =

1, . . . , K, and vice versa. Thus,

T =

τ0 ∪ (

K⋃k=1

τk) : τ0 ∈ T0, τk ∈ Tk, k = 1, . . . , K

,


with

|T | = |T0|K∏k=1

Smk , (7.3)

where Sm = mm−2 is the number of spanning trees on m points by Cayley’s formula.

Then, the test statistic based on averaging all MSTs on subjects can be defined as:

RaMST∆= |T |−1

∑τ∈T

Rτ , (7.4)

where Rτ is (1.1) with G = τ . The following theorem gives a computationally

tractable form for RaMST in terms of the cell counts of the contingency table and

the set of possible MSTs on categories.

Theorem 7.1.1. The test statistic based on averaging over all MSTs on subjects is

RaMST =K∑k=1

2naknbkmk

+ |T0|−1∑τ∗0∈T ∗0

K∏k=1

m|Eτ∗0k |

k

∑(u,v)∈τ∗0

naunbv + navnbumumv

. (7.5)

Proof.

RaMST = |T |−1∑τ∈T

Rτ

= |T |−1∑τ0∈T0

∑τ1∈T1

· · ·∑τK∈TK

[Rτ0 +Rτ1 + · · ·+RτK ]

= |T0|−1∑τ0∈T0

Rτ0 +K∑k=1

[∑τk∈Tk

Rτk/Smk

]. (7.6)

First consider the quantity∑

τk∈Tk Rτk/Smk . Since all pairs of subjects in a given

category have the same distance (= 0), the edge between them should appear in

the same number of trees. There are in total mk(mk − 1)/2 possible pairs and each

spanning tree for Ck has mk− 1 edges. Hence, the edge between each pair of subjects

in Ck appears in exactlySmk(mk − 1)

mk(mk − 1)/2=

2Smkmk


trees. Thus, ∑τk∈Tk

Rτk

Smk=

∑i,j∈Ck:i<j

Igi 6=gj2Smk/mk

Smk=

2naknbkmk

. (7.7)

Next consider the summation over T0. For any i ∈ Cu, j ∈ Cv, if (u, v) ∈ τ ∗0 , then the

edge (i, j) appears inK∏k=1

m|Eτ∗0k |

k /(mumv)

elements in T0, since any of the mumv possible edges connecting categories u and v

appear in equal number of graphs in T0. Thus,

∑τ0∈T0

Rτ0 =∑

τ∗0∈T ∗0

∑(u,v)∈τ∗0

∏Kk=1m

|Eτ∗0k|

k

mumv

∑i∈Cu

∑j∈Cv Igi 6=gj

=∑

τ∗0∈T ∗0

∏Kk=1 m

|Eτ∗0k |

k

∑(u,v)∈τ∗0

naunbv+navnbumumv

. (7.8)

Combining (7.6), (7.7) and (7.8) gives (7.5).

The following corollaries show that RaMST has a much simpler form if there is a

unique MST on categories, or if the total number of subjects in each category is the

same.

Corollary 7.1.2. When |T ∗0 | = 1, then

RaMST =K∑k=1

2naknbkmk

+∑

(u,v)∈τ∗0

naunbv + navnbumumv

, (7.9)

where τ ∗0 is the unique MST on categories.

Corollary 7.1.3. When mk ≡ m, k = 1, . . . , K,

RaMST =K∑k=1

2naknbkm

+ |T ∗0 |−1∑τ∗0∈T ∗0

∑(u,v)∈τ∗0

naunbv + navnbum2

. (7.10)

The form (7.9) of the statistic is especially intuitive. For each category k, we call

the term 2naknbk/mk the mixing potential of the category. The mixing potential is


maximized if nak = nbk = mk/2, that is, when the subjects in category k are evenly

divided between groups a and b; it is minimized when the category contains subjects

from only one group. A mixing potential for each edge (u, v) can also be defined as

(naunbv + navnbu)/(mumv). The edge-wise mixing potential is maximized when the

edge connects a category containing only group a subjects with a category containing

only group b subjects; it is minimized when both categories contain subjects only from

one group. Thus, mixing potentials over categories and over edges between categories

measure the similarity between the two groups. Corollary 7.1.2 shows that, when the

MST on categories is unique, the test statistic RaMST reduces to the sum of mixing

potentials over nodes and edges of the MST on categories. The similarity information

on the categories is explicitly incorporated into the test through the sum of mixing

potentials over the edges between categories.

In testing, the sums (7.5), (7.9) and (7.10) must be compared to their permutation

distributions. A generalized statistic that we propose later in Section 7.5 is based

directly on (7.9).

7.1.2 RuMST

Following the notation from the previous section, let M∗0 denote the set of edges

appearing in at least one MST on categories. That is,

M∗0 = (u, v) ∈ τ ∗0 : τ ∗0 ∈ T ∗0 .

In other words, M∗0 is the uMST with the categories as nodes. When there is only

one MST on categories, τ ∗0 , then M∗0 = τ ∗0 ; when there are multiple MSTs on cate-

gories, which is common for categorical data, obtaining M∗0 is not straightforward.

Computation ofM∗0 is discussed in Section 7.4. The following theorem describes the

analytic form of RuMST given M∗0.

Theorem 7.1.4. The test statistic based on uMST is

RuMST =K∑k=1

naknbk +∑

(u,v)∈M∗0

(naunbv + navnbu), (7.11)


Proof. Within each category, every pair of subjects is connected, which gives the first

term of (7.11). If categories u and v are connected in any τ ∗0 ∈ T ∗0 , then each point

in category u is connected to every point in category v, giving the second term of

(7.11). If categories u and v are not connected in any τ ∗0 ∈ T ∗0 , no edge will appear

between categories u and v in uMST.

Remark 7.1.5. Both RuMST and RaMST are derived from sums of Igi 6=gj over edges

of the uMST on subjects. The main difference between them is that RuMST treats

all of the edges equally, while RaMST assigns each edge a weight proportional to the

number of MSTs on subjects in which the edge appears. Comparing (7.11) to (7.9),

the denominators in (7.9) are omitted in (7.9). Each edge within category k appears in

|T |/(mk/2) MSTs, while each edge between categories appears in |T |/(mumv) MSTs.

Therefore, RuMST puts more weight on between-category edges than within-category

edges.

7.2 The Test Statistic Based on MDP

7.2.1 RaMDP

We first assume N , the total number of observations, to be even. Let K0 be the

number of categories containing an odd number of subjects. Since N is even, K0

is even. (K0 can be 0.). Without loss of generality, let categories 1, . . . , K0 be the

categories containing an odd number of subjects, and categories K0 +1, . . . , K be the

categories containing an even number of subjects. More notations are defined below.

• A = x = (x1, . . . , xK0)T : xi ∈ a, b, i = 1, . . . , K0: all possible combinations

of group identities of the subjects with one from each of the categories containing

an odd number of subjects.

• R0(na, nb): the number of edges connecting subjects from different groups av-

eraged over all perfect pairings of na points from group a and nb points from

group b in the same category, with na + nb being even.

• Rx,x ∈ A: the number of edges connecting subjects from different groups

averaged over all MDPs on categories 1, . . . , K0.


Assumption 7.2.1. If a category has an even number of subjects, the subjects are

paired within the category.

Assumption 7.2.1 is usually true for MDP on subjects for categorical data. It is

explicitly stated here to avoid the complicated scenario when the triangle inequality

becomes equality in the distance metric for any three categories.

Proposition 7.2.1. Under Assumption 7.2.1, the test statistic based on averaging

(1.1) over all MDPs is:

RaMDP =K∑

k=K0+1

R0(nak, nbk) +1∏K0

k=1mk

∑x∈A

K0∏i=1

nxii

[Rx +

K0∑j=1

R0(nxjj − 1, nxcjj)

],

(7.12)

where xci =

b if xi = a

a if xi = b,

R0(na, nb) =∑i∈S

i

(na

i

)(nb

i

)i! (na − i− 1)!! (nb − i− 1)!!/(na + nb − 1)!!

(7.13)

with

S =

0, 2, . . . , na ∧ nb if na and nb both even

1, 3, . . . , na ∧ nb if na and nb both odd,

and

Rx = |Ω∗|−1∑ω∗∈Ω∗

∑(i,j)∈ω∗

Ixi 6=xj , (7.14)

where ω∗ is an MDP on categories 1, . . . , K0, and Ω∗ is the set of all these ω∗’s.

Proof. First consider the simpler case: One category with na subjects from group a

and nb subjects from group b, with na + nb even. Since all subjects are in the same

category, any perfect pairing is an MDP. There are in total (na + nb − 1)!! different

perfect pairings.


When both na and nb are even, the possible numbers of edges connecting different

groups are: 0, 2, . . . , na∧nb. Among all the (na+nb−1)!! perfect pairings, the number

of perfect pairings having i ∈ 0, 2, . . . , na ∧ nb edges connecting different groups is(na

i

)(nb

i

)i! (na − i− 1)!! (nb − i− 1)!!.

When both na and nb are odd, the possible numbers of edges connecting different

groups are: 1, 3, . . . , na ∧ nb. Among all the (na + nb − 1)!! perfect pairings, the

number of perfect pairings having i ∈ 1, 3, . . . , na ∧ nb edges connecting different

groups is also (na

i

)(nb

i

)i! (na − i− 1)!! (nb − i− 1)!!.

(7.13) follows immediately.

Under Assumption 7.2.1, an MDP on all subjects would be an MDP on categories

1, . . . , K0, (ω∗), embedded on the subjects similar to the MST case, with all other

subjects paired within each category, so (7.12) follows naturally.

Remark 7.2.2. If N , the total number of observations, is odd, we can add a pseudo

category with one subject, whose distance to any other category is 0. All derivations

are the same, except that the edge containing the pseudo category is discarded from

the MDP on categories in later steps.

7.3 A Numerical Study

In this section, the power of the four tests based on RaMST, RuMST, RaMDP and RuNNG

are studied and compared to Pearson’s Chi-square and deviance tests on simulated

data sets. In each simulation, 30 points are randomly sampled from two different

distributions – N(0, 1) vs N(1, 1), N(0, 1) vs N(0, 4), N(0, 1) vs N(1, 4), and U(0, 5)

vs U(1, 6). The combined sample of 60 points is then discretized into 12 bins of equal

width. The value 12 is chosen so that the average number of data points in each

category is 5, mimicking the low cell count scenario. The bins are ranked by their


start positions, and the distance between two categories is defined as the difference

in their ranks. The p-values for all tests are calculated through 1,000 permutation

samples for each simulation run, and the power is obtained from 1,000 simulation runs.

In Figure 7.3, power is plotted versus type I error for each test and each simulation

setting. Pearson’s Chi-square and deviance tests give very similar results, so only the

results for the deviance test are shown. The deviance test is denoted by “LR” since

it is based on the log-likelihood ratio. Power for all tests at the two most commonly

used significance levels – 0.01 and 0.05 – are listed in Table 7.1.

First, compare RaMST, RaMDP, and RuNNG. RaMST is always significantly more powerful

than RaMDP, which in turn is always more powerful than RuNNG. This result is intuitive

from the definition of the different graphs. Since the MST must span the entire data

set, K − 1 out of its N − 1 edges are forced to connect points between categories.

For MDP, if a category has even number of subjects, the subjects in that category

would be paired amongst themselves; between-category edges is only possible for

those categories having an odd number of subjects. For uNNG, as long as a category

has more than one subject, the subjects in that category would not be connected to

subjects from other categories. Therefore, tests based on MST make the most use

of the similarity information among categories, while the test based on RuNNG makes

the least use of this information. The simulation results show a positive correlation

between using similarity information and the power of the test.

Now, we compare the test based on RuMST to RaMST. As discussed in Remark

7.1.5, the two statistics use the same set of edges but with different weighting. In

simulation, the two statistics perform similarly under the three scenarios that compare

two Normal distributions, while RuMST has very little power, even much lower than

RaMDP and the deviance test, for the comparison of two Uniform distributions with

different supports. When comparing two Normal distributions, the similarity between

two categories is closely related to the difference of the ranks of the categories. That

is, the further apart the ranks of the two categories, the less similar. However, when

comparing two Uniform distributions with different supports – [0,5] vs [1,6] – only

the ranks at the two ends are informative while the middle ranks are not. Since RuMST

puts more weight on between-category edges compared to RaMST, it’s power would be

lower if the similarity measure among categories is not informative.


Note that of all the graph-based tests, only the test based on RaMST consistently

outperforms the deviance test.

Figure 7.3: Power versus type I error for tests based on RaMST, RuMST, RaMDP, thelikelihood ratio (deviance), and RuNNG under different simulation settings.

This simulation study is limited and only uses ranked data. We chose this study

design for its interpretability. Though simple, the results are informative and show the

advantage of averaged MST over averaged MDP and uNNG for categorical data. Also,


N(0,1) vs N(1,1) aMST uMST aMDP uNNG LR Pearson

α = 0.01 0.523 0.495 0.428 0.234 0.355 0.346α = 0.05 0.762 0.740 0.679 0.492 0.605 0.605

N(0,1) vs N(0,4)α = 0.01 0.304 0.321 0.233 0.133 0.165 0.164α = 0.05 0.558 0.585 0.482 0.382 0.394 0.396

N(0,1) vs N(1,4)α = 0.01 0.560 0.600 0.434 0.291 0.352 0.345α = 0.05 0.804 0.824 0.722 0.569 0.632 0.626

U(0,5) vs U(1,6)α = 0.01 0.354 0.218 0.310 0.155 0.283 0.251α = 0.05 0.665 0.486 0.607 0.383 0.600 0.552

Table 7.1: The power of six tests – four graph-based tests based on RaMST, RuMST, RaMDP,RuNNG, the deviance test (LR) and Pearson’s Chi-square test – under two significancelevels (α = 0.01, 0.05) and different simulation settings.

averaged MST is better than uMST when the similarity measure used to construct

the graph is not effective. On the other hand, if the similarity measure is effective,

the test based on uMST is comparable to, and sometimes better than, the test based

on averaged MST. Hence, the rest of this paper focuses on the two tests based on

RaMST and RuMST.

7.4 Computational Issues of RaMST and RuMST

The analytic forms for RaMST and RuMST, (7.5) and (7.11), require enumeration of all

MSTs on categories for RaMST; and enumeration of all edges in M∗0 for RuMST. Let

M = |T ∗0 | be the number of MSTs on categories. If the distance matrix between

categories is continuous-valued, then usually M = 1. Even when the distance matrix

is arithmetic, M is small enough to be manageable for many problems. However,

for problems that exhibit certain symmetries, enumeration of the set of all MSTs on

categories is not computationally feasible. For example, Table 7.4 lists the values of

M for the haplotype association problem in Section 8.2, assuming that there are no

empty categories. In this problem, M is computed using the Matrix-Tree Theorem,


Length of haplotype K M

2 4 43 8 3844 16 424673285 32 2.078× 1019

6 64 1.66× 1045

Table 7.2: The number of categories, K, and the MSTs on categories, M , as haplotypelength increases for the haplotype association problem in Section 8.2. All categoriesare assumed to be non-empty.

yielding the formula

M = 22l−l−1

l∏i=2

exp(

l

i

)log i

,

where l is the haplotype length. We can see from the formula for M that it increases

fast as l increases. When the length of the haplotype is 6, which is a reasonably short

length in genetic studies, there are only 64 possible categories, M equals 1.66× 1045.

For example, when the length of the haplotype is 6, which is a reasonably short length

in genetic studies, there are only 64 possible categories, but M is already larger than

1045. One may argue that in this case, (7.5) may be further simplified using the

symmetry over categories, so that enumeration of |T ∗0 | is not necessary. This is true

if all categories are non-empty, but if one or more of the categories are empty, the

symmetry breaks, while M would still be too large for enumeration.

Table 7.3 summarizes the computation time for RaMST and RuMST in terms of K

and M . Consider first the listing of all edges in uMST on categories, M∗0, which is

required for RuMST. This task can be completed in O(K2) time through an algorithm

proposed by Eppstein [1995]. Details of the algorithm are in Appendix C.1, as well

as its theoretical justification. O(K2) time is usually affordable since K is no larger

than the sample size. Thus, RuMST is computationally feasible for any problem. On the

other hand, RaMST requires the enumeration of all MSTs on categories, not just their

edges, and thus adds O(M) computation time to the algorithm. For the haplotype

example, this makes RaMST computationally infeasible. Thus, in the next Section, we

propose a statistic that is motivated by RaMST but that is computationally tractable

for all problems.


Task Computation TimeRaMST Enumerating all MSTs on categories O(K2 +M)RuMST Listing edges in uMST on categories O(K2)

Table 7.3: Computational time for RaMST and RuMST. M is the number of MSTs oncategories.

7.5 A Fast Method Generalized from RaMST

Corollary 7.1.2 gives a simple and intuitive form of RaMST when there is a unique MST

on categories. In that special case, RaMST is the sum of mixing potentials computed

within each category and mixing potentials computed between categories that are

connected by an edge of the MST τ ∗0 . Evidence against the null increases if this sum

of mixing potentials is small, as compared to random permutation. In (7.9), the MST

τ ∗0 serves as an enumeration of the pairs of categories that are highly similar. There

is nothing sacred about the choice of MST for this role. The intuitive interpretation

for (7.9) remains if we replace τ ∗0 by any other graph C0 that represents proximity

between categories.

Up to this point, we have assumed that a distance matrix on categories is used

to represent the similarity between categories. We now bypass the distance matrix

and assume that similarity is directly represented by a graph C0 with the categories

as nodes. Our goal is to incorporate the proximity information encoded by the graph

into the two group comparison. We propose the following statistic, which we call RC0 ,

obtained by substituting C0 for τ ∗0 in (7.9),

RC0 =K∑k=1

2naknbkmk

+∑

(u,v)∈C0

naunbv + navnbumumv

. (7.15)

The above statistic has a similar interpretation to RaMST: Consider all C0-spanning

graphs, which are graphs on subjects where every pair of subjects are connected

by a path if they are in the same category or they are in two categories that are

connected by a path in C0. Hence, minimum distance C0-spanning graphs connect

subjects within categories by spanning trees and connects exactly one pair of subjects

between each pair of categories that have an edge in C0. RC0 is the averaged sum


(1.1) over all minimum distance C0-spanning graphs.

If C0 is given, computing RC0 only requires O(K + |C0|) time. If C0 is not given,

the choice of C0 can often be guided by domain knowledge. In the examples below,

our choices for C0 include the uMST on categories, which we denote by C-uMST

(same as M∗0), and the uNNG on categories, which we denote by C-uNNG. Since

C-uMST and C-uNNG can both be computed in O(K2) time, RC−uMST and RC−uNNG

require only O(K2) computation time for any problem.

Chapter 8

Examples

In this chapter, the application of RC−uMST, RC−uNNG and RuMST are illustrated on sev-

eral examples, both real and simulated. In the simulated examples, their power are

compared to Chi-square tests. The p-values for all tests are calculated through 1,000

permutation samples for each run, and the power calculated through 1,000 simulation

runs.

8.1 Preference Ranking

Consider comparing two groups of subjects on the ranking of four objects. Let Ξ

be the set of all permutations of the set 1, 2, 3, 4. Data are simulated under the

following model: Subjects from group a have no preference among the four objects,

and so their ranking is uniformly drawn from Ξ. The rankings of subjects from group

b are generated from the distribution

Pθ(ζ) =1

ψ(θ)exp−θd(ζ, ζ0), ζ, ζ0 ∈ Ξ, θ ∈ R, (8.1)

where d(·, ·) is a distance function and ψ a normalizing constant. This probability

model, first considered by Mallows [1957] with Kendall’s or Spearman’s distance,

favors rankings that are similar to a modal ranking ζ0 if θ > 0. See Diaconis [1988]

for more discussions. The larger the value of θ, the more clustering there should be in

group b around the mode ζ0. We experimented with both Kendall’s and Spearman’s

80

CHAPTER 8. EXAMPLES 81

distance and various values for θ. We assumed that the true distance function used

to generate the data is either known and used to construct the graph, or unknown,

in which case an incorrect distance is used.

Figure 8.1 shows C-uMST and C-uNNG formed on a typical data set generated

under θ = 5 with na = nb = 20. Spearman’s distance is used in both the generating

model and for constructing the graph. In this particular example, C-uMST contains

all edges in C-uNNG with three extra edges, shown in thinner lines. The reason this

happens is that no category is as close to category “3241” as category “3142”, and no

category is as close to category “3142” as category “3241”. For MST on categories,

more edges are needed to form a spanning tree. It is clear that in this case, there

are three MSTs on categories, each one obtained by adding one of the three thinner

edges to the C-uNNG. In most simulation runs, C-uMST and C-uNNG are the same,

while in those runs where they differ, C-uNNG is always a subset of C-uMST.

Figure 8.2 shows the power versus type I error for θ = 5, na = nb = 20 under

different combinations of using Kendall’s or Spearman’s distance for the generating

model and for constructing the graph; and Table 8.1 lists the power under two most

commonly used significance levels – 0.01 and 0.05. We see that even when a wrong

distance is used, the graph-based tests still have significantly higher power than the

Chi-square tests. For this simulation setting, RuMST is the most powerful among the

three graph-based tests; RC−uMST and RC−uNNG perform similarly with RC−uMST a little

better in all cases, implying that the extra edges in C-uMST do give additional useful

information.

8.2 Haplotype Association

In this example, we consider a disease model where the probability for disease depends

on the haplotype at four single nucleotide polymorphisms (SNP). We encode the allele

at each SNP as 0 or 1, and so the haplotype can be represented as a binary string.

We assume that the disease probability depends on the number of positions at which

the subject’s haplotype agrees with a target haplotype:

P (Disease) = 0.3 + 0.1× (Number of positions in agreement).


Figure 8.1: C-uMST and C-uNNG constructed on a typical data set generated underparameters ζ0 = 1234 and θ = 5 with na = nb = 20. The Spearman’s distance is usedin both the generating model and for constructing the graph. Each node is labeledby the ranking it represents, followed by the number of subjects from groups a and bwith that ranking in parentheses.

KK uMST C-uMST C-uNNG Pearson LR

α = 0.01 0.566 0.413 0.397 0.214 0.206α = 0.05 0.784 0.660 0.648 0.450 0.439

KSα = 0.01 0.567 0.395 0.385 0.221 0.209α = 0.05 0.784 0.649 0.631 0.455 0.437

SSα = 0.01 0.588 0.491 0.478 0.247 0.240α = 0.05 0.807 0.715 0.703 0.485 0.480

SKα = 0.01 0.607 0.495 0.486 0.253 0.248α = 0.05 0.811 0.729 0.715 0.494 0.481

Table 8.1: The power of five tests – three graph-based tests based on RuMST, RC−uMST,RC−uNNG and two Chi-square tests – under two significance levels (α = 0.01, 0.05) anddifferent simulation settings.


Figure 8.2: Power versus type I error for five different tests in the preference rankingexample with θ = 5 and na = nb = 20. One of two distance measures (Kendall’sor Spearman’s distance) was used for the generating model and for constructing thegraph. The title of each plot denotes the choice of distance: The first letter denotesthe distance used in the generating model (“K” is Kendall’s and “S” is Spearman’sdistance); and the second letter denotes the distance used in constructing the graph.For instance, “KS” in the bottom left panel means that Kendall’s distance is used inthe generating model, but Spearman’s distance is used in constructing the graph.


Thus, the probability of disease can take values 0.3 0.4, 0.5, 0.6 or 0.7 depending

on whether there are 0, 1, 2, 3 or 4 positions in agreement. To make the problem

harder, we assume that seven non-informative SNPs are analyzed together with the

four informative SNPs, and that which and how many of the 11 SNPs are informative

is unknown in the analysis. Thus the data actually consists of haplotypes of length 11.

There are 211 = 2, 048 possible categories. In each simulation, 1,000 haplotypes with

length 11 are generated uniformly from all possible haplotypes. Each subject with a

given haplotype is signed as “patient” or “normal” according to the disease model.

Since only 1,000 subjects are simulated in each run, not all of the 2,048 categories

are represented. The number of non-empty categories in each run ranged from 755

to 823, with an average of 791 in the 1000 simulation runs. The Hamming distance

is used to construct the graph. Figure 8.3 shows the power versus type I error plots

for the five tests. It is clear that, by incorporating the information in the graph, tests

based on RuMST, RC−uMST and RC−uNNG all have much higher power than the Pearson’s

Chi-square and deviance tests. Among the three graph-based tests, the one based on

RuMST works a little better than the ones based on RC−uMST and RC−uNNG.

8.3 Binary Clinical Features

This example comes from Anderson et al. [1972] and Nettleton and Banerjee [2001].

Data on the presence or absence of 17 clinical features of the eye ailment Kerato-

conjunctivitis Sicca (KCS) are given for two groups of patients. A question asked

by Nettleton and Banerjee was whether the two groups of patients share a common

distribution with respect to these clinical features. The sizes of the groups are 40 and

24. It turned out that only two subjects had the same outcome for the 17 clinical

features, so there are in total 63 distinct categories. The Hamming distance is used to

construct the graph, and p-values are calculated through 10, 000 permutation samples

and shown in Table 8.2. Nettleton and Banerjee’s method is based on the uNNG on

subjects. As discussed before and confirmed by simulation studies in Section 7.3, the

uNNG on subjects has lower power than MST based tests when many categories have

more than one subject. This is not a problem in this data because only one category

has more than one subject. We see that RuMST, RC−uMST and RC−uNNG all detected the


Figure 8.3: The power versus type I error plots for the five tests for the haplotypeexample. The length of the haplotype is 11, but only 4 positions informative.


difference between the two groups of patients, while the Chi-square tests did not.

Table 8.2: p-values for the KCS data set.RuMST RC−uMST RC−uNNG Nettleton and Banerjee’s Pearson LR0.0011 0.0010 0.0006 0.0007 0.5200 0.5200

Chapter 9

Permutation Distributions of the

Test Statistics

Based on the results in Sections 7.3-7.5, we focus on RC−uMST and RuMST. In this section,

we consider the permutation distributions of these two statistics in their generalized

forms. That is, we consider RC0 and TC0 , the latter defined as

TC0 =K∑u=1

naunbu +∑

(u,v)∈C0

(naunbv + nbunav) (9.1)

TC−uMST is equivalent to RuMST.

We define two quantities that will be used to characterize the permutation distri-

butions:

λ := maxu|EC0u |, the maximum node degree in C0. (9.2)

β := maxu

mu, the maximum total count for a category. (9.3)

By permutation distribution, we are referring to the distribution of the statistic under

random uniform permutation of the group labels. This is used as the null distribution

to assess statistical significance. We use PP, EP and VarP to denote the probability,

expectation and variance under the permutation null.

87

CHAPTER 9. PERMUTATION DISTRIBUTIONS 88

9.1 RC0

The following lemma states that the first two moments of RC0 under the permutation

null can be computed instantaneously using basic summary statistics of the graph

and cell counts of the contingency table.

Lemma 9.1.1. The mean and variance of RC0 under the permutation null are

EP[RC0 ] = (N −K + |C0|)2p1, (9.4)

VarP[RC0 ] = 4(p1 − p2)

(N −K + 2|C0|+

K∑u=1

|Eu|2

4mu

−K∑u=1

|Eu|mu

)(9.5)

+ (6p2 − 4p1)

(K −

K∑u=1

1

mu

)+ p2

∑(u,v)∈C0

1

mumv

+ (N −K + |C0|)2(p2 − 4p21),

where

p1 =nanb

N(N − 1), p2 =

4na(na − 1)nb(nb − 1)

N(N − 1)(N − 2)(N − 3). (9.6)

Proof of Lemma 9.1.1 is given in Appendix C.2.1.

Remark 9.1.2. As N →∞, na/N → γ ∈ (0, 1), we have p2 = 4p21 and thus

VarP[RC0 ] = 4(p1 − p2)

(N −K + 2|C0|+

K∑u=1

|Eu|2

4mu

−K∑u=1

|Eu|mu

)

+ (6p2 − 4p1)

(K −

K∑u=1

1

mu

)+ p2

∑(u,v)∈C0

1

mumv

.

Furthermore, if γ = 0.5, then p1 = p2 = 1/4 and we have

VarP[RC0 ] =1

2

(K −

K∑u=1

1

mu

)+

1

4

∑(u,v)∈C0

1

mumv

.

We next give sufficient conditions guaranteeing the convergence to normality of

RC0 after standardization by its mean and variance.


Condition 1.

K∑u=1

mu(mu + |EC0u |)(mu +

∑v∈Vu

mv + |EC0u,2|) ∼ o(K3/2),

∑(u,v)∈C0

(mu +mv + |EC0u |+ |EC0

v |)(mu +mv +∑

w∈(Vu∪Vv)

mw + |EC0u,2|+ |EC0

v,2|) ∼ o(K3/2).

Condition 1 constrains the size of “hubs” in the graph: The node degrees in C0

and the number of observations in each category must not get too large. It can be

simplified to stronger conditions that are easier to comprehend. For example, the

following implies Condition 1:

Condition 1′′. β6λ2 and λ8 are both o(K).

The second condition is usually trivial:

Condition 2. N, |C0|, and∑

(u,v)∈C0

1mumv

are all O(K).

The asymptotic distribution of the standardized form of RC0 is given in the fol-

lowing theorem.

Theorem 9.1.3. Assume that conditions 1 and 2 hold. Under the permutation null,

the standardized statisticRC0 − EP[RC0 ]√

VarP[RC0 ]

converges in distribution to N(0, 1) as K → ∞ and na/N is bounded away from 0

and 1.

The proof of Theorem 9.1.3 is given in Appendix C.2.2.

Theorem 9.1.3 can be applied to any type of graph, allowing for repeated observa-

tions of each node. Since the statistics in Friedman and Rafsky [1979] and Rosenbaum

[2005] do not allow ties, their asymptotic normality results are also restricted to the

case where each node is observed only once. To compare Theorem 9.1.3 to its coun-

terpart in these two papers, we let G = C0 and assume that mu ≡ 1. Thus,

N = K,∑

(u,v)∈C0

1

mumv

= |C0| = |G|.


Condition 2 requires that |G| ∼ O(K) and Condition 1 can be simplified to:

K∑u=1

|EGu ||EGu,2| ∼ o(K3/2),

∑(u,v)∈G

(|EGu |+ |EGv |)(|EGu,2|+ |EGv,2|) ∼ o(K3/2).

Hence, Theorem 9.1.3 implies the asymptotic normality result in Rosenbaum

[2005] since |EGu | ≡ 1, |EGu,2| ≡ 1, |G| = K/2 for MDP. Friedman and Rafsky proved a

more general condition for asymptotic normality of sums (1.1) after standardization:

For sparse graphs where |G| ∼ O(K), the number of edge pairs that share a common

node must be O(K). Condition 1 is neither stronger or weaker than Friedman and

Rafsky’s condition. For example, consider a graph with one node having degree K1/2

and all other nodes having degree 1; this graph satisfies Friedman and Rafsky’s con-

dition but not Condition 1, since∑

(u,v)∈G |EGu ||EGu,2| = O(K3/2). On the other hand, a

graph with√K nodes having degree K0.3 and all other nodes having degree 1 would

satisfy Condition 1 but not Friedman and Rafsky’s condition.

9.2 TC0

The following lemma is the counterpart of Lemma 9.1.1 for RC0 . It’s proof is given

in Appendix C.2.3.

Lemma 9.2.1. The mean and variance of TC0 under the permutation null are

EP[TC0 ] =

K∑u=1

mu(mu − 1) + 2∑

(u,v)∈C0

mumv

p1, (9.7)


VarP[TC0 ] = (p1 − p2)K∑u=1

mu(mu +∑v∈Vu

mv − 1)(mu +∑v∈Vu

mv − 2) (9.8)

+ (p1 − p2/2)

K∑u=1

mu(mu − 1) + 2∑

(u,v)∈C0

mumv

+ (p2 − 4p2

1)

K∑u=1

mu(mu − 1) + 2∑

(u,v)∈C0

mumv

2

,

where p1 and p2 are given in (9.6).

The next theorem gives a sufficient condition for asymptotic normality of TC0

under the permutation null.

Theorem 9.2.2. If∑K

u=1mu(mu+∑

v∈Vumv)2 ∼ O(N), then under the permutation

null distribution, the standardized statistic

TC0 − EP[TC0 ]√VarP[TC0 ]

,

where EP[TC0 ] and VarP[TC0 ] are given in (9.7) and (9.8), converges in distribution

to N(0, 1) as N →∞, and na/N bounded away from 0 and 1.

Proof. Let G be the uMST on subjects. Then as long as∑N

i=1 |Gi|(|Gi|−1) ∼ O(N),

asymptotic normality can be ensured following Friedman and Rafsky [1979]’s result.

Notice that if i is in category u, then |Gi| = (mu − 1) +∑Vumv.

9.3 Checking the p-Values Under Normal Approx-

imations

We now check the normal approximations to the p-values of the three graph-based

statistics – RC−uMST, RC−uNNG and RuMST – through simulation. We adopt the setting

of the haplotype example. In each simulation run, N haplotypes with length l are

generated uniformly from all possible haplotypes with length l. They are assigned to

either group with equal probability. Hence, the two groups have the same distribution.


For each simulation run, we calculate the difference between theoretical p-values from

the normal approximation and the permutation p-values from 10,000 permutations

for the three statistics. We consider different sparsity settings by varying l, which

controls the number of categories, and N . Under each setting, 100 simulation runs

are done, with boxplots of the differences between theoretical and simulation p-value

shown in Figure 9.1. We increased l from 6 to 10, and thus the number of possible

categories considered grows from 64 to 1024. The sample size N varies from 100 to

1000. This spectrum of values is reasonable for a genetic association study.

Simulation results under this setting shows that the normal approximation is

better for RC−uMST and RC−uNNG than for RuMST. Accuracy of normal approximation

improves for all statistics as l and N increase. For RC−uMST and RC−uNNG, when the

number of possible categories is larger than 256 and the number of observations is

larger than 200, the p-value from normal approximation is quite accurate. While

for RuMST, the number of observations needs to be larger than 500 to achieve similar

accuracy. For RuMST, when the number of possible categories is larger than the number

of observations, the p-value calculated from the normal approximation is negatively

biased, and thus less conservative. The bias is less severe for RC−uMST and RC−uNNG,

while still problematic when the number of possible categories is 1024 and the number

of observation is only 100. Skewness correction can be done to make the theoretical p-

values more accurate, but whenN is so small, it would be easier to just do permutation

directly.


Figure 9.1: Boxplots for the differences between p-values calculated from normalapproximation and 10,000 permutations.

Chapter 10

Discussion

We have described a new approach for comparing two categorical samples, which

is appealing when the contingency table is sparsely populated. Sparse contingency

tables are common in many modern applications where the number of categories,

K, is large compared to sample size. In such situations, the different categories can

usually be related to each other in a systematic way. The new approach utilizes

a graphical encoding of the similarity between categories to improve the power of

two-sample comparison. We showed, through simulations and real examples, that

utilizing graphical information improves the power over deviance and Pearson’s Chi-

square tests. The proposed statistics are shown to be asymptotically normal after

standardization, under assumptions that limit the hub size and density of the graph.

This allows instantaneous type I error control for large data sets.

The power of the new approach depends on the choice of an informative similarity

measure between categories. This part of the analysis should rely on domain knowl-

edge that is specific to each application. For ranking data from surveys, one can

start with the standard distance measures used in Example 8.1. When the number of

categories is large, drawing relationships between categories is a necessary and often

default step in analyzing the data.

Both RC−uMST and RuMST work well when the similarity information is effective

with RuMST usually having better power. However, when the similarity measure is

not as informative, RuMST can have very low power, even when compared to Chi-

square tests. In our simulation studies derived from the Haplotype problem, the

94

CHAPTER 10. DISCUSSION 95

normal approximation is more accurate for RC−uMST than for RuMST. For RuMST, p-values

obtained from normal approximation are lower than actual p-values in extremely

sparse situations. All p-value approximations work well when the sample size is

comparable to the number of categories.

Generalization of this approach to multi-sample comparison is straightforward by

letting gi take K ′ distinct values, where K ′ is the number of groups.

Appendix A

Existing Theorems Used in Proofs

A.1 Stein’s Method

Here, we state one form of the Stein’s method we use. Consider sums of the form

W =∑

i∈J ξi, where J is an index set and ξ are random variables with E[ξi] = 0, and

E[W 2] = 1. The following assumption restricts the dependence between ξi : i ∈ J .

Assumption A.1.1. [Chen and Shao, 2005, p. 17] For each i ∈ J there exists

Si ⊂ Ti ⊂ J such that ξi is independent of ξSci and ξSi is independent of ξT ci .

Theorem A.1.1. [Chen and Shao, 2005, Theorem 3.4] Under Assumption A.1.1, we

have

suph∈Lip(1)

|Eh(W )− Eh(Z)| ≤ δ,

where Lip(1) is all uniformly Lipschitz functions, Z has N (0, 1) distribution and

δ = 2∑i∈J

(E|ξiηiθi|+ |E(ξiηi)|E|θi|) +∑i∈J

E|ξiη2i |, (A.1)

with ηi =∑

j∈Si ξj and θi =∑

j∈Ti ξj, where Si and Ti are defined in Assumption

A.1.1.

96

Appendix B

Supporting Materials for Part I

B.1 Proofs for the Limiting Distributions

We here prove that ZG([nu]) : 0 < u < 1 converges to a Gaussian process. The

proof for the convergence of ZG([nu], [nv]) : 0 < u < v < 1 to two-dimensional

Gaussian random field can be done in the same manner but with a more careful

treatment of the indexes.

To prove ZG([nu]) : 0 < u < 1 converges to a Gaussian process, we only need

to show that (ZG([nu1]), ZG([nu2]), . . . , ZG([nuK ])) is multivariate Gaussian for any

0 < u1 < u2 < · · · < uK < 1 and fixed K. For simplicity, let tk = [nuk], k = 1, . . . , K.

To prove that (ZG(t1), ZG(t2), . . . , ZG(tK)) is multivariate Gaussian, we take one

step back. In permutation distribution, we permute the order of the observations.

Let π(i) be the observed time of yi after permutation, then (π(1), . . . , π(n)) is a

permutation of 1, . . . , n. So π(i) is the observed time of yi after permutation. To get

the permutation distribution, we can do it in two steps: 1) For each i, π(i) is sampled

uniformly from 1 to n; 2) only those that each value in 1, . . . , n is sampled once are

retained. It is easy to see that each permutation has the same occurrence probability

after the two steps.

We call the distribution resulting from only performing the first step the bootstrap

distribution, and we use PB, EB and VarB to denote the probability, expectation and

variance, respectively. (P, E, Var without the subscript B are used to denote the

equivalences under the permutation null.) Let

97

APPENDIX B. SUPPORTING MATERIALS FOR PART I 98

ZBG (t) = −RG(t)− EB(RG(t))√

VarB(RG(t));

nB(t) =n∑i=1

Iπ(i)≤t, XB(t) =nB(t)− t√t(1− t/n)

.

Then following a similar argument in the proof for Lemma 3.2.2 but replacing the

permutation distribution with bootstrap distribution, we have

EB(RG(t)) = pB1 (t)|G|,

VarB(RG(t)) = pB2 (t)|G|+(

1

2pB1 (t)− pB2 (t)

)∑i

|Gi|2,

where

pB1 (t) =2t(n− t)

n2,

pB2 (t) =4t2(n− t)2

n4.

We prove the following two lemmas.

Lemma B.1.1. If∑

e∈G |Ae||Be| ∼ o(|G|3/2), |G| ∼ O(n), then under the bootstrap

null, for 0 < u1 < u2 < · · · < uk < 1,

(ZBG (t1), ZB

G (t2), . . . , ZBG (tK), XB(t1), XB(t2), . . . , XB(tK))

has a non-degenerating multivariate Gaussian distribution.

Lemma B.1.2. When |G| ∼ o(n2), for t ∼ O(n), as |G| → ∞, we have

1.VarB(RG(t))

Var(RG(t))→ 1.

2.EB(RG(t))− E(RG(t))√

VarB(RG(t))→ 0.


Since (ZBG (t1), ZB

G (t2), . . . , ZBG (tK)|XB(t1) = 0, XB(t2) = 0, . . . , XB(tK) = 0) un-

der the bootstrap null has the same distribution as (ZBG (t1), ZB(t1), . . . , ZB(tK)) un-

der the permutation null and the fact that

RG(t)− E(RG(t))√Var(RG(t)

=VarB(RG(t))

Var(RG(t))

(RG(t)− EB(RG(t))√

VarB(RG(t))+

EB(RG(t))− E(RG(t))√VarB(RG(t))

),

with Lemma B.1.1 and Lemma B.1.2, we have the result that (ZG([nu1]), ZG([nu2]),

. . . , ZG([nuK ])) is multivariate Gaussian.

We next show the proof for the two lemmas.

Proof for Lemma B.1.1. We only need to show that

(a) VarB(∑K

k=1(akZBG (tk) + bkX

B(tk))) is bounded away from 0 for∑

k(a2k + b2

k) 6= 0.

(b)∑K

k=1(akZBG (tk) + bkX

B(tk)) is normally distributed for any fixed ak,

Let σB(tk) =√

VarB(RG(tk)). Following similar arguments in Section 4.2.2 but

replacing the permutation distribution with bootstrap distribution, we have

covB(ZBG (t1), ZB

G (t2)) =4u2

1(1− u2)2|G|+ u1(1− u2)(1− 2u1)(1− 2u2)∑

i |Gi|2

σB(t1)σB(t2).

|covB(ZBG (t1), ZB

G (t2))| is strictly bounded from 1 when u1 6= u2.

Notice that

R(t)nB(t) =∑

(i,j)∈G

(Iπ(i)≤t,π(j)>t + Iπ(i)>t,π(j)≤t

) n∑l=1

Iπ(l)≤t

=∑

(i,j)∈G

(Iπ(i)≤t,π(j)>t +

∑l 6=i,j

Iπ(i)≤t,π(j)>t,π(l)≤t

+Iπ(i)>t,π(j)≤t +∑l 6=i,j

Iπ(i)>t,π(j)≤t,π(l)≤t

),

so EB(R(t)nB(t)) = |G|2u(1− u)(un− 2u+ 1), then

covB(ZBG (t), XB(t)) =

2u(1− u)(1− 2u)|G|σB(t)

√nu(1− u)

.


If u = 1/2, then covB(ZBG (t), XB(t)) = 0, which is strictly bounded away from 1. For

u 6= 1/2,

covB(ZBG (t), XB(t)) =

1√n∑i |Gi|2

4|G|2 + nu(1−u)|G|(1−2u)2

Since n∑

i |Gi|2 ≥ 4|G|2 by Cauchy-Schwartz, and |G| ∼ O(n), we have

covB(ZBG (t), XB(t)) ≥ 0

and strictly bounded away from 1.

Given the essential arguments |covB(ZBG (t1), ZB

G (t2))| is strictly bounded from 1

when u1 6= u2, and covB(ZBG (t), XB(t)) ≥ 0 and strictly bounded from 1 when t ∼

O(n), (a) can be shown with some arithmetic arguments.

Let VarB(∑K

k=1(akZBG (tk)+bkX

B(tk))) := σ0, then σ0 ∼ O(1) for∑

k(a2k+b2

k) 6= 0.

We prove (b) using the Stein’s method. In particular, the version of Stein’s method

stated in Appendix A.1 is used. We adopt the same notation with the index set

J = G, 1, . . . , n.Let

ξe,k =Igπ(e−)(tk)6=gπ(e+)(tk) − pB1 (tk)

σB(tk),

Since Igπ(e−)(tk)6=gπ(e+)(tk) ∈ 0, 1, pB1 (tk) ∈ (0, 1), we have

|ξe,k| ≤1

σB(tk).

Let

ξi,k =Iπ(i)≤tk − uk√nuk(1− uk)

.

Similarly, we have

|ξi,k| ≤1√

nuk(1− uk).

Let ξe =∑

k akξe,k/σ0, ξi =∑

k bkξi,k/σ0, then W =∑

j∈J ξj =∑

k(akZBG (tk) +

bkXB(tk))/σ0, EB(W ) = 0, EB(W

2) = 1. Let a = max(maxk ak,maxk bk), σ =


min(mink σB(tk),mink

√nuk(1− uk)), then σ is at least of order n, and

|ξj| ≤aK

σσ0

, ∀j ∈ J .

For e ∈ G, let

Se = Ae, e−, e+,

Te = Be ∪ Nodes in Ae,

where Ae, Be defined in (4.3) and (4.4). Then Se and Te satisfy Assumption A.1.1.

For i = 1, . . . , n, let

Si = Gi

Ti = Gi,2 ∪ Nodes in Gi,

where Gi,2 is the subgraph of G including all edges connect to Gi. Then Si and Ti

satisfy Assumption A.1.1.

We have |Se| = |Ae|+ 2, |Te| = |Be|+ |Ae|+ 1, |Si| = |Gi|, |Ti| = |Gi,2|+ |Gi|+ 1.

By Theorem A.1.1, we have suph∈Lip(1) |Eh(W ) − Eh(Z)| ≤ δ for Z ∼ N (0, 1),

where

δ = 2∑j∈J

(E|ξjηjθj|+ |E(ξjηj)|E|θj|) +∑j∈J

E|ξjη2j |

= 2∑e∈G

(E|ξeηeθe|+ |E(ξeηe)|E|θe|) +∑e∈G

E|ξeη2e |

+ 2n∑i=1

(E|ξiηiθi|+ |E(ξiηi)|E|θi|) +n∑i=1

E|ξiη2i |

≤ a3K3

σ3σ30

(∑e∈G

5(|Ae|+ 2)(|Be|+ |Ae|+ 1) +n∑i=1

5|Gi|(|Gi,2|+ |Gi|+ 1)

)

≤ a3K3

σ3σ30

(45∑e∈G

|Ae||Be|+ 15n∑i=1

|Gi||Gi,2|

)

Since σ is at least of order n, σ0 ∼ O(1), |G| ∼ O(n), when∑

e∈G |Ae||Be| ∼ o(n3/2)


and∑n

i=1 |Gi||Gi,2| ∼ o(n3/2), we have δ → 0 as n→∞.

Also observe that if e = (i, j), then Gi, Gj ⊆ Ae, Gi,2, Gj,2 ⊆ Be. For each node

i, we can randomly pick an edge e that connects i, and we have |Gi||Gi,2| ≤ |Ae||Be|.Each node in the graph can be picked twice in maximum since an edge connects two

nodes, therefore,n∑i=1

|Gi||Gi,2| ≤ 2∑e∈G

|Ae||Be|.

So∑

e∈G |Ae||Be| ∼ o(n3/2) ensures∑n

i=1 |Gi||Gi,2| ∼ o(n3/2).

Proof for Lemma B.1.2. Let u = limn→∞ t/n, then

limn→∞

p1(t) = limn→∞

pB1 (t) = 2u(1− u),

limn→∞

p2(t) = limn→∞

pB2 (t) = 4u2(1− u)2,

limn→∞

Var(RG(t)) = limn→∞

VarB(RG(t)) = 4u2(1− u)2|G|+ u(1− u)(1− 2u)2∑i

|Gi|2.

SoVarB(RG(t))

Var(RG(t))→ 1.

Since

EB(RG(t))− E(RG(t)) = (pB1 (t)− p1(t))|G| = −2t(n− t)n3

|G|,

we have

limn→∞

EB(RG(t))− E(RG(t))√VarB(RG(t))

= − limn→∞

2u(1− u)|G|/n√4u2(1− u)2|G|+ u(1− u)(1− 2u)2

∑i |Gi|2

= − limn→∞

2u(1− u)√4u2(1− u)2n2/|G|+ u(1− u)(1− 2u)2n2

∑i |Gi|2/|G|2

,

which is 0 when |G| ∼ o(n2).


B.2 Effect of Skewness

To gain a better understanding of the role of skewness, we explore the following

quantities involved in the p-value approximations:

• γG(t)∆= E[Z3

G(t)],

• θb,G(t)∆= (−1 +

√1 + 2γG(t)b)/γG(t),

• SG(t)∆= 1√

1+γG(t)θb,G(t)exp(1

2(b− θb,G(t))2 +

γG(t)θb,G(t)3

6).

Figure B.1 shows the three quantities versus t for the single change-point scan statistic

on a MDP graph when n = 1000, b = 3. Since the structure of MDP is always the

same and does not depend on the distribution of yi, Figure B.1 is representative of all

MDP graphs with n = 1000 subjects and threshold b = 3. We can see from the figure

that γ is always larger than 0, indicating right skewness. When γ = 0, θb = b; when

γ > 0, θb < b. When ZG(t) is right-skewed, the analytic approximation of the p-value

assuming Gaussianity is smaller than the actual p-value, so the skewness correction

should increase the p-value approximation. This is indeed true as SG(t) is U-shaped

with a minimum of 1.

Each node in the MDP has degree 1. The shapes of γG(t) and θb,G(t) for ZG(t)

computed on graphs with very low number of hubs are similar to their shapes for

ZG(t) computed on the MDP. For example, for data in low dimensions (< 5), scans

based on MST and NNG constructed based on Euclidean distance have similar skew-

ness properties as described above. However, as the dimension of the data increases,

MST and NNG constructed based on Euclidean distance tend to become dominated

by hubs, and the distribution of ZG(t) becomes left-skewed. For a left-skewed distri-

bution, γ ≤ 0, θb ≥ b, and S ≤ 1. One problem for left-skewed distributions is that if

γ is smaller than −1/(2b), the current approximation does not yield real-valued so-

lution for θb. This issue is discussed in Remark 4.4.1 and here we provide a heuristic

solution to this problem based on an extrapolation procedure.

We illustrate procedure through a MST constructed on a simulated 100-dimensional

data based on Euclidean distance. From Figure B.2, we see that θb,G(t) and SG(t) are


not defined except in the middle region. In this case, the integrand

SG(nu)hG(n, x)ν√

2b20hG(n, x)

is directly extrapolated to the edge regions using the boundary tangent at each side.

If extrapolation is negative, it is set to zero. Figure B.3 illustrates the integrand

before and after extrapolation.

Figure B.1: The three quantities, γG(t), θb,G(t) and SG(t) from left to right, for aMDP graph. n = 1000, b = 3.

Figure B.2: The three quantities, γG(t), θb,G(t) and SG(t) from left to right, for a MSTgraph constructed using Euclidean distance on a sequence of n = 1000 observationsiid drawn from N(0, I100). b = 3.


Figure B.3: The integrand before (left) and after (right) extrapolation. The integrandcan only be directly calculated in the middle part (t ∈ [248, 752]), and the outer partis obtained by extending using the boundary tangent.

Appendix C

Supporting Materials for Part II

C.1 Computation Issues for RaMST and RuMST

The main task for computing RaMST and RuMST are to enumerate all MSTs on categories

for RaMST and to list the edges in M∗0 for RuMST. Other tasks can be finished in O(K)

time.

Let G be the complete graph on the K categories. |G| = K(K − 1)/2. Eppstein

[1995] proposed a graph operation called the sliding transformation which, when ap-

plied to G, produces an equivalent graph such that the MSTs on categories correspond

one-for-one with the spanning trees of the equivalent graph. The enumeration of all

spanning trees, without having to optimize for total distance, is relatively straight-

forward. Thus, we adopted the following computational approach: Use Eppstein’s

method to construct the equivalent graph of G, enumerate all spanning trees of the

equivalent graph, then transform back to get the set of MSTs on G. The sliding trans-

formation constructs the equivalent graph in O(|G| + K logK) = O(K2) time. To

perform the sliding transformation, an initial MST is needed. Prim’s algorithm can

be used to obtain the initial MST, which needs O(K2) time, not increasing the time

complexity. The theoretical justification of this algorithm can be found in Eppstein

[1995] and Section C.1.1, which completes many of the proofs of Eppstein [1995].

After removing any loops formed during the the sliding transformations, each

remaining edge appears in at least one spanning tree of the equivalent graph, thus

appearing in at least one MST on G. Now we have the list of edges in uMST on G,

106

APPENDIX C. SUPPORTING MATERIALS FOR PART II 107

and thus RuMST can be calculated in O(K2) time.

For enumerating all spanning trees of the equivalent graph, the algorithm proposed

by Shioura and Tamura [1995] is used, which requires O(K+ |G|+M) = O(K2 +M)

computation time. This was proven to be optimal in time complexity. Shioura and

Akihisa’s algorithm starts from a spanning tree formed by depth-first search, then

replaces one edge at a time using cycle structures in the graph, traversing the space

of all spanning trees of the graph. Hence, computing RaMST takes O(K2 +M) time.

C.1.1 Theoretical Justifications

This section proves the four lemmas stated (but not completely proved) in Eppstein

[1995] so as to draw the conclusion that applying sliding transformation produces an

equivalent graph such that the MST of the original graph correspond one-for-one with

the spanning trees of the equivalent graph. Let G be the original graph. We begin

with the definition of sliding transformation from Eppstein [1995]:

Let edges e = (u, v) and f = (v, w) share a common vertex v, and suppose

w(e) < w(f). We define the result of sliding edge f along edge e as the

graph G′ formed by deleting f from G and inserting in its place an edge

f ′ = (u,w) with the same weight.

and the definition of equivalent graph EG:

Let T0 be some particular minimum spanning tree in G, and choose some

vertex to be the root of T0. Then we form EG by repeatedly performing

sliding operations that slide an edge f = (v, w) along an edge e = (u, v)

as long as e is in T0 and u is closer to the root of T0 than is v.

We next state the four lemmas in Eppstein [1995] and give their proofs.

Lemma C.1.1. Let G′ be formed from G by any sequence of sliding operations.

Then each set of edges giving a minimum spanning tree of G corresponds to a unique

minimum spanning tree in G′ and vice versa.

Proof. We only need to show this for one sliding operation. Let G be the graph before

the sliding and G′ be the graph after sliding. This sliding operation is performed on

edge f as described in the definition of the sliding transformation.


Let Jm be the set of MST of G.

Jm0 = T ∈ Jm : f /∈ T,

Jmf = T ∈ Jm : f ∈ T, e /∈ T,

Jmef = T ∈ Jm : e, f ∈ T.

Let Jn be the set of spanning trees of G, and Jn0 , Jnf , Jnef are defined similarly.

J ′m, J ′m0, J ′mf , J

′mef

, J ′n, J ′n0, J ′nf , J

′nef

are the counterparts for G′ and those related

with f are replaced with f ′. What we need to show is the we can find one-for-

one correspondence for elements in Jm and J ′m. Observe that Jm0 , Jmf , Jmef is a

partition of Jm, and J ′m0, J ′mf , J

′mef

if a partition of J ′m. We next show the one-for-

one correspondence for each of the three subsets.

Observe that Jn0 = J ′n0, Jm0 ⊆ Jn0 , J

′m0⊆ J ′n0

. Therefore, ∀ Tm0 ∈ Jm0 , T′n0∈ J ′n0

,∑i∈Tm0

w(i) ≤∑

i∈T ′n0w(i). Hence J ′m0

⊆ Jm0 . With a similar argument, we have

Jm0 ⊆ J ′m0. Hence, Jm0 = J ′m0

.

For any Tmf ∈ Jmf , since f = (v, w) ∈ Tmf , e = (u, v) /∈ Tmf , there are two

possibilities to connect u, v, w in Tmf :

(1) −u− · · · − v − w − . . .

(2) −u− · · · − w − v − . . .

For the second scenario, if we delete f = (v, w) and add e = (u, v), it will still be

a tree but having smaller weight sum (performing this to the first scenario does not

lead to a tree any more), so Tmf can only be of the first form. Let Tmf be the graph

from Tmf by deleting f = (v, w) and adding f ′ = (u,w). Since Tmf is of the first

form, Tmf resulted from the two steps is still a tree. Since w(f ′) = w(f), we have∑i∈Tmf

w(i) =∑i∈Tmf

w(i). (C.1)

For any T ′mf ∈ J ′mf , since f ′ = (u,w) ∈ T ′mf , e = (u, v) /∈ T ′mf , there are two

possibilities to connect u, v, w in T ′mf :

(1) −v − · · · − u− w − . . .


(2) −v − · · · − w − u− . . .

For the second scenario, if we delete f ′ = (u,w) and add e = (u, v), it will still be

a tree but having smaller weight sum (performing this to the first scenario does not

lead to a tree any more), so T ′mf can only be of the first form. Let T ′mf be the graph

from T ′mf by deleting f ′ = (u,w) and adding f = (v, w). Since T ′mf is of the first

form, Tmf resulted from the two steps is still a tree. Since w(f ′) = w(f), we have∑i∈T ′mf

w(i) =∑i∈T ′mf

w(i). (C.2)

Let Jmf be the set of trees Tmf resulting from Tmf ∈ Jmf by deleting f and adding

f ′, J ′mf be the set of trees T ′mf resulting from T ′mf ∈ J′mf

by deleting f ′ and adding

f . It is easy to observe that Jmf ⊆ J ′nf and J ′mf ⊆ Jnf . Hence

∑i∈T ′mf

w(i) ≤∑i∈Tmf

w(i), (C.3)

∑i∈Tmf

w(i) ≤∑i∈T ′mf

w(i). (C.4)

(C.1), (C.2), (C.3), and (C.4) lead to∑i∈Tmf

w(i) =∑i∈Tmf



w(i).

Therefore J ′mf ⊆ Jmf , Jmf ⊆ J ′mf . Since there is one-for-one correspondence for

J ′mf and J ′mf , and one-for-one correspondence for Jmf and Jmf , we have J ′mf = Jmf ,

Jmf = J ′mf . Hence, there is one-for-one correspondence for Jmf and J ′mf .

For any Tmef ∈ Jmef , let Tmef be the graph resulted from Tmef by deleting f and

adding f ′. Then Tmef is still a tree, and∑i∈Tmef

w(i) =∑

i∈Tmef

w(i). (C.5)

Let Jmef be the set of the trees Tmef , then since e, f ′ ∈ Tmef , Jmef ⊆ J ′nef . Thus,


for any T ′mef ∈ J′mef

, we have

∑i∈T ′mef

w(i) ≤∑

i∈Tmef

w(i). (C.6)

For any T ′mef ∈ J′mef

, let T ′mef be the graph resulted from T ′mef by deleting f ′ and

adding f . Then T ′mef is still a tree, and

∑i∈T ′mef

w(i) =∑

i∈T ′mef

w(i). (C.7)

Let J ′mef be the set of the trees T ′mef , then J ′mef ⊆ Jnef . For any Tmef ∈ Jmef , we have

∑i∈Tmef

w(i) ≤∑

i∈T ′mef

w(i). (C.8)

(C.5), (C.6), (C.7), and (C.8) lead to∑i∈Tmef

w(i) =∑

i∈Tmef

w(i) =∑

i∈T ′mef

w(i) =∑

i∈T ′mef

w(i).

Therefore, Jmef ⊆ J ′mef and J ′mef ⊆ Jmef . Since there is one-for-one correspondence

for J ′mef and J ′mef , and one-for-one correspondence for Jmef and Jmef , we have J ′mef =

Jmef , Jmef = J ′mef . Hence, there is one-for-one correspondence for Jmef and J ′mef .

Lemma C.1.2. If we are given a graph G and a rooted minimum spanning tree T0,

then the graph EG described above is well-defined and does not depend on the order

in which sliding transformations are performed.

Proof. Let the root of T0 to be O. For simplicity, we use the same letter to denote

the edge before and after the sliding transformation, although one of the nodes of the

edge is changed. The set of edges forming T0 after sliding transformation still forms a

tree, and by Lemma C.1.1, this tree is still a minimum spanning tree in the resulting


graph. For simplicity, we still call this tree T0. Also, we call the type of sliding trans-

formation performed in the definition of EG be T0-sliding transformation. Observe

one fact that if edge e = (e−, e+) is in T0, with e− closer to O, then any T0-sliding

transformation does not change e+. That is, e− can be changed to some other node,

while e+ is always the same node. Therefore, we can view e+ as fixed for any e in T0.

Let g = (u,w) be an edge in the original graph G, and we discuss its destination in

EG.

Case I: g ∈ T0.

Without loss of generality, let u be closer to O than w is. In T0, let the edges

connecting O to u be e1, . . . , en, with ei = (e−i , e+i ) such that e−i is the one closer to

O. So e−1 = O, e+n = u. Let

mg = maxk ≥ 0 : w(ek) ≥ w(g).

If w(ek) < w(g),∀k = 1, . . . , n, then mg = 0. Then, no matter which order of sliding

transformation is used, g will connect (e+mg , w) in the EG (e+

0∆= 0). This is true based

on the following facts:

i) T0-sliding transformation of any edge other than g and ei, i = 1, . . . , n does not

the path from connecting O to u.

ii) T0-sliding transformation of ei, i = 1, . . . , n, will not move e+mg further to O than

g− is.

iii) T0-sliding transformation of g will move g− to e+mg eventually.

Since e+mg and g+ = w are fixed in any T0-sliding transformation, the position of g in

EG is fixed for whatever order of T0-sliding transformation.

Case II: g /∈ T0. In T0, let the edges connecting O to u be e1, . . . , en, and the edges

connecting O to w be f1, . . . , fm. e1, . . . , en and f1, . . . , fm may have overlap.


Let mgu and mgw be defined similarly as mg:

mgu = maxk ≥ 0 : w(ek) ≥ w(g),

mgw = maxk ≥ 0 : w(fk) ≥ w(g).

Then, by a similar argument as above, no matter which order of sliding transformation

is used, g will connect e+mgu

and f+mgw

in EG. Since e+mgu

and f+mgw

are fixed in any T0-

sliding transformation, the position of g in EG is fixed for whatever order of T0-sliding

transformation.

Lemma C.1.3. Any tree T in EG is minimum iff for each w, n(w, T ) = n(w, T0).

(n(w, T ) is the number of edges in T having weight w.)

Proof. The sufficiency of the condition is obvious. If the condition holds, then∑i∈T w(i) =

∑i∈T0 w(i). By definition, T is minimum. The necessity of the con-

dition is proved in the following stronger lemma.

Lemma C.1.4. For any w and any tree T in EG, n(w, T ) = n(w, T0).

Proof. Assume edges in T0 have weights w1, . . . , wm with w1 > w2 > · · · > wm and

n(wi, T0) = ni, i = 1, . . . ,m. For any edge with weight w′ > w1, then that edge is not

in T0. According to proof of Lemma C.1.2, since its weight is larger than any edge

in T0, the edge would connect O and O in EG, i.e., forming a loop at O. Therefore,

this edge will not appear in any tree in EG.

To remove ambiguity, let TEG0 be the tree in EG that consisting all edges in

T0. By proof of Lemma C.1.2, any edges of weight w1 will be moved by T0-sliding

transformation until either they ran into other edges of the same weight, or they reach

O. Therefore, edges of weight w1 in TEG0 also form a tree containing root O and n1

other nodes. We call this subtree T(1)0 . By a similar argument, edges with weights w1

and w2 also in TEG0 form a subtree, which we call T(2)0 . This can be continued, and

T(m)0 is TEG0 .

Let the nodes other than O in T(1)0 be v11, . . . , v1n1 , the nodes in T

(2)0 but not T

(1)0

be v21, . . . , v2n2 , and so on. We call the set of the nodes in T(i)0 but not T

(i−1)0 be


V (i) = vi1, . . . , vini, i = 1, . . . ,m. To make it complete, we let T(0)0 be the node O,

and also define V (0 = O.For any edge g = (u,w) in EG, ∃i, j ∈ 0, 1, . . . ,m, such that u ∈ V (i) and

w ∈ V (j). (i and j can be the same.) Then w(g) ≤ wi∨j because otherwise T0-sliding

transformation can be further performed on this edge.

For any tree T in EG, let ET,i = e ∈ T : e− ∈ V (0), V (1), . . . , V (i), e+ ∈ V (i).Then any edge in T belong to one of the ET,i’s, i = 1, . . . ,m. Let nT,i = |ET,i|. Since

T is a tree, we havem∑i=1

nT,i =m∑i=1

ni. (C.9)

Also, since T is a tree, we have

k∑i=1

nT,i ≤ |O, V (1), . . . , V (k)| − 1 =k∑i=1

ni. (C.10)

Since any edge in ET,i has weight no larger than wi, we have

∑i∈T

w(i) ≤m∑i=1

nT,iwi. (C.11)

By the proof of Lemma C.1.1, we know that TEG0 is a minimum spanning tree in EG.

Som∑i=1

niwi ≤∑i∈T

w(i). (C.12)

(C.11) and (C.12) give that

m∑i=1

niwi ≤m∑i=1

nT,iwi. (C.13)

Together with (C.10) and w1 > w2 > · · · > wm, we have

ni = nT,i, i = 1, . . . ,m,

and every edge in ET,i has weight wi.


C.2 Proofs for Lemmas and Theorems in Permu-

tation Distributions

C.2.1 Proof of Lemma 9.1.1

Proof. Define

RA =K∑u=1

1

mu

∑i,j∈Cu

Igi 6=gj ,

and

RB =∑

(u,v)∈C0

1

mumv

∑i∈Cu,j∈Cv

Igi 6=gj .

We have

EP[RC0 ] = EP[RA] + EP[RB]

=K∑u=1

1

mu

∑i,j∈Cu

PP(gi 6= gj) +∑

(u,v)∈C0

1

mumv

∑i∈Cu,j∈Cv

PP(gi 6= gj).

Since PP(gi 6= gj) =

0 if i = j

2nanbN(N−1)

if i 6= j, thus

EP[RC0 ] =K∑u=1

1

mu

mu(mu − 1)2nanb

N(N − 1)+

∑(u,v)∈C0

1

mumv

mumv2nanb

N(N − 1)

= (N −K + |C0|)2nanb

N(N − 1).

Now, to compute the second moment, first note that

EP[R2C0

] = EP[R2A] + EP[R

2B] + 2EP[RARB].


Expanding the right-hand-side in above,

EP[R2A] =

k∑u,v=1

1

mumv

∑i,j∈Cu, k,l∈Cv

PP(gi 6= gj, gk 6= gl),

EP[R2B] =

∑(u,v)∈C0

1

m2um

2v

∑i,k∈Cu, j,l∈Cv

PP(gi 6= gj, gk 6= gl)

+ 2∑

(u,v),(w,y)⊂C0

1

mumvmwmy

∑i∈Cu, j∈Cv , k∈Cw, l∈Cy

PP(gi 6= gj, gk 6= gl),

EP[RARB] =K∑u=1

∑(v,w)∈C0

1

mumvmw

∑i,j∈Cu, k∈Cv , l∈Cw

PP(gi 6= gj, gk 6= gl).

Since

PP(gi 6= gj, gk 6= gl) =

0 if i = j and/or k = l

2nanbN(N−1)

= 2p1 if

i = k, j = l, i 6= j

i = l, j = k, i 6= j

nanbN(N−1)

= p1 if

i = k, j 6= i, l

i = l, j 6= i, k

j = k, i 6= j, l

j = l, i 6= j, k4na(na−1)nb(nb−1)N(N−1)(N−2)(N−3)

= p2 if i, j, k, l are all different,

we have

EP[R2A] =

K∑u=1

1

m2u

∑i,j,k,l∈Cu


+k∑

u=1

∑v 6=u

1

mumv

∑i,j∈Cu, k,l∈Cv


=K∑u=1

1

m2u

[2mu(mu − 1)(2p1) + 4mu(mu − 1)(mu − 2)p1]

+K∑u=1

1

m2u

[mu(mu − 1)(mu − 2)(mu − 3)p2]


+k∑

u=1

∑v 6=u

1

mumv

mu(mu − 1)mv(mv − 1)p2

= 4

(N − 2K +

K∑u=1

1

mu

)p1 + (N −K − 4)(N −K)p2

+ 6

(K −

K∑u=1

1

mu

)p2,

EP[R2B] =

∑(u,v)∈C0

1

m2um

2v

∑i,k∈Cu, j,l∈Cv


+∑

(u,v),(u,w)∈C0, v 6=w

1

m2umvmw

∑i,k∈Cu, j∈Cv , l∈Cw


+∑

(u, v), (w, y) ∈ C0

u, v, w, y all different

1

mumvmwmy

∑i ∈ Cu, j ∈ Cvk ∈ Cw, l ∈ Cy


=∑

(u,v)∈C0

1

m2um

2v

[mumv(2p1) +mumv(mu +mv − 2)p1]

+∑

(u,v)∈C0

1

m2um

2v

[mu(mu − 1)mv(mv − 1)p2]

+∑

(u,v),(u,w)∈C0, v 6=w

1

m2umvmw

[mumvmwp1 +mu(mu − 1)mvmwp2]

+∑

(u, v), (w, y) ∈ C0

u, v, w, y all different

1

mumvmwmy

mumvmwmyp2

=∑

(u,v)∈C0

1

mumv

[(mu +mv)p1 + (mu − 1)(mv − 1)p2]

+∑

(u,v),(u,w)∈C0, v 6=w

1

mu

[p1 + (mu − 1)p2]

+ 2|(u, v), (w, y) ⊂ C0 : u, v, w, y all different|p2


=K∑u=1

|EC0u |2

mu

(p1 − p2) + |C0|2p2 +∑

(u,v)∈C0

1

mumv

p2,

EP[RARB] =K∑u=1

∑(u,v)∈EC0

u

1

m2umv

∑i,j,k∈Cu, l∈Cw


+K∑u=1

∑(v,w)∈C0\E

C0u

1

mumvmw

∑i,j∈Cu

∑k∈Cv ,l∈Cw


=K∑u=1

∑(u,v)∈EC0

u

1

m2umv

[2mu(mu − 1)mvp1 +mu(mu − 1)(mu − 2)mvp2]

+K∑u=1

∑(v,w)∈C0\E

C0u

1

mumvmw

mu(mu − 1)mvmwp2

= |C0|(N −K)p2 + 2(p1 − p2)

(2|C0| −

|EC0u |mu

).

VarP[RC0 ] follows by combining the above in computing EP[R2C0

], and then subtracting

E2P[RC0 ].

C.2.2 Proof of Theorem 9.1.3

To prove Theorem 9.1.3, we first prove a simpler result: Asymptotic normality of the

statistic under the bootstrap null, defined as the distribution obtained by sampling

the group labels from the observed vector of group labels with replacement. Let PB,

EB and VarB denote respectively the probability, expectation and variance under the

bootstrap null.

Lemma C.2.1. Assuming condition 1, under the bootstrap null distribution, the stan-

dardized statisticRC0 − EB[RC0 ]√

VarB[RC0 ]

converges in distribution to N(0, 1) as K → ∞, where EB[RC0 ] and VarB[RC0 ] are


given below.

EB[RC0 ] = (N −K + |C0|)2p3, (C.14)

VarB[RC0 ] = 4(p3 − p4)

(N −K + 2|C0|+

K∑u=1

|EC0u |2

4mu

−K∑u=1

|EC0u |mu

)(C.15)

+ (6p4 − 4p3)

(K −

K∑u=1

1

mu

)+ p4

∑(u,v)∈C0

1

mumv

,

where

p3 =nanbN2

, p4 =4n2

an2b

N4= 4p2

3. (C.16)

Proof. The mean and variance of RC0 under the bootstrap null, (C.14) and (C.15),

can be obtained following similar steps as the proof of Lemma 9.1.1, noting that,

under the bootstrap null,

PB(gi 6= gj) =

0 if i = j2nanbN2 = 2p3 if i 6= j

,

and

PB(gi 6= gj, gk 6= gl) =

0 if i = j and/or k = l

2nanbN2 = 2p3 if

i = k, j = l, i 6= j

i = l, j = k, i 6= j

nanbN2 = p3 if

i = k, j 6= i, l

i = l, j 6= i, k

j = k, i 6= j, l

j = l, i 6= j, k4n2an

2b

N4 = p4 if i, j, k, l are all different .

To prove asymptotic normality, we rely on Stein’s method A.1. We first define some

more notations. For any node u of C0, let

Ru =2naunbumu

, du = EB[Ru] = 2(mu − 1)p3,


where p3 is defined in (C.16). Similarly, for any edge (u, v) of C0, let

Ruv =naunbv + navnbu

mumv

, duv = EB[Ruv] = 2p3.

Let σ2B = VarB[RC0 ], ξu, ξuv be the standardized mixing potentials,

ξu =Ru − duσB

, (C.17)

ξuv =Ruv − duv

σB. (C.18)

Finally, we define the index sets for ξu and ξuv:

J1 = 1, . . . , K,

J2 = uv : u < v such that (u, v) ∈ C0,

and let J = J1∪J2. Since RC0 =∑K

u=1Ru+∑

(u,v)∈C0Ruv, the standardized statistic

is

W :=∑i∈J

ξi =∑u∈J1

Ru − duσB

+∑uv∈J2

Ruv − duvσB

=RC0 − EB[RC0 ]

σB.

Our notation follows those of Theorem A.1.1 and Assumption A.1.1. For u ∈ J1, let

Su = u ∪ uv, vu : (u, v) ∈ C0,

Tu = Su ∪ v, vw,wv : (u, v), (v, w) ∈ C0.

For uv ∈ J2, let

Suv = uv, u, v ∪ uw,wu : (u,w) ∈ C0 ∪ vw,wv : (v, w) ∈ C0,

Tuv = Suv ∪ w,wy, yw : (u,w), (w, y) ∈ C0 ∪ w,wy, yw : (v, w), (w, y) ∈ C0.

Su, Tu, Suv, Tuv defined in this way satisfy Assumption A.1.1.

Since Ru ∈ [0, mu2

], p3 ∈ [0, 14], and Ruv ∈ [0, 1], we have du ∈ [0, mu−1

2], duv ∈ [0, 1

2],


and therefore |ξu| ≤ mu2σB

, |ξuv| ≤ 1σB

. Hence,

∑j∈Su

|ξj| ≤1

σB(mu + |EC0

u |), u ∈ J1,

∑j∈Tu

|ξj| ≤1

σB(mu +

∑v∈Vu

mv + |EC0u,2|), u ∈ J1,

∑j∈Suv

|ξj| ≤1

σB(mu +mv + |EC0

u |+ |EC0v |), uv ∈ J2,

∑j∈Tuv

|ξj| ≤1

σB(mu +mv +

∑w∈Vu∪Vv

mw + |EC0u,2|+ |EC0

v,2|), uv ∈ J2.

As in Theorem A.1.1, let ηi =∑

j∈Si ξj and θi =∑

j∈Ti ξj. Then

EB|ξiηiθi| = EB|ξi∑j∈Si

ξj∑k∈Ti

ξk| ≤ EB|ξi|∑j∈Si

|ξj|∑k∈Ti

|ξk|,

|EB(ξiηi)|EB|θi| ≤ EB|ξi∑j∈Si

ξj|EB|∑j∈Ti

ξj| ≤ EB|ξi|∑j∈Si

|ξj|EB

∑j∈Ti

|ξj|,

EB|ξiη2i | = EB|ξi

∑j∈Si

∑k∈Si

ξjξk| ≤ EB|ξi|∑j∈Si

|ξj|∑k∈Si

|ξk|.

Thus, for i = u ∈ J1, the terms EB|ξiηiθi|, |EB(ξiηi)|EB|θi|, and EB|ξiη2i | are all

bounded by1

σ3B


∑v∈Vu

mv + |EC0u,2|),

and for i = uv ∈ J2, the terms EB|ξiηiθi|, |EB(ξiηi)|EB|θi|, and EB|ξiη2i | are all bounded

by1

σ3B

(mu +mv + |EC0u |+ |Ev|)(mu +mv +

∑w∈Vu∪Vv

mw + |EC0u,2|+ |EC0

v,2|).


Hence,

δ ≤ 5

σ3B

(K∑u=1


∑v∈Vu

mv + |EC0u,2|)

+∑

(u,v)∈C0

(mu +mv + |EC0u |+ |EC0

v |)(mu +mv +∑

w∈Vu∪Vv

mw + |EC0u,2|+ |EC0

v,2|)

.

Since σB is of order√K or higher, under condition 1, δ → 0 as K →∞.

Proof of Theorem 9.1.3. To show the asymptotic normality of the standardized statis-

tic under the permutation null, we only need to show that (RC0 , nBa ) converges to a

non-degenerating bivariate Gaussian distribution under the bootstrap null, where nBa

is the number of observations that belong to group a in the bootstrap sample. Then

asymptotic normality of RC0 under the permutation null follows from the fact that

its distribution is equal to the conditional distribution of RC0 given nBa = na. The

standardized bivariate vector is(RC0 − EB[RC0 ]√

VarB[RC0 ],nBa −Npa

σ0

)

with pa = na/N, σ20 = Npa(1 − pa). By the Cramer-Wold device, we only need to

show that

a1RC0 − EB[RC0 ]√

VarB[RC0 ]+ a2

nBa −Npaσ0

is asymptotic Gaussian under the bootstrap null for all a1, a2 ∈ R, a1a2 6= 0.

Let ξi, i ∈ J be defined in the same way as in the proof of Lemma C.2.1. Let

J3 = |J |+ 1, . . . , |J |+K. For i ∈ J3, let

ξi =nai′ − pami′

σ0

, i′ = i− |J |.

We use Theorem A.1.1 to show the asymptotic Gaussianity of∑

i∈J a1ξi+∑

i∈J3 a2ξi.

We need to redefine the neighborhood sets to satisfy Assumption A.1.1.


For u ∈ J1,

Su = u, u+ |J | ∪ uv, vu : (u, v) ∈ C0,

Tu = Su ∪ v, v + |J |, vw, wv : (u, v), (v, w) ∈ C0.

For uv ∈ J2,

Suv = uv, u, v, u+ |J |, v + |J | ∪ uw,wu : (u,w) ∈ C0

∪ vw,wv : (v, w) ∈ C0,

Tuv = Suv ∪ w,w + |J |, wy, yw : (u,w), (w, y) ∈ C0

∪ w,w + |J |, wy, yw : (v, w), (w, y) ∈ C0.

And for u ∈ J3,

Su = u, u′ ∪ u′v, vu′ : (u′, v) ∈ C0, u′ = u− |J |,

Tu = Su ∪ v, v + |J |, vw, wv : (u′, v), (v, w) ∈ C0.

From the proof of Lemma C.2.1, we have

|ξu| ≤mu

2σB, ∀u ∈ J1; |ξuv| ≤

1

σB, ∀uv ∈ J2.

For u ∈ J3,

|ξu| ≤mu′

σ0

, u′ = u− |J |.


Let σ = min(σB, σ0), then

∑j∈Su

|ξj| ≤1

σ(2mu + |EC0

u |), u ∈ J1 ∪ J3,

∑j∈Tu

|ξj| ≤1

σ(2mu + 2

∑v∈Vu

mv + |EC0u,2|), u ∈ J1 ∪ J3,

∑j∈Suv

|ξj| ≤1

σ(2mu + 2mv + |EC0

u |+ |EC0v |), uv ∈ J2,

∑j∈Tuv

|ξj| ≤1

σ(2mu + 2mv + 2

∑w∈Vu∪Vv

mw + |EC0u,2|+ |EC0

v,2|), uv ∈ J2.

Thus, for i = u ∈ J1 ∪ J3, the terms EB|ξiηiθi|, |EB(ξiηi)|EB|θi|, and EB|ξiη2i | are

all bounded by

1

σ3mu(2mu + |EC0

u |)(2mu + 2∑v∈Vu

mv + |EC0u,2|),

and for i = uv ∈ J2, terms EB|ξiηiθi|, |EB(ξiηi)|EB|θi|, and EB|ξiη2i | are all bounded

by

1

σ3(2mu + 2mv + |EC0

u |+ |Ev|)(2mu + 2mv + 2∑

w∈Vu∪Vv

mw + |EC0u,2|+ |EC0

v,2|).

Define Wa1,a2 =∑

i∈J a1ξi +∑

i∈J3 a2ξi. The value of δ in Theorem A.1.1 has the

form

δ =1√

EB[W 2a1,a2

]

(2∑i∈J

(EB|a1ξiηiθi|+ |EB(a1ξiηi)|EB|θi|) +∑i∈J

EB|a1ξiη2i |

+2∑i∈J3

(EB|a2ξiηiθi|+ |EB(a2ξiηi)|EB|θi|) +∑i∈J3

EB|a2ξiη2i |

),

where ηi =∑

j∈Si ξj(a1Ij∈J + a2Ij∈J3), and θi =∑

j∈Ti ξj(a1Ij∈J + a2Ij∈J3).


Let a = max(|a1|, |a2|), we have

EB|a1ξiηiθi|, EB|a2ξiηiθi| ≤ a3EB|ξi∑j∈Si

ξj∑k∈Ti

ξk|

≤ a3EB|ξi|∑j∈Si

|ξj|∑k∈Ti

|ξk|,

|EB(a1ξiηi)|EB|θi|, |EB(a2ξiηi)|EB|θi| ≤ a3EB|ξi∑j∈Si

ξj|EB|∑j∈Ti

ξj|


|ξj|EB

∑j∈Ti

|ξj|,

EB|a1ξiη2i |, EB|a2ξiη

2i | ≤ a3EB|ξi

∑j∈Si

∑k∈Si

ξjξk|


|ξj|∑k∈Si

|ξk|.

Thus,

δ ≤ 40a3

σ3√

EB[W 2a1,a2

]

(K∑u=1


∑v∈Vu

mv + |EC0u,2|)

+∑

(u,v)∈C0

(mu +mv + |EC0u |+ |EC0

v |)(mu +mv +∑

w∈Vu∪Vv

mw + |EC0u,2|+ |EC0

v,2|)

.

Since σ2B is at least of order K and σ2

0 is of order N , σ2 is at least of order K by

Condition 2. If EB[W2a1,a2

] is uniformly strictly bounded from 0 for any a1a2 6= 0, then

under Condition 1, δ → 0 as K →∞.

We next show that under Condition 2, EB[W2a1,a2

] is uniformly strictly bounded

from 0 for any a1a2 6= 0.

Let W1 =∑

i∈J ξi,W2 =∑

i∈J3 ξi, then

EB[W2a1,a2

] = a21EBW

21 + a2

2EBW22 + 2a1a2EB[W1W2]

= a21 + a2

2 + 2a1a2EB[W1W2]

Thus, we only need to show that the absolute correlation between W1 and W2 is

uniformly strictly bounded from 1. Notice that, in the theorem, we require na/N to


be bounded from 0 and 1, so pa and pb are both bounded from 0 and 1.

Correlation between RC0 and nBa : Observe that

RC0nBa =

K∑u=1

1

mu

∑i,j∈Cu

Igi 6=gj +∑

(u,v)∈C0

1

mumv

∑i∈Cu,j∈Cv

Igi 6=gj

N∑x=1

Igx=a

=K∑u=1

1

mu

∑i,j∈Cu

(Igi 6=gj

N∑x=1

Igx=a

)+

∑(u,v)∈C0

1

mumv

∑i∈Cu,j∈Cv

(Igi 6=gj

N∑x=1

Igx=a

).

For any i 6= j,

EB

[Igi 6=gj

N∑x=1

Igx=a

]= EB

[Igi 6=gj ,gi=a + Igi 6=gj ,gj=a +

∑x 6=i,j

Igi 6=gj ,gx=a

]= PB(gi = a, gj = b) + PB(gi = b, gj = a) +

∑x 6=i,j

PB(gi 6= gj, gx = a)

= papb + papb + 2papbpa(N − 2) = 2papb(Npa + 1− 2pa).

Hence

EB[RC0nBa ] = (N −K + |C0|)2papb(Npa + 1− 2pa).

Since EB[RC0 ] = (N −K + |C0|)2papb and EB[nBa ] = Npa, we have

covB(RC0 , nBa ) = (N −K + |EC0|)2papb(1− 2pa). (C.19)

If pa = 1/2, then covB(RC0 , nBa ) = 0. Since VarB[RC0 ] and VarB[n

Ba ] = Npapb are

positive, corrB(RC0 , nBa ) = 0, clearly bounded from 1. We consider pa 6= 1/2 in the

following.


VarB[RC0 ] = 4papb(1− 4papb)

(N −K + 2|C0|+

K∑u=1

|EC0u |2

4mu

−K∑u=1

|EC0u |mu

)

+ 4papb(6papb − 1)

(K −

K∑u=1

1

mu

)+ 4p2

ap2b

∑(u,v)∈C0

1

mumv

= 4papb(1− 4papb)

(N − 2K + 2|C0|+

K∑u=1

(|EC0u |/2− 1)2

mu

)

+ 8p2ap

2b

(K −

K∑u=1

1

mu

)+ 4p2

ap2b

∑(u,v)∈C0

1

mumv

.

Since

NK∑u=1

(|EC0u |/2− 1)2

mu

=K∑u=1

mu

K∑u=1

(|EC0u |/2− 1)2

mu

≥

K∑u=1

√mu

(|EC0u |/2− 1)2

mu

2

=

(K∑u=1

||EC0u |/2− 1|

)2

≥

(K∑u=1

(|EC0u |/2− 1)

)2

= (|C0| −K)2 ,

we have

VarB[RC0 ]VarB[nBa ] ≥ 4p2

ap2b(1− 4papb)[N −K + |C0|]2 + 4p3

ap3bN

∑(u,v)∈C0

1

mumv

.

Hence,

|corrB(RC0 , nBa )| ≤ 1√

1 +papbN

∑(u,v)∈C0

1mumv

(1−4papb)[N−K+|C0|]2

.

WhenN, |C0|,∑

(u,v)∈C0

1mumv

∼ O(K), |corrB(RC0 , nBa )| is bounded by a value smaller

than 1.


C.2.3 Proof of Lemma 9.2.1

Let G be the uMST on subjects, and EGi = (i, j) : (i, j) ∈ G. Then |EGi | =

mu+∑Vumv−1, |G| =

∑Ku=1 mu(mu−1)/2+

∑(u,v)∈C0

mumv. Since EP[TC0 ] = |G|2p1,

and the result follows.

Now, we compute the second moment.

EP[T2C0

] =∑

(i,j),(k,l)∈G


=∑

(i,j)∈G

PP(gi 6= gj) +∑

(i,j),(i,k)∈G,j 6=k

PP(gi 6= gj, gi 6= gk)

+∑

(i, j), (k, l) ∈ Gi, j, k, l all different


= |G|2p1 +N∑i=1

|EGi |(|EGi | − 1)p1 + (|G|2 − |G| −N∑i=1

|EGi |(|EGi | − 1))p2

= (p1 − p2)K∑u=1

mu(mu +∑v∈Vu

mv − 1)(mu +∑v∈Vu

mv − 2)

+ (p1 − p2/2)

K∑u=1

mu(mu − 1) + 2∑

(u,v)∈C0

mumv

+ p2

K∑u=1

mu(mu − 1) + 2∑

(u,v)∈C0

mumv

2

.

VarP[TC0 ] follows by EP[T2C0

]− E2P[TC0 ].

Bibliography

J.A. Anderson, K. Whaley, J. Williamson, and W.W. Buchanan. A statistical aid to

the diagnosis of keratoconjunctivitis sicca. QJM, 41(2):175, 1972.

Eliot C Bush and Bruce T Lahn. The evolution of word composition in metazoan

promoter sequence. PLoS computational biology, 2(11):e150, 2006.

E.G. Carlstein, H.G. Muller, and D. Siegmund. Change-point problems, volume 23.

Inst of Mathematical Statistic, 1994.

L.H.Y. Chen and Q.M. Shao. Stein’s method for normal approximation. An intro-

duction to Stein’s method, 4:1–59, 2005.

G.W. Cobb. The problem of the nile: conditional solution to a changepoint problem.

Biometrika, 65(2):243–251, 1978.

D.E. Critchlow. Metric methods for analyzing partially ranked data, volume 34.

Springer, 1985.

F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm.

Signal Processing, IEEE Transactions on, 53(8):2961–2974, 2005.

P. Diaconis. Group representations in probability and statistics. Lecture Notes-

Monograph Series, 11, 1988.

N. Eagle, A.S. Pentland, and D. Lazer. Inferring friendship network structure by

using mobile phone data. Proceedings of the National Academy of Sciences, 106

(36):15274–15278, 2009.

128

BIBLIOGRAPHY 129

D. Eppstein. Representing all minimum spanning trees with applications to counting

and generation. Citeseer, 1995.

J.H. Friedman and L.C. Rafsky. Multivariate generalizations of the wald-wolfowitz

and smirnov two-sample tests. The Annals of Statistics, pages 697–717, 1979.

S. Furihata, T. Ito, and N. Kamatani. Test of association between haplotypes and

phenotypes in case–control studies: Examination of validity of the application of

an algorithm for samples from cohort or clinical trials to case–control samples using

simulated and real data. Genetics, 174(3):1505–1516, 2006.

J. Giron, J. Ginebra, and A. Riba. Bayesian analysis of a multinomial sequence and

homogeneity of literary style. The American Statistician, 59(1):19–30, 2005.

Z. Harchaoui, F. Bach, and E. Moulines. Kernel change-point analysis. 2009.

B. James, K.L. James, and D. Siegmund. Tests for a change-point. Biometrika, 74

(1):71, 1987.

B. James, K.L. James, and D. Siegmund. Asymptotic approximations for likelihood

ratio tests and confidence regions for a change-point in the mean of a multivariate

gaussian process. Statistica Sinica, 2(1):69–90, 1992.

G. Kossinets and D.J. Watts. Empirical analysis of an evolving social network. Sci-

ence, 311(5757):88–90, 2006.

R.A. Lippert, H. Huang, and M.S. Waterman. Distributional regimes for the number

of k-word matches between two random sequences. Proceedings of the National

Academy of Sciences, 99(22):13980, 2002.

A. Lung-Yut-Fong, C. Levy-Leduc, and O. Cappe. Homogeneity and change-

point detection tests for multivariate data using rank statistics. Arxiv preprint

arXiv:1107.1971, 2011.

C.L. Mallows. Non-null ranking models. i. Biometrika, 44(1/2):114–130, 1957.

BIBLIOGRAPHY 130

C.R. Mehta and N.R. Patel. A network algorithm for performing fisher’s exact test in

r× c contingency tables. Journal of the American Statistical Association, 78(382):

427–434, 1983.

D. Nettleton and T. Banerjee. Testing the equality of distributions of random vectors

with categorical components. Computational statistics & data analysis, 37(2):195–

208, 2001.

A.B. Olshen, ES Venkatraman, R. Lucito, and M. Wigler. Circular binary segmen-

tation for the analysis of array-based dna copy number data. Biostatistics, 5(4):

557–572, 2004.

Scott C Perry and Robert G Beiko. Distinguishing microbial genome fragments based

on their composition: evolutionary and comparative genomic perspectives. Genome

biology and evolution, 2:117, 2010.

Issaac Rajan, Sarang Aravamuthan, and Sharmila S Mande. Identification of compo-

sitionally distinct regions in genomes using the centroid method. Bioinformatics,

23(20):2672–2677, 2007.

P.R. Rosenbaum. An exact distribution-free test comparing two multivariate dis-

tributions based on adjacency. Journal of the Royal Statistical Society: Series B

(Statistical Methodology), 67(4):515–530, 2005.

Akiyoshi Shioura and Akihisa Tamura. Efficiently scanning all spanning trees of an

undirected graph. Journal of the Operations Research Society of Japan, 38(3):331–

344, 1995. ISSN 04534514. URL http://ci.nii.ac.jp/naid/110001184429/en/.

D. Siegmund. Approximate tail probabilities for the maxima of some random fields.

The Annals of Probability, pages 487–501, 1988.

D. Siegmund and B. Yakir. The statistics of gene mapping. Springer, 2007.

D. Siegmund, B. Yakir, and N.R. Zhang. Detecting simultaneous variant intervals in

aligned sequences. The Annals of Applied Statistics, 5(2A):645–668, 2011.

BIBLIOGRAPHY 131

D.O. Siegmund. Tail approximations for maxima of random fields. In Probability the-

ory: proceedings of the 1989 Singapore probability Conference held at the National

University of Singapore, June 8-16, 1989, page 147. Walter de Gruyter, 1992.

MS Srivastava and K.J. Worsley. Likelihood ratio tests for a change in the multivariate

normal mean. Journal of the American Statistical Association, pages 199–204, 1986.

H.K. Tang and D. Siegmund. Mapping quantitative trait loci in oligogenic models.

Biostatistics, 2(2):147–162, 2001.

A. Tsirigos and I. Rigoutsos. A new computational method for the detection of

horizontal gene transfer events. Nucleic acids research, 33(3):922–933, 2005.

I-Ping Tu, David Siegmund, et al. The maximum of a function of a markov chain

and application to linkage analysis. Advances in Applied Probability, 31(2):510–531,

1999.

L.J. Vostrikova. Detecting disorder in multidimensional random processes. In Soviet

Mathematics Doklady, volume 24, pages 55–59, 1981.

M. Woodroofe. Frequentist properties of bayesian sequential tests. Biometrika, 63

(1):101–110, 1976.

M. Woodroofe. Large deviations of likelihood ratio statistics with applications to

sequential testing. The Annals of Statistics, pages 72–84, 1978.

D.V. Zaykin, P.H. Westfall, S.S. Young, M.A. Karnoub, M.J. Wagner, and M.G. Ehm.

Testing association of statistically inferred haplotypes with discrete and continuous

traits in samples of unrelated individuals. Human heredity, 53(2):79–91, 2002.

N.R. Zhang, D.O. Siegmund, H. Ji, and J.Z. Li. Detecting simultaneous changepoints

in multiple sequences. Biometrika, 97(3):631–645, 2010.

TWO GRAPH-BASED TESTS A DISSERTATIONvm961zz5360/... · two graph-based tests for high-dimensional...

Documents

Transcript of TWO GRAPH-BASED TESTS A DISSERTATIONvm961zz5360/... · two graph-based tests for high-dimensional...