Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

18
Parallel 2D Kolmogorov- Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J http://web.mit.edu/ianchan/ www/KS2D

Transcript of Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Page 1: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Parallel 2D Kolmogorov-Smirnov Statistic

Ian Chan

5/12/02 6.338J/18.337J

http://web.mit.edu/ianchan/www/KS2D

Page 2: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Motivation: my friend’s research

A colossal X-ray flare, likely sparked by a central Milky Way black hole, produced the bright spot in this Chandra image. [Source CNN]

Page 3: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

1D Kolmogorov Smirnov Statistic

test difference in two empirical distributions

F G nonparametrically D statistic: maximum difference between 2 CDF’s

Page 4: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

1D KS Test Bound

Kolmogorov(1933) asymptotic bound:

Page 5: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

2D Analog of KS Test

Peacock J, Monthly Notices of the Royal Astronomical Society, 1983, vol 202 p615: Two-Dimensional Goodness-of-Fit Testing in Astronomy

D statistic: considering all possible quadrant divisions, the largest possible difference in CDFs

Page 6: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

2D KS Test Bound

Monte Carlo simulated bounds

Z = D n1/2

Page 7: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

KS2D Test Brute Force Algorithm

O(n2), not exhaustive, quadrants centered at each data ponts O(n3), exhaustive, quadrants centered at each possible data x and data y

combination

Page 8: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

O(nlogn) KS2D algorithm

Author: A. Cooke (1999) construction of binary tree data structure ( O(nlogn) ), require pre-sorted

sample data by y

Page 9: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

How it works: (1) Tree construction

quadrants centered at (x,y) must have upper left quadrant contains all samples (a,b) where a < x AND b < y

If childless node, Dmin = Dmax = 1/Nsquare or –1/Ncircle, depending on class

Page 10: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

How it works: (2) Upward Propagation

At node (2,3), we find the MIN and MAX from the 3 choices:

1 inherit Dmin/max from its left child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D|

2 D = delta(left child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x

3 D = delta(left child) + (0/Ns-1/Nc) + Dmin/max (right child), which implies Q contains (2,3) and (2,3) is not on its border

Page 11: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

The other 3 quadrants

We have considered the Top Left Quadrant, but the Top Right quadrant can be obtained from the same tree structure if we modify the upward propagation rule by swapping left & right, i.e.

At node (2,3), we find the MIN and MAX from the 3 choices:

1 inherit Dmin/max from its right child (1,2), which implies that Q excludes (2,3) where Q is the quadrant that contains the largest |D|

2 D = delta(right child) + (0/Ns-1/Nc), which implies Q contains (2,3) and has (2,3) on its border. Delta(x) = diff in CDF if quadrant contains all samples in subtree at x

3 D = delta(right child) + (0/Ns-1/Nc) + Dmin/max (left child), which implies Q contains (2,3) and (2,3) is not on its border

• The Bottom Left/Right Quadrants can be obtained if the tree is built with samples sorted by reverse order of y.

Page 12: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Parallel KS2D Algorithm

Speed possibly scales linearly with number of processors during the upward propagation step, cannot parallelize the tree construction step

Problem size scales linearly with number of processors because sample nodes are stored in processors distributively

Challenges

1. Load Balancing: Dividing the tree nodes equally among processors

2. Minimize communications: Try to store an entire subtree into a single processor so that less inter-processor communication is necessary.

Page 13: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Load Balancing and Minimum Communications

•Ideally…

Page 14: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Load Balancing Strategy (1) Pre-processing

Randomly sample 1000 data points. Sort them by x. Consider the 1/numproc*1000th, 2/numproc*1000th…,

(numproc-1)/n*1000th positions and use them to define intervals for load balancingDrawback: assumes x and y to be more or less independent

Page 15: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Load Balancing Strategy (2): adaptive

Keep a running average of the x values of nodes stored in each processor. For every CHECKPOINT(=2000) number of samples, if the load is skewed (if difference of load between the heaviest load processor and the lightest load processor > 30% of load of lightest processor) change the load balancing intervals to midpoints of the running averages.

Page 16: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Performance(1)

• 20,000 and 200,000 samples from uniform [0,1] distribution

Page 17: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Performance (2)

Effects of adaptive load-balancing on performance for samples from standard normal distributions centered at

(-0.7,0.7)

and (0.7, -0.7)

Page 18: Parallel 2D Kolmogorov-Smirnov Statistic Ian Chan 5/12/02 6.338J/18.337J .

Conclusion for Parallel KS2D

Speedup is not great, especially when more processors are used because of communication overhead.

Load balancing strategies is noticeably effective for certain data distributions, need dependent on samples

Distributive Memory: Gains the ability to solve larger problems