Distributed Data Mining for Astronomy Catalogs

Distributed Data Mining for

Astronomy Catalogs�

Chris Giannella�, � , Haimonti Dutta

�, Kirk Borne � , Ran Wolff � , Hillol Kargupta

�, �

�Department of Computer Science and Electrical Engineering

University of Maryland, Baltimore County

Baltimore, Maryland, 21250 USA

{cgiannel,hdutta1,hillol}@cs.umbc.edu�

School of Computational Sciences

George Mason University

4400 University Drive

Fairfax, Virginia 22030

[email protected]�

Computer Science Department

Israel Institute of Technology

Haifa, Israel 32000

[email protected]

DDM FOR ASTRONOMY CATALOGS 1

SUMMARY

The design, implementation, and archiving of very large sky surveys is playing an increasingly important

role in today’s astronomy research. However, these data archives will necessarily be geographically

distributed. To fully exploit the potential of this data lode, we believe that capabilities ought to be provided

allowing users a more communication-efficient alternative to multiple archive data analysis than first

downloading the archives fully to a centralized site.

In this paper, we propose a system, DEMAC, for the distributed mining of massive astronomical catalogs.

The system is designed to sit on top of the existing national virtual observatory environment and provide

tools for distributed data mining (as web services) without requiring datasets to be fully down-loaded to

a centralized server. To illustrate the potential effectiveness of our system, we develop communication-

efficient distributed algorithms for principal component analysis (PCA) and outlier detection. Then, we

carry out a case study using distributed PCA for detecting fundamental planes of astronomical parameters.

In particular, PCA enables dimensionality reduction within a set of correlated physical parameters, such

as a reduction of a 3-dimensional data distribution (in astronomer’s observed units) to a planar data

distribution (in fundamental physical units). Fundamental physical insights are thereby enabled through

efficient access to distributed multi-dimensional data sets.

�A much abbreviated version of this paper has been submitted to the Mining Scientific and Engineering Datasets workshop

affiliated with the 2006 SIAM International Conference on Data Mining.�Corresponding author

�Also affiliated with Agnik LLC., Columbia, Maryland USA

Contract/grant sponsor: National Science Foundation; contract/grant number: IIS-0329143,IIS-0093353,IIS-

0203958,AST0122449

Copyright c�

2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0

Prepared using cpeauth.cls

2 C. GIANNELLA ET AL.

KEY WORDS: distributed data mining, astronomy catalogs, cross matching, principal component analysis, outlier

detection

1. Introduction

The design, implementation, and archiving of very large sky surveys is playing an increasingly

important role in today’s astronomy research. Many projects today (e.g. GALEX All-Sky Survey), and

many more projects in the near future (e.g. WISE All-Sky Survey) are destined to produce enormous

catalogs (tables) of astronomical sources (tuples). These catalogs will necessarily be geographically

distributed. It is this virtual collection of gigabyte, terabyte, and (eventually) petabyte catalogs that

will significantly increase science return and enable remarkable new scientific discoveries through the

integration and cross-correlation of data across these multiple survey dimensions [43]. Astronomers

will be unable to fully tap the riches of this data without a new paradigm for astro-informatics that

involves distributed database queries and data mining across distributed virtual tables of joined and

integrated sky survey catalogs [6, 7].

The development and deployment of a National Virtual Observatory (NVO) [48] is a step toward a

solution to this problem. However, processing, mining, and analyzing these distributed and vast data

collections are fundamentally challenging tasks since most off-the-shelf data mining systems require

the data to be down-loaded to a single location before further analysis. This imposes serious scalability

constraints on the data mining system and fundamentally hinders the scientific discovery process.

Figure 1 further illustrates this technical problem. The left part depicts the current data flow in the

NVO. Through web services, data are selected and down-loaded from multiple sky-surveys.

Copyright c�




Downloaded

Table

User A

Downloaded

Table

User B

Sky−Survey I Sky−Survey II

NVO Xface

Sky−Survey III

NVO XfaceNVO Xface

Select Download

NVO

Cross−Match (In Implementation)

Web Services

(a) Current data flow is restricted because of data

ownership and bandwidth considerations.

Data Mining

Model

User A

Model

Data Mining

User B

Sky−Survey I

NVO Xface

Sky−Survey III

NVO Xface

Distributed Data Mining

Select Download

NVO

Cross−Match

Web Services

Data Mining


Sky−Survey II

NVO Xface


(b) Distributed data mining algorithms can process large

amounts of data using a small amount of communication.

The users get the data mining output rather than raw data.

Figure 1. Proposed data flow for distributed data mining embedded in the NVO.

If distributed data repositories are to be really accessible by a larger community, then technology

ought to be developed for supporting distributed data analysis that can reduce, as much as possible,

communication requirements among the data servers and the client machines. Communication-efficient

distributed data mining (DDM) techniques will allow a large number of users simultaneously to

perform advanced data analysis without necessarily down-loading large volumes of data to their

respective client machines.

In this paper, we propose a system, DEMAC, for the distributed exploration of massive astronomical

catalogs. DEMAC will offer a collection of data mining tools based on various DDM algorithms. The

system would be built on top of the existing NVO environment and provides tools for data mining

(as web services) without requiring datasets to be down-loaded to a centralized server. Consider again

Figure 1. Our system requires a relatively simple modification – the addition a distributed data mining

Copyright c�




functionality in the sky servers. This allows DDM to be carried out without having to down-load large

tables to the users’ desktop or some other remote machine. Instead, the users will only down-load

the output of the data mining process (a data mining model); the actual data mining from multiple data

servers will be performed using communication-efficient DDM algorithms. The algorithms we develop

sacrifice perfect accuracy for communication savings. They offer approximate results at a considerably

lower communication cost than that of exact results through centralization. As such, we see DEMAC

as serving the role of an exploratory “browser”. Users can quickly get (generally quite accurate) results

for their distributed queries at low communication cost. Armed with these results, users can focus in

on a specific query or portion of the datasets, and down-load for more intricate analysis.

To illustrate the potential effectiveness of our system, we develop communication-efficient

distributed algorithms for principal component analysis (PCA) and outlier detection. Then we carry

out a case study using distributed PCA for detecting fundamental planes of astronomical parameters.

Astronomers have previously discovered cases where the observed parameters measured for a

particular class of astronomical objects (such as elliptical galaxies) are strongly correlated, as a

result of universal astrophysical processes (such as gravity). PCA will find such correlations in the

form of principal components. An example of this is the reduction of a 3-dimensional scatter plot

of elliptical galaxy parameters to a planar data distribution. The explanation of this plane follows

from fundamental astrophysical processes within galaxies, and thus the resulting data distribution is

labeled the Fundamental Plane of Elliptical Galaxies. The important physical insights that astronomers

have derived from this fundamental plane suggest that similar new physical insights and scientific

discoveries may come from new analysis of combinations of other astronomical parameters. Since

these multiple parameters are now necessarily distributed across geographically dispersed data

Copyright c�




archives, it is scientifically valuable to explore distributed PCA on larger astronomical data collections

and for greater numbers of astrophysical parameters. The application of communication-efficient

distributed PCA and other DDM algorithms will likely enable the discovery of new fundamental planes,

and thus produce new scientific insights into our Universe.

1.1. Astronomical Research Enabled

As an example of the potential astronomical research that our system aims to enable, consider the large

survey databases being produced (now and in the near future) by various U.S. National Aeronautics

and Space Agency (NASA) missions. GALEX is producing all-sky surveys at a variety of depths in the

near-UV and far-UV. The Spitzer Space Telescope is conducting numerous large-area surveys in the

infrared, including fields (e.g., GOODS, Hubble Deep Fields) that are well studied by HST (optical),

Chandra (X-ray), and numerous other observatories. The WISE mission (to be launched circa 2008)

will produce an all-sky infrared survey. The 2-Micron All-Sky Survey (2MASS) is cataloging millions

of stars and galaxies in the near-infrared. Each of these wave-bands contributes valuable astrophysical

knowledge to the study of countless classes of objects in the astrophysical zoo. In many cases, such

as the young star-forming regions within star-bursting galaxies, the relevant astrophysical objects and

phenomena have unique characteristics within each wavelength domain. For example, star-bursting

galaxies are often dust-enshrouded, yielding enormous infrared fluxes. Such galaxies reveal peculiar

optical morphologies, occasional X-ray sources (such as intermediate black holes), and possibly even

some UV bright spots as the short-wavelength radiation leaks through holes in the obscuring clouds. All

of these data, from multiple missions in multiple wave-bands, are essential for a full characterization,

classification, analysis, and interpretation of these cosmologically significant populations. In order

Copyright c�




to reap the full potential of multi-wavelength analysis and discovery that NASA’s suite of missions

enables, it is essential to bring together data from multiple heterogeneously distributed data centers

(such as MAST, HEASARC, and IRSA). For the all-sky surveys in particular, it is impossible to access,

mine, navigate, browse, and analyze these data in their current distributed state. To illustrate this point,

suppose that an all-sky catalog contains descriptive data for one billion objects; and suppose that these

descriptive data consist of a few hundred parameters (which is typical for the 2MASS and Sloan Digital

Sky Surveys). Then, assuming simply that each parameter requires just 2-byte representation, then

each survey database will consume one terabyte of space. If the survey also has a temporal dimension,

then even more space is required. It is clearly infeasible, impractical, and impossible to drag these

terabyte catalogs back and forth from user to user, from data center to data center, from analysis

package to package, each time someone has a new query to pose against NASA’s multi-wavelength data

collections. Therefore, we see an important need for data mining and analysis tools that are inherently

designed to work on distributed data collections.

2. Related Work

2.1. Analysis of Large Data Collections

There are several instances in the astronomy and space sciences research communities where data

mining is being applied to large data collections [15, 46]. Some dedicated data mining projects include

F-MASS [21], Class-X [14], the Auton Astrostatistics Project [3], and additional VO-related data

mining activities (such as SDMIV [52]). In essentially none of these cases does the project involve truly

DDM [44]. Through a past NASA-funded project, K. Borne applied some very basic DDM concepts

Copyright c�




to astronomical data mining [10]. However, the primary accomplishments focused only on centralized

co-location of the data sources [9, 8].

One of the first large-scale attempts at grid data mining for astronomy is the U.S. National Science

Foundation (NSF) funded GRIST [26] project. The GRIST goals include application of grid computing

and web services (service-oriented architectures) to mining large distributed data collections. GRIST

is focused on one particular data modality: images. Hence, GRIST aims to deliver mining on the

pixel planes within multiple distributed astronomical image collections. The project that we are

proposing here is aimed at another data modality: catalogs (tables) of astronomical source attributes.

GRIST and other projects also strive for exact results, which usually requires data centralization

and co-location, which further requires significant computational and communications resources.

DEMAC (our system) will produce approximate results without requiring data centralization (low

communication overhead). Users can quickly get (generally quite accurate) results for their distributed

queries at low communication cost. Armed with these results, users can focus in on a specific query or

portion of the datasets, and down-load for more intricate analysis.

The U.S. National Virtual Observatory (NVO) [48] is a large scale effort funded by the NSF

to develop a information technology infrastructure enabling easy and robust access to distributed

astronomical archives. It will provide services for users to search and gather data across multiple

archives and some basic statistical analysis and visualization functions. It will also provide a framework

for new services to be made available by outside parties. These services can provide, among other

things, specialized data analysis capabilities. As such, we envision DEMAC to fit nicely into the NVO

as a new service.

Copyright c�




The Virtual Observatory can be seen as part of an ongoing trend toward the integration of information

sources. The main paradigm used today for the integration of these data systems is that of a data grid

[19, 25, 34, 51, 20, 56, 45, 5]. Among the desired functionalities of a data grid, data analysis takes a

central place. As such, there are several projects [17, 41, 24, 26, 30, 32], which in the last few years

attempt to create a data mining grid. In addition, grid data mining has been the focus of several recent

workshops [40, 18].

2.2. Distributed Data Mining

DDM is a relatively new technology that has been enjoying considerable interest in the recent past

[49, 38]. DDM algorithms strive to analyze the data in a distributed manner without down-loading all of

the data to a single site (which is usually necessary for a regular centralized data mining system). DDM

algorithm naturally fall into two categories according to whether the data is distributed horizontally

(with each site having some of the tuples) or vertically (with each site having some of the attributes for

all tuples). In the latter case, it is assumed that the sites have an associated unique id used for matching.

In other words, consider a tuple � and assume site � has a part of this tuple, �� , and � has the remaining

part �� . Then, the id associated with � � equals the id associated with �� . �

The NVO can be seen as a case of vertically distributed data, assuming ids have been generated by

a cross-matching service. With this assumption, DDM algorithms for vertically partitioned data can be

applied. These include algorithms for principal component analysis (PCA) [39, 37], clustering [35, 39],

Bayesian network learning [12, 13], and supervised classification [11, 22, 29, 50, 55].

�Each id is unique to the site at which it resides; no two tuples at site have the same id. But, ids can match across sites; a tuple

at site can have the same id as a tuple at site .

Copyright c�




3. Data Analysis Problem: Analyzing Distributed Virtual Catalogs

We illustrate the problem with two archives: the Sloan Digital Sky Survey (SDSS) [53] and the 2-

Micron All-Sky Survey (2MASS) [1]. Each of these has a simplified catalog containing records for

a large number of astronomical point sources, upward of 100 million for SDSS and 470 million for

2MASS. Each record contains sky coordinates (ra,dec) identifying the sources’ position in the celestial

sphere as well as many other attributes (460+ for SDSS; 420+ for 2MASS). While each of these

catalogs individually provides valuable data for scientific exploration, together their value increases

significantly. In particular, efficient analysis of the virtual catalog formed by joining these catalogs

would enhance their scientific value significantly. Henceforth, we use "virtual catalog" and "virtual

table", interchangeably.

To form the virtual catalog, records in each catalog must first be matched based on their position in

the celestial sphere. Consider record � from SDSS and � from 2MASS with sky coordinates ��

and � � �� . Each record represents a set of observations about an astronomical object e.g. a galaxy.

The sky coordinates are used to determine if � and � match, i.e. are close enough that � and � represents

the same astronomical object. The issue of how matching is done will be discussed later. For each

match � � � �� , the result is a record �� in the virtual catalog with all of the attributes of � and � . As

described earlier, the virtual catalog provides valuable data that neither SDSS or 2MASS alone can

provide.

DEMAC addresses the data analysis problem of developing communication-efficient algorithms for

analyzing user-defined subsets of virtual catalogs. The algorithms allow the user to specify a region

�in the sky and a virtual catalog, then efficiently analyze the subset of tuples from that catalog with

sky coordinates in�

. Importantly, the algorithms we propose do not require that the base catalogs first

Copyright c�




be centralized and the virtual catalog explicitly realized. Moreover, the algorithms are not intended to

be a substitute for exact, centralization-based methods currently being developed as part of the NVO.

Rather, they are intended to complement these methods by providing, quick, communication-efficient

approximate results to allow browsing. Such browsing will allow the user to better focus their exact,

communication-expensive, queries.

Example 1. The all data release of 2MASS contains attribute, “K band means surface brightness”

(Kmsb). Data release four of SDSS contains galaxy attributes “red shift” (rs), “Petrosian I band

angular effective radius” (Iaer) and “velocity dispersion” (vd). To produce a physical variable,

consider composite attribute “Petrosian I band effective radius” (Ier) formed by the product of Iaer

and rs. Note, since Iaer and rs are both at the same repository (SDSS), then, from the standpoint of

distributed computation, we may assume Ier is contained in SDSS.

A principal component analysis over a region of sky�

on the virtual table with columns log(Ier),

log(vd), and Kmsb is interesting in that it can allow the identification of a “fundamental plane”

(the logarithms are used to place all variables on the same scale). Indeed, if the first two principal

components capture most of the variance, then these two variables define a fundamental plane. The

existence of such things points to interesting astrophysical behaviors. We develop a communication-

efficient distributed algorithm for approximating the principal components of a virtual table.

4. DEMAC - A System for Distributed Exploration of Massive Astronomical Catalogs

This section describes the high level design of the proposed DEMAC system. DEMAC is designed as

an additional web-service which seamlessly integrates into the NVO. It consists of two basic services.

Copyright c�




The main one is a web-service providing DDM capabilities for vertically distributed sky surveys

(WS-DDM). The second one, which is intensively used by WS-DDM, is a web-service providing

cross-matching capabilities for vertically distributed sky surveys (WS-CM). Cross-matching of sky

surveys is a complex topic which is dealt with, in itself, under other NASA funded projects. Thus,

our implementation of this web-service would supply bare minimum capabilities which are required in

order to provide distributed data mining capabilities.

To provide a distributed data mining service, DEMAC would rely on other services of the NVO such

as the ability to select and down-load from a sky survey in an SQL-like fashion. Key to our approach

is that these services be used not over the web, through the NVO, but rather by local agents which are

co-located with the respective sky survey. In this way, the DDM service avoid bandwidth and storage

bottlenecks, and overcomes restrictions which are due to data ownership concerns. Agents, in turn,

take part in executing efficient distributed data mining algorithms, which are highly communication-

efficient. It is the outcome of the data mining algorithm, rather than the selected data table, that is

provided to the end-user. With the removal of the network bandwidth bottleneck, the main factor

limiting the scalability of the distributed data mining service would be database access. For database

access we intend to rely on the SQL-like interface provided to the different sky-surveys to the NVO.

We outline here the architecture we propose for the two web-services we will develop.

4.1. WS-DDM – DDM for Heterogeneously Distributed Sky-Surveys

This web-service will allow running a DDM algorithm (three will be discussed later) on a selection of

sky-surveys. The user would use existing NVO services to locate sky-surveys and define the portion

of the sky to be data mined. The user would then use WS-CM to select a cross-matching scheme for

Copyright c�




those sky-surveys. This specifies how the tuples are matched across surveys to define the virtual table

to be analyzed. Following these two preliminary phases the user would submit the data mining task.

Execution of the data mining task would be scheduled according to resource availability. Specifically,

the size of the virtual table selected by the user would dictate scheduling. Having allocated the required

resources, the data mining algorithm would be carried on by agents which are co-located with the

selected sky-surveys. Those agents will access the sky-survey through the SQL-like interface it exposes

to the NVO and will communicate with each other directly, over the Internet. When the algorithm has

terminated, results would be provided to the user using a web-interface.

4.2. WS-CM – Cross-Matching for Heterogeneously Distributed Sky-Surveys

Central to the DDM algorithms we develop is that the virtual table can be treated as vertically

partitioned (see Section 2 for the definition). To achieve this, match indices are created and co-located

with each sky survey. Specifically, for each pair of surveys (tables) � and � , a distinct pair of match

indices must be kept, one at each survey. Each index is a list of pointers; both indices have the

same number of entries. The �� entry in �� list points to a tuple � and the �� entry in �� list

points to � such that �� and � match. Tuples in � and � which do not have a match, do not have a

corresponding entry in either index. Clearly, algorithms assuming a vertically partitioned virtual table

can be implemented on top of these indices.

Creating these indices is not an easy job. Indeed, cross-matching sources is a complex problem for

which no single best solution exists. The WS-CM web-service is not intended to address this problem.

Instead it will use already existing solutions (e.g., the cross-matching service already provided by

Copyright c�




the NVO), and it will be designed to allow other solutions to be plugged in easily. Moreover, cross-

matching the entirety of two large surveys is a very time-consuming job and would require centralizing

(at least) the��

coordinates of all tuples from both.

Importantly, the indices do not need to be created each time a data mining task is run. Instead,

provided the sky survey data are static (it generally is), each pair of indices only need be created once.

Then any data mining task can use them. In particular the DDM tasks we develop can use them. The

net result is the ability to mine virtual tables at low communication cost.

4.3. DDM Algorithms: Definitions and Notation

In the next two sections we describe some DDM algorithms to be used as part of the WS-DDM web

service. All of these algorithms assume that the participating sites have the appropriate alignment

indices. Hence, for simplicity, we describe the algorithms under the assumption that the data in each

site is perfectly aligned – the � �� tuple of each site match (sites have exactly the same number of tuples).

This assumption can be emulated without problems using the matching indices.

Let�

denote an �� matrix with real-valued entries. This matrix represents a dataset of � tuples

from �� . Let��

denote the � �� column and�� denote the �� entry of this column. Let � ��

denote the sample mean of this column i.e.

� � ��

�� (1)

Let � �� denote the sample variance of the � �� column i.e.

� ��

� � �� (2)

Copyright c�




Let �� denote the sample covariance of the � �� and � �� columns i.e.

� ��

� � �� (3)

Note, � �� ). Finally, let �� denote the covariance matrix of�

i.e. the

� �� matrix whose �� entry is �� .Assume this dataset has been vertically distributed over two sites � � and � � . Since we are assuming

that the data at the sites is perfectly aligned, then � � has the first attributes and � � has the last

attributes (�� ). Let � denote the �� matrix representing the dataset held by � � , and �

denote the � �� matrix representing the dataset held by � � . Let �� denote the concatenation of the

datasets i.e.� � �� . The � �� column of �� is denoted

��

.

In the next two sections, we describe communication-efficient algorithms for PCA and outlier

detection on�

vertically distributed over two sites. Both algorithms easily extend to more than two

sites, but, for simplicity, we only discuss the two site scenario. We have also developed a distributed

algorithm for decision tree induction (supervised classification). We will not discuss it further and refer

the reader to [22] for details. Later examine the effectiveness of the distributed PCA algorithm through

a case study on real astronomical data. We leave for future work the job of testing the decision tree and

outlier detection algorithms with a case studies.

Following standard practise in applied statistics, we pre-process�

by normalizing so that � � � ��

and � �� . This is achieved by replacing each entry� � � � with

�� "!$# �� . Since

both � �� and � �� can be computed without any communication, then normalizing can be

performed without any communication. Henceforth, we assume � � �� and � �� .

Copyright c�




Let � �� denote the eigenvalues of �� and � � � � � � � � �

� � � the

associated eigenvectors (we assume the eigenvectors are column vectors i.e. � � � matrices.) (pairwise

orthonormal). The � �� principal direction of�

is � . The � �� principal component is denoted � and

equals� � (the projection of

�along the � �� direction).

5. Virtual Catalog Principal Component AnalysisPCA is a well-established data analysis technique used in a large number of disciplines: astronomy,

computer science, biology, chemistry, climatology, geology, etc. Quoting [36] page 1: "The central

idea of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated

variables, while retaining as much as possible of the variation present in the dataset." Next we provide

a very brief overview of PCA, for a more detailed treatment, the reader is referred to [36].

The � �� principal component, � , is, by definition, a linear combination of the columns of�

– the

� �� column has coefficient � � � � . The sample variance of � equals � . The principal components

are all uncorrelated i.e. have zero pairwise sample covariances. Let �� # ( � � � ) denote the

� � �matrix with columns � � � � � �

� � # . This is the dataset projected onto the subspace defined by the

first�

principal directions. If� � � , then � � # is simply a different way of representing exactly the

same dataset, because�

can be recovered completely as� �� # � �� # where where � � # is the

� � �matrix with columns � � � � � �

� � # and � denotes matrix transpose ( � � � is a square matrix with

orthonormal columns, hence �� equals the � �� identity matrix.) However, if�� , then

�� # is a lossy lower dimensional representation of�

. The amount of loss is typically quantified as

Copyright c�




� # ��

(4)

the “proportion of variance” captured by the lower dimensional representation. The larger the

proportion captured, the better � � # represents the "information" contained in the original dataset�

. If

�is chosen so that a large amount of the variance is captured, then, intuitively, � � # , captures many of

the important features of�

. So, subsequent analysis on � � # can be quite fruitful at revealing structure

not easily found by examination of�

directly. Our case study will employ this idea.

To our knowledge, the problem of vertically distributed PCA computation was first addressed by

Kargupta et al. [39] based on sampling and communication of dominant eigenvectors. Later, Kargupta

and Puttagunta [37] developed a technique based on random projections. Our method is a slightly

revised version of this work. We describe a distributed algorithm for approximating �� .Clearly, PCA can be performed from �� without any further communication.

Recall that � � � is normalized to have zero column sample mean and unit sample variance. As a

result, �� which is the inner product between��

and��

. Clearly this inner product can be computed without communication when��

and

��

are at the same site (i.e. �� or � � � � � � � ). It suffices to show how

the inner product can be approximated across different sites, in effect, how � � � can be approximated.

The key idea is based on the following fact echoing the observation made in [47] that high-dimensional

random vectors are nearly orthogonal. A similar result was proved elsewhere [2].

Fact 1. Let�

be an � � � matrix each of whose entries is drawn independently from a distribution

with variance one and mean zero. It follows that � � � � � � �� where � � is the � � � identity matrix.

Copyright c�




We will use the Algorithm 5.0.1 for computing � � � . The result is obtained at both sites (in the

communication cost calculations, we assume a message requires 4 bytes of transmission).

Algorithm 5.0.1 Distributed Covariance Matrix Algorithm1. � � sends � � a random number generator seed. [1 message]

2. � � and � � generate a � � � random matrix�

where � . Each entry is generated independently and

identically from any distribution with mean zero and variance one.

3. � � sends� � to � � ; � � sends

� � to � � . [ � � messages]

4. � � and � � compute� �

�� .

Note that,

� � � � � � � � � � � � � ��

� � � � � � � � � � ��

�(5)

which, by Fact 1, equals

� � � �� (6)

Hence, on expectation, the algorithm is correct. However, its communication cost (bytes) divided by

the cost of the centralization-based algorithm,

� � � � �

� � � � ��

��

�(7)

is small if � � � � . Indeed � provides a "knob" for tuning the trade-off between communication-

efficiency and accuracy. Later, in our case study, we present experiments measuring this trade-off.

Copyright c�




6. Virtual Catalog Outlier Detection

Outlier detection has two important roles: data cleaning and data mining. Firstly, data cleaning is used

during data preparation stages. For example, if a cross-matching technique wrongly joined a pair of

tuples, then the matched tuple will likely have properties making it stand out from the rest of the data.

An outlier detection process might be able to identify the matched tuple and allow human operators to

check for an improper match. Secondly, outlier detection can be used to identify legitimately matched

tuples with unique properties. These may represent exceptional and interesting astronomical objects.

Again, a human operator is alerted to further examine the situation.

Due to the fact that the definition of an outlier is inherently imprecise, many approaches have been

developed for detecting outliers. We will not attempt a comprehensive citation listing, instead, the

reader is referred to Hodge and Austin [31] for an excellent survey mostly focusing on outlier detection

methodologies from machine learning, pattern recognition, and data mining. Also, Barnett and Lewis

[4] provide and excellent survey of outlier detection methodologies from statistics.

Note, some work has been done on parallel algorithms for outlier detection on horizontally

distributed data (not in an astronomy context) [33]. However, to our knowledge, no work has been

done on outlier detection over vertically distributed data. Next we describe a PCA-based technique

from the Statistics literature for outlier detection (on centralized data). Following that, we describe a

communication-efficient algorithm implementing the technique on vertically distributed data.

6.1. PCA Based Methods for Outlier Detection

Most commonly the first principal components are used in data analysis. However, the last components

also carry valuable information. Some techniques for outlier detection have been developed based on

Copyright c�




the last components [23, 28, 27, 36, 54]. These techniques look to identify data points which deviate

sharply from the “correlation structure” of the data. Before discussing the technical details, let us first

describe what is meant by a point deviating from the correlation structure by means of an example.

Example 1. Consider a dataset where each row consists of the height and weight of a group of people

(borrowed from [36] page 233). We expect that a strong positive correlation will exist – small (large)

weights correspond to small (large) heights. A row with height 70 in. (175 cm) and weight 55 lbs (25

kg) may not be abnormal when height or weight is taken separately. Indeed, there may be many people

in the group with height around 70 or weight around 55. However, taken together, the row is very

strange because it violates the usual dependence between height and weight.

In Example 1, the outlying point will not stand out if the projection of the data on each variable

is viewed. Only when the variables are viewed together, does the outlier appear. The same idea

generalizes to higher dimensional data: an outlier only stands out when all of the variables are taken

into account and does not stand out over a projection onto any proper subset. If the data has more

that three variables, then viewing all of them together is impossible. So, automated tests for detecting

“correlation outliers” are quite helpful. The “low variance” property of the last principal components

makes them useful for this job.

Recall the � �� component is a linear combination of the columns of�

(the � �� column has

coefficient � � � � ) with sample variance � i.e. the variance over the entries of � is � . Thus, if �

is very small and there were no outlier data points, one would expect the entries of � to be nearly

constant. In this case, � expresses a nearly linear relationship between the columns of�

. A data

point which deviates sharply from the correlation structure of the data will likely have its � entry

deviate sharply from the rest of the entries (assuming no other outliers). Since the last components

Copyright c�




have the smallest � � � , then an outlier’s entries in these components will likely stand out. This motivates

examination of the following statistic for the �� data point (the �� row in�

) and some � � �� :

� ��

� !� � � � � (8)

where � � � � denotes the � �� entry of � . � is a user defined parameter. A possible criticism of this

approach is pointed out in [36] page 237: “it [� �� ] still gives insufficient weight to the last few

PCs, [...] Because the PCs have decreasing variance with increasing index, the values of � �� will

typically become smaller as � increases, and� �� therefore implicitly gives the PCs decreasing weights

as � increases. This effect can be severe if some of the PCs have very small variances, and this is

unsatisfactory as it is precisely the low-variance PCs which may be most effective ...”

To address this criticism, the components are normalized to give equal weight. Let � denote the

normalized � �� principal direction: the � � � vector whose �� entry is � �� . The normalized � ��

principal component is �� . The sample variance of �� equals one, so, the weights of the

normalized components are equal. This statistic we use for the � �� data point is (following notation in

[36])

� �� !

�� (9)

Next, we discuss the development of a distributed algorithm for finding the top � ranked data points

in��

with respect to� �� . For this, we only consider the case involving the last component,

� � � (developing the algorithm in the� � � case is quite a bit more complicated):

� �� (10)

Copyright c�




6.2. Vertically Distributed Top � Outlier Detection

First, the algorithm in Section 5 is applied to approximate �� – at the end, each site has the

approximation. The eigenvalues and vectors of this matrix are computed �� and ��

��

�� ,

along with the normalized eigenvectors �� , � � � , �� (no communication required). However, the

approximations to the last component cannot be computed without communication. The approximation

to the last normalized principal component is defined to be ��

�� Let �� denote the � � vector consisting of the first entries of �� . This is the part of ��

corresponding to the attributes at site � � . Let �� denote the � � vector consisting of the last

entries of �� . This is the part of �� corresponding to the attributes at site � � . We have

��

�� (11)

Given � � � , let �� denote the � �� entry of � �� and ��

�� denote the �� entry

of � �� . Remaining consistent with Equation (10),� ��

�, we define the approximation to

� �� as ��

�� . Our goal is to develop a distributed algorithm for computing

the top � among� �� . Since ��

�� and ��

�� can be computed without

communication, it suffices to solve the following problem.

Problem 1 (Distributed Sum-Square Top K (DSSTK)) Site � � has� � � � � �

�� and � � has

� � � � � �� . Both sites have integer ��

. The sites must compute the top � among

� � � � � � � � � � � , � � � ,� � � � � � � � � � � .

A communication-efficient algorithm for solving this problem is described next. Unlike the

algorithm for distributed covariance computation, this one is not approximate – its output is the correct

Copyright c�




top � . The algorithm assumes a total linear ordering on the�� , � � � , and

�� . For simplicity, we assume

all�� , � � � , � � � are distinct among themselves and order numbers in the standard way. Ties can be easily

broken according to the lower index, e.g. if� � � , then

� is ordered before

� if �� .

6.3. An Algorithm for the DSSTK Problem

Below we only describe an algorithm for solving the problem when � � � . At the end, both sites have

index �� and

�� . Hence, the problem when � � � is solved by removing

�� and

� �� then repeating, etc.

The algorithm for finding��

� carries out two stages. First, � � and � � solve the following

two distributed sub-problems: find � ��

� �� (Distributed Max Algorithm) and

�� (Distributed Min Algorithm). Second, each site returns��

� �� . Clearly this equals � � , so, all that remains is to explicate the Distributed Max

and Min algorithms.

These algorithms are very similar. Both proceed in rounds during each of which � � and � � exchange

a small amount of communication (24 bytes). At the end of each round, the sites evaluate a condition

based on their local data and all the data communicated during past rounds. If the condition fails, then

a new round of communication is triggered. Otherwise, the algorithms terminate, at which point, both

sites will have the desired indeces. The Distributed Max algorithm is described in Algorithm 6.3.1; the

Min algorithm is completely analogous and obtained by replacing all occurrences of “argmax” with

“argmin”.

Next we prove that the Distributed Max Algorithm is correct i.e. the algorithm terminates, and

upon termination, both sites return the correct index. In fact, we prove a more general result.

Copyright c�




Algorithm 6.3.1 Distributed Max Algorithm

1. � � selects � � � �� and sends � � �� to � � . In parallel, � � selects � �

� �� and sends � � �� to � � . [16 bytes]

2. � � receives the message from � � and replies with � � � � . In parallel, � � receives the message

from � � and replies with � � � � . [8 bytes]

3. Both sites now have data points�� ,

� � ,�� ,� � , and indeces � � , � � . If � � � � � , then both sites

terminate.

4. Otherwise, � � replaces� � and

�� with

! � �� and! � � � � �� . Site � � replaces

� � and

�� in

exactly the same way. Goto step 1.

Namely, that both sites return index � ��

� �� where� � �� is any

strictly increasing function. The correctness result of the Distributed Min Algorithm, both sites return

�� for any strictly decreasing function�

, is proven in an analogous way

(omitted).

Correctness of the algorithm falls out from the following technical result. The result shows that for

each � ,� and

� change values at most once during the course of the algorithm. Moreover, when they

change, they change to! � � �� . Let

� � � � , � �� denote

� , � and

� � � � , � �� denote the value of

� and

� held at the end of round � �� .

Lemma 1. Suppose the algorithm just completed round � �� . For each � � � , let �� denote the

round at which index � was first selected by either site (if � was not selected during the � rounds, then

let �� ). It is the case that

1. for all � � � � �� , � � � � � � and

� � � � � � ;2. for all �� ,

�� ! � ��

Copyright c�




The proof can be found in the appendix. Suppose the algorithm did not terminate after � rounds. Let

�� denote the indeces selected during each of � rounds. It follows from Lemma 1

that a pair of indeces � � � � � � � , will not be selected if � � was selected at a previous round, � � � � and

� � was selected at a previous round, � � � � (possibly different than � � ). Otherwise, � � � � � � � and,

hence, the algorithm would have terminated at round � � . But, at most � � � pairs of indeces can satisfy

this condition. A contradiction is reached.

Let � �� denote the termination round and let ��

denote the index agreed upon by both sites

upon termination. To complete the correctness proof, we need show that ��

is the correct index, � ��

� �� . Consider any index � � � . From Lemma 1, one of the following holds

� � � � � �� and� � � � �� ;

� � � � � �� ! � � �� .

In either case it follows that� � � � �"� � � � � �� . Hence, � �

�� .

By definition of the algorithm, ��

� � � � � �� . Thus,� � � � � � � � � � �

� � � � � �"� � � � � � . Therefore, since�

is strictly increasing,� � � � � � � � � � � � � � � � � � � � � � � � � � � �� . We

conclude that �� "� � ��

�as desired.

Comment: Each round of the Distributed Max algorithm requires� �

bytes of communication. Thus,

the total communication is� ��

where�

is the number of rounds until termination. This is quite a small

communication requirement unless�

is large. However, this communication efficiency is won at a price

– the algorithm requires�

rounds of synchronization. Even for relatively modest�

, the cost of these

synchronizations could be unacceptably high. A simply way to address this problem is to have each

site, at each round, select its top � �� tuples instead of just its top tuple. The communication cost per

round increases to� �� bytes, but the total number of rounds executed will decrease to

�� .

Copyright c�




7. Case Study: Finding Galactic Fundamental Planes

The identification of certain correlations among parameters has lead to important discoveries in

astronomy. For example, the class of elliptical and spiral galaxies (including dwarfs) have been

found to occupy a two dimensional space inside a 3-D space of observed parameters, radius, mean

surface brightness and velocity dispersion. This 2D plane has been referred to as the Fundamental

Plane([16, 42]).

This section presents a case study involving the detection of a fundamental plane among galaxy

parameters distributed across two catalogs: 2MASS and SDSS. We consider several different

combinations of parameters and explore for a fundamental plane in each (once such combination was

described earlier in Example 1). Our goal is to demonstrate that, using our distributed covariance matrix

algorithm to approximate the principal components, we can find a very similar fundamental plane as

that obtained by applying a centralized PCA. Note that our goal is not to make a new discovery in

astronomy. We argue that our distributed algorithm could have found a very similar results to the

centralized approach at a fraction of the communication cost. Therefore, we argue that DEMAC could

provide a valuable tool for astronomers wishing to explore many parameter spaces across different

catalogs for fundamental planes.

In our study we measure the accuracy of our distributed algorithm in terms of the similarity

between its results and those of a centralized approach. We examine accuracy at various amounts of

communication allowed the distributed algorithm in order to assess the trade-off described at the end

of Section 5. For each amount of communication allowed, we ran the distributed algorithm 100 times

with a different random matrix and report the average result (except where otherwise noted).

Copyright c�




For the purposes of our study, a real distributed environment is not necessary. Thus, for simplicity, we

used a single machine and simulated a distributed environment. We carried out two sets of experiments.

The first involves three parameters already examined in the Astronomy literature [16, 42] and the

fundamental plane observed. The second involves several previously unexplored combinations of

parameters.

7.1. Experiment Set One

We prepared our test data as follows. Using the web interfaces of 2MASS and SDSS,

http://irsa.ipac.caltech.edu/applications/Gator/ and

http://cas.sdss.org/astro/en/tools/crossid/upload.asp,

and the SDSS object cross id tool, we obtained an aggregate dataset involving attributes from 2MASS

and SDSS and lying in the sky region between right ascension (ra) 150 and 200 and declination (dec)

0 and 15. The aggregated dataset had the following attributes from SDSS: Petrosian I band angular

effective radius (Iaer), redshift (rs), and velocity dispersion (vd); � and had the following attribute from

2MASS: K band mean surface brightness (Kmsb).�

After removing tuples with missing attributes, we

had a 1307 tuple dataset with four attributes. We produced a new attribute, logarithm Petrosian I band

effective radius (log(Ier)), as log(Iaer*rs) and a new attribute, logarithm velocity dispersion (log(vd)),

by applying the logarithm to vd. We dropped all attributes except those to obtain the three attribute

�� _ � (galaxy view), z (SpecObj view) and velDisp (SpecObj view) in SDSS DR4

��_ �� _ �� in the extended source catalog in the All Sky Data Release,

http://www.ipac.caltech.edu/2mass/releases/allsky/index.html

Copyright c�




dataset, log(Ier), log(vd), Kmsb. Finally, we normalized each column by subtracting its mean from

each entry and dividing by its sample standard deviation (as described in Section 4.3).

We applied PCA directly to this dataset to obtain the centralization based results. Then we treated

this dataset as if it were distributed (assuming cross match indeces have been created as described

earlier). This data can be thought of as a virtual table with attributes log(Ier) and log(vd) located at one

site and attribute Kmsb at another. Finally, we applied our distributed covariance matrix algorithm and

computed the principal components from the resulting matrix. Note, our dataset is somewhat small and

not necessarily indicative of a scenario where DEMAC would be used in practice. However, for the

purposes of our study (accuracy with respect to communication) it suffices.

Figure 2 shows the percentage of variance captured as a function of communication percentage (i.e.

at 15%, the distributed algorithm uses��

�� bytes). Error bars indicate standard

deviation – recall the percentage of variance captured numbers are averages over 100 trials. First

observe that the percentage captured by the centralized approach, 90.5%, replicates the known result

that a fundamental plane exists among these parameters. Indeed the dataset fits fairly nicely on the plane

formed by the first two PCs. Also observe that the percentage of variance captured by the distributed

algorithm (including one standard deviation) using as little as 10% communication never strays more

than 5 percent from 90.5%. This is a reasonibly accurate result indicating that the distributed algorithm

identifies the existence of a plane using 90% less communication. As such, this provides evidence that

the distributed algorithm would serve as a good browser allowing the user to get decent approximate

results at a sharply reduced communication cost. If this piques the user’s interest, she can go through

the trouble of centralizing the data and carrying out an exact analysis.

Copyright c�




0 10 20 30 40 50 60 70 80 90 10080

85

90

95

100

105

Communication Percentage

Per

cent

age

of v

aria

nce

capt

ured

by

first

and

sec

ond

PC

Std Error Bars Distributed VarianceCentralized Variance

Figure 2. Communication percentage vs. percent of variance captured, (log(Ier), log(vd), Kmsb) dataset.

Interestingly, the average percentage captured by the distributed algorithm appears to approach the

true percentage captured, 90.5%, very slowly (as the communication percentage approaches infinity,

the average percentage captured must approach 90.5%). At the present we don’t have an explanation

for the slow approach. However, as the communication increases, the standard deviation decreases

substantially (as expected).

To analyze the accuracy of the actual principal components computed by the distributed algorithm,

we consider the data projected onto each pair of PCs. The projection onto the true first and second PC

ought to appear with much scatter in both directions (this is the view of the data perpendicular to the

Copyright c�




plane). And, the projections onto the first, third and second, third PCs ought to appear more “flattened”

as it is the view of the data perpendicular to the edge of the plane. Figures 3 and 4 displays the results.

The left column depicts the projections onto the PCs computed by the centralized analysis (true PCs).

Here we see the fundamental plane. The right column depicts the projections onto the PCs computed

by our distributed algorithm at 15% communication (for one random matrix and not the average over

100 trials). We see a similar pattern, indicating that the PCs computed by the distributed algorithm are

quite accurate in the sense that they produce very similar projections as those produced by the true

PCs.

In closing, it is important to stress that we are not claiming that the actual projections can be

computed in a communication-efficient fashion (they can’t). Rather, that our the PCs computed in a

distributed fashion are accurate as measured by the projection similarity with the true PCs.

7.2. Experiment Set Two

We prepared our test data as follows. Using the web interfaces of 2MASS and SDSS and SDSS object

cross id tool, we obtained an aggregate dataset involving attributes from 2MASS and SDSS and lying in

the sky region between right ascension (ra) 150 and 200 and declination (dec) 0 and 15. The aggregated

dataset had the following attributes from SDSS: Petrosian Flux in the U band (petroU), Petrosian Flux

in the G band (petroG), Petrosian Flux in the R band (petroR), Petrosian Flux in the I band (petroI), and

Petrosian Flux in the Z band (petroZ) � , and had the following attributes from 2MASS: K band mean

surface brightness (Kmsb) and K concentration index (KconInd).�

We had a 29638 tuple dataset with

� �� _ � ,

�� _�

,� ��

_�,��

_ � , �� _ (PhotoObjAll) �

_ �� _ �� and�� in the extended source catalog in the All Sky Data Release.

Copyright c�




−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

1st Principal Component

2nd

Prin

cipa

l Com

pone

nt

(a) PCs from centralized analysis

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6


2nd

Prin

cipa

l Com

pone

nt

(b) PCs from distributed algorithm

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6


3rd

Prin

cipa

l Com

pone

nt

(c) PCs from centralized analysis

−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6


3rd

Prin

cipa

l Com

pone

nt

(d) PCs from distributed algorithm

Figure 3. Projections, PC1 vs. PC2 and PC1 vs. PC3; communication percentage 15%, (log(Ier), log(vd), Kmsb)

dataset.

seven attributes. We produced new attributes petroU-I � petroU - petroI, petroU-R � petroU - petroR,

petroU-G � petroU - petroG, petroG-R � petroG - petroR, petroG-I � petroG - petroI, and petroR-I �

Copyright c�




−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6

2nd Principal Component

3rd

Prin

cipa

l Com

pone

nt


−6 −4 −2 0 2 4 6−6

−4

−2

0

2

4

6


3rd

Prin

cipa

l Com

pone

nt


Figure 4. Projections, PC2 vs. PC3; communication percentage 15%, (log(Ier), log(vd), Kmsb) dataset.

petroR - petroI. We also produced new attribute, logarithm K concentration index (log(KconInd)), by

applying the logarithm to KconInd. In each experiment, we dropped all attributes except the following

to obtain a dataset: petroU-I, petroU-R, petroU-G, petroG-R, petroG-I, petroR-I, log(KconInd), and

Kmsb. Finally, we normalized each column by subtracting its mean from each entry and dividing by

its sample standard deviation (as described in Section 4.3).

We considered the following six combinations of three attributes: (petroU-I,log(KconInd),Kmsb),

(petroU-R,log(KconInd),Kmsb), (petroU-G,log(KconInd),Kmsb), (petroG-R,log(KconInd),Kmsb),

(petroG-I,log(KconInd),Kmsb), and (petroR-I,log(KconInd),Kmsb). In each case we carried out a

distributed PCA experiment just like done in experiment set one.

Table I depicts the results at 15% communication. The “Band” column indicates the combination of

attributes used, e.g. U-I indicates (petroU-I,log(KconInd),Kmsb). The “Centralized Variance” column

Copyright c�




contains the sum of the variances (times 100) of the first two principal components found by the

centralized algorithm. In effect, it contains the percent of variance captured by the first two PCs. The

“Distributed Variance” column contains the sum of the variances of the first two PCs found by the

distributed algorithm (averaged over 100 trials). The “STD Distributed Variance” column contains the

standard deviation in this average over 100 trials.

Band Centralized Variance Distributed Variance STD Distributed Variance

U-I 72.6186 67.8766 0.717

U-R 72.5375 68.0817 0.8191

U-G 72.6392 69.0102 0.9781

G-R 72.3501 68.5664 0.9879

G-I 72.4842 68.6199 0.9842

R-I 72.7451 68.0521 0.8034

Table I. The centralized and distributed variances captured by the first and second PCs (15% communication).

We see that in all cases the centralized experiments yield a relatively weak fundamental plane (72%

of the variance is captured by the first two PCs) relative to the fundamental plane from experiment set

one (90.5% captured). The distributed algorithm in all cases does a decent job at replicating this result

– 67.8% to 69.0% of the variance captured (all with standard deviation less than 0.99). These results

come at an 85% communication savings over centralizing the data and performing an exact calculation.

To further illuminate the trade-off between communication savings and accuracy, Figure 5 shows

the percentage of variance captured on the petroR-I dataset as a function of communication percentage

(error bars indicate standard deviation – recall the percentage of variance captured numbers are

Copyright c�




averages over 100 trials). Interestingly, the average percentage captured by the distributed algorithm

appears to move away from the true percentage (72%) for communication up to 40%, then move very

slightly toward 72%. This is surprising since the distributed algorithm error must approach zero as

its communication percentage approaches infinity. The results appear to indicate that this approach

is quite slow. At the present we don’t have an explanation for this phenomenon. However, as the

communication increases, the standard deviation decreases (as expected). Moreover, despite the slow

approach, the percentage of variance captured by the distributed algorithm (including one standard

deviation) never strays more than 6 percent from the true percent captured – a reasonably accurate

figure.

As in experiment set one, we analyze the accuracy of the actual principal components computed

by the distributed algorithm by considering their data projections. We do so for the (petroR-

I,log(KconInd),Kmsb) dataset. Figures 6 and 7 displays the results. The left column depicts the

projections onto the PCs computed by the centralized analysis (true PCs). Note that 10 projected data

points (out of 29638) are omitted from the centralized PC1 vs. PC2 and PC2 vs. PC3 plots due to

scaling. The y-coordinate for these points (x in case of PC2 vs. PC3) does not lie in the range [-20,20]

(in both cases a few points have coordinate nearly 60 or -60). Unlike the case os experiment set one,

we do not see a very pronouced plane (as expected).

The right column depicts the projections onto the PCs computed by our distributed algorithm at

15% communication (for one random matrix and not the average over 100 trials). The distributed

projections appear to indicate a fundamental plane somewhat more strongly than the centralized

ones. This indicates some inaccuracy in the distributed PCs. However, since the distributed variances

Copyright c�




0 20 40 60 80 10066

67

68

69

70

71

72

73

74

Communication Percentage

perc

enta

ge o

f var

ianc

e ca

ptur

ed b

y fir

st a

nd s

econ

d P

C Distributed VarianceCentralized Variance

Figure 5. Communication percentage vs. percent of variance captured, (petroR-I,log(KconInd),Kmsb) dataset.

correctly did not indicate a strong fundamental plane, the inaccuracies in the PCs would likely not play

an important role for users browsing for fundamental planes.

8. Conclusions

We proposed a system, DEMAC, for the distributed exploration of massive astronomical catalogs.

DEMAC is to be built on top of the existing U.S. national virtual observatory environment and provide

tools for data mining (as web services) without requiring datasets to be down-loaded to a centralized

server. Instead, the users will only down-load the output of the data mining process (a data mining

Copyright c�




−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20


2nd

Prin

cipa

l Com

pone

nt


−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20


2nd

Prin

cipa

l Com

pone

nt


−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20


3rd

Prin

cipa

l Com

pone

nt

(c) PCs from centralized analysis

−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20


3rd

Prin

cipa

l Com

pone

nt

(d) PCs from distributed algorithm

Figure 6. Projections, PC1 vs. PC2 and PC1 vs. PC3; communication percentage 15%, (petroR-

I,log(KconInd),Kmsb) dataset.

model); the actual data mining from multiple data servers will be performed using communication-

efficient DDM algorithms. The distributed algorithms we have developed sacrifice perfect accuracy

Copyright c�




−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20


3rd

Prin

cipa

l Com

pone

nt


−20 −15 −10 −5 0 5 10 15 20−20

−15

−10

−5

0

5

10

15

20


3rd

Prin

cipa

l Com

pone

nt


Figure 7. Projections, PC2 vs. PC3; communication percentage 15%, (petroR-I,log(KconInd),Kmsb) dataset.

for communication savings. They offer approximate results at a considerably lower communication

cost than that of exact results through centralization. As such, we see DEMAC as serving the role of

an exploratory “browser”. Users can quickly get (generally quite accurate) results for their distributed

queries at low communication cost. Armed with these results, users can focus in on a specific query or

portion of the datasets, and down-load for more intricate analysis.

To illustrate the potential effectiveness of our system, we developed communication-efficient

distributed algorithms for principal component analysis (PCA) and outlier detection. Then, we carried

out a case study using distributed PCA for detecting fundamental planes of astronomical parameters.

We observed our distributed algorithm to replicate fairly nicely the fundamental plane results observed

through centralized analysis but at significantly reduced communication cost.

Copyright c�




In closing, we envision our system to increase the ease with which large, geographically distributed

astronomy catalogs can be explored, by providing quick, low-communication solutions. Such benefit

will allow astronomers to better tap the riches of distributed virtual tables formed from joined and

integrated sky survey catalogs.

Acknowledgments

H. Dutta, C. Giannella, H. Kargupta, R. Wolff: Support for this work was provided by the U.S. National

Science Foundation (NSF) through grants IIS-0329143, IIS-0093353, IIS-0203958 and by the U.S.

National Aeronautics and Space Agency (NASA) through grant NAS2-37143.

K. Borne: Support for this work was provided in part by the NSF through Cooperative Agreement

AST0122449 to the Johns Hopkins University.

REFERENCES

1. 2-Micron All Sky Survey. http://pegasus.phast.umass.edu.

2. Arriaga R. and Vempala S. An Algorithmic Theory of Learning: Robust Concepts and Random Projection. In Proceedings

of the 40th Foundations of Computer Science, 1999.

3. The AUTON Project. http://www.autonlab.org/autonweb/showProject/3/.

4. Barnett V. and Lewis T. Outliers in Statistical Data. John Wiley & Sons, 1994.

5. bioGrid – Biotechnology Information and Knowledge Grid. http://www.bio-grid.net/.

6. Borne K. Science User Scenarios for a VO Design Reference Mission: Science Requirements for Data Mining. In Virtual

Observatories of the Future, San Francisco: Astronomical Society of the Pacific, page 333, 2000.

7. Borne K. Data Mining in Astronomical Databases. In Proceedings of Mining the Sky, Heidelberg: Springer-Verlag, page

671, 2001.

Copyright c�




8. Borne K. Distributed Data Mining in the National Virtual Observatory. In Proceedings of the SPIE Conference on Data

Mining and Knowledge Discovery: Theory, Tools, and Technology V, Vol. 5098, page 211, 2003.

9. Borne K., Arribas S., Bushouse H., Colina L., and Lucas R. A National Virtual Observatory (NVO) Science Case. In

Proceedings of the Emergence of Cosmic Structure, New York: AIP, page 307, 2003.

10. Distributed Data Mining Techniques for Object Discovery in the National Virtual Observatory.

http://is.arc.nasa.gov/IDU/tasks/NVODDM.html.

11. Caragea D., Silvescu A., and Honavar V. Decision Tree Induction from Distributed Data Sources. In Proceedings of the

Conference on Intelligent Systems Design and Applications, 2003.

12. Chen R. and Sivakumar K. A New Algorithm for Learning Parameters of a Bayesian Network from Distributed Data.

Proceedings of the Second IEEE International Conference on Data Mining (ICDM), pages 585–588, 2002.

13. Chen R., Sivakumar K., and Kargupta H. Learning Bayesian Network Structure from Distributed Data. In Proceedings of

the Third SIAM International Conference on Data Mining, pages 284–288, 2003.

14. The ClassX Project: Classifying the High-Energy Universe. http://heasarc.gsfc.nasa.gov/classx/.

15. Digital Dig - Data Mining in Astronomy.

http://www.astrosociety.org/pubs/ezine/datamining.html.

16. Elliptical Galaxies: Merger Simulations and the Fundamental Plane. http://irs.ub.rug.nl/ppn/244277443.

17. Data Mining Grid. http://www.datamininggrid.org/.

18. Workshop on Data Mining and the Grid (DM-Grid 2004). http://www.cs.technion.ac.il/ � ranw/dmgrid/program.html.

19. EU Data Grid. http://web.datagrid.cnr.it/.

20. Astrophysical Virtual Observatory. http://www.euro-vo.org/.

21. Framework for Mining and Analysis of Space Science Data. http://www.itsc.uah.edu/f-mass/.

22. Giannella C., Liu K., Olsen T., and Kargupta H. Communication Efficient Construction of Decision Trees Over

Heterogeneously Distributed Data. In Proceedings of the The Fourth IEEE International Conference on Data Mining

(ICDM), 2004.

23. Gnanadesikan R. and Kettenring J. Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data.

Biometrics, 28:81–124, 1972.

24. GridMiner. www.gridminer.org.

25. Griphyn – Grid Physics Network. http://www.griphyn.org/.

26. GRIST: Grid Data Mining for Astronomy. http://grist.caltech.edu.

Copyright c�




27. Hawkins D. The Detection of Errors in Multivariate Data Using Principal Components. Journal of the American Statistical

Association, 69(346):340–344, 1974.

28. Hawkins D. and Fatti P. Exploring Multivariate Data Using the Minor Principal Components. The Statistician, 33:325–

338, 1984.

29. Hershberger, D. and Kargupta, H. Distributed Multivariate Regression Using Wavelet-based Collective Data Mining.

Journal of Parallel and Distributed Computing, 61:372–400, 1999.

30. Hinke T. and Novotny J. Data Mining on NASA’s Information Power Grid. In Proceedings of the 9th IEEE International

Symposium on High Performance Distributed Computing (HPDC’00), page 292, 2000.

31. Hodge V. and Austin J. A Survey of Outlier Detection Methodologies. Artificial Intelligence Review, 22(2):85–126, 2004.

32. Hofer J. and Brezany P. Distributed Decision Tree Induction Within the Grid Data Mining Framework. In Technical

Report TR2004-04, Institute for Software Science, University of Vienna, 2004.

33. Hung E. and Cheung D. Parallel Mining of Outliers in Large Database. Distributed and Parallel Databases, 12:5–26,

2002.

34. International Virtual Data Grid Laboratory. http://www.ivdgl.org/.

35. Johnson E. and Kargupta H. Collective, Hierarchical Clustering From Distributed, Heterogeneous Data. In M. Zaki and

C. Ho, editors, Lecture Notes in Computer Science, volume 1759, pages 221–244. Springer-Verlag, 1999.

36. Jolliffe I. Principal Component Analysis. Springer-Verlag, 2002.

37. Kargupta H. and Puttagunta V. An Efficient Randomized Algorithm for Distributed Principal Component Analysis from

Heterogeneous Data. In Proceedings of the Workshop on High Performance Data Mining in conjunction with the Fourth

SIAM International Conference on Data Mining, 2004.

38. Kargupta H. and Sivakumar K. Existential Pleasures of Distributed Data Mining. In Data Mining: Next Generation

Challenges and Future Directions, MIT/AAAI Press, pages 3–26, 2004.

39. Kargupta H., Huang W., Sivakumar K., and Johnson E. Distributed clustering using collective principal component

analysis. Knowledge and Information Systems Journal, 3:422–448, 2001.

40. First International Workshop on Knowldge and Data Mining Grid (KDMG’05). http://laurel.datsi.fi.upm.es/KDMG05/.

41. Knowledge Grid Lab. http://dns2.icar.cnr.it/kgrid/.

42. Jones L. and Couch W. A Statistical Comparison of Line Strength Variations in Coma and Cluster Galaxies at �� .

Astronomical Society, Australia, 15:309–317, 1998.

43. McDowell J. Downloading the Sky. IEEE Spectrum, 41:35, 2004.

Copyright c�




44. Distributed Data Mining in Astrophysics and Astronomy. http://www.cs.queensu.ca/home/mcconell/

DDMAstro.html.

45. NASA’s Information Power Grid. http://www.ipg.nasa.gov/.

46. NASA’s Data Mining Resources for Space Science.

http://rings.gsfc.nasa.gov/ � borne/nvo_datamining.html.

47. Nielsen R. Context Vectors: General Purpose Approximate Meaning Representations Self-organized From Raw Data. In

Computational Intelligence: Imitating Life, pages 43–56. IEEE Press, 1994.

48. US National Virtual Observatory. http://www.us-vo.org/.

49. Park B. and Kargupta H. Distributed Data Mining: Algorithms, Systems, and Applications. In The Handbook of Data

Mining, Lawrence Erlbaum Associates, pages 341–358, 2003.

50. Park B., Kargupta H., Johnson E., Sanseverino E., Hershberger D., and Silvestre L. Distributed, Collaborative Data

Analysis From Heterogeneous Sites Using a Scalable Evolutionary Technique. Applied Intelligence, 16(1), 2002.

51. Particle Physics Data Grid. http://www.ppdg.net/.

52. Scientific Data Mining, Integration and Visualization Workshop. http://www.anc.ed.ac.uk/sdmiv/.

53. Sloan Digital Sky Survey. http://www.sdss.org.

54. Shyu M.-L., Chen S.-C., Sarinnapakorn K., and Chang L. A Novel Anomaly Detection Scheme Based on a Principal

Component Classifier. In Proceedings of the Foundations and New Directions of Data Mining Workshop, in Conjuction

with the Thrid IEEE International Conference on Data Mining (ICDM), pages 172–179, 2003.

55. Tumer K. and Ghosh J. Robust Order Statistics Based Ensembles for Distributed Data Mining. In Advances in Distributed

and Parallel Knowledge Discovery, pages 185–210. MIT/AAAI Press, 2000.

56. US National Virtual Observatory. http://us-vo.org/.

Appendix – Proof of Lemma 1

To ease reading, we restate the lemma. Suppose the Distributed Max Algorithm just completed iteration

� � � . For each � � �� , let �� denote the round at which index � was first selected by either site

(if � was not selected during the � rounds, then let � �� ). It is the case that

Copyright c�




1. for all � � � �� , � �� and

� �� ;2. for all � �� ,

� � � � � � � � �� ! � � ��

Proof: From the algorithm definition it can be seen that for all � � � and all � �� , if �

was selected by either site during round � , then� � � ��

! � � � � � � � � � � �� Otherwise,� ��

� �� and

� � � �� . The remainder of the proof proceeds by induction on � .

Base case � � � : If � � � � � �(i.e. � was not selected by any site), then

� � ��

� � � � � and

� � �� , so, the desired result holds. If � �� , then� � � ��

! � �� ! � � �� .

Hence the desired result holds.

Induction case � � �: The first case to consider: � � � � � � � � (i.e. � has not selected by any site

during � rounds of the algorithm). Thus,��

�� and� �� . Moreover, since � was

not selected during the first �� rounds, then by induction, for all � � �� , � � � �� and

� � �� . Therefore,� � � ��

and� � � � � � , for all � � � , as desired.

The second case to consider: � �� (i.e. � was selected for the first time by some site during round

� ). It follows that� � � � � � � � � �

! � ��

� �� . Since � was not selected during the first � � � rounds,

then, by induction, for all � � � � � , � �� and

� �� . Therefore,� � � � � � �� ! � � �� ,

and the desired result holds.

The final case to consider: � � � � � � (i.e. � was selected for the first time before round � ). By induction

the desired result holds for all � � � � � , in particular,� � �� ! � � �� . It suffices

to show that� � � � � � � � � � ! � � �� .

If � was selected again during round � , then� � � ��

! � ��

� ��

� ��

�� ! � � �� .

If � was not selected again during round � , then� � � � � �

� � �� ! � � �� and� � � � � � � � � � � �

! � � �� .�

Copyright c�



Distributed Data Mining for Astronomy Catalogs

Documents

Transcript of Distributed Data Mining for Astronomy Catalogs