Distributed Data Mining for Astronomy Catalogs
Transcript of Distributed Data Mining for Astronomy Catalogs
Distributed Data Mining for
Astronomy Catalogs�
Chris Giannella�, � , Haimonti Dutta
�, Kirk Borne � , Ran Wolff � , Hillol Kargupta
�, �
�Department of Computer Science and Electrical Engineering
University of Maryland, Baltimore County
Baltimore, Maryland, 21250 USA
{cgiannel,hdutta1,hillol}@cs.umbc.edu�
School of Computational Sciences
George Mason University
4400 University Drive
Fairfax, Virginia 22030
Computer Science Department
Israel Institute of Technology
Haifa, Israel 32000
DDM FOR ASTRONOMY CATALOGS 1
SUMMARY
The design, implementation, and archiving of very large sky surveys is playing an increasingly important
role in today’s astronomy research. However, these data archives will necessarily be geographically
distributed. To fully exploit the potential of this data lode, we believe that capabilities ought to be provided
allowing users a more communication-efficient alternative to multiple archive data analysis than first
downloading the archives fully to a centralized site.
In this paper, we propose a system, DEMAC, for the distributed mining of massive astronomical catalogs.
The system is designed to sit on top of the existing national virtual observatory environment and provide
tools for distributed data mining (as web services) without requiring datasets to be fully down-loaded to
a centralized server. To illustrate the potential effectiveness of our system, we develop communication-
efficient distributed algorithms for principal component analysis (PCA) and outlier detection. Then, we
carry out a case study using distributed PCA for detecting fundamental planes of astronomical parameters.
In particular, PCA enables dimensionality reduction within a set of correlated physical parameters, such
as a reduction of a 3-dimensional data distribution (in astronomer’s observed units) to a planar data
distribution (in fundamental physical units). Fundamental physical insights are thereby enabled through
efficient access to distributed multi-dimensional data sets.
�A much abbreviated version of this paper has been submitted to the Mining Scientific and Engineering Datasets workshop
affiliated with the 2006 SIAM International Conference on Data Mining.�Corresponding author
�Also affiliated with Agnik LLC., Columbia, Maryland USA
Contract/grant sponsor: National Science Foundation; contract/grant number: IIS-0329143,IIS-0093353,IIS-
0203958,AST0122449
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
2 C. GIANNELLA ET AL.
KEY WORDS: distributed data mining, astronomy catalogs, cross matching, principal component analysis, outlier
detection
1. Introduction
The design, implementation, and archiving of very large sky surveys is playing an increasingly
important role in today’s astronomy research. Many projects today (e.g. GALEX All-Sky Survey), and
many more projects in the near future (e.g. WISE All-Sky Survey) are destined to produce enormous
catalogs (tables) of astronomical sources (tuples). These catalogs will necessarily be geographically
distributed. It is this virtual collection of gigabyte, terabyte, and (eventually) petabyte catalogs that
will significantly increase science return and enable remarkable new scientific discoveries through the
integration and cross-correlation of data across these multiple survey dimensions [43]. Astronomers
will be unable to fully tap the riches of this data without a new paradigm for astro-informatics that
involves distributed database queries and data mining across distributed virtual tables of joined and
integrated sky survey catalogs [6, 7].
The development and deployment of a National Virtual Observatory (NVO) [48] is a step toward a
solution to this problem. However, processing, mining, and analyzing these distributed and vast data
collections are fundamentally challenging tasks since most off-the-shelf data mining systems require
the data to be down-loaded to a single location before further analysis. This imposes serious scalability
constraints on the data mining system and fundamentally hinders the scientific discovery process.
Figure 1 further illustrates this technical problem. The left part depicts the current data flow in the
NVO. Through web services, data are selected and down-loaded from multiple sky-surveys.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 3
Downloaded
Table
User A
Downloaded
Table
User B
Sky−Survey I Sky−Survey II
NVO Xface
Sky−Survey III
NVO XfaceNVO Xface
Select Download
NVO
Cross−Match (In Implementation)
Web Services
(a) Current data flow is restricted because of data
ownership and bandwidth considerations.
Data Mining
Model
User A
Model
Data Mining
User B
Sky−Survey I
NVO Xface
Sky−Survey III
NVO Xface
Distributed Data Mining
Select Download
NVO
Cross−Match
Web Services
Data Mining
Distributed Data Mining
Sky−Survey II
NVO Xface
Distributed Data Mining
(b) Distributed data mining algorithms can process large
amounts of data using a small amount of communication.
The users get the data mining output rather than raw data.
Figure 1. Proposed data flow for distributed data mining embedded in the NVO.
If distributed data repositories are to be really accessible by a larger community, then technology
ought to be developed for supporting distributed data analysis that can reduce, as much as possible,
communication requirements among the data servers and the client machines. Communication-efficient
distributed data mining (DDM) techniques will allow a large number of users simultaneously to
perform advanced data analysis without necessarily down-loading large volumes of data to their
respective client machines.
In this paper, we propose a system, DEMAC, for the distributed exploration of massive astronomical
catalogs. DEMAC will offer a collection of data mining tools based on various DDM algorithms. The
system would be built on top of the existing NVO environment and provides tools for data mining
(as web services) without requiring datasets to be down-loaded to a centralized server. Consider again
Figure 1. Our system requires a relatively simple modification – the addition a distributed data mining
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
4 C. GIANNELLA ET AL.
functionality in the sky servers. This allows DDM to be carried out without having to down-load large
tables to the users’ desktop or some other remote machine. Instead, the users will only down-load
the output of the data mining process (a data mining model); the actual data mining from multiple data
servers will be performed using communication-efficient DDM algorithms. The algorithms we develop
sacrifice perfect accuracy for communication savings. They offer approximate results at a considerably
lower communication cost than that of exact results through centralization. As such, we see DEMAC
as serving the role of an exploratory “browser”. Users can quickly get (generally quite accurate) results
for their distributed queries at low communication cost. Armed with these results, users can focus in
on a specific query or portion of the datasets, and down-load for more intricate analysis.
To illustrate the potential effectiveness of our system, we develop communication-efficient
distributed algorithms for principal component analysis (PCA) and outlier detection. Then we carry
out a case study using distributed PCA for detecting fundamental planes of astronomical parameters.
Astronomers have previously discovered cases where the observed parameters measured for a
particular class of astronomical objects (such as elliptical galaxies) are strongly correlated, as a
result of universal astrophysical processes (such as gravity). PCA will find such correlations in the
form of principal components. An example of this is the reduction of a 3-dimensional scatter plot
of elliptical galaxy parameters to a planar data distribution. The explanation of this plane follows
from fundamental astrophysical processes within galaxies, and thus the resulting data distribution is
labeled the Fundamental Plane of Elliptical Galaxies. The important physical insights that astronomers
have derived from this fundamental plane suggest that similar new physical insights and scientific
discoveries may come from new analysis of combinations of other astronomical parameters. Since
these multiple parameters are now necessarily distributed across geographically dispersed data
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 5
archives, it is scientifically valuable to explore distributed PCA on larger astronomical data collections
and for greater numbers of astrophysical parameters. The application of communication-efficient
distributed PCA and other DDM algorithms will likely enable the discovery of new fundamental planes,
and thus produce new scientific insights into our Universe.
1.1. Astronomical Research Enabled
As an example of the potential astronomical research that our system aims to enable, consider the large
survey databases being produced (now and in the near future) by various U.S. National Aeronautics
and Space Agency (NASA) missions. GALEX is producing all-sky surveys at a variety of depths in the
near-UV and far-UV. The Spitzer Space Telescope is conducting numerous large-area surveys in the
infrared, including fields (e.g., GOODS, Hubble Deep Fields) that are well studied by HST (optical),
Chandra (X-ray), and numerous other observatories. The WISE mission (to be launched circa 2008)
will produce an all-sky infrared survey. The 2-Micron All-Sky Survey (2MASS) is cataloging millions
of stars and galaxies in the near-infrared. Each of these wave-bands contributes valuable astrophysical
knowledge to the study of countless classes of objects in the astrophysical zoo. In many cases, such
as the young star-forming regions within star-bursting galaxies, the relevant astrophysical objects and
phenomena have unique characteristics within each wavelength domain. For example, star-bursting
galaxies are often dust-enshrouded, yielding enormous infrared fluxes. Such galaxies reveal peculiar
optical morphologies, occasional X-ray sources (such as intermediate black holes), and possibly even
some UV bright spots as the short-wavelength radiation leaks through holes in the obscuring clouds. All
of these data, from multiple missions in multiple wave-bands, are essential for a full characterization,
classification, analysis, and interpretation of these cosmologically significant populations. In order
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
6 C. GIANNELLA ET AL.
to reap the full potential of multi-wavelength analysis and discovery that NASA’s suite of missions
enables, it is essential to bring together data from multiple heterogeneously distributed data centers
(such as MAST, HEASARC, and IRSA). For the all-sky surveys in particular, it is impossible to access,
mine, navigate, browse, and analyze these data in their current distributed state. To illustrate this point,
suppose that an all-sky catalog contains descriptive data for one billion objects; and suppose that these
descriptive data consist of a few hundred parameters (which is typical for the 2MASS and Sloan Digital
Sky Surveys). Then, assuming simply that each parameter requires just 2-byte representation, then
each survey database will consume one terabyte of space. If the survey also has a temporal dimension,
then even more space is required. It is clearly infeasible, impractical, and impossible to drag these
terabyte catalogs back and forth from user to user, from data center to data center, from analysis
package to package, each time someone has a new query to pose against NASA’s multi-wavelength data
collections. Therefore, we see an important need for data mining and analysis tools that are inherently
designed to work on distributed data collections.
2. Related Work
2.1. Analysis of Large Data Collections
There are several instances in the astronomy and space sciences research communities where data
mining is being applied to large data collections [15, 46]. Some dedicated data mining projects include
F-MASS [21], Class-X [14], the Auton Astrostatistics Project [3], and additional VO-related data
mining activities (such as SDMIV [52]). In essentially none of these cases does the project involve truly
DDM [44]. Through a past NASA-funded project, K. Borne applied some very basic DDM concepts
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 7
to astronomical data mining [10]. However, the primary accomplishments focused only on centralized
co-location of the data sources [9, 8].
One of the first large-scale attempts at grid data mining for astronomy is the U.S. National Science
Foundation (NSF) funded GRIST [26] project. The GRIST goals include application of grid computing
and web services (service-oriented architectures) to mining large distributed data collections. GRIST
is focused on one particular data modality: images. Hence, GRIST aims to deliver mining on the
pixel planes within multiple distributed astronomical image collections. The project that we are
proposing here is aimed at another data modality: catalogs (tables) of astronomical source attributes.
GRIST and other projects also strive for exact results, which usually requires data centralization
and co-location, which further requires significant computational and communications resources.
DEMAC (our system) will produce approximate results without requiring data centralization (low
communication overhead). Users can quickly get (generally quite accurate) results for their distributed
queries at low communication cost. Armed with these results, users can focus in on a specific query or
portion of the datasets, and down-load for more intricate analysis.
The U.S. National Virtual Observatory (NVO) [48] is a large scale effort funded by the NSF
to develop a information technology infrastructure enabling easy and robust access to distributed
astronomical archives. It will provide services for users to search and gather data across multiple
archives and some basic statistical analysis and visualization functions. It will also provide a framework
for new services to be made available by outside parties. These services can provide, among other
things, specialized data analysis capabilities. As such, we envision DEMAC to fit nicely into the NVO
as a new service.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
8 C. GIANNELLA ET AL.
The Virtual Observatory can be seen as part of an ongoing trend toward the integration of information
sources. The main paradigm used today for the integration of these data systems is that of a data grid
[19, 25, 34, 51, 20, 56, 45, 5]. Among the desired functionalities of a data grid, data analysis takes a
central place. As such, there are several projects [17, 41, 24, 26, 30, 32], which in the last few years
attempt to create a data mining grid. In addition, grid data mining has been the focus of several recent
workshops [40, 18].
2.2. Distributed Data Mining
DDM is a relatively new technology that has been enjoying considerable interest in the recent past
[49, 38]. DDM algorithms strive to analyze the data in a distributed manner without down-loading all of
the data to a single site (which is usually necessary for a regular centralized data mining system). DDM
algorithm naturally fall into two categories according to whether the data is distributed horizontally
(with each site having some of the tuples) or vertically (with each site having some of the attributes for
all tuples). In the latter case, it is assumed that the sites have an associated unique id used for matching.
In other words, consider a tuple � and assume site � has a part of this tuple, ��� , and � has the remaining
part ��� . Then, the id associated with � � equals the id associated with ��� . �
The NVO can be seen as a case of vertically distributed data, assuming ids have been generated by
a cross-matching service. With this assumption, DDM algorithms for vertically partitioned data can be
applied. These include algorithms for principal component analysis (PCA) [39, 37], clustering [35, 39],
Bayesian network learning [12, 13], and supervised classification [11, 22, 29, 50, 55].
�Each id is unique to the site at which it resides; no two tuples at site have the same id. But, ids can match across sites; a tuple
at site can have the same id as a tuple at site .
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 9
3. Data Analysis Problem: Analyzing Distributed Virtual Catalogs
We illustrate the problem with two archives: the Sloan Digital Sky Survey (SDSS) [53] and the 2-
Micron All-Sky Survey (2MASS) [1]. Each of these has a simplified catalog containing records for
a large number of astronomical point sources, upward of 100 million for SDSS and 470 million for
2MASS. Each record contains sky coordinates (ra,dec) identifying the sources’ position in the celestial
sphere as well as many other attributes (460+ for SDSS; 420+ for 2MASS). While each of these
catalogs individually provides valuable data for scientific exploration, together their value increases
significantly. In particular, efficient analysis of the virtual catalog formed by joining these catalogs
would enhance their scientific value significantly. Henceforth, we use "virtual catalog" and "virtual
table", interchangeably.
To form the virtual catalog, records in each catalog must first be matched based on their position in
the celestial sphere. Consider record � from SDSS and � from 2MASS with sky coordinates �� ��������� �
and � � ���������� � . Each record represents a set of observations about an astronomical object e.g. a galaxy.
The sky coordinates are used to determine if � and � match, i.e. are close enough that � and � represents
the same astronomical object. The issue of how matching is done will be discussed later. For each
match � � � ��� , the result is a record ������� in the virtual catalog with all of the attributes of � and � . As
described earlier, the virtual catalog provides valuable data that neither SDSS or 2MASS alone can
provide.
DEMAC addresses the data analysis problem of developing communication-efficient algorithms for
analyzing user-defined subsets of virtual catalogs. The algorithms allow the user to specify a region
�in the sky and a virtual catalog, then efficiently analyze the subset of tuples from that catalog with
sky coordinates in�
. Importantly, the algorithms we propose do not require that the base catalogs first
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
10 C. GIANNELLA ET AL.
be centralized and the virtual catalog explicitly realized. Moreover, the algorithms are not intended to
be a substitute for exact, centralization-based methods currently being developed as part of the NVO.
Rather, they are intended to complement these methods by providing, quick, communication-efficient
approximate results to allow browsing. Such browsing will allow the user to better focus their exact,
communication-expensive, queries.
Example 1. The all data release of 2MASS contains attribute, “K band means surface brightness”
(Kmsb). Data release four of SDSS contains galaxy attributes “red shift” (rs), “Petrosian I band
angular effective radius” (Iaer) and “velocity dispersion” (vd). To produce a physical variable,
consider composite attribute “Petrosian I band effective radius” (Ier) formed by the product of Iaer
and rs. Note, since Iaer and rs are both at the same repository (SDSS), then, from the standpoint of
distributed computation, we may assume Ier is contained in SDSS.
A principal component analysis over a region of sky�
on the virtual table with columns log(Ier),
log(vd), and Kmsb is interesting in that it can allow the identification of a “fundamental plane”
(the logarithms are used to place all variables on the same scale). Indeed, if the first two principal
components capture most of the variance, then these two variables define a fundamental plane. The
existence of such things points to interesting astrophysical behaviors. We develop a communication-
efficient distributed algorithm for approximating the principal components of a virtual table.
4. DEMAC - A System for Distributed Exploration of Massive Astronomical Catalogs
This section describes the high level design of the proposed DEMAC system. DEMAC is designed as
an additional web-service which seamlessly integrates into the NVO. It consists of two basic services.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 11
The main one is a web-service providing DDM capabilities for vertically distributed sky surveys
(WS-DDM). The second one, which is intensively used by WS-DDM, is a web-service providing
cross-matching capabilities for vertically distributed sky surveys (WS-CM). Cross-matching of sky
surveys is a complex topic which is dealt with, in itself, under other NASA funded projects. Thus,
our implementation of this web-service would supply bare minimum capabilities which are required in
order to provide distributed data mining capabilities.
To provide a distributed data mining service, DEMAC would rely on other services of the NVO such
as the ability to select and down-load from a sky survey in an SQL-like fashion. Key to our approach
is that these services be used not over the web, through the NVO, but rather by local agents which are
co-located with the respective sky survey. In this way, the DDM service avoid bandwidth and storage
bottlenecks, and overcomes restrictions which are due to data ownership concerns. Agents, in turn,
take part in executing efficient distributed data mining algorithms, which are highly communication-
efficient. It is the outcome of the data mining algorithm, rather than the selected data table, that is
provided to the end-user. With the removal of the network bandwidth bottleneck, the main factor
limiting the scalability of the distributed data mining service would be database access. For database
access we intend to rely on the SQL-like interface provided to the different sky-surveys to the NVO.
We outline here the architecture we propose for the two web-services we will develop.
4.1. WS-DDM – DDM for Heterogeneously Distributed Sky-Surveys
This web-service will allow running a DDM algorithm (three will be discussed later) on a selection of
sky-surveys. The user would use existing NVO services to locate sky-surveys and define the portion
of the sky to be data mined. The user would then use WS-CM to select a cross-matching scheme for
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
12 C. GIANNELLA ET AL.
those sky-surveys. This specifies how the tuples are matched across surveys to define the virtual table
to be analyzed. Following these two preliminary phases the user would submit the data mining task.
Execution of the data mining task would be scheduled according to resource availability. Specifically,
the size of the virtual table selected by the user would dictate scheduling. Having allocated the required
resources, the data mining algorithm would be carried on by agents which are co-located with the
selected sky-surveys. Those agents will access the sky-survey through the SQL-like interface it exposes
to the NVO and will communicate with each other directly, over the Internet. When the algorithm has
terminated, results would be provided to the user using a web-interface.
4.2. WS-CM – Cross-Matching for Heterogeneously Distributed Sky-Surveys
Central to the DDM algorithms we develop is that the virtual table can be treated as vertically
partitioned (see Section 2 for the definition). To achieve this, match indices are created and co-located
with each sky survey. Specifically, for each pair of surveys (tables) � and � , a distinct pair of match
indices must be kept, one at each survey. Each index is a list of pointers; both indices have the
same number of entries. The ����� entry in ��� � list points to a tuple � and the ����� entry in ��� � list
points to � such that �� and � match. Tuples in � and � which do not have a match, do not have a
corresponding entry in either index. Clearly, algorithms assuming a vertically partitioned virtual table
can be implemented on top of these indices.
Creating these indices is not an easy job. Indeed, cross-matching sources is a complex problem for
which no single best solution exists. The WS-CM web-service is not intended to address this problem.
Instead it will use already existing solutions (e.g., the cross-matching service already provided by
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 13
the NVO), and it will be designed to allow other solutions to be plugged in easily. Moreover, cross-
matching the entirety of two large surveys is a very time-consuming job and would require centralizing
(at least) the�������� �
coordinates of all tuples from both.
Importantly, the indices do not need to be created each time a data mining task is run. Instead,
provided the sky survey data are static (it generally is), each pair of indices only need be created once.
Then any data mining task can use them. In particular the DDM tasks we develop can use them. The
net result is the ability to mine virtual tables at low communication cost.
4.3. DDM Algorithms: Definitions and Notation
In the next two sections we describe some DDM algorithms to be used as part of the WS-DDM web
service. All of these algorithms assume that the participating sites have the appropriate alignment
indices. Hence, for simplicity, we describe the algorithms under the assumption that the data in each
site is perfectly aligned – the � ��� tuple of each site match (sites have exactly the same number of tuples).
This assumption can be emulated without problems using the matching indices.
Let�
denote an ����� matrix with real-valued entries. This matrix represents a dataset of � tuples
from �� . Let��
denote the � ��� column and�� � � � denote the ����� entry of this column. Let � �� �
denote the sample mean of this column i.e.
� � ������
����� ��� �� � (1)
Let � �� � �� � denote the sample variance of the � ��� column i.e.
� �����
� � �� ��� �� � � � �������� � (2)
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
14 C. GIANNELLA ET AL.
Let ����� � �� � ��� � denote the sample covariance of the � ��� and � ��� columns i.e.
� �����
� � �� ��� �� ��� � � � � ��� � � ��� ��� � ������ � (3)
Note, � ��� � �� � ������� � �� � �� � ). Finally, let ������� � � denote the covariance matrix of�
i.e. the
� ��� matrix whose ��� � � ����� entry is ����� � �� � ��� � .Assume this dataset has been vertically distributed over two sites � � and � � . Since we are assuming
that the data at the sites is perfectly aligned, then � � has the first attributes and � � has the last
attributes (�� � � ). Let � denote the ���� matrix representing the dataset held by � � , and �
denote the � �� matrix representing the dataset held by � � . Let ��� � denote the concatenation of the
datasets i.e.� � ��� � . The � ��� column of ��� � is denoted
���� � �
.
In the next two sections, we describe communication-efficient algorithms for PCA and outlier
detection on�
vertically distributed over two sites. Both algorithms easily extend to more than two
sites, but, for simplicity, we only discuss the two site scenario. We have also developed a distributed
algorithm for decision tree induction (supervised classification). We will not discuss it further and refer
the reader to [22] for details. Later examine the effectiveness of the distributed PCA algorithm through
a case study on real astronomical data. We leave for future work the job of testing the decision tree and
outlier detection algorithms with a case studies.
Following standard practise in applied statistics, we pre-process�
by normalizing so that � � � ��
and � ��� � �� � � � . This is achieved by replacing each entry� � � � with
����� ������ ����� �� "!$# ��� � � . Since
both � �� � and � ��� � �� � can be computed without any communication, then normalizing can be
performed without any communication. Henceforth, we assume � � ��� � and � ��� � � � � � .
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 15
Let � ����� � � � � � ��� � � �denote the eigenvalues of ����� � � � and � � � � � � � � �
� � � the
associated eigenvectors (we assume the eigenvectors are column vectors i.e. � � � matrices.) (pairwise
orthonormal). The � ��� principal direction of�
is � . The � ��� principal component is denoted � and
equals� � (the projection of
�along the � ��� direction).
5. Virtual Catalog Principal Component AnalysisPCA is a well-established data analysis technique used in a large number of disciplines: astronomy,
computer science, biology, chemistry, climatology, geology, etc. Quoting [36] page 1: "The central
idea of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated
variables, while retaining as much as possible of the variation present in the dataset." Next we provide
a very brief overview of PCA, for a more detailed treatment, the reader is referred to [36].
The � ��� principal component, � , is, by definition, a linear combination of the columns of�
– the
� ��� column has coefficient � � � � . The sample variance of � equals � . The principal components
are all uncorrelated i.e. have zero pairwise sample covariances. Let ��� # ( � � � ) denote the
� � �matrix with columns � � � � � �
� � # . This is the dataset projected onto the subspace defined by the
first�
principal directions. If� � � , then � � # is simply a different way of representing exactly the
same dataset, because�
can be recovered completely as� ��� � # � �� # where where � � # is the
� � �matrix with columns � � � � � �
� � # and � denotes matrix transpose ( � � � is a square matrix with
orthonormal columns, hence ��� � ���� � equals the � ��� identity matrix.) However, if��� � , then
��� # is a lossy lower dimensional representation of�
. The amount of loss is typically quantified as
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
16 C. GIANNELLA ET AL.
� # ��� � � � ��� � �
(4)
the “proportion of variance” captured by the lower dimensional representation. The larger the
proportion captured, the better � � # represents the "information" contained in the original dataset�
. If
�is chosen so that a large amount of the variance is captured, then, intuitively, � � # , captures many of
the important features of�
. So, subsequent analysis on � � # can be quite fruitful at revealing structure
not easily found by examination of�
directly. Our case study will employ this idea.
To our knowledge, the problem of vertically distributed PCA computation was first addressed by
Kargupta et al. [39] based on sampling and communication of dominant eigenvectors. Later, Kargupta
and Puttagunta [37] developed a technique based on random projections. Our method is a slightly
revised version of this work. We describe a distributed algorithm for approximating ����� � � � � � .Clearly, PCA can be performed from ����� � ��� � � without any further communication.
Recall that � � � is normalized to have zero column sample mean and unit sample variance. As a
result, ����� � � � � � � � � � � � � � ��� ��������� �� ��� � � � � �� ���� � ��� � � which is the inner product between�� � � �
and�� � � � �
. Clearly this inner product can be computed without communication when�� � � �
and
�� � � � �
are at the same site (i.e. �� � � � or � � � � � � � ). It suffices to show how
the inner product can be approximated across different sites, in effect, how � � � can be approximated.
The key idea is based on the following fact echoing the observation made in [47] that high-dimensional
random vectors are nearly orthogonal. A similar result was proved elsewhere [2].
Fact 1. Let�
be an � � � matrix each of whose entries is drawn independently from a distribution
with variance one and mean zero. It follows that � � � � � � ����� � where � � is the � � � identity matrix.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 17
We will use the Algorithm 5.0.1 for computing � � � . The result is obtained at both sites (in the
communication cost calculations, we assume a message requires 4 bytes of transmission).
Algorithm 5.0.1 Distributed Covariance Matrix Algorithm1. � � sends � � a random number generator seed. [1 message]
2. � � and � � generate a � � � random matrix�
where � . Each entry is generated independently and
identically from any distribution with mean zero and variance one.
3. � � sends� � to � � ; � � sends
� � to � � . [ � � messages]
4. � � and � � compute� �
��� � ��� ��� � �� .
Note that,
� � � � � � � � � � � � � ����
� � � � � � � � � � ��
�(5)
which, by Fact 1, equals
� � � ��� � ���� � � � � � (6)
Hence, on expectation, the algorithm is correct. However, its communication cost (bytes) divided by
the cost of the centralization-based algorithm,
� � � � �
� � � � �� �
�� �
�(7)
is small if � � � � . Indeed � provides a "knob" for tuning the trade-off between communication-
efficiency and accuracy. Later, in our case study, we present experiments measuring this trade-off.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
18 C. GIANNELLA ET AL.
6. Virtual Catalog Outlier Detection
Outlier detection has two important roles: data cleaning and data mining. Firstly, data cleaning is used
during data preparation stages. For example, if a cross-matching technique wrongly joined a pair of
tuples, then the matched tuple will likely have properties making it stand out from the rest of the data.
An outlier detection process might be able to identify the matched tuple and allow human operators to
check for an improper match. Secondly, outlier detection can be used to identify legitimately matched
tuples with unique properties. These may represent exceptional and interesting astronomical objects.
Again, a human operator is alerted to further examine the situation.
Due to the fact that the definition of an outlier is inherently imprecise, many approaches have been
developed for detecting outliers. We will not attempt a comprehensive citation listing, instead, the
reader is referred to Hodge and Austin [31] for an excellent survey mostly focusing on outlier detection
methodologies from machine learning, pattern recognition, and data mining. Also, Barnett and Lewis
[4] provide and excellent survey of outlier detection methodologies from statistics.
Note, some work has been done on parallel algorithms for outlier detection on horizontally
distributed data (not in an astronomy context) [33]. However, to our knowledge, no work has been
done on outlier detection over vertically distributed data. Next we describe a PCA-based technique
from the Statistics literature for outlier detection (on centralized data). Following that, we describe a
communication-efficient algorithm implementing the technique on vertically distributed data.
6.1. PCA Based Methods for Outlier Detection
Most commonly the first principal components are used in data analysis. However, the last components
also carry valuable information. Some techniques for outlier detection have been developed based on
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 19
the last components [23, 28, 27, 36, 54]. These techniques look to identify data points which deviate
sharply from the “correlation structure” of the data. Before discussing the technical details, let us first
describe what is meant by a point deviating from the correlation structure by means of an example.
Example 1. Consider a dataset where each row consists of the height and weight of a group of people
(borrowed from [36] page 233). We expect that a strong positive correlation will exist – small (large)
weights correspond to small (large) heights. A row with height 70 in. (175 cm) and weight 55 lbs (25
kg) may not be abnormal when height or weight is taken separately. Indeed, there may be many people
in the group with height around 70 or weight around 55. However, taken together, the row is very
strange because it violates the usual dependence between height and weight.
In Example 1, the outlying point will not stand out if the projection of the data on each variable
is viewed. Only when the variables are viewed together, does the outlier appear. The same idea
generalizes to higher dimensional data: an outlier only stands out when all of the variables are taken
into account and does not stand out over a projection onto any proper subset. If the data has more
that three variables, then viewing all of them together is impossible. So, automated tests for detecting
“correlation outliers” are quite helpful. The “low variance” property of the last principal components
makes them useful for this job.
Recall the � ��� component is a linear combination of the columns of�
(the � ��� column has
coefficient � � � � ) with sample variance � i.e. the variance over the entries of � is � . Thus, if �
is very small and there were no outlier data points, one would expect the entries of � to be nearly
constant. In this case, � expresses a nearly linear relationship between the columns of�
. A data
point which deviates sharply from the correlation structure of the data will likely have its � entry
deviate sharply from the rest of the entries (assuming no other outliers). Since the last components
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
20 C. GIANNELLA ET AL.
have the smallest � � � , then an outlier’s entries in these components will likely stand out. This motivates
examination of the following statistic for the ����� data point (the ����� row in�
) and some � � �� :
� ���� � � ��� ��
� !� � � � � (8)
where � � � � denotes the � ��� entry of � . � is a user defined parameter. A possible criticism of this
approach is pointed out in [36] page 237: “it [� ���� ] still gives insufficient weight to the last few
PCs, [...] Because the PCs have decreasing variance with increasing index, the values of � ��� � � will
typically become smaller as � increases, and� ���� therefore implicitly gives the PCs decreasing weights
as � increases. This effect can be severe if some of the PCs have very small variances, and this is
unsatisfactory as it is precisely the low-variance PCs which may be most effective ...”
To address this criticism, the components are normalized to give equal weight. Let � denote the
normalized � ��� principal direction: the � � � vector whose ����� entry is � �� ��� � � . The normalized � ���
principal component is �� � � � . The sample variance of �� equals one, so, the weights of the
normalized components are equal. This statistic we use for the � ��� data point is (following notation in
[36])
� �� � � � � � �� � !
�� � � � � � (9)
Next, we discuss the development of a distributed algorithm for finding the top � ranked data points
in�� � � �
with respect to� �� � � � � . For this, we only consider the case involving the last component,
� � � (developing the algorithm in the� � � case is quite a bit more complicated):
� �� � � � �� � � � ������ � ��� ��� (10)
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 21
6.2. Vertically Distributed Top � Outlier Detection
First, the algorithm in Section 5 is applied to approximate ������� � � � � – at the end, each site has the
approximation. The eigenvalues and vectors of this matrix are computed �� � � � � �� �� � and �� �
�� � �
��� � ,
along with the normalized eigenvectors �� � , � � � , �� � (no communication required). However, the
approximations to the last component cannot be computed without communication. The approximation
to the last normalized principal component is defined to be �� � ����� � �
�� � �Let �� � � � � denote the � � vector consisting of the first entries of �� � . This is the part of �� �
corresponding to the attributes at site � � . Let �� � � � � denote the � � vector consisting of the last
entries of �� � . This is the part of �� � corresponding to the attributes at site � � . We have
�� � ����� � �
�� � � � �� � � � � � � �� � � � ��� �� � � � � � �� � � � � � (11)
Given � � � , let �� � � ��� � denote the � ��� entry of � �� � � � � and �� � � �
�� � denote the ����� entry
of � �� � � � � . Remaining consistent with Equation (10),� �� � � �� � � � �
�, we define the approximation to
� �� � as �� � � � � � �� � � ��� � � �� � � �
�� � � � . Our goal is to develop a distributed algorithm for computing
the top � among� �� � � � � � � ��� . Since �� � � �
�� � and �� � � �
�� � can be computed without
communication, it suffices to solve the following problem.
Problem 1 (Distributed Sum-Square Top K (DSSTK)) Site � � has� � � � � �
��� ��� � and � � has
� � � � � �� � � � � . Both sites have integer ��� �� �
. The sites must compute the top � among
� � � � � � � � � � � , � � � ,� � � � � � � � � � � .
A communication-efficient algorithm for solving this problem is described next. Unlike the
algorithm for distributed covariance computation, this one is not approximate – its output is the correct
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
22 C. GIANNELLA ET AL.
top � . The algorithm assumes a total linear ordering on the�� � , � � � , and
�� � . For simplicity, we assume
all�� � , � � � , � � � are distinct among themselves and order numbers in the standard way. Ties can be easily
broken according to the lower index, e.g. if� � � , then
� is ordered before
� if �� � .
6.3. An Algorithm for the DSSTK Problem
Below we only describe an algorithm for solving the problem when � � � . At the end, both sites have
index ��� � ����� � ��� ��� � � � ��� and
��� . Hence, the problem when � � � is solved by removing
��� and
� �� then repeating, etc.
The algorithm for finding����� � ��� ���
� carries out two stages. First, � � and � � solve the following
two distributed sub-problems: find � ��
� ����� � ��� ��� � � � (Distributed Max Algorithm) and
��� � � ����� � � � ��� � � � (Distributed Min Algorithm). Second, each site returns����� � ��� � � � �� �
� �� � � � � � � � � �� � � � . Clearly this equals � � , so, all that remains is to explicate the Distributed Max
and Min algorithms.
These algorithms are very similar. Both proceed in rounds during each of which � � and � � exchange
a small amount of communication (24 bytes). At the end of each round, the sites evaluate a condition
based on their local data and all the data communicated during past rounds. If the condition fails, then
a new round of communication is triggered. Otherwise, the algorithms terminate, at which point, both
sites will have the desired indeces. The Distributed Max algorithm is described in Algorithm 6.3.1; the
Min algorithm is completely analogous and obtained by replacing all occurrences of “argmax” with
“argmin”.
Next we prove that the Distributed Max Algorithm is correct i.e. the algorithm terminates, and
upon termination, both sites return the correct index. In fact, we prove a more general result.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 23
Algorithm 6.3.1 Distributed Max Algorithm
1. � � selects � � � ���� � ��� � � � � � ��� and sends � � �� � � ��� to � � . In parallel, � � selects � �
� ����� � � � � � � � � ��� and sends � � �� � � � � to � � . [16 bytes]
2. � � receives the message from � � and replies with � � � � . In parallel, � � receives the message
from � � and replies with � � � � . [8 bytes]
3. Both sites now have data points��� ,
� � ,��� ,� � , and indeces � � , � � . If � � � � � , then both sites
terminate.
4. Otherwise, � � replaces� � and
��� with
! � ��� � �� and! � � � � �� . Site � � replaces
� � and
��� in
exactly the same way. Goto step 1.
Namely, that both sites return index � ��
� ����� � ��� ��� � � � � � � where� � �� � is any
strictly increasing function. The correctness result of the Distributed Min Algorithm, both sites return
��� � � ����� � � � � � � � � � � � for any strictly decreasing function�
, is proven in an analogous way
(omitted).
Correctness of the algorithm falls out from the following technical result. The result shows that for
each � ,� and
� change values at most once during the course of the algorithm. Moreover, when they
change, they change to! � � �� . Let
� � � � , � �� � � denote
� , � and
� � � � , � �� � � denote the value of
� and
� held at the end of round � ��� .
Lemma 1. Suppose the algorithm just completed round � ��� . For each � � � , let ����� � denote the
round at which index � was first selected by either site (if � was not selected during the � rounds, then
let ��� � ��� � � � ). It is the case that
1. for all � � � � ��� � , � � � � � � and
� � � � � � ;2. for all ��� � � � � ,
��� � ��� � ����� � ! � �� �� �
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
24 C. GIANNELLA ET AL.
The proof can be found in the appendix. Suppose the algorithm did not terminate after � rounds. Let
��� � � � � � � � � � � �� � � �� � � �� � denote the indeces selected during each of � rounds. It follows from Lemma 1
that a pair of indeces � � � � � � � , will not be selected if � � was selected at a previous round, � � � � and
� � was selected at a previous round, � � � � (possibly different than � � ). Otherwise, � � � � � � � and,
hence, the algorithm would have terminated at round � � . But, at most � � � pairs of indeces can satisfy
this condition. A contradiction is reached.
Let � ���� denote the termination round and let ��
denote the index agreed upon by both sites
upon termination. To complete the correctness proof, we need show that ��
is the correct index, � ��
� ����� � ��� ��� � � � � � � . Consider any index � � � . From Lemma 1, one of the following holds
� � � � � ��� � � and� � � � ��� � � ;
� � � � � ��� � � � � �� ! � � �� .
In either case it follows that� � � � �"� � � � � �� � � � � � . Hence, � �
�� ����� � ��� ��� � � � � � � � � � � � � .
By definition of the algorithm, ��� ����� � ��� � �
� � � � � ����� � � � � � � � � � . Thus,� � � � � � � � � � �
� � � � � �"� � � � � � . Therefore, since�
is strictly increasing,� � � � � � � � � � � � � � � � � � � � � � � � � � � ��� . We
conclude that ��� ����� � ��� � � � � � � �"� � �� � ��� � � ���
�as desired.
Comment: Each round of the Distributed Max algorithm requires� �
bytes of communication. Thus,
the total communication is� ���
where�
is the number of rounds until termination. This is quite a small
communication requirement unless�
is large. However, this communication efficiency is won at a price
– the algorithm requires�
rounds of synchronization. Even for relatively modest�
, the cost of these
synchronizations could be unacceptably high. A simply way to address this problem is to have each
site, at each round, select its top � ��� tuples instead of just its top tuple. The communication cost per
round increases to� �� bytes, but the total number of rounds executed will decrease to
���� .
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 25
7. Case Study: Finding Galactic Fundamental Planes
The identification of certain correlations among parameters has lead to important discoveries in
astronomy. For example, the class of elliptical and spiral galaxies (including dwarfs) have been
found to occupy a two dimensional space inside a 3-D space of observed parameters, radius, mean
surface brightness and velocity dispersion. This 2D plane has been referred to as the Fundamental
Plane([16, 42]).
This section presents a case study involving the detection of a fundamental plane among galaxy
parameters distributed across two catalogs: 2MASS and SDSS. We consider several different
combinations of parameters and explore for a fundamental plane in each (once such combination was
described earlier in Example 1). Our goal is to demonstrate that, using our distributed covariance matrix
algorithm to approximate the principal components, we can find a very similar fundamental plane as
that obtained by applying a centralized PCA. Note that our goal is not to make a new discovery in
astronomy. We argue that our distributed algorithm could have found a very similar results to the
centralized approach at a fraction of the communication cost. Therefore, we argue that DEMAC could
provide a valuable tool for astronomers wishing to explore many parameter spaces across different
catalogs for fundamental planes.
In our study we measure the accuracy of our distributed algorithm in terms of the similarity
between its results and those of a centralized approach. We examine accuracy at various amounts of
communication allowed the distributed algorithm in order to assess the trade-off described at the end
of Section 5. For each amount of communication allowed, we ran the distributed algorithm 100 times
with a different random matrix and report the average result (except where otherwise noted).
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
26 C. GIANNELLA ET AL.
For the purposes of our study, a real distributed environment is not necessary. Thus, for simplicity, we
used a single machine and simulated a distributed environment. We carried out two sets of experiments.
The first involves three parameters already examined in the Astronomy literature [16, 42] and the
fundamental plane observed. The second involves several previously unexplored combinations of
parameters.
7.1. Experiment Set One
We prepared our test data as follows. Using the web interfaces of 2MASS and SDSS,
http://irsa.ipac.caltech.edu/applications/Gator/ and
http://cas.sdss.org/astro/en/tools/crossid/upload.asp,
and the SDSS object cross id tool, we obtained an aggregate dataset involving attributes from 2MASS
and SDSS and lying in the sky region between right ascension (ra) 150 and 200 and declination (dec)
0 and 15. The aggregated dataset had the following attributes from SDSS: Petrosian I band angular
effective radius (Iaer), redshift (rs), and velocity dispersion (vd); � and had the following attribute from
2MASS: K band mean surface brightness (Kmsb).�
After removing tuples with missing attributes, we
had a 1307 tuple dataset with four attributes. We produced a new attribute, logarithm Petrosian I band
effective radius (log(Ier)), as log(Iaer*rs) and a new attribute, logarithm velocity dispersion (log(vd)),
by applying the logarithm to vd. We dropped all attributes except those to obtain the three attribute
��������� �����_ � (galaxy view), z (SpecObj view) and velDisp (SpecObj view) in SDSS DR4
���_ ������� ����� _ ����� in the extended source catalog in the All Sky Data Release,
http://www.ipac.caltech.edu/2mass/releases/allsky/index.html
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 27
dataset, log(Ier), log(vd), Kmsb. Finally, we normalized each column by subtracting its mean from
each entry and dividing by its sample standard deviation (as described in Section 4.3).
We applied PCA directly to this dataset to obtain the centralization based results. Then we treated
this dataset as if it were distributed (assuming cross match indeces have been created as described
earlier). This data can be thought of as a virtual table with attributes log(Ier) and log(vd) located at one
site and attribute Kmsb at another. Finally, we applied our distributed covariance matrix algorithm and
computed the principal components from the resulting matrix. Note, our dataset is somewhat small and
not necessarily indicative of a scenario where DEMAC would be used in practice. However, for the
purposes of our study (accuracy with respect to communication) it suffices.
Figure 2 shows the percentage of variance captured as a function of communication percentage (i.e.
at 15%, the distributed algorithm uses�� ��� �
��������� � � � � �� ��� bytes). Error bars indicate standard
deviation – recall the percentage of variance captured numbers are averages over 100 trials. First
observe that the percentage captured by the centralized approach, 90.5%, replicates the known result
that a fundamental plane exists among these parameters. Indeed the dataset fits fairly nicely on the plane
formed by the first two PCs. Also observe that the percentage of variance captured by the distributed
algorithm (including one standard deviation) using as little as 10% communication never strays more
than 5 percent from 90.5%. This is a reasonibly accurate result indicating that the distributed algorithm
identifies the existence of a plane using 90% less communication. As such, this provides evidence that
the distributed algorithm would serve as a good browser allowing the user to get decent approximate
results at a sharply reduced communication cost. If this piques the user’s interest, she can go through
the trouble of centralizing the data and carrying out an exact analysis.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
28 C. GIANNELLA ET AL.
0 10 20 30 40 50 60 70 80 90 10080
85
90
95
100
105
Communication Percentage
Per
cent
age
of v
aria
nce
capt
ured
by
first
and
sec
ond
PC
Std Error Bars Distributed VarianceCentralized Variance
Figure 2. Communication percentage vs. percent of variance captured, (log(Ier), log(vd), Kmsb) dataset.
Interestingly, the average percentage captured by the distributed algorithm appears to approach the
true percentage captured, 90.5%, very slowly (as the communication percentage approaches infinity,
the average percentage captured must approach 90.5%). At the present we don’t have an explanation
for the slow approach. However, as the communication increases, the standard deviation decreases
substantially (as expected).
To analyze the accuracy of the actual principal components computed by the distributed algorithm,
we consider the data projected onto each pair of PCs. The projection onto the true first and second PC
ought to appear with much scatter in both directions (this is the view of the data perpendicular to the
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 29
plane). And, the projections onto the first, third and second, third PCs ought to appear more “flattened”
as it is the view of the data perpendicular to the edge of the plane. Figures 3 and 4 displays the results.
The left column depicts the projections onto the PCs computed by the centralized analysis (true PCs).
Here we see the fundamental plane. The right column depicts the projections onto the PCs computed
by our distributed algorithm at 15% communication (for one random matrix and not the average over
100 trials). We see a similar pattern, indicating that the PCs computed by the distributed algorithm are
quite accurate in the sense that they produce very similar projections as those produced by the true
PCs.
In closing, it is important to stress that we are not claiming that the actual projections can be
computed in a communication-efficient fashion (they can’t). Rather, that our the PCs computed in a
distributed fashion are accurate as measured by the projection similarity with the true PCs.
7.2. Experiment Set Two
We prepared our test data as follows. Using the web interfaces of 2MASS and SDSS and SDSS object
cross id tool, we obtained an aggregate dataset involving attributes from 2MASS and SDSS and lying in
the sky region between right ascension (ra) 150 and 200 and declination (dec) 0 and 15. The aggregated
dataset had the following attributes from SDSS: Petrosian Flux in the U band (petroU), Petrosian Flux
in the G band (petroG), Petrosian Flux in the R band (petroR), Petrosian Flux in the I band (petroI), and
Petrosian Flux in the Z band (petroZ) � , and had the following attributes from 2MASS: K band mean
surface brightness (Kmsb) and K concentration index (KconInd).�
We had a 29638 tuple dataset with
� ��������� ���_ � ,
������ �� ���_�
,� ���� �� ���
_�,������ �� ���
_ � , ������ �� ���_ (PhotoObjAll) �
_ ������� ����� _ ����� and��� � � � ��� in the extended source catalog in the All Sky Data Release.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
30 C. GIANNELLA ET AL.
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
(a) PCs from centralized analysis
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
(b) PCs from distributed algorithm
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6
1st Principal Component
3rd
Prin
cipa
l Com
pone
nt
(c) PCs from centralized analysis
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6
1st Principal Component
3rd
Prin
cipa
l Com
pone
nt
(d) PCs from distributed algorithm
Figure 3. Projections, PC1 vs. PC2 and PC1 vs. PC3; communication percentage 15%, (log(Ier), log(vd), Kmsb)
dataset.
seven attributes. We produced new attributes petroU-I � petroU - petroI, petroU-R � petroU - petroR,
petroU-G � petroU - petroG, petroG-R � petroG - petroR, petroG-I � petroG - petroI, and petroR-I �
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 31
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6
2nd Principal Component
3rd
Prin
cipa
l Com
pone
nt
(a) PCs from centralized analysis
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6
2nd Principal Component
3rd
Prin
cipa
l Com
pone
nt
(b) PCs from distributed algorithm
Figure 4. Projections, PC2 vs. PC3; communication percentage 15%, (log(Ier), log(vd), Kmsb) dataset.
petroR - petroI. We also produced new attribute, logarithm K concentration index (log(KconInd)), by
applying the logarithm to KconInd. In each experiment, we dropped all attributes except the following
to obtain a dataset: petroU-I, petroU-R, petroU-G, petroG-R, petroG-I, petroR-I, log(KconInd), and
Kmsb. Finally, we normalized each column by subtracting its mean from each entry and dividing by
its sample standard deviation (as described in Section 4.3).
We considered the following six combinations of three attributes: (petroU-I,log(KconInd),Kmsb),
(petroU-R,log(KconInd),Kmsb), (petroU-G,log(KconInd),Kmsb), (petroG-R,log(KconInd),Kmsb),
(petroG-I,log(KconInd),Kmsb), and (petroR-I,log(KconInd),Kmsb). In each case we carried out a
distributed PCA experiment just like done in experiment set one.
Table I depicts the results at 15% communication. The “Band” column indicates the combination of
attributes used, e.g. U-I indicates (petroU-I,log(KconInd),Kmsb). The “Centralized Variance” column
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
32 C. GIANNELLA ET AL.
contains the sum of the variances (times 100) of the first two principal components found by the
centralized algorithm. In effect, it contains the percent of variance captured by the first two PCs. The
“Distributed Variance” column contains the sum of the variances of the first two PCs found by the
distributed algorithm (averaged over 100 trials). The “STD Distributed Variance” column contains the
standard deviation in this average over 100 trials.
Band Centralized Variance Distributed Variance STD Distributed Variance
U-I 72.6186 67.8766 0.717
U-R 72.5375 68.0817 0.8191
U-G 72.6392 69.0102 0.9781
G-R 72.3501 68.5664 0.9879
G-I 72.4842 68.6199 0.9842
R-I 72.7451 68.0521 0.8034
Table I. The centralized and distributed variances captured by the first and second PCs (15% communication).
We see that in all cases the centralized experiments yield a relatively weak fundamental plane (72%
of the variance is captured by the first two PCs) relative to the fundamental plane from experiment set
one (90.5% captured). The distributed algorithm in all cases does a decent job at replicating this result
– 67.8% to 69.0% of the variance captured (all with standard deviation less than 0.99). These results
come at an 85% communication savings over centralizing the data and performing an exact calculation.
To further illuminate the trade-off between communication savings and accuracy, Figure 5 shows
the percentage of variance captured on the petroR-I dataset as a function of communication percentage
(error bars indicate standard deviation – recall the percentage of variance captured numbers are
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 33
averages over 100 trials). Interestingly, the average percentage captured by the distributed algorithm
appears to move away from the true percentage (72%) for communication up to 40%, then move very
slightly toward 72%. This is surprising since the distributed algorithm error must approach zero as
its communication percentage approaches infinity. The results appear to indicate that this approach
is quite slow. At the present we don’t have an explanation for this phenomenon. However, as the
communication increases, the standard deviation decreases (as expected). Moreover, despite the slow
approach, the percentage of variance captured by the distributed algorithm (including one standard
deviation) never strays more than 6 percent from the true percent captured – a reasonably accurate
figure.
As in experiment set one, we analyze the accuracy of the actual principal components computed
by the distributed algorithm by considering their data projections. We do so for the (petroR-
I,log(KconInd),Kmsb) dataset. Figures 6 and 7 displays the results. The left column depicts the
projections onto the PCs computed by the centralized analysis (true PCs). Note that 10 projected data
points (out of 29638) are omitted from the centralized PC1 vs. PC2 and PC2 vs. PC3 plots due to
scaling. The y-coordinate for these points (x in case of PC2 vs. PC3) does not lie in the range [-20,20]
(in both cases a few points have coordinate nearly 60 or -60). Unlike the case os experiment set one,
we do not see a very pronouced plane (as expected).
The right column depicts the projections onto the PCs computed by our distributed algorithm at
15% communication (for one random matrix and not the average over 100 trials). The distributed
projections appear to indicate a fundamental plane somewhat more strongly than the centralized
ones. This indicates some inaccuracy in the distributed PCs. However, since the distributed variances
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
34 C. GIANNELLA ET AL.
0 20 40 60 80 10066
67
68
69
70
71
72
73
74
Communication Percentage
perc
enta
ge o
f var
ianc
e ca
ptur
ed b
y fir
st a
nd s
econ
d P
C Distributed VarianceCentralized Variance
Figure 5. Communication percentage vs. percent of variance captured, (petroR-I,log(KconInd),Kmsb) dataset.
correctly did not indicate a strong fundamental plane, the inaccuracies in the PCs would likely not play
an important role for users browsing for fundamental planes.
8. Conclusions
We proposed a system, DEMAC, for the distributed exploration of massive astronomical catalogs.
DEMAC is to be built on top of the existing U.S. national virtual observatory environment and provide
tools for data mining (as web services) without requiring datasets to be down-loaded to a centralized
server. Instead, the users will only down-load the output of the data mining process (a data mining
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 35
−20 −15 −10 −5 0 5 10 15 20−20
−15
−10
−5
0
5
10
15
20
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
(a) PCs from centralized analysis
−20 −15 −10 −5 0 5 10 15 20−20
−15
−10
−5
0
5
10
15
20
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
(b) PCs from distributed algorithm
−20 −15 −10 −5 0 5 10 15 20−20
−15
−10
−5
0
5
10
15
20
1st Principal Component
3rd
Prin
cipa
l Com
pone
nt
(c) PCs from centralized analysis
−20 −15 −10 −5 0 5 10 15 20−20
−15
−10
−5
0
5
10
15
20
1st Principal Component
3rd
Prin
cipa
l Com
pone
nt
(d) PCs from distributed algorithm
Figure 6. Projections, PC1 vs. PC2 and PC1 vs. PC3; communication percentage 15%, (petroR-
I,log(KconInd),Kmsb) dataset.
model); the actual data mining from multiple data servers will be performed using communication-
efficient DDM algorithms. The distributed algorithms we have developed sacrifice perfect accuracy
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
36 C. GIANNELLA ET AL.
−20 −15 −10 −5 0 5 10 15 20−20
−15
−10
−5
0
5
10
15
20
2nd Principal Component
3rd
Prin
cipa
l Com
pone
nt
(a) PCs from centralized analysis
−20 −15 −10 −5 0 5 10 15 20−20
−15
−10
−5
0
5
10
15
20
2nd Principal Component
3rd
Prin
cipa
l Com
pone
nt
(b) PCs from distributed algorithm
Figure 7. Projections, PC2 vs. PC3; communication percentage 15%, (petroR-I,log(KconInd),Kmsb) dataset.
for communication savings. They offer approximate results at a considerably lower communication
cost than that of exact results through centralization. As such, we see DEMAC as serving the role of
an exploratory “browser”. Users can quickly get (generally quite accurate) results for their distributed
queries at low communication cost. Armed with these results, users can focus in on a specific query or
portion of the datasets, and down-load for more intricate analysis.
To illustrate the potential effectiveness of our system, we developed communication-efficient
distributed algorithms for principal component analysis (PCA) and outlier detection. Then, we carried
out a case study using distributed PCA for detecting fundamental planes of astronomical parameters.
We observed our distributed algorithm to replicate fairly nicely the fundamental plane results observed
through centralized analysis but at significantly reduced communication cost.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 37
In closing, we envision our system to increase the ease with which large, geographically distributed
astronomy catalogs can be explored, by providing quick, low-communication solutions. Such benefit
will allow astronomers to better tap the riches of distributed virtual tables formed from joined and
integrated sky survey catalogs.
Acknowledgments
H. Dutta, C. Giannella, H. Kargupta, R. Wolff: Support for this work was provided by the U.S. National
Science Foundation (NSF) through grants IIS-0329143, IIS-0093353, IIS-0203958 and by the U.S.
National Aeronautics and Space Agency (NASA) through grant NAS2-37143.
K. Borne: Support for this work was provided in part by the NSF through Cooperative Agreement
AST0122449 to the Johns Hopkins University.
REFERENCES
1. 2-Micron All Sky Survey. http://pegasus.phast.umass.edu.
2. Arriaga R. and Vempala S. An Algorithmic Theory of Learning: Robust Concepts and Random Projection. In Proceedings
of the 40th Foundations of Computer Science, 1999.
3. The AUTON Project. http://www.autonlab.org/autonweb/showProject/3/.
4. Barnett V. and Lewis T. Outliers in Statistical Data. John Wiley & Sons, 1994.
5. bioGrid – Biotechnology Information and Knowledge Grid. http://www.bio-grid.net/.
6. Borne K. Science User Scenarios for a VO Design Reference Mission: Science Requirements for Data Mining. In Virtual
Observatories of the Future, San Francisco: Astronomical Society of the Pacific, page 333, 2000.
7. Borne K. Data Mining in Astronomical Databases. In Proceedings of Mining the Sky, Heidelberg: Springer-Verlag, page
671, 2001.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
38 C. GIANNELLA ET AL.
8. Borne K. Distributed Data Mining in the National Virtual Observatory. In Proceedings of the SPIE Conference on Data
Mining and Knowledge Discovery: Theory, Tools, and Technology V, Vol. 5098, page 211, 2003.
9. Borne K., Arribas S., Bushouse H., Colina L., and Lucas R. A National Virtual Observatory (NVO) Science Case. In
Proceedings of the Emergence of Cosmic Structure, New York: AIP, page 307, 2003.
10. Distributed Data Mining Techniques for Object Discovery in the National Virtual Observatory.
http://is.arc.nasa.gov/IDU/tasks/NVODDM.html.
11. Caragea D., Silvescu A., and Honavar V. Decision Tree Induction from Distributed Data Sources. In Proceedings of the
Conference on Intelligent Systems Design and Applications, 2003.
12. Chen R. and Sivakumar K. A New Algorithm for Learning Parameters of a Bayesian Network from Distributed Data.
Proceedings of the Second IEEE International Conference on Data Mining (ICDM), pages 585–588, 2002.
13. Chen R., Sivakumar K., and Kargupta H. Learning Bayesian Network Structure from Distributed Data. In Proceedings of
the Third SIAM International Conference on Data Mining, pages 284–288, 2003.
14. The ClassX Project: Classifying the High-Energy Universe. http://heasarc.gsfc.nasa.gov/classx/.
15. Digital Dig - Data Mining in Astronomy.
http://www.astrosociety.org/pubs/ezine/datamining.html.
16. Elliptical Galaxies: Merger Simulations and the Fundamental Plane. http://irs.ub.rug.nl/ppn/244277443.
17. Data Mining Grid. http://www.datamininggrid.org/.
18. Workshop on Data Mining and the Grid (DM-Grid 2004). http://www.cs.technion.ac.il/ � ranw/dmgrid/program.html.
19. EU Data Grid. http://web.datagrid.cnr.it/.
20. Astrophysical Virtual Observatory. http://www.euro-vo.org/.
21. Framework for Mining and Analysis of Space Science Data. http://www.itsc.uah.edu/f-mass/.
22. Giannella C., Liu K., Olsen T., and Kargupta H. Communication Efficient Construction of Decision Trees Over
Heterogeneously Distributed Data. In Proceedings of the The Fourth IEEE International Conference on Data Mining
(ICDM), 2004.
23. Gnanadesikan R. and Kettenring J. Robust Estimates, Residuals, and Outlier Detection with Multiresponse Data.
Biometrics, 28:81–124, 1972.
24. GridMiner. www.gridminer.org.
25. Griphyn – Grid Physics Network. http://www.griphyn.org/.
26. GRIST: Grid Data Mining for Astronomy. http://grist.caltech.edu.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 39
27. Hawkins D. The Detection of Errors in Multivariate Data Using Principal Components. Journal of the American Statistical
Association, 69(346):340–344, 1974.
28. Hawkins D. and Fatti P. Exploring Multivariate Data Using the Minor Principal Components. The Statistician, 33:325–
338, 1984.
29. Hershberger, D. and Kargupta, H. Distributed Multivariate Regression Using Wavelet-based Collective Data Mining.
Journal of Parallel and Distributed Computing, 61:372–400, 1999.
30. Hinke T. and Novotny J. Data Mining on NASA’s Information Power Grid. In Proceedings of the 9th IEEE International
Symposium on High Performance Distributed Computing (HPDC’00), page 292, 2000.
31. Hodge V. and Austin J. A Survey of Outlier Detection Methodologies. Artificial Intelligence Review, 22(2):85–126, 2004.
32. Hofer J. and Brezany P. Distributed Decision Tree Induction Within the Grid Data Mining Framework. In Technical
Report TR2004-04, Institute for Software Science, University of Vienna, 2004.
33. Hung E. and Cheung D. Parallel Mining of Outliers in Large Database. Distributed and Parallel Databases, 12:5–26,
2002.
34. International Virtual Data Grid Laboratory. http://www.ivdgl.org/.
35. Johnson E. and Kargupta H. Collective, Hierarchical Clustering From Distributed, Heterogeneous Data. In M. Zaki and
C. Ho, editors, Lecture Notes in Computer Science, volume 1759, pages 221–244. Springer-Verlag, 1999.
36. Jolliffe I. Principal Component Analysis. Springer-Verlag, 2002.
37. Kargupta H. and Puttagunta V. An Efficient Randomized Algorithm for Distributed Principal Component Analysis from
Heterogeneous Data. In Proceedings of the Workshop on High Performance Data Mining in conjunction with the Fourth
SIAM International Conference on Data Mining, 2004.
38. Kargupta H. and Sivakumar K. Existential Pleasures of Distributed Data Mining. In Data Mining: Next Generation
Challenges and Future Directions, MIT/AAAI Press, pages 3–26, 2004.
39. Kargupta H., Huang W., Sivakumar K., and Johnson E. Distributed clustering using collective principal component
analysis. Knowledge and Information Systems Journal, 3:422–448, 2001.
40. First International Workshop on Knowldge and Data Mining Grid (KDMG’05). http://laurel.datsi.fi.upm.es/KDMG05/.
41. Knowledge Grid Lab. http://dns2.icar.cnr.it/kgrid/.
42. Jones L. and Couch W. A Statistical Comparison of Line Strength Variations in Coma and Cluster Galaxies at ����� � .
Astronomical Society, Australia, 15:309–317, 1998.
43. McDowell J. Downloading the Sky. IEEE Spectrum, 41:35, 2004.
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
40 C. GIANNELLA ET AL.
44. Distributed Data Mining in Astrophysics and Astronomy. http://www.cs.queensu.ca/home/mcconell/
DDMAstro.html.
45. NASA’s Information Power Grid. http://www.ipg.nasa.gov/.
46. NASA’s Data Mining Resources for Space Science.
http://rings.gsfc.nasa.gov/ � borne/nvo_datamining.html.
47. Nielsen R. Context Vectors: General Purpose Approximate Meaning Representations Self-organized From Raw Data. In
Computational Intelligence: Imitating Life, pages 43–56. IEEE Press, 1994.
48. US National Virtual Observatory. http://www.us-vo.org/.
49. Park B. and Kargupta H. Distributed Data Mining: Algorithms, Systems, and Applications. In The Handbook of Data
Mining, Lawrence Erlbaum Associates, pages 341–358, 2003.
50. Park B., Kargupta H., Johnson E., Sanseverino E., Hershberger D., and Silvestre L. Distributed, Collaborative Data
Analysis From Heterogeneous Sites Using a Scalable Evolutionary Technique. Applied Intelligence, 16(1), 2002.
51. Particle Physics Data Grid. http://www.ppdg.net/.
52. Scientific Data Mining, Integration and Visualization Workshop. http://www.anc.ed.ac.uk/sdmiv/.
53. Sloan Digital Sky Survey. http://www.sdss.org.
54. Shyu M.-L., Chen S.-C., Sarinnapakorn K., and Chang L. A Novel Anomaly Detection Scheme Based on a Principal
Component Classifier. In Proceedings of the Foundations and New Directions of Data Mining Workshop, in Conjuction
with the Thrid IEEE International Conference on Data Mining (ICDM), pages 172–179, 2003.
55. Tumer K. and Ghosh J. Robust Order Statistics Based Ensembles for Distributed Data Mining. In Advances in Distributed
and Parallel Knowledge Discovery, pages 185–210. MIT/AAAI Press, 2000.
56. US National Virtual Observatory. http://us-vo.org/.
Appendix – Proof of Lemma 1
To ease reading, we restate the lemma. Suppose the Distributed Max Algorithm just completed iteration
� � � . For each � � �� , let ��� � � denote the round at which index � was first selected by either site
(if � was not selected during the � rounds, then let � ��� ��� � � � ). It is the case that
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls
DDM FOR ASTRONOMY CATALOGS 41
1. for all � � � ��� � � , � ���� ��� � and
� ��� � � � ;2. for all � ��� � � � ,
� � � � � � � � ��� ! � � �� �
Proof: From the algorithm definition it can be seen that for all � � � and all � �� � , if �
was selected by either site during round � , then� � � ��� � � � � �
! � � � � � � � � � � �� � Otherwise,� ��� � �
� ��� ��� � and
� � � ��� � ��� ��� � . The remainder of the proof proceeds by induction on � .
Base case � � � : If � � � � � �(i.e. � was not selected by any site), then
� � ��� � �
� � � � � and
� � ����� � � � ��� � , so, the desired result holds. If � ��� ��� � , then� � � �� � � � ��
! � ��� � � � ��� �� �! � � �� .
Hence the desired result holds.
Induction case � � �: The first case to consider: � � � � � � � � (i.e. � has not selected by any site
during � rounds of the algorithm). Thus,��� � ��� �
�� � � � � and� �� � � � � � � � � � . Moreover, since � was
not selected during the first �� � rounds, then by induction, for all � � ���� , � � � ��� � � � and
� � ���� ��� � . Therefore,� � � ��� �
and� � � � � � , for all � � � , as desired.
The second case to consider: � ����� � � (i.e. � was selected for the first time by some site during round
� ). It follows that� � � � � � � � � �
! � �� � � � �
� �� � � �� . Since � was not selected during the first � � � rounds,
then, by induction, for all � � � � � , � �� � � � � and
� �� � � � � . Therefore,� � � � � � �� � � � ! � � �� ,
and the desired result holds.
The final case to consider: � � � � � � (i.e. � was selected for the first time before round � ). By induction
the desired result holds for all � � � � � , in particular,� � ��� ���� � � ��� � ��� ! � � �� . It suffices
to show that� � � � � � � � � � ! � � �� .
If � was selected again during round � , then� � � ��� � � � ���
! � �� � � � �
� �� � � �� �
� ����� �� � � ����� �
�� �! � � �� .
If � was not selected again during round � , then� � � � � �
� � ����� � ! � � �� and� � � � � � � � � � � �
! � � �� .�
Copyright c�
2006 John Wiley & Sons, Ltd. Concurrency Computat.: Pract. Exper. 2006; 0:0–0
Prepared using cpeauth.cls