Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of...

Bin Fu

Eugene Fink, Julio López, Garth Gibson

Carnegie Mellon University

Astronomy application of Map-Reduce:

Friends-of-Friends algorithm A distributed

2 Bin Fu © November 2009http://www.pdl.cmu.edu/

Motivation

•Future science will be increasingly driven by huge data

•Its analysis needs effective parallel tools for handling large scale data, like MapReduce

•We are trying to help domain scientists by solving their interested problems better

•Astronomy is our first step


• Sky surveys:• Sloan Digital Sky Survey (2000–2008):

230 million objects, 50 TByte

• Pan-STARRS (just started):Order of magnitude larger

• Large Synoptic Survey Telescope (2016):Two orders of magnitude larger

Astronomy Dataset

Simulations:•McWilliams Center at CMU:Black holes and dark matter,15B particles, 14 TByte / run

•LANL Coyote universe:1B particles, 1 TByte / run, 30 runs


Friends of Friends (FoF) technique:• Two galaxies are “friends” if

they are close to each other

• We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges

• We need to identify the connected components

• From astronomers: The number of connected components and their sizes reflect properties of the universe


Friends of Friends (FoF) technique:

Sequential algorithms• Exact: O((n ∙ log n)1.5)• Approximate: O(n)

When n is VERY large,parallel processing is required

Input• (id, x, y, z) for each object

Output• (id, group-id) for each object


Distributed Friends of Friends (dFoF)

• Traditionally (HPC community) Divide the space into cubes

• But for this specific application, it needs more work (e.g. communication) at later merge step


Distributed Friends of Friends (dFoF)• Instead, divide the space into

“slightly overlapping” cubes

• Identify cross-cube edges and merge the respective “local groups”

- Randomly select a subset of objects- Apply the kd-tree construction

- Send each object to corresponding cubes- Allocate different processors to cubes

- Apply the Union-Find algorithm to thegalaxies in the cube overlaps

• Distributed computation: Apply a sequential FoF algorithm to find the “local groups” within each cube


Distributed Friends of Friends (dFoF)

• PluggableSequential FOF can be easily replaced by other similar group finding algorithm

• Avoid explicit cross-processor communication

• Scalability

• Applicable to Other Problem


Implementation Properties

• Avoid explicit communication

• Optional out-of-core processing if resource is insufficient

• Fits Hadoop’s <key, value> structure naturally. Using 3 Map phrases and 2 Reduce phrases.

10

Bin Fu © November 2009http://www.pdl.cmu.edu/

Disc Cloud Cluster

• 64 nodes• 8 cores per node, 2.83GHz• 16 GByte memory per node• 10 GBit / second network

11

Strong Scalability Experiments

0.5 bln

*0.9 bln

1 2 4 8 16 32

4

15

60

240

Number of nodes

1 bln


Time(min)

• Input data constant

• Change the number of nodes

• Log-log scale Ideally: straight line

* University of Washington

2

15 bln

12


Weak Scalability Experiments

1 2 4 8 16 320

4

8

Time(min)

Number of nodes

32 mln/ Node

2

6

64 mln/ Node

• Proportionally change the input size and # nodes

• Ideally: flat line

• Constant workload for each node

10

13


Conclusion

•Good scalability from a series of experiments.

•A Distributed astronomic group-finding algorithm

•Hadoop implementation

14


Future Work

•General-purpose astronomy toolkit:

•Distributed computation for other standard astronomy problems:Correlation functions, spatial matching,

density distribution, spectral analysis,...

Massive spatial indices of celestial objects

integrated with distributed algorithms

15


Thanks!

http://www.cs.cmu.edu/~binf/dFOF

Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of...

Documents

Transcript of Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of...