Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of...

15
Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of Map-Reduce: Friends-of-Friends algorithm A distribut ed
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of...

Bin Fu

Eugene Fink, Julio López, Garth Gibson

Carnegie Mellon University

Astronomy application of Map-Reduce:

Friends-of-Friends algorithm A distributed

2 Bin Fu © November 2009http://www.pdl.cmu.edu/

Motivation

•Future science will be increasingly driven by huge data

•Its analysis needs effective parallel tools for handling large scale data, like MapReduce

•We are trying to help domain scientists by solving their interested problems better

•Astronomy is our first step

3 Bin Fu © November 2009http://www.pdl.cmu.edu/

• Sky surveys:• Sloan Digital Sky Survey (2000–2008):

230 million objects, 50 TByte

• Pan-STARRS (just started):Order of magnitude larger

• Large Synoptic Survey Telescope (2016):Two orders of magnitude larger

Astronomy Dataset

Simulations:•McWilliams Center at CMU:Black holes and dark matter,15B particles, 14 TByte / run

•LANL Coyote universe:1B particles, 1 TByte / run, 30 runs

4 Bin Fu © November 2009http://www.pdl.cmu.edu/

Friends of Friends (FoF) technique:• Two galaxies are “friends” if

they are close to each other

• We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges

• We need to identify the connected components

• From astronomers: The number of connected components and their sizes reflect properties of the universe

5 Bin Fu © November 2009http://www.pdl.cmu.edu/

Friends of Friends (FoF) technique:

Sequential algorithms• Exact: O((n ∙ log n)1.5)• Approximate: O(n)

When n is VERY large,parallel processing is required

Input• (id, x, y, z) for each object

Output• (id, group-id) for each object

6 Bin Fu © November 2009http://www.pdl.cmu.edu/

Distributed Friends of Friends (dFoF)

• Traditionally (HPC community) Divide the space into cubes

• But for this specific application, it needs more work (e.g. communication) at later merge step

7 Bin Fu © November 2009http://www.pdl.cmu.edu/

Distributed Friends of Friends (dFoF)• Instead, divide the space into

“slightly overlapping” cubes

• Identify cross-cube edges and merge the respective “local groups”

- Randomly select a subset of objects- Apply the kd-tree construction

- Send each object to corresponding cubes- Allocate different processors to cubes

- Apply the Union-Find algorithm to thegalaxies in the cube overlaps

• Distributed computation: Apply a sequential FoF algorithm to find the “local groups” within each cube

8 Bin Fu © November 2009http://www.pdl.cmu.edu/

Distributed Friends of Friends (dFoF)

• PluggableSequential FOF can be easily replaced by other similar group finding algorithm

• Avoid explicit cross-processor communication

• Scalability

• Applicable to Other Problem

9 Bin Fu © November 2009http://www.pdl.cmu.edu/

Implementation Properties

• Avoid explicit communication

• Optional out-of-core processing if resource is insufficient

• Fits Hadoop’s <key, value> structure naturally. Using 3 Map phrases and 2 Reduce phrases.

10

Bin Fu © November 2009http://www.pdl.cmu.edu/

Disc Cloud Cluster

• 64 nodes• 8 cores per node, 2.83GHz• 16 GByte memory per node• 10 GBit / second network

11

Strong Scalability Experiments

0.5 bln

*0.9 bln

1 2 4 8 16 32

4

15

60

240

Number of nodes

1 bln

Bin Fu © November 2009http://www.pdl.cmu.edu/

Time(min)

• Input data constant

• Change the number of nodes

• Log-log scale Ideally: straight line

* University of Washington

2

15 bln

12

Bin Fu © November 2009http://www.pdl.cmu.edu/

Weak Scalability Experiments

1 2 4 8 16 320

4

8

Time(min)

Number of nodes

32 mln/ Node

2

6

64 mln/ Node

• Proportionally change the input size and # nodes

• Ideally: flat line

• Constant workload for each node

10

13

Bin Fu © November 2009http://www.pdl.cmu.edu/

Conclusion

•Good scalability from a series of experiments.

•A Distributed astronomic group-finding algorithm

•Hadoop implementation

14

Bin Fu © November 2009http://www.pdl.cmu.edu/

Future Work

•General-purpose astronomy toolkit:

•Distributed computation for other standard astronomy problems:Correlation functions, spatial matching,

density distribution, spectral analysis,...

Massive spatial indices of celestial objects

integrated with distributed algorithms

15

Bin Fu © November 2009http://www.pdl.cmu.edu/

Thanks!

http://www.cs.cmu.edu/~binf/dFOF