Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of Bin Fu Eugene Fink, Julio López, Garth Gibson Carnegie Mellon University Astronomy application of...
Bin Fu
Eugene Fink, Julio López, Garth Gibson
Carnegie Mellon University
Astronomy application of Map-Reduce:
Friends-of-Friends algorithm A distributed
2 Bin Fu © November 2009http://www.pdl.cmu.edu/
Motivation
•Future science will be increasingly driven by huge data
•Its analysis needs effective parallel tools for handling large scale data, like MapReduce
•We are trying to help domain scientists by solving their interested problems better
•Astronomy is our first step
3 Bin Fu © November 2009http://www.pdl.cmu.edu/
• Sky surveys:• Sloan Digital Sky Survey (2000–2008):
230 million objects, 50 TByte
• Pan-STARRS (just started):Order of magnitude larger
• Large Synoptic Survey Telescope (2016):Two orders of magnitude larger
Astronomy Dataset
Simulations:•McWilliams Center at CMU:Black holes and dark matter,15B particles, 14 TByte / run
•LANL Coyote universe:1B particles, 1 TByte / run, 30 runs
4 Bin Fu © November 2009http://www.pdl.cmu.edu/
Friends of Friends (FoF) technique:• Two galaxies are “friends” if
they are close to each other
• We analyze an undirected graph, where galaxies are vertices and their “friendships” are edges
• We need to identify the connected components
• From astronomers: The number of connected components and their sizes reflect properties of the universe
5 Bin Fu © November 2009http://www.pdl.cmu.edu/
Friends of Friends (FoF) technique:
Sequential algorithms• Exact: O((n ∙ log n)1.5)• Approximate: O(n)
When n is VERY large,parallel processing is required
Input• (id, x, y, z) for each object
Output• (id, group-id) for each object
6 Bin Fu © November 2009http://www.pdl.cmu.edu/
Distributed Friends of Friends (dFoF)
• Traditionally (HPC community) Divide the space into cubes
• But for this specific application, it needs more work (e.g. communication) at later merge step
7 Bin Fu © November 2009http://www.pdl.cmu.edu/
Distributed Friends of Friends (dFoF)• Instead, divide the space into
“slightly overlapping” cubes
• Identify cross-cube edges and merge the respective “local groups”
- Randomly select a subset of objects- Apply the kd-tree construction
- Send each object to corresponding cubes- Allocate different processors to cubes
- Apply the Union-Find algorithm to thegalaxies in the cube overlaps
• Distributed computation: Apply a sequential FoF algorithm to find the “local groups” within each cube
8 Bin Fu © November 2009http://www.pdl.cmu.edu/
Distributed Friends of Friends (dFoF)
• PluggableSequential FOF can be easily replaced by other similar group finding algorithm
• Avoid explicit cross-processor communication
• Scalability
• Applicable to Other Problem
9 Bin Fu © November 2009http://www.pdl.cmu.edu/
Implementation Properties
• Avoid explicit communication
• Optional out-of-core processing if resource is insufficient
• Fits Hadoop’s <key, value> structure naturally. Using 3 Map phrases and 2 Reduce phrases.
10
Bin Fu © November 2009http://www.pdl.cmu.edu/
Disc Cloud Cluster
• 64 nodes• 8 cores per node, 2.83GHz• 16 GByte memory per node• 10 GBit / second network
11
Strong Scalability Experiments
0.5 bln
*0.9 bln
1 2 4 8 16 32
4
15
60
240
Number of nodes
1 bln
Bin Fu © November 2009http://www.pdl.cmu.edu/
Time(min)
• Input data constant
• Change the number of nodes
• Log-log scale Ideally: straight line
* University of Washington
2
15 bln
12
Bin Fu © November 2009http://www.pdl.cmu.edu/
Weak Scalability Experiments
1 2 4 8 16 320
4
8
Time(min)
Number of nodes
32 mln/ Node
2
6
64 mln/ Node
• Proportionally change the input size and # nodes
• Ideally: flat line
• Constant workload for each node
10
13
Bin Fu © November 2009http://www.pdl.cmu.edu/
Conclusion
•Good scalability from a series of experiments.
•A Distributed astronomic group-finding algorithm
•Hadoop implementation
14
Bin Fu © November 2009http://www.pdl.cmu.edu/
Future Work
•General-purpose astronomy toolkit:
•Distributed computation for other standard astronomy problems:Correlation functions, spatial matching,
density distribution, spectral analysis,...
Massive spatial indices of celestial objects
integrated with distributed algorithms