EvaluatingaProcessing-in-Memory Architecturewiththe k...
Transcript of EvaluatingaProcessing-in-Memory Architecturewiththe k...
Evaluating a Processing-in-MemoryArchitecture with the k-means Algorithm
Simon Bihel [email protected] Daniel [email protected] De Moor [email protected] Thomas [email protected] 4, 2017
University of Rennes IÉcole Normale Supérieure de Rennes
With Help From…
Dominique Lavenier [email protected]
David Furodet & the Upmem Team [email protected]
Context
BIG DATA Workloads
End of Dennard Scaling
End of Moore’s Law
Shift towards Data-Centric Architectures Exascale
Bandwidth and Memory Walls
1/17
Table of contents
1. The Upmem Architecture
2. k-means Implementation for the Upmem Architecture
3. Experimental Evaluation
2/17
The Upmem Architecture
Upmem architecture overview
CPU
WRAM
DPU
MRAM
DDR bus
0
WRAM
DPU
MRAM
255...
...
...
...
DIMM
DPU dram processing-unitWRAM execution memory for programsMRAM main memoryDIMM dual in-line memory module
3/17
A massively parallel architecture
Characteristics
• Several DIMMs can be added to a CPU• A 16 GBytes DIMM embeds 256 DPUs• Each DPU can support up to 24 threads
The context is switched between DPU threads every clock cycle.
The programming approach has to consider this fine-grainedparallelism.
4/17
A massively parallel architecture
Characteristics
• Several DIMMs can be added to a CPU• A 16 GBytes DIMM embeds 256 DPUs• Each DPU can support up to 24 threads
The context is switched between DPU threads every clock cycle.
The programming approach has to consider this fine-grainedparallelism.
4/17
Upmem Architecture Overview
On a programming level: two programs must be specified.
CPU
performsdata-intensive
operations
orchestratesthe execution
DPUs
TaskletHost
program{ {5/17
Upmem Architecture Overview
On a programming level: two programs must be specified.
CPU
performsdata-intensive
operations
orchestratesthe execution
DPUs
TaskletHost
program{ {communication
- MRAM- Mailboxes
5/17
Drawbacks and advantages
Drawbacks: computation power
• Frequency around 750 MHz• No floating point operations• Significant multiplication overhead (no hardwaremultiplier)
• Explicit memory management
Advantages: data access
• Parallelization power• Minimum latency• Increased bandwidth• Reduced power consumption
6/17
Drawbacks and advantages
Drawbacks: computation power
• Frequency around 750 MHz• No floating point operations• Significant multiplication overhead (no hardwaremultiplier)
• Explicit memory management
Advantages: data access
• Parallelization power• Minimum latency• Increased bandwidth• Reduced power consumption
6/17
k-means Implementation for theUpmem Architecture
k-means Clustering Problem
Partition data ∈ Rn×m into k clusters C1 . . . Ckn (resp. m): number of points (resp. attributes)
d: Euclidean distance
ArgminCk∑i=1
∑p∈Ci
d(p,mean(Ci))
Examples of applicationsSegmentationCommunities in socialnetworksMarket researchGene sequence analysis
7/17
k-means Standard Algorithm [6]
1: function k-means(k, data, δ)2: Choose C̃ := (c̃1 . . . c̃k) initial centroids3: repeat4: C = C̃5: for all point p ∈data do6: j := Argmini d(p, ci) ▷ Find nearest cluster7: Assign p to cluster Cj8: end for9: for all i in {1 . . . k} do10: c̃i = mean(p ∈ Ci) ▷ Compute new centroids11: end for12: until ∥C̃− C∥ ≤ δ ▷ Convergence criteria13: return C̃ ▷ Return the final centroids14: end function
8/17
k-means algorithm on Upmem
Computations
Start centroidsupdate
DPUs
HOST
Send centroids
End centroidsupdate
Data input
Choose initialcentroids
Distribute points
Convergence?
Output results
yes
no
points
The points aredistributed across theDPUs.
9/17
Implementation & Memory Management
• int type to store distance(easy to overflow withdistances)
MRAM
• Global variables (e.g. # ofpoints)
• Centers• Points• New centers
10/17
Experimental Evaluation
Experimental Setup
Simulator
• Architecture not yet manufactured• Cycle-Accurate simulator
Datasets
• int• Randomly generated (notuniformly, with clusters)
Could not find ready-to-useinteger large datasets.
200 0 200 400 600 800 1000200
0
200
400
600
800
1000
11/17
Experimental Setup
Simulator
• Architecture not yet manufactured• Cycle-Accurate simulator
Datasets
• int• Randomly generated (notuniformly, with clusters)
Could not find ready-to-useinteger large datasets.
200 0 200 400 600 800 1000200
0
200
400
600
800
1000
11/17
Number of Threads
0 5 10 15 20 25Number of threads
Runti
me
High number of
• points(N=1000000,D=10, K=5)
• dimensions(N=500000,D=34, K=3)
• centroids(N=100000,D=2, K=10)
Not the same runtime scales. 12/17
Number of DPUs
0 5 10 15 20 25 30 35Number of DPUs
0
10
20
30
40
50
60
70
80
Runti
me (
seco
nds)
Always the samenumber of points.
Time is divided by the number of DPUs. 13/17
Comparison with sequential k-means
Dataset Many PointsAlgorithm 16-DPUs 1 core SeqC
Runtime (s) 1.568 0.268Faster than SeqC with 94 DPUs
Large number of dimensions provides a large amount ofmultiplications to compute distances
14/17
Comparison with sequential k-means
Dataset Many DimensionsAlgorithm 16-DPUs 1 core SeqC
Runtime (s) 4.534 0.119Faster than SeqC with 610 DPUs
Large numbers of dimensions provides a large amount ofmultiplications to compute distances
14/17
Comparison with sequential k-means
Dataset Many CentersAlgorithm 16-DPUs 1 core SeqC
Runtime (s) 0.4353 0.0142Faster than SeqC with 491 DPUs
Large numbers of centers provides a large amount ofcomputation per memory transfer [2]
14/17
Conclusion
Conclusion
• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])
• Even if there is no gain on time, power might be reduced• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration
15/17
Conclusion
• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])
• Even if there is no gain on time, power might be reduced
• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration
15/17
Conclusion
• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])
• Even if there is no gain on time, power might be reduced• Overflows when computing distances
• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration
15/17
Conclusion
• Ideal use case with very low computation programs (e.g.genomic text processing [4, 5])
• Even if there is no gain on time, power might be reduced• Overflows when computing distances• Implemented k-means++ [1] with GMP library (arbitraryprecision numbers) but what was interesting is the timefor an iteration
15/17
Going Further with the Hardware
Actual Physical Device
• Evaluate how the program behaves at large scale• Impact on the DDR bus & communications
Hardware Multiplication
• Now: 40% of multiplication instructions & 30 instructionsper multiplication
16/17
Going Further with the Hardware
Actual Physical Device
• Evaluate how the program behaves at large scale• Impact on the DDR bus & communications
Hardware Multiplication
• Now: 40% of multiplication instructions & 30 instructionsper multiplication
16/17
Going Further with the k-means
Keep the distance to the current nearest centroid [3]Easy to add in our implementation: keep distance in DPU
+ Avoid useless computations during next iteration
− Reduce number of points per DPU
Define a border made of points that can switch cluster [7]Harder to integrate
+ Reduce the number of distance computations
− Might involve the CPU
17/17
Going Further with the k-means
Keep the distance to the current nearest centroid [3]Easy to add in our implementation: keep distance in DPU
+ Avoid useless computations during next iteration
− Reduce number of points per DPU
Define a border made of points that can switch cluster [7]Harder to integrate
+ Reduce the number of distance computations
− Might involve the CPU
17/17
Thank You
References
References i
D. Arthur and S. Vassilvitskii.k-means++: The advantages of careful seeding.In Proceedings of the eighteenth annual ACM-SIAMsymposium on Discrete algorithms, pages 1027–1035.Society for Industrial and Applied Mathematics, 2007.
M. A. Bender, J. Berry, S. D. Hammond, B. Moore, B. Moseley,and C. A. Phillips.k-means clustering on two-level memory systems.In Proceedings of the 2015 International Symposium onMemory Systems, MEMSYS ’15, pages 197–205, New York, NY,USA, 2015. ACM.
References ii
A. M. Fahim, A. M. Salem, F. A. Torkey, and M. A. Ramadan.An efficient enhanced k-means clustering algorithm.Journal of Zhejiang University-SCIENCE A, 7(10):1626–1633,2006.D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.BLAST on UPMEM.Research Report RR-8878, INRIA Rennes - BretagneAtlantique, Mar. 2016.
D. Lavenier, C. Deltel, D. Furodet, and J.-F. Roy.MAPPING on UPMEM.Research Report RR-8923, INRIA, June 2016.
References iii
S. Lloyd.Least squares quantization in pcm.IEEE transactions on information theory, 28(2):129–137, 1982.
C. M. Poteraş, M. C. Mihăescu, and M. Mocanu.An optimized version of the k-means clustering algorithm.
In Computer Science and Information Systems (FedCSIS),2014 Federated Conference on, pages 695–699. IEEE, 2014.