K-Means with BSP

Post on 25-May-2015

2.374 views 2 download

Tags:

Transcript of K-Means with BSP

K-Means Clustering with BSP Thomas Jungblut, Testberichte.de, 2012

Study assignment 4th semester, HWR Berlin

What is K-Means Clustering?

What is BSP?

K-Means with BSP

Content

2/33

What is K-Means Clustering?

3/33

Was ist K-Means Clustering?

7

Unsupervised Learning

Huge number of input vectors

k initial centers

Two step iterative algorithm

Assignment

Update

What is K-Means Clustering?

9/33

How do we parallelize K-Means?

10/33

BSP = Bulk Synchronous Parallel

Paradigm to design parallel algorithms

Two basic operations

Send message

Barrier synchronization

What is BSP?

11/33

What is BSP?

12/33

Sync

Sync

P1 P2 P3

Computation

Communication

Superstep

Computation phase is queuing messages

Within two barrier synchronizations messages are exchanged in bulk

Messages from previous superstep are available in next superstep

13

What is BSP?

K-Means with BSP

14/33

Partition the dataset into equal sized blocks

K-Means with BSP

Centers

Sum assigned vectors to a new temporary center object

15/33

Put centers into RAM on each process

Iterate sequentially over vectors on disk

K-Means with BSP

Centers

Centers

Centers

Centers

Centers

Centers

K-Means with BSP

Centers

Sums

• Center 1 • Sum=25 • 5 times summed

• Center 2 • Sum=50 • 10 times summed

• Center 3 • Sum=10 • 5 times summed

17/33

K-Means with BSP

Centers

Sum

Centers

Sum

Centers

Sum

Centers

Sum

Send the sum

K-Means with BSP

Centers

Sum

Centers

Sum

Centers

Sum

Centers

Sum

Send the sum

K-Means mit BSP

Centers Sum

Sum

Sum

Sum

Total Sum

Means

New Centers

20/33

• The same calculation on every process

• Floating point error can be corrected by synchronizing when it exceeds a given threshold

Divide by total increments

K-Means with BSP

Assignment

Sync

Update

21/33

Partition vectors into equal sized blocks # Blocks = # Tasks

Put centers in RAM Assignmentphase

Iterative vectors on disk sequentially Sum up temporary centers with assigned vectors Message all tasks with sum and how often something was

summed

Updatephase Calculate the total sum over all received messages and average Replace old centers with new centers and calc convergence

K-Means with BSP

22/33

16 Server, 256 Cores, 10G network

Benchmark

80 seconds!

Possible starvation: add more servers

Logarithmic scaling

Much better than linear scaling of MapReduce

24

Benchmark