Parallel C3M1 Aylin Tokuç Erkan Okuyan Özlem Gür Aylin Tokuç Erkan Okuyan Özlem Gür.

Post on 22-Dec-2015

231 views 2 download

Tags:

Transcript of Parallel C3M1 Aylin Tokuç Erkan Okuyan Özlem Gür Aylin Tokuç Erkan Okuyan Özlem Gür.

Parallel C3M 1

Parallel C3MParallel C3M

Aylin TokuçErkan Okuyan

Özlem Gür

Aylin TokuçErkan Okuyan

Özlem Gür

Parallel C3M 2

OutlineOutline

• Basics of Parallel computing

• Sequential C3M

• Parallel C3M

• Basics of Parallel computing

• Sequential C3M

• Parallel C3M

Parallel C3M 3

Parallel ComputationParallel Computation

Decomposition: The process of dividing a computation into smaller parts.

Task: Programmer defined units of computation into which the main computation is subdivided by means of decomposition.

Decomposition: The process of dividing a computation into smaller parts.

Task: Programmer defined units of computation into which the main computation is subdivided by means of decomposition.

Parallel C3M 4

Parallel Computation Primary Considerations

Parallel Computation Primary Considerations

• Load Balancing

• Minimizing Communication

• Task Dependency Optimization

• Load Balancing

• Minimizing Communication

• Task Dependency Optimization

Parallel C3M 5

Parallel Computation Load Balancing

Parallel Computation Load Balancing

Parallel C3M 6

Parallel Computation Minimizing Communication

Parallel Computation Minimizing Communication

Parallel C3M 7

Parallel Computation Task Dependency Optimization

Parallel Computation Task Dependency Optimization

Parallel C3M 8

C3M AlgorithmC3M Algorithm

1- Determine the cluster seeds of the database.

2- if d, is not a cluster seed then Find the cluster seed (if any) that maximally covers d

3- If there remain unclustered documents, group them into a ragbag cluster.

Parallel C3M 9

C3M FormulasC3M Formulas

Parallel C3M 10

C3M – Sample MatricesC3M – Sample Matrices

000101

110000

110001

001111

101001

D

.3750.0.125.375.125

0.0.417.4170.0.167

.083.277.361.083.194

.1880.0.063.563.188

.083.111.194.25.361

C

Parallel C3M 11

Parallel C3M- DistributionParallel C3M- Distribution

Distribute rows among processors

Load balancing by cyclic block distribution

Distribute rows among processors

Load balancing by cyclic block distribution

Parallel C3M 12

Local CalculationsLocal Calculations

All processors calculate α, partial β and PiAll processors calculate α, partial β and Pi

Current Method for Weighted Matrix: too costlyCurrent Method for Weighted Matrix: too costly

Need coloumn vectors (but row-wise partitioned)

Need coloumn vectors (but row-wise partitioned)

Parallel C3M 13

Seed Powers PiSeed Powers Pi

• Seed power Pi, should be small for a document whose terms appear in too many documents or too few documents.

• Seed power Pi, should be bigger for a document whose terms appear in a moderate number of documents.

• Seed power Pi, should be small for a document whose terms appear in too many documents or too few documents.

• Seed power Pi, should be bigger for a document whose terms appear in a moderate number of documents.

Parallel C3M 14

Minimize Communication - Proposed Heuristic

Minimize Communication - Proposed Heuristic

m

kkii d

1

),1min('

n

j

jjijiii mmdP

1

'1

''

# of non-zeros# of non-zeros

All processors calculate α, partial β and β’

Parallel C3M 15

Effectiveness of HeuristicEffectiveness of Heuristic

• A matlab script is written to compare the effectiveness of the proposed heuristic.

• Correlation Coeeficient = 0.95

• A matlab script is written to compare the effectiveness of the proposed heuristic.

• Correlation Coeeficient = 0.95

Parallel C3M 16

Communication btw Processors

Communication btw Processors

• Partial β and β’ vectors are exchanged btw processors to calculate the final β and β’ vectors.

• Then, all processor calculate cii=δi

• Partial β and β’ vectors are exchanged btw processors to calculate the final β and β’ vectors.

• Then, all processor calculate cii=δi

Parallel C3M 17

# of Clusters# of Clusters

• Processors exchange local δ

• All processors calculate nc

• Processors exchange local δ

• All processors calculate nc

m

iicn

1

Parallel C3M 18

Cluster-head SelectionCluster-head Selection

• Calculate seed power of local documents

• Exchange largest nc seed powers.

• Calculate largest nc seed powers among all Pi and find cluster heads.

• Calculate seed power of local documents

• Exchange largest nc seed powers.

• Calculate largest nc seed powers among all Pi and find cluster heads.

n

j

jjijiii mmdP

1

'1

''

Parallel C3M 19

Clustering Non-seed DocsClustering Non-seed Docs

• Exchange seed documents

• Cluster non-seed documents (as in sequential C3M) in each processor.

• Exchange seed documents

• Cluster non-seed documents (as in sequential C3M) in each processor.

Parallel C3M 20

Future WorkFuture Work

• Term Based Clustering

• Overlapping Clusters

• Term Based Clustering

• Overlapping Clusters

Parallel C3M 21

C3M SummaryC3M Summary• Load Balancing with cyclic block distribution• Communication minimization by a new

heuristic• Task dependency minimized with block

distirbution & heuristic.

• Load Balancing with cyclic block distribution• Communication minimization by a new

heuristic• Task dependency minimized with block

distirbution & heuristic.

n

j

jjijiii mmdP

1

'1

''

Parallel C3M 22

ReferencesReferences• Concepts and the effectiveness of the cover

coefficient-based clustering methodology, F. Can, E. A. Ozkarahan

• Parallelizing the Buckshot Algorithm for Efficient Document Clustering, Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder

• Clustering and Classification of Large Document Bases in a Parallel Environment, Anthony S. Ruocco, Ophir Frieder

• Efficient Clustering of Very Large Document Collections, I.S. Dhillon, J. Fan, Y. Guan

• Concepts and the effectiveness of the cover coefficient-based clustering methodology, F. Can, E. A. Ozkarahan

• Parallelizing the Buckshot Algorithm for Efficient Document Clustering, Eric C. Jensen, Steven M. Beitzel, Angelo J. Pilotto, Nazli Goharian, Ophir Frieder

• Clustering and Classification of Large Document Bases in a Parallel Environment, Anthony S. Ruocco, Ophir Frieder

• Efficient Clustering of Very Large Document Collections, I.S. Dhillon, J. Fan, Y. Guan

Parallel C3M 23

Questions?Questions?

Parallel C3M 24

The EndThe End

Thank you for your patience

Thank you for your patience