Parallel Datastore System for Parallel Javaark/students/obs8529/presentation.pdf · Parallel Java...

Post on 20-Sep-2020

6 views 0 download

Transcript of Parallel Datastore System for Parallel Javaark/students/obs8529/presentation.pdf · Parallel Java...

Parallel Datastore

System for Parallel Java

MSCS Capstone Project

by Omonbek Salaev

January 22, 2010

Overview

� Introduction

� Parallel Java Library

� DoubleMatrixFile

� Parallel Datastore System (PDS)

� Test Programs

� Results

� Conclusions

Introduction

� Parallel Virtual File System (PVFS) [1]

�Files split across I/O nodes

�Manager node coordinates partitioning

� Java Expandable Parallel File System

� I/O library for high-performance Java computing

�Files are declustered onto several NFS servers

Introduction

� Existing Parallel File Systems

�Complex

�Java programming often not supported

�Difficult to learn / use

Parallel Java (PJ)

� API and middleware for parallel programming

� 100% Java

� Designed and implemented by professor

A. Kaminsky and Luke McOmber

� Supports parallel programming for:

�Shared Memory Processor (SMP) computers

�Cluster computers

�Hybrid computers

DoubleMatrixFile

� Input/Output context in PJ

� Object for reading/writing a matrix of

doubles from/to a file

� RxC matrix of doubles

DoubleMatrixFile

DoubleMatrixFile()

DoubleMatrixFile(int R, int C, double[][] mx)

SetMatrix(int R, int C, double[][] mx)

getColCount() : int

getRowCount() : int

getMatrix() : double[][]

prepareToRead(InputStream is) : Reader

prepareToWrite(OutputStream os) : Writer

Reader

Writer

DoubleMatrixFileString fn = “test.dat”;

int R = 20, C=10;

double[][] mx = new double[R][C];

// compute matrix elements…

FileOutputStream fos = new FileOutputStream(fn);

DoubleMatrixFile dmf = new DoubleMatrixFile(R, C, mx);

DoubleMatrixFile.Writer writer = dmf.prepareToWrite(newBufferedOutputStream(fos));

writer.write();

writer.close();

Class DoubleMatrixFileString fn = “test.dat”;

FileInputStream fis = new FileIntputStream(fn);

DoubleMatrixFile dmf = new DoubleMatrixFile();

DoubleMatrixFile.Reader reader = dmf.prepareToRead(newBufferedIntputStream(fis));

reader.read();

reader.close();

int R = dmf.getRowCount();

Int C = dmf.getColCount();

double[][] mx = dmf.getMatrix();

// use matrix elements…

DoubleMatrixFile file format

R C

RL CL M N

ARL,CL ARL,CL+1 ARL,CL+N-1…

ARL+1,CL ARL+1,CL+1 ARL+1,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+1 ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

DoubleMatrixFile file format

R C

RL CL M N

ARL,CL ARL,CL+1 ARL,CL+N-1…

ARL+1,CL ARL+1,CL+1 ARL+1,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+1 ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

Number of matrix

rows/columns

R≥0, C≥0

(int, 4 bytes each)

DoubleMatrixFile file format

R C

RL CL M N

ARL,CL ARL,CL+1 ARL,CL+N-1…

ARL+1,CL ARL+1,CL+1 ARL+1,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+1 ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

Matrix Segment

#1

Matrix Segment

#2

DoubleMatrixFile file format

R C

RL CL M N

ARL,CL ARL,CL+1 ARL,CL+N-1…

ARL+1,CL ARL+1,CL+1 ARL+1,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+1 ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

RL – segment’s lower row

index (int), RL≥0

CL – segment’s lower

column index (int), CL≥0

M – number of rows in

segment (int), M≥0

N – number of columns in

segment (int), M≥0

DoubleMatrixFile file format

R C

RL CL M N

ARL,CL ARL,CL+1 ARL,CL+N-1…

ARL+1,CL ARL+1,CL+1 ARL+1,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+1 ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

Matrix elements in

segment

(double)

Stored in row-

major order

Class Range

� Defined by:

� L – Lower bound

� U – Upper bound

� S – Stride

� N – Length

� A range of integers:

{L, L+S, L+2*S, . . . , L+(N-1)*S},

where U = L+(N-1)*S

� Range(2, 9, 3) Range(1, 4)

{2,5,8} {1,2,3,4}

Class DoubleMatrixFile.Writer

� Object to write DoubleMatrixFile to an output

stream

� Created by the prepareToWrite() method

� When created:

�Number of rows and number of columns are

written to output stream

Class DoubleMatrixFile.Writer

� Methods to write:

�writeColSlice(Range slice)

�writeRowSlice(Range slice)

�writePatch(Range rslice, Range cslice)

�write

* Range objects must have stride=1

� Each write operation writes a matrix

segment to output stream

WriteRowSlice

Writer wr = dmf.prepareToWrite(ostream);

Range r = new Range(2,5);

wr.writeRowSlice(r);

wr.close();

0 1 2 3 4 5 6 7 8 9 10

0 97 19 42 50 91 74 94 87 80 29 6

1 95 42 93 81 62 72 12 11 93 32 77

2 22 89 42 10 36 77 81 25 47 77 55

3 91 74 97 53 49 59 98 32 85 18 84

4 56 71 66 41 69 95 5 7 70 25 94

5 29 62 68 50 19 95 92 73 19 71 81

6 82 93 91 72 53 49 33 18 68 59 79

7 19 18 66 53 84 54 67 22 53 27 77

8 95 76 6 57 83 14 56 82 52 4 1

9 46 18 66 30 93 77 69 8 33 92 49

10 63 96 62 37 91 34 43 18 97 50 85

WriteColSlice

Writer wr = dmf.prepareToWrite(ostream);

Range r = new Range(1,4);

wr.writeColSlice(r);

wr.close();

0 1 2 3 4 5 6 7 8 9 10

0 97 19 42 50 91 74 94 87 80 29 6

1 95 42 93 81 62 72 12 11 93 32 77

2 22 89 42 10 36 77 81 25 47 77 55

3 91 74 97 53 49 59 98 32 85 18 84

4 56 71 66 41 69 95 5 7 70 25 94

5 29 62 68 50 19 95 92 73 19 71 81

6 82 93 91 72 53 49 33 18 68 59 79

7 19 18 66 53 84 54 67 22 53 27 77

8 95 76 6 57 83 14 56 82 52 4 1

9 46 18 66 30 93 77 69 8 33 92 49

10 63 96 62 37 91 34 43 18 97 50 85

WritePatch

Writer wr = dmf.prepareToWrite(ostream);

Range rr = new Range(2,5);

Range cr = new Range(1,4);

wr.writePatch(rr,cr);

wr.close();

0 1 2 3 4 5 6 7 8 9 10

0 97 19 42 50 91 74 94 87 80 29 6

1 95 42 93 81 62 72 12 11 93 32 77

2 22 89 42 10 36 77 81 25 47 77 55

3 91 74 97 53 49 59 98 32 85 18 84

4 56 71 66 41 69 95 5 7 70 25 94

5 29 62 68 50 19 95 92 73 19 71 81

6 82 93 91 72 53 49 33 18 68 59 79

7 19 18 66 53 84 54 67 22 53 27 77

8 95 76 6 57 83 14 56 82 52 4 1

9 46 18 66 30 93 77 69 8 33 92 49

10 63 96 62 37 91 34 43 18 97 50 85

Write

Writer wr = dmf.prepareToWrite(ostream);

wr.write();

wr.close();

0 1 2 3 4 5 6 7 8 9 10

0 97 19 42 50 91 74 94 87 80 29 6

1 95 42 93 81 62 72 12 11 93 32 77

2 22 89 42 10 36 77 81 25 47 77 55

3 91 74 97 53 49 59 98 32 85 18 84

4 56 71 66 41 69 95 5 7 70 25 94

5 29 62 68 50 19 95 92 73 19 71 81

6 82 93 91 72 53 49 33 18 68 59 79

7 19 18 66 53 84 54 67 22 53 27 77

8 95 76 6 57 83 14 56 82 52 4 1

9 46 18 66 30 93 77 69 8 33 92 49

10 63 96 62 37 91 34 43 18 97 50 85

Class DoubleMatrixFile.Reader

� Object to read DoubleMatrixFile from an

input stream

� Created by the prepareToRead() method

� When created:

�Number of rows and number of columns are

read from input stream

�Storage is allocated for underlying matrix’s row references

Class DoubleMatrixFile.Reader

� Methods to read:� read

� readColSlice(Range slice)

� readRowSlice(Range slice)

� readPatch(Range rslice, Range cslice)

� readSegment

� readSegmentColSlice(Range slice)

� readSegmentRowSlice(Range slice)

� readSegmentPatch(Range rslice, Range cslice)

* Range objects must have stride=1

ReadRowSlice

Reader rd = dmf.prepareToRead( istream);

Range r = new Range(2,5);

rd.readRowSlice(r);

rd.close();

11 11

0 0 4 11

97 19 42 50 91 74 94 87 80 29 6

95 42 93 81 62 72 12 11 93 32 77

22 89 42 10 36 77 81 25 47 77 55

91 74 97 53 49 59 98 32 85 18 84

4 0 4 9

17 33 15 10 51 70 54 44 48

35 12 43 21 42 32 13 12 23

20 15 12 13 16 17 88 16 17

51 34 92 33 40 69 38 30 65

ReadSegmentRowSlice

Reader rd = dmf.prepareToRead( istream);

Range r = new Range(2,5);

rd.readSegmentRowSlice(r);

rd.close();

11 11

0 0 4 11

97 19 42 50 91 74 94 87 80 29 6

95 42 93 81 62 72 12 11 93 32 77

22 89 42 10 36 77 81 25 47 77 55

91 74 97 53 49 59 98 32 85 18 84

4 0 4 9

17 33 15 10 51 70 54 44 48

35 12 43 21 42 32 13 12 23

20 15 12 13 16 17 88 16 17

51 34 92 33 40 69 38 30 65

ReadColSlice

Reader rd = dmf.prepareToRead( istream);

Range r = new Range(4,6);

rd.readColSlice(r);

rd.close();

11 11

0 0 4 11

97 19 42 50 91 74 94 87 80 29 6

95 42 93 81 62 72 12 11 93 32 77

22 89 42 10 36 77 81 25 47 77 55

91 74 97 53 49 59 98 32 85 18 84

4 0 4 9

17 9 41 54 94 24 14 27 20

25 12 96 82 32 71 22 31 91

12 69 22 14 11 97 21 24 27

61 54 17 23 39 50 18 31 45

ReadSegmentColSlice

Reader rd = dmf.prepareToRead( istream);

Range r = new Range(4,6);

rd.readSegmentColSlice(r);

rd.close();

11 11

0 0 4 11

97 19 42 50 91 74 94 87 80 29 6

95 42 93 81 62 72 12 11 93 32 77

22 89 42 10 36 77 81 25 47 77 55

91 74 97 53 49 59 98 32 85 18 84

4 0 4 9

17 9 41 54 94 24 14 27 20

25 12 96 82 32 71 22 31 91

12 69 22 14 11 97 21 24 27

61 54 17 23 39 50 18 31 45

ReadPatch

Reader rd = dmf.prepareToRead( istream);

Range rr = new Range(2,5);

Range cr = new Range(4,6);

rd.readPatch(rr,cr);

rd.close();

11 11

0 0 4 11

97 19 42 50 91 74 94 87 80 29 6

95 42 93 81 62 72 12 11 93 32 77

22 89 42 10 36 77 81 25 47 77 55

91 74 97 53 49 59 98 32 85 18 84

4 0 4 9

17 9 41 54 94 24 14 27 20

25 12 96 82 32 71 22 31 91

12 69 22 14 11 97 21 24 27

61 54 17 23 39 50 18 31 45

ReadSegmentPatch

Reader rd = dmf.prepareToRead( istream);

Range rr = new Range(2,5);

Range cr = new Range(4,6);

rd.readSegmentPatch(rr,cr);

rd.close();

11 11

0 0 4 11

97 19 42 50 91 74 94 87 80 29 6

95 42 93 81 62 72 12 11 93 32 77

22 89 42 10 36 77 81 25 47 77 55

91 74 97 53 49 59 98 32 85 18 84

4 0 4 9

17 9 41 54 94 24 14 27 20

25 12 96 82 32 71 22 31 91

12 69 22 14 11 97 21 24 27

61 54 17 23 39 50 18 31 45

Read

Reader rd = dmf.prepareToRead( istream);

rd.read();

rd.close();

11 11

0 0 4 11

97 19 42 50 91 74 94 87 80 29 6

95 42 93 81 62 72 12 11 93 32 77

22 89 42 10 36 77 81 25 47 77 55

91 74 97 53 49 59 98 32 85 18 84

4 0 4 9

17 9 41 54 94 24 14 27 20

25 12 96 82 32 71 22 31 91

12 69 22 14 11 97 21 24 27

61 54 17 23 39 50 18 31 45

ReadSegment

Reader rd = dmf.prepareToRead( istream);

rd.readSegment();

rd.close();

11 11

0 0 4 11

97 19 42 50 91 74 94 87 80 29 6

95 42 93 81 62 72 12 11 93 32 77

22 89 42 10 36 77 81 25 47 77 55

91 74 97 53 49 59 98 32 85 18 84

4 0 4 9

17 9 41 54 94 24 14 27 20

25 12 96 82 32 71 22 31 91

12 69 22 14 11 97 21 24 27

61 54 17 23 39 50 18 31 45

Parallel Datastore System (PDS)

� Extension of I/O context of PJ library

� Set of classes to work with matrices and

arrays

� Modeled after the original DoubleMatrixFile

class

MatrixFile

DoubleMatrixFile

FloatMatrixFile

LongMatrixFile

IntMatrixFile

ShortMatrixFile

CharMatrixFile

ByteMatrixFile

BooleanMatrixFileObjectMatrixFile

Reader

Writer

Writer

Reader

Writer

Reader

Writer

Reader Reader

Writer

Reader

Writer

Reader

Writer

Reader

WriterWriter

Reader

MatrixFileWriter

MatrixFileReader

Class MatrixFile

� Abstract class

� Includes methods applicable to all matrix

files, regardless of element type

� Inner (abstract) classes:

�MatrixFileReader

�MatrixFileWriter

FloatMatrixFile file format

Class name R

RL CL M N

ARL,CL ARL,CL+cs ARL,CL+N-1…

ARL+rs,CL ARL+rs,CL+cs ARL+rs,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+cs ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

C

RS CS

FloatMatrixFile file format

Elements’ class name

(String “float”);

Number of matrix

rows/columns

R≥0, C≥0

(int, 4 bytes each)

Class name R

RL CL M N

ARL,CL ARL,CL+cs ARL,CL+N-1…

ARL+rs,CL ARL+rs,CL+cs ARL+rs,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+cs ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

C

RS CS

FloatMatrixFile file format

Matrix Segment

#1

Matrix Segment

#2

Class name R

RL CL M N

ARL,CL ARL,CL+cs ARL,CL+N-1…

ARL+rs,CL ARL+rs,CL+cs ARL+rs,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+cs ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

C

RS CS

FloatMatrixFile file format

RL – segment’s lower row

index (int), RL≥0

CL – segment’s lower

column index (int), CL≥0

M – number of rows in

segment (int), M≥0

N – number of columns in

segment (int), M≥0

RS – segment’s row stride

(int),RS > 0

CS – segment’s column

stride (int), CS > 0

Class name R

RL CL M N

ARL,CL ARL,CL+cs ARL,CL+N-1…

ARL+rs,CL ARL+rs,CL+cs ARL+rs,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+cs ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

C

RS CS

FloatMatrixFile file format

Matrix elements in

segment

(float)

Stored in row-

major order

Class name R

RL CL M N

ARL,CL ARL,CL+cs ARL,CL+N-1…

ARL+rs,CL ARL+rs,CL+cs ARL+rs,CL+N-1…

ARL+M-1,CL ARL+M-1,CL+cs ARL+M-1,CL+N-1…

.

.

.

.

.

.

.

.

.

C

RS CS

Matrix Segment

� Range objects defining slices may have

stride > 1

0 1 2 3 4 5 6 7 8 9 10

0 97 19 42 50 91 74 94 87 80 29 6

1 95 42 93 81 62 72 12 11 93 32 77

2 22 89 42 10 36 77 81 25 47 77 55

3 91 74 97 53 49 59 98 32 85 18 84

4 56 71 66 41 69 95 5 7 70 25 94

5 29 62 68 50 19 95 92 73 19 71 81

6 82 93 91 72 53 49 33 18 68 59 79

7 19 18 66 53 84 54 67 22 53 27 77

8 95 76 6 57 83 14 56 82 52 4 1

9 46 18 66 30 93 77 69 8 33 92 49

10 63 96 62 37 91 34 43 18 97 50 85

0 1 2 3 4 5 6 7 8 9 10

0 97 19 42 50 91 74 94 87 80 29 6

1 95 42 93 81 62 72 12 11 93 32 77

2 22 89 42 10 36 77 81 25 47 77 55

3 91 74 97 53 49 59 98 32 85 18 84

4 56 71 66 41 69 95 5 7 70 25 94

5 29 62 68 50 19 95 92 73 19 71 81

6 82 93 91 72 53 49 33 18 68 59 79

7 19 18 66 53 84 54 67 22 53 27 77

8 95 76 6 57 83 14 56 82 52 4 1

9 46 18 66 30 93 77 69 8 33 92 49

10 63 96 62 37 91 34 43 18 97 50 85

0 1 2 3 4 5 6 7 8 9 10

0 97 19 42 50 91 74 94 87 80 29 6

1 95 42 93 81 62 72 12 11 93 32 77

2 22 89 42 10 36 77 81 25 47 77 55

3 91 74 97 53 49 59 98 32 85 18 84

4 56 71 66 41 69 95 5 7 70 25 94

5 29 62 68 50 19 95 92 73 19 71 81

6 82 93 91 72 53 49 33 18 68 59 79

7 19 18 66 53 84 54 67 22 53 27 77

8 95 76 6 57 83 14 56 82 52 4 1

9 46 18 66 30 93 77 69 8 33 92 49

10 63 96 62 37 91 34 43 18 97 50 85

Range(2,8,3) Range(0,10,5) Range(2,8,3)

Range(0,10,5)

MatrixFile subclasses

Subclass Element Size Class Name

IntMatrixFile 4 “int”

LongMatrixFile 8 “long”

FloatMatrixFile 4 “float”

DoubleMatrixFile 8 “double”

CharMatrixFile 2 “char”

ShortMatrixFile 2 “short”

ByteMatrixFile 1 “byte”

BooleanMatrixFile 1 “boolean”

ObjectMatrixFile<T> variable variable

Class ObjectMatrixFile<T>

� Generic class

� T – reference type

�Needs to implement java.io.Serializable

� Elements may be of variable size

�E.g.: ObjectMatrixFile<String>

Class ObjectMatrixFile<T>

� When written to output stream:

�Actual class name of T is written

�Elements are written using method writeObject()

of class ObjectOutputStream

�Size of each element (in bytes) is written before writing the element

Class ObjectMatrixFile<T>

� When read from input stream:

�Elements’ actual class name is read

�Elements are read using method readObject()

of class ObjectInputStream

�Size of each element (in bytes) is read before reading the element

� Useful when skipping elements

Class MatrixFileManager

� Process different types of MatrixFiles

� Automatically recognizes matrix type

� Use interactively or programmatically

MatrixFileManager Operations

� Join

�Join several MatrixFiles into one

� Split

�Split MatrixFile into several pieces

� By rows, columns, or patches

� Dump

�Dump MatrixFile contents in text format

ArrayFile

DoubleArrayFile

FloatArrayFile

LongArrayFile

IntArrayFile

ShortArrayFile

CharArrayFile

ByteArrayFile

BooleanArrayFileObjectArrayFile

Reader

Writer

Writer

Reader

Writer

Reader

Writer

Reader Reader

Writer

Reader

Writer

Reader

Writer

Reader

WriterWriter

Reader

ArrayFileWriter

ArrayFileReader

Class ArrayFile

� Abstract class

� Includes methods applicable to all array

files, regardless of element type

� Inner (abstract) classes:

�ArrayFileReader

�ArrayFileWriter

FloatArrayFile file format

Class name N

L M S

AL AL+S AL+S*(M-1)…

FloatArrayFile file format

Elements’ class name

(String “float”);

Number of array elements

N≥0

(int, 4 bytes)

Class name N

L M S

AL AL+S AL+S*(M-1)…

FloatArrayFile file format

Array Segment

#1

Array Segment

#3

Class name N

L M S

AL AL+S AL+S*(M-1)…

Array Segment

#2

FloatArrayFile file format

L – segment’s lower

element index (int), L≥0

M – number of elements in

segment (int), M≥0

S – segment’s element

stride (int), S > 0

Class name N

L M S

AL AL+S AL+S*(M-1)…

FloatArrayFile file format

Array elements in

segment

(float)

Class name N

L M S

AL AL+S AL+S*(M-1)…

Array Segment

� Range objects defining segment may have

stride > 1

Range(3,13)

Range(3,12,3)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

ArrayFile subclasses

Subclass Element Size Class Name

IntArrayFile 4 “int”

LongArrayFile 8 “long”

FloatArrayFile 4 “float”

DoubleArrayFile 8 “double”

CharArrayFile 2 “char”

ShortArrayFile 2 “short”

ByteArrayFile 1 “byte”

BooleanArrayFile 1 “boolean”

ObjectArrayFile<T> variable variable

Class ObjectArrayFile<T>

� Generic class

� T – reference type

�Needs to implement java.io.Serializable

� Elements may be of variable size

�E.g.: ObjectArrayFile<String>

Class ObjectArrayFile<T>

� When written to output stream:

�Actual class name of T is written

�Elements are written using method writeObject()

of class ObjectOutputStream

�Size of each element (in bytes) is written before writing the element

Class ObjectArrayFile<T>

� When read from input stream:

�Elements’ actual class name is read

�Elements are read using method readObject()

of class ObjectInputStream

�Size of each element (in bytes) is read before reading the element

� Useful when skipping elements

Class ArrayFileManager

� Process different types of ArrayFiles

� Automatically recognizes array type

� Use interactively or programmatically

ArrayFileManager Operations

� Join

�Join several ArrayFiles into one

� Split

�Split ArrayFile into several pieces

� Dump

�Dump ArrayFile contents in text format

Test Programs

� Matrix Multiplication

� Floyd’s Algorithm

� K-Means Clustering

Matrix Multiplication

A B Cm m

n

n

pp

x =

∑=

=

n

k

jkkiji BAC1

,,,where andmi ≤≤1 pj ≤≤1

Matrix Multiplicationfor (int i = 0; i < m; i++) {

for (int j = 0; j < p; j++) {

c[i][j] = 0;

for (int k = 0; k < n; k++) {

c[i][j] += a[i][k] * b[k][j];

}

}

}

� Complexity: O(n3)

Matrix Multiplication

� Easily parallelizable

� No sequential dependencies:

� To calculate Cij, we only need:

� i-th row of A

� j-th column of B

Matrix Multiplication

� Each parallel process:

�Multiply a row slice of A by column slice of B to produce a patch of C

� E.g.: 1 parallel process:

A B C

Matrix Multiplication

� E.g.: 2 parallel processes:

A B C

A B C

Process 0:

Process 1:

Matrix Multiplication

� E.g.: 4 parallel processes:

A B C

A B C

Process 0:

Process 2:

A B C

A B C

Process 1:

Process 3:

Matrix MultiplicationAccept fileA, fileB, fileC as input parameters

Each parallel process i:

Based on # of processes and process rank, determine

which row slice of A and which column slice of B to

process

Read correct row slice from fileA

Read correct column slice from fileB

Multiply row slice by column slice

Write produced patch to fileC.i

Exit

Matrix MultiplicationAccept fileA, fileB, fileC as input parameters

Each parallel process i:

Based on # of processes and process rank, determine

which row slice of A and which column slice of B to

process

Read correct row slice from fileA

Read correct column slice from fileB

Multiply row slice by column slice

Write produced patch to fileC.i

Exit

readRowSlice()

readColSlice()

writePatch()

Floyd’s Algorithm

� All-pairs shortest distance

� Given a graph G, calculate the shortest distance from

every vertex of G to every other vertex

V1

V2

V5

V4

V3

V1 V2 V3 V4 V5

V1 0 3 ∞ 6 ∞

V2 3 0 ∞ ∞ 6V3 ∞ ∞ 0 ∞ 8V4 6 ∞ ∞ 0 2V5 ∞ 6 8 2 0

Distance Matrix: D[NxN]

Floyd’s Algorithmfor i=1 to N do:

for r=1 to N do:

for c=1 to N do:

Dr,c = min{Dr,c , Dr,i+Di,c}

� Complexity: O(N3)

� Parallel Program:

�Distribute the rows of D evenly among the available parallel processes

Floyd’s Algorithmfor i=1 to N do:

for r=1 to N do:

for c=1 to N do:

Dr,c = min{Dr,c , Dr,i+Di,c}

� Every processes needs to access i-th row

Floyd’s Algorithmfor i=1 to N do:

broadcast contents of ith row

for r=1 to N do:

for c=1 to N do:

Dr,c = min{Dr,c , Dr,i+Di,c}

� Process which “has” the i-th row for the current value of i broadcasts it to other

processors

Floyd’s AlgorithmAccept fileIN, fileOUT as input parameters

Each parallel process i:

Based on # of processes and process rank, determine

which row slice of D to process

Read correct row slice from fileIN

Run Floyd’s algorithm

Write produced row slice to fileOUT.i

Exit

Source code for this program adapted from [4]

Accept fileIN, fileOUT as input parameters

Each parallel process i:

Based on # of processes and process rank, determine

which row slice of D to process

Read correct row slice from fileIN

Run Floyd’s algorithm

Write produced row slice to fileOUT.i

Exit

Floyd’s Algorithm

readRowSlice()

writeRowSlice()

K-Means Clustering

� Partition N observations into K clusters

N=100, K=3

Images taken from [6]

K-Means Clustering

� Inputs:

�N – number of data points with D coordinates each

�K – number of clusters

� Outputs:

�Coordinates of K cluster centers

C1 C2 C3 … CD

P1 10 4.3 6.9 … 5.8

P2 3.9 0.8 17 … 62

P3 11 4.6 40 … 1.8

… … … … … …

PN 7.9 68 8.2 … 0.4

P[NxD]

K-Means Clustering

� Algorithm:1. Initialize the K cluster centers to be the first K data points

2. Initially, each data point is not assigned to any cluster

3. Repeat:

4. Assign each data point to the cluster whose center is the closest using

the Euclidean distance

5. Recalculate the center point of each cluster

6. Until none of the data points changed to a different cluster

7. Output the coordinates of each cluster center

�Euclidean distance:

21

1

2)(

−= ∑

=

D

i

ii bad

K-Means Clustering

� Cluster Parallel Algorithm

�Distribute the N data points evenly among parallel processors

�Process row slices of matrix P

K-Means Clustering1. Accept fileIN, K as input parameters

2. Each parallel process i:

3. Based on # of processes and process rank, determine

which row slice of P to process

4. Read correct row slice from fileIN

5. Assign initial cluster centers to first K data points

6. Assign each data point in slice to closest cluster

7. Perform All-Reduce using logical ‘OR’

8. If no data points changed clusters, go to 12

9. Perform All-Reduce using addition to propagate cluster

center data to all processes

10. Recalculate cluster centers

11. Go to 6

12. Print out cluster centers

13. Exit

K-Means Clustering

� Inputs:

�DoubleMatrixFile

� N rows, D columns

�2 segments:

1. First K rows

2. Rest of the rows

C1 C2 C3 … CD

P1 10 4.3 6.9 … 5.8

P2 3.9 0.8 17 … 62

P3 11 4.6 40 … 1.8

… … … … … …

PN 7.9 68 8.2 … 0.4

P[NxD]

K-Means Clustering1. Accept fileIN, K as input parameters

2. Each parallel process i:

3. Based on # of processes and process rank, determine

which row slice of P to process

4. Read K rows from the 1st segment in fileIN

5. Read correct row slice from 2nd segment in fileIN

6. Assign initial cluster centers to first K data points

7. Assign each data point in slice to closest cluster

8. Perform All-Reduce using logical ‘OR’

9. If no data points changed clusters, go to 12

10. Perform All-Reduce using addition to propagate cluster

center data to all processes

11. Recalculate cluster centers

12. Go to 6

13. Print out cluster centers

14. Exit

readSegmentRowSlice(

)

readSegmentRowSlice(

)

Test Runs

� 3 variations:

1. Sequential program not using PDS

2. Cluster parallel program not using PDS

3. Cluster parallel program using PDS

Test Runs

Variation 3:

IntMatrixFile imf = new IntMatrixFile();

MatrixFileReader reader =

imf1.prepareToRead(new

BufferedInputStream(new

FileInputStream(“matrix.dat”)));

int r = imf.getRowCount(); // get # of rows

int c = imf.getColCount(); // get # of cols

reader.read();

int[][] A = imf.getMatrix();

reader.close();

// use matrix A

Variations 1 & 2:

DataInputStream dis = new DataInputStream(new

BufferedInputStream(new

FileInputStream(“matrix.dat”)));

dis.readUTF(); // read & skip the

// element class name

int r = dis.readInt(); // read # of rows

int c = dis.readInt(); // read # of cols

// skip 6 integers that provide info about the

segment:

dis.skipBytes(6*4);

int[][] A = new int[r][c];

for (int i = 0; i < r; i++) {

for (int j = 0; j < c; j++) {

A[i][j] = dis.readInt();

}

}

dis.close();

// use matrix A

Test Runs

� Hybrid SMP parallel computer [?]:� Frontend computer -- tardis.cs.rit.edu

� UltaSPARC-IIe CPU, 650 MHz clock, 512 MB main memory

� 10 backend computers -- dr00 through dr09

� each with two AMD Opteron 2218 dual-core CPUs, four

processors, 2.6 GHz clock, 8 GB main memory

� 1-Gbps switched Ethernet backend interconnection

network

� Aggregate 104 GHz clock, 80 GB main memory

Test Runs

� Run each test program at least 3 times with

each configuration

�Sequential

�K = 1, 2, 4, 8, 10 parallel processors

� Take the smallest running time for each run

� Calculate and compare:

�Running times

�Efficiency

�EDSF

�Speedup

Results: Matrix Multiplication

� Input Dataset #1: [1000x800] * [800x900] = [1000x900]

� Problem Size: N = 0.7G

� K – number of parallel processes

� T – minimal running time from 3 runs (in ms.)

Variation K T Spdup Effic EDSF

Sequential seq 8719

Non-PDS

1 8610 1,013 1,013

2 4255 2,049 1,025 -0,012

4 2602 3,351 0,838 0,07

8 1468 5,939 0,742 0,052

10 1305 6,681 0,668 0,057

PDS

seq 8719

1 8396 1,038 1,038

2 4266 2,044 1,022 0,016

4 2586 3,372 0,843 0,077

8 1505 5,793 0,724 0,062

10 1199 7,272 0,727 0,048

Results: Matrix Multiplication� Input Dataset #2: [4000x4000] * [4000x4000] = [4000x4000]

� Problem Size: N=64G

� K – number of parallel processes

� T – minimal running time from 3 runs (in ms.)

Variation K T Spdup Effic EDSF

Sequential seq 8719

Non-PDS

1 8610 1,013 1,013

2 4255 2,049 1,025 -0,012

4 2602 3,351 0,838 0,07

8 1468 5,939 0,742 0,052

10 1305 6,681 0,668 0,057

PDS

seq 1106687

1 949546 1,165 1,165

2 489582 2,26 1,13 0,031

4 243092 4,553 1,138 0,008

8 127135 8,705 1,088 0,01

10 99322 11,142 1,114 0,005

Results: Matrix Multiplication

Results: Matrix Multiplication

Results: K-Means Clustering

� Input Dataset #1: D=2, N=180000, K=100

� K – number of parallel processes

� T – minimal running time from 3 runs (in ms.)

Variation K T Spdup Effic EDSF

Sequential seq 119364

Non-PDS

1 122578 0,974 0,974

2 71402 1,672 0,836 0,165

4 50256 2,375 0,594 0,213

8 23374 5,107 0,638 0,075

10 11128 10,726 1,073 -0,01

PDS

seq 119364

1 121650 0,981 0,981

2 70768 1,687 0,843 0,163

4 49815 2,396 0,599 0,213

8 22727 5,252 0,657 0,071

10 11603 10,287 1,029 -0,005

Results: K-Means Clustering

� Input Dataset #3: D=6, N=600000, K=55

� K – number of parallel processes

� T – minimal running time from 3 runs (in ms.)

Variation K T Spdup Effic EDSF

Sequential seq 319327

Non-PDS

1 311595 1,025 1,025

2 185976 1,717 0,859 0,194

4 107607 2,968 0,742 0,127

8 53247 5,997 0,75 0,052

10 43419 7,355 0,735 0,044

PDS

seq 319327

1 311950 1,024 1,024

2 185151 1,725 0,862 0,187

4 108308 2,948 0,737 0,13

8 53433 5,976 0,747 0,053

10 44218 7,222 0,722 0,046

Results: K-Means Clustering

Results: K-Means Clustering

Results: Floyd’s Algorithm

� Input Dataset: Graph with 4000 vertices

� Problem Size: N = 64G

� K – number of parallel processes

� T – minimal running time from 3 runs (in ms.)

Variation K T Spdup Effic EDSF

Sequential seq 614362

Non-PDS

1 587473 1,046 1,046

2 295097 2,082 1,041 0,005

4 149873 4,099 1,025 0,007

8 75500 8,137 1,017 0,004

10 63645 9,653 0,965 0,009

PDS

seq 614362

1 564362 1,089 1,089

2 276787 2,22 1,11 -0,019

4 143133 4,292 1,073 0,005

8 72371 8,489 1,061 0,004

10 61410 10,004 1 0,01

Results: Floyd’s Algorithm

Results: Floyd’s Algorithm

Conclusions

� New PDS Developed

�Extension of PJ library

�Set of classes for working with matrices and

arrays of different element types

� Full Javadocs and developer’s guide

�Object-oriented design

�Concise, elegant

Conclusions

� Manager utilities developed

� 3 Test programs developed

� Ready to be used in writing parallel

programs

Future Work

� Integrate with PJ job scheduler

�Automatically partition and distribute the input files

� Compatibility with existing parallel file

systems

References[1] Avery Ching, Alok Choudhary, Wei-keng Liao, Robert Ross, William Gropp. Noncontiguous I/O

through PVFS. Proceedings of 2002 IEEE International Conference on Cluster Computing, September, 2002.

[2] Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface. July 18,

1997.

http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html

[3] J. Perez, L. Sanchez, F. Garcia, A. Calderon, J. Carretero. High performance Java input/output for heterogeneous distributed computing. Proceedings of the 10th IEEE Symposium on Computers

and Communications, 2005.

[4] A.Kaminsky. Building Programs: SMPs, Clusters and Java. Cengage Course Technology, 2010.

[5] Alan Kaminsky. Parallel Java Library

http://www.cs.rit.edu/~ark/pj.shtml

[6] Alan Kaminsky. 4005-736-70 Parallel Computing II, Programming Project 1.

http://www.cs.rit.edu/~ark/736/project01/project01.shtml

Questions?

Appendix

� Speedup

� K – number of processes

� Tseq – running time of sequential program

� TK – running time of parallel program on K processes

K

seq

KT

TS =

Appendix

� Efficiency

� K – number of processes

K

SEff K

K =

Appendix

� EDSF - Experimentally Determined Sequential

Fraction

� K – number of processes

� Tseq – running time of sequential program

� TK – running time of parallel program on K processes

seqseq

seqK

TTK

TTKF

−×

−×

=