Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

21
Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College

Transcript of Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

Page 1: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

Raspberry Pi Performance Benchmarking

Justin MooreSalish Kootenai College

Page 2: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

2

Overview

• Raspberry Pi Cluster Build

• Performance

• Performance Benchmark Tools

• Tuning

• Analysis

• Conclusion

Page 3: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

3

Raspberry Pi Cluster – Model B

• CPU – Broadcom ARM11 76JZF-S 700MHz• Can overclock to 1000MHz

• RAM – 512MB• 448MB/64MB – CPU/GPU

• Linux OS• 10/100 BaseT Ethernet

port• Size of a credit card• Price - $35

Image: http://commons.wikimedia.org/wiki/File:RaspberryPi.jpg#mediaviewer/File:RaspberryPi.jpg

Page 4: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

4

Raspberry Pi Cluster - Setup

• 4 Pi cluster• USB hub – Manhattan 7 port USB 2.0• SD card – Kodak 8GB

• Router – Western Digital N750• NFS mounted external hard

drive – 500GB Buffalo Inc.

Page 5: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

5

Performance – Floating Point Operations

• How we measure performance• Busy CPU? Speed?

• What is a floating point operation (FLOP)?• Arithmetic operation • Formats

• Single • Double

• Performance measurement by FLOPS

• How do we measure FLOPS?• General Matrix Multiplication (GEMM)• High computational intensity with an increase in matrix size

Page 6: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

6

Performance Benchmarking• Single Raspberry Pi

• BLAS - Basic Linear Algebra Subprograms • ATLAS - Automatically Tuned Linear Algebra

Software• Auto tunes BLAS for any system

• Raspberry Pi Cluster• MPI - Message Passing Interface

• Standard API for inter-process communication• Facilitates parallel programming• MPICH 2-1.4.1p1

• HPL - High Performance LINPACK • Tuned MPI • Combined with ATLAS

• Wrote Custom code• ATLAS• Added parallel capability

• Compared with HPL

Page 7: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

7

MyGEMM – Naïve Implementation

= *C (i,j) A(i,k)

Where N = Matrix Size For i = 1 to N For j = 1 to N For k = 1 to N C(i,j)=C(i,j)+A(i,k)*B(k,j)

B (k,j)

Page 8: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

8

MyGEMM – Naïve Pitfalls

• GEMM – Naïve method inefficient• Two whole matrices are loaded in to memory• Cache is not used efficiently

• Strides through the matrix

• If not Naïve, then what?

Naïve

Page 9: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

9

MyGEMM – Software Tuning

• What is block matrix multiplication• Matrix is split into smaller matrices/blocks• Shrinks matrix size to allow both A & B into fast

memory

a1

1A11

A12 A1

3

A21

A14

a1

2

a2

1

a2

2

A22 A2

3

A24

A3

1

A32 A33 A34

A41 A42 A43 A44

A11

Page 10: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

10

MyGEMM – Cluster Software Tuning

• How do we distribute the matrix multiplication?• MPI used to distribute blocks to nodes

• MyGEMM allows for experimentation on block size

A12 A13

A21

A14

A22 A23 A24

A31 A32 A33 A34

A41 A42 A43 A44

A11

Node 1

Node 2

Node 3 Node 4

Page 11: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

11

Hardware Tuning

• Raspberry Pi allows CPU overclocking and memory sharing between CPU and GPU

• Memory• Memory is shared between CPU and GPU

• 512MB total onboard memory• Up to 496MB can be used for CPU • More memory = larger matrices

• CPU Clock• Up to 1000 MHz from 700MHz

Page 12: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

12

Page 13: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

13

Page 14: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

14

Page 15: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

15

Page 16: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

16

Page 17: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

17

Page 18: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

18

Conclusion

• Performance is dependent on key factors• Matrix Size• Block Size• RAM size• CPU speed

• Top 500 – 1993• T#292

• Tied with General Motors

Page 19: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

19

Conclusion – Pi Cluster vs. Yellowstone

Pi ClusterARM11, 32 bit, Double

Precision

YellowstoneSandy Bridge Xeon, 64 bit, Double Precision

Power 15.6 Watts 1.4 MWatts

Performance 836 MFLOPS 1.2 PFLOPS

Price $250.00 $22,500,000.00

MFLOPS/W 54.8 875.3

MFLOPS/$ 3.4 57.2

Page 20: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

20

Thank You

• Dr. Richard Loft

• Raghu Raj Prasanna Kumar

• Amogh Simha

• Stephanie Barr

Page 21: Raspberry Pi Performance Benchmarking Justin Moore Salish Kootenai College.

21

Questions?