B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A...
-
date post
20-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S C O M P U T A T I O N A...
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Performance Understanding, Prediction, and Tuning
at the Berkeley Institute for Performance
Studies (BIPS)
Katherine Yelick, BIPS DirectorLawrence Berkeley National Laboratory and U. C. Berkeley, EECS Dept.
National Science
Foundation
Berkeley Institute for Performance Studies 2
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Challenges to Performance
Two trends in High End Computing• Increasing complicated systems
• Multiple forms of parallelism• Many levels of memory hierarchy• Complex systems software in between
• Increasingly sophisticated algorithms• Unstructured meshes and sparse matrices• Adaptivity in time and space• Multi-physics models lead to hybrid approaches
• Conclusion: Deep understanding of performance at all levels is important
Berkeley Institute for Performance Studies 3
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPS Institute Goals
• Bring together researchers on all aspects of performance engineering
• Use performance understanding to:• Improve application performance• Compare architectures for application
suitability• Influence the design of processors,
networks and compilers• Identify algorithmic needs
Berkeley Institute for Performance Studies 4
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
BIPS Approaches
• Benchmarking and Analysis• Measure performance • Identify opportunities for improvements in
software, hardware, and algorithms
• Modeling• Predict performance on future machines • Understand performance limits
• Tuning• Improve performance • By hand or with automatic self-tuning tools
Berkeley Institute for Performance Studies 5
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Multi-Level Analysis
• Full Applications• What users want• Do not reveal
impact of features
• Compact Applications• Can be ported with
modest effort
Next GenApps
Full Apps
Compact Apps
Micro-Benchmarks
System Size and Complexity
• Easily match phases of full applications
• Microbenchmarks• Isolate architectural features• Hard to tie to real applications
Berkeley Institute for Performance Studies 6
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Projects Within BIPS
• Application evaluation on vector processors• APEX: Application Performance Characterization
Benchmarking• BeBop: Berkeley Benchmarking and Optimization
Group • Architectural probes for alternative architectures• LAPACK: Linear Algebra Package• PERC: Performance Engineering Research Center• Top500• ViVA: Virtual Vector Architectures
Berkeley Institute for Performance Studies 7
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Application Evaluation of Vector Systems
Two vector architectures: The Japanese Earth Simulator The Cray X1
Comparison to “commodity”-based systems IBM SP, Power4 SGI Altix
Ongoing study of DOE applications CACTUS Astrophysics 100,000 lines grid based PARATEC Material Science 50,000 lines Fourier space LBMHD Plasma Physics 1,500 lines grid based GTC Magnetic Fusion 5,000 lines particle based MADCAP Cosmology 5,000 lines dense lin. alg.
Work by L. Oliker, J. Borrill, A. Canning, J. Carter, J. Shalf, S. Hongzhang
Berkeley Institute for Performance Studies 11
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
APEX-MAP Benchmark
• Goal: Quantify the effects of temporal and spatial locality
• Focus on memory system and network performance
1 4
16 64
256
1024
4096
1638
4
6553
6
0.00
0.010.10
1.000.10
1.00
10.00
100.00
1000.00
Cycles
L
a
Altix Itanium2 Sequential2.00-3.00
1.00-2.00
0.00-1.00
-1.00-0.00
• Graphs over temporal and spatial locality axes
• Show performance valleys/cliffs
Berkeley Institute for Performance Studies 22
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Application Kernel Benchmarks
• Microbenchmarks are good for:• Identifying architecture/compiler bottlenecks• Optimization opportunities
• Application benchmarks are good for:• Machine selection for specific apps
• In between: Benchmarks to capture important behavior in real applications• Sparse matrices: SPMV benchmark• Stencil operations: Stencil probe• Possible future: sorting, narrow datatype ops,…
Berkeley Institute for Performance Studies 23
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Sparse Matrix Vector Multiply (SPMV)
• Sparse matrix algorithms• Increasingly important in applications• Challenge memory systems: poor locality• Many matrices have structure, e.g., dense sub-
blocks, that can be exploited
• Benchmarking SPMV• NAS CG, SciMark, use a random matrix • Not reflective of most real problems
• Benchmark challenge:• Ship real matrices: cumbersome & inflexible• Build “realistic” synthetic matrices
Berkeley Institute for Performance Studies 24
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Athlon 1.145 GHz
Opteron 1.4 GHz
Itanium 2 1.3 GHz
Itanium 2 1.5 GHz
Apple G5 1.8 GHz
Pentium 4 2.4 GHz
0
1
2
3
4
5
6
7
8
9
SpMV speedup from BCSR
SpMV medium best case speedupSpMV medium FEM speedup
Mflo
p/s
per
Mflo
p/s
(sca
led
by 1
x1)
Speedup of best-case blocked matrix vs unblocked
Importance of Using Blocked Matrices
Berkeley Institute for Performance Studies 25
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Generating Blocked Matrices
• Our approach: Uniformly distributed random structure, each a rxc block• Collect data for r and c from 1 to 12
• Validation: Can our random matrices simulate “typical” matrices?• 44 matrices from various applications
• 1: Dense matrix in sparse format• 2-17: Finite-Element-Method matrices, FEM
– 2-9: single block size, 10-17: multiple block sizes
• 18-44: non-FEM
• Summarization: Weighted by occurrence in test suite (ongoing)
Berkeley Institute for Performance Studies 26
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Itanium 2 prediction
1 2 3 4 6 7 9 11 12 13 15 17 20 21 24 25 26 27 28 36 40 41 42 440
200
400
600
800
1000
1200
1400
MFLOP/s: SpMV benchmark vs. test matrices: Itanium 2
Mflop/s (benchmark, data size matching test matrix) Mflop/s (test matrices, scaled for fill ratio)
Matrix
Mflo
p/s
Berkeley Institute for Performance Studies 27
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
UltraSparc III prediction
0 5 10 15 20 25 30 35 40 45 500
10
20
30
40
50
60
70
Mflop/s: SpMV benchmark vs.test matrices: Sun UltraSparc 3
Test matrix (Mflop/s)SpMV benchmark (Mflop/s)
Test matrix number
Mfl
op
/s
Berkeley Institute for Performance Studies 29
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Sample summary results (Apple G5, 1.8 GHz)
=============================================================Sparse matrix-vector multiplication (SpMV) benchmark results:-------------------------------------------------------------Matrix type Matrix size Mflop/s (r,c)-------------------------------------------------------------Best case small 2033.12 (12,6)FEM (blocked) small 1073.04Non-blocked small 299.536 (1,1)(small matrix is 2779 x 2779 with 0.0022282 fill)-------------------------------------------------------------Best case medium 541.875 (12,8)FEM (blocked) medium 443.346Non-blocked medium 193.14 (1,1)(medium matrix is 12154 x 12154 with 0.00222844 fill)=============================================================
Berkeley Institute for Performance Studies 30
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Selected SpMV benchmark results
1.Raw results● Which machine is fastest
2.Scaled machine's peak floating-point rate● Mitigates chip technology factors● Influenced by compiler issues
3.Fraction of peak memory bandwidth● Use Stream bechmark for “attainable peak”● How close to this bound is SPMV running?
Berkeley Institute for Performance Studies 31
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Athlon 1.145 GHz
Opteron 1.4 GHz
Itanium 2 1.5 GHz
Apple G5 1.8 GHz
Pentium 4 2.4 GHz
0
100
200
300
400
500
600
700
800
900
1000
Raw SpMV performance for medium problem size
SpMV medium best caseSpMV medium FEMSpMV medium 1x1
Mflo
p/s
Berkeley Institute for Performance Studies 32
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Athlon 1.145 GHz
Opteron 1.4 GHz
Itanium 2 1.5 GHz
Apple G5 1.8 GHz
Pentium 4 2.4 GHz
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
SpMV performance for medium problem size,
scaled by peak floating-point rate SpMV medium best caseSpMV medium FEMSpMV medium 1x1
Fra
ctio
n of
pea
k
Berkeley Institute for Performance Studies 33
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Athlon 1.145 GHz
Opteron 1.4 GHz
Itanium 2 1.5 GHz
Apple G5 1.8 GHz
Pentium 4 2.4 GHz
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
SpMV (medium) memory band-width,
scaled by peak memory bandwidthSpMV (medium best case) mem bw scaled by peak mem bwSpMV (medium 1x1) mem bw scaled by peak mem bw
Fra
ctio
n of
pea
k
Berkeley Institute for Performance Studies 35
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Automatic Performance Tuning
• Performance depends on machine, kernel, matrix• Matrix known at run-time• Best data structure + implementation can be surprising
• Filling in explicit zeros can• Reduce storage• Improve performance• PIII example: 50% more nonzeros, 50% faster
• BeBOP approach: empirical modeling and search• Up to 4x speedups and 31% of peak for SpMV• Many optimization techniques for SpMV• Several other kernels: triangular solve, ATA*x, Ak*x• Proof-of-concept: Integrate with Omega3P• Release OSKI Library, integrate into PETSc
Berkeley Institute for Performance Studies 41
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Summary of Optimizations
• Optimizations for SpMV (numbers shown are maximums)• Register blocking (RB): up to 4x • Variable block splitting: 2.1x over CSR, 1.8x over RB• Diagonals: 2x • Reordering to create dense structure + splitting: 2x • Symmetry: 2.8x • Cache blocking: 6x • Multiple vectors (SpMM): 7x
• Sparse triangular solve• Hybrid sparse/dense data structure: 1.8x
• Higher-level kernels• AAT*x, ATA*x: 4x • A2*x: 2x over CSR, 1.5x
• Future: automatic tuning for vectors
Berkeley Institute for Performance Studies 42
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Architectural Probes
• Understanding memory system performance
• Interaction with processor architecture:• Number of registers• Arithmetic units (parallelism)• Prefetching• Cache size, structure, policies
• APEX-MAP: memory and network system• Sqmat: processor features included
Berkeley Institute for Performance Studies 43
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Impact of Indirection
• Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values
• Itanium2 showing high penalty for indirection
• Results from the sqmat “probe”
• Unit stride access via indirection (S=1)
1
2
3
4
5
6
1 2 4 8 16 32 64 128 256 512
M (Computational Intensity)
Slo
wd
ow
n
Itanium2
Opteron
Power3
Power4
Berkeley Institute for Performance Studies 44
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Tolerating Irregularity
• S50 (Penalty for random access)• S is the length of each unit stride run • Start with S= (indirect unit stride)• How large must S be to achieve at least 50% of this
performance?• All done for a fixed computational intensity
• CI50 (Hide random access penalty using high computational intensity)• CI is computational intensity, controlled by number of
squarings (M) per matrix• Start with M=1, S=• At S=1 (every access random), how large must M be to
achieve 50% of this performance?
• For both, lower numbers are better
Berkeley Institute for Performance Studies 45
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Tolerating Irregularity
•Gather/Scatter is expensive on commodity cache-based systems
•Power4 is only 1.6% (1 in 64)•Itanium2: much less sensitive at 25% (1 in 4)
•Huge amount of computation may be required to hide overhead of irregular data access
•Itanium2 requires CI of about 9 flops/word•Power4 requires CI of almost 75!
S50: What % of memory access can be random before
performance decreases by half?CI50: How much computational
intensity is required to hide penalty of all random access?
Reducing performance by 50%
1.6%
25%
6.3%
0.8%
0%
1%
10%
100%
Itanium 2 Opteron Power3 Power4
% In
dire
ctio
n
CI required to hide indirection
9.3
149.3
18.7
74.7
0
50
100
150
200
Itanium 2 Opteron Power3 Power4Com
puta
tiona
l Int
ensi
ty
(CI)
Berkeley Institute for Performance Studies 46
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Memory System Observations
• Caches are important• Important gap has moved:
• between L3/memory, not L1/L2
• Prefetching increasingly important• Limited and finicky• Effect may overwhelm cache optimizations if
blocking increases non-unit stride access
• Sparse codes: matrix volume is key factor• Not the indirect loads
Berkeley Institute for Performance Studies 47
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Ongoing Vector Investigation
• How much hardware support for vector-like performance?• Can small changes to a conventional processor
get this effect?• Role of compilers/software• Related to Power5 effort
• Latency hiding in software• Prefetch engines easily confused• Sparse matrix (random) and grid-based (strided)
applications are target• Currently investigating simulator tools and
any emerging hardware
Berkeley Institute for Performance Studies 48
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Summary
• High level goals:• Understand future HPC architecture options
that are commercially viable• Can minimal hardware extensions make
improve effectiveness for scientific applications• Various technologies
• Current, future, academic• Various performance analysis techniques
• Application level benchmarks• Application kernel benchmarks (SPMV, stencil)• Architectural probes• Performance modeling and prediction
Berkeley Institute for Performance Studies 49
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
People within BIPS
• Jonathan Carter• Kaushik Datta• James Demmel• Joe Gebis• Paul Hargrove• Parry Husbands• Shoaib Kamil• Bill Kramer• Rajesh Nishtala• Leonid Oliker
• John Shalf• Hongzhang Shan• Horst Simon• David Skinner• Erich Strohmaier• Rich Vuduc• Mike Welcome• Sam Williams• Katherine Yelick
And many collaborators outside Berkeley Lab/Campus
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
End of Slides
Berkeley Institute for Performance Studies 51
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Sqmat overview
• Java code generate produces unrolled C code• Stream of matrices
• Square each Matrix M times in• M controls computational intensity (CI) - the ratio
between flops and mem access
• Each matrix is size NxN • N controls working set size: 2N2 registers required per
matrix. N is varied to cover observable register set size.
• Two storage formats:• Direct Storage: Sqmat’s matrix entries stored
continuously in memory• Indirect: Entries accessed through indirection vector.
“Stanza length” S controls degree of indirection
NxN . . .S in a row
Berkeley Institute for Performance Studies 52
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Slowdown due to Indirection
• Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values
• Itanium2 showing high penalty for indirection
1
2
3
4
5
1 2 4 8 16 32 64 128 256 512M
slo
wd
ow
n Itanium 2OpteronPower3Power4
Unit stride access via indirection (S=1)
Berkeley Institute for Performance Studies 53
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Potential Impact on Applications: T3P
• Source: SLAC [Ko] • 80% of time spent in SpMV• Relevant optimization techniques
• Symmetric storage• Register blocking
• On Single Processor Itanium 2• 1.68x speedup
• 532 Mflops, or 15% of 3.6 GFlop peak• 4.4x speedup with 8 multiple vectors
• 1380 Mflops, or 38% of peak
Berkeley Institute for Performance Studies 54
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Potential Impact on Applications: Omega3P
• Application: accelerator cavity design [Ko]• Relevant optimization techniques
• Symmetric storage• Register blocking• Reordering
• Reverse Cuthill-McKee ordering to reduce bandwidth• Traveling Salesman Problem-based ordering to create blocks
– Nodes = columns of A– Weights(u, v) = no. of nz u, v have in common– Tour = ordering of columns– Choose maximum weight tour– See [Pinar & Heath ’97]
• 2x speedup on Itanium 2, but SPMV not dominant
Berkeley Institute for Performance Studies 55
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Tolerating Irregularity
• S50 (Penalty for random access)• S is the length of each unit stride run • Start with S= (indirect unit stride)• How large must S be to achieve at least 50% of this
performance?• All done for a fixed computational intensity
• CI50 (Hide random access penalty using high computational intensity)• CI is computational intensity, controlled by number of
squarings (M) per matrix• Start with M=1, S=• At S=1 (every access random), how large must M be to
achieve 50% of this performance?
• For both, lower numbers are better
Berkeley Institute for Performance Studies 56
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Tolerating Irregularity
•Gather/Scatter is expensive on commodity cache-based systems
•Power4 is only 1.6% (1 in 64)•Itanium2: much less sensitive at 25% (1 in 4)
•Huge amount of computation may be required to hide overhead of irregular data access
•Itanium2 requires CI of about 9 flops/word•Power4 requires CI of almost 75!
S50: What % of memory access can be random before
performance decreases by half?CI50: How much computational
intensity is required to hide penalty of all random access?
Reducing performance by 50%
1.6%
25%
6.3%
0.8%
0%
1%
10%
100%
Itanium 2 Opteron Power3 Power4
% In
dire
ctio
n
CI required to hide indirection
9.3
149.3
18.7
74.7
0
50
100
150
200
Itanium 2 Opteron Power3 Power4Com
puta
tiona
l Int
ensi
ty
(CI)
Berkeley Institute for Performance Studies 57
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Emerging Architectures
• General purpose processors badly suited for data intensive ops• Large caches not useful if re-use is low
• Low memory bandwidth, especially for irregular patterns
• Superscalar methods of increasing ILP inefficient
• Power consumption
• Research architectures• Berkeley IRAM: Vector and PIM chip
• Stanford Imagine: Stream processor
• ISI Diva: PIM with conventional processor
Berkeley Institute for Performance Studies 58
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Sqmat on PIM Systems
• Performance of Sqmat on PIMs and others for 3x3 matrices, squared 10 times (high computational intensity!)
• Imagine much faster for long streams, slower for short ones
Mflop/s as with varying stream lengths
0
500
1000
1500
2000
2500
3000
3500
IMAGINE IRAM DIVA Power3
MF
lop
s
8
16
32
64
128
256
512
1024
Berkeley Institute for Performance Studies 59
B E R K E L E Y I N S T I T U T E F O R P E R F O R M A N C E S T U D I E S
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N
Comparison to HPCC “Four Corners”
Tem
pora
lL
ocal
ity
SpatialLocality
FFT (future
)
RandomAccess
Sqmat S=1 M=1 N=1
Stream
SqmatS=0 M=1
N=1
LINPACK
SqmatS=0 M=8 N=8
OpteronLINPACK 2000 MFLOPS @1.4ghz
Sqmat 2145 MFLOPS @1.6ghz
STREAMS 1969 MB/sSqmat 2047 MB/s
RandomAccess 0.00442 GUPsSqmat 0.00440 GUPs
Itanium2LINPACK 4.65 GFLOPsSqmat 4.47 GFLOPs
STREAMS 3895 MB/sSqmat 4055 MB/s
RandomAccess 0.00484 GUPsSqmat 0.0141 GUPs