FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

18
FFTW and the FFTW and the SiCortex SiCortex Architecture Architecture Po-Ru Loh Po-Ru Loh MIT 18.337 MIT 18.337 May 8, 2008 May 8, 2008

Transcript of FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

Page 1: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

FFTW and the SiCortex FFTW and the SiCortex ArchitectureArchitecture

Po-Ru LohPo-Ru Loh

MIT 18.337MIT 18.337

May 8, 2008May 8, 2008

Page 2: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

The email that launched a The email that launched a thousand processorsthousand processors

““Part of my work here at SiCortex is to work with the FFT libraries we provide to our customers.... When I do a Part of my work here at SiCortex is to work with the FFT libraries we provide to our customers.... When I do a comparison of performance of the serial version of FFTW 2.1.5 and 3.2 alpha, FFTW 3.2 alpha is significantly slower.comparison of performance of the serial version of FFTW 2.1.5 and 3.2 alpha, FFTW 3.2 alpha is significantly slower.

Both codes are compiled with the same compiler flags. I am running the 64 bit version:Both codes are compiled with the same compiler flags. I am running the 64 bit version:

CC=scpathcc LD=scld AR=scar RANLIB=scranlib F77=scpathf95CC=scpathcc LD=scld AR=scar RANLIB=scranlib F77=scpathf95 CFLAGS="-g -O3 -OPT:Ofast -mips64" MPILIBS="-lscmpi"CFLAGS="-g -O3 -OPT:Ofast -mips64" MPILIBS="-lscmpi" ./configure --build=x86_64-pc-linux-gnu./configure --build=x86_64-pc-linux-gnu --host=mips64el-gentoo-linux-gnu --enable-fma--host=mips64el-gentoo-linux-gnu --enable-fma

FFTW 3.2alphaFFTW 3.2alpha bench -opatient -s icf2048x2048bench -opatient -s icf2048x2048

Problem: icf2048x2048, setup: 3.79 s, time: 4.21 s, Problem: icf2048x2048, setup: 3.79 s, time: 4.21 s, ``mflops'':``mflops'': 109.7109.7

FFTW 2.1.5FFTW 2.1.5 Please wait (and dream of faster computers).Please wait (and dream of faster computers). SPEED TEST: 2048x2048, FFTW_FORWARD, in place, specificSPEED TEST: 2048x2048, FFTW_FORWARD, in place, specific time for one fft: 2.914490 s (694.868565 ns/point)time for one fft: 2.914490 s (694.868565 ns/point)

"mflops" = 5 (N log2 N) / (t in microseconds) = "mflops" = 5 (N log2 N) / (t in microseconds) = 158.303319158.303319

One thing I should point out is that One thing I should point out is that FFTW 2.1.5 has been modified from its original stock versionFFTW 2.1.5 has been modified from its original stock version. It has had some . It has had some optimizations based on the work by Franchetti and his paper "FFT Algorithms for Multiply-Add Architectures." From my optimizations based on the work by Franchetti and his paper "FFT Algorithms for Multiply-Add Architectures." From my understanding these optimizations have been folded into the genFFT code generator for FFTW 3.*.*understanding these optimizations have been folded into the genFFT code generator for FFTW 3.*.*

Am I correct in my assumption that FFTW 3.2 should be significantly faster?”Am I correct in my assumption that FFTW 3.2 should be significantly faster?”

Page 3: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

Where to start?Where to start?

Maybe try benchmarking FFTW for myselfMaybe try benchmarking FFTW for myselfBut I'm too lazy to write benchmarking code, especially if But I'm too lazy to write benchmarking code, especially if it already exists somewhereit already exists somewhereUnfortunately, that “somewhere” happens to be inside Unfortunately, that “somewhere” happens to be inside the FFTW packagethe FFTW package

… … and it has its own dependenciesand it has its own dependencies … … so I can't simply grab a file and link to the FFTW library on our so I can't simply grab a file and link to the FFTW library on our

SC648… ughSC648… ugh

Fine, I'll just download the whole packageFine, I'll just download the whole package … … compile the whole thing (with the default config, since I don't compile the whole thing (with the default config, since I don't

know how to do anything smarter)know how to do anything smarter) … … and then hijack the last link step to relink against the "real" and then hijack the last link step to relink against the "real"

SC-optimized librarySC-optimized library

Page 4: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

Some surprisesSome surprises

““The numbers you sent previously (for a 2048x2048 The numbers you sent previously (for a 2048x2048 complex transform) were:complex transform) were:

fftw-2.1.5: ~155 mflopsfftw-2.1.5: ~155 mflops fftw-3.2alpha: ~110 mflopsfftw-3.2alpha: ~110 mflops

My initial tests didn't find a discrepancy nearly that big:My initial tests didn't find a discrepancy nearly that big: fftw-2.1.5: ~140 mflopsfftw-2.1.5: ~140 mflops fftw-3.1.2: ~120 mflopsfftw-3.1.2: ~120 mflops fftw-3.2alpha3: ~130 mflops fftw-3.2alpha3: ~130 mflops

Of course, these results are simply using the Of course, these results are simply using the downloaded version of the libraries without the SiCortex downloaded version of the libraries without the SiCortex tweaks you mentioned before (and also with gcc and not tweaks you mentioned before (and also with gcc and not PathScale).”PathScale).”

Page 5: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

More surprisesMore surprises

““Strangely, I then tried re-linking fftw_test and bench Strangely, I then tried re-linking fftw_test and bench using the precompiled fftw libraries that came on the using the precompiled fftw libraries that came on the machine and got a substantial slowdown!machine and got a substantial slowdown!

/usr/lib/libsfftw.a [2.1.5]: ~45 mflops/usr/lib/libsfftw.a [2.1.5]: ~45 mflops /usr/lib/libfftw3.la [3.2alpha]: ~90 mflops/usr/lib/libfftw3.la [3.2alpha]: ~90 mflops

One other thing I noticed was that the configure script for One other thing I noticed was that the configure script for fftw3 complained about not finding a hardware cycle fftw3 complained about not finding a hardware cycle counter…counter…

But fftw2 didn't produce this warning -- perhaps it doesn't use it.But fftw2 didn't produce this warning -- perhaps it doesn't use it. Does a cycle counter actually exist (and do I just need to tell Does a cycle counter actually exist (and do I just need to tell

configure where it is)?”configure where it is)?”

Page 6: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

Meanwhile, let's play with FFTW Meanwhile, let's play with FFTW and our machineand our machine

In the end, the algorithms make the In the end, the algorithms make the difference, but we need to know where to difference, but we need to know where to looklook

Previously, I thought about algorithms in Previously, I thought about algorithms in terms of O(terms of O(nn log log nn) ) flopsflops

Need to get a feel for what's really Need to get a feel for what's really importantimportant

Page 7: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

How fast is FFTW anyway? Is How fast is FFTW anyway? Is there any hope of doing better?there any hope of doing better?

Grab a competitor from “Jorg's useful and ugly FFT page”Grab a competitor from “Jorg's useful and ugly FFT page”2-dim FFT2-dim FFT modernized and cleaned up by Stefan Gustavson modernized and cleaned up by Stefan Gustavson

Simple but decently optimized radix-842 transformation that does rows, then colsSimple but decently optimized radix-842 transformation that does rows, then cols Output time taken for just rows, then total time takenOutput time taken for just rows, then total time taken

ploh@sc1-m3n6 ~/kube-gustavson $ ./a.outploh@sc1-m3n6 ~/kube-gustavson $ ./a.out Enter number of rows: 1024Enter number of rows: 1024 Enter number of cols: 1024Enter number of cols: 1024 Time elapsed: 2.040359 secTime elapsed: 2.040359 sec Time elapsed: 4.149302 secTime elapsed: 4.149302 sec ploh@sc1-m3n6 ~/kube-gustavson $ ./a.outploh@sc1-m3n6 ~/kube-gustavson $ ./a.out Enter number of rows: 2048Enter number of rows: 2048 Enter number of cols: 2048Enter number of cols: 2048 Time elapsed: 8.482921 secTime elapsed: 8.482921 sec Time elapsed: 17.633655 secTime elapsed: 17.633655 sec ploh@sc1-m3n6 ~/kube-gustavson $ ./a.outploh@sc1-m3n6 ~/kube-gustavson $ ./a.out Enter number of rows: 4096Enter number of rows: 4096 Enter number of cols: 4096Enter number of cols: 4096 Time elapsed: 37.851723 secTime elapsed: 37.851723 sec Time elapsed: 80.187843 secTime elapsed: 80.187843 sec

Page 8: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

How fast is FFTW anyway? Is How fast is FFTW anyway? Is there any hope of doing better?there any hope of doing better?

How about this –O3 business?How about this –O3 business?

ploh@sc1-m3n6 ~/kube-gustavson $ gcc -c timed-kube-gustavson-fft.c -O3ploh@sc1-m3n6 ~/kube-gustavson $ gcc -c timed-kube-gustavson-fft.c -O3 ploh@sc1-m3n6 ~/kube-gustavson $ gcc test_kube.c timed-kube-gustavson-fft.o -lmploh@sc1-m3n6 ~/kube-gustavson $ gcc test_kube.c timed-kube-gustavson-fft.o -lm ploh@sc1-m3n6 ~/kube-gustavson $ ./a.outploh@sc1-m3n6 ~/kube-gustavson $ ./a.out Enter number of rows: 1024Enter number of rows: 1024 Enter number of cols: 1024Enter number of cols: 1024 Time elapsed: 0.555099 secTime elapsed: 0.555099 sec Time elapsed: 1.172668 secTime elapsed: 1.172668 sec ploh@sc1-m3n6 ~/kube-gustavson $ ./a.outploh@sc1-m3n6 ~/kube-gustavson $ ./a.out Enter number of rows: 2048Enter number of rows: 2048 Enter number of cols: 2048Enter number of cols: 2048 Time elapsed: 2.321096 secTime elapsed: 2.321096 sec Time elapsed: 5.325778 secTime elapsed: 5.325778 sec ploh@sc1-m3n6 ~/kube-gustavson $ ./a.outploh@sc1-m3n6 ~/kube-gustavson $ ./a.out Enter number of rows: 4096Enter number of rows: 4096 Enter number of cols: 4096Enter number of cols: 4096 Time elapsed: 11.917786 secTime elapsed: 11.917786 sec Time elapsed: 28.651167 secTime elapsed: 28.651167 sec

Page 9: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

How fast is FFTW anyway? Is How fast is FFTW anyway? Is there any hope of doing better?there any hope of doing better?

And how about SiCortex’s optimized math library?And how about SiCortex’s optimized math library?

ploh@sc1-m3n6 ~/kube-gustavson $ gcc test_kube.c timed-kube-gustavson-fft.o -O3 -lscm ploh@sc1-m3n6 ~/kube-gustavson $ gcc test_kube.c timed-kube-gustavson-fft.o -O3 -lscm -lm-lm

ploh@sc1-m3n6 ~/kube-gustavson $ ./a.outploh@sc1-m3n6 ~/kube-gustavson $ ./a.out Enter number of rows: 1024Enter number of rows: 1024 Enter number of cols: 1024Enter number of cols: 1024 Time elapsed: 0.540417 secTime elapsed: 0.540417 sec Time elapsed: 1.143589 secTime elapsed: 1.143589 sec ploh@sc1-m3n6 ~/kube-gustavson $ ./a.outploh@sc1-m3n6 ~/kube-gustavson $ ./a.out Enter number of rows: 2048Enter number of rows: 2048 Enter number of cols: 2048Enter number of cols: 2048 Time elapsed: 2.260734 secTime elapsed: 2.260734 sec Time elapsed: 5.208607 secTime elapsed: 5.208607 sec ploh@sc1-m3n6 ~/kube-gustavson $ ./a.outploh@sc1-m3n6 ~/kube-gustavson $ ./a.out Enter number of rows: 4096Enter number of rows: 4096 Enter number of cols: 4096Enter number of cols: 4096 Time elapsed: 11.729611 secTime elapsed: 11.729611 sec Time elapsed: 28.277497 secTime elapsed: 28.277497 sec

Page 10: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

How fast is the machine anyway? How fast is the machine anyway? How speedy can we hope to get?How speedy can we hope to get?150 mflops sounds 150 mflops sounds reallyreally slow compared to benchFFT slow compared to benchFFT graphs!graphs!

http://www.fftw.org/speed/CoreDuo-3.0GHz-icc64/http://www.fftw.org/speed/CoreDuo-3.0GHz-icc64/

Why: 500MHz MIPS processorsWhy: 500MHz MIPS processors Capable of performing two instructions per cycle*, but still… just Capable of performing two instructions per cycle*, but still… just

500MHz500MHz

(On the other hand, the machine is cool.(On the other hand, the machine is cool. Very cool: 1W per processor!Very cool: 1W per processor! More like 3W counting interconnect and everything else needed More like 3W counting interconnect and everything else needed

to keep a processor happy, but still just 16 kW for 5832 procs.)to keep a processor happy, but still just 16 kW for 5832 procs.)

So if you want to run a serial FFT, grab a desktop -- So if you want to run a serial FFT, grab a desktop -- even a five-year old desktop.even a five-year old desktop.

Page 11: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

Let's talk parallelLet's talk parallel““Meanwhile, I've been trying to get a feel for what our SC648 has to offer in Meanwhile, I've been trying to get a feel for what our SC648 has to offer in terms of relative speeds of communication and computation. (Mainly I terms of relative speeds of communication and computation. (Mainly I wanted a clearer picture of where to look for parallel FFT speedup.) A wanted a clearer picture of where to look for parallel FFT speedup.) A couple numbers I got were a little surprising -- please comment if you have couple numbers I got were a little surprising -- please comment if you have insight.insight.

Point-to-point communication speed for MPI Isend/Irecv nears 2GB/sec as Point-to-point communication speed for MPI Isend/Irecv nears 2GB/sec as advertised for message sizes in the MB range. Latency is such that a 32KB advertised for message sizes in the MB range. Latency is such that a 32KB message achieves half the max.message achieves half the max.

The above speeds seem to be mostly independent of the particular pair of The above speeds seem to be mostly independent of the particular pair of processors which are trying to talk, except that in a few cases -- apparently processors which are trying to talk, except that in a few cases -- apparently when the processors belong to the same 6-CPU node -- top speed is only when the processors belong to the same 6-CPU node -- top speed is only 1GB/sec. This is the opposite of what I (perhaps naively) expected -- any 1GB/sec. This is the opposite of what I (perhaps naively) expected -- any explanation?”explanation?”

[Where's the Kautz graph???][Where's the Kautz graph???]

Page 12: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

Let's talk parallelLet's talk parallel““Upon introducing network traffic by asking 128 processors to simultaneously perform 64 pairwise Upon introducing network traffic by asking 128 processors to simultaneously perform 64 pairwise communications, speed drops to around 0.5GB/sec with substantial fluctuations.communications, speed drops to around 0.5GB/sec with substantial fluctuations.

For comparison, the speed of memcpy on a single CPU is as follows:For comparison, the speed of memcpy on a single CPU is as follows:

Size (bytes) Time (sec) Rate (MB/sec)Size (bytes) Time (sec) Rate (MB/sec) 256 0.000000 1152.369380256 0.000000 1152.369380 512 0.000000 1367.888343512 0.000000 1367.888343 1024 0.000001 1509.0239771024 0.000001 1509.023977 2048 0.000001 1591.1019562048 0.000001 1591.101956 4096 0.000003 1421.0829134096 0.000003 1421.082913 8192 0.000006 1436.9827158192 0.000006 1436.982715 16384 0.000010 1638.37322016384 0.000010 1638.373220 32768 0.000097 337.86788632768 0.000097 337.867886 65536 0.000194 337.25006765536 0.000194 337.250067 131072 0.000413 317.622002131072 0.000413 317.622002 262144 0.001054 248.758781262144 0.001054 248.758781 524288 0.002281 229.862557524288 0.002281 229.862557 1048576 0.004371 239.9171181048576 0.004371 239.917118 2097152 0.008626 243.1096132097152 0.008626 243.109613

The table above shows clear fall-offs when the L1 cache (32KB) and L2 cache (256KB) are exceeded. What The table above shows clear fall-offs when the L1 cache (32KB) and L2 cache (256KB) are exceeded. What really surprises me, however, is that according to the previous experiments, really surprises me, however, is that according to the previous experiments, sending data to another sending data to another processor is as fast as memcpy even in the best-case scenario (memcpy processor is as fast as memcpy even in the best-case scenario (memcpy staying within L1), and is many times faster once the array being copied no staying within L1), and is many times faster once the array being copied no longer fits in cache!longer fits in cache! (Don't the MPI operations also have to leave cache though? How can point-to-point (Don't the MPI operations also have to leave cache though? How can point-to-point copy be so much faster?)”copy be so much faster?)”

Page 13: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

A trip to MaynardA trip to Maynard

Sample output from FFTW 3.1.2:Sample output from FFTW 3.1.2: ~3 sec planning, ~30 sec for the transform, ~3 mflops?!~3 sec planning, ~30 sec for the transform, ~3 mflops?! But I got 100+ mflops on my unoptimized version!But I got 100+ mflops on my unoptimized version!

Try rebuilding with a couple different config options…Try rebuilding with a couple different config options…

Aha! Blame “--enable-long-double”Aha! Blame “--enable-long-double”

Still off by a factor of 2 from FFTW 2.1.5; the missing Still off by a factor of 2 from FFTW 2.1.5; the missing cycle.h probably accounts for some of that but probably cycle.h probably accounts for some of that but probably not all of itnot all of it

Page 14: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

Back to algorithms: What plans Back to algorithms: What plans does FFTW actually choose?does FFTW actually choose?

““I found out that turning on the "verbose" option (fftw_test -v for 2.x and bench -v5, say, for 3.x) outputs the plan. Based on a I found out that turning on the "verbose" option (fftw_test -v for 2.x and bench -v5, say, for 3.x) outputs the plan. Based on a handful of test runs on power-of-2 transforms, FFTW 2.1.5 seems to prefer:handful of test runs on power-of-2 transforms, FFTW 2.1.5 seems to prefer:

radix 16 and radix 8, for transform lengths up to about 2^15radix 16 and radix 8, for transform lengths up to about 2^15 primarily radix 4 but finishing off with three radix 16 steps, for larger transforms.primarily radix 4 but finishing off with three radix 16 steps, for larger transforms.

There are also occasional radix 2 and 32 steps but these seem to be rare.There are also occasional radix 2 and 32 steps but these seem to be rare.

[In practice, FFTW 2 for 1D power-of-2 transforms seems to just try the various possible radices up to 64 -- nothing fancier [In practice, FFTW 2 for 1D power-of-2 transforms seems to just try the various possible radices up to 64 -- nothing fancier than that. FFTW 3 tries a lot more things, including a radix-\sqrt{n} first step.]than that. FFTW 3 tries a lot more things, including a radix-\sqrt{n} first step.]

Last week we weren't totally satisfied with the 258 "mflops" we were getting from Andy's optimized FFTW 2.1.5; however, the Last week we weren't totally satisfied with the 258 "mflops" we were getting from Andy's optimized FFTW 2.1.5; however, the transform we were running was fairly large, and in fact Andy's FFTW 2 gets 500+ "mflops" on peak sizes (around 1024). transform we were running was fairly large, and in fact Andy's FFTW 2 gets 500+ "mflops" on peak sizes (around 1024). Basically, there's a falloff in performance as soon as the L1 cache is exceeded and another one once we start reaching into Basically, there's a falloff in performance as soon as the L1 cache is exceeded and another one once we start reaching into main memory. All of the graphs on the benchFFT page show these falloffs; indeed, getting a peak "mflops" number slightly main memory. All of the graphs on the benchFFT page show these falloffs; indeed, getting a peak "mflops" number slightly better than the clock rate is about the best anyone can do (on non-SIMD processors) -- congrats.better than the clock rate is about the best anyone can do (on non-SIMD processors) -- congrats.

On the other hand, we still don't really have a working FFTW 3: without a cycle counter, the code currently does no timing On the other hand, we still don't really have a working FFTW 3: without a cycle counter, the code currently does no timing whatsoever and instead chooses plans based on a heuristic. Hopefully we can get that straightened out soon, because at the whatsoever and instead chooses plans based on a heuristic. Hopefully we can get that straightened out soon, because at the moment its planner information is pretty useless to us!moment its planner information is pretty useless to us!

Once that's fixed we'll see how its speed compares to FFTW 2. Either way, I'd say FFTW 2 is pretty well-optimized (although Once that's fixed we'll see how its speed compares to FFTW 2. Either way, I'd say FFTW 2 is pretty well-optimized (although the library you're currently distributing with your machines is not!) and our focus should really be on FFTW 3. In particular, the the library you're currently distributing with your machines is not!) and our focus should really be on FFTW 3. In particular, the algorithms that FFTW 2 uses for parallel transforms are just plain slow (I can elaborate on that if you wish). On the bright algorithms that FFTW 2 uses for parallel transforms are just plain slow (I can elaborate on that if you wish). On the bright side, 3.2alpha already implements several improvements -- the first three or four possibilities that came to mind for me are all side, 3.2alpha already implements several improvements -- the first three or four possibilities that came to mind for me are all already there -- and despite being in alpha, it's in good enough shape to start playing with, once we have a cycle counter.”already there -- and despite being in alpha, it's in good enough shape to start playing with, once we have a cycle counter.”

Page 15: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

What does parallel FFTW actually What does parallel FFTW actually do?do?

FFTW 2: Transpose, 1D FFTs, transpose, 1D FFTs, transpose -- straight from the FFTW 2: Transpose, 1D FFTs, transpose, 1D FFTs, transpose -- straight from the bookbook

Choose radix as close to \sqrt{n} as possibleChoose radix as close to \sqrt{n} as possible

Example: 1D complex double forward transform of size 2^24 (with workspace)Example: 1D complex double forward transform of size 2^24 (with workspace) "Parallel" algorithm on 1 proc: 25% slowdown"Parallel" algorithm on 1 proc: 25% slowdown 2 procs: 1.5x speedup2 procs: 1.5x speedup …… 64 procs: 29.900830x64 procs: 29.900830x 128 procs: 67.845401x128 procs: 67.845401x 256 procs: 119.773536x (12 GFlops)256 procs: 119.773536x (12 GFlops) 512 procs: 143.338615x (15 GFlops)512 procs: 143.338615x (15 GFlops)

FFTW 3.2alpha: Same general "six-step" algorithm, but with twistsFFTW 3.2alpha: Same general "six-step" algorithm, but with twists Choose radix equal to np if possibleChoose radix equal to np if possible Skip local transposes -- let FFTW decide whether or not to do themSkip local transposes -- let FFTW decide whether or not to do them Consider doing transposes in stages to avoid all-to-all communicationConsider doing transposes in stages to avoid all-to-all communication

Example (sans cycle counter!):Example (sans cycle counter!): Problem: ocf16777216, setup: 18.27 s, time: 308.69 ms, ``mflops'': 6522Problem: ocf16777216, setup: 18.27 s, time: 308.69 ms, ``mflops'': 6522 Problem: ocf16777216, setup: 13.82 s, time: 158.16 ms, ``mflops'': 12730Problem: ocf16777216, setup: 13.82 s, time: 158.16 ms, ``mflops'': 12730 Problem: ocf16777216, setup: 17.21 s, time: 118.28 ms, ``mflops'': 17020Problem: ocf16777216, setup: 17.21 s, time: 118.28 ms, ``mflops'': 17020

Page 16: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

Ideas for improvementIdeas for improvementSerialSerial

Micro levelMicro levelLow-level vectorization for cache line efficiency and SIMDLow-level vectorization for cache line efficiency and SIMD

Macro levelMacro levelInvestigate large-radix steps (and include them in tuning?)Investigate large-radix steps (and include them in tuning?)

Stop wasting time waiting for memoryStop wasting time waiting for memoryMultithreading -- even on one processor -- to compute while waiting?Multithreading -- even on one processor -- to compute while waiting?Prefetching?Prefetching?

ParallelParallel What's the best initial radix?What's the best initial radix? What's the best number of communication steps to use in a transpose?What's the best number of communication steps to use in a transpose? Can we do FFTW-esque self-tuning?Can we do FFTW-esque self-tuning? Latency-hiding by reordering workLatency-hiding by reordering work

Do half the computation in row-col order, other half in col-row order so that processors Do half the computation in row-col order, other half in col-row order so that processors always have work to doalways have work to doExploit the fact that the DMA Engine/Fabric Switch is a "processor" that takes care of Exploit the fact that the DMA Engine/Fabric Switch is a "processor" that takes care of sending and receiving messagessending and receiving messages

Load-balancing by having pairs of processors work together at each step?Load-balancing by having pairs of processors work together at each step?

Page 17: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

How we know there’s more to doHow we know there’s more to do

SerialSerial The edge-of-cache cliff: The edge-of-cache cliff:

http://www.fftw.org/speed/CoreDuo-3.0GHz-icc64/http://www.fftw.org/speed/CoreDuo-3.0GHz-icc64/

ParallelParallel HPCC has adopted FFTE, which claims it's faster on large transformsHPCC has adopted FFTE, which claims it's faster on large transforms Whether or not that's true, everything is slow in parallelWhether or not that's true, everything is slow in parallel

SiCortex talk: 174 GFlops for HPCC FFTE(?) on 5832 500MHz procsSiCortex talk: 174 GFlops for HPCC FFTE(?) on 5832 500MHz procsUniversity of Tsukuba Center for Computational Sciences poster (2006):University of Tsukuba Center for Computational Sciences poster (2006):13 GFlops for FFTE on 32 3.0GHz procs13 GFlops for FFTE on 32 3.0GHz procs

http://www.rccp.tsukuba.ac.jp/SC/sc2006/Posters/16_hpcs-ol.pdfhttp://www.rccp.tsukuba.ac.jp/SC/sc2006/Posters/16_hpcs-ol.pdf

Intel Math Kernel Library: 25 GFlops on 64 dual-core 3.0GHz procsIntel Math Kernel Library: 25 GFlops on 64 dual-core 3.0GHz procs

Intel claims they’re winning on all fronts:Intel claims they’re winning on all fronts: http://www.intel.com/cd/software/products/asmo-na/eng/266852.htmhttp://www.intel.com/cd/software/products/asmo-na/eng/266852.htm

Page 18: FFTW and the SiCortex Architecture Po-Ru Loh MIT 18.337 May 8, 2008.

Anyone want to help write a faster Anyone want to help write a faster parallel FFT? parallel FFT?

Let me know!Let me know!

Joke of the day: You know you've been spending too much time thinking about parallel computing when…Joke of the day: You know you've been spending too much time thinking about parallel computing when… You cite "multithreading between different class projects" as the reason for slow progress on each (and lament the fact You cite "multithreading between different class projects" as the reason for slow progress on each (and lament the fact

that you have but one processor to apply to an embarrassingly parallel problem)that you have but one processor to apply to an embarrassingly parallel problem) You start thinking about time wasted switching between tasks in terms of memory reads and writes, and it dawns on You start thinking about time wasted switching between tasks in terms of memory reads and writes, and it dawns on

you that a cache-oblivious algorithm could helpyou that a cache-oblivious algorithm could help You realize that LRU is a pretty good model for cramming (and wish you could achieve nanosecond latencies)You realize that LRU is a pretty good model for cramming (and wish you could achieve nanosecond latencies)