High-Performance Computing with C++

33

description

Languages such as JavaScript may receive a lot of hype nowadays, but for high-performance, close-to-the-metal computing, C++ is still king. This webinar takes you on a tour of the HPC universe, with a focus on parallelism, be it instruction-level (SIMD), data-level, task-based (multithreading, OpenMP), or cluster-based (MPI). We also discuss how specific hardware can significantly accelerate computation by looking at two such technologies: NVIDIA CUDA and Intel Xeon Phi. (Some scarier tech such as FPGAs are also mentioned). These slides were used as part of May 29, 2014 webinar, High-Performance Computing with C++. You can watch the webinar on JetBrainsTV YouTube Channel - http://youtu.be/JcSrwxDb-Fs

Transcript of High-Performance Computing with C++

  • [email protected]
  • Quant Programmer (C++, .NET, MATLAB) Microsoft MVP Visual C# (since 2009) Pluralsight course author (MATLAB, CUDA, D, Boost,) Technical Evangelist @ JetBrains
  • An overview of available technologies for computation A look at managed vs. unmanaged code How to leverage capabilities of x86 architecture What COTS and specialized acceleration h/w exists and how to use it
  • Native code Managed code
  • More portable. But ++ is also portable provided you do not use platform-specific things. In theory gets optimized for various platforms. In practice, this isnt great. Does not permit low-level interaction with the processor. Additional safety (managed) array bound checks, type conversion checks, etc.
  • Not always portable (e.g. .NET is only partially portable, excluding UI, WCF, ) Typically supports garbage collection. Has ways of interacting with native code (JNI, P/Invoke, C++/CLI).
  • Developer vs. software productivity? Managed languages simpler to use
  • This talk focuses on CPU bound problems Some problems bottleneck on I/O SSD made things a lot better Optimization mechanisms
  • Dont expect CPU clock speed to pick up PC/server architecture does not scale The only way to accelerate computation is to provide more entities to compute on.
  • Instruction-level Thread-level Machine-level
  • Via inline assembly Via intrinsics Compiler vectorization Use magical compilers (e.g. Intel SPMD)
  • SIMD things
  • Processing data in an array OpenMP Intel Threading Building Blocks/ Parallel Patterns Library (MS)
  • GPGPU Expansion boards Custom chips
  • Hardware Platforms NVIDIA, ATI Software platforms for computation CUDA, OpenCL, C++ AMP
  • Typically 2, effectiveness drop-off after that PCI bus congestion, but depends on usage patterns
  • CUDA is the principal commercially successful GPGPU platform CUDA is supported by many software manufacturers (Photoshop, MATLAB, etc.) In many domains (e.g. video transcoding), the situation with GPU leveraging is dire In terms of performance, it is thought that CUDA has better floating-point, AMD better integral math
  • CUDA is actually a managed technology CUDA is not device-independent CUDA C is the primary development language
  • A GPU has several streaming multiprocessors (SM) Each SM has lots of processors (SP) We can launch a large number of threads in parallel Very large number of SPs ensures that even at lower clock speeds, GPU wins out over CPU
  • A look at CUDA development
  • GPU does not support ordinary x86. Running several tasks on a GPU is difficult Branch divergence branching code (a simple if) turns computation from parallel to sequential.
  • How do you plug in a few CPUs into a motherboard? You cannot. The architecture doesnt scale. (And never will.) An alternative is to put a coprocessor on the PCI bus
  • Commercial coprocessor implementation from Intel PCI board with 60x cores Supports x86!!!!!!!!!111111 Supports different technologies Runs its own micro Linux (not a driver) Can be used in either independent or offload mode Requires special development tools (Intel C++ compiler)
  • Intel makes a lot of tools for ++ developers To work with Xeon Phi, you need
  • Offload mode Native execution mode Symmetric execution
  • Programming the Xeon Phi
  • 60 processors 4 hardware threads per core 8Gb memory 512-bit SIMD
  • Same as in ordinary PCs, i.e., OpenMP, MPI pthreads Other models coming soon
  • FPGA Field Programmable Gate Array Design your own CPU processing mechanic Middle ground between hard-wired ASIC and very flexible general-purpose CPU Uses special hardware description languages (HDL) VHDL, Verilog. There are others (SystemC, OpenCL) and higher-level solutions (e.g., MATLAB, Embeddr).
  • Intrinsically parallel Low-power Better scalability Not a COTS solution
  • FPGA lets us offload some tasks from the CPU FPGA is a lot less flexible. Not so good for math. FPGA is a low-level construct. FPGAs are relatively expensive to operate.
  • FPGAs do not directly compete with ordinary CPUs Gain an advantage due to a highly asynchronous nature The goal is to pre-program an FPGA to solve a single problem very quickly E.g., protocol parsing in hardware (so called feed handler)
  • JetBrains is working on the C++ IDE And C++ support in ReSharper Questions?