Multi-core processors and multithreading

35
Multi-core processors and multithreading 1 iCSC2015, Pawel Szostek, CERN Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape Lecture 2 Multi-core processors and multithreading Paweł Szostek CERN Inverted CERN School of Computing, 23-24 February 2015

Transcript of Multi-core processors and multithreading

Page 1: Multi-core processors and multithreading

Multi-core processors and multithreading

1 iCSC2015, Pawel Szostek, CERN

Evolution of processor architectures: growing complexity of CPUs and its impact on the software landscape

Lecture 2

Multi-core processors and multithreading

Paweł Szostek

CERN

Inverted CERN School of Computing, 23-24 February 2015

Page 2: Multi-core processors and multithreading

Multi-core processors and multithreading

2 iCSC2015, Pawel Szostek, CERN

ADVANCED TOPICS IN THE COMPUTER ARCHITECTURES

Multi-core processors and multithreading: part 1

Page 3: Multi-core processors and multithreading

Multi-core processors and multithreading

3 iCSC2015, Pawel Szostek, CERN

CPU evolution In the past manufacturers

were increasing frequency Transistors were invested

into larger caches and more powerful cores

From 2005 transistors are spent on new cores → 10 years of paradigm change (see Herb Sutter’s The Free Lunch is over)

Thermal Design Power (TDP) is stalled at ~150W

Why higher clock speed increases the power consumption?

Page 4: Multi-core processors and multithreading

Multi-core processors and multithreading

4 iCSC2015, Pawel Szostek, CERN

Interlude: power dissipation In the past, there were no

power dissipation issues Heat density (W/cm3) in a

modern CPU approaches the same level as in nuclear reactor [1]

“Tricks” needed to limit power usage (TurboBoost®, AVX frequencies, more transistors for infrequent use)

This can lead to caveats, see AVX

[1]: David Chisnall The Dark Silicon Problem and What it Means for CPU Designers

Page 5: Multi-core processors and multithreading

Multi-core processors and multithreading

5 iCSC2015, Pawel Szostek, CERN

Interlude: manufacturing technology

Flu virus 14nm process

transistor

120nm

Page 6: Multi-core processors and multithreading

Multi-core processors and multithreading

6 iCSC2015, Pawel Szostek, CERN

Simultaneous Multi-Threading

Problem: when executing a stream of instructions, even with out-of-order execution, a CPU cannot keep all the execution units constantly busy

Can be caused by many reasons: hazards, front-end stalls, homogenous instruction stream etc.

Page 7: Multi-core processors and multithreading

Multi-core processors and multithreading

7 iCSC2015, Pawel Szostek, CERN

Simultaneous Multi-Threading (II) Solution: we can utilize idle execution units with a different thread

SMT is a hardware feature that can be turned on/off in the BIOS

Most of the hardware resources (including caches) are shared Needs a separate fetching unit

Can both speed up and slow down execution (see next slide)

Page 8: Multi-core processors and multithreading

Multi-core processors and multithreading

8 iCSC2015, Pawel Szostek, CERN

Simultaneous Multi-Threading (III)

SMT Workloads from HEP-SPEC06 benchmark

Many instances of single-threaded processes run in parallel

Different scalability and reactions to SMT

Cache utilization is the most important factor in SMT impact

Page 9: Multi-core processors and multithreading

Multi-core processors and multithreading

9 iCSC2015, Pawel Szostek, CERN

Simultaneous Multi-Threading (IV) Idea: we might want to exploit SMT

by running a main thread and a helper thread on the same physical core

Example: list or tree traversal the role of the helper thread is to

prefetch the data helper thread works in front of

the main thread by accessing data ahead of the main thread

think of it as an interesting example of exploiting the hardware

source: J. Zhou et al. “Improving Database Performance on Simultaneous Multithreading Processors”

Page 10: Multi-core processors and multithreading

Multi-core processors and multithreading

10 iCSC2015, Pawel Szostek, CERN

Non-Uniform Memory Access Multi-processor architecture,

where memory access time depends on location of the memory wrt. the processor

Makes accesses fast, when the memory is “close” to the processor

There is a performance hit when accessing the “foreign” memory

Lowers down the pressure on the memory bus

Page 11: Multi-core processors and multithreading

Multi-core processors and multithreading

11 iCSC2015, Pawel Szostek, CERN

Cluster-on-die

Problem: with increasing number of cores there is more and more concurrent accesses to the shared memories (LLC and RAM)

Solution: split the memory on one socket into two nodes

Page 12: Multi-core processors and multithreading

Multi-core processors and multithreading

12 iCSC2015, Pawel Szostek, CERN

Intel architectural extensions Extension Generation/year Value added MMX Pentium

MMX/1997 64b registers with packed data types, only integer operations

SSE Pentium III/1999 128b registers (XMM), 32b float only

SSE2 Pentium 4 /2001 SIMD math on any data type SSE3 Prescott/2004 DSP-oriented math instructions AVX Sandy Bridge/2011 256b registers (YMM), 3op

instructions AVX2 Haswell/2013 Integer instructions in YMM

registers, FMA AVX512 Skylake/2016 512b registers

Hardware evolves → programmers and compilers need to adapt

Page 13: Multi-core processors and multithreading

Multi-core processors and multithreading

13 iCSC2015, Pawel Szostek, CERN

Intel extensions example – AVX2 AVX2 is the latest extension from Intel

Among others, it introduces FMA3 – multiply-accumulate operation with 3 operands ($0 = $0x$2 + $1) – useful for evaluating a polynomial (you remember Horner’s method?)

Creative application – Padé approximant

VDT is a math vector library using Padé approximant – libm plug&play replacement with speed-ups reaching 10x

𝑅𝑅 𝑥𝑥 = 𝑎𝑎0 + 𝑎𝑎1𝑥𝑥 + 𝑎𝑎2𝑥𝑥2 + ⋯+ 𝑎𝑎𝑛𝑛𝑥𝑥𝑛𝑛

1 + 𝑏𝑏1𝑥𝑥 + 𝑏𝑏2𝑥𝑥2 + ⋯+ 𝑏𝑏𝑚𝑚𝑥𝑥𝑚𝑚

=𝑎𝑎0 + 𝑥𝑥(𝑎𝑎1 + 𝑥𝑥 𝑎𝑎2 + 𝑥𝑥 … + 𝑥𝑥𝑎𝑎𝑛𝑛 … )

1 + 𝑥𝑥(𝑏𝑏1+𝑥𝑥(𝑏𝑏2 + 𝑥𝑥 … + 𝑥𝑥𝑏𝑏𝑚𝑚 … )

Page 14: Multi-core processors and multithreading

Multi-core processors and multithreading

14 iCSC2015, Pawel Szostek, CERN

CPU improvements summary

Technique Advantages Disadvantages Frequency scaling Immediate scaling Does not work any more

(see: dark silicon) Hyper-threading Medium overhead, up to

30% performance improvement

Can double workload’s memory footprint, possible cache pollution

Architectural changes

Increase versatility and performance, works well with existing software

Huge design overhead, happen ~every 3 years

Microarchitectural changes

Transparent for the users Huge design overhead

More cores Low design overhead, easy to implement, great scalability

Requires heavily-parallel software

Slid

e in

spira

tion:

A. N

owak

“Mul

ticor

e A

rchi

tect

ures

Common ways to improve CPU performance

Page 15: Multi-core processors and multithreading

Multi-core processors and multithreading

15 iCSC2015, Pawel Szostek, CERN

PARALLEL ARCHITECTURES ON THE SOFTWARE SIDE

Multi-core processors and multithreading: part 2

Page 16: Multi-core processors and multithreading

Multi-core processors and multithreading

16 iCSC2015, Pawel Szostek, CERN

Concurrency vs. parallelism

Do concurrent (not parallel) programs need synchronization to access shared resources? Why?

Page 17: Multi-core processors and multithreading

Multi-core processors and multithreading

17 iCSC2015, Pawel Szostek, CERN

Race conditions

What will be value of n after both threads finish their work?

Page 18: Multi-core processors and multithreading

Multi-core processors and multithreading

18 iCSC2015, Pawel Szostek, CERN

Race conditions (II)

Page 19: Multi-core processors and multithreading

Multi-core processors and multithreading

19 iCSC2015, Pawel Szostek, CERN

Thread-level parallelism in Python C++ parallelism skipped on purpose – already covered at CSC

Python is not a performance-oriented language, but can be made less slow

We can still use threading module to benefit from parallel IO operations via threads by relying on OS

Example is deferred to the synchronization slides.

But wait! Is there a real parallelism in Python? What about the Global Interpreter Lock?

Page 20: Multi-core processors and multithreading

Multi-core processors and multithreading

20 iCSC2015, Pawel Szostek, CERN

Thread-level parallelism in Python (II)

We can easily run many processes with multiprocessing package to leverage parallelism easily, not very efficiently though high memory footprint no resource sharing every worker is a separate process

from multiprocessing(.dummy) import Pool def f(x): return x*x if __name__ == '__main__': pool = Pool(processes=4) result = pool.map(f, xrange(10))

Page 21: Multi-core processors and multithreading

Multi-core processors and multithreading

21 iCSC2015, Pawel Szostek, CERN

CSC Refresher: vector operations

Problem: all the arithmetic operations are executed one element at a time

Solution: introduce vector operations and vector registers

What is the maximal speed-up from vectorization? Why is it hard to obtain it in practice?

Page 22: Multi-core processors and multithreading

Multi-core processors and multithreading

22 iCSC2015, Pawel Szostek, CERN

Auto-vectorization in gcc

Vectorization candidate: (inner) loops.

Will only work with more recent gcc versions (>4.6)

By default, auto-vectorization in gcc is disabled

There are tens of optimization flag, but it’s good to retain at least a couple: -mtune=ARCH, -march=ARCH -O2, -O3, -Ofast -ftree-vectorize

Page 23: Multi-core processors and multithreading

Multi-core processors and multithreading

23 iCSC2015, Pawel Szostek, CERN

Vectorization reports

Compiler can tell us which loop was not vectorized and why gcc: -ftree-vectorize-verbose=[0-9] icc: -vec-report=[0-7]

List of vectorizable loops available on-line: https://gcc.gnu.org/projects/tree-ssa/vectorization.html

Analyzing loop at vect.cc:14 vect.cc:14: note: not vectorized: control flow in loop. vect.cc:14: note: bad loop form. vect.cc:6: note: vectorized 0 loops in function.

Page 24: Multi-core processors and multithreading

Multi-core processors and multithreading

24 iCSC2015, Pawel Szostek, CERN

ICC __declspec(cpu_specific(generic)) int foo() { return 0; } __declspec(cpu_specific(core_i7_sse4_2)) int foo() { return 1; }

Intel architectural extensions (II) Compiler is capable of producing different versions of the same

function for different architectures (so called Automatic CPU dispatch)

A run-time check is added to the output code

in ICC –axARCH can be used instead

GCC __attribute__ ((target(“default”))) int foo() { return 0; } __attribute__((target(“sse4.2”))) int foo() { return 1; }

Page 25: Multi-core processors and multithreading

Multi-core processors and multithreading

25 iCSC2015, Pawel Szostek, CERN

Vectorization in C++

Possible to use intrinsics, but very cumbersome and “write-only”

Many libraries to approach vectorization, the choice is not easy

Example: Agner Fog’s Vector Class

float a[8], b[8], c[8]; … for (int i=0; i<8; ++i) { c[i] = a[i] + b[i]*1.5f; }

#include “vectorclass.h” float a[8], b[8], c[8]; … Vec8f avec, bvec, cvec; avec.load(a); bvec.load(b); cvec = avec + bvec * 1.5f; cvec.store(c);

Page 26: Multi-core processors and multithreading

Multi-core processors and multithreading

26 iCSC2015, Pawel Szostek, CERN

Vectorization in Python

Vectorization in Python is possible, but requires extra modules and extra care

numpy has a complete set of vectorized mathematical operations, requires usage of special types instead of built-in ones. Array notation expressions are vectorized

Any step outside of numpy world will dramatically slow down execution

Gains not only from vectorization, but also from using C types under the hood

Example: roots of quadratic equations (see next slide)

Page 27: Multi-core processors and multithreading

Multi-core processors and multithreading

27 iCSC2015, Pawel Szostek, CERN

Vectorization in Python - example import numpy as np from cmath import sqrt from itertools import izip # generate 1M coefficients a=np.random.randn(1000000) b=np.random.randn(1000000) c=np.random.randn(1000000) def solve_numpy(a, b, c): delta=b*b-4*a*c delta_s=np.sqrt(delta+0.j)) x1=((-b+delta_s)/(2*a)) x2=((-b-delta_s)/(2*a)) return (x1, x2)

def solve_python(a, b, c): for ai,bi,ci in izip(a,b,c): delta=bi*bi-4*ai*ci delta_s=sqrt(delta) x1=((-bi+delta_s)/(2*ai)) x2=((-bi-delta_s)/(2*ai)) yield (x1, x2) timeit list(solve_python(a,b,c)) 1 loops, best of 3: 15 s timeit list(solve_numpy(a,b,c)) 10 loops, best of 3: 105 ms

Wow! Where this speed-up comes from?

Page 28: Multi-core processors and multithreading

Multi-core processors and multithreading

28 iCSC2015, Pawel Szostek, CERN

Accessing shared resources in Python

C++ locking skipped on purpose – covered by Danilo

threading.Lock – the lowest synchronization primitive, possible states: released and acquired Provides two operations: Lock.acquire(blocking=False)

and Lock.release()

threading.Rlock – recurrent lock, can be acquired multiple times.

Queue.Queue – synchronized queue for message/object passing

Page 29: Multi-core processors and multithreading

Multi-core processors and multithreading

29 iCSC2015, Pawel Szostek, CERN

Shared resources - example

import Queue from threading import \ Thread import urllib2 from BeautifulSoup import\ BeautifulSoup # hosts = [something, something] url_queue = Queue.Queue() html_queue = Queue.Queue()

class FetchThread(Thread): def __init__(self, url_queue, html_queue): Thread.__init__(self) self.url_queue = url_queue self.html_queue = html_queue def run(self): while True: host = self.queue.get() url = urllib2.urlopen(host) chunk = url.read() self.out_queue.put(chunk) self.queue.task_done()

Multithreaded application for fetching and processing webpages Communication through synchronized queues

Page 30: Multi-core processors and multithreading

Multi-core processors and multithreading

30 iCSC2015, Pawel Szostek, CERN

Shared resources – example cont’d

class MineThread(Thread): def __init__(self, html_queue): Thread.__init__(self) self.html_queue = \ html_queue def run(self): while True: c = self.html_queue.get() soup = BeautifulSoup(c) titles = \ soup.findAll([‘title’] print(titles) self.html_queue.\ task_done()

def main(): for i in range(5): t = FetchThread(url_queue, html_queue) t.setDaemon(True) t.start() for host in hosts: url_queue.put(host) for i in range(5): dt = MineThread(html_queue) dt.setDaemon(True) dt.start() queue.join(); html_queue.join() main()

Page 31: Multi-core processors and multithreading

Multi-core processors and multithreading

31 iCSC2015, Pawel Szostek, CERN

EVOLUTION OF COMPUTING LANDSCAPE IN THE FUTURE

Multi-core processors and multithreading: part 3

Page 32: Multi-core processors and multithreading

Multi-core processors and multithreading

32 iCSC2015, Pawel Szostek, CERN

Intel tick-tock model

Page 33: Multi-core processors and multithreading

Multi-core processors and multithreading

33 iCSC2015, Pawel Szostek, CERN

Intel Xeon Phi

openlab collaborating since 2008

PCIe co-processor with 61 cores * 4-way SMT

1TFLOPS peak performance 512 bit vectors

Next generation: even more

cores, 3 times more performance, x86-64 compatible, standalone CPU… maybe in desktops?

But… are my applications ready for a such massive parallelism?

Page 34: Multi-core processors and multithreading

Multi-core processors and multithreading

34 iCSC2015, Pawel Szostek, CERN

ARM 64 (AArch64) It’s all about low power

64-bit memory addressing provides support for large memory (>4GB)

RISC architecture

Common software ecosystem with x86-64, uses same management standards

CISC also expanding in this direction

Energy efficiency scalability

figure source: D. Abdurachmanov et al. “Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi”

Page 35: Multi-core processors and multithreading

Multi-core processors and multithreading

35 iCSC2015, Pawel Szostek, CERN

Take-home messages

Moore’s law is doing fine. Transistors will be invested into more cores, bigger caches and wider vectors (512b)

NUMA and COD are another “complex stuff” that a programmer has to keep in mind

Parallelization is possible not only with C++

Not everything that looks like an improvement gives you better performance (e.g. AVX)

Multi-threaded applications always require synchronization to protect shared resources

Auto-vectorization is a speed-up for free