Parallel Arch

download Parallel Arch

of 11

Transcript of Parallel Arch

  • 7/27/2019 Parallel Arch

    1/11

    Parallel Architectures

    1

    Parallel ArchitecturesParallel Architectures

    COMP375 Computer Architecture

    an rgan za on

    Ever Faster

    There is a continuing drive to create fastercompu ers.

    Some large engineering problems requiremassive computing resources.

    Increases in clock speed alone will not

    performance.

    Parallelism provides an ability to get moredone in the same amount of time.

    Intel Microprocessor Performance

    From Stallings textbook

    Intel Performance

    7000.0

    8000.0

    ArchitecturalImprovements

    2000.0

    3000.0

    4000.0

    5000.0

    6000.0

    Clock

    Architecture

    0.0

    1000.0

    1987 1989 1991 1993 1995 1997 1999 2001 2002 2003

  • 7/27/2019 Parallel Arch

    2/11

    Parallel Architectures

    2

    Large Computing ChallengesLarge Computing Challenges

    (Memory Requirement)

    Demands exceed the capabilities

    o even t e astest current

    uniprocessor systems

    from http://meseec.ce.rit.edu/eecc756-spring2002/756-3-8-2005.ppt

    Parallelism

    The key to making a computer fast is to doas muc n para e as poss e.

    Processors in modern home PCs do a lotin parallel

    Superscalar execution

    Hyperthreading

    Pipelining

    Processes and Threads All modern operating systems support

    multiple processes and threads.

    Processes have their own memory space.

    Each program is a process. Threads are a flow of control through a

    process. Each process has at least onerea .

    Single processor systems support multipleprocesses and threads by sequentiallysharing the processor.

    Hyperthreading

    Hyperthreading allows the apparentlypara e execu on o wo rea s.

    Each thread has its own set of registers,but they share a single CPU.

    The CPU alternates executing instructions.

    Pipelining is improved because adjacentinstructions come from different threadsand therefore do not have data hazards.

  • 7/27/2019 Parallel Arch

    3/11

    Parallel Architectures

    3

    Hyperthreading Performance

    Also called Multiple Thread 1

    processors

    Overall throughput isincreased

    rea

    Thread 1

    Thread 2

    Thread 1

    Thread 2

    run as fast as they wouldwithout hyperthreading

    reaThread 2

    Pipelining

    Hyperthreading

    1. Runs best in singlerea e app ca ons

    2. Replaces pipelining3. Duplicates the user

    register set

    . -processors.

    How Visible is Parallelism?

    Superscalar

    Programmer never notices

    Programmer must create multiple threads in

    the program.

    Multiple processes

    Different programs

    Programmer never notices

    Parts of the same program

    Programmer must divide the work among differentprocesses.

  • 7/27/2019 Parallel Arch

    4/11

    Parallel Architectures

    4

    Flynns Parallel Classification

    SISD Single Instruction Single Data

    standard uniprocessors

    SIMD Single Instruction Multiple Data

    vector and array processors

    MISD - Multiple Instruction Single Data

    sys o c processors MIMD Multiple Instruction Multiple Data

    MIMD Computers

    Shared Memory Attached processors

    Multiple processor systems

    Multiple instruction issue processors

    Multi-Core processors

    NUMA

    epara e emory Clusters

    Message passing multiprocessors

    Hypercube

    Mesh

    Single Instruction Multiple Data

    Many high performance computing programsper orm ca cu a ons on arge a a se s.SIMD computers can be used to performmatrix operations.

    The Intel Pentium MMX instructions performSIMD o erations on 8 one b te inte ers.

    Vector Processors

    A vector is a one dimensional array.

    Vector processors are SIMD machines that

    have instructions to perform operations onvectors.

    A vector process might have vector.

    A single instruction could add two vectors.

  • 7/27/2019 Parallel Arch

    5/11

    Parallel Architectures

    5

    Vector Example

    /* traditional vector addition */

    double a[64], b[64], c[64];

    for (i = 0; i < 64; i++)

    c[i] = a[i] + b[i];

    /* vector processor in one instruction*/ci = ai + bi

    Non-contiguous Vectors

    A simple vector is a one dimensional arrayw eac e emen n success ve a resses

    In a two dimensional array, the rows can beconsidered simple vectors. The columnsare vectors, but the are not stored inconsecutive addresses.

    The distance between elements of a vectoris called the stride.

    Vector Stride

    Matrix multiplication

    12,11 bb

    32,31

    22,21*

    23,22,21

    ,,

    bb

    bbaaa

    aaa

    a11 a12 a13 a21 a22 a23

    b11 b12 b21 b22 b31 b32

    Cray-1 Vector Processor

    Sold to Los AlamosNational Laboratoryin1976 for $8.8 million

    80 MHz clock

    160 megaflops

    8 me ab te 1 M words

    Freon cooled

  • 7/27/2019 Parallel Arch

    6/11

    Parallel Architectures

    6

    Cray-2Built in 1985 with ten times the performance of the Cray-1

    Intel calls some of their new

    Pentium instructions SIMD because. ey opera e on

    multiple bytes

    2. There is a separateprocessor for them

    .units operate on onepiece of data

    4. They are executed inthe GPU

    Data Flow Architecture

    Data flow graph computing N =(A+B) * (B-4)

    MISD Systolic Architectures

    Memory

    A computed value is passed from onerocessor to the next.

    Processor

    Element

    Processor

    Element

    Processor

    Element

    Each processor may perform a differentoperation.

    Minimizes memory accesses.

    P ll l A hi

  • 7/27/2019 Parallel Arch

    7/11

    Parallel Architectures

    7

    Systolic Architectures Systolic arrays meant to be special-purpose

    processors or coprocessors and were very fine-grained

    computation, usually called cells

    Communication is very fast, granularity meant tobe very fine (a small number of computationaloperations per communication)

    Very fast clock rates due to regular, synchronous

    structure Data moved through pipelined computational units

    in a regular and rhythmic fashion

    Warp and iWarp were examples of systolic arrays

    from www-csag.ucsd.edu/teaching/cse160s05/lectures/Lecture15.pdf

    Systolic Array Example:

    3x3 Systolic Array Matrix Multipli cationb2,2

    b2,1 b1,2

    b2,0 b1,1 b0,2

    b1,0 b0,1

    Processors arranged in a 2-D grid

    Each processor accumulates one

    element of the product

    b0,0

    a0,2 a0,1 a0,0

    Alignments in time

    Rows of A

    Columns of B

    a , a , a ,

    a2,2 a2,1 a2,0

    T = 0

    Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

    Systolic Array Example:

    3x3 Systolic Array Matrix Multipli cation

    b2,2

    b2,1 b1,2

    b2,0 b1,1 b0,2

    Processors arranged in a 2-D grid

    Each processor accumulates one

    element of the product

    b1,0 b0,1

    a0,2 a0,1

    b0,0

    a0,0a0,0*b0,0

    a , a , a ,

    a2,2 a2,1 a2,0

    T = 1

    Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

    Systolic Array Example:

    3x3 Systolic Array Matrix Multipli cation

    b2,2

    b2,1 b1,2

    Processors arranged in a 2-D grid

    Each processor accumulates one

    element of the product

    , , ,

    a0,2

    b1,0

    a0,1a0,0*b0,0

    + a0,1*b1,0

    a1 0

    a0,0

    b0,1

    b0,0

    a0,0*b0,1

    a1,0*b0,0

    , ,

    a2,2 a2,1 a2,0

    T = 2

    ,

    Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

    P ll l A hit t

  • 7/27/2019 Parallel Arch

    8/11

    Parallel Architectures

    8

    Systolic Array Example:

    3x3 Systolic Array Matrix Multipli cation

    b2,2

    Processors arranged in a 2-D grid

    Each processor accumulates one

    element of the product

    b2,1 b1,2

    b2,0

    a0,2a0,0*b0,0

    + a0,1*b1,0

    + a0,2*b2,0

    a1 1

    a0,1

    b1,1

    b1,0

    a0,0*b0,1

    + a0,1*b1,1

    a1,0*b0,0

    + a1,1*b1,0 a1,0

    b0,1

    a0,0

    b0,2

    a1,0*b0,1

    a0,0*b0,2

    ,

    a2,2 a2,1

    T = 3

    b0,0

    a2,0 a2,0*b0,0

    Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

    Systolic Array Example:

    3x3 Systolic Array Matrix Multipli cation Processors arranged in a 2-D grid

    Each processor accumulates one

    element of the product

    b2,2

    a0,0*b0,0

    + a0,1*b1,0

    + a0,2*b2,0

    a1 2

    a0,2

    b2,1

    b2,0

    a0,0*b0,1

    + a0,1*b1,1

    + a0,2*b2,1

    a1,0*b0,0

    + a1,1*b1,0 a1,1

    b1,1

    a0,1

    b1,2

    a1,0*b0,1

    a0,0*b0,2

    + a0,1*b1,2

    a1,0

    b0,2

    a1,0*b0,2

    a2,2T = 4

    + a1,2*a2,0

    b1,0

    a2,1

    +a1,1*b1,1

    a2,0*b0,0

    + a2,1*b1,0

    b0,1

    a2,0a2,0*b0,1

    a2,2

    Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

    Systolic Array Example:

    3x3 Systolic Array Matrix Multipli cation Processors arranged in a 2-D grid

    Each processor accumulates one

    element of the product

    a0,0*b0,0+ a0,1*b1,0

    + a0,2*b2,0

    a0,0*b0,1

    + a0,1*b1,1

    + a0,2*b2,1

    a1,0*b0,0

    + a1,1*b1,0 a1,2

    b2,1

    a0,2

    b2,2

    a1,0*b0,1

    a0,0*b0,2

    + a0,1*b1,2

    + a0,2*b2,2

    a1 1

    b1,2

    a1,0*b0,2

    T = 5

    + a1,2*a2,0

    b2,0

    a2,2

    +a1,1*b1,1

    + a1,2*b2,1

    a2,0*b0,0

    + a2,1*b1,0

    + a2,2*b2,0

    b1,1

    a2,1a2,0*b0,1

    + a2,1*b1,1

    + a , ,

    b0,2

    a2,0a2,0*b0,2

    Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

    Systolic Array Example:

    3x3 Systolic Array Matrix Multipli cation Processors arranged in a 2-D grid

    Each processor accumulates one

    element of the product

    a0,0*b0,0+ a0,1*b1,0

    + a0,2*b2,0

    a0,0*b0,1

    + a0,1*b1,1

    + a0,2*b2,1

    a1,0*b0,0

    + a1,1*b1,0 a1,0*b0,1

    a0,0*b0,2

    + a0,1*b1,2

    + a0,2*b2,2

    a1 2

    b2,2

    a1,0*b0,2

    T = 6

    + a1,2*a2,0 +a1,1*b1,1

    + a1,2*b2,1

    a2,0*b0,0

    + a2,1*b1,0

    + a2,2*b2,0

    b2,1

    a2,2a2,0*b0,1

    + a2,1*b1,1

    + a2,2*b2,1

    + a , ,

    + a1,2*b2,2

    b1,2

    a2,1a2,0*b0,2

    + a2,1*b1,2

    Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

    Parallel Architectures

  • 7/27/2019 Parallel Arch

    9/11

    Parallel Architectures

    9

    Systolic Array Example:

    3x3 Systolic Array Matrix Multipli cation Processors arranged in a 2-D grid

    Each processor accumulates one

    element of the product

    a0,0*b0,0

    + a0,1*b1,0

    + a0,2*b2,0

    a0,0*b0,1

    + a0,1*b1,1

    + a0,2*b2,1

    a1,0*b0,0

    + a1,1*b1,0 a1,0*b0,1

    a0,0*b0,2

    + a0,1*b1,2

    + a0,2*b2,2

    a1,0*b0,2

    T = 7

    + a1,2*a2,0 +a1,1*b1,1

    + a1,2*b2,1

    a2,0*b0,0

    + a2,1*b1,0

    + a2,2*b2,0

    a2,0*b0,1

    + a2,1*b1,1

    + a2,2*b2,1

    + a , ,

    + a1,2*b2,2

    b2,2

    a2,2 a2,0*b0,2+ a2,1*b1,2

    + a2,2*b2,2

    Done

    Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

    Systolic Architectures

    Never caught on for General-purposecompu a on

    Regular structures are HARD to achieve

    Real efficiency requires local communication,but flexibility of global communication is oftencritical

    Used mainly in structured settings High speed multipliers

    Structured Array Computations

    from www-csag.ucsd.edu/teaching/cse160s05/lectures/Lecture15.pdf

    Multiple Instruction Multiple Data

    Multiple independent processors canexecute different parts of a program to

    .

    The processors can be different oridentical.

    Different architectures allow the

    the same memory

    separate memory

    local memory with access to all memory

    Specialized Processors

    Some architectures have a general purposeprocessor with additional specializedprocessors

    I/O processors I/O controllers with DMA

    Graphics processor

    Intel 80387 for Intel 80386

    Vector processors

    Parallel Architectures

  • 7/27/2019 Parallel Arch

    10/11

    Parallel Architectures

    10

    Separate Memory Systems

    Each processor has its own memory thatnone o e o er processors can access.

    Each processor has unhindered access to itsmemory.

    No cache coherence problems.

    Communication is done by passing

    messages over special I/O channels

    Topologies

    The CPUs can be interconnected indifferent ways.

    Goals of an interconnection method are:

    Minimize the number of connections per CPU

    Minimize the average distance betweennodes

    Minimize the diameter or maximum distance

    between any two nodes Maximize the throughput

    Simple routing of messages

    Grid or Mesh

    Each node has 2, 3

    Routing is simple

    and many routesare available.

    The diameter is

    22 N

    Torus

    Similar to a meshw connec e ennodes.

    Many similar routesavailable.

    N

    Parallel Architectures

  • 7/27/2019 Parallel Arch

    11/11

    Parallel Architectures

    11

    Hypercube

    Each node log2Nconnec ons

    Routing can be doneby changing one bitof the address at atime.

    The diameter islog2N

    NUMA systems

    In NonUniform MemoryAccess systemseach node has local memory but can also.

    Most memory access are to local memory.

    The memory of other nodes can beaccessed, but the access is slower than

    .

    Cache coherence algorithms are usedwhen shared memory is accesses.

    More scalable than SMP.

    NUMAsystem