Transcript of Virtual and Physical Cellular Architectures for Kilo-processor Chip Computers Tamás Roska Hungarian...
- Slide 1
- Virtual and Physical Cellular Architectures for Kilo-processor
Chip Computers Tams Roska Hungarian Academy of Sciences and Pzmny
P. Catholic University, Budapest, Hungary IEEE CNNA 2010
- Slide 2
- A new era in computing starts NOT just parallel NOT using the
present algorithms NOT massaging the nanoscale devices to repeat
the functional primitives of CMOS building blocks Running time is
NOT the only measure There are NO efficient tools Key:
understanding spatial-temporal algorithmic dynamics Cellular Wave
Computers: one way of thinking
- Slide 3
- Table of Contents The technology trend The six prototype
platforms Virtual and Physical Cellular Machines and the Design
Scenario Processor and memory array representations Design tools
and principles Qualitative theory of operators with local
connectedness New principles of Computational Complexity
- Slide 4
- The Technology Trend
- Slide 5
- ~ 30 nm : over 10 billion transistors Severe limits for clock
speed and power dissipation 25 k mixed mode processors and sensors
on a cellular visual microprocessor at 180 nm 45 nm: 70 k logic-
and 2k arithmetic- processors about 1 million 8 bit microprocessors
could be placed (~5 Billion transistors)...Why not? Wire delay is
bigger than gate delay, hence communication speed is limited
(synchrony radius) Hence The precedence of Locality ~ i.e.
Cellular, mainly locally connected architecture is a must for high
computing power
- Slide 6
- International Technology Roadmap for Semiconductors, ITRS/2007
/2009
- Slide 7
- but it also doesnt scale terrible well.
- Slide 8
- The six prototype platforms - related new products
- Slide 9
- 1.CELL Broadband Engine Architecture (CBEA) and high-end
supercomputer Heterogeneous, multi-core CELL Multiprocessor chip
241M transistor, 235mm2 200 GFlops (SP) @3.2GHz 200 GB/s bus
(internal) @ 3.2GHz dual XDR controller (25.6GB/s) two configurable
interfaces (76.8GB/s) Power Processor Element (PPE) General purpose
processor Synergistic Processor Element (SPE) SIMD-only engine fast
access to 256KB local memories Globally coherent DMA to transfer
data Racks, cabinets, PetaFlops systems
- Slide 10
- 2. FPGAs three cell arrays: logic- and arithmetic- processors,
and memories XILINX VIRTEX 6 FPGA Over 70k Advanced Silicon Modular
Logic Blocks in a cellular array Over 2k DSP slices in a cellular
array 36 kB memory blocks in a cellular array Application Specific
array components Logic and arithmetic processor arrays and memory
arrays are tiled on a 2D plate Local directional preference
(stream)
- Slide 11
- Xilinx Virtex-5 FPGA architecture
- Slide 12
- 3. Stacked arrays GPU: Graphic Programming Units Fermi ( NVIDIA
-2010): - in 32 core units / 16 kB local memory - 512 cores (16x32)
- IEEE 754-2008 32-bit and 64-bit - Full 32-bit integer path with
64-bit extensions - Memory access instructions - transition to 64-
bit address
- Slide 13
- Fermis 16 SM are positioned around a common L2 cache. Each SM
is a vertical rectangular strip that contain an orange portion
(scheduler and dispatch), a green portion(execution units), and
light blue portions (register file and L1 cache).
- Slide 14
- In each core an integer/logic (64 bit) unit and an arithmetic
unit (32/64 floating point) Local and Global memory in a cache
architecture Possibility to use several 512 core chips Fully
cellular architecture Physical Cellular Machine
- Slide 15
- SENSORY INTEGRATED CHIPS 4. WITH MIXED MODE CELL PROCESSORS
Eye-RIS 5. WITH DIGITAL CELL PROCESSORS XENON
- Slide 16
- Eye-RIS v.1.2, AnaFocus Ltd.
- Slide 17
- EyeRIS camera computer - 25 k processors with optical sensors,
1000 frame per sec AnaFocus Ltd. Seville Many kilo processor chips
using cellular architecture
- Slide 18
- XENON V3 (MDA SBIR Phase II Project: ROIC No2, 64x64 prototype)
Topographic sensor-processor arrangement (integrated InGaAs/InP
sensors are not shown in the photo) Photo of the ASIC processor
layer: Processors: 64 (8x8) Sensors: 4096 (64x64) Processing speed
(depends on algorithmic complexity): > 1000 fps Programming:
general purpose EUTECUS, Inc. - proprietary
- Slide 19
- 6. Cellular big cores: Intel Cloud 48 - 2010 48 IA processors
in cellular communication scheme
- Slide 20
- Integrated Systems Bi-i systems with different sensor inputs
and outputs and mobile versions Desktop high-end systems VISCUBE 3D
integrated circuits
- Slide 21
- Desktop Megaprocessor Computer
- Slide 22
- Processor array (160x120) Mixedsignallayer Memory array (32x24)
Digitalproc.layer Viscube 3D architecture Sensor layer (320x240)
Digitalmemorylayer Processor array (32x24) AD 8 bit/pixel (gray
value) Proc. parallel 1 bit/pixel (mask) Pixel Parallel BB ROI ROI
parallel Eutecus-SZTAKI-Anafocus ONR Grant
- Slide 23
- Virtual and Physical Cellular Machines and the Design
Scenario
- Slide 24
- The many-core Cellular Wave Computer Components The
architectural element is defined as A processor array placed on a
2D or 3D grid The processors communicate with a delay proportional
with the distance. Typically two speed classes; a local fo within a
synchrony radius r, and a global Fo via a Manhattan type bus
system. In case of continuous time dynamics f ~1/T, T being the
dominant time constant The local and global clock speeds are
adjusted to keep the power dissipation within prescribed limit Pd
The Virtual Cellular architecture of a single array is shown, for a
2D case, in Figure 1 (hiding physical details)
- Slide 25
- Slide 26
- New principles the role of the geometric address of a single
processor in an array is introducing new algorithmic principles:
the precedence of geometric locality (due to the physical and
logical locality), i.e. the cellular character granular,
topographic, etc. cellular wave dynamics as an instruction
streaming activity pattern in logic Do NOT parallelize existing
single processor algorithms invent new cellular algorithms
- Slide 27
- The physical constraints on communication speed and power
dissipation The basic constraints are: wire delay and power
dissipation. A side constraint is the number of communication pins
(contacts). The execution of a logic, arithmetic or symbolic
elementary array instruction is defined via r input (u(t)), m
output (y(t)) and n state (x(t)) variables (t is the time instant).
This unit is characterized by its s, surface/area, e, energy, f,
operating frequency, w = e f local power dissipation, and the
signals are traveling on a wire with length L, width q, and with
speed v q introducing a delay of D = l v q
- Slide 28
- cores can be placed on a single Chip, typically in a rectangle
grid, with R input and Q output physical connectors typically at
the corners of the Chip, altogether there are K input/output
connectors. The maximal value of dissipation of the Chip is W. The
physics is represented by the maximal values of , K, and W (as well
as the operating frequency). The operating frequency might be
global for the whole Chip Fo, or could be local within the Chip,
fo, f i (some parts might be switched off, f i = 0)
- Slide 29
- Virtual and Physical Cellular Machines The Virtual Cellular
Machine is the modern version of the Virtual Memory (i) to hide the
physical details of the huge number of processing elements (cores,
cells, threads, etc.), (ii) to provide the framework for designing
the algorithms with elementary array instructions, and (iii) to
serve as the starting phase for the Virtual to Physical Cellular
Machine mapping.
- Slide 30
- Virtual Cellular Machines Its computational building blocks are
mainly arrays, composed of operator and memory elements, acting in
space and time Heterogeneous and homogeneous computational blocks
Building blocks have no physical constraints - size, bandwidth,
speed, power dissipation, except the two classes of memory access
local and global)
- Slide 31
- A typical logic elementary array instruction (LA) is a binary
logic function on n variables, or a memory look-up table array A
typical arithmetic/analog elementary array (AA) instruction is a
multiply and accumulate (add) term (MAC core) array, A symbolic
elementary array instruction (SA) might be a string manipulation
core array (e.g. a P system) a complex cell based array instruction
(XA), hosting cells with all the three above types of data and
instructions An 8, 16, or 32 bit microprocessor could be considered
as well as an elementary array instruction with iterative or
multi-thread implementation.
- Slide 32
- A Virtual Cellular Machine is composed of five types of
building blocks: (i) cellular processor arrays/layers, CP, with
simple (L, or A, or S type) or complex (X) cells and their local
memories, these are the protagonist building blocks, (ii) classical
digital stored program computers, P (microprocessors), (iii)
multimodal topographic (T) or non-topographic inputs and outputs,
I/O (e.g. scalar, vector, or matrix signals), (iv) global memories
of different data types M, organized in qualitatively different
sizes and access times (e.g. cache memories), and (v)
interconnection pathways B (busses).
- Slide 33
- M Pn::P1Pn::P1 P0P0 M1M1 M2M2 I OI O CP1/ 1 CP2/ h CP2/ 1 T
Input 2D T Input 2D Simple Example: Global system control &
memory B0B0 CP1/ g B0B0 B0B0 b0b0 b1b1 b2b2 F0F0...
- Slide 34
- tasks, the algorithms to be implemented, are defined on the
Data/Memory representations of the Virtual Cellular Machine Various
data representations for Topographic (e.g. picture, image, PDE,
molecule dynamics) and Non-topographic (e.g. Markov processes,
algebra, number theory) problems
- Slide 35
- Physical Cellular Machines the model for the physical
implementation of system architectures A) An algorithm with input,
state and output vectors having real/ arithmetic, binary/digital
logic, and symbolic variables(typically implemened via digital
circuits). B) A real valued state and output dynamic system with
analog/continuous or arithmetic variables (typically implemented
via mixed mode/analog-and-logic circuits and digital control
processors) C) A physical dynamic entity with well defined
geometric layout and I/O ports (function in layout) (typical
implementations are CMOS and/or beyond CMOS nanoscale designs, or
optical architectures with programmable control) We have three
elementary programmable Cell processor (cell core) types used in
array implementations
- Slide 36
- Physical Cellular Machines The above cell processors have their
own physical parameters, hence, the arrays of them, in similar
building arrangements like in the Virtual Cellular Machines, will
be defined by given physical parameters as the Size
Bandwidth/latency Operation speed Power dissipation It might have
the same or a different architecture - compared to the Virtual
Cellular Machine
- Slide 37
- The geometry of the architectures are reflecting the physical
layout of the chips or systems. A building block could be
implemented as a separate chip or a part of a chip. This
architectural geometry defines also the communication (interacting)
speed ranges. Hence physical closeness means higher speed ranges.
The spatial location or topographic address of each elementary cell
processor or cell core, as well as that of each building block
within the architecture, and the communication bandwidth, plays a
crucial role. In addition to the number of processors, these are
the most dramatic differences compared to classical computer
science.
- Slide 38
- The Design Scenario algorithms Processor / memory topography
Data /object topography Task/Problem/ Workload Physical
Implementation Virtual Cellular Machine Physical Cellular Machine
COMPILER
- Slide 39
- Processor and memory/Data array representations
- Slide 40
- logic / symbolic array Array Signals, variables, memory
Processors logic/ symbolic processor logic / symbolic
valuearithmetic/analog processor arithmetic/analog array
arithmetic/analog value arithmetic/analog processor array
logic/symbolic processor array
- Slide 41
- FPGA organization
- Slide 42
- 1c 1D 2D 200c. ~50c. GPU organization 1c. 10c. 1c. 100c. CELL
organization
- Slide 43
- Many sensors/data/memory units on one Cell Processor
- Slide 44
- Separate sensory/memory plane Flow architecture 1 2 K
- Slide 45
- Design Tools and Principles
- Slide 46
- Homogeneous cellular arrays Decomposition in 2D Decomposition
in 1D Composing complex cells for a homogeneous FPGA architecture
Self-organizing distributed control The CNN Universal Machine as a
compiler Nontopographic data representation Spatial distribution of
arithmetic operators with large number of terms
- Slide 47
- Equivalent transformations for finite size homogeneous cellular
arrays (.Zarndy) Four different methods when the size of the
Physical Cellular machine is smaller than that of the Virtual
Cellular Machine
- Slide 48
- Simple example of a Physical Cellular Machine implemented on a
single chip (FPGA) CP2 {30x80} CP2 {50x20} CP2 {50x40} CP2 {50x20}
CP2 {20x30} CP2 {20x50} Chip size:100x80 processors: conditional
data path > streaming activity pattern">
- A compiler for FPGA implementation (Z.Nagy and P.Szolgay)
Compiler: the FALCON architecture via the CNN Universal Machine.
Special cases for Navier Stokes 2D, 2D, incompressible and
compressible Multilayer retina model Introducing special FIFOs to
overcome the bandwidth constraint and the global control: Cellular
Distributed Self-organizing Control structure ==>> streaming
activity pattern
- Slide 51
- Xilinx Virtex-5 FPGA architecture
- Slide 52
- Virtex-5 Logic Cell Element Configurable Logic Block (CLB) 2
slices
- Slide 53
- Arithmetic and Memory Cell Elements DSP Slice Dual-ported
memory
- Slide 54
- Programmable Interconnect
- Slide 55
- Slide 56
- The CNN State Equation Assumed that the input is constant or
changing slowly Forward Euler discretization Feedback equation
Feed-forward equation
- Slide 57
- The Falcon processor core Memory unit Contains a belt of the
cell array Mixer Contains cell values for the next updates Template
memory Arithmetic unit
- Slide 58
- The complex arithmetic unit for implementing the CELL dynamics
9 multipliers (3x3 case) Adder tree DSP slice built-in adders Fully
pipelined 1 cell update / clock cycle
- Slide 59
- The Falcon array Each core processor works on a narrow slice of
the image Each line computes one CNN iteration Results are shifted
down to the next row processors Linear speedup
- Slide 60
- Performance compared to a conventional microprocessor
*Estimated data
- Slide 61
- Inviscid, Adiabatic, Compressible flows on the CNN Universal
Machine Euler equations: Total energy is defined as: Notations: t:
time : Nabla operator : density v(u, v): velocity vector field p:
pressure I: identity matrix E: total energy : ratio of specific
heats
- Slide 62
- 1 st and 2 nd order solution
- Slide 63
- NEW algorithms for non-topographic problems Particle filters
(A.Horvth and M. Rsonyi) Topographic representation of the
nontopographic data Random selection within topographic locality
1000 trajectories x 100 time instants x a few particles Convergence
and stability both on Virtual and on an 8 bit Physical Cellular
Machine
- Slide 64
- Arithmetic operations on big number of terms Intel Threading
building Blocks (~ log n) Spatial-temporal optimization and their
shapes on the cellular core (thread) array Shape filling
optimization (~log n/N)
- Slide 65
- Shape filling optimization N Log N
- Slide 66
- 4 parallel arrays fill factor: static/dynamic N Log N N Data
arriving in parallel waves
- Slide 67
- Algorithm Representations UMF DIAGRAMS Dynamic Operational
Graphs Dynamic Acyclic Graphs Dynamic process graphs Etc.
- Slide 68
- UMF diagrams A single cellular array/ layer defined by a
standard CNN dynamics : U: input array, Xo Initial state array, z:
threshold or mask array, Y: output array, : time constant or clock
time, TEMk : local instruction
- Slide 69
- Algorithmic structures in terms of arrays/layers Cascade
Parallel A typical parallel structure with two parallel flows is
shown below, by combining them in the final layer
- Slide 70
- Optimization of algorithms represented by UMF diagrams Directed
acyclic graphs representing UMF diagrams Genetic Programming with
Indexed Memory, GP-IM (G. Pazienza)
- Slide 71
- Dynamic operational graphs Nodes are memories Branches are
either the operators or communication paths Extremal graph
problems
- Slide 72
- Qualitative theory of Operators with local connectedness
- Slide 73
- Fundamental analytic results for 1D Binary CNN (L.O.Chua et.
al.) the richness of patterns A fundamental Theorem for oscillating
CNN (F. Corinto, T.Roska, and M.Gilli): Any oscillating fully
connected dynamical system array can be represented by a locally
connected one Synchronization A globally connected array of
stochastic oscillatior cells can also be implemented by locally
connected ones (G. Mt, E..Horvth, E.Kptalan, A. Tunyagi, Z.Nda, and
T. Roska) - level of synchrony - size of local neighborhood
- Slide 74
- Logic operators Any 2D Boolean operator can be represented by a
sequence of locally connected standard CNN Grammar cells Locally
connected simple grammar cell arrays are equivalent to Turing
Machines (E. Csuhaj Varju et.al.) True Randomness in space True
random spatial bit patterns can be generated with standard CNN
Universal Machine and spatially correlated local noise (M.
Ercsey-Ravasz, Z. Nda, and T. Roska)
- Slide 75
- New principles of Computational Complexity
- Slide 76
- The algorithmic and physical complexity measures of these many
core architectures will be eventually different compared to the
single processor systems abandoning the asymptotic framework (what
is big? 1 million?). With cores and M total memory the finite
virtual algorithmic complexity C a = C a (, M ) The finite physical
computational complexity of an algorithm is measured by a proper
mix of speed, power, area, and accuracy.
- Slide 77
- A new era in Computer Science Computer engineering begins.