Many-Core Programming with GRAMPSJeremy SugermanKayvon FatahalianSolomon BoulosKurt AkeleyPat Hanrahan
2
Problem Statement Facilitate efficient development and
execution in many-/multi-core commodity systems.
Homogeneous or heterogeneous cores.
Status Quo: GPUs: Easy to write GL/D3D and run it fast,
hard to express anything else CPUs: Possible (not easy) to write
anything, possible (hard) to run it fast
3
GRAMPS Background Resembles a GPU with software constructed
pipeline. Not (too) radical even in a pure graphics context Similar story saw fixed -> programmable
shading Now the pipeline topology is under analogous
pressures: proliferation of stages and options And graphics is more than a GL/D3D pipeline… And throughput / many-core is more than
graphics…
4
GRAMPS Programming Model Software constructs the pipeline (actually
graph) Exposes threads, shaders, fixed function
stages– Coprocessors exposed via ISA
Exposes FIFOs / Queues connecting stagesAlso enables software push / re-sorting
Exposes Buffers for memory access
5
GRAMPS’ Place Compared to GPU Pipeline:
More things possible (and medium easy), still (mostly) runs fast, less hardware independent
Compared to CPU:Easier to write things, easier to run them well,
some loss of expressivity and flexibility
Still a role for a ‘graphics pipeline’. It’s an app! GRAMPS is a layer, model for state machines.
6
GRAMPS and Streaming From some angles, GRAMPS sounds a lot like
Stream Processing / Computing Distinctions are most visible in the target
traits. Streaming expects predictable data creation,
flow, and consumption. Intensive offline / compile-time optimization and pre-scheduling.
GRAMPS expects dynamic data-dependent execution, (and thus) run-time scheduling
Also, GRAMPS assumes commodity and heterogeneity.
GRAMPS Examples
Rast ShadeFB
Blend
InputFragment
Queue
OutputFragment
Queue
Camera Intersect
FB Blend
RayQueue
SampleQueue
Shade
PixelQueue
Rasterization Pipeline
Ray Tracing Pipeline
8
GRAMPS Overview Concepts:
GraphsStages: thread, shader, fixed-functionQueues: ordered, unordered, sets
(exclusion)Buffers
ComponentsAPIs: setup/driver, thread, shaderScheduler: fat core, shader core, top-level
9
What We’ve Built Three rendering pipelines:
Direct3D, Packet Tracer, D3D + Push (Hybrid)
Simulator and Runtime for two machines:GPU-like: Many threads per core, hw
schedCPU-like: Few threads per core, sw sched
10
Rendering Pipelines
Direct3D Pipeline (with Ray-tracing Extension)
IA 1 VS 1 RO Rast
Trace
IA N VS N
PS
SampleQueue Set
RayQueue
PrimitiveQueue
Input VertexQueue 1
PrimitiveQueue 1
Input VertexQueue N
Ray-tracing Pipeline
Tiler Sampler Camera Intersect
Shade FB Blend
SampleQueue
TileQueue
RayQueue
Ray HitQueue Fragment
Queue
= Thread Stage
= Shader Stage
= Fixed-func Stage
= Queue
= Output via Push
OM
PS2
FragmentQueue
= Stage Output
Ray HitQueue
Ray-tracing Extension
PrimitiveQueue N
13
High-level Challenges Is GRAMPS a suitable GPU evolution?
– Enable pipeline competitive with bare metal?
– Enable innovation: advanced / alternative methods?
– Is there a ‘best’ graphics pipeline on top?
Is GRAMPS a good parallel compute model?– Map well to hardware, hardware trends?– Support important apps?– Concepts influence developers?
14
What’s Next? Low level implementation: scheduling,
more accurate simulation. More apps: REYES, physics, likely more. Audit and refine model: graph modification
/ state change, fork-join / blocking calls, locks / barriers / synchronization primitives intra- or inter-stage
Prototype, explore next generation graphics pipelines.
Top Related