Piko: A Framework for Authoring Programmable Graphics Pipelines Anjul Patney and Stanley Tzeng UC...

Post on 01-Jan-2016

220 views 0 download

Tags:

Transcript of Piko: A Framework for Authoring Programmable Graphics Pipelines Anjul Patney and Stanley Tzeng UC...

Piko: A Framework for Authoring Programmable Graphics Pipelines

Anjul Patney and Stanley TzengUC Davis and NVIDIA

Kerry A. Seitz, Jr. and John D. OwensUC Davis

What does an efficient graphics pipeline look like?

What does an efficient graphics pipeline look like?

Renderer

Unreal Engine 4

Unity 5

Disney Hyperion

Pixar RenderMan

Solid Angle Arnold

Media Molecule Dreams

What does an efficient graphics pipeline look like?

Renderer Platform

Unreal Engine 4 GPU

Unity 5 GPU

Disney Hyperion Multicore CPU

Pixar RenderMan Multicore CPU

Solid Angle Arnold Multicore CPU

Media Molecule Dreams GPU

What does an efficient graphics pipeline look like?

Renderer Platform Algorithm

Unreal Engine 4 GPU Rasterization with deferred shading

Unity 5 GPU Rasterization with forward / deferred shading

Disney Hyperion Multicore CPU Path tracing with deferred shading

Pixar RenderMan Multicore CPU Reyes with Path tracing

Solid Angle Arnold Multicore CPU Path tracing

Media Molecule Dreams GPU Point-based rendering with deferred shading

Problem

Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.

Vision

Stage A

Stage B

Stage C

Stage E

Stage D

Stage F

?

CPU

GPU

High-level programmability

High-performance

Flexibility

Existing Work

Software Pipelines on GPUs

CudaRaster RenderAnts VoxelPipeFreePipe

OptiX and Embree

Programmable engines for accelerating ray tracing on specific platforms.

GRAMPS

• Introduces flexible graphics pipelines• Abstracts stages in classes• Abstracts communication by queues

[Sugerman et al. 2009]

Halide

• Introduces programmable image pipelines

• Applies well to shorter and more regular image-processing pipeline

[Ragan-Kelley et al. 2012]

What are the fundamentals of high-performance?

• Parallelism• Execution Locality• Data Locality• Producer-consumer locality

Spatial tiling

Efficient graphics pipelines utilize spatial tiling

Efficient graphics pipelines utilize spatial tiling

Efficient graphics pipelines utilize spatial tiling

• Packet ray tracing• SIMD fragment shading on GPUs• Tiled rendering on mobile GPUs

Vision

Stage A

Stage B

Stage C

Stage E

Stage D

Stage F

?

CPU

GPU

High-level programmability

High-performance

Flexibility

Vision

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

Piko

CPU

GPU

High-level programmability

High-performance

Flexibility

System Walkthrough

pikoc

Pipe Description

(Piko)

Pipe Implementation

(C++ / PTX)

Host Code(C++)

Executable

CPU Compiler

Host Interface (C++)

Device Compiler

pikoc

Pipe Description

(Piko)

Pipe Implementation

(C++ / PTX)

Executable

CPU Compiler

Host Interface (C++)

Device Compiler

Host Code(C++)

Device-independent(C++)

pikoc

Pipe Implementation

(C++ / PTX)

Host Code(C++)

Executable

CPU Compiler

Host Interface (C++)

Device Compiler

Pipe Description

(Piko)

Pipeline description (graph of stages)

Pipe Description

(Piko)

Host Code(C++)

Executable

CPU Compiler

Device Compiler

pikoc

Pipe Implementation

(C++ / PTX)

Host Interface (C++)

Clang- and LLVM- based infrastructure

pikoc

Pipe Description

(Piko)

Pipe Implementation

(C++ / PTX)

Host Code(C++)

Host Interface (C++)

Executable

CPU Compiler

Device Compiler

Problem

Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.

Problem

Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.

Approach

Use spatial tiling to help author efficient and flexible graphics pipelines.

Problem

Efficient graphics pipeline implementations are hard to write and the design space is hard to explore.

Approach

Use programmable spatial tiling to help author efficient and flexible graphics pipelines.

Programmable Spatial Tiling

We need three answers from the pipeline author

How does data map to spatial tile?

How do we schedule tiles at runtime?

What to compute for each tile?

AssignTile( )

Schedule( )

Process( )

Each stage consists of these three “phases”

Each stage in a pipeline has three phases

Stage A

Stage C

Stage B

AssignTile

Schedule

Process

AssignTile

Schedule

Process

AssignTile

Schedule

Process

S

A

A

A

S

S

S

P

P

InputPrimitives

Populated Bins

Execution Cores

Final Output

Input Scene P ProcessS ScheduleA AssignBin

S

A

A

A

S

S

S

P

P

InputPrimitives

Populated Bins

Execution Cores

Final Output

Input Scene P ProcessS ScheduleA AssignBinAssignTile

S

A

A

A

S

S

S

P

P

InputPrimitives

Populated Bins

Execution Cores

Final Output

Input Scene P ProcessS ScheduleA AssignBinAssignTile

S

A

A

A

S

S

S

P

P

InputPrimitives

Populated Bins

Execution Cores

Final Output

Input Scene P ProcessS ScheduleA AssignBinAssignTile

S

A

A

A

S

S

S

P

P

InputPrimitives

Populated Bins

Execution Cores

Final Output

Input Scene P ProcessS ScheduleA AssignBinAssignTile

Phases help identify optimization opportunities.

Identical tile size

Identical data-to-tile mapping

Identical tile-to-core mapping

Stage A

Stage B

Stage C

Stage D

Phases help identify optimization opportunities.

Identical tile size

Identical AssignTile Result

Identical Schedule Result

Stages can be fused to one

Stage A

Stage D

Stage B

Stage C

Stage BStage C

Phases help explore pipeline implementations

Vertex Shade

Raster

Fragment Shade

Composite

Geometry Shade

VS VS VS VS

GS GS GS GS

Rst Rst Rst Rst

FS FS FS FS

Cmp Cmp Cmp Cmp

Phases help explore pipeline implementations

Vertex Shade

Raster

Fragment Shade

Composite

Geometry Shade

Rst Rst Rst Rst

FS FS FS FS

Cmp Cmp Cmp Cmp

VS VS VS VS

GS GS GS GS

Phases help explore pipeline implementations

Vertex Shade

Raster

Fragment Shade

Composite

Geometry Shade

VS VS VS VS

GS GS GS GS

Rst Rst Rst Rst

FS FS FS FS

Cmp Cmp Cmp Cmp

Phases help explore pipeline implementations

Vertex Shade

Raster

Fragment Shade

Composite

Geometry Shade

VS VS VS VS

GS GS GS GS

Rst Rst Rst Rst

FS FS FS FS

Cmp Cmp Cmp Cmp

Evaluation

Piko pipelines are easy to express and customize

VS

Rast

FS

Setup

Comp

VS

Rast

FS

Setup

Comp

FS

Comp

Split

Dice

Sample

Shade

Comp

VS

Rast

Trace

Setup

FS

Comp

Triangle Raster Stereo Raster Reyes Raster-Raytrace

Piko pipelines are easy to express and customize

VS

Rast

FS

Setup

Comp

VS

Rast

FS

Setup

Comp

FS

Comp

Split

Dice

Sample

Shade

Comp

VS

Rast

Trace

Setup

FS

Comp

Triangle Raster Stereo Raster Reyes Raster-Raytrace

Piko pipelines are easy to express and customize

VS

Rast

FS

Setup

Comp

VS

Rast

FS

Setup

Comp

FS

Comp

Split

Dice

Sample

Shade

Comp

VS

Rast

Trace

Setup

FS

Comp

Triangle Raster Stereo Raster Reyes Raster-Raytrace

Piko pipelines are easy to express and customize

VS

Rast

FS

Setup

Comp

VS

Rast

FS

Setup

Comp

FS

Comp

Split

Dice

Sample

Shade

Comp

VS

Rast

Trace

Setup

FS

Comp

Triangle Raster Stereo Raster Reyes Raster-Raytrace

Piko lets us explore implementation alternatives

No tiling, complete stage fusion

1 10 100 10000

1

2

3

4

5

6

7

Shader complexity (# lights)

Rel

ativ

e fr

ame

tim

e

NVIDIA GPU Multicore CPU

Fairy ForestVS

Rast

FS

Setup

Comp

VS

Setup

Rast

FS

Comp

Baseline

Piko lets us explore implementation alternatives

Tiling with fusion

1 10 100 10000

1

2

3

4

5

6

7

Shader complexity (# lights)

Rel

ativ

e fr

ame

tim

e Fairy Forest

NVIDIA GPU Multicore CPU

VS

Rast

FS

Setup

Comp

Baseline

VS

Setup

Rast

FS

Comp

Piko lets us explore implementation alternatives

Tiling with no fusion

1 10 100 10000

1

2

3

4

5

6

7

Shader complexity (# lights)

Rel

ativ

e fr

ame

tim

e Fairy Forest

NVIDIA GPU Multicore CPU

VS

Rast

FS

Setup

Comp

Baseline

VS

Setup

FS

Comp

Rast

Piko enables high-performance code generation

Fairy Forest

Buddha Mecha Dragon0

2

4

6

8

10

12

cudaraster Piko Raster

Ren

derin

g tim

e (m

s) Performance is within 3.3-5.5x of hand-optimized code.

[Laine and Karras 2011]

Piko enables high-performance code generation

[Weber et al. 2015]

Micropolis Piko Reyes0

2

4

6

8

10

12

14

Spl

it P

erfo

rman

ce

(Mpa

tche

s /

seco

nd)

Split performance is within 30% of hand-optimized GPU Reyes.

Summary

Piko enables programmability and performance

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

Piko

CPU

GPU

Piko

CPU

GPU

Piko enables programmability and performance

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

High-level programmability

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

Piko enables programmability and performance

Piko

CPU

GPU

High-performance

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

Piko

Piko enables programmability and performance

CPU

GPU

Flexibility

Our work is not done

Piko can be improved

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

Piko

CPU

GPU

Utilization of shared local memory

Piko can be improved

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

Piko

CPU

GPU

Support for dynamic scheduling of pipeline work

The search for a graphics abstraction is not over

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

Piko

CPU

GPU

The search for a graphics abstraction is not over

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

Piko

CPU

GPU

Do tiles have to be 2d, uniform, one-config-per stage?

The search for a graphics abstraction is not over

Stage B

Stage C

Stage A

Stage E

Stage D

Stage F

Piko

CPU

GPU

Are there other abstractions that enable high-level programmability and achieve high-performance?

Acknowledgments

Discussions and adviceTim Foley, Jonathan Ragan-Kelley, Aaron Lefohn, Matt Pharr, Mark Lacey, Kayvon Fatahalian, Bill Mark, Marco Salvi, Chuck Lingle, Jason Mak, Edmund Yan, Calina Copos, Mike Steffen, Alex Elkman

NVVM HelpVinod Grover, Sean Lee

Financial SupportIntel Science and Technology Center (VC), NVIDIA Research Fellowship, Intel Ph.D. Fellowship, National Science Foundation Fellowship, NVIDIA, AMD, NSF, UC Lab Fees

AssetsAMD, Intel (Project Offset), Ingo Wald, Bay Raitt, Stanford

Thank you!github.com/piko-dev/piko-public

Extra Slides

RasterPipe pipe;pipe.allocate(...);pipe.prepare();pipe.run_single();

unsigned* pixels = pipe.pikoScreen.getData();

glDrawPixels(screenW, screenH, GL_RGBA, GL_UNSIGNED_BYTE, data);

Host Code is device independent.

Unmodified C++

A pipeline is a C++ class declaration.

class RasterPipe : public PikoPipe {

VertexShaderStage vertexShader_; RasterStage raster_; PikoScreen pikoScreen_; ...

RasterPipe() { pikoConnect (vertexShader_, raster_, 0, 0); } ...};

Connections indicate pipeline structure.

Stages are instantiated as objects.

Each phase is a member function.

class RasterStage : public Stage<8, 8, 32, raster_stri, Pixel> { inline void AssignTile(raster_stri p) { ... this->assignToBin (p, binID); ... } inline void schedule(int binID) { this->specifySchedule (LOAD_BALANCE); } inline void process(raster_stri p) { ... this->emit (Pixel(pos, color), 0); ... } };

A stage is a C++ class definition.

Built-in routines identify common scenarios.

Templates specify tiling configuration.

pikoc implements the pipeline description.

Pipeline

Stagesclang

pikocfrontend

Kernelplan

pikoc backend

Host Interface

Pipe Implementation

clang libNVVM

Frontend walks the AST and performs high-level optimizations.

Backend uses LLVM to generate optimized device code.

WIP Slides