OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on...

49
© Copyright Khronos Group, 2013 - Page 1 OpenCL on Intel ® Iris™ Graphics SIGGRAPH 2013 July 2013 Presenter: Adam Lake Content: Ben Ashbaugh, Arnon Peleg, others

Transcript of OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on...

Page 1: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 1

OpenCL on Intel® Iris™ Graphics

SIGGRAPH 2013 July 2013

Presenter: Adam Lake

Content: Ben Ashbaugh, Arnon Peleg, others

Page 2: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 2

Agenda

•Intel® Iris™ Graphics Architecture Overview

•Tools to help get the most from OpenCL on Intel CPU

and GPU

•Additional Resources

2

Page 3: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 3

Intel® Iris™ Graphics Architecture Overview

3

Rin

g B

us

/ LL

C /

Mem

ory

CommandStreamer

(CS)

Vertex Fetch(VF)

Vertex Shader

(VS)

Hull Shader(HS)

Tessellator

Domain Shader

(DS)

Geometry Shader

(GS)

Stream Out(SOL)

Clip/Setup

Thread D

ispatch

Video Front End

(VFE)

Video QualityEngine

Multi-FormatCODEC

Blitter Display

Slice 0

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 1

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 0

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice 1

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 3

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 2

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Page 4: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 4

Intel® Iris™ Graphics Architecture Overview

4

Rin

g B

us

/ LL

C /

Mem

ory

CommandStreamer

(CS)

Vertex Fetch(VF)

Vertex Shader

(VS)

Hull Shader(HS)

Tessellator

Domain Shader

(DS)

Geometry Shader

(GS)

Stream Out(SOL)

Clip/Setup

Thread D

ispatch

Video Front End

(VFE)

Video QualityEngine

Multi-FormatCODEC

Blitter Display

Slice 0

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 1

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 0

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice 1

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 3

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 2

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Global Assets • Command Streamer • Thread Dispatch

Page 5: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 5

Intel® Iris™ Graphics Architecture Overview

5

Rin

g B

us

/ LL

C /

Mem

ory

CommandStreamer

(CS)

Vertex Fetch(VF)

Vertex Shader

(VS)

Hull Shader(HS)

Tessellator

Domain Shader

(DS)

Geometry Shader

(GS)

Stream Out(SOL)

Clip/Setup

Thread D

ispatch

Video Front End

(VFE)

Video QualityEngine

Multi-FormatCODEC

Blitter Display

Slice 0

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 1

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 0

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice 1

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 3

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 2

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Sub Slice • Execution Units • Samplers and Data Port • Instruction and Texture Caches

Page 6: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 6

Intel® Iris™ Graphics Architecture Overview

6

Rin

g B

us

/ LL

C /

Mem

ory

CommandStreamer

(CS)

Vertex Fetch(VF)

Vertex Shader

(VS)

Hull Shader(HS)

Tessellator

Domain Shader

(DS)

Geometry Shader

(GS)

Stream Out(SOL)

Clip/Setup

Thread D

ispatch

Video Front End

(VFE)

Video QualityEngine

Multi-FormatCODEC

Blitter Display

Slice 0

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 1

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 0

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice 1

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 3

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e 2

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Slice Common • L3 Cache • Shared Local Memory • Barriers

Page 7: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 7

Intel® Iris™ Graphics Architecture Building Blocks

• OpenCL* Kernels run on an Execution Unit (EU)

• Each EU is a Multi-Threaded SIMD Processor

• Up to 7 threads per EU

- 128 x 8 x 32-bit registers per thread

• Up to 8, 16, or 32 OpenCL* work items per thread (compiler-controlled)

- “SIMD8”, “SIMD16”, “SIMD32”

- SIMD8 More Registers

- SIMD16 and SIMD32 Better Efficiency

7

EUThread 0

Thread 2

Thread 4

Thread 6

Thread 1`

Thread 3

Thread 5

Page 8: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 8

Intel® Iris™ Graphics Architecture Building Blocks

• OpenCL* Work Groups run on a

Sub Slice

- 10 EUs per Sub Slice

- Texture Sampler (Images)

- Data Port (Buffers)

- Instruction and Texture Caches

8

OpenCL* Work Groups may run on multiple EU threads, on multiple EUs!

Sub

Slic

e

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

EU EU EU EU EU

EUEUEUEUEU

Page 9: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 9

Slice

L3$ Pixel OpsRasterizer /

DepthRender$Depth$

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

Sub

Slic

e

EU EU EU EU EU

EUEUEUEUEU

EU EU EU EU EU

EUEUEUEUEU

Intel® Iris™ Graphics Architecture Building Blocks

• Two Sub Slices make a Slice

• Shared Resources: “Slice Common”

- L3 Cache + Shared Local Memory

- Barriers

• Intel® Iris™ Graphics has Two Slices

• How many work items in flight at once?

- 2 slides each with 2 subslices = 4 Sub Slices

- 4 subslices x 10 EUs/subslice = 40 EUs

- Up to 40 EUs x 7 threads/EU = 280 EU threads

- Up to 8960 OpenCL* work items in flight!

9

Page 10: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 10

EU

Sampler L1

L3

Sampler L2

LLC DRAM

images

buffers

256KB/slice 2-8MB/package

(shared w/ CPU)

EDRAM (non-inclusive victim cache)

128MB/package

(Intel® Iris™ Pro 5200)

Intel® Iris™ Graphics Cache Hierarchy

10

Page 11: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 11

Tools and Additional Resources

Page 12: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 12

Occupancy – Intel® VTune™ Amplifier XE 2013

12

Page 13: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 13

Additional Resources • Intel® SDK for OpenCL* Applications 2013

- Offline kernel builder, Debugger for CPU in MSVC

• Intel® OpenCL* Optimization Guide

• Intel® VTune™ Amplifier XE 2013

• Intel® Graphics Performance Analyzers - OpenCL Tuning

• Intel Linux Graphics hardware Bspec: https://01.org/linuxgraphics/

• SIGGRAPH 2013: - Faster, Better Pixels on the Go and in the Cloud with OpenCL* on Intel® Architecture

- Arnon Peleg - Optimizing OpenCL* Applications for Intel® Iris™ Graphics

- Ben Ashbaugh - Backup of this deck contains more from Ben Ashbaugh’s excellent SIGGRAPH 2012 talk on

optimizing for Intel® Iris™ Graphics!

13

Page 14: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 14

Questions

Page 15: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 15

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,

BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE

FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS

OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,

MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE

INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND

THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF,

DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR

NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice.

All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current

characterized errata are available on request.

Any code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third

parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole

risk of the user.

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product

roadmaps.

Performance claims: Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as

SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results

to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when

combined with other products. For more information go to

http://www.Intel.com/performance

Intel, Intel Inside, the Intel logo, Centrino, Intel Core, Intel Atom, Pentium, and Ultrabook are trademarks of Intel Corporation in the United States and other countries

Legal

Page 16: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 16

backup

Page 17: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 17

Scheduling EUs for maximum occupancy

Page 18: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 18

Occupancy

•Goal: Use All Execution Unit Resources

•This is harder than it sounds! Many factors to

consider… - Launch Enough Work

- More EU threads means better latency coverage to keep

an EU active - One thread sufficient to prevent an EU from going idle - Too few EU threads can result in an EU being stalled

18

Page 19: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 19

Occupancy

•Goal: Use All Execution Unit Resources - Don’t Waste SIMD Lanes

- Use an optimal Local Work Size

- Good: Query for compiled SIMD size:

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

- Occasionally Helpful: Compile for a specific local work size

(8, 16, or 32):

__attribute__((reqd_work_group_size(X, Y, Z)))

- Best: Let the driver pick (Local Work Size == NULL)

Ideal for kernels with no barriers or shared local memory

19

Page 20: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 20

Occupancy (continued…)

•Barriers - 16 barriers per sub slice

- Can be a limiting factor for very small local work groups

•Shared Local Memory - 64KB shared local memory per sub slice

- Can be a limiting factor for kernels that use lots of shared

local memory

20

Page 21: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 21

Optimizing OpenCL* Applications

for Intel® Iris™ Graphics Ben Ashbaugh (Intel)

Page 22: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 22

Agenda

•Understanding Occupancy - How Intel® Iris™ Graphics executes OpenCL* Kernels

•-

-

•-

22

Page 23: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 23

How Intel® Iris™ Graphics Runs OpenCL*

Page 24: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 24

How Intel® Iris™ Graphics Runs OpenCL*

24

1. Divide Into Work Groups

Page 25: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 25

How Intel® Iris™ Graphics Runs OpenCL*

25

2. Divide Each Work Group Into EU Threads

Page 26: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 26

Sub

Slic

e

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

EU EU EU EU EU

EUEUEUEUEU

How Intel® Iris™ Graphics Runs OpenCL*

26

3. Launch EU Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all EU threads for the Work Group - Not enough room in any Sub Slice EU threads must wait

Page 27: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 27

How Intel® Iris™ Graphics Runs OpenCL*

27

• 1. Divide Into Work Groups

Page 28: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 28

How Intel® Iris™ Graphics Runs OpenCL*

28

•2. Divide Each Work Group Into EU Threads

Page 29: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 29

Sub

Slic

e

L1IC$

3D Sampler

Media Sampler

Data Port

Tex$

EU EU EU EU EU

EUEUEUEUEU

How Intel® Iris™ Graphics Runs OpenCL*

29

•2. Launch EU Threads for the Work Group Onto a Sub Slice - Repeat for each Work Group - Must have enough room in the Sub Slice for all EU threads for the Work Group - Not enough room in any Sub Slice EU threads must wait

Page 30: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 30

Occupancy

•Goal: Use All Machine Resources

•This is harder than it sounds! Many factors to

consider…

1. Launch Enough Work - One thread sufficient to prevent an EU from going idle - Too few EU threads can result in an EU being stalled - More EU threads better latency coverage keeps an EU

active

30

Page 31: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 31

Occupancy – Intel® VTune™ Amplifier XE 2013

31

Page 32: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 32

Agenda

•-

•Memory Matters - Host to Device

-

•-

32

Page 33: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 33

Optimizing Host to Device Transfers

33

•Host (CPU) and Device (GPU) share the same physical

memory

•For OpenCL* buffers: - No transfer needed (zero copy)!

- Allocate system memory aligned to a cache line (64 bytes)

- Create buffer with system memory pointer and

CL_MEM_USE_HOST_PTR

- Use clEnqueueMapBuffer() to access data

•For OpenCL* images: - Currently tiled in device memory transfer required

Page 34: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 34

Operating on Buffers as Images

34

•Intel® Iris™ Graphics supports

cl_khr_image2d_from_buffer

- New OpenCL* 1.2 Extension

- Treat data as a buffer for some kernels, as an image for others

- Some restrictions for zero copy: buffer size, image pitch

0x123 0x456 0x789

Buffer Image

Page 35: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 35

Interop with Graphics and Media APIs / SDKs

35

•Intel® Iris™ Graphics supports many Graphics and

Media interop extensions: - cl_khr_dx9_media_sharing (includes DXVA for Intel® Media SDK)

- cl_khr_d3d10_sharing

- cl_khr_d3d11_sharing

- cl_khr_gl_sharing

- cl_khr_gl_depth_images

- cl_khr_gl_event

- cl_khr_gl_msaa_sharing

Use Graphics API / SDK assets in OpenCL* with no copies!

Page 36: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 36

Agenda

•-

•Memory Matters -

- Device Access

•-

36

Page 37: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 37

EU

Sampler L1

L3

Sampler L2

LLC DRAM

images

buffers

256KB/slice 2-8MB/package

(shared w/ CPU)

EDRAM (non-inclusive victim cache)

128MB/package

(Intel® Iris™ Pro 5200)

Intel® Iris™ Graphics Cache Hierarchy

37

Page 38: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 38

__global and __constant Memory • Global Memory Accesses go through the L3 Cache

• L3 Cache Line is 64 bytes

• EU thread accesses to the same L3 Cache Line are

collapsed

- Order of data within cache line does not matter

- Bandwidth determined by number of cache lines

accessed

- Maximum Bandwidth: 64 bytes / clock / sub slice

• Good: Load at least 32-bits of data at a time,

starting from a 32-bit aligned address

• Best: Load 4 x 32-bits of data at a time, starting

from a cache line aligned address

- Loading more than 4 x 32-bits of data is not beneficial

38

Page 39: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 39

Global and Constant Memory Access Examples

1. x = data[ get_global_id(0) ]

- One cache line, full bandwidth

2. x = data[ n – get_global_id(0) ]

- Reverse order, full bandwidth

3. x = data[ get_global_id(0) + 1 ]

- Offset, two cache lines, half bandwidth

4. x = data[ get_global_id(0) * 2 ]

- Strided, half bandwidth

5. x = data[ get_global_id(0) * 16 ]

- Very strided, worst-case

39

Cache Line n

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cache Line n + 1

Global ID:

Cache Line n - 1

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Cache Line n

Global ID:

Cache Line n

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cache Line n + 1

Global ID:

Cache Line n

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Cache Line n + 1

Global ID:

Cache Line n

0 1 2 ...

Cache Line n + 1 Cache Line n + 2 ...

Global ID:

Page 40: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 40

__local Memory Accesses • Local Memory Accesses also go through the

L3 Cache!

• Key Difference: Local Memory is Banked

- Banked at a DWORD granularity, 16 banks

- Bandwidth determined by number of bank

conflicts

- Maximum Bandwidth: Still 64 bytes / clock /

sub slice

• Supports more access patterns with full

bandwidth than Global Memory

- No bank conflicts full bandwidth

- Reading from the same address in a bank

full bandwidth

40

Page 41: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 41

Local Memory Access Examples

1. x = data[ get_global_id(0) + 1 ]

- Unique banks, full bandwidth

2. x = data[ get_global_id(0) & ~1 ]

- Same address read, full bandwidth

3. x = data[ get_global_id(0) * 2 ]

- Strided, half bandwidth

4. x = data[ get_global_id(0) * 16 ]

- Very strided, worst-case

5. x = data[ get_global_id(0) * 17 ]

- Full bandwidth!

41

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1

Bank:2 3 4 5 6 7 8 9 10 11 12 13 14 15

Global ID:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1

Bank:

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Global ID:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1

Bank:

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Global ID:

0 1 2 3 4 5 6

0

...

0 0 0 0

Bank:

0 0 0

Global ID:

... ... ... ... ... ... ...

7

0 1 2 3 4 5 6

0

...

1 2 3 4

Bank:

5 6

Global ID:

... ... ... ... ... ...

Page 42: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 42

__private Memory

• Compiler can usually allocate Private

Memory in the Register File

- Even if Private Memory is dynamically indexed

- Good Performance

• Fallback: Private Memory allocated in

Global Memory

- Accesses are very strided

- Bad Performance

__private int

a[100]

EU

Th

read

n-1

EU

Th

read

n

EU

Th

read

n+

1

Work Item 0

Work Item 1

Work Item n

__private int

b[100]

__private int

c[200]

Work Item 0

Work Item 1

Work Item n

Work Item 0

Work Item 1

Work Item n

Page 43: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 43

Agenda

•-

•-

-

•Compute Characteristics - Maximizing GFlops

43

Page 44: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 44

ISA

•“SIMT” ISA with Predication and Branching

•“Divergent” code executes both branches Reduced

SIMD Efficiency

44

this();

if ( x )

that();

else

another();

finish();

SIMD lane

time

Example: “x” sometimes true

SIMD lane tim

e

Example: “x” never true

Page 45: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 45

Compute GFlops

45

•EUs have 2 x 4-wide vector ALUs

•Second ALU has limitations: - Subset of instructions: add, mov, mad, mul, cmp

- Instruction must come from another EU thread

- Only float operands!

•Peak GFlops: #EUs x ( 2 x 4-wide ALUs ) x ( MUL +

ADD ) x Clock Rate

For Intel® Iris™ Pro 5200: 40 x 8 x 2 x 1.3 = 832 GFlops!

Add Intel® Core™ Host Processor >1TFlop!

Page 46: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 46

Maximizing Compute Performance

46

•Use mad() / fma(): Either explicitly with built-ins, or

via -cl-mad-enable

•Use floats wherever possible to maximize co-issue

•Avoid long and size_t data types - Prefer float over int, if possible

- Using short data types may improve performance

•Trade accuracy for speed: “native” built-ins, -cl-fast-

relaxed-math - Often good enough for graphics

Page 47: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 47

Agenda

•-

•-

-

•-

Summary / Questions

47

Page 48: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 48

Summary

•Maximize Occupancy - Choose a Good Local Work Size - Or, Let the Driver Choose (Local Work Size == NULL)

•Avoid Host-to-Device Transfers - Create Buffers with CL_MEM_USE_HOST_PTR

•Access Device Memory Efficiently - Minimize Cache Lines for __global Memory - Minimize Bank Conflicts for __local Memory

•Maximize Compute - Avoid Divergent Branches - Use mad / fma and float Data When Possible

48

Page 49: OpenCL on Intel Iris™ Graphics - Khronos Group · •Intel® SDK for OpenCL ... Better Pixels on the Go and in the Cloud with OpenCL* on Intel ... risk of the user. Intel product

© Copyright Khronos Group, 2013 - Page 49

Questions / Acknowledgements

•This presentation would not have been possible

without material and review comments from many

people – Thank you!

•Murali Sundaresan, Sushma Rao, Aaron Kunze, Tom

Craver, Brijender Bharti, Rami Jiossy, Michal Mrozek,

Jay Rao, Pavan Lanka, Adam Lake, Arnon Peleg, Raun

Krisch, Berna Adalier

49