[Osxdev]metal
-
Upload
naver-d2 -
Category
Technology
-
view
394 -
download
0
description
Transcript of [Osxdev]metal
![Page 2: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/2.jpg)
RecapWWDC 2014
•Swift •Yosemite •Metal
![Page 3: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/3.jpg)
I’m so happy that I was too lazy to learn Objective-C
![Page 4: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/4.jpg)
Maybe or notGame Industry Trend
C++ OOP
Design Pattern TDD
-
C FP / PP / DOP
- Fast Iteration Immutability
![Page 5: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/5.jpg)
Maybe or notGame Industry Trend
C++ / Objective-C OOP
Design Pattern TDD
C / Swift FP / PP / DOP
!Fast Iteration Immutability
![Page 6: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/6.jpg)
seen season one before?Explaining Metal
![Page 7: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/7.jpg)
Season one BAAM!Explaining Metal
![Page 8: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/8.jpg)
SubTitleBoss의 한마디
http://www.bloter.net/archives/195819
![Page 9: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/9.jpg)
This talk
•No API in detail •No code(my own) •No demo
![Page 10: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/10.jpg)
CPU vs GPU
Control
Cache
ALU ALU
ALU ALU
DRAM DRAM
March/April 2008 23 more queue: www.acmqueue.com
the duration of processing for a single frame, different stages will dominate overall execution, often resulting in bandwidth- and compute-intensive phases of execu-tion. Maintaining an efficient mapping of the graphics pipeline to a GPU’s resources in the face of this variability is a significant challenge, as it requires processing and on-chip storage resources to be dynamically reallocated to pipeline stages, depending on current load.
Mixture of predictable and unpredictable data access. The graphics pipeline rigidly defines inter-stage data flows using streams of entities. This predictability presents opportunities for aggregate prefetching of stream data records and highly specialized hardware management on-chip storage resources. In contrast, buffer and texture accesses performed by shaders are fine-grained memory operations on dynamically computed addresses, making prefetch difficult. As both forms of data access are critical to maintaining high throughput, shader programming models explicitly differentiate stream from buffer/texture memory accesses, permitting specialized hardware solu-tions for both types of accesses.
Opportunities for instruction stream sharing. While the shader programming model permits each shader invocation to follow a unique stream of control, in practice, shader execution on nearby stream elements often results in the same dynamic control-flow decisions. As a result, multiple shader invocations can likely share an instruction stream. Although GPUs must accom-modate situations where this is not the case, instruction stream sharing across multiple shader invocations is a key optimization in the design of GPU processing cores and is accounted for in algorithms for pipeline scheduling.
A large fraction of a GPU’s resources exist within programmable processing cores responsible for exe-cuting shader functions. While substantial imple-mentation differences exist across vendors and product lines, all modern GPUs maintain high efficiency through the use of multi-core designs that employ both hardware multi-threading and SIMD (single instruction, multiple data)
processing. As shown in table 1, these throughput-com-puting techniques are not unique to GPUs (top two rows). In comparison with CPUs, however, GPU designs push these ideas to extreme scales.
Multicore + SIMD Processing = Lots of ALUs. A thread of control is realized by a stream of processor instructions that execute within a processor-managed environment, called an execution (or thread) context. This context con-sists of states such as a program counter, a stack pointer, general-purpose registers, and virtual memory mappings. A multicore processor replicates processing resources (both ALUs and execution contexts) and organizes them into independent cores. When an application features multiple threads of control, multicore architectures pro-vide increased throughput by executing these instruction streams on each core in parallel. For example, an Intel Core 2 Quad contains four cores and can execute four instruction streams simultaneously. As significant paral-lelism exists across shader invocations, GPU designs easily push core counts higher. High-end models contain up to 16 cores per chip.
Even higher performance is possible by populating each core with multiple floating-point ALUs. This is done efficiently with SIMD processing, which uses each ALU to perform the same operation on a different piece of data. The most common implementation of SIMD processing is via explicit short-vector instructions, similar to those provided by the x86 SSE or PowerPC Altivec ISA exten-sions. These extensions provide a SIMD width of four, with instructions that control the operation of four ALUs. Alternative implementations, such as NVIDIA’s 8-series architecture, perform SIMD execution by implicitly shar-
Type Processor Cores/Chip ALUs/Core3 SIMD width MaxT4
GPUs AMD Radeon HD 2900 4 80 64 48
NVIDIA GeForce 8800 16 8 32 96
CPUs Intel Core 2 Quad1 4 8 4 1
STI Cell BE2 8 4 4 1
Sun UltraSPARC T2 8 1 1 4
TABLE 1
1SSE processing only, does not account for x86 FPU.2Stream processing (SPE) cores only, does not account for PPU cores.332-bit, floating point (all ALUs are multiply-add except the Intel Core 2 Quad)4 The ratio of core thread contexts to simultaneously executable threads. We use the ratio T (rather than the total number of per-core thread contexts) to describe the extent to which processor cores automatically hide thread stalls via hardware multithreading.
![Page 11: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/11.jpg)
Apple A7
http://www.anandtech.com/show/8116/some-thoughts-on-apples-metal-api
![Page 12: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/12.jpg)
Why we should use driver?
![Page 13: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/13.jpg)
Why we should use driver?
•GPU runs asynchronously •Different address space •Different ISA •Display is updated by frame
![Page 14: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/14.jpg)
그림 그리기
•도화지를 편다 •(그릴 그림을 생각한다) •붓과 물감을 고른다 •붓으로 그림을 그린다. •(구겨 버리거나 걸어둔다) •새 도화지를 편다
![Page 15: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/15.jpg)
그림 그리기 / Graphics App.
•도화지를 편다 / Framebuffer setup •(그릴 그림을 생각한다) / Data setup •붓과 물감을 고른다 / State setup •붓으로 그림을 그린다. / Draw call •(구겨 버리거나 걸어둔다) / Update a frame •새 도화지를 편다 / Framebuffer clear
Graphics Driver는 이 모든 과정의 API를 제공한다
![Page 16: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/16.jpg)
Graphics Driver의 계층 구조
API Interface
State Management
Command Queue Management
I/O Controller
Shader Compiler
![Page 17: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/17.jpg)
Why is it expensive?Graphics Driver가 하는 일
•State validation ■ Confirming API usage is valid ■ Encoding API state to hardware state
•Shader compilation ■ Run-time generation of shader machine code ■ Interactions between state and shaders
•Sending work to GPU ■ Managing resource residency ■ Batching commands
![Page 18: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/18.jpg)
OpenGLState validation
void glTexImage2D( GLenum target, GLint level,
GLint internalFormat, GLsizei width, GLsizei height, GLint border,
GLenum format, GLenum type,
const GLvoid * data);
![Page 19: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/19.jpg)
Are you kidding?Shader Compilation
•No standard for pre-built shader •No standard for shader binary format
int Init(ESContext *esContext) {
UserData *userData = esContext->userData; GLbyte vShaderStr[] = "attribute vec4 vPosition; \n" "void main() \n" "{ \n" " gl_Position = vPosition; \n" "} \n";
GLbyte fShaderStr[] = "precision mediump float; \n" "void main() \n" "{ \n" " gl_FragColor = vec4(1.0, 0.0, 0.0, 1.0); \n" "} \n"; GLuint vertexShader; GLuint fragmentShader; GLuint programObject; GLint linked;
![Page 20: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/20.jpg)
음영(陰影)Shader
•Shader는 오브젝트를 어둡게 칠한다
courtesy of 西川善司
![Page 21: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/21.jpg)
복붙Sending work to GPU
•Batching commands and committing •Transferring data and texture
![Page 22: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/22.jpg)
Design targetMetal
•Low CPU overhead •More predictable performance •Better programmability
![Page 23: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/23.jpg)
Key ideasMetal
•Create and validate state up-front •Shader can be compiled offline •Enable versatile multi-threading •Shared memory for CPU & GPU •Handle synchronisation explicitly •Tile-based deferred rendering •C++11 based language •No legacy baggage •Compute shader
But, A7 only - What the x
![Page 24: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/24.jpg)
Multi-threading
![Page 25: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/25.jpg)
Metal vs OpenGL ESCode comparison
![Page 26: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/26.jpg)
Low CPU overhead enableSo what
•more draw calls •more objects •better physics •better AI •more complex logic •low battery usage
![Page 27: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/27.jpg)
Use engine or forgetHow do I start?
•Unity 5(next year) - free/4,500$ •Unreal 4(may be this year) - 19$/month •Cocos2D - free •Xcode template
![Page 28: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/28.jpg)
Proprietary API
•Apple is a promoter of Khronos Group •OpenCL story •판이 꺼졌으니 사다리 걷어차기?
■ 하지만 구글은 바보가 아니다(Expansion Pack)
![Page 29: [Osxdev]metal](https://reader033.fdocuments.us/reader033/viewer/2022060106/54b731ae4a795912438b459e/html5/thumbnails/29.jpg)
몰라도 그만Conclusion
•Low CPU overhead •Can do something more •A7 only(할 수 없거나 귀찮거나) •Game-changer? maybe or not