Evolution of the Graphical Processing Unit

Evolution of the Graphical Processing Unit

A professional paper submitted in partial fulfillment of the requirements for the degree of Master of Science with a major in Computer Science.

Thomas Scott Crow

February 3, 2005

Acknowledgements

I would like to thank Dr. Harris for his considerable patience and help.

I would like to thank my committee members, Dr. Egbert and Dr. Mensing for their valuable time.

Overview

Introduction“Computer Graphics” MilestonesThe Modern GPUGeneral Purpose GPU ComputingFuture of the GPU

Introduction

Definition: Used primarily for 3D applications, a graphical processing unit (GPU) is a single chip processor that creates lighting effects and transforms objects every time a 3D scene is redrawn. These are mathematically intensive tasks, which otherwise would put quite a strain on the CPU.

History: Graphics computation has evolved from software written to perform graphics functions and run on the main CPU to specialized hardware to run certain types of graphics computation and the CPU performing the rest, to a fully implemented 3D graphics pipeline run entirely on a GPU. This history has followed closely the idea of the “Wheel of Reincarnation” first presented by Sutherland and Myers in a 1968 ACM paper.

Introduction

Sutherland and

Myer’s, “Wheel of Reincarnation”

“Computer Graphics” Milestones

MIT’s Whirlwind Project - 1944Significance: First computer built specifically for interactive, real-

time control which displayed real-time text and graphics on a video terminal.


“Magnetic” Core Memory (RAM) – 1951Significance: Miniaturization, speed, and non-volatility.


SAGE (Semi-Automatic Ground Environment) – 1958

Significance: Introduced real-time software, showed feasibility of CRTs in interactive computing, and the light-pen as an input device.


SAGE (Semi-Automatic Ground Environment) –1958

With light-pen


MIT’s TX-0 (Transistorized Experimental Computer Zero) – 1956

Significance: First real-time, programmable, general-purpose computer made entirely from transistors and first ever operating system.


MIT’s TX-2 – 1959Significance: Specialized I/O circuitry allowed for “online”

computing which allowed for the creation of Sutherland’s “Sketchpad”.


Ivan Sutherland’s Sketchpad – 1963Significance: Precursor of the direct manipulation computer

graphic interface of today. Ancestor of Computer Aided Design (CAD) and the modern graphical user interface.


Digital Equipment Corporation (DEC) and the Minicomputer – 1957

Significance: Drastic shift away from the mainframe “time-sharing” model of computing. The VAX supermini would become the workhorse for the CAD industry.


Computer Aided Design (CAD) SystemsSignificance: Furthered the concept of Sketchpad by allowing the

creation, rotation, and manipulation of 3D models.

General Motors DAC-1


Information Displays IDIIOM


The PC RevolutionSignificance: Allowed the computing power of the early

mainframes and minicomputers to be available to consumers.

Intel 4004, the first Microprocessor


The Altair 8800 is considered the first personal computer.

The Modern GPU

Graphical Processing Unit (GPU)

The Modern GPU

Professional Graphics Adapter (PGA) First processor based video card with an Intel 8088

microprocessor onboard. All video related tasks were performed by onboard

microprocessor.

The Modern GPU

Silicon Graphics Inc. (SGI) – 1980’sSGI’s two most important contributions to the modern

GPU - Vendor independent

Application Programming Interface (API) for the development of 2D and 3D graphics applications.

has become an industry standard API used and supported by all major vendors.

Graphics Pipeline - A conceptual model of stages that graphics data is sent through. It is simply a process for converting 3D coordinates of a model into 2D screen images.

The Modern GPU

3D Graphics Pipeline from nVidia

Generalized 2-Step Graphics Pipeline

Geometry Stage – Changes 3D object coordinates into 2D window coordinates.

Rendering Stage - Fills the area of pixels between the 2D coordinates with pixels to represent the surface of the object.

The Modern GPU

The Modern GPU Main Components of the Geometry

Stage

Transform and Lighting – Transform is the process of displaying the coordinates of a 3D object onto a 2D space and lighting is the process of providing lighting effects to the scene.

Triangle Setup – Converts triangle vertices into pixels and computes the rate of change of color values between pixels.

The Modern GPU

GPU Timeline

The Modern GPU

Transform Matrix Multiplication

Transform Matrix – Made up of many interim action matrices multiplied together.

Interim Action Matrix – Includes such actions as scaling, rotation, translation, etc.

The Modern GPU

Fixed Function Pipeline

The Modern GPU

Programmable Pipeline

Vertex Programs replace the T&L stages of pipeline Fragment Programs replace multi-texturing and blending

The Modern GPU The Classic Von Neumann Architecture

Von Neumann Bottleneck is the separation between the CPU and memory.

The Modern GPU

The Stream Processing Model

Streams are sets of sequential data elements that require similar computation.

Kernels are pieces of code that operate on every element of a stream.

The Modern GPU

Three Levels of Parallelism Exposed by the Stream Processing Model

Instruction-Level Parallelism – Simultaneous execution of multiple instructions within a kernel.

Data-Level Parallelism – Instruction execution on multiple stream elements simultaneously.

Task-Level Parallelism – Multiple stream processors can divide the work from one kernel or different kernels run on different stream processors.

The Modern GPU Memory Access is Expensive:CPUs use caches to reduce off-chip memory access.Caches benefit from:

Spatial Locality – Items located physically near an item referenced in the near past will have a higher probability of being referenced in the near future.

Temporal Locality – Items referenced in the near past have a higher probability of being re-referenced in the near future.

GPUs benefit from: Producer-Consumer Locality – Production of a stream that

is immediately consumed by another kernel.Memory-to-Arithmetic Operations Ratio:

Traditional Accumulator 1:1 Scalar Processor 1:4 Stream Processor 1:100

General Purpose GPU Computing

Why General Purpose Computing on a GPU? GPUs are not hampered by the classic sequential code

structure of the CPU. Basically means that GPUs can more effectively utilize additional transistors.

Moore’s Law says transistor count at a given die size doubles every 18 months. That of a GPU doubles every 6 months.

Pentium 4 has 222 million transistors. GeForce 6 has more than double.

Speed - The lure of raw computational power; parallelism. Cost - The multi-billion dollar gaming industry drives down

the cost of the commodity GPU making it a very cost effective alternative to the CPU.


Moore’s Law Cubed

From ‘Stream Programming Environments’ – Hanrahan, 2004


Current Research Topics Computer Vision Computational Geometry Stream Processing Cloud Simulation Ice Crystal Growth Simulation Database Queries Monte Carlo Methods Computational Fluid Dynamics Collision Detection Voronoi Computations Molecular Dynamics Many More…


Stanford’s “General Purpose” Imagine Stream Processor


Imagine Bandwidth Hierarchy


Matrix-Matrix Multiplication – A Test CaseC=AB, where A and B are large, dense NxN matrices.

System Requirements:CPU Test:

Pentium III 750MHz ScienceMark 2.0 – BLAS (Basic Linear Algebra

Subprograms) software suite.GPU Test:

GeForce FX 5200 – 1st fully programmable 3D Graphics Pipeline GPU.

Source code from GPUBench suite of performance testing tools, which is written in Cg “C for Graphics”.

Microsoft Visual Studio .Net 2003 – Programming Environment. Cygwin – Linux environment for MS Windows.


Results

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 200 400 600 800 1000 1200 1400

GF

LO

PS

Dimension of Square Matrices

GeForce FX 5200

Pentium III 750MHz


Efficiencye =

CPU: Theoretical peak GFLOPS for the Pentium III 750MHz is 3 GFLOPS. Observed Peak GFLOPS for this test was 1.2 GFLOPS.

e = 40% efficiency

GPU: Theoretical peak GFLOPS for the GeForce FX 5200 is 4 GFLOPS. Observed Peak GFLOPS for this test was 0.6 GFLOPS.

e = 15% efficiency

NOT EXPECTED In this test the GPU is capable of 25% more GFLOPS than

the CPU, but was found to perform ½ as well.


c

Future of the GPU

Potential Improvements Design of new algorithms New languages that are highly parallel and data

streaming capable. Compilers and tools to advance parallel stream

programming.• Stanford University’s BrookGPU

Memory bandwidth hierarchy improvements.

Future of the GPU

GPU Clusters nVIDIA SLI (Scalable Link Interface)

Can double the performance from a single GPU

Future of the GPU

Examples of Load Balancing: Alternate Frame Rendering

Future of the GPU

Examples of Load Balancing: Split Frame Rendering

Future of the GPUGPU Clustering at Stony Brook

University

Evolution of the Graphical Processing Unit

Questions

Evolution of the Graphical Processing Unit

Documents

Transcript of Evolution of the Graphical Processing Unit