Fixed point math, texturing and...

Fixed point math, texturing and texture

cachingMichael Doggett

Department of Computer ScienceLund university

Pixel shader

Vertex shader

Last week’s stage of the Graphics Pipeline

Rasterization

FrameBuffer

Z & Alpha

Edge functions

Vertex positioning

Traversal

Interpolation

Last weekRasterization

© 2009 Tomas Akenine-Möller 4

But first, Assignment 1!• C++ programming, but very localized in

functions where you should add code–C++ should be no problem (if it is, then ask on the

forum)• Only uses simple OpenGL

–should work on any GPU•But requires Windows


Assignment 1• Two small parts in this assignment:

–Find three bad things in small scenes• Fix the code so that correct behaviour is obtained

–Use a texture cache• Should be able to reduce texture bandwidth to 10-15%


Overview• Theory:

–Fixed-point math (Appendix A – online)–Texturing (Chapter 5 – online)–Texture caching (see assigned papers)

• Caches (Section 5.5 in notes)–For assignment 1, it will help to read chapters 2 and 3

as well (online)

• Practice:–More about the rasterizer framework for assignment 1–More about the actual assignment


Fixed-point math• Not floating point... • Good to know!• Essential for hardware design• Can be used for performance optimizations


Integer vs fixed-point• An 8-bit integer: b7 b6 b5 b4 b3 b2 b1 b0

• bi is ”worth” 2i as usual–But where is the decimal point?

b7 b6 b5 b4 b3 b2 b1 b0 00000...

• What if we move it to the left?

• bi is still ”worth” 2i : b-1=0.5, b-2=0.25, ...

b5 b4 b3 b2 b1 b0 b-1 b-2 00000...


What is fixed?• The decimal point...• A fixed-point number has a representation of [i.f] bits

– i bits for the integer part (with sign, or without)• We assume that two’s-complement is used, i.e., integer math can be

used– f bits for the fractional part

• Look at the fractional bits...

f1 f2 f3 f4

Decimal point

Worth 1/21=0.5Worth 1/22=0.25

1/23=0.125

1/24=0.0625


Resolution• f fractional bits ! resolution is 2-f

• Examples:


How to maintain the best accuracy?

• The number of bits needed for exact accuracy is increased after each mathematical operation (e.g., addition)–Overflow

• We focus on–Addition/subtraction–Multiplication

• Reason: needed for part of assignment 1


Conversion:to fixed and back again

• We have floating point number, a, and want fixed-point, [i.f]

• To fixed:• Nice thing: we now have an integer, and so can use

integer addition, mult etc (but see next slides on that)

• Rounding is implemented:

• If we have fixed-point number, b, we get a float as:

• x 2f and x 2-f are implemented as left and right shifts (fast!)


Very simple example:• We have float b=0.25• And want to represent it in fixed-point with 3

fractional bits, i.e., f=3• round(0.25*23)=2• Thus 2 is the fixed-point representation of

0.25 with three fractional bits• Can look at the 8 bits of the integer:

–0000.010 (= 2 if you disregard the decimal point)


Addition precision

• Why? Imagine the worse case:–Both numbers hold their maximum number:

• Eg 111.11b +111.11b = 1111.10b

• Result grows by one bit in integer part!

• Number of bits becomes:

=?

Note ”+1”


Multiplication precision• More complex. Can be seen as many adds!

–So intuitively, should need more bits to store

• Note, that if you want to maintain exact accuracy, we need to move the ”fixed-point”–Need twice as many fractional bits!

• In general:

• See appendix A for an explanation– Basically, a mult is a series of additions of shifted numbers


Fixed-point in practice• In C++ code, you deal with these as int’s

–32 bit signed numbers (but you need not use all of the bits)

• However, you need to prepare the calculations so that bits are not lost

• For edge functions it is of utmost importance to maintain exact values–(after you have rounded off floating-point screen space

coordinates to sub-pixel fixed-point coords)• Example: a and b are [8.2]. If you write:–c=a*b; // then c is [16.4] –d=c>>2; // d is now 16.2 format – // (but we’ve lost 2 LSB – // fractional bits)


Fixed-point example•float a=2.75f; •int ai=int(a*(1<<2)+0.5); // [2.2] •// should use floatToFixed()

•float b=2.5f; •int bi=int(b*(1<<1)+0.5); //[2.1] •// how to compute ai+bi? •int ci=ai+(bi<<1); // [3.2] bits


End of fixed-point...• In software frame work, a function int floatToFixed(fracBits, float_number) is used.

• When you do a matrix/vector multiply–You often do [16.16]*[16.16] ~=[32.16], or worse

• Remember–Full accuracy needed for edge-functions

• Read appendix A and chapter on Edge Funcs again–Available on course website

Pixel shader

Vertex shader

Last week’s stage of the Graphics Pipeline

Rasterization

FrameBuffer

Z & Alpha

Pixel shader

Vertex shader

Today’s stage of the Graphics Pipeline

Rasterization

FrameBuffer

Z & Alpha

Texture


Texturing – the tiny details

• Surprisingly simple technique–Extremely powerful, especially with programmable

shaders–Simplest form: ”glue” images onto surfaces (or

lines, or points)

Image from ”lpics”-paper by Pellacini et al. SIGGRAPH 2005PIXAR Animation Studios


Texture space, (s,t)

• Texture resolution, often 2a x 2b texels• The ck are texture coordinates, and belong to a

triangle’s vertices• When rasterizing a triangle, we get (u,v)

interpolation parameters for each pixel (x,y):–Thus the texture coords at (x,y) are:

(u,v)

(s,t)


A texture image + coord systems

• An wxh=8x4 texture. –(s,t) are independent of texture resolution–(sw,th) depend on the resolution, and are used to

access texels…• Each pixel in a Texture is called a “Texel”

© 2009 Tomas Akenine-Möller

Texture filtering

MINIFICATION MAGNIFICATION

What do weget here?


• We basically want the sum of the texels in the footprint (dark gray) to the right


Texture magnification (1)

• Middle: nearest neighbor – just pick nearest texel

• Right: bilinear filtering: use the four closest texels, and weight them according to actual sampling point


• Bilinear filtering is simply, linear filtering in x:

Texture magnification (2)

t00 t10

t11t01

• Followed by linear filtering in y:


Texture minification• If nearest neighbor or bilinear filtering is used,

then serious flickering will result–Extremely annoying

Nearestneighbor

Trilinearmipmapping

For a pixel here, there is a 50%chance of getting a black texel


Texture minification: mipmapping

An image pyramidof low-passfiltered images



Trilinear Mipmapping (1)

• Basic idea:–Approximate (dark gray footprint) with square–Then we can use texels in mipmap pyramid


Trilinear mipmapping (2)

• Compute d (LOD) (see Chapter 5), and then use two closest mipmap levels– In example above, level 1 & 2

• Bilinear filtering in each level, and then linear blend between these colors ! trilinear interpolation

• Nice bonus: makes for much better texture cache usage

demo1.exe

Text

© mmxiii mcd

TextureCaching


Texture caching• Without a cache, we can get ridiculously expensive

texturing... • Basic idea is: just use a cache for recently accessed

texels–Since we access coherently, hit rate should be quite

high!– In hardware, a cache can be:

• A small SRAM memory, or• A set of flipflops• We assume that an access in the cache is for ”free”

• In the assignment, texture filtering (eg mipmapping) is done for you.–You should experiment with caching parameters!


Assumptions: memory architecture• Accesses to external memory are expensive

–Both in time and from energy perspective–Bursting (i.e., send a sequence of continuous

words) is often (much) cheaper• E.g., fetching 8x 32-bit words (32 bytes) in a sequence is much faster than fetching 8x 32-bit words that are in random places...

Externalmemory

FragmentGeneration

Triangles

Textured fragments

PIPE

LIN

E

Texture cache


Texture cache readings• A nice introduction :

– My “Texture Caches” paper from IEEE Micro 2012• Also :

– ”The Design and Analysis of a Cache Architecture for Texture Mapping”, by Hakura and Gupta, in ISCA 97.

– ”Prefetching in a Texture Cache Architecture”, by Igehy et al, in Graphics Hardware 1998.

• Note that these are old papers, and cache sizes etc don’t apply to modern systems...• The general results still apply though

• GPU Example: NEON architecture (1998)– Built by Digital Equipment Corporation (bought by Compaq (bought by HP))– Has 256 bytes of cache, fully associative– Split into 8 different small caches

• So 8 texels can be fetched every clock cycle– Cache line size is 32 bits

• This is very small. The optimal size depends on what type of external memory you have

• More about GPU memory architecture in a later lecture


How to get good efficiency• Three important things [Hakura & Gupta]:

–How texels in texture are ordered in memory–Rasterization algorithm–Cache parameters

• Associativity–Number of cache lines = sets X ways–n - way associate cache : means n blocks(lines) in each set

• Cache line size• Cache size


Representation of textures in mem• Normally, a 4x4 texture is stored as:

–RGBA0, RGBA1, RGBA2, ... RGBA15 • What if, we traverse in the vertical

direction?–E.g., accessing 1,5,9,13–Quite bad if we read, say, 4 texels into

the cache at a time• Are better texel orderings possible?• With representation to the right, only

two blocks are read into the cache

1514131211109876543210

1514111013129876325410

• This representation will (on average) get the same performance regardless of traversal direction!!!


• This is called a ”blocked” or ”tiled” representation - “z-order”

• It is a 4D structure: first find 2x2 block, then texel in block

Representation of textures...

1514111013129876325410

• In general, we have an nxn block... –n is power of 2

• Mipmap levels can thrash at exactly the same location in a direct mapped cache

• Solution:–Use a fully associative cache–Hakura & Gupta shows that a 2-way associative cache

gives similar results–Or simpler, ”bake” the mipmap level into the computation of

the ”cache key” (tag)


Texture cache recommendations

• Tile (block) size in texture should be equal to cache line size

• Can even extend to 6D addressing–Another level, where each block is the size of the entire

cache... • Further minimizes conflict misses

–Also, Igehy et al use two separate direct-mapped caches:• One for odd mipmap levels, and one for even• Is enough to get good results• Again, one direct-mapped cache would work if the cache key

(tag) take mipmap level into account (but having two caches gives more bandwith from the caches)


Traversal algorithm• Traversal algorithm affects the order in which

texels are accessed !–Also influences texture caching...

• With scanline-based traversal, we do not get any positive effects for pixels below current scanline–This is assuming a small cache–Positive effects should be possible, due to bilinear

filtering (used in mipmapping and magnification)

• Tiled traversal performs better!–Especially for large triangles


Why is mipmapping good for texture caching?• We choose mipmap levels to

access where footprint becomes ~1 texel

• Therefore, traversal moves slowly in texture space ! many cache hits!

• Better than nearest neighbor (minification)


Back to the assignment...The coding framework (1)

• Implements a subset of OpenGL–(mostly focused on the rasterizer)

• Designed so –that is, it is built around units that exist in real

hardware• Programmability

–We have fragment shaders as well–Though, focus is not on using them right now...


The coding framework (2)• Uses Microsoft Visual Studio 2008

–But upgrades to work with 2010/12/13/15• Nice feature for this assignment:

–Press the R key, and you can toggle rasterizer–You can switch from

• our software rasterizer• to the OpenGL hardware rasterizer


Actual assignment (1)• Two tasks..• Task 1:

–Switch between the software rasterizer and hardware OpenGL rasterizer (press ’R’)

–Use this to detect the ”artifacts”• Three artifacts: need to be corrected so that results are ”very near” identical to hardware OpenGL

• How could I know how to correct the artifacts?• Read the literature that we recommend!

–Everything is very localized in the source:• Change in cRasterizer.* + cEdgeFunc.*


Actual assignment (2)• Task 1: Fix pixel errors.

• Task 2:–Time to reduce texture bandwidth–In glstate.cpp, add a texture cache...–Should be able to reduce texture bandwidth to at

most 10-15%...–You need to experiment quite a bit to get this kind

of performance...


cRasterizeror

cTileRasterizer

More about the software framework

setup()

rasterizeTriangle()[i.e., triangle traversal]

For each pixelinside triangle

perFragment() Compute depth cDepthUnit – depth testPerspectiveinterpolation

Fragment shader

cTextureUnit – texel fetch & filtering

Write color cColorUnit

© mmxvi mcd

Next• Don’t forget to read the literature!

• Text has full background to the slides

• Very valuable for assignments too

• Labs

• Find a partner (ask on the forum)

• Sign up

• Check the web page for info http://cs.lth.se/edan35

• Ask questions on the forum

• Next Lecture :

• Shader programming

http://cs.lth.se/edan35

Fixed point math, texturing and...

Documents

Transcript of Fixed point math, texturing and...