Database Operations on GPU Changchang Wu 4/18/2007.

Post on 21-Dec-2015

214 views 0 download

Transcript of Database Operations on GPU Changchang Wu 4/18/2007.

Database Operations on GPU

Changchang Wu

4/18/2007

Outline

• Database Operations on GPU

• Point List Generation on GPU

• Nearest Neighbor Searching on GPU

Database Operations on GPU

Design Issues

• Low bandwidth between GPU and CPU• Avoid frame buffer readbacks

• No arbitrary writes• Avoid data rearrangements

• Programmable pipeline has poor branching• Evaluate branches using fixed function tests

Design Overview

• Use depth test functionality of GPUs for performing comparisons

• Implements all possible comparisons <, <=, >=, >, ==, !=, ALWAYS, NEVER

• Use stencil test for data validation and storing results of comparison operations

• Use occlusion query to count number of elements that satisfy some condition

Basic Operations

Basic SQL query Select A

From T

Where C

A= attributes or aggregations (SUM, COUNT, MAX etc)

T=relational table

C= Boolean Combination of Predicates (using operators AND, OR, NOT)

Basic Operations

• Predicates – ai op constant or ai op aj

• Op is one of <,>,<=,>=,!=, =, TRUE, FALSE

• Boolean combinations – Conjunctive Normal Form (CNF) expression evaluation

• Aggregations – COUNT, SUM, MAX, MEDIAN, AVG

Predicate Evaluation

• ai op constant (d)

• Copy the attribute values ai into depth buffer

• Define the comparison operation using depth test• Draw a screen filling quad at depth d

glDepthFunc(…)

glStencilOp(fail, zfail, zpass );

Predicate Evaluation

• Comparing two attributes: • ai op aj is treated as (ai – aj) op 0

• Semi-linear queries

• Easy to compute with fragment shader

Boolean Combinations

• Expression provided as a CNF

• CNF is of form (A1 AND A2 AND … AND Ak)

where Ai = (Bi1 OR Bi

2 OR … OR Bimi )

• CNF does not have NOT operator• If CNF has a NOT operator, invert comparison operation to

eliminate NOT

Eg. NOT (ai < d) => (ai >= d)

• For example, compute ai within [low, high]

• Evaluated as ( ai >= low ) AND ( ai <= high )

Algorithm

Range Query

• Compute ai within [low, high]

• Evaluated as ( ai >= low ) AND ( ai <= high )

Aggregations

• COUNT, MAX, MIN, SUM, AVG

• No data rearrangements

COUNT

• Use occlusion queries to get pixel pass count

• Syntax:• Begin occlusion query• Perform database operation• End occlusion query• Get count of number of attributes that passed database operation

• Involves no additional overhead!

MAX, MIN, MEDIAN

• We compute Kth-largest number

• Traditional algorithms require data rearrangements

• We perform no data rearrangements, no frame buffer readbacks

K-th Largest Number

• By comparing and counting, determinate every bit in order of MSB to LSB

Example: Parallel Max

• S={10,24,37,99,192,200,200,232}• Step 1: Draw Quad at 128(10000000)

• S = {10,24,37,99,192,200,200,232}

• Step 2: Draw Quad at 192(11000000)• S = {10,24,37,192,200,200,232}

• Step 3: Draw Quad at 224(11100000)• S = {10,24,37,192,200,200,232}

• Step 4: Draw Quad at 240(11110000)• – No values pass• Step 5: Draw Quad at 232(11101000)

• S = {10,24,37,192,200,200,232}

• Step 6,7,8: Draw Quads at 236,234,233 – No values pass, Max is 232

Accumulator, Mean• Accumulator - Use sorting algorithm and add

all the values• Mean – Use accumulator and divide by n• Interval range arithmetic• Alternative algorithm

• Use fragment programs – requires very few renderings

• Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03]

Accumulator

• Data representation is of formak 2k + ak-1 2k-1 + … + a0

Sum = sum(ak) 2k+ sum(ak-1) 2k-1+…+sum(a0)

Current GPUs support no bit-masking operations

The Algorithm

>=0.5 means i-th bit is 1

Implementation

• Algorithm• CPU – Intel compiler 7.1 with hyper-threading,

multi-threading, SIMD optimizations• GPU – NVIDIA Cg Compiler

• Hardware• Dell Precision Workstation with Dual 2.8GHz Xeon

Processor• NVIDIA GeForce FX 5900 Ultra GPU• 2GB RAM

Benchmarks

• TCP/IP database with 1 million records and four attributes

• Census database with 360K records

Copy Time

Predicate Evaluation

Range Query

Multi-Attribute Query

Semi-linear Query

Kth-Largest

Kth-Largest

Kth-Largest conditional

Accumulator

Analysis: Issues

• Precision

• Copy time

• Integer arithmetic

• Depth compare masking

• Memory management

• No Branching

• No random writes

Analysis: Performance

• Relative Performance Gain• High Performance – Predicate evaluation, multi-attribute queries, semi-linear queries, count

• Medium Performance – Kth-largest number• Low Performance - Accumulator

High Performance

• Parallel pixel processing engines

• Pipelining

• Early Z-cull

• Eliminate branch mispredictions

Medium Performance

• Parallelism• FX 5900 has clock speed 450MHz, 8 pixel

processing engines• Rendering single 1000x1000 quad takes

0.278ms• Rendering 19 such quads take 5.28ms.

Observed time is 6.6ms• 80% efficiency in parallelism!!

Low Performance

• No gain over SIMD based CPU implementation

• Two main reasons:• Lack of integer-arithmetic• Clock rate

Advantages

• Algorithms progress at GPU growth rate• Offload CPU work• Fast due to massive parallelism on GPUs

• Algorithms could be generalized to any geometric shape

• Eg. Max value within a triangular region• Commodity hardware!

GPU Point List Generation

• Data compaction

Overall task

3D to 2D mapping

Current Problem

The solution

Overview, Data Compaction

Algorithm: Discriminator

Algorithm: Histogram Builder

Histogram Output

Algorithm: PointList Builder

PointList Output

Timing

Reduces a highly sparse matrix with Nelements to a list of its M active entries

in O(N) + M (log N) steps,

Applications

• Image Analysis• Feature Detection

• Volume Analysis

• Sparse Matrix Generation

Searching

• 1D Binary Search

• Nearest Neighbor Search for High dimension space

• K-NN Search

Binary Search

• Find a specific element in an ordered list• Implement just like CPU algorithm

• Assuming hardware supports long enough shaders• Finds the first element of a given value v

• If v does not exist, find next smallest element > v

• Search algorithm is sequential, but many searches can be executed in parallel

• Number of pixels drawn determines number of searches executed in parallel

• 1 pixel == 1 search

Binary Search

• Search for v0

v0v0 v0v0 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize Search starts at center of sorted array

v2 >= v0 so search left half of sub-array

v2v2

Binary Search

• Search for v0

v0v0 v0v0 v2v2 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize

22Step 1

v0 >= v0 so search left half of sub-array

Binary Search

• Search for v0

v0v0 v2v2 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize

22

11

Step 1

Step 2

v0 >= v0 so search left half of sub-array

v0v0

Binary Search

• Search for v0

v0v0 v2v2 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize

22

11

00

Step 1

Step 2

Step 3

At this point, we either have found v0 or are 1 element too far left

One last step to resolve

v0v0

Binary Search

• Search for v0

v0v0 v2v2 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize

22

11

00

00

Step 1

Step 2

Step 3

Step 4

Done!

v0v0

Binary Search

• Search for v0 and v2

v0v0 v0v0 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize 44 Search starts at center of sorted array

Both searches proceed to the left half of the array

v2v2

Binary Search

• Search for v0 and v2

v0v0 v0v0 v2v2 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize

22Step 1

44

22

The search for v0 continues as before

The search for v2 overshot, so go back to the right

Binary Search

• Search for v0 and v2

v0v0 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize

22

11

Step 1

Step 2

44

22

33

v0v0 v2v2

We’ve found the proper v2, but are still looking for v0

Both searches continue

Binary Search

• Search for v0 and v2

v0v0 v2v2 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize

22

11

00

Step 1

Step 2

Step 3

44

22

33

22

v0v0

Now, we’ve found the proper v0, but overshot v2

The cleanup step takes care of this

Binary Search

• Search for v0 and v2

v0v0 v2v2 v2v2 v5v5v0v0 v5v5Sorted List00 11 33 44 55 6622 77

44Initialize

22

11

00

00

Step 1

Step 2

Step 3

Step 4

44

22

33

22

33

v0v0 v2v2

Done! Both v0 and v2 are located properly

Binary Search Summary

• Single rendering pass• Each pixel drawn performs independent search

• O(log n) steps

Nearest Neighbor Search

• Very fundamental step in similarity search of data mining, retrieval…

• Curse of dimensionality,• When dimensionality is very high, structures like k-d tree does not help

• Use GPU to improve linear scan

Distances

• N-norm distance

• Cosine distance acos(dot(x,y))

Data Representation

• Use separate textures to store different dimensions.

Distance Computation

• Accumulating distance component of different dimensions

Reduction in RGBA

Reduction to find NN

Results

Results

K-Nearest Neighbor Search

• Given a sample point p, find the k points nearest p within a data set

• On the CPU, this is easily done with a heap or priority queue

• Can add or reject neighbors as search progresses• Don’t know how to build one efficiently on GPU

• kNN-grid• Can only add neighbors…

kNN-grid Algorithm

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

kNN-grid Algorithm

• Candidate neighbors must be within max search radius

• Visit voxels in order of distance to sample point

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

kNN-grid Algorithm

• If current number of neighbors found is less than the number requested, grow search radius

1

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

kNN-grid Algorithm

2

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

• If current number of neighbors found is less than the number requested, grow search radius

kNN-grid Algorithm

• Don’t add neighbors outside maximum search radius

• Don’t grow search radius when neighbor is outside maximum radius

2

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

kNN-grid Algorithm

• Add neighbors within search radius

3

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

kNN-grid Algorithm

• Add neighbors within search radius

4

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

kNN-grid Algorithm

• Don’t expand search radius if enough neighbors already found

4

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

kNN-grid Algorithm

• Add neighbors within search radius

5

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

kNN-grid Algorithm

• Visit all other voxels accessible within determined search radius

• Add neighbors within search radius6

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

kNN-grid Summary

• Finds all neighbors within a sphere centered about sample point

• May locate more than requested k-nearest neighbors

6

sample point

neighbors foundcandidate neighbor

Want 4 neighbors

References• Naga Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin and

Dinesh Manocha, Fast Computation of Database Operations using Graphics Processors http://www.gpgpu.org/s2004/slides/govindaraju.DatabaseOperations.ppt

• Benjamin Bustos, Oliver Deussen, Stefan Hiller, and Daniel Keim, A Graphic Hardware Accelerated Algorithm for Nearest Neighbor Search

• Gernot Ziegler, Art Tevs, Christian Theobalt, Hans-Peter Seidel, GPU Point List Generation through Histogram Pyramids

http://www.mpi-inf.mpg.de/~gziegler/gpu_pointlist/• Tim Purcell, Sorting and Searching

http://www.gpgpu.org/s2005/slides/purcell.SortingAndSearching.ppt