Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ......

55
San Jose| 02/10/09 | Timo Stich Graph Cuts with CUDA

Transcript of Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ......

Page 1: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

San Jose| 02/10/09 | Timo Stich

Graph Cuts with CUDA

Page 2: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Outline

• Introduction

• Algorithms to solve Graph Cuts

• CUDA implementation

• Image processing application

• Summary

Page 3: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Problems solvable with Graphcuts

Stereo Depth Estimation Binary Image Segmentation

Photo Montage (aka Image Stitching)Source: MRF Evaluation, Middlebury College

Page 4: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Energy Minimization

• Graphcut finds global minimum

),(

|)(|)()(yx

yx

x

xx LLVLDLE

Sum over all Pixels of an Image

Data Term:

Measures fitting of

label to pixel

Neighborhood Term:

Penalizes different labelings

for neighbors

Sum over all neighborhoods

Page 5: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Example: Binary Segmentation Problem

0 0 0 0

? ? ? ?

? ? ? ?

? ? ? 1

User marks some pixels as

Background and Foreground

Compute for all pixels if they

are Background or Foreground

Page 6: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Building a Flow Graph for the Problem

)1(,baV )1(,cbV )1(,dcV )1(,edV

Source

Sink

cba d ePixels

)1(aD

)1(bD )1(cD )1(dD

)1(eD

)0(aD

)0(bD)0(cD

)0(dD

)0(eD

....

Page 7: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Maximum Flow = Minimum Cut

Source

Sink

cba d ePixels ....

Page 8: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Graph Cut Solution

Pixels

Source

Sink

....

Page 9: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Graph Cut Solution

Input Result

Page 10: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Graph Cut Algorithms

• Ford-Fulkerson

– Find augmenting paths from source to sink

– Global scope, based on search trees

– Most used implementation today by Boykov et al.

• Goldberg-Tarjan (push-relabel)

– Considers one node at a time

– Local scope, only direct neighbors matter

– Inherently parallel, good fit for CUDA

Page 11: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Push-Relabel in a nutshell

• Some definitions

– Each node x:

• Has excess flow u(x) and height h(x)

• Outgoing edges to neighbors (x,*) with capacity c(x,*)

– Node x is active: if u(x)> 0 and h(x)< HEIGHT_MAX

– Active node x

• can push to neighbor y: if c(x,y) > 0, h(y) = h(x) – 1

• is relabeled: if for all c(x,*) > 0, h(*) >= h(x)

Page 12: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Push Pseudocode

void push(x, excess_flow, capacity, const height)

if active(x) do

foreach y=neighbor(x)

if height(y) == height(x) – 1 do // check height

flow = min( capacity(x,y), excess_flow(x)); // pushed flow

excess_flow(x) -= flow; excess_flow(y) += flow; // update excess flow

capacity(x,y) -= flow; capacity(y,x) += flow; // update edge cap.

done

end

done

Page 13: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Relabel Pseudocode

void relabel(x, height, const excess_flow, const capacity)

if active(x) do

my_height = HEIGHT_MAX; // init to max height

foreach y=neighbor(x)

if capacity(x,y) > 0 do

my_height = min(my_height, height(y)+1); // minimum height + 1

done

end

height(x) = my_height; // update height

done

Page 14: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Push-Relabel Pseudocode

while any_active(x) do

foreach x

relabel(x);

end

foreach x

push(x);

end

done

Page 15: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Graph setup

Source

Sink

0/00/00/0 0/0 0/0

39 5 6

2

3/3 3/3 4/4 1/1

102 1

8

9

Excess Flow Height

1

1

Page 16: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Direct Push

Source

Sink

0/00/00/0 0/0 0/0

39 10 6

2

3/3 3/3 4/4 1/1

102 1

8

9

-7/0

10 - 3 = -7

Total flow = 0Total flow = 3

Page 17: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Initialized

Source

Sink

4/07/0-7/0 -2/0 -7/03/3 3/3 4/4 1/1

HEIGHT_MAX = 5

active

Total flow = 14

Page 18: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

After Relabel

Source

Sink

4/17/1-7/0 -2/0 -7/03/3 3/3 4/4 1/1

Total flow = 14

Page 19: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

After Push

Source

Sink

0/14/1-4/0 2/0 -7/06/0 3/3 0/8 1/1

Total flow = 19

Page 20: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

2nd iteration

Source

Sink

0/14/1-4/0 2/0 -7/06/0 3/3 0/8 1/1

Total flow = 19

Page 21: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

After Relabel

Source

Sink

0/14/2-4/0 2/1 -7/06/0 3/3 0/6 1/1

Total flow = 19

Page 22: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

After Push

Source

Sink

3/11/2-4/0 1/1 -6/06/0 0/6 0/6 0/2

Total flow = 20

Page 23: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

After 3 more Iterations, Terminated

Source

Sink

3/51/5-4/0 1/5 -6/06/0 0/6 0/6 0/2

Total flow = 20

Page 24: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Inverse BFS from Sink

Source

Sink

3/51/5-4/0 1/1 -6/06/0 0/2

X X

Total flow = 20

Page 25: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Graph Cut and Solution

Source

Sink

cba d e

39 5 6

2

3 3 4 1

102 1

8

9

Total flow = 20

Minimum Cut = 20 = Maximum Flow

Page 26: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Graph Cuts for Image Processing

• Regular Graphs with 4-Neighborhood

• Integers

• Naive approach

– One thread per node

– Push Kernel

– Relabel Kernel

S

T

Page 27: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

CUDA Implementation

• Datastructures

– 4 WxH arrays for residual edge capacities

– 2 WxH array for heights (double buffering)

– WxH array for excess flow

Page 28: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Push Data Access Patterns

• Read/Write: Excess Flow, Edge capacities

• Read only : Height

Excess Flow Data

Page 29: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Relabel Data Access Patterns

• Read/Write: Height (Texture, double buffered)

• Read only : Excess Flow, Edge capacities

Height Data

Page 30: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Data Access Patterns

• Push does scattered write:

Needs global atomics to avoid RAW Hazard!

Page 31: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Naive CUDA Implementation

• Iterative approach:

• Repeat

– Push Kernel (Updates excess flow & edge capacities)

– Relabel Kernel (Updates height)

• Until no active pixels are left

Page 32: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Naive CUDA Implementation

• Both kernels are memory-bound

• Observations on the naive implementation

– Push: Atomic memory bandwidth is lower

– Relabel: 1-bit per edge would be sufficient

Addressing these bottlenecks improves overall performance

Page 33: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Push, improved

• Idea:

– Work on tiles in shared memory

• Share data between threads of a block

– Each thread updates M pixels

• Push first M times in first edge direction

• Then M times in next edge direction

Page 34: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Wave Push

Excess Flow Data-Tile in Shared Memory

M

Active Thread

Push direction

ef = 0;

for k=0...M-1

ef += s_ef(k)

flow = min(right(x+k),ef)

right(x+k)-=flow;

s_ef(k)=ef-flow;

ef = flow;

end

Page 35: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Wave Push

Flow is carried along by each thread

Active Thread

Push direction

ef = 0;

for k=0...M-1

ef += s_ef(k)

flow = min(right(x+k),ef)

right(x+k)-=flow;

s_ef(k)=ef-flow;

ef = flow;

end

Page 36: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Wave Push

Active Thread

Push direction

ef = 0;

for k=0...M-1

ef += s_ef(k)

flow = min(right(x+k),ef)

right(x+k)-=flow;

s_ef(k)=ef-flow;

ef = flow;

end

Page 37: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Wave Push

Border

Active Thread

Push direction

Page 38: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Wave Push

Do the same for other directions

Active Thread

Push direction

Page 39: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Wave Push

• After tile pushing, border is added

• Benefits

– No atomics necessary

– Share data between threads

– Flow is transported over larger distances

Page 40: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Relabel

• Binary decision: capacity > 0 ? 1 : 0

• Idea: Compress residual edges as bit-vectors

– Compression computed during push

Page 41: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Relabel

• Compression Ratio: 1:32 (int capacities)

5

0

3

9

0

0

1

7

1

0

1

1

0

0

1

1

Page 42: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

CUDA Implementation

• Algorithmic observations

– Most parts of the graph will converge early

– Periodic global relabeling significantly reduces

necessary iterations

Page 43: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Tile based push-relabel

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 4 5 6 7 8 9

Active Pixels per Iteration

Page 44: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Tile based push-relabel

• Split graph in NxN pixel tiles (32x32)

• If any pixel is active, the tile is active

Pixels

Tiles

Page 45: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Tile based push-relabel

• Repeat

– Build list of active tiles

– For each active tile

• Push

• Relabel

• Until no active tile left

Page 46: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Global Relabel

• Local relabel is a bad heuristic for long distance

flow transportation

– Unnecessary pushing of flow back and forth

• Global relabel is exact

– Computes the correct geodesic distances

– Flow will be pushed in the correct direction

– Downside: costly operation

Page 47: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Global Relabel

• BFS from sink

– First step implicit -> multi-sink BFS

• Implemented as local operator:

Page 48: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Global Relabel

• Mechanisms from Push-Relabel can be reused:

– Wave Updates

– Residual Graph Compression

– Tile based

Page 49: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Global Relabel

• Initialize all pixels:

– with flow < 0 to 0 (multi-sink BFS)

– with flow >= 0 to infinity

• Compress residual graph

• Build active tile list

• Repeat

– Wave label update

• Until no label changed

Page 50: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Final CUDA Graphcut

• Repeat

– Global Relabel

– For H times do

• Build active tile list

• For each tile do push-relabel

• Until no active tile

Page 51: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Results

• Comparison between Boykov et al. (CPU),

CudaCuts and our implementation

– Intel Core2 Duo E6850 @ 3.00 GHz

– NVIDIA Tesla C1060

Dataset Boykov

(CPU)

CudaCuts

(GPU)

Our

(GPU)

Speedup

Our vs CPU

Flower (600x450) 191 ms 92 ms 20 ms 9.5x

Sponge (640x480) 268 ms 59 ms 14 ms 19x

Person (600x450) 210 ms 78 ms 35 ms 6x

Average speedup over CPU is 10x

Page 52: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Results

0

1000

2000

3000

4000

5000

6000

7000

8000

0 2 4 6 8 10 12 14

Runti

me (m

s)

Image Size (Mega Pixels)

Boykov et al (CPU)

Our CUDA Impl. (GPU)

Average Speedup of 8.5 x

Page 53: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Example Application: GrabCut

Page 54: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

GrabCut Application(Siggraph 2004 paper)

• Based on Color models for FG and BG

– User specifies a rectangle around the object to cut

– Initialize GMM model of FG and BG colors

– Graph Cut to find labeling

– Use new labeling to update GMM

– Iterate until convergence

• Full CUDA implementation

• Total runtime: ~25 ms per iteration -> 500 ms

Page 55: Graph Cuts with CUDA - Nvidia · •Algorithms to solve Graph Cuts •CUDA implementation ... •Ford-Fulkerson –Find augmenting paths from source to sink –Global scope, based

Summary

• Introduction to Graph Cuts

• Push-Relabel CUDA implementation

– Beats CPU by 8.5 x on average

• Makes full CUDA implementation of many image

processing applications possible