Download - background IEEE 2017 Conference on car 0.90 0 · 2020-07-27 · IEEE 2017 Conference on Computer Vision and Pattern Recognition : Difficulty-Aware Semantic Segmentation via Deep Layer

IEEE 2017 Conference on

Computer Vision and Pattern

Recognition

: Difficulty-Aware Semantic Segmentation via Deep Layer CascadeXiaoxiao Li, Ziwei Liu, Ping Luo, Chen Change Loy, Xiaoou Tang

Department of Information Engineering, The Chinese University of Hong Kong

{lx015,lz013,pluo,ccloy,xtang}@ie.cuhk.edu.hk

IEEE 2017 Conference on

Computer Vision and Pattern

Recognition

1. Introduction

• Motivation:

Partition all pixels into three sets by classification confidence

Nearly 40% easy region

• Task & General Approaches:

Semantic Segmentation

ConvNets

Bus

Human

72.5%

0.9%1.1%1.0%0.8%

0.9%

1.8%1.7%

2.6%

1.0%1.3%

0.6%

2.2%

1.1%1.2%

4.8%

0.7%0.8%1.2%1.4%

0.6%6.7%

19.0%

68.0%

bk

g

are

o

bik

e

bir

d

bo

at

bo

ttle

bu

s

car

cat

cha

ir

cow

tab

le

do

g

ho

rse

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

Ea

sy

Mode…

Ha

rdPer

cen

tage

of

pix

els

on

VO

C

Easy Moderate Hard

Pix

els on

Bo

un

da

ry(%

)

background cow Easy Moderate Hard

(a)

(b)

• Existing Works:

Increase the model capacity to achieve promising results

High runtime complexity

Backbone Network Speed(300*500 image)

VGG16 5.7 fps

ResNet-101 7.1 fps

Inception-ResNet 9.0 fps

Layer Cascade 14.7 fps

Layer Cascade (fast) 23.6 fps

• Turning Inception-ResNet into Deep Layer Cascade (LC)

(a) IRNet

Image Stem Reduction-A5 IRNet-A 10 IRNet-B Reduction-B 5 IRNet-C

pooling

Fully-Connected

horse

Softmax

background

horse

person

car

unknown

stag

e-1

stag

e-2

stag

e-3

Image Stem Reduction-A5 IRNet-A 10 IRNet-B Reduction-B 5 IRNet-C Conv

ConvConv

(b) IRNet after LC

I

L1 L2

L3

2. Approach

• Our Idea:

Treats a single deep model as a cascade of several sub-models

Earlier sub-models are trained to handle easy and confident regions

Feed-forward harder regions to the next sub-model for processing

Stage-1 handles the easy regions (background)

Stage-2 focus more on the foreground than stage-1 does

Stage-3 further focus on harder classes

1

1.2

1.4

1.6

1.8

2

are

o

bik

e

bir

d

boa

t

bott

le

bu

s

car

cat

chair

cow

tab

le

dog

hors

e

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

Lab

el P

erce

nta

ge

rati

o stage-3

1

1.2

1.4

1.6

1.8

2

are

o

bik

e

bir

d

boa

t

bott

le

bu

s

car

cat

chair

cow

tab

le

dog

hors

e

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

Lab

el P

erce

nta

ge

rati

o stage-2

10.3%

45.9%43.7%

car

8.2%

39.2%52.6%

cat

6.6%

16.6%

76.8%

chair

4.9%

14.3%

80.8%

table

stage-1 stage-2 stage-3

Percentage of pixels

(a) (b)

• Ablation Study on Probability Thresholds ρ

ρ controls percentage of easy and hard regions

66%

68%

70%

72%

74%

41 46 51 56 61 66

mIo

U

Running Time (ms)

15.7 15.6

21.731.2

4.7

18.2

Time Spent in Each Stage (ms)

stage-3

stage-2

stage-1

0.93

0.95

0.97

0.98

0.985

0.90

• Performance and Speed Trade-off

3. Experiments

(a) Convolution (b) Region Convolution

M

• Region Convolution

+

(c) Region Convolution in Residual

• Stage-wise Label Distribution

1

1.2

1.4

1.6

1.8

2

are

o

bik

e

bir

d

boa

t

bott

le

bu

s

car

cat

chair

cow

tab

le

dog

hors

e

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

La

bel

Per

cen

tag

e ra

tio stage-3

1

1.2

1.4

1.6

1.8

2

are

o

bik

e

bir

d

boa

t

bott

le

bu

s

car

cat

chair

cow

tab

le

dog

hors

e

mb

ike

per

son

pla

nt

shee

p

sofa

train tv

La

bel

Per

cen

tag

e ra

tio stage-2

10.3%

45.9%43.7%

car

8.2%

39.2%52.6%

cat

6.6%

16.6%

76.8%

chair

4.9%

14.3%

80.8%

table

stage-1 stage-2 stage-3

Percentage of pixels

(a) (b)

(a) input image (e) ground truth(b) stage1 (c) stage2 (d) stage3

4. Stages’ Outputs

5. Overall Performance

• Comparisons with DeepLab & SegNet

6. Conclusion

• LC adopts a “difficulty-aware” learning

paradigm.

• LC accelerates both training and testing by the

usage of region convolution. It is capable of

running in real-time.

• LC is an end-to-end trainable

framework that jointly optimizes

the feature learning for different

regions.80.3

82.7