Juan Feng and Jianping Li LASG, IAP, Chinese Academy of Sciences Yun Li
SenseCUSceneParsing (Jianping Shi's conflicted copy 2016...
Transcript of SenseCUSceneParsing (Jianping Shi's conflicted copy 2016...
![Page 1: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/1.jpg)
Understanding Scene in the Wild
SenseCUSceneParsing
Hengshuang Zhao, Jianping Shi, Xiaojuan QiXiaogangWang, Tong Xiao, Jiaya Jia
![Page 2: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/2.jpg)
Features of ADE20K Dataset
• Number of image• ADE20K Dataset: 20k
• Number of scene• Image label: 1038
• Number of semantic label• Wall / Building, Field / Earth, Mountain / Hill, Stair /Stairway……
![Page 3: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/3.jpg)
Our baseline
• Pretrained Resnet 101 + FCN pixel prediction• Result regarding to mIOU / pixel accuracy• Train Data: 70.20 / 90.34• Val Data: 35.08/76.87
• Of course, we havemany pre-‐baselines that are notso good
![Page 4: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/4.jpg)
Our result improves- Evils in the details
• Various data augmentation• Dropout to the last convolution layers• Using dilated convolution• Learning rate policy• Total iteration number• Correct way to use batch normalization• Larger cropsize and larger receptive field• ……
Code and model will be released later
![Page 5: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/5.jpg)
Our result improves
- Evils in the details
• Previous baseline result by mIOU / pixel accuracy• Train Data: 70.20 / 90.34• Val Data: 35.08/76.87
• Current Resnet101 result• Train Data: 75.16/91.99• Val Data: 36.85/77.65
![Page 6: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/6.jpg)
Our result improves
• Deeply supervise for better optimization
Resnet 101, to conv4_26
Convolution
Softmax Loss
Resnet 101, to conv5_3
Convolution
Softmax Loss
Auxiliary loss, loss weight 0.4
![Page 7: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/7.jpg)
Our result improves
- Deep resnet improves by additional loss
• Previous result by mIOU / pixel accuracy• Train Data: 75.16/91.99• Val Data: 36.85/77.65
• Current result by deeply supervised training• Train Data: 77.70/93.15• Val Data: 38.28/78.63
• Better optimization policy improves confusion labeland inconspicuous object
![Page 8: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/8.jpg)
Image level information may helpscene parsing• A failure example
![Page 9: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/9.jpg)
Recognize scene in image level
• State-‐of-‐the-‐art Image classification• FCN + Average Pooling
• Classical scene understanding• Spatial Pyramid Matching
• Better scene recognition• FCN + Spatial Pyramid Matching Pooling
![Page 10: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/10.jpg)
Pixel Prediction with Image LevelInformation
Resnet 101
Convolution for pixel feature
Global Feature Via Spatial Pyramid
Concat andMake finalprediction
DimensionReduction via conv
• Utilizing image level information for scene parsing• End to end learning• Marginal computation cost
![Page 11: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/11.jpg)
Our result improves
- Spatial pyramid global feature
• Previous result by mIOU / pixel accuracy• Train Data: 77.70/93.15• Val Data: 38.28/78.63
• Current result by image level information• Train Data: 79.51/93.65• Val Data: 41.29/80.04
• Global feature improves error failure to senseimage label
![Page 12: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/12.jpg)
Deeper Pretrained Model
PretrainedModel ResultResnet 50 40.11/79.55Resnet 101 41.29/80.04Resnet 152 42.23/80.46Resnet 269 43.39/80.90
• Better and Deeper pretrained model improves theresult consistently
![Page 13: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/13.jpg)
Testing and Ensemble
Method ResultResnet 269 Single Scale Test 43.39/80.90Resnet 269 Multi Scale Test 44.59/81.80Ensemble of 5 Models 45.95/82.48
• Proper testing scheme improves the result• But it is time consuming and only useful forcompetitions
![Page 14: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/14.jpg)
Summary
Baseline Details&Tricks
DeepSupervise
GlobalFeature
DeeperModel
MultiScaleTest
35.08/76.87
36.85/77.6538.28/78.63
41.29/80.04
43.39/80.9044.59/81.80
45.95/82.48
Ensemble
1.771.43
3.01
1.20
2.10
1.36
Final Result
45.95/82.48
![Page 15: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/15.jpg)
Visual Results
![Page 16: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/16.jpg)
Visual Results
![Page 17: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/17.jpg)
Visual Results
![Page 18: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/18.jpg)
Visual Results
![Page 19: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/19.jpg)
Visual Results
![Page 20: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/20.jpg)
Visual Results
![Page 21: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/21.jpg)
Future Direction
• More labeled data, and more clear definition• Use of human semantic similarity matrix• Small and rare object• Scene parsing in video• Speedup
![Page 22: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/22.jpg)
It is not yet finished…
![Page 23: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/23.jpg)
Learn by failure – Balance Sample
• Sample training image touniform distribution• Better training accuracy• But overfitting andworse validationaccuracy
Blue is baseline training accuracy,Red is training accuracy after balance sample
![Page 24: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/24.jpg)
Learn by no significant improvement• Hard sample mining• CRF• Stochastic depth• Using predefined class correlation• ……
![Page 25: SenseCUSceneParsing (Jianping Shi's conflicted copy 2016 ...image-net.org/challenges/talks/2016/SenseCUSceneParsing.pdf · image label. DeeperPretrained Model PretrainedModel Result](https://reader030.fdocuments.us/reader030/viewer/2022040207/5e16946f2b19ea1ad46c79bb/html5/thumbnails/25.jpg)
Thanks & Questions