11 - Segmentation, object detection

Deep learning and machine learning in sciencedeeplea17em

11 - Segmentation, object detection

Pataki Bálint ÁrminELTE, Physics of Complex Systems Department

2021.04.27.

11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin

● Classification

○ Your car is broken

○ You have a decayed tooth

○ I have found Waldo

→ location is important

→ carries additional information

We need localization for many tasks!

2

Is classification always descriptive enough?

https://celtic-publications.com/where-is-waldo/

https://celtic-publications.com/where-is-waldo/


● Train a CNN for a classification task● Instead of global max pooling, use global average pooling at the end

○ So not throwing away that much spatial information

● Classification: based on the averaged pool values○ Each value is a feature map from the previous layer (that is averaged)○ Always non-negative → because of the ReLU activation

● Average the maps with the classification weights (they can be negative)!

3

Class activation maps

Zhou et al, 2015: Learning Deep Features for Discriminative Localization


● Classification○ Only the class is predicted, not location○ No location label is fed to the model○ Class activation maps can give locations

○ They are ‘side effect’

Classification vs classification w/ localization

4

● Classification with localization○ Single class○ + bounding box

Expected prediction: Expected prediction:


● Object detection○ Predict bounding boxes○ AND a class for each box

Various tasks

5

● Semantic segmentation○ Pixel-level mask for each class○ No objects!

Draw with colors!

● Instance segmentation○ AND pixel-level mask for each object separately○ Predict its class


● Classification

○ Is there a tumor?

○ Pro: Easy to label the images! → look at historic diagnostics results

○ Con: Often hard to plan treatment (eg radiation therapy) without proper localization

● Object detection

○ Where is the tumor? (If there is)

○ Count objects & locate them

○ How many people in the room? Can we let more in?

● Semantic segmentation

○ What portion of the wheat is healthy? From aerial images

● Instance segmentation

○ Pro: more detailed information

○ Con: the same. Time consuming to label it!

○ Estimate weight of individual animals via a photo made from far away (eg for bulls)

6

Example for each task + pro/con

11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 7

Examples

https://cdn.technologyreview.com/f/611380/researchers-have-released-the-largest-self-driving-car-data-set-yet/

https://mc.ai/lane-detection-for-self-driving-vehicles/

You might want to:- Segment the road- Detect objects on the road


● COCO - Common Objects in Context○ Segmentation data○ Object detection data○ 330k annotated images○ 80 object categories○ by Microsoft

Standard to pre-train model on COCO

● Pascal VOC○ Also popular

8

COCO dataset

What is this?

Common objects in context


Pixel mask segmentation

9


● A is the mask, B is the prediction, |A| is the number of elements in mask (positives)

● Dice score

● Jaccard

● Or any binary classification metric that fits the task the best

○ Accuracy is not meaningful usually (many-many background pixels are easy hit)

10

Metrics for pixel masks

https://en.wikipedia.org/wiki/Jaccard_index


How to generate pixel masks?

Idea:

bunch of convolutional layers

Output: as many channels as many categories we have→ each pixel is binary (0-1)

Output dim = input dim(eg 224 x 224 px)



Idea:



We need many layers, but computationally expensive to perform them in high-resolution



Idea:



How to upsample the low-resolution representation?


● When have a symmetric network

○ Pooling - unpooling pairs

○ Store position of the maximal values

○ Use them when unpooling

● Transpose conv (De/up - convolution)

○ Use a 3x3 conv kernel

○ Learnable weights usually

● Sometimes simple upsampling used

○ Bilinear, bicubic or any standard

○ And then normal convolutions

14

How to upsample the representation?

Noh et al, 2015: Learning Deconvolution Network for Semantic Segmentation


Deconvolution / upconvolution / transpose convolution

input kerneloutput

Convolution with special setups.Kernel can be learned turing the training process.


● 21 channel output → 21 classes● Task: pixel-wise classification● Softmax● Loss: categorical cross-entropy! (assume each pixel has one class)

16

Semantic segmentation networks

Noh et al, 2015: Learning Deconvolution Network for Semantic Segmentation

Problem: we need to store all spatial information here in a 1x1x4096 array(+ we can store some via the unpooling)


● Skip connections added● Last layers get information from different resolutions● More a meta-model than an exact architecture (often simply upsample + conv)

17

U-NET

Ronneberger et al, 2015: U-Net: Convolutional Networks for Biomedical Image Segmentation


Usually image segmentation / object detection CNNs often rely on a backbone model→ that was successful on image classification→ VGG, ResNet, EfficientNet etc.

18

Backbone modelsNoh et al, 2015: Learning Deconvolution Network for Semantic Segmentation

This is a VGG model!

https://neurohive.io/en/popular-networks/vgg16/


U-NET code example(or at the end based on the time)

19


Object detection

20


● Exactly matching the bounding box is impossible

○ Allow some tolerance, but how to define the tolerance?

● IoU: intersection over union

○ Intersection

○ Union

○ IoU > threshold → hit

From now on, we have binary classification!

- TP, FP, FN (TN is not meaningful)

● FROC curve

○ Predictions are probabilistic

○ Sweep probability threshold

○ Plot sensitivity vs false mark/image

21

Common metrics to measure performance

Dezso Ribli, 2018: Detecting and classifying lesions in mammograms with Deep Learning


● Use a model with two heads!○ A fully connected layer to predict classes

○ Lc is the usual categorical cross-entropy loss for classification○ An other fully connected layer to predict bounding box coordinates

○ Predict: (x, y, h, w) or (x_min, x_max, y_min, y_max)○ Lr is the MSE for the coordinates compared to the label box

● Final loss: L = Lc + Lr○ Train the weights with backpropagation

● Problem: for object detection we have undefined number of outputs (objects)!

22

Idea: object localization as a classification + regression!

https://neurohive.io/en/popular-networks/vgg16/

Classification:Predict a class

Regression:Predict box coordinates

Combined loss


Idea: object detection classification with sliding window● Crop regions of the image with a sliding window

○ Use many smaller-larger-wider-narrower windows○ Classify the cropped image

○ Is it dog/airplane/car etc… or background (a new class)?○ If lucky, we get it ‘just right‘

This a simple way to generate region proposals!

Region proposal: a possible localized object


● Instead of sliding window: classical approach to get ~2k region proposals● Cut out the regions & resize to 227 x 227 px● Run a CNN & save the last feature representation (a 4096 dim vector)● Train a linear SVM on the extracted representations

○ Performance enhancement (in later research it was replaced)

● Regression as offset to the proposed region (how much to change the proposed box)

Use an ImageNet pre-trained CNN (eg ResNet)! & Then fine-tune it for the detection.

24

R-CNN

Girshick et al, 2014: Rich feature hierarchies for accurate object detection and semantic segmentation


Region proposal method: selective search

Uijlings, 2012: Selective Search for Object Recognition

● Classical algorithm, no NN involved● Based on hierarchical grouping

○ Start with all the pixels○ Merge the most similar ones (close to each other and similar color)○ Continue merging (like the hierarchical clustering)○ Proposed regions: bounding boxes for the pixel clusters

Over time less and less proposed region (we can stop at any time, thus set N_regions)


● Problem: slow & ad-hoc○ Almost 1 minute / image to predict (2000 CNN call)○ Train 84h

● Accurate○ As per 2014 models

26

R-CNN

Girshick: https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0

Girshick et al, 2014: Rich feature hierarchies for accurate object detection and semantic segmentation


● IoU (intersection over union) > 0.5 → positive, else negative (no regression for neg.)● 128 mini-batch size

○ 32 positive samples○ 96 background samples○ Bias upsampling towards positive

● Non-max suppression○ Many overlapping prediction○ Same predicted category○ Assume they are the same if IoU > threshold○ Keep only the one w/ highest predicted probability

27

R-CNN a few details

https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c

https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c


For R-CNN the CNN is run on each region separately → CNN runs 2000 times / image→ run the CNN on the whole image→ proposed regions are cut out from the CNN representation! → shape them to a fixed size (like global max pooling, but output is not 1x1, but larger)→ classify the cut outs with fully connected layers

28

Fast-R-CNN

Girshick: https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0


Fast R-CNN

Via selective search Cut the corresponding region from the feature maps

“Resize”


- cc 10x faster to train- 146x faster to generate predictions than R-CNNAccuracy is similar. Most of the runtime is spent on the non NN based region proposal.

Key: do not run the CNN 2000 times for an image!→ R-CNN runs the same conv kernels over 2000 cutouts→ We can get almost the same if we run it once on the whole image(remember, a conv kernels are position independent)

30

Fast-R-CNN

Girshick 2015: Fast R-CNN


● Get rid of the selective search (region proposal)● Perform region proposal on the feature maps!

○ Try different boxes (k box at a given position) at each position with a NN○ Proposal network outputs

○ 2k score - binary prediction (probability if object or not)○ 4k coordinates (bounding box offsets)○ Usually

○ For the top N proposal run the classifier

31

Faster-R-CNN

Ren et al, 2016: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks


https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection/

32

Faster-R-CNN

https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection/


● Trend: let NN learn what is important from data → get rid of hand-wired, heuristic algorithms● In the paper 5 fps at test time

○ Since then (2016) GPU performance significantly improved → real-time detection○ Also with smaller models fps increases

● Faster & more accurate than previous models

33

Faster R-CNN


● Combine○ Faster R-CNN○ Pixel mask prediction○ Combine its losses!

● Output○ Bounding box○ Pixel-level annotation of each object separately

34

Mask-R-CNN

He et al, 2018: Mask R-CNN


2014, R-CNN: Ross Girshick Jeff Donahue Trevor Darrell Jitendra MalikAll authors from UC Berkley

2015, Fast R-CNN: Ross GirshickMicrosoft Research

2016, Faster R-CNN: Shaoqing Ren, Kaiming He, Ross Girshick, and Jian SunRen: University of Science and Technology of China, while intern at Microsoft ResearchHe & Sun: Microsoft ResearchGirshick: Facebook AI Research (FAIR)

2018, Mask R-CNN: Kaiming He Georgia Gkioxari Piotr Dollar Ross GirshickAll author Facebook AI Research (FAIR)

Facebook’s object detection/segmentation implementation: detectron2

35

*-R-CNN (Regions with CNN features) authors


● Many hyperparameters to tune

○ How many anchor boxes

○ How many proposals to consider

○ Thresholds for non-max suppression

○ Threshold for considering a detection as positive

● Detectron2

○ Facebook (so PyTorch backed and not Tensorflow)

○ Most of the *-R-CNN authors work at Facebook

○ Can be parametrized via config files

○ A bit complicated (no code example)

○ But works well when set up correctly

● Generating labels

○ LabelImg https://github.com/tzutalin/labelImg

○ For bounding boxes

○ Easy to install & do not use cloud

But there are many other open implementations, and new ones are coming all the time.

36

Detectron2 + LabelImg

https://github.com/tzutalin/labelImg


● Breast cancer○ Most common cancer for women○ Yearly 5-6.000 cases in Hungary○ 2000 deaths yearly in Hungary

● Screening: mammography○ X-ray○ Painless○ Cheap to make the scan

● Scan evaluation○ Mammographist

○ 6 year university○ 4-5 year radiology residency○ 1.5 year mammography practice○ Licence (to re regularly renewed)

○ Usually one needs to wait long to get the scan evaluated○ By law: 2 independent reader○ Time-consuming inspection○ Humans are not evolved to inspect such images○ Inspection is often outsourced to other countries

37

Selected results - mammography

https://radiopaedia.org/articles/mammography?lang=us


● Ribli Dezső, Department of Physics of Complex Systems - ELTE, 2018● Faster R-CNN● 3-4000 annotated images● Similar performance compared to human readers (who spent >10 years learning)

38

Mammography

Dezso Ribli, 2018: Detecting and classifying lesions in mammograms with Deep Learning


● Google Health, 2020● > 130.000 women for training● Training on English dataset

○ Evaluation on English dataset○ Evaluation on US dataset

(in US only 1 reader is required)

39

Mammography

McKinney et al, 2020: International evaluation of an AI system for breast cancer screening


Mammography

Basic research rarely produce useful products that quickly.Galvani playing with frog’s legs (late 1700s) → Maxwell equations (1850s) → Thomas Edison (late 1800s)Of course there were similar steps here, not all computer vision started in 2012.


● CNNs consistently outperform any other classical approaches on images○ For classification○ For object detection too

● With detectron2 fairly straightforward to train object detection models● Human level accuracy often possible

○ When enough diverse training data is presented

● Classification● Object detection

○ Bounding box (propose boxes → classify them and adjust their coordinates)

● Segmentation○ Pixel level annotation○ We need to upscale convolution feature maps

41

Summary

Source: http://cs231n.stanford.edu/


● Predicting landmark points (shoulder, hip, knee etc) is possible● Pose estimation

42

Keypoint detection / pose estimation

https://github.com/facebookresearch/detectron2

11 - Segmentation, object detection

Documents

Transcript of 11 - Segmentation, object detection