11 - Segmentation, object detection
Transcript of 11 - Segmentation, object detection
Deep learning and machine learning in sciencedeeplea17em
11 - Segmentation, object detection
Pataki Bálint ÁrminELTE, Physics of Complex Systems Department
2021.04.27.
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Classification
○ Your car is broken
○ You have a decayed tooth
○ I have found Waldo
→ location is important
→ carries additional information
We need localization for many tasks!
2
Is classification always descriptive enough?
https://celtic-publications.com/where-is-waldo/
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Train a CNN for a classification task● Instead of global max pooling, use global average pooling at the end
○ So not throwing away that much spatial information
● Classification: based on the averaged pool values○ Each value is a feature map from the previous layer (that is averaged)○ Always non-negative → because of the ReLU activation
● Average the maps with the classification weights (they can be negative)!
3
Class activation maps
Zhou et al, 2015: Learning Deep Features for Discriminative Localization
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Classification○ Only the class is predicted, not location○ No location label is fed to the model○ Class activation maps can give locations
○ They are ‘side effect’
Classification vs classification w/ localization
4
● Classification with localization○ Single class○ + bounding box
Expected prediction: Expected prediction:
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Object detection○ Predict bounding boxes○ AND a class for each box
Various tasks
5
● Semantic segmentation○ Pixel-level mask for each class○ No objects!
Draw with colors!
● Instance segmentation○ AND pixel-level mask for each object separately○ Predict its class
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Classification
○ Is there a tumor?
○ Pro: Easy to label the images! → look at historic diagnostics results
○ Con: Often hard to plan treatment (eg radiation therapy) without proper localization
● Object detection
○ Where is the tumor? (If there is)
○ Count objects & locate them
○ How many people in the room? Can we let more in?
● Semantic segmentation
○ What portion of the wheat is healthy? From aerial images
● Instance segmentation
○ Pro: more detailed information
○ Con: the same. Time consuming to label it!
○ Estimate weight of individual animals via a photo made from far away (eg for bulls)
6
Example for each task + pro/con
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 7
Examples
https://cdn.technologyreview.com/f/611380/researchers-have-released-the-largest-self-driving-car-data-set-yet/
https://mc.ai/lane-detection-for-self-driving-vehicles/
You might want to:- Segment the road- Detect objects on the road
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● COCO - Common Objects in Context○ Segmentation data○ Object detection data○ 330k annotated images○ 80 object categories○ by Microsoft
Standard to pre-train model on COCO
● Pascal VOC○ Also popular
8
COCO dataset
What is this?
Common objects in context
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
Pixel mask segmentation
9
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● A is the mask, B is the prediction, |A| is the number of elements in mask (positives)
● Dice score
● Jaccard
● Or any binary classification metric that fits the task the best
○ Accuracy is not meaningful usually (many-many background pixels are easy hit)
10
Metrics for pixel masks
https://en.wikipedia.org/wiki/Jaccard_index
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 11
How to generate pixel masks?
Idea:
bunch of convolutional layers
Output: as many channels as many categories we have→ each pixel is binary (0-1)
Output dim = input dim(eg 224 x 224 px)
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 12
How to generate pixel masks?
Idea:
bunch of convolutional layers
Output: as many channels as many categories we have→ each pixel is binary (0-1)
We need many layers, but computationally expensive to perform them in high-resolution
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 13
How to generate pixel masks?
Idea:
bunch of convolutional layers
Output: as many channels as many categories we have→ each pixel is binary (0-1)
How to upsample the low-resolution representation?
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● When have a symmetric network
○ Pooling - unpooling pairs
○ Store position of the maximal values
○ Use them when unpooling
● Transpose conv (De/up - convolution)
○ Use a 3x3 conv kernel
○ Learnable weights usually
● Sometimes simple upsampling used
○ Bilinear, bicubic or any standard
○ And then normal convolutions
14
How to upsample the representation?
Noh et al, 2015: Learning Deconvolution Network for Semantic Segmentation
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 15
Deconvolution / upconvolution / transpose convolution
input kerneloutput
Convolution with special setups.Kernel can be learned turing the training process.
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● 21 channel output → 21 classes● Task: pixel-wise classification● Softmax● Loss: categorical cross-entropy! (assume each pixel has one class)
16
Semantic segmentation networks
Noh et al, 2015: Learning Deconvolution Network for Semantic Segmentation
Problem: we need to store all spatial information here in a 1x1x4096 array(+ we can store some via the unpooling)
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Skip connections added● Last layers get information from different resolutions● More a meta-model than an exact architecture (often simply upsample + conv)
17
U-NET
Ronneberger et al, 2015: U-Net: Convolutional Networks for Biomedical Image Segmentation
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
Usually image segmentation / object detection CNNs often rely on a backbone model→ that was successful on image classification→ VGG, ResNet, EfficientNet etc.
18
Backbone modelsNoh et al, 2015: Learning Deconvolution Network for Semantic Segmentation
This is a VGG model!
https://neurohive.io/en/popular-networks/vgg16/
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
U-NET code example(or at the end based on the time)
19
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
Object detection
20
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Exactly matching the bounding box is impossible
○ Allow some tolerance, but how to define the tolerance?
● IoU: intersection over union
○ Intersection
○ Union
○ IoU > threshold → hit
From now on, we have binary classification!
- TP, FP, FN (TN is not meaningful)
● FROC curve
○ Predictions are probabilistic
○ Sweep probability threshold
○ Plot sensitivity vs false mark/image
21
Common metrics to measure performance
Dezso Ribli, 2018: Detecting and classifying lesions in mammograms with Deep Learning
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Use a model with two heads!○ A fully connected layer to predict classes
○ Lc is the usual categorical cross-entropy loss for classification○ An other fully connected layer to predict bounding box coordinates
○ Predict: (x, y, h, w) or (x_min, x_max, y_min, y_max)○ Lr is the MSE for the coordinates compared to the label box
● Final loss: L = Lc + Lr○ Train the weights with backpropagation
● Problem: for object detection we have undefined number of outputs (objects)!
22
Idea: object localization as a classification + regression!
https://neurohive.io/en/popular-networks/vgg16/
Classification:Predict a class
Regression:Predict box coordinates
Combined loss
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 23
Idea: object detection classification with sliding window● Crop regions of the image with a sliding window
○ Use many smaller-larger-wider-narrower windows○ Classify the cropped image
○ Is it dog/airplane/car etc… or background (a new class)?○ If lucky, we get it ‘just right‘
This a simple way to generate region proposals!
Region proposal: a possible localized object
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Instead of sliding window: classical approach to get ~2k region proposals● Cut out the regions & resize to 227 x 227 px● Run a CNN & save the last feature representation (a 4096 dim vector)● Train a linear SVM on the extracted representations
○ Performance enhancement (in later research it was replaced)
● Regression as offset to the proposed region (how much to change the proposed box)
Use an ImageNet pre-trained CNN (eg ResNet)! & Then fine-tune it for the detection.
24
R-CNN
Girshick et al, 2014: Rich feature hierarchies for accurate object detection and semantic segmentation
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 25
Region proposal method: selective search
Uijlings, 2012: Selective Search for Object Recognition
● Classical algorithm, no NN involved● Based on hierarchical grouping
○ Start with all the pixels○ Merge the most similar ones (close to each other and similar color)○ Continue merging (like the hierarchical clustering)○ Proposed regions: bounding boxes for the pixel clusters
Over time less and less proposed region (we can stop at any time, thus set N_regions)
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Problem: slow & ad-hoc○ Almost 1 minute / image to predict (2000 CNN call)○ Train 84h
● Accurate○ As per 2014 models
26
R-CNN
Girshick: https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0
Girshick et al, 2014: Rich feature hierarchies for accurate object detection and semantic segmentation
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● IoU (intersection over union) > 0.5 → positive, else negative (no regression for neg.)● 128 mini-batch size
○ 32 positive samples○ 96 background samples○ Bias upsampling towards positive
● Non-max suppression○ Many overlapping prediction○ Same predicted category○ Assume they are the same if IoU > threshold○ Keep only the one w/ highest predicted probability
27
R-CNN a few details
https://towardsdatascience.com/non-maximum-suppression-nms-93ce178e177c
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
For R-CNN the CNN is run on each region separately → CNN runs 2000 times / image→ run the CNN on the whole image→ proposed regions are cut out from the CNN representation! → shape them to a fixed size (like global max pooling, but output is not 1x1, but larger)→ classify the cut outs with fully connected layers
28
Fast-R-CNN
Girshick: https://dl.dropboxusercontent.com/s/vlyrkgd8nz8gy5l/fast-rcnn.pdf?dl=0
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 29
Fast R-CNN
Via selective search Cut the corresponding region from the feature maps
“Resize”
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
- cc 10x faster to train- 146x faster to generate predictions than R-CNNAccuracy is similar. Most of the runtime is spent on the non NN based region proposal.
Key: do not run the CNN 2000 times for an image!→ R-CNN runs the same conv kernels over 2000 cutouts→ We can get almost the same if we run it once on the whole image(remember, a conv kernels are position independent)
30
Fast-R-CNN
Girshick 2015: Fast R-CNN
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Get rid of the selective search (region proposal)● Perform region proposal on the feature maps!
○ Try different boxes (k box at a given position) at each position with a NN○ Proposal network outputs
○ 2k score - binary prediction (probability if object or not)○ 4k coordinates (bounding box offsets)○ Usually
○ For the top N proposal run the classifier
31
Faster-R-CNN
Ren et al, 2016: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
https://tryolabs.com/blog/2018/01/18/faster-r-cnn-down-the-rabbit-hole-of-modern-object-detection/
32
Faster-R-CNN
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Trend: let NN learn what is important from data → get rid of hand-wired, heuristic algorithms● In the paper 5 fps at test time
○ Since then (2016) GPU performance significantly improved → real-time detection○ Also with smaller models fps increases
● Faster & more accurate than previous models
33
Faster R-CNN
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Combine○ Faster R-CNN○ Pixel mask prediction○ Combine its losses!
● Output○ Bounding box○ Pixel-level annotation of each object separately
34
Mask-R-CNN
He et al, 2018: Mask R-CNN
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
2014, R-CNN: Ross Girshick Jeff Donahue Trevor Darrell Jitendra MalikAll authors from UC Berkley
2015, Fast R-CNN: Ross GirshickMicrosoft Research
2016, Faster R-CNN: Shaoqing Ren, Kaiming He, Ross Girshick, and Jian SunRen: University of Science and Technology of China, while intern at Microsoft ResearchHe & Sun: Microsoft ResearchGirshick: Facebook AI Research (FAIR)
2018, Mask R-CNN: Kaiming He Georgia Gkioxari Piotr Dollar Ross GirshickAll author Facebook AI Research (FAIR)
Facebook’s object detection/segmentation implementation: detectron2
35
*-R-CNN (Regions with CNN features) authors
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Many hyperparameters to tune
○ How many anchor boxes
○ How many proposals to consider
○ Thresholds for non-max suppression
○ Threshold for considering a detection as positive
● Detectron2
○ Facebook (so PyTorch backed and not Tensorflow)
○ Most of the *-R-CNN authors work at Facebook
○ Can be parametrized via config files
○ A bit complicated (no code example)
○ But works well when set up correctly
● Generating labels
○ LabelImg https://github.com/tzutalin/labelImg
○ For bounding boxes
○ Easy to install & do not use cloud
But there are many other open implementations, and new ones are coming all the time.
36
Detectron2 + LabelImg
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Breast cancer○ Most common cancer for women○ Yearly 5-6.000 cases in Hungary○ 2000 deaths yearly in Hungary
● Screening: mammography○ X-ray○ Painless○ Cheap to make the scan
● Scan evaluation○ Mammographist
○ 6 year university○ 4-5 year radiology residency○ 1.5 year mammography practice○ Licence (to re regularly renewed)
○ Usually one needs to wait long to get the scan evaluated○ By law: 2 independent reader○ Time-consuming inspection○ Humans are not evolved to inspect such images○ Inspection is often outsourced to other countries
37
Selected results - mammography
https://radiopaedia.org/articles/mammography?lang=us
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Ribli Dezső, Department of Physics of Complex Systems - ELTE, 2018● Faster R-CNN● 3-4000 annotated images● Similar performance compared to human readers (who spent >10 years learning)
38
Mammography
Dezso Ribli, 2018: Detecting and classifying lesions in mammograms with Deep Learning
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Google Health, 2020● > 130.000 women for training● Training on English dataset
○ Evaluation on English dataset○ Evaluation on US dataset
(in US only 1 reader is required)
39
Mammography
McKinney et al, 2020: International evaluation of an AI system for breast cancer screening
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin 40
Mammography
Basic research rarely produce useful products that quickly.Galvani playing with frog’s legs (late 1700s) → Maxwell equations (1850s) → Thomas Edison (late 1800s)Of course there were similar steps here, not all computer vision started in 2012.
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● CNNs consistently outperform any other classical approaches on images○ For classification○ For object detection too
● With detectron2 fairly straightforward to train object detection models● Human level accuracy often possible
○ When enough diverse training data is presented
● Classification● Object detection
○ Bounding box (propose boxes → classify them and adjust their coordinates)
● Segmentation○ Pixel level annotation○ We need to upscale convolution feature maps
41
Summary
Source: http://cs231n.stanford.edu/
11 - Deep learning and machine learning in science - deeplea17em - Pataki Bálint Ármin
● Predicting landmark points (shoulder, hip, knee etc) is possible● Pose estimation
42
Keypoint detection / pose estimation
https://github.com/facebookresearch/detectron2