Deep Learning for Image Analysis · 5/13/2020 · The Secret Sauce: Pretrained Networks • A...

© 2020 KNIME AG. All Rights Reserved.

Welcome to Deep Learning for Image Analysis

Benjamin WilhelmDavid Kolb

Going live at:

Berlin 5:00 PM (CEST) New York City 11:00 AM (EDT)Austin 10:00 AM (CDT) London 4:00 PM (GMT)

2© 2020 KNIME AG. All Rights Reserved.

Before we start…

• Please use the Q&A section to post your questions.

• Upvote for your favorite questions.

• Session is recorded and will be available on YouTube.


Before we start…

• Please use the Q&A section to post your questions.

• Upvote for your favourite questions.

• Session is recorded and will be available on YouTube.

© 2020 KNIME AG. All Rights Reserved. 4

Outline

• Motivation

• Fundamentals

• Image Classification

– Cats & Dogs Classification in KNIME

• Semantic Segmentation

– Natural Image Segmentation in KNIME

• Image Captioning

– Image Captioning in KNIME


Motivation


0

5

10

15

20

25

30

2010 2011 2012 2013 2014 2015 2016 2017

Erro

r in

%

Year

Winners of the ImageNet Challenge

Deep Learning

Why Deep Learning?


Why Deep Learning?

Sergios Karagiannakoshttps://sergioskar.github.io/Semantic_Segmentation/

Bearman and Donghttp://www.catherinedong.com/pdfs/231n-paper.pdf

Isola et al.http://openaccess.thecvf.com/content_cvpr_2017/papers/Isola_Image-To-Image_Translation_With_CVPR_2017_paper.pdf

Purnasai Gudikandulahttps://medium.com/@purnasaigudikandula/artistic-neural-style-transfer-with-pytorch-1543e08cc38f

Silver et al.https://doi.org/10.1038/nature24270


History of Deep Learning

1943

Neural Nets McCulloch & Pitt

1958

PerceptronRosenblatt

1960

Adaline

Widrow & Hoff

1969XOR Problem

Minsky & Papert

1974Backpropagation

Werbos

1980Neocognitron

(CNN)Fukushima

1986Multi-layered

Perceptron (Backpropagation)Rumelhart, Hinton

& Williams

1990LeNetLecun

2012AlexNet

Krizhevsky


Interest in Deep Learning according to Google Trends


What has changed?

Image Source:https://www.nvidia.com/content/dam/en-zz/es_em/Solutions/Data-Center/tesla-v100/[email protected]

Image Source:https://medium.com/syncedreview/sensetime-trains-imagenet-alexnet-in-record-1-5-minutes-e944ab049b2c


Deep Learning Software

https://developer.nvidia.com/caffe2

https://de.wikipedia.org/wiki/Datei:Pytorch_logo.png

https://danilobzdok.de/links/theano-deeplearning-package/

https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/mxnet2.png

https://geekflare.com/wp-content/uploads/2018/05/MicrosoftCNTKlogo.png

https://chainer.org/images/chainer_icon_red.png

https://upload.wikimedia.org/wikipedia/commons/2/2d/Tensorflow_logo.svg

https://miro.medium.com/max/368/1*u2t2N3lu8sH1CSsSrP_UyQ.png

https://upload.wikimedia.org/wikipedia/commons/c/c0/ONNX_logo_main.png

https://upload.wikimedia.org/wikipedia/commons/c/c9/Keras_Logo.jpg


Deep Learning Software in KNIME

https://developer.nvidia.com/caffe2

https://de.wikipedia.org/wiki/Datei:Pytorch_logo.png

https://danilobzdok.de/links/theano-deeplearning-package/

https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/mxnet2.png

https://geekflare.com/wp-content/uploads/2018/05/MicrosoftCNTKlogo.png

https://chainer.org/images/chainer_icon_red.png

https://upload.wikimedia.org/wikipedia/commons/2/2d/Tensorflow_logo.svg

https://miro.medium.com/max/368/1*u2t2N3lu8sH1CSsSrP_UyQ.png

https://upload.wikimedia.org/wikipedia/commons/c/c0/ONNX_logo_main.png

https://upload.wikimedia.org/wikipedia/commons/c/c9/Keras_Logo.jpg


KNIME Keras Integration


Fundamentals


Recap: Machine Learning

• Learning programs from data

• Supervised Learning– Input: Data points with labels

– Output: Model that maps from data points to labels

– Examples: Classification, regression

• Unsupervised Learning– Input: Data points without labels

– Output: Model that captures structure of data

– Examples: Clustering, dimensionality reduction


Examples of Supervised Learning

Images → Class labels

Credit history → Credit score

Customer data → Churn probability

Low resolution image → High resolution image

Cell image → Segmentation


The Multilayer Perceptron

Input Hidden Output

Neuron


A Single Neuron

𝑤1

𝑤2

𝑤3

𝜎 𝑏 +

𝑖

𝑤𝑖𝑥𝑖

𝑥1

𝑥2

𝑥3


Activation Functions


Forward Propagation

𝑥


Forward Propagation

𝑥 ℎ(𝑥)


Forward Propagation

𝑥 ℎ(𝑥) 𝑜(ℎ 𝑥 )


Modelling Probabilities

• Classification tasks require to output probabilities

• Properties of a probability distribution

– All values are non-negative

– All values sum up to 1

• Binary classification: Sigmoid

• Multi-class classification: Softmax


Forward Propagation

𝑥 ℎ(𝑥) 𝑜(ℎ 𝑥 ) Correct?


Loss Functions

• Evaluate how far model outputs are from the true label

• Task dependent

– Binary classification: Binary cross entropy

– Multi-class classification: Categorical cross entropy

– Regression: Mean squared/absolute error

• Must be differentiable


Gradient Descent

Gradient


Gradient Descent


Backpropagation

• All parts of a deep learning model are differentiable

• Backpropagation uses the chain rule to calculate the gradient of the loss with respect to all weights

• Modern deep learning software performs this automagically


Forward Propagation

𝑥 ℎ(𝑥) 𝑜(ℎ 𝑥 ) 𝑙𝑜𝑠𝑠 = 𝑙 𝑜 ℎ 𝑥


Backpropagation

𝑙′ 𝑥 = 𝑜′ ℎ 𝑥 ℎ′(𝑥)

Information Flow


Stochastic Gradient Descent

• Calculating the gradient on the full dataset is time-consuming

• Stochastic Gradient Descent: Evaluate on single data point

• Mini-batch Gradient Descent: Evaluate on a small set of data points


Momentum

• Averages past gradients

• Equivalent of a ball rolling down a slope (acceleration)

• Can help to

– Reduce fluctuation

– Speed-up progress in direction with small but consistent gradients

– Escape local minima


Adaptive Learning Rate

• The learning rate controls how large the steps taken by gradient descent are

• Not all parameters may require the same learning rate

• Solution: Adapt the learning rate based on the variance of the gradient


Different Gradient Descent Optimizers

Optimizer Momentum Adaptive Learning Rates

SGD ✗ ✗

SGD + Momentum ✔ ✗

Adagrad ✗ ✔

Adadelta ✗ ✔

RMSProp ✗ ✔

Adam ✔ ✔


The True Goal: Generalization

• Overfitting: Model overfits noise of the training set

• Low loss on training data but high loss on unseen data

• Remedy– Decrease model capacity

– Use Data Augmentation

– RegularizationImage Source:https://upload.wikimedia.org/wikipedia/commons/1/19/Overfitting.svg

https://upload.wikimedia.org/wikipedia/commons/1/19/Overfitting.svg


Old-school Regularization

• Add regularization term to loss that penalizes large parameters

• 𝐿2-Regularization (weight decay)

– Prefers solutions with small weights

• 𝐿1-Regularization

– Prefers solution with sparse weights (most weights are 0)

• Elastic net

– Combination of 𝐿1- and 𝐿2-Regularization


Dropout

• During training: Randomly drop some neurons

• During inference:Scale neuron activations by drop rate

• Prevents the network to rely too much on individual features


Image Classification


What is Image Classification?

Task:Decide to which class an image belongs to

Example:Cat or Dog?


Image Classification with Deep Learning

Input Output


Image Input for Deep Learning

255 250 100 113 117

248 223 89 105 101

227 65 233 95 91

89 6 65 89 186

70 211 100 78 111

Image Source: https://cdn.pixabay.com/photo/2017/09/12/21/17/dog-2743705_960_720.jpg

https://cdn.pixabay.com/photo/2017/09/12/21/17/dog-2743705_960_720.jpg



Input

Output

255 250 100 113 117

248 223 89 105 101

227 65 233 95 91

89 6 65 89 186

70 211 100 78 111


Image Classification Output

Class Probabilities

Cat Dog

0% 100%

Cat Dog

100% 0%

One-hot vector



Input

255 250 100 113 117

248 223 89 105 101

227 65 233 95 91

89 6 65 89 186

70 211 100 78 111

Cat Dog

14% 86%



Input

255 250 100 113 117

248 223 89 105 101

227 65 233 95 91

89 6 65 89 186

70 211 100 78 111

Cat Dog

14% 86%

Feature Extraction & Information Aggregation


Feature Extraction using Convolution

1 2 3

-4 7 4

2 -5 1

Kernel


Kernel Example

-1 0 1

-2 0 2

-1 0 1

* =

Sobel Y


Convolutional Layer

• Filter weights are trainable parameters

• Many filters to extract different kinds of features

Image Source: https://datascience.stackexchange.com/a/67324


Pooling: Aggregating Spatial Information

1 2 8 2

7 4 6 1

8 5 6 9

5 3 1 0

7 8

8 9

3.5 4.25

5.25 4

Max Pooling

Average Pooling


CNN for Image Classification

Cat

Dog

Image Source: https://upload.wikimedia.org/wikipedia/commons/6/63/Typical_cnn.png

https://upload.wikimedia.org/wikipedia/commons/6/63/Typical_cnn.png


Data Augmentation

• Idea: Create more data using ground truth preserving transformations

• Examples

– Mirroring

– Rotation

– Translation

– Zooming

– Color transformations

– Blur

– NoiseImage Source: https://cdn.pixabay.com/photo/2017/09/12/21/17/dog-2743705_960_720.jpg

https://cdn.pixabay.com/photo/2017/09/12/21/17/dog-2743705_960_720.jpg


The Secret Sauce: Pretrained Networks

• A trained network can be used as initialization for a network solving a different/related task

• Fine-tuning: The other task is similar– Example: A network trained for classification on Imagenet is fine-

tuned to discriminate between images of cats and dogs

• Transfer-learning: The other tasks differs greatly– Example: A network trained for classification on Imagenet is used to

initialize the backbone of a semantic segmentation network

• Feature extraction: The network is only used to extract features


1. Example:Cats & Dogs Classification in KNIME


Cats & Dogs Data

https://www.kaggle.com/c/dogs-vs-cats/overview


Cats & Dogs Classification in KNIME

1. Image preprocessing and augmentation

2. Train a simple CNN from scratch

3. Fine-tune a pretrained model

Three Workflows:


1. Image Preprocessing and Augmentation



Input:

3200 examples

Output:

64000 augmented examples

(80/20) split


2. Train a Simple CNN


Create One-hot Vector


Format Network Output


Score


3. Fine-tune a Pretrained Model


How to Fine-tune a Model?

Basic Recipe (of many):

1. Choose existing architecture, pretrained on a similar task

2. Adapt network head to new task (e.g. number of neurons)

3. Re-train new head only (maybe also some other layers)


Prepare pretrained ResNet50 Model

ResNet50 Only train the

new network head

Add new head

ResNet50: https://arxiv.org/abs/1512.03385

https://arxiv.org/abs/1512.03385


3. Fine-tune a Pretrained Model


Score


Semantic Segmentation


Before: Classification

Image Source: https://upload.wikimedia.org/wikipedia/commons/6/63/Typical_cnn.png

We have: One classification per imageWe need: One classification per pixel


Simple Approach: Sliding Window

Monkey

Tree

Fence

Problem: Inefficient


Another Approach: Only Convolutional Layers

Image Source: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf


Better: Encoder-Decoder

Image Source: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf


Upsampling? Transpose Convolution

Image Source: https://medium.com/apache-mxnet/transposed-convolutions-explained-with-ms-excel-52d13030c7e8


U-Net

Ronneberger et al.https://arxiv.org/pdf/1505.04597.pdf


2. Example:Natural Image Segmentation in KNIME


Workflow Demo


Image Captioning


What is Image Captioning?

Task:Describe the contents of an image

Example Captions:• A fancy desert on a plate

with a twisted orange.• A plate has a dessert and

orange slices on it.• some iice crean sitting next

to some orange slices …


Image Captioning with Deep Learning

`A fancy desert on a plate with a twisted orange.`

Does this work?


Problem Formulation

Problem:

• Length of caption to predict unknown (we need a fixed output dimension for the output layer)

Simple Approach (of many):

• Iterative approach predicting word by word


Image Captioning with Deep Learning

Image

Next Word in Caption

Partial Caption

One neuron for each possible word, each

word is a `class`


Iterative Approach Example

Input1: Image Input2: Partial Caption Output: Target Word

*image1* startseq A

*image1* startseq, A fancy

*image1* startseq, A, fancy desert

*image1* startseq, A, fancy, desert on

…

*image1* …, with, a, twisted, orange, . endseq

`A fancy desert on a plate with a twisted orange.`

Special tokens marking start and end of sentence


Network Inputs

Input1:

255 250 100 113 117

248 223 89 105 101

227 65 233 95 91

89 6 65 89 186

70 211 100 78 111

Input2:startseq, A, fancy, desert ?

Replace words with vocabulary indices

12, 452, 1120, 38


Network Output

Output: fancy […, 0, 0, 1, 0, …]

Convert vocabulary index of target word to one-hot vector


How to do Prediction?

Using same iterative approach:

• Predict first word using image and start token (startseq)

• Predict next words using image and partial caption from the previous prediction iteration

• Repeat until endseq is predicted


Reduce Complexity

Use Help ➜ Transfer Learning:

• Image Input: Use pretrained image features (InceptionV3)

• Text Input: Use pretrained embedding vectors (GLOVE)

Approach: Pre-calculate InceptionV3 image- and GLOVE

embedding-features

• Make captions simpler using textprocessing

InceptionV3 : https://arxiv.org/abs/1512.00567, GLOVE: https://nlp.stanford.edu/projects/glove/

https://arxiv.org/abs/1512.00567

https://nlp.stanford.edu/projects/glove/


3. Example:Image Captioning in KNIME


COCO Data

Large image datasets for many different tasks, e.g. image captioning

Five captions per image:• A hot dog bun filled with macaroni salad.• A hot dog bun has macaroni and cheese in it.• A hotdog bun filled with noodles on a plate with fries.• Mac and cheese sub with some fries on the side. • A nice meal sitting on top of a plate.

We are using a randomly sampled subset containing ≈ 8000 images.

Dataset: http://cocodataset.org/#home


Image Captioning in KNIME

1. Caption preprocessing

2. Pre-calculate image features

3. Pre-calculate GLOVE embedding vectors

4. Model Training

5. Prediction

Five Workflows:


1. Caption Preprocessing


Clean Captions



1830 unique wordsvs. ≈ 10000 before cleaning


2. Pre-calculate Image Features



Extract features of last dense layer (length 2048)


3. Pre-calculate GLOVE Embedding Vectors

GLOVE is a type of Word Embedding

What are Word Embeddings?:

• Map a word (or vocabulary index) to some position in an n-dimensional space, the position (relative to other words) encodes the semantics of the word


GLOVE Embedding Vectors Intuition

Nearest Neighbors to ‘frog’:(in terms of distance on the GLOVE vectors)

Image Source: https://nlp.stanford.edu/projects/glove/

https://nlp.stanford.edu/projects/glove/



Several versions with different length vectors, we choose the 200-dimensional ones here



Look-up word vector for every vocabulary entry and save it in a Python dictionary


4. Model Training


Word/Vocab Mapping

…


4. Model Training


Create Training Data

29

Padded with zeros to create equal length vectors (29)


4. Model Training


Caption Network

Input1: Image Vector

Input2: Word Indices

Shape: [2048]

Shape: [29]

Maps word indices to GLOVE vectors using our pre-calculated dictionary

Shape: [1831]


Caption Network

Input1: Image Vector

Input2: Word Indices

Shape: [2048]

Shape: [29]

Shape: [1831]

1831 softmax vector (1800 vocabulary size + ‘0’ padding)


4. Model Training


Training


Training

Creates one-hot vector from indices

Caution: Indices must not get out of range of the output shape!


4. Model Training


5. Prediction


Prepare Test Data

startseq:1176

29


5. Prediction


Iterative Prediction

1. Start with startseq token

2. Predict next token

3. If predicted token == endseq, exclude example from next iteration

4. Else, go to 2.

5. Repeat until all examples have been excluded


Iterative Prediction

endseq:348

Trained Model

Test Data

Predict next token

If predicted token == endseq, route

example to output

Loop output

Data for next iteration,stop loop if empty

Else


5. Prediction


Caption Results


Questions?


The End –thank you for joining this webinar.

Deep Learning for Image Analysis · 5/13/2020 · The Secret Sauce: Pretrained Networks • A...

Documents

Transcript of Deep Learning for Image Analysis · 5/13/2020 · The Secret Sauce: Pretrained Networks • A...