Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks
Artificial neural networks and deep learning€¦ · Deep learning 29 Deep learning is the...
Transcript of Artificial neural networks and deep learning€¦ · Deep learning 29 Deep learning is the...
Artificial neural networks and
deep learning
Dr. Timur I. Madzhidov
Kazan Federal University, Department of Organic Chemistry
1
AI achievement
2
AI beats human in games Optical character
recognition – solved!
Image recognition better
than human
Speech recognition at
human levelHandwritten text
recognition at human level
Self-driving cars
© Timur Madzhidov
Olomouc, February 5, 2020
Deep learning and human intelligence
3 © Timur Madzhidov
Olomouc, February 5, 2020
Deep learning in chemistry
4
2012
Team, lead by G. Hinton, having no experience in chemoinformatics
and drug design won by a wide margin other teams in Merck
Molecular Activity Challenge
First time when “deep learning” came to chemistry
2014
Tox21 Data Challenge 2014 is won by other deep learning team
from JKU on almost all subcompetitions!
© Timur Madzhidov
Olomouc, February 5, 2020
Artificial neuron
5
Dendrites
(receive signal)
Cell body
(make decision)
Axon
(transfer signal)
© Timur Madzhidov
Olomouc, February 5, 2020
Artificial neuron
6
Dendrites
(receive signal)
Cell body
(make decision)
Axon
(transfer signal)
x1 x2
𝑤𝑖𝑥𝑖
w1 w2
Inputs
Transfer
function
x2
w3
Activation
function
y Output
© Timur Madzhidov
Olomouc, February 5, 2020
Signal
One neuron is simply MLR
7
x1 x21
𝑤𝑖𝑥𝑖
w1w2
w0(bias)
x3
w3
y
Linear activation
𝑓 𝑎 = 𝑎
Linear Regression
𝑦 = 𝑤𝑖𝑥𝑖
© Timur Madzhidov
Olomouc, February 5, 2020
Linear Regression
One neuron is Logistic Regression
8
x1 x2
𝑤𝑖𝑥𝑖
w1 w2
xx2
w3
y
w
wTx
Linear activation
𝑓 𝑎 = 𝑎
x1 x2
𝑤𝑖𝑥𝑖
w1 w2
x3
w3
y
Sigmoid activation
𝑓 𝑎 =1
1 + 𝑒−𝑎
Logistic Regression
𝑓 𝑎 =1
1 + 𝑒−𝒘𝑇𝒙
y =1
1+𝑒−𝒘𝑇𝒙
© Timur Madzhidov
Olomouc, February 5, 2020
9
Multilayered architecture
10
∑
x1 x2 x3
∑
x4 x5 x6 x7 x8 x9
∑
y
x
W1
ah = W1T x
h = f(ah)
∑Hidden
layer
Input
layer
Output
layer
W2
ao = W2T h
y = f(W2h)
© Timur Madzhidov
Olomouc, February 5, 2020
Functional layers
11
∑
x1 x2 x3
∑
x4 x5 x6 x7 x8 x9
∑
y
∑
Input
Linear
transformation
layer
Activation function
layer
Linear
transformation
layer
Activation function
layer
© Timur Madzhidov
Olomouc, February 5, 2020
Set network in popular frameworks
12
∑
x1
x2
x3
∑
x4
x5
x6
x7
x8
x9
∑
y
∑
Input
Linear
transformation
layer
Activation function
layer
Linear
transformation
layer
Activation function
layer
Net = Sequential(
Linear(9, 3),
Sigmoid(),
Linear(3, 1),
Sigmoid()
)
In PyTorch:
© Timur Madzhidov
Olomouc, February 5, 2020
Important!
13
Non-linear activation function is essential for building
neural net! Otherwise it corresponds to simpler models.
Prove at home or ask me why after the lecture…
© Timur Madzhidov
Olomouc, February 5, 2020
Why multilayered architecture is better?
14
The deeper network is – the more complex dependencies it
could fit. More data is needed to avoid overfitting.
© Timur Madzhidov
Olomouc, February 5, 2020
Deep Learning and Big Data was born to make each other happy
15 © Timur Madzhidov
Olomouc, February 5, 2020
How to train network?
16
Descriptors
How to train network?
17
∑
x1 x2 x3
∑
x4 x5 x6 x7 x8 x9
∑
y
∑
Property Prediction error(loss function)
Weights update
Weights update
Error minimization
W1
W2
© Timur Madzhidov
Olomouc, February 5, 2020
Loss function for regression
18
∑
x1
x2
x3
∑
x4
x5
x6
x7
x8
x9
∑
y
∑
y pred
𝑀𝑆𝐸 =1
2
𝑖
𝑁
𝑦𝑖𝑝𝑟𝑒𝑑
− 𝑦𝑖𝑒𝑥𝑝
2
Linear activation
© Timur Madzhidov
Olomouc, February 5, 2020
Loss function for classification
19
∑
x1
x2
x3
∑
x4
x5
x6
x7
x8
x9
∑
y
∑
Sigmoid activation
P(“1”|X) = p
Probability that object belongs to class “1”
𝑁𝐿𝐿 = −
𝑖
𝑁
[ 𝑦𝑖 ∙ log 𝑝𝑖 + 1 − 𝑦𝑖 ∙ log (1 − 𝑝𝑖)]
Negative log likelihood:
a.k.a. binary cross-entropy (BCE)
© Timur Madzhidov
Olomouc, February 5, 2020
How to update weights?
20
Error
Weight value
© Timur Madzhidov
Olomouc, February 5, 2020
How to update weights?
21
Error,
E
Weight value, w
w1
tg 𝛼 =𝜕𝐸
𝜕𝑤
+-𝜕𝐸
𝜕𝑤𝜕𝐸
𝜕𝑤
𝑤𝑛𝑒𝑤 = 𝑤𝑜𝑙𝑑 − 𝛼𝜕𝐸
𝜕𝑤
w2
© Timur Madzhidov
Olomouc, February 5, 2020
𝜕𝐸
𝜕𝑤= 0
How to calculate gradient?
22
𝑤𝑖𝑡+1 = 𝑤𝑖
𝑡 − 𝛼𝜕𝐸
𝜕𝑤𝑖𝑡
We need to calculate derivative of loss function w.r.t.
weights of NN
∑
x1
x2
x3
∑
x4
x5
x6
x7
x8
x9
∑
y
∑
x
W1
W1T x
f(W1T x)
W2
W2T f(W1
T x)
f(W2T f(W1
T x))
E=(ytrue - f(W2T f(W1
T x)))2ytrue
NN is nothing more
than (quite complex)
analytic function!
© Timur Madzhidov
Olomouc, February 5, 2020
Computational graph
23
Seppo LINNAINMAA
𝑧 = log(𝑤2 ∙ sin 𝑥 + 1)
z at w=1, x=0?
𝑤 𝑥
𝑎 = 𝑤2 𝑏 = sin(𝑥)
𝑐 = 𝑎 ∙ 𝑏
𝑧 = log(𝑑)
1 0
𝑒 = 𝑐 + 𝑑
1 0
0
1
0
𝑑 = 1 1
𝜕𝑧
𝜕𝑤=𝜕𝑧
𝜕𝑒∙𝜕𝑒
𝜕𝑐∙𝜕𝑐
𝜕𝑎∙𝜕𝑎
𝜕𝑤
𝜕𝑧
𝜕𝑤=?
Chain rule (derivative of complex function)
𝑓 𝑔 𝑥 = 𝑓′(𝑔) ∙ 𝑔′(𝑥)
© Timur Madzhidov
Olomouc, February 5, 2020
Computational graph
24
Seppo LINNAINMAA
𝑧 = log(𝑤2 ∙ sin 𝑥 + 1)
at w=1, x=0?
𝑤 𝑥
𝑎 = 𝑤2 𝑏 = sin(𝑥)
𝑐 = 𝑎 ∙ 𝑏
𝑧 = log(𝑑)
1 0
𝑒 = 𝑐 + 𝑑
1 0
0
1
0
𝑑 = 1 1
𝜕𝑧
𝜕𝑒=1
𝑒1
𝜕𝑒
𝜕𝑐= 1 1
𝜕𝑐
𝜕𝑎= 𝑏 0
𝜕𝑎
𝜕𝑤= 2𝑤 2
𝜕𝑧
𝜕𝑤=𝜕𝑧
𝜕𝑒∙𝜕𝑒
𝜕𝑐∙𝜕𝑐
𝜕𝑎∙𝜕𝑎
𝜕𝑤= 1*1*0*2 = 0
© Timur Madzhidov
Olomouc, February 5, 2020
Take-home messages
25
Computational graph is efficient way to calculate value of
functions at given arguments
Computational graph can be used for derivatives calculation
using chain rule
Derivatives calculation starts from the end and propagates to
the input (in direction, opposite to function calculation)
Computational graph and derivative calculation is the essence
of deep learning frameworks (TensorFlow, PyTorch, etc)
© Timur Madzhidov
Olomouc, February 5, 2020
Backpropagation
26
x
w1
Sigmoid
•
w2
•
Sigmoid
w3
•
Linear
MSE
o1 o2 o3 o4 o5 o6 E
Forward signal propagation
Backward derivative (error) propagation
E
𝜕𝐸
𝜕𝑜6
𝜕𝑜6𝜕𝑜5
𝜕𝑜5𝜕𝑤3
𝜕𝑜5𝜕𝑜4
𝜕𝑜4𝜕𝑜3
𝜕𝑜3𝜕𝑤2
𝜕𝑜3𝜕𝑜2
𝜕𝑜2𝜕𝑜1
𝜕𝑜1𝜕𝑤1
© Timur Madzhidov
Olomouc, February 5, 2020
Vanishing gradient problem
27
𝜕𝐸
𝜕𝑜6
𝜕𝑜6𝜕𝑜5
𝜕𝑜5𝜕𝑜4
𝜕𝑜4𝜕𝑜3
𝜕𝑜3𝜕𝑜2
𝜕𝐸
𝜕𝑤1
𝜕𝑜2𝜕𝑜1
𝜕(𝑥𝑤1)
𝜕𝑤1= 𝑥
≈0≈0≈0
≈0
Weights are not updated!
x
w1
Sigmoid
•
w2
•
Sigmoid
w3
•
Linear
MSEo1 o2 o3
o4 o5 o6E
© Timur Madzhidov
Olomouc, February 5, 2020
Activation functions: ReLU
28
𝑅𝑒𝐿𝑈 𝑥 = ቊ𝑥, 𝑖𝑓 𝑥 ≥ 00, 𝑖𝑓 𝑥 < 0
DerivativeReLU
https://mlfromscratch.com/activation-functions-explained/#/
© Timur Madzhidov
Olomouc, February 5, 2020
Deep learning
29
Deep learning is the application to learning tasks of artificial neural networks with multiple hidden layers
• New activation functions (ReLU, etc)
• New regularization techniques (dropout, etc)
• New learning techniques (SGD, Adam)
• New output functions (SoftMax, cross-entropy)
• Representation learning using autoencoders
• Generative models using variational autoencoders (VAE)
• Convolutional neural networks (CNN)
• Recurrent neural networks (RNN)
• Generative adversarial networks (GAN)
• Deep reinforcement learning (RL)
© Timur Madzhidov
Olomouc, February 5, 2020
WHY DEEP LEARNING?
30
Precision is often better
31
https://www.frontiersin.org/articles/10.3389/fphar.2019.01303/full
© Timur Madzhidov
Olomouc, February 5, 2020
Neural Networks are very flexible
32
∑
x1
x2
x3
∑
x4
x5
x6
x7
x8
x9
∑
y
∑
Linear
activation
For regression
∑
x1
x2
x3
∑
x4
x5
x6
x7
x8
x9
∑
y
∑
Sigmoid
activation
For classification
MSE loss BCE loss
© Timur Madzhidov
Olomouc, February 5, 2020
Neural nets are building clocks
33
Task Y Output layer
activation
Loss function
Regression (-∞; +∞) Linear
MSE,
L1Loss (MAE),
Huber loss
Binary classification {0; 1} Sigmoid BCE
Binary classification
(SVM style with
margin)
{-1; +1}Hyperbolic
tangentHinge loss
Multiclass classification{class1, class2, …,
classK}Softmax
NLLLoss,
Categorical Cross
Entropy
Embedding learning
(minimize difference
between vectors)
vector any Cosine
Binned regressionOne-hot vector of binned
valuesSoftmax Earth mover distance
© Timur Madzhidov
Olomouc, February 5, 2020
Multi-headed NN for multitask prediction
34
Different properties
(12 assays)
12,707 chemical compounds
Comparison of multi-task (MT) and
single task (ST)
Tox21 challenge
Multitask outperformed
others in 10 out of 12 assays
https://www.frontiersin.org/articles/10.3389/fenvs.2015.00080/full
© Timur Madzhidov
Olomouc, February 5, 2020
Recurrent Neural Networks (RNN)
35
What can RNNs do?
36
• Language modeling and generating text
• Machine translation
• Speech recognition and generation
• Generating image descriptions
• Time series processing
• Movies and video clips processing
• Music classification and generation
• Choreography
• ….
• Bioinformatics (DNA, RNA, proteins, peptides)
• Chemoinformatics (generation of SMILES strings for “useful” structures)
© Timur Madzhidov
Olomouc, February 5, 2020
Recurrent Neural Networks (RNN)
37
I
0h1
h1 h2
h20
I Ilove Iyou
IIch liebe dich
h3
RNNs can process not only current inputs xt but also previous inputs xt-1, xt-2,…, so
it uses previous experience for taking decisions
© Timur Madzhidov
Olomouc, February 5, 2020
RNN cell
38 © Timur Madzhidov
Olomouc, February 5, 2020
One-Hot Encoding of SMILES String
39 © Timur Madzhidov
Olomouc, February 5, 2020
SMILES for property prediction
40
Input SMILES Multi-la
yer
NN
y
Too many data needed! No advantage over simpler approaches
© Timur Madzhidov
Olomouc, February 5, 2020
Autoencoder (AE)
41
3.2
6.4
5.3
3.2
6.4
5.3
Latent vector
• Is an artificial neural network
• Trained to accurately reconstruct its input object
© Timur Madzhidov
Olomouc, February 5, 2020
Sampling SMILES from probability matrix
42 © Timur Madzhidov
Olomouc, February 5, 2020
Seq2seq autoecoder
43
Encoder
Decoder
Input SMILES
Latent vector
Returned SMILES
© Timur Madzhidov
Olomouc, February 5, 2020
RNN + maps
44
Latent space
Sattarov B., et al. JCIM, 2019
https://doi.org/10.1021/acs.jcim.8b00751
© Timur Madzhidov
Olomouc, February 5, 2020
Discover of novel drug within 2 months
using AI
45
Zhavoronkov A., et al. Nature 2019 https://www.nature.com/articles/s41587-019-0224-x
© Timur Madzhidov
Olomouc, February 5, 2020
CONVOLUTION NETWORKS
46
Convolution
47
1 0 1
0 1 0
1 0 1
FilterWeights can be learnt
© Timur Madzhidov
Olomouc, February 5, 2020
Pooling
48
2 1 1
1 1 1
1 3 1
5
Sum pooling
2 1 1
0 1 1
1 3 1
2
Max pooling
© Timur Madzhidov
Olomouc, February 5, 2020
Image convolution networks
49 © Timur Madzhidov
Olomouc, February 5, 2020
Graph convolution networks
50
Atomic features vectors are assigned each atom:
- element number
- number of implicit hydrogens
- hybridization
- number of valent electrons ...
© Timur Madzhidov
Olomouc, February 5, 2020
Graph convolution networks
51
Filter with
learnable
weights
x x
x
x
x
x
x
© Timur Madzhidov
Olomouc, February 5, 2020
Graph convolution networks
52 © Timur Madzhidov
Olomouc, February 5, 2020
Graph convolution networks
53
+ +
+ +
© Timur Madzhidov
Olomouc, February 5, 2020
Graph convolution networks
54 © Timur Madzhidov
Olomouc, February 5, 2020
Graph convolution networks
55
+
© Timur Madzhidov
Olomouc, February 5, 2020
GCN - performance
56
logP Tox21 challenge data
© Timur Madzhidov
Olomouc, February 5, 2020
Summary
57
Deep learning imitates thinking process using multilayered artificial neural networks
Deep learning is not magic bullet – performance on small test sets approaches other methods. It is definitely not first choice method – complex to build and adjust
The power of deep learning is flexibility – with small change of architecture even unexperienced machine learner can adapt it to many different tasks
Deep learning can be used to make model directly on chemical structures
The most impressing in deep learning is ability to directly predict chemical structures
© Timur Madzhidov
Olomouc, February 5, 2020
To be continued…
58
Homework:
Tutorial in deep learning (using PyTorch):
https://bit.ly/2RWAI84
THANK YOU ALL!
59
You did it!