Defeating the Black Box – Neural Networks in HEP Data Analysis

Defeating the Black Box – Neural Networks in HEP Data Analysis

Jan Therhaag (University of Bonn)TMVA Workshop @ CERN, January 21st, 2011

TMVA on the web: http://tmva.sourceforge.net/

http://tmva.sourceforge.net/

http://tmva.sourceforge.net/

2 2 Top Workshop, LPSC, Oct 18–20, 2007 A. Hoecker: Multivariate Analysis with TMVA 2 CERN, Jan 21st, 2011 J. Therhaag – Neural Networks in HEP Data Analysis


•The single neuron as a classifier

•Network training and regularization

•Advanced topics: The Bayesian appraoch


x1

x2

The Problem …


The single neuron as a classifier


y =NP

i=1wi xi + w0

• Code the classes as a binary variable

(here: blue = 0, orange = 1)

• Perform a linear fit to this discrete function

• Define the decision boundary by

f x : y = wT x = 0:5g

y 2 f0;1g

x1

x2

y < 0:5

y > 0:5

A simple approach:

//######################################################################################//TMVA code//######################################################################################

//create FactoryTMVA::Factory *factory =

new TMVA::Factory(“TMVAClassification”,outputfile,”AnalysisType=Classification”)

factory->AddVariable(“x1”,”F”);factory->AddVariable(“x2”,”F”);

//book linear discriminant classifier (LD)factory->BookMethod(TMVA::Types::kLD,”LD”);

Factory->TrainAllMethods();Factory->TestAllMethods();Factory->EvaluateAllMethods();


• has values in [0,1] and can be interpreted as the probability p(orange | x)(then obviously p(blue | x) = 1- p(orange | x) = )

Now consider the sigmoid transformation: y 7! ¾(y) ´ 11+exp(¡ y)

7¡ !

y ¾(y)

¾(y)¾(¡ y)


• is called the activity of the neuron, while is called the activation

We have just invented the neuron!

¾(y)

y

w0w1

wN

x1

xN

NPi=0

1

¾(y) ´ 11+exp(wT x)

w0w1

wN

x1

xN

NPi=0

1

¾(y)

y = wT x

y


The idea of neuron training – searching the weight space


• The training proceeds via minimization of the error function

• The neuron learns via gradient descent*

• Examples may be learned one-by-one (online learning) or all at once (batch learning)

• Overtraining may occur!

*more sophisticated techniques may be used

E (w) = ¡ Pn

[t(n) lnp(C1jx(n)) + (1¡ t(n)) lnp(C2jx(n))]

@E@wj

= ¡ Pn

(t(n) ¡ p(n))x(n)j


Network training and regularization


• The class of networks used for regression and classification tasks is called feedforward networks

• Neurons are organized in layers

• The output of a neuron in one layer becomes the input for the neurons in the next layer

yk(x;w) = ¾

0B@

MPj =0

w(2)kj ¾

Ã NX

i=0w(1)

j i xi

!

| {z }

1CA

zj

//######################################################################################//TMVA code//######################################################################################




//book Multi Layer Perceptron(MLP) network and definde network architecturefactory->BookMethod(TMVA::Types::kMLP,”MLP”,”NeuronType=sigmoid:HiddenLayers=N+5,N”);



• Feedforward networks are universal approximators

• Any continuous function can be approximated with arbitratry precision

• The complexity of the output function is determined by the number of hidden units and the characteristic magnitude of the weights

yk(x;w) =MP

j =0w(2)

kj ¾Ã NX

i=0w(1)

j i xi

!

| {z }zj

z1z2z3

ytraining data


• In order to find the optimal set of weights w, we have to calculate the derivatives

• Recall the single neuron:

• It turns out that:

From neuron training to network training - backpropagation

@E (w)@wi j

@E (w)@wk

= (yk ¡ tk)xk

@E (w)@wi j

= ±j zi

±k = yk ¡ tk

±j / Pk

wkj ±k

with for output neurons

and else

While input information is always propagated forward, errors are propagated backwards!


• The error function has several minima, the result of the minimization typically depends on the starting values of the weights

• The scaling of the inputs has an effect on the final solution

Some issues in network training

NN with 10 hidden units

• Overtraining– bad generalization and overconfident

predictions

//######################################################################################//TMVA code//######################################################################################




//book Multi Layer Perceptron(MLP) network with normalized input distributionsfactory->BookMethod(TMVA::Types::kMLP,”MLP”,”RandomSeed=1:VarTransform=N”);



• Early stopping: Stopping the training before the minimum of E(w) is reached– a validation data set is needed– convergence is monitored in TMVA

• Weight decay: Penalize large weights explicitly

Regularization and early stopping

~E (w) = E (w) + ¸wT w

NN with 10 hidden units and λ=0.02

//######################################################################################//TMVA code//######################################################################################




//book Multi Layer Perceptron(MLP) network with regulariaztion factory->BookMethod(TMVA::Types::kMLP,”MLP”,”NCycles=500:UseRegulator”);



• Unless prohibited by computing power, a large number of hidden units H is to be preferred

– no ad hoc limitation of the model

• In the limits of , network complexity is entirely determined by the typical size of the weights

Network complexity vs. regularization

H ! 1

Out

put


Network learning as inference and Bayesian neural networks

Advanced Topics


• Reminder: Given the network output , the error function is just minus the log likelihood of the training data D

• Similarly, we can interpret the weight decay term as a log probability distribution for w

• Obviously, there is a close connection between the regularized error function and the inference for the network parameters

Network training as inference

y(x;w) = p(t = 1jw;x)

P (Djw) = exp(¡ E (w))

P (wj¸) = 1ZW D (¸ ) exp(¡ ¸wT w)

P (wjD;¸) = P (D jw)P (wj¸ )RP (D jw)P (wj¸ )dw

= 1Z ~E

exp(¡ ~E (w))

likelihood prior

normalization


• Minimizing the error corresponds to finding the most probable value which is used to make predictions

• Problem: Predictions for points in regions less populated by the training data may be to confident

Predictions and confidence

wM P

Can we do better?

P (t(N +1) jx(N +1);wM P )


• Instead of using , we can also exploit the full information in the posterior

Using the posterior to make predictions

wM P

P (wjD;¸)

P (t(N +1) jx(N +1);D;¸) =RP (t(N +1) jx(N +1);w;¸)P (wjD;¸)dw

P (wjD;¸)


• Instead of using , we can also exploit the full information in the posterior

wM P

P (wjD;¸)

P (t(N +1) jx(N +1);D;¸) =RP (t(N +1) jx(N +1) ;w;¸)P (wjD;¸)dw

Using the posterior to make predictions

P (wjD;¸)See Jiahang’s talk this afternoon for details of the Bayesian approach to NN in the TMVA framework!


• In a full Bayesian framework, the hyperparameter(s) λ are estimated from the data by maximizing the evidence

– no test data set is needed– neural network tunes itself– relevance of input variables can be tested

(automatic relevance determination ARD)

• Simultaneous optimization of parameters and hyperparameters is technically challenging– TMVA uses a clever approximation

A full Bayesian treatment

model complexity

model complexity

P (Dj¸) = RP (Djw)P (w;¸)dw


Summary (1)* A neuron can be understood as an extension of a linear classifier

* A neural net consists of layers of neurons, input information always propagates forward, errors propagate backwards

* Feedforward networks are universal approximators

* The model complexity is governed by the typical weight size, which can be controlled by weight decay or early stopping

* In the Bayesian framework, error minimization corresponds to inference and regularization corresponds to the choice of a prior for the parameters

* The Bayesian approach makes use to the full posterior and gives better predictive power

* The amount of regularization can be learned from the data by maximizing the evidence


Summary (2)Current features of the TMVA MLP:

* Support for regression, binary and multiclass classification (new in 4.1.0 !)

* Efficient optional preprocessing (Gaussianization, normalization) of the input distributions

* Optional regularization to prevent overtraining+ efficient approximation of the posterior distribution of the network weights+ self adapting regulator+ error estimation

Future development in TMVA:

* Automatic relevance determination for input variables

* Extended automatic model (network architecture) comparison

Thank you!


References

Figures taken from:

David MacKay: “Information Theory, Inference and Learning Algorithms”Cambridge University Press 2003

Christopher Bishop: “Pattern Recognition and Machine Learning”Springer 2006

Hastie, Tibshirani, Friedman: “The Elements of Statistical Learning”, 2nd Ed.Springer 2009

These books are also recommended for further reading on neural networks

Defeating the Black Box – Neural Networks in HEP Data Analysis

Documents

Transcript of Defeating the Black Box – Neural Networks in HEP Data Analysis