Computational decision making
-
Upload
boris-adryan -
Category
Technology
-
view
478 -
download
0
Transcript of Computational decision making
Computational decision making
Dr. Boris Adryan @BorisAdryan
What I aim to provide
✓basic vocabulary for ✓ fundamental concepts of
computational decision making ✓phenomenological introduction
to machine learning methods ✓a rough idea when and how to
use these methods
What this presentation isn’t
x hands-on tutorial
x thorough summary
x comprehensive guide
x technical deep dive
x statistics course
Is this artificial intelligence?
word = input(‘Enter a word:’) for key in British_dictionary.iteritems(): if key.startswith(word): print(‘This is a British word.’)
Is this machine learning?
temperature = float(input('What is the temperature?')) if temperature >= 1.0: print('Wear shorts.') else: print('Wear long underwear.’)
DefinitionRule-based decision making on the basis of numeric thresholds or string patterns etc is not machine learning.
And most definitely it is not artificial intelligence.
But what if……the threshold is inferred at run time?
“Write a software that says how close to Euston you can move if you can afford to spend £650k.”
450
550
650
750
850
950
1050
0 3 6 9 12
Average property price
Northern Line, number of stops from Euston
table = input_table(‘cost at station’)
print(where_x_yields_a_low_enough_y)
Linear regressionis probably the most simple “machine learning” method.
Average property price
Northern Line, number of stops from Euston
It is an example of supervised learning, because we teach the computer the relation between an input variable (“feature”) and an output variable (“label”). 450
550
650
750
850
950
1050
0 3 6 9 12
y = m . x + b
Linear regressioncan become arbitrarily complicated.
The difference between curve fitting in statistics and machine learning is mostly semantics.
f(number of stops to Euston, square footage, bedrooms, bathrooms, …) price
many features
EustonHigh
Barnet
small
large
price
Classification tasksI. setosa
I. virginica
I. versicolor
All images from https://en.wikipedia.org/wiki/Iris_flower_data_set
Rather than projecting the feature vector onto a continuous variable, many supervised learning methods identify “class labels”.
Classification tasksSupervised learning requires complete input matrices. Missing or nonsensical values have to be replaced or removed.
Non-numerical features (think, e.g. “name of colour”, “smell”) have to be encoded.
class label feature 1 feature 2 feature 3 feature 4
1 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 7.0 3.2 4.7 1.4
3 6.3 3.3 6.0 2.5
Classification tasks
sepal length
sepa
l wid
th
I. setosa
I. virginica
In a first approximation, classification (by regression) aims to find a function that best separates the different class labels.
f(sepal width, sepal length) (1,2)
“1”“2”
sepal length
sepa
l wid
th
I. setosa
I. virginica
I. versicolor
Decision trees
sepal width
peta
l wid
th
A decision tree can be understood as a series of linear separations of the data.
ratio of sepal width : sepal length
I. virginica ratio of petal width : sepal width
I. setosa I. versicolor
Random forestsA collection of decision trees, each trained on a random subset of the data, can minimise the risk of over-fitting.
A single big decision tree trained on all data can effectively describe a single data point.
sepal width
peta
l wid
th
Over-/UnderfittingSloppy separation is called underfitting, and greedy separation overfitting.
Counteracting an overfit is called regularisation, and works by penalising too many features (L1) or too strong feature weights (L2).
underfit (high bias)
okay
overfit (high variance)
Dimensionality reductionDimensionality reduction aims to reduce the complexity of a dataset (in respect to number of features).
The first principal components are dimensions that explain most of a dataset’s variance.
Support vector machineThe SVM aims to provide an ideal separation plane by supporting it with a training data vector.
A classification margin protects the SVM against overfitting.
?
margin support vectors
decision boundary wTx = 0
negative hyperplane wTx = -1
positive hyperplane wTx = 1
Kernel trickInput data can be projected into a higher-dimensional space that allows linear separation of otherwise inseparable data.
There are different kernels, such as the radial basis function (Gaussian) or polynomial kernel.
“2D” input space
Φ
“3D” feature space
http://scikit-learn.org/0.18/auto_examples/svm/plot_iris.html
features
weather forecast
airport location
# of gates
# of runways
# of snowploughs
airline
aircraft
BLACK BOX
trainingflights
cancelled in the past classifier
ranked list of relevant features
weight of features
thresholds for features
performance metric
prediction
new dataGeneral approachAn intuitive example from real life.
training
classifier
performance assessment
good enough?
success!
mor
e da
ta fo
r tra
inin
g
data
noyes
sens
itivi
ty
“true
pos
itive
s”
1-specificity “false positives”
0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.6
0.4
0.2
worse than random guess
Classifier performanceNot all machine learning behaves ideal, and performance metrics are important for quality checks and parameter tuning.
https://en.wikipedia.org/wiki/Precision_and_recall
There is a wide range of performance metrics, comprising combinations of true & false positives as well as true & false negatives.
Metrics zoo
positive class (P) negative class (N)
predicted positive true positive (TP) false positive (FP)
predicted negative false negative (FN) true negative (TN)
data acquisition model building test use in production
data recording (production system)
evaluation
raw data clean-up feature engineering model learning model selection
labour intense compute intensebrain intense
development
production
ML pipeline
Choosing a method
from: Olson et al., 2017, https://arxiv.org/abs/1708.05070
There is no ‘one-size-fits-all’ machine learning method. Most methods need to be carefully tuned to perform ideal.
Often, there a ‘non-functional’ constraints on choosing a method. Runtime, interpretability, etc.
What about neural networks?
feature 1 feature 2 feature 3weight 1 weight 2 weight 3
input function
activation function
class output error for weight updates
a simple perceptron
Neural networks attempt to mimic the integrative properties of neurons. The perceptron is a single-layer network.
inputs
outputs
Deep neural networks
http://www.asimovinstitute.org/neural-network-zoo
While many artificial neural networks show great performance, the basis on which features exactly the classification works remains largely unknown.
In RL, the methods iteratively learn to optimise an output from an abstract representation of a system
Reinforcement learning
move (l/r) or shoot
unknown systemmap (210 x 160 pixels, 8-bit RGB)
actual score
choose action on basis of map to optimise score
Mnih et al., Nature (2015)
Machine learning can help to structure and explore an unknown dataset.
These methods aid the identification of classes where their existence isn’t known yet.
Unsupervised learning
• Hierarchical clustering • K-means clustering • Expectation maximisation • Density-based clustering
plus
clever visualisation
Hierarchical clusteringDerives hierarchical dependencies of individual rows and columns in the dataset on the basis of similarity (correlation) between their properties.
Combined with a heatmap, gives a good first impression of a dataset.
k-means clusteringDefines k different centroids to which data points are assigned by proximity. If the distance doesn’t get much smaller, k is the number of clusters in the set.
Density-based clustering and expectation maximisation are conceptually related, the latter giving a probability for membership in any group.
k = 2
k vs centroid distance
Conclusions
https://badryan.github.io/2015/10/20/is-it-all-machine-learning.html
• People on the Internet steal infographics.
• ML methods have been around in the stats world for ages, but big data sets and compute power make them more widely known.
• Understanding key principles behind ML should be part of the school curriculum.