M. Kanevski, Palermo 2009 1
International Workshop: Intelligent Analysis of Environmental Data
Institute of Geomatics and Analysis of Risk (IGAR)
University of Lausanne, Switzerland
Prof. Mikhail Kanevski
M. Kanevski, Palermo 2009 2
Comments and questions to:• [email protected]
– www.unil.ch/igar – www.geokernels.org
M. Kanevski, Palermo 2009 3
General IntroductionTypical problems
ApproachesSolutions
Future research
M. Kanevski, Palermo 2009 4
Geo- and Environmental Data(classes, continuous, images, networks, geomanifolds,…)
• Spatio-temporal• Multi-scale• Multivariate• Highly variable at many scales• High-dimensional geo-feature spaces• Uncertainties• ………….
• In some cases we do have science-based models: data/knowledge/models integration
M. Kanevski, Palermo 2009 5
Spatio-temporal data in terms of patterns/structures: a. pattern recognition (pattern discovery, pattern extraction), b. pattern modelling, c. pattern prediction
M. Kanevski, Palermo 2009 6
Main Topics:• Review and posing of typical problems.• From “numbers” to data• Collection of data: Monitoring networks and data
representativity? Monitoring network optimisation. • Get more information value from your data –
EXPLORE ! Exploratory spatio-temporal data analysis (EDA, ESDA).
• Predictions/estimations or simulations? Risk analysis and mapping
• Let data speak for themselves: learning from data. Data mining, Machine learning.
M. Kanevski, Palermo 2009 7
Methods:• Monitoring networks descriptions• Geostatistics: predictions/simulations• Machine Learning(neural nets, SLT):
– Neural networks: MLP, PNN, GRNN, RBF, SOM. ANNEX models. Hybrid models
– Support Vector Machines• Recent trends in geostatistics: Multiple-points
geostatistics, pattern based geostatistics.• Bayesian approach for uncertainty assessment,
integration of data and science-based models (Bayesian Maximum Entropy)
M. Kanevski, Palermo 2009 8
• Predict a value at a given point. • Build a map (isolines, 3D surfaces,..). • Estimate prediction error.• Take into account measurement errors. • Risk mapping: Uncertainty mapping around unknown
value. Estimate the probability of exceeding of a given/decision level.
• Joint predictions of several variables (improve predictions on primary variable using auxiliary data and information).
• Optimization of monitoring network (design/ redesign)• Simulations: modelling of spatial uncertainty and
variability• Data/Science-based models assimilation/fusion• Image analysis. Remote sensing• Spatio-temporal events (forest fires, epidemiology,
crime,…)• Predictions/simulations in high dimensional spaces• ………………………………………..
Spatial data analysis: typical tasks
M. Kanevski, Palermo 2009 9
Generic Methodology
DATA
Statistical Description
Monitoring Network Analysis
Quick Visualisation
Variography Deterministic Interpolations
Cross-validation
Machine LearningAlgorithmsGeostatistical
Predictions & Simulations
Monitoring Network
Generation
Decision-oriented Mapping
Data Base Management System
GIS, GIS, Remote SensingRemote Sensing
M. Kanevski, Palermo 2009 10
GEOSTATISTICAL ANALYSIS• Basic/Naïve statistical analysis. EDA• ESDA (regionalized EDA)• Structural analysis. Spatial correlation analysis
(variography)• Model selection: Cross-validation, jack-knife,… • Prediction and error mapping for decision
making (family of kriging models)• Probability and Risk mapping. Conditional
stochastic simulations
M. Kanevski, Palermo 2009 11
Some Geostatistics
• Exploration of spatial correlations
• Family of kriging models (simple, ordinary, disjunctive, indicator,…)
• Conditional Stochastic Simulations
M. Kanevski, Palermo 2009 12
Briansk region (radioactivity, Cs137)
M. Kanevski, Palermo 2009 13
Heavy metals, Japan
M. Kanevski, Palermo 2009 14
Switzerland, indoor radon
M. Kanevski, Palermo 2009 15
Measures to characterise MN
• Topological• Statistical• Fractal/multifractal• Lacunarity
M. Kanevski, Palermo 2009 16
Preferential Sampling. Declustering Problem
M. Kanevski, Palermo 2009 17
Example: geostatistical spatial co-predictions
Sr90 « expensive » information. Cs137 « cheap » exhaustive information.
M. Kanevski, Palermo 2009 18
(Cross)Variography
M. Kanevski, Palermo 2009 19
Use of Cs137 to improve Sr90 predictions
(reduced errors and uncertainty).
Decision-oriented
mapping: « Thick isolines »
M. Kanevski, Palermo 2009 20
Simulations and Interpolations
M. Kanevski, Palermo 2009 21
Unconditional simulations
M. Kanevski, Palermo 2009 22
SGSim of the precipitation:
M. Kanevski, Palermo 2009 23
Results of the simulations
M. Kanevski, Palermo 2009 24
Post-processing of simulations: mean and standard deviation
M. Kanevski, Palermo 2009 25
Geostatistics: some comments• Geostatistics is a powerful and well elaborated
model-dependent approach. • Geostatistics proposes a variety of models for spatial data
analysis and modeling. It has long and successful history of developments and applications
• Some problems:Nonlinearity
Non-stationarityTwo-point statisticsData/models integrationData mining. Pattern recognition
• Hybrid Models (ANN/SVM + Geostat) can help.
M. Kanevski, Palermo 2009 26
Some useful comments, conclusions and future research
• 1. Detection of patterns: try k-NN or GRNN• as an exploratory tools• Cross-validation: leave-one-out, leave k-out,
jackknife,etc. as a control tool • Model selection and model asssessment
M. Kanevski, Palermo 2009 27
K- Nearest Neighbours
M. Kanevski, Palermo 2009 28
K-NN prediction:
NN methods use those k-observations in the training data set T closest in input space to prediction point x to estimate Y
( )
1
i k
k
ix N x
Y yk
Where Nk(x) is the neighborhood of x defined by the
closest points in the training set
M. Kanevski, Palermo 2009 29
k-NN Classifiers
These classifiers are memory-based and do not require any model to be fit! Given a query point x, we find the k training points closest in the distance to x and then classify using MAJORITY vote among the k neighbors.
M. Kanevski, Palermo 2009 30
Because it uses only the training point closest to the query point, the bias of the 1-nn estimate is often low, but the variance is high.
A famous result of Cover and Hurt (1967) shows that asymptotically the error rate of the 1-nn classifier is never more than twice the Bayes rate.
This result can provide a rough idea about the best performance that is possible in a given problem:if the 1-nn rule has a 10% error rate, then asymptotically the Bayes error rate is at least 5%.
M. Kanevski, Palermo 2009 31
Dirichlet cells, Thiessen tessellation, Voronoï polygons
M. Kanevski, Palermo 2009 32
• How to find k ?
Possible answer:
Cross-validation or leave-one-out
M. Kanevski, Palermo 2009 33
k-NN prediction (n=6 ?)
2
3
6
4
1
5
W1~(1/n)
W2~(1/n)
W6~(1/n)
W5~(1/n)
W4~(1/n)
W3~(1/n)
r1
r2r3
r4
r5
r6
M. Kanevski, Palermo 2009 34
Cross-validation
2
3
6
4
1
5
W1~(1/n)
W2~(1/n)
W6~(1/n)
W5~(1/n)
W4~(1/n)
W3~(1/n)
r1
r2r3
r4
r5
r6
Calculate error = (prediction-data)
M. Kanevski, Palermo 2009 35
Leave-next-one-out, etc
2
3
6
4
1
5
W1~(1/n)
W2~(1/n)
W6~(1/n)
W5~(1/n)
W4~(1/n)
W3~(1/n)
r1
r2r3
r4
r5
r6
Calculate error = (prediction-data)
M. Kanevski, Palermo 2009 36
Data and k-nn Cross-validation error curve
M. Kanevski, Palermo 2009 37
Complete data set and 500 training points linearly interpolated
M. Kanevski, Palermo 2009 38
Cross-validation curve
M. Kanevski, Palermo 2009 39
K-nn predictions
M. Kanevski, Palermo 2009 40
Machine Learning Algorithms• Machine learning is an area of artificial intelligence
concerned with the development of techniques which allow computers to "learn".
• More specifically, machine learning is a method for creating computer programs by the analysis of data sets. Machine learning overlaps heavily with statistics, since both fields study the analysis of data, but unlike statistics, machine learning is concerned with the algorithmic complexity of computational implementations. ...
M. Kanevski, Palermo 2009 41
AlgorithmsCommon algorithm types include:• supervised learning – where the algorithm generates a function that
maps inputs to desired outputs. • unsupervised learning – which models a set of inputs: labeled
examples are not available. • semi-supervised learning – which combines both labeled and
unlabeled examples to generate an appropriate function or classifier. • reinforcement learning – where the algorithm learns a policy of how to
act given an observation of the world. Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.
• transduction – similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and new inputs.
• The performance and computational analysis of machine learning algorithms is a branch of statistics known as computational learning theory.
M. Kanevski, Palermo 2009 42
ML Topics (short lists)• Machine learning topics• Modeling conditional probability density functions,
regression and classification – Artificial neural networks – Decision trees – Gene expression programming – Genetic Programming – Gaussian process regression – Linear discriminant analysis – k-Nearest Neighbor – Minimum message length – Perceptron – Quadratic classifier – Radial basis functions – Support vector machines
M. Kanevski, Palermo 2009 43
ML Topics (continued) • Modeling probability density functions through generative models:
– Expectation-maximization algorithm – Graphical models including Bayesian networks and Markov Random Fields – Generative Topographic Mapping
• Appromixate inference techniques: – Markov chain Monte Carlo method – Variational Bayes
• Meta-Learning (Ensemble methods): – Boosting – Bootstrap Aggregating aka Bagging – Random forest – Weighted Majority Algorithm
• Optimization: most of methods listed above either use optimization or are instances of optimization algorithms.
• Multi-objective Machine Learning: An approach that addresses multiple, and often confliciting learning objectives explicitly using Pareto-based multi-objective optimization techniques.
M. Kanevski, Palermo 2009 44
Machine Learning• Artificial Neural Networks1. Multilayer perceptrons (MLP)2. General Regression Neural
Networks (GRNN)
• Statistical Learning Theory1. Support Vector Classification2. Support Vector Regression3. Monitoring Networks Optimization
M. Kanevski, Palermo 2009 45
A Generic Model of Learning from Data/Examples
Generator Supervisor
LearningMachine
M. Kanevski, Palermo 2009 46
The Problem of Risk Minimization
In order to choose the best available model to the supervisor’s response, one measure the LOSS or discrepancy L(y,f(x,)) between the response y of the supervisor to a given input x and the response f(x,) provided by the Loss Measure.
M. Kanevski, Palermo 2009 47
Three Main Learning Problems
• Regression Estimation. Let the supervisor’s answer y, be a real value, and let f(x,), , be a set of real functions which contains the regression function
)¦(),( 0 xyydFxf
M. Kanevski, Palermo 2009 48
The Problem of Risk Minimization
Consider the expected value of the loss, given by the risk functional
The goal is to find the function f(x,0) which minimises the risk in the situation where the joint pdf is unknown and the only available information is contained in the training set.
( ) ( , ( , )) ( , )R L y f x dF x y
M. Kanevski, Palermo 2009 49
• Classification problem:
A
AA
AA
A
A
A
A
A
B
B
B
B
B B
B
B
BB
A B
M. Kanevski, Palermo 2009 50
Three Main Learning Problems
• Pattern Recognition (classification). y = {0,1}, classification error:
0, if ( , )( , ( , ))1, if ( , )
y f xL y f xy f x
M. Kanevski, Palermo 2009 51
• Regression problem
f(x) ?
yf xx ˆ
M. Kanevski, Palermo 2009 52
Three Main Learning Problems
• Regression Estimation It is known that regression function is the one
which minimizes the following loss-function:
2)),(()),(,( xfyxfyL
M. Kanevski, Palermo 2009 53
• Probability density estimation
x
p(x)
M. Kanevski, Palermo 2009 54
Three Main Learning Problems
• Density Estimation. For this problem we consider the following loss-function:
),(log)),(( xpxpL
M. Kanevski, Palermo 2009 55
Training samples(xi, yi) (ynew,xnew)
F(x,y)
Induction Deduction
TransductionTransduction
Inductive, Deductive and Transductive
M. Kanevski, Palermo 2009 56
Why Machine Learning algorithms?
• Universal, nonlinear, robust tools• Data adapted• Easy data and knowledge integration• Efficient in high dimensional spaces• Good generalisation (low prediction
error)• Input/feature selection
M. Kanevski, Palermo 2009 57
Our experience, some applications
• Hydrogeology, pollution/contamination (soil, water, air, food chains,…), topo-climatic modelling, geophysics
• Renewable resources – wind fields• Natural hazards/risks: forest fires, avalanches, indoor
radon,• Optimization of monitoring networks• Crime data, epidemiology• MNL for remote sensing, change detection• Socio-economic spatio-temporal multivariate data• Spatial econometrics. Financial data. Econophysics • Fractals, Chaos, EVT, • Time series
M. Kanevski, Palermo 2009 58
Model Selection & Model Evaluation
M. Kanevski, Palermo 2009 59
Guillaume d'Occam (1285 - 1349)
“Pluralitas non est ponenda sine necessitate”
Occam’s razor: “The more simple explanation of the
phenomena is more likely to be correct”
M. Kanevski, Palermo 2009 60
Model Assessment and Model Selection:
Two separate goals
M. Kanevski, Palermo 2009 61
Model Selection:
Estimating the performance of different models in order to choose the (approximate) best one
Model Assessment:Having chosen a final model, estimating its
prediction error (generalization error) on new data
M. Kanevski, Palermo 2009 62
If we are in a data-rich situation, the best solution is to split randomly (?) data
Raw Data
Test:25%(validation)
Validation:25%(test)
Train: 50%(Train)
M. Kanevski, Palermo 2009 63
Interpretation
• The training set is used to fit the models
• The validation set is used to estimate prediction error for model selection (tuning hyperparameters)
• The test set is used for assessment of the generalization error of the final chosen model
Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001
M. Kanevski, Palermo 2009 64
Bias and Variance. Model’s complexity
2 4 6 8 10
0.5
1
1.5
2
2.5
3
c. Underfitting
2 4 6 8 10
0.5
1
1.5
2
2.5
3
b. Overfitting
M. Kanevski, Palermo 2009 65
One of the most serious problems that arises in connectionist learning by neural networks is overfitting of the provided training examples.
This means that the learned function fits very closely the training data however it does not generalise well, that is it can not model sufficiently well unseen data from the same task.
Solution: Balance the statistical bias and statistical variance when doing neural network learning in order to achieve smallest average generalization error
M. Kanevski, Palermo 2009 66
Bias-Variance Dilemma
Assume that
2
( )
( ) 0,
( )
Y f XwhereE
Var
M. Kanevski, Palermo 2009 67
We can derive an expression for the expected prediction error of a
regression at an input point X=x0 using squared-error loss:
M. Kanevski, Palermo 2009 68
20 0 0
2 2 20 0 0 0
2 20 0
2
( ) [( ( )) ¦ ]
[ ( ) ( )] [ ( ) ( )]
( ( )) ( ( ))
Err x E Y f x X x
E f x f x E f x E f x
Bias f x Var f x
IrreducibleError Bias Variance
M. Kanevski, Palermo 2009 69
• The first term is the variance of the target around its true mean f(x0), and cannot be avoided no matter how well we estimate f(x0), unless
2=0. • The second term is the squared bias, the amount
by which the average of our estimate differs from the true mean
• The last term is the variance, the expected squared deviation of around its mean.
0( )f x
M. Kanevski, Palermo 2009 70
20 0 0
2 2 20
1
( ) [( ( )) ¦ ]1[ ( ) ( )] /
k
ll
Err x E Y f x X x
f x f x kk
For the k-NN regression fit
Here we assume for simplicity that training inputs are fixed, and the randomness arises from the Y. The number of neighbors k is inversely related to the model complexity
M. Kanevski, Palermo 2009 71
Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001
M. Kanevski, Palermo 2009 72
M. Kanevski, Palermo 2009 73
• A neural network is only as good as the training data!
• Poor training data inevitably leads to an unreliable and unpredictable network.
• Exploratory Data Analysis and data preprocessing are extremely important!!!
M. Kanevski, Palermo 2009 74
• If possible, prior to training, add some noise or other randomness to your example (such as a random scaling factor). This helps to account for noise and natural variability in real data, and tends to produce a more reliable network.
M. Kanevski, Palermo 2009 75
Hybrid Models:Geostatistics + ML
M. Kanevski, Palermo 2009 76
Final estimates(ANN + Geostatistics)
Data F1,F2,...,Fn
Statistical description
Trend analysis
Structural analysisData for
testingvalidation
ANN Training
TestingValidationANN architecture choice
Accuracy Test
ANN Residuals F1,F2,...,Fn
Statistical description
Multivariate structural analysis
Variogram model for residuals
Cokriging
errors estimates
ANN estimates for F1,F2,...,Fn
Cross-validation
Validation
training
Raw Data Variogram
Lag (km)
Var
iogr
am
Residual Variogram
Lag (km)
Var
iogr
am
NNRK/CK Algorithm
M. Kanevski, Palermo 2009 77
Model: Neural Network Residual Cokriging
Artificial Neural Network Estimate Geostatistical Estimate
of the Residuals
Final estimate of 90Sr with NNRCK
M. Kanevski, Palermo 2009 78
Conclusions• Machine Learning: universal data-driven
recently developed approach with many successful applications. Nonlinear, robust. Integration of different types of data and information. Efficient in high dimensional space.
• But: Depends on the quality and quantity of data. Uncertainty characterization. Diagnostic tools. Hyper-parameters tuning.
M. Kanevski, Palermo 2009 79
Topics for the research
• Multitask learning• Automatic feature selection/ feature extraction• Uncertainties characterisation• Understanding and visluation of high
dimensional data• Modelling on geomanifold, semi-supervised
learning• Active learning• MLA and simulations? • ……………………………………………………
M. Kanevski, Palermo 2009 80
Thank you for your attention!
2004
2008
www.geokernels.org
www.unil.ch/igar 2009
Top Related