AstroInformaticsbrescia/documents/ASTROINFOEDU/brescia_L3-ST… · M. Brescia - Data Mining -...

AstroInformatics

Statistics for classification

M. Brescia - Data Mining - lezione 4 2

Una rappresentazione utile è la matrice di confusione.

L’elemento sulla riga i e sulla colonna j è il numero assoluto oppure la percentuale di casi della

classe “vera” i che il classificatore ha classificato nella classe j.

Sulla diagonale principale ci sono i casi classificati correttamente. Gli altri sono errori.

A B C Totale

A 60 14 13 87 69,0%

B 15 34 11 60 56,7%

C 11 0 42 53 79,2%

Totale 86 48 66 200 68,0%

Sulla classe A l’accuratezza è 60 / 87 = 69,0%.Sulla classe B è 34 / 60 = 56,7% e sulla classe C è 42 / 53 = 79,2%.L’accuratezza media è (60 + 34 + 42) / 200 = 136 / 200 = 68,0%.Gli errori sono il 32%, cioè 64 casi su 200. Il valore di questaclassificazione dipende non solo dalle percentuali, ma anche dalcosto delle singole tipologie di errore. Ad es. se C è la classe che èpiù importante classificare bene, il risultato è considerabilepositivo…

Nel training set ci sono 200 casi. Nellaclasse A ci sono 87 casi:• 60 classificati correttamente come A• 27 classificati erroneamente, dei quali14 come B e 13 come C

Confusion Matrix


A binary classifier has two possible output classes.The response is also known as:o Output variable;o Label;o Target;o Dependent variable.

Let's now define the most basic terms:

true positives (TP): predicted yes (correct), and they are really correct.

true negatives (TN): We predicted no (wrong), and they are really wrong.

false positives (FP): We predicted yes, but they are really wrong.

false negatives (FN): We predicted no, but they are really correct.

Classification estimators


Classification accuracy: fraction of patterns (objects) correctly classified, with respect to thetotal number of objects in the sample;

Purity/Completeness: fraction of objects correctly classified, for each class;

Contamination: fraction of objects erroneously classified, for each class

DICE: The Sorensen–Dice index, also known as F1-score, is a statistic used for comparing thesimilarity of two samples.

5 basic quality evaluation criteria, by exploiting the output representation through the confusion matrix

𝐷𝐼𝐶𝐸 =2 𝑋 ∩ 𝑌

𝑋 + 𝑌=

2 𝐴𝐵

𝐴 2 + 𝐵 2 =2𝑇𝑃

2𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁

There is a list of basic rates often computed from a confusion matrix for a binary classifier:

Classification estimators


Accuracy: Overall, how often is the classifier correct?(TP+TN)/total = (100+50)/165 = 0.91

Misclassification Rate: Overall, how often is it wrong?(FP+FN)/total = (10+5)/165 = 0.09equivalent to 1 - Accuracyalso known as "Error Rate"

True Positive Rate: When it's actually yes, how often does it predict yes?TP/actual yes = 100/105 = 0.95also known as "Sensitivity" or "Recall“ or “Completeness“

False Positive Rate: When it's actually no, how often does it predict yes?FP/actual no = 10/60 = 0.17

Specificity: When it's actually no, how often does it predict no?TN/actual no = 50/60 = 0.83equivalent to 1 - FPR

Precision (Purity): When it predicts yes, how often is it correct?TP/predicted yes = 100/110 = 0.91

Prevalence: How often does the yes condition actually occur in our sample?actual yes/total = 105/165 = 0.64

More in general:

ROC curve


Together with DICE estimator, another useful operator is the ROC curve.

ROC (Receiver Operating Characteristics) is a statistical estimator used to assess thepredictive power of a binary classifier (Logistic Regression Model).

It comes from Signal Theory, but heavily used in Analytics and it is suitable graph thatsummarizes the performance of a classifier over all possible thresholds. It is generated byplotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary thethreshold for assigning observations to a given class.

ROC curve


To draw a ROC curve, only TPR and FPR are needed.

TPR defines how many correct positive results occur among

all positive samples available during the test. FPR, on the

other hand, defines how many incorrect positive results occur

among all negative samples available during the test.

A ROC space is defined by FPR and TPR as x and y axes

respectively, which shows relative trade-offs between true

positive (benefits) and false positive (costs).

Since TPR is equivalent to sensitivity and FPR is equal to 1 −

specificity, the ROC graph is sometimes called the sensitivity

vs (1 − specificity) plot. Each prediction result or instance of

a confusion matrix represents one point in the ROC space.

ROC curve


The best possible prediction method would yield a point in

the upper left corner or coordinate (0,1) of the ROC space,

representing 100% sensitivity (no false negatives) and 100%

specificity (no false positives).

The (0,1) point is also called a perfect classification. A

completely random guess would give a point along a

diagonal line (the so-called line of no-discrimination) from the

left bottom to the top right corners (regardless of the positive

and negative base rates).

An intuitive example of random guessing is a decision

by flipping coins. As the sample size increases, a random

classifier's ROC point migrates towards (0.5,0.5).

ROC – classifier estimation


in order to compare arbitrary classifiers, the Receiver Operating Characteristic or ROC curveplots may give a quick evaluation on their behavior. The overall effectiveness of the algorithmis measured by the area under the ROC curve, where an area of 1 represents a perfectclassification, while an area of .5 indicates a useless result (i.e. like a flipped coin).

It is obtained by varying thethreshold used to discriminateamong classes. If the target labels isthe [0,1] range, the ROC plot is builtby calculate the couple TPR and FPRfor each value of a threshold, i.e.(0, 0.1, 0.15, 0.2…, 0.95, 1) andplotting all these results, bydescribing a curve. The ROC value istherefore the area under thatcurve.

Probability Density Function


In regression experiments, where the goal is to predict a distribution based on a restrictedsample of a true population (KB or Knowledge Base), the usual mechanism is to infer theknowledge acquired on the true sample through a model able to learn the hidden andunknown correlation between data parameter space and the expected output.

A typical example in astrophysics is the prediction of the photometric redshift for millions skyobjects by learning the hidden correlation between the multi-band photometric fluxes and thespectroscopic redshift (almost precise, thus considered true redshift). The true redshift isusually known for a very limited sample of objects (spectroscopically observed). Theadvantage to predict the photo-z is that on real cases the photometric catalogues are quiteeasier to be obtained rather than to use very complex and expensive spectroscopicobservation runs and reduction.

By forcing a model F to predict a single-point estimation y=F(x) may yield largely inaccurateprediction errors (outliers). While the prediction based on a Probability Density FunctionPDF(x) may reduce or minimize physical bias (systematic errors) and the occurrence ofoutliers as well.

In other words, PDF improves performance, at the price of a more complex mechanism toinfer and calculate the prediction.

Probability Density Function


There exists a plethora of statistical methods which can produce a PDF for analyticalproblems approached by traditional models (deterministic/probabilistic models). But in thecase of models for which an analytical expression y=F(x) does not exist (such as machinelearning models), it is extremely difficult to find a well posed PDF(x), since it is intrinsicallycomplex to split error due to the model itself from error embedded in the data.And, important, a PDF is well-posed only for known (continuous) probability distributions.

The importance of PDF is that in most real life problems it is impossible to answer toquestions like: Given a distribution function of a random variable X, what is the probabilitythat X is exactly equal to a certain value n?We could better answer to questions like: what is the probability the X is between n-a andn+a?This corresponds to calculate the area under the interval –a and +a, e.g. the probability willbe an integral value, not a single point!!!

Confidence Statistics


As said before, p(z) cannot be verified on a single-point basis. A large sample, however, doessupport p(z) verification. Let’s assume to have a population of observed objects N. For alimited amount of them we know their real nature. For others we applied an estimationmodel which predicted their nature with a residual uncertainty, i.e. it provides a probabilityp(z). We want to verify such estimation reliability and accuracy.

What we expect? About 1% of the predictions should be quite perfect, i.e. their p(z)extremely close to the real value at least with 99% of confidence level.

What happens if such amount occurs for, let say, about 40%? the model estimation suffersof overconfidence, e.g. too much precise prediction with respect to the supported evidence!

What happens in the opposite case (i.e. less than 0.5%)? the model is underconfident!

In many astrophysical cases, astronomers spend much time to calibrate their measurements,by taking strongly under control the error budget of the observation quality. They removemost of the bias (systematic effects) sources, tune the physical models, increase the statisticsof the used samples, comparing with past experience and using empirical (magic) rules ofthumbs, etc…

This means that in most cases the results are overconfident.

Confidence statistics


The key of the idea is the concept of the confidence interval

Let’s suppose to have a statistical variable X distributed on a population with mean µ andvariance σ2.We want to build a confidence interval for µ at level 1-α based on a simple randomsample (x1…xn) with dimension n. The quantity 1-α is called confidence level. In practicewe want to find an interval of values which contain the true value of statistical estimatorµ.First, we have to distinguish the case when the variance is known from that one when it isunknown.



Variance known (rare case in the real world)

The sampling mean

ҧ𝑥 =1

𝑛σ𝑖=1𝑛 𝑥𝑖 (1)

is a random variable distributed approximately like a Gaussian 𝑁 𝜇,𝜎2

𝑛and this

approximation improves by extending the sample dimension n.𝜎2

𝑛measures the precision of the estimator (1).

When 𝑛 ≫ 𝜎2 the (1) is more precise.

Hence ҧ𝑥~𝑁 𝜇,𝜎2

𝑛implies that

ҧ𝑥−𝜇

𝜎2

𝑛

~𝑁 0,1 we can use the z-score of a normal

standard…

𝑍 =𝑋 − 𝜇

𝜎



Therefore, for each probability value 1-α, we can write:

𝑃 −𝑧𝛼/2 ≤ҧ𝑥−𝜇

𝜎2

𝑛

≤ 𝑧𝛼/2 = 1 − 𝛼 (2)

Where 𝑧𝛼/2 is the quantile of the gaussian distribution of order 1-α/2, i.e. the point

leaving a left area under the gaussian equal to 1-α/2. The values of the quantile are usually tabulated for each distribution.



Quantiles of a standard Gaussian distribution.The table reports the quantiles p0+p1 of a distribution N(0,1). Remind that a standard Gaussian is symmetric around zero. Therefore the quantiles with p<0.5 can be obtained by considering -p=-(1-p) (see example below).

exampleTo obtain the quantile 0.975 of a N(0,1) means to calculate𝑃 𝑁 0,1 ≤ 𝑥 = 0.975

0.975 = p0+p1 = 0.90 + 0.0750 => find the cross value between 0.90 and 0.0750 => x = 1.96

The symmetric value corresponds to calculate𝑃 𝑁 0,1 ≥ −𝑥 = − 1 − 0.975 = −0.025

Therefore x = 1.96 is the quantile 0.025 and 0.975 of the distribution N(0,1).



The confidence interval can then be built by expanding the formula

𝑃 −𝑧𝛼/2 ≤ҧ𝑥−𝜇

𝜎2

𝑛

≤ 𝑧𝛼/2 = 1 − 𝛼 (2)

In other words, the probability that the intervals (3) contain the true value of the mean µof the population is approximately equal to the confidence level 1-α.

(3)

confidence interval(s)



The confidence level 1-α indicates the «level» of the coverage given by the confidenceintervals (3). In other words, it always exists a residual probability α that the sampling datacome from a population with mean outside that intervals.

Consider that the (3) is centered on the estimate of mean ҧ𝑥 with a radius equal to 𝑧𝛼/2𝜎2

𝑛,

which length depends on the desired level of coverage (i.e. depending on the chosen

quantile) and on the precision degree of the estimator measured by𝜎2

𝑛, which is called

standard error of the estimate.

We speak about multiple confidence interval(s) because any choice of the quantiledetermines a different confidence interval.

Let’s see an example…

Confidence Statistics - Example


From an observed image, after reduction, we calculated the absolute magnitudes of thebrightest stars present in that sky region. Then we know that these magnitudes aredistributed with a variance of σ2 = 16 squared magnitudes. We want to calculate aconfidence interval with confidence level of 95% (~2σ) for the mean of magnitudes.

Let’s consider 10 stars with absolute magnitudes: 7.36, 11.91, 12.91, 9.77, 5.99, 10.91,9.57, 11.01, 6.11, 12.12

We start from the sampling mean and its standard error:

ഥ𝑚 =1

10σ𝑖=110 𝑚𝑖 = 9.766 and

𝜎2

10=

16

10= 1.265

Since we fixed a confidence level of 95%, then 1 − 𝛼 = 0.95 and consequently 𝛼 = 0.05Therefore the desired quantile is 𝑧𝛼/2 = 𝑧0.05/2 = 𝑧0.025 = 1.96

The radius of the confidence interval is indeed given by 𝑧𝛼/2𝜎2

𝑛= 1.96 ∙ 1.265 = 2.4794

Therefore the confidence interval is [(9.766-2.4794), (9.766+2.4794)] = [7.2866, 12.2454]

We have 95% of confidence that the true value of the mean magnitude of the bright starsin that sky region is between 7.29 and 12.25

Confidence Statistics - Example


What happens if we increase the confidence level at about 3σ (99.7%)?

We start from the sampling mean and its standard error:

ഥ𝑚 =1

10σ𝑖=110 𝑚𝑖 = 9.766 and

𝜎2

10=

16

10= 1.265

Since now the confidence level is 99.7%, then 1 − 𝛼 = 0.997 and consequently 𝛼 = 0.003Therefore the desired quantile is 𝑧𝛼/2 = 𝑧0.003/2 = 𝑧0.0015 ≅ 3.29

The radius of the confidence interval is indeed given by 𝑧𝛼/2𝜎2

𝑛= 3.29 ∙ 1.265 = 4.1619

Therefore the confidence interval is [(9.766-4.1619), (9.766+4.1619)] = [5.6041, 13.9279]

We have 99.7% of confidence that the true value of the mean magnitude of the brightstars in that sky region is between 5.60 and 13.93

In practice, by increasing the confidence level the radius of the confidence level isincreased. This is obvious, since a better confidence implies to enlarge the interval for thetrue value of the mean estimator



What does it changes if the variance is unknown?

In the most frequent cases of the real world, the precise estimate of the variance of a population is difficult (if not impossible)…

Variance unknown (real world)

The formula

𝜎2 =1

𝑛−1σ𝑖=1𝑛 𝑥𝑖 − ҧ𝑥 2 =

𝑛

𝑛−1

1

𝑛σ𝑖=1𝑛 𝑥𝑖

2 − ҧ𝑥2 (4)

is the sampling variance corrected by the factor𝑛

𝑛−1, due to the fact that, for small

samples the sampling variance is a distorted estimate, whose precision increases with the

sample dimension. For large samples𝑛

𝑛−1≈ 1 and the (4) becomes the standard

expression of the variance.

In such cases, to obtain a correct confidence interval for the mean µ of the population, we

must consider that the distribution of the random variableഥ𝒙−𝝁

𝝈𝟐

𝒏

follows the t of Student

with n-1 Degrees of Freedom (DoF), where n is the dimension of the extracted sample.

Student T-distribution


The Student t-distribution describes small samples drawn from a full population normallydistributed. It’s useful to evaluate the difference between two sample means, to assess theirstatistical significance. It occurs whenever it is considered the following random variable:

(5)

This statistics, if the sample is Gaussian, is the ratiobetween a normal standard N(0,1) and a Chi-squaredivided by n-1.

The t-Student is symmetric like a Gaussian around zerobut has «heavier tails» than a normal distribution, i.e.values far from 0 have a higher probability to beextracted than in the case of a standard Gaussiandistribution. But such differences decrease by increasingthe dimension n of the sample.In the figure, the t-Student with ∞ degrees of freedom(ν) approximates the Gaussian N(0,1).

Gaussian

t-Student

with the variance of the compared sample

𝑆2 =1

𝑁−1σ𝑖=1𝑁 𝑋𝑖 − ത𝑋 2

Student T-distribution


The construction of a confidence interval for the mean estimator is similar to theprevious case and here the quantiles play a key role. By taking the quantile of the t-Student (n-1 DoF) distribution of order 1-α/2, defined as 𝑡𝑛−1,𝛼/2 the confidence

interval will be derived by the usual chain of inequalities:

We obtain that it isapproximately equal to 1-αthe probability that theinterval below contains thetrue value of the mean µ ofthe population.

The radius is usually smaller than the one obtained previously. That’s because the sampleprovides always an estimate of the variance smaller than the previous case.

Recap of Confidence Statistics


By summarising(under the hypothesis

that a population isnormally distributed)

Known variance Unknown variance

In many real situations it is preferred to infer an interval of any parameterestimate, rather than a single value. Such interval should indicate also theerror associated to the estimate.

A confidence interval of any parameter ϴ (such as the mean or the variance) of apopulation is an interval, bounded by two limits Linf and Lsup, with a defined probability (1-α) to contain the true parameter of the population:

𝑝 𝐿𝑖𝑛𝑓 < 𝜃 < 𝐿𝑠𝑢𝑝 = 1 − 𝛼

where 1-α is the confidence level and α error probability

PDF statistics


As underlined before, p(z) cannot be verified on a single-point basis. A large sample,however, does support p(z) verification. Let’s assume to have a population of observedobjects N. For a limited amount of them we know their real nature. For others we applied anestimation model which predicted their nature with a residual uncertainty, i.e. it provides aprobability p(z). We want to verify such estimation reliability and accuracy.

The key concept is the confidence interval (CI)

We can analyze the over/under-confidenceof our model prediction by checking if anyx% of the samples have their true valuewithin their x% CI, y% have the true valuewithin their y% CI, etc.

This can be done by calculating and plottingthe Empirical Cumulative DistributionFunction (ECDF) after having obtained thep(z) for our model (known as posteriorprobability).

Empirical Cumulative Distribution Function


An empirical cumulative distribution function (CDF) is a non-parametric estimator of theunderlying CDF of a random variable. It assigns a probability to each datum, orders thedata from smallest to largest in value, and calculates the sum of the assigned probabilitiesup to and including each datum. The result is a step function that increases at eachdatum. The ECDF is usually denoted by 𝐹 𝑥 or 𝑃 𝑋 ≤ 𝑥 and is defined as

𝐹(𝑥) = 𝑛−1𝑖=1..𝑛

𝐼(𝑥𝑖 ≤ 𝑥)

𝐼()is the Indicator function

Essentially, to calculate the value of 𝐹 𝑥 at x, simply (1) count the number of data lessthan or equal to x; (2) divide the number found by the total number n of data in thesample.

𝐼 𝑥𝑖 ≤ 𝑥 = ቊ1, 𝑥𝑖 ≤ 𝑥0, 𝑥𝑖 > 𝑥

Empirical Cumulative Distribution Function


The ECDF is useful because: it approximates the true CDF well if the sample size (the number of data) is large, and

knowing the distribution is helpful for statistical inference; a plot of the ECDF can be visually compared to known CDFs of frequently used

distributions to check if the data came from one of those common distributions; it can visually display “how fast” the CDF increases to 1; hence, it can be useful to “get a

feel” for the data; for example check for over- or under- confidence of any prediction(Wittman et al. 2016).

ECD

F

AstroInformaticsbrescia/documents/ASTROINFOEDU/brescia_L3-ST… · M. Brescia - Data Mining -...

Documents

Transcript of AstroInformaticsbrescia/documents/ASTROINFOEDU/brescia_L3-ST… · M. Brescia - Data Mining -...