Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm ›...
Transcript of Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm ›...
![Page 1: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/1.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Data Visualisation & InterpretationThe art of reading datasets
Devert AlexandreSchool of Software Engineering of USTC
14 February 2012 — Slide 1/1
![Page 2: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/2.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Table of Contents
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 2/1
![Page 3: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/3.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Descriptive statistics
descriptive statistics helps to give a general summary ofdata
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 3/1
![Page 4: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/4.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Mean
Example of descriptive statistics quantity
arithmetic mean
a =1
n
n∑i=1
ai
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 4/1
![Page 5: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/5.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Mean
Example of descriptive statistics quantity
arithmetic mean
a =1
n(a1 + a2 + · · ·+ an)
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 4/1
![Page 6: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/6.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Mean
The mean is defined in Rn ⇒ geometric center
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 5/1
![Page 7: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/7.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Mean computation
You think, it is easy to compute the mean ?
0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 6/1
![Page 8: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/8.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Mean computation
A naive summation algorithm will return this
>>> 0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.10.8999999999999999
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 7/1
![Page 9: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/9.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Mean computation
An accurate summation algorithm will return this
>>> impor t math>>> math . fsum (0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1+0.1)0 .9
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 8/1
![Page 10: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/10.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Mean computation
Algorithms like Kahan summation algorithm or Shewchuksummation algorithm reduces the numerical error
de f KahanSum( data ) :s = 0 .0c = 0 .0f o r i i n range ( l e n ( data ) ) :
y = data [ i ] − ct = s + yc = ( t − s ) − ys = t
r e t u r n s
Listing 1: Kahan summation
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 9/1
![Page 11: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/11.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Central tendencyThe mean is a measure of central tendency ⇒ the mainbehaviour, the main value of some phenomenon
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 10/1
![Page 12: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/12.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Central tendencyThe mean is a measure of central tendency ⇒ the mainbehaviour, the main value of some phenomenon
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 10/1
![Page 13: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/13.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Mean robustnessThe mean is not a robust estimator of the centraltendency
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 11/1
![Page 14: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/14.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Median
The median is the value such as 50% of the values arehigher, 50% of the values are lower
a = [6, 1, 7, 9, 6, 3, 4, 5, 2]
a = [1, 2, 3, 4, 5, 6, 6, 7, 9]
a = 5
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 12/1
![Page 15: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/15.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Median
The median is the value such as 50% of the values arehigher, 50% of the values are lower
a = [6, 1, 7, 9, 6, 3, 4, 8, 5, 2]
a = [1, 2, 3, 4, 5, 6, 6, 7, 8, 9]
a =1
2(5 + 8) = 6.5
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 12/1
![Page 16: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/16.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Median computation
To compute the median, you can
1 sort the list of samples
2 • if size is odd → a = a n+12
• if size is even → a = 12(a n
2+ a n+1
2)
Note that it is for indexes starting from 1
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 13/1
![Page 17: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/17.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Median computation
Let’s code some python
de f median ( data ) :data . s o r t ( )i f l e n ( data ) % 2 == 0 :m = l e n ( data ) / 2r e t u r n 0 .5 ∗ ( data [m−1] + data [m] )
e l s e :r e t u r n data [ ( l e n ( data ) − 1) / 2 ]
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 14/1
![Page 18: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/18.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Median computation
Let’s code some python
>>> a =[6 , 1 , 7 , 9 , 6 , 3 , 4 , 5 , 2 ]>>> median ( a )5
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 14/1
![Page 19: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/19.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Median computationThe median have an equivalent in Rn ⇒ median center
Compute the median for each dimension to get themedian center
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 15/1
![Page 20: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/20.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Median robustness
The median is a more robust estimator of the centraltendency
• green is the median
• pink is the arithmeticmean
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 16/1
![Page 21: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/21.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Statistical dispersionThe following datasets have the same central tendency
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 17/1
![Page 22: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/22.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Statistical dispersionThe following datasets have the same central tendency
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 17/1
![Page 23: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/23.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Statistical dispersionBut they have different dispersions
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 18/1
![Page 24: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/24.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Standard deviation
A traditional measure of dispersion is the standarddeviation sigma
σ2 =1
n − 1
N∑i=1
(ai − a)2
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 19/1
![Page 25: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/25.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Standard deviation computation
Robust computation of the standard deviation ⇒Knuth-Welford algorithm
de f stdDev ( data ) :n = 0mean = 0M2 = 0meanEst imate = math . fsum ( data ) / l e n ( data )
f o r x i n data :y = x − meanEst imaten = n + 1d e l t a = y − meanmean = mean + d e l t a / nM2 = M2 + d e l t a ∗ ( y − mean )
r e t u r n math . s q r t (M2 / ( n − 1) )
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 20/1
![Page 26: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/26.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Standard deviation
Standard deviation suffers from the same robustnessissues as mean. We will look why, later.
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 21/1
![Page 27: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/27.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Quartiles
The lower quartile or first quartile is the value such as75% of the values are higher, 25% of the values are lower
a = [6, 1, 2, 7, 9, 6, 3, 4, 5, 2, 6]
a = [1, 2, 2, 3, 4, 5, 6, 6, 6, 7, 9]
q1 = 2
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 22/1
![Page 28: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/28.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Quartiles
The higher quartile or third quartile is the value such as25% of the values are higher, 75% of the values are lower
a = [6, 1, 7, 9, 6, 3, 4, 5, 2, 6]
a = [1, 2, 2, 3, 4, 5, 6, 6, 6, 7, 9]
q3 = 6
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 22/1
![Page 29: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/29.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Quartiles
Where is the second quartile ? ⇒ it’s the median
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 23/1
![Page 30: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/30.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Interquartile range
The difference Q3− Q1 is the interquartile range or IQR⇒ it’s a more robust dispersion measure
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 24/1
![Page 31: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/31.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distributionA model for random variables, with 2 parameters µ and σ
−6 −4 −2 0 2 4 60.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 25/1
![Page 32: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/32.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distribution
The normal distributions have 2 parameters µ and σ.
Φ(x) =1√
2πσ2e
−(x−µ)2
2σ2
This is the probability density of the normal distribution.
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 26/1
![Page 33: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/33.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distribution
The normal distributions have 2 parameters µ and σ.
Φ(x) =1√
2πσ2e
−(x−µ)2
2σ2
It tells the probability for x to appear, according to thisdistribution.
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 26/1
![Page 34: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/34.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distributionµ is the mode, the central tendency of the normaldistribution
−6 −4 −2 0 2 4 60.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 27/1
![Page 35: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/35.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distribution
If some data are following a normal distribution, then
µ = a
The more sample, the more ”true“ it will be
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 28/1
![Page 36: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/36.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distributionσ controls the shape of the normal distribution
−6 −4 −2 0 2 4 60.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 29/1
![Page 37: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/37.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distribution
If some data are following a normal distribution
σ2 =1
n − 1
N∑i=1
(ai − a)2
The standard deviation comes from here ⇒ dispersion ofa normal distribution
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 30/1
![Page 38: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/38.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distributionµ and σ are completely independent parameters
−6 −4 −2 0 2 4 60.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 31/1
![Page 39: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/39.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distribution
Practical interpretation of the normal distribution0.0
0.1
0.2
0.3
0.4
−2σ −1σ 1σ−3σ 3σµ 2σ
34.1% 34.1%
13.6%2.1%
13.6% 0.1%0.1%2.1%
68% of the values within [µ− σ, µ + σ]
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 32/1
![Page 40: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/40.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distribution
Practical interpretation of the normal distribution0.0
0.1
0.2
0.3
0.4
−2σ −1σ 1σ−3σ 3σµ 2σ
34.1% 34.1%
13.6%2.1%
13.6% 0.1%0.1%2.1%
95% of the values within [µ− 2σ, µ + 2σ]
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 32/1
![Page 41: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/41.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
normal distribution
Practical interpretation of the normal distribution0.0
0.1
0.2
0.3
0.4
−2σ −1σ 1σ−3σ 3σµ 2σ
34.1% 34.1%
13.6%2.1%
13.6% 0.1%0.1%2.1%
99.7% of the values within [µ− 3σ, µ + 3σ]
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 32/1
![Page 42: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/42.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
skewed distributions
Your data might not have a symmetric distribution ⇒they might have a skewed distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.00.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
• red is the true centraltendency
• green is the median
• pink is the arithmeticmean
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 33/1
![Page 43: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/43.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
skewed distributions
Your data might not have a symmetric distribution ⇒they might have a skewed distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.00.0
0.2
0.4
0.6
0.8
1.0
• red is the true centraltendency
• green is the median
• pink is the arithmeticmean
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 33/1
![Page 44: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/44.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
skewed distributions
Your data might not have a symmetric distribution ⇒they might have a skewed distribution
0.0 0.5 1.0 1.5 2.0 2.5 3.00.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
• red is the true centraltendency
• green is the median
• pink is the arithmeticmean
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 33/1
![Page 45: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/45.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
skewed distributions
You can compute the skewness of your data
1n
∑ni=1(ai − a)3(
1n
∑ni=1(ai − a)2
) 32
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 34/1
![Page 46: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/46.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
multimodal distributionsYour data might have multiple modes
−3 −2 −1 0 1 2 3 4 50.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 35/1
![Page 47: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/47.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
multimodal distributionsIn such case, the mean, median and other descriptivequantities might have no reliable meaning
−3 −2 −1 0 1 2 3 4 50.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 36/1
![Page 48: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/48.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
multimodal distributionsIn such case, the mean, median and other descriptivequantities might have no reliable meaning
−3 −2 −1 0 1 2 3 4 50.0
0.2
0.4
0.6
0.8
1.0
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 36/1
![Page 49: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/49.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
multimodal distributionsIn such case, the mean, median and other descriptivequantities might have no reliable meaning
−3 −2 −1 0 1 2 3 4 50.0
0.2
0.4
0.6
0.8
1.0
1.2
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 36/1
![Page 50: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/50.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
multimodal distributionsIn such case, the mean, median and other descriptivequantities might have no reliable meaning
−3 −2 −1 0 1 2 3 4 50.0
0.2
0.4
0.6
0.8
1.0
1.2
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 36/1
![Page 51: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/51.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Table of Contents
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 37/1
![Page 52: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/52.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Observe your data
Descriptive statistics can completely miss importantinformations from your data !
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 38/1
![Page 53: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/53.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Observe your dataThe Anscombe’s quartet
4
8
12
0 10 20
4
8
12
0 10 20
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 39/1
![Page 54: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/54.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Observe your data
Those 4 datasets have exactly the same
• mean
• variance
• regression line
But they are not quite the same things !
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 40/1
![Page 55: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/55.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
BoxplotA nice way to summarize data distribution is the boxplot
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 41/1
![Page 56: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/56.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
BoxplotA nice way to summarize data distribution is the boxplot
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 41/1
![Page 57: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/57.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
BoxplotA nice way to summarize data distribution is the boxplot
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 41/1
![Page 58: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/58.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Boxplot
The red mark shows the mean
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 42/1
![Page 59: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/59.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Boxplot
The box goes from the lower quartile to the upperquartile
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 42/1
![Page 60: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/60.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Boxplot
The box is thus centred on the median
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 42/1
![Page 61: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/61.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Boxplot
The whiskers are the minimum and maximum values
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 42/1
![Page 62: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/62.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Boxplot
Outliers values are shown as blue crosses
Outliers are values which are beyond 1.5× IQR from thequartiles
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 43/1
![Page 63: Data Visualisation & Interpretationmarmakoide.org › download › teaching › dm › dm-visual.pdf · y = data [ i ] c t = s + y c = ( t s ) y s = t returns Listing 1: Kahan summation](https://reader035.fdocuments.us/reader035/viewer/2022070817/5f13521bbfdfc52e0a58d2a9/html5/thumbnails/63.jpg)
UNIVERSITY OF SCIENCE AND TECHNOLOGY OF CHINA SCHOOL OF SOFTWARE ENGINEERING OF USTC
Scatter plotA scatter plot is simply a plot with the data as pointsalong 2 dimensions
−3 −2 −1 0 1 2 3−5
−4
−3
−2
−1
0
1
2
3
4
Devert Alexandre (School of Software Engineering of USTC) — Data Visualisation & Interpretation — Slide 44/1