Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive...
Transcript of Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive...
![Page 1: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/1.jpg)
Statistics IChapter 2: Analysis of univariate data
![Page 2: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/2.jpg)
Numerical summary
Central tendency Location Spread Form
⇓ ⇓ ⇓mean quartiles range coeff. asymmetry
median percentiles interquartile range coeff. kurtosismode variance
standard deviationcoeff. of variation
![Page 3: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/3.jpg)
Descriptive statistics
X What are they useful?
X Can we calculate them for all types of variables?
X Which are the most useful in each case?
X How can we use the calculator or Excel?
![Page 4: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/4.jpg)
Measures of central tendency
X The mean
X The median
X The mode
![Page 5: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/5.jpg)
Central tendency: the (artithmetic) mean
The (artithmetic) meanThe mean is the average of all the data
x =
∑ni=1 xin
=x1 + . . .+ xn
n
I It is the most common measure of location
I It is the center of gravity of the data
I It can be calculated only for quantitative variables
![Page 6: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/6.jpg)
The mean: example
For the experience of the 46 professionals of a computer company, Whichis the mean?
x =1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + · · ·+ 17 + 20
46= 7.5 anos
How can we calculate it using the absolute frequency table? and usingthe relative one?
Experience, xi absolute freq., ni relative freq., fi1 5 0,1092 4 0,0873 4 0,0874 4 0,0875 3 0,0656 4 0,0877 1 0,0228 4 0,087
10 4 0,08711 2 0,04312 2 0,04313 2 0,04314 1 0,02215 1 0,02216 3 0,06517 1 0,02220 1 0,022
Total 46 1
![Page 7: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/7.jpg)
The mean with grouped data
This is the same formula but using the center of each interval.For the salary of the 46 professionals of a computer company, Which isthe mean?
Note: the mean salary using the raw data equals 17250.413
![Page 8: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/8.jpg)
The mean: properties
X Linearity: If Y = a + bX ⇒ y = a + bx
If the 46 professionals’ salaries is increased by 2 %, How the meansalary changes?
Afterwards the salary is reduced in 100 dolars, Wich is the final meansalary?
X Disadvantages: Affected by extreme values (outliers)
Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2
x =3 + 1 + 5 + 4 + 2
5= 3 y =
3 + 1 + 50 + 4 + 2
5= 12
Its value has been multiplied by 4!!When the data is skewed an alternative robust measure of centraltendency is more appropriate
![Page 9: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/9.jpg)
Central tendency: the median
...is the most central datum
1 1 1 3 3 5 5 7 8 8 9
1. Order the data from smallest to largest
2. Include repetitions
3. The median is the physical centre
1 1 1 3 3 5 5 7 8 8 ⇒ M =3 + 5
2= 4
MedianOrdered list from smallest to largest: x(1), x(2), . . . , x(n)
M =
x((n+1)/2) if n odd
x(n/2)+x(n/2+1)
2 if n even
![Page 10: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/10.jpg)
The media via the table of frequenciesExperience, xi ni fi Ni Fi
1 5 0,109 5 0,1092 4 0,087 9 0,1963 4 0,087 13 0,2834 4 0,087 17 0,3705 3 0,065 20 0, 435 < 0.5
M=6 4 0,087 24 0, 522 > 0.57 1 0,022 25 0,5438 4 0,087 29 0,6309 0 0 29 0,630
10 4 0,087 33 0,71711 2 0,043 35 0,76112 2 0,043 37 0,80413 2 0,043 39 0,84814 1 0,022 40 0,87015 1 0,022 41 0,89116 3 0,065 44 0,95717 1 0,022 45 0,97818 0 0 45 0,97810 0 0 45 0,97820 1 0,022 46 1,000
![Page 11: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/11.jpg)
The meadian: properties
X Linearity: If Y = a + bX ⇒ My = a + bMx
If the 46 professionals’ salaries is increased by 2 %, How the mediansalary changes?
Afterwards the salary is reduced in 100 dolars, Wich is the finalmedian salary?
X Can we calculate the meadian with the education level data?
Can we calculate the meadian with the 0-1 position of responsabilityvariable?
X Advantage: Not affected by outliers
Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2
Mx = 3 My = 3
When the data is skewed it is a better measure of central tendencythan the mean.
![Page 12: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/12.jpg)
The median and the mean for asymmetric dataAnnual gross salary in 2014, Encuesta de Estructura Salarial 2014, I.N.E.
“La diferencia entre el salario medio y el mediano se explica porque en elcalculo del valor medio influyen notablemente los salarios muy altosaunque se refieran a pocos trabajadores.´´ (En la Nota de Prensa delINE de 28 de octubre de 2016)
![Page 13: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/13.jpg)
Central tendency: the mode
...is the most frequent value
The mode of the variable experience in the 46 professionals example is 1year, with an absolute frequency of 5 employees.
The values 2,3,4,8 and 10 have an absolute frequency of 4 employees.
![Page 14: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/14.jpg)
Central tendency: the mode
Does this definition make sense with the education level data?
Does this definition make sense with the 0-1 position of responsabilityvariable?
![Page 15: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/15.jpg)
Central tendency: the mode
Does this definition make sense with continuous data? ⇒ modal interval
![Page 16: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/16.jpg)
The mode: properties
X It can be calculated for both qualitative and quantitative variables.Indeed, it is the only descriptive measurement (mean, median, mode)that makes sense for nominal qualitative variables.
X Not affected by outliers
X There can be no mode.
X There can be more than one mode: bimodal–trimodal–plurimodal
What it can be indicate?
![Page 17: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/17.jpg)
Location measures
X Quartiles
X Percentiles
![Page 18: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/18.jpg)
Location measures: quartiles and percentiles
X Quartiles split the ranked data into four segments with an equalnumber of values per segment.
X Percentiles split the ranked data into a hundred segments with anequal number of values per segment.
1. Order the data from smallest to largest
2. Include repetitions
3. Select each quartile (percentile) according to:I The first quartil Q1 has position 1
4(n + 1).
I The second quartil Q2 (= median) has position 12(n + 1).
I The third quartil Q3 has position 34(n + 1).
I The k-th percentile Pk , has position k(n + 1)/100, k = 1, . . . , 99.
![Page 19: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/19.jpg)
Quartiles: example
![Page 20: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/20.jpg)
Percentiles: example
![Page 21: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/21.jpg)
Masures of spread
X The range and the interquartile range
X The variance and the standard deviation
X The coefficient of variation
![Page 22: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/22.jpg)
Variation: range and interquartile range (IQR)
I The Range is the simplest measure of variation
R = xmax − xmın
I Ignores the way the data is distributed
I Sensitive to outliers
Example: Given observations 3, 1, 5, 4, 2, R = 5− 1 = 4Example: Given observations 3, 1, 5, 4, 100, R = 100− 1 = 99
I The Interquartile range (IQR) can eliminate some outlier problems.Eliminate high and low observations and calculate the range of themiddle 50 % of the data
RIC = 3rd cuartil− 1st cuartil = Q3 − Q1
![Page 23: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/23.jpg)
Variation: Interquartile range and boxplot
I Outliers are observations that fall
I below the value of Q1 − 1.5 · IQRI above the value of Q3 + 1.5 · IQR
I For extreme outliers, replace 1.5 by 3 in the above definition
25% 25% 25% 25%
12 24 31 42 58
xmin Q1 ((Q2))MEDIANA
Q3 xmax
RI=18
![Page 24: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/24.jpg)
Measure of variation: variance
I Average of squared deviations of values from the mean
I Population variance
σ2 =
∑Ni=1 (xi − µ)2
N
I Sample variance
σ2 =
∑ni=1 (xi − x)2
n=
faster to calculate︷ ︸︸ ︷∑ni=1 x
2i − n(x)2
n⇐ divided by n
I Sample quasi-variance (corrected sample variance)
s2 =
∑ni=1 (xi − x)2
n − 1=
∑ni=1 x
2i − n(x)2
n − 1⇐ divided by n − 1
I They are related via
σ2 =n − 1
ns2
I If a, b (b 6= 0) are real numbers and y = a + bx , then s2y = b2s2
x
![Page 25: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/25.jpg)
Measure of variation: standard deviation (SD)
I The most-commonly used measure of spread
I Population standard deviation, sample standard deviation andsample quasi-standard deviation are respectively
σ =√σ2 σ =
√σ2 s =
√s2
I Shows variation about the mean
I Has the same units as the original data, whilst variance is in units2
I Variance and SD are both affected by outliers
![Page 26: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/26.jpg)
Calculating variance and standard deviationExample: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17,Z : 11, 11, 11, 12, 19, 20, 20, 20
x =124
8= 15.5 y =
124
8= 15.5 z =
124
8= 15.5
n∑i=1
x2i = 112 + 122 + . . .+ 212 = 2000
n∑i=1
y 2i = 142 + 152 + . . .+ 172 = 1928
n∑i=1
z2i = 112 + 112 + . . .+ 202 = 2068
s2x =
∑ni=1 x
2i − n(x)2
n − 1=
2000− 8(15.5)2
8− 1=
78
7= 11.1429 ⇒ sx = 3.3381
s2y =
1928− 8(15.5)2
8− 1=
6
7= 0.8571 ⇒ sy = 0.9258
s2z =
2068− 8(15.5)2
8− 1=
146
7= 20.8571 ⇒ sz = 4.5670
![Page 27: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/27.jpg)
Comparing standard deviationsExample cont.: X : 11, 12, 13, 16, 16, 17, 18, 21,Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20
●
●
●
● ● ●
●
●
● ●
●
●
●
●
●
●
● ● ● ●
●
● ● ●
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
z == 15.5 sz == 4.6
y == 15.5 sy == 0.9
x == 15.5 sx == 3.3
![Page 28: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/28.jpg)
Measure of variation: coefficient of variation (CV)
I Measures relative variation and is defined as
CV =s
|x |
I Is a unitless number (sometimes given in %’s)
I Shows variation relative to mean
Example: Stock A: Average price last year = 50, Standard deviation = 5Stock B: Average price last year = 100, Standard deviation = 5
CVA =5
50= 0.10 CVB =
5
100= 0.05
Both stocks have the same SDs, but stock B is less variable relative to its mean
price
![Page 29: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/29.jpg)
Numerical summaries and frequency tables. Standarization.
I If the data is discrete then
x =
∑ki=1 xini
nand s2 =
∑ki=1 x
2i ni − nx2
n − 1
I If the data is continuous, we replace xi in the above difinition, by themid-points of class intervals
I To standardize variable x means to calculate
x − x
s
I If you apply this formula to all observations x1, . . . , xn and call thetransformed ones z1, . . . , zn, then the mean of the z ’s is zero with thestandard deviation of one
I Standarization = finding z-score
![Page 30: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/30.jpg)
Measures of form
X Fisher’s coefficient of asymmetry
X Fisher coefficient of kurtosis
X Empirical rule
![Page 31: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/31.jpg)
Shape: comparing mode, mean and median
Three types of distributions:
I Skewed to the left Mean < Median < Mode
I Symmetric Mean = Median = Mode
I Skewed to the right Mode < Median < Mean
LEFT−SKEWEDx << M
SYMMETRICx == M
RIGHT−SKEWEDM << x
Note: The distribution in the middle is known as bell-shaped or normal
![Page 32: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/32.jpg)
Measures of form: Asymmetry
I Fisher’s coefficient of asymmetry → γ1 =1n
∑ni=1(xi−x)3
S3 . The data isskewed to the right (positive) if γ1 > 0, and vice versa.
Asimetría a la derecha
Fre
qu
en
cy
0 1 2 3 4 5 60
10
20
30
40
50
60
γ1
=
2.236
Asimetría a la izquierda
Fre
qu
en
cy
0.0 0.2 0.4 0.6 0.8 1.0
05
01
00
15
02
00
γ1
=
−1.401
![Page 33: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/33.jpg)
Measures of form: kurtosis
I Fisher’s coefficient of kurtosis → γ2 =1n
∑ni=1(xi−x)4
S4 − 3
I For the standard normal, γ2 = 0. If γ2 > 0→ leptokurtic (sharperthan the standard normal) and platykurtic if γ2 < 0
Distribución Leptocúrtica
De
nsity
−2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Distribución Platicúrtica
De
nsity
−1.0 0.0 1.0 2.0
0.0
0.2
0.4
0.6
0.8
1.0
![Page 34: Statistics I Chapter 2: Analysis of univariate · PDF fileIndeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. X](https://reader031.fdocuments.us/reader031/viewer/2022030412/5a9e548c7f8b9a077e8bd6a5/html5/thumbnails/34.jpg)
Empirical rule
If the data is bell-shaped (normal), that is, symmetric and with lighttails, the following rule holds:
I 68 % of the data are in (x − 1s, x + 1s)
I 95 % of the data are in (x − 2s, x + 2s)
I 99.7 % of the data are in (x − 3s, x + 3s)
Note: This rule is also known as 68-95-99.7 ruleExample: We know that for a sample of 100 observations, the mean is40 and the quasi-standard deviation is 5. Assuming that the data isbell-shaped, give the limits of an interval that captures 95 % of theobservations.
95 % of xi ’s are in: (x ± 2s) = (40± 2(5)) = (30, 50)