Post on 28-Dec-2015
Introduction to R, Statistics, and the grammar of graphics Thomas INGICCO
E. Delacroix, Dante et Virgile aux EnfersE. Delacroix, Dante and Virgile in Hell
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))
Data3d1 <-matrix(c(0.72,100.32,0.75,100.36,0.77,100.32,0.81,100.32,0.77,100.29,0.77,100.24,0.73,100.28,0.7,100.26,0.7,100.3,0.67,100.33), 10, 2, byrow=T)
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))
Data3d1 <-matrix(c(0.72,100.32,0.75,100.36,0.77,100.32,0.81,100.32,0.77,100.29,0.77,100.24,0.73,100.28,0.7,100.26,0.7,100.3,0.67,100.33), 10, 2, byrow=T)colnames(Data3d)<-c("x", "y")rownames(Data3d)<-paste("Lan", 1:10, sep="")
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))
Data3d1 <-matrix(c(0.72,100.32,0.75,100.36,0.77,100.32,0.81,100.32,0.77,100.29,0.77,100.24,0.73,100.28,0.7,100.26,0.7,100.3,0.67,100.33), 10, 2, byrow=T)colnames(Data3d)<-c("x", "y")rownames(Data3d)<-paste("Lan", 1:10, sep="")t(Data3d)
Data3d2 <- Data3d1
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))
Data3d1 <-matrix(c(0.72,100.32,0.75,100.36,0.77,100.32,0.81,100.32,0.77,100.29,0.77,100.24,0.73,100.28,0.7,100.26,0.7,100.3,0.67,100.33), 10, 2, byrow=T)colnames(Data3d)<-c("x", "y")rownames(Data3d)<-paste("Lan", 1:10, sep="")t(Data3d)
Data3d2 <- Data3d1
array(cbind(Data3d1, Data3d2), dim=c(10, 2, 2))
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
# Listff <- list(aa, bb, cc, dd)
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
## Tablehh <- table(gg)hh <- table(gg, dd[1:6,11])
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
hhh <- data.frame(gg, dd[1:6,11])colnames(hhh) <- c("gg","Lip") # Rename the columnshhhh <- table(hhh)
data.frame(gg, na.omit(dd[1:6,11])) # Function na.omitdata.frame(gg, na.omit(dd[1:7,11]))
dim(hhhh) # Number of lines and columnsdimnames(hhhh)
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
margin.table(hhhh) # Calculate the marginsmargin.table(hhhh, 1)margin.table(hhhh, 2)
hhhh[3,] <- c(1000,2000) # Replace line 3
cbind(hhhh,hhh) # Concatenate the columns of two tables
t(hhhh) # Transposition
Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table
… but before all it is a language with its own grammar made of:
# Factorgg <- rep(c("Everted", "Round", "Flat"), c(1,2,3))is.vector(gg)is.character(gg) gg1 <- factor(gg)
IndividualsVariables
1 … j … p
1 x11 … x1j … x1p
… … … …
i xi1 … xij … xip
… … … …
n xn1 … xnj … xnp
FAM GEN SP ID UNW LNW MTW ATW ANW MDW ADW TL NH AGEGibbons Hylobates H.sp 1880_1167_D 7.11 7.74 10.999.26 8.16 9.42 9.59 188,3110.37 AGibbons Hylobates H.sp 1880_1167_G 6.12 8.53 11.3 9.29 8.54 9.5 9.42 187,510.13 AGibbons Hylobates H.sp 1880_1170_D 6.18 9.72 10.818.91 7.69 8.05 8.78 177,248.94 AGibbons Hylobates H.sp 1880_1170_G 6.44 10.0910.688.96 9.07 8.05 8.69 177,599.29 AGibbons Hylobates H.sp 1901_102_D 6.31 11.6915.1911.799.26 11.83 11.6 206,6911.49 AGibbons Hylobates H.sp 1901_102_G 7.14 11.1314.9311.689.06 11.76 11.3 205,3211.49 A
Continuous quantitative variableLength of dog calcaneum
{67.0 54.7 7.0 48.5 14.0 17.2 20.7 13.0 43.4 40.2 38.9 54.5 59.8 48.3 22.9 11.5 34.4 35.1 38.7 30.8 30.6 43.1 56.8 40.8 41.8 42.5 31.0 31.7 30.2 25.9 49.2 37.0 35.915.0 30.2 7.2 36.2 45.5 7.8 33.4 36.1 40.2 42.7 42.5 16.2 39.0 35.0 37.0 31.4 37.6 39.9 36.2 42.8 46.424.7 49.1 46.0 35.9 7.8 48.2 15.2 32.5 44.7 42.6 38.8 17.4 40.8 29.1 14.6 59.2}
Discrete quantitative variable
Number of flakes per context{1 0 3 3 0 0 1 1 0 0 1 1 0 2 2 1 0 1 0 0 1 3 0 0 0 2 0 2 5 0 0 0 0 1 1 0 0 0 1 0 0 1 4 0 2 2 1 2 2 2 1 1 0 2 0 0 1 0 4 2 0 0 2 3 1 1 1 0 0 1 0 0 2 0 0 0 2 2 0 0 1 0 2 2 0 1 0 3 3 0 2 0 2 2 3 0 3 1 0 0}
Qualitative variableColour of the pot
{black, red, black, red, brown, brown, black, grey, red, black}
Different kind of data
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
We add all the measuresAnd we divide by the number of measurements
𝑋= ∑𝑖=1
𝑛
𝑥𝑖
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
𝑋= 1𝑁 ∑
𝑖=1
𝑛
𝑥𝑖
We add all the measuresAnd we divide by the number of measurements
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
𝑋= 1𝑁 ∑
𝑖=1
𝑛
𝑥𝑖
We add all the measuresAnd we divide by the number of measurements
sum(DATA[1:49,6]) / length(DATA[1:49,6])
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
𝑋= 1𝑁 ∑
𝑖=1
𝑛
𝑥𝑖
We add all the measuresAnd we divide by the number of measurements
sum(DATA[1:49,6]) / length(DATA[1:49,6])
mean(DATA[1:49,6])
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
𝑋= 1𝑁 ∑
𝑖=1
𝑛
𝑥𝑖
We add all the measuresAnd we divide by the number of measurements
sum(DATA[1:49,6]) / length(DATA[1:49,6])
mean(DATA[1:49,6])
colMeans(DATA[1:49,6:11])
Descriptive and inferential statistics
Example:You are told that you have a serious illness for which the mean survival period is six months… Statistics interest you !
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
The Mode is the most frequent value
Sample > Median = Median < sample
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
The Mode is the most frequent value
Sample > Median = Median < sample
median(DATA[1:49,6])
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
The Mode is the most frequent value
Sample > Median = Median < sample
median(DATA[1:49,6]) quantile(DATA[1:49,6])
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
The Mode is the most frequent value
Sample > Median = Median < sample
median(DATA[1:49,6]) quantile(DATA[1:49,6])
min(DATA[1:49,6])max(DATA[1:49,6])
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
The Mode is the most frequent value
Sample > Median = Median < sample
median(DATA[1:49,6]) quantile(DATA[1:49,6])
min(DATA[1:49,6])max(DATA[1:49,6])range(DATA[1:49,6])
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
The Mode is the most frequent value
Sample > Median = Median < sample
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value
𝜎 ²= 1𝑛−1
.∑𝑖=1
𝑛
(𝑥𝑖−𝑋 ) ²
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value
𝜎 ²= 1𝑛−1
.∑𝑖=1
𝑛
(𝑥𝑖−𝑋 ) ²
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value
𝜎 ²= 1𝑛−1
.∑𝑖=1
𝑛
(𝑥𝑖−𝑋 ) ²
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value
𝜎 ²= 1𝑁−1
.∑𝑖=1
𝑛
(𝑥𝑖−𝑋 )²
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value
𝜎 ²= 1𝑁−1
.∑𝑖=1
𝑛
(𝑥𝑖−𝑋 )²
The variance is the mean of the squared differences to the mean
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
𝜎=√ 1𝑁−1
.∑𝑖=1
𝑛
(𝑥𝑖−𝑋 ) ²
The standard deviation is the square root of the variance
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
𝑐𝑣=𝜎𝑋
Transform the standard deviation into the metrics of the variable
It permits to compare two variablesProblem: when X is close to zero, it becomes useless
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
𝑧=(𝑥−𝑋 )
𝑠
To measure the difference to the mean in the standard deviationmetrics, we use:
This is the centered- reduced variable of mean=0 and variance=1
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Covariance measures the degree of dependance of two variables: Are the values of each measurement drift independantly away from the centre of gravity, or are they drifting away together?If x and y are independant, then the covariance is equal to 0
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Covariance measures the degree of dependance of two variables: Are the values of each measurement drift independantly away from the centre of gravity, or are they drifting away together?If x and y are independant, then the covariance is equal to 0
𝑆𝑥𝑦=∑𝑖=1
𝑛
(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)
𝑛
We multiply the x-deviation to the mean to its associated y-deviationWe sum these productsWe divide by the number of values
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Covariance measures the degree of dependance of two variables: Are the values of each measurement drift independantly away from the centre of gravity, or are they drifting away together?If x and y are independant, then the covariance is equal to 0
𝑆𝑥𝑦=∑𝑖=1
𝑛
(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)
𝑛
We multiply the x-deviation to the mean to its associated y-deviationWe sum these productsWe divide by the number of values
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Covariance measures the degree of dependance of two variables: Are the values of each measurement drift independantly away from the centre of gravity, or are they drifting away together?If x and y are independant, then the covariance is equal to 0
𝑆𝑥𝑦=∑𝑖=1
𝑛
(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)
𝑛
We multiply the x-deviation to the mean to its associated y-deviationWe sum these productsWe divide by the number of values
So covariance is the sum of the crossed products
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
𝑆𝑥𝑦=∑𝑖=1
𝑛
(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)
𝑛
So covariance is the sum of the crossed products
( sum(DATA[1:49,6] * DATA[1:49,7]) - prod(sum(DATA[1:49,6]),sum(DATA[1:49,7])) / length(DATA[1:49,6]) ) / ( length(DATA[1:49,6])-1 )
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
𝑆𝑥𝑦=∑𝑖=1
𝑛
(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)
𝑛
So covariance is the sum of the crossed products
( sum(DATA[1:49,6] * DATA[1:49,7]) - prod(sum(DATA[1:49,6]),sum(DATA[1:49,7])) / length(DATA[1:49,6]) ) / ( length(DATA[1:49,6])-1 )
Cov(DATA[1:49,6],DATA[1:49,7])
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Pearson’s correlation coefficient differs from the covariance by its absence of unit and its boundaries between -1 and 1
𝑟 𝑥𝑦=𝑆𝑥𝑦
𝑆𝑥×𝑆𝑦
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Pearson’s correlation coefficient differs from the covariance by its absence of unit and its boundaries between -1 and 1
𝑟 𝑥𝑦=𝑆𝑥𝑦
𝑆𝑥×𝑆𝑦
cov(DATA[1:49,6],DATA[1:49,7]) / (sd(DATA[1:49,6]) * sd(DATA[1:49,7]))
Descriptive and inferential statistics
Position parameters:
MeanModeMediane
Dispersion parameters:
Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation
Pearson’s correlation coefficient differs from the covariance by its absence of unit and its boundaries between -1 and 1
𝑟 𝑥𝑦=𝑆𝑥𝑦
𝑆𝑥×𝑆𝑦
cov(DATA[1:49,6],DATA[1:49,7]) / (sd(DATA[1:49,6]) * sd(DATA[1:49,7]))
Cor(DATA[1:49,6],DATA[1:49,7])