R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics...

Introduction and descriptive statistics

tutorial 1

what is R

R is a free software programming language and software environment for statistical computing and graphics. (Wikipedia)

R is open source.

what is R

R is an object oriented programming language.

Everything in R is an object.

R objects are stored in memory, and are acted upon by functions (and operators).

Homepage CRAN: Comprehensive R Archive Network

http://www.r-project.org/

how to get R

Editor RStudio

http://www.rstudio.com

how to edit R

RStudio

http://www.rstudio.com

how R works

using R as a calculator

Users type expressions to the R interpreter.

R responds by computing and printing the answers.

arithmetic operators

type operator action performed

arithmetic

results in numeric value(s)

+ addition

- subtraction

* multiplication

/ division

^ raise to power

logical operators

type operator action performed

comparison

results in logical

value(s):

TRUE FALSE

< less than

> greater than

== equal to

!= not equal to

<= greater than or equal to

>= less than or equal to

connectors

& boolean intersection operator (logical and)

| boolean union operator (logical or)

arithmetic operators

power multiplication > 3 ^ 2

> 2 ^ (-2)

Note: > 100 ^ (1/2)

is equivalent to

> sqrt(100)

addition / subtraction > 5 + 5

> 10 - 2

multiplication / division > 10 * 10

> 25 / 5

logical operators

> 4 < 3 [1] FALSE

> 2^3 == 9 [1] FALSE

> (3 + 1) != 3 [1] TRUE

> (3 >= 1) & (4 == (3+1)) [1] TRUE

assignment

Values are stored by assigning them a name.

The statements

> z = 17

> z <- 17

> 17 -> z

all store the value 17 under the name z in the workspace. Assignment operators are: <- , = , ->

data types

There are three basic types or modes of variables:

numeric (numbers: integers, real)

logical (TRUE, FALSE)

character (text strings, in "")

The type is shown by the mode() function. Note: A general missing value indicator is NA.

data types

> a = 49 # numeric

> sqrt(a)

> mode(a)

[1] "numeric"

> a = "The dog ate my homework" # character

[1] "The dog ate my homework"

> mode(a)

[1] "character"

> a = (1 + 1 == 3) # logical

[1] FALSE

> mode(a)

[1] "logical"

Elements: numeric, logical, character in

vectors ordered sets of elements of one type

data.frames ordered sets of vectors (different vector types)

matrices ordered sets of vectors (all of one vector type)

lists ordered sets of anything.

data structures

creating vectors

> x = c(1, 3, 5, 7, 8, 9) # numerical vector

[1] 1 3 5 7 8 9

> z = c("I","am","Ironman") # character vector

[1] "I" "am" "Ironman"

> x = c(TRUE,FALSE,NA) # logical vector

[1] TRUE FALSE NA

The function c( ) can combine several elements into vectors.

combining vectors

The function c( ) can be used to combine both vectors and elements into larger vectors. > x = c(1, 2, 3, 4)

> c(x, 10)

[1] 1 2 3 4 10

> c(x, x)

[1] 1 2 3 4 1 2 3 4

In fact, R stores elements like 10 as vectors of length one, so that both arguments in the expression above are vectors

sequences

A useful way of generating vectors is using the sequence operator. The expression n1:n2, generates the sequence of integers ranging from n1 to n2. > 1:15

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13

[14] 14 15

> 5:-5

[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5

> y = 1:11

[1] 1 2 3 4 5 6 7 8 9 10 11

extracting elements

> x = c(1, 3, 5, 7, 8, 9)

> x[3] # extract 3rd position

> x[1:3] # extract positions 1-3

[1] 1 3 5

> x[-2] # without 2nd position

[1] 1 5 7 8 9

> x[x<7] # select values < 7

[1] 1 3 5

> x[x!=5] # select values not equal to 5

[1] 1 3 7 8 9

data frame

Data frames provide a way of grouping a number of related vectors into a single data object. The function data.frame() takes a number of vectors with same lengths and returns a single object containing all the variables.

df = data.frame(var1, var2, ...)

data frame

In a data frame the column labels are the vector names. Note: Vectors can be of different types in a data frame (numeric, logical, character).

Data frames can be created in a number of ways:

Binding together vectors by the function data.frame( ).

Reading in data from an external file.

data frame

> time = c("early","mid","late","late","early")

> type <- c("G", "G", "O", "O", "G")

> counts <- c(20, 13, 8, 34, 7)

> data <- data.frame(time,type,counts)

> data

time type counts

1 early G 20

2 mid G 13

3 late O 8

4 late O 34

5 early G 7

> fix(data)

example data: low birth weight

name text variable type

low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'

age age of mother continuous: years

lwt mother's weight at last period continuous: pounds

race ethnicity nominal: 1 'white' 2 'black' 3 'other'

smoke smoking status nominal: 0 'no' 1 'yes'

ptl premature labor discrete: number of

ht hypertension nominal: 0 'no' 1 'yes'

ui presence of uterine irritability nominal: 0 'no' 1 'yes'

ftv physician visits in first trimester discrete: number of

bwt birthweight of the baby continous: g

The birthweight data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass during 1986.

loading: library(MASS), the dataframe is called birthwt. Overview over dataframes:

dim(birthwt)

summary(birthwt)

head(birthwt)

str(birthwt)

extracting vectors

data$vectorlabel gives the vector named vectorlabel of the dataframe named data. Extracting elements from this vector is done as usually.

> birthwt$age

> birthwt$age[33]

> birthwt$age[1:10]

some functions in R

name function

summary(x) summary statistics of the elements of x

max(x) maximum of the elements of x

min(x) minimum of the elements of x

sum(x) sum of the elements of x

mean(x) mean of the elements of x

sd(x) standard deviation of the elements of x

median(x) median of the elements of x

quantile(x, probs=…) quantiles of the elements of x

sort(x) ordering the elements of x

some functions in R

> mean(birthwt$age)

[1] 23.2381

> max(birthwt$age)

[1] 45

> min(birthwt$age)

[1] 14

getting help

to get help on the sd() function you can type either of > help(sd)

sorting vectors

> help(sort)

> x=sort(birthwt$age, decreasing=FALSE)

> x[1:10]

[1] 14 14 14 15 15 15 16 16 16 16

> x=sort(birthwt$age, decreasing=TRUE)

> x[1:10]

[1] 45 36 36 35 35 34 33 33 33 32

> x[25] # 25th highest age

[1] 30

Sorting / ordering of data in vectors with the function sort()

graphics

R has extensive graphics facilities.

Graphic functions are differentiated in

high-level graphics functions

low-level graphics functions

The quality of the graphs produced by R is often cited as a major reason for using it in preference to other statistical software systems.

high-level graphics

name function

plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)

hist(x) histogram of the frequencies of x

barplot(x) histogram of the values of x; use horiz=FALSE for horizontal bars

dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column)

pie(x) circular pie-chart

boxplot(x) box-and-whiskers plot

stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes)

mosaicplot(x) mosaic plot from frequencies in a contingency table

qqnorm(x) quantiles of x with respect to the values expected under a normal law

high-level graphics

> hist(birthwt$age)

> boxplot(birthwt$age)

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Load the dataframe into your workspace with the data("PimaIndiansDiabetes2") command. Get an overview with the functions dim, head.

Calculate the mean and median of the variable insulin. Remove NAs for the calculation with the na.rm = TRUE option in mean and median functions.

Plot an histogram of the variable insulin.

Graphics and probability theory

tutorial 2

graphics

high-level graphics

name function

plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)

hist(x) histogram of the frequencies of x

barplot(x) histogram of the values of x; use horiz=FALSE for horizontal bars

dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column)

pie(x) circular pie-chart

boxplot(x) box-and-whiskers plot

stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes)

mosaicplot(x) mosaic plot from frequencies in a contingency table

qqnorm(x) quantiles of x with respect to the values expected under a normal law

plot function

The core R graphics command is plot(). This is an all-in-one function which carries out a number of actions:

It opens a new graphics window.

It plots the content of the graph (points, lines etc.).

It plots x and y axes and boxes around the plot and produces the axis labels and title.

plot function

Parameters in the plot() function are:

x x-coordinate(s) y y-coordinates (optional, depends on x)

plot function

To plot points with x and y coordinates or two random variables for a data set (one on the x axis, the other on the y axis; called a scatterplot) , type:

> a = c(1,2,3,4)

> b = c(4,4,0,5)

> plot(x=a,y=b)

> plot(a,b) # the same

plot function

To plot points with x and y coordinates or two random variables for a data set (one on the x axis, the other on the y axis; called a scatterplot), type:

> library(MASS)

> plot(x=birthwt$age,y=birthwt$lwt)

# lwt: mothers weight in pounds

> plot(x=birthwt$age[1:10],y=birthwt$lwt[1:10])

# first 10 mothers

plot function

Another example:

> a = seq(-5, +5, by=0.2)

# generates a sequence from -5 to +5 with increment

[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2

[45] 3.8 4.0 4.2 4.4 4.6 4.8 5.0

> b = a^2 # squares all components of a

> plot(a,b)

plot function

Parameters in the plot() function are (see help(plot) and help(par)):

x x-coordinate(s) y y-coordinates (optional, depends on x) main, sub title and subtitle xlab, ylab axes labels xlim, ylim range of values for x and y type type of plot lty type of lines pch plot symbol cex scale factor col color of points etc. ...

plot symbol / line type

plot symbol: pch=

line type: lty=

plot type: type=

“p‘‘ points ‘‘l‘‘ lines ‘‘b“ both ‘‘s“ steps ‘‘h“ vertical lines ‘‘n“ nothing …

plot function

> a = seq(-5, +5, by=0.2)

> b = a^2

> plot(a, b)

> plot(a,b,main="quadratic function")

> plot(a,b,main="quadratic function",cex=2)

> plot(a,b,main="quadratic function",col="blue")

plot function

> a = seq(-5, +5, by=0.2)

> b = a^2

> plot(a,b,main="quadratic function",type="l")

> plot(a,b,main="quadratic function",type="b")

> plot(a,b,main="quadratic function",pch=2)

-4 -2 0 2 4

quadratic function

probability theory, factorials

Binomial coefficients can be computed by choose(n,k):

> choose(8,5)

[1] 56

functions for random variables

Distributions can be easily calculated or simulated using R. The functions are named such that the first letter states what the function calculates or simulates

d = density function (probability function) p = distribution function q = quantile (inverse distribution) r = random number generation

and the last part of the name of the function specifies the type of distribution, e.g.

binomial dististribution normal distribution

binomial distribution

• dbinom(x, size, prob)

x k size n prob π

knk )1(k

n)kX(P)k(f

Probability function:

normal distribution

• dnorm(x, mean, sd)

Density function:

normal distribution

Calculating the probability density function: > dnorm(x=2, mean=6, sd=2)

[1] 0.02699548

0 2 4 6 8 10 12

normal distribution

• pnorm(q, mean, sd)

Distribution function:

dx)x(f)b(F

f(x) 'density'

normal distribution

Distribution function:

N(10,25) distribution

> pnorm(q=13, mean=10, sd=5)

[1] 0.7257469

> dbinom(x=5, size=50, prob=0.15)

# Probability of having exactly 5 successes in 50

independent observations/measurements with a success

probability of 0.15 each

[1] 0.1072481

> dbinom(5, 50, 0.15) # the same

Probability function:

normal distribution

Plotting the density of a N(5,49) distribution:

> x_values=seq(-15, 25, by=0.5)

> y_values=dnorm(x_values, mean=5, sd=7)

> plot(x_values,y_values,type="l")

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Make a scatter plot for the variables glucose and insulin. What are the possible realizations of a random variable X distributed according to Bin(4,0.85)? Calculate all possible values of the probability function of X. Plot the probability function of X with the possible realizations of X on the x axis and the corresponding values of the probability function on the y axis.

Random numbers and factors

tutorial 3

binomial dististribution normal distribution t distribution

• rbinom(n, size, prob)

n: number of samples to draw size: n prob=π

output: number of successes

knk )1(k

n)kX(P)k(f

Generating random realizations:

normal distribution

• rnorm(n, mean, sd)

n: number of samples to draw

t distribution

• qt(p, df)

p: quantile probability df: degrees of freedom

Quantiles:

> rbinom(n=1, size=50, prob=0.15)

# Generating one sample of 50 independent

observations/measurements with a success probability

of 0.15 each

[1] 14 # 14 successes in this simulation

# Generating 10 samples

[1] 14 10 6 12 8 6 7 10 5 9

# The number of successes for all samples

normal distribution

> values=rnorm(10, mean=0, sd=1)

> values

[1] -0.56047565 -0.23017749 1.55870831 0.07050839

0.12928774 1.71506499 0.46091621 -1.26506123

-0.68685285 -0.44566197

# 10 simulations from a N(0,1) distribution

> mean(values)

[1] 0.07462565

= for α=0.05, n=100

t distribution

• qt(p, df)

p: quantile probability df: degrees of freedom

> qt(p=0.95,df=9)

[1] 1.833113

> qt(p=0.95,df=99)

[1] 1.660391

> qnorm(p=0.95,mean=0,sd=1)

[1] 1.644854

> qt(p=0.975,df=99)

[1] 1.984217

Quantiles:

121 n,/t α

object classes

All objects in R have a class. The class attribute allows R

to treat objects differently (e.g. for summary() or plot()).

Possible classes are:

numeric

logical

character

matrix

data.frame

factor

The class is shown by the class() function.

factors

Categorical variables in R are often specified as factors.

Factors have a fixed number of categories, called levels.

summary(factor) displays the frequency of the factor levels.

Functions in R for creating factors:

factor(), as.factor()

levels() displays and sets levels.

factors

• factor(x,levels,labels)

• as.factor(x)

x: vector of data, usually small number of values levels: specifies the values (categories) of x labels: labels the levels

> smoke = c(0,0,1,1,0,0,0,1,0,1)

> smoke

[1] 0 0 1 1 0 0 0 1 0 1

> summary(smoke)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0 0.0 0.0 0.4 1.0 1.0

> class(smoke)

[1] "numeric"

factors

> smoke_new=factor(smoke)

> smoke_new

[1] 0 0 1 1 0 0 0 1 0 1

Levels: 0 1

> summary(smoke_new)

> class(smoke_new)

[1] "factor"

factors

> smoke_new=factor(smoke,levels=c(0,1))

> smoke_new

[1] 0 0 1 1 0 0 0 1 0 1

Levels: 0 1

> smoke_new=factor(smoke,levels=c(0,1,2))

> smoke_new

[1] 0 0 1 1 0 0 0 1 0 1

Levels: 0 1 2

factors

> smoke_new=factor(smoke,levels=c(0,1),

labels=c("no", "yes")

> smoke_new

[1] no no yes yes no no no yes no yes

Levels: no yes

no yes

factors

> library(MASS)

> summary(birthwt$race)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 1.000 1.000 1.847 3.000 3.000

> race_new=as.factor(birthwt$race)

> summary(race_new)

96 26 67

> levels(race_new)

[1] "1" "2" "3"

> levels(race_new)=c("white","black","other")

> summary(race_new)

white black other

96 26 67

factors

hands-on example

Sample 20 realizations of a N(0,1) distribution. Calculate mean and standard deviation. What is the formula for the confidence interval for the mean for unknown σ? For a 90% confidence interval and the above sample: What are the parameters α and n? Which value has t1- α/2,n-1? Calculate the 90% confidence interval for our example.

Reading data from files,

frequency tables

tutorial 4

binomial dististribution normal distribution t distribution

• qnorm(p, mean, sd)

p: quantile probability

[1] 1.644854

[1] 1.959964

= for α=0.05

Quantiles:

21 /αz

normal distribution

= z0.95

reading data: working directory

For reading or saving files, a simple file name identifies a file in the working directory. Files in other places can be specified by the path name.

getwd() gives the current working directory.

setwd("path") sets a specific directory as your

working directory.

Use setwd("path") to load and save data in the

directory of your choice.

The standard way of storing statistical data is to put them in a rectangular form with rows corresponding to observations and columns corresponding to variables.

Spreadsheets are often used to store and manipulate data in this way, e.g. EXCEL.

The function read.table() can be used to read

data which has been stored in this way.

The first argument to read.table() identifies the

file to be read.

reading data

Optional arguments to read.table() which can be

used to change its behaviour.

Setting header=TRUE indicates to R that the first row

of the data file contains names for each of the columns.

The argument skip= makes it possible to skip the

specified number of lines at the top of the file.

The argument sep= can be used to specify a character which separates columns. (Use sep=";" for csv files.)

The argument dec= can be used to specify a character

as decimal point.

example data: infarct

(case/control study)

name label

nro identifier

grp group (string) 'control', 'infarct'

code coded group 0 'control', 1 'infarct'

sex sex 1 'male', 2 'female'

age age years

height body height cm

weight body weight kg

blood sugar blood sugar level mg/100ml

diabet diabetes 0 'no', 1 'yes'

chol cholesterol level mg/100ml

trigl triglyceride level mg/100ml

cig cigarettes number of

> setwd("C:/Users/Präsentation/MLS")

> mi = read.table("infarct data.csv")

Error in scan(file, what,...: line 2 did not have 2

elements # wrong separator

> mi = read.table("infarct data.csv",sep=";")

> summary(mi) # no variable names

> mi = read.table("infarct data.csv",sep=";",

header=TRUE)

> summary(mi) # with variable names

frequency tables

table(var1, var2) gives a table of the

absolute frequencies of all combinations of var1 and var2. var1 and var2 have to attain a finite number of values (frequency table, cross classification table,

contingency table). var1 defines the rows, var2 the columns. addmargins(table) adds the sums of rows and

columns. prop.table(table) gives the relative

frequencies, overall or with respect to rows or columns.

frequency tables

> grp_sex=table(mi$grp,mi$sex)

> grp_sex

control 25 15

infarct 28 12

> addmargins(grp_sex)

1 2 Sum

control 25 15 40

infarct 28 12 40

Sum 53 27 80

frequency tables

> prop.table(grp_sex)

control 0.3125 0.1875

infarct 0.3500 0.1500

> prop.table(grp_sex,margin=1)

control 0.625 0.375

infarct 0.700 0.300 # rows sums to 1

> prop.table(grp_sex,margin=2)

control 0.4716981 0.5555556

infarct 0.5283019 0.4444444 # columns sum to 1

hands-on example

Load the dataset from the file bdendo.csv into the workspace. Generate a table of the variables d (case-control status) and dur (categorical duration of oestrogen therapy). Generate a table of the variables d (case-control status) and agegr (age group). Compare the two tables.

Installing packages,

the package "pROC"

tutorial 5

R packages

R consists of a base level of functionality together with a set of contributed libraries which provide extended capabilities.

The key idea is that of a package which provides a related set of software components, documentation and data sets.

Packages can be installed into R. This needs administrator rights.

Package: pROC

Type: Package

Title: display and analyze ROC curves

Version: 1.7.1

Date: 2014-02-20

Encoding: UTF-8

Depends: R (>= 2.13)

Imports: plyr, utils, methods, Rcpp (>= 0.10.5)

Suggests: microbenchmark, tcltk, MASS, logcondens, doMC,

doSNOW

LinkingTo: Rcpp

Author: Xavier Robin, Natacha Turck, Alexandre Hainard,

Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez

and Markus Müller.

Maintainer: Xavier Robin <xavier@cbs.dtu.dk>

pROC – diagnostic testing

installing packages

You can install R packages using the install.packages() command.

> install.packages("pROC")

Installing package(s) into

‘C:/Users/Amke/Documents/R/win-library/2.15’

(as ‘lib’ is unspecified)

downloaded 827 Kb

package ‘pROC’ successfully unpacked and MD5 sums

checked

The downloaded binary packages are in

C:\Users\Amke\AppData\Local\Temp\RtmpUJPoia\downl

oaded_packages

Installing R packages using the menu:

using installed packages

When R is running, simply type:

> library(pROC)

This adds the R functions in the library to the search path. You can now use the functions and datasets in the package and inspect the documentation.

cite packages

To cite the package pROC in publications use:

> citation("pROC")

Xavier Robin, Natacha Turck, Alexandre Hainard,

Natalia Tiberti, Frédérique

Lisacek, Jean-Charles Sanchez and Markus Müller

(2011). pROC: an open-source

package for R and S+ to analyze and compare ROC

curves. BMC Bioinformatics, 12,

p. 77. DOI: 10.1186/1471-2105-12-77

<http://www.biomedcentral.com/1471-2105/12/77/>

package pROC

The main function is roc(response, predictor). It creates the values necessary for an ROC curve.

response: disease status (as provided by gold standard)

predictor: continuous test result

(to be dichotomized)

For an roc object the plot(roc_obj) function produces an ROC curve.

package pROC

The function coords(roc_obj,x,best.method,ret) calculates measures of test performance.

x: value for which measures are calculated (default: threshold) , x="best" gives the optimal threshold

best.method: if x="best", the method to determine the best threshold (e.g. "youden")

ret: Measures calculated. One or more of "threshold", "specificity", "sensitivity", "accuracy", "tn" (true negative count), "tp" (true positive count), "fn" (false negative count), "fp" (false positive count), "npv" (negative predictive value), "ppv" (positive predictive value)

(default: threshold, specificity, sensitivity)

example data: aSAH

name label

Glasgow Outcome Score (GOS) at

6 months 1-5

outcome prediction of development 'good', 'poor' to be diagnozed

gender sex 'male', 'female'

age age years

World Federation of Neurological Surgeons

Score 1-5

S100 calcium binding protein

B μg/l biomarker

continuous test result

ndka Nucleoside diphosphate

kinase A μg/l biomarker

continous test result

aneurysmal subarachnoid haemorrhage

> data(aSAH) # loads the data set "aSAH"

> head(aSAH)

> rocobj = roc(aSAH$outcome, aSAH$s100b)

> plot(rocobj)

> coords(rocobj, 0.55)

threshold specificity sensitivity

0.5500000 1.0000000 0.2682927

> coords(rocobj, x="best",best.method="youden")

0.2050000 0.8055556 0.6341463

# youden threshold is 0.20; according spec and sens

package pROC

true positive

Measures of Test Performance

Outcomes of a diagnostic study for a dichotomous test result

positive negative

present

absent

test result

disease

false negative

false positive true negative

> coords(rocobj,x="best",best.method="youden",

ret=c("threshold","specificity","sensitivity",

"tn","tp","fn","fp"))

0.2050000 0.8055556 0.6341463

tn tp fn fp

58.0000000 26.0000000 15.0000000 14.0000000

package pROC

tp: 26

positive negative

present

absent

test result

disease

fn: 15

fp:14 tn: 26

Statistical testing 1

tutorial 6

statistical test functions

name function

t.test( ) Student‘s t-test

wilcox.test( ) Wilcoxon rank sum test and signed rank test

ks.test( ) Kolmogorov-Smirnov test

chisq.test( ) Pearson‘s chi-squared test for count data

mcnemar.test( ) McNemar test

One sample t test

The function t.test() performs different Student‘s t tests.

Parameters for the one sample t test are t.test(x,mu,alternative)

x: numeric vector of values which shall be tested

(assumed to follow a normal distribution)

mu: reference value µ0

alternative: "two.sided" (two sided alternative, default), "less" (alternative: expectation of x is less than µ0),

"greater" (alternative: expectation of x is larger than µ0)

Blood Sugar Level and Myocardial Infarction

H0: ≤0 HA: >0

A study was carried out to assess whether the expected blood sugar level (BSL) of patients with myocardial

infarction µ is higher than the expected BSL of control

individuals, namely µ0=100 mg/100ml.

name label

nro identifier

code coded group 0 'control', 1 'infarkt'

age age years

> mi = read.table("infarct data.csv",sep=";",

dec=",", header=TRUE)

>summary(mi$blood.sugar)

>summary(as.factor(mi$code))

>bloods_infarct=mi$blood.sugar[mi$code==1]

# Attention: two "="s!

# Extracts the blood sugar levels of only the cases.

>summary(bloods_infarct)

One sample t test

>t.test(bloods_infarct,mu=100,alternative="greater")

One Sample t-test

data: bloods_infarct

t = -0.7824, df = 39, p-value = 0.7807

alternative hypothesis: true mean is greater than 100

95 percent confidence interval:

90.14572 Inf

sample estimates:

mean of x

96.875

# Blood sugar level of infarct patients is not

significantly higher than 100mg/100ml.

One sample t test

hands-on example

Load the dataset from the file infarct data.csv into the workspace. Perform a two-sided one-sample t-test for cholesterol level in infarct patients. The reference value for the population is 180 mg/100ml. What is the result of the test?

Statistical testing 2

tutorial 7

statistical test functions

name function

t.test( ) Student‘s t-test

wilcox.test( ) Wilcoxon rank sum test and signed rank test

ks.test( ) Kolmogorov-Smirnov test

chisq.test( ) Pearson‘s chi-squared test for count data

mcnemar.test( ) McNemar test

The function t.test() performs different Student‘s t tests.

Parameters for the two sample t test are:

t.test(x, y, alternative, var.equal)

x, y: numeric vectors of values which shall be compared

(assumed to follow a normal distribution)

alternative: "two.sided" (two sided alternative, default), "less" (alternative: expectation of x is less than expectation of y), "greater" (alternative: expectation of x is larger than expectation of y)

var.equal: Are the variances of x and y equal? (TRUE or FALSE (default); TRUE is the t test of the lecture)

Two sample t test

The function wilcox.test() performs the Wilcoxon rank sum test and the Wilcoxon signed rank test.

Parameters for the Wilcoxon rank sum test are: wilcox.test(x, y, alternative)

x, y: numeric vectors of values which shall be compared

(need not follow a normal distribution)

alternative: similar to t.test

Wilcoxon rank sum test

Blood Sugar Level and Myocardial Infarction

H0: 1≤2 HA: 1>2

A case-control study was carried out to assess whether the expected blood sugar level (BSL) of patients with

myocardial infarction µ1 is higher than the expected BSL of control individuals µ2.

name label

nro identifier

age age years

> mi = read.table("infarct data.csv", sep=";",

> summary(mi$blood.sugar)

> summary(as.factor(mi$code))

> bloods_infarct=mi$blood.sugar[mi$code==1]

> bloods_control=mi$blood.sugar[mi$code==0]

# Extracts the blood sugar levels of the cases

# and of the controls.

Two sample t test

> t.test(bloods_infarct, bloods_control,

var.equal=TRUE, alternative="greater")

Two Sample t-test

data: bloods_infarct and bloods_control

t = 0.0305, df = 78, p-value = 0.4879

alternative hypothesis: true difference in means is

greater than 0

-13.39077 Inf

sample estimates:

mean of x mean of y

96.875 96.625

# Expected BSL of infarct patients is not

significantly higher than expected BSL of controls.

Two sample t test

> wilcox.test(bloods_infarct, bloods_control,

alternative="greater")

Wilcoxon rank sum test with continuity correction

data: bloods_infarct and bloods_control

W = 867.5, p-value = 0.2576

alternative hypothesis: true location shift is greater

than 0

# The Wilcoxon test can be applied if the BSL does not

# follow a normal distribution. Then the t test is not

# valid.

Wilcoxon rank sum test

Pearson‘s chi-squared test

The function chisq.test() performs a Pearson‘s chi-squared test for count data. chisq.test(x)

x: n x m table (matrix) to be tested

name text variable type

low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'

age age of mother continuous: years

lwt mother's weight at last period continuous: pounds

race ethnicity nominal: 1 'white' 2 'black' 3 'other'

smoke smoking status nominal: 0 'no' 1 'yes'

ptl premature labor discrete: number of

ht hypertension nominal: 0 'no' 1 'yes'

ui presence of uterine irritability nominal: 0 'no' 1 'yes'

ftv physician visits in first trimester discrete: number of

bwt birthweight of the baby continous: g

> library(MASS)

> tab_bw_smok=table(birthwt$low, birthwt$smoke)

> tab_bw_smok

0 86 44

1 29 30

> chisq.test(tab_bw_smok)

Pearson's Chi-squared test with Yates'

continuity correction

data: tab_bw_smok

X-squared = 4.2359, df = 1, p-value = 0.03958

# The probability of having a baby with low birth

# weight is significantly higher for smoking mothers.

Pearson‘s chi-squared test

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Plot a histogram of the variable insulin. Compare the insulin values between cases and controls (variable diabetes) using an appropriate test.

Correlation and linear regression,

low level graphics

tutorial 8

Correlation

The function cor(x, y, method) computes the correlation between two paired random variables.

x, y: numeric vectors of values for which the correlation shall be calculated (must have the same length)

method: "pearson", "spearman" or "kendall"

Test of correlation

The function cor.test(x, y, alternative, method) tests for correlation between paired random variables.

x, y: numeric vectors of values for which the correlation shall be tested (must have the same length)

alternative:

"two.sided" (alternative: correlation coefficient ≠ 0,

default),

"less" (alternative: negative correlation),

"greater" (alternative: positive correlation)

method: "pearson", "spearman" or "kendall"

Linear regression (simple)

The function lm(formula, data) fits a linear model to data.

formula: y~x with y response variable and x explanatory variable (must have the same length)

data: optional, if not specified in formula, the dataframe containing x and y

name label

nro identifier

age age years

> plot(x=mi$height, y=mi$weight)

> cor(mi$height, mi$weight, method="pearson")

[1] 0.6307697

> cor(mi$height, mi$weight, method="spearman")

[1] 0.6281738

Correlation

> cor.test(mi$height,mi$weight,method="pearson")

Pearson's product-moment correlation

data: mi$height and mi$weight

t = 7.1792, df = 78, p-value = 3.586e-10

alternative hypothesis: true correlation is not

equal to 0

0.4771865 0.7469643

sample estimates:

0.6307697

# Significant correlation between body height and

# body weight

Correlation

> lm(mi$weight~mi$height)

lm(formula = mi$weight ~ mi$height)

Coefficients:

(Intercept) mi$height

-51.2910 0.7477

# Y = a + b × x + E # with Y: body weight, x: body height,

# a=-51.29, b=0.75

Linear regression

graphics

low-level graphics

Plots produced by high-level graphics facilities can be modified by low-level graphics commands.

low-level functions

name function

points(x, y) adds points (the option type= can be used)

lines(x, y) adds lines (the option type= can be used)

text(x, y, labels, ...) adds text given by labels at coordinates (x,y); a typical use is: plot(x, y, type="n"); text(x, y, names)

abline(a, b) draws a line of slope b and intercept a

abline(h=y) draws a horizontal line at ordinate y

abline(v=x) draws a vertical line at abcissa x

rect(x1, y1, x2, y2) draws a rectangle whose left, right, bottom, and top limits are x1, x2, y1, and y2, respectively

polygon(x, y) draws a polygon with coordinates given by x and y

title( ) adds a title and optionally a sub-title

> plot(x=mi$height, y=mi$weight)

> abline(a=-51.29, b=0.75, col="blue")

# Adds the regression line to the scatter plot.

> title("Regression of weight and height")

> text(x=185, y=65, labels="Kieler Woche",

col="green")

low-level functions

Kieler Woche

hands-on example

Load the dataset from the file correlation.csv into the workspace. Calculate the Pearson correlation coefficient between the variables x and y and test whether this coefficient is significantly different from 0. Generate a scatter plot.

Regression models

tutorial 9

Linear regression (simple)

formula: y~x with y response variable and x explanatory variable (must have the same length)

Linear regression (multiple)

formula: y~x1+x2+…+xk with y response variable and

x1,…,xk explanatory variables

(must have the same length)

data: optional, if not specified in formula, the dataframe containing x1,…,xk and y

Generalised linear model

The function glm(formula, family) fits a generalised linear model to data.

formula: y~x1+x2+…+xk with y response variable and

x1,…,xk explanatory variables

(must have the same length)

family: specifies the link function; choose family=binomial for the logistic regression

name label

nro identifier

code coded group 0 'control', 1 'infarct'

age age years

> model_mi=glm(mi$code~mi$sex+mi$age+

mi$height+mi$weight+mi$blood.sugar+mi$diabet

+mi$chol+mi$trigl+mi$cig,family=binomial)

> summary(model_mi)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -34.60297 12.51757 -2.764 0.005704 **

mi$sex 0.23048 0.90885 0.254 0.799810

mi$age 0.10734 0.04161 2.580 0.009883 **

mi$height 0.14930 0.07838 1.905 0.056799 .

mi$weight -0.11508 0.06304 -1.826 0.067916 .

mi$blood.sugar -0.02246 0.01399 -1.605 0.108425

mi$diabet 2.05732 2.15947 0.953 0.340743

mi$chol 0.07294 0.02188 3.334 0.000855

mi$trigl -0.01936 0.01227 -1.578 0.114638

mi$cig 0.07686 0.04695 1.637 0.101603

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

0.1 ‘ ’ 1

> model_mi=glm(mi$code~mi$age+mi$chol,family=binomial)

> summary(model_mi)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -16.13858 3.78005 -4.269 1.96e-05 ***

mi$age 0.08404 0.03255 2.582 0.009827 **

mi$chol 0.05564 0.01569 3.546 0.000391 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

0.1 ‘ ’ 1

# Model after backward selection

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Perform a linear regression with the variable insulin as response and variables glucose, pressure, mass and triceps as explanatory variables. Apply a backwards selection to generate a reduced model.

Survival analysis

tutorial 10

Survival object

Before performing analysis the function Surv(time, event) has to create a survival object.

if event occured: time of the event

if no event occured: last observation time

Since start of study (survival time)

event:

1: event

0: no event

Important: has to be numeric.

Survival curves

The function survfit(formula, data) creates an estimated survival curve. Afterwards use the plot command.

formula: Let y be a Surv object.

y~1 for a Kaplan-Meier curve

y~x for several Kaplan-Meier curves stratified by x

Log-Rank Test

The function survdiff(formula, rho, data) tests if there is a difference between two or more survival curves.

formula: y~x with

y: Surv object

x: group or stratifying variable

rho: a scalar parameter that controls the type of test.

rho=0 (default) for the Log-Rank Test (Modification of test in lecture)

example data: survival

therapy

two chemotherapies: C1 and C2

if death occured: time of death if no death occured: last observation time

1: death 0: no death

> install.packages("survival")

> library(survival)

> cancer=read.table("survival.csv",dec=",",sep=";",

header=TRUE)

> head(cancer)

> surv_object=Surv(time=cancer$time,

event=cancer$event)

> curve=survfit(surv_object~1)

> summary(curve)

> plot(curve)

# One Kaplan-Meier curve for both therapies combined

# (with confidence bands)

Survival analysis

> surv_object=Surv(time=cancer$time,

event=cancer$event)

> curve=survfit(surv_object~cancer$therapy)

> summary(curve)

> plot(curve,lty=1:2)

> legend("topright",levels(cancer$therapy),lty=1:2)

# Two Kaplan-Meier curves, one for each therapy

# group (without confidence bands)

Survival analysis

> survdiff(surv_object~cancer$therapy,rho=0)

survdiff(formula = surv_object ~ cancer$therapy,rho = 0)

N Obs Expected (O-E)^2/E (O-E)^2/V

cancer$therapy=C1 10 6 4.07 0.919 1.56

cancer$therapy=C2 10 6 7.93 0.471 1.56

Chisq= 1.6 on 1 degrees of freedom, p= 0.211

# No significant difference between the

# survival functions of the two therapies

Survival analysis

hands-on example

loading: library(survival), the dataframe is called retinopathy. In this dataframe the variable futime is the time variable and the variable status the event variable. Plot a Kaplan-Meier curve.

R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics...

Documents

Transcript of R Introduction and descriptive statistics · 2020-06-11 · Introduction and descriptive statistics...

Module 2: Descriptive Statistics (and a bit about R)faculty.nps.edu/rdfricke/OA3102/Descriptive Statistics.pdf · · 2011-12-23Module 2: Descriptive Statistics (and a bit about

Descriptive Statistics - StatPlus · Descriptive Statistics The DESCRIPTIVE STATISTICS procedure displays univariate summary statistics for selected variables. Descriptive statistics

Exploring Data and Descriptive Statistics (using R)otorres/sessions/s2r.pdfExploring Data and Descriptive Statistics (using R) Oscar Torres-Reyna ... (portable file) ... Data analysis

Descriptive Statistics Descriptive Statistics describe a set of data.

R Example Descriptive Statistics Frequency and Histogram Diagrams Standard Deviation.

Statistics for EES Introduction to R and Descriptive ...evol.bio.lmu.de/_statgen/StatEES/13SS/DescrStats_handout.pdf · Statistics for EES Introduction to R and Descriptive Statistics

Descriptive Statistics Organize Descriptive Statistics ...acfoos/Courses/381/02... · Descriptive Statistics PSYC 381 Arlo Clark-Foos • Descriptive Statistics – Organize – Summarize

QBM117 Business Statistics Descriptive Statistics Numerical Descriptive Measures.

Descriptive statistics

Statistical Quality Control N.Obeidi Descriptive Statistics Descriptive Statistics include: Descriptive Statistics include: – The Mean- measure of central.

Introduction, descriptive statistics, R and data visualization

Descriptive Statistics and Inferential Statistics Shibin Liu SAS Beijing R&D.

BIO5312: R Session 1 An Introduction to R and Descriptive Statistics · 2017. 2. 2. · An Introduction to R and Descriptive Statistics Yujin Chung August 30th, 2016 Fall, ... the

Exploring Data and Descriptive Statistics (using R) Data and Descriptive Statistics (using R) Oscar Torres-Reyna Data Consultant otorres@princeton.edu ... Data extensions *.dta *.sav,

Chapter 2 | Descriptive Statistics 67 2|DESCRIPTIVE STATISTICS

Chapter 2 Descriptive Statistics - MMathematics...Descriptive Statistics As described inChapter 1 "Introduction", statistics naturally divides into two branches, descriptive statistics

Exploring Data and Descriptive Statistics (using R) Oscar Torres

3 descriptive statistics with R

Exploring Data and Descriptive Statistics (using R) Data and Descriptive Statistics (using R) Oscar Torres-Reyna Data Consultant otorres@princeton.edu ... Data extensions .dta .sav,