A very brief introduction to R - Drexel University

A very brief introduction to R

Erjia Yan

January 25, 2010

1

Outline

• Introduction

• Data input

• Date type

• Functions

• Graphics

• Resources

• Examples

2

What is R?

• Statistical computer language;

• Variety of statistical and numerical methods.

• Easy to build your own functions;

• High quality visualization and graphics tools.

• Has abundant free packages; and

• Extensive help files.

3

Statistics R can do

• Factorial methods

• Clustering

• Probability Distributions

• Statistical Tests

• Regression

• Generalized Linear Models

• Mixed models, etc.

4

Factorial methods

• Principal Component Analysis (PCA)• Distance-based methods• SOM (Self-Organizing Maps)• Simple Correspondance Analysis (CA)• Multiple Correspondance Analysis• Log-linear model (Poisson Regression)• Discriminant Analysis• Canonical analysis• Kernel methods• Neural networks

5

Clustering

• Non-hierarchical clustering (k-means)

• Hierarchical Classification (dendogram)

• Density estimation

6

Probability Distributions

• Discrete probability distributions

• Continuous probability distributions

• Extreme value theory

7

Statistical Tests

• Parametric Tests

• Discrete variables and the Chi^2 test

• Non-parametric tests

8

Regression

• Linear regression

• Non-linear regression

9

Generalized Linear Models

• Naive Bayes classifyer

• Discriminant Analysis

• Logistic Regression

10

Reading data into R

• R is not well suited for data preprocessing;• Preprocess data elsewhere (SPSS, etc…);• Easiest form of data to input: text file;• Spreadsheet like data:

– Small/medium size: read.table()– Large data: scan()

• Read from other systems: – Use the library “foreign”: library(foreign)– Can import from SAS, SPSS, Epi Info– Can export to STATA

11

Reading data into R (cont.)

• R commander– Package: Rcmdr– >library(Rcmdr)

12

Naming conventions

• Any roman letters, digits, and ‘.’ (non-initial position);

• Avoid using system names: c, q, s, t, C, D, F, I, T, diff, mean, pi, range, rank, tree, var; and

• Hold for variables, data, and functions.

13

Defining new variables

• Assignment symbol, use “<-” (or _)• Scalars

– scal<-6– value<-7

• Vectors– vec<-c(0,1,2)– vec2<-c(1:10)– vec3<-c(8,6,4,2,10,12,14)– famnames<-c("Kate", "Andrew", "Brian")

• Variable names are case sensitive

14

Frequently used operators<- Assign+ Sum- Difference* Multiplication/ Division^ Exponent%% Mod%*% Dot product%/% Integer division%in% Subset

| Or& And< Less> Greater<= Less or =>= Greater or =! Not!= Not equal== Is equal

15

Frequently used functionsc Concatenatecbind,rbind

Concatenate vectors

min Minimummax Maximumlength # valuesdim # rows, colsfloor Max integer inwhich TRUE indicestable Counts

summary Generic stats Sort, order, rank

Sort, order, rank a vector

print Show valuecat Print as charpaste c() as charround Roundapply Repeat over

rows, cols16

Statistical functionsrnorm, dnorm, pnorm, qnorm

Normal distribution random sample, density, cdf and quantiles

lm, glm, anova Model fittingloess, lowess Smooth curve fittingsample Resampling (bootstrap, permutation).Random.seed Random number generation

mean, median Location statisticsvar, cor, cov, mad, range

Scale statistics

svd, qr, chol, eigen

Linear algebra

17

Graphical functionsplot Generic plot eg: scatterpoints Add pointslines, abline Add linestext, mtext Add textlegend Add a legendaxis Add axesbox Add box around all axespar Plotting parameterscolors, palette Use colors

18

Writing R code

• Can input lines one at a time into R

• Can write many lines of code in a text editor and run all at once– Using Windows version, simply paste the

commands into R

– Using Unix version, save the commands and run in batch mode

19

Types of commands

• Defining variables

• Inputting data

• Using built-in functions

• Using the help menu and notation– ?functionname, help.search(“functionname”)

• Writing your own functions

20

Language layout

• Three types of statement– expression: it is evaluated, printed, and the value

is lost (3+5)

– assignment: passes the value to a variable but the result is not printed automatically (out<-3+5)

– comment: (#This is a comment)

21

Loops and conditionals

• Conditional– if (expr) expr– if (expr) expr else expr

• Iteration– repeat expr– while (expr) expr– for (name in expr1) expr

• For comparisons use:– == for equal– != for not equal– > for greater than– && for and– | for or

22

Plot Command

• The basic command-line command for producing a scatter plot or line graph.– col= set colors, – lty= set line types, – lwd= set line widths, – pch= set the character type, – type= pick points (type = "p"), lines ("l"), – cex= set the "character expansion“, – xlab= and ylab= set the labels, – xlim= and ylim= set the limits of the axes,– main= put a title on the plot, – mtext= add a sub-title,– help (par) for details

23

One-Dimensional Plots

• barplot(height) #simple form

• barplot(height, width, names, space=.2, inside=TRUE, beside=FALSE, horiz=FALSE, legend, angle, density, col, blocks=TRUE)

• boxplot(..., range, width, varwidth=FALSE, notch=FALSE, names, plot=TRUE)

• hist(x, nclass, breaks, plot=TRUE, angle, density, col, inside)

24

Two-Dimensional Plots

• lines(x, y, type="l")

• points(x, y, type="p"))

• matplot(x, y, type="p", lty=1:5, pch=, col=1:4)

• matpoints(x, y, type="p", lty=1:5, pch=, col=1:4)

• matlines(x, y, type="l", lty=1:5, pch=, col=1:4)

• plot(x, y, type="p", log="")

• abline(coef), abline(a, b), abline(reg), abline(h=), abline(v=)

• qqplot(x, y, plot=TRUE)

• qqnorm(x, datax=FALSE, plot=TRUE)

25

Three-Dimensional Plots

• contour(x, y, z, v, nint=5, add=FALSE, labex)

• interp(x, y, z, xo, yo, ncp=0, extrap=FALSE)

• persp(z, eye=c(-6,-8,5), ar=1)

26

Basic Graphics

• Histogram– hist(D$wg)

27

Basic Graphics

• Add a title…– The “main” statement

will give the plot an overall heading.

– hist(D$wg , main=‘Weight Gain’)

28

Basic Graphics

• Adding axis labels…

• Use “xlab” and “ylab” to label the X and Y axes, respectively.

• hist(D$wg , main=‘Weight Gain’,xlab=‘Weight Gain’, ylab =‘Frequency’)

29

Basic Graphics

• Changing colors…

• Use the col statement.– ?colors will give you help

on the colors.

– Common colors may simply put in using the name.

– hist(D$wg, main=“Weight Gain”,xlab=“Weight Gain”, ylab =“Frequency”, col=“blue”)

30

Basic Graphics – Colors

31

Scatter Plots

• Suppose we have two variables and we wish to see the relationship between them.

• A scatter plot works very well.

• R code: – plot(x,y)

• Example– plot(D$metmin,D$wg)

32

Scatterplots

33

Scatterplots

plot(D$metmin,D$wg,main='Met Minutes vs. Weight Gain', xlab='Mets (min)',ylab='Weight Gain (lbs)')

34

Scatterplots

plot(D$metmin,D$wg,main='Met Minutes vs. Weight Gain',

xlab='Mets (min)',ylab='Weight Gain (lbs)',pch=2)35

Line Plots

• Often data comes through time.

• Consider Dell stock– D2 <- read.csv("H:\\Dell.csv",header=TRUE)

– t1 <- 1:nrow(D2)

– plot(t1,D2$DELL)

36

Line Plots

37

Line Plots

plot(t1,D2$DELL,type="l") 38

Line Plots

plot(t1,D2$DELL,type="l",main='Dell Closing Stock Price',xlab='Time',ylab='Price $'))

39

Overlaying Plots

• Often we have more than one variable measured against the same predictor (X).– plot(t1,D2$DELL,type="l",main='Dell Closing Stock Price',xlab='Time',ylab='Price $'))

– lines(t1,D2$Intel)

40

Overlaying Graphs

41

Overlaying Graphs

lines(t1,D2$Intel,lty=2) 42

Overlaying Graphs

43

Adding a Legend

• Adding a legend is a bit tricky in R.

• Syntax• legend( x, y, names, line types)

X

coordinateY

coordinate

Names of series in column format

Corresponding line types

44

Adding a Legend

legend(60,45,c('Intel','Dell'),lty=c(1,2))45

Good resources

• http://cran.r-project.org/manuals.html

• Statistics with R– 17 chapters, 1266 pages;

– High definition images;

– All the codes are online; and

– Freely available at http://zoonek2.free.fr/UNIX/48_R/all.html

46

SOME EXAMPLES

47

Histogram

• library(e1071) # For the "skewness" and "kurtosis" functions

• n <- 1000 • x <- rnorm(n) • op <- par(mar=c(3,3,4,2)+.1) • hist(x, col="light blue", probability=TRUE,

main=paste("skewness =", round(skewness(x), digits=2)), xlab="", ylab="")

• lines(density(x), col="red", lwd=3) • par(op)

48

Histogram (accumulative)

• op <- par(mfcol=c(2,4), mar=c(2,2,1,1)+.1) • do.it <- function (x) { hist(x, probability=T, col='light blue',

xlab="", ylab="", main="", axes=F) axis(1) lines(density(x), col='red', lwd=3) x <- sort(x) q <- ppoints(length(x)) plot(q~x, type='l', xlab="", ylab="", main="") abline(h=c(.25,.5,.75), lty=3, lwd=3, col='blue') }

• n <- 200 • do.it(rnorm(n)) • do.it(rlnorm(n)) • do.it(-rlnorm(n)) • do.it(rnorm(n, c(-5,5))) • par(op)

49

Histogram (Old Faithful Geyser Data)

• Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

• hist(faithful$eruptions, probability=TRUE, breaks=20, col="light blue", xlab="", ylab="", main="Histogram and density estimation")

• points(density(faithful$eruptions, bw=1), type='l', lwd=3, col='black')

• points(density(faithful$eruptions, bw=.5), type='l', lwd=3, col='blue')

• points(density(faithful$eruptions, bw=.3), type='l', lwd=3, col='green')

• points(density(faithful$eruptions, bw=.1), type='l', lwd=3, col='red')

50

Lines

• library(e1071) # For the "skewness" and "kurtosis" functions

• n <- 1000 • x <- rnorm(n) • qqnorm(x, main=paste("kurtosis =", round(kurtosis(x),

digits=2), "(gaussian)")) • qqline(x, col="red") • op <- par(fig=c(.02,.5,.5,.98), new=TRUE) • hist(x, probability=T, col="light blue", xlab="", ylab="",

main="", axes=F) • lines(density(x), col="red", lwd=2) • box() • par(op)

51

Dot chart (Areas of the World's Major Landmasses)

• The areas in thousands of square miles of the landmasses which exceed 10,000 square miles.

• data(islands)

• dotchart(islands, main="Island area")

• dotchart(sort(log(islands)), main="Island area (logarithmic scale)")

52

Scatter plot (Old Faithful Geyser Data)

• op <- par(mar=c(3,4,2,2)+.1)

• plot(sort(faithful$eruptions), xlab="")

• rug(faithful$eruptions, side=2)

• par(op)

53

Scatter plot (Edgar Anderson's Iris Data)

• This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

• data(iris) • plot(iris[1:4], pch = 21, bg = c("red", "green",

"blue")[ as.numeric(iris$Species) ])

54

Scatter plot (Lawyers' Ratings of State Judges in the US Superior Court)

• Lawyers' ratings of state judges in the US Superior Court.

• pairs(USJudgeRatings, gap=0)

55

Scatter plot (Longley's Economic Regression Data)

• pairs(longley, • gap=0, • diag.panel = function (x, ...) { • par(new = TRUE) • hist(x, col = "light blue", probability = TRUE,

axes = FALSE, main = "") • lines(density(x), col = "red", lwd = 3) • rug(x) })

56

Clustering (Lawyers' Ratings of State Judges in the US Superior Court)

• heatmap(as.matrix(USJudgeRatings))

57

Kernel Density Estimation

• data(faithful)

• x <- faithful$eruptions

• y <- faithful$waiting

• library(MASS)

• library(fields)

• z <- kde2d(x, y, n=300)

• image.plot(z)

58

Contour

• data(faithful) • x <- faithful$eruptions• y <- faithful$waiting• library(MASS)• library(fields)• z <- kde2d(x, y, n=300)• contour(z, col = "red", main = "Density

estimation: contour plot")

59

KDE+Contour

• data(faithful) • x <- faithful$eruptions• y <- faithful$waiting• library(MASS)• library(fields)• z <- kde2d(x, y, n=300)• image.plot(z)• contour(z, col = "red", add=T)

60

3-D KDE

• data(faithful) • x <- faithful$eruptions• y <- faithful$waiting• library(MASS)• library(fields)• z <- kde2d(x, y, n=100)• op <- par(mar=c(0,0,2,0)+.1) • persp(z, phi = 45, theta = 30, xlab = "eruptions", ylab =

"waiting", zlab = "density", col = "yellow", shade = .5, border = NA, main = "Density estimation: perspective plot")

• par(op)

61

References

• http://zoonek2.free.fr/UNIX/48_R/all.html

• http://www.pitt.edu/~super7/17011-18001/17641.ppt

• http://isites.harvard.edu/fs/docs/icb.topic154887.files/Intro_to_R.ppt

62

A very brief introduction to R - Drexel University

Documents

Transcript of A very brief introduction to R - Drexel University