A very brief introduction to R - Drexel University
Transcript of A very brief introduction to R - Drexel University
A very brief introduction to R
Erjia Yan
January 25, 2010
1
Outline
• Introduction
• Data input
• Date type
• Functions
• Graphics
• Resources
• Examples
2
What is R?
• Statistical computer language;
• Variety of statistical and numerical methods.
• Easy to build your own functions;
• High quality visualization and graphics tools.
• Has abundant free packages; and
• Extensive help files.
3
Statistics R can do
• Factorial methods
• Clustering
• Probability Distributions
• Statistical Tests
• Regression
• Generalized Linear Models
• Mixed models, etc.
4
Factorial methods
• Principal Component Analysis (PCA)• Distance-based methods• SOM (Self-Organizing Maps)• Simple Correspondance Analysis (CA)• Multiple Correspondance Analysis• Log-linear model (Poisson Regression)• Discriminant Analysis• Canonical analysis• Kernel methods• Neural networks
5
Clustering
• Non-hierarchical clustering (k-means)
• Hierarchical Classification (dendogram)
• Density estimation
6
Probability Distributions
• Discrete probability distributions
• Continuous probability distributions
• Extreme value theory
7
Statistical Tests
• Parametric Tests
• Discrete variables and the Chi^2 test
• Non-parametric tests
8
Regression
• Linear regression
• Non-linear regression
9
Generalized Linear Models
• Naive Bayes classifyer
• Discriminant Analysis
• Logistic Regression
10
Reading data into R
• R is not well suited for data preprocessing;• Preprocess data elsewhere (SPSS, etc…);• Easiest form of data to input: text file;• Spreadsheet like data:
– Small/medium size: read.table()– Large data: scan()
• Read from other systems: – Use the library “foreign”: library(foreign)– Can import from SAS, SPSS, Epi Info– Can export to STATA
11
Reading data into R (cont.)
• R commander– Package: Rcmdr– >library(Rcmdr)
12
Naming conventions
• Any roman letters, digits, and ‘.’ (non-initial position);
• Avoid using system names: c, q, s, t, C, D, F, I, T, diff, mean, pi, range, rank, tree, var; and
• Hold for variables, data, and functions.
13
Defining new variables
• Assignment symbol, use “<-” (or _)• Scalars
– scal<-6– value<-7
• Vectors– vec<-c(0,1,2)– vec2<-c(1:10)– vec3<-c(8,6,4,2,10,12,14)– famnames<-c("Kate", "Andrew", "Brian")
• Variable names are case sensitive
14
Frequently used operators<- Assign+ Sum- Difference* Multiplication/ Division^ Exponent%% Mod%*% Dot product%/% Integer division%in% Subset
| Or& And< Less> Greater<= Less or =>= Greater or =! Not!= Not equal== Is equal
15
Frequently used functionsc Concatenatecbind,rbind
Concatenate vectors
min Minimummax Maximumlength # valuesdim # rows, colsfloor Max integer inwhich TRUE indicestable Counts
summary Generic stats Sort, order, rank
Sort, order, rank a vector
print Show valuecat Print as charpaste c() as charround Roundapply Repeat over
rows, cols16
Statistical functionsrnorm, dnorm, pnorm, qnorm
Normal distribution random sample, density, cdf and quantiles
lm, glm, anova Model fittingloess, lowess Smooth curve fittingsample Resampling (bootstrap, permutation).Random.seed Random number generation
mean, median Location statisticsvar, cor, cov, mad, range
Scale statistics
svd, qr, chol, eigen
Linear algebra
17
Graphical functionsplot Generic plot eg: scatterpoints Add pointslines, abline Add linestext, mtext Add textlegend Add a legendaxis Add axesbox Add box around all axespar Plotting parameterscolors, palette Use colors
18
Writing R code
• Can input lines one at a time into R
• Can write many lines of code in a text editor and run all at once– Using Windows version, simply paste the
commands into R
– Using Unix version, save the commands and run in batch mode
19
Types of commands
• Defining variables
• Inputting data
• Using built-in functions
• Using the help menu and notation– ?functionname, help.search(“functionname”)
• Writing your own functions
20
Language layout
• Three types of statement– expression: it is evaluated, printed, and the value
is lost (3+5)
– assignment: passes the value to a variable but the result is not printed automatically (out<-3+5)
– comment: (#This is a comment)
21
Loops and conditionals
• Conditional– if (expr) expr– if (expr) expr else expr
• Iteration– repeat expr– while (expr) expr– for (name in expr1) expr
• For comparisons use:– == for equal– != for not equal– > for greater than– && for and– | for or
22
Plot Command
• The basic command-line command for producing a scatter plot or line graph.– col= set colors, – lty= set line types, – lwd= set line widths, – pch= set the character type, – type= pick points (type = "p"), lines ("l"), – cex= set the "character expansion“, – xlab= and ylab= set the labels, – xlim= and ylim= set the limits of the axes,– main= put a title on the plot, – mtext= add a sub-title,– help (par) for details
23
One-Dimensional Plots
• barplot(height) #simple form
• barplot(height, width, names, space=.2, inside=TRUE, beside=FALSE, horiz=FALSE, legend, angle, density, col, blocks=TRUE)
• boxplot(..., range, width, varwidth=FALSE, notch=FALSE, names, plot=TRUE)
• hist(x, nclass, breaks, plot=TRUE, angle, density, col, inside)
24
Two-Dimensional Plots
• lines(x, y, type="l")
• points(x, y, type="p"))
• matplot(x, y, type="p", lty=1:5, pch=, col=1:4)
• matpoints(x, y, type="p", lty=1:5, pch=, col=1:4)
• matlines(x, y, type="l", lty=1:5, pch=, col=1:4)
• plot(x, y, type="p", log="")
• abline(coef), abline(a, b), abline(reg), abline(h=), abline(v=)
• qqplot(x, y, plot=TRUE)
• qqnorm(x, datax=FALSE, plot=TRUE)
25
Three-Dimensional Plots
• contour(x, y, z, v, nint=5, add=FALSE, labex)
• interp(x, y, z, xo, yo, ncp=0, extrap=FALSE)
• persp(z, eye=c(-6,-8,5), ar=1)
26
Basic Graphics
• Histogram– hist(D$wg)
27
Basic Graphics
• Add a title…– The “main” statement
will give the plot an overall heading.
– hist(D$wg , main=‘Weight Gain’)
28
Basic Graphics
• Adding axis labels…
• Use “xlab” and “ylab” to label the X and Y axes, respectively.
• hist(D$wg , main=‘Weight Gain’,xlab=‘Weight Gain’, ylab =‘Frequency’)
29
Basic Graphics
• Changing colors…
• Use the col statement.– ?colors will give you help
on the colors.
– Common colors may simply put in using the name.
– hist(D$wg, main=“Weight Gain”,xlab=“Weight Gain”, ylab =“Frequency”, col=“blue”)
30
Basic Graphics – Colors
31
Scatter Plots
• Suppose we have two variables and we wish to see the relationship between them.
• A scatter plot works very well.
• R code: – plot(x,y)
• Example– plot(D$metmin,D$wg)
32
Scatterplots
33
Scatterplots
plot(D$metmin,D$wg,main='Met Minutes vs. Weight Gain', xlab='Mets (min)',ylab='Weight Gain (lbs)')
34
Scatterplots
plot(D$metmin,D$wg,main='Met Minutes vs. Weight Gain',
xlab='Mets (min)',ylab='Weight Gain (lbs)',pch=2)35
Line Plots
• Often data comes through time.
• Consider Dell stock– D2 <- read.csv("H:\\Dell.csv",header=TRUE)
– t1 <- 1:nrow(D2)
– plot(t1,D2$DELL)
36
Line Plots
37
Line Plots
plot(t1,D2$DELL,type="l") 38
Line Plots
plot(t1,D2$DELL,type="l",main='Dell Closing Stock Price',xlab='Time',ylab='Price $'))
39
Overlaying Plots
• Often we have more than one variable measured against the same predictor (X).– plot(t1,D2$DELL,type="l",main='Dell Closing Stock Price',xlab='Time',ylab='Price $'))
– lines(t1,D2$Intel)
40
Overlaying Graphs
41
Overlaying Graphs
lines(t1,D2$Intel,lty=2) 42
Overlaying Graphs
43
Adding a Legend
• Adding a legend is a bit tricky in R.
• Syntax• legend( x, y, names, line types)
X
coordinateY
coordinate
Names of series in column format
Corresponding line types
44
Adding a Legend
legend(60,45,c('Intel','Dell'),lty=c(1,2))45
Good resources
• http://cran.r-project.org/manuals.html
• Statistics with R– 17 chapters, 1266 pages;
– High definition images;
– All the codes are online; and
– Freely available at http://zoonek2.free.fr/UNIX/48_R/all.html
46
SOME EXAMPLES
47
Histogram
• library(e1071) # For the "skewness" and "kurtosis" functions
• n <- 1000 • x <- rnorm(n) • op <- par(mar=c(3,3,4,2)+.1) • hist(x, col="light blue", probability=TRUE,
main=paste("skewness =", round(skewness(x), digits=2)), xlab="", ylab="")
• lines(density(x), col="red", lwd=3) • par(op)
48
Histogram (accumulative)
• op <- par(mfcol=c(2,4), mar=c(2,2,1,1)+.1) • do.it <- function (x) { hist(x, probability=T, col='light blue',
xlab="", ylab="", main="", axes=F) axis(1) lines(density(x), col='red', lwd=3) x <- sort(x) q <- ppoints(length(x)) plot(q~x, type='l', xlab="", ylab="", main="") abline(h=c(.25,.5,.75), lty=3, lwd=3, col='blue') }
• n <- 200 • do.it(rnorm(n)) • do.it(rlnorm(n)) • do.it(-rlnorm(n)) • do.it(rnorm(n, c(-5,5))) • par(op)
49
Histogram (Old Faithful Geyser Data)
• Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
• hist(faithful$eruptions, probability=TRUE, breaks=20, col="light blue", xlab="", ylab="", main="Histogram and density estimation")
• points(density(faithful$eruptions, bw=1), type='l', lwd=3, col='black')
• points(density(faithful$eruptions, bw=.5), type='l', lwd=3, col='blue')
• points(density(faithful$eruptions, bw=.3), type='l', lwd=3, col='green')
• points(density(faithful$eruptions, bw=.1), type='l', lwd=3, col='red')
50
Lines
• library(e1071) # For the "skewness" and "kurtosis" functions
• n <- 1000 • x <- rnorm(n) • qqnorm(x, main=paste("kurtosis =", round(kurtosis(x),
digits=2), "(gaussian)")) • qqline(x, col="red") • op <- par(fig=c(.02,.5,.5,.98), new=TRUE) • hist(x, probability=T, col="light blue", xlab="", ylab="",
main="", axes=F) • lines(density(x), col="red", lwd=2) • box() • par(op)
51
Dot chart (Areas of the World's Major Landmasses)
• The areas in thousands of square miles of the landmasses which exceed 10,000 square miles.
• data(islands)
• dotchart(islands, main="Island area")
• dotchart(sort(log(islands)), main="Island area (logarithmic scale)")
52
Scatter plot (Old Faithful Geyser Data)
• op <- par(mar=c(3,4,2,2)+.1)
• plot(sort(faithful$eruptions), xlab="")
• rug(faithful$eruptions, side=2)
• par(op)
53
Scatter plot (Edgar Anderson's Iris Data)
• This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
• data(iris) • plot(iris[1:4], pch = 21, bg = c("red", "green",
"blue")[ as.numeric(iris$Species) ])
54
Scatter plot (Lawyers' Ratings of State Judges in the US Superior Court)
• Lawyers' ratings of state judges in the US Superior Court.
• pairs(USJudgeRatings, gap=0)
55
Scatter plot (Longley's Economic Regression Data)
• pairs(longley, • gap=0, • diag.panel = function (x, ...) { • par(new = TRUE) • hist(x, col = "light blue", probability = TRUE,
axes = FALSE, main = "") • lines(density(x), col = "red", lwd = 3) • rug(x) })
56
Clustering (Lawyers' Ratings of State Judges in the US Superior Court)
• heatmap(as.matrix(USJudgeRatings))
57
Kernel Density Estimation
• data(faithful)
• x <- faithful$eruptions
• y <- faithful$waiting
• library(MASS)
• library(fields)
• z <- kde2d(x, y, n=300)
• image.plot(z)
58
Contour
• data(faithful) • x <- faithful$eruptions• y <- faithful$waiting• library(MASS)• library(fields)• z <- kde2d(x, y, n=300)• contour(z, col = "red", main = "Density
estimation: contour plot")
59
KDE+Contour
• data(faithful) • x <- faithful$eruptions• y <- faithful$waiting• library(MASS)• library(fields)• z <- kde2d(x, y, n=300)• image.plot(z)• contour(z, col = "red", add=T)
60
3-D KDE
• data(faithful) • x <- faithful$eruptions• y <- faithful$waiting• library(MASS)• library(fields)• z <- kde2d(x, y, n=100)• op <- par(mar=c(0,0,2,0)+.1) • persp(z, phi = 45, theta = 30, xlab = "eruptions", ylab =
"waiting", zlab = "density", col = "yellow", shade = .5, border = NA, main = "Density estimation: perspective plot")
• par(op)
61
References
• http://zoonek2.free.fr/UNIX/48_R/all.html
• http://www.pitt.edu/~super7/17011-18001/17641.ppt
• http://isites.harvard.edu/fs/docs/icb.topic154887.files/Intro_to_R.ppt
62