Subsetting DataStat 480
Heike Hofmann
Outline
• More on qplot
• Subsetting Data
• Indexing (review)
• Logical Operations
• Two continuous variables
•qplot(Burglary, Murder, data=fbi)!
•qplot(log(Burglary), log(Murder), data=fbi)!
•qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi)
ScatterplotLoad the fbi data into R, load the ggplot2 package
• Big patterns
• Form and direction
• Strength
• Small patterns
• Deviations from the pattern
• Outliers
Revision: Interpreting a scatterplot
• Form
• Is the plot linear? Is the plot curved? Is there a distinct pattern in the plot? Are there multiple groups?
• Strength
• Does the plot follow the form very closely? Or is there a lot of variation?
Interpreting Scatterplots
• Direction
• Is the pattern increasing? Is the plot decreasing?
• Positively: Above (below) average in one variable tends to be associated with above (below) average in another variable.
• Negatively: Above (below) average in one variable tends to be associated with below (above) average in another variable.
Interpreting Scatterplots
Form: Linear
Strength: Strong, very close to a straight line.
Direction: Two variables are positively associated.
No outliers.
Form: Roughly linear, two distinct groups (more than 40% and less than 40%.)
Strength: not strong. Data points are scattered.
Direction: Negatively Associated.
Outliers: None
• Can map other variables to size or colour
•qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi, colour=State)!
•qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi, size=Population)!
• other aesthetics: shape
Aesthetics
• Can facet to display plots for different subsets
•qplot(Year, Population, data=fbi, facets=~State)
Facetting
• Will need to experiment as to which one answers your question/tells the story best
• Remember, just like with pivot tables we want comparisons of interest to be close together
Facets vs aesthetics
• produces histogram or barchart
• Categorical variable → bar chartgeom="bar"!
• Continuous variable → histogramgeom="histogram"
• For the histogram, you should always vary the binwidth
Univariate plots
qplot (x, data=dataset)
• What do we look for?
• Symmetry/Skewness
• Modes, Groups (big pattern: where is the bulk of the data?)
• Gaps & Outliers(deviation from the big pattern: where are the other points?)
Histograms and Barcharts
• Use fill to color bars according to another variable, or use facetting to compare subsets
• Facetting (later) is generally more useful, as it is easier to compare different groups
Aesthetics & facetting
Boxplotsdefinition by J.W. Tukey (1960s, EDA 1977)
-0.4
-0.2
0.0
0.2
0.4
-3 -2 -1 0 1 2 3
qplot(x,y, geom="boxplot")
Median
25% 75%quartiles:IQR = inter-quartile range = = upper quartile - lower quartile
hinges: data point≤ 75% + 1.5 * IQR≥ 25% - 1.5 * IQR
outliers: data points between hinges and quartile ± 3*IQR
extreme outliers: data points beyond quartile ± 3*IQR
Boxplots• Pros:
• Symmetry vs Skewness
• Outliers
• Quick Summary
• Comparisons across multiple Treatments (side by side boxplots)
• Cons:
• Boxplots hide multiple modes and gaps in the data
• Explore the distribution of Murders
• What can you see? What might explain that pattern?
• Make sure to experiment with bin width in histograms!
• Use facetting to explore the relationship between Murders and States
Your turn
Subsets of Data
• Facetting & Zooming are visual ways of subsetting the data
!
• Next: use R to subset
Logical vectors
• Very important!
• Usually created with a logical comparison
•<, >, ==, !=, <=, >=!
•x %in% c(1, 4, 3, 7)!
•subset
Logical expressions
• & and | are the logical and and or
• ! is the logical negation
• use parentheses () when linking expressions to avoid mis-interpretation
Logical Operators
A BA & B is the set of elements
that is both in A and B
A BA | B is the set of elements
that is in A or in B or in both
Updating subsets
• You can take a subset and update the original data
•a <- 1:4!
•a[2:3] <- 0!
•a!
• Very useful with logical subsetting
Practice
•a <- c(1,15, 3,20, 5,8,9,10, 1,3)!
• Get logical vector that is TRUE when number is:
• less than 20
• squared value is at least 100 or less than 10
• equals 1 or 3
• even (look at a %% 2)
Subsets
• subset(dataset, logical expression)
• subset(fbi, Year == 2011) subset(fbi, (Year == 2011) & (State == “Kansas”))
Useful Commands• nrow(dataset) # number of records!
• quantile(variable, probs=0.001, na.rm=T) # retrieves 0.1 percentile of variable!
• which(logical variable) # retrieves all indices for which the variable is TRUE!
• which.max(variable) which.min(variable) # retrieve index of highest (lowest) value in variable
Your Turn
• Get a subset of all crimes in Iowa, i.e.: ... iowa <- ... Plot incidences/rates for one type of crime over time.
• Get a subset of all crimes in 2009, and plot one aspect of it.
• Get a subset of the data that includes number of Homicides in the last five years. Find rate of homicides, extract all states that have a rate > 90% across the States, and plot.
FBI Data
Top Related