Subsetting Data

26
Subsetting Data Stat 480 Heike Hofmann

Transcript of Subsetting Data

Page 1: Subsetting Data

Subsetting DataStat 480

Heike Hofmann

Page 2: Subsetting Data

Outline

• More on qplot

• Subsetting Data

• Indexing (review)

• Logical Operations

Page 3: Subsetting Data

• Two continuous variables

•qplot(Burglary, Murder, data=fbi)!

•qplot(log(Burglary), log(Murder), data=fbi)!

•qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi)

ScatterplotLoad the fbi data into R, load the ggplot2 package

Page 4: Subsetting Data

• Big patterns

• Form and direction

• Strength

• Small patterns

• Deviations from the pattern

• Outliers

Revision: Interpreting a scatterplot

Page 5: Subsetting Data

• Form

• Is the plot linear? Is the plot curved? Is there a distinct pattern in the plot? Are there multiple groups?

• Strength

• Does the plot follow the form very closely? Or is there a lot of variation?

Interpreting Scatterplots

Page 6: Subsetting Data

• Direction

• Is the pattern increasing? Is the plot decreasing?

• Positively: Above (below) average in one variable tends to be associated with above (below) average in another variable.

• Negatively: Above (below) average in one variable tends to be associated with below (above) average in another variable.

Interpreting Scatterplots

Page 7: Subsetting Data

Form: Linear

Strength: Strong, very close to a straight line.

Direction: Two variables are positively associated.

No outliers.

Page 8: Subsetting Data

Form: Roughly linear, two distinct groups (more than 40% and less than 40%.)

Strength: not strong. Data points are scattered.

Direction: Negatively Associated.

Outliers: None

Page 9: Subsetting Data

• Can map other variables to size or colour

•qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi, colour=State)!

•qplot(log(Burglary), log(Motor.Vehicle.Theft), data=fbi, size=Population)!

• other aesthetics: shape

Aesthetics

Page 10: Subsetting Data

• Can facet to display plots for different subsets

•qplot(Year, Population, data=fbi, facets=~State)

Facetting

Page 11: Subsetting Data

• Will need to experiment as to which one answers your question/tells the story best

• Remember, just like with pivot tables we want comparisons of interest to be close together

Facets vs aesthetics

Page 12: Subsetting Data

• produces histogram or barchart

• Categorical variable → bar chartgeom="bar"!

• Continuous variable → histogramgeom="histogram"

• For the histogram, you should always vary the binwidth

Univariate plots

qplot (x, data=dataset)

Page 13: Subsetting Data

• What do we look for?

• Symmetry/Skewness

• Modes, Groups (big pattern: where is the bulk of the data?)

• Gaps & Outliers(deviation from the big pattern: where are the other points?)

Histograms and Barcharts

Page 14: Subsetting Data

• Use fill to color bars according to another variable, or use facetting to compare subsets

• Facetting (later) is generally more useful, as it is easier to compare different groups

Aesthetics & facetting

Page 15: Subsetting Data

Boxplotsdefinition by J.W. Tukey (1960s, EDA 1977)

-0.4

-0.2

0.0

0.2

0.4

-3 -2 -1 0 1 2 3

qplot(x,y, geom="boxplot")

Median

25% 75%quartiles:IQR = inter-quartile range = = upper quartile - lower quartile

hinges: data point≤ 75% + 1.5 * IQR≥ 25% - 1.5 * IQR

outliers: data points between hinges and quartile ± 3*IQR

extreme outliers: data points beyond quartile ± 3*IQR

Page 16: Subsetting Data

Boxplots• Pros:

• Symmetry vs Skewness

• Outliers

• Quick Summary

• Comparisons across multiple Treatments (side by side boxplots)

• Cons:

• Boxplots hide multiple modes and gaps in the data

Page 17: Subsetting Data

• Explore the distribution of Murders

• What can you see? What might explain that pattern?

• Make sure to experiment with bin width in histograms!

• Use facetting to explore the relationship between Murders and States

Your turn

Page 18: Subsetting Data

Subsets of Data

• Facetting & Zooming are visual ways of subsetting the data

!

• Next: use R to subset

Page 19: Subsetting Data

Logical vectors

• Very important!

• Usually created with a logical comparison

•<, >, ==, !=, <=, >=!

•x %in% c(1, 4, 3, 7)!

•subset

Page 20: Subsetting Data

Logical expressions

• & and | are the logical and and or

• ! is the logical negation

• use parentheses () when linking expressions to avoid mis-interpretation

Page 21: Subsetting Data

Logical Operators

A BA & B is the set of elements

that is both in A and B

A BA | B is the set of elements

that is in A or in B or in both

Page 22: Subsetting Data

Updating subsets

• You can take a subset and update the original data

•a <- 1:4!

•a[2:3] <- 0!

•a!

• Very useful with logical subsetting

Page 23: Subsetting Data

Practice

•a <- c(1,15, 3,20, 5,8,9,10, 1,3)!

• Get logical vector that is TRUE when number is:

• less than 20

• squared value is at least 100 or less than 10

• equals 1 or 3

• even (look at a %% 2)

Page 24: Subsetting Data

Subsets

• subset(dataset, logical expression)

• subset(fbi, Year == 2011) subset(fbi, (Year == 2011) & (State == “Kansas”))

Page 25: Subsetting Data

Useful Commands• nrow(dataset) # number of records!

• quantile(variable, probs=0.001, na.rm=T) # retrieves 0.1 percentile of variable!

• which(logical variable) # retrieves all indices for which the variable is TRUE!

• which.max(variable) which.min(variable) # retrieve index of highest (lowest) value in variable

Page 26: Subsetting Data

Your Turn

• Get a subset of all crimes in Iowa, i.e.: ... iowa <- ... Plot incidences/rates for one type of crime over time.

• Get a subset of all crimes in 2009, and plot one aspect of it.

• Get a subset of the data that includes number of Homicides in the last five years. Find rate of homicides, extract all states that have a rate > 90% across the States, and plot.

FBI Data