Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data...
Transcript of Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data...
![Page 1: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/1.jpg)
Properties of Data
Digging into Data
University of Maryland
February 11, 2013
ggplot2 material adapted from Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 1 / 53
![Page 2: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/2.jpg)
Roadmap
Getting and cleaning dataI Unavoidable stepI Example of how I do it
GoalI Not to teach you howI What end results you need to tell stories from dataI Telling those stories with picturesI Same thing necessary for making predictions and clusteringI Homework 1
ggplot2
CaBi
Digging into Data (UMD) Properties of Data February 11, 2013 2 / 53
![Page 3: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/3.jpg)
Outline
1 Data Terminology
2 Testbed: Capital Bikeshare
3 Visualizing and Summarizing Data in Rattle
4 ggplot2
5 ggplot2 with ”real” data
6 Wrapup
Digging into Data (UMD) Properties of Data February 11, 2013 3 / 53
![Page 4: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/4.jpg)
(Confusing) Terminology
A dataset has di↵erent components
Input: what you always knowI Sometimes called independent variableI Sometimes called regressorI Sometimes called feature
Output: what you’re trying to learnI Sometimes called independent variableI Sometimes called the regressandI Sometimes called the response variableI Sometimes called the “label”
I Does not exist for unsupervised learning
Digging into Data (UMD) Properties of Data February 11, 2013 4 / 53
![Page 5: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/5.jpg)
(Confusing) Terminology
A dataset has di↵erent components
Input: what you always knowI Sometimes called independent variableI Sometimes called regressorI Sometimes called feature
Output: what you’re trying to learnI Sometimes called independent variableI Sometimes called the regressandI Sometimes called the response variableI Sometimes called the “label”I Does not exist for unsupervised learning
Digging into Data (UMD) Properties of Data February 11, 2013 4 / 53
![Page 6: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/6.jpg)
Terminology
But not all data are usable
Most data also have an identifier
Could also be metadataI When data was collectedI Who collected itI How much it cost
Often important to exclude such data from your algorithms
Digging into Data (UMD) Properties of Data February 11, 2013 5 / 53
![Page 7: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/7.jpg)
Terminology
Discrete DataAlso called categoric
Bins that you group data into
There is no “in between”
You can ask most frequent value
Continuous DataAlso called numeric
Numeric values that represent data
There is an “in between”
You can take the average
It makes sense to ask questions like what if this were 10% more X
Digging into Data (UMD) Properties of Data February 11, 2013 6 / 53
![Page 8: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/8.jpg)
Outline
1 Data Terminology
2 Testbed: Capital Bikeshare
3 Visualizing and Summarizing Data in Rattle
4 ggplot2
5 ggplot2 with ”real” data
6 Wrapup
Digging into Data (UMD) Properties of Data February 11, 2013 7 / 53
![Page 9: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/9.jpg)
Capital Bikeshare
Until this year, largestbikeshare system in US
Publicly share data
Important problems:I Where should new
stations be?I RebalancingI PricingI Coordinating with other
transit
Digging into Data (UMD) Properties of Data February 11, 2013 8 / 53
![Page 10: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/10.jpg)
Downloading CaBi Data
CSV Filehttp://www.capitalbikeshare.com/trip-history-data
Digging into Data (UMD) Properties of Data February 11, 2013 9 / 53
![Page 11: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/11.jpg)
What story do you want to tell?
What data are there?
What information do you want?
How to get from point A to point B?
I More art than scienceI No right answers
Digging into Data (UMD) Properties of Data February 11, 2013 10 / 53
![Page 12: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/12.jpg)
What story do you want to tell?
What data are there?
What information do you want?
How to get from point A to point B?I More art than scienceI No right answers
Digging into Data (UMD) Properties of Data February 11, 2013 10 / 53
![Page 13: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/13.jpg)
Adding it to Google Docs
Import into Google Spreadsheet
Digging into Data (UMD) Properties of Data February 11, 2013 11 / 53
![Page 14: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/14.jpg)
Adding it to Google Docs
Loads nicely into columns
Digging into Data (UMD) Properties of Data February 11, 2013 11 / 53
![Page 15: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/15.jpg)
Adding it to Google Docs
It would be nice to have more
Real world locations
Elevation
CaBi has some of this information
Google (Maps) knows the rest . . .
Digging into Data (UMD) Properties of Data February 11, 2013 11 / 53
![Page 16: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/16.jpg)
Adding it to Google Docs
http://www.capitalbikeshare.com/data/stations/bikeStations.xml
Digging into Data (UMD) Properties of Data February 11, 2013 11 / 53
![Page 17: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/17.jpg)
Adding it to Google Docs
Creating a new sheet just for stations
Digging into Data (UMD) Properties of Data February 11, 2013 11 / 53
![Page 18: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/18.jpg)
Adding it to Google Docs
Load columns from the xml file
We now have columns for lat, long for every station
Digging into Data (UMD) Properties of Data February 11, 2013 11 / 53
![Page 19: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/19.jpg)
Adding it to Google Docs
Now we can attach a location to each row in the original sheet
Digging into Data (UMD) Properties of Data February 11, 2013 11 / 53
![Page 20: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/20.jpg)
Adding it to Google Docs
Now we’ve added neat new columns to the spreadsheet; time to download
Digging into Data (UMD) Properties of Data February 11, 2013 11 / 53
![Page 21: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/21.jpg)
Loading a dataset
rides <- read.csv("data/cabi-rides.ext.csv")
Creates a “data frame”
This is the basic unit of R data (Rattle creates these automatically foryou)
Very easy to add columns
Use the $ to access columns
Digging into Data (UMD) Properties of Data February 11, 2013 12 / 53
![Page 22: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/22.jpg)
Outline
1 Data Terminology
2 Testbed: Capital Bikeshare
3 Visualizing and Summarizing Data in Rattle
4 ggplot2
5 ggplot2 with ”real” data
6 Wrapup
Digging into Data (UMD) Properties of Data February 11, 2013 13 / 53
![Page 23: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/23.jpg)
Summarizing Data
Getting Output Directly
“Explore” tab
Click: “summary”
duration startStationMin. : �.���� Massachusetts Ave & Dupont Circle NW : 1161st Qu.: �.1��� 15th & P St NW : 97Median : �.1667 Columbus Circle / Union Station : 94Mean : �.2418 Thomas Circle : 793rd Qu.: �.2667 Eastern Market Metro / Pennsylvania Ave & 7th St SE: 74Max. :13.5667 17th & Corcoran St NW : 7�NA’s : 2.���� (Other) :3629
Digging into Data (UMD) Properties of Data February 11, 2013 14 / 53
![Page 24: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/24.jpg)
Summarizing Data
Getting Output Directly
“Explore” tab
Type: “summary”
endStation distance startHourMassachusetts Ave & Dupont Circle NW: 148 Min. : �.� Min. : �.133315th & P St NW : 1�3 1st Qu.: 921.5 1st Qu.:1�.55��Thomas Circle : 94 Median : 1515.5 Median :15.15��17th & Corcoran St NW : 86 Mean : 1785.3 Mean :14.6237Columbus Circle / Union Station : 82 3rd Qu.: 24�2.2 3rd Qu.:18.35��North Capitol St & F St NW : 74 Max. :13166.5 Max. :23.9667(Other) :3572 NA’s : 1.����
Digging into Data (UMD) Properties of Data February 11, 2013 15 / 53
![Page 25: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/25.jpg)
Descriptive Statistics: Quartiles
Order your data
Find the middle data point - this is your medianI If even number of data points, average points in the middle
Repeat on two halves on either side of median - these are your firstand third quartiles
Digging into Data (UMD) Properties of Data February 11, 2013 16 / 53
![Page 26: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/26.jpg)
Descriptive Statistics
min - smallest data point
max - largest data point
mean - sum of all data divided by number of data points
µ =X
i
xi/N (1)
Digging into Data (UMD) Properties of Data February 11, 2013 17 / 53
![Page 27: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/27.jpg)
Descriptive Statistics
min - smallest data point
max - largest data point
mean - sum of all data divided by number of data points
µ =X
i
xi/N (1)
Digging into Data (UMD) Properties of Data February 11, 2013 17 / 53
![Page 28: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/28.jpg)
What to look for . . .
Are the min / max reasonable?
Is there a lot of missing data (NA)?
Do the most frequent levels for categorical data make sense?
Digging into Data (UMD) Properties of Data February 11, 2013 18 / 53
![Page 29: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/29.jpg)
Box Plots
Show median, mean, Q1,Q2, max and min
Show if distributions areskewed
Easier to see thanreading o↵ numbers
Introduced by Tukey
Under “Explore”,“Distributions”
Digging into Data (UMD) Properties of Data February 11, 2013 19 / 53
![Page 30: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/30.jpg)
Outline
1 Data Terminology
2 Testbed: Capital Bikeshare
3 Visualizing and Summarizing Data in Rattle
4 ggplot2
5 ggplot2 with ”real” data
6 Wrapup
Digging into Data (UMD) Properties of Data February 11, 2013 20 / 53
![Page 31: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/31.jpg)
Some housekeeping
Install some packages (make sure you also have recent copies of reshape2and plyr)
install.packages("ggplot2", dependencies = TRUE)
Digging into Data (UMD) Properties of Data February 11, 2013 21 / 53
![Page 32: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/32.jpg)
Base graphics
Ugly, laborious, and verbose
There are better ways to describe statistical visualizations.
Digging into Data (UMD) Properties of Data February 11, 2013 22 / 53
![Page 33: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/33.jpg)
Why ggplot2?
Follows a grammar, just like any language.
It defines basic components that make up a sentence. In this case,the grammar defines components in a plot.
Grammar of graphics originally coined by Lee Wilkinson
Digging into Data (UMD) Properties of Data February 11, 2013 23 / 53
![Page 34: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/34.jpg)
Why ggplot2?
Supports a continuum of expertise.
Get started right away but with practice you can e↵ortless buildcomplex, publication quality figures.
Common pitfal:I Never use qplot - short for quick plot.I You’ll end up unlearning and relearning a good bit.
Digging into Data (UMD) Properties of Data February 11, 2013 24 / 53
![Page 35: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/35.jpg)
Some terminology
ggplot - The main function where you specify the dataset andvariables to plot
geoms - geometric objectsI geom point(), geom bar(), geom density(), geom line(), geom area()
aes - aestheticsI shape, transparency (alpha), color, fill, linetype.
scales Define how your data will be plottedI
continuous, discrete, log
Digging into Data (UMD) Properties of Data February 11, 2013 25 / 53
![Page 36: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/36.jpg)
The iris dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species## 1 5.1 3.5 1.4 �.2 setosa## 2 4.9 3.� 1.4 �.2 setosa## 3 4.7 3.2 1.3 �.2 setosa## 4 4.6 3.1 1.5 �.2 setosa## 5 5.� 3.6 1.4 �.2 setosa## 6 5.4 3.9 1.7 �.4 setosa
Digging into Data (UMD) Properties of Data February 11, 2013 26 / 53
![Page 37: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/37.jpg)
Let’s try an example
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +geom_point()
Let’s try an example
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +geom_point()
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8Sepal.Length
Sepal.W
idth
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 27 / 53
![Page 38: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/38.jpg)
Basic structure
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width))+ geom_point()myplot <- ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width))myplot + geom_point()
Specify the data and variables inside the ggplot function.
Anything else that goes in here becomes a global setting.
Then add layers of geometric objects, statistical models, and panels.
Digging into Data (UMD) Properties of Data February 11, 2013 28 / 53
![Page 39: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/39.jpg)
Scatter Plots: Increase the size of points
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +geom_point(size = 3)
Increase the size of points
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +geom_point(size = 3)
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●● ●
●
●●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8Sepal.Length
Sepal.W
idth
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 29 / 53
![Page 40: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/40.jpg)
Scatter Plots: Add some color
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +geom_point(size = 3)
Add some color
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +geom_point(size = 3)
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
● ●
●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●● ●
●
●●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8Sepal.Length
Sepal.W
idth Species
●
●
●
setosaversicolorvirginica
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 30 / 53
![Page 41: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/41.jpg)
Scatter Plots: Di↵erentiate points by shape
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +geom_point(aes(shape = Species), size = 3)
Di↵erentiate points by shape
ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +geom_point(aes(shape = Species), size = 3)
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●●
●
●●●
●●
●
●●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8Sepal.Length
Sepal.W
idth Species
● setosaversicolorvirginica
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 31 / 53
![Page 42: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/42.jpg)
Boxplots
See ?geom boxplot for list of options
library(MASS)ggplot(birthwt, aes(factor(race), bwt)) + geom_boxplot()
See ?geom boxplot for list of options
library(MASS)ggplot(birthwt, aes(factor(race), bwt)) + geom_boxplot()
●
●
1000
2000
3000
4000
5000
1 2 3factor(race)
bwt
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 32 / 53
![Page 43: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/43.jpg)
Histograms
See ?geom histogram for list of options
h <- ggplot(faithful, aes(x = waiting))h + geom_histogram(binwidth = 3�, colour = "black")
See ?geom histogram for list of options
h <- ggplot(faithful, aes(x = waiting))h + geom_histogram(binwidth = 3�, colour = "black")
0
50
100
150
0 50 100 150waiting
count
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 33 / 53
![Page 44: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/44.jpg)
Histograms
h <- ggplot(faithful, aes(x = waiting))h + geom_histogram(binwidth = 8, fill = "steelblue",colour = "black")
h <- ggplot(faithful, aes(x = waiting))h + geom_histogram(binwidth = 8, fill = "steelblue",colour = "black")
0
20
40
60
30 50 70 90waiting
count
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 34 / 53
![Page 45: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/45.jpg)
Line Plot
climate <- read.csv("climate.csv", header = T)ggplot(climate, aes(Year, Anomaly1�y)) +geom_line()
climate <- read.csv("climate.csv", header = T)ggplot(climate, aes(Year, Anomaly1�y)) +geom_line()
0.0
0.5
1920 1950 1980Year
Anom
aly10y
climate <- read.csv(text =
RCurl::getURL(https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv))
Data Visualization with R & ggplot2 Karthik Ram
climate <- read.csv(text =
RCurl::getURL(’https://raw.github.com/karthikram/ggplot-lecture/master/climate.csv’))Digging into Data (UMD) Properties of Data February 11, 2013 35 / 53
![Page 46: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/46.jpg)
Line Plot: Confidence Regions
ggplot(climate, aes(Year, Anomaly1�y)) +geom_ribbon(aes(ymin = Anomaly1�y - Unc1�y,ymax = Anomaly1�y + Unc1�y),fill = "blue", alpha = .1) +geom_line(color = "steelblue")
We can also plot confidence regions
ggplot(climate, aes(Year, Anomaly1�y)) +geom_ribbon(aes(ymin = Anomaly1�y - Unc1�y,ymax = Anomaly1�y + Unc1�y),fill = "blue", alpha = .1) +geom_line(color = "steelblue")
0.0
0.5
1920 1950 1980Year
Anom
aly10y
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 36 / 53
![Page 47: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/47.jpg)
Bar Plots
ggplot(iris, aes(Species, Sepal.Length)) +geom_bar(stat = "identity")
ggplot(iris, aes(Species, Sepal.Length)) +geom_bar(stat = "identity")
0
100
200
300
setosa versicolor virginicaSpecies
Sepal.Length
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 37 / 53
![Page 48: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/48.jpg)
Density vs. Line Plots
ggplot(faithful, aes(waiting)) +geom_density(fill = "blue", alpha = �.1)
Density plots
ggplot(faithful, aes(waiting)) +geom_density(fill = "blue", alpha = �.1)
0.00
0.01
0.02
0.03
50 60 70 80 90waiting
density
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 38 / 53
![Page 49: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/49.jpg)
ggplot(faithful, aes(waiting)) +geom_line(stat = "density")
ggplot(faithful, aes(waiting)) +geom_line(stat = "density")
0.01
0.02
0.03
50 60 70 80 90waiting
density
Data Visualization with R & ggplot2 Karthik Ram
Digging into Data (UMD) Properties of Data February 11, 2013 39 / 53
![Page 50: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/50.jpg)
Publication Quality Figures
Raster graphics (bmp, jpeg, png) don’t scale well
Preparing graphics for publication requires vector graphics (pdf, eps)
Much easier to provide publication-quality images with ggplot2
Digging into Data (UMD) Properties of Data February 11, 2013 40 / 53
![Page 51: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/51.jpg)
Saving Plots
If the plot is on your screen
ggsave("˜/path/to/figure/filename.png")
If your plot is assigned to an object
ggsave(plot1, file = "˜/path/to/figure/filename.png")
Specify a size
ggsave(file = "/path/to/figure/filename.png", width = 6,height =4)
or any format (pdf, png, eps, svg, jpg)
ggsave(file = "/path/to/figure/filename.eps")ggsave(file = "/path/to/figure/filename.jpg")ggsave(file = "/path/to/figure/filename.pdf")
Digging into Data (UMD) Properties of Data February 11, 2013 41 / 53
![Page 52: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/52.jpg)
Outline
1 Data Terminology
2 Testbed: Capital Bikeshare
3 Visualizing and Summarizing Data in Rattle
4 ggplot2
5 ggplot2 with ”real” data
6 Wrapup
Digging into Data (UMD) Properties of Data February 11, 2013 42 / 53
![Page 53: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/53.jpg)
ggplot2 maps
Get an outline of DC
all_states <- map_data("state")states <- subset(all_states, region %in%
c( "district of columbia" ) )
Draw it
p <- ggplot(stations)p <- p + geom_polygon( data=states, aes(x=long, y=lat))
Digging into Data (UMD) Properties of Data February 11, 2013 43 / 53
![Page 54: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/54.jpg)
ggplot2 maps
long
lat
38.85
38.90
38.95
-77.10 -77.05 -77.00 -76.95
Digging into Data (UMD) Properties of Data February 11, 2013 44 / 53
![Page 55: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/55.jpg)
ggplot2 maps
p <- p + geom_point( data=stations,aes(x=long, y=lat, size = count),
color="gold2") +scale_size(name="Bikes")
long
lat
38.85
38.90
38.95
-77.10 -77.05 -77.00 -76.95
Bikes10
20
30
40
50
60
70
Digging into Data (UMD) Properties of Data February 11, 2013 45 / 53
![Page 56: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/56.jpg)
ggplot2 facets
p <- p + facet_grid(type ˜ time)
long
lat
38.85
38.90
38.95
38.85
38.90
38.95
AFTERNOON
-77.10-77.05-77.00-76.95
EARLYMORN
-77.10-77.05-77.00-76.95
EVENING
-77.10-77.05-77.00-76.95
LATEMORN
-77.10-77.05-77.00-76.95
LATENIGHT
-77.10-77.05-77.00-76.95
NIGHT
-77.10-77.05-77.00-76.95
LeaveReturn
Bikes10
20
30
40
50
60
70
Digging into Data (UMD) Properties of Data February 11, 2013 46 / 53
![Page 57: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/57.jpg)
ggplot2 facets (resorted)
stations$time <- factor(stations$time, levels =c("EARLYMORN","LATEMORN","AFTERNOON",
"EVENING", "NIGHT", "LATENIGHT"))
long
lat
38.85
38.90
38.95
38.85
38.90
38.95
EARLYMORN
-77.10 -77.05 -77.00 -76.95
LATEMORN
-77.10 -77.05 -77.00 -76.95
AFTERNOON
-77.10 -77.05 -77.00 -76.95
EVENING
-77.10 -77.05 -77.00 -76.95
NIGHT
-77.10 -77.05 -77.00 -76.95
LATENIGHT
-77.10 -77.05 -77.00 -76.95
LeaveReturn
Bikes10
20
30
40
50
60
70
Digging into Data (UMD) Properties of Data February 11, 2013 47 / 53
![Page 58: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/58.jpg)
ggplot2 scatterplots
p <- ggplot(rides)p <- p + geom_smooth(aes(x=startHour, y=distance))p <- p + coord_cartesian(ylim=c(1���,25��))
startHour
distance
2000
5 10 15 20
Digging into Data (UMD) Properties of Data February 11, 2013 48 / 53
![Page 59: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/59.jpg)
ggplot2 histograms
p <- ggplot(rides)p <- p + geom_histogram(aes(x=duration), binwidth = .1)p <- p + scale_y_sqrt()p <- p + facet_grid(subscription ˜ .)p <- p + scale_x_continuous(limits=c(�, 4))
duration
count 0
100
400
900
1600
0
100
400
900
1600
0 1 2 3 4
Casual
Subscriber
Digging into Data (UMD) Properties of Data February 11, 2013 49 / 53
![Page 60: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/60.jpg)
Outline
1 Data Terminology
2 Testbed: Capital Bikeshare
3 Visualizing and Summarizing Data in Rattle
4 ggplot2
5 ggplot2 with ”real” data
6 Wrapup
Digging into Data (UMD) Properties of Data February 11, 2013 50 / 53
![Page 61: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/61.jpg)
We’ve done a lot
You don’t have to be able to do everything we did today
You have to be able to do some of it
Play around with the way of manipulating data you feel mostcomfortable with
Digging into Data (UMD) Properties of Data February 11, 2013 51 / 53
![Page 62: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/62.jpg)
Further help
You’ve just scratched thesurface with ggplot2.
Practice
Read the docs (either locally inR or at http://docs.ggplot2.org/current/)
Work together
Digging into Data (UMD) Properties of Data February 11, 2013 52 / 53
![Page 63: Digging into Data - UMIACS › ~jbg › teaching › DATA_DIGGING › lecture_03.…Outline 1 Data Terminology 2 Testbed: Capital Bikeshare 3 Visualizing and Summarizing Data in Rattle](https://reader033.fdocuments.us/reader033/viewer/2022060321/5f0d30857e708231d4391ca2/html5/thumbnails/63.jpg)
First assignment
Find some data
Edit it so it is in a usable form
Find interesting relationships in your data
Use Rattle/ggplot2 to display those relationships (be creative andthorough!)
Digging into Data (UMD) Properties of Data February 11, 2013 53 / 53