ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE...
Transcript of ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE...
![Page 1: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/1.jpg)
ORIE 4741: Learning with Big Messy Data
Exploratory Data Analysis
Professor Udell
Operations Research and Information EngineeringCornell
October 1, 2020
1 / 21
![Page 2: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/2.jpg)
Announcements
I If you’re taking lecture async: remember to submitparticipation post after each class!
I Sections start Wednesday. They are optional, attend anyone you prefer.Section this week is a Julia tutorial.
I Office hours: Zoom links and times are posted on coursewebsite.
I Gradescope is open for submission of hw0, due Thursday9-9-19 9:30am.
I First quiz this week! It should occupy about 20 minutes;you’ll have up to half an hour to complete it. Start itanytime between 6:15pm Thursday and 9pm Friday.
(All times ET)
2 / 21
![Page 3: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/3.jpg)
Questions from campuswire
I search for your question before posting new question
I approximate date of voter registration is fine
3 / 21
![Page 4: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/4.jpg)
Why julia?
I the two language problem
I julia is fast (JIT-compiled)
I julia has pleasant syntax, esp for linear algebra(MATLAB-like, but more principled)
I julia supports efficient parallelism (including multithreading)
I the julia ecosystem
for this class: you can use any language you’d like (which yourTAs can read), but the course staff will only support Julia.
4 / 21
![Page 5: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/5.jpg)
Topics to review
We will cover (most of) these in section, too:
I Linear algebra: invertible matrices, rank, norm, basic matrixidentities. When is a matrix invertible?
I QR factorization
I Gradients (multivariate derivative)
I Projections
I SVD
I Maximum likelihood estimation
I Union bound
I Computational complexity
5 / 21
![Page 6: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/6.jpg)
Why look at the data?
I detect errors in data
I check assumptions
I select appropriate models
I understand relationships among the features
I understand relationships between features and labels
6 / 21
![Page 7: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/7.jpg)
How to look at the data?
I inspect raw data
I summary statistics
I visualize
7 / 21
![Page 8: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/8.jpg)
American community survey
2013 ACS:
I 3M respondents, 87 economic/demographic surveyquestionsI incomeI cost of utilities (water, gas, electric)I weeks worked per yearI hours worked per weekI home ownershipI looking for workI use foodstampsI education levelI state of residenceI . . .
I 1/3 of responses missing
find it at https://people.orie.cornell.edu/mru8/orie4741/data/acs_2013.csv
8 / 21
![Page 9: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/9.jpg)
How do computers work?
on a laptop:
I hard disk: usually ≤ 500 GB
I memory (RAM): usually ≤ 16 GB
I many programs (e.g., Excel): substantially more limited
don’t load a giant file into memory.your computer will crash.
how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB
9 / 21
![Page 10: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/10.jpg)
How do computers work?
on a laptop:
I hard disk: usually ≤ 500 GB
I memory (RAM): usually ≤ 16 GB
I many programs (e.g., Excel): substantially more limited
don’t load a giant file into memory.your computer will crash.
how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB
9 / 21
![Page 11: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/11.jpg)
How do computers work?
on a laptop:
I hard disk: usually ≤ 500 GB
I memory (RAM): usually ≤ 16 GB
I many programs (e.g., Excel): substantially more limited
don’t load a giant file into memory.your computer will crash.
how big is ACS data?3M respondents × 100 questions = 300M numbers ≈ 300MB
9 / 21
![Page 12: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/12.jpg)
Inspect raw data
solution for large files: technology from the 70s!
bash shell:
I “how big are these files?”: ls -lh
I “show me some lines from the file”: head, tail, less
I “how many lines are in the file?”: wc -l
10 / 21
![Page 13: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/13.jpg)
American Community SurveyVariable Description Type
HHTYPE household type categoricalSTATEICP state categoricalOWNERSHP own home BooleanCOMMUSE commercial use BooleanACREHOUS house on ≥ 10 acres BooleanHHINCOME household income realCOSTELEC monthly electricity bill realCOSTWATR monthly water bill realCOSTGAS monthly gas bill realFOODSTMP on food stamps BooleanHCOVANY have health insurance BooleanSCHOOL currently in school BooleanEDUC highest level of education ordinalGRADEATT highest grade level attained ordinalEMPSTAT employment status categoricalLABFORCE in labor force BooleanCLASSWKR class of worker BooleanWKSWORK2 weeks worked per year ordinalUHRSWORK usual hours worked per week realLOOKING looking for work BooleanMIGRATE1 migration status categorical 11 / 21
![Page 14: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/14.jpg)
Julia and Jupyter
I Julia is a programming language:it parses human-readable code to machine-readable code,executes it, returns the answer
I Jupyter is a protocol for interacting with a programminglanguage.
I Jupyter stores inputs and outputs as .ipynb files.I Jupyter notebooks display inputs and outputs in a browserI JuliaBox is an interface to a webserver running Julia
how to access?
I recommended: install anaconda distribution, then julia(go to section or see section materials for details)
I not recommended: use Google Colabhttps://discourse.julialang.org/t/
julia-on-google-colab-free-gpu-accelerated-shareable-notebooks/
15319
12 / 21
![Page 15: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/15.jpg)
Julia and Jupyter
I Julia is a programming language:it parses human-readable code to machine-readable code,executes it, returns the answer
I Jupyter is a protocol for interacting with a programminglanguage.
I Jupyter stores inputs and outputs as .ipynb files.I Jupyter notebooks display inputs and outputs in a browserI JuliaBox is an interface to a webserver running Julia
how to access?
I recommended: install anaconda distribution, then julia(go to section or see section materials for details)
I not recommended: use Google Colabhttps://discourse.julialang.org/t/
julia-on-google-colab-free-gpu-accelerated-shareable-notebooks/
1531912 / 21
![Page 16: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/16.jpg)
Summary statistics
univariate
I mean, median, mode
I max, min, range
I variance
I . . .
explore via Julia + Jupyter notebook
https:
//github.com/ORIE4741/demos/blob/master/eda.ipynb
multi- (but usualy just bi-)variate
I correlation, covariance
I . . .
13 / 21
![Page 17: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/17.jpg)
Summary statistics
univariate
I mean, median, mode
I max, min, range
I variance
I . . .
explore via Julia + Jupyter notebook
https:
//github.com/ORIE4741/demos/blob/master/eda.ipynb
multi- (but usualy just bi-)variate
I correlation, covariance
I . . .13 / 21
![Page 18: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/18.jpg)
The perils of summary statistics
same mean, variance, correlation, line of best fit. . .
14 / 21
![Page 19: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/19.jpg)
The perils of summary statistics
same mean, variance, correlation, line of best fit. . .14 / 21
![Page 20: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/20.jpg)
The perils of summary statistics
15 / 21
![Page 21: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/21.jpg)
The perils of summary statistics: modern update
https:
//www.autodeskresearch.com/publications/samestats
16 / 21
![Page 22: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/22.jpg)
What to visualize?
I examples across all features (usually not)
I plot features across all examples (much more common)
17 / 21
![Page 23: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/23.jpg)
Best practices
I Always label your axes.
I Ensure all marks on plot are meaningful.
I Beware of pie charts; bar charts are often easier to read.
I Beware of line plots; if your data is not continuous, tryscatter plot instead.
I Consider which curves to plot on same axes. Makecomparisons easy!
18 / 21
![Page 24: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/24.jpg)
Beware of bad data
19 / 21
![Page 25: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/25.jpg)
Take away
I always look at (some of) your data
I decide what question you want to answer
20 / 21
![Page 26: ORIE 4741: Learning with Big Messy Data [2ex] Exploratory Data Analysis · 2019-09-10 · ORIE 4741: Learning with Big Messy Data Exploratory Data Analysis Professor Udell Operations](https://reader033.fdocuments.us/reader033/viewer/2022050503/5f95413a7e2478792265b245/html5/thumbnails/26.jpg)
Questions?
https://docs.google.com/spreadsheets/d/
1vLbwi0WCOn0wU6cU_r0RHAnY7C0fDZ1F8Yq09pqYYuk
21 / 21