Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M....
Transcript of Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M....
![Page 1: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/1.jpg)
Nuts and Bolts
Matthew GentzkowJesse M. Shapiro
Chicago Booth and NBER
![Page 2: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/2.jpg)
Introduction
We have focused on the statistical / econometric issues that arise withbig data
In the time that remains, we want to spend a little time on thepractical issues...
E.g., where do you actually put a 2 TB dataset?
Goal: Sketch some basic computing ideas relevant to working withlarge datasets.
Caveat: We are all amateurs
![Page 3: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/3.jpg)
Introduction
We have focused on the statistical / econometric issues that arise withbig data
In the time that remains, we want to spend a little time on thepractical issues...
E.g., where do you actually put a 2 TB dataset?
Goal: Sketch some basic computing ideas relevant to working withlarge datasets.
Caveat: We are all amateurs
![Page 4: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/4.jpg)
Introduction
We have focused on the statistical / econometric issues that arise withbig data
In the time that remains, we want to spend a little time on thepractical issues...
E.g., where do you actually put a 2 TB dataset?
Goal: Sketch some basic computing ideas relevant to working withlarge datasets.
Caveat: We are all amateurs
![Page 5: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/5.jpg)
Introduction
We have focused on the statistical / econometric issues that arise withbig data
In the time that remains, we want to spend a little time on thepractical issues...
E.g., where do you actually put a 2 TB dataset?
Goal: Sketch some basic computing ideas relevant to working withlarge datasets.
Caveat: We are all amateurs
![Page 6: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/6.jpg)
The Good News
Much of what we've talked about here you can do on your laptop
Your OS knows how to do parallel computing (multiple processors,multiple cores)Many �big� datasets are < 5 GBSave the data to local disk, �re up Stata or R, and o� you go...
![Page 7: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/7.jpg)
How Big is Big?
Congressional record text (1870-2010) ≈50 GB
Congressional record pdfs (1870-2010) ≈500 GB
Nielsen scanner data (34k stores, 2004-2010) ≈5 TB
Wikipedia (2013) ≈6 TB
20% Medicare claims data (1997-2009) ≈10 TB
Facebook (2013) ≈100,000 TB
All data in the world ≈2.7 billion TB
![Page 8: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/8.jpg)
Outline
Software engineering for economists
Databases
Cluster computing
Scenarios
![Page 9: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/9.jpg)
Software Engineering for Economists
![Page 10: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/10.jpg)
Motivation
A lot of the time spent in empirical research is writing, reading, anddebugging code.
Common situations...
![Page 11: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/11.jpg)
Broken Code
![Page 12: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/12.jpg)
Incoherent Data
![Page 13: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/13.jpg)
Rampant Duplication
![Page 14: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/14.jpg)
Replication Impossible
![Page 15: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/15.jpg)
Tons of Versions
![Page 16: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/16.jpg)
This Talk
We are not software engineers or computer scientists.
But we have learned that most common problems in social scienceshave analogues in these �elds and there are standard solutions.
Goal is to highlight a few of these that we think are especially valuableto researchers.
Focus on incremental changes: one step away from common practice.
![Page 17: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/17.jpg)
Automation
![Page 18: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/18.jpg)
Raw Data
Data from original source...
![Page 19: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/19.jpg)
![Page 20: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/20.jpg)
Manual Approach
Open spreadsheet
Output to text �les
Open Stata
Load data, merge �les
Compute log(chip sales)
Run regression
Copy results to MS Word and save
![Page 21: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/21.jpg)
Manual Approach
Two main problems with this approach
Replication: how can we be sure we'll �nd our way back to the exactsame numbers?E�ciency: what happens if we change our mind about the rightspeci�cation?
![Page 22: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/22.jpg)
Semi-automated Approach
Problems
Which �le does what?In what order?
![Page 23: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/23.jpg)
Fully Automated Approach
File: rundirectory.bat
stattransfer export_to_csv.stc
statase -b mergefiles.do
statase -b cleandata.do
statase -b regressions.do
statase -b figures.do
pdflatex tv_potato.tex
All steps controlled by a shell script
Order of steps unambiguous
Easy to call commands from di�erent packages
![Page 24: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/24.jpg)
Make
Framework to go from source to target
Tracks dependencies and revisions
Avoids rebuilding components that are up to date
Used to build executable �les
![Page 25: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/25.jpg)
Version Control
![Page 26: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/26.jpg)
After Some Editing
Dates demarcate versions, initials demarcate authors
Why do this?
Facilitates comparisonFacilitates �undo�
![Page 27: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/27.jpg)
What's Wrong with the Approach?
Why not do this?
It's a pain: always have to remember to �tag� every new �leIt's confusing:
Which log �le came from regressions_022713_mg.do?
Which version of cleandata.do makes the data used by
regressions_022413.do?
It fails the market test: No software �rm does it this way
![Page 28: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/28.jpg)
Version Control
Software that sits �on top� of your �lesystem
Keeps track of multiple versions of the same �leRecords date, authorshipManages con�icts
Bene�ts
Single authoritative version of the directoryEdit without fear: an undo command for everything
![Page 29: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/29.jpg)
Life After Version Control
![Page 30: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/30.jpg)
Life After Version Control
![Page 31: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/31.jpg)
Life After Version Control
![Page 32: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/32.jpg)
Life After Version Control
![Page 33: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/33.jpg)
Life After Version Control
![Page 34: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/34.jpg)
Life After Version Control
![Page 35: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/35.jpg)
Life After Version Control
Aside: If you always run rundirectory.bat before you commit, youguarantee replicability.
![Page 36: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/36.jpg)
Directories
![Page 37: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/37.jpg)
One Directory Does Everything
Pros: Self-contained, simple
Cons:
Have to rerun everything for every changeHard to �gure out dependencies
![Page 38: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/38.jpg)
Functional Directories
![Page 39: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/39.jpg)
Dependencies Obvious
![Page 40: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/40.jpg)
One Resource, Many Projects
![Page 41: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/41.jpg)
Keys
![Page 42: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/42.jpg)
Research Assistant Output
county state cnty_pop state_pop region
36037 NY 3817735 43320903 1
36038 NY 422999 43320903 1
36039 NY 324920 . 1
36040 . 143432 43320903 1
. NY . 43320903 1
37001 VA 3228290 7173000 3
37002 VA 449499 7173000 3
37003 VA 383888 7173000 4
37004 VA 483829 7173000 3
![Page 43: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/43.jpg)
Causes for Concern
county state cnty_pop state_pop region
36037 NY 3817735 43320903 1
36038 NY 422999 43320903 1
36039 NY 324920 . 1
36040 . 143432 43320903 1
. NY . 43320903 1
37001 VA 3228290 7173000 3
37002 VA 449499 7173000 3
37003 VA 383888 7173000 4
37004 VA 483829 7173000 3
![Page 44: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/44.jpg)
Relational Databases
county state population
36037 NY 3817735
36038 NY 422999
36039 NY 324920
36040 NY 143432
37001 VA 3228290
37002 VA 449499
37003 VA 383888
37004 VA 483829
state population region
NY 43320903 1
VA 7173000 3
Each variable is an attribute of an element of the table
Each table has a key
Tables are connected by foreign keys (state �eld in the county table)
![Page 45: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/45.jpg)
Steps
Store data in normalized format as above
Can use �at �les, doesn't have to be fancy relational database software
Construct a second set of �les with key transformations
e.g., log population
Merge data together and run analysis
![Page 46: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/46.jpg)
To Come
What to do with enormous databases?
![Page 47: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/47.jpg)
Abstraction
![Page 48: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/48.jpg)
Rampant Duplication
![Page 49: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/49.jpg)
Abstracted
![Page 50: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/50.jpg)
Three Leave-Out Means
![Page 51: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/51.jpg)
Copy and Paste Errors
![Page 52: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/52.jpg)
Abstracted
![Page 53: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/53.jpg)
Documentation
![Page 54: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/54.jpg)
Too Much Documentation
![Page 55: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/55.jpg)
Too Much Documentation
![Page 56: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/56.jpg)
Too Much Documentation
![Page 57: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/57.jpg)
Too Much Documentation
![Page 58: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/58.jpg)
Too Much Documentation
![Page 59: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/59.jpg)
Unclear Code
![Page 60: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/60.jpg)
Self-Documenting Code
![Page 61: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/61.jpg)
Management
![Page 62: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/62.jpg)
A Friendly Chat
![Page 63: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/63.jpg)
A Friendly Chat
![Page 64: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/64.jpg)
A Friendly Chat
![Page 65: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/65.jpg)
A Friendly Chat
![Page 66: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/66.jpg)
A Friendly Chat
![Page 67: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/67.jpg)
A Friendly Chat
![Page 68: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/68.jpg)
Task Management
Source: Asana.
![Page 69: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/69.jpg)
Parting Thoughts
![Page 70: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/70.jpg)
Code and Data
Data are getting larger
Research is getting more collaborative
Need to manage code and data responsibly for collaboration andreplicability
Learn from the pros, not from us
![Page 71: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/71.jpg)
Databases
![Page 72: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/72.jpg)
What is a Database?
Database Theory
Principles for how to store / organize / retrieve data e�ciently(normalization, indexing, optimization, etc.)
Database Software
Manages storage / organization / retrieval of data (SQL, Oracle,Access, etc.)Economists rarely use this software because we typically store data in�at �les & interact with them using statistical programsWhen we receive extracts from large datasets (the census, Medicareclaims, etc.) someone else often interacts with the database on theback end
![Page 73: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/73.jpg)
Normalization
�Database Normalization is the process of organizing the �elds andtables of a relational database to minimize redundancy anddependency. Normalization usually involves dividing large tables intosmaller (and less redundant) tables and de�ning relationships betweenthem.�
![Page 74: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/74.jpg)
Bene�ts of Normalization
E�cient storage
E�cient modi�cation
Guarantees coherence
Makes logical structure of data clear
![Page 75: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/75.jpg)
Indexing
Medicare claims data for 1997-2010 are roughly 10 TB
These data are stored at NBER in thousands of zipped SAS �les
To extract, say, all claims for heart disease patients aged 55-65, youwould need to read every line of every one of those �les
THIS IS SLOW!!!
![Page 76: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/76.jpg)
Indexing
Medicare claims data for 1997-2010 are roughly 10 TB
These data are stored at NBER in thousands of zipped SAS �les
To extract, say, all claims for heart disease patients aged 55-65, youwould need to read every line of every one of those �les
THIS IS SLOW!!!
![Page 77: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/77.jpg)
Indexing
The obvious solution, long understood for book, libraries, economicsjournals, and so forth, is to build an index
Database software handles this automatically
Allows you to specify �elds that will be often used for lookups,subsetting, etc. to be indexedFor the Medicare data, we could index age, gender, type of treatment,etc. to allow much faster extraction
![Page 78: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/78.jpg)
Indexing
Bene�ts
Fast lookupsEasy to police data constraints
Costs
StorageTime
Database optimization is the art of tuning database structure andindexing for a speci�c set of needs
![Page 79: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/79.jpg)
Data Warehouses
Traditional databases are optimized for operational environments
Bank transactionsAirline reservationsetc.
Characteristics
Many small reads and writesMany users accessing simultaneouslyPremium on low latencyOnly care about current state
![Page 80: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/80.jpg)
Data Warehouses
Traditional databases are optimized for operational environments
Bank transactionsAirline reservationsetc.
Characteristics
Many small reads and writesMany users accessing simultaneouslyPremium on low latencyOnly care about current state
![Page 81: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/81.jpg)
Data Warehouses
In analytic / research environments, however, the requirements aredi�erent
Frequent large reads, infrequent writesRelatively little simultaneous accessValue throughput relative to latencyMay care about history as well as current stateNeed to create and re-use many custom extracts
Database systems tuned to these requirements are commonly called�data warehouses�
![Page 82: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/82.jpg)
Data Warehouses
In analytic / research environments, however, the requirements aredi�erent
Frequent large reads, infrequent writesRelatively little simultaneous accessValue throughput relative to latencyMay care about history as well as current stateNeed to create and re-use many custom extracts
Database systems tuned to these requirements are commonly called�data warehouses�
![Page 83: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/83.jpg)
Distributed Computing
![Page 84: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/84.jpg)
Distributed Computing
De�nition: Computation shared among many independent processors
Terminology
Distributed vs. Parallel (latter usually refers to systems with sharedmemory)Cluster vs. Grid (latter usually more decentralized & heterogeneous)
![Page 85: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/85.jpg)
Distributed Computing
De�nition: Computation shared among many independent processors
Terminology
Distributed vs. Parallel (latter usually refers to systems with sharedmemory)Cluster vs. Grid (latter usually more decentralized & heterogeneous)
![Page 86: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/86.jpg)
On Your Local Machine
Your OS can run multiple processors each with multiple cores
Your video card has hundreds of cores
Stata, R, Matlab, etc. can all exploit these resources to do parallelcomputing
Stata
Buy appropriate �MP� version of StataSoftware does the rest
R / Matlab
Install appropriate add-ins (parallel package in R, �parallel computingtoolbox� in Matlab)Include parallel commands in code (e.g., parfor in place of for inMatlab)
![Page 87: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/87.jpg)
On Your Local Machine
Your OS can run multiple processors each with multiple cores
Your video card has hundreds of cores
Stata, R, Matlab, etc. can all exploit these resources to do parallelcomputing
Stata
Buy appropriate �MP� version of StataSoftware does the rest
R / Matlab
Install appropriate add-ins (parallel package in R, �parallel computingtoolbox� in Matlab)Include parallel commands in code (e.g., parfor in place of for inMatlab)
![Page 88: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/88.jpg)
On Your Local Machine
Your OS can run multiple processors each with multiple cores
Your video card has hundreds of cores
Stata, R, Matlab, etc. can all exploit these resources to do parallelcomputing
Stata
Buy appropriate �MP� version of StataSoftware does the rest
R / Matlab
Install appropriate add-ins (parallel package in R, �parallel computingtoolbox� in Matlab)Include parallel commands in code (e.g., parfor in place of for inMatlab)
![Page 89: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/89.jpg)
On Cluster / Grid
Resources abound
University / department computing clustersNon-commercial scienti�c computing grids (e.g., XSEDE)Commercial grids (e.g., Amazon EC2)
Most of these run Linux w/ distribution handled by a �batch scheduler�
Write code using your favorite application, then send it to schedulerwith a bash script
![Page 90: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/90.jpg)
On Cluster / Grid
Resources abound
University / department computing clustersNon-commercial scienti�c computing grids (e.g., XSEDE)Commercial grids (e.g., Amazon EC2)
Most of these run Linux w/ distribution handled by a �batch scheduler�
Write code using your favorite application, then send it to schedulerwith a bash script
![Page 91: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/91.jpg)
MapReduce
MapReduce is a programming model that facilitates distributedcomputing
Developed by Google around 2004, though ideas predate that
Most algorithms for distributed data processing can be represented intwo steps
Map: Process individual �chunk� of data to generate an intermediate�summary�Reduce: Combine �summaries� from di�erent chunks to produce asingle output �le
If you structure your code this way, MapReduce software will handleall the details of distribution:
Partitioning dataScheduling execution across nodesManaging communication between machinesHandling errors / machine failures
![Page 92: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/92.jpg)
MapReduce
MapReduce is a programming model that facilitates distributedcomputing
Developed by Google around 2004, though ideas predate that
Most algorithms for distributed data processing can be represented intwo steps
Map: Process individual �chunk� of data to generate an intermediate�summary�Reduce: Combine �summaries� from di�erent chunks to produce asingle output �le
If you structure your code this way, MapReduce software will handleall the details of distribution:
Partitioning dataScheduling execution across nodesManaging communication between machinesHandling errors / machine failures
![Page 93: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/93.jpg)
MapReduce
MapReduce is a programming model that facilitates distributedcomputing
Developed by Google around 2004, though ideas predate that
Most algorithms for distributed data processing can be represented intwo steps
Map: Process individual �chunk� of data to generate an intermediate�summary�Reduce: Combine �summaries� from di�erent chunks to produce asingle output �le
If you structure your code this way, MapReduce software will handleall the details of distribution:
Partitioning dataScheduling execution across nodesManaging communication between machinesHandling errors / machine failures
![Page 94: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/94.jpg)
MapReduce: Examples
Count words in a large collection of documents
Map: Document i → Set of (word , count) pairs Ci
Reduce: Collapse {Ci}, summing count within word
Extract medical claims for 65-year old males
Map: Record set i → Subset of i that are 65-year old males Hi
Reduce: Append elements of {Hi}Compute marginal regression for text analysis (e.g., Gentzkow &Shapiro 2010)
Map: Counts xij of phrase j → Parameters(α̂j , β̂j
)from
E (xij |yi ) = αj + βjxij
Reduce: Append{α̂j , β̂j
}
![Page 95: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/95.jpg)
MapReduce: Examples
Count words in a large collection of documents
Map: Document i → Set of (word , count) pairs Ci
Reduce: Collapse {Ci}, summing count within word
Extract medical claims for 65-year old males
Map: Record set i → Subset of i that are 65-year old males Hi
Reduce: Append elements of {Hi}
Compute marginal regression for text analysis (e.g., Gentzkow &Shapiro 2010)
Map: Counts xij of phrase j → Parameters(α̂j , β̂j
)from
E (xij |yi ) = αj + βjxij
Reduce: Append{α̂j , β̂j
}
![Page 96: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/96.jpg)
MapReduce: Examples
Count words in a large collection of documents
Map: Document i → Set of (word , count) pairs Ci
Reduce: Collapse {Ci}, summing count within word
Extract medical claims for 65-year old males
Map: Record set i → Subset of i that are 65-year old males Hi
Reduce: Append elements of {Hi}Compute marginal regression for text analysis (e.g., Gentzkow &Shapiro 2010)
Map: Counts xij of phrase j → Parameters(α̂j , β̂j
)from
E (xij |yi ) = αj + βjxij
Reduce: Append{α̂j , β̂j
}
![Page 97: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/97.jpg)
MapReduce: Model
UserProgram
Master
(1) fork
worker
(1) fork
worker
(1) fork
(2)assignmap
(2)assignreduce
split 0
split 1
split 2
split 3
split 4
outputfile 0
(6) write
worker(3) read
worker
(4) local write
Mapphase
Intermediate files(on local disks)
worker outputfile 1
Inputfiles
(5) remote read
Reducephase
Outputfiles
Figure 1: Execution overview
Inverted Index: The map function parses each docu-ment, and emits a sequence of 〈word,document ID〉pairs. The reduce function accepts all pairs for a givenword, sorts the corresponding document IDs and emits a〈word, list(document ID)〉 pair. The set of all outputpairs forms a simple inverted index. It is easy to augmentthis computation to keep track of word positions.
Distributed Sort: The map function extracts the keyfrom each record, and emits a 〈key,record〉 pair. Thereduce function emits all pairs unchanged. This compu-tation depends on the partitioning facilities described inSection 4.1 and the ordering properties described in Sec-tion 4.2.
3 Implementation
Many different implementations of the MapReduce in-terface are possible. The right choice depends on theenvironment. For example, one implementation may besuitable for a small shared-memory machine, another fora large NUMA multi-processor, and yet another for aneven larger collection of networked machines.
This section describes an implementation targetedto the computing environment in wide use at Google:
large clusters of commodity PCs connected together withswitched Ethernet [4]. In our environment:
(1) Machines are typically dual-processor x86 processorsrunning Linux, with 2-4 GB of memory per machine.
(2) Commodity networking hardware is used – typicallyeither 100 megabits/second or 1 gigabit/second at themachine level, but averaging considerably less in over-all bisection bandwidth.
(3) A cluster consists of hundreds or thousands of ma-chines, and therefore machine failures are common.
(4) Storage is provided by inexpensive IDE disks at-tached directly to individual machines. A distributed filesystem [8] developed in-house is used to manage the datastored on these disks. The file system uses replication toprovide availability and reliability on top of unreliablehardware.
(5) Users submit jobs to a scheduling system. Each jobconsists of a set of tasks, and is mapped by the schedulerto a set of available machines within a cluster.
3.1 Execution Overview
The Map invocations are distributed across multiplemachines by automatically partitioning the input data
To appear in OSDI 2004 3
![Page 98: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/98.jpg)
MapReduce: Implementation
MapReduce is the original software developed by Google
Hadoop is the open-source version most people use (developed byApache)
Amazon has a hosted implementation (Amazon EMR)
How does it work?
Write your code as two functions called map and reduce
Send code & data to scheduler using bash script
![Page 99: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/99.jpg)
MapReduce: Implementation
MapReduce is the original software developed by Google
Hadoop is the open-source version most people use (developed byApache)
Amazon has a hosted implementation (Amazon EMR)
How does it work?
Write your code as two functions called map and reduce
Send code & data to scheduler using bash script
![Page 100: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/100.jpg)
Distributed File Systems
Data transfer is the main bottleneck in distributed systems
For big data, it makes sense to distribute data as well as computation
Data broken up into chunks, each of which lives on a separate nodeFile system keeps track of where the pieces are and allocates jobs socomputation happens �close� to data whenever possible
Tight coupling between MapReduce software and associated �lesystems
MapReduce → Google File System (GFS)Hadoop → Hadoop Distributed File System (HDFS)Amazon EMR → Amazon S3
![Page 101: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/101.jpg)
Distributed File Systems
Data transfer is the main bottleneck in distributed systems
For big data, it makes sense to distribute data as well as computation
Data broken up into chunks, each of which lives on a separate nodeFile system keeps track of where the pieces are and allocates jobs socomputation happens �close� to data whenever possible
Tight coupling between MapReduce software and associated �lesystems
MapReduce → Google File System (GFS)Hadoop → Hadoop Distributed File System (HDFS)Amazon EMR → Amazon S3
![Page 102: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/102.jpg)
Distributed File Systems
Legend:
Data messagesControl messages
Application(file name, chunk index)
(chunk handle,chunk locations)
GFS master
File namespace
/foo/bar
Instructions to chunkserver
Chunkserver state
GFS chunkserverGFS chunkserver(chunk handle, byte range)
chunk data
chunk 2ef0
Linux file system Linux file system
GFS client
Figure 1: GFS Architecture
and replication decisions using global knowledge. However,we must minimize its involvement in reads and writes sothat it does not become a bottleneck. Clients never readand write file data through the master. Instead, a client asksthe master which chunkservers it should contact. It cachesthis information for a limited time and interacts with thechunkservers directly for many subsequent operations.Let us explain the interactions for a simple read with refer-
ence to Figure 1. First, using the fixed chunk size, the clienttranslates the file name and byte offset specified by the ap-plication into a chunk index within the file. Then, it sendsthe master a request containing the file name and chunkindex. The master replies with the corresponding chunkhandle and locations of the replicas. The client caches thisinformation using the file name and chunk index as the key.The client then sends a request to one of the replicas,
most likely the closest one. The request specifies the chunkhandle and a byte range within that chunk. Further readsof the same chunk require no more client-master interactionuntil the cached information expires or the file is reopened.In fact, the client typically asks for multiple chunks in thesame request and the master can also include the informa-tion for chunks immediately following those requested. Thisextra information sidesteps several future client-master in-teractions at practically no extra cost.
2.5 Chunk SizeChunk size is one of the key design parameters. We have
chosen 64 MB, which is much larger than typical file sys-tem block sizes. Each chunk replica is stored as a plainLinux file on a chunkserver and is extended only as needed.Lazy space allocation avoids wasting space due to internalfragmentation, perhaps the greatest objection against sucha large chunk size.A large chunk size offers several important advantages.
First, it reduces clients’ need to interact with the masterbecause reads and writes on the same chunk require onlyone initial request to the master for chunk location informa-tion. The reduction is especially significant for our work-loads because applications mostly read and write large filessequentially. Even for small random reads, the client cancomfortably cache all the chunk location information for amulti-TB working set. Second, since on a large chunk, aclient is more likely to perform many operations on a givenchunk, it can reduce network overhead by keeping a persis-
tent TCP connection to the chunkserver over an extendedperiod of time. Third, it reduces the size of the metadatastored on the master. This allows us to keep the metadatain memory, which in turn brings other advantages that wewill discuss in Section 2.6.1.On the other hand, a large chunk size, even with lazy space
allocation, has its disadvantages. A small file consists of asmall number of chunks, perhaps just one. The chunkserversstoring those chunks may become hot spots if many clientsare accessing the same file. In practice, hot spots have notbeen a major issue because our applications mostly readlarge multi-chunk files sequentially.However, hot spots did develop when GFS was first used
by a batch-queue system: an executable was written to GFSas a single-chunk file and then started on hundreds of ma-chines at the same time. The few chunkservers storing thisexecutable were overloaded by hundreds of simultaneous re-quests. We fixed this problem by storing such executableswith a higher replication factor and by making the batch-queue system stagger application start times. A potentiallong-term solution is to allow clients to read data from otherclients in such situations.
2.6 MetadataThe master stores three major types of metadata: the file
and chunk namespaces, the mapping from files to chunks,and the locations of each chunk’s replicas. All metadata iskept in the master’s memory. The first two types (names-paces and file-to-chunk mapping) are also kept persistent bylogging mutations to an operation log stored on the mas-ter’s local disk and replicated on remote machines. Usinga log allows us to update the master state simply, reliably,and without risking inconsistencies in the event of a mastercrash. The master does not store chunk location informa-tion persistently. Instead, it asks each chunkserver about itschunks at master startup and whenever a chunkserver joinsthe cluster.
2.6.1 In-Memory Data StructuresSince metadata is stored in memory, master operations are
fast. Furthermore, it is easy and efficient for the master toperiodically scan through its entire state in the background.This periodic scanning is used to implement chunk garbagecollection, re-replication in the presence of chunkserver fail-ures, and chunk migration to balance load and disk space
![Page 103: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/103.jpg)
Scenarios
![Page 104: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/104.jpg)
Scenario 1: Not-So-Big Data
My data is 100 gb or less
Advice
Store data locally in �at �les (csv, Stata, R, etc.)Organize data in normalized tables for robustness and clarityRun code serially or (if computation is slow) in parallel
![Page 105: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/105.jpg)
Scenario 1: Not-So-Big Data
My data is 100 gb or less
Advice
Store data locally in �at �les (csv, Stata, R, etc.)Organize data in normalized tables for robustness and clarityRun code serially or (if computation is slow) in parallel
![Page 106: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/106.jpg)
Scenario 2: Big Data, Small Analysis
My raw data is > 100 gb, but the extracts I actually use for analysis
are << 100 gb
Example
Medicare claims data → analyze heart attack spending by patient byyearNielsen scanner data → analyze average price by store by month
Advice
Store data in relational database optimized to produce analysis extractse�cientlyStore extracts locally in �at �les (csv, Stata, R, etc.)Organize extracts in normalized tables for robustness and clarityRun code serially or (if computation is slow) in parallel
Note: Gains to database increase for more structured data. Forcompletely unstructured data, you may be better o� using distributed�le system + map reduce to create extracts.
![Page 107: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/107.jpg)
Scenario 2: Big Data, Small Analysis
My raw data is > 100 gb, but the extracts I actually use for analysis
are << 100 gb
Example
Medicare claims data → analyze heart attack spending by patient byyearNielsen scanner data → analyze average price by store by month
Advice
Store data in relational database optimized to produce analysis extractse�cientlyStore extracts locally in �at �les (csv, Stata, R, etc.)Organize extracts in normalized tables for robustness and clarityRun code serially or (if computation is slow) in parallel
Note: Gains to database increase for more structured data. Forcompletely unstructured data, you may be better o� using distributed�le system + map reduce to create extracts.
![Page 108: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/108.jpg)
Scenario 2: Big Data, Small Analysis
My raw data is > 100 gb, but the extracts I actually use for analysis
are << 100 gb
Example
Medicare claims data → analyze heart attack spending by patient byyearNielsen scanner data → analyze average price by store by month
Advice
Store data in relational database optimized to produce analysis extractse�cientlyStore extracts locally in �at �les (csv, Stata, R, etc.)Organize extracts in normalized tables for robustness and clarityRun code serially or (if computation is slow) in parallel
Note: Gains to database increase for more structured data. Forcompletely unstructured data, you may be better o� using distributed�le system + map reduce to create extracts.
![Page 109: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/109.jpg)
Scenario 2: Big Data, Small Analysis
My raw data is > 100 gb, but the extracts I actually use for analysis
are << 100 gb
Example
Medicare claims data → analyze heart attack spending by patient byyearNielsen scanner data → analyze average price by store by month
Advice
Store data in relational database optimized to produce analysis extractse�cientlyStore extracts locally in �at �les (csv, Stata, R, etc.)Organize extracts in normalized tables for robustness and clarityRun code serially or (if computation is slow) in parallel
Note: Gains to database increase for more structured data. Forcompletely unstructured data, you may be better o� using distributed�le system + map reduce to create extracts.
![Page 110: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/110.jpg)
Scenario 3: Big Data, Big Analysis
My data is > 100 GB and my analysis code needs to touch all of the
data
Example
2 TB of SEC �ling text → run variable selection using all data
Advice
Store data in distributed �le systemUse MapReduce or other distributed algorithms for analysis
![Page 111: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/111.jpg)
Scenario 3: Big Data, Big Analysis
My data is > 100 GB and my analysis code needs to touch all of the
data
Example
2 TB of SEC �ling text → run variable selection using all data
Advice
Store data in distributed �le systemUse MapReduce or other distributed algorithms for analysis
![Page 112: Nuts and Bolts - National Bureau of · PDF fileNuts and Bolts Matthew Gentzkow Jesse M. Shapiro Chicago Booth and NBER](https://reader034.fdocuments.us/reader034/viewer/2022050719/5a8378d47f8b9aee018ed2be/html5/thumbnails/112.jpg)
Scenario 3: Big Data, Big Analysis
My data is > 100 GB and my analysis code needs to touch all of the
data
Example
2 TB of SEC �ling text → run variable selection using all data
Advice
Store data in distributed �le systemUse MapReduce or other distributed algorithms for analysis