Big mountain data competition training: scraping-n-munge
10
Click here to load reader
-
Upload
david-b-gonzalez -
Category
Data & Analytics
-
view
94 -
download
2
description
BMDC stand-up lunch presentation
Transcript of Big mountain data competition training: scraping-n-munge
BMDC: Utah Air Quality
@davidbgonzalezZiff.io
Boawp.comhttps://github.com/davidbgonzalez/bmdcfall2014data
??
??
WWWWW
Find: “\<this is what I'm looking for>”
Replace: %s/<leave me blank for the last thing I searched>/<replace with this>/
Scrape
txtwww
BIG
sql
BIG
● wc -l # 128438621 # Oh No's● Work with sample for flow● head -n 100000● Compress it
– Avros
– Parque
● Play with it hdfs
txt
Tools
● Beautifulsoup4 ← python● Vim + regex● Xlrd● CLI
– jq
– json2csv
– csvkit
– subsample