MIT Big Data Explorers - presentation by Daniel Burseth
-
Upload
don-dark -
Category
Data & Analytics
-
view
225 -
download
1
description
Transcript of MIT Big Data Explorers - presentation by Daniel Burseth
![Page 1: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/1.jpg)
AN END-TO-END DEMONSTRATION OF GENERATING, CLEANING, AND VISUALIZING A “MESSY” DATA SETDaniel BursethCo-president MIT Big Data [email protected]@dmbnycGithub: dburseth
![Page 2: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/2.jpg)
WHAT’S THE MOTIVATION? Acronyms abound
Tremendous complexity
Use building blocks not code
![Page 3: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/3.jpg)
CLEAN DATA IS A LUXURY This is easy
EPPM of 10 requires 500 professionals
![Page 4: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/4.jpg)
BUT WHAT ABOUT INFORMATION THAT ISN’T NICELY STRUCTURED AND DOESN’T HAVE AN API?
![Page 5: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/5.jpg)
ANOTHER AREA THAT DOESN’T GET MUCH AIR TIME….
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?emc=eta1&_r=0
Data preparation and cleansing:• Missing• Duplicative• Conventions (dates, time,
geographies)• Spacing• Can we measure data
cleanliness?• What’s our Pareto point?
![Page 6: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/6.jpg)
LOGIN TO YOUR AWS INSTANCE AWS -> EC2
Launch instance: ami-c6b61fae (US-EAST)
Instance type m3.medium
Connect
You should see some software on the desktop
![Page 7: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/7.jpg)
AGENDA
Scrape all of Craiglist’s Boston apartment listings using WebHarvy
Examine, clean, and prepare the data set using OpenRefine
Map our data and apply filters using Tableau
……all without writing a single line of code.
![Page 8: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/8.jpg)
DOWNLOAD MY SLIDES AT SHOUTKEY.COM/EFFIGY
![Page 9: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/9.jpg)
WEBHARVY A hyper-intelligent utility to scrape website
data.
SysNucleus, makers of USBTrace
Heavy duty alternatives: Scrapy (scrappy.org), Beautiful Soup
![Page 10: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/10.jpg)
GO TO HTTP://SHOUTKEY.COM/WIRE
1. Start Config
2. Click on Hungry Mother – capture text
3. Click on Hungry Mother – capture URL
4. Click on Kendall Square/MIT – capture text
5. Click lasts review– capture text
CLEAR
6. Mine -> Scrape a list of similar links
7. Click on Hungry Mother
![Page 11: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/11.jpg)
WE’VE NOW DRILLED INTO THE TOP LINK Let’s start collecting
information in the first sub-page.
![Page 12: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/12.jpg)
THIS CAPTURED THE FIRST PAGE, BUT WHAT IF WE WANT MORE? Edit Clear
Navigate into a sub-page
Start Config
Set as Next Page Link
![Page 13: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/13.jpg)
OTHER BELLS AND WHISTLES Scheduler
Input keywords
Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for apps, commercial purposes!)
TRY VISITING CRAIGSLIST IN AWS BTW!!
Proxy
Database export
![Page 14: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/14.jpg)
20K ROWS OF MESS!
Download Craigslist Boston from http://shoutkey.com/glorify
Look at our data: open Boston Dirty.csv (20k rows of mess!)
Time to CLEAN: Launch GOOGLE-REFINE.EXE
Within MOZILLA, navigate to http://127.0.0.1:3333/
Create Project -> This Computer -> Browse
Parse by tab
Create Project
![Page 15: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/15.jpg)
REMOVE DUPLICATES1. First, sort your column. 2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the middle of the data table. 3. Then invoke Edit cells and Blank down on the Title column. 4. Then on that column, invoke menu Facet > Custom facets and Facet by blank. 5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu. 6. Remove the facet.
![Page 16: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/16.jpg)
DUPLICATE “TITLE”
![Page 17: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/17.jpg)
“TITLE” CONTAINS KEY INFO, LET’S PARSE IT
![Page 18: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/18.jpg)
MORE CHANGES TO “TITLE”
![Page 19: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/19.jpg)
TITLE REMAINS MESSY
Then run the “To Number” transform again
![Page 20: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/20.jpg)
LET’S EXTRACT LOCATION
![Page 21: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/21.jpg)
REMOVE TRAILING PAREN
![Page 22: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/22.jpg)
NOW THE FUN PART: CLUSTERING
![Page 23: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/23.jpg)
SWITCH THE METHOD: NEAREST NEIGHBOR
Increment the radius to 7 and make judgment calls along the way.
Change the Distance Function and do the same thing
![Page 24: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/24.jpg)
TRIM WHITESPACE ON OUR CITY DATA
![Page 25: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/25.jpg)
ADD “,MA” TO OUR CITY DATA
![Page 26: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/26.jpg)
LET’S PLOT OUR VALUES Looks like we have SOME really expensive
real estate. Data errors????
![Page 27: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/27.jpg)
EXPORT OUR DATA AND LEAVE REFINE
Boston Clean.csv
![Page 28: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/28.jpg)
WELCOME TO TABLEAU Load Boston
clean.csv
“Go to Worksheet”
![Page 29: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/29.jpg)
DRAG CITY TO THE BLACK BOX
Great “semantic” example. Tableau understands that this text translates to a lat/long
![Page 30: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/30.jpg)
TABLEAU ALERTS TO UNPLOTTED POINTS Look on the map in the lower right corner
Let’s “Filter Data”
![Page 31: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/31.jpg)
SIZE AND LABEL OUR DATA Under “Measures”, drag “Price” onto size in “Marks”
Change sum(Price) to avg(Price)
Drag Price, change to max(price) into Filters and select an “At Most”
Right click on the filter and show “Quick Filter”
Drag “City” onto “Label”
Menu Map -> Map Options
Click on a node for info and drill down potential
![Page 32: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/32.jpg)
VISUALIZATION IS A HUGE TOPIC!
![Page 33: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/33.jpg)
RECAP
1. Explored various webpage structures and scraped them2. Exported the data to Refine3. Parsed columns to extract critical price and location information4. Used clustering algorithms to merge related geographies5. Applied filters to identify errant prices6. Exported the data to Tableau7. Completed a real cursory mapping visualization
![Page 34: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/34.jpg)
WHAT’S YOUR BUSINESS IDEA? Please come talk to me
![Page 35: MIT Big Data Explorers - presentation by Daniel Burseth](https://reader038.fdocuments.us/reader038/viewer/2022103001/558ccc3ed8b42a02638b4639/html5/thumbnails/35.jpg)
QUESTIONS? THANK YOU!GITHUB:DMBNYC [email protected]