Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan...

30
Data Cleaning & Integration CSE6242 / CX4242 Jan 14, 2014 Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Transcript of Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan...

Page 1: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Data Cleaning & Integration

CSE6242 / CX4242Jan 14, 2014

Duen Horng (Polo) ChauGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Page 2: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Last TimeBig data analytics building blocks"Data collection & simple data storage!

• Why SQLite? "• Simplicity : nothing to install/

maintain, database in a single file"

• Popular: cross-platform, cross-device"

• SQL basics (create table, join, create index, etc.)

�2

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 3: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Data CleaningHow dirty is real data?

Page 4: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Data CleanersWatch videos "• Google Refine"• Data Wrangler (research at Stanford)"

Write down"• Examples of data dirtiness"• Tool’s features demo-ed (or that you like)"

Will collectively summarize similarities and differences afterwards

Google Refine: http://code.google.com/p/google-refine/"Data Wrangler: http://vis.stanford.edu/wrangler/

�4

Page 5: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

How dirty is real data?Examples"

• no specific schemas / different names for the same thing / numbers and text mixed"

• trailing spaces/ text not relevant to data"

• different units / data out of range (unrealistic) / skew data distributions"

• missing values / missing rows entirely"

• file formats"

• text may not be where you want it to be (maybe at a different column)"

• improper merge of two tables"

• duplications�5

Page 6: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

How are they similar?• mass/batch conversion "

• graph/chart visualization"

• heuristics (e.g., group in G, selection in W)"

• removing redundancy"

• tracking changes / history / undo-redo"

• table based"

• suggestions (what to fix)"

• filtering (show less)�6

G = Google Refine"W = Data wrangler

Page 7: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

How do they different?• G has clustering feature"

• W has format conversion (1 column spread into multiple)"

• W can export actions as scripts"

• G supports offline mode (online too?)"

• W extracts part of text into new column"

• W can copy and paste"

• W allow you to preview changes"

• W uses colors to indicate different kinds of changes"

• G can show statistics

G = Google Refine"W = Data wrangler

�7

Page 8: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

! The videos only show

some of the tools’ features. Try them out.

Google Refine: http://code.google.com/p/google-refine/"Data Wrangler: http://vis.stanford.edu/wrangler/

�8

Page 9: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Data Integration

Page 10: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Course OverviewCollection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 11: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

What is Data Integration? Why is it Important?

Page 12: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

�12

Data IntegrationCombining data from different sources to provide the user with a unified view"

As data’s volume, velocity and variety increase, and veracity decreases, data integration presents new (and more) opportunities and challenges"

How to help people effectively leverage multiple data sources? (People: analysts, researchers, practitioners, etc.)

Page 13: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Examples of businesses based on

data integration

Page 14: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection
Page 15: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection
Page 16: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection
Page 17: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Mashup

Page 18: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

More Examples?• Palantir gotham"

• Yelp: restaurant reviews, business reviews"

• Facebook friend request: look at your friends’s friends and recommend those friends as your friends"

• Trulia / zillow (real estate sites)"

• graph search (facebook)"

• waze"

• yahoo pipe "

• google search engine"

• google transit"

• google now / apple siri�18

Page 19: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

How to do data integration?

Page 20: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

“Low” Effort ApproachesUse database’s “Join”! (e.g., SQLite)"

"

"

"

"

Google Refinehttp://code.google.com/p/google-refine/ (video #3)

�20

id name state111 Smith GA222 Johnson NY222 Obama CA

id name111 Smith222 Johnson333 Obama

id state111 GA222 NY222 CA

Page 21: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Crowd-sourcing Approaches: Freebase

�21http://wiki.freebase.com/wiki/What_is_Freebase%3F

Page 22: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Freebase(a graph of entities)!

“…a large collaborative knowledge base consisting of metadata composed mainly

by its community members…”

�22

Wikipedia.

Page 23: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

So what? What can you do with Freebase?

(Hint: Google acquired it in 2010)!

�23

Page 24: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

http://www.google.com/insidesearch/features/search/knowledge.html

Page 25: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Given a graph of entities, like Freebase, what other cool

things can you do? "

�25

Page 26: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

https://www.facebook.com/about/graphsearch

Page 27: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Facebook’s Graph Search!

Integrate your friends’ info with yours

�27

Page 28: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Feldspar!Finding Information by Association.

CHI 2008 Polo Chau, Brad Myers, Andrew Faulring

�28Paper: http://www.cs.cmu.edu/~dchau/feldspar/feldspar-chi08.pdfYouTube: http://www.youtube.com/watch?v=Q0TIV8F_o_E&feature=youtu.be&list=ULQ0TIV8F_o_E

Page 29: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection
Page 30: Data Cleaning & Integration - Visualizationpoloclub.gatech.edu/cse6242/2014spring/lectures/...Jan 14, 2014  · • SQL basics (create table, join, create index, etc.) 2 Collection

Summary for data integrationOpportunities"

• enable new services (Siri, padmapper)"• enable new ways to discover info"• improve existing services"• reduce redundancy"• new way to interactive with data"• promote knowledge transfer (e.g., between

companies)�30