Juliana Freire PPT

33
Exploring Big and not so Big Data: Opportunities and Challenges Juliana Freire [email protected] Visualization and Data Analysis (ViDA) Center http://bigdata.poly.edu NYU Poly

Transcript of Juliana Freire PPT

Page 1: Juliana Freire PPT

Exploring Big and not so Big Data: Opportunities and Challenges

Juliana Freire [email protected]

Visualization and Data Analysis (ViDA) Center http://bigdata.poly.edu

NYU Poly

Page 2: Juliana Freire PPT

2 ViDA Center Juliana Freire

Big Data: What is the Big deal?

http://www.google.com/trends/explore#q=%22big%20data%22!

Page 3: Juliana Freire PPT

3 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Many success stories –  Google: many billions of pages indexed, products,

structured data –  Facebook: 1.1 billion users using the site each month –  Twitter: 517 million accounts, 250 million tweets/day

  This is changing society!

Page 4: Juliana Freire PPT

4 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Smart Cities: 50% of the world population lives in cities –  Census, crime, emergency visits, cabs, public transportation,

real estate, noise, energy, … –  Make cities more efficient and sustainable, and improve the

lives of their citizens http://www.nyu.edu/about/university-initiatives/center-for-urban-science-progress.html

  Enable scientific discoveries: science is now data rich –  Petabytes of data generated each day, e.g., Australian radio

telescopes, Large Hadron Collider –  Social data, e.g., Facebook, Twitter (2,380,000 and 2,880,000

results in Google Scholar!)   Data is currency

Page 5: Juliana Freire PPT

5 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Smart Cities –  Census, crime, emergency visits, cabs, public transportation,

real estate, noise, energy, … –  Make cities more efficient and sustainable, and improve the

lives of their citizens   Enable scientific discoveries: science is now data rich

–  Petabytes of data generated each day, e.g., Australian radio telescopes, Large Hadron Collider

–  Social data, e.g., Facebook, Twitter

  Data is currency

Page 6: Juliana Freire PPT

6 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Big data is not new: financial transactions, call detail records, astronomy, …

  What is new is that there are many more data enthusiasts

  More data are widely available, e.g., Web, data.gov, scientific data

  Computing is cheap and easy to access –  Server with 64 cores, 512GB RAM ~$11k –  Cluster with 1000 cores ~$150k –  Pay as you go: Amazon EC2

data

volu

mes,

% IT

inve

stm

ent

Astronomy

Geosciences

Chemistry Microbiology

rank

2020

2010 Social Sciences

Physics

Medicine

Plot from Howe and Halperin, DEB 2012

Page 7: Juliana Freire PPT

7 ViDA Center Juliana Freire

Big Data: What is the Big deal?

  Big data is not new: financial transactions, call detail records, astronomy, …

  What is new is that there are many more data enthusiasts

  More data are widely available, e.g., Web, data.gov, scientific data, social and urban data

  Computing is cheap and easy to access –  Server with 64 cores, 512GB RAM ~$11k –  Cluster with 1000 cores ~$150k –  Pay as you go: Amazon EC2

Page 8: Juliana Freire PPT

8 ViDA Center Juliana Freire

Big Data: What is hard?

  Scalability is not the problem…   Usability is the Big issue

data knowledge

statistics

algorithms

machine learningmath

user interfaces

data visual encodings

interaction modes

technology

data management

provenance

Page 9: Juliana Freire PPT

Exploring data is hard

data knowledge

statistics

algorithms

machine learningmath

user interfaces

data visual encodings

interaction modes

technology

data management

provenance

Page 10: Juliana Freire PPT

Exploring data is hard, regardless of whether the data

is big or small

data knowledge

statistics

algorithms

machine learningmath

user interfaces

data visual encodings

interaction modes

technology

data management

provenance

Page 11: Juliana Freire PPT

11 ViDA Center Juliana Freire

Case Study: Studying Cab Trips in NYC

Prepare data for analysis   Raw data for 2011 63 GB

–  24 csv files, 2 csv files for each month - one for trip data, and snother for fare data

–  ~170M trips

  Cleaning –  ~60,000 fare records do not have trip records –  ~200 duplicates per month

Page 12: Juliana Freire PPT

12 ViDA Center Juliana Freire

Storage Solutions: Temporal Queries

  SQLite – 20 GB of storage

(index on pickup_time)

– Ordered queries: 9.39s

– Reverse ordered queries: 9.41s

– Shuffled queries: 9.37s

  Custom storage – 12 GB of storage (in-

memory binary search instead of index)

– Ordered queries: 0.6s – Reverse ordered

queries: 1.4s – Shuffled queries: 1.2s

Page 13: Juliana Freire PPT

13 ViDA Center Juliana Freire

Storage Solutions: Spatial-Temporal

  All trips for a week in a given region   All trips in a week for a given taxi   All trips in a week for a given taxi in a

given region

Needs a complex indexing scheme that combines spatial, temporal, and taxi id searches

Page 14: Juliana Freire PPT

14 ViDA Center Juliana Freire

Storage Solutions: Spatial-Temporal

  SQLite – 20+10 GB of storage

(index on time and id, r-tree for coordinates)

– Creating indexes: 52hrs

– Range queries: 2.1s – Combined queries:

15.3s – Cross-table queries:

57s

  Custom storage (ours) – 12+4 GB of storage

(using (4d) kd-tree on time, id and coordinates)

– Building kd-tree: 8 mins

– Range queries: 0.2s – Combined queries:

0.2s – Cross-table queries:

2s

Page 15: Juliana Freire PPT

15 ViDA Center Juliana Freire

Summary Statistics

  13,237 Medallion Cabs   42,000 Taxi Drivers   Average Number of Rides: 485k/day   Average Number of Passengers: 660k/day

Analysis/Modeling

Rides in 2011

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Aug 28 Irene

Apr 2 Apr 3

Dec 25

29k

590k

Page 16: Juliana Freire PPT

16 ViDA Center Juliana Freire

Rides per Hour June 2011

Between 5k and 35k rides/hour

0h

Rides at Midnight

0h

0h

0h

0h

0h

Night Life!

Weekly Patterns

Analysis/Modeling

Page 17: Juliana Freire PPT

17 ViDA Center Juliana Freire

TLCVis

Page 18: Juliana Freire PPT

18 ViDA Center Juliana Freire

Drop-off

Pickup

Most of the drop-off’s occur on the avenues while most of the pick-up’s occur on the streets

Drop-offs vs. Pickups

Page 19: Juliana Freire PPT

19 ViDA Center Juliana Freire

Studying Anomalies

8:00AM-8:30AM 6:00AM-6:30AM 4:00AM-4:30AM

Sunday, May 1st 2011

Page 20: Juliana Freire PPT

20 ViDA Center Juliana Freire

Studying Anomalies

8:00AM-8:30AM 6:00AM-6:30AM 4:00AM-4:30AM

Sunday, May 1st 2011

Page 21: Juliana Freire PPT

21 ViDA Center Juliana Freire

Studying Anomalies

8:00AM-8:30AM 9:30AM-10:00AM Sunday, May 1st 2011

Page 22: Juliana Freire PPT

22 ViDA Center Juliana Freire

Studying Anomalies

8:00AM-8:30AM 9:30AM-10:00AM Sunday, May 1st 2011

Five Borough Bike Tour

Interpretation

Page 23: Juliana Freire PPT

23 ViDA Center Juliana Freire

Studying Anomalies

Sunday May 1st 2011

07:00AM-08:00AM

Page 24: Juliana Freire PPT

24 ViDA Center Juliana Freire

Studying Anomalies

Sunday May 1st 2011

08:00AM-10:00AM

Page 25: Juliana Freire PPT

25 ViDA Center Juliana Freire

Studying Anomalies

Sunday May 1st 2011

10:00AM-11:00AM

Page 26: Juliana Freire PPT

26 ViDA Center Juliana Freire

Studying Patterns

May 1st – May 7th 2011

3.6 Million Trips

Compare movement in the

airports against the large train stations

Page 27: Juliana Freire PPT

27 ViDA Center Juliana Freire

Studying Patterns

May 1st – May 7th 2011

3.6 Million Trips

Train Stations Airports

Page 28: Juliana Freire PPT

28 ViDA Center Juliana Freire

Studying Patterns

May 1st – May 7th 2011

3.6 Million Trips

Train Stations Airports

Page 29: Juliana Freire PPT

29 ViDA Center Juliana Freire

Data exploration reveals bad data…

Page 30: Juliana Freire PPT

30 ViDA Center Juliana Freire

Uses of Clean Data: FindMeACab App

Page 31: Juliana Freire PPT

31 ViDA Center Juliana Freire

Take Away

  Data exploration is challenging for both small and big data

  It is hard to prepare data for exploration   For many tasks, existing tools are either too

cumbersome, not scalable, etc.   Need better, usable tools

–  Tools for data enthusiasts who are not computer scientists!   Visualization is essential for exploring large volumes

of data --- “A picture is worth a thousand words’’   Pictures help us think [Tamara Munzner]

–  Substitute perception for cognition –  Free up limited cognitive/memory resources for higher-

level problems

Page 32: Juliana Freire PPT

32 ViDA Center Juliana Freire

Masters in Big Data

  New degree at NYU Poly – Spring 2014   Courses:

–  Machine learning –  Massive data analysis –  Visualization –  Visual Analytics –  Database Systems –  Algorithms –  …

Page 33: Juliana Freire PPT

Thanks