MapMyCab

13
MapMyCab Preetika Kulshrestha

Transcript of MapMyCab

MapMyCabPreetika Kulshrestha

Motivation• Tool for Data Scientists and Cab dispatchers to analyze

(by time of day or day of week):

• cab occupancy

• miles travelled

• pickups and drop-offs

• An app for city dwellers to view real-time cab status for unoccupied cabs in a given area

Demo

Pipeline

Script Message Broker

Real-Time Streaming

HDFS

HBase UI

MrJob

Data Flow CabID Lat Long Occ Timestamp

yyyy_m_day AvgOcc Pickups Drops Miles

MrJob

Tables

• Hourly data organized by Day of Week

• Aggregate metrics stored in the same table for fast retrieval

y_m_dow c:0 c:1 c:2 c:3 c:4 … c23 c:Totals

Day of Week Hour 0 Attributes hr 1 hr 2 hr 3 hr 4 … hr 23 ..

2008_01_Mon pickups, dropoffs, avg_occ, avg_dist .. .. .. .. .. ..

sum(pickups), sum(dropoffs), avg(occ), avg(dist)

2008_01_Tue <pickups, dropoffs, avg_occ, avg_dist> .. .. .. .. .. ..

<sum(pickups), sum(dropoffs), avg(occ), avg(dist)>

2008_01_Wed <pickups, dropoffs, avg_occ, avg_dist> .. .. .. .. .. ..

<sum(pickups), sum(dropoffs), avg(occ), avg(dist)>

Hourly Aggregates by Day of Week

API and Lessons Learned

• Need to safeguard against corrupt data

• Workflow is very important when connecting different tools

About Me

• Previous Life - Senior Energy Analyst (EnerNOC Inc.).

• M.S. Electrical Engineering - North Carolina State University (focus on robotics, control systems and smart grid).

• https://github.com/PreetikaKuls

[email protected]

Pipeline

Script Message Broker

Real-Time Streaming

HDFS

HBase UI

MrJobPython Script

uid, lat, long, timestamp, occ

y_m_dow_h, pickups, drops, dist, occ

y_m_dow, hour(pickups, drops, dist, occ)

Hive

Data

Item SF Cabs

Description GPS coordinates of approx. 500 SF cabs collected over 30 days

Format [latitude (float), longitude (float), occupancy (boolean), time (timestamp)]

Size ~ 500 MB

Throughput 50-100 messages/sec (500 cabs, 5-10 min granularity)

Master Data SetTime CabID!

Lat | Long | OccupancyCabID!

Lat | Long | Occupancy -—>

CabID Timestamp!Lat | Long | Occupancy

Timestamp!Lat | Long | Occupancy -—>

Retrieve all data for a given time frame where latitude and longitude fall with in a specific range

Analyze data based on timestamp

Batch Processing Result

Features and Example Queries

Features!

• A system that uses crowdsourcing to automatically generate parking spot information for streets

• Parking information overlaid on Google Maps

Queries!

• Does West Middlefield Road allow for street parking?

• Can I park on this street for more than 2 hours?

• Which nearby streets might have better parking availability?