Railroad Modeling at Hadoop Scale

23
Railroad Modeling at Hadoop Scale Hadoop Summit 3 June 2014, San Jose, CA John Akred (@BigDataAnalysis), Tatsiana Maskalevich (@notrockstar) www.svds.com @SVDataScience

description

Hadoop is a very flexible platform for storing various disparate types of data—the variety dimension of the famous "3 Vs". We discuss how we use Hadoop to combine audio, video, and social media data sources to analyze Caltrain activity and provide real time insight into variance from the regular schedule. This is an instance of the general problem of combining disparate data sources to reason about the current operational state of a business system. We will cover how we store raw sensor and social media data in hdfs then use various processing frameworks to refine that data and store it in HBase. Specific examples of using Hive, Flume, Python, SerDe, and Avro to take data from various inputs and perform necessary transformations to make the data suitable for analysis will be explained. Finally, we discuss how we develop and integrate the analytical components using Python’s Numpy, Scikit-learn, OpenCV, and Pandas libraries. The goal of the analyses is to recognize train sounds in audio streams, detect trains in video streams, and combine that with data from social media. Ultimately, we aim to determine what train is where, and how it is running relative to the schedule.

Transcript of Railroad Modeling at Hadoop Scale

Page 1: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

Railroad Modeling at Hadoop

Scale Hadoop Summit

3 June 2014, San Jose, CA

John Akred (@BigDataAnalysis), Tatsiana Maskalevich (@notrockstar) www.svds.com @SVDataScience

Page 2: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

2

Why is a data science & engineering consulting company building its own Caltrain app?  

Page 3: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

As of 10AM Today

3

Page 4: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

4

Page 5: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

5

•  Commuter rail between San Francisco and San Mateo and Santa Clara counties ~30 stations

•  118 passenger cars •  60% >=30 years old

•  2014 weekday ridership is 52,019 people

•  On-time performance is about 92% •  No reliable real-time status information •  API outage between April 5th and June 2nd

Page 6: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

HOW DO WE KNOW IF

THE TRAIN IS LATE?

•  Direct observation –  We can hear the train horn –  We can see the train when it goes by

•  Purpose-built systems: –  We can use Caltrain API’s (when working)

•  Other signals –  We can check Twitter for delay info or rider

comments

6

Page 7: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

SVDS Approach

7

Ø  Take advantage of the available signals

Ø  Use historical data to make direct and latent observations more useful

Ø  Provide a service that gives users valuable planning and riding features

Ø  Don’t let the perfect be the enemy of the good

Page 8: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

8

Stovepipe: One-to-one relationship from data source to product

Hard Failure: If the data source is broken, so is the app.

Multi-sourced: Redundancy of overlapping data sources makes your products more resilient

Graceful Degradation: If a data source breaks, there is a backup and your app continues to function

Production data services abstract the probabilistic integration of overlapping data sources. We call this model a Data Mesh:

DATA RESILIENCY Products Data Sources

Broken Data Sources

Data Services

Page 9: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

9

Source Signals

Audio Image Text API

Variety Volume Velocity

Page 10: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

10

•  Microphone connected to Raspberry Pi mic->preamp->analog-to-digital converter->usb

•  PyAudio running on Raspberry Pi serializes audio as an array of 2-byte integers.

•  Sound data + metadata -> Flume on AWS via flumelogger

•  We use FFT + Decision Trees to detect and classify the trains into express and local based on the whistle sound.

Audio Capture and Ingest

Raspberry Pi

Raw Audio Agent

Raw Audio Agent

Page 11: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

11

•  wget pulls images from camera’s built-in server 2-3 times a second, and saves them on a local server/NAS

•  Flume pushes the image data to our EC2 servers

•  We used openCV (Python) to detect trains in images

Image Capture and Ingest

Raw Image Agent

Raw Image Agent

Local Server

Page 12: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

12

•  Capturing all the tweets with keyword ‘Caltrain’ via Twitter API

•  Flume agent sends tweets to Apache Storm topology for processing

•  Tweets are parsed and written to HDFS and HBase

•  Event Detection is based on the baseline number of tweets per hour

Text Capture and Ingest: Twitter

Twitter Agent

Twitter APIs

Page 13: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

13

•  Real-time departure times available via 511.org developer API’s

•  Python script collects data once a minute from 511.org APIs and stores it in HDFS as sequence files using WebHDFS API’s.

•  Python script collects data from the Caltrain site that includes run #

•  Didn’t function from April 5th until June 2nd 2014

Caltrain API Data Capturing

scraper.py

511.Org

APIs

Caltrain

Webpage

data_collector_api.py

Page 14: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

14

Combining the Signals

Audio Signal

Detection

Image Recogni-

tion Text

Analysis

STATE of complex system

Page 15: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

                             

15

Twitter Agent

Analytics Dev

MapReduce

Event Storage

Sound Agent

Image Agent

Twitter Spout

Sound Spout

Image Spout

Tweet Parser

Tweets Counter

HDFS Writer

Event Detector

Alerts

Twitter

API

HBase Writer

Microphone on

Raspberry Pi

Web

Camera

External Data

Sources Data Platform

Sounds Classifier

Train

Detector

Transmit to APP

Caltrain Agent

Caltrain Spout

Caltrain

API

Schedule Integrator

Page 16: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

16

Batch: •  Apply FFT to audio data to

identify train based on train whistle’s fundamental frequencies.

•  Decision tree trained to classify trains into local or express based on minimum and maximum fundamental frequencies (Doppler effect)

Data Science: Audio

Real-Time:

•  Execute local / express classifier

•  Send data to the Event Detector for APP alerts

•  Store results in HBase

•  Apply FFT to audio signal

•  Extract min and max fundamental frequencies

Frequency, Hz

Histogram of Whistle Frequencies Over a Period of Time

Freq

uenc

y Cou

nts

Page 17: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

17

Real-Time •  ORB algorithm (openCV) is used to

detect the train in image

•  Sends results to the Event Detector to identify train and compare to schedule

•  Event Detector updates APP with the train’s status, alerts if late

Data Science: Image Number of Key-Points That Are The Same In Two Consecutives Images

Time (Sec)

Num

ber o

f Mat

chin

g Po

ints

Page 18: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

18

Batch: •  Update baseline tweet

frequencies for each hour as additional historical data collected

•  Store model parameters in HBase

Data Science: Text

Real-Time: •  Count tweets as they stream

through topology

•  Alert based on frequency deviations from the baseline

Page 19: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

19

Baseline Calculation Baseline

Page 20: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

20

Future Work •  Detect direction of train in image processing

•  Use natural language processing on twitter data for event detector.

•  Continue evaluation of analytical frameworks for model computation

•  Add observation posts •  Release Caltrain Rider Application

Page 21: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

COMING SOON: CALTRAIN RIDER APP

•  Find out what train to catch using our ‘Ride Now’ view

•  Select a train, see when that train should be reaching each stop in a trip detail view.

•  For more info: www.svds.com/trains

21

Page 22: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

questions ? 22

Yes, We’re Hiring www.svds.com/join-us

Page 23: Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

THANK YOU John @BigDataAnalysis Tatsiana @notrockstar

23