Railroad Modeling at Hadoop Scale

© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience

Railroad Modeling at Hadoop

Scale Hadoop Summit

3 June 2014, San Jose, CA

John Akred (@BigDataAnalysis), Tatsiana Maskalevich (@notrockstar) www.svds.com @SVDataScience


2

Why is a data science & engineering consulting company building its own Caltrain app?


As of 10AM Today

3


4


5

•  Commuter rail between San Francisco and San Mateo and Santa Clara counties ~30 stations

•  118 passenger cars •  60% >=30 years old

•  2014 weekday ridership is 52,019 people

•  On-time performance is about 92% •  No reliable real-time status information •  API outage between April 5th and June 2nd


HOW DO WE KNOW IF

THE TRAIN IS LATE?

•  Direct observation –  We can hear the train horn –  We can see the train when it goes by

•  Purpose-built systems: –  We can use Caltrain API’s (when working)

•  Other signals –  We can check Twitter for delay info or rider

comments

6


SVDS Approach

7

Ø  Take advantage of the available signals

Ø  Use historical data to make direct and latent observations more useful

Ø  Provide a service that gives users valuable planning and riding features

Ø  Don’t let the perfect be the enemy of the good


8

Stovepipe: One-to-one relationship from data source to product

Hard Failure: If the data source is broken, so is the app.

Multi-sourced: Redundancy of overlapping data sources makes your products more resilient

Graceful Degradation: If a data source breaks, there is a backup and your app continues to function

Production data services abstract the probabilistic integration of overlapping data sources. We call this model a Data Mesh:

DATA RESILIENCY Products Data Sources

Broken Data Sources

Data Services


9

Source Signals

Audio Image Text API

Variety Volume Velocity


10

•  Microphone connected to Raspberry Pi mic->preamp->analog-to-digital converter->usb

•  PyAudio running on Raspberry Pi serializes audio as an array of 2-byte integers.

•  Sound data + metadata -> Flume on AWS via flumelogger

•  We use FFT + Decision Trees to detect and classify the trains into express and local based on the whistle sound.

Audio Capture and Ingest

Raspberry Pi

Raw Audio Agent

Raw Audio Agent


11

•  wget pulls images from camera’s built-in server 2-3 times a second, and saves them on a local server/NAS

•  Flume pushes the image data to our EC2 servers

•  We used openCV (Python) to detect trains in images

Image Capture and Ingest

Raw Image Agent

Raw Image Agent

Local Server


12

•  Capturing all the tweets with keyword ‘Caltrain’ via Twitter API

•  Flume agent sends tweets to Apache Storm topology for processing

•  Tweets are parsed and written to HDFS and HBase

•  Event Detection is based on the baseline number of tweets per hour

Text Capture and Ingest: Twitter

Twitter Agent

Twitter APIs


13

•  Real-time departure times available via 511.org developer API’s

•  Python script collects data once a minute from 511.org APIs and stores it in HDFS as sequence files using WebHDFS API’s.

•  Python script collects data from the Caltrain site that includes run #

•  Didn’t function from April 5th until June 2nd 2014

Caltrain API Data Capturing

scraper.py

511.Org

APIs

Caltrain

Webpage

data_collector_api.py


14

Combining the Signals

Audio Signal

Detection

Image Recogni-

tion Text

Analysis

STATE of complex system


15

Twitter Agent

Analytics Dev

MapReduce

Event Storage

Sound Agent

Image Agent

Twitter Spout

Sound Spout

Image Spout

Tweet Parser

Tweets Counter

HDFS Writer

Event Detector

Alerts

Twitter

API

HBase Writer

Microphone on

Raspberry Pi

Web

Camera

External Data

Sources Data Platform

Sounds Classifier

Train

Detector

Transmit to APP

Caltrain Agent

Caltrain Spout

Caltrain

API

Schedule Integrator


16

Batch: •  Apply FFT to audio data to

identify train based on train whistle’s fundamental frequencies.

•  Decision tree trained to classify trains into local or express based on minimum and maximum fundamental frequencies (Doppler effect)

Data Science: Audio

Real-Time:

•  Execute local / express classifier

•  Send data to the Event Detector for APP alerts

•  Store results in HBase

•  Apply FFT to audio signal

•  Extract min and max fundamental frequencies

Frequency, Hz

Histogram of Whistle Frequencies Over a Period of Time

Freq

uenc

y Cou

nts


17

Real-Time •  ORB algorithm (openCV) is used to

detect the train in image

•  Sends results to the Event Detector to identify train and compare to schedule

•  Event Detector updates APP with the train’s status, alerts if late

Data Science: Image Number of Key-Points That Are The Same In Two Consecutives Images

Time (Sec)

Num

ber o

f Mat

chin

g Po

ints


18

Batch: •  Update baseline tweet

frequencies for each hour as additional historical data collected

•  Store model parameters in HBase

Data Science: Text

Real-Time: •  Count tweets as they stream

through topology

•  Alert based on frequency deviations from the baseline


19

Baseline Calculation Baseline


20

Future Work •  Detect direction of train in image processing

•  Use natural language processing on twitter data for event detector.

•  Continue evaluation of analytical frameworks for model computation

•  Add observation posts •  Release Caltrain Rider Application


COMING SOON: CALTRAIN RIDER APP

•  Find out what train to catch using our ‘Ride Now’ view

•  Select a train, see when that train should be reaching each stop in a trip detail view.

•  For more info: www.svds.com/trains

21


questions ? 22

Yes, We’re Hiring www.svds.com/join-us


THANK YOU John @BigDataAnalysis Tatsiana @notrockstar

23

Railroad Modeling at Hadoop Scale

Technology

Transcript of Railroad Modeling at Hadoop Scale