Railroad Modeling at Hadoop Scale
-
Upload
silicon-valley-data-science -
Category
Technology
-
view
1.262 -
download
2
description
Transcript of Railroad Modeling at Hadoop Scale
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
Railroad Modeling at Hadoop
Scale Hadoop Summit
3 June 2014, San Jose, CA
John Akred (@BigDataAnalysis), Tatsiana Maskalevich (@notrockstar) www.svds.com @SVDataScience
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
2
Why is a data science & engineering consulting company building its own Caltrain app?
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
As of 10AM Today
3
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
4
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
5
• Commuter rail between San Francisco and San Mateo and Santa Clara counties ~30 stations
• 118 passenger cars • 60% >=30 years old
• 2014 weekday ridership is 52,019 people
• On-time performance is about 92% • No reliable real-time status information • API outage between April 5th and June 2nd
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
HOW DO WE KNOW IF
THE TRAIN IS LATE?
• Direct observation – We can hear the train horn – We can see the train when it goes by
• Purpose-built systems: – We can use Caltrain API’s (when working)
• Other signals – We can check Twitter for delay info or rider
comments
6
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
SVDS Approach
7
Ø Take advantage of the available signals
Ø Use historical data to make direct and latent observations more useful
Ø Provide a service that gives users valuable planning and riding features
Ø Don’t let the perfect be the enemy of the good
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
8
Stovepipe: One-to-one relationship from data source to product
Hard Failure: If the data source is broken, so is the app.
Multi-sourced: Redundancy of overlapping data sources makes your products more resilient
Graceful Degradation: If a data source breaks, there is a backup and your app continues to function
Production data services abstract the probabilistic integration of overlapping data sources. We call this model a Data Mesh:
DATA RESILIENCY Products Data Sources
Broken Data Sources
Data Services
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
9
Source Signals
Audio Image Text API
Variety Volume Velocity
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
10
• Microphone connected to Raspberry Pi mic->preamp->analog-to-digital converter->usb
• PyAudio running on Raspberry Pi serializes audio as an array of 2-byte integers.
• Sound data + metadata -> Flume on AWS via flumelogger
• We use FFT + Decision Trees to detect and classify the trains into express and local based on the whistle sound.
Audio Capture and Ingest
Raspberry Pi
Raw Audio Agent
Raw Audio Agent
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
11
• wget pulls images from camera’s built-in server 2-3 times a second, and saves them on a local server/NAS
• Flume pushes the image data to our EC2 servers
• We used openCV (Python) to detect trains in images
Image Capture and Ingest
Raw Image Agent
Raw Image Agent
Local Server
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
12
• Capturing all the tweets with keyword ‘Caltrain’ via Twitter API
• Flume agent sends tweets to Apache Storm topology for processing
• Tweets are parsed and written to HDFS and HBase
• Event Detection is based on the baseline number of tweets per hour
Text Capture and Ingest: Twitter
Twitter Agent
Twitter APIs
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
13
• Real-time departure times available via 511.org developer API’s
• Python script collects data once a minute from 511.org APIs and stores it in HDFS as sequence files using WebHDFS API’s.
• Python script collects data from the Caltrain site that includes run #
• Didn’t function from April 5th until June 2nd 2014
Caltrain API Data Capturing
scraper.py
511.Org
APIs
Caltrain
Webpage
data_collector_api.py
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
14
Combining the Signals
Audio Signal
Detection
Image Recogni-
tion Text
Analysis
STATE of complex system
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
15
Twitter Agent
Analytics Dev
MapReduce
Event Storage
Sound Agent
Image Agent
Twitter Spout
Sound Spout
Image Spout
Tweet Parser
Tweets Counter
HDFS Writer
Event Detector
Alerts
API
HBase Writer
Microphone on
Raspberry Pi
Web
Camera
External Data
Sources Data Platform
Sounds Classifier
Train
Detector
Transmit to APP
Caltrain Agent
Caltrain Spout
Caltrain
API
Schedule Integrator
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
16
Batch: • Apply FFT to audio data to
identify train based on train whistle’s fundamental frequencies.
• Decision tree trained to classify trains into local or express based on minimum and maximum fundamental frequencies (Doppler effect)
Data Science: Audio
Real-Time:
• Execute local / express classifier
• Send data to the Event Detector for APP alerts
• Store results in HBase
• Apply FFT to audio signal
• Extract min and max fundamental frequencies
Frequency, Hz
Histogram of Whistle Frequencies Over a Period of Time
Freq
uenc
y Cou
nts
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
17
Real-Time • ORB algorithm (openCV) is used to
detect the train in image
• Sends results to the Event Detector to identify train and compare to schedule
• Event Detector updates APP with the train’s status, alerts if late
Data Science: Image Number of Key-Points That Are The Same In Two Consecutives Images
Time (Sec)
Num
ber o
f Mat
chin
g Po
ints
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
18
Batch: • Update baseline tweet
frequencies for each hour as additional historical data collected
• Store model parameters in HBase
Data Science: Text
Real-Time: • Count tweets as they stream
through topology
• Alert based on frequency deviations from the baseline
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
19
Baseline Calculation Baseline
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
20
Future Work • Detect direction of train in image processing
• Use natural language processing on twitter data for event detector.
• Continue evaluation of analytical frameworks for model computation
• Add observation posts • Release Caltrain Rider Application
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
COMING SOON: CALTRAIN RIDER APP
• Find out what train to catch using our ‘Ride Now’ view
• Select a train, see when that train should be reaching each stop in a trip detail view.
• For more info: www.svds.com/trains
21
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
questions ? 22
Yes, We’re Hiring www.svds.com/join-us
© 2014 Silicon Valley Data Science LLC All Rights Reserved. svds.com @SVDataScience
THANK YOU John @BigDataAnalysis Tatsiana @notrockstar
23