Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily,...
-
Upload
josef-a-habdank -
Category
Science
-
view
1.662 -
download
0
Transcript of Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily,...
Airfare prediction using
Machine Learning with Apache Spark
on 1 billion observed airfares dailyAGIFORS RM 2016
Josef Habdank
20th of May, 2016
Lead Data Scientist & Data Platform Architect
[email protected] www.infare.com
In business since 2000
150 Airlines Customers
11 Airports and
several OTAs Customers
7 offices worldwide
55000-6000 revenue managers
login to our platform every week
Leading provider of Airfare
Intelligence Solutions to
the Aviation Industry
Delivers actionable information
based on huge amount of freshly
collected and historical data
https://www.youtube.com/watch?v=h9cQTooY92E
Pharos: life analytics
Airfare Collection and Analytics
Online Airfare Data Collection
Data Processingand Modelling Altus: historical
analytics
Data Feeds
Collecting 1 billion a day airfares
Reached 1bn/day airfares on 7th of April 2016
Conservative projected growth based on leads
-
500,000,000.00
1,000,000,000.00
1,500,000,000.00
2,000,000,000.00
2,500,000,000.00
3,000,000,000.00
3,500,000,000.00
Airfare observations daily
Observations Daily Extrapolated Observations Daily
Data collection doubling time ~7-12 months
Reached 1bn/day airfares on 7th of April 2016
Conservative projected growth based on leads
100,000.00
1,000,000.00
10,000,000.00
100,000,000.00
1,000,000,000.00
10,000,000,000.00
Airfare observations daily
Observations Daily Extrapolated Observations Daily
Infare technology stack
2016+
Data processing: Apache Spark
Message streaming: Kafka/Kinesis
BigData storage: Hadoop/S3
Microservices: C#.Net/Akka Spray
Real time analytics:
MsSql/Cassandra
Machine Learning:
PySpark + Scikit Learn
Tested on 6-8bn airfares a day
Reaching soon a full market coverage:
how to utilize it?
Infare DataCenter
Altus: historicalData Feeds Granular Data Access API
(life + historical queries to DB)
Prediction and Analytics API
(all models presented later)
Pharos: life data+ prediction
Researched prediction since 2012, however accuracy requires larger market coverage.Estimated that at 5bn airfares/day is the required coverage for launch of the final product.
Developing Prediction at Scale
• Tens to hundreds of millions of unique
trips observed daily
• Tens to hundreds observed prices per
trip
• Clustering price vectors
• Training model per cluster
• 10000-50000 models
• Training should take 2-3h to enable
daily or real time update
Prediction of highly multivariate time series
Drawing depicts trivial case in 2 dim and 3 models.In reality there are tens of thousands clusters in > 300 dim space
Each point is representing n-dim vector time series
Cluster the time series (after dimensionality
reduction reducing sparsity)
Train ML models on thedata within respective cluster
Remarks regarding modelling
+
• Requires careful feature selection
• Dimensionality reduction of time series space done using
polynomial fitting or inverse exponential series fitting
• Transforms the price vectors into a parameters space
𝑓: 𝑃 ↦ Θ
• Clustering of time series projection Θ using k-means or
Gaussian Mixture Model
• ARIMA formulated as Linear Regression trained on P space:
𝐴𝑅𝐼𝑀𝐴 0, 1, 𝑛 ≡ 𝒚 = 𝑿𝛽 + 𝛼,𝑤ℎ𝑒𝑟𝑒 dim 𝑿 = ∙, 𝑛
• For some clusters Support Vector Regression
with Radial Basis Function Kernel
• Quantize the continuous co-domain to finite states drawn from data
• Requires in-memory parallel processing, using Scikit Learn on PySpark
could be solved as Blind Source Separation or Machine Learning problem
Future research:
estimating competitors’ demand curves
Looking for a partner Airline to pilot this research project
Airline’s own historical and
current demand curves
Estimate of competitor’s current and future demand
curvesInfare’s historical
and current market prices
Question to audience
What do you think is the
most important product?
1) Granular life and historical data
access API
3) Estimating competitors’
booking curves
2) Price Prediction inPharos + API
THANK YOU!
Please contact to us if you would
like to collaborate in research
Josef Habdank
20th of May, 2016
Lead Data Scientist & Data Platform Architect
[email protected] www.infare.com