Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily,...

16
Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily AGIFORS RM 2016 Josef Habdank 20 th of May, 2016 Lead Data Scientist & Data Platform Architect [email protected] www.infare.com

Transcript of Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily,...

Airfare prediction using

Machine Learning with Apache Spark

on 1 billion observed airfares dailyAGIFORS RM 2016

Josef Habdank

20th of May, 2016

Lead Data Scientist & Data Platform Architect

[email protected] www.infare.com

In business since 2000

150 Airlines Customers

11 Airports and

several OTAs Customers

7 offices worldwide

55000-6000 revenue managers

login to our platform every week

Leading provider of Airfare

Intelligence Solutions to

the Aviation Industry

Delivers actionable information

based on huge amount of freshly

collected and historical data

https://www.youtube.com/watch?v=h9cQTooY92E

Pharos: life analytics

Airfare Collection and Analytics

Online Airfare Data Collection

Data Processingand Modelling Altus: historical

analytics

Data Feeds

Collecting 1 billion a day airfares

Reached 1bn/day airfares on 7th of April 2016

Conservative projected growth based on leads

-

500,000,000.00

1,000,000,000.00

1,500,000,000.00

2,000,000,000.00

2,500,000,000.00

3,000,000,000.00

3,500,000,000.00

Airfare observations daily

Observations Daily Extrapolated Observations Daily

Data collection doubling time ~7-12 months

Reached 1bn/day airfares on 7th of April 2016

Conservative projected growth based on leads

100,000.00

1,000,000.00

10,000,000.00

100,000,000.00

1,000,000,000.00

10,000,000,000.00

Airfare observations daily

Observations Daily Extrapolated Observations Daily

Infare technology stack

2015

2016+

Infare technology stack

2016+

Data processing: Apache Spark

Message streaming: Kafka/Kinesis

BigData storage: Hadoop/S3

Microservices: C#.Net/Akka Spray

Real time analytics:

MsSql/Cassandra

Machine Learning:

PySpark + Scikit Learn

Tested on 6-8bn airfares a day

Reaching soon a full market coverage:

how to utilize it?

Infare DataCenter

Altus: historicalData Feeds Granular Data Access API

(life + historical queries to DB)

Prediction and Analytics API

(all models presented later)

Pharos: life data+ prediction

Researched prediction since 2012, however accuracy requires larger market coverage.Estimated that at 5bn airfares/day is the required coverage for launch of the final product.

Prediction: minimum future price

+ API access

Prediction: price evolution

+ API access

Developing Prediction at Scale

• Tens to hundreds of millions of unique

trips observed daily

• Tens to hundreds observed prices per

trip

• Clustering price vectors

• Training model per cluster

• 10000-50000 models

• Training should take 2-3h to enable

daily or real time update

Prediction of highly multivariate time series

Drawing depicts trivial case in 2 dim and 3 models.In reality there are tens of thousands clusters in > 300 dim space

Each point is representing n-dim vector time series

Cluster the time series (after dimensionality

reduction reducing sparsity)

Train ML models on thedata within respective cluster

Remarks regarding modelling

+

• Requires careful feature selection

• Dimensionality reduction of time series space done using

polynomial fitting or inverse exponential series fitting

• Transforms the price vectors into a parameters space

𝑓: 𝑃 ↦ Θ

• Clustering of time series projection Θ using k-means or

Gaussian Mixture Model

• ARIMA formulated as Linear Regression trained on P space:

𝐴𝑅𝐼𝑀𝐴 0, 1, 𝑛 ≡ 𝒚 = 𝑿𝛽 + 𝛼,𝑤ℎ𝑒𝑟𝑒 dim 𝑿 = ∙, 𝑛

• For some clusters Support Vector Regression

with Radial Basis Function Kernel

• Quantize the continuous co-domain to finite states drawn from data

• Requires in-memory parallel processing, using Scikit Learn on PySpark

could be solved as Blind Source Separation or Machine Learning problem

Future research:

estimating competitors’ demand curves

Looking for a partner Airline to pilot this research project

Airline’s own historical and

current demand curves

Estimate of competitor’s current and future demand

curvesInfare’s historical

and current market prices

Question to audience

What do you think is the

most important product?

1) Granular life and historical data

access API

3) Estimating competitors’

booking curves

2) Price Prediction inPharos + API

THANK YOU!

Please contact to us if you would

like to collaborate in research

Josef Habdank

20th of May, 2016

Lead Data Scientist & Data Platform Architect

[email protected] www.infare.com