British Gas Connected Homes: Data Engineering

25
Data Engineering At British Gas Connected Homes 1 Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Transcript of British Gas Connected Homes: Data Engineering

Page 1: British Gas Connected Homes: Data Engineering

Data EngineeringAt British Gas Connected Homes

1Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 2: British Gas Connected Homes: Data Engineering

2Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

British Gas / Connected Homes• British Gas is a 200 year old company

• Connected Homes is BG’s IoT “startup”

• Leader in the UK’s connected home market

Page 3: British Gas Connected Homes: Data Engineering

Data Sources• Gas and electricity meter readings

• Thermostat temperature data

• Connected boiler data

• Real time energy consumption data

• Introducing motion sensors, window and door sensors, etc.

3Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 4: British Gas Connected Homes: Data Engineering

Meter Data

• Millions of gas and electricity customers

• ~600k smart meters

• Readings every 30 minutes from smart meters

4Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 5: British Gas Connected Homes: Data Engineering

Machine Learning applied to Meter Data

• Energy disaggregation

• Similar homes comparison

• Smart meters used in indirect algorithms for non-smart customers

5Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 6: British Gas Connected Homes: Data Engineering

Connected Thermostats

• > 200k Connected Thermostats

• Temperature data time series

6Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 7: British Gas Connected Homes: Data Engineering

Connected Boilers

• Proactive maintenance

• Failure detection

7Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 8: British Gas Connected Homes: Data Engineering

In Home Displays in a mobile App

• Data every 10 seconds

• Still needs an access device connected to the router

• Allows real time mobile alerts

8Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 9: British Gas Connected Homes: Data Engineering

Technologies we use

Technologies we are trying

Page 10: British Gas Connected Homes: Data Engineering

Our Engineering process• Two points of friction at the

intersection between teams

• Sharing datasets is problematic

• Real infrastructure too different from real environments

• New technologies too hard to deploy

• Time to production > 6 months

10Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 11: British Gas Connected Homes: Data Engineering

Solution #1: Data Ops

• Data oriented DevOps instead of service oriented DevOps:

• Stateful instead of stateless

• Jobs instead of config

• Resource management instead of resource partitioning

11Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 12: British Gas Connected Homes: Data Engineering

Solution #1: Data Ops• Ansible and Docker:

1. Smooth transition from development testing to production

2. blue / green deployments

3. swarm / mesos + docker = better use of infrastructure

• Time to production down to < 2 months :-|

12Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 13: British Gas Connected Homes: Data Engineering

Future Solution #2: Data Science Environment

• Ideally Data Science models should be plug and play

• Python and R dataframes in Spark are promising but data scientists don’t feel the need of Spark

• Data scientists prefer to work with relational DBs

• We need to find a way to make production datasets available to them

13Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 14: British Gas Connected Homes: Data Engineering

Future Solution #2: Data Science Environment

• Possible solutions we are investigating are:

• Automated exports into a data science relational DB

• Spark SQL server

• Automatically generated environment images

• Objective is to reduce implementation time for new features to < 1 month

14Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 15: British Gas Connected Homes: Data Engineering

Use Case High Consumption Alerts

• The red dot on top is what we want to detect

• The green bottom dots are the baseline plus the fridge

15Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 16: British Gas Connected Homes: Data Engineering

High Consumption Alerts Data Ingest

• Very high volume of messages (every 10 seconds)

• Kafka partitions help us cope with volume

• (experimental) we’re trying Samza for quick sliding-window type transformations

• Often we miss reads, the Samza job also does basic interpolation

16Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 17: British Gas Connected Homes: Data Engineering

High Consumption Alerts Spark Streaming with Cassandra

• Real time data comes from Kafka

• Cassandra stores historical usage information

• A Spark Streaming job combines both and applies a machine learning algorithm to generate high usage alerts

17Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 18: British Gas Connected Homes: Data Engineering

High Consumption Alerts Overall Architecture

• Getting the partitions right is very important for scalability

• Spark-Cassandra connector keeps C* partitions

• It’s important to match Kafka partitioning to CassandraRDD partitioning

18Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 19: British Gas Connected Homes: Data Engineering

High Consumption alerts | Main Spark loop

19Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 20: British Gas Connected Homes: Data Engineering

Data Partitioning• Data systems like Cassandra or

Kafka scale by partitioning data

• Given enough partitions, any technology can work

• We need a simple hashing algorithm that works the same in many languages and across technologies

20Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 21: British Gas Connected Homes: Data Engineering

Cassandra data modelling with buckets• Using a hashing function that is uniform and deterministic we can cope

time series data of any amount of customers

• One of our preferred strategies is to use buckets

21Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 22: British Gas Connected Homes: Data Engineering

h(k) = ⌊m * frac(kA)⌋• Multiplicative hashing is our preferred simple partitioning algorithm

• m= Number of partitions

• A≈(√5−1)/2 = 0.6180339887... (Golden Ratio)

• Online example: jsfiddle.net/joscas/yfp72fq5

22Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 23: British Gas Connected Homes: Data Engineering

Summary• Increase in productivity with portable environments (Ansible, Docker,

Mesos)

• Getting partitions straight is essential

• Using a simple common hashing algorithm across technologies and languages is very helpful

23Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 24: British Gas Connected Homes: Data Engineering

Summary• Streaming technologies are rapidly evolving

• Spark streaming is complex but with many advantages (Spark’s excellent integration with Cassandra, Spark’s ML libraries, etc.)

• Kafka ticks a lot of boxes for large scale distributed real time data systems

24Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA

Page 25: British Gas Connected Homes: Data Engineering

Thank [email protected]

@jcasals

25Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA