British Gas Connected Homes: Data Engineering

Data EngineeringAt British Gas Connected Homes

1Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA


British Gas / Connected Homes• British Gas is a 200 year old company

• Connected Homes is BG’s IoT “startup”

• Leader in the UK’s connected home market

Data Sources• Gas and electricity meter readings

• Thermostat temperature data

• Connected boiler data

• Real time energy consumption data

• Introducing motion sensors, window and door sensors, etc.


Meter Data

• Millions of gas and electricity customers

• ~600k smart meters

• Readings every 30 minutes from smart meters


Machine Learning applied to Meter Data

• Energy disaggregation

• Similar homes comparison

• Smart meters used in indirect algorithms for non-smart customers


Connected Thermostats

• > 200k Connected Thermostats

• Temperature data time series


Connected Boilers

• Proactive maintenance

• Failure detection


In Home Displays in a mobile App

• Data every 10 seconds

• Still needs an access device connected to the router

• Allows real time mobile alerts


Technologies we use

Technologies we are trying

Our Engineering process• Two points of friction at the

intersection between teams

• Sharing datasets is problematic

• Real infrastructure too different from real environments

• New technologies too hard to deploy

• Time to production > 6 months


Solution #1: Data Ops

• Data oriented DevOps instead of service oriented DevOps:

• Stateful instead of stateless

• Jobs instead of config

• Resource management instead of resource partitioning


Solution #1: Data Ops• Ansible and Docker:

1. Smooth transition from development testing to production

2. blue / green deployments

3. swarm / mesos + docker = better use of infrastructure

• Time to production down to < 2 months :-|


Future Solution #2: Data Science Environment

• Ideally Data Science models should be plug and play

• Python and R dataframes in Spark are promising but data scientists don’t feel the need of Spark

• Data scientists prefer to work with relational DBs

• We need to find a way to make production datasets available to them


Future Solution #2: Data Science Environment

• Possible solutions we are investigating are:

• Automated exports into a data science relational DB

• Spark SQL server

• Automatically generated environment images

• Objective is to reduce implementation time for new features to < 1 month


Use Case High Consumption Alerts

• The red dot on top is what we want to detect

• The green bottom dots are the baseline plus the fridge


High Consumption Alerts Data Ingest

• Very high volume of messages (every 10 seconds)

• Kafka partitions help us cope with volume

• (experimental) we’re trying Samza for quick sliding-window type transformations

• Often we miss reads, the Samza job also does basic interpolation


High Consumption Alerts Spark Streaming with Cassandra

• Real time data comes from Kafka

• Cassandra stores historical usage information

• A Spark Streaming job combines both and applies a machine learning algorithm to generate high usage alerts


High Consumption Alerts Overall Architecture

• Getting the partitions right is very important for scalability

• Spark-Cassandra connector keeps C* partitions

• It’s important to match Kafka partitioning to CassandraRDD partitioning


High Consumption alerts | Main Spark loop


Data Partitioning• Data systems like Cassandra or

Kafka scale by partitioning data

• Given enough partitions, any technology can work

• We need a simple hashing algorithm that works the same in many languages and across technologies


Cassandra data modelling with buckets• Using a hashing function that is uniform and deterministic we can cope

time series data of any amount of customers

• One of our preferred strategies is to use buckets


h(k) = ⌊m * frac(kA)⌋• Multiplicative hashing is our preferred simple partitioning algorithm

• m= Number of partitions

• A≈(√5−1)/2 = 0.6180339887... (Golden Ratio)

• Online example: jsfiddle.net/joscas/yfp72fq5


http://jsfiddle.net/joscas/yfp72fq5

Summary• Increase in productivity with portable environments (Ansible, Docker,

Mesos)

• Getting partitions straight is essential

• Using a simple common hashing algorithm across technologies and languages is very helpful


Summary• Streaming technologies are rapidly evolving

• Spark streaming is complex but with many advantages (Spark’s excellent integration with Cassandra, Spark’s ML libraries, etc.)

• Kafka ticks a lot of boxes for large scale distributed real time data systems


Thank [email protected]

@jcasals


mailto:[email protected]

British Gas Connected Homes: Data Engineering

Technology

Transcript of British Gas Connected Homes: Data Engineering