British Gas Connected Homes: Data Engineering
-
Upload
datastax-academy -
Category
Technology
-
view
920 -
download
0
Transcript of British Gas Connected Homes: Data Engineering
Data EngineeringAt British Gas Connected Homes
1Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
2Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
British Gas / Connected Homes• British Gas is a 200 year old company
• Connected Homes is BG’s IoT “startup”
• Leader in the UK’s connected home market
Data Sources• Gas and electricity meter readings
• Thermostat temperature data
• Connected boiler data
• Real time energy consumption data
• Introducing motion sensors, window and door sensors, etc.
3Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Meter Data
• Millions of gas and electricity customers
• ~600k smart meters
• Readings every 30 minutes from smart meters
4Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Machine Learning applied to Meter Data
• Energy disaggregation
• Similar homes comparison
• Smart meters used in indirect algorithms for non-smart customers
5Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Connected Thermostats
• > 200k Connected Thermostats
• Temperature data time series
6Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Connected Boilers
• Proactive maintenance
• Failure detection
7Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
In Home Displays in a mobile App
• Data every 10 seconds
• Still needs an access device connected to the router
• Allows real time mobile alerts
8Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Technologies we use
Technologies we are trying
Our Engineering process• Two points of friction at the
intersection between teams
• Sharing datasets is problematic
• Real infrastructure too different from real environments
• New technologies too hard to deploy
• Time to production > 6 months
10Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Solution #1: Data Ops
• Data oriented DevOps instead of service oriented DevOps:
• Stateful instead of stateless
• Jobs instead of config
• Resource management instead of resource partitioning
11Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Solution #1: Data Ops• Ansible and Docker:
1. Smooth transition from development testing to production
2. blue / green deployments
3. swarm / mesos + docker = better use of infrastructure
• Time to production down to < 2 months :-|
12Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Future Solution #2: Data Science Environment
• Ideally Data Science models should be plug and play
• Python and R dataframes in Spark are promising but data scientists don’t feel the need of Spark
• Data scientists prefer to work with relational DBs
• We need to find a way to make production datasets available to them
13Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Future Solution #2: Data Science Environment
• Possible solutions we are investigating are:
• Automated exports into a data science relational DB
• Spark SQL server
• Automatically generated environment images
• Objective is to reduce implementation time for new features to < 1 month
14Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Use Case High Consumption Alerts
• The red dot on top is what we want to detect
• The green bottom dots are the baseline plus the fridge
15Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
High Consumption Alerts Data Ingest
• Very high volume of messages (every 10 seconds)
• Kafka partitions help us cope with volume
• (experimental) we’re trying Samza for quick sliding-window type transformations
• Often we miss reads, the Samza job also does basic interpolation
16Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
High Consumption Alerts Spark Streaming with Cassandra
• Real time data comes from Kafka
• Cassandra stores historical usage information
• A Spark Streaming job combines both and applies a machine learning algorithm to generate high usage alerts
17Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
High Consumption Alerts Overall Architecture
• Getting the partitions right is very important for scalability
• Spark-Cassandra connector keeps C* partitions
• It’s important to match Kafka partitioning to CassandraRDD partitioning
18Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
High Consumption alerts | Main Spark loop
19Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Data Partitioning• Data systems like Cassandra or
Kafka scale by partitioning data
• Given enough partitions, any technology can work
• We need a simple hashing algorithm that works the same in many languages and across technologies
20Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Cassandra data modelling with buckets• Using a hashing function that is uniform and deterministic we can cope
time series data of any amount of customers
• One of our preferred strategies is to use buckets
21Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
h(k) = ⌊m * frac(kA)⌋• Multiplicative hashing is our preferred simple partitioning algorithm
• m= Number of partitions
• A≈(√5−1)/2 = 0.6180339887... (Golden Ratio)
• Online example: jsfiddle.net/joscas/yfp72fq5
22Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Summary• Increase in productivity with portable environments (Ansible, Docker,
Mesos)
• Getting partitions straight is essential
• Using a simple common hashing algorithm across technologies and languages is very helpful
23Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA
Summary• Streaming technologies are rapidly evolving
• Spark streaming is complex but with many advantages (Spark’s excellent integration with Cassandra, Spark’s ML libraries, etc.)
• Kafka ticks a lot of boxes for large scale distributed real time data systems
24Josep Casals | @jcasals | CassandraSummit, 2015 Santa Clara CA