Metail at Cambridge AWS User Group Main Meetup #3

21
1 The fashion shopping future Metail's Data Pipeline and AWS OCTOBER 2015

Transcript of Metail at Cambridge AWS User Group Main Meetup #3

Page 1: Metail at Cambridge AWS User Group Main Meetup #3

1

The fashion shopping futureMetail's Data Pipeline and AWS

OCTOBER 2015

Page 2: Metail at Cambridge AWS User Group Main Meetup #3

2

Introduction

• Introduction to Metail (from BD shiny)

• Architecture Overview

• Event Tracking and Collection

• Extract Transform and Load (ETL)

• Getting Insights

• Managing The Pipeline

Page 3: Metail at Cambridge AWS User Group Main Meetup #3

3

The Metail Experience allows customer to…

Discover clothes on your body shape

Create, save outfits and share

Shop with confidence of size and fit

Page 4: Metail at Cambridge AWS User Group Main Meetup #3

4

1.6m MeModels created

Size & scale

Page 5: Metail at Cambridge AWS User Group Main Meetup #3

5

+

-

88 Countries

Size & scale

Page 6: Metail at Cambridge AWS User Group Main Meetup #3

6

Architecture Overview

• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-architecture.net

Page 7: Metail at Cambridge AWS User Group Main Meetup #3

7

Architecture Overview

• Our architecture is modelled on Nathan Marz’s Lambda Architecture: http://lambda-architecture.net

Page 8: Metail at Cambridge AWS User Group Main Meetup #3

8

New Data and Collection

Page 9: Metail at Cambridge AWS User Group Main Meetup #3

9

New Data and Collection

Batch Layer

Page 10: Metail at Cambridge AWS User Group Main Meetup #3

10

New Data and Collection

Batch LayerServing Layer

Page 11: Metail at Cambridge AWS User Group Main Meetup #3

11

Data Collection

• A significant part of our pipeline is powered by Snowplow: http://snowplowanalytics.com

• We use their technology for tracking and setup for collection

– They have specified a tracking protocol, implementing it in many languages

– We’re using the JavaScript tracker

– Implementation very similar to Google Analytics (GA): http://www.google.co.uk/analytics/

– But you have all the raw data

Page 12: Metail at Cambridge AWS User Group Main Meetup #3

12

Data Collection

• Where does AWS come in?

– Snowplow Cloudfront Collector: https://github.com/snowplow/snowplow/wiki/Setting-up-the-Cloudfront-collector

– Snowplow’s GIF, called i, we uploaded to an S3 bucket

– Cloudfront serves the content of the bucket

– To collect the events the tracker performs a GET request

– Query parameters of the GET request contain the payload

– E.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&...

– Cloudfront configured for http and https for only GET and HEAD with logging enabled

– Cloudfront requests, the events, are logged to our S3 bucket

– In Lambda Architecture terms these Cloudfront logs are our master record and are the raw data

Page 13: Metail at Cambridge AWS User Group Main Meetup #3

13

Extract Transform and Load (ETL)

• This is the batch layer of our architecture

• Runs over the raw (and enriched) data producing (further) enriched data sets

• Implemented using MapReduce technologies:

– Snowplow ETL written in Scalding

– Cascading ( Java higher level MapReduce libraries) in Scala https://github.com/twitter/scalding + http://www.cascading.org/

– Looks like Scala and Cascading

– Metail ETL written in Cascalog: http://cascalog.org

– Cascalog has been described as logic programming over Hadoop

– Cascading + Datalog = Cascalog

– Ridiculously compact and expressive – one of the steepest learning curve I’ve encountered in software engineering but no hidden traps

– AWS’s Elastic MapReduce (EMR) https://aws.amazon.com/elasticmapreduce/

– AWS has done the hard/tedious work of deploying Hadoop to EC2

Page 14: Metail at Cambridge AWS User Group Main Meetup #3

14

Extract Transform and Load (ETL)• Snowplow’s ETL https://github.com/snowplow/snowplow/wiki/setting-up-EmrEtlRunner

– Initial step executed outside of EMR

– Copy data in Cloudfront incoming log bucket to another S3 bucket for processing

– Next create EMR cluster

– To that cluster you add steps

Page 15: Metail at Cambridge AWS User Group Main Meetup #3

15

Extract Transform and Load (ETL)

• Metail’s ETL

– We run directly on the data in S3

– We store our JARs in S3 and have a process to deploy them

– We have several enrichment steps

– Our enrichment runs on Snowplow’s enriched events

– And further enrich our enriched events

– This is what is building our batch views for the serving layer

Page 16: Metail at Cambridge AWS User Group Main Meetup #3

16

Extract Transform and Load (ETL)

• EMR and S3 get on very well

– AWS have engineered S3 so that it can behave as a native HDFS file system with very little loss of performance

– They recommend using S3 as permanent data store

– EMR cluster’s HDFS file system in my mind is a giant /tmp

– Encourages immutable infrastructure

– You don’t need your compute cluster running to hold your data

– Snowplow and Metail output directly to S3

– The only reason Snowplow copies to local HDFS is because they’re aggregating the Cloudfront logs

– That’s transitory data

– You can archive S3 data to Glacier

Page 17: Metail at Cambridge AWS User Group Main Meetup #3

17

Getting Insights

• The work horse of Metail’s insights is Redshift: https://aws.amazon.com/redshift/

– I’d like it to be Cascalog but even I’d hate that :P

• Redshift is a “petabyte-scale data warehouse”

– Offers a Postgres like SQL dialect to query the data

– Uses a columnar distributed data store

– It’s very quick

– Currently we have a nine node compute cluster (9*160GB = 1.44TB)

– Thinking of switching to dense storage node or re-architecting

– Growing at 10GB a day

Page 18: Metail at Cambridge AWS User Group Main Meetup #3

18

Getting Insights

SELECT DATE_TRUNC('mon', collector_tstamp),COUNT(event_id)

FROM eventsGROUP BY DATE_TRUNC('mon', collector_tstamp)ORDER BY DATE_TRUNC('mon', collector_tstamp);

Page 19: Metail at Cambridge AWS User Group Main Meetup #3

19

Getting Insights

• The Snowplow pipeline is setup to have Redshift as an endpoint: https://github.com/snowplow/snowplow/wiki/setting-up-redshift

• The Snowplow events table is loaded into Redshift directly from S3

• The events we enrich in EMR are also loaded into Redshift again directly from S3

Page 20: Metail at Cambridge AWS User Group Main Meetup #3

20

Getting Insights

• A technology called Looker …

– This provides a powerful Excel like interface to the data

– While providing software engineering tools to manage the SQL used explore the data

• .. and R for the heavier stats

– Starting to interface directly to Redshift through a PostgreSQL driver

The analysis of this data is done using a combination of

Page 21: Metail at Cambridge AWS User Group Main Meetup #3

21

Managing the Pipeline

• I’ve almost certainly run out of time and not reached this slide

• Lemur to submit ad-hoc Cascalog jobs

– The initial manual pipeline

– Clojure based

• Snowplow have written their configuration tools in Ruby and bash

• We use AWS’s Data Pipeline: https://aws.amazon.com/datapipeline/

– More flaws than advantages