Snowplow Analytics and Looker at Oyster.com

15
SNOWPLOW AND LOOKER AT OYSTER.COM SNOWPLOW MEETUP NYC – MARCH 30, 2016 BEN HOYT, DEVON POHL

Transcript of Snowplow Analytics and Looker at Oyster.com

SNOWPLOW AND LOOKER AT

OYSTER.COMSNOWPLOW MEETUP NYC – MARCH 30, 2016

BEN HOYT, DEVON POHL

WHAT IS OYSTER.COM?• “The Hotel Tell-All”• Authentic hotel reviews and

photos• We visit every hotel in person• 1000 hotels per month• 7M high-res photos• 100k 360° panoramas

(SOME OF) OUR TECH STACK

• Python to run our backend: web, scripting, photo processing, ETL• PostgreSQL for all content data (eg: hotels, metadata for 12M images)• Amazon S3 for image storage, EC2 spot instances for photo processing• Amazon Redshift for analytics and reporting data• Looker for reporting and visualizations• for analytics tracking and analytics ETL

GOOGLE ANALYTICS V. SNOWPLOWGoogle Analytics

• Good for web, but little control and flexibility

• Hard to get data out of (your data!)

• Crazy pricing model ($0 for free tier, or $150,000/y for premium)

• Can only do web analytics, not other business reporting

Snowplow• Free and open source, with great support and paid tiers

• Puts data into a standard, easily-queryable database (Redshift)

• Focuses on tracking and analytics ETL and does that part well

WHY & HOW WE SWITCHED (1 YEAR AGO)

• We were considering Looker for reporting and visualization• Looker rep: “majority of our customers use Snowplow to collect their data”• We dug into Snowplow and liked what we saw• Initially the design felt a bit overkill, but it’s definitely built to scale• We implemented the tracking and pipeline, and haven’t looked back

OUR CONTEXT SCHEMA• We use one “custom fields” schema to rule them all• Simple, one table, one SQL join gives us all our custom fields

{ "self": {"name": "custom_fields", "vendor": "com.oyster", "version": "1-0-9"}, "properties": { "page_type": {"type": "string"}, "page_subtype": {"type": "string"}, "template_type": {"type": "string", "enum": ["desktop", "mobile"]},

"hotel_id": {"$ref": "#/definitions/positiveInteger32"}, "account_id": {"$ref": "#/definitions/positiveInteger32"},

"ab_cell": {"type": "integer", "minimum": 1, "maximum": 20}, "checkin_date": {"type": "string", "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}$"}, ...

OUR DATASET

• A large, though not a massive, dataset• Redshift cluster: 6 dc1.large SSD nodes, ~1TB storage• 640 million rows in our events table• We add 1.5 million event rows per day

• We copy (a subset of) our PostgreSQL content database into Redshift nightly

• Enables business reporting and advanced content-based queries

PAGETRACKINGEXAMPLE

ANALYTICS AND LOOKER (DEVON POHL)

REPORTING• Snowplow and content data are merged to provide insights into:

• Product• A/B testing• Funnel mapping

• Marketing• SEO monitoring• Ad Campaigns

• Operations• Workflow Optimization• ROI Modeling

• Business Trends• Traffic• Revenue

VISIT TABLE• Event data is large and granular – often hard to digest

• Most valuable pre-processing we do is building the visit table

• Incremental build Python ETL run on Redshift

• This is key to most of our reporting infrastructure

• Combines events, custom fields data

• This visit table:

• Is user and user-session-ID granular

• Includes counts of a variety of event types

• Includes all information associated with first event of a visit

• A/B testing cells

• Referral information

• Etc.

LOOKER

• Looker is our core data exploration and reporting tool• Web-based YAML + visualization wrapper on Redshift

• Enables non-technical business owners self-serve reporting and explore• Used for other pre-processing via persistent derived tables (PDTs)

• PDTs are temporary tables built and managed by Looker defined by a query

• Good for small-to-medium size pre-processing

• Applications include de-duping and revenue attribution

DASHBOARDS / SAVED REPORTS

EXPLORATION

OYSTER.COMThe Hotel Tell-All