Promoting a Data Driven Culture in a Microservices Environment

86
Democratizing Data Promoting a data driven culture in a world of microservices

Transcript of Promoting a Data Driven Culture in a Microservices Environment

Democratizing DataPromoting a data driven culture in a

world of microservices

Overview

1. Introduction to Hudl

2. Hudl Data Journey

3. #DataProblems

4. Data Engineering

5. Data Analytics

6. Key Takeaways

7. Summary

Basketball workflow animation or static images.

Basketball workflow animation or static images.

Basketball workflow animation or static images.

Capture and bring value to every moment in sports.

4.9 millionusers

150 thousandteams

4.5 billionvideo views last 12 months

Our data journey.

2006

2010

2014

2014

2015

2015

#DataProblems

“Find all football teams that had 3 or more users watch video in 3 different months.”

SSH+ SQL+ Mongo+ Excel/Python/etc.

Data Engineering+

Data Analytics

Data Engineering

Data EngineeringJust give me my damn data.

Three questions

1. Where do we put the data?

2. How does it get there?

3. How do people access it?

Three questions

1. Where do we put the data?

2. How does it get there?

3. How do people access it?

● SQL

● Fully managed on AWS

● Reasonably priced

Amazon Redshift

● SQL

● Fully managed on AWS

● Reasonably priced

Rob Story, Data Engineering Architecture at Simple, PyData Chicago

Amazon Redshift

For the Google Cloud User:

Google BigQuery

For the Do-it-yourself-er:

Hive / Impala / PrestoDB / Druid

For the Enterprise User:

Vertica / Teradata ?

Alternatives

Three questions

1. Where do we put the data?

2. How does it get there?

3. How do people access it?

ETL

ExtractTransformLoad

Extract

Transform

Load

Extract Extract Extract

Transform TransformTransform

Load LoadLoad

Extract Extract Extract

Transform TransformTransform

Load LoadLoad

Use a workflow manager.

Luigi (Spotify) Airflow (Airbnb) Azkaban (LinkedIn)

Luigi (Spotify) Airflow (Airbnb) Azkaban (LinkedIn)

● Dependency management● Parallelism● Idempotence

Think about your tooling.

● UI● Logging● Triggers

○ Cron○ Dependency○ GitHub

Single machine jobs

Single machine jobs● Zendesk● Salesforce● Google Sheets

Multi machine jobs● Database exports● Mongo processing

Three questions

1. Where do we put the data?

2. How does it get there?

3. How do people access it?

● Everyone has access -- 430+ Hudlies

● Lots of data

○ 24+ TB

○ 100B+ rows

Our needs

● Looker● Periscope● Tableau

Commercial options

● Open source (Python!)

● Query editor + visualizations

● Hosted version or host your own

re:dash

SSH+ SQL+ Mongo+ Excel/Python/etc.

SSH+ SQL+ Mongo+ Excel/Python/etc.

Data AnalyticsHelping employees use data to make better decisions.

Access for All

Finding Data isn’t Easy

● SQL

● So much data

● Only 3 data analysts

● Education

● Derived Tables

● Report Automation

Removing Roadblocks

● Education

● Derived Tables

● Report Automation

Removing Roadblocks

● Relational Database Model

● Basic & intermediate SQL

● Table Familiarity

● Using re:dash

● Data Visualization

Certification Topics

Data Dictionary

screenshot here but delete this box after

you place it

Understanding Relationships

screenshot here but delete this box after

you place it

● Education

● Derived Tables

● Report Automation

Removing Roadblocks

“Find how many football teams had 3 or more users watch video in 3 different months this year.”

hudl_daily_active_users

Userid Teamid Date Has_Watched_Video

Has_Tagged_Video

Has_uploaded_video

1 1234 2016-08-15 True True False

2 2345 2016-08-15 False False True

3 5678 2016-08-15 True False False

● Education

● Derived Tables

● Report Automation

Removing Roadblocks

Insert Dashboard Example here

Slackalytics

September Stats

● 194 unique users executed a query

● 14,000 ad hoc queries executed

● 940 unique scheduled queries/week

● Bad Data

● Slow Queries

Cons

● Being Data-driven is a team sport

● Get the data architecture in place

● Make data and metrics accessible

● Be Flexible

Key Takeaways

Summary

1. Introduction to Hudl

2. Hudl Data Journey

3. #DataProblems

4. Data Engineering

5. Data Analytics

6. Key Takeaways

7. Summary

Tools we use

Summary

Jenkins Scheduling

Luigi Workflow management

Sqoop RDBMS Extraction

Spark Data transformation

AWS Lambda Event-driven processing

Redshift Data warehouse

re:dash Query interface + visualization