Democratizing DataPromoting a data driven culture in a
world of microservices
Overview
1. Introduction to Hudl
2. Hudl Data Journey
3. #DataProblems
4. Data Engineering
5. Data Analytics
6. Key Takeaways
7. Summary
Basketball workflow animation or static images.
Basketball workflow animation or static images.
Basketball workflow animation or static images.
Capture and bring value to every moment in sports.
4.9 millionusers
150 thousandteams
4.5 billionvideo views last 12 months
“Find all football teams that had 3 or more users watch video in 3 different months.”
SSH+ SQL+ Mongo+ Excel/Python/etc.
Data Engineering+
Data Analytics
Data EngineeringJust give me my damn data.
Three questions
1. Where do we put the data?
2. How does it get there?
3. How do people access it?
Three questions
1. Where do we put the data?
2. How does it get there?
3. How do people access it?
● SQL
● Fully managed on AWS
● Reasonably priced
Amazon Redshift
● SQL
● Fully managed on AWS
● Reasonably priced
Rob Story, Data Engineering Architecture at Simple, PyData Chicago
Amazon Redshift
For the Google Cloud User:
Google BigQuery
For the Do-it-yourself-er:
Hive / Impala / PrestoDB / Druid
For the Enterprise User:
Vertica / Teradata ?
Alternatives
Three questions
1. Where do we put the data?
2. How does it get there?
3. How do people access it?
ExtractTransformLoad
Extract
Transform
Load
Extract Extract Extract
Transform TransformTransform
Load LoadLoad
Extract Extract Extract
Transform TransformTransform
Load LoadLoad
Use a workflow manager.
Luigi (Spotify) Airflow (Airbnb) Azkaban (LinkedIn)
Luigi (Spotify) Airflow (Airbnb) Azkaban (LinkedIn)
● Dependency management● Parallelism● Idempotence
Think about your tooling.
● UI● Logging● Triggers
○ Cron○ Dependency○ GitHub
Single machine jobs
Single machine jobs● Zendesk● Salesforce● Google Sheets
Multi machine jobs● Database exports● Mongo processing
Three questions
1. Where do we put the data?
2. How does it get there?
3. How do people access it?
● Everyone has access -- 430+ Hudlies
● Lots of data
○ 24+ TB
○ 100B+ rows
Our needs
● Looker● Periscope● Tableau
Commercial options
● Open source (Python!)
● Query editor + visualizations
● Hosted version or host your own
re:dash
SSH+ SQL+ Mongo+ Excel/Python/etc.
SSH+ SQL+ Mongo+ Excel/Python/etc.
Data AnalyticsHelping employees use data to make better decisions.
Finding Data isn’t Easy
● SQL
● So much data
● Only 3 data analysts
● Education
● Derived Tables
● Report Automation
Removing Roadblocks
● Education
● Derived Tables
● Report Automation
Removing Roadblocks
● Relational Database Model
● Basic & intermediate SQL
● Table Familiarity
● Using re:dash
● Data Visualization
Certification Topics
Data Dictionary
screenshot here but delete this box after
you place it
Understanding Relationships
screenshot here but delete this box after
you place it
● Education
● Derived Tables
● Report Automation
Removing Roadblocks
“Find how many football teams had 3 or more users watch video in 3 different months this year.”
hudl_daily_active_users
Userid Teamid Date Has_Watched_Video
Has_Tagged_Video
Has_uploaded_video
1 1234 2016-08-15 True True False
2 2345 2016-08-15 False False True
3 5678 2016-08-15 True False False
● Education
● Derived Tables
● Report Automation
Removing Roadblocks
Insert Dashboard Example here
September Stats
● 194 unique users executed a query
● 14,000 ad hoc queries executed
● 940 unique scheduled queries/week
● Bad Data
● Slow Queries
Cons
● Being Data-driven is a team sport
● Get the data architecture in place
● Make data and metrics accessible
● Be Flexible
Key Takeaways
Summary
1. Introduction to Hudl
2. Hudl Data Journey
3. #DataProblems
4. Data Engineering
5. Data Analytics
6. Key Takeaways
7. Summary
Tools we use
Summary
Jenkins Scheduling
Luigi Workflow management
Sqoop RDBMS Extraction
Spark Data transformation
AWS Lambda Event-driven processing
Redshift Data warehouse
re:dash Query interface + visualization