1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior...

35
1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to Machine Learning

description

3 © Cloudera, Inc. All rights reserved. My Current Data Warehouse

Transcript of 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior...

Page 1: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

1© Cloudera, Inc. All rights reserved.

Engines, Algorithms, and Data ModelsJosh Wills | Senior Director of Data Science

From Dimensional Modeling to Machine Learning

Page 2: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

2© Cloudera, Inc. All rights reserved.

My First Data Warehouse

Page 3: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

3© Cloudera, Inc. All rights reserved.

My Current Data Warehouse

Page 4: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

4© Cloudera, Inc. All rights reserved.

The Rise of the Data Scientist

Page 5: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

5© Cloudera, Inc. All rights reserved.

Data Scientist Supply vs. Data Scientist Demand

Page 6: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

6© Cloudera, Inc. All rights reserved.

Moneyball and Data Science

Page 7: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

7© Cloudera, Inc. All rights reserved.

Choosing The Right Metrics

Page 8: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

8© Cloudera, Inc. All rights reserved.

1. Analyzing “Unstructured” Data Sources

Page 9: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

9© Cloudera, Inc. All rights reserved.

2. Building Machine Learning Models

Page 10: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

10© Cloudera, Inc. All rights reserved.

3. Turn Static Reports Into Analytical Applications

Page 11: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

11© Cloudera, Inc. All rights reserved.

Answering More Questions in Less Time

Page 12: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

12© Cloudera, Inc. All rights reserved.

How To Answer QuestionsLike A Data Scientist

Page 13: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

13© Cloudera, Inc. All rights reserved.

1. Read and deserialize input data.

2. Project/filter input records.

3. Shuffle: serialize it, send over the network, deserialize it.

4. Apply aggregation logic.

5. Serialize output data.

The Life of a Data Processing Job

Page 14: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

14© Cloudera, Inc. All rights reserved.

Handling the Cost of Serialization

Page 15: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

15© Cloudera, Inc. All rights reserved.

The Traditional RDBMS Approach

Page 16: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

16© Cloudera, Inc. All rights reserved.

The Cost of The Traditional RDBMS Approach

Page 17: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

17© Cloudera, Inc. All rights reserved.

Query Scheduling and Exploratory Data Analysis

Page 18: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

18© Cloudera, Inc. All rights reserved.

The Spark Approach

Page 19: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

19© Cloudera, Inc. All rights reserved.

The Cost of the Spark Approach

Page 20: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

20© Cloudera, Inc. All rights reserved.

The MapReduce Approach

Page 21: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

21© Cloudera, Inc. All rights reserved.

MapReduce In The Hands of a Data Scientist

Page 22: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

22© Cloudera, Inc. All rights reserved.

Example: Hive Multi-Insert

Page 23: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

23© Cloudera, Inc. All rights reserved.

Our Goal: Public Transit for Questions

Page 24: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

24© Cloudera, Inc. All rights reserved.

Data Modeling for Data Science

Page 25: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

25© Cloudera, Inc. All rights reserved.

Motivating Example: Spelling Correction

Page 26: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

26© Cloudera, Inc. All rights reserved.

Event Series Analytics

Page 27: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

27© Cloudera, Inc. All rights reserved.

A Simple Star Schema for Spell Correction

Page 28: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

28© Cloudera, Inc. All rights reserved.

The Combinatorial Explosion

Page 29: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

29© Cloudera, Inc. All rights reserved.

• What parameters does this model need…• during the analysis phase?• during deployment?

• Some Candidates• Lag time between events• Similarity of queries• What else?

Designing the Spell Correction Data Product

Page 30: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

30© Cloudera, Inc. All rights reserved.

A Supernova Schema for Search

Page 31: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

31© Cloudera, Inc. All rights reserved.

Spell Correction in SQL

Page 32: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

32© Cloudera, Inc. All rights reserved.

Exhibit: http://github.com/jwills/exhibit

Page 33: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

33© Cloudera, Inc. All rights reserved.

Querying Nested Types with Impala

Page 34: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

34© Cloudera, Inc. All rights reserved.

• Core Metric: # Outputs/ # Jobs• Measure on both an individual and

aggregate level• Drive the marginal cost of asking one

additional question towards zero• Point business analysts at output

tables for interactive analysis with Impala• Self-serve BI frees up resources

(compute + data science time)

Trading Up: From Data Analyst to Data Scientist

Page 35: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to.

35© Cloudera, Inc. All rights reserved.

Thanks!@josh_wills