The Road to Data Science - Joel Grus, June 2015

98
Joel Grus Seattle DAML Meetup June 23, 2015 Data Science from Scratch

Transcript of The Road to Data Science - Joel Grus, June 2015

Page 1: The Road to Data Science - Joel Grus, June 2015

Joel GrusSeattle DAML Meetup

June 23, 2015

Data Science from Scratch

Page 2: The Road to Data Science - Joel Grus, June 2015

About meOld-school DAML-erWrote a book ---------->SWE at GoogleFormerly data science at VoloMetrix, Decide, Farecast

Page 3: The Road to Data Science - Joel Grus, June 2015

The Road to Data Science

Page 4: The Road to Data Science - Joel Grus, June 2015

The Road to Data ScienceMy

Page 5: The Road to Data Science - Joel Grus, June 2015
Page 6: The Road to Data Science - Joel Grus, June 2015

Grad School

Page 7: The Road to Data Science - Joel Grus, June 2015
Page 8: The Road to Data Science - Joel Grus, June 2015
Page 9: The Road to Data Science - Joel Grus, June 2015

Fareology

Page 10: The Road to Data Science - Joel Grus, June 2015

Data Science Is A Broad Field

Some Stuff

MoreStuff

EvenMoreStuff

DataScience

People who think they're data scientists, but they're not really data scientists

People who are a danger to everyone around them

People who say "machine learnings"

Page 11: The Road to Data Science - Joel Grus, June 2015
Page 12: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able to

JOEL GRUS

Page 13: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression,

JOEL GRUS

Page 14: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query,

JOEL GRUS

Page 15: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site,

JOEL GRUS

Page 16: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment,

JOEL GRUS

Page 17: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices,

JOEL GRUS

Page 18: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame,

JOEL GRUS

Page 19: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning,

JOEL GRUS

Page 20: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery,

JOEL GRUS

Page 21: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python,

JOEL GRUS

Page 22: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce,

JOEL GRUS

Page 23: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior,

JOEL GRUS

Page 24: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard,

JOEL GRUS

Page 25: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data,

JOEL GRUS

Page 26: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis,

JOEL GRUS

Page 27: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson,

JOEL GRUS

Page 28: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, JOEL GRUS

Page 29: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, JOEL GRUS

Page 30: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, JOEL GRUS

Page 31: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. JOEL GRUS

Page 32: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS

Page 33: The Road to Data Science - Joel Grus, June 2015

A lot of stuff!

Page 34: The Road to Data Science - Joel Grus, June 2015

What Are Hiring Managers Looking For?

Page 35: The Road to Data Science - Joel Grus, June 2015

What Are Hiring Managers Looking For?

Let's check LinkedIn

Page 36: The Road to Data Science - Joel Grus, June 2015
Page 37: The Road to Data Science - Joel Grus, June 2015

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS

grad students!

Page 38: The Road to Data Science - Joel Grus, June 2015

Learning Data Science

Page 39: The Road to Data Science - Joel Grus, June 2015

I want to be a data

scientist.Great!

Page 40: The Road to Data Science - Joel Grus, June 2015

The Math WayI like to start with matrix

decompositions. How's your

measure theory?

Page 41: The Road to Data Science - Joel Grus, June 2015

The Math WayThe Good:Solid foundationMath is the noblest known pursuit

Page 42: The Road to Data Science - Joel Grus, June 2015

The Math WayThe Good:Solid foundationMath is the noblest known pursuit

The Bad:Some weirdos don't think math is fun

Can be pretty forbidding

Can miss practical skills

Page 43: The Road to Data Science - Joel Grus, June 2015

So, did you count the words in

that document?

No, but I have an elegant

proof that the number of

words is finite!

Page 44: The Road to Data Science - Joel Grus, June 2015

OK, Let's Try Again

Page 45: The Road to Data Science - Joel Grus, June 2015

I want to be a data

scientist.Great!

Page 46: The Road to Data Science - Joel Grus, June 2015

The Tools WayHere's a list of

the 25 libraries you

really ought to know. How's

your R programming?

Page 47: The Road to Data Science - Joel Grus, June 2015

The Tools WayThe Good:Don't have to understand the math

PracticalCan get started doing fun stuff right away

Page 48: The Road to Data Science - Joel Grus, June 2015

The Tools WayThe Good:Don't have to understand the math

PracticalCan get started doing fun stuff right away

The Bad:Don't have to understand the math

Can get started doing bad science right away

Page 49: The Road to Data Science - Joel Grus, June 2015

So, did you build that model?

Yes, and it fits the training data almost perfectly!

Page 50: The Road to Data Science - Joel Grus, June 2015

OK, Maybe Not That Either

Page 51: The Road to Data Science - Joel Grus, June 2015

So Then What?

Page 52: The Road to Data Science - Joel Grus, June 2015

Example: k-means clusteringUnsupervised machine learning technique

Given a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares

i.e. in a way such that the clusters are as "small" as possible (for a particular conception of "small")

Page 53: The Road to Data Science - Joel Grus, June 2015
Page 54: The Road to Data Science - Joel Grus, June 2015

The Math Way

Page 55: The Road to Data Science - Joel Grus, June 2015

The Math Way

Page 56: The Road to Data Science - Joel Grus, June 2015

The Tools Way# a 2-dimensional examplex <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")(cl <- kmeans(x, 2))plot(x, col = cl$cluster)points(cl$centers, col = 1:2, pch = 8, cex = 2)

Page 57: The Road to Data Science - Joel Grus, June 2015

The Tools Way>>> from sklearn import cluster, datasets>>> iris = datasets.load_iris()>>> X_iris = iris.data>>> y_iris = iris.target

>>> k_means = cluster.KMeans(n_clusters=3)>>> k_means.fit(X_iris) KMeans(copy_x=True, init='k-means++', ...>>> print(k_means.labels_[::10])[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]>>> print(y_iris[::10])[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

Page 58: The Road to Data Science - Joel Grus, June 2015

So What To Do?

Page 59: The Road to Data Science - Joel Grus, June 2015

Bootcamps?

Page 60: The Road to Data Science - Joel Grus, June 2015

Data Science from ScratchThis is to certify thatJoel Grus

has honorably completed the course of study outlined in the book Data Science from Scratch: First Principles with Python, and is entitled to all the Rights, Privileges, and Honors thereunto appertaining. Joel

GrusJune 23, 2015

Certificate Programs?

Page 61: The Road to Data Science - Joel Grus, June 2015

Hey! Data scientists!

Page 62: The Road to Data Science - Joel Grus, June 2015

Learning By BuildingYou don't really understand something until you build it

For example, I understand garbage disposals much better now that I had to replace one that was leaking water all over my kitchen

More relevantly, I thought I understood hypothesis testing, until I tried to write a book chapter + code about it.

Page 63: The Road to Data Science - Joel Grus, June 2015

Learning By BuildingFunctional Programming

Page 64: The Road to Data Science - Joel Grus, June 2015

Break Things Down Into Small Functions

Page 65: The Road to Data Science - Joel Grus, June 2015

So you don't end up with

something like this

Page 66: The Road to Data Science - Joel Grus, June 2015

Don't Mutate

Page 67: The Road to Data Science - Joel Grus, June 2015

Example: k-means clusteringGiven a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares

Global optimization is hard, so use a greedy iterative approach

Page 68: The Road to Data Science - Joel Grus, June 2015

Fun Motivation: Image Posterization

Image consists of pixelsEach pixel is a triplet (R,G,B)Imagine pixels as points in spaceFind k clusters of pixelsRecolor each pixel to its cluster

meanI think it's fun, anyway

8 colors

Page 69: The Road to Data Science - Joel Grus, June 2015

Example: k-means clusteringgiven some points, find k clusters by

choose k "means"repeat:

assign each point to cluster of closest "mean"recompute mean of each cluster

sounds simple! let's code!

Page 70: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

Page 71: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

Page 72: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

Page 73: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

Page 74: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

Page 75: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each mean

Page 76: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each meancompute the distance

Page 77: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each meancompute the distance

assign the point to the cluster of the mean with the smallest distance

Page 78: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each meancompute the distance

assign the point to the cluster of the mean with the smallest distance

find the points in each cluster

Page 79: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each meancompute the distance

assign the point to the cluster of the mean with the smallest distance

find the points in each cluster

and compute the new means

Page 80: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

Not impenetrable, but a lot less helpful than it

could be

Page 81: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

Not impenetrable, but a lot less helpful than it

could be

Can we make it simpler?

Page 82: The Road to Data Science - Joel Grus, June 2015

Break Things Down Into Small Functions

Page 83: The Road to Data Science - Joel Grus, June 2015

def k_means(points, k, num_iters=10): # start with k of the points as "means" means = random.sample(points, k)

# and iterate finding new means for _ in range(num_iters): means = new_means(points, means)

return means

Page 84: The Road to Data Science - Joel Grus, June 2015

def new_means(points, means): # assign points to clusters # each cluster is just a list of points clusters = assign_clusters(points, means)

# return the cluster means return [mean(cluster) for cluster in clusters]

Page 85: The Road to Data Science - Joel Grus, June 2015

def assign_clusters(points, means): # one cluster for each mean # each cluster starts empty clusters = [[] for _ in means] # assign each point to cluster # corresponding to closest mean for p in points: index = closest_index(point, means) clusters[index].append(point) return clusters

Page 86: The Road to Data Science - Joel Grus, June 2015

def closest_index(point, means): # return index of closest mean return argmin(distance(point, mean) for mean in means)

def argmin(xs): # return index of smallest element return min(enumerate(xs), key=lambda pair: pair[1])[0]

Page 87: The Road to Data Science - Joel Grus, June 2015

To Recapk_means(points, k, num_iters=10)

mean(points)

k_means(points, k, num_iters=10)new_means(points, means)assign_clusters(points, means)closest_index(point, means)argmin(xs)

distance(point1, point2)mean(points)

add(point1, point2)scalar_multiply(c, point)

Page 88: The Road to Data Science - Joel Grus, June 2015

As a Pedagogical ToolCan be used "top down" (as we did here)

Implement high-level logicThen implement the detailsNice for exposition

Can also be used "bottom up"Implement small piecesBuild up to high-level logicGood for workshops

Page 89: The Road to Data Science - Joel Grus, June 2015

Example: Decision TreesWant to predict whether a given Meetup is worth attending (True) or not (False)

Inputs are dictionaries describing each Meetup

{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }

{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }

Page 90: The Road to Data Science - Joel Grus, June 2015

Example: Decision Trees{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }

{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }

beer?

True Falsespeaker?

True False

free none

paid

@jakevdp

@joelgrus

Page 91: The Road to Data Science - Joel Grus, June 2015

Example: Decision Treesclass LeafNode: def __init__(self, prediction): self.prediction = prediction

def predict(self, input_dict): return self.prediction

class DecisionNode: def __init__(self, attribute, subtree_dict): self.attribute = attribute self.subtree_dict = subtree_dict

def predict(self, input_dict): value = input_dict.get(self.attribute) subtree = self.subtree_dict[value] return subtree.predict(input)

Page 92: The Road to Data Science - Joel Grus, June 2015

Example: Decision TreesAgain inspiration from functional programming:type Input = Map.Map String String

data Tree = Predict Bool | Subtrees String (Map.Map String Tree)

look at the "beer" entry a map from each possible "beer" value to a subtree

always predict a specific value

Page 93: The Road to Data Science - Joel Grus, June 2015

Example: Decision Treestype Input = Map.Map String String

data Tree = Predict Bool | Subtrees String (Map.Map String Tree)

predict :: Tree -> Input -> Boolpredict (Predict b) _ = bpredict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)

Page 94: The Road to Data Science - Joel Grus, June 2015

Example: Decision Treestype Input = Map.Map String String

data Tree = Predict Bool | Subtrees String (Map.Map String Tree)

We can do the same, we'll say a decision tree is eitherTrueFalse(attribute, subtree_dict)

("beer", { "free" : True, "none" : False, "paid" : ("speaker", {...})})

Page 95: The Road to Data Science - Joel Grus, June 2015

predict :: Tree -> Input -> Bool

predict (Predict b) _ = b

predict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)

Example: Decision Treesdef predict(tree, input_dict): # leaf node predicts itself if tree in (True, False): return tree else: # destructure tree attribute, subtree_dict = tree # find appropriate subtree value = input_dict[attribute] subtree = subtree_dict[value] # classify using subtree return predict(subtree, input_dict)

Page 96: The Road to Data Science - Joel Grus, June 2015

Not Just For Data Science

Page 97: The Road to Data Science - Joel Grus, June 2015

In ConclusionTeaching data science is fun, if you're smart about it

Learning data science is fun, if you're smart about it

Writing a book is not that much funHaving written a book is pretty funMaking slides is actually kind of funFunctional programming is a lot of fun