The Road to Data Science - Joel Grus, June 2015

Post on 14-Apr-2017

1.105 views 2 download

Transcript of The Road to Data Science - Joel Grus, June 2015

Joel GrusSeattle DAML Meetup

June 23, 2015

Data Science from Scratch

About meOld-school DAML-erWrote a book ---------->SWE at GoogleFormerly data science at VoloMetrix, Decide, Farecast

The Road to Data Science

The Road to Data ScienceMy

Grad School

Fareology

Data Science Is A Broad Field

Some Stuff

MoreStuff

EvenMoreStuff

DataScience

People who think they're data scientists, but they're not really data scientists

People who are a danger to everyone around them

People who say "machine learnings"

a data scientist should be able to

JOEL GRUS

a data scientist should be able torun a regression,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS

A lot of stuff!

What Are Hiring Managers Looking For?

What Are Hiring Managers Looking For?

Let's check LinkedIn

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS

grad students!

Learning Data Science

I want to be a data

scientist.Great!

The Math WayI like to start with matrix

decompositions. How's your

measure theory?

The Math WayThe Good:Solid foundationMath is the noblest known pursuit

The Math WayThe Good:Solid foundationMath is the noblest known pursuit

The Bad:Some weirdos don't think math is fun

Can be pretty forbidding

Can miss practical skills

So, did you count the words in

that document?

No, but I have an elegant

proof that the number of

words is finite!

OK, Let's Try Again

I want to be a data

scientist.Great!

The Tools WayHere's a list of

the 25 libraries you

really ought to know. How's

your R programming?

The Tools WayThe Good:Don't have to understand the math

PracticalCan get started doing fun stuff right away

The Tools WayThe Good:Don't have to understand the math

PracticalCan get started doing fun stuff right away

The Bad:Don't have to understand the math

Can get started doing bad science right away

So, did you build that model?

Yes, and it fits the training data almost perfectly!

OK, Maybe Not That Either

So Then What?

Example: k-means clusteringUnsupervised machine learning technique

Given a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares

i.e. in a way such that the clusters are as "small" as possible (for a particular conception of "small")

The Math Way

The Math Way

The Tools Way# a 2-dimensional examplex <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")(cl <- kmeans(x, 2))plot(x, col = cl$cluster)points(cl$centers, col = 1:2, pch = 8, cex = 2)

The Tools Way>>> from sklearn import cluster, datasets>>> iris = datasets.load_iris()>>> X_iris = iris.data>>> y_iris = iris.target

>>> k_means = cluster.KMeans(n_clusters=3)>>> k_means.fit(X_iris) KMeans(copy_x=True, init='k-means++', ...>>> print(k_means.labels_[::10])[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]>>> print(y_iris[::10])[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

So What To Do?

Bootcamps?

Data Science from ScratchThis is to certify thatJoel Grus

has honorably completed the course of study outlined in the book Data Science from Scratch: First Principles with Python, and is entitled to all the Rights, Privileges, and Honors thereunto appertaining. Joel

GrusJune 23, 2015

Certificate Programs?

Hey! Data scientists!

Learning By BuildingYou don't really understand something until you build it

For example, I understand garbage disposals much better now that I had to replace one that was leaking water all over my kitchen

More relevantly, I thought I understood hypothesis testing, until I tried to write a book chapter + code about it.

Learning By BuildingFunctional Programming

Break Things Down Into Small Functions

So you don't end up with

something like this

Don't Mutate

Example: k-means clusteringGiven a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares

Global optimization is hard, so use a greedy iterative approach

Fun Motivation: Image Posterization

Image consists of pixelsEach pixel is a triplet (R,G,B)Imagine pixels as points in spaceFind k clusters of pixelsRecolor each pixel to its cluster

meanI think it's fun, anyway

8 colors

Example: k-means clusteringgiven some points, find k clusters by

choose k "means"repeat:

assign each point to cluster of closest "mean"recompute mean of each cluster

sounds simple! let's code!

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each mean

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each meancompute the distance

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each meancompute the distance

assign the point to the cluster of the mean with the smallest distance

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each meancompute the distance

assign the point to the cluster of the mean with the smallest distance

find the points in each cluster

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

start with no cluster assignments

for each iteration

for each point

for each meancompute the distance

assign the point to the cluster of the mean with the smallest distance

find the points in each cluster

and compute the new means

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

Not impenetrable, but a lot less helpful than it

could be

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

Not impenetrable, but a lot less helpful than it

could be

Can we make it simpler?

Break Things Down Into Small Functions

def k_means(points, k, num_iters=10): # start with k of the points as "means" means = random.sample(points, k)

# and iterate finding new means for _ in range(num_iters): means = new_means(points, means)

return means

def new_means(points, means): # assign points to clusters # each cluster is just a list of points clusters = assign_clusters(points, means)

# return the cluster means return [mean(cluster) for cluster in clusters]

def assign_clusters(points, means): # one cluster for each mean # each cluster starts empty clusters = [[] for _ in means] # assign each point to cluster # corresponding to closest mean for p in points: index = closest_index(point, means) clusters[index].append(point) return clusters

def closest_index(point, means): # return index of closest mean return argmin(distance(point, mean) for mean in means)

def argmin(xs): # return index of smallest element return min(enumerate(xs), key=lambda pair: pair[1])[0]

To Recapk_means(points, k, num_iters=10)

mean(points)

k_means(points, k, num_iters=10)new_means(points, means)assign_clusters(points, means)closest_index(point, means)argmin(xs)

distance(point1, point2)mean(points)

add(point1, point2)scalar_multiply(c, point)

As a Pedagogical ToolCan be used "top down" (as we did here)

Implement high-level logicThen implement the detailsNice for exposition

Can also be used "bottom up"Implement small piecesBuild up to high-level logicGood for workshops

Example: Decision TreesWant to predict whether a given Meetup is worth attending (True) or not (False)

Inputs are dictionaries describing each Meetup

{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }

{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }

Example: Decision Trees{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }

{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }

beer?

True Falsespeaker?

True False

free none

paid

@jakevdp

@joelgrus

Example: Decision Treesclass LeafNode: def __init__(self, prediction): self.prediction = prediction

def predict(self, input_dict): return self.prediction

class DecisionNode: def __init__(self, attribute, subtree_dict): self.attribute = attribute self.subtree_dict = subtree_dict

def predict(self, input_dict): value = input_dict.get(self.attribute) subtree = self.subtree_dict[value] return subtree.predict(input)

Example: Decision TreesAgain inspiration from functional programming:type Input = Map.Map String String

data Tree = Predict Bool | Subtrees String (Map.Map String Tree)

look at the "beer" entry a map from each possible "beer" value to a subtree

always predict a specific value

Example: Decision Treestype Input = Map.Map String String

data Tree = Predict Bool | Subtrees String (Map.Map String Tree)

predict :: Tree -> Input -> Boolpredict (Predict b) _ = bpredict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)

Example: Decision Treestype Input = Map.Map String String

data Tree = Predict Bool | Subtrees String (Map.Map String Tree)

We can do the same, we'll say a decision tree is eitherTrueFalse(attribute, subtree_dict)

("beer", { "free" : True, "none" : False, "paid" : ("speaker", {...})})

predict :: Tree -> Input -> Bool

predict (Predict b) _ = b

predict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)

Example: Decision Treesdef predict(tree, input_dict): # leaf node predicts itself if tree in (True, False): return tree else: # destructure tree attribute, subtree_dict = tree # find appropriate subtree value = input_dict[attribute] subtree = subtree_dict[value] # classify using subtree return predict(subtree, input_dict)

Not Just For Data Science

In ConclusionTeaching data science is fun, if you're smart about it

Learning data science is fun, if you're smart about it

Writing a book is not that much funHaving written a book is pretty funMaking slides is actually kind of funFunctional programming is a lot of fun

Thanks!@joelgrus

joelgrus@gmail.com

joelgrus.com