The Road to Data Science - Joel Grus, June 2015

Joel GrusSeattle DAML Meetup

June 23, 2015

Data Science from Scratch

About meOld-school DAML-erWrote a book ---------->SWE at GoogleFormerly data science at VoloMetrix, Decide, Farecast

The Road to Data Science

The Road to Data ScienceMy

Grad School

Fareology

Data Science Is A Broad Field

Some Stuff

MoreStuff

EvenMoreStuff

DataScience

People who think they're data scientists, but they're not really data scientists

People who are a danger to everyone around them

People who say "machine learnings"

a data scientist should be able to

JOEL GRUS

a data scientist should be able torun a regression,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson,

JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. JOEL GRUS

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS

A lot of stuff!

What Are Hiring Managers Looking For?

Let's check LinkedIn

a data scientist should be able torun a regression, write a sql query, scrape a web site, design an experiment, factor matrices, use a data frame, pretend to understand deep learning, steal from the d3 gallery, argue r versus python, think in mapreduce, update a prior, build a dashboard, clean up messy data, test a hypothesis, talk to a businessperson, script a shell, code on a whiteboard, hack a p-value, machine-learn a model. specialization is for engineers.JOEL GRUS

grad students!

Learning Data Science

I want to be a data

scientist.Great!

The Math WayI like to start with matrix

decompositions. How's your

measure theory?

The Math WayThe Good:Solid foundationMath is the noblest known pursuit

The Bad:Some weirdos don't think math is fun

Can be pretty forbidding

Can miss practical skills

So, did you count the words in

that document?

No, but I have an elegant

proof that the number of

words is finite!

OK, Let's Try Again

I want to be a data

scientist.Great!

The Tools WayHere's a list of

the 25 libraries you

really ought to know. How's

your R programming?

The Tools WayThe Good:Don't have to understand the math

PracticalCan get started doing fun stuff right away

The Tools WayThe Good:Don't have to understand the math

PracticalCan get started doing fun stuff right away

The Bad:Don't have to understand the math

Can get started doing bad science right away

So, did you build that model?

Yes, and it fits the training data almost perfectly!

OK, Maybe Not That Either

So Then What?

Example: k-means clusteringUnsupervised machine learning technique

Given a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares

i.e. in a way such that the clusters are as "small" as possible (for a particular conception of "small")

The Math Way

The Tools Way# a 2-dimensional examplex <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")(cl <- kmeans(x, 2))plot(x, col = cl$cluster)points(cl$centers, col = 1:2, pch = 8, cex = 2)

The Tools Way>>> from sklearn import cluster, datasets>>> iris = datasets.load_iris()>>> X_iris = iris.data>>> y_iris = iris.target

>>> k_means = cluster.KMeans(n_clusters=3)>>> k_means.fit(X_iris) KMeans(copy_x=True, init='k-means++', ...>>> print(k_means.labels_[::10])[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]>>> print(y_iris[::10])[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

So What To Do?

Bootcamps?

Data Science from ScratchThis is to certify thatJoel Grus

has honorably completed the course of study outlined in the book Data Science from Scratch: First Principles with Python, and is entitled to all the Rights, Privileges, and Honors thereunto appertaining. Joel

GrusJune 23, 2015

Certificate Programs?

Hey! Data scientists!

Learning By BuildingYou don't really understand something until you build it

For example, I understand garbage disposals much better now that I had to replace one that was leaking water all over my kitchen

More relevantly, I thought I understood hypothesis testing, until I tried to write a book chapter + code about it.

Learning By BuildingFunctional Programming

Break Things Down Into Small Functions

So you don't end up with

something like this

Don't Mutate

Example: k-means clusteringGiven a set of points, group them into k clusters in a way that minimizes the within-cluster sum-of-squares

Global optimization is hard, so use a greedy iterative approach

Fun Motivation: Image Posterization

Image consists of pixelsEach pixel is a triplet (R,G,B)Imagine pixels as points in spaceFind k clusters of pixelsRecolor each pixel to its cluster

meanI think it's fun, anyway

8 colors

Example: k-means clusteringgiven some points, find k clusters by

choose k "means"repeat:

assign each point to cluster of closest "mean"recompute mean of each cluster

sounds simple! let's code!

def k_means(points, k, num_iters=10): means = list(random.sample(points, k)) assignments = [None for _ in points]

for _ in range(num_iters): # assign each point to closest mean for i, point_i in enumerate(points): d_min = float('inf') for j, mean_j in enumerate(means): d = sum((x - y)**2 for x, y in zip(point_i, mean_j)) if d < d_min: d_min = d assignments[i] = j

# recompute means for j in range(k): cluster = [point for i, point in enumerate(points) if assignments[i] == j] means[j] = mean(cluster)

return means

start with k randomly chosen points

return means

start with no cluster assignments

return means

for each iteration

return means

for each iteration

for each point

return means

for each iteration

for each point

for each mean

return means

for each iteration

for each point

for each meancompute the distance

return means

for each iteration

for each point

assign the point to the cluster of the mean with the smallest distance

return means

for each iteration

for each point

find the points in each cluster

return means

for each iteration

for each point

find the points in each cluster

and compute the new means

return means

Not impenetrable, but a lot less helpful than it

could be

return means

Not impenetrable, but a lot less helpful than it

could be

Can we make it simpler?

Break Things Down Into Small Functions

def k_means(points, k, num_iters=10): # start with k of the points as "means" means = random.sample(points, k)

# and iterate finding new means for _ in range(num_iters): means = new_means(points, means)

return means

def new_means(points, means): # assign points to clusters # each cluster is just a list of points clusters = assign_clusters(points, means)

# return the cluster means return [mean(cluster) for cluster in clusters]

def assign_clusters(points, means): # one cluster for each mean # each cluster starts empty clusters = [[] for _ in means] # assign each point to cluster # corresponding to closest mean for p in points: index = closest_index(point, means) clusters[index].append(point) return clusters

def closest_index(point, means): # return index of closest mean return argmin(distance(point, mean) for mean in means)

def argmin(xs): # return index of smallest element return min(enumerate(xs), key=lambda pair: pair[1])[0]

To Recapk_means(points, k, num_iters=10)

mean(points)

k_means(points, k, num_iters=10)new_means(points, means)assign_clusters(points, means)closest_index(point, means)argmin(xs)

distance(point1, point2)mean(points)

add(point1, point2)scalar_multiply(c, point)

As a Pedagogical ToolCan be used "top down" (as we did here)

Implement high-level logicThen implement the detailsNice for exposition

Can also be used "bottom up"Implement small piecesBuild up to high-level logicGood for workshops

Example: Decision TreesWant to predict whether a given Meetup is worth attending (True) or not (False)

Inputs are dictionaries describing each Meetup

{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }

{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }

Example: Decision Trees{ "group" : "DAML", "date" : "2015-06-23", "beer" : "free", "food" : "dim sum", "speaker" : "@joelgrus", "location" : "Google", "topic" : "shameless self-promotion" }

{ "group" : "Seattle Atheists", "date" : "2015-06-23", "location" : "Round the Table", "beer" : "none", "food" : "none", "topic" : "Godless Game Night" }

True Falsespeaker?

True False

free none

@jakevdp

@joelgrus

Example: Decision Treesclass LeafNode: def __init__(self, prediction): self.prediction = prediction

def predict(self, input_dict): return self.prediction

class DecisionNode: def __init__(self, attribute, subtree_dict): self.attribute = attribute self.subtree_dict = subtree_dict

def predict(self, input_dict): value = input_dict.get(self.attribute) subtree = self.subtree_dict[value] return subtree.predict(input)

Example: Decision TreesAgain inspiration from functional programming:type Input = Map.Map String String

data Tree = Predict Bool | Subtrees String (Map.Map String Tree)

look at the "beer" entry a map from each possible "beer" value to a subtree

always predict a specific value

Example: Decision Treestype Input = Map.Map String String

predict :: Tree -> Input -> Boolpredict (Predict b) _ = bpredict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)

Example: Decision Treestype Input = Map.Map String String

We can do the same, we'll say a decision tree is eitherTrueFalse(attribute, subtree_dict)

("beer", { "free" : True, "none" : False, "paid" : ("speaker", {...})})

predict :: Tree -> Input -> Bool

predict (Predict b) _ = b

predict (Subtrees a subtrees) input = predict subtree input where subtree = subtrees Map.! (input Map.! a)

Example: Decision Treesdef predict(tree, input_dict): # leaf node predicts itself if tree in (True, False): return tree else: # destructure tree attribute, subtree_dict = tree # find appropriate subtree value = input_dict[attribute] subtree = subtree_dict[value] # classify using subtree return predict(subtree, input_dict)

Not Just For Data Science

In ConclusionTeaching data science is fun, if you're smart about it

Learning data science is fun, if you're smart about it

Writing a book is not that much funHaving written a book is pretty funMaking slides is actually kind of funFunctional programming is a lot of fun

Thanks!@joelgrus

joelgrus@gmail.com

joelgrus.com

The Road to Data Science - Joel Grus, June 2015

Engineering

Transcript of The Road to Data Science - Joel Grus, June 2015

Whooping Cranes Grus americana by: Laura Clayton Youth Middle School.

GRUS AMERICANA AND A TEXAS RIVER ASE FOR … · Grus americana and a Texas River 4 Range The whooping crane lives exclusively in North America. It is one of only two crane species

Dr. JOEL ANDERSON COOPER HISTORICAL INFORMATIONfriendsofthebccemetery.org/files/biographical/Joel...JOEL & LAURA COOPER CHILDREN Dr. Joel Anderson Cooper b. 10-20-1847, Father: James

Sandhill Crane Grus canadensis

1 THE PURPOSE AND POPULARITY OF ZOOS · 2010, some 45,000 people visited Lake Hornborga in Sweden to witness the annual return of over 12,000 migrating Eurasian cranes (Grus grus).

Joel Samuel Yudken, Ph.D. - High Road Strategies · Title: Microsoft Word - HRS-Yudken Resume 2015.docx Author: Joel Yudken Created Date: 6/29/2015 5:46:47 PM

Sarus Crane Grus antigone population fluctuation at various wetlands in India BY PUNEET.

Android in the Cloud Chromebooks, BYOD and Wearables Joel Isaacson Copyright 2014 Joel Isaacson joel@ascender.com.

SARUS CRANE Grus antigone - birdbase.hokkaido …birdbase.hokkaido-ies.go.jp/rdb/rdb_en/grusanti.pdf · SARUS CRANE Grus antigone ... (Roberts 1991–1992, ... Kathua district, at

Grus Americana - whoopingcrane.comwhoopingcrane.com/.../uploads/2018/03/Spring-2017-WCCA-Newsletter… · Grus Americana Aerial photo of a ... basins being full of water. The next

Using Automobiles · Web viewAlthough we will certainly need to expand our road and highway networks (i.e. Joel Garreau’s Plan A) and improve Big Box transit systems (i.e. Joel

Whooping Crane (Grus americana)Whooping Crane (Grus americana) 5-Year Review: Summary and Evaluation U.S. Fish and Wildlife Service Aransas National Wildlife Refuge, Austwell, Texas

Autumn Migration of Common Cranes Grus grus Through the …centrostudinatura.it/public2/documenti/927-76625.pdf · 2014. 7. 24. · Autumn migration of Common Cranes Grus grus through

WILLIAM MARTIN JOEL “BILLY” JOEL

RETAIL PROPERTY PERRY ROAD CENTER FOR LEASE · JOEL C. ENGLISH TRACY (KIEP) EDDY PRESIDENT/PRINCIPAL PRINCIPAL 713.473.7200 713.907.1707 JOEL@TEXASCRES.COM TRACY@TEXASCRES.COM AVAILABLE

XILOCA 23 LA GRULLA COMÚN (GRUS GRUS): págs. 131-140 ...xiloca.org/data/Bases datos/Xiloca/819.pdf · La grulla mitológica. A través de los tiempos se ha utilizado la imagen de

Whooping Crane ( Grus americana )

Taking the High Road Joel Rogers University of Wisconsin, COWS cows.org/608.890.2543.

SAC Presentation The Road Ahead Joel Cooper, Director of ITS.

PYNNACLE HOMESpynnaclehomes.com/pinner/brochurepinner.pdf · 2020. 7. 27. · Field End Road Field End Road Field End Road EASTCOTE ILLAGE Eastcote Cricket Club Joel Street B466 B466