Deconstructing Data Science -...

46
Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 4: Regression overview Jan 26, 2017

Transcript of Deconstructing Data Science -...

Page 1: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Deconstructing Data ScienceDavid Bamman, UC Berkeley

Info 290

Lecture 4: Regression overview

Jan 26, 2017

Page 2: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Regression

x = the empire state building y = 17444.5625”

A mapping from input data x (drawn from instance space 𝓧) to a point y in ℝ

(ℝ = the set of real numbers)

Page 3: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

task x y

predicting box office revenue movie total box office

predicting stock movements $TWTR price at time t+1

predicting vote share Clinton 47%

Page 4: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Regression

Supervised learning

Given training data in the form of <x, y> pairs, learn

ĥ(x)

Page 5: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Regression

• Can you create (or find) labeled data that marks that value for a bunch of examples? Can you make that choice?

• Can you create features that might help in distinguishing those classes?

Page 6: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Experiment designtraining development testing

size 80% 10% 10%

purpose training models model selectionevaluation;

never look at it until the very

end

Page 7: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Metrics• Measure difference between the prediction ŷ and

the true y

1N

N�

i=1(yi � yi)2Mean squared error

(MSE)

1N

N�

i=1|yi � yi|Mean absolute error

(MAE)

Page 8: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

y ŷ MAE MSE

1 2 1 11 1.1 0.1 0.011 100 99 98011 5 4 161 -5 6 361 10 9 81

1 3 2 41 0.9 0.1 0.011 1 0 0

121.2 9939.02

81.7% of total MAE → 98.6% of

total MSE←

MSE error penalizes outliers more than MAE

Page 9: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Linear regression

y =F�

i=1xiβi

β ∈ ℝF

(F-dimensional vector of real numbers)

0

5

10

15

20

0 5 10 15 20x

y

Page 10: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Polynomial regression

y =F�

i=1xiβa,i +

F�

i=1x2i βb,i

βa, βb ∈ ℝF

(F-dimensional vector of real numbers)

0

100

200

300

400

-20 -10 0 10 20x

x^2

Page 11: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Polynomial regression

βa, βb, βc ∈ ℝF

(F-dimensional vector of real numbers)

y =F�

i=1xiβa,i +

F�

i=1x2i βb,i +

F�

i=1x3i βc,i

-5000

0

5000

-20 -10 0 10 20x

x^3

Page 12: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Support vector machines (regression)

Probabilistic graphical models

Networks

Neural networks

Deep learning

Decision trees

Random forests

Nonlinear regression

Page 13: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Number of Parametersorder 1

(linear reg.)

order 3 y =F�

i=1xiβa,i +

F�

i=1x2i βb,i +

F�

i=1x3i βc,i

order 2 y =F�

i=1xiβa,i +

F�

i=1x2i βb,i

y =F�

i=1xiβa,i

Page 14: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

Page 15: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

labeled data

labeled data

labeled data

𝓧instance space

Page 16: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 1, training MSE = 73.4

Page 17: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 2, training MSE = 71.9

Page 18: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 3, training MSE = 60.9

Page 19: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 4, training MSE = 60.6

Page 20: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 5, training MSE = 59.1

Page 21: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 6, training MSE = 50.2

Page 22: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 7, training MSE = 49.6

Page 23: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 8, training MSE = 46.8

Page 24: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 9, training MSE = 41.2

Page 25: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 10, training MSE = 35.8

Page 26: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 11, training MSE = 21.1

Page 27: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

degree 12, training MSE = 18.4

Page 28: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

18.4

Page 29: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

18.4 118.8

Page 30: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

18.4 118.8

136.5

Page 31: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

18.4 118.8

136.5 87.7

Page 32: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

73.4 86.3

65.0 94.7

Page 33: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Overfitting• Memorizing the nuances (and noise) of the training

data that prevents generalizing to unseen data

-20

0

20

0 5 10 15 20x

y

-20

0

20

0 5 10 15 20x

y

Page 34: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Sources of error• Bias: Error due to mis-specifying the relationship

between input and the output. [too few parameters, or the wrong kinds]

• Variance: Error due to sensitivity to random fluctuations in the training data. If you train on different data, do you get radically different predictions?[too many parameters]

Page 35: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

X XX

XX

X

X

X

X

X

X

X

X

X

X

X XX

XX

Low variance High variance

Low bias

High bias

Image from Flach 2012

Page 36: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

X XX

XX High bias, low variance: Always

predict “Berkeley”

X

X

X

X

X

High bias, high variance: Predict most frequent city in training data

X

X

X

X

X

Low bias, high variance: many features, some of which capture true signal but capture random noise

X XX

XX

Low bias, low variance: enough features to capture the true signal

Example: geolocation on

Twitter

Page 37: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Ordinal regression• In between classification and regression

• 𝒴 is categorical (e.g.,✩, ✩✩, ✩✩✩)

• Elements of 𝒴 are ordered

• ✩ < ✩✩ • ✩✩ < ✩✩✩ • ✩ < ✩✩✩

Page 38: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

task 𝓧 𝒴

predicting star ratings movie {✩, ✩✩, ✩✩✩}

Ordinal regression

Page 39: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

• Sarah Cohen, James T. Hamilton, and Fred Turner, “Computational Journalism,” Communications of the ACM (2011)

• Sylvain Parasie, “Data-Driven Revelation? Epistemological tensions in investigative journalism in the age of ‘big data,’” Digital Journalism (2015)

Computational Journalism

Page 40: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Computational Journalism

• “Changing how stories are discovered, presented, aggregated, monetized and archived” (Cohen et al. 2012)

• Draws on earlier tradition of computer-assisted reporting and “precision journalism” (Meyer 1972)

Page 41: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

• Database linking, e.g.:

• voting records to the deceased • press releases from different members of

congress • indictments/settlements from U.S. attorneys • documents from SEC, Pentagon, defense

contractors to note movement to industry (Cohen 2012)

• DSA database of safety status of CA public schools + US seismic zones + school list from CA Dept of (Parasie 2015)

Computational Journalism

Page 42: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

• Information extraction: need to pull out people, places, organizations and their relationship from large (often sudden) dumps of documents.

• Analyzing the relationship between entities

Computational Journalism

Page 43: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Computational Journalism• Data-driven stories about large-scale trends

43

Change in insured Americans under the ACA, NY Times (Oct 29, 2014)

Relationship between birth year and political views NY Times (July 7, 2014)

Page 44: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

• Data-driven lead generation; the outliers in analysis that point to a story

Computational Journalism

Page 45: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

• Demands:

• High precision • Fast turnaround

• Needs (Stray 2016):

• Accurate document analysis • Guided search • Interactive methods

Computational Journalism

Page 46: Deconstructing Data Science - Coursescourses.ischool.berkeley.edu/i290-dds/s17/slides/4_regression.pdf · “Computational Journalism,” Communications of the ACM (2011) • Sylvain

Project proposal, due 2/16• Collaborative project (involving up to 3 students), where the

methods learned in class will be used to draw inferences about the world and critically assess the quality of those results.

• Proposal (2 pages):

• outline the work you’re going to undertake • formulate a hypothesis to be examined • motivate its rationale as an interesting question worth asking • assess its potential to contribute new knowledge by

situating it within related literature in the scientific community. (cite 5 relevant sources)

• who is the team and what are each of your responsibilities (everyone gets the same grade)