EECS E6690: Statistical Learning for Biological and ... · E6690 Statistical Learning: Brief...

EECS E6690: Statistical Learning forBiological and Information Systems

Lecture1: Introduction

Prof. Predrag R. JelenkovicTime: Tuesday 4:10-6:40pm

1127 Seeley W. Mudd Building

Dept. of Electrical EngineeringColumbia University , NY 10027, USAOffice: 812 Schapiro Research Bldg.

Phone: (212) 854-8174Email: [email protected]

URL: http://www.ee.columbia.edu/∼predrag

E6690 Statistical Learning: Brief Description

I Deluge of Data in Biology and Information Systems: Ongoingadvancements in information systems as well as the emergingrevolution in microbiology and neuroscience are creating a deluge ofdata, whose mining, inference and prediction will have an enormouseconomic, social, scientific and medical/therapeutic impact.

I Biology: For example, in biology, microarray technology is creatingvast amounts of gene expression data, whose understanding couldlead to better diagnostics and potential cure of cancer.

I Information Systems: Similarly, in information systems, companieslike Google, Amazon, Facebook, etc., are facing various problems onmassive data sets, e.g., ranking and community detection.

E6690 Statistical Learning: Brief DescriptionThis course will cover a variety of fundamental statistical (machine)learning techniques that are suitable for the emerging problems in theseapplication areas:

I Basics of Statistics and Optimization

I Introduction to Statistical/Machine Leraning Techniques

I Supervised versus unsupervised learningI Inference and predictionI Linear versus nonlinear modelsI Training, testing and validationI RegularizationI And many more

I Specifics of Biological and Information Systems Data

I High dimensionality and need for regularizationI Large sparse graphsI Community detectionI RankingI Association rules (Market basket analysis)

E6690 Statistical Learning: Course Logistics

Prerequisites: Calculus. Some knowledge of probability/statisticsand optimization is strongly encouraged, but not required.Familiarity with a programming language, say Matlab, is highlydesirable.

Textbooks: The following two books will represent the supportingreferences for the course. The books are available online:

ESL Hastie, T., Tibshirani, R. and Friedman, J.The Elements of Statistical Learning: Data Mining, Inference and Prediction, 2nd Edition.Springer, 2009. https://web.stanford.edu/~hastie/Papers/ESLII.pdf

ISL James, G., Witten, D. Hastie, T. and Tibshirani, R.An Introduction to Statistical Learning, Springer, 2014. http://www-bcf.usc.edu/~gareth/ISL/

In addition, lecture notes and research papers will be used.

Homework: Biweekly homework will be assigned (about 4)

Programming: The course uses R language. Pointers to its freedownload, as well as basic examples of programming in R will becovered in class.

Grading: Homework (20%) + Midterm (35%) + Final Project(45%).

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

http://www-bcf.usc.edu/~gareth/ISL/

E6690 Statistical Learning: Course LogisticsMidterm: In class, closed book; 2 page cheat-sheet allowed; 2 1/2 hours

I Mixture of problem solving and descriptive answers

Final Project: Done in groups of 2-3 studentsI First, select a paper(s) from a data repository, e.g.:

I GEO (Gene Expression Omnibus) Data Repositoryhttps://www.ncbi.nlm.nih.gov/geo/

I UC Irvine Machine Learning Repositoryhttps://archive.ics.uci.edu/ml/datasets.html

I General Project Outline1. Introduction: e.g., describe the application area, problems

considered, etc2. Data set(s) and paper(s): e.g., describe data in detail, what

was done in the paper(s), common stat/machine learningtools, etc

3. Reproduce the results from the paper(s)4. Try different techniques learned in class, or propose new ones5. Discussion and conclusion: e.g., compare different techniques,

pros and cons, future work, etc

https://www.ncbi.nlm.nih.gov/geo/

https://archive.ics.uci.edu/ml/datasets.html

Statistical Learning: What Does It Involve?In general, Statistical (Machine) Learning (supervised) problems typicallycan be posed as

Y = f(X)

Problem: Estimate f from training data {(xi, yi)}, and then use it ingeneralAreas involved:

I Approximation theory - for picking a class of functions

I Optimization - for fitting the training data

I Computing - fitting and testing

I Probability and Statistics - testing, error estimation

Interesting Question: What is the difference between classicalprogramming and statistical/machine learning?

I Classical Programming: f is an algorithm designed by a person

I Statistical Learning: f is discovered through examples by training

General Course Objectives

I Focus/motivation - emerging applications in:

I Biology and MedicineI Information Technology, e.g. problems in: Google, Facebook,

Twitter, Amazon, etc.

I Learn fundamental concepts and techniques in statistical (machine)learning techniques that are

I Suitable for these application areasI Useful and applicable in general

I Develop the necessary knowledge as we go (e.g., Statistics,Optimization, Approximation Theory)

I Learn R

I Have a hands-on experience on a real, practical problem through afinal project

Overall objective: Become an expert in Statistical/Machine Learning

Programming in R: Computing Platform

I Language and environment for statistical computing andgraphics

I Free softwareI Download

I R from http://cran.r-project.org/I RStudio, an Integrated Development Environment for R, from

http://www.rstudio.com/products/rstudio/download/

I ResourcesI R for beginnersI Quick-RI Cookbook for RI R for Data ScienceI Try R

http://cran.r-project.org/

http://www.rstudio.com/products/rstudio/download/

https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf

http://www.statmethods.net/index.html

http://www.cookbook-r.com/

http://r4ds.had.co.nz/

http://tryr.codeschool.com/

Brief Statistics Review

Example

The following numbers are particle (contamination) counts for asample of 10 semiconductor silicon wafers:

50 48 44 56 61 52 53 55 67 51

Over a long run the process average for wafer particle counts hasbeen 50 counts per wafer, and on the basis of the sample, we wantto test whether a change has occurred.

I Are data consistent is a given hypothesis?

I Idea: Data → scalar with a known distribution → likelihood

I Not a unique “transformation”

Estimates

I A statistic is a property of sample data taken from apopulation

I A point estimate of some unknown parameter is a statisticthat provides a best guess at the parameter value

I A point estimate θ is unbiased if Eθ = θ

I X1, X2, . . . , Xn – i.i.d. with mean µ and variance σ2

I ExamplesI Sample mean

X =1

n

n∑i=1

Xi

I Sample variance

S2 =1

n− 1

n∑i=1

(Xi − X)2

I Variability: Var(X) = σ2/n ≈ SE(X)2

SE is standard error, SE(X)2 = S2/n

Variability of estimates: Known variance

I If X1, . . . , Xn are i.i.d. normal, thenI X is normal:

X − µ√σ2/n

∼ N (0, 1)

I S2 has a known distribution:

n− 1

σ2S2 ∼ χ2

n−1,

where χ2n−1 (Chi - square) is the distribution of the sum of

(n− 1) squares of independent standard normal randomvariables

I X and S2 are independent

I ... if not, then CLT:

X − µ√σ2/n

⇒ N (0, 1)

Variability of estimates: Unknown variance

I If X1, . . . , Xn are i.i.d. normal, thenI t-statistic:

X − µ√S2/n

∼ N (0, 1)√χ2n−1/(n− 1)

∼ tn−1,

where tn−1 is Student’s t-distribution with (n− 1) degrees offreedom

I tn: independent Z ∼ N (0, 1) and V ∼ χ2n

Z√V/n

∼ tn

t-distributionI Zero meanI Variance (n > 2): n/(n− 2)

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

PDFs of t distributions

x value

dens

ity

degrees of freedom

n=3n=5n=8n=30normal

t-test

I Null hypothesis H0 : µ = µ0

I Under H0, t-statistic:

t =X − µ0√S2/n

∼ tn−1

and the corresponding p-value is the probability of observing|tn−1| that is ≥ |t|, i.e., p = P[|tn−1| ≥ |t|].

I Large values of t unlikely under H0

I Typically:I reject if p < 0.01I accept if p > 0.1I not sure if 0.01 ≤ p ≤ 0.1.

(Or, simply: if p < 0.05→ reject, if p ≥ 0.05→ accept)

Intro to Statistical Learning

Supervised vs. unsupervised learningI Supervised learning: there is an input-output relationship

Y = f(X)

I X - Vector of p predictor measurementsI Y - Outcome measurementsI Two problems:

I Regression: Y is quantitativeI Classification: Y is categorical

I Training data (observations): (x1, y1), (x2, y2), . . . , (xn, yn)I Objectives:

I PredictionI Inference

I Unsupervised learning: No outcome variable Y

I Objective can be vague - just exploring dataI Learn interesting phenomena in data, e.g.:

I Clustering, community detection, data association, lowdimensional representation

Learning

I Let Y be the output variable, and X the input variablesX1, X2, . . . , Xp. Then

Y = f(X) + ε

I Want to estimate what f is

I ε is unavoidable noise that is independent of X, zero mean

I How to estimate f from the data? How to evaluate theestimate?

I Given an estimate f for f , predict unavailable values of Y forknown values of X: Y = f(X)

I Reducible and irreducible errors:I f is not exactly f , but f can potentially be learnt given

enough dataI even if f is known, there is error: ε = Y − f(X)

Two approaches to estimate f

I ParametricI Assume a specific form of fI Example: the linear model

f(X) = β0 + β1X1 + β2X2 + ...+ βpXp

I Use training data to choose the values of parametersβ0, β1, ..., βp

I Pro: easier to estimate parameters than arbitrary functionI Con: the choice of f might be (very) wrong

I Non-parametricI Make the parametric form more flexibleI This makes f more complex and potentially following the noise

too closely, thereby overfittingI Get f as close as possible to the data points, subject to not

being too non-smoothI Pro: more likely to get f right, especially if f is “strange”I Con: more data is needed to obtain a good estimate for f

Example

●

●

●

●

● ●

●

●

●

●

● ●●

●

●●

●

●

●

●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

training

x

y

●

●

●●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

testing

x

y

I More complicated models not always better - e.g., overfitting

I Amount of available data

I Interpretability

Linear Regression

Idea

I Simple approach to supervised learning

I Assumes linear dependence of quantitative Y onX1, X2, . . . , Xp

I True regression functions are never linear!

I Extremely useful both conceptually and practically

Data set

I Will use Advertising.csv to illustrate conceptsI 200 observations:

"","TV","Radio","Newspaper","Sales"

"1",230.1,37.8,69.2,22.1

"2",44.5,39.3,45.1,10.4

"3",17.2,45.9,69.3,9.3

.

.

.

"198",177,9.3,6.4,12.8

"199",283.6,42,66.2,25.5

"200",232.1,8.6,8.7,13.4

Advertising data set

TV

0 20 40

●

●●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●●

● ●

●

● ●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

● ●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●●

●

● ●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

5 15 25

010

025

0

●

●●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●●

●

●

●●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●●

●

● ●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

020

40 ●●

●●

●

●

●

●

● ●●

●

●

●

●

●

●●

●●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

Radio

●●

●●

●

●

●

●

● ●●

●

●

●

●

●

●●

●●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●●

●●

●

●

●

●

●

●●●

●

●

●

●

●

● ●

●

●

●

●

●

●●●

●

●

●●

●●

●

●

●

●

● ●●

●

●

●

●

●

●●

●●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

● ●●

●

●

●

●

●● ●

●

●

●●

●●

●

●

●

●●

●

●

● ●

●

●

●

●● ●

●●●

● ●●●

●● ●●

●● ●●

●

●

●

●●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●● ●

●●

●

●

●

●

●

●●

●

●●

●●● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●● ●

●

●●

●

●

●●●

●

●

●

●●

●

●

●●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

● ●

●

●●

●●

●

●

●●

●

●

●● ●

●

●

●●

●

● ●

●

●

●

●

● ● ●

●

●

●●

●●

●●

●

●●

●●

●

●

● ●

●

●●●

● ●

●

●

●

●

●●●

●

●

●●

●●

●

●

●

●●

●

●

●●

●

●

●

●●●

●● ●

● ●●●

●●● ●

●● ●●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●● ●

●●

●

●

●

●

●

●●●

● ●

● ● ●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

● ●

●

●

●●●

●

●

●

●●

●

●

●●

●

●

●●

● ●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●● ●

● ●

●

●●

●●

●

●

●●

●

●

●●●

●

●

●●

●

●●

●

●

●

●

●●●

●

●

●●

●●

●●

●

●●

●●

●

●

●●

●

●●●● ●

●

●

Newspaper

040

80

●

●

●●●

●

●

●●

●●

●

●

●

●●

●

●

● ●

●

●

●

●● ●

●● ●

● ●●●

●● ● ●

●● ●●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●● ●

●●

●

●

●

●

●

●●

●

●●

●● ● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●●●

●

●

●

●●

●

●

●●

●

●

●●

● ●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●●●

● ●

●

●●

●●

●

●

●●

●

●

●●●

●

●

●●

●

● ●

●

●

●

●

● ●●

●

●

●●●

●

●●

●

●●

●●

●

●

●●

●

●●●

● ●

●

●

0 100 250

515

25

●

●●

●

●

●

●●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●● ●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●● ●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●●

●● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

● ●

●● ●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●●

●

● ●

●

●

●●

●

●

●

● ●

●

●

● ●●●

●

●

●

●

●

●

●●

●

● ●

●

●●

●●

●

●

●

●

●●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●● ●

●●●

●

●

●

●

●

●

●

●●

●

● ●

●

●

●

●

●

●●

●

●●

●

●

● ●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●●

●●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

● ●●

●

●

●

●

●●

● ●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●●

●

●●

●

●

●●

●

●

●

●●

●

●

●● ●●

●

●

●

●

●

●

●●

●

●●

●

●●

●●

●

●

●

0 40 80

●

●●

●

●

●

●●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●●

●●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●●

●

● ●

●

●

●●

●

●

●

●●

●

●

●●●●

●

●

●

●

●

●

●●

●

●●

●

●●

●●

●

●

● Sales

Single predictor: TV vs. Sales> adv<-read.csv("advertising.csv",header=TRUE,sep=",")

> plot(adv$TV,adv$Sales,xlab="TV",ylab="Sales",col="red")

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

I Linear modelY = β0 + β1X + ε,

whereI β0 and β1: unknown constants/parameters/coefficients

(intercept and slope)I ε: error term

Single predictor: Model selection

I Estimate β0 and β1 based on data

I Given estimates β0 and β1, predict future sales using

y = β0 + β1x

I y: prediction of Y given X = x

I Residuals: yi − yi = yi − (β0 + β1xi)

I Select β0 and β1 to “minimize” residuals

I How to minimize a vector?

Need to Define Distance: Vector normsI Example: lp norm

‖z‖p =

(n∑i=1

|zi|p)1/p

I Example: 3 data point - {(0, 1), (1, 0), (2, 1)}The result depends on the choice of the norm (!)(parallel to x-axis due to symmetry)

●

●

●

0.0 0.5 1.0 1.5 2.0

0.0

0.2

0.4

0.6

0.8

1.0

l2 regression: Least squares

I min ‖y − y‖2I Residual Sum of Squares (RSS):

RSS ≡ RSS(β0, β1) = ‖y − y‖22 =

n∑i=1

(yi − yi)2

I Least squares approach: minβ0,β1 RSS

I Solution:

β1 =

∑ni=1(yi − y)(xi − x)∑n

i=1(xi − x)2,

β0 = y − β1x,

where x = n−1∑n

i=1 xi and y = n−1∑n

i=1 yi are the samplemeans

Example> lm1<-lm(adv$Sales~adv$TV)

> summary(lm1)

Sales = 7.032594 + 0.047537× TV

> plot(adv$TV,adv$Sales,xlab="TV",ylab="Sales",col="red",pch=20)

> abline(lm(adv$Sales~adv$TV),col="blue",lwd=2)

> Sales_Predict<-predict(lm1)

> segments(adv$TV, adv$Sales, adv$TV, Sales_Predict)

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 50 100 150 200 250 300

510

1520

25

TV

Sal

es

Example: l2 vs. l1

I One point in the data set modified

●

●●

●

●

●

●●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

0 50 100 150 200 250 300

010

2030

4050

Original data set

TV

Sal

es

●

●●

●

●

●

●●

●

●●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●●

●

●

●

●

●●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

0 50 100 150 200 250 3000

1020

3040

50

Modified data set

TV

Sal

es

Coefficient estimatesI Suppose the true model is

Sales = β0 + β1 × TV + ε

I How good are estimates β0 and β1?

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 50 100 150 200 250 300

05

1015

2025

30

TV

Sal

es

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

● ●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

0 50 100 150 200 250 300

05

1015

2025

30

TV

Sal

es

i = 1, . . . , 100 : Sales = 7.241734 + 0.049069× TV

i = 101, . . . , 200 : Sales = 6.803818 + 0.046135× TV

Properties of β0 and β1

I Repeated sampling

I β0 and β1 vary

I Means:Eβ0 = β0 and Eβ1 = β1

I Variances:

Var(β1) =σ2∑n

i=1(xi − x)2,

Var(β0) = σ2(

1

n+

x2∑ni=1(xi − x)2

),

where σ2 = Var(ε)

I An estimate of σ2:

RSE2 =1

n− 2

n∑i=1

(yi − yi)2 =1

n− 2RSS,

where RSE is the Residual Standard Error

Confidence intervals

I Normality assumption: ε ∼ N (0, σ2)

I t-statistic:β1 − β1SE(β1)

∼ tn−2,

where

SE(β1)2 =

1

n− 2

∑ni=1(yi − yi)2∑ni=1(xi − x)2

I (1− γ) confidence interval:

[β1 − SE(β1) · tγ/2,n−2, β1 + SE(β1) · tγ/2,n−2]

is such that

P[β1 ∈ [β1−SE(β1) · tγ/2,n−2, β1 + SE(β1) · tγ/2,n−2]] = 1− γ,

where tγ/2,n−2 is the (1− γ/2)-th quantile of the tn−2

distribution

Hypothesis testing

I Typical testing (null vs. alternative hypothesis):

H0: there is no relationship between X and Yversus

H1: there is some relationship between X and Y

I Formally:

H0 : β1 = 0 vs. H1 : β1 6= 0

I To test H0 (β1 = 0), compute a t-statistic:

t =β1 − 0

SE(β1),

which is distributed according to a t-distribution with (n− 2)degrees of freedom

I Compute the p-value – probability of observing any valueequal to |t| or larger

Example

> summary(lm1)

Call:

lm(formula = adv$Sales ~ adv$TV)

Residuals:

Min 1Q Median 3Q Max

-8.3860 -1.9545 -0.1913 2.0671 7.2124

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.032594 0.457843 15.36 <2e-16 ***

adv$TV 0.047537 0.002691 17.67 <2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 3.259 on 198 degrees of freedom

Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099

F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16

> qt(0.975,198)

[1] 1.972017

Reading:

ISL: Read in detail Chapter 2 and Section 3.1.Also, looking through the entire Chapters 1-3 is recommended.

EECS E6690: Statistical Learning for Biological and ... · E6690 Statistical Learning: Brief...

Documents

Transcript of EECS E6690: Statistical Learning for Biological and ... · E6690 Statistical Learning: Brief...