BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of...

26
BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics [email protected]
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    2

Transcript of BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of...

Page 1: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

BIO503: Lecture 5

Harvard School of Public Health Wintersession 2009

Jess Mar

Department of Biostatistics

[email protected]

Page 2: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Roadmap for Today

Some More Advanced Statistical Models Multiple Linear Regression Generalized linear models

– Logistic Regression

– Poisson Regression

– Survival Analysis

Multivariate Data Analysis

Programming Tutorials

Bits & Pieces

Page 3: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Tutorial 4

Page 4: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Multiple Linear Regression

Some handy functions to know about:

new.model <- update(old.model, new.formula)

Model Selection functions available in the MASS package

drop1, dropterm

add1, addterm

step, stepAIC

Similarly,

anova(modObj, test="Chisq")

Page 5: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Generalized Linear Models

Linear regression models hinge on the assumption that the response variable follows a Normal distribution.

Generalized linear models are able to handle non-Normal response variables and transformations to linearity.

Page 6: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Logistic Regression

When faced with a binary response Y = (0,1), we use logistic regression.

),|1( xiii YP

T

ip

i

i

x

x

x

1

T

p

i

1

where

jijj

T

ii

i

ii

iix

YP

YPxx

x

1log

),|0(

),|1(log

jijj

jijj

i

x

x

exp1

exp

Page 7: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Problem 2 – Logistic Regression

Read in the anaesthetic data set, data file: anaesthetic.txt.

Covariates:

move binary numeric vector for patient movement

(1 = movement, 0 = no movement)

conc anaethestic concentration

Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.

Page 8: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Fit the Logistic Regression Model

> anes.logit <- glm(nomove ~ conc, family=binomial(link=logit), data=anesthetic)

The output summary looks like this: > summary(anes.logit)

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.469 2.418 -2.675 0.00748 **conc 5.567 2.044 2.724 0.00645 **

Estimates of P(Y=1) are given by: > fitted.values(anes.logit)

Page 9: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Estimating Log Odds Ratio

To get back the log odds ratio

> anes.logit$linear.predictors

> plot(anesthetic$conc, anes.logit$linear.predictors)

> abline(coefficients(anes.logit))

Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.

Page 10: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Problem 3 – Multiple Logistic RegressionRead in data set birthwt.txt.

low indicator of birth weight less than 2.5kg age mother's age in years lwt mother's weight in pounds at last menstrual period race mother's race (1 = white, 2 = black, 3 = other) smoke smoking status during pregnancy ptl number of previous premature labours ht history of hypertension ui presence of uterine irritability ftv number of physician visits during the first trimester bwt birth weight in grams

We fit a logistic regression using the glm function and using the binomial family.

Page 11: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Problem 4 - Poisson Regression

Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease.

Example: schooldata.csv.

We can fit the Poisson regression model using the glm function and the poisson family.

Page 12: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Survival Analysis

library(survival)

Example: aml leukemia data

Kaplan-Meier curve

fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11]))

summary(fit1)

plot(fit1)

Log-rank test

survdiff(Surv(time, status)~x, data=aml)

Page 13: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Survival Analysis

Fit a Cox proportional hazards model

coxfit1 <- coxph(Surv(time, status)~x, data=aml)

summary(coxfit1)

Cumulative baseline hazard estimator:

basehaz(coxph(Surv(time, status)~x, data=aml))

Survival function for one group:

plot(survfit(coxfit1, newdata=data.frame(x=1)))

Page 14: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Tutorial 5

Page 15: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Cluster Analysis

Hierarchical Methods:

(Agglomerative, Divisive) + (Single, Average, Complete) Linkage…

Model-based Methods:

Mixed models. Plaid models. Mixture models…

A clustering problem is generally much harder than a classification problem because we don’t know the number of classes.

Clustering observations on the basis of experiments or across a time series.

Clustering experiments together on the basis of observations.

Page 16: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Examples of Clustering Algorithms Available in R

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

EGEGG

EG

E

NNNNN

NN

N

xxx

x

x

xxx

E

1,1

,1

21

11211

Experiments or Microarray Slides

Genes

Hierarchical Methods:

hclust

agnes

Partitioning Methods:

som

kmeans

pam

Packages:

cluster

Different Samples

Ob

servation

s

Page 17: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Hierarchical Clustering

n genes in n clusters

n genes in 1 cluster

divisive

agg

lom

erat

ive

We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’.

Euclidean distance

(Pearson) correlation

Source: J-Express Manual

Page 18: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Single linkage

Complete linkage

Average linkage

Different Ways to Determine Distances Between Clusters

Page 19: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Partitioning Methods

Examples of partitioning methods are k-means, partitioning about medoids (pam).

Gap statistic:

source("http://www.bioconductor.org/biocLite.R")

biocLite("SAGx")

?gap

The goal is to minimize the gap statistic.

Page 20: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

W – within variance

B – between variance

K-means Clustering

Reference: J-Express manual

Page 21: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

241 genes from 19 cell samples into 6 clusters.

Page 22: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Classification (Machine Learning)

Machine learning algorithms predict new classes based on patterns discerned from existing data.

Classification algorithms are a form of supervised learning.

Clustering algorithms are a form of unsupervised learning.

R Package: class – contains knn, SOMnnetMLInterfaces - Biconductor

A simplified way to construct machine learning algorithms from microarray data.

Goal: derive a rule (classifier) that assigns a new object (e.g. patient

microarray profile) to a pre-specified group (e.g. aggressive vs non-

aggressive prostate cancer).

Page 23: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Classification

Linear Discriminant Analysis lda

Support Vector Machines library(e1071) svm

K-nearest neighborsknn

Tree-based methods:rpartrandomForest

Page 24: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Scaling Methods

Principal Component Analysis

prcomp

Multi-dimensional Scaling

MDS

Self Organizing Maps

SOM

Independent Component Analysis

fastICA

Page 25: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

R Shortcuts

Ctrl + A:

Ctrl + E:

Ctrl + K

Esc

{Up, Down} Arrow

Page 26: BIO503: Lecture 5 Harvard School of Public Health Wintersession 2009 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu.

Laundry List

.Rprofile file

Outline of R packages

Graphics – lattice, Rwiki

Homework

R/SAS/Stata Comparison

Exercises