20130308 Preparing data for modeling in R

Post on 27-Jun-2015

2.319 views 13 download

Tags:

Transcript of 20130308 Preparing data for modeling in R

Preparing data for modeling in

2013-03-08 @HSPHKazuki Yoshida, M.D. MPH-CLE student

FREEDOMTO  KNOW

Group Website is at:

http://rpubs.com/kaz_yos/useR_at_HSPH

Open R Studio

Create a new scriptand save it.

lowbwt.dat

http://www.umass.edu/statdata/statdata/data/lowbwt.txthttp://www.umass.edu/statdata/statdata/data/lowbwt.dat

We will use lowbwt dataset used inBIO213 Applied Regression for Clinical Research

NAME: ! LOW BIRTH WEIGHT DATA (LOWBWT.DAT)KEYWORDS: Logistic RegressionSIZE: 189 observations, 11 variables

SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second ! Edition. These data are copyrighted by John Wiley & Sons Inc. and must ! be acknowledged and used accordingly. Data were collected at Baystate! Medical Center, Springfield, Massachusetts during 1986.

DESCRIPTIVE ABSTRACT:

The goal of this study was to identify risk factors associated withgiving birth to a low birth weight baby (weighing less than 2500 grams).Data were collected on 189 women, 59 of which had low birth weight babiesand 130 of which had normal birth weight babies. Four variables which werethought to be of importance were age, weight of the subject at her lastmenstrual period, race, and the number of physician visits during the firsttrimester of pregnancy.

NOTE:

This data set consists of the complete data. A paired data setcreated from this low birth weight data may be found in lowbwtm11.dat anda 3 to 1 matched data set created from the low birth weight data may befound in mlowbwt.dat.

http://www.umass.edu/statdata/statdata/data/lowbwt.txt

http://www.umass.edu/statdata/statdata/data/lowbwt.txt

LIST OF VARIABLES:

Columns Variable Abbreviation-----------------------------------------------------------------------------2-4 Identification Code ID 10 Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) 17-18 Age of the Mother in Years AGE 23-25 Weight in Pounds at the Last Menstrual Period LWT 32 Race (1 = White, 2 = Black, 3 = Other) RACE 40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE 48 History of Premature Labor (0 = None 1 = One, etc.) PTL 55 History of Hypertension (1 = Yes, 0 = No) HT 61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI 67 Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.) 73-76 Birth Weight in Grams BWT-----------------------------------------------------------------------------

http://www.umass.edu/statdata/statdata/data/lowbwt.txt

PEDAGOGICAL NOTES: These data have been used as an example of fitting a multiplelogistic regression model.

STORY BEHIND THE DATA: Low birth weight is an outcome that has been of concern to physiciansfor years. This is due to the fact that infant mortality rates and birthdefect rates are very high for low birth weight babies. A woman's behaviorduring pregnancy (including diet, smoking habits, and receiving prenatal care)can greatly alter the chances of carrying the baby to term and, consequently,of delivering a baby of normal birth weight. The variables identified in the code sheet given in the table have beenshown to be associated with low birth weight in the obstetrical literature. Thegoal of the current study was to ascertain if these variables were importantin the population being served by the medical center where the data werecollected.

References:

1. Hosmer and Lemeshow, Applied Logistic Regression, Wiley, (1989).

lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4)

Load dataset from web

header = TRUEto pick up

variable names

skip 4 rows

lbw[c(10,39), "BWT"] <- c(2655, 3035)

“Fix” dataset

Replace data pointsto make the dataset identical

to BIO213 dataset10th,39th

rows

BWT column

Lower case variable names

names(lbw) <- tolower(names(lbw))

Convert variable names to lower case

Put them back into variable names

See overview

library(gpairs)gpairs(lbw)

RecodingChanging and creating variables

Why?

Different variable forms mean different modeling

assumptions!

Variable form and assumption

n Continuous variables:

n Linearity assumption

n Categorical variables:

n No residual confounding assumption

Relabel race: 1, 2, 3 to White, Black, Other

lbw$race.cat <- factor(lbw$race, levels = 1:3, labels = c("White","Black","Other"))

Using this variable as continuous is meaning less!!

Take race variable

Order levels 1, 2, 3Make 1 reference level

Label levels 1, 2, 3 as White, Black, Other

Create new variable named

race.cat

Dichotomize ptl

lbw$preterm <- factor(ifelse(lbw$ptl >= 1, "1+", "0"))

Change to categorical

If condition is true, then “1+”

if not (else) “0”ifelse function give either one of two values

condition

Change 0,1 binary to No,Yes binary

lbw$smoke <- factor(ifelse(lbw$smoke == 1, "Yes", "No")) lbw$ht <- factor(ifelse(lbw$ht == 1, "Yes", "No"))lbw$ui <- factor(ifelse(lbw$ui == 1, "Yes", "No"))lbw$low <- factor(ifelse(lbw$low == 1, "Yes", "No"))

equality is tested by ==, not =

if 1, return “Yes”

if not, return “No”

cutting a continuous variableinto categories

lbw$ftv.cat <- cut(lbw$ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))

-Inf Inf0 1 2 3 4 5 6] ] ](None Normal Many

breaks = c(-Inf, 0, 2, Inf)

labels = c("None","Normal","Many")

Breaks at

Label them as

4 bounds for 3 categories

Make “Normal” the reference level

lbw$ftv.cat <- relevel(lbw$ftv.cat, ref = "Normal")

“Normal” as reference level

within() allows direct use of variable names

lbw <- within(lbw, {

## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))

## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")

## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))

## Categorize smoke ht ui smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes")) ht <- factor(ht, levels = 0:1, labels = c("No","Yes")) ui <- factor(ui, levels = 0:1, labels = c("No","Yes"))})

You can specify variables with variable name only. No need for lbw$

within() method

model formula

outcome ~ predictor1 + predictor2 + predictor3

formula

SAS equivalent: model outcome = predictor1 predictor2 predictor3;

age ~ zyg

In the case of t-test

continuous variable to be compared

grouping variable to separate groups

Variable to be explained

Variable used to explain

Y ~ X1 + X2

linear sum

n . All variables except for the outcome

n + X2 Add X2 term

n - 1 Remove intercept

n X1:X2 Interaction term between X1 and X2

n X1*X2 Main effects and interaction term

Y ~ X1 + X2 + X1:X2

Interaction term

Main effects Interaction

Y ~ X1 * X2

Interaction term

Main effects & interaction

Y ~ X1 + I(X2 * X3)

On-the-fly variable manipulation

New variable (X2 times X3) created on-the-fly and used

Inhibit formula interpretation. For math

manipulation

lm.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm , data = lbw)

Fit a model

lm.full

See model object

Call: command repeated

Coefficient for each variable

summary(lm.full)

See summary

Call: command repeated

Model F-test

Residual distribution

Dummy variables created

R^2 and adjusted R^2

Coef/SE = t

ftv.catNone No 1st trimester visit people compared to Normal 1st trimester visit people (reference level)

ftv.catMany Many 1st trimester visit people compared to Normal 1st trimester visit people (reference level)

race.catBlack Black people compared to White people (reference level)

race.catOther Other people compared to White people (reference level)

confint(fit.lm)

Confidence intervals

Lower boundary

Upper boundary

Confidence intervals