20130308 Preparing data for modeling in R
-
Upload
kazuki-yoshida -
Category
Documents
-
view
2.319 -
download
13
Transcript of 20130308 Preparing data for modeling in R
![Page 1: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/1.jpg)
Preparing data for modeling in
2013-03-08 @HSPHKazuki Yoshida, M.D. MPH-CLE student
FREEDOMTO KNOW
![Page 2: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/2.jpg)
Group Website is at:
http://rpubs.com/kaz_yos/useR_at_HSPH
![Page 3: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/3.jpg)
Open R Studio
![Page 4: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/4.jpg)
Create a new scriptand save it.
![Page 5: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/5.jpg)
http://www.umass.edu/statdata/statdata/data/
![Page 6: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/6.jpg)
lowbwt.dat
http://www.umass.edu/statdata/statdata/data/lowbwt.txthttp://www.umass.edu/statdata/statdata/data/lowbwt.dat
We will use lowbwt dataset used inBIO213 Applied Regression for Clinical Research
![Page 7: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/7.jpg)
NAME: ! LOW BIRTH WEIGHT DATA (LOWBWT.DAT)KEYWORDS: Logistic RegressionSIZE: 189 observations, 11 variables
SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second ! Edition. These data are copyrighted by John Wiley & Sons Inc. and must ! be acknowledged and used accordingly. Data were collected at Baystate! Medical Center, Springfield, Massachusetts during 1986.
DESCRIPTIVE ABSTRACT:
The goal of this study was to identify risk factors associated withgiving birth to a low birth weight baby (weighing less than 2500 grams).Data were collected on 189 women, 59 of which had low birth weight babiesand 130 of which had normal birth weight babies. Four variables which werethought to be of importance were age, weight of the subject at her lastmenstrual period, race, and the number of physician visits during the firsttrimester of pregnancy.
NOTE:
This data set consists of the complete data. A paired data setcreated from this low birth weight data may be found in lowbwtm11.dat anda 3 to 1 matched data set created from the low birth weight data may befound in mlowbwt.dat.
http://www.umass.edu/statdata/statdata/data/lowbwt.txt
![Page 8: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/8.jpg)
http://www.umass.edu/statdata/statdata/data/lowbwt.txt
LIST OF VARIABLES:
Columns Variable Abbreviation-----------------------------------------------------------------------------2-4 Identification Code ID 10 Low Birth Weight (0 = Birth Weight >= 2500g, LOW 1 = Birth Weight < 2500g) 17-18 Age of the Mother in Years AGE 23-25 Weight in Pounds at the Last Menstrual Period LWT 32 Race (1 = White, 2 = Black, 3 = Other) RACE 40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE 48 History of Premature Labor (0 = None 1 = One, etc.) PTL 55 History of Hypertension (1 = Yes, 0 = No) HT 61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI 67 Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.) 73-76 Birth Weight in Grams BWT-----------------------------------------------------------------------------
![Page 9: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/9.jpg)
http://www.umass.edu/statdata/statdata/data/lowbwt.txt
PEDAGOGICAL NOTES: These data have been used as an example of fitting a multiplelogistic regression model.
STORY BEHIND THE DATA: Low birth weight is an outcome that has been of concern to physiciansfor years. This is due to the fact that infant mortality rates and birthdefect rates are very high for low birth weight babies. A woman's behaviorduring pregnancy (including diet, smoking habits, and receiving prenatal care)can greatly alter the chances of carrying the baby to term and, consequently,of delivering a baby of normal birth weight. The variables identified in the code sheet given in the table have beenshown to be associated with low birth weight in the obstetrical literature. Thegoal of the current study was to ascertain if these variables were importantin the population being served by the medical center where the data werecollected.
References:
1. Hosmer and Lemeshow, Applied Logistic Regression, Wiley, (1989).
![Page 10: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/10.jpg)
lbw <- read.table("http://www.umass.edu/statdata/statdata/data/lowbwt.dat", head = T, skip = 4)
Load dataset from web
header = TRUEto pick up
variable names
skip 4 rows
![Page 11: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/11.jpg)
lbw[c(10,39), "BWT"] <- c(2655, 3035)
“Fix” dataset
Replace data pointsto make the dataset identical
to BIO213 dataset10th,39th
rows
BWT column
![Page 12: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/12.jpg)
Lower case variable names
names(lbw) <- tolower(names(lbw))
Convert variable names to lower case
Put them back into variable names
![Page 13: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/13.jpg)
See overview
![Page 14: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/14.jpg)
library(gpairs)gpairs(lbw)
![Page 15: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/15.jpg)
![Page 16: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/16.jpg)
RecodingChanging and creating variables
![Page 17: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/17.jpg)
Why?
![Page 18: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/18.jpg)
Different variable forms mean different modeling
assumptions!
![Page 19: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/19.jpg)
Variable form and assumption
n Continuous variables:
n Linearity assumption
n Categorical variables:
n No residual confounding assumption
![Page 20: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/20.jpg)
Relabel race: 1, 2, 3 to White, Black, Other
lbw$race.cat <- factor(lbw$race, levels = 1:3, labels = c("White","Black","Other"))
Using this variable as continuous is meaning less!!
Take race variable
Order levels 1, 2, 3Make 1 reference level
Label levels 1, 2, 3 as White, Black, Other
Create new variable named
race.cat
![Page 21: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/21.jpg)
Dichotomize ptl
lbw$preterm <- factor(ifelse(lbw$ptl >= 1, "1+", "0"))
Change to categorical
If condition is true, then “1+”
if not (else) “0”ifelse function give either one of two values
condition
![Page 22: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/22.jpg)
Change 0,1 binary to No,Yes binary
lbw$smoke <- factor(ifelse(lbw$smoke == 1, "Yes", "No")) lbw$ht <- factor(ifelse(lbw$ht == 1, "Yes", "No"))lbw$ui <- factor(ifelse(lbw$ui == 1, "Yes", "No"))lbw$low <- factor(ifelse(lbw$low == 1, "Yes", "No"))
equality is tested by ==, not =
if 1, return “Yes”
if not, return “No”
![Page 23: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/23.jpg)
cutting a continuous variableinto categories
lbw$ftv.cat <- cut(lbw$ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many"))
-Inf Inf0 1 2 3 4 5 6] ] ](None Normal Many
breaks = c(-Inf, 0, 2, Inf)
labels = c("None","Normal","Many")
Breaks at
Label them as
4 bounds for 3 categories
![Page 24: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/24.jpg)
Make “Normal” the reference level
lbw$ftv.cat <- relevel(lbw$ftv.cat, ref = "Normal")
“Normal” as reference level
![Page 25: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/25.jpg)
within() allows direct use of variable names
![Page 26: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/26.jpg)
lbw <- within(lbw, {
## Relabel race race.cat <- factor(race, levels = 1:3, labels = c("White","Black","Other"))
## Categorize ftv (frequency of visit) ftv.cat <- cut(ftv, breaks = c(-Inf, 0, 2, Inf), labels = c("None","Normal","Many")) ftv.cat <- relevel(ftv.cat, ref = "Normal")
## Dichotomize ptl preterm <- factor(ptl >= 1, levels = c(F,T), labels = c("0","1+"))
## Categorize smoke ht ui smoke <- factor(smoke, levels = 0:1, labels = c("No","Yes")) ht <- factor(ht, levels = 0:1, labels = c("No","Yes")) ui <- factor(ui, levels = 0:1, labels = c("No","Yes"))})
You can specify variables with variable name only. No need for lbw$
within() method
![Page 27: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/27.jpg)
model formula
![Page 28: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/28.jpg)
outcome ~ predictor1 + predictor2 + predictor3
formula
SAS equivalent: model outcome = predictor1 predictor2 predictor3;
![Page 29: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/29.jpg)
age ~ zyg
In the case of t-test
continuous variable to be compared
grouping variable to separate groups
Variable to be explained
Variable used to explain
![Page 30: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/30.jpg)
Y ~ X1 + X2
linear sum
![Page 31: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/31.jpg)
n . All variables except for the outcome
n + X2 Add X2 term
n - 1 Remove intercept
n X1:X2 Interaction term between X1 and X2
n X1*X2 Main effects and interaction term
![Page 32: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/32.jpg)
Y ~ X1 + X2 + X1:X2
Interaction term
Main effects Interaction
![Page 33: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/33.jpg)
Y ~ X1 * X2
Interaction term
Main effects & interaction
![Page 34: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/34.jpg)
Y ~ X1 + I(X2 * X3)
On-the-fly variable manipulation
New variable (X2 times X3) created on-the-fly and used
Inhibit formula interpretation. For math
manipulation
![Page 35: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/35.jpg)
lm.full <- lm(bwt ~ age + lwt + smoke + ht + ui + ftv.cat + race.cat + preterm , data = lbw)
Fit a model
![Page 36: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/36.jpg)
lm.full
See model object
![Page 37: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/37.jpg)
Call: command repeated
Coefficient for each variable
![Page 38: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/38.jpg)
summary(lm.full)
See summary
![Page 39: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/39.jpg)
Call: command repeated
Model F-test
Residual distribution
Dummy variables created
R^2 and adjusted R^2
Coef/SE = t
![Page 40: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/40.jpg)
ftv.catNone No 1st trimester visit people compared to Normal 1st trimester visit people (reference level)
ftv.catMany Many 1st trimester visit people compared to Normal 1st trimester visit people (reference level)
![Page 41: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/41.jpg)
race.catBlack Black people compared to White people (reference level)
race.catOther Other people compared to White people (reference level)
![Page 42: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/42.jpg)
confint(fit.lm)
Confidence intervals
![Page 43: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/43.jpg)
Lower boundary
Upper boundary
Confidence intervals
![Page 44: 20130308 Preparing data for modeling in R](https://reader036.fdocuments.us/reader036/viewer/2022062303/558e04171a28ab736c8b4738/html5/thumbnails/44.jpg)