Transcript of SPH 247 Statistical Analysis of Laboratory Data. Cystic Fibrosis Data Set The 'cystfibr' data frame...
- Slide 1
- SPH 247 Statistical Analysis of Laboratory Data
- Slide 2
- Cystic Fibrosis Data Set The 'cystfibr' data frame has 25 rows
and 10 columns. It contains lung function data for cystic fibrosis
patients (7-23 years old) We will examine the relationships among
the various measures of lung function April 30, 2010SPH 247
Statistical Analysis of Laboratory Data2
- Slide 3
- age: a numeric vector. Age in years. sex: a numeric vector
code. 0: male, 1:female. height: a numeric vector. Height (cm).
weight: a numeric vector. Weight (kg). bmp: a numeric vector. Body
mass (% of normal). fev1: a numeric vector. Forced expiratory
volume. rv: a numeric vector. Residual volume. frc: a numeric
vector. Functional residual capacity. tlc: a numeric vector. Total
lung capacity. pemax: a numeric vector. Maximum expiratory
pressure. April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data3
- Slide 4
- Scatterplot matrices We have five variables and may wish to
study the relationships among them We could separately plot the
(5)(4)/2 = 10 pairwise scatterplots In R we can use the pairs()
function, or the splom() function in the lattice package. In Stata,
we can use graph matrix April 30, 2010SPH 247 Statistical Analysis
of Laboratory Data4
- Slide 5
- Scatterplot matrices > pairs(lungcap) > library(lattice)
> splom(lungcap).graph matrix fev1 rv frc tlc pemax April 30,
2010SPH 247 Statistical Analysis of Laboratory Data5
- Slide 6
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data6
- Slide 7
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data7
- Slide 8
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data8
- Slide 9
- Principal Components Analysis The idea of PCA is to create new
variables that are combinations of the original ones. If x 1, x 2,
, x p are the original variables, then a component is a 1 x 1 + a 2
x 2 ++ a p x p We pick the first PC as the linear combination that
has the largest variance The second PC is that linear combination
orthogonal to the first one that has the largest variance, and so
on HSAUR2 Chapter 16 April 30, 2010SPH 247 Statistical Analysis of
Laboratory Data9
- Slide 10
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data10
- Slide 11 lungcap.pca$sdev [1] 1.7955824 0.9414877 0.6919822
0.5873377 0.2562806 > lungcap.pca$center fev1 rv frc tlc pemax
34.72 255.20 155.40 114.00 109.12 > lungcap.pca$scale fev1 rv
frc tlc pemax 11.19717 86.01696 43.71880 16.96811 33.43691 >
plot(lungcap.pca$x[,1:2]) Always use scaling before PCA unless all
variables are on the Same scale. This is equivalent to PCA on the
correlation matrix instead of the covariance matrix">
- April 30, 2010SPH 247 Statistical Analysis of Laboratory Data11
> lungcap.pca plot(lungcap.pca) > names(lungcap.pca) [1]
"sdev" "rotation" "center" "scale" "x" > lungcap.pca$sdev [1]
1.7955824 0.9414877 0.6919822 0.5873377 0.2562806 >
lungcap.pca$center fev1 rv frc tlc pemax 34.72 255.20 155.40 114.00
109.12 > lungcap.pca$scale fev1 rv frc tlc pemax 11.19717
86.01696 43.71880 16.96811 33.43691 > plot(lungcap.pca$x[,1:2])
Always use scaling before PCA unless all variables are on the Same
scale. This is equivalent to PCA on the correlation matrix instead
of the covariance matrix
- Slide 12
- April 30, 2010SPH 247 Statistical Analysis of Laboratory Data12
Scree Plot
- Slide 13
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data13
- Slide 14
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data14. pca fev1 rv frc tlc pemax Principal components/correlation
Number of obs = 25 Number of comp. = 5 Trace = 5 Rotation:
(unrotated = principal) Rho = 1.0000
--------------------------------------------------------------------------
Component | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Comp1 | 3.22412 2.33772 0.6448 0.6448 Comp2 |.886399.40756 0.1773
0.8221 Comp3 |.478839.133874 0.0958 0.9179 Comp4 |.344966.279286
0.0690 0.9869 Comp5 |.0656798. 0.0131 1.0000
--------------------------------------------------------------------------
Principal components (eigenvectors)
------------------------------------------------------------------------------
Variable | Comp1 Comp2 Comp3 Comp4 Comp5 | Unexplained
-------------+--------------------------------------------------+-------------
fev1 | -0.4525 0.2140 0.5539 0.6641 -0.0397 | 0 rv | 0.5043 0.1736
-0.2977 0.4993 -0.6145 | 0 frc | 0.5291 0.1324 0.0073 0.3571 0.7582
| 0 tlc | 0.4156 0.4525 0.6474 -0.4134 -0.1806 | 0 pemax | -0.2970
0.8377 -0.4306 -0.1063 0.1152 | 0
------------------------------------------------------------------------------
- Slide 15
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data15
- Slide 16
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data16
- Slide 17
- Fishers Iris Data This famous (Fisher's or Anderson's) iris
data set gives the measurements in centimeters of the variables
sepal length and width and petal length and width, respectively,
for 50 flowers from each of 3 species of iris. The species are
_Iris setosa_, _versicolor_, and _virginica_. April 30, 2010SPH 247
Statistical Analysis of Laboratory Data17
- Slide 18 attach(iris) > iris.dat splom(iris.dat) >
splom(iris.dat,groups=Species) > splom(~ iris.dat | Species)
> summary(iris) Sepal.Length Sepal.Width Petal.Length
Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median
:1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean
:1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max.
:7.900 Max. :4.400 Max. :6.900 Max. :2.500"> data(iris) >
help(iris) > names(iris) [1] "Sepal.Length" "Sepal.Width"
"Petal.Length">
- April 30, 2010SPH 247 Statistical Analysis of Laboratory Data18
> data(iris) > help(iris) > names(iris) [1] "Sepal.Length"
"Sepal.Width" "Petal.Length" "Petal.Width" "Species" >
attach(iris) > iris.dat splom(iris.dat) >
splom(iris.dat,groups=Species) > splom(~ iris.dat | Species)
> summary(iris) Sepal.Length Sepal.Width Petal.Length
Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median
:1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean
:1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max.
:7.900 Max. :4.400 Max. :6.900 Max. :2.500
- Slide 19
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data19
- Slide 20
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data20
- Slide 21
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data21
- Slide 22 plot(iris.pc) > iris.pc$sdev [1] 1.7083611
0.9560494 0.3830886 0.1439265 > iris.pc$rotation PC1 PC2 PC3 PC4
Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863 Sepal.Width
-0.2693474 -0.92329566 -0.2443818 -0.1235096 Petal.Length 0.5804131
-0.02449161 -0.1421264 -0.8014492 Petal.Width 0.5648565 -0.06694199
-0.6342727 0.5235971">
- April 30, 2010SPH 247 Statistical Analysis of Laboratory Data22
> data(iris) > iris.pc
plot(iris.pc$x[,1:2],col=rep(1:3,each=50)) > names(iris.pc) [1]
"sdev" "rotation" "center" "scale" "x" > plot(iris.pc) >
iris.pc$sdev [1] 1.7083611 0.9560494 0.3830886 0.1439265 >
iris.pc$rotation PC1 PC2 PC3 PC4 Sepal.Length 0.5210659 -0.37741762
0.7195664 0.2612863 Sepal.Width -0.2693474 -0.92329566 -0.2443818
-0.1235096 Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
- Slide 23
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data23
- Slide 24
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data24
- Slide 25
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data25. summarize Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
index | 150 75.5 43.44537 1 150 sepallength | 150 5.843333.8280661
4.3 7.9 sepalwidth | 150 3.057333.4358663 2 4.4 petallength | 150
3.758 1.765298 1 6.9 petalwidth | 150 1.199333.7622377.1 2.5
-------------+--------------------------------------------------------
species | 0. graph matrix sepallength sepalwidth petallength
petalwidth
- Slide 26
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data26
- Slide 27
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data27. pca sepallength sepalwidth petallength petalwidth Principal
components/correlation Number of obs = 150 Number of comp. = 4
Trace = 4 Rotation: (unrotated = principal) Rho = 1.0000
--------------------------------------------------------------------------
Component | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Comp1 | 2.9185 2.00447 0.7296 0.7296 Comp2 |.91403.767274 0.2285
0.9581 Comp3 |.146757.126042 0.0367 0.9948 Comp4 |.0207148. 0.0052
1.0000
--------------------------------------------------------------------------
Principal components (eigenvectors)
--------------------------------------------------------------------
Variable | Comp1 Comp2 Comp3 Comp4 | Unexplained
-------------+----------------------------------------+-------------
sepallength | 0.5211 0.3774 -0.7196 -0.2613 | 0 sepalwidth |
-0.2693 0.9233 0.2444 0.1235 | 0 petallength | 0.5804 0.0245 0.1421
0.8014 | 0 petalwidth | 0.5649 0.0669 0.6343 -0.5236 | 0
--------------------------------------------------------------------
- Slide 28
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data28
- Slide 29
- Discriminant Analysis An alternative to logistic regression for
classification is discrimininant analysis This comes in two
flavors, (Fishers) Linear Discriminant Analysis or LDA and
(Fishers) Quadratic Discriminant Analysis or QDA In each case we
model the shape of the groups and provide a dividing line/curve
April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data29
- Slide 30
- One way to describe the way LDA and QDA work is to think of the
data as having for each group an elliptical distribution We
allocate new cases to the group for which they have the highest
likelihoods This provides a linear cut-point if the ellipses are
assumed to have the same shape and a quadratic one if they may be
different April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data30
- Slide 31
- April 30, 2010SPH 247 Statistical Analysis of Laboratory Data31
> library(MASS) > iris.lda iris.lda Call: lda(iris[, 1:4],
iris[, 5]) Prior probabilities of groups: setosa versicolor
virginica 0.3333333 0.3333333 0.3333333 Group means: Sepal.Length
Sepal.Width Petal.Length Petal.Width setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552
2.026 Coefficients of linear discriminants: LD1 LD2 Sepal.Length
0.8293776 0.02410215 Sepal.Width 1.5344731 2.16452123 Petal.Length
-2.2012117 -0.93192121 Petal.Width -2.8104603 2.83918785 Proportion
of trace: LD1 LD2 0.9912 0.0088
- Slide 32 iris.pred$class[71:80] [1] virginica versicolor
versicolor versicolor versicolor versicolor versicolor [8]
versicolor versicolor versicolor Levels: setosa versicolor
virginica > iris.pred$posterior[71:80,] setosa versicolor
virginica 71 7.408118e-28 0.2532282 7.467718e-01 72 9.399292e-17
0.9999907 9.345291e-06 73 7.674672e-29 0.8155328 1.844672e-01 74
2.683018e-22 0.9995723 4.277469e-04 75 7.813875e-18 0.9999758
2.421458e-05 76 2.073207e-18 0.9999171 8.290530e-05 77 6.357538e-23
0.9982541 1.745936e-03 78 5.639473e-27 0.6892131 3.107869e-01 79
3.773528e-23 0.9925169 7.483138e-03 80 9.555338e-12 1.0000000
1.910624e-08 > sum(iris.pred$class != iris$Species) [1]
3">
- April 30, 2010SPH 247 Statistical Analysis of Laboratory Data32
> plot(iris.lda,col=rep(1:3,each=50)) > iris.pred
names(iris.pred) [1] "class" "posterior" "x" >
iris.pred$class[71:80] [1] virginica versicolor versicolor
versicolor versicolor versicolor versicolor [8] versicolor
versicolor versicolor Levels: setosa versicolor virginica >
iris.pred$posterior[71:80,] setosa versicolor virginica 71
7.408118e-28 0.2532282 7.467718e-01 72 9.399292e-17 0.9999907
9.345291e-06 73 7.674672e-29 0.8155328 1.844672e-01 74 2.683018e-22
0.9995723 4.277469e-04 75 7.813875e-18 0.9999758 2.421458e-05 76
2.073207e-18 0.9999171 8.290530e-05 77 6.357538e-23 0.9982541
1.745936e-03 78 5.639473e-27 0.6892131 3.107869e-01 79 3.773528e-23
0.9925169 7.483138e-03 80 9.555338e-12 1.0000000 1.910624e-08 >
sum(iris.pred$class != iris$Species) [1] 3
- Slide 33
- April 30, 2010SPH 247 Statistical Analysis of Laboratory
Data33
- Slide 34 iris.cv(10,1000) [1] "total number classified = 10000"
[1] "total number wrong = 213" [1] "percent wrong =
2.13%"">
- April 30, 2010SPH 247 Statistical Analysis of Laboratory Data34
iris.cv