Post on 04-Jun-2018
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
1/66
Chap
ter3
StatisticalMethods
PaulC.
Taylor
University
ofH
ertford
shire
28thM
arch2
001
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
2/66
3.1Introduction
Generalize
dLin
ear
Models
Sp
ecialTo
picsinR
egressionM
odelling
Cl
assicalM
ultivaria
teAn
alysis
Su
mmary
1
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
3/66
3.2Generalized
LinearMo
dels
Regression
An
alysisofV
arian
c
e
Lo
g-linearM
odels
Lo
gisticR
egre
ssion
An
alysisof
Surviva
lData
2
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
4/66
Th
efitting
of
gen
eralize
dlinearm
odelsi
scurren
tlythem
ostfre
quen
tlyapplied
statisticaltechniq
ue.
Generalize
dlin
earm
odels
are
used
todescribedthe
rela-
tion
shi
pbetween
them
ean,som
etimesc
alledthetrend,o
fonevaria
ble
an
dthe
valu
es
taken
byseveral
othervaria
ble
s.
3
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
5/66
3.2.1
Regression
Howis
avaria
ble
,
,rel
atedtoon
e,orm
ore,othervaria
bl
es,
,
,...,
?
Names
for
:
re
sponse;depende
ntvariable;output.
Names
forthe
s:
re
gressors
;explanatoryvariables;in
dependentvariables
;inputs.
Here
,w
ewillu
setheterms
outputan
din
puts.
4
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
6/66
Comm
onreason
sfor
doing
are
gre
ssion
analysisin
clude:
theoutputis
exp
en
sivetom
easure
,butthein
putsa
renot,an
dsocheap
pre
diction
sof
theo
utputare
sough
t;
thevalu
esof
thein
putsarekn
own
e
arlier
than
theou
tputis,
an
daw
orking
pre
diction
of
theou
tputisre
quire
d;
wecan
con
trol
the
valuesof
thein
puts,w
ebelieve
thereis
acausallink
be
tween
thein
puts
andtheoutput,
andsow
ew
an
t
tokn
owwh
atva
lues
of
thein
putsshoul
dbechosen
toobtain
aparticula
rtarg
etvalu
efor
the
ou
tput;
iti
sbelieve
dthatth
ereis
acausallinkb
etween
som
e
ofthein
putsan
dthe
ou
tput,an
dw
ewish
toid
en
tifywhichinp
utsarerela
te
dtotheoutput.
5
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
7/66
Th
e(g
eneral)linearmo
delis
(3.1)
wh
ere
the
sarein
dependen
tlyan
did
e
ntically
distrib
ute
das
an
d
isthen
umber
of
datapoints.
Th
em
odelislinearin
the
s.
(3.2)
(Aw
eighte
dsum
of
the
s.)
6
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
8/66
Th
em
ainre
ason
sfor
th
euseof
thelin
earm
odel.
Th
emaxim
umlikelihoo
destimators
ofthe
sare
thesam
easthel
east
sq
uaresestimators
;seeSection2.4
ofCh
apter2.
Ex
plicitform
ula
ean
drapid
,relia
blen
umericalm
ethodsforfin
din
gthel
east
sq
uaresestimators
ofthe
s.
Many
pro
blem
scan
befram
edasge
nerallin
earm
odels.F
or
exam
ple
,
(3.3)
ca
nbeconverte
db
ysetting
,
an
d
.
Ev
enwh
en
thelin
e
armodelisn
ots
trictlyappro
pria
te
,thereis
often
a
way
to
transform
theoutputan
d/or
thein
puts,sothatalin
earmodelcan
pro
vide
us
efulinform
ation.
7
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
9/66
Non-linearRegression
Twoex
ample
sare:
(3.4)
(3.5)
wh
ere
thesan
d
are
asin
(3.1
).
8
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
10/66
Pro
ble
ms
1.E
s
timationis
carrie
doutusingitera
tivemethodswhich
require
goodcho
ices
of
startin
gvalu
es,migh
tn
otconverg
e,migh
tconverg
etoalo
cal
optimum
rather
than
theglo
b
aloptimum
,an
dwillre
quireh
um
anin
terven
tion
toover-
co
methesedifficulties.
2.Th
estatistical
pro
p
ertiesof
theestimatesan
dpre
dic
tionsfrom
them
odel
areno
tkn
own
,so
wecann
otperform
statisticalinf
erenceforn
on-linear
regression.
9
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
11/66
GeneralizedLinearMo
dels
Th
ege
neraliza
tionisin
twoparts.
1.Th
edistrib
ution
of
theoutputdoesn
oth
ave
tobeth
enorm
al,
butca
nbe
an
yofthedistrib
utionsin
theexp
on
entialfamily.
2.In
steadof
theexp
ectedvalu
eof
theoutputbein
ga
linearfun
ction
o
fthe
s,weh
ave
(3.6)
wh
ere
isam
on
oton
edifferen
tiablefun
ction.Th
e
function
iscalled
thelinkfun
ction.
Th
ere
isarelia
ble
gen
e
ralalg
orithmforfi
ttinggen
eralize
d
linearm
odels.
1
0
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
12/66
GeneralizedAdditiveM
odels
Gen
eralize
dadditivem
odelsare
agen
er
alization
of
gen
er
alizedlin
earm
od
els.
Th
ege
neraliza
tionis
that
needno
tbealin
earfunction
of
asetof
s,
buth
astheform
(3.7)
wh
ere
the
sare
arbitr
ary,usually
smooth,fun
ction
s.
An
example
of
them
o
delpro
ducedus
ingatypeof
sc
atterplo
tsmootheris
shown
inFig
ure
3.1.
1
1
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
13/66
Dia
betesD
ata--
-Splin
eSm
ooth
df=3
Age
Log C-peptide
5
10
15
3 4 5 6
Fig
ure
3.1
1
2
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
14/66
Methodsforfittin
ggen
e
ralizedadditivem
odels
exist
an
dareg
en
erallyrelia
ble.
Th
em
aindraw
backis
thatthefram
ew
o
rkof
statisticalinferen
cethatis
a
vail-
ablefo
rgen
eralize
dlin
earm
odelsh
asn
otyetbeen
deve
lopedfor
gen
era
lized
additiv
emodels.
Despite
this
draw
back,
generalize
daddi
tivem
odels
can
befittedbysever
alof
them
a
jorstatisticalp
ac
kagesalre
ady.
1
3
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
15/66
3.2.2
AnalysisofV
ariance
Th
eanalysisofvarianc
e,orAN
OVA
,is
prim
arily
am
etho
dofid
en
tifyingw
hich
of
the
sin
alin
earm
o
delaren
on-zero.This
techniq
uew
asdevelo
pedfo
rthe
an
alys
isof
agriculturalfi
eldexp
erim
en
ts,butisn
ow
used
quitegen
erally.
Example27TurnipsforWinterFodder.
ThedatainTa
ble
3.1arefrom
an
ex-
perim
e
nttoinve
stigate
thegrow
thof
tur
nips.Th
esetype
sof
turnip
sw
oul
dbe
grown
toprovid
efo
odf
orfarm
anim
als
inwin
ter.Th
etu
rnipsw
ereh
arve
sted
an
dw
eigh
edbystaff
an
dstuden
tsof
th
eDepartm
en
tso
fAgriculture
an
d
Ap-
plie
dS
tatisticsofTh
eU
niversity
ofR
ead
ing,in
October,1
990.
1
4
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
16/66
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
17/66
Th
efollowin
glin
earm
odel
(3.8)
or
an
e
quivalen
ton
eco
uldbefitte
dtoth
esedata.Th
ein
putstake
thevalu
es0
or1
an
dare
usually
calleddummyorindicatorvaria
ble
s.
Onfirs
tsigh
t,(3.8
)shou
ldalsoin
cludea
an
da
,but
wedon
otn
eedthem.
1
6
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
18/66
Th
efirs
tquestion
thatw
ewould
trytoan
swer
aboutthese
datais
Doesachangeintreatmentproduceachangeinthe
turnipyield?
whichi
sequivalen
ttoaskin
g
Areanyof
,
,
...,
non-zero
?
whichi
sthesort
of
questionthatcan
be
answere
dusingANOVA.
1
7
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
19/66
Thisis
howtheAN
OVA
works.R
ecall,
th
egen
erallin
earm
odelof
(3.1
),
Th
ees
timateof
is
.
Fitte
dv
alues
(3.9)
Residu
als
(3.10
)
Th
esizeof
there
sidual
sisrela
tedtothesizeof
,thev
arian
ceof
the
s.It
turn
so
utthatw
ecan
estimate
by
(3.11
)
1
8
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
20/66
Th
eke
yfactsabout
isthatallow
usto
com
pare
differe
ntlinearm
odels
are:
ifthefitte
dm
odelis
adequate(therigh
ton
e),
then
isagoodestimate
of
;
ifthefitte
dm
odelin
cludesre
dun
dan
tterm
s(thatisin
cludessom
e
s
that
arere
allyzero
),the
n
isstillagoo
destimateof
;
ifthefitte
dm
odeld
oesn
otin
cludeo
neorm
orein
putsthatit
ough
tto,
then
willten
dtobela
rger
than
thetruevalu
eof
.
Soifw
eomit
ausefulinpu
tfrom
ourm
odel,
theestimateof
will
shoo
tup,
wh
ere
asifw
eomit
are
dundan
tin
putfrom
ourm
odel,
the
estimateof
sh
ould
notchang
em
uch.N
ote
thatomittin
gon
eofthein
putsfro
mthem
odelis
equiv-
alen
ttoforcin
gthecorre
spon
din
g
tobezero.
1
9
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
21/66
Example28TurnipsforWinterFoddercontinued.L
et
tobethem
od
elat
(3.8
),and
tobethe
followin
gm
odel
(3.18
)
So,
isthespecialca
seof
inwhich
allof
,
,...,
arezero.
Table
3.2
Df
Su
m
of
Sq
Mean
Sq
F
Value
Pr(F)
block
3
163.737
54.57891
2.278016
0.08867543
Residuals
60
1
437.538
23.95897
Table
3.3
Df
Su
m
of
Sq
Mean
Sq
F
Value
Pr(F)
block
3
163.737
54.57891
5.690430
0.002163810
treat
15
1
005.927
67.06182
6.991906
0.000000171
Residuals
45
431.611
9.59135
2
0
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
22/66
Table
3
.4show
stheAN
OVAthatw
ould
u
sually
bepro
duc
edfor
theturnip
data.
Notice
thattheblo
cka
ndR
esidualsro
wsare
thesam
e
asinTa
ble
3.3.
The
basicdifferen
cebetwee
nTable
s3.3
an
d
3.4is
thatthetre
atmen
tinform
ationis
broken
downin
toits
con
stituen
tpartsin
Table
3.4.
Table
3.4
Df
Sum
o
f
Sq
Mean
S
q
F
Value
Pr(F)
block
3
163.
7367
54.578
9
5.69043
0.
0021638
variet
y
1
83.
9514
83.951
4
8.75282
0.
0049136
sowing
1
233.
7077
233.707
724.36650
0.
0000114
densit
y
3
470.
3780
156.792
716.34730
0.
0000003
variet
y:sowing
1
36.
4514
36.451
4
3.80045
0.
0574875
variet
y:density
3
8.
6467
2.882
2
0.30050
0.
8248459
sowing
:density
3
154.
7930
51.597
7
5.37960
0.
0029884
variet
y:sowing:den
sity
3
17.
9992
5.999
7
0.62554
0.
6022439
Residu
als
45
431.
6108
9.591
4
2
1
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
23/66
3.2.3
Log-linearModels
Th
edatashowninTa
b
le3.7
show
thesortof
pro
blem
attackedbylo
g-linear
modelling.Th
ere
arefiv
ecategoricalvar
iablesdispla
yedinTa
ble
3.7:
centre
oneof
thre
eh
ealth
cen
tresfor
th
etreatmen
tof
bre
astcan
cer;
ageth
eageof
thepatientwh
enh
er
bre
astcan
cerw
asdi
agnosed;
surviv
edwh
ether
thepatien
tsurvive
dfora
tle
astthre
ey
earsfrom
dia
gn
o
sis;
appear
appearan
ceof
thepatien
tstum
oureith
ermalig
nantorbenign
;
inflam
amoun
tofinflam
mation
of
thetum
our
eith
ermin
imal
orgreater.
2
2
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
24/66
Table
3.7
StateofT
umour
Minim
alInflamm
ation
Gre
aterInflamm
ation
Malign
an
t
Benign
Malign
an
t
Benign
Cen
tre
Age
Survived
Appearan
ce
Appearan
ce
Appearan
ce
Appearan
ce
Tokyo
Un
der
50
No
9
7
4
3
Yes
26
68
25
9
50
69
No
9
9
11
2
Yes
20
46
18
5
70or
over
No
2
3
1
0
Yes
1
6
5
1
Boston
Un
der
50
No
6
7
6
0
Yes
11
24
4
0
50
69
No
8
20
3
2
Yes
18
58
10
3
70or
over
No
9
18
3
0
Yes
15
26
1
1
Glam
or
gan
Un
der
50
No
16
7
3
0
Yes
16
20
8
1
50
69
No
14
12
3
0
Yes
27
39
10
4
70or
over
No
3
7
3
0
Yes
12
11
4
1
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
25/66
For
the
sedata,theoutp
utisthen
um
ber
ofpatien
tsin
eachcell.
Th
em
odelis
(3.21
)
Sin
ceallth
evaria
ble
so
fintere
stare
categorical,w
en
eed
tousein
dica
tor
vari-
able
sasinputsin
thesamew
ayasin
(3.
8).
2
4
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
26/66
Table
3.8
Terms
added
seq
uentially
(first
to
last)
Df
Deviance
Resid.
Df
Re
sid.
Dev
Pr(Chi)
NULL
71
860.0076
centre
2
9.3619
69
850.6457
0.0092701
age
2
105.5350
67
745.1107
0.0000000
survived
1
160.6009
66
584.5097
0.0000000
inflam
1
291.1986
65
293.3111
0.0000000
appear
1
7.5727
64
285.7384
0.0059258
centre:age
4
76.9628
60
208.7756
0.0000000
centre:survived
2
11.2698
58
197.5058
0.0035711
centre:inflam
2
23.2484
56
174.2574
0.0000089
centre:appear
2
13.3323
54
160.9251
0.0012733
age:survived
2
3.5257
52
157.3995
0.1715588
age:inflam
2
0.2930
50
157.1065
0.8637359
age:appear
2
1.2082
48
155.8983
0.5465675
survived:inflam
1
0.9645
47
154.9338
0.3260609
survived:appear
1
9.6709
46
145.2629
0.0018721
inflam:appear
1
95.4381
45
49.8248
0.0000000
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
27/66
Tosum
marise
thism
odel,Iw
ould
con
structits
con
dition
alind
epen
den
ceg
raph
an
dpre
sen
ttable
scorre
spon
din
gtothe
intera
ction
s.
Table
s
arein
thebook.
Th
eco
ndition
alin
depen
dencegra
phis
s
howninFig
ure
3.2.
age
centre
su
rvived
inflam
appear
Fig
ure
3.2
2
6
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
28/66
3.2.4
LogisticRegression
Inlo
gis
ticre
gre
ssion
,th
eoutputis
then
umb
er
of
success
esoutof
an
um
b
erof
trials,
eachtrialre
sultin
gin
eith
er
asuccessorfailure.
For
the
breastcan
cer
data,w
ecanre
gar
deachpatien
tas
atrial,with
suc
cess
corre
spondin
gtothepa
tientsurvivin
gfor
threeyears.
Th
eoutputw
ould
simp
lybegiven
asn
umb
er
of
successes,eith
er
0or1
,for
eacho
fthe7
64
patien
tsinvolve
din
thes
tudy.
Th
em
odelth
atw
ewillfi
tis
and
(3.22
)
Again
,
thein
putsh
ere
willbein
dica
tor
sfor
thebre
ast
cancer
data,butthis
isn
ot
generally
true;th
ereisn
ore
ason
whyan
yof
the
inputsshouldn
otbe
quan
titative.
2
7
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
29/66
Table
3.15
Df
Deviance
Resid.
Df
Resid
.
Dev
Pr(Chi)
NU
LL
763
898
.5279
cent
re
2
11.26979
761
887
.2582
0.0035711
a
ge
2
3.52566
759
883
.7325
0.1715588
appe
ar
1
9.69100
758
874
.0415
0.0018517
infl
am
1
0.00653
757
874
.0350
0.9356046
centre:a
ge
4
7.42101
753
866
.6140
0.1152433
centre:appe
ar
2
1.08077
751
865
.5332
0.5825254
centre:infl
am
2
3.39128
749
862
.1419
0.1834814
age:appe
ar
2
2.33029
747
859
.8116
0.3118773
age:infl
am
2
0.06318
745
859
.7484
0.9689052
appear:infl
am
1
0.24812
744
859
.5003
0.6184041
centre:age:appe
ar
4
2.04635
740
857
.4540
0.7272344
centre:age:infl
am
4
7.04411
736
850
.4099
0.1335756
cen
tre:appear:infl
am
2
5.07840
734
845
.3315
0.0789294
age:appear:infl
am
2
4.34374
732
840
.9877
0.1139642
centre:
age:appear:infl
am
3
0.01535
729
840
.9724
0.99949642
8
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
30/66
Th
efit
tedm
odelis
sim
pleen
oughin
thi
scasefor
thepa
rameter
estimatesto
bein
cludedh
ere
;they
areshownin
the
formthatastatistical
packagew
ould
pre
sen
ttheminTa
ble
3.16.
Table
3.16
Coefficients:
(Intercept)
centre2
centre3
ap
pear
1.080257
-0
.6589141
-0.4944846
0.515
7151
Usingthe
estimatesgiveninTa
ble
3.1
6,thefitte
dm
odelis
(3.23
)
2
9
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
31/66
3.2.5
AnalysisofS
urvivalData
Survivalda
taare
datac
oncernin
gh
owlo
ngittake
sfor
ap
articular
even
tto
hap-
pen.In
man
ym
edicala
pplication
stheev
entis
deathof
apatien
twith
anilln
ess,
an
dso
weare
an
alysin
gthepatien
tssur
vivaltim
e.Inin
du
striala
pplica
tion
sthe
even
ti
softenfailure
of
acom
pon
en
tin
a
machin
e.
Th
eoutputin
this
sort
ofpro
blemis
th
esurvival
time.
Aswith
all
theother
pro
blem
sthatw
eh
ave
seenin
this
section,
thetaskis
tofi
tare
gre
ssionm
odel
todescribe
therela
tion
s
hipbetween
theoutputan
dsom
e
inputs.In
them
e
dical
con
tex
t,thein
putsare
usually
qualitie
softh
epatien
t,sucha
sagean
dse
x,or
are
determin
edbythetreatmen
tgiven
to
thepatien
t.
Wewillskip
this
topic.
3
0
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
32/66
3.3SpecialTop
icsinRegr
essionMod
elling
Multivaria
teAn
alysi
sofV
arian
ce
RepeatedM
easure
sData
RandomEffe
ctsM
odels
Th
eto
picsin
this
sectionare
specialin
thesen
sethattheyare
exten
sion
sto
theba
sicid
eaofre
gre
ssionm
odellin
g.
Thetechniq
uesh
avebeen
develo
ped
inre
sp
onsetom
ethodsofdatacolle
ctioninwhich
theusual
assum
ption
sof
regre
ssionm
odellin
gar
enotjustified.
3
1
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
33/66
3.3.1
MultivariateA
nalysisofVa
riance
Model
(3.26
)
wh
ere
the
sarein
dependen
tlyan
did
e
ntically
distrib
utedas
an
d
isthen
umber
of
datapoints.Th
e
under
indica
testhedim
en
sion
sof
theve
ctor,in
this
case
rowsan
d1
colu
mn;the
sare
a
lso
vecto
rs.
Thism
odel
can
befitte
dinexa
ctlythesamew
ayasalinearm
odel
(byl
east
square
sestimation
).On
ewaytodothis
fittingw
ould
betofitalin
earm
od
elto
eacho
fthe
dim
en
sion
softheoutput,o
ne-at-a-tim
e.
3
2
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
34/66
Havin
g
fittedthem
odel,we
can
obtainfit
tedvalu
es
an
dh
e
ncere
siduals
Th
ean
alogueof
there
sidu
al
sum
of
squ
aresfrom
the(un
ivariate)lin
earm
odel
isthem
atrixofre
sidual
sumsof
square
s
andpro
ductsfor
them
ultivaria
telinear
model.Thism
atrixis
define
dtobe
3
3
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
35/66
3.3.2
RepeatedMe
asuresData
Repea
tedm
easure
sda
taare
gen
era
ted
when
theoutputvaria
bleis
obse
rved
atseveral
poin
tsin
time,on
thesam
ein
dividuals.
Usually,th
ecovaria
tesare
also
observe
datthesame
timepoin
ts
astheoutput;sothein
putsare
time-
depen
denttoo.Th
us,
asin
Section
3.3
.1theoutputis
avector
ofm
easure-
men
ts.In
prin
ciple
,w
ecan
simply
apply
thetechniq
uesof
Section
3.3.1to
an
alys
erepeatedm
easuresdata.In
ste
ad,w
eusually
trytousethefa
ctthat
weh
av
ethesam
esetofvaria
ble
s(outp
utan
din
puts)atseveral
times,ra
ther
than
a
collection
of
diffe
rentvaria
ble
sm
a
kingupave
ctor
output.
Repea
tedm
easure
sdataare
often
calle
dlongitudinaldata,especiallyin
theso-
cialsci
ences.Th
eterm
cross-sectionalis
often
usedtom
eann
otlon
gitu
d
inal.
3
4
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
36/66
3.3.3
Random
Effe
ctsModels
Overdispersion
Inalo
gisticre
gre
ssionw
emigh
tre
pla
ce
(3.22
)with
(3.29
)
wh
ere
the
sarein
de
penden
tlyan
did
entically
distrib
utedas
.We
can
think
of
asre
pr
esen
tingeith
er
theeffe
ctof
them
issingin
puton
or
simply
asran
domvaria
tionin
thesucces
spro
babilitie
sforindivid
uals
thath
ave
thesam
evalu
esfor
the
inputvaria
ble
s.
3
5
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
37/66
Hierarchicalmodels
Inthe
turnip
exp
erim
en
t,thegrow
thof
theturnip
sis
affe
ctedbythedifferen
t
blo
cks,buttheeffe
cts(the
s)for
eachb
lockarelikely
tobedifferen
tin
differen
t
years.
Sow
ecould
thin
kofthe
sfor
eachblo
ckascom
ingfrom
apopula
tion
of
sf
orblo
cks.Ifw
ed
idthis,
thenw
ec
ouldre
pla
cethem
odelin
(3.8
)with
(3.30
)
wh
ere
,
,
an
d
arein
dependen
tlyan
did
en
tically
distrib
ute
das
.
3
6
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
38/66
3.4ClassicalM
ultivariateAnalysis
Pr
incipalC
om
pon
e
ntsAn
alysis
Corre
spon
den
ceAn
alysis
Multidim
en
sion
alS
caling
Cl
usterAn
alysis
an
dMixtureD
ecom
position
La
tentV
aria
ble
an
d
Covarian
ceStru
ctureM
odels
3
7
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
39/66
3.4.1
PrincipalCom
ponentsAnalysis
Prin
cip
alcom
pon
en
tsa
nalysisis
aw
ay
oftran
sformin
ga
setof
-dim
en
sional
vector
observa
tion
s,
,,...,
,in
toanother
setof
-dimen
sion
alve
c
tors,
,
,...,
.Th
e
shave
thepro
per
tythatm
ostof
theirinform
ation
con
tent
isstore
dinthefirstfew
dimen
sion
s(features).
Thisw
illallow
dim
en
sio
nalityre
duction
,sothatw
ecan
do
thingslike:
ob
tainin
g(inform
ative)
gra
phical
displays
of
thedatain2-D
;
ca
rryingoutcom
pu
terin
ten
sivem
ethodsonre
duced
data;
ga
iningin
sigh
tin
to
thestructure
of
the
data,whichw
asn
otapparen
t
in
dim
ension
s.
3
8
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
40/66
SepalL.
2.02.53.03.54.0
0.51.0
1.52.02.5
5 6 7 8
2.02.53.03.54.0
SepalW.
PetalL.
1 2 3 4 5 6 7
5
6
7
8
0.51.01.52.02.5
1
23
4
5
6
7
Pe
talW.
Fig
ure
3.3
Fisher
sIrisData(colle
ctedbyAn
derson
)
3
9
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
41/66
Th
em
ainid
eabehin
d
principal
com
pon
entsan
alysisis
thathighinform
ation
corre
spondstohighvariance.
So,ifw
ewan
tedtore
ducethe
stoas
ingledim
en
sionw
ewould
tran
sform
to
choosing
sothat
ha
sthelarg
estvariance
possible.
Itturn
s
outthat
should
betheeig
enve
c
torcorre
spon
din
gtothelarg
estei
gen-
valu
eofth
evarian
ce(covarian
ce)m
atrix
of
,
.
Itis
als
opossible
tosho
wthatof
allth
edirection
sorth
ogo
naltothedire
ctionof
high
es
tvarian
ce,the(secon
d)high
estvarian
ceisin
thed
irection
parallelto
the
eig
env
ector
of
thesecon
dlarg
esteig
env
alueof
.Th
ese
results
exten
dallthe
wayto
dim
en
sion
s.
4
0
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
42/66
Estima
teof
is
(3.31
)
wh
ere
.
Th
eeig
envalu
esof
are
Th
eeig
enve
ctors
o
fcorre
spon
din
gto
,
,...,
are
,
,...
,
,
respectively.
Th
evectors
,
,...,
are
called
theprincipal
axes.(
isthe
first
princip
alaxis,
etc.)
Th
e
matrix
whose
thcolum
nis
willb
eden
otedas
.
4
1
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
43/66
Th
epr
incipalaxe
s(can
bean
d)are
chos
ensothattheya
reoflen
gth1
an
dare
orth
og
onal(p
erp
en
dicu
lar).Alg
ebraically
,thism
ean
sthat
if
if
(3.32
)
Th
eve
ctor
defin
edas
,
...
iscalle
dtheve
ctor
ofp
rincipalcomponentscoresof
.T
hethprin
cipal
com-
pon
en
tscore
of
is
;som
etim
estheprin
cipal
c
ompon
en
tscore
sare
referre
dtoastheprin
cipalcom
pon
en
ts.
4
2
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
44/66
1.Th
eelem
en
tsof
areun
correla
tedandthesam
pl
evarian
ceof
the
th
princip
alcom
pon
en
tscoreis
.In
o
therw
ord
sthesamplevarian
cem
atrix
of
is
...
2.Th
esum
of
thesam
plevarian
cesfo
rtheprin
cipal
co
mpon
en
tsis
equ
alto
thesum
of
thesam
plevarian
cesfor
theelem
en
tsof
.Th
atis,
wh
ere
isthesam
plevarian
ceof
.
4
3
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
45/66
y1
-6.5-6.0-5.5-5.0-4.5-4.0
-0.4-0.2
0.00.20.4
2 4 6 8
-6.5 -5.5 -4.5
y2
y3
-1.2-0.8-0.4 0.0
2
4
6
8
-0.4 0.00.20.4
-1.2-0.8-0.4
0.0
y4
Fig
ure
3.4
Prin
cip
alcom
pon
en
tsc
oreforFishersIrisData.
Com
parewithFig
ure
3.3
4
4
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
46/66
Effecti
veDimensionality
1.Th
eproportionof
varianceaccountedforTake
the
first
prin
cipal
com-
po
nentsan
dadduptheirvarian
ces.
Dividebythesum
of
allth
evarian
ces,
to
give
whichis
calle
dthe
proportionofvarianceaccountedforbythefirst
princi-
pa
lcomponents.
Usually,pro
jection
s
accoun
tingfor
o
ver7
5%of
thetotalvarian
ceare
con-
sideredtobegood
.Th
us,a2-D
pi
cturewill
becon
sidere
dare
ason
able
represen
tationif
4
5
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
47/66
2.Th
esizeofimportantvarianceTh
eideah
ereis
to
consider
thevariance
ifalldire
ction
sw
ere
equallyim
portan
t.Inthis
caseth
evarian
cesw
oul
dbe
ap
proxim
ately
Th
earg
um
en
trun
s
If
,thenthe
thprincipaldirectionisles
sinterestingtha
n
average.
an
dthisle
adsustodiscard
prin
cip
alcom
pon
en
tsthath
ave
sam
ple
vari-
an
cesbelow
.
3.Sc
reediagram
As
creedia
gramis
a
nindex
plo
tof
theprincipalcom
po
nent
va
riances.In
other
wordsitis
aplo
t
of
again
st.A
nexam
ple
of
as
cree
dia
gram
,for
theIris
Data,is
showninFig
ure
3.5.
4
6
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
48/66
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
49/66
Norma
lising
Th
eda
tacan
ben
orm
a
lisedbycarryin
g
outthefollowin
g
steps.
Centre
eachvaria
b
le.Inotherw
ord
ssubtractthem
eanof
eachvaria
b
leto
giv
e
Divide
eachelem
en
tof
byits
stan
darddevia
tion
;asaform
ula
thism
eans
ca
lculate
wh
ere
isthesam
plestan
dard
dev
iation
of
.
4
8
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
50/66
PetalL.
Sepal W.
-10
-5
0
5
10
15
-10 -5 0 5 10 15
Mean
Cen
tredData
5xP
etalL.
Sepal W.
-10
-5
0
5
10
15
-10 -5 0 5 10 15
Scale
dDa
ta
Fig
ure
3.6Ifw
edontn
orm
alise.
4
9
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
51/66
Interpr
etation
Th
efin
alpart
of
aprin
c
ipalcom
pon
en
ts
analysisis
toin
s
pecttheeig
enve
ctors
intheh
opeofid
en
tifyingam
eanin
gfor
the(importan
t)princip
alcom
pon
en
ts.
Seeth
ebookfor
anin
te
rpretationforFis
hersIrisData.
5
0
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
52/66
3.4.2
Corresponde
nceAnalysis
Corre
s
ponden
ceis
aw
aytore
pre
sen
tthestructurewithinincidencematrices.
Inciden
cem
atricesare
alsocalle
dtwo-w
aycontingencytables.
An
example
of
a
inciden
cem
atrix,withm
argin
altotalsis
show
nin
Table
3
.17
.
Table
3.17
Sm
okin
gCategory
Staff
Gro
up
Non
e
LightM
edium
Heavy
Total
SeniorM
an
ag
ers
4
2
3
2
11
JuniorM
an
ag
ers
4
3
7
4
18
SeniorEm
plo
yees
25
1
0
12
4
51
JuniorEm
plo
yees
18
2
4
33
13
88
Secretarie
s
10
6
7
2
25
Total
61
4
5
62
25
193
5
1
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
53/66
TwoStages
Transform
thevalu
esinaw
aythatrelate
stoatestfor
association
betw
een
row
san
dcolumn
s(chi-square
dtest).
Useadim
en
sion
ali
tyreductionm
ethodtoallow
usto
draw
apicture
o
fthe
rel
ation
ship
sbetwe
enrow
san
dcolu
mnsin2-D.
Details
arelike
prin
cipal
com
pon
en
tsan
alysism
athem
atically;seetheboo
k.5
2
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
54/66
3.4.3
MultidimensionalScaling
Multidimen
sion
al
scalin
gis
thepro
cessofconvertin
gasetof
pairwise
dissimi-
laritie
s
forasetof
poin
ts,intoasetof
co
-ordin
atesfor
the
points.
Exam
p
lesof
dissimilarities
could
be:
thepriceof
an
airlin
eticketbetween
pairsof
cities;
roaddistan
cesbetween
town
s(aso
pposedtostraigh
t-linedistan
ces);
acoefficien
tin
dica
tingh
ow
differen
ttheartefa
ctsfo
undin
pairs
of
to
mbs
wi
thinagrave
yard
are.
5
3
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
55/66
ClassicalScaling
Cla
ssicalscalin
gis
also
known
asmetricscalingan
das
principalco-ordin
ates
analys
is.Th
en
am
em
etricscalin
gis
usedbecausethe
dissimilaritie
sare
as-
sum
ed
tobedistan
cesorinm
athem
atical
term
sthem
easure
of
dissimil
arity
istheeuclideanmetric.Th
en
am
eprin
cipal
co-ordin
ates
analysisis
usedbe-
cause
thereis
alink
between
this
techniq
uean
dprin
cipalc
ompon
en
tsan
al
ysis.
Th
en
amecla
ssicalis
usedbecauseitwa
sthefirstwi
delyusedm
etho
dof
multidimen
sion
alscalin
g,an
dpre-d
atesthe
availa
bility
of
electronic
com
pu
ters.
Th
ederiva
tion
of
them
ethodusedtoo
btaintheconfig
u
rationis
givenin
the
book.
5
4
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
56/66
Th
ere
sultsof
applyin
g
classical
scalin
g
toBritishro
addi
stancesare
show
nin
Fig
ure
3.7.Th
esero
ad
distan
cescorre
spon
dtothero
u
tesre
comm
en
de
dby
theAu
tomobileAssociation;thesere
comm
en
dedro
utes
arein
ten
dedto
give
theminimum
travellin
gtime
,n
otthetheminim
um
journ
ey
distan
ce.
An
effectof
this,
thatisvisibleinFig
ure3.7is
thatthe
town
san
dcitiesh
ave
lin
edupin
position
srela
tedtothem
otorwayn
etwork.
Th
emapalsofe
atu
resdistortion
sfr
omthegeogra
ph
icalm
apsuchasthe
po
sition
ofH
olyh
ea
d(holy
),which
a
ppears
tobem
uchclo
ser
toLiver
pool
(lv
er)an
dM
an
ches
terthanitre
allyis
,andtheposition
ofCornish
penin
sula
(th
epart
en
din
gatPenzan
ce,penz
)
isfurth
erfrom
Ca
rmarth
en
(carm
)
than
iti
sphysically.
5
5
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
57/66
Compon
ent1
Component 2
-400
-200
0
200
-200 0 200
abdn
abry
barn
bham
bton
btol
camb
card
carl
carmcolc
dorcdovr
edin
exet
fort
glas
glou
gild
holy
hull
invr
kend
leed
linc
lver
maid
manc
middnewc
norw
nott
oxfd
penz
prth
plym
shef
sotn
stra
taun
york
lond
Fig
ure
3.7
5
6
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
58/66
Ordina
lScaling
Ordin
a
lscalin
gis
used
forthesam
epur
posesatclassicalscalin
g,butfor
dis-
similar
itiesthataren
ot
metric,thatis,
theyaren
otwh
at
wew
ould
think
ofa
s
distan
ces.Ordin
alscalingis
som
etimes
callednon-metricscaling,becausethe
dissimilaritie
saren
otm
etric.Som
epeople
callitShepard-Kruskalscaling
,be-
cause
Shepard
an
dKru
skalare
then
am
esof
twopion
eer
sof
ordin
alscalin
g.
Inordin
alscalin
g,w
eseekaconfig
ura
tioninwhich
thep
airwise
distan
cesbe-
tween
pointsh
ave
thes
amerank
ord
er
a
sthecorre
spon
ding
dissimilaritie
s
.So,
if
is
thedissimilarity
between
poin
ts
and,an
d
is
thedistan
cebetw
een
thesam
epoin
tsin
the
derivedconfig
ura
tion
,thenw
eseekaconfig
ura
tionin
which
if
5
7
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
59/66
3.4.4
ClusterAnaly
sisandMixtureDecomposition
Clu
ster
analysis
an
dmix
turedecom
positionare
bothtechniqu
estodowithi
den-
tificatio
nof
con
cen
tratio
nsofin
divid
uals
inaspace.
5
8
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
60/66
Cluste
rAnalysis
Clu
ster
analysisis
used
toiden
tifygro
upsofindivid
ualsin
asam
ple.Th
egro
ups
aren
o
tpre-d
efin
ed,n
or
,usually,is
then
umber
of
gro
ups
.Thegro
upstha
tare
iden
tifi
edarereferre
dto
asclusters.
hierarchical
agglomerative
divisive
no
n-hierarchical
5
9
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
61/66
Mi
nimum
distance
orsingle-link
Ma
ximum
distance
orcomplete-link
Av
eragedistance
Ce
ntroiddistance
definesthedistan
cebetween
twoclusters
asthesquared
dis
tancebetween
themeanve
ctors
(thatis,
thecen
troids)of
thetwoclus-
ter
s.
Su
mofsquareddeviationsdefin
esthedistan
cebetwe
en
twocluster
sas
thesum
of
thesqua
reddistan
cesofindivid
ualsfrom
thejoin
tcen
troid
o
fthe
thetwoclustersmin
usthesum
of
thesquare
ddistan
c
esofin
divid
uals
from
theirse
para
teclust
ermean
s.
6
0
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
62/66
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6
Distance between clusters
Fig
ure
3.8
Usualw
aytopre
sen
tre
sultsofhierarchic
alclusterin
g.
6
1
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
63/66
Non-hi
erarchical
clusteringis
essen
tially
tryingtopartition
thesam
ple
soasto
optimiz
esom
em
easure
ofclusterin
g.
Th
ech
oiceofm
easure
ofclusterin
gis
u
sually
basedon
propertie
sof
sum
sof
square
sandpro
ductsm
atrices,like
thoseme
tin
Section
3
.3.1,becausethe
aim
intheMAN
OVAis
tom
easure
differen
ce
sbetween
gro
ups.
Th
em
ain
difficultyh
ere
isthatthere
are
toom
an
ydifferen
twaystopartition
the
sam
plefor
ustotrythem
all,
unle
ssthesam
pleisvery
small
(aro
un
da
bout
or
smaller).Thus
our
onlyw
a
y,ingen
eral,
of
guaran
teein
gtha
tthe
glo
bal
optimumis
achie
vedis
touseam
ethodsuchasbr
anch-an
d-b
oun
d
On
eo
fthebestkn
ownnon-hierarchicalclu
sterin
gm
ethodsis
the
-means
method.
6
2
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
64/66
MixtureDecomposition
Mixture
decom
position
isrela
tedtoclusteran
alysisin
thatitis
usedtoid
e
ntify
con
cen
tration
sofin
divid
uals.Th
ebasicdifferen
cebetwee
ncluster
an
alysis
and
mixture
decom
positioni
sthatthereis
an
underlyin
gstatis
ticalm
odelinmix
ture
decom
position
,wh
ere
asthereisn
osuchmo
delin
cluster
analysis.Th
epr
oba-
bility
d
ensitythath
asgenera
tedthesam
pledatais
assum
edtobeamixtu
reof
severa
lunderlyin
gdistri
bution
s.Sow
eh
ave
wh
ere
isthen
um
be
rofun
derlyin
gd
istribution
s,the
sare
theden
sitie
s
of
the
underlyin
gdistrib
ution
s,the
s
aretheparam
etersof
theun
der
lying
distrib
utions,the
sa
repositivean
dsumtoon
e,an
d
istheden
sity
from
which
the
sam
pleh
asb
eengen
era
ted.
Details
inon
eofH
an
ds
books.
6
3
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
65/66
3.4.5
LatentVariab
leandCovarianceStructureModels
Ih
ave
never
usedthetechniq
uesin
this
section
,soI
do
notcon
ssiderm
yself
exp
ert
enough
togive
a
presen
tation
on
them.
Noten
ough
timetocovereverythin
g.
6
4
8/13/2019 [] Statistics - Statistical Methods for Data Analytic
66/66
3.5Summary
Th
etechniq
uespre
sen
tedinthis
chapter
donotform
an
yth
inglike
an
e
listof
useful
statisticalm
ethods.Th
esetechniq
uesw
ere
chosen
bec
are
eith
erwid
ely
usedoro
ugh
ttobewi
delyused.Th
er
egression
t
arewid
elyused,though
thereis
som
erel
uctan
ceam
on
gs
tresearch
e
thejum
pfromlin
earm
o
delstogen
eralize
dlinearm
odels.
Th
em
ultivaria
tean
alysi
stechniq
uesoug
httobeusedm
o
rethan
they
of
them
ainobstaclesto
theadoption
of
thesetechniq
uesm
aybethat
areinlinear
alg
ebra.
Ife
elth
etechniq
uespre
sentedin
this
ch
apter,
an
dtheir
e
xtension
s,w
or
bec
omethem
ostwid
elyusedstatisticaltechniq
ues.Thisiswh
y
chosenfor
this
chapter.