Louis Roussos Sports Data - Istics.Net

23
Louis Roussos Sports Data Rank the sports you most like to participate in, 1 = favorite, 7 = least favorite. There are n=130 rank vectors. > sportsranks Baseball Football Basketball Tennis Cycling Swimming Jogging 1 3 7 2 4 5 6 1 3 2 5 4 7 6 1 3 2 5 4 7 6 4 7 3 1 5 6 2 [...] 3 2 1 4 7 5 6 3 2 1 4 5 6 7 5 7 6 4 1 3 2 2 1 6 7 3 5 4

Transcript of Louis Roussos Sports Data - Istics.Net

Page 1: Louis Roussos Sports Data - Istics.Net

Louis Roussos Sports Data

Rank the sports you most like to participate in, 1 = favorite, 7 =least favorite. There are n=130 rank vectors.

> sportsranks

Baseball Football Basketball Tennis Cycling Swimming Jogging

1 3 7 2 4 5 6

1 3 2 5 4 7 6

1 3 2 5 4 7 6

4 7 3 1 5 6 2

[...]

3 2 1 4 7 5 6

3 2 1 4 5 6 7

5 7 6 4 1 3 2

2 1 6 7 3 5 4

Page 2: Louis Roussos Sports Data - Istics.Net

K-means in RSet #Clusters = K = centers. nstart is the number of times it runsthe algorithm, each time using a diferent random starting set ofmeans.> kmeans(sportsranks,centers=2,nstart=10)K−means clustering with 2 clusters of sizes 62, 68

Cluster means:Baseball Football Basketball Tennis Cycling Swimming Jogging

1 2.451613 2.596774 3.064516 4.112903 4.709677 5.209677 5.8548392 5.014706 5.838235 4.352941 3.632353 2.573529 2.470588 4.117647

Clustering vector:

1 1 1 2 1 2 2 2 2 2 2 1 2 1 1 2 2 1 1 1 2 1 1 2 2 1 1 2 1 2 2 2 1 1 1 1 2 1 1 2 2 2 1 2 1 2 1 1 1 1

2 1 1 2 2 1 1 1 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 1 2 1 1 1 2 2 2 2 1 2 2 2 2 2 1 1 1 1

2 2 1 1 1 1 2 2 2 1 2 2 1 2 2 2 1 2 1 2 2 2 2 1 2 1 1 1 2 1

Within cluster sum of squares by cluster:[1] 1074.968 1288.176

Available components:[1] ”cluster” ”centers” ”withinss” ”size”

Page 3: Louis Roussos Sports Data - Istics.Net

Getting clusters of size K=2, ..., 10

kms <− vector(’list’,10)for(K in 2:10) {

kms[[K]] <− kmeans(sportsranks,centers=K,nstart=10)}

Page 4: Louis Roussos Sports Data - Istics.Net

K = 1 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 3.79 4.29 3.74 3.86 3.59 3.78 4.95

K = 2 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 5.01 5.84 4.35 3.63 2.57 2.47 4.12Group 2 2.45 2.60 3.06 4.11 4.71 5.21 5.85

K = 3 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 2.33 2.53 3.05 4.14 4.76 5.33 5.86Group 2 4.94 5.97 5.00 3.71 2.90 3.35 2.13Group 3 5.00 5.51 3.76 3.59 2.46 1.90 5.78

K = 4 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 5.10 5.47 3.75 3.60 2.40 1.90 5.78Group 2 2.30 2.10 2.65 5.17 4.75 5.35 5.67Group 3 2.40 3.75 3.90 1.85 4.85 5.20 6.05Group 4 4.97 6.00 5.07 3.80 2.80 3.23 2.13

K = 2: Group 1 likes swimming and cycling, while group 2 likes the team sports,

baseball, football, and basketball. K = 3: Group 1 appears to be about the same is the

team sports group from K = 2, while groups 2 and 3 both like swimming and cycling.

The difference is that group 3 does not like jogging, while group 2 does. K = 4: The

team-sports group has split into one that likes tennis (group 3), and one that doesn’t

(group 2).

Page 5: Louis Roussos Sports Data - Istics.Net

Plotting two clusters

The idea is to project the observations to the subspace (which isjust a line) that goes through the two clusters’ mean vectors.The

z =µ̂1 − µ̂2

‖µ̂1 − µ̂2‖,

is the unit vector pointing from µ̂2 to µ̂1. Then using z as anaxis, the projections of the observations onto z have coordinates

wi = xiz′, i = 1, . . . , N.

Page 6: Louis Roussos Sports Data - Istics.Net

The histogram

K=2

W

Fre

quency

−6 −4 −2 0 2 4 6

02

46

810

12

Fre

quency

−6 −4 −2 0 2 4 6

02

46

810

12

XX

Baseball

Football

Basketball

Tennis

Cycling

Swimming

Jogging

Page 7: Louis Roussos Sports Data - Istics.Net

Plot for K=3If K = 3, then the three means lie in a plane, hence we wouldlike to project the observations onto that plane. One approachis to use principal components on the means:

Z =

µ̂1µ̂2µ̂3

,

we apply the spectral decomposition to the sample covariancematrix of Z:

13

Z′H3Z = GLG′, (1)

where G is orthogonal and L is diagonal. The diagonals of Lhere are 11.77, 4.07, and five zeros. We then rotate the data andthe means using G,

W = XG and W(means) = ZG,

Only the first two columns in each matrix are relevant.

Page 8: Louis Roussos Sports Data - Istics.Net

The Plot

−4 −2 0 2 4

−4

−2

02

4

Var 1

Var

2

1

2

3

BaseballFootball

BasketballTennis

Cycling

Swimming

Jogging

K=3

Page 9: Louis Roussos Sports Data - Istics.Net

The sums of squares

2 4 6 8 10

1500

2000

2500

3000

3500

K

SS

SSK = obj(µ̂1, . . . , µ̂K) =K

∑k=1

∑{i|yi=k}

‖xi − µ̂k‖2.

Page 10: Louis Roussos Sports Data - Istics.Net

The reduction of sums of squares

2 4 6 8 10

0.05

0.10

0.15

0.20

0.25

0.30

K

1-SS[k]/SS[k-1]

1− SSK

SSK−1

Page 11: Louis Roussos Sports Data - Istics.Net

Silhouettes in RThe function silhouette.km finds the silhouettes for a givenclustering, then sort.silhouette orders them, first by clusternumber, then by value. To plot the sillhouettes for k = 2, . . . , 10:

sil.ave <− NULL # To collect silhouette’s means for each Kpar(mfrow=c(3,3))for(K in 2:10) {

sil <− silhouette.km(sportsranks,kms[[K]]$centers)sil.ave <− c(sil.ave,mean(sil))ssil <− sort.silhouette(sil,kms[[K]]$cluster)plot(ssil,type=’h’,xlab=’Observations’,ylab=’Silhouettes’)title(paste(’K =’,K))

}

The sil.ave calculated above can then be used to obtain the plotof averages:

plot(2:10,sil.ave,type=’l’,xlab=’K’,ylab=’Average silhouette width’)

Page 12: Louis Roussos Sports Data - Istics.Net

Plotting the silhouettes

0 20 40 60 80 120

0.2

0.4

0.6

0.8

Ave = 0.625

K = 2

0 20 40 60 80 120

0.2

0.4

0.6

0.8

Ave = 0.555

K = 3

0 20 40 60 80 120

0.2

0.4

0.6

0.8

Ave = 0.508

K = 4

0 20 40 60 80 120

0.2

0.4

0.6

0.8

Ave = 0.534

K = 5

Page 13: Louis Roussos Sports Data - Istics.Net

Plotting the silhouettes’ averages

2 4 6 8 10

0.5

00.5

40.5

80.6

2

K

Avera

ge s

ilhouette w

idth

K = 2 seems like a good choice.

Page 14: Louis Roussos Sports Data - Istics.Net

Model-based clustering – Car data

The data consists of size measurements on 111 automobiles, thevariables include length, wheelbase, width, height, front andrear head room, front leg room, rear seating, front and rearshoulder room, and luggage area. The data are in the file cars.The variables have been normalized to have medians of 0 andmedian absolute deviations (MAD) of 1.4826 (the MAD for aN(0, 1)).

Page 15: Louis Roussos Sports Data - Istics.Net

R for model-based clustering

The R function we use is in the package mclust. The function isMclust. The basic command is simple:

mcars <− Mclust(cars)

There are many options for plotting in the package. To see aplot of the BIC’s, use

plot(mcars,cars,what=’BIC’)

You have to clicking on the graphics window, or hit enter, toreveal the plot. Not that the BIC’s in this function are actuallythe −BIC’s. So we want to maximize it.

Page 16: Louis Roussos Sports Data - Istics.Net

Plotting the BIC’s

2 4 6 8

-6000

-5500

-5000

-4500

-4000

number of components

BIC

EII

VII

EEI

VEI

EVI

VVI

EEE

EEV

VEV

VVV

K = 2, VVV is best.

Page 17: Louis Roussos Sports Data - Istics.Net

What is VVV?

To find the name of the best model:

> mcarsbest model: ellipsoidal, unconstrained with 2 components

That K = 2 is easy to see. The assumptions on the covariancematrices are “ellipsoidal,” which means they have no specialstructure, and “unconstrained,” which means they are notassumed equal for the two groups, Σ1 6= Σ2.

To plot variable 1 (length) versus variable 4 (height), use

plot(mcars,cars,what=’classification’,dimens=c(1,4))

Page 18: Louis Roussos Sports Data - Istics.Net

Plotting the clusters

−4 −2 0 2 4

−5

05

1020

Length

Hei

ght

−4 −2 0 2 4

−4

−2

02

4

Width

Frt

LegR

oom

−4 −2 0 2 4 6

−8

−4

02

4

RearHd

Lugg

age

0 10 20 30

−20

−10

05

PC1

PC

2

Page 19: Louis Roussos Sports Data - Istics.Net

The cars in group 2

Rear Head Rear Seating Rear Shoulder LuggageChevrolet Corvette −4.0 −19.67 −28.00 −8.0Honda Civic CRX −4.0 −19.67 −28.00 −8.0Mazda MX5 Miata −4.0 −19.67 −28.00 −8.0Mazda RX7 −4.0 −19.67 −28.00 −8.0Nissan 300ZX −4.0 −19.67 −28.00 −8.0Chevrolet Astro 2.5 0.33 −1.75 −8.0Chevrolet Lumina APV 2.0 3.33 4.00 −8.0Dodge Caravan 2.5 −0.33 −6.25 −8.0Dodge Grand Caravan 2.0 2.33 3.25 −8.0Ford Aerostar 1.5 1.67 4.25 −8.0Mazda MPV 3.5 0.00 −5.50 −8.0Mitsubishi Wagon 2.5 −19.00 2.50 −8.0Nissan Axxess 2.5 0.67 1.25 −8.5Nissan Van 3.0 −19.00 2.25 −8.0Volkswagen Vanagon 7.0 6.33 −7.25 −8.0

Page 20: Louis Roussos Sports Data - Istics.Net

Just group 1

Redo on just the group 1 automobiles:

cars1 <− cars[mcars$classification==1,]mcars1 <− Mclust(cars1)mcars1best model: elliposidal multivariate normal with 1 components

The best is one big cluster.

Page 21: Louis Roussos Sports Data - Istics.Net

The models in mclust

Code Description ΣkEII spherical, equal volume σ2IpVII spherical, unequal volume σ2

k IpEEI diagonal, equal volume and shape ΛVEI diagonal, varying volume, equal shape ck∆EVI diagonal, equal volume, varying shape c∆kVVI diagonal, varying volume and shape ΛkEEE∗ ellipsoidal, equal volume, shape, and orientation ΣEEV ellipsoidal, equal volume and equal shape ΓkΛΓ′kVEV ellipsoidal, equal shape ckΓk∆Γ′kVVV∗ ellipsoidal, varying volume, shape, and orientation arbitrary

Here, Λ’s are diagonal matrices with positive diagonals, ∆’s are diagonal matrices with

positive diagonals whose product is 1, Γ’s are orthogonal matrices, Σ’s are arbitrary

nonnegative definite symmetric matrices, and c’s are positive scalars. A subscript k on

an element means the groups can have different values for that element. No subscript

means that element is the same for each group.

Page 22: Louis Roussos Sports Data - Istics.Net

Hierarchical clustering of the sportsplclust(hclust(dist(t(sportsranks))))

Baseball

Footb

all

Basketb

all

Joggin

g

Tennis

Cyclin

g

Sw

imm

ing

20

25

30

35

40

Complete linkage

Heig

ht

Page 23: Louis Roussos Sports Data - Istics.Net

Hierarchical clustering of the individualspar(mfrow=c(2,1))dxs <− dist(sportsranks) # Gets Euclidean distanceslbl <− rep(’ ’,130) # Prefer no labels for the individualsplclust(hclust(dxs),xlab=’Complete linkage’,sub=’ ’,labels=lbl)plclust(hclust(dxs,method=’single’),xlab=’Single linkage’,sub=’ ’,labels=lbl)

04

8

Complete linkage

Height

02

4

Single linkage

Height