louis roussos sports data - istics.net
Post on 21-May-2022
3 Views
Preview:
TRANSCRIPT
Louis Roussos Sports Data
Rank the sports you most like to participate in, 1 = favorite, 7 =least favorite. There are n=130 rank vectors.
> sportsranks
Baseball Football Basketball Tennis Cycling Swimming Jogging
1 3 7 2 4 5 6
1 3 2 5 4 7 6
1 3 2 5 4 7 6
4 7 3 1 5 6 2
[...]
3 2 1 4 7 5 6
3 2 1 4 5 6 7
5 7 6 4 1 3 2
2 1 6 7 3 5 4
K-means in RSet #Clusters = K = centers. nstart is the number of times it runsthe algorithm, each time using a diferent random starting set ofmeans.> kmeans(sportsranks,centers=2,nstart=10)K−means clustering with 2 clusters of sizes 62, 68
Cluster means:Baseball Football Basketball Tennis Cycling Swimming Jogging
1 2.451613 2.596774 3.064516 4.112903 4.709677 5.209677 5.8548392 5.014706 5.838235 4.352941 3.632353 2.573529 2.470588 4.117647
Clustering vector:
1 1 1 2 1 2 2 2 2 2 2 1 2 1 1 2 2 1 1 1 2 1 1 2 2 1 1 2 1 2 2 2 1 1 1 1 2 1 1 2 2 2 1 2 1 2 1 1 1 1
2 1 1 2 2 1 1 1 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 1 2 1 1 1 2 2 2 2 1 2 2 2 2 2 1 1 1 1
2 2 1 1 1 1 2 2 2 1 2 2 1 2 2 2 1 2 1 2 2 2 2 1 2 1 1 1 2 1
Within cluster sum of squares by cluster:[1] 1074.968 1288.176
Available components:[1] ”cluster” ”centers” ”withinss” ”size”
Getting clusters of size K=2, ..., 10
kms <− vector(’list’,10)for(K in 2:10) {
kms[[K]] <− kmeans(sportsranks,centers=K,nstart=10)}
K = 1 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 3.79 4.29 3.74 3.86 3.59 3.78 4.95
K = 2 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 5.01 5.84 4.35 3.63 2.57 2.47 4.12Group 2 2.45 2.60 3.06 4.11 4.71 5.21 5.85
K = 3 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 2.33 2.53 3.05 4.14 4.76 5.33 5.86Group 2 4.94 5.97 5.00 3.71 2.90 3.35 2.13Group 3 5.00 5.51 3.76 3.59 2.46 1.90 5.78
K = 4 BaseB FootB BsktB Ten Cyc Swim JogGroup 1 5.10 5.47 3.75 3.60 2.40 1.90 5.78Group 2 2.30 2.10 2.65 5.17 4.75 5.35 5.67Group 3 2.40 3.75 3.90 1.85 4.85 5.20 6.05Group 4 4.97 6.00 5.07 3.80 2.80 3.23 2.13
K = 2: Group 1 likes swimming and cycling, while group 2 likes the team sports,
baseball, football, and basketball. K = 3: Group 1 appears to be about the same is the
team sports group from K = 2, while groups 2 and 3 both like swimming and cycling.
The difference is that group 3 does not like jogging, while group 2 does. K = 4: The
team-sports group has split into one that likes tennis (group 3), and one that doesn’t
(group 2).
Plotting two clusters
The idea is to project the observations to the subspace (which isjust a line) that goes through the two clusters’ mean vectors.The
z =µ̂1 − µ̂2
‖µ̂1 − µ̂2‖,
is the unit vector pointing from µ̂2 to µ̂1. Then using z as anaxis, the projections of the observations onto z have coordinates
wi = xiz′, i = 1, . . . , N.
The histogram
K=2
W
Fre
quency
−6 −4 −2 0 2 4 6
02
46
810
12
Fre
quency
−6 −4 −2 0 2 4 6
02
46
810
12
XX
Baseball
Football
Basketball
Tennis
Cycling
Swimming
Jogging
Plot for K=3If K = 3, then the three means lie in a plane, hence we wouldlike to project the observations onto that plane. One approachis to use principal components on the means:
Z =
µ̂1µ̂2µ̂3
,
we apply the spectral decomposition to the sample covariancematrix of Z:
13
Z′H3Z = GLG′, (1)
where G is orthogonal and L is diagonal. The diagonals of Lhere are 11.77, 4.07, and five zeros. We then rotate the data andthe means using G,
W = XG and W(means) = ZG,
Only the first two columns in each matrix are relevant.
The Plot
−4 −2 0 2 4
−4
−2
02
4
Var 1
Var
2
1
2
3
BaseballFootball
BasketballTennis
Cycling
Swimming
Jogging
K=3
The sums of squares
2 4 6 8 10
1500
2000
2500
3000
3500
K
SS
SSK = obj(µ̂1, . . . , µ̂K) =K
∑k=1
∑{i|yi=k}
‖xi − µ̂k‖2.
The reduction of sums of squares
2 4 6 8 10
0.05
0.10
0.15
0.20
0.25
0.30
K
1-SS[k]/SS[k-1]
1− SSK
SSK−1
Silhouettes in RThe function silhouette.km finds the silhouettes for a givenclustering, then sort.silhouette orders them, first by clusternumber, then by value. To plot the sillhouettes for k = 2, . . . , 10:
sil.ave <− NULL # To collect silhouette’s means for each Kpar(mfrow=c(3,3))for(K in 2:10) {
sil <− silhouette.km(sportsranks,kms[[K]]$centers)sil.ave <− c(sil.ave,mean(sil))ssil <− sort.silhouette(sil,kms[[K]]$cluster)plot(ssil,type=’h’,xlab=’Observations’,ylab=’Silhouettes’)title(paste(’K =’,K))
}
The sil.ave calculated above can then be used to obtain the plotof averages:
plot(2:10,sil.ave,type=’l’,xlab=’K’,ylab=’Average silhouette width’)
Plotting the silhouettes
0 20 40 60 80 120
0.2
0.4
0.6
0.8
Ave = 0.625
K = 2
0 20 40 60 80 120
0.2
0.4
0.6
0.8
Ave = 0.555
K = 3
0 20 40 60 80 120
0.2
0.4
0.6
0.8
Ave = 0.508
K = 4
0 20 40 60 80 120
0.2
0.4
0.6
0.8
Ave = 0.534
K = 5
Plotting the silhouettes’ averages
2 4 6 8 10
0.5
00.5
40.5
80.6
2
K
Avera
ge s
ilhouette w
idth
K = 2 seems like a good choice.
Model-based clustering – Car data
The data consists of size measurements on 111 automobiles, thevariables include length, wheelbase, width, height, front andrear head room, front leg room, rear seating, front and rearshoulder room, and luggage area. The data are in the file cars.The variables have been normalized to have medians of 0 andmedian absolute deviations (MAD) of 1.4826 (the MAD for aN(0, 1)).
R for model-based clustering
The R function we use is in the package mclust. The function isMclust. The basic command is simple:
mcars <− Mclust(cars)
There are many options for plotting in the package. To see aplot of the BIC’s, use
plot(mcars,cars,what=’BIC’)
You have to clicking on the graphics window, or hit enter, toreveal the plot. Not that the BIC’s in this function are actuallythe −BIC’s. So we want to maximize it.
Plotting the BIC’s
2 4 6 8
-6000
-5500
-5000
-4500
-4000
number of components
BIC
EII
VII
EEI
VEI
EVI
VVI
EEE
EEV
VEV
VVV
K = 2, VVV is best.
What is VVV?
To find the name of the best model:
> mcarsbest model: ellipsoidal, unconstrained with 2 components
That K = 2 is easy to see. The assumptions on the covariancematrices are “ellipsoidal,” which means they have no specialstructure, and “unconstrained,” which means they are notassumed equal for the two groups, Σ1 6= Σ2.
To plot variable 1 (length) versus variable 4 (height), use
plot(mcars,cars,what=’classification’,dimens=c(1,4))
Plotting the clusters
−4 −2 0 2 4
−5
05
1020
Length
Hei
ght
−4 −2 0 2 4
−4
−2
02
4
Width
Frt
LegR
oom
−4 −2 0 2 4 6
−8
−4
02
4
RearHd
Lugg
age
0 10 20 30
−20
−10
05
PC1
PC
2
The cars in group 2
Rear Head Rear Seating Rear Shoulder LuggageChevrolet Corvette −4.0 −19.67 −28.00 −8.0Honda Civic CRX −4.0 −19.67 −28.00 −8.0Mazda MX5 Miata −4.0 −19.67 −28.00 −8.0Mazda RX7 −4.0 −19.67 −28.00 −8.0Nissan 300ZX −4.0 −19.67 −28.00 −8.0Chevrolet Astro 2.5 0.33 −1.75 −8.0Chevrolet Lumina APV 2.0 3.33 4.00 −8.0Dodge Caravan 2.5 −0.33 −6.25 −8.0Dodge Grand Caravan 2.0 2.33 3.25 −8.0Ford Aerostar 1.5 1.67 4.25 −8.0Mazda MPV 3.5 0.00 −5.50 −8.0Mitsubishi Wagon 2.5 −19.00 2.50 −8.0Nissan Axxess 2.5 0.67 1.25 −8.5Nissan Van 3.0 −19.00 2.25 −8.0Volkswagen Vanagon 7.0 6.33 −7.25 −8.0
Just group 1
Redo on just the group 1 automobiles:
cars1 <− cars[mcars$classification==1,]mcars1 <− Mclust(cars1)mcars1best model: elliposidal multivariate normal with 1 components
The best is one big cluster.
The models in mclust
Code Description ΣkEII spherical, equal volume σ2IpVII spherical, unequal volume σ2
k IpEEI diagonal, equal volume and shape ΛVEI diagonal, varying volume, equal shape ck∆EVI diagonal, equal volume, varying shape c∆kVVI diagonal, varying volume and shape ΛkEEE∗ ellipsoidal, equal volume, shape, and orientation ΣEEV ellipsoidal, equal volume and equal shape ΓkΛΓ′kVEV ellipsoidal, equal shape ckΓk∆Γ′kVVV∗ ellipsoidal, varying volume, shape, and orientation arbitrary
Here, Λ’s are diagonal matrices with positive diagonals, ∆’s are diagonal matrices with
positive diagonals whose product is 1, Γ’s are orthogonal matrices, Σ’s are arbitrary
nonnegative definite symmetric matrices, and c’s are positive scalars. A subscript k on
an element means the groups can have different values for that element. No subscript
means that element is the same for each group.
Hierarchical clustering of the sportsplclust(hclust(dist(t(sportsranks))))
Baseball
Footb
all
Basketb
all
Joggin
g
Tennis
Cyclin
g
Sw
imm
ing
20
25
30
35
40
Complete linkage
Heig
ht
Hierarchical clustering of the individualspar(mfrow=c(2,1))dxs <− dist(sportsranks) # Gets Euclidean distanceslbl <− rep(’ ’,130) # Prefer no labels for the individualsplclust(hclust(dxs),xlab=’Complete linkage’,sub=’ ’,labels=lbl)plclust(hclust(dxs,method=’single’),xlab=’Single linkage’,sub=’ ’,labels=lbl)
04
8
Complete linkage
Height
02
4
Single linkage
Height
top related