#By Joos Korstanje
In this project, I will work with customer data of a wholesale store. The data contains two categorical variables: Region and Channel (whether the client is retail or horeca). Then there are six continuous variables containing data on the annual spending of the client in this category.
###Data information 1) FRESH: annual spending (m.u.) on fresh products (Continuous); 2) MILK: annual spending (m.u.) on milk products (Continuous); 3) GROCERY: annual spending (m.u.)on grocery products (Continuous); 4) FROZEN: annual spending (m.u.)on frozen products (Continuous) 5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous) 6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous); 7) CHANNEL: customers????T Channel - Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal) 8) REGION: customers????T Region ????" Lisnon, Oporto or Other (Nominal) Descriptive Statistics:
(Minimum, Maximum, Mean, Std. Deviation) FRESH ( 3, 112151, 12000.30, 12647.329) MILK (55, 73498, 5796.27, 7380.377) GROCERY (3, 92780, 7951.28, 9503.163) FROZEN (25, 60869, 3071.93, 4854.673) DETERGENTS_PAPER (3, 40827, 2881.49, 4767.854) DELICATESSEN (3, 47943, 1524.87, 2820.106)
REGION Frequency Lisbon 77 Oporto 47 Other Region 316 Total 440
CHANNEL Frequency Horeca 298 Retail 142 Total 440
##Part 1: Data preparation ###Getting the data
#getwd()
setwd("/Users/joos/Downloads/")
getwd()
## [1] "/Users/joos/Downloads"
dat = read.csv("Wholesale customers data.csv", header = TRUE, sep = ",")
str(dat)
## 'data.frame': 440 obs. of 8 variables:
## $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
## $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
## $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
## $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
## $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
## $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
## $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
The variable type of the two categorical variables has to be set to Factor.
###Changing variable type
data.fact=dat
data.fact$Channel = as.factor(dat$Channel)
data.fact$Region = as.factor(dat$Region)
head(dat)
##Inspecting the data I will look whether there are missing values, the variances are equal, whether there is correlation, and whether there are outliers.
table(is.na(data.fact))
##
## FALSE
## 3520
This means that there are no missing values.
The Covariance matrix and the Correlation matrix are shown here, in order to see if a Generative approach seems useful for this data. This is the case when there is much correlation and covariance.
cov(data.fact[,-(1:2)])
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 159954927 9381789 -1424713 21236655 -6147825.7
## Milk 9381789 54469967 51083186 4442612 23288343.5
## Grocery -1424713 51083186 90310104 -1854282 41895189.7
## Frozen 21236655 4442612 -1854282 23567853 -3044324.9
## Detergents_Paper -6147826 23288343 41895190 -3044325 22732436.0
## Delicassen 8727310 8457925 5507291 5352342 931680.7
## Delicassen
## Fresh 8727310.0
## Milk 8457924.8
## Grocery 5507291.3
## Frozen 5352341.8
## Detergents_Paper 931680.7
## Delicassen 7952997.5
cor(data.fact[,-(1:2)])
## Fresh Milk Grocery Frozen Detergents_Paper
## Fresh 1.00000000 0.1005098 -0.01185387 0.34588146 -0.1019529
## Milk 0.10050977 1.0000000 0.72833512 0.12399376 0.6618157
## Grocery -0.01185387 0.7283351 1.00000000 -0.04019274 0.9246407
## Frozen 0.34588146 0.1239938 -0.04019274 1.00000000 -0.1315249
## Detergents_Paper -0.10195294 0.6618157 0.92464069 -0.13152491 1.0000000
## Delicassen 0.24468997 0.4063683 0.20549651 0.39094747 0.0692913
## Delicassen
## Fresh 0.2446900
## Milk 0.4063683
## Grocery 0.2054965
## Frozen 0.3909475
## Detergents_Paper 0.0692913
## Delicassen 1.0000000
These two matrices can only work for continous variables. This already shows that a Generative algorithm could be beneficial.
I will also look at this for the categorical variables.
First, I will look at the number of observations in each category.
table(data.fact$Channel)
##
## 1 2
## 298 142
table(data.fact$Region)
##
## 1 2 3
## 77 47 316
table(data.fact$Channel,data.fact$Region)
##
## 1 2 3
## 1 59 28 211
## 2 18 19 105
There seems to be quite some overrepresentation of Region 3. This could be an argument for using a Generatie algorithm.
Another argument can be the variance per category. I will look at this now, using several boxplots.
boxplot(data.fact[,-(1:2)])
title("Overall Boxplot")
obsChnl1 = which(data.fact$Channel == 1)
obsChnl2 = which(data.fact$Channel == 2)
datChnl1 = data.fact[obsChnl1,]
datChnl2 = data.fact[obsChnl2,]
boxplot(datChnl1[,-(1:2)])
title("Boxplot of spending of Horeca clients")
boxplot(datChnl2[,-(1:2)])
title("Boxplot of spending by Retail clients")
obsRgn1 = which(data.fact$Region == 1)
obsRgn2 = which(data.fact$Region == 2)
obsRgn3 = which(data.fact$Region == 3)
datRgn1 = data.fact[obsRgn1,]
datRgn2 = data.fact[obsRgn2,]
datRgn3 = data.fact[obsRgn3,]
boxplot(datRgn1[,-(1:2)])
title("Boxplot of spending in Region 1")
boxplot(datRgn2[,-(1:2)])
title("Boxplot of spending in Region 2")
boxplot(datRgn3[,-(1:2)])
title("Boxplot of Spending in Region 3")
Conclusion of this first part is that there is certainly a reason to look into Generative methods.
##Modeling. Goal of project: Make a clustering on the clients (other than Horeca or Retail). This can be used to send personalized publicity / flyers / personalized discount.
I will apply two methods: k-Means and a Gaussian Mixture Model (GMM). The GMM is expected to perform better, because there is correlation in the data and the variances of the different variables are not equal.
The performance of the models will be assessed by creating a training and a test set. It would be best to assess performance using some external indices, like the efficiency of the budget of publicity. But I do not have such data. Therefore I will use some internal indices on the applciation of the model on the test set.
###Splitting the data in training and test
set.seed=12345 #For reproducibility
train.samples = sample(nrow(data.fact), 300)
###k-Means First do hierarchical clustering to decide on the k.
d=dist(data.fact)
h.clust=hclust(d)
names(h.clust)
## [1] "merge" "height" "order" "labels" "method"
## [6] "call" "dist.method"
plot(h.clust)
Following the dendrogram, I will use 8 classes, because there seems to be a larger gap between theses separations.
I will now do the k-Means.
kmeans.fit=kmeans(data.fact,8)
I will make a plot, by using the Principal components.
library(cluster)
clusplot(data.fact, kmeans.fit$cluster, color=TRUE, lines=0)
fpc for making a centroid plot against 1st two discriminant functions.
I had to remove the categorical variables in this plot.
#install.packages("fpc")
library(fpc)
#getAnywhere(plotcluster)
plotcluster(data.fact[,-c(1:2)], kmeans.fit$cluster)
names(kmeans.fit)
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
kmeans.fit$tot.withinss
## [1] 39739132712
kmeans.fit$betweenss
## [1] 117856724813
##Gaussian Mixture Model Now I will do the Gaussian Mixture Model:
#install.packages("mclust")
library(mclust)
## Package 'mclust' version 5.4.5
## Type 'citation("mclust")' for citing this R package in publications.
Fittign the GMM
GMM.fit = Mclust(data.fact)
summary(GMM.fit)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 3 components:
##
## log-likelihood n df BIC ICL
## -24432.14 440 120 -49594.69 -49597.2
##
## Clustering table:
## 1 2 3
## 144 188 108
names(GMM.fit)
## [1] "call" "data" "modelName" "n"
## [5] "d" "G" "BIC" "bic"
## [9] "loglik" "df" "hypvol" "parameters"
## [13] "z" "classification" "uncertainty"
Seeing the results. Four clusters have been made.
GMM.fit$classification
## [1] 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 2 1 2 3 1 1 1 2 2 1 2 2 2 2 2 2 1 2
## [38] 1 1 3 3 2 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 2 2 1 1 1 1 2 1 2 1 2 2 2 3 2 1
## [75] 1 2 2 1 2 2 2 1 1 2 1 1 1 3 2 2 2 2 1 3 1 2 1 2 2 2 1 1 1 3 2 2 1 1 1 1 2
## [112] 1 2 2 2 2 2 2 2 2 2 2 2 1 2 3 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 1 2 2
## [149] 2 2 2 2 2 2 2 1 1 2 1 1 1 2 2 1 1 1 1 2 2 2 1 1 2 1 2 1 2 2 2 2 2 3 2 1 2
## [186] 2 2 2 1 1 2 2 2 1 2 2 3 1 3 3 1 1 3 3 3 1 3 1 3 1 3 1 3 3 1 3 1 3 1 3 3 3
## [223] 3 1 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 1 3 1 3 1 3 3 3 3 2 2 2 2 3 2 1 2 1 2 2 3 2 2 2 2 2 2 2 2 1 3 1
## [297] 3 1 1 3 1 1 1 1 1 1 1 3 3 1 3 3 1 3 3 1 3 3 3 1 3 3 3 3 3 1 3 3 3 3 3 1 3
## [334] 1 1 1 3 3 3 3 1 1 2 1 2 2 1 1 2 1 2 1 2 1 2 2 2 1 3 2 2 2 2 2 2 1 2 2 2 2
## [371] 1 2 2 1 2 2 1 2 2 1 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 3 2 2 2 2 2
## [408] 1 1 2 2 2 2 3 2 1 1 2 1 2 2 1 2 1 1 3 2 3 2 2 3 3 2 2 2 3 2 1 2 2
plot(GMM.fit, what="BIC")
plot(GMM.fit, what="classification")
plot(GMM.fit, what="uncertainty")
plot(GMM.fit, what="density")
For the question of tho which clusters of clients to send which publicity, looking at the cluster means gives more information.
GMM.fit$parameters$mean
## [,1] [,2] [,3]
## Channel 1.986111 1.000016 1.000000
## Region 2.611103 3.000000 1.655336
## Fresh 9263.521035 11640.291550 16283.094075
## Milk 10989.553304 2786.508043 4113.367431
## Grocery 16331.020308 3327.062831 4830.529959
## Frozen 2305.980702 2578.571941 4955.665063
## Detergents_Paper 7179.166684 760.354550 843.925278
## Delicassen 2101.001619 1008.352343 1656.849532
#Conclusion: In this paper, two clustering methods have been applied. First was a k-Means with k based on hierarchical clustering. Second was a Gaussian Mixture Model. In a practical situation, I would have had more information on which results were more valuavle in a company setting, which I do not have in this case. Since I described the use of having several types of publicity, I would choose for a solution with not too many clusters. So this would be an argument to try again k-means with 4 clusters. However, in this situation, there is quite a lot of differences in sample sizes between groups in the categorical variables and at the same time there is quite some correlation between the variables. This makes that I have more faith in the Gaussian Mixture Model in the end. Therefore, this solution seems better.