#By Joos Korstanje

In this project, I will work with customer data of a wholesale store. The data contains two categorical variables: Region and Channel (whether the client is retail or horeca). Then there are six continuous variables containing data on the annual spending of the client in this category.

###Data information 1) FRESH: annual spending (m.u.) on fresh products (Continuous); 2) MILK: annual spending (m.u.) on milk products (Continuous); 3) GROCERY: annual spending (m.u.)on grocery products (Continuous); 4) FROZEN: annual spending (m.u.)on frozen products (Continuous) 5) DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous) 6) DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous); 7) CHANNEL: customers????T Channel - Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal) 8) REGION: customers????T Region ????" Lisnon, Oporto or Other (Nominal) Descriptive Statistics:

(Minimum, Maximum, Mean, Std. Deviation) FRESH ( 3, 112151, 12000.30, 12647.329) MILK (55, 73498, 5796.27, 7380.377) GROCERY (3, 92780, 7951.28, 9503.163) FROZEN (25, 60869, 3071.93, 4854.673) DETERGENTS_PAPER (3, 40827, 2881.49, 4767.854) DELICATESSEN (3, 47943, 1524.87, 2820.106)

REGION Frequency Lisbon 77 Oporto 47 Other Region 316 Total 440

CHANNEL Frequency Horeca 298 Retail 142 Total 440

##Part 1: Data preparation ###Getting the data

#getwd()
setwd("/Users/joos/Downloads/")
getwd()
## [1] "/Users/joos/Downloads"
dat = read.csv("Wholesale customers data.csv", header = TRUE, sep = ",")
str(dat)
## 'data.frame':    440 obs. of  8 variables:
##  $ Channel         : int  2 2 2 1 2 2 2 2 1 2 ...
##  $ Region          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Fresh           : int  12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
##  $ Milk            : int  9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
##  $ Grocery         : int  7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
##  $ Frozen          : int  214 1762 2405 6404 3915 666 480 1669 425 1159 ...
##  $ Detergents_Paper: int  2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
##  $ Delicassen      : int  1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...

The variable type of the two categorical variables has to be set to Factor.

###Changing variable type

data.fact=dat
data.fact$Channel = as.factor(dat$Channel)
data.fact$Region = as.factor(dat$Region)
head(dat)

##Inspecting the data I will look whether there are missing values, the variances are equal, whether there is correlation, and whether there are outliers.

table(is.na(data.fact))
## 
## FALSE 
##  3520

This means that there are no missing values.

The Covariance matrix and the Correlation matrix are shown here, in order to see if a Generative approach seems useful for this data. This is the case when there is much correlation and covariance.

cov(data.fact[,-(1:2)])
##                      Fresh     Milk  Grocery   Frozen Detergents_Paper
## Fresh            159954927  9381789 -1424713 21236655       -6147825.7
## Milk               9381789 54469967 51083186  4442612       23288343.5
## Grocery           -1424713 51083186 90310104 -1854282       41895189.7
## Frozen            21236655  4442612 -1854282 23567853       -3044324.9
## Detergents_Paper  -6147826 23288343 41895190 -3044325       22732436.0
## Delicassen         8727310  8457925  5507291  5352342         931680.7
##                  Delicassen
## Fresh             8727310.0
## Milk              8457924.8
## Grocery           5507291.3
## Frozen            5352341.8
## Detergents_Paper   931680.7
## Delicassen        7952997.5
cor(data.fact[,-(1:2)])
##                        Fresh      Milk     Grocery      Frozen Detergents_Paper
## Fresh             1.00000000 0.1005098 -0.01185387  0.34588146       -0.1019529
## Milk              0.10050977 1.0000000  0.72833512  0.12399376        0.6618157
## Grocery          -0.01185387 0.7283351  1.00000000 -0.04019274        0.9246407
## Frozen            0.34588146 0.1239938 -0.04019274  1.00000000       -0.1315249
## Detergents_Paper -0.10195294 0.6618157  0.92464069 -0.13152491        1.0000000
## Delicassen        0.24468997 0.4063683  0.20549651  0.39094747        0.0692913
##                  Delicassen
## Fresh             0.2446900
## Milk              0.4063683
## Grocery           0.2054965
## Frozen            0.3909475
## Detergents_Paper  0.0692913
## Delicassen        1.0000000

These two matrices can only work for continous variables. This already shows that a Generative algorithm could be beneficial.

I will also look at this for the categorical variables.

First, I will look at the number of observations in each category.

table(data.fact$Channel)
## 
##   1   2 
## 298 142
table(data.fact$Region)
## 
##   1   2   3 
##  77  47 316
table(data.fact$Channel,data.fact$Region)
##    
##       1   2   3
##   1  59  28 211
##   2  18  19 105

There seems to be quite some overrepresentation of Region 3. This could be an argument for using a Generatie algorithm.

Another argument can be the variance per category. I will look at this now, using several boxplots.

boxplot(data.fact[,-(1:2)])
title("Overall Boxplot")

obsChnl1 = which(data.fact$Channel == 1)
obsChnl2 = which(data.fact$Channel == 2)

datChnl1 = data.fact[obsChnl1,]
datChnl2 = data.fact[obsChnl2,]

boxplot(datChnl1[,-(1:2)])
title("Boxplot of spending of Horeca clients")

boxplot(datChnl2[,-(1:2)])
title("Boxplot of spending by Retail clients")

obsRgn1 = which(data.fact$Region == 1)
obsRgn2 = which(data.fact$Region == 2)
obsRgn3 = which(data.fact$Region == 3)

datRgn1 = data.fact[obsRgn1,]
datRgn2 = data.fact[obsRgn2,]
datRgn3 = data.fact[obsRgn3,]

boxplot(datRgn1[,-(1:2)])
title("Boxplot of spending in Region 1")

boxplot(datRgn2[,-(1:2)])
title("Boxplot of spending in Region 2")

boxplot(datRgn3[,-(1:2)])
title("Boxplot of Spending in Region 3")

Conclusion of this first part is that there is certainly a reason to look into Generative methods.

##Modeling. Goal of project: Make a clustering on the clients (other than Horeca or Retail). This can be used to send personalized publicity / flyers / personalized discount.

I will apply two methods: k-Means and a Gaussian Mixture Model (GMM). The GMM is expected to perform better, because there is correlation in the data and the variances of the different variables are not equal.

The performance of the models will be assessed by creating a training and a test set. It would be best to assess performance using some external indices, like the efficiency of the budget of publicity. But I do not have such data. Therefore I will use some internal indices on the applciation of the model on the test set.

http://stats.stackexchange.com/questions/21807/evaluation-measure-of-clustering-without-having-truth-labels

###Splitting the data in training and test

set.seed=12345 #For reproducibility
train.samples = sample(nrow(data.fact), 300)

###k-Means First do hierarchical clustering to decide on the k.

d=dist(data.fact)
h.clust=hclust(d)
names(h.clust)
## [1] "merge"       "height"      "order"       "labels"      "method"     
## [6] "call"        "dist.method"
plot(h.clust)

Following the dendrogram, I will use 8 classes, because there seems to be a larger gap between theses separations.

I will now do the k-Means.

kmeans.fit=kmeans(data.fact,8)

I will make a plot, by using the Principal components.

library(cluster)
clusplot(data.fact, kmeans.fit$cluster, color=TRUE, lines=0)

fpc for making a centroid plot against 1st two discriminant functions.

I had to remove the categorical variables in this plot.

#install.packages("fpc")
library(fpc)
#getAnywhere(plotcluster)
plotcluster(data.fact[,-c(1:2)], kmeans.fit$cluster)

names(kmeans.fit)
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
kmeans.fit$tot.withinss
## [1] 39739132712
kmeans.fit$betweenss
## [1] 117856724813

##Gaussian Mixture Model Now I will do the Gaussian Mixture Model:

#install.packages("mclust")
library(mclust)
## Package 'mclust' version 5.4.5
## Type 'citation("mclust")' for citing this R package in publications.

Fittign the GMM

GMM.fit = Mclust(data.fact)
summary(GMM.fit)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 3 components: 
## 
##  log-likelihood   n  df       BIC      ICL
##       -24432.14 440 120 -49594.69 -49597.2
## 
## Clustering table:
##   1   2   3 
## 144 188 108
names(GMM.fit)
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "bic"           
##  [9] "loglik"         "df"             "hypvol"         "parameters"    
## [13] "z"              "classification" "uncertainty"

Seeing the results. Four clusters have been made.

GMM.fit$classification
##   [1] 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 2 1 2 1 2 3 1 1 1 2 2 1 2 2 2 2 2 2 1 2
##  [38] 1 1 3 3 2 1 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 2 2 1 1 1 1 2 1 2 1 2 2 2 3 2 1
##  [75] 1 2 2 1 2 2 2 1 1 2 1 1 1 3 2 2 2 2 1 3 1 2 1 2 2 2 1 1 1 3 2 2 1 1 1 1 2
## [112] 1 2 2 2 2 2 2 2 2 2 2 2 1 2 3 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 2 1 2 2
## [149] 2 2 2 2 2 2 2 1 1 2 1 1 1 2 2 1 1 1 1 2 2 2 1 1 2 1 2 1 2 2 2 2 2 3 2 1 2
## [186] 2 2 2 1 1 2 2 2 1 2 2 3 1 3 3 1 1 3 3 3 1 3 1 3 1 3 1 3 3 1 3 1 3 1 3 3 3
## [223] 3 1 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1 3 3 3 3 3 3 3
## [260] 3 3 3 3 3 1 3 1 3 1 3 3 3 3 2 2 2 2 3 2 1 2 1 2 2 3 2 2 2 2 2 2 2 2 1 3 1
## [297] 3 1 1 3 1 1 1 1 1 1 1 3 3 1 3 3 1 3 3 1 3 3 3 1 3 3 3 3 3 1 3 3 3 3 3 1 3
## [334] 1 1 1 3 3 3 3 1 1 2 1 2 2 1 1 2 1 2 1 2 1 2 2 2 1 3 2 2 2 2 2 2 1 2 2 2 2
## [371] 1 2 2 1 2 2 1 2 2 1 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 3 2 2 2 2 2
## [408] 1 1 2 2 2 2 3 2 1 1 2 1 2 2 1 2 1 1 3 2 3 2 2 3 3 2 2 2 3 2 1 2 2
plot(GMM.fit, what="BIC")

plot(GMM.fit, what="classification")

plot(GMM.fit, what="uncertainty")

plot(GMM.fit, what="density")

For the question of tho which clusters of clients to send which publicity, looking at the cluster means gives more information.

GMM.fit$parameters$mean
##                          [,1]         [,2]         [,3]
## Channel              1.986111     1.000016     1.000000
## Region               2.611103     3.000000     1.655336
## Fresh             9263.521035 11640.291550 16283.094075
## Milk             10989.553304  2786.508043  4113.367431
## Grocery          16331.020308  3327.062831  4830.529959
## Frozen            2305.980702  2578.571941  4955.665063
## Detergents_Paper  7179.166684   760.354550   843.925278
## Delicassen        2101.001619  1008.352343  1656.849532

#Conclusion: In this paper, two clustering methods have been applied. First was a k-Means with k based on hierarchical clustering. Second was a Gaussian Mixture Model. In a practical situation, I would have had more information on which results were more valuavle in a company setting, which I do not have in this case. Since I described the use of having several types of publicity, I would choose for a solution with not too many clusters. So this would be an argument to try again k-means with 4 clusters. However, in this situation, there is quite a lot of differences in sample sizes between groups in the categorical variables and at the same time there is quite some correlation between the variables. This makes that I have more faith in the Gaussian Mixture Model in the end. Therefore, this solution seems better.