Chris Chapman and Elea McDonnell Feit
September 2016
Chapter 11: Segmentation - Clustering and Classification
Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html
Segmentation is a process of finding groups of customers who are similar to one another, are different from other groups, and exhibit differences that are important for the business.
There is no magic method to solve all three of those requirements simultaneously.
Segmentation requires trying multiple methods and evaluating the results to determine whether they are useful for the business question.
It often occurs that the statistically “best” segmentation is difficult to understand in the business context. A model that is statistically “not as strong” – but is clear and actionable – may be a preferable result.
In this chapter, we give an overview of methods to demonstrate some common approaches.
Clustering is the process of finding groups inside data. Key problems include:
Classification is the process of assigning observations (e.g., customers) to known categories (segments, clusters). Some important concerns are:
seg.raw <- read.csv("http://goo.gl/qw303p")
seg.df <- seg.raw[ , -7] # remove the known segment assignments
summary(seg.df)
age gender income kids ownHome
Min. :19.26 Female:157 Min. : -5183 Min. :0.00 ownNo :159
1st Qu.:33.01 Male :143 1st Qu.: 39656 1st Qu.:0.00 ownYes:141
Median :39.49 Median : 52014 Median :1.00
Mean :41.20 Mean : 50937 Mean :1.27
3rd Qu.:47.90 3rd Qu.: 61403 3rd Qu.:2.00
Max. :80.49 Max. :114278 Max. :7.00
subscribe
subNo :260
subYes: 40
We create a simple function to look at mean values by group. (This is a placeholder for a more complex evaluation of an interpretable business outcome.)
seg.summ <- function(data, groups) {
aggregate(data, list(groups), function(x) mean(as.numeric(x)))
}
seg.summ(seg.df, seg.raw$Segment)
Group.1 age gender income kids ownHome subscribe
1 Moving up 36.33114 1.30 53090.97 1.914286 1.328571 1.200
2 Suburb mix 39.92815 1.52 55033.82 1.920000 1.480000 1.060
3 Travelers 57.87088 1.50 62213.94 0.000000 1.750000 1.125
4 Urban hip 23.88459 1.60 21681.93 1.100000 1.200000 1.200
Clustering methods work by looking at some measure of the distance between observations. They try to find groups whose members are close to one another (and far from others).
A common metric is Euclidian distance, the root square of differences. Manually we could compute:
c(1,2,3) - c(2,3,2)
[1] -1 -1 1
sum((c(1,2,3) - c(2,3,2))^2)
[1] 3
sqrt(sum((c(1,2,3) - c(2,3,2))^2))
[1] 1.732051
Note that distance is between observations, and the result is a matrix of distances between all pairs (in this case, just one pair).
dist()
computes Euclidian distance:
sqrt(sum((c(1,2,3) - c(2,3,2))^2))
[1] 1.732051
dist(rbind(c(1,2,3), c(2,3,2)))
1
2 1.732051
In case of mixed data types (e.g., continuous, binary, ordinal), dist()
may not be appropriate because of the huge implied scale differences. daisy()
is an alternative that automatically rescales.
library(cluster)
seg.dist <- daisy(seg.df) # daisy works with mixed data types
as.matrix(seg.dist)[1:4, 1:4] # distances of first 4 observations
1 2 3 4
1 0.0000000 0.2532815 0.2329028 0.2617250
2 0.2532815 0.0000000 0.0679978 0.4129493
3 0.2329028 0.0679978 0.0000000 0.4246012
4 0.2617250 0.4129493 0.4246012 0.0000000
Hierarchical clustering combines closest neighbors (defined in various ways) into progressively larger groups. In R, we first compute distances (previous slide) and then cluster those:
seg.hc <- hclust(seg.dist, method="complete")
Plot the result to see a tree of the solution:
plot(seg.hc)
We can cut the tree at a particular height and plot above or below. In this case, we cut at a height of 0.5
. Then we plot the first ($lower[[1]]
) of the resulting trees below that:
plot(cut(as.dendrogram(seg.hc), h=0.5)$lower[[1]])
From the previous tree, we select observations from close and far branches:
seg.df[c(101, 107), ] # similar
age gender income kids ownHome subscribe
101 24.73796 Male 18457.85 1 ownNo subYes
107 23.19013 Male 17510.28 1 ownNo subYes
seg.df[c(278, 294), ] # similar
age gender income kids ownHome subscribe
278 36.23860 Female 46540.88 1 ownNo subYes
294 35.79961 Female 52352.69 1 ownNo subYes
seg.df[c(173, 141), ] # less similar
age gender income kids ownHome subscribe
173 64.70641 Male 45517.15 0 ownNo subYes
141 25.17703 Female 20125.80 2 ownNo subYes
The cophenetic correlation coefficient is a measure of how well the clustering model (expressed in the dendrogram) reflects the distance matrix.
cor(cophenetic(seg.hc), seg.dist)
[1] 0.7682436
To get K groups, read from the top of the dendrogram until there are K branches.
rect.hclust()
shows where the tree would be cut for K groups:
plot(seg.hc)
rect.hclust(seg.hc, k=4, border="red")
Get a vector of class (cluster) assignment:
seg.hc.segment <- cutree(seg.hc, k=4) # membership vector for 4 groups
table(seg.hc.segment)
seg.hc.segment
1 2 3 4
124 136 18 22
Compare them with our quick function:
seg.summ(seg.df, seg.hc.segment)
Group.1 age gender income kids ownHome subscribe
1 1 40.78456 2.000000 49454.08 1.314516 1.467742 1
2 2 42.03492 1.000000 53759.62 1.235294 1.477941 1
3 3 44.31194 1.388889 52628.42 1.388889 2.000000 2
4 4 35.82935 1.545455 40456.14 1.136364 1.000000 2
plot(jitter(as.numeric(seg.df$gender)) ~
jitter(as.numeric(seg.df$subscribe)),
col=seg.hc.segment, yaxt="n", xaxt="n", ylab="", xlab="")
axis(1, at=c(1, 2), labels=c("Subscribe: No", "Subscribe: Yes"))
axis(2, at=c(1, 2), labels=levels(seg.df$gender))