R for Marketing Research and Analytics

Chris Chapman and Elea McDonnell Feit
September 2016

Chapter 11: Segmentation - Clustering and Classification

Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html

Segmentation as Clustering & Classification

Segmentation is a process of finding groups of customers who are similar to one another, are different from other groups, and exhibit differences that are important for the business.

There is no magic method to solve all three of those requirements simultaneously.

Segmentation requires trying multiple methods and evaluating the results to determine whether they are useful for the business question.

It often occurs that the statistically “best” segmentation is difficult to understand in the business context. A model that is statistically “not as strong” – but is clear and actionable – may be a preferable result.

In this chapter, we give an overview of methods to demonstrate some common approaches.

Clustering vs Classification

Clustering is the process of finding groups inside data. Key problems include:

  • Determining which variables to use
  • Finding the right number of clusters
  • Ensuring the groups differ in interesting ways

Classification is the process of assigning observations (e.g., customers) to known categories (segments, clusters). Some important concerns are:

  • Predicting better than chance
  • Optimizing for positive vs. negative prediction
  • Generalizing to new data sets

Example data

seg.raw <- read.csv("http://goo.gl/qw303p")
seg.df  <- seg.raw[ , -7]     # remove the known segment assignments

summary(seg.df)
      age           gender        income            kids        ownHome   
 Min.   :19.26   Female:157   Min.   : -5183   Min.   :0.00   ownNo :159  
 1st Qu.:33.01   Male  :143   1st Qu.: 39656   1st Qu.:0.00   ownYes:141  
 Median :39.49                Median : 52014   Median :1.00               
 Mean   :41.20                Mean   : 50937   Mean   :1.27               
 3rd Qu.:47.90                3rd Qu.: 61403   3rd Qu.:2.00               
 Max.   :80.49                Max.   :114278   Max.   :7.00               
  subscribe  
 subNo :260  
 subYes: 40  




Group differences

We create a simple function to look at mean values by group. (This is a placeholder for a more complex evaluation of an interpretable business outcome.)

seg.summ <- function(data, groups) {
  aggregate(data, list(groups), function(x) mean(as.numeric(x)))  
}

seg.summ(seg.df, seg.raw$Segment)
     Group.1      age gender   income     kids  ownHome subscribe
1  Moving up 36.33114   1.30 53090.97 1.914286 1.328571     1.200
2 Suburb mix 39.92815   1.52 55033.82 1.920000 1.480000     1.060
3  Travelers 57.87088   1.50 62213.94 0.000000 1.750000     1.125
4  Urban hip 23.88459   1.60 21681.93 1.100000 1.200000     1.200

Distance

Clustering methods work by looking at some measure of the distance between observations. They try to find groups whose members are close to one another (and far from others).

A common metric is Euclidian distance, the root square of differences. Manually we could compute:

c(1,2,3) - c(2,3,2)
[1] -1 -1  1
sum((c(1,2,3) - c(2,3,2))^2)
[1] 3
sqrt(sum((c(1,2,3) - c(2,3,2))^2))
[1] 1.732051

Note that distance is between observations, and the result is a matrix of distances between all pairs (in this case, just one pair).

dist()

dist() computes Euclidian distance:

sqrt(sum((c(1,2,3) - c(2,3,2))^2))
[1] 1.732051
dist(rbind(c(1,2,3), c(2,3,2)))
         1
2 1.732051

In case of mixed data types (e.g., continuous, binary, ordinal), dist() may not be appropriate because of the huge implied scale differences. daisy() is an alternative that automatically rescales.

library(cluster)                  
seg.dist <- daisy(seg.df)       # daisy works with mixed data types
as.matrix(seg.dist)[1:4, 1:4]   # distances of first 4 observations
          1         2         3         4
1 0.0000000 0.2532815 0.2329028 0.2617250
2 0.2532815 0.0000000 0.0679978 0.4129493
3 0.2329028 0.0679978 0.0000000 0.4246012
4 0.2617250 0.4129493 0.4246012 0.0000000

Hierarchical Clustering

Hierarchical clustering combines closest neighbors (defined in various ways) into progressively larger groups. In R, we first compute distances (previous slide) and then cluster those:

seg.hc <- hclust(seg.dist, method="complete")

Plot the result to see a tree of the solution:

plot(seg.hc)

plot of chunk unnamed-chunk-7

Examining Similarities

We can cut the tree at a particular height and plot above or below. In this case, we cut at a height of 0.5. Then we plot the first ($lower[[1]]) of the resulting trees below that:

plot(cut(as.dendrogram(seg.hc), h=0.5)$lower[[1]])

plot of chunk unnamed-chunk-8

Comparing observations in branches

From the previous tree, we select observations from close and far branches:

seg.df[c(101, 107), ]  # similar
         age gender   income kids ownHome subscribe
101 24.73796   Male 18457.85    1   ownNo    subYes
107 23.19013   Male 17510.28    1   ownNo    subYes
seg.df[c(278, 294), ]  # similar
         age gender   income kids ownHome subscribe
278 36.23860 Female 46540.88    1   ownNo    subYes
294 35.79961 Female 52352.69    1   ownNo    subYes
seg.df[c(173, 141), ]  # less similar
         age gender   income kids ownHome subscribe
173 64.70641   Male 45517.15    0   ownNo    subYes
141 25.17703 Female 20125.80    2   ownNo    subYes

Comparing the dendrogram to the distance matrix

The cophenetic correlation coefficient is a measure of how well the clustering model (expressed in the dendrogram) reflects the distance matrix.

cor(cophenetic(seg.hc), seg.dist)
[1] 0.7682436

Getting K groups from the tree

To get K groups, read from the top of the dendrogram until there are K branches.

rect.hclust() shows where the tree would be cut for K groups:

plot(seg.hc)
rect.hclust(seg.hc, k=4, border="red")

plot of chunk unnamed-chunk-11

Getting segment membership from hclust

Get a vector of class (cluster) assignment:

seg.hc.segment <- cutree(seg.hc, k=4)     # membership vector for 4 groups
table(seg.hc.segment)
seg.hc.segment
  1   2   3   4 
124 136  18  22 

Compare them with our quick function:

seg.summ(seg.df, seg.hc.segment)
  Group.1      age   gender   income     kids  ownHome subscribe
1       1 40.78456 2.000000 49454.08 1.314516 1.467742         1
2       2 42.03492 1.000000 53759.62 1.235294 1.477941         1
3       3 44.31194 1.388889 52628.42 1.388889 2.000000         2
4       4 35.82935 1.545455 40456.14 1.136364 1.000000         2

Is the result interesting?

plot(jitter(as.numeric(seg.df$gender)) ~ 
     jitter(as.numeric(seg.df$subscribe)), 
       col=seg.hc.segment, yaxt="n", xaxt="n", ylab="", xlab="")
axis(1, at=c(1, 2), labels=c("Subscribe: No", "Subscribe: Yes"))
axis(2, at=c(1, 2), labels=levels(seg.df$gender))