Chris Chapman and Elea McDonnell Feit

September 2016

**Chapter 11: Segmentation - Clustering and Classification**

Website for all data files:

http://r-marketing.r-forge.r-project.org/data.html

Segmentation is a process of finding groups of customers who are **similar to one another**, are **different from other groups**, and exhibit differences that are **important for the business**.

There is **no magic method** to solve all three of those requirements simultaneously.

Segmentation requires trying **multiple methods** and evaluating the results
to determine whether they are useful for the business question.

It often occurs that the statistically “best” segmentation is difficult to understand in the business context. A model that is statistically “not as strong” – but is clear and actionable – may be a preferable result.

In this chapter, we give an overview of methods to demonstrate some common approaches.

Clustering is the process of *finding* groups inside data. Key problems include:

- Determining which variables to use
- Finding the right number of clusters
- Ensuring the groups differ in interesting ways

Classification is the process of *assigning* observations (e.g., customers) to known categories
(segments, clusters). Some important concerns are:

- Predicting better than chance
- Optimizing for positive vs. negative prediction
- Generalizing to new data sets

```
seg.raw <- read.csv("http://goo.gl/qw303p")
seg.df <- seg.raw[ , -7] # remove the known segment assignments
summary(seg.df)
```

```
age gender income kids ownHome
Min. :19.26 Female:157 Min. : -5183 Min. :0.00 ownNo :159
1st Qu.:33.01 Male :143 1st Qu.: 39656 1st Qu.:0.00 ownYes:141
Median :39.49 Median : 52014 Median :1.00
Mean :41.20 Mean : 50937 Mean :1.27
3rd Qu.:47.90 3rd Qu.: 61403 3rd Qu.:2.00
Max. :80.49 Max. :114278 Max. :7.00
subscribe
subNo :260
subYes: 40
```

We create a simple function to look at mean values by group. (This is a placeholder for a more complex evaluation of an interpretable business outcome.)

```
seg.summ <- function(data, groups) {
aggregate(data, list(groups), function(x) mean(as.numeric(x)))
}
seg.summ(seg.df, seg.raw$Segment)
```

```
Group.1 age gender income kids ownHome subscribe
1 Moving up 36.33114 1.30 53090.97 1.914286 1.328571 1.200
2 Suburb mix 39.92815 1.52 55033.82 1.920000 1.480000 1.060
3 Travelers 57.87088 1.50 62213.94 0.000000 1.750000 1.125
4 Urban hip 23.88459 1.60 21681.93 1.100000 1.200000 1.200
```

Clustering methods work by looking at some measure of the *distance* between observations. They try to
find groups whose members are close to one another (and far from others).

A common metric is *Euclidian* distance, the root square of differences. Manually we could compute:

```
c(1,2,3) - c(2,3,2)
```

```
[1] -1 -1 1
```

```
sum((c(1,2,3) - c(2,3,2))^2)
```

```
[1] 3
```

```
sqrt(sum((c(1,2,3) - c(2,3,2))^2))
```

```
[1] 1.732051
```

Note that distance is between *observations*, and the result is a matrix of distances between all pairs (in this case, just one pair).

`dist()`

computes Euclidian distance:

```
sqrt(sum((c(1,2,3) - c(2,3,2))^2))
```

```
[1] 1.732051
```

```
dist(rbind(c(1,2,3), c(2,3,2)))
```

```
1
2 1.732051
```

In case of mixed data types (e.g., continuous, binary, ordinal), `dist()`

may not be appropriate because of the huge implied scale differences. `daisy()`

is an alternative that automatically rescales.

```
library(cluster)
seg.dist <- daisy(seg.df) # daisy works with mixed data types
as.matrix(seg.dist)[1:4, 1:4] # distances of first 4 observations
```

```
1 2 3 4
1 0.0000000 0.2532815 0.2329028 0.2617250
2 0.2532815 0.0000000 0.0679978 0.4129493
3 0.2329028 0.0679978 0.0000000 0.4246012
4 0.2617250 0.4129493 0.4246012 0.0000000
```

Hierarchical clustering combines closest neighbors (defined in various ways) into progressively larger groups. In R, we first compute distances (previous slide) and then cluster those:

```
seg.hc <- hclust(seg.dist, method="complete")
```

Plot the result to see a tree of the solution:

```
plot(seg.hc)
```

We can cut the tree at a particular height and plot above or below. In this case, we cut at a height of `0.5`

. Then we plot the first (`$lower[[1]]`

) of the resulting trees below that:

```
plot(cut(as.dendrogram(seg.hc), h=0.5)$lower[[1]])
```

From the previous tree, we select observations from close and far branches:

```
seg.df[c(101, 107), ] # similar
```

```
age gender income kids ownHome subscribe
101 24.73796 Male 18457.85 1 ownNo subYes
107 23.19013 Male 17510.28 1 ownNo subYes
```

```
seg.df[c(278, 294), ] # similar
```

```
age gender income kids ownHome subscribe
278 36.23860 Female 46540.88 1 ownNo subYes
294 35.79961 Female 52352.69 1 ownNo subYes
```

```
seg.df[c(173, 141), ] # less similar
```

```
age gender income kids ownHome subscribe
173 64.70641 Male 45517.15 0 ownNo subYes
141 25.17703 Female 20125.80 2 ownNo subYes
```

The cophenetic correlation coefficient is a measure of how well the clustering model (expressed in the dendrogram) reflects the distance matrix.

```
cor(cophenetic(seg.hc), seg.dist)
```

```
[1] 0.7682436
```

To get K groups, read from the top of the dendrogram until there are K branches.

`rect.hclust()`

shows where the tree would be cut for K groups:

```
plot(seg.hc)
rect.hclust(seg.hc, k=4, border="red")
```

Get a vector of class (cluster) assignment:

```
seg.hc.segment <- cutree(seg.hc, k=4) # membership vector for 4 groups
table(seg.hc.segment)
```

```
seg.hc.segment
1 2 3 4
124 136 18 22
```

Compare them with our quick function:

```
seg.summ(seg.df, seg.hc.segment)
```

```
Group.1 age gender income kids ownHome subscribe
1 1 40.78456 2.000000 49454.08 1.314516 1.467742 1
2 2 42.03492 1.000000 53759.62 1.235294 1.477941 1
3 3 44.31194 1.388889 52628.42 1.388889 2.000000 2
4 4 35.82935 1.545455 40456.14 1.136364 1.000000 2
```

```
plot(jitter(as.numeric(seg.df$gender)) ~
jitter(as.numeric(seg.df$subscribe)),
col=seg.hc.segment, yaxt="n", xaxt="n", ylab="", xlab="")
axis(1, at=c(1, 2), labels=c("Subscribe: No", "Subscribe: Yes"))
axis(2, at=c(1, 2), labels=levels(seg.df$gender))
```