R for Marketing Research and Analytics

Chris Chapman and Elea McDonnell Feit
September 2016

Chapter 11: Segmentation - Clustering and Classification

Website for all data files:
http://r-marketing.r-forge.r-project.org/data.html

Segmentation as Clustering & Classification

Segmentation is a process of finding groups of customers who are similar to one another, are different from other groups, and exhibit differences that are important for the business.

There is no magic method to solve all three of those requirements simultaneously.

Segmentation requires trying multiple methods and evaluating the results to determine whether they are useful for the business question.

It often occurs that the statistically “best” segmentation is difficult to understand in the business context. A model that is statistically “not as strong” – but is clear and actionable – may be a preferable result.

In this chapter, we give an overview of methods to demonstrate some common approaches.

Clustering vs Classification

Clustering is the process of finding groups inside data. Key problems include:

  • Determining which variables to use
  • Finding the right number of clusters
  • Ensuring the groups differ in interesting ways

Classification is the process of assigning observations (e.g., customers) to known categories (segments, clusters). Some important concerns are:

  • Predicting better than chance
  • Optimizing for positive vs. negative prediction
  • Generalizing to new data sets

Example data

seg.raw <- read.csv("http://goo.gl/qw303p")
seg.df  <- seg.raw[ , -7]     # remove the known segment assignments

summary(seg.df)
      age           gender        income            kids        ownHome   
 Min.   :19.26   Female:157   Min.   : -5183   Min.   :0.00   ownNo :159  
 1st Qu.:33.01   Male  :143   1st Qu.: 39656   1st Qu.:0.00   ownYes:141  
 Median :39.49                Median : 52014   Median :1.00               
 Mean   :41.20                Mean   : 50937   Mean   :1.27               
 3rd Qu.:47.90                3rd Qu.: 61403   3rd Qu.:2.00               
 Max.   :80.49                Max.   :114278   Max.   :7.00               
  subscribe  
 subNo :260  
 subYes: 40  




Conclusion: A Few Key Points

Segmentation is not a method, but a process that must focus clearly on the business need and question. Sometimes a “better” model is less useful.

Clustering can help identify potentially interesting groups in the data. Appropriateness of a solutions depends on both statistical criteria (fit) and business utility (clarity, ability to target, etc.)

If specific groups are known (e.g., segments or behaviors), classification methods find rules to predict membership in those groups. We saw how to predict likelihood-to-subscribe. Depending on cost & margin, one might target more or fewer customers based on likelihood.

Important considerations for classification include performance on holdout data, generalization to new data sets, and avoiding class imbalance problems.

Suggested Readings

James, Witten, Hastie, & Tibshirani (2013). An Introduction to Statistical Learning, with Applications in R. New York: Springer.

  • An excellent overview of a wide variety of statistical approaches to learning and classification.

Kuhn & Johnson (2013). Applied Predictive Modeling. New York: Springer.

  • A detailed examination of how to build regression and classification models that generalize. It is focused on practical application and overcoming many typical problems of such projects. There is a superb related package (“caret”).

Wedel & Kamakura (2000). Market Segmentation: Conceptual and Methodological Foundations. New York: Springer.

  • Discusses a wide variety of approaches to segmentation for marketing applications (not specific to R).

Notes

This presentation is based on Chapter 11 of Chapman and Feit, R for Marketing Research and Analytics © 2015 Springer.

All code in the presentation is licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0\ Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.