choicetools: a package for conjoint analysis and best-worst surveys

Chris Chapman (Chrome OS) & Eric Bahna (Android Auto), Google
July 10, 2019

Slides: http://bit.ly/2RO51fq [2 Romeo Oscar 51 foxtrot quebec]

Overview

  • Basics of Conjoint Analysis surveys for product preference
  • Creating a survey and example data
  • Modeling respondents' preferences
  • Additional features in choicetools package

Package: In development, v 0.9073

library(devtools)
install_github("cnchapman/choicetools")

Choice-Based Conjoint Analysis

In a Choice-Based Conjoint (CBC) survey, respondents choose among products. A product has attributes such as brand, features, and price, and levels, such as brand names and specific prices.

CBC question

Hypothetical Product: USB Drive

We imagine a USB flash drive with five attributes. Each attribute has 2-5 levels (brand name, price, etc.)

cbc.attrs     <- c(Brand=4, Style=4, Price=5, Color=2, Size=4)
cbc.levels    <- c("Alpha", "Bravo", "Charlie", "Delta",    # Brand
                   "Shiny", "Flat",  "Sparkly", "Odd",      # Style
                   "$9",  "$14",  "$19",  "$24",  "$29",    # Price
                   "Blue",  "Red",                          # Color
                   "64GB", "256GB", "512GB", "1TB")         # Size

Given choices among products with randomized attributes, we can model the contribution of each feature (multinomial/conditional logit model). Conceptually:

\[ p(choice | product) \propto preference(product) \]

\[ preference(product) \propto \sum{preference(attributes)} \]

Study Setup

Each respondent answers multiple choices (tasks). We set up the study to ask 12 choices with 3 products on each. The randomized design matrix will have N=400 versions of the 12-task survey:

set.seed(98103)

cbc.tasks     <- 12   # trials per respondent
cbc.concepts  <- 3    # cards per trial
N             <- 400  # N of respondents

Design Matrix

Levels should appear roughly the same number of times with every other level. generateMNLrandomTab() does so:

cbc.tab <- generateMNLrandomTab(cbc.attrs, respondents=N,
                                cards=cbc.concepts, trials=cbc.tasks )
#> Searching for a balanced design ...
#> Improved design found on trial:  8  SSE =  7.375579e-05 
#> Improved design found on trial:  17  SSE =  5.564236e-05 
#> Improved design found on trial:  35  SSE =  5.005787e-05 
#> Improved design found on trial:  37  SSE =  4.819637e-05 
#> Improved design found on trial:  47  SSE =  4.241898e-05
knitr::kable(head(cbc.tab, 3))  # first choice trial, 3 products
Brand Style Price Color Size
2 3 2 2 2
1 2 3 1 4
3 1 5 2 1

(Usually design comes from a survey authoring platform.)

Dummy Coded Design Matrix

We can convert the layout to a dummy coded matrix:

cbc.des <- convertSSItoDesign(cbc.tab)     # dummy coded matrix
knitr::kable(head(cbc.des, 3))
Brand-1 Brand-2 Brand-3 Brand-4 Style-1 Style-2 Style-3 Style-4 Price-1 Price-2 Price-3 Price-4 Price-5 Color-1 Color-2 Size-1 Size-2 Size-3 Size-4
0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0
1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1
0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0

The first product (first row) has Brand 2, Style 3, Price level 2, and so forth.

Survey Output as CSV

Given a design, writeCBCdesignCSV() produces a minimal “survey” in CSV format. This is easy to use in a classroom.

To ensure CSV data match the design, digest (Eddelbuettel et al, 2018) adds a hash value for the design.

writeCBCdesignCSV(head(cbc.tab, 3), cards=cbc.concepts, trials=1,
                  attr.list=cbc.attrs, lab.attrs=names(cbc.attrs),
                  lab.levels = cbc.levels)
#> ##############################
#> CBC response file for design: 1d029ebfab00967d76d9a00dad28dc43
#> 
#> ##############################
#> Respondent 1 
#> 
#> TRIAL: 1
#>       1       2       3  
#> Brand:    Bravo   Alpha   Charlie
#> Style:    Sparkly     Flat    Shiny
#> Price:    $14     $19     $29
#> Color:    Red     Blue    Red
#> Size:     256GB   1TB     64GB
#> 
#> CHOICE for Trial 1:

Creating Simulated Preference Data

The simplest model is a multinomial logit (MNL). Coefficients are part worths, which sum to 0 across attribute levels. generateRNDpws() simulates a single vector of part worths pickMNLwinningCards() finds a winner for each task:

cbc.pws <- generateRNDpws(cbc.attrs)    # make up some zero-sum part worths
cbc.win <- pickMNLwinningCards(cbc.des, cbc.pws)  # winning cards with them
#> Processing trial:  2000 
#> Processing trial:  4000
knitr::kable(head(cbind(cbc.win, cbc.des), 3))
cbc.win Brand-1 Brand-2 Brand-3 Brand-4 Style-1 Style-2 Style-3 Style-4 Price-1 Price-2 Price-3 Price-4 Price-5 Color-1 Color-2 Size-1 Size-2 Size-3 Size-4
0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0
0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1
1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0

In this task the 3rd concept (Brand 3, etc.) is preferred and selected (cbc.win).

Aggregate Multinomial Logit Model

The MNL model estimates preference for a product as proportional to the sum of the utility values for its features. Preference for A vs. B is the proportion of total utility represented by A:

\[ pref(A | \{A, B\}) = \frac{pref(A)}{pref(A)+pref(B)} \]

Utility is the exponentiated sum of the features' part worth coefficients, i.e., a product's logit value:

\[ \frac{pref(A)}{pref(A)+pref(B)} = \frac{e^{\sum{PW_A}}}{e^{\sum{PW_A}}+e^{\sum{PW_B}}} \]

(For simplicity, we omit intercepts and error terms.)

Aggregate MNL Estimation

Estimate part worths based on “observed” choices, and plot them vs. original values. There is near perfect recovery:

cbc.mnl <- estimateMNLfromDesign(cbc.des, cbc.win, cards=cbc.concepts)
plot(cbc.pws, cbc.mnl)

plot of chunk unnamed-chunk-8

Write CSV Structure and Read Choices from It

It is more interesting to collect real data. Usually one would use survey authoring tools such as Sawtooth Software or Qualtrics. These platforms display CBC tasks in a suitable format for respondents.

For this vignette, we write out a CSV file. We later read the data as if responses had been given there.

csv.filename <- "~/Downloads/testCBC.csv"
writeCBCdesignCSV(cbc.tab, filename=csv.filename,   # file "" for console
                  cards=cbc.concepts, trials=cbc.tasks,
                  attr.list=cbc.attrs, lab.attrs=names(cbc.attrs),
                  lab.levels = cbc.levels, overwrite=TRUE)

CSV with Random Choices

If respondents had filled in choices, we could estimate the MNL model. Instead, we make random choices and rewrite the CSV:

# read the CSV
csvfile.in  <- readLines(csv.filename)
# make random choices (1, 2, or 3)
lines.with.choices <- which(grepl("CHOICE for Trial [0-9]+", csvfile.in))
csvfile.in[lines.with.choices] <- paste(csvfile.in[lines.with.choices],
                                        sample(cbc.concepts,
                                               length(lines.with.choices),
                                               replace = TRUE))
writeLines(csvfile.in, con=csv.filename)

Read the CSV using readCBCchoices(), including the design matrix (cbc.tab):

# get those choices
cbc.choices <- readCBCchoices(cbc.tab, filename=csv.filename,
                              cards=cbc.concepts, trials=cbc.tasks,
                              verbose=FALSE)

Estimation of Random Choice Data

estimateMNLfromDesign() estimates part worths. It is mainly for teaching and uses gradient descent. HB (next slide) is for production. For random data, part worths do not correspond to original values:

cbc.mnl2 <- estimateMNLfromDesign(cbc.des, cbc.choices, cards=cbc.concepts, no.output = TRUE)
plot(cbc.pws, cbc.mnl2); abline(h=0)

plot of chunk unnamed-chunk-12

Hierarchical Bayes Estimation

Hierachical Bayes (HB) estimates a mixed effects model. The upper level has fixed effects, while the lower level gives random effects for each respondent.

To estimate HB, use estimateMNLfromDesignHB(). It is a wrapper that calls ChoiceModelR::choicemodelr() (Sermas, 2012). HB uses MCMC iteration; for demonstration, we specify a chain with length 2000. In practice, this would typically be 10000s.

estimateMNLfromDesignHB() includes common HB options:

  • proportion of the MCMC chain that is regarded as posterior draws (pitersUsed)
  • whether to save MCMC chain (“draws”) for each respondent (drawKeep)
  • frequency of posterior draws to retain (every K'th draw, to avoid autocorrelation; drawKeepK)

HB Estimation

# replace 30% of the perfect "winning" vector with random draws
cbc.win2    <- cbc.win
cbc.replace <- sample(length(cbc.win2), length(cbc.win2)*0.3)  # to replace
cbc.win2[cbc.replace] <- sample(3, length(cbc.replace), replace=TRUE)

# estimate using the design and winning cards
cbc.hb <- estimateMNLfromDesignHB(tmp.des=cbc.tab, tmp.win=cbc.win2,
                                  kCards=cbc.concepts, kTrials=cbc.tasks,
                                  kResp=N , mcmcIters=2000)
#>                     Logit Data                    
#> ==================================================
#> Attribute       Type         Levels
#> -----------------------------------
#> Attribute 1    Part Worth      4
#> Attribute 2    Part Worth      4
#> Attribute 3    Part Worth      5
#> Attribute 4    Part Worth      2
#> Attribute 5    Part Worth      4
#> 
#> 14 parameters to be estimated.
#> 
#> 400 total units.
#> Average of 3 alternatives in each of 12 sets per unit.
#> 4800 tasks in total.
#> 
#> Table of choice data pooled across units:
#> Choice  Count   Pct.
#> --------------------
#>    1    972    20.25%
#>    2    1599   33.31%
#>    3    2229   46.44%
#> 
#>       MCMC Inference for Hierarchical Logit       
#> ==================================================
#> Total Iterations:          2000
#> Draws used in estimation:  1000
#> Units:                     400
#> Parameters per unit:       14
#> Constraints not in effect.
#> Draws are to be saved.
#> Prior degrees of freedom:  5
#> Prior variance:            2
#> 
#> MCMC Iteration Beginning...

plot of chunk unnamed-chunk-13

#> Iteration  Acceptance   RLH     Pct. Cert.   Avg. Var.   RMS     
#> Time to End
#>       100  0.362        0.380   0.114        0.16        0.25    0:09  
#>       200  0.303        0.429   0.224        0.30        0.50    0:09  
#>       300  0.306        0.453   0.277        0.40        0.65    0:08  
#>       400  0.300        0.467   0.306        0.48        0.75    0:07

plot of chunk unnamed-chunk-13

#>       500  0.306        0.472   0.317        0.52        0.80    0:07  
#>       600  0.306        0.475   0.323        0.57        0.84    0:06  
#>       700  0.304        0.478   0.329        0.59        0.86    0:06  
#>       800  0.308        0.481   0.334        0.61        0.87    0:05  
#>       900  0.303        0.479   0.331        0.61        0.87    0:05  
#>      1000  0.303        0.481   0.333        0.61        0.89    0:04  
#>      1100  0.302        0.480   0.332        0.59        0.88    0:04  
#>      1200  0.300        0.480   0.332        0.60        0.88    0:03  
#>      1300  0.299        0.478   0.328        0.61        0.87    0:03  
#>      1400  0.302        0.478   0.328        0.60        0.87    0:03  
#>      1500  0.304        0.480   0.332        0.61        0.88    0:02  
#>      1600  0.297        0.478   0.328        0.60        0.87    0:02  
#>      1700  0.299        0.476   0.323        0.58        0.85    0:01  
#>      1800  0.307        0.474   0.321        0.56        0.84    0:01  
#>      1900  0.308        0.477   0.327        0.56        0.84    0:00  
#>      2000  0.305        0.478   0.327        0.57        0.84    0:00  
#> 
#> Total Time Elapsed: 0:09
#> 
#> Writing estimated unit-level betas to Rbetas.csv in the working directory

Get Individual Estimates

Estimates may be extracted from the HB model with extractHBbetas() (and we add respondent IDs):

cbc.est        <- data.frame(extractHBbetas(cbc.hb, cbc.attrs))
names(cbc.est) <- cbc.levels
cbc.est$ID     <- 1:nrow(cbc.est)   # set respondent ID

# mean of MCMC chain per individual
head(cbc.est)
#>        Alpha       Bravo   Charlie       Delta      Shiny         Flat
#> 1 -0.4997400 -0.68652143 0.4198542  0.76640724 -0.5983791 -0.534706669
#> 2 -0.2927157 -0.38988933 0.5871751  0.09542995  0.3158654 -0.287247328
#> 3 -1.4966944  0.32503856 1.4698404 -0.29818457  0.1632377 -0.206093768
#> 4 -0.8882575 -0.04936736 0.8446392  0.09298558  1.0793851 -1.134929988
#> 5 -0.4101659 -1.01777366 0.9038024  0.52413712  0.1865601 -0.599280419
#> 6 -0.2664094 -0.66078766 0.3958085  0.53138856  0.1548074 -0.003601601
#>     Sparkly         Odd          $9        $14        $19        $24
#> 1 1.9313875 -0.79830174 -0.01871368  1.7422740 -1.1538048  0.4764291
#> 2 1.3424898 -1.37110790  0.39612222  1.4830933 -1.1347539 -0.1371388
#> 3 0.5353565 -0.49250051 -0.75974943  0.5904404  0.3789250 -0.9212234
#> 4 0.3079874 -0.25244245 -0.48370356  0.6257977 -0.5677943  0.6909679
#> 5 0.5088712 -0.09615084 -0.29313176 -0.7557123 -0.2117941  1.2288428
#> 6 0.1970143 -0.34822007 -0.40228457  0.7168540 -0.9230615  0.5014125
#>           $29       Blue       Red       64GB      256GB       512GB
#> 1 -1.04618465 -0.4316321 0.4316321  0.5124843 -0.3821001 -0.42334567
#> 2 -0.60732272 -0.9769510 0.9769510  0.9684142 -0.3010086 -0.85703427
#> 3  0.71160741 -0.6302464 0.6302464  0.6959751 -0.3620261 -0.04441025
#> 4 -0.26526781 -0.3799779 0.3799779  0.3444515 -0.6725858 -0.20206752
#> 5  0.03179532 -0.1376253 0.1376253 -0.2521733 -0.1732746 -0.04664018
#> 6  0.10707963 -0.6193282 0.6193282  0.6857107 -0.4190226 -0.94944218
#>          1TB ID
#> 1  0.2929615  1
#> 2  0.1896287  2
#> 3 -0.2895388  3
#> 4  0.5302019  4
#> 5  0.4720881  5
#> 6  0.6827541  6

Plot Individual HB Estimates

It is helpful to plot individuals' estimates, as this is informative about the variation in preferences for features. ggridges works well (next slide):

library(ggplot2)
library(reshape2)
cbc.m <- melt(cbc.est, id.vars = "ID")

library(ggridges)
ggplot(data=cbc.m, aes(x=value, y=variable, group=variable)) +
      geom_density_ridges(scale=0.9, alpha=0, jittered_points=TRUE,
                          rel_min_height=0.005,
                          position="points_sina",
                          point_color = "blue", point_alpha=1/sqrt(N),
                          point_size=2.5) +
        ylab("Attribute / Level") +
        xlab("Relative preference (blue circles=individuals)")

Because we are using simulated data, these will be Guassian. In real data, you might see other patterns such as bimodal distributions.

Individual estimates