What is Cluster Analysis: Put simply, cluster analysis is grouping or classifying observations in a way that groups are created based on similarities between the observations within the group. These groups are known as clusters. The objective is to maximize the distance between clusters but minimize the distance within clusters.
What is SPSS: A statistical package created by IBM, SPSS is used commonly by researchers to analyze survey data through statistical analysis, machine learning algorithms, text analysis, and more.
Cluster Analysis in SPSS: SPSS offers three methods for Cluster Analysis
- K-Means Cluster- This form of clustering is used for large data sets when the researcher has already defined the number of clusters.
- Hierarchical Cluster– Considered the most common approach, this model of clustering generates a series of solutions from 1 cluster where all observations are grouped together to n clusters where each observation is its own cluster. In hierarchical clustering, variables as well as observations or cases can be clustered. Finally, nominal, scale, and ordinal data can be used when creating clusters using the hierarchical method.
- Two-Step Cluster– A combination of the previous two approaches, two-step clustering gets its name from its approach of first running pre-clustering and then running hierarchical clustering. Similar to K-means, it can handle large sets of data that would take too long with the hierarchical method. A limitation is that two-step clustering can only handle scale and ordinal data. However, two-step clustering finds the optimal number of clusters for you.
Case Application
Using a dataset of over 300 survey responses collected by Chipotle within a local area with the intention of segmenting the market, we aim to solve a research question by applying cluster analysis through SPSS.
Research Question: What type of customers should Chipotle target when pursuing a customer segment expansion?
We broke our process down into steps:
Step 1: Analyze the Dataset
- Contains demographic, psychographic, and behavioral variables
- Demographic: age, income, ethnicity, etc
- Psychographic: attitude, beliefs, lifestyle
- Behavioral: purchase behavior, customer loyalty
- Cleaned for missing and incomplete data
- No information about how data was gathered
Step 2: Choose Cluster Analysis Method
Based on our initial research, we decided to proceed with two-step cluster analysis as it allows for ordinal data to be clustered and the optimal number of clusters are found by the algorithm.
Step 3: Choose Variables
After analyzing the data, we decided to only use psychographic variables also known as lifestyle variables. As the research question is focused on expanding Chipotle’s customer segment, it is important to know what potential customers find important when choosing a fast-food option.
However, there were quite a few psychographic variables; yet, we did not perform a t-test to find significance of variables. Instead, we performed a two-step cluster analysis with all the psychographic variables. From that output we got a breakdown of which predictors/ variables were the most important. Using that, we decided to run another two-step cluster with the 4 most important predictors.
After looking at the predictor importance, we found that ambiance, health, convenience, and variety were the most important in forming the clusters. Price was significantly less important and taste did not make the graph to begin with, and therefore had no bearing on the clusters.
Step 4: Output and Analysis
Two-Step Cluster with all Psychographic Variables*
- Three clusters formed with an uneven size distribution
- Cluster quality is fair
- Most important predictor is importance of convenience
- Evaluation fields show little variation between the demographics of each cluster
Two-Step Cluster with Select Variables**
- Two clusters formed with fairly even distribution of cases
- Cluster quality is fair to almost good
- Most important predictor is importance of healthy options
- Evaluation fields show little variation between the demographics of each cluster
Step 5: Analyze Outputs
Looking at the two outputs, it is clear that better clusters are formed when only using the select variables**. The cluster quality increases though not by much. However, the distribution of cases between clusters is more even. One clear distinction is that healthiness is more important to Cluster 2 in the second output. This is important considering Chipotle is known to have healthier fast-food options. Going forward, Chipotle would have greater success when target customers that align with Cluster 2.
Lessons Learned:
- The quality of the dataset plays a huge role in the creation of clusters. The responses given by those surveyed go a long way in forming clusters with a greater difference between each other.
- Picking variables that are significant is very important in improving the quality of the cluster output.
- Determining the type of cluster analysis performed is the first step towards answering any research question. That decision is heavily dependent on the type of data, amount of data, and goal of analysis.
- There are many options and tools within each method of clustering in SPSS. It can be overwhelming, but research into details of cluster analysis using SPSS helps in understanding what the options mean and how it adds value to the output.
*Psychographic Variables: Ambience, Health, Convenience, Variety, Price and Taste
**Select Variables: Ambiance, Health, Convenience, and Variety