Explaining Cluster Analysis

A deeper review of how cluster segmentation analysis works


Segmentation – is a process of dividing population of objects/consumers into relatively `homogeneous groups that differ from each other. This term is most closely associated with consumers or markets though its application is much broader.

Market or Consumer Segmentation – is a process of splitting consumers into clearly defined groups, that differ from each other based on own needs, behavior, socio-demographic characteristics. These groups also vary on types of goods and services they prefer. This analysis helps companies define target markets/audience, which are the most attractive for business and develop proposition of goods/services, which will resonate with the consumer needs.

From technical point of view, a segmentation task is solved with the help of family of statistical tools called Cluster analysis. Any method within cluster analysis is based on a set of individual characteristics for each object/consumer, which will be allocated into groups. Depending on the purpose of the analysis, different parameters of cluster solution are taken into account. Therefore, there is no single standard approach for cluster analysis; and it does not have strict mathematical criteria of solution quality. The decision on optimal number of groups is made by researcher and is based on logic of results interpretation and their further applicability.

In cluster analysis, there is no dependent variable. It is created in the process of analysis and contains information on allocation of each case/sample to certain group. This is the key output produced by cluster analysis.

In order to perform segmentation, we need to conduct measurements for the sample of objects/consumers. Some of the measurements can be straightforward (e.g. width or height of the object, or gender for consumers). Consumer segmentations usually require surveying a large number of people. And consumer groups are defined based on similarities in answers to the research questions.

Real World Examples:

Imagine we are a manufacturer of shoes for women and want to enter the segment of shoes with high heels. We need to find a target audience and develop a communication message for them.

  1. As a first step we need to define the core users of shoes with high heels. For this purpose we can conduct segmentation on the basis of socio-demographic characteristics. We believe that middle-aged women living in the cities are wearing the shoes more often than the other. So, we ask a sample of women aged 15-65 years old living in both urban and rural areas the question if they wear shoes with high heels at least once in 3 months (‘yes/no’ alternatives). Based on our hypothesis we segment a sample into the following age groups and have a look at share of those wearing shoes with high heels:
 Female living in rural area15-25 y.o. female living in urban areas26-45 y.o. female living in urban areas46-65 y.o. female living in urban areas
Group size25%10%35%30%
% of female audience wearing shoes with high heels20%30%70%45%

The data in the table above reconfirm the hypothesis: our core target group is formed by female aged 26-45 y.o. (both from point of view of group size and penetration of shoes with high heels in the group).

  1. As a next step, we add some behavioral patterns and ask 2 questions to the selected group: 1) frequency of purchasing shoes with high heels (1pt – very rarely (rarer than once per 2 years); 9pt – very often (once in 2-3 months and more often); 2) cost of the shoes purchased (1pt – very cheap; 9pt – very expensive). Below is the extract of the dataset (which represents the whole set of data), where we have 3 equal groups with clearly defined patterns:
respondent #Frequency of purchase of shoes with high heelsCost of the shoesGroup

Group 1: women who purchase expensive shoes with high heels very rarely

Group 2: women who purchase cheap shoes with high heels quite often

Group 3: women who frequently purchase expensive shoes with high heels

Apparently, middle-aged women are more inclined to purchasing expensive shoes.

  1. As a 3rd step, let’s see how such grouping is done with the help of cluster analysis.

Imagine we have a sample of 33 women, who often purchase expensive shoes with high heels. We have collected their needs with regard to investigated category. For this purpose, we’ve asked them to rate on 9pt scale how important are the following characteristics of shoes with high heels:

V1: Comfort of the shoes

V2: Importance of high status

“1” stands for low importance, “9” goes for high importance.

Below is an example of a spreadsheet file containing data related to this case: columns should store variables; the first row contains variable names.

Once we’ve run the cluster analysis, there are 2 key outputs to analyze and interpret: summary of cluster solution and a plot with visual representation of the results. Let’s have a look:

  1. Summary of cluster solution:

K-means clustering with 4 clusters of sizes 11, 2, 10, 10

Cluster means:

        V1       V2

1 5.818182 5.818182

2 1.000000 6.000000

3 7.500000 8.200000

4 5.000000 2.800000

Clustering vector:

[1] 4 4 4 1 4 4 4 2 4 2 1 4 4 1 1 1 1 1 1 1 1 4 3 3 3 3 1 3 3 3 3 3 3

Within cluster sum of squares by cluster:

[1] 17.27273  2.00000 24.10000 27.60000

(between_SS / total_SS =  76.3 %)

Let’s have a look at key parameters of cluster solution:

  1. Cluster sizes. As a rule, we strive for reasonable cluster sizes as a good solution. It means that having a lot of very small groups and one large group might not properly reveal the cluster patterns and we need to group some clusters together or break the large group into 2 or 3 smaller clusters. However, there might be cases when some groups stand out due to consumers/objects quantity. In order to understand if we should keep the cluster separate or group it with another one, we look into cluster profiles.
  1. Cluster means. This is the basic description of clusters based on classifiers used in the procedure. The more the groups differ from each other – the better. In our case:
  1. Cluster #1 stands for average importance of both status and comfort in shoes with high heels.
  2. Cluster #2 is a niche, it contains 2 women that don’t care about shoes comfort; their need in high status is average. Their behavior is non-typical, therefore we keep them as a separate group.
  3. Cluster #3 stand for high scores for both status and comfort.
  4. Cluster #4 is characterized by average importance of shoes comfort and low need in high status.
  1. Within cluster sum of squares by cluster: it shows how well the analysis grouped the observations. The higher is the value – the better is the solution differentiating the observations into groups. Increase in number of clusters causes growth in this statistics. If number of clusters is equal to number of cases/observations – its value is equal to 1.
  1. Plot for cluster solution:

This plot allows us to visually interpret the results of the analysis and see if data structure is logical. In our case, as there are only 2 features, we’ve put them on X and Y axes. Selection of 4 groups demonstrates good grouping results.

Based on the parameters of the shoes produced and group size (we strive to get the larger size of the pie), we can pick the appropriate target audience for our new proposition. In case we as a company have an appropriate offer combining both comfort and high status of the shoes produced, we are to go with the cluster #3 (which contains 30% of the audience).