The Math: Segment Builder

Segment Builder groups survey respondents into distinct clusters based on patterns in their numeric responses. Rather than analysing the whole sample as one homogeneous group, segmentation surfaces subgroups with meaningfully different attitudes, behaviours, or needs.

Step 1 — Z-Score Standardisation

Before clustering, every selected variable is standardised to a common scale so that high-range variables (e.g., income in thousands) do not dominate low-range ones (e.g., satisfaction on a 1–10 scale).

For each variable j across all respondents:

z_ij = (x_ij − μ_j) / σ_j

where μ_j is the column mean and σ_j is the population standard deviation. The result has mean 0 and standard deviation 1. All distances in the clustering step are computed on these standardised values.

Step 2a — k-Means++ Clustering

k-means partitions n respondents into k clusters by minimising the Within-Cluster Sum of Squares (WCSS):

WCSS = Σ_c Σ_i∈c ‖x_i − μ_c‖²

where μ_c is the centroid of cluster c. The algorithm iterates two steps until convergence:

Assignment: assign each respondent to the nearest centroid (Euclidean distance).
Update: recalculate each centroid as the mean of its assigned members.

Krosstabs uses the k-means++ initialisation strategy rather than random starts. The first centroid is chosen uniformly at random; each subsequent centroid is sampled with probability proportional to its squared distance from the nearest existing centroid. This gives a provably better starting point and reduces the risk of poor local minima.

Step 2b — Hierarchical Clustering (Ward's Method)

The alternative algorithm builds segments bottom-up. Every respondent starts as their own cluster. At each step, the two clusters whose merger produces the smallest increase in WCSS are joined — this is Ward's linkage criterion:

ΔWard(i, j) = (n_i · n_j) / (n_i + n_j) · ‖μ_i − μ_j‖²

Merging continues until the desired number of clusters k is reached. Because Ward's method must compute an n × n distance matrix, it is limited to datasets of 5,000 rows or fewer. For larger files, k-means is faster and uses far less memory.

Step 3 — Choosing k: The Elbow Method

When Auto-suggest is enabled, the tool runs k-means for k = 2 through 8 and records WCSS at each value. Adding more clusters always reduces WCSS, but the marginal gain diminishes. The "elbow" is the point where the curve bends most sharply — the best trade-off between fit and parsimony.

Krosstabs detects this point using a simplified Kneedle algorithm:

Normalise the (k, WCSS) curve to the unit square.
Draw a straight line from the first point (k=2) to the last (k=8).
Select the k whose point is furthest from that line (maximum perpendicular distance).

The elbow plot is shown on the results screen so you can review the curve and override the suggestion with a fixed k if the context warrants it.

Step 4 — Silhouette Score

The silhouette score measures how well each respondent fits their assigned cluster versus the next-closest one. For respondent i:

s(i) = (b(i) − a(i)) / max(a(i), b(i))

a(i) — mean distance from i to all other members of its own cluster (cohesion)
b(i) — mean distance from i to all members of the nearest other cluster (separation)

s(i) ranges from −1 to +1. The mean silhouette across all respondents is reported with a qualitative rating:

≥ 0.70 — Strong: clusters are well-separated and internally compact.
0.50–0.69 — Good: reasonable structure; most respondents clearly belong.
0.25–0.49 — Fair: some overlap; segments may benefit from a different k.
< 0.25 — Poor: little evidence of real clustering structure in this data.

For large datasets (> 2,000 respondents), silhouette is computed on an evenly-spaced subsample for performance — the estimate is stable for surveys of typical sizes.

Reading the Segment Profile

The profile table shows the unstandardised mean of each variable within each segment, alongside the total sample mean. Values are coloured relative to the total — darker shading indicates a segment mean that deviates more from average, making it easy to spot the defining characteristics of each group.

Segment names can be edited inline on the results screen. Once you are satisfied with the solution, the Export CSV button downloads the original dataset with a _segment column appended. If the data was loaded from your dataset library, Save to Dataset writes _segment directly back to that dataset so it can be used as a banner variable in Banner Table.