Clustering Mixed Data Types in R (2024)

[This article was first published on Wicked Good Data - r, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Clustering allows us to better understand how a sample might be comprised of distinct subgroups given a set of variables. While many introductions to cluster analysis typically review a simple application using continuous variables, clustering data of mixed types (e.g., continuous, ordinal, and nominal) is often of interest. The following is an overview of one approach to clustering data of mixed types using Gower distance, partitioning around medoids, and silhouette width.

In total, there are three related decisions that need to be taken for this approach:

  1. Calculating distance
  2. Choosing a clustering algorithm
  3. Selecting the number of clusters

For illustration, the publicly available “College” dataset found in the ISLR package will be used, which has various statistics of US Colleges from 1995 (N = 777). To highlight the challenge of handling mixed data types, variables that are both categorical and continuous will be used and are listed below:

  • Continuous
    • Acceptance rate
    • Out of school tuition
    • Number of new students enrolled
  • Categorical
    • Whether a college is public/private
    • Whether a college is elite, defined as having more than 50% of new students who graduated in the top 10% of their high school class

The code was run using R version 3.2.2 with the following packages:

set.seed(1680) # for reproducibilitylibrary(dplyr) # for data cleaninglibrary(ISLR) # for college datasetlibrary(cluster) # for gower similarity and pamlibrary(Rtsne) # for t-SNE plotlibrary(ggplot2) # for visualization

Before clustering can begin, some data cleaning must be done:

  • Acceptance rate is created by diving the number of acceptances by the number of applications
  • isElite is created by labeling colleges with more than 50% of their new students who were in the top 10% of their high school class as elite
college_clean <- College %>% mutate(name = row.names(.), accept_rate = Accept/Apps, isElite = cut(Top10perc, breaks = c(0, 50, 100), labels = c("Not Elite", "Elite"), include.lowest = TRUE)) %>% mutate(isElite = factor(isElite)) %>% select(name, accept_rate, Outstate, Enroll, Grad.Rate, Private, isElite)glimpse(college_clean)
## Observations: 777## Variables: 7## $ name (chr) "Abilene Christian University", "Ad...## $ accept_rate (dbl) 0.7421687, 0.8801464, 0.7682073, 0....## $ Outstate (dbl) 7440, 12280, 11250, 12960, 7560, 13...## $ Enroll (dbl) 721, 512, 336, 137, 55, 158, 103, 4...## $ Grad.Rate (dbl) 60, 56, 54, 59, 15, 55, 63, 73, 80,...## $ Private (fctr) Yes, Yes, Yes, Yes, Yes, Yes, Yes,...## $ isElite (fctr) Not Elite, Not Elite, Not Elite, E...

Calculating Distance

In order for a yet-to-be-chosen algorithm to group observations together, we first need to define some notion of (dis)similarity between observations. A popular choice for clustering is Euclidean distance. However, Euclidean distance is only valid for continuous variables, and thus is not applicable here. In order for a clustering algorithm to yield sensible results, we have to use a distance metric that can handle mixed data types. In this case, we will use something called Gower distance.

Gower distance

The concept of Gower distance is actually quite simple. For each variable type, a particular distance metric that works well for that type is used and scaled to fall between 0 and 1. Then, a linear combination using user-specified weights (most simply an average) is calculated to create the final distance matrix. The metrics used for each data type are described below:

  • quantitative (interval): range-normalized Manhattan distance
  • ordinal: variable is first ranked, then Manhattan distance is used with a special adjustment for ties
  • nominal: variables of k categories are first converted into k binary columns and then the Dice coefficient is used

    • pros: Intuitive to understand and straightforward to calculate
    • cons: Sensitive to non-normality and outliers present in continuous variables, so transformations as a pre-processing step might be necessary. Also requires an NxN distance matrix to be calculated, which is computationally intensive to keep in-memory for large samples

Below, we see that Gower distance can be calculated in one line using the daisy function. Note that due to positive skew in the Enroll variable, a log transformation is conducted internally via the type argument. Instructions to perform additional transformations, like for factors that could be considered as asymmetric binary (such as rare events), can be seen in ?daisy.

# Remove college name before clusteringgower_dist <- daisy(college_clean[, -1], metric = "gower", type = list(logratio = 3))# Check attributes to ensure the correct methods are being used# (I = interval, N = nominal)# Note that despite logratio being called, # the type remains coded as "I"summary(gower_dist)
## 301476 dissimilarities, summarized :## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0018601 0.1034400 0.2358700 0.2314500 0.3271400 0.7773500 ## Metric : mixed ; Types = I, I, I, I, N, N ## Number of objects : 777

As a sanity check, we can print out the most similar and dissimilar pair in the data to see if it makes sense. In this case, University of St. Thomas and John Carroll University are rated to be the most similar given the seven features used in the distance calculation, while University of Science and Arts of Oklahoma and Harvard are rated to be the most dissimilar.

gower_mat <- as.matrix(gower_dist)# Output most similar paircollege_clean[ which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), arr.ind = TRUE)[1, ], ]
## name accept_rate Outstate Enroll## 682 University of St. Thomas MN 0.8784638 11712 828## 284 John Carroll University 0.8711276 11700 820## Grad.Rate Private isElite## 682 89 Yes Not Elite## 284 89 Yes Not Elite
# Output most dissimilar paircollege_clean[ which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]), arr.ind = TRUE)[1, ], ]
## name accept_rate## 673 University of Sci. and Arts of Oklahoma 0.9824561## 251 Harvard University 0.1561486## Outstate Enroll Grad.Rate Private isElite## 673 3687 208 43 No Not Elite## 251 18485 1606 100 Yes Elite

Choosing a clustering algorithm

Now that the distance matrix has been calculated, it is time to select an algorithm for clustering. While many algorithms that can handle a custom distance matrix exist, partitioning around medoids (PAM) will be used here.

Partitioning around medoids is an iterative clustering procedure with the following steps:

  1. Choose k random entities to become the medoids
  2. Assign every entity to its closest medoid (using our custom distance matrix in this case)
  3. For each cluster, identify the observation that would yield the lowest average distance if it were to be re-assigned as the medoid. If so, make this observation the new medoid.
  4. If at least one medoid has changed, return to step 2. Otherwise, end the algorithm.

If you know the k-means algorithm, this might look very familiar. In fact, both approaches are identical, except k-means has cluster centers defined by Euclidean distance (i.e., centroids), while cluster centers for PAM are restricted to be the observations themselves (i.e., medoids).

  • pros: Easy to understand, more robust to noise and outliers when compared to k-means, and has the added benefit of having an observation serve as the exemplar for each cluster
  • cons: Both run time and memory are quadratic (i.e., $O(n^2)$)

Selecting the number of clusters

A variety of metrics exist to help choose the number of clusters to be extracted in a cluster analysis. We will use silhouette width, an internal validation metric which is an aggregated measure of how similar an observation is to its own cluster compared its closest neighboring cluster. The metric can range from -1 to 1, where higher values are better. After calculating silhouette width for clusters ranging from 2 to 10 for the PAM algorithm, we see that 3 clusters yields the highest value.

# Calculate silhouette width for many k using PAMsil_width <- c(NA)for(i in 2:10){ pam_fit <- pam(gower_dist, diss = TRUE, k = i) sil_width[i] <- pam_fit$silinfo$avg.width }# Plot sihouette width (higher is better)plot(1:10, sil_width, xlab = "Number of clusters", ylab = "Silhouette Width")lines(1:10, sil_width)

Clustering Mixed Data Types in R (1)

Cluster Interpretation

Via Descriptive Statistics

After running the algorithm and selecting three clusters, we can interpret the clusters by running summary on each cluster. Based on these results, it seems as though Cluster 1 is mainly Private/Not Elite with medium levels of out of state tuition and smaller levels of enrollment. Cluster 2, on the other hand, is mainly Private/Elite with lower levels of acceptance rates, high levels of out of state tuition, and high graduation rates. Finally, cluster 3 is mainly Public/Not Elite with the lowest levels of tuition, largest levels of enrollment, and lowest graduation rate.

pam_fit <- pam(gower_dist, diss = TRUE, k = 3)pam_results <- college_clean %>% dplyr::select(-name) %>% mutate(cluster = pam_fit$clustering) %>% group_by(cluster) %>% do(the_summary = summary(.))pam_results$the_summary
## [[1]]## accept_rate Outstate Enroll ## Min. :0.3283 Min. : 2340 Min. : 35.0 ## 1st Qu.:0.7225 1st Qu.: 8842 1st Qu.: 194.8 ## Median :0.8004 Median :10905 Median : 308.0 ## Mean :0.7820 Mean :11200 Mean : 418.6 ## 3rd Qu.:0.8581 3rd Qu.:13240 3rd Qu.: 484.8 ## Max. :1.0000 Max. :21700 Max. :4615.0 ## Grad.Rate Private isElite cluster ## Min. : 15.00 No : 0 Not Elite:500 Min. :1 ## 1st Qu.: 56.00 Yes:500 Elite : 0 1st Qu.:1 ## Median : 67.50 Median :1 ## Mean : 66.97 Mean :1 ## 3rd Qu.: 78.25 3rd Qu.:1 ## Max. :118.00 Max. :1 ## ## [[2]]## accept_rate Outstate Enroll ## Min. :0.1545 Min. : 5224 Min. : 137.0 ## 1st Qu.:0.4135 1st Qu.:13850 1st Qu.: 391.0 ## Median :0.5329 Median :17238 Median : 601.0 ## Mean :0.5392 Mean :16225 Mean : 882.5 ## 3rd Qu.:0.6988 3rd Qu.:18590 3rd Qu.:1191.0 ## Max. :0.9605 Max. :20100 Max. :4893.0 ## Grad.Rate Private isElite cluster ## Min. : 54.00 No : 4 Not Elite: 0 Min. :2 ## 1st Qu.: 77.00 Yes:65 Elite :69 1st Qu.:2 ## Median : 89.00 Median :2 ## Mean : 84.78 Mean :2 ## 3rd Qu.: 94.00 3rd Qu.:2 ## Max. :100.00 Max. :2 ## ## [[3]]## accept_rate Outstate Enroll ## Min. :0.3746 Min. : 2580 Min. : 153 ## 1st Qu.:0.6423 1st Qu.: 5295 1st Qu.: 694 ## Median :0.7458 Median : 6598 Median :1302 ## Mean :0.7315 Mean : 6698 Mean :1615 ## 3rd Qu.:0.8368 3rd Qu.: 7748 3rd Qu.:2184 ## Max. :1.0000 Max. :15516 Max. :6392 ## Grad.Rate Private isElite cluster ## Min. : 10.00 No :208 Not Elite:199 Min. :3 ## 1st Qu.: 46.00 Yes: 0 Elite : 9 1st Qu.:3 ## Median : 54.50 Median :3 ## Mean : 55.42 Mean :3 ## 3rd Qu.: 65.00 3rd Qu.:3 ## Max. :100.00 Max. :3

Another benefit of the PAM algorithm with respect to interpretation is that the medoids serve as exemplars of each cluster. From this, we see that Saint Francis University is the medoid of the Private/Not Elite cluster, Barnard College is the medoid for the Private/Elite cluster, and Grand Valley State University is the medoid for the Public/Not Elite cluster.

college_clean[pam_fit$medoids, ]
## name accept_rate Outstate## 492 Saint Francis College 0.7877629 10880## 38 Barnard College 0.5616987 17926## 234 Grand Valley State University 0.7525653 6108## Enroll Grad.Rate Private isElite## 492 284 69 Yes Not Elite## 38 531 91 Yes Elite## 234 1561 57 No Not Elite

Via Visualization

One way to visualize many variables in a lower dimensional space is with t-distributed stochastic neighborhood embedding, or t-SNE. This method is a dimension reduction technique that tries to preserve local structure so as to make clusters visible in a 2D or 3D visualization. While it typically utilizes Euclidean distance, it has the ability to handle a custom distance metric like the one we created above. In this case, the plot shows the three well-separated clusters that PAM was able to detect. One curious thing to note is that there is a small group that is split between the Private/Elite cluster and the Public/Not Elite cluster.

tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)tsne_data <- tsne_obj$Y %>% data.frame() %>% setNames(c("X", "Y")) %>% mutate(cluster = factor(pam_fit$clustering), name = college_clean$name)ggplot(aes(x = X, y = Y), data = tsne_data) + geom_point(aes(color = cluster))

Clustering Mixed Data Types in R (2)

By investigating further, it looks like this group is made up of the larger, more competitive public schools, like the University of Virginia or the University of California at Berkeley. While not large enough to warrant an additional cluster according to silhouette width, these 13 schools certainly have characteristics distinct from the other three clusters.

tsne_data %>% filter(X > 15 & X < 25, Y > -15 & Y < -10) %>% left_join(college_clean, by = "name") %>% collect %>% .[["name"]]
## [1] "College of William and Mary" ## [2] "Georgia Institute of Technology" ## [3] "SUNY at Binghamton" ## [4] "SUNY College at Geneseo" ## [5] "Trenton State College" ## [6] "University of California at Berkeley" ## [7] "University of California at Irvine" ## [8] "University of Florida" ## [9] "University of Illinois - Urbana" ## [10] "University of Michigan at Ann Arbor" ## [11] "University of Minnesota at Morris" ## [12] "University of North Carolina at Chapel Hill"## [13] "University of Virginia"

A Final Note: Dealing with Larger Samples and One-Hot Encoding

Because using a custom distance metric requires keeping an NxN matrix in memory, it starts to become noticeable for larger sample sizes (> 10,000 or so on my machine). For clustering larger samples, I have found two options:

  1. Two-step clustering in SPSS: This model-based clustering approach can handle categorical and continuous variables and utilizes silhouette width (using rule-of-thumb cutoffs) to find the optimal number of clusters.
  2. Using Euclidean distance on data that has been one-hot encoded: While much quicker computationally, note that this is not optimal as you run into the curse of dimensionality fairly fast since all categoricals are recoded to become sparse matrices. Note that this approach is actually fairly similar to the dice coefficient found in calculation of Gower distance, except it incorrectly labels 0-0 as a match. More on this can be seen in this discussion.

Related

To leave a comment for the author, please follow the link and comment on their blog: Wicked Good Data - r.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Clustering Mixed Data Types in R (2024)

References

Top Articles
Latest Posts
Article information

Author: Jamar Nader

Last Updated:

Views: 5803

Rating: 4.4 / 5 (55 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Jamar Nader

Birthday: 1995-02-28

Address: Apt. 536 6162 Reichel Greens, Port Zackaryside, CT 22682-9804

Phone: +9958384818317

Job: IT Representative

Hobby: Scrapbooking, Hiking, Hunting, Kite flying, Blacksmithing, Video gaming, Foraging

Introduction: My name is Jamar Nader, I am a fine, shiny, colorful, bright, nice, perfect, curious person who loves writing and wants to share my knowledge and understanding with you.