Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Abstract The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated a...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Gregoire Preud’homme, Kevin Duarte, Kevin Dalleau, Claire Lacomblez, Emmanuel Bresso, Malika Smaïl-Tabbone, Miguel Couceiro, Marie-Dominique Devignes, Masatake Kobayashi, Olivier Huttin, João Pedro Ferreira, Faiez Zannad, Patrick Rossignol, Nicolas Girerd
Formato: article
Lenguaje:EN
Publicado: Nature Portfolio 2021
Materias:
R
Q
Acceso en línea:https://doaj.org/article/26646d58a34f4588aac72200ad772731
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:26646d58a34f4588aac72200ad772731
record_format dspace
spelling oai:doaj.org-article:26646d58a34f4588aac72200ad7727312021-12-02T12:11:45ZHead-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark10.1038/s41598-021-83340-82045-2322https://doaj.org/article/26646d58a34f4588aac72200ad7727312021-02-01T00:00:00Zhttps://doi.org/10.1038/s41598-021-83340-8https://doaj.org/toc/2045-2322Abstract The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.Gregoire Preud’hommeKevin DuarteKevin DalleauClaire LacomblezEmmanuel BressoMalika Smaïl-TabboneMiguel CouceiroMarie-Dominique DevignesMasatake KobayashiOlivier HuttinJoão Pedro FerreiraFaiez ZannadPatrick RossignolNicolas GirerdNature PortfolioarticleMedicineRScienceQENScientific Reports, Vol 11, Iss 1, Pp 1-14 (2021)
institution DOAJ
collection DOAJ
language EN
topic Medicine
R
Science
Q
spellingShingle Medicine
R
Science
Q
Gregoire Preud’homme
Kevin Duarte
Kevin Dalleau
Claire Lacomblez
Emmanuel Bresso
Malika Smaïl-Tabbone
Miguel Couceiro
Marie-Dominique Devignes
Masatake Kobayashi
Olivier Huttin
João Pedro Ferreira
Faiez Zannad
Patrick Rossignol
Nicolas Girerd
Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
description Abstract The choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.
format article
author Gregoire Preud’homme
Kevin Duarte
Kevin Dalleau
Claire Lacomblez
Emmanuel Bresso
Malika Smaïl-Tabbone
Miguel Couceiro
Marie-Dominique Devignes
Masatake Kobayashi
Olivier Huttin
João Pedro Ferreira
Faiez Zannad
Patrick Rossignol
Nicolas Girerd
author_facet Gregoire Preud’homme
Kevin Duarte
Kevin Dalleau
Claire Lacomblez
Emmanuel Bresso
Malika Smaïl-Tabbone
Miguel Couceiro
Marie-Dominique Devignes
Masatake Kobayashi
Olivier Huttin
João Pedro Ferreira
Faiez Zannad
Patrick Rossignol
Nicolas Girerd
author_sort Gregoire Preud’homme
title Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_short Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_full Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_fullStr Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_full_unstemmed Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
title_sort head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark
publisher Nature Portfolio
publishDate 2021
url https://doaj.org/article/26646d58a34f4588aac72200ad772731
work_keys_str_mv AT gregoirepreudhomme headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT kevinduarte headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT kevindalleau headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT clairelacomblez headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT emmanuelbresso headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT malikasmailtabbone headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT miguelcouceiro headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT mariedominiquedevignes headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT masatakekobayashi headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT olivierhuttin headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT joaopedroferreira headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT faiezzannad headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT patrickrossignol headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
AT nicolasgirerd headtoheadcomparisonofclusteringmethodsforheterogeneousdataasimulationdrivenbenchmark
_version_ 1718394561698463744