Date(s) - 29/10/2021
10 h 00 - 11 h 00
CNAM, accès 31, salle 31.2.87
Speakers: Mouhamadou Lamine NDAO et Fadela SADOU ZOULEYA, stagiaires dans l’équipe MSDMA
Title: Clustering with missing values: how and why can we use multiple imputation?
Abstract: Multiple imputation (MI) is a popular method to deal with missing values. The methodology is well established for applying various analysis methods (like linear or logistic regression for instance) when data are incomplete. However, dealing with missing values in clustering by using MI remains a recent topic of research and is not yet well studied. The principle essentially consists in 1) imputing the incomplete data set M times, by using dedicated imputation model 2) applying a clustering method (e.g. k-means) on each imputed data set 3) aggregating the M partitions using consensus clustering (Basagna et al. 2013; Bruckers et al. 2017; Faucheux et al. 2021; Audigier and Niang, 2021)
Aggregation of partitions is generally performed considering all partitions have the same weight. This uniform weighting makes sense since the imputation step assumes that each imputed data set is imputed independently of the others (conditionally on observed values). In such a case, there is no reason to give a larger or smaller weight for certain partitions. However, independence is difficult to guarantee in practice. If the independence assumption is not respected, then the M partitions obtained from the M imputed data sets are not well representative of the set of partitions compatible with the observed values. To account for the under (or over)-representativeness, a non-uniform weighting for partitions can be considered. This corresponds to the use of weighted consensus techniques, like WNMF (Tao and Ding, 2008), instead of unweighted consensus methods. Thus, a first part of this talk will consist in investigating how weighted consensus clustering can be useful for aggregating partitions.
A second part of this talk will deal with the relevance of using MI to address missing values in clustering. Indeed, some clustering methods have already been extended to handle missing values. For instance, the k-pod algorithm (Chi et al., 2016) has been proposed to apply k-means clustering on incomplete data. Such techniques are named “direct methods” since they do not (explicitly) require an imputation step to perform clustering. Through a simulation study, clustering after MI is compared to clustering by direct methods for various clustering techniques (k-means, fuzzy c-means, clustering by gaussian mixture).
The conclusions of theses studies are quite unexpected. First, unlike regression models, applying clustering after MI is quite robust to the non-independence of imputed values between the M data sets. Second, while direct methods are known to provide similar inference to those obtained by MI, MI outperforms generally direct methods in clustering.