Uniﬁed Embedding and Clustering

—In this paper, we introduce a novel algorithm that uniﬁes manifold embedding and clustering (UEC) which efﬁciently predicts clustering assignments of the high di- mensional data points in a new embedding space. The algorithm is based on a bi-objective optimisation problem combining embedding and clustering loss functions. Such original formulation will allow to simultaneously preserve the original structure of the data in the embedding space and produce better clustering assignments. The experimental results using a number of real-world datasets show that UEC is competitive with the state-of-art clustering methods.


I. Introduction
C LUSTERING is a fundamental pillar of unsupervised machine learning and it has received increasing popularity among various research communities. There are two ways for applying clustering algorithms, either by operating on data in the original space or projecting data into a new space. Such a projection which can be either linear or non-linear transformation, is often done through dimensionality reduction techniques such as Isomap [1], local linear embedding (LLE) [2], and Laplacian eigen-maps [3]. These techniques embed data in a new space based on the manifold local geometric structure. Once embedding is done, clustering can be applied in a straightforward way on the transformed data [4], [5], [6].
With the advent of deep learning, work on clustering has seen a new direction which is based on deep representation learning leading to a number of algorithms such as Deep Embedding Clustering (DEC) [7], Joint Unsupervised LEarning (JULE) [8], DEeP Embedded Regu-larIzed ClusTering (DEPICT) [9], Deep Adaptive Image Clustering (DAC) [10], Information Maximising Self-Augmented Training (IMSAT) [11], Spectral-Net [12], and Deep Clustering by Gaussian Mixture Variational Autoencoders with Graph Embedding (DGG) [13]. Based on nonlinear embedding, these algorithms showed excellent clustering results. Despite the ability of these techniques to learn the intrinsic structure of data, they suffer the problem of inherent defects that exist in the embedding space. In addition, they do not focus on learning from the similarities between the data points, therefore, the global and local structure of the data set are not preserved in the low dimensional spaces of these techniques. In contrast, the manifold learning techniques, including algorithms such as Isomap [1], t-SNE [14] and UMAP [15], tend to preserve similarities and project similar data points to a nearby representation, while dissimilar ones are projected far apart preserving the local and global structure of data.
In this paper, we propose an algorithm that can learn to simultaneously produce a low-dimensional embedding space and assign data to clusters. It is motivated by the fact that manifold learning-based techniques [16], [17] improve the quality of clustering when jointly learned [18]. The proposed algorithm, called Unified Embedding and Clustering (UEC). It seeks to learn the embedding space in a way similar to UMAP [15]. Firstly, UEC takes the extracted features using deep autoencoder [19], [20]. Then it initialises both the low dimensional space and the cluster assignments. In order to improve the embedding and the quality of the assignment of the data, UEC optimises a two-term objective function that controls both embedding and clustering.
Our contributions in this paper are: • The proposal of a novel manifold-based learning algorithm that simultaneously learns the embedding space of the data and the clustering assignments. The joint optimisation of both the embedding and clustering objective functions leads to better preservation of the similarities in the original space and clusterability of the embedding space. • The theoretical derivations of the optimisation process associated with the proposed UEC. • Extensive evaluation against state-of-the-art clustering methods to show the superiority of UEC. The rest of this paper is organised as follows: Section II reviews related work. Section III describes the proposed algorithm. Section VI discusses the experimental results before concluding the paper in Section Sec VII.

II. Related work
Clustering has been extensively studied in the literature resulting in a plethora of algorithms such as the well-known algorithms K-means [21], [22], GMM [23], the agglomerative clustering [24], DBSCAN [25], [26], and Spectral Clustering [27], [28]. They can be categorised into distance-based, density-based or connectivity-based algorithms, depending on their conceptual idea [29]. Despite the popularity of these techniques, they are not efficient in all cases and suffer from high computational complexity when dealing with largescale data sets.
To solve the aforementioned issues, dimensionality reduction techniques have been used to map data in low dimensional space. The dimensionality reduction methods can be categorised to deep or manifold learning methods. Manifold embedding methods can focus on local or global structure. The well-known algorithm Principal Component Analysis (PCA) [30] seeks to produce a linear transformation of data into a new feature space. However, due to its linearity, PCA does not perform well in cases where relationships are non-linear. Thankfully, alternative manifold learning methods exist to overcome the shortcoming of linearity. The most popular ones are Isomap [1] and its precursor Multidimensional Scaling (MDS), Locally Linear Embedding (LLE) [2]. t-distributed Stochastic Neighbour Embedding (t-SNE) [14] and the most recent algorithm UMAP [15]. These methods seek to utilise the distances between the original data points to learn the underlying structure better and preserve the similarity among data points in the low dimensional space. A number of techniques [17], [18], [16], proved powerful to support clustering.
In dimensionality reduction methods, various autoencoders [31], [20] and CNNs [32], [33] have been proposed showing significant improvements on many computer vision tasks [34], [35], [36], [37], [38]. Recent research on clustering has explored the application of deep learning for both dimensionality reduction and clustering, named deep clustering [39], [40]. We divide those techniques into two classes according to how they do their optimisation: The first class includes the techniques, that use one loss function. The second class contains the techniques that use two loss functions (embedding and clustering loss functions). We find in the sequential techniques DEC [7], N2D [17], JULE [8], IMSAT [11], AND DAC [10]. DEC trains an autoencoder to learn features and imposes a soft assignment constraint on them. However, how to effectively pre-train deep networks is an open problem. N2D [17] passes the extracted features using deep autoencoder to UMAP followed by GMM [23] applied on the embedding space to cluster the data. However, its simplicity makes it a trivial method and not suitable for complected data sets.
JULE [8] uses a convolutional neural network with agglomerative clustering loss without the need for any reconstruction loss. In every iteration, hierarchical clustering is performed on the forward pass using an affinity measure and representations are optimised on the backward pass. However, the limitation of JULE is the computational and memory complexity issues, due to the undirected affinity matrix constructed by the agglomerative clustering loss function. IMSAT [11] is based on data augmentation, where the net is trained to maximise the mutual information between data and predicted clusters, while regularising the net so that the cluster assignment of the original data will be consistent with the assignment of augmented data. DAC [10] uses a convolutional neural network with a binary pairwise classification as clustering loss. It is based on the assumption that the relationship between pairwise images is binary. It also adds a regularisation constraint that helps learn label features as one hot encoded features, and the similarity is computed as the dot product of these label features. Spectral-Net [12] is a method based on spectral clustering. It attempts to learn a network that maps the training data into the eigen-space of the graph Laplacian matrix. Then a Siamese network is applied to learn the weights of the connections between the graph's nodes before k-means is used to perform the final clustering.
On the other hand, we find a set of techniques belonging to the interlaced family such as DEPICT [9] and DERC [18]. DEPICT [9] consists of a convolutional autoencoder and a single layer classifier, which learns the latent features and the distribution of the cluster assignments. DEPICT is optimised by minimising the reconstruction error and the relative entropy between the distributions of the cluster assignments and their prior. Deep Embedded Dimensionality Reduction Clustering (DERC) [18] is a combination of deep autoencoder and t-SNE to represent the data. Then, centres of the clusters are initialised using GMM and incorporate a probabilitybased triplet loss measure to retrain their model and improve the clustering performance of the model.
All the models mentioned above are based on deep learning, which requires tuning of many hyperparameters. In contrast, our model does not need any supervisory signals for hyper-parameter tuning. The models which use autoencoder to produce the lowdimensional space have the issue of the inherent defects that exist in the embedding space. In addition, their way of optimising the embedding space does not focus on preserving the data's global and local structure. In contrast, our model can maintain the data's global and local structure by learning from the similarities, which leads to a better clustering assignment.

III. Unification of Manifold Embedding and Clustering
Before discussing the details of the proposed method, the list of symbols to be used in the rest of this paper is introduced as follows (Tab I): The representation of the data in the embedding space: the set of cluster centres: M = {µ} C k=1 , where C is the number of clusters p i|j Probability distribution between the input data points y i and their jth nearest neighbour. These form the matrix: P = {p ij } to be defined later q ij the probability distribution between the embedded data point z i and its jth nearest neighbour: the Soft assignment distribution that indicates the probability between the points z i and the cluster centre Binary cross entropy representing the adopted embedding loss function KL Kullback-Leibler divergence representing the adopted clustering objective function F total loss as a Weighted sum function of both CE and KL The proposed algorithm, UEC, aims to preserve the closeness in the input space when mapping the data into the output space. Thus, data points with similar characteristics are projected nearby, and dissimilar ones are mapped apart. Like with general manifold embedding, UEC considers the original data as a high-dimensional graph to be transformed into a lower-dimensional one.
UEC is about mapping high dimensional input data into a lower-dimensional embedded space while considering clustering constraints. To achieve that, UEC tries to optimise the following objective function: The first term (CE) is the cross-entropy that assesses the quality of the embedding in the low-dimensional space. It is referred to as the embedding loss function. The second term (KL) is the Kullback-Leibler divergence and is used to assess the clustering quality. It is referred to as the clustering loss function. The quantities α and β define the relative weights of the two-loss component of the overall function and serve to control the trade-off between embedding and clustering. We can optimise this function in two different ways. Each of them follows an interpretation of the ⊕: 1) ⊕ = ',' (comma) that indicates sequential operations: evaluate the embedding using CE, then assess the clustering using KL to update first the embedded data points and then the cluster centres. This scenario is referred to as sequential update. 2) ⊕ = '+' (plus) that indicates that the evaluation of Eq. 1 is done in one combined step (multi-objective function) to update the sought quantities simultaneously. This scenario is referred to as joint update. Equation 1 actually depends on a number of quantities (matrices), namely P, Q, T and S described in Table I. The matrix P forms the affinity scores between the individual data points and their nearest neighbours in the input space. The matrix, Q, represents the similarities between the data points in the embedding space. The matrix S is a soft assignment of embedded data to clusters, while T represents the probabilistic point-to-cluster assignments using the obtained soft assignment S (The formal definition of these matrices will follow below). Equation 1 can be rewritten as follows: Clearly, CE computes the total entropy between P and Q, while KL measures the relative entropy (divergence) between S and T. This objective function will be used to analytically compute the coordinates of the set Z and the centres M. Before delving more into the details, it is worthwhile to portray the structure of the UEC algorithm.
1: Compute the affinity matrix for the input data Y Eq. 3 and 4. 2: Initialise the embedding space of dimension d. 3: Initialise the cluster centres. 4: while The convergence criterion is not met do 5: Compute the affinity matrix for the embedded data Z, Eq. 5.

6:
Compute the soft assignment of the embedded data to the clusters S(Z, M), Eq. 6.

7:
Compute the probabilistic point-to-cluster assignments using the obtained soft assignment T (Z, M), Eq. 7.

8:
Update the coordinates of the embedded data Z, as in Eq. 13.

IV. Formulation of the algorithm ingredients
A. Computation of the affinity matrix P The affinity matrix P represents the similarity scores between pairs of data points using the exponential probability: where d(y i , y j ) is the distance between the i th data point and its j th nearest neighbour, ρ i is the distance between i th data point and its first nearest neighbour. The quantities ρ i and σ i ensure the local connectivity of the manifold. We use a symmetrization of the highdimensional probability, since the weight of the edge between the i th and j th nodes is not equal to the weight between j th and i th nodes. The final formulation of P is given as follows:

B. Initialisation
To initialise the embedding space and the centres of the clusters, we use respectively spectral embedding [41] and a centroid-based algorithm (e.g., k-means, GMM, etc.).

C. Computation of Q
The affinity matrix Q represents the similarity scores between each embedded data point and its neighbours. It is computed using a smooth approximation of the membership strength: The parameters a and b are defined using the piece-wise non-linear least-square fitting method.

D. Computation of the soft assignments S
The matrix S is computed using a smooth approximation of the membership strength between the embedded points z i and the cluster centres µ k as follows: We could also consider the following form like in DEC [7] (which is inspired from t-SNE [14]): 2 ) −1 However, it has been shown in UMAP [15] that the normalisation increases processing time without bringing any improvements in the accuracy. It can be experimentally shown that avoiding normalisation does not affect the quality of clustering.

E. Computation of the auxiliary target distribution T
The matrix T should satisfy three constraints: (1) improving the cluster purity; (2) ensuring high-confidence assignments get more emphasis; and (3) preventing large clusters from distorting the embedding space by normalising the loss contribution of each centre. A formulation that addresses these constraints is given as follows: F. Formulation of the optimisation problem The coordinates of the data points and the centres of the clusters are updated in light of the minimisation of two objective functions: the embedding and the clustering loss functions. These functions are coupled in Eq. 2. Before discussing the two optimisation options presented earlier, we first formulate the first term, which is the embedding objective loss represented as binary cross-entropy (CE). CE is given as follows: (8) where P and Q are the probabilistic similarity scores of the input data Y and that of the embedded data Z respectively. The second term is the clustering objective function which is defined as the Kullback-Leibler (KL) divergence function between the soft assignments S and the auxiliary distribution T. KL divergence function refines the clusters by learning from the high-confidence assignments with the help of auxiliary target distribution: Now that the objective function, its individual terms and all quantities used have been defined, we discuss the optimisation problem to update the coordinates of Z and the clusters' centres.

V. Optimisation of the objective function
The coordinates of the data point z i and cluster centres µ j are updated at each time step t until the criterion convergence parameter is met. We use Stochastic Gradient Descent (SGD) and negative sampling algorithms as optimisation algorithms due to their rapid convergence and their low memory consumption. The optimisation steps, as illustrated in Alg. 1, are repeated until the change in the cluster assignment of the points is less than a threshold 1e − 5.
As indicated earlier, the update of z i and µ j can take place according to two variants.

A. Sequential (Comma) variant
Here the terms of the objective function are sequentially evaluated.
1) Update of the embedding: The coordinates of the data in the embedded space will be updated twice in a sequential manner. The first update stems from the embedding loss is given as follows: where η is a learning rate. The quantity δCE δz i is expressed as follows: For more details, please see Appendix A. While the second update of the coordinates of z i comes from the clustering loss.
The partial derivative of the clustering loss function (KL) given by Eq. 9 w.r.t. z i reads as follows: 2) Update of the cluster centres: The centres µ k do not depend on the embedding loss function. Hence, the centres of the clusters are updated using the derivative of the clustering loss function, the KL-divergence (Eq. 9), as follows: where η is a learning rate, and δKL δµ k is as follow:

B. Combined (plus) variant
The main difference between the combined and the sequential variants is that in the former, the clustering influences both the computation of z i and µ j .
1) Update of the embedding: According to this second variant, the coordinates of the data points z i are updated using the embedding and the clustering loss functions simultaneously.
From Eq. 1, the update is executed as follows: Where δCE δz i is given by Eq. 10 and δKL δz i is given as follows: The derivative of KL is the sum of 2 terms since the derivation of T ik with respect to z i is computed according to two cases: 1). when i = i and 2). when i = i (please see Appendix B for more details).
The final formulation of the update is obtained by combining Eq. 10 and Eq. 15: 2) Update of the cluster centres: Since CE doesn't depend on µ k , only the clustering loss function, KL is relevant: The derivative of the KL divergence is the sum of two parts requiring the consideration of two cases: when k = k and when k = k. For more detail, please, see Appendix B.

C. Light combined variant
In the combined variant, we consider that the update of z i and µ k depends on the auxiliary target distribution, T. While this variant improves the performance of the proposed UEC significantly, it is not highly efficient in terms of computational time due to the heavy computation involved in the gradient descent update, hence this light version of the combined variant. The underlying assumption of this third variant is to consider T ik not dependent on z i and µ k .
1) Update of the embedding: KL divergence is derived with respect to z i : By substituting the derivative of CE (Eq. 10) and the derivative of KL divergence (Eq. 17) in the total loss function (Eq. 16), we then obtain the final update of z i (t + 1) as shown in Eq. 14.
2) Update of the cluster centres: Now for the derivative of total loss F (Eq. 1) with respect to µ k is computed only for KL-divergence function which depend on the cluster centroids. The formulation obtained in the sequential variant derivations will be applied here for updating the centres (See Eqs. 11 and Eq. 12).

VI. Experiments and discussion
In this section we will show the performance of the proposed 3 variants of UEC on a set of benchmarks. Specifically, we will discuss the following experiments: • In the first experiment we evaluate the three variants and compare their performance, see Sec. VI-A. • In the second experiment we study the sensitivity of UEC variants with respect to the initialisation of the clusters' centres, see Sec. VI-B. • In the third experiment, we discuss the weight α and β in Eq. 1 and their effect, see Sec. VI-C. • In the fourth experiment, we compare the performance of UEC variants against the state-of-the-art clustering methods, see Sec. VI-D. • The final experiment discusses the performance of the UEC variants in term of the internal validation clustering, see Sec. VI-E.

Datasets
We conduct experiments on four benchmark datasets given as follows: 1) USPS: consists of 9298 images belonging to 10 different classes. Each image is 16x16 gray image [42]. 2) MNIST: consists of 10 handwritten digits with a total of 70,000 images. Each image is a 28x28 gray image [43]. 3) CIFAR10: is a dataset of 60000 samples with 10 classes, where each sample is a 32x32 RGB image [44]. 4) Reuters 10K: is English news data set [45] consisting of 10700 documents. Similar to [7], we use four (4) categories/classes and discard all documents that are labelled by multiple root categories. We removed stop words and used tf-idf representation of 2000 most frequent words.

Evaluation Metrics
Accuracy: We evaluate all clustering methods with clustering ACCuracy (ACC) which is defined as the best match between the ground truth and predicted labels: where GT i and PL i are the ground truth and predicted label of example z i respectively, and m is a one to one mapping from predicted label to ground truth label.

1) Normalised Mutual Information (NMI)
: is a normalisation of the mutual information score to scale the results between 0 which means no mutual information and 1 represent the perfect correlation. NMI formula is given by:

Implementation Details
In the optimisation step of UEC, the minimisation of our weighted sum function is conducted using Stochastic Gradient Descent and Negative Sampling algorithm. We set the learning rate to 1.0 and then decreased it by a factor of 10 every 10 epochs. The number of epochs is set to 200 for big data sets and 500 for small ones. For the image data sets, we use a deep autoencoder to extract the features. Our UEC source code is publicly available on:

A. Comparison of the variants
In this experiment, We evaluate the performance of each variant of the UEC (comma variant, plus variant, light plus variant) in term of accuracy and run-time execution. Table II represents   sets. We can see that the clustering performance of Plus variant is better than the other two versions. However, Table III shows that Comma variant and Light plus variant are better than Plus variant in the execution time, due to that Plus variant has a huge amount of calculation operations, as in Appendix B which shows the big number of equations. Comma variant and Light plus variant are approximately near to each other in the execution time. However, Light plus variant is better than Comma variant in the clustering performance. Therefore, these results support our claim that the joint optimisation of embedding and the clustering loss functions improves the clustering performance of our algorithm. In contrast, we reduced the execution time. Notice: Based on the results achieved by the three versions of UEC in the upcoming experiments, in Sec. VI-B and Sec. VI-C, we use only the Light plus variant since it is the fastest one among the three versions and its clustering performance is near to Plus variant.

B. Centre Initialisation
In this experiment, we analyse the sensitivity of our algorithm toward the initialisation of the clusters' cen-  tres. We perform three types of initialisation using kmeans, Gaussian mixture models (GMM) and randomly chosen medoids. The experiments are conducted using Light plus variant on the original data sets (see Section VI-A). Table IV shows the performance of our algorithm using the three initialisation evaluated by the accuracy measure. We can see that our algorithm is not sensitive to the type of initialisation adopted (even with the random initialisation, it achieves good results). Notice: based on the achieved results, in the upcoming experiments we use k-means to initialise the centres.

C. Effect of α and β
In order to study the effect of the parameter α and β on the loss function. Recall that α is related to the embedding loss, while β is related to clustering. The goal here is to vary them in the unit interval [0, 1] and observe their effect on the algorithm performance and the execution time. The experiments are conducted using Light plus variant following the note in Sec. VI-A.
To analyse the effect of the parameters, two experiments are designed as follows: 1) Experiment 1: we study the effect of the parameters on the gradient magnitude order of both embedding and clustering loss functions with respect to z i by varying α and β between the interval [0, 1]. Using the USPS dataset, the accuracy results obtained are presented in Tab. V (for more visibility see Fig. 2). Clearly, the algorithm achieves better results when α and β both go to 1 and worst results when they go to 0 (since only the centroids move). If we allow data points to move, the model makes a quantum leap in its performance. As we see in the Tab. V comparing the outcome when the parameters are set to 0 and 0.2 respectively, the accuracy increases from 0.61 to 0.95. As β increases, the performance improves, but as α increases, the performance remains relatively constant. We can also observe that the execution time increases when α decreases (because the algorithm needs more epochs to converge).     Fig. 3 for more visibility) shows the accuracy of the algorithm after the clipping transformation. We can observe that there is a slight positive change in the performance results.
2) Experiment results: Tables VII and VIII outline the performance in terms of accuracy and NMI, respectively. The top three accuracy scores are highlighted, while those marked by ( ) and ( ) are reported in [9] and [10] respectively. The dash mark (-) means that the result was not reported in the referenced paper. Table VII shows that UEC outperforms the other algorithms across all benchmarks. It outperforms most of the deep clustering methods by a significant margin. It is worthwhile to mention that deep autoencoders improved the performance of the UEC algorithm. This later did not need any fine-tuning, in contrast to other deep clustering models, which require tweaking several hyper-parameters and fine-tuning to achieve their better results. This advantage makes UEC significantly a better embedding-andclustering choice compared to the other alternative clustering models.
Tables VII and VIII outline the performance in terms of accuracy and NMI, respectively. The results achieved by UEC or even by other methods prove that representation learning plays an important role in the clustering process. The fact that UEC is a manifold-based embedding method helps in the clustering process and improves its performance.

E. Internal validation
The goal of clustering is to assign similar data objects to the same cluster and dissimilar ones to different clusters. Internal validation intends to quantify the quality of clustering usually using two criteria: Compactness and Separation. The first criterion measures how similar the data objects are in the individual clusters. The second criterion measures how distinct or well-separated clusters are from each other. Often these two criteria are embedded in various clustering quality indices. In these experiments we use three well-known internal validation indices which are Davis-Bouldin (DB) [46], Silhouette coefficient (S) [47] and Calinski-Harabaz (CH) [48]. Tables IX, X and XI show the clustering quality scores of the UEC variants. They indicate that the Plus variant is the best among the three variants across comparing the three indices. In fact, the UEC variants obtain approxi-

VII. Conclusion
We have proposed a new algorithm, UEC, that jointly optimise the representation and clustering of data. UEC is a manifold-based clustering algorithm that has the capability to preserve the local and the global structure of data and that seeks to learn the manifold within the embedding space. It comes with three variants that resulted from the optimisation process: Comma variant, Plus variant, and Light plus variant.
The empirical results obtained through performance and sensitivity analysis have shown the high effectiveness of UEC across a number of large-scale benchmarks and against a number of baseline algorithms.
Future work investigates the different optimisation techniques and the objective functions to improve the performance of the algorithm in terms of the evaluation measures and run-time.
Appendix A The derivation of the embedding objective function (Cross Entropy) The Cross-Entropy function depends only on z i , so in this part, we derive the CE with respect to z i . For The partial derivative of the CE with respect to z i we used the same derivative of UMAP.
Let us first compute the derivative of q ij and 1 − q ij we have: Now we can derive the CE objective function as follows: Since p ij doesn't depend on z i , the derivatives of p ij log(p ij ) and (1 − p ij ) log(1 − p ij ) are zero so we can derived the CE as follow: , now we will substitute the last one, Eq. 18 and Eq. 19 in the Eq. 20 so we can rewrite it as follow: Since the exponential power is taking long calculation time, so we can reduce this term as shown below: Now we can write the derivative of the embedding loss function as follows:

Appendix B Plus variant derivations
We describe how the clustering loss function is derived considering the auxiliary target distribution T ik which depends on z i and µ k . First let us recall that S ik and T ik are expressed as follows: The derivative of the objective function F with respect to z i : The derivation of the cross entropy is given in Appendix A, while the derivation of the KL divergence with respect to z i is given in the following.
The derivative is formulated as a sum of two parts since T ik needs to compute its derivative with respect z i according to two cases. The first one when i = i and the second case when i = i. Since i is different from i, S i k does not depend on z i (see Eq. 22), We can simplify the derivative as follows: The derivative of S ik with respect to z i is: We can write ∂S ik ∂z i as follows: To derive T ik , we need to distinguish two cases. The first one corresponds to i = i (written as T ik ) and the second case corresponds to i = i (written as T i k ). The expression of T ik and T i k is the same as in Eq. 23.
Let us show the derivation of T ik with respect to z i by considering the numerator and the denominator separately. The numerator is given as: and its derivative is: Since S lk depends on z i in the only case where l = i, as follows: For the denominator which is expressed as: the derivative with respect to z i is given as follows: In the same way and for all m, ∂ ∑ l S lm The derivative of T ik with respect to z i is: where are substituted by Eqs. 27, 28, 29, and 30 respectively in Eq.31. Now we compute the derivative of T i k with respect to z i . In Eq. 25, we use T i k to denote the case where i = i.. The numerator and the denominator are derived separately. Like in Eq. 27, the numerator is given: Since S i k does not depend on z i , its derivative is 0. Also only when l = i and ∂S lk ∂z i = 0 otherwise. The derivative of N ik with respect to z i is: where ∂S ik ∂z i is given by Eq. 26. Like in Eq. 28, the denominator is given as follows: Since S i m does not depend on z i , its derivative is 0.
can write the derivative of D as follows: Now the derivative of T i k with respect to z i is: where N i k , are substituted by Eqs. 32, 33, 34, 35 respectively in Eq.36. Now let's recall the derivative of the KL-divergence, Eq. 25: We substitute S ik and ∂S ik ∂z i by Eqs.22 and 26 in Eq. 25. to obtain: can be more simpler as follow: by its expression, we obtain the final formulation of the derivative of KL: A. Derivative of the total loss function with respect to z i Now, we assemble the two parts, by substituting the derivative of Cross Entropy (Eq. 21) and the derivative of KL divergence (Eq. 37) in the derivative of the total loss function to obtain: The update of z i is performed using ∂F ∂z i where α and β are parameters to be set.
However, Eq. 38 refers to the two fractions: log T ik S ik and log T i k S i k which could involve division by 0. To solve this problem we do some studies and analysis on the definition domain of S ik and T ik , the sign of hyperparameter a, and the definition domain of log T ik S ik .

The sign of hyper-parameter 'a'
Define φ : R d × R d → [0, 1], a smooth approximation of the membership strength between two points in R d : where 'a' and 'b' are chosen by non-linear least square fitting against the curve ψ : R × R → [0, 1] where: So the sign of 'a' is always positive since q ij ∈ [0, 1] and that is mean: is always positive, we conclude that 'a' is also always positive. 1], a soft assignment between the data points and the cluster centroid in R d :

Definition domain of S ik and T ik
is defined since a is positive Now since T ik is based on S ik so it is also defined on the interval [0, 1].

Definition domain of log
Where: 2) S ik and T ik has the same sign (3) As we see in the previews Eq. 39 we have three condition let us treat them separately. The first condition is satisfied if and only if:

Condition (3):
T ik and S ik have the same sign and this condition is always satisfied since S ik ∈ [0, 1] and T ik is computed based on S ik so they have always the same sign.
From what we mentioned above we can conclude our solutions to solve the problem of division by zero as follow, the first one is shown in algorithm B-A: And the second solution is as follow: B. The derivative of Objective Function with respect to the µ k Now, we will drive our objective function with respect to µ k . First let's recall our weighted sum function: Such that: Also let's recall the derivative of our weighted sum function: ∂F ∂µ k = α ∂CE ∂µ k + β ∂KL ∂µ k CE doesn't depend on cluster centroids, so its derivative is zero. Therefore, we only compute the derivative of the KL divergence with respect to µ k : We divided the derivative into a sum of 2 parts, since T ik needs to compute its derivative with respect µ k in two cases. The first one when k = k and the second case when k = k. Also, since S ik doesn't depend on µ k so To simplify the derivative of the KL divergence with respect to µ k we calculate the derivatives of S ik = (1 + a z i − µ k 2b 2 ) −1 and T ik w.r.t. µ k in case of k = k. In addition, we compute the derivatives of S ik and T ik w.r.t. µ k in the case k = k. Now we start by ∂S ik And since S ik doesn't depend on µ k , so we can conclude that Also here we should derive T ik when k = k and T ik when k = k with respect to µ k . Now we are going to start with T ik : let's derive the numerator and the denominator each separately. We have the numerator is: So the derivative of N ik with respect to µ k is: is given by Eq. 43 and ∂S lk ∂z i is deduced from Eq. 43 as follows: We have the denominator is: 2 ) 2 when m = k, and ∂S im We can see that ∂N ik . So the derivative of T ik with respect to µ k is: We do the same thing for the derivative of T ik with respect to µ k in the case here is k = k. Let us first write T ik : we are going to derived the numerator and denominator separately: Since S ik doesn't depend on µ k , we can deduce that ∂S ik ∂µ k = 0 from Eq. 22 and 43. Also for ∑ l S lk doesn't depend on µ k so ∑ l ∂S lk ∂µ k = 0. We can conclude that the derivative of N ik with respect to µ k is zero. Let us move now to the denominator: We already computed the derivative of D i , so let us just recall it: And from previews equations we can conclude that the derivative of T ik as follow: Now we are going to substitute the previews equations in the derivative of the KL divergence Eq. 42 with respect to µ k : We substitute S ik and ∂S ik ∂µ k by Eq. 22 and Eq. 43, so let's Then we are going to multiply it by T ik S ik in order to get a simpler form: by the last equation, we obtain:

Appendix C Light plus variant derivations
Our optimisation purpose is to update the data points z i and the centroids µ k , so first, we derived the function F with respect to z i . Then, we compute the derivative of the weighted sum function with respect to µ k . However, the CE function does not depend on the cluster centres so we compute only the derivative of the clustering loss function. Now for the derivative of the weighted sum function we have: A. Derivation of the KL divergence with respect to z i Notice that for the data points z i we have two objective functions that need to compute their derivations with respect to z i . The derivation of the CE function is already explained in Appendix A. In this part, we provide the details of the derivative of the KL divergence with respect to z i . Let's recall the KL divergence: When updating z i , T ik is already computed and is considered as a constant number, T ik doesn't depend on z i , thus the derivative of T ik log T ik with respect to z i is zero. We obtain: We can derived S ik with respect to z i as follow: B. Derivative of the total loss function with respect to z i Now, we assemble the two parts, by substituting the derivative of Cross Entropy (Eq. 21), and the derivative of KL divergence (Eq. 49) in the total loss function (Eq. 45) to obtain: Then the update of z i is performed using ∂F ∂z i , we just have to choose α and β.
C. The derivative of Objective Function with respect to the µ i Now, we derivate our objective function with respect to µ k . First let's recall our weighted sum function: F(P, Q, T, S) = αCE(P, Q) + βKL(T||S) Such that: Also let's recall the derivative of our weighted sum function: ∂F ∂µ k = α ∂CE ∂µ k + β ∂KL ∂µ k CE doesn't depend on cluster centroids, so its derivative is zero. Therefore, we compute only the derivative of the KL divergence with respect to µ k : KL(T, S) = ∑ i ∑ k T ik log T ik − T ik log S ik T ik is considered as a constant number, since T ik doesn't depend on µ k . Thus the derivative of T ik log T ik with respect to µ k is zero. The partial derivative of the KL function with respect to µ k is: To simplify the derivative of the KL divergence with respect to µ k we calculate the derivative of S ik w.r.t µ k : Substituting S ik and its derivative ∂S ik ∂µ k (Eq. 53) in Eq. 52. we obtain: Appendix D Definition of the values domain of the CE and KL divergence gradients with respect to z i A. Defining the interval values of the CE gradient with respect to z i We have the gradient of the CE as follow: We observe that the gradient of the CE has two terms . So we need to study their values Domain. Let us start with the first term: In above term (Eq. 57), we observe that it is preceded by the negative sign so the values domain of this term is ] − ∞, 0]. So the values domain of the CE gradient is ] − ∞, +∞[. B. Defining the values domain of the KL divergence gradient with respect to z i In this part we studied the definition domain of the KL divergence gradient w.r.t z i in the two variants.
1) The derivative of the KL divergence Plus variant: Let us recall the derivative of the KL divergence with respect to z i : The same thing for KL gradient, we proved that the values domain of S ik and T ik are in [0, 1] in division by zero subsection ( see Appendix B). so the values domain of KL derivative in [0, +∞[.
2) The derivative of the KL divergence Light Plus variant: Let us recall the derivative of the KL divergence with respect to z i :