Deep Clustering with Self-supervision using Pairwise Data Similarities

—Deep clustering incorporates embedding into clustering to ﬁnd a lower-dimensional space appropriate for clustering. Most of the existing methods try to group similar data points through simultaneously minimizing clustering and reconstruction losses, employing an autoencoder (AE). However, they all ignore the relevant useful information available within pairwise data relationships. In this paper we propose a novel deep clustering framework with self-supervision using pairwise data similarities (DCSS). The proposed method consists of two successive phases. First, we propose a novel AE-based approach that aims to aggregate similar data points near a common group center in the latent space of an AE. The AE’s latent space is obtained by minimizing weighted reconstruction and centering losses of data points, where weights are deﬁned based on similarity of data points and group centers. In the second phase, we map the AE’s latent space, using a fully connected network MNet, onto a K-dimensional space used to derive the ﬁnal data cluster assignments, where K is the number of clusters. MNet is trained to strengthen (weaken) similarity of similar (dissimilar) samples. Experimental results on multiple benchmark datasets demonstrate the effectiveness of DCSS for data clustering and as a general framework for boosting up state-of-the-art clustering methods.


INTRODUCTION
I N many science and practical applications, information about category (aka label) of data samples is nonaccessible or expensive to collect. Clustering, as a major data analysis tool in pattern recognition and machine learning, endeavors to gather essential information from unlabeled data samples. Clustering can address many issues in different real-world applications, such as the celestial data analysis [1], the medical analysis [2] the tumor's gene analysis [3], and the data retrieval [4,5,6,7]. The main goal of clustering methods is to partition data points based on a similarity metric. Although recently a wide variety of clustering algorithms have been developed, the more traditional clustering algorithms k-means [8] and fuzzy cmeans [9] are still appealing due to their simplicity. Nevertheless, these algorithms do not provide proper clustering performance when dealing with uneven distribution of samples. Moreover, they usually fail to properly cluster high-dimensional data points due to the curse of dimensionality; this makes them inappropriate in many recent applications where data points have many features [10]. Nowadays, novel deep-learning-based clustering methods have been widely developed in many applications such as image segmentation [11], social network analysis [12], face recognition [13], and machine vision [14]. The ultimate goal of these methods is to embed original data samples into a lower-dimensional space (aka latent space), where groups Arrows show a nonlinear mapping, using an AE, from the original input space to the AE's latent space (i.e the u space). In the first phase of DCSS, we aim to gathering data points near their corresponding group centers in the u space. In the second phase, DCSS employs pairwise samples similarity and dissimilarity to create a new space q in which similar pairwise samples are tightly packed together and dissimilar pairwise samples sit as far away as possible. Similar samples are connected with solid lines and dashed lines represent dissimilar samples/clusters. of data points could be distinguished by applying common algorithms such as k-means and fuzzy c-means. In literature, many research studies have been devoted to find an optimal lower-dimensional space in an unsupervised manner using different autoencoder (AE) structures [15,16,17]. This provides a highly nonlinear transformation of data points. An AE consists of two networks: 1) an encoder network that maps the original input space to a lower-dimensional space; 2) a decoder network that tries to reconstruct the original space using the output of the encoder network. Accordingly, encoder and decoder networks are trained to minimize reconstruction loss. As is discussed in Section 2, almost all conventional deep-learning-based algorithms aim to grouping data points near their corresponding cluster center using an AE. They attempt to include clustering loss in addition to the reconstruction loss of the AE to make the latent representation more suitable for data clustering [18,19,20]. The main common disadvantages of these algorithms are: (1) At each training epoch, they first perform crisp cluster assignment, i.e. assign each data point to only one cluster, then compute the data clustering loss using the obtained crisp assignments assuming they are correct assignments. This is not a correct assumption because of the unsupervised nature of the clustering task where the true cluster labels are unknown during the training phase. This issue would be more serious when the non-crisp (aka soft) K-dimensional cluster assignment vector obtained prior to snapping to the one-hot vector is far from the one-hot vector imposed by the crisp assignment, because this results in misleading of the training process. K is the number of clusters. (2) In all of the former methods, a single common loss function is utilized for all data clusters, without considering the possible existence of differences between characteristics of the different clusters (3) All the conventional deep-learningbased clustering algorithms seek an optimal latent space suitable for data clustering. To this end, they try to locate data points close to their corresponding cluster center without considering relations between the data points, a relation such as pairwise data similarities.
In this paper, we propose a deep clustering with selfsupervision using pairwise data similarities (DCSS) framework. DCSS employs similar and dissimilar pairwise samples to supervise the training process, where similarities are measured in reliable subspaces. DCSS performs clustering in two phases. First, the original data points are mapped into a reliable latent space u, i.e. latent space of an AE, in which samples of the same cluster are encouraged to form a group through sitting close to the corresponding group center. The second phase is to create an improved subspace q which is trained under supervision of similar and dissimilar pairwise samples. Since similar and dissimilar samples are not recognizable in the original input feature space, due to the curse of dimensionality [21], we propose to use the (partially) trained subspaces u and q for similarity measurement. More specifically, at the q training outset, similar and dissimilar pairwise samples are determined within the u subspace which is previously trained in the first phase. After a few training epochs, when the subspace q becomes a reliable space, the similarities are measured in the q subspace itself. The subspace q is trained such that similar pairwise samples are tightly packed together in the q space and dissimilar pairwise samples sit in q as far away as possible. Pairwise samples that are neither similar nor dissimilar, i.e. those that are in ambiguous region, do not contribute in the second phase of DCSS. Our experimental results confirm that as the training in the second phase progresses, through performing more and more training epochs, more and more pairwise samples contribute to the training, i.e. very few pairs remain in the ambiguous region at the end of the training phase. Our experiments show that such training process results in sample representations, in the q space, that are very close to one-hot vectors where the active element of the vector indicates the true cluster assignment, assuming the dimension of the q subspace is equal to the true number of data clusters K. An intuitive motivation of the proposed method is illustrated in Fig.  1. Theoretical supports for the proposed DCSS method are provided in Appendixes A. Training at both of the DCSS phases is performed in an end-to-end manner employing gradient descent and back-propagation. In the second phase which is mainly for q training, the u space also receives small updates to be refined and more helpful for the q space training.
In the first phase of DCSS, to obtain the subspace u, we propose to train an autoencoder by simultaneously minimizing two losses: samples reconstruction loss and centering loss; these two losses are linearly combined in a weighted manner and form the final loss function. The latent space of the trained AE is then taken as the DCSS's u subspace. Minimizing the reconstruction loss helps to preserve the local structure of the data points [20] and constraints too much manipulation of the AE latent space by the centering loss. Since pairwise similarities measured in the u space are used for supervising the q space training, data distribution in the u space should be close to the desired data distribution in the q space, i.e. having dense pure groups of similar data in q. This goal is realized by incorporating the data centering loss in the loss function of Phase 1. The subspace u is trained in K successive runs where at each run a specific loss function, corresponding to a specific data group, is minimized. More specifically, at the kth run, k ∈ 1, ..., K, the AE network focuses on centering and reconstruction of those samples that are more likely to belong to the kth data group. This implicitly means that, at the kth run, those AE's parameters are optimized that have more impact on centering of the kth data group, around their corresponding center, in the u space. Despite the previous deep clustering methods that use their estimated crisp assignments for training a latent space, we propose to train the u subspace employing the original soft group assignments (without snapping to 0 and 1). The soft assignments are used as samples weight when computing the reconstruction and centering losses. Block diagram of Phase 1 for training the u space is shown in Fig. 2(a). Throughout this paper, we refer to the proposed method for finding u as Deep Successive Learning (DSL). Details of the DSL formulation are presented in Section 3.1.
In the second phase of the proposed DCSS method, despite the conventional AE-based clustering methods that define the final cluster assignments based on the data representations in the AE's latent space [18,19,20,22], we propose to employ the pairwise relationship between the data points as a supervisor to map the u space representations to an improved space q. Such map is realized through a fully connected network called mutual net (MNet) that its objective is to strengthen pairwise similarities and dissimilarities, where the inner product is used for similarity measurement. Overview of the second phase of the DCSS method is shown in Fig. 2(b). Final cluster assignment is then performed in the discriminative space q (see Fig. 2(c)).
To sum up, the main contributions of this work are: • We discover the novel framework of employing similarities of pairwise samples as a means of selfsupervision for data clustering.
• To realize this, since the original data space suffers from the curse of dimensionality, we propose to measure the similarities in two reliable subspaces u and q that are trained successively.
• To create the reliable space u, we propose a novel AE-based subspace learning approach DSL that trains the AE through minimizing weighted reconstruction and centering losses of data points, where soft group assignments are used as samples weight. See Fig.  2(a). The latent space of the AE is used as the u space required in the DCSS.
• We propose to create the more reliable subspace q, through appending a fully connected network MNet to the encoder part of the AE trained by DSL, where the process of training q (and refining u) is supervised by the similarity of the pairwise samples. See Fig. 2(b).
• Final crisp cluster assignment of a given data point is realized by mapping it to the q space. The map is performed by feeding the data to the trained encoder followed by the trained MNet. Our experiments show that samples representation in the q space, are very close to one-hot vectors. Assuming the dimension of the q space is set to the true number of clusters K, the index of the maximum element of the q vector indicates the input data cluster label. See Fig. 2(c).

•
The proposed method can be considered as a general framework that can be employed to improve the clustering performance of the existing subspacebased clustering methods such as [22,20,19,18]. More specifically, the reliable subspace u can be obtained with other state-of-the-art methods (instead of DSL) followed by the proposed MNet.

•
We performed an extensive set of experiments, on seven benchmark datasets, to demonstrate the effectiveness of the proposed DCSS (and DSL). Our results outperform both traditional and state-of-theart deep network-based clustering methods.
Appendix A presents mathematical proofs that show, under certain assumptions, DCSS can effectively capture pairwise similarities and dissimilarities; hence DCSS performs a final reasonable cluster assignment. We proved that samples' q vector is very close to an one-hot vector (Corollary 1.2), similar samples are assigned to the same cluster (Theorem 2), dissimilar samples are assigned to different clusters (Corollary 1.3), and if two samples are similar to another sample then the two samples are not dissimilar (Theorem 3).
The rest of this paper is organized as follows. In Section 2, we present a brief review of conventional and deeplearning-based clustering methods. Section 3 presents details of our proposed DCSS framework. Extensive experimental results that demonstrates the effectiveness of the DCSS is presented in Section 4. Section 5 conveys the gist of this paper.

RELATED WORK
Clustering have been widely studied in machine learning from different aspects such as feature selection [23,24,25], distance metric [26,27], and categorizing methods [28,29,30]. K-means [8] and fuzzy c-means [9] are the two popular conventional methods that are applicable to a wide range of tasks [31,32,33]. However, because of their distance metric, they only can extract local information of the data dealing with high dimensional feature spaces. Some conventional algorithms, such as [34], aim to handle this difficulty by jointly performing subspace selection and a clustering algorithm in an iterative manner. At each iteration, they group data points using k-means and endeavor to maximize the intercluster variance employing the data projected in a lowerdimensional space. They repeated this process until convergence. Another group of conventional methods, called spectral clustering such as [35,36], address the high input dimension issue by embedding the high dimensional data into a lower-dimensional space. They then apply clustering algorithms in the new space. For their embedding phase, first, they construct a weighted graph in which nodes are data samples and weights are defined based on pairwise relationship between them in the original space. Then, they define a minimization problem using the Laplacian matrix of the weighted graph. Although they could surpass the clustering performance of k-means in different applications, the complexity of solving the optimization problem limited the application of these algorithms to only small datasets. In order to make spectral clustering more applicable on large datasets, [37,38] propose stochastic optimization methods that try to estimate the original optimization problem. However, these conventional methods only consider linear embedding that is not successful when dealing with complex datasets. In order to take into consideration the nonlinear embedding of the data points, deep clustering has been broadly studied during the past few years. They utilize deep autoencoders to embed original data points in a lower-dimensional space. In some algorithms, learning the lower representation of the data points is separated from the clustering task. For example, deep embedding network (DEN) [39] uses an AE to find a lower-dimensional representation of data points by enforcing group sparsity and locality-preserving constraints. They then obtain cluster assignments by applying k-means algorithm to the obtained lower-dimensional space. As another example, [40] takes advantage of a deep autoencoder to find lower-dimensional representation of a graph; it then utilizes k-means algorithm to define clusters. In order to further improve clustering performance, more recent algorithms simultaneously embed data points in a lower-dimensional feature space and perform clustering algorithms to assign data points to different clusters using the new feature space. E.g. Deep embedded clustering (DEC) [22] initializes parameters of a stacked autoencoder layer by layer using [41,42]. After removing the decoder part, it then updates encoder part of the AE through minimizing a Kullback-Leibler (KL) divergence loss consisting of soft assignments and estimated target distributions. Soft assignments measure degree of similarity between data points and cluster centers using Students' t-distribution following [43]. In an unsupervised learning problem, since true cluster assignments (aka target distributions) of data points are unknown, they perform a form of self-training [44] to estimate target distributions using the soft cluster assignments. There have been a few research studies, e.g [19,18,20], proposed to take advantage of the decoder part of an AE to combine reconstruction loss with clustering loss to maintain the local structure of the original data points. For example, improved deep embedding clustering (IDEC) [20] improved the clustering performance of DEC by considering the reconstruction loss of an AE besides the KL divergence loss of DEC to preserve local dependencies and structures between the original data points. Deep convolutional embedded clustering (DCEC) [45] could enhance the performance of IDEC by changing the fully connected structure of IDEC to a deep convolutional autoencoder. Moreover, DCEC proposed an end-toend pre-training scheme by minimizing the reconstruction loss instead of pre-training a stacked autoencoder proposed in DEC and IDEC. Some other works, instead of applying self-training using soft cluster assignments, improve clustering performance of DEC by proposing an independent approach of finding target distributions. For example, improved deep embedding clustering with fuzzy supervision (IDECF) [46] aims to estimating the target distributions by training a deep fuzzy c-means network, which is specifically designed and trained for this purpose. Deep clustering network (DCN) [19] method, which jointly learns lowerdimensional feature representation and performs clustering, aims to find a new representation of data points in which data points are separable by applying k-means algorithm. To this end, DCN updates an autoencoder's parameters by minimizing a combination of the reconstruction loss and the objective function of k-means algorithm; this is to find a kmeans friendly space. DCN updates AE's parameters and cluster centers separately. The latter is based on solving a discrete optimization problem. However, in deep k-means (DKM) [18], which has the same objective function as DCN while minimizing the parameters of an AE, weights and cluster centers are updated through minimizing a continuous optimization problem using stochastic gradient descent (SGD). Some recent works such as Deep Spectral Clustering (DSC) [47] proposed a joint learning framework to create a discriminative embedded space using a dual autoencoder. They devise a common encoder part and two decoder parts for their autoencoder. The first decoder tries to reconstruct the original input and the second encoder endeavors to denoise the encoder latent space. They consider reconstruction, mutual information, and spectral clustering losses while updating parameters of their network. They utilized mutual information to exploit more discriminative information from the inputs. Moreover, a deep spectral clustering approach is applied to extract the mutual relationship between data points in the latent space. Contrastive learning methods, such as [48,49], have been widely attracted researchers' attention in the recent year due to their promising performance. They first construct negative and positive pairs by applying data augmentation on data points. Then, they map them in the feature space and endeavor to maximize similarity (minimize dissimilarity) between positive (negative) pairs. For example, contrastive clustering (CC) [48] defines instance-and cluster-level losses respectively on rows and columns of the feature space in order to maximize similarity while minimizing dissimilarity.

PROPOSED METHOD
Consider a K-clustering problem which aims to partitioning a given dataset X = {x 1 , x 2 , ...x N } into K disjoint clusters, where x i indicates the ith data sample, N is the number of data points, and K is a predefined user-settable parameter.
DCSS utilizes an AE, consisting of an encoder and a decoder network respectively represented by f (.) and g(.). Latent representation of X is denoted by indicates dimension of the latent space, and θ e denotes parameters of the encoder network. The reconstructed output of the AE is denoted bŷ where θ d represents the decoder parameters. Center of the kth data group in the u space is denoted by µ (k) . As aforementioned, in the second phase of DCSS, to investigate the mutual relationship between the data points, we propose to employ a fully connected network MNet, which takes the latent representation of the AE for each data point, i.e. u i , as input and maps it to a K-dimensional vector q i which its kth element indicates the probability of x i belonging to the kth data cluster. In this paper, the output of MNet for the ith data point is denoted by where M (.) and θ M respectively shows MNet and its corresponding parameters.

The first phase of DCSS (DSL)
To obtain a reliable low dimensional space u in which detection of the similar and dissimilar samples is possible, we propose to train an AE with a novel and effective loss function consisting of weighted reconstruction and centering losses. The latent space of the AE is in fact the DCSS's u space. At each training data batch, we propose to train the AE in K successive runs where at the kth run, the AE focuses on reconstruction and centering of the data points that are more probable to belong to the kth data group. Note that data clustering is an unsupervised task and the group/cluster membership of the data points is unknown at the problem outset. As such, at the kth run of DSL, we use Euclidean distance between u i and µ (k) (obtained in the previous training iteration), i.e. ||u i − µ (k) || 2 , as a means of measuring the degree of membership of x i to the kth data group in the u space. The group memberships are used as the sample weight in the (k+1)th run. More specifically, the closer a sample is to the group center µ (k) , the higher contribution that sample has in minimizing the loss function at the kth run.
Formulation of the proposed DSL method is shown in equation set (1). (1a) presents the DSL's loss function at the kth run, i.e. L c , higher weights are assigned to the samples closer to the group center µ (k) . Membership of the ith data point to the kth data group, in the u space, is shown as p ik defined in (2). p ik is used as the weight of the ith sample reconstruction and centering losses, at the kth run of DSL. Every T 1 training iterations, we update the group centers to average of weighted samples in the u space, as is shown in (3); i.e. samples closer to µ (k) contribute more in updating.

Shared Weights
Block diagram of the DSL framework is shown in Fig.  2(a). A preliminary version of DSL is presented in [50].

The second phase of DCSS
After completing the first phase of the DCSS method, we discard the decoder part of the DSL's autoencoder and append a fully connect network MNet to the trained encoder. More specifically, the MNet's input is the latent space of the DSL's encoder, i.e. the u space. We take pairwise samples similarity to supervise the MNet training phase. The MNet output space q is used for data clustering; hence, it is trained to strengthen similarities and dissimilarities. The MNet's output layer, i.e. the q space, consists of K neurons where each neuron corresponds to one data cluster. We utilize the softmax function at the output layer to obtain probability values for clusters assignment. More specifically, for input sample x i , the output value at the jth neuron, i.e. q ij , is the probability of x i belonging to the jth cluster.
At the problem outset, MNet parameters, θ M , are initialized with random values. Hence, at the first few epochs of the q training process, when q is not yet a reliable space, we measure pairwise similarities in the u space, using the inner product of the group assignments shown in (2). More specifically, knowing that p i = [p i1 , .., p iK ] T represents the assignment vector of the ith data point x i to the different data groups in the u space, the inner product of p i and p j shows the pairwise similarity of the two data points x i and x j . The loss function proposed for the MNet training at the first T 2 training epochs is shown in (4) where ζ and γ are two user-settable hyperparameters, and 1{.} is the indicator function.
As can be inferred from (4), only similar and dissimilar samples contribute in the MNet training phase -i.e., a pair of samples contributes to the training if their similarity is greater than ζ or less than γ. A Pair of samples with a similarity value between ζ and γ, i.e. in the ambiguity region, does not contribute to the current training epoch. Therefore, minimizing L M strengthens (weakens) the similarity of similar (dissimilar) samples. Note that, along with training the MNet parameters, the encoder parameters θ e are also updated through back-propagation, in an end-toend manner. Group centers in the u space, i.e. µ (k) are also updated using (3) after completing each training epoch.
After training MNet (and refining encoder) parameters through minimizing L M for T 2 epochs, when q becomes a reliable space for pairwise similarities measurement, we use the loss function L M defined in (5) to complete the MNet training. Similar to (4), a pair contributes in L M if its corresponding similarity value is not in the ambiguity region. As is shown in Section 4, as the MNet training progresses, more and more pairs contribute to the training procedure. Again, the u space receives small updates through the backpropagation process when minimizing L M .
As is proved in Appendix A, a proper choice for of hyperparameters ζ and γ is 2 3 < ζ and γ < ζ 2 . In our experiments ζ and γ are set to 0.8 and 0.2, respectively. Fig. 2(b) shows the overall training procedure of the DCSS's second phase. Moreover, Appendix A presents several mathematical proofs that show, under certain assumptions, the final q vector for a query sample is very close to an one-hot vector where the index of the maximum element of the vector indicates the cluster label. This also shows that similar (dissimilar) samples tend to sit in the same (different) data cluster(s).

Final cluster assignments
To determine the final cluster assignment of a data point x i , we utilize the trained encoder and MNet networks to compute the data representation in the K-dim q space, i.e. q i . x i is assigned to the most probable cluster, i.e. the index corresponding to the highest q i element. This clustering assignment process is shown in Fig. 2(c).
The pseudo-code of the DCSS algorithm is presented in Algorithm 1.

12:
Compute soft cluster assignments vectors q i for i ∈ B

17:
Update θ e and θ M to minimize (5) 18: end if 19: end for Final Cluster Assignments: 20: Compute q i for x i , i = 1, . . . N 21: Assign each data sample to the most probable cluster is compared with ten conventional and state-of-the-art deeplearning-based clustering methods. The DCSS code is available: https://github.com/Armanfard-Lab/DCSS.

Datasets
The effectiveness of the proposed method is shown on seven widely used datasets. Considering unsupervised nature of the clustering task, we concatenate training and test sets when applicable. Combining train and test datasets is a common practice in the clustering research field [22,18,20,19,47]. The datasets are: (1) MNIST [51] consists of 60,000 training and 10,000 test gray-scale handwritten images with size 28 × 28. This dataset has 10 classes.
(2) Fashion MNIST [52] has the same image size and number of samples as of MNIST. However, instead of handwritten images, it consists of different types of fashion products. This makes it fairly more complicated for data clustering compared to the MNIST dataset. It has 10 classes of data.
(3) 2MNIST is a more challenging dataset created through concatenation of the two MNIST and Fashion MNIST datasets. Thus, it has 140,000 gray-scale images from 20   [54] is similar to the CIFAR-10, except it has 20 super groups based on similarity between images instead of 10 classes.

Evaluation Metrics
We utilize two standard metrics to evaluate clustering performance, including clustering accuracy (ACC) [57] and normalized mutual information (NMI) [58]. ACC finds the best mapping between the true and predicted cluster labels. NMI finds normalized measure of similarity between two different labels of the same data point. The ACC and NMI formulations are shown below: where l i and c i denote the true and predicted labels for the data point x i . map(.) indicates the best mapping between the predicted and true labels of data points. I(l; c) denotes the mutual information between true labels l = {l 1 , l 2 , ..., l N } and predicted cluster assignments c = {c 1 , c 2 , ..., c N } for all data points. H(.) presents the entropy function. ACC and NMI range in the interval [0,1] where higher scores indicate higher clustering performance.

Networks Architecture
The proposed DCSS method includes an autoencoder and a fully connected MNet. This section presents structure of these networks. We use two variations of autoencoders, depending on the dataset nature (i.e. RGB or gray-scale), when training the proposed DCSS framework.
For gray-scale datasets, we propose to use an asymmetric autoencoder; where, following [56], we propose to use the bottleneck layer shown in Fig. 3(c) in the encoder structure. Fig. 3(a) and (b) respectively show the encoder and decoder structures of the proposed asymmetric AE. Employing such asymmetric structure provides a more discriminative latent space. Hyperparameters of the proposed AE for each dataset are indicated in Section 4.4 For the RGB datasets, we first apply a ResNet-152 [56], pre-trained on ImageNet [59], to extract abstract features. Then we feed the extracted features to a symmetric fully connected AE. Inspired by [22], we set the AE architecture to 2048-500-500-2000-d for RGB datasets, where ReLU activation function is utilized in all layers.
MNet is a fully connected network that takes the d dimensional latent space of the AE (u space) as input and generates a K dimensional output q. The architecture of MNet is d-128-128-128-K for all datasets except CIFAR-100. Since CIFAR-100 is a more complicated dataset, it needs a more complex MNet architecture; so we set the MNet architecture for CIFAR-100 to d-1000-1000-1000-K. Batch normalization and ReLU activation function is utilized in for all datasets in all layers of MNet except the last layer in which we used softmax function.

Implementation Details
In this section, we discuss hyperparameter values and implementation details of DCSS.
Network's hyperparameters n, p 1 , s 1 , p 2 , s 2 , p 3 , s 3 , f 1 , and f 2 (shown in Fig. 3) are respectively set to 28, 2, 2, 1,   Fig. 3. Structure of the proposed asymmetric autoencoder. In the encoder part, in order to obtain an informative lower-dimensional representation of data points, we proposed to use a Bottleneck layer. Following [56], we use the bottleneck layer after 5 × 5 and 3 × 3 convolutional layers. the value of hyperparameters is presented in Section 4.4.
Following [18,19,20,22], in order to initialize the parameters of DSL's, i.e. θ e , θ d and µ (k) for k = 1, . . . , K, we train an autoencoder where the end-to-end training is performed by only minimizing the samples reconstruction losses. Adam optimization method [60], with the same parameters mentioned in the original paper are used for training. θ e , θ d are then initialized with the parameters of the trained autoencoder's parameters. We apply k-means algorithm [8] to the latent space of the trained autoencoder and initialize µ (k) , k = 1, . . . , K to the centers defined by k-means.
For all datasets, in the first phase of DCSS, α, Maxiter 1 , T 1 , and m are respectively set to 0.1, 200, 2, and 1.5. The second phase hyperparameters ζ, γ, T 2 , and MaxIter 2 are respectively set to 0.8, 0.2, 5, and 20. We utilize Adam optimizer for updating weights of the AE and MNet and their learning rates are set to 10 −5 and 10 −3 , respectively.
All algorithms were implemented in Python using Py-Torch framework. All codes are run on Google Colaboratory GPU (Tesla K80) with 12GB RAM. The proposed algorithm codes are included in the supplementary materials.

Clustering Performance
The effectiveness of our proposed DCSS method is compared against ten well-known algorithms, including conventional and state-of-the-art deep-learning-based clustering methods, using the commonly used evaluation metrics ACC and NMI.
The conventional clustering methods are k-means [8], large-scale spectral clustering (LSSC) [61], and locality preserving non-negative matrix factorization (LPMF) [62]. Deep-learning based algorithms are deep embedding clustering (DEC) [22], improved deep embedding clustering (IDEC) [20], deep clustering network (DCN) [19], deep kmeans (DKM) [18], deep spectral clustering (DSC) [47], and the very recent contrastive clustering method (CC) [48]. In addition, we report clustering performance of a baseline method AE + k-means in which k-means is simply applied to the latent representation of an AE, with similar architecture as of the AE used in the DCSS method, trained based on minimizing the dataset reconstruction loss. More details about the comparing algorithms can be find in Section 2. We also demonstrate the success of the DSL method, presented in Section 3.1, in creating a reliable subspace u where the data points are effectively grouped around their corresponding center. To this end, we only implement the first phase of the DCSS algorithm -i.e. we train the DCSS's AE through minimizing the loss function presented in (1), where the AE architecture and its initialization are similar to those presented in Section 4.3 . After training the u space, we perform a crisp cluster assignment by considering each data group, in the u space, as a data cluster and assigning each data point to the cluster (group) with closest center.
the clustering performance of our proposed DSL and DCSS along with our comparison algorithms are shown in Table 1. For the comparison methods, if the ACC and NMI results for a dataset are not reported in the corresponding original paper, we ran the released code with the same hyper-parameters discussed in the original paper. The best result for each dataset is shown in bold. The second top results are shown with *. Several observations can be made from this table. (1) The proposed DCSS method outperforms all of our comparison methods on all datasets. (2) The proposed DSL approach effectively centers the data around the group centers. This can be inferred from its ACC and NMI results. Indeed DSL outperforms all the clustering methods except DSC which is one of the very most recent state-of-theart AE-based clustering methods. DSL outperforms DSC in 4 out of the 7 datasets and provides competitive results on the remaining three ones. (3) Effectiveness of the second phase of the proposed DCSS framework, where we append MNet to the latent space of the DSL, can be inferred by comparing DCSS with DSL. It can be seen that DCSS significantly outperforms DSL on all the datasets. (4) Effectiveness of the AE loss function proposed in equation (1) compare to the case of training AE with only reconstruction loss can be inferred by comparing the DSL performance with the baseline method AE+K-means. As can be seen, the DSL clearly outperforms AE+K-means on all the datasets. Fig. 4 illustrates the effectiveness of different phases of our proposed DCSS framework for all datasets, where t-SNE [63] is used to map the output of DCSS's encoder/MNet to a 2D-space. The different colors correspond to the different data groups/clusters.

t-SNE visualization
The first row of Fig. 4 shows the representation of different data points in the u space, i.e. the latent space of the DCSS's AE, only after completing the first phase discussed in Section 3.1. As it can be seen, after completing the first phase of DCSS, different groups of data points are fairly separated and sit near group centers; however, this phase is not adequate for defining cluster assignments. For example, in the USPS dataset, data groups shown in pink, purple, and magenta are mixed together. This indicates insufficiency of the reconstruction and centering losses. The second row of Fig. 4 shows the data representations in the u space after completing the second phase of DCSS discussed in Section 3.2, where u is refined by minimizing (4) and (5). As can be seen, refining the u space employing pairwise similarities results in a more clear and separate group distributions. For example, the pink, purple, and magenta groups of USPS are now well distinguishable in the new refined u space. As another example, see samples from three groups shown in red, olive, and brown of the 2MNIST dataset. These groups are more separated in the refined u space, shown in the second row, compare to the representation shown in the first row of Fig 4. The last row in Fig. 4 depicts the output space of MNet (q space), in which we make decisions about cluster assignments of data points. As is expected, clusters in this space have low within-and high between-cluster distances. As an example, consider the navy blue and the purple groups of the Fashion MNIST dataset, at the end of the second phase. Although these groups are mixed in the u space, they are completely isolated in the q space.

Effect of Pre-trained Network
In this section, we investigate the effect of the structure of the employed pre-trained network for extracting features from RGB images. To this end, we compare the clustering performance of all the deep-learning-based algorithms presented in Section 4.5 using four different ResNet architectures, namely ResNet-34 [56], ResNet-50 [56], ResNet 101 [56] and ResNet-152 [56]. All ResNets are pre-trained on the ImageNet dataset [56]. The corresponding ACC and NMI are reported in Table 3. The best result for each dataset is shown in boldface and the second top results are shown with *. As can be seen, regardless of the employed structure, the proposed DCSS method outperforms all other algorithms on all datasets. In addition, the DSL method is the second top method in 19 out of the 24 reported values. Fig. 5. depicts the average, over different groups on different batches of data points, of the reconstruction, centering, and total losses corresponding to the first phase of DCSS (i.e. DSL) shown in (1). As can be seen, all losses are converged at the end of training. The noticeable reduction in the centering loss shows the effectiveness of our proposed approach in creating a reliable u space in which the data points are gathered around the group centers. Moreover, the figures show that at the first training epochs, our method trades the reconstruction loss for an improved centering performance. This proves insufficiency of the reconstruction loss in creating a reliable latent space.

Loss function convergence
In Fig. 6, we investigate convergence of the second phase's loss of our proposed DCSS method, shown in equations (4) and (5). Since we initialize the MNet randomly, at the first few epochs, MNet knows nothing about the lowerdimension representation of the data points in the q space; thus, we face a high loss value. As the q training process progresses, the loss value drops and converges to zero at the end of the training process. In the first 5 epochs, the algorithm minimizes the loss presented in (4) and minimizes (5) in the remainings. The continuity of the loss reduction over epochs along with the sharp loss drop at the 5th epoch, confirms the effectiveness of our proposed strategy in employing the reliable space u for supervision in the early epochs and then employing the skilled space q for supervision in the later epochs.

DCSS as a General Framework
In this section, we demonstrate the effectiveness of the DCSS as a general framework where the u space can be trained with other AE-based clustering techniques. To this end, we substitute the DSL technique proposed in Section 3.1 with other deep-learning-based techniques that train an effective subspace using an AE for the purpose of data clustering. Among our comparison methods, algorithms DEC, IDEC, DCN, and DKM are AE-based. For each dataset, we train AEs using these algorithms, then take their encoder part and append our proposed MNet to their latent space. Then we run the second phase of DCSS. Results of such implementation are reported in Table 2 where X+MNet indicates the performance of DCSS employing the X method's latent space as the DCSS's u space. Note that, for the RGB datasets, we construct features using the pre-trained ResNet-152. Comparing the clustering results reported in Tables 1, 3 and 2 confirms the effectiveness of DCSS as a general framework to improve the existing state-of-the-art AE-based clustering methods. On average, MNet improves clustering performance of DEC, IDEC, DCN, and DKM respectively by 3.58% (1.87%), 2.93% (1.54%), 4.04% (2.98%), and 2.56% (1.65%) in terms of ACC (NMI).

Performance on Imbalanced Dataset
To demonstrate effectiveness of our proposed DCSS method on imbalanced dataset, we randomly collect five subsets At the earlier epochs, the latent representation of the data points are irregularly scattered around the group centers, which causes the high centering loss value. During the first step of DCSS, our proposed algorithm aims to simultaneously minimizing centering and reconstruction losses; this leads to an increase in the reconstruction loss (and decrease in the centering loss), in the later training epochs. When the first phase is complete, the latent representation of data points sit close to the group centers, which causes convergence of centering loss to a small value. Fig. 6. The second phase's loss function of DCSS for different benchmark datasets. In the first T 2 epochs, when MNet is naïve, we make use of the u space to measure pairwise similarities; hence, the first T 2 = 5 epochs show the loss value defined in (4). After the first T 2 epochs all datasets show an eye-catching decrease in the loss values since we switch from using the u space to using the more reliable space q when measuring pairwise similarities by minimizing the loss defined in times less than that of the last cluster. As is shown in Fig.  7, our proposed DCSS framework significantly outperforms our comparison methods for all r values. This indicates the robustness of DCSS on imbalanced data. As is expected, in general, for all methods, increasing r results in a higher performance because the dataset gets closer to a balanced one. Fig. 9 shows the q vector corresponding to samples from different data clusters. As can be seen, the proposed method

Visualization of representations in the q space
usually results in q space representations close to an one-hot vector. Note that the kth element of q i shows the probability of sample x i being in the kth cluster. The closer q i is to the one-hot vector, the more confident crisp assignment can be performed. As is proved in Corollary 1.2 of Appendix A, if a data point has at least one similar neighbor, the maximum element of its corresponding q is greater than ζ. In our experiments, ζ is set to 0.8. This can justify aggregation of the data points near one-hot representations in the q space.
To further demonstrate convergence of the q representations to one-hot vectors, histogram of residuals R i = ||I i − q i || 1 , i = 1, . . . , N for all datasets are shown in Fig.  8; where ||.|| 1 indicates the 1 -norm and I i is the one-hot crisp assignment corresponding to q i -i.e. the index of the non-zero element of I i is equal to the index of the maximum element of q i . As it can be seen in Fig. 8, the q representation of almost all data points are very close to their corresponding one-hot vector.

Hyperparameters Sensitivity
In Fig. 10, we investigate the effect of different hyperparameters on DCSS clustering performance. For hyperparameters of the first phase (i.e. α, m and T 1 ), we report performance of clustering using DSL (as is presented in Section 4.5).
In Fig. 10(a), we explore importance of the centering loss in the first phase's loss function shown in (1) by changing α ∈ {0, 0.01, 0.1, 1} for the MNIST dataset. As is shown in this figure, by increasing the value of α from 0 to 0.01, our DCSS performance significantly enhances in terms of ACC and NMI, which demonstrates the effectiveness of incorporating the centering loss beside the reconstruction loss in the first phase's loss function. The best clustering performance is achieved for α = 0.1 for the MNIST dataset. Fig. 10(b) shows the impact of the level of fuzziness m on clustering performance of DSL, where m ∈ {1.1, 1.3, 1.5, 1.7}. In case m→ 1 (m→ ∞), group membership vectors converge to one-hot (equal probability) vectors. As it is shown in this figure, the best performance is obtained for m = 1.5 for the Fashion MNIST dataset.
In Fig. 10(c), we scrutinize the effect of update interval T 1 in clustering performance of the first step for T 1 ∈ {2, 5, 10, 15}. As is expected, better clustering performance in terms of ACC and NMI is acquired for smaller value of T 1 , i.e. T 1 = 2.
In Fig. 10(d), we change the number of training epochs T 2 , defined in Section 3.2, where T 2 ∈ {1, 5, 10, 20}. As is expected, for very small T 2 value, e.g. T 2 = 1, where training the q space is mainly supervised by the q space itself even at the MNet training outset, DCSS cannot provide a proper q space, since q is not a reliable subspace to be used for self-supervision. The figure also shows that for a very large T 2 value, e.g. T 2 = 20, when we only trust the u space for supervising the q space, we cannot train an effective q space. As it is shown, the best clustering performance is obtained when T 2 is set to a moderate value, e.g. T 2 = 5. This demonstrates the effectiveness of the proposed strategy in supervising the MNet training using both u and q spaces.
In Fig. 11, we change ζ and γ in range [0,1], where ζ + γ = 1, to observe model convergence and accuracy for different lengths of the ambiguity interval, defined as ζ − γ, ranging from 1 (when ζ = 1) to 0 (when ζ = 0.5). Fig. 11(a) shows the number of pairs participating in minimizing the loss functions defined in (4) and (5). As can be seen, at the beginning of the second phase, our model can make a decisive decision only about a few pairs, and the remaining pairs are in the ambiguous region. As the second phase training process progresses, more and more pairs are included in the loss functions optimization process. Finally, at the end of the second phase, almost all pairs contribute to the training.
Furthermore, in Fig. 11(b), we investigate the influence of ζ and γ in clustering performance. As it can be seen, as is desired, the final clustering performance of our DCSS framework is not highly sensitive to the choice of ζ and γ when are set to reasonable values. In all our experiments ζ = 0.8 and γ = 0.2.

Features visualization
In order to investigate the effectiveness of our model in extracting useful features for different datasets, we train a deep neural network with the same structure as we proposed in Section 4.3 in a supervised manner, and then we compare the output of the first convolutional layers for the trained model and our proposed DCSS model. As it can be seen in Fig. 12, our DCSS learns a variety of low-and highfrequency features, which are similar to features learned in a supervised manner. This demonstrates the effectiveness of our framework in finding informative features in an unsupervised manner.

CONCLUSION
In this research study, we developed a novel and effective AE-based deep-clustering framework, i.e. deep clustering with self-supervision (DCSS), which considers pairwise similarity and dissimilarity between data points for data clustering task. DCSS is based on a two-phase training procedure. In the first phase, through K successive runs, DCSS tries to obtain an optimal lower-dimensional representation for data points, where at the kth run, DSL concentrates on reconstruction and grouping of those data points that are more probable to belong to the the kth data group. In the second phase, DCSS aims to finding the pairwise relationship between data points employing the lower-dimensional representations obtained in the first step. To this end, we train a fully connected network, i.e. MNet, in an unsupervised manner through minimizing a novel and effective loss function that only considers similar   and dissimilar pairs identified in the AE's latent space or the output space of the MNet itself. We evaluate our DCSS framework on several benchmark datasets. Empirical results show that DCSS outperforms state-of-the-art data clustering methods. The effectiveness of the proposed method is also supported by severe mathematical proofs discussed in Appendix A. Moreover, the results shown in Section 4.9 demonstrate the effectiveness of the DCSS framework in improving the clustering performance of the state-of-the-art AE-based clustering algorithms.

APPENDIX A
Notation clarification: Representation of the ith sample in the q space is shown by q i . The kth element of q i is shown by q ik , k = 1, . . . , K, where K is the number of data clusters. Note that q i is the MNet output when the input sample is x i . Since we employed soft-max as the final layer of MNet, 0 ≤ q il ≤ 1 and the 1 -norm of q i is equal to 1. Furthermore, as is discussed before, parameters ζ and γ are values between 0 and 1.  (aka similar) if and only if q T i q j ≥ ζ. Definition A.2. Two data points, i.e. i and j, are in the same cluster if and only if the index of the maximum value in their corresponding q (i.e. q i and q j ) are equal.
where q T i q j is the inner product of the two vector q i and q j . q il and q jl denote the lth element of q i and q j , respectively.
Proof. Assume q * j is a maximal vector that satisfies the below inequality: where ||q * j || 1 = 1. In addition, assume the index of the maximum element of q i is r: In the following, we first prove by contradiction that q * j must be a one-hot vector. Then we prove (7). Assume q * j is not a one-hot vector. Therefore, there exists at least one index, i.e. e, that its corresponding element q * je is non-zero: ∃ e : e = r and q * je = 0.
Now, let's define a vectorq j as follows: whereq jl denotes the l th element ofq j . Since ||q * j || 1 = 1, we can immediately show that ||q j || 1 = 1. Moreover, we can represent the inner products q T i q * j and q T iq j as shown in (12) and (13), respectively. q T i q * j = q ir q * jr + q ie q * je + l =r,e q il q * jl (12) q T iqj = q ir (q * jr + q * je ) + q ie × 0 + l =r,e q il q * jl (13) Since q ir is the maximum element of q i , we can readily show that q T i q * j ≤ q T iq j , which contradicts the assumption shown in equation (8); thus, q * j must be a one-hot vector. Therefore: Considering (14) and (8), we have: Similarly, for the ith sample, we can show that q * i T q j ≤ max l {q jl }, hence: (15) and (16) proves (7). Corollary 1.1. If two samples i and j are adjacent, the maximum value among their elements is greater than ζ.
Proof. This statement is a corollary of Theorem 1, that can be proved from (7), (15), (16) and the adjacency definition provided in Definition A.1. In other words: where the index of the maximum element of q i and q j are respectively shown by r and o -i.e., r = arg max l {q il } and o = arg max l {q jl }.

Corollary 1.2.
If a data point has at least one adjacent neighbor, the maximum element of its corresponding q is greater than ζ.
Proof. We first empirically test the validity of the employed assumption, i.e. existence of at least one adjacent (i.e. similar) sample for a data point, on our datasets. Fig. 13 shows the number of data samples that are not similar to any other data points in the q space. As it can be seen, at the beginning of the second phase of DCSS, since MNet is initialized randomly, many data points do not have any adjacent neighbor (i.e. almost ∀ i, j, i = j : q T i q j < ζ.). By minimizing (4) and (5) in the second phase of DCSS, similar samples are tightly packed in in the q space; therefore, almost all samples have at least one adjacent neighbor. For example, there are only 330 data points, out of the 60,000 samples of the CIFAR-100 dataset, that are not similar to any other data samples at the end of the second phase. Note that CIFAR-100 presents the worst case among the other datasets shown in Fig 13. All in all, we can roughly assume that each data point has at least one adjacent sample.
Lets consider an arbitrary data point i, and one of its adjacent data points j. From Corollary 1.1, we can conclude that: Thus, we proved that the maximum element of q i , where i is an arbitrary data point, is greater than ζ. Corollary 1.3. Assume each data point has at least one adjacent neighbor and γ < ζ 2 . If two data points i and k are dissimilar, i and k are not from the same cluster. Proof. Since i and j (i and k) are adjacent and ζ > 2 3 , from Theorem 2, we can conclude that i and j (i and k) are in the same cluster; hence, the three samples i, j, and k all are in the same cluster. Theorem 3. Consider three data points i, j, and k where i and j also i and k are adjacent (aka similar). Assume ζ > 2 3 . If γ < ζ 2 , then the two samples j and k are not dissimilar (i.e. q T j q k ≮ γ). Proof. Considering Theorem 2, i and j and k are in the same cluster. Therefore, we do not want to include the pair of j and k samples as a dissimilar pair when minimizing the loss function defined in equation (5). Since j and k are in the same cluster and knowing that the index of the maximum element in a q vector shows cluster of the corresponding sample, we have: Thus, q T j q k = l =η q jl q kl + q jη q kη .
Therefore, q T j q k ≥ q jη q kη .
Since ζ ≤ q T i q j and ζ ≤ q T i q k , we can infer (27) from (17).
From (26) and (27), we can obtain below: Thus, if we choose γ < ζ 2 , we will not include the pair of samples j and k as a dissimilar pair in equation (5). In this paper γ is set to 0.2, i.e. γ = 0.2 < 0.8 2 .