DC-FUDA: Improving Deep Clustering via Fully Unsupervised Domain Adaptation

—By transferring knowledge from a source domain, the performance of deep clustering on an unlabeled target domain can be improved. When achieving this, traditional approaches make the assumption that adequate amount of labeled data is available in a source domain. However, this assumption is usually unrealistic in practice. The source domain should be carefully selected to share some characteristics with the target domain, and it can not be guaranteed that rich labeled samples are always available in the selected source domain. We propose a novel framework to improve deep clustering by transferring knowledge from a source domain without any labeled data. To select reliable instances in the source domain for transferring, we propose a novel adaptive threshold algorithm to select low entropy instances. To transfer important features of the selected instances, we propose a feature-level domain adaptation network (FeatureDA) which cancels unstable generation process. With extensive experiments, we validate that our method effectively improves deep clustering, without using any labeled data in the source domain. Besides, without using any labeled data in the source domain, our method achieves competitive results, compared to the state-of-the-art methods using labeled data in the source domain.


I. INTRODUCTION
W ITH deep neural networks as feature extractors, deep clustering methods can achieve higher accuracy in unsupervised learning tasks, compared to traditional clustering methods. Deep clustering has received massive attentions and can be widely applied to the scenarios without sufficient supervised information such as motion segmentation and face clustering [1], acoustic source separation [2], and reducing the manufacturing process quality variations [3]. The performance of deep clustering methods on an unlabeled target domain can be further improved, by transferring knowledge from a relative domain (i.e., source domain) [4], [5]. To achieve this, traditional approaches, such as unsupervised domain adaptation (UDA) [4], [5], are developed to transfer information from a source domain to improve accuracy in the target task.
However, when developing these traditional approaches [4], [5] transferring knowledge from source domains, an unrealistic assumption is often made that rich labeled samples are available in the source domain. In practice, the source domain should be carefully selected to share some characteristics with the target domain. If the source domain is not carefully selected, the accuracy in the target domain will decrease. More importantly, it can not be guaranteed that rich labeled samples are always available in the carefully selected source domain. Labeling data in a particular domain can require a lot of human labor or expertise, which is very expensive. When there are limited labeled data samples in the source domain, traditional approaches can not be applied.
To overcome the above assumption made by traditional approaches, we present the first work and propose a novel framework to improve the deep clustering performance by transferring knowledge from a source domain without any labeled data. First, we apply deep clustering methods on source domain to obtain the cluster soft assignments (i.e., pseudo labels) of the data instances in the source domain. Then, we select instances with high confidences, and use these instances to transfer knowledge from the source domain to the target domain. Specifically, with these selected instances, we train a generative adversarial network (GAN) [6] to generate fake features, which share similar distribution with the learned features in the target domain. Finally, the selected instances and generated fake features are used to train a classifier, to further predict the final cluster assignments of the data instances in the target domain.
It is non-trivial to select the appropriate instances and transfer information from these instances in our approach. We need to (i) select the reliable instances with important characteristics which can be shared between the source domain and target domain, and (ii) transfer important features of these instances from the source domain to enhance the clustering performance in the target domain. To (i), information entropy is an intuitive measurement for the instances' confidence. Construct a model which uses one-hot encoding labels as supervision, the output vectors after normalization can be explained as the probability for the instances to every cluster, and then we can compute the entropy of the instances. The problem is how to choose a cutoff entropy value so that the instances below it will be selected for transfer. Inspired by the work [7], we use a low-entropy object for the clustering network to get entropy values which can be easily divided. To (ii), the Existing GAN-based UDA nets considers both image generation and classification, which means the vast majority of computing power would be used for the generation task rather than classification. Some works use the features extracted by the existing classification network as the input to decrease the computing complexity which ignores the training process of the feature extractor. We propose a new model which asks for the source and target image as input and generates features following the target image features' distribution.
In summary, we make three major contributions.
• We identify an unrealistic assumption made by traditional approaches, and present the first work to improve the deep clustering performance by transferring knowledge from a source domain without any labeled data. • To select reliable instances in the unlabeled source domain for transferring, we propose a novel adaptive threshold algorithm to select low entropy instances. To transfer important features of the selected instances, we propose a feature-level domain adaptation network for clustering. • It is empirically verified that our method effectively improves deep clustering, when no label information in the source domain is provided. In addition, compared to existing domain adaptation methods which need labeled data of source domain, our method can obtain competitive performance.

II. RELATED WORK
In our method, first we need a traditional deep clustering algorithm to label the instances based on instances' entropy, second, we ask for an UDA method. So we will introduce the works about traditional deep clustering algorithms and UDA algorithms in this part.

A. Deep Clustering
For the clustering methods based on a deep neural network, the most important and difficult part is to set a suitable objective. Find a practical and can be widely-used objective is the accelerator for unsupervised learning. In recent years, amounts of deep clustering methods with different objectives have been proposed. Autoencoder(AE) [8] has been the basic net frame for its' ability to learning high non-linear mapping functions. [9] proposed a deep clustering method based on autoencoder (AE) considering both data reconstruction and compactness. To learn the cluster-oriented feature representations, [10] imposed locality-preserving constraint and group sparsity when learning the deep autoencoder. [11] proposed a deep subspace clustering which enhances autoencoder by considering the structure prior to samples. Density-based clustering algorithms are also extended to deep learning, e.g., [12] used autoencoder to learn the low-dimensional feature representations and then proposed a density-based method to partition the learned features. Inspired by t-SNE [13], [7] proposed a joint framework to optimize feature learning and clustering objective which is named deep embedded clustering (DEC). DEC also uses the features exposed by the autoencoder and then sets a low entropy objective for the encoder. As variations of DEC, improved deep embedded clustering (IDEC) with local structure preservation [14] and deep embedded clustering with data augmentation (DEC-DA) [15] are proposed. IDEC optimizes the weighted clustering loss and the reconstruction loss of autoencoder jointly. DEC-DA uses data augmentation in deep embedded clustering. GAN can also be used in clustering tasks. [16] proposed adversarial deep embedded clustering (ADEC) which addresses the feature randomness and feature drift using adversarial training. Except the model objective, different training strategies are imposed into deep clustering, e.g., self-paced learning [17], [18], multiview learning [19], [20] and semi-supervised learning [21], [22].

B. Unsupervised Domain Adaptation
The objective of domain adaptation is not only based on the data we want to cluster or generate, it's also based on another dataset or module.
Unsupervised domain adaptation (UDA) aims at transferring knowledge from labeled data of the source domain to enhance the prediction performance on the target domain with only unlabeled data. The objectives of UDA methods are always based on the discrepancy of domains [5], [23]. [23] used maximum mean discrepancy (MMD) [24] to calculate the discrepancy between domains. After this, a number of methods based on MMD have been proposed. The main optimization way is to choose different versions of MMD, such as joint adaption network (JAN) [5] and weighted deep adaption network (WDAN) [25]. JAN choose joint MMD which measures the Hilbert-Schmidt norm between kernel mean embedding of empirical joint distributions of source and target data. WDAN is proposed to solve the question about imbalanced data distribution by introducing an auxiliary weight for each class in the source domain. The methods based on reconstruction use AE as the base model. [26] uses the same encoder to source data and target data. The source features extracted by the encoder are delivered to a classifier and the target data would be reconstructed by the decoder. [27] proposes Contrastive Adaptation Network (CAN) optimize the intra-class domain discrepancy and the inter-class domain discrepancy alternately. UDA methods based on GAN also attracted peoples' attention. They generally use the generator to generate the fake samples and use a discriminator to judge whether each generated instance is sampled from the target domain. Once the discriminator is fixed, the source data and target data can be projected to the same feature space with the generator. [28] used noise vectors and source data to generate fake data for the discriminator. [29] provided independent generator and discriminator. The domain adaptation is completed by sharing the weights in the generator's first layer and the discriminator's last layer. [30] proposed an attention module in the training process of GAN which allows the discriminator to distinguish the transferable regions among the source and target images. [31] proposes an unsupervised domain adaptation model which can generate image-label pairs in the target domain and achieve class-level transfer.
How to pick an appropriate source domain is always ignored. Recently, UDA methods without source domain are proposed. [32] proposes a collaborative class conditional generative adversarial net to bypass the dependence on the source data. [33] proposes a new framework called SHOT (Source HypOthesis Transfer) which freezes the classifier module of the source model and learns the target-specific feature extraction module by exploiting both information maximization and self-supervised pseudo-labeling to implicitly align representations from the target domains to the source hypothesis.

III. PROPOSED METHOD
In this section, we introduced our deep clustering approach via fully unsupervised domain adaptation (DC-FUDA). In order to transfer knowledge from unlabeled source domain to target domain, we select the high confidence pseudo-labeled data from source domain for target domain deep clustering with a feature domain adaptation network (FeatureDA).
The proposed method mainly comprises two modules: Instances selection with adaption threshold module and featurelevel domain adaptation with generative adversarial network. Fig. 2 shows the whole model. The upper part is instances selection with adaption threshold module. In training process, the source domain instances are input to the feature extractor and then the features are optimized for getting lower entropy. Then, with our adaptive threshold instance selection algorithm, we can transfer low-entropy instances to the nether part of Fig. 2: FeatureDA. In FeatureDA, generator G generates fake features share the distribution with features of target data. The labels of the fake features are the same with the instances used to generate them. Then, the source low-entropy features (extract from the feature extractor) and fake features are used to optimize the classifier C. In the test phase, the source and target images are all handled to the classifier.
Consider the source domain with N s instances, . We aim at clustering instances from both source and target domain into k clusters. The whole model consists of the following components: where θ θ θ I is model parameters. It iteratively maps the input X s to a feature space and assign pseudo labels to the source instances, resulting where y ps i is the pseudo label of instances i. • Low entropy instance selector: It selects N le N le < N s high confident (i.e., low entropy) instances It takes the source instances as input and output the low-entropy instances with pseudo label. The under part is a feature adaptation network. It's used to transfer the information from source domain to target domain in feature level.
is to generate fake features share the distribution with target feature. z is the random Gaussian noise, f f ake i is the fake features generated from G, and y f ake i is the same with y ps i .
is used to discriminate the features of target data and fake data. • Classifier: C (x; θ θ θ C ) →ŷ, parameterized by θ θ θ C ,ŷ is the predicted labels. C is the classifier to predict the labels of source data and target data. We will introduce the detail of each component it the next sub-sections.

A. Selecting Low Entropy Source Instances with Adaptive Threshold
To select the source domain samples that are useful in target domain clustering, we extract the features of source domain instances and use a standard clustering algorithm (e.g. k-means) to obtain labeled data with high confidence. We use autoencoder [15] and pretrained networks Inception-V3 [34] as the feature extractor. The label confidence is represented by the entropy, which can be obtained in the last layer of the feature extractor: H (E(x i )) = h i where h i is the entropy of x i . The number of neurons in the last layer equals to the number of clusters. So p ij is the output of the last layer after normalization and can be considered as the probability of the instance i belongs to the cluster j.
Once we label the source domain data, we propose an adaptive threshold algorithm to select the instances with low entropy. Here we introduce a parameter ∆ based on the entropy: The threshold of instance selection for the cluster j is defined as n × ∆ j (n=(1,2,3 ,. . . )). It indicates an entropy interval: [N min , n∆], where N min is the minimum number of instances will be selected. In the experiment, we set it as 50% of the instances for every cluster. The instances in this interval will be used in target domain clustering. If n is too small, we cannot have enough instances to transfer source domain knowledge. For example, small n 1 in Fig. 3(2) results in a small area S 1 . The generated fake features or instances would have weak variety as the source instances. If n is too large as n 3 , it might include too much noise to obtain ideal performance. Therefore, we aim at finding an appropriate n, such as n 2 in Fig. 3, for the clusters where the area does not include the instances with a large entropy gap. The threshold selection process is introduced in the following steps: 1) Initialize the adaptive threshold as N min . This is the minimum number of instances we should choose in each cluster, otherwise it will result in serious data imbalance problems. 2) Increase n from 1 to N s (The final value of n is less than 50 every time in our experiments), once every thresholds add for one ∆ and the number of instances in all the scopes (every cluster has its own scope, the union of these scope is marked as S) is less than the number of clusters k, n is confirmed. In our method, every cluster uses the same n, because of the different ∆, the thresholds' scopes are different in the clusters.
Nevertheless, the adaptive threshold algorithm cannot handle all the entropy distributions. As shown in Fig. 3 (1), it is difficult to find a precise cutoff because the distribution of entropy is almost uniform. Fig. 3 (2) is an ideal distribution for adaptive threshold algorithm for which we can see the entropy of the instances out of scope S 2 increase abnormally. Inspired by DEC, we iteratively refine E to strengthen pseudo labels' confidence.
The original feature extractor E is represents as F s = E(x s ; θ θ θ E ), F s = {f s i , y le i } N le i=0 . F s represents the features of instances in source dataset and the pseudo labels are assigned by a shallow clustering method like k-means [35]. Then net E will be optimized to produce features following a distribution which is convenient to identify the instances with low entropy.
Following DEC [7], we choose Student's t-distribution [36] as a kernel to measure the similarity between every embedded point and cluster centroid pair and then normalize the similarity (the centroid of every cluster is initialized with k-means [35] method in our experiments.): In Eq. (3), q ij presents the probability of assigning sample i to cluster j. At the beginning of the training process, the entropy of the q ij distribution is difficult to find an abnormal increment (This conclusion is shown in the experiment). To differentiate the probability distribution and refine the cluster results into a high confident mode, we set a target distribution to the probability distribution: where f j = i q ij . The loss function of module Low-entropy feature extractor is set as the KL divergence loss between the soft assignments q ij and the target distribution p ij :

B. PixelDA to FeatureDA
After getting instances with low entropy and pseudo labels, we can leverage them in target domain clustering. We choose PixelDA as the base domain adaptation model. Pix-elDA is a generative net which can generate images following sampled from source domain and random noise. The generated image has the same label with the image that generates it. But the original target for PixelDA is not only UDA, so some constraints of PixelDA is not necessary for our target: clustering. In clustering issue, what we want to do is identification instead of generation. Hence we optimize PixelDA model to FeatureDA model which can handle classification task with lower complexity especially when the input data of generator has high dimension. The nether part of Fig. 2 describes the architecture of FetureDA: G generates fake features with noise and low-entropy source instances, the fake features have the same label as the source instance used for generating. D is used to judge the fake features and the target features. Theoretically speaking, with different noise vector, an image from source domain can produces different fake features which means the model won't be under fitting. For broader applicability, we adopt the base model of PixelDA, which is optimized by three objective loss: L G , L D and L C . As shown in Fig. 2, our goal is to minimize the objective Eq. (6): where L d is a routine loss of generative adversarial network, and L c is the loss of the classifier and generator, it's shown in Eq. (8). The cross entropy loss of the classifier is calculated by both source instances and generated instances.
The influence of whether the low entropy instances should be used to train the classifier is shown in Fig. 4. Train the classifier with both source features and fake features accelerates G to generate instances or features which are more plain to be classified. . Our algorithm is summarized in Algorithm 1. N, M, k are numbers of source domain instances, target domain instances, clusters respectively. Let L 1 denote the maximum number of neurons in feature extractors' hidden layers. The complexity of pretrain and optimize the feature extractor are O(N L 1 2 ). The complexity to select the low entropy instances is O(N log N ) because the sort algorithm is used here. Let L 2 denote the maximum number of neurons in G, D, C's hidden layers. The complexity of training FeatureDA is O(N L 2 2 ). So, the complexity of our model is O (N log N ) to the data size.

IV. EXPERIMENTS
A. Experimental Settings 1) Datasets: In this section,we evaluate our method in both quantitative and qualitative ways. We compared our method on following datasets: MNIST [37], MNIST-M [38], USPS [39] which are digital images and OFFICE-31 [40] which contains office environment images. The details of these datasets is listed in Table I X s , X t , k, N min and N epoch , α, β Output: Cluster assignments y source and y target 1: Pretrain the original feature extractor E 2: Initialize µ j by k-means 3: Predict y ps by E 4: repeat 5: y old = y ps S old = S, n = n + 1 14: for j = 1 to k do 15: end for 17: until S − S old < k 18: X le = {} 19: for j = 1 to k do 20: Update D and G on X le , according to Eq. (7) 24: Update C and G on X le , according to Eq. (8) 25: end for 26: Output y source and y target predicted by C applied data augmentation (i.e., noise addition, random rotation and intensification) such that A, D and W all have 500 samples, each class providing 100 samples.
2) Implementation Details: The algorithm is implemented with Python. For the digits datasets, convectional autoencoder (CAE) is applied as the feature extractor whose encoder's structure is Input−Conv 5 32 −Conv 5 64 −Conv 3 128 −Fc 10 . For OFFICE-31 dataset, Inception-V3 with weight ImageNet is used as the feature extractor for all methods. For all the comparing methods, the suggested parameters are adopted. For proposed method, we suggest N m in can be set as 50% of the instances for every cluster, α and β are set as 1, N e poch is set as 80.  3) Evaluation Metrics: Cluster accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI) are the clustering metrics. The reported results are average values of 10 independent runs.

B. Comparison with State-of-the-art Methods
We compare our method with the classic and state of the arts. For our method is a combination of clustering problem and UDA problem, we choose 4 clustering method and UDA method as comparison methods which have labeled source dataset. All the clustering methods are evaluated on two modal: " only target" and " mixture of source and target" where " only target" means the model is trained on target dataset and tested on target dataset, "mixture of source and target" means the modal is trained on both target and source dataset and tested on target dataset. The quantitative comparison results for digits datasets and OFFICE-31 are shown in Table II and  Table III, respectively. In Table II, It is shown that our method performs the best on MNIST→MNIST-M, USPS→MNIST-M dataset pairs. To the traditional clustering methods, which rely on the feature extractor too much so the 3-layers autoencoder limited the performance of these methods. The UDA methods, which transfer the knowledge of MNIST and USPS to MNIST-M can get huge improvement to the performance. In Table  III, our method performs better than most of the comparison methods even the UDA methods have labeled source data. We can see that the results in experiments with "mixture of source and target", perform worse than "only target" on most dataset pairs, which shows mix different dataset straightforwardly can't improve the clustering power.

C. Entropy Optimization Experiments
We visualize the process of optimizing the entropy distribution with Eq. (5), as shown in Fig. 5. To show the thresholds in every cluster clearly, we fix n = 18 (n is automatically computed after the optimization in our algorithm). At the beginning of the optimization, the entropy of the instances is hard to get clear thresholds. In iteration 0 of Fig. 5, the first and third clusters have no thresholds. At the end of the training, every cluster can be divided with a threshold and most of the instances' entropy in the same clusters is compressed into a narrow range. With the training process, the entropy of the dataset also declines. Fig. 5 shows that the first part of our method "electing Low Entropy Source Instances with Adaptive Threshold" can produce a clear threshold according to the instances' entropy. The results in Ablation Studies show the selected instances are reliable.

D. Ablation Studies
In our methods, we mainly have two strategies to enhance the performance, the first one is instance selection and another one is FeatureDA. In this part, we will respectively show the influence of both parts. Table 1 shows the results of three methods: FeatureDA uses image instances to generate features and performs clustering operation on feature level, our method DC-FUDA" and DC-FUDA with all source instances transferred called "All-DA-FUDA". Compared with All-DA-FUDA and FeatureDA, DC-FUDA gets better results especially when the source domain is noisy, which is shown in "A→ D" and "A→ W" dataset pairs. The dataset "A" has some images wrong labeled in the OFFICE-31 dataset, and our method only chooses the credible instances from the source domain which relieves the destructiveness of the noisy instances. To illustrate the influence of data selection and exclude the influence of PixelDA, we use SPL [41] model (which also needs labeled data in the source domain) as the domain adaptation network. SPL is a selective pseudo-labeling strategy based on structured prediction. The experimental results are shown in Table V (Notice, for its high complexity we only validate it on OFFICE-31 dataset). In Table V, we can see the promotion with data selection. "SPL with DC-FUDA" performs better than "SPL with All-DC-FUDA" on every dataset setting. "SPL with DC-FUDA" even performs better than SPL which has a labeled source domain. The main reason is that there are several wrong labeled instances in the OFFICE-31 dataset which could cause the negative transfer. In contrast, "SPL with DC-FUDA" only chooses the low entropy instances with reliable pseudo labels.
The difference between FeatureDA and PixelDA is whether G needs to generate instances. Fig. 6 shows that FeatureDA focuses more on the feature generation and can improve the prediction performance in every dataset pairs.

V. CONCLUSIONS
In this paper, we present the first work and propose a novel framework to improve the deep clustering performance by transferring knowledge from a source domain without any labeled data. To select reliable instances in the source domain for transferring, we propose a novel adaptive threshold algorithm to select low entropy instances. To transfer important features of the selected instances, we propose a feature-level domain adaptation network for clustering. Experimental results demonstrate that our method can significantly enhance deep clustering performance, without using any label information from the source domain. Besides, compared to those state-ofthe-art methods which need labeled data of the source domain, our method can achieve competitive results.