Cross-Modality Hierarchical Clustering and Refinement for Unsupervised Visible-Infrared Person Re-Identification

Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality image retrieval task. Compared to visible modality person re-identification that handles only the intra-modality discrepancy, VI-ReID suffers from an additional modality gap. Most existing VI-ReID methods achieve promising accuracy in a supervised setting, but the high annotation cost limits their scalability to real-world scenarios. Although a few unsupervised VI-ReID methods already exist, they typically rely on intra-modality initialization and cross-modality instance selection, despite the additional computational time required for intra-modality initialization. In this paper, we study the fully unsupervised VI-ReID problem and propose a novel cross-modality hierarchical clustering and refinement (CHCR) method by promoting modality-invariant feature learning and improving the reliability of pseudo-labels. Unlike conventional VI-ReID methods, CHCR does not rely on any manual identity annotation and intra-modality initialization. First, we design a simple and effective cross-modality clustering baseline that clusters between modalities. Then, to provide sufficient inter-modality positive sample pairs for modality-invariant feature learning, we propose a cross-modality hierarchical clustering algorithm to promote the clustering of inter-modality positive samples into the same cluster. In addition, we develop an inter-channel pseudo-label refinement algorithm to eliminate unreliable pseudo-labels by checking the clustering results of three channels in the visible modality. Extensive experiments demonstrate that CHCR outperforms state-of-the-art unsupervised methods and achieves performance competitive with many supervised methods.

Although existing supervised VI-ReID methods [24], [25], [26] have achieved promising performance, they require largescale cross-modality labeled datasets [20], [21].Labeling datasets for image retrieval problems is a time-consuming task, and the modality gap further increases the difficulty of annotation.To address the aforementioned problems, Liang et al. [27] proposed the first unsupervised VI-ReID method, H2H.As shown in Fig. 1(a), H2H is first pre-trained on the labeled source domain [28] and then conducts both intra-modality initialization and cross-modality instance selection on the unlabeled target domain (visible-infrared dataset) [20], [21].Although identity annotation is not needed in the visible-infrared domain for H2H, it is still necessary in the source domain for pre-training.Therefore, in practice, H2H is not a fully unsupervised method, but infact a cross-domain method [29], [30].Cross-domain methods not only require additional data preprocessing but also require an appropriate source domain.However, the appropriate source domain does not always exist [31], [32].Recently, Yang et al. [33] proposed the first fully unsupervised VI-ReID method called ADCA.As shown in Fig. 1(b), although ADCA does not need pre-training, it still requires intra-modality initialization.Similarly, DFC [34] also follows the principle of intra-modality initialization and cross-modality instance selection.While intra-modality initialization may lead to higher computational complexity and longer training times, both intra-modality initialization and cross-modality instance selection have emerged as prevailing principles in unsupervised VI-ReID.This observation does raise the question whether new approaches relying solely on cross-modality clustering are feasible, which we shall see is indeed the case.
The existing methods require intra-modality initialization because the significant gap between modalities impedes cross-modality clustering, compelling them to cluster solely within their respective modalities.To address the above challenge, in this paper we propose a novel fully unsupervised VI-ReID method called cross-modality hierarchical clustering and refinement (CHCR).CHCR is fully unsupervised and does not require intra-modality initialization.As shown in Fig. 1(c), we design a simple and effective cross-modality clustering baseline in CHCR.Different from existing methods, the baseline aims to reduce modality gap from two levels to promote cross-modality clustering.At the image-level, previous research [20] shows that the gap between the grayscale image and the infrared image is smaller than that between the visible image and the infrared image, so the baseline converts the visible image to the grayscale image and uses linear transformation to reduce the modality gap.In addition, inspired by CAJ [25], we incorporate gamma transformation as a data augmentation technique during the training process to increase the robustness of the model to modality gap.At the feature-level, following AGW [35], we share most layers of the CNN model between the two modalities and design a modality contrastive loss to encourage the model to learn the modality-invariant feature.Unlike the maximum mean discrepancy (MMD) distance [36], which is widely used in the existing VI-ReID methods [27], [37], modality contrastive loss can prevent identity misalignment [12] when aligning the feature distribution of the two modalities.Our cross-modality clustering baseline achieves promising recognition performance, which is improved upon by additional innovations that we describe next.
Following the existing unsupervised visible ReID methods, our baseline also iterates between clustering and finetuning.However, under the cross-modality setting, the baseline inevitably encounters two issues: 1) Despite the reduction of the modality gap, the clustering of positive sample pairs; i.e., images of the same person captured through different modalities; into a single cluster remains challenging.This counters the modality contrastive loss which requires the inter-modality positive sample pairs for effective learning.
2) The clustering algorithm inevitably generates noisy labels, and the reduction from the three RGB channels of visible images to grayscale images not only causes information loss but can also amplify the label noise.Noisy labels accumulate during the training process and eventually hinder the improvement of model performance.
To address the first issue, we design a cross-modality hierarchical clustering (CHC) algorithm in CHCR.CHC divides the cross-modality clustering process into two stages.CHC first clusters in each modality and then combines the clusters from the two modalities according to similarity.The advantages of hierarchical clustering are twofold: in the first stage, CHC effectively makes use of the sample similarity within the modality and protects the clustering algorithm from the modality gap; in the second stage, CHC provides sufficient inter-modality positive sample pairs for modality contrastive loss by integrating inter-modality clusters.To address the second issue, we design an inter-channel pseudo-label refinement (IPR) algorithm in CHCR.IPR makes effective use of the prior knowledge that all channels within the same sample share the same identity.Specifically, the algorithm first performs clustering within each channel, and then refines the pseudo-labels by evaluating the consistency of the clustering results across the three RGB channels.
To summarize, our contributions are as follows: • We propose a simple and effective cross-modality clustering baseline that does not require labeled source domains and intra-modality initialization.The baseline is the first attempt to solve the cross-modality clustering problem in VI-ReID that we are aware of.• We propose a cross-modality hierarchical clustering algorithm, which promotes clustering of positive samples from different modalities into the same cluster.This promotes the generation of adequate inter-modality positive sample pairs, which are essential for modality-invariant feature learning.• We propose an inter-channel pseudo-label refinement algorithm, which improves the reliability of pseudo-labels by checking the clustering results of the three RGB channels in the visible modality.• Extensive experimental results on two standard benchmarks demonstrate that our method performs favorably relative to state-of-the-art unsupervised methods.In addition, our method achieves promising performance compared to that of the supervised methods.

A. Fully Unsupervised Visible Person Re-Identification
The existing research on fully unsupervised ReID mainly focuses on the visible modality [38], [39], [40], [41], [42], [43], [44].These methods usually rely on unlabeled visible images for learning and then match in the visible modality.For example, BUC [42] uses bottom-up hierarchical clustering to generate pseudo-labels, and designs a repelled loss to increase intra-class similarity.This method also serves as a paradigm for a much of the subsequent research.HCT [43] introduces batch hard triple loss [45] on the basis of BUC, which effectively improves the robustness of the model to hard samples.Wu et al. [46] construct patch surrogate classes as initial supervision, and propose to assign pseudo-labels to images through pairwise gradient-guided similarity separation, achieving better performance than BUC and HCT.However, the camera gap, i.e., the difference in features for different cameras, encountered in the unsupervised scenario limits the performance of these methods.
Among the early ReID methods, PCSL [47] considers the influence of intra-camera labels and inter-camera labels, and trains a deep neural network using generated cross-camera soft-labels.To bridge the camera gap, IICS [16] adopts a strategy of training a classifier for each camera individually, and utilizes the scores from these classifiers to enhance the similarity of inter-camera positive sample pairs.Although this method has achieved promising performance, it is difficult to train a classifier for each camera in large-scale scenes.MetaCam [44] introduces meta-learning into model training as a new approach for handling the camera gap.In subsequent research, ICE [31] achieve competitive performance by adopting camera-invariant feature learning via suitably designed optimization method.Recently, CIFL [32] has further improved model performance thorough enhancements in clustering and optimization that are designed to combat the camera gap.Different from the above methods, in this paper, a more challenging fully unsupervised VI-ReID method is explored.Although unsupervised visible ReID methods are difficult to directly apply to the VI-ReID scenario, they can provide inspiration for our research.

B. Visible-Infrared Person Re-Identification
Compared with the visible ReID, VI-ReID is much more challenging because the modality gap increases the difficulty of cross-modality matching between visible and IR imagery.Wu et al. [20] proposed the first supervised VI-ReID method and converted visible images into grayscale images to address the modality gap.More recent studies usually reduce the modality gap by using two ideas: (1) images from different modalities are mapped to the same feature space to learn shared features [48], [49], [50], [51], [52], [53] and (2) alignment methods are exploited to reduce the modality gap [54], [55], [56], [57].The JSIA [22] and Hi-CMD [58] both additionally use feature disentanglement frameworks to learn a feature that is modality-invariant and identity-related.Recent methods have achieved promising performance by making use of color-invariant learning [25], [37], neural feature search [24], and feature-level compensation [26].
In actual practice, however, the utility of supervised VI-ReID is severely limited by its strong reliance upon identity annotations.Recently, Liang et al. [27] propose the first unsupervised VI-ReID method, H2H.H2H is first pre-trained on the Market-1501 dataset [28] and then completes homogeneous-to-heterogeneous learning on an unlabeled visible-infrared dataset [20], [21].In addition, H2H relies on a suitably designed cross-modality re-ranking (CMRR) to further improve test accuracy.Although H2H does not use identity annotation in the cross-modality scenario, it still relies on an additional labeled source domain, so this method is not fully unsupervised.In subsequent research, ADCA [33] and DFC [34] develop fully unsupervised approaches that remove the need for a labeled source domain and further improve model performance.OTAL [59] utilize the standard unsupervised domain adaptation approach of generating pseudo-labels for the visible subset with the help of well-annotated RGB datasets, and assign pseudo-labels from visible modality to the infrared modality.Although existing methods continue to make performance gains, they invariably rely on intra-modality initialization and cross-modality instance selection.Different from these prior methods, in this paper a fully unsupervised VI-ReID method is developed based on cross-modality clustering, which completely eliminates the reliance on intra-modality initialization.

III. PROPOSED METHOD
The purpose of fully unsupervised VI-ReID is to learn a modality-invariant and identity-related feature representation without using identity annotation.Specifically, we train the model on an unlabeled visible-infrared dataset {X v , X ir } to enable the model to match samples with same identity between X v and X ir .
represents the visible modality dataset, and X ir = {x ir i } N ir i=1 represents the infrared modality dataset.N v and N ir are the number of samples in the visible modality and infrared modality, respectively.
As shown in Fig. 2, the proposed CHCR consists of three components: a cross-modality clustering baseline, crossmodality hierarchical clustering (CHC) and inter-channel pseudo-label refinement (IPR).The baseline applies linear scaling and a gamma transformations to the data as detailed in the implementation details and, as shown in Fig. 2(a), utilizes DBSCAN [60] to generate pseudo-labels between modalities.In Fig. 2(b), CHC and IPR are used to improve the baseline.We introduce these three components in the following sections.

A. Cross-Modality Clustering Baseline
Inspired by MoCo [61], our cross-modality clustering baseline includes two DNN encoders: encoder E and momentum encoder M. E and M have the same structure.As shown in Fig. 2(a), E is updated based on back-propagation, and the weight of M is defined as the temporal accumulation of E: where θ t M and θ t E represent the weights of M and E, respectively, for the t-th iteration, θ t−1 M represents the weight of M at the (t − 1)-th iteration, and w is momentum coefficient that controls the update speed of the momentum encoder.
First, inspired by deep zero-padding [20], we convert visible image to reduce the modality gap at the image-level.Additionally, as detailed in the implementation details, we also applied linear scaling and gamma transformation to {X s , X ir }.Then, we use M to extract the features of {X s , X ir } and use the DBSCAN [60] to generate pseudo-labels.We discard outliers and implement PK sampling [45] on the labeled data.Finally, we use the softmax loss L soft [42], batch hard triplet loss Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.L hard [45] and modality contrastive loss L moda to jointly optimize encoder E: where λ h and λ m are used to control the scale of L hard and L moda , respectively.1) Softmax Loss: Based on the pseudo-labels obtained from the cross-modality clustering, we first calculate the centroids for each cluster.For example, the centroid c p of the p-th cluster is defined as: where m i is a feature of sample x i in the cluster extracted by M and n p is the total number of samples in the cluster.We refer to the cluster centroid with the same cluster label as sample x p as the positive cluster centroid for x p and the other cluster centroids as the negative cluster centroids for x p .For any sample x p , the purpose of the softmax loss L soft is to increase the similarity between x p and its positive cluster centroid and reduce the similarity between x p and its negative cluster centroids: where f p is the feature of sample x p extracted by E, c p is the positive cluster centroid of x p , n c is the number of clusters at the current stage, and τ s is the temperature parameter [62] for L soft .
2) Batch Hard Triplet Loss: To compute L hard , we first select P identities from the clustering results, and then choose K samples from each identity to form a minibatch.For a minibatch of size P × K , L hard first selects a sample x i a as the anchor, then favors increasing the similarity between x i a and the hardest positive sample x i p , and decreasing the similarity between x i a and the hardest negative sample x j n : where β mar is the hyperparameter margin.d (•, •) is a function used to measure the distance of features extracted by E. Specifically, in this paper, the Euclidean distance is used.
3) Modality Contrastive Loss: For the modality gap, the traditional method usually employs MMD [36] to align the feature distributions of the two modalities.However, MMD often results in identity misalignment [22].To address this problem, we designed a modality contrastive loss.First, we calculate the modality centroids based on the clustering results.The modality centroid c pq with modality label q in the p-th cluster is defined as: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The "pull" and "push" terms, respectively, decrease and increase the distance between the sample and the modality centroids.
where q ∈ {0, 1} is the modality label, with q = 0 representing the visible modality, and q = 1 representing the infrared modality; m i is a feature of the sample x i , extracted by M, having modality label q in cluster p, and n pq is the total number of samples with modality label q in cluster p.For sample x pq , we refer to the modality centroid c pl (l = 1 − q) with the same cluster label and different modality label as the positive modality centroid and we refer to the modality centroids c i j (i ̸ = p) with cluster labels different from sample x pq as the negative modality centroids.In Fig. 3, for any labeled sample x pq , the minimization of the loss L moda seeks to increase the similarity to the positive modality centroid and to reduce the similarity to the negative modality centroids, via the definition: where f pq is the feature of sample x pq extracted by E and c pl is the positive modality centroid of x pq , Q is the set of the hardest negative modality centroids of sample x pq , and τ m is the temperature parameter [62] of L moda .

B. Cross-Modality Hierarchical Clustering
While modality contrastive loss has the potential to overcome the modality gap, its effectiveness relies on the availability of a sufficient number of inter-modality positive sample pairs and positive modality centroids.The modality gap presents a challenge in achieving effective clustering of inter-modality positive sample pairs, as their similarity is typically reduced.Consequently, when the number of inter-modality positive sample pairs is insufficient, the performance of the modality contrastive loss is constrained.
To address the above problems, we propose a cross-modality hierarchical clustering (CHC) algorithm.As shown in Fig. 4, CHC includes two stages: 1) intra-modality clustering and 2) inter-modality clustering.In intra-modality clustering, we cluster within each modality separately.In inter-modality clustering, we first calculate cluster centroids for the two modalities according to Eq. ( 3) and normalize the cluster centroids to form cluster centroid matrices C v ∈ R d×n v and C ir ∈ R d×n ir , where n v and n ir are the number of cluster centroids in the visible modality and infrared modality, respectively; and d is the dimension of the feature vector.Then, we calculate the similarity matrix of inter-modality cluster pairs based on cosine similarity and Jaccard distance of k-reciprocal nearest neighbors [63]: where d J (C v , C ir ) represents Jaccard distance of k-reciprocal nearest neighbors, which has been proven to be effective in alleviating modality gap [27].α is used to control the relative contributions of the two distances.Element s i j in similarity matrix S ∈ R n v ×n ir represents the similarity between the i-th cluster in the visible modality and the j-th cluster in the infrared modality.Finally, we select the largest h elements in S and finally merge the corresponding clusters to obtain the final clustering result.It should be noted that, unlike intra-modality initialization and cross-modality instance selection, two-stage clustering does not increase computational complexity.On the contrary, in the second stage, CHC only calculates cluster similarity, which reduces computational complexity.More significantly, CHC can effectively promote the clustering of samples from different modalities into the same cluster in the second stage.

C. Inter-Channel Pseudo-Labels Refinement
Due to the influence of brightness, background, and other factors, the clustering algorithm inevitably produces noisy labels.Existing method [20] simplifies the visible images of three channels into a single channel grayscale image, which not only causes information loss but may also exacerbate the generation of noisy labels.To this end, we consider how to improve the quality of pseudo-labels.
In reality, we can determine identity based on any of the three channels of the visible image.Similarly, in the clustering process, the three channels from the same visible image should have the same pseudo-label.However, due to the poor performance of the initial model, the features of different channels of the same sample have large differences, which predisposes the clustering algorithm to assign different pseudo-labels to different channels.At this point, these pseudo-labels are usually unreliable.Noisy labels interfere with the correct optimization direction and hinder performance improvement.Intuitively, we can eliminate these unreliable pseudo-labels by checking the clustering results of the three channels, which cannot be done with grayscale images alone.Therefore, we design an inter-channel pseudo-label refinement (IPR) algorithm to refine the pseudo-labels by considering the consistency of the clustering results of the three channels.
As shown in Fig. 2(b), we first extract different channels from visible modality . Then, we combine the three channels with the infrared modality to obtain {X r , X ir }, {X g , X ir } and {X b , X ir }.Then, we cluster the above three combinations.We use I r i , I g i and I b i to represent the sample set in the i-th cluster of {X r , X ir }, {X g , X ir } and {X b , X ir }, respectively.Then, we calculate the clustering consistency matrix Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.U ∈ R n r ×n g ×n b of the above three combinations based on the intersection over union (IoU), where n r , n g and n b represent the number of clusters in the three combinations.Any element u i, j,k in U is defined as the IoU of the corresponding clusters: where | • | indicates the number of elements of a set, I r i , I g j , and I b k denote the sample sets from the i-th cluster of {X r , X ir }, the j-th cluster of {X g , X ir }, and the k-th cluster of {X b , X ir }, respectively.When calculating IoU, we regard different channels of an image as the same sample.As shown in Fig. 5, we set the threshold t, and for the elements in U that reach the threshold, we take the corresponding intersection as the result from the refining and add all channels of the same image to this cluster.The advantage of this is twofold: 1) even though the various channels of a single sample may differ in brightness and contrast, they are essentially positive samples for each other.By optimizing the distance between them, we can enhance the model's robustness to brightness and contrast.2) Refinement reduces the amount of training data, and we find that the number of visible images is reduced more than that of infrared images.Adding multiple channels supplements the training set.Finally, we take all refined clusters as the clustering result.Extract channels of Extract features of {X r , X ir }, {X g , X ir } and {X b , X ir } with model M 5: Cluster {X r , X ir }, {X g , X ir } and {X b , X ir } with CHC respectively 6: Refine clustering results using IPR 7: Calculate cluster centroids and modality centroids based on Eq.( 3) and Eq.( 6), respectively 8: for j = 1 to iter s do 9: Implement randomly gamma transformation 10: Optimize θ E to minimize the loss defined in Eq.( 2) Update θ M based on Eq.( 1) end for 13: end for 14: return Trained θ M .
Previous research [32], [61] has demonstrated that the momentum encoder has higher stability.Therefore, we utilize the momentum encoder M to extract features for the testing phase.Unlike ICE [31], we calculate the distances between the three channels (x r i , x g i and x b i ) of the visible image x v i and the infrared image x ir j and take the sum as the final distance: where d (•, •) is a function used to measure the distance of features extracted by M. The Euclidean distance is used in this paper.In summary, the overall training process of CHCR is listed in Algorithm 1.The data preprocessing is described in the implementation details section.
The SYSU-MM01 [20] dataset is a challenging benchmark for VI-ReID; the dataset contains 30,071 visible images and 15,792 infrared images for 491 identities captured by 6 cameras (2 infrared cameras and 4 visible cameras).The training set contains 395 identities, and test set contains 96 identities.In the test phase, infrared images are used for probe set and visible images are used for the gallery set.Cameras 2 and 3 are placed in the same scene, so the probe image of camera 3 skips the gallery image of camera 2. The dataset includes all-search and indoor-search test modes, and we conducted tests under the more challenging all-search mode.
The RegDB [21] dataset contains 8,240 images for 412 identities, with each identity having 10 visible images and 10 thermal images.Following TONE+HCML [64], we select 206 identities as the training set and another 206 identities as the test set.The above random selection is repeated ten times, and the overall average accuracy is reported in the final performance statistics.
We use cumulated matching characteristics (CMC) and the mean average precision (mAP) to evaluate the model performance.All training is conducted in a fully unsupervised mode, and the identity labels are only used in the test phase.
2) Implementation Details: During data preprocessing, all images are resized to 288×144 pixels.For RegDB [21], we perform grayscale inversion on both grayscale images and single channel images.In addition, for both datasets we linearly scale the gray value to the range [127, 255] for both datasets.Inspired by CAJ [25], we incorporate random gamma transformation with a range of [0.5, 1.0] as a data augmentation technique during the training process to increase the robustness of the model to the modality gap.
We use ResNet-50 [65] based AGW [35] pre-trained on ImageNet [66] as the network backbone of the encoder and the momentum encoder.For DBSCAN [60], we set the minimum number of cluster samples to 4 and the distance threshold to 0.55 on SYSU-MM01.We set the minimum number of cluster samples to 4 and the distance threshold to 0.25 on RegDB.We renew the pseudo-labels at the beginning of each epoch.We set the batch size to 32, where P = 8 and K = 4.We use Adam [67] to optimize all models.The learning rate is set to 0.00035.For L soft , we set τ s = 0.5.For L hard , we set λ h = 5 and β mar = 0.5.For L moda , we set the number of elements in Q to 20.Following the existing research, we set w = 0.999 [61] and α = 0.7 [63].In the training phase, we trained 50 epochs in total.In the test phase, only the momentum encoder is used for the inference.

B. Parameter Analysis
In this section, we analyze the impact of the following four hyperparameters on performance: λ m , τ m , h and t.
1) λ m and τ m of Modality Contrastive Loss: In Fig. 6, we illustrate how the performance varies as λ m is varied from 0 to 2 via plots of mAP and Rank-1.Note that  λ m = 0 corresponds to the situation where the modality contrastive loss has no contribution to the overall loss.We find that on SYSU-MM01 and RegDB, the model obtains the best performance when λ m = 0.5.The results verify the generalization of the hyperparameter.The worst performance is achieved when λ m = 0, which preliminarily verifies the effectiveness of modality contrastive loss.In Fig. 7, we explore the best τ m of modality contrastive loss.We find that when τ m = 0.1, the model achieves the optimal performance on both datasets.
2) h of Cross-Modality Hierarchical Clustering: In Fig. 8, we show plots of the mAP and Rank-1 as the parameter h is varied from the value from 0.6r to r , where r denotes the smaller value of the number of rows and columns of similarity matrix S. We find that when h=0.7r , the model obtains the best performance on SYSU-MM01 and RegDB.The above results verify the generalization of h.
3) t of Inter-Channel Pseudo-Labels Refinement: Fig. 9 shows plots of the performance on the two datasets as function of the parameter t for the inter-channel pseudo-labels refinement.On SYSU-MM01, the model achieves the best performance when t = 0.50.On RegDB, the model achieves the best performance when t = 0.45.When t is too large or too small, the performance is poor.This is because, when t is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.set to a large value, a large number of samples are discarded, resulting in training data.On the other hand, when t is set to a small value, a large of noisy samples are retained, which leads to an insignificant refinement effect.The above results preliminarily validate the effectiveness of IPR.

C. Comparison With State-of-the-Art Methods
In Table I and Table II, we compare our methods with the state-of-the-art methods on SYSU-MM01 and RegDB.The existing research on unsupervised ReID mainly focuses on the visible modality.We refer to the existing unsupervised visible ReID methods reported by H2H [27] and ADCA [33], including HHL [68], SSG [69], ECN [70], MMT [71], SpCL [72], CAP [41], ICE [31], and cluster contrast [73].The results show that MMT [71] and cluster contrast [73] achieve competitive performance on two datasets.Since these methods do not consider the modality gap, their performance is lower than that of our baseline and significantly lower than that of the proposed CHCR approach.
Among the unsupervised VI-ReID methods, H2H [27], ADCA [33], OTAL [59], and DFC [34] show better performance than single-modality methods.While H2H [27] and H2H [27]+AGW [35] are not entirely unsupervised, our baseline outperforms both H2H models when no ID labels are used, as seen in Table I and Table II.Moreover, our proposed CHCR method demonstrates notably superior performance on SYSU-MM01 when compared to ADCA [33], OTAL [59] and DFC [34].It also attains comparable results to ADCA on RegDB.Specifically, compared to the second-best performing ADCA method, CHCR provides a significant performance gain of 2.21% Rank-1, 1.98% Rank-10, and 2.61% mAP on SYSU-MM01(Single-shot).This is mainly because the aforementioned methods encounter difficulty in establishing connections between inter-modality positive samples and neglect the impact of noisy labels.In contrast, the proposed method utilizes CHC to obtain pseudo-labels that are robust to the modality gap, and employs IPR to handle noisy labels.Additionally, CHCR only requires 50 epochs, while needs 100 epochs for intra-modality initialization and cross-modality instance selection.These experimental results demonstrate the superiority of our proposed method.
Considering that H2H [27] uses CMRR to improve test accuracy, we also assess the performance of CHCR+CMRR.It can be seen from Table I and Table II that after the introduction of CMRR, CHCR+CMRR outperforms H2H+AGW+CMRR by a large margin and achieves the optimal performance under four test scenarios.The above experimental results verify the superiority of the proposed methods.
We also compared the computation time requirements (training and test) on the two datasets for the proposed approach against the competitive methods MMT [71], ICE [31], cluster contrast [73], and ADCA [33].For fair comparison, two RTX3090 GPUs are used for each method.

D. Ablation Study
For the fully unsupervised VI-ReID problem, we designed three new components: modality contrastive loss L moda , crossmodality hierarchical clustering (CHC) and inter-channel pseudo-labels refinement (IPR).We conduct ablation experiments, which are reported in Table IV, to validate the effectiveness of each component.Note that, compared with the baseline, the model labeled A1 does not use L moda , the model labeled A2 replaces L moda with MMD, and models A3 and A4 introduce CHC and IPR, respectively.
1) Effectiveness of Modality Contrastive Loss: The purpose of L moda is to promote modality-invariant feature learning.As shown in Table IV, the performance of baseline is significantly better than that of A1 and A2.Specifically, relative to the performance of A2, the performance of baseline attains a 6.25% Rank-1, 9.23% Rank-10 and 6.12% mAP gain on SYSU-MM01(Single-shot), a 4.20% Rank-1, 8.15% Rank-10 and 4.25% mAP gain on SYSU-MM01(Multi-shot), and a 9.28% Rank-1, 5.98% Rank-10 and 8.78% mAP gain on RegDB (Visible to Infrared).
To better understand the effectiveness of L moda in promoting modality-invariant feature learning and in preventing identity misalignment, we visualize the distance distribution of SYSU-MM01, in Fig. 10.We find two phenomena: (1) compared with A1, A2 and the baseline makes the distance distribution of inter-modality positive sample pairs approximate that of intra-modality positive sample pairs; that is, MMD and L moda guide the model to learn modality-invariant features.(2) Compared with A1 and A2, the baseline effectively reduces the overlap of the distance distribution of inter-modality positive and negative sample pairs; that is, L moda can effectively prevent identity misalignment.The experimental analyses demonstrate the effectiveness of the modality contrastive loss L moda .

2) Effectiveness of Cross-Modality Hierarchical Clustering:
The purpose of CHC is to facilitate clustering of inter-modality positive samples into the same cluster.As shown in Table IV, with the help of CHC, the performance of A3 is better than that of the baseline.Specifically, for the three metrics Rank-1, Rank-10, and mAP, the performance improves 5.16%, 3.65% and 4.57% on SYSU-MM01 (Single-shot), 4.92%, 5.25% and 6.31% on SYSU-MM01 (Multi-shot) and 9.90%, 5.56% and 11.07% on RegDB (Visible to Infrared).
To further verify that CHC effectively increases the number of inter-modality positive sample pairs, we count the average number of modality centroids in each cluster generated by the traditional one-stage clustering algorithm (DBSCAN) [60] and CHC.The average number of modality centroids ranges in the interval [1,2].It's worth noting that we calculate the number of modality centroids instead of the number of inter-modality positive sample pairs.This is because, in situations where all samples are clustered into the same cluster, the number of pseudo inter-modality positive samples reaches its maximum.However, this is not a desirable outcome.
As shown in Fig. 11, the average number of modality centroids that correspond to the ground truth is two.This means that each identity in the dataset includes samples from two different modalities.The average number of modality centroids generated by DBSCAN is close to 1, which means that almost every cluster contains samples from only one modality.According to Eq. ( 7), the impact of the modality contrastive loss is limited when the number of modality  centroids is inadequate.Compared with DBSCAN, CHC can generate more modality centroids, thus facilitating the modality-invariant feature learning.The above experimental results verify the effectiveness of CHC.
3) Effectiveness of Inter-Channel Pseudo-Labels Refinement: The purpose of IPR is to improve the reliability of pseudo-labels.The test results for the baseline+IPR (A4) are presented in Table IV and demonstrate a significant improvement over the baseline.
In addition, we use the [79] to evaluate the accuracy of pseudo-labels.The higher the F-score, the higher the accuracy of the pseudo-labels.As shown in Fig. 12, the test settings include the F-score of the grayscale image (gray), average F-score of each channel in three channels (RGB), and F-score of IPR (IPR).We find that the accuracy of gray and RGB is similar, and they are significantly lower than the accuracy of IPR.The above experimental results verify that IPR can improve the performance by improving the reliability of pseudo-labels.
4) Effectiveness of the Combination of L moda , CHC and IPR: As shown in Table IV, we study the benefits of the combination of L moda , CHC and IPR.The performance of CHCR is significantly superior to A1 and outperforms each individual component, providing evidence for the overall effectiveness of the three aforementioned components.

V. CONCLUSION
This paper introduces a cross-modality hierarchical clustering and refinement (CHCR) method to tackle the fully unsupervised VI-ReID problem.Unlike previous VI-ReID methods, CHCR does not rely on intra-modality initialization; instead, CHCR concentrates on cross-modality clustering.This study offers a novel perspective for addressing the unsupervised VI-ReID problem that is particularly relevant for practical real-world settings where labeled data is limited.
The parameter analyses and ablation study demonstrate that the proposed modality contrastive loss and cross-modality hierarchical clustering contribute to modality-invariant feature learning, and the inter-channel pseudo-labels refinement enhances the reliability of pseudo-labels.Comparative test results on the SYSU-MM01 and RegDB datasets validate the effectiveness of our proposed method, which outperforms existing unsupervised VI-ReID approaches and achieves performance that is competitive with many supervised VI-ReID methods.

Fig. 2 .
Fig. 2. Proposed CHCR framework for VI-ReID.(a) Cross-modality clustering baseline uses DBSCAN to generate pseudo-labels.(b) CHCR embeds CHC and IPR in the baseline.x v i and x ir i are visible and infrared images, respectively.x r i , x g i and x b i represent the red, green, and blue channels of x v i , respectively.x s i is the image of x v i after grayscale processing.Arrows in different colors represent data flows from different images or channels.Black arrows represent mixed data flow.

Fig. 3 .
Fig. 3. Illustration of the proposed modality contrastive loss.Different colors indicate different pseudo-labels.Different shapes indicate different modalities.The "pull" and "push" terms, respectively, decrease and increase the distance between the sample and the modality centroids.

Fig. 4 .
Fig. 4. Illustration of the proposed CHC.Each dot represents a sample.Intra-modality positive sample pairs are connected by black lines.The positive cluster pairs of visible modality (blue dotted line) and infrared modality (red dotted line) are connected by red lines.

Fig. 5 .
Fig. 5.An example of IPR.Superscripts represent channels or modalities.Subscript represents the sample index.For example, x r 2 , x g 2 and x b 2 are the red, green, and blue channels from the same sample x 2 .

Algorithm 1
Cross-Modality Hierarchical Clustering and RefinementInput: Unlabeled samples X v = {x v i } N v i=1 and X ir = {x ir i } N ir i=1, encoder E parameterized by θ E , momentum encoder M parameterized by θ M , training epochs and iter s.Output: Trained θ M .1:for i = 1 to epochs do 2:

Fig. 11 .
Fig. 11.The average number of modality centroids for DBSCAN, CHC and ground truth on the two datasets.

TABLE II COMPARISON
OF THE PROPOSED METHODS WITH STATE-OF-THE-ART METHODS ON THE REGDB.THE BEST PERFORMANCES UNDER TWO UNSUPER-VISED SETTINGS ARE HIGHLIGHTED IN BOLD AND "PROP."DENOTES VERSIONS OF APPROACHES PROPOSED IN THIS PAPER TableIIIsummarizes the results.All methods have similar test times, while there is a significant disparity in their training times.We find that the proposed CHCR approach has slightly higher training times than ICE and cluster contrast, but significantly lower training times than MMT and ADCA on both datasets.
Combining the experimental results from Tables I, Tables II, and Tables III, it is evident that CHCR significantly improves model performance while also reducing computation time requirements for training, compared to the best-performing prior unsupervised methods.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE IV RESULTS
OF THE ABLATION STUDY WHERE THE ALTERNATIVE MODELS DROP OR REPLACE ALTERNATIVE COMPONENTS OF THE PROPOSED MODEL (SEE TEXT FOR DETAILS).THE MAIN COMPONENTS INCLUDE L MODA , CHC AND IPR Fig. 10.The distance distribution of intra-modality positive pairs (orange), inter-modality positive pairs (red) and inter-modality negative pairs (blue).