Zhikai Hu - TechRxiv

Unsupervised cross-modal retrieval has received increasing attention recently, because of the extreme difficulty of labeling the explosive multimedia data. The core challenge of it is how to measure the similarities between multi-modal data without label information. In previous works, various distance metrics are selected for measuring the similarities and predicting whether samples belong to the same class. However, these predictions are not always right. Unfortunately, even a few wrong predictions can undermine the final retrieval performance. To address this problem, in this paper, we categorize predictions as solid and soft ones based on their confidence. We further categorize samples as solid and soft ones based on the predictions. We propose that these two kinds of predictions and samples should be treated differently. Besides, we find that the absolute values of similarities can represent not only the similarity but also the confidence of the predictions. Thus, we first design an elegant dot product fusion strategy to obtain effective inter-modal similarities. Subsequently, utilizing these similarities, we propose a generalized and flexible weighted loss function where larger weights are assigned to solid samples to increase the retrieval performance, and smaller weights are assigned to soft samples to decrease the disturbance of wrong predictions. Despite less information is used, empirical studies show that the proposed approach achieves the state-of-the-art retrieval performance.