Elegant Unsupervised Cross-Modal Hashing
- Yiu-ming Cheung ,
- Zhikai Hu
Abstract
Unsupervised cross-modal retrieval has received increasing attention
recently, because of the extreme difficulty of labeling the explosive
multimedia data. The core challenge of it is how to measure the
similarities between multi-modal data without label information. In
previous works, various distance metrics are selected for measuring the
similarities and predicting whether samples belong to the same class.
However, these predictions are not always right. Unfortunately, even a
few wrong predictions can undermine the final retrieval performance. To
address this problem, in this paper, we categorize predictions as solid
and soft ones based on their confidence. We further categorize samples
as solid and soft ones based on the predictions. We propose that these
two kinds of predictions and samples should be treated differently.
Besides, we find that the absolute values of similarities can represent
not only the similarity but also the confidence of the predictions.
Thus, we first design an elegant dot product fusion strategy to obtain
effective inter-modal similarities. Subsequently, utilizing these
similarities, we propose a generalized and flexible weighted loss
function where larger weights are assigned to solid samples to increase
the retrieval performance, and smaller weights are assigned to soft
samples to decrease the disturbance of wrong predictions. Despite less
information is used, empirical studies show that the proposed approach
achieves the state-of-the-art retrieval performance.