Yiu-ming Cheung

and 1 more

Unsupervised cross-modal retrieval has received increasing attention recently, because of the extreme difficulty of labeling the explosive multimedia data. The core challenge of it is how to measure the similarities between multi-modal data without label information. In previous works, various distance metrics are selected for measuring the similarities and predicting whether samples belong to the same class. However, these predictions are not always right. Unfortunately, even a few wrong predictions can undermine the final retrieval performance. To address this problem, in this paper, we categorize predictions as solid and soft ones based on their confidence. We further categorize samples as solid and soft ones based on the predictions. We propose that these two kinds of predictions and samples should be treated differently. Besides, we find that the absolute values of similarities can represent not only the similarity but also the confidence of the predictions. Thus, we first design an elegant dot product fusion strategy to obtain effective inter-modal similarities. Subsequently, utilizing these similarities, we propose a generalized and flexible weighted loss function where larger weights are assigned to solid samples to increase the retrieval performance, and smaller weights are assigned to soft samples to decrease the disturbance of wrong predictions. Despite less information is used, empirical studies show that the proposed approach achieves the state-of-the-art retrieval performance.

Yiu-ming Cheung

and 1 more

Facial sketch recognition is one of the most commonly used method to identify a suspect when only witnesses are available, which, however, usually leads to four gaps, i.e. memory gap, communication gap, description-sketch gap, and sketch-image gap. These gaps limit its application in practice to some extent. To circumvent these gaps, this paper therefore focus on the problem: how to identify a suspect using partial photo information from different persons. Accordingly, we propose a new Logical Operation Oriented Face Retrieval (LOOFR) approach provided that partial information extracted from several different persons’ photos is available. The LOOFR defines the new AND and OR operators on these partial information. For example, ” eyes of person A AND mouth of person B” means retrieving the target person whose eyes and mouth are similar to that of person A and person B respectively, while “eyes of person A OR eyes of person B” means retrieving target person whose eyes are similar to both person A and B. Evidently, these logical operators cannot be directly implemented by INTERSECTION and UNION in set operations. Meanwhile, they are better for human understanding than set operators. Subsequently, we propose a two-stage LOOFR approach, in which the representations of partial information are learned in the first stage while the logical operations are processed in the second stage. As a result, the target photo of a suspect can be retrieved. Experiments show its promising results.