A Logical Operator Oriented Face Retrieval Approach: How to Identify a Suspect Using Partial Photo Information from Diﬀerent Persons?

. Facial sketch recognition is one of the most commonly used method to identify a suspect when only witnesses are available, which, however, usually leads to four gaps, i.e. memory gap, communication gap, description-sketch gap, and sketch-image gap. These gaps limit its application in practice to some extent. To circumvent these gaps, this paper therefore focus on the problem: how to identify a suspect using partial photo information from diﬀerent persons. Accordingly, we propose a new Logical Operation Oriented Face Retrieval (LOOFR) approach provided that partial information extracted from several diﬀerent per-sons’ photos is available. The LOOFR deﬁnes the new AND and OR operators on these partial information. For example, ” eyes of person A AND mouth of person B” means retrieving the target person whose eyes and mouth are similar to that of person A and person B respectively, while ”eyes of person A OR eyes of person B” means retrieving target person whose eyes are similar to both person A and B. Evidently, these logical operators cannot be directly implemented by INTERSECTION and UNION in set operations. Meanwhile, they are better for human understanding than set operators. Subsequently, we propose a two-stage LOOFR approach, in which the representations of partial information are learned in the ﬁrst stage while the logical operations are processed in the second stage. As a result, the target photo of a suspect can be retrieved. Experiments show its promising results. logical operators and then matched against photo mugshot database to identify the suspect.


Introduction
In suspect identification, one of the most widely used tool is facial sketch recognition technique, as illustrated in Figure 1(a). Normally, a facial sketch recognition method is to sketch the suspect face manually or by computer based on the narrative of a witness, and then tries to match this sketch with the ones in the database such that the suspect is found. In general, such a method needs to address the following hour gaps: -Memory gap. In most cases, a witness only takes a glance at the suspect without seeing him/her clearly, thus resulting in missing some facial details.  The general procedure of standard facial sketch recognition methods is that, based on a witness' narrative, sketch artist sketches the face of a suspect manually or by computer. Then, the sketch will be compared with the ones in photo database to identify the suspect. (b) The procedure of the proposed LOOFR approach: given a set of photos, a witness can pick out the photos which they think are similar to the suspect and point out which part is the most same; these partial information witnesses select are calculated by logical operators and then matched against photo mugshot database to identify the suspect.
In addition, sometimes they are asked to describe the suspect several days after the event [8,9], which inevitably increases the uncertainty.
-Communication gap. Two persons, i.e. the witness and sketch artist, take part in the procedure of facial sketch. In general, even for the same thing, different people may have quite different understanding. It turns out that such imperfect communication between them would affect the quality of sketch [9]. -Description-sketch gap. Some descriptions, such as "silent lips" and "murderous eyes", are extremely hard for artists to sketch. This brings another deviation, which is essentially a text-image gap and has been widely studied [20,13,14,22]. -Sketch-photo gap. Due to the heterogeneous features of sketches and photos, the spaces they distribute are different. Thus, they cannot be compared directly.
In the literature, a number of methods focusing on bridging the sketch-photo gap have been presented, including feature engineering [2,18,3,5], common space learning [25], and multi-modal learning [31,30]. Furthermore, Some works, e.g. see [10,12], have achieved the promising result on the main benchmark [31]. Nevertheless, they have yet to take the memory gap into account. In fact, the difference between the true suspect photo and the sketch in all cases they have tried thus far is relatively small. As far as we know, Uhl et al. [29] firstly attempted to discuss the forensic sketch and tried to bridge the memory gap. Along this line, several papers, e.g. see [18,3], have addressed this problem. Furthermore, some studies [20,13,14,22] have been conducted to build the gap between the text and image, but such work is seldomly applied to the facial sketch recognition task. In addition, to the best of our knowledge, communication and description-sketch gaps have yet to be well explored over the past years.
To alleviate the above-mentioned gaps, as illustrated in Figure 1(b), this paper will consider a scenario below: First, a witness is provided with a set of photos. Then, he/she can select several photos, in which some parts are similar to the suspect, in his/her memory, and points out these parts. Subsequently, a problem is naturally arisen: How to identify the suspect based on these partial information extracted across multiple images? To answer this question, we therefore propose a novel two-stage Logical Operation Oriented Face Retrieval (LOOFR) approach, which combines such partial information and matches the combination result with photo mugshot database to identify the suspect. In the LOOFR approach, there are two basic logical operators: AND and OR, as described below: -AND: When a witness says that the nose and mouth of suspect are similar to two different persons: A and B , respectively, we can use AND to operate them, i.e. A's nose AND B's mouth. -OR: When a witness says that the nose of suspect is similar to both A and B, we can use OR to operate them, i.e. A's nose OR B's nose.
More complex description, e.g. A's nose OR B's nose AND C's eye, can be expressed in terms of these two basic operators. Compared with facial sketch recognition methods, the merits of the proposed approach are at least two-fold: -The latter three gaps mentioned previously have been bypassed. In the procedure of facial sketch recognition, a witness has to communicate with a sketch artist and misunderstanding often occurs during the communication.
On the contrary, in the proposed approach, a witness can complete the whole procedure independently. What they need to do is only to recognize whether the photo is similar to the suspect or not and pick out which part of the photo is similar to the suspect, which therefore circumvent the communication gap and text-image gap. Besides, we directly use photos, not sketch, to retrieve photos, thus the sketch-photo gap vanishes. -Images are helpful to overcome the memory gap. Instead of recalling the memory on their initiative, a witness is more easily to recall the facial details of a suspect when watching similar photos.
Thus far, there are several related works, e.g. multi-query retrieval [17,1,7,34,32] and instance search [21,33,4,26], but none of them is applicable to LOOFR problem. In fact, as far as we know, the problem addressed by LOOFR approach has yet to be explored in the literature.

Related Work
In this section, we review three tasks that are partially similar to LOBFR: facial sketch recognition, multi-query retrieval and instance search. Facial sketch recognition can be roughly divided into viewed and forensic sketch based face recognition. Viewed sketch means that artists draw the sketch, viewing the corresponding photo. One of the earliest work [28] adopted principal component analysis (PCA) to learning the features and reach a 71% performance on CUHK dataset. Roy et al. [24] employed a fuzzy-based texture encoding model for learning sketch features, but it needs the face of sketch separating from the background. Recently, some works [12,16,15] focused on deep learning based recognition framework. Hu et al. [12] fed the sketches of different scales into their multiple input deep networks to learn the effective representation and achieved near-perfect recognition accuracy (99% rank-1) on CUHK dataset. On the contrary, forensic sketch means artists draw the sketch by memory, without the corresponding photo. Uhl et al. [29] is the first one who underlined the importance and challenge of forensic sketch based face recognition. Klare et al. [18] utilized the combination of SIFT and LBP feature to learn a more effective weighting. Later work [3] changed the SIFT and LBEP feature into a new combination of Weber and Wavelet descriptors and reach a better result. Ouyang et al. [23] built a new forensic sketch dataset MGDB to imitate the forgetting process of people. This dataset consists of four kinds of sketchesL viewed, 1-hour, 24-hour and unviewed. Based on this new dataset, they employed a cascade model to overcome the memory gap.
Multi-query retrieval is using multiple sample as query to retrieve the target image. Multiple queries are usually regarded as a method of data augment [1,6,7,17,32,34]. Noticing that query samples offered by different users are usually photographed on different view or angle, wang et al. [32] combined a photo of low quality shape with the photos provided by other users who hold the same topic with each others to form a multi-query expansion. These researches are similar to our proposed logical operation OR, where different queries of the same kinds are combined. However, multi-query retrieval based on different queries of different kinds, which is more likely to our proposed logical operation AN, is rarely studied [11,27]. Hsiao et at. [11] combined the Pareto front method with manifold ranking and proposed a novel method to handle multiple queries of different semantic. Taghizadeh et al. [27] utilised a binary component vector that represents different components of an image to handle multiple queries retrieval.
Instance search (INS) [4,21,26,33] is using a query image of a specific instance to retrieve images containing that instance. It is quite similar to our task that using local information, such as eyes or mouth, to retrieve the target face whose eyes or mouth are similar to the query sample. Yu et al. [33] proposed a Fuzzy Objects Matching (FOM) framework to explore the similarity between the query sample and images in the dataset and used object proposals to detect whether images contain the potential regions of the query sample. Song et al. [26] combined the deep networks with hash method to cope with large scale instance search problem. They learned two kinds of hash codes: global region hash codes and local region hash codes. The query sample should be first compared with global region hash codes to get a rank, then be compared with local region hash codes to re-rank the former rank.

Proposed Approach
Suppose that a face dataset F = [f 1 , ..., f n ] contains n face samples and each face f j consists of five elements {e j , e j , e j , e j }, which represent eyes, eyebrow, nose, mouth and outline of face respectively. Same elements of all faces constitute new datasets Our aim is to retrieve target face via two basic logical operators AND and OR. That is: To tackle this problem, we propose a two-stage approach: in the first stage, we learn five sparse hashing codes, or representations of elements, Z (i) ∈ {0, 1} ki×n and their corresponding dictionaries ki ] ∈ R di×ki and projections P (i) ∈ R ki×di for all elements respectively, where k i is the code length of Z (i) ; in the second stage, giving query samples and corresponding elements, we will get several candidate sets via learned Z (i) and P (i) . Then, logical operation is performed on these candidate sets to make the final decision.

Dictionary Learning for Representations of Five Facial Elements
For i-th element E (i) , we aim to learn an enriched dictionary D (i) and a sparse coefficient matrix Z (i) . Specifically, we minimize the following object function: where λ > 0 is a trade-off parameter. For similar elements e k should be as same as possible. To this end, we first define an affinity matrix jk is the j-th row and k-th column of S (i) . Then, we minimize the following object function: where G (i) ∈ R n×n is a diagonal matrix whose entries are the column sum of . For out of samples, we prefer to learn a projection P (i) to directly map them into sparse codes rather than solve sparse codes via Eq.(1). Subsequently, we minimize the following object function: where β and λ are trade-off parameters and ||P (i) || 2 F is a regularization term. To be concise, we omit the superscript of the letters in the subsequent equations. For each element, we get the following overall loss function Because of the term ||Z|| 1 , L is not differentiable for Z in the whole field of real numbers. In general, Least Absolute Shrinkage and Selection Operator (Lasso) or K-SVD are adopted to update Z bit by bit or column by column. In this paper, we combine dictionary learning with hashing method and introduce a new constraint Z ∈ [0, 1] k×n , aiming to use binary codes to represent Z. Then, Eq.(5) becomes Compared with Eq.(5), there are two merits to optimize the loss function in Eq.(6): 1. It is easier to solve Eq.(6) than Eq.(5). The loss function is continue and differentiable for Z in the interval Z ∈ [0, 1] k×n , thus, we can use gradient descent or least squares method to solve it; while several cusps exist when Z ∈ R k×n , these methods are not suitable. 2. Optimizing Eq.(6) is time-saving. Whole Z can be updated simultaneously by gradient descent or least square method, while it should be updated bit by bit or column by column via Lasso or K-SVD, which means more iteration and computation.
The minimization problem in Eq.(6) can be solved by alternating optimization: D-step: Fix Z and P, we can update the dictionary D column by column. Let D = [d 1 , ..., d k ] and Z = [z 1 ; ...; z k ], where d i and z j are the i-th column of D and j-th row of Z respectively.
When update d i , the other columns of D are constant and Eq.(5) can be rewritten to whereÊ = E − j =i d j z j . Z-step: Fix D and P, loss function in Eq. (6) is differentiable for Z. In addition, L is bounded in the closed interval Z ∈ [0, 1] k×n . Thus, L must attain its minimum at the point where ∂L ∂Z = 0 or the boundary of the closed interval. Let ∂L ∂Z = 0, we can solve Z by where A = 2D T D, B = 2(αL + βI) and C = γ1 − 2(D T + βP)E and 1 denotes the matrix whose elements are all ones and I is an identity matrix. Eq. (8) is a Sylvester equation [19], which can be solved by the lyap function of MATLAB.
If the solution of Eq. (8) is out of the interval [0, 1] k×n , we use to modify its coordinates out of [0, 1] to the boundary and keep grads of other directions being 0. In this way, we ensure that the solution of Eq.(8) locates at the point where ∂L ∂Z = 0 or the boundary of [0, 1] k×n . P-step: Fix D and Z and let ∂L ∂P = 0, we can get where I is an identity matrix.

Logical Operation Oriented Face Retrieval
AND-Given query samples q (i) 1 and q (j) 2 of elements E (i) and E (j) respectively, we can compute their corresponding sparse codes bŷ Set operation: Comparingẑ 2 with Z (i) and Z (j) , respectively, we can get two top-K nearest neighbor candidate sets A and B, which contain the indices of the top-K nearest samples in Z (i) and Z (j) . Then, the final decision is made by C 1 = A ∩ B.
Then, we compareẑ with Z to get top-K nearest neighbor result C 2 . OR-Given query samples q (i) 1 and q (i) 2 of elements E (i) , we can compute their corresponding sparse codes bŷ Set operation: Comparingẑ (i) 1,2 with Z (i) , respectively, we can get two top-K nearest neighbor candidate sets A 1 and A 2 , which contain the indices of the top-K nearest samples in Z (i) . Then, the final decision is made by Our approach: We compareẑ of elements E (j) , we can compute their corresponding sparse codes bŷ Set operation: Comparingẑ with Z (i) and Z (j) , respectively, we can get three top-K nearest neighbor candidate sets A 1 , A 2 and B, which contain the indices of the top-K nearest samples in Z (i) and Z (j) . Then, the final decision is made by Our approach: We concatenateẑ 2 , Z (i) and Z (j) to get a new query and retrieval setẑ Then, we compareẑ 1,2 with Z simultaneously and record the distance between z 1,2 and each item of Z. Based on these records, we get the top-K nearest neighbor result C 2 .

Dataset and Performance Measurement
CUHK student data set [31] consists of 188 faces. We first cut them into five elements: eye, eyebrow, nose, mouth and outline, and then annotate the similarity of the samples of the same element. Each element is represented by a 128-d SIFT feature. 170 faces and their corresponding elements are randomly selected to form the training set, and remaining is the test set.
Two criteria, R@k and Average Index, are adopted to measure the performance of our proposed method. R@k is the accuracy of top-k rank retrieval result. Average Index denotes the average place where the target face appears in the results.  The results of logical operation AND are reported in Table 1. It can be observed that: firstly, in most cases (i.e. 24/30), our proposed approach (C 2 ) achieves a better result than both using only one element and set operation; secondly, results of intersect operation (C 1 ) are always worse than that of using only one element. This is determined by the property of set operation: the intersection of two sets is always smaller than them. The worse performance of intersect operation means that two elements usually do not get the right answer simultaneously.

Results of AND Operation
In contrast, our approach takes the complementary advantages of both elements. For example, when a high-ranked result is returned by querying one element and a low-ranked result is returned by querying another element, our proposed AND operation makes the result returned by using both of them reach a moderate place.

Results of OR Operation
The results of logical operation OR are reported in Table 2. It can be observed that: firstly, in most cases (i.e. 10/15), our proposed logical operation OR (C 2 ) achieves a better result than using only one element; secondly, contrary to intersect operation, results of union operation (C 1 ) are always better than that of other three operations. This is also determined by the property of set operation: the union of two sets is always bigger than them.
Although this result is impressive, it does not means that the set operation is superior to our approach. Because the account of C 1 do not represent the data that similar to both queries. Take eye OR eye for example, the percentage of this part of data only accounts for (38.5+40.5-60.6=)18.4%, less than the 1/3 of  Table 2. The results of R@k (k=10,20,50) of the three operations: use only one element (A1 and A2), set operation (C1) and our strategy OR (C2). Best results are marked in bold.  Table 1. Thus, the impressive results of union operation in Table 1 are of little significance and such plenitude vanishes when conducting a more complex logical operation which we discuss in the next section. Table 3. The results of R@k (k=20,50) of the three operations: use only one element (A1, A2 and B), set operation (C1) and our approach AND+OR (C2). Best results are marked in bold. Better results that logical operation OR achieves than only one element does are underlined. element R@20 R@50 q is ranked at the 13th; when using eye AND mouth, the target face is re-ranked at the 3rd.

Results of AND+OR Operation
To further explore the more complicated application of logical operation, we conducted the experiment of AND+OR, the results are reported in Table 3.
It can be observed that in most cases (i.e. 26/40), the proposed approach (C 2 ) achieves better performance, which are in coordinated with the former results. This demonstrates the effectiveness of the proposed approach of logical operation.
As mentioned before, the impressive results of union operation (C 1 ) in Table  1 do not appear again. A plausible reason is that the result of set operation is determined by C 1 = (A 1 ∪ A 2 ) ∩ B, which equals to C 1 = (A 1 ∩ B) ∪ (A 2 ∩ B). From Table 1, it is easy to find that intersect operation dramatically decreases the retrieval results. This means that the size of two sets (A 1 ∩ B) and (A 2 ∩ B) in the latter equation are extremely small. Even the union operation, which achieves "considerable results" in Table 1, cannot increase the result by a big margin.

Rank Improvement after Logical Operation
The purpose of logical operation is to use more partial information to get the result with more confidence (i.e. at higher rank). For example, as shown in Figure  2, when using only one element (eye or mouth), the target face (in bold border) is ranked at the 13th; when using eye AND mouth, the target face is re-ranked at the 3rd.
Apparently, set operation violates this need, because set is unordered, rank information was discarded when conducting the set operation. Thus logical operation cannot directly implemented by set operation. In contrast, as shown in Table 4, in most cases, logical operations AND (i.e. 8/10) and OR (i.e. 5/5)  get smaller average indexes, which means it indeed improves the retrieval performance and returns the results in a better rank with a stronger confidence.
To further analyze the effect of logical operation on the rank of results, we count the number of index that denotes the place where the target face appears in the results of different intervals I-IX (representing [1,20], [21,40] Figure 3.
It is clear that, compared with only using one element, logical operations increase the number of index falling to the intervals I, II and III, and decrease that to the intervals VII, VIII and IX. That is, logical operation re-rank the target face in a higher order by using multiple partial information. This accounts for the results in Table 4.

Conclusion
This paper has addressed the problem of identifying a suspect using partial photo information from different persons. Accordingly, we have proposed the novel LOOFR approach to bypass the thorny problems faced by the existing facial sketch recognition methods. In this two-stage approach, representations of elements are learned in the first stage, and logical operators: AND and OR, are utilized on representations in the second stage to retrieve target face with a better rank and stronger confidence. We have conducted several experiments on three scenarios AND, OR and their combination AND+OR and compared