A new convolutional neural network based on a sparse convolutional layer for animal face detection

This paper focuses on the face detection problem of three popular animal categories that need control such as horses, cats and dogs. Existing detectors are generally based on Convolutional Neural Networks (CNNs) as backbones. CNNs are strong and fascinating classification tools but present some weak points such as the big number of layers and parameters, require a huge dataset and ignore the relationship between image parts. To be precise, to deal with these problems, this paper contributes to present a new Convolutional Neural Network for Animal Face Detection (CNNAFD), a new backbone CNNAFD-MobileNetV2 for animal face detection and a new Tunisian Horse Detection Database (THDD). CNNAFD used a processed filters based on gradient features and applied with a new way. A new sparse convolutional layer ANOFS-Conv is proposed through a sparse feature selection method known as Automated Negotiation-based Online Feature Selection (ANOFS). The ANOFS method is used as a training optimizer for the new ANOFS-Conv layer. CNNAFD ends by stacked fully connected layers which represent a strong classifier. The fusion of CNNAFD and MobileNetV2 constructs the new network CNNAFD-MobileNetV2 which improves the classification results and gives better detection decisions. The proposed detector with the new CNNAFD-MobileNetV2 network provides effective results and proves to be competitive with the detectors of the related works with an Average Precision equal to 98.28%, 99.78%, 99.00% and 92.86% on the THDD, Cat Database, Stanford Dogs Dataset and Oxford-IIIT Pet Dataset respectively.


Introduction
In a natural scene, face detection is a long-term objective for remote control and security needs. Using facial features, animal monitoring does not require a direct contact with the sensor and the animal will thus be at ease. Animal face detection can be used for many security applications that identify animal faces, gender/age detection and visual monitoring [49]. Ensuring safety for animals is an important task, particularly for breeders in many commercial fields such as horse race, livestock buying and selling as well as cow breeding. Face detection helps to fight against fraud and animal theft and to enforce health monitoring and traceability. Regretfully, it is still too difficult to detect animal faces given that face textures and shapes are grossly diverse. This is probably the reason for the small number of approaches.
The face detection task is still extremely difficult mainly because of the wide intra-class variation, illumination change, variable pose, complex background and partial occlusion. Despite these difficulties, recent research has achieved significant progress to resolve the interesting detection problems. The detection rate has reached nearly 90% of the face using boosting-based [40] and CNN (Convolutional Neural Network) based [48] approaches. Traditional human face detectors adopting handcrafted features have been replaced in many works by deep convolutional neural networks with the ability to extract discriminative facial features.
In the literature, the detection procedure usually includes three steps: block generation (multi-scale sliding windows or region proposals), face classification (in the backbone of the detector) and post-processing (non-maximum suppression and bounding box regression). In fact, the performance of face detectors is mainly influenced by the face classification network also known as the backbone. Duan et al. [7] discovered that the detector and the classifier of the general object detection have comparable performances using the same backbone. These explain that the designed backbone for the classification dataset is applied easily to the general object detection which gets an excellent mAP (mean Average Precision) score. Existing detectors, especially those for humans, have already taken on known CNN architectures as backbones. Convolutional neural networks (CNNs) are strong and fascinating classification tools. This is one of the reasons why Deep Learning is immensely popular and widely used for computer vision tasks. Given the rapid development, the questions that must arise are the following: Are CNNs flawless? Are they the best? In fact, there are different challenges during CNN training: -Most network optimization algorithms (such as SGD and Adam) use the backpropagation method to set the layer weights. Backpropagation has yielded good training results in recent years but it is not a very effective way to learn as it requires a huge dataset for CNNs [15].
-According to Hinton [35], pooling layers eliminate a great deal of information and ignore the relationship between image parts. In face detectors, for instance, just combining some features (mouth, eyes, face oval and nose) makes a face. -Face detectors use CNNs which represent a big number of parameters and layers. This leads to much training time and high computational complexity.
According to the above weak points of CNN, the main problems to be addressed in this paper are as follows: -How maintain the relationship between image parts in a pooling layer.
-How to achieve a good detection results using a small dataset for CNN training.
-If it is possible to create a new efficient CNN which represents a small number of parameters and layers.
The new challenging issue consists in creating a new convolutional neural network that effectively exploits the animal face characterization maintaining the relationship between image parts and using the smallest number parameters without the need for a huge dataset in order to obtain a robust and fast detector. To deal with the above-listed problems, a new CNN was proposed based on the ANOFS method for sparse feature selection which was introduced by BenSaid et al. [3,5]. The different contributions in this paper are as follows: -A new convolutional neural network CNNAFD (Convolutional Neural Network for Animal Face Detection) was proposed for binary classification (face/non-face) based on new sparse convolutional layer. -The ANOFS method for sparse feature selection which has been adapted only on pattern recognition applications was employed in this paper for face detection. -The ANOFS method for sparse feature selection was used as a training optimizer for a new sparse convolutional layer ANOFS-Conv instead of the traditional algorithms such as Adam and SGD. -A new backbone CNNAFD-MobileNetV2 (fusion of CNNAFD and MobileNetV2 [36]) for the purpose of an efficient animal face detection. -Making a new horse database called Tunisian Horse Detection Database (THDD). This database can contribute for the research community of the animal biometrics. To the best of our knowledge, this is the only dataset of public face image that is available for research on horse detection -Extensive experimental studies showed that our proposed CNNAFD -MobileNetV2 backbone could get better performance than the backbones of the traditional detectors. In the experiments, we proved the efficiency of CNNAFD network compared to the other Conolutional Neural Networks with maintaining the relationship between image parts and using the smallest number of parameters without the need for a huge dataset.
The rest of this paper is organized as follows. Section 2 presents the related works of animal face detection. However, the proposed CNNAFD network and its layers are described in Section 3. While Section 4 is devoted to presenting CNNAFD training methodology. Section 5 shows the final proposed backbone CNNAFD-MobileNetV2 for animal face detection. Section 6 is devoted to the experimental test bed. Section 7 focuses on evaluation methodology and metrics. Section 8, however, presents the experimental results. Eventually the main conclusion of this paper which presents some possible future work is drawn in Section 7. All Abbreviations and Symbols are shown in Tables 1, 2, 3 and 4.

Related works
The number of works in this area is very limited due to the complication of the animal face detection task. The existing related works in this field are as follows: Zhang et al. [50] proposed a set of Haar of Oriented Gradients (HOOG) to capture the texture and shape features on the animal head (such as cats, tigers, pandas, foxes and cheetahs). They used the SVM for classification and decision calculation. Using the Cat Database, they found a precision (P) equal to 95% and a recall (R) equivalent to 99.8% (Table 5).
Yamada et al. [46] proposed detecting dog and cat heads using edge-based features. They selected four directional features (Horizontal, Vertical, Upper Right and Upper Left) The recall rate (R) was equal to 85% on the cat set and 90% on the dog set (Table 5).
Mukai et al. [22] focused on cat and dog face detection. They used the same Viola-Jones method and employed both the Haar and the HOG descriptors for feature extraction. Using 58 images from the Cat Database for the test, they found a recall (R) equal to 96.6% and a precision (P) equivalent to 75.7%. However, they achieved a recall (R) equal to 98.3% and a precision (P) equivalent to 90.8% using 60 images from the Stanford Dogs Dataset.
These traditional animal face detectors, adopting handcrafted features, have been replaced in the recent works by deep convolutional neural networks with the ability to extract discriminating face features (Table 5).
Vlachynska et al. [42] used the faster R-CNN proposed in [34] network with ResNet-101 for dog face detection. They found an Average Precision (AP) equal to 98% on the Columbia Dogs Dataset (Table 5).
Tureckova et al. [41] who used the YOLOv3 detector with DarkNet-53 for dog face detection noticed an Average Precision (AP) equivalent to 92% on the Columbia Dogs Dataset and the Oxford-IIIT Pet Dataset ( Table 5).
The traditional approaches [22,46,50] adopted two distinct stages: handcrafted features and feature classification. However, theses methods are not effective and have been replaced in many works by deep convolutional neural networks (CNN) able to extract discriminative facial features. The proposed detectors [41,42] have already taken on known CNN architectures (ResNet and DarkNet) as backbones. In addition, other detectors for animal detection [17] based on CNN were proposed. However, Convolutional Neural Networks (CNNs) ignore the relationship between image parts, represent a big number of parameters and layers and require a huge dataset for training. This leads to much training time and high computational complexity.
In the last decades, the development of facial recognition systems has been achieved using manually-noted databases in order to locate the facial area in the image. Overall, facial recognition systems have not been automated by facial detection systems [12,13,17,21,[24][25][26][27][28]31]. However, these methods allow high recognition rates but their systems lack automatic face detection, which is why the animal face detection system is important to ensure safety and security.
To deal with the above-listed problems, this paper introduces the Convolutional Neural Network for Animal Face Detection CNNAFD. The proposed network effectively exploits the animal face characterization to obtain a robust and fast detector with maintaining the relationship between image parts and using the smallest number of operations and parameters without the need for a huge dataset.

CNNAFD: Convolutional neural network for animal face detection
A small convolutional network and a fast training with a small database were taken into consideration in order to overcome the previously-mentioned challenges. Despite the large diversity of animal head textures, each animal species has a distinctive head form with a similar shape. Consequently, the gradient features were considered because they are invariant to photometric and geometric transformations. Furthermore, as Dalal and Triggs [6] discovered, fine orientation sampling, coarse spatial sampling and strong local photometric normalization make it possible to ignore the object movement as long as it maintains a roughly vertical position without a big transformation as is the case for animal face. The gradient features could thus be suited for animal face detection in images.
The Automated Negotiation-based Online Feature Selection (ANOFS) is a sparse online learning method introduced by BenSaid and Alimi [1][2][3][4][5]. The aim of this method is to select a small number of features for binary classification on small databases and thereby replace the traditional optimizers (such as SGD and Adam) by the ANOFS, which decreases the number of layer parameters and operations. Moreover, using ANOFS helps to find the most expressive features and extract the best representation of the animal face by keeping the relationship between the face parts during the training. In fact, this paper presents the proposed convolutional neural network CNNAFD.
As shown in Fig. 1, the CNNAFD network included five types of layers as follows: -INPUT: This layer was used to keep the raw pixel values of the image. In this network, the layers had the shape of vectors instead of matrices.

Gradient-Conv layer
The gradient-Conv layer incorporated some constraints and achieved some deformations using local receptive fields, gradient features and spatial subsampling. Each unit in the output vector was connected to local regions of neighborhood pixels in the input image as usual ( Fig. 1). The output vector C was considered as a feature map produced by a local window of size 16 * 16 * 1 which scanned over the plane of the image with a stride of 8 pixels. The same principle of the HOG descriptor proposed by Triggs and Dalal [6] was applied in this layer for feature calculation. Each window was divided into four sub-windows of 8 * 8. The gradient magnitude Mag and the gradient angle θ were calculated for each sub-window in order to construct a normalized 36 * 1 vector over the whole window. The produced gradient vector represented a unit in the output feature map of the gradient convolution layer. The proposed measures of this layer produced the best results in our experiments. In fact, the improvement was insignificant whether the window size was smaller or bigger. Following Triggs and Dalal [6], both the gradient magnitude and the gradient angle were calculated using the intensity I of each pixel through the following expressions ((1-4)).
The gradient-Conv layer extracted each feature vector automatically and then sent it to the second convolutional layer.

ANOFS-Conv layer
In other research works, feature selection field has been well adapted resorting to many learning methods for pattern recognition. However, these are not devoted to object detection applications. Despite the efficiency of the feature selection methods, they are not accurate enough to process real world data using a small number of features. The ANOFS [1][2][3][4][5] method is a sparse feature selection method for binary classification that treats this limitation and integrates an automated negotiation process by a simple truncation algorithm (PEtrun) and the randomized feature selection algorithm (RAND) [2,3,5]. The ANOFS method which has been well adapted only on pattern recognition applications was employed in this paper for face detection. Instead of the traditional training optimizers, the ANOFS plays their role in this convolutional layer. The input sequence of the ANOFS is (C t , y t ) where t = 1, . . . , T , C t is the gradient vector of d dimension and y t refers to the desired output. The ANOFS requires a classifier W t which represents the weight vector (kernel) of the ANOFS-Conv layer. W t contains at most B non-zero elements (with B > 0 is a predefined constant). Thus, the classification of C t depends only on B features and is made by a linear function. The weight vector W t would be updated in each trial and the learner would classify the instance C t using the automated negotiation process between the PEtrun and RAND. This scenario is repeated until t = T when the learner is provided with full inputs of every training instance and resulting the final kernel W with W = W T . Using different n kernels W i with i = 1..n as convolutional filters, various maps (vectors) AN OF SM i (ANOFS Maps) were constructed in this layer and thus relevant features could be selected. The ANOFS-Conv layer had a size equal to that of the gradient-Conv layer. Each neuron in the gradient-Conv layer had a unique relationship with the opposite neuron in the ANOFS-Conv layer, which reduced the number of parameters. The sparse ANOFS-Conv maps were produced when the ANOFS-Conv weights W i were multiplied by the gradient-Conv output C with a linear activation function as shown in the following equation with j = 1..p and p is the number of features:

Non-zero pool layer
Each ANOFS-Conv kernel weight contained a big number of zeros. The pooling layer summed up the sparse convolutional output by eliminating all the values corresponding to zero and keeping only the relevant features that represent the 10% of the ANOFSM vector. Indeed, this elimination considerably reduces the number of features. Equation (6) presents the output (P oolM) of the Non-zero Pool layer calculation where k = j = 1 in the beginning, j = 1..p and p represents the number of features. Unlike the max pooling, Non-zero Pool maintained the relationship between image parts by keeping the same arrangement of values.

Stacked fully connected (FC) layers
The proposed CNNAFD was completed with stacked Fully Connected (FC) layers for classification. Once one FC layer classified a region (Window) as non-face, the region was rejected without going through the rest of the FC layers. In fact, each FC layer was connected to a Non-zero Pool vector and ended with an OutPut layer (OP). The stacked FC layers were applied using the tangent sigmoid transfer function and the stochastic gradient descent with momentum (SGD). Assuming that the Pool map P oolM i is composed of p features, each neuron k of the FC layer would have an input calculated as shown in (7) and an output as illustrated in Eq. (8) where k = 1..r and w z refers to the weight value of the neuron k on the FC layer. Each FC layer represented a weak classifier algorithm. Nonetheless, when combining their decisions, the stacked FC layers represented a strong classifier.

Proposed training methodology of CNNAFD
The training database was divided into n overlapped sub-sets. The gradient-Conv vectors of all images were extracted from each sub-set to produce n new sparse convolotional filters.
In fact, Fig. 2 shows that each ANOFS-Conv filter was trained separately on a specific set of data using the Automated Negotiation-based Online Feature Selection (ANOFS) method Fig. 3 Flow chart of the CNNAFD training process [3][4][5]. This sparse method has been accurate enough to create a weight vector for binary classification. The same fraction of the selected features (10% of all dimensions and the rest of the weights were zeros) used by [3][4][5] were chosen for use. BenSaid et al. [3,5] proved the effectiveness of the prediction performance of this fraction on several public large-scale benchmark datasets and thereby each pooling filter summarized the sparse ANOFS-Conv map by 10% of its real size. Both Algorithm 1 and Fig. 3 detail all the training process. Table 6 presents the CNNs which have appeared in the last 10 years. In order to keep a small number of parameters, the decisions of MobileNetV2 [36] which presents fewer parameters were merged with CNNAFD. For more efficiency, the addition of CNNAFD to MobileNetV2 was proposed to strengthen the detection process. This fusion gives the new backbone CNNAFD-MobileNetV2. As shown in Fig. 4, the fusion was applied using a neuron with the tangent sigmoid transfer function and the stochastic gradient descent with momentum (SGD). This neuron that represented the final FC layer of the network triggered the final detection decision. The input input F usion and the decision result y of the final FC layer were calculated as shown in Eqs. (9) and (10) based on the CNNAFD  decision CNNAF D Output and the MobileNetV2 decision MobileNetV 2 Output where w is the weight value the neuron of the fusion.

CNNAFD-MobileNetV2 backbone: fusion of CNNAFD and MobileNetV2
The detection system was based on the YOLOv2 [32] strategy as indicated in Fig. 5.
The proposed regions were applied on the original image and used as inputs by CNNAFD. As stated above, the resulting decisions of MobileNetV2 and CNNAFD were merged by the

Experimental test bed
In the experiments, the proposed CNNAFD-MobilNetV2 backbone was performed on four real-world databases: -The THDD 1 (Tunisian Horse Detection Database) included a set of horse images which were taken at different distances ranging from 1 to 2 meters relative to the horses. The collected database consisted of 703 horse images for the training and 400 images for the testing (see Fig. 7). The testing set contained 415 horse faces. -The Cat Database [18] involved 10,000 cat head photos. The photos were mainly downloaded from Flickr and were paired with data files that specified the position of each cat ears, eyes and mouth. Some examples of Cat Database are shown in Fig. 8. -The Stanford Dogs Dataset [16] included over 20,580 annotated images of 120 dog breeds. This dataset has been built using images from ImageNet for the fine-grained image categorization task (Fig. 9). Each image was annotated with an object class and a bounding box label. For accurate evaluation, manual face annotations were made as they did not exist for the whole face area in the Cat Database and in the Stanford Dogs Dataset. -The Oxford-IIIT Pet 2 Dataset proposed by Parkhi et al. [29] contains 37 different breeds of cats and dogs with roughly 200 labeled images for each breed. Only the cat categories were used in this study. The total number of labeled cat images was 1188. Figure 10 presents some examples of dog images from the Oxford-IIIT Pet Dataset.

Evaluation methodology and metrics
In order to evaluate the animal face classification and detection, outputs were extracted from the images of the testing dataset. The classification rates, the Receiver Operating Characteristic (ROC) and the precision-recall curves were recorded using different metrics such as accuracy, precision, average precision, recall, sensitivity, specificity, negative Predictive Specificity is also called true negative rate (T NR or SPC). It is calculated by dividing the number of correct negative predictions by the total number of negatives: The NPV is defined as follows: F1-score (F 1) can be useful, but it is less frequently used than the other basic measures. F1-score is a harmonic mean of precision and recall: The Intersection over Union I oU ratio is computed as a ratio between the intersection and the union of the predicted bounding box and the ground-truth bounding boxes: Following the Pascal VOC challenge [8], every true positive detection has an I oU ratio equal or larger than 50%. The precision-recall curve introduces the relation between the precision and the recall calculated for different detection thresholds. Consequently, the area under the precision recall curve presents the average precision (AP ) of the detector. A Receiver Operating Characteristic curve (ROC curve) illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. A ROC curve plots the relationship between the true-positive rate (T P R) of detection or T p rate and the false positive rate (F P R) of error or F p rate at various threshold settings. The ROC curve is a graph with: T he x axis showing F P R = F p F p + T n (18) T he y axis showing T P R = R

Experimental results
The experiments were performed on Nvidia GeForce 920MX GPU and a 20 GB memory. In order to get an accurate assessment, the proposed backbone CNNAFD-MobileNetV2 was tested for animal face classification and detection.   [20,48]. A transfer learning of the four pre-trained CNNs was made on the training set of the THDD, Cat Database and Stanford Dogs Dataset. The last FC layer of these CNNs was replaced with a new FC layer having two outputs (face/non-face). As shown in Table 6, the CNNAFD and CNNAFD-MobileNetV2 present a small number of parameters.
The field of face detection has been dominated by generic object detection methods. A slight difference was actually witnessed between the face detection and the generic object detection. Consequently, it was necessary to discuss and compare the proposed detector with object detection methods such as Faster R-CNN [34], SSD [19], YOLOv3 [33], YOLOv5 [14] which have also been used for human and animal face detection. The Detectron2 [44] for object detection has become one of the most widely adopted open source projects by Facebook Artificial Intelligence Research (FAIR). The SSD-MobileNetV2 [23] is a Single-Shot multibox Detection (SSD) network designed to perform real-time object detection on mobile devices. YOLOv5 [14] is the last improved version of the You Only Look Once (YOLO) detector using CSPNet [43]. Thus, there was a reason for comparing our detector with Detectron2 [44] using Faster R-CNN [34] with ResNext-101 [45] , YOLOv3-tiny [41] with Darknet-53 [33] , YOLOv5 [14] with Cross Stage Partial Network (CSPNet) [43] and SSD [19] with MobileNetV2 3 presented by TensorFlow [23].
Tables 7 and 8 represent the experimental configuration of the different CNNs and detectors. YOLOv2 (CNNAFD-MobileNetV2) was made using the SGD optimizer (Opt), 9 anchors (Anch) , 90 epochs (Ep) and a Batch Size (BS) equal to 16. As shown in Table 8, most of the detectors use SGD and select a Batch Size of 16. Indeed, to make a fairer comparison between them, the same values of these parameters were chosen. According to the experiment, 9 anchors and 90 epochs were sufficient to detect animal faces using four datasets for training (HDD, Cat Database, Stanford Dogs Dataset and Oxford-IIIT Pet Dataset). The faces in these datasets are not very big and not very small and the disparity between them is not great. For this reason, we do not need a large number which do not give a better result. A smaller number decreases the detection accuracy by losing the detection of some faces ( Table 9). The same for the number of epochs, the system converges in epoch 90 and the accuracy of detection doesn't improve even by adding more epochs (Table 10). The selected feature vector for anchor generation on the MobileNetV2 was the ReLU layer (Bloc-13-expand). During the training of YOLOv2 with MobileNetV2, three types of data augmentation (horizontal flipping, scaling and jitter image color) were used. The YOLOv3, YOLOv5 and Detectron2 used other transformations such as jitter image color, translation, change scale, flip left-right, flip up-down, mosaic transformation, image shear and image rotation.

Results on THDD
Classification evaluation: CNNAFD-MobileNetV2 presented precisely the maximum accuracy of classification as illustrated in Fig. 11. Figure 12 shows the ROC curves of the different CNNs. CNNAFD-MobileNetV2 outperformed all the other CNNs since it obtained the biggest critical region. Table 11 presents competitive statistic rates of classification with the other CNNs. Thus, it could be concluded that the fusion of the two networks gets encouraging results.
Detection evaluations Animals and mainly horses have many face texture variations that led to more difficulties in detection. Table 12 shows comparative results with different research studies. It is noted that our detector was a competitor to the other detectors and achieved a high performance. Figure 13 displays some detection examples while Fig. 14 represents the performance of our detector in terms of precision and recall. The figure also reports the new detector had a big critical region which indicates effective results.

Results on cat database
Classification evaluation In Table 13 and Fig. 15, CNNAFD-MobileNetV2 had adequate statistical classification rates compared to other CNNs. CNNAFD-MobileNetV2 exhibited competitive results with the biggest critical region of ROC curves, outperforming all the other CNNs (Fig. 16).

Detection evaluation The proposed detector was evaluated on the challenging Cat
Database. Newly-published methods were compared to our results as illustrated in Table 14.
It was suggested that the same database partition of the related work could be taken. 5000 randomly-chosen images were used for training and 3000 ones for testing. CNNAFD-MobileNetV2 achieved a competitive performance compared with the other approaches by a recall rate of 99.80% and a precision rate of 99.53%. Figure 17 shows the precision-recall curves on the Cat Database. Our detector presented the biggest critical region which indicates the performance of the proposed detector. Figure 18, however, displays some detection examples which show the efficiency of our detector.

Results on stanford dogs dataset
Classification evaluation CNNAFD-MobileNetV2 competed with the other CNNs as it had a big critical region of ROC curves in Fig. 19. In precise detail, CNNAFD-MobileNetV2 presented effective and competitive statistic rates of classification precisely as illustrated in Fig. 20 and Table 15.

Detection evaluation
The proposed detector on the challenging Stanford Dogs Dataset was evaluated (Fig. 21). Taking 3000 randomly chosen images for training and 100 images (as in the related work) for testing has been suggested in our work. Our detector achieved a recall rate of 99% and a precision rate of 99.98%, outperforming all recently detection methods (Table 16).   Figure 22 shows a comparison between CNNAFD-MobileNetV2 and the related work detectors using recall-precision curves. The CNNAFD-MobileNetV2 presented the biggest critical region which indicates the validity of the proposed detector for the dog face.

Classification evaluation: CNNAFD-MobileNetV2 outperformed all the other CNNs on
Oxford-IIIT Pet Dataset (cat part). The proposed CNN had a big critical region of ROC curves in Fig. 23 and proved again its effectiveness for cat face classification. Moreover, CNNAFD-MobileNetV2 presented competitive statistic classification rates as illustrated in Fig. 24 and Table 17.

Detection evaluation
The proposed detector was evaluated on the cat part of the Oxford-IIIT Pet Dataset (Fig. 25). In this work, It has been suggested that the same training model    (Table 18 and Fig. 26).

Discussion
The proposed CNNAFD-MobileNetV2 backbone proved its performance on the last experimental part for classification and detection. Figure 27 presents the classification accuracy on the three databases (THDD, Cat Database and Stanford Dogs Dataset). It is very obvious that the accuracy of MobileNetV2 was almost equal to that of CNNAFD on THDD. Moreover, it is plain to see that the accuracy of MobileNetV2 was very large compared to the accuracy of CNNAFD for cats on the Cat Database and Oxford-IIIT Pet Dataset. On the other hand, it is very noticeable that the accuracy of CNNAFD was larger than that of MobileNetV2 on the Stanford Dogs Dataset. Therefore, according to the horse, cat and dog face results, it could not be concluded that one of the two networks was better than the other. However, it is apparent that the CNNAFD-MobileNetV2 network achieved success on the four datasets with the best accuracy. In fact, the CNNAFD-MobileNetV2 overcame the accuracy of the other CNNs by about 1.29% on the THDD with an accuracy equal to 98.38%, 0.22% on the Cat Database with an accuracy equal to 99.95%, 0.21% on Stanford Dogs Dataset with an accuracy equal to 99.99% and 0.57% on the Oxford-IIIT Pet Dataset with an accuracy equal to 99.97% (Tables 11, 13, 15 and 17). Indeed, it could be stated that the fusion of the two networks led to a cooperation between them. The two networks were thus complementary since the fusion reinforced this coherence. The last FC layer of fusion adjusted the output decisions of the two networks and took a weight value for each of them to produce the final decision. The same thing is noticed for the detection process. The fusion of CNNAFD and MobileNetV2 improved the Recall and the Precision results by maximizing true detections and minimizing false detections. In fact, the addition of CNNAFD to MobileNetV2 reinforced the YOLOv2 detector and overcame the F1 of the other related works (Detec-tron2,YOLOv5, YOLOv3, YOLOv2) by about 5.14% on THDD with an Average Precision equal to 98.28%, 1.06% on Cat Database with an Average Precision equal to 99.66%, 1.58% on the Stanford Dogs Dataset with an Average Precision equal to 99.49% and 2.75% on the Oxford-IIIT Pet Dataset with an Average Precision equal to 95.22% (Tables 12, 14, 16 and 18).
The use of another learning algorithm through a sparse feature selection method (ANOFS) enhanced the information transmitted to the FC layers. In fact, the proposed sparse ANOFS-Conv layer and training methodology contributed to proper distinction between true and false detections (face/non-face). CNNAFD extracted the relevant features using ANOFS-Conv layer and then classified the candidate bloc using the stacked Fully Connected (FC) layers. However, unlike the other CNNs, maintaining the relationship between image parts on the Non-zero Pool layer kept as much information as possible by minimizing the number of operations and parameters. Consequently, the new sparse ANOFS-Conv layer and Non-zero Pool layer positively influenced decisions and brought detections closer to reality. The addition of CNNAFD was proposed on the YOLOv2 detector with MobileNetV2 to reinforce the animal face detection process. In fact, the fusion of CNNAFD with the MobileNetV2 helped to increase the Precision rate and to decrease the number of false detections.

Limitations
Owing to the photos taken close to the pets in the used databases, the faces of the animals are not very small and the system easily detected them. Indeed, our detector cannot detect very small faces when the animal present in the photo is very far away. Figure 28 shows the detection results on some images loaded from the Oxford-IIIT Pet Dataset and from the web. These images contain cats far away.
This was due to the poor resolution of the facial area as well as the lack of the important details and information. However, the performance of the ANOFS method decreased as the information was reduced. The more information there is, the fairer the ANOFS does the classification. The problem of detecting small faces is actually a challenge in the backbone of the most popular detectors.

Conclusion
Traditional approaches based on handcrafted features are not effective. They have been replaced by many recent approaches that use deep convolutional neural networks (CNNs) with the ability to extract discriminative facial features. However, CNNs present different weak points such as ignoring the relationship between image parts, representing a large number of parameters and layers and requiring a huge set of data for training.
To avoid these problems, the CNNAFD was proposed in this work. In fact, the traditional training optimizer (such as ADAM and SGD) was replaced with the ANOFS method for sparse feature selection so as to create a new convolutional layer ANOFS-Conv. The ANOFS-Conv layer was connected to the Non-zero Pool layer to remove all null features while maintaining the relationship between the image parts. Indeed, this elimination reduced the number of features. CNNAFD ended by stacked fully connected layers that represented a strong classifier. The proposed CNNAFD succeeded to do the training with a small database and a small number of parameters that were equal to 1 million and layers equivalent to 5.
The detection system was based on the YOLOv2 strategy. The addition of CNNAFD to MobileNetV2 was proposed to strengthen the detection process. This fusion resulted in the new backbone CNNAFD-MobileNetV2. The fusion was applied using a neuron that represented the final FC layer of the network and triggered the final detection decision. The resulting decisions of MobileNetV2 and CNNAFD were merged by the final FC layer in order to obtain the final detection decision. Despite this fusion, the proposed backbone kept the minimum parameters all the time compared to other CNNs, improved the classification results and gave better detection decisions.
The proposed system was evaluated on three known datasets such as Cat Database, Stanford Dogs Dataset and Oxford-IIIT Pet Dataset. Furthermore, our paper introduced a new Tunisian Horse Detection Database (THDD). The performance of the proposed CNNAFD network on three real-world databases has been demonstrated. It was very noticeable in the experimental part that the CNNAFD-MobileNetV2 outperformed the other CNNs by about 1.29% on the THDD with a classification accuracy equal to 98.38%, 0.22% on the Cat Database with an accuracy equal to 99.95%, 0.21% on Stanford Dogs Dataset with an accuracy equal to 99.99%, 0.57% on the Oxford-IIIT Pet Dataset with an accuracy equal to 99.97%.
Using the CNNAFD-MobileNetV2 backbone, the proposed detector outperformed the state of the art detectors by about 5.14% on the THDD with an Average Precision equal to 98.28%, 1.06% on Cat Database with an Average Precision equivalent to 99.66%, 1.58% on the Stanford Dogs Dataset with an Average Precision equal to 99.49% and 2.75% on the Oxford-IIIT Pet Dataset with an Average Precision equivalent to 95.22%.
In our future work, our objective will be to improve the CNNAFD-MobileNetV2 performance by exploring more discriminant filters and also to extend our proposed detector to other animals as well as to humans.