Lightweight Deep Learning Based Intelligent Edge Surveillance Techniques

Decentralized edge computing techniques have been attracted strongly attentions in many applications of intelligent Internet of Things (IIoT). Among these applications, intelligent edge surveillance (INES) methods play a very important role to recognize object feature information automatically from surveillance video by virtue of edge computing together with image processing and computer vision. Traditional centralized surveillance techniques recognize objects at the cost of high latency, high cost and also require high occupied storage. In this paper, we propose a deep learning-based INES technique for a specific IIoT application. First, a depthwise separable convolutional strategy is introduced to build a lightweight deep neural network to reduce its computational cost. Second, we combine edge computing with cloud computing to reduce network traffic. Third, our proposed INES method is applied into the practical construction site for the validation of a specific IIoT application. The detection speed of the proposed INES reaches 16 frames per second in the edge device. After the joint computing of edge and cloud, the detection precision can reach as high as 89%. In addition, the operating cost at the edge device is only one-tenth of that of the centralized server. Experiment results are given to confirm the proposed INES method in terms of both computational cost and detection accuracy.

It plays an important role in many areas such as national security [5], social security [6], traffic monitoring [7], because it can be more intuitive to obtain information from surveillance video. However, the approach of the obtaining information is usually low efficiency. At present, most video surveillance systems get information by playing back the saved surveillance video after an abnormal situation occurs. This method cannot timely detect and prevent the occurrence of abnormal events or accidents. There are also some video surveillance systems that set up a monitoring room to concentrate dozens or even hundreds of surveillance video together, which is watched by full-time workers. Once an abnormal event occurs, the relevant workers will make alarm. This method can prevent the occurrence of abnormal events to a certain extent, but it is limited by the number of monitors, and often ignores some key behavior information. A more serious problem is that due to the reduced attention caused by visual fatigue, monitoring personnel cannot keep focusing on monitors. Real-time observation and analysis of massive video data pose a big challenge to conventional video surveillance systems, and even post-recording queries are extremely labor intensive and difficult to obtain all important information.
Fortunately, with the development of computer vision technologies, machine learning empowered computers have the ability to analyze video intelligently. In the early years, researchers searched the area to be detected by the sliding window method, and then detected the target by hand-designed features. Haar-like feature [8] combined with adaptive boosting (Adaboost) [9] and cascade [10] is its typical approach. However, this method has a high time complexity and redundancy. In addition, the hand-designed features lack robustness thus they cannot adapt to many real-world scenarios. In 2006, Hilton et al. [11] proposed the concept of deep learning (DL), which has been utilized in many applications such as big data analysis [12]- [14], computer vision [15], [16], and wireless communications [17]- [25]. Motivated by the ubiquitous applications, many state-of-the-art deep neural networks (DNN) have been proposed, such as convolutional neural networks (CNN) [26]- [28]. In 2015, He et al. [29] proposed deep residual network (ResNet) to solve the problem of vanishing gradient and exploding gradient when the number of network layers was deepened. In order to improve the accuracy of DNN, the number of layers of the network is increasing, and the network structure is becoming more and more complicate, which makes DNN only run on centralized GPU or cloud servers. However, the centralized GPU processing method has the following disadvantages, such as high computational cost, high transmission cost, and high system delay, and hence it is hard to utilize in IIoT applications. In order to solve this problem, many researchers focus on lightweight neural networks, which allows these intelligent methods to run on embedded devices or edge devices [30]- [34], and its development process is shown in Fig. 1.
In recent years, there are four mainstream lightweight neural networks. Luo et al. [35] proposed a parameter pruning and weight sharing method to reduce the redundancy of DNN. Wen et al. [36] proposed a low-rank factorization method to decompose the convolutional kernel in the original CNN. Sandler et al. [33] proposed a compact convolutional computing unit method to compresses the storage of the model and reduces the computational complexity. Li et al. [37] proposed a knowledge distillation method, which guides the training of small neural networks through large neural networks, to reduce the complexity of the network. Wang et al. [38] proposed a lightweight automatic modulation recognition method using DL and compressive sensing. All of above proposed lightweight DL-based methods have made significantly contributions for realistic applications. At the same time, with the enhancement of computing power of edge devices, edge computing came into being. Shi et al. [39] proposed the typical application scenarios of edge computing and analyzed its challenges. Sun and Ansari [40] proposed mobile edge computing for the IIoT to ease the pressure of network transmission. Li et al. [41] proposed a deep learning method based on edge computing and analyzed its feasibility.
In this paper, we propose a lightweight DL based intelligent edge surveillance (INES) method for IIOT applications. Based on the smart construction site system, this paper proposes a joint computing method that combine edge computing and cloud computing. Compared with the conventional DL methods, our proposed method has the following two advantages: Less network resource occupancy: Most of the existing intelligent surveillance systems directly process the video collected by the camera. Limited by the powerful computing capabilities required for DL, videos must be transmitted to a cloud server platform, which will consume a lot of network resources. The method of combine edge computing and cloud computing proposed in this paper reduces the occupation of network resources by sharing the computing burden to the edge nodes. Less system response delay: Surveillance systems usually have hundreds of cameras. In addition to the occupation of network resources, cloud servers cannot process these videos in a timely manner, which will cause delays or even crashes to the system. Our proposed method can process video at the edge nodes, and the cloud server is used for secondary confirmation and video storage, so that the system can achieve real-time response.
The rest of this paper is organized as follows. Section II introduces the smart construction site system, including system requirements, system principles, and system models. Section III provides the system operating environment, dataset production, and experimental results. Finally, we conclude this paper in Section IV.

II. SYSTEM MODEL
In this section, we take smart construction site system as an example to introduce our INES. As shown in [16], smart construction site system needs to analyze the video captured by the camera to determine whether the worker wears a helmet. We have implemented this function in [16]. However, we have encountered many problems in practical applications. Since there are hundreds of cameras in the construction site, if the original centralized computing method is adopted, all the collected video data is transmitted to the cloud server for processing, which brings the following problems. First, in order to ensure the accuracy of neural network model detection, highdefinition video needs to be transmitted, which takes up a lot of network resources. Second, transmitting so much video can cause huge delay for the entire system. Third, even the cloud server cannot process the video from hundreds of cameras at the same time, causing the system to crash. Finally, transmitting the original video directly to the cloud server leaked users privacy. These are the motivations for us to study the new system. By adopting a combination of edge computing and cloud computing, the above problems are solved well. The basic framework of the INES system is shown in Fig. 2 and the functions of each part of the system are shown in Fig. 3.

A. System Framework
In this section, the framework of the system is introduced. The basic framework of the INES is shown in Fig. 2. It consists of five components: camera, edge node, router, core network, and cloud computing center. According to its function, we can divide it into three parts, as shown in Fig. 3. First, the camera collects the video data of the construction site and transmits it to the edge node. At the edge node, lightweight neural network is used to pre-detect captured video. According to the detection results, only video clips containing workers without helmets are transmitted to the cloud computing center. This not only can effectively save network resources, but also can effectively alleviate the computing burden of the cloud computing center. In order to reduce the occupation of network resources, the video at the edge node is encoded. When transmitting video clips, the user datagram protocol (UDP) is selected. On the cloud computing server side, the received video is decoded and then use the large neural network to fine-tune it. Finally, the test result is passed back to the edge node. In the remainder of this section, we will describe in detail the construction of neural networks and the choice of transport protocols.

B. Lightweight Neural Network at the Edge Node
This section introduces the lightweight neural networks which are used in the smart construction site system. As we all know, it is difficult to make a fair comparison between the two neural networks. The detection quality of the neural network often depends on the actual requirement and it is a trade-off between accuracy and detection speed. In the smart construction site system, the speed of detection is paramount, so the one-stage detection neural network is selected. You only look once (YOLO) and single shot multiBox detector (SSD) are the most famous in the one-stage detection network, with fast detection speed and high detection accuracy. By simplifying their backbone network structure, this paper proposes Tiny-YOLO and MobileNetV2-SSD. The detailed introduction is as follows.
1) The Structure of MobileNetV2-SSD: In image processing, the standard convolutional layer plays a crucial role in extracting image features. It is widely used in [26]- [28]. The structure of standard convolutional layer is shown in Fig. 4. Suppose the input of standard convolutional layer is a feature map of size D i × D i × M , and the output is a feature map of size D o × D o × N . The computational cost of the standard convolutional is Do are the size of the input and output feature map, respectively; M, N are the number of channels of the input and output feature map, respectively; D k is the size of the convolutional kernels.
In order to reduce the computational cost and speed up the operation of the network, depthwise separable convolutional is proposed in [31]. The structure of depthwise separable convolutional layer is shown in Fig. 5. It divides the standard convolutional into two steps: depthwise convolutional and pointwise convolutional.
The structure of the depthwise convolutional is shown in the left half of The structure of the pointwise convolutional is shown in the right half of Fig. 5. The computational cost of the pointwise convolutional is, The computational cost of the depthwise separable convolutional is, By comparing the computational cost of the standard convolutional and the depthwise separable convolutional, we can get the following ratio, suppose that D i = D o , and N is much larger than D 2 K . The size of the convolutional kernel is usually 3 × 3, so the computational cost of the depthwise separable convolutional is one-ninth of the standard convolutional.
On the basis of MobileNetV1 [31], as shown in Fig. 6(b), MobileNetV2 [33] further improves performance by introducing an Inverted residual block structure and a linear activation function, as shown in Fig. 6(c) and Fig. 6(d). Due to the limitation of its own calculation, depthwise separable convolutional cannot determine the number of channels. If the number of output channels in the upper layer is small, depthwise separable convolutional can only extract features in low-dimensional space, which affects its performance. So MobileNetV2 adds a  layer of pointwise convolutional before the depthwise convolutional, its purpose is to improve the dimension. In addition, MobileNetV2 changed the non-linear activation function of the second pointwise convolutional layer to a linear activation function. Because the non-linear activation function can effectively increase non-linearity in a high-dimensional space, but it can destroy features in a low-dimensional space, and its performance is not as good as a linear activation function. The structure comparison of standard convolutional, MobileNetV1, MobileNetV2 is shown in Fig. 6.
The network structure of MobileNetV2-SSD is shown in Fig. 7. Unlike SSD, MobileNetV2-SSD replaced VGG16 as the backbone network with MobileNetV2. Compared with other one-stage neural network, the biggest bright spot of SSD is that it uses shallow layers to detect small targets and deep layers to detect large targets. Superficial neurons have more detailed information is more effective for small targets. Deep neurons have larger receptive fields and more abstract semantic information is more effective for large targets. SSD proposes to use both low-level feature maps and high-level feature maps for detection. As shown in Fig. 7, MobileNetV2-SSD detect on six different scales feature maps, thereby improving the accuracy of detection. In order to avoid using too low-level features, six layers of convolutional layers are added behind MobileNetV2. In addition, MobileNetV2-SSD uses the concept of anchor box in region proposal network (RPN) [42], which greatly reduces the amount of calculation. Fig. 7 shows the structure of MobileNetV2-SSD, which consists of an input layer, 7 convolutional layers, and 16 bottleneck layers. The structure of each bottleneck layer is shown in Table I, where h and w are the size of the input feature map; c is the number of channels of the input feature map; c is the number of channels of the output feature map; t is the magnification of the inverse residual (6 is selected here), and s is the stride of convolutional. The activation function of the first pointwise convolutional and depthwise convolutional is ReLU6, as shown in Eq. (6). The second pointwise suppose that the dataset of MobileNetV2-SSD is represented where g i is the position parameter of ground truth, class i is the class of ground truth and N is the number of training samples for each picture. The loss function of MobileNetV2-SSD is given as follows where L loc (T ; Θ) is the localization loss of ground truth as shown in Eq. (8); L conf (T ; Θ) is the confidence loss of ground truth as shown in Eq. (10); Θ is the trainable neural network parameter and λ is used to adjust the importance of localization loss and confidence loss, where g i is the position parameter of ground truth; g i is the prediction of the position parameter. f smooth L1 is used to calculate the loss between the ground truth and the prediction as where c i j is the class confidence that the i-th ground truth belongs to the j-th class; S is the number of class and c 0 represents the background. Softmax function f softmax is used to calculate the class confidence, as shown in Eq. (11).
2) The Structure of Tiny-YOLO: At the edge nodes, we also tried another lightweight neural network called Tiny-YOLO, which network structure is shown in Fig. 8. There are 24 layers in Tiny-YOLO. Compared with 107 layers in YOLOV3, the number of layers is greatly reduced. Its network structure Algorithm 1 Algorithm Description of the MobileNetV2-SSD Input: Annotated datasets for different construction site; Output: The MobileNetV2-SSD for smart construction site system; 1: Data cleaning and data augmentation for datasets; 2: Randomly assign training dataset and validation dataset according to 7:3; 3: Build the botttleneck layer according to Table I for feature extraction; 4: Build the MobileNetV2-SSD structure according to Fig. 7 for multi-scale feature fusion; 5: Initialize the neural network and train the network weights; 6: Set a proper λ, calculate the loss according to Eq. 7 and minimize it; 7: Save weights and verify; 8: return MobileNetV2-SSD.
is simple, with a small amount of calculation, which is very suitable for running on the edge nodes. Tiny-YOLO detects from two feature maps of different scale, which improves the detection accuracy of the network. It is worth noting that in the detection phase, it uses the convolutional layer to take the fully connected layer, which greatly reduces the calculation amount.
Due to different detection principles, the loss functions of Tiny-YOLO and MobileNetV2-SSD are different. The loss function of Tiny-YOLO is given as follows.
where L loc (T ; Θ) is the localization loss of ground truth as shown in Eq. (14); L obj (T ; Θ) is the object confidence loss of ground truth as shown in Eq. (15); L conf (T ; Θ) is the class confidence loss of ground truth as shown in Eq. (16); Sum of the squared errors function f SSE is used to calculate the loss between the ground truth and the prediction, as shown in Eq. (13).
where g i is the position parameter of ground truth; C i is the object confidence; c j is the class confidence; g i , C i , c i are their prediction, respectively; S is the number of class and c 0 represents the background.
After the lightweight neural network completes the predetection, the edge node will determine which piece of video needs to be transmitted. The specific method is that the lightweight neural network only needs to detect three frames of pictures per second. If all three frames are detected that the worker is not wearing a helmet, the video in this period of time is transmitted. In order to further reduce the transmission burden of the network traffic, the video is also compressed and encoded before transmission.

C. Transmission Protocol Selection
In this section, the choice of transmission protocol is introduced. As all we know, if you want to transfer information between different networks, you must abide by the transport layer protocol. The transport layer protocols mainly include two protocols, transmission control protocol (TCP) and user datagram protocol (UDP). Their respective flowcharts are shown in Fig. 9.
The Fig. 9(a) is the TCP flowchart. TCP provides connection-oriented services. It is a reliable transport layer communication protocol. To ensure its stability, TCP needs a three-way handshake to establish a connection and a fourway handshake to terminate the connection. In addition, it also has functions such as congestion control, timeout retransmission, and flow control. The Fig. 9(b) is the flowchart of UDP. UDP provides datagram-oriented services, which is an unreliable transport layer communication protocol. Therefore, it has the characteristics of simple structure, and does not perform any splitting and splicing operations on data messages. It The structure of YOLOV3. The feature extraction network of YOLOV3 is DarkNet53, which composed of residual modules. It can eliminate vanishing gradient and exploding gradient. YOLOV3 detects images from three different feature scales. In the detection phase, YOLOV3 uses the convolutional layers instead of the connected layer to reduce the amount of parameters .  TABLE II  THE EDGE NODE AND SERVER SIDE DEVICE PARAMETERS always sends data at a constant speed, which is very conducive to video transmission. Through the above analysis, UDP is selected for the smart construction site system.

D. Neural Network at the Server Side
On the server side, when the video transmitted from the edge node is received, the video is decompressed and decoded first. According to the requirements of smart construction site systems, a neural network with high detection accuracy needs to be selected. We chose YOLOV3 neural network, which network structure is shown in Fig. 10. The principle and network structure of YOLOV3 have been analyzed in detail in [16]. Its excellent detection accuracy and detection speed meet the system requirements.

III. EXPERIMENTAL RESULTS
In this section, the experimental environment, experimental process, and experimental results are described. At the edge node, we chose NVIDIA Jetson TX2 as the edge device. At the server side, our calculations are accelerated by the NVIDIA GTX 1080Ti graphics card. Their specific parameters are shown in Table II. Both the server and edge operating systems are ubuntu16.04 and the programming language is Python.

A. Dataset Preparation
Due to the requirements of smart construction site systems, this experiment requires datasets of pedestrians and helmets, but existing public datasets do not include helmets. Therefore, we adopt the method of manual labeling, collecting site video through the camera and picking typical scenes. After data cleaning, some data augmentation methods are adopted, such as flip change, random crop, color jittering, shift transformation, scale transformation, contrast transformation, add gaussian noise and reflection transformation. Through the above processing, a huge dataset of 100,000 level is built.

B. Experimental Results
The model loss during MobileNetV2-SSD and Tiny-YOLO training is shown in Fig. 11. Due to the different definitions of the loss function, their accuracy cannot be directly compared with loss value. But through the loss graph, we can find that the convergence speed of MobileNetV2-SSD is faster than Tiny-YOLO. MobileNetV2-SSD has completed convergence after about 600 epochs, while Tiny-YOLO requires 1,000 epochs. The detection speed and detection accuracy of YOLOV3, MobileNetV2-SSD and Tiny-YOLO are shown in  Table III. The actual detection results of the smart construction site system are shown in Fig. 12.

IV. CONCLUSION
In this paper, we have proposed a lightweight deep learning based INES method for the IIoT applications, which combine edge computing and cloud computing. This system has the advantages of low system delay and low network resource occupancy. In addition, this method is ten times cheaper than the centralized method. We use the smart construction site system as an example to display how to implement the system. However, the experimental results show that there is still room for improvement in detection accuracy. In the future, we will combine federated learning [43] to solve the problem of weak system robustness and improve detection accuracy.