SecureCam: Selective Detection and Encryption Enabled Application for Dynamic Camera Surveillance Videos

Using dynamic surveillance cameras for security has significantly increased the privacy concerns for captured individuals. Malicious users may misuse these videos by performing Replay and/or Man-in-the-Middle attacks during storage or recording over the network. Considering these risks, this paper proposes an effective security application SecureCam based on selective detection (focused moving objects) and protection using encryption. For object detection, this paper implements a novel low computational unsupervised learning algorithm, i.e., Motion-Fusion (MF) for more precise motion detection in the mobile camera videos. After that, selective encryption (SE) is applied by the lightweight Chacha20 cipher to the detected video parts. Proposed SecureCam is extensively evaluated based on performance analysis, security analysis and computational complexity. For object detection, the comparative evaluation shows that the MF algorithm outperforms traditional state-of-the-art dense optical flow (DOF) algorithm with an average (mean) difference increase: in the accuracy of 54%; and in the precision of 42% making it computationally effective for such videos. The visual results with 21% encryption space ratio (ESR) indicate that the videos are sufficiently protected against identification. Overall comparative evaluation with existing approaches also affirm the significance and utility of proposed SecureCam for Internet of Multimedia Things (IoMT) environment.

identity of captured people, and their associated objects as a part of sensitive information. As dynamic cameras, pan-tiltzoom (PTZ), dashboard cams, and unmanned aerial vehicles (UAV) (like drones) rotate and/or move as they capture videos, which also change the background with their movement. In the 20 th century, drones were only used for the military operations. However, they have been commercially available since the 21 st century. Consumer drones [1] are capable of taking photos, recording videos, and delivering packages at previously unreachable heights and distances. They are offered in a variety of forms and accessories for customizing personal drone experiences in smart cities [2]. Furthermore, with the advent of the Internet of Things (IoT) infrastructure [3], these drones are also utilised for entertainment purposes [4], and for monitoring smart agricultural operations [5] and smart construction sites [6]. Hence, due to the sensitive visual information embedded and being exchanged, these devices must be protected from intruders.
Additionally, due to the limited internal storage of these devices, broadcasting of recorded videos from one device to another on the network without securing the video content, can give an attacker: firstly, the ability to launch a Man-in-the-Middle attack to monitor identifiable persons and/or objects in the video. Secondly, an attacker can add a valid replayed scene, one which has possibly occurred years ago, to masquerade as a current scene, which is referred to as a Replay Attack [7]. Thirdly, an attacker can: hack or hijack these devices in some way; and tamper the videos on those devices [8]. In each of these cases, the attacker could demand a ransom or perpetrate a fraud (aka scam) to the individuals recognized in the video [9], clearly violating the European general data protection regulation (EU-GDPR) [10].
Object identification techniques are widely implemented in computer vision either for static or dynamic camera video devices to separate the region-of-interest (ROIs). The ROIs (in this study) are the foreground (FG) consisting of the nonstationary objects and the background (BG) containing the immovable objects in the video. Static camera devices are always fixed to a position, making it easy to detect the FG information in the form of motion by utilizing background subtraction [11], [12], [13] technique. However, dynamic camera devices resulting in a complex situation, as the BG changes along with the FG information. There are various methods that could identify the objects in these videos [14] but optical flow (OF) methods are broadly used in existing research. OF is the visible estimated motion of objects between two successive video frames, produced by the camera and/or object movement. OF is divided into two types: sparse OF, which enables motion detection based on some features of the object, and DOF which performs motion detection based on all the features of an object. Pixel segmentation using global threshold [15], [16] is generally employed to separate the FG pixels from the BG pixels.
To comply with the EU-GDPR, data-protection-by-design solutions for protecting visual information are advisable [17]. Encryption is the only reversible data protection safeguard suggested by the regulation. Data encryption can be applied on complete video frames in the form of nave encryption (NE), that is encryption of the full video payload and header information, or encryption of specific parts/regions-of-interest (ROIs) within frames using selective encryption (SE). NE is impenetrable, but it is computationally slow and suffers from high memory consumption. Hence, NE is not advisable for real-time applications [18], [19], especially when dynamic, battery-based devices, are likely to be faced with high energy consumption costs during memory access. In contrast, SE has minimal memory consumption and computational cost, which makes it suitable for efficient encryption of real-time applications [20], [21]. The research [22] also revealed that SE provides satisfactory security against attacks.
Chacha20 [23] is a stream cipher that utilizes a symmetric (single) key of length 256-bits, and a nonce of either 64-bits or 96-bits. The implementation of Chacha20 is hardwareindependent and it is computationally fast for real-time encryption on dynamic cameras with 50-volt batteries [24], which are considered as extra-low voltage devices [25]. Due to its lightweight computation nature, Chacha20 was initially developed for securing Internet of Things (IoT) devices.
By keeping in view the computational and security challenges of surveillance videos captured with dynamic camera devices, the following research questions (RQs) were identified, as a way of targeting the research of this paper: RQ1: Does video segmentation into FG and BG information help to prevent Replay attacks? RQ2: What would be the impact of joint detection and protection scheme on the avoidance of Replay, tampering and Man-in-the-Middle attacks on surveillance videos during their transmission and storage? To answer the aforementioned RQs, this paper proposes a Data-Protection-by-Design solution for videos captured with dynamic cameras. (A Protection-by-Design solution is one in which protection is first introduced at the design stage of a proposal.) The research contributions are as follows: • Prevention of Replay and Man-in-the-Middle attacks on dynamic camera videos by segmenting the videos into parts. • By implementing the newly proposed Motion-Fusion (MF) algorithm, objects are accurately detected and segmented into FG and BG, for maximum security against their content identification.
• Content protection is achieved by encryption algorithm Chacha20 (initially proposed for IoT devices) on the segmented parts of the videos. • Evaluation of SecureCam application demonstrated the highly accurate selective detection and protection of the tested videos. Thus, this application fits well as a part of IoMT environment. The remainder of this paper is structured as follows: Section II describes related studies on object identification techniques, pixel segmentation with thresholding and encryption algorithms. Section III describes the methodological workflow of proposed SecureCam application related to selective detection and encryption. Section IV demonstrates the evaluation of SecureCam over publicly available dynamic camera datasets. This section also includes the performance analysis, security analysis and comparative analysis. Finally, Section V concludes this research study.

II. RELATED STUDIES
This section reviews the recent studies in the field of object detection, pixel segmentation and encryption for surveillance videos.

A. Object Identification in Cameras
Cameras are classified into two main groups called: (1) static cameras; and (2) dynamic cameras. The static cameras are fixed to a position within a public area, such as a bank, supermarket, or bus station, recording activities in that location. Closed Circuit Television (CCTV) is a common example of a static surveillance camera. By considering motion as a foreground (FG) object, the detection of the FG, consisting of moving objects, by these cameras is easily achieved by means of frame differencing [26], [27], [28] and background subtraction [11], [12], [13] methods. The background subtraction technique can be implemented either as an unsupervised learning [29], [30], a semi-supervised [31], [32], or a supervised learning [33], [34].
Dynamic/mobile cameras are devices that are in motion when recording. Because their positions are not fixed, as the camera moves the background (BG) tends to change hence giving a complex situation. Examples of these cameras are the car dashboard-cams, PTZ and UAV security cameras. Object identification by these cameras can be achieved by supervised learning with deep neural network [35], [36], YOLO [37], Convolutional neural network(CNN) [38] but this will require high computation for the training these models. While for unsupervised learning algorithms pre-training of detection models are not required.
The study [39] implemented a background subtraction for freely moving cameras but the quality of the result was affected by the accuracy of the detection making background subtraction unsuitable for object detection in dynamic camera videos. Thus, different techniques such as a motion compensation method [40], [41], trajectory classification [42], [43], [44], and optical flow (OF) [45] are often prescribed to separate the FG objects from the moving BG.
These techniques cannot be implemented in isolation from other algorithms. Taking Dense Optical Flow (DOF) as an example, the study [46] introduced an OS-Flow method that combines DOF and SOF. The research [47] implemented a computation method for DOF and texture features. Also, the authors of [48] proposed a DOF-based background subtraction technique using a homography matrix, with single Gaussian and DOF. Likewise the authors of [49] described a movement detection method applying DOF and a fundamental matrix.
Also, the studies [42], [50] discussed different unsupervised methods to detect the moving objects in a moving cameras like panoramic background subtraction, motion compensation, motion segmentation, subspace segmentation, sparse matrix decomposition, trajectory classification. However, this study implemented motion segmentation.

B. Pixel Segmentation With Thresholding
Pixel segmentation, which is also referred to as intensity based segmentation [51], is the grouping of a grey-scaled image into two classes of either light or dark pixels [52]. This is achieved by applying a threshold to these images. The threshold might either be locally or globally applied [22]. In local thresholding, an individual object has its own threshold value, resulting in: multiple thresholding, as described in [53], and in a high computational load. Alternatively, setting a global threshold value and comparing if a pixel is above or below that value results in a diminished computational burden.
In fact, the current study applies global thresholding, which differentiate the foreground from the background after object detection [54].

C. Privacy Protection With Encryption Algorithms
As already discussed, encryption can be performed as Nave (or Full) encryption (NE) and Selective Encryption (SE). In NE, the whole video content is encrypted, normally including the video header [18], [55]. This converts a 2D image to 1D before encryption [55]. This type of encryption offers good protection against attack but requires considerable computational time and memory consumption [55]. Alternatively, SE can be performed before or after video compression and simultaneously [55]. Frequently, the crucial areas in a video are of small pixel extent compared to the entire video content. Thus, SE uses a low computational time and memory space and, for many practical purposes, provides sufficient protection against attacks [19].
Asymmetric encryption uses two different keys (one public and one private key) for its encryption and decryption while symmetric encryption uses a single key for both encryption and decryption process [18], [21]. Symmetric encryption can further be divided into block ciphers and stream ciphers. Chacha20, a stream cipher in the family variant of Salsa [56] was created by Daniel Bernstein [23]. In fact, some have claimed it to be the lightest and fastest encryption algorithms [57]. As the name implies, Chacha20 performs 20-rounds of encryption, which are equivalent to 80-quarter rounds of XORing, additions and rotations in a cipher round [23]. The study [56] performed a cryptanalysis of Chacha20, arriving at a conclusion that correlation attacks do not pose a threat to this algorithm because it does not use a look-up table. Thus, it is not vulnerable to cache-timing attacks. Currently, Chacha20 has been implemented for text (messages) [23] and IoT devices, but not for the protection of multimedia, i.e., audios and videos.
All the existing research studies reviewed herein show that DOF is not sufficient to accurately detect FG and BG information from dynamic camera videos. Therefore, DOF is implemented in combination with other algorithms. However, the combined implementation of algorithms for object detection increase the computational cost, making them unsuitable for small battery-operated camera electronics as discussed in [58].
It is also worth noting that the existing literature is not focused on privacy-protection of the detected objects in dynamic camera videos, which constitutes a research-gap in the literature. In contrast, this paper proposes a SecureCam application, which jointly detects and encrypts the detected FG and BG video parts using MF algorithm and Chacha20 algorithm for dynamic camera videos. The reason of implementing Chacha20 is its low computational and hardware-independent nature, making it suitable for low-voltage IoT devices. By this time, only Google organization uses Chacha20 for its Transport Layer System (TLS) [59], and to the best of the authors' knowledge, Chacha20 has not been utilized for the privacy protection of the mobile camera surveillance videos to facilitate the IoMT infrastructure.

III. THE PROPOSED METHOD
The process flow of the proposed SecureCam application is described in " Fig. 1" with five (05) stages. In the First stage, the video frames are loaded from a recording output of a dynamic camera device. Secondly, MF algorithm (explained in Section III-A) was applied on consecutive frames to detect the objects in motion (FG). Thirdly, pixel segmentation was applied on the retrieved output, using global thresholding to separate the FG pixels from the BG pixels. The FG and BG pixels are encrypted separately with the Chacha20 algorithm by randomly generating a key and a nonce value at stage four. The key and nonce are securely stored in a hardware wallet, so that their security cannot be compromised. Finally, the encrypted FG and BG pixels (ROIs) are also stored separately. The storage and transmission of these ROIs will help in preventing Replay attacks, because the video parts are not together as a file for easy identification by the attacker. By accessing the FG stored file, the attacker cannot identify how the BG looks and vice versa. " Table I" describes the frequently used acronyms in this paper.

A. Motion Detection With MF
" Fig. 1" (column 2) represents the Motion detection stage of the SecureCam application. MF algorithm was implemented for precise ROIs detection in the video in order to preserve the device's battery when applying the proposed SecureCam to real-time videos from dynamic camera devices. MF considered  The MF algorithm firstly select first two consecutive frame F 1 and F 2 from the video. The pixel intensity in this frame is constant as given in "(1)", where I = intensity, a = horizontal axis, b = vertical axis and t = time Secondly, perform a motion estimation on F 1 and F 2 by extracting the vectors coordinate of the motion between the frames using Farneback algorithm [45]. The change in the intensity of the pixels with time from the previous frame to the current frame is been identified as the movement of the object as analysed in "(2)".
Taylor's approximation series is applied on the right hand side (RHS) of "(2)" and hence divided by the time change giving the result in "(3)", where u = δa δt and v = δb δt .
Lastly, this motion is extracted and infused back to the frame to identify the actual moving object (FG) in the frame. The infusion was achieved using "(4)" where the α value varies, β value is constant at 0.5 and the γ value is 0.
This process was repeated on the remaining frames in the video resulting in "(5)"

B. Pixel Segmentation With Global Thresholding
Pixel segmentation is the stage 3 of the SecureCam process flow (" Fig. 1" -column 3). Pixel segmentation was performed by setting a global threshold on the grey-scaled result of the MF algorithm. This global threshold separates the detected moving object pixels from the static object pixels in the frame.

C. Encryption With Chacha20 Algorithm
At stage 4 (" Fig. 1" -column 4), encryption is applied using Chacha20 algorithm. The segmented FG and BG pixels are been XORed separately with a key and nonce generated randomly from Chacha20 algorithm and also simultaneously securing the key and nonce in a hardware wallet. To restore the original frame, decryption was performed by XORing the cipher-pixels with the same securely stored key and nonce.
A complete round for Chacha20 is a 10 column-rounds and 10 diagonal-rounds which include: • Four 4-byte constants (giving a total of 16-bytes (128bits)). • A random 32-byte (256-bits in all) long key. • A 4-byte (32-bits in all) block counter. • A 12-byte (96-bits in all) or 8-bytes (64-bits) nonce [60]. A single round for Chacha20 contains four quarter-rounds for a column-round and four quarter-rounds for a diagonalround. A single quarter-round involves three arithmetical operations that are: [60]. A complete round of Chacha20 with single rounds arithmetic operation, makes chacha20 resistance to attack.

D. Storing of the Encrypted Video Segments (ROIs)
The FG encrypted pixels are stored separately from the BG encrypted pixels as shown in (" Fig. 1" -column 5). If an attacker access the FG encrypted pixels, it will be difficult to identify the BG pixels of the frame due to the encryption of the BG of the captured videos. This makes it hard to perform a Replay attack on the video.

E. Computational Complexity of SecureCam
The pseudo-code of the proposed SecureCam application is presented in the "Algorithm 1". The SecureCam algorithm has multiple stages, stage 1 loads the video for processing therefore, consumes Big(O) = n for n number of video frames.

IV. EVALUATION
For the development of proposed SecureCam application, both MF and Chacha20 algorithms were implemented in python programming language with the OpenCV vision library for MF, and the Base64 and Crypto algorithms for Chacha20. The testbed setup specs for the experiments are provided in " Table III". The experiments were executed on the dataset of five (05) publicly available dynamic camera videos, [61], [62], [63] each with differing characteristics, i.e., in terms of colours, motion activity, and spatial information. The properties of these test videos are described in "Table IV", where frame per second (FPS) represents the frame rate of the videos.

A. Performance Analysis of SecureCam
As described earlier, newly proposed MF algorithm is implemented for selective detection of mobile camera videos Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  in this paper. We have analysed the performance evaluation of MF and the state-of-the-art (SOTA) DOF algorithms in this section using different parameters along with visual results.
1) Visual Analysis: The comparative visual results given in " Table V" shows the the increased accuracy in object detection with MF algorithm in comparison with DOF. The object detection illustrations on column 2 of "Table V" enables a comparison between the MF and DOF algorithms. These visual results demonstrate a clearer and more accurate FG object detection by MF in comparison with DOF, according to findings from the test videos.
After detecting FG objects with MF and DOF in the test videos, the next stage in our procedure is to apply privacy protection to the extracted FG and BG ROIs. The visual results of SE implemented with Chacha20 are shown in columns 3 and 4 of " Table V". The results, taken after encryption of FG and the BG segregated frames (" Table V" -columns 3 and 4), shows that the DOF method [45] is not suitable for SE on videos taken by dynamic cameras. However, the MF results demonstrate an effective and accurate implementation of SE. This signifies that the MF can efficiently contribute to privacy protection of FG and BG ROIs by virtue of their accurate detection.
2) Accuracy and Precision: Results for accuracy and precision were analysed in " Table VI". The accuracy was calculated as the ratio, given as a percentage, of the correctly predicted objects versus the total objects in the video, while the precision was calculated as the ratio (given as a percentage) of the correctly detected objects to the total of detected objects in the video.
The accuracy and precision results presented in "Table VI" indicate that MF has a greater accuracy and precision ratio for each of the tested videos compared to DOF, thus verifying the better performance of MF in object detection.

3) Encryption Space Ratio (ESR):
The ESR indicates the amount of data encrypted in terms of percentages. ESR is directly proportional to the computational cost of encryption. Thus the smaller ESR indicate the lower encryption cost and higher the efficiency of the encryption scheme over the streaming data. ESR is directly proportional to the pixel space rate (PSR). The ESR for the FG of the implemented MF method was compared with the DOF method. " Fig. 2(a)" confirms that the MF produced a lower PSR/ESR value in comparison to DOF. The lower value indicates that fewer objects as part of the FG are encrypted. The higher value of ESR for the FG after application of DOF shows that DOF wrongly detect some BG objects as being part of the FG. Thus, MF leads to more efficient encryption because it selectively encrypt fewer objects which represent the FG.
The BG ESR for the MF and the DOF methods are compared in " Fig. 2(b)" with the MF method having a higher value than DOF. This indicates that more BG objects are encrypted in the videos, while the lower value for the DOF method demonstrates that DOF detects relatively limited BG information in the videos.
From the FG and BG ESR/PSR analysis, it can be deduced that the DOF method has a larger False Positive (FP) detection rate for FG objects and a greater True Negative (TN) detection rate for BG objects in the tested videos when compared with the results from using MF. 4) Video Quality Metrics: Structural Similarity Index (SSIM) compares the similarity between the detected and encrypted test videos (output) with the original test videos (taken as input). SSIM index ranges from 0-1, so if it is closer to 1, then the structure of selectively detected/encrypted parts in the frames and the original frames are closely resemble with each other, with the inverse applying if a SSIM index is closer to 0. SSIM is calculated in "(6)" as: where x = original tested videos, y = encrypted tested videos, μ x = average of x, μ y = average of y, σ 2 x = variance of x, σ 2 y = variance of y, σ xy = covariance of x and y, c1 = (K 1 L) 2 c2 = (K 2 L) 2 , L = dynamic range, (K 1 ) = 0.01, (K 2 ) = 0.03 SSIM was calculated after applying the MF and DOF algorithms and after selective encryption of either the detected FG or BG ROIs using Chacha20 (see "Table V" (columns 3 and 4)). The results drawn from these operations are shown in " Table VII". MF for the FG has a SSIM value that ranges between 0.4684 to 0.7255 giving an average of 0.5553. In contrast, for DOF the FG values range from 0.0766 to 0.3391. These latter values indicate that virtually the whole of the frames are identified as FG by the DOF algorithm, indicating that virtually whole frames are encrypted. Consequently, the metric's values indicate limited resemblance between the original video frames and the video frames when applying selectively-detected/encrypted FGs. In other words, the DOF algorithm has erroneously detected and consequently encrypted more objects as FG elements of frames, leading to an enlarged FG ROI.  On the other hand, selectively encrypting the BGs resulting from the MF algorithm results in SSIM values between 0.1361 to 0.3062 with an average of 0.2075 for the test videos. This should not be surprising as the frames with the BGs encrypted can be viewed as the obverses of the frames with the FGs encrypted, as discussed in the previous paragraph. Thus, for MF, according to the SSIM values, the BGs are a much greater proportion of their frames and consequently the SSIM results are relatively low (indicating a minor resemblance between the frames with selectively encrypted FGs and the original frames).
Conversely, the BG SSIM values after application of DOF and SE are closer to 1 (ranges from 0.3609 to 0.9540 across the videos). Thus, the DOF algorithm identifies a smaller sized ROI as part of the static background, which again indicates that the DOF algorithm is more likely to identify static objects as part of a larger moving foreground, which visual inspection shows is the case. The average SSIM value for the encrypted FG and BG with MF is 0.381.

5) Computational Cost Analysis for Selective Detection:
The time taken for object detection of FG objects using either MF or DOF algorithms was also calculated and compared in this paper. It is visible from " Fig. 3" that MF took a longer time in comparison with the DOF. For example, the variation between the MF and DOF for object detection in the MOT16-12 video file was 19920103 (μs) while object detection in the Mountain Hiking video file was 30262966 (μs). On average (arithmetic mean) MF has a time difference of 23894911.4 (μs) across the five tested videos during object detection. However, such a time difference is relatively minimal, especially when set against the accurate detection exhibited by MF.

B. Security Analysis of SecureCam
The Security paradigm of SecureCam was considered on two implementation points, i.e., a Replay attack and an extensive key guessing attack. The analysis of these attacks against SecureCam are discussed below.

1) Replay Attack:
Replay attacks occur when valid data transmission is fraudulently or maliciously repeated or delayed. However, the SecureCam scheme renders Replay attack infeasible.
To verify the strength of SecureCam against Replay attacks, the attack was simulated against tested videos and it could be deduced from " Fig. 4" that the output video frames after decryption were looking tampered. The attacker could not guess the accurate position to inject the attack since the segmented FG and the BG were separately stored in our scheme. The visual results of the Replay attack for two different frames from two test videos are shown in " Fig. 4". " Fig. 4 (a) and (b)" are the original video frame of horse_moving video (frame number 115) and MOT16-12 video (frame number 160) respectively. " Fig. 4 (a1) and (b1)" are the decrypted frames without the application of Replay attack while " Fig. 4 (a2) and (b2)" are the decrypted frames after launching a Replay attack against the frame. The differences in the original and the attacked video frames are obvious in " Fig. 4", proving the effectiveness of SecureCam segmentation scheme against the prevention of Replay attacks.
To further elaborate the verification of tampered frames by executing Replay attacks, we also performed pixel correlation testing. Pixels in a frame are constituted of two properties, position and colour. Verifying the originality of videos is dependent on the number of pixels within a frame/image relative to these two properties. For testing, we have calculated the number of pixels within the original frames, and decrypted frames with and without Replay attacks (" Fig. 4"). "Table VIII" verifies that the total pixel counts of the attacked frames after decryption (column 4) are increased. This also reduces the pixel correlation within the attacked frames in contrast to the original. Testing pixels count/correlation within frames is an easy way to detect tampering within frames, 2) Extensive Key Guessing Attack: Extensive key guessing is an approach of finding the correct key by continuously trying every possible key by guessing until the correct key is discovered.
Implementation of the SecureCam application with Chacha20 uses 256-bit keys and it is not feasible to find a 256-bit key by extensive key guessing techniques. However, quantifying the security effectiveness of the SecureCam, the number of generated attacks on data and keys could be related with the Poisson probability distribution, given by P(μ; n) = e −μ μ n n! , where e is constant at approximately 2.71828, μ refers to the number of attacks and n represents how many attacks occur within a fixed period of time.
Based on a given number of events (attacks), P is the probability of attacks occurring within a set time interval. According to Cisco security statistics [64] in 2020, there is an attack on a host machine every 5 minutes, which is approximately 300 attacks per day.
Despite the inherent vulnerabilities of the host machine, the encryption process of SecureCam is robust enough to meet the security requirements since it does not only use the keys (256-bit), but it also uses a nonce of 64-bit for random generation of a strong cipher output. In other words, the previously rendered successful attack wouldn't work with the SecureCam scheme.

C. Statistical Analysis of SecureCam
The statistical analysis of the SecureCam was evaluated with the histogram of the pixel colors and the SSIM values of the decrypted test video with and without simulated attack.
1) Histogram Analysis: The histogram analysis, is a plot of the frequency distribution of the pixel values based on the color components RGB (red, green, and blue) of the original vs. decrypted video frames with and without simulated Replay attack. Additionally, the histogram determines the correlation between these frames as shown in " Fig. 5". The lower correlation of the plot shows the greater variance, and vice versa. Comparing the histogram results of the original with the decrypted frames without Replay attack signifies a high correlation with each other, which means there is little or no variance with both frames. But when comparing the original frame with decrypted frame " Fig. 5 (a2) and (b2)" after Replay attack reveals a low correlation with both frames resulting in high variance. This proves that an attempted Replay attack against a frame protected with the SecureCam scheme can be easily detected.
2) Structural Similarity Analysis: As discussed earlier (Section IV-A4) SSIM evaluates the structural distortions by comparing the luminance values of two different images as

D. Comparative Computational Cost Analysis of Used Cipher
The computational complexity of the proposed SecureCam is 3n (3 × Frame_Size) (Section III-E) which is dependent on the number of iterations (video length or frames). For evaluating the execution complexity of proposed SecureCam scheme (stage 4) with other SOTA cipher, we also performed their comparative results in this section. The computation timing for using Chacha20 as the encryption algorithm in  SecureCam was tested against advanced encryption standard (AES) cipher feedback (CFB) mode [65]. The AES algorithm is an industry cipher and has proven resistance against attacks, while the CFB is the only self-synchronizing mode used by AES operating as a stream cipher. The measured execution time when Chacha20 was applied for SE in SecureCam at stage 4 was less than the time taken by AES-CFB as shown in " Fig. 6 (a) and (b)" respectively. The FG and the BG encryption and storage with the Chacha20 cipher was faster in comparison with AES-CFB, which proves the efficiency of SecureCam application by utilizing the Chacha20 cipher for IoMT environment.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

E. Comparative Analysis of SecureCam With Existing Approaches
The proposed SecureCam application for selective detection and encryption of objects in the dynamic camera videos is compared with other approaches in "Table X". Even though other approaches employed the use of object detection to identify the objects in the videos, none of them applied the attack prevention strategies using video segmentation and encryption altogether. Furthermore, the newly proposed object detection algorithm (MF) also proved to be more accurate and efficient in the FG and BG detection of the test videos than the SOTA DOF algorithm.

F. Limitations and Future work
This paper proposes a novel SecureCam application for protecting mobile camera videos against Replay and Manin-the-Middle attacks using video frame segmentation and encryption techniques. Video segmentation into FG and BG was performed through a newly developed MF algorithm and encryption on detected parts was applied with the Chacha20 cipher. Despite its significance, this application has some limitations.
There is always an additional cost that should be paid to achieve Security as a service (SECaaS). There is an obvious trade-off among security, computational cost, and storage in all applications. Same is the case with SecureCam application, as the selective detection and encryption on segmented FG and BG result in the increase of video size by adding additional bits to these objects in order to change their appearance. As a result of storing them separately (FG and BG), the amount of storage space will automatically double. For static camera videos (CCTV), the storage issue can be handled through video summarisation methods (by not storing all detected static BG parts). However, for moving camera videos, the remedy should be further explored as a future research challenge.
Although this paper provides a detailed security and statistical analysis against attacks on SecureCam, in future we also intend to develop a threat model that will provide effective countermeasures to other types of stealthy threats and attacks on ROI based encrypted surveillance videos captured by dynamic camera devices.

V. CONCLUSION
The contribution of this paper is two-fold; firstly we develop an unsupervised learning algorithm MF for precisely detecting foreground and background ROIs in mobile/moving camera videos, secondly this paper proposes a novel application (SecureCam) for these videos. The SecureCam encompasses five implementation stages (" Fig. 1") with the linear time complexity of 3n (3 × Frame_Size). The application and the MF algorithm were extensively tested and compared with existing SOTA approaches in this study. The performance analysis of MF algorithm with SOTA DOF was compared using different parameters such as Accuracy, Precision, ESR, SSIM metrics, and detection timings. The results of these comparisons prove the efficiency and accuracy of object detection with MF for moving camera videos. The SE was applied using Chacha20 cipher in the SecureCam, and the computational cost analysis in terms of their execution timing (for encryption and decryption) was also compared to the SOTA AES cipher. In contrast to the industry cipher(AES), ChaCha20 is found to be efficient, making it an ideal choice for real-time video encryption on low-powered camera devices. Overall, the comparative performance, security, and statistical analysis of SecureCam demonstrate its effectiveness, efficiency, and robustness against Replay and/or Man-in-the-Middle attacks for moving camera surveillance videos in the IoMT infrastructure.