P2SLR: A Privacy-Preserving Sign Language Recognition as-a-Cloud Service Using Deep Learning For Encrypted Gestures

Cloud-based services have revolutionized data storage and processing tasks. However, these services raise security concerns as service providers may misuse the user’s stored data. Privacy loss is particularly problematic for hearing and speech impaired individuals that may need to use cloud infrastructure for sign language recognition (SLR). Addressing these challenges, this paper presents a privacy-preserving sign language recognition (P2SLR) as a cloud-service that operates over cloud infrastructure without revealing the individual’s visual information to the cloud service provider (CSP). The proposed P2SLR system is realized through two innovations: (a) block-based probabilistic image encryption scheme that combines fractional-order chaotic system (FOCS) and singular value decomposition (SVD) to obfuscate the visual information in video frames, and (b) a cloud-residing deep convolutional neural network (DCNN) based recognition architecture with a modified classifier to recognize gestures from encrypted video. The proposed scheme is validated for American, Argentinian, and German sign languages. The proposed scheme achieved recognition accuracy in the range 90:76 - 98:09%, comparable to existing state-of-the-art SLR techniques in the plain domain (PD). The proposed image encryption scheme is secure under standard cryptographic image attacks, protecting the individual’s identity. P2SLR is the first move towards developing a secure SLR system over the cloud.


I. INTRODUCTION
I ndividuals with hearing and speech impairments face communication challenges in their daily lives. These people use non-verbal visual gestures, known as sign language (SL), to communicate and express their thoughts. These gestures are different for different regions. For instance, Argentine SL is used in Argentina, German SL in Germany and Belgium and, American SL in the United States of America. For decades, numerous feature-based SLR methods have been proposed [1] [2], and recently, the D-CNNs based SLR frameworks have achieved remarkable recognition performance [3] [4].
However, these data-driven deep learning (DL) techniques require demanding computational resources such as GPUs and storage space [5]. The setting-up and maintaining of these platforms are expensive and require significant time and effort. In contrast, cloud computing has revolutionized the computing structure allowing an individual/organization to develop virtual infrastructure-resources for storage and processing datasets. These services unshackle users from the burdensome tasks of collecting, assembling, and maintaining the expensive computing resources, instead allowing them to get these services in a pay-as-you-go business model. Microsoft Azure, Amazon Web Services, Google Cloud and IBM Cloud are a few instances of the CSPs.
Cloud computing's key advantages are mobility via smart devices, quality control, automatic software updates, and data loss prevention in the event of hardware damage. Despite the benefits, the cloud-based services inherit some substantial drawbacks. Third parties develop and maintain these cloud platforms, and users cannot be assured of confidentiality and integrity of data transmitted to the cloud servers. Therefore, many users do not utilize cloud-services. In this paper, we address an SL individual's security and privacy concerns, which may occur with the leakage of visual information of the individual while using cloud-services for an automatic SLR system. Protecting the SL users' visual information from an adversary (present at CSP), such as face and background environmental details, before transmitting SL data to CSP, is a primary requisite as the leakage of SL user's information may increase the risk of violence with them 1 . For instance, an adversary may misuse this visual information for a security breach to curate illegal activities against SL individuals, such as kidnapping and human trafficking. These privacy issues restrict the organizations to outsource SL individual's visual data with CSPs to train a D-CNN SLR model. An ideal solution to address the data "privacy-utility" trade-off is to encrypt the individual's identity-related information present in the video database before transmitting to the CSP and then train the D-CNN model for the encrypted data. The fully homomorphic encryption (FHE) schemes can be used to train the SLR model for FHE-encrypted data to achieve the state-of-the-art recognition accuracy as that of unencrypted data [6] [7]. However, the FHE schemes exponentially increase the data size, sometimes 1MB image is encrypted to 1GB ciphertext, making them unsuitable for real-time applications. In contrast with FHE, most of the computationally efficient image encryption schemes shuffle the pixel-locations to significantly randomize the image's visual attributes [8] [9]. On the other hand, the D-CNNs models are susceptible to the pixel's interlocations to generate a robust representation of the input video and thus these pixel shuffling based schemes cannot be utilized with the D-CNNs frameworks. Therefore, a new encryption scheme is required for an SL image that can significantly obfuscate the global features (individual's visual information) while preserving local features (regions depicting gesture information) with low computation and storage overheads. Such a scheme can be used to encrypt the user's SL video before transmitting to CSP to train a D-CNNs based SLR model for encrypted data without compromising the SL individual's privacy.
In this paper, we utilized the chaos-based bit-plane approach [10] [11] to obfuscate the SL image information by adding probabilistic noise to each pixel intensity value without altering the location of the pixels. This approach aims to preserve the relevant information such as facial expression and hand movement in an SL image as much as possible while encrypting the regions indicating SL individual identity. This feature preserving property will benefit the D-CNN model for robust parameter learning and achieves state-of-the-art recognition accuracy for encrypted data. From a security perspective, these chaotic systems are sensitive to initial parameters, inherit orbital unpredictability, ergodicity, and the random dynamic phenomena. In particular, FOCS shows higher non-linearity, and more degree-of-freedom than the integer-order chaotic systems [12] [13]. Combining the FOCS and SVD, we proposed an encryption scheme for an SL image I by partitioning it into non-overlapping blocks, followed by adding randomnoise to each SVD components of bit-planes of the blocks using the FOCS. We observe that the approach of inserting randomness to in-depth modules such as SVD components of a bit-plane achieves high-security capability of image's global features than directly randomizing the bit-plane. The major contributions of the paper are: 1. A scheme to train an end-to-end privacy-preserving SLR system, namely P2SLR using DL techniques over the cloud server is developed. The scheme aims to train a D-CNN-based recognition model for the encrypted SL dataset while protecting the user's visual information, the person who is performing the gestures in the input videodata, from adversaries. 2. A probabilistic block-based bit-plane image encryption scheme without altering the pixels' location is proposed to preserve the visual information. Each bit-plane of a block is obfuscated by encrypting its respective SVD components via adding a pseudo-random noise. 3. The performance of P2SLR is reported over American, Argentinian, and German SL datasets for varying blocksizes. The recognition results over the encrypted data with varying block-sizes lie in the range of 90.76-98.09%, comparable with existing SLR schemes in PD. 4. The qualitative and quantitative security analysis indicates that the proposed scheme significantly protects the user's visual information. Further, the scheme is comparable with the existing image encryption methods under various cryptographic attacks. Organization: Section II provides an overview of the related privacy preserving schemes. The proposed encryption scheme and the DL-framework are presented in Sections III. The experimental recognition results are reported in Section IV. The security and performance analysis are analyzed in Section V and Section VI respectively. Finally, Section VII concludes the paper with future applications.
II. RELATED WORK Some instances of cloud-based privacy-preserving services are secure photo-sharing over the social network [14], secure outsourced biometric identification [15], privacy-preserving augmented reality-based virtual cloth try-on system [16], secure and trusted e-healthcare services for social media health users [17], [18] and secure location-based services [19], [20].
Bost et al. [6] proposed an FHE-based scheme for naive Bayes, decision trees, and hyperplane decision classifiers. The authors combined FHE without bootstrapping, Quadratic Residuosity, and Paillier cryptosystems [26] for data encryption. Rahulamathavan et al. [21] proposed a privacypreserving system for recognizing facial expressions as a user-CSP service. For encrypting image's information, the authors presented a randomization technique based lightweight encryption algorithm using local fisher discriminant analysis and the Paillier cryptosystem [26]. The reported classification accuracies are 94.37% and 95.24% over JAFFE and MUG facial expression databases, respectively. Wang and Chang [24] proposed a two-party privacy-preserving image classification scheme by perturbating the image information using local differential privacy (LDP) [27]. They analyzed perturbation's effect satisfying -LDP on data utility regarding distance and count-based machine learning algorithms. Chen et al. [25] presented a secure multi-classification scheme to address the privacy-leakage in robot system using DL. The authors adopted two pairs of activation and cost functions using HE, namely softmax plus log-likelihood function and sigmoid plus cross-entropy function, along with secure calculation protocols. The scheme reduced the computation and communication overheads. Xie et al. [28] presented a HE-based theoretical protocol for implementing an image classification algorithm in the encrypted domain (ED) but not validated over any dataset. Gilad et al. [29] proposed an FHE scheme for encrypting gray-scale images to train a basic neural network architecture and validated for the MNIST dataset only. For more privacypreserving data processing schemes, the reader is advised to refer to Li et al. [30], and Ding et al. [31].
Dai et al. [32] proposed a secure identification method using extremely low resolution (LR) images to perform human action recognition. The LR cameras are utilized to capture real-time video frames and process them pixel-by-pixel to recognize the human activity. They evaluated the scheme over a 3D-animated room scenario containing five cameras and avatars describing four actions for 12 users. The reported results are close to the state-of-the-art methods in PD. Ryoo et al. [33] presented an inverse super-resolution scheme addressing unstable decision boundaries in low-resolution images, which may affect the classification robustness. The authors augmented the training samples by transforming high-resolution (HR) images to multiple LR images followed by sub-pixel transformation features of these LR images using the Siamese model. Ryoo et al. [34] proposed a privacy-preserving human action recognition scheme by transforming the high-resolution image into an extremely low-resolution image to protect the performer's identity in input data. Further, Ren et al. [22] addressed the privacy concerns of an individual's identity by anonymizing face only rather than the complete image without affecting the action information. In contrast with datasets utilized in these schemes, the only relevant information in SL datasets is the gestures expressed through hand kinematics and facial features (reveals individual's identity). This relevant information is less in amount than the irrelevant information such as the unused body parts, the background of the gesture performer, and light variations, as shown in Fig. 6. Therefore, it is required to develop an encryption scheme for SL image that can obfuscate the irrelevant information, revealing individual's identity and other related information while preserving the gesture's features in the encrypted image.

III. PROPOSED FRAMEWORK
This section introduces an end-to-end framework to develop P2SLR. We first define the role of different entities considered in the threat model. Then, the proposed block-based image encryption scheme is presented in which we obfuscate the image's visual information by encrypting each block's bitplane using the FOCS and SVD. Finally, the D-CNN-based SLR architecture, namely ResNet with the modified classifier is described. The overview of the proposed framework P2SLR is presented in Fig. 1.

A. Threat model
In P2SLR, two entities are involved: a user U who wishes to outsource the SL dataset D, containing the video-clips of SL individuals performing the gestures, for developing

User-End
Cloud Server an SLR model but without compromising with the visual information of the SL individuals available in the videoclips and, an "honest-but-curious" CSP C who provides a D-CNN based SLR architecture M along with the computational and storage resources in a pay-as-you-go business model. A general user-CSP protocol to develop D-CNNs based cloud model is presented in Fig. 2 The visual information of the gesture performer in each video-clip of D is obfuscated using the image encryption scheme proposed in Section III-B which partially preserves the gesture-related features in an encrypted video-clip. The encrypted dataset is denoted by D Enc which is further transmitted by U to the cloud-server for training M (defined in Section III-C), over this encrypted data. M utilizes the inter-correlated pixels intensities to generate robust representation of input video to achieve high recognition accuracy. It is important to note that the proposed image encryption scheme preserves the inter-correlated pixels representing gesture regions only and de-correlates the other nongesture regions. Therefore, M will learn only the preserved gesture features (local features) in the encrypted video-clip during training-phase which benefits M to obtain state-of-theart gesture recognition accuracy. The non-gesture information (global features) will be drastically different in each encrypted video-clip of same gesture, thus this information is considered as irrelevant by M. We assume that U will not have any control on the computations performed by C over D Enc . Since, C train M over D Enc , resulting the model's weights and hyper-parameters in encrypted form and thus, the trained model say M Enc , is also in the encrypted form. If a practicaladversary attempts to extract the visual information of the gesture performer from the encrypted video-clip in D Enc , the obtained information is meaningless and completely randomlike structures. Finally, C will send back the trained SLR model M Enc to U for recognizing real-time gestures of SL individuals.

B. Proposed image encryption scheme
This section proposes an image encryption scheme required to obfuscate the visual information in a gray-scale SL image I, for instance Fig. 3 (a). The encryption objective is to randomize the visual and another relevant information present in I (global features), revealing the performer's identity and related confidential information. We aimed to preserve the gesture attributes (local features) in an encrypted SL image as much as possible to benefit the D-CNN SLR model M for robust recognition accuracy. As discussed in Section I, the chaosbased image encryption schemes [9][10][11], [35] significantly obfuscate both local and global features of I simultaneously through permuting the location of the pixels to break the pixel inter-correlation. Thus, these approaches cannot be utilized to fulfill our requirements of preserving the regions representing the gesture in I. To achieve our objectives, we propose a block-based bit-plane encryption scheme for an SL image to encrypt the global features (performer visual information) in I and partially preserves the local features describing gesture attributes. We accumulate the randomness and confusing-noise in each pixel intensity value without altering location of the pixels and thus preserving gesture features. The encryption scheme is explained in two phases. In first phase, we describe the technique to encrypt a bit-plane BP using FOCS and SVD, defined in Section III-B2. In the second phase, we incorporate the proposed bit-plane encryption technique (first phase) to each non-overlapping blocks of I with an additional layer of block-encryption as explained in Section III-B3. Note that an 8-bit image can be partitioned into eight binary-valued bitplanes of dimension equal to image as shown in Fig. 3.

1) Generation of noise-vectors:
We generate the noise vectors using the solution-matrix of Chen's chaotic system defined in [36] . Suppose the solution-matrix of system (??) is S of dimension num sol × 3, where each column indicates x, y and z variable and, num sol is total number of meshpoints obtained as fraction of simulation-time and step-size. For initial noise-vectors N 1 and N 2 , we generate two random numbers using pseudo-random generator over the range [1, num sol ] and then consider the corresponding solutions in S namely N 1 = n 1 1 , n 2 1 , n 3 1 and N 2 = n 1 2 , n 2 2 , n 3 2 . These vectors are utilized to perturbate the SVD-components of a bit-plane BP as explained in Section III-B2.
2) Encrypting the bit-plane: The intensity values in a bitplane BP of dimension BP rows ×BP cols are either 0 or 1. We accumulate the probabilistic random-noise in BP by obfuscating each numeric-values in BP's SVD-components without changing the location of the pixels. The objective is to encrypt BP in a bottom-to-top approach, i.e., generate the randomness to the components of the least significant decomposition of BP, for instance, SVD decomposition, so that a substantial amount of random-noise can be incorporated in the encrypted bit-plane obtained after the reconstruction of the randomized components of the decomposition. In the literature, we found that a small perturbation in the SVD-components of an image occurs a large variance within the intensity values of the SVD-reconstructed image. Moreover, the reconstructed bitplane, after perturbating its SVD-components, significantly decorrelates the pixel's inter-correlation. In the SL dataset, we observed that the region/pixels representing the gesture are more-or-less are at the same location irrespective of the same gesture's video-clip. Thus, the proposed bit-plane encryption accumulates a similar type of noise-like structure at gesture representing regions to each image of the same gesture and drastically different noise-structure for non-gesture regions. The SVD decomposition of BP is defined as - where BP U and BP V are the orthogonal matrices of dimensions BP rows × BP rows and BP cols × BP cols respectively, T indicates matrix transpose. BP S = diag(a 1 , a 2 , . . . a i . . . , a r ) represents a diagonal matrix consisting eigenvalues of BP with a i ≥ a i+1 , i = 1, 2, ..., r − 1 and r = min (BP rows , BP cols ).
Let N 1 = n 1 1 , n 2 1 , n 3 1 and N 2 = n 1 2 , n 2 2 , n 3 2 be two noise-vectors generated using the method defined in Section III-B1. These vectors are used to incorporate noise where rand k i,j is a randomly generated floating-point number over the range [0, 1]. It is important to note that rand k i,j is different for each (i, j) in every SVD-component and we are adding noise to each pixel (i, j) without altering the position. Now, the encrypted components BP U enc , BP S enc , BP V enc of SVD are reconstructed, as defined in Eq. 1, to form the encrypted plane say BP temp enc . Further, we normalized BP temp enc over the range [0, 1] followed by assigning the bit-value 0 at (i, j) if the intensity value BP temp enc (i, j) is greater than the mean value (µ) of BP temp enc and 1 otherwise. It gives the encrypted bit-plane say BP enc for unencrypted bit-plane BP. The pseudo-code for bit-plane encryption is presented in Algorithm 1. Moreover, a drastic difference between the visual information exhibiting the bit-planes of an SL image and its encrypted form can be observed in Fig. 3.
3) Encryption of a gray-scale image: After successfully proposing bit-plane encryption scheme, we define a blockbased approach for encrypting a gray-scale image I of dimension I rows ×I cols . The pictorial representation of the proposed encryption scheme for a gesture image I of dimension 210 is depicted in Fig. 4. Initially, I is divided into T non-overlapping blocks say B = I 1 each of dimension ρ 1 × ρ 2 where ρ 1 and ρ 2 are multiples of I rows and I cols respectively and T = Irows×I cols ρ1×ρ2 . For instance, we partitioned a image in Fig. 4 into nine blocks each of dimension 70 × 70. As discussed above, most of the gesture information are contained in second and fifth blocks with a small amount of information in eighth block, rest of the blocks contain the non-gesture or irrelevant information for an SLR system. Thus, we aim to preserve information of second, fifth and eighth blocks (local features) in the encrypted image while completely obfuscating the information in rest of the blocks (global features). Next, each t th block I t B in B is considered as a separate image of dimension ρ 1 ×ρ 2 and its bit-planes say {b 1 , b 2 , ...b 8 } are extracted, as shown in Fig. 4 after step (ii). It can be observed that each bit-value in a bit-plane inherits different amount of information based upon its pixel-position. For instance, consider the gesture image (8-bit gray scale) depicted in Fig. 4. The number "1" in a 8-bit binary representation expresses 2 0 = 1 at first-bit of the pixel and 2 7 = 128 at eight-bit of the same pixel. Experimentally, the least four bit- Fig. 4 contain nearly 6-7% and the highest-four bit-planes (b 5 , b 6 , b 7 , b 8 ) carry 93-94% of total image information. It can easily be observed through the respective bit-planes in Fig. 4. The percentage of pixel information say P (i) is computed as - To reduce the computational complexity without much compromising the security efficiency of the proposed image encryption scheme as a cloud-service, it is sufficient to obfuscate the least-four bit-planes information (as they contain minimal amount of information) by location-wise adding random-noise say n 1 3 , n for j ← 1 to BP cols do 8: Enc temp as - It is important to note that the range of random-value for each t th block Enc temp is bounded with the maximum and minimum intensity values of the same block. Therefore, the gesture features in an encrypted SL image will get obfuscated but remain at their same pixel-locations as shown in Fig.  4 (after operating (iii)). Finally, the encrypted image I Enc corresponding to image I is obtained by concatenating the respective encrypted blocks I 1 Enc at the corresponding locations of I 1 B , I 2 B , ...I T B , as depicted in Fig. 4 (after operating (iv) and (v)). The pseudo-code for encrypting a gray-scale image is presented in Algorithm 2. We analyzed that the security-efficiency of global features of an SL image is directly proportional to the block-size (ρ 1 × ρ 2 ) to be considered for partitioning the image. For instance, consider an image of dimension 200 × 200, then its encrypted form with block-size ρ 1 × ρ 2 = 40 × 40 will be more secure than the encrypted image with block-size ρ 1 × ρ 2 = 4 × 5. An RGB-color image can be encrypted by performing image encryption (Algorithm 2) to each red-green-blue color channel simultaneously.

C. D-CNN based SLR framework
After successfully defining the encryption scheme to obfuscate the local and global features in an SL gesture video-clip in the previous section, we will define the D-CNNs based recognition architecture denoted by M which is trained for encrypted data D Enc . For experiments, we utilized the variants of Residual neural network (ResNet) architecture [37]. In literature, we found that the ResNet architecture significantly boosted-up the recognition accuracy with low computational and storage overheads than other D-CNNs architectures in numerous applications such as face recognition [38], object detection [39], and SLR system [40]. The ResNet architecture has two parts -feature extractor and ImageNet classifier. The feature extractor takes encrypted gesture video-clip as input and produces a robust representation of the clip. It is important to note that the representation inherits only the local features preserved during the encryption scheme. The classifier is used to classify the input video-clip into pre-defined gestures (class labels). In this work, we do not make any changes in the feature extractor but modified the classifier by incorporating two adaptive pooling layers, namely maximum and average over the last batch-normalization layer of feature-extractor in Fig. 5. The output features of these pooling layers are concatenated and flattened to a one-dimensional global feature descriptor of the input image. Further, we incorporate layers as BatchNorm-Dropout-FullyConnected-Activation(ReLu)-BatchNorm-Dropout-FullyConnected-LogSoftmax (output layer). Moreover, the class-labels associated with the gesture video-clips are uniquely transformed into numeric numbers (one for each class-label) before transmitting to the cloud server. In this manner, the CSP receives the supervised encrypted data D Enc with the encoded class labels rather than the actual text-labels.

IV. RECOGNITION EXPERIMENTS
This section demonstrates the recognition performance for the variants of ResNet namely ResNet18, ResNet34, ResNet50 and VGG16 recognition models with feature extractor and classifier settings defined in Section in III-C over three SL datasets given in Section IV-A, in which each video-frame is encrypted using the image encryption scheme proposed in Section III-B3. for i ← 1 to ρ 1 do 18: for j ← 1 to ρ 2 do 19:  Implementation details: Codes are implemented using Jupyter-notebook running on 64-bit Ubuntu 18.04 OS, over HP workstation -Intel Xeon(R) Gold 5120 CPU @2.20GHz × 56 and Nvidia Quadro P5000 GPU. All recognition architectures are trained end-to-end for rescaled video-frames of dimension 200×200 with batch-size of 128, 100 epochs, Adam optimizer (lr = 1e −3 ), Cosine Annealing lr scheduler and negative log likelihood loss. Each dataset is partitioned into trainingvalidation-testing data as 75%-10%-15% respectively.

B. Gesture recognition analysis for the encrypted datasets
Here, we analyze the recognition accuracy for the encrypted forms of the above-defined SL datasets. Each recognition architecture is evaluated for five encrypted variants of ASL, LSA64 and RWTH SL datasets obtained by varying blocksizes ρ 1 × ρ 2 namely 5 × 5, 10 × 10, 20 × 20, 25 × 25 and, 40 × 40 in the proposed image encryption scheme. For these encrypted variants, the achieved recognition accuracies for testing dataset lie in the range 91-98%, 94-99%, and 81-91% for ASL, LSA64, and RWTH datasets respectively as depicted in Tables I -III. Also, we observed a inverse trade-off between the recognition accuracy and the block-size, considered for image encryption. For instance, the recognition accuracies of ResNet50 for encrypted ASL dataset is 98.09% with block-size 5 × 5 and 90.76% for block-size 40 × 40 as shown in Table  I. A similar trade-off can be analyzed with other datasets and block-sizes. It can be noted that the ResNet50 outperforms the other frameworks for all three encrypted SL datasets. Further, we trained ResNet50 end-to-end with the same implementation setting for unencrypted ASL, LSA64, and RWTH datasets and achieved recognition accuracies for testing datasets as 98.89%, 99.73%, and 95.84% respectively. It can be noticed that the error-difference in accuracies of unencrypted and encrypted datasets is not more than 7-8% only. However, a small recognition loss can be acceptable by the society with the significant increment of SL users privacy and security benefits. Next, we compare the recognition accuracies of ResNet50 for each of encrypted ASL, LSA64 and RWTH datasets with block-sizes 5 × 5 and 40 × 40 with the existing permutationbased image encryption schemes.

C. Comparison with image encryption schemes
The image encryption scheme proposed in Section III-B3 is aimed to partially preserve the gesture features in an encrypted image without permuting the location of the pixels. Therefore, it is desirable to compare this scheme with the pixels location permutation image encryption schemes. For experiments, we consider the chaos-based encryption schemes proposed by Zhu et al. [43], Zhang et al. [44], and Ping et al. [45]. Each of the ASL, LSA64, and RWTH datasets are encrypted using these schemes, and ResNet50 is trained end-to-end for these encrypted datasets. The obtained recognition accuracy lies in the range of 9 − 15% only as reported in Table IV, whereas the accuracy reported by the proposed scheme is close to 81 − 99%. Hence, it is worthwhile to mention that the proposed encryption scheme fulfills the desired objective of preserving gesture features while efficiently secure an individual's visual information.

D. Comparison with state-of-the-art SLR methods
Since P2SLR is the first work to perform SLR in the ED; therefore we compare the gesture recognition accuracies of ResNet50 for each of the encrypted SL datasets using the proposed encryption scheme with the existing SLR methods in the PD utilizing classical feature and D-CNNs based approaches. As the ASL dataset is provided in Kaggle's SLR challenge; therefore, P2SLR is compared with the results reported by participants, namely Dan, Rohit, and Jeffy. These participants utilized variants of D-CNNs frameworks in their end-to-end SLR methods to achieve the recognition accuracies in the range 85.49-99% whereas P2SLR reported 90.76-98.09%. For the LSA64 dataset, we consider the features based SLR methods proposed by Ronchetti et al. [41] and Tanwar et al. [1] with reported accuracies as 95.95% and 82.57% respectively. Ronchetti's method constituted the hand tracking and segmentation of colored gloves, followed by the Hidden Markov Model (HMM) and Gaussian Mixture Model (GMM) based classifier. In contrast, Tanwar's model utilized the dense trajectories for tracing hand-movement and GMM for hand segmentation with Random Forest and support vector machine as classifiers. Further, Javed et al. [4] generated three deep motion templates for each gesture-video and train three different D-CNNs architectures followed by their fusions using a Kernel-based extreme learning machine to obtain the final class-label. Javed's model reported recognition accuracies of 97.81% and 85.86% for LSA64 and RWTH datasets, respectively. The comparable results are shown in Table V. It is observed that the P2SLR outperforms the existing schemes on LSA64 and RWTH datasets and achieves close recognition accuracy for the ASL dataset.

V. SECURITY ANALYSIS
This section presents the proof-of-security of the proposed encryption scheme and various standard cryptographic attacks practiced by an adversary to extract the original image's visual information from its encrypted form to obtain the individual's identity.

A. Proof-of-security
Consider a gray-scale gesture image I of dimension M ×N which is encrypted using the proposed encryption scheme with block-size ρ 1 ×ρ 2 . In other words, a total of T = M ×N ρ1×ρ2 blocks each of dimension ρ 1 ×ρ 2 are encrypted. Since, the encryption is performed to all 8 bit-planes of T blocks, therefore, a total of 8 × T bit-planes of dimension ρ 1 × ρ 2 are encrypted. As discussed in Section 4, the information in least-significant four bit-planes are obfuscated by additive random-noise whereas the most-significant four bit-planes are encrypted through the proposed Algorithm 1. Thus, half of total 8 × T bit-planes i.e., 4 × T bit-planes are obfuscated through former method and rest 4 × T bit-planes using later method. The total number of random-values required for least significant 4 × T bit-planes are 4 × T × ρ 1 × ρ 2 . Now, we calculate the total number of random-values for highest-significant 4 × T bit-planes. Here, each bit-plane is initialized with two noise-vectors i.e., solution of Chen's chaotic system. So, for 4 × T bit-planes, it requires 4 × T × 2solutions. Also, each bit-plane is decomposed into three SVDcomponents where each component is of dimension ρ 1 × ρ 2 and all the entries in each component are obfuscated through a random-value (Eq. 2-??), which requires 3 × ρ 1 × ρ 2 randomvalues for a single bit-plane and therefore, 4 × T × 3 × ρ 1 × ρ 2 for 4×T bit-planes. For bit-plane encryption, the total number of required random-values denoted by RV bit are - In Eq. 4, a random-value is added to each pixel-intensity value of the encrypted block (as a second-security layer) whose range is different for each block that further increases M × N more random-values to RV bit . Thus, the total number of random-values denoted by RV total become - It is important to note that the RV bit random-values vary uniformly over the range [0, 1] i.e, if it is partitioned into L values uniformly then, the probability for getting a randomvalue rand is 1 L . For instance, if L = 10 7 , then the probability for each random-value is 0.0000001 ≈ 0. Moreover, these random-values are generated through a cryptographic-secure pseudo-random generator G and we assume that an adversary is unable recover the output of G with any condition. Further, the last M × N random-values (Eqs. 6 -7) varies for T different intervals, one for each block (Eq. 4). With above analysis, we claim that an adversary is unable to extract the original pixel-intensity value form that of encrypted pixelvalue. For a RGB-color image, total number of random-values becomes 3-times (one for each color channel) of RV total (Eq. 7). For the visualization of total incorporated randomness and confusion in the encrypted images obfuscating the global features, please refer to the supplementary material.

B. Visual comparison with varying block-size
After theoretically analyzing the total number of randomvalues in an encrypted image, this section qualitatively compares the encrypted images with varying encryption scheme parameters such as block-size. The difference between different encrypted images for a gesture image can be visualized in Fig. 7. A trade-off between the block-size and level of randomness in an encrypted image can easily be observed. For instance, an encrypted image obtained with block-size 5 × 5 reveals edges and hand-shape (local features) features, whereas an encrypted image with block-size 40 × 40 does not leak any original image information (local and global features) and depicts only noisy blocks. It is worth mentioning that an adversary cannot extract sensitive information from the encrypted images for large block-sizes and vice-versa.

C. Pixel-known attack
In this attack, an adversary is known with the frequency of each pixel intensity value in an encrypted gesture image I Enc . The task is to extract the frequency of each pixel intensities of the original gesture image I. It is performed to measure the total amount of diffusion and confusion attributes in I Enc compared to I. Therefore, the frequencies of I Enc must be unrelated to I and uniform as much as possible. These frequencies explain the contrast, brightness, and saturation effects in an image, which are analyzed through their graphical representation. For experimentation, we compare the frequencies of a 200 × 200-dimensional ASL gesture image I and its encrypted form I Enc obtained with block-size 40 × 40 as depicted in Fig. 8 (a). The channel-wise (red-green-blue) frequencies comparing I and I Enc are shown in the first and second columns, respectively. It can easily be observed the histograms of each channel of I Enc are nearly uniform and drastically different from that of histograms of I.
In a similar attack, an adversary can perform frequencyequalization to the histograms of I Enc to obtain the equalized histograms and compare them with the existing histograms database in the PD to retrieve the best-similar image. Therefore, we perform histogram-equalization over frequencyhistograms of I Enc as shown in the third column of Fig. 8 and compare with the respective channel histogram of I. It can be noticed that the equalized frequencies of each color-channel are significantly different from the original frequencies, leaking zero image information.

D. Cipher-known difference attack
In this attack, an adversary considers two different encrypted images say I Enc1 and I Enc2 corresponding to a gesture image I and practice to extract the pixel intensities of I from I Enc1 and I Enc2 . Therefore, we compute the pixel-wise intensity difference between I Enc1 and I Enc2 as shown in Fig. 9(b)-(d) with block-sizes 10 × 10, 25 × 25 and 40 × 40 respectively. We observed that the difference image (third-column) of encrypted images with block-size (first-second columns) is a non-constant and non-zero image representing similar to encrypted images. It ensures that the proposed encryption scheme generates different random-values each time and hence significantly different encrypted images. However, the range of pixel intensities in different images is narrow due to our approach to adding pixel-wise noise in gesture images without changing the pixels' location.

E. Mean-square Error
This section evaluates the mean square error (MSE) to analyze the deviation-difference in the pixel-intensity of an original SL image I and its encrypted form I Enc . The mathematical equation of MSE is - where I(m, n) and I Enc (m, n) denote the intensity value of I and I Enc at pixel location (m, n) respectively. Analytically, high MSE indicates the large changes in the global features of an encrypted image than the original image and vice-versa. Experimentally, we considered 100 images from each of the ASL, LSA64, and RWTH datasets and computed the MSE between the original and their encrypted images with varying block-sizes. The average MSE values for each dataset are reported in Table VI.
Since the different encryption levels in the proposed image encryption scheme utilized pseudo-random values to generate noise; therefore, it is worthwhile to report the average MSE between two encrypted images, say I Enc1 and I Enc2 of the same original image I, as shown in Table VII. It can be observed that the MSE values are small for small block-sizes and vice-versa, as depicted in Tables VI and VII. Also, it supports our claim of the inverse trade-off between the blocksize and image security in the proposed encryption scheme.

F. Pixel-wise Information Entropy
Information entropy, proposed by Claude Shannon [46], analyzes the total amount of randomness and the probability distribution of intensity values in an image. The more uniformity and randomness in the distribution of intensity values, the higher the information entropy is. For possible pixel intensity values {0, 1, ..., L − 1} in an N -bit image I, L = 2 N , the entropy denoted by H(I), is defined as - where P (l) is the probability of intensity value l in I. The entropy-range for an N -bit gray-scale image is [0, N ], where H(I) close to N indicates the encryption scheme's highsecurity efficiency with low-probability of information leakage and vice-versa. Also, it indicates the amount of confusion generated by the encryption scheme to obfuscate the image visual information.
Experimentally, we computed the entropy for an 8-bit SL image and its respective encrypted images with varying blocks as reported in Table VIII. We observe that each red-greenblue channel's entropy lies in the range [7.7, 7.9] with blocksize 10 × 10 only and approaches to 8 for large block-sizes say 40 × 40, proving the block-size and randomness trade-off. Moreover, the proposed scheme achieves comparable values with the existing chaos-based image encryption methods, as presented in Table IX. These existing methods can only be utilized to obfuscate the image information for storage and transmission purposes, whereas our scheme can also perform secure computations in an encrypted image.

G. Differential analysis
The analysis is performed to examine the total change in the encrypted image I Enc1 and I Enc2 , obtained by performing a small variation in the original SL image I. In other words, we compute the difference between I Enc1 and I Enc2 , where I Enc1 and I Enc2 are obtained before and after changing a pixel intensity value in I, respectively. The mathematical equation to compute difference denoted by D is defined as Then the difference percentage, denoted by DP, between I Enc1 and I Enc1 is evaluated as We evaluate DP for encrypted RGB-color SL image with varying block-sizes and reported the channel-wise DP in Table  X. The obtained values lie in the range of 90-98%, which signifies that the encrypted images' intensities vary drastically with a single-pixel variation in the original image. Further, the obtained DP values are found to be comparable with existing state-of-the-art encryption schemes, as presented in Table XI. Similar to entropy, the DP values present a trade-off which increases with large block-sizes and vice-versa.

VI. TIME COMPLEXITY ANALYSIS
Besides the high recognition efficiency, the encryption scheme's time-complexity is also a vital requirement for cloudbased services, which we will discuss in this section. We report the encryption time for the RGB-color SL image of different dimensions namely 32×32, 128×128, 256×256 and 512×512 with encryption block-sizes 5 × 5, 10 × 10, 20 × 20, 25 × 25 and, 40 × 40 over the system with configuration as defined in Section IV. The chaotic behaviour of Chen's system restricts to compute it once for the complete dataset and utilize its solution set S to initialize the random noise-vectors (ref. Algorithm 2) for image encryption. Now, we compute the time required to encrypt an 32 × 32dimensional RGB-color SL image and block-size 5 × 5. The total number of 5×5-dimensional blocks is 49 (using the same padding technique). Each block require 0.011 second to encrypt and thus, 49 blocks require a total of 49×0.011 = 0.539 seconds. Moreover, the time required for pre-processing and post-processing operations such as partitioning an image into (R, G, B)-channels and generation of 5 × 5-dimensional 49 blocks followed by concatenating the encrypted blocks and channels to generate a complete encrypted image of dimension 32 × 32 is 1.831 seconds. Therefore, the total encryption time (with above assumptions) is 0.539 + 1.831 = 2.37 seconds. Also, the same image with block-size 40×40 requires 0.14 seconds only. It happens because of the large blocksize partition of the image into fewer blocks that ultimately require encrypting fewer blocks and vice-versa. Similarly, we compute the encryption time of RGB-color SL for dimensions 128 × 128, 256 × 256, and 512 × 512 with varying block-sizes, as shown in Table XII. VII. CONCLUSION In this paper, a privacy-preserving system, namely P2SLR, for end-to-end training of a D-CNNs based SLR architecture for the encrypted SL data over the cloud platform is developed. The image's visual information reveals that the individual's identity in the original data is protected through a proposed probabilistic block-based bit-plane image encryption scheme without altering the pixel's location using the FOCS SVD. Additionally, the scheme partially preserves the gesture's local features such as hand-kinematics and facial expressions in the encrypted form and significantly obfuscated the global features (non-gesture features). For recognition experiments, we trained/learned the variants of ResNet framework with a modified classifier that performs dual adaptive-pooling over the feature-extractor layer. The ResNet50 outperformed the other architectures for three encrypted SL gesture datasets, namely ASL, LSA64, and RWTH, and achieved the recognition accuracy in the range of 90.76%-98.09%. Moreover, P2SLR is proved to be secure theoretically and various qualitative and quantitative image cryptographic measures. Also, we observed an inversely proportional trade-off between the recognition accuracy and security level with varying blocksizes. To the best of our knowledge, P2SLR is a first-of-its-kind method to develop a secure SLR system as a' cloud-service. ACKNOWLEDGMENT