Deep-BEJT: A New Human Activity Recognition System based on Beta Elliptical Joint Trajectory (BEJT) and Long Short Term Memory (LSTM)

This research article aims to introduce a novel approach for Human Activity Recognition (HAR) from 3D skeletal sequences captured by Kinect sensor that can operate in low light and solve the problem of body detection and human tracking. Latter approaches in 3D human activity recognition outstanding performance and gain much interest thanks to the eﬀectiveness of skeleton data such us joint trajectory that is chosen as source of diﬀerent HAR models. 3D Skeleton model removes the factors of the invariance of view, scale and background that both the original and depth image have. Beta elliptic model can be employed in HAR system thanks to its power to characterize the joint trajectory by merging the dynamic and the geometric description. Therefore, we propose a deep learning approach for action recognition problem based on 3D skeletal representation, Beta Elliptic model and a special kind of Recurrent Neural Networks called Long Short-Term Memory (LSTM). Beta Elliptical Joint Trajectory (BEJT) is a novel feature that describes both geometric and dynamic characteristics of skeletal sequences with elliptical arcs. Additionally, to enhance the success of our proposal, the classiﬁcation of human activities is carried out according to two layers LSTM framework since its capability to model the long-term temporal dependencies automatically. Relying on an experimental evaluation of the proposed methods for human activity recognition based on BEJT features and LSTM classiﬁer on four data sets; UT-Kinect, MSR-Action 3D, CAD60 and CAD120, we have validated the performance of our proposed method compared to the existing ones.


Introduction
Human action recognition has stilled one of major challenging and important tasks in computer vision and machine learning. It promotes an extensive range of applications such as intelligent video surveillance, video understanding and human computer interaction. Recognizing human actions aims to determine the actions from input sensor streams, where RGB [1, 2], depth [3,4] and skeleton [5,6] are three input types. The most popular input is the RGB videos; it has been widely studied. Withal, information that represent human activities and captured in the 3D space is richer. There are many sensors that are used to recollect information of a person performing activities, sensors like accelerometers [7], microphones [8], video cameras [9] and other means. Generally, depth cameras produce better quality 3D depth data than those estimated from monocular video sensors. Skeleton information recollected from Microsoft Kinect are a good source of data because they are not affected by environment light variations.they can provide body shape and simplify the problem of body detection and human tracking [1]. It is easier to use the skeletal data than RGB information in HAR systems because skeleton model could remove the factors of the invariance of view, scale and background that the original image and the depth image have. Also, it eliminates problems related to clothing textures, color and hair shapes, which greatly simplify the learning of action recognition itself.
In this work, a human action recognition system, exploiting the 3D skeletal joint positions provided by Microsoft Kinect, is proposed.
Our algorithm starts from a 3D skeletal representation as an input. Based on the skeleton joints positions in each sequence frames, we extract the joint trajectory that can represent the skeleton motion to construct the new feature BEJT using the Beta Elliptic model [10]. The motivation for the use of Beta Elliptic model is the richest output in terms of kinematic, graphical and biometrical data. Indeed, Beta-elliptic model is characterizing by a description combining the dynamic and the geometric profiles modeling [11]. In the current skeletonbased action recognition, literature (RNN) [12,13] has a great influence on processing video sequence. Several works [14,15,16] successfully built well-designed multi-layer RNNs for recognizing action based on skeletons [17]. However, there are often strong dependency relations among the skeletal joints in spatial domain. The spatial dependency structure is also usually discriminative for action classification. In this paper, Long Short-Term Memory (LSTM) network is adopted to take advantage of its powerful ability to model the longterm contextual information in the temporal domain. It has been successfully applied to language modeling [18], RGB based video analysis [19 .. 27], and skeleton based action recognition [28], [29], [30]. One of the problems in recognition is the availability of data. Accurate data like MoCap, CMU MoCap and HDM05 [31], is expensive to acquire. Recently, Microsoft Kinect and other low-cost sensors have provided the depth data with acceptable accuracy, [32] evolving a real time process to get 3D joint position from a single depth image. Despite extensive training on synthetic data extracting the joint locations in real-time has become a doable task. With all these available data, human activity recognition based joint trajectory reconstructed with 3D joints positions has become tractable. Over this paper, we use the 3D skeletal joint locations to expand an efficient action recognition approach. We introduce a novel descriptor named Beta Elliptic Joint Trajectory (BEJT) based on beta elliptic modelization. The motivation for the use of Beta Elliptic model [10,11] is the rich output in terms of kinematic, graphical and biometrical data. Indeed, Beta-elliptic model is characterizing by a description combining the dynamic and the geometric profiles modeling [33]. We will show how the proposed descriptor is efficient and discriminative on popular 3D action recognition data sets: MSR-Action3D, UT-Kinect, CAD 60, ... then, we will evaluate the performance on LSTM because of his capability to learn the long-term dependencies defined by the limitation of processing long term information where the Beta Elliptic joints Trajectories are the input. Finally, we will sum up the main contributions of this document as follows: Firstly, BEJT is a novel local descriptor that represent the skeleton movement based on beta elliptic model and joint trajectories. Secondly, it is an integrated system combing the advantages of new dynamic features and stacked LSTM model for skeleton-based action recognition. Last but not least, The suggested method perform state-of-the art achievements on four benchmark data sets, the rest of this paper is orderly as following: In section 2, we introduce the related works on skeleton-based activity recognition using recurrent neural networks (LSTM)to model the temporal dynamics. Our approach is depicted in Section 3. Section 4 presents the data sets used with the experimental results. Finally, the paper is concluded in section 5.

Related work
Human activity recognition is an active area of research, with many existing algorithms. Surveys by Weinland et al. [14] and Poppe [15] exploring the vast literature in activity recognition. Here, we will focus on recent technologies and approaches for 3D skeletonbased action recognition and recent related advances in deep learning.

Skeleton-Based 3D Action Representation
The geometric structures of body motion patterns are determined by the construction of their skeletons. From a mechanical point of view, the joints of the human body are end points of bones with constant length. This study has inspired most of the literature about human body pose estimation and action recognition [34], [35] [36] as, by knowing the position of multiple body parts, we want the machine to learn to discriminate among action-classes. Technologies are evolving fast, and the very recent wide diffusion of cheap depth camera, and the seminal work by Shotton, et al. [37] for estimating the joint locations of a human body from depth map, have provided new stimulus to the research in action recognition given the locations of body joints.
3D action representation based on human skeleton can be generally splitted into three categories [38]: joints, joints group, and joint dynamics. To capture the correlation of body's joints in the joint representation, it is necessary to extract spatial descriptor [39], [40], geometric descriptor [41], [42], [43], [44], [45] or key poses [46], [47], [48]. Joints group detect the discriminative subsets of joints to distinguish actions. Joint dynamics focused on modeling the dynamics of either subsets or all skeleton joints. In [49], 3D joint trajectories are projected into three 2D trajectories, and the oriented displacement histogram is calculated to depict the three trajectories. Each trajectory displacement polling its length in the histogram of angles orientation. Chaudhry et al. [50] proposed to divide the fully human skeleton into various body parts depicted by joints, comprising the upper body, lower body, left/right arms and left/right legs. The computing of shape context feature started by regarding the directions of a set of equidistant points sub-sampled beyond the segments of each body part. Finally, the skeletal sequence is described as a set of time series features(position, tangent and shape context feature). Then, they divided these time series into different temporal scales modeled by a linear dynamic system. So, to describe the skeleton dynamics in the skeletal sequence all estimated parameters series are used. In [51], the authors characterized the 3D joint coordinates and their changing over time as a trajectory in the Riemannian Manifold, and the human action recognition is considered as the problem of computing the similarity between the trajectories shape. Wang et al. [52] used the color to encode the trajectory's dynamics , and model the spatial-temporal data carried in a skeleton sequence across shape and textures. They depicted the spatio-temporal information on three 2D images through coding the joint trajectories and their dynamics into color distribution. This method called Joint Trajectory Maps where they adopted ConvNets to learn the discriminative features to recognize the human. Hai et al. [53] describes an action as a collection of time series of 3D locations of the joints. Each action sequence is represented as a linear dynamical system that has produced the 3D joint trajectories. In particular, an auto regressive moving average model is adopted to represent the sequence. The dynamics captured by the ARMA model during an action sequence can be represented by means of the observably matrix, which embeds the parameters of the model. Therefore, two ARMA models can be com compared in terms of their finite observably matrix. The subspace spanned by columns of this finite observably matrix corresponds to a point on a Grassmann manifold.

Direct Acquisition of 3D Skeletal Data
Different commercial devices, including motion capture systems, time-of-flight sensors, and structured-light cameras, enabled the direct retrieval of 3D skeleton information. The human body models obtained by these devices are shown in Figure 1. Motion Capture Systems (MoCap) [54] obtained the 3D skeleton information by identifying and tracking markers attaching to a human joints or body parts. It exists two important categories of Mo-Cap systems based on inertia sensors or visual cameras. In inertial sensors based MoCap systems, the rotation of a body part is estimated by 3-axis inertial sensor respecting a fixed point. The collection of this information aims to obtain the skeleton data without any optical devices. Commercial MoCap systems provided a software collecting skeletal data but such systems are so expensive and unable to be used outside a greatlycontrolled indoor environments. Compared to other sensors, the Microsoft Kinect camera affords an affordable alternative to obtain depth information [55]. Moreover, a color camera is incorporated into the sensor to obtain the color data registered. The color depth information can be reached by the Kinect-SDK 2.0 or the OpenKinect library [56]. The Kinect camera offers an important resolution of depth images (512 424) at 30 Hz. In addition, this camera can estimate the positions of 25 human body joints to provide 3D skeleton information, with better tracking accuracy. Also, using Kinect SDK gives many advantages. It provides 3D human skeletal data in real time using the estimation method described by Shotton et.al [57]. This sensor is also inexpensive and eliminates the body tracking and background extraction problems.

Skeleton Based Action Recognition with RNN and LSTM Models
Du et al. [8] propounded an end-to-end RNN with handcrafted subnets. According to human structure, the raw positions of body joints are divided into to five parts singly fed into five bidirectional RNNs. When the layers number increases, the network hierarchically fused the representations extracted by the subnets to a higher-level representation.
The classical RNN has a problem to learn the long-term representation of video sequences because of the exponential decrease in retaining the context information of video frames [58]. To surpass this limitation, Long Short-Term Memory (LSTM) [59], a kind variant of RNN created to learn the long-range dependency between the input frame and the output label. LSTM networks achieved competitive performance on human activity recognition task [60] [61]. Zhu et al. [67] used LSTM to represent a skeletal sequence as a feature vector. At each step, the LSTM input consisting of the 3D concatenated locations of skeletal joints in a frame. They modeled a feature manifold due to a set of encoded feature vectors. Finally, the manifold was utilized to aid and regularize the supervised learning of LSTM for action recognition based on RGB video. In [68] and [69], a 2D Spatio-Temporal LSTM framework is created to explore concurrently the hidden sources of activity related context information in the temporal and spatial domains. They also proposed a trust gate mechanism aims to deal with the inaccurate 3D joints coordinates provided by the depth sensors.

A Beta Elliptical Joint Trajectory (BEJT) System for HAR
The aim of building a human representation based on 3D skeleton is to extract compact, discriminative descriptions to characterize human impute from 3D human skeletal data that can be acquired, comprising devices that directly assist the skeletal information and computational methods to build the skeleton. Thus, the human body is encoded as an articulated system of rigid segments attached by joints. In our method, informations in the raw joint position are used to form a joint trajectory, and then we extract features from this trajectory, which are often labeled the trajectory-based representation. Indeed, Betaelliptic model is characterizing by a description combining the dynamic and the geometric profiles modeling. The proposed technique comprises three main stages including 3D skeletal data preprocessing, Beta Elliptic Joint Trajectory extraction and classification step based on multi-layer LSTM.

Joint Position Estimation and 3D skeletal data preprocessing
The human body can be defined as a set of rigid segments connected by skeletal joints to form an articulated system and the human action is considered as a continuous evolution of the spatial configuration of these segments (i.e. body postures) [70].  Figure.3 shows an example result of 3D skeletal joints and the corresponding depth map.

Feature Extraction
Using the Microsoft Kinect SDK, we can easily obtain in real-time a 3D skeleton by following the approach of Shotton et al. [57]. This skeleton contains the 3D position of a certain number of joints representing different parts of the human body. The number of estimated joints that depends on the SDK used in combination with the device extracted in our approach is 20 joints. We explore the joint positions as gesture representation, and we model the dynamics of the full skeleton as a trajectory.

3D Joint Trajectory Construction
For each frame of a sequence, the real-world 3D position of each joint of the skeleton is represented by three coordinates expressed in the camera reference system. p_i (t) = (x_i (t), y_i (t), z_i (t)) (1) The trajectory is defined by the motion over time of the feature point encoding the 3D coordinates of all the joints of the skeleton (or by all the feature points coding the body parts separately).
The 3D joints trajectories constructed by the position of joint J in frame 1 to frame N where N was the last frame in the skeletal sequence is defined by the equation below: To construct the 3D Skeleton Trajectory, we focus on modeling the dynamics of all skeletal joints utilizing the raw joint position information (the coordinates of joint in each frame of the 3D skeletal sequence). Figure 3 show the dynamic of some skeletal joints (Head, Right Hand and Spine). By concatenating the 3D coordinates of skeleton joints, data representation encodes the shape of the human posture at each frame. The dynamics of human motion was captured by modeling the sequence of frame features along the action as a trajectory. Let N j be the number of joints composing the skeleton, the posture of the skeleton at frame t is represented by a 3N j dimensional tuple: For an action sequence composed of N f frames, N f joint position vectors are extracted and arranged in columns to build a skeleton joints coordinates matrix M c (3N j *N f ) describing the whole sequence.
M c = (c(1) c(2) ...c(N f ) ) (5) Thus, the skeleton joints coordinates matrix M c represents the evolution of the skeleton joints pose over time. Each column vector is regarded as a sample of a continuous trajectory of a single body joint.

Trajectory Filtering and Resampling
First of all, the resampling concerns the increase in the number of points defining the skeletal trajectory extracted in the case where their number is insufficient to apply the Beta elliptical extraction module. We respect the velocity of the original trajectory because resampling depends on this velocity. Physiological tremor is present in all normal and healthy subjects and is exhibited in different conditions, such as various task execution (motion or isometric contraction), posture maintenance and even at rest. It can be found in spatial and temporal domains. In the frequency domain, it can be divided into three ranges: low frequency (less than 4 Hz), intermediate frequency ( 8 12 Hz) and high frequency (greater than 12 Hz), according to the tremble frequency. The most common physiological tremor has a tremor frequency of 8 12 Hz [1]; This is the reason of our frequency choice: f cut =8Hz.

Beta Elliptical Joint Trajectory Reconstruction
In this paper a new formalism is proposed to describe the trajectory of the skeleton by resorting to an extension in 3D of an efficient 2D representation called beta elliptic model [33]. The beta -elliptic approach for 3D skeletal movement representation consists in modeling the trajectory by a combination of two aspects of profiles: dynamic and static. In the dynamic profile, overlapped Beta signals model the curvilinear velocity, whereas in the static profile named also geometric profile, elliptic arcs model trajectory segments. The combination of both types of features characterizing the extracted Beta impulse and its corresponding arc of ellipse allows decomposing a complex trajectory in elementary segments called 'strokes'.
3D trajectory modeling based on beta-elliptical representation. The 3D joint trajectory modeling based on beta-elliptic approach goes through three steps. The first consists of the segmentation of the trajectory into strokes according to velocity extreme, i.e. local minimum and maximum, and double-inflection point of curvilinear pulse velocity defined by: The monotonic variation of the curvilinear velocity during an elliptical beta segment also guarantees a monotonic variation of the radius of curvature on the corresponding trajectory line. Depending on the level of approximation, each segmented trajectory is then approximated by a planar trajectory. We assume that the trajectory does not change planes appreciably until moments of velocity extremes, ie when passing from one segment to another. Thus, this hypothesis suggests that each segment of the trajectory is substantially approximate to a plane shape. On the other hand, studies carried out in the literature [70,71,72] on the flat trajectories of rapid movements of human bio-motor systems (arm-hand system, locomotion system, eyes, etc.) show that their trajectories are approximate to elliptical arcs.
A given segment is defined by all points of its sampled trajectory: The plane P k that supports approximately the trajectory segment S k could be estimated precisely by a plane regression using the least square error algorithm, however, to avoid any discontinuity in the inter-strokes of the reconstructed hand trajectory path and to reduce the computation time a direct estimation method is adopted that considers as elements belonging to P k the following three-most distant points in the S set: The beginning point M n (x n , y n , z n ) The last point M m (x m , y m , z m ) The most distant point M j (x j , y j , z j ) ) of the trajectory segment S k with respect to the straight line checking thus: M j ∈ S k maximizing the orthogonal distance between M j (M n ; M m ). The plane P k can be represented by the Cartesian equation below: (a P k · x) + (b P k · y) + (c P k · z) + 1 = 0 (8) The parameters a P k , b P k and c P k can be calculated from the following equations:    (a P k · x n ) + (b P k · y n ) + (c P k · z n ) + 1 = 0 (a P k · x m ) + (b P k · y m ) + (c P k · z m ) + 1 = 0 (a P k · x j ) + (b P k · y j ) + (c P k · z j ) + 1 = 0 The second step consists in projecting the segment into the corresponding plane P k . In reality, the trajectory of a segment S k is not perfectly flat. To be approximated by the arc of an ellipse, it must first be projected orthogonally on the plane P k . To have the whole information about the trajectory tangential direction at the starting and ending points M m and M n , the front and rear neighborhoods' trajectory of the segment S k are also projected on the plane P k . The Cartesian coordinates of the direction vector of the orthogonal projection on the plane P k are given by: The set of point S k : {M i , i = n , ...., m } representing the orthogonal projection of the trajectory segment S k on the plane P k is composed of the points M i (x i , y i , z i ) , i = n , ...., m verifying the following condition: Hence, from the system of linear equations, it is possible to calculate the coordinates of the point M i (x i , y i , z i ) according to those of the points M i (x i , y i , z i ), the parameters a P k , b P k and c P k of the P k plane defined by: The third step of the 3D trajectory modeling process is the estimation of the optimal elliptic arc that approximates the obtained set of point S' k . Since the objective is to estimate the elliptic arc that approximates the projection of the trajectory segment S k on the plane P k , we proceed to change its marker analytical description from the original frame of the 3D domain Rep 3D (O, − → ox, − → oy, − → oz) to the 2D description on the P k plane.
The new orthonormal frame is defined by its origin M' n , and the direction of its two axes M' n M' m and its perpendicular direction. The unit of measurement on the two axes of the new 2D representation is equal to that used for the distance measurements in 3D space. Thus, the points of the projected trajectory segment S' k pass from a 3D description M i (x i , y i , z i ) in the original frame Rep 3D to a 2D description in the Rep 2D_k frame associated to the plane P k : And M" i (x" i , y" i , z" i ) represent the orthogonal projection of the point M' i on the straight line (M' n M' m ).
Then, to estimate the arc of ellipse that approximates the projected trajectory segment S' k described in the 2D frame Rep 2D_k , 2 tangent endpoints' method is used. In what follows we present this method modeled the obtained plot.
Joint Trajectory modeling based onelliptical path limited by two tangents method . In this method, we consider the limit point M 1 of the line trajectory which coincides with the local maximum velocity is the top of the small axis of the ellipse modeling the trajectory segment M 1 M 2 . Thus, the direction of the ellipse's major axis is parallel to the tangent to the trajectory at M 1 . Then, contrary to the previous approach, the endpoint of the long axis of the ellipse is not fixed to the other limit point M 2 of the trajectory segment. Indeed, the ellipse parameters calculation is based on the position of the point M 2 and the angle α as explained below.
Let M (X M , Y M ) be a point on the ellipse defined in an orthogonal reference frame (O, X, Y) where one axis is parallel to the major axis. It verifies: The angle β between the ellipse main axis and the tangent to the ellipse at a point M(X M , Y M ) verifies: Assuming that the trajectory segment endpoint M 2 (X M 2 ,Y M 2 ) belongs to the ellipse and that the tangent to the ellipse at this point forms an angle α with respect to direction of the small axis, the following relation is obtained: represent the long axis a 0 and small axis b 0 of the quarter ellipse 0 , respectively. The latter, defined by the points M 1 and M 2 , is tangent to the modeled trajectory at M 1 .
The evaluation of a and b parameters of the ellipse according to a 0 , b 0 and α is obtained from equations (18) and (19) as follows: 20) and (21) The coordinates of the center C(x 0 ; y 0 )of this ellipse in the original coordinate system a 0 and b 0 are calculated according to the coordinates of M 1 (x 1 , y 1 ) and M2 (x 2 , y 2 ) in the original coordinate system O( − → i , − → j ) as well as the inclination angle of the tangent θ 0 at the point M 1 .
24) It appears that the calculation of the ellipse parameters using this approach takes into account the position of the two endpoints of the arc M 1 (x 1 , y 1 ) and M2(x 2 , y 2 ) and the inclination angles of the tangent at these points θ 1 = θ 0 and θ 2 = θ 0 + β = θ 0 + π 2 hence the name of the presented method.
Finally, the 2D analytical description of the elliptic arc approximating the path segment S k on the plane P k is mapped to a 3D reference frame. The referential changes that regard the coordinates of the ellipse center C(x C_2D , y C_2D ) are computed by the reciprocate transformation of the 2D planar conversion. In the 2D reference frame, θ 2D is the angle between the ellipse major axis and the first coordinate axis of Rep 2D_k , where in the 3D reference frame, θ xy and θ xz are respectively the inclination angles of the elliptic arc major  axis projection on the planes (oxy) and (oxz). On the other hand, the measures of the major 13 axis a and the minor axis b remain unchanged. In the spatial domain, each trajectory stroke fitted by an elliptic arc is defined by a set of parameters. The table1 contains a complete description of the model parameters. Each joint trajectory stroke is modelled by a feature vector V combining geometric and dynamic characteristics: V: {ellipse {a, b, c, θ xy , θ xz } , Normal Vector to plan P k { a P k ,b P k ,c P k }}

HAR based on LSTM
In order to put our proposal for human action recognition into context, we first review a special kind type of RNN called Long-Short Term Memory neuron (LSTM).
The choice of LSTM is due to the famous problem of simple RNN is the vanishing gradient problem [39]. Actually, LSTMs are improved to obtain long-distance dependencies within sequence data. The contextual semantics of information are retained and stored for the purpose of obtaining long dependencies between data with the LSTM approach. For this purpose, special memory units named cells are exploited to stock information for dependencies in a long-range context in LSTM. Also, in each LSTM unit composed by a data-driven gates; input, forget, and output gates; to check which fragments of information to remember, forget and pass on to the next step. In this way, LSTM gains the capability to make decisions about what to store, and when to permit reads, writes and deletions by via gates that pass or block information through the LSTM unit.
The formulas (excluding the bias terms) are as follows: Where i, f, o, and c represent the outputs of input gate, forget gate, output gate, and cell activation vectors, respectively; all of them have the same vector size h t therefore defining the hidden value (i.e., the memory state of the block). σ i , σ f and σ o are, respectively, the non-linear functions of input, forget, and output gates. W xi , W hi , W ci , W xf , W hf , W cf , W xc , W hc , W xo , W ho and W co are weight matrices of the respective gates, where x and h are the input and the hidden value of LSTM block respectively. b i , b f , b c and b o are the bias vectors of input gate, forget gate, cell, and output gate, respectively [74]. Traditional RNNs only mines the short-term dynamics and are unable to discover relations among long-terms of inputs and in order to alleviate this drawback, a typical LSTM model is employed to classify the human actions. LSTM is powerful deep model for time series classification for its capability of interpreting the dynamics in time series and its ability to overcomes the RNN's vanishing gradient problem.
As shown in Figure 7, we build a muli-layer LSTM framework. For the first LSTM layer, we use the beta elliptical joint trajectories as an input x n where n is the number of joint points in the human skeleton and the output h t of the lower LSTM layer is as the input of the upper LSTM layer. The output of the highest LSTM layer is fed to a softmax layer to transform the output codes to probability values of class labels. The probability that a sequence V belongs to the class C k is:

Experimental results and discussion
To evaluate the performance of the proposed HAR system, we used four data sets: UT-Kinect data set, MSR-Action3D data set, CAD60 data set and CAD120 data set.
Based on skeleton information for action recognition recorded with Microsoft Kinect and after extracting the BEJT features and constructing the multi-layer LSTM network, we will train this network for human activity recognition. Accomplishing training, according to equation (25), calculating the highest probability as action class for test data set. For these experiments, learning rate α = 0.6, both batch size and the number of hidden LSTM are set to 100, forget bias is set to 1.0 and dropout rate is set to 0.5.
All descriptions of each data set, the experimental results and comparison of our proposal system with methods identified in the state of the art are given below.

Results on UT-Kinect data set
The UT-Kinect [75] data set is captured by a single stationary Kinect that holds 200 sequences of 10 classes produced by 10 subjects in varied views. Each activity is recorded twice for each one and each frame contains 20 skeleton joints. To divide the data set, we pursue the half-vs-half protocol where half of the subjects for training and the remaining for testing [51].  Table 2 compares the recognition accuracy of the proposed BEJT method with that of the state-of-the-art methods using UT-Kinect data set. And it is observed that our system based on beta elliptical joint features and two-layer LSTM shows significantly better recognition accuracy than other methods.

Fig.8. Confusion matrix for Ut-kinect data set
The confusion matrix of classification results affirms the performance of our proposed system. Most of the human action classes obtained higher than 90% classification rates with BEJT-LSTM method. Also, some skeletal sequences are misclassified such us sit down class: 16.70% are classed as stand up class. This misclassification can be explained by the use of the same interest joints that can affect the same BEJT feature vector or the movement of Kinect sensor during recording sequences.

Results on MSR Action 3D data set
The MSR-Action3D data set [76] is generated by a Microsoft Kinect which is widely used in action recognition. This data set consists of 20 actions performed by 10 subjects for two or three times. 557 samples are valid and each sequence frame contains 20 skeleton joints. Following the protocol A, we used the subjects 1, 3, 5, 7, 9 for training and 2, 4, 6, 8, 10 for testing. Table 3 collects the accuracy obtained by the proposed BEJT+LSTM architecture over the MSR Action 3D data set as well as other works. As can be seen, the accuracy results Table 3: Comparison of recognition accuracy on the MSR-Action 3D dataset. Bold figure means the best performance.

Method
Recognition Accuracy 3D positional pairwise differences of joints + HMM [22] 82.0% Joint Location + HMM [45] 89.23% Body part + SVM [47] 89.48% STIP Trajectory + LSTM [62] 90.36% Joint Trajectory + LSTM [5] 91.21% Multi-fused-feature + HMM [41] 93.3% BEJT + LSTM 95.20% obtained by our proposal outperforms the previous works found in the literature. As can be seen, the accuracy results obtained by our proposal outperforms the previous works found in the literature. We obtained 95.20% as accuracy rate of our proposed system. Confusion matrix in figure 9 reported that thirteen classes 100% correctly classed; four classes obtained higher than 90% classification rates and only three classes obtained an accuracy rate less than 85%. Fig.9. Confusion matrix for MSR Action data set

Results on CAD 60 data set
The CAD-60 [77] data set is a publicly available data set captured by the Kinect sensor. Three types of data are available: the 3-D locations of the 15 tracked skeleton joints, the RGB and depth map modalities. It consists of 12 human daily life activities captured in five different environments, performed by four subjects.

Method
Accuracy Recognition Joint Coordinates + Naïve-Bayes-Nearest-Neighbor [3] 71.9% Key pose + RF [46] 76.6% Key pose + RF [45] 80.02% STIP + SVM [62] 87.5% BEJT + LSTM 90.5% Table 4 collects the accuracy obtained by the proposed BEJT+LSTM architecture over the CAD60 data set as well as other works. As can be seen, the accuracy results obtained by our proposal outperforms the previous works found in the literature. In addition, in order to compare our algorithm better, figure 10 depict the confusion matrix obtained for the testing of action recognition algorithm based on BEJT features.
Most of the human action classes obtained higher than 90% classification rates such as rinsing mouth, working in computer, cooking, . . . . Our system confused between a number of classes like drinking water and brushing teeth. This confusion due to the convergence between entry data: the same movement of right hand.

Results on CAD 120 data set
The CAD-120 [78] data set contains 120 RGB-D video sequences representing daily living activities of four human subjects (are male and two are female)recorded using a Microsoft Kinect camera. Each video is labeled with a single high-level activity name. The data set provides skeleton tracks of people in the scene. The 3-D locations of tracked skeleton joints are available.
When we compared our accuracy results based on BEJT+LSTM architecture with works of the state-of-the-art in table 5, we found that our results are as good as the best one of the considered works. Table 5: Comparison of recognition accuracy on the CAD120 data set. Bold figure means the best performance.

Conclusion
In this paper, we presented a deep learning approach for human action recognition based on 3D skeletal representation. This effective approach achieves high accuracy on the relevant benchmark data sets. Two factors are considered us the key of our performance: BEJT feature that describes both geometric and dynamic characteristics of skeletal sequences with elliptical arcs that decompose a complex skeleton trajectory in strokes. We also, takes the advantages of LSTM for final recognition. This special kind of recurrent neural networks has the capability to model the long short temporal dependencies automatically. We validated our model on four public benchmark data sets: MSR Action 3D data set, UT Kinect data set, CAD60 data set and CAD120 data set. Experimental results demonstrated that our approach leads to a significant improvement of classification performance. In the future, we will present a more refined level of 3D joint trajectory approximation that define not only an ellipse to support the trajectory but also an ellipsoidal spindle. 20