Temporal Order and Pen Velocity Recovery for Character Handwriting Based on Sequence-to-Sequence with Attention Mode

BGRU ABSTRACT Online signals are rich in dynamic features such as trajectory chronology, velocity, pressure and pen up/down movements. Their offline counterparts consist of a set of pixels. Thus, online handwriting recognition accuracy is generally better than offline. In this paper, we propose an original framework for recovering temporal order and pen velocity from offline multi-lingual handwriting. Our framework is based on an integrated sequence-to-sequence attention model. The proposed system involves extracting a hidden representation from an image using a Convolutional Neural Network (CNN) and a Bidirectional Gated Recurrent Unit (BGRU), and decoding the encoded vectors to generate dynamic information using a BGRU with temporal attention. We validate our framework using an online recognition system applied to a benchmark Latin, Arabic and Indian On/Off dual-handwriting character database. The performance of the proposed multi-lingual system is demonstrated through a low error rate of point coordinates and high accuracy system rate.


Introduction
Handwriting analysis has been an active area of research, such as handwriting recognition [11,13], writer identification [8], and signature verification [7,14]. When handwriting is captured using different acquisition techniques, it gives * rise to two handwriting categories: online and offline. In case of online handwriting, it may require special digital devices and it can represent numeric data as a succession of points ordered in time. Consequently, this mono-dimensional signal is noted as having dynamic features: the temporal order, the pen velocity, the pressure, and the pen up / down. In contrast, offline handwriting requires a camera or a scanner to capture handwriting from the paper. Therefore, online devices are more expensive compared to the offline ones. However, it is obvious that online handwriting has become an efficient choice thanks to its dynamic features, thus authorizing more features to be available to the recognition systems. It is necessary to mention the importance of the pen velocity that develops the online information and to improve the recognition accuracy. In [29], the authors demonstrated the effectiveness of the velocity in the recognition task applied on Arabic handwriting characters. They obtained 98.8% for a resampled online signal against 95.8% for an online signal without velocity. In addition, considering that the offline category is a set of static images, the storage of handwriting images is larger than online data. Those images are presented by a set of pixels without dynamic information. Generally, the presence of dynamic features in online systems leads to the effectiveness of the performance compared to offline systems. Qiao et al. [28] proved the effectiveness of the online system compared to the offline one, using the recognition rates as an evaluation metric. They got 96% for online digits, against 90% for offline images. To exploit the advantages of offline and online processing, researchers have * Corresponding author. e-mail: besma.rebhi.2015@ieee.org.
presented a lot of methods to recover the temporal order from static handwriting images. Reconstructing the drawing order has been implemented since the nineties [3,5]. In general, this process is based on different steps: preprocessing, ambiguous zone selection, terminal point detection, and searching for the smoothest path [25]. According to Rousseau et al. [32], each step had an effect on the next step. Moreover, these last methods have suffered from assumption problems because the direction will be different according to the language of writing. In [25], the authors affirmed that the trajectory chronology proved to be more promising for reconstruction, but there was no way to recover some dynamic information like pen velocity. Actually, deep learning can handle these types of problems without using complicated algorithms or assumptions. Memory recurrent networks [17], with the potential to treat long term sequential tasks, realized great success. Among these investigations, image caption [37] achieved the level of translating images into text. Motivated by this, we assume that Sequence-to-Sequence (Seq2Seq) with attention models have a great potential to become the newly state-of-the-art for handwriting recovery problems. Our framework contains: a) a Convolutional Neural Network (CNN) to extract the lower level features, b) a Bidirectional Gated Recurrent Unit (BGRU) to encode the last extracted features to a single vector, and c) a BGRU to decode the encoded features into ordered coordinates based on attention model. To the best of our knowledge, we have been the first to implement a Seq2Seq-BGRU with attention model for temporal-order and pen-velocity recovery. The major contributions of this work are as follows:  Investigating a novel Seq2Seq with attention model based on BGRU NN to predict the dynamic information from static handwriting images;  Recovering, for the first time, the pen velocity besides the temporal order;  The ability of this end-to-end system to recover a multilingual character, so there are no assumptions about the pen order.
The rest of the paper is presented as follows. Section 2 sets an overview of related work. Section 3 describes the framework of the study. Section 4 discusses the implementation and the obtained results. Finally, section 5 provides the implications of the study and the conclusion.

Related work
Some work quoted in the literature has recovered the temporal trajectory order according to one category: a contour or skeleton approach. The contour technique [10,34] has suffered from high computational time. For example, in [34], the authors were interested in loop analysis. They processed different models for loop types and performed a thorough loop contour analysis. Nevertheless, the effectiveness of the proposed investigation on loops was not clear enough as they did not show any practical results of handwriting recognition and since the valuation time was high. On the other hand, the use of the skeleton approach has given good results and a more rapid response compared to the contour method [9,12,21,28,32]. Based on the edge continuity relation [28], the authors suggested three main steps. First, they identified different relations at each node. For a node of degree four, they used the NN, otherwise they utilized some assumptions. Second, they selected double-traced lines using maximal weighted matching. Their last step was to find the smoothest possible path to go through all the curves of the handwriting graphic model. Based on the optimal Euler path, they selected the smoothest one. However, their work was applied on a single stroke for a mono language. In [32], the authors utilized the handwriting knowledge to propose the possible start / end points. Afterwards, different paths were produced and the best one was chosen. Their approach was applied on multi stroke letters. To demonstrate the performance of their approach, they presented a good recognition rate. Even so, they used the assumptions based on the Latin language only. In addition to the presence of contour and skeleton categories, surrounding areas of handwriting recovery could be divided into two groups: local and global search methods. The local tracing method goal was to search for the smoothest path at each ambiguous zone based on the tracing history and the actual configuration [3,7]. The major limitation of this method was that the design of heuristic rules applied for different handwriting styles was difficult. This limitation could be overcome by using the global graph technique. It aimed to create a graph model of the input skeleton images and then use a search technique to find an optimal path through text [5,12,21,28,32]. The drawback of the global method was that the computational time was high and it depended on the complexity of the algorithms used as search techniques. There were also some cases which were hardly treated. For example, Phan et al. [26] used the greedy algorithm for searching about the optimal path in a global model. Their work was based on limited assumptions about start / end points, ambiguous zones and double traces segments, which gave rise to a difficult decision for obtaining the right trajectory. In [9], the authors considered that start point detection and skeleton separation represented a hard task. In addition, there was a higher complexity of searching the smoothest path at the junction zone. Based on the skeleton graph, they separated touching characters and crossing strokes. The optimal path was fixed by the greedy algorithm. Their model was still sensitive to the processed language, and they had the common problem of being slow and complex [19,28,32]. However, it was not clear whether the use of the local method could achieve a more effective performance compared to the global method. Both suffer from some problems. Consequently, some existing work has combined these two tracing methods together [12], where the number of possibilities was optimized by adding some local features such as the curvature and inclination angle. It cannot be denied that previous work has got perfect performances, particularly in the Latin language. However, most of them have been weakened by some languages, like the Arabic handwriting corpus, hence having many versions for each language [12]. Moreover, the problem of handwriting recovery has been based on finding out the correct terminal points (start/end points), junction points and the main direction in the detected ambiguous zone. Furthermore, Rousseau et al. [32] demonstrated that each step of the recovery procedure could affect the results of the recognition rate. Thus, in our opinion, these challenging parts were complicated. Some early work [23,29] addressed the use of an end-to-end system for handwriting recovery. In [29], the authors used VGG-LSTM to extract features from images and utilized BLSTM as a decoder model. Their system was the most closely related work. In our work, we refer to [29] where the focus was different. We use CNN-BGRU to extract features from images, instead of VGG-LSTM. In the task of recovering the temporal order from offline handwriting, it is more challenging to produce human-like velocity. Indeed, the authors in [29] recovered an online signal with equidistant points. They utilized a re-sampling step to add velocity to the obtained signal. However, in this paper, our framework produces an online signal characterized by a trajectory chronology with velocity.

End-to-end recovery framework
In this section, we clarify the main architecture of the proposed framework. We employ a Seq2Seq with attention model [2] to transform a sequence of offline handwriting characters into a sequence of their homologous online signal. The obtained signal contains dynamic features like the temporal order and the pen velocity. The main objective of this work is shown in the following equation: where pc is the generated sequence of point coordinates characterized by dynamic features. Those points correspond to image I. Moreover, fp is a sequence of pixels which represent a static feature of I. As a result, the length and type of the input (image) and output (signal) are actually different. Our model consists of three parts: a CNN, an encoder BGRU NN and a decoder with attention model. These parts are considered as an end-to-end system. Therefore, training is supervised, taking into account image I and its counterpart online signal points n, i.e. {I}= {(x 1 ,y 1 ),…,(x n ,y n )} ϵ R 2*n . Fig. 1 represents an overview of the proposed framework. The ConvExtractor function C() is a CNN that transforms each image I into features F. The input images are embedded and converted to a vector. These features are extracted according to the sliding receptive field across the image from left to right (from top to bottom). The ConvExtractor receives offline handwriting and produces a sequence of features associated with the most significant part of pixels forming the letter of the image. The EncoderExtractor function E() is a multilayer BGRU model that accepts the sequence of features F and extracts the last state S. The attention layer is placed between the encoder and the decoder extractors to highlight some effective encoder output features. Finally, the DecoderExtractor D() function is a multilayer BGRU model. It receives the local context state generated by the attention layer and the previous predicted coordinates. In section 4, we will show the effectiveness of different basic encoder-decoder models and prove that a BGRU NN is the best candidates. To summarize, the following equations present the recent processes: where <x t-1 ,y t-1 > are the previously predicted points.

Preprocessing and pen velocity process
The training step is based on two inputs: a) the handwriting character image, and b) its counterpart online signal. Generally, the first step in handwriting recovery is preprocessing. This process can include image normalization, taking into account that all images have the same size (64 x 64). Those images are passed to the ConvExtractor C(). Their counterpart online signals are used as an input to the DecoderExtractor D(). This type of signals is necessary to train the supervised decoder BGRU network. The number of online signal points is fixed to 50 points. A sampling step is necessary to obtain a signal with the desired point number (see the algorithm of normalization step). In fact, online handwriting is a series of non equidistant points saved during the writing process. A study made on the neuronal and muscular effect shows that the pen velocity decreases at the terminal points of strokes and at the junction zone of the curve. This information is used in online systems to calculate the velocity. Thus, those points are presented as a two-dimensional matrix and characterized by the pen velocity. To the best of our knowledge, researchers have not reconstructed the pen velocity when solving the handwriting recovery problem. However, in [12,29], the authors used the re-sampling step to add the velocity to the recovered signal, without generating a significant signal with temporal order and pen velocity at the same time. By way of contrast, our framework does that.  The GRU and LSTM networks can be adapted to different sizes (with an end-of-stroke token). However, the proposed framework treats 50 points as [23,28], because each point will be more informative and the long dependencies will be reduced. The normalization step keeps the velocity information consistent within a character, and the distance between two points will depend on the natural speed of the pen and on the total length of the character. As a consequence, the strength of the following algorithm is to solve this issue and normalize the online signal with a fixed point number.

Convolutional extractor
Deep CNNs [24] have been used in different tasks [19,23,37]. Given the CNN success in providing a solid representation of features, we use CNNs extensively to extract a sequence of image features. The main goal of the convolutional extractor is to convert the character image into a visual feature. A deep CNN model is used without its last fully connected layer. The normalized images are fed into the network. Then, the convolutional layers produce the feature maps. The feature is extracted from left to right, and column by column, from the feature maps. One feature vector corresponds to a rectangular region. Those features describe the image region. The details of the CNN configuration are described in section 4. The input image (64*64) is transformed to CNN features of size (Batchsize, N, D). Specifically, the batch size is fixed to 32, and N and Dare the length and depth of the CNN entity, respectively. As illustrated in Fig. 1, the CNN output is noticed by F = (F 1 , F 2 , ..., F N ) and F i ϵR D (D = 512).

Encoder extractor
The encoder-decoder model is previously employed for machine translation [2,35], but recently it has been applied to other tasks [19,23,31,37,20]. The encoder is the portion of the network that processes the input to produce a single hidden representation of all the input information. Among the varied types of neural nets, Recurrent NNs (RNNs) can forecast the most accurate results [15,30]. In fact, the famous problem of simple RNNs is the vanishing gradient problem, which the GRU [6] and LSTM [16] can alleviate. According to previous studies [18], there were no conclusions that could be drawn about the best: LSTM or GRU. Both RNNs have achieved similar results in many tasks. Even so, the GRU can be better in terms of time thanks to the parameter size. In addition, according to [17], the GRU could be a better choice for modeling the temporal information compared to the LSTM. That is why the GRU is chosen instead of LSTM. The hidden state is computed by the following equations: where g t , r t and c t are respectively the updated gate, the reset gate and the candidate activation value, σ is the sigmoid function, F t is the input which is the extracted CNN features, and W fg , U hg , W fr , U hr , W fh and U rh represent different weights. To obtain more information and a better representation, the BGRU with multiple layers is chosen as the most efficient encoder type adapted to our problem. Precisely, after many experiments of various models with different configurations, the encoder is fixed as a BGRU NN with three layers, and each layer has 512 units. The encoder BGRU model is built after obtaining the CNN features extracted from the image character. This feature vector is mapped to a fixed-length vector through the encoder. As depicted in Fig. 1 (8).

Decoder extractor
The decoder is a BGRU NN with three layers as the encoder architecture. Each layer has 512 units. The role of the DecoderExtractor is to generate the predicted coordinate sequences, as shown in equation (9): with P 1 , …, P i = [<x 1 ,y 1 >,…,<x i ,y i >], and where i is the number of the desired points and F is the set of features. Besides, F N ϵR D where N and D are respectively the length and depth of the CNN. According to the entire feature sequence (F 1 ,..., F N ), the decoder estimates a set of points (P 1 ,…,P i ). Moreover, S is the last encoder state that summarizes the whole input feature. The hidden state of the decoder is initialized with S at the same time step. In the first layer, the decoder takes as an input the last state of encoder S and the first coordinate <0,0>. Then, the remaining layers receive as an input the hidden state of the previous decoder layer and the previous coordinate <x i-1, y i-1 >, as shown in equation (10): where is the current state of the decoder, is its previous state, and g is a GRU function. The generated coordinates are calculated by equation (11) as follows: Here, we apply a linear activation function (dense). Actually, U and b are the weights and bias, respectively.
As mentioned above, we introduce the recovery process of a classical Seq2Seq [35], which can resolve the problem of handwriting recovery.

Attention mechanism
The classic Seq2Seq without attention model has defects that all hidden representation of the encoder are compressed into a fixed length S context vector. Consequently, the prediction accuracy will gradually decrease when the input length increases [2]. Thus, this paper proposes an attention mechanism for the handwriting recovery process. An attention layer is placed between the encoder layer and the decoder. The input of the BGRU encoder is F' = (F' 1 , F' 2 , …, F' N ). At each time step N, the encoder reads F' N and updates its hidden state h t Then, the attention context vectors are produced as a weighted sum of h t , which is used to detect the best hidden representation of the encoder. The following equation describes the attention process: where formula (12) is the align computation between the encoder hidden state h t and the decoder hidden state . Formula (13) represents the attention weights that indicate the importance of the input value at time step t to generate the output at time step i. The softmax function is used to normalize the vector e i (of length N) as the attention mask on the input sequence. Formula (14) shows the final attention state c a . At each time step, the decoder predicts a probability distribution of point coordinates. The ground truth point coordinates are used to train the network how to generate an online signal compatible to the original script (see Fig. 2 (b)). The values of predicted point coordinates are updated according to loss L1 as in equation (15): Let the ground truth vector be P'= [<x' 1 ,y' 1 >,...,<x' i ,y' i >] where i is the number of points. We calculate loss L1 using the predicted vector P. The training continued until loss L1 converges and the model is saved. The test step uses the offline handwriting image as an input; and based on the obtained model, the framework can generate a set of point coordinates representing human-like writing.

Experiments
In this section we study the effect of different settings of basic encoder and decoder candidates. Then, we explore the performance of the proposed model and we compare our results with the existing methods on different datasets.

Datasets
We test the proposed framework on Arabic, Latin and Indian corpora. Specifically, for Arabic, we use the dual on / off Arabic LMCA dataset [22]. It contains 28 letters, which are joined or isolated (we choose the isolated form). For Latin data, we adopt the dual on / off Latin IRONOFF dataset [36]. We use the isolated letters (upper and lower cases) and digits, sorted into 26 and 10 classes, respectively. For Indian, we use the Telugu 1 dataset. It contains 116 Telugu characters. All these datasets are used without considering the pen up / down. We created additional patterns using a data augmentation strategy based on distorted samples (changing angle inclination, smoothing, and baselines). The obtained signals are converted to offline handwriting. First, we concatenate the pen points to obtain the skeleton of the image. Then, we use a filter to grow the skeleton so that if the current point is foreground, all its neighbors are set to the foreground. Table 1 represents the number of samples used for training, testing and validation.

Metrics and implementation
Thanks to the available on / off data for each corpus, the evaluation step becomes easy. Thus, both signals (truth and generated) are not only compared by the Root Mean Square Error (RMSE) and the Euclidean Distance (ED), but also recognized by an online recognition system. We state the effectiveness of our proposed framework on three evaluation metrics: the RMSE, the ED and the online recognition system based on the LSTM NN. The RMSE and the ED are chosen as evaluation criteria for distance parsing.
The RMSE is used to measure the rate of transformation from one set of points to another. Its definition has been used differently in related work such as [7,14]. In our case, it represents the difference between the online signal and the recovered one according to the following formula: where n is the number of samples, l i is the number of points in each sample, x t and y t are the coordinates of the online signal, x' t , y' t represent the recovered coordinates, and L denotes the total length of characters. The better system is the one that reaches the lower RMSE value. Different results are represented in Table 4.
The chosen ED is used in [7], as the following formula: where p and q vary between 1 and L. The ED metric minimizes the cumulative distance calculated between the ground truth and the predicted elements via a warping path. Thus, the best method is the one that achieves the lowest values. The results are given in Table 4.
We agree that the evaluator metrics, the RMSE and even the ED, are brutal when evaluating a predicted signal with a small deviation, even if the temporal order is compatible with its ground truth. Nevertheless, the accuracy result could be unexpected. After this interpretation, we use an online LSTM recognition system [29], based on the beta elliptic and grapheme segmentation method [4] which requires the existence of the temporal order and the velocity to produce a reasonable result. We extract 10 characteristics from the original online signal to train the network. Then, we test the network with our reconstructed signal.
Our framework is trained over one Nvidia GT 650M with GPU in a Tensorflow platform. The training batch has 32 imagesignal pairs. To update parameters, we choose Adam Stochastic with 10 -3 of the learning rate. We save the checkpoint model every 700 iterations. Training time lasts for 16 hours and testing takes around seven minutes for each 20 samples.

Ablation study
Recently, there have been many networks used to extract features from images: ResNet-50 [39], VGG-16 [33] and CNN, followed by LSTM NN or GRU NN, to obtain a higher feature representation. Thus, we evaluate the encoder combinations on the IRONOFF digit dataset, and the results of the training loss are presented in Table 2. This table shows the following studies: (1) We use 5 networks (i.e. ResNet , VGG , CNN1, CNN2 and CNN3 ) followed by LSTM as an encoder. The CNN configurations are listed in Table 3. We fine-tune the pre-trained deep models (ResNet and VGG). The results demonstrate that with the same followed network (i.e. LSTM), CNN2 performs better than other networks in terms of training accuracy. We use eight convolution layers with a kernel size of (3 * 3). The primary role of the Max pooling is to extract the main significant characteristics from the output of the previous convolution layers.  Each layer is preceded by an activation function (rectified linear unit) [1]. Batch normalization is employed after the third and last Conv. layers. (2) With CNN2, the results show that the GRU NN is better than LSTM. GRU-CNN2 is the best combination that achieves the lowest training loss. We investigate whether the proposed framework can generate the closest signal to the original. With the same GRU-CNN2 based encoder, we use five decoder-based combinations (i.e. Seq2Seq-LSTM, Seq2Seq-GRU, Seq2Seq-BGRU, and Att-BGRU).

Architecture exploration
We aim to demonstrate whether the proposed framework and velocity reconstruction can boost the performance of the handwriting recovery process. Specifically, we compare five different handwriting recovery models: (1) the basic model (Att-BGRU) presented in section 4.3, which utilizes equidistant points as online training data, so it is described as a framework without velocity; (2) sequence-to-sequence based BLSTM with velocity (S2S-BLSTM-V), which is inspired from [23] and which integrates the velocity feature. Moreover, training online data is passed by the algorithm of normalization (introduced in section 3.1) to obtain an online script with velocity and a fixed point number; (3) sequence-to-sequence with a VGG-16 and BLSTM based encoder (S2S-VBLSTM-V), which is inspired from [29] and takes the similar training data as S2S-BLSTM-V; (4) sequence-to-sequence with a CNN2 and BGRU based encoder (S2S-CBGRU-V), which takes similar training data as the latter model; and (5) our Att-BGRU-V framework which integrates the velocity concept. as detailed in section 3.1 .
All the experimental results are shown in Table 4. From Table  4, we can see the following: (1) BGRU with the attention model (Att-BGRU) performs better than both S2S-BLSTM-V and S2S-VBLSTM-V. This indicates that the attention mechanism can improve the performance of handwriting recovery. However, in some cases, it has low margin effectiveness compared to S2S-CBGRU-V. This demonstrates the effectiveness of BGRU trained by signals with velocity, so, the pen velocity reconstruction proves its effectiveness. (2) S2S-CBGRU-V performs slightly better than both S2S-BLSTM-V and S2S-VBLSTM-V, thanks to the BGRU NN which outperforms BLSTM for handwriting recovery. (3) The results show that our Att-BGRU-V achieves the best performance over all the evaluation metrics (i.e., 2.0 RMSE, 22.8 ED and 94.6% recognition rate) on the IRONOFF upper-case. (4) We show the effectiveness of the pen velocity process and we assess the model on another scheme without velocity (Att-BGRU), which is obtained after training the framework with offline handwriting and its corresponding online one (a set of equidistant points). By comparing between the latter and our proposed framework (Att-BGRU-V) with velocity, we find that using the pen velocity is very efficient for handwriting recognition parsing.
To sum up, the experimental results demonstrate that the proposed framework with the jointly attention model and BGRU with velocity enhances the effectiveness of the handwriting recovery process.

Comparison to existing methods
To evaluate the effectiveness of the proposed framework, we compare our work with three other located systems, which have been re-implemented and tested under the same environment conditions. Our proposal consists in using the CNN2-BGRU based encoder and the Att-BGRU based decoder. This model is intended to recover an online signal with velocity. Precisely, we compare four frameworks :(1) Elbaati's approach [12] was based on the graph model to represent an image as a set of segments. The genetic algorithm was used to find the smoothest path across those segments (2) S2S-BLSTM proposed in [23], was the classical Seq2Seq without the attention model with CNN-BLSTM as an encoder and BLSTM as a decoder. (3) S2S-VBLSTM proposed in [29], was similar to S2S-BLSTM but with a VGG-16 and BLSTM based encoder, and the authors applied the sampling step on the recovered signal to add velocity. (4) Our baseline model (Att-BGRU-V) is introduced in section 4.4. All the experimental results are provided in Table 5.
From Table 5 , we can admit that: (1) The approach of Elbaati et al. [12] had a low margin effectiveness compared to all deep models and over all the evaluation metrics. This indicates that with the rise of deep learning, handwriting recovery can be handled more efficiently.
(2) The framework of Ayan et al. [23] was the first deep model for handwriting recovery. Their higher accuracy rate (91.9%) can be achieved when testing the LMCA dataset, but it will remain below both the S2S-VBLSTM [29] rate (98.8%) and our rate (98.9%). This indicates that pen acceleration obtained after a sampling step [29] proves its performance. In addition, the attention model (ours) with velocity has the advantage of improving the recognition accuracy.
(3) Rabhi's framework [29] was the best state of the art framework. The authors re-sampled the recovered signal (set of equidistant points) and obtained an online signal with pen acceleration. However, our framework generates a significant signal with velocity without a post-processing step based on a sampling step. As a result, they achieved a lower accuracy compared to ours, thanks to the attention mechanism which can handle the performance. In addition, we can interpret that the BGRU NN proves its efficiency in terms of accuracy and time (0.7s/step) compared to BLSTM (1s/step).
(4) Our proposed Att-BGRU-V performs better than the stateof-the-art models in terms of all evaluation metrics. It outperforms Elbaati's approach [12], Ayan's system [23] and Rabhi's framework [29] in 16, 0.3 and 0.1 absolute error points, respectively, when testing the data of digits. For the ED evaluation, it can be affirmed that our proposed framework achieves the best performance. It surpasses Elbaati's approach [12], Ayan's system [23] and Rabhi's framework [29] in 30.9, 0.2 and 0.1 absolute error points, respectively, when testing the Arabic data. In addition, the recognition rate, whatever the database, is higher than other methods.
To sum up, the effectiveness of the velocity is captured when we apply the attention mechanism with BGRU NN.
Visual analysis of the reconstructed velocity. We analyze the velocity prediction of the proposed framework using a visual graphic on the Arabic letter <<waw>> from the LMCA dataset. Fig. 4 (a)-(c) shows the velocity curves of both ground truth and predicted models (S2S-BLSTM-V, S2S-VBLSTM-V and our model ATT-BGRU-V). As shown in these figures, the difference of deviations between the predicted velocity of different models and the ground truth velocity are not too large. Our model has the closest deviation to the ground truth. To further analyze the reconstructed velocity, Fig. 4 (d)-(f) shows the trajectory reconstruction corresponding to the latter curves. These trajectories are divided into strokes based on the inflection points which are located according to the variation of pen acceleration [4]. As indicated in these figures, the difference between the location of points is not interesting. However, our model reconstructs a trajectory with a flexible curvature as human writing, thanks to the attention layer which can focus on the detailed oval curves.
Analysis of reconstructed characters. Fig. 5 presents two successful recovered samples of an Indian character and an IRONOFF digit. These scripts are reconstructed successfully using the models with and without an attention layer. Despite that, the models without attention (S2S-BLSTM-V, S2S-VBLSTM-V) could recover the online trajectories which are not matched to the offline image. because the encoder-decoder model extracts the final encoder state from different character samples. due to the absence of the attention layer, the decoder generates an online trajectory which can be the same as some existing samples in the training dataset ( see Fig. 6 ). Consequently, the attention layer can avoid overfitting and adapt to truth samples. Fig. 6 shows two reconstructed samples of Latin and Arabic characters. These samples are successfully reconstructed with our attention framework and fail with models without attention. In some cases, the latter model generates erroneous scripts which are different to the counterpart offline handwriting. These scripts could refer to some existing samples of the target dataset. For example, when recovering the character << sin >>, we obtain an erroneous signal which refers to the existing character << ba >>. In addition, the Latin letter << b >> is recovered as character <<z>> which exist in the IRONOFF dataset. Here, the decoder refers to the last encoder feature which can be similar to existing encoder feature of other samples. Fig. 7 shows our recovered signal with a small deviation compared to its counterpart ground truth signal. Simultaneously, the temporal order is suitable for the online one. The proposed framework uses as an input an offline image and generates an online signal with temporal order and pen velocity. Fig. 8 represents the pen velocity of our recovered signal and the truth signal.
Acceleration decreases at the zones of the red circles similarly to the ground truth signal. This makes us admit that the recovered signal respects the velocity like human writing. As indicated in Fig. 9, the recovered signal passes the loop in the truth direction. Both start and terminal points are basically compatible to the original one. Thus, the temporal order of our recovered signal respects those of the truth signal. Fig. 10 depicts some cases of the suggested framework where the zone marked by a pink arrow is wasted because of the up / down pen. This process has not been treated yet. In fact, the proposed system deals with monostroke isolated characters. Our contribution focuses on the reconstruction of the pen velocity feature, which has not been treated yet, neither with old systems nor with new ones.

Velocity performance
The velocity curve varies between extremums of velocity (maxima, minima) which specifies the number of strokes. In fact, the effectiveness of pen velocity reconstruction from an offline image is to give sense and dynamic information to offline handwriting. Hence, we become able to segment an image into a primitive line based on the reconstructed pen tip velocity. In addition, we can obtain more features to improve the offline handwriting recognition rate. In this study, velocity reconstruction visually appears when plotting the character (see Fig. 11. b) where the points are not equidistant. In addition, the magnitude of the pen velocity (Fig. 11. d) shows the variation in acceleration as a function of time. As illustrated in Fig.11. we obtain an online signal; and based on the beta-elliptic model [27], there are two feature types, which are the dynamic and geometric profiles.  In the geometric profile, each beta stroke can be represented by an elliptic arc described by four geometric features: a, b, teta, and teta_p, where a and b are the half and small dimensions of the elliptic arc and teta and teta_p are respectively the angle of the ellipse and the tangent inclination. Those profiles are more detailed in [27].

Conclusion
In this study, we have introduced a novel framework based on a Seq2Seq model to recover the temporal order and pen velocity of multilingual handwriting characters. We have proved the importance of the attention model to focus on the local state while recovering online trajectories. The framework is an end-to-end system based on a CNN to extract features and an encoder-decoder BGRU with attention model to generate a signal with temporal order and velocity information. Consequently, we have achieved a higher recognition rate compared to other existing state-of-the-art models. Among the challenges that could be addressed in the future is to better recover the dynamic information from words, sentences and a complicated signature taking into consideration the pen up / down information. Thus, the use of meta learning will be required in this context, especially when we need to generalize the recovery model to unseen offline handwriting proficiently. We will also deploy a dependent bidirectional recurrent NN as it solves the Seq2Seq erroneous prediction problem, potentially improving the accuracy results. Finally, other deep learning and reinforcement learning models will need explored [40], and a range of benchmark multilingual handwriting character databases developed for comparative evaluation with other state-of-the-art (e.g. multi-task [41] and multi-model deep learning [42]) approaches.