Quantification of Mental Workload Using a Cascaded Deep One-dimensional Convolution Neural Network and Bi-directional Long Short-Term Memory Model

In this paper, a new cascade one-dimensional convolution neural network (1DCNN) and bidirectional long short-term memory (BLSTM) model has been developed for binary and ternary classification of mental workload (MWL). MWL assessment is important to increase the safety and efficiency in Brain-Computer Interface (BCI) systems and professions where multi-tasking is required. Keeping in mind the necessity of MWL assessment, a two-fold study is presented, firstly binary classification is done to classify MWL into Low and High classes. Secondly, ternary classification is applied to classify MWL into Low, Moderate, and High classes. The cascaded 1DCNN-BLSTM deep learning architecture has been developed and tested over the Simultaneous task EEG workload (STEW) dataset. Unlike recent research in MWL, handcrafted feature extraction and engineering are not done, rather end-to-end deep learning is used over 14 channel EEG signals for classification. Accuracies exceeding the previous state-of-the-art studies have been obtained. In binary and ternary classification accuracies of 96.77% and 95.36% have been achieved with 7-fold cross-validation, respectively.


I. INTRODUCTION
OWADAYS almost every person suffers from mental stress either due to their lifestyle or their nature of work or profession. Mental stress may lead to mental disorders, and there is a strong correlation between the two. Psychological and mental stress are synonymous and are associated with anger or anxiety. These may also lead to depression if they remain untreated for a long time. Mentally stressed conditions also affect functionalities of the autonomic nervous system (ANS). Therefore, mental stress is considered the main cause of the overall degradation of a person mental and physical health. Due to stress, a person may also lose interest in their profession and related works. Mental stress also changes attitudes toward life [1].
In various industries and professions, higher mental workload, mental stress, and mental pressure lead to stressrelated diseases that decrease the performance and output of industries and increase the burden of medical expenses of employees. Mental workload, mental stress, and related diseases can also increase economic and social losses for the whole country. Suicides and psychiatric illnesses due to mental stress are also reported [1][2]. In universities and colleges, faculty and mental health also suffers due to mental stress generated from various factors as reported in [3]. Hence, precautionary measures to reduce mental stress and its proper assessment are necessary for the safety and benefit of humanity. For assessment of brain/mental state physiological signal analysis is the best way.
Electroencephalogram (EEG) signals are the most suitable physiological signal to explore the mental state of humans. Different types of experiments and simulations have been designed as protocols to understand the influence of mental workload and mental stress. Protocols and tools like Visual Response Test (VRT), Auditory Response Test (ART), Letter Counting (LC) test, Stroop Test, NASA Task load index (NASA-TLX), mental arithmetic/calculation, -Attribute Task Battery (MATB), Subjective Workload Assessment Technique (SWAT) and single-session simultaneous capacity (SIMKAP) task were designed to observe and identify the level of alertness and mental workload. These protocols and tools are related to cognitive tasks which induce mental fatigue and change the level of the -8]. Recorded EEG data analyzed through these protocols is very useful in the proper assessment of mental workload which causes stress and other issues, based on this timely diagnosis and treatment is possible. Recently, machine learning techniques have started being used to automate the assessment of mental workload. The major issue in these experiments and simulations is the accuracy of classification or prediction of mental workload. The accuracy of the entire protocol or model depends majorly upon factors like preprocessing of data (generally filtering), the number of classes or categories or tasks to be classified, and the classification algorithm used for training and testing. These factors are targeted by different researchers in several studies [9][10][11][12]. Among these factors, classification algorithm plays a crucial role, because, at present, methods like deep learning and specifically convolution neural networks with other models or algorithms have surpassed other classification methods that India (e-mail: vipuls1996@outlook.com, ahirwalmitul@gmail.com). Corresponding author: ahirwalmitul@gmail.com (Mitul K. Ahirwal) work on features extracted from EEG signals.
The common idea in most deep learning models is using the initial convolution neural network (CNN) layers for the generation of feature maps and then the use of fully connected layers to classify these feature maps. Layers like pooling and drop-out are used to prevent over-fitting in such models. Many recent studies have used these CNN-based classification models for the classification of EEG signals. For EEG signals that are time series data, many researchers have also applied long short-term memory (LSTM) models. In the LSTM model, generative and discriminative capabilities of recurrent neural networks (RNN) have been used. The use of RNNs allows for temporal features to be extracted, and CNNs help in extracting spatial features from the data.
We have reviewed the recent studies which utilized deep learning concepts in the fields related to mental workload. Most of the research done focuses on handcrafted feature extraction from EEG signals. A deep 1DCNN has been used for attention classification in the Stroop color test, in this, raw data, filtered data, and data in five conventional EEG bands are given for training [9]. In [10], a CNN model has been developed which can be used as a generalized model for few EEG Brain-Computer Interface (BCI) systems working on P300 event related potentials (ERPs) with visual stimulation, neural oscillations generated for movement-related cortical potentials, and several sensorimotor rhythms generated due to real and imaginary limb movements. Another important field is driver fatigue monitoring because of its relationship with -Shift 2 Unleashed (NFSare used in these experiments. Along with EEG, EKG is recorded and used for the classification of two mental states, -Conv and EEG-Conv-R that are based on deep CNN and deep residual learning concepts have been proposed in [11] for this task. In [12], a deep classifier and a deep autoencoder were used for task engagement assessment i.e., to learn and label three types of events in flight simulation. EEG and ECG signals were used to monitor the state of pilots in a 4-h flight simulation and three events were classified, namely two types of air traffic control (ATC) calls and one failure event. Point-wise gated Boltzmann machines (PGBM) have been used to classify the mental state of subjects in task-relevant or taskirrelevant categories, where each subject underwent a working memory experiment with a set of characters [13]. Assessing operator functional states (OFS) plays an important role in safety critical human machine (HM) systems. A new switching deep-belief networks with adaptive-weights (SDBN) has been implemented for detection of separate and coupling effect of mental workload and mental fatigue across different subjects [14]. In this, the automation enhanced cabin air management system (AutoCAMS) is used as platform to simulate complex processes as control tasks for real-time HM collaboration. In [15], RNNs were used in the effective prediction of drowsiness in a high-fidelity vehicle simulator study using EEG, for driving tasks. Modification in same has been done with ensemble group-trained Recurrent Self-Evolving Fuzzy Neural Networks (RSEFNNs). A Deep CNN-RNN model was used to predict cognitive load generated due to working memory tasks with the help of 2D Azimuthal Equidistant Projections (AEPs) of Power Spectral Density (PSD) features of different EEG bands [16,17]. In most of the RNN models used, the temporal processing direction is only forward, sometimes this reduces the extent of temporal information extracted from the data. To overcome this limitation a Bidirectional Long Short-Term Memory (BLSTM) model has been used for epileptic seizure classification, followed by 2DCNN and fully connected layers in [18]. The Application of BLSTM has also been reported in [19] for the classification of the mental workload from extracted EEG signal features. It involved a combination of BLSTM, and LSTM being used for the classification of mental workload during the task and no-task states. In most of the research stated above, manually extracted features like certain time and frequency domain features, linear and nonlinear features, etc. are used instead of raw EEG signals. In addition to this, many feature selection techniques and evolutionary algorithms for feature selection or manual selection of features have been done.
Hence, building on the existing research, in this paper, we propose a new cascaded 1DCNN and BLSTM model to classify mental workload in two and three classes. To the knowledge, the points of novelty and contribution of this study are: 1. Most of the previous studies on the STEW dataset [7][8] classify mental workload state between task and notask states, but we quantify different levels of mental workload during multitasking, i.e., task state, in binary as well as ternary classes. 2. Before this work, a combination of CNN and BLSTM was not applied to the mental workload data used in this study. This model surpasses the current state-of-the-art models. 3. We use EEG signals for end-to-end deep learning. To our knowledge, no other study on the STEW data set has done so, instead, they have focused on handcrafted feature extraction and engineering. 4. Besides the above contributions, a new learning rate modification method during the training phase of the proposed 1DCNN-BLSTM model has been also suggested. The rest of the paper is arranged as, section II containing the overall methodology, section III containing details of the proposed method, section IV containing the results obtained, section V discusses the performance of the proposed model, and its comparison with recent research, and section VI containing the concluding remarks.

A. Dataset Description
In this study simultaneous task EEG workload (STEW) dataset [7][8] is used for the mental workload classification task. STEW measures the mental workload and the workload (SIMKAP)-based multi-EEG recordings during SIMKAP have been analyzed in the experiments. The SIMKAP involves the subjects being given simultaneous audio-visual tasks like arithmetic, finding identical items on two separate windows, data lookup etc, and at the end of the tasks they rate their mental workload on a scale of 1 to 9. In STEW dataset workload ratings during SIMKAP is provided. These ratings were binned into 2 and 3 classes respectively for binary and ternary classification. Table I and Table II show the distribution of the ratings in each class.
The EEG signals were captured with 14 electrodes, namely, AF 3 , F 7 , F 3 , FC 5 , T 7 , P 7 , O 1 , O 2 , P 8 , T 8 , FC 6 , F 4 , F 8 and AF 4 , during the SIMKAP test with a sampling frequency of 128 Hz for 2.5 minutes. Bandpass filter with a permissible frequency range of 4 to 32 Hz is used to remove artefacts from the EEG recordings.

B. Experimental Setup
In this experiment, 45 multichannel EEG recordings have been considered, each 2.5 minutes long. To augment this data for deep learning models, windowing has been done over the dataset with overlapping windows of size 512 samples and shift of 128 samples. This sub-sampling is performed over 14 channels, and labels are repeated for subsample as per their original sample. This augmentation produced 6615 samples one-hot encoding for the class labels is done. Table III describes the shape of the dataset thus produced. For classification, 85% of data (5622 samples) were used for training purpose, and 15% (993 samples) for testing the deep learning model. In addition to this, K-fold cross-validation (CV) is also performed, after several initial experiments, 5-fold and 7-fold CV are found suitable for final experiments, and to check the robustness of the results. A deep learning model for the multivariate time series i.e., EEG signals, classification into 3 and 2 classes has been developed. The model consists of 1D convolution (1DCNN) layers followed by bidirectional long short-term memory (BLSTM) layers for feature extraction. A fully connected neural network to the output of these layers is also used for classification. The detailed structure of the layers is discussed in section III. The use of deep learning has allowed for the classification of complex multichannel EEG data without the need for handcrafted feature extraction, demonstrating the power of deep learning.

C. Description of Layers Used
The CNN-BLSTM model is used in our experiment for both binary and ternary classification of EEG signals, this model learns both the spatial and the temporal characteristics of multichannel EEG signals to do automated feature extraction.

1) 1D Convolution (1D CNN)
1D CNN works based on convolution operations using kernels/filters. Several kernels of small size are passed over the data to learn local patterns from small patches of data and do feature extraction. They learn the spatial information from multivariate time series easily and are often stacked to do feature extraction from raw data.

2) Long short-term memory (LSTM)
LSTM was developed by Hochreiter et al. [21] and is a special type of Recurrent Neural Network (RNN) used for learning temporal information. RNNs are a type of neural network which utilize cell or state along with the sequence input. RNNs suffer from vanishing gradient problem which leads to the gradient becoming zero for long sequences. LSTM overcomes this problem and is useful to learn information from long sequences. They consist of 4 blocks, the cell state, the forget gate, the input gate, and the output gate. The cell state helps to transfer the information from earlier states to later cells solving the vanishing gradient problem, further, the forget state learns what information should be retained or forgotten. A combination of these two helps to develop a mix of long and short-term memory.

3) Bidirectional LSTM (BLSTM)
LSTMs are traditionally unidirectional, i.e., they process the time series in only one direction from past to future. To overcome this limitation, an extension to RNNs was proposed by Schuster et al. [22] as a bidirectional recurrent neural network (BRNN) that can simultaneously train in the positive and negative time direction. BLSTMs are a type of BRNN that can process the data parallelly in both forward and backward direction and the output of LSTMs merged to produce the final output. This bidirectional reading allows BLSTMs to learn the temporal information from the data in a better way.

III. PROPOSED METHOD
In this section, the proposed model is explained with description of the layers used in the model i.e., the model parameters and the hyperparameters.  The proposed CNN-BLSTM model architecture, shown in Fig. 1, consists of two 1D CNN layers stacked with two BLSTM layers which are then followed by a Dense Layer and Output Layer for classification. The first 1D CNN layer has 32 filters, each with a kernel of size 16 and stride length 1. The output of this layer is passed through a ReLU activation function. The second layer is also a 1D CNN layer with 16 filters, each filter has a kernel size of 8 and stride length of 1 with ReLU activation function. The output of these stacked 1D CNN layers is then passed to BLSTM layers. The first BLSTM layer has 32 neurons with a tanh activation function. The output of this BLSTM layer is a sequence that is fed into another BLSTM layer with 32 neurons with tanh activation function. The second BLSTM layer generates a single vector as its output which is fed to a dense layer for classification, consisting of 40 neurons and output layer having 2 or 3 neurons depending on the type of classification. A Softmax function is used at the end for the mental workload classification task. This architecture is also summarized in Table IV.
The resulting models have 51,370 and 51,411 trainable parameters respectively for binary and ternary classification tasks. We trained the models using stochastic gradient descent (SGD) optimizer with cross-entropy loss. Appropriate batch size is selected from the factors of the size of training data i.e., a number that evenly divides the training set. Learning rate is chosen in a specific way as described by Leslie N. Smith in [20]. Initial training started with an initial learning rate of 1e-7 and exponentially increased in each epoch using the formulae, , , (2) where (1) was used for binary and (2) for ternary classification. After plotting the loss versus learning rate graph for each epoch, learning rate has been selected which gave the maximum decrease in loss i.e., the rate of change of loss was minimum. After fine-tuning, we found the hyperparameters described in Table V which gave the fastest training and the best accuracy of each model.

A. Model Evaluation Parameters 1) Accuracy
Accuracy is simply the fraction of correct classifications done by the model. It can be defined as, ,

2) Precision
Precision is the fraction of correct positive predictions. It can be defined as, , (4) where, FP means False Positive or the number of instances of negative class which are predicted wrong, and TP is the True Positive as defined in (3).

3) Recall
Recall is the fraction of all positive instances that the model predicts correctly as positive. It can be defined as, ,

4) F1 Score
F1 Score is defined as the harmonic mean of precision and recall and it helps to give an overall measure of the model. It is defined using (4) and (5) as, .
These above metrics are used to evaluate the performance of the proposed model in binary and ternary classification tasks.
In results, the class weighted averages of the above metrics has been reported.

B. Binary Classification
In this subsection, analysis of the model trained for EEG classification into low and high Workload classes has been discussed. Model has been trained as defined in section III using the hyperparameters mentioned in Table V. Single model has been tested, and trained through holdout method, and 5-fold and 7-fold CV as mentioned in section II. Fig. 2a  and 2b shows the training loss and accuracy for 5-fold and 7fold CV, respectively. In these figures, the bold line represents the mean of these values, and the lines in the background are the individual metrics for each fold. It has been noticed that there are sudden spikes in the loss curve during training which are quickly reduced, these arise due to a bad mini batch being randomly generated during optimization. It is also seen that towards the end of the training, all curves stagnate and converge to around the same accuracy level which proves that our results are robust and reproducible.
We also plotted the confusion matrix of the trained model on the holdout test dataset, as shown in Fig. 3. The model performs well in separating the low workload EEG samples from the high workload samples. Out of 993 test samples, during the holdout method, only 21 are misclassified. The classification of Class 0 as Class 1 is slightly high which maybe related to more samples of Class 1 being available in the dataset leading to a slight class imbalance.
Table VI easured   i.e., the mean and standard deviation of the metrics measured across the K-folds. The model gives an impressive accuracy of 97.89% on the test dataset when using the holdout method of training. Further, for the 5-fold CV, 3 out of 5 folds had a testing accuracy greater than 97% but fold number 3 and 4 had small disturbances in their loss and accuracy curves towards the end of the training, decreasing their accuracy to 96.07% and 93.27%, respectively. This led to a slight decrease in mean accuracy to 96.54% and an increase in standard deviation for 5-fold CV. Similarly for 7-fold CV, 6 out of 7 folds had a testing accuracy greater than 96.50% but as evident from Fig.  2, the learning was unstable for some folds. Particularly, for fold number 3, there was a large spike in loss which it could not recover from in the given epochs. It also had a drop in its accuracy in the last 3 epochs giving a testing accuracy of 90.89% which brought down the mean accuracy measure to 96.77%.

C. Ternary Classification
Same proposed model architecture as discussed in section III with the hyperparameters for 3 class classification as mentioned in Table V has been used. Like binary classification, in ternary classification we have done training using the holdout method, 5-fold CV and 7-fold CV. Fig. 4a and 4b shows the training loss and accuracy with the bold line representing their mean of all the folds and the lines in the classification, ternary classification also shows spikes in the loss in the middle of the training for some folds caused by a bad mini batch being randomly generated. But as we reach the end of the training, the curves stabilize and reach the same accuracy level.
We have also shown the 3-class confusion matrix of the e on the holdout test dataset in Fig. 5. There is only a marginal misclassification showing that the model was able to learn to distinguish low, moderate, and high workload.
The proposed and analysis is shown using model evaluation parameters discussed earlier, in Table  VII. results and the K-fold CV results are shown in this table. Like binary classification, the K- The model gives an impressive test accuracy of 95.87% with an F1 score of 95.88% for the holdout method of training. We have also obtained substantial performance of the proposed model when judged using 5-fold and 7-fold CV. For 5-fold CV, despite spikes in loss during training for fold number 3, all the folds were able to learn adequately from the data giving a mean accuracy of 94.68%. Similarly for 7-fold CV, except fold number 1 and 6, the testing accuracy was more than 96% for the rest of the folds. But due to unstable learning and spikes in loss, fold number 1 and 6 had difficulty learning from the data and recovering from huge spikes in loss which resulted in their accuracy being 92.91% and 89.21%, respectively. This brought down the mean accuracy measures but was simply caused by randomness in the deep learning process and is accounted for when reporting the CV results. The CV results clearly show the robustness and generalizability of our proposed method.   In this study, we estimated the impact of cognitive load during multitasking activities using EEG data which leads to mental workload. Here, the STEW [7][8] mental workload data for subjects during the is used. We were able to make a single classifier for all subjects, overcoming the subject to subject variability which is a great challenge when using EEG data for classification.
The comparative analysis of our proposed model with the current state-of-the-art models has been shown in Table VIII. The model architecture, the number of classes to estimate and the protocol used to induce and measure cognitive workload are mentioned in this table. The serial numbers from 1 to 6 represent recent research done on mental workload/ cognitive load estimation using other testing protocols as discussed in section I. The models numbered from serial 7 to 9 represent recent research on the STEW [7-8] dataset using handcrafted feature extraction and engineering. Serial number 10 and 11 -fold CV. For the binary classification task, the current state-of-the-art r models made on testing protocols other than SIMKAP, and for SIMKAP testing protocol, the maximum accuracy is only 86.33%. Our proposed model exceeds this performance significantly by attaining accuracy of 96.77%.
Similarly, for ternary classification, the state-of-the-art based models while for other protocols this accuracy reached up to 86.52%. Again, the proposed model far exceeds these models and attains an accuracy of 95.36%. Since the dataset we used is open access, our work is easily reproducible and can be extended in the future.
We have demonstrated that end-to-end deep learning can be successfully used for multi-channel EEG signals classification. Simple data preprocessing like bandpass filtering and data augmentation like windowing of data are sufficient for adapting the raw EEG data for deep learning. This study follows the recent trend of deep learning surpassing models which use handcrafted feature extraction and engineering.
The model utilizes only around 50,000 parameters which result in a fast performance and training time while other models based on deep learning have significantly more parameters. Due to the lightweight nature of our model, it can easily be updated and maintained and utilized in real-time classification of mental workload. We have focused on the dataset and not on the which involves the subjects in a resting state. Classification of mental workload during multitasking is of more use for operator efficiency in tasks like air traffic management as compared to just being able to learn to distinguish between the state of a subject. model has only been tested on the STEW dataset which has all the subjects of the same gender, education level and age. The impact of these factors on the classification of MWL is not studied.

VI. CONCLUSION
In this paper, we developed a new model using cascaded deep 1DCNN and BLSTM for binary and ternary classification of mental workload dataset.
-based multitasking contained data for subjects doing multitasking activities. We used an end-to-end deep learning methodology that did not require any handcrafted feature extraction and engineering. Using only around 50,000 parameters, the proposed model achieves accuracies of 97.89% and 95.87% with the holdout method, 96.54% and 94.68% with 5-fold cross-validation, and 96.77% and 95.36% with 7-fold cross-validation for binary and ternary classification, respectively, far exceeding the state-of-the-art.
In the future, we would like to evaluate our model on other reputed mental workload datasets which have more diversity in subjects. We would also like to explore using time distributed 2D convolutional neural network layer and see if it has a better spatial information extraction compared to 1DCNN.