Physiologically-Informed Gaussian Processes for Interpretable Modelling of Psycho-Physiological States

The widespread popularity of Machine Learning (ML) models in healthcare solutions has increased the demand for their interpretability and accountability. In this paper, we propose the Physiologically-Informed Gaussian Process (PhGP) classification model, an interpretable machine learning model founded on the Bayesian nature of Gaussian Processes (GPs). Specifically, we inject problem-specific domain knowledge of inherent physiological mechanisms underlying the psycho-physiological states as a prior distribution over the GP latent space. Thus, to estimate the hyper-parameters in PhGP, we rely on the information from raw physiological signals as well as the designed prior function encoding the physiologically-inspired modelling assumptions. Alongside this new model, we present novel interpretability metrics that highlight the most informative input regions that contribute to the GP prediction. We evaluate the ability of PhGP to provide an accurate and interpretable classification on three different datasets, including electrodermal activity (EDA) signals collected during emotional, painful, and stressful tasks. Our results demonstrate that, for all three tasks, recognition performance is improved by using the PhGP model compared to competitive methods. Moreover, PhGP is able to provide physiological sound interpretations over its predictions.


I. INTRODUCTION
O WING to their success in several application domains [1], Machine Learning (ML) models are becoming the method of choice for tackling recognition and detection problems in healthcare applications such as affective computing [2]. However, to validate ML models in healthcare domains, traditional ML metrics for performance assessment, e.g., accuracy, precision or recall, are no longer sufficient. Alongside these metrics, the interpretability of model predictions is the highest priority for clinicians and healthcare practitioners, as it allows them to understand, debug and assess the predictive ability of ML model [3]. For instance, in real-life affective computing applications, a psychologist may want to know why a ML model is diagnosing a mental condition, or a physiologist may be interested in understanding whether the physiological dynamics is taken into account in the learning algorithm [4]. Therefore, the choice of the best model for recognition of psycho-physiological states is subject to a trade-off between the performance and the interpretability of the ML system.
By working with the raw, unprocessed signal, end-to-end models, in particular, have often been found to outperform human/expert recognition performance. Unfortunately, the resulting models behave as black-boxes and are less interpretable at the model design level. Furthermore, because of the reliance on human and expert participation in data collection experiments, affective datasets are often small in size, sparsely labelled, and thus violate the big-data assumption that deep learning relies on [5].
On the other hand, rather than employing end-to-end models, practitioners often rely on standard and well understood feature extraction techniques [6] that utilise human-generated expert knowledge and established problem-specific findings. While previous literature has considered learning end-to-end models in conjunction with feature-extraction pipelines [7], [8], to the best of our knowledge an effective method for their principled and interpretable integration in the context of affective computing is missing. Intuitively, incorporating expert domainspecific knowledge into an end-to-end framework has the potential of further informing a ML model about the context of the input data. This may not only enhance the recognition performance, thus enabling end-to-end learning, but can also provide the practitioner with a more transparent, understandable ML model.
In this work, we introduce the Physiologically-Informed Gaussian Process (PhGP) model as a tool for integrating, through a Bayesian principled approach, information contained in automatically-discovered patterns in the raw signal data with expert domain knowledge available about the problem at hand. Specifically, we proceed by encoding the latter, assumed to be in the form of probabilistic assumptions about the data-generating physiological process, as a set of stochastic objects that are probabilistically correlated with the input raw physiological signal and the subject's psycho-physiological state. By relying on a MAP (Maximum A-Posteriori) estimation for the quantities involved, we show how this can be used to infer a prior distribution over a Gaussian Process (GP) classification model.
Intuitively, in the PhGP framework, the prior physiologicallybased model is thus used as a soft starting point for model training, and is adapted in a probabilistic fashion according to the dataset. This potentially reduces the risk of over-fitting when learning the model directly from the raw signal and in practice allows for learning even with affective datasets of small size.
An additional aspect that enhances the interpretability of recognition models from affective biomarkers is ranking and selecting those salient features that make a major contribution to the classifier's decision. This may be potentially of interest to a clinician or physiologist since it allows better understanding of the physiological patterns underlying the classification problem [9]. One example is a recently proposed interpretation of deep regression models for depression detection by identifying salient regions in face images in terms of their severity level, which reveal the visual depression patterns on faces [4].
We take a similar approach and, to enhance the interpretability of our PhGP framework, in addition to the transparency, formulate a novel methodology to produce quantitative and visual explanation of PhGP predictions by generating physiological activation maps (PAMs), which represent the salient patterns leading to the classifier's decisions.
We implement and empirically validate our methods on three datasets of psycho-physiological state recognition from the Electrodermal Activity (EDA) signal, relying on the Convex optimisation tool for EDA processing (cvxEDA) [10] for the design of the physiological prior model. In particular, we focus on the DEAP (A Dataset for Emotion Analysis using EEG, Physiological and Video Signals) dataset for recognition of video-induced emotion [11], the BHVP (BioVid Heat Pain) dataset for pain recognition [12] and Stroop (a dataset containing EDA signals collected during a paced stroop task test for stress recognition).
Empirical results demonstrate that, by combining the raw signal with the physiologically-based prior function, PhGP outperforms GP-based models that have access to only one of these sources of information. Further comparison of PhGP with a state-of-the-art classification method, the support vector machines embedded with recursive feature elimination (SVM-RFE) demonstrates that the former obtains competitive performance results, while still providing physiologically sound interpretations over its predictions. This paper is a significant extension of our previous work [13], which we have extended in a number of directions.
r We develop techniques that, by building on physiologically-inspired priors explored in [13], enable us to train GP classification models directly from raw, unprocessed physiological signals while providing a more transparent and explainable GP architecture.
r We develop methods to provide quantitative and visual interpretation of PhGP predictions by relying on the explicit formulation of the GP inference equations, in addition to the explainability of our framework at the model design level.
r We provide empirical comparison of PhGP with standard GP models and a state-of-the-art feature-based classification algorithm on three datasets of psycho-physiological state recognition.

II. RELATED WORK
Affective recognition from physiological sensors, i.e., the problem of inferring user's emotional/affective state from signals recorded from one's body, is routinely achieved through extraction and processing of features [6]. Mathematical models have been specifically developed as a means to uncover and make explicit the relationship that exists between the psychophysiological state of a user and his/her body signals. Examples include the integral pulse frequency modulation [14] and the point-process model [15] for the modelling of heart rate variability dynamics; causal modelling [16] and cvxEDA [10] for explaining EDA dynamics; as well as the recursive penalised least squares approach for the solution of the electroencephalogram (EEG) signal [17]. Compared to generic feature extraction methods, model-based techniques mathematically encode domainspecific expert knowledge about the physiology of the affective modelling problem itself, and as such are able to provide detailed explanations of the inherent physiological mechanisms [16].
End-to-end learning, especially in the form of deep neural network models, has consistently outperformed standard ML pipelines for affective computing, at least in cases where a sufficient amount of labelled data is available [8], [18], [19], [20], [21]. Unfortunately, overfitting problems and the lack of interpretability inherent in neural networks has thus far limited the use of these methods in practical clinical applications [5]. In an effort to overcome these issues, several works have considered techniques for extensive data augmentation [21], [22] and transfer learning or pre-trained networks [19], [23], as well as learning based on hand-crafted features [24] or creating ensembles of deep and shallow models [8]. While mitigating these issues, data augmentation and transfer learning do not fundamentally overcome them, and the use of hand-crafted features restricts a-priori the learning capabilities of deep models. On the other hand, our method, by relying on patterns automatically learned from raw data by the GP, reduces the risk of overfitting by probabilistically centering the model around the explicit solution given by a physiologically-inspired approach.
GP models have been applied in various forms in physiological signal analysis [25], [26], [27], [28]. Beside GP models, the Relevance Vector Machines are another type of probabilistic extended linear models which offer a higher flexibility for the choice of basis functions with prior on weights that enforces sparse solutions [29], [30]. However, physiologically-based design of the prior distribution in the Bayesian architecture of GP models has not been fully investigated, and priors used in the literature tend to be uninformative. The authors in [25] proposed an approach for designing priors for GPs specifically tailored to capturing hemodynamics in functional magnetic resonance imaging analysis, showing that an informed GP model significantly outperforms a GP trained on uninformative priors. Similarly, the authors in [27] proposed a pseudo-Bayesian method for the estimation of intracranial pressure, where the model likelihood is informed and adapted by physiological modelling. We build on this literature to design an approach, in which the posterior distribution is informed both by the peculiarities of the dataset at hand and the information embedded within mathematical physiological models.
In this work we focus on applying our techniques to EDA signals, but note that the approach can also be used for other physiological signals. Several studies have suggested that the EDA signal, even in single-modality settings, can provide objective means of assessing psycho-physiological states, including emotional changes and distress associated with pain [6], [19], [31], [32].

III. METHODS
A depiction of the complete prediction pipeline for the PhGP model is given in Fig. 1, in the case of recognition of psychophysiological states from EDA recordings (block A of the plot).
First, information from a physiologically-based model of the EDA signals is encoded into a probabilistic generative model that captures their relationship with the raw input signal and the subject's psycho-physiological state (block B in the plot, and discussed in Section III-B). This is then used, together with a MAP estimation of the feature representation and an approximate Maximum Likelihood Estimation (MLE) for the hyper-parameters, to define a Gaussian prior over the PhGP model (block C in the plot). Finally, posterior Bayesian inference is performed for the PhGP model to obtain a prediction for the subject's psycho-physiological state y (i) ∈ {0, 1}, and interpretability analysis is employed to provide a quantitative explanation of the prediction made (block D in the plot, and discussed in Section III-C).

A. Preliminaries
We consider a generic training dataset associated to a binary classification problem, where N = |D| represents the number of observation data points. Each x (i) denotes the i−th, raw, physiological signal recorded at n discrete time-steps, while y (i) represents its associated class label, i.e., the subject's psycho-physiological state that we aim to model. Let x =[x (1) ,...,x (N ) ]∈R N ×n be the combined vector of input physiological signals and y =[y (1) ,...,y (N ) ]∈R N be its associated class vector, encoding the psycho-physiological state of interest (see Section IV-A).
In two-class classification we proceed by defining a latent variable, f ∈ R, that represents the classification logit, and relate it to the class probability by means of a sigmoid function σ(·). 1 We employ Gaussian Process classification with Laplace approximation for modelling the relationship between x and y. Briefly, this is achieved by placing a Gaussian prior function, p(f |x), over the latent variable, performing Bayesian inference on it, and finally computing the predictive probability function, denoted with π(y = 1|D, x), that encodes the probability that x belongs to class 1.
The key observation that we exploit in this paper is that the prior function p(f |x) depends on a mean function μ : R n −→ R and a kernel (covariance) function k : R n × R n −→ R. In Section III-B we show how these can be adapted to automatically fit data arising from psycho-physiological processes. An in-depth overview of Gaussian Process classification can be found in [33], and a summary of the relevant background is given in the Supplementary Material.

B. The Physiologically-Informed Gaussian Process Model
The PhGP model builds on Bayesian learning in order to embed information from a physiological model into the training process. To achieve this, we encode the model as a functional distribution over the latent variable f ∈ R and feed it into the definition of the GP prior, p(f |x).
Specifically, in PhGP we interpret a physiological model as a set of unobservable sub-processes s = [s 1 , . . . , s l ]. Given a subject's psycho-physiological state, y, the process s gives rise to the observable physiological signal x according to a stochastic generative model of the form: for an unknown density function p. Intuitively, s captures the physiological phenomenon related to certain physiological processes. Most of these, are not actually directly observable through the use of physiological signal monitoring, hence we assume that s is unobservable. For example, in the case of the EDA signal (discussed in more detail in Section III-D), a physiological model, p(x, y, s 1 , . . . , s l ), captures the relationship between the subject's condition and the sudomotor nerve activity (SMNA) that gives rise to a variation of the skin conductance properties. The idea is that the space in which the sub-processes contained in s are defined allows for a better understanding of the signal properties, and for the extraction of a set of m relevant quantifiers (features), which we denote as In this way, the relationship with the subject's condition, y, is understood in terms of direct, physiological correlation.
In PhGP, we consider the feature vector ω(s) as a building block of the prior knowledge used to approximate the effect of s on the generation of the subject's condition y in (1). We do this by embedding this information in the prior mean function, μ(x), of the GP and employ a parametric approach to estimate its effect on the GP output. In particular, we investigate the suitability of polynomial and trigonometric functions of the form: where α is a vector of unknown parameters that adapt the shape of m j (s|α). Parameter d in (2) is the degree of the polynomial function and we retrieve a constant, linear, quadratic and cubic function for the values d = 0, 1, 2, 3. In (3) d is the number of projected cosine components. We observe that, by relying on the probabilistic relationship that exists between x and s (1), m j (s|α) can be used to naturally induce a prior mean function over the GP. By marginalising over the random variable s we in fact obtain: Notice that the resulting mean prior function μ(x|α) we obtain, combines standard parametric mean functions (e.g., those of (2) and (3)) with the information coming from the problem specific physiological model of (1) in a modular fashion. In general, this integral cannot be computed analytically, so in practice we employ a Monte Carlo approximation for its computation, namely: for M random samples of s. The PhGP prior is then defined by the choice of the kernel function k(x, x), for which we employ the squared-exponential kernel computed directly on the raw physiological signal x, as this provides flexible and smooth GP models that can adapt to different classification boundaries for specific choices of hyper-parameters [34]. The mean and covariance thus defined centre the PhGP prior around the model estimation provided by the physiologicallybased generative model. The learning procedure for the PhGP model then follows the lines outlined in the Supplementary Material for the standard GP case.
In particular, by plugging (5) for the physiologically informed prior mean for the a-posteriori GP mean, for a test point x * , in PhGP we obtain: where k * ∈ R N is the covariance computed between x * and each point in the training set, K ∈ R N ×N is the training set covariance matrix, andf ∈ R N is the mode of the posterior distribution over the training set. Further notice that p(s i |x * ) represents the conditional density function of s given x * , evaluated at the Monte Carlo sample s i , and m j (s i |α) is the evaluation of the function m j with parameters α in s i . We note that, in (6), the solution provided by the physiological model is adapted by the raw data naturally following the Bayes rule.

C. Interpretability Analysis of the P hGP Model
A key advantage of relying on physiologically-informed features in PhGP is the ease of interpretability of the resulting model architecture. This is because the features are directly fed into the learning process of the GP classification model. However, an additional degree of interpretability can be achieved after obtaining the predictions made by the model. Therefore, we build an interpretability framework for our proposed P hGP model. We achieve this by extending the interpretability methods discussed in [35], [36] based on the explicit form of the GP inference equations to the context of PhGP modelling. Namely, we proceed by propagating small input perturbations in succession through the physiologically-based prior model and then through the GP posterior in order to obtain an estimate of the contribution of each data point to the overall model prediction. The outcome is an interpretability metric, denoted Φ, that corresponds to the importance of each data point in the recognition task.
Consider a generic input point, x ∈ R n , then, for any subset of indices I ⊆ {1, 2, . . . n}, we call x I the subvector of x that includes only the indices of I, that is, set of allowed perturbations of magnitude up to γ around x * . In the following definition we quantify the maximum effect that local perturbations of the subvector of indices I have on the classification probabilities. Definition 1: Consider T I γ,x * , then we define 1|D, x).
Then, for a finite set of test points T we define the interpretability metric by Φ(T , I, γ) = 1 |T |

D. EDA-Based Modelling
EDA broadly refers to the variations in the skin conductance induced by the sudomotor nerve activity which modulates the sweat secretion of the eccrine glands. It can be measured through an EDA meter, a device that displays the electrical conductance change between two points over time. We can now give an explicit formulation for EDA-based PhGP modelling, by considering the cvxEDA model [10] as the generative model of the physiological signal. Namely, this represents the observed n-sample long signal x as a sum of three n-long components, a tonic component (s 1 = t), a phasic component (s 2 = r) and an additive noise term ( ), according to the following equation: The tonic activity contains information about the overall psychophysiological state of the subject, while the phasic component shows rapid changes in EDA signals directly related to an external physiological stimulation. The phasic component is the output of the convolution between the SMNA and an impulse response function that describes the sweat diffusion process. We refer to the sparse SMNA driver of the phasic component as p. We encode the parameters of this model in the stochastic generative model in the form of (1). We then use the following standard quantifiers of x, t, r and p to form a set of features that constitute the vector ω EDA (s) [10], [37]: r ω r,p (s): includes the number of significant phasic driver peaks (nSCR), the sum of Skin Conductance Response (SCR) amplitudes (SumAmpSCR), the maximum value of SCR amplitudes (MaxAmpSCR), and the mean and standard deviation of phasic activity (P hasicM ean, P hasicStd); r ω t (s): includes mean and standard deviation of tonic activity (T onicMean, T onicStd); r ω x (s): includes EDASymp, which is highly correlated to the activity of the sympathetic nervous system and is obtained by integrating the spectrum of x within the (0.045 − 0.25 Hz) frequency band.

A. Experimental Data
We perform our analyses on two publicly available datasets, which report EDA signals during changes in autonomic nervous system (ANS) activity, namely, the DEAP dataset [11], the BHVP (BioVid Heat Pain) dataset [12] and the Stroop test, a dataset collected in our laboratory (block A in Fig. 1). Details of all three datasets are given in the following paragraphs.

1) Affective Valence Recognition (Emotional Stimulus):
The DEAP dataset consists of multi-modal physiological recordings (including EDA), recorded from 32 healthy subjects watching different affective video clips. During each trial, the index of the current trial was first shown for 2 seconds and a consecutive 5 seconds of baseline recording was followed. Then, the subjects were exposed to the emotional stimulus for 1 minute. Finally, they were asked to mark the stimulus on a scale of 1-9. Since the 32 initial subjects were recorded by means of two different EDA acquisition systems, as reported in the dataset description page, we select only the first 21 subjects, i.e., the largest group recorded with the same system to avoid a bias that was evident from the preliminary analysis of the signals. In this paper we focus on the highest arousal and highest valence videos and choose the data recorded during the 5 highest positive valence/highest arousal and 5 highest negative valence/highest arousal videos of the subjects. The resulting dataset contains 105 observations equally balanced between the two class. Additional details of this dataset can be found in [11].
2) Autonomic Arousal Recognition (Pain Stimulus): In the BVHP dataset, a group of 87 subjects underwent a heat-induced pain experiment of four different intensities, while their physiological response was being recorded (including EDA signal). Each pain stimulus was applied at the subject's right arm for around 5 seconds. Each of the specific pain level stimulus was elicited 20 times in a randomised order for each study participant. There was a randomised rest of 8 to 12 seconds between the stimuli. We choose two states corresponding to the states with the highest and the lowest level of heat pain stimulus representing two diverse psycho-physiological states in subjects. This choice was according to pre-existing research on the same dataset, which enables baseline comparison of our results with the literature [38], [39]. We then build the training set related to this dataset from 174 observations in each class. Additional details of this dataset can be found in [12].
3) Stroop Test: 33 healthy subjects volunteered to take part in this study in University of Pisa. The experimental protocol consisted of a 5 minute resting state followed by a stressor, namely, the paced stroop test lasting for 3 minutes [40]. During this task, the subjects were shown words whose meaning was different from their displayed colors. The subjects had two seconds to press the button corresponding to the color of the displayed word and not the corresponding meaning. In case of any mistakes or missed answer, a buzzer was activated and the counter showing the number of consecutive correct answers would turn back to zero. During the experiment, the EDA signal was monitored. The subjects gave their written informed consent and the experiment was approved by the "Comitato Etico Regionale per la Sperimentazione Clinica della Regione Toscana," section "Area Vasta Nord Ovest" -Protocol n. 7803, Registry number 1072, approved on 18 Jan 2018. The recordings were carried out in agreement with the Declaration of Helsinki.

B. Recognition Pipeline With the PhGP Model
A key advantage of PhGP is that, in view of its probabilistic formulation, the model predictions take into account the information provided by the physiologically-based model as well as the raw signals available in the given dataset. In the experiments discussed in Section V, we compare PhGP with variants in which only one of the two sources of information is available, and specifically the following: 1) The input data of the classification model are raw physiological signals x. We refer to this as Raw-GP.
2) The classification model is the last step of a feature extraction pipeline associated to the feature vector ω ∈ R m . We refer to this as Feat-GP. The PhGP model can be viewed as a combination of the two approaches, as it incorporates both learning with GPs and the knowledge of the features of the physiological model. The training of the PhGP model proceeds from raw data by adapting the model distribution around the explicit solution of the physiologically-inspired computational model. We investigate the parametric prior functions described in Section III-B for all three GP-based models. Moreover, for comparison with well performing classification methods outside the GP context, we also evaluate the SVM-RFE algorithm [41] on the experimental data (see the Supplementary Material for additional details on this method). MATLAB software (R2017b version) and the Gaussian Processes for Machine Learning (GPML) toolbox [34] were used to implement GP model training and prior function estimation.

C. Interpretability Pipeline of PhGP Model
After training the PhGP model as well as the other two variants (Raw-GP and Feat-GP), the interpretability metric (Φ) is estimated for each input data point of each recognition model (refer to Section III-C). This metric indicates: 1) For Raw-GP: the contribution of each data sample in the input signal in the final prediction. 2) For Feat-GP: the contribution of each feature in the feature vector ω in the final prediction. 3) For PhGP: the contribution of each data sample in the input signal in the final prediction. However, inherently, when performing posterior inference for PhGP, the contribution of the physiologically-based features is confounded with those of the raw data from the input dataset.   EDA signal and n as the number of subjects in each dataset) over time.

A. Recognition Results
In Table 1 we provide recognition results of the PhGP model and compare its performance with that of Raw-GP and Feat-GP as well as the SVM-RFE algorithm. In particular, we investigate the performance of the GP-based models with respect to different choices of the parametric form of the mean prior functions (i.e., zero, constant, linear, quadratic, cubic and trigonometric). We do not give results for PhGP with zero and constant mean, as the PhGP formulation relies on a non-trivial mean function. The results reported are computed through a Leave-One-Subject-Out (LOSO) cross-validation procedure, so that the results and the models obtained are subject-independent. Namely, at each iteration of the LOSO validation scheme, the recognition model is trained using data from M − 1 subjects (where M is the total number of subjects) and tested on the data from the left-out subject. This procedure is iterated M times. In the table we report final performance results averaged over all subjects in terms of sensitivity (i.e., number of true positive assessments/ number of all positive assessments), specificity (i.e., number of true negative assessments/ number of all negative assessments) and accuracy (i.e., number of correct assessments/ number of all assessments) of predictions.
The results reported in Table 1 suggest that the PhGP model obtains an overall higher accuracy for all GP prior functions compared to the Feat-GP and the SVM-RFE model in all three datasets, while it outperforms the Raw-GP model for some specific GP prior functions. For example, with the linear prior function, PhGP obtains 3%, 2% and 12% higher accuracy compared to Feat-GP and 13%, 2% and 1% higher accuracy compared to the Raw-GP model in DEAP, BVHP and Stroop datasets, respectively.
Results obtained from training the PhGP model with the linear, quadratic, cubic and trigonometric prior functions show virtually similar performance in the BVHP dataset (1% difference in accuracy). However, the difference in performance is more evident in the DEAP dataset (6 − 9%) and the Stroop dataset (3 − 7%). Interestingly, the highest accuracy for the PhGP model is achieved with the linear function in all datasets.
Observe that Feat-GP significantly outperforms Raw-GP in the DEAP dataset (up to 12% improvement), while the opposite occurs in the BVHP and Stroop datasets. Overall, PhGP improves on the LOSO accuracy obtained by the SVM-RFE model by 17%, 5% and 12%, respectively, for the DEAP, BVHP and Stroop datasets.
Note that the results in Table 1 are obtained by using the full EDA prior model, that is, consisting of all the indices, ω EDA (s), discussed in Section III-D. In Fig. 3 we instead investigate the effect that each subset of features comprising ω EDA (s), namely, ω r,p (s), ω t (s), ω x (s), has on the performance of the PhGP model.
We observe that choosing the full set of features (i.e., ω EDA and blue bars in the plots) obtains the highest balanced performance between LOSO sensitivity and specificity of the prediction and therefore highest accuracy compared to when selecting subsets of features for all datasets. Although the feature subsets with ω x and ω t vectors result in higher sensitivity than ω EDA in the DEAP and BVHP datasets, the specificity is very low (53% for ω x and 58% for ω t in the DEAP dataset and 56% for ω x and 66% for ω t in the BVHP dataset). This trend is different in the Stroop dataset, where the sensitivity and specificity is the highest considering the ω EDA feature set.

B. Interpretability Results
We apply the method described in Section III-C to perform interpretability analysis of the trained PhGP model on the three datasets for each subject. For simplicity, we focus on the linear prior model (which has better overall accuracy for PhGP), though similar results can be obtained using the polynomial and the trigonometric prior by employing techniques discussed in the Supplementary Materials. We also apply the methods from [35], [36] to analyse the interpretability of the trained Raw-GP and Feat-GP models on each dataset.
We present the physiological activation maps (PAM) derived for each GP-based recognition model and briefly discuss how these maps can help in gaining better understanding of the trained model. Fig. 4 shows the generated PAM maps for Raw-GP, Feat-GP and PhGP models trained on the three datasets.
The PAM heatmaps show the normalized value (between zero and one) of the interpretability metric (φ, refer to section III-C) for each subject. The vertical axis represents the subject index, which, given the analysis is done in LOSO settings, means a different GP, i.e., one that was learned from the remaining training data. The horizontal axes in the first and third columns (corresponding to Raw-GP and PhGP models) represent the φ value for each selected data patch from the raw EDA signal, whereas in the second column (i.e., the Feat-GP model), φ is obtained for each feature index in the ω EDA (s) vector. In all the PAMs the color-bar varies from blue (denoting the value 0 for φ) to red (showing the highest obtained value for φ). Therefore, the blue blocks represent the lowest contribution of the data patch/feature index for a particular subject in the recognition model, whereas the highest contributions are indicated by the red-colored blocks in the heatmaps.
For all the models we consider 10 values for γ, equally distanced between 0 and 1. Note that the x-axis for Raw-GP and PhGP models represents time, whereas for the Feat-GP it indicates feature indices and the order in which they are shown is arbitrary.
The key observations from the panels (a, c, d, f, g, i) in PAMs (Fig. 4) are the patches corresponding to the highest contribution (red color) in the prediction outcome, which are possibly the salient regions in the signal corresponding to the physiological alternations. On the other hand, the panels (b, From the PAMs of the Raw-GP model in the DEAP dataset, we observe that the patches corresponding to the 48th-52nd seconds of the whole 60 s duration of the EDA acquisition account for the highest contribution in recognition. On the other hand, the first 18 seconds and the 36th-42nd seconds of the data show the least contribution in almost all subjects. This trend is different in the BVHP dataset, where the highest contribution corresponds to the 2nd and the 8th seconds. On the other hand, the patches occurring around the 2th-3th second and the 5th second are the patches corresponding to the highest contribution in the final prediction in the Stroop dataset. The PAMs corresponding to the Feat-GP model show the high contribution of the SumAmpSCR index (refer to Section 7) in the DEAP dataset. Although the same feature has a relatively high contribution in the BVHP dataset, the highest contribution is obtained through the T onicstd (refer to Section 7). Similarly, both phasic and tonic related indices show a high contribution in the final prediction in the Stroop dataset.
Concerning the PhGP model, the patches of data corresponding to the highest value of φ are located at the 18th-26th and 44th-51st seconds in the DEAP dataset and at the 5th and 2nd seconds for the BVHP and Stroop datasets, respectively.

VI. DISCUSSIONS
In this study we presented a novel approach for designing an interpretable recognition model using Bayesian GP classification. We proposed two levels of interpretability: i) at a model design level, we have proposed PhGP modelling which is more transparent compared to traditional GP models, through embedding physiologically-based mathematical models within the GP inference. This level of interpretability is more acceptable for the clinicians since their domain knowledge is taken into account. ii) at a post-hoc level, thanks to the analytical formulation of PhGP, it is amenable to interpretability analysis with the methods discussed in Section III-C.
The main difference between our approach and existing feature-based methods lies in the way we explicitly inject expert knowledge into the learning algorithm of the ML model in the form of previously validated physiologically-inspired models and assumptions. While feature-based approaches may utilise such expert knowledge by considering the hand-crafted features in the input space, they do not inform the learning procedure of the ML model about it. Furthermore notice that in PhGP, the interpretability metric we provide is computed formally, with provable bounds and not approximated with gradient techniques. It is important to note that the innovation in the PhGP design provides interesting insights into identifying the salient regions in the input, in view of access to both the raw physiological signal and physiologically-based features.
Comparison of our PhGP model with the Raw-GP, Feat-GP and SVM-RFE models in Table 1 demonstrate the merit of relying in recognition tasks on physiological signal analysis and information from end-to-end modelling, i.e., PhGP, by drawing on both aspects, is able to achieve competitive performance in all datasets. Moreover, although potentially having access to the same information (that is, the full raw signal), the Raw-GP model tends to overfit, while the PhGP methods benefit from the physiologically-informed prior in shaping its output distribution.
It is interesting to note how all the GP-based models outperform the SVM-RFE method; in fact, the latter tends to overfit in these settings. Furthermore, by using MLE for the hyperparameters of the prior in the GP settings, we also obtain a form of feature selection in the prior space, though in an approximate Bayesian fashion, which provides better generalisation properties. The PhGP model offers higher accuracy compared to a recent study performed in similar settings on the DEAP dataset, which obtained 71% accuracy [19] (similar to that obtained by SVM-RFE). As in here, previous studies have conducted experiments on the BVHP dataset with the aim of classifying the lowest level of pain from the highest pain threshold level and validated their results with LOSO cross-validation, achieving 77% [39] and 79% [38] of accuracy. Interestingly, PhGP improves on the accuracy of all these methods, although it targets a more difficult task of recognition, that is, classifying between the minimum and the maximum level of the pain stimulus.
PhGP obtains comparative results both to discriminate low vs. high sympathetic discharge as in BVHP and Stroop datasets, and when the sympathetic activity is similarly triggered by two different emotional processes as in the DEAP dataset. This latter aspect suggests how sympathetic activity is not a monotonous and stereotypical reaction but is modulated by the activating stimulus.
We highlight that, in addition to obtaining good accuracy performance of GP-based models, they provide different insights into the salient regions in the input data which possibly correspond to the patches in the input data where the highest alterations in physiology are present. While the PAMs in Fig. 4 obtained from Raw-GP and Feat-GP indicate the most informative regions in the raw signal and the most important features, respectively, both sources of information are inherently reflected in the salient regions obtained from PhGP.
It is relevant to highlight how the raw signal and the features which are more informative are different and specific for each dataset. The three datasets reflect three different triggers for sympathetic response: physical stressor (i.e., pain); cognitive stressor (i.e., stroop); emotional stressor (i.e., emotional pictures). Once again this might suggest that sympathetic response, as measured with EDA, is not monotonous and stereotyped but also depends on the nature of the stressor. Further studies are needed to specifically test this hypothesis and to understand the role of this sympathetic specific response.
Moreover, it is interesting to observe the consistency in the φ values reported in each of PAMs in Fig. 4. Those refer to different models learnt in the same settings (only the training/test set split varies), so this highlights how the results obtained by interpretability analysis are in a sense qualitatively independent from the specific subjects. This indicates that the model is learning features and patterns that are specific to the problem itself, rather than the particular subject involved.
Considering the EDA trend depicted in Fig. 2, it is evident that the EDA dynamics is different for each psycho-physiological condition. In the DEAP dataset, the EDA dynamics in a positive valence condition shows an elevated response from the 45th second to the end of the stimulation, while the negative valence condition is relatively smoother along the timeline with a peak response at the 18th second. For pain stimulation, while during the lowest level of pain a decreasing trend in EDA response is observed, during the highest level of pain a maximum peak of response is observed at the 8th second of the stimulation. During the Stroop task, a higher elevation in EDA response is observed compared to the resting state where the fluctuations are relatively lower in amplitude.
Although visualization of these plots can aid understanding of the different trends in each psycho-physiological condition, quantifying this difference is a difficult task. The PAMs obtained as a result of interpretability analysis introduced in this paper are a first step to quantify the position of those patches where the highest difference in the EDA response between the two psycho-physiological conditions is obtained.

VII. CONCLUSION
In this paper we provided an interpretable GP-based framework that facilitates the injection of physiologically-inspired priors in a Bayesian GP model in order to infer a user's psycho-physiological state. The experimental application of the proposed PhGP model on EDA signals in this study indicates that PhGP not only obtains comparative performance among competitive predictive models, but also provides physiologically sound interpretation of predictions which are consistent at the single-subject level. This generality of the results is a remarkable feature of the PhGP model for recognition tasks. We believe that our methodology offers considerable advantages for recognition systems that input signals from non-invasive wearable monitoring systems (e.g, smart-watches, sensorised gloves or shirts), where recording EDA is easy and inexpensive. From a clinical perspective, the proposed method will be able to support the clinician, operating for example in the psychological field, by providing not only an automatic diagnosis based on objective measures such as those of physiological data but also a tool capable of helping to understand the psycho-physiological motivation behind this diagnosis.
Although the experimental applications in this study are limited to the EDA signal, the modelling can be adapted to other available models for the analysis of physiological signals (e.g., point-process modelling of heartbeat dynamics and recursive penalized least squares solution for EEG generation) in the PhGP model. An extension of the current methodology to a multi-modal PhGP model that benefits from meaningful information content of different physiological signals is a focus for future work. The methodology for interpretability analysis of the PhGP model in this paper provides useful insights about the predictions made by the model, but this is just an initial step. In future we aim to provide systematic means of interpretability to investigate more complex properties of psycho-physiological mechanisms such as correlational and causal relationships. Moreover, we will explore methods to combine deep neural networks with GPs to encode the prior physiological model while keeping the interepretability of the framework.