Quantifying Prediction Uncertainty in Regression Using Random Fuzzy Sets: The ENNreg Model

In this article, we introduce a neural network model for regression in which prediction uncertainty is quantified by Gaussian random fuzzy numbers (GRFNs), a newly introduced family of random fuzzy subsets of the real line that generalizes both Gaussian random variables and Gaussian possibility distributions. The output GRFN is constructed by combining GRFNs induced by prototypes using a combination operator that generalizes Dempster's rule of evidence theory. The three output units indicate the most plausible value of the response variable, variability around this value, and epistemic uncertainty. The network is trained by minimizing a loss function that generalizes the negative log-likelihood. Comparative experiments show that this method is competitive, both in terms of prediction accuracy and calibration error, with state-of-the-art techniques such as random forests or deep learning with Monte Carlo dropout. In addition, the model outputs a predictive belief function that can be shown to be calibrated, in the sense that it allows us to compute conservative prediction intervals with specified belief degree.


Quantifying Prediction Uncertainty in Regression Using Random Fuzzy Sets: The ENNreg Model Thierry Denoeux
Abstract-In this article, we introduce a neural network model for regression in which prediction uncertainty is quantified by Gaussian random fuzzy numbers (GRFNs), a newly introduced family of random fuzzy subsets of the real line that generalizes both Gaussian random variables and Gaussian possibility distributions.The output GRFN is constructed by combining GRFNs induced by prototypes using a combination operator that generalizes Dempster's rule of evidence theory.The three output units indicate the most plausible value of the response variable, variability around this value, and epistemic uncertainty.The network is trained by minimizing a loss function that generalizes the negative log-likelihood.Comparative experiments show that this method is competitive, both in terms of prediction accuracy and calibration error, with state-of-the-art techniques such as random forests or deep learning with Monte Carlo dropout.In addition, the model outputs a predictive belief function that can be shown to be calibrated, in the sense that it allows us to compute conservative prediction intervals with specified belief degree.
Index Terms-Belief functions, Dempster-Shafer theory, evidence theory, machine learning, neural networks.

I. INTRODUCTION
T HE quantification of prediction uncertainty in machine learning has recently gained a lot of attention [1], [2], [3], [4].Whereas most approaches are based on Bayesian inference, other theories of uncertainty, such as Dempster-Shafer (DS) theory [5], [6], have also proved to be very promising [7], [8].DS theory of belief functions, also known as theory of belief functions of evidence theory, is a mathematical formalism for reasoning with uncertainty [5], [6], which makes it possible to overcome some limitations of Bayesian inference.This formalism relies on two main components: 1) the representation of elementary pieces of evidence using belief functions and 2) their combination by a conjunctive operator, referred to as Dempster's rule of combination.The greater flexibility of DS theory makes it suitable to reason and make decisions based on weak information, which would not be adequately represented The author is with the Heudiasyc Lab, Université de technologie de Compiègne, CNRS, 60203 Compiègne, France, and also with the Institut Universitaire de France, 75005 Paris, France (e-mail: tdenoeux@hds.utc.fr).
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TFUZZ.2023.3268200.
Digital Object Identifier 10.1109/TFUZZ.2023.3268200by probability measures.In particular, it is possible, using belief functions, to distinguish between conflicting evidence equally supporting different hypotheses on the one hand and sheer ignorance resulting from total lack of evidence on the other hand [9].
In machine learning, DS theory provides a powerful framework for expressing uncertainty about the output of a learning algorithm.Most applications so far have concerned clustering [10], [11] and classification [12], [13], [14].In particular, in supervised classification, several algorithms have been proposed to learn evidential classifiers, defined as classifiers that represent class prediction uncertainty by belief functions.One of the first such classifiers is the evidential K-nearest neighbor (EKNN) rule [12], which constructs a predictive belief function by combining the evidence of the K-nearest neighbors of the feature vector under consideration using Dempster's rule.A similar idea is implemented in the evidential neural network (ENN) model introduced in [15], in which the learning vectors are summarized by prototypes, whose location in feature space is optimized together with other parameters.Recently, this idea has been applied to deep networks [7], [8] by adding a "DS layer" to a deep architecture.
The reason why applications of DS theory to machine learning have been confined to clustering and classification is that these learning tasks only require the definition of belief functions over a finite set of clusters or classes, a simple mathematical framework for which many tools and concepts of DS theory have been developed [6].A belief function on a finite domain (or frame of discernment) can be conveniently represented by a mass function assigning probability masses to some subsets called focal sets.Such mass functions are easily combined by Dempster's rule.In contrast, applying DS theory to regression is more challenging, as the domain of the response variable is then the real line or a real interval.Whereas some mathematical models for generating belief functions on the real line, such as random intervals, have been well studied [16], [17], combining such belief functions by Dempster's rule is often computationally difficult and requires Monte Carlo simulation [18].
A way to circumvent this difficulty is to discretize the response variable.This is the strategy followed in [19], in which a regression neural network directly extending the ENN model is proposed.However, discretizing a variable inevitably results in a loss of accuracy.Another approach, introduced in [20], is to directly extend the EKNN rule by defining, for each neighbor, a simple mass function with two focal sets: a singleton composed of a real number and the frame of discernment.This simple method, called EVREG, was shown in [20] to have good performances as compared to nearest neighbor regression and other nonparametric methods, and to allow efficient handling of uncertain response data.However, as a nonparametric method, the performance of EVREG decreases with the input dimension, and it cannot compete with state-of-the-art regression methods.
In this article, we propose a new regression neural network model that quantifies prediction uncertainty using belief functions, based on the theory of epistemic random fuzzy sets (RFSs) recently introduced in [22] and [23].Specifically, this article makes the following contributions.
1) We formulate a loss function for regression tasks in which the prediction is given in the form of a predictive RFS, and we introduce a corresponding notion of calibration.2) We propose a new distance-based neural network model, called ENNreg, which computes predictions in the form of Gaussian random fuzzy numbers (GRFNs) [23] defined by three output values: mean, variance characterizing aleatory uncertainty, and precision reflecting epistemic uncertainty.3) Through numerical experiments, we show the ENNreg model to be competitive with state-of-the-art machine learning models for regression tasks in terms of prediction accuracy and probabilistic calibration.4) We show that the precision output can be adjusted to provide calibrated conservative predictive belief functions.The rest of this article is organized as follows.The necessary background on RFSs and GRFNs is first recalled in Section II.The desired properties of predictive RFSs in regression tasks are then studied in Section III, and the proposed ENNreg model is introduced in Section IV.Experimental results are reported in Section V. Finally, Section VI concludes this article.

II. RANDOM FUZZY SETS
The theory of epistemic RFSs was introduced in [22] and [23] as a general framework encompassing both DS theory and possibility theory.We first recall some important definitions related to RFSs in Section II-A.GRFNs, a parametric family of RFSs on the real line, are then described in Section II-B.

A. General Definitions
The notion of RFS generalizes that of random set [24] by mapping elements of a probability space to fuzzy subsets of the domain Θ of the variable of interest.This mathematical framework has been used to model a random experiment whose outcomes are fuzzy sets [25], [26], [27].Here, as in [22] and [23], we use it as a model of unreliable and fuzzy evidence, directly generalizing the DS framework.
Under the epistemic interpretation, we see Ω as a set of interpretations of a piece of evidence about a variable θ taking values in Θ.If interpretation ω ∈ Ω holds, we know that "θ is X(ω)," i.e., θ is constrained by the possibility distribution defined by X(ω).The resulting RFS, thus, encodes a state of knowledge about variable θ.If all images X(ω) are crisp, then X defines an ordinary random set.If X is a constant mapping, it is equivalent to specifying a unique fuzzy subset of Θ, which defines a possibility distribution [29].
Belief and plausibility functions: In the following, we assume RFS X to be normalized, i.e., to verify the following conditions.
2) The set of elements ω ∈ Ω whose image is empty has zero probability, i.e., P ({ω ∈ Ω : X(ω) = ∅}) = 0. Conditionally on ω ∈ Ω being the true interpretation, we can define possibility and necessity measures [29] on Θ, respectively, as and Let Bel X and P l X denote the mappings from Σ Θ to [0, 1] that associate to each subset B ∈ Σ Θ , respectively, its expected necessity and its expected possibility, i.e., It can be shown that Bel X and P l X are, respectively, belief and plausibility functions [28].
Combination: Consider two epistemic RFSs (Ω i , Σ i , P i , Θ, Σ Θ , X i ), i = 1, 2, encoding independent pieces of evidence.If interpretations ω 1 ∈ Ω 1 and ω 2 ∈ Ω 2 both hold, we know that "θ is X 1 (ω 1 )" and "θ is X 2 (ω 2 )."It is then natural to combine the fuzzy sets X 1 (ω 1 ) and X 2 (ω 2 ) by an intersection operator.As discussed in [22], the normalized product-intersection operator • , defined for any two fuzzy subsets F and G of Θ as is suitable for combining fuzzy information from independent sources and it is associative [30].We, thus, consider the mapping To ensure the normality condition, we need to condition the product measure P 1 × P 2 on the set Θ * 12 of (partially) consistent interpretations, defined as We then obtain the combined RFS Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where P 12 is the conditioned product measure.This operation will be referred to as the generalized product-intersection rule with hard normalization. 1 The corresponding operator, denoted by , is commutative and associative, and it is a proper generalization of Dempster's rule (it boils down to Dempster's rule when the combined RFSs are crisp).

B. Gaussian Random Fuzzy Numbers
To define a practical model of random fuzzy subset of the real line, we start with the definition of a Gaussian fuzzy number (GFN).A GFN is a fuzzy subset of R with membership function where m ∈ R is the mode and h ∈ [0, +∞] is the precision.Such a fuzzy number will be denoted by GFN(m, h).An interesting property of this family of fuzzy numbers is that it is closed with respect to the normalized product intersection: given two fuzzy numbers GFN(m 1 , h 1 ) and GFN(m 2 , h 2 ), their combination by the • operator (3) yields the fuzzy number GFN(m 12 , h 12 ) with A GRFN can now be defined as a GFN whose mode is a Gaussian random variable (GRV).More formally, let (Ω, Σ Ω , P ) be a probability space and let M : Ω → R be a GRV with mean μ and variance σ 2 .The RFS X : Ω → [0, 1] R defined as is called a GRFN with mean μ, variance σ 2 , and precision h, which we write X ∼ N (μ, σ 2 , h).A GRFN is, thus, defined by a location parameter μ and two parameters h and σ 2 corresponding, respectively, to possibilistic and probabilistic uncertainty.
A GRFN can be seen either as a generalized GRV with fuzzy mean or as a generalized GFN with random mode.In particular, a GRFN X with infinite precision h = +∞ is equivalent to a GRV with mean μ and variance σ 2 , which we can write N (μ, σ 2 , +∞) = N (μ, σ 2 ).If σ 2 = 0, M is a constant random variable taking value μ, and X is a possibilistic variable with possibility distribution GFN(μ, h).Another case of interest is that where h = 0, in which case X(ω)(x) = 1 for all ω ∈ Ω and x ∈ R; the belief function induced by X is then vacuous.
The usefulness of GRFNs as a model of uncertain information about a real quantity arises from the fact that GRFNs can easily be combined by the generalized product-intersection we have, as a direction consequence of (4), X 1 X 2 ∼ N (μ 12 , σ 2  12 , h 12 ), with and

III. PREDICTIVE RFSS
In this section and in the rest of this article, we consider a regression problem, in which the task is to predict a continuous random response variable Y from a random vector X = (X 1 , . . ., X p ) of p input variables.In Section IV, we will propose a neural network model that, given an observed input vector X = x, computes a GRFN Y (x) with associated belief function Bel Y (x) that "approximates" Y in some way.The network will be trained using a training set T = {(x 1 , y 1 ), . . ., (x n , y n )} assumed to be an independent and identically distributed (iid) sample from (X, Y ).In this section, we define desirable properties for a predictive RFS Y (x).A loss function measuring predictive accuracy is first defined in Section III-A.A notion of calibration is then introduced in Section III-B.

A. Loss Function
When predicting a random variable Y by a probability measure P with density function f , we typically measure the prediction error (or loss) by the negative log-likelihood (NLL) which can be interpreted as follows.The continuous variable Y is always observed with finite precision, so we actually observe an interval [y] = [y − , y + ] centered at y. Denoting by F the distribution function of P , the probability of that interval is and its logarithm is equal to ln f (y) plus a constant that does not depend on f .Minimizing (6), thus, amounts to maximizing the probability of the observations.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In the case where the prediction takes the form of an RFS Y , we no longer have a single probability of the observation [y] , but two numbers: a degree of belief Bel Y ([y] ) and a degree of plausibility P l Y ([y] ).We could, thus, replace ( 6) by either ), where the precision now appears explicitly in the expression of the loss function.However, none of these two loss functions adequately measures the quality of the imprecise predictions as expressed by an RFS Y .To see this, let us assume that Y ∼ N (μ, σ2 , h) is a GRFN with precision h.The following inequality holds: ), with an equality when h = +∞.Consequently, whatever the values of μ and σ 2 , a GRV with mean μ and variance σ 2 always achieves a smaller loss than that of Y , which implies that a higher imprecision of the RFS is never rewarded by a decrease of the loss function.Now, when h = 0, we have P l Y ([y] ) = 1 for any y, μ, and σ 2 : the loss L (y, Y ) is, thus, maximized by the vacuous uninformative RFS.
From the above analysis, we propose to consider as the loss function a weighted sum of L (y, Y ) and L (y, Y ): where λ ∈ [0, 1] is a hyperparameter.Choosing a smaller value of λ amounts to favoring more cautious predictions.Fig. 2 shows the variation of the loss L λ, (0, Y ) with h for λ ∈ {0, 0.2, 0.5, 0.8, 1}, = 0.01, and Y ∼ N (μ, 1, h) with μ = 1 [see Fig. 2(a)] and μ = 5 [see Fig. 2(b)].We can see that the loss is minimized for h = 0 when λ = 0, for h → +∞ when λ = 1, and for some intermediate value of h when 0 < λ < 1.For any given value of λ, the optimum precision h is smaller when μ = 5: a larger error is compensated by a smaller precision.When μ, σ 2 , and h are free to vary, loss function ( 7) is minimal for μ ∈ [y] , h = +∞, and σ 2 → 0. As in support vector regression [31], can be seen as a value under which the error is not penalized.We can also remark that, when is small, for a probabilistic GRFN Y ∼ N (μ, σ 2 , +∞), we have L λ, (y, Y ) ≈ − ln φ( y−μ σ ) − ln , where φ denotes the standard normal probability density function; loss function (7) then approximates the NLL, up to an additive constant.
In Section IV, we will describe a neural network model for computing a GRFN Y (x; Ψ) from an input vector x and a parameter vector Ψ.The network will be trained by minimizing the average of loss function (7) over the training set.

B. Evidential Calibration
In the case of probabilistic predictions of the response variable Y , we usually expect the predictive probability distribution to be close to the true conditional distribution of Y given X = x.More precisely, let F Y |x be a predictive cumulative distribution function for Y given X = x.For any α ∈ (0, 1), we can construct a prediction interval at level α as These intervals are said to be calibrated if This property can be checked by estimating the coverage rates for different values of α using a validation set and plotting the estimates versus the nominal values α.The resulting graph is known as a calibration plot.
In the case considered here, where the response is predicted for X = x by a GRFN Y (x), we also need a notion of calibration that can easily be checked graphically.Different notions of calibration for belief functions are reviewed in [32].Recently, Cella and Martin [33] proposed two definitions for "valid" predictions in the belief function framework.However, the weaker definition is too weak to be practically useful, while the stronger one appears to be verified only for consonant belief functions.Here, we will adopt a simple definition based on an immediate generalization of (8).
For any α ∈ (0, 1], we define an α-level belief prediction interval (BPI) as an interval B α (x) = μ(x) ± a(x) centered at μ(x), such that Bel Y (x) (B α (x)) = α.A BPI at level α is, thus, an interval that is believed to a degree α to contain the true value of the response variable Y .The predictions will be said to be calibrated if, for all α ∈ (0, 1], α-level BPIs have a coverage probability at least equal to α, i.e, As in the probabilistic case, the calibration of evidential predictions can be checked graphically using a calibration plot (see Fig. 5).The predictions are calibrated if the curve lies above the first diagonal, and the predictions are all the more precise that the curve is close to the first diagonal.

IV. NEURAL NETWORK MODEL
In this section, we propose a neural network model, 2 called ENNreg, which for an observed input vector X = x computes a GRFN Y (x), with associated belief function Bel Y (x) quantifying the uncertainty on Y given x.As the ENN model introduced in [15] for classification, ENNreg is based on prototypes.The distances to the prototypes are treated as independent pieces of evidence about the response and are combined by the generalized product-intersection rule recalled in Section II-A.The network architecture and propagation equations will first be described in Section IV-A.Learning will then be discussed in Section IV-B.Finally, a brief comparison with recent models for uncertainty quantification in deep networks will be performed in Section IV-C.

A. Neural Network Architecture
Let w 1 , . . ., w K denote K vectors in the p-dimensional feature space, called prototypes.The similarity between input vector x and prototype w k is measured by where γ k is a positive scale parameter.The evidence of prototype w k is represented by a GRFN k and h k are variance and precision parameters for prototype k, respectively; the mean μ k (x) is defined as where β k is a p-dimensional vector of coefficients, and β k0 is a scalar parameter.The quantity μ k (x) can be seen as an estimate of the conditional expectation of the response Y given that x is close to w k , while σ 2 k is an estimate of the conditional variance.When the distance x − w k tends to infinity, the precision s k (x)h k tends to 0 and Y k (x) becomes vacuous.The vector ψ k of parameters associated with prototype k is, thus, The output Y (x) for input x is computed by combining the GRFNs Y k (x), k = 1, . . ., K induced by the K prototypes using the operator.From (5), it is a GRFN Y (x) ∼ N (μ(x), σ 2 (x), h(x)), with We can observe that σ 2 (x) and h(x) represent different kinds of uncertainty.The variance output σ 2 (x) estimates the conditional variance of Y given the input x: it, thus, corresponds to aleatory uncertainty (the larger σ 2 (x), the more uncertainty).In contrast, the precision output h(x) gets smaller and tends to zero when the distance to the prototypes increases: it corresponds to epistemic uncertainty (the smaller h(x), the more uncertainty).
The calculation of Y (x) can be performed by a neural network with an input layer, two hidden layers of 2 K units, and an output layer of three units, as shown in Fig. 3.The first hidden layer is composed of K pairs of units totally connected to the input units: a radial basis function (RBF) unit with weight vector w k that computes s k (x) and a linear unit with weight vector β k and bias β k0 that computes μ k (x).The second layer is also composed of K pairs of units: in each pair k, a "squaring" unit connected to RBF unit k of the previous layer computes s 2 k (x), and a "product" unit connected to RBF unit k and linear unit k of the previous layer computes the product s k (x)μ k (x).Finally, the output layer is composed of: 1) a linear unit with shortcut connections from the RBF units in the first hidden layer and weights (h k ) that computes h(x); 2) a unit connected to the squaring units of the previous layer with weights (h 2 k s 2 k (x)) that outputs σ 2 (x); and 3) a unit connected to the product units of the previous layer with weights (h k ) that computes μ(x).
Although the structure of this network is more complex than that of RBF networks [35], [36], the complexity of one input propagation is the same, i.e., O(pK).We can also remark that this neural network model generalizes both RBF networks and the linear model.
1) If β k = 0 for all k, then μ k (x) = β k0 , and μ(x) is identical to the output of an RBF neural network [37] with K hidden units, hidden-to-output weights h k β k0 , and normalized outputs; 2) If γ k = 0 for all k, the output μ(x) becomes linear in x, and the variance component σ 2 (x) is constant.We then have a linear model with constant variance.Finally, ENNreg can also be interpreted as a special kind of fuzzy system (see [38] for a general introduction and, e.g., [39] for a recent application to fuzzy control).From this point of view, each prototype k can be interpreted as a fuzzy rule where W kj , j = 1, . . ., p, are fuzzy sets of the real line with membership function Using the product t-norm, the degree of firing of this rule is The GRFN in the consequent part is then "discounted" by multiplying its precision by s k (x), and the outputs from each of the K rules are combined by the generalized product-intersection operator .This analogy with fuzzy systems could suggest alternative models based on RFSs; this is left for further research.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Learning
Let Ψ = (ψ 1 , . . ., ψ K ) denote the vector of all parameters in the model.Given a training set T = {(x 1 , y 1 ), . . ., (x n , y n )}, the accuracy of the predictions can be measured by the loss function (7) introduced in Section III-A.To train the ENNreg model described in Section IV-A, we will use the regularized average loss where C λ, (Ψ) is the average loss, R 1 (Ψ) and R 2 (Ψ) are two regularizers, and ξ and ρ are regularization coefficients.The first regularizer R 1 (Ψ) has the effect of reducing the number of prototypes used for the prediction (setting h k = 0 amounts to discarding prototype k), while R 2 (Ψ) shrinks the solution toward a linear model (as mentioned before, setting γ k = 0 for all k yields a linear model).Loss function ( 12) can be minimized using any batch or online gradient-based optimization procedure.The constraints h k ≥ 0 are imposed by introducing intermediate variables

The gradient can be computed by error backpropagation, with complexity O(Kp).
As in RBF networks, the prototypes in the ENNreg model can easily be initialized using a clustering procedure such as the K-means algorithm.As before, let {(x 1 , y 1 ), . . ., (x n , y n )} be the training set.Let I k denote the set of indices of input vectors in cluster k, and n k = |I k |.The prototypes are initialized as the cluster centers, w k = 1 n k i∈I k x i , and γ k is initialized proportionally to the inverse square root of the mean squared distances within cluster k Parameters β k0 and σ 2 k are set, respectively, to the mean and variance of the response variable in cluster k.Finally, we set Overall, the model has five hyperparameters: the number K of prototypes, coefficients and λ in the definition of the loss (7), and regularization coefficients ξ and ρ in (12).In practice, K can be set to a large value, and the effective number of prototypes can be controlled by ξ.We found the network performance to be quite robust to the choice of and λ.In the experiments reported in Section V, these hyperparameters were fixed to = 0.01 σ Y , where σ Y is the sample standard deviation of Y , and λ = 0.9.Only ξ and ρ, thus, have to be tuned using either cross-validation or the hold-out method.

C. Comparison With Alternative Approaches
In ENNreg, prediction uncertainty is quantified by the output GRFN Y (x) obtained by combining evidence from different prototypes, in line with DS theory [6].Our approach is, thus, radically different from recently proposed methods for quantifying prediction uncertainty in deep networks, which are rooted in Bayesian inference.For instance, probabilistic backpropagation (PBP) [1] performs approximate inference in a Bayesian neural network, in which each weight has a one-dimensional Gaussian prior distribution; the marginal distributions of the activations in each layer are approximated by one-dimensional Gaussians with the same means and variances.Monte Carlo dropout, another method introduced in [2], can be interpreted as approximate Bayesian inference of parameters in a Gaussian process; the first two moments of the output predictive distribution are estimated.In the deep ensemble method [3], a set of neural networks is trained by minimizing the NLL; the predictive distribution is a mixture of Gaussians, which is approximated by a Gaussian distribution with the same mean and variance.Finally, "deep evidential regression"3 introduced in [4] assumes a Gaussian prior on the conditional mean of the response variable and an inverse gamma prior on its conditional variance; the hyperparameters of these "evidential priors" are learnt by minimizing a suitable loss function.
By considering only the mean μ(x) and variance σ 2 (x) outputs of our method, it is possible to compare it experimentally with the above Bayesian approaches.This will be done in Section V-B, where we will show that ENNreg has slightly better prediction accuracy than Bayesian methods and is on par in terms of probabilistic calibration.In addition, our model has a precision output h(x) that can be adjusted to provide conservative predictions calibrated in the sense of the definition introduced in Section III-B.

V. EXPERIMENTAL RESULTS
We first give an illustrative example in Section V-A.Results from a comparative experiment are then reported in Section V-B.

A. Illustrative Example
As an illustrative example, we consider iid data simulated from the following distribution: the one-dimensional input X has a uniform distribution in the interval [−2, 2], and where U ∼ N (0, 1) is a standard normal random variable.The conditional standard deviation of Y given X = x, thus, increases linearly from 0 for x = −2 to 1/ √ 2 for x = 2.We generated a learning set of size n = 300 and a validation set of the same size.We initialized a network with K = 30 prototypes.The values of hyperparameters ξ and ρ minimizing the mean squared error on the validation set were ξ = 10 −4 and ρ = 0.01.Fig. 4 shows the outputs μ(x) of a network trained with these values, together with BPIs at levels α ∈ {0.5, 0.9, 0.99}.
A calibration plot is shown in Fig. 5, representing the proportion of α-level BPIs containing the observed value of the response in the validation set, for α ∈ {0.1, 0.2, . . ., 0.9}.In Fig. 5, we also show the calibration graph for the probabilistic  prediction intervals defined as μ(x) ± u (1+α)/2 σ(x).In this particular case, both prediction intervals are calibrated, the BPIs being more conservative than the probabilistic ones.We can remark that we can make the BPIs less conservative by multiplying the output precisions h(x) by some constant c > 1.The probabilistic intervals are recovered when c → +∞.Conversely, we can make the BPIs more conservative by multiplying the output precision by a constant c < 1.This technique will be illustrated in Section V-B.
From Fig. 5, we can see that the calibration curve of BPIs is close to that of probabilistic prediction intervals, which indicates that, for this dataset, the output precision h(x) is quite high.This is confirmed by Fig. 6(a) showing h(x) versus x.Precision first increases and then decreases with x as the errors |y i − μ(x i )| increase, which can be explained by Fig. 2: larger errors result in more imprecise predictions.Fig. 6(b) shows the output standard deviation σ(x) as a function of x: it slightly overestimates the true standard deviation, but the increasing trend is well captured.

B. Comparative Experiments 1) Comparison With State-of-the-Art Regression Methods:
We compared the performance of ENNreg in terms of rootmean-square (RMS) error to those of eight alternative regression methods on ten datasets.Eight of the datasets (Energy efficiency, Concrete Compressive Strength, Yacht Hydrodynamics, Wine quality-red, Communities and Crime, Residential building, Airfoil Self-Noise, and Bike sharing) are available from the UCI

TABLE I CHARACTERISTICS OF THE DATASETS
Machine Learning Repository. 4 The Boston dataset is included in the R package MASS [40], and the kin8nm dataset was downloaded from the OpenML website. 5The characteristics of these datasets are summarized in Table I.We also give in Table I the name of the response variable, as some datasets have several possible response variables.For the "Communities and Crime" dataset, variables with two missing values or more were discarded.For the "Bike sharing dataset," the nine predictors were season, year, month, holiday, weekday, weathersit, atemp, hum, and windspeed.
The eight methods are as follows: 1) four nonlinear regression algorithms using RBF kernel functions: RBF networks, relevance vector machines (RVM), support vector machines (SVM), and Gaussian processes (GP); 2) two state-of-the-art methods for nonlinear regression: random forests (RF) and multilayer perceptron (MLP) with one hidden layer of sigmoidal units; 3) two regularized linear regression methods: Ridge and Lasso.For all methods, except RBF networks, we used the implementation in the R package caret [41].For training ENNreg, we used batch learning using the algorithm described in [42] for all datasets, except kin8nm for which we used mini-batch stochastic gradient descent.Each dataset was split randomly into training and test sets containing 90% and 10% of the observations, respectively.The random splits were repeated 20 times.All input variables were scaled to have zero mean and unit standard deviation.For each method, hyperparameters were tuned by fivefold cross-validation.For ENNreg, we set λ = 0.9 and = 0.01 σ Y (where σ Y is the estimated standard deviation of the response variable) for all the simulations.The number of prototypes was fixed to K = 30 for all datasets except kin8nm and Airfoil, for which it was set to K = 100.Hyperparameters ξ and ρ were tuned by cross-validation.The results6 are reported in Table II.We can see that ENNreg stands among the best methods for seven out of the ten datasets.It is only outperformed by RF and MLP on the Concrete dataset, RF on the Wine dataset, and MLP on the kin8nm dataset.In most cases, it performs strictly better than other RBF-based methods (except for the Crime dataset, on which all methods have equivalent performances, and for the Bike dataset, on which it performs as well as RBF and RVM).Overall, it appears that ENNreg is a very competitive regression method in terms of prediction accuracy.
Comparing computing time across all methods is difficult because it highly depends on implementation.We have seen in Section IV that both forward and backpropagation in ENNreg can be performed in linear time with respect to both the number of inputs and the number of prototypes; consequently, the algorithmic efficiency of ENNreg is similar to those of other neural network methods.We have done a comparison with the RBF network model because it was implemented in a similar way as ENNreg and with the same optimization algorithm.Average computing times for both the methods on five datasets are shown in Table III.We can see that ENNreg is actually faster than RBF networks with the same number of prototypes, which can be explained by faster convergence due to greater flexibility.

2) Comparison With Alternative Uncertainty Quantification Methods:
We also compared our results to published results obtained with the four neural network methods reviewed in Section IV-C, on six of the ten datasets for which published results with these methods are available.The RMS values are shown in Table IV.We can see ENNreg performs strictly better than, or as well as these four methods on the six considered datasets in terms of RMS.
The PBP, Monte Carlo dropout, deep ensemble, and evidential regression methods were actually introduced for uncertainty quantification in deep networks.Probabilistic calibration is usually measured by NLL, assuming Gaussian errors.For ENNreg, NNL can be computed using the mean μ(x) and variance σ 2 (x) output, ignoring the precision h(x).ENNreg is then seen as a probabilistic method.The NLL values for ENNreg and the four neural network methods on the six datasets (as reported in [1], [2], [3], and [4]) are shown in Table IV.We can see that ENNreg performs similarly to the other methods and stands among the best methods for four out of the six datasets in terms of probabilistic calibration as measured by NLL.
3) Evidential Calibration: As mentioned in Section V-A, the ENNreg model provides not only conditional mean and variance estimates as probabilistic models, but a precision output h(x) that can be used to compute calibrated BPIs, as defined by (9).Fig. 7 shows calibration plots for four or the six datasets above.Each plot shows the calibration curves of: 1) probabilistic prediction intervals (using only μ(x) and σ 2 (x)); 2) raw BPIs; and 3) adjusted BPIs obtained by multiplying the output h(x) by a constant c > 0 in such a way that the calibration curve lies above the diagonal, but as close as possible to it.We can distinguish several cases.For the Boston and Energy datasets [see Fig. 7(a) and (b)], the BPIs are calibrated but too conservative.Applying a correction factor c = 2 increases the precision of the predictions.In the case of the Concrete dataset [see Fig. 7(c)], the BPIs before adjustment are not calibrated.In this case, we need to apply a correction factor c < 1 to make the predictions more imprecise.Here, we used c = 0.1.Finally, for the Wine dataset [see Fig. 7(d)], the calibration curves corresponding to probabilistic predictions are already above the diagonal, which means that the probabilistic prediction intervals are already calibrated BPIs.In such a case, we can consider that the probabilistic predictions provide an adequate description of prediction uncertainty and neglect the precision output h(x) or, equivalently, multiply it by c = ∞.

VI. CONCLUSION
In this article, we introduced an ENN model for regression, called ENNreg.In this model, distances to K prototypes were considered as pieces of evidence about the response variable and were described by GRFNs.The total evidence was then pooled by the generalized product-intersection rule, an extension of Dempster's rule in the epistemic RFS setting.The network architecture was composed of a first hidden layer of K RBF units and K linear units, a second hidden layer of 2K units, and an output layer of three units with cross-cut connections from the two hidden units.The network output was a GRFN described by three numbers: a point prediction μ(x), a conditional variance estimate σ 2 (x), and a precision parameter h(x) whose value depends on the distances between the input vector and the prototypes, and which can be seen as describing the reliability of the probabilistic predictions.This additional degree of freedom made it possible to quantify not only random uncertainty but also epistemic uncertainty.
The network output can also be expressed as a predictive belief function on the real line induced by the output GRFN.We defined a loss function for such outputs, extending the NLL to evidential regression.We also discussed the calibration of predictive belief functions for regression tasks and introduced a definition based on the coverage rates of BPIs (prediction intervals with specified belief degree), which were required to be conservative.Calibrated BPIs at degree α, thus, contain, on average, at least 100α% of future observations.This calibration property can be assessed visually by drawing calibration plots.
Comparative experiments showed that ENNreg performs better in terms of RMS error than other RBF-based regression methods such as RBF networks, SVM, RVM, and Gaussian processes, and that it is competitive with state-of-the art methods for regression such as RF and MLP.Furthermore, a comparison with recent approaches to uncertainty quantification in deep networks showed that ENNreg is also competitive with these methods in terms of both prediction accuracy and probabilistic calibration, while offering the advantage of a richer quantification of uncertainty thanks to the RFS formalism.
This research can be extended in several directions.For structured inputs such as time series or images, additional feature extraction layers could be inserted between the input and prototype layers, resulting in a deep architecture, as described in [7], for image classification.For regression with multiple outputs, the ENNreg model could be extended to compute outputs in the form of Gaussian random fuzzy vectors, as introduced in [23].Finally, other families of random fuzzy numbers could be used to accommodate outputs with, e.g., skewed distribution or bounded support.

Fig. 6 .
Fig. 6.(a) Output precision h(x) and (b) standard deviation σ(x) for the simulated data.The broken line indicates the true standard deviation.

TABLE II AVERAGE
RMS AND STANDARD ERRORS FOR ENNREG AND EIGHT ALTERNATIVE ALGORITHMS ON TEN DATASETS TABLE III AVERAGE CPU TIME AND STANDARD ERRORS FOR ENNREG AND RBF NETWORKS ON FIVE DATASETS

TABLE IV AVERAGE
ANDSTANDARD ERRORS OF RMS AND NLL FOR ENNREG, PBP, MONTE CARLO DROPOUT, DEEP ENSEMBLES, AND DEEP EVIDENTIAL REGRESSION (DER) FOR SIX DATASETS Fig.7.Calibration plots for ENNreg and four datasets: probabilistic predictions (blue circles), raw evidential predictions (red squares), and adjusted evidential predictions (green triangles).