Continuous and Distribution-free Probabilistic Wind Power Forecasting: A Conditional Normalizing Flow Approach

We present a data-driven approach for probabilistic wind power forecasting based on conditional normalizing flow (CNF). In contrast with the existing, this approach is distribution-free (as for non-parametric and quantile-based approaches) and can directly yield continuous probability densities, hence avoiding quantile crossing. It relies on a base distribution and a set of bijective mappings. Both the shape parameters of the base distribution and the bijective mappings are approximated with neural networks. Spline-based conditional normalizing flow is considered owing to its non-affine characteristics. Over the training phase, the model sequentially maps input examples onto samples of base distribution, given the conditional contexts, where parameters are estimated through maximum likelihood. To issue probabilistic forecasts, one eventually maps samples of the base distribution into samples of a desired distribution. Case studies based on open datasets validate the effectiveness of the proposed model, and allows us to discuss its advantages and caveats with respect to the state of the art.


A. Motivation
As an essential tool to assess and accommodate wind power generation uncertainty, short-term probabilistic wind power forecasting (PWPF) has gained increasing interest in recent decades. It generally takes numerical weather prediction and historical values as input features, in order to model and communicate the probability density of wind power generation at some time in the future. Such densities may be for a unique lead time and location (hence, univariate), or jointly for several lead times and/or locations (referred to as multivariate) [1]. It has become common now to decouple the estimation of the marginal probability density function of each variable and of the interdependence structure in the multivariate PWPF [2]. In other words, univariate PWPF is usually recognized as the cornerstone of PWPF problems.
A classical approach for univariate PWPF relies on assumptions (often referred to as parametric approach) for the distribution of future wind power generation, the parameters of which are estimated via statistical and machine learning methods. For instance, the Gaussian, Beta, Generalized Logit-Normal, etc could be used [3]. Although it is convenient to develop models based on such assumptions, the distribution of wind power at hand may not match the assumptions. This is primarily due to the wind power generation process, in other words, the nonlinear power curve that converts energy from the wind into electric power [4]. Concretely, the characteristics of wind power generation distributions differ a lot depending on predicted weather conditions, as illustrated by [5] for instance. This has motivated many to look for distributionfree approaches, i.e., that do not rely on a specific assumption for the densities to model and communicate as forecasts. Certainly the most popular distribution-free approach, also referred to as non-parametric, is quantile regression (QR) [6], which allows to relax the use of distributional assumptions for the case of univariate probabilistic forecasting. It has achieved great success in the Global Energy Forecasting Competition 2014 (GEFCom 2014) for instance, and has become a mainstream solution owing to its state-of-the-art performance and simplicity of use. However, it requires parallel models to be fitted for each quantile, which raises the cost of computation when the whole distribution is needed. In addition, it only provides discrete quantiles, which may lead to quantile crossing -quantiles of the whole distribution are inconsistent.
Till now, parametric models with distributional assumption and QR are still the most effective methods [1] with prominent characteristics. That is, parametric models characterize the whole distribution efficiently, whereas QR models are free of distributional assumptions. For multivariate PWPF, the complex interdependence structures of multivariate distribution can be modeled using copula models [7]. By estimating the marginal probability density function (PDF) via nonparametric methods and modeling the complex interdependence structure, the copula method is allowed to model complicated multivariate distribution. However, the copulabased approach relies on strong assumptions regarding the probabilistic calibration of predicted marginals, while it often underestimate the strength of the dependence structure among the various variables. Eventually, it remains an open issue to develop an efficient, continuous and distribution-free probabilistic forecasting model that obtains whole distribution at once.

B. Related Works
Univariate probabilistic forecasting usually translates to communicating quantile forecasts, prediction intervals (PIs), and predictive densities. Quantile forecasts and PIs are specific characteristics of predictive densities, which are most often obtained by QR. Based on this approach, several machine learning models such as neural network (NN) [8] and gradient boost machine [9] have been adopted to estimate conditional arXiv:2206.02433v1 [eess.SY] 6 Jun 2022 quantile functions. It is then simple and effective to construct a PI with two corresponding quantile functions. A (1−β)×100% PI can be constructed by the pair of quantiles (α, 1 − β + α) where α ∈ (0, β). For instance, the pair (β/2, 1 − β/2) is typically selected in the literature [10], [11]. However, both quantiles and PIs only provide partial information of probability densities, the applications of which can hardly cover power systems operation based on stochastic programming where the whole distribution of future wind power generation is often required.
As a result of this, it has been an active research topic to communicate densities in the PWPF community. Besides the aforementioned parametric approach, resampling and advanced density estimation techniques have been adopted, as reviewed in [1], [12]. The idea of resampling method lies in estimating the PDF of empirical errors of point forecasts, which therefore makes the method distribution-free. In order to issue conditional densities for the PWPF, fuzzy inference has been applied to classify the forecast conditions into several modes [5]. But such finite classifications cannot continuously adapt to all forecast conditions. Furthermore, the quality of estimated densities is strongly related to the performance of utilized point-forecast models. The non-parametric density estimation method, namely kernel density estimation (KDE) has been popular among the PWPF community due to its universal approximation capability. In particular, models based KDE usually deduce the density of a finite population selected by k-nearest neighbors [13]. As with the resampling method, this method is still limited in modeling conditional densities, since the employed k-nearest neighbors operation is restricted in dealing with heterogeneous distributions. That said, once k is fixed, the KDE-based model cannot adaptively select the finite population. In addition, the k-nearest neighbor operation suffers from the curse of dimension. Recently, mixture density network (MDN) has been applied in PWPF, as it can model more complex distribution (compared to a Gaussian) through the comic combination of Gaussian distribution [14]. But it would get stuck in mode collapse, i.e., the ultimate estimated distribution would collapse into a Gaussian distribution, and training instability [15].
Multivariate probabilistic forecasting often communicates scenarios as forecasts, which are drawn from predictive densities. The scenario generation procedure is based on probability integral transform (PIT) and the interdependence structure [2]. Concretely, one draws realizations from the estimated multivariate standard Gaussian distribution, and converts the realizations into scenarios of wind power generation via inverse PIT. Besides, an emerging approach is to directly learn multivariate densities based on advanced generative models such as the generative adversarial network (GAN) adopted in [16]. The GAN is composed of a generator and a discriminator, where the generator is responsible for generating scenarios at the operation stage. Although it is computationally more efficient than the copula method, it suffers from notorious training instability caused by the game between the generator and discriminator at the training phase [17]. Moreover, it only presents the applicability of GAN in generating scenarios, and as such is not focused on producing various forms of probabilistic forecasts e.g. predictive densities in univariate and multivariate setups. Indeed, it has not even been assessed by proper statistical scores. The most related work is [18], which compares the performance of several generative models, i.e., GAN, variational auto-encoder, and an integration-based normalizing flow (NF). But their primary focus is to compare the performance of deep-learning based generative models. It is reported in [18] that the performance of the integrationbased NF is limited, let alone compared to state-of-the-art QR models. Besides, they are unaware of the differences between affine NF (which is indeed is equivalent to parametric models with Gaussian distributional assumption) and integration-based NF models. Therefore it leaves issues such as applicability of NF and the relationship between NF and existing models uncovered.

C. Proposed Method and Main Contributions
As a basis for this work, we get inspiration from [5] and [19], which relied on the idea of transforming samples of bounded stochastic process at hand to make them more suitable to be modeled by a Gaussian (or multivariate Gaussian) variable. Besides, parametric models always serve as good candidates for estimating the underlying distributions of wind power generation [20]. Thus, it is appealing to set a parametric model to learn a base distribution, and transform the base distribution to the desired distribution (in the view of the underlying distribution of wind power generation) with an affordable cost. Indeed, it is allowed by the conservation of probability measure [21], which translates into saying that one can transform a variable that follows an arbitrary distribution into a variable that follows a desired distribution with the assistance of bijective mapping (transform). Here, instead of using a manually designed transform, we implement such transforms via the NF [22], [23]. An NF framework is composed of a base distribution and a sequence of trainable bijective mappings. Both the shape parameters of base distribution and bijective mappings are modeled by neural networks (NNs). Besides, such transforms ought to be non-affine so that the model can flexibly characterize the wind power distribution under different conditions. Concretely, we establish a distribution-free PWPF model based on a combination of a parametric model with Gaussian distributional assumption and a conditional auto-regressive NF [24], which is applicable to both univariate and multivariate PWPF applications. Unlike copula models where the marginal PDF and interdependence structure are modeled separately, here the joint probability density is derived through the chain rule of probability, i.e., the product of conditional probability densities. In particular, such conditional probability densities are also dependent on input features. The base Gaussian distribution are estimated by the parametric model, whose realizations are then mapped into those of the desired distribution via a spline-based NF [25]. By using the non-affine characteristics of spline-based NF, the model is allowed to characterize the predicted distribution of wind power generation more flexibly. The spline operates in an elementwise manner, i.e., the mapping for each dimension is specified by the outputs of an NN that takes contextual features and the values of previous dimension as inputs. All the parameters are estimated simultaneously based on the maximum likelihood. Case studies validate the effectiveness of the proposed model, which achieves state-of-the-art.
The main contributions of the paper are: (i) The proposal of a distribution-free PWPF model, which suffices to handle the bounded characteristics of wind power by using the power of a parametric model with Gaussian distributional assumption and non-affine transforms. (ii) The demonstration of its applicability to model the whole predictive distribution, which avoids the quantile crossing issue in the univariate PWPF and still presents competitive performance that is comparable to stateof-the-art QR models. (iii) A new perspective for conditional PDF estimation for PWPF based on the function theory, which offers complimentary understanding to merits and caveats of distribution-free approaches versus parametric approaches.
The remainder of this paper is organized as follows. In section II, the problem formulation and methodological components of normalizing flows are introduced. Our approach to their application to univariate and multivariate wind power probabilistic forecasting is described in section III. Section IV summarizes data sources and experiment implementation. The results obtained are presented in Section V, where the performance comparison with existing models is discussed. Section VI concludes this paper.

A. Preliminaries
The most important base property to consider for normalizing flows is the concept of conservation of probability measure.
Definition 1 (Conservation of Probability Measure) : +∞), and an invertible transform as T : Z → Y. For any subset ω ⊆ Z, we have where γ = {T (z)|z ∈ ω}, as illustrated in Fig. 1. By utilizing the change of variable, z = T −1 (y), we convert the formula into where J T −1 (y) denotes the Jacobian matrix s.t.
As it holds for any subset γ ⊆ Y, we have B. Problem Formulation Consider we have p wind farms whose generation is driven by a multivariate stochastic process. For wind farm i, let y i,t denote the generation value at time t, which is a realization of the corresponding random variable Y i,t . Then, let f Yi,t (y) and F Yi,t (y) respectively denote the PDF and cumulative distribution function (CDF) of Y i,t . The univariate PWPF boils down to estimating the PDF of Y i,t+H , i.e.,f Y i,t+H |t , given information Ω i,t up to t via a model M, i.e., where H is the forecasting horizon, andΘ represents the estimation of real parameters Θ. Certainly, information from nearby wind farms could be used to improve the forecasts, if available [26]. The information may contain previous wind power generation values, i.e., {y i,t−l , · · · , y i,t−1 , y i,t }, and some exogenous features such as numerical weather predictions (NWPs). Accordingly, one can also obtain the CDF of A PI with nominal level (1 − β) × 100% can be formed by two quantiles,q Indeed, multivariate PWPF aims at communicating the joint probability distribution of a collection of future random variables. For instance, the multivariate PWPF may communicate the joint probability distribution of random variables at several future time, i.e., Y i,t+1 , · · · , Y i,t+H , which is expressed aŝ (6) and that at several sites, i.e., In multivariate PWPF, one often draws several realizations as scenarios from the estimated distribution. For instance, one can draw realizations fromf Y 1,t+H ,··· ,Y p,t+H |t , which are denoted asỹ Without loss of generality, we write the future random variable as Y t (which may be univariate or multivariate), and its realization as y t . The information is denoted as Ω t , whose realization is x t . In this paper, we refer to x t as contextual features, to make them distinguished from the inputs of NF. Hence, the cornerstone of PWPF can be written in a compact form, i.e.,f In this paper, we assume that Θ does not change with time, which therefore can be estimated from training datasets X train and Y train . It can be also considered in an online setting, where parameters vary with time. With the estimated model at hand, to issue a forecast at time t, it is only required to feed x t into the model and yield results as described in (9). The classic parametric approach usually sets M as a model with distributional assumption, such as Gaussian and Logitnormal, whereasΘ denotes the parameters of a function that maps contextual features to the shape parameters of distribution. With the conservation of probability measure, we consider an intermediate random variable Z t that follows a specific distribution f Zt (z), whose realization is denoted as z t . Let T map z t into y t , i.e., whereΘ T denotes the estimation of parameters of transform T (whose real parameters are denoted as Θ T ). Now we consider to model the distribution of Z t via a parametric model G, whose parameters are denoted as Θ G . The estimation of Θ G is denoted asΘ G . Then, the model M consists of G and T , i.e., M = {G, T }, whose parameters are Θ = {Θ G , Θ T }. The conceptual framework of training stage is shown in Fig. 2. In other words, by learning G and T , we can estimate the model M.

C. Flow Model for Forecasting
Here we implement the conceptual model T via normalizing flows. Generally, the transform T in an NF consists of a series of invertible functions T 1 , T 2 , . . . , T K [23], i.e., where • denotes the symbol of composition. For each T k , we denote its input as z which is the realization of the random variable Z With the sequential transforms, we have The Jacobian determinant is computed by Ultimately, we build the connection between the PDF of z (0) t and that of y t , i.e., Such T k in the NF model is implemented via NNs, and is required to be invertible and have a tractable Jacobian determinant. The introduced CNF model is trained based on maximum likelihood. As we assume that parameters Θ will not change with time, we can estimate them from training dataset Y train = [y 1 , y 2 , · · · , y N ] and X train = [x 1 , x 2 , · · · , x N ] . The loss function is defined as At the training stage, we estimateΘ by minimizing the loss function L. To issue a forecast at time t, we feed x t into the base model and all transforms, which is illustrated in Fig. 3. Then, we derive the density of z . Based on it, we could draw L realizations: By transforming each realizationz , we can obtain L realizations off (y t |x t ; M,Θ), namelỹ y 1 t , · · · ,ỹ L t . In particular, we can obtain the α-th quantile off (z G , and then transform it via T to obtain the quantile off (y t |x t ; M,Θ), i.e.,q

D. Relationship with Classic Methods
Here we discuss the relationship between this method and classic methods. In what follows, we assume the base distribution as a standard normal distribution, i.e., z (0) t ∼ N (0, I). 1) Gaussian Distribution: Models with Gaussian distributional assumption [28], [29] are described as where µ t (x t ) and Σ t (x t ) are the corresponding shape parameters, which are specified by x t and estimated via statistical learning. They can be translated into setting the transform T as the composition of affine transforms. That is, where A t (x t ) and b t (x t ) are the corresponding matrix and vector specified by x t . Then the problem boils down to estimating A t (x t ) and b t (x t ) from data. As affine transforms cannot change the family of distributions, y t still obeys Gaussian distribution, 2) Logit-Normal Distribution: The logit-normal distribution [19] can be derived by applying a logit-normal transform to a Gaussian distribution, i.e., It can be interpreted as setting the transform T in a normalizing flow as a combination of affine transforms and a sigmoid transform. Using the affine transforms, we derive where µ t (x t ) and Σ t (x t ) are specified by x t . Then the logitnormal transform operates element-wise on z where y t,i and z respectively represent the i-th element of y t and z 3) Mixture Density Network: Mixture density network is a popular model that outputs the parameters of Gaussian mixture models. It is described as

Inverse path
where π i (x) = 1. Models based on mixture density networks can be regarded as setting T as a conic combination of affine transforms. That is, where T i operates as where A i t and b i t are parameters, specified by x t . 4) Gaussian Copula: The model based on Gaussian Copula [2] is an instance of NF, which is specified by an element-wise monotone function g and a correlation matrix Σ t specified by x t . That is, Indeed, any desired distribution can be obtained by transforming a Gaussian distribution through a specific mapping. Such mapping proceeds each value in the domain in the same manner, such as the aforementioned Logit-Normal transform. Therefore, the mapping is required to be specified by conditional information, so that the derived distribution is allowed to adapt to different wind conditions. Although the conic combination enables deriving more complex distributions compared to Gaussian distributions, it is restricted by the number of mixing components. With regard to the Gaussian copula model, it is developed for multivariate modeling. By modeling the well-calibrated marginal PDF and correlation structure, one can yield the the ultimate joint probability density in a distribution-free way. However, as mentioned above, it highly relies on the estimation of marginals and tends to underestimate the covariance structure, which often impedes its performance.

III. FORECASTING APPLICATIONS
The basic approach for conditional normalizing flows described in the above can readily be used for forecasting applications, in both univariate and multivariate settings. We choose a Gaussian distribution as the base distribution whose shape parameters are learned by an NN and adopt a nonaffine flow to obtain a piece-wise non-Gaussian distribution. Letμ t ,Σ t denote the estimated shape parameters of base distribution, which are determined by a function of x t , namely φ(x t ;Θ G ). In other words, with the Gaussian distributional assumption, the model G described in Section II-B reduces to the function φ(x t ;Θ G ). It is described aŝ A. Probabilistic Forecasting Applications 1) Univariate Probabilistic Forecasting: In the univariate case, each intermediary variable (for instance the k-th intermediary variable) and shape parameters of the Gaussian distribution are scalars, which are rewritten as z (k) t , µ t , and σ t . The estimated shape parameters of base distributionμ t ,σ t are derived viaμ T k is a univariate function that operates as 2) Multivariate Probabilistic Forecasting: The most relevant computation to consider for multivariate forecasting is the transform described in (13). Here, let us consider a function τ k in the transform T k that operates elementwise and relies on previous dimensions and contextual information. Take the computation of i-th dimension as an example, i.e., the forward path and inverse path between z where z t,i−1 ] ,θ τ k represents the parameters of τ k , and c(z (k) t,1:i−1 , x t ;θ c k ) is a function that outputs conditionals. In other words,Θ T k containsθ τ k and θ c k . The forward path is described as Using the terminology of [30], c k (·) and τ k (·) are respectively referred to as the conditioner and transformer. Illustration of such calculation procedure is shown in Fig. 4 and Fig. 5.

Remark 1:
With the chain rule of probability, we decompose the joint probability density f (z (k) t |x t ) into a product of conditional probability densities, i.e., As shown in Section II-C, the training stage is relied on the inverse path and the computation of likelihood. Indeed, the inverse path described in (21) is associated with , which translates into saying that the computation of likelihood will preserve the conditional structure of multivariate distribution. The forward path can be translated into sampling z |x t ) and computing via (22), which can be also regarded as sampling z Remark 2: The univariate probabilistic forecasting can be interpreted as a special case of multivariate probabilistic forecasting. As with (22), we rewrite (20) as

B. Base Distribution
The function φ(·) described in (18) and (19) is implemented by an NN of N φ layers. Denote the outputs, weights, and bias of the l-th layer respectively as h φ,l t , W φ,l , and b φ,l . The l-th layer operates as Specially, h φ,0 t = x t . After each layer, a non-linear elementwise operator ReLu(·) is followed, i.e.
The output layer will yieldμ t andΣ t .

C. Non-affine Transform
In this section, we describe the conditioner and transformer of the adopted transform.

1) Conditioner:
The function c k (·) is set as an additive model and implemented by an NN. Concretely, it contains two parts: the function of z (k) t,1:i−1 and the function of x t . Then, c k (·) is described as where c k,1 (·) and c k,2 (·) are the two component functions. c k,2 (·) is implemented by an NN, similar to that of φ(·). Specially, as the length of z  2) Transformer: The main idea of a spline-based NF is to implement the transform as a monotonic spline [25]. Each τ k is represented as a piece-wise function which contains M segments specified by M + 1 coordinates (knots). The knots are obtained from the conditioner c k (·) and denoted as {(α k,m , β k,m )|m = 0, · · · , M }. Accordingly, the transformer τ k (·) is split into M segments, each of which is a simple monotonic function. Every two nearby segments will meet at internal knots {(α k,m , β k,m )|m = 1, · · · , M − 1}. Specifically, we use monotonic rational-quadratic splines, which are defined by derivatives at internal knots besides the knots. They are also derived from the conditioner c k (·) and denoted as {δ k,m |m = 1, · · · , M − 1}. We define The rational-quadratic function in the m-th bin is expressed as ). That is, Specifically, when z (k−1) t,i < α k,0 or z (k−1) t,i > α k,M , we set τ k (·) as equivalent transform, i.e., As τ k (·) is monotonic, the inverse path can be computed analytically by solving a quadratic equation, i.e., where t,i − β k,m−1 ). It implicitly defines the inverse function τ −1 k (·).

IV. CASE STUDY
In this paper, we validate the proposed approach in both univariate cases (Case 1, Case 2) and multivariate cases (Case 3, Case 4), , which cover typical applications in probabilistic wind power forecasting. Case 1 and case 2 differ in forecast horizons. Specifically, Case 1 aims at day-ahead forecasting, whereas Case 2 focuses on forecasting within minutes to hours. Case 3 aims at characterizing the joint distribution of wind power values for various lead times, jointly. Case 4 deals with the joint distribution of wind power values at several geographical locations. Their settings are described as follows and summarized in Table I. 1) Case 1: It is a day-ahead PWPF case based on GEF-Com 2014 data 1 , where numerical weather predictions (NWPs) are taken as inputs and the predictive PDF of wind power at each time step is issued as forecast. 2) Case 2: It is a very-short-term PWPF case where previous values of wind power generation are taken as inputs, and the predictive PDF of wind power at future time is issued as forecast. The horizon is set as 1 here for validation based on NREL 2 and France wind farm data 3 . 3) Case 3: It is a scenario generation case based on France wind farm data, which considers temporal interdependence. Specifically, we generate scenarios of future 6 time steps, which can be used in electricity market. 4) Case 4: It is a scenario generation case based on NREL data, which considers spatial interdependence of multiple sites. The horizon is set as 1. Specifically, we choose data from 5 nearby wind farms for validation. As feature selection is not the focus of this paper, in Case 2, Case 3, and Case 4, the length of input features is determined by a preliminary test, which is varied from 4 to 24 and empirically set as 6. Certainly, models may be further improved by finely selecting the features. But it is fair for all models as they use the same input features.

A. Dataset Description
Three open datasets are used for validation, i.e., data from GEFCom 2014, NREL, and France wind farm. The GEFCom 2014 dataset provides NWPs that contain wind speeds and directions at 10-m and 100-m, and corresponding normalized wind power generation values. It is an hourly data set collected in 2012 and 2013, and contains a total of 16,800 samples. We randomly select data from 5 wind farms for experiments. The France wind farm data and NREL data are 1 [32].

B. Assessment Metrics
In this paper, reliability diagrams and PI width are used to assess the reliability and sharpness of univariate predictive densities. The comprehensive quality of predictive probability density in univariate cases is assessed by continuous ranked probability score (CRPS) as suggested by [33]. And, the quality of predictive probability density in multivariate cases is assessed by scenarios in terms of energy score (ES) and variogram score (VS) as suggested by [34], [35], which are allowed to measure the dependence within scenarios. All of them are averaged over the whole test data.
1) CRPS: Let F t (y) denote the CDF of Y t and y t denote the observation at time t. The CRPS is defined as: where 1(·) is unit step function, which represents the empirical CDF of observation.

2) ES: Given a set of scenarios {ỹ
(i) t |i = 1, · · · , S} and observations y t , the ES is defined as where · 2 is the d-dimensional Euclidean norm.
3) VS: Let y t,i andỹ t,i respectively denote the i-th dimension of the observation y t and a scenarioỹ t . The VS is defined as where Here we set p as 0.5 as suggested by [35].
C. Benchmarks 1) Univariate Cases: We set both parametric and nonparametric models as benchmarks. For the parametric approach, we choose NN models that rely on Gaussian and logit-normal distributional assumptions, and refer to them as NN-G and NN-L respectively. They share the same basic NN structure with the proposed model. An MDN is established, as it is allowed to model more complex distributions compared to Gaussian distributions. As for the non-parametric approach, we include two popular distribution-free models, namely KDE [13] and quantile regression gradient boosting machine (QRGBM) [9] as benchmarks, since they are proved effective in the GEFCom 2014. The QRGBM is an ensemble model that iteratively fits new tree model to minimize the quantile loss. Concretely, in the KDE, we determine the nearest 100 neighbors of each test sample and use their corresponding wind power values to estimate the predictive PDF. In addition, the climatology model is adopted as a naive benchmark model, which estimates the predictive probability density using all training data.
2) Multivariate Cases: For multivariate cases, we mainly use NN-G, and NN-L as benchmark models, since they are the most popular ones. Besides, the multivariate probabilistic ensemble (MuPEn) [36] is adopted as a naive benchmark. It is a generalized model of the complete-history persistence, which conducts random sampling without replacement from historical scenarios for each test sample. 2) Multivariate Cases: For multivariate cases, NN-G, NN-L, and the proposed model use the same NN architecture in univariate cases. The only difference is that we adopt the autoregressive structure here to model the joint probability density. It is implemented based on a masked auto-encoder [31] that forces each variable to only rely on the previous variables in a given order via masks. Besides, we permute variable orders after each transform, as PDF is permutation-invariant.
NN-G, NN-L, and the proposed model are established via Pytorch and trained by the Adam optimizer [37]. The learning rate is determined through a grid search and ultimately set as 1e-4. It decays 1/3 per 300 iterations. The QRGBM is implemented based on lightGBM 5 , the hyperparameters of which are set according to the winner of GEFCom 2014 [9]. KDE is implemented by using scikit-learn 6 .   Table II. It is seen that all the benchmark models and the proposed model outperform climatology model. Amongst the benchmark models, KDE has slightly worse performance than others, which suggests that it is overly simplified to approximate the conditional PDF by the density of neighborhood population. Concretely, the distribution of samples is not homogeneous, which means that more samples could be taken to better estimate the conditional PDF if the neighborhood distribution is dense. However, once the criterion to select neighborhood samples is fixed, e.g. value of k in k-nearest neighbors here, it cannot adaptively adjust the population, on which the conditional PDF estimation is based. On the contrary, NN-G, NN-L, QRGBM, MDN, and the proposed model can adaptively estimate the conditional PDF/quantile by excavating the similarity of input features via parameterization or entropy measure. It also suggests that the performance of KDE can be further improved by carefully designing such population selection criterion and making use of the similarity of neighborhood samples. The QRGBM outperforms the NN-G and NN-L in 3/5 of cases as it is distribution-free. However, the other two cases suggest that the independent fitting in QRGBM may accumulate errors. MDN is comparable with NN-G and NN-L, which may be explained as that it is harder to estimate weights and component distributions jointly in the MDN compared to NN-G and NN-L (where only shape parameters are required to be estimated). The weight of MDN, namely π i (x t ) can be interpreted as the possibility that samples will fall in the i-th mixing distribution. Then, by increasing the number of distributions to infinite, the approximation capability will accordingly increase, i.e., However, MDNs often occur mode collapse and training instability when the number of mixing components is large or the dimension of variables is high. In this case, we investigate the number of mixing components via a grid search, concretely, 3, 10, 20, 50, 100. It turns out that the training of MDN is unstable even for 20 mixing components. Obviously, the proposed model exceeds benchmarks in all cases.
The comparison between NN-L and NN-G shows that the logit-normal transform may deteriorate performance at times. It reveals that the logit-normal distributional assumption may not hold sometimes, although the realizations of random variable are forced to fall into the physically defined interval. We present the predictive probability density of the proposed model at a selected time in Fig. 6. As illustrated, the PDF derived by the proposed model are more flexible than specific families of distributions because the proposed model is free of any distributional assumptions. In addition, the proposed model has 1.7 million trainable parameters, which are comparable to those of NN-G and NN-L, i.e., 1.6 million trainable parameters. This means that the proposed model can flexibly model different wind power distribution characteristics on condition of predicted wind speeds, with increased but affordable complexity.
2) Reliability and Sharpness: The reliability diagram and PI width for wind farm 1 are illustrated in Fig. 7 and Fig. 8. It turns out that QRGBM and the proposed model achieve the best performance in reliability, which are close to the ideal case. Strictly speaking, it is unfair to compare a bunch of    independently trained QR models with a single model that derives the whole distribution, as the computational cost of QR for a single quantile is much larger. Nevertheless, the proposed model still achieves comparable reliability, which confirms its performance. By contrast, the reliability diagrams of NN-G, NN-L, MDN, and KDE deviate from the ideal to a certain degree. The deviation of NN-G and NN-L cannot be totally mitigated, since the families of distribution they define mismatch the real underlying distribution. Results suggest that the superiority of the proposed model goes beyond the distribution-free property compared to the QR and KDE-based methods, by offering an efficient and continuous conditional modeling approach. Fig. 8 demonstrates that the proposed model provides the shortest PI at all nominal levels. However, the performance of NN-G in width of PI is comparable to that of the proposed model, whereas the PI width of NN-L is much wider. For illustration, we present 90% PI of the NN-G, proposed model, and NN-L of 10 days in the top, middle, and bottom subplots of Fig. 9. As shown, the PIs of NN-G violate the bounds of wind power to a large extent, revealing probability leakage issue, while PIs of the proposed model and NN-L are more realistic. Besides, it is demonstrated that PIs of NN-L are sometimes unnecessarily wide. For example, between 200h and 250-h, the upper bound of NN-L is larger than that of NN-G and the proposed model. Indeed, both the NN-L and the proposed model can be considered as models derived from the NN-G by applying transforms. Indeed, the logitnormal transform in the NN-L applies to all NWP conditions indifferently, whereas the spline transform of the proposed model is specified by NWPs. This explains the sacrifice of NN-L in PI width, which is a side-effect when forcing the realizations within the boundaries.

B. Case 2
We present the CRPS values of Case 2 in Table III. As with Case 1, all models are superior to the climatology model. The performance of NN-G, NN-L, QRGBM, and the proposed model are demonstrated to be comparable. The gap of performance between the KDE and others is enlarged compared to Case 1, because of higher dimension of input features which raises issues for k-nearest neighbors. Comparing the results of KDE, QRGBM, MDN, and the proposed model with the results of NN-G and NN-L, we can infer that the Gaussian and logit-normal distributional assumptions are fairly adequate in very-short term PWPF. This may be due to the fact that the structure of temporal interdependence over a short period of time is simpler than the interdependence spanning several hours.

C. Case 3 and Case 4
The ES and VS values are presented in Table IV and  Table V. All of NN-G, NN-L, and the proposed model outperform MuPEn, since the MuPEn draws samples from the empirical unconditional distribution whereas other models draws samples from the estimated conditional distributions. Except for the MuPEn, the ES and VS values in Case 3 are larger than those in Case 4, which indicates larger uncertainty in Case 3. This is is caused by the increase in generation uncertainty as forecasting horizon increases. In both cases, NN-L and the proposed model exceed NN-G, which suggests the limited capability of the Gaussian distributional assumption in complex and high dimensional cases. Besides, the performance of NN-L and the proposed model is comparable in Case 3, but differs in Case 4, which suggests that spatial interdependence is more complex.  D. Discussion on the Base Distribution Theoretically, the base distribution modeled by G can be set as any distribution. By learning the transforms, one can still obtain the estimation of desired distribution. But it means that one needs to estimate the transforms in a relatively large space, if the base distribution is considerably different from the underlying distribution. For illustration, we consider the wind farm 1 in Case 1 and set the base distribution as a standard Normal distribution N (0, 1). In theory, this will not make a lot differences in estimated distributions, since the standard Normal distribution can be transformed to any Gaussian distribution by using an affine transform. But compared to the proposed model, this setting implies a more complex task, i.e., the flow model requires to estimate such affine transform besides the non-affine transform. In the experiments, the CRPS value under the condition of standard Normal base distribution turns out as 14.9, which is much larger than that of the proposed model, i.e., 9.22. In other words, the estimation of base distribution if of significance in the view of practice.

E. Discussion on Transforms
In this paper, we set the transformers as rational-quadratic splines. In fact, other splines could also be considered as long as the transforms are invertible, for instance linear and cubic splines [38], [39]. However, it is hard to compute the inverse path of high-degree spline based transforms. As suggested by [25], calculating the inverse path of a cubic spline is prone to numerical instability. On the other hand, it is required that transforms are flexible enough, which translates into saying that there is a trade-off between complexity and flexibility. The adopted rational-quadratic spline based flows are more flexible than linear and quadratic spline based flows. We use linear and cubic spline based transforms for a comparative study in Case 1 and Case 3. The CRPS values for linear and cubic spline based CNF models on wind farm 1 in Case 1 are respectively 9.34 and 9.11. The ES values for linear and cubic spline based CNF models in Case 3 are respectively 9.28 and 9.22. That is, the performance of the rational-quadratic based flow is superior to that of the linear spline based flow and comparable to the cubic spline based flow. Certainly, more advanced normalizing flow models could be used.

F. Distribution-free vs Distributional Assumption
In the case study, QRGBM, KDE, and the proposed model are distribution-free, whereas NN-G and NN-L rely on distributional assumptions. Compared to NN-G and NN-L, the proposed model has increased but affordable complexity due to its spline operation. Meanwhile, the increased complexity enables the proposed model to obtain different wind power distributions on condition of different predicted wind speeds. Compared to QRGBM and KDE, the proposed model is superior in efficiently modeling whole conditional PDFs. In addition, case studies show that distribution-free methods are not overwhelmingly superior to models with distributional assumptions. Concretely, NN-G and NN-L rival QRGBM and KDE in several cases. And in Case 2, the performance of NN-L is comparable to that of the proposed model, which means  these distributional assumptions are adequate in very-shortterm PWPF. But when it comes to applications with more uncertainty and more complex interdependence, the proposed approach always achieves a satisfactory performance with an acceptable computational cost. Indeed, it has been reported in [18] that the distribution-free integration-based NF model is comparable to an affine NF model that is equivalent to NN-G. We infer that it is resulted from the difficulty in training the integration-based NF. Intuitively, it will take more effort to find the desired transform in a larger function space. This also reveals the trade-off between complexity and flexibility in modeling distributions.

G. Training Time
The training time of all models in Case 1 is presented in Table VI, we report the training time of NN-based models in 1000 iterations and 199 independent quantiles of QRGBM. It shows that the training time of the proposed model is comparable to that of commonly used NN-G, which is affordable. In general, the training time of the proposed model is governed by the number of transforms and the number of hidden units. With more transforms and hidden units, the training time will increase. However, it still costs time to generate scenarios for high-dimensional multivariate forecasting.

VI. CONCLUSIONS
The approach for probabilistic wind power forecasting described in this paper, based on conditional spline normalizing flow, offers a number of advantages with respect to the existing. It directly estimates the conditional probability density and does not require any assumption on the distributions involved. In addition, it is applicable to both univariate and multivariate PWPF, with high efficiency in terms of both modeling and computing. Our case-study applications based on open datasets confirmed the interest of the approach and its wide applicability for wind power applications.
Parameters are assumed fixed in this paper; therefore it is still required to explore how to estimate the parameters in an online learning fashion. Besides, the time for scenario generation is costly when dimension increases, so we will focus on finding more efficient methods in the future.

A. Selection on Hyperparameters
To empirically determine the hyperparameters, we conduct a preliminary test to validate the influence of number of transforms, number of units, and number of knots by studying variants of Case 1. Specifically, we take wind farm 1 as an example, and present results of several case settings.

1) Number of Transforms:
In this case, we set the number of hidden units in transform as 64, the number of knots as 10, and vary the number of transforms from 1 to 5. The corresponding results are shown in Table VII. It can be seen that the CRPS is relatively larger when we use only few transforms. Consequently, the model is small, which results in limited capability of fitting ultimate transform and shape parameter function of base distribution. After reaching at 3 transforms, the gain of increasing transforms is relatively low, which suggest the capability is enough. Besides, increasing transforms means increasing layers of deep neural network, whose training procedure might become difficult when the model is considerably deep.

2) Number of Hidden Units:
Here we fix the number of transforms as 5, the number of knots as 10, and adjust the number of hidden units as 64, 256, and 512. Results are presented in Table VIII. It shows that the fitting capability of NN in each transform is influenced by the number of hidden units. The capability is limited when the number of hidden units is few. But it might overfit the data if the number of hidden units is considerable.

3) Number of Knots:
In this case, we fix the number of layers as 5 and the number of hidden units as 256, and look into the influence of knots by varying the number. We set it as 5, 10, 20, and 50 respectively, whose results are shown in Table IX. As we increase the number of knots, the CRPS first decreases and then increases.

ACKNOWLEDGMENT
This work was performed during a research stay at the Technical University of Denmark. The authors would like to appreciate China Scholarship Council (NO. 202006230261) and Shanghai Sailing Program (19YF1423700). The research leading to this work is being carried out as a part of the Smart4RES project (European Union's Horizon 2020, No. 864337). The sole responsibility of this publication lies with the authors. The European Union is not responsible for any use that may be made of the information contained therein.