Hyperparameter Free MEE-FP based Learning for Next Generation Communication Systems

Information theoretic learning (ITL) criteria have emerged useful for mitigating degradations caused by unknown non-Gaussian noise processes in future wireless communication systems. Speciﬁcally, the reproducing kernel Hilbert space (RKHS) based approaches relying on ITL based learning criteria are envisioned to provide near-optimal mitigation of unknown hardware impairments and non-Gaussian noises. Among several ITL criteria, the recent works ﬁnd the minimum error entropy with ﬁducial points (MEE-FP) promising due to its guarantee of unbiased estimation and generalization over generic noise distributions. However, MEE-FP based learning approaches are known to depend on an accurate kernel-width initialization. Also, the optimal value of this kernel-width is well-known to vary temporally and across deployment scenarios. To remove the dependency on kernel-width, a hyperparameter-free MEE-FP based adaptive algorithm is derived using random-Fourier features with sampled kernel widths (RFF-SKW). In addition, a detailed convergence analysis is presented for the proposed hyperparameter-free MEE-FP, which promises a near-optimal error-ﬂoor independent of step-size and guarantees convergence for a wide range of step sizes. The promised hyperparameter-independence and improved convergence for the proposed hyperparameter-free MEE-FP are validated by computer simulations considering different case studies.


I. INTRODUCTION
The next-generation communication systems must be capable of delivering disruptively high data-rates, offering higher bandwidths, and supporting massive connectivity [1], [2]. However, the degradation caused by hardware impairments and generic non-Gaussian noise processes are found to present performancebottlenecks for these emerging communication systems [3], [4]. Generally, the nature of these hardware impairments and non-Gaussian processes are known to vary from one ecosystem to another and to change temporally. To mitigate these impairments and to support high data-rates, deep-learning (DL) based receivers have emerged as a viable solution [1]. In existing works, several emerging DL paradigms, such as federated-learning [5], meta-learning for low-pilot training [6], reinforcement-learning [7], are derived for several of the relevant sub-tasks of the next-generation communication systems [1].
In parallel, the reproducing kernel Hilbert space (RKHS) based algorithms promise near-optimal mitigation of arbitrary channel-nonlinearities. Specifically, in the context of visible light communications (VLC) [8], [9], initial RKHS based methods focused mostly on dictionary based learning for post-distortion.
However, recent works promote the explicit Monte-Carlo approximations of the RKHS, called random Fourier features (RFF), which significantly improves convergence and reduces implementation complexity compared to the pre-existing dictionary based approaches [10]- [13]. In addition, other important works have extended the scope of RKHS based approaches to other applications, such as, detection for massive MIMO (m-MIMO) [13]- [15], parameter-estimation for radar [16], [17], and detection for ultraviolet communications [18]. Further, other works have proposed the RFF based DL (RFF-DL) [19] and hyperparameter-independent RFFs [13], [20], and have carried out a rigorous analysis of RFF-DL in the low-data regime. Under additive white Gaussian noise (AWGN), several of the works reviewed above conclude that for unknown channel-nonlinearities, the RKHS based methods demonstrate a performance which is equivalent to a hypothetical ideal AWGN channel. Thus, the above discussions clearly motivate the optimality of RKHS based methods for generic nonlinear channels.
Non-Gaussian noise processes, such as impulsive-noise (IN) [4] and multi-modal non-Gaussian noises due to user-mobility [21], [22], were found to significantly degrade the performance of classical receivers. These noise-processes arise due to ambient processes, such as ice-cracking, switching transients, and electrical fluctuations [23]. Moreover, these noise processes are known to impair the performance of the internet of things (IoT), and smart-grid [4]. Among several existing methods for non-Gaussian noise-mitigation, the information theoretic learning (ITL) based methods have emerged as attractive for generalized mitigation of unknown noise-processes [24], [25]. Specifically, ITL is found relevant for applications such as localization [26], [27], detection for sparse code multiple access (SCMA) [28], and for tracking in scenarios with user-mobility [29], [30]. It is further noted that these ITL based approaches have the ability to adapt according to generic non-Gaussian noises, without requiring explicit knowledge of their noise statistics. However, these ITL based methods depend on appropriate choice of hyperparametervalues, whose optimal values are known to vary from one dataset to another. Also, certain ITL based approaches are known for their convergence to biased estimators. However, a recent ITL based approach, namely, the minimum error entropy with fiducial points (MEE-FP), was found to mitigate this drawback and ensure unbiasedness [31]. Besides, most cost-functions for ITL (including the MEE-FP) involve a Gaussian function with a spread parameter, and several works analytically interpret these cost-functions as optimizers of high-dimensional correlation-coefficients [25]. In this context, it is noted that while the recent works aim to approximate these Gaussian functions using RFFs [32], the existing works still depend on the choice of a valid spread parameter for their performance.
To alleviate dependence on the spread parameter and for obtaining an unbiased and accurate MEE-FP based approximation of Gaussian kernels, this work proposes a hyperparameter free MEE-FP using random Fourier features with sampled kernel widths (RFF-SKW) [13]. Additionally, this work derives detailed convergence analysis of the proposed hyperparameter-free MEE-FP and presents relevant case studies. Computer simulations indicate excellent generalization over statistically diverse noise-processes.
It is noted that for all the case studies considered in this paper, no hyperparameter-initialization/estimation is performed for either the kernel-width of the Mercer-kernel or the spread-parameters of the ITL costfunctions. In addition, no scenario-specific information on the channel-nonlinearity or any a-priori insight on the distribution/statistics of non-Gaussian noise-processes is assumed at the receiver.
The remainder of this paper is organized as follows: Section-II briefly reviews MEE-FP based learning. Section-III presents analytical theorems for hyperparameter-free MEE-FP, its convergence analysis.
Further, Section-IV presents relevant case studies to validate the analytical results. Finally, conclusions are drawn in Section-V.
Regularization parameter Ω Weights η Step-size λ Forgetting-factor n G Number of RFFs σ 2 n Variance of the underlying noise-procesŝ Φ α,β RFF with sampled kernel-width [13]  Notations: In this paper, vectors in C n or R n are denoted by bold-lowercase characters, such as x with n denoting the number of dimensions, matrices are denoted by bold-uppercase characters, such as X, and the number of random Fourier features is denoted as n G . In addition, the optimal value of a quantity (·) is denoted by (·) o , and(·) denotes the deviation of (·) from its optimal value. Lastly, throughout the paper, the suffix (·) i denote the value of (·) at the i th sample/instant. Also, the generic notations used in this paper are summarized in Table. I.

II. REVIEW OF MEE-FP BASED LEARNING
In this section, we present a review of MEE-FP based learning for mitigating the non-Gaussian noise processes. In this regard, it is noted that classical minimum mean-square based receivers are well known for their optimality under additive Gaussian noise; however, their performance is found to severely degrade in the presence of non-Gaussian additive noises. It may be noted that while the error-energy may be a sufficient statistic for Gaussian noise processes, the error-energy alone is sub-optimal for the non-Gaussian noise processes that require generalized order statistics for their mitigation [25]. In this regard, the ITL based cost-functions are well-suited for mitigating generalized non-Gaussian processes for the following reasons: • ITL based methods use implicit estimates of the underlying probability density function (PDF) of the underlying noise-process. In this regard, ITL based methods are well-known to "adapt" to the underlying noise-distribution, without requiring any knowledge or insight on the actual noise-PDF.
• ITL based methods aim to minimize the randomness in the error-processes, rather than some specific error-statistics (as with classical squared-error or cumulant based solutions). In details, these criteria are motivated from statistical mechanics, and the optimal-points of ITL based optimization can be interpreted as "thermal equilibrium" points.
• Some ITL based methods are also interpreted as maximizers of mutual-information and as generalized higher dimensional inner-products.
While ITL based approaches are known for their generalization to arbitrary noise processes, such approaches are prone to an unknown bias, and also significantly depend on a spread-parameter, whose optimal value could vary temporally and from one deployment scenario to another. The recent MEE-FP based criterion is known to mitigate the issue of estimator-bias.
We consider a generalized supervised scenario with a dataset, , where | · | denotes cardinality. Without loss of generality, we proceed with the definition of the MEE-FP cost function as: where, the error, e i = d i − Ω T r i , with Ω ∈ R n G denoting the weights, r i indicating the i th observation, ξ ∈ [0, 1] being a regularization parameter, and d i denoting the labels. Additionally, κ(·) denotes the Gaussian function, which is given by: In this regard, the gradient-expression for the optimization of J MEE-FP (e i ) is given by: where, r i denotes the regressors at the i th time-instant, and and the weights, Ω, are updated as: where, η denotes the step-size and := denotes assignment operator. It is noted that the objective in (1) does not rely on any statistical assumption on the underlying distribution of e i . Rather, instead of any specific error-statistic, the MEE-FP objective in (1) optimizes the information content of e i [31], which leads to distribution-invariance. However, we note the dependence of J MEE-FP on the hyperparameter h, whose optimal value is well known to vary across different datasets.
In this work, we aim to specifically use the generalized inner-product interpretation of the MEE-FP costfunction and re-express the gradients of the J MEE-FP cost-function as inner-products of the hyperparameterfree RFFs proposed in [13]. For the hyperparameter-free MEE-FP based learning, we present derivations and analytical results in the next section.

III. HYPERPARAMETER-FREE MEE-FP
In this section, we derive the hyperparameter-free ITL. From (1), we note the instantaneous approximation of J MEE-FP (e i ),Ĵ MEE-FP (e i ), as follows: Using the random Fourier features with sampled kernel-widths derived in [13], we have the following approximations for κ(e i ) and κ(e i − e j ) 1 : where,Φ α,β (·) denotes the hyperparameter-free RFF based approximation for the feature map to RKHS.
In this work, we use the distribution dependent feature map, as used in [13], and ·, · denotes the innerproduct in R n G , with n G denoting the number of RFFs. The hyperparameter-free RFF is expressed as: where, for an input (·) of compatible dimensions, diag[ν] denotes a diagonal matrix with the diagonal entries, ν ∈ R n G , being drawn from an inverse-gamma distribution with shape parameter α and scaleparameter β, W denotes a normal Gaussian random-matrix having n G rows of unit-variance, and b ∈ R n G denotes a uniform random vector.
Using the above approximations, we find that (6) is equivalently expressed as: where, we consider a moving average mean-gradient estimator g i with forgetting-factor λ: Using the above gradient-expression, we adapt the weights as: Next, we present the convergence analysis of this approach, and provide relevant insights on the convergence-dynamics.

A. Convergence Analysis
In alignment with existing analytical methods for adaptive filters [33], we assume that the g i−1 is uncorrelated from r i ,e i , and ∆r i in (10), and constants C 1 , C 2 << n G we have: Below, we proceed to derive simplifications for : Inspecting (12), we aim to first simplify the cross-terms Using the Bussgang theorem [34], these cross-terms are replaced by the Bussgang correlation coefficient, α 1 , for the function, f (x 2 ) = exp(−2x 2 ). Thus, (12) is expressed as: Further, we denote E[ r i 2 ] = σ 2 r , and assume r i as independent and identically distributed (i.i.d) with zero mean, which leads to the following simplification: which is further rearranged as: where, Using the summation rule for geometric progression, (15) is further approximated as: with σ 2 n denoting the variance of the underlying noise process. Considering a large n G , E[ g i 2 ] is approximated as: Noting the dependence onΩ i only on the latest update to the term of g i , we have: with α 2 denoting the corresponding Bussgang coefficient for Since r i is assumed as zero-mean i.i.d, the above expression is further simplified as: From (11), the dynamical evolution of E[ Ω i 2 ] is expressed as: From (18) and (20), and for constants C 4 , C 5 << n G , we arrive at the following dynamical equation Upon re-arranging (22), it can be re-expressed as: .

B. Insights from the Convergence Analysis:
From (23), some insights are highlighted below: • Simplification of α 1 and α 2 : By the Bussgang theorem in [34], we note that the correlation-coefficient, α 1 , is given by: which simplifies to the following value: From the Bussgang theorem, it is similarly noted that: which is simplified as: Step-size range for convergence: To guarantee the convergence of (22), we obtain the following condition on the evolution-parameter υ (defined below the underbrace in (23)): which leads to the following range for the step-size when the ideal kernel-width is known: Upon choosing ξ 2 +4(1−ξ) 2 −(1−λ 2 ), this range becomes infinite, which indicates the robustness of the proposed approach to the step-size in the hypothetical scenario when the kernel-width is exactly known. However, since the kernel-width is sampled in practice, the range is rendered finite, and is given by: This range is noted to be much higher than the typical ranges for the existing gradient-descent based approaches, since the above range scales with n G .
• Mean-squared error: Setting i → ∞ in (22) and rearranging, the expression for the steady-state misadjustment is expressed as follows: and the steady-state mean squared error as: which, under the low step-size regime, is expressed as: the steady-state error in (33) reduces to the following: Therefore, it is concluded that the proposed algorithm approaches the noise variance-floor of the underlying noise-process irrespective of the noise-distribution, and does so with a time-constant of 1 6ηα 2 σ 4 r . Since α 2 is typically very large for the medium/high SNR range, the convergence of this approach is significantly faster than for known first-order methods. Notably, this approach did not require any unknown hyper-parameters to achieve the noise-variance floor.
In the next section, we present case studies to highlight the generalization of the proposed approach to different communication scenarios.

IV. SIMULATION RESULTS AND DISCUSSIONS
Based on the analysis presented above, we present case studies in this section to demonstrate the universal applicability of the above approach. We consider two case studies A. Case study I: Channel Estimation for VLC Systems with User-Mobility and Imperfect CSI In this section, we present a case study on VLC systems impaired by user-mobility and with channel estimation errors at the receiver. For this case study, we consider a random-waypoint (RWP) based mobility model [35]. We detail the system model for this case study next, which is summarized in Fig. 1.

System Model:
We consider an on-off keying (OOK) transmission consisting of symbols, x i at the i th time-instant. These symbols are offset by a DC bias Γ. The corresponding observations at the receiver, y i , are denoted as: where, n i ∼ N (0, σ 2 n ), denotes an additive white Gaussian noise (AWGN) process with zero mean and variance σ 2 n and h denotes the channel-gain. Moreover, the RWP mobility model admits the following probability density function (PDF) for h [35]: where, the coefficients for K i and a i are derived from the VLC system-model parameters in [35,Eq. (4)]. At the receiver, we assume an erroneous estimate of h, denoted by [22], [36], [37,Eq. (21)]: with δ denoting the channel estimation error, which is drawn from a zero mean Gaussian distribution with variance σ 2 δ . At the receiver, upon removal of the DC bias and subsequent maximal ratio combining (MRC) with this erroneous estimate of h, we have the equivalent system model: Notably, this system has a noise floor due to the term δx i and the degradation caused by a saturation in the bit-error rate (BER) performance due to this noise-floor was quantified in [36]. The saturation in the BER performance calls for generalized channel estimation methods to improve the BER performance.
It may be noted that although minimum bit error-rate based channel estimation/combining was proposed in [36], the algorithm requires a detailed insight on the underlying parameters of the VLC system-model, such as exact knowledge of the estimation-error level, σ 2 δ , mobility model parameters h min and h max , etc. The list of notations for this case study are summarized in Table. 2.

Pilot-symbols at the i th time-instant
Hyperparameter-free MEE-FP based channel estimation: In order to perform accurate channel estimation without relying on scenario-specific parameters, the proposed hyperparameter-free MEE-FP based channel estimation aims to gain an accurate channel estimate by considering the following dataset of , with N P denoting the number of pilots. For each iteration, we calculate the error-term as follows: Using this error-term for every iteration, we estimate the gradient according to (10), as follows: Finally, for every iteration, we adapt the channel estimate as: where, η is the step-size. The proposed algorithm is summarized in Algorithm 1, and the architecture of the estimator is depicted in Fig. 2. Estimate gradient as per (40);      channel estimator presents negligible degradations to the "clairvoyant" MBER detector (called as such due to its requirement for scenario-specific details, like the estimation-error levels, half-angle values, and other parameters specific to the underlying mobility-model), and demonstrates faster convergence compared to the NLMS based channel estimator. It is also noted that due to the bias from the uniformly distributed DC bias instability, the performance of the NLMS algorithm is found to degrade compared to other approaches, as seen from the BER plots and the MSE characteristics. These simulations demonstrate the generalized benefits of the proposed hyperparameter-free MEE-FP algorithm over different system models. Next, to validate the steady-state error expression presented in (33), the step-size η was varied between 6e − 3 and 1.2e − 2, and without forsaking generality, the simulated steady-state error was plotted in Fig. 9 for SNR values of 18dB, 20dB, and 23dB. Further, from Fig. 9 and as per (33), we find no dependence of the steady-state error on the step-size η, since we choose ξ 2 + 4(1 − ξ) 2 = 1 − λ 2 . Rather, the steady-state  error floor is mostly flat for various values of the step-size, and is given as follows (upto an added factor of 1 n G ): where, from (17), Ξ denotes a Lipschitz constant, which, for fitting, was 0.1 for SNR = 18dB, 1.5 for SNR = 20dB, and 2.9 for SNR = 23dB. One may note that all the above values of Ξ are relatively small as compared to n G , and that the monotonicity of the converged MSE is maintained with respect to the SNR, regardless of the step-size. This SNR independent convergence is guaranteed only when η lies within a certain range, which is derived in (30).  Step-Size( )  In this section, we present a case study for uplink m-MIMO detection in the presence of impairments such as IN and PA nonlinearity. For the channel-model for this case study (similar to the ones in [13], [14], [38]), we consider U users each equipped with a single antenna. The channel-matrix at lag τ is given by {H τ } T τ =1 , with T denoting the channel-memory, while the vector of user-symbols for time index τ is denoted as s τ ∈ C U . Furthermore, B denotes the number of antennas at the base-station, and each H τ ∈ C B×U is modeled to the correlated Rayleigh channel matrix with correlation coefficient ρ for τ = 0.
For τ = 0, a line of sight (LoS) component with a Rice factor of 6dB is added to the correlated Rayleigh channel. The received signal, u j , at the base station at the j th time-instant, u j ∈ C B , is denoted as [14], [15]: where, n j is a two component IN vector with zero componentwise-means, covariance matrices σ 2 n I and 1000σ 2 n I, and probabilistic weights 0.99, and 0.01 [39], with I denoting the identity matrix. Several popular models exist for the nonlinear PA characeteristics, such as Rapp, Saleh, Ghorbani etc. [40].
However the proposed approach does not consider the knowledge of their characteristics to be known at the receiver. Without loss of generality, we consider an AM-AM Rapp nonlinearity for our simulations, which is mathematically expressed as follows [40]: where, p is a parameter that controls the severity of the nonlinearity. Without loss of generality, we consider a Rapp model for F (·), and the RFF-SKW features for its generic hyperparameter-free mitigation [13]. In this context, it noted that the RFF-SKW in [13] were successful in mitigating the impairments due to PA nonlinearities for AWGN noise, without the requirement of an accurate estimate of the kernel-width.
The considered system model is summarized in Fig. 10, and a summary of notations for this case study is provided in Table 4.  [25]. However, the well-known ITL based criteria are well-known to depend on a spread parameter, whose optimal value is problemspecific, and acquiring a-priori insight on the value of this hyperparameter is non-trivial. Therefore, in this case study, we utilize the proposed hyperparameter-free MEE-FP based method for symbol-detection (as Massive MIMO channel impulse response Fig. 10. System model for case study II. x x (1) x (2) x Fig. 11. Block-diagram of the proposed parallel detector. mentioned previously, it is an ITL based approach that combines the benefits of the MEE and maximum correntropy based approaches and ensures unbiasedness), which has the potential to offer significantly improved convergence as compared to the squared-error based approach in [13].
For generic m-MIMO based parallel detection in RKHS [13] over AWGN, the complexity due to high-dimensional regressors (i.e. the large n, which occurs when the base-station is equipped with a large number of antennas) can be mitigated using a parallel RKHS based detector proposed in [14], [15].
Notably, the formulation in [14] was dictionary-based. Due to the reliance on an online dictionary, this approach required extra computations for constructing the dictionary and was found to exhibit instability in early iterations due to concurrent updation of the dictionary and the parameter. To address these shortcomings, an RFF based formulation of the parallel RKHS based detector was derived in [15], and using the RFF-SKW features, a detector was proposed in [13], which could generalize across several nonlinearity-types under AWGN. However, the approach in [13] was a squared error based approach, which causes performance to degrade in the presence of IN.
In this case study, we use an RFF-SKW based parallel detector, which as far as nonlinearity-mitigation is concerned, is well-known to alleviate hyperparameter-dependence. First, the observations are transformed using a mapping x(u) : x(u), which denotes the regressors in this particular study, is simply denoted as x. Due to the large dimensionality of the regressors, the inference is split into L blocks between the compute nodes. The l th regressor-block is denoted as x (l) ∈ C 2B L , which is mapped at each compute node toΦ α,β (x (l) ) ∈ C n G , denoted asΦ (l) α,β for simplicity. At each node, the local estimate of s is denoted asŝ (l) = Ω (l) TΦ (l) α,β , where Ω (l) ∈ R n G ×U is the local parameter at the l th compute node. For a modulation-order M , the best values of α and β for the SKW based regressors are given by [13]: At each compute node, the parameter Ω (l) is learnt using the proposed MEE-FP, with the appropriate values of α and β to adapt the criterion, and the corresponding values of α and β for mapping incoming regressors to the RFF. Notably, for the proposed approach, we assumed knowledge of neither the ideal value of the kernel-width nor any particular value of the spread parameter of the MEE-FP. Finally, over all the l compute-units, majority voting based fusion is performed over all the block wise estimates of the user-symbols. The proposed algorithm is summarized in Algorithm 2, with the demodulation operation denoted as Q(·). In addition, the proposed parallel detection algorithm is pictorially depicted in Fig. 11.
Next, in Fig. 12, we perform a comparison between the KLMS based RFF-SKW proposed in [13], and the RFF-SKW based hyperparameter free MEE-FP proposed in this work. The simulation parameters for this case study is summarized in Table 5. Similar to the setup in [13] for the KLMS based RFF-SKW, Initialize: e 1 = s 1−D ,e 1 = 0, g i = 0 5: Calculate error: e j = s j−D −ŝ (l) j .

8:
Update gradient-estimate: Update weights: we assume a 4-quadrature phase shift keying (QPSK) transmission with U = 4, B = 320 antennas at the base-station, and a step-size η = 0.7. It is noted that for this study for RFF-SKW in [13], reported delayed convergence due to the approximation error caused by kernel-width sampling (which is well known to be non-Gaussian in general). Further, as opposed to the AWGN additive-distortion in [13], this work evaluates the performance of the MEE-FP based parallel detector in IN environments. From the simulations presented in Fig. 12, it is observed that the proposed MEE-FP shows improved adaptability to the errors caused by the sampled kernel width, and IN. These non-Gaussian processes are found to cause degradations in the BER performance and the MSE convergence characteristics for the considered values of L. Specifically, from Fig. 12, it is observed that the convergence of the RFF-SKW in [13] is slower and the RFF-SKW achieves a degraded BER performance for the split values L = 2, 4, 8.
Notably, these performance-improvements for the proposed hyperparameter free MEE-FP are obtained without any knowledge of the kernel-width parameter or the value of the spread-parameter, which makes the proposed hyperparameter-free MEE-FP suitable for improving the performance of learning systems over unknown non-Gaussian noises.

V. CONCLUSIONS
In this paper, a hyperparameter-free formulation of an ITL based learning criterion, namely the MEE-FP, was derived using the RFF-SKW. The proposed hyperparameter-free MEE-FP was found to self-adapt according to different noise-distributions, and a scenario-specific initialization of the spread parameter was found to be unnecessary. An elaborate convergence-analysis was presented, in which step-size range was derived to ensure the convergence of the proposed approach and an expression for steady-state error was quantified. This enables the proposed approach to generalize across various deployments/datasets which are impaired by generic non-Gaussian noises. This hyperparameter-independence and distribution-invariance of the proposed approach was highlighted through various case studies.