Random Fourier Feature Based Deep Learning for Wireless Communications

Deep-learning (DL) has emerged as a powerful machine-learning technique for several classic problems encountered in generic wireless communications. Specifically, random Fourier Features (RFF) based deep-learning has emerged as an attractive solution for several machine-learning problems; yet there is a lacuna of rigorous results to justify the viability of RFF based DL-algorithms in general. To address this gap, we attempt to analytically quantify the viability of RFF based DL. Precisely, in this paper, analytical proofs are presented demonstrating that RFF based DL architectures have lower approximation-error and probability of misclassification as compared to classical DL architectures. In addition, a new distribution-dependent RFF is proposed to facilitate DL architectures with low training-complexity. Through computer simulations, the practical application of the presented analytical results and the proposed distribution-dependent RFF, are depicted for various machine-learning problems encountered in next-generation communication systems such as: a) line of sight (LOS)/non-line of sight (NLOS) classification, and b) message-passing based detection of low-density parity check codes (LDPC) codes over nonlinear visible light communication (VLC) channels. Especially in the low training-data regime, the presented simulations show that significant performance gains are achieved when utilizing RFF maps of observations. Lastly, in all the presented simulations, it is observed that the proposed distribution-dependent RFFs significantly outperform RFFs, which make them useful for potential machine-learning/DL based applications in the context of next-generation communication systems.

Abstract-Deep-learning (DL) has emerged as a powerful machine-learning technique for several classic problems encountered in generic wireless communications. Specifically, random Fourier Features (RFF) based deep-learning has emerged as an attractive solution for several machinelearning problems; yet there is a lacuna of rigorous results to justify the viability of RFF based DL-algorithms in general. To address this gap, we attempt to analytically quantify the viability of RFF based DL. Precisely, in this paper, analytical proofs are presented demonstrating that RFF based DL architectures have lower approximationerror and probability of misclassification as compared to classical DL architectures. In addition, a new distributiondependent RFF is proposed to facilitate DL architectures with low training-complexity. Through computer simulations, the practical application of the presented analytical results and the proposed distribution-dependent RFF, are depicted for various machine-learning problems encountered in next-generation communication systems such as: a) line of sight (LOS)/non-line of sight (NLOS) classification, and b) message-passing based detection of low-density parity check codes (LDPC) codes over nonlinear visible light communication (VLC) channels. Especially in the low training-data regime, the presented simulations show that significant performance gains are achieved when utilizing RFF maps of observations. Lastly, in all the presented simulations, it is observed that the proposed distributiondependent RFFs significantly outperform RFFs, which make them useful for potential machine-learning/DL based applications in the context of next-generation communication systems. This paper is under review in the IEEE Transactions on Vehicular Technology (Submitted on September 17, 2020). A version of this paper was under was previously submitted to the IEEE Transactions on Neural Networks and Learning Systems on March 20, 2020; however, this was submitted to the IEEE Transactions on Vehicular Technology after its decision on Sept. 1, 2020, which advised rejection of this work due to its better suitability to a Communications journal.

I. INTRODUCTION
The capacity of classical machine learning methodologies are limited in terms of learning accurate representations from data and in generalizing over large datasets [1], [2]. On the other hand, deep-learning (DL) has emerged as a viable machine-learning paradigm to model a nonlinear/abstract mapping from observations or learning representations from data. Furthermore, DL based algorithms have been successfully deployed in numerous sub-domains like computer-vision, speech processing, natural language processing, wireless communications, and time-series prediction. For various tasks/problems in these sub-domains, several DL-architectures, e.g. multilayer perceptron, convolutional neural network (CNN), and recurrent neural network (RNN) [3]- [5] that are optimized using the backpropagation algorithm are proposed. Further, long-short term memory (LSTM) based DL architectures are found to be particuarly viable, as LSTMs address the issue of exploding/vanishing gradients [6], [7] encountered in the backpropagation algorithm for modelling/predicting dynamical systems with memory. However, in spite of DL enjoying widespread deployment for complex machine-learning tasks, deep neural networks (DNN) have been found to be sensitive to hyperparameters like number of layers, number of hidden nodes in each layer and the nature of the activation functions.
On the other hand, classical kernel based learning techniques are well-known for their ability to model high-dimensional representations and for their generalization [8]- [10], and have fewer hyperparameters that require optimization as compared to DNNs; however they require the representation of the learning-parameter to be expressed as an implicit inner-product in a reproducing kernel Hilbert space (RKHS) using Mercer kernels. The exact nature of the implicit feature-map is unknown; however the feature-map can be well-approximated explicitly using sampling methods like random Fourier features (RFF) [11], [12], which can be further utilized as features for potential DL applications. From simulations presented in various works in the literature it is observed that the use of RFF in DNNs, significantly boosts the performance of the DNNs rather than utilizing the indigenous features [13]. Moreover, the RFFs are approximations of feature maps, which facilitate intrinsically regularized parameter updates. This in turn leads to improved generalization [9], [14] in RFF based DNN architectures, which has prompted the proposal of several DL based architectures which attempt to highlight the viability of RFF-maps through extensive simulation studies [13], [15]. However, existing works on RFF based DL motivate their results on intuition/simulation examples rather than providing rigorous analytical results to establish the paradigm of RFF based DL. Furthermore, RFFs in general require a large number of dimensions to gain an accurate approximation of an RKHS, which significantly increases the overall computational complexity, and creates a serious implementation bottleneck in the practical deployment of RFF based DL. Hence, based on this review, we highlight the following novelty points of this work: • We seek to quantify the viability of RFF-maps in the context of DNN rigorously. We present our claims in the form of two theorems and provide detailed proofs to justify the benefits of utilizing RFF-mapped observations for DL.
• To overcome the computational complexity incurred by RFF-mappings, a distributiondependent RFF is proposed which outperforms classical RFF in scenarios with low/medium RFF dimensions, and delivers better classification performance with lesser amount of training data and with lesser amount of training-data.
The paradigm of RFF based DNN is tested over the following practical problems encountered in the context of next-generation communication systems: a) line of sight (LOS)/non-line of sight (NLOS) identification for wireless links using LSTM based DNN, and b) Message-passing based low-density parity check (LDPC) decoding over nonlinear visible light communication (VLC) channels. Next, we present existing works on the two aforementioned sub-domains.

A. LOS/NLOS classification for wireless links
Accurate inference of channel-state is critical for node-localization, and link-adaptation over adhoc tactical networks, which necessitates extracting accurate information of the channel-conditions and tracking the users' channels. However, in high mobility scenarios, inferring accurate information about the channel state is quite challenging, mainly due to the time-varying nature of the wireless channel, which significantly impairs localization/degrades the overall wireless link due to detrimental outages caused by NLOS scenarios [16]- [18]. Hence, it is quite essential to develop accurate signal processing algorithms which estimate and track the channel, and also infer the channel type, i.e. LOS vs NLOS such that suitable link-adaptation or network topology selection can be performed [19].
In this section, we focus on reviewing signal processing algorithms in general for LOS/NLOS detection or localization. Several LOS/NLOS detection methodologies are proposed using popular temporal DL paradigms trained on the receive signal strength indicator (RSSI) [20], [21] based on LSTM/hybrid CNN [22]. Apart from this, an LSTM trained with local temporal RSSI features is proposed for indoor localization in [23]. Furthermore, there are unsupervised approaches for LOS/NLOS identification including the use of a Gaussian mixture model for LOS/NLOS classification [24]. Moreover, the work in [17] suggests tracking V2V channels using IEEE 802.11p, and it is particularly highlighted that the NLOS components cause link-outages due to packet losses, and that accurate LOS/NLOS detection is needed to mitigate such losses a-priori by linkadaptation. Furthermore, it is noteworthy that the task of LOS/NLOS detection is more difficult for outdoor scenarios which are characterized by longer ranges, and higher mobility and delay-spread.
From the above review, it can be concluded that fast and accurate LOS/NLOS identification is essential, and LSTMs have been found useful for LOS/NLOS prediction in general. However, in outdoor scenarios which are characterized by high mobility (i.e. with typically lower coherence time as compared to indoor scenarios), it is quite essential to perform accurate LOS/NLOS predictions with lesser training data. Hence, we outline the following novel points of our work which seek to enhance the accuracy of LOS/NLOS classification using hybrid RFF/LSTM based DNNs in the low training-data regime: • This work presents an analytical result that justifies the viability of mapping of the incoming observations to RFF for training neuralnetwork architectures like LSTMs. The gains promised by the presented analytical results are validated using computer simulations for LOS/NLOS identification of outdoor wireless channels.
• A new kind of distribution-dependent RFF is proposed which is found to deliver-improved approximation of RKHS with less number of RFF-dimensions, and hence facilitates low-complexity architectures. When used in conjunction with LSTM, the distributiondependent RFF is found to achieve better F 1score for LOS/NLOS classification which lowers the overall computational complexity for a given error-floor.

B. LDPC-decoding for static VLC channels
Low-density parity check (LPDC) codes, wellknown as one of the capacity-achieving codes, have been widely deployed for both radio-frequency (RF) and VLC based communication systems [25], [26]. However, in the context of VLC, the algebraic structure of the codewords is distorted due to nonlinear LED transfer-characteristics, which, if unmitigated, severely impairs message-passing based detection. In this work, the observed codewords from a nonlinear VLC channel are iteratively mapped to an approximate RKHS using RFF to mitigate the LED nonlinearity and recover the codewords, where message-priors are iteratively updated based on the "quality" of RFF-approximation at each iteration. From the conducted simulations over nonlinear VLC channels, a significant BER gain is observed when performing message-passing based detection using the RFF-map. Furthermore, one can observe a significant BER-performance gain upon deployment of our proposed distribution-dependent RFFs as compared to RFFs, which renders the proposed RFFs viable.

PROCESSING
In this section, we provide an overview of RFF based signal processing. Using an implicit feature map to RKHS (denoted by Φ : R n → H), and invoking the Representer theorem [27], an arbitary function f (·) may be represented as the following weighted combination: where x j denotes the j th observation, β j denote the approximation-weights, and κ(·, ·) : R n × R n → R is a continuous and shift-invariant Mercer kernel.
Estimating the above representation of f (·) is computationally involved and requires expressing f (·) only in terms of Mercer kernels which prevents us from gaining intuitive insights (as opposed to insights provided by intermediate layers of DNN). To reap the benefits of RKHS based approaches (like regularization, generalization etc) by potentially deploying them as features in DNN, the Mercer kernel κ(·, ·) can be well-approximated as an RFF [28]. This approximation is motivated by the Bochner's theorem [29], which is restated as below: Using Bochner's theorem, a positive-definite kernel can be expressed as Lastly, to lower the approximation-error, the above mean may be approximated by an sample average such as Further bounds on the error in kernel-approximation were derived in [30] using the RFF based approximation of feature-maps. In particular, for a real Gaussian kernel, an RFF (denoted here bŷ Φ : R n → R n G ) is obtained aŝ where each {ω i } n G i=1 is a Gaussian vector, with zero mean and covariance 1 σ 2 I n G , with I n G denoting the identity matrix of size n G , σ denotes the kernelwidth, and (·) T denotes the transpose operation.
It is noted that since an RKHS is closed, an exact representation exists for a wide class of functions in an RKHS [27], which makes RKHS based learning methods suitable for function-approximation. However, most RKHS techniques rely on learning a dictionary of observations [9], [31], [32], and hence are sensitive to inclusion of erroneous entries due to noise in the initial learning-stages, and also makes practical implementation complex. In this regard, RFFs provide an approximate explicit map to RKHS, which facilitates practical implementations with a finite memory budget whilst delivering equivalent performance as promised by RKHS based learning algorithms, which make them promising for RFF based DNN-architectures.

III. PROOF OF VIABILITY OF RFF FOR TRAINING
LSTM In this section, a proof is outlined that guarantees the viability of RFF based LSTMs in terms of classification accuracy/optimizing the hit-or-miss cost function. First of all, we enlist the considered assumptions as follows: • We consider two kinds of sequences of observations: a) Independent and identically distributed (i.i.d) observations in C n denoted by s = (x 1 , x 2 , · · · x i · · · ) and b) Observations mapped to RKHS using RFF denoted by ζ = (Φ 1 ,Φ 2 , · · ·Φ i · · · ). • We denote the linear inner-product space in C n as X . Furthermore, we denote the space spanned by Φ 1 as H, which can be considered as an extension space of X . c) The sequence of actual, and predicted labels are denoted by g = (g 1 , g 2 , · · · g i · · · ) andĝ = (ĝ 1 ,ĝ 2 , · · ·ĝ i · · · ), respectively. d) We assume that the LSTM is a system that inputs sequences and outputs the labels asymptotically correct for some time-index i > T , i.e. Pr(ĝ i = g i ) → 1, ∀i > T , where T is arbitrarily large.
• Furthermore, we assume two hypothetical cases: i) s is the input to an LSTM based predictor. ii) ζ is the input to the same LSTM considered in i).
Based on these assumptions we proceed to formulate the following theorem. Theorem 2.. For a given LSTM network, the likelihood of correct detection of a sequence of labels given the mapped sequence ζ, is greater than the likelihood of corresponding correct detection given s for some time-index i > T , where T is large enough.
Further for t > T , we can assume p(g) ∼ N (ĝ, ǫ 2 ), and ǫ 2 → 0 (where ǫ 2 is the approximation-error energy) and N (µ, σ 2 ) denotes a Gaussian distribution with mean µ and variance σ 2 (which is a soft approximation of g in the neighborhood ofĝ). Under this assumption, one can write (6) as Taking the logarithm of both sides, applying Jensen's inequality, and the min operation, one can re-express (7) and defining θ > 0, such that θ = max In other words, For i > T , the measure of g is concentrated around g; hence, under this assumption the above equation can be re-expressed as Hence, we have or, Using the cosine transform-relation of the random variables from x to ζ, we conclude p(s) p(ζ) ≥ 1 (as the Lebesgue measure of X is less than the Lebesgue measure of H 2 ), and thus we reach the following inequality: This yields a contradiction from our original claim in (6), which concludes the proof.

IV. PROPOSED DISTRIBUTION DEPENDENT RFFS
The RFF based DNNs require a large number of RFFs to gain an accurate approximation of an RKHS. In this context, a distribution-dependent RFF is proposed in this section, which achieves lower approximation error as compared to classical RFFs for a given number of RFF dimensions, and hence significantly lowers the computational complexity required for achieving a given error floor.
Indexing the incoming observations by j, we can update the following Parzen estimate of the p.d.f of observations, denoted byp(x).
where the spread parameter for kernel density estimation, λ, is drawn from Silverman's rule [34]. Consequently, the mean of the RFF can be readily derived, using a moving-average estimator as follows: where µΦ denotes the mean of the RFF, estimated by the moving-average estimator, and ν ∈ [0, 1] is 2 It is noted that while there are several extensions from x possible, the RFF based extension outperforms classical polynomial kernel based extensions [33] the forgetting factor. However, from (17) one can adapt S as Denoting S = {S i } n G i=1 as a vector, we get: where M is the size of the considered batch. This gives us a new smoothed RFF that can present potentially useful features for low-complexity DNN architectures.

V. PROOF OF VIABILITY OF DISTRIBUTION DEPENDENT RFFS
In the Section. III, we demonstrated the viability of RFF based LSTMs in terms of the misclassification error/the "hit-or-miss" cost function. In this section, we prove that, compared to classical RFFs, the proposed distribution dependent RFFs provide a better approximation to the RKHS by reducing the approximation-error (which is a "soft" error-metric). We state this claim in the form of the following two theorems. Proof: We begin by recalling that for a fixed number of n G dimensions, the approximation error ǫ for classical RFF is given by n G ∼ O(ǫ −2 log ǫ −2 ) [35]. However, we also note that ǫ is not only a function of dimension n G , but also a function of the set of ω i . In other words, the approximation error depends on how much the samples of ω i deviate from a Gaussian distribution; particularly, when n G is not high enough to converge to the desired Gaussian distribution. Hence, in the sequel, we denote the approximation error as ǫ n G (ω), where ω denotes an approximate continuum of From (20), we can approximate each component of the proposed RFF as follows: Noting the fact that the RFF can be expressed as the sum of an element in RKHS H with some error ǫ n G , the above equation can be re-expressed as which can be simplified as .
From the above equation, we make the following observations: • Applying Parseval's theorem, one can easily note that the energy of ǫ (1) n G (ω) is lower than ǫ n G (ω), as the exp − λ 2 ω 2 2 part performs a low-pass filtering, and attenuates highermagnitude "frequencies" (or ω).  Proof: It is noted that upon using the distribution dependent RFFs, and upon invoking the Cauchy-Schwarz inequality, the number of RFF dimensions required for an error floor Kǫ, is written as follows: where C is an arbitrary constant, and K = λ 2 2π n 2 < 1 is a scaling constant making the error floors of RFF and distribution-dependent RFF same to facilitate comparison. Similarly, for distributiondependent RFFs, the dimensions n G can be quantified as Hence, the ratio may be expressed as Noting that ǫ 2 ≈ 0 for sufficiently large RFF dimensions, we can write It can be noted that since λ < 1, which proves the desired result.

VI. ARCHITECTURE OF RFF BASED LSTM
Considering the results presented in Theorem 2, wherein the viability of the proposed RFF based features to an LSTM for sequence detection is established, we describe the neural network architecture considered in this work. For illustration purposes, the proposed architecture is shown in Fig. 1. As detailed in Theorem 2, the input is mapped to an approximate RKHS using RFFs as outlined in (5) or (20). Next, there are optional fully connected layers with a subsequent RFF transformation. The cascade of consecutive mapping renders the overall mapping to a sequence of RKHSs and the result in Theorem 2, can be readily utilized by replacing H with the last RKHS. Lastly, the observations in RKHS are presented as an input to the LSTM layer for prediction of the NLOS/LOS labels. Notably, the benefits of the proposed mapping prior to presentation to the LSTM layer have been highlighted in Theorem 2, which indicates that the posterior is more "peaked" given RFFs as input to the LSTM, as compared to the indigenous observations. These steps can be summarized in the following set of equations: where • denotes the Hadamard product, σ g (·) is a sigmoid activation function, σ c (·) and σ h (·) denote the tanh(·) activation function, and f t , i t , o t , χ t , γ t ∈ R n h where n h denotes number of hidden nodes. Furthermore, f t denotes the forgetgate of LSTM with weights

LSTM Layer
Output Labelsĝ Fully Connected Layer R n h ×n h and bias b f ∈ R n h ). i t denotes the input gate with weights A (1) ∈ R n h ×n G , A (2) ∈ R n h ×n h and bias b i ∈ R n h . o t denotes the output gate with weights D (1) ∈ R n h ×n G , D (2) ∈ R n h ×n h and bias b o ∈ R n h . χ t is the sum of the gating of the previous parameter-value with the forget-gate and the gating between input nodes and input-layer (parameterized by weights F (1) ∈ R n h ×n G , F (2) ∈ R n h ×n h and bias b ∈ R n h ), which is passed through the activation function σ c (·) and gated with the output gate o t to obtain the next-state γ t ∈ R n h . The aforementioned weights and biases are optimized using the backpropagation algorithm; however, we do not encounter the vanishing/exploding gradient problem for time-series prediction. In addition, the mapping to an approximate RKHSΦ(·) (as derived in (28)), prior to input to LSTM increases the accuracy of predictions as inferred from Theorem. 2.
In the next section, we present simulations, which validate the viability of the proposed RFF based LSTM for two problems. First, LOS/NLOS classification over an outdoor WINNER II channel. Next, the suitability of the proposed distributiondependent RFF for LDPC decoding is demonstrated by realistic simulations over nonlinear VLC channels.

VII. SIMULATIONS
In this section, we present simulations to validate the paradigm of RFF based learning for LOS/NLOS based classification and message-passing based detection for LDPC decoding. From the simulations presented below, one can observe significant gains for various nonlinear classification problems when using RFF-approximations of an RKHS, compared to using the indigenous observations.

A. RFF-based LSTM for LOS/NLOS classification
In this subsection, simulations are presented for LOS/NLOS identification in outdoor communication systems. The simulation parameters are summarized in Table. I. We consider various outdoor WINNER II channel scenarios, wherein there is a single base-station, and the receiver moves from an initial location using a random-walk mobility model. The antenna height at the transmitter is assumed to be 4m, while the mobile stations were assumed to move along a trajectory in the horizontal plane drawn from a 2D random-walk mobility model. The OFDM standard assumed is IEEE 802.11ax with a guard band of 3.2 µs. The complex channel-estimates at the receiver are transformed by concatenating real and imaginary components prior to RFF mapping. In this sections, we present the following two comparison-cases: a) the case in which we present x directly to the LSTM layer and b) the case in which x is mapped to an RKHS using RFF layer(s) prior to the LSTM layer. The candidate neural networks were trained with the initial location of the receiver as mentioned in Scenario I of Table I 3 . It is worthwhile to mention that the initial location of the user for Scenario I was chosen heuristically such that there are almost equal number of examples of LOS and NLOS observations, as the user moves along the aforementioned random-walk based trajectory. Consequent to training, the candidate neural networks were tested on the testingobservations derived from Scenario II (which has more LOS labels due to the initial location being near to the base station), and Scenario III (which has more NLOS labels due to the initial location being far away from the base station). The testing F 1 -score is plotted for the C1 and D1 outdoor scenarios. It is observed that the gains in F 1 -score performance becomes more prominent in the low training-data regime, as seen from Fig. 2 and Fig. 3, which makes distribution dependent RFFs better suited for rapidly fluctuating outdoor scenarios. Further, from the simulated receiver operating characteristics (ROC), which are plotted in Fig. 4 and Fig. 5 for 400 training samples, better performance is obtained from the RFF based LSTM than the generic LSTM, which is in line with the gains promised by the analytical results derived previously.

B. LDPC decoding for VLC
In this subsection, we describe our methodology for LDPC decoding in a nonlinear VLC channel. We assume LOS VLC channel modelled by a Lambertian radiation pattern [38], [39], with a memory Rapp LED nonlinearity (which is widely used for modelling a white light-emitting diode (LED) [40], [41]). The overall system model at the i th timeinstant can be written as where n ∼ N (0, σ 2 n I), with σ 2 n denoting the overall variance of the additive noise which accounts for the overall effect of shot-noise and ambient noise at the photodetector. Moreover, x i denotes encoded independent and identically distributed (i.i.d) onoff keying (OOK) transmissions, and f (·) denotes the LED transfer characteristic modelled as an AM-AM Rapp nonlinearity. The encoding is performed according to the 802.11n LDPC generator-matrix. From (29), one can observe that the nonlinearity f (·) warps/distorts the transmitted codewords x, which alters their algebraic structure, and causes errors in LDPC decoding. Hence, based on the previous discussion, it is proposed to learn the bits based on a detector trained onΦ(y), where the RFF dimensions are equal to the codeword-length 4 . In fact, using the Representer theorem, (29) can be rewritten as where k x is an operator in RKHS H. Using the completeness of the RKHS H, there exists an operator k y such that which can be modelled as an AWGN channel with noise-variance equal to var[< k y , y > H ] =< k y , k y > H σ 2 n . Given the theme of this work, the following approximation of RKHS is utilized < k y , y > H ≈ Ω TΦ (y).
Next, a hypothesis, denoted by Ω, is trained on Φ(y), which optimizes the following quadratic lossfunction Ω TΦ (y) − sign(y) 2 2 Channel-decoding is performed using messagepassing over a Tanner graph representation of the 4 It is also possible to up-convert to higher dimensions using an RFF and then down-convert using an autoencoder. Though this dualconversion may have performance benefits, it is computationally complex, and hence we focus our attention on the single-layer case.
parity-check matrix [42]. We denote the graphneighorbood of node k as B k (which is the set of points incident on node k in the Tanner-graph apart from k itself). Additionally, for the j th bit, the log-likelihood-messages from the bit-nodes to codewords, and the message from codewords to bitnodes are denoted as m b (j) and m c (j) respectively. Lastly, we denote the length of the bit-string b as B and the size of the encoded codeword as C. The algorithm is summarized in Algorithm 1.  Calculate syndome e = Hb.   in [41], with Rapp LED nonlinearity, where the memory-parameter of the nonlinearity, α, is considered to be 0.2, and the saturation current of the LED is 0.4. The generator matrix for the LDPC code is derived following the IEEE 802.11n standard with a block length of 648 [43], [44]. From the BER results presented in Fig. 6, it can be inferred that the proposed distribution dependent RFF based message-passing outperforms the classical RFF based message-passing in terms of BER performance. Notably, the gains in BER performance are higher at lower number of outer iterations; though  it is noted that even upon increasing the number of iterations to a very high value (like 50), we still get a significant gain with the proposed distributiondependent RFFs, as seen in Fig. 6.

VIII. CONCLUSION
In this work, analytical results motivating the paradigm of RFF based DL are presented and a novel distribution-dependent RFF is proposed. The validity of the presented analysis and the proposed distribution-dependent RFF are ratified through realistic computer-simulations for critical machinelearning problems encountered in next-generation wireless communications, such as LOS/NLOS identification and LDPC decoding over nonlinear VLC channels. Simulations performed over realistic WINNER II outdoor channels validate the analytical proofs and indicate that the proposed neural network architecture delivers significant gains over classical LSTMs for LOS/NLOS identification, which makes the proposed methodology viable. Lastly, the worth of traditional RFF and the proposed distributiondependent RFF maps are compared for messagepassing based LDPC detection. In line with the derived theoretical results, the simulations indicate a significant performance gain upon deployment of the proposed distribution-dependent RFF, which enforces the usefulness of distribution-dependent RFFs for machine-learning applications for generic communication systems.