EM DeepRay: An Expedient, Generalizable, and Realistic Data-Driven Indoor Propagation Model

Efficient and realistic indoor radio propagation modeling tools are inextricably intertwined with the design and operation of next-generation wireless networks. Machine-learning (ML)-based radio propagation models can be trained with simulated or real-world data to provide accurate estimates of wireless channel characteristics in a computationally efficient way. However, most of the existing research works on the ML-based propagation models focus on outdoor propagation modeling, while indoor data-driven propagation models remain site-specific with limited scalability. In this article, we present an efficient and credible ML-based radio propagation modeling framework for indoor environments. Specifically, we demonstrate how a convolutional encoder–decoder can be trained to replicate the results of a ray tracer, by encoding physics-based information of an indoor environment, such as the permittivity of the walls, and decoding it as the path loss (PL) heatmap for an environment of interest. Our model is trained over multiple indoor geometries and frequency bands, and it can eventually predict the PL for unknown indoor geometries and frequency bands within a few milliseconds. In addition, we illustrate how the concept of transfer learning can be leveraged to calibrate our model by adjusting its preestimate weights, allowing it to make predictions that are consistent with measurement data.

artificial intelligence, to process large volumes of data and extract purposeful information, can revolutionize wireless network operation [6]. Until now, ML has been used to tackle a wide range of wireless-network-design-related problems, such as resource allocation, user mobility analysis, localization, and wireless channel modeling [7]- [9]. The latter case has recently attracted significant interest, as radio propagation modeling is the cornerstone of the cellular network design [10], [11].
Conventionally, for radio channel modeling, empirical or deterministic models are used. The empirical channel models, such as the COST-231 model [12], are derived by fitting simplistic mathematical models to measurement campaign data [13], [14]. Although these models are practical in use, they can demonstrate significant deviations from the actual received path loss (PL) values [15], potentially rendering them unreliable. Deterministic models rely on the governing laws of electromagnetic wave propagation, providing an approximate solution to Maxwell's equations. The commonly used deterministic models include ray tracing, the finite-difference time-domain method, or the vector parabolic equation method [16]. Unlike empirical models, deterministic models are site-specific, solving Maxwell's equations within a specified physical geometry, thereby yielding more accurate results.
For that reason, the popularity of deterministic models has grown constantly over recent years. Among the various deterministic models, ray tracing has been widely used to calculate radio channels characteristics, and it is expected to have a leading role in the deployment and the design of 5G and beyond systems [17]. A key difference between 5G and legacy communication system design is that a significant part of 5G radio access networks will be installed in indoor 0018-926X © 2022 Crown Copyright environments. Traditionally, in-building traffic is served by outdoor cells, following an outside-in approach, and until now only a small number of buildings have dedicated indoor mobile networks. However, in the 5G and beyond era, the majority of in-building mobile traffic will be served by indoor base stations or access points [18]. Thus, an accurate and expedient indoor propagation modeling tool is now more important than ever.
Computational efficiency usually appears as a bottleneck for ray tracing, due to the substantial simulation time and memory required to trace all the ray paths, when the number of scattering objects and ray intersections within the simulated space increases. Data-driven approaches aim to alleviate this limitation by integrating ray tracing simulators with ML algorithms which are capable of learning and inferring radio propagation parameters [10], [11]. In particular, artificial neural networks (ANNs) have been widely used in an effort to expedite [19] or even replace ray tracing simulators [20]. The preponderance of past research concentrates on urban propagation scenarios [20]- [27]; however, as has been mentioned, indoor propagation modeling is of high significance in the deployment of 5G networks. Currently, most of the existing approaches on indoor propagation modeling are confined to the use of simple multilayer perceptrons (MLPs) [28] to determine the radio channel characteristics [19]. This poses limitations in terms of model's generalizability, as MLPs trained within a certain geometry are agnostic to the characteristics of other geometries (materials, building layout, transmitter position) [29]. In addition, it is necessary to increase the fidelity of ML-based propagation models by considering real measured data, instead of using simulated data only [22], [29], [30].
The main contribution of this article is to tackle these two fundamental limitations of indoor ML-based propagation models. To that end, first we outline how a convolutional encoder-decoder can be trained to predict the PL for an arbitrarily complex indoor environment over multiple frequency bands. In particular, we use a modified version of the U-Net architecture using stacked dilated convolutions (SDU-Net) [31], which learns to transform an input tensor, comprising information regarding the physical properties of an indoor environment, to a PL heatmap. Unlike [19], [29], and [30], the predictions of the proposed data-driven framework are based on the electromagnetic properties of the physical environment, such as the permittivity and the conductivity of the walls, and other parameters that affect wave propagation, e.g., the distance and the frequency. We demonstrate that our approach overcomes the problem of limited generalizability, as it is directly applicable to multiple indoor geometries and frequencies, without any further training. The predictions made by our model can replicate closely the results of a ray tracing simulator, with the additional advantage of a substantially reduced computational time (less than a second in a GPU).
To tackle the second limitation, in this article we present an approach that allows making predictions that match measured data. Thence, we use the idea of transfer learning to calibrate the proposed data-driven propagation model. Specifically, we introduce an expedient method that allows our model to adjust its preestimated weights, computed using simulated data, and provide realistic estimates of the signal level using only a very small quantity of measured data. To summarize, in this article we show how to leverage the results from a ray tracing simulator to create a generalizable data-driven propagation model for indoor environments. Once trained, our model can be used as a standalone propagation solver to predict the PL for unknown indoor geometries and frequencies.
Consequently, it can be readily calibrated to provide realistic estimates of the signal level for any indoor environment, using only a few measurements.
Specifically, our work differs from previous research in the following ways. 1) While urban and suburban ML-propagation models have been widely explored [20]- [26], the ML-based indoor propagation models have received less attention, despite their importance for 5G and beyond systems. In our work, we introduce a standalone indoor radio propagation model, which builds on input features that consider the particularities of wave propagation in indoor environments. For instance, in indoor environments it is common to encounter a larger variety of building materials than in outdoor spaces. Moreover, in outdoor ray tracing simulations the building walls are modeled as semi-infinite spaces because of the high losses. However, the walls found in indoor environments are typically thinner, and for materials such as plasterboard or wood, the losses are considerably smaller. Hence, it is necessary to consider waves' attenuation when propagating through walls. 2) Instead of using conventional convolutional neural networks (CNNs) and/or MLPs [19], [23], [25], [26], [30], we leverage the computational efficiency of convolutional encoders-decoders to create a generalizable data-driven propagation model. Unlike [20] and [24], the convolutional encoder-decoder used in this article uses the concept of atrous convolutions, which as was shown in [29] and [31] can be considerably advantageous. 3) Our approach is directly applicable to new indoor environments and frequency bands, without further training, allowing PL estimation in a few milliseconds. On the contrary, the predictions of MLP-based models using relative coordinates, environmental features, and other tabular information [19], [25] are restricted to the environment at which the MLP is trained. Likely, the data-driven models in [23] and [24] were trained over a sole operating frequency, or in [24] and [26] the testing was performed on parts of cities included in the training set. 4) We do not confine the comparison of our model's performance only with synthetic ray tracing data 22]- [24], [30]. We also demonstrate that our model can provide estimates of the signal strength, for unknown geometries and frequencies, which are in close agreement with realworld measurements. More importantly, we outline how sparse measurements can be leveraged through transfer learning to make realistic predictions of the signal strength, minimizing the error between our model's predictions and some measured data.
The outline of this article is as follows. First, in Section II, we discuss the functionality of some basic ANN architectures. In Section III, we provide a brief overview of some of the existing approaches to ML-based propagation modeling, focusing on ray tracing. In Section IV, we present the proposed data-driven indoor propagation modeling framework. Next, in Section V, we provide numerical results comparing our model with synthetic data computed through a ray tracing simulator. We study scenarios where the proposed framework is trained over multiple geometries and frequency bands, and we also explore its applicability to unknown indoor environments and frequencies. Then, in Section VI, we outline how transfer learning can be exploited to calibrate our model, and we compare the predictions of our and other propagation models with measured data. Finally, Section VII concludes this article outlining its main contributions.

II. THEORETICAL BACKGROUND
ANNs seek to imitate human intelligence by allowing simple learning components to perform basic computational operations and connect to other learning components. The learning components are referred to as nodes, artificial neurons, or just neurons. In what follows, we briefly discuss the functionality of MLPs and CNNs, focusing on the latter since in this work we are using a convolutional encoder-decoder.

A. Multilayer Perceptrons
In MLPs, the neurons are arranged into layers, where each neuron has input connections originating from the previous layer and output connections pointing toward the next layer. A typical MLP consists of an input layer, a number of hidden layers, and an output layer, as shown in Fig. 1, where x 1 , x 2 , . . . , x n and y 1 , y 2 , . . . , y m are the input features and the outputs of MLP, respectively. With the term input features, we refer to the characteristic physical quantities that affect a quantity of interest (QoI), which is the output of MLP. For instance, the commonly used input features of ML propagation models are the operating frequency, the distance, or the transmitter height, while typical QoIs are the signal level or the PL. The hidden layers consist of neurons that apply nonlinear transformations to their input data. Specifically, the output, u (l) j , of the j th neuron in the lth layer is computed by applying a nonlinear function, g, to the weighted sum of the previous layer neuron outputs, plus a bias term, b (l) where u (l−1) i is the output of the i th neuron of the previous layer, θ i, j is the weight associated with u (l−1) i , and n l−1 is the number of neurons in the previous layer. Some commonly used nonlinear functions, g, are the rectified linear unit (ReLU), the tanh, and the sigmoid function [28]. The output of MLP is estimated in a similar manner, and the QoI can assume discrete categorical or continuous values, for classification and regression problems, respectively.

B. Convolutional Neural Networks
CNNs have been widely used in computer vision and in image-processing-related applications. In CNNs, the input data are represented as tensors, and they have three dimensions: the width, W , the height, H , and the number of channels, C. A CNN typically comprises three different layer types: convolutional, pooling, and MLPs. An example of a CNN is shown in Fig. 2, where an RGB image, representing the geometry of an indoor environment, is fed to a convolutional layer.
In a convolutional layer, the input tensor is convolved with n f different filters. Each filter identifies different hidden features and it consists of stacked 2-D kernels, i.e., 2-D square f w × f h weight matrices. The number of stacked kernels is equal to the number of input channels C, and the number of output channels is equal to the number, n f , of the individual filters convolved with the input tensor. The result of these convolutions, i.e., the layer output, is known as the feature map. To estimate the elements of feature map, each kernel is slid with a stride, s, over the respective channel of the input tensor, and the dot product between the filter kernel weights and the points of the corresponding overlapping channel subarea is computed. Consequently, the elements of the output feature map are computed as the cross-channel summation of the results of the aforementioned dot products. With respect to Fig. 2, this corresponds to computing the dot product between the orange weight boxes and the pixel intensity values of the overlapping subarea, and then adding the dot products estimated for each one of the three RGB channels. Let X be the input tensor and θ the weights of the o-th filter, then the elements z of the feature map are computed via [28] where the summation over i and j captures spatial correlations between the elements of the same channel, while the summation over k enables unveiling correlations between the different channels of the input tensor (cross-channel correlations). The convolutional layers of a CNN are typically followed by pooling layers used to simplify the representation of the encoded feature maps through downsampling and to reduce the number of model's trainable parameters. The most common pooling operation is max-pooling. In a max-pooling layer, tensor channels, is slid over the received input tensor, and the elements of the output feature map are estimated as the maximum element of the overlapping subarea of each channel. For instance, the different colored squares in Fig. 2 correspond to the kernels of the filter used to extract the maximum point from a subarea of each channel. Finally, CNNs are commonly terminated with MLPs, which are used to estimate the final network output. The main advantages of CNNs are the weight sharing and the connection sparsity properties. The first means that the same filter weights are applied to different parts of the input tensor, i.e., the same orange box is slid over the entire indoor geometry in Fig. 2. Thus, each convolutional filter can identify certain kinds of hidden features, and more importantly it is not required to compute different weights for every point of the input tensor. The second property signifies that unlike MLPs, where each output neuron receives information from all the neurons of the previous layer, in CNNs, each neuron considers information originating only from a small subarea of the previous layer. Thus, the numerical operations required to compute the output value of a neuron decrease substantially. For instance, for the estimation of z 1,1,1 in Fig. 2 (purple box), only the top-left side of the indoor geometry is considered.
The connection sparsity property also gives rise to the concept of the receptive field, which depicts the information region of the input tensor that affects the output response of a neuron. Evidently, since each convolutional layer encodes and compresses information of the previous layer, the neurons of deeper layers have an indirect access to a larger area of the input tensor, and thereby a larger receptive field [28]. Hence, the shallower layers within a CNN can detect low-level features of the input tensor, whereas the deeper layers, due to their augmented receptive field, can identify more complicated abstract high-level features.

III. RELATED WORK
The idea of using ANNs to improve the performance of radio propagation modeling tools is not new. Perrault et al. [21] used an MLP to calibrate a ray tracer with measurement taken from three different cities taken at 900 and 1800 MHz. The MLP received as input the received signal strength (RSS) provided by the ray tracer and several simulation parameters (transmitter height, number of reflections, and land type). Then, the authors tried to fit the simulated RSS to the measured RSS values and calibrate the ray tracer. More recently, due to their computational efficiency, the use of CNNs for radio propagation modeling has become popular [20], [22]. In [22] and [23], a CNN was trained with images that depicted the buildings of a city, assuming different pixel intensity values according to the building height. Consequently, the PL for different urban environments was simulated through a ray tracing simulator. The city images along with the respective simulated PL values were used as the input and the expected output of the data-driven model, respectively. In [20] and [24], a U-Net-like convolutional encoder-decoder was used. The authors incorporated multiple features to their model by encoding additional information through different input channels. Although these models provided accurate results and they overcame the scalability limitation posed by MLPs, they focus on outdoor propagation modeling and they are not applicable to indoor environments.
Most of the existing AI-based indoor propagation models make use of MLPs [19], [32]. The motivation in the work done in [19] was to reduce the computational cost of ray tracing through a coarse-to-dense grid MLP-assisted scheme. A ray tracing simulation was conducted in an indoor environment assuming a coarse grid discretization. Then, the ray tracing results were used to train an MLP to recognize radio channel characteristics and infer the RSS for other points and coverage "holes," assuming a dense grid discretization. In [32], a similar approach was implemented using real measurements instead of ray tracing data. A CNN-based formulation was presented in [30], consisting of two convolutional layers and an MLP with four hidden layers, aiming to evaluate the characteristics of a millimeter-wave channel. The input data comprised the coordinates of the receiver and the transmitter, while the output vector included various channel characteristics (PL, delay spread, angle of arrival, etc.). The main drawback all the previous approaches share is that they attempt to associate the geometrical coordinates of an indoor environment to telecommunication-related features. As it was shown in [29], this constitutes a fundamental bottleneck in the applicability of such models in geometries other than those in which they are trained.

IV. PROBLEM FORMULATION AND PROPOSED MODEL
In this section, we present the details of our data-driven indoor propagation model, called EM DeepRay. We describe the format of the input and output data, the general framework pipeline, and the convolutional encoder-decoder used.

A. Input Features and Target Output
The input features used in our work are: 1) the permittivity; 2) the conductivity of the building materials; 3) the distance between the transmitter and every point within the simulated grid; and 4) the free space PL (FSPL) at every point of the indoor geometry, assuming the absence of the building. The predicted parameter is the PL at every spatial point within an indoor geometry, given a specific transmitter position and building layout. The reasoning behind the input feature selection and the generation of the input tensor is explained in the following paragraphs.
Wave propagation is affected by four basic mechanisms: reflection, transmission, diffraction, and diffuse scattering [16], [33]. It is well-known that the impact of these mechanisms depends on the wavelength λ of the propagating wave. In particular, reflections and transmissions require that the objects found within the propagation environment are electrically large, i.e., their dimensions are much larger than λ. When λ is much larger than the object (sharp edges, small openings), the propagating waves will bend around the object and diffraction will occur. Finally, diffuse scattering requires that abrupt variations in the surface height are an order larger than the wavelength. The impact of these mechanisms also depends on the material proprieties, which in turn are related to the radio signal frequency [34].
The electromagnetic properties of a material can be quantified by its permittivity and its permeability. In this work, we assume that all the construction materials found within the simulated indoor geometries are nonmagnetic, i.e., they have a constant permeability equal to μ 0 . The permittivity is formulated as: = r 0 = ( r − j σ / 0 ω) 0 , where r and σ are the relative permittivity and the conductivity, respectively. Thus, to consider how different materials affect wave propagation, we use two input channels that convey information regarding the relative permittivity. Specifically, the first channel depicts the real part of r at every point of the simulated grid, and the second includes the value of the conductivity. An example of these channels is shown in Fig. 3(a) and (b), respectively. The conductivity is modeled as σ = c f d , where f is the frequency of the propagating wave in GHz. The values of r , c and d, are derived from the ITU-R P.2040-1 Recommendation [34], and they are shown in Table I. The first channel allows our model to understand the presence of an object and infer the strength of the reflected and the transmitted components of an electromagnetic wave impinging onto it. The second channel accounts for the attenuation that an electromagnetic wave undergoes while it propagates through an absorbing medium. Also, to indicate the transmitter position within the simulated grid, the values of r and σ around the transmitter's position are set equal to twice the maximum values of r and σ .
Another important parameter that affects wave propagation in indoor environments is wall thickness. In outdoor ray tracing simulations, building walls are modeled as semi-infinite spaces due to the high losses, and thus the penetration into buildings or the multiple reflections within the building facets can be omitted. In indoor environments, this assumption does not hold, since the walls are thinner than the building facets, and for materials such as plasterboard or wood, the losses are considerably smaller. Hence, objects such as walls, doors, and windows are modeled as slabs, and consequently the reflection and transmission coefficients depend on slabs' thickness. To account for that, we use a third channel which depicts the physical distance between the transmitter and every point in the simulated grid. An example of the distance channel is shown in Fig. 3(c). Thus, when a convolutional kernel is applied to a subarea of the input tensor, apart from detecting the wall type through the first two channels, it can also consider the wall length via the difference between the values of the third channel.
Furthermore, the third channel can also provide a good estimate about the deterioration of the signal over distance. To further enhance the awareness of our model regarding the impact of distance on wave propagation, we add a fourth channel which includes the FSPL for each point in the simulated grid. We note that if our model was trained only over one frequency, the fourth channel would have been a simple transformation of the third channel, and thus this information would have been redundant. However, EM DeepRay is trained over multiple frequencies, and hence the fourth channel helps our data-driven model to unveil correlations over the frequency and space domains.
The target output of EM DeepRay is a tensor including the PL at every point of the simulated grid. We note that a similar treatment can be applied for another QoI such as the RSS or the phase delay. As shown in [29], each channel of the output tensor corresponds to different sampling heights. In this work, we use a single output channel, representing the PL at the horizontal plane at a height equal to 1.5 m. The target PL values are computed using a ray tracing simulator [35]. The goal is to train a convolutional encoder-decoder which encodes the input tensor and then decodes it as a PL heatmap, i.e., to find a mapping between the material properties, physical distance, and frequency space and the PL space.

B. Workflow
The block diagram of the EM DeepRay, shown in Fig. 4(a), preserves the same idea as the initial DeepRay framework [29]. In the initial framework, the simulated results from a ray tracing simulator are leveraged to train a convolutional encoder-decoder to learn how to transform a blueprint of the indoor geometry to a PL heatmap. For each ray tracing simulation, we obtain an image, X i , representing the geometry of an indoor environment and a tensor, y o , comprising the simulated PL values at each point of the indoor geometry. The target PL tensor is converted into a grayscale image, y o,gs , and consequently it is resized to a standard size image, y. Similarly, the input geometry image is also resized to a standard size image X c . Although a convolutional encoder-decoder can be trained with varying sized images, using standard size images facilitates the training procedure [36]. To avoid a substantial stretching or squeezing when resizing is applied to the building layouts and the target PL image, the standard size selected in this work is 512 × 512. The resized input geometry and PL images are provided as an input-target pair to a convolutional encoder-decoder that performs a pixel-topixel prediction, translating the blueprint of an input geometry to a PL heatmap.
The main difference between the two frameworks is the form of the input tensor. The initial DeepRay framework uses a blueprint of the indoor geometry, and the convolutional encoder-decoder learns to identify how different construction materials affect wave propagation, by representing each material with a distinct color. In this work, we use the blueprint of the input geometry to extract physics-based information that will augment the performance of the convolutional encoder-decoder and allow it to make more accurate predictions. Specifically, as discussed in Section IV-A, for each indoor geometry image, X i , we derive an input tensor, X, with four channels representing: 1) r and 2) σ at each point of the grid; 3) the physical distance between the transmitter and each point; and 4) the FSPL for each point. Hence, first we resize the initial input geometry image to a standard size image, and then we use the resized image to compute the values of the four input channels of Fig. 3. The first two channels are derived by matching the colors of the blueprint, i.e., different RGB pixel values, to the values of Table I. The physical distance, d i, j , for the (i, j )-th pixel is computed as where R is the spatial resolution of the ray tracing simulation, (x T x , y T x ) corresponds to the transmitter position in the simulated grid, and W r , H r = W, H /512 are the width and height resizing ratios, respectively. Consequently, the values of the fourth channel are estimated as where f is the operating frequency in Hz and c o is the speed of light.
A schematic representation of this process is shown in Fig. 4(b). As can be observed, the initial input image is resized to a 512 × 512 image, and it is used to extract the four-channel tensor X, which is input to the convolutional encoder-decoder. Apart from the input tensor X, the convolutional encoder-decoder receives an auxiliary input z, which conveys information about the frequency and the resizing ratios, W r and H r , of each sample. The details of the incorporation of the auxiliary input with the input tensor are outlined in Section IV-C. The target PL tensor is turned to a grayscale image, assuming values between 0 and 255, and the grayscale image is resized to have the same size as X (i.e., 512 × 512). The resized geometry image along with the auxiliary input tensor form a pair of input values, (X, z), while y is the target value. Each input-output pair accounts for a sample used to train the convolutional encoder-decoder.
Once trained, the convolutional encoder-decoder can be used to directly predict the PL for new geometries or frequency bands that are not included in the training dataset (dashed box in Fig. 4(a) and (b)). Predictions for new geometries require only an image of the indoor environment, which is used to extract the physics-based input tensor, as shown in Fig. 4(b). The request of an image depicting the input geometry is not undue, since most of the existing commercial ray tracing software use such a representation [17], [35]. In addition, our approach can be further facilitated by recent research which allows computer-aided design (CAD) floor plans to be turned into blueprints [37]. We also note that the extraction of the physics-based input tensor does not entail any complex operations, only i f statements and matrix multiplications whose computation requires a few milliseconds.
Given the physics-based input tensor, the convolutional encoder-decoder predicts a PL heatmap image,ŷ, for the desired input geometry. The predictions of the convolutional encoder-decoder are pixelwise, and the size of the output tensor is 512 × 512. Thus, after prediction, we use a bilinear interpolation to resize the output image back to the initial dimensions W and H of each geometry. As a final step, the grayscale PL images are converted into numerical values. The latter simply implies perceiving the elements of the resized input image as dB instead of pixel intensity values.

C. Convolutional Encoder-Decoder
Conventionally, CNNs are terminated by MLPs, as shown in Fig. 2. There are three main drawbacks associated with this approach: 1) MLPs require a fixed-size input; 2) they neglect the spatial and cross-channel correlations; and 3) they are computationally expensive as they have more parameters [36]. This motivated the development of fully convolutional networks (FCNs), which are able to perform pixel-to-pixel predictions [36]. In FCN architectures, the MLP at the end of the network is replaced by a series of upsampling and convolutional layers. The absence of the MLP translates to a smaller number of trainable parameters and a higher efficiency.
A convolutional encoder-decoder is an FCN that consists of two basic components: 1) an encoder, used to downsample the input tensor and compress its context, and 2) a decoder which is used to recapture the input tensor details and reconstruct it in the form of the target output tensor. Convolutional encoders-decoders may also include skip connections between the encoding and the decoding path, as shown in Fig. 5 (dark yellow arrows), to retain information that might be lost during the encoding procedure. In this work, we use the convolutional encoder-decoder depicted in Fig. 5, which is a slightly modified version of the SDU-Net architecture introduced in [31] (we include an additional upsampling path).
The SDU-Net used has five encoding and decoding layers, i.e., five downsampling and upsampling operations are applied, denoted by orange and purple arrows at the encoding and the decoding branches, respectively. As can be seen in Fig. 5, the left path of SDU-Net is used to encode and downsample the input tensor containing information about the indoor geometry, while the right path upsamples the encoded tensor and transforms it to a PL heatmap. At the end of the encoding path, we add an upsampling path used to upsample the auxiliary input z. Consecutive upsampling operations are applied to z to increase its size and concatenate it with the encoded representation of the input tensor. This will allow the SDU-Net to understand the operating frequency and also be aware of the resizing ratio for each sample. This information is not included in the input tensor for computational efficiency, i.e., there is no need to reserve an entire channel for scalar variables. Instead, it is considered during the decoding by fusing the upsampled z with the encoded input tensor just before the decoding procedure begins.
At the encoding path, each layer includes a 3 × 3 standard convolution, followed by the SDU-Net block, shown in Fig. 6, and a 2 × 2 max pooling with stride equal to 2. At the decoding branch, the feature map of the previous layer is first upsampled, and then it is concatenated with the feature map of the respective layer of the encoding branch (yellow boxes shown at each layer of the decoding branch). Consequently, a 3 × 3 standard convolution is applied to the concatenated tensor, and the resulting feature map is forwarded to the SDU-Net block and moved upward to the next layer. All the convolutional layers use an ReLU activation function, except for the last 1 × 1 convolution which uses a linear activation function. The number of convolutional filters is increased or decreased by a factor of 2 at each encoding or decoding layer, respectively. At the top layer of the decoding branch, a 1 × 1 convolution is used after the SDU-Net block to derive the PL heatmap image. The number of filters for the final 1 × 1 convolution is equal to the number of the desired output channels, i.e., one.
The SDU-Net block, as shown in Fig. 6, comprises five cascaded atrous convolutions, each one using a different number of filters and a different atrous rate. The feature maps of different atrous convolutions are concatenated and forwarded to the next layer after being down-or upsampled. Atrous convolution, also known as dilated convolution, is a generalized case of the standard convolution operation. Its popularity has recently increased, since it can be used to efficiently augment the receptive field of each neuron without increasing the number of model parameters [38]. In atrous convolution, the kernel is widened and its sparsity is increased by introducing blank spaces between the kernel elements. The distance between neighboring nonblank elements of the kernel is r − 1, where r is the atrous rate. The parameters for each atrous convolution are shown in Fig. 6, where r is the atrous rate, and n out denotes the number of filters at the next layer (which is twice or half the number of filters at the current encoding or decoding layer, respectively).
An example of 3 × 3 atrous convolutions with different values of r applied to the image of an indoor environment is shown in Fig. 7. It can be observed that as r increases, the receptive field is augmented, since the convolutional kernel has an enhanced field of view over the indoor geometry. Indeed, the area covered by the convolutional kernel with r = 1 (standard, nondilated, convolution, orange box) is substantially smaller than that of the kernel with r = 3. That comes at no increase in the overall computational cost, since the number of trainable parameters remains the same. Furthermore, the use of various atrous convolutions with different atrous rates enables the aggregation of information originating from different spatial scales and the identification of correlations between distant points. Moreover, due to multiple layers of the SDU-Net architecture, it is also possible to capture multiscale correlations at different resolutions (since the size of the input tensor is decreased by a factor of 2 at each encoding layer). The augmented receptive field and the potential of capturing multiscale correlations (using various r values) at multiple resolutions (due to the multiple encoding and decoding layers) render the SDU-Net an exemplary convolutional encoder-decoder for radio propagation modeling tasks.

D. Evaluation Metrics
To evaluate the performance of DeepRay, we use four error metrics: the mean absolute error (MAE), the mean absolute percentage error (MAPE), the root mean square error (RMSE), and the Pearson correlation coefficient, r p , defined as follows: (5)- (8), as shown at the bottom of the previous page, where y o,(n) (i, j ) andŷ o,(n) (i, j ) are the ray tracing and the predicted PL values for the nth sample at the spatial point (i, j ), respectively, y o,(n) andŷ o,(n) are the corresponding mean values for that sample, and N represents the number of samples. We note that the error is measured at the end of the proposed framework, i.e., it is not pixelwise but it is estimated with respect to the actual PL values.

V. COMPARISON WITH SIMULATED DATA
In this section, we proceed to the implementation of EM DeepRay, comparing its PL predictions with simulated data. To create the synthetic data to train EM DeepRay, we use the Ranplan Professional ray tracing simulator, developed by Ranplan Wireless [35]. Ranplan Professional is a robust radio propagation engine, supporting radio propagation simulations for both indoor and outdoor environments. It enables easy and efficient wireless network design of entire floors or even buildings and it supports a large number of different communication technologies (WiFi, 5G New Radio, IoT). The software has been widely used for actual network planning, which provides us an assurance regarding its reliability.
To train and validate EM DeepRay, we conduct ray tracing simulations at 20 different indoor environments. The buildings considered include simple and more complicated indoor environments, with more than 20 subrooms in the same building floor, and the walls are made of various construction materials (concrete, brick, plasterboard, wood, and glass) and have various thicknesses. For each building, we consider transmitting devices operating at three different frequencies: 1) 0.433; 2) 2; and 3) 3.7 GHz, which correspond to frequency bands used in IoT and 5G systems. To create our dataset, we place a sole transmitter operating at a given frequency within a building, and then conduct a ray tracing simulation only for this transmitter. Once the ray tracing simulation is finished, we change the transmitter position and/or the frequency, and we run a new ray tracing simulation to create a new sample. This corresponds to consecutively sampling the PL response in an environment of interest for a given transmitter position and frequency.
For all the simulations, the transmitter antenna is 1.5 m above the floor and has an omnidirectional beam pattern. The resolution of the ray tracing simulations is set to 0.1, i.e., we sample the PL value once per 0.1 m. For each frequency band, we run approximately 35 different ray tracing simulations at each building, i.e., 35 × 3 samples per building, assuming a different transmitting device position within the indoor geometry for each sample. We use 80% of the samples to train DeepRay and 20% to validate its performance, which is a commonly used data splitting ratio in ML problems [28]. Once the data are split into two sets, the training dataset is augmented by flipping the indoor geometry, along with the transmitter's position within it, and the PL heatmap images left, right, and downward. The initial and the flipped input images exhibit a symmetry, and hence one should expect that the ray-tracing results should be the same. Indeed, the intersections between the rays and the walls, the respective reflection and transmission coefficients, and the diffracting edges remain the same, and thus the flipped PL images are equivalent to the ray tracing simulation results for the flipped geometry. This allows an effective increase in the training set size by a factor of 3 without conducting any extra simulations. We note that EM DeepRay is validated with samples from known geometries and frequency bands, i.e., from indoor environments and frequency bands included in the training dataset, but for unknown transmitting device positions within these environments.
To test EM DeepRay, we consider three cases, aiming to explore how well our model: 1) can generalize to new geometries, not included in the training dataset; 2) can infer the PL for frequency bands other than those of the training set; and 3) behaves in a combination of 1) and 2). We refer to these cases as Tests 1, 2, and 3, respectively. The case studies are summarized in Table II, where with the term "known" and "unknown" we indicate either included or not included in the training dataset. To implement Test 1, we consider five new indoor environments, generating about 35 samples at each building, for each frequency band of the training dataset (0.433, 2, and 3.7 GHz). We underline that EM DeepRay has no prior information regarding PL distribution within these indoor environments, and it will infer the PL heatmaps based on the weights estimated during the training phase. For Test 2, we assume transmitting devices operating at 866 MHz and 2.6 GHz, and we randomly place approximately 15 devices for each frequency at each one of the 20 buildings used to train and the validate our model. Finally, for Test 3, we consider transmitting devices operating at 866 MHz and 2.6 GHz at the five buildings used in Test 1, i.e., for buildings and a frequency band not included in the training dataset, taking approximately 40 samples per building and frequency.
Our data-driven model is trained on a Nvidia Quadro RTX 8000 GPU over Tensorflow, using the Adam optimization algorithm for 250 epochs, with the learning rate set to 0.0005, and a batch size of 4. Before passing the input data to our model, a min-max normalization is applied separately at every input channel [28]. The loss function to be optimized is RMSE. We also explored the use of MAE, but the performance of our model was slightly worse, and thus we present results only using RMSE as a loss function. During training, the loss function is minimized with respect to y andŷ, i.e., based on the pixel intensity values of ray tracing and the predicted PL images. However, the error metrics are estimated with respect to ray tracing and the predicted actual PL values (i.e., between    y o andŷ o , in dB). The training takes almost 2 h, while a prediction for a single sample requires about 100 ms, including the extraction of the physics-based input tensor.
The overall error metrics between the simulated and predicted PL values for all the case studies are presented in Table III. The error metrics are negligible for the training set, with the RMSE and the MAE being equal to 1.2 and 0.99 dB, respectively. This implies that the proposed framework can indeed learn to translate a map of physics-based information to a PL heatmap. The validation set error metrics  II   CASE STUDIES: KNOWN AND UNKNOWN INDICATE INCLUDED OR NOT  INCLUDED IN THE TRAINING DATASET are approximately 3 dB larger than these of the training set, but they still assume low values. More importantly, the good resemblance between the simulated and predicted data is also preserved for the three different test cases, as is indicated by the values of r p that are very close to 1. Also, Tests 1 and 2 yield approximately an RMSE of 5 dB and an MAE of 3.8 dB, while for Test 3 the respective values are 5.6 and 4.5 dB. Furthermore, the MAPE assumes very small values for every scenario. Thus, in addition to learning to infer the signal attenuation within a given geometry and frequency band, our model is generalizable, and it can provide accurate estimates of the PL for new indoor geometries and frequencies. By exploring the generalizability of our model to buildings and frequencies separately, it is easier to understand its potential limitations, and to where these could be attributed. Also, in Test 3, by studying the same buildings and frequency bands as in Tests 1 and 2, we can be confident that any error is not owning to the use of new unknown buildings or frequencies, but it can be attributed to the simultaneous generalization over new buildings and frequencies.
In Figs. 8-11, we visualize the results for a random sample from the validation set and from test sets 1-3, respectively. Figs. 8(a)-11(a) show the simulated ray tracing "groundtruth" for each sample, while the corresponding PL heatmap predicted by EM DeepRay is illustrated in Figs. 8(b)-11(b). We note that DeepRay outputs solely the PL heatmap, and to illustrate the input geometry we consequently impose its blueprint on top of the PL heatmap. As can be observed, the similarity between the simulated and predicted PL heatmaps is very high for all the cases. The absolute error maps in Figs. 8(c)-11(c) depict the absolute error, |y o (i,j) −ŷ o (i,j)|, at each spatial point (i, j ) within the simulated indoor environment. A common trend is that the predicted PL exhibits the largest absolute errors at positions far away from the transmitting device. For instance, as can be seen in Fig. 11(c), the largest errors are found on the lower left area of the floor. That is more than 40 and 10 m away from the transmitter, in the horizontal and vertical directions, respectively. This is encouraging, since when a user is located far away from a transmitter it is likely that they are served by another device, and thus these areas do not have significant impact on network design decisions.

VI. COMPARISON WITH MEASURED DATA AND CALIBRATION THROUGH TRANSFER LEARNING
In Section V, we demonstrated that EM DeepRay can be trained with physics-based data of various indoor environments and eventually learn to predict the PL within them. Once trained, EM DeepRay can be used as a standalone propagation model to furnish estimates of the PL for an arbitrarily complex indoor geometry at a given frequency. The credibility of our model highly depends on the data used to develop it. These data are only an approximation of the actual signal attenuation and they are themselves subject to errors. However, the purpose of an ML-based propagation model is to deliver results that closely resemble actual rather than synthetic data.
In this section, we address this issue by outlining a simple, yet efficient, approach that allows the calibration of EM Deep-Ray through transfer learning. The aim of transfer learning is to leverage knowledge accrued to tackle a certain problem and use it to solve a different problem, which is related back to the initial problem [28]. In our case, the first problem is the design of an ML-based propagation model that can predict the PL in indoor environments, while the second problem is to make the ML-based propagation model realistic. The motivation behind the use of transfer learning, rather than training EM DeepRay simultaneously with synthetic and measured data, is that the measured data are scarce and sparse. Indeed, measurement campaigns are time-consuming and expensive, and usually the results from large-scale campaigns are not publicly available. In addition, typically only a small number of measured RSS values are available for each indoor environment, since these campaigns are realized by moving the receiving device around within the area of interest and recording the RSS at some sparse points.
Hence, if a dataset with both the simulated and measured data is used to train an ML propagation model, it would comprise a limited number of measurement scenarios (compared with the simulated ones) with only a few recordings per scenario. That would render the training challenging, as it would be necessary to find a balance between the importance of the measured over the simulated data. Instead, treating the calibration as a distinct problem enables handling a limited quantity of sparse data in a more efficient manner. In particular, to calibrate our model we use the same framework shown in Fig. 4(a); however, the target output tensor is now different. As shown in Fig. 12, for calibration, the target output tensor comprises the recorded measurements at some points within the simulated grid, instead of the simulated ray tracing PL values. To distinguish the target output tensors of the calibration framework, we use the subscript c, referring to the initial and the resized target output as y c,0 and y c , respectively. Thus, the calibration process constitutes a retraining of the pretrained model obtained in Section V to adapt its PL predictions to match the measured PL values of y c . Fig. 12. Calibration framework; the dots in the target tensor correspond to points in the indoor geometry at which the RSS is measured, and their color represents RSS intensity. A significant difference between the target output tensors of the frameworks, as shown in Figs. 4(a) and 12, is that most elements of y c,0 and y c are zeros, i.e., no measurements are recorded for these points. For that reason, it is not possible to use the same loss function as the one used in Section V. More specifically, the operation of EM DeepRay is equivalent to applying a mapping function f , parameterized through a set of learnable weighs , to the initial input tensor X to derive the output PL image heatmap, i.e.,ŷ = f (X| ). The values of are computed iteratively, such as to minimize the difference between the pixel intensity of the ray tracing "ground truth" and the predicted PL image heatmap where y (n) (i, j ) andŷ (n) (i, j ), are the ray tracing and the predicted pixel intensity values, respectively, for the nth sample at pixel (i, j ). Evidently, if y is zero at most points, for (9) to be minimized, the values of should be selected such asŷ is also zero at these points. Thus, using the same loss function for y c , which has a few nonzero points, will force the pretrained EM DeepRay to breakdown. To overcome this limitation, during the calibration when (9) is computed for y c andŷ c , their difference is multiplied with a term Q c defined as and hence the only elements that contribute to the loss function are those for which measured data exist. This allows us to slightly modify the preestimated weights during the calibration (retraining) of our model and compute some new weights , in a way thatŷ c = f (X| ) matches only the nonzero values of y c . We note that the role of Q c is to consider only the nonzero points of y c during the computation of ; however, that does not imply that the predictions for other points of the simulation grid will remain intact. That is to say, that is estimated based only on a few points, but the change in the weights affects the predictions for the entire grid.
To demonstrate the effectiveness of our approach, we generate the indoor geometry of [39] and provide it as an input to DeepRay. The operating frequency is 868 MHz. Note that detailed information about the measurement setup can be found in [39]. In the absence of detailed information regarding the material types, all the walls found within the geometry are assumed to be made of concrete and are 10-cm thick. Again, the input geometry image is resized and it is used Fig. 13. Predicted RSS, and comparison with measured data [39] (a) Deep-Ray RSS prediction before calibration for the indoor geometry of [39]. (b) DeepRay RSS prediction after calibration for the indoor geometry of [39]. (c) Comparison between the simulated and measured RSS.
to derive the physics-based input tensor. The target output, y c,o , is turned into an image and it is also resized to have the same size as the input tensor. The target output tensor has only 36 nonzero points, i.e., we have measurements for 36 points within the simulated geometry. We note that the measurements usually depict the RSS, while EM DeepRay predicts the PL. To account for that, the elements of y o,c are calculated as T x power − RSS measured , where T x power is the transmitting power. Then, the two tensors along with the auxiliary vector are passed to the SDU-Net which performs a pixel-to-pixel regression, updating its weights considering only the nonzero points of the target output image tensor, y c , as discussed earlier. The SDU-Net is retrained using the Adam optimization algorithm. To avoid abrupt changes in , we train the SDU-Net over fewer epochs and we use a smaller learning rate than that of Section V. Expressly, the learning rate is equal to 10 −5 and the training lasts for 50 epochs. Finally, to prevent overfitting, we add an L 2 regularization at each convolutional layer, setting the regularization parameter equal to 0.01 [28].
The calibration procedure takes 15 s, since the SDU-Net is retrained over a single sample. Once the SDU-Net is retrained, we obtain the predicted PL tensorŷ c,o and we subtract it from the transmitting power to derive the predicted RSS values. The predicted signal level by DeepRay is benchmarked against ray tracing and the multiwall model (MWM) [39]. The RSS predicted by DeepRay before calibration is shown in Fig. 13(a), with the blueprint of the input geometry printed on top of it. The predicted RSS after calibration is shown in Fig. 13(b). The red dots in Fig. 13(a) and (b), numbered from 1 to 36, correspond to the points at which the measurements were taken. The measured, simulated, and predicted RSS for each point are shown in Fig. 13(c), where the x-axis signifies the corresponding red dot position shown in Fig. 13(a). A comparison of the error metrics for the three propagation models is presented in Table IV.
Prior to calibration, ray tracing demonstrates the smallest errors yielding an RSME and the MAE being equal to 5.2 and 4.3 dB, respectively. The calculated error values for EM Deep-Ray and MWM are approximately 1 dB larger. We remark that EM DeepRay is not trained in this geometry and it can directly make predictions by just using a blueprint. Also, the fact that our model can provide estimates of the PL that are in close agreement with measurements can assure us that the number of training samples used in Section V is sufficient to develop a credible data-driven model. A significant advantage of EM DeepRay over both ray tracing and MWM is the substantially computational time required to compute the PL values. Indeed, our model's estimation is based solely on multiple matrix multiplications that can be executed within a few milliseconds in a GPU. On the other hand, ray tracing requires to determine all the rays that can reach a receiving point, while MWM needs to estimate the number of walls between the transmitter and each point in the simulation grid through Bresenham's line algorithm [40]. The substantial difference in the computational time can be critical when it comes to optimal network planning, where multiple simulations are conducted within the same geometry, aiming at meeting certain quality of service requirements.
After calibration, the predicted RSS values for EM Deep-Ray show a close correspondence with the measured data, exhibiting an RMSE and an MAE equal to 1.69 and 1.22 dB,  Table IV. Note that even if we include the training time (15 s), it remains much lower than that of ray tracing and MWM. More importantly, as pointed out earlier and can be seen in Fig. 13(b), due to the new weights , the predicted RSS values are different for the entire grid, and not just for the few measurement points. For instance, the RSS was underestimated for the entire outdoor area on the left side of the building, but after the computation of the signal for the entire area (and not only for points 26, 27, 28, 29, and 30) assumes higher values. To further test our assumption that due to calibration the predicted RSS values for the entire grid are improved, we calibrate EM DeepRay using only a fraction of the total 36 measurement points, and measuring the error with respect to the remaining points. The results are shown in Table V, where we consider cases in which EM DeepRay is calibrated using randomly only 4, 8, 16, and 27 out of the 36 measurements, and the error metrics are estimated with respect to the remaining 32, 28, 20, and 9 measurement points, respectively. As can be seen, the predictions for the rest of the grid are improved, and the accuracy of our model is increased even using a small number of points during calibration.

VII. CONCLUSION
Wireless communication system design requires robust and expedient propagation modeling tools to ensure an optimal network performance. In this article, we introduced a generalizable and realistic ML-based radio propagation modeling framework for indoor environments. Our model exploits physics-based information extracted from the blueprint of an indoor environment to predict the PL within an indoor area of interest. Unlike previous work in the field, the predictions of our model are not restricted to indoor geometries included in the training set, but it can be used to readily predict the PL for an arbitrarily complex indoor environment. Our results indicate that our model can very well replicate the PL simulated by a ray tracer, with the distinct advantage of a considerably lower computational time. More importantly, in this work we presented a calibration method that allows our data-driven model to adjust its weights to make predictions that are closer to measured rather than synthetic data. We demonstrated that after calibration, which only takes a few seconds, our model can provide estimates of RSS that resemble actual measured data with outstanding fidelity. Our work tackles two fundamental problems of ML-based propagation modeling (generalizability and credibility), and it paves the way for the establishment of a family of completely automated ML-based propagation models that will assist the deployment of next-generation wireless networks.