Kernel Regression Imputation in Manifolds via Bi-Linear Modeling: The Dynamic-MRI Case

—This paper introduces a non-parametric kernel-based modeling framework for imputation by regression on data that are assumed to lie close to an unknown-to-the-user smooth manifold in a Euclidean space. The proposed framework, coined kernel regression imputation in manifolds (KRIM), needs no training data to operate. Aiming at computationally efficient solutions, KRIM utilizes a small number of “landmark” data-points to extract geometric information from the measured data via parsimonious affine combinations (“linear patches”), which mimic the concept of tangent spaces to smooth manifolds and take place in functional approximation spaces, namely reproducing kernel Hilbert spaces (RKHSs). Multiple complex RKHSs are combined in a data-driven way to surmount the obstacle of pin-pointing the “optimal” parameters of a single kernel through cross-validation. The extracted geometric information is incorporated into the design via a novel bi-linear data-approximation model, and the imputation-by-regression task takes the form of an inverse problem which is solved by an iterative algorithm with guaranteed convergence to a stationary point of the non-convex loss function. To showcase the modular character and wide applicability of KRIM, this paper highlights the application of KRIM to dynamic magnetic resonance imaging (dMRI), where reconstruction of high-resolution images from severely under-sampled dMRI data is desired. Extensive numerical tests on synthetic and real dMRI data demonstrate the superior performance of KRIM over state-of-the-art approaches under several metrics and with a small computational footprint.

the required ones, generating "holes/gaps" or "missing data" in measurements, causing distortion, aliasing, and motion blurring in the observed image series [1]. Such after-effects may hamper the diagnosis ability of a physician.
Missing values/data commonly occur in a wide range of applications owing to a number of reasons like damaged sensors, physical constraints on the signal-acquisition mechanism, or simply because some of the information is not available. Missing observations translate to uncertainties that negatively affect the end result: distortion in seismic imaging [3], suboptimal data modeling in recommender systems [4], as well as large numbers of false positives/negatives in detection methods for single-cell genomics [5].
Several strategies have been proposed to address the missing-data problem. This study focuses on imputation by regression, where, typically, (i) a regression model that fits the measured data values is identified with the regressors being the measured data themselves, and (ii) unknown data values are imputed by the identified regression model and the measured data [6]. Along these lines, this paper puts forth an efficient non-parametric framework, coined kernel regression imputation in manifolds (KRIM), to address imputation-byregression tasks via exploration and exploitation of latent geometric information which is hidden beneath the measured data. Although KRIM can be tailored to fit into any application domain where data approximation is needed, the dMRI setting is used here to demonstrate the way data are collected and sideinformation is employed, solidify arguments, clarify concepts, and validate the effectiveness of the approach via extensive numerical tests.

B. Prior Art
The standard approach to imputation by regression is compressed sensing (CS) where signals (image frames or patches) are sparsely represented by dictionary "atoms" (vectors) that reside in appropriately defined transform domains (feature spaces) [7][8][9][10][11]. CS-MRI approaches witnessed also the development of low-rank models to exploit inherent correlations for MRI recovery even at the image-patch analysis level [12,13].
Recent efforts on imputation by regression revolve around deep-learning (DeepL) approaches [14][15][16][17][18] that rely on lengthy processes to learn models from training data prior to reconstructing MR test images. Notwithstanding, concerns were brought to light in [19] with regards to DeepL approaches for medical imaging through tests which demonstrated undesired perturbations and structural changes (e.g., small tumors) in reconstructed images.
Manifold learning, built on the hypothesis that measured data lie onto or close to a low-dimensional manifold embedded in a high-dimensional space [20][21][22], has found applications also in dMRI [23][24][25][26]. According to the standard manifoldlearning route, graph Laplacian matrices, which capture relations within data-point clouds, are employed in quadratic regularizers of an inverse problem to recover dMR images in [24]. Following a less conventional path, [26] relies on affine relations between low-dimensional renditions of the acquired data to identify a temporal basis, and follows the lines of [12] to proceed with the computation also of a spatial basis to describe dMR image frames.
Kernels have been also employed in MRI data-recovery tasks, e.g., [26][27][28][29], following their success in the realms of non-parametric approximation [30][31][32][33] and machine learning [34]. Loosely speaking, the premise of kernel methods is to non-linearly map the observed data from the input space to a high-dimensional feature one where linear operations are employed. For instance, [27] approaches manifold learning via the application of the celebrated principal component analysis (PCA) into a kernel-induced feature space, while [28] proposes a kernel graph-Laplacian-matrix extension of [24]. A challenging task for kernel-based schemes is to tune kernel parameters to "optimally" fit the measured data via crossvalidation. Another obstacle for kernel approaches, especially in regression tasks, is the need for explicit pre-imaging steps from the feature space back to the original input one. Preimaging procedures are often computationally costly and rely on the existence of an invertible function that may not even exist for a specific choice of a kernel function, e.g., [27]. Lastly, kernel-based MRI schemes often leave the complexvalued nature of the measured dMRI data underused, e.g., [27].

C. Main Contributions
The proposed KRIM framework offers a novel nonparametric approach to kernel-based imputation by regression via the hypothesis that the measured data lie into or close to an unknown-to-the-user smooth manifold in a high-dimensional Euclidean space. KRIM requires no lengthy learning procedures on training data to identify latent geometric structures, perform regression on the measured data and impute their missing values, unlike mainstream DeepL approaches [14][15][16][17][18].
KRIM follows less typical manifold-learning paths than those which capitalize on graph Laplacian matrices to unveil data structures, e.g., [24,28]. More specifically and aiming at computationally efficient solutions, KRIM identifies first a potentially small number of vectors, coined "landmark points," to describe concisely the data-cloud in its original input space, cf. Sec. III-A. Moving ahead, KRIM exploits the landmark points to extract geometric information from the data via parsimonious affine combinations ("linear patches") that mimic the concept of tangent spaces to smooth manifolds [35], and take place in functional approximation spaces, namely reproducing kernel Hilbert spaces (RKHSs) [30][31][32][33]; cf. Sec. III-B. The learned geometric information is incorporated into the design via a novel bi-linear regression approach, rooted in classic RKHS-based approximation principles, able to describe the measured data and impute their missing values, and promote computationally efficient solutions by a dimensionalityreduction module [36]; cf. Secs. III-C and III-D. It is worth noticing here that the introduction of the bi-linear model enables KRIM to avoid pre-imaging steps which map points of an RKHS space back to points of the input one, unlike [27]. Moreover, to accommodate complex-valued data, the employed approximation space comprises multiple complex RKHSs [37,38], where multiplicity in kernels is introduced to surmount the usually cumbersome task of pin-pointing the "optimal" parameters of a single kernel through cross-validation, e.g., [39]; cf. Sec. III-C.
The imputation-by-regression task takes the form of an inverse problem which is solved by an algorithmic procedure with guaranteed convergence to a stationary point of the nonconvex loss function; cf. Sec. III-E. To showcase the modular character of KRIM, as well as its flexibility in fitting into specific application domains, this paper details the application of KRIM to dMRI, where reconstruction of high-resolution images from severely under-sampled dMRI data is desired; cf. Sec. III-F. Extensive numerical tests on synthetic and real dMRI data demonstrate the superior performance of KRIM, under several metrics and with a small computational footprint, over state-of-the-art approaches; cf. Sec. IV.
KRIM differentiates itself from its predecessor [40] not only in the usage of RKHSs, but also in the structure of the inverse dMRI problem, where the computationally intensive 2D discrete Fourier transform (DFT) is employed in the learning task only as a constraint and is not involved at all at the loss function, unlike [40]. This approach yields large savings in computation times when compared with [40], as extensive numerical tests show in Sec. IV. A more detailed discussion on similarities and differences between KRIM and state-of-the-art approaches can be found in Sec. V.
Useful introductory information on the mathematical concepts which are used often in this manuscript, such as the proximal mapping and complex RKHSs, can be found in Appendix A. Preliminary results of this study were documented in [41].

II. Data Collection
Measured data are denoted by and are considered to be complex valued. In dMRI, takes the form of a tensor defined on the ( p × f × fr )-sized "(k,t)-space"; cf. Fig. 1a. Each "slice" of tensor , with ∈ {1, … , fr }, corresponds to the ( p × f )-sized "k-space" measurements collected at the time instance . Positive integers p and f represent the numbers of phase-and frequency-encoding lines, respectively [1], while k ≔ p f stands for the number of measurement values required for a fully-scanned k-space frame. "Full-resolution" MR images can be obtained ∀ by ≔ ℱ −1 ( ) (cf. Fig. 1b), where ℱ −1 (⋅) is the 2D inverse DFT [1]. In other words, k-space can be viewed as the 2D "frequency domain" of MRI.
To facilitate data processing via linear-algebra operations, vec( ) stacks the columns of one below the other to create a single k × 1 vector ≔ vec( ). Define now The marked p × × fr box corresponds to the location of the faithful data, often called "navigator/pilot" data in dMRI. (b) The p × f × fr image-domain data . (c) 1D Cartesian and (d) radial sampling trajectories in the k-space.
fr ] ∈ ℂ k × fr . Abusing notation to avoid clutter in symbols, ℱ(⋅) still represents the 2D DFT even when applied to vectors, i.e., ℱ[vec . Due to physical limitations in MR scanners, data acquisition in dMRI is a slow process with scanners being usually unable to keep up with the dynamic nature of the scanning area (patient/organ motion) [2]. Moreover, to reduce as much as possible the discomfort of patients during scanning procedures, it is a common practice to partially sample the ( p × f )sized k-space and to collect only a small number of data per frame, often much smaller than the k = p f one which is required for a full scan. To express this partial observation in mathematical terms, define the mapping (⋅) ∶ ℂ k × fr → ℂ k × fr ∶ ↦ ( ), which operates entry-wise and nullifies the entry of in the case the entry is missing, while leaves the entry intact in the case where that entry is measured. Consequently, under severe under-sampling of the k-space, images ℱ −1 [ ( )] will suffer from temporal bleeding and motion blurring, adding to the deformations caused by the "thermal" noise which is omnipresent in any data-collection procedure. The missing-data problem permeates any dataanalytic application, e.g., unreported observations in recommender systems [4] and dropout values in gene expressions [5].
Given the previous stringent sampling conditions, the MRresearch community has proposed several strategies to sample the partially observed k-space; e.g., 1D Cartesian (cf. Fig. 1c) and radial trajectories (cf. Fig. 1d) [1,42]. A general trend is to heavily acquire data from the central (i.e., "low frequency") k-space region, which carries contrast information, while sparsely acquire data from the peripheral (i.e., "high frequency") region which reflects high-resolution image details. Cartesian sampling acquires data along horizontal/vertical lines, concentrating in the low-frequency area of kspace and scarcely appearing in the high-frequency one (cf. Fig. 1c). Radial sampling acquires data along radial spokes, resulting in dense measurements in low-frequency and sparse measurements in the high-frequency k-space area (cf. Fig. 1d).
All imputation frameworks, including the proposed one, rely on a faithful subset of the measured data to impute the missing observations of the partially observed ( ). In MRI, for example, the heavily sampled central region of the k-space, usually called "navigator data" [24,26,27,40], constitutes the faithful data. More specifically, in 1D Cartesian sampling, few (≪ f ) lines, which are marked by the yellow-colored box in Fig. 1a and cross the central region of each frame, are vectorized to form the × 1 vectořf , where ≔ p , with superscript "f" stressing the fact that the vector includes faithful data and symbol(⋅) the fact that the dimension of these vectors is in general smaller than k . These vectors are finally stacked into columns of a matrix to form̌f ≔ [̌f 1 , … ,̌f fr ] ∈ ℂ × fr . Similarly, in radial sampling, a few radial spokes can be carefully selected by the user to form the faithful data. The availability of faithful data can be also met in applications other than MRI; in singlecell genomics, for example, they may take the form of geneexpression data extracted from the bulk analysis of similar cell types [5].

A. Selecting Landmark/Representative Points
To promote parsimonious descriptions of the point-cloud {̌f } fr =1 as in [40], a subset { } =1 , coined landmark/representative points, is selected from {̌f } fr =1 , with being smaller than fr . Randomly selecting the landmark points from {̌f } fr =1 is not the most appropriate strategy to capture the geometry of the faithful data. Although several strategies may be implemented to identify { } =1 , the greedy "maxmin" methodology [43] is adopted here. In short, maxmin is a recursive scheme, taking a number of steps to terminate, where at the th step of the algorithm, a landmark point is selected from {̌f } fr =1 that maximizes, over all un-selected {̌f } fr =1 , the minimum (Euclidean) distance to the landmark points which have been already selected up to the ( − 1)th step of the algorithm. In general, the larger fr is, the larger should be to describe the point-cloud {̌f } fr =1 , justifying thus the non-parametric label of the proposed framework. Maxmin algorithm scores a computational complexity of order ( fr ), which is naturally heavier than the case of selecting
By Modeling Assumption 1, there exists a vector where represents the Hermitian operation onto vectors/matrices. In other words, if (1) can be expressed also as follows: ∀ , The following modeling assumption postulates more structure within the data, moving beyond the typical linear-subspace constraints such as Modeling Assumption 1. Modeling Assumption 2. Points { ( )} =1 , as well as { (̌)} fr =1 , lie onto or close to a smooth manifold ℳ embedded in the RKHS ℋ (cf. Fig. 2).
Since landmark points { } =1 have been identified to capture the geometry of the point-cloud {̌f } fr =1 , it is conceivable that a group of "closely related" points, for ex- Fig. 2, may collaborate to approximate point (̌). Here, "closely related" means proximity lie onto or close to the unknown-to-the-user manifold ℳ ⊂ ℋ and are affinely combined to approximate (̌). All affine combinations of { ( )} 3 =1 are represented by the gray-coloured plane which mimics the concept of a tangent space to ℳ. within ℳ, which may translate into similarities in datastructures and attributes in the original input space (k-space in dMRI). Putting together the modeling assumption 2 and the concept of tangent spaces to a smooth manifold [35], (̌) may be affinely approximated by neighboring landmark points. For example, (̌) is approximated in Fig. 2  being the ×1 all-one vector. Moreover, driven by the desire for parsimonious data descriptions-few points Fig. 2-and the robustness that such descriptions entail, is considered to be sparse. Gathering all vectors { } fr =1 in the sparse matrix ≔ [ 1 , … , fr ], and by defining the kernel matrix ≔ ( ) ( ), the previous discussion and (2) yield the following data-approximation model: where the complex-valued matrix 1 accumulates all mismodeling and approximation errors.

C. Multiple Kernels
Given the rich variety of reproducing kernels in the literature, e.g., Appendix A and [34], it becomes a tough task to select a specific kernel and fine-tune its parameters through cross-validation to fit the available data. A popular strategy to address this issue is to employ a dictionary of user-defined reproducing kernel functions { (⋅, ⋅)} =1 , with their associated RKHSs {ℋ } =1 and feature mappings { (⋅)} =1 , and form the weighted kernel matrix ∑ =1 ( ) , where each ∈ ℝ >0 and ( ) is the kernel matrix constructed according to the preceding discussion via (⋅, ⋅); cf. [39]. The advantage behind this scheme is that instead of pin-pointing a single kernel and its parameters, a dictionary of kernel matrices { ( ) } =1 is chosen and { } =1 are fine-tuned from the data themselves through some learning algorithm. Rather than simply plugging the previous weighted kernel matrix into (3), an approach with more degrees of freedom in the employed parameters is followed here via the following assumption.
where the complex-valued matrix 2 accumulates all mismodeling and approximation errors. It is worth noticing here that the proposed modeling framework employs kernel functions for data approximations without any need for explicit pre-imaging from the feature spaces {ℋ } =1 back to the input ℂ one, as in [27].

D. Dimensionality Reduction
Owing to the non-parametric character of the proposed scheme, the number of landmark points { } =1 and thus the size × of the kernel matrix ( ) in (4) can become computationally unbearable if the number fr of the faithful data, from which { } =1 are extracted, gets large; cf. Sec. III-A. To meet the limitations of computational and time resources, and to impose a low-rank structure on (4), which appears to have a beneficial effect on dMRI [12,13], a dimensionality-reduction module is introduced here to map each ( ) to a × complex-valued matrix̌( ) , with ≪ . To this end, the two-step dimensionality-reduction scheme of [36] is extended into the present RKHS case: (i) a matrix is identified first to gather all the coefficients of the affine dependencies within vectors { ( )} =1 , with (⋅) denoting the feature mapping from ℂ to the userdefined RKHS ℋ ; and (ii) the low-dimensionaľ( ) , which preserves the identified affine dependencies , is computed. More specifically, along the lines of the discussion that follows modeling assumption 2, a sparse matrix ∈ ℂ × is assumed s.t.
( ) ≈ ( ) , under the constraints = and diag( ) = , where ( ) is defined as in (17) via the feature mapping (⋅). The constraint = attributes to the desire for affine combinations, while diag( ) = excludes the trivial solution of the identity matrix for . Further, to surmount computations on the possibly infinite dimensional ( ), the kernel trick turns out to be handy also here: Multiplying the previous relation from the left by . In a more formal way, for a user-defined ∈ ℝ >0 , is identified by where ‖⋅‖ F stands for the Frobenius norm of a matrix, while the and ℑ(⋅) denoting the real and imaginary parts of a complex number. Task (5) is an affinely constrained and composite convex minimization task, and, hence, [45] is used to solve it by virtue of the fast convergence and flexibility of [45] in dealing with affine constraints.
After is identified by (5),̌( ) is computed by where the constraint is employed foř( ) to avoid the trivial solution × . It turns out that the solution to (6) is the orthogonal × matrix [ 1 , … , ] where { } =1 are eigen-vectors corresponding to the smallest eigenvalues of the positive-semidefinite matrix ( − )( − ) . In the context of data-recovery problems, there is a need to unfold the compressed matrix̌( ) back to ( ) ; e.g., PCAdriven frameworks. This need calls for the following assumption. The subsequent matrix acts as a "decompression" operator which maps the × matrix̌( ) back to the × matrix ( ) .

Modeling Assumption 4. There exist
gather mis-modeling and approximation errors.
Via Modeling Assumptions 3 and 4, as well as ≔ , ∀ ∈ {1, … , }, it can be verified that where the complex-valued k × fr matrix accumulates all approximation and mis-modeling errors. The KRIM dataapproximation model in (7)  , requires effort to fine-tune the parameters of the single kernel function to "optimal" values via cross validation.

E. The Inverse Problem
To simplify (7), define the following complex-valued matrices: ] , and the × matrix Then, (7) takes the bi-linear form of where "bi-linearity" refers to the behavior that if̃(or̃) is fixed to a specific value, then (9) expresses as a linear (better affine) model of the variablẽ(or̃).
Having established (9), the following inverse problem is proposed to identify (̃,̃) for some user-defined parameters where denotes the th column of the identity matrix . The recovery task comprises the following terms and constraints: (i) the data-fit term T1; (ii) the sparsity-imposing term T2 along the lines of Modeling Assumption 3; (iii) constraint (10a) to prevent unbounded solutions iñdue to the bilinear structure of the model; and (iv) constraint (10b) to introduce affine constraints according to Modeling Assumption 3. A solution (̂̃,̂̃) to (10) smoothes/denoises the measured values and imputes the missing ones in ≈̂≔̂̃̃̂̃.
An iterative algorithm that computes a solution (̂̃,̂̃) in the context of dMRI is provided in the following subsection.

F. The dMRI Case Study
With ℱ(⋅) defined in Sec. II, let the k × complexvalued matrix̃≔ ℱ −1 (̃). By virtue of the linearity of ℱ(⋅), the general KRIM model in (9) takes the following form in the present dMRI context: which leads in turn to Scanners in dMRI usually observe regions within that demonstrate a periodic movement, e.g., cardiac pumping, while being surrounded by a static background. To exploit this prior knowledge for better data reconstruction, with the 1 × fr row vector ⊺ denoting the time profile of the th , … , k }, and with ℱ t (⋅) standing for the 1D DFT along the time axis, it is conceivable that ℱ t ( ) becomes (almost) a sparse vector whenever the time profile ⊺ of the th pixel records a periodic movement.
This is also the case whenever the time profile of the th pixel observes also a static background. In other words, ℱ t ( ) ≔ [ℱ t ( 1 ), … , ℱ t ( k )] ⊺ can be approximated by a sparse matrix . Infusing (11) and the previous arguments into the general formulation (10), the following dMRI inverse problem is obtained: for user-defined parameters 1 , 2 , 3 , ∈ ℝ >0 , solve min (̃,̃, , ) where term T3 in the loss function (12a) is employed to promote the potential sparsity of ℱ t ( ), and (12b) is introduced for ℱ( ) to be consistent with the measured data in .
Other than the introduction of kernel functions, (12) is also structurally different than the inverse problem in [40], since the computationally intensive 2D DFT ℱ(⋅) is not employed at all at the loss function (12a), unlike [40], but it is only applied to the constraint (12b). Such a strategy yields largely reduced computational times with regards to [40] as the extensive numerical tests demonstrate in Sec. IV.

G. Solving the dMRI inverse problem
This section discusses the details of Alg. 1 which solves (12). Even though Alg. 1 addresses the specific dMRI recovery task, its backbone iterations can be applied, with appropriate tweaks, also to other application domains.
The successive-convex-approximation framework of [46] is applied to solve (12) via steps 4-10 in Alg. 1. Convergence to a stationary solution of (12) is guaranteed [46]. Step 7 in Alg. 1 focuses on solving convex minimization sub-tasks to update estimateŝ̃,̂̃,̂,̂. With user-defined parameters , ∈ ℝ >0 ,̃+ Tasks (13a) and (13b) are solved by [45] as described in Alg. 2. Notice that for Alg. 2 to operate, only first-order information (gradients) and proximal mappings, like softthresholding and projections, are required; cf. Appendix B. Moreover, the unique solution̂+ 1/2 to (13c) can be obtained as follows: where c (⋅) operates on a matrix in a complementary way to the sampling operator (⋅), i.e., c (⋅) nullifies the entries of the matrix that are preserved by (⋅), leaving the remaining entries intact. A detailed derivation of (14) is provided in Appendix B-C. The updates of (̂̃,̂̃,̂,̂) can be carried out independently of each other and computed in parallel, offering room for efficient implementations with short computation times. Furthermore, tasks (13) can be also solved inexactly at every iteration, with increasing precision as described in [46], offering space for further reduction in computation times.

B. Performance Metrics
The quality of reconstructions is quantified via the following performance metrics.
(i) The normalized root mean square error (NRMSE): where denotes the high-fidelity original image data, generated from a fully measured (k,t)-space (cf. Sec. II), and̂represents an estimate of computed via the competing reconstruction schemes using the undersampled (k,t)-space. NRMSE highlights the pixel-by-pixel accuracy of a reconstruction scheme. Low NRMSE values are desired. (ii) The high frequency error norm (HFEN):  where LoG is a rotationally symmetric Laplacian of Gaussian filter. HFEN explores the effectiveness of a reconstruction algorithm along the edges of an image. As in [48], a kernel filter of size 15 × 15 with standard deviation of 1.5 pixels is employed. A low HFEN score is an indicator of how close the reconstructed image is to the ground-truth one in terms of sharpness.

(iii) The sharpness measures M1 (intensity-variance based)
and M2 (energy of the image gradient) [49, (43) and (46)]. These gradient-based measures focus on the highfrequency (edge characteristics) variations in the images and quantify peaks. Large M1 and M2 values are desired. (iv) The structural similarity index measure (SSIM) [50] investigates local patterns in pixel intensities after normalizing for luminance and contrast, with values within [0, 1]. SSIM values close to 1 are desired since they indicate the similarity between the reconstructed data and the groundtruth image. Among the previous metrics, NRMSE, HFEN, and SSIM require the knowledge of the ground-truth image, unlike the M1 and M2 measures.

C. Experimental Setup
KRIM[M] and KRIM[S] (Alg. 1) were implemented in Julia (ver. 1.0.8) [51], while BiLMDM [40,41] was implemented in MATLAB [52]. MATLAB implementations for SToRM [53] and FRIST [54] are publicly available, whereas implementations of PS-Sparse and KLR were made available by the Authors of [12] and [27], respectively. Experiments were carried out on a 12-core Intel(R) 2.40GHz Linux based-system with 48GB RAM. The hyperparameters for the frameworks, with which the proposed schemes are compared, were tuned to produce reconstructions with the lowest NRMSE score. KRIM's Julia code will be made available online via GitHub as soon as the manuscript is accepted for publication.
Among the three datasets of Sec. IV-A, ground-truth data are available only for the first two datasets. These datasets are retrospectively under-sampled in the (k,t) domain under different acceleration/sampling rates, defined as k fr /(# of acquired samples). Both the 1D Cartesian (Fig. 1c) and the radial (Fig. 1d) under-sampling schemes were employed. Given the stringent space limitations of this journal, only the results on the 1D Cartesian scheme are demonstrated for the first dataset, while only the results on radial undersampling for the second one. Nevertheless, similar behavior of results was also verified for the other under-sampling scenarios. The third dataset of Sec. IV-A is prospectively under-sampled by default, since it comprises readings from an MR scanner.

D. Parameter Tuning
Tasks (12) and (13) contain parameters which relate to different aspects of the implementation; for example, data-modeling parameters ( 1 , 2 , 3 , , , ) are tuned for an optimal data fit, while optimization parameters ( 0 , , , , ) govern the convergence of the sequence of estimates. Those parameters need tuning for Algs. 1 and 2 to reach "desired performance." To offer a concrete discussion about "desired performance," parameters are carefully tuned for the first two datasets of Sec. IV-A where ground-truth dMRI data are available. The following discussion identifies "optimal ranges" of parameter values based on the behavior of the NRMSE metric as a function of the parameters: "optimal" parameter values are defined as those values that yield low NRMSE while, at the same time, that achieved low NMRSE value shows small fluctuations as parameters change.
Trial and error revealed the following: parameter 1 penalizes the sparsity of̃to control the size of the neighborhood that participates in the approximation of a time-series frame. Higher the value 1 is, lower the size of the neighborhood becomes, allowing only frames with very similar characteristics and structure to participate in data approximations. It is worth noting that this neighborhood is made of landmark points and, hence, parameter (total number of landmark points) plays an important role in defining the size of neighborhoods. Parameter has to be made large enough so that there is enough representation from a group of frames with similar phase and characteristics, but not excessively large to keep computation times manageable. For the datasets used, it is observed that ∈ fr ⋅ [15%, 25%] is adequate. An increase of beyond those values didn't show any significant improvement in performance.
With regards to the variablẽ, the bounding constraint = 1 usually leads to "optimal" estimates for̃in the case where the k-space is normalized in a way that the pixel intensity values lies within [0, 1]. The value imposes a low-rank structure to the model, and it was observed to be dependent on the dynamic nature of the image series. For the parameters ( 2 , 3 ) exploiting the periodicity along the time series, higher the ratio 3 / 2 is, the more temporal bleeding is observed. Hence, a value low enough to avoid temporal bleeding but high enough to exploit the periodicity along the temporal axis is desired. Focusing on the parameters which affect the optimization procedures, the step sizes 0 = 0. Notice that KRIM[S]'s "optimal" polynomial kernel with ( * , * ) is included in KRIM[M]'s kernel dictionary. The reason for adopting this strategy is to build on the effort spent on KRIM[S] to identify the kernel parameters ( * , * ), and to push for better data-recovery performance than the case of the hand-tuned kernel in KRIM[S] by allowing the data themselves to fine-tune kernel contributions in the data-approximation formula of (7). Indeed, such a performance boost was observed throughout all of the numerical tests. Nevertheless, to provide a complete set of numerical tests, KRIM[M] was also employed with a kernel dictionary that does not contain KRIM[S]'s "optimal" polynomial kernel with parameters ( * , * ).

E. MRXCAT Cardiac Phantom
The MRXCAT phantom is a synthetically generated cardiac cine dMRI dataset using the extended cardiac torso (XCAT) framework [47]. The MRXCAT phantom emulates a breathhold cardiac cine data of spatial size ( p , f ) = (408, 408) with fr = 360 frames. The data have spatial resolution 1.56 × 1.56 mm 2 for a field-of-vision (FOV) 400 × 400 mm 2 . The generated phantom data contains 15 cardiac cycles and 24 cardiac phases. It is characterized by its limited inter-frame variations and periodic nature along the temporal direction. The KRIM parameters ≔ 4 and ≔ 100 are considered. It can be seen from Fig. 3a that KRIM outperforms consistently the competing state-of-the-art schemes across different acceleration rates. Spatial and temporal reconstructions, generated from (k,t)space measurements available at the acceleration rate of 8x, are exhibited for comparison in Figs. 4 and 5, respectively, alongside with the ground-truth dMRI data for reference. Examination of these images sheds light onto the fact that KRIM produces images which are almost distortion free (unlike FRIST), artifact free (unlike PS-Sparse) and alias free (unlike KLR and SToRM). These deformities are highlighted in Fig. 4. It is also worth noticing that speaking in terms of contrast, KRIM reconstructions provide the best results, closely resembling the ground-truth dMRI data. Images reconstructed by KRIM also look sharper compared to the other schemes. Average computation times are reported in Table II. Notice the large savings in computation time achieved by KRIM with regards to its predecessor BiLMDM [40].

F. In-Vivo Human Cardiac Phantom
Real, human, cardiac cine MR data [12], with size ( p , f ) = (200, 256), encompassing a FOV = 273×300 mm 2 and with a spatial resolution 1.36 × 1.36 mm 2 , are considered. The data were acquired during a single breath hold. Multiple time-wraps (introducing temporal variations) and quasiperiodic spatial deformation [55] (to model respiration) were then used to generate a temporal sequence with fr = 256 frames. For KRIM, parameters and are tuned to 16 and 80, respectively.  Fig. 3b shows that KRIM produces images with the least NRMSE among all competing methods for all acceleration rates within the range 6x-35x. Table III summarizes  Quantitative results are supported by Figs. 6 and 7. Those images were generated from (k,t) data which were retrospectively under-sampled, emulating a radial sampling trajectory at an acceleration rate of 12x. As exhibited by HFEN, M1 and M2 measures, the cardiac images generated via KRIM are sharper and exhibit no blurring compared to PS-Sparse and SToRM (as pointed by the arrows). There are no patchy distortions or evidence of aliasing in the images recovered by KRIM, unlike FRIST and KLR. The temporal cross sections of the reconstructions provide better insights and reveal temporal blurring (marked with arrows) in PS-Sparse, presence of distortions and artifacts in KLR and FRIST. On the other hand, KRIM doesn't show any signs of distortions, artifacts or temporal blurring. To conclude, KRIM appears to score the best performance in dMRI reconstruction among all competing methods. Average computation times are reported in Table IV.

G. Acquired dMRI Cardiac Scan
Measurements were acquired via a FLASH sequence [56] from a volunteer breathing normally under the following acquisition parameters: TR/TE = 5.8/4ms, FOV 284×350 mm 2 and spatial resolution = 1.8 × 1.8 mm 2 . A 12 channel scanner was used to continuously measure 4500 phase-encoded lines under 1D Cartesian sampling. The phase-encoded lines are further grouped into groups of 15 (containing 5 navigator lines) to obtain fr = 300 frames and a data matrix of 156×192×300 per channel. This corresponds to an acceleration/sampling rate of 10x. The reconstructed multi-channel data were combined via the sum-of-squares strategy for all reconstruction frameworks.
Due to the unavailability of the ground-truth images, comparisons are qualitative in nature, as exhibited in Fig. 8. Reconstructions via FRIST show severe distortions (highlighted by box) and suffer from aliasing (evident in the temporal cross section), while those via SToRM show signs of blurring along the edges (in the area marked by the box). For the PS-Sparse reconstructions, some image structures seemed to be missing (marked by arrows) which are clearly present in images reconstructed by other schemes. Moreover, significant motion blurring occurs as exhibited in the temporal crosssection. On the other hand, images generated via KRIM look sharper with little or no signs of distortion and aliasing, unlike KRIM's competitors. Even though no significant difference could be observed when compared to BiLMDM, the M1 and M2 measures (Table V) quantify the improvement in sharpness using KRIM over BiLMDM. The average computation times for the reconstruction of the multi-coil image series are provided in Table VI.  V. Discussion This section accentuates the differences of KRIM with the state-of-the-art dMRI schemes PS-Sparse [12], SToRM [24] and its variants [28,29], KLR [27], as well as BiLMDM [40].
Matrix factorization schemes, similar to (11), have been also used in PS-Sparse [12] and KLR [27]. Nevertheless, KRIM differentiates itself from the two previous methods in several aspects. The PCA-based PS-Sparse [12], for example, searches for local affine/linear patches to manifolds, as in Fig. 2, and determines its temporal basis using the SVD of the faithful or navigator/pilot data, something that can be indeed effective if data show affine/linear structures within the rawdata input space, which is not often true. In contrast, KRIM searches for affine/linear patters in an RKHS-feature space by the same token that celebrated kernel-based methods use to apply standard linear-estimation arguments in the RKHSfeature space, e.g., [34]. Sec. IV provides empirical evidence that KRIM offers better data approximation than PS-Sparse in terms of performance metrics with no signs of temporal bleeding in images. Moreover, there is no need for the matrix factors̃and̃in (11) to be orthogonal, unlike the PCAdriven PS-Sparse.
On the other hand, the kernel (K)PCA-driven KLR [27] applies PCA in RKHS-feature spaces to capture potential nonlinearities in respiratory and cardiac motions. Notwithstanding, KLR requires a computationally expensive pre-imaging step to map points from the RKHS-feature space back to the original input one. KRIM's bi-linear formulation in (11) captures the latent point-cloud geometry in the RKHS-feature space without any need for pre-imaging steps. KRIM employs also complex-valued kernels to take into consideration the information carried in both real and imaginary parts of the complexvalued (k,t)-space measurements [38], as opposed to KLR which uses only real-valued kernel functions. Furthermore, the single-kernel approaches KLR [27] and [41], as well as KRIM[S], require "optimal" tuning of the parameters of the kernel function, which can be a daunting task with the availability of the wide range of kernels in the literature [34]. KRIM[M] escapes this trap by the use of a kernel dictionary and lets the data decide on the "optimal" kernel combinations through bi-linear modeling in (7). This doesn't only eliminate the need for pin-pointing the optimal kernel but shows better results than the ones by KLR and KRIM[S] as summarized in Tables I and III. The difference in the results between KRIM and KLR is also evident by the observation in Fig. 3 where KLR reconstructions show sharp increase in NRMSE values due to aliasing as acceleration rates increase.
SToRM [24,28,29] enforces "smoothness" to the underlying low-dimensional manifold by penalizing the loss of the recovery problem with several norms of a graph-Laplacian matrix, amounting to a "global" smoothness constraint over the data point-cloud. In contrast, to prevent losing local structural information and to offer as much flexibility as possible in data approximations, KRIM enforces smoothness locally over the manifold via point-cloud neighborhoods and affine approximations within those neighborhoods to mimic the concept of tangent spaces. The "union" of those affine combinations is modeled by the sparse matrix̃in (11). KRIM performs better than SToRM along edges (as verified by the HFEN values in Tab. I and III), and produces sharper images with almost no aliasing effect and temporal bleeding.
The transform-learning based FRIST scheme performs im-age reconstruction on a frame-by-frame basis, ignoring the temporal information of the dMRI data onto which KRIM capitalizes. In addition to the lack of temporal information, FRIST relies on dividing images in patches prior to reconstruction. This does help with exploiting the structural similarities in the image, but as the acceleration rate increases, the aliasing effect severely affects the image-domain data resulting in patchy distortion and blurring. This is verified in Figs. 4 and 3 by the deterioration of NRMSE with increasing acceleration rates. KRIM has also exhibited improvement in performance over its predecessor BiLMDM [40]. This comes as result of reaping the benefits of similarities and bi-linear modeling via operating in the high dimensional RKHS-feature spaces. Numerical evidence of the improvements are summarized in Tables I,  III, and VI. It is also worth noticing that the formulation and algorithmic design of KRIM puts itself way ahead of BiLMDM in terms of computation times, paving the way for more efficient implementations in future studies.

VI. Conclusions
This manuscript laid down the foundations for a novel imputation-by-regression scheme based on classic kernelapproximation arguments for data which lie into or close an unknown-to-the-user manifold embedded in a highdimensional Euclidean space. To this end, a bi-linear dataapproximation model was introduced which employs multiple kernel functions, bears the advantage of not requiring tuning for an optimal kernel, avoids computationally expensive pre-imaging steps, accommodates complex-valued data, and demonstrates a low-rank structure to promote computationally efficient solutions. The proposed framework, coined kernel regression imputation in manifolds (KRIM), facilitates data reconstruction from partially observed data with missing measurements, without the need for a fully observed training data-set. This work provided also evidence of KRIM's rich potential via a dMRI setting, where high-resolution medical images need to be recovered from severely under-sampled data. The manuscript has also put forth elaborate quantitative and qualitative results which highlight the superior performance of KRIM over state-of-the-art schemes under several performance metrics and with a small computational footprint.
‖ ‖ 2 of matrix ∈ ℂ × is defined as where max (⋅) stands for the maximum eigenvalue of a symmetric matrix. In the case of = , then ‖ ‖ 2 = max ( ).