Information in Missing Patterns: Enhancing Prediction Accuracy in Weighted Linear Regression with Missing Data Using Soft Clustering

—The linear system with missing information is inves- tigated in this paper. New methods are introduced to improve the Mean Squared Error (MSE) on the test set in comparison to state-of-the-art methods, through appropriate tuning of Bias-Variance trade-off. The concept is to cluster the data and adapt the learning model to each cluster. Hence, we set forth a controlled bias into the problem and positively utilize it to enhance learning capability on the instances considered in some speciﬁc neighborhood. To deal with missing infrormation, we propose a novel algorithm ”Missing-SCOP” based on SCOP-KMEANS algorithm introduced by Wagstaff, et al., utilizing the missing pattern of the dataset for construction of a soft-constraint matrix and clustering in missing scenario. It is shown that controlled over-ﬁtting suggested by our algorithm improves prediction accuracy in various cases. Numerical experiments approve the efﬁcacy of our proposed algorithm in enhancing the prediction accuracy.

Abstract-The linear system with missing information is investigated in this paper. New methods are introduced to improve the Mean Squared Error (MSE) on the test set in comparison to state-of-the-art methods, through appropriate tuning of Bias-Variance trade-off. The concept is to cluster the data and adapt the learning model to each cluster. Hence, we set forth a controlled bias into the problem and positively utilize it to enhance learning capability on the instances considered in some specific neighborhood. To deal with missing infrormation, we propose a novel algorithm "Missing-SCOP" based on SCOP-KMEANS algorithm introduced by Wagstaff, et al., utilizing the missing pattern of the dataset for construction of a softconstraint matrix and clustering in missing scenario. It is shown that controlled over-fitting suggested by our algorithm improves prediction accuracy in various cases. Numerical experiments approve the efficacy of our proposed algorithm in enhancing the prediction accuracy.

I. INTRODUCTION
R ECENTLY, there has been a growing interest in enhancing prediction accuracy in machine learning. Although previous studies indicate that clustering may improve accuracy [23], training set shrinkage and data ignorance would be the penalties since it assigns hard weights to the subjects (i.e. each member has a weight parameter w ∈ {0, 1}). In this paper, a novel weighted ensemble learning method of classification is presented based on weighted ensemble learning [16]. We call this method Soft Weighted Prediction (SWP), which weighs each cluster [1] obtained from training set (possibly each training example if they form a cluster themselves) based on its Euclidean distance from each test set subject.
Missing information has been gaining importance quite recently due to wide vision of applications it accompanies in practice as recommendation systems [6], [17], quantized rating systems and quantized data analysis [11], predictive sparse models with missing information [7], [10], semisupervised learning with missing information [9]. Several clustering methods are developed in the literature to enhance prediction and regression accuracy. Several studies have been constructed on constrained clustering recently [15]. Hard and soft constrained clustering algorithms are aimed to modify the K-means algorithm to consider the side information regarding the connectivity graph of instances. Soft constrained clustering (SCC) concept, introduced by Kiri Wagstaff [25] known as KSCOP accounts for the baseline of our work. In this paper, we aim to extend the concept of SCC to prediction scenarios with missing information.
Data loss or idleness could be considered as a practical paradigm of inducing missing parameters in the structure of medical prediction problem. Obviously, in such cases missing values are not randomly distributed, e.g. patients suffering from the same disease, are more likely to be recorded with the same blood factors and symptoms. Thus, patients with similar missing factors, tend to be clustered together and have tendency to be reported with correlated medical diagnosis [2], [18]. This lack of similar recorded parameters (jointly missing parameters for subjects) is assumed as a constraint parameter in soft clustering. Prediction for medical data with missing information can be found at [20].

II. MODEL ASSUMPTIONS
In matrix representation, linear models are represented as follows: where ε ∼ N (0, σ 2 I) X is the oracle instance-feature matrix. However, in practice, X is partially observed. Mathematically speaking, the observed matrix is obtained by applying a random mask on the original data matrix. The mask contains zeros on the entries which are missing or lost, i.e. we have access to a data matrix X = X M , where M is the oracle mask, and denotes the Hadamard product. Y is the observed measurement vector. β is the parameters (weights) coefficients.

A. Mathematical Approaches in Extracting the True Model (Imputing Coefficients)
Coefficients vector β could be estimated knowing X and Y as b. There are several regularization methods based on assumed constraints on vector β such as sparsity, to find the estimator b as it is not unique in many cases. However, our main concern is superior prediction of vector Y , not the coefficient. As Lasso constrains desired over-fitting, the Least-Square (LS) solution is used for each cluster in controlled bias setting.

1) Lasso Solution:
Assuming β as a sparse vector, desired b will be obtained from optimization 2.
where parameter λ controls sparsity rate of coefficient β which is equivalent to balancing the trade-off. Letting λ = 0, P2 turns into the ordinary least square problem. As λ approaches zero, this solution will have less bias and more variance errors. Thus, such is a data-dependent (training set) solution. As a result, test and train variation will lead to an inferior estimation and larger M SE. Further, as λ approaches infinity, b will be constrained to be sparse. Thus, training set variation effect decreases and estimator data-dependency will be omitted.
The least square solution is a particular case of LASSO (λ = 0) which can be obtained from the normal equations are as follows: Taking expectation yields to: Knowing Thus, unlike Lasso, least square solution is an unbiased estimation.

B. Controlled Overfitting
Overfitting occurs in test and training set variation cases. This error could be controlled by constraining the training set based on its similarity to each test example. This constraining could be done by either soft or hard weighting methods. In hard weighting algorithms training set would be shrunk to the most similar members to test example, such as clustering. On the other hand, Soft Weighting method prevents such data losses by applying a weighting mask based on similarities. Although SWP methods may cause accuracy reduction for estimator b specifically in sparse cases, more accurate Y estimation will be obtained. Specific estimator b is calculated for each test member based on its distance from X, which is not necessarily a good estimation of β, but more accurate prediction for Y . We can also refrain from separate estimation of β for each test sample by assigning each test sample to one cluster comparing its distance to different centroids determined by each cluster. As overfitting is controlled (by similarity) and satisfying in such scenarios, the introduced clustering algorithm, segments X and allocates each test set example, a cluster based on its Euclidean distance from its centroid. Thus, estimator b is trained by specific members, which results in increase of variance and reduction in bias term of predicted Y error. By increasing the number of clusters, overfitting and increase in variance term error will be seen. K-mapping [19] is one of the methods trying to optimize Bias-Variance trade-off [12]. The error expression is: Supposing k nearest neighbors are chosen from the training set. Bias, which is the first term, has a monotonous rise as k increases, on the other hand, variance, the second term, drops off at the same time.
Although variance minimization leads to worse interpolation of training set, depending to its answer Y , it removes data dependency. Bias minimization has the reverse effect, i.e. although estimator b leads to the best Y calculation dependent to the specific training set X, vector b itself has larger M SE to the real coefficient coefficient β. Obviously in such cases if test data does not fit in any of the clusters, the estimated Y will face a larger error (large variance and small bias).

III. PROPOSED ALGORITHM
Clustering as a so-called method of tuning variance-bias trade-off has been studied and discussed in the literature recently as in [23]. Although simulations depicted enhancement of prediction responses in some cases, hard clustering results in uncontrolled overfitting and data loss.
As K-means Algorithm with squared Euclidean distance parameter is used for k-mapping, minimum distance of test set samples to centroid of clusters, leads to the appropriate assignment of test samples to each cluster. Following the least square solution, the predicted b is found. Multiplying test and estimator b, results in predicted Y matrix. As the number of clusters (k) increases, members of each cluster will decrease. Although this will lead to lower bias, variance term of error will increase. If test varies from training set, Estimated Y accuracy will be greatly depressed. Proposed solution to the problem is comprised of assigning each training set subject, specific weight based on its similarity to test sample. This filter is set to be an exponential function of distance. W is an m×1 matrix (filter) containing normalized distance between test and each training set subject. Parameter w controls the strength of filtering. As it approaches infinity, filter approaches one (no filtering).
The SWP algorithm is provided in Alg. 1. Obviously, all sub-figures of Fig. 1 in V-B1 depicts Bias-Variance tradeoff.

IV. TREATING WITH MISSING VALUES
Introduced methods are dependent on data matrix (training set). Considering missing values, clustering would not be possible (by k-means). Therefore, SWP algorithm requires a new definition of similarity to address the missing values.  for all data new = X test (i, :) do return Y test

10: end function
If the missing is block-wise meaning that there are certain feature sets and a patient for example has either records for one feature set or not, then the clustering can be carried out based on the patient profiles. similarity in each profile can be addressed easily as the profiles are consistent among patients yielding to similar missing patterns. However, if the missing data is not block-wise, the non-missing pattern would differ among patients. Consequently, there is no similar profile based on which one can categorize the patients. Rather, we must infer from the data missing pattern how the patients may be similar. There are two approaches in dealing with non-blockwise missing data. The first is to impute the missing data followed by SWP. Therefore, we discuss a couple of offthe-shelf matrix efficient matrix completion and imputation methods next. A long list, however, can be found in [5], [4], [3].
A. Imputation Methods 1) Soft Impute [13]: In this method, Z is considered as a low-rank matrix. As rank(Z) is a non-convex function, relaxation could be carried out by minimizing equivalent [idx, C] ← SCOP KMEANS [25] (X, k, S) 14: end function nuclear norm of Z. Finding matrix Z which satisfies 9, is desired.
The Lagrangian is given as: The solution is given by Singular vlue thresholding (SVT) as follows: Where (S − λI) + is either positive or zero, otherwise.
To optimize the algorithm time complexity the suggested idea is to start Z from mean-estimation which makes iterative code converge faster.
2) MCPAT [8]: MCPAT is an efficient and adaptive matrix completion method which functions properly for highly missing scenarios which yields high SNRs in retrieving information.

B. Non-Impute Method
Soft-Impute, an Imputation method, applies low-rank restriction on the recovered dataset. Data loss is an inevitable consequence of the solution, as linearly dependent features could be ignored in clustering. Many recent studies have focused on clustering datasets containing missing informations. Most common suggested solutions offer modifications to clustering algorithms such as KMEANS and FCM illustrated in [24] and [14], respectively. Although the main concern in such solutions are similarity of observed elements, it is worth noting that the same missing features represent a kind of resemblance in such scenarios. Balancing n-dimensional distance of observed data and missing features similarity by a weight tuning parameter leads to desired clustering.

1) Missing-SCOP:
We have chosen SCOP-KMEANS Algorithm [25] as a baseline for the development of missing values clustering. As the real model dictates, missing pattern contains information and is profitable in clustering as a factor of similarity, i.e. we leverage the missing mask similarity of each pair in training set as a constraint in soft constrained clustering. Let matrix S be an m × m matrix, which assigns each pair (x i , x j ) ∈ X × X a constraint s ∈ [−1, 1]. s is assigned based on mask similarities and jointly observed features Euclidean distance using a proportional tuning parameter w. As s approaches −1, the constraint forces separation. On the other hand, when s approaches 1, the two members of the pair must be clustered in the same group. Replicative Kmeans algorithm is employed in centroid initialization due to local minimum trap prevention.
2) SWP via Missing-SCOP: SWP algorithm consists of splitting the training set to one member clusters, and specifying each cluster a weight based on its distance to each individual. Another solution to the problem is soft clustering algorithms [21] utilization to find the probability matrix U for the test example. Thus, weight matrix is a diagonal matrix in which members of same clusters have the same weights. As the problem contains missing values, introduced Missing-SCOP algorithm is used to obtain more precise clustering in comparison to imputation methods. Let X be the dataset matrix, divided to m × n train set X train and p × n test set X test . Assuming X train is clustered into k sub-matrices by centroid matrix C and index vector idx, probability matrix U is defined in 11.
, where for each i ∈ [1, p], j ∈ [1, k] Weight matrix W in SWP algorithm would be obtained by matrix U, consequently. As u ij is a normalized factor of similarity between i th test set example and j th cluster centroid, vector W clusters and matrix W are defined for each X test example in 13 and 14 respectively.
which is calculated for i th X test example.
Weighted least square solution in the algorithm requires matrix completion which could be obtained by MCPAT [8] algorithm.

V. SIMULATION RESULTS
A. Datasets 1) Simulated Data: As the real problems dictate, training set and test set are random processes which consist of normally distributed random sequences (features). Let X be an m × n random process consists of random variables X = {X 1 , X 2 , ..., X n } where X 1 , X 2 , ..., X n are normally distributed with uniformly random parameters i.e. X i ∼ N (µ, σ). As Law of Large Numbers (LLN ) states, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed. Due to data-dependency of the simulation results, our reported M SEs are averaged on 20 generated random data.
2) Sample Data: Algorithms are also tested on following MATLAB sample datasets: cities, discrim, kmeansdata, stockreturns 3) Missing Mask: Real cases depict significant and meaningful similarities in missing patterns of similar elements. Suggested missing mask consists of same missing pattern for each cluster in Dataset matrix. A Gaussian logic mask is added to this mask as expected in real world. Considering m×n dataset X clustered into k sub-matrices each consists of n 1 , n 2 , ..., n k members by index vector idx. Explained m × n logic mask is generated as described in 15.
, where i = [1 : k], r max = max(r(:)), m rate is the missing rate and r 1×n ∼ unif .  ) SWP: Algorithm is tested on datasets described in V-A. Results are respectively depicted in Fig. 1. Although optimal tuning parameter w varies from case to case, general behavior of the figures are the same.

C. Missing Scenario
Introduced methods dealing with missing elements of training set, are tested on mentioned datasets. 1) Clustering: Our main concern of dealing with missing cases is clustering. Impute and non-impute methods, introduced in Section IV are tested on datasets explained in V-A, which masked by the mentioned method. Silhouettes [22] as a well-known method of clustering accuracy assessment is utilized. Simulation results are depicted in TABLE I to compare and find the efficiency of each clustering algorithm. Silhouette values of kmeansdata as an appropriate dataset for clustering are depicted in fig. 2. This figure illustrates a trade-off between missing mask similarity and observed values correlation tuned by parameter w described in algorithm 2. Notable improvement of clustering accuracy is observed in this case.

VI. CONCLUSION
An innovative method of prediction enhancement is introduced and explained on linear models. SWP algorithm as a developed weighted least square solution is suggested and surpassed many state-of-the-art methods such as clustering in simulation results. Datasets containing missing informations have been studied; adjusted SWP is developed for such scenarios, too. Clustering as a fundamental part of this adjustment is discussed and Missing-SCOP algorithm is introduced as a mean of handling missing values in clustering. Mentioned algorithm considers missing mask similarity of each example as a constraint of clustering by weight tuning parameter w. Comparing mean silhouette values as a factor of clustering precision, simulation results depicted that Missing-SCOP algorithm, a non-impute clustering method of cases with missing values, outperformed imputation methods like soft-impute.