Optimality of Spectrum Pursuit for Column Subset Selection Problem: Theoretical Guarantees and Applications in Deep Learning

We propose a novel technique for ﬁnding representatives from a large, unsupervised dataset. The approach is based on the concept of self-rank , deﬁned as the minimum number of samples needed to reconstruct all samples with an accuracy proportional to the rank- K approximation. As the exact computation of self-rank requires a com-putationally expensive combinatorial search, we propose an efﬁcient algorithm that jointly estimates self-rank and selects the optimal samples. The ratio of obtained projection error using selected samples to the error of best rank- K approximation is called approximation ratio (AR). In this paper, a new upper bound for AR is derived which is tight for two asymptotic cases. The best AR for self-representative low-rank approximation was presented in ICML 2017 Chierichetti et al. (2017), which was further improved by the bound √ 1 + K reported in NeurIPS 2019 Dan et al. (2019). Both of these bounds are obtained by brute force search which is not practical and these bounds depend solely on K , the rank which is targeted. In this paper, for the ﬁrst time, we present an adaptive AR depending on spectral properties of the original dataset, A ∈ R N × M . In particular, our performance bound is proportional to the condition number κ ( A ) which is a well-known spectral property. Our derived AR is expressed as 1 +( κ ( A ) 2 − 1 ) / ( N − K ) which approaches 1 in certain asymptotic cases. Our proposed algorithm enjoys linear complexity w.r.t. the size of original dataset which results in ﬁlling a historical gap between practical and theoretical methods in ﬁnding representatives. In addition to evaluating the proposed algorithm on a synthetic dataset, we show that it can be utilized in real-world applications such as graph node sampling for optimizing the shortest path criterion, and learning a classiﬁer with represent In addition to evaluating the proposed algorithm on a synthetic dataset, we show that it can be utilized in real-world applications such as graph node sampling for optimizing the shortest path criterion, and learning a classiﬁer with representative data.


Introduction
Low-rank approximation is one of the fundamental workhorses of machine learning and optimization. It has applications for dimension reduction and clustering in recommendation systems Esmaeili et al. ( , 2016, text miningAzghani et al. (2019), super-resolution , and computer vision . It is a building block for data compression algorithms such as PCA, where the data are transferred into an eigenspace representation to learn the representative samples which are suitable for the whole data. The representations are either in a lower-dimensional space or expressed in terms of pseudo-data not part of the original dataset. 1 Thus, these methods are not applicable for data sampling. However, it was shown recently that spectral methods are useful tools to devise a sampling framework Zaeemzadeh et al. (2019). In spectral-based sampling, the goal is to keep the spectrum of representatives as close as possible to the spectrum of the original data. Shannon sampling Unser (2000), the most theoretically strong sampling scheme also follows the strategy of keeping the spectrum of the original and sampled function equal to each other, and a reconstruction guarantee is established for signal sampling in an infinite-dimensional space based on spectral properties.
Datasets encountered in practice are usually composed of a large number of vectors in a finitedimensional space. Assume M samples in an N-dimensional space organized in matrix A ∈ R N×M . Let S be a set of integers between 1 and M, which, when used to index the columns of A, define a collection of samples. We define a matrix of the form V = [v mk ] which projects back from the sampled to the original data. We consider the following optimization problem for finding the optimal sampling and back-projection: where λ is a parameter that regularizes the rate of sampling. We refer the cardinality of the set S * as the self-rank of matrix A in this paper. Substituting a k in the inner summation with an arbitrary vector u k ∈ R N simplifies this problem to the truncated singular value decomposition (SVD). In this case, |S| is simplified to the conventional definition of rank for matrix A. However, the conventional spectral decomposition using SVD is not suitable for data sampling since singular vectors are arbitrary vectors not actual samples. Enforcing the principal bases to be from the dataset itself in (1) results in a self-representative low-rank approximation. This problem is related to the column subset selection problem (CSSP) , a problem that is known to be NP-hard Shitov (2017); Çivril (2014) and subject of ongoing research Dan et al. (2019); Song et al. (2019b,a).
In this paper, we propose a versatile sampling framework to determine the self-rank and to identify the corresponding samples. Our approach is based on spectral decomposition; we compute the minimum number of samples that cover the K most significant spectral components of the dataset. The main motivation of the present work and the role of proposed method among the vast literature of CSSP are explained in the next section. The main contributions of this paper can be summarized as follows: • We introduce Spectrum Pursuit with Residual Descent (SP-RD), a constructive algorithm that jointly estimates the self rank and samples corresponding to informative columns. • We prove an upper bound for the SP-RD projection error, and show that in two special cases, the bound is asymptotically the tightest one. • We show that the SP-RD algorithm leads to improved performance in a number of applications, including graph node sampling with shortest path criterion and selection of training data for a classifier.

Related Work and Motivation
There is a large line of work on data selection in machine learning which are mainly studied under context of CSSP and volume sampling (VS). See Bardenet & Maillard (2015); Dieng et al. (2017) (2015); Gu (2015); Duersch & Gu (2015) for a non-exhaustive list. Motivated by the fact that datasets continue to grow larger and larger over time, a popular approach relies on the modification of the dataset itself, such that its size shrinks while preserving its original statistical properties. Based on this observation, (Bardenet & Maillard, 2015) studied the reduction in size of a large dataset using random linear projection. On a similar note, (Huggins et al., 2016;Dieng et al., 2017;Campbell & Broderick, 2018) constructed a weighted subset of a large dataset, the Bayesian coreset, for a wider class of Bayesian models.
Many other efforts have been devoted to data selection and coreset construction algorithms. Examples include pivoted QR factorization and matrix subset selection, which are under the umbrella of column subset selection problem Paul et al. (2015); Dan et al. (2019); Song et al. (2019a,b), that can be considered as a form of unsupervised feature selection or prototype selection. We briefly discuss the key models relevant to our work. One line of work focuses on the randomized matrix  , and Nearoptimal Boutsidis et al. (2014).

Algorithm selected samples Upper bound Complexity
algorithms which have been developed in fast least squares Clarkson & Woodruff (2013), sketching algorithms J. W. Demmel & Xiang (2015) and low-rank approximation problems Gu (2015). Authors in Duersch & Gu (2015) develop a randomized QR with column pivoting (RQRCP), where random sampling (RNDS) is used for column selection. A comprehensive summary of these algorithms and their theoretical guarantees is available in Table 1 in Christos Boutsidis & Drinea (2009). Most of these algorithms can be roughly categorized into two branches. One branch of algorithms are based on rank-revealing QR (RRQR) decomposition Gu & Eisenstat (1996) which is related to reduction in E-optimal sense Joneidi et al. (2018). It has been proved in Christos Boutsidis & Drinea (2009) that RRQR is nearly optimal in terms of residue norm. On the other hand, sampling based methods Petros  try to select columns by sampling from certain distributions over all columns of an input matrix. Extension of sampling based methods to general low-rank matrix approximation problems was also investigated in Srinadh Bhojanapalli (2016). Let us denote the best rank-K approximation of matrix A by A K . The data matrix A is approximated in polynomial-time in Deshpande & Rademacher (2010b) and a lower bound on the required number of samples as O (K log K + K/ε), where ε is the coefficient of the relative error, has been obtained in Deshpande & Rademacher (2010b); P. Drineas & Muthukrishnan (2006). In other words, a subset of data is selected such that the projection error of all samples on it is less than (1 + ε) A − A K F . For a smaller value of ε, we need more selected samples to achieve the desired projection error. A tighter bound for the number of required samples is introduced in Boutsidis et al. (2014). They show that O (K/ε) columns contain a subspace which approximates A with coefficient error of (1 + ε). Here, we introduced our approach based on recent advances on data sampling. The reader is invited to see the supplementary document for more elaboration. Our proposed bound is the first work in the literature that targets rank-K approximation using only K samples while can approach to 0 asymptotically.
The AR of existing low-rank approximation methods are only a function of K. We refer the reader to Table 1 for a brief survey of upper bounds on various low-rank models for sampling. Methods in Table 1 are either impractical or shown to be working in very limited practical settings in their experiments. This motivates two main questions in our work: How many samples suffice to represent a dataset? and How can we select such data representatives considering the intrinsic structure of the dataset?. Our proposed method is based on spectrum pursuit (SP) algorithm which is shown to be working efficiently in a wide range of applications Joneidi et al. (2020). In the present paper we will show theoretically that why SP outperforms state-of-the-art methods in sampling. Moreover, an extension of SP algorithm will be presented which is more accurate and provably convergent. As reflected in Table 1, there is no sampling algorithm whose AR is controlled by structural or spectral properties (such as condition number κ(A)) of the matrix A. In this paper, a theoretical upper bound is established which depends on spectral properties of A in addition to data dimensions and the target rank K. Sampling literature includes a vast set of practical methods which function in limited occasions appropriately, however, come short in theoretical guarantee. On the other hand, one can find sampling methods in the literature supported by established theoretical guarantees with no practical applications. In the present work, for the first time we fill a historical gap between practical sampling and theoretical sampling.

Joint Self-rank Estimation and Sampling
We introduce an equivalent problem to (1) that is solved efficiently.  Multi-PIE dataset. Linear combination of these components is able to reconstruct all 52,000 images with an average error less than 25%. (b) A subset of dataset with 15 images is found such that their span is as accurate as the span of the first 9 spectral components.
where E K ∈ [0, 1] is the projection error corresponding to the best rank-K approximation of data normalized to A 2 F . Parameter K is a user specified target rank. As K increases, more samples are needed in set S in order to reach the desired error. The cardinally of set S * for a specified K is denoted by S K (A) and it is called the self-rank of dataset A with parameter K. For example, setting E K = 0.25 results in a set of samples such that their combination is able to reconstruct all data with 25% error. It is straightforward to find a target rank to provide the desired error by using truncated SVD. Fig. 1 Left shows the low-rank approximation error of SVD versus an assumed target rank for a subset of Multi-PIE face dataset with 52, 000 images. For instance, 9 spectral components are needed to span all data with a normalized error of less than 0.25. Since spectral components in SVD are not among samples of data, we need to solve Problem (2) in order to find the minimum number of actual samples, such that their span provides a span as accurate as that of the first 9 spectral components. Fig. 1 Right shows the computed self-rank versus the target rank. As an example, there is a subset of the dataset with 15 samples such that their span is as accurate as the span of first 9 spectral components as shown in Fig 2. In other words, there are 15 samples in the dataset such that they are able to reconstruct all the dataset with an error less than 25%. The normalized error, E K , is a user-specified parameter and it can be set to any value between 0 and 1. The target rank of matrix A corresponding to a desired projection error, E K , is defined as the minimum integer K that satisfies (∑ N n=K+1 σ 2 n )/ A 2 F ≤ E K . The n th singular value of matrix A is denoted by σ n . The self-rank of matrix A with parameter K is equal to the minimum number of columns in A such that their span approximates A by an error less than the error of best rank-K approximation, i.e., the smallest size for set S that holds Matrix A S refers to sampled columns of A, and A † S is the Moore-Penrose inverse of A S .

Practical Algorithm for Sampling
Solving Problem (2) implies a combinatorial search that is not feasible for a massive dataset and a large target rank. However, computing the target rank is tractable via SVD. Since self-rank is lower-bounded by target rank, a practical algorithm should start searching for self-rank values greater than the target rank. Thus, we start with solving the following problem with a fixed |S| = K, and check if the constraint in Problem (2) is satisfied for a desired error threshold.
The solution for S does not guarantee that the error in the constraint of Problem (2) is less than the desired E K . The minimum integer, K, that satisfies the constraint in (2) is considered as the self-rank of dataset A with parameter K. If the error is higher than the desired threshold indicated by E K , the cardinality of set S is increased by 1, and Problem (3) is solved with a new value for K. Joneidi et al. (2020) for solving (3). Inspired by SP, a more accurate solver for (3) is proposed in this paper which is called Spectrum Pursuit with Residual Descend (SP-RD). Deriving SP-RD Alg from (3) and its convergence analysis are studied in the supplementary document. Alg. 1 shows the steps of SP-RD algorithm. In this algorithm P is a parameter that controls the computational burden of the algorithm. The joint problem of self-rank estimation and data sampling is summarized in Alg. 2. The heart of this algorithm is the mentioned SP-RD algorithm. In the experiments we refer Alg. 2 as adaptive SP-RD. In Line 1 of the algorithm K can be initialized using truncated SVD easily. Moreover, Π S (A) is the projection operator on the subspace spanned by columns of A indexed by set S. Mathematically, it is equal to

Spectrum Pursuit (SP) Algorithm is proposed in
It is noteworthy that the computational complexity of the SP-RD algorithm is O(MN + K 2 N + MNP) per iteration, which depends on the computational burden parameter P. Note that we only need the first singular vector and there are fast methods to get it. For the most relevant cases, large dataset with K < N < M, the complexity is dominated by the O(MNP) per iteration and we need to perform O(K) number of iterations. This is a linear complexity w.r.t M.

Algorithm 1: Spectrum Pursuit with Residual Descent (SP-RD)
Require: A, P and K Output: S 1: Initialization: Algorithm 2: Joint Self-rank Estimation and Data Sampling via Adaptive SP-RD Require: dataset A, P, and the desired sampling error e ∈ [0, 1] Output: set S.
The state-of-the-art upper bound for the minimum required number of samples is introduced as O( K ε ) Boutsidis et al. (2014), in order to guarantee that the projection error of sampling is not worse than the projection error of rank-K approximation. In general, the purpose in CSSP problem is to find an upper bound on the number of sampled data in set S to ensure that In the following theorem, we prove there exists an upper bound for ε when the algorithm SP-RD is used to select one sample. Please note that when K = 1, in Line 3, U k is empty and in Line 5, E k = A.
Theorem 1. Let E 1 denote the error achieved by the best first rank-1 approximation of full-rank data matrix A, i. e., A − A 1 2 F . Further, let ρ(A) denote rank-oneness measure of A defined in Zaeemzadeh et al. (2019), and κ(A) denote the condition number of matrix A. Assume that N is less than M for the given large dataset A. Also, let {s 1 } denote the singleton containing the best sample selected by SP-RD algorithm. Then, Therefore, the smallest possible ε for the upper bound to be held can be set to the value found in the above equation. Before proceeding to establish an upper bound for K selected samples, we briefly discuss the established bound for one sample in the following remarks clarifying certain properties of the obtained bound in two specific instances, Remark 1-When κ(A) = 1, the data matrix is isometric, meaning that all samples are of equal importance. In this case, the projection error on the span of any sample is equal to the projection error on any spectral component. Thus, In other words, the upper bound holds with correction factor ε = 0. Remark 2ρ(A) = 1 means that all the energy of the samples selected is accumulated in the direction of the first eigenvector. In other words, the first sample is oriented towards the best rank-1 approximation of the data matrix, i.e., u 1 . Therefore, ε = 0, and the upper bound holds tightly. This is trivial as the first sample turns out to be the best rank-1 approximation itself.
F denote the minimum error achieved by the best first rank-K approximation of data matrix A. Further, let S K = {s 1 , ...s K } be the set containing K samples selected by Alg 1. Then, All proofs can be found in the supplementary material. The proposed SP-RD is the extended version of SP Joneidi et al. (2020) and iterative projection and matching (IPM) algorithm Zaeemzadeh et al. (2019). Theorem 2 shows us the theoretical reason behind the exhibited success of SP family algorithms for a very wide range of applications due to their simplicity and accuracy. SP and IPM algorithms has no parameter for fine tuning and parameter P in SP-RD does not need fine tuning and it can be set according to our accessible computational power. These favorable properties make the class of SP algorithms problem-independent and dataset-independent.

Experimental Results
In this section, we apply our above theoretical results to perform learning tasks employing the selected samples of datasets based on their complexity, which is reflected in their self-rank. Our aim is to reduce training size of massive data such that accuracy of learning tasks is maintained. Applications of the class of SP algorithms are studied recently Joneidi et al. (2020); Zaeemzadeh et al. (2019). In this section, first a synthetic experiment is designed in order to estimate an oracle self-rank. Then, two new real-world applications are studied.

Self-rank Estimation on Synthetic Data
Since the computation of self-rank is a combinatorial problem, we need to synthesize a dataset with a known self-rank in order to evaluate sampling algorithms in Alg. 1 and Alg. 2. The known self-rank is assumed as a lower bound on the estimated self-rank using Alg. 2. A synthetic full-rank dataset is generated with M samples in 200 dimensional space. First, we generate 20 linearly-independent samples whose target rank is equal to 15 with parameter E K = 0.1 2 We call these 20 samples the base samples. Then, M−20 samples are generated by linear combination of base samples. Finally, all M − 20 generated samples (all except base samples) are contaminated with noise in the null-space of base samples to result in a full rank dataset. Since base samples can be approximated by a rank-15 subspace, the whole dataset can be approximated by a rank-15 decomposition. However, those 20 base samples correspond to the self-rank of dataset. We test this dataset using different selection algorithms. An efficient algorithm summarizes the dataset to the base samples. Fig. 3 shows the probability of selection from the underlying base samples using different algorithms. We let all selection algorithms to sample 20 data. As can be seen, SP-RD algorithm successfully finds all base samples for up to 1000 generated samples. Increasing P (number of correlated samples) in the SP-RD algorithm results in a more accurate solution to Problem 3. SP-RD is performed for 200 iterations. Note that SP algorithm is equivalent to the SP-RD algorithm with P = 1. Table 2 shows the estimated value for the self-rank of the generated synthetic dataset. Setting P ≥ 8 results in estimating the oracle value for the underlying self-rank. The desired E K is set to 0.1. It is worthwhile to mention that random selection on average needs 67 samples to have the same accuracy as the projection error of the 20 underlying base samples, while SP-RD with P = 8 only needs 20 samples, i.e., base samples are selected intelligently. Table 2: Estimated self rank for the described synthetic data using SP-RD algorithm with different values for parameter P. The expected estimated self-rank using random selection is computed as 67. It means on average, 67 samples are needed to span the dataset as accurately as 20 samples found by SP-RD algorithm. P P = 1 (SP) P = 3 P = 5 P = 7 P = 8 P = 10 Estimated self-rank 54 43 34 26 20 20

Representative Selection from CMU Multi-Pie Dataset
Here, CMU Multi-PIE Face Database is considered for representative selection Gross et al. (2010). A cropped version of this dataset is available online at 3 , which contains 249 subjects with 13 poses, 20 illuminations, and 2 expressions. Thus, there are 520 images for each subject. Fig. 4 compares the performance of different state-of-the-art selection algorithms in terms of CSSP normalized projection error, which is defined as the CSSP cost function in (3) for a given selection method normalized by A − A K . The normalized error is averaged for all 249 subjects. Parameter P is set as 10 for SP-RD algorithm and in order to select K samples, 5K iterations are performed. As it can be seen, the obtained approximation ratios are much better than the tightest theoretical bound in the literature which is √ 1 + K. In practice, for multi-pie face dataset, the achieved AR using different algorithms is better than 1.42 for K ≤ 15. Moreover, the proposed SP-RD reaches an AR around 1.15 which is the closest to the best possible AR, i. e., 1.

Adaptive Sampling from MNIST Dataset for Classification
To evaluate the effectiveness of our algorithm, we apply our algorithm on MNIST dataset classification task. In order to apply adaptive sampling, first MNIST data points are transferred to a feature space. A simple back-bone architecture is utilized for both feature selection and classification tasks to prevent architecture-specific bias. We train our feature net for classification task on Omniglot dataset Lake et al. (2011) and pick the last convolution layer features. We transform every image in MNIST training set into a vector of length 256 with the trained feature net. Our classifier also has the same architecture as our feature net and the only difference is that the last layer consists of only 10 neurons which is the number of classes in MNIST. Figure 5 shows the result of these experiments. In adaptive sampling scenario using SP-RD algorithm, we consider the complexity of each class in MNIST training data determined by the introduced self-rank for each class. We compute the self-rank of each class using Alg. 2 alongside with optimal Figure 5: Comparison between the accuracy of MNIST classification task when a portion of each class is sampled. In this scenario, K-mediods performs even worse than random selection and the performance of SMRS and dual volume sampling are close to that of random selection. The vertical lines show a 95% confidence interval. samples. Therefore, the number of selected samples are class dependent. This forms a non-uniform sampling scenario where data are sampled for learning from each class in an adaptive fashion. We compare this to random selection, and the selection carried out by SP-RD with a fixed number of selected samples from each class. Moreover, comparison to K-mediods, SMRS, and dual volume sampling (DVS) Li et al. (2017) algorithms is provided. Splitting the sampling budget adaptively according to the introduced self-rank results in improvement of classification performance. This experiment is designed based on a generic architecture with a typical dataset. However, other selection algorithms do not provide a gain over random selection.

Adaptive Graph Summarization
A graph network is characterized by structural features such as degree distribution, average shortestpath length (ASPL) and the clustering coefficient. Calculating the APSL of a large graph is memory space and computation intensive. Hence, we propose an alternative efficient method for computing the ASPL with a reasonable error. Instead of the shortest path between all pairs of vertices we compute the ASPL between all the vertices and some selected vertices. Given a graph G we first select a subset of the vertices and then we exploit the measure of ASPL to evaluate the accuracy. Let G be the graph of the real world citation graph datasets, Cora with the set of vertices V 4 . Further, let dist(v 1 , v 2 ) denote the shortest distance between v 1 and v 2 (v 1 , v 2 ∈ V ). Then, the error is defined as where |S| is the selected samples by the algorithms specified in the table. The more selected vertices chosen as these representatives, the less error is obtained. More importantly, the less selected samples are adopted by each algorithm the faster the ASPL will be obtained. For a fair comparison between our algorithm and the state-of-the-art we consider a pre-defined threshold (err th = 0.15) on the error of the approximation based on the topology of the graph and we observe each algorithm to see how many data each algorithm needs to satisfy that precision of error on the shortest path. The results of these experiments are shown in Table 3.

Conclusion
In this paper, we study the problem of subset selection, which has many applications in machine learning. The general goal is to select a subset from a large set of data such that their linear combination is able to target rank-K approximation. We have propounded a novel adaptive sampling technique from a given dataset for machine learning tasks. This sampling method is based on considering the spectral properties of the given data matrix. Our algorithm delivers a tight theoretical bound on the approximation ratio. The proposed algorithm enjoys a linear computational complexity and at the same time provides an outstanding practical performance. It also comes with theoretical approximation guarantee. These favorable properties make our proposed algorithm a new paradigm in the literature of data sampling. The elongated proof is included in our supplementary material. We also present experiments on synthetic and real world datasets to demonstrate significant performance superiority to other sampling methods in different learning tasks.

Supplemantary Material
In this supplementary document, we will show a constructive intuition about our proposed sampling. Then, the details of spectrum pursuit with residual descent (SP-RD) algorithm are presented. Finally, the proofs of the mentioned theoretical results are presented.

Relation to the Conventional Spectral-based Sampling
Fifty years after Shannon work, in 90s researchers revisited Shannon theory for signals with special spectral patterns Unser (2000). A popular investigated spectral structure is union of low-rank subspaces Eldar & Mishali (2009) which resulted in compressed sensing, a low-rate sensing technique Davenport et al. (2011). In compressed sensing we measure a combination (sense) of samples and it is not appropriate for selecting a small number of actual samples. On the other hand, in the recent 20 years information systems are extensively developed based on recent advances in machine learning and big datasets. The present paper links the conventional spectral sampling approaches to sampling from now-days big datasets represented in a finite space for a machine learning task.
To get a constructive intuition about the Shannon sampling theorem, we cast the optimal Shannon sampling in terms of an optimization problem as follows, The sampling period is the average of |t k − t k−1 | for all k. This problem aims to find sampling epochs t k such that function f is able to recover the original signal accurately. Shannon theorem solves this problem using an assumption on the bandwidth of signal x(t). Assume that the decomposition of signal x(t) in terms of bases {exp( jwt)} is non-zero only for |w| ≤ B, then the cost function of the first term in (5) is 0 using t k = k/(2B) as the sampled points and f (t,t k ) = sinc(2B(t − t k )) as the mixer function. Thus the achieved sampling period is 1/(2B). Problem (5) holds for signals in an infinite-dimension space.
However, practical data measurements result in datasets which are composed of a large number of vectors in a finite-dimensional space. Assume M data samples in an N-dimensional space are organized in matrix A = [a m ] ∈ R N×M . Variable t in the continuous signal representation is substituted with integer variable m for m = 1 up to M. The goal is to sample few columns such that their mixture reconstructs all columns. Inspired by (5), Problem (1) in the main paper is achieved as follows, where λ is a parameter that regularizes the rate of sampling. Matrix V projects back selected samples to the original dataset similar to function f in Eq. (5). Similar to the spectral conditions for a perfect signal reconstruction, we establish an upper bound for the performance of data sampling according to the spectral properties of matrix A. In the continuous signal reconstruction we need 2B samples for each ∆t = 1. However, Self-rank indicates the minimum required samples for a dataset in a finite dimensional space.
Shannon sampling theorem requires a minimum number of samples which is a function of spectral properties of the original function. As it is shown in the main paper, we also pursuit the same fashion.

Spectrum Pursuit with Residual Descent (SP-RD) Algorithm
Projection of all the data onto the subspace spanned by the K columns of A, indexed by T, i.e., π T (A), can be expressed by a rank-K factorization, UV T . In this factorization, U ∈ R N×K , V T ∈ R K×M , and U includes the K columns of A, indexed by T which are normalized to have unit length. Therefore, optimization problem (6) and the problem in Eq. (2) of the main paper can be restated as where, A = {ã 1 ,ã 2 , . . . ,ã M },ã m = a m / a m 2 , and u k is the k th column of U. It should be noted that U is restricted to be a collection of K normalized columns of A, while there is no constraint on V . Since Problem (7) involves a combinatorial search and is not easy to tackle, we modify (7) into two consecutive problems. The first sub-problem relaxes the constraint u k ∈ A in (7) to a moderate constraint u = 1, and the second sub-problem reimposes the underlying constraint. These sub-problems are formulated as Here S k is a singleton containing the index of the k th selected data point. Matrix U k is obtained by removing the k th column of U. Matrix V k is defined in a similar manner. Subproblem (8a) is equivalent to finding the first left singular vector (LSV) of E k A −U k V T k . The constraint u = 1 keeps u on the unit sphere to remove scale ambiguity between u and v. Moreover, the unit sphere is a superset for A and keeps the modified problem close to the recast problem (7). After solving for u (which is not necessarily one of our data points), we find the data point that matches u the most (makes the smallest angle with u) in (8b).
The best minimizer for (8a) w.r.t. u is the first LSV of E k . However, restriction to data samples does not imply that the most correlated sample with the first LSV is necessarily the best minimizer for (8a), although in most cases the most correlated sample with the first LSV is the best minimizer. In order to find the best sample that minimizes (8a) at each iteration, we propose to collect a few samples that are correlated with the first LSV and compute residual error for these samples. Let Ω denote the set of P most correlated samples with the first LSV. The modified version of the above problem can be written as follows.
To make the error monotonically decreasing, as a sufficient condition for convergence, we can include the index of the previously selected sample in for Ω for updating the k th selected sample. This step is performed in Line 8 of SP-RD algorithm. In order to select K samples, the lower bound for selection cost function is A − A K in which A K is the best rank-K approximation of A. Since the cost function is monotonically decreasing, and it is lower bounded, the modified algorithm is convergent. Alg. 1 in the main paper and Fig. 6 shows the steps in SP-RD algorithm.
Approximation ratio is a constant greater than 1 since it is normalized by the best (lowest) possible projection error. In Fig. 7 errors are normalized by projection error of random selection. Since random selection is a rough selection method, the ratio is usually (not necessarily) smaller than 1. As it can be seen increasing P improves the accuracy of selection. However, the gain after P = 10 is marginal.

Convergence Analysis
At each iteration of SP-RD, a new sample is selected only if the resulted residual error decreases (Alg. 1 (SP-RD), line 9). This way the error is non-increasing. The error is also lower bounded by A − A K 2 F since it is the least error that K vectors can achieve as the projection error. These two conditions guarantee that the algorithm converges and quality of the selected subset always improves or remains the same over iterations. As you can see in Fig. 8 the cost function for SP algorithm is not necessarily decreasing over iterations. However, our proposed SP-RD algorithm is decreasing, since convergence is guaranteed.

Proofs of Propositions and Theorems
Lemma 3. Zaeemzadeh et al. (2019) Let A denote an N × M matrix. Let σ 1 and u denote the first singular value and the corresponding left singular vector of A, respectively. Then, there exists at least one column in A such that the absolute value of its inner product with u is greater than or equal to The following proposition states a lower bound on the maximum of the absolute value of the correlation between columns of a matrix and u, when data are normalized on the unit sphere. There exists at least one data point, a i , such that the correlation coefficient between a i and the first left singular vector of A is greater than or equal to σ 1 A F .
Rank-oneness measure of a rank R matrix A with singular values σ 1 , σ 2 , . . . , σ R is defined as Proof of Theorem 1: Given the SVD of a full-rank data matrix A is represented as A = UΣV T . The most correlated column of A with u 1 , which is denoted by s 1 and its normalized version can be written as follows: where u i is ith column of U. The projection error for the first normalized selected sample s 1 can be cast as follows: Now, assume L = 1 + ∑ i α 2 i . In order to compute the desired bound for ε in the format presented in Theorem 1, it only suffices to show that the last term in the above equation is upper bounded by E 1 = ∑ i>1 σ 2 i (1 + ε). Rewriting and cancelling common terms from both sides leads to, To establish inequality (13), it is enough to relax it to a more tractable looser inequality. To this end, we shrink the right hand side by substituting all σ i s for i > 1 with σ N , and since i L 2 ≥ 0 from right hand side. Next, we enlarge the left hand side by ignoring the power factor 2 on the left hand side of inequality (13) as 1 − 1 L < 1. Therefore, the inequality to be satisfied which guarantees the inequality (13) will be held can be written as follows: Knowing that ∑ i>1 α 2 i L = 1 − 1 L , the inequality () can be reduced to: According to lemma 3, we observe that 1 − 1 L ≤ 1 − ρ(A) 2 . Again this means we lift the left hand side and in order to satisfy the original upper bound, it suffices that the following inequality to be held: First, we need the following lemma. Lemma 5. Let us define κ * = max k κ(E¯k). κ * is upper bounded as κ * ≤ κ(A).
Proof : It is known that removing columns of a data matrix, the largest singular value decreases and the smallest singular value increases and as a result, the condition number of a given matrix reduces Queiró (1987). In addition, throughout iterations of Alg. 1, the data passes through projections on null-space of selected samples. Projection does not affect singular values but may make some zero. As a result, null-space projections are non-increasing operators on condition number. A series of column deletion and null-space projection therefore, lead to decreasing (non-increasing) condition number. Thus, κ * ≤ κ(A).
Proof of Theorem 2: The subspace of interest E¯k in Alg. 1 Line 5 (also as defined in Eq. (8)) is obtained by projection on the null-space of the previously selected samples. Therefore, due to null-space projection property at each iteration, every vector in the resulted sub-spaces after projection is orthogonal to the previously selected samples. In other words, if s ∈ Range(E¯k), then s ⊥ span{s 1 , ..., s k−1 }. The residual-based progression of Alg. 1 selects a novel sample to capture the highest information in projected version of dataset to the null-space of previously selected samples. Thus, representation errors in subsequent iterations fall in orthogonal sub-spaces and this fruitful linear algebraic property of Alg. 1 permits us to sum over error terms as in Line 9 of Alg. 1 in each iteration. The upper bound for such error terms is established in Theorem 1. Owing to orthogonality property, the Frobenius norm of selected subset representation error can be written as sum of iteration error terms to yield: where Z k 2 F is the minimum error value achieved in Line 9 of Alg. 1. Moreover, Ẽ K 2 F = A − π S (A) 2 F . Let EK show the best rank-1 representation error for Ek in Eq. (4). Z k can be upper bounded as (1 + ε k )EK following the same approach as in Theorem 1 where, To watch for the zero singular values appearing in subsequent iterations making the matrix E k singular, we truncate the nonzero singular values so that the condition number remains finite and meaningful. This requires a reduction in the number of columns in matrices used to select sample indices U¯k and E¯k. This can be done by deleting the index of previously selected sample leading to singularity after null-space projection.
Also, the denominator appearing in the lower bound for ε k is N − k (equal to the new reduced dimension which can be derived in a similar manner as observed in Theorem 1 for the k-th iteration).
Inserting κ * in (18), one can immediately find that the smallest possible coefficient ε K holds in: Using (19) and (17) the proof is completed immediately.
Two asymptotic cases are discussed in the main paper in Remark 1 and Remark 2. Fig. 9 shows the geometrical interpretation of these two special conditions. As it can be seen, for an isometric dataset, there is no preference for samples to be selected. In this scenario, projection error of selecting 1 sample is equal to projection error of the best rank-1 approximation. Thus, the tightest bound will be reached and = 0. Similarly, when the dataset is rank-1, there will be the same story since span of any sample will be equal to the span of best rank-1 approximation. Thus, = 0 which the tightest upper bound is touched. : Assume three points in a two dimensional space represented by a 2 × 3 matrix A = [a 1 ; a 2 ; a 3 ]. Two extreme cases of eigenvectors' orientation are shown where the self-representative approximation error tightly sticks to the best rank-1 approximation error according to Theorem 1 (ε = 0). In both cases, there is no preference on samples to be selected and any random selection provides the same projection error.