Submitted 5 January 2019 Accepted 12 April 2019 Published 13 May 2019 Corresponding author Carlos A. Loza, cloza@usfq.edu.ec Academic editor Mikael Skoglund Additional Information and Declarations can be found on page 22 DOI 10.7717/peerj-cs.192 Copyright 2019 Loza Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS RobOMP: Robust variants of Orthogonal Matching Pursuit for sparse representations Carlos A. Loza Department of Mathematics, Universidad San Francisco de Quito, Quito, Ecuador ABSTRACT Sparse coding aims to find a parsimonious representation of an example given an observation matrix or dictionary. In this regard, Orthogonal Matching Pursuit (OMP) provides an intuitive, simple and fast approximation of the optimal solution. However, its main building block is anchored on the minimization of the Mean Squared Error cost function (MSE). This approach is only optimal if the errors are distributed according to a Gaussian distribution without samples that strongly deviate from the main mode, i.e. outliers. If such assumption is violated, the sparse code will likely be biased and performance will degrade accordingly. In this paper, we introduce five robust variants of OMP (RobOMP) fully based on the theory of M-Estimators under a linear model. The proposed framework exploits efficient Iteratively Reweighted Least Squares (IRLS) techniques to mitigate the effect of outliers and emphasize the samples corresponding to the main mode of the data. This is done adaptively via a learned weight vector that models the distribution of the data in a robust manner. Experiments on synthetic data under several noise distributions and image recognition under different combinations of occlusion and missing pixels thoroughly detail the superiority of RobOMP over MSE-based approaches and similar robust alternatives. We also introduce a denoising framework based on robust, sparse and redundant representations that open the door to potential further applications of the proposed techniques. The five different variants of RobOMP do not require parameter tuning from the user and, hence, constitute principled alternatives to OMP. Subjects Algorithms and Analysis of Algorithms, Artificial Intelligence, Computer Vision, Data Mining and Machine Learning, Data Science Keywords M-Estimation, Matching Pursuit, Representation-based classifier, Robust classification, Sparse representation, Outliers INTRODUCTION Sparse modeling is a learning framework with relevant applications in areas where parsimonious representations are considered advantageous, such as signal processing, machine learning, and computer vision. Dictionary learning, image denoising, image super–resolution, visual tracking and image classification constitute some of the most celebrated applications of sparse modeling (Aharon, Elad & Bruckstein, 2006; Elad & Aharon, 2006; Mallat, 2008; Wright et al., 2009; Elad, Figueiredo & Ma, 2010; Xu et al., 2011). Strictly speaking, sparse modeling refers to the entire process of designing and learning a model, while sparse coding, sparse representation, or sparse decomposition is How to cite this article Loza CA. 2019. RobOMP: Robust variants of Orthogonal Matching Pursuit for sparse representations. PeerJ Comput. Sci. 5:e192 http://doi.org/10.7717/peerj-cs.192 https://peerj.com mailto:cloza@usfq.edu.ec https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.192 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.192 an inference process—estimation of the latent variables of such model. The latter formally emerged as a machine learning adaptation of the sparse coding scheme found in the mammalian primary visual cortex (Olshausen & Field, 1996). The sparse coding problem is inherently combinatorial and, therefore, intractable in practice. Thus, classic solutions involve either greedy approximations or relaxations of the original `0-pseudonorm. Examples of the former family of algorithms include Matching Pursuit (MP) and all of its variants (Mallat & Zhang, 1993), while Basis Pursuit (Chen, Donoho & Saunders, 2001) and Lasso (Tibshirani, 1996) are the archetypes of the latter techniques. Particularly, Orthogonal Matching Pursuit (OMP) is usually regarded as more appealing due to its efficiency, convergence properties, and simple, intuitive implementation based on iterative selection of the most correlated predictor to the current residual and batch update of the entire active set (Tropp & Gilbert, 2007). The success of OMP is confirmed by the many variants proposed in the literature. Wang, Kwon & Shim (2012) introduced Generalized OMP (GOMP) where more than one predictor or atom (i.e., columns of the measurement matrix or dictionary) are selected per iteration. Regularized OMP (ROMP) exploits a predefined regularization rule (Needell & Vershynin, 2010), while CoSaMP incorporates additional pruning steps to refine the estimate recursively (Needell & Tropp, 2009). The implicit foundation of the aforementioned variants—and, hence, of the original OMP—is optimization based on Ordinary Least Squares (OLS), which is optimal under a Mean Squared Error (MSE) cost function or, equivalently, a Gaussian distribution of the errors. Any deviation from such assumptions, e.g., outliers, impulsive noise or non–Gaussian additive noise, would result in biased estimations and performance degradation in general. Wang, Tang & Li (2017) proposed Correntropy Matching Pursuit (CMP) to mitigate the detrimental effect of outliers in the sparse coding process. Basically, the Correntropy Induced Metric replaces the MSE as the cost function of the iterative active set update of OMP. Consequently, the framework becomes robust to outliers and impulsive noise by weighing the input samples according to a Gaussian kernel. The resulting non–convex performance surface is optimized via the Half–Quadratic (HQ) technique to yield fast, iterative approximations of local optima (Geman & Yang, 1995; Nikolova & Ng, 2005). Even though the algorithm is successful in alleviating the effect of outliers in practical applications, the main hyperparameter—the Gaussian kernel bandwidth—is chosen empirically with no theoretical validation. With this mind, we propose a generalization of CMP by reformulating the active set update under the lens of robust linear regression; specifically, we exploit the well known and developed theory of M–Estimators (Andersen, 2008; Huber, 2011) to devise five different robust variants of OMP: RobOMP. Each one utilizes validated hyperparameters that guarantee robustness up to theoretical limits. In addition, the HQ optimization technique is reduced to the Iteratively Reweighted Least Squares (IRLS) algorithm in order to provide an intuitive and effective implementation while still enjoying the weighing nature introduced in CMP. For instance, Fig. 1 illustrates the estimated error in a 50–dimensional observation vector with a 10% rate of missing samples (set equal to zero). While Tukey–Estimator– based–OMP practically collapses the error distribution after 10 decompositions, the OMP Loza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.192 2/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.192 Figure 1 Illustration of the robustness of the proposed method. (A) y ∈ IR50 constitutes an observation vector with five missing samples (set to zero, marked in red). (B) eOMP1 and (C) eOMP10 are the resulting errors after the first and tenth iteration of OMP (with corresponding box plots as insets), respectively. (D) xOMP10 is the final estimated sparse decomposition after 10 OMP iterations. Their RobOMP counterparts (Tukey estimator) (F–G) reduce more aggressively the dynamic range of the errors until almost collapsing to a delta distribution; this results in optimal sparse coding (H). (E) wTukey is the learned weight vector that assigns values close to one to values around the main mode of the data and small weights to poten- tial outliers (red marks). Number of iterative sparse decompositions equal to ground truth cardinality of sparse active set, i.e., K =K0 =10. Full-size DOI: 10.7717/peerjcs.192/fig-1 counterpart still leaves a remnant that derives in suboptimal sparse coding. Moreover, RobOMP provides an additional output that effectively weighs the components of the input space in a [0,1] scale. In particular, the missing samples are indeed assigned weights close to zero in order to alleviate their effect in the estimation of the sparse decomposition. We present three different sets of results to validate the proposed robust, sparse inference framework. First, synthetic data with access to ground truth (support of the representation) highlights the robustness of the estimators under several types of noise, such as additive non–Gaussian densities and instance–based degradation (e.g., missing samples and impulsive noise). Then, a robust sparse representation–based classifier (RSRC) is developed for image recognition under missing pixels and occlusion scenarios. The results outperform the OMP–based variants and the CMP–based classifier (CMPC) for several cases. Lastly, preliminary results on image denoising via sparse and redundant representations over overcomplete dictionaries are presented with the hope of exploiting RobOMP in the future for image denoising under non–Gaussian additive noise. The rest of the paper is organized as follows: Section 2 details the state of the art and related work concerning greedy approximations to the sparse coding problem. Section 3 introduces the theory, rationale, and algorithms regarding M–estimation–based Robust OMP: RobOMP. Section 4 details the results using synthetic data and popular digital image databases, while Section 5 discusses more in–depth technical concepts, analyzes the implications of the Loza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.192 3/24 https://peerj.com https://doi.org/10.7717/peerjcs.192/fig-1 http://dx.doi.org/10.7717/peerj-cs.192 proposed framework, and offers potential further work. Lastly, Section 6 concludes the paper. STATE OF THE ART AND RELATED WORK Let y∈ IRm be a measurement vector with an ideal, noiseless, sparse representation, x0∈ IRn, with respect to the measurement matrix (also known as dictionary), D∈ IRm×n. The matrix D is usually overcomplete, i.e., m < n, to promote sparse decompositions. In practice, y is affected by a noise component, n∈ IRm. This results in the following constrained, linear, additive model: y=Dx0+n s.t. ||x0||0=K0 (1) where K0 indicates the support of the sparse decomposition and ||·||0 represents the `0– pseudonorm, i.e., number of non–zero components in x0. The sparse coding framework aims to estimate x0 given the measurement vector and matrix plus a sparsity constraint. MSE–based OMP Orthogonal Matching Pursuit (Tropp & Gilbert, 2007) attempts to find the locally optimal solution by iteratively estimating the most correlated atom in D to the current residual. In particular, OMP initializes the residual r0=y, the set containing the indices of the atoms that are part of the decomposition (an active set) 30 =∅, and the iteration k =1. In the kth iteration, the algorithm finds the predictor most correlated to the current residual: λk =argmax i∈� |〈rk−1,di〉| (2) where 〈·,·〉 denotes the inner product operator, di represents the ith column of D, and �={1,2,...,n}. The resulting atom is added to the active set via 3, i.e.: 3k =3k−1∪λk. (3) The next step is the major refinement of the original Matching Pursuit algorithm (Mallat & Zhang, 1993)—instead of updating the sparse decomposition one component at the time, OMP updates all the coefficients corresponding to the active set at once according to a MSE criterion xk = argmin x∈IRn,supp(x)⊂3k ||y−Dx||2 (4) where supp (x) is the support set of vector x. Equation (4) can be readily solved via OLS or Linear Regression where the predictors are the columns of D indexed by 3k and the response is the measurement vector y. Stopping criterion for OMP typically include a set number of iterations or compliance with a set minimum error of the residue. In the end, the estimated sparse code, x, is set as the last xk obtained. In practice, the true sparsity pattern, K0, is unknown and the total number of OMP iterations, K , is treated as a hyperparameter. For a detailed analysis regarding convergence and recovery error bounds of OMP, see Donoho, Elad & Temlyakov (2006). A potential drawback of OMP is the extra computational complexity added by the OLS solver. Loza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.192 4/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.192 Specifically, each incremental update of the active set affects the time complexity of the algorithm in a polynomial fashion: O(k2n+k3) where k is the current iteration. Generalized Orthogonal Matching Pursuit (Wang, Kwon & Shim, 2012) refines OMP by selecting N0 atoms per iteration. If the indices of the active set columns in the kth iteration are denoted as Jk[1],Jk[2],...,Jk[N0], then Jk[j] can be defined recursively: Jk[j]= argmax i∈�\{Jk[1],...,Jk[j−1]} |〈rk−1,di〉|, 1≤ j≤N0 (5) The index set {Jk[j]} N0 j=1 is then added to 3k and, likewise OMP, GOMP exploits an OLS solver to update the current active set. Both OMP and GOMP obtain locally optimal solutions under the assumption of Gaussianity (or Normality) of the errors. Yet, if such restriction is violated (e.g., by the presence of outliers), the estimated sparse code, x, will most likely be biased. CMP The main drawback of MSE–based cost functions is its weighing nature in terms of influence and importance assigned to the available samples. In particular, MSE considers every sample as equally important and assigns a constant weight equal to one to all the inputs. Wang, Tang & Li (2017) proposed exploiting Correntropy (Liu, Pokharel & Príncipe, 2007) instead of MSE as an alternative cost function in the greedy sparse coding framework. Basically, the novel loss function utilizes the Correntropy Induced Metric (CIM) to weigh samples according to a Gaussian kernel gσ(t)=exp(−t2/2σ2), where σ , the kernel bandwidth, modulates the norm the CIM will mimic, e.g., for small σ , the CIM behaves similar to the `0-pseudonorm (aggressive non–linear weighing), if σ increases, CIM will mimic the `1–norm (moderate linear weighing), and, lastly, for large σ , the resulting cost function defaults to MSE, i.e., constant weighing for all inputs. The main conclusion here is that the CIM, unlike MSE, is robust to outliers for a principled choice of σ . This outcome easily generalizes for non–Gaussian environments with long–tailed distributions on the errors. Correntropy Matching Pursuit (CMP) exploits the CIM robustness to update the active set in the sparse coding solver. The algorithm begins in a similar fashion as OMP, i.e., r0=y, 30 =∅, and k =1. Then, instead of the MSE–based update of Eq. (4), CMP proceeds to minimize the following CIM–based expression: xk = argmin x∈IRn,supp(x)⊂3k Lσ(y−Dx) (6) where Lσ(e)= 1 m ∑m i=1σ 2(1−gσ(e[i])) is the simplified version (without constant terms independent of e) of the CIM loss function and e[i] is the ith entry of the vector e. The Half–Quadratic (HQ) technique is utilized to efficiently optimize the invex CIM cost function (Geman & Yang, 1995; Nikolova & Ng, 2005). The result is a local minimum of Eq. (6) alongside a weight vector w that indicates the importance of the components of the observation vector y: w(t+1)[i]=gσ(y[i]−(Dx (t))[i]), i=1,2,...,m (7) where t is the iteration in the HQ subroutine. In short, the HQ optimization performs block coordinate descent to separately optimize the sparse code, x, and the weight vector, Loza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.192 5/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.192 w, in order to find local optima. The hyperparameter σ is iteratively updated without manual selection according to the following heuristic: σ (t+1) = ( 1 2m ∥∥∥y−Dx(t+1)∥∥∥2 2 )1/2 . (8) In Wang, Tang & Li (2017), the authors thoroughly illustrate the advantage of CMP over many MSE–based variants of OMP when dealing with non-Gaussian error distributions and outliers in computer vision applications. And even though they mention the improved performance of the algorithm when σ is iteratively updated versus manual selection scenarios, they fail to explain the particular heuristic behind Eq. (8) or its statistical validity. In addition, the HQ optimization technique is succinctly reduced to a weighted Least Squares problem than can be solved explicitly. Therefore, more principled techniques that exploit weighted Least Squares and robust estimators for linear regression can readily provide the needed statistical validity, while at the same time, generalize the concepts of CMP under the umbrella of M–estimators. ROBUST ORTHOGONAL MATCHING PURSUIT MSE–based OMP appeals to OLS solvers to optimize Eq. (4). In particular, let 8∈ IRm×k correspond to the active atoms in the dictionary D at iteration k, i.e., 8=[d3k[1],d3k[2],...,d3k[k]], and β∈ IR k be the vector corresponding to the coefficients that solve the following regression problem: y=8β+e (9) where e is an error vector with independent components identically distributed according to a zero–mean Normal density (e[i]∼N(0,σ2)). Then, the least squares regression estimator, β̂, is the maximum likelihood estimator for β under a Gaussian density prior, i.e.: β̂=argmax β m∏ i=1 1 √ 2πσ2 exp ( − e[i]2 2σ2 ) =argmax β m∏ i=1 1 √ 2πσ2 exp ( − (y[i]−(8β)[i])2 2σ2 ) (10) which is equivalent to maximizing the logarithm of Eq. (10) over β: β̂=argmax β m∑ i=1 ( − 1 2 ln(2πσ2)− e[i]2 2σ2 ) =argmin β m∑ i=1 ( e[i]2 2 ) . (11) Since σ is assumed as constant, β̂ is the estimator that minimizes the sum of squares of the errors, or residuals. Hence, the optimal solution is derived by classic optimization theory giving rise to the well known normal equations and OLS estimator: m∑ i=1 e[i]2=eT e =(y−8β)T (y−8β) Loza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.192 6/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.192 =yT y−yT8β−βT8T y+βT8T8β. At the minimum: δ δβ m∑ i=1 e[i]2=0= δ δβ (yT y−yT8β−βT8T y+βT8T8β) =0−8T y−8T y+2(8T8)β. Consequently when 8T8 is non–singular, the optimal estimated coefficients vector has a closed–form solution equal to: β̂OLS= β̂=(8 T 8)−18T y (12) which is optimal under a Gaussian distribution of the errors. If such assumption is no longer valid due to outliers or non–Gaussian environments, M–Estimators provide a suitable alternative to the estimation problem. M–Estimators If the errors are not normally distributed, the estimator of Eq. (12) will be suboptimal. Hence, a different function is utilized to model the statistical properties of the errors. Following the same premises of independence and equivalence of the optimum under the log–transform, the following estimator arises: β̂M−Est =argmin β m∑ i=1 ρ (e[i] s ) =argmin β m∑ i=1 ρ ((y[i]−(8β)[i]) s ) (13) where ρ(e) is a continuous, symmetric function (also known as the objective function) with a unique minimum at e=0 (Andersen, 2008). Clearly, ρ(e) reduces to half the sum of squared errors for the Gaussian case. s is an estimate of the scale of the errors in order to guarantee scale–invariance of the solution. The usual standard deviation cannot be used for s due to its non–robustness; thus, a suitable alternative is usually the ‘‘re–scaled MAD’’: s=1.4826×MAD (14) where the MAD (median absolute deviation) is highly resistant to outliers with a breakdown point (BDP) of 50%, as it is based on the median of the errors (ẽ) rather than their mean (Andersen, 2008): MAD=median|e[i]− ẽ|. (15) The re–scaling factor of 1.4826 guarantees that, for large sample sizes and e[i]∼N(0,σ2), s reduces to the population standard deviation (Hogg, 1979). M–Estimation then, likewise OLS, finds the optimal coefficients vector via partial differentiation of Eq. (13) with respect to each of the k parameters in question, resulting in a system of k equations: m∑ i=1 8ijψ (y[i]−φTi β s ) = m∑ i=1 8ijψ (e[i] s ) =0, j=1,2,...,k (16) Loza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.192 7/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.192 where φi represents the ith row of the matrix 8 while 8ij accesses the jth component of the ith row of 8.ψ ( e[i] s ) = ∂ρ ∂ e[i] s is known as the score function while the weight function is derived from it as: w[i]=w (e[i] s ) = ψ ( e[i] s ) e[i] s . (17) Substituting Eqs. (17) into (16) results in: m∑ i=1 8ijw[i] e[i] s = m∑ i=1 8ijw[i](y[i]−φ T i β) 1 s =0 j=1,2,...,k m∑ i=1 8ijw[i](y[i]−φ T i β) 1 s =0 j=1,2,...,k m∑ i=1 8ijw[i]φiβ= m∑ i=1 8ijw[i]y[i] j=1,2,...,k (18) which can be succinctly reduced in matrix form as: 8 T W8β=8T Wy (19) by defining the weight matrix, W, as a square diagonal matrix with non–zero elements equal to the entries in w, i.e.: W=diag({w[i] : i=1,2,...,m}). Lastly, if 8T W8 is well- conditioned, the closed form solution of the robust M–Estimator is equal to: β̂M−Est =(8 T W8)−18T Wy. (20) Equation (20) resembles its OLS counterpart (Eq. (12)), except for the addition of the matrix W that weights the entries of the observation vector and mitigates the effect of outliers according to a linear fit. A wide variety of objective functions (and in turn, weight functions) have been proposed in the literature (for a thorough review, see Zhang (1997)). For the present study, we will focus on five different variants that are detailed in Table 1. Every M–Estimator weighs its entries according to a symmetric, decaying function that assigns large weights to errors in the vicinity of zero and small coefficients to gross contributions. Consequently, the estimators downplay the effect of outliers and samples, in general, that deviate from the main mode of the residuals. Solving the M-Estimation problem is not as straightforward as the OLS counterpart. In particular, Eq. (20) assumes the optimal W is readily available, which, in turn, depends on the residuals, which, again, depends on the coefficient vector. In short, the optimization problem for M–Estimators can be posed as finding both β̂M−Est and ŵM−Est that minimize Eq. (13). Likewise Half–Quadratic, the common approach is to perform block coordinate descent on the cost function with respect to each variable individually until local optima are found. In the robust regression literature, this optimization procedure is the well known Iteratively Reweighted Least Squares or IRLS (Andersen, 2008). The procedure is detailed in Algorithm 1. In particular, the routine runs for either a fixed number of iterations or until the estimates change by less than a selected threshold between iterations. The Loza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.192 8/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.192 Table 1 Comparison between OLS estimator and M–Estimators. Objective ρ(e) and weight w(e) func- tions of OLS solution and five different M–Estimators. For M–Estimators, error entries are standardized, i.e. divided by the scale estimator, s. Each robust variant comes with a hyperparameter c. Exemplary plots in the last column utilize the optimal hyperparameters detailed in Table 2. Type ρ(e) w(e) w(e) OLS 12 e 2 1 Cauchy c 2 2 log(1+( e c ) 2) 1 1+(e/c)2 Fair c2 ( |e| c −log ( 1+ |e|c )) 1 1+|e|/c Huber{if |e|≤c if |e|≥c {e2 2 c ( |e|− c 2 ) {1 c |e| Tukey{ if |e|≤c if |e|>c {c2 6 ( 1− ( 1− (e c )2)3) c2 6 {( 1−( e c )2 )2 0 Welsch c 2 2 (1−exp(−( e c ) 2)) exp(−( ec ) 2) main hyperparameter is the choice of the robust M–Estimator alongside its corresponding parameter c. However, it is conventional to select the value that provides a 95% asymptotic efficiency on the standard Normal distribution (Zhang, 1997). Throughout this work, we exploit such optimal values to avoid parameter tuning by the user (see Table 2). In this way, the influence of outliers and non-Gaussian errors are expected to be diminished in the OMP update stage of the coefficients corresponding to the active set. M–Estimators–based OMP Here, we combine the ideas behind greedy approximations to the sparse coding problem and robust M–Estimators; the result is RobOMP or Robust Orthogonal Matching Pursuit. We propose five variants based on five different M–Estimators (Table 1). We refer to Loza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.192 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.192 Table 2 Optimal hyperparameter c of M–Estimators according to a 95% asymptotic efficiency on the standard Normal distribution. Cauchy Fair Huber Tukey Welsch 2.385 1.4 1.345 4.685 2.985 Algorithm 1 IRLS–based M–Estimation 1: function IRLS(y∈IRm,8∈IRm×k,wc(u)) FWeight function w(u) with hyperparameter c 2: t ←0 3: β(0)=βOLS←(8T8)−18T y FOLS initialization 4: e(0)←y−8β(0) 5: MAD←median|e(0)[i]− ẽ(0)| 6: s(0)←1.4826×MAD 7: w(0)[i]←wc ( e(0)[i] s(0) ) i=1,2,...,m F Initial weight vector 8: W(0)←diag ( w(0) ) 9: t ←1 10: while NO CONVERGENCE do 11: β(t)←(8T W(t−1)8)−18T W(t−1)y FBlock coordinate descent 12: e(t)←y−8β(t) 13: MAD←median|e(t)[i]− ẽ(t)| 14: s(t)←1.4826×MAD 15: w(t)[i]←wc ( e(t)[i] s(t) ) i=1,2,...,m FBlock coordinate descent 16: W(t)←diag ( w(t) ) 17: t ← t+1 18: return β̂M–Est←β(t) ŵM–Est←w(t) FFinal estimates each RobOMP alternative according to its underlaying M–Estimator; for instance, Fair– Estimator–based–OMP is simply referred to as Fair. As with OMP, the only hyperparameter is the stopping criterion: either K as the maximum number of iterations (i.e., sparseness of the solution), or �, defined as a threshold on the error norm. For completeness, Algorithm 2 details the RobOMP routine for the case of set maximum number of iterations (the case involving � is straightforward). Three major differences are noted with respect to OMP: 1. The robust M–Estimator–based update stage of the active set is performed via IRLS, 2. The updated residuals are computed considering the weight vector ŵk from IRLS, and 3. The weight vector constitutes an additional output of RobOMP. The last two differences are key for convergence and interpretability, respectively. The former guarantees shrinkage of the weighted error in its first and second moments, while the latter provides an intuitive, bounded, m–dimensional vector capable of discriminating between samples from the main mode and potential outliers at the tails of the density. Loza (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.192 10/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.192 Algorithm 2 RobOMP 1: function RobOMP(y∈IRm,D∈IRm×n,wc(u),K) 2: k←1 F Initializations 3: r0←y 4: 30←∅ 5: while k