- Email: [email protected]

Contents lists available at SciVerse ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Classiﬁcation approach based on non-negative least squares Yifeng Li n, Alioune Ngom School of Computer Science, 5115 Lambton Tower, University of Windsor, 401 Sunset Avenue, Windsor, Ontario, Canada N9B 3P4

a r t i c l e i n f o

abstract

Article history: Received 21 June 2012 Received in revised form 22 November 2012 Accepted 6 February 2013 Communicated by G. Thimm

A non-negative least squares classiﬁer is proposed in this paper for classifying under-complete data. The idea is that unknown samples can be approximated by sparse non-negative linear combinations of few training samples. Based on sparse coefﬁcient vectors representing the training data, a sparse interpreter can then be used to predict the class label. We have devised new sparse methods which can learn data containing missing value, which can be trained on over-complete data, and which also apply to tensor data and to multi-class data. Permutation test shows that our approach requires a small number of training samples to obtain signiﬁcant accuracy. Statistical comparisons on various data shows that our methods perform as well as support vector machines while being faster. Our approach is very robust to missing values and noise. We also show that with appropriate kernel functions, our methods perform very well on three-dimensional tensor data and run fairly fast. & 2013 Elsevier B.V. All rights reserved.

Keywords: Sparse representation Non-negative matrix factorization Non-negative least squares Classiﬁcation Kernel approach Missing values

1. Introduction Various large-scale biological data have been generated within the last decade. Typical such high-throughput data are microarray data [1] and mass spectrometry data [2]. Their common issue is that they contains very few number of samples compared to the large features; that is, they are under-complete data as opposed to overcomplete data where the number of samples is (much) larger than the number of features. In medical research, the expression levels of genes with respect to biological samples can now be monitored synchronically over a series of time-points. This corresponds to a three-dimensional (3D) dataset, termed gene-sample-time (GST) microarray data [1,3], which can be viewed as a collection of gene-sample data over a series of time-points, or a collection of gene-time data across some samples. These are examples of 3-order tensor data [4]. An order-n tensor data is a vector having n directions, for example, a vector, a matrix, and a cube are tensor of orders 1, 2, and 3, respectively. Analyzing (i.e., classifying or clustering) tensor data of order three or more is more challenging due to the extra number of dimensions. High through-put data provide much richer information for biologists, however they are quite challenging to mine and analyze due to their low sample size and their large number of (redundant or irrelevant) features. The issues derived from

n

Corresponding author. Tel.: þ1 519 253 3000 x3789 E-mail addresses: [email protected], [email protected] (Y. Li), [email protected] (A. Ngom). URLS: http://cs.uwindsor.ca/~li11112c (Y. Li), http://cs.uwindsor.ca/~angom (A. Ngom).

analyzing the undercomplete data are called ‘‘the curse of dimensionality’’. Many classiﬁers confront with problems when classifying such data. Some classiﬁers do not theoretically work at all, for example, Gaussian Bayesian classiﬁer [5] might fail to calculate the inverse of singular covariance matrix. Some are practically expensive to execute, such as decision tree [6] and neural networks [7]. If an inductive model is not consistent, then it would result in overﬁtting, that is poor capability of generalization [8]. One solution to the curse of dimensionality is dimension reduction which includes feature extraction and feature selection. Principle component analysis (PCA) [9] and independent component analysis (ICA) [10] are among the most popular dimension reduction techniques. Non-negative matrix factorization (NMF) has been intensively studied since the work of Lee and Seung [11]. For example, NMF is studied as feature extraction and feature selection methods in Li and Ngom [12]. Alternatively, sophisticated models have been directly applied to classifying high-dimensional data, of which the most famous family being the basis-expanded linear models which will be introduced in Section 3. Noise is another issue which cannot be avoided completely and can result in overﬁtting issues if a model is not robust. Sparse representation (SR) [13,14] has been proven very robust to noise and redundance. It has been used to classify under-complete and overcomplete data. The sparse representation classiﬁcation (SRC) approach has been reported to be robust to random pixel corruption [15] in image processing. For instance, on a facial dataset where even the test images are 70% corrupted, SRC surprisingly obtained an accuracy of over 90.7%. Besides, sparse representation has been also applied successfully as a de-noising method [16]. Please see Section 3 for details. High-throughput

0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.02.012

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

2

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

data often contain missing values and many classiﬁers are not devised for such to handle incomplete data well. Bayesian network based prediction methods become extremely slow when dealing with missing values since expectation maximization (EM) strategy is used [17,18] to deal with the missing values. Decision trees can deal with missing values, but may fail beyond a certain percentage of missing values in the data. In this paper, we propose a novel non-parametric classiﬁcation method which directly classiﬁes the original under-complete data. The sparse codes of new samples are generated via nonnegative constraints instead of the l1-norm minimization. Our approach can efﬁciently predict the class labels of a set of unknown samples in parallel. We propose a robust weighting strategy to handle incomplete data (i.e., data containing missing values). Our approach can naturally work on multi-class data. Our optimization algorithms use only the inner products of samples and therefore are dimension-free. Applying this property, kernel version of our approach is proposed to classify over-complete and tensor data. The rest of this paper is organized as follows: The mathematical notations are given in Section 2. Related works are brieﬂy introduced in Section 3. Our approach is described in Section 4. Computational experiments over various datasets and discussions are reported in Section 5. Finally, conclusions and suggestions for future research are given in Section 6.

2. Notations Hereafter we shall use the notations as listed below, unless otherwise noted.

(1) A scalar is denoted by an italic lowercase or upper case letter, for example, a and A. (2) A matrix is denoted by a bold capital letter, for example, Y. (3) A (column) vector is denoted by a bold lowercase letter, for example, y. A bold lowercase letter with a subscript denotes a column vector of a matrix. The colon notation, ‘‘:’’, in the subscript of a matrix denotes taking all elements in the corresponding order. For example, yi ¼ Y :,i is the i-th column vector in matrix Y. The j-th row of Y is Y j,: . An italic lowercase letter with two subscripts represents a scalar element of the corresponding matrix. For example, yij is the (i,j)-th scalar element of matrix Y. 4. A tensor is denoted by a boldface Euler script [4], for example, X. 5. X ðnÞ denotes the matrix obtained through the mode-n matricization of the tensor X [4]. Columns of X ðnÞ are the mode-n ﬁbers of tensor X . A mode-n ﬁber is a vector deﬁned through ﬁxing every index but the n-th index. This is the extension of matrix row and column in tensor algebra. X ð1Þ therefore denotes the matrix of size I JK, unfolded in mode-1 of X A RIJK , that is, X ð1Þ ¼ ½X ð1Þ1 ,X ð1Þ2 , . . . ,X ð1ÞK . 6. The mode-n product of a tensor X and a matrix A [4], written P as X n A, is: ðX n AÞi1 in1 jin þ 1 iN ¼ Iinn ¼ 1 xi1 i2 iN ajin , where I1 I2 IN JIn X AR and A A R . This results in a tensor Y A RI1 In1 JIn þ 1 IN . 7. A set is denoted by an uppercase calligraphic letter, for example, S. If the set elements are also sets, bold font is used, for example, S ¼ ff1,2g,f3,4gg. 8. A set containing ﬁnite number of continuous integers is deﬁned by colon notation. For example, N ¼ f1,2, . . . ,ng ¼ f1 : ng. 9. A matrix can be indexed by a set. For example, Y C ¼ Y :,C denotes the columns of Y indexed by set C and Y R,: denotes the rows of Y indexed by R.

P 10. l1 vector norm is deﬁned as JxJ1 ¼ m i ¼ 1 9xi 9. l2 vector norm is qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ pﬃﬃﬃﬃﬃﬃﬃﬃ Pm 2¼ T x. And the Frobenius x deﬁned as JxJ2 ¼ x i¼1 i norm of a m n matrix qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pm Pn 2 trðX T XÞ. i¼1 j ¼ 1 xij ¼

is

deﬁned

as

JXJF ¼

3. Related works In this section, we survey existing approaches addressing part of the issues mentioned in Section 1. First, we shall introduce NMF and the classiﬁcation approaches devised for undercomplete data. We then review sparse representation methods which are robust to noise. Finally, we compared three strategies for handling missing values. 3.1. Non-negative matrix factorization for dimension reduction Non-negative matrix factorization (NMF) has been intensively studied since the work of of Lee and Seung [11]. NMF decomposes a non-negative matrix A into two low-rank non-negative factors: Amn W mk H kn , where k rminfm,ng. In the Euclidean space, its optimization can be formulated as 1 min JAWHJ2F W,H 2

s:t: W,H Z0:

ð1Þ

In machine learning, if each column of A is a training sample, then each training sample is approximated by a non-negative linear P combination of the columns of W, that is, ai kj ¼ 1 wj hji ¼ Whi . Each column of W is called a basis vector or metasample. Thus W is called the basis matrix and H is the coefﬁcient matrix. Semi-NMF [19] allows negative values in A and drops the non-negative constraints on W. NMF has been applied as feature extraction method as well as a clustering approach. The key difference between NMF and techniques like PCA and ICA is that, its basis vectors are non-orthogonal and its coefﬁcient matrix is non-negative and usually sparse. The largest coefﬁcient helps us to identify the basis vector (or cluster prototype) most related to a sample. 3.2. Basis-expanded linear model and kernel trick The general linear model for classiﬁcation can be formulated as f ðxÞ ¼ signðwT fðxÞ þ bÞ where fðÞ is a mapping function and b is a bias. Given n training samples D A Rmn and the corresponding class information c, its risk minimization can be expressed as minw rðwÞ þ C xðw,D,cÞ, where rðwÞ is the regularization term, xðw,D,cÞ is a loss function, and C is the trade-off parameter. The ﬁrst term is to increase the generalization of the model, whereas the second term is to decrease the empirical classiﬁcation error. Most linear models can be kernelized as long as the optimization and prediction use only the inner products between the samples. The kernel trick can make the optimization dimension-free. An appropriate kernel can linearize complex patterns. When rðwÞ is the l2-norm and the loss function is the hinge loss, then this is the well-known support vector machine (SVM) [20]. It is well-known that SVM works well on both under-complete and over-complete data given proper kernels. One of the potential weaknesses of linear models is that they reduce the multi-class problem into many binary-class ones, which becomes inefﬁcient as the number of class increases. 3.3. Independent modelling for multi-class data Soft independent modelling of class analogies (SIMCA) [21] is a multi-class method that measures the (dis)similarity between a

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

new sample to each class independently and assigns the new sample to the closest class. A popular dissimilarity measure is the regression residual. Unlike SIMCA which uses PCA to ﬁnd the basis vectors of each class, linear regression classiﬁcation (LRC) [22] directly uses all the training samples within a class to compute the ordinary-least-squares regression residual. LRC only works under the condition that each classes must be under-complete. Moreover, due to independence, it may not work well in the situation that the class distributions are entangled. 3.4. Sparse representation Sparse representation [13] is a principle of parsimony which states that a new signal is approximated by a sparse superposition of basis functions. Suppose b is a m-dimensional signal, sparse P representation can be formulated as b ki ¼ 1 ai xi ¼ Ax, where A ¼ ½a1 , . . . ,ak is called a dictionary and ai is a dictionary atom (basis function or basis vector). Sparse representation includes sparse coding and dictionary learning. Given a new signal b and the dictionary A, learning the sparse coefﬁcient x is a procedure of sparse coding. Given training data B, learning the dictionary from data is called dictionary learning. If m 5k, A is an over-complete dictionary. This means that b ¼ Ax is under-determined and may have inﬁnite number of solutions. The optimal sparse coefﬁcient vector x is the optimal solution to the problem, minJxJ0 x

s:t: b ¼ Ax,

P where JxJ0 ¼ ni¼ 1 dðxi Þ and dðxÞ is 1 if x a0, and 0 otherwise. The constraint can be relaxed to JbAxJ22 r E, where E is a small positive tolerance value. Unfortunately, this optimization is NP-hard [23]. Basis pursuit (BP) [24] is one of the heuristic approximation methods in which l1-norm minimization is performed as minJxJ1 x

s:t: b ¼ Ax,

or the version allowing small relaxation minJxJ1 x

s:t:JbAxJ22 r E:

ð2Þ

An example of solvers to this problem is l1-magic [25]. If m bk, then A is under-complete and there is no solution to the over-determined b ¼ Ax. The optimization task reduces then to ﬁnd a sparsest coefﬁcient while minimizing the error JbAxJ22 . The l1 -regularized approximation of this problem is transformed into a linear combination of the objectives, 1 min JbAxJ22 þ lJxJ1 : x 2

ð3Þ

Sparse representation has been applied in some ﬁelds such as computer vision [26,27], machine learning [28], and bioinformatics [29,30]. Sparse representation classiﬁcation (SRC1, we added the numerical postﬁx to indicate that there exists other SRC approaches) approach is proposed for face recognition [15]. SRC1 constructs over-complete dictionary from the training samples after dimension reduction, then applies l1-norm minimization, as in Eq. (2), to pursue sparse coefﬁcient vectors. It has been shown that SRC1 has state-of-the-art classiﬁcation performance and is robust to noise. One of the beneﬁts of sparsity is that it encodes a good semantic and discriminative information which facilitates interpretation and classiﬁcation. SRC1 is a multi-class method. The way of pooling all training samples from all classes into a dictionary may works better than the independent modelling in the case of cluttered distributions. However, in its application to classifying high-dimensional data, some dimension reduction techniques (for example, downsampling) has to be employed to make sure the dictionary is over-complete.

3

SRC2 based on Eq. (3) has been applied to classify highdimensional gene expression data by Hang and Wu [29]. SRC2 works on under-complete dictionary. SRC2 can be solved by l1_ls [31]. One of the drawback of this algorithm is that the running time grows dramatically as the number of dictionary atoms increases [29]. Also, the value of parameter l depends on speciﬁc tasks and hence has to be optimized. This is an expensive procedure. Moreover, given a set of new samples, the queries are sequential, which also aggravates the computational burden. Dictionary learning [13] is an interesting topic in sparse representation to extract hidden patterns from data. The beneﬁts of dictionary learning are that (1) the number of dictionary elements learned can be much smaller than the number of training samples, which expedites the sparse coding, and (2) prediction performance might be improved on the learned dictionary. Aiming at reducing the running time while increasing the classiﬁcation performance of SRC2, Zheng et al. [30] proposed a metasample-based sparse representation classiﬁcation (MSRC) approach. A metasample is a basis vector. A set of metasamples is extracted for each class. Metasamples can be extracted by NMF or SVD. These sets of metasamples of all classes are concatenated as a holistic under-complete dictionary. After learning the dictionary A, for a new sample, MSRC optimizes Eq. (3) to obtain the corresponding sparse coefﬁcient vector. This sample is assigned to a class using the nearest-subspace rule. Besides selecting the trade-off between the least squares error and the l1-norm regularization in the objective, the model selection introduced in the dictionary learning step is also time-consuming and degrades dramatically as the number of classes increases. 3.5. Handling incomplete data There are three strategies to handle the issue of missing values. First, we can remove the features or samples with missing values. However, the main drawback is that the already small sample size becomes smaller and we may run the risk of having not enough data for learning a satisfactory and consistent model. The second strategy is to impute (that is, to ﬁll-in) the missing values in the data [32]. Missing values can either be imputed by a constant value (the most common value is 0 which corresponds to 0-imputation) or features averages, or alternatively they can be estimated by some statistical, or supervised or unsupervised learning methods. Expectation maximization (EM) and k-nearest neighbor (k-NN) are two of the most well-known estimation methods [17,32]. This strategy essentially completes an incomplete dataset, and hence, avoids deleting features or samples. The time complexity of making data complete before learning through this way is usually much lower than that of hybridizing EM and a learning method. Nevertheless, the issue of imputation is that false data can be introduced. The situation can even be worse if the missing data do not actually exist. Moreover, the larger the missing rate is, the more difﬁcult for an estimation method to execute. The third strategy is that the present values are only allowed to be used in learning and prediction. 3.6. Single-left-hand-side non-negative least squares We will apply an active-set algorithm to solve a variant of the multiple-left-hand-side non-negative least squares (MLHS-NNLS) in Section 4.3. For the sake of understanding, we ﬁrst describe the active-set algorithm for the single-left-hand-side non-negative least squares (SLHS-NNLS, Eq. (5)). Active-set algorithms solving NNLS problems have been developing since 1970s. The standard algorithm was proposed by Lawson and Hanson [33] to solve the SLHS-NNLS problem which is shown in Algorithm 1. For narrative convenience, we call this algorithm LH-NNLS algorithm.

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

4

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

Algorithm 1. LH-NNLS optimization algorithm. Input: matrix Amn , vector bm1 Output: vector x ¼ arg min 12 JbAxJ22 s.t. x Z0 x’0; R’fi9i ¼ 1 : ng; initialize the active set P’|; initialize the passive set

l’AT bAT Ax; the lagrange multiplier E’ð0,1; small positive tolerance value while R a | and maxi A R ðmi Þ 4 E do j’arg maxi A R ðmi Þ; index with maximal positive multiplier P’P [ fjg; R’R\fjg; t P ’ðATP AP Þ1 ATP b; t R ’0; while mint P r0 do a’mini A P,ti r 0 xixti i ; i ; many indices for a K’arg mini A P,ti r 0 xi xt i

x’x þ aðtxÞ; P’P\K; R’R [ K; t P ’ðATP AP Þ1 ATP b; t R ’0; end while x’t;

l’AT bAT Ax; end while

This algorithm iteratively searches the true active set and the passive set. The true active set and the passive set are, respectively, the set of variables whose values lie on the border of a feasible region and the set of variables whose values are inside the feasible region. LH-NNLS is a rigorous algorithm which can converge to optimal solution. The convergence is proved in the following. The Lagrangian function for the LH-NNLS problem in Algorithm 1 is Lðx, lÞ ¼ 1=2JbAxJ22 þ lT x: The corresponding Karush–Kuhn–Tucker (KKT) conditions are 8 T > < A ðAxbÞ þ l ¼ 0 lnx¼0 , ð4Þ > : l r 0, x Z0 where n is the component-wise multiplication operator. When the algorithm terminates, lR o0 and lP ¼ 0, where R and P are the true active set and passive set, respectively. That is, 8 < l ¼ ATR bR ATR AR xR o 0, R : lP ¼ ATP bP ATP AP xP ¼ 0: This means that the KKT conditions in Eq. (4) are satisﬁed. The Hessian of Lðx, lÞ is AT A which is positive deﬁnite. Also because lR o0, the solution returned by LH-NNLS is a strict local minimum. Since the objective function and the feasible region are convex, this solution is thus a global minimum [34]. In the optimal solution, the values corresponding to the active set are set to 0 and the remaining values are the optimal solution to the unconstrained least squares subproblem.

4. Proposed method In this section, we ﬁrst propose our classiﬁcation approach based on non-negative sparse coding. We then give a numerical example to demonstrate how our method works. Next, we address the

optimization of our method. Thereafter, we investigate the relation between sparsity and non-negativity constraints. We also propose a weighting strategy for handling missing values. Finally, the kernel version of our approach and its application for tensor data are described.

4.1. Non-negative least-squares classiﬁcation approach We propose non-negative-least-squares (NNLS) classiﬁcation approach in which any new sample with unknown class label is approximated by a non-negative linear combination of the training samples. The combination coefﬁcients are sparse due to the non-negativity constraints. Suppose bm1 is a new sample with unknown label and Amn , m 4n, is the training set with n samples P in columns and m features in rows. We have b ni¼ 1 ai xi ¼ Ax, where the coefﬁcient vector x Z 0. Because A is under-complete, b ¼ Ax is usually overdetermined. This means there does not exist a non-negative solution x. Therefore, obtaining x is a single-lefthand-side NNLS (SLHS-NNLS) problem formulated as 1 min JbAxJ22 x 2

s:t: x Z 0:

ð5Þ

If there are p Z 2 new samples, there is no need to solve the NNLS problem one by one as in Eq. (5). Instead, we can solve the NNLS problems in batch. The approximation becomes B AX, where B is of size m p, X is of size n p, and X Z0. Each column of B is a new sample. The j-th column of X, xj , is the coefﬁcient vector corresponding to the j-th new sample, bj , that is bj Axj . And the optimization task becomes the multiple-left-hand-side NNLS (MLHS-NNLS) problem, 1 min JBAXJ2F X 2

s:t: X Z0:

ð6Þ

This problem is equivalent to min X

p X 1 i¼1

2

Jbi Axi J2F

s:t: X Z 0:

ð7Þ

After obtaining the sparse coefﬁcients, we need to employ a sparse interpreter to predict the class of unknown samples. We shall give two interpreters (or rules), in the following. The rule that assigns the class label of the training sample with the largest coefﬁcient to the new sample is called the MAX rule. This rule is inspired by the rule used in NMF clustering method. The relation between our NNLS method using the MAX rule and 1-NN is that the MAX rule is actually 1-NN in the column space of A, where the representation of the i-th training sample is a vector with the i-th component to be 1 and the rest 0. The representation of a new sample in this space is the corresponding sparse coefﬁcient vector. Identifying the largest coefﬁcient corresponds to ﬁnding the closest training sample in the column space of A. The sparsity may decrease as the noise increases in the data. There may not exist dominantly large coefﬁcients for very noisy data. In such case, incorrect decision may be made by the MAX rule. The nearest-subspace (NS) rule is proposed by Wright et al. [15] to interpret the sparse coding. This rule takes the advantage of the property of discrimination of sparse coefﬁcients and is more robust to noise than the MAX rule. We can also include this rule in our NNLS approach as the sparse interpreter. Suppose there are C classes with labels l1 , . . . ,lC , for a given new sample b, after obtaining its non-negative coefﬁcient vector x, the regression residual corresponding to the k-th class is computed as r k ðbÞ ¼ 12JbAdk ðxÞJ22 ,

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

where dk ðxÞ : Rn -Rn returns the coefﬁcients for class lk. Its j-th element is deﬁned by ( xj if aj in class lk , ðdk ðxÞÞj ¼ 0 otherwise: Finally, class label lK is assigned to b, where K ¼ arg min r k ðbÞ: 1rkrC

ð8Þ

Our NNLS based classiﬁcation simply involves two steps that are listed in Algorithm 2. First of all, we need a fast optimization algorithm (see Section 4.3) to obtain the non-negative coefﬁcient matrix X. Then the class label of each new sample is predicted by a sparse interpreter as explained above. For narrative convenience, we denote the NNLS approach with MAX rule by NNLS-MAX and that with the nearest-subspace rule by NNLS-NS. Algorithm 2. Non-negative least squares classiﬁer. Input: Amn : training set including m features and n samples, c: class labels of the training samples, Bmp : p new samples Output: p: predicted class labels of the p new samples (1) sparse coding: solve the NNLS optimization problem in Eq. (6); (2) sparse interpreter: use MAX or NS rule to predict the class labels of the new samples: p ¼ MAXðXÞ or p ¼ NSðXÞ;

In Algorithm 2, function MAX returns a vector p containing the predicted class labels of the new samples from B. The i-th component of p is deﬁned as pi ¼cK, if XKi is the largest value in the i-th column of X, that is K ¼ arg max1 r j r n ðX ji Þ. Suppose there are C classes with labels l1 , . . . ,lC . The i-th component of p returned by function NS is deﬁned as pi ¼lK, where K ¼ arg min1 r k r C r k ðbi Þ. Our NNLS classiﬁer is an example of instance-based learning method and it is well-known that instance-based learning algorithms perform well for complicated distributions [7]. NNLS-NS can be also viewed as a locally weighted linear regression approach [7,35] except that (1) NNLS-NS ﬁnds the approximation coefﬁcients through nonnegative least squares, while locally weighted regression usually involves a distance-based weighting and (2) NNLS-NS ﬁnds the local training instances adaptively with respect to the contribution to reduce the regression residual further, whereas locally weighted regression employs distance-based k-NN to ﬁnd local instances surrounding the new query. Furthermore, our NNLSMAX classiﬁer is a locally weighted classiﬁcation method, because it actually uses a decision function over the locally weighted regression as deﬁned by gðarg minx Z 0 1=2JbAxJ22 Þ, where the decision function is gðxÞ ¼ MAXðxÞ. Finally, we should note that NNLS-MAX is not equivalent to 1-NN in the input space. 4.2. Numerical example Now let us have three simulation examples to demonstrate how this method works for both non-noisy and noisy data. First, suppose A contains 6 samples with 8 features (see Eq. (9)). And the ﬁrst 3 samples belong to class 1 and the last 3 are from class 2. Also suppose there is no intergroup variation and noise. Suppose we have 4 new samples in B. We can obtain its coefﬁcient matrix X by a NNLS algorithm. The largest values in the columns of X exactly indicate that the ﬁrst 2 new samples have the same class labels with the ﬁrst sample in A and the last 2 new samples have the same class labels with the forth sample in A. Second, if we add a little portion of Gaussian noise to A (through A ¼ A þ 0:1 randnð8,6Þ, where randnð8,6Þ returns a 8 6 matrix, each element of which is generated by standard Gaussian generator) and B

5

(through B ¼ B þ 0:1 randnð8,4Þ), we would obtain Eq. (10). From the dominantly largest values in the columns of X, we can predict that the ﬁrst two new samples fall into class 1, the last two into class 2. Third, if we add more noise through A ¼ A þ randnð8,6Þ and B ¼ B þ randnð8,4Þ, we obtain Eq. (11). Although the largest coefﬁcients for the second and third new samples are not obvious, we are still able to make the correct decision. As the noise increases, the sparsity decreases in our examples. Therefore, incorrect decision may be made by the MAX rule. If we use the nearest-subspace rule instead, the residuals of B in Eq. (11) is shown in Table 1. According to the residuals, the ﬁrst two samples are predicted in Class 1 and the last two samples are in Class 2. 0

1

B 1 B B B 5 B B 3 B B B 1 B B 0 B B @ 4

1

2

1

1

5

0

3

2

1

1

0 4

0 1

2

3 3 3 0 1 1 0 B0 0 0 B B B0 0 0 B B0 0 1 B B @0 0 0 0

0

0

0

1

0

1

1

1

2

2

B 1 C C B 1 C B 0 C B 5 C B B 2 C C B 3 C¼B 1 C B 1 C B B 0 C C B 0 C B 1 A @ 4

1

1

1

1

5

5

0

0

3

3

2

2

1

1

1

1

0 4

0 4

0 1

0 1

3 1 0 0C C C 0C C 1C C C 0A

3

3

3

3

3

1

1 C C C 0 C C 2 C C C 1 C C 0 C C C 1 A 3

ð9Þ

0

2:0702 B 1:4899 B B B 4:3718 B B 4:1365 B B B 0:2267 B B 0:2735 B B @ 4:8281

1:4154 0:8002

1:9287

0:2012 0:5951

6:9345

0:9424

2:9243

1:7334

0:1533

2:3851

0:1955

0:0560

6:2445

1:8495

2:8046

2:4307

0

1 3:4851 1:3299 C C C 0:8629 C C 0:6178 C C C 0:2490 C C 0:9819 C C C 1:1942 A 3:2563

0:9680 B 0:9874 B B B 5:0937 B B 3:1575 B B B 1:0624 B B 0:0342 B B @ 3:9404

1:0423 0:8560

1:0079 0:9018

5:0841

4:8661

0:0138

0:0766

3:0348

2:8721

2:1275

1:8499

1:0656

0:7186

1:0896

0:9697

0:0912 4:0698

0:0435 4:0739

0:1052 0:9814

0:0229 1:0922

2:9902

2:9878

3:1125

2:8825

2:8433

0

2

0:9514 B 0 B B B 0:0470 B B 0:0203 B B @ 0

0:7280

0

0:2664

0

0

0

0:0135

0:1349

0

0

0

0

0:9056

1:9519 0:9420

0

1:9980 0:9971

1 2:0600 1:0871 C C C 0:0989 C C 1:8937 C C C 0:9328 C C 0:0057 C C C 1:0815 A 2:8863

1

C C C C 0 C 0:0796 C C C 0:2984 A 0

ð10Þ

0:6630

Table 1 The residuals of B in Eq. (11). Sample

Class 1

Class 2

b1 b2 b3 b4

2.5370 1.7087 4.4854 5.2291

8.3597 10.4016 2.2324 3.0576

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

6

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

0

2:0702 B 1:4899 B B B 4:3718 B B 4:1365 B B B 0:2267 B B 0:2735 B B @ 4:8281 1:9287 0

1:4154 0:8002 6:9345

0:5951 0:9424

2:9243

1:7334

0:1533

2:3851

0:1955

0:0560

6:2445

1:8495

2:8046

2:4307

2:7206

B 1:9554 B B B 5:5972 B B 1:5594 B B B 0:1129 B B 0:9253 B B @ 3:8031 1:7990

0

0:2012

0:2500

B 0:5863 B B B 0:0230 B B 0 B B @ 0 0

3:4851

1

products AT A and AT b, and the computations ATP AP and ATP b are equivalently replaced with ðAT AÞP and ðAT bÞP , respectively. Second, predeﬁned active set and passive set based on the previous iteration are given instead of starting from null passive set. Van Benthem and Keenan [37] proposed their FC-NNLS algorithm to further relieve the burden of FNNLS. For the convenience of discussion, we also reformulated the pseudocode of FC-NNLS which is given in Algorithm 3.

1:3299 C C C 0:8629 C C 0:6178 C C C 0:2490 C C 0:9819 C C C 1:1942 A 3:2563

2:1430

0:1428

2:0088

0:0913

0:3454

1:9641

0:4381

1:2974

7:0132

6:7965

1:4534

0:1980

3:7783 0:1157

2:7041 0:1575

3:9547 1:7358

2:1347 0:9402

0:1314

0:7121

0:4539

0:4043

4:9755

4:3159

0:3239

0:8172

2:6933

3:7117

1:7895

2:8704

0

0

0

0

0:1586

0:1551

0:2586

0:6216

0 0:0336

0:0286

0:3624

1

0:2921 C C C 0:6267 C C 2:2222 C C C 1:0519 C C 0:1376 C C C 2:5853 A 5:2992

ð11Þ

Now, Let us take Eq. (11) for example to demonstrate the difference between our NNLS sparse coding and locally weighted regression, the Euclidean distance between the six training samples and four unknown samples is in Eq. (12), 0 1 3:3768 3:8625 9:4569 8:4278 B 3:3714 2:0757 11:8114 10:5107 C B C B C B 4:0535 2:8996 11:4309 11:2606 C C, ð12Þ D¼B B 11:7900 13:3134 3:4360 5:3789 C B C B C @ 10:4943 11:8442 2:9320 4:1921 A 10:8042

12:1536

4:1262

Algorithm 3. FCNNLS optimization algorithm. Input: matrices Amn , matrix Bmq Output: matrix X nq which is an optimal solution to min12JBAXJ2F s.t. X Z0 Q ¼ f1 : qg; % set of indices of columns of B N ¼ f1 : ng; % set of indices of rows of X K ¼ AT A; C ¼ AT B; X ¼ ½K 1 C þ ; % use the unconstrained least squares solution to initialize X, and X ¼ ½Y þ is deﬁned as xij ¼ yij if yij 4 0, otherwise xij ¼ 0

1

0:1091 C C C C 0 C 0:0572 C C C A 0

0:5568 0:5175

1:9284

3:8268

where dij is the distance between the i-th training sample and the j-th unknown sample. Take its last column for example, the ﬁrst and the last three training samples should be selected by 4-NN, while NNLS picks up the second, fourth, and last training samples as shown in Eq. (11). For demonstrating the difference between NNLS-MAX and the nearest neighbor approach, we need Eq. (12) again. It suggests that the ﬁfth training sample is the nearest neighbor of the third unknown sample in input space, while Eq. (11) indicates that the fourth training sample is the most meaningful one for NNLS. 4.3. NNLS optimization algorithms In the following, we shall introduce two algorithms to optimize, in parallel, the MLHS-NNLS problem as expressed in Eq. (6). The ﬁrst is an active-set algorithm and the second is a gradient– descent based algorithm.

R ¼ fRi 9i A Q, Ri ¼ fj9xji ¼ 0, j A N gg; P ¼ fP i 9i A Q, P i ¼ fj9xji 4 0, j A N gg; F ¼ fi9Ri a |, maxk A Ri lki 4 E, li ¼ CKxi , i A Qg; % set of indices of columns that to be optimized while F a| do J ¼ fJ i 9Ji ¼ arg maxk A Ri lki , iA F g; RF ¼ fRi 9Ri ¼ Ri J i , i A F g; P F ¼ fP i 9P i ¼ P i þ J i , iA F g; T ¼ XF ; T R ¼ 0; T P ¼ arg minY 12 JBF AP YJ22 ; % solved by Algorithm 4 through calling CSSLSðK,C F , P F Þ H ¼ fi9minðt i Þ o0, iA F g; % subset of columns of T with negative values while H a| % remove a variable with negative value from each passive set

a ¼ fai 9ai ¼ minj A P i ,i A H,xij r 0 xijxtij ij g; x

ij g; K ¼ fKi 9Ki ¼ arg minj A P i ,i A H,xij r 0 xij t ij

X P,H ¼ X P,H þðT P,H X P,H Þ diagðaÞ; RH ¼ fRi 9Ri ¼ Ri þ Ki ,i A Hg; P H ¼ fP i 9P i ¼ P i Ki ,i A Hg; T R ¼ 0; T P ¼ arg minY 12 JBH AP YJ22 ; % solved through calling CSSLSðK,C H ,P H Þ in Algorithm 4 H ¼ fi9minðt i Þ o0,iA F g; % subset of columns of T having negative values end while X F ¼ T; F ¼ fi9Ri a |,maxk A Ri lki 4 E, li ¼ CKxi ,ıA Qg; end while

Algorithm 4. CSSLS. 4.3.1. Active-set algorithm LH-NNLS in Algorithm 1 only solves the SLHS-NNLS problem. Its simple extension to solve the MLHS-NNLS problem could be that LH-NNLS is called separately to solve the SLHS-NNLS subproblem for each individual column vector of S. This is, however, computationally expensive as matrix inner products and inverses have to be calculated for every iteration of LH-NNLS and the same matrix computations might be recalculated for multiple times. Bro and De Jong [36] proposed the fast NNLS (FNNLS) algorithm to improve the speed of LH-NNLS for MLHS-NNLS in two aspects. First, the inputs are replaced with the precomputed inner

Input: matrices K ¼ AT A and C ¼ AT B, where A is of size m n and B is of size m l, l passive sets P Output: set Y ¼ fyi 981 ri r l, yi ¼ arg min 12 Jbi AP i zJ22 g which can be denoted concisely by Y ¼ arg minZ 12 JBAP ZJ22 Y ¼ |; % initialize L ¼ f1 : lg; % set of indices of columns of X U ¼ uniqueðPÞ; E ¼ f1 : sizeðUÞg; % set of indices of unique passive sets S ¼ fS i 9S i ¼ fj9j A L,P j ¼ U i g,i A Eg; % set of indices of columns of B sharing the identical passive sets

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

We note that both FC-NNLS and UR-NNLS only use matrix inner products, AT A, AT B, and BT B, in our formulation. This is because least squares is essentially a problem of quadratic programming. This property provides us two beneﬁts. First, it helps us to handle missing values. Second, The NNLS based classiﬁcation approaches can be extended to kernel version. We will introduce these two beneﬁts in Sections 4.5 and 4.6.

for 8U i A U do YSi ¼ ðK U i Þ1 C U i ,Si ; end for Algorithm 5. UR-NNLS optimization algorithm. Input: matrices Amn and Bmp Output: matrix X which is a solution to minX 12 JBAXJ2F

4.4. Relation between sparsity and non-negativity

s:t: X Z 0

K ¼ AT A; C ¼ AT B; R ¼ BT B; X ¼ K 1 C; % initialized by the unconstrained LS solution XðX o 0Þ ¼ 0; % replace the negative values in X by 0 r prev ¼ 12 JBAXJ2F ¼ 12 trðRX T CC T X þ X T KXÞ; for i ¼ 1 : maxIter do rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðCÞ þ þ ½K Xij 9M 9 þ M X ij ¼ X ij C ijþ ½K þ X ; % deﬁne M ijþ ¼ ij 2 ij ; M ij ¼ ij

ij

9M ij 9M ij ; 2

if i ¼ ¼ maxIter or i mod l ¼ 0 then % check every l iterations if the termination conditions are met 2 1 2 JBAXJF

1 2

T

T

7

The optimal solution of the MLHS-NNLS problem is sparse, because in an optimal solution the components corresponding to the active set are 0, while the reminder corresponding to the passive set are positive. If the optimal solution to an unconstrained LS contains negative values, the optimal solution to the corresponding NNLS problem sits on the edge of the feasible regions. Hence large size of active set means the optimal solution is sparse. We can deﬁne a measure of sparsity as sparsityðXÞ ¼

T

r cur ¼ ¼ trðRX CC X þX KXÞ; if r prev r cur r E or r cur r E then break; end if r prev ¼ r cur ; end if end for

FC-NNLS has two improvements. First, the solution is initialized by the optimal-unconstrained-least-squares solution whose negative values are set to 0’s and positive variables are collected into passive sets. This initial solution approximates the optimal solution. Second, instead of solving the SLHS-NNLS problems sequentially, they are solved in a column-parallel strategy so as to share the common matrix inverses at each iteration. Eqs. (9) and (10), shown above, are two examples solved by FC-NNLS algorithm.

4.3.2. Iterative update rule algorithm The second algorithm is based on multiplicative update rule. Multiple-update-rules algorithm has been popular for NMF since it was ﬁrst developed by of Lee and Seung [11,38]. The update rules are essentially gradient–descent rules. Two update rules are proposed by Ding et al. [19] to solve semi-NMF: S 7 W 7 H þ . Semi-NMF is a relaxed NMF. Through semi-NMF, matrix S can be decomposed into basis matrix W and coefﬁcient matrix H. Only H is constrained by non-negativity. W and H are alternatingly updated by update rules. The update rule for H, ﬁxing W, can be modiﬁed to solve our MLHS-NNLS problem. This is illustrated in Algorithm 5 where W is ﬁxed by A, and H is redenoted by X. Ding et al. [19] have proven that, when ﬁxing A and updating X iteratively, the term JBAXJ2F monotonically decreases. Based on this theory, we propose the update-rule-based NNLS (UR-NNLS) optimization algorithm which is shown in Algorithm 5. For indirect proof of convergence and time complexity of this algorithm, please refer to the work of Ding et al. [19]. The solution is also initialized by the unconstrained least squares solution as in Algorithm 3. Algorithm 5 does not always reach the optimal solutions, it provides a high-quality approximation to this problem and the classiﬁcation task. Also, this algorithm is much easier to implement than active-set algorithms. Hereafter, we shall use the FC-NNLS algorithm as the default solver to our MLHS-NNLS problem.

dð9X9 o EÞ , rowðXÞ columnðXÞ

ð13Þ

where dðMÞ is a function counting the number of non-zeros in M, E is a small positive number (the default can be 104 ). For example, the sparsity of the coefﬁcient matrices in Eqs. (10) and (11) are 0.5417 and 0.4167, respectively. It is well-known that the unconstrained least squares solution, which minimizes l2-norm, is not sparse. Therefore, the high sparsity of our solution results from the non-negative constraint. Another example of this relation is SVM where the normal vector w of separating hyperplane is a non-negative linear combination of all training samples. The coefﬁcients are Lagrangian multipliers which are usually sparse. And the training samples corresponding to the non-zero coefﬁcients are called support vectors. 4.5. Weighting strategy for missing values We note that the two optimization algorithms described above only use the inner product of samples. Two inner product matrices are needed: K ¼ AT A and C ¼ AT B by both of Algorithms 3 and 5, where A and B represent the training data and unknown data with samples in columns, respectively. Additionally, Algorithm 5 also needs the inner product R ¼ BT B. If the samples are normalized to have unit l2-norm, the inner product of two samples is actually the cosine similarity between them. Suppose we have two unnormalized samples a,b A Rm , their normalized inner product can be formulated as 0

a0T b ¼

aT b : JaJ2 JbJ2

This feature of these optimization algorithms is quite useful to handle missing values. If a and b have missing values, we can use the features without missing values in both of them to calculate the cosine similarity, 0

a0T b ¼

aTI bI , JaI J2 JbI J2

ð14Þ

where I is the set of indices of the features whose values are observed in both a and b, that is I ¼ I a \ I b , where I a is the set of indices of features whose values are present in a. The difference between this weighting strategy with 0-imputation is that the inner product of a pair of 0-imputed samples is not the cosine similarity. If a and b are original samples with missing values imputed by 0, the normalized inner product of them is 0 a0T b ¼ ðaTI =JaI a J2 ÞðbI =JbI b J2 Þ. The performance of our weighting strategy and 0-imputation will be investigated in the experimental part.

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

8

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

Through this way, we can calculate the inner product matrices K, C, and R in a pair-wise fashion, if A and B have missing values. The difference between this weighting strategy with the strategy of removing features with missing values as a preprocessing is that our strategy utilizes more information than the later. This is because our strategy computes the inner product of two samples based on their available features, while the feature pre-removal strategy does so based on the global available features. In some cases, the missing rate can be so large that all features are removed by the later strategy. But our pair-wise weighting strategy might still work normally. Therefore, the advantage of our strategy is that it can work in the cases of large missing rate. If the NS rule is used after getting the sparse coefﬁcient of each sample, the regression residual of each class can also be computed in this weighting strategy. From the following equation: 1 2

T

JbAi xi J22 ¼ 12 ðbAi xi ÞT ðbAi xi Þ ¼ 12 b bxTi ATi b þ 12xTi ATi Ai xi ,

ð15Þ

we can ﬁnd that the residual of b regressed by the i-th class is a T function of b b, ATi b and ATi Ai which can be computed by the weighting strategy in the case of missing values. Ai is the training samples of the i-th class. 4.6. Kernel extension Our NNLS approach described above may not work well on overcomplete data due to the following reasons. First, the systems of linear equations, B ¼ AX, may have one or multiple nonnegative solutions. In this case, the solutions may not be necessarily sparse. Even if the solutions might be sparse, the unique solution returned by the optimization algorithms may not have the best semantic information. Second, the optimization algorithms need to compute the inverses of K and its submatrices, but they are singular matrices in this case. In the experimental section, we will show the performance of our NNLS approach on overcomplete data to verify the claim above. Because of the potential singularity problem when computing the matrix inverses, we can use two ways to approximate this. The ﬁrst way is to add a very small identity matrices to the target matrix before computing the inverse, K 1 ðK þ gIÞ1 ,

ð16Þ 32

where g is a very small positive number, for example, 2 second way is to apply SVD, K 1 Uf ðSÞV T ,

. The ð17Þ

T

1 f ðli Þ ¼ li

if li Z g, otherwhere K ¼ USV and f ðSÞ is deﬁned as wise, f ðli Þ ¼ 0. g is also a very small positive number. In fact, the solution using the ﬁrst way is a solution to ridge regression. Ridge regression is a popular remedy to deal with the ﬁrst reason above. In order to improve the performance of our approach for overcomplete data, we have two options. One is to use l1 regularization on the coefﬁcient vector in order to obtain more sparse result. Another is to kernelize them. Though the former is interesting, it has a problem of model selection. Therefore, we leave it for future investigation and focus on the later one. Let fðbÞ be a mapping function which casts sample b to a feature space, we can deﬁne fðBÞ ¼ ½fðb1 Þ, , fðbp Þ. And the NNLS optimization in the feature space is 1 min JfðBÞfðAÞXJ2F X 2

s:t: X Z0,

ð18Þ

As has been mentioned in Section 4.5, the learning and prediction of our NNLS approach only involves the inner products K ¼ ðfðAÞÞT fðAÞ, C ¼ ðfðAÞÞT fðBÞ, and R ¼ ðfðBÞÞT fðBÞ. Therefore, we can use kernel trick to compute these three matrices: K ¼ kðA,AÞ, C ¼ kðA,BÞ, and R ¼ kðB,BÞ, where ½kðA,BÞij ¼ kðai ,bj Þ ¼ ðfðai ÞÞT fðbj Þ.

Through this way, FC-NNLS and UR-NNLS are extended to kernel FCNNLS (FC-KNNLS) and kernel UR-NNLS (UR-KNNLS). The corresponding extension of classiﬁcation algorithms are KNNLS-MAX and KNNLS-NS. The performance of them are explored in the experimental part. In order to build robust optimization algorithms for kernel NNLS methods, Eqs. (16) and (17) are also applied to compute matrix inverses. Model selection for kernel approaches based on k-fold CV accuracy can be expensive in practice, because for each point of parameter, a classiﬁer has to run for k times. Therefore, we propose a randomized KNNLS method to avoid model selection. The idea is in the following. Bootstrapping is rerun for b times to generate b training sets. At each execution, about 66.7% training samples are selected as input of our kernel NNLS classiﬁer, and the kernel parameter is randomly selected from a speciﬁc range. Finally, the class labels are voted over the b results by the majority rule. This method is coined bootstrapping NNLS (BNNLS) method. If the MAX (or nearest-subspace) rule is used by each kernel NNLS classiﬁer, it is denoted as BNNLS-MAX (or BNNLS-NS). The algorithm is shown in Algorithm 6. Algorithm 6. Bootstrapping kernel non-negative least squares classiﬁer. Input: Amn : training set including m features and n samples, c: class labels of the training samples, Bmp : p new samples, b: the number of classiﬁers Output: p: predicted class labels of the p new samples for t ¼1 to b do T ¼ bootstrapðAÞ; take kernel parameter randomly: s ¼ randðrÞ; % r is a speciﬁed set of real numbers or an interval p ¼ KNNLSðT,c,B, sÞ; Pð: ,tÞ ¼ p; end for p ¼ voteðPÞ;

The function bootstrapðAÞ randomly selects 66:7% samples from A. rand ðrÞ choose a value as kernel parameter from the speciﬁc range r. vote ðPÞ is a function that makes the ﬁnal decision based on the decisions of the classiﬁer committee by majority rule. 4.7. Kernel NNLS for tensor data Our NNLS classiﬁer can be extended to classify tensor data. Interested readers are referred to the review of Kolda and Bader [4] for tensor algebra. Without the loss of generality, we suppose there are I3 training samples represented by an order-3 tensor A I1 I2 I3 . The third mode is the class axis. Each sample is a matrix of size I1 I2 . Therefore, Að: , : ,iÞ is the i-th training sample. Suppose we have P new samples in BI1 I2 P . Assume that each of such new samples can be regressed by a non-negative linear combination of the training samples. We need to solve the following NNLS problem: min12JBA3 X T J2F X

s:t: X Z 0,

ð19Þ

where 3 is the mode-3 product [4] as deﬁned at the end of Section 1. Through matricizing tensors to matrices, we can convert the above optimization task into the equivalent formula: min12JBð3Þ X T Að3Þ J2F X

s:t: Y Z0,

ð20Þ

where Að3Þ is a matrix of I3 ðI1 I2 Þ, unfolded from tensor A in mode 3 which is also deﬁned at the end of Section 1.

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

GST data. Dynamical systems kernel accepts matrix inputs and takes into account the temporal information.

Using transposition, we have min12JBTð3Þ ATð3Þ XJ2F X

9

s:t: X Z 0:

ð21Þ

Now Eq. (21) can be solved by FC-NNLS and Algorithm 5. Our NNLS classiﬁer, described in Algorithm 2, can now be generalized for tensor data. The drawback of this generalization is that the structural information within a sample is not considered. The objective (Eq. (19)) uses the Euclidean distance, hence the samples are actually vectorized in Eqs. (20) and (21). If we use other dissimilarity or similarity metric which takes the temporal information into account when classifying gene-sample-time data, the performance is expected to be increased. We propose to apply dynamical-systems kernel [39] in our kernel NNLS method for

5. Experiments In this section, we explore the performance of NNLS methods in various aspects including the prediction capability, the robustness to missing values and noise, time-complexity, and sparsity of coefﬁcient generated by our method. We conducted experiments over 28 datasets which are brieﬂy described in Table 2. We compared our methods with diverse benchmark methods including sparse representation, instance-based, and kernel methods. They are summarized in Table 3. 5.1. Estimation of data size requirement

Table 2 Datasets. The sample size of each class and the total data size are in the last column. Data

#Class #Feature

#Sample

Adenoma [40] Breast [41] Colon [42] Leukemia [43] ALL [44] Breast5 [45] MLL [46] SRBCT [47]

2 2 2 2 6 5 3 4

7457 24 481 2000 7129 12 625 14 460 12 582 2308

18 þ 18¼ 36 44 þ 34¼ 74 40 þ22 ¼ 62 47 þ 25¼ 72 15 þ 27þ 64þ 20þ 43 þ79 ¼248 13 þ 22þ 53þ 31 þ13 ¼158 24 þ 20þ 28¼ 72 23 þ 8þ 12þ 20¼ 63

Ovarian8-7-02 [48] Ovarian4/3/02 [48] OvarianCD_PostQAQC [49] Prostate7-3-02 [50] PC-IMAC-Cu [51]

2 2 2

15 154 15 154 15 000

91 þ 162¼ 253 100 þ100 ¼200 95 þ 121¼ 216

2 2

15 154 16 382

253 þ 69¼ 322 159 þ 168 ¼327

face95 [52] face96 [52] grimace [52]

20 20 18

200 180 20 20¼ 400 200 180 20 20¼ 400 200 180 18 20 ¼360

BCWD [53] Ionosphere [53] Parkinsons [54] Pima [53] Sonar [53] SPECT [53] BSTT [53] Glass [53] Iris [53] LM [53] Wine [53]

2 2 2 2 2 2 6 6 3 15 3

9 34 22 8 60 22 9 9 4 90 13

458 þ 241 ¼699 225 þ 126 ¼351 48 þ 147¼ 195 500 þ268 ¼ 768 97 þ 111¼ 208 55 þ 212¼ 267 22 þ 21þ 14þ 15 þ16 þ18 ¼ 106 70 þ76 þ 17þ 13þ 9þ 29 ¼214 3 50¼ 150 15 24 ¼360 59 þ 71þ 48¼ 178

53 7

18 þ 9¼ 27

2

INF b [55]

The statistical methodology using learning curve [57] is to use available classiﬁcation results to estimate sample-size requirement for future experiments and evaluate the beneﬁt in performance with extra samples. It involves two tasks. First of all, the minimum sample size, where signiﬁcant accuracy is obtained, is estimated. Second, larger sample sizes and the corresponding classiﬁcation error rates are used to ﬁt the learning curve. The learning curve is an inverse power-law model expressed as eðnÞ ¼ ana þ b, where a is the learning rate, a is the decay rate, and b is an asymptotical Bayes error. In order to know how the performance of NNLS changes with the size of training samples, we estimated (1) the minimum data sizes required for signiﬁcant accuracy and (2) the learning curves of our NNLS over eight microarray gene expression datasets. We implemented the method proposed in [57]. For each sample size, we repeated the partition of training and test set 50 times. For each split, the number of permutation is 50. We set the signiﬁcant level to 0.05. The comparison with SVM are summarized in Table 4. As an example, the learning curves of NNLS and SVM over ALL are plotted in Fig. 1. First of all, from Table 4, it can be clearly observed that, the minimum sample size (required to obtain signiﬁcant accuracy) of NNLS is consistently smaller than SVM. It may be because NNLS belongs to instance-based learning while SVM needs sufﬁcient training samples to learn the model parameter (that is the normal vector of hyperplane) for all future prediction. Second, from Table 4 and Fig. 1, we can ﬁnd that the learning rate of SVM is consistently faster than NNLS. Third, the average error rate of NNLS is smaller than SVM on ﬁve datasets. Considering both error rate and p value, it is hard to ﬁnd the difference of their

Table 3 The summary of our NNLS methods and the benchmark methods. Category Method

Description

Our

NNLS classiﬁer with MAX rule as sparse interpreter. Linear kernel is used for overcomplete data, and radial basis function (RBF) kernel is used for undercomplete data NNLS classiﬁer with nearest-subspace rule as sparse interpreter. Linear kernel is used for overcomplete data, and RBF kernel is used for undercomplete data Bootstrapping NNLS classiﬁer with nearest-subspace rule as sparse interpreter. RBF kernel is used

NNLSMAX NNLS-NS BNNLSNS

SR

SRC MSRC

SRC2 (Eq. (3)) is used for undercomplete data, SRC1 (Eq. (2)) is used for overcomplete data The metasample-based sparse representation classiﬁcation approach proposed in [30]

Instance

LRC 1-NN

Linear regression classiﬁcation approach, an independent modelling method proposed in [22] 1-nearest neighbor approach

Kernel

SVM ELM

Linear kernel is used for high-dimensional data and RBF kernel is used for low-dimensional data Extreme learning machine proposed in [56]

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

10

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

Table 4 Dataset size estimation. The third column is the minimum sample size to obtain the signiﬁcant accuracy with p-value r 0:05. The forth column is the maximum sample size used to ﬁt the learning curve. The ﬁfth column are the error rate and p-value when using the maximum sample size in the forth column. The last three columns are the ﬁtted learning curves of mean, 25% and 75% quantiles accuracies. The learning curves of 25% and 75% quantiles are measures of variation. Data

Method

Min. size

Max. size

Error rate (p-value)

(learning rate: a, decay rate: a, Asymp. Bayes error: b) 25th quantile

Mean

75th quantile

Adenoma

NNLS-NS SVM

6 13

25 25

0.02(1.12e 2) 0.03(2.12e 2)

(0.15,0.73,0.00) (99.80,2.99,0.00)

(0.60,1.31,0.02) (78.41,2.61,0.00)

(1.85,1.80,0.02) (5.79,2.28,0.07)

Breast

NNLS-NS SVM

10 11

60 60

0.14(1.20e 3) 0.25(1.16e 2)

(1.24,0.59,0.00) (2.37,0.53,0.00)

(1.15,0.50,0.00) (1.87,0.44,0.00)

(1.16,0.45,0.00) (1.41,0.34,0.00)

Colon

NNLS-NS SVM

8 14

55 55

0.11(2.00e 2) 0.08(9.60e 3)

(0.45, 0.40, 0.00) (2.30, 0.97, 0.00)

(0.39, 0.25, 0.00) (2.15, 0.77, 0.00)

(0.45, 0.26, 0.00) (9.26, 1.30, 0.08)

Leukemia

NNLS-NS SVM

6 8

60 60

0.05(8.00e 4) 0.02(0)

(1.38, 1.13, 0.00) (16.36, 2.07, 0.00)

(2.33, 1.25, 0.03) (42.54, 2.28, 0.03)

(4.01, 1.54, 0.07) (100.00, 2.45, 0.05)

ALL

NNLS-NS SVM

6 13

100 100

0.10(0) 0.15(0)

(2.49,0.85,0.04) (5.30,0.77,0.00)

(2.31,0.77,0.03) (4.46,0.69,0.00)

(2.19,0.69,0.03) (4.09,0.65,0.00)

Breast5

NNLS-NS SVM

5 16

100 100

0.22(0) 0.18(0)

(0.51,0.39,0.10) (2.50,0.63,0.01)

(0.77,0.69,0.18) (2.51,0.61,0.01)

(1.48,1.07,0.23) (2.51,0.63,0.05)

MLL

NNLS-NS SVM

4 16

60 60

0.07(0) 0.09(1.60e 3)

(0.50,0.62,0.00) (1.78,0.56,0.00)

(0.87,1.07,0.07) (3.16,0.66,0.00)

(1.69,1.52,0.10) (4.88,0.74,0.00)

SRBCT

NNLS-NS SVM

4 19

60 60

0.01(2.40e 2) 0.03(8.00e 4)

(2.10, 1.39, 0.00) (100.00, 1.87, 0.00)

(1.63, 1.06, 0.00) (100.00, 1.82, 0.00)

(1.66, 0.96, 0.00) (100.00, 1.79, 0.00)

0.5 NNLS Average NNLS 25% Quantile NNLS 75 Quantitle SVM Mean SVM 25% Quantitle SVM 75% Quantitle

0.45 0.4

Error Rate

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

0

100

200

300

400

500

Sample Size Fig. 1. The learning curve of NNLS and SVM on ALL. The sample sizes to ﬁt the learning curves are [6, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100] and [13, 20, 30, 40, 50, 60, 70, 80, 90, 100], respectively for NNLS and SVM.

ranks of classiﬁers over all datasets are computed, and Friedman test is conducted to verify the null-hypothesis that all classiﬁers are equivalent in the respect of classiﬁcation performance. If the null hypothesis is rejected, a post hoc test (for example, Nemenyi test) is continued. If the average ranks of two classiﬁers differ by at least the critical difference (CD), then it can be concluded that their performance are signiﬁcantly different. In our experiment, the average accuracies (of 4-fold cross-validation) of all classiﬁers over each dataset is obtained. All classiﬁers ran on the same partition of training and test sets. 4-fold cross-validation was repeated for 50 and 20 times on microarray and mass spectrometry data, respectively. We set the signiﬁcant level to a ¼ 0:05. The CD diagram is illustrated in Fig. 2. The bar plot of the average accuracy over ﬁve datasets is shown in Fig. 3, from which we can see that good result can be obtained by our NNLS approaches. In Fig. 2, we can identify two groups. The ﬁrst group includes NNLS-NS, SVM, and SRC. The secondary group is composed of MSRC, NNLS-MAX, 1-NN, and ELM. Through statistical comparison, we can ﬁnd that: (i) NNLSNS is signiﬁcantly better than NNLS-MAX on undercomplete data; (ii) NNLS-NS can obtain state-of-the-art result, and there is no signiﬁcant difference among NNLS-NS, SVM, and SRC in such case. 5.3. Handle missing values

performances. We will resort to the statistical comparison based on Friedman test to further investigate this in the subsequent sections. 5.2. On undercomplete data In order to know the performance of our NNLS approaches on undercomplete data, we compared our NNLS approaches with SRC, MSRC, LRC, 1-NN, SVM, and ELM on eight microarray data and ﬁve mass spectrometry data. We used the Friedman test with Nemenyi test as post hoc test [58]. This non-parametric statistical methodology is recommended in Demˇsar [58], for comparing multiple classiﬁers ( 4 5) over multiple datasets ( 410), because it is simple, safe, and robust. It involves two steps. First, the average

For the purpose of exploring the capability of our weighting strategy of handling missing values, we did experiment on all microarray data with randomly and artiﬁcially generated missing values. The missing rate ranges from 0.1 to 0.7. We used 4-fold cross-validation for 50 times at each missing rate for each dataset. We compared our strategy with 0-imputation. 0-imputation is the strategy to ﬁll the missing values by constant 0’s. We did not conduct k-NN imputation [32], because for large missing rate, there is no complete feature and sample, and k-NN imputation can thus fail for this case. As an example, the average accuracies over SRBCT are plotted in Fig. 4. Furthermore, in order to investigate the robustness of our weighting strategy in the case of severe missing rate, we

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

11

9

ELM

CD

8

1–NN

7

NNLS–MAX

6

MSRC

5

LRC

4

SRC

3

NNLS–NS

2

SVM

1 0

1

2

3

4

5

6

7

8

Rank Fig. 2. Graphical representation of Nemenyi test over undercomplete data.

1

NNLS–MAX NNLS–NS SRC MSRC LRC 1–NN SVM ELM

0.95 0.9 0.85

Accuracy

0.8 0.75 0.7 0.65 0.6 0.55

02

C

7– te ta os Pr

O

va

ria

nC

D

_P os

tQ

3–

AQ

S D RB at C as T et

m ei uk Le

Br

ea

st

a

0.5

Fig. 3. Classiﬁcation performance on some undercomplete datasets.

employed Friedman test on all microarray data with missing rates 60% and 70% (therefore 16 datasets). We set the signiﬁcant level to a ¼ 0:05. In our experiment, the null hypothesis was rejected, and the result of Nemenyi test is shown in Fig. 5. First of all, we can see that NNLS-MAX using weighting strategy is signiﬁcantly better than its counterpart using 0-imputation. If we increase the threshold of Type-I error slightly, then NNLS-NS using weighting strategy also performs signiﬁcantly better than its counterpart using 0-imputation. Therefore it can be concluded that the weighting strategy is signiﬁcantly better than 0-imputation when using the same classiﬁcation technique. Second, from Fig. 4 we can see that NNLS-NS using weighting strategy and SVM using 0-imputation still maintain high accuracy at severe missing rate. And from Figure 5, we can ﬁnd both of them obtained top ranks. It hence can prove the robustness of our weighting strategy.

5.4. On facial data Now, we investigate the robustness of NNLS to corruption on facial data. SRC1 and LRC were proposed speciﬁcally for face recognition. Our proposed approach competed with them and others over three facial datasets (face95, face96, and grimace) collected by Spacek [52]. It was mentioned by Spacek [52] that it is difﬁcult to classify face96 and grimace due to variation of background and scale, and extreme variation of expression. We randomly picked up 20 individuals for face95 and face96, respectively, and used the whole grimace data in our experiment. As was done by Wright et al. [15] and Naseem et al. [22], downsampling was employed for SRC1 and LRC with downsampling rate 1=16. We artiﬁcially replaced 10%–70% of the pixel values with random values from the range of [0,255]. The comparison with respect to the mean accuracy over 20 runs of 4-fold cross-validation on face95 is shown in Fig. 6.

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

12

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

1

0.9

Accuracy

0.8

0.7 NNLS–MAX NNLS–NS NNLS–MAX–Impute NNLS–NS–Impute SRC–Impute MSRC–Impute LRC–Impute 1–NN–Impute SVM–Impute ELM–Impute

0.6

0.5

0.4 Complete

10%

20%

30%

40%

50%

60%

70%

Missing Rate Fig. 4. Classiﬁcation performance on SRBCT as missing rate increases.

1–NN–Impute

CD

10

NNLS–MAX–Impute ELM–Impute

8

LRC–Impute NNLS–NS–Impute

6

NNLS–MAX MSRC–Impute

4

SRC–Impute NNLS–NS

2

0

SVM–Impute

1

2

3

4

5

6

7

8

9

10

Rank Fig. 5. Graphical representation of Nemenyi test over microarray data with 60% and 70% missing rates.

It can be see that NNLS, SRC, and MSRC look very robust to random corruption. In order to verify this intuition, we applied Friedman test on these datasets with corruption rates from 10% to 70% (thus 21 datasets). The signiﬁcant level is again 0.05. The null hypothesis is rejected. The result of Nemenyi test is given in Fig. 7. We can identify two groups of classiﬁers: the robust one including SRC, MSRC, and NNLS; and the sensitive one including LRC, SVM, and ELM. 5.5. On overcomplete data We did two parts of experiments on overcomplete data. First, in order to see the performance of our NNLS method (with linear kernel) on overcomplete data, we compared it with SRC on LM dataset. The result is in Table 5. We can see that the performance of our NNLS method is worse than SRC. The accuracies of NNLS-NS are 0.7760 and 0.7969 using FC-NNLS and UR-NNLS algorithms, respectively, while SRC obtained 0.8665. This result corroborates our claim that the NNLS methods with linear kernel may not work well on overcomplete data. Next, for the purpose of exploring the performance of kernel NNLS methods (NNLS-MAX, NNLS-NS, and BNNLS-NS), we compared them

with benchmark methods on 6 binary-class and 5 multi-class low dimensional datasets. These data are described in the second last block of Table 2, where BCWD stands for Breast Cancer Wisconsin Diagnostic, BSTT for Breast Tissue, and LM for Libras Movement. Our NNLS methods used the well-known RBF kernel. The benchmark methods includes SRC, LRC, 1-NN, SVM, and ELM using RBF as activation function. In order to maximize the capability of the methods, model selection was employed. The optimal kernel parameters of NNLS-MAX, NNLS-NS were obtained by line search with the average accuracy of inner 3-fold cross-validation on training data (a higher folds of cross-validation is not computationally affordable). The superparameters and kernel parameters of SVM were attained through grid search with inner 3-fold crossvalidation on training set. It was claimed by Wang and Huang [56] that the performance of ELM is not sensitive to the number of hidden nodes. Thus it was set to 3 times as the number of training samples. Using outer 4-fold cross-validation, all the algorithms were executed on the same partition of training and test sets, and it is repeated for 20 times. The average accuracies of ﬁve out of eleven datasets are illustrated in Fig. 8. Comparing the results on LM in Table 5 and Fig. 8, we can see that the accuracy of NNLS is increased by around 10% using RBF kernel.

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

13

1 0.9 0.8

Accuracy

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 No Corruption

NNLS–MAX NNLS–NS SRC MSRC LRC 1–NN SVM ELM

10%

20%

30%

40%

50%

60%

70%

Corruption Rate Fig. 6. Classiﬁcation performance on face95.

9

ELM

CD

8

SVM

7

LRC

6

1–NN

5

NNLS–MAX

4

NNLS–NS

3

MSRC

2

SRC

1 0

1

2

3

4

5

6

7

8

Rank Fig. 7. Graphical representation of Nemenyi test on corrupted facial data.

NNLS-NS for signiﬁcant accuracy is usually very small. The Nemenyi test also suggests no critical difference between NNLSs and SVM.

Table 5 Mean accuracy on LM data of 20 runs of 4-fold cross-validation. Optimization

Method

Accuracy

FC-NNLS

NNLS-MAX NNLS-NS

0.6725 0.7760

UR-NNLS

NNLS-MAX NNLS-NS

0.7425 0.7969

l1-magic

SRC

0.8665

We resorted to Friedman test (a ¼ 0:05) again to compare the classiﬁers over these eleven datasets. The null hypothes is rejected, and the post hoc Nemenyi test is presented graphically in Fig. 9. From this diagram, we can see that NNLS-NS obtained the best average rank and BNNLS-NS has very close average rank as SVM. Our statistical test implies that there is no signiﬁcant difference between NNLS-NS and BNNLS-NS. This convinces us that BNNLSNS is an effective method to avoid model selection. The reason for this may be because the minimum sample size required by

5.6. On all datasets We have investigated the performance of NNLS methods on several situations separably so far. Now we compared them with benchmark approaches using the results of all 27 datasets including microarray data, mass spectrometry data, facial data, and low dimensional data. We conducted Friedman test with signiﬁcant level a ¼ 0:05. The null hypothesis is rejected and the Nemenyi test is presented in Fig. 10. Two groups of classiﬁers can be identiﬁed. NNLS-NS, SVM, SRC form the ﬁrst group which obtain top ranks. And they do not differ signiﬁcantly. The second group is composed of NNLS-MAX, 1-NN, LRC, ELM, and MSRC. 5.7. Compare computing time In order to investigate the time-complexity of our method on undercomplete data, we record the averaged computing time

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

14

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

1

NNLS–MAX NNLS–NS BNNLS–NS SRC MSRC LRC 1–NN SVM ELM

0.95 0.9

Accuracy

0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5

BCWD

Parkinsons

Sonar Dataset

Iris

LM

Fig. 8. Classiﬁcation performance on some overcomplete data.

10

LRC

CD

9

MSRC

8

ELM

7

SRC

6

1–NN

5

NNLS–MAX

4

BNNLS–NS

3

SVM

2

NNLS–NS

1 0

1

2

3

4

5 Rank

6

7

8

9

Fig. 9. Graphical representation of Nemenyi test on overcomplete data.

including training time on 3=4 of data and test time on the rest. We compared our method with others on one binary-class and one multi-class microarray datasets (Colon and Breast5), one mass spectrometry dataset (PC-IMAC-Cu), and one facial dataset (face95). The result is shown in Fig. 11. First, we can see that the NNLS approaches run faster than SRC. On face5, SRC has similar running time because features are downsampled when using SRC. Second, SVM is slower than NNLS methods over face95. This implies that SVM is slowed down, while the NNLS approaches are more efﬁcient, in the case of large number of classes. We also compared the computing time of BNNLS-NS with NNLS with model selection over Parkinsons and BSTT which are overcomplete data. The average of computing time, including model selection, training and test, is illustrated in Table 6. From Table 6 we can observe that our BNNLS approach is slightly efﬁcient than NNLS with model selection. In fact, the efﬁciency of BNNLS is determined by the number of repetition of bootstrapping. In current study, we ﬁx it to 100. In the future, we will investigate the smallest number of repetition in BNNLS where equivalent accuracy can be obtained compared with NNLS using model selection.

5.8. On tensor data We tested our approach on a tensor dataset of size 53 (genes) 7 (time-points) 27 (samples). which is denoted as INFb (Table 2). We compared our NNLS methods with hidden Markov models (HMMs) [59], SVM with dynamical-systems kernel [39], and high-order NMF [3]. We applied 9-fold crossvalidation to split the dataset into training set and test set (because there are only 27 samples and in each fold there are right 3 samples in the test set). It was repeated 50 times for NNLS methods. And the results of HMMs, SVM, and HONMF are obtained from Li and Ngom [3]. 9-fold cross-validation was repeated 20 times for these benchmark methods, as a large number of repetition is computationally unaffordable. The preliminary result are shown in Table 7. Instead of conducting strict statistical test, we would rather treat these results just as example of applying NNLS on tensor data, because we do not have enough tensor data. First, from the comparison between NNLS using linear kernel and its counterpart using dynamical-systems kernel, we can see that the later obtained better result, because it considers the temporal

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

9

MSRC

CD

8

15

ELM

7

LRC

6

1–NN

5

NNLS–MAX

4

SRC

3

NNLS–NS

2

SVM

1 0

1

2

3

4

5

6

7

8

Rank Fig. 10. Graphical representation of Nemenyi test on all datasets.

8

NNLS–MAX NNLS–NS SRC MSRC LRC 1–NN SVM ELM

6

log2 (computing time)

4 2 0 –2 –4 –6 –8 Colon

Breast5

PC–IMAC–Cu Dataset

face95

Fig. 11. Computing time on diverse datasets. The unit is second. These are mean results of 50 and 20 runs of 4-fold cross-validation for microarray and other data, respectively. Logarithm of base 2 was taken for better comparison.

Table 6 Computing Time (in Seconds) With Model Selection versus Without Model Selection. These are mean results of 20 runs of 4-fold cross-validation. Method

NNLS-MAX NNLS-NS BNNLS-NS

Table 7 Classiﬁcation Performance on INFb Data. BACC stands for balanced accuracy, that is 0.5(sensitivity þspeciﬁcity). STD stands for standard deviation. The unit of time is second. The values in this table are the average results over 20 runs of 9-fold cross-validation.

With Model Selection

Without Model Selection

Parkinsons

BSTT

Parkinsons

BSTT

Data

Methods

Accuracy (STD)

BACC (STD)

Time

4.319 4.431 –

1.498 1.546 –

0.369 0.370 3.010

0.122 0.120 1.064

INFb

NNLS-MAX NNLS-NS NNLS-MAX kernel NNLS-NS kernel BNNLS-NS HMMs SVM HONMF

0.588 0.744 0.664 0.761 0.788 0.761 0.789 0.815

0.554 0.663 0.603 0.754 0.777 0.711 0.697 0.828

0.005 0.005 0.475 0.470 10.571 2117 93 1662

information within the samples. Hence, the structural information within samples does contribute the discrimination. Second, using dynamical-systems kernel, BNNLS-NS obtained comparable accuracy with NNLS-NS. It corroborates that BNNLS-NS is an effective method avoiding model selection. Third, kernel NNLSNS has comparable accuracy with HMMs and SVM, however it runs much faster than these benchmark methods. These preliminary results convince us that NNLS methods with appropriative

(0.039) (0.024) (0.030) (0.043) (0.029) (0.047) (0.023) (0.040)

(0.043) (0.022) (0.031) (0.043) (0.030) (0.038) (0.019) (0.065)

kernels, which take the speciﬁc structural information into account, perform well in the aspect of prediction capability and time-complexity. Thus they are worthy of further investigation for tensor data.

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

16

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

1

NNLS SRC

0.9 0.8 0.7

Sparsity

0.6 0.5 0.4 0.3 0.2 0.1 0

Breast

Breast5

face95

Ionosphere

LM

INFBeta

Dataset Fig. 12. Sparsity on some datasets.

5.9. Sparsity We recorded the sparsity of NNLS and SRC on each dataset. And the sparsity of some datasets are given in Fig. 12. It can be seen that NNLS can obtain very sparse coefﬁcients. So far, we have not yet ﬁnd any case where NNLS fails to obtain sparse result.

6. Conclusion and future work In this study, we propose a novel classiﬁer based on nonnegative least squares to classify undercomplete data. For a new sample, our approach pursues a non-negative and sparse linear combination of training samples. Two rules are applied as sparse interpreters in our approach to predict new samples according to the sparse coding. A weighting strategy is proposed to handle missing values. And our approach is kernelized to classify overcomplete data. An ensemble approach based on bootstrapping is proposed to avoid model selection. And our approach is also applied to classify tensor data by an appropriate kernel. Intensive experiments using statistical test have shown that our approach and its extensions perform very well in various cases. Using learning curves, it revealed that NNLS has a smaller minimum sample size than SVM where signiﬁcant accuracy is obtained. Friedman test with post hoc Nemenyi test indicates that there is no signiﬁcant difference between NNLS and SVM. We also show that our NNLS using weighting strategy is very robust to missing values. Our NNLS approach is also as robust as SRC to random corruption over facial data. Comparison result over overcomplete data proves that the bootstrapping based NNLS is an effective methods to avoid model selection. Comparison over various data shows that our NNLS approaches are very efﬁcient. Furthermore, preliminary result shows that our approach works very well and efﬁciently on tensor data. Experimental result shows that very sparse coefﬁcient coding can be obtained by the NNLS optimization. It could be beneﬁcial to apply our NNLS approach to both large-scale undercomplete and overcomplete data which may be subject to serious missing values or noise. It is also promising to classify tensor data using our approach with appropriate kernels. The high sparsity can help users to interpret the result well.

We have developed the Sparse Representation Toolbox in MATLAB. This toolbox includes our NNLS approach as well as SRC1, SRC2, MSRC, and LRC. The implementation of statistical tests used in this study are also included. It is available at http:// cs.uwindsor.ca/ li11112c/sr. The necessity of combining nonnegative constraint with l1-norm minimization will be explored. The NNLS optimization algorithms will also be improved to classify data with large number of samples. Dictionary learning methods might be considered for this case. Since NNLS has a very small minimum data size requirement for signiﬁcant accuracy, we will explore its performance for imbalanced data. The effect of normalization will be studied for NNLS. New kernels will be investigated to measure the similarity of two samples for tensor data.

Acknowledgments We would like to thank the reviewers who gave us valuable suggestions on the presentation and statistical comparison in this article. This study has been supported by IEEE CIS Summer Research Grant 2010, Ontario Graduate Scholarship 2011–2013, and Canadian NSERC Grants #RGPIN228117-2011.

References [1] A. Zhang, Advanced Analysis of Gene Expression Microarray Data, World Scientiﬁc, Singapore, 2009. [2] I. Levner, Feature selection and nearest centroid classiﬁcation for protein mass spectrometry, BMC Bioinformatics 6 (68) (2005) 20, http://dx.doi.org/ 10.1186/1471-2105-6-68. [3] Y. Li, A. Ngom, Classiﬁcation of clinical gene-sample-time microarray expression data via tensor decomposition methods, Lect. Notes Bioinformatics/Lect. Notes Comput. Sci. 6685 (2011) 275–286. [4] T.G. Kolda, B.W. Bader, Tensor decompositions and applications, SIAM Rev. 51 (3) (2009) 455–500. [5] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classiﬁcation, second ed., John Wiley, New York, 2000. [6] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81–106. [7] T. Mitchell, Machine Learning, McGraw Hill, Columbus, Ohio, 1997. [8] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 2000. [9] I.T. Jolliffe, Principal Component Analysis, Springer, New York, 2002. [10] A. Hyaˇrinen, J. Karhunen, E. Oja, Independent Component Analysis, John Wiley, New York, 2001.

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i

Y. Li, A. Ngom / Neurocomputing ] (]]]]) ]]]–]]]

[11] D.D. Lee, S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature 401 (1999) 788–791. [12] Y. Li, A. Ngom, Non-negative matrix and tensor factorization based classiﬁcation of clinical microarray gene expression data, in: Proceedings of the International Conference on Bioinformatics and Biomedicine, Hong Kong, 2010, pp. 438–443. [13] A.M. Bruckstein, D.L. Donoho, M. Elad, From sparse solutions of systems of equations to sparse modeling of signals and images, SIAM Rev. 51 (1) (2009) 34–81. [14] M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing, Springer, New York, 2010. [15] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [16] M. Elad, M. Aharon, Image denoising via sparse and redundant representations over learned dictionaries, IEEE Trans. Image Process. 54 (12) (2006) 3736–3745. [17] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B 39 (1) (1977) 1–38. [18] N. Friedman, Learning belief networks in the presence of missing values and hidden variables, in: Proceedings of the International Conference on Machine Learning, Nashville, 1997, pp. 125–133. [19] C. Ding, T. Li, M.I. Jordan, Convex and semi-nonnegative matrix factorizations, IEEE Trans. Pattern Anal. Mach. Intell. 32 (1) (2010) 45–55. [20] V. Vapnik, Statistical Learning Theory, John Wiley, New York, 1998. [21] S. Wold, Pattern recognition by means of disjoint principal component models, Pattern Recognition 8 (1996) 127–139. [22] I. Naseem, R. Togneri, M. Bennamoun, Linear regression for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 32 (11) (2010) 2106–2112. [23] B.K. Natarajan, Sparse approximate solutions to linear systems, SIAM J. Comput. 24 (1995) 227–234. [24] S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by basis pursuit, SIAM Rev. 43 (1) (2001) 129–159. [25] E. Cande s, J. Romberg, l1-MAGIC: recovery of sparse signals via convex programming, California Institute of Technology, Pasadena, California, 2005. Available at /http://users.ece.gatech.edu/ justin/l1magicS. [26] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T.S. Hung, S. Yan, Sparse representation for computer vision and pattern recognition, Proc. IEEE 98 (6) (2010) 1031–1044. [27] X. Mei, H. Ling, Robust visual tracking and vehicle classiﬁcation via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (11) (2011) 2259–2272. [28] P. Karsmakers, K. Pelckmans, K. De Brabanter, H. Van Hamme, J.A.K. Suykens, Sparse conjugate directions pursuit with application for ﬁxed-sized kernel models, Mach. Learn. 85 (1–2) (2011) 109–148. [29] X. Hang, F.-X. Wu, Sparse representation for classiﬁcation of tumors using gene expression data, J. Biomed. Biotechnol. (2009), http://dx.doi.org/ 10.1155/2009/403689. [30] C.-H. Zheng, L. Zhang, T.-Y. Ng, S.C.K. Shiu, D.-S. Huang, Metasample-based sparse representation for tumor classiﬁcation, IEEE Trans. Comput. Biol. Bioinformatics 8 (5) (2011) 1273–1282. [31] S.J. Kim, K. Koh, M. Lustig, S. Boyd, D. Gorinevsky, An interior-point method for large-scale l1-regularized least squares, IEEE J. Sel. Top. Signal Process. 1 (4) (2007) 606–617. Available at /http://www.stanford.edu/ boyd/papers/ l1_ls.htmlS. [32] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R.B. Altman, Missing value estimation methods for DNA microarrays, Bioinformatics 17 (6) (2001) 520–525. [33] C.L. Lawson, R.J. Hanson, Solving Least Squares Problems, SIAM, Piladelphia, 1995. [34] D.P. Bertsekas, Nonlinear Programming, second ed., third printing, Athena Scientiﬁc, Belmont, Massachusetts, 2008. [35] C.G. Atkeson, A.W. Moore, S. Schaal, Locally weighted learning, Artif. Intell. Rev. 11 (1997) 11–73. [36] R. Bro, S. De Jong, A fast non-negative constrained least squares algorithm, J. Chemometr. 11 (1997) 393–401. [37] M.H. Van Benthem, M.R. Keenan, Fast algorithm for the solution of largescale non-negative constrained least squares problems, J. Chemometr. 18 (2004) 441–450. [38] D.D. Lee, S. Seung, Algorithms for non-negative matrix factorization, in: Proceedings of the Conference on Neural Information Processing Systems, Denver, 2000, pp. 556–562. [39] K.M. Borgwardt, S.V.N. Vishwanathan, H.P. Kriegel, Class prediction from time series gene expression proﬁles using dynamical systems kernels, in: Proceedings of the Paciﬁc Symposium on Biocomputing, Maui, Hawaii, 2006, pp. 547–558. [40] D.A. Notterman, et al., Transcriptional gene expression proﬁles of colorectal adenoma, adenocarcinoma, and normal tissue examined by oligonucleotide arrays, Cancer Res. 61 (2001) 3124–3130. [41] L.J. Van’t Veer, et al., Gene expression proﬁling predicts clinical outcome of breast cancer, Nature 415 (6871) (2002) 530–536.

17

[42] U. Alon, et al., Broad patterns of gene expression revealed by clustering of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. U.S.A. 96 (12) (1999) 6745–6750. [43] T.R. Golub, et al., Molecular classiﬁcation of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (15) (1999) 531–537. [44] E.J. Yeoh, et al., Classiﬁcation, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression proﬁling, Cancer Cell 1 (2002) 133–143. [45] Z. Hu, et al., The molecular portraits of breast tumors are conserved across microarray platforms, BMC Genomics 7 (96) (2006), http://dx.doi.org/ 10.1186/1471-2164-7-96. [46] S.A. Armstrong, et al., MLL translocations specify a distinct gene expression proﬁle that distinguishes a unique leukemia, Nat. Genet. 30 (2002) 41–47. [47] J. Khan, et al., Classiﬁcation and diagnostic prediction of cancers using gene expression proﬁling and artiﬁcial neural networks, Nat. Med. 7 (6) (2001) 673–679. [48] Clinical proteomics program, Center for Cancer Research, National Cancer Institute. Available at /http://home.ccr.cancer.gov/ncifdaproteomics/ppat terns.aspS. [49] T.P. Conrads, et al., High-resolution serum proteomic features for ovarian cancer detection, Endocr.-Relat. Cancer 11 (2004) 163–178. [50] E.F. Petricoin III, et al., Serum proteomic patterns for detection of prostate cancer, J. Nat. Cancer Inst. 94 (20) (2002) 1576–1578. [51] B. Adam, et al., Serum protein ﬁngerprinting coupled with a patternmatching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men, Cancer Res. 62 (13) (2002) 3609–3614. [52] L. Spacek, Face recognition data, University of Essex, Essex, UK, 2008. Available at /http://cswww.essex.ac.uk/mv/allfacesS. [53] A. Frank, A. Asuncion, UCI machine learning repository, School of Information and Computer Science, University of California, Irvine, California, 2010. Available at /http://archive.ics.uci.edu/mlS. [54] M.A. Little, P.E. McSharry, S.J. Roberts, D.A.E. Costello, I.M. Moroz, Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection, Biomed. Eng. Online 6 (23) (2007), http://dx.doi.org/10.1186/1475925X-6-23. [55] S.E. Baranzini, et al., Transcription-based prediction of response to INFb using supervised computational methods, PLOS Biol. 3 (1) (2005) e2. [56] D. Wang, G.-B. Huang, Protein sequence classiﬁcation using extreme learning machine, in: Proceedings of the International Conference on Joint Neural Networks, Montreal, 2005, vol. 3, pp. 1406–1411. [57] S. Mukherjee, P. Tamayo, S. Rogers, R. Rifkin, A. Engle, C. Campbell, T.R. Golub, J.P. Mesirov, Estimating dataset size requirements for classifying DNA microarray data, J. Comput. Biol. 10 (2) (2003) 119–142. [58] J. Demˇsar, Statistical comparisons of classiﬁers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [59] T. Lin, N. Kaminski, Z. Bar-Joseph, Alignment and classiﬁcation of time series gene expression in clinical studies, Bioinformatics 24 (ISMB 2008) (2008) i147–i155.

Yifeng Li received his Bachelor and Master Degrees from Shandong Polytechnic University, China, in 2006 and 2009, respectively. Now he is a Ph.D. candidate with School of Computer Science, University of Windsor, Canada. His research interests include machine learning, numerical optimization, and bioinformatics. He is particularly interested in sparse models in the above areas. He is a member of ACM and IEEE.

Alioune Ngom is an Associate Professor at the University of Windsor, Ontario, Canada. Prior to joining the University of Windsor, he held the position of an Assistant Professor at the Department of Mathematics and Computer Science at Lakehead University, Thunder Bay, Ontario, Canada, from 1998 to 2000. His main research interests include but are not limited to computational intelligence and machine learning methods and their applications in computational biology and bioinformatics problems such as microarray analysis, protein analysis, oligonucleotide selection, bioimage analysis, and gene regulatory network analysis. He is a member of IEEE.

Please cite this article as: Y. Li, A. Ngom, Classiﬁcation approach based on non-negative least squares, Neurocomputing (2013), http:// dx.doi.org/10.1016/j.neucom.2013.02.012i