- Email: [email protected]

JID: NEUCOM

[m5G;July 7, 2018;9:26]

Neurocomputing 0 0 0 (2018) 1–9

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Robust jointly sparse embedding for dimensionality reduction Zhihui Lai a,∗, Yudong Chen a, Dongmei Mo a, Jiajun Wen a, Heng Kong b a b

The College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China Shenzhen University General Hospital, Shenzhen University, Shenzhen 518060, China

a r t i c l e

i n f o

Article history: Received 13 August 2016 Revised 1 March 2018 Accepted 19 June 2018 Available online xxx Communicated by Feiping Nie Keywords: Dimensionality reduction Manifold learning Robustness

a b s t r a c t As a famous linear manifold learning method, orthogonal neighborhood preserving projections (ONPP) is able to provide a set of orthogonal projections for dimensionality reduction. However, a problem of ONPP is that it takes the L2 -norm as the basic measurement and therefore tends to be sensitive to the outliers or the variations of the data. Aiming at strengthening the robustness of the conventional method ONPP, in this paper, a robust and sparse dimensionality reduction method based on linear reconstruction, called Robust Jointly Sparse Embedding (RJSE), is proposed by introducing L2, 1 -norm as the basic measurement and regularization term. We design a simple iterative algorithm to obtain the optimal solution of the proposed robust and sparse dimensionality reduction model. Experiments on four benchmark data sets demonstrate the competitive performance of the proposed method compared with the state-of-the-art methods. © 2018 Elsevier B.V. All rights reserved.

1. Introduction Dimensionality reduction is a well-known data processing technique that can simultaneously reduce the redundant information and preserve important information from the high-dimensional data. The conventional dimensionality reduction methods including principal component analysis (PCA) [1], linear discriminant analysis (LDA) [2–5] and sparse discriminant analysis (SDA) [6]. Unlike these methods, the non-linear manifold learning method, locally linear embedding (LLE) [7,8] can preserve the local geometry structure of the data set. The related techniques have been well studied in the ﬁelds of data mining and pattern recognition. To preserve the local geometry structure of the data with an explicit linear mapping, the linearization of the LLE and the other manifold learning methods were developed, among which the orthogonal neighborhood preserving projections (ONPP) [9] was one of the most well-known methods. ONPP is a linear approximation to the nonlinear method LLE. This scheme allows ONPP to preserve the manifold structure in a linear subspace. In [10], the authors proposed the sparse extension of ONPP. In [11,12], the authors further extended ONPP to process tensor data. What is more, [13] combined the supervised learning and manifold learning together for dimensionality reduction. Since these manifold learning methods showed promising performance in pattern recognition, they have been widely used in face recognition [14–16], disease classiﬁcation

∗

Corresponding author. E-mail addresses: [email protected] (Z. Lai), [email protected] (D. Mo).

[17], ﬁnger vein recognition [18], ﬁnancial data analysis [19], clustering [20,21] and human age estimation [22]. However, the traditional manifold learning methods still have some drawbacks. First, these methods use the L2 -norm as measurement, which are sensitive to the outliers of data. Second, these methods lack the function of sparse feature selection for obtaining more reliable low-dimensional features. An effective way to enhance the robustness is to introduce an L2, 1 -norm as the measurement instead of the L2 -norm in the traditional manifold learning methods. The L2, 1 -norm metric has been a popular technique in recent years because of its competitive robustness to the outliers compared with the L2 -norm. Since the square operation does not need to be performed during the optimization of the L2, 1 -nom-based model, the noise or reconstructive error will not be over-emphasized, which reduces the sensitiveness of the model. Therefore, many L2, 1 -norm-based methods were proposed for improving the model’s robustness including the robust feature selection [23–25], the robust PCA [26,27] and the robust classiﬁcation [28]. In addition, the usage of the L2, 1 -norm regularization is an effective way to obtain joint sparsity to improve the classiﬁcation performance. In recent years, sparse learning has become more and more popular [29–33]. The so-called sparsity is to generate a sparse representation matrix or projection matrix so that most of the elements of the matrix become zero, thereby the main features of the images are emphasized and the feature extraction results are more explanatory. In summary, the conventional methods, i.e., PCA, LDA and their extensions ignore the local geometry structure of the data set. And the manifold learning methods such as LLE and ONPP lack

https://doi.org/10.1016/j.neucom.2018.06.051 0925-2312/© 2018 Elsevier B.V. All rights reserved.

Please cite this article as: Z. Lai et al., Robust jointly sparse embedding for dimensionality reduction, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.051

ARTICLE IN PRESS

JID: NEUCOM 2

[m5G;July 7, 2018;9:26]

Z. Lai et al. / Neurocomputing 000 (2018) 1–9

of robustness or do not obtain joint sparsity for discriminative feature selection. Therefore, in this paper, we propose a robust and sparse subspace learning method by utilizing the L2, 1 -norm as main metric and the regularization term to consider the local geometry structure information for robust sparse subspace learning. The proposed L2, 1 -norm based model can be solved by a simple iterative algorithm. The contributions of this paper are stated as follows: (1) We construct a robust regression model based on ONPP and prove that the optimal solution space of the model is the same as the solution of ONPP in some special cases. (2) By adding the L2, 1 -norm as the regularization term, the joint sparsity can be easily obtained so as to improve the performance of feature selection. The proposed L2, 1 -norm based model is proved to be more robust than other L1 -norm or L2 -norm based methods on four benchmark data sets. (3) We provide the convergence proof of the proposed iterative algorithm. Extensive experimental results with ﬁgures and tables illustrate the effectiveness of our method.

Third, all the sample points xi are projected to a lowdimensional space Y = [y1 , y2 , . . . yn ]. The optimization problem is as follow.

min Y

yi − i

2

j∈Ck (xi )

Wi j y j s.t.Y T Y = I

(2)

where I is identity matrix. Finally, the optimal solution of (2) is provided by the following eigenfunction.

(I − W )T (I − W )y = λy

(3)

The solution spaces of LLE are the eigenvectors according to the ﬁrst d minimal eigenvalues λ [34]. 2.3. Orthogonal neighborhood preserving projections ONPP is the linearization of LLE. It adds a linear projection p ∈ Rm based on LLE. After dimensionality reduction by linear projection, the sample point and its neighboring points have the minimum reconstruction error. Therefore, ONPP aims to solve the following optimization problem:

2

The rest of the paper consists of ﬁve sections. In Section 2, we will introduce some notations used in this paper and then review LLE and its linear extension ONPP. In Section 3 we will introduce the regression form of ONPP. In addition, a robust and sparse subspace learning method, including the designed iterative algorithm and its convergence proof will be introduced in Section 4. Experimental results on four different data sets will be shown in Section 5. Section 6 summaries the paper.

min

2. Related work

3. L2, 1 -norm based regression model

In this section, we ﬁrst introduce the notations and deﬁnitions used in this paper. Then we will review the non-linear manifold learning method LLE and the linearization one, i.e., ONPP.

In this section, we propose a L2, 1 -norm based regression model for linear reconstruction which is easy to be solved for obtaining the optimal solution. What’s more, we discuss the connection between the proposed model and traditional ONPP.

pT xi −

p

j∈Ck (xi )

i

Wi j pT x j

s.t. pT p = 1

(4)

Similar to LLE, problem (4) can be transformed into an eigenvalue problem.

˜p X (I − W )T (I − W )X T p = λ

(5)

Then the solutions of P = [ p1 , p2 , . . . pd ] are the eigenvectors ˜ [34]. corresponding to the d minimal eigenvalues λ

2.1. Notations and deﬁnition of L2, 1 -norm 3.1. Robust regression model We ﬁrst introduce some notations used in this paper. All matrices are represented by the uppercase like X and Y. Vectors are represented by the lowercase like a and p. In this paper, for data matrix X ∈ Rm × n , m denotes the dimensionality of data points and n denotes the number of data points. Besides, d is the desired dimensionality of the low-dimensional feature vectors. The L2, 1 -norm plays an important role in jointly sparse feature selection. For a general matrix X, xi and xj denote its ith row and jth column, respectively. The L2, 1 -norm of matrix X is deﬁned as

X 2,1 =

m i=1

n

Xi2j =

m

x i

i=1

j=1

2

where Xi j denotes the element in the ith row and jth column of X. The L2, 1 -norm meets the three requirements of valid norm [23].

LLE algorithm supposes every data points can be reconstructed by its neighbors and is divided into three steps. First, ﬁnd out the k nearest neighbor points of xi based on the Euclidean distance. Second, compute the weight matrix W from the k nearest neighbor points. The optimization problem is as follow.

W

i

2 Wi j x j s.t. Wi j = 1 j∈C (x ) k

i

min MX T aaT − MX T a

s.t. aT a = 1

2,1

(6)

where is the projection and M = (I − W ). For the optimization problem (6), we have the following Theorem 1. a ∈ Rm

Theorem 1. The optimization problem (6) derives the same solution spaces as the one of ONPP if the diagonal elements of matrix D derived by the L2, 1 -norm optimization problem are the same nonzero constant. Proof. We use an iteratively reweight algorithm to solve the L2, 1 norm based model as in [23]. From (6), we can derive

2.2. Locally linear embedding

min xi −

In order to increase the robustness of ONPP, we use L2, 1 -norm as the measurement of the loss function. Besides, different from ONPP which minimizes the information loss in low-dimensional feature spaces, we aim to reduce the reconstruction error of different neighbors. We have the following model by using L2, 1 -norm for increasing the robustness.

(1)

j∈Ck (xi )

where Wij is the weight coeﬃcient and Ck (xi ) is index set of the k nearest neighbor points of xi .

T T M X a a − M X T 2,1 T T = tr

M X aa − M X T

T

D M X T aaT − M X T

= tr aT X MT DMX T a − aT X MT DMX T a − aT X MT DMX T a + X MT DMX T

(7)

where D is a diagonal matrix and deﬁned as

Dii =

1

2hi

(8) 2

Please cite this article as: Z. Lai et al., Robust jointly sparse embedding for dimensionality reduction, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.051

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 7, 2018;9:26]

Z. Lai et al. / Neurocomputing 000 (2018) 1–9

where hi is the ith row of H = MX T aaT − MX T . Since D is updated by (8) in each iteration, and we initialize D as an identity matrix in the ﬁrst iteration, then tr(XMT DMXT ) becomes a constant. We can obtain the following optimization problem from (7).

max tr aT X MT DMX T a

s.t. aT a = 1

a

(9)

From (9) we have following eigenvalue problem:

X MT DMX T a = γ a

(10)

Then the solution space of (10) is spanned by [a1 , a2 ...ad ] where ai is the ith eigenvector of (10) sorted in increasing order. By comparing (10) and (5), the only difference is that there is a diagonal matrix D on (10). Suppose the diagonal elements of matrix D are non-zero constant, then X MT DMX T a = γ a derives X MT MX T a = γ1 a, where γ1 = γ /v and v is the constant in diagonal matrix D. Therefore, diagonal matrix D does not change the solution space. Theorem 1 proves that the optimal projection of optimization problem (6 is the same as the one of ONPP if the diagonal elements of matrix D are equal to nonzero constant. However, if the reconstructive errors are not a constant, the new model will not derive the same solution as the one of ONPP. From the deﬁnition of the diagonal matrix D, we can ﬁnd that if the reconstructive error is large for the ith data, then the ith diagonal element of matrix D will be very small (i.e., inverse ratio relationship), which will impress the contribution of the ith reconstructive error for the objective function. This is the essential reason for the robustness of the proposed method. 3.2. Relaxed regression model The new model (6) is still not in the sparse regression form. It is expected to add the L2, 1 -norm regularization term to the model (6) so that the model can provide the function of jointly sparse feature extraction. However, the reconstruction error will increase since the projection matrix becomes sparse. In order to solve this problem, we ﬁrst replace a in the MXT aaT of (6) with a new variable b to relax the model such that it can ﬁt the data and obtain better generalization ability. As a result, the new model becomes regression form so as to add the regularized term to obtain sparse projections. Let us consider the relaxed form of (6):

min MX T baT − MX T 2,1

T

s.t. a a = 1

a,b

(11)

For the above model, we have the following conclusion. Theorem 2. Assume XMT MXT is full-rank and initialize D˜ as an identity matrix in the ﬁrst step of the iterative algorithm, then the projection p of (4) are exactly the same as the projection b of (11).

If we initialize D˜ as an identity matrix, then we get

b = X MT MX T

= tr

MX ba − MX T

T

D˜ MX T baT − MX T

4. Robust jointly sparse embedding In this section, by adding an L2, 1 -norm regularization term to (11), we are able to obtain a robust and sparse subspace for feature selection, which preserves the manifold structure of the data set. Then an iterative algorithm will be designed to solve the model. 4.1. Robust and sparse regularized model In order to perform feature selection, we need to add a jointly sparse regularization term in the regression model. Besides, the objective function in vector-form can only learn one projection vector and it is not enough for the proposed method to obtain good performance in such low-dimensional (i.e. one dimensional) subspaces. In order to learn a set of projection vectors (i.e. projective matrix) for feature extraction and achieve the goal that the reconstruction error should be minimized, we extend the vectorbased model (11) to be the matrix-based model, and thus obtain the following matrix-form formulation:

(12)

1

2gi

2

where gi is the ith row of G = MX T baT − MX T . For ﬁxed a, taking the derivation of the formulation (12) with respect to b to be zero, we get

2X MT D˜ MX T b − X MT D˜ MX T a − X MT D˜ T MX T a = 0

min MX T BAT − MX T A,B

2,1

+ β B 2,1

s.t. AT A = I

(14)

where β > 0, A ∈ Rm × d , B ∈ Rm × d . The above model can obtain a sparse discriminative projection matrix for feature selection. In next subsection, we propose a simple iterative algorithm to compute the optimal solution. 4.2. Solution of the problem We ﬁrst construct diagonal matrix Q and Z

Zii =

where D˜ is a diagonal matrix and deﬁned as

D˜ ii =

(13)

1

(15)

2ci

2

= tr abT X MT D˜ MX T baT − abT X MT D˜ MX T − X MT D˜ MX T baT +X MT D˜ MX T

X MT MX T a = a

Theorem 2 proves that (11) have the same projection as the one of ONPP when XMT MXT is full-rank and D˜ is initialized as an identity matrix in the ﬁrst step of iterative algorithm. As a regression model based on ONPP, (11) is easy to be solved to obtain the optimal solutions. However, the optimal solution of (11) is still not sparse. Therefore, in the Section 4, we will propose a robust and sparse model to solve this problem to obtain the sparse projection for feature selection.

Qii =

−1

For ﬁxed b, substituting (13) in (12) we get (9). Since we initialized D˜ as an identity matrix, (9) will degrade to be (5). The projection p in (4) are exactly the same as the projection b in (11) if D˜ is an identity matrix (or the elements of diagonal matrix D˜ are the same non-zero constants).

Proof. From (11), we have

T T MX ba − MX T 2,1 T T

3

1

2bi

(16) 2

where ci and bi denote the ith row of C = MX T BAT − MX T and B, respectively. From (14), we have

T T MX BA − MX T + β B 2,1 2,1 T T T T = tr

Q M X T BAT − M X T

M X BA − M X

+ β t r BT ZB

= tr ABT X MT QMX T BAT − ABT X MT Q MX T − X MT Q MX T BAT

+ X MT QMX T + β t r BT ZB

s.t . AT A = I

(17)

Please cite this article as: Z. Lai et al., Robust jointly sparse embedding for dimensionality reduction, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.051

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;July 7, 2018;9:26]

Z. Lai et al. / Neurocomputing 000 (2018) 1–9

(a)Convergence on AR data set

(b)Convergence on ORL data set

(c)Convergence on FERET data set

(d)Convergence on COIL-20 data set

Fig. 1. Convergence of the iterative algorithm on four data sets.

We alternately optimize matrix A and B in each iteration. For ﬁxed A, XMT QMXT becomes a constant. Taking the derivation of (17) with respect to B to be zero, we have

B = X MT QMX T + β Z

−1

X MT QMX T A

(18)

For ﬁxed B, ABT XMT QMXT BAT , XMT QMXT and the regularization term become constant. So (17) becomes a maximization problem as follow.

max t r AT X MT QMX T B A

s.t . AT A = I

Table 1 Robust jointly sparse embedding algorithm. Input: Data X = [x1 , x2 , ...xn ], iterative times Tm , d(d < < m), β . Step 1: Initialize M = I − W where W is local reconstruction matrix. Step 2: Initialize A as randomly column-orthogonal m × d matrix. Initialize Q and Z as identity matrix. Step 3: For j = 1 : Tm do Given A, B = (X MT QMX T + β Z )−1 (X MT QMX T A ) Given B, with X MT QMX T B = U V T , then A = U V T . Update Q and Z with (15) and (16). Step 4: Normalize B. Output: Low-dimensional data matrix Y = BT X.

(19) 4.3. Convergence

From theorem 4 in [35], with X MT QMX T B = U V T , then A = U V T is the optimal projection matrix of (19). More details about the iterative algorithm are described in Table 1.

In this section, we will prove that the iterative algorithm in Table 1 is convergent.

Please cite this article as: Z. Lai et al., Robust jointly sparse embedding for dimensionality reduction, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.051

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 7, 2018;9:26]

Z. Lai et al. / Neurocomputing 000 (2018) 1–9

5

Fig. 2. Sample images from AR data set.

Fig. 5. Average recognition rates on ORL data set without block.

≤

n i=1

2 m c i i i ct − t 2 + bt − 2 2 i 2 ct 2

2 bit 2 2bit

i=1

(22)

2

Combining (21) and (22), we derive n

Fig. 3. Average recognition rates on AR data set without block.

i=1

Theorem 3. The objective function value of (14) will monotonically decrease in each iteration. Proof. In each iteration, we obtain the optimal A and B by solving (18) and (19). Model (14) can be rewritten as

min t r C T QC + β t r BT ZB

(20)

C,B

where C = MX T BAT − MX T . In the t + 1 iteration, let

< Ct+1 , Bt+1 >= arg min t r C T Qt C + β t r BT Zt B ,

m n m i ct+1 + β bit+1 ≤ cti + β bit 2 2 2 2 i=1

i=1

i=1

Which means

Ct+1 2,1 + βBt+1 2,1 ≤ Ct 2,1 + βBt 2,1 Theorem 3 proves that in each iteration of the algorithm, the objective value of (14) will monotonically decrease. Since (14) is convex, a global optimal solution is guaranteed for the optimization problem. Fig. 1(a)–(d) show the quick convergent property of the iterative algorithm on AR, ORL, FERET and COIL-20 data sets. Basically, the iterative algorithm converges within 10∼15 iterations in the experiments.

C,B

5. Experiment

Which means

T T t r Ct+1 Qt Ct+1 + β t r Bt+1 Zt Bt+1 ≤ t r CtT Qt Ct + β t r BtT Zt Bt

Or equivalently,

2 2 2 2 m bi n m bi cti t+1 2 2 2 +β ≤ +β t 2 2cti 2bit 2cti 2bit

n c i t+1 i=1

2

i=1

i=1

2

2

i=1

2

From the Lemma 1 in [23], for ci and bi we have n i=1

2 i m ct+1 i i 2 ct+1 − + bt+1 − 2 2 i 2 ct 2

i=1

bit+1 2 2 2bit 2

(21)

In this section, we compare ten different dimensionality reduction algorithms including the traditional algorithms, state-ofthe-art algorithms and the proposed algorithm. The traditional algorithms include the classical discriminant method LDA, manifold learning-based methods neighborhood preserving embedding (NPE) [36], ISOMAP [37] and ONPP [9]. The state-of-the-art algorithms include L1 -norm-based sparse methods uniﬁed sparse subspace learning (USSL) [38] and sparse linear embedding (SLE) [10], L2, 1 -norm–based methods, SAIR [39], joint embedding learning and sparse regression (JELSR) [40,41] and structured optimal graph feature selection (SOGFS) [42]. The proposed algorithm is called

Fig. 4. Sample images from ORL data set.

Please cite this article as: Z. Lai et al., Robust jointly sparse embedding for dimensionality reduction, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.051

JID: NEUCOM 6

ARTICLE IN PRESS

[m5G;July 7, 2018;9:26]

Z. Lai et al. / Neurocomputing 000 (2018) 1–9

Fig. 6. Sample images from FERET data set. Table 2 Best average recognition rates and corresponding dimensions on AR (%). Without block

5 × 5 block

7 × 7 block

Method

Rate

Dim

Rate

Dim

Rate

Dim

LDA NPE ISOMAP ONPP USSL SAIR JELSR SOGFS SLE RJSE

86.08 85.35 81.92 88.85 90.85 88.48 72.98 76.29 89.93 91.63

115 145 120 145 150 150 150 145 150 135

82.58 82.21 78.66 83.45 89.68 86.42 71.30 76.48 88.17 90.50

115 145 125 145 150 150 150 150 150 110

81.47 81.13 77.38 79.98 87.23 82.57 66.74 74.73 86.48 88.86

115 145 120 145 145 150 150 140 145 110

Table 3 Best average recognition rates and corresponding dimensions on ORL (%).

Fig. 7. Average recognition rates on FERET data set without block.

RJSE. We use four standard data sets, i.e., AR, ORL, FERET and COIL20, to test the algorithms’ performance in different condition. The code of this paper can be downloaded from http://www.scholat. com/laizhihui.

Without block

5 × 5 block

7 × 7 block

Method

Rate

Dim

Rate

Dim

Rate

Dim

LDA NPE ISOMAP ONPP USSL SAIR JELSR SOGFS SLE RJSE

91.25 92.71 91.88 76.50 91.25 89.25 89.92 90.67 93.96 94.88

35 35 40 75 60 75 75 75 75 75

87.42 89.79 90.04 69.96 86.29 85.46 85.46 89.42 92.19 92.17

40 35 40 75 35 75 70 70 75 70

86.58 87.29 86.29 67.33 81.96 80.00 83.04 86.33 88.17 87.63

35 35 40 75 35 75 75 75 75 60

5.1. Experiment settings On each experiment, we randomly chose T sample points from each class to form the training sample set and the rest of them were used as testing samples. After extracting features from training samples and testing samples, we used NN classiﬁer to perform classiﬁcation. On AR, FERET, COIL-20 data sets, the desired features of low-dimensional subspace were changed from 5 to 150 at the interval of 5. On ORL data set, the desired features were change from 5 to 75 at the interval of 5. To ensure the reliability of the experimental results, each algorithm ran 10 times on each data set and the average recognition rates were computed. What’s more, to verify the robustness of different algorithms, we randomly added square block with the size of 5 × 5 or 7 × 7 on each image. For the manifold learning methods, the number of neighborhood points was selected from 1 to 20. For USSL, SLE, SAIR, JELSR, SOGFS and RJSE, the regularization coeﬃcients were selected from 10−3 , 10−2 ,…, 103 .

5.2. Data sets descriptions and experimental results AR face data set [43] includes 2400 face images derived from 120 people. Each people has 20 images. Each face image was resized to 25 × 20 pixels for relieving the computation burden. In this experiment, 6 images of each person were randomly selected as the training set and the remaining were as test set. Fig. 2 displays the sample images from AR data set and the experimental results on AR data set are shown in Fig. 3 and Table 2. ORL (http://www.cl.cam.ac.uk/research/dtg/attarchive/ facedatabase.html\) data set includes 400 images from 40 distinct subjects. Each subject has 10 images. All the subjects were resized to 23 × 23 pixels for relieving the computation burden. In this experiment, the number of training sample T is 4. Fig. 4 displays the sample images from ORL data set and the experimental results on ORL data set are shown in Fig. 5 and Table 3.

Fig. 8. Sample images from COIL-20 data set.

Please cite this article as: Z. Lai et al., Robust jointly sparse embedding for dimensionality reduction, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.051

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 7, 2018;9:26]

Z. Lai et al. / Neurocomputing 000 (2018) 1–9

7

Table 4 Best average recognition rates and corresponding dimensions on FERET (%). Without block

5 × 5 block

Method

Rate

Dim

Rate

Dim

Rate

7 × 7 block Dim

LDA NPE ISOMAP ONPP USSL SAIR JELSR SOGFS SLE RJSE

76.10 75.57 74.37 77.13 82.93 84.32 79.37 80.00 84.37 84.57

145 145 110 150 125 140 145 150 150 150

71.93 71.70 70.78 70.02 77.77 79.20 78.55 78.48 80.25 80.05

125 120 105 150 115 145 55 145 150 150

70.65 70.73 69.87 69.32 72.03 74.07 75.20 74.12 74.92 76.08

125 115 110 150 85 150 65 150 150 65

Table 5 Best average recognition rates and corresponding dimensions on COIL-20 (%).

Fig. 9. Average recognition rates on COIL-20 data set without block.

FERET data set [44] includes 1400 facial images derived from 200 individuals. Each person has 7 images. Every facial image was resized to 20 × 20 pixels for relieving the computation burden. In this experiment, the number of training sample T is 4. Fig. 6 displays the sample images from FERET data set and the experimental results on FERET data set are shown in Fig. 7 and Table 4. COIL-20 (http://www.cs.columbia.edu/CAVE/software/softlib/ coil-20.php) data set includes 1440 images from 20 different subjects. Each subject has 72 images. Every image was resized to a vector with the dimension of 256 for speeding up the computation. In this experiment, we set T = 10. Fig. 8 displays the sample images from COIL-20 data set and the experimental results on COIL-20 data set are shown in Fig. 9 and Table 5.

Without block

5 × 5 block

7 × 7 block

Method

Rate

Dim

Rate

Dim

Rate

Dim

LDA NPE ISOMAP ONPP USSL SAIR JELSR SOGFS SLE RJSE

75.65 75.96 76.45 90.53 92.05 92.10 92.98 92.86 92.88 93.55

20 60 45 145 60 145 150 145 150 150

73.82 73.86 74.61 88.32 91.13 90.64 92.17 91.44 92.81 92.85

15 60 40 145 45 140 150 130 150 150

73.02 73.31 74.27 82.44 88.82 88.95 90.48 89.98 91.35 91.63

20 55 45 145 35 150 150 150 150 150

5.3. Discussion of the experimental results The recognition performances of ten different dimensionality reduction algorithms are all shown in Figs. 3,5,7,9 and Tables 2 to 5. We analyze the experimental results from the following three aspects: (1) Among the traditional manifold learning methods, ONPP performs much better than NPE and ISOMAP. That is because ONPP obtains orthogonal projections for removing the trivial features. Besides, the new sparse methods USSL, JELSR,

Fig. 10. Variations of recognition rates on ORL when parameter β and k changed.

Please cite this article as: Z. Lai et al., Robust jointly sparse embedding for dimensionality reduction, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.051

JID: NEUCOM 8

ARTICLE IN PRESS

[m5G;July 7, 2018;9:26]

Z. Lai et al. / Neurocomputing 000 (2018) 1–9

SOGFS, SLE and RJSE further extend manifold learning to sparse case and thus they could outperform ONPP. (2) As shown in the experimental results on ORL, FERET and COIL-20 data set, the SLE and RJSE is more effective than USSL, JELSR and SOGFS. Compared to USSL, JELSR and SOGFS, the advantage of RJSE and SLE is that they both consider the reconstruction error in the loss function while preserve the manifold structure. (3) To test the robustness of different algorithms, we randomly added square block on each image. The experimental results shown in Tables 2 to 5 prove that our method is more robust than the other algorithms. The experimental results illustrate that the proposed method improves the performance of the traditional ONPP by using the L2, 1 -norm as metric and regularization term. 5.4. Parameters selections of RJSE When the parameter β and k were changed, RJSE obtained different performances. Fig. 10 presents the case on ORL data set and the similar observations can also be found on other data sets. Fig. 10(a) shows that when the parameter β changed from 10−8 to 102 , RJSE could obtain the best performance. When β changed from 104 to 108 , the performance of RJSE decreased. When β was set to be 103 , RJSE obtained the worst performance. Fig. 10(b) shows that RJSE performed the best when k was set to be 3 or 4 while its worst performance was obtained when k was set to be 2 or the value ranged from 9 to 16. 6. Conclusions In this paper, we propose a robust and sparse subspace learning method by imposing the L2, 1 -norm on both of the loss function and the regularization term. The proposed model is able to obtain jointly sparse projections since we add the L2, 1 -norm regularization term. Besides, we prove that the optimal projections of the regression model will be degenerated to be the same as ONPP when some conditions are satisﬁed. We present an iterative algorithm to solve the optimization problem and thus obtain a set of jointly sparse projections. The experimental results on four standard data sets indicate that RJSE is superior to some dimensionality reduction algorithms. Acknowledgments This work was supported by the Natural Science Foundation of China under Grant Nos. 61573248, 61773328, 61703283, 61732011, 61672358, the China Postdoctoral Science Foundation under Grant 2016M590812 and 2017T100645, in part by the Natural Science Foundation of Guangdong Province through the Grant 2017A030313367 and 2017A030310067, Shenzhen Municipal Science and Technology Innovation Council under JCYJ20160429182058044 and JCYJ20170302153434048, and Shenzhen Nanshan District Science and Technology Project under grant 2015009. References [1] I.T. Jolliffe, Principal Component Analysis, 87, Springer, Berlin, 2010, pp. 41–64. [2] M. Li, B. Yuan, 2D-LDA: a statistical linear discriminant analysis for image matrix, Pattern Recognit. Lett. 26 (5) (2005) 527–532. [3] Q. Liu, H. Lu, S. Ma, Improving kernel ﬁsher discriminant analysis for face recognition, IEEE Trans. Circuits Syst. Video Technol. 14 (1) (2004) 42–49. [4] J. Yang, J.Y. Yang, Why can LDA be performed in PCA transformed space? Pattern Recognit 36 (2) (2003) 563–566. [5] E.I. Altman, G. Marco, F. Varetto, Corporate distress diagnosis: comparisons using linear discriminant analysis and neural networks (the Italian experience), J. Bank. Finance 18 (3) (1994) 505–529.

[6] L. Clemmensen, T. Hastie, D. Witten, B. Ersbøll, Sparse Discriminant Analysis, Technometrics 53 (4) (2011) 406–413. [7] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [8] W. Chojnacki, M.J. Brooks, A note on the locally linear embedding algorithm, Int. J. Pattern Recognit. Artif. Intell. 23 (8) (Dec. 2009) 1739–1752. [9] E. Kokiopoulou, Y. Saad, Orthogonal neighborhood preserving projections: a projection-based dimensionality reduction technique, IEEE Trans. Pattern Anal. Mach. Intell. 29 (12) (2008) 2143–2156. [10] Z. Lai, W.K. Wong, Y. Xu, J. Yang, D. Zhang, Approximate orthogonal sparse embedding for dimensionality reduction, IEEE Trans. Neural Netw. Learn. Syst. 27 (4) (2016) 723–735. [11] A.A.M. Al-Shiha, W.L. Woo, S.S. Dlay, Multi-linear neighborhood preserving projection for face recognition, Pattern Recognit. 47 (2) (2014) 544–555. [12] S. Liu, Q. Ruan, Orthogonal tensor neighborhood preserving embedding for facial expression recognition, Pattern Recognit. 44 (7) (2011) 1497–1513. [13] D. Bouzas, N. Arvanitopoulos, A. Tefas, Graph embedded nonparametric mutual information for supervised dimensionality reduction, IEEE Trans. Neural Netw. Learn. Syst. 26 (5) (2015) 951–963. [14] F. Nie, S. Xiang, Y. Song, C. Zhang, Orthogonal locality minimizing globality maximizing projections for feature extraction, Opt. Eng. 48 (1) (2009) 17202. [15] C. Hou, C. Zhang, Y. Wu, Y. Jiao, Stable local dimensionality reduction approaches, Pattern Recognit. 42 (9) (2009) 2054–2066. [16] C. Hou, J. Wang, Y. Wu, D. Yi, Local linear transformation embedding, Neurocomputing 72 (2009) 2368–2378. [17] X. Liu, D. Tosun, M.W. Weiner, N. Schuff, Locally linear embedding (LLE) for MRI based Alzheimer’s disease classiﬁcation, Neuroimage 83 (2013) 148–157. [18] Z. Liu, Y. Yin, H. Wang, S. Song, Q. Li, Finger vein recognition with manifold learning, J. Netw. Comput. Appl. 33 (3) (2010) 275–282. [19] Y. Huang, G. Kou, A kernel entropy manifold learning approach for ﬁnancial data analysis, Decis. Support Syst. 64 (8) (2014) 31–42. [20] X. Wang, Y. Liu, F. Nie, H. Huang, Discriminative unsupervised dimensionality reduction, in: Proceedings of the IJCAI International Joint Conference on Artiﬁcial Intelligence, 2015, pp. 3925–3931. [21] F. Nie, X. Wang, H. Huang, Clustering and projected clustering with adaptive neighbors, in: Proceedings of the Twentieth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 977–986. [22] G. Guo, Y. Fu, C.R. Dyer, T.S. Huang, Image-based human age estimation by manifold learning and locally adjusted robust regression, IEEE Trans. Image Process. 17 (7) (2008) 1178–1188. [23] F. Nie, H. Huang, X. Cai, C. Ding, Eﬃcient and robust feature selection via joint L2,1-norms minimization, in: Proceedings of the Advances in Neural Information Processing Systems, 2010, pp. 1813–1821. [24] Y. Yang, H.T. Shen, Z. Ma, Z. Huang, X. Zhou, L2,1-norm regularized discriminative feature selection for unsupervised learning, in: Proceedings of the IJCAI International Joint Conference on Artiﬁcial Intelligence, 2011, pp. 1589–1594. [25] X. Zhu, X. Li, S. Zhang, C. Ju, X. Wu, Robust joint graph sparse coding for unsupervised spectral feature selection, IEEE Trans. Neural Netw. Learn. Syst. 28 (6) (2017) 1263–1275. [26] Z. Lai, Y. Xu, J. Yang, L. Shen, D. Zhang, Rotational invariant dimensionality reduction algorithms, IEEE Trans. Cybern. 47 (11) (2017) 3733–3746. [27] F. Nie, J. Yuan, H. Huang, Optimal mean robust principal component analysis, in: Proceedings of the Thirty First International Conference on Machine Learning (ICML-14), 2014, pp. 1062–1070. [28] C.-X. Ren, D.-Q. Dai, H. Yan, Robust classiﬁcation using 2,1-norm based regression model, Pattern Recognit. 45 (7) (2012) 2708–2718. [29] Z. Lai, M. Wan, Z. Jin, J. Yang, Sparse two-dimensional local discriminant projections for feature extraction, Neurocomputing 74 (4) (2011) 629–637. [30] X. Shi, Y. Yang, Z. Guo, Z. Lai, Face recognition by sparse discriminant analysis via joint L2,1-norm minimization, Pattern Recognit. 47 (7) (2014) 2447–2453. [31] J. Wen, Z. Lai, Y. Zhan, J. Cui, The L2,1-norm-based unsupervised optimal feature selection with applications to action recognition, Pattern Recognit. 60 (2016) 515–530. [32] S. Yi, Z. Lai, Z. He, Y. Cheung, Y. Liu, Joint sparse principal component analysis, Pattern Recognit. 61 (2017) 524–536. [33] Y. Feng, J. Xiao, K. Zhou, Y. Zhuang, A locally weighted sparse graph regularized non-negative matrix factorization method, Neurocomputing 169 (2015) 68–76. [34] T. De Bie, N. Cristianini, R. Rosipal, Eigenproblems in pattern recognition, Handb. Geom. Comput. 10 (2005). [35] H. Zou, T. Hastie, R. Tibshirani, S. Url, Sparse principal component analysis, J. Comput. Gr. Stat. 15 (2) (2006) 265–286. [36] X. He, D. Cai, S. Yan, H.J. Zhang, Neighborhood preserving embedding, in: Proceedings of the Tenth IEEE International Conference on Computer Vision, 2005, pp. 1208–1213. [37] J.B. Tenenbaum, S.V De, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [38] D. Cai, X. He, J. Han, A uniﬁed approach for sparse subspace learning, in: Proceedings of the International Conference on Data Mining, 2007. [39] Z. Ma, Y. Yang, N. Sebe, K. Zheng, A.G. Hauptmann, Multimedia event detection using a classiﬁer-speciﬁc intermediate representation, IEEE Trans. Multimed. 15 (15) (2013) 1628–1637. [40] C. Hou, F. Nie, X. Li, D. Yi, Y. Wu, Joint embedding learning and sparse regression: a framework for unsupervised feature selection, IEEE Trans. Cybern. 44 (6) (2014) 793–804.

Please cite this article as: Z. Lai et al., Robust jointly sparse embedding for dimensionality reduction, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.051

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 7, 2018;9:26]

Z. Lai et al. / Neurocomputing 000 (2018) 1–9 [41] C. Hou, F. Nie, D. Yi, Y. Wu, Feature selection via joint embedding learning and sparse regression, in: Proceedings of the IJCAI International Joint Conference on Artiﬁcial Intelligence, 22, 2011, p. 1324. [42] F. Nie, W. Zhu, X. Li, Unsupervised feature selection with structured graph optimization, in: Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence, 2016, pp. 1302–1308. [43] A.M. Martinez, “The AR face database,” CVC Technical Report, vol. 24, 1998. [44] P. Jonathon Phillips, H. Moon, S.A. Rizvi, P.J. Rauss, The FERET evaluation methodology for face-recognition algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 22 (10) (20 0 0) 1090–1104. Zhihui Lai received the B.S. degree in mathematics from South China Normal University, M.S. degree from Jinan University, and the Ph.D. degree in pattern recognition and intelligence system from Nanjing University of Science and Technology (NUST), China, in 20 02, 20 07 and 2011, respectively. He has been a Research Associate, Postdoctoral Fellow and Research Fellow at The Hong Kong Polytechnic University. His research interests include face recognition, image processing and content-based image retrieval, pattern recognition, compressive sense, human vision modelization and applications in the ﬁelds of intelligent robot research. He has published over 60 scientiﬁc articles. Now he is an associate editor of International Journal of Machine Learning and Cybernetics. For more information including all papers and related codes, the readers are referred to the website (http: //www.scholat.com/laizhihui). Yudong Chen is now pursuing the M.S degree in Shenzhen University. He is with the College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518,060.

9 Dongmei Mo received the B.S degree in Zhao Qing University and she is now pursuing the M.S degree in Shenzhen University. She is with the College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518,060.

Jiajun Wen received the Ph.D. degree in computer science and technology from Harbin Institute of Technology, China, in 2015. He has been a Research Associate with the Hong Kong Polytechnic University, Hong Kong, since 2013. He is currently a Postdoctoral Fellow with the College of Computer Science & Software Engineering, Shenzhen University, Shenzhen, China. His research interests include pattern recognition and video analysis.

Heng Kong received the M.D. and B.S. degree from Chongqing Medical University, M.S. degree from Guangzhou Medical University, and Ph.D. degree from Southern Medical University, China, in 20 0 0, 20 05 and 2008, respectively. She works as a visiting scholar in Cancer Center of Georgia Reagent University at Augusta in USA in 2014–2016. She is a professor in department of thyroid and breast, Aﬃliated Nanshan District People’s Hospital, Shenzhen University. She is also doing basic and clinic research associated breast cancer. Her research interests include gene therapy, immunotherapy, early diagnosis and prognosis analysis of breast cancer, and tumor image processing and recognition using machine learning and/or artiﬁcial intelligent methods.

Please cite this article as: Z. Lai et al., Robust jointly sparse embedding for dimensionality reduction, Neurocomputing (2018), https://doi.org/10.1016/j.neucom.2018.06.051