- Email: [email protected]

Contents lists available at SciVerse ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Structured sparse linear graph embedding Haixian Wang ∗ Key Laboratory of Child Development and Learning Science of Ministry of Education, Research Center for Learning Science, Southeast University, Nanjing, Jiangsu 210096, PR China

article

info

Article history: Received 4 April 2011 Received in revised form 10 August 2011 Accepted 9 November 2011 Keywords: Structured sparsity Linear graph embedding Face recognition Overlapping grouped lasso penalty

abstract Subspace learning is a core issue in pattern recognition and machine learning. Linear graph embedding (LGE) is a general framework for subspace learning. In this paper, we propose a structured sparse extension to LGE (SSLGE) by introducing a structured sparsity-inducing norm into LGE. Specifically, SSLGE casts the projection bases learning into a regression-type optimization problem, and then the structured sparsity regularization is applied to the regression coefficients. The regularization selects a subset of features and meanwhile encodes high-order information reflecting a priori structure information of the data. The SSLGE technique provides a unified framework for discovering structured sparse subspace. Computationally, by using a variational equality and the Procrustes transformation, SSLGE is efficiently solved with closed-form updates. Experimental results on face image show the effectiveness of the proposed method. © 2011 Elsevier Ltd. All rights reserved.

1. Introduction Recognition of high-dimensional data is usually confronted in real applications such as face recognition and biomedical data classification. To address this challenging issue, many useful dimensionality reduction methods have been developed, which potentially circumvent the ‘‘curse of dimensionality’’ (Jain, Duin, & Mao, 2000). The well-known dimensionality reduction technique is the family of subspace learning approaches represented by principal component analysis (PCA) (Lu & Pandolfo, 2011) and linear discriminant analysis (LDA) (Duda, Hart, & Stork, 2001). It is commonly believed that the intrinsic dimensionality of the object space is much lower. The objects may lie on or close to a lowdimensional manifold hidden in the high-dimensional ambient space (Mhaskar, 2011). The popular manifold learning algorithms include Laplacian eigenmaps (Belkin & Niyogi, 2003), locally linear embedding (LLE) (Roweis & Saul, 2000), and isometric feature mapping (Isomap) (Tenenbaum, de Silva, & Langford, 2000). Their linearization versions were also developed and demonstrated encouraging performance for pattern recognition problem, i.e., locality preserving projection (LPP) (He, Yan, Hu, Niyogi, & Zhang, 2005b), neighborhood preserving embedding (NPE) (He, Cai, Yan, & Zhang, 2005a), and isometric projection (Isop) (Cai, He, & Han, 2007b). Recently, another subspace learning approach, namely sparsity preserving projection (SPP) (Qiao, Chen, & Tan, 2010), was proposed based on sparse representation of samples rather than locality. All the above mentioned algorithms, and other potential

∗

Tel.: +86 25 83795664; fax: +86 25 83795929. E-mail addresses: [email protected], [email protected]

0893-6080/$ – see front matter © 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2011.11.003

algorithms, can be reformulated in a unified framework, called (linear) graph embedding (GE) (Yan et al., 2007). One critical drawback of the above algorithms is that the projection bases involve all the original features, which makes the results difficult to be interpreted. The number of original features is typically very large in practice, and irrelevant or redundant features often exist for recognition purpose. It is useful to learn sparse bases so as to find a small number of significant or important features. In the past few years, an increasing interest in learning a sparse subspace has been witnessed. Sparse PCA (SPCA) was proposed by penalizing the regression coefficients of principal components using the ℓ1 lasso regularization (Tibshirani, 1996; Zou, Hastie, & Tibshirani, 2006). Sparse LDA (SLDA) applied the ℓ1 regularization to LDA (Qiao, Zhou, & Huang, 2009). Manifold elastic net incorporated the sparse learning and the manifold learning based dimensionality reduction (Zhou, Tao, & Wu, 2011). Cai, He, and Han (2007a, 2008) proposed a unified sparse subspace learning (USSL) approach by imposing the ℓ1 constraint on the embedding obtained by the GE framework. The USSL approach works in two-stage exploratory procedure: first carry out GE to obtain responses, then use least-squares regression and the ℓ1 norm (or elastic net) to find sparse approximations. This twostage exploratory analysis was earlier used in Zou et al. (2006) for directly approximating principal components (Cai et al., 2007a). In fact, the structure of the final solution has been dominated by the first step, and the second step only approximates the previous result in which ℓ1 or ℓ2 norms are used to tackle the ill-posed problem (Cai et al., 2007a). The ℓ1 sparsity is used as approximation rather than learning from the original samples. Zou et al. (2006) claimed that the two-stage analysis was not a genuine sparse alternative. Although USSL is computationally convenient

H. Wang / Neural Networks 27 (2012) 38–44

and demonstrates effectiveness on some benchmark data sets, we would like to derive a ‘‘self-contained’’ model to discover discriminative subspace, incorporating structure information of the samples. The sparsity occurs as a natural result rather than as an intermediate approximation. In many applications, we know some certain structure of the data. It is useful to encode a priori structure information of the data when learning the projection bases. For example, in face recognition, the supports (nonzero elements) of discriminative features are expected to take parts of the face corresponding to eyes, mouth, or nose, or have some other regularity. The grouped lasso with ℓ1 /ℓ2 penalty is such an example (Yuan & Lin, 2006), in which the variables in the same group are selected to be zero or nonzero simultaneously. The groups, however, are assumed to be non-overlapping and static. The more sophisticated structured sparsity has recently been highlighted in the context of statistical regression and machine learning (Huang, Zhang, & Metaxas, 2009; Jacob, Obozinski, & Vert, 2009; Jenatton, Obozinski, & Bach, 2010b), as well as in compressed sensing (Baraniuk, 2007; Baraniuk, Cevher, Duarte, & Hegde, 2010). In this paper, we propose a structured sparse linear graph embedding (SSLGE) as a general framework for subspace leaning. By reformulating LGE as a regression-type optimization problem, we incorporate the structured sparsity-inducing norm to constrain the regression coefficients. Unlike USSL that considers only sparsity performed in two steps, SSLEG models structured sparsity in a single step. We use the structured sparsity-inducing norm of Jenatton et al. (2010b) in the SSLEG formulation. Nevertheless, they have different loss functions. SSLGE combines the advantages of the unified framework of LGE and the discriminative features of structured sparsity. SSLGE includes the structured sparse LDA, LPP, NPE, SPP, and other potential subspace methods. Computationally, by using a variational equality and the Procrustes transformation, SSLGE is efficiently solved with closed-form updates. Experiments on face image confirm the effectiveness of the proposed method. In short, the contribution of this paper is that we extend LGE to its structured sparsity version, giving a unified framework for linear subspace learning with structured sparsity constraint. The remainder of this paper is organized as follows. In Section 2, the LGE framework and some popular subspace learning methods under this framework are briefly reviewed. In Section 3, we propose the SSLGE framework. The experimental results are presented in Section 4. Finally, Section 5 concludes the paper. 2. Brief review of graph embedding The graph embedding seeks a meaningful low-dimensional representation of a set of typically high-dimensional sample points. The obtained low-dimensional space captures discriminative information or preserves some geometric similarity of the data. Many dimensionality reduction algorithms can be unified under the graph embedding umbrella. Suppose that x1 , . . . , xn are a set of p-dimensional samples, where n is number of samples. The samples are assumed to be divided into cc classes π1 , . . . , πc . The lth class contains nl samples, i.e., = n. From the perspective of graph, l = 1 nl the n samples x1 , . . . , xn can be represented by a weighted undirected graph (V , E ), where V denotes a set of nodes that correspond to the n data points, and E denotes the edges that connect pairwise nodes i and j with weight Wij . Note that W = (Wij )i,j=1,...,n is not necessarily symmetric. The low-dimensional embedding y1 , . . . , yn , which are assumed to be one-dimensional for simplicity, are solved as

y∗ = arg min (yi − yj )2 Wij = arg min yT Ly, yT By=1

i̸=j

yT By=1

(1)

39

where y = [y1 , . . . , yn ]T , the matrix B is a constraint matrix, the Laplacian matrix L is defined as L = D − W ′ , the matrix W ′ is W ′ = (W + W T )/2, and D is a diagonal matrix whose entries n ′ are row sums of W ′ , i.e., Dii = j=1 Wij . Many manifold learning algorithms can be derived from the graph embedding framework by using different W and B (Yan et al., 2007), such as Laplacian eigenmap (Belkin & Niyogi, 2003), LLE (Roweis & Saul, 2000), and Isomap (Tenenbaum et al., 2000). For classification purpose, the linearization of graph embedding (LGE) is preferred, since the out-of-sample issue can be effectively handled. A transformation vector v is chosen such that yi = vT xi , i.e., y = X v, where X is the data matrix X = [x1 , . . . , xn ]T . Then, the projection v is solved as v∗ = arg

= arg

min

vT XBX T v=1

min

vT X T BX v=1

(vT xi − vT xj )2 Wij i̸=j

vT X T LX v.

(2)

Likewise, the LGE framework yields many popular linear dimensionality reduction algorithms (Cai et al., 2007a, 2008; Yan et al., 2007). Specifically, the LDA arises when taking Wij as 1/nl if both xi and xj belong to class πl and zero otherwise, and B as B = I − 11T /n, where I denotes the n-dimensional identity matrix and 1 denotes the n-dimensional vector with all the entries being ones. The LPP is to define W as the heat kernel with k-nearest neighbors, and B as D. The NPE is recovered when adopting W as W = M + M T − M T M, and B as I, where the local reconstruction coefficient ma trix M is obtained by minimizing ∥xi − M ij xj ∥ subject to j∈Nk (i) M = 1. Here N ( i ) denotes the k-nearest neighbors of xi . ij k j∈Nk (i) For j ̸∈ Nk (i), Mij = 0. We point out that the recently developed SPP approach (Qiao et al., 2010) can also be expressed under the LGE framework by setting W as W = S + S T − S T S, and B as I, where the sparse reconstruction coefficient matrix S is obtained by minimizing | S | subject to x = S x and S = 1. Note that ij i ij j ij j̸=i j̸=i j̸=i the diagonal elements of S are zeros. 3. Structured sparse linear graph embedding Our goal is to introduce structured sparsity in the projection vector v. In other words, we need to make structured variable selection and feature extraction by the structured sparsity of v. We first relates GE with regression coefficients by formulating the generalized eigenvalue problem as a regression-type optimization problem. Then, we impose a structured sparsity-inducing norm during linearizing the embedding. 3.1. Reformulating graph embedding as regression The problem of solving the embedding y in (1) can be rewritten as y∗ = arg min y

yT Ly yT By

= arg max y

yT By yT Ly

,

(3)

which is solved as a generalized eigenvalue equation By = γ Ly

(4)

with the leading eigenvectors. Note that the Laplacian matrix L is positive semi-definite. We add a small positive value on its δ tr(L) diagonal such that L˜ = L + n I is positive definite, where δ is a small value. The term tr(L) I n

tr(L) n

is to scale the identity matrix I such that

has the same trace with L. Let L˜ = L˜ 1/2 L˜ 1/2 , and B = B1/2 B1/2 .

For example, we can let L˜ 1/2 = Γ Λ1/2 Γ T , where the eigenvalue decomposition of L˜ is L˜ = Γ ΛΓ T , and the power 1/2 in Λ1/2

40

H. Wang / Neural Networks 27 (2012) 38–44

works on the diagonal elements of the diagonal eigenvalue matrix Λ. Likewise, let L˜ −1/2 = Γ Λ−1/2 Γ T . Like in Qiao et al. (2009) and Zou et al. (2006), we have the following theorem. Theorem 1. Suppose we are considering the q leading eigenvectors of (4). Let µ = [µ1 , . . . , µq ] and η = [η1 , . . . , ηq ] be n × q matrices. For any positive λ, let 1

1

1

(µ, ˆ η) ˆ = arg min ∥B 2 L˜ − 2 − B 2 ηµT ∥2F + λ µ,η

q

T ˜ ηm Lηm

(5)

m=1

subject to µT µ = I, where ∥·∥F denotes the Frobenius norm. Then, the columns of ηˆ span the same linear space with the q leading generalized eigenvectors ym (m = 1, . . . , q) of (4). The proof is presented in Appendix. As shown in the proof, λ is essentially to recover the generalized eigenvectors rather than to penalize the regression coefficients. This can be seen from (A.2), in which the role of λ is to remedy the singularity of B, which is like in ridge regression. In case that B is positive definite, by (A.2) and (A.6), we see that the problem of solving η can be carried out with any non-negative value of λ. And from (A.6), we see that the obtained ηˆ m (m = 1, . . . , q) are scaled versions of the q leading generalized eigenvectors of (4), in which the role of λ is only to scale the eigenvectors. As a result, the recovered ηˆ spans the same subspace with the q leading generalized eigenvectors, no matter the value of λ. If normalizing ηˆ m having unit length, then the resulting vectors are independent of λ. Therefore, λ is not to penalize the regression coefficients. Rather, it is used to recover the generalized eigenvectors. LGE uses y = X v to obtain the linearization of GE. It is beneficial to constrain v by encoding a priori structured information of the data. In this paper, we penalize the projection vectors v by a (quasi-)norm. The norm we considered adjusts not only the sparsity but also the supports of the projection vector.

min ∥B L˜ µ,V

− B XV µT ∥2F

+λ

+ 2λ1

q

1

vTm X T BX vm − 2µTm L˜ − 2 BX vm .

(9)

m=1

We proceed to the third term of (6). The norm ΩGα is non-convex and non-differentiable with respect to vm . The following lemma in (Jenatton et al., 2010b) yields the variational equality. α Lemma 2 (Jenatton et al., 2010b). Let α ∈ (0, 2) and β = 2−α . For G |G | vector (∥d ◦ v∥2 )G∈G ∈ R , one has the following equality

∥(∥dG ◦ v∥2 )G∈G ∥α = min

z G >0

∥dG ◦ v∥2 2

G∈G

2z G

+

∥(z G )G∈G ∥β 2

α

ΩG (vm )

(10)

Therefore, for m = 1, . . . , q, one has (11)

where Θm = diag

G 2 G G 2 G (( d ) / z , . . . , ( d ) / z ) . m p m 1 G∈G

Substituting (9) and (11) into (6), and ignoring the constant term tr(L˜ −1 B), we arrive at min

q

˜ + λ1 Θm )vm [vTm (X T BX + λX T LX

1

.

The minimum is uniquely reached at z G = (∥dG ◦ v∥2 )2−α ∥(∥dG ◦ v∥2 )G∈G ∥αα−1 for G ∈ G.

− 2µTm L˜ − 2 BX vm + λ1 ∥(zmG )G∈G ∥β ].

q

vTm X T LX vm

tr(L˜ −1 B) +

G µ,V ,zm >0 m=1

1 2

q

We consider the optimization of (6). The first term of (6) is expanded as

G zm >0

The formulation of embedding as a regression-type problem allows us to produce structured sparsity. Specifically, we plug ηm = X vm into (5), and meanwhile constrain the (quasi-)norm of vm , i.e., add a penalty term to the regression function. Let V = [v1 , . . . , vq ] be the p × q projection matrix. Based on Theorem 1, we present the optimization problem − 21

3.3. Optimization procedure

ΩGα (vm ) = min (vTm Θm vm + ∥(zmG )G∈G ∥β )/2,

3.2. Structured sparsity regularized projection

1 2

The (quasi-)norm ΩGα combines the ℓ2 norm of variables and ℓα norm of groups. Particularly, if α = 1, ΩGα becomes the standard ℓ1 /ℓ2 norm as in variable selection (Jenatton, Audibert, & Bach, 2010a). As a result, at the group level, some vectors dG ◦ v will be shrunk to exact zero if λ1 is large enough due to the ℓ1 penalty (Tibshirani, 1996). By contrast, within the group G, the ℓ2 norm does not produce sparsity. For α ∈ (0, 1), the ℓα quasinorm is closer to the ℓ0 quasi-norm (number of nonzero elements of a vector), and penalizes regression coefficients more heavily (Jenatton et al., 2010b). It has shown advantage in sparsity problem (Zou & Li, 2008). So, ℓα /ℓ2 leads to sparsity at the level of group more greedily.

(12)

(6)

The optimization involves iterative computation of three variables, i.e., z , µ, and V , which are updated as follows.

subject to µT µ = I, where λ and λ1 are two parameters, ΩGα is an ℓα /ℓ2 (quasi-)norm defined as follows. Let class G be a subset of the power set of {1, . . . , p} such that ∪G∈G G = {1, . . . , p}, i.e., G is a coverage of {1, . . . , p}. Note that different G may overlap. The ΩGα is defined as (Jenatton et al., 2010b)

3.3.1. Update of z The update of z is directly obtained by Lemma 2. That is, given V , for G ∈ G and m = 1, . . . , q,

˜

m=1

m=1

α

ΩG (v) =

p (dGr vr )2 G∈G

α α1 2

,

(7)

r =1

where vr is the rth entry of v, and dGr > 0 if r ∈ G and zero p otherwise for r = 1, . . . , p. For any vector p v ∈ R , denote the ℓθ (quasi-)norm of v by ∥v∥θ = ( r =1 |vr |θ )1/θ . Let the p-dimensional vector dG = [dG1 , . . . , dGp ]T , and let dG ◦ v =

G 1 zm ← (∥dG ◦ vm ∥2 )2−α ∥(∥dG ◦ vm ∥2 )G∈G ∥α− α .

(13)

G For numerical stability, when zm is near zero, a small positive value is added.

3.3.2. Update of µ 1

The update of matrix µ is to maximize tr(µT L˜ − 2 BXV ) subject to the orthonormal columns of µ. This is the Procrustes problem with inner version (Gower & Dijksterhuis, 2004; Zou et al., 2006). Given 1

v ] be the element-wise product of d and v. Then (7) can be rewritten as

V , perform singular value decomposition (SVD): L˜ − 2 BXV = PEQ T , where n × q matrix P has orthonormal columns and q × q matrix Q is orthonormal. Then, the update of µ is given by

ΩGα (v) = ∥(∥dG ◦ v∥2 )G∈G ∥α .

µ ← PQ T .

[ v ,..., dG1 1

dGp p T

G

(8)

(14)

H. Wang / Neural Networks 27 (2012) 38–44

41

3.3.3. Update of V The update of V is separable in its columns, resulting in q independent ridge-regression problem. Given z and µ, for m = 1, . . . , q, the vector vm is updated as 1

˜ + λ1 Θm )−1 X T BL˜ − 2 µm . vm ← (X T BX + λX T LX

(15)

The above three steps are repeated until convergence. Theoretical proof of the convergence of the iterative procedure is not easy. The convergence, however, is observed in practice, as stated in Section 4. We point out that the alternate optimization technique (i.e., optimize one variable while keeping the other ones fixed) is usually adopted in the literature (Shi, Ren, Dai, Wang, & Zhang, 2011). Note that the whole approach is not convex, and thus careful initialization is needed in implementation. 3.4. Computational complexity The computation involves two parts: the generation of matrices W and L˜ −1/2 , and the regularized regression solved by the iterative updates of z , µ, and V . Note that we do not need to compute B1/2 . The computation of L˜ −1/2 needs O(n3 ) operations. The formulation of W in LDA is straightforward, and in LPP consumes O(n2 p + n2 log n). The formulation of M in NPE consumes O(npk3 ). The update of z requires O(qp|G|) operations where |G| denotes 1

the cardinality of G. In the update of µ, the product of µT L˜ − 2 BXV costs O(n2 + n2 p + npq), and its SVD followed by the formulation of µ is of order O(nq2 ). The update of V is of order O(q(p|G|+ pn2 + p2 n + p3 )). 4. Experiments Like LGE, if taking different W and B, SSLGE becomes structured sparse extensions to various subspace learning methods. In this section, we investigate the problem of face recognition on two benchmark facial image databases using structured sparse LDA (SSLDA), structured sparse LPP (SSLPP), structured sparse NPE (SSNPE) and structured sparse SPP (SSSPP). 4.1. Data sets The two databases used in this experiment are the ORL (Olivetti Research Laboratory) database and the FERET database. The ORL database consists of 400 face images, corresponding to 40 individuals. For each person, there are 10 different images varying in facial expressions and facial details. The tolerance for some side movement is of about 20 degrees. For computational convenience, the original images are resized into 32 × 32 pixels. The FERET database (Phillips, Moon, Rizvi, & Rauss, 2000) includes a total of 11 338 facial images grouped into 994 distinct subjects. We select 246 subjects who have ‘‘fa’’ images in both the gallery ‘‘fa’’ and prob ‘‘dup1’’ sets and ‘‘fb’’ images in both the ‘‘fb’’ and ‘‘dup1’’ sets. Each individual has four (two ‘‘fa’’ images plus two ‘‘fb’’ images) frontal images showing variations of facial expression, aging, with/without glasses. The original images are cropped and then resized into 32 × 32 pixels. Some sample images of the two databases are shown in Fig. 1. In our experiment, the gray scale is linearly rescaled into [0, 1] for the whole data set. To perform face recognition, we first obtain the face subspaces by learning projection vectors. Then facial images are projected onto the face subspaces. Finally, the nearest-neighbor classifier is used to predict the class labels of new facial images. 4.2. Parameter selection and experimental setting The settings of G, dGr , and α are same with Jenatton et al. (2010b). The class G is designed for two-dimensional face images. We include in G all horizontal and vertical half-planes, as well as

Fig. 1. Face examples from two face databases: (a) ORL and (b) FERET.

planes with angles that are multiples of π /4. Suppose the size of an image I is a × a. Then, the horizontal planes contain the first ς rows and the remainder (a − ς ) rows for ς = 1, . . . , (a − 1). The vertical planes contain the first ς columns and the remainder (a − ς ) columns for ς = 1, . . . , (a − 1). Define the first π /4 line as {I(1, 1)}, the second line as {I(1, 2), I(2, 1)}, the third line as {I(1, 3), I(2, 2), I(3, 1)}, and so on. The last π /4 line is {I(a, a)}. Here, I(1, 1) denotes the (1, 1)-pixel of I. Then, the π /4 planes contain the first τ lines and remainder (2a − τ − 1) lines for τ = 1, . . . , (2a − 2). The 3π /4 planes are likely defined as the π /4 planes but in perpendicular orientation. The 2π /4 and 4π /4 planes are in fact the vertical and horizontal planes, respectively. The weights dGr are chosen to take overlapping groups into account, i.e., dGr = 0.5|H ∈G;r ∈H ,H ⊂G and H ̸=G| . We set α = 0.5 in the norm Ω for sparsity generation and numerical stability. The parameter λ is set as a small value 1e−3. The sparsity regularization parameter log10 (λ1 ) is searched in {−5, −4.5, −4, . . . , 0} for producing the best results. The small value δ is set as 1e−5. The recognition accuracy is evaluated using a cross-validation strategy. The databases are randomly divided into two folds. One fold is used as training data (five images per subject for ORL and three images for FERET) and the remaining fold is used as testing data. The training samples are used to learn the projection matrix V , which is to transform the samples. The testing data are used to estimate the generalizability with the nearest-neighbor classifier. This procedure is repeated four times with different partitions of the databases. The average recognition rate of the four-round experiments is recorded as final recognition accuracy. In LPP and NPE, the neighborhood size k is set as k = t −1, where t is number of training samples per person. For LPP, the parameter of the heat kernel is specified as the variance of the squared norms of the training samples. 4.3. Experimental results For visual perception, Fig. 2 shows examples of learned projection vectors vm , m = 1, . . . , 20, by SSLDA, SSLPP, SSNPE, and SSSPP on the ORL face database. It is observed that the structured sparse approaches reveal groups of features localized on the face. The face-like segments seem intuitively meaningful in capturing some facial features or parts of the face. To verify the effectiveness of the structured sparse approaches, we compare the performances of LDA, LPP, NPE, and SPP (four popular subspace learning approaches in the LGE framework) and their structured sparse counterparts (in the SSLGE framework) for face recognition. For varying q and λ1 , we compute the recognition accuracies by the cross-validation. The recognition accuracy versus the reduced dimension q is shown in Fig. 3, where λ1 is optimized in the structured sparse approaches. For comparison, we also evaluate the performance of the baseline method that the raw data without any dimensionality reduction are used to perform recognition. The best results together with the optimal combinations of q and λ1 are summarized in Table 1, where the degree of sparsity of the projection vectors is computed. The degree of sparsity is defined as the ratio of the number of zeros and the total number of elements. The experiments are implemented using personal computer in Matlab platform. The proposed three-step alternate algorithm for

42

H. Wang / Neural Networks 27 (2012) 38–44

Fig. 2. Learned projections of faces with q = 20 on the ORL face database: (a) SSLDA, (b) SSLPP, (c) SSNPE, and (d) SSSPP. The projection vectors are sorted in decreasing order of generalized eigenvalues. Table 1

˜ 1 = log10 (λ1 ), and degree of Comparison of the best recognition rates (%) associated with the corresponding reduced-dimensions q, sparsity regularization parameters λ sparsity on the ORL and FERET face databases. Method

Baseline LDA LPP NPE SPP SSLDA SSLPP SSNPE SSSPP

ORL

FERET

Recog.

q

λ˜ 1

Sparsity

Recog.

q

λ˜ 1

Sparsity

79.0 93.0 93.6 93.9 94.7 95.0 95.5 97.7 98.3

1024 39 35 35 40 65 55 65 45

N.A. N.A. N.A. N.A. N.A. −0.5 −3.0 −2.5 −2.0

0 0 0 0 0 79.2 86.9 81.5 82.2

59.0 65.5 67.0 70.9 73.0 86.0 85.5 85.5 92.2

1024 50 45 50 40 40 45 70 30

N.A. N.A. N.A. N.A. N.A. −1.0 −2.5 −3.0 −2.5

0 0 0 0 0 77.5 85.3 83.7 79.1

SSLGE converges in less than 100 iterations. We have estimated the computational complexity of the iterative algorithm, which depends on many factors such as n, p, q and |G|. The computational time for the two data sets is several minutes per run. 4.4. Discussions All the subspace learning methods outperform the baseline method when subspace dimensions arrive at some values, which implies that dimensionality reduction is a necessary step for the face recognition problem. The structured sparse approaches have much better performance than their linear counterparts. This shows that by encoding structured sparsity of the images, the recognition performances can be improved. The structured sparsity effectively combines the advantages of feature extraction and feature selection, and meanwhile takes the image structure into account. The performances of LDA, LPP, NPE, and SPP are competitive. LDA is a supervised learning while the other three methods are unsupervised. The locality of manifold seems to contain much discriminative information. LPP and NPE have local parameters that need to be tuned, which may depend on the training data

and are somewhat intractable. The appropriate setting of the parameters results in good performance. SPP is parameter free. The solving of sparse reconstruction coefficients, however, is computationally intensive especially for large scale data set. The goal of this paper is to demonstrate the effectiveness of the structured sparse extension to LGE, and provides a unified framework of generating structured sparse subspace for many existing and other potential subspace methods. The recognition performance may be improved if using other sophisticated classifiers. 5. Conclusion In this paper, we propose a general framework, called SSLGE, for subspace learning that incorporates structured sparsity in the projection vectors. The SSLGE technique provides a unified model for various structured sparse subspace learning algorithms, for example, SSLDA, SSLPP, SSNPE, and SSSPP. The advantage of the unified formulation is, clearly, inherited from LGE. Moreover, SSLGE can accommodate other potential structured sparse linear methods, provided the methods can be interpreted under the LGE umbrella. The structured sparsity not only considers sparsity but

H. Wang / Neural Networks 27 (2012) 38–44

= tr(L˜ −1 B) +

43 q

1 T ηm Bηm − 2µTm L˜ − 2 Bηm .

(A.1)

m=1

If µ is fixed, then η that optimizes (5) is solved as 1 ηm = (B + λL˜ )−1 BL˜ − 2 µm ,

(m = 1, . . . , q).

(A.2)

Substituting (A.2) into (5) reads

1

1

tr L˜ −1 B − µT L˜ − 2 B(B + λL˜ )−1 BL˜ − 2 µ .

(A.3)

Therefore,

1 1 µ ˆ = arg max tr µT L˜ − 2 B(B + λL˜ )−1 BL˜ − 2 µ µ

(A.4)

subject to µT µ = I. Suppose the eigenvalue decomposition of L˜ −1/2 BL˜ −1/2 is L˜ −1/2 BL˜ −1/2 = Υ 1Υ T , where the eigenvalues are sorted in descending order in the diagonal matrix ∆ with the corresponding eigenvectors in the columns of Υ . It follows that 1

1

L˜ − 2 B(B + λL˜ )−1 BL˜ − 2

= L˜ −1/2 BL˜ −1/2 (L˜ −1/2 BL˜ −1/2 + λI )−1 L˜ −1/2 BL˜ −1/2 = Υ ∆(∆ + λI )−1 1Υ T ,

(A.5)

which implies that µ ˆ of (A.4) is the q leading eigenvectors of L˜ −1/2 BL˜ −1/2 , i.e., µ ˆ m = sΥm with s = 1 or −1, where Υm is the mth column of Υ . By (A.2), we have that

ηˆ m = L˜ −1/2 (L˜ −1/2 BL˜ −1/2 + λI )−1 L˜ −1/2 BL˜ −1/2 µ ˆm −1/2 −1 T ˜ =L Υ (∆ + λI ) 1Υ µ ˆm s∆mm = L˜ −1/2 Υm , (A.6) ∆mm + λ where ∆mm is the mth diagonal element of ∆. Note that the quantities L˜ −1/2 Υm , m = 1, . . . , q, are the q leading generalized eigenvectors of (4). The proof thus is completed. Fig. 3. Recognition rate versus reduced dimension on: (a) ORL and (b) FERET face databases, using various linear and structured sparse methods.

also encodes a priori structure of the data, which is meaningful for high-dimensional feature extraction. By using the variational equality and the Procrustes transformation, SSLGE is efficiently solved with closed-form updates. The effectiveness of SSLGE is shown on face image recognition. In future work, we would like to investigate shared structures across projection vectors. The parameter λ1 , as well as α , may depend on the data set at hand, and need to be tuned further. We also intend to apply the SSLGE technique for multi-task learning and to high-dimensional biomedical data, say fMRI data. Acknowledgments The author would like to thank the anonymous referees for constructive recommendations, which greatly improve the paper. This work was supported in part by the National Natural Science Foundation of China under grants 61075009 and 60803059, in part by the Natural Science Foundation of Jiangsu Province under grant BK2011595, and in part by the Qing Lan Project of Jiangsu Province. Appendix. Proof of Theorem 1 The proof is similarly established with Qiao et al. (2009) and Zou et al. (2006). We give the proof sketch as follows. We have that 1

1

1

∥B 2 L˜ − 2 − B 2 ηµT ∥2F 1 1 1 = tr(L˜ − 2 BL˜ − 2 + ηT Bη − 2µT L˜ − 2 Bη)

References Baraniuk, R. G. (2007). Compressive sensing. IEEE Signal Processing Magazine, 24(4), 118–120. Baraniuk, R. G., Cevher, V., Duarte, M. F., & Hegde, C. (2010). Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4), 1982–2001. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15, 1373–1396. Cai, D., He, X., & Han, J. (2007a). Spectral regression: a unified approach for sparse subspace learning. In Proceedings of the seventh IEEE international conference on data mining (pp. 73–82). Cai, D., He, X., & Han, J. (2007b). Isometric projection. In Proceedings of the artificial intelligence. Cai, D., He, X., & Han, J. (2008). Sparse projections over graph. In Proceedings of the artificial intelligence. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Gower, J. C., & Dijksterhuis, G. B. (2004). Procrustes problems. New York: Oxford University Press. He, X., Cai, D., Yan, S., & Zhang, H.-J. (2005a). Neighborhood preserving embedding. In Proceedings of the tenth IEEE international conference on computer vision. Vol. 2 (pp. 1208–1213). He, X., Yan, S., Hu, Y., Niyogi, P., & Zhang, H.-J. (2005b). Face recognition using Laplacianfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3), 328–340. Huang, J., Zhang, T., & Metaxas, D. (2009). Learning with structured sparsity. In Proceedings of the 26th international conference on machine learning. Jacob, L., Obozinski, G., & Vert, J.P. (2009). Group lasso with overlap and graph lasso. In Proceedings of the 26th international conference on machine learning. Jain, A. K., Duin, R. P. W., & Mao, J. (2000). Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4–37. Jenatton, R., Audibert, J.-Y., & Bach, F. (2010a). Structured variable selection with sparsity-inducing norms. Technical report. arXiv: 0904.3523v3. Jenatton, R., Obozinski, G., & Bach, F. (2010b). Structured sparse principal component analysis. In International conference on artificial intelligence and statistics. Lu, B.-W., & Pandolfo, L. (2011). Quasi-objective nonlinear principal component analysis. Neural network, 24(2), 159–170. Mhaskar, H. N. (2011). A generalized diffusion frame for parsimonious representation of functions on data defined manifolds. Neural network, 24(4), 345–359.

44

H. Wang / Neural Networks 27 (2012) 38–44

Phillips, P. J., Moon, H., Rizvi, S. A., & Rauss, P. J. (2000). The FERET evaluation methodology for face-recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10), 1090–1104. Qiao, L., Chen, S., & Tan, X. (2010). Sparsity preserving projections with applications to face recognition. Pattern Recognition, 43, 331–341. Qiao, Z., Zhou, L., & Huang, J. Z. (2009). Sparse linear discriminant analysis with applications to high dimensional low sample size data. IAENG International Journal of Applied Mathematics, 39(1), 48–60. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326. Shi, J., Ren, X., Dai, G., Wang, J., & Zhang, Z. (2011). A non-convex relaxation approach to sparse dictionary learning. In Proceedings of the IEEE international conference on computer vision (pp. 1809–1816). Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319–2323.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B, 58, 267–288. Yan, S., Xu, D., Zhang, B., Zhang, H.-J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 40–51. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B, 68(1), 49–67. Zhou, T., Tao, D., & Wu, X. (2011). Manifold elastic net: a unified framework for sparse dimension reduction. Data Mining and Knowledge Discovery, 22, 340–371. Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265–286. Zou, H., & Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4), 1509–1533.