- Email: [email protected]

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Discriminant similarity and variance preserving projection for feature extraction Pu Huang a,b,n, Caikou Chen a, Zhenmin Tang b, Zhangjing Yang b a b

College of Information Engineering, Yangzhou University, Yangzhou 225009, China School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

art ic l e i nf o

a b s t r a c t

Article history: Received 1 September 2013 Received in revised form 23 February 2014 Accepted 26 February 2014 Communicated by Haowei Liu Available online 8 April 2014

In this paper, a novel supervised dimensionality reduction algorithm called discriminant similarity and variance preserving projection (DSVPP) is presented for feature extraction and recognition. More speciﬁcally, we redeﬁne the intrinsic graph and penalty graph to model the intra-class compactness and inter-class separability of data points, where the intrinsic graph characterizes the similarity information of the same-class points and the penalty graph characterizes the variance information of the not-same-class points. Using the two graphs, the within-class scatter and the between-class scatter are computed, and then a concise feature extraction criterion is raised via minimizing the difference between them. Experimental results on the Wine data set, ORL, FERET and AR face databases show the effectiveness of the proposed method. & 2014 Elsevier B.V. All rights reserved.

Keywords: Dimensionality reduction Manifold-based learning Intrinsic graph Penalty graph Variance

1. Introduction Techniques for dimensionality reduction in supervised or unsupervised learning tasks have attracted much attention in computer vision and pattern recognition. Dimensionality reduction is to construct a meaningful low dimensional representation of high-dimensional data. With respect to pattern recognition, dimensionality reduction is an effective approach to reveal the distinctive features from the original data and improve the computational efﬁciency of pattern matching [1]. Over the past few years, a variety of dimensionality reduction methods have been developed. Among them, principal component analysis (PCA) [2] and linear discriminant analysis (LDA) [3] are two most well-known linear subspace learning methods. PCA is an unsupervised learning algorithm, which performs dimensionality reduction by projecting the original N-dimensional data onto the d (d⪡N)-dimensional linear subspace spanned by the leading eigenvectors of the data's covariance matrix. LDA is a completely supervised method which takes full consideration of the class information which is different from PCA which has nothing to do with the class information. The purpose of LDA is to seek a set of projection vectors such that the ﬁsher criterion (i.e. the ratio of the between-class scatter to the within-class scatter) is maximized

n Corresponding author at: College of Information Engineering, Yangzhou University, Yangzhou 225009, China. E-mail address: [email protected] (P. Huang).

http://dx.doi.org/10.1016/j.neucom.2014.02.047 0925-2312/& 2014 Elsevier B.V. All rights reserved.

after projection of samples. It is generally believed that the class information can enhance the discriminative ability. Thus LDA often achieves signiﬁcantly better performance than PCA in many realworld applications such as face recognition. Despite the success of LDA, its effectiveness is still limited because the number of the available projection directions is lower than the class number. Furthermore, LDA cannot be applied to small sample size (SSS) problems [4] because the within-class scatter matrix is singular. In the past, many extensions [5–15] of LDA have been developed to deal with this problem. Li et al. [15] used the difference of the between-class scatter and the within-class scatter as the discriminant criterion, called maximum margin criterion (MMC), in which the SSS problem in traditional LDA is naturally avoided since the inverse matrix does not need to be computed. Recent studies [16–18] have shown that large volumes of high dimensional data possibly reside on a nonlinear sub-manifold. The two classical methods PCA and LDA may fail to discover the intrinsic manifold structure of the data since they can see only the global linear Euclidean structure of the data. To discover the intrinsic structure of the manifold, many manifold-based learning algorithms have been developed. Among them, the most representative ones are locally linear embedding (LLE) [16], isometric feature mapping (Isomap) [17] and Laplacian Eigenmap [18]. In [19], Yan et al. proposed a general framework called graph embedding for dimensionality reduction. It has been shown that all the aforementioned methods can be integrated in such a framework and their differences only lie in the strategy used to construct graphs and embedding types. Manifold learning

P. Huang et al. / Neurocomputing 139 (2014) 180–188

algorithms do produce impressive visualization results on some benchmark data (e.g., facial images and handwritten digits). However, they are deﬁned only on the training data points; how to evaluate the map for novel testing points remains unclear. He et al. [20] proposed locality preserving projection (LPP), which builds a graph incorporating neighborhood information of the data set and aims to optimally preserve the local structure of the data in a certain sense. In contrast to most manifold-based learning algorithms such as LLE, Isomap and Laplacian Eigenmap, LPP can yield explicit maps both effective for training and testing data points, and the objective is linear and can be easily computed like PCA and LDA. Xu et al. [21] proposed some solutions to overcome the SSS problem when LPP is applied in face recognition. Cai et al. [22] proposed orthogonal LPP to boost the locality preserving ability of LPP. Hu et al. [23] proposed the 2DLPP method for palmprint recognition and Feng et al. [24] proposed a framework for matrix-based feature extraction, based on which a method called 2-dimensional discriminant embedding analysis (2DDEA) was developed. In [25], Yang et al. proposed unsupervised discriminant projection (UDP), which can be viewed as simpliﬁed LPP on the assumption that the local density is uniform [26]. UDP characterizes the nonlocal scatter as well as the local scatter, seeking for a projection that simultaneously maximizes the nonlocal scatter and minimizes the local scatter. This property makes UDP more intuitive and more powerful than LPP for classiﬁcations tasks. Neither LPP nor UDP uses the label information and they are unsupervised methods in nature. Many previous studies [19,27–32] have demonstrated that the recognition performance can be improved signiﬁcantly via manifold-based learning discriminant approaches, which are straightforward in preserving the intrinsic structure and the discriminant structure of the data. One prevalent approach is marginal ﬁsher analysis (MFA) [19]. In [27], Chen et al. proposed the local discriminant embedding (LDE) algorithm. The two methods are very similar in formulation and both of them combine the locality and label information to characterize the intra-class compactness and inter-class separability. MFA uses one adjacency graph to model the intrinsic geometrical structure of the manifold and another adjacency graph for the discriminant structure of data points. Then the discriminant structure and geometrical structure are incorporated into the dimensionality reduction objective function for data classiﬁcation. In this way, MFA can discover the discriminant structure and simultaneously preserve the intrinsic geometrical structure of the data. Unfortunately, there are still some limitations that exist in MFA. First, the performance of MFA depends on two parameters k1 and k2 used to construct the adjacency graphs. However, they are difﬁcult to determine in real-world applications. Second, the intrinsic geometrical structure characterizes only the similarity among nearby points and ignores the variance among points from different classes, which characterizes the diversity of pattern and is important for image classiﬁcation [33]. It indicates that MFA cannot best preserve the information of images, and the generalization capability and robustness are not good enough. Finally, in many practical applications, especially in the face recognition, MFA often encounters the SSS problem. To overcome the limitations of MFA, we propose a novel efﬁcient approach, called discriminant similarity and variance preserving projection (DSVPP), for feature extraction and recognition which explicitly considers the information including similarity and variance between data points. To be speciﬁc, the intrinsic graph and the penalty graph are constructed to model the intraclass compactness and the inter-class separablity of data points, where the intrinsic graph characterizes the similarity information of the same-class points and the penalty graph characterizes the variance information of the not-same-class points. Based on the two graphs, the within-class scatter and the between-class scatter

181

are computed and then the feature criterion via minimizing the difference between them is raised to overcome the SSS problem. The reminder of this paper is organized as follows. In Section 2, we brieﬂy review LPP and MFA. In Section 3, we introduce the new method in detail. In Section 4, experiments on Wine data set, ORL, FERET and AR face databases are presented to demonstrate the effectiveness of the new method. Conclusions are made in Section 5.

2. Outline of LPP and MFA Let X¼[x1, x2, …, xn] denote the sample set, xi A RN (i¼1,2, …, n), and Let A A RNd(d⪡N) denote the transformation matrix. In this section, we brieﬂy review LPP and MFA. 2.1. LPP Let G ¼{V, E, W} denote the adjacency graph, where V is the set of all data points in the data set X, E is the set of edges connecting data points, and W is a weighted matrix with elements characterizing the likelihood of two points. The elements of weighted matrix W are deﬁned as follows: 8 < ‖xi xj ‖2 t ; if xi A N k ðjÞ or xj A Nk ðiÞ W ij ¼ e ð1Þ : 0; otherwise where t40 is a suitable parameter, Nk(i) denotes the set of k nearest neighbors of xi. LPP aims to ﬁnd a projection that preserves the local structure of the data after projection of samples. To fulﬁll such an objective, the criterion function of LPP is deﬁned by J LPP ¼

min

AΤ XDX Τ A ¼ I

AΤ XLX Τ A

ð2Þ

where I is an identity matrix, L ¼D W is the Laplacian matrix, D is a diagonal matrix and Dii ¼ ∑j W ij . The transformation matrix ALPP can be obtained by solving the generalized eigenvalue problem: XLXTA¼ λXDXTA. 2.2. MFA Different from LPP which pays no attention to the discriminant structure of the data, MFA characterizes both the intrinsic geometrical structure and the discriminant structure of the data. MFA needs to design the intrinsic graph and penalty graph; the intrinsic graph characterizes the intra-class compactness and connects each data point with its neighboring points of the same class, while the penalty graph connects the marginal points and characterizes the inter-class separability. Let Gi ¼{V, E, Ww} and Gp ¼ {V, E, Wb} denote the intrinsic graph and penalty graph, respectively. The elements of matrices Ww and Wb are deﬁned by ( 1 if xj A N kþ1 ðiÞ or xi A N kþ1 ðjÞ ¼ Ww ð3Þ ij 0 otherwise and W bij ¼

1

if xj A P k2 ðiÞ or xi A P k2 ðjÞ

0

otherwise

ð4Þ

where N kþ1 ðiÞ and P k2 ðiÞ indicate the set of k1 intra-class nearest neighbors of xi and the set of k2 inter-class nearest neighbors of xi, respectively. The aim of MFA is to ﬁnd a discriminative projection that characterizes both the intra-class compactness and inter-class separability. Thus the feature extraction criterion of MFA can be

182

P. Huang et al. / Neurocomputing 139 (2014) 180–188

expressed as follows: J MFA ¼ min A

AΤ Sw A AΤ Sb A

ð5Þ

where Sw ¼XLwXT is the within-class scatter matrix, Sb ¼XLbXT is the between-class scatter matrix, Lw ¼Dw Ww and Lb ¼Db Wb are Laplacian matrices, Dw and Db are diagonal matrices whose w b b entries are respectively calculated as: Dw ii ¼ ∑ W ij and Dii ¼ ∑ W ij . j The optimal transformation matrix AMFA jcan be obtained by solving the following generalized eigenvalue problem SwA¼λSbA.

3. DSVPP From the deﬁnition of MFA, we can ﬁnd that the performance of MFA depends on the parameters k1 and k2. However, they are difﬁcult to determine in practical applications. For example, for a data point xi from the given data set X which belongs to the jth class, the number of possible choices for determining k1 and k2 is (nj 1)(n nj), where nj is the number of data points in class j and n is the total number of data points in X; obviously, it is hard to achieve an appropriate choice. Furthermore, MFA mainly focuses on characterizing the similarity between nearby data points: if one point is close to another point, then their similarity weight is set to 1, otherwise their similarity weight is set to 0. That indicates that MFA neglects the variance information between points which is beneﬁcial for classiﬁcation. Based on these considerations, we introduce discriminant similarity and variance preserving projection (DSVPP) as follows. We also use the intrinsic graph and penalty graph to characterize the intra-class compactness and inter-class separability. Let Gw ¼ fV ; E; Sgbe the intrinsic graph and Gb ¼ fV ; E; Bg be the penalty graph, where S is a weighted matrix characterizing the similarity between data points and B is a weighted matrix characterizing the variance between data points. Using the projection matrix A, the intra-class compactness can be characterized from the intrinsic graph by the within-class scatter Jw: J w ¼ ∑ J AΤ xi AΤ xj J 2 Sij

ð6Þ

ij

while the similarity between point ‘○’ and point ‘◇’ is smaller than the similarity between point ‘○’ and point ‘Δ’. Thus the elements of the variance matrix B can be deﬁned as follows: 8 Jx x J2 < i t j ; if li a lj Bij ¼ 1 e ð9Þ : 0; otherwise From the deﬁnition of Bij, we know that 1 e ‖xi xj ‖ =t is a strictly monotonic increasing function with respect to the distance between two points xi and xj. Bij indicates the variance information of xi and xj (in the not-same-class) to the penalty graph, the larger the distance J xi xj J 2 is, the larger the variance weight would be; otherwise, the variance weight is zero. To gain more insight into Eq. (6), we rewrite the square of the norm in the form of the matrix trace: 2

J w ¼ ∑ J AΤ xi AΤ xj J 2 Sij ij

¼ ∑trfðAΤ xi AΤ xj ÞðAΤ xi AΤ xj ÞΤ Sij g ij

¼ ∑trfAΤ ðxi xj Þðxi xj ÞΤ ASij g

ð10Þ

ij

Since the operation of trace is linear and Sij is a scalar, Eq. (10) can be easily simpliﬁed as J w ¼ ∑trfðAΤ xi AΤ xj ÞSij ðAΤ xi AΤ xj ÞΤ g ij

(

) Τ

Τ

Τ

Τ

¼ 2tr ∑A xi Sij xi A ∑A xi Sij xj A ij Τ

ij Τ

Τ

¼ 2trfA ðXDX XSX ÞAg

where the elements of the similarity matrix S are deﬁned as follows: 8 < J xi xj J 2 t ; if li ¼ lj ð7Þ Sij ¼ e : 0; otherwise where li denotes the class label of xi. From the deﬁnition of Sij, we 2 know that e jjxi xj jj =t is a strictly monotonic decreasing function with respect to the distance between two points xi and xj. Sij indicates the similarity information of xi and xj (in the same-class) to the intrinsic graph, the smaller the distance J xi xj J 2 is, the larger the similarity weight would be, otherwise the similarity weight is zero. The inter-class separability can be characterized from the penalty graph by the between-class scatter Jb: J b ¼ ∑ J AΤ xi AΤ xj J 2 Bij

Fig. 1. An example for illustrating the variance and similarity between data points.

ð8Þ

ij

where B is a weighted matrix with elements characterizing the variance between points from different classes. From the viewpoint of statistics, if two points xi and xj are very close to each other, i.e. J xi xj J 2 is small, then the variance between them is also small and the similarity should be large. On the contrary, if two points xi and xj are far apart, i.e. J xi xj J 2 is large, then the variance between them is also large and the similarity should be small [34]. Taking the points shown in Fig. 1 as an example, for the point ‘○’, it is easy to see that the variance between point ‘○’ and point ‘◇’ is larger than the variance between point ‘○’ and point ‘Δ’,

¼ 2trfAΤ XðD SÞX Τ Ag ¼ 2trfAΤ XLX Τ Ag ¼ 2trfAΤ Sw Ag

ð11Þ

where D ¼ diagðD11 ; D22 ; …; Dnn Þ, Dii ¼ ∑ Sij (i¼1, …, n) and Τ j L ¼ D S is the Laplacian matrix, Sw ¼ XLX is the within-class scatter matrix. Similar to Eqs. (10) and (11), we can simplify Eq. (8) as follows: J b ¼ ∑ J AΤ xi AΤ xj J 2 Bij ij

¼ 2trfAΤ XðD~ BÞX Τ Ag ~ Τ AÞ ¼ 2trðAΤ X LX ¼ 2trðAΤ Sb AÞ

ð12Þ

~ 22 ; …; D ~ nn Þ, D ~ ii ¼ ∑ Bij (i¼1, …, n) and where D~ ¼ diagðD~ 11 ; D ~ B is the Laplacian matrix, S ¼ X LX ~ j Τ is the between-class L~ ¼ D b scatter matrix. To consider the intra-class compactness and inter-class separability, we need to minimize the within-class scatter and maximize the between-class scatter simultaneously. In MFA, the objective function is a ratio of the within-class scatter to the between-class scatter. If the matrix XLbXT is non-singular, the optimal matrix can be obtained by solving the eigen-equation Sb 1SwA ¼λA. However, in many applications such as face recognition, a ratio often makes the algorithm suffer from the SSS

P. Huang et al. / Neurocomputing 139 (2014) 180–188

problem when the sample number is less than the dimensions of the samples. Under such circumstance, the between-class scatter matrix Sb is invertible. So the optimal solutions cannot be obtained by solving the eigen-equation Sb 1SwA¼ λA. To avoid the SSS problem, we change the objective function into a difference form of the within-class scatter and the between-class scatter, since the objective function min(Sb 1Sw) and the objective function min (Sw Sb) have the same motivation that it is to minimize the within-class scatter and maximize the between-class scatter at the same time. Thus the objective function of the proposed method is expressed as follows: J DSVPP ¼ min AΤ ðSw μSb ÞA Τ A A¼I

ð13Þ

where I is an identity matrix, and μ40 is a tradeoff parameter which balances the two terms of the objective function. Then we can obtain the optimal transformation matrix ADSVPP, which consists of the eigenvectors corresponding to the ﬁrst d non-zero smallest eigenvalues of Sw μSb . The criterion in Eq. (13) considers all samples and contains similarity and variance information between points in the graph construction in comparison with MFA. Minimizing the criterion in Eq. (13) attempts to ensure that the same-class points are mapped together while simultaneously the not-same-class points are mapped far away in the feature space. The technique of LPP seeks to ﬁnd a linear subspace which can preserve the local scatter of data. The value Wij characterizing the similarity weight between xi and xj is a strictly monotonic decreasing function with respect to the distance between xi and xj. If the distance between xi and xj is smaller, Wij gives a larger similarity weight value to them, which means the projection direction determined by LPP can ensure that if xi and xj are close, their projections yi and yj are close as well, where yi ¼ATxi. However, LPP cannot ensure that if xi and xj are distant from each other, their projections yi and yj are also distant from each other. Since the similarity weight Sij is a strictly monotonic decreasing function with respect to the distance between xi and xj and the variance weight Bij is a monotonic increasing function with respect to the distance between xi and xj, if the distance between xi and xj with the same class label is smaller, Sij gives a larger similarity weight value to them, and if the distance between xi and xj with different class labels is larger, Bij gives a larger variance weight value to them. This indicates that to minimize the criterion in Eq. (13) is to make sure that if xi and xj from the same class are close, their projections yi and yj are close as well, and if xi and xj from different classes are distant from each other, their projections yi and yj are still distant from each other. That is to say, the similarity information of points in the same class and the variance information of points in different classes are preserved in the projected feature space. Thus, we call the algorithm proposed in this paper as DSVPP. As a result, the outline for the proposed algorithm can be summarized as follows: Step 1: construct the adjacency matrices: for the given data set X, constructing the adjacency matrix S of the same class samples by Eq. (7); constructing the adjacency matrix B of the different class samples by Eq. (9); Step 2: construct the Laplacian matrices L and L~ form an n n diagonal matrix D, whose elements on the diagonal are given by D ¼ diagðD11 ; D22 ; …; Dnn Þ and Dii ¼ ∑ Sij (i ¼1, …, n). Then j the Laplacian matrix L is L ¼ D S; similarly, calculate the ~ ~ Laplacian matrix L ¼ D B; ~ Τ; Step 3: construct the two matrices XLX Τ and X LX Step 4: compute the optimized resolutions by solving the generalized eigenvalue problem based on Eq. (13);

183

Step 5: let a1, a2, …, ad be the optimal solutions of Eq. (13), then the transformation matrix of DSVPP is ADSVPP ¼ [a1, a2, …, ad]; for a sample x, its low-dimensional embedding can be obtained by y¼ ATx; and Step 6: adopt a suitable classiﬁer to recognize the features obtained in Step 5.

4. Experiments In this section, in order to evaluate the performance of the proposed DSVPP, we systematically compare it with PCA, LDA, LPP, UDP and MFA on several publicly available data sets. More speciﬁcally, we evaluate the performance of DSVPP through 2D (2-dimensional) visualization and face recognition. What deserves to be mentioned is that LPP, UDP, MFA and DSVPP all involve the parameter selection problem. In this paper the optimal parameters for the above methods are set by exhaustive search, since theoretically the parameter selection problem remains unsolved. Furthermore, when LDA, LPP, UDP and MFA are applied for face recognition, the PCA method is adopted to preprocess the image data to avoid the SSS problem. Different from algorithms, e.g., LDA, LPP, UDP and MFA, which lead to the SSS problem, DSVPP successfully avoids the matrix singularity problem since it has no inverse operation over a matrix. However, the PCA step is still recommended to reduce noise and for fair comparisons. After all the methods have been adopted to extract the face image features, the nearest neighbor classiﬁer with the Euclidean metric as the distance measure is employed to perform classiﬁcation for its simplicity. All the experiments are performed on a Core 2 Duo CPU 2.2 GHz with 2G RAM and programmed in the MATLAB language (version 7.0). 4.1. 2D data visualization In this selection, we perform visualization on Wine data set from the UCI machine learning repository (http://archive.ics.uci. edu/ml). The Wine data set consists of 178 samples of 3 classes. Each sample has 13 features. We select 48 samples per class in our experiments. Here, we apply PCA, LDA, LPP, UDP, MFA and DSVPP for feature extraction. All the samples are projected onto the 2D subspace by different algorithms, which are shown in Fig. 2. From the results shown in Fig. 2, we can clearly see that the points from 3 classes learned by the DSVPP algorithm have been well separated from each other, while the other algorithms tend to mix the 3 classes. This indicates that DSVPP captures a more reasonable structure of the data in comparison with other algorithms.

4.2. Face recognition In this section, we conduct experiments to evaluate the recognition performance of the proposed algorithm DSVPP and other algorithms on ORL, FERET and AR face databases. The ORL database is used to evaluate the performance of DSVPP under conditions where the poses and sample size are varied. The FERET database is used to examine the system performance when facial expressions, poses and illumination are varied. The AR database is employed to test the performance of the system under conditions where there is a variation on facial expressions and lighting conditions. 4.2.1. Experiments on ORL database The ORL database (http://www.cam-orl.co.uk) contains 400 different images from 40 subjects: each subject has 10 images.

184

P. Huang et al. / Neurocomputing 139 (2014) 180–188 150

0

140

-0.1

130

-0.2

120

-0.3

110

-0.4

100

-0.5

90

-0.6

80

-0.7

70

-0.8

60 -1800

-1600

-1400

-1200

-1000

-800

-600

-400

-200

-0.9 -0.7

-0.6

-0.5

-0.4

PCA

-0.3

-0.2

-0.1

0

0.1

110

120

5

5.5

LDA -0.5

0.015 0.01 0.005

-1

0 -0.005 -0.01

-1.5

-0.015 -0.02 -0.025 0.938

0.94

0.942 0.944 0.946 0.948

0.95

0.952 0.954 0.956

-2 20

30

40

50

60

LPP

70

80

90

100

UDP 3.5

-1 -2

3

-3 -4

2.5

-5 2

-6 -7

1.5 -8 -9

0

1

2

3

4

5

6

1 1.5

2

2.5

3

3.5

4

4.5

DSVPP

MFA

Fig. 2. 2D visualization results on Wine data set by PCA, LDA, LPP, UDP, MFA and DSVPP, where ‘ o ’,' ∗ ’ and ‘ + ’ denote three different classes.

Fig. 3. Some sample images of one person from the ORL database.

For some subjects, the images were taken at different times, with varying the lighting, facial expressions (open/closed eyes, smiling/ not smiling) and facial details (glasses/no glasses). The heads in images are slightly tilted or rotated. All images are grayscale and

normalized to a resolution of 112 92 pixels. Fig. 3 shows some sample images of one person from the ORL database. In the experiments, we select the ﬁrst l ( ¼2, 3, 4) images per person to form the training set and rest of the images to form the

P. Huang et al. / Neurocomputing 139 (2014) 180–188

testing set. In order to avoid the singularity in LDA, LPP, UDP, MFA and DSVPP, PCA is introduced as a preprocessing, by which the original face images are projected to a subspace at 40 dimensions. After features have been extracted by performing PCA, LDA, LPP, UDP, MFA and DSVPP, the nearest neighbor classiﬁer is adopted to predict the labels of the test data. Shown in Table 1 are the optimized recognition rates at corresponding dimensions for different dimensionality reduction methods. From Table 1 we can see that maximal recognition rates of all the methods increase with the increase of the training sample size. It is found that the proposed method outperforms the other techniques no matter what the sample size is. One reason is that DSVPP takes not only the label information and similarity information but also the variance information of the data into account which makes DSVPP more discriminative than other methods.

4.2.2. Experiments on FERET database The FERET [35,36] face image database has become a standard database for testing and evaluating state-of-the-art face recognition algorithms. The proposed method is tested on a subset of the Table 1 The maximal recognition rates (%) and the corresponding dimensions (shown in parentheses) using PCA, LDA, LPP, UDP, MFA and DSVPP on the ORL database. l

PCA

LDA

LPP

UDP

MFA

DSVPP

2 3 4

75.94 (34) 83.57 (34) 87.50 (40)

80.00 (28) 86.43 (38) 90.42 (20)

76.88 (34) 83.93 (36) 89.58 (35)

76.56 (40) 84.64 (38) 91.25 (33)

80.31 (30) 85.36 (34) 90.83 (21)

85.00 (36) 88.93 (35) 92.92 (36)

185

FERET database. This subset includes 1400 images of 200 individuals (each one has seven images). It is composed of the images whose names are marked with two-character strings: “ba”, “bd”, “be”, “bf”, “bg”, “bj” and “bk”. This subset involves two facial expression images, two left pose images, two right pose images and an illumination image. In our experiment, the facial portion of each original image was automatically cropped based on the location of eyes and mouth, and the cropped images were resized to 80 80 pixels. Some sample images are shown in Fig. 4. On the FERET database, we conduct experiments to examine the performance of the proposed algorithm DSVPP when facial poses, expressions and illuminations are varied. In the experiments, we select l ( ¼2, 3) images of each person to form the training set and rest of the images for testing. For each l, the training set and the testing set are constructed in three cases as shown in Table 2. The experiments are carried out with different training sets for each l, and thus we can see how the variation of images affects the recognition performance. For LDA, LPP, UDP, MFA and DSVPP, the dimension of preprocessed data obtained by PCA is 120. The best recognition rates for all the methods and corresponding dimensions are reported in Table 3, and the recognition curves with varied dimensions are shown in Fig. 5 and Fig. 6. As we can see, in spite of the variation on the facial expressions, poses and illuminations our algorithm outperforms other competitors. The main reason is that our algorithm could deal with the similarity information and variation information simultaneously. Furthermore, the strong structure preserving and discriminating ability makes the algorithm more suitable for the recognition tasks. In addition, we can ﬁnd that variations of the images have inﬂuence on the recognition rates of all algorithms

Fig. 4. Some sample images of one person from the FERET database.

Table 2 Training set, the variations of the training images (shown in parentheses) and testing set on the FERET database. Case 1

Case 2

Case 3

l ¼2 Training set (variation) Testing set

1#, 3# (Pose) 2#, 4#, 5#, 6#, 7#

1#, 6# (Expression) 2#, 3#, 4#, 5#, 7#

1#, 7# (Illumination) 1#, 2#, 3#, 4#, 5#

l ¼3 Training set (variation) Testing set

1#, 3#, 5# (Pose) 2#, 4#, 6#, 7#

1#, 5#, 6# (Pose, expression) 2#, 3#, 4#, 7#

1#, 6#, 7# (Pose, expression and illumination) 2#, 3#, 4#, 5#

Table 3 The maximal recognition rates (%) and the corresponding dimensions (shown in parentheses) using PCA, LDA, LPP, UDP, MFA and DSVPP on the FERET database. PCA

LDA

LPP

UDP

MFA

DSVPP

l¼2 Case 1 Case 2 Case 3 Average

39.30 (55) 28.90 (35) 34.25 (60) 34.15

58.00 (40) 37.80 (30) 38.90 (70) 44.90

40.20 (50) 30.10 (40) 31.90 (25) 34.07

44.40 (100) 36.30 (120) 36.70 (115) 39.13

57.60 (25) 39.10 (35) 40.50 (60) 46.03

60.30 (55) 44.70 (65) 47.90 (25) 50.97

l¼3 Case 1 Case 2 Case 3 Average

49.50 (90) 34.25 (60) 37.00 (35) 40.25

72.88 (40) 51.25 (25) 41.25 (25) 55.13

50.88 (120) 36.25 (75) 33.12 (40) 42.79

53.37 (105) 36.88 (85) 39.88 (80) 43.38

74.62 (40) 50.25 (25) 40.25 (25) 55.04

78.25 (40) 55.00 (75) 49.00 (35) 60.75

186

P. Huang et al. / Neurocomputing 139 (2014) 180–188

1

1

0.9

Recognition accuracy

0.8 0.7 0.6 0.5

0.9 0.8

Recognition accuracy

PCA LDA LPP UDP MFA DSVPP

0.7 0.6 0.5

0.4

0.4

0.3

0.3

0.2

10

20

30

40

50

60

70

80

90

100

110

0.2

120

PCA LDA LPP UDP MFA DSVPP 10

20

30

40

50

Dimension

1

Recognition accuracy

0.8 0.7 0.6 0.5

50

60

70

80

90

100

110

0.5

0.2

120

10

20

30

40

50

60

70

80

90

100

110

120

Dimension

1

0.8 0.7 0.6 0.5 0.4

1 PCA LDA LPP UDP MFA DSVPP

0.9 0.8

Recognition accuracy

PCA LDA LPP UDP MFA DSVPP

0.9

Recognition accuracy

120

PCA LDA LPP UDP MFA DSVPP

Dimension

0.7 0.6 0.5 0.4

0.3 0.2

110

0.6

0.3

40

100

0.7

0.3

30

90

0.8

0.4

20

80

0.9

0.4

10

70

1

Recognition accuracy

PCA LDA LPP UDP MFA DSVPP

0.9

0.2

60

Dimension

0.3 10

20

30

40

50

60

70

80

90

100

110

120

Dimension Fig. 5. Recognition rate curves for different algorithms with varied dimensions on the FERET database when l ¼ 2: (a) case 1, (b) case 2, and (c) case 3.

and the average recognition rate of DSVPP when l ¼3 is larger than that of DSVPP when l ¼2. Besides the recognition rates, the time complexity is also important for practical applications. We list the average computational time of all the methods for three cases in Table 4. From Table 4 we can see that DSVPP is less time consuming than the other manifold-based learning algorithms such as LPP, UDP and MFA.

0.2

10

20

30

40

50

60

70

80

90

100

110

120

Dimension Fig. 6. Recognition rate curves for different algorithms with varied dimensions on the FERET database when l ¼ 3: (a) case 1, (b) case 2, and (c) case 3.

4.2.3. Experiments on AR database The AR face database [37,38] contains over 4000 face images of 126 people (70 men and 56 women), including frontal views of faces with different facial expressions, lighting conditions and occlusions. For each individual, 26 images were taken in two sessions (separated by two weeks) and each session contains 13 images. In our experiment, we only use the images without

P. Huang et al. / Neurocomputing 139 (2014) 180–188

occlusion in the preprocessed AR face database. The subset includes 1680 face images corresponding to 120 persons (65 men and 55 women) where each person has 14 different images taken in two sessions. All the images were in grayscale and manually cropped to 50 40 pixels. Some sample images of one person are shown in Fig. 7. On the AR database, we conduct experiments to evaluate the performance of our algorithm with varied facial expressions and illumination. The experiments are carried out in two cases: in case 1, we select the ﬁrst 7 images of each person for training and rest Table 4 Computational time (s) of all the methods on the FERET database. Method

l¼2

l ¼3

PCA LDA LPP UDP MFA DSVPP

5.8 6.7 16.2 20.6 19.3 13.4

7.6 8.9 33.2 43.4 46.5 27.1

187

of the images for testing, and in case 2, we use the last 7 images of each person for training and the rest images for testing. Note that LDA, LPP, UDP, MFA and DSVPP all involve a PCA phase. In the phase, the dimension of the image data preprocessed by PCA is 150. Shown in Table 5 are the maximal recognition rates and the corresponding dimensions for all the methods. Shown in Fig. 8 are the recognition curves of different methods with varied dimensions. From Table 5, we can see that DSVPP performs the best among all the methods, which further conﬁrms the effectiveness of the proposed method. From Fig. 8, it can be found that, with the increase of the dimensions, our proposed DSVPP outperforms other methods and shows particularly better performance when the size of the dimensions is large in the second case.

5. Conclusions In this paper, we propose a novel supervised algorithm, namely discriminant similarity and variance preserving projection (DSVPP), for feature extraction and recognition. DSVPP can effectively discover the intrinsic manifold structure and the discriminant

Fig. 7. Some sample images of one person from the AR database.

Table 5 The maximal recognition rates (%) and the corresponding dimensions (shown in parentheses) using PCA, LDA, LPP, UDP, MFA and DSVPP on the AR database. LDA

LPP

UDP

MFA

DSVPP

54.29 (100) 54.53 (150)

66.67 (113) 65.00 (104)

61.67 (150) 60.83 (140)

64.05 (150) 64.40 (140)

67.38 (120) 67.86 (140)

69.52 (125) 72.62 (130)

1

1

0.9

0.9

0.8

0.8

0.7 0.6 0.5 PCA LDA LPP UDP MFA DSVPP

0.4 0.3 0.2

20

40

60

80

Dimension

100

120

140

Recognition accuracy

Recognition accuracy

Case 1 Case 2

PCA

0.7 0.6 0.5 PCA LDA LPP UDP MFA DSVPP

0.4 0.3 0.2

20

40

60

80

100

120

Dimension

Fig. 8. Recognition rate curves for different algorithms with varied dimensions on the AR database: (a) case 1 and (b) case 2.

140

188

P. Huang et al. / Neurocomputing 139 (2014) 180–188

structure of the data. By taking advantage of the label information, DSVPP constructs the intrinsic graph and the penalty graph to model the intra-class compactness and inter-class separability of data points. Different from other manifold-based learning algorithms, DSVPP not only considers the similarity information of points but also the variance information of points which is beneﬁcial for classiﬁcation tasks. Experiments carried out on the Wine data set, ORL, FERET and AR face databases demonstrate that our proposed algorithm DSVPP are more suitable for classiﬁcation than other algorithms such as PCA, LDA, LPP, UDP and MFA. References [1] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, Boston, MA, 1990. [2] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognitive Neurosci. 3 (1) (1991) 71–86. [3] S. Mikat, G. Fitsch, Fisher discriminant analysis with kernels, in: Proceedings of the 1999 IEEE Signal Processing Society Workshop, 1999, pp. 41–48. [4] S.J. Raudys, A.K. Jain, Small sample size effects in statistical pattern recognition: recommendations for practitioners, IEEE Trans. Pattern Anal. Mach. Intell. 13 (3) (1991) 252–264. [5] S.W. Ji, J.P. Ye, Generalized linear discriminant analysis: a uniﬁed framework and efﬁcient model selection, IEEE Trans. Neural Netw. 19 (10) (2008) 1768–1782. [6] J.W. Lu, K. Plataniotis, A. Venetsanopoulos, Regularization studies of linear discriminant analysis in small sample size scenarios with application to face recognition, Pattern Recognit. Lett. 26 (2005) 181–191. [7] T. Hastie, A. Buja, R. Tibshirani, Penalized discriminant analysis, Ann. Stat. 23 (1) (1995) 73–102. [8] P. Howland, H. Park, Generalizing discriminant analysis using the generalized singular value decomposition, IEEE Trans. Pattern Anal. Mach. Intell. 26 (8) (2004) 995–1006. [9] J.P. Ye, Q. Li, A two-stage linear discriminant analysis via QR-decomposition, IEEE Trans. Pattern Anal. Mach. Intell. 27 (6) (2005) 929–941. [10] J.P. Ye, Computational and theoretical analysis of null space and orthogonal linear discriminant analysis, J. Mach. Learn. Res. 7 (2006) 1183–1204. [11] L.F. Chen, H.Y. Mark Liao, M.T. Ko, J.C. Lin, G.J. Yu, A new LDA-based face recognition system which can solve the small sample size problem, Pattern Recognit. 33 (2000) 1713–1726. [12] H. Yu, J. Yang, A direct LDA algorithm for high-dimensional data-with application to face recognition, Pattern Recognit. 34 (2001) 2067–2070. [13] J. Yang, J.Y. Yang, Why can LDA be performed in PCA transformed space? Pattern Recognit. 36 (2003) 563–566. [14] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. ﬁsherfaces: recognition using class speciﬁc linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711–720. [15] H. Li, T. Jiang, K. Zhang, Efﬁcient and robust feature extraction by maximum margin criterion, IEEE Trans. Neural Netw. 17 (1) (2006) 1157–1165. [16] S.T. Roweis, L.K. Saul, Nonlinear dimension reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [17] J.B. Tenenbaum, V.d. Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323. [18] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [19] S. Yan, D. Xu, B. Zhang, H.J. Zhang, Q. Yang, S. Lin, Graph embedding and extension: a general framework for dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007). [20] X. He, S. Yan, Y. Hu, P. Niyogi, H. Zhang, Face recognition using Laplacian faces, IEEE Trans. Pattern Anal. Mach. Intell. 27 (3) (2005) 328–340. [21] Y. Xu, A. Zhong, J. Yang, D. Zhang, LPP solution schemes for use with face recognition, Pattern Recognit. 43 (2010) 4165–4175. [22] D. Cai, X. He, J.W. Han, et al., Orthogonal Laplacianfaces for face recognition, IEEE Trans. Image Process 15 (11) (2006) 3609–3614. [23] D. Hu, G. Feng, Z. Zhou, Two-dimensional locality preserving projections (2DLPP) with its application to palmprint recognition, Pattern Recognit. 40 (1) (2007) 339–342. [24] G. Feng, D. Zhang, J. Yang, D. Hu, A theoretical framework for matrix-based feature extraction algorithms with its application to image recognition, Int. J. Image Graph. 8 (1) (2008) 1–23. [25] J. Yang, D. Zhang, J.Y. Yang, B. Niu, Globally maximizing, locally minimizing: unsupervised discriminant projection with application to face and palm biometrics, IEEE Trans. Pattern Anal. Mach. Intell. 29 (4) (2007) 650–664. [26] W. Deng, J. Hu, J. Guo, H. Zhang, C. Zhang, Comments on globally maximizing, locally minimizing: unsupervised discriminant projection with application to face and palm biometrics, IEEE Trans. Pattern Anal. Mach. Intell. 30 (8) (2008) 1503–1504. [27] H. Chen, H. Chang, T. Liu, Local discriminant embedding and its variants, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR2005), vol. 2, pp. 846–853. [28] P. Huang, Z.M. Tang, C.K. Chen, X.T. Cheng, Nearest-neighbor classiﬁer motivated marginal discriminant projections for face recognition, Front. Comput. Sci. China 5 (4) (2011) 419–428.

[29] B. Li, C. Wang, D.S. Huang, Supervised feature extraction based on orthogonal discriminant projection, Neurocomputing 73 (2009) 191–196. [30] W.K. Yang, C.Y. Sun, L. Zhang, A multi-manifold discriminant analysis for image feature extraction, Pattern Recognit. 44 (2011) 1649–1657. [31] B. Li, D. Huang, C. Wang, K. Liu, Feature extraction using constrained maximum variance mapping, Pattern Recognit. 41 (2008) 3287–3294. [32] M. Sugiyama, Local ﬁsher discriminant analysis for supervised dimensionality reduction, in: Proceedings of the 23rd International Conference on Machine Learning, AMC Press, 2006. [33] Q. Gao, H. Xu, Y. Li, D. Xie, Two-dimensional supervised similarity and diversity projection, Pattern Recognit. 43 (10) (2010) 3359–3363. [34] Q. Gao, H. Zhang, J. Liu, Two-dimensional margin, similarity and variation embedding, Neurocomputing 86 (2012) 179–183. [35] P.J. Phillips, H. Moon, S.A. Rizvi, P.J. Rauss, The FERET evaluation methodology for face recognition algorithms, IEEE Trans. Pattern Ana1. Mach. Intell. 22 (10) (2000) 1090–1104. [36] P.J. Phillips, The facial recognition technology (FERET) database. 〈http://www. itl.nist.gov/iad/humanid/feret/feret_master.html〉, 2004. [37] A.M. Martinez, R. Benavente, The AR face database, 1998. 〈http://cobweb.ecn. purdue.edu/ aleix/aleix_face_DB.html〉. [38] A.M. Martinez, R. Benavente, The AR face database, CVC Technical Report, #24, June 1998.

Pu Huang received his B.S. and M.S. degrees in computer applications from Yangzhou University, PR China, in 2007 and 2010, respectively. He is currently pursuing the Ph.D. degree in Pattern Recognition and Intelligent Systems at Nanjing University of Science and Technology (NUST), PR China. His research interests include pattern recognition, computer vision and machine learning.

Caikou Chen received his B.S. degree and M.S. degree in 1991 and 2000, respectively, and his Ph.D. at the Nanjing University of Science and Technology (NUST) in the Department of Computer Science on the subject of Pattern Recognition and Intelligence Systems in 2004. From 2005 to 2007, he was a postdoctoral researcher at NUST. From 2009 to 2010, he was a visiting professor at Robotics Institute of Carneige Mellon University. Now, he is a professor in the College of Information Engineering of Yangzhou University. He is the author of more than 30 scientiﬁc papers in pattern recognition and computer vision. His current research interests include pattern recognition, computer vision and machine learning.

Zhenmin Tang received his B.S. degree in computer software from Harbin Engineering University in 1982, M.S. degree in computer application and Ph.D. degree in Pattern Recognition and Intelligent System from Nanjing University of Science and Technology (NUST), PR China, in 1988 and 2002, respectively. He is a professor in the School of Computer Science and Engineering of NUST. His current interests are in the areas of pattern recognition, computer vision, machine learning and intelligent robot.

Zhangjing Yang received his B.S. degree in computer science and education from Nanjing Normal University (NJNU) in 2004, and his M.S. degree in computer applications from Nanjing University of Science and Technology (NUST) in 2009. He is currently pursuing the Ph.D. degree in computer applications at Nanjing University of Science and Technology, China. His current research interests include pattern recognition, computer vision and machine learning.