An incremental dimensionality reduction method on discriminant information for pattern classification

An incremental dimensionality reduction method on discriminant information for pattern classification

Pattern Recognition Letters 30 (2009) 1416–1423 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier...

533KB Sizes 7 Downloads 29 Views

Pattern Recognition Letters 30 (2009) 1416–1423

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

An incremental dimensionality reduction method on discriminant information for pattern classification Xiaoqin Hu a, Zhixia Yang b,c, Ling Jing a,* a

College of Science, China Agricultural University, 100083 Beijing, PR China College of Mathematics and Systems Science, Xinjiang University, 830046 Urumuqi, PR China c Academy of Mathematics and Systems Science, CAS, 100190 Beijing, PR China b

a r t i c l e

i n f o

Article history: Received 18 July 2008 Received in revised form 15 May 2009 Available online 7 July 2009 Communicated by R.P.W. Duin Keywords: Dimensionality reduction Pattern classification Discriminant mapping

a b s t r a c t From the view of classification, linear discriminant analysis (LDA) is a proper dimensionality reduction method which finds an optimal linear transformation that maximizes the class separability. However it is difficult to apply LDA in under sampled problems where the number of data samples is smaller than the dimensionality of data space, due to the singularity of scatter matrices caused by high-dimensionality. In order to make LDA applicable, we propose a new dimensionality reduction algorithm called discriminant multidimensional mapping (DMM), which combines the advantages of multidimensional scaling (MDS) and LDA. DMM is effective for small sample datasets with high-dimensionality. Its superiority is given from theoretical point of view. Then we extend DMM for large datasets and datasets with non-linear manifold respectively, and get two algorithms: landmark DMM (LDMM) and geodesic–metric discriminant mapping (GDM). The performances of these algorithms are also shown by preliminary numerical experiments. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction Pattern classification where the data objects have a large number of features is becoming more prevalent in the area of pattern recognition. In this situation it is often beneficial to reduce the dimensionality of the data in order to improve the efficiency and accuracy of data analysis. Traditional algorithms used in machine learning and pattern recognition applications are often susceptible to the well-known problem of the curse of dimensionality (Bellman, 1961), which refers to the degradation in the performance of a given algorithm as the number of features increases. To deal with this issue, dimensionality reduction techniques are often applied as a data pre-processing step or as a part of the data analysis to simplify the data model. The existing dimensionality reduction techniques can be divided into two classes by their learning process: unsupervised and supervised. The well-known unsupervised dimensionality reduction techniques include multidimensional scaling (MDS) (Torgerson, 1952), principal component analysis (PCA), locally linear embedding (LLE) (Roweis and Saul, 2000), isometric feature mapping (Isomap) (Tenenbaum et al., 2000), Laplacian eigenmap (Belkin and Niyogi, 2002) and kernel principal component analysis (KPCA) (Scholkopf et al., 1998) etc. They are effective in finding * Corresponding author. Tel./fax: +86 10 62736511. E-mail addresses: [email protected], [email protected] (L. Jing). 0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.06.013

compact representations and useful for data interpolation and visualization. However they may be weakly effective for pattern classification because the class labels are not considered. One of the important supervised dimensionality reduction techniques is the linear discriminant analysis (LDA) (Belhumeur et al., 1997). Its success has been shown in pattern classification. The features extracted by LDA are the most discriminating features (Swets and Weng, 1996). But compared with unsupervised methods, LDA is too expensive for high-dimensional datasets because its computational complexity is determined by the dimensionality of data points. In this paper, we first consider the supervised dimensionality reduction for small sample datasets with high-dimensionality. It is a challenging work to extract discriminating features from small sample datasets with high-dimensionality. Noticing the efficiency of LDA to deal with the low-dimensional datasets, it is natural to put MDS before LDA to reduce the dimensionality, and form an algorithm MDS + LDA. By theoretical analysis, we find that the same results of the algorithm MDS + LDA can be obtained by a simple process. This leads to our main algorithm discriminant multidimensional mapping (DMM). The function of the algorithm DMM is the same as the algorithm MDS + LDA, but the computational cost is less. Some variations are also studied here. The algorithm DMM mainly deals with small sample datasets with high-dimensionality. For large datasets with high-dimensionality, based on the idea of

1417

X. Hu et al. / Pattern Recognition Letters 30 (2009) 1416–1423

the landmark MDS, we develop a variation of DMM, called landmark DMM (LDMM). Furthermore, for datasets with non-linear manifold, DMM is then modified by using the geodesic distance to estimate the intrinsic geometry of the underlying manifold, called geodesic–metric discriminate mapping(GDM). The rest of this paper is organized as follows: in Section 2, LDA,classical MDS and landmark MDS are introduced in brief. Our methods (DMM, LDMM and GDM) are discussed in detail in Section 3. Experimental results are presented in Section 4, followed by the conclusions in Section 5.

In this section, we briefly introduce Linear discriminant analysis (LDA), Classical multidimensional scaling (MDS) and landmark MDS. And we discuss their advantages and disadvantages respectively.

Given a training set

S ¼ fðx1 ; c1 Þ; ðx2 ; c2 Þ; . . . ; ðxN ; cN Þg;

ð1Þ

where xi 2 Rm is the object, ci 2 C ¼ f1; 2; . . . ; Kg is the class label of the object, i ¼ 1; 2; . . . ; N. The objective is to uncover a transformation W 2 Rmp ðp < mÞ such that

Y ¼ W T X;

ð2Þ p

where X ¼ ½x1 ; . . . ; xN , xi 2 R , Y ¼ ½y1 ; . . . ; yN , yi 2 R , i ¼ 1; 2; . . . ; N. LDA seeks the transformation that maximizes between-class separation and meanwhile minimizes within-class separation. To do this two scatter matrices are introduced, SB for between-class separation and SW for within-class separation:

X

NN

is Euclidean distance between xi and xj , i; j ¼ 1; 2; . . . ; N. The meancentered ‘‘inner-product” matrix is defined as

ð6Þ

where J ¼ I  eeT =N is the mean-centering matrix. Decompose the matrix H as follows:

H ¼ U diagðk1 ; . . . ; kN ÞU T ;

k1 P    P kp P kpþ1 P    P kN ;

U ¼ ½v1 ; v2 ; . . . ; vp ; vpþ1 ; . . . ; vN ;

ð7Þ

where ki is the ith largest positive eigenvalue and vi is the corresponding orthonormal eigenvector. Then the required p-dimensional embedding vectors Y are given by the columns of the following matrix:

2.1. LDA (Belhumeur et al., 1997)

SB ¼

Classical MDS is unsupervised in that it does not take class labels into account. Given a set of N objects with m-dimensional data X ¼ ½x1 ; . . . ; xN . The task is to embed the objects as points in a pdimensional (p < m) space, while preserving the geometry as faithfully as possible. h i 2 , where dðxi ; xj Þ Let the square distance matrix M ¼ d ðxi ; xj Þ

H  JMJ=2;

2. LDA, classical MDS and landmark MDS

m

2.2. Classical MDS (Zha and Zhang, 2003)

Nc ðlc  lÞðlc  lÞT ;

  T 1=2 Y ¼ ½y1 ; . . . ; yN  ¼ diag k1=2 U ; 1 ; . . . ; kp

ð8Þ

where U  ¼ ½v1 ; v2 ; . . . ; vp . Classical MDS has many appealing traits. For example, it has very nice behavior when the objects truly have a low-dimensional Euclidean structure, classical MDS is guaranteed to find a Euclidean embedding which exactly preserves interpoint distances. In addition, the required storing space is OðN 2 Þ and the computational complexity is OðN 3 þ pN 2 Þ, so it is efficient in computation for small sample datasets even with high-dimensionality m.

ð3Þ

2.3. Landmark MDS (de Silva and Tenenbaum, 2004)

ð4Þ

When the number N of objects is very large, classical MDS may be too expensive. To solve the problem (de Silva and Tenenbaum, 2004) proposed a variation on classical MDS, called Landmark MDS (LMDS), which almost preserves all of the attractive properties of classical MDS but is much more efficient—the cost being essentially linear with the number of objects. From the set of N objects with m-dimension X ¼ ½x1 ; . . . ; xN , select a set of n distinguished points referred to as ‘‘landmarks”. Then there are two steps in LMDS: The first step applies classical MDS to the landmark points; The second step is a distance-based triangulation procedure, which uses distances to the already-embedded landmark points to determine where the remaining points should go. LMDS is very effective in practice as it requires OðnNÞ storing space and Oð2mnN þ n3 Þ computation time. It has been proved that it achieves the correct solution if the data really have a low-dimensional Euclidean structure. And it is also stable under small perturbations of the input distances.

c2C

SW ¼

X X

ðxi  lc Þðxi  lc ÞT ;

c2C i:ci ¼c

where Nc is the number of sample points in class c, l is the mean of all examples and lc is the mean of all examples in class c. The components within these summations ðl; lc ; xi Þ are m-dimensional vectors so both of SB and SW are m  m matrices. Two objectives of maximizing between-class separation and minimizing within-class separation can be combined into a single maximization called the Fisher criterion (Fisher, 1936; Fukunaga, 1990):

  cT c  W SB W  ; W ¼ ½w1 ; w2 ; . . . ; wp  ¼ arg max  c T SW W c  W b W

c 2 Rmp : W

ð5Þ

In fact, finding the above optimal projection W is equivalent to solve the generalized eigenvectors ½w1 ; w2 ; . . . ; wp  of S1 W SB , corresponding to the p largest generalized eigenvalues fki gpi¼1 . Note that the rank of SB is not greater than K  1, because SB is the sum of K matrixes with the rank of one or zero. Therefore, S1 W SB has at most ðK  1Þ nonzero eigenvectors (Richard et al., 2001). Thus p is sometimes selected to be ðK  1Þ. It should be pointed out that LDA has two computational bottlenecks: one is the calculation of the inverse of an m  m matrix S1 W ; the other is the calculation of the p largest generalized eigenvalues and the corresponding generalized eigenvectors of an m  m matrix S1 W SB . LDA is computationally expensive when m is very large—the situation where dimensionality reduction is called for.

3. New methods In this section, we propose three supervised dimensionality reduction methods for different datasets traits. The first discriminant multidimensional mapping (DMM) is designed for small sample datasets with high-dimensionality. The second landmark DMM and the third geodesic–metric discriminant mapping (GDM) are fit for large datasets with high-dimensionality and datasets with nonlinear manifold respectively.

1418

X. Hu et al. / Pattern Recognition Letters 30 (2009) 1416–1423

3.1. Discriminant multidimensional mapping (DMM)

S2B ¼ W T1 S1B W 1 ;

In the subsection, for small sample datasets with high-dimensionality, we propose a supervised dimensionality reduction method, called discriminant multidimensional mapping (DMM). From the above section, we have known that LDA is a supervised method, but it is too expensive when the original objects are high-dimensionality. However, the computational cost of Classical MDS is determined by the number of samples, and MDS is an isometry projection which is less appropriate for pattern classification. So it is natural to put MDS before LDA to reduce the dimensionality, and form an algorithm MDS + LDA. Thus the features we get are suitable for classification and the computational cost is determined mainly by the number of samples. But if we do so, the computational cost is still expensive because there are two steps, and another disadvantage is that a new parameter ‘p’ (dimensionality of the projected space) needs to be selected. To solve the above problems, we give two theorems below on which our algorithm DMM is based. Without selecting any parameters, DMM algorithm projects the points into a low-dimensional space directly. Besides this, it has many other attractive properties.

According to (5), W 2 ; W 3 are defined as follows:

Theorem 1. Given a set of N objects with m-dimension X ¼ ½x1 ; . . . ; xN , xi 2 Rm , i ¼ 1; 2; . . . ; N, which is to be embedded into h 2 2 the Euclidean space Rp . Let D ¼ ½d1 ; . . . ; dN , di ¼ d ðxi ; x1 Þ; d h pffiffiffiffiffi pffiffiffiffiffi 2 ðxi ; x2 Þ; . . . ; d ðxi ; xN ÞT , i ¼ 1; . . . ; N. And L ¼ vT1 = k1 ; . . . ; vTp = kp T , ki is the ith largest positive eigenvalue and vi is the corresponding orthonormal eigenvector for the mean-centered ‘‘inner-product” matrix H  JMJ=2. Then the embedding vector Y obtained by classical MDS can be given by

1 Y ¼  LDJ; 2

S2W ¼ W T1 S1W W 1 :

   T 2   W SB W  ; W 2 ¼ arg max   T 2 W W SW W 

   T 1   W SB W  : W 3 ¼ arg max   T 1 W  W SW W 

ð14Þ

ð15Þ

According to (14), W 2 can also be written as

    T 1 ðW 1 WÞ SB ðW 1 WÞ : W 2 ¼ arg max   T 1 W ðW 1 WÞ SW ðW 1 WÞ

ð16Þ

Comparing (15) and (16), it is obvious that W 3 ¼ W 1 W 2 . h Now we are in the position to derive our algorithm DMM. Consider the data given by (1). Assume that N  m. Let the embedding vectors Z ¼ ½z1 ; . . . ; zN , zi 2 Rp , i ¼ 1; 2; . . . ; N. Applying classical MDS to X, according to Theorem 1, we firstly have

1 Y ¼  LDJ; 2

ð17Þ

where D and J are defined as in Theorem 1. Let  12 L ¼ W T1 , then

Y ¼ W T1 DJ:

ð18Þ

By applying LDA to Y, we get the embedding vectors

Z ¼ W T2 Y;

ð19Þ

where W 2 is obtained by optimal objection (5). Combining the above, we get

Z ¼ W T2 W T1 DJ:

ð20Þ



ð9Þ

Let W ¼ W 1 W 2

Z ¼ W T DJ:

ð21Þ

T

where J ¼ I  ee =N.

From Theorem 2 we can see

Proof.. Because vi is an orthonormal eigenvector, we have LJ ¼ L. Thus we only need prove that the following equation holds:

1 Y ¼  LJMJ ¼ LH: 2

ð10Þ

For each eigenvector vi of H we have

pffiffiffiffi vTi H ki vTi pffiffiffiffi ¼ pffiffiffiffi ¼ ki vTi : ki ki

ð11Þ

According to (8), we know



hpffiffiffiffiffi pffiffiffiffiffi i k1 vT1 ; . . . ; kp vTp ;

ð12Þ

so the ith row of LH equals the ith row of Y. h P Let dl ¼ N1 i di , then DJ ¼ ½d1  dl ; . . . ; dN  dl . From Theorem 1, we have yi ¼  12 Lðdi  dl Þ. It means that yi is a linear transformation of di  dl . A similar theorem was proved in (de Silva and Tenenbaum, 2004). Theorem 2. Given two sets with the same label of class X ¼ ½x1 ; . . . ; xN  and Y ¼ ½y1 ; . . . ; yN , xi 2 Rm , yi 2 Rp , i ¼ 1; 2; . . . ; N. And there exists a matrix W 1 such that Y ¼ W T1 X. Applying LDA to Y and X, respectively, and get optimal projections W 2 and W 3 , then

W 3 ¼ W 1W 2:

ð13Þ

   T   W SB W  ; W  ¼ arg max   T  W W SW W 

where SB and SW are the between-class and within-class scatter matrixes of DJ ¼ ½d1  dl ; . . . ; dN  dl . From the above discussion, we know that the embedding vectors Z can be gotten directly by (21). Thus DMM avoids the calculation of W 1 in (18) and W 2 in (19). Because the rank of SB is K  1 or less, there are at most K  1 nonzero eigenvectors. Thus, we have p 6 K  1. We often select p ¼ K  1 when K is small. Now we give the DMM algorithm as follows: Algorithm 1. (DMM) 1. Compute the squared distance matrix M ¼ ½d1 ; . . . ; dN , where h iT 2 2 2 i ¼ 1; . . . ; N. Let di ¼ d ðxi ; x1 Þ; d ðxi ; x2 Þ; . . . ; d ðxi ; xN Þ , P 1 dl ¼ N i di ; 2. Compute the within-class and between-class scatter matrixes SW and SB of DJ ¼ ½d1  dl ; . . . ; dN  dl . Eigendecompositing S1 W SB , we get the p largest eigenvalues k1 ; . . . ; kp and the corresponding eigenvcetors v1 ; . . . ; vp ; 3. Let W ¼ ½v1 ; v2 ; . . . ; vp , then the embedding vectors Z can be obtained by the following formula:

Z ¼ W T DJ: Proof. Suppose that S1B and S1W are the between-class and within-class scatter matrixes of X respectively. The corresponding matrixes of Y are S2B and S2W . It is easy to know

ð22Þ

ð23Þ

DMM is convenient online to embed a new data point xa by the formula:

1419

X. Hu et al. / Pattern Recognition Letters 30 (2009) 1416–1423

za ¼ W T ðda  dl Þ:

ð24Þ

In DMM, the within-scatter matrix SW is computed by N points in N-dimensional space, while in LDA it is by N points in m-dimensional ðN  mÞ space, so the possibility of singularity of SW in DMM is smaller than that in LDA. The cost of DMM is dominated by the OðN3 Þ eigenvalue problem on a dense N  N matrix, and the memory requirement is OðN 2 Þ. When N is small, DMM is very efficient. 3.2. Landmark DMM (LDMM) If both the number and the dimensionality of the data are very large, it is worth for DMM to consider how to compute efficiently. To make DMM suitable for processing large sample data sets, we propose a variation of DMM, called landmark DMM(LDMM), based on the idea of the well-known LMDS. The paper (de Silva and Tenenbaum, 2004) shows that if the data really have a exactly or approximately low-dimensional Euclidean structure, LMDS recovers up to an isometry projection of the points onto this subspace provided that all the data points are contained in the affine span of the landmark points. If the landmarks span a lower-dimensional affine subspace, then LMDS recovers the orthogonal projection of the points onto this subspace. According to the above discussion, for the training set (1), we can first select n points as landmarks such that the dimensionality of their affine span is not smaller than p. In this affine space, the embedding vectors can be given by an affine linear transformation:

Z ¼ W T M J:

ð30Þ

The landmark points may be chosen by any reasonable method. For a p-dimensional embedding, we require at least p þ 1 landmarks. For the stability reason it is better to avoid configurations which lie close to p- or lower-dimensional affine subspace. We often embed the points into ðK  1Þ-dimensional space(K is the number of class), so n (the number of landmark points) is selected greater than K. It is generally advisable to choose rather more landmark points than the strict minimum. Two ways are usually used to select the landmark points: Random choice and MaxMin (greedy optimization). It should be pointed out that it is different to select the landmark points in LDMM from in LMDS, as LDMM is a supervised method. The landmark points should be selected from each class in average. In MaxMin, the landmark points are selected as follows: (1) Give the ratio p of landmark points in all data points, then the total number of landmark points is: n ¼ pN. Calculate the number nc of landmark points in class c, nc ¼ p  N c , P where N c is the number of points in class c, and n ¼ Ki¼1 ni ; c  (2) Calculate the mean point x of all examples in class c, then xc as the select the lc nearest neighbors of the mean point  lc landmark points of class c, denote them as ~ xci ði ¼ 1; 2; . . . ; lc Þ; (3) If lc < nc , then the ðlc þ 1Þ-landmark point in class c is obtained by

  ~xclc þ1 ¼ argmax min d xci ; ~xcj ; i ¼ 1;2;...;N c  lc ; j ¼ 1;2;...;lc ; c xci

1 Y ¼  LDnN J; 2

~x j

ð25Þ

where L is a p  n matrix, DnN ¼ ½d1 ; . . . ; dN  is the the distances matrix between the N data points and a set of n landmarks, J ¼ I  eeT =N. Let  12 L ¼ W T1 , thus

Y ¼ W T1 DnN J:

ð31Þ n ol where ~xcj

is the selected landmark points in class c, and the rest  Nc lc points in class c is denoted as xci i¼1 . j¼1

ð26Þ

Applying LDA to the set Y, we get the discriminating feature set

Z ¼ W T2 Y ¼ W T2 W T1 DnN J:

ð27Þ

Let W  ¼ W 1 W 2 , we have

Z ¼ W T DnN J:

ð28Þ

LDMM is an improvement to DMM for large datasets. The distance matrix of DMM is DNN as compared to DnN for LDMM. DMM requires OðN 2 Þ storage while LDMM requires OðnNÞ. As di 2 Rn , SB and SW are both n  n matrixes. The calculation of 2 2 S1 W and eigendecomposition is Oðn Þ for LDMM while OðN Þ for DMM.



The optimal projection W is chosen as:

   T    W SB W   ; W ¼ arg max   T  W  W SW W 

3.3. Geodesic–metric discriminate mapping (GDM)

ð29Þ

where SB and SW are, respectively, the between-class and withinclass scatter matrixes of DnN J ¼ ½d1  dl ; . . . ; dN  dl , di 2 Rn and P the mean dl ¼ 1n ni¼1 di . The LDMM algorithm is summarized as follows: Algorithm 2. (LDMM) Given the training set (1) which is to be embedded in a p-dimensional Euclidean space. 1. Designate a set of n landmark points, where n þ 1 P p; 2. Compute the squared distance matrix M J ¼ ðd1  dl ; . . . ; dN  dl Þ, where di denotes the squared distance between the ith P points and the n landmark points. dl ¼ 1n ni¼1 di ;  3. Compute the within-class matrix SB and between-class matrix SW for M J. Eigendecompositing S1 W SB , get the eigenvalues k1 ; . . . ; kp and the corresponding eigenvcetors v1 ; . . . ; vp ; 4. Let W  ¼ ½v1 ; v2 ; . . . ; vp , then the embedding vectors Z are obtained

DMM and LDMM are designed for the case when the manifold is linear or almost linear in the observation space. The underlying manifold generally is non-linear. Inspired by Isomap method, we use the geodesic distance instead of Euclidean distance to estimate the intrinsic geometry of the underlying manifold (Tenenbaum et al., 2000) and develop a new nonlinear dimensionality reduction method, called geodesic–metric discriminate mapping(GDM). The difference between GDM and DMM is only the dissimilarity measure. So we simply replace the vector of squared Euclidean distances di in DMM by that of squared geodesic distances di , and then GDM is obtained. The estimate of geodesic distances is the same as in Isomap: for neighboring points, the input space distance usually provides a good approximation to their geodesic distance; for faraway points, the shortest path connecting them in the neighborhood graph is computed and is used as an estimate of the true geodesic distance. It is assumed that the data in a high-dimensional observation space is embedded from an unknown p-dimensional manifold, the GDM algorithm is given below:

1420

X. Hu et al. / Pattern Recognition Letters 30 (2009) 1416–1423

Algorithm 3. (GDM) 1. Construct a weighted undirected neighborhood graph G of the observed data fxi gNi¼1 . We put an edge between xi and xj iff (as measured by dðxi ; xj Þ) they are closer than , or iff xi ðxj Þ is one of the k nearest neighbors of xj ðxi Þ. Set edge length equal to dðxi ; xj Þ; 2. Estimate the geodesic distance between all pairs of points on the manifold by computing their shortest path distanced dG ðxi ; xj Þ in graph G, then we get the geodesic distance matrix DG ¼ ðdG ðxi ; xj ÞÞ; 3. Apply DMM to the shortest-path distance matrix DG . Where the  squared distance matrix D ¼ d1 ; d2 ; . . . ; dN , di ¼ ðdG ðxi ; x1 Þ; P dG ðxi ; x2 Þ; . . . ; dG ðxi ; xN ÞÞT , i ¼ 1; . . . ; N, dl ¼ N1 Ni¼1 di .

4. Numerical experiments In this section, two classical pattern classification problems, face recognition and handwritten digit recognition, are considered in order to analyze the performance of our three algorithms. After computing a transformation matrix using training data, both training data and test data are represented in the reduced dimensional space, where the nearest neighbor classifier is applied to compute the prediction accuracies for classification. For each data item in test set, it finds the nearest neighbor from the training data set and predicts a class label for the test data according to the class label of the nearest neighbor. In our experiments, leave-one-out method is performed where it takes one image for test set and

remaining images are used as a training set. Each image serves as a test datum by turns and the ratio of the number of incorrectly classified cases and the total number of data is considered as error rate.

4.1. Small sample datasets In order to evaluate the performance of DMM for small sample datasets with high-dimensionality, two data sets, AT&T(formerly ORL) face database (AT & T dataset) and UMIST face database (UMIST dataset) are used. The experiments are performed using the ‘‘leave-one-out” strategy: each image is left out from the training set of N images and the projection matrix is computed. Then all N images are projected into the reduced dimensional space and recognition of the left out image is performed using a nearest neighbor classifier. The AT&T face database (AT & T dataset) contains 400 images of 40 people. Fig. 1 shows images of a few subjects. They contain facial contours and vary in pose as well as scale. Their original pixels are 112  92 ¼ 10; 304. We test the DMM method against Eigenface (Turk and Pentland, 1991), Fisherface (Belhumeur et al., 1997), MDS (Torgerson, 1952), LLE (Roweis and Saul, 2000), Isomap (Tenenbaum et al., 2000) methods. For the complexity of Fisherface (Belhumeur et al., 1997) and Eigenface (Turk and Pentland, 1991), we have to reduce pixels to 23  28 ¼ 644. The parameters, such as the number of principal components in Eigenface, the number of neighbors in LLE and Isomap, the dimension of embedded space in MDS, are empirically determined to achieve the lowest error rate by each method. For

Fig. 1. Face images in the AT&T database.

Table 1 Results with the AT&T dataset. The right histogram also shows the percent of leave-one-out error. Method

Reduced space

Error rate (%)

Eigenface Fisherface MDS Isomap, #neighbor = 110 LLE, #neighbor = 40 DMM

40 39 28 30 70 39

1.75 2.75 2.75 2.0 1.75 0.5

1421

X. Hu et al. / Pattern Recognition Letters 30 (2009) 1416–1423 Table 2 Prediction accuracies (%) with the AT&T dataset.

Table 4 The error rates of LDMM on digits ‘‘0” and ‘‘8”.

Dataset

RLDA

LDA/GSVD

TO  RðSw Þ

TO  RðSb Þ

TO  NRðSw Þ

DMM

Landmark (%)

100

50

40

30

20

10

2

1

AT&T

98.0

93.5

98.0

99.0

98.8

99.5

Random choice MaxMin

0.08 0.08

0.16 0.08

0.16 0.08

0.14 0.08

0.24 0.08

0.26 0.08

1.4 0.16

3.8 1.8

Fisherface, the points are projected automatically onto a K  1 space in DMM. The experimental results are shown in Table 1. Besides our DMM algorithm, there are several generalizations of LDA have been proposed. In (Cheong and Haesun, 2008), the AT&T dataset is used to evaluate the performance between different methods based on LDA, and meanwhile after leave-one-out process, the ratio of the number of correctly classified cases and the total number of data is considered as a prediction accuracy. Here we refers to the results directly for compare DMM with them. Table 2 reports the prediction accuracies. The UMIST Face Database (UMIST dataset) consists of 575 images of 20 people. Each covers a range of poses from profile to frontal views. Subjects cover a range of race/sex/appearance. Their original pixels are 112  92 ¼ 10; 304. The experiments are performed in the same way as in AT&T. Several images of some people are displayed in Fig. 2. The experimental results are shown in Table 3. To explain why DMM performs better in these experiments, the two datasets are projected onto the first two eigenvectors ex-

Table 5 The error rates of other methods on digits ‘‘0” and ‘‘8”. Method

Reduced space

Error rate (%)

Reduced space

Lowest error rate (%)

LDA MDS PCA

1 1 1

0.8 31 20

1 28 20

0.8 0.08 0.08

tracted by DMM in Fig. 3. Different forms and colors refer to different classes. From the Fig. 3, it is obvious that the projected samples of different classes are separated well. These above experiments reveal a number of interesting points: 1. Most methods need to select parameters while DMM can avoid the step by setting the embedding dimension p to K  1(K is the number of class). 2. In the experiments, DMM performs better than other methods even their parameters are selected by achieving the lowest error rates.

Fig. 2. Face images in UMIST database.

Table 3 Results with the UMIST database. The right histogram also shows the percent of leave-one-out error. Method

Reduced space

Error rate (%)

Eigenface Fisherface MDS Isomap, #neighbor = 70 LLE, #neighbor = 40 DMM

19 19 40 19 40 19

0.87 0.87 0.70 1.57 1.04 0.35

Fig. 3. DMM projects the two database onto the first two eigenvectors (left) AT&T database, (right) MIST database.

1422

X. Hu et al. / Pattern Recognition Letters 30 (2009) 1416–1423

Fig. 4. Some images of digits: (left) ‘‘1”, ‘‘2” and ‘‘7”, (right) ‘‘3” ‘‘5” and ‘‘8”.

Table 6 The error rates of methods on digits ‘‘1”, ‘‘2” and ‘‘7”. Method

Reduced space

Error rate (%)

Reduced space

Lower error rate (%)

Isomap, #neighbor = 10 LLE, #neighbor = 10 GDM, #neighbor = 20

2

4.00

6

2.67

2 2

7.33 2.33

6 2

6.00 2.33

Table 7 The error rates of methods on digits ‘‘3”, ‘‘5” and ‘‘8”. Method

Reduced space

Error rate (%)

Reduced space

Lower error rate (%)

Isomap, #neighbor = 30 LLE, #neighbor = 20 GDM,#neighbor = 20

2

11.33

4

9.67

2 2

19.67 9.67

10 2

11.67 9.67

3. There is a formula to embed a new sample to low-dimensional space in DMM, so it is effective online with respect to the new data points.

points. The row 2 in Table 4 is the mean error rates of LDMM by random choice of landmark from 10 times. The results of other methods are represented in Table 5. In Table 5, the dimensionality of the reduced space in column 2 is set as the same as in LDMM, column 4 is determined to achieve the lowest error rate by each method. By looking into the results, we can find: 1. When the embedding dimensionality is very low, the error rates are too bad for MDS and PCA. Only increasing the dimensionality to more than 20, reasonable results can be achieved. While the results of LDMM are better even the dimensionality is very low. 2. LDMM works more perfectly by using MaxMin than random choice of landmark points. 3. LDMM performs better even when the number of landmarks decreases to 2%, but the error rate will drastically increasing when the landmarks decrease too small to 1%. Generally, the landmarks may be chosen by any reasonable method. For a pdimensional embedding, we require at least p þ 1 landmarks; specifically the affine span of the landmarks must be p-dimensional. For stability reasons it is better to avoid configurations which lie close to some ðp  1Þ- or lower-dimensional affine subspace. It is generally advisable to to choose rather more landmark points than the strict minimum.

4.2. Large datasets 4.3. Non-linear manifolds To verify the dimensionality reduction capability of LDMM for large datasets, we consider the handwritten digit database (MINIST dataset). Each digit set in the dataset consists of about 1100 instances of a given numeral (‘‘0” to ‘‘ 9”) scanned in as 16  16 pixel gray scale images. These handwritten digits are vary in slope, angle and thickness characters. Here we compare the linear methods and select the digits ‘‘0” and ‘‘8” which are almost linearly distribution (de Silva and Tenenbaum, 2004). The other digits will be classified in the next subsection. We select the first 500 points of each digit set as training set and the later 600 points of each digit as testing set, thus there are a training set of 1000 points and a testing set of 1200 points. In LDMM, we designate a set of landmark points by two ways: random choice and MaxMin. Random choice means that the landmarks are chosen arbitrarily. MaxMin means that the landmark points are chosen as described by Section 3.2. Table 4 shows the error rates of LDMM corresponding to different numbers of landmark

In the previous subsection, we have classified the handwritten digits ‘’0” and ‘‘8”. From the Fig. 4, we can see the other two groups ‘‘1”, ‘‘2”and ‘‘7”, ‘‘3”, ‘‘5”and ‘‘8” are similar in shape. They are most difficult to recognize (Kouropteva et al., 2003). So we use the two group sets to estimate the performance of non-linear methods— Isomap, LLE and GDM—for pattern classification. For each digit, we randomly select 100 points as training points and non-overlapping 100 points as testing points. thus we have a training set of 300 points and a testing set of 300 points in each group set. The experimental results are shown in Tables 6 and 7. The parameters, such as number of neighbor and the reduce space in Isomap and LLE methods, are empirically determined to achieve the lowest error rate by each method. Among the methods, GDM consistently outperforms the Isomap and LLE from classification viewpoint. Though the error rate of Isomap will approach that of GDM when the dimension of embedding space increase, it has to

Fig. 5. The error rate varies with parameter k. (left)‘‘1”, ‘‘2” and ‘‘7” (right) ‘‘3”, ‘‘5”and ‘‘8”.

X. Hu et al. / Pattern Recognition Letters 30 (2009) 1416–1423

select the appropriate parameter. Considering k from 10 to 50 in our experiments, we find that error rate of GDM is insensitive to the number of neighbor, while the other methods have not this property. Fig. 5 shows the error rate varies with the number of neighbors. This is worth considering deeply for the future.

1423

tion of China (No. 10801112), the China Postdoctoral Science Foundation funded Project (No. 20080432573), and the Ph.D Graduate Start Research Foundation of Xinjiang University Funded Project (No. BS080101). References

5. Conclusion In this paper, we address dimensionality reduction for high dimensional datasets and develop three supervised dimensionality reduction methods: DMM, LDMM, GDM, they are suitable for small sample datasets, large datasets and datasets with non-linear manifolds respectively. Our methods have below features: 1. Our methods have no parameter needed to be selected. The dimension of embedding space can be automatically set to (K  1)(K is the number of class). The error rate of LDMM is insensitive to the number of landmark points by using MaxMin choice,and the error rate of GDM is insensitive to the number of neighbor, so this two parameters would be selected easily. 2. In experiments, our methods perform better than other methods even when their parameters are determined by achieving the lowest error rates. 3. Our methods have formulas to embed a new sample to lowdimensional space, so they are effective online with respect to the new data points. Our future work will focus their further improvements for large and non-linear datasets. Acknowledgements The authors would like to thank the editor Robert P.W. Duin and the anonymous reviewers for valuable suggestions. This work is supported by the Key Project of National Natural Science Foundation of China (No. 10631070), the National Natural Science Founda-

. Belhumeur, P.N., Hespanha, J.P., Kriengman, D.J., 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Machine Intell. 19 (7), 711–720. Belkin, M., Niyogi, P., 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in Neural Information Processing Systems, vol. 14. MIT Press, Cambridge, MA. Bellman, R., 1961. Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton, New Jersey. Cheong, H.P., Haesun, P., 2008. A comparison of generalized linear discriminant analysis algorithms. Pattern Recognition 41, 1083–1097. de Silva, V., Tenenbaum, J.B., 2004. Sparse multidimensional scaling using landmark points. Technical Report, Stanford University. Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, 179–188. Fukunaga, K., 1990. Introduction to Statistical Pattern Recognition, second ed. Academic Press. Kouropteva, O., Okun, O., Pietikaäinen, M., 2003. Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine. In: Proc. 11th European Symposium on Artificial Neural Networks (ESANN’2003), April 23–25, Bruges, Belgium, pp. 229–234. . Richard, O.D., Peter, E.H., David, G.S., 2001. Pattern Classification, second ed. WileyInterscience, New York. Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), 2323–2326. Scholkopf, B., Smolla, A., Muller, K., 1998. Nonlinear component analysis as a Kernel eigenvalue problem. Neural Comput. 10 (5). Swets, D., Weng, J., 1996. Using discriminant eigenfeatures for image retrieval. IEEE Trans. Pattern Anal. Machine Intell. 18 (8), 831–836. Tenenbaum, J.B., de Silva, V., Langford, J.C., 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), 2319–2323. Torgerson, W.S., 1952. Multidimensional scaling: Theory and method. Psychometrika 17 (4), 401–419. Turk, M., Pentland, A., 1991. Eigenfaces for recognition. J. Cognitive Neurosci. 3 (1), 71–86. . Zha, H.Y., Zhang, Z.Y., 2003. Isometric embedding and continuum ISOMAP. In: Int. Conf. on Machine Learning.