- Email: [email protected]

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Semi-supervised double sparse graphs based discriminant analysis for dimensionality reduction Puhua Chen a,n, Licheng Jiao a , Fang Liu b , Jiaqi Zhao a , Zhiqiang Zhao a , Shuai Liu a a The Key Laboratory of Intelligent Perception and Image Understanding of the Ministry of Education, International Research Center for Intelligent Perception and Computation, Xidian University, Xi'an Shaanxi Province 710071, China b The School of Computer Science and Technology and also with the Key Laboratory of Intelligent Perception and Image Understanding of the Ministry of Education, International Research Center for Intelligent Perception and Computation, Xidian University, Xi'an Shaanxi Province 710071, China

art ic l e i nf o

a b s t r a c t

Article history: Received 15 July 2015 Received in revised form 10 August 2016 Accepted 11 August 2016 Available online 13 August 2016

Discriminant analysis (DA) is a well-known dimensionality reduction tool in pattern classiﬁcation. With enough efﬁcient labeled samples, the optimal projections could be found by maximizing the between-class scatter variance meanwhile minimizing the within-class scatter variance. However, the acquisition of label information is difﬁcult in practice. So, semi-supervised discriminant analysis has attracted much attention in recent years, where both few labeled samples and many unlabeled samples are utilized during learning process. Sparse graph learned by sparse representation contains local structure information about data and is widely employed in dimensionality reduction. In this paper, semi-supervised double sparse graphs (sDSG) based dimensionality reduction is proposed, which considers both the positive and negative structure relationship of data points by using double sparse graphs. Aiming to explore the discriminant information among unlabeled samples, joint k nearest neighbor selection strategy is proposed to select pseudo-labeled samples which contain some precise discriminant information. In the following procedures, the data subset consisting of labeled samples and pseudo-labeled samples are used instead of the original data. Based on two different criterions, two sDSG based discriminant analysis methods are designed and denoted by sDSG-dDA (distance-based DA) and sDSG-rDA (reconstruction-based DA), which also use different strategies to reduce the effect of pseudo-labels’ inaccuracy. Finally, the experimental results both on UCI datasets and hyperspectral images validate the effectiveness and advantage of the proposed methods compared with some classical dimensionality reduction methods. & 2016 Elsevier Ltd. All rights reserved.

Keywords: Semi-supervised learning Discriminant analysis Dimensionality reduction Sparse graph Graph embedding

1. Introduction Dimensionality reduction (DR) is one of the most hot topics in the ﬁeld of information processing and numerous valuable research results have been published in this ﬁeld. With the development of the information acquisition technique, the curse of dimensionality [1] becomes more serious. In model based analysis, the stability of models may be affected by the interference brought by the redundancy of data [2]. Besides, high time-consumption and storage requirement of high dimensional data also cannot be ignored in applications. Dimensionality reduction is the direct tool for dealing with those problems, which aims to ﬁnd the low dimensional space with less information loss and enhanced discriminant power. Linear projection (LP) is a common type of DR methods, which aims n

Corresponding author. E-mail addresses: [email protected] (P. Chen), [email protected] (L. Jiao), [email protected] (F. Liu), [email protected] (J. Zhao), [email protected] (Z. Zhao), [email protected] (S. Liu). http://dx.doi.org/10.1016/j.patcog.2016.08.010 0031-3203/& 2016 Elsevier Ltd. All rights reserved.

to ﬁnd a projection matrix mapping from the high dimensional space to the low dimensional space under some criterions. The most typical LP methods are principle component analysis (PCA) [3–5] and linear discriminant analysis (LDA) [6,7]. However, because of the linear assumption, LP meets some troubles during handling the nonlinear data. Kernel strategy can be used to extend LP to the nonlinear data, where nonlinear data samples are projected to a linear space by one kernel function and LP methods can be employed in this new space. Kernel PCA [8–10] and Kernel LDA [11,12] are typical kernel based LP methods. Manifold based methods also achieve good performance on dealing with the nonlinear data because of the consideration about the local manifold structure of data. By preserving the local structure learned in original space, the low dimensional embedding can be more suitable for classiﬁcation or recognition. There are many manifold based methods such as local linear embedding (LLE) [13], isometric feature mapping (ISOMAP) [14], Laplacian eigenmap [15] and local tangent space alignment (LTSA) [16], etc. Because the low dimensional embedding is learned directly without any mapping, manifold based methods usually meet the

362

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

out of sample problem. To handle this problem, manifold based LP methods are proposed, which suppose there exists a mapping function from the original high dimensional space to the low dimensional space. For instance, locality preserving projection (LPP) [17,18] is the linear extension of LE and neighborhood preserving projection (NPP) [19] is corresponding to LLE. Sparse graph (or L1-graph) is a new descriptor of data's intrinsic structure, which is constructed by sparse representation and has got much attention in recent years. Because of the successfulness of sparse representation classiﬁer (SRC) [20–23], it is reasonable to enhance the discriminant power of low dimensional embedding by preserving the sparse representation relationship among samples. In addition, compared with the common k -nearest-neighbor graph, sparse graph has some other advantages such as robustness to data noise, automatic sparsity and adaptive neighborhood [24]. Sparse graph also has been applied in DR problem to learn the low dimensional embedding or the projection matrix [24,25]. The utilization of sparse graph can be performed in two different ways, in one of which sparse graph is treated as the similarity relationship between samples like k-nearest neighbor graph [26], in another, sparse graph is considered as the representation relationship such as sparsity preserving projection (SPP) [25]. Moreover, label information also can be included in sparse graph. In [27], sparse graph based discriminant analysis (SGDA) is proposed, which constructs the within-class scatter matrix and the between-class scatter matrix by considering the learned sparse graph. When just a small number of labeled samples are available, linear discriminant analysis usually meets a serious overﬁtting problem [28]. Regularized LDA [29,30] could overcome this problem without increasing the number of labeled samples. Furthermore, semi-supervised learning methods have been studied to deal with the problems caused by the limitation of labeled samples [31–36]. Besides label information, pairwise constraints (must-link constraint and cannot-link constraint) also have been used in semisupervised learning as a prior knowledge, such as semi-supervised dimensionality reduction (SSDR) [37], marginal semi-supervised sub-manifold projections (MS3MP) [38] and constrained large margin local projection (CMLP) [39]. Commonly, most of the semisupervised methods utilize unlabeled samples to model the geometric structure and incorporate it into the classic supervised algorithms as a regularization term. Nevertheless, the potential discriminant information of unlabeled samples is not considered. Aiming to deal with this problem, another type of methods are proposed [40–43], which ﬁrstly investigate the class relationship of unlabeled samples and use the pseudo-labels of unlabeled samples and the true labels of labeled samples to build the objective function. In [44,42], the class membership probability is estimated and inserted into the objective function of discriminant analysis. In [41,43], soft labels of unlabeled samples are computed by using label propagation and utilized in the computation of between-class scatter matrix and within-class scatter matrix. Similarly, both of them utilize all unlabeled samples for training. Although labeled samples are difﬁcult to be obtained, unlabeled samples are very abundant in real applications. So, it is possible to select much more valuable unlabeled samples instead of whole dataset. In this paper, semi-supervised double sparse graphs based dimensionality reduction is proposed. Different from those methods mentioned above, just part of unlabeled samples selected by using joint k nearest neighbor selection strategy are used in the following learning process. These selected unlabeled samples are denoted by pseudo-labeled samples and the pseudo-labels are given to them during the selection processing. After pseudo-labeled sample selection, labeled samples and pseudo-labeled samples are employed to model the geometric structure of data by considering the discriminant information among samples. Then, double sparse graphs are built, of which the positive sparse graph contains the intra-class

geometric structure of samples and the negative sparse graph reﬂects the inter-class geometric structure of samples. Based on different discriminant criterions, two semi-supervised double sparse graphs based discriminant analysis algorithms are designed, which are denoted as sDSG-dDA and sDSG-rDA. Although pseudo-labeled samples are treated as real labeled samples, the difference of them also is considered during projection learning. Both sDSG-dDA and sDSG-rDA could be transformed to the generalized eigenvalue problem. The classiﬁcation results in the found low dimensional space are used to evaluate the performance of dimensionality reduction methods. The experimental results on UCI datasets and hyperspectral images validate the effectiveness and advantage of the proposed methods. The remainder of this paper is organized as following: some related works are introduced in Section 2. Section 3 describes the integrated procedure of the sDSG-dDA and sDSG-rDA. Section 4 shows the experimental results of classiﬁcation on UCI datasets and hyperspectral images. A conclusion of this paper is given in Section 5.

2. Related works 2.1. Formulation of dimensionality reduction problem The dimensionality reduction problem can be formulated as follows. The high dimensional data is denoted by X = [x1, x2, … , xN ] ∈ R P × N , where N is the number of samples and P is the number of features. The objective of dimensionality reduction is to ﬁnd a projection matrix W ∈ R P × K (K ≪ P ) mapping from the Pdimensional space to the K-dimensional space. The low dimensional representation of X is written as Xlow = WT X ∈ R K × N . In the supervised methods, all samples used for learning are labeled. The label vector is denoted by ylabel = ⎡⎣ y1label , y2label , … , yNlabel ⎤⎦ ∈ R1 × N ,

yilabel = c (c = 1, 2, … , C ), C is the number of classes. In semi-supervised methods, the data X contains two parts including labeled label ⎡ ⎤ samples X label = {x i}iN= 1 with label vertex ylabel = ⎣ y1label , … , y label N label ⎦ and unlabeled samples

X ulabel = {x i}N

, where Nlabel is the

i = N label + 1

number of labeled samples. 2.2. Sparse graph based discriminant analysis (SGDA) Algorithm 1. Sparse graph construction. Require: training samples: X = [x1, x2, … , xN ] Ensure: weight matrix of sparse graph: Q ∈ R N ⁎ N . 1: normalize all samples with the same L2 norm. 2: for xi, let Dict = [x1, … , x i − 1, x i + 1, … , xN ], θi with N − 1 elements is obtained by solving Eq. (1). Then, rearrange ∼ ∼ the vector θi to θi , where θi = [θi1, …, θii − 1, 0, θii, …θiN − 1] ∈ R N .

min ∥ xi − Dictθi ∥22 s. t. ∥ θi ∥1 < s θ

(1)

∼ ∼j ∼j 3: let Q ij = θi for j = 1, … , N , where θi is the j-th element of θi . 4: repeat step 2 and step 3 until all samples are processed. 5: return Q.

In [27], a discriminant analysis method based on sparse graph is proposed, which is an extension of sparsity preserving projection (SPP) [25]. SPP is an unsupervised method aiming to ﬁnd a

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

low dimensional space where the sparse reconstruction relation leaned in original space can be preserved. SGDA utilizes label information to enhance the discriminant power of sparse graph. Here, the sparse graph is denoted as G(X , Q ), where each vertex is corresponding to a sample point in X and Q is the weight matrix produced by sparse representation. The procedure of sparse graph's construction is given in Algorithm 1. By dividing the integrated graph into two parts, the objective function of SGDA can be deﬁned as following:

min W

WT S wW WT SbW

(2)

where, C

Sw =

N

∑ ∑ Sc(xi)· c=1 i=1 T

2

N

WT xi −

Sc (x m)Q i, mWT x m

∑ m=1

= Tr(W X (IN × N − Q ′) (IN × N − Q ′)XT W )

C

Sb =

T

N

∑ ∑ Sc(xi)· c=1 i=1

2

N

WT xi −

∑

(3)

[1 − Sc (x m)]Q i, mWT x m

m=1

= Tr(WT X (IN × N − Q ″)T (IN × N − Q ″)XT W )

(4)

Sc (x m) is a class-selector function, Q ′ and Q ″ are two matrixes transformed from Q:

363

⎧1 y = c m Sc (x m) = ⎨ ⎩ 0 otherwise

(5)

⎧Q y = ym Q ′i, m = ⎨ i, m i ⎩0 otherwise

(6)

⎧Q y ≠ ym Q ″i, m = ⎨ i, m i ⎩0 otherwise

(7)

The weight matrix Q is divided into two portions as Q ′ and Q ″ according to samples’ labels and satisﬁes Q = Q ′ + Q ″. Q ′ indicates the intra-class relation of samples and Q ″ contains the inter-class relation. An example is shown in Fig. 1. If one sample is just represented by some samples with same labels or with different labels, one of these two portions ( Q ′ and Q ″) will be zero for this sample. Fig. 1(d) and (e) show the distribution of the number of each row's nonzero atoms in Q ′ and Q ″. It can be observed that many rows of Q ′ and Q ″ are zero. If one sample is reconstructed by some samples from different classes, the relationship of this sample with those samples with same labels will be abandoned. Furthermore, experimental results show that block-SGDA where samples are just represented by those samples with same labels can obtain better results than SGDA. In our semi-supervised discriminant analysis method, two sparse graphs are constructed respectively to reﬂect the intra-class and inter-class relationship of samples, which could avoid ignoring any useful information.

Fig. 1. An example of the weight matrix used in SGDA. (a) is the matrix Q, (b) is the matrix Q ′, (c) is the matrix Q ″, (d) is the distribution of the number of each row's nonzero atoms in Q ′, the abscissa is the number of nonzero atoms, the ordinate is the number of rows, (e) is the distribution of the number of each row's nonzero atoms in Q ″.

364

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

3. Semi-supervised double sparse graphs based discriminant analysis In this section, semi-supervised double sparse graphs based discriminant analysis will be introduced. The content is divided into three parts. The ﬁrst part discusses how to select approving pseudo-labeled samples from unlabeled samples. The second part mainly describes the construction process of semi-supervised double sparse graphs. The whole procedures of sDSG-dDA and sDSG-rDA are shown in the third part. 3.1. Pseudo-labeled sample selection based on joint k-nearestneighbor selection strategy Generally, the distribution of data in each class is hard to be estimated only depending on a small size of labeled samples. So, a soft assignment measured by class reconstruction error [44,42] is not very precise and the effectiveness of label propagation also may be inﬂuenced. Suppose there are very abundant unlabeled samples except few labeled samples, it becomes feasible that some unlabeled samples are selected as pseudo-labeled samples by considering the relationship of labeled samples with unlabeled samples. This selection can reduce the size of samples used in the following process and explore the discriminant information hiding in unlabeled samples. Although pseudo-labels are not real labels, the active inﬂuence of them could not be ignored because of their relationship with labeled samples. The k-nearest-neighbor method has been successfully applied in many applications such as classiﬁcation [45] and graph construction [15,17]. Because of its simplicity and effectiveness, the idea of it is employed to select pseudo-labeled samples from unlabeled samples in the proposed methods. Different from knn classiﬁer focuses on the neighbors of each unlabeled sample, the selection of pseudo-labeled samples investigates the neighbors of labeled samples to pick some unlabeled samples out as pseudolabeled samples. Those selected pseudo-labeled samples are expected to be close to the center of corresponding class and far from other classes'. Aiming to enhance the accuracy of pseudo-labeled samples, a new k-nearest-neighbor selection strategy called joint k-nearest-neighbor selection strategy is proposed, which not only considers the neighborhood relationship of each labeled sample but also the intersection of neighborhood in different classes. The selection is not carried out on each labeled sample separately, but on all labeled samples jointly. A simple diagram of joint k-nearest-neighbor selection strategy is given in Fig. 2. In Fig. 2(a), red and green points indicate labeled samples and the color represents the class. Unlabeled samples are represented by black points in the ﬁgure. For each labeled sample, k nearest neighbors are found by searching whole unlabeled sample set, which are enclosed in these black circles. Here, the

Fig. 3. A simple sketch about the construction of double sparse graphs.

parameter k is set to be 5. Next, pseudo-labeled samples are selected from those nearest neighbors and assigned appropriate pseudo-labels. The ﬁnal result of selection is shown in Fig. 2(b). The class of neighborhood region is decided by the center point's label. Samples in a single neighborhood region are selected as pseudo-labeled samples and forced same labels as the center labeled sample's. Besides single neighborhood regions, there are some intersection regions where two or more regions are intersecting. If the region is intersected by neighborhood regions in the same class, those unlabeled samples in this region are selected as pseudo-labeled samples and assigned pseudo-labels as same as the center samples’ label. Conversely, if the region is intersected by neighborhood regions in different classes, unlabeled samples in this region are discarded. By considering all neighbors of labeled samples jointly, some samples close to the center of class and far from the boundary of different classes are selected as pseudo-labeled samples. In the proposed method, k nearest neighbors of each labeled sample are selected from unlabeled samples, which are denoted as X knn ⊂ X ulabel . Then, discussing each sample in Xknn by using joint k nearest neighbor selection strategy, those samples in different class neighborhood regions are discarded and rest are assigned appropriate pseudo-labels. These pseudo-labeled samples are denoted by

X pseudo ∈ R P × N

pseudo

denoted by y

(X pseudo ⊂ X ulabel ) and pseudo-label of them are

pseudo

∈ R1 × N

pseudo

. After this selection process, X is re-

duced to X new = [X label , X pseudo] ∈ R P × N new

label

pseudo

new

with the new label vector

1 × N new

, where N new = N label + N pseudo . To y = [y ,y ]∈R distinguish labeled samples from pseudo-label samples, a indication new

vector ID ∈ R N is deﬁned, where IDi ¼ 1 when xnew is a labeled i sample, for otherwise IDi ¼0. Xnew is used in the following processing instead of X. However, it cannot be guaranteed that the pseudo-labels of pseudo-labeled samples are same as their true labels. So, the inaccuracy of pseudo-labels must be considered in the following procedures. The size of Xpseudo is related with parameter k which can be chosen according to the data's dimension and class number. Commonly, the suitable parameter k must make the rank of Xnew equal to data's dimension. 3.2. Semi-supervised double sparse graphs construction

Fig. 2. Diagram of joint k-nearest-neighbor selection. Red points and green points are labeled samples, black points are unlabeled samples. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

After pseudo-labeled samples selection, adequate labeled samples are achieved. Aiming to investigate the discriminant information brought by labels, two different kinds of relations are considered here, one of which is the positive relation that should be preserved in the low-dimensional space and another is the negative relation should be dislodged. Here, the relations of each

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

365

Fig. 4. An example of the weight matrixes about double sparse graphs. (a) is the weight matrix of positive graph, (b) is the weight matrix of negative graph, (c) is the distribution of number of each row's nonzero atoms of positive graph, (d) is the distribution of each row's nonzero atoms of negative graph.

sample with those samples in the same class are regarded as positive relations and the relations with samples in different class are regarded as negative relations. Double sparse graphs including positive sparse graph and negative sparse graph are introduced to describe above two kinds of relations. For simplicity, the samples in same class are called positive samples of the center sample and the samples in different class are called negative samples. For the i-th sample xinew in Xnew, positive samples of it conand negative samples constitute stitute the positive dictionary Dpos i the negative dictionary Dneg i . The sparse representation problem in Eq. (1) can be transformed to the lasso problem [46] by integrating the sparsity constraint into the objective function as a sparsity regularization term. The sparse representation coefﬁcients of xinew on both Dpos and Dneg can be obtained by solving the lasso proi i blems in Eqs. (8) and (9) respectively:

min ∥ xinew − Dipos θi pos ∥22 + λ ∥ θi pos ∥1 pos θi

min ∥ xinew − Dineg θineg ∥22 + λ ∥ θineg ∥1 neg θi

∼ pos ∼ pos ∼ pos ∼ neg ∼ neg ∼ neg Q pos = [θ1 , θ2 , … , θ Nnew ]T and Qneg = [θ1 , θ2 , … , θ Nnew ]T , of which each row represents the relationship of each sample with other samples. The integral process of double sparse graph construction is shown in Fig. 3. If all samples are arranged in the label order, the weight matrix of positive sparse graph has conspicuous block structure. An example of weight matrixes of double sparse graphs is given in Fig. 4. The gray value represents the value of weight matrix. Fig. 4(a) shows the weight matrix of graph Gpos, it appears block diagonal because that samples are just represented by their positive samples. Fig. 4(b) shows the weight matrix of graph Gneg, nonzero atoms appear out of diagonal regions where negative samples are existing. Fig. 4(c) and (d) give the statistical results of nonzero atoms of each row of Qpos and Qneg. Different from the single sparse graph shown in Fig. 1, the number of rows which are all zeros is equal to zero, it means that no any information of samples is discarded in graphs.

(8) 3.3. Semi-supervised double sparse graphs based discriminant analysis

(9)

In the above equations, xinew is expected to be reconstructed by those samples in dictionaries. Because of the sparsity regularization and θneg are term in the objective function, just a few atoms of θpos i i nonzero, which means just few samples are connected with xinew in graphs. Due to the sparsity of sparse representation, the inﬂuence of and θneg are pseudo-labels’ inaccuracy also can be limited. Next, θpos i i rearranged according to samples’ coordinates in Xnew and zeros are inserted into blank positions. The rearranged coefﬁcients are de∼neg ∼ pos with the same length Nnew. noted by θi and θi ∼ pos ∼neg new After sparse representation coefﬁcients {θi , θi }iN= 1 are obpos neg and G are achieved, such as tained, the weight matrixes of G

In this part, two semi-supervised double sparse graphs based discriminant analysis methods are introduced. The ﬁrst one is the distance based discriminant analysis (sDSG-dDA) and the second one is the reconstruction based discriminant analysis (sDSG-dDA). 3.3.1. sDSG-dDA Double sparse graphs contain not only the positive discriminant information, but also the negative discriminant information. In common graph embedding, local neighborhood relationship based on Euclidean distance is expected to be preserved in low-dimensional space. Analogously, the structure information concluded in sparse graph also can be treated as the similarity between

366

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

neighbors, which should be preserved by minimizing the weighted Euclidean distance of each sample in the low dimensional space. Based on the this opinion, sDSG-dDA is proposed, where double sparse graphs are treated as similarity matrixes of data. The positive graph Gpos reﬂects the similarity of each sample with its positive samples, which is expected to be preserved in the low dimensional space. The intra-class distance Dist w in the low dimensional space is constructed by the weighted Euclidean distance with the weight matrix Qpos, which is given in Eq. (10). Aiming to reduce the inﬂuence of pseudo-label inaccuracy, another weight matrix Mpos is involved in Eq. (10), which indicates the possibility of connected sample pairs belonging to the same class:

matrix Qneg of graph Gneg, nonzero atoms demonstrate that corresponding samples with different labels are similar. Each sample is expected to leave far away from their negative samples as possible, especially those with high similarity. The inter-class distance Dist b is deﬁned with the weight matrix Qneg. The formulation of Dist b is similar to Dist w, and it is described in Eq. (12). Similar to Mpos, Mneg is also deﬁned to reduce the inﬂuence of pseudo-label inaccuracy in Dist _ b : N new

Dist−b =

∑

= pos ∥ WT xinew − WT xnew ∥22 Q ipos j , j Mi, j

(

∥W

T

xinew

−

WT xnew j

∥22

Tipos ,j

(

⎛ ⎞ ⎛N ⎞ T ⎟X new T W ⎟ = Tr⎜ WT X new ⎜⎜ ∑ ei − ej ei − ej Tipos , j ⎟ ⎜ ⎟ ⎝ i, j = 1 ⎠ ⎝ ⎠ new

(

)(

Mineg ,j

)

i, i

)

new pos Ti, j

T

)

(12)

Nnew

j, j

deﬁnition of Mneg is given in the following equation:

(10)

=0 Q ipos ,j IDi = IDj = 1 2⎞ 2

⎟ ⎟ IDi = 0 or IDj = 0 ⎟ ⎠

⎧0 ⎪ ⎪1 ⎪ ⎛ =⎨ new new new new pos pos ⎜ − mean(X (Q i,: ≠ 0), xi ) − mean(X (Q j,: ≠ 0), xj ) ⎪ ⎪ 1 − exp⎜ σ22 ⎜ ⎪ ⎝ ⎩

( Dcolpos)i,i = ∑Nj=1

)

neg neg where T = Qneg . Mneg , ( Dcol and ( Drow . The ) = ∑i=1 Tineg ) = ∑ j=1 Tineg ,j ,j

⎧0 ⎪ ⎪1 ⎪ ⎛ =⎨ new new new new pos pos ⎜ − mean(X (Q i,: ≠ 0), xi ) − mean(X (Q j,: ≠ 0), xj ) ⎪ ⎪ exp⎜ σ12 ⎜ ⎪ ⎝ ⎩

where T pos = Q pos. Mpos ,

(

)

Nnew

⎛ ⎞ T pos pos = Tr⎜ WT X new Dcol + Drow − (T pos )T − T pos X new W ⎟ ⎜ ⎟ ⎝ ⎠

(

)(

neg neg = Tr WT X new Dcol + Drow − (T neg )T − T neg X new W

i, j = 1

Mipos ,j

∥ WT xinew − WT xnew ∥22 Tineg j ,j

⎛ ⎞ ⎛ N new ⎞ T ⎟X new T W ⎟ = Tr⎜ WT X new ⎜⎜ ∑ ei − ej ei − ej Tineg , j ⎟ ⎜ ⎟ ⎝ i, j = 1 ⎠ ⎝ ⎠

N new

∑

∑ i, j = 1

i, j = 1

=

neg ∥ WT xinew − WT xnew ∥22 Q ineg j , j Mi, j

N new

N new

Dist−w =

∑ i, j = 1

new

pos and ( Drow . The ) = ∑iN=1 Tipos ,j j, j

deﬁnition of Mpos is given in the following equation: For each sample, samples connected to it by sparse representation coefﬁcients and itself can build a subclass set. Instead of using samples directly to compute the probability of sample pairs’ class consistency, the averages of subclass sets are utilized. and xnew is a pseudo-labeled sample, Mipos When either of xnew i j , j is computed by the distance between means of subclass sets around and xnew . xnew i j The negative graph Gneg contains the similarity relationship between samples with their negative samples. In the weight

(11)

=0 Q ineg ,j IDi = IDj = 1 2⎞ 2

⎟ ⎟ IDi = 0 or IDj = 0 ⎟ ⎠

(13)

Different from Mpos, the value of Mneg indicates the probability of sample pairs belonging to different classes, which is inversely proportional to the distance between averages of subclass sets. In our opinion, the similarity relationship in the negative graph Gneg should be punished during learning and be eliminated in the low dimensional space. So, the inter-class distance Dis _ b in Eq. (12) should be maximized to reduce the negative similarity. Considering to optimize Eqs. (10) and (12) together, the objective function of sDSG-dDA could be formulated as Eq. (14). By minimizing the ratio of Dist w to Dist b, the optimal projection matrix can be obtained. In the low dimensional space, the similar samples with same labels are close to each other as in original

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

space and similar samples with different labels are expected to be far away: N new

min W

∑i, j = 1 ∥ WT xinew − WT xnew ∥22 Tipos j Dist−w ,j = N new 2 neg Dist−b ∑i, j = 1 ∥ WT xinew − WT xnew T ∥ 2 i, j j T

=

new Tr(WT X new L (pos W) 1) X T

new Tr(WT X new L (neg W) 1) X

(14)

where, neg neg pos pos neg pos L (pos − (T pos )T L (neg − (T neg )T 1) = Dcol + Drow − T 1) = Dcol + Drow − T

(15)

The optimization problem in Eq. (14) can be solved by the generalized eigenvalue problem. After the generalized eigenvalue T

T

new new decomposition of (X new L(pos , X new L(neg ), these K eigenvec1) X 1) X tors corresponding to K smallest eigenvalues are selected to construct the projection matrix W, W = [p1, p2 , … , pK ], pi is the eigenvector corresponding to the eigenvalue si, satisfying s1 < s2, … , < sK . The integral algorithm of sDSG-dDA is given in Algorithm 2.

Algorithm 2. sDSG-dDA. Require: labeled samples and label vector: Xlabel, Ylabel; unlabeled samples: Xulabel; parameters of neighborhood: k the dimension number of projection matrix, K. Ensure: the projection matrix, W ∈ R P × K . 1: select pseudo-labeled samples Xpseudo from unlabeled samples using joint k -nearest neighbor selection strategy.

3.3.2. sDSG-rDA In graph embedding, there is another kind of methods different from distance based methods, which is denoted as reconstruction based method. Without label information, each sample can be represented by its neighbors with appropriate coefﬁcients and this representation relation is expected to be preserved in the low-dimensional space. The typical reconstruction based method is NPP [19], where k nearest neighbors or ϵ-nearest neighbors is used. Without determination of neighbors, SPP preserves the representation relation obtained by sparse representation in the low dimensional space. In sparse representation classiﬁcation [20], the reconstruction error is used to determine the class of samples, by which samples are classiﬁed to the class with the smallest reconstruction error than other classes. This can be seen as the intrinsic discriminant property of sparse representation. Here, both the intra-class reconstruction error and the inter-class reconstruction error are investigated. One sample with smallerintra-class reconstruction error and bigger inter-class reconstruction error is easier to be determined its class. In the low dimensional space, the distinction degree of above two kinds of reconstruction error should be large, which is beneﬁcial for classiﬁcation. Similar as Mpos and Mneg involved in sDSG-dDA, two diagonal matrixes such as Cpos and Cneg are inserted into the objective function of sDSG-rDA to bring down the effect of pseudo-labels’ inaccuracy. However, because the representation relationship is one to more, which is not similar as the similarity between two samples, it should not be treated like sDSG-dDA. Cpos and Cneg are determined by the class relationship of each sample with its connected samples. The intra-class reconstruction error is denoted by CE _ w :

i =1

i

subproblem(2): min θ neg ∥ x inew − Dineg θineg ∥22 + λ ∥ θineg ∥1. 4: rearrange sparse representation coefﬁcients {θi ∼ pos ∼neg new {θi , θi }iN= 1 with the length Nnew. 5: construct double sparse graphs:

∼ pos Q pos = [θ1 , ∼neg = [θ1 ,

∼ pos ∼ pos θ2 , …, θ N new ]T ∼neg ∼ neg θ2 , …, θ N new ]T

and

,

T

new new X new L (pos ps = λ sX new L (neg ps 1) X 1) X

where, ps ∈ R P is the eigenvector corresponding to the s-th smallest eigenvalue λs. 10: the projection matrix W is obtained:

W = [p1, p2 , …, pK ] ∈ R P × K 11: return W.

j =1

Q ipos WT x new j ,j 2

⎞ ⎞ posT pos ⎟ new T ⎟ + C ipos Q Q X W ⎟ i,: ⎟ i,: ,i ⎟ ⎟ ⎠ ⎠ ⎛ ⎛ ⎞ T ⎞ T T = Tr⎜ WT X new ⎜ C pos − Q pos C pos − C posQ pos + Q pos C posQ pos ⎟X new W ⎟ ⎝ ⎠ ⎠ ⎝

(16)

9: solve the generalized eigenvalue problem: T

to

Qneg

6: compute Mpos and Mneg by using Eqs. (11) and (13). 7: compute T pos = Q pos. M pos and T neg = Qneg . M neg . neg 8: compute L(pos 1) and L (1) by using Eq. (15).

∑

⎛ ⎛ N new T ⎜ ⎜ = Tr⎜ WT X new ⎜ ∑ C ipos e e T − C ipos Q ipos eiT − C ipos e Q pos ,i i i ,i ,: , i i i,: ⎜ ⎜ ⎝ i =1 ⎝

i

new θineg}iN= 1

C ipos WT x inew − ,i

⎛ ⎛ N new ⎛ ⎞ T⎞ ⎞ T ⎟⎛ new T⎞ ⎟ ⎟ ⎜ ⎜ ⎜ new = Tr⎜ WT ⎜ ∑ C ipos X ei − X new Q ipos ei − X new Q ipos ⎜X ⎟ ⎟W ⎟ ⎜ ⎟ i , ,: ,: ⎜ ⎟⎝ ⎠ ⎟ ⎟ ⎜ i =1 ⎜ ⎝ ⎠ ⎠ ⎠ ⎝ ⎝

subproblem(1): min θ pos ∥ x inew − Diposθipos ∥22 + λ ∥ θipos ∥1,

pos

∑

CE−w =

2

N new

N new

2: construct new data: X new = [X label , X pseudo], ynew = [ylabel , y pseudo ]. 3: compute sparse representation coefﬁcients. For each samples in Xnew, solve following two problem:

367

(17)

where

Table 1 UCI datasets. No. 1 2 3 4 5 6 7 8 9 10 11 12

Name

Number

Dimension

Cluster

Wine Vehicle Sonar Pima Letter Heart German Clean Breast Balance Australian Air

178 846 208 768 5000 303 1000 476 277 624 690 359

13 18 60 8 16 13 24 166 9 4 14 64

3 4 2 2 26 2 2 2 2 3 2 3

368

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

Fig. 5. The effect of different k on UCI datasets.

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

369

Fig. 6. Experimental results on Wine dataset with different s1 and s2.

Cipos ,i

⎛ ⎞ 1 ⎛ new ⎜ − x new − + ∑j : Q pos ≠ 0 xnew ⎜ xi ⎟ i j ⎜ i, j ni + 1 ⎝ ⎠ = exp⎜ 2 ⎜ σ1 ⎜ ⎜ ⎝

xi

⎞ ⎟ ⎟ 2⎟ ⎟ ⎟ ⎟ ⎠ 2

N new

∑

CE−b =

i =1

WT x inew −

∑ j =1

Q ineg WT x new j ,j 2

⎛ ⎛ N new ⎞ ⎞ ⎛ ⎞ ⎜ ⎛ new ⎞T ⎟ ⎟ ⎜ ⎜ new new neg T ⎟⎜ new neg T ⎟ = Tr⎜ WT ⎜ ∑ C ineg X e − X Q X e − X Q W i i ⎟⎟⎝ i,: i,: ,i ⎜ ⎠ ⎟⎟ ⎟⎟ ⎜ ⎜ i =1 ⎜ ⎝ ⎠ ⎝ ⎠ ⎠ ⎝

(18)

⎛ ⎛ N new ⎜ T T ⎜ = Tr⎜ WT X new ⎜ ∑ C ineg e e T − C ineg Q ineg ei − C ineg e Q neg ,i i i ,i ,: , i i i,: ⎜ ⎜ i = 1 ⎝ ⎝

new

and its connected samples with nonzero coefﬁcients in constitute a sample subset, which can be treated as a subQ ineg ,: class. The distance between xinew with the center of this subclass is new used to determine Cipos , i , which represents the probability of xi pos belonging to this subclass. C controls the effect of each sample in the learning process. If xinew is near to the center point of the subclass, large Cipos would be given, which means it is possible that ,i xinew is represented by samples belonging to the same class. Small means these samples connected with xinew may belong to Cipos ,i different classes. Hence, the sparse representation relationship of samples represented by samples in the same class should be encouraged to be preserved in the low dimensional space. The representation relationship of samples with negative samples is not expected to be preserved in low-dimensional space. So, the inter-class reconstruction error CE _ b in the low dimensional space should be maximized, which is deﬁned in the following equation:

2

N new

C ineg ,i

⎞ ⎞ T neg ⎟ new T ⎟ + C ineg Q ineg Q i,: ⎟X W⎟ ,i ,: ⎟ ⎟ ⎠ ⎠ ⎛ ⎞ ⎜ ⎛ ⎞ T T T ⎟ = Tr⎜ WT X new ⎜ C neg − Q neg C neg − C neg Q neg + Q neg C neg Q neg ⎟X new W ⎟ ⎝ ⎠ ⎜ ⎟ ⎝ ⎠

(19)

where

Cineg ,i

⎛ 1 ⎜ − x new − i ⎜ ni + = 1 − exp⎜ ⎜ ⎜ ⎜ ⎝

⎛ new ⎞ + ∑j : Q neg ≠ 0 xnew ⎜ xi ⎟ j i, j 1⎝ ⎠ σ22

⎞ ⎟ ⎟ 2⎟ ⎟ ⎟ ⎟ ⎠ 2

(20)

370

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

Fig. 7. The inﬂuence of M and C on sDSG-dDA and sDSG-rDA respectively. new neg Opposite to Cipos , i , Ci, i is proportional to the possibility of xi and its connected samples belonging to different classes. In negative graph Gneg, if the sample xinew is represented by samples in the same class, small Cineg is given to reduce its effect on learning. ,i Conversely, if samples used to represent it are not belonging to the same class as xinew, large Cineg , i should be given to preserve its effect. Like sDSG-dDA, to optimize above problems given in Eqs. (17) and (19) simultaneously, the formulation of sDSG-rDA is written like Eq. (21). By minimizing Eq. (21), the difference between sample's intra-class reconstruction error and inter-class reconstruction error becomes larger in the low dimensional space, which is beneﬁcial for classiﬁcation:

N new

CE _ w min = W CE _ b

N new

∑i = 1 Cipos WT xinew − ∑ j = 1 Q ipos WT xnew j ,i ,j N new ∑i = 1

Cineg ,i

W

T

xinew

−

N new ∑ j=1

Q ineg WT xnew j ,j

2 2 2

new Tr(WT X new L (pos W) 2) X T

new Tr(WT X new L (neg W) 2) X

the projection matrix, W ∈ R P × K . 1: select pseudo-labeled samples Xpseudo from unlabeled samples using joint k -nearest neighbor selection strategy. 2: construct new data: X new = [X label , X pseudo], ynew = [ylabel , y pseudo ]. 3: compute sparse representation coefﬁcients. For each samples in Xnew, solve following two problem: subproblem(1): min θ pos ∥ x inew − Diposθipos ∥22 + λ ∥ θipos ∥1, i

subproblem(2): min θ neg ∥ x inew − Dineg θineg ∥22 + λ ∥ θineg ∥1. i

new

4: rearrange sparse representation coefﬁcients {θipos, θineg}iN= 1 to ∼ pos ∼neg new {θi , θi }iN= 1 with the length Nnew. 5: construct double sparse graphs:

∼ pos ∼ pos ∼ pos Q pos = [θ1 , θ2 , …, θ N new ]T neg neg ∼ ∼ ∼ neg and Qneg = [θ1 , θ2 , …, θ N new ]T

2

T

=

Ensure

(23)

(21) 6: compute Cpos and Cneg by using Eqs. (18) and (20). neg 7: compute L(pos 2) and L (2) by using Eq. (22).

where T

T

T

T

pos L (pos − Q pos C pos − C posQ pos + Q pos C posQ posL (neg 2) = C 2)

= C neg − Qneg C neg − C negQneg + Qneg C negQneg

8: solve the generalized eigenvalue problem:

(22)

The optimization problem in Eq. (21) could be transformed to the generalized eigenvalue problem similar to sDSG-dDA. By solving new T new T the generalized eigenvalue problem (X new L(pos , X new L(neg ), 2) X 2) X the projection matrix W can be achieved. The algorithm procedure of sDSG-rDA shown in Algorithm 3 is similar to sDSG-dDA except the step 5, where the deﬁnitions of Lpos and Lneg are different. sDSG-dDA and sDSG-rDA are derived from different view points, but with the uniform ultimate goal, by which the discriminant ability of data could be preserved or enhanced in the ﬁnal low-dimensional space. By doing this, it could promote the accuracy of classiﬁcation. In the experimental section, some experiments are devised to prove the effectiveness of the proposed methods on classiﬁcation problem. Algorithm 3. sDSG-rDA. Require labeled samples and label vector: Xlabel, Ylabel; unlabeled samples: Xulabel; parameters of neighborhood: k the dimension number of projection matrix, K.

T

T

new new X new L (pos ps = λ sX new L (neg ps 2) X 2) X

where, ps ∈ R P is the eigenvector corresponding to the s-th smallest eigenvalue λs. 9: the projection matrix W is obtained:

W = [p1, p2 , …, pK ] ∈ R P × K 10: return W.

4. Experimental results In this section, two different kinds of data are tested to validate the proposed methods’ performance. The ﬁrst part shows the results on UCI database and the second part shows the results on two hyperspectral images. Each experiment is divided into two

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

371

Table 2 The classiﬁcation result of UCI data with 2 labeled samples per class. The best number of projections is given in the bracket. Methods Wine Vehicle Sonar Pima Letter Heart German Clean Breast Balance Australian Air

OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA

Original

PCA

LDA

SPP

SDA

SSDR

SELF

SL-LDA

GPSDA

sDSG-dDA

sDSG-rDA

0.6712(13) 0.6692 0.3937(18) 0.3953 0.5550(60) 0.5508 0.6128(8) 0.5943 0.3178(16) 0.3188 0.6485(13) 0.6471 0.5504(24) 0.5496 0.5328(166) 0.5383 0.5936(9) 0.5656 0.5294(4) 0.5112 0.5890(14) 0.5726 0.4653(64) 0.4665

0.6712(5) 0.6692 0.3926(10) 0.3943 0.5548(15) 0.5516 0.6117(2) 0.5892 0.2933(8) 0.2941 0.6768(5) 0.6742 0.5506(8) 0.5487 0.5386(50) 0.5424 0.5945(2) 0.5725 0.5394(3) 0.5149 0.5890(6) 0.5726 0.4654(20) 0.4666

0.7324(2) 0.7394 0.4053(3) 0.4072 0.5626(1) 0.5610 0.6142(1) 0.5910 0.3621(25) 0.3636 0.6540(1) 0.6526 0.5483(1) 0.5360 0.5348(1) 0.5387 0.5986(1) 0.5691 0.4766(4) 0.4534 0.5730(1) 0.5733 0.4693(2) 0.4742

0.6759(13) 0.6922 0.4294(15) 0.4376 0.5719(40) 0.5737 0.5514(7) 0.5268 0.3385(16) 0.3280 0.6345(12) 0.6336 0.5656(24) 0.5475 0.5871(5) 0.5822 0.5944(8) 0.5667 0.5816(2) 0.5573 0.6002(14) 0.5849 0.4678(10) 0.4825

0.7079(3) 0.7101 0.4267(4) 0.4294 0.5624(2) 0.5632 0.6075(2) 0.5863 0.3785(26) 0.3800 0.6237(2) 0.6245 0.5487(2) 0.5341 0.5207(2) 0.5231 0.5833(2) 0.5620 0.5398(3) 0.5256 0.6128(2) 0.6012 0.4644(3) 0.4687

0.6999(13) 0.7100 0.4094(7) 0.4114 0.5733(25) 0.5689 0.6164(8) 0.5878 0.3950(13) 0.3964 0.6546(6) 0.6576 0.5592(12) 0.5430 0.5384(60) 0.5398 0.5897(8) 0.5673 0.5562(4) 0.5268 0.5894(14) 0.5730 0.4814(55) 0.4981

0.6886(13) 0.6876 0.3977(18) 0.3992 0.5557(60) 0.5556 0.6140(7) 0.5869 0.3281(15) 0.3294 0.6639(1) 0.6326 0.5497(24) 0.5324 0.5338(60) 0.5380 0.5899(9) 0.5656 0.5229(4) 0.5093 0.6110(5) 0.5996 0.4446(60) 0.4497

0.6733(3) 0.6851 0.3757(4) 0.3872 0.5836(2) 0.5813 0.5959(2) 0.5800 0.3519(26) 0.3527 0.5935(2) 0.5948 0.5367(2) 0.5300 0.5429(2) 0.5479 0.5560(2) 0.5579 0.5269 (3) 0.5316 0.5842(2) 0.5716 0.4918(3) 0.4966

0.4140(8) 0.4269 0.3977(6) 0.3977 0.5387(30) 0.5256 0.6277(6) 0.5916 0.3378(20) 0.3378 0.6453(5) 0.6435 0.5649(6) 0.5578 0.5360(30) 0.5360 0.5815(4) 0.5632 0.5296(4) 0.4938 0.5641(4) 0.5585 0.4780(30) 0.4761

0.8510(6) 0.8542 0.4454(9) 0.4582 0.5754(20) 0.5768 0.6255(8) 0.6074 0.3957(12) 0.3790 0.6878(12) 0.6820 0.5621(20) 0.5618 0.5357(55) 0.5389 0.5998(8) 0.5726 0.5819(4) 0.5547 0.6141(12) 0.6065 0.4805(30) 0.4739

0.8057(6) 0.8050 0.4419(9) 0.4495 0.5819(20) 0.5710 0.6110(6) 0.5801 0.3765(12) 0.3450 0.6824(13) 0.6741 0.5671(18) 0.5642 0.5302(55) 0.5284 0.6007(8) 0.5730 0.5791(4) 0.5510 0.6151(12) 0.6084 0.4875(35) 0.4802

independent procedures, the ﬁrst of which is dimensionality reduction and the second is the classiﬁcation. In the ﬁrst procedure, the whole dataset with few labeled samples is used to learn the excellent projection matrix. After that, in the low dimensional space projected by the obtained projection matrix, few labeled samples are used to train the classiﬁer and the rest are used as testing samples for classiﬁcation. In addition, labeled samples used in dimensionality reduction are the same as those in classiﬁcation. Classiﬁcation is employed to evaluate the performance of dimensionality reduction indirectly. Aiming to reduce the impact of classiﬁer, nearest neighborhood (NN) classiﬁer is used for classiﬁcation and overall accuracy (OA) and average accuracy (AA) are utilized to measure the ﬁnal results. In experiments, many classical dimensionality reduction methods are compared with the proposed methods, such as unsupervised methods (PCA, SPP [25]), supervised method (LDA) and semi-supervised methods (SDA [31], SSDR [37], SELF [36], SL-LDA [43] and GPSDA [42]). SSDR can preserve the intrinsic structure of the unlabeled samples as well as both the must-link and cannot-link constraints deﬁned on the labeled samples in the projected space. SELF also preserves the local structure of unlabeled samples which is captured by knngraph and separates labeled samples in different classes from each other. SL-LDA uses label propagation method to estimate a soft label for each unlabeled sample before discriminant analysis. GPSDA is based on probabilistic semi-supervised discriminant analysis and uses a similarity measurement of fuzzy sets to investigate the inexact class information of unlabeled data in addition to preserve the adjacency structure of whole data in the projected space. Note that regular-LDA is employed in experiments instead of common LDA to avoid the singularity of scatter matrix. The sparse representation problems involved in experiments are solved by using the SPAMS toolbox [47]. 4.1. Experiments on UCI datasets 4.1.1. Introduction of UCI datasets UCI database is a published database which is often used to test algorithms’ performance on classiﬁcation, clustering, detection and recognition problem [48]. In experiments, 12 UCI datasets are used to test the proposed methods’ performance on classiﬁcation.

A brief introduction of these datasets is shown in Table 1. 4.1.2. Parameter analysis The parameter k in joint k nearest neighbor selection is very important, which decides the number of pseudo-labeled samples ( Num _ps ) and accuracy rate ( Rate _acc ) of Xnew. In this section, different value of k is applied to each dataset to investigate the inﬂuence of k on ﬁnal results. In each time, two labeled samples per class are selected randomly to learn the projection matrix and train the classiﬁer. Optimal λ is selected in the range from 0.001 to 0.1 for each dataset. For simplicity, (s1, s2) of DSG-dDA and DSGrDA are set to be ( 1e − 0, 1e − 4 ). In Fig. 5, the varying of overall accuracy with k, Num _ps and Rate _acc with the ﬁxed k are given. When k is increasing, Num _ps is increasing in the beginning phase and decreasing in the following phase. In the beginning phase, the neighborhood of each labeled sample is far away from each other, so many new samples could be selected out with large k. After reaching a certain boundary, the neighborhoods of labeled samples with different labels are overlapped and some selected samples in the overlapped regions are discarded. Along with the variation of Num _ps , Rate _acc also varies with k. However, the classiﬁcation result could not be decided just only by Num _ps or Rate _acc . Much more pseudo-labeled samples with high Rate _acc will bring good results. For most datasets, the inﬂuence of k is very slight, such as Vehicle, Pima, Sonar, German, Breast, Balance, Australian and Air. Note that the number of selected pseudo-labeled samples is much less than the size of dataset. Two parameters (s1 and s2) are involved in the deﬁnition of matrixes ( M pos, Mneg , C pos and Cneg) to control the weights added to each edge or each sample set. Some experiments on Wine have done to show the inﬂuence of these parameters. Two labeled samples selected randomly are used in experiments. s1 and s2 are varying from 0.0001 to 1. The results are shown in Fig. 6. When k¼ 40, overall accuracy is inﬂuenced by s1 and s2 observably. When σ1 = 0.01 and σ2 = 0.0001, the best result of sDSG-dDA is reached. When σ1 = 0.001 and σ2 = 1, the best result of sDSG-rDA is achieved. However, when k ¼100, the improvement of results due to the adjustment of s1 and s2 becomes slight. Because of the accuracy of pseudo-labels when k ¼100 is very high, it is not necessary to add M and C to the objective function. In Fig. 7, sDSG-

372

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

Table 3 The classiﬁcation result of UCI data with 4 training samples per class. Methods Wine Vehicle Sonar Pima Letter Heart German Clean Breast Balance Australian Air

OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA

Original

PCA

LDA

SPP

SDA

SSDR

SELF

SL-LDA

GPSDA

sDSG-dDA

sDSG-rDA

0.6718(13) 0.6746 0.4620(18) 0.4635 0.5702(60) 0.5675 0.6349(8) 0.6140 0.4188(16) 0.4199 0.6936(13) 0.6930 0.5447(24) 0.5348 0.5711(166) 0.5857 0.5708(9) 0.5701 0.5961(4) 0.5806 0.5816(14) 0.5752 0.5214(64) 0.5281

0.6722(6) 0.6749 0.4620(18) 0.4635 0.5730(5) 0.5725 0.6377(7) 0.6206 0.4193(16) 0.4204 0.6940(6) 0.6934 0.5449(18) 0.5350 0.5769(10) 0.5871 0.5810(4) 0.5792 0.6048(2) 0.6033 0.5816(10) 0.5752 0.5214(45) 0.5282

0.7524(13) 0.7575 0.4853(15) 0.4876 0.6180(5) 0.5737 0.5514(7) 0.5268 0.4368(16) 0.4385 0.6733(12) 0.6336 0.5590(24) 0.5475 0.6019(5) 0.6066 0.5636(8) 0.5632 0.6768(2) 0.6684 0.5933(12) 0.5867 0.5390(60) 0.5455

0.6945(2) 0.7253 0.5220(3) 0.5243 0.6088(1) 0.6077 0.6640(1) 0.6404 0.5025(1) 0.5048 0.7011(1) 0.6770 0.5565(1) 0.5485 0.5844(1) 0.5893 0.5619(1) 0.5610 0.5650(4) 0.5453 0.6192(1) 0.6111 0.5213(2) 0.5295

0.7786(3) 0.7786 0.5005(4) 0.5030 0.6102(2) 0.6079 0.6349(2) 0.6135 0.5143(2) 0.5152 0.6747(2) 0.6741 0.5463(2) 0.5380 0.5638(2) 0.5650 0.5766(2) 0.5750 0.6257(3) 0.6155 0.6330(2) 0.6195 0.4961(3) 0.5051

0.7512(6) 0.7562 0.4740(12) 0.4760 0.6016(30) 0.5974 0.6335(8) 0.6117 0.5155(14) 0.5170 0.7053(12) 0.7045 0.5648(12) 0.5552 0.5667(70) 0.5919 0.5698(9) 0.5635 0.6292(2) 0.6133 0.5916(4) 0.5858 0.5372(50) 0.5532

0.7255(12) 0.7270 0.4757(18) 0.4773 0.5819(60) 0.5780 0.6278(8) 0.6097 0.4482(16) 0.4497 0.6789(12) 0.6781 0.5517(24) 0.5321 0.5542(120) 0.5634 0.5746(8) 0.5589 0.6004(4) 0.5808 0.5800(14) 0.5678 0.5213(64) 0.5281

0.7648(3) 0.7625 0.4343(4) 0.4370 0.6028(2) 0.6040 0.5903(2) 0.5893 0.4703(26) 0.4718 0.6142(2) 0.6171 0.5395(2) 0.5510 0.5587(2) 0.5597 0.5573(2) 0.5579 0.6474(3) 0.6269 0.6012(2) 0.5987 0.4918(3) 0.4966

0.8063(4) 0.8145 0.4742(12) 0.4761 0.5565(5) 0.5545 0.6317(6) 0.6148 0.4647(8) 0.4647 0.6595(4) 0.6566 0.5547(6) 0.5360 0.5528(30) 0.5630 0.5716(4) 0.5632 0.5923(4) 0.5679 0.6111(4) 0.6120 0.4780(30) 0.4761

0.8823(12) 0.8898 0.5428(9) 0.5520 0.6205(20) 0.6158 0.6653(5) 0.6426 0.5194(10) 0.5211 0.7194(10) 0.7193 0.5662(24) 0.5487 0.5733(10) 0.5744 0.6120(3) 0.5943 0.6534(3) 0.6454 0.6521(4) 0.6340 0.5413(20) 0.5419

0.8804(10) 0.8376 0.5265(9) 0.5241 0.6323(20) 0.6230 0.6451(8) 0.6296 0.4969(8) 0.5021 0.7225(12) 0.7219 0.5500(18) 0.5391 0.5789(10) 0.5796 0.6005(3) 0.5932 0.6785(2) 0.6573 0.6498(4) 0.6356 0.5481(60) 0.5362

Table 4 Categories of Indian Pines image. No. Class 1 2 3 4 5 6 7 8

Alfalfa Corn-notill Corn-min Corn Grass/Pasture Grass/Trees Grass-pasturemowed Hay-windrowed

Number No. Class

Number

46 1428 830 237 483 730 28

9 10 11 12 13 14 15

20 972 2455 593 205 1265 386

478

16

Oats Soybeans-notill Soybeans-min Soybeans-clean Wheat Woods Building-Grass-TreesDrives Stone-steel Towers Total Number:

39 10,249

dDA without M and sDSG-rDA without C are also tested to compare with sDSG-dDA and sDSG-rDA under optimal s1 and s2. It is obviously that adding weight reﬂecting class relationship of sample pair or sample set could affect the ﬁnal results. In addition, if the accuracy rate of pseudo-labels is very low, the effect of M and C becomes slight. Of course, if the accuracy rate is very high, matrixes M and C would not be necessary for sDSG-dDA and sDSGrDA. 4.1.3. Comparison with other methods In this part, two experiments are given on UCI datasets with different numbers of labeled samples. The parameter of k in sDSGdDA and sDSG-rDA is selected from the range [10, 60] for most datasets except Letter and Clean. k is selected from [2, 6] for Letter and [50, 90] for Clean. The value of k is set as the lowest value which might be increased if the rank of training samples selected is less than the dimension number of dataset. s1 and s2 are chosen from [0.0001,1]. These parameters used in the comparing methods also are selected in a certain range for the best results. In the ﬁrst experiment, two samples per class are selected randomly as labeled samples for learning and classiﬁcation. The results are shown in Table 2. In the second experiment, four samples per class are used as labeled samples and the results are given in Table 3. The results obtained by using original data

without dimensionality reduction can be treated as the reference value. The performance of PCA on most datasets is close to original data. LDA can reach better results on some datasets when four labeled samples are used. On Clean, SPP obtains the surprised result both in two experiments. In the comparing algorithms, SDA, SSDR SELF, SL-LDA and GPSDA achieve relevant results for most datasets. Because SL-LDA and GPSDA investigate the class relation of unlabeled samples by considering their relationship with labeled samples, the results obtained in the second experiment are much better than in the ﬁrst experiment. In the ﬁrst experiment, SPP, SL-LDA and GSPDA achieve the best result on one dataset respectively. sDSG-dDA gets it on ﬁve datasets and sDSG-rDA is on four datasets. In the second experiment, it is one for LDA, seven for sDSG-dDA and four for sDSG-rDA. sDSG-dDA and sDSG-rDA are comparable with those comparing methods and sDSG-dDA has a little advantage than sDSG-rDA. 4.2. Experiments on hyperspectral images 4.2.1. Introduction of hyperspectral images In this part, two hyperspectral images are used, one is the Indian Pines image and another is the Pavia University image. In the following, a brief introduction of them is given. Indian Pines:The Indian Pines image was gathered by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in the Northwestern Indiana [2]. It contains (145 145) pixels and the spatial resolution is (20 20)m per pixel. The AVIRIS sensor generates 220 spectral bands in the wavelength range of 0.2 − 2.4 μm . In the experiments, by removing 20 water absorption bands, just 200 bands are used. This image consists of 16 categories. The number of samples in each category is shown in Table 4 and the distribution of samples is shown in Fig. 8. Pavia University:The Pavia University image gathered by the Reﬂective Optics System Imaging Spectrometer (ROSIS) optical sensor depicts the Pavia University, Italy. About 115 bands in the range of 0.43 – 0.86 μm are generated by the ROSIS sensor. Removing the 12 noisy bands, 103 bands are retained in the experiments. This image consists of (610 340) pixels and has the spatial resolution of ( 1.3 × 1.3)m per pixel. There are nine

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

373

Fig. 8. Distribution of Indian Pines image. Table 5 Categories of Pavia University image. No. 1 2 3 4 5 6 7 8 9

Class

Number

Asphalt Meadows Gravel Trees Painted metal sheets Bare soil Bitumen Self-blocking bricks Shadows Total number:

6631 18,649 2099 3064 1345 5029 1330 3682 947 42,776

applications. Here, the spatial information of hyperspectral image is utilized to select pseudo-labeled samples combined with joint k nearest neighbor selection. In spatial space, pixels within a squared spatial window centered at xi are treated as spatial neighbors of it. Firstly, spatial neighbors of each sample in Xlabel are selected to form a new sample set. Secondly, joint k nearest neighbor selection is applied to select pseudo-labeled sample from this sample set. The size of spatial window is denoted by ks × ks , which can inﬂuence the ﬁnal results. In the following experiments, 3 × 3 and 5 × 5 spatial windows are tested (Fig. 10). For some methods such as SPP, SDA, SSDR, SELF, SL-LDA and GSPDA, too much training samples may exceed the boundary of computer's computing capability. Therefore, a sample subset is used in experiments instead of the whole dataset. For Indian Pine images, 20% samples (about 2046 samples) are randomly selected as unlabeled samples for dimensionality reduction. For Pavia University image, 4% samples (about 1711 samples) are used. The average of 20 times experiments is recorded as the ﬁnal result. 4.2.3. Parameter analysis Because of the combination of spatial neighborhood and joint k nearest neighbor selection, the accuracy rate of pseudo-labels is almost close to 100%. Therefore, it is not necessary to utilize the matrixes M (Mpos and Mneg ) and C (Cpos and Cneg). Based on this consideration, s1 and s2 are set to be 1 and 0.0001 respectively, which makes the nonzero elements of M and C close to 1. Here, we just focus on analyzing the inﬂuence of k and ks. Under two conditions such as ks ¼3 and ks ¼5, k is varying from 5 to 40. In Table 6, Num _ps , Rate _acc and classiﬁcation results with different k and ks are shown. For Indian Pines image, sDSG-dDA reaches a stable and approving result when ks ¼3 and k > 10 and sDSG-rDA achieves it when ks ¼3 and k > 15. When ks ¼5 and k is small, the results are not well-pleasing. For Pavia University image, when ks ¼5 and k ≥ 10, the results of sDSG-dDA and sDSG-rDA reach the best results. All of Rate _acc obtained by different ks and k are equal or close to 100%, which guarantees the superiority of sDSG-dDA and sDSG-rDA.

Fig. 9. Distribution of Pavia University image.

categories, as shown in Table 5. The distribution is shown in Fig. 9. 4.2.2. Experiments settings Different from UCI datasets, spatial relationship existing in hyperspectral image is beneﬁcial for classiﬁcation or other

4.2.4. Investigation on the inﬂuence of number of projections The number of projections (K) is an important parameter for dimensionality reduction methods, which is closely related with their performance. Here, the effect of K is investigated by many experiments on two hyperspectral datasets. 10 samples per class are selected as the labeled samples in experiments. Parameters in sDSG-dDA and sDSG-rDA are predetermined, such as ks ¼ 3, k¼ 20 and ( σ1 = 1, σ2 = 0.0001) for Indian Pine image and ks ¼5, k¼ 20 and ( σ1 = 1, σ2 = 0.0001) for Pavia University image. Because the

374

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

Fig. 10. Different neighborhoods in hyperspectral image. Table 6 Results on Indian Pines image with different value of k and ks. For each class, 4 labeled samples are selected randomly for learning. Indian Pines image

k¼5

k ¼10

k ¼15

k ¼ 20

k ¼ 25

k ¼ 30

k ¼35

k ¼40

ks ¼ 3

248 100% 0.5324 0.6685 0.5051 0.6411 283 99.38% 0.5126 0.6476 0.4918 0.6231

353 100% 0.5613 0.6952 0.5380 0.6726 495 99.08% 0.5522 0.6813 0.5161 0.6483

388 99.97% 0.5712 0.7090 0.5397 0.6769 669 99.97% 0.5670 0.6891 0.5284 0.6568

398 99.96% 0.5710 0.7102 0.5429 0.6787 790 99.96% 0.5692 0.6950 0.5274 0.6668

400 99.91% 0.5705 0.7092 0.5525 0.6843 864 99.91% 0.5670 0.7021 0.5401 0.6648

402 99.89% 0.5704 0.7140 0.5504 0.6826 933 99.89% 0.5848 0.7089 0.5459 0.6643

400 99.89% 0.5638 0.7030 0.5511 0.6844 964 98.05% 0.5812 0.7066 0.5431 0.6693

400 99.89% 0.5638 0.7030 0.5511 0.6844 970 98.07% 0.5779 0.7118 0.5497 0.6740

Pavia University image

k¼5

k ¼10

k ¼15

k ¼ 20

k ¼ 25

k ¼ 30

k ¼35

k ¼40

ks ¼ 3

139 100% 0.5587 0.6534 0.5673 0.6591 162 100% 0.5939 0.6706 0.5752 0.6735

199 100% 0.5902 0.6881 0.5841 0.6821 288 99.97% 0.6127 0.7068 0.5883 0.6991

214 100% 0.6027 0.6905 0.5855 0.6876 383 99.99% 0.6087 0.7052 0.5900 0.6871

224 100% 0.6038 0.6862 0.5896 0.6915 458 99.99% 0.6187 0.7141 0.5969 0.6985

231 100% 0.5808 0.6765 0.5908 0.6926 489 99.98% 0.6172 0.7114 0.6018 0.7053

228 100% 0.5678 0.6723 0.5899 0.6910 542 99.98% 0.6170 0.7114 0.6038 0.7085

232 100% 0.5961 0.6800 0.5899 0.6911 549 99.98% 0.6148 0.7130 0.6017 0.7075

232 100% 0.5966 0.6841 0.5899 0.6911 563 99.98% 0.6171 0.7138 0.6017 0.7075

Num _ps Rate _acc sDSG-dDA sDSG-rDA

ks ¼ 5

Num _ps Rate _acc sDSG-dDA sDSG-rDA

Num _ps Rate _acc sDSG-dDA sDSG-rDA

ks ¼ 5

Num _ps Rate _acc sDSG-dDA sDSG-rDA

OA AA OA AA

OA AA OA AA

OA AA OA AA

OA AA OA AA

projection number of LDA, SDA and SL-LDA is ﬁxed, the curves of them are not changed with K. For Indian pines image, K varies from 5 to 70 with interval 5 and the experimental results are shown in Fig. 11(a). In Fig. 11(b), the results on Pavia University image with K varying from 5 to 50 with interval 5 are given. The results of different methods increase obviously when K is smaller than a certain value and stabilize after this value. When K is too large, the results may drop down slightly. The results of sDSG-dDA and sDSG-rDA stay steady when K is large than 10. On Indian Pine image, the performance of sDSG-dDA and sDSG-rDA surpasses other comparing methods no matter what K is. On Pavia University image, the performance of sDSG-dDA and sDSG-rDA has little advantage than others, especially SL-LDA and GSPDA, which also explore the discriminant information among unlabeled samples.

4.2.5. Comparison with other methods In this section, the experiments of all methods with different number of labeled samples are given. The number of labeled samples per class (T) varies from 2 to 18 with interval 2. Parameters ks and k are selected from [3, 5] and [10, 30]. When the number of labeled samples is large, the value of ks and k can be reduced to control the size of pseudo-labeled sample set, by which the computational cost can be limited while the performance would not be affected apparently. The results on Indian Pine image are listed in Table 7 and the corresponding curve chart is given in Fig. 12. By analyzing Fig. 12 and Table 7, our methods sDSG-dDA and sDSG-rDA outperform other methods almost 10 percents both on OA and AA. Comparing sDSG-dDA and sDSG-rDA, sDSG-dDA works better than sDSG-rDA on this dataset. The performance of

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

375

Fig. 11. The inﬂuence of the number of projections on two hyperspectral images, (a) Indian pines image and (b) Pavia University image.

Table 7 The classiﬁcation results on Indian Pines image. Methods T¼2 T¼4 T¼6 T¼8 T ¼ 10 T ¼ 12 T ¼ 14 T ¼ 16 T ¼ 18

OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA

Original

PCA

LDA

SPP

SDA

SSDR

SELF

SL-LDA

GSPDA

sDSG-dDA

sDSG-rDA

0.3814 0.5457 0.4233 0.6016 0.4666 0.6328 0.4871 0.6582 0.5124 0.6791 0.5266 0.6926 0.5323 0.6982 0.5424 0.7072 0.5521 0.7155

0.3876 0.5473 0.4332 0.6014 0.4678 0.6343 0.4868 0.6588 0.5141 0.6800 0.5277 0.6930 0.5329 0.7004 0.5436 0.7090 0.5532 0.7171

0.3751 0.4818 0.4022 0.5039 0.4747 0.6024 0.4983 0.6300 0.5195 0.6459 0.5339 0.6611 0.5457 0.6692 0.5577 0.6752 0.5659 0.6829

0.3655 0.5059 0.4045 0.5608 0.4364 0.5918 0.4562 0.6172 0.4789 0.6375 0.4923 0.6494 0.4956 0.6564 0.5083 0.6679 0.5197 0.6700

0.3894 0.5490 0.4440 0.6193 0.4844 0.6512 0.5047 0.6774 0.5233 0.6854 0.5381 0.7096 0.5469 0.7133 0.5592 0.7244 0.5701 0.7385

0.3860 0.5195 0.4471 0.5790 0.4807 0.6070 0.4982 0.6257 0.5236 0.6491 0.5315 0.6592 0.5437 0.6689 0.5579 0.6789 0.5659 0.6800

0.3707 0.4717 0.3880 0.5125 0.3971 0.5463 0.4367 0.6034 0.4718 0.6439 0.4789 0.6502 0.4738 0.6505 0.4834 0.6533 0.4681 0.6424

0.3618 0.4789 0.4462 0.5547 0.4804 0.6011 0.5001 0.6204 0.5240 0.6556 0.5358 0.6605 0.5416 0.6754 0.5562 0.6871 0.5663 0.6955

0.3742 0.4802 0.4523 0.5592 0.4818 0.5981 0.4899 0.6192 0.5020 0.6315 0.5138 0.6509 0.5291 0.6664 0.5334 0.6703 0.5345 0.6776

0.5364 0.7165 0.5711 0.7646 0.6130 0.8032 0.6523 0.8330 0.6639 0.8441 0.6748 0.8538 0.6894 0.8634 0.6975 0.8746 0.7078 0.8809

0.5086 0.6689 0.5576 0.7429 0.5936 0.7763 0.6337 0.8072 0.6454 0.8176 0.6492 0.8229 0.6614 0.8328 0.6691 0.8413 0.6749 0.8461

Fig. 12. The classiﬁcation results on Indian Pines image.

376

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

Table 8 The classiﬁcation results on Pavia University image. Methods T¼2 T¼4 T¼6 T¼8 T ¼ 10 T ¼ 12 T ¼ 14 T ¼ 16 T ¼ 18

OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA OA AA

Original

PCA

LDA

SPP

SDA

SSDR

SELF

SL-LDA

GSPDA

sDSG-dDA

sDSG-rDA

0.4931 0.6276 0.5474 0.6629 0.5792 0.6874 0.6039 0.7069 0.6135 0.7102 0.6262 0.7188 0.6376 0.7276 0.6418 0.7312 0.6505 0.7378

0.4933 0.6583 0.5477 0.6637 0.5796 0.6884 0.6042 0.7077 0.6136 0.7111 0.6262 0.7196 0.6372 0.7281 0.6418 0.7322 0.6506 0.7387

0.4993 0.6283 0.5434 0.6699 0.5736 0.6849 0.5907 0.7006 0.5996 0.7062 0.6081 0.7147 0.6172 0.7173 0.6243 0.7223 0.6380 0.7290

0.5048 0.6388 0.5622 0.6735 0.5966 0.7024 0.6191 0.7202 0.6338 0.7275 0.6458 0.7339 0.6585 0.7425 0.6636 0.7482 0.6724 0.7531

0.5240 0.6312 0.5589 0.6743 0.5956 0.7044 0.6135 0.7156 0.6341 0.7264 0.6466 0.7387 0.6599 0.7500 0.6756 0.7575 0.6795 0.7636

0.4794 0.5152 0.5158 0.5557 0.5327 0.5735 0.5630 0.5981 0.5737 0.6036 0.5825 0.6102 0.5736 0.6158 0.6007 0.6240 0.6170 0.6263

0.4989 0.6085 0.5118 0.6181 0.5340 0.6460 0.5665 0.6740 0.5935 0.6932 0.6090 0.7108 0.5842 0.6708 0.5962 0.6892 0.6102 0.7063

0.5228 0.6192 0.5844 0.6779 0.6231 0.7067 0.6134 0.7226 0.6512 0.7399 0.6624 0.7423 0.6744 0.7376 0.6613 0.7548 0.6827 0.7590

0.5422 0.6603 0.5701 0.6783 0.6187 0.7044 0.6106 0.6971 0.6593 0.7272 0.6712 0.7337 0.6817 0.7399 0.6745 0.7707 0.6843 0.7400

0.5436 0.6541 0.5889 0.6914 0.6268 0.7177 0.6542 0.7333 0.6815 0.7574 0.6766 0.7540 0.6935 0.7640 0.7118 0.7704 0.7159 0.7749

0.5370 0.6508 0.5946 0.6975 0.6068 0.7143 0.6513 0.7345 0.6536 0.7469 0.6649 0.7515 0.6849 0.7612 0.6851 0.7595 0.6952 0.7684

Fig. 13. The classiﬁcation results on Pavia University image.

Fig. 14. Comparison of single graph and double graphes.

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

PCA is close to original data, which is seen as a reference value. The results of LDA are lower than PCA when the number of labeled samples is small and increasing with T increase. Because of ignoring the discriminant information and the spectral similarity between different landcovers, SPP could not achieve a good result. The results of SDA, SSDR and SELF are also increasing with T. However, they are little better than PCA, because just the internal structure information of unlabeled samples is utilized during learning without any additional discriminant information. Although the discriminant information of unlabeled samples is explored and utilized during learning process, data noise or complex distribution makes it hard to estimate more precise discriminant information. So, SL-LDA and GSPDA also do not reach the expected results. By considering spatial neighborhood with joint k nearest neighbor selection, many pseudo-labeled samples are selected out and almost of all pseudo-labels are same as their real labels, which takes more precise discriminant information into learning. The results on Pavia University image are listed in Table 8 and the curve chart is shown in Fig. 13. Because the sample distribution of each category in Pavia University image is more balanced than Indian Pines image and the spectral responses of different categories are easy to be distinguished. Therefore, The performance gap of different methods is not very big. SPP and SDA work better than LDA, SSDR, SELF and PCA. SL-LDA and GSPDA reach the best results in those comparing methods because of utilizing the discriminant information among unlabeled samples. However, sDSGdDA and sDSG-rDA still have little advantage than other methods due to the high accuracy of pseudo-labels. sDSG-dDA and sDSG-rDA contain two graphs as the positive graph and the negative graph. By ignoring the negative graph, they are predigested to single-dDA and single-rDA. For comparability, sDSG-dDA and sDSG-rDA are denoted as double-dDA and doublerDA. The comparing results are given in Fig. 14. It can be demonstrated that the negative graph also can bring some useful information into learning process.

5. Conclusion The discriminant information of unlabeled samples is very useful for semi-supervised dimensionality reduction. In this paper, we proposed two new semi-supervised discriminant analysis methods based on the previous researches on sparse representation and discriminant analysis. To explore the discriminant information among unlabeled samples, joint k nearest neighbor selection is utilized to select pseudo-labeled samples from unlabeled samples, which not only considers the interaction of neighborhoods in the same class but also neighborhoods in different classes. Following, double sparse graphs are used to exploit the discriminative local structure of the new data subset consisting of pseudo-labeled samples and labeled samples. Although there are errors between pseudo-labels and real labels, the sparsity of sparse representation can limit this inﬂuence in a certain extent. Besides, aiming to reduce the inﬂuence of the inaccuracy of pseudo-labeled samples, two different strategies are applied to the proposed methods respectively. When there are abundant unlabeled samples in real applications, it is meaningful to select a part of unlabeled samples instead of whole data for semi-supervised learning. In the experimental section, extensive experiments on UCI datasets and hyperspectral images validate the feasibility and advantage of the proposed methods. In the future, we are interested in the incremental feature learning based on sparse graph in order to deal with the large scale data.

377

References [1] A. Jain, R. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mach. Intell. 22 (January) (2000) 4–37. [2] J.A. Benediktsson, J.R. Sveinsson, K. Arnason, Classiﬁcation and feature extraction of aviris data, IEEE Trans. Geosci. Remote Sens. 33 (September (5)) (1995) 1194–1205. [3] M. Farrell, R. Mersereau, On the impact of pca dimension reduction for hyperspectral detection of difﬁcult targets, IEEE Trans. Geosci. Remote Sens. Lett. 2 (2) (2005) 192–195. [4] D.L. Swets, J. Weng, Using discriminant eigenfeatures for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 18 (8) (1996) 831–836. [5] P. Bajorski, Statistical inference in pca for hyperspectral images, IEEE J. Select. Top. Signal Process. 5 (3) (2011) 438–445. [6] K. Fukunaga, Statistical Pattern Recognition, Academic Press, San Diego, CA, USA, 1990. [7] T. Hastie, A. Buja, R. Tibishirani, Penalized discriminant analysis, Ann. Stat. 23 (1) (1995) 73–102. [8] C. Liu, Gabor-based kernel pca with fractional power polynomial models for face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 26 (5) (2004) 572–581. [9] A. Papaioannou, S. Zafeiriou, Principal component analysis with complex kernel: the widely linear model, IEEE Trans. Neural Netw. Learn. Syst. 25 (9) (2014) 1719–1726. [10] P. Honeine, Online kernel principal component analysis: a reduced order model, IEEE Trans. Pattern Anal. Mach. Intell. 34 (9) (2012) 1814–1826. [11] W. Zheng, Z. Lin, H. Wang, L1-norm kernel discriminant analysis via Bayes error bound optimization for robust feature extraction, IEEE Trans. Neural Netw. Learn. Syst. 25 (4) (2014) 793–805. [12] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural Comput. 12 (10) (2000) 2285–2404. [13] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [14] J.B. Tenenbaum, V.D. Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [15] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (6) (2003) 1373–1396. [16] Z. Zhang, H. Zha, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, SIAM J. Sci. Comput. 26 (1) (2004) 313–338. [17] X. He, P. Niyogi, Locality preserving projections, Neural Inf. Process. Syst. 45 (1) (2004) 186–197. [18] W. Li, S. Prasad, J.E. Fowler, L.M. Bruce, Locality-preserving dimensionality reduction and classiﬁcation for hyperspectral image analysis, IEEE Trans. Gemosci. Remote Sens. 50 (4) (2012) 1185–1198. [19] X. He, D. Cai, S. Yan, H.-J. Zhang, Neighborhood preserving embedding, ICCV 2 (2005) 1208–1213. [20] J. Wright, A. Yang, S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [21] Y. Chen, N.M. Nasrabadi, T.D. Tran, hyperspectral image classiﬁcation using dictionary-based sparse representation, IEEE Trans. Gemosci. Rmote Sens. 49 (10) (2011) 3973–3985. [22] D. Xu, Y. Huang, Z. Zeng, X. Xu, Human gait recognition using patch distribution feature and locality-constrained group sparse representation, IEEE Trans. Image Process. 21 (1) (2012) 316–326. [23] X.-T. Yuan, X. Liu, S. Yan, Visual classiﬁcation with multitask joint sparse representation, IEEE Trans. Image Process. 21 (10) (2012) 4349–4360. [24] B. Cheng, J. Yang, S. Yan, Y. Fu, T.S. Huang, Learning with l1-graph for image analysis, IEEE Trans. Image Process. (A Publication of the IEEE Signal Processing Society) 19 (4) (2010) 858–866. [25] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications to face recognition, Pattern Recognit. 43 (January (1)) (2010) 331–341. [26] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, S. Lin, Graph embedding and extensions: a general framework for dimensionality reduction, IEEE Trans. Pattern Anal. Mach. Intell. 29 (1) (2007) 40–51. [27] N.H. Ly, Q. Du, J.E. Fowler, Sparse graph-based discriminant analysis for hyperspectral imagery, IEEE Trans. Gemosci. Rmote Sens. 52 (7) (2014) 3872–3884. [28] Y. Pang, S. Wang, Y. Yuan, Learning regularized lda by clustering, IEEE Trans. Neural Netw. Learn. Syst. 25 (12) (2014) 2191–2201. [29] J. Friedman, Regularized discriminant analysis, J. Am. Stat. Assoc. 84 (405) (2012) 165–175. [30] T.V. Bandos, L. Bruzzone, G. Camps-Valls, Classiﬁcation of hyperspectral images with regularized linear discriminant analysis, IEEE Trans. Geosci. Remote Sens. 47 (3) (2009) 862–873. [31] D. Cai, X. He, J. Han, Semi-supervised discriminant analysis, in: Proceedings of the ACM Conference on Multimedia, 2007, pp. 1–7. [32] Y. Song, F. Nie, C. Zhang, S. Xiang, A uniﬁed framework for semi-supervised dimensionality reduction, Pattern Recognit. 41 (9) (2008) 2789–2799. [33] J. Wei, H. Peng, Neighbourhood preserving based semi-supervised dimensionality reduction, Electron. Lett. 4 (20) (2008) 1190–1191. [34] S. Yang, P. Jin, B. Li, L. Yang, W. Xu, Semisupervised dual geometric subspace projection for dimensionality reduction of hyperspectral image data, IEEE Trans. Geosci. Remote Sens. 52 (6) (2014) 3587–3593. [35] R. Chatpatanasiri, B. Kijsirikul, A uniﬁed framework for semi-supervised dimensionality reduction framework for manifold learning, Neurocomputing, 73 (10-12) (2010) 1631–1640.

378

P. Chen et al. / Pattern Recognition 61 (2017) 361–378

[36] M. Sugiyama, T. Ide, S. Nakajima, J. Sese, Semi-supervised local ﬁsher discriminant analysis for dimensionality reduction, Mach. Learn. 78 (1–2) (2010) 35–61. [37] D. Zhang, Z.-H. Zhou, S. Chen, Semi-supervised dimensionality reduction, in: Proceedings of the 2007 SIAM International Conference on Data Mining. [38] Z. Zhang, M. Zhao, T.W. Chow, Marginal semi-supervised sub-manifold projections with informative constraints for dimensionality reduction and recognition, Neural Netw. 36 (2012) 97–111. [39] Z. Zhang, M.B. Zhao, T.W. Chow, Constrained large margin local projection algorithms and extensions for multimodal dimensionality reduction, Pattern Recognit. 45 (12) (2012) 4466–4493. [40] J. Xia, J. Chanussot, P. Du, X. He, (Semi-) supervised probabilistic principal component analysis for hyperspectral remote sensing image classiﬁcation, IEEE J. Select. Top. Appl. Earth Observ. Remote Sens. 7 (6) (2014) 2224–2236. [41] F. Nie, S. Xiang, Y. Jia, C. Zhang, Semi-supervised orthogonal discriminant analysis via label propagation, Pattern Recognit. 42 (11) (2009) 2615–2627. [42] W. Li, Q. Ruan, J. Wan, Dimensionality reduction using graph-embedded

[43]

[44]

[45]

[46]

[47] [48]

probability-based semi-supervised discriminant analysis, Neurocomputing 138 (2014) 283–296. M. Zhao, Z. Zhang, T.W. Chow, A general soft label based linear discriminant analysis for semi-supervised dimensionality reduction, Neural Netw. 55 (2014) 83–97. W. Li, Q. Ruan, J. Wan, Semi-supervised dimensionality reduction using estimated class membership probabilities, J. Electron. Imag. 12 (4) (2012) 255–262. J. Amores A, N. Sebe, P. Radeva, Abstract boosting the distance estimation application to the k-nearest neighbor classiﬁer, Pattern Recognit. Lett. 27 (3) (2006) 201–209. R. He, W.-S. Zheng, B.-G. Hu, X.-W. Kong, Two-stage nonnegative sparse representation for large-scale face recognition, IEEE Trans. Neural Netw. Learn. Syst. 24 (1) (2013) 35–46. Online, Cited 〈http://spams-devel.gforge.inria.fr/downloads.html〉(link). M. Lichman. UCI Machine Learning Repository (Online), 2013.

Puhua Chen received the B.S. degree from the University of Electronic Science and Technology of China, Chengdu, China, in 2009. She is currently working toward the Ph.D. degree with the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Electronic Engineering, Xidian University, Xi'an, China. Her current research interests include machine learning, pattern recognition, and synthetic aperture radar image interpretation.

Licheng Jiao received the B.S. degree from Shanghai Jiaotong University, Shanghai, China, in 1982 and the M.S. and Ph.D. degree from Xi'an Jiaotong University, Xi'an, China, in 1984 and 1990, respectively. Since 1992, he has been a Professor with the School of Electronic Engineering, Xidian University, Xi'an, where he is currently the Director of Key Laboratory of Intelligent Perception and Image Understanding of the Ministry of Education of China. In 1992, Dr. Jiao was awarded the Youth Science and Technology Award. In 1996, he was granted by the Cross-century Specialists Fund from the Ministry of Education of China. And he was selected as a member of the First level of Millions of Talents Project of China from 1996. In 2006, he was awarded the First Prize of Young Teacher Award of High School by the Fok Ying Tung Education Foundation. From 2006, he was selected as an Expert with the Special Contribution of Shaanxi Province. Dr. Jiao is a member of the IEEE Xi'an Section Execution Committee, the Chairman of the Awards and Recognition Committee and the Chairman of Computational Intelligence Society, the Chairman of IET Xi'an Section, the Vice Board Chairperson of the Chinese Association of Artiﬁcial Intelligence, a Councilor of the Chinese Institute of Electronics, a committee member of the Chinese Committee of Neural Networks, and an expert of the Academic Degrees Committee of the State Council. He is in charge of about 40 important scientiﬁc research projects and has published more than 20 monographs and a hundred papers in international journals and conferences. His research interests include image processing, natural computation, machine learning, and intelligent information processing.

Fang Liu received the B.S. degree in computer science and technology from Xian Jiaotong University, Xian, China, in 1984 and the M.S. degree in computer science and technology from Xidian University, Xian, in 1995. She is currently a Professor with the School of Computer Science, Xidian University. She is the author or coauthor of 5 books and more than 80 papers in journals and conferences. Her research interests include signal and image processing, synthetic aperture radar image processing, multiscale geometry analysis, learning theory and algorithms, optimization problems, and data mining.

Jiaqi Zhao received the B.Eng. degrees in intelligence science and technology from Xidian University, Xi'an, China in 2010. He is currently pursuing the Ph.D. degree in circuit and system from Xidian University, Xi'an China. Between 2013 and 2014, he was an exchange Ph.D. student with the Leiden Institute for Advanced Computer Science (LIACS), University of Leiden, the Netherlands. He is currently a member of Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, and International Research Center for Intelligent Perception and Computation, Xidian University, Xi'an, China. His current research interests include multiobjective optimization, machine learning and image processing.

Zhiqiang Zhao received the B.S. and the M.S. degree from Huaqiao University, Ximen, China in 2007 and 2010 respectively. He is currently pursing the Ph.D. degree in pattern recognition and intelligence systems from Xidian University, Xian, China. He is currently a member of Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, and International Research Center for Intelligent Perception and Computation, Xidian University, Xian, China. His current research interests include machine learning, image processing, high resolution synthetic aperture radar image processing.

Shuai Liu received the B.S. degree in Observation & Control Engineering and Instrumentation from Xidian University, Xi'an, China, in 2005. She is currently working toward the Ph.D. degree in the Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, School of Electronic Engineering, Xidian University, Xi'an, China. Her major research interests include statistical machine learning and deep learning.