Discriminative globality and locality preserving graph embedding for dimensionality reduction

Discriminative globality and locality preserving graph embedding for dimensionality reduction

Expert Systems With Applications 144 (2020) 113079 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

2MB Sizes 0 Downloads 5 Views

Expert Systems With Applications 144 (2020) 113079

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Discriminative globality and locality preserving graph embedding for dimensionality reduction Jianping Gou a,∗, Yuanyuan Yang a, Zhang Yi b, Jiancheng Lv b, Qirong Mao a, Yongzhao Zhan a a b

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, Jiangsu, 212013, PR China School of Computer Science, Sichuan University, Chengdu, 610065, PR China

a r t i c l e

i n f o

Article history: Received 28 January 2019 Revised 7 November 2019 Accepted 7 November 2019

Keywords: Dimensionality reduction Graph embedding Graph construction Pattern classification

a b s t r a c t Graph embedding in dimensionality reduction has attracted much attention in the high-dimensional data analysis. Graph construction in graph embedding plays an important role in the quality of dimensionality reduction. However, the discrimination information and the geometrical distributions of data samples are not fully exploited for discovering the essential geometrical and discriminant structures of data and strengthening the pattern discrimination in graph constructions of graph embedding. To overcome the limitations of graph constructions, in this article we propose a novel graph-based dimensionality reduction method entitled discriminative globality and locality preserving graph embedding (DGLPGE) by designing the informative globality and locality preserving graph constructions. In the constructed graphs, bidirectional weights of edges are newly defined by considering both the geometrical distributions of each point of edges and the class discrimination. Using the adjacent weights of graphs, we characterize the intra-class globality preserving scatter, the inter-class globality preserving scatter and the locality preserving scatter to formulate the objective function of DGLPGE in order to optimize the projection of dimensionality reduction. Extensive experiments demonstrate that the proposed DGLPGE often outperforms the state-of-the-art dimensionality reduction methods. © 2019 Elsevier Ltd. All rights reserved.

1. Introduction In many real-world applications, since there are a large number of high-dimensional data to be processed, dimensionality reduction that transforms the original high-dimensional data into the new low-dimensional data is widely applied. Recently, graph embedding in dimensionality reduction has attracted much attention in the fields of computer vision and pattern recognition (Cai, Zheng, & Chang, 2017; Goyal & Ferrara, 2018). Through graph embedding, not only the issues about the “curse of dimensionality” (Jain, Duin, & Mao, 20 0 0) and computational cost can be overcome, but also the discriminant and geometrical structures of high-dimensional data can be preserved in low-dimensional subspace for various practical applications, such as pattern classification and analysis (Fang, Han, Wu, & Xu, 2018; Houari, Bounceur, Kechadi, Tari, & Euler, 2016; Kim, 2018; Lai et al., 2018; Lai, Xu, Yang, Shen, & Zhang, 2017; Li & Du, 2016; Li, Feng, Li, & Du, 2018). Moreover, as argued



Corresponding author. E-mail addresses: [email protected] (J. Gou), [email protected] (Y. Yang), [email protected] (Z. Yi), [email protected] (J. Lv), [email protected] (Q. Mao), [email protected] (Y. Zhan). https://doi.org/10.1016/j.eswa.2019.113079 0957-4174/© 2019 Elsevier Ltd. All rights reserved.

in Yan et al. (2007), most of dimensionality reduction methods can be unified into the general framework of graph embedding. Nowadays, the dimensionality reduction methods can be divided into linear and nonlinear ones in general. In the linear methods, both PCA and LDA are the representative ones (Martinez & Kak, 2001). PCA learns the global intrinsic structure of data by preserving the maximum variance of data, and LDA learns global discriminative structure of data by keeping the margin between interclass and intra-class samples of data maximum. To well reflect the non-linear structure of data, the linear PCA and LDA are extended to kernel ones with the kernel tricks (Yang, Frangi, Yang, Zhang, & Jin, 2005). Due to the simplicity and effectiveness of PCA and LDA, their extensions were recently proposed in Lai, Xu, Chen, Yang, and Zhang (2014), Wu et al. (2018), Fan et al. (2014), Kan, Shan, Zhang, Lao, and Chen (2016), Li, Chen, Nie, and Wang (2017) and Xu, Iosifidis, and Gabbouj (2018). In fact, the manifold-based dimensionality reduction methods can well discover the essential structure of high-dimensional data, because manifold learning as a type of nonlinear dimensionality reduction can always reflect the local intrinsic manifold structure of nonlinear complex data (Belkin & Niyogi, 2003; Roweis & Saul, 20 0 0). Since manifold learning has the out-of-sample problem (Bengio et al., 2004), a great many linear manifold-based dimen-

2

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

sionality reduction methods were proposed recently. Among the manifold-based dimensionality reduction methods, locality preserving projections (LPP) (He, Yan, Hu, Niyogi, & Zhang, 2005) is a very famous method. LPP is a linear extension of Laplacian eigenmap (LE) that is manifold learning (Belkin & Niyogi, 2003). Since LPP can well reflect the local geometry of high-dimensional data, its many variants were introduced, such as in Yang, Zhang, Yang, and Niu (2007), Zhang, Xue, Lu, and Guo (2006), Xu, Zhong, Yang, and Zhang (2010), Gao, Liu, Zhang, Gao, and Li (2013), Gou and Zhang (2013), Wang et al. (2017b), Ding and Zhang (2015), Wang, Ding, Hsu, and Yang (2018), Lu, Ding, Xu, and Wang (2018), Yang, Liu, Yu, and Wang (2018) and Wang et al. (2017a). Yang et al. proposed unsupervised discriminant projections as a simplified version of LPP (Yang et al., 2007). Considering the class information of data, discriminant neighborhood embedding (DNE) was proposed for classification (Zhang et al., 2006). Gou and Zhang extended DNE and proposed locality-based discriminant neighborhood embedding (LDNE) by fully utilizing the local information of data (Gou & Zhang, 2013). As a fast and orthogonal LPP, fast and orthogonal locality preserving projections (FOLPP) was proposed by fully defecting locality of data with the orthogonal constraint (Wang et al., 2017b). Among them, hierarchical discriminant analysis was proposed by using the two-phase dimensionality reduction (Lu et al., 2018). Unlike traditional graph embedding, a dimensionality reduction method was proposed by simultaneously learning the graph and projections (Wang et al., 2017a). And in graph embedding, local similarity preserving function was designed for graph constructions (Wang et al., 2018). Thus, as stated above, the extensions of LPP can well reflect local structure of data based on manifold learning, in comparison with the variants of PCA and LDA. Using the good properties that sparse representation holds the natural discrimination and essential geometry of data (Wright, Yang, Ganesh, Sastry, & Ma, 2009), some sparse representation-based dimensionality reduction methods were recently proposed in Gou, Du, Cheng, and Cai (2016), Gou et al. (2018), Qiao, Chen, and Tan (2010), Wen, Xu, Li, Ma, and Xu (2018), Yin, Lai, Zeng, and Wei (2018), Yuan, Mao, and Chen (2019) and Zhang, Wang, and Cai (2017). Qiao et al. firstly proposed sparsity preserving projections (SPP) that uses sparse representation to automatically construct graph for graph embedding (Qiao et al., 2010). To overcome the limitation of ignoring the locality of data in SPP, three variants of SPP were proposed by considering the locality of data in the automatical sparse construction (Yin et al., 2018; Yuan et al., 2019; Zhang et al., 2017). Inspired by the ideas of both LPP and SPP, Gou et al. proposed discriminative sparsity preserving graph embedding (DSPGE) by using sparse reconstructions and developing a complete new informative graph construction (Gou et al., 2016). Subsequently, Gou et al. gave the full extension of DSPGE by considering the local and global geometry of data (Gou et al., 2018). Moreover, using the general version of sparse representation named collaborative representation (Zhang, Yang, & Feng, 2012), the collaborative representation-based projections was proposed by using collaborative representation to construct graph (Yang, Wang, & Sun, 2015). To fully reflect locality of data, the extensions of collaborative representation-based projections were proposed (Huang, Li, Gao, Yao, & Yang, 2018; Yin, Wei, Song, & Zeng, 2016). Although these representation-based graph embedding above have natural properties of uncovering the discriminative and geometrical information of data, the geometry of high-dimensional data could be not directly or fully taken into account. In most of graph embedding methods, graph constructions play a key role in the performance of dimensionality reduction. However, the ways of the existing graph constructions cannot fully consider the geometrical and discriminative information of data in

how to assign the weights to edges of graphs. To address the issue, in this article we propose a novel discriminative globality and locality preserving graph embedding (DGLPGE). In the proposed method, the global and local geometries of high-dimensional data are simultaneously considered by designing new global and local graph constructions. As is known to all, two points of one edge of graphs have their own different geometrical distributions of data. Of course, we naturally assume that the weight of each edge in our constructed adjacent graphs is determined by the geometrical distributions of two points of the edge. Moreover, to enhance pattern discrimination in the low-dimensional subspace, we simultaneously employ class discrimination information to design the weight functions for edges of graphs. The effectiveness of the proposed DGLPGE are verified by conducting comparative experiments on six public face databases, four real benchmark data sets and two synthetic data sets. The experimental results show that the proposed method often performs better than the competing stateof-the-art dimensionality reduction methods. The rest of this article is organized as follows. Section 2 reviews the related works. Section 3 describes the proposed method in details. Section 4 analyzes the superior properties of the proposed method. Section 5 presents the extensive experiments. Finally, Section 6 gives the conclusions and future works. 2. Related works In this section, we briefly review both LPP and DSPGE methods that are related to our works. First of all, the common notations are given here for the following descriptions. Suppose that a set of high-dimensional data is denoted as X = [x1 , x2 , · · · , xn ] ∈ Rd×n with n training samples in d-dimensional feature space. These training samples are from C classes with class set {1, 2, . . . , C }, and the class label of training sample xi is ci ∈ {1, 2, . . . , C }. The linear feature extraction for dimensionality reduction is to transform a high-dimensional sample xi into the corresponding lowdimensional one yi with yi = P T xi , and P = [ p1 , p2 , · · · , pr ] ∈ Rd×r is a matrix of projections where d  r. Moreover, we assume that p ∈ Rd is any one column vector of the projection matrix P, i.e. p ∈ P, in order to clearly describe the following graph embedding methods with aim of solving P. 2.1. LPP LPP (He et al., 2005) is a typical manifold-based graph embedding method that can preserve local manifold structures of highdimensional data by using k-NN graph construction. It keeps the similar samples in the original high-dimensional space closer in low-dimensional subspace. The weight wij of the edge between xi and xj in the k-NN adjacent graph is defined as follows:

⎧ ⎨



xi −x j  exp − t wi j = ⎩

2



0,

,

 

xi ∈ Nk x j or x j ∈ Nk (xi )

(1)

otherwise

where Nk (xi ) represents the set of k nearest neighbors of xi , and t is a adjustable parameter. Using the adjacent weights, the objective function of LPP is defined as follows:



min p

− y j 2 wi j

1

i j yi 2 T T

s.t. p X DX p = 1.

(2)

The model of LPP can be finally reformulated as the generalized eigenvector problem as

X LX T p = λX DX T p

(3)

where L = D − W is a Laplacian matrix, D is a diagonal matrix with

D+ = j wi j , and (W )i j = wi j . The optimized projections P is comii

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

posed of the r eigenvectors that correspond to the r smallest eigenvalues by solving Eq. (3).

3

(W )− = w− . The optimized projections P can be obtained by solvij ij ing Eq. (8) and is composed of the r eigenvectors that correspond to the r largest eigenvalues of Eq. (8).

2.2. DSPGE DSPGE (Gou et al., 2016; Gou et al., 2018) is a new graph embedding method that is based on sparse representation and LPP. DSPGE assumes that the similar samples in the high-dimensional space should have the corresponding similar sparse reconstructions and their reconstructions should also have the similar projected ones in the low-dimensional space. First of all, each training sample is reconstructed by the other training samples through the sparse model



min zi 1 zi

(4)

s.t. xi = X zi , 1 = 1T zi

where zi = [zi,1 , · · · , zi,i−1 , 0, zi,i+1 , · · · , zi,n ]T ∈ Rn is a vector of sparse coefficients for xi and 1 = [1, 1, · · · , 1] ∈ Rn is a vector of all ones. zi,i = 0 in zi indicates that xi is reconstructed by all the samples except itself, and zi,j , j = i is regarded as the weight from xj to reconstruct xi . And all the sparse coefficient vectors corresponding to all the training samples are denoted as Z = [z1 , z2 , . . . , zn ]. Note that .1 in Eq. (4) denotes the l1 -norm and is the sum of the absolute values of all elements of a vector. Then, sample xi can be reconstructed as x¯i = Xzi . DSPGE constructs global intra-class and inter-class adjacent graphs of high-dimensional data. In the intra-class graph, the weight w+ of the edge between xi and xj from the same class is ij defined as follows:

w+ = ij

⎧ ⎨1 ⎩

2



exp



− xi −x j ti+



2





+ exp



− x j −xi t+ j



2



,

0,

ci = c j

(5)

ci = c j

w− = ij



2



exp



− xi −x j ti−



2





+ exp



− x j −xi t− j



2



0,

,

ci = c j

(6)

ci = c j .

where ti− and t − reflect the geometric distributions of xi and xj j in the inter-class high-dimensional feature space, respectively. Note that the formulations of ti+ , t + , ti− and t − are given in Section 3 in j j order to conveniently analyze the proposed graph constructions. Using the intra-class and inter-class graph constructions, DSPGE is to simultaneously minimize the intra-class globality preserving

2 + scatter i j y¯ i − y¯ j  wi j and maximize inter-class globality pre

2 − serving scatter i j y¯ i − y¯ j  wi j . Thus, the objective function of DSPGE with the constraint pT p = 1 is defined as follows:

⎧ min ⎪ ⎨

max

⎪ ⎩

1 2



1 2

ij

ij

In this section, we detailedly describe our proposed DGLPGE method including the idea of DGLPGE, global and local graph constructions of DGLPGE, and the optimization of DGLPGE. 3.1. The idea of DGLPGE In graph embedding, the existing graph constructions hardly directly consider the geometrical distribution of each point to assign the weights to edges of the adjacent graphs. In fact, each point in the feature space of data has different geometrical distribution. Thus, two points of each edge have their own geometrical contributions to determining the weights of the edge. This property is considered in our graph constructions. To fully discover the geometry of data, we do the global and local graph constructions. To further enhance the pattern discrimination among the different classes, the adjacent weights of edges in the intra-class and inter-class constructed graphs are discriminatively defined. Through characterizing the geometry preserving scatters with the adjacent weights, we use maximum margin criterion to formulate the objective function, in order to keep the global and local geometry preserving scatters to be more discriminative. Fig. 1 intuitively gives the illustration of the idea of the proposed method. Note that in Fig. 1 the points from three classes are denoted by three shapes with differen colors, and k is the number of nearest neighbors in the k-NN graph construction. 3.2. Intra-class globality preserving scatter

where ti+ and t + reflect the geometric distributions of xi and xj j in the intra-class high-dimensional feature space, respectively. In the inter-class graph, the weight w− of the edge between xi and xj ij from different classes is defined as follows:

⎧ ⎨1

3. The proposed DGLPGE

To globally reflect the intra-class intrinsic geometric structures of high-dimensional data, we construct the corresponding graph of the training samples from each class. Suppose that G+ represents intra-class adjacent graph. The weight Wi+j of one edge between xi

and xj in graph G+ is determined by the weight w+ of edge from ij xi to xj and the weight w+ji of edge from xj to xi . Both w+ and w+ji ij are defined as

w+ ij

y¯ i − y¯ j 2 w−i j

(7)

and

w+ji =

 +

ZT X T p = λ p

D+

2

 1 + exp

ti+



xi −x j 

2



ti+

0,



, ci = c j

otherwise

⎧ ⎨1 ⎩

2



x j −xi 

2

exp −

t+ j

 1 + exp



x j −xi 

2



0,

t+ j



, ci = c j otherwise

where ti+ and t + is the positive intra-class regulators of xi and xj , j respectively. Both ti+ and t + is defined as j

where y¯ i = i and x¯i = Xzi . According to the simple algebra, Eq. (7) can be reformulated as the following generalized eigenvalue problem

L+

2

xi −x j 

(10)

P T x¯

X Z L− − L



 exp −

(9)

y¯ i − y¯ j 2 w+i j

s.t. pT p = 1



=

⎧ ⎨1

− W+

(8)

n ( ci )  1 = + xi − xl 2 q n c ( ( i )) +

ti+

and

D+

where = is a Laplacian matrix, is a diagonal ma

trix with D+ = j w+ and (W )+ = w+ , and L− = D− − W − is a ii ij ij ij

Laplacian matrix, D− is a diagonal matrix with D− = j w− and ii ij

(11)

l=1

t+ j

=



1 n+

(c j )

( ) 

n+ c j

q

l=1

x j − xl 2

(12)

4

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

Fig. 1. The illustrated example of the idea of DGLPGE.

where n+ (ci ) is the number of all samples that have the same class as xi , n+ (c j ) is the number of all samples that have the same class as xj and q is a positive parameter to further adjust the weights of graph. Using weights w+ and w+ji , Wi+j of the edge between xi and ij xj is defined as



Wi+j

=

W ji+

=

1 2



w+ + w+ji , ij

0,

ci = c j otherwise.

(13)

Using the intra-class graph constructions, the essential geometric and discriminant structures of the intra-class samples are preserved in the low-dimensional subspace by defining the intra-class globality preserving scatter

3.3. Inter-class globality preserving scatter To globally reflect the inter-class intrinsic geometric structures of high-dimensional data, we construct the corresponding graphs of the training samples from different classes. The inter-class adjacent graph is denoted as G− . Just like the intra-class graph constructions, the weight Wi−j of one edge between xi and xj in graph

G− is determined by the weight w− of edge from xi to xj and the ij weight w−ji of edge from xj to xi . Both w− and w−ji are defined as ij

⎧ ⎨1

w− = ij



2



xi −x j 

2

exp −

 1 − exp

ti−



xi −x j 

2



0,

1  yi − y j 2W + JW ( p) = ij 2 and

1  pT xi − pT x j 2W + = ij 2 ij

otherwise

⎧ ⎨1

w−ji =

   T  1  T = p xi − x j Wi+j xi − x j p 2





x j −xi  exp − t − 2

2





x j −xi  1 − exp − t−

j

= pT SW p

0,

ti−

where

otherwise.

and

t− j

 1  = 2Wi+j xi xTi − 2Wi+j xi xTj 2

n ( ci )  1 xi − xl 2 q − ( n ( ci ) )

and

=



i

t− j



Wi+j xi xTj

ij

= X D+ X T − X W + X T





= X D+ − W + X T = X L+ X T L+

D+

(15) − W+

where = is the intra-class Laplacian matrix, and

a diagonal matrix with D+ = j Wi+j . ii

(18)

l=1

ij



are the positive inter-class regulators of xi and xi , −

ti− =

D+ x xT ii i i

, ci = c j

respectively. Both ti− and t − are defined as j

   1  = xi − x j Wi+j xi − xTj 2





(17) (14)

  T

 where SW = 12 i j xi − x j Wi+j xi − x j is called the intra-class globality preserving scatter matrix. SW can be rewritten as



2

j

ij

ij

, ci = c j

(16)

ij

SW

ti−



D+

is

=



1 n−

(c j )

( ) 

n− c j

q

x j − xl 2

where n− (ci ) is the sum of ferent from the class label samples whose class labels xj . Then, the weight Wi−j of as



Wi−j

=

W ji−

=

(19)

l=1

1 2

0,

all samples whose class labels are difci of xi , and n− (c j ) is the sum of all are different from the class label cj of the edge between xi and xj is defined



w− + w−ji , ij

ci = c j otherwise.

(20)

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

To preserve the inter-class intrinsic geometric and discriminant structure of high-dimensional data, we define the inter-class globality preserving scatter in the low-dimensional subspace as

1  yi − y j 2W − JB ( p) = ij 2





 −

(21)

T

   1  xi − x j Wi−j xi − xTj 2 ij

= X L− X T L−

(22) D−

where = is the intra-class laplacian matrix, and D+

is a diagonal matrix with D− = j Wi−j . Note that the reformuii lations of Eqs. (21) and (22) are deduced in the way similar to Eqs. (14) and (15), respectively.

− W−

3.4. Locality preserving scatter To well reflect the local geometrical and discriminant structures of high-dimensional data, we do the local graph construction. The local adjacent graph is denoted as Gk that is constructed by k-NN. Just like the ways of assigning the weights to edges in the global graph constructions above, the weight Wikj of one edge between xi and xj in graph Gk is determined by the weight wki j of edge from xi to xj and the weight wkji of edge from xj to xi . Considering the class information of data, the weight of one edge from one point to another point is defined as follows:

wki j+ =

   ⎧  2 2  xi −x j  j ⎪ ⎨exp − xi −x 1 + exp − , tk tk i

⎪ ⎩

i

(23)

j

⎪ ⎩

j

xi ∈ Nk (x j ) and c j = ci otherwise (24)

   ⎧  2 2  xi −x j  j ⎪ ⎨exp − xi −x 1 − exp − , tk tk i

⎪ ⎩

i

x j ∈ Nk (xi ) and ci = c j otherwise

0,

(25) and

wkji−

=

j

⎪ ⎩

xi ∈ Nk (x j ) and c j = ci otherwise. (26)

where Nk (xi ) is the set k-nearest neighbors of xi , and and are the local positive regulators in the regions of k-neighborhood of xi and xj , respectively. Both tik and t kj are defined as

j=1

0,

otherwise. (29)

To preserve the local geometrical and discriminant of highdimensional data in the low-dimensional subspace, we define the locality preserving scatter as follows:

JL ( p) =

1  yi − y j 2Wikj 2 ij

T

= p SL p

(30)

 k T

 where SL = is called the locality prei j xi − x j Wi j xi − x j serving scatter matrix. According to simple algebra, SL can be rewritten as 1 2

SL =

   1  xi − x j Wikj xi − xTj 2 ij

(31)

where Lk = Dk − W k is the local Laplacian matrix, and Dk is a di

agonal matrix with Dkii = j Wikj . Note that the reformulations of Eqs. (30) and (31) are deduced in the way similar to Eqs. (14) and (15), respectively.

To well preserve the geometrical and discriminant information of high-dimensional data in the low-dimensional subspace, we employ maximum margin criterion to simultaneously minimize the intra-class globality preserving scatter, maximize inter-class globality preserving scatter and minimize the locality preserving scatter. Thus, the objective function of the proposed method with the constraint pT p = 1 is formulated as follows:

⎧ +

T 2 T ⎪ min W ⎪ i j p xi − p x j ⎪ i j− ⎨

T T 2 max W i j p xi − p x j ij

T ⎪ T 2 k ⎪ min p x − p x W i j ij ⎪ ij ⎩ T

(32)

s.t. p p = 1.

⎧ T ⎪ ⎨min p TSW p

j

tik

k 1  xi − x j 2 , x j ∈ Nk (xi ) k2

 ⎧  ⎪ + 12 wki j+ + wkji+ , ⎪ ⎪ ⎪ ⎪ x j ∈ Nk (xi ) or xi ∈ Nk (x j ) and ci = c j ⎪ ⎪ ⎪ ⎨   Wikj = W jik = − 1 wk− + wk− , 2 ⎪ ij ji ⎪ ⎪ ⎪ x j ∈ Nk (xi ) or xi ∈ Nk (x j ) and ci = c j ⎪ ⎪ ⎪ ⎪ ⎩

Using Eqs. (14), (21) and(30), the objective function is rewritten as

   ⎧  2 2  x j −xi  x j −xi    ⎪ ⎨exp − t k 1 − exp − , tk 0,

tik =

(28)

3.5. The objective function of DGLPGE

   ⎧  2 2  x j −xi  x j −xi    ⎪ ⎨exp − t k 1 + exp − , tk 0,

wikj− =

k 1  x j − xi 2 , xi ∈ Nk (x j ). 2 k

= X Lk X T

x j ∈ Nk (xi ) and ci = c j otherwise

0,

wkji+ =

t kj =

Then, the weight Wikj of the edge between xi and xj is defined as

where SB = 12 i j xi − x j Wi j xi − x j is called the inter-class globality preserving scatter matrix. SB can be rewritten as

SB =

and

i=1

ij

= pT SB p

5

t kj

(27)

max p SB p ⎪min pT SL p ⎩ s.t. pT p = 1.

(33)

Through Eqs. (15), (22) and (31), the objective function of DGLPGE can be further reformulated as







max pT X α (L− − L+ ) + (1 − α )Lk pX s.t. pT p = 1

(34)

where α is the positive parameter that balances the globality and locality preserving scatters of data, and the value of α belongs to

6

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

Fig. 2. Examples of the geometrical distributions of different points. Note that the points from the same class are denoted by the same shapes.

(0,1). Here, it should be noted that the proposed method only considers the globality of data when α = 1 and the locality of data when α = 0 to be preserved in graph embedding. The variants of the proposed method are called discriminative globality preserving graph embedding (DGPGE) when α = 1 and discriminative locality preserving graph embedding (DLPGE) when α = 0. Using Lagrangian multiplier, the objective function of Eq. (34) can be reformulated as the following generalized eigenvalue problem

X

  − +  α L − L + (1 − α )Lk X T p = λ p.

(35)

Finally, we choose the first r eigenvectors corresponding to the largest r eigenvalues to compose the optimal transformation matrix P, i.e. λ1 > λ2 > . . . > λr > 0 and P = [ p1 , p1 , . . . , pr ]. As discussed above, we briefly summarize the proposed method in Algorithm 1. Algorithm 1 The proposed DGLPGE method. Calculate the weight matrix W + of the global intra-class graph G+ by Eq. (13) and the intra-class globality preserving scatter SW by Eq. (15). 2: Calculate the weight matrix W − of the global inter-class graph G− by Eq. (20) and the inter-class globality preserving scatter SB by Eq. (22). 3: Calculate the weight matrix W k of the local graph Gk by Eq. (29) and the locality preserving scatter SL by Eq. (31). 4: Solve Eq. (35) to form optimal projection P . 1:

of weight w+ of edge from xi to xj and weight w+ji of edge from xj ij to xi . To directly reflect the geometrical distribution of each point in the graph constructions, we define the geometrical regulator of each point according to the its geometrical distribution. The geometrical intra-class regulators ti+ of xi and t + of xj are defined in j

n+ (ci ) Eqs. (11) and (12), respectively. Obviously, xi − xl 2 and l=1 +

n (c j ) x j − xl 2 can well reflect the geometrical distributions xi l=1 and xj in the intra-class feature spaces, respectively. Similarly, in the inter-class feature space shown in Fig. 2(c) and 2(d), the weight Wi−j in Eq. (20) between xi and xj in the global inter-class graph are

the average of weight w− of edge from xi to xj and weight w−ji of ij edge from xj to xi , and the inter-class geometrical distributions of xi and xj are reflected in the regulators ti− in Eq. (18) and t − in j Eq. (19), respectively. Using the similar ways above, the weights of edges in the local graph construction are also determined by the local geometrical distribution of each point in the regions of k-neighborhood, shown in Section 3.4. From the point view of pattern discrimination, the discrimination information of data is fully considered in the graph constructions. On the one hand, to enhance the power of pattern discrimination among the different classes, the weight of one edge from xi to xj is defined as



1 2

exp −

xi −x j 2 ti+



1 + exp



xi −x j 2





in the global intra-

ti+

   x −x 2 x −x 2 class graph and 12 exp − i t − j 1 − exp − i t − j in 

i

4. Analysis of DGLPGE In this section, we analyze the characteristics of the designed graph constructions in the proposed method to illustrate why the geometrical and discriminant structures of high-dimensional data can be well preserved in low-dimensional subspace. From the perspective of geometrical information of data, we consider the different geometrical distribution of each point to determine the weights of edges of the constructed adjacent graphs. That is to say, the weight of one edge should be determined by the geometrical distributions of two points at the ends of the edge. Fig. 2 gives the examples of the geometrical distributions of points that are situated in the same classes or different classes. As shown in Fig. 2(a) and (b), we can see that the similarity weight w+ of ij

edge from xi to xj in Fig. 2(a) and the similarity weight w+ji of edge from xj to xi in Fig. 2(b) should be different and reasonably determined by considering their own geometrical distributions in the intra-class feature space. So that weight Wi+j in Eq. (13) between xi and xj in the global intra-class graph are designed as the average

i

the global inter-class graph. In the intra-class graph construc tion, each weight includes two parts: exp −

xi −x j 2 t+

and

  i  2 2 xi −x j  xi −x j    1 + exp − . The weight exp − t + represents ti+ i    x −x 2 the similarity from x to x , and weight 1 + exp − i j 

i

ti+

j

is used to increase this similarity. In the graph con  inter-class   struction, the weight

1 2

exp −



includes the similarity exp −

xi −x j 2 ti−

xi −x j 



enhanced dissimilarity

1 − exp

2







xi −x j 2 ti−

from xi to xj and the

ti+



1 − exp

xi −x j 2 ti+



. The similar way

of enhancing the patter discrimination is used in the local graph construction in Section 3.4. On the other hand, the geometrical distribution of each point in constructed graphs can also enhance the power of pattern discrimination. For example in Figs. 2(a) and 2(b), xi could have different discriminative contribution to classi-

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

7

Fig. 3. Examples of one high-dimensional data in a two-dimensional subspace.

fying xj and xj could have different discriminative contribution to classifying xi , because xj and xj have their own different geometrical distributions. Therefore, our designed graph constructions in the proposed method have the good property of preserving the geometrical and discriminant structures of high-dimensional data in graph embedding. To visually illustrate the good property of preserving geometrical and discriminant of data in the proposed DGLPGE, we give the examples that one high-dimensional data is projected into a two-dimensional subspace by LPP, LDNE, DSPGE and the proposed DGLPGE in Fig. 3. Here, we choose LPP, LDNE and DSPGE as the comparative methods because LPP is a typical unsupervised locality preserving graph embedding method (He et al., 2005), LDNE is a supervised locality preserving graph embedding method (Gou & Zhang, 2013) and DSPGE is a sparsity and geometry preserving graph embedding method (Gou et al., 2016). As shown in Fig. 3, the proposed DGLPGE has more power of discovering the geometrical and discriminant information of high-dimensional data. This possible reason of the good property is the proposed method fully considers the geometrical distribution of each point and discriminant information in graph constructions.

5. Experimental results In this section, we experimentally verify the proposed DGLPGE method, in comparison with the competing feature extraction methods including PCA, LDA, LPP, DNE, LDNE, SPP, FOLPP and DSPGE. Meanwhile, we also compare DGLPGE with three classical feature selection methods including Fast Correlation-Based Filter (FCBF) (Yu & Liu, 2004), Predominant Group-Based Variable Neighborhood Search (PGVNS) (García-Torres, Gómez-Vela, MeliánBatista, & Moreno-Vega, 2016) and Discriminative Least Squares Regression (DLSR) (Xiang, Nie, Meng, Pan, & Zhang, 2012). The comparative experiments are conducted on six popular face databases that are ORL, AR, GT, IMM, FEI and Yale face data sets, four real benchmark data sets that are taken from UC Irvine Machine Learning Repository (UCI)1 and UCR Time Series Classification Archive (UCR),2 and two synthetic data sets. In the experiments, the number l of samples from each class on each database are randomly chosen as the training samples and the remaining 1 2

http://archive.ics.uci.edu/ml. www.cs.ucr.edu/∼eamonn/time_series_data.

8

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

Fig. 4. The face images from one subject on each used face data set.

ones are the testing samples. The random divisions of training and testing samples on each database have been carried out ten times and the experimental results of each method are the average recognition accuracies over ten runs with 95% confidence. In the low-dimensional subspaces learned by all competing methods, we adopt the nearest neighbor method as the classification decision for simplicity. Since FOLPP, LPP, LDNE and DNE use k-NN graph constructions, the values of k are set as k = l − 1 (Yang et al., 2007). In addition, the sparse model with l1 -norm in SPP and DSPGE are solved by l1 -magic toolbox3 . 5.1. Data sets In this subsection, we briefly describe all the data sets used in the experiments. First of all, we describe six public face databases including ORL, AR, GT, IMM, FEI and Yale. The ORL face data set4 contains 400 face image samples taken from 40 subjects, each of which has 10 face images. The face images per subject were taken by varying the lighting, facial expressions and facial details at different times. The AR face data set5 contains about 40 0 0 face images from 126 subjects with 70 men and 56 women. We use a subset of AR with 1400 face images from 50 men and 50 women. Each subject has 14 face images taken by varying different facial expressions and illumination conditions. The FEI face data set6 contains 1400 face images from 200 subjects, each of which has 14 samples. The face

3 4 5 6

http://dsp.rice.edu/cs. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. http://www2.ece.ohio-state.edu/∼aleix/ARdatabase.html. https://fei.edu.br/∼cet/facedatabase.html.

images were taken against a white homogenous background in an upright frontal position with profile rotation of up to about 180°. IMM face data set7 contains 240 face images from 40 subjects, each of which has 6 face images. The GT face data set8 contains 750 face images from 50 subjects, each of which has 15 face images. The face images were taken by varying illuminations, different facial expressions and different facial details. The Yale face data set9 contains 165 face images from 15 subjects, each of which has 11 face images. The face images were taken by varying lighting conditions and facial expressions. Moreover, each image in the experiments is cropped and resized to 32 × 32, and the gray level values of each image are rescaled to (0,1). That is, the dimensionalities of each image sample are 1024. As an example, Fig. 4 shows the face images of one subject on each face data set. It should be noted that the numbers of training samples per class are set as l = 5 on ORL, l = 7 on AR, l = 7 on GT, l = 2 on IMM, l = 5 on FEI and l = 5 on Yale. The four real benchmark data sets include two UCI data sets that are ‘Parkinson’s Disease’ and ‘Smartphone’, and two UCR data sets that are ‘Strawberry’ and ’Swedish Leaf’. The main information about numbers of the total samples, the training and testing samples, classes and attributes of four real data sets are given in Table 1. The used artificial data sets, each of which has two classes are called I-10I and Ness (Van Ness, 1980), respectively. Note that the I-10I data set is similar to the I-4I data set (Fukunaga, 2013). The two artificial data sets are randomly generated with singular normal density functions and each class has the same prior prob-

7 8 9

http://www2.imm.dtu.dk/∼aam/datasets/datasets.html. http://www.anean.com/research/face.htm. http://cvc.yale.edu/projects/yalefaces/yalefaces.html.

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

9

Table 1 The used real UCI and UCR data sets. Data sets

Datebases

Samples

Training samples

Testing samples

Classes

Attributes

Parkinson’s Disease Smartphone Strawberry Swedish Leaf

UCI UCI UCR UCR

756 5744 983 1125

530 4020 370 500

226 1724 613 625

2 6 2 15

752 561 235 128

Table 2 The used synthetic data sets. Data sets

Samples

Training samples

Testing samples

Classes

Attributes

I-10I Ness

600 800

420 560

180 240

2 2

300 400

ability. In specific, both are generated by using the p-dimensional Gaussian distributions as follows. The I-10I data set is formed by the 300-dimensional Gaussian data as

u

1 = u2 = 0 1 = I300 ,

2

(36)

= 10I300 ,

where ui is the mean vector,  i is the covariance matrix belonging to class ci , and I300 is a 300 × 300 identity matrix. The Ness data set is formed by the p-dimensional Gaussian data as



u1 = 0, u2 =  , 0, . . . , 0,  2  2 Ip 0



2 = I , = p 1 p , 1 2 0 I 2

T

, (37)

2

where the parameters  and p could be adjusted, and I p is a p × p identity matrix.  is set to be 6 in our experiments. Then, the numbers of classes, attributes, total samples, and training and testing samples are clearly displayed in Table 2. 5.2. Experiment 1 This experiment is to investigate the performance of the proposed method in terms of parameter selection on six face databases. The proposed DGLPGE method has three parameters q, k and α . When α = 1 the DGPGE as the variant of the proposed DGLPGE has one parameter q and when α = 0 the DLPGE as the variant of the proposed DGLPGE has one parameter k. The values of q are the range from 1 to 5 with a step of 0.5 in DGLPGE and DGPGE, the values of α are the range from 0 to 1 with a step of 0.1, and the values of k are the ranges from 1 to 13 in DGLPGE and from 1 to 11 DLPGE with a step of 2. Note that the best classification results in the low-dimensional subspace on each value of q, k and α are given in the experiments. The recognition accuracies of DGPGE with the variations of the values of parameters q are displayed in Fig. 5. As can be seen in Fig. 5, DGPGE achieves the best classification performance when q = 1.5 on ORL, GT, AR and FEI, q = 2 on IMM and q = 2.5 on Yale. The recognition accuracies of DLPGE with the variations of the values of parameters k are shown in Fig. 6. We can see that DLPGE obtains the best classification results when k = 3 on ORL, GT, AR, FEI and Yale, and k = 11 on IMM. This fact means that in the DLPGE more nearest neighbors could always contains more outliers from different classes so as to degrade the classification performance. The recognition accuracies of DGLPGE with the variations of the values of parameters k and q are displayed in Fig. 7 when the optimal values of α is fixed in advance. we can observe from Fig. 7 that the classification performance of the proposed method is influenced by the different values of q and k. The classification performance of DGLPGE is always improved by increasing the values of k

at the different values of q except q = 1. This fact implies that the locality of data is very important in graph embedding under considering the globality of data. And surprisingly, when q = 1, the performance of DGLPGE is hardly influenced by values of k. This reason may be that the global geometry of data has dominant contribution for dimensionality reduction. From Fig. 7, the best classification results of DGLPGE is achieved when q = 1.5 and k = 5 on ORL, GT, AR, FEI and Yale, and q = 2 and k = 7 on IMM. Besides, we can see that the good classification performance can be nearly achieved when the value of q is 1.5. 5.3. Experiment 2 This experiment is to study the performance of the proposed DGLPGE and its two varaints by varying the dimensionalities of the low-dimensional subspace and by comparing them with PCA, LDA, LPP, DNE, LDNE, SPP, FOLPP and DSPGE on six face databases. Note that it is unnecessary to compare the proposed methods with the FCBF, PGVNS and DLSR feature selection methods, since the proposed methods are the feature extraction methods. All the classification results are achieved under the optimal values of q and k that are obtained in Section 5.2. The recognition accuracies of all the competing methods with variations of the dimensionalities are shown in Fig. 8. Note that the dimensionalities corresponding to the classification results in Fig. 8 are only from 1 to 40 with a step of 1 for example. We can see that the recognition accuracies of each competing method quickly increase at the small values of dimensionalities, and then increase slowly or nearly tend to be stable when the values of dimensionalities become larger on all data sets. Moreover, from the classification results in Fig. 8, we can see that the proposed DGLPGE nearly obtains the best classification accuracies among all the competing methods, and its accuracies rapidly increase at small values of dimensionalities and then nearly keep stable when values of dimensionalities are larger than 20. Besides, the proposed DLPGE and DGPGE, as the two variants of DGLPGE also obtain the competing classification performance. 5.4. Experiment 3 This experiment is to study the performance of the proposed methods on the six face databases in comparisons with the competing dimensionality reduction methods: PCA, LDA, LPP, DNE, LDNE, SPP, FOLPP, DSPGE, FCBF, PGVNS and DLSR in terms of the classification accuracy. Note that the classification results of each method are the highest classification accuracies among the range of dimensionalities when the regularized parameters needed in each competing method are empirically set to be optimal. The best classification results of the competing methods in the learned low-dimensional subspaces are shown in Table 3. It is clear

10

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

Fig. 5. The classification results of DGPGE with variations of parameter q on each face database.

Fig. 6. The classification results of DLPGE with variations of parameter k on each face database.

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

11

Fig. 7. The classification results of DGLPGE with variations of parameter k and q on each face database when the best value of parameter α is given on each face database. Table 3 The best classification results (%) of the comparative methods with the corresponding standard deviations (stds) and dimensionalities in parentheses on each face data set. Data set

ORL

PCA

93.15 (41) 89.55 (34) 93.55 (31) 93.10 (41) 94.55 (50) 93.10 (49) 94.20 (47) 93.50 (36) 96.35 (29) 95.10 (31) 97.05 (25) 95.69 (23) 96.12 (32) 94.17 (25)

SPP LDNE DNE LDA LPP DSPGE FOLPP DGPGE DLPGE DGLPGE FCBF PGVNS DLSR

GT ± 1.83 ± 2.49 ± 2.33 ± 1.06 ± 1.69 ± 2.07 ± 0.95 ± 2.09 ± 1.16 ± 2.20 ± 1.42 ± 1.48 ± 1.40 ± 2.58

68.45 (50) 62.65 (21) 70.00 (48) 68.45 (50) 73.35 (49) 66.50 (47) 68.50 (42) 74.15 (40) 67.00 (45) 76.55 (46) 84.00 (45) 82.02 (50) 84.72 (42) 85.15 (45)

AR ± 2.49 ± 3.11 ± 1.71 ± 1.44 ± 1.05 ± 1.26 ± 1.79 ± 2.18 ± 3.62 ± 1.82 ± 0.31 ± 1.50 ± 1.11 ± 2.06

60.83 (50) 86.57 (49) 64.29 (50) 60.83 (50) 71.37 (50) 58.09 (50) 94.03 (50) 57.20 (40) 97.37 (46) 92.06 (48) 98.77 (46) 97.36 (50) 97.30 (49) 96.96 (50)

FEI ± 1.45 ± 0.90 ± 1.83 ± 1.45 ± 0.93 ± 2.03 ± 0.70 ± 0.20 ± 0.51 ± 1.59 ± 0.33 ± 0.52 ± 1.13 ± 1.65

that the proposed DGLPGE always performs best among all the competing feature extraction methods, and DLPGE and DGPGE are often better than PCA, LDA, LPP, DNE, LDNE, SPP, FOLPP and DSPGE. Moreover, the proposed DGLPGE performs best classification performance with lower dimensionalities than the other competing feature extraction methods. We can also observe the fact from

73.33 (50) 76.69 (50) 74.56 (50) 73.33 (24) 83.58 (10) 73.80 (50) 71.23 (50) 59.61 (39) 79.37 (50) 73.10 (50) 96.08 (47) 90.15 (43) 96.98 (40) 92.17 (50)

IMM ± 1.06 ± 1.68 ± 1.15 ± 1.85 ± 0.65 ± 1.07 ± 1.05 ± 0.62 ± 0.2 ± 0.76 ± 0.58 ± 1.51 ± 0.95 ± 1.03

41.13 (48) 48.62 (49) 53.25 (39) 41.25 (49) 46.00 (42) 40.63 (49) 61.63 (10) 40.75 (34) 66.63 (48) 55.10 (37) 73.00 (13) 75.86 (21) 73.79 (15) 73.63 (20)

Yale ± 1.62 ± 2.77 ± 3.91 ± 1.59 ± 1.22 ± 2.76 ± 2.64 ± 3.17 ± 2.71 ± 2.1 ± 2.74 ± 2.16 ± 1.78 ± 2.32

62.22 ± 2.83 (27) 67.33 ± 6.97 (34) 64.67 ± 1.83 62.44 (26) 70.44 (17) 63.33 (39) 86.00 (25) 62.44 (20) 79.11 (40) 63.33 (26) 87.11 (16) 85.41 (25) 84.39 (19) 86.06 (23)

± 2.41 ± 2.9 ± 2.61 ± 2.3 ± 3.63 ± 3.55 ± 2.08 ± 2.02 ± 0.45 ± 1.33 ± 1.65

Fig. 8 and Table 3 that DGPGE often performs better than DLPGE, and DGLPGE is better than both DGPGE and DLPGE. This implies that the global and local geometries of data are very important in graph embedding. Meanwhile, we can see from Table 3 that the proposed DGLPGE outperforms the FCBF, PGVNS and DLSR feature selection methods on six face databases in most cases.

12

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

Fig. 8. The classification results of each method with variations of dimensionalities on each face database.

5.5. Experiment 4 This experiment is to study the performance of the proposed methods on the four real benchmark data sets and two synthesis data sets in comparison with the competing feature extraction methods: PCA, LDA, LPP, DNE, LDNE, SPP, FOLPP and DSPGE. The performance of each competing method is measured by the best classification accuracy among the ranges of dimensionalities. It should be noted that the parameters q, k and α involved in

the proposed methods are determined by the same way as in the Section 5.2. In the experiments, these three parameters in DGLPGE are set as q = 1.5, k = 9 and α = 0.5 on Parkinson’s Disease, q = 1.5, k = 5 and α = 0.4 on Smartphone, q = 2, k = 5 and α = 0.4 on Strawberry, q = 1.5, k = 7 and α = 0.5 on Swedish Leaf, q = 1.5, k = 5 and α = 0.4 on I-10I, and q = 2.5, k = 5 and α = 0.7 on Ness. The parameter q in DGPGE are set as q = 1.5 on Parkinson’s Disease, Smartphone, Swedish Leaf, I-10I and Ness, and q = 1 on Strawberry. And the parameter q in DLPGE are set as k = 5 on

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

13

Table 4 The best classification results (%) of the comparative methods with the corresponding standard deviations (stds) and dimensionalities in parentheses on the real and synthesis data sets. Data set PCA SPP LDNE DNE LDA LPP DSPGE FOLPP DGPGE DLPGE DGLPGE

Parkinson’s Disease 72.23 ± 0.76 (38) 71.62 ± 1.05 (41) 74.82 ± 1.46 (32) 67.06 ± 2.17 (40) 70.15 ± 1.41 (45) 73.61 ± 1.55 (40) 74.20 ± 0.84 (31) 74.00 ± 1.59 (40) 73.37 ± 1.49 (45) 68.11 ± 1.26 (41) 75.05 ± 1.30 (40)

Smartphone 83.18 ± 1.36 (43) 82.31 ± 2.17 (40) 85.97 ± 0.99 (45) 81.43 ± 1.40 (50) 87.65 ± 1.62 (50) 79.26 ± 0.88 (50) 80.75 ± 1.13 (46) 82.78 ± 1.19 (43) 78.54 ± 2.40 (45) 81.59 ± 1.92 (48) 90.23 ± 1.61 (45)

Strawberry 89.18 ± 1.59 (50) 87.16 ± 2.17 (48) 92.23 ± 1.55 (50) 90.87 ± 0.89 (55) 93.24 ± 1.36 (52) 83.51 ± 0.76 (50) 88.17 ± 2.02 (50) 84.58 ± 1.21 (47) 84.12 ± 1.26 (56) 80.33 ± 0.67 (49) 94.59 ± 1.48 (50)

Swedish 90.32 ± (52) 91.93 ± (55) 88.38 ± (46) 89.67 ± (51) 87.42 ± (46) 91.61 ± (50) 92.20 ± (50) 92.54 ± (53) 85.76 ± (50) 77.63 ± (50) 93.66 ± (48)

Leaf 2.39 1.53 2.19 1.05 1.78 1.46 1.71 2.07 1.62 1.21 1.43

I-10I 52.32 (45) 57.22 (42) 54.45 (45) 51.11 (50) 57.36 (51) 56.67 (49) 57.15 (48) 57.79 (44) 53.18 (50) 55.43 (47) 58.64 (53)

± 1.48 ± 2.12 ± 2.29 ± 1.90 ± 2.82 ± 1.20 ± 1.61 ± 2.11 ± 1.85 ± 2.16 ± 1.50

Ness 95.00 (37) 96.13 (44) 97.91 (43) 97.08 (38) 98.75 (40) 96.34 (46) 98.33 (45) 99.27 (40) 98.78 (42) 91.42 (38) 99.12 (39)

± 1.81 ± 2.14 ± 2.97 ± 1.96 ± 2.78 ± 1.53 ± 2.80 ± 2.27 ± 1.72 ± 2.41 ± 2.59

Table 5 The running time (seconds) of each method for feature extraction on ORL, AR and GT. Data set

PCA

SPP

LDNE

DNE

LDA

LPP

DSPGE

FOLPP

DGLPGE

ORL AR GT

0.01 0.08 0.02

5.77 349.41 30.54

3.99 39.10 9.20

3.15 38.20 11.07

0.04 0.06 0.05

0.25 7.14 0.63

9.27 398.41 39.20

0.17 6.65 0.60

6.31 74.41 21.56

Parkinson’s Disease and Swedish Leaf, k = 3 on Smartphone, Strawberry and Ness, k = 9 on I-10I. The best classification accuracies (%) of the comparative feature extraction methods with the corresponding standard deviations (stds) and dimensionalities in parentheses on the real and synthesis data sets are shown in Table 4. It is obvious that the proposed DGLPGE nearly performs best among all the competing methods, and its variants DGPGE and DLPGE achieve the competitive classification compared to PCA, LDA, LPP, DNE, LDNE, SPP, FOLPP and DSPGE. Furthermore, we can see that the proposed DGLPGE is better than DGPGE and DLPGE. This experimental fact also means that globality and locality of data play an important role in graph embedding.

5.6. Experiment 5 This experiment is only to study the efficiencies of the competing feature extraction methods including DGLPGE, PCA, LDA, LPP, DNE, LDNE, SPP, FOLPP and DSPGE on ORL, GT and AR, because the proposed DGLPGE is a feature extraction method. The efficiency of each method is measured by the running time of phase of feature extraction. Note that the numbers of samples of the used three face databases for feature extraction are the numbers of the corresponding training samples that are preseted as in Section 5.1. Our experiments are carried out by Windows 7 64-bit operating system, in which the running memory is 16GB, the processor is Intel(R) Xeon(R) CPU E3-1231 v3 @ 3.40 Hz (8 CPUs), 3.4GHz, and the Matlab version is 2016b. The running time of feature extraction processes of DGLPGE, PCA, LDA, LPP, DNE, LDNE, SPP, FOLPP and DSPGE on ORL, GT and AR face databases is shown in Table 5. We can see that both SPP and DSPGE need enough time for feature extraction, because both have to solve l1 -norm minimization problem in Eq. (4) for sparse graph construction in SPP and for sparse reconstruction in DSPGE.

Moreover, since the graph constructions have been done, the graph embedding methods including LPP, DNE, LDNE and DGLPGE spend more time. And also, since the proposed DGLPGE has two global graph constructions and one local graph construction, its time is larger than that of LPP, DNE and LDNE. As illustrated in the experimental results above, we can conclude that the proposed DGLPGE method has more power of preserving geometrical and discriminant structures of highdimensional data in the low-dimensional subspace by designing the new informative graph constructions. Furthermore, from the theoretical and experimental analyses, we can summarize these following important observations: 1) The geometrical and discriminant distribution of each data considered in graph constructions is very useful for graph embedding in dimensionality reduction. 2) Integrating global and local geometrical and discriminant structures of data can improve the performance of graph embedding. 3) To fully utilize the class discrimination information in graph constructions can enhance the power of pattern discrimination in the low-dimensional subspace. 6. Conclusions In this article, we propose a novel graph embedding method called discriminative globality and locality preserving graph embedding (DGLPGE). In the proposed method, we fully consider the locality, globality and class discrimination information of highdimensional data with the purpose of improving the power of preserving geometry of data and pattern discrimination in lowdimensional subspace. In the constructed local and global graphs, we consider the geometrical distribution of each data to determine the weights of each edge. In the graph constructions, the class information are fully employed for assigning the weight of each edge. To demonstrate the effectiveness of the proposed DGLPGE,

14

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079

we give the theoretical and experimental analyses in details. The experiments on six popular face databases, four real benchmark data sets and two synthesis data sets show that the proposed method outperforms the state-of-art competing dimensionality reduction methods. In the future works, we will plan to extend some graph embedding method using the graph constructions in the proposed method, use kernel tricks to extend the proposed method for non-linear complex data, and apply the proposed method into some practical applications, such as image classification and object recognition. Conflict of Interests All the authors declare that there are no conflicts of interests regarding the publication of this article. Credit authorship contribution statement Jianping Gou: Methodology, Writing - review & editing. Yuanyuan Yang: Writing - review & editing. Zhang Yi: Writing review & editing. Jiancheng Lv: Writing - review & editing. Qirong Mao: Writing - review & editing. Yongzhao Zhan: Writing - review & editing. Acknowledgements This work was supported in part by National Natural Science Foundation of China [Grant Nos. 61976107, 61502208, U1836220 and 61672268], Natural Science Foundation of Jiangsu Province of China [Grant No. BK20150522], International Postdoctoral Exchange Fellowship Program of China Postdoctoral Council [Grant No. 20180051], Research Foundation for Talented Scholars of Jiangsu University [Grant No. 14JDG037], China Postdoctoral Science Foundation [Grant No. 2015M570411], and Open Foundation of Artificial Intelligence Key Laboratory of Sichuan Province [Grant No. 2017RYJ04]. References Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation,, 15(6), 1373–1396. doi:10.1162/ 089976603321780317. Bengio, Y., Paiement, J., Vincent, P., Dellallaeu, O., Roux, L., & Quimet, M. (2004). Out-of sample extensions for LLE, isomap, MDS, eigen-maps, and spectral clustering. In In proceedings of the 16th international conference on neural information processing systems whistler, British Columbia, Canada (pp. 177–184). Cai, H., Zheng, V., & Chang, K. (2017). A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Transactions on Knowledge & Data Engineering, 30(9), 1616–1637. doi:10.1109/TKDE.2018.2807452. Ding, C., & Zhang, L. (2015). Double adjacency graphs-based discriminant neighborhood embedding. Pattern Recognition, 48, 1734–1742. doi:10.1016/j.patcog.2014. 08.025. Fan, Z., Xu, Y., Zuo, W., Yang, J., Tang, J., Lai, Z., et al. (2014). Modified principal component analysis: An integration of multiple similarity subspace models. IEEE Transactions on Neural Networks & Learning Systems,, 25(8), 1538–1552. doi:10. 1109/TNNLS.2013.2294492. Fang, X., Han, N., Wu, J., & Xu, Y. (2018). Approximate low-rank projection learning for feature extraction. IEEE Transactions on Neural Networks and Learning Systems,, 29(11), 5228–5241. doi:10.1109/TNNLS.2018.2796133. Fukunaga, K. (2013). Introduction to statistical pattern recognition. Elsevier. doi:10. 1002/0470854774.ch1. Gao, Q., Liu, J., Zhang, H., Gao, X., & Li, K. (2013). Joint global and local structure discriminant analysis. IEEE Transactions on Information Forensics and Security,, 8(4), 626–635. doi:10.1109/TIFS.2013.2246786. García-Torres, M., Gómez-Vela, F., Melián-Batista, B., & Moreno-Vega, J. M. (2016). High-dimensional feature selection via feature grouping: A variable neighborhood search approach. Information Sciences, 326, 102–118. doi:10.1016/j.ins.2015. 07.041. Gou, J., Du, L., Cheng, K., & Cai, Y. (2016). Discriminative sparsity preserving graph embedding. evolutionary computation. In IEEE Congress on Evolutionary Computation, Vancouver, BC, Canada (pp. 4250–4257). doi:10.1109/CEC.2016.7744330. 2016 Gou, J., & Zhang, Y. (2013). Locality-based discriminant neighborhood embedding. The Computer Journal, 56(9), 1063–1082. doi:10.1093/comjnl/bxs113.

Gou, J., Zhang, Y., Zhang, D., Zhan, Y., Shen, X., & Du, L. (2018). Sparsity and geometry preserving graph embedding for dimensionality reduction. IEEE Access, 6, 75748–75766. doi:10.1109/ACCESS.2018.2884027. Goyal, P., & Ferrara, E. (2018). Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151, 78–94. doi:10.1016/j.knosys. 2018.03.022. He, X., Yan, S., Hu, Y., Niyogi, P., & Zhang, H. (2005). Face recognition using Laplacian faces. IEEE Transactions on Pattern Analysis & Machine Intelligence,, 27(3), 328– 340. doi:10.1109/TPAMI.2005.55. Houari, R., Bounceur, A., Kechadi, M. T., Tari, A. K., & Euler, R. (2016). Dimensionality reduction in data mining: A copula approach. Expert Systems with Applications, 64, 247–260. doi:10.1016/j.eswa.2016.07.041. Huang, P., Li, T., Gao, G., Yao, Y., & Yang, G. (2018). Collaborative representation based local discriminant projection for feature extraction. Digital Signal Processing, 76, 84–93. doi:10.1016/j.dsp.2018.02.009. Jain, A. K., Duin, R. P. W., & Mao, J. (20 0 0). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27(11). doi:10.1109/ 34.824819. 1502–1502 Kan, M., Shan, S., Zhang, H., Lao, S., & Chen, X. (2016). Multi-view discriminant analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence,, 38(1), 188– 194. doi:10.1109/TPAMI.2015.2435740. Kim, K. (2018). An improved semi-supervised dimensionality reduction using feature weighting: Application to sentiment analysis. Expert Systems with Applications, 109, 49–65. doi:10.1016/j.eswa.2018.05.023. Lai, Z., Mo, D., Wong, W., Xu, Y., Miao, D., & Zhang, D. (2018). Robust discriminant regression for feature extraction. IEEE Transactions on Cybernetics, 1–13. doi:10. 1109/TCYB.2017.2740949. Lai, Z., Xu, Y., Chen, Q., Yang, J., & Zhang, D. (2014). Multilinear sparse principal component analysis. IEEE Transactions on Neural Networks and Learning Systems,, 25(10), 1942–1950. doi:10.1109/TNNLS.2013.2297381. Lai, Z., Xu, Y., Yang, J., Shen, L., & Zhang, D. (2017). Rotational invariant dimensionality reduction algorithms. IEEE Transactions on Cybernetics,, 47(11), 3733–3746. doi:10.1109/TCYB.2016.2578642. Li, W., & Du, Q. (2016). Laplacian regularized collaborative graph for discriminant analysis of hyperspectral imagery. IEEE Transactions on Geoscience and Remote Sensing,, 54(12), 7066–7076. doi:10.1109/TGRS.2016.2594848. Li, W., Feng, F., Li, H., & Du, Q. (2018). Discriminant analysis-based dimension reduction for hyperspectral image classification: A survey of the most recent advances and an experimental comparison of different techniques. IEEE Geoscience and Remote Sensing Magazine,, 6(1), 15–34. doi:10.1109/MGRS.2018.2793873. Li, X., Chen, M., Nie, F., & Wang, Q. (2017). Locality adaptive discriminant analysis. In Proceedings of 26th international joint conference on artificial intelligence, Melbourne, Australia (pp. 2201–2207). doi:10.24963/ijcai.2017/306. Lu, D., Ding, C., Xu, J., & Wang, S. (2018). Hierarchical discriminant analysis. Sensors, 18(1), 279. doi:10.3390/s18010279. Martinez, A. M., & Kak, A. C. (2001). PCA versus LDA. IEEE Transactions on Pattern Analysis & Machine Intelligence, 23(2), 228–233. doi:10.1109/34.908974. Qiao, L., Chen, S., & Tan, X. (2010). Sparsity preserving projections with applications to face recognition. Pattern Recognition,, 43(1), 331–341. doi:10.1016/j.patcog. 20 09.05.0 05. Roweis, S. T., & Saul, L. K. (20 0 0). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323–2326. doi:10.1126/science.290.5500. 2323. Van Ness, J. (1980). On the dominance of non-parametric Bayes rule discriminant algorithms in high dimensions. Pattern Recognition, 12(6), 355–368. doi:10.1016/ 0 031-3203(80)90 012-6. Wang, J., Zhao, R., Wang, Y., Zheng, C., Kong, J., & Yi, Y. (2017a). Locality constrained graph optimization for dimensionality reduction. Neurocomputing, 245, 55–67. doi:10.1016/j.neucom.2017.03.046. Wang, R., Nie, F., Hong, R., Chang, X., Yang, X., & Yu, W. (2017b). Fast and orthogonal locality preserving projections for dimensionality reduction. IEEE Transactions on Image Processing,, 26(10), 5019–5030. doi:10.1109/TIP.2017.2726188. Wang, S., Ding, C., Hsu, C. H., & Yang, F. (2018). Dimensionality reduction via preserving local information. Future Generation Computer Systems. doi:10.1016/j. future.2018.01.016. Wen, J., Xu, Y., Li, Z., Ma, Z., & Xu, Y. (2018). Inter-class sparsity based discriminative least square regression. Neural Networks, 102, 36–47. doi:10.1016/j.neunet.2018. 02.002. Wright, J., Yang, A., Ganesh, A., Sastry, S. S., & Ma, Y. (2009). Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence,, 31(2), 210–227. doi:10.1109/TPAMI.2008.79. Wu, J., Qiu, S., Kong, Y., Jiang, L., Chen, Y., Yang, W., et al. (2018). PCANet: An energy perspective. Neurocomputing, 313, 271–287. doi:10.1016/j.neucom.2018.06.025. Xiang, S., Nie, F., Meng, G., Pan, C., & Zhang, C. (2012). Discriminative least squares regression for multiclass classification and feature selection. IEEE Transactions on Neural Networks and Learning Systems,, 23(11), 1738–1754. doi:10.1109/TNNLS. 2012.2212721. Xu, L., Iosifidis, A., & Gabbouj, M. (2018). Weighted linear discriminant analysis based on class saliency information. In 25th IEEE International Conference on Image Processing, Athens, Greece (pp. 2306–2310). doi:10.1109/ICIP.2018.8451614. Xu, Y., Zhong, A., Yang, J., & Zhang, D. (2010). LPP solution schemes for use with face recognition. Pattern Recognition,, 43(12), 4165–4176. doi:10.1016/j.patcog. 2010.06.016. Yan, S., Xu, D., Zhang, B., Zhang, H., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis & Machine Intelligence,, 29(1), 40–51. doi:10.1109/TPAMI.2007. 12.

J. Gou, Y. Yang and Z. Yi et al. / Expert Systems With Applications 144 (2020) 113079 Yang, J., Frangi, A. F., Yang, J. Y., Zhang, D. D., & Jin, Z. (2005). KPCA plus LDA: A complete kernel fisher discriminant framework for feature extraction and recognition. IEEE Transactions on Pattern Analysis & Machine Intelligence,, 27(2), 230– 244. doi:10.1109/TPAMI.2005.33. Yang, J., Zhang, D., Yang, J., & Niu, B. (2007). Globally maximizing, locally minimizing: unsupervised discriminant projection with applications to face and palm biometrics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(4), 650–664. doi:10.1109/TPAMI.2007.1008. Yang, W., Wang, Z., & Sun, C. (2015). A collaborative representation based projections method for feature extraction. Pattern Recognition,, 48(1), 20–27. doi:10. 1016/j.patcog.2014.07.009. Yang, X., Liu, G., Yu, Q., & Wang, R. (2018). Stable and orthogonal local discriminant embedding using trace ratio criterion for dimensionality reduction. Multimedia Tools and Applications,, 77(3), 3071–3081. doi:10.1007/s11042- 017- 5022- 1. Yin, J., Lai, Z., Zeng, W., & Wei, L. (2018). Local sparsity preserving projection and its application to biometric recognition. Multimedia Tools and Applications, 77(1), 1069–1092. doi:10.1007/s11042- 016- 4338- 6.

15

Yin, J., Wei, L., Song, M., & Zeng, W. (2016). Optimized projection for collaborative representation based classification and its applications to face recognition. Pattern Recognition Letters, 73, 83–90. doi:10.1016/j.patrec.2016.01.012. Yu, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 1205–1224. doi:10.1023/B: JODS.0 0 0 0 045365.56394.b4. Yuan, S., Mao, X., & Chen, L. (2019). Sparsity regularization discriminant projection for feature extraction. Neural Processing Letters,, 49(2), 539–553. doi:10.1007/ s11063- 018- 9842- 4. Zhang, J., Wang, J., & Cai, X. (2017). Sparse locality preserving discriminative projections for face recognition. Neurocomputing, 260, 321–330. doi:10.1016/j.neucom. 2017.04.051. Zhang, L., Yang, M., & Feng, X. (2012). Sparse representation or collaborative representation: Which helps face recognition? In IEEE International Conference on Computer Vision, Barcelona, Spain (pp. 471–478). doi:10.1109/ICCV.2011.6126277. Zhang, W., Xue, X., Lu, H., & Guo, Y. (2006). Discriminant neighborhood embedding for classification. Pattern Recognition, 39(11), 2240–2243. doi:10.1016/j.patcog. 2006.05.011.