- Email: [email protected]

PII: DOI: Reference:

S0167-739X(17)31969-6 https://doi.org/10.1016/j.future.2018.01.016 FUTURE 3923

To appear in:

Future Generation Computer Systems

Received date : 18 September 2017 Revised date : 5 December 2017 Accepted date : 5 January 2018 Please cite this article as: S. Wang, C. Ding, C. Hsu, F. Yang, Dimensionality reduction via preserving local information, Future Generation Computer Systems (2018), https://doi.org/10.1016/j.future.2018.01.016 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

FUTURE GENERATION COMPUTER SYSTEMS

1

Dimensionality Reduction via Preserving Local Information Shangguang Wang, Chuntao Ding, Ching-Hsien Hsu, and Fangchun Yang Abstract—Dimensionality reduction is an important technique for multimedia data in smart cities. As a significant type of multimedia data, images are being processed frequently using such technique. It is worth noting that finding neighbors is an integral part of graph embedding algorithms in dimensionality reduction. In this paper, we present an effective and efficient metric function, named local similarity preserving (LSP), which can preserve the similarity information of each sample to its homogeneous and heterogeneous neighbors. The proposed LSP function is helpful to enhance the discriminative capability of subsequent feature extractors. Based on LSP function, we also propose two novel algorithms, local similarity preserving discriminant (LSPD) algorithm, which can preserve the local similarity information and LSPD+ algorithm, which can preserve the local similarity and geometric structure information, respectively. The experimental results on digit visualization and face recognition demonstrate the effectiveness and efficiency of the proposed algorithms. Index Terms—Multimedia data, Dimensionality reduction, graph embedding, local similarity, geometric structure.

F

1

I NTRODUCTION

I

N many applications, many multimedia data in smart cities are characterized by hundreds or even thousands of features, such as images. However, directly using of high dimensional data may significantly degrade the performance of many applications. Dimensionality reduction, as a crucial technique, has yielded a surge of interest research. It is worth noting that finding neighbors is an essential step of graph embedding algorithms for dimensionality reduction. Since Roweis et al. [1] find k nearest neighbors to reconstruct each sample xi to preserve the local geometric structure, which could seek a low-dimensional embedding of high-dimensional inputs, there are many proposed graph embedding algorithms for dimensionality reduction, which reconstruct the samples by its neighbors or preserve the local structure by finding nearest neighbors, such as [2], [3], [4], [5], [6], [7]. However, they are all unsupervised algorithms, which cannot work well in the task of classification. To tackle the unsupervised problem, many algorithms have been proposed [8], [9], [10], [11], [12], [13], such as [13] proposed a new multi-view discriminative manifold embedding method. They all can preserve geometric structure of samples. However, they ignore the similarity information, which is important in the task of visualization and classification. Fig. 1 gives a vivid diagram to illustrate the procedure of graph embedding. In Fig. 1, there are two ways finding appropriate neighbors to construct adjacency graphs: Fig. 1(a1), which is the traditional way constructing adjacency graph; and Fig. 1(b1), which is the ideal way constructing

•

•

S. Wang, C. Ding and F. Yang are with the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China. E-mail: {sgwang, ctding, fcyang}@bupt.edu.cn Ching-Hsien Hsu is with the Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan, 40724, R.O.C.. Ching-Hsien Hsu is the corresponding author. E-mail: robe[email protected]

Traditional Way

Result

(a1)

(a2)

Original data

Ideal Way

How to find appropriate neighbors? (b1)

Result

(b2)

Fig. 1. The procedure of graph embedding. (a1) traditional way, (a2) the result of constructing the adjacency graph with the traditional way. (b1) ideal way, (b2) the result of constructing the adjacency graph with the ideal way. However, the result of the traditional way makes the samples close if and only if they are close in the high-dimensional space, and the ideal result should be that if the samples are same class, they should be as close as possible in the subspace and whether they are close or not in the high-dimensional space.

adjacency graph. And we also have two results: Fig. 1(a2), which is the result of constructing the adjacency graph with the traditional way; and Fig. 1(b2), which is the result of constructing the adjacency graph with the ideal way. Our purpose is to make the representations of these two samples as close as possible in the new space whether they are close or not in the high-dimensional space. However, traditional way finds nearest neighbors, measured by Euclidean, constructing adjacency graph. Thus, if two same class samples have a far distance, they cannot establish a relationship, so that they will not be close in the subspace. In the task of classification, we should give priority to the distant samples in the same class, which will make the classification ability of the algorithm seriously deteriorate if we ignore them. The fundamental challenge is how to find the appropriate

FUTURE GENERATION COMPUTER SYSTEMS

2

R ELATED W ORK

Our work is inspired by graph embedding for dimensionality reduction. Therefore, we briefly review some classical algorithms, which are closely related to our work. Many graph embedding algorithms are proposed in recent years, such as discriminant neighborhood embedding [22], exponential local discriminant embedding [25], orthogonal neighborhood preserving projections [28], locality sensitive discriminant analysis [32], marginal fisher analysis [23], [24], local feature discriminant projection [30], double adjacency graphs-based discriminant neighborhood embedding [26] and discriminant neighborhood structure graphs [31]. Here, we analyze three typical algorithms, [22], [23], [24] and [26]. Zhang et al. [22] proposed the discriminant neighborhood embedding (DNE) supposing that multi-class samples in high-dimensional space tend to move due to local intraclass attraction or inter-class repulsion, and the optimal embedding from the sample of view of classification is discovered consequently. Yan et al. [23], [24] proposed the marginal fisher analysis (MFA) adopting two adjacency graphs to preserve the geometric structure. The algorithm firstly constructs two adjacency graphs, the intrinsic graph and the penalty graph. The intrinsic graph characterizes the intra-class compactness and connects each sample with its neighbors of the same class, and the penalty graph connects the marginal samples and characterizes the inter-class separability. Ding and Zhang [26] proposed the double adjacency graphs-based discriminant neighborhood embedding (DAG), which makes each sample linked to its homogeneous and heterogeneous neighbors respectively by constructing two adjacency graphs. As a consequence, because of the balance links, neighbors belonging to same class are compact while neighbors belonging to different classes become separable in the subspace. Thus, DAG can keep the local structure of a given data and find a good projection matrix.

Before constructing adjacency graphs, these algorithms ignore the category information when finding neighbors and only use Euclidean (traditional metrics) as a measure to find the nearest neighbors. Therefore, these algorithms just preserve the local structure and ignore the similarity information. Nevertheless, in the task of visualization or classification, similarity information is also important.

3

LSP FOR G RAPH E MBEDDING A LGORITHMS

In this section, a local similarity preserving (LSP) function for finding appropriate neighbors is proposed to preserve the local similarity information, and we also present the details of the proposed algorithms. 3.1

Local Similarity Preserving Function

In this section, we introduce our LSP Function. The basic idea of our function is designing an efficient similarity measure function for choosing the appropriate neighbors to construct the adjacency graphs. Given a set of samples X=[x1 ,...,xn ]∈Rd×n , where d is the dimensionality of samples, n is the number of samples, the corresponding similarity measure function is given as follows: ρ × exp (ρ + 1) , yi =yj G(xi , xj ) = (1) −ρ × exp (ρ − 1) , yi 6= yj −||xi −xj ||2 where ρ = exp , Eq. 1 illustrates the similarity β measure function for preserving local similarity information. For more intuitive display, we give a vivid display in Fig. 2.

8

yi=yj

7

yi ≠ yj

6 5

G(xi , xj )

neighbors to establish the relationship to achieve the same result of ideal way. Inspired by recent advances [14], [15], [16], [17], [18], [19], [20], [21], in this paper, we propose a local similarity preserving (LSP) function for finding appropriate neighbors, which can preserve the similarity information of the samples. Based on LSP function, we also propose two algorithms, LSPD algorithm, which finds the appropriate neighbors and preserves the similarity information by using LSP function; and LSPD+ algorithm, which also preserves the geometric structure of data based on LSPD. Extensive experiments demonstrate that the effectiveness and efficiency of our algorithms comparing with the state-of-the-art algorithms. The main contribution of our work is that we propose a local similarity preserving function and two algorithms, LSPD and LSPD+ based on LSP function. Both LSPD and LSPD+ outperform the related state-of-the-art algorithms in digits and face data sets. Moreover, we analyze the experiment results, which demonstrate the effectiveness and efficiency of our proposed algorithms.

2

4 3 2 1 0 −1

0

0.5

1

1.5

2

2.5

3

||xi − xj ||2 /β

Fig. 2. Similarity degrees of the intra-class yi = yj and the inter-class yi 6= yj .

We construct adjacency graphs by using LSP function as a measure to find appropriate neighbors. There are three main advantages as follows: • Taking advantage of category labels. In the tasks of visualization or classification, we need to make neighbors from the same class compact while neighbors from different classes become separable in the subspace. Different classes samples are treated differently by using LSP function. With regard to heterogeneous samples, if they are close, they are negative about classification. In

FUTURE GENERATION COMPUTER SYSTEMS

3

this case, our priority is to choose those samples. As the same purpose, our priority is to choose those samples which have distant homogeneous samples. • Different weight functions for different categories. Fig. 2 illustrates that the maximum value of similarity is e2 , and the similarity value decreases with the increase of Euclidean distance, and the minimum value is zero when they belong to the same class. The maximum value of similarity is 0, and the similarity value increases with the increase of distance, and the minimum value is -1 when they belong to the different classes. • Preserving the similarity information. Our purpose is that samples are close in the high-dimensional space, and the representations of these two samples in a new space are close to each other if they belong to the same class, and separable if they belong to different classes. To achieve this purpose, we need to preserve similarity information between samples rather than preserving geometric structure simply. We use Euclidean and label information building the LSP function, which not only exploits the label information, but also preserves the similarity information among samples. 3.2

How to find neighbors based on LSP?

Fig. 3 displays two graphs, the intra-class graph, which illustrates the intra-class sample adjacency relationship and characterizes the intra-class compactness; and the interclass, which illustrates the inter-class sample adjacency relationship and characterizes the inter-class separability. From Fig. 3, we can see that in the intra-class graph, the sample establishes relationships with the samples that are far away from it, and not establishes relationships with its close samples. This is different from the traditional way, which establishes a relationship between two same class samples if they are very close. However, in the intra-class samples, those distant samples are more important, which determine the quality of the final performance. Intra-class Compactness

Inter-class Separability Class-1

Class-2 Intra-class Graph

Inter-class Graph

Fig. 3. The adjacency relationships of the LSP function. Note that we choose those have a maximum distance samples for homogeneous samples before constructing the left adjacency graph, and we choose those have a minimum distance samples for heterogeneous samples before constructing the right adjacency graph.

3.3

Graph Embedding Algorithms Using LSP

Based on LSP function, in this section, two novel graph embedding algorithms for dimensionality reduction are proposed, which find the appropriate neighbors by LSP function. Under the LSP function, first, we propose a local similarity preserving discriminant algorithm (LSPD), which

is an extension of DAG algorithm. Second, based on LSPD algorithm, in order to achieve the purpose of preserving the geometric structure, we also propose an enhanced algorithm of LSPD, called LSPD+ , which not only preserves local similarity information, but also keeps the geometric structure of data. 3.3.1

LSPD

Let {(xi , yj )}N i=1 be a set of training samples of a given multiclass classification task, where xi ∈ Rd and yi ∈ {1, 2, · · · , c} is the class label of xi . Similar to DAG, the goal of LSPD is to find a projection matrix that can project samples from high-dimensional space into low-dimensional space, and in the low-dimensional space, samples from the same class are compact while samples from different classes become separable. Our goal is to improve the classification performance by finding the appropriate neighbors according to the LSP function to learn a projection matrix. Then, we construct two adjacency graphs to preserve the similarity information. Let Fw and Fb be the intra-class and inter-class adjacency matrices, respectively. For a sample xi , let the set of its k1 homogeneous and k2 heterogeneous neighbors (which be computed by the LSP function) be NSkw1 (xi ) and NSkb2 (xi ), respectively. Intra-class adjacency matrix Fw is defined as: +1, xi ∈ NSkw1 (xj ) or xj ∈ NSkw1 (xi ) . (2) Fijw = 0, otherwise Inter-class adjacency matrix Fb is +1, xi ∈ NSkb2 (xj ) or xj ∈ NSkb2 (xi ) Fijb = . 0, otherwise

(3)

By following the graph embedding formulation, intraclass scatter is characterized by the term P P W (Ps ) = ||Ps T xi − Ps T xj ||2 Fijw

=

xj ∈NSkw (xi ) xi ∈NSkw (xj ) 1 1 2Ps T X(Dw −Fw )XT Ps ,

(4) where Dw is a diagonal matrix and its entries are column P w sum of Fw , i.e., Dii = j Fijw , Ps is the projection matrix. Inter-class scatter is characterized by the term P P B(Ps ) = ||Ps T xi − Ps T xj ||2 Fijb xj ∈NSkb (xi ) xi ∈NSkb (xj ) , 2 2 = 2Ps T X(Db −Fb )XT Ps (5) where Db is a diagonal matrix and its entries are column P b sum of Fb , i.e., Dii = j Fijb . To implement the intra-class attraction and inter-class repulsion, we need to maximize the margin between interclass scatter and intra-class scatter, i.e., max T(Ps ) = B(Ps ) − W (Ps ) , (6) s.t. Ps T Ps =I where T(Ps ) measures the total differences from xi to its inter-class neighbors and intra-class neighbors. Given the constraint Ps T Ps =I, the columns of Ps are orthogonal, which can enhance the discriminant ability. We can rewrite

FUTURE GENERATION COMPUTER SYSTEMS

4

the objective function (6) in the form of trace by following some simple algebraic steps:

T(Ps ) = B(Ps ) − W (Ps ) = 2tr{Ps T X(Db − Fb − Dw + Fw )XT Ps } , (7) = 2tr{Ps T XMs XT Ps } where Ms = Db −Fb −Dw +Fw . The optimization problem (7) can be rewritten as ( max tr{Ps T XMs XT Ps } Ps , (8) s.t. Ps T Ps =I In order to solve the formulation (8), we give a theorem. Theorem 1. if Ms is a real symmetric matrix, the optimization problem (8) is equivalent to the eigendecomposition problem of the matrix XMs XT . Assume that λ1 ≥ λi−1 ≥ λi ≥ λd are the eigenvalues of XMs XT and Psi is the corresponding eigenvectors of the eigenvalue λi . The optimal projection matrix Ps is only composed of eigenvectors corresponding to the top r largest positive eigenvalues, or (9)

Ps = [Psi , · · · , Psr ].

Proof. Since Db , Fb , Dw , Fw are all real symmetric matrices, according to [33], Ms = Db − Fb − Dw + Fw is also a real symmetric matrix. To maximize the problem (8), the Lagrange Multiplier technique is adopted. First, we construct the following Lagrangian function:

L(Psi , λi ) =

r X i=1

(PTsi XMs XT Psi − λi (PTsi Psi − 1)),

(10) where λi , i = 1, . . . , r are Lagrange Multipliers. Then, the partial derivative of L with respect to Psi is obtained by

∂L(Psi , λi ) = XMs XT Psi − λi Psi , ∂Psi

(11)

Algorithm 1 LSPD. N

Input: A training set {(xi , yi )}i=1 , and the dimensionality of discriminant subspace r; Output: Projection matrix Ps ; 1: Construct the similarity measure matrix G according to Eq.(1); 2: Construct the intra-class graph Fw by Eq.(2) and interclass graph Fb by Eq.(3) based on step 1; 3: Eigendecompose the matrix, where XMs XT ; 4: Choose the r largest eigenvalues corresponding eigenvectors Ps = [Psi , · · · , Psr ]. 3.3.2 LSPD+ In order to achieve the purpose of preserving the geometric structure, we propose a novel dimensionality reduction algorithm, named LSPD+ . Similar to LSPD, LSPD+ finds neighbors by LSP function, and also constructs two adjacency graphs. Let Fgw and Fgb be the intra-class and inter-class adjacency matrices, respectively. Intra-class adjacency matrix Fgw is defined as G(xi , xj ), xi ∈ NSkw1 (xj ) or xj ∈ NSkw1 (xi ) . Fijgw = 0, otherwise (15) Inter-class adjacency matrix Fgb is −G(xi , xj ), xi ∈ NSkb2 (xj ) or xj ∈ NSkb2 (xi ) gb . Fij = 0, otherwise (16) Similar to LSPD algorithm, the corresponding problem is: d X max PTgi XMg XT Pgi , (17) i=1

where Mg is similar to Ms , and Pgi is the projection matrix. The details about LSPD+ are given in Algorithm 2. Algorithm 2 LSPD+ .

let (11) be zero, we yield

N

T

(12)

XMs X Pi = λi Pi , so the objective function of (8) could be rewritten as

tr{PTs XMs XT Ps } = =

r P

i=1 r P

i=1

PTsi XMs XT Psi PTsi λi Psi =

r P

i=1

(13)

λi .

Since the matrix XMs X is non-positive definite, the eigenvalues of XMs XT could be positive, negative or zero. T In order to maximize tr{XM Pr s X }, we should choose Tall positive eigenvalues, or i=1 λ . Thus, when tr{XMs X } Pr i achieves its maximal value i=1 λi , the optimal solution to (8) must be Ps = [Psi , · · · , Psr ]. (14) T

This completes the proof. From Theorem 1, we know that the optimal projection matrix Ps is only composed of eigenvectors corresponding to the top r largest positive eigenvalues. The details about LSPD are given in Algorithm 1.

Input: A training set {(xi , yi )}i=1 , and the dimensionality of discriminant subspace r; Output: Projection matrix Pg ; 1: Construct the similarity measure matrix G according to Eq.(1); 2: Construct the intra-class graph Fgw by Eq.(15) and interclass graph Fgb by Eq.(16) based on step 1; 3: Eigendecompose the matrix, where XMg XT ; 4: Choose the r largest eigenvalues corresponding eigenvectors Pg = [Pgi , · · · , Pgr ]. 3.4

Discussion

In this section, we discuss the importance of the LSP function and the shortcomings of traditional way in the graph embedding framework. From Eq. 6, we know that,

T(Ps ) = B(Ps ) − W (Ps )

(18)

We obtain the best projection Ps by optimizing the objective function Eq. 18. B(Ps ) is the inter-class scatter,

FUTURE GENERATION COMPUTER SYSTEMS

which is the sum of the distances of the samples in the interclass neighbors; and W (Ps ) is the intra-class scatter, which is the sum of the distances of the samples in the intra-class neighbors. In traditional way, finding nearest neighbors to construct adjacency graphs will ignores the relationships between distant samples in the same class as illustrated in Fig. 1. And it also brings the following shortcomings: • Comparing with B(Ps ) and W (Ps ), the value of B(Ps ) is much bigger than the value of W (Ps ) under normal circumstances. Thus, this will result in the small influence of W (Ps ) in the process of solving the objective function T(Ps ), so that homogeneous samples will not be aggregated. • In heterogeneous samples, if the distance of the two heterogeneous samples is very close, these two samples are harmful to classification. Therefore, we give priority to choosing the close neighbors. But in similar samples, traditional way ignores the relationships of distant samples which has the similar way to deal with them. Thus, this situation will make the classification ability of the algorithm seriously deteriorate (e.g., [23], [24], [26]). Nevertheless, finding neighbors using the LSP function can solve these two shortcomings. On one hand, the value of W (Ps ) and the influence becomes larger by finding distant neighbors in similar samples. On the other hand, we consider the relationships between distant samples if they belong to the same class, because in the homogeneous samples, distant samples are harmful to classification. Therefore, we should give priority to choosing the more distant neighbors. We also use the same way as the traditional algorithm to deal with the heterogeneous samples, in that we give priority to choosing the close neighbors, which is harmful to classification.

4

E XPERIMENTS

In this section, several experiments are conducted to validate the effectiveness and efficiency of the proposed algorithms. 4.1

Experiment Setup

4.1.1 Data Sets We have conducted experiments on three real-world large data sets, including MNIST 1 data set, Yale2 and UMIST3 face data sets. • MNIST. The MNIST data set of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. • Yale. The data set contains 165 grayscale images of 15 individuals in GIF format. There are 11 images per subject, and one for each type of facial expression or configuration: center-light, with glasses, happy, leftlight, without glasses, normal, right-light, sad, sleepy, surprised, and wink. 1. http://yann.lecun.com/exdb/mnist/ 2. http://vision.ucsd.edu/content/yale-face-database 3. https://www.sheffield.ac.uk/eee/research/iel/research/face

5

• UMIST. The UMIST face data set consists of 564 images of 20 individuals, taking into account race, sex and appearance. Images of subjects are taken for a range of poses from profile to frontal views. All samples are normalized to have the unit norm, and Yale and UMIST faces are manually aligned, cropped and resized to 32 × 32 pixels. We consider the visualization on MNIST data set and accuracy on the Yale and UMIST data set. 4.1.2 Evaluation Metric We use the Accuracy to measure the classification performance. Given a sample xi , let yi and zi be the obtained classification label and the label provided by the data set originally, respectively. The Accuracy is defined as follows: Pn δ(yi , zi ) Accuracy = i=1 , (19) N where N is the total number of samples and δ(yi , zi ) equals 1 if yi = zi and equals 0 otherwise. We performed all the experiments on a PC with a 2.30 GHz Intel(R) Core(TM) I5-6200U CPU and 8 GB of memory. This computer runs on Windows 7, and has a MATLAB 2012b compiler installed.

4.1.3 Compared Algorithms To validate the effectiveness and efficiency of our proposed algorithms, we compare them with a number of related state-of-art algorithms, which are enumerated as follows. • PCA [27]. It is an unsupervised algorithm, and discovers the best projection directions with maximal variances. • DNE [22]. It learns projection matrix by constructing an adjacency graph to keep local structure. • MFA [24]. The underlying latent space is learned by minimizing the ratio between the intra-class scatter and inter-class scatter. • DAG [26]. It seeks projection matrix by maximizing the margin between the between-class scatter and withinclass scatter. In the experiment, without prior knowledge and for reducing the difficulty of adjusting parameters, we set equal value of k1 and k2 , where we use k to represent k1 and k2 simultaneously. DNE, MFA, DAG, LSPD and LSPD+ all require parameter k , here, we set the same for each algorithm. In the LSP function, the value of β which adjusts the weight Throughout our experiments, we empirically set β = 5. Since the optimization process of MFA algorithm differs from others, we compare others with recognition accuracy. For simplicity, the nearest neighbor classifier [29] is used for classifying test images. Ten independent trials are conducted to reduce the effect of randomness and the mean performance is reported. 4.2

Experimental Results

4.2.1 Digit Visualization In MNIST data set, we randomly select 50 samples from the original training set as our training samples, and 50 samples

FUTURE GENERATION COMPUTER SYSTEMS

6

PCA,digits:0−4

PCA,digits:5−9

6

4

2

4

0 2 −2 0 −4 −2 −6 −4

−6 −12

−8

−10

−8

−6

−4

−2

0

2

−10 −6

−4

−2

0

(a)

2

4

6

(b)

Fig. 4. Two-dimensional projections of digits by using PCA. (a) “+” denotes 0, “×” denotes 1, “◦” denotes 2, “” denotes 3, “” denotes 4. (b) “+” denotes 5, “×” denotes 6, “◦” denotes 7, “” denotes 8, “” denotes 9.

DNE,digits:0−4

DAG,digits:0−4

4

2

3

1

2

0

1

−1

0

−2

−1

−3

−2

−4

−3

−5

−4

−6

−5 −4

−3

−2

−1

0

1

2

3

4

−7 −10

−8

−6

LBPD,digits:0−4

−4

−2

0

2

4

1.5

2

LBPD+,digits:0−4

3

1.5

2

1

1 0

0.5

−1

0 −2 −3

−0.5

−4

−1 −5 −6 −3

−2

−1

0

1

2

3

4

5

6

−1.5 −2

−1.5

−1

−0.5

0

0.5

1

Fig. 5. Two-dimensional projections of digits by using four related algorithms, where “+” denotes 0, “×” denotes 1, “◦” denotes 2, “” denotes 3, “” denotes 4.

FUTURE GENERATION COMPUTER SYSTEMS

7

DAG,digits:5−9

DNE,digits:5−9

4

2

3

1

2 0 1 −1 0 −2 −1 −3

−2

−4

−5 −4

−3

−3

−2

−1

0

1

2

−4 −8

3

−6

LBPD,digits:5−9

−4

−2

0

2

LBPD+,digits:5−9

3

2

2

1.5 1

1

0

−1

0.5 −2

0 −3

−4 −6

−5

−4

−3

−2

−1

0

1

2

3

−0.5 −2

−1.5

−1

−0.5

0

0.5

1

Fig. 6. Two-dimensional projections of digits by using four related algorithms, where “+” denotes 5, “×” denotes 6, “◦” denotes 7, “” denotes 8, “” denotes 9.

from the original test set as our test samples for each class. For the purpose of comparison with PCA, we first project the data set in the 2D space by using PCA, and the results are depicted in Fig. 4. In the sequel, we project the data set in two dimensions by using all four algorithms (i.e., DNE, DAG, LSPD and LSPD+ ). The results are illustrated in Figs. 5 (digits ”0”-”4”) and 6 (digits ”5”-”9”), respectively. Here, PCA is utilized as a baseline, since it aims at maximizing the variance and can keep the global structure of data. From Fig. 4, we observe that the classes of different digits seem to heavily overlap. We also observe that the other four algorithms yield more meaningful projections from Figs. 5 and 6, since samples from the same class are mapped close while samples from different classes become separable. Finally, LSPD and LSPD+ seems to provide slightly better projections than other algorithms. 4.2.2

Accuracy

In this section, we test our algorithms on two face data sets, i.e., the Yale data set and the UMIST data set. In this simulation, we mainly focus on the effect of the dimensionality of subspace under different choice of the nearest neighbor parameters k and we prepare three settings under different numbers of the nearest neighbor (i.e., 1, 3 and 5). We randomly select 60% training samples from each class and the remaining images are used for testing in Yale data set and randomly select 30% training samples from each class

and the remaining images are used for testing in UMIST data set. In order to eliminate the null space, PCA [27] is utilized to reduce dimensionality. For each setting, the numbers of projection vectors are regulated from 1 to 80 with interval 3 for each fixed k values. Figs. 7 and 8 show the mean accuracy of five algorithms versus dimensionality of subspace with different k on Yale and UMIST face data sets, respectively. We can find that LSPD and LSPD+ perform better than DNE, MFA and DAG across a wide range of dimensionality. For more intuitive and clearer analysis of data, we summarize some statistics, include the mean accuracy (i.e., Mean), best record (Best) and the optimal subspace dimension (i.e., Dim), where the optimal subspace corresponds to the highest recognition accuracy of each algorithm in each setting. We also can see that LSPD and LSPD+ perform better than MFA and DAG in each case on Yale and UMIST face data sets. 4.3

Results Analysis

The performance of the six algorithms on all the three data sets are illustrated in Figs. 4-8. These results reveal a number of interesting points. 1. Considering the visualization, Figs. 4-6 illustrate the results of five algorithms in the digit data set. LSPD is better than DNE and DAG, which shows that in the task of visualization, preserving local similarity information is important.

FUTURE GENERATION COMPUTER SYSTEMS

8

Accuracy

Accuracy

0.8 0.6 DNE MFA DAG LSPD

0.4 0.2

1

1

0.8

0.8

0.6 DNE MFA DAG LSPD

0.4 0.2

LSPD+ 0 0

20

40

60

Accuracy

1

0.6 DNE MFA DAG LSPD

0.4 0.2

LSPD+

LSPD+ 0 0

80

20

40

60

0 0

80

Number of Projection Vectors

Number of Projection Vectors

(a) k = 1

20

40

60

80

Number of Projection Vectors

(b) k = 3

(c) k = 5

Fig. 7. Accuracy vs. Number of Projection Vectors on the Yale data set under different k. TABLE 1 Result with different numbers of neighbors on Yale data set Yale (k = 1) Mean Best Dim 0.6800 0.7556 27 0.6533 0.7778 22 0.7822 0.8667 59 0.8622 0.9333 28 0.8978 0.9556 22

Algorithm/Result DNE MFA DAG LSPD LSPD+

1

1

0.8

DNE MFA DAG LSPD

0.4

0.8

0.6 DNE MFA DAG LSPD

0.4 0.2

40

60

Number of Projection Vectors

(a) k = 1

0.6 DNE MFA DAG LSPD

0.4

LSPD+

LSPD+

LSPD+ 20

Accuracy

Accuracy

0.6

0.2 0

Yale (k = 5) Mean Best Dim 0.7222 0.8444 12 0.7800 0.9111 29 0.7822 0.8667 14 0.8256 0.8889 15 0.8911 0.9556 17

1

0.8

Accuracy

Yale (k = 3) Mean Best Dim 0.6800 0.7333 16 0.6978 0.7778 17 0.7467 0.8667 21 0.7956 0.8667 64 0.8267 0.8889 20

80

0 0

20

40

60

80

0.2 0

Number of Projection Vectors

20

40

60

80

Number of Projection Vectors

(b) k = 3

(c) k = 5

Fig. 8. Accuracy vs. Number of Projection Vectors on the UMIST data set under different k. TABLE 2 Result with different numbers of neighbors on UMIST data set Algorithm/Result DNE MFA DAG LSPD LSPD+

UMIST (k = 1) Mean Best Dim 0.9198 0.9392 29 0.9105 0.9291 39 0.9561 0.9722 17 0.9688 0.9797 22 0.9814 0.9924 20

Furthermore, LSPD+ outperforms all of its competitors, which shows that preserving the geometric structure based on preserving similarity information, is better than preserve only the local similarity information. 2. Considering the classification accuracy, LSPD+ outperforms all of its competitors for the Yale and UMIST data sets. From Figs. 7 and 8, regardless of the data sets or different neighbors, LSPD+ and LSPD are better than DNE, DAG and MFA. To some extent, the primary reason for the superiority of LSPD+ and LSPD is because of the advantages of the LSP function, that is, by using the distance function LSP, the

UMIST (k = 3) Mean Best Dim 0.9411 0.9696 29 0.9327 0.9570 20 0.9538 0.9823 15 0.9722 0.9899 15 0.9848 0.9975 14

UMIST (k = 5) Mean Best Dim 0.9462 0.9696 16 0.9436 0.9797 18 0.9504 0.9772 16 0.9646 0.9797 12 0.9814 0.9949 21

samples that are harmful to the classification are optimized and the local similarity information is preserved, so that neighbors from the same class are compact while neighbors from different classes become separable in the subspace to improve the accuracy. Comparing to LSPD, LSPD+ achieves a higher accuracy in most cases. That is because LSPD+ not only preserves the local similarity information of samples, but also retains the geometric structure. 3. Considering the dimension of subspace. From Tables 1-2, we know that DAG and LSPD have the similar dimensions. Though, the optimal subspace of LSPD is higher than

FUTURE GENERATION COMPUTER SYSTEMS

9

DAG in some cases. e.g., in Yale data set, when k = 3, the dimensions of DAG and LSPD are 21 and 64, respectively, DAG gets the highest recognition accuracy when the subspace dimension is 21, but LSPD achieves the highest recognition accuracy when the subspace dimension is 64. Although the subspace dimension of LSPD is higher than DAG when they all get the highest classification accuracy, when the subspace dimension of LSPD is less than or equal to 21, LSPD also has a higher accuracy than DAG. 4.4

Sensitivity of the Proposed algorithms to β

To show the sensitivity of the proposed algorithms to parameter β , the mean Accuracy versus the value of β , under different k , for data set “UMIST”, are shown in Fig. 9 where β ranges 1 to 50. It is worth noting that estimating an appropriate value for achieving a good performance is generally a challenging issue. This is usually estimated using a validation set or based on prior knowledge. Both of LSPD+ and LSPD find appropriate neighbors by LSP function, so they all need the parameter β to adjust the relative position of samples in LSP function. Moreover, the role of parameter β in LSPD+ and LSPD is different, in LSPD, β adjusts the relative position of the samples, so there is no effect on the Accuracy (i.e., Fig. 9: LSPD). In LSPD+ , parameter β has two roles, one is to adjust the relative position of the samples and the other is to control the geometry of the sample (in this experiment, since the relative position of the sample does not affect the accuracy, we use the parameter β to play two roles). Fig. 9 shows how the Accuracy changes as β varies from 1 to 50 on UMIST data set. As we can be seen, the proposed LSPD and LSPD+ algorithms are not too sensitive to this parameter. As mentioned previously, throughout all our experiments, β is set to 5 without tuning. This value is seen to work well over all data sets. 0.88 0.87 0.86

Accuracy

0.85 0.84 0.83

LSPD (k = 1) LSPD+ (k = 1) LSPD (k = 3) LSPD+ (k = 3) LSPD (k = 5) LSPD+ (k = 5)

0.82 0.81 0.8 0.79 0.78

0

10

20

30

40

50

β

Fig. 9. Parameter tuning on UMIST face data set.

5

C ONCLUSION

In this paper, we present a novel local similarity preserving function for preserving the local similarity information in

the process of dimensionality reduction. On the basis of LSP function, we further propose two algorithms, LSPD, which preserves the local similarity information of data, and LSPD+ , which not only preserves the locality similarity information of samples, but also keeps the geometric structure, respectively. Extensive experiments on visualization and classification show the effectiveness of our algorithms comparing with the state-of-the-art algorithms. It is also potentially applicable for other graph embedding algorithms(e.g., linear discriminant embedding [8] , marginal fisher analysis [24]). For the future work, we will continue to focus on improving the performance of the graph embedding algorithm. Shaw [34] and Gibert et al. [35] proposed many very interesting topics of graph embedding. How to effectively combine [34],[35] and our work to improve the performance of graph embedding algorithms is also an interesting direction for future study.

6

ACKNOWLEDGEMENTS

This work is supported in part by the National Science Foundation of China (Grant No. 61472047).

R EFERENCES [1] S. Roweis, and L. Saul, “nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 22, pp. 2323-2326, Dec. 2000. [2] J. Tenenbaum, V. Silva, and J. Langford, “a global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, no. 22, pp. 2319-2323, Dec. 2000. [3] M. Belkin, and P. Niyogi, “laplacian eigenmaps and spectral techniques for embedding and clustering,” Advances in Neural Information Processing System, vol. 12, pp. 585-591, 2001. [4] X. F. He, and P. Niyogi, “locality-preserving projections,” Advances in Neural Information Processing Systems, vol. 16, pp. 153-160, 2003. [5] X. F. He, S. C. Yan, Y. X. Hu, and H. J. Zhang, “learning a locality preserving subspace for visual recognition,” in Proc. the 9th IEEE International Conference on Computer Vision, pp. 385-392, 2003. [6] X. F. He, D. Cai, S. C. Yan, and H. J. Zhang, “neighborhoodpreserving embedding,” in Proc. the 10th IEEE International Conference on Computer Vision, vol. 2, pp. 1208-1213. [7] J. Yang, D. Zhang, J. Y. Yang, and B. Niu, “globally maximizing, locally minimizing: unsupervised discriminant projection with applications to face and palm biometrics,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 4, April. 2007. [8] H. T. Chen, H. Wei, and T. L. Liu, “local discriminant embedding and its variants,” in Proc. the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 846-853, 2005. [9] F. P. Nie, S. M. Xiang, C. S. Zhang, “neighborhood minMax projection,” in Proc. the 20th international Joint Conference on Artificial Intelligence, pp. 993-998, 2007. [10] Z. H. Zhou, M. L. Zhang, S. J. Huang, and Y. F. Li, “multi-instance multi-label learning,” Artificial Intelligence, vol. 176, no. 1, pp. 22912320, Jan. 2012. [11] S. M. Xiang, F. P. Nie, C. S. Zhang, and C. X. Zhang, “nonlinear dimensionality reduction with local spline embedding,” IEEE Trans. Knowledge and Data Engineering, vol. 21, no. 9, pp. 1285-1298, Sep. 2009. [12] S. Nikitidis, A. Tefas, and I. Pitas, “maximum margin projection subspace learning for visual data analysis,” IEEE Trans. Image Processing, vol. 23, no. 10, pp. 4413-4425, Oct. 2014. [13] X. Z. Wang, Y. X. Guo, Z. Q. Wang, “Multi-view Discriminative Manifold Embedding for Pattern Classification,” Journal of Intelligent Computing, vol. 8, no. 2, pp. 58-63, June, 2017. [14] H. F. Li, T. Jiang, and K. S. Zhang, “efficient and robust feature extraction by maximum margin criterion,” IEEE Trans. Neural Networks, vol. 17, no. 1, pp. 157-165, Feb. 2006.

FUTURE GENERATION COMPUTER SYSTEMS

[15] X. Y. Gao, S. C. Hoi, Y. D. Zhang, J. Wang, and J. T. Li, “SOML: sparse online metric learning with application to image retrieval,” in Proc. the 20th AAAI Conference on Artificial Intelligence, pp. 12061212, 2014. [16] X. J. Chang, F. P. Nie, Y. Yang, and H. Huang, “a convex formulation for semi-supervised multi-label feature selection,” in Proc. the 28th AAAI Conference on Artificial Intelligence, pp. 1171-1177. 2014. [17] B. Kim, K. Patel, A. Rostamizadeh, and J. Shah, “scalable and interpretable data representation for high-dimensional, complex data,” in Proc. the 29th AAAI Conference on Artificial Intelligence, pp. 1763-1769, 2015. [18] C. Zhang, and Y. F. Liu, “multicategory large-margin unified machines,” Journal of Machine Learning Research, vol. 14, no. 1, pp. 1349-1386, 2013. [19] P. Fox, and E. Rosten, “unbiased generative semi-supervised learning,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 367-443, 2014. [20] M. Cutrui, and D. Avis, “ground metric learning,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 533-564, 2014. [21] M. N. Volkovs, R. S. Zemel, “new learning methods for supervised and unsupervised perference aggregation,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1135-1176, 2014. [22] W. Zhang, X. Y. Xue, H. Lu, and Y. F. Guo, “discriminant neighborhood embedding for classification,” Pattern Recognition, vol. 39, no. 11, pp. 2240-2243, Nov. 2006. [23] S. C. Yan, D. Xu, B. Y. Zhang, and H. J. Zhang, “graph embedding: A general framework for dimensionality reduction,” in Proc. the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 830-837, 2005. [24] S. C. Yan, D. Xu, B. Y. Zhang, H. J. Zhang, Q. Yang, and S. Lin, “graph embedding and extensions: a general framework for dimensionality reduction,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40-51, Jan. 2007. [25] F. Dornaika, and A. Bosaghzadeh, “exponential local discriminant embedding and its application to face recognition,” IEEE Trans. Cybernetics, vol. 43, no. 3, pp. 921-934, JUN. 2013. [26] C. T. Ding, and L. Zhang, “double adjacency graphs-based discriminant neighborhood embedding,” Pattern Recognition, vol. 48, no. 5, pp. 1734-1742, May. 2015. [27] M. Turk, and A. Pentland, “face recognition using eigenfaces,“ in Proc. the 1991 IEEE Computer Vision and Pattern Recognition, pp. 586591, 1991. [28] E. Kokiopoulou, and Y. Saad, “orthogonal neighborhood preserving projections: a projection-based dimensionality reduction technique,” IEEE Trans. Pattern Analysis and Machine intelligence, vol. 29, no. 12, pp. 2143-2156, Nov. 2007. [29] T. Cover, and P. Hart, “nearest neighbor pattern classification,” IEEE Trans. Information Theory, vol. 13, no. 1, pp. 21-27, Jan. 2003. [30] M. Y. Yu, L. Shao, X. T. Zhen, and X. F. He, “local feature discriminant projection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp. 1908-1914, Nov. 2016. [31] S. Miao, J. Wang, Q. X. Gao, F. Chen, and Y. Wang, “discriminant structure embedding for image recognition,” Neurocomputing, vol. 174, pp. 850-857, 2016. [32] D. Cai, X. F. He, K. Zhou, J. W. Han, and H. J. Bao, “locality sensitive discriminant analysis,” in Proc. the 28th AAAI Conference on Aritificial Intelligence, pp. 708-713, 2007. [33] C. H. Golub, and C. F. Van, Matrix Computations, 3rd edition, The Johns Hopkins University, USA, 1996. [34] B. Shaw, “Graph Embedding and Nonlinear Dimensionality Reduction,” Columbia University, 2011. [35] J. Gibert, E. Valveny, H. Bunke, “Graph embedding in vector spaces by node attribute statistics,” Pattern Recognition, vol. 45, no. 9, pp. 3072-3083, Sep, 2012.

10

Shangguang Wang received his PhD degree at Beijing University of Posts and Telecommunications in 2011. He is an associate professor at the State Key Laboratory of Networking and Switching Technology (BUPT). He has published more than 100 papers, and played a key role at many international conferences, such as general chair and PC chair. His research interests include service computing, cloud computing, and mobile edge computing. He is a senior member of the IEEE, and the Editor-in-Chief of the International Journal of Web Science.

Chuntao Ding received the B.S. and M.S. degrees from SIAS international University in 2012 and Soochow University in 2015, respectively, both in software engineering. He is currently a Ph.D. at the State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications (BUPT). His research interests include machine learning, mobile edge computing.

Ching-Hsien Hsu is a professor in the Department of Information Engineering and Computer Science, Feng Chia University, Taichung, Taiwan; He was distinguished chair professor at Tianjin University of Technology, China, during 2012-2016. His research includes high performance computing, cloud computing, parallel and distributed systems, big data analytics and intelligence. He has published 200 papers in these areas, including top journals such as IEEE TPDS, IEEE TSC, IEEE TCC, IEEE TETC, IEEE T-SUSC, IEEE Systems, IEEE Network, IEEE Communications, ACM TOMM. Dr. Hsu is serving as editorial board for a number of prestigious journals, including IEEE TSC, IEEE TCC, IJCS, JoCS. He has been acting as an author/co-author or an editor/co-editor of 10 books from Elsevier, Springer, IGI Global, World Scientific and McGraw-Hill. Dr. Hsu was awarded nine times distinguished award for excellence in research from Chung Hua University. He is vice chair of IEEE TCCLD, executive committee of IEEE TCSC, Taiwan Association of Cloud Computing and an IEEE senior member.

Fangchun Yang received his Ph.D.in communications and electronic systems from the Beijing University of Posts and Telecommunication in 1990. He is currently professor at the Beijing University of Posts and Telecommunication, China. He has published six books and more than 80 papers. His current research interests include network intelligence, service computing, communications software, soft-switching technology, and network security. He is a Fellow of the IET.

Shangguang Wang

Chun ntao Ding

Robe ert C.H. Hsu

Fang gchun Yang g

This paper proposes a local similarity preserving (LSP) function for finding appropriate neighbors, which can preserve the similarity information of the samples. Based on LSP function, this paper proposes two algorithms, LSPD algorithm, which finds the appropriate neighbors and preserves the similarity information by using LSP function; and LSPD+ algorithm, which also preserves the geometric structure of data based on LSPD. Extensive experiments demonstrate that the effectiveness and efficiency of our algorithms comparing with the state‐of‐the‐art algorithms based on digits and face data sets.