Multi-graph feature level fusion for person re-identification

Multi-graph feature level fusion for person re-identification

Accepted Manuscript Multi-Graph Feature Level Fusion for Person Re-identification Le An, Xiaojing Chen, Songfan Yang PII: DOI: Reference: S0925-2312...

3MB Sizes 2 Downloads 4 Views

Accepted Manuscript

Multi-Graph Feature Level Fusion for Person Re-identification Le An, Xiaojing Chen, Songfan Yang PII: DOI: Reference:

S0925-2312(17)30255-2 10.1016/j.neucom.2016.08.127 NEUCOM 18051

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

24 January 2016 8 June 2016 18 August 2016

Please cite this article as: Le An, Xiaojing Chen, Songfan Yang, Multi-Graph Feature Level Fusion for Person Re-identification, Neurocomputing (2017), doi: 10.1016/j.neucom.2016.08.127

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Multi-Graph Feature Level Fusion for Person Re-identification

CR IP T

Le Ana , Xiaojing Chenb , Songfan Yangc a

AN US

Department of Electrical and Computer Engineering, University of California, Riverside, CA 92521, USA b Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA c College of Electronics and Information Engineering, Sichuan University, Chengdu 610064, China

Abstract

AC

CE

PT

ED

M

Person re-identification refers to the task of matching people in non-overlapping cameras. As the concerns for public safety keep rising, the ability to accurately identify a subject in surveillance cameras is a highly demanded technique. In practice, person re-identification is challenging due to the substantial appearance shift caused by view change. Many factors, such as illumination, pose, and image quality, can affect the matching accuracy. In the past, many feature descriptors have been engineered for more robust matching in certain cases. In this paper, we propose a graph-based feature fusion scheme to effectively leverage different feature descriptors. Moreover, instead of determining the matching results by computing pairwise distance between a unknown probe and a gallery subject in the database, we learn the similarity scores between a probe and all the gallery subjects simultaneously in a graph learning framework. We use off-the-shelf features and test our method on popular benchmark datasets for person re-identification. Experimental results show that different feature descriptors can be effectively combined through this graph learning scheme and superior results are achieved as compared with the rival approaches. Keywords: Person re-identification, multi-camera, feature fusion, graph learning

Email address: [email protected] (Xiaojing Chen)

Preprint submitted to Neurocomputing

February 7, 2017

ACCEPTED MANUSCRIPT

1. Introduction

CE

PT

ED

M

AN US

CR IP T

Recent years have witnessed wide deployment of surveillance cameras to protect the public safety all over the world. In security and law enforcement related applications, the ability of recognizing a person in surveillance cameras is highly demanded. Conventionally, identifying a subject is performed by human operators. However, the feasibility of manual screening diminishes as the size of database increases. Therefore, automatic and accurate people matching in different cameras, or commonly referred to as person reidentification, is desired. In this task, the goal is to recognize a probe, i.e., an unknown subject in one view, from a gallery of known subjects in different views. In the past few years, rapid advances have been made in person re-identification [1, 2, 3]. Most approaches for matching people in different cameras are based on appearance, assuming that the appearance of the same subject does not change in different cameras. Although this assumption is reasonable, due to view changes, the appearance consistency will inevitably be affected in different views. As shown in Fig. 1, the appearance of the same subject varies significantly in different views, due to pose and illumination changes. The difficulty in person re-identification is that often the intra-class (same person) difference can be as large as or even larger than the inter-class (different persons) difference.

AC

Figure 1: The appearance of the same person in different camera views. The images are from the VIPeR dataset[4] and the GRID dataset [5].

To improve the person re-identification accuracy, many feature descriptors have been designed. Some representative examples include illuminationinvariant color descriptors [6], biological-inspired features [7], semantic color names [8], reference descriptors [9, 10], local discriminative features [11], 2

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

local maximal occurrence features [12], salient patterns [13, 14]. Besides handcrafted features, deep learning has also shown to be an effective tool to automatically extract meaningful features [15, 16]. The aforementioned feature descriptors focus on various aspects, such as more stable color description [8], or more discriminative pattern discovery [17]. In different cases, features may have different levels of importance, as empirically studied in [18]. However, very limited work has focused on feature fusion to leverage multiple feature descriptors to improve the re-identification accuracy. In this paper, we propose a feature-level fusion method to harness different feature descriptors, in order to achieve improved matching results. Specifically, a graph is built for each feature descriptor, and different graphs are combined and jointly optimized to derive the similarities between a probe and the gallery subjects. This is a unified framework that performs multimodal feature fusion and similarity learning at the same time. Graph fusion of multi-modal data has shown to be effective in other fields, such as disease diagnosis [19, 20] and tag-based social image search [21]. As later shown in the experiments, the proposed graph fusion of multi-modal features is effective in our application, and superior performance is achieved compared to the most recent methods. In the following, the related works are discussed in Section 2. The proposed method is explained in detail in Section 3. Experimental results and analysis are included in Section 4, and the conclusions with pointers to future work is given in Section 5.

PT

2. Related Works

AC

CE

Many feature descriptors have been proposed specifically for the task of person re-identification. Instead of using standard color histograms, an intra-distribution color structure was discovered in [6], which was found to be invariant in different imaging conditions. In [8], semantic color names were used to replace color histograms for better representation. Another color descriptor based on salient color names was proposed in [22]. These color names were shown to be effective and they can be computed efficiently in advance. Biologically inspired features and covariance descriptors were combined in [23, 7], which were found to be robust against background and illumination changes. In [24], multi-resolution color histograms were extracted, and local structural sparse coding were used to complement the color features. Besides, advanced features, such as local energy pattern [25] 3

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

and continuous rotation invariant local descriptors [26], which were originally proposed and shown to be effective for tasks in other domains, can also be employed in person re-identification. Instead of using low-level color or texture features, higher level representations have been proposed. In [17], mid-level filters were proposed to capture distinctive patterns on the subjects. Salient features that distinguish one subject from the other were exploited in [13, 14]. These salient features were learned in an unsupervised framework. Attributes, such as gender, were utilized in [27] to improve the ranking for re-identification. Similar attributes were also studied in [28]. In [29], gait was utilized to complement the appearance features and improve the matching accuracy for person re-identification. Attributes can be learned from data in another domain. For example in [30], a semantic attribute model was learned from fashion photographs and then applied to person re-identification. The authors of [31] proposed an attribute relation learning model, which can automatically discover the relation between attributes from data, and achieved promising results in object recognition. A local maximal occurrence feature descriptor was introduced in [12], where stability against view changes was attempted by analyzing the horizontal occurrence of local features. A deep neural network was constructed in [15] to jointly handle image misalignment, pose difference, occlusions, and background clutter. An improved deep learning structure was later reported in [16]. Multiple low-level and high-level features were combined in [32] using a structured learning based approach. Features from different body parts can be extracted and matched separately. In [33], pictorial structures were adopted to localize human parts and part-to-part correspondences were searched for matching. In [34], two coupled dictionaries were jointly learned to represent the gallery and probe in a semi-supervised manner. In [9, 10], a reference descriptor was generated by computing the similarities between a probe or gallery subject and an independent set of subjects with diverse appearance. It was shown that this descriptor was more robust than the image features. In [35], a multilevel adaptive correspondence model was proposed, which describes a person based on horizontal stripes in multiple levels to capture rich visual cues as well as implicit spatial structure. After feature extraction, feature selection by covariance learning [36] or AdaBoost [37] can be performed. For feature selection, a pairwise constraint based sparse model [38], which achieved better results than state-of-the-art feature selection methods, can be employed. Features extracted from images can be further projected onto different 4

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

feature spaces for better discrimination. In [39], the image spaces of two camera views were jointly partitioned into different configurations, and image pairs with similar transforms were projected onto a common feature space for matching. A pairwise constrained component analysis (PCCA) was proposed in [40] to learn a feature projection, following a set of sparse pairwise constraints. Local Fisher discriminant analysis was employed in [41] to match subjects in a discriminative space. A modified canonical correlation analysis with robust covariance estimation was proposed and applied to person re-identification [42]. A cross-view quadratic discriminant analysis [12] was introduced recently, achieving competitive results on several benchmark datasets. The distance between a probe and a gallery subject can be computed using Mahalanobis distance, and there have been various methods for learning a distance metric. Based on the idea of “keep it simple and straightforward”, a simple yet effective distance metric learning was proposed in [43]. This metric was later improved by regularization and smoothing techniques [44, 45]. A data-specific adaptive metric method by applying the cross-view support and projection consistencies was proposed in [46]. In [47], re-identification was formulated as a relative distance comparison problem, and this formulation was later improved in [48] by incorporating attribute and feature weighting. By taking advantage of the data structure, a relaxed pairwise learned metric was proposed in [49]. Multiple kernel-based metrics were used in conjunction with histogram-based features in [50], achieving better results than using a single metric. In [51], a set-label model was proposed to improve the performance under a multi-shot setting, and a deep non-linear metric learning was introduced to overcome the limitations of traditional linear metric learning. After initial matching results are obtained, further refinement can be achieved by human intervention [52]. In [53], a post-rank optimization method was proposed, and human was allowed to select negative samples for improving accuracy and reducing search time. Besides, some new aspects of person re-identification have been explored recently. For example, low-resolution probes were matched with high-resolution gallery in [54]. Taking into account possible occlusion, a patch-level matching model was proposed in [55] to match partial bodies. A large scale person re-identification dataset was introduced in [56], which is substantially larger than the existing ones to better simulate real-world cases. Recently, Zheng et al. [57] proposed a novel transfer local relative distance comparison model, which attempts to tackle open-world person re-identification problem by mining and transferring use5

CR IP T

ACCEPTED MANUSCRIPT

AN US

Figure 2: Pipeline of the proposed method. Different features from probe and gallery are first extracted. Then, a graph is built for each feature representation. Multiple graphs are jointly optimized to obtain the similarity scores between the probe and subjects in the gallery.

PT

ED

M

ful information from a labeled open-world non-target dataset. In [58], the tasks of person re-identification and tracking were jointly considered, and a hypergraph representation was used to link related objects for search and re-identification. Compared to the aforementioned approaches that mainly address feature engineering or distance metric learning, we focus on feature-level fusion to improve the person re-identification performance by exploiting and utilizing complementary and mutually informative feature representations. In addition, the similarity scores between a probe and the gallery subjects are learned through graph optimization, which are more accurate than direct pairwise distance computation. As later shown in the experiments, our approach leads to superior matching results. 3. Technical Approach

AC

CE

In the proposed framework, the multi-modal feature fusion and similarity learning are performed jointly by multi-graph optimization. Fig. 2 shows the pipeline of our method. Given the images of a probe and the gallery subjects, features are first extracted using different feature descriptors. Then, for each feature representation, a graph is built and the node connection is determined by pairwise similarity. Multiple graphs are jointly optimized to obtain the similarity scores between a probe and the gallery subjects. These scores are then ranked to determine the matching results. Details of each step are explained in the following.

6

ACCEPTED MANUSCRIPT

CR IP T

3.1. Feature Extraction For each probe or gallery subject, different features are extracted from the image. We choose the following two off-the-shelf feature descriptors in our implementation:

M

AN US

1. LOMO feature descriptor [12]1 . This feature descriptor, abbreviated for local maximal occurrence descriptor, combines HSV color histogram for color representation, and scale invariant local ternary pattern (SILTP) for texture description. A three-level pyramid is adopted to extract features from images at different resolutions. The LOMO features from the entire image are concatenated and a log transform is applied to suppress large bin values. 2. The gBiCov feature descriptor [7]2 . This feature descriptor combines biologically inspired feature (BIF) and covariance descriptors. The BIF is designed to emulate the human visual system. In particular, the S1 layer (Gabor filters) and the C1 layer (MAX operator) of the BIF are used to filter the original image. Afterwards, the covariance features are extracted from the BIF magnitude images to encode shape, location, and color information.

PT

ED

Before feature extraction, each image is resized to 128 × 48. For dimensionality reduction, we employ the cross-view quadratic discriminant analysis (XQDA) [12] to project the image features onto discriminant subspace. In this way, the feature discrepancy due to view change is alleviated. For details regarding feature extraction and dimensionality reduction, we refer the interested readers to [12, 7].

AC

CE

3.2. Graph Construction For each feature descriptor, a graph is built with the nodes being the features extracted from the probe and gallery. Specifically, denoting p as the probe and Q = {q1 , q2 , . . . , qM }, where M is the size of the gallery, a graph G = (V, E, W) is constructed for the M subjects in the gallery and the probe p. V is the vertex set containing M + 1 nodes, and W contains the weights for each edge in E. 1

2

http://www.cbsr.ia.ac.cn/users/scliao/projects/lomo xqda/index.html http://vipl.ict.ac.cn/members/bpma

7

ACCEPTED MANUSCRIPT

CR IP T

The weight for the edge linking two vertices vs and vt in the weight matrix W is given by d(vs , vt )2 W(vs , vt ) = exp(− ), (1) σ2 in which σ is a controlling parameter. The distance d(vs , vt ) is the Mahalanobis distance defined as d(vs , vt ) = (vs − vt )> M(vs − vt ),

AN US

where the kernel matrix M is computed by XQDA [12]. We define a diagonal matrix D with its diagonal elements as X W(vs , vt ), D(s, s) =

(2)

(3)

vt

which is the degree of the vertex vs . Then, a normalized weight matrix Θ of the graph is defined as (4)

M

Θ = D−1/2 WD−1/2 ,

ED

3.3. Graph Optimization To optimize the graph, we first define the graph regularizer Ω(f ) as 2 f (vs ) f (vt ) 1 X −p ) W(vs , vt )( p Ω(f ) = 2 v ,v ∈V D(vs , vs ) D(vt , vt ) s t

(5)

>

PT

=f (I − Θ)f ,

AC

CE

where f ∈ RM +1 is a relevance vector [59]. Each graph has its own regularizer, and the optimization objective for multiple graphs is defined as ( ) nf X argmax Remp (f ) + λ Ωi (f ) , (6) f

i=1

where nf is the number of feature descriptors, and λ is a regularization parameter. In this way, multiple graphs, representing features in different modalities (i.e., extracted by different feature descriptors), are combined into a unified optimization problem. 8

ACCEPTED MANUSCRIPT

The empirical loss function term Remp (f ) in Eq. (6) can be written as X Remp (f ) = kf − yk2 = (f (vs ) − y(vs ))2 , (7)

CR IP T

vs ∈V

AN US

where y is a binary label vector. Matching a probe with the gallery can be considered as a one-class classification problem [60]. Thus, in the label vector y ∈ RM +1 , the first element corresponding to the probe is set to 1 and the rest are set to 0. Therefore, the optimization function in Eq. (6) can be written as ( ) nf X argmax Remp (f ) + λ Ωi (f ) f

i=1 nf

= argmin kf − yk2 + λ f

X i=1

f > (I − Θi )f

= argmin kf − yk2 + λf > (I − f

f

Pnf

M

= argmin kf − yk2 + λf > ∆f ,

(8)

nf

X

Θi )f

i=1

AC

CE

PT

ED

where ∆ = I − i=1 Θi is the multi-graph Laplacian. The relevance vector f to be computed contains the learned similarity scores between the probe and the gallery subjects. Taking derivative of Eq. (8) with respect to f , an analytical solution of f is given by 1 ∆−1 y. (9) f= 1+λ Alternatively, when ∆ is large and taking its inverse is computationally inefficient, the solution of f can be obtained in an iterative process by f (t+1) =

λ 1 (I − ∆)f (t) + y, 1+λ 1+λ

(10)

where t is the iteration number and the convergence of this iterative process is guaranteed [59]. After obtaining f , the matching results is determined by ranking the last M elements in f in descending order.

9

ACCEPTED MANUSCRIPT

4. Experiments

CR IP T

4.1. Datasets Our method is validated under a single-shot setting, meaning that for each subject, one image is available in each view for matching. We evaluate our method on two widely used person re-identification datasets:

ED

M

AN US

1. VIPeR dataset [4]3 . The VIPeR dataset includes images of 632 subjects from two cameras. Each subject has one image in each camera. Even in the same camera, the pose change can be up to 90 degrees. The dataset is evenly divided into disjoint training and testing sets. The experiments are conducted 10 times to comply with the standard evaluation protocol, and the average results are reported. 2. GRID dataset [5] 4 . The GRID dataset contains 250 image pairs of pedestrians from a busy underground station. In addition, there are images of 775 pedestrians whose identities are not included in the 250 image pairs. Images from this dataset have very low resolution and are not well illuminated. For training, 125 image pairs are selected, and the remaining 125 image pairs are used for testing. The additional 775 images that do not belong to any of the probes are used to expand the gallery. For this dataset, 10 random trials are performed, and the average results are reported for comparison. Fig. 3 shows some sample images from the VIPeR and the GRID datasets.

AC

CE

PT

4.2. Parameter Setting For extracting features and learning the distance kernel matrix M in Eq. (2), default parameters in the original implementations are used. The controlling parameter σ in Eq. (1) is set to 0.05. For the regularization parameters in Eq. (6), we set λ = 0.1. The values for these parameters are determined by cross-validation on the training data. Since the dataset size in our experiments is not large, we directly use the analytical solution in Eq. (9) for graph optimization. 3 4

https://vision.soe.ucsc.edu/node/178 http://www.eecs.qmul.ac.uk/∼ccloy/downloads qmul underground reid.html

10

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 3: Sample images from the VIPeR dataset [4] (top) and the GRID dataset [5] (bottom).

AC

CE

PT

ED

4.3. Effects of Multi-Graph Fusion The proposed method leverages different types of feature descriptors. To examine the effectiveness of using multiple feature descriptors, we compare the matching rates at different ranks between our method and baseline methods using only one feature descriptor. Therefore, in the baseline methods, only one graph is built and the similarity scores are learned by optimizing Eq. (8) with nf = 1. The Cumulative Matching Characteristic (CMC) curves on the VIPeR and GRID datasets are shown in Fig. 4 and Fig. 5, respectively. For both datasets, the proposed method with two feature descriptors provides the best results at different ranks. On the VIPeR dataset, LOMO features are notably more effective than the gBiCov features, with a rank-1 matching rate of 40.28%. When both LOMO and gBiCov features are used, a rank-1 matching rate of 44.56% is achieved. On the GRID dataset, a rank-1 matching rate of 19.84% is achieved by our method, which is 2.32% and 9.04% higher than the baseline methods using only LOMO or gBiCov features, respectively. These results suggest that multiple feature descriptors are effectively utilized by our method, and improved performances are achieved by fusing information 11

ACCEPTED MANUSCRIPT

M

AN US

CR IP T

from different sources through multi-graph learning.

ED

Figure 4: CMC curves for comparison with single feature descriptor based matching on the VIPeR dataset.

AC

CE

PT

4.4. Comparison with State-of-the-Art Methods To evaluate the performance of our method as compared to the rival approaches, we list the matching accuracy at different ranks for our method and the other recent methods. For fair comparison, we use the same experimental settings as in the competing methods, and the results of the other methods are provided by the authors. For the VIPeR dataset, the matching rates at different ranks are listed in Table 1 methods and the corresponding CMC curves are plotted in Fig. 6. Compared to the second best method, our method improves the rank-1 matching rate by 4.56%. Compared to the rest of the methods, our method outperforms by a large margin. At all of the listed ranks, the results by our method are consistently better than the others. For the GRID dataset, the results are shown in Table 2, which is congruent with the result reporting format on the dataset’s website. Given 12

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

ED

Figure 5: CMC curves for comparison with single feature descriptor based matching on the GRID dataset.

PT

the challenges in this dataset, the matching rates are relatively low for all the methods. Nevertheless, our method achieves a rank-1 matching rate of 19.86%, which is higher than the results by other competing methods. At higher ranks, superior performance is maintained by our method.

AC

CE

4.5. Computation Cost Our method mainly consists of three parts: feature extraction, graph construction, and graph optimization. For feature extraction, given an input image with a size of 128 × 48, extracting LOMO features takes about 0.15 second, and the extraction of gBiCov features takes about 7.8 seconds. The time cost for graph construction and optimization varies as the sizes of the datasets differ. Specifically, for the VIPeR dataset, the graph construction takes about 0.12 second, and the graph optimization takes about 0.07 second. For the GRID dataset, these costs are 0.16 and 0.09 second, respectively. The aforementioned costs are the average for each probe, and the algorithm was implemented in Matlab on a laptop with 2.4GHz CPU and 8GB RAM. The 13

ACCEPTED MANUSCRIPT

Table 1: Matching rates at different ranks (in %) on the VIPeR dataset

1

10

20

50

100

Proposed LOMO+XQDA [12] RD [10] kLFDA [50] kBiCov [7] SalMatch [14] LADF [61] Mid-level Filter [17] RPLM [49] SSCDL [34] LFDA [41] KISSME [43]

44.56 40.00 33.29 32.33 31.11 30.16 29.33 29.11 27.34 25.60 24.18 20.03

83.03 80.51 78.35 79.72 70.71 65.54 75.98 65.95 69.02 68.10 67.12 62.39

92.40 91.08 88.48 90.95 82.45 79.15 88.10 79.87 82.69 83.60 82.00 77.46

99.18 98.54 97.53 — 94.92 91.49 97.03 92.47 94.56 95.20 94.12 92.81

99.91 99.72 99.36 — 98.77 98.10 99.34 98.04 98.54 — 96.48 98.19

AC

CE

PT

ED

M

AN US

CR IP T

Method

Figure 6: CMC curves for comparison with other methods on the VIPeR dataset.

14

ACCEPTED MANUSCRIPT

Table 2: Matching rates at different ranks (in %) on the GRID dataset

1

5

15

20

49.76 41.84 46.00 45.84 35.76 36.32 35.04 33.28 32.96 16.24

56.88 47.68 52.80 52.88 41.76 42.24 42.08 39.44 38.96 19.12

62.32 52.40 57.60 59.84 46.56 46.56 46.48 43.68 44.32 24.80

AN US

Proposed 19.84 40.48 LOMO+XQDA [12] 16.56 33.84 PolyMap [62] 16.30 35.80 MtMCML [63] 14.08 34.64 MRank-PRDC [64] 11.12 26.08 MRank-RankSVM [64] 12.24 27.84 LCRML [65] 10.68 25.76 RankSVM [66] 10.24 24.56 PRDC [47] 9.68 22.00 L1-norm 4.40 11.68

10

CR IP T

Method

M

graph optimization is very efficient with the adoption of its analytical solution in Eq. (9).

AC

CE

PT

ED

4.6. Discussion The results on both datasets suggest that the proposed method is effective and competitive. Compared to the baseline methods in which only a single feature descriptor is used, feature-level fusion through graph learning leads to superior results. Compared to the other recent methods, more accurate matching is achieved by our method. In our current implementation, only two feature descriptors are considered. It is worthwhile to mention that any feature descriptors can be utilized by our method, regardless of their type or dimension. In addition, more than two feature descriptors can be combined by constructing and learning multiple graphs, each of which corresponds to a specific feature representation. By comparing the results from Table 1 and Table 2, we observe that, on both datasets, the performance gain of our method is more significant at lower ranks as compared to the gain at higher ranks. Since the gallery subjects at lower ranks are less distinguishable from the probe, larger improvement at lower ranks indicates better discriminative ability of our method. From a practical point of view, subjects at lower ranks are of more interests. Therefore, successful identification of potential match at lower ranks is favorable. 15

ACCEPTED MANUSCRIPT

AN US

CR IP T

The reasons for the improved performance of our method are mainly twofold: 1) Different feature descriptors, which are complementary and mutually informative, are combined for better discrimination; 2) Instead of using a predefined or learned distance metric to compute pairwise distances, the similarity scores between the probe and gallery subjects are learned through graph optimization, and as validated by the experiments, the learned similarity scores can be more accurate. In our current method, the graphs for different feature types are fused with equal weights as shown in Eq. (8). Imposing weights on different graphs can help better combine feature representations based on their contributions. To do so, a weight αi can be applied to each feature type (i.e., Θi ) in Eq. (8), and the new objective function, involving the relevance vector f and the weight αi , can be solved through alternating optimization. In addition, currently conventional graphs are constructed. Instead, hypergraphs, which capture not only pairwise but also high-order relationships among the subjects, can be utilized to more accurately compute similarity scores between the probe and gallery subjects.

M

5. Conclusions

AC

CE

PT

ED

In this paper, we proposed a feature-level fusion method for person reidentification. In our approach, different features, which are complementary to each other and mutually informative, are effectively combined in a graph learning framework. The similarity scores between a probe and the gallery subjects are learned through graph optimization. We conducted experiments on two commonly used person re-identification datasets, and showed that more accurate results were achieved as compared to the state-of-the-art techniques. In the future, we plan to extend our framework to incorporate more feature descriptors for further improvement. Currently, we assume that a probe’s identity is included in the gallery, which is referred to as closed-set re-identification. Extending our method to open-set re-identification is our ongoing work. References [1] R. Vezzani, D. Baltieri, R. Cucchiara, People re-identification in surveillance and forensics: a survey, ACM Computing Surveys 46 (2) (2013) 29:1–29:37.

16

ACCEPTED MANUSCRIPT

CR IP T

[2] G. Doretto, T. Sebastian, P. H. Tu, J. Rittscher, Appearance-based person reidentification in camera networks: problem overview and current approaches, Journal of Ambient Intelligence and Humanized Computing 2 (2) (2011) 127–151. [3] A. Bedagkar-Gala, S. K. Shah, A survey of approaches and trends in person re-identification, Image and Vision Computing 32 (4) (2014) 270 – 286.

AN US

[4] D. Gray, S. Brennan, H. Tao, Evaluating appearance models for recognition, reacquisition, and tracking, in: IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS), 2007, pp. 1–7. [5] C. C. Loy, T. Xiang, S. Gong, Multi-camera activity correlation analysis, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1988–1995.

M

[6] I. Kviatkovsky, A. Adam, E. Rivlin, Color invariants for person reidentification, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (7) (2013) 1622–1634.

ED

[7] B. Ma, Y. Su, F. Jurie, Covariance descriptor based on bio-inspired features for person re-identification and face verification, Image and Vision Computing 32 (6-7) (2014) 379 – 390.

PT

[8] C.-H. Kuo, S. Khamis, V. Shet, Person re-identification using semantic color names and rankboost, in: IEEE Workshop on Applications of Computer Vision (WACV), 2013, pp. 281–287.

CE

[9] L. An, M. Kafai, S. Yang, B. Bhanu, Reference-based person reidentification, in: IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2013, pp. 244–249.

AC

[10] L. An, M. Kafai, S. Yang, B. Bhanu, Person re-identification with reference descriptor, IEEE Transactions on Circuits and Systems for Video Technology 26 (4) (2016) 776–787. [11] N. Martinel, C. Micheloni, Re-identify people in wide area camera network, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2012, pp. 31–36. 17

ACCEPTED MANUSCRIPT

[12] S. Liao, Y. Hu, X. Zhu, S. Z. Li, Person re-identification by local maximal occurrence representation and metric learning, in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2197–2206.

CR IP T

[13] R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3586–3593. [14] R. Zhao, W. Ouyang, X. Wang, Person re-identification by salience matching, in: IEEE International Conference on Computer Vision (ICCV), 2013, pp. 2528–2535.

AN US

[15] W. Li, R. Zhao, T. Xiao, X. Wang, DeepReID: Deep filter pairing neural network for person re-identification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 152–159. [16] E. Ahmed, M. Jones, T. Marks, An improved deep learning architecture for person re-identification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3908–3916.

ED

M

[17] R. Zhao, W. Ouyang, X. Wang, Learning mid-level filters for person re-identification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 144–151. [18] C. Liu, S. Gong, C. C. Loy, On-the-fly feature importance mining for person re-identification, Pattern Recognition 47 (4) (2014) 1602 – 1615.

CE

PT

[19] Y. Gao, E. Adeli-M., M. Kim, P. Giannakopoulos, S. Haller, D. Shen, Medical image retrieval using multi-graph learning for mci diagnostic assistance, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, pp. 86–93.

AC

[20] M. Liu, D. Zhang, D. Shen, the Alzheimer’s Disease Neuroimaging Initiative, View-centralized multi-atlas classification for Alzheimer’s disease diagnosis, Human Brain Mapping 36 (5) (2015) 1847–1865. [21] Y. Gao, M. Wang, Z.-J. Zha, J. Shen, X. Li, X. Wu, Visual-textual joint relevance learning for tag-based social image search, IEEE Transactions on Image Processing 22 (1) (2013) 363–376.

18

ACCEPTED MANUSCRIPT

[22] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, S. Li, Salient color names for person re-identification, in: European Conference on Computer Vision (ECCV), 2014, pp. 536–551.

CR IP T

[23] B. Ma, Y. Su, F. Jurie, BiCov: a novel image representation for person re-identification and face verification, in: British Machine Vision Conference (BMVC), 2012, pp. 57.1–57.11.

AN US

[24] D. Xu, H. Zheng, Person re-identification by multi-resolution saliencyweighted color histograms and local structural sparse coding, in: Seventh International Conference on Image and Graphics (ICIG), 2013, pp. 477– 482. [25] J. Zhang, J. Liang, H. Zhao, Local energy pattern for texture classification using self-adaptive quantization thresholds, IEEE Transactions on Image Processing 22 (1) (2013) 31–42.

M

[26] J. Zhang, H. Zhao, J. Liang, Continuous rotation invariant local descriptors for texton dictionary-based texture classification, Computer Vision and Image Understanding 117 (1) (2013) 56 – 75.

ED

[27] L. An, X. Chen, M. Kafai, S. Yang, B. Bhanu, Improving person reidentification by soft biometrics based reranking, in: International Conference on Distributed Smart Cameras (ICDSC), 2013, pp. 1–6.

PT

[28] R. Layne, T. Hospedales, S. Gong, Person re-identification by attributes, in: British Machine Vision Conference (BMVC), 2012, pp. 24.1–24.11.

CE

[29] Z. Liu, Z. Zhang, Q. Wu, Y. Wang, Enhancing person re-identification by integrating gait biometric, Neurocomputing 168 (2015) 1144 – 1156.

AC

[30] Z. Shi, T. M. Hospedales, T. Xiang, Transferring a semantic representation for person re-identification and search, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4184 – 4193. [31] M. Liu, D. Zhang, S. Chen, Attribute relation learning for zero-shot classification, Neurocomputing 139 (2014) 34 – 46. [32] S. Paisitkriangkrai, C. Shen, A. van den Hengel, Learning to rank in person re-identification with metric ensembles, in: IEEE Conference on 19

ACCEPTED MANUSCRIPT

Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1846 – 1855.

CR IP T

[33] D. S. Cheng, M. Cristan, M. Stoppa, L. Bazzani, V. Murino, Custom pictorial structures for re-identification, in: British Machine Vision Conference (BMVC), 2011, pp. 68.1–68.11. [34] X. Liu, M. Song, D. Tao, X. Zhou, C. Chen, J. Bu, Semi-supervised coupled dictionary learning for person re-identification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3550–3557.

AN US

[35] S.-C. Shi, C.-C. Guo, J.-H. Lai, S.-Z. Chen, X.-J. Hu, Person reidentification with multi-level adaptive correspondence models, Neurocomputing 168 (2015) 550 – 559. [36] S. Bak, G. Charpiat, E. Corvee, F. Bremond, M. Thonnat, Learning to match appearances by correlations in a covariance metric space, in: European Conference on Computer Vision (ECCV), 2012, pp. 806–820.

ED

M

[37] D. Gray, H. Tao, Viewpoint invariant pedestrian recognition with an ensemble of localized features, in: European Conference on Computer Vision (ECCV), 2008, pp. 262–275. [38] M. Liu, D. Zhang, Pairwise constraint-guided sparse learning for feature selection, IEEE Transactions on Cybernetics 46 (1) (2016) 298–310.

CE

PT

[39] W. Li, X. Wang, Locally aligned feature transforms across views, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3594–3601.

AC

[40] A. Mignon, F. Jurie, PCCA: A new approach for distance learning from sparse pairwise constraints, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2666–2672. [41] S. Pedagadi, J. Orwell, S. Velastin, B. Boghossian, Local fisher discriminant analysis for pedestrian re-identification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3318– 3325.

20

ACCEPTED MANUSCRIPT

[42] L. An, S. Yang, B. Bhanu, Person re-identification by robust canonical correlation analysis, IEEE Signal Processing Letters 22 (8) (2015) 1103– 1107.

CR IP T

[43] M. K¨ostinger, M. Hirzer, P. Wohlhart, P. Roth, H. Bischof, Large scale metric learning from equivalence constraints, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2288– 2295.

AN US

[44] D. Tao, L. Jin, Y. Wang, Y. Yuan, X. Li, Person re-identification by regularized smoothing KISS metric learning, IEEE Transactions on Circuits and Systems for Video Technology 23 (10) (2013) 1675–1685.

[45] D. Tao, L. Jin, Y. Wang, X. Li, Person reidentification by minimum classification error-based KISS metric learning, IEEE Transactions on Cybernetics 45 (2) (2015) 242–252.

M

[46] Z. Wang, R. Hu, C. Liang, Y. Yu, J. Jiang, M. Ye, J. Chen, Q. Leng, Zero-shot person re-identification via cross-view consistency, IEEE Transactions on Multimedia 18 (2) (2016) 260 – 272.

ED

[47] W.-S. Zheng, S. Gong, T. Xiang, Reidentification by relative distance comparison, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (3) (2013) 653–668.

PT

[48] C. Liu, S. Gong, C. Loy, X. Lin, Person re-identification: What features are important?, in: European Conference on Computer Vision Workshops and Demonstrations, 2012, pp. 391–401.

CE

[49] M. Hirzer, P. M. Roth, M. K¨ostinger, H. Bischof, Relaxed pairwise learned metric for person re-identification, in: European Conference on Computer Vision (ECCV), 2012, pp. 780–793.

AC

[50] F. Xiong, M. Gou, O. Camps, M. Sznaier, Person re-identification using kernel-based metric learning methods, in: European Conference on Computer Vision (ECCV), 2014, pp. 1–16. [51] H. Liu, B. Ma, L. Qin, J. Pang, C. Zhang, Q. Huang, Set-label modeling and deep metric learning on person re-identification, Neurocomputing 151, Part 3 (2015) 1283 – 1292. 21

ACCEPTED MANUSCRIPT

[52] M. Hirzer, C. Beleznai, P. M. Roth, H. Bischof, Person re-identification by descriptive and discriminative classification, in: Scandinavian Conference on Image analysis (SCIA), 2011, pp. 91–102.

CR IP T

[53] C. Liu, C. Loy, S. Gong, G. Wang, POP: Person re-identification postrank optimisation, in: IEEE International Conference on Computer Vision (ICCV), 2013, pp. 441–448.

AN US

[54] X.-Y. Jing, X. Zhu, F. Wu, X. You, Q. Liu, D. Yue, R. Hu, B. Xu, Superresolution person re-identification with semi-coupled low-rank discriminant dictionary learning, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 695–704. [55] W.-S. Zheng, X. Li, T. Xiang, S. Liao, J. Lai, S. Gong, Partial person reidentification, in: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4678 – 4686.

M

[56] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, Q. Tian, Scalable person re-identification: A benchmark, in: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1116 – 1124.

ED

[57] W. S. Zheng, S. Gong, T. Xiang, Towards open-world person reidentification by one-shot group-based verification, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (3) (2016) 591–606.

PT

[58] S. Sunderrajan, B. S. Manjunath, Context-aware hypergraph modeling for re-identification and summarization, IEEE Transactions on Multimedia 18 (1) (2016) 51–63.

CE

[59] Y. Gao, M. Wang, D. Tao, R. Ji, Q. Dai, 3-D object retrieval and recognition with hypergraph analysis, IEEE Transactions on Image Processing 21 (9) (2012) 4290–4303.

AC

[60] Y. Huang, Q. Liu, S. Zhang, D. Metaxas, Image retrieval via probabilistic hypergraph ranking, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3376–3383. [61] Z. Li, S. Chang, F. Liang, T. Huang, L. Cao, J. Smith, Learning locallyadaptive decision functions for person verification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3610– 3617. 22

ACCEPTED MANUSCRIPT

CR IP T

[62] D. Chen, Z. Yuan, G. Hua, N. Zheng, J. Wang, Similarity learning on an explicit polynomial kernel feature map for person re-identification, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1565 – 1573. [63] L. Ma, X. Yang, D. Tao, Person re-identification over camera networks using multi-task distance metric learning, IEEE Transactions on Image Processing 23 (8) (2014) 3656–3670.

AN US

[64] C. C. Loy, C. Liu, S. Gong, Person re-identification by manifold ranking, in: IEEE International Conference on Image Processing (ICIP), 2013, pp. 3567–3571.

[65] J. Chen, Z. Zhang, Y. Wang, Relevance metric learning for person reidentification by exploiting global similarities, in: International Conference on Pattern Recognition (ICPR), 2014, pp. 1657–1662.

AC

CE

PT

ED

M

[66] B. Prosser, W.-S. Zheng, S. Gong, T. Xiang, Person re-identification by support vector ranking, in: British Machine Vision Conference (BMVC), 2010, pp. 21.1–21.11.

23