Perceptual hash-based feature description for person re-identification

Perceptual hash-based feature description for person re-identification

ARTICLE IN PRESS JID: NEUCOM [m5G;July 20, 2017;4:2] Neurocomputing 0 0 0 (2017) 1–12 Contents lists available at ScienceDirect Neurocomputing jo...

3MB Sizes 3 Downloads 12 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 20, 2017;4:2]

Neurocomputing 0 0 0 (2017) 1–12

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Perceptual hash-based feature description for person re-identificationR Wen Fang a, Hai-Miao Hu a,b,∗, Zihao Hu a, Shengcai Liao c, Bo Li a,b a

Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang University, Beijing 100191, China State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China c Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China b

a r t i c l e

i n f o

Article history: Received 11 April 2017 Revised 7 July 2017 Accepted 9 July 2017 Available online xxx Communicated by Jiwen Lu Keywords: Person re-identification Image pyramid Regional statistics Hierarchical feature description The perceptual hash algorithm (PHA)

a b s t r a c t Person re-identification is one of the most important and challenging problems in video surveillance systems. For person re-identification, feature description is a fundamental problem. While many approaches focus on exploiting low-level features to describe person images, most of them are not robust enough to illumination and viewpoint changes. In this paper, we propose a simple yet effective feature description method for person re-identification. Starting from low-level features, the proposed method uses perceptual hashing to binarize low-level feature maps and combines several feature channels for feature encoding. Then, an image pyramid is built, and three regional statistics are computed for hierarchical feature description. To some extent, the perceptual hash algorithm (PHA) can encode invariant macro structures of person images to make the representation robust to both illumination and viewpoint changes. On the other hand, while a rough hashing may be not discriminative enough, the combination of several different feature channels and regional statistics is able to exploit complementary information and enhance the discriminability. The proposed approach is evaluated on seven major person re-identification datasets. The results of comprehensive experiments show the effectiveness of the proposed method and notable improvements over the state-of-the-art approaches. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Person re-identification describes the process of pedestrian images observed across camera views at different locations and times based on visual features [1]. Appearance-based person reidentification is a very challenging task because the appearance of an individual undergoes significant changes due to variations of illumination, poses and viewpoints across non-overlapping camera views. Image resolution, camera settings and background clutter will increase the difficulty of person re-identification. It can be observed from Fig. 1 that the appearance of an individual varies greatly in different camera views. The re-identification method usually requires two stages. First, a reliable and distinctive descriptor is constructed to describe both the query and the gallery images. Second, the adapted distance measures are used to calculate the similarity between the query and each of the gallery images, and the similarity is used to find the correct match among a large number of gallery images.

R This work was partially supported by the National Key Research and Development Program of China (Grant no. 2016YFC0801003), and the National Natural Science Foundation of China (No. 61370121). ∗ Corresponding author at: Beijing Key Laboratory of Digital Media, School of Computer Science and Engineering, Beihang University, Beijing 100191, China. E-mail address: [email protected] (H.-M. Hu).

Development of a representation construct is a critical and challenging problem in person re-identification. There has been a great deal of work focused on constructing the descriptor. Methods in the recent literature of person re-identification exploit low-level features such as color [41], shape and filter [6,22], spatial structure [49] or combinations thereof [6,32], because they can be relatively easily obtained from the image. Usually, the constructed descriptor should be robust to various changes, such as changes in illumination, viewpoint, background clutter, occlusion and image resolution. Nevertheless, despite extensive research, finding a method to construct the descriptor for person re-identification is still a largely unsolved problem. This is because most low-level feature representations are either insufficiently robust to illumination changes, especially with noisy background, or insufficiently discriminative for viewpoint variations. For example, color is sensitive to lighting variations, and texture features are subject to the variations in viewpoint and pose. Thus, such low-level features are not effective for re-identification. In this article, we propose a simple yet effective feature description method to address the above issue. Our descriptor is named MSHF, which is short for multi-statistics on hash feature map. The proposed MSHF descriptor includes two steps. In the first step, low-level color and gradient features of the image are extracted, and each low-level feature is quantified into 2 scales by the perceptual hash algorithm (PHA) [58]. Since a rough hashing

http://dx.doi.org/10.1016/j.neucom.2017.07.019 0925-2312/© 2017 Elsevier B.V. All rights reserved.

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

ARTICLE IN PRESS

JID: NEUCOM 2

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12

Fig. 1. Images of the same person from two different camera views.

may be not discriminative enough, the combination of various different feature channels is able to exploit complementary information and enhance the discriminability. Then, a hash feature map, in which each pixel is represented as a binary number, is generated by the combined quantified low-level features of each pixel. In the second step, three complementary regional statistical features are extracted from the center area of this hash feature map. These features include the histogram, the mean vector and the cooccurrence matrix. The simplest way to describe a hash feature map is by its raw pixel values, which were used for many years in computer vision. However, this feature is not robust to various changes and nonrigid motion, and the dimension value is high [36]. To describe the image more effectively, we extract three regional statistical features from the hash feature map to describe people. Furthermore, the center area of the hash feature map is selected, and the noisy background information can be discarded. Finally, since the MSHF descriptor and low-level features provide very different types of information, we combine the MSHF descriptor with other low-level features to enhance the performance. Additionally, since the metric learning algorithms improve the performance of person re-identification in the recent literature, we also use metric learning to further enhance the performance of the person re-identification. Compared with the state-of-the-art descriptors, the proposed descriptor is different in the following three ways. First, the computational complexity of our descriptor is relatively low, since the perceptual hash algorithm (PHA) quantifies each low-level feature on 2 scales. Second, the proposed MSHF descriptor is tolerant to both illumination and viewpoint changes; to some extent, we use PHA to encode some invariant macro structures of person images. Furthermore, MSHF is also robust to background variations, because we extract the center area of the hash feature map and discard the noisy background information. However, it is important to note that the MSHF descriptor makes very different use of the perceptual hash algorithm. First, in the image retrieval, the perceptual hash algorithm only uses the gray information of the image, while the MSHF descriptor employs different types of low-level features in the person re-identification. Because a rough hashing may be not discriminative enough, the combination of several different feature channels and regional statistics is used to exploit complementary information and enhance the discriminability. Second, the perceptual-hash-based sim-

ilarity is defined as the difference between the string descriptors of two different images. The MSHF descriptor extracts three complementary features from the hash feature map. These three complementary descriptors are then concatenated to form the image signature, and the similarity of two different images is obtained by simply computing the l1 vector distance between their descriptors. The remainder of this paper is organized as follows. In the next section, we review the related works on person re-identification. The proposed descriptor is presented in detail in Section 3. Section 4 compares the performance of our strategy with those of state-of-the-art algorithms on benchmark datasets, including VIPeR, CAVIAR4REID, i-LIDS, ETHZ, GRID, CUHK01, and Market1501. Finally, Section 5 concludes the paper. 2. Related works Early published work on re-identification dates back to 2003 [1]. With the development of pattern recognition, computer vision [37,26,59] and machine learning, person re-identification became a hot topic in academia and has received extensive attention from researchers since 2008 [14,33,34]. Person re-identification algorithms generally fall into two categories [1], namely, the unsupervised algorithms and the supervised algorithms. The unsupervised algorithms [2–5] rely on a robust feature description, which is a set of distinguishing characteristics that describe the appearance of pedestrians from various camera views. Typical descriptors that have been proposed include color, texture, shape, edges, and semantic attributes. The supervised algorithms [6–11] employ learning techniques for descriptor extraction and matching. They require labeled samples for training. 2.1. Unsupervised methods Bazzani et al. [2] proposed a descriptor called Symmetry-Driven Accumulation of Local Features (SDALF). This method combines three features as a human signature, including maximally stable color regions (MSCR), weighted color histograms (WCH) and recurrent high-structured patches (RHSP). However, Cheng et al. [22] observed that the performance will not be too heavily degraded if the RHSP are removed. Bazzani et al. [18], through epitomic analysis, extracted the Histogram Plus Epitome (HPE) to describe the appearance of a person. However, this multi-shot

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

JID: NEUCOM

ARTICLE IN PRESS

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12

descriptor cannot be well applied in the single-shot case. In [12], Scale Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF) and spin image interest point descriptors were applied for person re-identification. However, these features are limited to HD videos. In [13], Gabor features and Local Binary Patterns (LBP) were combined to form a covariance descriptor to handle the difficulty of viewpoint and illumination changes. However, the covariance matrix does not lie in Euclidean space. Most of the common machine-learning methods work in Euclidean spaces; therefore, they are not suitable for the covariance matrix. Ma et al. [24] proposed a novel image representation called BiCov. This descriptor, which can properly handle both background and illumination variations, is formed by Biologically Inspired Features (BIF) and the covariance descriptor. Zhao et al. [16] believe that humans often rely on salient features to distinguish one person from another, and thus that salient regions are more distinctive and reliable when matching two persons. The salience of a patch is estimated through unsupervised learning and incorporated into person matching. Liao et al. [38] proposed an effective feature representation called Local Maximal Occurrence (LOMO). In LOMO, HSV color histograms and Scale Invariant Local Ternary Pattern (SILTP) features are extracted from an image that has been processed by a multi-scale Retinex transform, which is designed to handle illumination variations [15]. To handle viewpoint changes, LOMO maximizes the horizontal occurrence of local features.

3

In recent years, Convolutional Neural Network (CNN) has shown great potential in person re-identification for extracting features in hierarchically. Li et al. [44] designed a deep filter pairing neural network (FPNN) to learn feature representations of a pair of images and their corresponding metric. Hu et al. [46] proposed a new deep transfer metric learning (DTML) method. Xiao et al. [45] presented a pipeline for learning generic feature representations from multiple domains that are effective on all of them simultaneously. Wu et al. [47] proposed a novel feature extraction model called Feature Fusion Net (FFN) for pedestrian image representation. Complex CNN models require a vast amount of training data to find good representations and task-dependent parameters [62]. To avoid this problem, Wu et al. [52] introduced a hybrid architecture that combines Fisher vectors and deep neural networks to learn nonlinear representations of person images in a space where the data are linearly separable. 3. Proposed descriptor The proposed MSHF descriptor is a two-stage representation: low-level features are quantified into two scales by the perceptual hash algorithm, and a hash feature map is generated (illustrated in Fig. 2); then, three complementary features are extracted from this hash feature map and concatenated to describe persons. In the following, the technical details of the two stages are presented. 3.1. Hash feature map

2.2. Supervised methods Some researchers think that certain features play more important roles than others for matching different individuals. Gary and Tao [6] used the AdaBoost algorithm to find the ensemble of localized features (ELF). Schwartz and Davis [19] employed the Partial Least Squares (PLS) technique to weighed feature descriptors based on color, texture and edges according to their discriminative power for each different appearance. Metric learning has also been widely adopted in person reidentification [40]. Prosser et al. [32] reformulated the person reidentification problem as a ranking problem and used Ranking Support Vector Machines (RankSVM) to learn a subspace where the potential true match is given highest ranking, rather than a direct distance measure. The relative distance comparison (RDC) [7] aims to maximize the likelihood of a pair of true matches having a relatively smaller distance than a wrongly matched pair. In [27], the authors introduced Pairwise Constrained Component Analysis (PCCA) and proposed a new distance-metric learning approach. A simple though effective strategy was introduced in [23] to learn a distance metric from equivalence constraints. They referred to their strategy as KISSME (Keep It Simple and Straightforward Metric). In [10], the re-identification problem was formulated as a block sparse recovery problem, and the associated optimization problem is solved using the alternating-direction framework. Chen et al. [29] presented a novel similarity learning approach to person reidentification, which used a more robust explicit polynomial kernel feature map instead of greedily keeping only the best-matched patch. To learn a discriminant metric, Liao et al. [38] proposed a subspace and metric learning method called Cross-view Quadratic Discriminant Analysis (XQDA) and presented a practical computation method for XQDA, as well as its regularization. However, the existing metric learning methods face the classic small sample size (SSS) problem. Zhang et al. [48] overcame the SSS problem by matching people in a discriminative null space in which images of the same person are collapsed into a single point. Tao et al. [51] presented dual-regularized KISS (DR-KISS) metric learning to address the SSS problem by regularizing the two covariance matrices.

The perceptual hash algorithm (PHA) has been successfully used in image retrieval. The main steps of this algorithm are as follows: (1) calculate the average gray value of all pixels. (2) Each pixel is labeled 1 if its gray value is greater than or equal to the average gray value; otherwise, it is labeled 0. (3) The results of the previous step are combined to form the descriptor of the image. This algorithm is robust to illumination and viewpoint changes, and most importantly, this algorithm is very efficient. To avoid high dimensionality and high computational complexity, motivated by the perceptual hash algorithm (PHA), each low-level feature is quantified into two scales. The first step is to extract the low-level features of the image. For each pixel of the pedestrian image, I, a nine-dimensional feature vector is computed to capture the color and gradient information:

f = [nR, nG, nB, H, S, Y, ∇Y , ∇ 2Y , θ 2Y ],

(1)

where nR, nG, nB stands for normalized RGB (nRGB), Y stands for the intensity channel, ∇ Y represents the first-order gradient magnitude of the intensity channel, and ∇ 2 Y and θ 2 Y correspond to the second-order gradient magnitude and orientation of the intensity channel, respectively. The normalized RGB is

nR =

R G B , nG = , nB = . R+G+B R+G+B R+G+B

(2)

For the sake of clarity, f is re-represented as follows:

f(x, y ) = [ f1 (x, y ), f2 (x, y ), f3 (x, y ), . . . , f9 (x, y )],

(3)

where x and y are the pixel coordinates. Color is a widely used feature for person re-identification [2,18,19] because the color of clothing constitutes a simple but efficient visual signature. Since there is no single color space that is robust to all types of lighting variations, various complementary color spaces are exploited, such as HSV, RGB, and YCbCr [4,6,28]. In our work, three different color models are used, including normalized RGB (nRGB), Y (YCbCr) and HS. Then, the average value of each feature is computed. For each pixel of the pedestrian image, each feature value is compared to

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12

Fig. 2. Flowchart of the hash feature map: (1) a nine-dimensional feature vector of each pixel is extracted to capture the color and gradient information. (2) For each feature channel, compare the feature value to the average value of the same feature. If this value is greater than or equal to the average value, it is labeled 1; otherwise, it is labeled 0. (3) The hash feature map of the original image is produced by concatenating nine comparison results.

the average value of the same feature: if its value is greater than or equal to the average value, it is labeled 1; otherwise, it is labeled 0. The hash feature map of the original image is formed by concatenated nine comparison results. We denote by HF the hash feature map. A pixel of the hash feature map can be represented as a nine-bit binary number hf.

h f (x, y ) = (h f1 (x, y ), h f2 (x, y ), h f3 (x, y ), . . . , h f9 (x, y ))

 h fi (x, y ) =

1 0

i f fi (x, y ) ≥ fi otherwise

i = 1, 2 . . . 9

(4)

(5)

where x and y are the pixel coordinates, fi is the mean vector of fi (x, y) over the pedestrian image and hfi (x, y) is some binary number hf(x, y).

center area of the hash feature map into six horizontal strips of equal size. Choosing the center area of the image can reduce the distraction of a noisy background. A pixel of the hash feature map is represented as a nine-bit binary number, hence the total gray level of the image is 512 (i.e., 512 = 2^9). The number of statistical bins of each strip is set as 512; thus, the dimension of the histogram is 3072 bins. The second feature is the mean vector. To consider the spatial information, a four-layer spatial pyramid is built by dividing the center area of the hash feature map into non-overlapping horizontal strips. In Fig. 3, the four-layer spatial pyramid is presented. Level 0 corresponds to the center area of hash feature map. Level 1 is described by two horizontal strips. Level 2 is represented by four horizontal strips. Finally, level 3 has eight horizontal strips. The mean vector of strip R is derived by decimal computation:

3.2. MSHF descriptor (multi-statistics on hash feature map) In the second stage, three complementary features are extracted from this hash feature map. The first feature is the histogram, and because spatial information about the layout is an important cue, we divide each image into multiple horizontal strips of equal size. For our first feature, we use the generic human body partitioning method, which is widely used in existing methods; it divides the

 1 hfR = n



 h f (x, y ) ,

(6)

(x,y )∈stripe R

where hfR is the nine-dimensional mean vector of strip R, and n is the total number of pixels in region R.

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12

Original hash feature map

Level 0

Level 1

5

Level 2

Level 3

Fig. 3. The four-layer spatial pyramid.

Using decimal operations makes the mean vector is easy to combine with other features. The length of the mean vector of the hash feature map is 135 (i.e., 135 = 9∗(1 + 2 + 4 + 8)). The third feature is the co-occurrence matrix. The cooccurrence matrix is defined as the distribution over an image of co-occurring pixel values (grayscale values, or colors) at a given offset. It is a second-order statistical feature. To a certain extent, it reflects the spatial position relations of the pixels. For the sake of preserving spatial information, the center area of the hash feature map is divided into non-overlapping horizontal strips, as shown in Fig. 3, and the co-occurrence matrix is computed for each strip.

C MR = (C M1,R , C M2,R , C M3,R , . . . , C M9,R ),

(7)

where CMR is the set of co-occurrence matrices of strip R, and CMi, R (i = 1, 2 . . . 9) is the 2∗2 normalized co-occurrence matrix of hfi (x, y) over the region R. In a real scene, we can observe that the vertical change of a pedestrian is smaller than the horizontal change. Therefore, in our case, the orientation parameter of the co-occurrence matrix CMi, R is set to vertical. The offset parameter is fixed as 1. We use the set of co-occurrence matrices of hfi (x, y) rather than the co-occurrence matrix of hf(x, y) because the size of the cooccurrence matrix of hf(x, y) is 512∗512, which is too high. Suppose the area of the image region is of size 16∗48 (for instance, the image is divided into 8 horizontal strips and the size of the image is 128 × 48 pixels). Then, the total number of pixels in this region is 768. The sum of the element values of the co-occurrence matrix is less than 768 and the size of the co-occurrence matrix is 512∗512. This tends to cause most of the elements of the co-occurrence matrix to be zero, that is, there is no statistical significance of the co-occurrence matrix of hf(x, y). Above three complementary descriptors are concatenated to form a simple but discriminative descriptor, which is called the multi-statistics on hash feature map (MSHF) in this paper. 3.3. The extensions of MSHF 3.3.1. Fusion with Gabor filters Since the MSHF descriptor only exploits the color and gradient information, this descriptor is not able to distinguish among a large number of pedestrians. Therefore, to achieve good performance in person matching, we combine the MSHF descriptor and low-level texture feature to produce a richer signature. Many texture descriptors have been used in the person re-identification literature, such as the Gabor filter and other filter banks, SIFT, and LBP. Since the Gabor filter is tolerant to illuminations changes, we choose it as the fused texture feature.

The Gabor filter is often computed on the intensity image. Because the single color space lacks robustness, we consider several color spaces, including normalized RGB (nRGB), YCbCr and HS. For an image, I, we compute its convolution with Gabor filters on the eight color spaces (nRGB, YCbCr and HS) according to the following equations:

G(μ, ν ) = I (x, y ) ∗ ϕμ,ν (x, y ),

(8)

where ϕ μ, ν (x, y) is the 2-D Gabor kernel,

ϕμ,ν (x, y ) =

  −σ 2 kμ,ν 2 −kμ,ν 2 z2  ikμ,ν z 2 e e − e , σ2 2σ 2

(9)

where μ, ν are scale and orientation parameters of the kernel, respectively. Similar to the orientation parameter of the co-occurrence matrix, ν is equal to π2 , while μ is quantified into four scales. The convolution images are divided into non-overlapping horizontal strips, as in Fig. 3. For each layer, the numbers of histogram bins from the top to the bottom are set as 32, 24, 16 and 8 for each stripe. By accumulating histograms on all strips among the eight channels, each person image is represented by a 6656-dimensional Gabor histogram vector (i.e., 6656 = 4∗8∗(8∗8 + 4∗16 + 2∗24 + 1∗32)). The MSHF descriptor and the Gabor histogram descriptor are fused to represent the human signature. We denote this fused descriptor as the eMSHF descriptor (enriched MSHF descriptor). The distance between two particular person images, Ip and Iq , can be obtained by weighting the two components in eMSHF:

deMSHF (I p , Iq ) = βMSHF · dMSHF (I p , Iq ) + βGabor · dGabor (I p , Iq ).

(10)

Here, dMSHF and dGabor correspond to the distance measures of the two descriptors, β s are weights for each distance measure, dMSHF is obtained by computing the l1 vector distance between their representations, and dGabor also adopts the l1 vector distance. 3.3.2. Combining with metric learning In the recent computer vision literature, metric learning has gained considerable interest and improved the performance of person re-identification. Accordingly, we are also using metric learning to get better performance. In this paper, we use XQDA [38], which has used successfully in the context of person re-identification to learn the metric. The learning process of XQDA is simple and efficient. Under the zero-mean Gaussian distribution, the distance of the XQDA model is computed as follows: −1

dW (xi , x j ) = (xi − x j )T W (  I

−1

−   E )W T (xi − x j ),

(11)

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12 Table 1 Datasets used in our evaluation. Dataset VIPeR i-LIDS CAVIA4REID ETHZ1 ETHZ2 ETHZ3 GRID CUHK01 Market1501

Illumination variations √ √ √ √ √ √ √ √ √

Pose variations √ √ √

√ √

1  T I = W (xi − x j )(xi − x j )T W, nI y =y

(12)

1  T W (xi − x j )(xi − x j )T W. nE y =y

(13)

E =

j

i

j

Consider a sample pair (xi , xj ). yi = y j if the samples share the same class label; otherwise yi = yj . Let W be a subspace. The kernel matrix M is computed by the following equation:

M = W ( I−1 −  E−1 )W T .

(14)

where W is obtained by maximization of the Generalized Rayleigh Quotient:

J (w ) =

w T E w . w T I w

√ √ √ √ √ √

Occlusions √ √ √ √ √ √ √ √

where

i

Scale variations

(15)

4. Experiments In this section, we compare the performance of our approach, which is carried out with the techniques described in Section 3, with those of the state-of-the-art approaches. The main standard protocol to evaluate the performance of a person re-identification algorithm is the Cumulative Matching Characteristic (CMC) curve [6], which measures the cumulative expectation of the true match at the top-n rank. 4.1. Datasets

Viewpoint variations √ √ √

√ √ √

camera view. Compared with the pre-existing databases, the typical characteristic of this dataset is that the images have large differences in resolution. The minimum and maximum sizes of the images are 17 × 39 and 72 × 144, respectively. ETHZ dataset [19]: The ETHZ dataset consists of three subdatasets, which were taken by a moving camera in a busy street scene. All images are captured from a single camera, so variations of viewpoint and pose are rather few. GRID dataset [39]: The QMUL underGround Re-IDentification (GRID) dataset contains images that were captured from eight disjoint camera views in a crowded underground train station. It was divided into probe and gallery sets and contains 250 pedestrian image pairs. The probe set contains 250 persons, while the gallery set contains 1025 persons; therefore, an additional 775 persons in the gallery set have no matched images in the probe set. CUHK01 dataset [54]: The images in the CUHK01 dataset were obtained from two disjoint camera views in a campus environment. The dataset contains 971 individuals, and each person has four images. The back and frontal views of individuals come from camera view A, while the side views are captured with camera view B. Each camera view corresponds to two images of each person. Market1501 dataset [55]: This dataset is the largest and most realistic dataset, which is collected in front of a busy supermarket. It contains 32,643 detected person bounding boxes of 1501 pedestrians, and the person bounding boxes are obtained by running the Deformable Part Model (DPM) detector. Example images of the abovementioned datasets are shown in Fig. 4. 4.2. Performance evaluation

To evaluate re-identification technologies, multiple benchmark datasets have been published in recent years. The characteristics of the datasets used in our experiments for person re-identification are summarized in Table 1, and some details are given below. VIPeR dataset [21]: VIPeR (Viewpoint Invariant Pedestrian Recognition), which was built by Gary et al., is one of most popular and challenging datasets. It contains 632 pedestrians taken from two non-overlapping camera views, and each camera captured only one image of each person. The two cameras were placed at different locations; therefore, the two images of the same person contain high degrees of viewpoint and illumination variations, and most images exhibit viewpoint changes of 90° or more. After cropping and scaling, all images are normalized to 128 × 48 pixels. i-LIDS dataset [20]: The i-LIDS Multiple-Camera Tracking Scenario (MCTS) dataset, created by Zheng et al., was collected from multiple camera views. The images come from a busy airport arrival hall and most of the persons are carrying baggage, so the dataset is subject to serious occlusions. CAVIA4REID dataset [22]: The images were captured by two different cameras in an outdoor shopping center scenario of the original CAVIAR dataset, which was initially created to evaluate people tracking and detection algorithms. A total of 50 pedestrians appear in two camera views, and the remaining 22 appear from only one

Since not all approaches provide results for the abovementioned datasets, we use “–” to indicate that the approach does not give a result for the corresponding dataset, and the supervised methods are indicated with “∗ ”. 4.2.1. Evaluation of MSHF descriptor Even though there are some methods that use the same datasets to report performance analysis, their descriptors are usually combinations of other different image descriptors, making it hard to conclude whether the differences in the reported CMC scores are due to the authors’ presented descriptors or to the other fused features. Therefore, to fairly compare the performances of each descriptor, we evaluated each descriptor using only the authors’ proposed descriptor and do not combine it the other descriptors. The performances of our descriptor, described in Section 3, and some of the most commonly used descriptors in the reidentification literature are evaluated in the following. We provide the comparison of the results in Tables 2–7, and the best scores are given in boldface. For the VIPeR database, we follow the same experimental protocol as in [2] and define images from Cam B as the gallery set and

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12

7

Fig. 4. Example images taken from different datasets used in our evaluation. The images in each column show the same person. Table 2 Compare with other low-level feature on VIPeR (highest scores are shown in boldface).

Table 4 Compare with descriptors used in learning algorithms on VIPeR (best scores are shown in boldface).

Method

r=1

r=5

r = 10

r = 20

Method

r=1

r=5

r = 10

r = 20

SDC_knn [3] WCH [2] MSCR [2] MSHF

22.15 13.88 9.29 20.01

38.61 32.53 21.00 43.67

49.68 44.23 31.39 55.43

62.97 57.01 42.57 68.65

ELF18(L1 ) [42] descriptor(L1 ) [43] LOMO(L1 ) [38] descriptor(L1 ) [47] MSHF

12.15 4.18 3.35 12.15 20.01

26.01 11.65 13.31 26.01 43.67

32.82 16.52 20.81 32.09 55.43

42.47 22.37 34.17 34.72 68.65

Table 3 Compare with other descriptor which extract from low-level features on VIPeR (highest scores are shown in boldface). Method

r=1

r=5

r = 10

r = 20

AIR∗ [30] W. AIR∗ [30] MLA∗ [31] BiCov [24] MSHF

5.56 4.84 5.06 9.01 20.01

15.76 17.44 19.24 23.59 43.67

24.72 29.24 29.06 33.59 55.43

– – – 45.95 68.65

those from Cam A as the probe set. The results of our method are obtained by averaging 10 different random sets of 316 pedestrians. In the VIPeR dataset, we first compared the proposed MSHF descriptor with other low-level features that are widely used in person re-identification. From Table 2, we can see that MSHF consistently outperforms MSCR, WCH (weighted color histograms) and SDC_knn. Although we obtain lower performance than SDC_knn at ranks 1–4 on the VIPeR dataset, MSHF yields the best rank-5 matching rate. In this case, our MSHF descriptor outperforms other low-level visual descriptors, which shows that our descriptor is ro-

Table 5 CMC scores on ETHZ1 (highest scores are shown in boldface). Method

r=1

r=2

r=3

r=4

r=5

r=6

r=7

MSCR [2] SDALF [2] BiCov [24] MSHF descriptor

39.20 66.63 71.08 72.92

49.70 74.50 77.48 79.41

55.83 78.44 80.61 82.82

60.11 80.95 82.82 85.17

63.45 82.79 84.40 86.95

66.21 84.27 85.67 88.36

68.46 85.48 86.78 89.44

bust to viewpoint and illumination variations, compared to lowlevel visual features. Next, we compared the performance of the proposed MSHF descriptor to that of another descriptor that extracts low-level features. The first descriptor is an attribute descriptor, which was proposed in [30,31]. Layne et al. proposed to learn mid-level semantic attributes from low-level features to describe people. With the second descriptor, called BiCov, the representation relies on the combination of Biologically Inspired Features (BIF) and covariance descriptors.

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

ARTICLE IN PRESS

JID: NEUCOM 8

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12

Fig. 5. Performances on ETHZ dataset.

Table 6 CMC scores on ETHZ2 (highest scores are shown in boldface). Method

r=1

r=2

r=3

r=4

r=5

r=6

r=7

MSCR [2] SDALF [2] BiCov [24] MSHF descriptor

37.46 63.16 71.76 73.81

49.22 73.46 79.10 80.79

56.51 78.85 82.93 84.26

61.77 82.50 85.41 86.56

66.10 85.23 87.39 88.24

69.65 87.35 88.92 89.64

72.73 89.06 90.05 90.83

Table 7 CMC scores on ETHZ3 (highest scores are shown in boldface). Method

r=1

r=2

r=3

r=4

r=5

r=6

r=7

MSCR [2] SDALF [2] RbFS [17] BiCov [24] MSHF descriptor

50.72 74.46 75.93 85.05 85.23

62.25 82.86 – 89.62 89.92

69.00 86.88 – 92.07 92.82

73.70 89.45 – 93.53 94.73

77.27 91.22 91.38 94.54 96.00

80.15 92.50 – 95.21 96.83

82.55 93.53 – 95.75 97.42

From Table 3, we can see that our MSHF descriptor significantly outperforms the other descriptor, which extracts low-level features. This suggests that our descriptor more accurately represents reidentification images in most cases. Finally, we compared our MSHF descriptor with other common descriptors used in learning algorithms of person re-identification. In [42], each image is equally partitioned into 18 horizontal strips. For each strip, RGB, HSV, YCbCr, Lab, YIQ and 16 Gabor texture features were extracted, where each feature channel was represented by a 16D histogram and then normalized by the L1-norm. All histograms are concatenated to form a single feature vector. This feature is denoted as ELF18. The method of constructing the descriptor in [43] is almost the same as that in [42]. However, slightly differently, in [43], the person image was divided into six horizontal strips of equal width, and RGB, YCbCr, HS and Local Binary Patterns (LBP) were computed for each strip. In [38], researchers proposed an efficient descriptor called Local Maximal Occurrence (LOMO). In [47], researchers proposed a deep Feature Fusion Network (FFN) to extract features, which is a complement to hand-crafted features for pedestrian image representation. Table 4 shows the performance of our descriptor compared to other the descriptor in terms of the L1-norm. Our MSHF descriptor outperforms the other descriptor, which shows that our descriptor has better discriminative and invariance capabilities compared to the other descriptor. In addition to these experiments, we have also tested the MSHF representation on the ETHZ database. For this dataset, the gallery set is built by randomly selecting one image for each person, while the other images form the probe set. The independent trial is repeated 100 times. All images are scaled to 64 × 32 pixels. The CMC

Fig. 6. Performances on VIPeR dataset.

curves for the ETHZ datasets are displayed in Fig. 5. The results for ETHZ1, ETHZ2 and ETHZ3 show that MSHF shows obvious improvement compared with other descriptors. In particular, on ETHZ1, MSHF is 1.96% better than BiCov at rank 1. In ETHZ2, matching rates at rank 1 are approximately 73.81% for MSHF and 71.76% for BiCov. In ETHZ3, matching rate at rank 1 is approximately 85.23% for MSHF.

4.2.2. Evaluation of eMSHF First, we show the result for VIPeR. In Fig. 6, we display a comparison among SDALF, eSDC_knn, eSDC_ocsvm, gBiCov, eBiCov, ISR and eMSHF in terms of CMC curves. It is observed that our approach outperforms SDALF, eSDC_knn, eSDC_ocsvm, gBiCov, eBiCov and ISR, even if ISR is slightly superior to our method in the first position of the CMC curve (ranks 1–4). More comparison results are shown in Table 8. In particular, the matching rates at rank 1 are approximately 26.23% for eMSHF, 18.45% for SDALF, 17.01% for gBiCov, 24.34% for eBiCov, 23.10% for eSDC_knn, 26.27% for eSDC_ocsvm, and 26.99% for ISR. However, from this data, we see that eMSHF declines by about one percentage point compared to the state-of-the-art performance at rank 1. ISR is based on sparse reconstruction, which has been demonstrated to be a powerful tool for face recognition. Although it leads to higher performance at the first rank, the time complexity of this method is too high due to iteration and re-weighting. Nevertheless, we can get up 51.11% accuracy at rank 5 and 64.93% accuracy

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12 Table 8 VIPeR—compare with state-of-the-art scores are shown in boldface).

methods

9

(best

Method

r=1

r=5

r = 10

r = 20

SDALF [2] gBiCov [24] eBiCov [24] CPS [22] eSDC_knn [3] eLDFV [25] ISR [5] RbFS [17] eSDC_ocsvm [3] eMSHF

18.45 17.01 24.34 21.84 23.10 22.34 26.99 22.47 26.27 26.23

37.59 33.67 46.75 44.00 41.14 47.00 49.84 46.84 45.89 51.11

50.76 46.84 58.48 57.21 53.48 60.04 61.20 56.65 56.96 64.93

66.36 58.72 71.17 71.00 65.19 71.00 73.04 73.73 67.09 74.96

Fig. 8. CMC curves on CAVIA4REID dataset. Table 10 CAVIA4REID—compare with state-of-the-art (best scores are given in boldface).

Fig. 7. Comparisons on i-LIDS dataset. Table 9 i-LIDS—compare with state-of-the-art methods (highest scores are shown in boldface). Method

r=1

r=5

r = 10

r = 20

SDALF [2] eSDC_knn [3] eSDC_ocsvm [3] PLS∗ [19] ELF∗ [6] MFL opt.∗ [35] eBiCov [24] eMSHF

25.13 34.31 33.98 18.32 22.79 30.76 29.62 34.46

44.12 50.87 50.06 38.23 44.41 50.59 51.55 53.73

55.21 58.85 58.07 49.68 57.16 58.74 60.30 61.92

70.25 68.66 68.85 64.95 70.55 70.42 71.63 72.57

at rank 10, which are much better than the accuracies of the other methods. Now let us analyze the results on the i-LIDS dataset. We randomly select image pairs for each person to form the gallery set and the probe set. The whole procedure is repeated 10 times. Images are normalized to a size of 128 × 64 pixels. The average CMC curves are depicted in Fig. 7. The experiments show that eMSHF outperforms the SDALF, eSDC_knn, eSDC_ocsvm and eBiCov, and the rank-1 matching rate of eMSHF can be up to 34.46%, versus 25.13%, 29.62%, 33.98%, and 34.31% for SDALF, eBiCov, eSDC_ocsvm and eSDC_knn, respectively. Meanwhile, we get better performance than some supervised algorithm, such as PLS, ELF and MFL. More experimental results are show in Table 9. Next, we consider the CAVIA4REID dataset. The settings of the CAVIA4REID dataset are the same as those of i-LIDS, except the normalized size of the images is 128 × 48 pixels. The average CMC curves for the different approaches are plotted in Fig. 8. Moreover, we provide the corresponding CMC scores in Table 10. The CMC

methods

Method

r=1

r=5

r = 10

r = 20

SDALF [2] eSDC_knn [3] eSDC_ocsvm [3] ISR [5] RbFS [17] eMSHF

21.47 22.56 23.18 29.00 26.56 31.44

36.86 37.54 37.92 – 41.34 47.01

46.91 46.80 47.24 – 50.42 56.99

60.55 58.70 59.26 – 62.28 69.35

curves and scores reveal that the performance of eMSHF is much better than those of SDALF, RbFS, eSDC_knn and eSDC_ocsvm. The results demonstrate that eMSHF is robust to variations in image resolution. To be more specific, the correct-match rate in the first rank of our proposed method is approximately 31.44%, while for eSDC_knn it is 22.56%; for eSDC_ocsvm, it is 23.18%, and for ISR, it is 29.00%. The above experimental results indicate that our MSHF descriptor can complement low-level features to enhance the performance of person re-identification. 4.2.3. Evaluation of the proposed method For fair comparisons, we follow the standard experimental protocol for the VIPeR dataset in our experiments. We randomly select half of the persons out of the 632 pairs of images for the training set, and reserve the remaining persons for the test set. This procedure is repeated 10 times, and the average performance is calculated. The dimension of the LOMO descriptor is 26,960, while the dimension of our MSHF descriptor is 3747. This indicates that the XQDA learning algorithm can use more value information, and the learned kernel matrix is more robust. Therefore, we combine the MSHF descriptor with low-level features to get better performance. From the results shown in Table 11, we can observe that our method performs much better than any of the other approaches. In particular, the matching rate at rank 1 of our method (MSHF + LOMO + XQDA) is approximately 45.76%, while that of LOMO + XQDA is 40.00%. The method proposed by Lin et al. [50] addresses the problem of handling spatial misalignments due to camera-view changes or human-pose variations in person reidentification. The other approaches require training data with identity labels to learn a correspondence structure, but our approach does not rely on training data. This comparison indicates that despite its simplicity, our descriptor helps to complement low-level features.

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12 Table 14 Method comparison datasets.

Table 11 VIPeR—ranked matching rates (%) for p = 316 pedestrians in the probe testing set (best scores are shown in boldface).

on

the

Market-1501

Method

r=1

r=5

r = 10

r = 20

Method

r=1

mAP

kBiCov∗ [24] Method ∗ [50] LOMO + XQDA∗ [38] NullReid∗ [48] Hybrid∗ [52] DeepRanking∗ [53] SSDAL∗ [56] PCT [34] + LOMO + XQDA∗ SSDAL∗ + XQDA∗ [56]

31.11 34.80 40.00 42.28 44.11 38.37 37.90 38.86 43.50

58.33 68.70 68.13 71.47 72.59 69.22 65.50 68.35 71.80

70.71 82.30 80.51 82.94 81.66 81.33 75.60 80.69 81.50

91.08 91.80 91.08 92.06 91.47 90.42 88.40 91.29 89.00

Baseline∗ [55] LOMO + XQDA∗ [38] TMA∗ [61] G-All∗ [57] L-All∗ [57] Hybrid∗ [52]

34.38 42.28 47.92 51.10 49.55 48.15

14.10 21.55 22.31 25.47 23.83 29.94

MSHF + LOMO + XQDA∗ eMSHF + LOMO + XQDA∗

48.90 51.60

26.16 28.00

MSHF + LOMO + XQDA∗ eMSHF + LOMO + XQDA∗

45.76 47.31

74.87 75.92

85.32 86.39

94.15 94.21

Table 12 GRID—ranked matching rates (%) for p = 125 pedestrians in the probe testing set (best scores are shown in boldface). Method

r=1

r=5

r = 10

r = 20

RDC∗ [7] ELF6 [7] + XQDA∗ [38] ELF6 [7] + RMLLC(R)∗ [43] LOMO + XQDA∗ [38] SSDAL∗ [56] DR-KISS∗ [51] G-All∗ [57] LOMO + SSM ∗ [14]

9.68 10.48 11.68 16.56 19.10 20.60 19.20 18.96

22.00 28.08 27.04 33.84 35.60 39.30 39.84 –

32.96 38.64 37.20 41.84 45.80 51.40 49.44 44.16

44.32 52.56 48.88 52.40 58.10 62.60 59.36 55.92

MSHF + LOMO + XQDA∗ eMSHF + LOMO + XQDA∗

21.28 23.20

41.44 44.48

51.28 53.04

61.76 64.16

Table 13 CUHK01—ranked matching rates (%) for p = 486 pedestrians in the probe testing set (best scores are shown in boldface). r=1

r=5

r = 10

r = 20

LOMO + XQDA [38] MLAPG∗ [60] NullReid∗ [48] DeepRanking∗ [53] DeepRanking∗ + kLFDA [53] Enhanced Deep Feature∗ [47]

63.21 64.24 64.98 50.41 57.28 55.51

83.89 85.41 84.96 75.93 81.07 78.40

90.04 90.84 89.92 84.07 88.44 83.68

94.16 94.92 94.36 91.32 93.46 92.59

MSHF + LOMO + XQDA∗ eMSHF + LOMO + XQDA∗

65.56 66.75

84.79 85.80

90.97 91.75

94.69 95.16

Method ∗

The widely adopted experimental setting of 10 random trials on the GRID database is utilized. We used the provided fixed training and test sets in our experiments. For each trial, 125 pairs of images are used for training, and the other 125 image pairs, as well as the 775 background images, are used for testing. We report the comparison results at ranks 1, 5, 10 and 20 in Table 12. As can be observed in Table 12, our method achieves the best matching rates for ranks 1–20, except that the rank-10 and rank-20 accuracies of MSHF + LOMO + XQDA are slightly worse than those of DR-KISS. This is because DR-KISS is designed to overcome the SSS problem in distance-metric learning, and the SSS problem is severe in the GRID database, which has only 125 pairs of training samples. For the CUHK01 dataset, we report results for the setting in which 485 identities are used for training and the remaining 486 are used for testing (multi-shot), following [38]. Our approach again obtains superior results, as shown in Table 13. The results indicate that the proposed descriptor, combined with metric learning approach, outperforms other distance learning algorithms such as NullReid [48] and convolutional neural network algorithms such as DeepRanking and Enhanced Deep Feature [47]. We used the fixed training and test sets provided by the Market1501 dataset. Table 14 presents the results. The rank-1 matching rate and mean average precision (mAP) [55] are used to evalu-

ate the performance. However, we find that Hybrid performs better than eMSHF + LOMO + XQDA in terms of mAP. This is because Hybrid combines Fisher vectors and deep neural networks to learn non-linear representations of person images. With 12,936 training samples, the deep neural networks can learn robust feature representations. 5. Conclusion In this paper, we propose a simple yet effective feature description method for person re-identification. This descriptor is named MSHF, which is short for multi-statistics on hash feature map. To make the descriptor robust against illumination and viewpoint variations, we use PHA to encode some invariant macro structures of person images. While a rough hashing may not be discriminative enough, the combination of several different feature channels and regional statistics is able to exploit complementary information and enhance the discriminability. Our method outperforms several state-of-the-art algorithms on seven public benchmark datasets. References [1] X. Wang, R. Zhao, Person re-identification: system design and evaluation overview, Person Re-Identification, Springer, 2014, pp. 351–370. [2] L. Bazzani, M. Cristani, V. Murino, Symmetry-driven accumulation of local features for human characterization and re-identification, Comput. Vis. Image Underst. 117 (2) (2013) 130–144. [3] R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3586–3593. [4] M. Zeng, Z. Wu, C. Tian, et al., Efficient person re-identification by hybrid spatiogram and covariance descriptor, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 48–56. [5] G. Lisanti, I. Masi, A. Bagdanov, et al., Person re-identification by iterative re-weighted sparse ranking, IEEE Trans. Pattern Anal. Mach. Intell. 37 (8) (2015) 1629–1642. [6] D. Gray, H. Tao, Viewpoint invariant pedestrian recognition with an ensemble of localized features, in: Proceedings of European conference on computer vision (ECCV), 2008, pp. 262–275. [7] W. Zheng, S. Gong and Xiang T., Reidentification by relative distance comparison, IEEE Trans. Pattern Anal. Mach. Intell., 35, 3, 653–668. 2013. [8] F. Xiong, M. Gou, O. Camps, Person re-identification using kernel-based metric learning methods, in: Proceedings of European Conference on Computer Vision (ECCV), 2014, pp. 1–16. [9] Y. Yang, J. Yang, J. Yan, S. Liao, D. Yi, S.Z. Li, Salient color names for person re-identification, in: Proceedings of European Conference on Computer Vision (ECCV), 2014, pp. 536–551. [10] S. Karanam, Y. Li, R. Radke, Sparse Re-Id: block sparsity for person re-identification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2015, pp. 33–40. [11] Z. Wu, Viewpoint invariant human re-identification in camera networks using pose priors and subject-discriminative features, IEEE Trans. Pattern Anal. Mach. Intell. 37 (5) (2015) 1095–1108. [12] K. Jungling, M. Arens, View-invariant person re-identification with an implicit shape model, in: Proceedings of IEEE Conference on Advanced Video and Signal-Based Surveillance (AVSS), 2011, pp. 197–202. [13] Y. Zhang, S. Li, Gabor-LBP based region covariance descriptor for person re-identification, in: Proceedings of IEEE Conference on Image and Graphics (ICIG), 2011, pp. 368–371. [14] S. Bai, X. Bai, Q. Tian, Scalable person re-identification on supervised smoothed manifold, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

JID: NEUCOM

ARTICLE IN PRESS W. Fang et al. / Neurocomputing 000 (2017) 1–12

[15] H.M. Hu, J. Wu, B. Li, Q Guo, An adaptive fusion algorithm for visible and infrared videos based on entropy and the cumulative distribution of gray levels, IEEE Trans. Multimed. (2017) 1-1, doi:10.1109/TMM.2017.2711422. [16] R. Zhao, W. Ouyang, X. Wang, Unsupervised salience learning for person re-identification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3586–3593. [17] Y. Geng, H.M. Hu, G. Zeng, A person re-identification algorithm by exploiting region-based feature salience, J. Vis. Commun. Image Represent. 29 (2015) 89–102. [18] L. Bazzani, M. Cristani, A. Perina, V. Murino, Multiple-shot person re-identification by chromatic and epitomic analyses, Pattern Recognit. Lett. 33 (2012) 898–903. [19] W.R. Schwartz, L.S. Davis, Learning discriminative appearance-based models using partial least squares, in: Proceedings of 2009 XXII Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI), IEEE, 2009, pp. 322–329. [20] W. Zheng, S. Gong, T. Xiang, Associating groups of people, in: Proceedings of British Machine Vision Conference (BMVC), 2009, pp. 23.1–23.11. [21] D. Gray, S. Brennan, H. Tao, Evaluating appearance models for recognition, reacquisition, and tracking, in: Proceedings of IEEE International Workshop on Performance Evaluation for Tracking and Surveillance (PETS), 2007, pp. 41–48. [22] D.S. Cheng, M. Cristani, M. Stoppa, et al., Custom pictorial structures for re-identification, in: Proceedings of British Machine Vision Conference (BMVC), 2011, pp. 1–11. [23] M. Koestinger, M. Hirzer, P. Wohlhart, et al., Large scale metric learning from equivalence constraints, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2288–2295. [24] B. Ma, Y. Su, F. Jurie, Covariance descriptor based on bio-inspired features for person re-identification and face verification, Image Vis. Comput. 32 (6–7) (2014) 379–390. [25] B. Ma, Y. Su, F. Jurie, Local descriptors encoded by fisher vectors for person re-identification, in: Proceedings of European Conference on Computer Vision (ECCV), 2012, pp. 413–422. [26] W. Lin, Y. Mi, W. Wang, J. Wu, J. Wang, T. Mei, A diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes, IEEE Trans. Image Process. 25 (4) (2016) 1674–1687. [27] A. Mignon, F. Jurie, PCCA: a new approach for distance learning from sparse pairwise constraints, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2666–2672. [28] M. Hirzer, PM. Roth, M. Köstinger, et al., Relaxed pairwise learned metric for person re-identification, in: Proceedings of European Conference on Computer Vision (ECCV), 2012, pp. 780–793. [29] D. Chen, Z. Yuan, G. Hua, et al., Similarity learning on an explicit polynomial kernel feature map for person re-identification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1565–1573. [30] R. Layne, T.M. Hospedales, S. Gong, et al., Person re-identification by attributes, in: Proceedings of British Machine Vision Conference (BMVC), 2012, p. 8. [31] R. Layne, TM. Hospedales, S. Gong, Towards person identification and re-identification with attributes, in: Proceedings of European Conference on Computer Vision (ECCV), 2012, pp. 402–412. [32] B. Prosser, W. Zheng, S. Gong, et al., Person re-identification by support vector ranking, in: Proceedings of British Machine Vision Conference (BMVC), 2010, pp. 1–11. [33] S. Bak, E. Corvee, F. Brmond, M. Thonnat, Person re-identification using spatial covariance regions of human body parts, in: Proceedings of IEEE Conference on Advanced Video and Signal-Based Surveillance (AVSS), 2010, pp. 435–440. [34] H.M. Hu, W. Fang, G. Zeng, et al., A person re-identification algorithm based on pyramid color topology feature, Multimed. Tools Appl. (2017) 1–14, doi:10. 1007/s11042- 016- 4188- 2. [35] D. Figueira, L. Bazzani, HQ. Minh, et al., Semi-supervised multi-feature learning for person re-identification, in: Proceedings of IEEE Conference on Advanced Video and Signal-Based Surveillance (AVSS), 2013, pp. 111–116. [36] O. Tuzel, F. Porikli, P. Meer, Region covariance: a fast descriptor for detection and classification, in: Proceedings of European Conference on Computer Vision (ECCV), 2006, pp. 589–600. [37] W. Lin, Y. Zhou, H. Xu, J. Yan, M. Xu, J. Wu, Z. Liu, A tube-and-droplet-based approach for representing and analyzing motion trajectories, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2016) 1489–1503. [38] S. Liao, Y. Hu, X. Zhu, et al., Person re-identification by local maximal occurrence representation and metric learning, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2197–2206. [39] C.L. Chen, X. Tao, S. Gong, Time-delayed correlation analysis for multi-camera activity understanding, Int. J. Comput. Vis. 90 (1) (2010) 106–129. [40] W. Li, R. Zhao, X. Wang, Human reidentification with transferred metric learning, in: Proceedings of Asian Conference on Computer Vision (ACCV), 2012, pp. 31–44. [41] C. Madden, E.D. Cheng, M. Piccardi, Tracking people across disjoint camera views by an illumination-tolerant appearance representation, Mach. Vis. Appl. 18 (3) (2007) 233–247. [42] Y.C. Chen, W.S. Zheng, J. Lai, Mirror representation for modeling view-specific transform in person re-identification, in: Proceedings of International Joint Conference of Artificial Intelligence (IJCAI), 2015. [43] J. Chen, Z Zhang, Y. Wang, Relevance metric learning for person re-identification by exploiting listwise similarities, IEEE Trans. Image Process. 24 (12) (2015) 4741–4755.

[m5G;July 20, 2017;4:2] 11

[44] W Li, R Zhao, T Xiao, et al., DeepReID: deep filter pairing neural network for person re-identification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 152–159. [45] T. Xiao, H. Li, W. Ouyang, X. Wang, et al., Learning deep feature representations with domain guided dropout for person re-identification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016. [46] J. Hu, J. Lu, YP. Tan, Deep transfer metric learning, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 325–333. [47] S. Wu, YC. Chen, X. Li, et al., An enhanced deep feature representation for person re-identification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1–8. [48] L. Zhang, T. Xiang, S. Gong, Learning a discriminative null space for person re-identification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016. [49] W. Lin, Y. Shen, J. Yan, M. Xu, J. Wu, J. Wang, K. Lu, Learning correspondence structures for person re-identification, IEEE Trans. Image Process. 26 (5) (2017) 2438–2453. [50] Y. Shen, W. Lin, J. Yan, M. Xu, J. Wu, J. Wang, Person re-identification with correspondence structure learning, in: Proceedings of International Conference on Computer Vision (ICCV), 2015. [51] D. Tao, Y. Guo, M. Song, et al., Person re-identification by dual-regularized KISS metric learning, IEEE Trans. Image Process. 25 (6) (2016) 2726–2738. [52] L. Wu, C. Shen, A.V.D. Hengel, Deep linear discriminant analysis on Fisher networks: a hybrid architecture for person re-identification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [53] S.Z. Chen, C.C. Guo, J.H. Lai, Deep ranking for person re-identification via joint representation learning, IEEE Trans. Image Process. 25 (5) (2016) 2353–2367. [54] W. Li, X. Wang, Locally aligned feature transforms across views, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3594–3601. [55] Z. Liang, L. Shen, L. Tian, et al., Scalable person re-identification: a benchmark, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1116–1124. [56] C. Su, S. Zhang, J. Xing, et al., Deep attributes driven multi-camera person re-identification, Computer Vision – ECCV 2016, Springer International Publishing, 2016. [57] D.P. Chen, Z.J. Yuan, B.D. Chen, et al., Similarity learning with spatial constraints for person re-identification, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1268–1277. [58] J.D. Wang, T. Zhang, J.K. Song, et al., A survey on learning to hash, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [59] W. Lin, Y. Zhang, J. Lu, B. Zhou, J. Wang, Y. Zhou, Summarizing surveillance videos with local-patch-learning-based abnormality detection, blob sequence optimization, and type-based synopsis, Neurocomputing 155 (2015) 84–98. [60] S. Liao, S.Z. Li, Efficient PSD constrained asymmetric metric learning for person re-identification, in: Proceedings of IEEE International Conference on Computer Vision (ICCV), 2016, pp. 3685–3693. [61] N. Martinel, A. Das, C. Micheloni, A.K. Roy-Chowdhury, Temporal model adaptation for person re-identification, in: Proceedings of European Conference on Computer Vision (ECCV), 2016, pp. 858–877. [62] S. Ge, J. Li, Q. Ye, Z. Luo, Detecting masked faces in the wild with LLE-CNNs, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. Wen Fang received the B.S. degree in computer science and technology from Wuhan University of Science and Technology, Hubei, China, in 2009, the M.S. degree in mathematics and computer science from Fuzhou University, Fujian, China, in 2013, and she is currently pursuing the Ph.D. degree in computer science and engineering from Beihang University, Beijing, China. Her current research interests include person re-identification, image enhancement, video analysis and understanding.

Hai-Miao Hu received the B.S. degree from Central South University, Changsha, China, in 2005, and the Ph.D. degree from Beihang University, Beijing, China, in 2012, all in computer science. He was a visiting student at University of Washington from 2008 to 2009. Currently, he is an associate professor of Computer Science and Engineering at Beihang University. His research interests include video coding and networking, image/video processing, and video analysis and understanding.

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019

ARTICLE IN PRESS

JID: NEUCOM 12

[m5G;July 20, 2017;4:2]

W. Fang et al. / Neurocomputing 000 (2017) 1–12 Zihao Hu received the B.S. degree from China University of Geosciences, Wuhan, China, in 2016, and currently pursuing the M.S. degree from Beihang University, Beijing, China, in 2019, all in computer science. His current research interests include video analysis and understanding.

Bo Li received the B.S. degree in computer science from Chongqing University in 1986, the M.S. degree in computer science from Xi’an Jiaotong University in 1989, and the Ph.D. degree in computer science from Beihang University in 1993. Now he is a professor of Computer Science and Engineering at Beihang University, the Director of Beijing Key Laboratory of Digital Media, and has published over 100 conference and journal papers in diversified research fields including digital video and image compression, video analysis and understanding, remote sensing image fusion and embedded digital image processor.

Shengcai Liao received the B.S. degree in mathematics and applied mathematics from Sun Yat-sen University, Guangzhou, China, in 2005 and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China, in 2010. He was a Post-Doctoral Fellow with the Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA, from 2010 to 2012. He is currently an Associate Professor with CASIA. His research interests include face recognition and video analysis. Dr. Liao received the Excellence Paper of Motorola Best Student Paper Award and the 1st Place Best Biometrics Paper Award at the International Conference on Biometrics in 2006 and 2007, respectively, for his work on face recognition. He also received the Best Reviewer Award in IJCB 2014.

Please cite this article as: W. Fang et al., Perceptual hash-based feature description for person re-identification, Neurocomputing (2017), http://dx.doi.org/10.1016/j.neucom.2017.07.019