Real-time robust background subtraction under rapidly changing illumination conditions

Real-time robust background subtraction under rapidly changing illumination conditions

Image and Vision Computing 30 (2012) 1004–1015 Contents lists available at SciVerse ScienceDirect Image and Vision Computing journal homepage: www.e...

2MB Sizes 1 Downloads 79 Views

Image and Vision Computing 30 (2012) 1004–1015

Contents lists available at SciVerse ScienceDirect

Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Real-time robust background subtraction under rapidly changing illumination conditions☆ Luc Vosters a,⁎, Caifeng Shan b, 1, Tommaso Gritti b a b

Eindhoven University of Technology, The Netherlands Philips Research, Eindhoven, The Netherlands

a r t i c l e

i n f o

Article history: Received 10 February 2012 Received in revised form 6 June 2012 Accepted 23 August 2012 Keywords: Background subtraction EM Eigenbackground Illumination changes

a b s t r a c t Fast robust background subtraction under sudden lighting changes is a challenging problem in many applications. In this paper, we propose a real-time approach, which combines the Eigenbackground and Statistical Illumination method to address this issue. The first algorithm is used to reconstruct the background frame, while the latter improves the foreground segmentation. In addition, we introduce an online spatial likelihood model by detecting reliable background pixels. Extensive quantitative experiments illustrate our approach consistently achieves significantly higher precision at high recall rates, compared to several state-of-the-art illumination invariant background subtraction methods. © 2012 Elsevier B.V. All rights reserved.

1. Introduction In many computer vision and video analysis applications, foreground detection is an important component for segmenting foreground objects, people or events of interest from a background scene for higher-level processing and interpretation. Therefore many of these systems contain a background subtraction module. Such module evaluates the difference between an input image and a background model and produces a foreground mask in which all foreground pixels are marked white and background pixels are marked black. In this paper we limit our scope to cameras with a fixed viewpoint. Toyama et al. [1] have identified the following challenges in background subtraction: • moving background objects and the new parts of the background they reveal should be always detected as background. • lighting changes, shadows and reflections. • dynamic backgrounds such as waterfalls or waving trees. • camouflage in which the foreground object characteristics are almost indistinguishable from those of the background. • motionless foreground. • homogenous moving foreground objects, which interior pixels do not undergo any change and may therefore not appear as foreground.

☆ This paper has been recommended for acceptance by Maja Pantic. ⁎ Corresponding author. Tel.: +31 40 2473614. E-mail addresses: [email protected] (L. Vosters), [email protected] (C. Shan), [email protected] (T. Gritti). 1 Member, IEEE. 0262-8856/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.imavis.2012.08.017

Out of all these challenges, lighting changes remain a difficult problem for background subtraction in a real-life home or office setting. Lighting changes can include gradual changes but also sudden changes. The former are mainly caused by a change in day time. The latter could originate from various events like switching on or off a light switch, the opening and closing of curtains, shades and windows, television screens, people blocking light sources, and the headlights of passing cars. Lighting changes which are caused by these events can be either global in that they affect all pixels in the image, or local in that they only affect a specific portion of the image. For example, the input image in Fig. 2 row 3 shows that a portion of the background pixels is clipped and another portion is highlighted, while the input image of Fig. 2 column 1 shows that all pixels are affected by the same global lighting change. State-of-the-art algorithms can handle progressive illumination changes but remain vulnerable to sudden changes [2]. Therefore, in this paper, we propose an algorithm for background subtraction under rapidly changing lighting conditions for video sequences captured by a camera with a fixed viewpoint. The proposed approach requires a training stage in which we train an Eigenbackground (EB) model [3] from a training sequence that contains expected illumination changes of the background of a camera with a fixed viewpoint. In online operation mode this background model is then used to reconstruct for each input image the corresponding background image. EB allows for instantaneous adaptation to changes in the background, making it robust to sudden local and global light changes and dynamic backgrounds. However, shadows created by foreground objects cannot be learned by EB due to the absence of foreground objects in the training sequence. Therefore, we adopt a Statistical Illumination (SI) method, to model the remaining background variations caused by shadows and highlights of

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

foreground objects. In contrast with the SI algorithm in [2], that uses a pre‐learned spatial likelihood model, we propose to update the spatial likelihood model online by detecting reliable background pixels. In a previous publication [4], we briefly demonstrated that our method outperforms state-of-the-art methods on several challenging test sequences. The major contributions of this paper include 1. a comprehensive evaluation of ours and other state-of-the-art algorithms on a set of challenging sequences that we recorded ourselves, and on a publicly available benchmark set. 2. an analysis on the sensitivity of the most influencing parameters, and a discussion of their effect. 3. an analysis on the computational complexity of our algorithm. These extensive experimental results illustrate our approach consistently provides better segmentation accuracy at significantly higher recall levels compared to other state-of-the-art approaches. The remainder of the paper is organized as follows. We give a brief review of related work in Section 2. In Section 3, we present a self contained description of EB and SI. In Section 4, we present the proposed approach. Then Section 5 contains an experimental validation of the proposed approach. Finally Section 6 concludes the paper.

2. Related work Background subtraction is a widely studied topic in computer vision, and many techniques have been proposed (see surveys [5,6]). In this paper we limit our scope only to methods that address lighting changes. A popular approach is the pixel-based adaptive background mixture model (ABMM) of Stauffer and Grimson [7,8]. In this method the different appearances of a single pixel are modeled by a number of Gaussian Mixture Model (GMM) components, which are updated adaptively over time. This method can cope with gradual illumination changes as well as moving background elements. However, foreground objects could be integrated into the background model if they remain static for too long. Furthermore the method is not robust to rapidly changing lighting conditions, and the spatial dependency among pixels is not taken into account. Di Stefano et al. [9] proposed to use Tonal Alignment (TA) to deal with lighting changes. For each input image, the change detection algorithm [10] is applied to reliably detect background pixels. These background pixels are employed to compute the histogram specification transformation [11] that tonally aligns the current input image with a background image. The foreground mask is obtained by evaluating the difference between the input and tonally aligned background image. The histogram specification method can reconstruct global lighting changes very well, but it cannot reconstruct local lighting changes such as shadows and highlights in the background. Therefore it only works for sudden global lighting changes. Eigenbackground (EB) [3,12] is training based approach that computes an Eigenspace model from a set of background images recorded from a camera with a fixed viewpoint under various lighting conditions. EB projects each input image onto the learned Eigenspace model, to reconstruct the background as a weighted sum of Eigenvectors. Therefore, in addition to global lighting changes EB can unlike TA also reconstruct local lighting changes in the background. Section 1 describes the mechanics of Eigenbackground in more detail. More recently Pilet et al. [2] introduced Statistical Illumination (SI), which is based upon the shading model that assumes the ratio between the same pixel in the input and background model image remains unchanged in the presence of illumination changes. SI models the ratios of intensities between a stored background image and an input image by a GMM. Section 2 looks at SI in more detail.

1005

3. Eigenbackground and Statistical Illumination 3.1. Eigenbackground The Eigenbackground approach in [3,12,13] computes an Eigenspace model from a training set of background images that are recorded under various different lighting conditions from a fixed viewpoint. We write the training set that consists of N background images as an n by N matrix MnN ¼

h

m

ð0Þ

P

i ðN−1Þ  ; …; P  ; − m m − m P P −1

N

ð1Þ

ði Þ

 ¼ N ∑i¼0 P where m m denotes the mean of all images in the trainP ing set, P m ðiÞ ∈Rn is a column vector which denotes the i-th background image of width w and height h, and n= h⋅ w. Here P m traverses the image pixels in row-major order. Next we write the Eigenvalue decomposition (EVD) of the covariance matrix Cnn as T

T

C nn ¼ MnN MnN ¼ U nn diagðλ0 ; …; λn−1 ÞU nn ;

ð2Þ

where diag(λ0,… λn−1) is a diagonal matrix of size n by n with Eigenvalues λi on the diagonal, and the columns of Unn denote the Eigenvectors. The larger Eigenvalue λi, the more variance in the training set its corresponding Eigenvector captures. Typically, only p ≤min(n,N) Eigenvalues capture most of the energy in the Eigenspectrum.2 Therefore, like Hall et al. [13], we retain only the p Eigenvectors corresponding to the p largest Eigenvalues. Finally we write the Eigenspace model as Ω¼

n

  o  ; U np ; diag λ0 ; …; λp−1 ; N : m P

ð3Þ

Instead of computing the Eigenspace model from the n × n matrix T Cnn = MnNMnN , we can compute the same Eigenspace model from the T N × N matrix MnN MnN (Hall et al. [13]). This results in a substantial reduction of computational complexity since typically N ≪ n. With this Eigenspace model we can for each input image P u ∈Rn reconstruct the n ^ corresponding background image m ∈R by projecting it onto the P Eigenspace as follows    þ m : ^ ¼ U np U Tnp P u− m m P P P

ð4Þ

Then we classify pixel i as foreground if EBDIF i ¼

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi     ‖ uri ; ugi ; ubi − m^ ri ; m^ gi ; m^ bi ‖2 > ξEBfix ;

ð5Þ

^ i denote the intensity of pixel i for channel where, for example, uir and m R in the input and reconstructed background model image respectively, the superscripts r, g, b denote the R, G and B channels respectively, and ξEBfix is a threshold. Han and Jain [12] propose a dynamic threshold ξiEBdyn that is given by r

ξEBdyn ¼ maxðξ min ; minðξ max ; mi þ κσ i ÞÞ; i mi ←ð1−α Þmi þ αEBDIF i ; 2 2 σ i ←ð1−α Þσ i þ α ð1−α ÞðEBDIF i −mi Þ ;

ð6Þ

where mi and σi denote the mean and standard deviation of a Gaussian distribution, that models EBDIFi for each pixel individually. mi and σi are updated for each new frame. It is imperative the background images in the training set do not contain any foreground objects, since they would cause significant errors in the reconstructed background. Acquiring such training set for a fixed camera is fairly easy for our target application in a home or office setting. 2

Energy in the Eigenspectrum is computed as the sum of all Eigenvalues.

1006

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

3.1.1. Multichannel background reconstruction Eqs. (1)–(4) hold for a single channel intensity image. To reconstruct the background for a multi-channel color image, Han and Jain [12] make no distinction among channels. They treat a 3-channel RGB color image as 3 individual intensity images, for which they compute a common Eigenspace model. Then they reconstruct the background model image for each R, G and B channel independently from the common Eigenspace model. Conversely, we propose to construct 3 Eigenspace models, one for each individual color channel. Then we compute a background model image by reconstructing the background's R, G and B channels from the individual R, G and B Eigenspace models respectively. Instead of computing a common Eigenspace model for all channels like in Han and Jain, we compute an individual Eigenspace model for each channel individually. Therefore we argue that the proposed approach for coping with multichannel images in Eigenbackground is more accurate than Han and Jain's approach. Furthermore the complexity for computing   3 the Eigenspace model of Han and Jain is Oððd⋅NÞ Þ2 ,3 versus O d⋅N3 for the proposed Eigenspace model.



g uri þ C ui þ C ubi þ C ; g ; b ^ ri þ C m ^i þC m m ^i þC

 ;



xðkÞ ðkÞ → → i K BG xi πk N l i μ k ; Σk ; p l i xi ; μ; Σ ¼ ∏k¼1

ð8Þ

where μk, Σk and πk denote the mean, variance and relative importance of the kth mixture component/illumination ratio respectively, xi(k) is a set of binary latent variables that take value 1 if pixel i belongs to mixture component k and 0 otherwise. Pilet et al. model the foregroundas a mixture of KFG colors, repre→ sented by the normal distributions N u i μ k ; Σk Þ and refer to foreground pixels that do not have a normal distribution as suspicious pixels. These pixels occur with probability 1/2563. Then the probability of observing

→ pixel u i ¼ uri ; ugi ; ubi is BG þK FG þ1Þ

xðK BG þK FG þ1Þ

   gs gs gs n∑j∈wi uj mj − ∑j∈wi uj ⋅ ∑j∈wi mj

1

f i ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi     rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi     ffi; gs 2

− ∑j∈wi uj

gs 2

n∑j∈wi mj

gs 2

ð12Þ

− ∑j∈wi mj

where uigs and migs denote the gray-scale intensity of pixel i in the input and background model image respectively, wi is a window centered in pixel i, and n is the number of pixels in wi. The second feature, fi2, is a measure for the amount of texture 2

fi ¼

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2      ffi gs gs 2 gs 2 gs 2 þ n∑j∈wi mj ; n∑j∈wi uj − ∑j∈wi uj − ∑j∈wi mj

ð13Þ Spatial likelihood is then independently modeled for the background (BG) and foreground (FG) as   8  < pBG f 1i ; f 2i ; k ¼ 1; …; K BG  1 2   pSLM f i ; f i k ¼ : : p f 1 ; f 2 ; k ¼ K þ 1; …; K þ K FG BG BG FG i i

ð14Þ

Both pBG and pFG are computed from the 2D histograms

1

1 → →2 → →2 hBG f ; ; f and hFG f ; ; f . Both hBG and hFG are computed from background and foreground pixels respectively. These background and foreground pixels are manually annotated in a training set comprising various test images and a single background model image. Finally the joint probability distribution extended with the spatial likelihood model, can be written from Eqs. (11) to (14) as

 →   → → → → → ðkÞ → 1 2 p u ; x ; f 1 ; f 2 μ ; Σ; π ¼ ∏ π k ⋅p u i xi ; μ k ; Σk ⋅pSLM f i ; f i k : ð15Þ i;k

This joint probability can be optimized with the EM algorithm, of which the update equations can be found in [2].

i

2563

→ xðkÞ ðkÞ x i : ∏ πki N u i μ k ; Σk

K BG þK FG

ð9Þ

Here μ, Σ and π are the mean and variance of the color intensities. The background model (Eq. (8)) is expressed in terms of illumination ratios, while the foreground model (Eq. (9)) is expressed in terms of color. Therefore we express Eq. (8) as a function of color

→ →  → → p u i xi ; μ k ; Σk ¼ p l i xi ; μ k ; Σk =J i ; ∀k∈k ¼ 1; …; K BG ;

d is the number of channels in a multichannel image.

4. Proposed approach 4.1. ESI

k¼K BG þ1

3

Eqs. (8) and (9) assume that all pixels are independent. This entails the loss of the relation between a pixel and its neighbors. To model the dependence among neighboring pixels a spatial likelihood model is learned offline from ground-truth data. Two features that can be computed efficiently are considered in this the spatial likelihood model. The first feature, fi1, is the zero mean normalized cross correlation (ZNCC) between a window around pixel i in the input and background model image.

ð7Þ

where, u is the input image, m is the background image, C is a constant that prevents instability when mi is zero, and the superscripts r, g and b indicate the R, G and B channel intensities respectively. SI models illumination effects in the background by KBG illumination ratios. We now write → the probability of illumination ratio li as

→  π p u i xi ; μ; Σ ¼ ðK

ð11Þ

i;k

gs 2

SI is based upon the shading model which assumes that the ratios of the R, G and B channel intensities between a background pixel in the input image and the same pixel in the background model image are constant. Following the same notation as in Pilet et al. [2], we write the illumination ratio for pixel i as li¼

→  → → →  ðkÞ → p u ; x μ ; Σ; π ¼ ∏ p u i xi ; μ k ; Σk ⋅πk :

n∑j∈wi uj

3.2. Statistical Illumination



where, Ji = mir ⋅ mig ⋅ mib is the determinant of the Jacobian of function li(ui) ([2]). To lower the computational burden, we assume that all pixels are independent identically distributed (i.i.d). Then from Eqs. (9) to (10), we can write the joint probability distribution over all pixels as

ð10Þ

Background subtraction methods that use only a single background model image, like SI and TA, will produce many false positive foreground detections. This is due to texture mismatches that occur when the dynamic range of the input and background model image differs significantly for example due to clipping. 4 EB, however, is robust to both dark and bright input images of many different dynamic ranges. This is because EB leverages multiple images from a training 4 Clipping is caused by the camera's limited dynamic range and slow adaptation rate of its exposure settings.

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

set that contains expected illumination and dynamic range variations of the background. Furthermore EB is particularly robust to sudden local and global lighting changes, because it can instantaneously (within a single frame) respond to background variations. In addition, unlike SI and TA, EB can represent dynamic backgrounds such as waving trees and waterfalls very well. Unfortunately shadows or highlights created by foreground objects cannot be learned by EB since foreground objects are not permitted in the training set. SI, however, is robust to these shadows and highlights, because unlike EB, it models light changes by illumination ratio rather than pixel intensity. SI fails nonetheless when the background scene significantly changes due to local illumination effects like, highlights and clipping. This is illustrated in Fig. 1, where the input image has local illumination effects in the background. According to SI all illumination effects in the input image could be modeled by few (i.e. KBG∈ 2,3) background mixture components (Eq. (8)). However, Fig. 1 shows many more mixture components are required when various local illumination effects are present. The performance of EB strongly depends on the pixel difference threshold, the setting of which is ad hoc. Furthermore, pixel dependencies are not considered in the thresholding. In contrast SI does not require an ad hoc threshold. Instead it computes for each pixel the probability of belonging to the foreground by solving an optimization problem that takes into account interpixel dependencies. Based on the analysis above we propose to apply SI on the input image and reconstructed background image obtained from EB (Eq. (4)), instead of thresholding EBDIFi (Eq. (5)) as in EB. In this way given an input image, EB can reconstruct the corresponding background image with global and local illumination effects, while SI models the remaining background variations. The proposed Eigenbackground based Statistical Illumination (ESI) method thus combines the benefits of the high dynamic range Eigenspace model of EB together with modeling illumination effects by SI. Fig. 1 illustrates the advantage of the proposed approach. Fig. 2 shows that our proposal obtains a more accurate foreground segmentation in the presence of rapidly changing illumination conditions, cast shadows, highlights and dynamic backgrounds. Compared to SI, ESI can deal better with dynamic backgrounds and large changes in dynamic range. Compared to EB, ESI achieves better foreground segmentation because it can handle cast shadows and highlights and because of EM optimization.

1007

4.2. Online spatial likelihood model Pilet et al. learn the spatial likelihood model offline on manually annotated ground-truth data. However, the pre-learned model cannot adapt to new situations, and may fail if drastic illumination changes occur. In our approach, we propose to build the spatial likelihood model online from the input image. More precisely, for each input image, some reliable background pixels are first detected; the background distribution is then updated with the detected background pixels. To reduce computational complexity, we discard reliable foreground pixel detection as described in [4], and keep the foreground distribution (pFG in Eq. (14)) uniform. We have empirically found that this did not noticeably affect segmentation performance. 4.2.1. Reliable background pixel detection As mentioned above, the ordering of pixels in textured image areas will be preserved even in the presence of heavy photometric distortions [14]. In [10], a measure was proposed for robust visual correspondence, which examines the amount of order preservation in neighboring pixels. Here we exploit a similar concept to detect reliable background pixels under sudden illumination changes. The order preservation can be determined by comparing the sign of the pixel differences in a neighborhood. If the sign of the pixel difference does not change, the order is preserved. In order to evaluate the degree of order preservation for pixel i, we compute a matching score →

T→

∑i∈wi δ IN ðxi ; yi Þ δ BG ðxi ; yi Þ

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi MF ðxi ; yi Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : → → → → ∑i∈wi δ BG ðxi ; yi ÞT δ BG ðxi ; yi Þ ∑i∈wi δ IN ðxi ; yi ÞT δ IN ðxi ; yi Þ

ð16Þ (xi,yi) are the horizontal and vertical coordinates of pixel i, wi is a window → → around i, and the feature vectors δ IN ðxi ; yi Þ and δ BG ðxi ; yi Þ of the input and background image respectively, are computed as 2

3 F a ðxi −1; yi Þ−F a ðxi ; yi Þ 6 7 → F a ðxi ; yi −1Þ−F a ðxi ; yi Þ 7 δ a ðxi ; yi Þ ¼ 6 4 F a ðxi −1; yi Þ−F a ðxi þ 1; yi Þ 5; F a ðxi ; yi −1Þ−F a ðxi ; yi þ 1Þ

ð17Þ

SI

ESI

Fig. 1. (1st column) input frame, (2nd column top row) background image, (2nd column bottom row) background image reconstructed by EB, (3rd column) intensity ratio between the input frame and the reconstructed background, and (4th column) foreground segmentation based on the ratio image.

1008

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

Fig. 2. Segmentation results of SI, EB and ESI on some test sequences affected by sudden illumination changes, cast shadows and vacillating background. (1st column) input frame, (2nd column) SI, (3rd column) EB fixed threshold and (4th column) ESI.

where, F(xi,yi) is the luminance of (xi,yi), and a ∈ {IN, BG}. This NCC calculation can be implemented efficiently using integral image tables [15]. Pixel (xi,yi) is considered as reliable background pixel if MF(xi,yi) is higher than the reliable background detection threshold (RBDT) threshold TRBDT. Image portions with high gradients are correctly detected as background with very few false positives. In contrast, uniform or clipped regions have a low matching score and thus will not be detected as reliable background. Fig. 3 shows examples of reliable background pixels that are detected under various sudden illumination changes.

and f2 (Eq. (13)). The bins bl of histograms hBG are recursively updated according to 

hBG ðbl Þ ¼ α BG ⋅hBG ðbl Þ þ ð1−α BG Þ⋅hBG ðbl Þ; ∀bl ∈hBG

ð18Þ

where, αBG is the learning rate parameter of the background. It indicates the update speed of the histogram. Initially for the first input image hBG is uniform. 5. Experiments 5.1. Dataset and evaluation methodology

4.2.2. Spatial likelihood model update ∗ For each input image, we calculate histogram hBG from the reliably detected background pixels (Eqs. (16) and (17)) and features f1 (Eq. (12))

To test the proposed algorithm, we recorded 5 challenging test sequences with various sudden lighting changes that occur continuously

Fig. 3. Reliable background pixel detection under rapidly changing illumination conditions. The green pixels are the detected background pixels.

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015 Table 1 Performance comparisons of the offline spatial likelihood model and the proposed online spatial likelihood model applied to SI. Recall

Scene1 Scene2 Chair Sofa Walking

Precision

Offline

Online

Offline

Online

15.7 8.8 16.0 50.0 30.8

32.5 39.5 43.9 73.4 66.2

65.2 5.4 20.0 69.9 55.5

52.3 30.5 30.7 71.5 61.0

throughout each sequence. We created the lighting changes by randomly turning on and off different light sources and by opening and closing shades. We refer to each individual sequence by Scene1, Scene2, Chair, Sofa, and Walking. In addition we recorded the training sequences in a similar way as the test sequences. The only difference is that the training sequences do not contain any foreground object. Furthermore each training sequence contains approximately 500 frames. Since Chair, Sofa, and Walking have been recorded in the same room with the same fixed camera, they have a common training sequence. All test and training sequences are publicly available for benchmarking at our webpage [16]. To quantitatively evaluate the segmentation performance, we manually labeled the ground-truth foreground mask for each test sequence. For Scene1, Scene2, Chair, Sofa, and Walking, we labeled 750,

1009

154, 573, 382 and 734 frames respectively. In our experiments, the segmentation performance is measured by precision and recall, defined by tp tp Recall ¼ tpþf 100%; Precision ¼ tpþf 100%; n⋅ p⋅

ð19Þ

were tp stands for number of true positives, and is calculated by counting the number of pixels that are correctly classified as foreground. Similarly # fn (number of false negatives) and #fp (number of false positives) are the number of pixels that are incorrectly classified as background, and foreground respectively. We calculate Precision and Recall for a whole sequence as the arithmetic mean of the Precision and Recall over all frames. 5.2. Parameter sensitivity Before comparing our method with the state-of-the-art methods, we present a systematic study on its components and parameters. We demonstrate that overall the performance of the proposed ESI method with respect to the different parameters is easily identifiable. Therefore it is fairly intuitive to choose an optimal set of parameters. 5.2.1. Spatial likelihood model In the proposed ESI method, the spatial likelihood model is updated online. We verify the effectiveness of this online model, by comparing two versions of SI, one with an online and the other with an offline spatial likelihood model. For both versions all common parameters are kept equal. Table 1 and Fig. 4 show that the online spatial likelihood

Fig. 4. (top) input frames, (middle) offline spatial likelihood, and (bottom) online spatial likelihood.

1010

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

90

90

80

80

Recall

100

Precision

100

70 60 scene1 scene2 chair sofa walking

50 40 30 −0.4

−0.2

70 60 scene1 scene2 chair sofa walking

50 40 0

0.2

0.4

0.6

30 −0.4 −0.2

0.8

Reliable Background Threshold (RBGT)

0

0.2

0.4

0.6

0.8

Reliable Background Threshold (RBGT)

Fig. 5. Precision and recall when varying the reliable background pixel detection threshold.

model obtains a more reliable and more accurate segmentation than the offline model. Next in Fig. 5, we demonstrate what happens when we sweep the reliable background detection threshold TRBDT from −0.3 to 0.9. For low TRBDT the BG histogram is updated with more BG pixels, and consequently also more pixels that are incorrectly detected as background. Therefore the SI component in ESI will be biased more towards classifying a pixel as background rather than foreground, and thus the Recall will be low. On the other hand a smaller percentage of the pixels classified as foreground will be incorrect, hence the Precision will be high. It is clear that TRBDT can be used to tune the operating point of ESI and SI in an ROC curve. We adopt TRBDT =0.2 as default setting in our experiments.

5.2.3. Number of Eigenvectors Another important parameter in the EB based approaches is the number of Eigenvectors that are used in the reconstruction of the background model image (Eq. (4)). Fig. 7 plots the Precision and Recall curves for each sequence as a function of the number of Eigenvectors for ESI. It is clear that when we increase the number of Eigenvectors

100

100

90

90

80

80

Recall

Precision

5.2.2. Number of EM iterations In Fig. 6, we plot Precision and Recall for each sequence as a function of the number of EM iterations per frame. From this figure we see for Scene1, Chair, Sofa, and Walking, that when the number of EM iterations

increases, both Precision and Recall approximately remain constant. The only exception is Scene2, which requires 4 EM iterations per frame. Because Scene2 contains relatively more cast shadows and reflections than the other test sequences, and since these illumination effects can only be modeled by SI, we require more EM iterations. Fig. 6 shows that the number of EM iterations per frame is not critical. Because we recursively pass on the GMM parameters from the previous frame to the next frame, we can already get an accurate segmentation with as little as 1 EM iteration per frame. In the following experiments we set the number of EM iterations per frame to 2.

70 scene1 scene2 chair sofa walking

60 50

1

2

3

4

5

6

7

70 60 50

8

1

2

Number of EM iterations

3

4

5

6

7

8

Number of EM iterations

Fig. 6. Precision and recall when varying the number of EM iterations.

90

80

80

Recall

100

90

Precision

100

70 60 scene1 scene2 chair sofa walking

50 40 30

0

5

10

15

20

25

Number of Eigen vectors

30

70 60 50 40 30

0

5

10

15

20

25

Number of Eigen vectors

Fig. 7. Precision and recall when varying the number of Eigenvectors.

30

100

100

95

90

90

80

80 75

60 50 40

70

30

65

20

60

1011

70

85

Recall

Precision

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

scene1 scene2 chair sofa walking

10 10

20

30

40

50

60

70

80

10

20

30

40

C

50

60

70

80

C

Fig. 8. Precision and recall when varying the shading model constant C.

scene1

90

80

80

80

60

40 1

2

3

4

70 60 50 40

5

1

Number of background components

70

40 5

1

70 60

40

40

30

30

80

80

Precision

90

70 #FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

60 50 40 2

3

4

40 5

1

90

80

80

Recall

90

#FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

50 40 30 1

2

60

Number of background components

4

2

3

4

5

#FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

30 5

3

70

40 4

2

1

2

3

4

5

Number of background components

walking

50

3

1

Number of background components 100

70

5

#FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

60 50

sofa

60

4

70

Number of background components 100

3

walking

100

90

1

2

Number of background components

sofa

100

#FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

30 1

Number of background components

chair

60

40 5

5

70

50

4

4

80

50

3

3

90

50

2

2

Number of background components 100

#FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

80

60

1

Precision

4

90

Recall

Recall

80

3

#FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

60 50

scene2

100

#FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

90

2

70

Number of background components

scene1

100

#FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

Recall

#FGC 1 #FGC 2 #FGC 3 #FGC 4 #FGC 5

Precision

90

70

chair

100

90

50

Recall

scene2

100

Precision

Precision

100

5

Number of background components

Fig. 9. Precision and recall when varying the number of BG and FG components. #FGC denotes the number of foreground components.

1012

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

scene1

Precision

Precision

80 60 40

80

80

60

60

20

40

60

80

0

100

20

40

100

0

20

40

60

80

100

Recall

80

Precision

Precision

80

walking

100

80 60 40 20 0

60

Recall

sofa

40

0 0

Recall 100

ESI EBfix EBdyn SI ABMM TA

20

20

0

chair

100

40

20 0

scene2

100

Precision

100

60 40 20

0

20

40

60

80

0

100

0

20

40

Recall

60

80

100

Recall

Fig. 10. The ROC curves for each method. Please note that, for the case of SI method, we adopt the proposed online spatial likelihood model, which improves performances compared to the originally proposed method.

beyond 8, the Precision hardly increases while the Recall slowly decreases. Thus we conclude that most lighting conditions in the background of the test sequences can be reconstructed in the background model image by a linear combination of just 8 of the most dominant Eigenvectors. In the following experiments we set the number of Eigenvectors to 10. 5.2.4. Shading model constant C The shading model constant C in Eq. (7) prevents instability when the denominator is zero. To reveal the sensitivity of ESI to changes in C, we plot the Precision and Recall as a function of C in Fig. 8. When C

Table 2 List of parameter settings for each method in the benchmark of Fig. 10. The parameter set for ESI is the union of the parameter set of SI and EBfix. We use the same no. of Eigenvectors in ESI, EBfix and EBdyn. TTA is the image differencing threshold for TA. For ABMM, the first B Gaussian components that account for T% of the most recent pixels are selected as background model (Kaewtrakulpong and Bowden [17]). In the notation a:b:c, a denotes the start, b the step, and c the stop value. SI1 KBG KFG No. of EM iterations TRBDT Window size MF(xi,yi) Window size f1, f2 No. of bins hBG αBG No. of bins hBG C EBfix

2 2 2 −0.3:0.1:0.2 19 13 128 0.2 128 25

No. of Eigenvectors2 ξEBfix EBdyn α ξmin ξmax k TA4 T TA 3 ABMM4 No. of Gaussian components T5

10 5:5:100 {0.01, 0.05, 0.1, 0.2–0.5} 25 70 0.1:0.1:0.7 {5, 10,…100} 1:8 70%

increases, the Precision also slightly increases while the Recall clearly drops. This can be explained as follows: C limits the dynamic range of the illumination ratios and makes them more linear as function of the background pixel intensity. This results in smaller fluctuations of the illumination ratio in the background image portions. Therefore fewer pixels will be erroneously classified as foreground, and more pixels will be classified as background. The former leads to an increase in Precision, while the latter leads to an increase in #fn that in turn will result in a lower Recall (Eq. (19)). We adopt C = 25 as default setting in our experiments. 5.2.5. KBG and KFG In Fig. 9, we plot the Precision and Recall as a function of KBG for KFG ∈ {1, …,5}. The figure shows that when going from KBG = 1 to KBG = 2, the Recall significantly drops while the Precision significantly increases. When KBG > 2 the Precision almost remains constant, while the recall continues to decrease but at a slower rate. In general by increasing KBG, the SI component in ESI is biased towards classifying pixels as background rather than foreground. Hence # fn increases and # fp decreases. The former results in a decreased Recall, while the latter causes an increased Precision. Clearly for KBG > 2 we overfit the actual distribution of the illumination ratios of the background image portions, since the Recall continues to drop while the Precision hardly increases. Therefore we set KBG = 2. Conversely, by analyzing the effect of increasing KFG for KFG ∈{1,…,5}, we see that the Precision clearly drops while the Recall slightly increases and remains approximately constant for KFG > 2. This is the opposite effect of increasing KBG. Increasing KFG biases the SI component in ESI towards classifying pixels as foreground rather than background. Hence # fp increases, which results in lower Precision, and at the same time # tp also increases leading to a higher Recall. Therefore we set KFG = 2.

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

1013

Fig. 11. Comparison of foreground segmentation results on several example frames. From top to bottom: Walking, Scene1, and Chair.

5.3. Benchmarking 5.3.1. Benchmark on our dataset We benchmark our algorithm with several state-of-the-art approaches on the dataset presented in Section 1. In the benchmark we include EB with both fixed (EBfix) and dynamic (EBdyn) thresholds, SI, TA and the pixel-wise adaptive background mixture Model (ABMM) of [17].5 In our benchmark we use SI with the proposed online spatial likelihood model, because it performs better than the pre‐learned model 5 The software implementation for TA and ABMM is obtained from the authors' website and from OpenCV's implementation respectively.

(Table 1). Furthermore, we train the ABMM on the same training set as the one we use to train the Eigenspace models. We plot the precision as a function of recall for each sequence in the receiver operator curves (ROC) of Fig. 10. Table 2 gives an overview on the parameter values that we have used in ESI, EBfix, EBdyn, SI, TA and ABMM to obtain this figure. This table also shows for each method the parameter that we sweep in order to construct the ROC curves. Fig. 10 clearly shows that ESI is superior to the other methods for all test sequences. Then comes EBfix followed by EBdyn, TA and SI. ABMM has the worst performance. The complex and rapidly changing illumination conditions just cannot be handled by a pixelwise GMM with limited components and slow adaptation rate.

1014

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

When we compare ESI to EBfix we see that for Scene1, ESI attains a higher overall Precision and Recall. For Scene2, EBfix obtains a Precision of 70% for a Recall of 42%, while ESI obtains the same Precision for a significantly higher Recall of 57%. In sequences Chair, Sofa and Walking, ESI obtains the same Precision as EBfix for a Recall that is at least 10% higher. In conclusion, with ESI we achieve a significantly higher precision at high recall rates, compared to the methods. Compared to SI, TA and ABMM, EBfix performs well. This is because EBfix can instantly reconstruct local illumination effects in the

input

ground-truth ABMM

background image. From Fig. 10, it is clear that SI and TA do not work well on the test sequences. This is because both methods exploit just a single background model image with limited dynamic range. Hence, local illumination effects cannot be modeled well, and will result in classification errors. The qualitative comparison on some snapshots depicted in Fig. 11 clearly illustrates this. For a complete comparison we refer the reader to our webpage [16], to which we uploaded for each test sequence a demo sequence containing a side-by-side comparison of the segmentation results of ESI, EBfix, EBdyn, SI, TA and ABMM.

TA

SI

EBfix

ESI

camouflage

foreground aperture

bootstrap

light switch

moved object

time-of-day

waving trees Fig. 12. ABMM, TA, SI, EB and ESI applied on the wallflower dataset. Each row shows a snapshot of a sequence in the wallflower dataset.

L. Vosters et al. / Image and Vision Computing 30 (2012) 1004–1015

5.3.2. Performance on wallflower dataset We also test our algorithm on the wallflower dataset. The wallflower dataset is designed to test background subtraction algorithms to 7 canonical problems, including, moved objects, time of day, light switch, waving trees, camouflage, bootstrapping and foreground aperture [1]. We use the 200 frames in each test sequence that are marked as training frames, to train the Eigenspace model in EBfix and ESI, and the GMM in ABMM. For ESI and SI we set TRBDT = 0.2, for EB and TA we set ξ EBfix = TTA = 50, for ABMM we use 5 Gaussian components. All other parameters are tuned to obtain the best performance for each test sequence. The results are shown in Fig. 12. As expected ESI performs particularly well on the light-switch, waving trees, camouflage and time-of-day problem. In the light switch problem, the background can be well reconstructed by EBfix and ESI, after the light is switched on. The dynamic backgrounds in the waving trees sequence are handled well by the Eigenbackground based approaches EBfix and ESI. Furthermore pixelwise ABMM copes well with dynamic backgrounds, gradual illumination changes and moved objects, however, it completely fails on the light switch, camouflage and foreground aperture sequences. TA, SI, EBfix and ESI cannot cope well with moved background objects, unlike ABMM, which background model is updated at every frame. SI, is robust to sudden and gradual illumination changes, however it cannot handle the dynamic background in waving trees. Compared to EBfix and TA, ESI and SI achieve denser segmentations with fewer holes in FG objects on the waving trees, light switch, camouflage and foreground aperture sequences, due to their incorporated spatial likelihood model. In conclusion, the wallflower dataset well demonstrates that ESI combines the benefits of both EB and SI. 5.4. Computational complexity We implemented ESI in C++ on an Intel i7-950, 3.06 GHz quad core CPU. We calculate MF(xi,yi), f1 and f2 in constant time with integral image tables [15]. Furthermore we calculate the latent variable expectations or beliefs, and GMM parameter updates, in multiple parallel threads by adding openMP compiler directives to the source code. In addition we also parallelized all vector additions, matrix vector multiplications, and spatial likelihood model updates with openMP. With these optimizations, we achieve frame rates of 20 and 18 frames per second (fps) for 1 and 2 EM iterations per frame respectively, for a resolution of 320 by 240. Thus our implementation of ESI easily deals with the real-time requirements of many computer vision and video analysis applications which require a minimum frame rate of 15 fps. 6. Conclusions In this paper, we presented an approach to background subtraction under rapidly changing lighting conditions. The proposed method combines the Eigenbackground algorithm and the Statistical Illumination model. We introduce an online updated spatial likelihood model by exploiting reliably detected background pixels, and show that the online spatial likelihood model provides more accurate segmentation compared to the offline spatial likelihood model. The extensive experimental results on two challenging data sets demonstrate that our method is superior to other state-of-the-art background subtraction methods when rapidly and abruptly changing illumination conditions are involved.

1015

Better segmentation accuracy at significantly higher recall rates has been obtained, compared to several state-of-the-art approaches. Furthermore the proposed algorithm runs in real-time on a general purpose cpu. The most obvious weakness of ESI compared to ABMM is that moved background objects will be detected as foreground forever after movement. This is because the object's new location is not incorporated into the Eigenspace background model. In ABMM these objects will slowly disappear into the background again. Currently, we are investigating ways to tackle this problem with incremental Eigenspace background model updates [13]. Input frames with large foreground objects, cause degradation in the Eigenbackground reconstruction. These degradations occur as errors which are spread over the entire reconstructed background image resulting in inaccurate segmentation [18]. This problem can be circumvented by extending the ESI with recursive error compensation as described in [18].

References [1] K. Toyama, J. Krumm, B. Brumitt, B. Meyers, Wallflower: principles and practice of background maintenance, in: IEEE International Conference on Computer Vision (ICCV), vol. 1, 1999, pp. 255–261. [2] J. Pilet, C. Strecha, P. Fua, Making background subtraction robust to sudden illumination changes, in: European Conference on Computer Vision (ECCV), 2008, pp. 567–580. [3] N.M. Oliver, B. Rosario, A.P. Pentland, A Bayesian computer vision system for modeling human interactions, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 831–843. [4] L. Vosters, C. Shan, T. Gritti, Background subtraction under sudden illumination changes, in: IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), September 2010, pp. 384–391. [5] M. Piccardi, Background subtraction techniques: a review, in: IEEE International Conference on Systems, Man, and Cybernetics, 2004, pp. 3099–3104. [6] R.J. Radke, S. Andra, O. Al-Kofahi, B. Roysam, Image change detection algorithms: a systematic survey, IEEE Transactions on Image Processing 14 (3) (2005) 294–307. [7] C. Stauffer, W. Grimson, Learning patterns of activity using real-time tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 747–757. [8] N. Friedman, S. Russell, Image segmentation in video sequences: a probabilistic approach, in: International Conference on Uncertainty in Artificial Intelligence, 1997, pp. 175–181. [9] L. Di Stefano, F. Tombari, S. Mattoccia, Robust and accurate change detection under sudden illumination variations, in: ACCV Workshop on Multi-dimensional and Multi-view Image Processing, November 2007. [10] F. Tombari, L. Di Stefano, S. Mattoccia, A robust measure for visual correspondence, in: IEEE International Conference on Image Analysis and Processing (ICIAP), September 2007, pp. 376–381. [11] R.C. Gonzalez, R.E. Woods, Digital Image Processing (3rd Edition), Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2006. [12] B. Han, R. Jain, Real-time subspace-based background modeling using multi-channel data, in: International Symposium on Visual Computing, 2007, pp. 162–172. [13] P. Hall, D. Marshall, R. Martin, Merging and splitting eigenspace models, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (9) (1998) 1042–1049. [14] B. Xie, V. Ramesh, T. Boult, Sudden illumination change detection using order consistency, Image and Vision Computing 22 (2) (2004) 117–125. [15] M.J. McDonnell, Box-filtering techniques, Computer Graphics and Image Processing 17 (1) (September 1981) 65–70. [16] L. Vosters, C. Shan, T. Gritti, Dataset and Demo Sequences for Robust Background Subtraction under Rapidly Varying Illumination Conditions, 2011. (https://sites. google.com/site/tommasogritti/publications/background-subtract ion-data). [17] P. Kaewtrakulpong, R. Bowden, An improved adaptive background mixture model for realtime tracking with shadow detection, in: European Workshop on Advanced Video Based Surveillance Systems (AVBS), September 2001. [18] Z. Xu, P. Shi, I.Y.-H. Gu, An eigenbackground subtraction method using recursive error compensation, in: Pacific Rim Conference on Multimedia, 2006, pp. 779–787.