Optik 127 (2016) 795–801
Contents lists available at ScienceDirect
Optik journal homepage: www.elsevier.de/ijleo
Recognizing violent activity without decoding video streams Jianbin Xie a , Wei Yan a , Chundi Mu a , Tong Liu a,∗ , Peiqin Li a , Shuicheng Yan b a b
National University of Defense Technology, College of Electronic Science and Engineering, Kaifu District, Yan W Pond #47, Changsha 410073, Hunan, China National University of Singapore, Singapore, Singapore
a r t i c l e
i n f o
Article history: Received 26 January 2015 Accepted 27 October 2015 Keywords: Activity recognition Violent activity Motion vectors
a b s t r a c t The processes of motion target detection and tracking in most of traditional activity recognition methods are usually complicated and the application of these methods is limited. In this paper, we propose a fast violent activity recognition method based on motion vectors. First, we extract the motion vectors from compressed video data directly. Then, we analyze the features of the motion vectors in each frame and between frames, and get Region Motion Vectors descriptor (RMV). Finally, we use the Support Vector Machine (SVM) which takes the radial basis as the kernel function to classify the RMV and determine whether the violent activity exists in the video or not. Experimental results on several datasets have shown that the proposed method can detect 96.1% of the violent activities in videos (false probability is about 5.1%), and the calculation speed is very fast, which means the new method can be used in embedded systems. © 2015 Elsevier GmbH. All rights reserved.
1. Introduction Violent activity is a kind of very harmful action for people and society, which takes the human body or property as the target, and uses violent means to endanger people’s life, health and personal freedom. Violent activity recognition based on video analysis refers to analyzing the motion feature of the target in the videos to determine whether the violent activity exists in the video or not. In a video surveillance system, the earlier the violent activity is detected, the lower the harm will be. Violent activity is not a kind of ﬁxed activity (e.g., raising hand and stooping), or a simple activity (e.g., walking and riding bike), but a kind of complicated space–time interactive activity, which does not have ﬁxed model or style, and is very hard to be deﬁned and recognized exactly. Existing video analyzing algorithms include target detection and target recognition. The ways of separating the target and background, tracking the target and extracting the key points require expensive computation. There are some limitations when those algorithms are used in real applications. For example, most of the video capture and coding devices do not have enough resource for real-time operation of those algorithms, and software and
∗ Corresponding author. Tel.: +86 13574870542. E-mail addresses: [email protected]
(J. Xie), [email protected]
(W. Yan), [email protected]
(C. Mu), [email protected]
(T. Liu), lipeiqin [email protected]
(P. Li), [email protected]
(S. Yan). http://dx.doi.org/10.1016/j.ijleo.2015.10.165 0030-4026/© 2015 Elsevier GmbH. All rights reserved.
hardware, which use those algorithms to analyze the huge amount of video recording data, are usually too expensive to be widely used. In video monitoring systems, many algorithms have been proposed to reduce the transmission bandwidth, such as motion detection and motion compensation. Therefore, when we get compressed video data, the motion vectors of the video can be got without additional calculation. The motion vectors represent the relative motion between the current macro-blocks and reference macro-blocks, which are not exactly equivalent to the actual movements of the target. However, we can still get some useful information from them. By analyzing the implications of the motion vectors, complexity and processing time of the algorithm can be reduced signiﬁcantly, and the demand for the hardware can also be reduced. To facilitate the violent activity recognition study, we ﬁrstly collect the positive video clips (e.g., boxing and ﬁghting) and negative video clips (e.g., walking and jumping) from public video datasets such as UCF sports, UCF50, HMDB51 and YouTube, etc., to construct the Video streams for Violent Activity Recognition dataset v1.0 (VVAR10) for violent activity analysis. Then, we extract the motion vectors of the video clip in VVAR10, analyze the features of the motion vectors and use Region Motion Vectors descriptor (RMV) to describe the motion features of the video clip. Finally, we use the Support Vector Machine (SVM) to determine whether the violent activity exists in the video or not by machine learning. Eventually, the extensive experiments on the VVAR10 demonstrate the average precision of our method is about 96.3% and the false probability is about 5%, and the calculation speed is very fast, which means the method can be implemented in embedded systems.
J. Xie et al. / Optik 127 (2016) 795–801
Fig. 1. We select several video clips which contain four types of violent activities and lots of non-violent video clips to form VVAR10 dataset (Section 3). Then we use Region Motion Vectors descriptor to describe the feature of motion vectors of videos and use SVM to classify them (Section 4). Motion vectors can easily be gotten from compressed video data, and SVM is very suitable for dealing with non-linear classiﬁcation requirement.
Fig. 1 illustrates the proposed framework for violent activity recognition by motion vectors. The main contributions of this work can be summarized as follows: • We propose a new method for violent activity recognition, which has less calculation and can be widely used in smart video surveillance systems. • The proposed method can take full advantage of the motion vectors existing in compressed video data and reduce the time of calculating the motion vectors. • The proposed method can use RMV to describe the motion features of the video data.
integrated such representations with SVM classiﬁcation schemes for recognition . Wang et al. introduced a novel descriptor based on motion boundary histograms, which have shown good performance by classifying many kinds of actions . Sadanand et al. presented Action Bank, a new high-level representation of the video, which is comprised of many individual action detectors sampled broadly in the semantic space as well as the viewpoint space . Ryoo et al. can recognize human high-level activities such as “ﬁght” and “assault” by deﬁning complex human activities based on simpler activities or movements by context-free grammars . Jingen Liu et al. used high-level semantic concepts to realize the recognition of human actions. Sun and Nevatia used ﬁsher vectors to realize the classiﬁcation of large scale web video event . With the development of the sensor industry, sensors are widely used in computer version . Some researches recognize the activity with the help of the data got by sensors. Ravi et al. used a triaxial accelerometer worn near the pelvic region as a motion detector to recognize activities . Ravi et al. reported their efforts to recognize user activities from accelerometer data . Spriggs et al. explored ﬁrst-person sensing through a wearable camera and Inertial Measurement Units . Maekawa et al. introduced a way to recognize actions with sensors on the wrist . In summary, we ﬁnd that previous studies always focus on the recognition of the basic action, and in order to get the accurate space–time features and trajectories, many of them need the help of accessories worn on the body, such as accelerometer and GPS. Hierarchical approaches are used for the recognition of complex activities, such as Markov Network [6,22,29], Conditional Random Fields , and so on [20,28,31]. It is assured that these approaches are good. However, due to their complexity, they are not suitable for real-time monitoring and warning systems. Xie et al. proposed a fast and robust algorithm for ﬁghting behavior detection based on motion vectors, but the motion vectors were extracted by an improved Three-Step Search method . In this paper, we attempt to recognize the violent activity by analyzing motion vectors, which can be extracted directly from the video stream, and the way to analyze the method will be described in Section 4. 3. VVAR10 dataset
The rest of the paper is organized as follows. Section 2 discusses the related work. Then, we describe the construction of VVAR10, and propose the framework for violent activity recognition in Sections 3 and 4, respectively. Experiments and discussions are presented in Section 5. Section 6 concludes this work. 2. Related work Violent activity recognition is an important part of human activity recognition due to its potential for a multitude of applications . As we all know, “ﬁghting between two persons” is an interaction between two persons . Thus, the analysis of interactions is to analyze the activities between persons. For activity recognition, early works focus on classifying video sequences of a single person in controlled environments. In this scenario, the background is simple and uniform [3,11,12]. However, with the development of activity recognition technologies, researchers have tried to introduce more natural and unconstrained videos. For instance, Laptev et al. have studied the sequences from feature ﬁlms , and some researchers have focused on the recognition of “wild videos”, such as the video from YouTube [14,18]. The approaches of action recognition are varied, and so is the way to classify them [1,16,19]. There are some approaches that can analyze the video directly. Messing et al. presented an activity recognition feature inspired by human psychophysical performance which is based on the velocity history of tracked key points . Schuldt et al. constructed video representations in terms of local space–time features and
There are several datasets, e.g., Weizmann , KTH , UCF sports  and Hollywood2 actions  for activity recognition. However, none of them is directly suitable as they usually focus on the recognition of simple individual actions. Therefore, to study our proposed problem, we need a large dataset of violent actions, which contains ﬁghting, boxing, hammering and pursuing. Hence, we have built VVAR10 dataset which contains 296 positive samples and 277 negative samples. We get positive samples and negative samples from the UCF sports, the UCF50, the HMDB51  and the YouTube . In order to save the experimental time and test the effectiveness of our algorithm, we partition each video clips less than ﬁve seconds. To diversify our dataset, we select violent activity videos of various situations, day and night, single and multiplayer, with tools and non-tools, and so on. Fig. 2 shows some examples in VVAR10. 4. The proposed framework To determine whether the violent activity exists in the video or not, two key issues are required to be addressed. First, most of the video coding algorithms use the macro-block as the basic process unit, and often use deformable macro-block technology (for example, there are 7 different sizes of macro-blocks in H.264) and multiple reference frame technology to improve the coding efﬁciency, so the motion vectors in the same frame often have
J. Xie et al. / Optik 127 (2016) 795–801
In this structure, the tsf is the frame rate of the video clip, the tst is the frame interval between the target macro-block and reference macro-block, the tsx is the width of macro-block, the tsy is the height of macro-block, the bcx and bcy are center coordinates of target macro-block, and the mvx and mvy are the horizontal and vertical direction offset of the target macro-block. Normalizing the motion vector means making the tst , tsx and tsy of all motion vectors equal to each other. The normalization is based on two principles: ﬁrst, the size of the macro-block does not inﬂuence the value of the motion vector; second, the motion is continuous in adjacent frames. The calculation process is shown in the following equation: tsf = tsf ,
tst = tst min
tsx = tsx min ,
tsy = tsy min
mvx = bcxi
mvx × tst min , tst
(i × 2 − 1) × tsx min = bcx + , 2
= bc + bcyj y
(j × 2 − 1) × tsy min 2
mvy × tst min tst
tsx i = ±1, ±2. . . ± tsx min j = ±1, ±2. . . ±
tsy tsy min
where tst min is the minimum frame interval, usually set to 1, and the tsx min and tsy min are the minimum macro-block size used by the current coding algorithms. The process effect is shown in Fig. 3. In this paper, we take tsx min and tsy min equal to 4, which means that we can get more information than those two parameters equal to 8 or 16. Fig. 2. Examples of VVAR10 dataset. The motion vector images show the distribution of motion vectors in the areas surrounded by red or green rectangles, and these images of motion vectors are scaled for clearly display. (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)
different spatial and temporal scales. Second, the motion vectors in compressed video data only represent the relative positional relationship between the current macro-block and the reference macro-block, and do not represent the actual movement features of the target. For the ﬁrst question, we use spatial and temporal interpolation to normalize the motion vectors to a uniform minimum spatial and temporal scale. For the second question, we use RMV to represent the motion feature of the video and use the SVM to classify the videos. The whole ﬂow diagram of the proposed framework is shown in Fig.7, and the detail of each key step is discussed in following sections. 4.1. Motion vector normalization By analyzing the structure of compressed video data, we can easily obtain the original motion vector data from the live video stream or the video recording ﬁle. Impacted by the deformable macroblock and multi-reference frame technology, the original motion vectors have different spatial and temporal scales, thus the motion features represented by them have different spatial and temporal scales. For example, in all motion vectors of the nth frame, there may exist the motion vector of a macro-block of size 16 × 16 and a reference macro-block is located in the n − 3th frame, and there may exist the motion vector of a macro-block of size 8 × 8 and a reference macro-block is located in the n − 1th frame. To unify spatial and temporal features, we ﬁrstly deﬁne the following structure to describe the motion vector containing spatial and temporal scales.
tsf , tst , tsx , tsy , bcx , bcy , mvx , mvy
4.2. Motion vector features in frames The macro-block is the basic unit of video encoding, and the region which includes the target motion typically covers the plurality of macro-blocks. Analyzing features of the target motion must inspect the motion region as the basic unit. All macro-blocks which have nonzero amplitude and interconnect to each other will be classiﬁed into the same motion region. If there is more than one motion region, the motion regions should be numbered in sequence. Histogram is a very useful tool in features analysis. The motion vector of a macro-block has two parameters, which shouldn’t be analyzed separately. So in this paper, we use three-dimensional histogram to express the motion vectors of all macro-blocks in the motion region. In order to eliminate the inﬂuence caused by the differences of region areas, we normalize the three-dimensional histogram by the total number of macro blocks and get the Three Dimensional Histogram of Regional Motion Vector (TDH RMV), as Fig. 4 shows. In order to investigate the motion features of the area expressed by TDH RMV, we deﬁne the following variables:
Fig. 3. Example of motion vector normalization. (a) The size of macro-block has small inﬂuence over the value of the motion vector. (b) The position of referenceframe has large inﬂuence over the value of the motion vector.
J. Xie et al. / Optik 127 (2016) 795–801
Fig. 4. The Three Dimensional Histogram of Regional Motion Vector (TDH RMV) shows the statistics of motion vectors distribution in region. Each pillar represents the motion vector, whose value is equal to the center coordinate (mvx , mvy ) of that pillar bottom rectangle, which exists in current region, and the distribution probability of motion vector is equal to the height h of the pillar.
Fig. 5. The Three Dimensional Histogram of Regional Motion Vector (TDH RMV) is divided into eight equal sub-regions by direction of motion vector. By analyzing the motion vectors in each sub-region, we can measure the complexity of motion vectors’ direction.
(1) Complexity of Sort (CoS) CoS investigates the difference of motion vector type and proportion. The key point of CoS is when the region contains more types of motion vectors, and the motion vector distribution is more balanced, the complexity of region motion will be greater. This is similar as information entropy, and we borrow the definition of entropy to calculate CoS as Eq. (3), where hxy min is pillar height, and the pillars with 0 height are not counted. CoS = −
hxy × log2 hxy
hxy > 0
(2) Complexity of direction (CoD) CoD investigates the distribution of motion vectors direction. The Key point of CoD is when the motion vectors have more different kinds of directions, the complexity of region motion will be greater. To measure this indicator, we ﬁrstly divide the TDH RMV into eight sub-regions by direction (remove center 0 vector area), as shown in Fig. 5. Then we use the sub-region as a unit to calculate CoD as Eq. (4), in which the hi is the sum of pillar height in sub-region, and the sub-regions are not counted if the sum is 0. CoD = −
(hi × log2 (hi )) ,
hi > 0
(3) Complexity of amplitude (CoA) CoA investigates the distribution of motion vectors amplitude. The key point of CoA is when the motion vectors have more different kinds of amplitude, the complexity of region motion will be greater. To measure this indicator, we ﬁrstly divide the TDH RMV into seven sub-regions by amplitude (remove center 0 vector area), as shown in Fig. 6. Then, we use
Fig. 6. The Three Dimensional Histogram of Regional Motion Vector (TDH RMV) is divided into seven sub-regions by amplitude of motion vector. By analyzing the motion vectors in each sub-region, we can measure the complexity of motion vectors’ amplitude.
the sub-region as a unit to calculate CoA as Eq. (5), in which the hj is the sum of pillar height in each sub-region, and the sub-regions are not counted if the sum is 0. CoA = −
hj × log2 hj
hj > 0
(4) Intensity of motion (IoM) IoM investigates the intensity of motion vectors amplitude. The key point of IoM is when a violent activity happens, some parts of violent activity participants (e.g., human hands and legs) move fast while other parts of violent activity participants (e.g., human body) move slowly, and the amplitude of motion vectors of these parts will be remarkably different. To measure this indicator, we ﬁrstly divide the TDH RMV into seven subregions by amplitude (remove center 0 vector area), as shown in Fig. 6. Then, we use Eq. (6) to calculate IoM, in which the jmax
Fig. 7. Procedure of violent activity recognition based on motion vectors. (1) The VVAR10 dataset is divided into two parts, and each part includes both positive samples and negative samples. One part is used to train the SVM, and the other part is used to test the SVM. (2) The THDs are used to describe the feature of MV distribution. The RMV (Regional Motion Vector) expresses the features in frames and the RMVD (Regional Motion Vector Difference) expresses the features between two adjacent frames. (3) RMV includes 8 variables to describe the motion features of the region. The 4 values in the ﬁrst line are extracted from RMV, and the 4 values in the second line are extracted from RMVD. (4) We train a non-linear SVM to classify the RMV.
J. Xie et al. / Optik 127 (2016) 795–801
is the largest number of nonzero sub-region, and the hjmax is the sum of pillar height of the sub-region.
j max = 1
j max ×hj max ,
1 < j max ≤ 7
function to calculate the inner-product of the x and the xi , which is also called as the kernel function. Before using SVM to classify the input data, two things must be done. First, a property kernel function should be selected. Second, the SVM must be trained by prepared samples to get the required support vectors and corresponding parameters.
4.3. Motion vector features between frames A violent activity is usually continuous in a certain period, and the motion regions and their motion vectors are constantly changing between adjacent frames, which reﬂect the disorder of the violent activity. We calculate the absolute value of the difference of TDH RMVs between two adjacent frames, to get the Three Dimensional Histogram of Region Motion Vector Difference (TDH RMVD). In TDH RMVD, each pillar height indicates the difference of motion vectors between two adjacent frames. In order to investigate the motion features of the area expressed by TDH RMVD, we deﬁne the following variables: (1) Intensity of Difference (IoD) IoD investigates the absolute difference of motion vectors between two adjacent frames. The key point of IoD is when the difference of motion vectors between two adjacent frames is larger, the complexity of region motion will be larger. We use Eq. (7) to calculate IoD, in which the hxy is pillar height. IoD =
(2) Complexity of Difference Sort (CoDS), Complexity of Difference Direction (CoDD), Complexity of Difference Amplitude (CoDA). CoDS, CoDD and CoDA have the similar meanings as CoS, CoD and CoA deﬁned in Section 4.2, and can be calculated in the same step. However, it must be pointed out that because the pillar height in TDH RMVD is not equal to the probability of the pillar, the pillar height in TDH RMVD must be normalized ﬁrstly by the sum of all pillars heights. 4.4. Region motion vectors descriptor As aforementioned, for each frame of the input video sequence, we can get four in-frame features and four inter-frame features. Set the time length of the detection window as N. In this time window, we calculate the mean of each feature. These 8 means and the detection window length N form the Region Motion Vectors descriptor (RMV), as Eq. (8) shows, in which the “— ” represents the mean operator. RMV represents the space–time complexity of the motion in the videos, and can be used to measure the possibility of violent activity.
URVD CoS, CoD, CoA, IoM, IoD, CoDS, CoDD, CoDA, N
4.5. Violent activity judgment The violent activity has strong randomness, and can hardly be classiﬁed by methods based on the minimum distance or template matching. In this paper, we select the SVM method to recognize the violent activity. SVM is a learning method developed on the basis of the statistical learning theory, which can effectively resolve the problems of small sample, model selection and nonlinearity, and has a good generalization performance. The decision function of SVM is shown in Eq. (9). The x is the vector of an unclassiﬁed sample. The n is the sum of the support vectors in the trained SVM. The xi is the ith support vector selected from training samples. The yi is the type of the xi , where the “+1” represents the positive sample and the “−1” represents the negative sample. The a∗i is the corresponding parameter of the xi . The b* is the global parameter. The K(xi · x) is the
f (x) = sgn
yi a∗i K
(xi · x) + b
The RMV is a vector with eight dimensions (the parameter N of RMV is not used for SVM), which can hardly be classiﬁed by a linear classiﬁer. So we select the Radial Basis Function (RBF), which is the most popular kernel function and very suitable for the non-linear classiﬁcation application problem, as the kernel function of SVM. Eq. (10) is the expression of RBF, where the is the width parameter and often set by the distance of training samples. K (xi · x) = exp
xi − x2 −
Training an SVM means to solve the equation to get the optimal parameters a∗i and b* of the SVM, as Eq. (11) shows.
W T Xa∗ + b∗ arg max min , W = ai ∗ , X = xi ∗ a W W,b∗
In the training stage, we select lots of positive and negative samples from VVAR10, and calculate corresponding RMVs. The RMV and the type of the sample are used to form the vector xs for SVM training, as Eq. (12) shows, where the s represents the serial number of the training sample.
xs URVDs , ys , ys ∈
In the recognition stage, we use the other videos in VVAR10 as the testing samples and calculate corresponding RMV, and then use the trained SVM to determine whether a violent activity exists in the video. 5. Experiments In this section, extensive experiments on a large collected video dataset are conducted to evaluate the effectiveness of the proposed violent activity recognition method. 5.1. Detection results of our algorithm Using the algorithm we propose in Section 4 and setting the time length of the detection window as 50, we calculate the RMV of all videos in VVAR10. Then we select 50 ﬁghting videos, 50 pursing videos, 50 exchanging blows videos, 50 dragging videos and 277 negative samples from VVAR10 to form training samples, and use their RMV to train the SVM. Then we classify the RMV of other videos in VVAR10 by the trained SVM. Table 1 shows the results, in which the Accuracy represents the rate of recognizing the samples as their actual type, the MAR (Miss Alarm Rate) represents the rate of recognizing the violent videos as the non-violent videos, and the FAR (False Alarm Rate) represents the rate of recognizing the non-violent videos as the violent videos. Table 1 Experiment Result on VVAR10. Method
J. Xie et al. / Optik 127 (2016) 795–801
Table 2 Detection results of different violent activities. Violent activity
Table 5 The performance on the streamlined dataset (“*” means that the dataset used here is streamlined from the dataset in Table 4). Method Ours
Table 3 Detection results of different scenes. Scene feature
Violent actisvities Nonviolent activities Average
UCF sports* (%)
UCF 50* (%)
HMDB 51* (%)
Use tools Non tools
Table 4 The performance comparison. Method
UCF sports (%)
UCF 50 (%)
HMDB 51 (%)
Yao et al.  Wang et al.  Sadanand et al. 
86.6 88.2 95.0
– – 76.4
– – 26.9
5.1.1. Performance for different violent activities By analyzing the experiment results, we get the accuracy rates of different kinds of violent activities, which are shown in Table 2. We ﬁnd that the accuracy rate of ﬁghting detection is the highest, and Pursuing is the lowest. By analyzing these values and taking the situation into account, we ﬁnd that a long time and obvious activity detection is much easier than a short time activity. 5.1.2. Performance for different scenes For a further study of our algorithm, we statistically analyze the experiments in different scenes, respectively. The scenes include taking tools or not, in day or night and single player or multiplayers, and the recognition results are given in Table 3. By analyzing the table, we ﬁnd that the detection accuracy rates of violent activities using tools and without tools are similar, and for the other situations, the differences are obvious. This is mainly because the motion vectors are obtained by blocks at least 4 × 4, and they cannot display the motion of tiny objects clearly and are greatly inﬂuenced by noises. 5.2. Performance comparison In order to measure the performance of our method in-depth, we test our method on the UCF sports, the UCF50 and the HMDB51. Because these datasets are not be designed for violent activity recognition, we divide each dataset into two parts, the violent activities part and the non-violent activities part. Table 4 shows the results, in which the performance of our method is obviously different from the results in Table 1. By analyzing the results, we ﬁnd the reason is that lots of videos in these datasets contain global motion caused by camera motion, and in these videos, the motion vectors that belong to background areas have non-zero values. So we remove the videos which contain obvious global motion from the test datasets, and run the test again. Table 5 shows the results, which means the performance of our method is better, when deal with the videos which do not contain global motion. It is partly because the work to divide dataset into two parts is easier than the work to divide dataset into multiple parts.
5.3. Computational cost Most of the calculated amount of our method is spent on the motion vector normalization and extracting motion features from motion vectors. In the ﬁrst step, the sum of motion vectors of a video frame with N × N pixels is no more than N × N/16, and no more than N × N/16 times addition and division are used. In the second step, the calculated amount of getting RMV has nothing to do with the video frame size because the TDH RMV and the TDH RMVD are normalized by the region area, and no more than 2655 times addition, 868 times multiplication and 540 times logarithm are used. The other steps of our method such as getting motion vectors form video streams and using the trained SVM to classify RMV need little calculation. 6. Conclusions and future work In this paper, we have proposed a violent activity recognition method for surveillance videos based on motion vectors. The method ﬁrstly extracts motion vectors from compressed video data, and then analyzes the space–time distribution feature of the motion vectors’ amplitude and directions to get the RMV, and ﬁnally uses the SVM with the radial basis as the kernel function, to classify the RMV and determine the violent activity in surveillance videos. The proposed method is very suitable for front video encoder based on embedded DSP platform, because it saves the steps of motion target detection and tracking which are usually used in traditional activity analysis methods. It can improve the performance of video surveillance systems on real-time violent activity recognition, and improve the performance of mass video retrieval systems on violent activity location in historical videos. Using the motion vectors to describe the motion feature of the target takes full advantage of existing information in compressed video data, and has great signiﬁcance for improving the performance of smart video surveillance systems. In future, we will focus on extending the RMV to classify the input video more accurately, and reducing the adverse effects, which are caused by the camera motion, on the recognition performance. Acknowledgments The research described in this paper has been supported by National Natural Science Foundation of China (Grant no. 61303188) and National Standards Project of China (Grant no. TC100-SC2GA201410). References  J.K. Aggarwal, M.S. Ryoo, Human activity analysis: a review, in: ACM, 2011.  Ross Messing, Chris Pal, Henry Kautz, Activity recognition using the velocity histories of tracked keypoints, in: ICCV, 2009.  Christian Schuldt, Ivan Laptev, Barbara Caputo, Recognizing human actions: a local SVM approach, in: ICPR, 2004.
J. Xie et al. / Optik 127 (2016) 795–801  Nishkam Ravi, Nikhil Dandekar, Preetham Mysore, Michael L. Littman, Activity recognition from accelerometer data, in: AAAI, 2005.  Heng Wang, Alexander Klaser, Cordelia Schmid, Cheng-Lin Liu, Action recognition by dense trajectories, in: CVPR, 2011.  Zhenhua Wang, Qinfeng Shi, Chunhua Shen, Anton van den Hengel, Bilinear programming for human activity recognition with unknown MRF graphs, in: CVPR, 2013.  Sreemanananth Sadanand, Jason J. Corso, Action bank: a high-level representation of activity in video, in: CVPR, 2012.  M.S. Ryoo, J.K. Aggarwal, Semantic representation and recognition of continued and recursive human activities, in: IJCV, 2009.  Ekaterina H. Spriggs, Fernando De La Torre, Martial Hebert, Temporal segmentation and activity classiﬁcation from ﬁrst-person sensing, in: CVPR, 2009.  Angela Yao, Juergen Gall, Luc Van Gool, A Hough transform-based voting framework for action recognition, in: CVPR, 2010.  M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri, Actions as space–time shapes, in: ICCV, 2005.  Wei Niu, Jiao Long, Dan Han, Yuan-Fang Wang, Human activity detection and recognition for video surveillance, in: ICME, 2004.  I. Laptev, M. Marszałek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: CVPR, 2008.  J. Liu, J. Luo, M. Shah, Recognizing realistic actions from videos ‘in the wild’, in: CVPR, 2009.  Emmanuel Munguia Tapia, Stephen S. Intille, Kent Larson, Activity recognition in the home using simple and ubiquitous sensors, in: PC, 2004.  J.K. Aggarwal, Q. Cai, Human motion analysis: a review, Comput. Vision Image Understanding 73 (3) (1999) 428–440.
 Takuya Maekawa, Yutaka Yanagisawa, Yasue Kishino, Katsuhiko Ishiguro, Koji Kamei, Yasushi Sakurai, Takeshi Okadome, Object-based activity recognition with heterogeneous sensors on wrist, in: ICPC, 2010.  Du Tran, Alexander Sorokin, Human activity recognition with metric learning, in: ECCV, 2008.  Heng Wang, Muhammad Muneeb Ullah, Alexander Kläser, Ivan Laptev, Cordelia Schmid, Evaluation of local spatio-temporal features for action recognition, in: BMVC, 2009.  Boˇstjan Kaluˇza, Gal A. Kaminka, Milind Tambe, Detection of suspicious behavior from a sparse set of multiagent interactions, in: AAMAS, 2012.  Douglas L. Vail, Manuela M. Veloso, John D. Lafferty, Conditional random ﬁelds for activity recognition, in: AAMAS, 2007.  Lin Liao, Dieter Fox, Henry Kautz, Location-based activity recognition using relational Markov networks, in: IJCAI, 2005.  YouTube. http://www.youtube.com.  Mohamed R. Amer, Dan Xie, Mingtian Zhao, Cost-sensitive top-down/bottomup inference for multiscale activity recognition, in: ECCV, 2012, pp. 187–200.  Benjamin Yao, Bruce Nie, Zicheng Liu, Animated pose templates for modelling and detecting human actions, IEEE Trans. PAMI (2013).  Jiang Wang, Zicheng Liu, Ying Wu, Junsong Yuan, Mining actionlet ensemble for action recognition with depth cameras, in: CVPR, 2012.  Chen Sun, Ram Nevatia, Large-scale web video event classiﬁcation by use of Fisher vectors, in: WACV, 2013.  Jianbin Xie, Tong Liu, Wei Yan, Peiqin Li, Zhaowen Zhuang, A fast and robust algorithm for ﬁghting behavior detection based on motion vectors, TIIS 5 (11) (2011) 2191–2203.