- Email: [email protected]

PII: DOI: Reference:

S0925-2312(16)30594-X http://dx.doi.org/10.1016/j.neucom.2015.11.126 NEUCOM17206

To appear in: Neurocomputing Received date: 15 April 2015 Revised date: 17 October 2015 Accepted date: 2 November 2015 Cite this article as: Yen-Lun Chen, Xinyu Wu, Teng Li, Jun Cheng, Yongsheng Ou and Mingliang Xu, Dimensionality Reduction of Data Sequences for Human Activity Recognition, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.11.126 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Dimensionality Reduction of Data Sequences for Human Activity Recognition$ Yen-Lun Chena , Xinyu Wua,∗, Teng Lib , Jun Chenga , Yongsheng Oua , Mingliang Xuc a Guangdong

Provincial Key Laboratory of Robotics and Intelligent System, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences b College of Electrical Engineering and Automation, Anhui University c School of Information Engineering, Zhengzhou University

Abstract Although current human activity recognition can achieve high accuracy rates, data sequences with high-dimensionality are required for a reliable decision to recognize the entire activity. Traditional dimensionality reduction methods do not exploit the local geometry of classiﬁcation information. In this paper, we introduce the framework of manifold elastic net that encodes the local geometry to ﬁnd an aligned coordinate system for data representation. The introduced method is efﬁcient because classiﬁcation error minimization criterion is utilized to directly link the classiﬁcation error with the selected subspace. In the experimental section, a dataset on human activity recognition is studied from wearable, object, and ambient sensors. Keywords: Dimensionality reduction, human locomotion, gesture detection. 1. INTRODUCTION Human action understanding [1] is one of the promising research areas of behavior analysis in social networks [2]. However, the modeling of human actions from learning [3] is a challenging research topic due to diversity, ambiguity, $ The

work described in this paper is partially supported by National Natural Science Foundation of China (61403364, 61473277, 61572029, 61472370), Shenzhen Fundamental Research Program (JCYJ20140901003939022), SIAT Innovation Program for Excellent Young Researchers (201315), Guangdong Public Welfare Research and Capacity Building Project (2014A010103020), and Guangdong Innovative Research Team Program (201001D0104648280). ∗ Corresponding author: [email protected] (Xinyu Wu) Preprint submitted to Neurocomputing

June 9, 2016

Figure 1: Different locomotion (standing, walking, sitting, and lying) of four subjects in the human activity dataset [14] under varying illumination and clutter backgrounds. The depth images were captured by a Kinect sensor, and the skeleton plots were obtained by the visualization tool.

and uncertainty. For example in Fig. 1, the detection of various forms of human locomotion, including standing, walking, sitting, and lying, may not be easy for machines under different illumination and clutter backgrounds. An important area of social network is exploring new methods for behavior analysis [4][5][6]. However, explosion in popularity of social network services leads to the problem of “information overload” [7]. To let users more efﬁciently connect and communicate, it is necessary to introduce effective information extracting mechanism [8][9][10][11][12] to identify information most interesting to users from every network [13]. Multi-mode intelligent research has developed quickly and has been widely applied in education, entertainment, art, sports, and other areas, with ambient intelligence applied in ubiquitous systems of highly miniaturized wireless sensor nodes that assist us in all aspects of human life [15]. A multi-mode intelligent system fully utilizes image processing [16][17], motion tracking, action recognition, automatic speech recognition, virtual reality, video annotation [18][19], 2

scene analysis [20], and other techniques [21], where computers truly imitate human perception to create more harmonious human and machine interactions. Several universities and companies conducted this research by establishing speciﬁc research centers. A number of institutions have also actively committed to these aspects of research. Intelligent systems have begun to take shape, which combine smart homes, residential community information, medical monitoring, and security management based on social networks. Intelligent systems have a broad space for development prospects. The ubiquitous and portable features of intelligent wireless devices have become a new trend and direction for developing innovative services and applications. Unlike traditional systems, the design of these sensors and applications focus on the user’s experience. Based on the ubiquitous and portable characteristics of wireless devices in combination with the technology of behavioral and physiological information analysis, the portable management and service platform could provide independent management of status in cooperation with external quality services full of interactive and unique experiences. In response to this trend, automatic daily-life wellness management services have been developed based on service engineering, which integrates the technology of behavioral and physiological information analysis. Moreover, links to social networks facilitate services with more varied and diversiﬁed interactions [22]. In the ﬁeld of sensing research, mutual collaboration based on the fusion technology of multi-sensor information is the key to resolving pressing issues. An intelligent system in the normal state not only monitors its own status of position, posture, and speed but also perceives the work environment, thus allowing the system to operate in the order of the tasks and to smoothly adapt to changes in the environment. Internal and external perceptive sensors are integrated into the system, constituting the perception of multisensor information fusion. Mutual synergy based on the multi-sensor information fusion technique requires the full use of information resources for multiple sensors over time and space using chronological multi-sensor observation information within certain guidelines for automatic analysis, synthesis, and disposal to gain consistency in the interpretation and description of the measured object. At present, the intelligent decision-making ability of machines in the cognitive process is still relatively low. Cognitive systems can process the perceived information, can reﬁne and make the right decision, and can take the appropriate actions in a timely manner. The reliability and validity of the decision should be enhanced; speciﬁcally, systems are capable of effectively extracting information and the surrounding access to critical information and of making inferences in 3

search of certainty and predictability. Intelligent systems should explore the law of obtained information and should continuously improve and attain a good learning ability. Highly intelligent systems should also have good human-machine communication skills to enhance coherence and human interaction. The main objective of this paper is to achieve ﬂexible and efﬁcient feature extraction for dynamic action recognition and its application in social networks. The system can be more clever to “understand” and anticipate the different behaviors of people. The successful implementation of this technology will have positive impacts on activity detection, and systems will become increasingly effective and applied to various situations, including game control, sensor-enabled robot control, virtual reality, and smart home systems. The remainder of this paper is organized as follows. Section 2 contains an overview of relevant research in the ﬁeld of previous work. Section 3 details the dimensionality reduction approaches of the system. In Section 4, simulation study is illustrated based on a dataset on human activity recognition from wearable, object, and ambient sensors. Finally, the results are summarized in Section 5. 2. PREVIOUS WORK Many researchers have investigated in this area, and a considerable amount of previous work could be found in the literature. In [23], the authors introduce a novel video presentation term spatial-temporal pyramid sparse coding which characterizes both the spatial and temporal aspects of the video. In [24], advantages of both dense and sparse sampling are combined, and descriptors are extracted on a dense grid pruned either randomly or based on a sparse saliency mask of the underlying video. In [25], the authors compare two different representation schemes, raw multivariate time-series data and the covariance descriptors of the trajectories, and apply sparse representation techniques for classifying the various actions. The features are sparse coded using the orthogonal matching pursuit algorithm, and the gestures and actions are classiﬁed based on the reconstruction residuals. In [26], a video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. In [27], the potential of recent machine learning methods for discovering universal features is investigated for context-aware applications of activity recognition. In [28], the authors utilize action primitives that can be extracted from data collected by sensors worn on 4

human body and embedded in different objects and in the environment to identify how various types of action primitives inﬂuence the performance of high level activity recognition systems. In [29], a technique is presented for using on-body accelerometers to assist in automated classiﬁcation of problem behavior during such direct observation. In [30], the authors introduce a novel method for activity recognition which leverages the predictability of human behavior to conserve energy by dynamically selecting sensors. 3. DIMENSIONALITY REDUCTION APPROACHES In this section, we discuss two approaches of data representation. First, we discuss the dimensionality reduction algorithm of principle component analysis (PCA). Second, we explain the local geometry of a set of samples and an aligned coordinate system in the optimal solution of a manifold elastic net (MEN) [31] with a framework overview illustrated in Fig. 2. 3.1. Principle Component Analysis The central idea of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible the variation contained in the data set [32]. Suppose that {x1 , x2 , . . . , xn } is a set of p-dimensional vectors, where xi ∈ R p . Then the covariance matrix S ∈ R p×p can be described as 1 n (1) S = ∑ (xi − m)(xi − m)T , n i=1 where m = 1n ∑ni=1 xi is the mean of all data points. Since S is real and symmetric, there exists a real and unitary matrix U such that U T SU = Λ, where Λ is a real diagonal matrix. The diagonal elements of Λ are the eigenvalues of S, and the columns of U are the corresponding eigenvectors of S. Let q1 , q2 , . . . , qd be the eigenvalues, sorted from large to small, and u1 , u2 , . . . , ud be the corresponding eigenvectors. A data point xi is projected on eigenvector u j by zi j = uTj (xi − m). PCA is used in many disciplines for structure detection and dimensionality reduction. It ﬁnds the best angle to observe the most variation of the data points by detecting the structure in the relationships between variables and then reducing the number of variables. Since u1 is the eigenvector with the largest eigenvalue, it is called the ﬁrst principle component. If only one variable (dimension) can be retained, u1 provides the best direction to project the data points on, and to see their variations. Similarly, if d variables can be retained, then u1 , . . . , ud are used for projection and the dimensionality is reduced from p to d. 5

Data Source

Manifold Elastic Net (MEN) Weighted covariance matrix

Dimensionality Reduction

S

Normalization

Training set X

1 k ¦ n j mTj m j n j 1

Construct the indicator matrix Y by the eigenvectors with the d-largest eigenvalues of S.

Class label C

O Projection matrix W

Construct X⃰ and Y⃰ by equation (2).

O1

1 O2

Lasso penalized least square problem 2

argmin Y * X *W * O W *

Feature vector Z

2

1

Obtain W⃰ by the least angle regression (LARS) algorithm.

Test data input

Extracted feature vector

Classifier

Detection output

Figure 2: Framework overview of the dimensionality reduction approach.

3.2. Manifold elastic net (MEN) To encode the local geometry of a set of samples and ﬁnd an aligned coordinate system for data representation under the patch alignment framework, MEN utilizes the classiﬁcation error minimization criterion to directly link the classiﬁcation error with the selected subspace. In addition, the MEN incorporates the elastic net regularization to sparsify the projection matrix [31]. Let X = [x1 , x2 , . . . , xn ]T ∈ Rn×p be a given training set in a high-dimensional space Rn×p with column-wise zero empirical mean, where the sample mean of each column has been shifted to zero. Let C = [c1 , c2 , . . . , cn ]T ∈ Rn be the corresponding class label vector, where ci ∈ {1, 2, . . . , k} and k being the number of classes. The linear approximation of mapping X → Z from the high-dimensional space to a low-dimensional subspace is performed by minimizing ||Z − XW ||22 , and the objective is to ﬁnd a projection matrix W = [w1 , w2 , . . . , wd ] ∈ R p×d that 6

projects samples xT ∈ R p to zT ∈ Rd . An explicit way to achieve classiﬁcation error minimization in the objective function is to minimize ||Y − XW ||22 to further improve the performance of the classiﬁcation problems, where Y represents the indicator matrix with ﬂexible construction. The construction of Y could adopt a weighted PCA of class centers, nj where the class centers m j = 1/n j ∑i=1 xi , for j = 1, . . . , k. The weighted covariance matrix is deﬁned as 1 k (2) Sw = ∑ n j mTj m j . n j=1 The projected center mˆ is obtained by mˆ = UdT m j , where Ud = [u1 , u2 , . . . , ud ] being the eigenvectors with the d-largest eigenvalues of Sw . To obtain a sparse projection matrix with the grouping effect, the manifold elastic net directly imposes the combination of the lasso penalty and the l2 -norm penalty as the criterion of a discriminative manifold learning dimensionality reduction algorithm to the objective function arg min{||Y − XW ||22 + αtr(Z T LZ) + β ||Z − XW ||22 + λ1 ||W ||1 + λ2 ||W ||22 }, Z,W

(3)

where the entire alignment matrix L is given after summing over all of the partial optimizations to maximize the distances between a given sample and the group of related samples from different classes while minimizing the distances between the sample and the group of related samples in the same class. Using a series of linear algebra equivalent transformations as described in [31], the objective function of the MEN can be rewritten as a lasso penalized least square problem arg min ||Y ∗ − X ∗W ∗ ||22 + λ ||W ∗ ||1 , (4) ∗ W

where

X ∗ = (1 + λ2 )−1/2 [(D1/2V T )X,

λ2 I p×p ]T ∈ R(n+p)×p ,

Y ∗ = [((D1/2V T )T )−1Y, 0 p×1 ]T ∈ R(n+p)×1 ,

(5) W ∗ = (1 + λ2 )W , and λ = λ1 /(1 + λ2 ). In Eq. (4), D and V are the eigenvalue decomposition of (A + AT )/2, where A is an asymmetric matrix computed from L as A = α(β (αL + β I)−1 )T L(β (αL + I)) + β (β (αL + β I) − I)T (β (αL + β I) − I) + I. (6) 7

The optimal solution of Eq. (4) can be obtained by the least angle regression (LARS) algorithm [33], which continuously shrinks the entries of the projection matrix W ∗ towards zeros while simultaneously preserving a high prediction accuracy. As suggested in [31], α and β are always assigned as the same value. λ1 and λ2 are the weights on the l1 -norm and l2 -norm of the projection matrix W as the grouping-effect penalties to the objective function. The values of λ1 and λ2 are determined according to the given data, where the values should be large when the features are strongly correlated and vice versa. In practical algorithms, the weight of grouping effect λ2 /λ1 is used to substitute the effects of λ1 and λ2 . 4. EXPERIMENTAL RESULTS The OPPORTUNITY Activity Recognition Dataset [34] is used to evaluate human activity recognition from wearable, object, and ambient sensors. This dataset, which comprises two sets of readings from motion sensors that are recorded under different conditions while users executed typical daily activities in a simulating studio room, is available on the website of the UCI machine learning repository [35]. The ﬁrst set of data contains different modes of locomotion (e.g., sitting, standing, walking), and the numbers of instances are shown in Table 1. Samples from four subjects are collected in the dataset. There are signiﬁcantly more samples of standing and walking than of sitting and lying. Six different runs were recorded for each sample. Five of the runs were termed activity of daily living (ADL) and followed a given scenario, which consisted of temporally unfolding situations. The remaining run was called a drill run and was designed to generate a large number of activity instances. The number of instances for the 17 meaningful actions in the second set of Mid-Level gestures is shown in Fig. 3. The drill run consists of 20 repetitions of the following sequence of activities: open then close the fridge, open then close the dishwasher, open and then close three drawers (at different heights), open and then close door I, open and then close door II, toggle the lights on and then off, clean the table, drink while standing, and drink while seated. There were more samples of open/close the fridge, drink from cup, and toggle switch than of the other samples for all subjects.

8

ADLs

S1

Drills

100 50 0

0

2

4

6

8

10

12

14

16

18

10

12

14

16

18

10

12

14

16

18

8 10 Actions

12

14

16

18

S2 100 50 0

0

2

4

6

8 S3

100 50 0

0

2

4

6

8 S4

100 50 0

0

2

4

6

Figure 3: Number of instances in the Mid-Level gesture set which contains the following 17 meaningful actions: (1) open door I; (2) open door II; (3) close door I; (4) close door II; (5) open fridge; (6) close fridge; (7) open dishwasher; (8) close dishwasher; (9) open drawer I; (10) close drawer I; (11) open drawer II; (12) close drawer II; (13) open drawer III; (14) close drawer III; (15) clean table; (16) drink from cup; (17) toggle switch.

9

Table 1: Number of instances in the Locomotion set.

ADLs ADLs + Drills S1 ADLs S1 ADLs + Drill S2 ADLs S2 ADLs + Drill S3 ADLs S3 ADLs + Drill S4 ADLs S4 ADLs + Drill

Stand 1094 1711 252 424 286 437 259 406 297 444

Walk 1095 1733 271 463 285 434 245 394 294 442

Sit 90 169 11 31 23 43 32 50 24 45

Lie 40 40 10 10 10 10 10 10 10 10

4.1. Data Dimensionality Reduction There are three types of sensor data: body-worn sensors, object sensors, and ambient sensors, during data acquisition with sensors shown in Fig. 4. In total, there are 243 attributes for each sampled data point. For body-worn sensors, the inertial measurement units had the highest data quality and nearly no data loss. Although the placement of jacket-integrated sensors is highly repeatable among subjects and recording runs, the tags that provide indoor 3D localization are extremely noisy because the 3D acceleration sensors suffer from wireless data losses. Similarly, the sensors in the objects suffer from wireless data loss. The objects placed in the dishwasher suffer from more occluded data losses when the dishwasher is closed. The ambient sensors suffer from little to no data loss because they were acquired by a wired system. Different algorithms follow an equivalent procedure for all experiments to eliminate the noisy attributes and reduce data dimensionality. First, the database is randomly divided into two separate sets: the training set and testing set. Then, the training set is used to learn the low-dimensional subspace and the corresponding projection matrix through a given algorithm. Then, samples in the testing set are projected to a low-dimensional subspace via the projection matrix. The nearest neighbor (NN) classiﬁer is used to recognize the testing samples in the subspace. The experimental results are shown in Figs. 5 and 6, where the linked lines across the dimension indicate the average accuracy over ten different runs. The accuracy measure is conventionally deﬁned as Accuracy = trace(M)/n, where M is the confusion matrix, and n is the total number of samples. Each column of M represents the instances in a predicted class, and each row represents the instances 10

Figure 4: Dataset acquisition sensors include: (a) the body-worn sensors; (b) object sensors; (c)(d) ambient sensors. (images courtesy of OPPORTUNITY activity recognition dataset, available in [35])

11

MEN PCA LDA

0.8

Accuracy

0.7

0.6

0.5

0.4

0.3 1

2

3

4

Dimension

Figure 5: Boxplot of data representation in the Locomotion set. The linked lines across dimension indicate the average accuracy over ten different runs.

in an actual class. The boxplots, which graphically depict groups of numerical data using their quartiles, allow one to visually estimate data statistics, such as the interquartile range, mid-hinge, range, mid-range, and tri-mean. On each box, the central mark is the median, and the edges of the box are the locations of 1/4 and 3/4. The spacings between the different parts of the box indicate the degree of spread and skewness in the data. By the default settings, k = n, κ = 3, α = 1, β = 1, λ1 = 1, and λ2 = 0.3. From the results, the MEN algorithm outperforms Linear Discriminant Analysis (LDA) [36] and PCA consistently for both the locomotion and mid-level gesture datasets. However, it could be observed that there are boxplot overlaps when the dimension equals three for PCA in the locomotion dataset, and when the dimension larger than 12 for LDA in the gesture dataset. MEN maintains higher accuracy rates when the dimension of the selected subspace is low. This result veriﬁes the robustness of the MEN algorithm in lowdimensionality scenarios.

12

0.85 0.8 0.75

Accuracy

0.7 0.65 0.6 0.55 0.5

MEN LDA PCA

0.45 0.4

1

2

3

4

5

6

7

8 9 10 11 12 13 14 15 16 17 Dimension

Figure 6: Boxplot of data representation in the Mid-level gesture set. The linked lines across dimensions indicate the average accuracy over ten different runs.

4.2. Event Performance The results are displayed quantitatively for different subjects in Fig. 7 using the event measures proposed by Ward [34][37], which provides an objective, non-ambiguous method of scoring event recognition, includes event merges and fragmentation in the error summary, and accounts for the timing error as a separate category. True positive (TP), true negative (TN), overﬁll/underﬁll (O/U), insertion (I), fragmentation/deletion (F/D), and substitution (S) are stacked in a column to illustrate the percentage of the various measures, and a serious error line indicates the separation of event errors from TP and TN. Three categories of segment scores for event scoring are utilized in the experiments. The ﬁrst category includes event errors such as insertion, deletion, merge, and fragmentation. The second category includes event correctness such as TP and TN. The third category includes event timing such as underﬁll (ex: delay, shortening), and overﬁll (ex: preemption, prolongation). The segment scores are further divided into two kinds of errors which are negative and positive errors. Negative errors include deletion, underﬁll, and 13

Locomotion 110 S

F/D

I

O/U

TN

TP

100

Percentage of various measures

90 80 70

63.5 64.3 65.6

58.7 58.6 61.7

64.0 64.5 67.8 72.9 73.6 73.5

60 50 40 30

21.3 21.5 20.7 21.2 20.7

20 10 0

23.4

21.5 21.8 22.7

1.9 3.5 3.8

1.7 3.2 4.0

6.6

5.6

1.4 3.3 3.0

2.2 3.5 0.9 7.2

Subject 1

6.9

1.4 2.9 3.0 6.4

1.0 1.6 0.8 6.0

10.8 10.2

6.6 1.5

0.5 4.4 1.7

0.4 5.0 1.8

7.5 1.5

9.8

9.0

9.3

Subject 2 Subject 3 Different Classifiers grouped by Subjects

1.6 4.5 1.9

2.0 4.5 2.7

1.2 2.4 1.6

11.9 10.7 9.7

Subject 4

Gestures 110 S

F/D

I

O/U

TN

TP

100 10.6 10.0 11.2

8.0

7.7

7.3

11.9 11.3 11.3

10.5 9.9

79.8 80.8 82.0

74.1 75.3

76.6 77.4

2.7 5.3 1.3 2.9

4.6

4.7

4.8 1.7 2.9

4.0 2.3 2.4

11.1

Percentage of various measures

90 80 70 60 50

74.5 75.4 78.1

77.5

78.2

40 30 20 10 0

4.7 4.3 3.5 2.4

4.8 3.5 4.4 1.8

4.2 1.0 4.2 1.4

Subject 1

3.5 4.5 1.0 2.5

3.0 3.0 3.1 1.6

5.4 2.1 2.1 1.7

Subject 2 Subject 3 Different Classifiers grouped by Subjects

3.5 4.1 2.1 3.2

3.9 3.5 2.8 2.5

3.5 2.4 2.7 2.2

Subject 4

Figure 7: Percentage of various measures on the Locomotion and the Mid-level gesture datasets. Each group of three columns denotes the accuracy of 1-NN, 3-NN, and SVM, respectively. TP: true positive, TN: true negative, O/U: overﬁll or underﬁll, I: insertion, F/D: fragmentation or deletion, S: substitution. 14

fragmentation; positive errors include insertion, overﬁll, and merge. The SVM classiﬁers achieve lower event errors than the 1-NN and 3-NN classiﬁers for both datasets except in subject 3 of the Locomotion dataset. As observed in Fig. 7, there is a large portion of TN in each column because the Null activity occupies nearly 4/5 of the samples in the mid-level gesture dataset. The positive predictive value (Precision), sensitivity (Recall), and F1 -scores are utilized to evaluate the system with several measures, including true positive (TP), false positive (FP), true negative (TN), and false negative (FN). Precision is the fraction of retrieved instances that are relevant and is deﬁned as Precision = T P/(T P + FP). Recall is the fraction of relevant instances that are retrieved and is deﬁned as Recall = T P/(T P + FN). There is typically a trade-off between precision and recall in a scenario that involves losing precision in return for gaining recall. The balanced F-score is utilized as the harmonic mean of precision and recall to combine the two measures. Furthermore, the F-score could be weighted according to their sample proportion to counter the class imbalance. The F1 -score is deﬁned as the weighted harmonic mean of the precision and recall values as 2 × Precisioni × Recalli F1 = ∑ wi , (Precisioni + Recalli ) i where i is the class index and wi is the sample proportion of class i deﬁned as wi = ni /n with ni being the number of samples of the ith class.

Gesture

Locomotion 92

90

90

89 88

86 84

1NN

82

3NN

80

SVM

Accuracy (%)

Accuracy (%)

88

87

1NN

86

3NN

85

78

84

76

83

SVM

82

74 S1

S2

S3

S1

S4

S2

S3

S4

Figure 8: Accuracy of the different classiﬁers grouped by subjects in the Locomotion and Mid-level gesture datasets.

In Fig. 8, the accuracy rates of the locomotion and the mid-level gesture dataset are illustrated for different classiﬁers and grouped by subjects. In Figs. 9 and 10, 15

Precision

(%)

(%)

Recall 92 90 88 86 84 82 80 78 76 74

1NN 3NN SVM

S1

S2

S3

92 90 88 86 84 82 80 78 76 74 72 70

S4

1NN 3NN SVM

S1

Locomotion

S2

1NN 3NN SVM

S2

S4

F_null

(%)

(%)

F_act 92 90 88 86 84 82 80 78 76 74 72 S1

S3

Locomotion

S3

S4

92 90 88 86 84 82 80 78 76 74 72

1NN 3NN SVM

S1

Locomotion

S2

S3

S4

Locomotion

Figure 9: Recall, Precision, and F1 -scores in the Locomotion dataset. Fact represents the weighted F1 score without the null activity, and Fnull represents the weighted F1 score with the null activity.

several measures such as recall, precision, Fact , and Fnull are illustrated, where Fact is the weighted F1 score of the confusion matrix without the Null class, and Fnull is the weighted F1 score of the confusion matrix with the Null class. As can be observed, SVM performs better by incorporating learning in classiﬁcation. The training set is divided into ﬁve subsets of equal size. One subset is tested using the classiﬁer that is trained on the remaining four subsets. Thus, each instance of the entire training set is predicted once. Therefore, the cross-validation accuracy is the percentage of data that are correctly classiﬁed. After training by cross-validation, the optimal parameter is then applied to the test samples to obtain the detection measures. 5. CONCLUSIONS In this paper, data representations of time sequences are investigated for human activity detection. Classiﬁcation information of local geometry is utilized 16

Recall

Precision

70 60

40

(%)

(%)

50 1NN

30

3NN

20

SVM

10 0 S1

S2

S3

90 80 70 60 50 40 30 20 10 0

S4

1NN 3NN SVM

S1

S2

Gesture

F_act

S4

F_null

70

90

60

89

50

88 87

40

(%)

(%)

S3 Gesture

1NN

30 20

86

1NN

3NN

85

3NN

SVM

84

SVM

10

83

0

82 S1

S2

S3

S4

S1

Gesture

S2

S3

S4

Gesture

Figure 10: Recall, Precision, and F1 -scores in the Mid-level gesture dataset. Fact represents the weighted F1 score without the null activity, and Fnull represents the weighted F1 score with the null activity.

to reduce data dimensionality. Experimental studies are given to validate the approach. In future work, optimal learning of sequence information in the detection process will be investigated. [1] Q. Zhou, G. Wang, K. Jia, Q. Zhao, Learning to share latent tasks for action recognition, in: IEEE International Conference on Computer Vision (ICCV), 2013. [2] S. Si, D. Tao, M. Wang, Social image annotation via cross-domain subspace learning, Multimedia Tools and Applications 50 (3). [3] M. Wang, X. Liu, X. Wu, Visual classiﬁcation by l1-hypergraph modeling, IEEE Trans. Knowl. Data Eng. 27 (9) (2015) 2564–2574.

17

[4] K. Wang, X. Wang, L. Lin, M. Wang, W. Zuo, 3D Human Activity Recognition with Reconﬁgurable Convolutional Neural Networks, in: ACM International Conference on Multimedia (ACM MM), 2014. [5] A.-A. Samadani, A. Ghodsi, D. Kulic, Discriminative functional analysis of human movements, Pattern Recognition Letters 34 (2013) 1829–1839. ´ [6] J. Valencia-Aguirre, A. M. Alvarez-Meza, G. Daza-Santacoloma, C. D. Acosta-Medina, G. Castellanos-Dom´ınguez, Human activity recognition by class label LLE, in: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, Vol. 7441 of Lecture Notes in Computer Science, Springer, 2012, pp. 260–267. [7] J. Bian, Y. Chang, Y. Fu, W. Chen, Learning to blend vitality rankings from heterogeneous social networks, Neurocomputing 97 (2012) 390–397. [8] M. Wang, H. Li, D. Tao, K. Lu, X. Wu, Multimodal graph-based reranking for web image search, IEEE Transactions on Image Processing 21 (11) (2012) 4649–4661. [9] T. Li, T. Mei, I. Kweon, X. Hua, Contextual bag-of-words for visual categorization, IEEE Trans. Circuits Syst. Video Techn. 21 (4) (2011) 381–392. [10] F. Zheng, Z. Song, L. Shao, R. Chung, K. Jia, X. Wu, A semi-supervised approach for dimensionality reduction with distributional similarity, Neurocomputing 103 (2013) 210–221. [11] M. Wang, B. Liu, J. Tang, X.-S. Hua, Metric learning with feature decomposition for image categorization, Neurocomputing 73 (10-12) (2010) 1562– 1569. [12] J. Wang, Z.-Q. Zhao, X. Hu, Y.-M. Cheung, M. Wang, X. Wu, Online group feature selection, in: International Joint Conference on Artiﬁcial Intelligence (IJCAI), 2013. [13] Z. Liu, W. K. Ng, E. Lim, F. Li, Towards building logical views of websites, Data Knowl. Eng. 49 (2) (2004) 197–222. [14] J. Sung, C. Ponce, B. Selman, A. Saxena, Human activity detection from RGBD images, in: Association for the Advancement of Artiﬁcial Intelligence (AAAI) workshop on Pattern, Activity and Intent Recognition, 2011. 18

[15] D. Roggen, G. Tr¨oster, P. Lukowicz, A. Ferscha, J. del R. Mill´an, R. Chavarriaga, Opportunistic human activity and context recognition, IEEE Computer 46 (2) (2013) 36–45. [16] T. Li, B. Ni, M. Xu, M. Wang, Q. Gao, S. Yan, Data-driven affective ﬁltering for images and videos, IEEE T. Cybernetics 45 (10) (2015) 2336–2349. [17] T. Li, S. Yan, T. Mei, X. Hua, I. Kweon, Image decomposition with multilabel context: Algorithms and applications, IEEE Transactions on Image Processing 20 (8) (2011) 2301–2314. [18] M. Wang, X.-S. Hua, J. Tang, R. Hong, Beyond distance measurement: Constructing neighborhood similarity for video annotation, IEEE Transactions on Multimedia 11 (3) (2009) 465–476. [19] M. Wang, X.-S. Hua, R. Hong, J. Tang, G.-J. Qi, Y. Song, Uniﬁed video annotation via multi-graph learning, IEEE Transactions on Circuits and Systems for Video Technology 19 (5) (2009) 733–746. [20] T. Li, H. Chang, M. Wang, B. Ni, R. Hong, S. Yan, Crowded scene analysis: A survey, IEEE Transactions on Circuits and Systems for Video Technology 25 (3) (2015) 367–386. [21] M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, T.-S. Chua, Event driven web video summarization by tag localization and key-shot identiﬁcation, IEEE Transactions on Multimedia 14 (4) (2012) 975–985. [22] L. Li, Y. Wang, E.-P. Lim, Trust-oriented composite service selection and discovery, in: L. Baresi, C.-H. Chi, J. Suzuki (Eds.), Service-Oriented Computing, Vol. 5900 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2009, pp. 50–67. [23] X. Zhang, H. Zhang, X. Cao, Action recognition based on spatial-temporal pyramid sparse coding, in: ICPR, IEEE, 2012, pp. 1455–1458. [24] E. Vig, M. Dorr, D. D. Cox, Saliency-based selection of sparse descriptors for action recognition, in: ICIP, IEEE, 2012, pp. 1405–1408. [25] R. Sivalingam, G. Somasundaram, V. Bhatawadekar, V. Morellas, N. Papanikolopoulos, Sparse representation of point trajectories for action classiﬁcation, in: ICRA, IEEE, 2012, pp. 3601–3606. 19

[26] J. C. Niebles, H. Wang, F.-F. Li, Unsupervised learning of human action categories using spatial-temporal words, International Journal of Computer Vision 79 (3) (2008) 299–318. [27] T. Pl¨otz, N. Y. Hammerla, P. Olivier, Feature learning for activity recognition in ubiquitous computing, in: Proceedings of the Twenty-Second International Joint Conference on Artiﬁcial Intelligence - Volume Volume Two, IJCAI’11, AAAI Press, 2011, pp. 1729–1734. [28] A. Manzoor, C. Villalonga, A. Calatroni, H.-L. Truong, D. Roggen, S. Dustdar, G. Trster, Identifying important action primitives for high level activity recognition, in: P. Lukowicz, K. Kunze, G. Kortuem (Eds.), Smart Sensing and Context, Vol. 6446 of Lecture Notes in Computer Science, Springer Berlin Heidelberg, 2010, pp. 149–162. [29] T. Pl¨otz, N. Y. Hammerla, A. Rozga, A. Reavis, N. Call, G. D. Abowd, Automatic assessment of problem behavior in individuals with developmental disabilities, in: Proceedings of the 2012 ACM Conference on Ubiquitous Computing, UbiComp ’12, ACM, New York, NY, USA, 2012, pp. 391–400. [30] D. Gordon, J. Czerny, M. Beigl, Activity recognition for creatures of habit, Personal and Ubiquitous Computing (2013) 1–17. [31] T. Zhou, D. Tao, X. Wu, Manifold elastic net: a uniﬁed framework for sparse dimension reduction, Data Min. Knowl. Discov. 22 (3) (2011) 340–371. [32] I. Jolliffe, Principal component analysis, second edition Edition, New York: Springer-Verlag, 2002. [33] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Annals of Statistics 32 (2004) 407–499. [34] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Tr¨oster, J. del R. Mill´an, D. Roggen, The opportunity challenge: A benchmark database for on-body sensor-based activity recognition, Pattern Recognition Letters. [35] A. Frank, A. Asuncion, UCI machine learning repository (2010). URL http://archive.ics.uci.edu/ml [36] L. Van der Maaten, E. Postma, H. van den Herik, Matlab toolbox for dimensionality reduction, MICC, Maastricht University. 20

[37] J. A. Ward, P. Lukowicz, H.-W. Gellersen, Performance metrics for activity recognition, ACM TIST 2 (1) (2011) 6.

21