Accepted Manuscript deepGesture: Deep Learning-based Gesture Recognition Scheme using Motion Sensors Ji-Hae Kim, Gwang-Soo Hong, Byung-Gyu Kim, Debi P. Dogra PII: DOI: Reference:
S0141-9382(17)30203-2 https://doi.org/10.1016/j.displa.2018.08.001 DISPLA 1881
To appear in:
Received Date: Revised Date: Accepted Date:
26 December 2017 24 February 2018 23 August 2018
Please cite this article as: J-H. Kim, G-S. Hong, B-G. Kim, D.P. Dogra, deepGesture: Deep Learning-based Gesture Recognition Scheme using Motion Sensors, Displays (2018), doi: https://doi.org/10.1016/j.displa.2018.08.001
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
deepGesture: Deep Learning-based Gesture Recognition Scheme using Motion Sensors Ji-Hae Kim1 , Gwang-Soo Hong2 , and Byung-Gyu Kim1∗ , Debi P. Dogra3 1 Department
of IT Engineering, Sookmyung Women’s University, Seoul, Rep. of Korea, of Computer Engineering, SunMoon University, A-san, Rep. of Korea, 3 School of Electrical Sciences, Indian Institute of Technology Bhubaneswar, India
Abstract Recent advancement in smart phones and sensor technology has promoted research in gesture recognition. This has made designing of eﬃcient gesture interface easy. However, human activity recognition (HAR) through gestures is not trivial since each person may pose the same gesture diﬀerently. In this paper, we propose deepGesture algorithm, a new arm gesture recognition method based on gyroscope and accelerometer sensors using deep convolution and recurrent neural networks. This method uses four deep convolution layers to automate feature learning in raw sensor data. The features of the convolution layers are used as input of the gated recurrent unit (GRU) which is based on the state-of-the-art recurrent neural network (RNN) structure to capture long-term dependency and model sequential data. The input data of the proposed algorithm is obtained through motion sequence data extracted using a wrist-type smart band device equipped with gyroscope and accelerometer sensors. The data is initially segmented in ﬁxed length segments. The segmented data is labeled and we construct the database. Then the labeled data is used in our learning algorithm. To verify the applicability of the algorithm, several experiments have been performed to measure the accuracy of gesture classiﬁcation. Compared to the human activity recognition method, our experimental results ∗ Corresponding
author Email address: 1 [email protected]
(Ji-Hae Kim1 , Gwang-Soo Hong2 , and Byung-Gyu Kim1∗ , Debi P. Dogra3 )
Preprint submitted to Journal of LATEX Templates
August 22, 2018
show that the proposed deepGesture algorithm can increase the average F1-score for recognition of nine deﬁned arm gestures by 6%. Keywords: Human Activity Recognition, Wearable Sensors, Deep Learning, GRU, Neural Network.
1. Introduction People communicate quickly and easily with each other through actions of body parts in our daily lives (e.g., simple hand gestures such as greetings, OK signs and complicated gestures such as cooking, etc). Communicating through 5
gestures is convenient because it is more expressive and intuitive than other interaction methods. Recently, with the advancement of new wearable devices equipped with sensors capable of detecting biometric signals, phase, and position such as smart bands, gesture-based recognition technology is becoming very popular , .
Gesture recognition can be classiﬁed into touch-based and touchless approaches . Touch-based gesture recognition uses the postures and motion information of the users obtained by attaching a sensor or device to a part of the user’s body. Touchless gesture recognition mainly acquires human motion information through visual analysis.
Wearable devices capable of touch-based gesture recognition using gyroscopes and accelerometers have been developed for detecting emergency situations, such as falling accidents of the elderly.
smart phones are used in games with motion sensors such as Nintendo Wii, etc. Touchless gesture recognition is used for motion control enabled by a visual 20
motion recognition sensor in games such as Xtion, smartphone-guided gesture recognition function based on left or right swiping, and smart TV interfaces . Although gesture recognition can be highly complex due to features, varying behavior patterns and diﬀerent individual users, good accuracy can still be achieved by application-speciﬁc designs.
With the availability of eﬃcient hardware and big data frameworks, it is
now possible to handle large quantities of data without much diﬃculty. Deep learning has the advantage of automatically acquiring good functions using general-purpose learning procedures . Therefore, the performances of many algorithms have been improved with the application of deep learning , . 30
Especially, it is widely used in speech recognition , sentence recognition , object recognition , and gesture/action recognition , . Deep learning can provide the following beneﬁts for gesture recognition: It makes handling a large amount of data easy, and it does not require expertise in functional design. In addition, when suﬃcient data and suitable algorithms
are applied, it is possible to achieve high performance by reducing the feature gaps between the individual components . As a representative example, the deep learning algorithm played a key role in the performance improvement of audio classiﬁacation  and sentence analysis . In this paper, we propose a gesture recognition method based on gyroscope
and accelerometer sensors using convolutional and gated recurrent unit (GRU) layers. The recurrent neural network (RNN) is a type of artiﬁcial neural network in which hidden nodes are connected by a directional edge to form a circulating structure . This model is eﬀective for gesture data with uninterrupted values received from gyroscope and accelerometer sensors, since gesture data are time
series events like voice or sequential data processing. The modiﬁed GRU module in RNN is similar to Long short-term memory (LSTM), but it has a simpler structure . In particular, it takes less time to learn due to fewer parameters than LSTM. Also, it is possible to learn with a small number of data. Based on the aforementioned advantages, the contributions of this paper can
be addressed as followings: • We propose the gesture recognition algorithm based on deep convolution and recurrent neural network: the deep learning framework consists of the convolutional and GRU layers. This automatically creates and learns the features to model the relationship between the patterns of gesture data
and the deﬁned gesture patterns.
• We demonstrate the proposed algorithm based on wearable sensor data to achieve a higher recognition rate for a wider range of gesture patterns and certain motions. The rest of this paper is organized as follows: In Section 2, we discuss vari60
ous existing algorithms of gesture recognition. Section 3 presents a new gesture recognition algorithm using deep learning based on data from gyroscope and accelerometer sensors. Experimental results are reported in Section 4. Concluding remarks are presented in Section 5.
2. Related works 65
Sensors for recognizing gestures have been reported in several studies. Liu et al. have proposed a Markov model framework for classiﬁcation and shown that recognition of hand gestures were achieved by fusion of inertia and depth sensor data from two diﬀerent modality sensors . With the fusion scheme of visual depth and inertial sensor data, robust recognition results for ﬁve kinds
of hand gestures were obtained. Since they made the fusion scheme of visual depth and inertial sensor data, it obtained robust recognition results for ﬁve kinds of hand gestures. Jing et al. have proposed a gesture-based remote control system to control TV, audio equipment, and other home appliances . This algorithm uses ﬁnger-ring gesture controller called Magic Ring (MR) based
on accelerometer sensors to remotely control an Electric Appliance Node (EANode) by recognizing postures such as ﬁnger up, ﬁnger down, and ﬁnger rotation. Kim et al. have developed a gesture recognition algorithm based on Fast Fourier Transform (FFT) using accelerometer and electromyography (EMG) sensors . This algorithm recognizes seven word-level symbol vocabularies
of German Sign Language (GSL). Based on acceleration signals and bio-signals (e.g. heart rate), Maguire et al. have proposed recognition of activities such as moving up and down through stairs, running, and brushing using k-Nearest Neighbor (k-NN) and j48 classiﬁers .
Recently, the gesture recognition system using deep learning has become 85
popular. In particular, as it is diﬃcult to obtain high success rates using existing systems for complex and complicated patterns due to limitations of using classiﬁers such as Support Vector Machine (SVM) or k-NN, the advantage of deep learning can be exploited to increase the recognition rates for continuous time series data of complicated operations.
Ordez et al. have used gesture recognition based on data received from a multimodal wearable device using RNN with LSTM module and Deep Convolution Network . Yang et al. have focused on the HAR problem and they have proposed a method to automate feature extraction . The proposed method builds a deep learning structure for Convolutional Neural Network (CNN) to in-
vestigate time series data with multiple channels. This largely automates feature learning using activity dataset that deﬁnes daily activities such as cooking and other commonly practiced hand gesture data using a deep convolution network . Particularly, hand gestures are classiﬁed into eight daily life movements and three tennis movement classes, and they are extracted by sensors with 3-
axis accelerometer and 2-axis gyroscope sensors. Zeng et al. have proposed to recognize three public datasets, namely Skoda (assembly line activities), Opportunity (activities in kitchen), and Actitracker (jogging, walking,etc.) using CNN based on 3-axis accelerometer sensor in smart phone . Ahsan et al. have introduced ANN to detect predeﬁned hand gestures (left, right, top, bot-
tom) using EMG signals which are biological signals . Since ANN is useful for complex pattern recognition and classiﬁcation tasks, ANN can be used to classify EMG signals. This can be used to design eﬃcient computer-interaction interfaces for the handicapped. Other studies on recognizing complex gestures and hand-writing also reported the use of 3-D sensors such as Leap motion and
Kinect , . Although many techniques for improving sensor-based gesture recognition systems have been developed, as well as up to about 88% recognition accuracy for complex gestures has also been reported, further improvement is still required . Therefore, we are motivated to propose and evaluate the gesture 5
recognition algorithm for complex patterns based on gyroscope and accelerometer data using the deep learning framework.
3. Proposed Algorithm We propose a new deep learning algorithm for recognizing gestures based on data of gyroscope and accelerometer sensors. The new model has a structure 120
in which four convolution and GRU layers are combined. The proposed deep learning structure makes the convolution ﬁlter robust and unsusceptible to preprocessing (which extracts a good feature map) and it has a signiﬁcant impact on performance. It also achieves high performance by using layers of GRU that has already shown good performance for continuous sequence sensor data ,
. For the input sensor data, four levels of convolution layers extract features from the sequences of a given gesture operation and generate a feature map. Next, four GRU layers to allow for learning a long sequence data by computing gradient component eﬃciently. This is explained in more detail in Section 3.1.
3.1. Deep Convolution and GRU Neural Network The proposed algorithm consists of four structures as shown in Fig. 1 i.e., an input layer that is composed of data of gyroscope and accelerometer sensors, convolution layers for extracting feature maps, GRU layers, and a fully connected layer.
First, the input node, capable of handling 288 samples, receives input signals (6, 256) consisting of a total of 6 axes (x, y, z from a 3D accelerometer and x, y, z from a 3D gyroscope) and 256 sequences. In Fig. 1, the red rectangle region in the sensor data section corresponds to the input layer. The second part is the convolution layers (2-5 layers). The CNN is a network
that extracts features from the original data using a convolution ﬁlter to form a feature map . A feature map is an array of units consisting of convolution results operationally obtained through featured kernels. In CNN, the kernels are
optimized as part of a supervised training process to maximize the activation level . A method for extracting a feature map using convolution operations 145
is expressed as given in (1): ( xl,d ij
= σ bij +
i −1 ∑ P∑
) p wijm xl+p,d (i−1)m
, ∀d = 1, ..., D,
where xl,d ij represents the feature map of the jth sample in layer l. σ is a nonlinear function and it is used in this paper as tanh (hyper tangent function). bij is the bias value for this feature map, and m is the index for the feature p map set of the (i − 1)th layer. wijm is the kernel value convolved through the 150
feature map to create feature map j at the next layer, and Pi is the length of the convolution kernel. The proposed algorithm has four convolution layers with two dimensions. Each layer performs a convolution operation on stride (1,2) through a kernel with a size of (1,3) as shown in Fig. 2. The result of convolution with stride Si−1 is described in (2), Ni = (Ni−1 − K) /Si−1 , Mi = Mi−1 ∗ Pi−1 ,
where K is the kernel size. Mi is the number of kernels after multiplying the previous layer by the number of kernels Pi−1 . After the operation, batch normalization is used to stabilize the training 155
process and accelerate learning speed so that gradient vanishing does not occur . In , they argue that the reason for this instability is due to the internal covariance shift. Internal covariance shift means that the distribution of input varies with each layer or activation. Batch normalization has been used to prevent this phenomenon. This algorithm is a method of learning by bringing
data in mini-batch unit. By using this, we can normalize the mean and standard deviation for each feature, and create a new value using the scale factor and shift factor . When applied to a network, the batch normalization layer is added before the hidden layer to modify the input. Then, the new value is inserted into the activation function. 7
The third part is the recurrent dense GRU layers (6-9 layers). These four layers consist of GRU which is a modiﬁed form of RNN. These layers contain more high-dimensional information with at least two recurrent dense layers according to the results presented in . RNN starts with the idea of processing sequential information unlike traditional neural networks, where all inputs and
outputs are assumed to be independent of each other . Therefore, the same task is applied to every element of one sequence, and the past data has a structure that can aﬀect the next result. Figure 3 shows the recurrent connection of the classic RNN. The recurrent connection of the RNN on the right side of the ﬁgure is the unfolded left struc-
ture. xt is the input value at time step t. ht is the hidden state at time step t and is calculated as ht = f (xt Ut + ht−1 Wt−1 ) by the hidden state value ht−1 and weight Wt−1 of the previous step (t − 1), and the input value xt and weight Ut of the current time step as the network memory part. The nonlinear function f is usually denoted by tanh and ReLU. In addition, each layer shares param-
eter values for all time steps. yt is the output value at time step t. Although the RNN is successfully applied in many natural language processing problems, learning with simple RNN results in a vanishing gradient problem in which the learning ability is signiﬁcantly degraded due to the reduction of the gradient in back propagation when dealing with long sequences.
To solve this problem, an extended RNN model, LSTM and GRU have been adopted in . In this paper, we solve the vanishing gradient problem by using state-of-the-art RNN model with GRU . The reasons for using this model, especially with respect to LSTM, are: • This model has a shorter learning time with fewer parameters (U and R).
• Despite the small amount of data, the results of the learning have good performance. The GRU is a model ﬁrst used in 2014 and its structure is similar to that of LSTM . The GRU has a simpler structure, which is achieved by calculating hidden states, and allows a long sequence to be learned well via a gating 8
mechanism. GRU’s formulas are given in (3), z = σ (xt U z + ht−1 W z ) , r = σ (xt U r + ht−1 W r ) , h˜t = tanh (Uz · xt + Ws · (ht−1 ⊙ r)) ,
ht = (1 − z) ⊙ ht−1 + z ⊙ h˜t , where z means an update gate and r means a reset gate. As shown in Fig. 4, unlike the basic RNN and GRU which have two gates, where the gate determines whether the sigmoid function limits the values of the vectors from 0 to 1 195
and passes them through the element-wise multiplication operation with other vectors. The reset gate determines how the new input is merged with the previous memory, and the update gate decides how much of the previous memory to remember. The basic RNN model has the form of a GRU model in which all the reset gates are set to 1 and the update gates are all set to zero. In this
study, we have four GRU layers for higher level information. Finally, the fully connected layer determines the ﬁnal output value through softmax to make a relative comparison with the output value of other neurons. 3.2. Model Implementation and Training In this paper, motion data from gyroscope and accelerometer sensors are used
for learning the proposed model. The data and learning process of the model are shown in Fig. 5. First, the wearable sensor is worn on the wrist, and the motion of a speciﬁc gesture is repetitively performed for collecting and storing the 6-axis sequence data in a csv ﬁle. Secondly, the onset of gesture activation is identiﬁed by recognizing the measured peak signals, then the identiﬁed signal segment is
selected as the data of interest and divided into sequences by a certain length as input samples (256 samples have been used during experiments). Third, the noise of the data is reduced by low-pass ﬁltering. Finally, the ﬁltered data is labeled and stored as training data. In Section 3.2.1, we describe the gesture sensor data for recognizing the gestures, and the process of ﬁltering for changing
the collected data into training data in Section 3.2.2. Finally, we explain the process of learning the ﬁltered data in Section 3.2.3. 9
3.2.1. Gyroscope and Accelerometer sensor data This section describes gesture data for learning our proposed model. Sensor data for gesture recognition have been collected using a wrist-type wearable 220
sensor integrated with a 3-axis accelerometer and a 3-axis gyroscope. Figure 6 shows a sensor board for obtaining input data in this study. The BMI055 sensor is indicated by the red rectangle in the ﬁgure, which measures the three orthogonal (x,y,z) data components from a 3-axis gravity accelerometer and also the three orthogonal (x,y,z) data components from a 3-axis gyroscope. Therefore,
6-axis time-sequential sensor data are used as input data. 3.2.2. Filtering of Sensor data Since the sensor data is a sequence, we check the action activation to recognize the gestures by processing the data. To conﬁrm the activation, the peaks of the variance of accelerometer and gyroscope data are identiﬁed. After iden-
tifying the peaks, the sequence is searched in a sliding window manner and the minimum and maximum values are detected. The smallest of the diﬀerences is taken as the start point, and the largest one is assumed to be the end point. The captured sequence is segmented into 256 sequences. Based on this segmented 6-axis sensor sequence, the corresponding motion label is matched and then it
becomes the input to our framework. However, if we study the deep learning model without ﬁltering, the learning rate may be degraded due to noise caused by environmental parameters and patterns. Therefore, the segmented data is ﬁltered to remove the noise by lowpass ﬁltering. To remove the noise, Savitzky-Golay ﬁlter is employed . It
ﬁnds a k-th order polynomial that best ﬁts adjacent data points at each point using the least squares method. Then, it sets the current data value when there is a data sequence containing information. In particular, it is a process that can accurately ﬁnd the return of a polynomial by using a conversion. This ﬁlter preserves the width and maximum and minimum points of the peaks in the
given data. This ﬁlter is expressed in Fig. 7. That is, the new value of xi ﬁnds a formula si that returns to the polynomial using 2n + 1 points near both sides 10
and replaces xi with si (xi ). By using this, we can obtain the smoothed one of the captured sequential. 3.2.3. Model Traning 250
The proposed model has been implemented using Python and it has been learned using the Keras framework, which uses theano and tensorﬂow as backends. Model learning and classiﬁcation runs on an NVIDIA Geforece GTX 1070 graphics card with a core count of 1920 and with a speed of 1506 MHz and 8 GB of RAM.
The input signal of the model consists of 256 sequences of 6-axis data from the accelerometer and the gyroscope, and the labels according to the operation are saved as discussed earlier. The model is learned in a supervised way and trained with a learning rate of 0.001. It also optimizes network parameters by minimizing the cross entropy loss function using a mini-batch gradient descent
with a batch size of 128. First, a total of four layers with 2-4 layers are convolved with the kernel size (1, 3). The next 6-9 layers are the GRU layers. 288 of the input nodes are connected from the output of the previous convolution layer and 256 hidden nodes exist in each GRU layer. In order to increase the eﬃciency and overcome the problem of overﬁtting,
we apply the dropout technique to eﬀect regularization, in addition to batch normalization, in the last layer of convolution. This computation is performed by skipping a part of the network randomly during training with probability p = 0.25. It is also possible to avoid co-adaptation phenomena in which weights are mutually related due to learning data , .
4. Experimental results We deﬁne some of the behaviors of the arm among the human gestures to evaluate the proposed algorithm and compare the performance and results for the same dataset with human activity recognition algorithm .
4.1. Description of Dataset 275
In order to experiment with the proposed algorithm, we have deﬁned nine gestures as shown in Fig. 8 based on the MSR data set of Microsoft , . We deﬁne nine gestures; stand-by (A), clockwise-draw circle (B), counterclockwisedraw circle (C), upper-right-draw X (D) upper-left-draw X (E), straight-draw from left to right (F), straight-draw from right to left (G), straight-draw from
bottom to top (H) and straight-draw from top to bottom (I). This dataset has been collected by wearing the wrist-type wearable band shown in Fig. 5, which has a wearable sensor board with the accelerometer and gyroscope sensors as shown in Fig. 6. This sensor board, which includes the BMI055 sensor, is embedded in the same kind of wearable wrist-band used for
training as described in Section 3.2.1. The dataset consisted of 6-axis signals is received through a Bluetooth-based communication on the platform with NVIDIA graphic processing unit. It is stored as comma separated value (csv) ﬁles in a total of about 180,000 sets which are composed of 20,000 sets for each gesture (9 gestures in total) through the bluetooth communication protocol. Ten
subjects were asked to perform the gestures at diﬀerent speeds which were timed to last between 1 to 3 seconds, and the nine gestures were each performed by each subject 200 times. This practice lasted for 10 days. Datasets are divided into training set (80%) and test set (20%) for learning and validation, respectively. 4.2. Results and Discussion
The recognition performance is compared in Table 1 for nine gestures. Table 1 shows precision, recall, and F1-score as a result of testing the proposed algorithm and human activity recognition algorithm  on nine gestures. Precision is deﬁned using True Positive (TP) and False Positive (FP) as
TP T P +F P
This means the ratio of the positive in actual to the positive predicted value. 300
That is, it indicates how many of the detection results actually contain correct results. Recall is deﬁned with True Positive (TP) and False Negative (FN), and the formula is
TP T P +F N .
This is an indicator of how much the actual results were
not missed out . 12
F1 is calculated from the harmonic mean of these two measures as given in (4), F1 = 2 ∗
P recision ∗ Recall . P recision + Recall
This is equally important to the correct classiﬁcation of each class, and we 305
consider the importance of precision and recall in the same time. Thus, it can be used as a normalized measure of the performance in natural and irregular human activity data sets. The results show that the mean F1-score can be improved by 6% over the human activity recognition. In the counterclockwise-draw circle gesture (B),
it shows a relatively low precision of about 0.79 in human activity recognition algorithm. Otherwise, the proposed algorithm shows the precision of 0.93, which is an improvement of about 14%. In the human activity recognition, the clockwise-draw circle (C), upper-left-draw X (E) and straight-draw from right to left (G) gestures have a relatively low recall score of about 0.88, while the
proposed algorithm yields an average of 0.95. For the upper-right-draw X (D), the human activity recognition algorithm shows good score of 0.93634 for precision, while the proposed algorithm produces a precision of 0.97143, about 0.4 higher. In Tables 2 and 3, the confusion matrices are presented, where the horizontal
axis represents the predicted label and the vertical axis represents the true label. The confusion matrix contains information about the true amount of prediction and the classiﬁcation error when the data is applied to the algorithm. Table 2 shows the confusion matrix for the human activity recognition algorithm and Table 3 shows the confusion matrix for our proposed algorithm. When we look at
the whole matrix, the distribution of diagonal value illustrated from left to right downward means that the prediction is correct . Therefore, we can see that our algorithm has larger diagonal values than the human activity recognition algorithm for all nine gestures. For the results of error prediction of gestures such as A to G, E to I, or F to I, comparison of the results in Tables 2 and
3 shows our algorithm reduces the errors from 0.022 down to 0, indicating the
error can be further reduced by our algorithm. As a result, we have been able to see a signiﬁcant improvement in recognition performance. However, continuous and complex patterns such as opportunity datasets  and skoda datasets  have not been tested yet. 335
Future studies will focus on improving the recognition rate of gesture recognition by adding the latest classiﬁcation algorithm or by changing the network size, such as the kernel. We will also explore advanced algorithms that can achieve high performance for extending to more complex and diverse gestures (such as physical activity and cooking). If the recognition performance is improved for
complex movements, gesture recognition through a wrist-type wearable band can be applied to various ﬁelds such as games and health care by exchanging motion information in real-time with other display technologies like smart TV and Virtual Reality (VR) .
5. Conclusions 345
In this paper, we have proposed an eﬃcient arm gesture recognition algorithm consisting of convolution layers and GRU recurrent layers using input data acquired from a wrist type wearable sensor equipped with gyroscope and accelerometer sensors. The proposed deepGesture algorithm extracts a higher dimensional feature map with four deep convolutional layers and then uses four
iterative GRU layers to obtain more eﬃcient and accurate gesture recognition for sequential sensor data. We have compared the proposed algorithm with the human activity recognition algorithm  that combines deep convolutional and recurrent layers, LSTM. We have achieved satisfactory high-accuracy results using the proposed algorithm. As a result of the experiment, we have found that
the mean of F1-score increases by more than 6% for nine deﬁned gestures, and increases by 9% for the case of counterclockwise-draw circle. Also, the confusion matrix shows that the accuracy of prediction for each class has improved by 6%. Our model has produced promising results for the nine deﬁned patterns, however it will be more meaningful if we extend it to more complex and versatile
gestures (e.g. interaction on smart TV, games etc.) in future.
6. References  A. Pantelopoulos, N.G. Bourbakis, A survey on wearable sensor-based systems for health monitoring and prognosis, IEEE Trans. Syst., Man, Cybern. Part C: Appl. Rev. 40 (2010) 1-12. 365
 T. Schlmer, B. Poppinga, N. Henze, S. Boll, Gesture recognition with a Wii controller, in: Proc. 2nd Int. Conf. Tangible Embed. Interact. (2008) 11-14.  D. Hong, W. Woo, Recent research trend of gesture-based user interfaces, Telecommun. Rev. 18 (2008) 403-413.  C. Lee, G. Park, A study on touch and touchless gesture recognition tech-
nology and products, J. Packag. Cult. Des. Res. 31 (2012) 21-31.  Y. LeCun, B. Yoshua, H. Geoﬀrey, Deep learning, Nat. 521 (2015) 436-444.  J. Schmidhuber, Deep learning in neural networks: An overview, Neural Netw. 61 (2015) 85-117.  L. Deng, D. Yu, Deep learning: Methods and applications, Found. Trends
Signal Process. 7 (2014) 197-387.  O. Abdel-Hamid, L. Deng, D. Yu, Exploring convolutional neural network structures and optimization for speech recognition, Interspeech (2013) 33663370.  E. Arisoy, T. Sainath, B. Kingsbury, B. Ramabhadran, Deep neural net-
work language models, in: Proc. NAACL-HLT Workshop Assoc. Comput. Linguist. (ACL) (2012) 20-28.  V. Nair, G. Hinton, 3D object recognition with deep belief nets, Adv. Neural Inf. Process. Syst. (2009) 1339-1347.
 J. Wang, Y. Chen, S. Hao, X. Peng, L. Hu, Deep learning for sensor-based 385
activity recognition: A survey, arXiv 1707 (2017)  S. Mukherjee, R. Saini, P. Kumar, P. P. Roy, D. P. Dogra, and B. G. Kim, Fight Detection in hockey videos using deep network, J. Multimed. Inf. Syst. 4 (2017) 225-232.  S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput.
9 (1997) 1735-1780.  H. Lee, P. Pham, Y. Largman, A. Ng, Unsupervised feature learning for audio classication using convolutional deep belief networks, Adv. Neural Inf. Process. Syst. (2008) 1096-1104.  Y. Kim, Convolutional neural networks for sentence classiﬁcation, arXiv
1408 (2017).  J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv 1412 (2014).  K. Liu, C. Chen, R. Jafari, N. Kehtarnavaz, Fusion of inertial and depth sensor data for robust hand gesture recognition, IEEE Sens. J. 14 (2014)
1898-1903.  L. Jing, K. Yamagishi, J. Wang, Y. Zho, T. Huang, Z. Cheng, A uniﬁed method for multiple home appliances control through static ﬁnger gestures, IEEE/IPSJ 11th Int. Symp. Appl. Internet (SAINT) (2011) 82-90.  J. Kim, J. Wagner, M. Rehm, E. Andr, Bi-channel sensor fusion for au-
tomatic sign language recognition, IEEE Int. Conf. Automat. Face Gesture Recog. (2008) 1-6.  D. Maguire, R. Frisby, Comparison of feature classiﬁcation algorithm for activity recognition based on accelerometer and heart rate data, 9th IT & T Conf. (2009) 11.
 F.J. Ordez, D. Roggen, Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition, Sens. 16 (2016) 115.  J. Yang, M.N. Nguyen, P.P. San, X. Li, S. Krishnaswamy, Deep convolutional neural networks on multichannel time series for human activity recognition, Int. Conf. Artif. Intell. (2015) 3995-4001.
 R. Chavarriaga, H. Sagha, A. Calatroni, S.T. Digumarti, G. Trster, J.D.R. Milln, D. Roggen, The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition, Pattern Recognit. Lett. 34 (2013) 2033-2042.  M. Zeng, L.T. Nguyen, B. Yu, O.J. Mengshoel, J. Zhu, P. Wu, J. Zhang,
Convolutional neural networks for human activity recognition using mobile sensors, 6th Int. Conf. Mob. Comput. Appl. Serv. (2014) 197-205.  M.R. Ahsan, M.I. Ibrahimy O.O. Khalifa, Electromygraphy (EMG) signal based hand gesture recognition using artiﬁcial neural network (ANN), 4th Int. Conf. Mechatron. (2011) 1-6.
 P. Kumar, R. Saini, P.P. Roy, U. Pal, A lexicon-free approach for 3D handwriting recognition using classiﬁer combination, Pattern Recognit. Lett. 103 (2018) 1-7.  P. Kumar, H. Gauba, P.P. Roy, D.P. Dogra, Coupled HMM-based multisensor data fusion for sign language recognition, Pattern Recognit. Lett. 86
(2017) 1-8.  Y. LeCun, Y. Bengio, Convolutional networks for images, speech, and time series, in: M.A. Arbib (Ed.), The Handb. of Brain Theory Neural Netw., The MIT Press, Cambridge MA, 2002, pp. 255-258.  S. Ioﬀe, C. Szegedy, Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift, Int. Conf. Mach. Learn. (2015) 448456.
 A. Karpathy, J. Johnson, L. Fei-Fei, Visualizing and understanding recurrent networks, arXiv 1506 (2015).  A. Savitzky, M.J. Golay, Smoothing and diﬀerentiation of data by simpliﬁed 440
least squares procedures, Anal. Chem. 36 (1964) 1627-1639.  W. Zaremba, I. Sutskever, O. Vinyals, Recurrent neural network regularization, arXiv 1409 (2014).  N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overﬁtting, J. Mach.
Learn. Res. 15 (2014) 1929-1958.  L. Miranda, T. Vieira, D. Martinez, T. Lewiner, A.W. Vieira, M.F. Campos, Real-time gesture recognition from depth data through key poses learning and decision forests, 25th SIBGRAPI Conf. on Graph., Patterns and Images (2012) 268-275.
 R. Zhao, J. Wang, R. Yan, K. Mao, Machine health monitoring with LSTM networks, 10th Int. Conf. Sens. Technol. (ICST) (2016) 1-6.  D.M. Powers, Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation, J. Mach. Learn. Technol. 2 (2011) 37-63.
 T. Stiefmeier, D. Roggen, G. Ogris, P. Lukowicz, G. Trster, Wearable activity tracking in car manufacturing, IEEE Pervasive Comput. Mag. 7 (2008) 42-50.  J.T. Park, H.S. Hwang, I.Y. Moon, Study of wearable smart band for a user motion recognition system, Int. J. Smart Home, 8 (2014) 33-44.
Figure 1: Architecture of the proposed algorithm.
19 Sensor data
Fully connected layer
đ đ đ
Convolution Layers Input Layer
Figure 2: Architecture of Deep Convolutional layers in the proposed algorithm.
Figure 3: Structure of classical RNN.
287 Figure 4: Structure of GRU gate.
pG S S¡TG GS S¡TG
Figure 5: Flow chart of building gesture learning data.
Figure 6: Wearable sensor board based on gyroscope and accelerometer sensors.
C CC C C
Figure 7: Representation of a 5-point moving polynomial Savitzky-Golay filter.
Figure 8: Nine-arm gestures: from top to bottom, stand-by (A), counterclockwise-draw circle (B), clockwise-draw circle (C), upper-right-draw X (D) upper-left-draw X (E), straight-draw from left to right (F), straight-draw from right to left (G), straight-draw from bottom to top (H), and straight-draw from top to bottom (I).
0.95528 0.79688 0.84645 0.93634 0.93067 0.91481 0.92872 0.89980 0.92693 0.90399
B C D E F G H I avg / total
human activity recognition 
Table 1: Experimental results of human activity recognition and the proposed algorithm.
A 0.94 0.014 0.006 0. 0.004 0.006 0.014 0. 0.
Gestures A B C D E F G H I
Table 2: Confusion matrix of nine-class gesture pattern for human activity recognition (HAR) .
A 0.984 0. 0.006 0. 0. 0. 0.002 0. 0.
Gestures A B C D E F G H I
Table 3: Confusion matrix of nine-class gesture pattern of the proposed algorithm.
List of Figures 1
Architecture of the proposed algorithm. . . . . . . . . . . . . . .
Architecture of Deep Convolutional layers in the proposed algorithm. 20
Structure of classical RNN. . . . . . . . . . . . . . . . . . . . . .
Structure of GRU gate. . . . . . . . . . . . . . . . . . . . . . . .
Flow chart of building gesture learning data. . . . . . . . . . . .
Wearable sensor board based on gyroscope and accelerometer sensors. 24
Representation of a 5-point moving polynomial Savitzky-Golay ﬁlter. 25
Nine-arm gestures: from top to bottom, stand-by (A), counterclockwise-draw circle (B), clockwise-dra
List of Tables 470
Experimental results of human activity recognition and the proposed algorithm. 27
Confusion matrix of nine-class gesture pattern for human activity recognition (HAR) . 28
Confusion matrix of nine-class gesture pattern of the proposed algorithm. 29
Field: Arm gesture classification and recognition based on motion sensors
Algorithm Description: deepGesture, a new arm gesture recognition method based on gyroscope and accelerometer sensors using deep convolution and recurrent neural networks
Implementation: Wrist-type smart band device and deep learning framework
Applications: Smart TV based entertainment and fitness service platform