deepGesture: Deep learning-based gesture recognition scheme using motion sensors

deepGesture: Deep learning-based gesture recognition scheme using motion sensors

Accepted Manuscript deepGesture: Deep Learning-based Gesture Recognition Scheme using Motion Sensors Ji-Hae Kim, Gwang-Soo Hong, Byung-Gyu Kim, Debi P...

2MB Sizes 0 Downloads 7 Views

Accepted Manuscript deepGesture: Deep Learning-based Gesture Recognition Scheme using Motion Sensors Ji-Hae Kim, Gwang-Soo Hong, Byung-Gyu Kim, Debi P. Dogra PII: DOI: Reference:

S0141-9382(17)30203-2 https://doi.org/10.1016/j.displa.2018.08.001 DISPLA 1881

To appear in:

Displays

Received Date: Revised Date: Accepted Date:

26 December 2017 24 February 2018 23 August 2018

Please cite this article as: J-H. Kim, G-S. Hong, B-G. Kim, D.P. Dogra, deepGesture: Deep Learning-based Gesture Recognition Scheme using Motion Sensors, Displays (2018), doi: https://doi.org/10.1016/j.displa.2018.08.001

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

deepGesture: Deep Learning-based Gesture Recognition Scheme using Motion Sensors Ji-Hae Kim1 , Gwang-Soo Hong2 , and Byung-Gyu Kim1∗ , Debi P. Dogra3 1 Department

of IT Engineering, Sookmyung Women’s University, Seoul, Rep. of Korea, of Computer Engineering, SunMoon University, A-san, Rep. of Korea, 3 School of Electrical Sciences, Indian Institute of Technology Bhubaneswar, India

2 Department

Abstract Recent advancement in smart phones and sensor technology has promoted research in gesture recognition. This has made designing of efficient gesture interface easy. However, human activity recognition (HAR) through gestures is not trivial since each person may pose the same gesture differently. In this paper, we propose deepGesture algorithm, a new arm gesture recognition method based on gyroscope and accelerometer sensors using deep convolution and recurrent neural networks. This method uses four deep convolution layers to automate feature learning in raw sensor data. The features of the convolution layers are used as input of the gated recurrent unit (GRU) which is based on the state-of-the-art recurrent neural network (RNN) structure to capture long-term dependency and model sequential data. The input data of the proposed algorithm is obtained through motion sequence data extracted using a wrist-type smart band device equipped with gyroscope and accelerometer sensors. The data is initially segmented in fixed length segments. The segmented data is labeled and we construct the database. Then the labeled data is used in our learning algorithm. To verify the applicability of the algorithm, several experiments have been performed to measure the accuracy of gesture classification. Compared to the human activity recognition method, our experimental results ∗ Corresponding

author Email address: 1 [email protected] (Ji-Hae Kim1 , Gwang-Soo Hong2 , and Byung-Gyu Kim1∗ , Debi P. Dogra3 )

Preprint submitted to Journal of LATEX Templates

August 22, 2018

show that the proposed deepGesture algorithm can increase the average F1-score for recognition of nine defined arm gestures by 6%. Keywords: Human Activity Recognition, Wearable Sensors, Deep Learning, GRU, Neural Network.

1. Introduction People communicate quickly and easily with each other through actions of body parts in our daily lives (e.g., simple hand gestures such as greetings, OK signs and complicated gestures such as cooking, etc). Communicating through 5

gestures is convenient because it is more expressive and intuitive than other interaction methods. Recently, with the advancement of new wearable devices equipped with sensors capable of detecting biometric signals, phase, and position such as smart bands, gesture-based recognition technology is becoming very popular [1], [2].

10

Gesture recognition can be classified into touch-based and touchless approaches [3]. Touch-based gesture recognition uses the postures and motion information of the users obtained by attaching a sensor or device to a part of the user’s body. Touchless gesture recognition mainly acquires human motion information through visual analysis.

15

Wearable devices capable of touch-based gesture recognition using gyroscopes and accelerometers have been developed for detecting emergency situations, such as falling accidents of the elderly.

Also, touchscreen-enabled

smart phones are used in games with motion sensors such as Nintendo Wii, etc. Touchless gesture recognition is used for motion control enabled by a visual 20

motion recognition sensor in games such as Xtion, smartphone-guided gesture recognition function based on left or right swiping, and smart TV interfaces [4]. Although gesture recognition can be highly complex due to features, varying behavior patterns and different individual users, good accuracy can still be achieved by application-specific designs.

25

With the availability of efficient hardware and big data frameworks, it is

2

now possible to handle large quantities of data without much difficulty. Deep learning has the advantage of automatically acquiring good functions using general-purpose learning procedures [5]. Therefore, the performances of many algorithms have been improved with the application of deep learning [6], [7]. 30

Especially, it is widely used in speech recognition [8], sentence recognition [9], object recognition [10], and gesture/action recognition [11], [12]. Deep learning can provide the following benefits for gesture recognition: It makes handling a large amount of data easy, and it does not require expertise in functional design. In addition, when sufficient data and suitable algorithms

35

are applied, it is possible to achieve high performance by reducing the feature gaps between the individual components [13]. As a representative example, the deep learning algorithm played a key role in the performance improvement of audio classifiacation [14] and sentence analysis [15]. In this paper, we propose a gesture recognition method based on gyroscope

40

and accelerometer sensors using convolutional and gated recurrent unit (GRU) layers. The recurrent neural network (RNN) is a type of artificial neural network in which hidden nodes are connected by a directional edge to form a circulating structure [5]. This model is effective for gesture data with uninterrupted values received from gyroscope and accelerometer sensors, since gesture data are time

45

series events like voice or sequential data processing. The modified GRU module in RNN is similar to Long short-term memory (LSTM), but it has a simpler structure [16]. In particular, it takes less time to learn due to fewer parameters than LSTM. Also, it is possible to learn with a small number of data. Based on the aforementioned advantages, the contributions of this paper can

50

be addressed as followings: • We propose the gesture recognition algorithm based on deep convolution and recurrent neural network: the deep learning framework consists of the convolutional and GRU layers. This automatically creates and learns the features to model the relationship between the patterns of gesture data

55

and the defined gesture patterns.

3

• We demonstrate the proposed algorithm based on wearable sensor data to achieve a higher recognition rate for a wider range of gesture patterns and certain motions. The rest of this paper is organized as follows: In Section 2, we discuss vari60

ous existing algorithms of gesture recognition. Section 3 presents a new gesture recognition algorithm using deep learning based on data from gyroscope and accelerometer sensors. Experimental results are reported in Section 4. Concluding remarks are presented in Section 5.

2. Related works 65

Sensors for recognizing gestures have been reported in several studies. Liu et al. have proposed a Markov model framework for classification and shown that recognition of hand gestures were achieved by fusion of inertia and depth sensor data from two different modality sensors [17]. With the fusion scheme of visual depth and inertial sensor data, robust recognition results for five kinds

70

of hand gestures were obtained. Since they made the fusion scheme of visual depth and inertial sensor data, it obtained robust recognition results for five kinds of hand gestures. Jing et al. have proposed a gesture-based remote control system to control TV, audio equipment, and other home appliances [18]. This algorithm uses finger-ring gesture controller called Magic Ring (MR) based

75

on accelerometer sensors to remotely control an Electric Appliance Node (EANode) by recognizing postures such as finger up, finger down, and finger rotation. Kim et al. have developed a gesture recognition algorithm based on Fast Fourier Transform (FFT) using accelerometer and electromyography (EMG) sensors [19]. This algorithm recognizes seven word-level symbol vocabularies

80

of German Sign Language (GSL). Based on acceleration signals and bio-signals (e.g. heart rate), Maguire et al. have proposed recognition of activities such as moving up and down through stairs, running, and brushing using k-Nearest Neighbor (k-NN) and j48 classifiers [20].

4

Recently, the gesture recognition system using deep learning has become 85

popular. In particular, as it is difficult to obtain high success rates using existing systems for complex and complicated patterns due to limitations of using classifiers such as Support Vector Machine (SVM) or k-NN, the advantage of deep learning can be exploited to increase the recognition rates for continuous time series data of complicated operations.

90

Ordez et al. have used gesture recognition based on data received from a multimodal wearable device using RNN with LSTM module and Deep Convolution Network [21]. Yang et al. have focused on the HAR problem and they have proposed a method to automate feature extraction [22]. The proposed method builds a deep learning structure for Convolutional Neural Network (CNN) to in-

95

vestigate time series data with multiple channels. This largely automates feature learning using activity dataset that defines daily activities such as cooking and other commonly practiced hand gesture data using a deep convolution network [23]. Particularly, hand gestures are classified into eight daily life movements and three tennis movement classes, and they are extracted by sensors with 3-

100

axis accelerometer and 2-axis gyroscope sensors. Zeng et al. have proposed to recognize three public datasets, namely Skoda (assembly line activities), Opportunity (activities in kitchen), and Actitracker (jogging, walking,etc.) using CNN based on 3-axis accelerometer sensor in smart phone [24]. Ahsan et al. have introduced ANN to detect predefined hand gestures (left, right, top, bot-

105

tom) using EMG signals which are biological signals [25]. Since ANN is useful for complex pattern recognition and classification tasks, ANN can be used to classify EMG signals. This can be used to design efficient computer-interaction interfaces for the handicapped. Other studies on recognizing complex gestures and hand-writing also reported the use of 3-D sensors such as Leap motion and

110

Kinect [26], [27]. Although many techniques for improving sensor-based gesture recognition systems have been developed, as well as up to about 88% recognition accuracy for complex gestures has also been reported, further improvement is still required [21]. Therefore, we are motivated to propose and evaluate the gesture 5

115

recognition algorithm for complex patterns based on gyroscope and accelerometer data using the deep learning framework.

3. Proposed Algorithm We propose a new deep learning algorithm for recognizing gestures based on data of gyroscope and accelerometer sensors. The new model has a structure 120

in which four convolution and GRU layers are combined. The proposed deep learning structure makes the convolution filter robust and unsusceptible to preprocessing (which extracts a good feature map) and it has a significant impact on performance. It also achieves high performance by using layers of GRU that has already shown good performance for continuous sequence sensor data [16],

125

[22]. For the input sensor data, four levels of convolution layers extract features from the sequences of a given gesture operation and generate a feature map. Next, four GRU layers to allow for learning a long sequence data by computing gradient component efficiently. This is explained in more detail in Section 3.1.

130

3.1. Deep Convolution and GRU Neural Network The proposed algorithm consists of four structures as shown in Fig. 1 i.e., an input layer that is composed of data of gyroscope and accelerometer sensors, convolution layers for extracting feature maps, GRU layers, and a fully connected layer.

135

First, the input node, capable of handling 288 samples, receives input signals (6, 256) consisting of a total of 6 axes (x, y, z from a 3D accelerometer and x, y, z from a 3D gyroscope) and 256 sequences. In Fig. 1, the red rectangle region in the sensor data section corresponds to the input layer. The second part is the convolution layers (2-5 layers). The CNN is a network

140

that extracts features from the original data using a convolution filter to form a feature map [28]. A feature map is an array of units consisting of convolution results operationally obtained through featured kernels. In CNN, the kernels are

6

optimized as part of a supervised training process to maximize the activation level [21]. A method for extracting a feature map using convolution operations 145

is expressed as given in (1): ( xl,d ij

= σ bij +

i −1 ∑ P∑

m

) p wijm xl+p,d (i−1)m

, ∀d = 1, ..., D,

(1)

p=0

where xl,d ij represents the feature map of the jth sample in layer l. σ is a nonlinear function and it is used in this paper as tanh (hyper tangent function). bij is the bias value for this feature map, and m is the index for the feature p map set of the (i − 1)th layer. wijm is the kernel value convolved through the 150

feature map to create feature map j at the next layer, and Pi is the length of the convolution kernel. The proposed algorithm has four convolution layers with two dimensions. Each layer performs a convolution operation on stride (1,2) through a kernel with a size of (1,3) as shown in Fig. 2. The result of convolution with stride Si−1 is described in (2), Ni = (Ni−1 − K) /Si−1 , Mi = Mi−1 ∗ Pi−1 ,

(2)

where K is the kernel size. Mi is the number of kernels after multiplying the previous layer by the number of kernels Pi−1 . After the operation, batch normalization is used to stabilize the training 155

process and accelerate learning speed so that gradient vanishing does not occur [29]. In [29], they argue that the reason for this instability is due to the internal covariance shift. Internal covariance shift means that the distribution of input varies with each layer or activation. Batch normalization has been used to prevent this phenomenon. This algorithm is a method of learning by bringing

160

data in mini-batch unit. By using this, we can normalize the mean and standard deviation for each feature, and create a new value using the scale factor and shift factor [29]. When applied to a network, the batch normalization layer is added before the hidden layer to modify the input. Then, the new value is inserted into the activation function. 7

165

The third part is the recurrent dense GRU layers (6-9 layers). These four layers consist of GRU which is a modified form of RNN. These layers contain more high-dimensional information with at least two recurrent dense layers according to the results presented in [30]. RNN starts with the idea of processing sequential information unlike traditional neural networks, where all inputs and

170

outputs are assumed to be independent of each other [5]. Therefore, the same task is applied to every element of one sequence, and the past data has a structure that can affect the next result. Figure 3 shows the recurrent connection of the classic RNN. The recurrent connection of the RNN on the right side of the figure is the unfolded left struc-

175

ture. xt is the input value at time step t. ht is the hidden state at time step t and is calculated as ht = f (xt Ut + ht−1 Wt−1 ) by the hidden state value ht−1 and weight Wt−1 of the previous step (t − 1), and the input value xt and weight Ut of the current time step as the network memory part. The nonlinear function f is usually denoted by tanh and ReLU. In addition, each layer shares param-

180

eter values for all time steps. yt is the output value at time step t. Although the RNN is successfully applied in many natural language processing problems, learning with simple RNN results in a vanishing gradient problem in which the learning ability is significantly degraded due to the reduction of the gradient in back propagation when dealing with long sequences.

185

To solve this problem, an extended RNN model, LSTM and GRU have been adopted in [16]. In this paper, we solve the vanishing gradient problem by using state-of-the-art RNN model with GRU [16]. The reasons for using this model, especially with respect to LSTM, are: • This model has a shorter learning time with fewer parameters (U and R).

190

• Despite the small amount of data, the results of the learning have good performance. The GRU is a model first used in 2014 and its structure is similar to that of LSTM [13]. The GRU has a simpler structure, which is achieved by calculating hidden states, and allows a long sequence to be learned well via a gating 8

mechanism. GRU’s formulas are given in (3), z = σ (xt U z + ht−1 W z ) , r = σ (xt U r + ht−1 W r ) , h˜t = tanh (Uz · xt + Ws · (ht−1 ⊙ r)) ,

(3)

ht = (1 − z) ⊙ ht−1 + z ⊙ h˜t , where z means an update gate and r means a reset gate. As shown in Fig. 4, unlike the basic RNN and GRU which have two gates, where the gate determines whether the sigmoid function limits the values of the vectors from 0 to 1 195

and passes them through the element-wise multiplication operation with other vectors. The reset gate determines how the new input is merged with the previous memory, and the update gate decides how much of the previous memory to remember. The basic RNN model has the form of a GRU model in which all the reset gates are set to 1 and the update gates are all set to zero. In this

200

study, we have four GRU layers for higher level information. Finally, the fully connected layer determines the final output value through softmax to make a relative comparison with the output value of other neurons. 3.2. Model Implementation and Training In this paper, motion data from gyroscope and accelerometer sensors are used

205

for learning the proposed model. The data and learning process of the model are shown in Fig. 5. First, the wearable sensor is worn on the wrist, and the motion of a specific gesture is repetitively performed for collecting and storing the 6-axis sequence data in a csv file. Secondly, the onset of gesture activation is identified by recognizing the measured peak signals, then the identified signal segment is

210

selected as the data of interest and divided into sequences by a certain length as input samples (256 samples have been used during experiments). Third, the noise of the data is reduced by low-pass filtering. Finally, the filtered data is labeled and stored as training data. In Section 3.2.1, we describe the gesture sensor data for recognizing the gestures, and the process of filtering for changing

215

the collected data into training data in Section 3.2.2. Finally, we explain the process of learning the filtered data in Section 3.2.3. 9

3.2.1. Gyroscope and Accelerometer sensor data This section describes gesture data for learning our proposed model. Sensor data for gesture recognition have been collected using a wrist-type wearable 220

sensor integrated with a 3-axis accelerometer and a 3-axis gyroscope. Figure 6 shows a sensor board for obtaining input data in this study. The BMI055 sensor is indicated by the red rectangle in the figure, which measures the three orthogonal (x,y,z) data components from a 3-axis gravity accelerometer and also the three orthogonal (x,y,z) data components from a 3-axis gyroscope. Therefore,

225

6-axis time-sequential sensor data are used as input data. 3.2.2. Filtering of Sensor data Since the sensor data is a sequence, we check the action activation to recognize the gestures by processing the data. To confirm the activation, the peaks of the variance of accelerometer and gyroscope data are identified. After iden-

230

tifying the peaks, the sequence is searched in a sliding window manner and the minimum and maximum values are detected. The smallest of the differences is taken as the start point, and the largest one is assumed to be the end point. The captured sequence is segmented into 256 sequences. Based on this segmented 6-axis sensor sequence, the corresponding motion label is matched and then it

235

becomes the input to our framework. However, if we study the deep learning model without filtering, the learning rate may be degraded due to noise caused by environmental parameters and patterns. Therefore, the segmented data is filtered to remove the noise by lowpass filtering. To remove the noise, Savitzky-Golay filter is employed [31]. It

240

finds a k-th order polynomial that best fits adjacent data points at each point using the least squares method. Then, it sets the current data value when there is a data sequence containing information. In particular, it is a process that can accurately find the return of a polynomial by using a conversion. This filter preserves the width and maximum and minimum points of the peaks in the

245

given data. This filter is expressed in Fig. 7. That is, the new value of xi finds a formula si that returns to the polynomial using 2n + 1 points near both sides 10

and replaces xi with si (xi ). By using this, we can obtain the smoothed one of the captured sequential. 3.2.3. Model Traning 250

The proposed model has been implemented using Python and it has been learned using the Keras framework, which uses theano and tensorflow as backends. Model learning and classification runs on an NVIDIA Geforece GTX 1070 graphics card with a core count of 1920 and with a speed of 1506 MHz and 8 GB of RAM.

255

The input signal of the model consists of 256 sequences of 6-axis data from the accelerometer and the gyroscope, and the labels according to the operation are saved as discussed earlier. The model is learned in a supervised way and trained with a learning rate of 0.001. It also optimizes network parameters by minimizing the cross entropy loss function using a mini-batch gradient descent

260

with a batch size of 128. First, a total of four layers with 2-4 layers are convolved with the kernel size (1, 3). The next 6-9 layers are the GRU layers. 288 of the input nodes are connected from the output of the previous convolution layer and 256 hidden nodes exist in each GRU layer. In order to increase the efficiency and overcome the problem of overfitting,

265

we apply the dropout technique to effect regularization, in addition to batch normalization, in the last layer of convolution. This computation is performed by skipping a part of the network randomly during training with probability p = 0.25. It is also possible to avoid co-adaptation phenomena in which weights are mutually related due to learning data [32], [33].

270

4. Experimental results We define some of the behaviors of the arm among the human gestures to evaluate the proposed algorithm and compare the performance and results for the same dataset with human activity recognition algorithm [35].

11

4.1. Description of Dataset 275

In order to experiment with the proposed algorithm, we have defined nine gestures as shown in Fig. 8 based on the MSR data set of Microsoft [17], [34]. We define nine gestures; stand-by (A), clockwise-draw circle (B), counterclockwisedraw circle (C), upper-right-draw X (D) upper-left-draw X (E), straight-draw from left to right (F), straight-draw from right to left (G), straight-draw from

280

bottom to top (H) and straight-draw from top to bottom (I). This dataset has been collected by wearing the wrist-type wearable band shown in Fig. 5, which has a wearable sensor board with the accelerometer and gyroscope sensors as shown in Fig. 6. This sensor board, which includes the BMI055 sensor, is embedded in the same kind of wearable wrist-band used for

285

training as described in Section 3.2.1. The dataset consisted of 6-axis signals is received through a Bluetooth-based communication on the platform with NVIDIA graphic processing unit. It is stored as comma separated value (csv) files in a total of about 180,000 sets which are composed of 20,000 sets for each gesture (9 gestures in total) through the bluetooth communication protocol. Ten

290

subjects were asked to perform the gestures at different speeds which were timed to last between 1 to 3 seconds, and the nine gestures were each performed by each subject 200 times. This practice lasted for 10 days. Datasets are divided into training set (80%) and test set (20%) for learning and validation, respectively. 4.2. Results and Discussion

295

The recognition performance is compared in Table 1 for nine gestures. Table 1 shows precision, recall, and F1-score as a result of testing the proposed algorithm and human activity recognition algorithm [35] on nine gestures. Precision is defined using True Positive (TP) and False Positive (FP) as

TP T P +F P

.

This means the ratio of the positive in actual to the positive predicted value. 300

That is, it indicates how many of the detection results actually contain correct results. Recall is defined with True Positive (TP) and False Negative (FN), and the formula is

TP T P +F N .

This is an indicator of how much the actual results were

not missed out [36]. 12

F1 is calculated from the harmonic mean of these two measures as given in (4), F1 = 2 ∗

P recision ∗ Recall . P recision + Recall

(4)

This is equally important to the correct classification of each class, and we 305

consider the importance of precision and recall in the same time. Thus, it can be used as a normalized measure of the performance in natural and irregular human activity data sets. The results show that the mean F1-score can be improved by 6% over the human activity recognition. In the counterclockwise-draw circle gesture (B),

310

it shows a relatively low precision of about 0.79 in human activity recognition algorithm. Otherwise, the proposed algorithm shows the precision of 0.93, which is an improvement of about 14%. In the human activity recognition, the clockwise-draw circle (C), upper-left-draw X (E) and straight-draw from right to left (G) gestures have a relatively low recall score of about 0.88, while the

315

proposed algorithm yields an average of 0.95. For the upper-right-draw X (D), the human activity recognition algorithm shows good score of 0.93634 for precision, while the proposed algorithm produces a precision of 0.97143, about 0.4 higher. In Tables 2 and 3, the confusion matrices are presented, where the horizontal

320

axis represents the predicted label and the vertical axis represents the true label. The confusion matrix contains information about the true amount of prediction and the classification error when the data is applied to the algorithm. Table 2 shows the confusion matrix for the human activity recognition algorithm and Table 3 shows the confusion matrix for our proposed algorithm. When we look at

325

the whole matrix, the distribution of diagonal value illustrated from left to right downward means that the prediction is correct [36]. Therefore, we can see that our algorithm has larger diagonal values than the human activity recognition algorithm for all nine gestures. For the results of error prediction of gestures such as A to G, E to I, or F to I, comparison of the results in Tables 2 and

330

3 shows our algorithm reduces the errors from 0.022 down to 0, indicating the

13

error can be further reduced by our algorithm. As a result, we have been able to see a significant improvement in recognition performance. However, continuous and complex patterns such as opportunity datasets [23] and skoda datasets [37] have not been tested yet. 335

Future studies will focus on improving the recognition rate of gesture recognition by adding the latest classification algorithm or by changing the network size, such as the kernel. We will also explore advanced algorithms that can achieve high performance for extending to more complex and diverse gestures (such as physical activity and cooking). If the recognition performance is improved for

340

complex movements, gesture recognition through a wrist-type wearable band can be applied to various fields such as games and health care by exchanging motion information in real-time with other display technologies like smart TV and Virtual Reality (VR) [38].

5. Conclusions 345

In this paper, we have proposed an efficient arm gesture recognition algorithm consisting of convolution layers and GRU recurrent layers using input data acquired from a wrist type wearable sensor equipped with gyroscope and accelerometer sensors. The proposed deepGesture algorithm extracts a higher dimensional feature map with four deep convolutional layers and then uses four

350

iterative GRU layers to obtain more efficient and accurate gesture recognition for sequential sensor data. We have compared the proposed algorithm with the human activity recognition algorithm [35] that combines deep convolutional and recurrent layers, LSTM. We have achieved satisfactory high-accuracy results using the proposed algorithm. As a result of the experiment, we have found that

355

the mean of F1-score increases by more than 6% for nine defined gestures, and increases by 9% for the case of counterclockwise-draw circle. Also, the confusion matrix shows that the accuracy of prediction for each class has improved by 6%. Our model has produced promising results for the nine defined patterns, however it will be more meaningful if we extend it to more complex and versatile

14

360

gestures (e.g. interaction on smart TV, games etc.) in future.

6. References [1] A. Pantelopoulos, N.G. Bourbakis, A survey on wearable sensor-based systems for health monitoring and prognosis, IEEE Trans. Syst., Man, Cybern. Part C: Appl. Rev. 40 (2010) 1-12. 365

[2] T. Schlmer, B. Poppinga, N. Henze, S. Boll, Gesture recognition with a Wii controller, in: Proc. 2nd Int. Conf. Tangible Embed. Interact. (2008) 11-14. [3] D. Hong, W. Woo, Recent research trend of gesture-based user interfaces, Telecommun. Rev. 18 (2008) 403-413. [4] C. Lee, G. Park, A study on touch and touchless gesture recognition tech-

370

nology and products, J. Packag. Cult. Des. Res. 31 (2012) 21-31. [5] Y. LeCun, B. Yoshua, H. Geoffrey, Deep learning, Nat. 521 (2015) 436-444. [6] J. Schmidhuber, Deep learning in neural networks: An overview, Neural Netw. 61 (2015) 85-117. [7] L. Deng, D. Yu, Deep learning: Methods and applications, Found. Trends

375

Signal Process. 7 (2014) 197-387. [8] O. Abdel-Hamid, L. Deng, D. Yu, Exploring convolutional neural network structures and optimization for speech recognition, Interspeech (2013) 33663370. [9] E. Arisoy, T. Sainath, B. Kingsbury, B. Ramabhadran, Deep neural net-

380

work language models, in: Proc. NAACL-HLT Workshop Assoc. Comput. Linguist. (ACL) (2012) 20-28. [10] V. Nair, G. Hinton, 3D object recognition with deep belief nets, Adv. Neural Inf. Process. Syst. (2009) 1339-1347.

15

[11] J. Wang, Y. Chen, S. Hao, X. Peng, L. Hu, Deep learning for sensor-based 385

activity recognition: A survey, arXiv 1707 (2017) [12] S. Mukherjee, R. Saini, P. Kumar, P. P. Roy, D. P. Dogra, and B. G. Kim, Fight Detection in hockey videos using deep network, J. Multimed. Inf. Syst. 4 (2017) 225-232. [13] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput.

390

9 (1997) 1735-1780. [14] H. Lee, P. Pham, Y. Largman, A. Ng, Unsupervised feature learning for audio classication using convolutional deep belief networks, Adv. Neural Inf. Process. Syst. (2008) 1096-1104. [15] Y. Kim, Convolutional neural networks for sentence classification, arXiv

395

1408 (2017). [16] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv 1412 (2014). [17] K. Liu, C. Chen, R. Jafari, N. Kehtarnavaz, Fusion of inertial and depth sensor data for robust hand gesture recognition, IEEE Sens. J. 14 (2014)

400

1898-1903. [18] L. Jing, K. Yamagishi, J. Wang, Y. Zho, T. Huang, Z. Cheng, A unified method for multiple home appliances control through static finger gestures, IEEE/IPSJ 11th Int. Symp. Appl. Internet (SAINT) (2011) 82-90. [19] J. Kim, J. Wagner, M. Rehm, E. Andr, Bi-channel sensor fusion for au-

405

tomatic sign language recognition, IEEE Int. Conf. Automat. Face Gesture Recog. (2008) 1-6. [20] D. Maguire, R. Frisby, Comparison of feature classification algorithm for activity recognition based on accelerometer and heart rate data, 9th IT & T Conf. (2009) 11.

16

410

[21] F.J. Ordez, D. Roggen, Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition, Sens. 16 (2016) 115. [22] J. Yang, M.N. Nguyen, P.P. San, X. Li, S. Krishnaswamy, Deep convolutional neural networks on multichannel time series for human activity recognition, Int. Conf. Artif. Intell. (2015) 3995-4001.

415

[23] R. Chavarriaga, H. Sagha, A. Calatroni, S.T. Digumarti, G. Trster, J.D.R. Milln, D. Roggen, The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition, Pattern Recognit. Lett. 34 (2013) 2033-2042. [24] M. Zeng, L.T. Nguyen, B. Yu, O.J. Mengshoel, J. Zhu, P. Wu, J. Zhang,

420

Convolutional neural networks for human activity recognition using mobile sensors, 6th Int. Conf. Mob. Comput. Appl. Serv. (2014) 197-205. [25] M.R. Ahsan, M.I. Ibrahimy O.O. Khalifa, Electromygraphy (EMG) signal based hand gesture recognition using artificial neural network (ANN), 4th Int. Conf. Mechatron. (2011) 1-6.

425

[26] P. Kumar, R. Saini, P.P. Roy, U. Pal, A lexicon-free approach for 3D handwriting recognition using classifier combination, Pattern Recognit. Lett. 103 (2018) 1-7. [27] P. Kumar, H. Gauba, P.P. Roy, D.P. Dogra, Coupled HMM-based multisensor data fusion for sign language recognition, Pattern Recognit. Lett. 86

430

(2017) 1-8. [28] Y. LeCun, Y. Bengio, Convolutional networks for images, speech, and time series, in: M.A. Arbib (Ed.), The Handb. of Brain Theory Neural Netw., The MIT Press, Cambridge MA, 2002, pp. 255-258. [29] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network train-

435

ing by reducing internal covariate shift, Int. Conf. Mach. Learn. (2015) 448456.

17

[30] A. Karpathy, J. Johnson, L. Fei-Fei, Visualizing and understanding recurrent networks, arXiv 1506 (2015). [31] A. Savitzky, M.J. Golay, Smoothing and differentiation of data by simplified 440

least squares procedures, Anal. Chem. 36 (1964) 1627-1639. [32] W. Zaremba, I. Sutskever, O. Vinyals, Recurrent neural network regularization, arXiv 1409 (2014). [33] N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, J. Mach.

445

Learn. Res. 15 (2014) 1929-1958. [34] L. Miranda, T. Vieira, D. Martinez, T. Lewiner, A.W. Vieira, M.F. Campos, Real-time gesture recognition from depth data through key poses learning and decision forests, 25th SIBGRAPI Conf. on Graph., Patterns and Images (2012) 268-275.

450

[35] R. Zhao, J. Wang, R. Yan, K. Mao, Machine health monitoring with LSTM networks, 10th Int. Conf. Sens. Technol. (ICST) (2016) 1-6. [36] D.M. Powers, Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation, J. Mach. Learn. Technol. 2 (2011) 37-63.

455

[37] T. Stiefmeier, D. Roggen, G. Ogris, P. Lukowicz, G. Trster, Wearable activity tracking in car manufacturing, IEEE Pervasive Comput. Mag. 7 (2008) 42-50. [38] J.T. Park, H.S. Hwang, I.Y. Moon, Study of wearable smart band for a user motion recognition system, Int. J. Smart Home, 8 (2014) 33-44.

18

Figure 1: Architecture of the proposed algorithm.

19 Sensor data

Convolution layers

UUU

GRU layers

ˎ

UU

UU

Fully connected layer

đ

layer 4

đ

layer 3

đ đ đ

layer 1

layer 1

layer 2

Convolution Layers Input Layer

Figure 2: Architecture of Deep Convolutional layers in the proposed algorithm.

20

Figure 3: Structure of classical RNN.

21

]

h

,1

U

287 Figure 4: Structure of GRU gate.

22

~ Œˆ™ˆ‰“ŒG‰ˆ•‹

p•—œ›GšŽ•ˆ“š ŸS S¡TˆŸšGˆŠŠŒ“Œ™–”Œ›Œ™š ˆ•‹GŸS S¡TˆŸšGŽ ™–šŠ–—Œš

v•šŒ›G–GnŒš›œ™Œ p•›ˆ“¡ˆ›–•

zŒŽ”Œ•›ˆ›–• zŒŽ”Œ•›Gž›G—œ“šŒGŒˆ›œ™ŒšG–G—Œˆ’šG

u–šŒ yŒ‹œŠ›–•

vœ›—œ›G—ˆ››Œ™•GŠ–‹Œ

hGGGiGGGjGGGkGGGlGGGmGGGnGGGoGGGp

Figure 5: Flow chart of building gesture learning data.

23

Figure 6: Wearable sensor board based on gyroscope and accelerometer sensors.

24

C CC C C

C

C C

C C

CC

CC

 Figure 7: Representation of a 5-point moving polynomial Savitzky-Golay filter.

25

5

6

7

8

9

:

;

<

=

Figure 8: Nine-arm gestures: from top to bottom, stand-by (A), counterclockwise-draw circle (B), clockwise-draw circle (C), upper-right-draw X (D) upper-left-draw X (E), straight-draw from left to right (F), straight-draw from right to left (G), straight-draw from bottom to top (H), and straight-draw from top to bottom (I).

26

27

0.95528 0.79688 0.84645 0.93634 0.93067 0.91481 0.92872 0.89980 0.92693 0.90399

B C D E F G H I avg / total

precision

0.90133

0.88800

0.89800

0.88600

0.90200

0.88600

0.91200

0.88200

0.91800

0.94000

recall

0.90195

0.90705

0.89890

0.90686

0.90836

0.90779

0.92401

0.86386

0.85316

0.94758

F1-score

human activity recognition [35]

A

Patterns

Evaluation

Algorithm

0.96199

0.96341

0.96378

0.95833

0.98163

0.95174

0.97143

0.94012

0.93555

0.99194

precision

0.96178

0.94800

0.95800

0.96600

0.96200

0.98600

0.95200

0.94200

0.95800

0.98400

recall

proposed

0.95565

0.96088

0.96215

0.97172

0.96857

0.96162

0.94106

0.94664

0.98795

F1-score

0.96180

Table 1: Experimental results of human activity recognition and the proposed algorithm.

28

A 0.94 0.014 0.006 0. 0.004 0.006 0.014 0. 0.

Gestures A B C D E F G H I

0.02

0.024

0.032

0.028

0.03

0.022

0.062

0.918

0.016

B

0.016

0.024

0.024

0.006

0.028

0.036

0.882

0.008

0.018

C

0.004

0.006

0.006

0.004

0.028

0.912

0.004

0.01

0.

D

0.014

0.01

0.002

0.008

0.886

0.014

0.008

0.008

0.002

E

0.022

0.02

0.018

0.902

0.006

0.008

0.

0.008

0.002

F

0.018

0.016

0.886

0.004

0.008

0.008

0.

0.01

0.004

G

0.018

0.898

0.004

0.02

0.004

0.

0.028

0.008

0.018

H

0.888

0.002

0.014

0.022

0.006

0.

0.01

0.016

0.

I

Table 2: Confusion matrix of nine-class gesture pattern for human activity recognition (HAR) [35].

29

A 0.984 0. 0.006 0. 0. 0. 0.002 0. 0.

Gestures A B C D E F G H I

0.016

0.004

0.008

0.008

0.

0.01

0.016

0.958

0.004

B

0.016

0.004

0.006

0.006

0.006

0.008

0.942

0.008

0.006

C

0.004

0.006

0.002

0.004

0.002

0.952

0.004

0.006

0.

D

0.004

0.01

0.002

0.008

0.986

0.014

0.008

0.002

0.002

E

0.002

0.

0.002

0.962

0.002

0.008

0.

0.002

0.002

F

0.002

0.016

0.966

0.004

0.002

0.008

0.

0.01

0.

G

0.008

0.958

0.004

0.

0.

0.

0.014

0.008

0.002

H

Table 3: Confusion matrix of nine-class gesture pattern of the proposed algorithm.

0.948

0.002

0.008

0.008

0.002

0.

0.01

0.006

0.

I

460

465

List of Figures 1

Architecture of the proposed algorithm. . . . . . . . . . . . . . .

19

2

Architecture of Deep Convolutional layers in the proposed algorithm. 20

3

Structure of classical RNN. . . . . . . . . . . . . . . . . . . . . .

21

4

Structure of GRU gate. . . . . . . . . . . . . . . . . . . . . . . .

22

5

Flow chart of building gesture learning data. . . . . . . . . . . .

23

6

Wearable sensor board based on gyroscope and accelerometer sensors. 24

7

Representation of a 5-point moving polynomial Savitzky-Golay filter. 25

8

Nine-arm gestures: from top to bottom, stand-by (A), counterclockwise-draw circle (B), clockwise-dra

List of Tables 470

1

Experimental results of human activity recognition and the proposed algorithm. 27

2

Confusion matrix of nine-class gesture pattern for human activity recognition (HAR) [35]. 28

3

Confusion matrix of nine-class gesture pattern of the proposed algorithm. 29

30

Research Highlights: 

Field: Arm gesture classification and recognition based on motion sensors



Algorithm Description: deepGesture, a new arm gesture recognition method based on gyroscope and accelerometer sensors using deep convolution and recurrent neural networks



Implementation: Wrist-type smart band device and deep learning framework



Applications: Smart TV based entertainment and fitness service platform