HSFE Network and Fusion Model based Dynamic Hand Gesture Recognition

Tai, Do Nhu;Na, In Seop;Kim, Soo Hyung;

doi:10.3837/tiis.2020.09.020

KSII Transactions on Internet and Information Systems (TIIS)

제14권9호
/
Pages.3924-3940
/
2020
/
1976-7277(pISSN)
/
1976-7277(eISSN)

한국인터넷정보학회 (Korean Society for Internet Information)

DOI QR Code

HSFE Network and Fusion Model based Dynamic Hand Gesture Recognition

Tai, Do Nhu (Department of Computer Science, Chonnam National University) ;
Na, In Seop (SW Convergence Education Institute, Chosun University) ;
Kim, Soo Hyung (Department of Computer Science, Chonnam National University)

투고 : 2019.04.18
심사 : 2020.09.30
발행 : 2020.09.30

https://doi.org/10.3837/tiis.2020.09.020 인용 PDF KSCI HTML

PDF 다운로드

⟨ 이전 논문 다음 논문 ⟩

초록

Dynamic hand gesture recognition(d-HGR) plays an important role in human-computer interaction(HCI) system. With the growth of hand-pose estimation as well as 3D depth sensors, depth, and the hand-skeleton dataset is proposed to bring much research in depth and 3D hand skeleton approaches. However, it is still a challenging problem due to the low resolution, higher complexity, and self-occlusion. In this paper, we propose a hand-shape feature extraction(HSFE) network to produce robust hand-shapes. We build a hand-shape model, and hand-skeleton based on LSTM to exploit the temporal information from hand-shape and motion changes. Fusion between two models brings the best accuracy in dynamic hand gesture (DHG) dataset.

키워드

1. Introduction

Due to the growth of low-cost 3D depth sensors, dynamic hand gesture recognition (d-HGR) has been emerged as an important step in Human-Computer Interaction (HCI) applications, such as sign language recognition, robotics, and interactive gaming. They allow considering 3D information, which leads easily to extract hand region and 3D hand skeletons in the complex environments such as background clutter, occlusions, and light variants [1]. The DHGR is considered typical pattern recognition problems with two steps: feature extraction, and classification.

Up to now, the d-HGR is a challenging task due to its small size, complexity, and self-occlusion. Moreover, it is difficult to recognize because of intra-class dissimilarities, interclass similarities in gestures. Intra-class gesture dissimilarities come from cultural or individual factors to lead the differences of position, speed, and style of the hand gesture. Inter-class similarities appear in high same among some hand gestures. So, it needs to deal with exploiting the spatial and temporal information of hand gestures to prevent above problems as well as the noise from the device

Traditional handcrafted methods focus on building the robust feature descriptors in the spatial and temporal dimension to encode the changes of hand motion and hand shape such as Histogram of 3D facets [2], Spatio-Temporal HOG2 Descriptor [3], Histogram of Oriented 4D Normals (HON4D) [4], etc.

Besides, with the success of the convolution neural network (CNN) in image classification [5], image segmentation [6] as well as the big dataset ImageNet [7], deep learning also applies in dynamic action/gesture recognition with 2D CNN model [8], 2D CNN integrated motion features [9], 3D CNN model [10], and temporal models such as Long Short-Term Memory (LSTM) [11].

Color and depth stream often are used in previous methods [12]. Some methods use additionally infrared and audio stream, as well as body skeleton [13,14]. Besides, the rapid development of hand pose estimation [15], it requires the dataset and methods for processing hand skeleton data. De Smedt et al. [16] build Dynamic Hand Gesture Dataset (DHG) for depth and hand skeleton along with the hand-crafted method. From there, many methods based on deep learning [17-21] is proposed the multi-modal from depth and hand skeleton sequences.

HandSegNet[22] and 3D CNN encoder-decoder[23] show that an encoder-decoder architecture is successful in hand pose estimation. Zimmermann et al. [22] use HandSegNet for hand localization, and an encoder-decoder model named PoseNet for estimating keypoint score maps. Recently, Moon et al. [23] designs a 3D CNN encoder-decoder to predict 3D pose from a single depth image. From there, the latent codes from encoder-decoder models is the robust features for dynamic hand recognition. However, almost works only focus on the powerful 2D/3D CNN for feature extraction on the specific dataset. Smedt [21] proposes the features from the last layer of 2D CNN hand-pose model for hand-shape representation in dynamic hand-gesture recognition. Input and output of model is depth image and fingertip positions, respectively. Therefore, it is not robust in hand-shape representation to bring efficiently for classification due to the small size, higher complexity and self-occlusions of the hand.

In this paper, we propose the HSFE network to solve the d-HGR problem. We build a handshape model and hand-skeleton based on LSTM to exploit the temporal information from hand shape and motion changes. The robust hand-shape feature are extracted by training the hand-shape feature network from the available hand-pose datasets. Our method can handle the complex changes of depth hand sequences by the small size, self-occlusions.

The remainder of the paper consists of four sections. In Section 2, we review the recent related works. In Section 3, we describe our proposed method and its analysis. In Section 4, experiments and discussion are described along with the related methods for comparison. Finally, we conclude our results and discuss further works in Section 5.

2. Related Works

2.1 Overview d-HGR

Hand gesture recognition(HGR) has been rapidly developed in the HCI applications in recent years as follow reasons. Firstly, hand-gestures are intuitive and effective in expressing human feelings. Second, the development of sensor technology has brought hand-gesture such as sensors using accelerometers to capture accurately the movement of the hand and fingertips [25], multi-touch screen sensors widely available through tablets, telephone devices [26], and visual-based sensors [27] for hand-recognition through color images.

The low cost 3D depth sensor such as Microsoft Kinect and Intel RealSense bring many benefits in dealing with HGR more than traditional sensors. Firstly, it is robust to light variants, background clutter, and occlusions. So, it helps easily in hand detection and segmentation. Secondly, the depth sensors capture 3D information in the scene. It helps the development quickly of hand-pose, human-pose estimation in determination the skeleton of human body or hand. Therefore, there are many choices for getting information in HGRs such as depth, color images, and body/hand skeleton [28].

There are two main categories in HGR: static and dynamic HGR. Different from static hand gesture recognition(s-HGR) detecting hand region and extracting hand feature from hand segmentation at the specific time, the dynamic hand gesture recognition(d-HGR) needs to exploit more the temporal features from the hand shape sequences. It treats as the pattern recognition problems consisting of feature extraction, and classification.

Traditional well-known handcrafted features such as HOG, SIFT are extended in depth-base image sequences to describe hand shape feature as well as motion feature. Zhang et al. [2] proposed the Histogram of 3D facets as a depth-based descriptor for s-HGR. Besides, Zhang et al. also use Edge Enhanced Depth Motion Map [29] for encoding shape and motion in dHGR. Spatio-Temporal HOG2 Descriptor [3] is introduced by Ohn et al. applying in MSRHand Gesture Dataset [30]. Oreifej et al. proposed HON4D [4] integrated time, depth and spatial coordinates into 4D space by a histogram of the surface normal orientation distribution. Devane et al. [31] presents skeleton sequences as these trajectories in a Riemannian manifold with action recognition using kNN classification.

A survey of Aghbolaghi et al. [33] depicts the good performance of deep learning approach in action and gesture recognition. The first approach methods [8] uses the power of transfer learning from pre-trained ImageNet [7] of 2D CNN architecture to classify action/gesture recognition by averaging of sampled frames. The second approach methods [9] often use pre-computed motion features for temporal information, while third approach methods [10] take into account 3D convolution as well as 3D pooling. The final approach methods [11] combines 3D CNN with the temporal sequence model such as Recurrent Neural Network (RNN) and LSTM.

For d-HGR, Molchanov et al. [12] use the volumes of image gradient and depth values for multi-scale 3D-CNN models in VIVA dataset. After that, they improved 3D-CNN models into the Recurrent 3D convolutional neural network [13] with depth, color, and stereo-IR sensors data with successful on ChaLearn Dataset.

2.2 Depth and 3D skeleton d-HGR

With the rapid development of hand pose estimation [15] and supported from depth-based cameras such as Intel RealSense , Microsoft Kinect [34], the hand skeleton features are interesting in the HGR in recent works. Lu et al. [35] use the palm direction, palm normal, fingertips positions and palm center position data from Leap Motion controller to extract features such as fingertip-distances, fingertip-angles, fingertip-elevations, adjacent fingertipangles for d-HGR. Garcia et al. [14] collected RGB-D sequences as well as hand-pose annotation for first-person hand action recognition. The best base-line method is in merging color, depth and pose data. A multi-modal deep learning framework proposed by Neverova et al. [36] uses color, depth, audio stream as well as body skeleton. The final label of a sequence is computed from voting every frame.

The most recent works, De Smedt et al. [16] published DHG with depth and 2D/3D skeleton information to deal with the lack of benchmark and comparison methods in d-HGR in depth and 3d hand joint approach. They introduce Shape of Connected Joints (SoCJ) descriptor to represent hand shape. After that, Fisher Vector computed from SoCJ descriptor, as well as histograms of the hand direction and the wrist orientation are used for classification. Their method also is the state-of-the-art in handcrafted methods such as HOG2, HON4D, etc. Moreover, many deep learning methods are also proposed for depth and skeleton d-HGR. Guerry et al. [20] concatenates depth frames randomly to create three key-frames and uses VGG11 [37] as well as pre-trained weights from ImageNet [7] for classification. Chen et al. [17] extracts finger and global motion features from skeleton sequences feeding into a bidirectional recurrent neural network. Devineau et al. [19] builds the parallel convolutions only from hand-skeleton data.

3. Proposed Method

In this section, we describe the proposed method for d-HGR pipeline in Fig. 1 with normalization hand-depth image, hand-shape feature learning, d-HGR with hand-shape model, and hand-skeleton model, and fusion techniques to combine these models.

E1KOBZ_2020_v14n9_3924_f0001.png 이미지

Fig. 1. The pipeline of proposed method

The input of our problem is a hand-depth image sequence \(D=\{D_t ∈ R^{h \times w} \}^T_{t-1}\) and a hand-skeleton sequenceS = \(\{S_t\}^T_{t=1}\), where D_t is the hand-depth image at frame t, \(S_t = \{x^t_i, y^t_i, z^t_i\}^J_{i=1}\) denotes the hand-skeleton at time t, T is the length of the sequence, and J is the number of hand-skeleton joints. The goal of our problem is to classify {D, S} to the gesture C_i with \(C=\{c_i\}^K_{i=1}\), where C denotes the set of gesture classes in dynamic hand gesture recognition, and K is the amount of classes.

- We first train hand-shape feature network H_{Shape-Feature}(𝐷_𝑡) by using SegNet[6] Encoder-Decoder model with FingerPaint dataset[24].

- The hand-depth image in the dataset will be normalized before feeding to the network. We use H_{Shape-Feature}(D_t) to extract hand-shape features of every depth-hand image in depth sequences.

- After that, the hand-shape features will be the input of the hand-shape network H_{Shape-Feature}(D) for training and classifying gestures. Due to the arbitrary length of hand depth and skeleton sequences, we will normalize the hand-depth sequence input.

- Besides, we also build a hand-skeleton network H_Skeleton(S) receiving the handskeleton sequences.

- Finally, the result from two model will be integrated by fusion techniques for enhancing the performance the recognition result.

3.1 Hand-depth image normalization

For a depth image 𝐼, we will extract the hand-depth image I_H, where H = (𝑥, 𝑦, 𝑤, ℎ) is the bounding-box of hand region. We use morphology operator to eliminate the isolated depth pixels. After that, we sort ascending depth values in the hand region and pick depth values in the position range [d_min, d_max], where d_min, d_max choose suitable from the experiment. The average value d_center of picked-up depth values will be the center of mass of hand region. With a thresholding 𝑡, the depth values in hand region H outside of the range [d_center- t, d_center+ t] are assigned a value of 0. All depth values are normalized to [0, 255].

3.2 HSFE network

HSFE network H_{Shape-Feature}(𝐷_𝑡) is built from SegNet [6] network with solving image segmentation problem as Fig. 2. It is a symmetry network including encoder and decoder parts. The encoder is modified from the VGG16 network with the aim of encoding object into the latent representation. The encoder consists of the blocks of convolutional, batch normalization, ReLU, and Pooling layers. The decoder will responsible for mapping the latent representation of the objects into the semantic tag of the objects. The decoder different from the encoder is to replace Pooling layer by Up-sampling layer. The pooling indices saved from Pooling layers in the encoder is used by corresponding Up-sampling layers to extract location maximum.

E1KOBZ_2020_v14n9_3924_f0002.png 이미지

Fig. 2. HSFE network based on SegNet

To extract the robust hand-shape feature, it is carefully to choose the hand-pose dataset for training the hand-shape feature network. Due to the complex attributes of the hand such as small size, self-occlusions, the dataset is suitable to represent the semantic tags of five fingertips, background, and palm region containing the most cases of the hand-poses in the wild.

In this paper, we use FingerPaint dataset [24] as Fig. 3 for training the hand-shape feature network to transfer pre-train weights into hand-shape network. The FingerPaint dataset consists of captures of five subjects (A, B, C, D, E), with three captures per subject: ‘global’ for large global movements, relatively static fingers, ‘poses’ for relatively static global position, moving fingers, and ‘combined’ for more challenging. It achieves high precision on the pixel-segmentation in pose estimation. In training step, we split 70% every subject for training data and 30% for testing data. We use rotate operator for data augmentation.

E1KOBZ_2020_v14n9_3924_f0003.png 이미지

Fig. 3. Example of Finger-Paint dataset[24]

3.3 Hand-Shape and Hand-Skeleton Network

In the hand-shape model, the input sequence is depth hand sequences. Every hand-image region will normalize as mentioned in Section 2.1. After that, we use the last convolution layer in the encoder of hand-shape feature network to extract the hand latent representation. From there, we create the hand-shape feature sequences for the hand-shape network for exploiting the changes in hand-shape by time.

Structure of hand-shape network as Fig. 4 consists of normalization hand-image, hand-shape feature extraction transferred pre-train weights on Finger-Paint dataset, and the block of LSTM, dense layers and soft-max responsible for exploiting the temporal information of hand pose changes.

E1KOBZ_2020_v14n9_3924_f0004.png 이미지

Fig. 4. Hand-Shape Network

Due to the different length of the input sequences, we will normalize them. Firstly, we choose the mean value L of the length of the sequences in DHG dataset as the sequence length of the input data. If the length of the current sequence smaller than L, we use the data at the start and end of sequence for padding. Otherwise, we will choose data randomly satisfying the length is L.

Skeleton sequences are also normalized as the hand-shape feature sequences as in the handshape network. Next, the hand-skeleton will receive them and use the block of two stacked LSTMs, dense layers and soft-max for classification as described in Fig. 5.

E1KOBZ_2020_v14n9_3924_f0005.png 이미지

Fig. 5. Hand-Skeleton Network

In two models, we use the Dropout [38] in LSTM as well as in front of soft-max layer. The length for input sequences is 75 based on histogram of the sequence lengths as in Fig. 6.

E1KOBZ_2020_v14n9_3924_f0006.png 이미지

Fig. 6. Histogram of the sequence lengths in DHG Dataset. The mean length is at the value of 75.

3.4 Fusion Techniques

To enhance the performance of the two models, we use fusion techniques, which exploit the complement and redundancy information between the models. In this paper, we use three fusion techniques consisting of late fusion, early fusion and joint fine-tuning as described in [21]. The late fusion technique combines the probability outputs of each deep learning model by majority voting as Fig. 7. Given ŷ_shape and ŷ_skeleton, respectively, the output probabilities of the hand-shape network and hand-skeleton network, the final predicted label are computed as below equation:

\(\hat{y}_{\text {final }}=\underset{i}{\arg \max }\left(\alpha \hat{y}_{\text {shape }}+(1-\alpha) \hat{y}_{\text {skeleton }}\right)\) (1)

E1KOBZ_2020_v14n9_3924_f0010.png 이미지

Fig. 7. (Left) weighted sum based fusion technique, and (Right) concatenation based fusion technique

where a ∈ [0,1] is the parameter depending on the performance of each network, \(i = \overline{1, K}\) with K is the number of gesture classes. In practice, we adjust a from zero to one with step size 0.001 to find the optimal value of the classification result.

On the other hand, early fusion techniques learn the intermediate feature space from merging the feature space producing from the hand-shape model and hand-shape skeleton as in Fig. 7. Finally, joint fine-tuning fusion [39] described in Fig. 8 integrates two trained models by retraining the last fully connected layers before soft-max with new cost defined as follow:

L_fusion = λ₁L_skeleton + λ₂L_shape + λ₃L_joint (2)

where L_skeleton,L_{shape and}L_joint are loss function computed by hand-skeleton network, hand-shape network and the joint of two networks, respectively. Three parameters λ₁, λ₂, and λ₃ are the control parameters, where λ₁, λ₂ are same, and λ₃ is smaller than λ₁, λ₂.

E1KOBZ_2020_v14n9_3924_f0007.png 이미지

L_skeleton,L_shape are the cross entropy loss functions, retrained on linear fully connected network with the last feature layer of the corresponding network connected with soft-max layer. Ljoint is also the cross entropy loss function of the network constructing from the concatenation two feature layers of two model along with soft-max layer.

4. Experimental and Results

4.1 Environments and Implementation

We built the program on Window environment with Python 3.5. In the program, we used Keras library with Tensorflow backend to develop deep learning models. The experiment was done with the machine’s configurations as follows: Intel(R) Core(TM) i7 -8700 CPU @3.20 GHz, GTX 1080Ti 11GB RAM.

We have three main models for training and validating: the HSFE network, hand-shape network, and hand-skeleton network. Besides, we train two fusion models consisting of early and joint fine-tuning fusion. The models is trained first time on Adam algorithm using minibatch 32, learning rate 0.001 or 0.001. After that, we use Stochastic Gradient Descent (SGD) algorithm with learning rate 0.001 for enhancing performance.

4.2 Datasets and comparison methods

There are two datasets used in our paper including Finger-Paint dataset, and DHG dataset. On FingerPaint dataset, we use it for HSFE network. The trained weighs is transferred to the hand-shape network for feature extraction. We split on every subject and category of FingerPaint dataset with 70% for training and 30% for validating. DHG dataset with 14 or 28 gesture classes as Table 1 is applied in d-HGR.

Table 1. Gesture list in DHG dataset

E1KOBZ_2020_v14n9_3924_t0001.png 이미지

DHG dataset has 2800 sequences with 20 participants for 5 trials in 2 ways depending on the number of fingers with one finger and the whole hand. The depth images and hand skeletons were received from Intel RealSense camera. We also split DHG dataset with 70% for training and 30% for validating.

4.3 HSFE

We compare our method with Taylor et al. [40] using a smooth model of the hand, Sharp et al. [24] with the pipeline hand-tracker, and Tan et al. [41] using the 5 different shape models to personalize to each subject. We quantify the error of classification based on counting the percentage of the dataset with the average pixel classification error rate below a specific threshold. It shows the fully and accurately to segment every pixel.

Our result in Fig. 9 shows our hand-segmentation model better than with Tan et al. method and Sharp et al. method. The strength of our model is to apply in HSFE to use in d-HGR. We experimented HSF network on FingerPaint dataset with the prediction result shown in Table 2. The result of hand segmentation is the good comparison with ground-truth. In the difficult case in row 3 of Table 2, it shows exactly for hand-arm and palm region. Some fingers with self-occlusion are not so good.

E1KOBZ_2020_v14n9_3924_t0002.png 이미지

Fig. 9. Classification error on FingerPaint dataset

Table 2. Prediction result on finger-paint dataset

E1KOBZ_2020_v14n9_3924_t0003.png 이미지

4.4 d-HGR

We evaluated with the hand-shape, hand-skeleton model as well as three fusion techniques as late fusion, early fusion, and joint fine-tuning fusion. The accuracy of every model per gesture class is shown in Table 3.

Table 3. The accuracy of the models for classification 14 gesture classes on DHG dataset

E1KOBZ_2020_v14n9_3924_t0004.png 이미지

In 14 gesture classification, the hand-shape model shows good accuracy with 88.39% better than the hand-skeleton model with 87.32%. The gesture Grab/Pinch has a wide confusion shown in Fig. 10 because of very similar and only different with the amplitude of the hand movement. The gesture Swipe Right and Swipe V also has high confusion up to 12%. The accuracy of almost gestures is greater than 80%.

E1KOBZ_2020_v14n9_3924_f0009.png 이미지

Fig. 10. Confusion matrix of hand-shape model with 14 gesture classes for the accuracy 88.39%

We experimented on DHG dataset with 28 gesture classes with the accuracy shown in Table 4. The best accuracy method is at early fusion technique.

Table 4. The accuracy of the models for classification 28 gesture classes on DHG dataset

E1KOBZ_2020_v14n9_3924_t0005.png 이미지

The hand-shape model at 24 gesture classes also gives the good accuracy with 89.29%. It shows that the hand-shape features are good under hand-shape changes. The accuracy of the hand-skeleton model is 79.46% decreasing 8% due to decreasing the information about the number of fingers on the gestures. From the resulting confusion matrix of the best methods in 14 and 28 gesture classes in Fig.11.a and Fig.11.b respectively, it shows the combine fusion techniques of two models to give the best accuracy. Early fusion and joint fusion techniques exploit the complement between hand-shape and hand-skeleton feature representation.

E1KOBZ_2020_v14n9_3924_f0011.png 이미지

Fig. 11. (a) Confusion matrix of joint fine-tuning model with 14 gesture classes for the accuracy 95.36%, (b) Confusion matrix of early fusion model with 28 gesture classes for the accuracy 94.1%

Finally, we compare our proposed method with the related methods in traditional methods and deep learning methods. Our proposed method gives the best accuracy more than the remaining methods as in Table 5.

Table 5. The Accuracy of Gesture in DHG dataset

E1KOBZ_2020_v14n9_3924_t0006.png 이미지

5. Conclusion

In this paper, we propose a d-HGR method with depth and skeleton classification approach based on the HSFE network and fusion between hand-shape model and hand-skeleton model.

Firstly, the system trained the hand-shape feature model on FingerPaint dataset to extract features of every depth-hand image. Afterward, the hand-shape model exploited temporal information from the hand-shape changes of depth sequence. Our experimental results corroborate that the hand-shape features can cope with various complexity, low-resolution, and self-occlusion of hand-shape changes in the gestures. It takes the accuracy 88.39% for classification 14 gesture classes, and 89.29% for classification 28 gesture classes. It shows that the hand-shape model gives good accuracy when increasing the number of gesture classes.

Besides, the system built the hand-skeleton model to exploit temporal information of hand-pose changes. The accuracy of model is 87.32 for classification 14 gesture classes, and 79.46% for classification 28 gesture classes. The reason for decreasing 8% accuracy is due to decreasing the information about the number of fingers on the gestures.

To boost up the accuracy of the overall system, hand-shape and hand-skeleton models are integrated by fusion techniques such as weighted sum, early fusion, and joint fine-tuning fusion. The accuracy of the model is 90.54% (90.36%), 94.64% (94.11%), and 95.36% (93.75%) corresponding with the weighted sum, early fusion, and joint fine-tuning fusion for classification 14 (28) gesture classes. With the above result, our proposed method achieves the best accuracy on DHG dataset comparing with the traditionally handcrafted methods well as deep learning methods.

In future works, we need to improve the hand-skeleton model to enhance accuracy when decreasing the information about the number of fingers on the gestures.

참고문헌

F. Coleca, T. Martinetz, and E. Barth, "Gesture interfaces with depth sensors," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8200 LNCS, pp. 207-227, 2013.
C. Zhang and Y. Tian, "Histogram of 3D Facets: A depth descriptor for human action and hand gesture recognition," Computer Vision and Image Understanding., vol. 139, pp. 29-39, 2015. https://doi.org/10.1016/j.cviu.2015.05.010
E. Ohn-Bar and M. M. Trivedi, "Joint Angles Similarities and HOG2 for Action Recognition," in Proc. of the IEEE conference on computer vision and pattern recognition workshops, pp. 465-470, 2013.
O. Oreifej and Z. Liu, "HON4D: Histogram of oriented 4D normals for activity recognition from depth sequences," in Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 716-723, 2013.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Communications of the ACM, vol. 60, no. 6, pp. 84-90, 2017. https://doi.org/10.1145/3065386
V. Badrinarayanan, A. Kendall, and R. Cipolla, "SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481-2495, 2017. https://doi.org/10.1109/TPAMI.2016.2644615
O. Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S.Ma, Z.Huang, and A.Karpathy, "Imagenet large scale visual recognition challenge," International journal of computer vision, vol. 115, no. 3, pp. 211-252, 2015. https://doi.org/10.1007/s11263-015-0816-y
L. Sun, K. Jia, D. Yeung, and B. E. Shi, "Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks," in Proceedings of the IEEE International Conference on Computer Vision, pp. 4597-4605, 2015.
G. Varol, I. Laptev, and C. Schmid, "Long-Term Temporal Convolutions for Action Recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1510-1517, 2018. https://doi.org/10.1109/TPAMI.2017.2712608
Z. Liu, C. Zhang, and Y. Tian, "3D-based Deep Convolutional Neural Network for action recognition with depth sequences," Image and Vision Computing, vol. 55, pp. 93-100, 2016. https://doi.org/10.1016/j.imavis.2016.04.004
Y. Du, W. Wang, and L. Wang, "Hierarchical recurrent neural network for skeleton based action recognition," in Proc. of the IEEE conference on computer vision and pattern recognition, vol. 07-12-June, pp. 1110-1118, 2015.
P. Molchanov, S. Gupta, K. Kihwan, and J. Kautz, "Hand gesture recognition with 3D convolutional neural networks," in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 1-7, 2015.
P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, "Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207-4215, 2016.
G. Garcia-Hernando, S. Yuan, S. Baek, and T.-K. Kim, "First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 409-419, 2017.
J. S. Supancic, G. Rogez, Y. Yang, J. Shotton, and D. Ramanan, "Depth-Based Hand Pose Estimation: Data, Methods, and Challenges," in Proc. of the IEEE international conference on computer vision, pp.1868-1876, 2015.
Q. De Smedt, H. Wannous, and J. P. Vandeborre, "Skeleton-Based Dynamic Hand Gesture Recognition," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1206-1214, 2016.
X. Chen, H. Guo, G. Wang, and L. Zhang, "Motion feature augmented recurrent neural network for skeleton-based dynamic hand gesture recognition," in Proc. of IEEE International Conference on Image Processing (ICIP), pp. 2881-2885, 2017.
J. C. Nunez, R. Cabido, J. J. Pantrigo, A. S. Montemayor, and J. F. Velez, "Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition," Pattern Recognition, vol. 76, pp. 80-94, 2018. https://doi.org/10.1016/j.patcog.2017.10.033
G. Devineau, F. Moutarde, W. Xi, and J. Yang, "Deep learning for hand gesture recognition on skeletal data," in Proc. of IEEE International Conference on Automatic Face & Gesture Recognition (FG), pp. 106-113, 2018.
Q. De Smedt, H. Wannous, J. Vandeborre, J. Guerry, B. Le Saux, and D. Filliat, "SHREC'17 Track : 3D Hand Gesture Recognition Using a Depth and Skeletal Dataset," 3DOR - 10th Eurographics Workshop on 3D Object Retrieval, pp. 1-6, 2017.
Q. De Smedt, "Dynamic hand gesture recognition - From traditional handcrafted to recent deep learning approaches," Computer Vision and Pattern Recognition [cs.CV], Universite de Lille 1, Sciences et Technologies; CRIStAL UMR 9189, 2017.
C. Zimmermann and T. Brox, "Learning to Estimate 3D Hand Pose from Single RGB Images," in Proc. of the IEEE International Conference on Computer Vision, pp. 4903-4911, 2017.
J. Y. Chang, G. Moon, and K. M. Lee, "V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map," in Proc. of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5079-5088, 2018.
T. Sharp et al., "Accurate, Robust, and Flexible Real-time Hand Tracking," in Proc. of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3633-3642, 2015.
X. Zhang, X. Chen, Y. Li, V. Lantz, K. Wang, and J. Yang, "A framework for hand gesture recognition based on accelerometer and EMG sensors," IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 41, no. 6, pp. 1064-1076, 2011. https://doi.org/10.1109/TSMCA.2011.2116004
H. Olafsdottir and C. Appert, "Multi-touch gestures for discrete and continuous control," in Proc. of the 2014 International Working Conference on Advanced Visual Interfaces, pp.177-184, 2014.
S. S. Rautaray and A. Agrawal, "Vision based hand gesture recognition for human computer interaction: a survey," Artificial Intelligence Review, vol. 43, no. 1, pp. 1-54, 2015. https://doi.org/10.1007/s10462-012-9356-9
J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, "Deep learning for sensor-based activity recognition: A Survey," Pattern Recognition Letters, vol. 119, pp. 3-11, 2019. https://doi.org/10.1016/j.patrec.2018.02.010
C. Zhang and Y. Tian, "Edge enhanced depth motion map for dynamic hand gesture recognition," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 500-505, 2013.
J. Wang, Z. Liu, J. Chorowski, Z. Chen, and Y. Wu, "Robust 3D action recognition with random occupancy patterns," in Proc. of European Conference on Computer Vision, pp. 872-885, 2012.
M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Del Bimbo, "3-D Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold," IEEE transactions on cybernetics, vol. 45, no. 7, pp. 1340-1352, 2015. https://doi.org/10.1109/TCYB.2014.2350774
S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp.1137-1149, 2017. https://doi.org/10.1109/TPAMI.2016.2577031
M. Asadi-Aghbolaghi et al., "A Survey on Deep Learning Based Approaches for Action and Gesture Recognition in Image Sequences," in Proc. of IEEE international conference on automatic face & gesture recognition (FG), pp. 476-483, 2017.
L. A. Anonymous, E. Krupka, N. Bloom, D. Freedman, A. Vinnikov, and A. B. Hillel, "Toward realistic hands gesture interface : Keeping it simple for developers and machines," in Proc. of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 1887-1898, 2017.
W. Lu, Z. Tong, and J. Chu, "Dynamic hand gesture recognition with leap motion controller," IEEE Signal Processing Letters, vol. 23, no. 9, pp. 1188-1192, 2016. https://doi.org/10.1109/LSP.2016.2590470
N. Neverova, C. Wolf, G. Taylor, and F. Nebout, "ModDrop: Adaptive Multi-Modal Gesture Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1692-1706, 2016. https://doi.org/10.1109/TPAMI.2015.2461544
C. Szegedy et al., "Going deeper with convolutions," in Proc. of the IEEE conference on computer vision and pattern recognition, pp.1-9, 2015.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," The Journal of Machine Learning Research, vol. 15, pp. 1929-1958, 2014.
H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, "Joint fine-tuning in deep neural networks for facial expression recognition," in Proc. of the IEEE international conference on computer vision, pp. 2983-2991, 2015.
J. Taylor et al., "Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences," ACM Transactions on Graphics (TOG), vol. 35, no. 4, pp. 1-12, 2016. https://doi.org/10.1145/2897824.2925965
D. J. Tan et al., "Fits Like a Glove: Rapid and Reliable Hand Shape Personalization," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5610-5619, 2016.

KSII Transactions on Internet and Information Systems (TIIS)

HSFE Network and Fusion Model based Dynamic Hand Gesture Recognition

초록

키워드

1. Introduction

2. Related Works

2.1 Overview d-HGR

2.2 Depth and 3D skeleton d-HGR

3. Proposed Method

3.1 Hand-depth image normalization

3.2 HSFE network

3.3 Hand-Shape and Hand-Skeleton Network

3.4 Fusion Techniques

4. Experimental and Results

4.1 Environments and Implementation

4.2 Datasets and comparison methods

4.3 HSFE

4.4 d-HGR

5. Conclusion

참고문헌

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)