1. INTRODUCTION
The goal of all slide presentation softwares is to provide visual aids that help speakers convey messages and convince or persuade the audience effectively. When used wisely, the slide presentation can turn a complex scientific talk more understandable or even make slide-based math classes more enjoyable. That is the reason why slide presentation has become very common today.
During slide presentations, we frequently point to and pick objects, keywords, and even pictures in the slides. To animate the objects or move the slides back and forth, we need additional common peripherals, either manually operated (like mouse and keyboard) or remotely operated (like an IR/RF-based remote device), as triggers to control the slide navigation. However, the devices could be not only distracting people during presentations but also hindering the eye contact with the audience and natural hand gestures for emphasizing key points in the slide.
There are many solutions to enhance the effectiveness of slide presentations [1,2,3]. Gesturebased slide control is among the most natural ways of Human Computer Interaction (HCI) [4]. But most of the conventional solutions require special devices which are not very human-friendly.
This paper presents a design and analysis of a single camera based hand tracking system that can be used to control a finger pointer and navigate the slide back and forth with hand gestures. The experiment was conducted in a small room of size 7×5 meters for presentation with the only light coming from a projector. The system has been installed on a laptop running windows 7 2GB RAM Intel®Core i3 M370 with Microsoft LiveCam VX- 5000 attached. By using the computer vision based hand gesture technique for interfacing to a slide presentation system, we believe the proposed application can provide a highly convenient environment for presentation. This kind of interface has the potential to better achieve the goal of an effective and successful presentation.
2. RELATED WORKS
Hand gesture recognition has gained a lot of attention as a new way of interaction with a computer in the past decades. That is because it is more intuitive and much simpler than using common input devices like a keyboard and a mouse. There are already several methods developed to date for hand gesture recognition.
There are numerous devices for the tasks: wearable devices like a ring [5], armband [6], glove [7], leap motion [8], and controllers based motion recognition such as Wii-mote [9], and ordinary web-camera [3], stereo camera [10], and even using radar [11]. Every technique has pros and cons. By utilizing one or a combination of those techniques, hand gestures can be used to control various applications and devices. Slides presentation software is one of the many that can be controlled by hand gestures.
Gesture recognition in the context of slide presentation comes in two forms: controlling slide navigation and finger pointing. For both tasks, finding skin color is a popular way to detect hands in images. According to the literature Jadhav et.al [12] and Rajesh et.al [13] have applied their similar technique for an alternative interface to desktop machines, but not gone so far as to aid a slide-based presentation that employs a beam projector.
Jadhav et.al used HSL color model to extract a hand region in video frames. They used the centroid of a hand to navigate slides and control the mouse pointer. On the other hand, Rajesh et al used the number of fingers to control the slide presentation. They detected skin colors to find a hand region, just like Jadhav et.al did, but they chose HSV color model. The comparison result between our work and the related work methods that seems similar will be available in the experiment result section on slide navigation control.
3. PROPOSED METHOD
3.1 System Organization
The main goal of this research is to aid speakers in slide presentations by allowing them to use their hands and fingers to control the screen and navigate through the slides back and forth. The proposed method is designed to capture hand gestures inside the light pyramid of a beam projector. Hand gestures outside the scene are ignored. The camera is placed in front of the screen as shown in Fig. 1.
Fig. 1.Camera position.
The system can be divided into four functional of skin color processing, hand detection, fingertips detection, and gesture analysis. Fig. 2 shows the detailed cascade organization of the proposed system, starting with color conversion to HSL color space that facilitates the hand or skin detection in the next step. Then we find the hand contour, compute the convex hull, and measure the convexity defect. The convex hull and the convexity defect allow us to detect any fingers, estimate a smooth trajectory in the finger pointing steps, and even define gestures via the motion of the centroid to control the slide navigation.
Fig. 2.System organization.
3.2 Color Model
The HSL color model has three components of hue, saturation, and luminosity, which has a relevance to human vision. Hue represents the gradation of the color within the spectrum of visible light in the HSL color space, saturation specifies the purity of the color, and luminosity the overall brightness. The HSL color model works well in a very luminous or noisy environment like inside the beam of a slide projector.
Both the skin color sampling and the subsequent image processing module start with an input image from a camera. The color space of an image is converted from RGB to HSL, according to equation (1).
where H, S, and L are the three channel values of a color in HSL. Vmax is the maximum value of a RGB color and Vmin is minimum value of a RGB color, and R, G, and B are the values of the color in the RGB space.
Fig. 3.HSL color model.
Since input images from a camera contains random noise, it is desired to reduce them out with a smoothing filter such as the one defined in equation (2) to improve the chances of identifying target objects. Any smoothing technique can help reduce the edge complexity so that the transition between colors gets smoother.
where and are the width and height of the filter K respectively.
3.3 Skin Color Sampling
Skin color detection is often complicated due to slide color changes which in turn affect hand detecting. We use skin color detection because it can detect hand every time a hand enters the beam projector light whereas other methods such as background subtraction detect a hand only when if it moves and return a region that includes solid shadow. In this research we employ an adaptive skin color detection method to locate a hand. It starts with color calibration, a step called skin color sampling [14]. We use this method to face skin discoloration due to projector beam. Here we pick a set of pixel samples in order to estimate the skin color threshold. To compute a threshold value between skin pixels and non-skin pixels, we spread six square probes to sample skin pixels as shown in Fig. 4.
Fig. 4.Sampling skin color pixels from a set of probe rectangles.
The skin color sampling takes about thirty initial frames from which the system captures the color samples. In each frame we pick the median values of the probes rectangles and then compute their average. We can distinguish the skin color by setting the upper and the lower bounds for each of the HSL channels. Algorithm 1 is the detailed specification of the procedure for the skin color sampling.
Algorithm 1.Color Sampling (input: color image, output: binary image)
Fig. 5.A result of skin color sampling.
The values from skin color sampling may well be manually tuned using track bars to improve skin detection for each channel of HSL image.
3.4 Contour Finding
The skin detection module produces binary images as shown in Fig. 6. Then the contour finding process is supposed to extract a complete boundary between skin pixels and non-skin pixels like Fig. 7.
Fig. 6.A binary skin image.
Fig. 7.Extracted contour.
To find the complete contour of a hand blob, we employ the algorithm developed by Suzuki and Abe algorithm [15]. This algorithm is simple and fast in finding reasonable contours. This algorithm returns the boundary of the skin region in the form of a closed curve. From this boundary we derive a set of points that represent the edge of a region. We pick only the biggest blob and ignore the rest. The biggest area is simply assumed to have the longest boundary.
4. FINGER POINTING PROCESSOR
Given a hand contour, we compute the overall shape of the hand to decide whether it is a pointing finger or a hand.
4.1 Convex Hull
The Convex hull is a computational geometry of an object defined by a set of points that enclose the object. It is the smallest convex set that contains all of the given points on the contours. A convex hull in this research is drawn around the hand. Given the set of points from the contour extraction, we create a convex hull corresponding to the smallest convex polygon according to Sklansky algorithm [16,17]. The convex hull becomes the final result of hand segmentation and the starting point of finger pointing and gesture analysis.
4.2 Convexity Defect
Boundary points in a convex hull are commonly used to describe the shape of a two dimensional object such as the hand in Fig. 7. In this study we employ the derived concept called the convexity defect as a way of analyzing a silhouette shape. The convexity defect is defined as the space between the hull and the original contour of an object.
Fig. 8.Contour points for a convex hulls and convexity defects.
With the convex hull of a hand contour and its convexity defects we can form a vector for each triangular defect which consists of two neighboring hull vertices and a depth point, inside the convex and away from the hull edge.
4.3 Fingertip Detection
Fingertips are among the most critical information in hand gesture. We can locate the fingertips and estimate the number of fingers using the convex hull and convexity defect computed above. This number can be used to trigger a finger pointing activity. If the number of fingers is one then the pointing module is activated.
Fingertip location in a low resolution image is often ambiguous and noisy. We solve this problem by assuming that the fingertip is a point that is the farthest from the centroid of the hand. We use the farthest point as the fingertip as long as the number of fingers does not change. This result becomes the basic of the next step of estimating the fingertip motion trajectory.
4.4 Estimating Fingertip Trajectory
Tracking a fingertip over time presents an additional dimension of difficulty. So we introduce the Kalman filter to estimate the smooth trajectory of a pointing finger despite occasional detection failures and location variabilities due to noise and illumination changes. Let us consider a sequence of fingertip measurements yt∈Rm, which is modeled by the following measurement equation:
where B∈Rn×n is the observation matrix applied to the unknown target state vector xt∈Rn, and the measurement noise vt∈Rm at time t. The observation vector yt is the noisy finger position, from which we want to estimate the unknown hidden state xt corresponding to a smoothed trajectory of the pointing finger. The filtered state vectors are subject to following state equation:
where A∈Rn×n denotes the dynamics of the state process. A predictive model Axt plus a Gaussian white noise process of wt∈Rn defines the hidden state variable at time t+1. The result is a smooth sequence of xt’s filtered from a noisy sequence of ys’s.
Fig. 9.A fingertip trajectory measured and the corresponding trajectory filtered.
5. GESTURE ANALYSIS MODULE
The gesture analysis module interprets the motion of the hand and generates a sequence of commands to move the slide back and forth. Just like the finger pointing module, this module also uses the number of fingers to trigger gesture analysis. If all of the five fingers are detected, the system calculates the centroid of the hand and then initiates hand tracking.
5.1 Centroid of the Hand
We define the centroid of the hand using statistical moments which are robust noise and easy to compute. After hand detection, we calculate centroid of the hand [18] by:
where the two coordinates are the average of the x and y respectively given by.
5.2 Gesture Classification
Once triggered by five fingers, gesture analysis starts by tracking the motion of the hand. When a hand moves laterally facing the camera, the last coordinates are used as the current coordinates. By comparing them with the initial coordinates, the system interprets the gesture as a command of moving the slide forward or backward according to the relative distant between the two coordinates. To move the slide forward, the initial x-coordinate should be smaller than that of the current coordinates. Otherwise, it will move the slide backward.
Fig. 10.Move the slide with the centroid of the hand.
Fig. 11.Interfacing the estimation coordinate into desktop.
5.3 Interfacing to Presentation
After image processing and gesture analysis, the remaining step is to interfacing it to the Power Point software. The gesture commands from the vision module are sent to the presentation software using SendKeys, a Microsoft Windows function that takes an input to their operating system [19].
To synchronize the system results from estimating the finger pointing trajectory with the Windows desktop application we compute the following way:
where the pointing finger’s xy-coordinates p = (px, py) are divided by screen size d = (dx, dy), and then multiplied by the camera frame size (W for width H for height) .
Algorithm 2.Gesture and finger pointing analysis
The overall algorithm for the entire system is given in Algorithm 2.
6. EXPERIMENTAL RESULTS
To evaluate the functionality of the system and prove the proposed approach to building a finger pointing and gesture analysis system, we conducted a set of tests concerning skin color detection, single fingertip detection, slide navigation control, and finger pointer control.
We carried out the experiments in a small meeting room that was arranged for a slide presentation. The video stream input was captured in 10 fps in 640 by 480 frame size. We tested on five subjects regardless of skin color.
6.1 Skin Color Detection
In the skin color detection experiment, we measured the number of hands detected successfully in one-minute video of a slide presentation with occasional gestures. The subjects consist of five peoples, three males and two females. Both of female subject are Indonesian have fair white skin, one black male from Kenya, two Indonesian with different skin color, dark and white skin color. Each subject used three slide presentations with simple background, one presentation for each trial test. To test the skin color detection, each person put his or her hand palm to the camera to take the skin sample and continue with normal speed unconstrained gestures. The results are counted by the number of successful hand segmentation in fifty seconds. Table 1 shows the detailed result of the test, with the overall success being around 72.4 percent. For one minutes each subjects makes gestures start from skin color sampling.
Table 1.Skin color detection result (figure in number of frames)
The hand successful segmented when the shape that detected looks perfectly like hand, have fingers according hull and defect technique. By count this kind of shape each frames, we get the performance result.
6.2 Fingertip Detection
The second set of tests calculated the success rate of single fingertip detection, which counted the number of frames with or without a fingertip in one minute video with occasional gestures. The number of fingertips detected counted by the number of fingertips detected per frames. Table 2 shows the summary. Input from webcam not as good as phone camera or pocket camera and the result can be categories as low-grade videos. Due to the difficulty in fingertip detection in low-grade videos, it is not easy to detect fingers reliably. We consider the accuracy in the table to be near the practical limit in our experimental context.
Table 2.Single fingertip detection (number of frames)
Adding farthest point detection instead using only hull and defect technique prevent detection fail for single finger detection. So, even though the shapes that detected not looks like hand, as long as the shape have pointed shape and not pointed downward. To counting the single fingertip we use a point that corresponds to the single fingertip, by counting the point each frames we measure the accuracy of single fingertip.
6.3 Slide Navigation Control
The success rate in moving slides using hand gesture was measured by the number of successes in moving the slides backward fifty times and forward fifty times. Table 3 shows the detailed counts with a success rate amounting to around 77%.
Table 3.Slide navigation control (number of frames)
Jadhav et.al used centroid of the hand to navigate the slide presentation. They used certain rectangular that define commands and Rajesh et.al tested their software with 10 subjects under various illumination conditions such as fluorescent lamps, incandescent lamps and sunlight. They count the number of fingers to navigate the slide, one finger to move to next slide and two fingers to previous slide.
6.4 Control the Laser Pointer
The final set of tests concerns the performance of the finger pointing task, an experiment held in the same condition as the preceding experiments. The result was measured as the success rate in controlling the finger pointing without losing the finger for fifty seconds. Table 4 summarizes the result. The success rate of 80% is not very high, but we believe it can easily be complemented by some dynamical model that introduces a context and helps boost the performance.
Table 4.The result of the finger pointer test (number of frames)
7. CONCLUSION AND FUTURE WORK
Through a series of experiments, we have confirmed that the proposed method of finger pointing and gesture analysis could well replace the common computer interface peripherals used in slide-based presentations for navigation and screen pointer. With this kind of interface, a user will experience naturalness and intuitiveness and comfortably focus on presentation.
As is usually the case in many vision systems, skin color detection is very sensitive and thus hard to get reliable results. One way of enhancing the reliability of the system is the common technique of background-subtraction. Applying this technique may cover the weakness of the system in rich color presentation. We are also considering embedding Dynamic Time Warping and or Haar-cascade method into the proposed system, to enhance the recognition accuracy. In addition, other interesting gesture-based scenarios are being considered to include drawing shapes, free hand drawing, and even hand writing in space as a new input method.
References
- C. Yiqiang, L. Mingjie, L. Junfa, S. Zhiqi, and P. Wei, "Slideshow: Gesture-Aware PPT Presentation," Proceeding of 2011 International Conference on IEEE Multimedia and Expo, pp. 1-4, 2011.
- D. Martinovikj and N. Ackovska, "Gesture Recognition Solution for Presentation Control," Proceeding of 10th Informatics and Information Technology Conference, pp. 187-191, 2013.
- G. Chang, J. Park, C. Oh, and C. Lee, "A Decision Tree Based Real-time Hand Gesture Recognition Method using Kinect," Journal of Korean Multimedia Society, Vol. 16, No. 13, pp. 1393-1402, 2013. https://doi.org/10.9717/kmms.2013.16.12.1393
- P. Trigueiros, F. Ribeiro, and L.P. Reis, "Generic System for Human-Computer Gesture Interaction," Proceeding of Conference 2014 Autonomous Robot Systems and Competitions, pp. 175-180, 2014.
- Logbar Labs, http://logbar.jp/ring/en/ (Accessed 17, Aug., 2015).
- Thalmic Labs, https://www.myo.com (Accessed 17, Aug., 2015).
- Y. Huang, D. Monekosso, H. Wang, and J.C. Augusto, "A Concept Grounding Approach for Glove-Based Gesture Recognition," Procceding of 7th International Conference on IEEE Intelligent Environments, Vol. 7, No. Doctorial Colloquium, pp. 358-361, 2011.
- X. Dang, W. Wang, K. Wang, M. Dong, and L. Yin, "A User-independent Sensor Gesture Interface for Embedded Device," Proceeding of IEEE Sensors Journal, Vol. 2, No. Multiaxis Sensors, pp. 1465-1468, 2011.
- Nintendo, http://www.nintendo.com.au/wii-accessories-page1 (Accessed 17, Aug., 2015).
- H. Hongo, M. Ohya, M. Yasumoto, and Y. Niwa, "Focus of Attention for Face and Hand Gesture Recognition Using Multiple Camera," Proceeding of IEEE Automatic Face and Gesture Recognition, pp. 156-161, 2000.
- Google Developer, https://www.youtube.com/watch?v=mpbWQbkl8_g (Accessed 17, Aug., 2015).
- D.R. Jadhav and L. Lobo, "Navigation of PowerPoint Using Hand Gestures," International Journal of Science and Research, Vol. 4, No. 1, pp. 833-837, 2015.
- R. Rajesh, D. Nagarjunan, R. Arunachalam, and R. Aarthi, "Distance Transform Based Hand Gestures Recognition for PowerPoint Presentation Navigation," Advanced Computing: An International Journal, Vol. 3, No. 3, pp. 41-48, 2012. https://doi.org/10.5121/acij.2012.3304
- M. Xu, B. Raytchev, K. Sakaue, O. Hasegawa, A. Koizumi, M. Takeuchi, et al., "A Vision-Based Method for Recognizing Non- manual Information in Japanese Sign Language," Journal of Lecture Note in Computer Science, Vol. 1948, pp. 572-581, 2000. https://doi.org/10.1007/3-540-40063-X_75
- S. Suzuki and K. Abe, "Topological Structural Analysis of Digitized Binary Images by Border Following," Journal of Computer Vision, Graphics, and Image Processing, Vol. 30, pp. 32-46, 1985. https://doi.org/10.1016/0734-189X(85)90016-7
- J. Sklansky, "Measuring Concavity on a Rectangular Mosaic," IEEE Transactions on Computers, Vol. C-21, No.12, pp. 1355-1364, 1972. https://doi.org/10.1109/T-C.1972.223507
- J. Sklansky and V. Gonzalez, "Fast Polygonal Approximation of Digitized Curves," Journal of Pattern Recognition, Vol. 12, pp. 327-331, 1980. https://doi.org/10.1016/0031-3203(80)90031-X
- Marghitu, B. Dan, Dupac, and Mihai, Advanced Dynamics Analytical and Numerical Calculations with MATLAB, Springer, New York, 2012.
- E. Carter and E. Lippert, Visual Studio Tools for Office: Using Visual Basic 2005 with Excel, Word, Outlook, and InfoPath, ebook Edition, Addision-Wesley, Massachusetts, 2006.