DOI QR코드

DOI QR Code

A Fast and Accurate Face Tracking Scheme by using Depth Information in Addition to Texture Information

  • Kim, Dong-Wook (Dept. of Electronic Materials Engineering, Kwangwoon University) ;
  • Kim, Woo-Youl (Dept. of Electronic Materials Engineering, Kwangwoon University) ;
  • Yoo, Jisang (Dept. of Electronic Engineering, Kwangwoon University) ;
  • Seo, Young-Ho (College of Liberal Arts, Kwangwoon University)
  • Received : 2013.04.12
  • Accepted : 2013.07.22
  • Published : 2014.03.01

Abstract

This paper proposes a face tracking scheme that is a combination of a face detection algorithm and a face tracking algorithm. The proposed face detection algorithm basically uses the Adaboost algorithm, but the amount of search area is dramatically reduced, by using skin color and motion information in the depth map. Also, we propose a face tracking algorithm that uses a template matching method with depth information only. It also includes an early termination scheme, by a spiral search for template matching, which reduces the operation time with small loss in accuracy. It also incorporates an additional simple refinement process to make the loss in accuracy smaller. When the face tracking scheme fails to track the face, it automatically goes back to the face detection scheme, to find a new face to track. The two schemes are experimented with some home-made test sequences, and some in public. The experimental results are compared to show that they outperform the existing methods in accuracy and speed. Also we show some trade-offs between the tracking accuracy and the execution time for broader application.

Keywords

1. Introduction

Detection and/or tracking one or more objects (especially parts of the human body) have been researched for a long time. Their application areas have been diversely expanded from the computer vision area to security or surveillance systems, visual systems for robots, video conferencing, etc. One of the biggest applications is HCI (human-computer interface) by detecting and tracking human hand(s), body, face, or eyes, and is used in various areas, such as smart home systems [1]. In this paper, the target object is restricted to the human face(s).

Many previous works included both face detection and tracking, such that face detection would be a preprocessing to face tracking, as in this paper, but they are reviewed separately here. For face detection, the most frequently used or referred method is the so called Adaboost algorithm [2-4]. This method includes a training process, Haar-like feature extraction and classification, and cascade application of the classifiers, although it only applies to gray images. Many subsequent researches have referenced them [5-10]. [5, 6], and [9] proposed new classifiers to apply to a color image, and [7] designed a classifier that included skin color and eye-mouth features, as well as Haar-like features. [8] focused on the asymmetric features, to make a classifier for them. In [10], the result from local normalization and Gabor wavelet transform to solve the color variation problem was applied to Adaboost. Some also used Haar-like features, but designed different classifiers, and refined them by a support vector machine [11]. Many other methods have used some features on the face, such as nose and eyes [12], skin-color histogram [13], and edges of the components of the face [14]. Also, many works used skin color to detect the face. Most of them used chrominance components. [15] and [16] proposed chrominance distribution models of the face, and [17] proposed a statistical model of skin color. In addition, [18] focused on detecting various poses of face, and [19] proposed a method to detect any angle of the face, with multi-view images.

Most face tracking methods so far also used the factors that have been used in the face detection method, such as the component features in a face [20-23], appearance of the face [24-29], and skin color [30-34]. Among the featurebased tracking methods, [20] used the eyes, mouth, and chins as the landmarks, [21] tracked some features individually characterized by Gabor wavelets, and [22] used an inter-frame motion inference algorithm to track the features. [23] used silhouette features, as well as semantic features to on-line track the features. Appearance-based tracking basically used face shape or appearance [24, 25]. But [26] additionally used the contour of the face to cover large motions of the face, and [27] proposed constraints to temporally match the face. [28] proposed a method to learn the appearance models online, and [29] used a condensation method for efficiency. Many face tracking schemes also used skin color [30-34]. [30] proposed a modeling method based on skin color, and [31] constructed a condensation algorithm based on skin color. [32] used color distribution models to overcome the problem caused by varying illumination. Some methods used other facial factors, in addition to the skin color, such as facial shape [33]. Also, some methods used a statistical model of skin color, by adopting a neural network to calculate the probability of skin color, with adaptive mean shift method for condensation [34]. Besides, [35] tracked various poses of face, by using a statistical model. [36] used a template matching method to track a face, by using depth as the template. It first found hands, and then found the face to be tracked, by using the hands as the support information.

This paper proposes a combination of a face detection scheme and a face tracking scheme, to find and track the face(s) seamlessly, even in the case that the human goes out of the image, or the scene changes. In our scheme, the face detection is performed at the beginning of face tracking, or when the tracked face disappears. It is a hybrid scheme that uses features, skin color, and motion. It basically uses the method in [2-4], the Adaboost method or Viola & Jones method. But we reduce the search area to apply the Adaboost method, with motion and skin color. Our face tracking scheme uses a template matching method with only depth information, as in [36]. But it directly tracks the face, without any auxiliary information. Also, it includes a template resizing scheme, to adapt to the change of the distance of the face. In addition, it includes an early termination scheme, to reduce the execution time to run in more than a real time, with minimizing the tracking accuracy. Thus, we will show that it can be used adaptively, by considering the complementation between the execution time and the tracking error.

This paper consists of six chapters. The next chapter explains the overall operation of our scheme. The proposed face detection scheme and face tracking scheme are explained in more detail separately, in Chapter 3 and Chapter 4, respectively. Chapter 5 is devoted to finding the necessary parameters, and experimenting the proposed schemes. Finally, Chapter 6 concludes this paper, based on the experimental results.

 

2. Overall Operation

The global operation of the face tracking algorithm proposed in this paper is shown as a flow graph in Fig. 1. As mentioned before, the main proposal is for a face tracking algorithm, but we also propose a scheme to reduce the calculation time of face detection. That is, the proposed face tracking method uses a template of the face(s), which consists of the position and depth information of the face, and the first template is extracted by the face detection process. But afterward the template is updated by the tracking scheme itself, except for the case when the face being tracked disappears from the image.

Fig. 1.Process flow of global operation

At the very first, the face detection process is performed. It uses both RGB image and depth information. Basically, it uses an existing method, the Adaboost algorithm [2-4], but the search area is reduced, by finding movement of the human, and skin color. Once a face is detected, its position (x-y coordinate), and the corresponding depth information (segment of depth map), are taken as the template to be used in the tracking process.

The tracking process uses a template matching method that finds a block matching the template. Here, we use only depth information for this process. It also includes a scheme to resize the template, as well as actual calculation for template matching. Also an early termination process is incorporated, to reduce the execution time.

The result from the tracking process, if it is successful, is the template of the current frame, and is used as the template for the next frame. But if it fails, it goes back to the detection process, to find a new face. This case is when the scene change or when the face being tracked disappears from the scope of the image. That is, in most cases, the scene does not change, or the tracked person remains in the scope of the image, so the detection process is performed only once, at the very start.

The image data we need for face detection and tracking is both RGB images and depth images. Here, we assume that the two kinds of images are given with the same resolution, such as the ones by Kinect from Microsoft.

 

3. Face Detection Algorithm

The processing flow of the proposed face detection scheme is as shown in Fig. 2. Basically it uses the Adaboost algorithm [2-4], but the area to search for the human face(s) is restricted, by using two consecutive depth images (i-1th and ith), and the current RGB image (ith).

Fig. 2.The proposed face detection procedure

To do this, for the ith RGB image, if it contains skin color region is examined with Eq. (1), which was taken from the skin color reference in [37]. Here, only Cb and Cr components are used in the YCbCr color format.

The result by Eq. (1) is a binary image (we call it a skin image), in which the pixels with ‘1’ value indicate human skin region. But a skin region may not be found, even if it exists (false negative error), which is mostly due to the illumination condition. Thus, in this case, we try once more, after adjusting the color distribution by a histogram equalization method [38]. If the first attempt, or the re-trial of skin color detection, finds any skin color pixel, we define the skin region as in Fig. 3. From the skin image Si (x, y) , the vertical skin region image (x,y) (horizontal skin

Fig. 3.Procedure to define a skin region

region image (x,y)) is obtained, such that if any pixel in a column j (row k) in Si (x, y) has value‘1’, all the pixels in the column j(in the row k) have ‘1’ if (j,k)∈Si (x,y)(j,k). Then, the final skin region image SRi (x, y) is obtained, by taking the common parts of (x, y) and (x, y). Fig. 4 (a) shows the scheme to find the skin region image, where the horizontal and vertical gray regions correspond to (x, y) and (x, y) respectively, and the red-boxed regions are the defined skin region.

Fig. 4.Examples of results from face detection processes: (a) skin region; (b) movement region; (c) detected face

Meanwhile, for the depth information, a depth difference image DDi (x, y) between the i-1th and ith depth images is found as Eq. (2), where Di (x, y) is the depth value at (x, y) in the depth image i. The resulting image DDi (x, y) is also a binary one, where the region with ‘1’ defines the region with movement.

From the extracted depth difference image, the region map MRi (x, y) can also be found in the same way as Fig. 3 with DDi (x, y) , (x, y) , and (x, y) , instead of Si (x, y) , (x, y), (x, y), respectively. An example to find the movement region corresponding to Fig. 4 (a) is shown in Fig. 4 (b), where the horizontal and vertical gray regions correspond to (x, y) and (x, y) , respectively, and the white-boxed region is the defined movement region.

With existence of the skin color region or movement region, there are four cases to define the area for the Adaboost algorithm to search face(s). When both skin color region and movement region exist, the search area is defined by the common regions of the skin region and the movement region. But when only skin region (movement region) exists, the skin region (the movement region) itself is defined as the search area. But when neither the skin region nor movement region exist, the process takes the next image frame, and performs the whole process again. Finally, the Adaboost algorithm is applied to the defined search area. An example of an RGB image segment of the finally detected face is shown in Fig. 4 (c), which corresponds to Figs. 4 (a) and (b).

The data from the face detection process is the coordinate and the size of the detected face, and its corresponding depth image segment. Because our purpose is more for face tracking, rather than face detection itself, the face information nearest to the camera is sent to the face tracking process, when more than one faces is detected in the detection process.

 

4. Face Tracking Algorithm

By taking the information of the detected face from the face detection process, or the previous face tracking process as template, the face tracking process is performed, as in Fig. 5. Because the proposed tracking process uses only depth information, it takes the next frame of the depth image as another input. Each step is explained in the following.

Fig. 5.The proposed face tracking procedure

4.1 Template and search area re-sizing

The first step in the proposed face tracking scheme is to resize the template, and the search area, which is the area to find the face being tracked. Among the two, the template size is processed first, because the size of the search area depends on the resized template.

4.1.1 Template Re-sizing

If a human moves horizontally or vertically, the size of the face is nearly the same, but for back-and-forth movement it changes. So, the template that has been detected or updated in the detection process or the previous tracking process needs to be resized to fit to the face in the current frame.

(1) Relationship between depth and size

Because the size of an object in an image depends totally on its depth, change of the size of an object according to its depth is explained first. To do this it is necessary to define the way to express a depth. In general, a depth camera provides a real depth value in a floating-point form. Also, it usually has a distance range, within which the estimated value is reliable. Let’s define the range as (zR,min , zR,max), and the depth value is expressed digitally by an n-bit word. Then, a real depth value z corresponds to a digital word depth is explained first. To do this it is necessary to define the way to express a depth. In general, a depth camera provides a real depth value in a floating-point form. Also, it usually has a distance range, within which the estimated value is reliable. Let’s define the range as (zR,min , zR,max), and the depth value is expressed digitally by an n-bit word. Then, a real depth value z corresponds to a digital word Z', as Eq. (3)

But in a typical depth map, a closer point has a larger value, by converting it as Eq. (4) or (5), and this paper also uses this value, Z.

Now, when an object, whose real size is s and depth is z, has the size Ssensor in the image sensor of a camera whose focal length is f, the relationship between Ssensor and z or Z is as Eq. (6).

If we assume that the pixel pitch in the image sensor is Psensor , and number of pixels corresponding to Ssensor is N, the number of pixels in the real image is the same as N, and is found as Eq. (7).

Fig. 6 shows an example plots for the relationship in Eq. (7), and its measured result. Here, the dashed line is the plot from Eq. (7), the dots are the measured values, and the un-dashed line is the trend line. The error between the dashed line, and the measured value or the trend line, is because of the digitizing error (from the above equations, a function to delete the fractions should be applied), and the measuring errors.

Fig. 6.Plotting for the relationship between size and depth

(2) Face depth estimation and template re-sizing

To resize the template according to Fig. 6 or Eq. (7), the depth of a face in the current frame must be re-determined. For this, we define the depth template area as all the pixels in the ith frame Di(x,y) , corresponding to the ones in the previous template Ti-1 , as Eq. (8).

The scheme is shown in Fig. 7. The template and are divided into p×q blocks (each block has a×b resolution), and average values and for each block (block (j, k)) in the template and the depth template area, respectively, are calculated, to find the maximum values TAi-1max and DTAimax, respectively, as Eqs. (9) and (10). In this paper, p and q are determined empirically.

Fig. 7.Defining the search area

Then, the size of the updated template is calculated with the size of the previous template as Eq. (11), and the template is resized accordingly (X is hor or ver, representing horizontal and vertical, respectively).

4.1.2 Search area re-sizing

Although the template size has been updated, it is necessary to find the exact location of the face in the current frame, by searching the appropriate area, which we call the search area. The search area must be determined by considering the depth value of , and the maximum amount of face movement. The first has just been considered above. For the second, we have measured the maximum movement empirically with proper test sequences, and this will be explained in the experiment chapter. By considering both factors, we determine some extension of the template size as the size of the search area, as shown in Fig. 7, where means the smallest integer not less than x, and X is hor (horizontal) or ver (vertical).

4.2 Template matching

Once the resized template (we call it just ‘template’, from now on) and the corresponding search area are determined, the exact face location is found by a template matching. For this, we use the SAD (sum of absolute differences) value per pixel (PSAD) as the cost value, as Eq. (12), where K is the number of pixels in the template, (cx ,cy ) is the current position to be examined in the search area, and DT (i, j) ( DSA (i, j) ) is the pixel value at (i, j) in the template (search area).

The final location of the matched face template SPopt is determined as the pixel location satisfying Eq. (13)

4.2.1 Early termination

The process of Eq. (13) needs to repeat the calculation of Eq. (12) as many times as the number of pixels in the search area, which might take too much time. The means to reduce this search time in our method is an early termination scheme, whereby the search process is terminated when a certain criterion is satisfied. Because the amount of the face movement in the assumed circumstances is usually small or none, search from smaller movement to larger is more proper. So we take a spiral search scheme, as in Fig. 8, which is an example with a search area of 5×5 [pixel2]. The dashed arrows show the direction of search, and the numbers in the blocks (pixels) are the search order. Thus, if the early termination scheme is not applied, each of the full search sequence SSFS is examined to find the pixel SPopt , as Eq. (14), where the size of the search area is assumed as m×n [pixel2].

Fig. 8.Spiral search and early termination scheme

To terminate the examination earlier than the last pixel, the PSAD value of a pixel satisfies Eq. (14),

where, TET is the threshold value of PSAD that we assume for when the pixel SPET satisfying Eq. (14) is close enough to SPopt , and first{x} means the first position satisfying x. TET is determined empirically.

4.2.2 Sparse search and refinement

Even though the early termination scheme reduces the search time, it could be reduced more, if we might sacrifice accuracy a little more. This can be done by hopping several pixels, from the currently examined pixel, to the next one to be examined. In this case, the search sequence is as in Eq. (16), when the number of intervals between the current and the next is p, which is called the ‘hopping distance’.

If the early termination scheme is not applied, and p=1, it is the same as SSFS. If the early termination scheme is applied, and p=1, it is the same as when this sparse search scheme is not applied. Fig. 8 shows a case of p=3, where the dark pixels are the ones in the SS3.

One more step is included in this sparse search. When p > 1, and a pixel ( SPqp ) in the sequence satisfies Eq. (15) (early terminated), the immediate neighbor pixels are additionally examined. We call it a refinement process, and two cases are considered. The first is called 2-pixel refinement, and additionally examines two pixels ( SPqp−1 , SPqp+1 ) just before and just after SPqp (horizontally striped),); while the second is called 4-pixel refinement, and two more pixels just outside of SPqp (vertically striped) are examined. In either case, the pixel with the lowest PSAD value would be the final pixel, SPopt. If a sparse search does not find a pixel satisfying Eq. (14), the one with minimum PSAD value is selected as the final result.

4.2.3 Feedback to face detection

If the face can no longer be tracked, in such a case that the human goes out of the screen, or is hidden by another object, our algorithm goes back to the face detection algorithm, to find another face. This is decided by a PSAD threshold value TFB, as in Eq. (17)

Eq. (17) is applied to any search sequence, no matter whether the early termination scheme, or sparse search, is applied, or not.

 

5. Experiments and Results

We have implemented the proposed face detection algorithm and the face tracking algorithm, and conducted experiments with various test sequences. In these experiments, we used Microsoft Visual Studio 2010, and OpenCV Library 2.4.3 in the Windows operating system. The computer used in the experiments has an Intel Core i7 3.4GHz CPU, with 16GB RAM.

5.1 Determining parameters for face tracking

First, we have empirically determined the parameters defined in the previous chapter. For this, we used three home-made contents, whose names indicate the directions of movements. The information is in Table 1, and two representative images for each sequence are shown in Fig. 9. The speed of movement in each content was conducted maximally, to cover more than enough movement speed, compared with the assumed circumstances. All three contents are captured by Kinect® from Microsoft, so the resolution of both RGB image and depth image is 640×480.

Table 1.Contents used for parameter determination

Fig. 4.Representative images in each sequence: in LR: (a) 18th frame; (b) 119th frame; in UD; (c) 36th; (d) 166th, in BF; (e) 96th frame; (f) 130th frame

5.1.1 Template segmentation for template resizing

In Fig. 7, we have segmented the template and the corresponding region of the current frame into p×q blocks. Because there can be so many combinations of p and q, we only take the cases of p = q that are an odd number. The purpose of this segmentation is to resize the template to match the current face size. So, we have estimated the relative error of the resized template. In this experiment, we have used the faces extracted by applying the face detection algorithm, as the reference templates. The relative error, named as template resizing error, was calculated as Eq. (18).

Fig. 10 shows the experimental results, where (a) shows the average values of the template resizing errors for various segmentations, and for the three test sequences. As can be seen by considering all the three test sequences, 3×3 segmentation is the best. Fig. 10 (b) shows the change of the template resizing error throughout the sequences, for 3×3 segmentation. Any frame of any sequence does not exceed 3% in error. Other experimental results showed that all the segmentations have very similar execution time. So, we decided on 3×3 segmentation as our segmentation scheme.

Fig. 5.Template resizing error for template segmentation: (a) average values for various segmentations, and (b) values for 3×3 segmentation

5.1.2 Size of the search area

The next parameter is the extension ratio α in Fig. 7, to determine the size of search area. To do this, we performed two experiments. The first was to measure the actual maximum amount of movement between two consecutive frames. In this experiment, we have 25.8%, 23.9%, and 18.2% of the size of the template as the maximum movement in for the UD, LR, and BF sequence, respectively. From this experiment, the value of α should be 0.258.

The second experiment was to measure the amount of displacement between the found template Tfound within the given search area, and the best solution of template Tbest found in the whole image. Here, both templates were found by a full search, without early termination, or sparse search. We converted it to displacement error, which is calculated by Eq. (19). size(T) and position(T) means the size and the position of T, respectively.

The result is shown in Fig. 11. The values in α<21% of sequence UD were more than 20%, because the face could not be tracked properly. From this experiment, it is clear that the result by estimating maximum movement between two consecutive frames (α =0.258) is not enough. To make the displacement error almost 0 (less than 0.1%), α≥0.41 should be maintained. Therefore, we decided on α as 0.41.

Fig. 11.Displacement errors for the three test sequences to the value of α

5.1.3 Threshold value for early termination

The PSAD threshold value TET for early termination was also determined by an empirical method. Fig. 12 is the result, where both displacement errors (D-Error) by Eq. (19), and execution times (Time), are shown with average values. The BF sequence shows the lowest displacement errors on almost all the threshold values, but for the execution time, the UD sequence shows the lowest. Because our aim was to track the face in real time, with appropriately low displacement error, we have chosen TET as 2, which makes the displacement error lower than 4%, with execution time lower than 40ms.

Fig. 12.Experimental results to determine TET

5.1.4 Hopping distance and refinement search for sparse search

Another scheme to reduce the execution time is to hop from the current search pixel to the next, in the spiral sequence. The hopping distance was determined empirically, as well. We have measured the displacement error and execution time by increasing the hopping distance, fixing on TET=2. The result is shown in Fig. 13 (a), where all the displacement errors and execution times are the average values. As the hopping distance increases, the displacement error increases, and the execution time decreases, as expected. Because of our aim, we took 3 as the hopping distance, which makes the displacement error less than 5%, and the execution time less than 30ms.

Fig. 13.Experimental results to determine the hopping distance and refinement process: (a) hopping distance; (b) refinement process.

One more thing we decided was the refining process that examines the neighboring pixels to the early-terminated one (refer to Fig. 8). This also has two options, 2-pixel refinement and 4-pixel refinement. Because we have chosen TET=2 and hopping distance=3, we used them in this experiment. The result is shown in Fig. 13 (b). The displacement error of UD dramatically decreased in 2-pixel refinement and 4-pixel refinement, although other sequences did not change much for any refinement. Also, the increase in execution time is negligible for all the sequences, and all the refinement processes. So, we will apply 2-pixel refinement or 4-pixel refinement.

5.1.5 Threshold value to return to the face detection process

The final parameter to be decided is the PSAD threshold for the tracking process to return to the face detection process, by assuming that the tracking process could not track the face correctly any more. For this we prepared a few special sequences, where a human walks out of the screen, or behind an object, etc. Fig. 14 (a) shows an example of the experimental result of PSAD values when the human being tracked walks out of the screen. As you can see in the graph, the PCAD value changes dramatically from the 75th frame to the 76th frame. Images for the two frames are shown in Figs. 14 (b) and (c), respectively. Other sequences showed similar results, so, it is quite reasonable to choose TFB =15[dB].

Fig. 14.Experimental result to determine TFB: (a) PSAD value change; (b) 75th frame image; (c) 76th frame image.

5.2 Experimental results and comparison

We have experimented using the proposed face detection scheme and the face tracking scheme with some test sequences. They are listed in Table 2, where the first three sequences were tested only for the detection scheme, because they were used to extract the parameters for the tracking scheme. Lovebird1 is a test sequence from MPEG for multi-view, and its resolution is 1,920×1,080. The last two were home-made with Kinect®. In Lovebird1, two persons walk side-by-side, from far away, to very near the camera. The WL sequence has only one person sitting on the screen, and moving various ways. In the S&J sequence, two persons appear at first, but the one near the camera walks out of the screen. Then, the other is moving this and that way, while sitting on a chair. Three representative images for each sequence are shown in Fig. 15.

Table 2.Applying test sequences

Fig. 15.The representative images for Lovebird1: (a) 18th frame; (b) 90th frame; (c) 119th frame; WL: (d) 116th frame; (e) 303th frame; (f) 416th frame; and S&J: (g) 144th frame; (h) 182th frame; (i) 328th frame.

5.2.1 Face detection

The experimental results for the proposed face detection scheme are summarized in Table 3, which includes the true positive rate (TP), false positive rate (FP), false negative rate (FN), and execution time per frame for each test sequence. In this table, our results are compared with the Adaboost method in [3] (V&J). TP, FP, and FN were calculated as Eqs. (20-1), (20-2), and (20-3), respectively, where, a true positive face is a face that was detected and truly resides in the image, a false positive face is one that was detected but is not truly a face truly, and a false negative face is a face that resides in the image, but was not detected.

Table 3.The experimental results and comparison of the face detection methods

As can be seen in the table, V&J is a little better in FN, but ours are much better in TP. In particular, ours showed 0% of FP. Also the execution time of ours was about 1/10 that of V&J. This means that our scheme reduces the search area to about 1/10 of the whole image, with a little sacrifice in the FP.

Table 4 compares some previous methods with ours. Note that the test sequences for each method are different. Because ours uses depth information, and the others do not contain their depth information, those test sequences could not be used. Also, we failed to get the implemented results for the existing methods. Thus, it seems unclear to compare them; but we still think it is worthy, because it makes some indirect comparison possible. Also note that the ‘-’ in a cell indicates that there is no information in the corresponding paper. As can be seen in the table, ours outperforms any existing method in TP, FP, FN, and even in the execution time.

Table 4.Comparison with existing methods for face detection

5.2.2 Face tracking

The experimental results of the proposed face tracking scheme for the last three sequences in Table 2 are shown in Fig. 16, where displacement error rate and execution time per frame are graphed to the frames for each sequence. Also, the average values are shown in Table 5. Fig. 16 and Table 5 include two refinement schemes, 2-pixel refinement and 4-pixel refinement. In this experiment, we let the scheme track the nearest person to the camera, when more than one person resided in the screen (Lovebird1 sequence, and the front part of the S&J sequence).

Fig. 16.Face tracking experimental results for displacement and execution time for the sequence of: (a) Lovebird 1; (b) WL; (c) S&J.

Table 5.Experimental results for the proposed face tracking scheme

Because the man in Lovebird1 walks toward the camera, his face gets larger with moving left and right repeatedly, and whenever the direction changes, he remains still for a couple of frames. It is projected exactly to the execution time, as shown in Fig. 16 (a). Also, the execution time becomes larger, as the sequence progresses, because the size of face becomes larger. That is, as the size of face increases, the execution time increases. This can be also found, by comparing the times between before and after, about the 190th frame of S&J. The woman is nearer than the man. For reference, the distances from the camera to the man in WL, and the woman and the man in S&J are about 120cm, 110cm, and 160cm, respectively.

As in Table 5, the average displacement error was less than 3%, with less than 5ms of the execution time per frame. For reference, Fig. 17 shows three example texture images corresponding to the depth template images, with 0%, 4%, and 7.8% of displacement error, respectively. By considering them, it is obvious that 3% of the average displacement error is quite acceptable. Also, the proposed tracking scheme takes about 5ms in average to track the face per frame, and even for the Lovebird1 sequence, it is less than 8ms, which is more than enough speed for real time tracking.

Fig. 17.Texture images corresponding to the depth templates, with displacement error ratios of: (a) 0%; (b) 4%; (c) 7.8%

Table 6 compares our method with some existing methods. Because they do not provide enough information for fair and clear comparison, we have used the data from the papers without modification or recalculation. Instead, we fitted our data to them as close as possible. Most tracking methods used some sequences made by their respective authors as the test sequences, and so did we. Among the three schemes in Table 5, we have chosen the 2-pixel refinement scheme, because both its error rate and execution time are in the middle. In the table, we entered depth+RGB as the property of the sequence of ours, but actually, only depth was used. For tracking rate, ours showed 100%, because ours changes the execution mode to face detection, when it fails to track the face. It happened only once in the S&J sequence, as explained before. In the tracking error, [29] and [36] provided the displacement amount and root mean square error (RMSE), as the information for how accurate they are, respectively, while we have provided the displacement error rate in the previous table. So, we have changed our data corresponding to them, and showed it in parenthesis. From the data in the table, it is clear that our method outperforms in tracking accuracy, and execution time.

Table 6.Comparison with existing methods for face tracking

 

6. Conclusion

In this paper, we have proposed a combination of a face detection scheme and face tracking scheme, to track a human face. The face detection scheme basically uses the Viola & Jones method [2-4], but we have reduced the area to be searched, with skin color and depth information. The proposed tracking scheme basically uses a template matching method, but it only uses depth information. It includes a template resizing scheme, to adapt to the change of the size or depth of the face being tracked. It also incorporates an early termination scheme, by a spiral search for template matching, with a threshold, and a refinement scheme with the neighboring pixels. If it fails to track the face by deciding with a threshold, it automatically returns to the face detection scheme, to find a new face to track.

Experimental results for the face detection scheme showed 97.6% of the true positive detection rate on average, with 0% and 2.1% of false positive rate and false negative rate, respectively. The execution time was about 44[ms] per frame, with 640×480 resolution. From the comparison with the existing methods, it was clear that ours is better, in both detection accuracy and execution time.

The experimental results for the proposed face tracking scheme showed that the displacement error rate is about 2.5%, with almost 100% tracking rate. Also, the tracking time per frame with 640×480 resolution was as low as about 2.5 [ms]. These results much outperform the previous face tracking schemes. Also, we have shown some trade-offs between tracking accuracy and execution time for the size of the search area, PSAD threshold value for early termination, hopping distance, and refinement scheme.

Therefore, we can conclude that the proposed face detection scheme and face tracking scheme, or their combination, can be used in an application that needs fast and accurate face detection and/or tracking. Because our tracking scheme can also provide higher speed with sacrificing a little tracking accuracy, by increasing the early termination threshold value and hopping distance, or decreasing the search area and taking a simpler refinement scheme, its applicable area can be found much more broadly.

References

  1. M.-H. Yang, D. J. Kriegman and N. Ahuja, "Detecting Faces in Images; A Survey," IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 24, No. 1, pp. 34-58, Jan. 2002. https://doi.org/10.1109/34.982883
  2. P. Viola and M. Jones, "Rapid Object Detection using a Boosted Cascade of Simple Features," Conf. on Computer Vision and Pattern Recognition, pp. 1-9, 2001.
  3. P. Viola and M. Jones, "Robust Real-time Object Detection," Intl. Workshop on Statistical and computational Theories of Vision-Modeling, Learning, Computing, and Sampling, pp. 1-25, 2001.
  4. P. Viola and M. Jones, "Robust Real-Time Face Detection," J. of Compter Vision, 57(2), pp. 137-154, 2004. https://doi.org/10.1023/B:VISI.0000013087.49260.fb
  5. C. E. Erdem, S. Ulukaya, A. Karaali and A. T. Erdem, "Combining Haar Feature and Skin Color Based Classifiers for Face Detection," Conf. Acoustics, Speech and Signal Processing, pp. 1497-1500, 2011,
  6. S. A. Inalou and S. Kasaei, "AdaBoost-based Face Detection in Color Images with Low False Alarm," Conf. Computer Modeling and Simulation, pp. 107-111, 2010.
  7. Y. Tu, F. Yi, G. Chen, S. Jiang and Z. Huang, "Fast Rotation Invariant Face Detection in Color Image Using Multi-Classifier Combination Method," Conf. EDT, pp. 211-218, 2010.
  8. J. Wu, S. C. Brubaker, M. D. Mullin and J. M. Rehg, "Fast Asymmetric Learning for Cascade Face Detection," IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 30, No. 3, pp. 369-382, March 2008. https://doi.org/10.1109/TPAMI.2007.1181
  9. Y.-W. Wu and X.-Y. Ai, "Face detection in color images using AdaBoost algorithm based on skin color information," Workshop on Knowledge Discovery and Data Mining, pp. 339-342, 2008.
  10. Y. Tie and L. Guan, "Automatic face detection in video sequence using local normalization and optimal adaptive correlation techniques," J. Pattern Recognition 42, pp. 1859-1868, 2009. https://doi.org/10.1016/j.patcog.2008.11.026
  11. P. Shih and C. Liu, "Face detection using discriminating feature analysis and support vector machine," J. Pattern Recognition 39, pp. 260-276, 2006. https://doi.org/10.1016/j.patcog.2005.07.003
  12. A. Colombo, C. Cusano and R. Schettini, "3D face detection using curvature analysis", J. Pattern Recognition 39, pp. 444-455, 2006. https://doi.org/10.1016/j.patcog.2005.09.009
  13. C. A. Waring and X. Liu, "Face Detection Using Spectral Histograms and SVMs," IEEE trans. Syst., Man, And Cybernetics, Vol. 35, No. 3, pp. 467-476, Jun. 2005. https://doi.org/10.1109/TSMCB.2005.846655
  14. W.-K. Tsao, A. J. T. Lee, Y.-H. Liu, T.-W. Chang and H.-H. Lin, "A Data mining Approach to Face Detection," J. Pattern Recognition 43, pp. 1039-1049, 2010. https://doi.org/10.1016/j.patcog.2009.09.005
  15. A. Sagheer and S. Aly, "An Effective Face Detection Algorithm based on Skin Color Information," Conf. Signal Image Technology and Internet Based Systems, pp. 90-96, 2012.
  16. H. Fan, D. Zhou, R. Nie and D. Zhao, "Target Face Detection using Pulse Coupled Neural Network and Skin Color Model," Conf. Computer Science & Service System, pp. 2185-2188, 2012.
  17. S. Kherchaoui and A. Houacine, "Face Detection Based on a Model of the Skin Color with Constraints and Template Matching," Conf. Machine and Web Intelligence, pp. 469-472, 2010.
  18. H.-Y. Chen, C.-L Huang and C.-M. Fu, "Hybridboost Learning for Multi-pose Face Detection and Facial Expression Recognition," J. Pattern Recognition 41, pp. 1173-1185, 2008. https://doi.org/10.1016/j.patcog.2007.08.010
  19. C. Huang, H. Ai, Y. Li and S. Lao, "High- Performance Rotation Invariant Multiview Face Detection, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 29, No. 4, pp. 671-686, April 2007. https://doi.org/10.1109/TPAMI.2007.1011
  20. T. Maurer and C. Malsburg, "Tracking and Learning Graphs and Pose on Image Sequences of Faces," International Conf. Automatic Face and Gesture Recognition, pp. 176-181, 1996.
  21. S. McKenna, S. Gong, R. Wurtz, J. Tanner and D. Banin, "Tracking Facial Feature Points with Gabor Wavelets and Shape Models," International Conf. on Audio and Video-based Biometric Person Authentication, pp. 35-42, 1997.
  22. Q. Wang, W. Zhang, X. Tang, and H.-Y. Shum. "Realtime Bayesian 3-D Pose Tracking," IEEE Trans. Circuits Syst. Video Techn., Vol. 16, No. 12, pp. 1533-1541, Dec. 2006. https://doi.org/10.1109/TCSVT.2006.885727
  23. W. Zhang, Q. Wang, and X. Tang. "Real Time Feature Based 3-D Deformable Face Tracking," ECCV, pp. 720-732, 2008.
  24. T. Cootes and G. Edwards, "Active Appearance Models," IEEE Trans. Pattern Anal. Mach. Intell., Vol. 23, No. 6, pp. 681-685, June 2001. https://doi.org/10.1109/34.927467
  25. T. Cootes, G. Wheeler, K. Walker and C. Taylor, "View-based Active Appearance Models," Image Vision Comput. Vol. 20, No. 9, pp. 657-664, 2002. https://doi.org/10.1016/S0262-8856(02)00055-0
  26. J. -W. Sung and D. Kim, "Large Motion Object Tracking using Active Contour Combined Active Appearance Model," International Conference on Computer Vision Systems, p. 31, 2006.
  27. M. Zhou, L. Liang, J. Sun and Y. Wang, "AAM based Face Tracking with Temporal Matching and Face Segmentation," IEEE Conf. on Computer Vision and Pattern Recognition, pp. 701-708, 2010.
  28. P. Wang and Q. Ji, "Robust Face Tracking via Collaboration of Generic and Specific Models," IEEE Trans. Image Processing, Vol. 17, No. 7, pp. 1189-1199, July 2008. https://doi.org/10.1109/TIP.2008.924287
  29. Y. Lui, J. Beveridge and L. Whitley, "Adaptive Appearance Model and Condensation Algorithm for Robust Face Tracking," IEEE Tran. Syst., Man, and Cybernetics, Vol. 40, No. 3, pp. 437-448, May 2010. https://doi.org/10.1109/TSMCA.2010.2041655
  30. Y. Raja, S. McKenna, and S. Gong, "Colour Model Selection and Adaptation in Dynamic Scenes," European Conference on Computer Vision, pp. 460-474, 1998.
  31. G. Jang and I. Kweon, "Robust Real-time Face Tracking using Adaptive Color Model," International Symposium on Mechatronics and Intelligent Mechanical System for 21 Century, 2000.
  32. H. Stern and B. Efros, "Adaptive Color Space Switching for Tracking under Varying Illumination," Image Vision Computation, Vol. 23, No. 3, pp. 353-364, 2005. https://doi.org/10.1016/j.imavis.2004.09.005
  33. H. Lee and D. Kim, "Robust Face Tracking by Integration of Two Separate Trackers: Skin Color and Facial Shape," J. pattern recognition, Vol. 40, pp. 3225-3235, March 2007. https://doi.org/10.1016/j.patcog.2007.03.003
  34. P. Vadakkepat, P. Lim, L. Silva, L. Jing and L. Ling, "Multimodal Approach to Human-Face Detection and Tracking" IEEE Trans. Industrial Electronics, Vol. 55, No. 3, pp. 1385-1393, March 2008. https://doi.org/10.1109/TIE.2007.903993
  35. R. Qian, M. Sezan, K. Matthews, "A Robust Realtime Face Tracking Algorithm," International Conference on Image Processing, pp. 131-135, 1998.
  36. X. Suau, J. Ruiz-Hidalgo and J. Casas, "Real-Time Head and Hand Tracking Based on 2.5D Data", IEEE Trans. Multimedia, Vol. 14, No. 3, pp. 575-585, June 2012. https://doi.org/10.1109/TMM.2012.2189853
  37. D. Chai, et al., "Locating Facial Region of a Headand -Shoulders Color Image," Int'l Conf. Automatic Face and Gesture Recognition, pp.124-129, April 1998.
  38. R. C. Gonzalez and R. E. Woods, Digital Image Processing, 3rd Ed., Pearson Ed. Inc., Upper Saddle River, NJ, 2008.