DOI QR코드

DOI QR Code

Low-Complexity Sub-Pixel Motion Estimation Utilizing Shifting Matrix in Transform Domain

  • Ryu, Chul (Dept. of Information and Communication Engineering, Dougguk University) ;
  • Shin, Jae-Young (Dept. of Information and Communication Engineering, Dougguk University) ;
  • Park, Eun-Chan (Dept. of Information and Communication Engineering, Dougguk University)
  • Received : 2015.09.24
  • Accepted : 2016.02.03
  • Published : 2016.07.01

Abstract

Motion estimation (ME) algorithms supporting quarter-pixel accuracy have been recently introduced to retain detailed motion information for high quality of video in the state-of-the-art video compression standard of H.264/AVC. Conventional sub-pixel ME algorithms in the spatial domain are faced with a common problem of computational complexity because of embedded interpolation schemes. This paper proposes a low-complexity sub-pixel motion estimation algorithm in the transform domain utilizing shifting matrix. Simulations are performed to compare the performances of spatial-domain ME algorithms and transform-domain ME algorithms in terms of peak signal-to-noise ratio (PSNR) and the number of bits per frame. Simulation results confirm that the transform-domain approach not only improves the video quality and the compression efficiency, but also remarkably alleviates the computational complexity, compared to the spatial-domain approach.

Keywords

1. Introduction

The recent advances in mobile communication technologies facilitate diverse multimedia services in a wireless environment, and thus, the demands of high-quality video services in mobile devices are continuously increasing. In order to satisfy these demands, video compression is an imperative technique because high-quality video requires a great amount of information to be transmitted and received over a bandwidth-limited wireless channel. ITU-T and ISO jointly developed and proposed H.264/AVC aiming to provide high-quality video at substantially lower bit rates than the previous standards (e.g., MPEG-2 and H.263) without increasing the design complexity substantially. Therefore, H.264/ AVC became one of the most prevailing video codecs for recording, compression, and distribution of video contents, e.g., Internet video streaming and HDTV broadcasting, and it is widely embedded in various consumer electronics such as digital cameras/camcorders, digital TV set-top boxes, mobile phones, and CCTVs, as well desktops and laptops.

The essence of video compression is to reduce spatial and temporal redundancies, and a motion estimation (ME) algorithm is the key part of video compression to remove temporal redundancy among consecutive frames in the video sequence. In literature, ME algorithms have been continuously and widely studied by many researchers to improve the compression efficiency and/or to lower the computational complexity, because the computational complexity of ME takes up about 60%-80% of total computational complexity of video compression [1]. For the purpose of motion estimation, a block matching algorithm is usually used to estimate a motion vector by finding a location that gives the minimum matching error within the search area in the previous frame. A video encoder transfers the motion vectors found by the ME algorithm and the errors between the current and the best-matched blocks using entropy coding. More precise motion vectors can reduce the temporal redundancy effectively, because they lead to less amount of error between the current and the target block found by motion estimation.

Motion vector does not always occur in integer-pixels, it may also occur in sub-pixels due to changes of brightness among frames, nonlinear motion, or objects being appeared or disappeared in the boundary of pixel. Therefore, it cannot be expected to obtain precise motion information with a ME algorithm supporting integer-pixel accuracy, and it is necessary to adopt a sub-pixel ME algorithm for achieving more accurate motion vectors. For this reason, MPEG-2 and H.263 utilize half-pixel accuracy ME and H.264 further expands it to the level of quarter-pixel accuracy to achieve better video quality than the existing video coding standards.

Conventional ME algorithms are usually performed in the spatial domain. However, the sub-pixel ME algorithms in the spatial domain result in a serious increment of computational complexity of video encoders due to embedded interpolation algorithms. To overcome this drawback, this paper proposes a low-complexity ME algorithm based on a transform-domain approach. By using shifting matrix in the transform domain and predefined coefficient of interpolation, the proposed ME algorithm significantly reduces the computational complexity without impairing the video quality. The simulation results confirm that the proposed transform-domain approach outperforms the conventional spatial-domain approach in terms of the peak signal-to-noise ratio (PSNR) and the number of bits per frame, which are effective measures for the video quality, compression efficiency, and computational complexity.

The paper is organized as follows: Section 2 describes the ME algorithm in the transform domain. In Section 3, a shifting matrix algorithm is proposed for quarter-pixel accuracy ME. Section 4 presents simulation results comparing the performance of spatial-domain ME algorithms with that of transform-domain ME algorithms. Section 5 concludes and addresses the utilization of the proposed algorithm.

 

2. Motion Estimation in Transform Domain

Since video encoding and decoding are performed in a unit of block, block-based ME algorithms are preferred. Block-based ME algorithms can be classified as spatial-domain algorithms and transform-domain algorithms. The common spatial-domain ME algorithms are block matching algorithm and gradient-based algorithm, while the typical transform-domain ME algorithms are discrete cosine transform (DCT) and wavelet-based algorithm [2-6].

Among the transform-domain algorithms, the wavelet-based algorithm spends less time in searching and matching however, it incurs a high complexity in estimating motion vectors and increases the bit rate to transmit motion vectors in all the sub-bands. On the other hand, DCT-based algorithm is widely used in various video compression standards, and thus, it maintains a high compatibility to many existing standards.

DCT transforms the spatial-domain data into the transform-domain coefficients. One of its desirable features is that the energy of transformed coefficients is concentrated on the low frequency bands, which can be a basis for high efficiency of video compression. In general, the spatial-domain motion estimation requires the process of inverse DCT (IDCT) in order to utilize the spatial-domain information of the restored frame which is used as a reference frame in the estimation process, as shown in Fig. 1. Here, Q and IQ denote quantization and inverse quantization, respectively. On the other hand, IDCT can be omitted in the transform-domain ME since the motion estimation can be performed using coefficients transformed by DCT, as illustrated in Fig. 2 [7]. One of the key advantages of transform-domain ME is the low computational complexity, which is originated from the simple encoder structure without IDCT.

Fig. 1.Motion estimation and compensation of video encoder in the spatial domain

Fig. 2.Motion estimation and compensation of video encoder in the transform domain

The shifting matrix algorithm proposed by Plompen showed that a prediction block fpred can be represented with the horizontal and vertical conversion of neighboring blocks [8]. Fig. 3 shows an example that four neighboring blocks in the previous frame, f1, f2, f3, and f4, are used to estimate the fpred which is the best-matched block to the current block fcu rr .

Fig. 3.Estimation of current block from the best-matched block composed of neighboring blocks in the previous frame

The prediction block fpred can be expressed with a vertical shifting matrix hi and a horizontal shifting matrix vi, i.e.,

where, hi and vi in Eq. (1) are defined as follows.

Here, the displacement matrix Dn is defined as

where, In is a n × n unit matrix. For example, Fig. 4 demonstrates a process of predicting fpred with ∆x = 2, ∆y = 0, and Eq. (5) and (6) represent hi used in this example.

Fig. 4.Example of estimation process of fpred with ∆x = 2, ∆y = 0

Applying orthogonality and separability of DCT, fpred , DCT coefficient of the prediction block, can be expressed from Eq. (1) as follows:

Note that fpred can be obtained by carrying out eight operations of matrix multiplication and three operations of matrix subtraction. However, the amount of computation can be greatly reduced by utilizing the characteristic of energy concentration of DCT. Accordingly, Eq. (7) can be derived as follows, in order to relieve the computational complexity by taking the advantage of DCT transformed block Fi, i.e.,

All Fi(m, n) with zero in Eq. (8) can be excluded from computations so that the efficiency of prediction can be further improved using the scarcity of prediction block Fi. Considering that the value of coefficients in high frequency band are almost zero, it is necessary to redistribute the coefficients of Fi in the sequence of zigzag scanned order as

The recursive equation derived from Eq. (9) provides an efficient way to calculate Fi by adjusting non-zero factor of Fi. Furthermore, the computational complexity can be efficiently reduced according to the block size and the number of non-zero factors of Fi . From Eq. (9), the operation of outer product of Vi and Hi needs to be performed only with non-zero factors of Fi, and Eq. (9) can be expressed in a recursive form as

Here, is a 4 × 4 zero matrix and k increases from one to the total number of non-zero DCT factors in the block.

 

3. Shifting Matrix for Sub-Pixel Accuracy ME

The spatial-domain ME supporting sub-pixel accuracy utilizes interpolation algorithms which often up-sample more than four times of its samples. Therefore, the performance of motion estimation depends on the interpolation algorithm applied to it. For example, the performance of bilinear interpolation incorporated in the ME with half-pixel accuracy is degraded, compared to the higher-order interpolation algorithms [9]. Meanwhile, the higher-order interpolation algorithms, such as b-spline interpolation or cubic interpolation, each of which is given in Eq. (11) and Eq. (12), respectively, provide better prediction performance than the bilinear interpolation at the cost of large computational complexity. Accordingly, H.264 can enhance the image quality by utilizing a ME with quarter-pixel accuracy; however it exacerbates the computational complexity at the same time.

This section presents a shifting matrix algorithm supporting quarter-pixel accuracy of ME and proposes a transform-domain ME with b-spline or cubic interpolation schemes, with which the performance can be enhanced compared to the bilinear interpolation. In order to apply b-spline or cubic interpolation, the shifting matrix should be represented as follows.

Considering the quarter-pixel accuracy of ME, . If n < 0, Dn and use 4 × 4 zero matrices, otherwise if n > 4, and . In Eqs. (13) and (14), ⌊d⌋ and ⌈d⌉ denote the rounding-up and rounding-down values of d, respectively.

In the case of bilinear interpolation, the frequency information in both sides of interpolated location is used to estimate the motion at the sub-pixel accuracy. On the other hand, in the case of b-spline or cubic interpolation, the frequency information in four locations is required. Therefore, when |d| < 1 or |d| > 3 , the frequency information beyond the boundary of reference block can be utilized. However, in order to obtain frequency information in a specific location from the block converted to the transform domain, all of the block information including the specific location is required, which causes an unnecessary access of memory. To resolve this problem, the vertical or horizontal intra prediction mode is adopted in predicting a specific location [10].

For instance, if the current frame fcurr moves in y-axis at the amount of −4 < d < −3, the information beyond boundary of the search area is required and the problem of memory allocation happens. However, if the intra prediction mode is applied, the information beyond boundary can be estimated by duplicating information in the search area upwards as indicated in Fig. 5. In this way, the problem of unnecessary memory allocation can be effectively avoided. The following Eqs. (16) and (17) show the vertical shifting matrix in the ME with quarter-pixel accuracy for the example illustrated in Fig. 5.

Fig. 5.Estimation of information beyond boundary using intra prediction mode

It is worthwhile to note that the proposed mechanism does not require IDCT and alleviates the computational complexity by utilizing the recursive equation. Table 1 compares the number of addition and multiplication operations used in the spatial-domain ME and the transform-domain ME when the block size is N × N and the search area is p. For the typical values of N = 4 and p = 8, the transform-domain ME decreases the number of addition and multiplication operations by about 27 and 3.5 times, respectively, compared to the spatial-domain ME, which confirms the outstanding decrease of computational complexity due to the transform-domain approach.

Table 1.Comparison of computational complexity in spatial-domain ME and transform-domain ME

In order to compare the computational complexities of spatial-domain ME and transform-domain ME, two indices of γ+ and γ* are introduced as

where, and are the numbers of addition operations in the spatial-domain ME and transform-domain ME, respectively, and and are the corresponding numbers of multiplication operations. Fig. 6(a) and 6(b) show γ+ and γ* when N ranges from 2 to 8 and p ranges from 2 to 10, respectively. In these ranges of N and p, γ+ lies between 0.024 and 0.063, while γ* lies between 0.14 and 0.57; that is, the transform-domain ME remarkably decreases the numbers of addition and multiplication operations by about 16~42 times and 2~7 times, respectively, compared to the spatial-domain ME. Moreover, as shown in Fig. 6, both γ+ and γ* almost linearly increase with respect to the increase of block size N, but they are hardly affected by search area p.

Fig. 6.Relative computational complexity of transform-domain ME to spatial-domain ME

 

4. Simulation and Discussion

In this section, the performances of several ME algorithms are evaluated and compared via simulations. In the simulations, H.264/AVC video coder is used, the quarter-pixel accuracy ME is considered, and the sum of absolute difference is used as a criterion to determine the motion vector. The performance is evaluated with two indices, PSNR and the number of bits per frame (BPF), which are widely used to measure video quality and compression efficiency. The performance is compared by considering the following several ME algorithms.

• H.264: This is a baseline algorithm as standardized by ITU-T and ISO. It is the most typical spatial-domain ME algorithm and uses a 4-tap FIR filter as the interpolation scheme.

• CSD/BSD: These two are the spatial-domain ME algorithms with the cubic and b-spline interpolation schemes, respectively.

• CTD/BTD: These two are the proposed transform-domain ME algorithms with the cubic and b-spline interpolation schemes, respectively.

• HCL/HBL: These two correspond to the enhanced versions of CTD and BTD, respectively. In order to further mitigate the computational complexity, a hybrid approach is adopted, i.e., the bilinear interpolation is applied to half-pixel and the cubic (HCL) or b-spline (HBL) interpolation is applied to quarter-pixel.

Note that H.264, CSD, and BSD are the spatial-domain ME algorithms while the others are the transform-domain ME algorithm as proposed in this paper.

Table 2 and 3 demonstrate the performance of several ME algorithms using akiyo and football sequences in CIF format with different value of quantization parameter (QP), respectively. In these tables, the values in bold type indicate the highest PSNR and lowest BPF for each QP. In akiyo sequence (see Table 2), the transform-domain ME algorithms.

Table 2.Performance comparison of several motion estimation algorithms for akiyo sequence

Table 3.Performance comparison of several motion estimation algorithms for football sequence

However, when QP≥21, the transform-domain algorithms require slightly higher bit-rate than the spatial-domain algorithms, which results from the increase of quantization error with respect to the increase of QP.

In the case of football sequence, as shown in Table 3, the transform-domain algorithms have higher PSNR than the spatial-domain algorithms, regardless of QP values. In terms of BPF, CTD requires more bits than the spatial- domain algorithms for some specific values of QP (e.g., QP=3 and QP≥21). This is because the quantization errors are accumulated when estimating locations at half-pixel and quarter-pixel. When estimating the location of certain sub-pixels such as 1.25, 1.5, 1.75, 3.25, 3.5 and 3.75 by using the shifting matrix, some reference locations are beyond the boundary and are forced to be duplicated as shown in Fig. 5. Interpolating these locations causes the estimation errors, leading to the decrease of PSNR. This problem can be effectively dealt with the hybrid algorithms, HCL and HBL. By applying the differentiated interpolation scheme in half-pixel location (bilinear interpolation) and quarter-pixel location (cubic interpolation in HCL and b-spline interpolation in HBL), the quantization error is reduced, which contributes to the increase of PSNR accordingly.

Fig. 7(a) and 7(b) show the values of PSNR for 50 frames with akiyo and football sequences, respectively. Here, the value of QP is fixed to 18. As shown in Fig. 7, the transform-domain algorithms achieve higher PSNR than the spatial-domain algorithms, on the whole. Also, it is shown that HCL outperforms the others for the most frames in the case of akiyo sequence (see Fig. 7(a)), while HBL and BTD attain the highest value of PSNR in the case of football sequence (see Fig. 7(b)). The value of PSNR is high in the order of CSD, BSD, and H.264 among the spatial-domain algorithms, while it is high in the order of the hybrid algorithms (HCL and HBL), BTD and CTD among the transform-domain algorithms. When the interpolation is performed in the spatial domain, the surrounding pixels of target location are used to produce an interpolated pixel; however, the proposed transform-domain algorithms use the DCT coefficient instead. In the case of block-based 2D DCT, most of the information in the block is used to obtain the specific frequency component. This is the reason why the performance of specific spatial-domain ME does not retain a consistently in the transform domain.

Fig. 7.Performance comparison of motion estimation in spatial domain and transform domain (QP =18)

 

5. Conclusion

This paper proposed the low-complexity transform-domain ME algorithm supporting quarter-pixel accuracy. By utilizing the shifting matrix in the transform domain, the proposed algorithm can attain high quality of video and high efficiency of compression, while reducing the computational complexity substantially. The simulation results verified that the proposed algorithm not only provides higher PSNR but also requires smaller number of bits per frame compared to the conventional ME algorithms in the spatial domain. Moreover, due to the utilization of recursive equation and lack of IDCT block, the proposed transform-domain ME algorithm significantly relieved the computational complexity, which is the critical problem of the spatial-domain ME algorithms. As a result, the proposed algorithm is quite suitable for the real-time or streaming video service in mobile devices or embedded systems that have the limited capability of computing power, memory space, battery lifetime, and communication bandwidth. The proposed algorithm is also applicable to the next-generation video compression standard of H.265 by revising the shifting matrix to cope with the variable block size from 4 × 4 to 32 × 32. The gain of compression efficiency that can be achieved by the proposed algorithm is further intensified with a large block size such as 32 × 32, by utilizing the low-complexity recursive equation [11].

References

  1. S. Agha and V. M Dwyer, "Algorithm and VLSI architectures for MPEG-4 motion estimation," Electronic System and Control Division Research, 2003.
  2. J. Oh, “Improved sub-block matching algorithm,” Journal of Korean Institute of Communications and Information Sciences, vol. 33, no. 7, pp. 628-633, Jul. 2010.
  3. Y. Keller and A. Averbuch, “Fast gradient methods based on global motion estimation for video compression,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 4, pp. 300-309, Apr. 2003. https://doi.org/10.1109/TCSVT.2003.811360
  4. Q. H. Nguyen, V. A. Nguyen, C. V. Trinh, K. Q. Dinh, Y. Park, and B. Jeon, “Block-based transform-domain measurement coding for compressive sensing of images,” Journal of Korean Institute of Communications and Information Sciences, vol. 39A, no. 12, pp. 746-755, Dec. 2014. https://doi.org/10.7840/kics.2014.39A.12.746
  5. C. M. Yang and K. C. hung, “Embedded image compression scheme using rate-distortion optimized block coding of wavelet coefficients,” Journal of Korean Institute of Communications and Information Sciences, vol. 39A, no. 11, pp. 625-636, Nov.2014. https://doi.org/10.7840/kics.2014.39A.11.625
  6. J. Song and B. Yeo, “A fast algorithm for DCT-domain inverse motion compensation based on shared information a macroblock,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 10, no. 5, pp. 767-775, Aug. 2000. https://doi.org/10.1109/76.856453
  7. R. Kleihorst and F. Caberera, “Implement of DCT-domain motion estimation and compensation,” in Proceedings of IEEE Workshop on Signal Processing Systems (SIPS), pp. 53-62, Oct. 1998.
  8. R.H.J.M. Plompen, B. F. Schuurink, and J. Biemond, “A new motion-compensated transform coding scheme,” IEEE Transactions on Acoutstics, Speech and Signal Process, vol. 1, pp. 371-374, 1985.
  9. M. Kang, J. Heo, and C. Ryu, “Half-pixel accuracy motion estimation algorithm in the transform domain for H.264,” Journal of Korean Institute of Communi- cations and Information Sciences, vol. 33, no. 11, pp. 917-924, Nov. 2011.
  10. T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/AVC video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560-576, Jul. 2003. https://doi.org/10.1109/TCSVT.2003.815165
  11. D. Yoon and Y. Ho, “Fast mode decision method for HEVC in depth video,” Journal of Korean Institute of Communications and Information Sciences, vol. 37, no. 1A, pp. 51-56, Jan. 2012. https://doi.org/10.7840/KICS.2012.37A.1.51