DOI QR코드

DOI QR Code

Adaptive Importance Channel Selection for Perceptual Image Compression

  • He, Yifan (Institute of Information Science, Beijing Jiaotong University) ;
  • Li, Feng (Institute of Information Science, Beijing Jiaotong University) ;
  • Bai, Huihui (Institute of Information Science, Beijing Jiaotong University) ;
  • Zhao, Yao (Institute of Information Science, Beijing Jiaotong University)
  • Received : 2020.01.02
  • Accepted : 2020.08.27
  • Published : 2020.09.30

Abstract

Recently, auto-encoder has emerged as the most popular method in convolutional neural network (CNN) based image compression and has achieved impressive performance. In the traditional auto-encoder based image compression model, the encoder simply sends the features of last layer to the decoder, which cannot allocate bits over different spatial regions in an efficient way. Besides, these methods do not fully exploit the contextual information under different receptive fields for better reconstruction performance. In this paper, to solve these issues, a novel auto-encoder model is designed for image compression, which can effectively transmit the hierarchical features of the encoder to the decoder. Specifically, we first propose an adaptive bit-allocation strategy, which can adaptively select an importance channel. Then, we conduct the multiply operation on the generated importance mask and the features of the last layer in our proposed encoder to achieve efficient bit allocation. Moreover, we present an additional novel perceptual loss function for more accurate image details. Extensive experiments demonstrated that the proposed model can achieve significant superiority compared with JPEG and JPEG2000 both in both subjective and objective quality. Besides, our model shows better performance than the state-of-the-art convolutional neural network (CNN)-based image compression methods in terms of PSNR.

Keywords

1. Introduction

In the past 20 years, the digital media technology has achieved great progress. Considering that everyone can share their daily life by taking pictures and videos with their friends on internet, and the resolution of pictures and videos taken by mobile phones are increasing year by year, the storage of data is increasing with enormous rate. Due to the limitation of data transmission technology, image compression has become a key technique for various multimedia transmission services. Image compression typically starts from obtaining a description of an image, then quantifying the description, and recovering the image from the obtained description. A general image compression system mainly includes three components,i.e. encoder, quantizer, and decoder, to form a codec. The image compression codec in typical encoding standards, such as JPEG [1], JPEG2000 [2], and BPG [3] using the intra-coded HEVC [4], rely on the hand-crafted image transformation and separated optimization on codecs, which is not optimal for compression performance.

An image compression system needs to deal with quantization, and to control the trade-off between reconstruction error d and the bitrates R. To minimizing 𝑑 + 𝜆𝑅, there are two directions. On the one hand, the rate term, which is determined by the entropy H of the latent image representation, can be optimized by an exact entropy rate estimator. On the other hand,the distortion term, which measures difference between the input image of encoder and the reconstructed image by decoder, can be minimized by designing better encoder and decoder.

Recently, inspired by the powerful learning ability of deep convolutional neural networks(CNNs) in image restoration tasks [5,6], many methods [7-11] adopt CNNs to form different frameworks for lossless image compression [7] or lossy image compression [8-11], which have achieved significant improvement than many traditional image compression codecs. Mentzer et al. [7] propose a practical lossless image compression framework by learning-based method, named L3C, which introduces a fully parallelizable hierarchical probabilistic model for entropy coding, which can be optimized by an end-to-end way. This pioneer method shows significant superiority compared with many popular engineered codecs, such as PNG, WebP and JPEG2000.

Besides the lossless image compressiom, there are some CNN-based methods [8-10] that have been proposed to work on learned lossy compression. In [8], Ballé et al. propose an end-to-end image lossy compression network, which consists of a nonlinear analysis transformation, a uniform quantizer, and a nonlinear synthesis transformation, which produce nearly better compression performance than the standard JPEG and JPEG2000 method. However, such method treats the transmitted features equally and directly feeds the output of encoder to the corresponding decoder, which can not focus on the important information across spatial locations under limited bits for effective image compression. In [9], Mentzer et al. focus on the rate-distortion (R-D) of the latent image representation and presents a conditional probability model to optimize the R-D trade-off. The authors formulate a spatial-aware network, which can use an importance map to help the network spatially attend to the most important regions of the image with different numbers of bits. However, this approach chooses the fixed first feature map of the last layer in encoder as importance map, which can not adaptively emphasize informative spatial regions for various input. In [10], motivated by that the information is highly variant in different areas of an image, Li et al. develop a CNN-based end-to-end system for content weighted image compression, which can allocate the content-aware bits under the guidance of a content-weighted importance map. The importance map can be produced by an convolutional neural network and the sum of the importance map can serve as a continuous alternative of discrate entropy estimation to control compression rate.

However, such auto-encoder based image compression methods regard the image compression of different bit rates as indenpent tasks, which will face the challenge of large storages. To address this issue, in [11], Toderici et al proposes a LSTM recurrent network for variable-rate image compression, which can provide variable compression rates during deployment without requiring retraining of the network. Although this method could provide variable compression rates in an model, it has not any bit-allocation strategy and leads to less accurate reconstruction. In addition, the framework in [8, 9, 10] can not make fully use of the hierarchical features extracted from the encoder, which cannot produce a reconstructed image with better quality and detial.

In this paper, to solve the problems mentioned above, we implement a novel auto-encoder framework for learned lossy image. Instead of directly sending the features of last layers in encoder to the correponsding decoder, to explore the features under different receptive fields, we aggregate the features from each downsampling layer of our encoder to obtain more accurate feature representation. Besides, different from choosing the first channel as the importance map in [9], we put forward an adaptive important channel selection stragety by comparing the sum of each channel. And then the multiply operation is conducted on the importance mask from the selective channel and the last layer features of the encoder to achieve efficient bit allocation. Furthermore, previous deep learning-based image compression models simply minimize the mean square error (MSE) between the reconstructed image and original input, which will generate overly-smooth compressed results. Therefore, we present an additional novel perceptual loss function and combine it with reconstruction loss to optimize our network, which can produce the compressed image with visual pleasant details.

Our main contributions are summarized as:

• To utilize the hierarchical features extracted by encoder, a novel auto-encoder framework is proposed, which transmits the hierarchical features from encoder to decoder to reconstruct images with better quality.

• We proposed an adaptive important channel selection stragety to achieve efficient bit allocation.

• The preceptual loss is generated by the proposed encoder rather than extra pre-train network to improve visual details.

• The extensive experiments on KodakCD image datasets demonstrate that the proposed method performs favorably against the state-of-the-art compression approachs in terms of PSNR.

The remainder of this paper is organized as follows. In Section 2, the related work is introduced. The proposed method is presented in Section 3, including hierarchical auto-encoder framework, adaptive importance map and encoder preceptual loss. The experimental results and comparisons with other method are demonstrated in Section 4. The conclusion of this paper is presented in Senction 5.

2. Related Work

In early years, the auto-encoder model was first proposed by Hinton to address unsupervised learning problems [12]. Traditionally, it is composed of two or three layers of a neural network and applied BP (back propagation) [13] technique to learn nonlinear transformation for compressing and reconstructing the input data. It aims at learning an identity equation:

\(\mathrm{D}(E(x))=\tilde{x}\)       (1)

which makes the output approximately equals to the input.

Therefore, the auto-encoder is perfectly suitable for the data compression task, and there are many works [8, 14, 15] that use traditional auto-encoder model to compress images and have made great progress. On the other hand, other works [9, 10, 16] modify the traditional auto-encoder structure or propose new network structures for the better compression quality.

2.1 Bit allocation strategy

Since the conventional encoder assigns the same number of bit symbols for each spatial areas of the original image. However, in practice, image information in different spatial locations are highly variable. The importance map tries to allocate automatically lower symbols for the smooth information regions (e.g., the cloud) and higher symbols for the complicate regions (e.g., the house with exquisite and complicated pattern). In [9] an importance mask is added in the latter of encoder last layer for spatial bit allocation, which is produced from the first feature map of the encoder last layer. In [10] the content-aware convolutional neural network is used to learn an importance map to achieve different bit rates.

2.2 Multi-scale structure for image compression

In [16] a new auto-encoder structure is presented that exploits the multi-scale features of input images. The proposed model consists of two components: a multi-scale lossy auto-encoder and a multi-scale lossless coder for entropy coding. The lossy auto-encoder model directly connects the encoder and decoder at different depths to encode multi-scale image features. Then, the encoder sends the part of each layer to the corresponding layer in decoder. The lossless coder simultaneously encodes the quantized multi-scale features to produce transmitted symbols for decrease the time of encoding.

2.3 Variable compression rate

In deep learning-based image compression methods, there is a problem that is how to compress an image at different bit rates. Several options have been explored including training multiple modes [8], learning quantization-scaling parameters [14], and transmitting a subset of the encoded representation with a recurrent structure [11,17].

The architecture of [11] consists of encoder and decoder based on recurrent neural network(RNN), a binarizer, and a neural network to model the distribution of latent variables. It solves the problem of variable compression rate from two aspects, that is, designing a residual encoder with powerful ability of feature extraction, and designing a probability estimation model for capture long-term dependencies between the patches of input image. In [17] three improvements over previous research are introduced. First, a new recurrent architecture is proposed, which makes the image compression network models and propagates spatial information more effectively between the network’s hidden layers. Second, besides lossless entropy coding, a bit allocation algorithm is adopted to adequately exploit the limited number of bits in complex image regions. Finally, the results demonstrate that training with the combination of pixel-wise loss and structural similarity (SSIM) can improve the compression performance according to multiple metrics. These RNN-based methods provide a way to solve the problem of variable compression rate. However, the RNN-based methods are less accurate reconstruction in each compression rates.

2.4 Generative compression

In [18] the concept of generative compression is described as the compression of data using generative models. For model generative image, they use the variational auto-encoders [19] to alternate Generative Adversarial Networks (GANs). Their results show that the method of generative compression is more resilient to bit error rates than traditional image compression methods at very low bitrates. However, their model has merely proved the effectiveness of generative compression in small images below 64 × 64, and has limited effects on larger images.

In [20] a new GAN-based network for extreme learned image compression is proposed, which aims at full-resolution images, targeting bitrates below 0.1 bpp and obtaining visually pleasing images at significantly lower bitrates than previous methods. The proposed method consists of unconditional and conditional GANS. The unconditional GANs can generate the overall image content with lower image quality, and the conditional GANs can utilize the corresponding semantic label map to reconstruct the parts of the image with better detail. Their results show that for extreme low bitrates, the proposed method can reconstruct the original image with better visual quality.

3. Proposed Method

In this section, the proposed network architecture is firstly introduced. And then we describe each of three main techniques used in our model: adaptive importance map, multi-scale auto-encoder, and encoder perceptual loss.

3.1 Overview

Given an original input image 𝑥 ∈ 𝑅𝐻×𝑊×𝐶, we wish to design an image compression system to compress the image as small as possible and make the restored image same as the original image. In an image compression system, the procedure of obtaining the compressed bitstream of input can be described as follows.

𝑠 = 𝐻𝑒{𝑄[𝐸(𝑥; 𝜑)]}       (2)

where 𝐸(𝑥;𝜑) ∶ 𝑅𝑑 → 𝑅𝑚 represents the encoder, which maps the input to a latent representation 𝑧 = 𝐸(𝑥; 𝜑). The quantizer 𝑄:𝑅 → 𝐵 discretizes the coordinates of 𝑧 to 𝑁 = |𝐵| centers, obtaining 𝑧̂ with 𝑧̂𝑖 = 𝑄(𝑧𝑖) ∈ 𝐵, which have limited value numbers and can be losslessly encoded into a bitstream s by an entropy encoder 𝐻𝑒(∙) ∈ (0,1). When the decoder receives the bitstreams, the process of the decoder restoring the final image can be formulated as

\(\tilde x = D[H_d(s) ; \theta]\)       (3)

Here, \(\tilde x\) is the corresponding reconstructed image from compressed binary symbols. Thedecoder 𝐷(𝑥; 𝜃) forms the reconstructed image \(\tilde x\) from the quantized latent representation 𝑧̂, which is in turn losslessly decoded from the bitstream by entropy decoder 𝐻𝑑(∙) ∈ 𝐵.

3.2 Hierarchical Auto-encoder Structure

In this subsection, we describe our proposed hierarchical structure of the encoder-decoder. As shown in Fig. 1, the proposed network is composed of four parts: an encoder-decoder for hierarchical features extraction, a bit allocation module, a quantizer and an entropy encoder-decoder. The encoder takes an image as input to produce four outputs \(f^{H_K \times W_k \times N} _k\) (𝑘 =1, 2, 3, 4) with different scales. Next, at 𝑘 = 1, 2, 3, three 1-channel convolutional layers are employed to make these outputs to 1 channel. These outputs are downsampled to the same size as \(f^{H_4 \times W_4 \times N} _4\). This features are concatenated together\(z=\left[\tilde{f}_{1}^{H_{4} \times W_{4} \times 1}\right. \left., \tilde{f}_{2}^{H_{4} \times W_{4} \times 1}, \tilde{f}_{3}^{H_{4} \times W_{4} \times 1}, f_{4}^{H_{4} \times W_{4} \times N}\right]\). After that, the concatenated hierarchical features 𝑧 are sent to the bit allocation model for an importance mask m. We conduct the multiply operation on m and 𝑧 to achieve efficient bit allocation.

E1KOBZ_2020_v14n9_3823_f0001.png 이미지

Fig. 1. The proposed multi-scale auto-encoder architecture for image compression.

\(\tilde{\mathbf{z}}=z \otimes m\)       (4)

where 𝑧, 𝑚 ∈ \(R^{H_4 \times W_4 \times (N+3)}\), 𝑚 ∈ [0,1]. Then, the generated feature \(\tilde z\) is quantized and arithmetic encoded (AE) to get s.

When binary symbols are transmitted to the decoder, the arithmetic decoder (AD) firstly decodes it. Furthermore, the first, second and third channel of the decoded features are separately upsampled to the corresponding decoder layers size to get the decoder inputs with different scales \(\left(\hat{f}_{1}^{H_{1} \times W_{1} \times 1}, \hat{f}_{2}^{H_{2} \times W_{2} \times 1}, \hat{f}_{3}^{H_{3} \times W_{3} \times 1}, \hat{f}_{4}^{H_{4} \times W_{4} \times N}\right)\). To reconstruct the original image, the rest channels of 𝑧̂ are directly sent to the last layer of our decoder. At the same time, the upsampled features are concatenated to the corresponding decoder layers respectively to provide multi-scale information.

\(l_{D}^{3}=l_{D}^{4}\left(\hat{f}_{4}^{H_{4} \times W_{4} \times N}\right)\)       (5)

\(\tilde{\mathrm{x}}=l_{D}^{S}\left(\operatorname{concat}\left(l_{D}^{S+1}, \hat{f}_{S}^{H_{s} \times W_{s} \times 1}\right)\right)\)       (6)

where \(l^s_D\) represents the s-th layer output of decoder. \(\tilde x\) is the final reconstructed image. 𝑐𝑜𝑛𝑐𝑎𝑡(∙) denotes the concatenation operation.

3.3 Adaptive Importance Map

In previous method [8], the importance mask is added at the end of the encoder for spatial bit-allocation. The authors choose the first feature map from the final output features generated by the encoder as importance map, which can not adaptively emphasize informative spatial regions for various input.

For the features produced by the encoder, we consider that the large number of high values can cause more bits allocation during transmission. Since that the feature maps produced by the convolutional kernels have different values, the feature map containing a lot of large values will have big gaps with others including the small values. In the entropy coding stage, the feature map with a lot of high values can consume more bits, which shows this feature map contains more information than others. At the same time, it is crucial to choose a feature map which contains aboundant information as importance map for more effective bits alloction. Asa result, in our network, we select the feature map with the largest sum of all values within itself as our importance map.

The process of choosing importance map can be describe as follows. Given an input image x, which have 𝐻 × 𝑊 × 3 scales, the encoder E has three strides-2 convolution layers and bottleneck 𝑧 has C channels. The dimension of 𝑧 and 𝑧̂ will be \(\frac H 8 \times \frac W 8 \times C\).

We choose adaptively the importance map, which is the n-th channel of encoder last layer with largest values after summing.

𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥{𝑘}𝑖,𝑗 𝑓𝑖,𝑗,𝑘       (7)

Actually, the range of the importance map values 𝑓𝑛 cannot be used directly in produce importance mask. We need to make a transformation for the range of 𝑓𝑛 values, that is:

\(\tilde{f}_{n}=C \times \operatorname{sigmoid}\left(f_{n}\right) \tilde{f}_{n} \in(0, C)\)        (8)

Finally, the importance map size is \(\frac H 8 \times \frac W 8 \times 1\) and should be expanded into mask m which have same size with z by a simple function:

\(\mathrm{m}_{i, j, k}=\left\{\begin{array}{cl} 1, & \text { if } k<\tilde{f}_{i, j, n} \\ \left(\tilde{f}_{i, j, n}-k\right), & \text { if } k \leq \tilde{f}_{i, j, n} \leq k+1 \\ 0, & \text { if } k+1>\tilde{f}_{i, j, n} \end{array}\right.\)       (9)

where \(\tilde f _{i,j,n}\) means the values of importance map at spatial location (𝑖,𝑗), and k denotes the index of mask m channel.

3.4 Encoding Perceptual Loss

Nowadays many computation perception algorithm [21, 22, 23] are proposed. The perceptual loss shows that the visually high-quality images can be generated by defining and optimizing perceptual loss function based on high-level features.

The traditional loss function is to calculate the pixel-level distance between ground-truth image and generated image, which makes each pixel of the generated image as similar as possible to original image. If the generated image has few pixel offsets from the original image,the pixel-level loss function will show highly discrepancy, whereas the generated image visually is very similar to original image.

Considering that people can’t find the slight pixel offsets between two different images, the perceptual loss based on high-level features can better evaluate image visual similarity between two images. The perceptual loss typically needs a pre-trained network to extract high-level features, and the pre-trained network often chooses the VGG-Net trained inImageNet dataset. That is:

\(d_{\text {feat }}^{l}(x, \widetilde{x})=\frac{1}{C_{l} W_{l} H_{l}}\left\|V G G_{l}(x)-V G G_{l}(\widetilde{x})\right\|_{2}^{2}\)       (10)

where 𝑙 is the l-th layer of VGG-Net, the l-th output is a feature map of shape 𝐶𝑙 × 𝑊𝑙 × 𝐻𝑙. However, in image or video compression task, the auto-encoder structure has been restricted the depth of encoder and decoder, because the structure of encoder and decoder should be generally mirror symmetrical. If the number of encoder layers is increased by N layers, the total numbers of auto-encoder layers will increase by 2N layers. Therefore, in the case of memory limitation, the addition of the pre-trained model will make the depth of encoder and decoder become shallower, which lead to the degradation on the reconstructed image quality.

Instead of using pre-trained model, we use the encoder to get perceptual loss. In our proposed auto-encoder framework, the encoder can extract the features from input, which can be used to compute perceptual loss, that is:

\(d_{\text {feat }}(x, \tilde{x})=\frac{1}{C W H}\|E(x)-E(\tilde{x})\|_{2}^{2}\)        (11)

where the output of encoder is feature maps of shape 𝐻 × 𝑊 × 𝐶, and 𝐸(𝑥) represent the encoder final outputs when input an image 𝑥. \(\tilde x\) is the reconstruction of 𝑥 in decoder.

3.5 Loss Function

In this section, we describe the loss function used in our model in training step. Optimizing the trade-off between image reconstruction distortion and the bit rates in image compression is the permanent theme. We adopt it as a part of our loss function to learn compression and reconstruction of an image.

However, the section 3.4 has analyzed the disadvantage of pixel-level loss function, that is, the traditional distortion MSE will make the decoder reconstruct over smooth image. Therefore, we propose a new perceptual loss as another part of distortion term to enhance the detail of the reconstructed image. Suppose that mini-batch input image is 𝑥 = {𝑥(1), 𝑥(2), … , 𝑥(𝐵)} and the masked outputs of encoder are 𝑧̃= {𝑧̃(1), 𝑧̃(2), … , 𝑧̃(𝐵)}, our object function can be described as follows.

\(\mathrm{L}=\frac{1}{B} \sum_{b}\left(\lambda_{2}\left(d\left(x^{(b)}, \tilde{x}^{(b)}\right)+\lambda_{1} d_{f e a t}\left(x^{(b)}, \tilde{x}^{(b)}\right)\right)+L_{R}\left(Q\left(\tilde{z}^{(b)}\right)\right)\right)\)       (12)

where 𝐿𝑅 is the rate loss, which describes the entropy of compressed image. In our model, we adopt MSE (Mean Squared Error) as the 𝑑(·,·), and the 𝑑𝑓𝑒𝑎𝑡(·,·) is the encoder perceptual loss described in section 3.4.

 

4. Experiments

Our hierarchical auto-encoder image compression models are trained on the subset of ImageNet [24], which includes 33,600 images with a size larger than 128 × 128. During training, these images are cropped into 128 × 128 patches and feed these patches to our network as original inputs. After training, we conduct experiments to evaluate the performance of our network for image compression task on the Kodak PhotoCD [25] image dataset, which consists of 24 natural images with size 512 × 768 or 768 × 512.

4.1 Parameter Setting

In our experiments, we set the number of convolutional kernel output channels n according to the bitrates, i.e. 128, when the bitrate is lower than 0.5 bpp and 192 otherwise. Then, different values of the trade-off parameter λ2 in the range [0.001, 0.02] are chosen to get different bitrates. The encoder perceptual loss term λ1 is set to 10 and other network parameters have shown in Fig. 1.

The generalized divisive normalization (GDN) is chosen as our activation function in encoder and the inverse generalized divisive normalization (IGDN) used in decoder, which are proposed in [26]. In the stage of entropy coding, the method of model the probability distribution of latent variable representation is the same as that proposed in [27]. During training, in order to backpropagate gradient through the non-differentiable quantizer, we add a uniform noise to latent representation for replacing the quantizer, as in [8].

4.2 Performance Evaluation

We compared the performance of our proposed methods with existing image compression standard formats, JPEG, JPEG2000, and state-of-the-art CNNs-based image compression methods. In this paper, image distortion is evaluated by Multi-Scale Structure Similarity(MS-SSIM) [28] and Peak Signal-to-Noise Ratio (PSNR), while compression ratio is evaluated by bits per pixel (bpp), which calculates as the number of bits used to code the original image divided by the number of pixels.

Fig. 2 shows the R-D curves with different compression methods on Kodak dataset. Interms of MS-SSIM, our proposed method has achieved superior performance to the existing image compression standard formats (JPEG; JPEG2000) and deep learning-based methods([8,11]). Moreover, when PSNR is used to evaluation, these deep learning-based methods([9, 11, 29]) have poor performance, but our model still keep the performance in a good level.

E1KOBZ_2020_v14n9_3823_f0002.png 이미지

Fig. 2. Comparison of the rate-distortion curves by different methods: (left) PSNR, (right) MS-SSIM.

Finally, we provide subjective comparisons between our compression results and other results of popular codecs in Fig. 3. Because each of the codecs can only compress an image toa coarse-level output bit rate, when compared with other codecs, we choose the bitrates of other codecs that is same or larger than the bitrates produced by our model, which purpose is to give other image compression methods an advantage in term of bitrates. In Fig. 3, These results indicate that the images compressed by standard compression methods usually perform well when evaluated with PSNR, but perform poorly when evaluated with MS-SSIM. Ourmodel enables the compressed images to perform better when evaluated with PSNR and MS-SSIM.

E1KOBZ_2020_v14n9_3823_f0003.png 이미지

Fig. 3. Image producted by different compression systems at different compression rate. From the left to right: groundtruth, JPEG, JPEG2000 and Ours.

4.3 Ablation Study

4.3.1 Adaptive importance map

As described in detail in Section 3.3, we adaptively choose an important map to dynamically adjust the bit allocation of different channels features used for encoding spatial locations of an image effectively. To prove the advantage of the adaptive selection model, we trained three auto-encoder 𝑀, \(M^1_I\) and \(M^*_I\), where \(M^*_I\) choose the largest feature map of last layer in encoder as an importance map, \(M^1_I\) uses the first feature map as importance map, and𝑀 has not bit-allocation model. During training, 𝑀, \(M^1_I\) and \(M^*_I\) have set 𝑁 = 128 and trained with the same iteration. In Table 1, it shows that the MS-SSIM and PSNR results evaluated in Kodak PhotoCD image dataset.

Table 1. Importance Channel Selection Experiments

E1KOBZ_2020_v14n9_3823_t0001.png 이미지

These results mean that, no matter which channel is chosen as the importance map, the addition of importance map can improve the compression model performance. At the same time, our strategy of adaptive choosing importance map can best boost the MS-SSIM from0.9593 to 0.9608 and PSNR from 29.96 to 30.18 in model 𝑀.

Furthermore, Fig. 4 shows the visualization of all channels of the latent representation for M, which displays that the information discrepancy between different importance maps. In these feature maps, the 26th channel is the largest feature map and has been upsampled for better observation, which obviously has more semantic information than others.

E1KOBZ_2020_v14n9_3823_f0004.png 이미지

Fig. 4. Visualization of the latent representation in model 𝑀 at a median-bpp operating point

Fig. 4 shows the visualization of all channels of the latent representation for M, which displays that the information discrepancy between different importance maps. In these feature maps, the 26th channel obviously has more semantic information than others, which will be selected as the importance channel. As the result, the proposed channel selection strategy has advantage to represent semantic information. Furthermore, the selective important channel comes directly from the features of encoder last layer, and does not need additional convolutional operator. If extra semantic segmentation network is introduced, it may lead to more semantic information, but it will increase computation complexity and more network parameters.

4.3.2 Encoder perceptual loss

The detail of encoder perceptual loss is described in Section 3.4. The purpose of choosing encoder perceptual loss as a feature-level constrain term in our loss function is that, it can make our model pay attention to the detail of reconstructed image and does not require extra memory beside auto-encoder structure during training. We trained two auto-encoder 𝑀 and 𝑀𝑝, where M only chooses MSE as the distortion term and 𝑀𝑝 selects the combination of MSE and encoder perceptual loss. The entropy rate terms of objective function are same in 𝑀 and 𝑀𝑝.

Fig. 5 shows this combination can guide our image compression model to reconstruct an image with better quality. It is noted that the proposed encoder perceptual loss does not require additional pre-training network to obtain high-level features, which reduces the load of GPUduring the training stage. Considering that the perceptual loss needs encoder to extract high-level features, it cannot be used as a network loss alone. Therefore, we adopt joint loss function of perceptual loss and MSE loss as our reconstruction loss term.

E1KOBZ_2020_v14n9_3823_f0005.png 이미지

Fig. 5. PSNR and MS-SSIM comparison between the model 𝑀 and the model 𝑀p

5. Conclusion

In this paper, we introduced three techniques: adaptive importance channel, multi-scale auto-encoder network, and encoder perceptual loss. Our experiments show that these techniques boost our performance. The proposed method of adaptive importance channel enables our model with the ability to allocate bits and improves our model’s performance on MS-SSIM and PSNR. Training with encoder perceptual loss and multi-scale auto-encoder structure provide further improvements to reconstruct perceptual structures, such as sharp edges and details textures. Additionally, our methods are a worthy choice for other auto-encoder compression networks to boost their performance.

This work was supported in part by National Natural Science Foundation of China (No. 61972023) and Fundamental Research Funds for the Central Universities (2019JBZ102, 2019YJS031, 2018JBZ001).

References

  1. G. K. Wallace, "The jpeg still picture compression standard," IEEE transactions on consumer electronics, 38(1), xviii- xxxiv, 1992. https://doi.org/10.1109/30.125072
  2. A. Skodras, C. Christopoulos, and T. Ebrahimi, "The jpeg 2000 still image compression standard," IEEE Signal processing magazine, 18(5), 36-58, 2001. https://doi.org/10.1109/79.952804
  3. F. Bellard, "BPG Image Format," 2014. Accessed: 2017-01-30.
  4. J. Lainema, F. Bossen, W.-J. Han, J. Min, and K. Ugur, "Intra coding of the HEVC standard," IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1792-1801, Dec. 2012. https://doi.org/10.1109/TCSVT.2012.2221525
  5. F. Li, H. Bai, Y. Zhao, "FilterNet: Adaptive Information Filtering Network for Accurate and Fast Image Super-Resolution," IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 6, pp. 1511-1523, 2019. https://doi.org/10.1109/tcsvt.2019.2906428
  6. J. Kim, J. Kwon Lee, K. Mu Lee, "Accurate image super-resolution using very deep convolutional networks," in Proc. of the IEEE conference on computer vision and pattern recognition, 1646-1654, 2016.
  7. F. Mentzer, E. Agustsson, M. Tschannen, et al., "Practical full resolution learned lossless image compression," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 10629-10638, 2019.
  8. J. Balle, V. Laparra, and E. P. Simoncelli, "End-to-end optimized image compression," in Proc. of International Conference on Learning Representations, 2017.
  9. F. Mentzer, E. Agustsson, M. Tschannen, et al., "Conditional probability models for deep image compression," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 4394-4402, 2018.
  10. M. Li, W. Zuo, S. Gu, et al., "Learning convolutional networks for content-weighted image compression," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 3214-3223, 2018.
  11. G. Toderici, D. Vincent, N. Johnston, et al., "Full resolution image compression with recurrent neural networks," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 5306-5314, 2017.
  12. D. E. Rumelhart, G. E. Hinton, R. J. Williams. "Learning internal representations by error propagation," California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
  13. Y. LeCun, B. E. Boser, J. S. Denker, et al., "Handwritten digit recognition with a back-propagation network," in Proc. of Advances in neural information processing systems, 396-404, 1990.
  14. L Theis, W Shi, A Cunningham, F Huszar, "Lossy image compression with compressive autoencoders," in Proc. of International Conference on Learning Representations, 2017.
  15. T. Dumas, A. Roumy, C. Guillemot. "Autoencoder based image compression: can the learning be quantization independent?," in Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, 1188-1192, 2018.
  16. K. M. Nakanishi, S. Maeda, T. Miyato, et al., "Neural multi-scale image compression," in Proc. of Asian Conference on Computer Vision. Springer, Cham, 718-732, 2018.
  17. N. Johnston, D. Vincent, D. Minnen, et al., "Improved lossy image compression with priming and spatially adaptive bit rates for recurrent networks," in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 4385-4393, 2018.
  18. S. Santurkar, D. Budden, N. Shavit. "Generative compression," in Proc. of Picture Coding Symposium. IEEE, 258-262, 2018.
  19. D. P. Kingma, M. Welling. "Auto-encoding variational bayes," in Proc. of International Conference on Learning Representations, 2014.
  20. E. Agustsson, M. Tschannen, F. Mentzer, et al., "Generative adversarial networks for extreme learned image compression," in Proc. of the IEEE International Conference on Computer Vision, 221-231, 2019.
  21. J. Johnson, A. Alahi, L. Fei-Fei. "Perceptual losses for real-time style transfer and super-resolution," in Proc. of European conference on computer vision. Springer, Cham, 694-711, 2016.
  22. M Jian, KM Lam, J Dong, et al., "Visual-patch-attention-aware Saliency Detection," IEEE Transactions on Cybernetics, Vol. 45, No. 8, pp. 1575-1586, 2015. https://doi.org/10.1109/TCYB.2014.2356200
  23. M Jian, W zhang, Y Hui, et al., "Saliency detection based on directional patches extraction and principal local color contrast," Journal of Visual Communication and Image Representation, Vol. 57, pp. 1-11, 2018. https://doi.org/10.1016/j.jvcir.2018.10.008
  24. J. Deng, W. Dong, R. Socher, et al., "Imagenet: A large-scale hierarchical image database," in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 248-255, 2019.
  25. Kodak PhotoCD dataset.
  26. J. Balle. "Efficient nonlinear transforms for lossy image compression," in Proc. of Picture Coding Symposium. IEEE, 248-252, 2018.
  27. J. Balle, D Minnen, S Singh, et al., "Variational image compression with a scale hyperprior," in Proc. of International Conference on Learning Representations, 2018.
  28. Z. Wang, E. P. Simoncelli, A. C. Bovik, "Multiscale structural similarity for image quality assessment," in Proc. of The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2, 1398-1402, 2003.
  29. J. Lee, S. Cho, S. K. Beack. "Context-adaptive entropy model for end-to-end optimized image compression," in Proc. of International Conference on Learning Representations, 2019.