Block Based Enhancement using Deep Learning for Conversion of Low Resolution AVS Video to High Resolution HEVC Video

,


I. INTRODUCTION A. Background
Nowadays, the ultra-high definition videos have become a future demand in the multimedia industry.There are various work has been proposed to achieve this objective.The highefficiency video coding (HEVC) has been introduced by the joint collaborative team for video coding (JCT-VC).[1].On the other hand, the audio video coding (AVC) standard working china group has developed a new audio-video coding standard (AVS) to fulfill the requirement of multimedia industry [2]- [4].AVS has similar visual qualities to H.264/AVC with different computational cost.AVS presents a valuable substitute for H.264 within the video entertainment market in China.Two coding groups are very famous among many videos coding groups due to their effortless contribution in video trans-coding industry.One of them is VCEG (video coding expert group), and the second is the moving picture expert group (MPEG).The main valuable effort of these two group teams is HEVC.Changing the AVS format to the HEVC format is a valuable step in video coding industry.The requirement of HEVC based devices and applications are increasing due to the fast growing internet technology.As per CISCO data traffic prediction, 80% to 90% global internet circulation will contain high definition video contents in future.The fast progress in multimedia contents, the demand for better quality visualization and high-resolution videos is increasing [5].
Therefore, it is required to cut down large video data via different video compression coding standards like HEVC [17, 19, 20, 21, 22, 24 and 25] and H.264/AVS.The high resolution videos have many applications in different fields, such as healthcare [6], computer vision [7], and wireless multimedia transmissions [8].
The main goal of HEVC is to minimize the bitrate by up to 50% while giving comparable visual quality compared to its equivalent video coding standard AVC/H264.HEVC has many new prominent features; one of them is macro block sizes extension.By extending these blocks, we can easily improve the visual quality of low-resolution video.Different types of transcoders have been used to change the format of one bitstream to another bitstream [9]- [11].The main purpose of transcoding is to fill the gap caused by the rapid changes in the digital world and the increasing capacity of video content in different formats such as AVI, VP8, MP4 and VP9, etc.In the past, the video transcoder was limited to convert the video files from one digital device to another digital device but now it has become popular and easy.
Usually two basic types of video transcoders are used in transcoding.One is called homogenous transcoder and the second is called heterogeneous transcoder.In homogenous transcoder, the same format of video bitstream is used to convert one video into another video.For example, low resolution HEVC video bitstream can be converted into high resolution HEVC video bitstream and vice versa.In heterogeneous transcoding, the different format of video bitstreams are used.For example, low resolution HEVC video bitstream can be converted into target video bitstream (high resolution AVC/H.264) and vice versa [12], [13].This digital-todigital conversion of data is very important for electronic devices, when target devices have supported low capacity video than the original size.Transcoding is needed sometimes for resolution conversion from one coding standard to another coding standard.Transcoding may depend on different devices resolution.For example, tablets or mobile devices do not have enough capacity to display the high definition videos, for this purpose, high spatial videos are down sampled into low spatial videos and vice versa.For spatial downscaling, the discrete cosine transform (DCT) scheme has been shown significantly higher visual quality than the low-pass filtering schemes of pixel domain [14].However, some applications need to show standard definition television (SDTV) bitsream on high definition mobile devices.This involves the upsmaling of the original bitstream by using spatial enhancement techniques [9], [10].
In our proposed algorithm, we used heterogeneous transcoding because it provides conversion between various standards such as H.264 to MPEG transcoder, H.264 to MPEG-4 transcoder, and MPEG-4 to MPEG-2 transcoder, etc.The heterogeneous transcoder requires a syntax conversion unit and changing the directionality of motion vectors (MV) and the picture characteristics such as the resolution, type, and rate.Therefore the implementation of this technique has many challenges [15].Due to the output sequence having a different encoding format and the spatial temporal subsampling, the motion compensation loops of encoder and decoder in heterogeneous transcoder are very complex.Quadtree structure size is also a challenge in heterogeneous transcoding.Quadtree structure in HEVC uses a larger coding unit size (64 × 64) for estimating the object movement in the video frame, while H.264 / AVS uses (16 × 16) unit size [16].While upsampling of low resolution frame into high resolution frame is also a well-known challenge in video transcoder.

II. LITERATURE REVIEW
There are various techniques have been proposed in the last two decades to overcome such types of problems [17]- [21].Video enhancement has many application, such as in medical [22], underwater video enhancement [23], entertainment [24], and crime investigation [25].A low-light weight MBLLEN network is proposed for image or video enhancement using convulutional neural networks in [26].The main concept is extracting rich features at different levels so that we can apply augmentation via many subnets and then construct the output image via fusion of multi-branch.By this technique, the image quality is increased in various ways.Huang et al. [27] proposed a method, which adds high frequency information in videos frames.This technique did not perform well when noise ratio increased in the video frame.Chow-Sing et al. [14] proposed a fast intra transcoding based on discrete cosine transform (DCT) and prediction modes for converting from H.264/AVC to high-efficiency video coding.This method uses coefficients of DCT and the intra-prediction information embedded in H.264 bitstream to predict the coding depth map for depth limitation.Linwei et al. [16] proposed a fast H.264/AVC to HEVC transcoder based on machine learning by exploiting the similarity of block partition.This method uses a feature selection technique for predicting the quadtree coding unit partition in HEVC.Johan et al. [28] proposed a spatially misaligned HEVC transcoder with computational scalability, which minimizes the complexity of composition process and reduce some information from original bitstream.Mamoona et al. [29] proposed a transparent encryption technique with scalable video communication at lower latency.This method represents the gains in reducing delay and increasing distortion arising from a transparent encryption.Chia-Hung et al. [1] proposed a coding unit complexity for an efficient HEVC to SHVC transcoding with quality scalability based on coding unit depth and mode predictions.A survey of video compression is also given in [30].A comprehensive survey for video enhancement methods using deep learning is presented in [31].
Assuncao et al. [15] proposed an open loop transcoder that cascades the decoder and encoder directly.The transcoding architecture does not contain a feedback loop for the drift error.The aim of this transcoding is to minimize the complexity and adjust DCT coefficients to reduce the bit rate.Keesman et al. [32] proposed a close loop transcoder, which has the feedback loop in structure for minimizing the transcoding distortion by compensating the drift error in the transcoder.Wee-svsie et al. [10] proposed a spatial domain video transcoder to perform the dynamic bit rate adaption through rate control.Eleftheriads et al. [33] proposed a frequency domain transcoder that used variable-length decoding (VLD) and inverse quantization to extract each block's DCT value at the decoder end.In this technique, the residue errors of motion compensation are encoded through variable length coding (VLC) and re-quantization [2], [34].A scalable omni-Fig.2: An example of block partition similarity between H.264/AVC and HEVC.[16] directional video coding scheme for virtual reality applications is presented in [35].Xie et al. [29] proposed a hybrid domain transcoder, which provides the trade-off between the video quality and computational cost.In [36] the end-to-end convolutional neural network is proposed for video compression.
The above disused methods have limitations in terms of computational complexity and data visualization.In this paper, the downsample AVS bitstream is classified based on the visual characteristics, motion vector (MV) features, and transform coefficients into many sub-blocks [37].Blocks of most interest (BOMI) have large motions in the video, the blocks of less interest (BOLI) have slight motion, and the blocks of non-interest (BONI) are almost stable in the video.After classifying these blocks, we apply a super-resolution deep learning-based approach on BOLI and BOMI blocks, which enhances the visual quality of low resolution blocks.The BONI blocks have no interest in the video, so they are considered as background blocks.The enhancement of region of interest area in video, saves the transcoding time as well as super resolution complexity as compared to full frame based enhancement.The main contributions of our work are summarized as: • We proposed a block detection and classification technique that uses the visual characteristics, transform coefficients, as well as motion vector (MV) features of a video.This technique classifies all blocks into three types: block of most interest, block of less interest, and block of non-interest.• A deep learning based enhancement technique is used to convert low resolution to high resolution video.This technique enhances the region of interest area in a video instead of full frame and saves the computational time of the system.• The multi scale super resolution technique is progressive in both design and training during the upsampling of frames.To make training process smoothly, we used curriculum learning in the proposed framework.• We have carried out detailed experiments on six different benchmarks to evaluate the efficiency of our proposed method.The rest of the paper is organized as follows.Section III shows the classification of blocks.The super resolution framework on different blocks is presented in section IV.In section V, the Simulation results are presented.Finally, the conclusion is discussed in section VI.

A. Classification of Blocks
In fact, humans are more concerned about the area of motion in video content.The sensitivity with respect to the eyes of the visual system is significant for the classification of blocks in the video frames.Mostly richer texture blocks or moving blocks get more attention of the viewers in the video.These blocks are known as the block of most interest (BOMI).The blocks, which are very near to the block of most interest and have low motion in video, are called blocks of less interest (BOLI).Frequently, background objects have no motion and are less appealing to the viewers, known as blocks of noninterest (BONI).The super-resolution algorithm can well preserve the video quality of BOMI and BOLI.The regions for newly incorporated objects and camera motion or fastmoving objects can be determined by motion vectors (MVs) and transforming coefficients.The regions of interest (ROI) can be detected in the video frame using coding parameters.
In AVS, the macroblocks (MBs) are used to encode the video frames, where each macroblock has a fixed dimension of 16 × 16.We will evaluate coding information of macroblocks for the classification of each coding tree unit (CTU).Although each macro block of the frame is used to estimate either frame is intra or inter.In the blocks detection, each video frame is divided into three block types of as shown in Fig. 1.The number of MBs have encoded with the intra mode in a frame is used to decide the intra or inter mode further.Let ha i is used to indicate the i th MB in one frame.
Eq. ( 2) is used as a criterion for classification of blocks.It also classifies a CTU as an intra/inter coded, where T is used as a threshold, and its optimized value is 8.The motion vectors of current frame are calculated as follows: where F M V avg is the frame average motion vectors, i is the current frame index and M B k presents the k th macro block of the current frame.BM V M B k (k) represents the magnitude of motion vectors, that is defined as: where BM V a 2 (k) and BM V b 2 (k) are the motion vector (M V ) components.These motion vector components are used horizontally and vertically in a frame.The current macro blocks and average motion vectors are calculated as: where N F shows the macro blocks total number and m shows the total number of motion vectors in the frame.
The background macro blocks are determined by transform coefficient of each frame value.Therefore, the average value of transform coefficient and current macro block can be estimated by these equations: where F T C avg shows the average motion vectors, i is the current frame index and M B k presents the k th macro block in the frame.BT C M B k (k) shows transform coefficients of k th macro block in the current frame.According to the given parameters, we can easily detect the required blocks, which are shown in Fig. 1.

1) Blocks of most interest (BOMI)
2) Blocks of less interest (BOLI) 3) Blocks of non-interest (BONI) These conditions are very useful to separate the region of interest block.The detection and classification of three video sequences, Cactus, Jockey, and Bosphorus is shown in Fig. 3.The resolution of each sequence is 1920 × 1080.Fig. 3(a) shows the AVS encoding information, which is used to classify the blocks into three gray, purple and black colors.In Fig. 3(c), the white color blocks are region of most interest, gray color blocks are considered as less region of interest, and black colored blocks are mostly stable and have less interest to the viewers, which are known as blocks of non-interest.

IV. SUPER RESOLUTION OF DIFFERENT BLOCKS
We implemented multi scale super-resolution method to enhance the visual quality of low-resolution both MBOI and LBOI blocks of video frame.According to the visual features and motion vectors, the down sample frames of a video are divided into three blocks: MBOI, LBOI and BONI, as described in Section III.In Fig. 3, AVS decoded frame is divided into three types of blocks and then we applied multi-scale SR method to obtain the high quality blocks.The resolution of BONI blocks remains unchanged due to less interest as compared to MOBI and LBOI.The proposed technique mainly focuses on foreground area of the frame which obtained from blocks detection technique as shown in Fig. 3.

A. Multi-Scale Super Resolution
We presented a GAN-SR approach that is more progressive in the design as well as training.In intermediate phases, the network upsamples an image, while the curriculum learning is used for the learning process to order the process from simple to difficult.We designed a generative adversarial network (GAN) called Gan-SR that uses the same progressive multiscale design idea to achieve more credible outcomes.This not only able to measure well against large sampling factors, but also a multi-scale technique that simultaneously raises the quality of the reconstruction of each upsampling factor.This method can work on all image channels (chrominance and luminance, etc.).Here we use luminance channel of each block to increase the spatial resolution of the frame, but other channels remained unchanged.
Suppose we have a set of N low resolution (LR) as an input frames with the corresponding target frames that are high resolution frames (x 1 , y 1 ), . . ., (x n , y n ).The upscale function U :→ X, Y , with X low resolution and Y are high resolution frames.It is difficult to find the parameters that are appropriate for the upscaling function U for high resolution ratios: the bigger the ratio, the more complicated the class of functions is needed.To this purpose, the upscaling function U is a progressive solution.We apply multi-scale super resolution pyramidal network on low resolution frames that are discussed below.
1) Pyramidal Division: A pyramidal division model is used which divides a upsampling factor U into a number of simpler functions (U 0 , . . ., U n ).The task of each function is to perform the upscaling and the refinement of the features of its own input.The cascade model consist of the compact compression units (CCUs), and a sub-pixel convolution layer, is used at each pyramid level.The asymmetrical structure is achieved by assigining of additional CCUs in the lower pyramid level as shown in Fig. 5.The high computational power in lower pyramid decreases the memory consumption and rises the receptive field to the original frame, which exceeds the symmetrical variation in terms of quality and execution.The upscale function U is divided across the pyramid levels.We additionally utilize two sub scaling-network which are represented by v s and r s .These networks translate the individual transformation between the scale and standardized feature space of a frame.Figure 5 shows a schematic depiction of our progressive upsample network.This network uses a fixed upsampling of input (x), e.g. the bicubic interpolation to make the learning process easy to get a residual as an output.
The estimated HR frame with a certain scaling factor (s) can be calculated as: In particular, the Laplacian pyramid [38] concept is not followed in this network.In the Laplacian pyramid based network, intermediate sub-network outputs are neither supervised nor monitored as a basic frame.This network perform well over the Laplacian alternative, which simplifies the backpass and minimizes the optimization challenge.Moreover, the ground truth does not downsample for interim supervision.This network is very helpful to prevent artifacts which may occur from sub-sampling.
2) Compact Compression Units (CCUs): A previously proposed DenseNet architecture [39] is used to construct each pyramid level.Similarly to skip connections, we use the compact connections that increase the flow of gradients and reduce disappearance in gradients [40].A compact compression unit (CCU) is the central component of each level of the pyramid and includes a modified compact-connected block by 1 × 1 convolution CON V (1, 1).The BN − ReLU − CON V (1, 1)-BN − ReLU − CON V (3, 3) is used in the initial compact layer.We eliminate all batch normalizations in superresolution due to recent practice in SR methods [11], [33].However, as the characteristics of prior layers may differ, the first ReLU to rescale CON V (1, 1) characteristics can also be removed.This result leads to composition of a dense layer: CON V (1, 1) − RELU CON V (3,3).
In Compact Net, the compact connection is broken at the end of each CCU with a compression layer CON V (1, 1) that quickly reassembles the information compact connections and increase the efficiency gain in results.A pyramid and local residual connections are used to improve the gradient propagation in this deep model, as illustrated in Fig. 5.
3) Progressive GAN: Generative adversarial networks [41] developed as an effective way to improve the perceptiveness of upsampled frames in SR.Training GANs is notoriously hard and the success of using GANs to SR is sparse to a single sample in very small targets.The proposed transcoder network is based on multi-scale SR network which is similar to the generator adversarial network suggested in the second part of Figure 5.The reverse pyramid structure u 2 , u 1 , u 0 are indicated in the second part of Fig. 6, where each level diminishes the input frame spatially using AvgP OOLIN G progressively.
A scale-specific frame transformation layers v s is used before each pyramid that is similar to the generator.The network is entirely converted and produces a tiny features patch that is similar to PatchGAN [42] to handle multifaceted outputs from the generator.The discriminator works on residual of bicubic up sampled frame and the original frame is similar to the generator network.This permits both the generator and the discriminator to solely focus on the major variation source which is a upsampling procedure.This may also be seen as the subtraction from the discriminator of a database that contributes to a reduction of variance.We employ the less square loss rather than the original cross-entropy loss as our training target.The discriminator and the generator loss for the training of scales (s) may be described by means of the anticipated real residual and residual, which represented as r and r : where p k shows the k th pooling layer of VGG-16 model [12].
4) Curriculum Learning: Curriculum learning [43] is a training approach which increases the complexity of the learning task progressively.It is widely used to forecast sequences and make decisions sequentially, where huge speeds may be achieved in the training period and improve performance.The pyramidal division of U makes it possible for us to naturally implement the curriculum learning.The loss for a scalable exemplar (x s i , z i ) can be set to: where x s i coressponds to s× downsampled version of z i and the target at scale s is calculated as: where the value θ s parameterizes all functions in and below the current scale (u 0 , v 0 , r 0 , ..., u s , v s , r s ) according to the pyramid network that is described in Fig. 5.
Our method merely starts with a 2× network instruction.As we enter a new phase (e.g. to 4×), the new pyramid level is progressively included to decrease its influence on the previously learned levels as shown in Fig. 6.The predicted rs of the generator at scale s is a linear combination of s and s−1 level outputs.For the discriminator, the new pyramid's output features are combined with the scale-specific input of the preceding level vscale1 before entering the trained pyramids {u s−1 , ...u 0 }.
The bilinear interpolation and AvgPOOL are employed before merging to match spatial dimensions.In both situations, β determines the effect of the new pyramid, which changes from 0 to 1 during the mixing process.As a consequence, we add the training pairs of next scale that scale gradually.Finally, in order to construct the batches we have chosen one of the scales (s) randomly to prevent the batch statistics from being mixed as recommended in [2].
This progressive training strategy considerably reduces overall training time in comparison to basic multi-scale training when training examples from multiple scales are provided to the network at the same time.In comparison to single scale and basic multi-scale training, it offers a further efficiency improvement on all included scales and relieves instabilities in GAN training.

V. EXPERIMENTS
The proposed SR technique is implemented by using Python and MATLAB 2018a.All experiments are performed at Intel (R) Core™ i5-CPU 4590 @ 3.3GHZ with 8 GB RAM with a spatial resolution of 3840 × 2160 for evaluating the performance.The frames are encoded with the low delay-P-main configuration file.The performance of proposed method is evaluated based on different metrics such as the image quality (∆ PSNR), the bit rate (∆Bit-rate), and the computational complexity (∆ T).The parameters are calculated using the equations given below: where P SN R (proposed) ,T (proposed) , and Bitrate (proposed) are the proposed algorithm PSNR, encoding time and bitrate, respectively and P SN R (anchor) , T (anchor) , and Bitrate (anchor) and represent the PSNR, encoding time, and bitrate of anchor software HM10.1, respectively.
In order to evaluate the performance of our proposed method, we used six different classes of videos.Each video belongs to a particular class.I and Table II.The difference between the frame based and block based results is shown in Table III.HoneyBee up-sampling results show that our proposed method saves 37% time compared to the frame based method with similar picture quality PSNR (dB) and SSIM (index).In block based, Bosphorus saves 26% of the time during 1K to 4K up sampling.For 1K to 2K upsampling, Kimono takes 00.006% more SR time.The results of our experiments show that the proposed method has achieved good results in 2K and 4K video with low computational time.The negative sign in the Table-III indicates a decrease, and the positive sign indicates an increase in value.In Table III, the HoneyBee PSNR drops by 0.0091 dB compared to the frame based value and the Cactus PSNR decreases by 0.0022 dB over the frame based result.As to compare bitrate results, BQTerrace obtains a maximum bit rate of 1.77% over the frame based value, and the Bosphorus achieves a minimum bit rate reduction of 0.745% based on the frame based value.Fig. 7 shows the HEVC encoding time analysis for six different videos with different resolutions.In Fig. 7, it can be seen that Bosphorus video consumes more HEVC encoding time in 2K and 1K resolutions.VI.CONCLUSION This paper introduces a novel approach, where a deep learning based enhancement technique is used to convert low resolution to high resolution video.This technique enhances the region of interest area in a video instead of full frame and reduces the 20% to 30% SR complexity of a system.The results validate the efficiency of this approach.The proposed block based transcoder performs better than frame based super-resolution method with comparable visual quality in SSIM (index) and PSNR (dB).
In future, we can extend this transcoder for other video formats like H.264 to MPEG transcoder, H.264 to MPEG-4 transcoder, MPEG-4 to MPEG-2 transcoder, etc.This field is growing rapidly and continuous effort will bring significant improvement in video compression and quality.

Fig. 3 :
Fig. 3: Blocks detection and classification: (a) represents the original video sequences of Cactus, Jockey and Bosphorus video, (b) shows the AVS encoding information results: grey regions show skip macro blocks, purple color region shows intera MBs and blue region shows inter MBs, and (c) shows the region of interest: where white color shows the blocks of the most interest, gray color shows the blocks of less interest and black color shows the blocks of non-interest (Best viewed in color).

Fig. 4 :
Fig. 4: The deep learning based super resolution method architecture where compact compression units (CCUs) are allocated in the lower pyramid level to reduce memory consumption and improve the reconstruction accuracy of high quality blocks in video frame.

Fig. 5 :
Fig. 5: Flow chart of the proposed video transcoder with deep learning based super resolution.

Fig. 6 :
Fig. 6: The top part shows the blending procedure for the generator and bottom part shows the blending process for discriminator in curriculum training.
For example, Class A has PeopleOnStreet and Traffic video, Class B has Cactus and Kimono video, Class-C has BasketBallDrill and PartyScene video, class-D has BlowingBubbles and BQSquare video, Class-E has FourPeople and KristenAndSara video, and Class-F has ChinaSpeed and SlideEditing video.The experimental results of block-based super-resolution method and frame based method are shown in Table

Fig. 7 (
a) shows the HEVC encoding time of six different videos.The experiments show that the all block based videos have high HEVC encoding time.On average, the block based results take 15.98% more HEVC encoding time as compared to frame based.The average difference between block based and frames based in PeopleOnStreet video frame is -45.80 seconds, which indicates that frame based transcoder has a

Fig. 7 (
Fig.7(c) illustrates the visual quality of frame-based and block-based SR approaches.Here, we will contemplate the results of six different videos, which are used in the experiment.In most cases, the maximum signal to noise ratio (PSNR) of both frame based and block based videos is similar.The visual quality difference raises gradually in low resolution videos.In 4K videos, the visual quality of both block based and frame based videos is mostly comparable.During upsampling 1K to 2K or 2K to 4K the number of pixels density decreases, which leads to decrease in the visual quality of all video frames.Fig.7(d)shows a comparison of bitrates between frame-based and block-based transcoding methods.The bitrate is the size of video file per second of data.It is usually represented by kilobits (kbs) or megabits per second (mbs).The bitrate in video streaming quality is a vital factor.

Fig. 7 :
Fig. 7: (a) shows the ROI based and full frame-based HEVC-encoding Time (sec), (b) shows the ROI based and full frame-based SR Time (sec), (c) shows the ROI based and full frame-based PSNR (dB), and (d) shows the ROI based and full frame-based Bitrate (kb).

Fig. 8 :
Fig. 8: shows the qualitative results of 1K, 2K, and 4K videos.(a) shows the results of 4K Bosphorus, Jockey, HoneyBee and ShakeNDry video.There is no any significant difference between picture quality of both frame based and block based method, (b) shows the results of class A videos.In this class, the visual quality of PeopleOnStreet and Traffic videos is similar, (c) illustrates the results of class B videos ( BasketballDrive, BQTerrace, Cactus, and Kimono), (d) shows the visual quality of PartyScene and RaceHorses videos, which belong to class C, (e) shows the visual quality of BlowingBubbles and BQSquare video of class D, (f) shows the visual quality of FourPeople and KristenAndSara videos, which belong to class E, and (g) shows the results of ChinaSpeed and SlideEditing video of class F.

TABLE I :
Block Based Result of Proposed Method

TABLE II :
Frame Based Result of Proposed Method

TABLE III :
Difference between Block Based and Frame Based Result

TABLE IV :
Image quality (PSNR) and Bitrate (kb) comparison among proposed method and other methods (1280 × 800) images of PeopleOnStreet, and Traffic are upsampled to 2K resolution (2560 × 1600) by using proposed SR method same as the low resolution (960 × 544) images of Cactus, and Kimono are up-sampled to 1K high resolution (1920 × 1080) frames using same SR technique.