Textual Deblurring using Convolutional Neural Network

The problem of blind textual deblurring of images is a classical problem in image restoration. Here we propose a new formulation of Convolutional Neural Network (CNN) for solving this problem. Our architecture has been care-fully designed after thorough empirical evaluation with detailed experimentation of the recent advances in the domain of DL. Our method outperforms the previous state of the art CNN based blind deblurring method, while reducing the training time, memory and depth. These traits make our method a potential candidate to be integrated as a defacto method for textual deblurring in mobile devices. Precisely, our contributions are: (i) a CNN model that outperforms the state of art method for blind textual deblurring; our this method is also resource friendly and takes less time to train. (ii) analysis on the learning behavior of CNN for textual deblurring by encoding structural information of images through image gradients. (iii) recommendations based on detailed empirical validations of different design choices for building an image deblurring network.


Introduction
Rapid digitisation has resulted in corporations' adoption of paperless management systems. However, paper based textual documents still exist in large amount of data that are stored in textual form; like documents, notes, financial ledgers, books and religious writings. Digitising them requires special equipment like high resolution cameras and scanners that captures such data with higher pixel quality and lower noise. Google has made certain efforts in making high resolution cameras (Art Camera) which they are using to preserve artwork [20]. However, such equipment are expensive and not readily available.
In contrast, now almost everybody has a low end mobile connected with camera. Especially, during the covidera online sharing of written documents scanned through hand held devices has became a norm. However, capturing textual images using these cameras require paying special attention to image capturing process. For instance a poor lighting condition or a simple shake of hand during image capturing process can lead to a poor quality blurred image [8].
Traditional blind image deblurring methods utilise manually perceived hand crafted features, employing edges or gradients with local image statistics to reconstruct or recover the blur kernel [18,4,3,12,19]. Recent, CNN based methods have removed this limitation [8,1,16,15]. To this end we develop, by careful analysis and evaluation of different CNN architectures and model designing choices, to come up with an architecture that gives state of the art performance on the benchmark datasets for text deblurring. Additionally, we propose a new loss function for textual deblurring of images, called the "Structured Loss". In this work we show that structed loss is better at deblurring text in images in contrast to simply using L 2 loss, due to its ability to focus more on edges. In this loss function, first order derivative is added to the L 2 loss, which forces the CNN model to generate clear and sharp human readable textual images.
We would like to emphasize that we have achieved all the performance gains, while reducing the memory footprint of the model. This makes our model a suitable candidate for execution of textual deblurring in small hand held device like mobile phones or tablets. It should be also noted that our models' spatial support is not defined by the kernel size of the model. Rather, it takes the whole (300 × 300 × 3) images for processing. Furthermore, in contrast to other methods, during comparison we have computed all the evaluation matrices Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) using whole images, rather than the (160×160) central patch of the image. Since, we believe that in real life setting text can be captured on any part of the image during image capturing process.
The main contributions of the paper are as follows: 1. Proposed a new loss function for textual deblurring called the "Structured Loss". This loss function utilises the first order derivative of the image, which is added to the L 2 components in the loss function. This new loss function facilitates the CNN network to focus more on edges thus generates sharp textual images. The remaining of the paper is organised as follows. Section 2 provides a brief overview of the problem domain. Section 3 provides extensive literature review of both traditional and deep learning based methods. Section 4 describes our approach, dataset utilised in the experimentation process, evaluation metrics, the proposed loss function and models' architecture. Section 5 elaborate the effects of different design choices, which aided in the designing and finalising of our CNN model. In section 6 visual and empirical analysis of the achieved performance is provided. Section 7 provides some promising research directions. Finally section 7.1 concludes the paper.

Background
To evaluate the deblurring quality of a method the closeness of the perceived image to the captured image is measured. Specifically, image quality can be defined as presence of some desirable characteristics such as sharpness, tone reproduction, contrast and colour; while absence of some undesirable characteristics such as blur, noise and incorrect exposure. We can address these problems either by being careful during image capturing and using specialised hardware such as image stabilisers, hybrid cameras, coded shutters or by post processing the captured image by specialised algorithms.
In image processing to recover high quality image from a poor quality one, we apply different image restoration algorithms. Image restoration is an interesting and classical image processing problem [14]. Image deblurring is a special case of restoration problem in which a blurred image is processed to produce a sharpened image. Like majority of other noises, blur is introduced due to physical and manual errors such as motion and defocus blur. Other cases include noise, incorrect exposure and distortion effects, etc.
Image processing algorithms formalize the image restoration problem as recovery of blurring kernel K, i.e. if we can recover the blur kernel K, we can quasi recover the original image by convolving the image with this blurring kernel k, specifically: where I out is the deblurred image, I in is the blur image given as input, K is the kernel and n is the random noise added during image capturing system, here * is used to imply convolution as shown in equation 1. Generally speaking these algorithms can be categorised into two broad categories.
Non blind deblurring: The non blind deblurring methods explicitly take the kernel K or Point spread function (PSF) and blur image as input to generate the deblur image -c.f . Figure 1. Mostly used and available tools such as (Photoshop, GIMP, etc.) use non blind deblurring techniques to recover images. However, these are limited in their use as user has to specify hand crafted kernels to generate final deblur images. Furthermore, these hand crafted kernels remove blur to the extent for which they are build for.
Blind deblurring: For blind deblurring methods neither the PSF nor any other prior information to generate the deblur image is known. This is why blind deblurring is difficult and considered an ill posed problem [8,18,3].
A problem which has multiple solutions is called an ill posed problem.

Literature review
All the previous deblurring methods typically try to approximate the functional form of PSF used to generate the blur image. Once we have the approximation of PSF, any non blind deblurring method can be used to deblur the image. The performance of all such methods depend on how good is the approximation function. Many such methods have been proposed for blind deblurring [1,16,15,3,4,18,12].

Hand-crafted feature methods.
Prior to Deep Learning (DL), traditional methods designed for vision tasks utilised manually perceived and hand crafted features to extract relevant information for the PSF approximation. Furthermore, they also put constraints on specific structure of underlying documents such as either the document needs to be bi-modal (black and white without any complex background or graphics) or to be in specific orientation, or in specific lighting conditions, etc. Thus these models are not generalizable to real life settings of mobile connected cameras, where user will be taking pictures without any predefined constraints. For instance, Li et al. [12] method works very well on strictly two tone images but failed on images with complex background or noise. It was not a very robust or scalable method and works only for a subset of blur kernels. They used inverse mapping function to deblur images, and evaluated their performance on total misclassified pixel count.
Cho et al. [4] use text image specific properties with optimization techniques to produce their final output. These properties include contrast difference between text with near background and uniformity in color of printed text characters, etc. Clearly, these assumptions do not hold for real-world images. This method failed to deblur textual images if the text present in the images was in smaller fonts or connected. Moreover, it also worked only on a subset of blur kernels and failed to remove motion blur as well. The method works very well for text images as well as natural images.
Chen et al. [3] designed a very good method for blind deblurring as it was successful in suppressing ring effect and had sufficient PSNR improvement. However, the performance of their method depended on which text segmentation method was being used. Due of this their method was not generalisable as text segmentation method has to be selected based on the type of images.
Pan et al. [18] generated perceptually good deblurred images and, their methods was also less complex compared to previous methods. But for their method to work, textual images had to be horizontally aligned, bi-modal and uniform otherwise the method produced poor deblurred images.
All these manually hand crafted feature selection methods failed to be generalizable, robust and scalable. Furthermore, they were limited to certain deblurring kernels.

Deep Learning methods
Typical DL methods have seen many advancements in recent time. Primarily due to their ability to automatically learn feature representations from data. Due to this DL methods are state-of-the-art in object detection [23,5], image segmentation [13,6,11], depth perception [2,25,21], etc.
Naturally, many DL methods have been proposed for blind deblurring of images. Nah et al. [15] proposed a multi-scale Convolutional Neural Network (CNN) and a new dataset captured using high speed camera. Their method works it way up from coarse-to-fine, and finally gives a high resolution deblurred image. Further, according to them removing batch normalisation layer helps network to cater for range flexibility thus improving deblurring performance. Noroozi et al. [17] also proposed a multi-scale DeblurNet W ild for deblurring trained on blurry and sharp image pairs with skip connections to produce the residual image.
Recurrent networks have also been used for deblurring task. Zhang et al. [28] reconstruct deblurred images after passing it through a series of convolutional and Recurrent Neural Network (RNN) based deconvolution network. They show that deblurring can be modelled as a infinite impulse response (IIR) model. instead of deblurring the whole image, Zhang et al. [27] perform patch wise deblurring. Motivated by Spatial Pyramid Matching, their Deep Multi-Patch Hierarchical Network (DMPHN) perform deblurring in a fine-to-coarse manner in a novel stacking approach.
Ren et al. [22] propose an autoencoder with skip connection with sigmoid non linearity. In addition they employ a Fully connected network, combined together with softmax. Their method called the selfdeblur method capture the deep prior to produce the blur kernels.
Majority of the DL methods are mostly based for scenic images. Hardis et al. [8] proposed a deblurring method for deblurring textual images. They investigate how CNNs perform textual deblurring. They have made two major contributions in their work. Firstly, they have introduced a new dataset for textual image deblurring. Secondly, they have proposed a CNN model that gives state of the art performance on dataset produced by them. They have compared the performance of their CNN with classical text deblurring techniques.

Our Approach
We have developed a CNN based algorithm that gives better performance than the previous state of the art [8], while being resource friendly and faster to train. We have explored different design choices. Empirically, proven what their affects are on the final performance. Which combinations provides better learning opportunity and which cause the loss in performance. With this we have developed a road map on what practices to follow on problems that are similar to this in nature.
We also have throughly and empirically evaluated different loss functions and judge which simplify learning process for the CNN model. By directly minimizing loss or maximizing PSNR and implemented custom loss functions which aid the quick convergence of CNN.

Dataset
For our deblurring experiments we used the publicly available dataset of [8]. This dataset consists of 66,741 images. Each image has a dimensionality of (300 × 300 × 3). Thus totaling up-to approximately 16.5 GB in size.
The dataset was uniformly divided into several chunks to ease training. Each chunk had 3000 images. 17 chunks were dedicated for training. 3 chunks for validation and 2 chunks for testing. Testing set was used only once after doing complete configuration and tuning on validation set once the training was complete.

Evaluation Metrics
For judging the quality of image restoration algorithms two types of evaluation metric are generally used; subjective and objective [7]. Subjective metrics are based on human judgment and we compare how close the images are perceptually whereas objective metrics rely on numeric criteria [7]. In these metrics we compare images based on their pixel values. To compare performance for image deblurring we will use two evaluation metrics, one objective and one subjective.
PSNR is poor at discriminating structural difference between images.
There are many transformations/degradations that can be applied to an image and same PSNR can be achieved even though generated images are perceptually and structurally different [24]. So we have used both PSNR and SSIM because we wanted our model to generate both numerically and perceptually good image deblurring results.

Structured Loss
We Implemented the structured loss function, to analyse what effect it has on CNNs learning and whether it guides the CNN to produce sharp images or not. The magnitude of first order derivative of the image was added with L 2 loss. The participation of both components was controlled by alpha and beta hyper parameter. Here mag ∇ I represent the magnitude of gradient image produced by applying first order derivative filter.
Structured loss was basically the result of a (3 × 1) and a (1×3) kernel ([1, 0, −1]) convolved with the output images. Basically, we were adding the first order derivative compo nent to the loss in order for the CNN to produce sharp images as shown in Equation 8. The goal of this loss function was to enable CNN to quickly distinguish the edges and generate a sharp image by encoding font shapes structures.

Our Architecture
Our proposed architecture is a 7 layer deep CNN which has (11 × 11) convolutional kernels and 128 channels per layer excluding the last layer. Last layer is a (1 × 1)convolution layer with 3 output channels shown in Figure 8. Our this model outperforms the current state-of-theart PSNR achieved by Hardis et al. [8] for textual deblurring.
All of this was done with a reduced memory foot print due to the fact that we used pooling and unpooling layers in our model. For pooling operation we used max pooling layer which picks the max pixel value while reducing di mensionality. And, for unpooling deconvolution layer was used. This turns out to be better unpooling layer [26] in our experiments as it learns which features to pick while up sampling the image. Due, to this we have greatly reduced the memory foot print of our model. Our this model requires 3 days to train on a single K40 GPU which is also less then [8], as their model takes one week to train. Our this initial model training was however slow because it required very low learning rate of 1×e −6 to learn. To speed up the convergence we also experimented with batch and instance normalization. Both did fine on training and we were able to set the learning rate as high as Batch normalization however failed during validation as this layer learns the mean and variance of the whole dataset during training and applies the learned parameters during testing and validation. This caused the output image to be distorted as combined mean and variance of the whole dataset was being used to normalize a neuron output.
Instance normalization however worked better as it normalizes each image individually rather with respect to a batch,it is applied evenly during both training and validation/test. We then decided to go with instance normalization instead of batch normalization.
We also used structural loss function which simplifies the learning criteria for the CNN model. Due to which the model quickly focuses on textual information present in the images. This loss function focuses the CNN learning capacity on text prior due to which we have achieved good textual deblurring. Many variants were then derived from this model to further fine tune and squeeze performance.

Design choices
Here we discuss in detail all the experiments and results done by fine tuning and putting to test different configurations for our architecture. We discuss what affect each configuration has on the proposed architecture performance. Further, we also shed light on why such a configuration works well or poorly for our this problem of blind deblur ring. We also empirically prove why the configuration we selected is the best trade off between performance, resource consumption and time. We also perform comparison of our results with existing state of the art method and show superiority of our method (Shown in Table 2).

Effects of Down and Up-Sampling
We trained two models where we tested different upsampling and downsampling techniques. For downsampling we experimented with max, average and L 2 -norm pooling. Max pooling performed far better than other downsampling techniques. Max pooling helped in suppressing unnecessary information. Max pooling in conjunction with convolution produced better deblurring.
For upsampling, we experimented with fixed unpooling and transposed convolution. Transpose convolution was a better upsampling technique. Because, it could learn what features to give importance to while upsampling. The images produced this way were better deblurred. Fixed unpooling layer produced images which were poor in perceptual quality as compared to transposed convolution layer. With the addition of down sampling images dimensionality was reduced, which had a great affect on reducing the computational complexity. Model became faster at convergence but still took one week to train. We named this model as variant 2.

Effects of Normalization
Although results that were being produced by variant-2 were promising. However, the convergence speed was still very slow and took approximately 1 week to train. This could be pointed to the low learning rate that was set in our base model. This slow convergence speed was a very serious bottleneck in our experimentation process. In order to perform more experimentation and explore different design choices quickly with the available resources and time we needed to boost the learning rate higher. For this we had to add a normalization layer in our CNN. So, we tried batch and instance normalisation. This model was named Variant-4.
With both the normalisation techniques we were able to set high learning rates of up to 1 × e −1 . Both worked fine during training phase. The problem aroused during validation phase. Batch normalisation caused image distortion because it tries to normalise the whole batch using the learned mean and variance during training. This proves the claim made by Nah et al. [15] that removing batch normalisation improves deblurring performance.
Instance normalisation however did the trick as it only normalises each image with respect to its own mean and variance; not across batch. Therefore, fixes the range flexibility problem in deblurring [15]. To conclude instance normalisation is a better normalisation technique for textual deblurring in images.

Effects of Loss Function
We trained another variant which was exactly like Variant-4 with the exception of the loss function. We called it Variant-3 For Variant-3 we removed our custom loss function and replaced it with the defacto loss function used in image restoration problems: L 2 loss. With this change we exactly wanted to see how better our custom loss function compares to L 2 loss at deblurring of textual images. And, what effect can a loss function have on the learning process.
After, evaluation we found out that our custom loss function was better at deblurring the textual images (c.f . Table  2). Even though the PSNR is barely better in the graphs we argue that this can be further improved by simply changing the ratio of participation of structural part and L 2 part. This avenue has not yet being fully explored and needs further experimentation.

Effects of Increasing Data Dimensionality
we increased the depth of Variant-3 from 7 to 9. We added convolution layers after each deconvolution layer in hopes that these layers will consolidate the up sampled images by restoring image features. We call it Variant-5.
Although, we had previously argued that adding these layers reduce computational complexity but sampling also has a drawback. When we down sample an image we reduce the dimensionality which causes reduction in computational cost but, we also reduce the amount of information that we are passing to the next convolution layer.
Keeping this in mind we ran another Variant-6 to check either sampling techniques were a hindrance to the performance of the model. This model achieved a PSNR of 18.57 db which is better then Variant-3 and Variant-4. However, the improvement is better that also at a higher computational cost. This proves that downsampling techniques does reduce a little performance but at the cost of saving computational complexity. After this experiment, we decided to keep the downsampling in our final variant. Because, the performance gain was not that significant to the gain in computational complexity.

Experimentation Discussion
For all the above variants same initialization technique was used and were similar in all the other parameters expect those which were mentioned (c.f . Table 2). Kernel size and number of channels was same. except variant 4.1 in which number of channels was reduced to 80. So far, Variant-4.1 is the best model configuration which offers best trade-off between computational complexity, training time, forward pass time and, image deblurring PSNR and SSIM shown in Table 1.
We would like to emphasize that we have achieved all the performance gains, while reducing the memory footprint of the model. This makes our model a suitable candidate for execution of textual deblurring in small hand held device like mobile phones or tablets. It should be also noted that our models' spatial support is not defined by the kernel size of the model. Rather, it takes the whole (300 × 300 × 3) images for processing. Furthermore, in contrast to other methods, during comparison we have computed all the evaluation matrices Peak Signal to Noise Ratio (PSNR) and Struc-Depth Learning rate Normalization Pooling/unpooling Loss function Variant-1 7 1e −6 structure loss Variant-2 7 1e −6 structure loss Variant- 3 7 1e −2 l 2 loss Variant- 4 7 1e −2 structure loss Variant- 5 9 1e −2 l 2 loss Variant- 6 7 1e −2 l 2 loss  tural Similarity Index Measure (SSIM) using whole images, rather than the (160 × 160) central patch of the image [8].
Since, we believe that in real life setting text can be captured on any part of the image during image capturing process. After experimentation with different design choices we recommend that pooling operation are a must to reduce computational complexity, instance normalization boosts quick convergence thus making training faster, increasing depth makes CNNs learning long and difficult; better results can be achieved by relatively smaller networks if smartly designed and adding structural component helps CNN to produce sharp deblurred images. Figure 3 shows sample images for Variant-2 which has the highest PSNR of all models. Here we can see that images are clearly deblurred with distinguishable text. However, still the PSNR achieved is low. That is because, if we compare variants' output images to the original images and, closely observe we can see that the intensity of pixel values is low in variants' output image as compared to the label images. Therefore, the PSNR achieved can be further improved by some kind of thresholding technique. However, we argue that the main objective for textual deblurring is to be able to deblur image in such a way that the text is human readable. In the figure 3, it is clearly visible that the the output image image are near identical in terms of textual information. However, it must be noted that the individual letters in a word of a sentence are convoluted to some absurd representation. Through manual inspection of the text, the vowels and some particular letters with similar shapes are often misclassified for example, letter "a", "e", "o", "i", "t" and "l". However, still with these inconsistencies the human brain is able to correctly guess the word.

Future work
We have evaluated many new architectural choices. However, we believe that this performance can still be further stretched. Addition of skip connections can be one possible way forward to improve results. We are still in the process of evaluating our custom design loss function and trying to find out the right mixture of structural and L 2 loss component, with which we can focus on the deblurring text in the images. We believe that this loss function is the key to solving this problem. Furthermore, it has the potential to becoming a standard loss for deblurring. As the problem of blind deblurring is very hard as high amount of blur and noise can cause image degradation to such a extent that model fails. Even after achieving good image restoration we still believe that this problem is far from solved and can be helped by guided approach to blind deconvolution. As the blur space we are trying to model is huge so a possible solution ca be a guided approach to learning anti-PSF functions. A CNN that can successfully predict anti-PSF function can be given to a non blind method for deblurring. Further, we can also train such a network in conjunction to this network; that minimizes the difference between the original image and output image. By doing so we build a network that is able to fill in the missing information left behind the generative network. By these two ways we be-lieve that further performance gain in image restoration can be achieved.

Conclusion
We have proposed a CNN model that tries to solve the problem of blind text deblurring. It outperforms the previous state of the art method while remaining resource friendly and computationally fast. Precisely, our this approach takes less time to converge/train, has reduced forward pass time thus enable real time response and reduced memory footprint making it possible to be deployed in small hand held devices.
We have proposed a new custom designed loss function which simplifies learning for textual deblurring and has the potential to become a standard choice loss function in computer vision problems relating to image transfer tasks.
Furthermore, we have demonstrated what effects are exhibited by the new advancements in CNN on this problem of textual deblurring. We have empirically evaluated different design choices to find which are better and how they affect the performance of the architecture in terms of memory consumption, convergence speed and image restoration quality.