Differentiable Image Data Augmentation and Its Applications: A Survey

Data augmentation is an effective method to improve model robustness and generalization. Conventional data augmentation pipelines are commonly used as preprocessing modules for neural networks with predefined heuristics and restricted differentiability. Some recent works indicated that the differentiable data augmentation (DDA) could effectively contribute to the training of neural networks and the augmentation policy searching strategies. Some recent works indicated that the differentiable data augmentation (DDA) could effectively contribute to the training of neural networks and the searching of augmentation policy strategies. This survey provides a comprehensive and structured overview of the advances in DDA. Specifically, we focus on fundamental elements including differentiable operations, operation relaxations, and gradient estimations, then categorize existing DDA works accordingly, and investigate the utilization of DDA in selected of practical applications, specifically neural augmentation networks and differentiable augmentation search. Finally, we discuss current challenges of DDA and future research directions.


Differentiable Image Data Augmentation and Its
Applications: A Survey Jian Shi , Hakim Ghazzai , Senior Member, IEEE, and Yehia Massoud , Fellow, IEEE Abstract-Data augmentation is an effective method to improve model robustness and generalization.Conventional data augmentation pipelines are commonly used as preprocessing modules for neural networks with predefined heuristics and restricted differentiability.Some recent works indicated that the differentiable data augmentation (DDA) could effectively contribute to the training of neural networks and the augmentation policy searching strategies.Some recent works indicated that the differentiable data augmentation (DDA) could effectively contribute to the training of neural networks and the searching of augmentation policy strategies.This survey provides a comprehensive and structured overview of the advances in DDA.Specifically, we focus on fundamental elements including differentiable operations, operation relaxations, and gradient estimations, then categorize existing DDA works accordingly, and investigate the utilization of DDA in selected of practical applications, specifically neural augmentation networks and differentiable augmentation search.Finally, we discuss current challenges of DDA and future research directions.
Index Terms-Computer vision, data augmentation, differentiability.

I. INTRODUCTION
M ODERN deep neural networks heavily rely on data augmentations techniques, which are widely applied to increase the amount of data by transforming the original data or synthesizing from the existing data.Conventionally, data augmentation can effectively alleviate overfitting to improve the model generalization in a low-data regime.For optimal model performances, curated data augmentation strategies are preferred as they may affect downstream tasks [1].In computer vision, transformation-based augmentation includes color space transformations that modify pixel intensity values (e.g., brightness, contrast adjustment) and geometric transformations that update the spatial locations of pixels (e.g., affine transformations), while synthetic-based augmentation resorts to generative methods (e.g., neural style transfer [2]), adversarial methods (e.g., adversarial training [3]), and mixing augmentations (e.g., MixUp [4], CutMix [5]).Those data augmentation methods are developed to effectively improve the model performances for vision tasks.Furthermore, recent advances may perform data augmentation by synthesising new images through the learnt decoupled feature representations (e.g., InfoGAN [6]) and neural rendering [7].Additionally, latent-space augmentations (e.g., Manifold MixUp [8], MODALS [9]) performed data augmentation in feature spaces, while negative data augmentation [10] incorporated the prior knowledge of bad examples by creating outof-distribution data.Meanwhile, studies on self-supervised and semi-supervised learning focused on leveraging the content similarities between the weak-augmented and strong-augemented samples to learn the invariant feature representations of given images.Studies such as MixMatch [11], ReMixMatch [12], and FixMatch [13] are semi-supervised frameworks that focus on augmentation anchoring, a method that leverages the augmentation consistency, to improve the performance of the target model.Additionally, many contrastive learning paradigms (e.g., Sim-CLR [14], MoCo [15]) are designed to learn the general feature representations by closing a positive pair (e.g., an image with different augmentations) and distancing a negative pair (e.g., different images), while BYOL [16] and SimSiam [17] further simplify the paradigm to learn without negative pairs.Despite their effectiveness, most of these common data augmentation techniques rely on non-differentiable functions executed outside the computation graphs.
Gradient-based optimization methods have brought dramatic advances in applications such as deep learning, while image transformation gradients are the most common neglected ingredient during the training of neural networks.The incorporation of DDA and neural networks has been introduced to maintain the augmentation gradient flow.Thus, the computed augmentation gradients can be further integrated into the neural network optimization graph.Essentially, differentiable data augmentation (DDA) refers to the image transformations that are differentiable with respect to (w.r.t.) the input image, while it would further be optimizable if the transformations are differentiable w.r.t the transformation parameters.For example, affine transformations like scaling, rotation, translation etc. are differentiable operations.So that we can compute gradients of the output image w.r.t. the input image and the transformation parameters.To enable the gradient of image transformation operators, Kornia [18] implemented most differentiable image operators in PyTorch, while Halide [19] proposed a domain-specific language for differentiable image processing which claims to be a high-performance image processing architecture compared to PyTorch and manual CUDA programming.In this work, a DDA module that contains one or more DDA operations is referred as a neural 0162-8828 © 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
augmentation network.There are two types: the static neural augmentation network that does not update augmentation parameters, and the optimizable neural augmentation network that does update augmentation parameters.Previous works tended to use static neural augmentation network to replace traditional image operations [20], [21], [22] as a network layer for GANs, or to replace routined non-differentiable data augmentation [23], [24], [25] for data-efficient learning.Meanwhile, several studies investigated augmentation policy searching [26], [27], [28], [29] with optimizable neural augmentation networks.Apart from improving model training, various applications have been explored, such as mimicking camera sensors [30], generating underwater-like images from non-underwater images [31].Also, an interesting research [32] integrated DDA to trace the gradient to detect if a particular image dataset has been used in training.This survey is an attempt to provide a structured and broad overview of the recent works on DDA, spanning many research and application domains.Previous surveys, Shorten et al. [33] and Xu et al. [34], have comprehensively investigated augmentation techniques, including traditional augmentation transformations, GAN-based synthetic methods, and automatic augmentation algorithms.Yang et al. [35] further elaborated automatic augmentation methods in details.Naveed [36] focused on image mixing and deleting methods.To the best of our knowledge, this is the first survey that is specifically focusing on DDA.In summary, our contributions are as follows: r Application: We classify the current DDA applications into two principled paradigms: differentiable neural augmentation and differentiable augmentation search.
r Challenges and future opportunities: We discuss the chal- lenges and future opportunities of DDA.The paper is organized as follows.Section II introduces the concept of DDA, along with the differentiable image transformation operators as well as the common gradient approximation methods.Afterwards, in Sections III and IV, we elaborate the applications of differentiable neural augmentation and differentiable augmentation search, respectively.In Section V, we discuss the challenges and opportunities of DDA.Finally, Section VI concludes our paper.

II. DIFFERENTIABLE DATA AUGMENTATION
This section presents the concept of DDA and summarizes common image augmentation operations.

A. Prerequisites
Image data augmentation is a composition of image processing operators to increase the diversity of a given dataset.Apart from the neural-network-based augmentations, most common data augmentation techniques are non-differentiable and executed outside the computation graphs.Thus, to formally introduce differentiable data augmentation, we borrow the definition of stochastic computation graph and the requirements of differentiability from Schulman et al. [37]: Definition 1: Stochastic Computation Graph.A directed, acyclic graph, with three types of nodes: 1) Input nodes, which are set externally, including the parameters we differentiate w.r.t. 2) Deterministic nodes, which are functions of their parents.3) Stochastic nodes, which are distributed conditionally on their parents.Each parent v of a non-input node ω is connected to it by a directed edge (v, ω).
Within a stochastic computation graph N , if the path from an input node n to a deterministic node v passes through stochastic nodes, then v may be a non-differentiable function of its parent nodes, since the stochastic nodes can introduce discontinuities that break derivability, as illustrated in Fig. 2. Formally, given an input node n ∈ N , for all edges (v, ω) which satisfy n ≺ D v and n ≺ D ω, then the following condition holds: if ω is deterministic, Jacobian ∂ω ∂v exists, and if ω is stochastic, then the derivative of the probability mass function ∂ ∂v p(ω|PARENTS ω ) exists.Note that the notation of n ≺ D v (n deterministically influences v where D represents deterministic nodes) means a deterministic path flow from n to v exists.In summary, DDA refers to a set of differentiable image processing algorithms, that can be interpreted as a stochastic computation graph to compute gradients w.r.t. the image.Essentially, as illustrated in Fig. 1, DDA enables the gradient flow through augmentation operations.

B. Differentiable Image Transformation
As fundamental ingredients of DDA, differentiable image transformations 1 have been widely applied in many 2D and 3D computer vision tasks, including image restoration [23], colorization [24], [25], and differentiable rasterization [38], [39], [40].This section categorizes common image transformation operators, with a focus on elaborating their differentiabilities.Specifically, we cover 1) geometric transformations, 2) intensity transformations, and 3) mix transformations.Table I summarizes common operations for image data augmentation, noting their differentiability.Additionally, we include neural transformations that are neural network-based transformation operators.
1) Geometric Transformations: In general, geometric transformations map each pixel coordinate (x, y) in an image to a new location (x, ŷ).In order to perform differentiable geometric transformations, Jaderberg et al. [41] introduced a differentiable image sampling method.To transform a given image in a shape of H × W × C, to the target shape Ĥ × Ŵ × C: where k(•|Φ) denotes the image interpolation (e.g., bilinear) parameterized by Φ, while V c nm and V c i correspond to the input values at (n, m, c) location and the i th output value in channel c, respectively.Note that not all interpolation kernels are differentiable.For example, bilinear interpolation is differentiable but not nearest-neighbor interpolation.With bilinear interpolation, the partial derivatives are: and same expressions as in (2.3) can be derived for ∂ V c i ∂y i .The whole pipeline backpropagates the gradients from the transformed images to the input images via the sampled grid coordinates.
With the rationale of getting the target transforming coordinates, we cover common geometric transformations that are used as data augmentation techniques, such as affine transformation, perspective transformation, and elastic transformation.Those transformations can be implemented in a differentiable manner taking the advantage of (2.1).Fig. 3 summarizes common affine and perspective transformation matrices.For further details regarding the differentiable implementation of these transformations, refer to [18].
Affine Transformation is a geometric transformation that preserves points, straight lines, and planes by linear mappings with affine transformation matrices.Formally, the affine transformation matrix M for image transformations can be defined as: Perspective Transformation uses homogeneous coordinates in the projective space, that does not preserve the line parallelism as affine transformations.Formally, a homography is a more generic transform representation with an 8-DOF, than the constrained 6-DOF affine transformations.It can be expressed as: Notably, cropping is a transformation that extracts a sub-area of a given image by removing the pixels at the sides.It is a non-differentiable index-based operation that consumes the start and end coordinates to select desired regions.Thus, perspective transformations come as a rescue to obtain meaningful gradients for the non-differentiable cropping [42].
Elastic Deformation is introduced to create the deformations that mimics uncontrolled oscillations [43], which have been reported as a useful augmentation in areas like medical imaging [44], [45].Other than performing transformations with transformation matrices, Simard et al. [43] proposed to generate randomized locations for each pixel coordinate w.r.t. the original location with: Δx = Gaussian(rand(−1, 1)|k, σ), Δy = Gaussian(rand(−1, 1)|k, σ), (2.6) where rand(•) is a uniform sampler and Gaussian(•|k, σ) is a Gaussian filter to smooth the random generated new coordinates parameterized by the kernel size k and elasticity coefficient σ.
2) Intensity Transformation: Intensity transformation refers to the image processing operations that modify the pixel intensities without changing its geometric locations.In general, intensity transformation includes point operations that transform each pixel independently, and local operations that transform each pixel referring to its neighbors.This survey follows [46] regarding the definition of each operation, since there are diverged versions of color enhancement transformations.
Point Operation transforms each pixel w.r.t its previous state.With f (x i , y i ) and g(x i , y i ) represent the input and output pixel values in location (x i , y i ) and T is the transformation function.The generic point operation can be expressed as: (2.7) In the following, we list the transformations that use the naive 1 × 1 neighborhood size: where a, b, g, k, and p are the magnitudes for each corresponding transformation, and • rounds the images with 8-bit unsigned integer uint8 (0 to 255).Posterize is non-differentiable since it only accepts uint8 inputs for non-linear bit shift operations.Another operation, identified as Equalize, processes images in order to adjust its contrast by modifying the intensity distribution of the image histogram, which transforms each pixel to a high contrast according to the calculated image histogram.The general histogram equalization formula is: where cdf refers to the cumulative distribution, cdf min is the minimum non-zero value of the cumulative distribution function, L is the maximum intensity value (255 if uint8), h and w are the image height and width, respectively.Technically, equalize is implemented with non-differentiable Look-Up-Tables (LUTs).
To the best of our knowledge, no valid differentiable equalize methods have been implemented, though some differentiable histogram calculation methods have been proposed [47], [48], [49].
Local Operation refers to the operations that transform each pixel that considers a collection of pixel values.Its differentiability depends on the differentiabilities of the targeted local operators.For example, an intuitive local operation that uses linear filters to update each pixel by a weighted sum of pixel values within a neighborhood can be expressed as follows: where h(x k , x l ) represents the weight for each neighbor pixel around in f (x i , x j ).Similarly, linear filtering operations perform convolution operations which are apparently differentiable.Contrarily, non-continuous local operations are nondifferentiable since they involve discontinuities or singularities.For instance, median filter is a typical non-differentiable operation that operates in a discrete manner.Fig. 4 illustrates common filtering operations.Morphological transformation contains two fundamental operators: erosion and dilation: Erosion operation can diminish the size the foreground object, while the dilation operation expands the shape in the images.Particularly, a structuring element is a key concept in morphological transformations that is a shape (e.g.squares, circles, diamonds) used to probe an image to selectively expand or shrink features based on brightness, shape, and size constraints.Given an input image X and a structuring element B, the dilation and erosion operators can be written as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Erosion where Bz is the translation of B by the vector z, allowing the structuring element B to be translated or shifted to different positions across the image.Technically, the erosion operation computes the pointwise minimum of the given image and structuring element, while the dilation takes the pointwise maximum.In terms of differentiability, the intuitive approach is to perform the minimum/maximum comparison directly with less meaningful gradients, whereas one may work around with convolution operations for more meaningful gradients but less stable numerically.For the detailed implementations, refer to Kornia [18], as well as a CUDA version. 2 Furthermore, morphological opening operator • and closing operator • can be differentiable by stacking the erosion and dilation processes and can be expressed as follows: 2 https://github.com/Manza12/nnMorpho.git3) Mix Transformation: Mix transformations transform each image with a collection of images, that improves model generalization by generating "out-of-distribution" examples to prevent the overfitting on the training distribution.Mix augmentations have been extensively reviewed by [50].We would hereby cover some representative methods, showcased in Fig. 6, and discuss their differentiability.
SamplePairing [51] mixes two cropped patches from two randomly selected images (A and B), then averages the two patches without modifying the label.This technique aims at improving the model robustness by confusing the network since the ground truth 0.5 × A + 0.5 × B can be either A or B, but the confusion also complicates the training procedures as stated by the authors.MixUp [4] applies a similar idea by interpolating the pixel values between two images by a weighted sum, and it further manipulates the label accordingly for Empirical Risk Minimization (ERM).These linear methods are obviously differentiable as they apply a simple I out = α × A + β × B where α and β are the coefficients for each image.CutMix [5] randomly cuts out portions of an image and places them over another.Mosaic [52] randomly cuts out and mixes four training images at one time.It also proposes to modify the label representations similarly to MixUp [4].Additionally, it is worth to mention that we also include the image-noise mixing into this category that mixes up images and noises.CutOut [53] erases a square region of the input image by masking the corresponding pixel values with zeros.RandomErase [54] and PatchGaussian [55] are proposed to replace the masked area with random values (e.g., noise,  [56], [57], [58], [59], [60] are the most common neural transformation methods, that can be used as static neural augmentation networks (as in Section III-A1) to enlarge the dataset.In this survey, we particularly focus on the works that take advantage of the differentiability of neural networks that can be used as data augmentation operators to generate meaningful variants of a given image.Fig. 7 presents a visual illustration of the neural transformations in image editing and the generation of adversarial examples.
Image Editing: Generative models like GANs learns a latent space, that offers the possibility to control the generation outcome [6], [61].Also, simple transformations like adding small noise to the latent codes can result in realistic augmented data.The generator is differentiable, so these small perturbations can be backpropagated through.For instance, DAGAN [62] and MineGAN [63] proposed to manipulate the latent space of generative models to generate augmented samples that significant improved model performances under low-data regime.In particular, this area of research involves the inversion problem [64], [65], [66], that to investigate how to invert a given image back into the latent space of a pretrained generative model so that the image can be faithfully reconstructed from the inverted code by the generator.Many existing works [67], [68] utilized inversion techniques for data augmentation in various domains.Additionally, BlobGAN [69] proposed an interesting unsupervised representation learning method that decomposes a scene into blobs (i.e. the BlobGAN introduced latent space), in which all the blobs are differentiably placed onto a feature grid that is decoded into an image by a generative adversarial network.Though not widely explored yet, recent advances of guided image editing techniques [60], [70], [71], [72] may potentially be seen as data augmentation operators.
Adversarial Examples: The process of generating adversarial examples is fully differentiable with respect to the model parameters [74], [75], where the generated adversarial examples confuse the target model with little image perturbations.Adversarial training provides an elegant way to directly increase model robustness by learning on targeted adversarial examples generated through differentiation [76], [77], [78].Classic works such as projected gradient descent (PGD) adversarial training is proposed by Madry et al. [3], that uses PGD as a reliable first-order adversary to improve model robustness.Additionally, virtual adversarial training (VAT) [79] adds small perturbations to the input that maximize the output distribution change, to improve model performance and robustness.The perturbation generation process is differentiable and done along with model training.To mention that adversarial attack and robustness problem is a hot area of research, interested readers may refer to [80], [81].

C. Gradient Approximation
As indicated in Table I, non-differentiable transformations (e.g., posterize) exist in the common data augmentation routines.Most of the non-differentiable transformations create discontinuities with discrete sampling operations, resulting in zero gradient w.r.t.its parameters.An intuitive consideration is to redesign the computation graph into a differentiable manner as the cropping transformation example introduced in Section II-B1.However, not all transformations can be redesigned to be differentiable.Thus, in order to maintain the gradient flow, recent studies tend to construct gradient estimators to create optimizable surrogate function to replace the operation gradients with the estimated ones.Formally, in reinforcement learning, gradient estimation aimed at choosing the parameters θ of a distribution π(a|s, θ) to maximize the expected reward over state-action trajectories τ .By using the reparametrization trick [37] or other gradient estimators such as straight-through [82] or relax [83], the meaningful augmentation gradients can be computed and integrated into the neural network optimization graph.Next, we summarize common gradient estimators applied in DDArelated literature, including backpropagation-based methods and sampling-based methods.
1) Backpropagation-Based Methods: The backpropagationbased methods directly backpropagate gradients through non-differentiable operations, essentially "pretending" they are differentiable.Reparameterization computes analytical gradients that is more mathmatically principled with a lower variance, while STE is a simpler heuristic that can be unstable.
Reparameterization oftenly refers as the reparameterization trick that reconstructs the parameterized random variable to a parameterized deterministic function, that is independent of the initial parameter.As an example, we cite the Gumbel Softmax [84], which is a categorical reparameterization method, that uses softmax as a differentiable approximation to the argmax operation of Gumbel-Max trick [85].Let z be a categorical variable (possibilities while differentiable), the Gumbel-Softmax is then given by: where g 1 , . . ., g k are random samples from Gumbel distribution and η is a temperature parameter.When η is small, σ η (θ k ) would be close to a onehot-like vector but the variance of gradients would be large.When η is large, σ η (θ k ) would be smoother but the variance of gradients would be small.Additionally, Bernoulli distribution, as a typical non-differentiable distribution, can be interpreted as a special binary case of Gumbel-softmax.Hence, Gumbel-softmax can also be served as an approximation to Bernoulli distributions, resulting in Relaxed Bernoulli, which can be expressed as: is a sigmoid function, and u is the value sampled from the uniform distribution U (0, 1).The distribution will be similar to Bernoulli with a lower temperature η → 0.
Straight-Through Estimator (STE) [82] is a simple gradient estimator to bypass the non-differentiable operations f (•) by forcing the operator gradient to df (x) dx = 1, where x refers to whatever input is fed into f (•) during the forward pass.Though not mathematically rigorous, STE maintains the gradient flow by approximating the function gradients as the identity function as if the non-differentiable operation was not existed during the backpropagation.Notably, though other curated gradient values were explored [86], [87], the gradient of STE is commonly set to 1 to avoid erroneously scale the gradient up or down as experimented in [82].In practice, this simple estimator is efficient to implement and worked well for estimating the gradients of data augmentation [28].However, since the higher gradient variance might be introduced due to the discrepancies between the forward and backward pass as pointed out by [84], recent works are more inclined to use sampling-based gradient estimation methods like REINFORCE [88], [89], [90] and RELAX [26], [27].
2) Sampling-Based Methods: The sampling-based methods use randomized sampling and stochastic gradients to estimate gradients for non-differentiable functions.Both RELAX and REINFORCE produce unbiased gradients while REINFORCE has higher variance.A key difference is that REINFORCE is used for estimating gradients through non-differentiable reward functions (e.g.losses), while RELAX is used for gradients through non-differentiable operations.
REINFORCE [91] is known as a score-function estimator or a Monte Carlo policy gradient method.It computes the likelihood ratio taking the advantage of differentiable log-derivatives to estimate the gradients of the augmentation policy search.Some studies [88], [89], [90] utilized the REINFORCE algorithm (a.k.a Monte Carlo stochastic relaxation as in [88]) to estimate the gradients of non-differentiable augmentation policy search process as: where p(τ n ) represents the probability of the policy τ n .Note that the REINFORCE estimator may provide high variance gradients.
RELAX [83] is similar to REINFORCE, that uses Monte Carlo sampling while a continuous relaxation to obtain unbiased and low-variance gradient estimation since the reparameterization tricks may introduce biased gradients [84], [92].Given b is a discrete random variable by introducing a "relaxed" continuous variable z, RELAX estimator can be expressed as: where a differentiable neural network c φ is used as the surrogate function of the target non-differentiable operator f (•), that c φ (•) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.are encouraged to approximate f (•).Notably, RELAX requires a continuous, reparameterizable distribution π(z|θ) (e.g.Gumbelsoftmax, Relaxed Bernoulli) and a deterministic mapping H(z) such that H(z) = b ∼ p(b|θ) when z ∼ p(z|θ), and it applies the control variate both at a relaxed input z ∼ π(z|θ), and a relaxed input conditioned on the discrete variable b, denoted z ∼ π(z|b, θ).

III. NEURAL AUGMENTATION NETWORKS
Gradient-based optimization has been used in many areas in image processing [93], [94], [95].With the development of deep learning, an intuitive intention is to integrate the augmentation pipeline into neural network computation graphs, so as to enable gradient computation and optimization for image transformation operations.Many applications (e.g., image restoration [23], colorization [24], [25]) optimized image processing parameters with stochastic gradient descent.Moreover, Wang and Perez [2] trailed a 5-layer CNN to perform augmentation without any traditional augmentation methods, that implies the potential of neural augmentation methods.Essentially, neural augmentation directly embed DDA into the model as a part of neural networks to take the advantage of the maintained gradient flows of differentiable image transformations.This section introduces neural augmentation networks, then it summarizes current studies as static neural augmentation and optimizable neural augmentation depending on the optimization state as illustrated in Fig. 8(a) and (b).

A. Neural Augmentation Networks
Earlier works such as spatial transform networks (STN) [41] integrated neural network layers to perform transformations to apply attention mechanism by performing spatial transformations.As shown in Fig. 9, a STN boils down to three components: a localization network, a grid generator, and a sampler.
More recently, OnlineAugment [96] used STN and variational auto-encoder (VAE) [97] networks to construct a series augmentation network with an affine transformation network, a deformation network, and an intensity perturbation network.Gao et al. [98] followed the similar design with three convolutional neural networks to learn to generate transformations adaptively.Interestingly, the affine transformation network of OnlineAugment computed gradients w.r.t either generic feature maps or 1-D Guassian noise but not the images, as the authors indicated that the same spatial transformation shall be applicable to different images despite their contents.Tang et al. [96] and Gao et al. [98] characterized its data variations into three differentiable parts: affine transformation, local deformation, and appearance perturbations.Differently, TeachAugment [99] proposed differentiable neural augmentation models consisting a color augmentation model and a geometric augmentation model.
Meanwhile, apart from the routined image transformations (e.g., brightness, rotation adjustment), domain-specific image transformation were developed for the best fit for different domain areas.RenderGAN [100] defined four optimizable augmentation models including blurriness, lighting, background, and details to generate realistic data given labels, to analyse the complex social behavior of honey bees.Similarly, Water-GAN [31] defined an attenuation, a back scattering, and a camera model to generate underwater-like images from non-underwater images.Tseng et al. [30] leveraged differentiable proxy models including seven common image signal processors (ISP) stages to reproduce the entire ISP image transformation process.
1) Static Neural Augmentation Network: Static neural augmentation networks are essentially non-optimizable neural networks that allow gradients flow through augmentation layers without optimizing any parameters.For instance, Sharan et al. [101] employed a differentiable Gaussian filter layer as part of a UNet-based generator to improve the robustness of various tasks (e.g., landmark detection, semantic segmentation).Sun et al. [102] adopted differentiable Sobel filters to maintain the gradient information from edge detection to enhance the image edge clarity.Recent studies [103], [104], [105] applied data augmentation for the discriminator of GANs to avoid discriminator overfitting.Those GAN training techniques used a set of differentiable image transformations to augment training data to augment samples before fed into the discriminator, which significantly improved the data-efficiency for GAN training.Zhao et al. [106] further stated a better performance would be obtained by applying the exact transformation on both training data and generated samples at the same time.Additionally, Sablayrolles et al. [32] investigated data marking strategies to detect if a marked dataset has been used for trained models, which integrated DDA to trace the augmentation gradients.Furthermore, whilst training GANs, Tran et al. [105] performed a theoretically analysis that stated the preservation of Jensen-Shannon (JS) divergence between the input data and generated data can be guaranteed if differentiable and invertible 3 (bijective) transformations are performed.With the development of image synthesis methods, StableRep [60] directly employed diffusion models as static neural augmentation networks to generate positive image pairs by prompting the same text description for visual representation learning.Similarly, DIA [107] also used diffusion models as static neural augmentation networks to construct the negative counterpart of a given image for fine-grained representation learning.
2) Optimizable Neural Augmentation Network: Optimizable neural augmentation network optimizes appointed augmentation parameters via backpropagation.Typically, they are predefined differentiable neural augmentation models that can be integrated as a part of the network optimization graphs, where the gradient comes from either differentiable operators or gradient estimators.
An earlier work of neural augmentation is Smart Augmentation [108], which aims at learning suitable augmentations when training deep neural networks.Notably, augmentation optimization can lead to augmentation avoidance.As observed by Benton et al. [109], affine transformation parameters would tend to be optimized to not perform any augmentation with the standard cross-entropy loss.The authors hereby proposed to broaden the augmentation distributions with an extra regularization term R(μ) = −||μ|| 2 to penalize the augmentation magnitude μ to avoid collapsing.Since the formula is for the single transformation, it neglected the other degrees of freedom of policy selection while a new hyperparameter could lead to augmentation avoidance again.Thus, Rommel et al. [110] added another selection weight ω to penalize the product of weights and magnitudes R(ω, μ) = −||ω μ|| 2 to avoid collapsing.
Inspired by adversarial training [3], another branch of studies [79], [111] tend to augment examples with adversarial methods.Unlike the direct end-to-end gradient descent optimization, adversarial examples are considered "harder" to learn with, which could lead to a better model generalization.However, adversarial training is unstable without any constraint due to the risks of collapsing the inherent meanings of images, resulting in out-of-distribution data.Ratner et al. [112] introduced the concept of null class mapping, representing the data points that have been destructively transformed without preserving the original labels.Formally, the null class mapping pitfall is given as: where m(•) denotes the label mapping function for obtaining the ground truth of any given data, τ represents the augmentation, 3 Invertible operations refer to the image operations that can fully restore the original images without information losses (e.g., rotation with 90, 180, and 270 degrees).and m φ represents an out-of-distribution null class.To avoid generating out-of-distribution example, the authors introduced a generative adversarial objective and diversity objective to minimize the probability of a generated sequence mapping augmented data τ (x) to m φ .Meanwhile, different regularizers were proposed to regularize the transformations to avoid the null class mapping.VAT [79] is a method to perform adversarial training that regularized the generated examples with local distribution smoothness (LDS).RAT [111] further extended VAT by richer data transformations.OnlineAugment [96] added a regularization term to constrain the augmented data within reasonable distributions.TeachAugment [99] leveraged a teacher model to guarantee the generated images are meaningful to relax the complexity of parameter tuning, without any prior knowledge.

B. Relation to Physical Informed Neural Networks
Another research trend focuses on physical informed neural networks (PINN) [113], which incorporate the prior-knowledge of physical principles into neural networks.Specifically, physical laws are interpreted in a differentiable manner as differentiable transformations.For example, Dubois et al. [114] utilized a Fourier space transformation matrix to transform 2D magnetisation to magnetic fields as follows: where B and V are the Fourier-transformed magnetic field and magnetisation vector respectively, k x and k y are the Fourier space coordinates, k = k 2 x + k 2 y , and the detailed approximation of α can be found in [114].In a way, PINN can be seen as a specially constrained case of differentiable transformations or neural augmentation embedded with the particular prior knowledge.

IV. DIFFERENTIABLE AUGMENTATION SEARCH
Among optimizable neural augmentation networks, differentiable automatic data augmentation is of particular interest.Automatic data augmentation refers to the process to find the best data augmentation policies for a specific dataset.As illustrated in Fig. 10(a), it attempts to search through a predefined search space of augmentation policies to find optimal augmentation strategies, while the search space often contains a policy matrix that enumerated different compositions of augmentation strategies.Namely, with augmentation policy τ to apply a sequence of transformations for the augmented data τ (x), and a evaluation function f to evaluate the learnt policy by f (•), the general task of automatic data augmentation is to find the best augmentation policy τ * for the best evaluation result as follows: where the evaluation function f normally denotes the target neural network model under this context.In [35], a comprehensive survey on automatic data augmentation methods are delivered.This section introduces different objectives of automatic data augmentation, with a focus on the combination with differentiable methods.Benefited from DDA, one may optimize augmentation parameters directly for the optimal augmentation strategy with gradient-based optimization methods.Fig. 10 presents a visual demonstration regarding the typical automatic augmentation search methods used in the literature.

A. Objectives
The objective of obtaining the best augmentation policy, a.k.a policy optimization, can be described as a bilevel optimization problem, while some literatures surrogated the problem as matching problems to reduce the computation cost.For training neural neural networks, one common assumption is that hard examples could make training more effective and efficient [115].Thus, some studies applied adversarial techniques with a focus on augmenting examples to hinder the learning.This section introduces four common optimizing objectives in the literature, including bilevel optimization, matching, minmax, and empirical risk minimization.
1) Bilevel Optimization Problem: Many augmentation policy searching methods [27], [88], [90], [116], [117] aim at solving a bilevel optimization problem [118], that the inner optimization task (lower-level) is embeded within the outer optimization task (upper-level).Commonly, the inner level is the model weight optimization that optimizes network parameters θ on training data using a given augmentation policy candidate τ , while the outer level is the augmentation policy optimization that optimizes the policy τ given the results of the inner level problem.More precisely, the bilevel optimization problem is formulated as: where L(θ|D) represents the loss over dataset D given θ.Bilevel optimization is a typical split-optimization formulation that iteratively optimizes θ and τ respectively.It has gained its popularity since the work of AutoAugment [116], which consists of two components of a search algorithm and a search space.AutoAugment designed a policy search space with sub-policies, in which each sub-policy is parametrized into an image processing function, a probability, and a magnitude.It relies on the validation data to evaluate the augmentation performances while searching in the discrete policy space, resulting in inefficient bilevel optimizations.The complete bilevel optimization pipeline is computational expensive as stated in AutoAugment [116].PBA [119] proposed to learn the optimal augmentation schedule rather than transformations.Tian et al. [117] proposed augmentation-wise weight sharing to pretrain the network weights θ with a shared augmentation before search the policy τ , that could significantly speed up the optimization.OHL-Auto-Aug [90] proposed an online bilievel optimization approach for improving the search efficiency and the final classification accuracy, which adopted the REIN-FORCE [91] estimator for approximating the gradients of the validation accuracy w.r.t augmentation parameters.
Inspired by DARTS [120], a differentiable neural architecture search framework, faster search can be achieved with gradientbased optimization by relaxing the discrete search space to be continuous.DADA [26] and ADDA [27] relaxed the augmentation policy spaces to improve the searching efficiency to effectively obtain optimal policies with reduced computational complexity.Furthermore, DDAS [121] approximated the inner optimization with one-step gradient to avoid the training process.As pointed out by [96], offline searched policies are isolated from the downstream tasks, and they may not be domain-agnostic toward class-preserving, while the online augmentation search could mitigate the problems.DABO [122] adopted DDA to solve bilevel optimization problem in an online manner whilst training, while MADAO [29] proposed to solve the bilevel optimization problem by optimizing θ and τ simultaneously with Neumann series approximation in an online manner, that heavily shrinked the searching time from days to hours.
2) Matching Problem.Density Matching: Since the bilevel optimizations are computationally demanding, Fast AutoAugment [123] and Faster AutoAugment [28] considered augmentation policy searching as a density matching problem, that to match the density of a dataset D 1 with the density of augmented dataset D 2 , which works as a special case of bilevel optimization.In particular, density matching that sees data augmentation as a process that fills missing data points in training data.Fast AutoAugment [123] proposed to match the density between the training set D train and the density of augmented validation set D val .Specifically, the authors divided the training data set D train to k subsets {D 1 train , . . ., D k train } and each subset is split into two subsets D n M and D n A where n = 1, . . ., k.The subset D n M was used to learn the model parameters θ, while D n A was used for searching the augmentation policy τ .During the training, k models are first trained on D n M without any augmentation, then the model will be frozen to search for the best augmentation policy τ k via Bayesian optimization techniques for each pair of D n M and D n A .Intuitively, it requires each D M is a representative sample of D train , then a policy that matches densities between augmented D A and D M will also match densities between augmented D train and D valid .With the expected model performance R(θ|D), density matching optimizes the following objective: where τ * approximately minimizes the density distance between D M and τ (D A ) by maximizing the model performance with the same parameter θ.The final τ * is obtained by merging all the τ n .More recently, Faster AutoAugment [28] solved density matching with DDA techniques, that directly minimized the Wasserstein distance [124] between D n M and τ n (D n A ) where τ n is differentiable.Additionally, a classification loss is deployed for preserving image labels to avoid images being overly transformed: where f is the image classifier and L is the loss function.However, both methods adopt shallow augmentation layers (i.e., 2 sub-policies), since they are based on brute-force searching that prohibits the use of multiple sub-polices.Gradient Matching: Deep AutoAugment [125] formulated the problem as a regularized gradient matching problem, which manages to steer the gradient of augmented training data towards the gradient of the validation batch by maximizing their cosine similarities.With g(x, μ) representing the gradients of augmentation where μ is the parameters of τ , and g v being the gradients of validation data, gradient matching can be formulated as: This method also tackled the problem of the search space's dimensionality exponentially increasing when more augmentation layers are added, which is achieved by progressively stacking augmentation layers, optimizing each layer on the data distribution transformed by the previous augmentation layers.

3) MinMax Adversarial Problem:
The joint optimization of the target network parameters θ and augmentation policy network parameters τ is assumed to be superior since the augmentation networks are updated according to the learning state of the target network models [29], [89], [126].Many adversarial learning-based methods play min-max games to perform augmentations to maximize the loss for the target model, to improve model generalization.Particularly, non-differentiable augmentation operations tend to be avoided to maintain the gradient flow from the target network to augmentation policy network.Earlier works performed adaptive transformation selection to maximize the end model loss [127], or to learn augmentation with adversarial objectives [112].Adversarial AutoAugment [89] applied REINFORCE algorithm to relax the non-differentiable augmentation policy network to play min-max games towards generating adaptive augmentation policies in an online manner.It formulated the augmentation searching problem as a min-max game, that target network θ minimizes the training loss while the policy network τ maximizes the training loss.The generic formulation is: where x and y are the image and label data sampled from the training set D, f θ (•) is the model parameterized by θ, L(•) is the loss function, and S is all the available augmentations.Gao et al. [98] further combined adversarial training techiniques and Adversarial AutoAugment to employ a regularized adversarial training framework with an additional min-max objectives, along with an augmentation magnitude regularization term to penalize large augmentation to avoid out-of-distribution data.On-lineAugment [96] utilized a meta-learned augmentation network along with a differentiable online data augmentation scheme based on adversarial training.To reduce the many regularization hyperparameters of OnlineAugment, TeachAugment [99] leveraged a teacher model to avoid careful parameter tuning.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Additionally, UADA [126] applied adaptive adversarial data augmentation without a policy searching space to adaptively update augmentation parameters along with the directions of loss gradient (e.g., adversarial attack).4) Empirical Risk Minimization Problem: Empirical risk minimization (ERM) is another joint optimization method that commonly offer an integral loss function to minimize the training loss while maximize the augmentation diversity.As introduced in Section III-A2, Smart Augmentation [108], Augerino [109], AugNet [110], and InstaAug [128] are the typical end-to-end learning methods based on differentiable augmentation operators, that minimize the empirical risk whilst maintaining a high variance of available augmentation without augmentation avoidance.Technically, with model parameter θ and augmentation parameter μ, for any image x and label y, the loss function of Augerino [109] is defined as: where L is a loss function (e.g.cross-entropy) and λR(μ) is a regularization term that encourages large transformations where R(μ) = −||μ|| 2 .AugNet [110] extended Augerino with hierarchical differentiable data augmentation layers and introduced the augmentation weights ω to select transformations which encode the strongest data invariance.Additionally, AugNet revised Augerino's regularization term −||μ|| 2 to −||ω μ|| 2 to prevent the potential augmentation avoidance from the introduced hyperparameter ω.Apart from learning global augmentations, InstaAug [128] proposed to perform instance-specific augmentation that aims at learning more diverse augmentations that also preserve labels compared to global augmentations.Meanwhile, GA3N [129] attempted to include GAN operations as additional policies into the policy space then applied adversarial training between policy and target networks to find optimal policies.The target network minimizes the loss on real and augmented samples, while the policy network maximizes the loss to generate harder augmented samples.

B. Differentiable Relaxation
Aforementioned methods mainly involved with a nondifferentiable policy space and a number of predefined augmentation policies.This subsection highlights the differentiable relaxation techniques for those propositions.
1) Policy Space Relaxation: Subpolicies are scattered in the search space in a discrete manner, that posing optimization difficulties due to the discontinuity.DARTS [120] is a gradient-based neural architecture search method to solve the bi-level optimization problem.Essentially, it relaxes the categorical selection of a particular operation to a softmax over all possible operations.Similarly, most of the studies adopted this approach to relax the discrete space into a continuous one.Specifically, Faster AutoAugment [28] utilized a softmax function σ(z) = exp z/η j exp z i /η with temperature η, so that σ(z) would be a onehot-like vector when η is small.MADAO [29] replaced the softmax function with the Gumbel-softmax ((2.12)) to improve the differentiability.DADA [26] and ADDA [27] pointed the Gumbel-softmax gradients are biased, so that the search efficiency can be further improved by adding an unbiased RELAX gradient estimator [83].Meanwhile, this relaxation can be used to sample the whole subpolicies than operations within subpolicies as in DADA [26].
Notably, although the relaxation shortened improved the searching efficiency, some practitioners claim it would introduce discrepancies between the real and relaxed augmentation spaces [125].Thus, instead of relaxation, Deep Au-toAugment [125] proposed neural policy layers and each layer is a 139-dimensional categorical distribution to match the 139 {operation, magnitude} pairs.GA3N [129] modeled the search space using a LSTM network that consequentially selecting operations and magnitudes.Additionally, DDAS [121] directly applied the one-step gradient update approach with the expectation of training loss instead of reparameterization tricks or gradient estimators.
2) Operation Relaxation: Most augmentation searching methods [116], [119] treated image augmentation as a gradientfree preprocessor, that is not backpropagatable through the network computation graphs.In order to take the advantage of the gradient-based optimization, DADA [26] and CADDA [27] utilized RELAX algorithm to approximate gradients for all the augmentation operations, neglected the operations that are naturally differentiable.Hence, Hataya et al. [28], [29] integrated DDA to avoid inaccurate gradient approximations, while adopted STE to approximate non-differentiable operations.The integrated DDA operation O(•; p, μ) would be differentiable w.r.t. the probability p and the magnitude μ as: where Ber relaxed relaxes each operation to be differetniable w.r.t. the probability p and the gradient of the resulting image is solved by a linear addition.Notably, the differentiable operations can naturally be differentiated w.r.t. the magnitude μ, and STE is mostly applied for the non-differentiable operations under DDA settings [26], [27], [28], [29].
V. DISCUSSION AND FUTURE OPPORTUNITIES This section discusses the challenges and potential opportunities in the area of DDA.In Table II, we provide a summary of the DDA methods existing in the literature.Furthermore, we discuss the potential research directions from three perspectives: modeling, optimization, and evaluation.
Class-Dependent Augmentation: Balestriero et al. [133] demonstrated the class dependent regularization effects for data augmentations, which creates class-dependent model biases.For example, in MNIST dataset, we aim at obtaining a model to be invariant to rotations of a '6' up until it looks more like a '9' and vice versa [109], while the invariant rotation interval differs for '8'.Similar situation applied to image instances.Apparently, the class-dependent or even instance-dependent data augmentation can hardly be manually designed.While differentiable augmentations fit into this problem, Rommel et al. [27] proposed a class-dependent augmentation for 1D EEG signals with DDA techniques while Miao et al. [128] presented an instance-dependent differentiable image augmentation method.A further modeling of class-dependent or instance-dependent image data augmentation might be promising to reduce the class/instance-dependent biases.
Applications: Current research trends of DDA are mostly towards finding the best augmentation policies to improve the model performances for downstream tasks.Nevertheless, previous studies stated that traditional data augmentation methods are proven to be effective for other tasks, such as the passive defences of adversarial attacks [134], [135], [136] and mitigating the membership inference attacks towards differential privacy [137].A recent study [32] integrated DDA to trace the training dataset.Plus, the curated differentiable transformations also contributed to PINN applications [114].It would be interesting to see more DDA applications such as integrating DDA into adversarial defenses and membership inference defences.

B. Optimization
Robust Gradient Estimation: As aforementioned, gradient estimation is yet effective and practical in many DDA applications.
Despite the intorduced bias and variance due to different gradient estimators [83], the approximation itself might be problematic.For example, some studies [28], [29] directly adopted DDA and applied STE to bypass the non-differentiable augmentations, which clearly introduced extra biases during the optimization.Meanwhile, DADA [26] and ADDA [27] used unbiased RELAX to replace the biased Gumbel-softmax for a better performances.However, some researches [121], [132] have further pointed out that the estimated second-order gradients of DADA may lead to an inaccurate gradient direction.A recent research [138] also dived into the automatic selection of gradient estimators.Despite developing more robust gradient estimation algorithms, another research direction can be the automatic gradient estimator selection algorithms.
Variance Regularization: Limited invariances are provided in typical augmentation methods, so that model-based augmentation is preferred in terms of a larger invariance space [62], while Ravuri et al. [139] claimed that generative augmentation method may fail to improve model performances.Thus, the more restricted neural augmentation might come as an aid with curated optimizable transformation routines.As discussed in Section III-A2, DDA optimization might fall into the two extremes of augmentation avoidance and null class.In this case, several problems need to be addressed, such as i) regularize the augmentation variance to perfrom label-preserving transformations and ii) regularize the post-transform label variance (e.g., the loss computation in MixUp [4]) to maintain meaningful supervision.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
However, with an aim of improving the validation set performance, a theoretical analysis is needed to explain the optimized augmentation policies.Dao et al. [1] carried out a kernel alignment metric to analyze whether a transformation would improve the generalization performance without performing end-to-end training.Additionally, Wu et al. [140] presented an uncertaintybased theoretical analysis to demonstrate the rationale under various augmentations, regarding the abilities of different transformation methods for reducing estimation errors.As the gradient-based formulation in [125], a gradient-enhanced theoretical analyzing framework might assist explaining the optimization rationale while yet challenging.

VI. CONCLUSION
In this survey, we aimed at presenting a structured yet extensive overview of the current research directions on differential data augmentation applied to image processing, which could effectively improve the model performances in many downstream applications including object detection, segmentation, GANs, and automatic augmentations.We first introduced the key elements (including concept, basic operations, and gradient approximations) of DDA, then we summarized current application paradigms, along with relevant approaches and works that we categorized by the types of neural augmentation layers, specifically static and optimizable neural augmentation.Finally, we discussed the current challenges and highlighted the future potentials from three perspectives of modeling, optimization, and evaluation.Data augmentation plays an important role in dealing with overfitting and modern contrastive learning.Differentiable data augmentation, as a trending area, has already demonstrated its potential in data-efficient learning, dataset tracking, and efficient augmentation policy searching.With this survey, we provided a guide for researchers to better grasp the rationale behind differentiable data augmentation and to inspire their interest in further researches in this field.

r
Comprehensive literature review: We introduce DDA and provide a comprehensive literature review covering the recent and relevant studies to assess the current state of the art of DDA.r Categorization: We categorize the different DDA tech- niques and present their different operations and ways of implementation.

Fig. 1 .Fig. 2 .
Fig. 1.High-level comparison between traditional and differentiable augmentation methods.Purple and orange arrows denote forward and backward operations, respectively.

Fig. 3 .
Fig. 3. Summary of the common used transformation matrices.

Fig. 4 .
Fig. 4. Demonstration of the effect of common filtering operations.Kernel size is set to 13 for all filters.

Fig. 6 .
Fig. 6.Demonstration of the effect of common Mixing operations.Note that SamplePairing does not change the label information whilst computing losses, but MixUp does.

Fig. 9 .
Fig. 9. STN module [41] contains three parts: (a) a localization network is a regular CNN which regresses the transformation parameters, (b) a grid generator generates a grid of coordinates in the input image corresponding to each pixel from the output image, (c) a sampler uses the parameters of the transformation and applies it to the input image.

Fig. 10 .
Fig. 10.Typical automatic augmentation search methods.(b) Bilevel optimization is a brutal-force method selects τ then trains f θ multiple times, then select the combination that results in the lowest validation loss.(c) In density match, f θ (•) only needs to be trained for one time without augmenting the training data, then searching for the best τ to minimize the validation loss.(d) A minmax game aims to find out the augmentation policy τ to maximize the loss while to train the model f θ to minimize the loss.(e) The empirical risk minimization methods normally add an additional regularization term to regularize the augmentation policy τ .

Jian
Shi received the master's degree from the University of Leicester, U.K.He was an associate researcher with NEC Laboratories, Beijing, China, working on medical computer vision technologies.Meanwhile, he is also the primary contributor to Kornia, a widely used open-source repository for computer vision and image processing for deep learning.He is currently working toward the PhD degree with the King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.His research interests include deep learning, neural networks, computer vision, and image processing.Hakim Ghazzai (Senior Member, IEEE) received the PhD degree in electrical engineering from the King Abdullah University of Science and Technology (KAUST), Saudi Arabia, in 2015, and the diplome d'Ingenieur (Hons.)degree in telecommunication engineering and the master's degree in high-rate transmission systems from the Ecole Superieure des Communications de Tunis (SUP'COM), Tunis, Tunisia, in 2010 and 2011, respectively.He was a researcher scholar with the Qatar Mobility Innovations Center (QMIC), Qatar, Karlstad University, Sweden, and Stevens Institute of Technology, NJ, USA.He is currently a research scientist with KAUST.Since 2019, he has been on the Editorial Board of IEEE Communications Letters and IEEE Open Journal of the Communications Society.He is the author or co-author of more than 170 publications.His research interests include artificial intelligence-enabled applications, the Internet of Things, intelligent transportation systems (ITS), and mobile and wireless networks.Since 2020, he joined the Board of IoT and Sensor Networks (specialty section of Frontiers in Communications and Networks) as an associate editor.He was the recipient of appreciation for being an exemplary reviewer for IEEE Wireless Communications Letters in 2016 and IEEE Communications Letters in 2017.Yehia Massoud (Fellow, IEEE) received the PhD degree in electrical engineering and computer science from the Massachusetts Institute of Technology (MIT), Cambridge, MA, USA.He is currently the director with the Innovative Technologies Laboratories (ITL), King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.He has held several positions at leading institutions of higher education and the industry including Rice University, Houston, TX, USA, Stevens Institute of Technology, Hoboken, NJ, USA, WPI, UAB, the SLAC National Accelerator Laboratory, and Synopsys Inc.From 2018 to 2021, he was the dean of the School of Systems and Enterprises (SSE), Stevens Institute of Technology.Prior to Stevens, he was the head of the department of Electrical and Computer Engineering (ECE), Worcester Polytechnic Institute, Worcester, MA, between 2012 and 2017.In 2003, he joined Rice University as an assistant professor, where he became one of the fastest Rice Faculty to be granted tenure in electrical and computer engineering and computer science in 2007.His research interests include the design of state-of-the-art innovative technological solutions that span over a broad range of technical areas including smart cities, autonomy, smart health, smart mobility, embedded systems, nanophotonics, and spintronics.

TABLE I SUMMARY
OF COMMON IMAGE TRANSFORMATION OPERATIONS AND THEIR DIFFERENTIABILITY homography.Affine operations apply restricted 6-DOF homographies to preserve parallelism of lines.To be specific, a homography is decomposed into matrices A and b, in which A is a 2 × 2 matrix to relocate pixels and b is a two-elements translation vector.Affine transformations can be inversed by applying the inverted matrix H −1 .To chain on n affine transformations, one may simply apply H = H n . . .H 1 .
4))where (x i , y i ) and ( xi , ŷi ) represent the coordinates in an image before and after transformation, respectively, and H represents Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.a

TABLE II SUMMARY
OF DDA LITERATURE.BY THE TYPES OF NEURAL AUGMENTATION LAYERS, WE CATEGORIZE CURRENT WORKS TO STATIC (I.E.STATIC NEURAL AUGMENTATION) AND OPTIMIZABLE (I.E.OPTIMIZABLE NEURAL AUGMENTATION)