Information-Theoretic Limits for Steganography in Multimedia

Steganography is the art and science of hiding data within innocent-looking objects (cover objects). Multimedia objects such as images and videos are an attractive type of cover objects due to their high embedding rates. There exist many techniques for performing steganography in both the literature and the practical world. Meanwhile, the definition of the steganographic capacity for multimedia and how to be calculated has not taken full attention. In this paper, for multivariate quantized-Gaussian-distributed multimedia, we study the maximum achievable embedding rate with respect to the statistical properties of cover objects against the maximum achievable performance by any steganalytic detector. Toward this goal, we evaluate the maximum allowed entropy of the hidden message source subject to the maximum probability of error of the steganalytic detector which is bounded by the KL-divergence between the statistical distributions for the cover and the stego objects. We give the exact scaling constant that governs the relationship between the entropies of the hidden message and the cover object.


I. INTRODUCTION
Steganography aims to embed data within innocent-looking cover objects such as images, audio, video, and even text [1], to produce a stego-object that looks similar to the original cover object but with hidden data embedded.The embedding process implies some form of distortion to the cover object and this distortion is utilized by a warden to detect if there is hidden data or not using different steganalysis techniques.Consequently, the optimal target for steganography is to hide the maximum amount of data subject to achieving a minimum probability of detection by a warden.In other words, steganography has two competing goals: undetectability and embedding rate.
Undetectability is concerned with determining how to alter the cover object to embed data without making a notable distortion to the cover.The distortion is measured by a cost function which is either modeled as the difference between the cover and stego-objects [2], or as the expected warden detector sensitivity of changing certain features within the cover object [3].On the other hand, the embedding rate is defined as the ratio between the number of hidden data bits to the number of cover data bits with an acceptable undetectability margin [4,Ch.4].Maximizing the embedding rate with an acceptable undetectability margin is fundamental to any steganographic algorithm.Multimedia objects are excellent choices for steganography due to their rich structural features and their ubiquity over industry and daily life [5], [6].
To evaluate the performance of a multimedia steganographic approach, the relation between the two contradicting goals (undetectability and embedding rate) is a vital measure.Specifically, an accurate definition of the cost function and its relation to the maximum embedding rate is the keystone for proper evaluation of the approach.Additionally, dealing with multimedia incurs additional complexity as the statistical distribution of multimedia sources generally follows the Gibbs distribution [7]- [9], which is a complex statistical model.The complexity of the Gibbs distribution prohibits a thorough mathematical analysis, especially when the definition of the cost function requires an accurate definition of the statistical model of the cover object as in [4,Ch.13] in which the Kullback-Leibler divergence (KL-divergence or the relative entropy), is utilized as a global measure of cover distortion.
Previous approaches of steganography concerned with the definition of the cost function are either not globally accurate for all multimedia types or accurate with limited scope for a certain type of multimedia for a certain class of embedding or detection techniques [10].Additionally, due to the complexity of Gibbs distribution, previous approaches either use more relaxed mathematical distributions such as Gaussian distribution [10]- [17] in the statistical modeling or presume a preposition on the maximum embedding rate then try to show the correctness of the preposition [11], [12].Thus the estimated performance may not reflect the real one due to the model relaxation.
The work in 1  [18] partially overcomes the aforementioned problems and proposes a constrained optimization problem that models the undetectability by the KL-divergence and the embedding rate by the mutual information between statistical distributions of the cover and stego-objects.The optimization problem is solved through a rigorous mathematical procedure to present an analytical form for the relation between the embedding rate and undetectability.However, [18] utilized a relaxed statistical model for cover and stego-objects (Multivariate-Quantized-Gaussian-Distribution) and does not provide an achievability proof that verifies the reliable extraction of the embedded message at the receiver side when the exact cover realization is not present at the receiver side, in which there exists a probability of decoding error.
In this paper, we present several contributions beyond our preliminary work in [18].Specifically, we propose a constrained optimization problem that calculates the upper limit that any multimedia steganographic method can achieve reliably (with an achievability proof), for a prespecified acceptable level of detectability by an adversary.Unlike [18], in this paper, we model the cover object by the Gibbs statistical distribution by establishing an equivalence relation between the Gibbs and the correlated-multivariate-quantized-Gaussian distribution (CMQGD) for rigid mathematical analysis.The solution to our optimization problem provides an analytical form for the maximum embedding rate in terms of the WrightOmega function.Additionally, we prove that the achieved maximum embedding rate comes in agreement with the well-known square root law (SRL) of steganography [4,Ch.13].Finally, we introduce comparisons between our theoretically achieved maximum embedding rate with other practically calculated rates provided by [10], [16], and [17].The results demonstrate that our calculated upper bound is relatively very small compared to the referenced practical stenographic methods.The reason is that our theoretical upper bound is calculated against the optimal detector, which may not be achieved yet for the referenced steganographic methods.
The paper is organized as follows.Section II briefly discusses the previous publications related to our work.Section III describes the main definitions and assumptions with clear mathematical representation for the proposed constrained optimization problem formulation.Section IV provides an equivalence relation between the Gibbs distribution and CMQGD.The solution to the proposed constrained optimization problem is introduced in Section V and the achievability proof is provided in Section VI.Section VII discusses the relation between the SRL and our results.Practical interpretation of our results is presented in Section VIII followed by a discussion in Section IX.The final conclusion is introduced in Section X.

II. PREVIOUS WORK
This section summarizes the most notable prior work related to the scope of this paper.For more details, readers can refer to [4], [10]- [17]. 1 A preliminary version of this paper with the same authors.Some of the previous approaches employ the KL-divergence as a statistical cost function and introduce a theoretical limit (under different assumptions) for the maximum number of bits that can be embedded in a designated class of multimedia cover objects under a specified probability of detection by a warden.For instance, [11] utilized a special case of Continuous-Gaussian-distributed covers (AWGN channels), and introduce additional proofs of achievability and converse.Additionally, [12] generalizes the technique used in [11] for Multivariate-Continuous-Gaussian-distributed covers (MIMO channels).Although these approaches utilized an accurate statistical model for the proposed cover (AWGN channels) with a rigorously defined cost function (KL-divergence), their approaches follow a pre-assumed preposition on the maximum embedding rate then try to prove the correctness of the preposition.
The work in [15] provides a systematic technique for constructing the distortion function based on the relation between steganographic Fisher information and KL-divergence.In [15], the additive distortion approach is utilized to model the distortion for digital images, then adaptively calculating the individual pixel costs that minimize the KL-divergence when embedding using the least-significant bit matching approach.Although [15] utilizes a simple cover model (zero-mean independent multivariate quantized Gaussian distribution), the achieved security outperforms the state-of-the-art algorithm HUGO [19].The authors improved their work in [13] by replacing their previous model in [15] with the generalized multivariate Gaussian (referred to as MVGG) combining it with an improved variance estimator.These improvements enhanced the steganographic performance by allowing embedding changes with larger amplitudes in complex (highly textured) regions, thanks to a thicker-tail MVGG model.The MVGG provides comparable performance with respect to pentary coded HiLL [20] and S-UNIWARD [21] against maxSRMd2 [22] and SRM [23] feature-based steganalysis methods.Although these techniques provide enhanced practical outcomes and an accurate analytical formulation for the relation between the embedding rate and the cost function, their results are based on a relaxed statistical model for the cover.
In [16], the proposed approach follows the non-additive model assumption and utilizes the Gaussian Markov Random Field (GMRF) with four-elements cross neighborhood.The proposed GMRF with low-dimensional clique structures provides the ability to capture the interdependencies among spatially contiguous pixels.The cost function design approach is formulated as the minimization of KL-divergence between the original cover image and the stego one.With GMRF, the cover image is split into two disjoint sub-images, which are conditionally independent.Then, an alternating iterative optimization technique is applied to achieve an efficient embedding incorporated with minimizing the total KL-divergence.The paper provides experiments demonstrating that the performance of GMRF outperforms state-of-the-art steganography MiPOD [14] and HiLL techniques in terms of secure payload against SRM and maxSRMd2.This work is expanded in [17] with higher-dimension clique structure (eight-elements TABLE I: List of commonly used symbols through this paper.

A
The sender B The receiver E The eavesdropper V Alphabet of cover and stego-elements X Alphabet of the original message M Alphabet of the coded message c The cover object s The stego-object x The original message m The coded message P D The probability of steganalyzer correct detection at E Pe The Total probability of steganalyzer error at E P E The average probability of of steganalyzer error at E P B The probability of decoding error at B side Pc The joint probability distribution of the cover object Ps The joint probability distribution of the stego-object Pm The joint probability distribution of the coded message c i The i th cover element s i The i th stego-element Pc i The probability distribution of the i th cover object Ps i The probability distribution of the i th stego-object GMRF).Although [16], [17] provided improved results and an analytical formulation for the relation between the embedding rate and the cost function, the provided results are based on a relaxed statistical model for the cover.Another approach in [10] focused on gray-scale images as cover objects with an initial assumption that the cover, the message, and the stego-object to be statistically modelled as a Multivariate-Quantized-Gaussian-Distribution.This approach provides a theoretical limit for three popular detection strategies for the likelihood ratio test: Bayes, Minimax, and Neyman-Pearson.This approach maximizes the detection error of these three detectors related to a specified payload.Although this approach provides an accurate analytical form of the relation between the embedding rate and the cost function, the scope of the defined cost function is limited (only three types of detectors) with a relaxed statistical model (Gaussian distribution).
The Square Root Law (SRL) [4,Ch.13] is regarded as the first successful work that mathematically models the complex statistical relation between the embedding rate and undetectability in a global framework for multimedia.The SRL states that the embedding rate for any steganographic technique is proportional to the square root of the number of cover elements.For example, assuming cover A contains x elements and cover B contains 4x elements.If we can embed y bits of a secret message within A with a probability of detection P by a certain steganalyzer, then for the same steganalyzer with the same probability of detection P, we can only embed 2y bits.Also, for cover A, the embedding rate per each cover element will be y x , but for cover B will be y 2x .The SRL is valid for any steganographic approach regardless of the embedding and the steganalysis methods.This approach produces only relative approximated order results for any defined cost function without calculating the scaling constant.

A. Communication Model
Elements of a cover object are assumed to be the communication channel between two entities: sender A and its corre- with probability distributions P c and P s , respectively, where n is the number of elements of the cover or the stego-object.Both c, s ∈ V n , where V is the set of allowed cover values.For example, when the cover is a gray-scale image, then V = {0, 1, . . ., 255} is the set of all available pixel values.It should be noted that V is not limited to pixels, but can be considered for DC-coefficients [24], video motion vectors [2], [25], ...etc.Similarly, we consider an original message x ∈ X k , where k is the number of elements of the message and X is the set of allowed message values (alphabet).The stego-object s is obtained by embedding a coded message m ∈ M n via an addition process, i.e., s = c + m.The coded message m is obtained from x using a certain codebook that maps x → m.The coded message m is statistically distributed as P m .

B. Problem Statement
Consider the scenario in which A has access to a multimedia cover-object c, that is drawn from some distribution P c which is known by both B and E .To conceal secret information within c, A needs to transmit a slightly perturbed version of c called the stego-object s that only B can reveal the context of these perturbations.Meanwhile, A needs to keep his probability of being detected by E as minimum as possible.
The fundamental problem here is to quantify, for a given P c , the information-theoretic limits that govern how much information can be exchanged reliably between A and B while at the same time, maintaining a certain level of probability of their communication being detected by E, namely P D .Since A does not know E's detection strategy, the transmission strategy should be designed against the optimal detector that E can deploy.In addition, we assume no prior knowledge of the exact realization of the cover-object neither at B nor at E. In other words, the cover-object realization is selected at random by A at the time of communication.
We assume that E will perform a binary hypothesis test for H 0 and H 1 , where H 0 means A is not embedding any secret message whereas H 1 means A does.Thus, we have two types of errors: • Type I error: Deciding H 1 when H 0 is true, also called false positive.We denote the probability of this type of errors by α. • Type II error: Deciding H 0 when H 1 is true, also called false negative (miss-detection) and we denote the probability of this type of errors by β.
Assuming equal priors for E's optimal hypothesis test, the total probability of error P e will be where P E is the average probability of error for E under equal priors.According to [26], P e can be calculated as where V(P s , P c ) is the total variation distance between P s and P c , defined in [26] as where . 1 is the L 1 norm.As L 1 norm's analytic calculations are not wieldy, V(P s , P c ) is related to the KL-divergence as [27,Ch.11]: where the KL-divergence is given by: where P sj and P cj are the bins of probability mass function for cover and stego-objects, respectively.For A to guarantee a low detection probability at E's optimal detector, A needs to bound V(P s , P c ) by some ǫ chosen according to the desired probability of E's detection.Consequently, A ensures that the sum of error probabilities at E is bounded as α + β = 1 − V(P s , P c ).Using (4), A can achieve this goal by designing a steganographic technique such that On the other hand, B must be able to correctly decode the message x from m sent by A. Thus the chosen codebook must be selected such that P m maximizes the mutual information between both P s and P m , which is defined as where In this section, a mathematical illustration for our basic assumptions about the general statistical distribution model for the multimedia (P c ) is introduced.Consequently, for a clear illustration of the multimedia stochastic model, we need first to define the Markov-Random-Field (MRF) and Gibbs distribution.
MRF is a multidimensional random process, which generalizes the single dimensional Markov random process [9].Let Ẍ be a coordination system in R N and ρ(i) is a function representing the neighbourhood for each element i ∈ Ẍ, such that i / ∈ ρ(i) and i ∈ ρ(j) iff j ∈ ρ(i).For example, the neighbourhood may be defined as the immediate left, right, top and bottom neighbours of i.Let Ÿ be the neighbourhood system representing the set of neighbourhoods of all elements i, j, . . .∈ Ẍ.
A random field ξ over Ẍ is a multidimensional random process where each element i ∈ Ẍ is assigned a random variable ξ i with f i be its associated realization for all i ∈ Ẍ. ξ is called a MRF if it achieves the following conditions: • Positivity property: P (ξ = f) > 0, ∀f ∈ S, and • Markovianity property: , ∀i ∈ Ẍ, ∀f ∈ S, where P is the probability measure and S is the state space for the MRF ξ.
According to [9], Hammersley-Clifford theorem states that ξ is a MRF over Ẍ with respect to Ÿ if and only if its probability distribution follows the Gibbs distribution with respect to Ẍ and Ÿ .To explain the Gibbs distribution, the idea of clique must to be clarified.A clique ω is a correlated group of neighboured elements, where ω ⊂ Ẍ with respect to Ÿ (i.e.: ω consists of a single element i or multiple elements i, j, . . .which are neighbours).Also, ω ∈ Ω, where Ω denotes the set of all cliques.
According to [7]- [9], it is accurate to statistically model the multimedia objects by the Gibbs distribution, which implies the above described properties (positivity and markovianity), in which all elements for each multimedia object are organized in cliques.Gibbs distribution quantifies the probability P (ξ = f) through an energy function U f and a temperature constant T as where is the normalization denominator that is called the partition function and M is the number of all allowed states of the system (number of elements within S).The energy function is modelled as where V f (ω) is the potential function for f calculated only within the clique ω ∈ Ω.
Gibbs distribution involves computing the partitioning function which is intractable for most models.Therefore, in this paper, we propose an efficient approximation for the Gibbs distribution by modelling it with the correlated-multivariatequantized-Gaussian-distribution (CMQGD) for more thorough mathematical analysis.One possible approximation to (11) is to model V f (ω) in (13) as where • f ω is the realization of the random field ξ within the clique ω (i.e.: f i , f j ∈ f ω ∀i, j, . . .∈ ω and f ω ⊂ f ).• µ ω and Σ ω are the mean vector for ξ ω ⊂ ξ and its covariance matrix within the clique ω, respectively.This approximation is valid for the following reasons: • V f (ω) can be mapped to any function depends only on the elements within the clique ω [9].• The multiplication of (f ω − µ ω ) and (f ω − µ ω ) T , which is the distance from the mean calculated locally within the clique ω, controls the probability of certain realization (state) f ω (and hence P (ξ = f)) within the clique ω for certain Σ ω .In other words, the local coherency condition of multimedia elements: This allows lower energy states to be always have a higher probability than the higher energy ones, which is one of the main Gibbs properties [7].
It should be noted that other valid assumptions for the potential function V ω (f) can be used to approximate the Gibbs distribution with other distributions.The approximation in ( 14) is one of them.

V. MAIN RESULTS
In this section, we discuss the solution and results of the optimization problem (10) in Sec.III in subsection V-B using the assumptions in subsection V-A.

A. The main assumptions
To solve the optimization problem (10), we have the following assumptions: • From here on, each clique within the cover will be considered as a single cover element c i with unknown distribution P ci .

• Having established the equivalence relation between
Gibbs and Gaussian distributions as discussed in section IV, we model the whole cover object (i.e.: the joint distribution for the cover object P c ) as a CMQGD with mean µ c and covariance matrix Σ c .Rather, we have no assumptions on the marginal distribution P ci of each cover element c i .• The cover statistical parameters ( µ c , Σ c ) are known to all A, B, and E. • Although we have no assumptions for both P s and P m , according to [28], to minimize the KL-divergence for any Gaussian distributed cover object, the corresponding stego-object distribution must be Gaussian.Thus, the embedding operation will produce another CMQGD P s with parameters ( µ s , Σ s ).• The realization of codebook from P m (the exact codewords used between A and B) are shared only between A and B and not known to E. (10) To solve (10), we need to determine H(P c ), H(P s ) and the KL-Divergence D(P s P c ).Using lemma 3 in appendix (A), H(P c ) and H(P s ) are computed as demonstrated in ( 46) and (47), respectively.The KL-Divergence between two quantized distributions can be evaluated using (17) alongside with the following lemma.

B. The Solution to the optimization problem in
Lemma 1.According to [27,Ch.11],KL-Divergence between any two uniformly quantized distributions F ( Ẋ) and G( Ẋ) is bounded as: Where f and g are the continuous versions of F and G, respectively, and Ẋ is the uniformly quantized version of the continuous random variable ẋ.
From lemma 1 and the definition of the KL-divergence for multivariate-continuous-Gaussian-distributions in [12], the KL-Divergence for CMQGD can be bounded as where n is number of cover elements.For more conservative evaluation for our optimization problem in (10), we set D(P s P c ) to its upper bound in the mathematical relations in the rest of the paper.Specifically, Having established the upper bound of of the KL-Divergence for CMQGD in (17), the solution of the optimization problem in (10) as provided in the following theorem.
Theorem 1.With the assumptions stated in subsection V-A, the solution of (10) can be achieved by setting: 1) The distribution of P m must be CMQGD (same as cover and stego-objects), with the parameters (mean µ m and variance Σ m ) as ≤ ǫ), where W (.) is the WrightOmega function [29].
2) The maximum achievable embedding rate for sufficiently large n.

Proof :
The Lagrangian L of (10) will be where λ is the Lagrange multiplier.To find the optimized parameters for (18), we set ∂ ∂ µs L = 0 and ∂ ∂Σs L = 0.With the definitions of H(P s ) and D(P s P c ) in ( 47) and (17), respectively, we have As Σ c and Σ s are symmetric matrices, ( 21) can be rewritten as From ( 19) and ( 23), as ∂ ∂ µs L = 0, we have From ( 20) and ( 22), as ∂ ∂Σs L = 0, we have Through the paper, we name the factor a = λ+1 λ as the embedding factor.From (25), it is clear that Σ s has the exact eigenvectors as Σ c and the eigenvalues of Σ s are a scaled version of the eigenvalues of Σ c with the embedding factor a. Let ψ c ( t) and ψ s ( t) be the characteristic functions for both the cover and the stego-objects where As s = c+m, this implies P s = P c ⊕P m , where ⊕ represents the convolution operation.Thus, the characteristic function for the distribution of codebook of the message (ψ m ( t)) will be Then from (24), we have Thus This means that, the codebook of the message must be modelled as CMQGD with zero mean and variance The embedding factor a can be calculated as follows.Starting from (25), we have and Substituting from ( 24), ( 29) and ( 30) into ( 17), we get The solution of ( 31) can be obtained from [29] as From ( 32) and according to [29], a is monotonically increasing with D(P s P c ). From ( 25) and ( 47), a is monotonically increasing with H(P s ).Thus, using the constraint in (10) by substituting D(P s P c ) with 2ǫ2 in ( 18) we have where a * ∈ R is the embedding factor 2 associated with the design parameter ǫ.Thus, ( 27), ( 28) and (33) complete the proof of the first part of Theorem 1. From ( 9), we have: Then from ( 46) and (47): Then, from ( 28) and (33): Thus Equation ( 35) 3 completes the proof of the second part of Theorem 1. Please note that, despite D(P s P c ) = D(P c P s ), the results obtained from Theorem 1 are also valid for D(P c P s ).The proof of this claim in provided in the following lemma.
Lemma 2. µ m and Σ m obtained from Theorem 1 are also valid for D(P c P s ).

VI. ACHIEVABILITY
In this section, we prove that the maximum embedding rate I(P s ; P m ) in ( 35) is achievable with a low probability of decoding error P B at B. We couldn't include the achievablity constraint in the optimization problem in ( 10) and (18) as P s and P m will not be obtained till solving the optimization problem.
We utilize the standard random coding argument [27,Ch.7]as a base for our proof.Although the minimum distance decoder (maximum likelihood estimator) is the optimal detector [27,Ch.7],there is technical difficulties in using this detector to calculate P B as we do not have the marginal distribution for each cover element; only the joint distribution for the whole cover.Thus, we utilize the jointly typical decoder [27,Ch.7]at B instead.
In our achievability proof, we utilize the same procedures for the Channel Capacity Theorem (Theorem (9.1.1)in [27]) for calculating P B .Although this theorem mandates the constraint for the average transmission power, in our case we do not need this power constraint as Σ m is very small due to the constraint of low probability of detection at E in (10).Without loss of generality, assuming transmitting the i th codeword, we have two types of errors • The transmitted codeword and the received one are not jointly typical.We denote this error by Êi .• The received codeword is jointly typical with another wrong non-intended codeword.This error is denoted by K j=1,j =i E j , where K is the number of all codewords.Thus, equation (9.26) in [27] will be modified as where P (x) is the probability of an event x.Continuing the same procedures for the proof of Theorem (9.1.1)in [27]: where δ > 0 [27, Ch.3] is an arbitrary small number.Thus, (40) prove that when n is sufficiently large and for an actual embedding rate R ≤ I(P s ; P m ) − 2ǫ, then P B is sufficiently small.This proves the existence of a codebook (K,n,ǫ) that achieves an embedding rate R ≤ I(P s ; P m ) − 2ǫ at B with low P B .
It should be noted that although related approaches such as [11] and [12] utilize a converse proof, in our case we have dropped the converse proof as we have got the maximum embedding rate as a solution for the optimization problem in (10), which guarantees that there is no other codebook generated by any other distribution can achieve higher rate than (35).

STEGANOGRAPHY
In this section, we establish the relation between the obtained expressions in (33) and (35) in terms of the WrightOmega function and the well known previously obtained results of the SRL in [4], [11], [12].To do so, we state the following facts about the WrightOmega function: • As the WrightOmega function in the form −W (−γ − 1 − iπ) is monotonically increasing with γ; where γ is a positive constant; −W (−γ −1−iπ) can be approximated by polynomial regression in the form where θ i is an arbitrary constant.
• As γ = 4ǫ 2 n , then for large n we can ignore the higher order terms in (41) and use the following approximation4 • From Lemma 5, as a * ≥ 1, then √ a * ≥ 1.Also, as 2 and ln(1 + x) ≤ x, then the following approximation of (34) is valid Thus, (43) establishes the relation between (35) and the SRL.Hence, (43) proves that our results come in agreement with previously established SRL in [4] in the context of image steganography and [11], [12] in the context of covert communication.Using the justifications in section IV for approximating the Gibbs distribution with CMQGD and the proved relation to the SRL in (43), equation ( 35) can be considered as a rigorous analytic form for the SRL.To illustrate the above results, and to prove the relation between P E and both (35) and (43), we plot in Fig. 2 the maximum embedding rate I(P s ; P m ) against the number of Fig. 2: Maximum achievable embedding rate I(P s ; P m ) compared to the number of cover elements (n) for P E = 0.1 and P E = 0.2.Note that, we use the lower bound for a * in (52) to compute I(P s ; P m ) in (34).
the cover elements n using (34) and (52).In the figure, we use the lower bound for a * in (52), i.e., we set to compute I(P s ; P m ) in (34).We provided the plots for P E = 0.1 and P E = 0.2 and we used the implementation of the wrightOmega function in [30] to generate the plots.From the figure, we can conclude that (41), (42), and (43) are not over-approximated form of (35) as we plot the actual analytic form expression in (35) against √ n.
VIII.PRACTICAL INTERPRETATION In this section, we compare the practical experimental results for the stenographic methods in [10], [16], and [17] with our theoretically calculated limits.Specifically, we get the published results of each steganographic method that present the payload (bits per pixel) of the method against different P E , where P E is obtained using different steganalysis methods.Then, we plot these published results against our theoretical limit of the payload calculated using equation (35).In our comparison, we used the BOSSbase 1.01 dataset [31] where the size of each image is 512 × 512 pixels.We set n = 512×512 = 2 18 in (35) to calculate the maximum achievable embedding rate that occurs in the case of independent cover elements, i.e. when each clique represents only a single pixel.We give more clarification for the relation between the achievable embedding rate and the clique size in Section IX.
Figures 3 to 6 demonstrate the experimental results obtained from [16] plotted against the theoretical upper limit for the payload in (34) and (52).We set a * to its lower bound in (52) to compute the theoretical upper limit for the payload against P E and we plot the payload in log-scale.We also demonstrate in the supplementary material figures ???????? and figures ?????????????? for the methods in [17] and [10], respectively.
It can be concluded from the figures that our theoretically calculated upper limit is relatively very small compared to the referenced practical stenographic methods.The reason for this is that there exist other steganalysis methods that can be more optimum than the methods utilized by referenced stenographic methods.In other words, the steganalysis methods used in [10], [16], [17] can be regarded as non-optimal compared to the KL divergence D(P s P c ) that is used in our proof and can be regarded as the upper bound for any steganalysis detector.Consequently, using non-optimal steganographic detectors may be misleading as it can achieve a lower probability of detection error (i.e.P E ) when using higher embedding rates than our theoretically calculated limit.Likewise, using non-optimal embedding techniques may also be misleading as it can achieve lower embedding rates than the theoretically calculated limit with the same detection probability.

IX. DISCUSSION
As I(P s ; P m ) ∝ √ n as shown in (43), the upper limit for embedding rate occurs when each clique contains a single element, i.e. when n is large.This occurs in cases such as a highly textured noisy image or a motion field of a sand storm video.Meanwhile, the lowest embedding rate occurs when all the elements of the cover are fully correlated (i.e.: the whole cover can be considered as a single clique where n = 1), such as an image of a clear sky or a motion field of a video of the stationary scene with global camera motion.These cases imply relatively a small number of cliques.In other words, the upper limit occurs in the situation when all elements of the cover are independent, whereas the lowest one occurs in the case of the cover contains only a single clique.Another important point is that using the informationtheoretic upper bound for I(P s ; P m ) in (35) is not a practical constructive approach due to the random coding argument assumption.In other words, it does not introduce concrete guidelines for how to design steganographic methods or codebooks to achieve this theoretic upper bound.Our work introduces a proof of existence in analytical form, implying that work is still needed to construct steganographic methods to be more closed to the calculated bounds.

X. CONCLUSION
Through this paper, we have calculated the informationtheoretic upper bound for a predefined level of security for data embedding within a generalized case for multimedia covers.Our mathematical analysis has several advantages over previous ones.First, we introduced a mathematical justification for using CMQGD as a replacement for the accurate and mathematically intractable Gibbs distribution.Second, we provided a rigorous analytic form for the SRL, not an order result.Third, the provided analytic form is calculated not only in the case when the cover is known for both sender A and receiver B, but also in the real-world case when B doesn't know the exact cover, only knows the cover distribution, thanks to the achievability proof in section VI.Forth, our calculated limit is applied to all types of stegnanalytic detectors (statistical, deeplearning, feature-based, ... etc) for any type of multimedia.Five, we have introduced the model parameters for the optimal message's codebook in an analytic form.It should be noted that for continuously distributed covers, our mathematical solutions will be exact closed-form solutions.

APPENDIX
A. The Entropy of Multivariate Quantized Gaussian Distribution Lemma 3. Theorem (8.3.1) in [27] defines the entropy of quantized distribution as where p is any continuous distribution, P is a quantized version of p with b quantization bits, H(P ) is the entropy of the quantized distribution and h(p) is the differential entropy of the continuous distribution.Thus, as the entropy of CMQGD p g is defined in [12] as the entropy of both P c and P s can be defined w.r.t their continuous versions P c , P s as B. Additional properties about the embedding factor a * Lemma 4. a * cannot be a complex number: It should be noted that a * ∈ R despite W (.) ∈ C. From [29], the relation between the Lambert W Function W(.) and the WrightOmega function W (.) can be modelled as If we set x = − 4ǫ 2 n − 1 − iπ as in (33), then, Im(x) = π and therefore (48) can be re-written as which is a special case of Lambert W Function where W −1 (.) ∈ R [32].Thus, a * ∈ R. The same applies for a in (32).

Fig. 1 :
Fig. 1: General communication model for steganography.sponding receiver B. An eavesdropper E is fully monitoring this communication channel between A and B. Fig.1 introduces the general communication model for steganography.Through this paper, we consider an innocent original cover c = [c 1 , c 2 , . . ., c n ] and a stego-object s = [s 1 , s 2 , . . .s n ]with probability distributions P c and P s , respectively, where n is the number of elements of the cover or the stego-object.Both c, s ∈ V n , where V is the set of allowed cover values.For example, when the cover is a gray-scale image, then V = {0, 1, . . ., 255} is the set of all available pixel values.It should be noted that V is not limited to pixels, but can be considered for DC-coefficients[24], video motion vectors[2],[25], ...etc.Similarly, we consider an original message x ∈ X k , where k is the number of elements of the message and X is the set of allowed message values (alphabet).The stego-object s is obtained by embedding a coded message m ∈ M n via an addition process, i.e., s = c + m.The coded message m is obtained from x using a certain codebook that maps x → m.The coded message m is statistically distributed as P m .

Fig. S- 10 :
Fig. S-10: Results from [2] comparing HILL steganographic method and its modified Gaussian version with two different batching strategies, IM S with batch size 128 and AdaBIM with adaptive batch size against steganalyzer utilizing maxS-RMd2 feature.

Fig. S- 11 :
Fig. S-11: Results from [2] comparing HILL steganographic method and its modified Gaussian version with two different batching strategies, IM S with batch size 128 and AdaBIM with adaptive batch size against steganalyzer utilizing maxS-RMd2 feature.
is the entropy of P m and H(P m | P s ) is the conditional entropy of P m given P s .Thus, our constrained optimization problem can be written as (8) P s ) in(7)can be written as [27, Ch.2]I(P m ; P s ) = H(P s ) − H(P s | P m ) = H(P s ) − H(P c ),(9)and because H(P c ) is given, maximizing I(P m ; P s ) is equivalent to maximizing H(P s ).Thus, the optimization problem in(8)can be restated as argmax Pm H(P s ) s.t.D(P s P c ) ≤ 2ǫ 2 .(10) IV.MULTIMEDIA STOCHASTIC MODEL )