A New Global Measure to Simultaneously Evaluate Data Utility and Privacy Risk

Measuring data utility and privacy risk embedded in synthetic or other de-identified datasets is an increasingly important research area. Existing measures in the data privacy literature however are one-sided in that they either measure utility or privacy risk only. In this paper we propose a new measure that can evaluate both data utility and privacy, a well-known trade-off relationship in data synthesis. The proposed measure employs the notion of relative distance between the synthetic and original datasets at the dataset level, and can identify the optimally balanced position of the synthetic data in terms of both utility and privacy. In addition, we devise a graphical tool that visually reveals the current utility-privacy trade-off position of the synthetic data. Numerical studies show our new measure consistently performs better and offers richer interpretations than other existing global data utility measures, for both simulated and real datasets, confirming its distinctive advantages.


I. INTRODUCTION
C OLLECTING, storing and publishing datasets are becoming increasingly important in many research disciplines and applications. Publishing datasets in particular is of great interest in data privacy research as datasets often contain sensitive private information of individuals. When a data collector needs to publish or release such datasets therefore certain safety measures must be taken to protect privacy, while preserving the utility of the data so that various data analyses done on the dataset remain valid to end users.
Common methods of mitigating or eliminating the privacy leakage typically involve some modification of individual records in the dataset, including: adding noise, reshuffling, aggregating, and top or bottom coding, to name a few. These methods are collectively known as the statistical disclosure control or data de-identification process as introduced based on the definition of privacy of [1]. In this definition, access to the published dataset should not allow the adversary to know anything extra about any target individual or unit, even with the presence of the attacker's background knowledge. Preventing such disclosure, known as re-identification, however is known to be generally impossible as background knowledge obtained from other sources is beyond the data publisher's control. Instead, [2] used a different notion where data privacy is said to be protected if no outputs are significantly more or less likely even if the adversary removed someone's data from the dataset. Another direction of privacy protection is to synthetically generate a new dataset from the original one. In synthetic datasets some or all of the original data are replaced by synthetic values by means of some suitable mechanism. The idea is that a well-balanced synthetic dataset can be seen as another independent sample from the same population from which the original data have been drawn. In other words, a good synthetic dataset must be similar enough to the original data so that it preserves the utility of the original data but, at the same time, different from it to the extent that the privacy disclosure risk is minimized. See [3], [4], [5], [6] for some various synthetic data generation methods.
Regardless of specific choices for de-identification, a related pivotal question is how to quantify the utility and privacy risk in the de-identified or masked dataset compared to the original dataset, which is our main topic of interest. Maximizing data utility and minimizing privacy risk obviously contradict each other by nature. For example, when extreme values exist in the sample, data synthesis methods focusing on imitating original data often fail to protect privacy, and conversely, excessive privacy protection leads to a substantial information loss, losing the data utility. Thus it is of interest and importance to properly quantify the quality of the de-identified dataset in terms of utility and privacy by means of some index or measure in an objective manner [7], [8]. Developing such measures have been widely discussed in the literature, including k-anonymity, l-diversity, t-closeness, and differential privacy, to name a few keywords. All these measures however focus on assessing the degree of data privacy risk side only. For the data utility side, there are measures that quantify the degree of utility at either analysis-specific or more generally sample levels; see [9], [10] and references therein. These measures again mostly deal with the disclosure risk side, ignoring the utility.
To this extent we propose in this paper a new framework that can measure both data utility and privacy risk at the distribution level, which has been hardly researched. The new index coherently accommodates the trade-off of the two contradicting aspects of de-identified dataset. More details of our contribution is presented in Section II-C.

A. Measures of Data Utility
Measurement of data utility has been discussed at two different levels in the literature. First, at the analysis level, data summaries or fitted model's estimated parameters are compared between the synthetic and original datasets. If the two values of interest are sufficiently close, the synthetic data is deemed to have high utility [11]. These analysis-specific measures however are somewhat limited in their applications in that the type or nature of the analysis carried out by the end user is generally unknown. To overcome this disadvantage, measuring data utility at the sample level have also been studied. Unlike analysis-specific measures, these global measures attempt to assess the utility for all possible analyses by summarizing the difference in the shape of the sample or distribution [9]. An important recent work on the global utility measure is [10] where the propensity score mean-squared error (pMSE) is extensively studied. Their model however has several limitations in practice. First, even if both the original and synthetic datasets follow the same distribution, the asymptotic results on their pMSE differ depending on whether the original dataset is used when synthetic datasets are constructed. This indicates that their proposed method depends on synthetic data generation methods. Second, the propensity score-based approach is sensitive to classifiers as mentioned in [9]. Last, their method does not give any information about data privacy; this drawback actually applies to most existing papers on data utility.

B. Measures of Data Privacy Risk
Traditional measures that evaluate the level of privacy disclosure risk include k-anonymity, l-diversity, and t-closeness. In k-anonymity it is required that there must be at least k > 1 records in the dataset who share the same set of quasi-identifiers that may be used to identify the owner of records; see [12], [13], [14], and [15].
While k-anonymity ensures that a group of k records has the same values of the quasi-identifiers, it is still open to attribute disclosure risk. To tackle this, l-diversity requires that there should be l > 1 different sensitive values in each group of k records [16], [17]. This way a certain sensitive attribute, e.g., specific medical illness or condition, cannot be associated with a specific record in the data, protecting privacy. Recognizing that l-diversity still is subject to skewness and similarity attacks, the method of t-closeness has emerged. A group of k records is said to have t-closeness if the distance between the distribution of a sensitive attribute in the group and the distribution of the attribute in the entire data table is smaller than some threshold value t; see [18] and [19] for details. While all these privacy measures capture privacy risk, they ignore how much information has been lost compared to the original one. In this regard, these measures are highly asymmetric, and it is difficult for the data publisher to select specific values, leaving the choice of k, l, and t essentially subjective. Another recent privacy model class is based on a notion called differential privacy (DP) in [2], where privacy models seeks to achieve the uninformative principle in [20], whose goal is to keep the difference between the prior and posterior beliefs small enough to prevent private information leakage. Various extensions and variations of DP have been made in the literature; see [21], [22], [23], and [24]. Recent works include a noise-added gradient descent procedure in deep learning [25], differential privacy generative adversarial networks [26], noising before model aggregation federated learning [27], differentially private alternating direction method of multipliers [28], differentially private game theoretic approch [29], and the PATE mechanism in deep learning [30], [31]. Though DP have been widely accepted in data privacy research, its limitations have also been argued. For example, it is well-known that DP becomes increasingly less efficient for repeated queries. Also, [32] shows that choosing DP parameter is a non-trivial matter as it depends on the queries to be made by the users. It is demonstrated in [33] that a certain value can result in very different levels of confidentiality and utility, advising against using DP to anonymize microdata. Furthermore, [34] argues that DP cannot be a universal solution to all privacy problems, and that DP methods have been used beyond the setting it was designed for without care in many applications including data collection, data release, and machine learning applications.

C. Our Contribution
We propose a new framework to measure data utility and privacy embedded in synthetic datasets, or datasets created via other de-identification methods. As previously mentioned, most existing measures are focused on either utility or privacy risk. Though in some past works privacy risk has been measured by looking at the departure form the benchmark utility measure, they lack the ability of formal comparison. Hence developing a tool that can simultaneously measure utility and privacy can be a valuable contribution in data privacy literature.
Our framework can be classified as a global measure, because it does not depend on the queries or types of analysis made by the end users. In this sense it is similar to the propensity score-based general measure in [10], but our approach is fundamentally different in that it is motivated from the statistical notion of the distance between two datasets, the synthetic dataset and the originally observed one, at the distributional level. By properly measuring the degree of similarity and difference between the two datasets, we evaluate how much data utility is preserved and, at the same time, how much privacy protection is offered in the synthetic or de-identified dataset, compared to the original one. Our contribution in this paper is two-fold. First, we define a relative measure of dataset distance and create a numerical index that can evaluate both data utility and disclosure risk embedded in the synthetic or other de-identified dataset. This allows us to identify the optimally balanced position of any synthetic data and examine whether the data utility has been maximized within a specified data privacy level. Second, we devise a graphical tool that visually reveals the current utility-privacy trade-off position of the synthetic dataset, along with the optimal position. We emphasize that our method is both distribution-free and distance-free in that it can be applied to any type of data, including both numerical and categorical, under any distance function defined in a metric space. As our focus is on measuring the data utility and privacy, we will assume throughout the paper that a suitable de-identification process, either masking or synthetic, is already in place.
The rest of the paper is organized as follows. In Section III we explain how the distance of a dataset from a fixed point can be measured and put forth related theoretical findings. These results are extended in Section IV where measuring the distance between two datasets is discussed and a visual tool to depict the utility and privacy is introduced. Section V carries out numerical experiments to confirm our theoretical findings, and Section VI concludes the paper.

A. Idea and Notation
Consider a dataset X randomly drawn from a multivariate population represented by the distribution function F X . In our discussions we will treat X as the original dataset that contains useful information but, at the same time, is exposed to privacy disclosure risk due to sensitive attributes. Now suppose that the data collector obtains Y, a new synthetic or differentially private dataset by means of some statistical disclosure control technique. Note that Y is similar enough to X in that it contains the (almost) same amount of information or utility, but less subject to the privacy risk. An ideal method to create Y is simply to randomly draw another sample from the same F X without referencing the original X . This of course is practically impossible because the true generative model F X is unknown. Nonetheless this argument conceptually shows that such synthetic Y would contain an equivalent amount of utility as X , and yet is much less prone to privacy risk because it is independent of the original X . In light of this argument, we may view datasets X and Y as independent random samples from the common F X , and the current practice of anonymizing or synthesizing the original X can be understood as creating Y using a surrogate of F X .

B. Distance of a Dataset From a Fixed Point
We start our discussion by proposing a measure of distance between a dataset and a fixed point. In what follows we use subscript to make the sample size explicit, so that ∼ F Y are random samples of size n and m, respectively. The population F X and F Y are generally multivariate and can be continuous, categorical, or both.
In this section a distance measure between X n and a fixed point c. For our developments, we make several assumptions.
(a) Both F X and F Y share the same support, denoted by . This makes sense as Y m is a synthetic dataset similar to X n , and should be defined in the same dimension and support. (b) Each element in X n is unique with no duplicates. This requirement is easily met for any dimensional datasets with at least one continuous marginal variable. This condition is also practically satisfied for categorical datasets as the dimension gets larger. As a general solution, one can always append an additional variable to X n and fill with small random numbers from a continuous distribution, such as N(0, 2 ) with negligibly small. This way, each X i ∈ X n is ensured to be unique with at least one continuous marginal variable. Essentially, this assumption can be omitted after computing the number of duplicates of a fixed point c and adjusting the theoretical results according to the number of duplicates. (c) The fixed point c, from which the distance of X n is measured, is in the support of F X , that is, c ∈ . Note that P(c ∈ X n ) = 0 at the random variable level because F X has a continuous marginal variable. However, this probability becomes non-zero when we have actually observed X n as a sample; this case will be discussed in later developments. Definition 1: For a suitable metric function d on , we define the kth shortest distance of random sample X n ∼ F X from a fixed point c ∈ as where < k > represents the kth smallest element in the set, when measured in terms of d.
Example 1: Suppose X 4 (n = 4) has been observed to be . We denote this 'observed' sample as X 4 (i.e., X 4 is a realization of X 4 ; see Section IV for further details). Let us pick c = (1, 0) ∈ . With the Euclidean metric d, we compute d(x, c) for each x in X 4 , to obtain Then we have, after arranging in ascending order, d <1> X n (c) essentially identifies the kth nearest point in X n from c. We now extend this concept to compare the distances of two datasets from a common fixed point.
This relative distance is stated in terms of probability because both d <k> X n (c) and d <k> Y m (c) are random variables. To explain the meaning of the relative distance, let us assume that both datasets X n and Y m are drawn from the same population F.
for any c so long as m = n. The proof is trivial using the symmetry of the two samples. Thus in this case the kth relative distance of Y m to X n is 50%, indicating that both datasets exhibit similar characteristics, being equally distant from c. If m > n however the probability will be greater than 0.5 because Y m has more points on the same support than X n does, and thus generally is closer than X n to c. The argument is reversed when m < n. Therefore, when n = m and (2) is close to 0.5, we may say that Y m is well embedded in that it is neither easily re-identified nor too different from the original X n .
Actually, a more important usage of (2) in this paper is to examine and measure the quality of the synthetic dataset Y m . Suppose that Y m has been drawn from F Y , which typically represents some a synthetic population such asF X estimated from X n , and let c be a new data point from the true population F X . In order to examine if Y m is suitable as a replacement of the original X n , we compare the distances between X n and Y m from c as in (2). More specifically, assuming n = m, if (2) is less than 0.5, it implies that X n is closer than Y m to c, better capturing the characteristic of F X . This means that Y m represents F X less effectively than X n does for given c, losing the data utility of the original X n . At the same time, however, the de-identification process or privacy protection is deemed stronger than the original X n because Y m is materially different from X n . Hence by computing (2) over various c values generated from F X and taking the mean, we may measure the data utility and privacy of Y m over the original X n . We will take up this discussion later in more details.

C. Gaussian Data Case
To illustrate how the kth relative distance works, we carry out a simulation study in a simple setting. We employ · 2 as the metric function d and set k = 1. Throughout this subsection, the dataset X n is assumed to be from the standard Gaussian distribution, N(0, 1). Under this setting, d <1> 2 follows the noncentral chi-square distribution with the degree of freedom 1 and non-centrality parameter c 2 . That is, We then can calculate kth relative distance as for various m, n and c. The results are presented in Table I.
As pointed out previously, the values are 0.5 when n = m at any choice of c. From the data synthesis viewpoint, the dataset Y m is an ideal candidate for the synthetic dataset because it is directly generated from the identical population F ≡ N(0, 1) without referencing X n , which of course impossible in practice. Nonetheless we emphasize that the value 0.5 is generally the main benchmark that any ideal synthetic dataset would aim for. When n = m, the values in Table I departs from 0.5. In particular, when (m, n) = (300, 100), the probability is larger than 0.5 because Y m has more data points than X n on the real line, thus less away from any c. The next example considers a case where the mean of the distribution of Y m is misspecified.
As presented in Table II, when the synthetic data is generated based on an incorrect assumption about the location parameter, things are more complicated but give us new insights. For instance, when μ = 1 and the sample size is (200, 200), the relative distance value is 0.377 at c = 0, less than 0.5. This implies that, X n is closer than Y m to c = 0. In contrast, the value is 0.623 at c = 1. Being larger than 0.5, this indicates that X n is farther than Y m if measured from c = 1. The following example concerns a case where the scale parameter of Y m is misspecified. The results are given in Table III which shows that the relative distance measure varies over different σ and c values.
. A popular technique of generating synthetic datasets is to fit the original X n to a parametric F and have the estimated F generate synthetic Y m , as shown in the following example.
be a random sample generated from a Gaussian distribution where parameters are maximum likelihood estimates (MLE) from X n , that is whereX is a sample mean for X n . Then The setting of Example 5 is more practical in that the population parameters are estimated from the original sample, and the synthetic dataset is drawn from this estimated population. If we let n → ∞, the result will be the same as in Example 2 becauseX The numerical results presented in Table IV are consistent with and quite similar to those in Table I.

D. Exact Value of the Relative Distance
We carried out a simulation study to compute the kth relative distance measure in the previous subsection, but it turns out that these values can actually be obtained analytically if both datasets are from the same population, an important setting to evaluate utility-privacy of synthetic datasets.
To elaborate, let us start with a simple case where X n and Y m are independent univariate random samples drawn from the common population F that is continuous to prevent duplicates. For this case we determine P Y <k> We can rewrite this set of mutually exclusive conditions as Therefore we obtain, using the standard theory of the order statistics, that where the third equality holds by observing To further simplify (4), we use integration by substitution. By letting F(y) = u, 0 ≤ u ≤ 1, and f (y)dy = du (4) reduces to where the last expression uses the definition of the beta function.
We note that this result is non-parametric and distributionfree in that it does not depend on the form of F.
Moving to the general case, we now assume that X n and Y m are multivariate random samples drawn from the common population F of any dimension. While both datasets are multivariate, the distances d <k> Y m (c) and d <k> X n (c) are univariate, and can be considered as random samples from a common continuous univariate population. Hence, based on the same logic above, we obtain P d <k> Y m (c) ≤ d <k> X n (c) as follows. Theorem 1: For X n and Y m drawn from the same population F, the kth relative distance of Y m to X n from a fixed c ∈ defined in (2) is All proofs of theoretical findings in the present paper can be found in Appendix. There are several comments on this result: • The relative distance measure is distribution-free in that (6) holds regardless of the form of F (numerical, categorical, or both) as long as it is the common population of both X n and Y m . Hence we can compute this quantity for any distribution in a straightforward manner. • It is also independent of the choice of c. Later we will use this property to investigate the data utility and privacy over different c values across the original sample, which again is easily done. • As mentioned in Section III-B, the continuity of F ensures that P(c ∈ X n ) = 0. In Section IV we will consider a situation where X n has been observed as actual numbers, in which case the result become slightly different. An alternative expression for the last expression in (6) is possible and presented as Lemma 1 below, which sometimes can be useful for some (m, n) choices.
Lemma 1: An alternative expression of Theorem 1 is According to Theorem 1 and Lemma 1, following corollaries are immediate.
Corollary 1: When k = 1 Theorem 1 reduces to Next, we consider the case where the order of distance, that is, the value of k, differs for each dataset. This is an extension of Theorem 1 and the resulting formula is again in an analytic, but more complicated from.
Theorem 2: Consider two random samples X n and Y m independently drawn from a common distribution F. Then, for c ∈ , and k 1 = 1, . . . , n, and k 2 = 1, . . . , m, we have The following Lemma is a generalization of the hockeystick identity.
Lemma 2: When n ≥ k 2 and m ≥ k 1 , the following equations hold.

A. Distance of Synthesized Data From Original Data
As mentioned in Section III a natural way to measure the distance between the synthec Y m and the original X n is to compute the relative distance in Definition 2 over various c values drawn from F X . However, as F X is unknown in practice, the best realistic solution is to use the element of X n as the candidates of c values, provided that X n has been realized or observed. To reflect this shift in perspective, we let X n = {x i } n i=1 be the 'observed' value of X n and force c to be an element in X n .
One caveat is that when c ∈ X n the definition of the kth relative distance in Definition 2 should be slightly revised. To elaborate, let us assume that c = x i ∈ X n . Then one of the distances between X n and c must be 0 and thus where X n\i = X n \{x i }. As a result, the first distance becomes d <1> X n (c) = 0 and This implies in particular that the first relative distance is always zero because To avoid this discrepancy, we define the kth relative distance of Y m to X m from c = x i ∈ X n as We now present our main theoretical result. Theorem 3: For X n , Y m ∼ F, the kth relative distance of Y m to X n from X i ∈ X n is for any i = 1, . . . , n.
If we apply the above theorem to all X i , i = 1, . . . , n and take average, we obtain the distance of the synthesized data from the original data without referring to the anchoring point c, of which the result is presented as follows.
Theorem 4: For X n , Y m ∼ F, the kth relative distance of Y m from X n is defined as, for k = 1, . . . , min{n − 1, m}, The proof is immediate from Theorem 3. As special case, the result (10) reduces to m/(n + m − 1) when k = 1.

B. Data Utility and Privacy Index (DUPI)
We propose the empirical version of Equation (10) as our index to measure the data utility and privacy.

Definition 3: The kth order Data Utility and Privacy Index (DUPI <k> ) of a synthetic data Y m against the original data X n is defined as
where I (·) is an indicator function. By investigating the value of DUPI (11) we can gain important insights on the quality of Y m as a synthetic dataset driven from X n . In particular, the DUPI may be compared against three specific benchmark numbers: 0, 1, and the theoretical value in (10).
• If DUPI is close to 1, it means that each x i tends to be closer to Y m than to other data points in X n . This happens when Y m and X n are too close in value. Thus the case of DUPI value near 1 is interpreted as two datasets having a similar structure and are highly correlated in the sense that most of the nearest points of x i ∈ X n belong to Y m . In this case the utility of the synthetic dataset Y m is maximized as there has been no information leakage, but the privacy protection is poor because it carries virtually the same sensitive information as X n does. • In contrast, the DUPI value gets close to zero when Y m and X n are far away from each other. This implies that the synthesizing procedure is well implemented but data information is not preserved. In this case Y m offers greater privacy protection, but poor utility compared to the original X n due to considerable information loss during the synthesizing procedure. • The optimal distance between Y m and X n is the theoretical benchmark (10). By optimal, we mean that the synthetic Y m is neither too different from the original X n leading to substantial information loss, nor too similar to X n to face unacceptable privacy disclosure risk. Therefore a DUPI value close to the benchmark (10) indicates that both Y m and X n are from the same population and independent as stated as an assumption in Theorem 4, ensuring that Y m is a well-balanced synthetic data. Any synthesizing procedure inevitably involves tension between information preservation (data utility) and privacy protection (data privacy), two contradicting goals in data synthesis. The DUPI index is seen as a numerical compromise between these two, and its value can be used to show which of the two is more influential on Y m .
Before closing this subsection, we examine the computational cost of DUPI which obviously involves many combinatorial calculations. For this numerical experiment we set m = n and sampled both X n and Y n from the 10-dimensional standard Gaussian distribution, independently, and calculated DUPI. The sample sizes tested are n = 1000, 5000, 10000, 20000, 40000, and 60000. The Euclidean distance was used for the metric We repeated this whole procedure with 20-dimensional standard Gaussian distribution to compare the difference. The numerical results are presented in Table V  and also depicted in Figure 1, from which we see that the computational cost increases quadratically over sample size for the 10-dimensional dataset. The same pattern is exhibited for the 20-dimensional data with a faster growth rate over sample size. Overall the experiment with our setup suggests that the computational time is O(n 2 p), where p is the dimension of the dataset.

C. Visualizing DUPI
Based on the trade-off relationship between data utility and privacy, we decompose the DUPI into two sub-indices, that is, Utility Index (UI) and Privacy Index (PI), and propose new graphical tools to visually present each component so that we can assess the quality of the synthetic data from two different perspective. We consider the first relative distance (k = 1) case for the plot even though other k values are equally valid, because the DUPI value tends to be sensitive to outliers as k gets larger and our extensive experiments show that the first order DUPI can capture the main characteristics of the synthetic data at a sufficient level with more robust performance. Thus we suppress the superscript < k > in what follows.
We design the plot to meet two requirements. First, the plot should map the DUPI onto a two dimensional plane so that we can examine the influence of PI and UI separately in the unit square [0, 1] × [0, 1]. Second, the plot must be invariant to sample sizes of X n and Y n as well as their values, so that the same plot can be consistently used for different cases.
To construct such a plot we first apply a rescaling function g to normalize the DUPI for two cases separated by the theoretical benchmark (10), which will be denoted by DUPI 0 : As the second step, we define UI and PI in terms of the inverse of tangent function so that they form proper curves in [0, 1] × [0, 1]. That is, Here τ > 0 is a nuisance parameter that determines the degree of the concavity inside the plot, with more concave curve at larger τ . The default value is set at τ = 5 in this paper.
Presented in Figure 2 is the result for the wine dataset analysis considered in Section V. A synthetic dataset has been obtained from this dataset using the CART-based algorithm [35]. The sample size of both datasets is m = n = 6497. The value of DUPI is 0.25 computed from (11), and the benchmark is DUPI 0 = 0.5 from (10). Thus we have g(DUPI) = g(0.25) = 0.25/(2 × 0.5) = 0.25. Hence, using (12) and (13), this DUPI is mapped to (UI, PI) = (0.652, 0.954) and marked as a black solid circle on the curve in the figure. The curve itself consists of points mapped from all possible DUPI values from 0 to 1 with DUPI 0 fixed at 0.5. Next, we need to find the optimal point (UI(DUPI 0 ), PI(DUPI 0 )) corresponding to DUPI 0 . As this special case occurs when the DUPI coincides with the benchmark DUPI 0 , we set DUPI=DUPI 0 and obtain g(DUPI 0 ) = 1/2 using the definition of g, from which we can readily compute (UI(DUPI 0 ), PI(DUPI 0 )) = (0.867, 0.867) from (12) and (13). This optimal point is shown as the cross point of the blue and red dashed lines in the figure. Thus we observe that the synthetic dataset exhibit lower utility in exchange for higher privacy protection. This position in fact corresponds to the point that yields the maximum area computed from the rectangle [0, UI]×[0, PI] for any (UI, PI) point on the curve. This property is presented in Theorem 5.

A. Simulated Datasets
For our numerical study we set the original dataset as where each x i is a vector-valued observation from MVN 5 (0, I), a 5-dimensional standard multivariate Gaussian distribution. A series of different sets of synthetic datasets derived from X 2000 are considered as follows. All synthetic datasets are of the same sample size 2,000, denoted by Y 2000 . The DUPI then is computed from Y 2000 and X 2000 . To reflect the sampling variation we repeat this 1,000 times for both X and Y. S1: Y i ∼ MVN 5 (0, I), for i = 1, . . . , 2000. Thus synthetic S1 is generated from the same population as the original dataset. This essentially generates synthetic data by adding noises to the original dataset. Presented in Figure 3 are the distributions of DUPI for each synthetic dataset. With m = n = 2000, the theoretical benchmark is 2000/3999 ≈ 0.5, drawn as the horizontal line across all datasets. As expected the results of S1 and S4 show the average DUPI values are very close to the theoretical benchmark. This indicates that both synthesis methods produce well-balanced synthetic datasets that are similar to the original data with an appropriate level of privacy protection. When the location or dispersion parameters are misspecified, as shown in S2 and S3a-S3d, the distributions of DUPI lie below the benchmark being closer to zero, suggesting that these synthetic datasets have lost the pertinent information of the original dataset. In light of the location difference of S2 and S3a-S3d, the amount of information loss is more substantial for the location misspecification compared to the dispersion misspecification, which is consistent with our intuition as most statistical analyses are more sensitive to the difference in the mean than variance. The results of synthetic datasets S5a-S5e illustrates the effect of adding-noise synthesis method. From the figure, the DUPI values gradually decrease from 1 towards 0 as the noise level θ gets larger. This pattern again is intuitive because it tells that, when very small noises (e.g., θ = 0.01) are added, the synthesis hardly provides privacy protection as there is hardly any change in the original dataset; in contrast, if added noises are very large, the synthetic dataset considerably loses the original information as confirmed by a value below the benchmark for S5e (i.e., at θ = 1).
Next we depict UI-PI plots for these synthetic datasets in Figure 4. We have computed the average of the DUPI values from 1,000 simulations for each synthetic dataset for this figure. The overall conclusions from this figure reaffirm our discussion for Figure 3. In panels (a) and (c), we see that the synthetic datasets S1 and S4 are located almost at the optimal point, meaning that these are well-balanced enjoying the optimal level of data utility and privacy. Panel (b) contains S2 and S3a-S3d, where S3b and S3c are almost overlapped. The locations of these datasets suggest poor data utility with excessive privacy protection by lying above the benchmark point. Datasets S5a-S5e in panel (d) confirms the gradual upward shift in the UI-PI trade-off as the noise level gets larger as previously mentioned.

B. Comparison With Existing Measures
In order to compare the performance of the proposed DUPI as a utility and privacy measure, we apply two other existing global utility measures to the synthetic datasets S1 to S5. The first alternative measure is the propensity score-based method in [9] and [10], which will be called the PSM. We employ the logistic regression approach in the PSM with two different feature settings where each feature corresponds to the values of each dimension of the dataset: (1) all features are incorporated into the logistic regression up to the first order, and (2) only quadratic terms are incorporated into the logistic regression. The second measure is Clustering Analysis Measure (CAM) used in [9], which is based on the clustering algorithm in the unsupervised learning. We try two different choices for the number of clusters G = 2 and 4 for the CAM measure. As the synthetic dataset departs from the original dataset, both PSM and CAM become larger.
Table VI presents the comparison results. We have already discussed the DUPI values; they correspond to the average of the distribution in Figure 3. To discuss the results in the table we treat S4, the MLE-based synthetic data, as the benchmark case because this is practically the best possible synthesis method. For the DUPI, the value for S4 is very close to the theoretically best synthetic data S1, which is intuitive and natural. For the other datasets, we can evaluate their utility and privacy by comparing against this number along with 0 and 1, and all numbers look reasonable and consistent as discussed before. For the other two measures, the PSM and CAM, things are more erratic and we have the following comments.
1) The results for S1 and S4 are considerably different under the both measures in relative terms. Especially for the PSM side, the numbers for S1 are about doubled compared to S4. This is due to the fact that the PSM index depends on the knowledge of the observed dataset; the expected pMSE gets doubled if the observed dataset is unseen. Though this can be theoretically justified (see Appendix A.3 in [10]), it is practically is not intuitive and difficult to understand for practicioners. This discrepancy also appears for the CAM measure, though no rationale seems available to explain this in the literature. 2) Both PSM and CAM are sensitive to the user input.
The use input in the PSM is the order of the regression equation, and the number of clusters in the CAM. As these inputs are essentially a subjective decision, the resulting measures are more difficult to interpret and prone to instability. However, these values are generally small; especially at small noise levels (e.g., S5a and S5b) they are even smaller than the practically best case S4. This indicates

C. Actual Datasets Analysis
We now compare the DUPI, PSM, and CAM against synthetic datasets obtained from actual datasets. As we do not know the true population, our conclusion will be somewhat limited compared to the simulation study in Section V-B. However the advantages of the DUPI over the other measures still stand. For our analysis, we consider three real datasets: German credit dataset [36], Wine quality dataset [37], and Bank marketing dataset [38]; basic information of these datasets is presented in Table VII. For the Bank data we dropped pdays variable as it contains missing values. A synthetic dataset of a similar size has been created for each real dataset using three popular synthesis techniques: CART-based algorithm [35], TGAN method [39], and CTABGAN [40]. As shown in Table VII the German and Bank datasets contain both categorical and numerical attributes. We used Heterogeneous Euclidean-Overlap Metric (HEOM) for these two datasets to compute DUPI. HEOM, initially proposed by [41], is a popular metric that can handle both numerical and categorical attributes present in a given dataset. For the Wine dataset we simply used the Euclidean distance as it contains only numerical attrtibutes.
The results are given in Table VIII. In the table the sizes of the synthetic datasets are exactly equal to those of the original datasets under synthpop, but not under TGAN because the package rounds down the size for synthesis. We make several comments on the results in Table VIII. First, we can see how the three synthetic methods fare by examining the DUPI against the benchmark DUPI 0 defined in (10). For German and Bank datasets, synthpop produces a fair, well-balanced synthetic dataset as their DUPI values, 0.495 and 0.549, are relatively close to the benchmarks. This type of statement is not possible for the PSM or CAM measures. In contrast, for the TGAN and CTABGAN results, things get worse in that German and Bank datasets are now further away from the corresponding benchmarks, indicating that their synthesis ability may not as preferable (note that DUPI for German data is much better under TGAN than CTABGAN, but still worse that synthpop). More specifically, having DUPI values close to 0, TGAN and CTABGAN tend to emphasize too much privacy protection in exchange for excessive utility loss. Second, the quality of the synthetic Wine data is considerably poor under all three synthetic algorithms. In the case of synthpop, this may be explained by the fact that CART algorithm tends to have difficulties in finding optimal partitions for high dimensional data with all numerical variables. We are however unable to explain why TGAN and CTABGAN produce such mediocre synthetic datasets (with DUPI being 0.113 and 0.054 respectively) for Wine dataset.
Finally, all three measures, the DUPI, PSM and CAM consistently show that synthpop delivers synthetic datasets of higher utility by having smaller values compared to TGAN or CTABGAN. This suggests that all these measures tend to work reasonably well for these real datasets, with similar opinions on data utility, though we argue again that only the DUPI can measure the privacy risk and thus can offer richer interpretations.
We split the DUPI numbers in Table VIII into UI and PI and present them in Table IX with their graphical representation in Figure 5. The optimal values of both UI and PI are 0.8667, computed from (12) and (13) with DUPI 0 . The maximum area under the plot is 0.7511 from (14). These benchmark numbers apply to all datasets by construction. For the synthetic Wine dataset, UI is smaller but PI is larger than the optimal value under all three synthesis methods (synthpop, TGAN and CTABGAN), confirming our previous conclusion. In contrast, synthetic German and Bank datasets under synthpop are well balanced. Clearly, for each real dataset, synthpop provides better data synthesis algorithm than the other two, being located closer to the benchmark point on the curve.

VI. CONCLUSION
Measuring data utility and privacy risk embedded in a synthetic or other masked dataset is an important topic in the data privacy literature. Most existing measures focus on only one side of these two contradicting aspects, ignoring the other. This paper proposes a new measure called the DUPI (Data Utility and Privacy Index) that can simultaneously evaluate data utility and privacy risk, accounting for their trade-off relationship. Based on the notion of probabilistic distance between the synthetic and original datasets, the DUPI has several advantages over the existing other measures. First, as a global measure, it assesses the utility-privacy at the dataset level, independent of specific queries. Second, the choice of the user parameter is not necessary, which often is a non-trivial matter in other existing measures. Third, it is distribution-free and distance-free, so that easily applicable to both numerical and categorical datasets. We also introduce a graphical plot to visualize the trade-off between data privacy and utility.

Proof of Theorem 1
Proof: From (4) and (5), we know when Y m and X n are from a univariate distribution F. Now let us assume that F is a multivariate distribution of any dimension. Even in this case, each d(x, c) for x ∈ X n is still univariate, and same for d(y, c) with each y ∈ Y m . Therefore, when Y m , X n ∼ F, we can say that both d(x, c) and d(y, c) follow some common distribution that is univariate. As a result, we can repeat the above equality for d <k> X n (c) and d <k> Y m (c) to establish

Proof of Lemma 1
Proof: It is a special case of Theorem 2 with k 1 = k 2 = k. Theorem 2 will be proven later.

Proof of Corollary 1
Proof: Corollary 1 is a special case of Theorem 1 with k = 1. Thus

Proof of Corollary 2
Proof: We start with a relevant equation to be used later in this proof. From the fact that (−1) r −r k = k+r−1 k , we have the following equation: By the Chu-Vandermonde identity r j =0 n j m r− j = n+m r , the second equality of the above equation is transformed to which again can be expressed as When m = n, this reduces to Next we split the left-hand side of the equation above into two pieces to get Because two terms on the left-hand side of (17) is the same by Lemma 1, (17) becomes Therefore, using (18), we finally obtain

Proof of Theorem 2
Proof: First, the event . We can rewrite this set of mutually exclusive conditions as Y m (c) for s = k 2 , k 2 + 1, . . . , k 1 + k 2 − 1. Therefore we have Now, noting that the summand is equal to (4) and (5), (19) becomes Proving (20) is rather long and involves somewhat heavy combinatorics, so we divide it into two steps for convenience. In the first step, we prove a main combinatorial equation. The second step then proves (20) based on this main equation.
Step 1: We denote the following equation by E t , a function of t: This equation actually holds for any t = 1, . . . , k 1 − 1, and proving the equality of this equation completes Step 1. For this we first write the left side of E t as By the Pascal's triangle n r + n r+1 = n+1 r+1 , we replace The left side of (26) is same as the left side of (20). Thus we only need to show that the right side of (26) matches that of (20). Using the hockey-stick identity n i=r i r = n+1 r+1 , we can rewrite the last term of (26) as , which makes the right side of (26) become This last expression actually is same as the right side of (20), because s−1 s−k 2 = s−1 k 2 −1 and n+m−s n−s+k 2 = n+m−s m−k 2 .

Proof of Lemma 2
Proof: We use the equality (16) in the proof of Corollary 2. First, by substituting k = k 2 on (16), we have  Here the third equality comes from applying Theorem 1 with the original dataset size set at n − 1, reduced by 1 from the original size n.

Proof of Theorem 5
Proof: Let h(x) be a monotone and concave function, and invoke two inequalities. First, Jensen's inequality gives for a constant t. Next, from the arithmetic-geometric mean, we get By combining the two inequalities we obtain where the equalities hold if and only if h(x) = h(t − x). This condition is equivalent to x = t − x, or x = t/2 because of the unique existence of h −1 (x) due to its monotonicity. Consequently, with the equality holds when x = t/2, in which case the right side is the maximum. Now let us consider the product of UI and PI functions in (12) and (13):