S TANDARDISED M ETRICS AND M ETHODS FOR S YNTHETIC T ABULAR D ATA E VALUATION

Synthetic Tabular Data Generation (STDG) is a potentially valuable technology with great promise to augment real data and preserve privacy. However, prior to adoption, an empirical assessment of synthetic tabular data (STD) is required across the three dimensions of resemblance, utility, and privacy, trying to find a trade-off between them. A lack of standardised and objective metrics and methods has been found targeting this assessment in the literature and neither an organised pipeline or process for coordinating this evaluation has been identified. Therefore, in this work we propose a collection of metrics and methods to evaluate STD in the previously defined dimensions, presenting a meaningful orchestration of them and a pipeline unifying all of them. Additionally, we present a methodology to categorise STDG approaches performance for each dimension. Finally, we conducted an extensive analysis and evaluation to verify the usability of the proposed pipeline across six healthcare-related datasets, using four STDG approaches. The results of these analyses showed that the proposed pipeline can effectively be used to evaluate and benchmark the STD generated with one or more different STDG approaches, helping the scientific community to select the most suitable approaches for their data and application of interest.


Introduction
We are in an era where increasingly data are increasingly being generated daily, motivating a paradigm shift from traditional and manual processes towards artificial intelligence (AI) applications in different contexts.However, many AI developments are being slowed down by data protection and regulation laws or imbalanced data.Therefore, research (4) Based on the obtained evaluation results, we benchmark the STDG approaches used to generate STD and discuss the veracity and efficiency of the proposed STD evaluation metrics and methods.We demonstrate that the proposed pipeline can effectively be used to evaluate and benchmark different approaches for STDG, helping the scientific community to select the most suitable approaches for their data and application of interest.
The remainder of this article is organised as follows: Section 2 presents related work in STD generation and evaluation for the healthcare context.Section 3 describes the proposed pipeline of metrics and methods for STD evaluation.Section 4 presents the experimental results and they are discussed in Section 5. Section 6 states the conclusions of this work.

Related Work
This Section presents some of the existing STDG approaches for tabular healthcare data (Section 2.1), and the most commonly used metrics and methods for evaluating STD dimensions (Section 2.2).

STDG Approaches
To generate STD in the healthcare context many approaches can be found in the literature.The simplest STDG approaches include Gaussian Multivariate (GM) [8,9], Bayesian Networks (BN) [10,3,11], Categorical maximum entropy model (CMEM) [12] and Movement-based kernel density estimation (MKDE) [13] can be found.These approaches employ an statistical model to learn the multivariate distributions of the RD to sample a set of SD.These approaches are typically used for small amounts of data and are not very scalable.
Due to the efficiency and popularity of the Generative Adversarial Networks (GAN) in other areas and applications of healthcare, there is an interest in finding out if they have promising potential to synthesise tabular healthcare data.GANs principally consists of two neural networks (a generator and a discriminator) that learn to generate high quality STD through an adversarial training process.This approach has been used by several different authors, presenting improvements, tuning some hyper-parameters, or adding new features.While ehrGAN [14,2] and medGAN [15,8,16,17,2] were proposed to synthesise mainly numerical and binary data, Wasserstein GAN (WGAN) with Gradient Penalty [18,16,17,19,2,8], healthGAN [8] and Conditional Tabular GAN (CTGAN) [20] synthesise numerical, binary and categorical data efficiently.
Furthermore, the Synthetic Data Vault (SDV) consists of an ensemble approach which combines several probabilistic graphical modelling and Deep Learning (DL) based techniques [21].

STD evaluation metrics and methods
The metrics and methods used to evaluate the resemblance, utility and privacy of STD in the literature are diverse.Most of the studies related to STDG in the healthcare context evaluate the resemblance and utility dimensions, but only a few evaluate the privacy dimension.The most relevant metrics and methods for STD reported in the literature are now presented.

Resemblance evaluation
The first step in resemblance evaluation is to analyse whether the distribution of SD attributes is equivalent to the distribution of the RD.Che et al. [14] and Chin-Cheong et al. [19] compared the distributions of the attributes of RD against STD.Yang et al. [2] compared the frequency of the attributes.Additionally, Choi et al. [15], Wang et al. [22], Abay et al. [11], Baowaly et al. [16] and Yale et al. [8] compared the dimensional probability or probability distributions of RD and STD.
For distributions comparison Yang et al. [2] and Rashidian et al. [18] analysed the mean absolute error (MAE) between mean and standard deviation values of RD and STD.Some authors also use statistical tests to analyse univariate resemblance of SD.Baowaly et al. [16] used Kolmogorov-Smirnov (KS) tests to compare distributions, Dash et al. [23] applied Welsch t-tests to compare mean values of the attributes, and Yoon et al. [17] performed Student t-tests to compare mean values of the attributes and Chi-squared tests to compare the independence of categorical attributes.In these studies, they analysed the p-values obtained from the statistical tests to determine whether the STD attributes preserved the properties of the RD attributes.
In the evaluation of multivariate relationships, Rankin et al. [3], Yale et al. [8], Wang et al. [22] and Rashidian et al. [18] visually compared the Pairwise Pearson Correlation (PPC) matrices to see whether correlations between attributes of RD are maintained in STD.Additionally, Principal Component Analysis (PCA) transformation has been used by Yale et al. [8] to compare the dimensional properties of STD and RD.
To analyse if the semantics or significance from RD is maintained in STD, Choi et al. [15], Wang et al. [22], Beaulieu-Jones et al. [24] and Lee et al. [25] asked some clinical experts to evaluate STD qualitatively giving a score between 1 and 10.This score indicated how real the STD records appear to them where a score of 10 is most realistic.Another method that can be used if access to clinical experts is not available is to train some ML classifiers to label records as real or synthetic as Lee et al. [25] proposed in their study.

Utility evaluation
The evaluation of the utility dimension has mainly been performed using STD in ML models, by training and analysing the performance of these models.
Train on Real Test on Real (TRTR) and Train on Synthetic Test on Real (TSTR) methods were used by Park et al. [26], Wang et al. [22], Beaulieu-Jones et al. [24], Chin-Cheong et al. [19], Baowaly et al. [16] and Rashidian et al. [18].These authors trained ML models with RD and with STD separately and then tested them with held-out RD not used for training.They use different classification metrics (Accuracy, F1-score, ROC, AUC-ROC, etc) to evaluate and analyse the differences in the models' performance when training the models with RD and with STD.
On the other hand, Che et al. [14], Wang et al. [22] and Yang et al. [2] augmented the training set of RD with STD.The authors analysed whether the trained ML models have a slight difference when training only with RD or when training with mixed RD and STD.

Privacy evaluation
The few metrics and methods that authors have used for privacy assessment of STD are based on distance and similarity metrics and re-identification risk evaluation.
Regarding distance and similarity based metrics, Park et al. [26] used distance to the closest record (DCR) computing the pairwise Euclidean distance between real and synthetic records where the closer the mean distance value is to 0, the more the privacy is preserved.According to Norgaard et al., [27] the maximum real to synthetic (RTS) similarity value, computed by the cosine similarity, indicates whether the model has memorised and stored RD and is really generating data and not copying it.Other distance based metrics, used by Yoon et al. [17], are the Jensen-Shannon divergence (JSD) and the Wasserstein distance.The authors used them to compute the balance between the identifiability and quality of STD.
To assess the re-identification of RD disclosure risk through STD, a number of disclosure simulation attacks have been proposed in the literature.On the one hand, Choi et al. [15], Park et al. [26], Yale et al. [28] and Mendelevitch et al. [29] simulated membership inference attacks to analyse the disclosure risk of a complete record in RD by computing distance metrics between RD and STD records and using accuracy and precision metrics to quantify the membership risk.On the other hand, Choi et al. [15] and Mendelevitch et al. [29] additionally simulated attribute inference attacks to quantify the disclosure risk of some attributes of the dataset.Defining quasi-identifiers (QID) attributes and training some ML models with STD to predict the rest of the attributes, they analyse how accurately an attacker could predict some RD attributes if they obtained access to STD.

STD Evaluation Metrics and Methods
The proposed metrics and methods for the evaluation of STD can be clustered into three different dimensions: resemblance, utility, and privacy.Within each dimension, different metrics and methods from the literature have been selected and configured in an organised way. Figure 1 shows a taxonomy of the selected methods, which can be used within the defined pipeline to evaluate STD generated with one STDG approach or compare the STD generated by different STDG approaches.This pipeline was developed due to the lack of a complete STD evaluation method that covers resemblance, utility, and privacy dimensions.Although the included metrics and methods are not new, the orchestration of them in an organised way and the calculation of overall scores for each dimension are the novelty of this study.Furthermore, to the best of our knowledge this work is the first attempt to propose a complete and universal STD evaluation process that is generalisable to any kind of STD, due to the fact that the metrics and methods have been selected according to the most used and standardised STD evaluation metrics and methods reported in the literature.

Resemblance evaluation
In the resemblance dimension, the capacity of SD to represent RD is evaluated.Statistical, distribution and interpretability characteristics are analysed using four analyses: univariate resemblance analysis (URA), multivariate relationships analysis (MRA), dimensional resemblance analysis (DRA) and data labelling analysis (DLA).

Univariate Resemblance Analysis (URA)
This analysis proposes to analyse the attributes of RD and SD independently to see if univariate statistical characteristics of RD are preserved in SD.Statistical tests, distance calculation and visual comparisons are proposed.
Statistical tests can be used to compare the attributes from RD and SD.They should be performed independently for each attribute with a proposed significance level of α = 0.05, meaning that if the p-value obtained from the test is higher than this value, the null hypothesis (h 0 ) is accepted.Otherwise, the alternative hypothesis (h 1 ) is accepted.The properties analysed in each test are preserved in SD if h 0 is accepted.For numerical attributes, the following tests are proposed: • Student T-test for the comparison of means.
h 0 : Means of RD feature and SD attribute are equal.
h 1 : Means of RD feature and SD attribute are different.
• Mann Whitney U-Test for population comparison.
h 0 : RD feature and SD attribute come from the same population.
h 1 : RD feature and SD attribute do not come from the same population.
• Kolmogorov-Smirnov test for distributions comparison.h 0 : RD feature distribution and SD attribute distribution are equal.
h 1 : RD feature distribution and SD attribute distribution are not equal.
For categorical features, the Chi-Squared test (χ 2 test) is proposed to analyse the feature independence between real and synthetic categorical attributes.In this case, if the h 0 is accepted, statistical properties are not preserved.h 0 and h 1 are defined as: • h 0 : There is not a statistical relationship between the real categorical variable and the synthetic categorical variable.
• h 1 : There is a statistical relationship between the real categorical variable and the synthetic categorical variable.
Some distance metrics can also be computed between the RD and SD attributes for URA.The lower the distance values are, the better the univariate resemblance is preserved in SD.In total, three distance metrics are proposed: cosine distance, Jensen-Shannon distance and Wasserstein distance.Before computing all distances, RD and SD need to be scaled.In the following equations ( 1)-( 5), r is the attribute of RD and s is the attribute of SD.
Cosine distance is defined using the cosine similarity, which is the cosine of the angle between two n-dimensional vectors in n-dimensional space; the dot product of the two vectors is divided by the product of the two vectors' lengths (Equation 1).Using this distance, a threshold of 0.3 has been proposed to indicate that the SD attribute resembles RD attribute.
Jensen-Shannon distance is the square root of the Jensen-Shannon divergence, which measures the similarity between two probability distributions (Equation2; m is the pointwise mean of p and q and D is the Kullback-Leibler divergence, defined in Equation 3).To compute these distances the probability distributions of the features have been used: p is the probability distribution of the RD attribute and q is the probability distribution of the SD attribute.Thus, a value lower than 0.1 represents perfect resemblance.
Wasserstein distance can be seen as the minimum amount of work required to transform a vector (r) into another vector (s), where the work is measured as the amount of distribution weight that must be moved, multiplied by the distance it has to be moved (Equation4; R and S are the cumulative distribution function of the RD and SD attributes respectively).
As in cosine distance, a threshold of 0.3 is proposed to assure the resemblance of the attribute.
For both statistical tests and distance calculation, the number of SD attributes that fulfil the requisite to resemble RD attributes has been considered to categorise the STDG approach performance.If more than half of the attributes maintain resemblance, the approach is classified as "Excellent".If less than half maintained resemblance, it is classified as "Good" and if none of attributes fulfilled it, as "Poor".
Finally, in the last step for URA, it is proposed to visually compare the values of each attribute (RD vs. SD).For numerical and categorical attributes distribution plots and histograms can be used, respectively.The STDG approaches have been categorised depending on the number of attributes that maintained resemblance of RD attributes.
To get an overall score of URA, the performances for all previously presented metrics can be calculated according to the results obtained from them.For that, first the categorisation is translated to a numerical value ("Excellent" = 3, "Good" = 2 and "Poor" = 1), and then, the same weight (0.25) is given for the four methods presented (statistical tests of numerical attributes, statistical tests of categorical attributes, distance calculations and visual comparisons).The resulting score, rounded to the nearest integer, gives a value between 1 to 3 that indicates the URA score of STD.

Multivariate Relationship Analysis (MRA)
This analysis proposes to analyse if the multivariate relationships of RD are preserved in SD.To do that, the computation of two correlation matrices are defined for RD and SD: Pairwise Pearson Correlation (PPC) matrices for numerical variables (Equation 5for each pair of features x and y; x and ŷ are the mean value of the features) and normalised contingency tables for categorical variables.The correlation matrices of RD can be visually compared with the matrices of SD using heatmaps.
Additionally, for each matrix, the differences between the correlations of RD and SD are calculated, to then compute the percentage of relationships maintained in SD (those that have a difference value lower than 0.1).This way, two values between 0 and 1 are obtained, one expressing the percentage of numerical attributes relationships maintained, and the other the percentage of categorical attributes relationships.If the values are higher than 0.6, the STDG approach is categorised as "Excellent", since more than half of the relationships are preserved in RD.If they are equal or between 0.4 and 0.6 it is categorised as "Good", representing that half of the relationships are preserved.Finally, if the values are lower than 0.4, the performance of the STDG approach is "Poor" as it has preserved less than the half of the relationships.After doing this categorisation for each matrix, a total MRA performance is obtained by calculating the mean performance of both analyses.

Dimensional Resemblance Analysis (DRA)
To analyse if the dimensional properties of RD are preserved in SD, it is proposed to analyse the performance of a linear (PCA) and a non-linear (Isomap) dimensionality reduction method for RD and SD.For the two methods, the transformation should be independently computed for RD and SD, after scaling the numerical attributes and one-hot encoding the categorical attributes.After computing the transformations, the results can be visually analysed with scatter plots for each method, the more similar the shapes of RD and SD plots the more resemblance is maintained.
Figure 2 shows some examples that indicate "Excellent", "Good" and "Poor" resemblance respectively.Additionally, to help given the visual results the calculation of a numerical value, a distance metric between RD and SD dimensionality reduction plots is proposed.This distance metric is the joint distance of the baricenters distance and spread distance of both plots (Equation 6).The baricenters distance is the distance between the mean values of the RD and SD dimensionality reduction matrices, while the spread distance is the distance between the standard deviation values of the same matrices.α is a regularization parameter that gives different weigths to each of the two distances.
As this distance metric cannot be normalise for comparison it has not been defined a methodology to classify the resemblance of STD generated with one or more STDG approaches into "Excellent", "Good" and "Poor".Although, the lower this distance value is the more similar are the dimensionality reduction plots for RD and SD.For this reason, this analysis has not been considered for total resemblance calculation, but the results of both the plots and the distance metric (calculate with α of 0.05) are provided and discussed in the supplemental material.

Data Labelling Analysis (DLA)
The final step proposed for resemblance evaluation can be used to evaluate the semantics of SD.For this method, it is proposed to analyse the performance of some classifiers when labelling records as real or synthetic.Firstly, real and synthetic datasets should be mixed and labelled (0 for RD and 1 for SD) in a single dataset.Secondly, this dataset must be split into train and test sets, i.e. 80:20 split.Thirdly, the train and test data should be pre-processed, i.e. standardize numerical attributes and one-hot encode categorical attributes.Finally, some ML classifiers can be trained (with training data) and evaluated (with test data) to analyse the performance of a classifier to label records as real or synthetic.The following commonly used and diverse ML classifiers are proposed to be used in these analyses with the specified parameter values: • RF (n_estimators = 100, random_state = 9, verbose = T rue, n_jobs = 3) • KNN (n_neighbors = 10, n_jobs = 3) • DT (random_state = 9) • SVM (C = 100, max_iter = 300, kernel = "linear , probability = T rue, random_state = 9, verbose = 1) • MLP (hidden_layer_sizes = (128, 64, 32), max_iter = 300, random_state = 9, verbose = 1) After training and testing the models, classification performance metrics (accuracy, precision, recall and F1-score) can be analysed visually with box plots.To indicate that the semantics of RD are preserved in SD a classifier should not distinguish if a record is synthetic or real.Thus, the classification metrics should be lower than or equal to 0.6 for "Excellent" resemblance, meaning that the models have classified most of the synthetic records as real.Obtaining metric values higher than 0.6 and lower than 0.8 indicates "Good" resemblance, while values equal to or greater than 0.8 indicates "Poor" resemblance.

Total resemblance performance
After completing the four analyses proposed for resemblance evaluation, a total resemblance score can be calculated weighting the different analyses, and giving more importance to the one of your interest.In our experimental work, DRA has been excluded due to the difficulty of comparison of the proposed metric and the following weights have been applied to the other analyses: URA = 0.4, MRA = 0.4 and DLA = 0.2.Thus, according to the total score obtained when applying these weights the resemblance has been categorised as "Excellent" if the resulting value is 3, "Good" if the resulting values is 2 or "Poor" if the resulting value is 1.

Utility evaluation
In the utility dimension, the ability of SD, instead of RD, to train ML models is analysed to determine if ML models trained with SD produce similar results to ML models trained with RD.To do that, TRTR TSTR analyses are proposed.
A number of ML classifiers should be trained with RD, and separately with SD.The same ML classifiers as for DLA analysis are proposed for this one due to their simplicity, scalability, and training efficiency.
Before training and testing the models the data must be pre-processed, i.e. standardize numerical attributes and one-hot encode categorical attributes.All trained models should be tested with the same RD (20% of the real dataset held-out before training the STDG approaches).To analyse the classification results accuracy, precision, recall and F1-score classification metrics are proposed, as well as their absolute differences when TRTR and TSTR.To assure that SD utility is "Excellent", the metrics differences should not exceed a proposed threshold of 0.4.If differences between 0.4 and 0.8 are obtained the performance in the utility dimension should be categorised as "Good" and if they are higher than 0.8 the performance is "Poor".

Privacy evaluation
For the privacy dimension, it is proposed to evaluate the similarity of RD and SD and the re-identification risk of real patients or records.

Similarity Evaluation Analysis (SEA)
In the SEA it is proposed to evaluate how private SD is when compared to RD.Three metrics are proposed for this: Euclidean distance between each pair of records, Hausdorff distance between SD and RD and Synthetic To Real (STR) similarity.
The Euclidean distance is defined as the square root of the sum of squares of differences between RD features and SD features, as defined in Equation 7. In this case, the Euclidean distance can be computed for each pair of records.Then, the mean and std of all distances should be analysed.The higher the mean distance and the lower the standard deviation are, the more the privacy is preserved.Thus, mean values higher than 0.8 and standard deviation values lower or equal 0.3 indicates that privacy is preserved.
To compute the STR similarity, the cosine similarity metric is proposed, which computes similarity as the normalised dot product of two datasets (Equation 8; R is a record from RD and S is a record from SD).The pairwise similarity value can be computed for each pair of records and the mean and maximum values of those pairwise similarity values should be analysed.If the mean value is higher than 0.5 the SD is very close to the RD, so the privacy is not preserved.In all other cases, it can be said that privacy is preserved.
The Hausdorff distance measures how far two subsets of a metric space are from each other as it is the greatest of all the distances from a point in one set to the closest point in the other set (Equation9; R is the real dataset; S is the synthetic dataset and d(s i , r i ) is a metric between points s i and r i , in this case, the Euclidean distance).Two sets are considered to be close in the Hausdorff distance if every point of either set is close to some point in the other set.Thus, the higher this distance value is, the better the privacy is preserved in SD, as a high value indicates that SD is far from RD.Since this metric is not bounded between 0 and 1, a value higher than 1 has been considered to assure that the privacy is preserved.
haus_dist(S, R) = max{h(S, R), h(R, S)} (9) For these three distance based metrics, if the three ones fulfil the condition to preserve privacy the categorisation for privacy preservation is "Excellent".Otherwise, it two or 1 of the metrics fulfiled the condition it is "Good" and in other cases it is "Poor".

Re-Identification Risk Analysis (RIRA)
In this analysis, it is proposed to evaluate the level of disclosure risk if an attacker or an adversarial obtains access to SD and a subset of RD.For this, two simulations are proposed: (1) membership inference attack (MIA) when an attacker tries to identify if real patient records have been used to train the STDG approach (Figure 3), and (2) attribute inference attack (AIA) when an attacker has access to some attributes of the RD and tries to guess the value of an unknown attribute of a patient from the SD (Figure 4) [29].

Membership Inference Attack (MIA)
In a MIA if the attacker determines that real records were used to train the STDG approach, it could be said that they have re-identified the patient from the SD. Figure 3 illustrates this attack where a hypothetical attacker has access to all records of the SD and a subset of the RD randomly distributed.Using a patient record (r) from the RD subset, the attacker will try to identify the closest records in the SD, with a distance metric calculation.If there is any distance lower than some threshold, the attacker determines that there is at least one row close enough to the RD in the SD, meaning that r has been used to generate SD.The success rate of the attacker is proposed to be evaluated by simulating this kind of attack by calculating the Hamming distance (proportion of non-equal attributes between two records) between each row of the RD subset and SD rows.Reasonable thresholds to assure that r is close enough to a record in SD are 0.4, 0.3, 0.2 and 0.1.Since it is known if r belongs to the training data or not, accuracy (proportion of correct predictions made by the attacker) and precision (proportion of records used for training the STDG approach that was identified as such by the attacker) values have been calculated.Accuracy and precision values should be 0.5 or lower for all thresholds to obtain an "Excellent" privacy preservation categorisation.Any value above 0.5, indicates increasing levels of disclosure risk, obtaining a "Good" or "Poor" privacy preservation categorisation depending on the number of thresholds where the values are higher than 0.5.To analyse and interpret these results, the accuracy and precision values have been plotted as a function of the proportion of records in the SD that are present in the RD subset known by the attacker for each threshold.

Attribute Inference Attack (AIA)
In an AIA an attacker has access to the SD and a subset of attributes for some RD records (generally QID such as age, gender, height, weight, etc).As shown in Figure 4, the attacker will use ML models trained with SD to predict the values of the rest of the attributes of the RD records.The success of this attack can be measured defining the QID of each dataset and then using them to train ML models, i.e., Decision Tree (DT) models, with SD to predict the rest of the attributes.Next, 50% of the RD (randomly distributed) can be used to evaluate the performance of the models, generating batches of data with each QID combination.This way, the predictions made by the models trained on SD for each data batch combination can be evaluated using accuracy for categorical attributes and Root-Mean-Squared-Error (RMSE) for numerical attributes.Higher accuracy values (close to 1) indicate higher disclosure risk, whilst lower RMSE values (close to 0) reflect higher disclosure risk.The metric values obtained for each QID combination and all attributes are evaluated visually with a box-plot for each risk attribute.Additionally, the percentage of correctly predicted attributes has been calculated to categorise the re-identification risk.The mode of the analysed metrics has been considered to determine if an attribute has been predicted; for categorical attributes accuracy mode equal to 1 and for numerical attributes RMSE mode equal to 0. If a high percentage (higher than 0.6) is obtained, as more than half of the attributes has been re-identified the AIA results is categorised as "Poor".For a percentage between 0.4 and 0.6, the result is categorised as "Good" since approximately the half of the attributes have been re-identified.A percentage lower than 0.4 indicates that less than half of the attributes have been re-identified, demonstrating "Excellent" privacy preservation for this attack.

Total privacy performance
After computing results for the three privacy evaluation methods proposed, a total privacy score should be calculated weighting the results from all of the methods.In this study the following weights have been given to the three analyses   for the privacy dimension: SEA = 0.4, MIA = 0.3 and AIA = 0.3.Therefore, the privacy of SD has been categorised as "Excellent" (3), "Good" (2) or "Poor" (1) based on the total score obtained when applying these weights.

STD Evaluation
To prove and trust the efficiency and usability of the proposed pipeline for STD evaluation, a number of different datasets have been selected and then synthesised with different STDG approaches.The STD generated for the selected datasets is then evaluated and benchmarked.

Selected data
Six open-source healthcare-related datasets have been selected for synthesis.A brief description of these datasets, together with the data types and the number of attributes and records is presented in Table 1.Each dataset has firstly been pre-processed, deleting missing values and performing a data split into two subsets.80% of the records have been used for training the STDG approaches and 20% of the records for utility dimension evaluation and RIRA simulations.

STDG approaches
To generate STD using the previously presented datasets four STDG approaches have been used, two of which are GANs and the other two are classical approaches.These approaches are as follows:

GM
A classical STDG approach based on statistical modelling that implements a multivariate distribution by using a Gaussian Copula to combine marginal probabilities estimated using univariate distributions 1 [36].

SDV
This approach is a STDG ecosystem of libraries that uses several probabilistic graphical modelling and DL-based techniques.To enable a variety of data storage structures, they employ unique hierarchical generative modelling and recursive sampling techniques2 [37].

CTGAN
A STDG approach proposed by Xu et al. in 2019 [20] defined as a collection of DL models based on GAN models for a single data table.It can learn from real data and generate synthetic clones with high fidelity3 .

WGANGP
This approach is a GAN proposed by Yale et al. in 2020 [8] composed of a generator and discriminator.The generator learns to generate better SD based on the feedback received by the discriminator and using the Wasserstein distance with gradient penalty as the optimisation function4 .

Results
Using the datasets and STDG approaches described in Sections 4.1 and 4.2, the metrics and methods proposed in Section 3 for STD evaluation in the resemblance, utility and privacy dimensions have been applied to evaluate the generated STD to evidence and provide trust in their efficiency and usability.Additionally, a comparison and benchmarking of the STDG approaches has been performed based on the proposed strategy for STD evaluation.
Table 2 shows the results obtained from applying the proposed STD evaluation pipeline to the STD synthesised with each STDG approach and for each dataset.The detailed and complete description of all results are available in the Supplementary Material.Based on the results of each evaluation analysis presented in the pipeline, the performance of each synthesised dataset has been categorised as "Excellent", "Good" or "Poor" for each dimension (resemblance, utility and privacy).
The resemblance dimension has been perfectly maintained with GM for five datasets (A, B, D, E and F) out of six, whilst for Dataset C, like with the other STDG approaches, it has been quite well maintained.SDV has only resembled RD excellently for Dataset E and quite good for the other datasets (A, B, C, D and F).CTGAN has generally not performed well in retaining resemblance across datasets, with acceptable results for three datasets (A, B and C) and poor results for the other three (D, E and F).Finally, WGANGP has been the worst approach to resemble RD, as it has only resembled RD quite well for two datasets (D and E) and poorly for the other four (A, B, C and F).
The utility dimension has been perfectly maintained with all approaches for five datasets.For dataset C, the utility of the STD has been poorly maintained with SDV and CTGAN, whilst with GM and WGANGP it has been maintained but not perfectly.
The privacy of STD has been quite well maintained for four datasets (A, D, E and F) with all STDG approaches.For datasets B and C, privacy preservation has been poor for STD generated with GM and CTGAN, but for SDV and WGANGP is has been maintained quite well.Therefore, by applying the proposed privacy dimension evaluation metrics and methods none of the STDG approaches used have yielded a perfect resemblance for any dataset.However, in most of the cases it has been maintained to a certain extent.

Discussion
Overall the results have shown that the proposed pipeline for STD evaluation in the three dimensions defined (resemblance, utility, and privacy) can be used to assess and benchmark STD generated with different approaches.In the experiments carried out, none of the STDG approaches have been better than any other across all STD dimensions considered.Therefore, it can be said that it is difficult to get a trade-off between resemblance, privacy, and utility scores.However, the categorisation system provided and overall STD score calculation for each dimension can help to select the most appropriate STDG approaches, by looking at individual dimension scores and configuring overall scores by computing weights, according to priorities defined for each specific application.
A collection of different metrics and methods have been proposed to evaluate the resemblance of SD at different levels: univariate, multivariate, dimensional, and semantics.Among the metrics used in URA, as statistical tests and distance calculations provide quantitative results, these have been more trustworthy in assessing how well STD attributes resemble RD, than visual comparisons of the attribute distributions, which provide a more qualitative view.MRA has appeared to be useful to analyse whether STD maintains the multivariate relationships between attributes, and DRA is useful to see how well the dimensional properties of RD are preserved.DLA has been less effective due to the lack of medical specialists.However, the analysis developed composed of different ML models trained to label records as real or synthetic has simplified this process giving an approximation of how a medical expert would label the records.
Regarding the privacy of SD, the similarity between SD and RD has first been evaluated in the SEA and then, a pair of data inference attacks (MIA and AIA) have been simulated in the RIRA analysis.Although these simulations have not been quite significant, they could be useful to estimate the quantification of the re-identification risk of SD.However, these metrics and methods must be improved to quantify the re-identification risk of SD in a more reliable way.Thus, future work might include the definition of a strategy that helps identify which attributes are more prone to re-identification and considers the real consequences of potential personal data disclosure.
Although, the proposed STD evaluation pipeline has been used to evaluate some STD generated with different approaches and in different contexts, the datasets used for the evaluation have been limited due to the lack of quality health-related open-source datasets.From the six datasets selected, only two (A and B) comprised an appropriate number of records to be considered representative of real health-related data, with the remaining four containing a limited number of entries.Moreover, these open-source datasets might have been anonymised or synthesised before, inducing new bias in STD and analysis.Therefore, further work is required to judge and benchmark the proposed pipeline with more datasets, in other contexts and with RD that comes directly from hospitals or laboratories without any anonymisation or other modification processes applied to them.Furthermore, the proposed methodology for STD and the obtained results must be compared with the methodologies that other authors followed to evaluate STD quality.
Another important finding from this work is the lack of a trade-off between resemblance, utility, and privacy dimensions in STD generated with the different approaches in the evaluation section.Thus, further work on improving the STDG approaches to generate more quality STD that maintain a trade-off between the defined and analysed dimensions is foreseen.
Apart from that, a number of metrics and methods should also be proposed and developed to evaluate the performance of the STDG approaches in terms of training time and footprint.This way, STDG approaches could be compared analysing the computational resources needed.Furthermore, a web-based application can be developed unifying the proposed metrics and methods, to help researchers who work on STDG to evaluate the generated SD.This tool could help them to be more focussed on improving or proposing STDG approaches without having to invest time in defining and developing a SD evaluation process.

Conclusion and Future Work
In this work, we proposed a comprehensive and universal STD evaluation pipeline covering resemblance, utility and privacy dimensions, with a methodology to categorise the performance of STDG approaches across each dimension.Additionally, we conducted an extensive analysis and evaluation of the proposed STD evaluation pipeline using six different healthcare-related open-source datasets and four STDG approaches to prove the efficiency and veracity of the proposed STD evaluation pipeline.This analysis has shown that the proposed pipeline can effectively be used to evaluate and benchmark different approaches for STDG, helping the scientific community to select the most suitable approaches for their data and application of interest.Although other authors have proposed metrics or methods to evaluate STD, none of them have used a complete pipeline that covers all dimensions of STD evaluation.
Regarding the limitations of this work, we have found that (1) some metrics and methods are not as trustworthy as initially considered, (2) it is difficult to find a trade-off between resemblance, utility and privacy, (3) previously synthesised or anonymised data has been used and (4) the pipeline has not been compared with other methods used in the literature for STD evaluation.
Future work includes, (1) to judge and benchmark the metrics and methods in the proposed pipeline with more datasets, in other contexts and with real data from healthcare authorities, (2) to improve the proposed RIRA in the privacy dimension, (3) to propose new metrics and methods to evaluate the performance of STDG approaches in terms of time and footprint, (4) to improve the STDG approaches to improve the trade-off between the dimensions, and ( 5) to unify all the proposed metrics and methods into a web-based application.

Figure 1 :
Figure 1: Taxonomy of the proposed pipeline or metrics and methods to evaluate STD in three different dimensions: resemblance, utility and privacy.

Figure 2 :
Figure 2: DRA Categorisation Example.(a) Result for Excellent resemblance.(b) Result for Good resemblance.(c) Result for Poor Resemblance.

Table 1 :
Brief description of the selected datasets