Standardised Metrics and Methods for Synthetic Tabular Data Evaluation
preprintposted on 21.09.2021, 19:25 by Mikel HernandezMikel Hernandez, Gorka Epelde, Ane Alberdi, Rodrigo Cilla, Debbie Rankin
Synthetic Tabular Data Generation (STDG) is a potentially valuable technology with great promise to augment real data and preserve privacy. However, prior to adoption, an empirical assessment of synthetic tabular data (STD) is required across the three dimensions of resemblance, utility, and privacy, trying to find a trade-off between them. A lack of standardised and objective metrics and methods has been found targeting this assessment in the literature and neither an organised pipeline or process for coordinating this evaluation has been identified. Therefore, in this work we propose a collection of metrics and methods to evaluate STD in the previously defined dimensions, presenting a meaningful orchestration of them and a pipeline unifying all of them. Additionally, we present a methodology to categorise STDG approaches performance for each dimension. Finally, we conducted an extensive analysis and evaluation to verify the usability of the proposed pipeline across six healthcare-related datasets, using four STDG approaches. The results of these analyses showed that the proposed pipeline can effectively be used to evaluate and benchmark the STD generated with one or more different STDG approaches, helping the scientific community to select the most suitable approaches for their data and application of interest.