TechRxiv
XAI_VAE_GAN_Analysis-TBD.pdf (7.88 MB)
Download file

Are Classifiers Trained on Synthetic Data Reliable? An XAI Study

Download (7.88 MB)
preprint
posted on 2022-12-30, 19:04 authored by ASIF IQBALASIF IQBAL, Biplab Sikdar

Machine learning (ML) solutions are being applied in many areas of our daily lives, but they often require high-quality, balanced datasets in order to perform well.

However, datasets for real-world problems are often imbalanced, requiring the use of special-purpose ML algorithms or synthetic data to address the class imbalance.

Traditional techniques such as Synthetic Minority Oversampling Technique (SMOTE) and generative models such as Variational Auto Encoders (VAE) and Generative Adversarial Networks (GAN) are commonly used to generate minority class samples. 

Evaluating the quality of synthetic samples can be challenging, and researchers often rely on improved classifier performance as justification for their use.  

However, simply performing well on a test set is not sufficient to ensure the trustworthiness of a model, and further analysis of model predictions is necessary. 

To address this, we trained multiple classifiers on synthetic data generated by various methods and analyzed their predictions using SHAP, an explainable AI technique. 

Our in-depth analysis showed that these classifiers used different features for making predictions and placed different levels of importance on commonly used features. Therefore, we conclude that classification models trained on synthetic data must be carefully analyzed by human experts before being deployed in the real world.

Funding

Ministry of Education, Singapore

History

Email Address of Submitting Author

aiqbal@nus.edu.sg

ORCID of Submitting Author

0000-0002-4657-4451

Submitting Author's Institution

National University of Singapore

Submitting Author's Country

  • Singapore