loading page

Dataset Quality Assessment with Permutation Testing Showcased on Network Traffic Datasets
  • +1
  • Katarzyna Wasielewska ,
  • Dominik Soukup ,
  • Tomáš Čejka ,
  • Jose Camacho
Katarzyna Wasielewska
University of Granada

Corresponding Author:[email protected]

Author Profile
Dominik Soukup
Author Profile
Tomáš Čejka
Author Profile
Jose Camacho
Author Profile

Abstract

Intelligent and autonomous networks require precise and fast mechanisms that ensure error-free and efficient operation. Modern solutions are increasingly based on  artificial intelligence, in particular on machine learning, to reliably process huge amounts of data. Therefore, high-quality datasets are essential to train machine learning models. Unfortunately, the problem of assessing the quality of datasets is very challenging and often overlooked. This paper proposes a method for assessing the dataset quality in the context of binary classification. It is based on permutation testing and examines the strength of the relationship between observations and labels. Experiments carried out on simulated and real network datasets show that the method is sensitive to detect errors/mislabels in the labeled dataset. We also present theoretical considerations justifying our results.