Dataset Quality Assessment with Permutation Testing Showcased on Network
Traffic Datasets
Abstract
Intelligent and autonomous networks require precise and fast mechanisms
that ensure error-free and efficient operation. Modern solutions are
increasingly based on artificial intelligence, in particular on machine
learning, to reliably process huge amounts of data. Therefore,
high-quality datasets are essential to train machine learning models.
Unfortunately, the problem of assessing the quality of datasets is very
challenging and often overlooked. This paper proposes a method for
assessing the dataset quality in the context of binary classification.
It is based on permutation testing and examines the strength of the
relationship between observations and labels. Experiments carried out on
simulated and real network datasets show that the method is sensitive to
detect errors/mislabels in the labeled dataset. We also present
theoretical considerations justifying our results.