TechRxiv
Download file
Download file
1/1
2 files

Dataset Quality Assessment with Permutation Testing Showcased on Network Traffic Datasets

Download all (3.14 MB)
preprint
posted on 2022-06-29, 12:54 authored by Katarzyna WasielewskaKatarzyna Wasielewska, Dominik SoukupDominik Soukup, Tomáš Čejka, Jose Camacho

Intelligent and autonomous networks require precise and fast mechanisms that ensure error-free and efficient operation. Modern solutions are increasingly based on  artificial intelligence, in particular on machine learning, to reliably process huge amounts of data. Therefore, high-quality datasets are essential to train machine learning models. Unfortunately, the problem of assessing the quality of datasets is very challenging and often overlooked. This paper proposes a method for assessing the dataset quality in the context of binary classification. It is based on permutation testing and examines the strength of the relationship between observations and labels. Experiments carried out on simulated and real network datasets show that the method is sensitive to detect errors/mislabels in the labeled dataset. We also present theoretical considerations justifying our results.

Funding

Agencia Estatal de Investigaci´on in Spain, grant No PID2020-113462RB-I00

European Union’s Horizon 2020 research, innovation programme under the Marie Skłodowska-Curie grant agreement No 893146

Ministry of Interior of the Czech Republic (Flow-Based Encrypted Traffic Analysis) under grant number VJ02010024

History

Email Address of Submitting Author

k.wasielewska@ugr.es

ORCID of Submitting Author

0000-0001-8087-790X

Submitting Author's Institution

University of Granada

Submitting Author's Country

  • Spain