Effects of Feature Selection and Normalization on Network Intrusion Detection
Preprints are manuscripts made publicly available before they have been submitted for formal peer review and publication. They might contain new research findings or data. Preprints can be a draft or final version of an author's research but must not have been accepted for publication at the time of submission.
The rapid rise of cyberattacks and the gradual failing of traditional defense systems and approaches led to the use of Machine Learning (ML) techniques aiming to build more efficient and reliable Intrusion Detection Systems (IDSs). However, the advent of larger IDS datasets brought about negative impacts on the performance and computational time of ML-based IDSs. To overcome such issues, many researchers utilized data preprocessing techniques such as feature selection and normalization. While most of these researchers reported the success of these preprocessing techniques on a shallow level, very few studies are performed on their effects on a wider scale. Furthermore, the performance of an IDS model is subject to not only the preprocessing techniques used but also the dataset and the ML algorithm used, which most of the existing studies on preprocessing techniques give little emphasis on. Thus, this study provides an in-depth analysis of the effects of feature selection and normalization on various IDS models built using four separate IDS datasets and five different ML algorithms. Wrapper-based decision tree and min-max are used in feature selection and normalization respectively. The models are evaluated and compared using popular evaluation metrics in IDS. The study found normalization to be more important than feature selection in improving performance and computational time of models on both datasets, while feature selection on UNSW-NB15 failed to reduce models computational time, and in the case of models built using NSL-KDD, it decreases their performance. The study also reveals that, compared to the UNSW-NB15 dataset, the NSL-KDD dataset is less complex and unsuitable for building reliable modern-day IDS models. Furthermore, the best performance on both datasets is achieved by Random Forest with accuracy of 99.75% and 98.51% on NSL-KDD and UNSW-NB15 respectively.