Effects of Feature Selection and Normalization on Network Intrusion Detection

Mubarak Albarka Umar; Chen Zhanfang

doi:10.36227/techrxiv.12480425.v2

loading page

Effects of Feature Selection and Normalization on Network Intrusion Detection

Mubarak Albarka Umar,
Zhanfang Chen,
Khaled Shuaib,
Yan Liu

Abstract

The rapid rise of cyberattacks and the gradual failing of traditional defense systems and approaches led to the use of Machine Learning (ML) techniques aiming to build more efficient and reliable Intrusion Detection Systems (IDSs). However, the advent of larger IDS datasets brought about negative impacts on the performance and computational time of ML-based IDSs. To overcome such issues, many researchers utilized data preprocessing techniques such as feature selection and normalization. While most of these researchers reported the success of these preprocessing techniques on a shallow level, very few studies are performed on their effects on a wider scale. Furthermore, the performance of an IDS model is subject to not only the preprocessing techniques used but also the dataset and the ML algorithm used, which most of the existing studies on preprocessing techniques give little emphasis on. Thus, this study provides an in-depth analysis of the effects of feature selection and normalization on various IDS models built using four separate IDS datasets and five different ML algorithms. Wrapper-based decision tree and min-max are used in feature selection and normalization respectively. The models are evaluated and compared using popular evaluation metrics in IDS. The study found normalization to be more important than feature selection in improving performance and computational time of models on both datasets, while feature selection on UNSW-NB15 failed to reduce models computational time, and in the case of models built using NSL-KDD, it decreases their performance. The study also reveals that, compared to the UNSW-NB15 dataset, the NSL-KDD dataset is less complex and unsuitable for building reliable modern-day IDS models. Furthermore, the best performance on both datasets is achieved by Random Forest with accuracy of 99.75% and 98.51% on NSL-KDD and UNSW-NB15 respectively.

29 Jan 2024Submitted to TechRxiv

12 Feb 2024Published in TechRxiv

Abstract

Peer review timeline