loading page

Effects of Feature Selection and Normalization on Network Intrusion Detection
  • +1
  • Mubarak Albarka Umar,
  • Zhanfang Chen,
  • Khaled Shuaib,
  • Yan Liu
Mubarak Albarka Umar
Changchun University of Science and Technology, United Arab Emirates University

Corresponding Author:[email protected]

Author Profile
Zhanfang Chen
Changchun University of Science and Technology
Author Profile
Khaled Shuaib
United Arab Emirates University
Yan Liu
Shantou University

Abstract

The rapid rise of cyberattacks and the gradual failure of traditional defense systems and approaches led to using Machine Learning (ML) techniques to build more efficient and reliable Intrusion Detection Systems (IDSs). However, the advent of larger IDS datasets has negatively impacted the performance and computational complexity of ML-based IDSs. Many researchers used data preprocessing techniques such as feature selection and normalization to overcome such issues. While most of these researchers reported the success of these preprocessing techniques on a shallow level, very few studies have been performed on their effects on a wider scale. Furthermore, the performance of an IDS model is subject to not only the utilized preprocessing techniques but also the dataset and the ML algorithm used, which most of the existing studies give little emphasis on. Thus, this study provides an in-depth analysis of feature selection and normalization effects on various IDS models built using two IDS datasets namely, NSL-KDD and UNSW-NB15, and five different ML algorithms. The algorithms are support vector machine, k-nearest neighbor, random forest, naive bayes, and artificial neural network. For feature selection and normalization, the decision tree wrapper-based approach, which tends to give superior model performance, and min-max normalization methods were respectively used. A total of 30 unique IDS models were implemented using the full and feature-selected copy of the datasets. The models were evaluated using popular evaluation metrics in IDS modeling, intra- and inter-model comparisons were performed between models and with state-of-the-art works. Random forest achieved the best performance on both NSL-KDD and UNSW-NB15 datasets with prediction accuracies of 99.87% and 98.5%, as well as detection rates of 99.79% and 99.17% respectively, it also achieved an excellent performance in comparison with the recent works. The results show that both normalization and feature selection positively affect IDS modeling with normalization shown to be more important than feature selection in improving performance and computational time. The study also found that the UNSW-NB15 dataset is more complex and more suitable for building and evaluating modern-day IDS than NSL-KDD.
29 Jan 2024Submitted to TechRxiv
12 Feb 2024Published in TechRxiv