Abstract
The rapid rise of cyberattacks and the gradual failing of traditional
defense systems and approaches led to the use of Machine Learning (ML)
techniques aiming to build more efficient and reliable Intrusion
Detection Systems (IDSs). However, the advent of larger IDS datasets
brought about negative impacts on the performance and computational time
of ML-based IDSs. To overcome such issues, many researchers utilized
data preprocessing techniques such as feature selection and
normalization. While most of these researchers reported the success of
these preprocessing techniques on a shallow level, very few studies are
performed on their effects on a wider scale. Furthermore, the
performance of an IDS model is subject to not only the preprocessing
techniques used but also the dataset and the ML algorithm used, which
most of the existing studies on preprocessing techniques give little
emphasis on. Thus, this study provides an in-depth analysis of the
effects of feature selection and normalization on various IDS models
built using four separate IDS datasets and five different ML algorithms.
Wrapper-based decision tree and min-max are used in feature selection
and normalization respectively. The models are evaluated and compared
using popular evaluation metrics in IDS. The study found normalization
to be more important than feature selection in improving performance and
computational time of models on both datasets, while feature selection
on UNSW-NB15 failed to reduce models computational time, and in the case
of models built using NSL-KDD, it decreases their performance. The study
also reveals that, compared to the UNSW-NB15 dataset, the NSL-KDD
dataset is less complex and unsuitable for building reliable modern-day
IDS models. Furthermore, the best performance on both datasets is
achieved by Random Forest with accuracy of 99.75% and 98.51% on
NSL-KDD and UNSW-NB15 respectively.