Dimensionality Reduction by Machine Learning for Cost-Effective Data Analysis

Abu Asaduzzaman; Md R Uddin; Fadi N Sibai

doi:10.36227/techrxiv.171332281.12206851/v1

loading page

Dimensionality Reduction by Machine Learning for Cost-Effective Data Analysis

Abu Asaduzzaman,
Md R Uddin,
Fadi N Sibai

Abstract

Processing large amount of data with many input features is always time consuming and expensive. In machine learning (ML), the number of input features play a crucial role in determining the performance of the ML models. Studies show that ML has potential for dimensionality reduction. This work proposes a methodology to reduce the number of input features using ML to facilitate cost-effective data analysis. Two different data sets for water quality prediction from Kaggle are used to run the ML models. First, we use Recursive Feature Elimination with Cross-Validation (RFECV), Permutation Importance (PI), and Random Forest (RF) models to find the impact of input features on predicting water quality. Second, we conduct experiments applying seven ML models: RF, Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN), Gaussian Naïve Bayes (GNB), Support Vector Machine (SVM), and Deep Neural Network (DNN) to explore water quality using the original and reduced datasets. Third, we evaluate the impact of the optimized data features on computations and cost to test water quality. Experimental results show that reducing the number of features from nine to five for Dataset 1 helps reduce computations by up to 59% and cost up to 65%. Similarly, reducing the number of features from 20 to 16 for Dataset 2 helps reduce computations by up to 20% and cost up to 14%. This study may help mitigate the curse of dimensionality, via improving the performance of ML models by enhancing data generalization.

11 Apr 2024Submitted to TechRxiv

17 Apr 2024Published in TechRxiv

Abstract

Peer review timeline