Raking and Relabeling for Imbalanced Data
preprintposted on 2022-04-27, 03:59 authored by Seunghwan Park, Hae-Hwan Lee, Jongho ImJongho Im
We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.