TechRxiv
_2022_IEEE__RnR (1).pdf (1.05 MB)
Download file

Raking and Relabeling for Imbalanced Data

Download (1.05 MB)
preprint
posted on 2022-04-27, 03:59 authored by Seunghwan Park, Hae-Hwan Lee, Jongho ImJongho Im
We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.

Funding

NRF-2021R1C1C1014407

NRF-2019R1G1A1002232

History

Email Address of Submitting Author

ijh38@yonsei.ac.kr

ORCID of Submitting Author

0000-0001-8362-4756

Submitting Author's Institution

Yonsei University

Submitting Author's Country

  • South Korea

Usage metrics

    Licence

    Exports