An Ordered Sample Consensus (ORSAC) Method for Data Cleaning Inspired by
RANSAC: Identifying Probable Mislabeled Data
Abstract
In classification problems, mislabeled data can have a dramatic effect
on the capability of a trained model. The traditional method of dealing
with mislabeled data is through expert review of the data. However, this
is not always ideal, due to both the large volume of data in many
classification datasets, as with image datasets supporting deep learning
models, and the limited availability of human experts for review of the
data. Herein we propose an Ordered Sample Consensus (ORSAC) method to
support data cleaning by flagging mislabeled data. This method is
inspired by the Random Sample Consensus (RANSAC) method for outlier
detection. In short, the method involves iteratively training and
testing a model on different splits of the dataset, recording
misclassifications, and flagging data which is frequently misclassified
as probable mislabels. We evaluate the method by purposefully
mislabeling subsets of the data and assessing the method’s capabilities
to find such data. We demonstrate with three datasets, a mosquito image
dataset, CIFAR-10, and CIFAR-100, that this method is reliable in
finding mislabeled data with a high degree of accuracy. Our experimental
results indicate a high proficiency of our methodology in identifying
mislabeled data across these diverse datasets, with performance assessed
using different mislabeling frequencies.