Self-Reﬁne Learning For Data-Centric Text Classiﬁcation

. In industry NLP application, our manually labeled data has a certain number of noisy data. We present a simple method to ﬁnd the noisy data and re-label their labels to the result of model prediction. We select the noisy data whose human label is not contained in the top-K model’s predictions. The model is trained on the origin dataset. The experiment result shows that our method works. For industry deep learning application, our method improve the text classiﬁcation accuracy from 80.5% to 90.6% in dev dataset, and improve the human-evaluation accuracy from 83.2% to 90.5%.


Introduction
In recent years, deep learning [2] and BERT-based [1] model have shown significant improvement on almost all the NLP tasks. However, the most important factor for deep learning application performance is the data quantity and quality. We try to improve performance of the industry NLP application by correcting the noisy data by other most of data.
Previous works [11] first find the noisy data which human label and model prediction is not equal and re-label the noisy data manually. During the correction, the last human label and model prediction is viewed by the labeling people. However it need more human labeling. So in this work, we directly re-label the noisy data whose label is not contained in the top-K (K=1,2,3...10) predictions of model. We re-label the noisy data's label to the top-1 prediction of model.
Our key contribution is: Based on our industry dataset, we first find the noisy data which human label is not in the top-K (K=1,2,3...10) predictions of model. Then we re-label the label of noisy data to the top-1 prediction of the model. The experiment results shows that our idea works for our large industry dataset. text and fine-tuned for the supervised downstream tasks. BERT achieved stateof-the-art results on many sentence-level tasks on the GLUE benchmark [3] and CLUE [12] benchmark.
Our method is different to semi-supervised learning. Semi-supervised learning solve the problem that making best use of a large amount of unlabeled data. These works include UDA [6], Mixmatch [7], Fixmatch [8], Remixmatch [9]. Our work is full supervised.

Our method
In this section, we describe our method in detail. Our methods is shown in Fig  1. It includes 5 steps: Step 1, in order to solve our industry text classification problem. We manually label 2,790,000 data and split them into 2,700,000 training data and 90,000 dev data.
Step 2, we train / fine-tune the BERT model on the 2,700,000 training data. We named the result model of this step Model-A.
Step 3, we use Model-A to predict for all the 2,790,000 data. Then we find all the data whose human label are not in the top-K (K=1,2,3...10) predictions of model-A. We consider they are the noisy data.
Step 4, we re-label the noisy data's human label to the top-1 prediction of model-A. Then we split the same 2,700,000:90,000 training and dev dataset.
Step 5, we train and evaluate upon the dataset of step 4 and get Model-B. As we also re-label the dev dataset by the top-1 prediction of model-A, we also manually evaluate the performance of our method.
We use BERT as our model. The training steps in our method belongs to the fine-tuning step in BERT. We follow the BERT convention to encode the input text.

Experiments
In this section we describe detail of experiment parameters and show the experiment result. The detail result is shown in Table 2. The data size in our experiment is shown in Table 1.
In fine-tuning, we use Adam [4] with learning rate of 1e-5 and use a dropout [5] probability of 0.1 on all layers. We use BERT-Base (12 layer, 768 hidden size) as our pre-trained model. Data Size Description 11,000,000 All the data in our application database. 2,790,000 All the data we label in step 1 of Fig 1. 2,700,000 The the training dataset we split from the 2,790,000 data. 180,000 All the noisy data we select from the 2,790,000 data to re-label. 90,000 The dev dataset we split from the 2,790,000 data.

Analysis
As the whole dataset is large enough, so we re-label the ground truth of noisy data by other most of the data.