Student Loss: Towards the Probability Assumption in Inaccurate Supervision

Noisy labels are often encountered in datasets, but learning with them is challenging. Although natural discrepancies between clean and mislabeled samples in a noisy category exist, most techniques in this field still gather them indiscriminately, which leads to their performances being partially robust. In this paper, we reveal both empirically and theoretically that the learning robustness can be improved by assuming deep features with the same labels follow a student distribution, resulting in a more intuitive method called student loss. By embedding the student distribution and exploiting the sharpness of its curve, our method is naturally data-selective and can offer extra strength to resist mislabeled samples. This ability makes clean samples aggregate tightly in the center, while mislabeled samples scatter, even if they share the same label. Additionally, we employ the metric learning strategy and develop a large-margin student (LT) loss for better capability. It should be noted that our approach is the first work that adopts the prior probability assumption in feature representation to decrease the contributions of mislabeled samples. This strategy can enhance various losses to join the student loss family, even if they have been robust losses. Experiments demonstrate that our approach is more effective in inaccurate supervision. Enhanced LT losses significantly outperform various state-of-the-art methods in most cases. Even huge improvements of over 50% can be obtained under some conditions.

Student Loss: Towards the Probability Assumption in Inaccurate Supervision Shuo Zhang , Graduate Student Member, IEEE, Jian-Qing Li , Hamido Fujita , Life Senior Member, IEEE, Yu-Wen Li , Deng-Bao Wang , Ting-Ting Zhu , Min-Ling Zhang , Senior Member, IEEE, and Cheng-Yu Liu , Senior Member, IEEE Abstract-Noisy labels are often encountered in datasets, but learning with them is challenging.Although natural discrepancies between clean and mislabeled samples in a noisy category exist, most techniques in this field still gather them indiscriminately, which leads to their performances being partially robust.In this paper, we reveal both empirically and theoretically that the learning robustness can be improved by assuming deep features with the same labels follow a student distribution, resulting in a more intuitive method called student loss.By embedding the student distribution and exploiting the sharpness of its curve, our method is naturally data-selective and can offer extra strength to resist mislabeled samples.This ability makes clean samples aggregate tightly in the center, while mislabeled samples scatter, even if they share the same label.Additionally, we employ the metric learning strategy and develop a large-margin student (LT) loss for better capability.It should be noted that our approach is the first work that adopts the prior probability assumption in feature representation to decrease the contributions of mislabeled samples.This strategy can enhance various losses to join the student loss family, even if they have been robust losses.Experiments demonstrate that our I. INTRODUCTION R ECENT developments in supervised deep neural networks (DNNs) have considerably increased the performance of state-of-the-art (SOTA) models in various applications.These successes are highly dependent on the emergence of large-scale datasets that have been carefully labeled.Nevertheless, labeling precise annotations for training is time-consuming and prone to mistakes [1].Therefore, inaccurate supervision, particularly in learning with noisy labels, is a critical issue in practical deep learning tasks [2].Numerous approaches have been suggested sequentially to address this issue, including: 1) Robust Architecture [3], [4], [5], [6], [7], [8], [9], [10], in which some novel structures of DNNs are designed to limit the mislabeled samples.2) Robust Regularization [11], [12], [13], [14], [15], [16], in which some additional constraints should be met during convergence.3) Sample Selection [17], [18], [19], [20], [21], [22], [23], [24], [25], in which clean samples are picked up as much as possible for training.4) Robust Loss Design [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], in which some new robust loss functions are proposed to learn with noisy labels.In comparison to alternative techniques that may suffer from imprecise noise estimates or complicated training procedures, applying the robust loss is simpler and more effective, so that is also the main focus of this paper.
Generally speaking, a discriminative loss is encouraged by congregating the samples with the same labels, which can be achieved by evaluating the sample similarity on certain metrics (such as cosine angle distance, Mahalanobis distance, etc.).However, this thinking is largely valid but falls short in inaccurate supervision.It is because traditional discriminative losses usually have only one strength, which comes from the information of supervised classification with sample labeling, although some mislabeled samples appear.As shown in Fig. 1, if we apply a traditional loss to learn, not only does the angle between the clean sample and the learned weight vector decrease but that between the mislabeled sample and the learned weight Fig. 1.Schematic of the traditional loss and our student loss in inaccurate supervision.The traditional loss has only one strength for pulling the samples with the same label to the weight embedding in the feature space, even if mislabeled samples exist.Since we introduce a long-tail student distribution to feature representation and retain some prior probabilities at the edge, our student loss can produce extra strength for resistance and obtain distinguishable categorical clusters.
vector also decreases.It generates discrepancies and finally leads to an inseparable categorical cluster.To overcome the defects of losses in inaccurate supervision, researchers have made many attempts: Ghosh et al. [26] compared Categorical Cross-Entropy (CCE) with Mean Absolute Error (MAE) loss and concluded that MAE is more noise-resistant due to its data-equal characteristic.This result prompted Zhang et al. [27] to propose Generalized Cross-Entropy (GCE), which can be seen as a combination of CCE and MAE.Furthermore, Kim et al. [28] offered the Negative Learning for Noisy Label (NLNL) strategy, which introduced a three-stage pipeline for filtering the noisy data.Wang et al. [29] proposed Symmetric Cross-Entropy (SCE), which was a robust variant of CCE that combined CCE with Reverse Cross-Entropy (RCE).Extensively, Ma et al. [30] categorized current losses as "Active" or "Passive" and proposed Active Passive Loss (APL).This technique combines active losses that induce overfitting (such as CCE) with a passive loss that causes underfitting (such as MAE) to achieve optimal performance.Recently, Kim et al. [32] reported the Joint Negative and Positive Learning (JNPL) and claimed it can be regarded as an improved approach to NLNL.Englesson et al. [33] practiced the Jensen-Shannon Divergence (JS) loss and its generalized version, which trained the samples by Jensen-Shannon Divergence.Zhou et al. [34] illustrated that not only the symmetric loss but also the asymmetric loss could improve the robustness of learning with noisy labels and proposed a new class of asymmetric loss function (ALF).They then expanded this work and illustrated that ALFs could be employed for regression tasks [35].
Since the labels are noisy, these SOTA methods actually attempt to discover a way to reduce the strength of the supervised classification with the assurance that clean samples can be correctly categorized.Despite numerous efforts, most of them are still partially robust.In fact, natural discrepancies between clean and mislabeled samples of the same label are ideal materials to produce another strength of resistance.They trigger us to achieve this reduction through an unsupervised distinction with prior assumptions.Specifically, by considering deep features under the same label as a long-tail student distribution, in this paper, we propose a more intuitive and effective method called student loss.As shown in Fig. 1, since the curve of the student distribution is extremely steep in a particular region, the edge probabilities (P 1 and P 2 in Fig. 1) are reserved.It produces extra strength to push a mislabeled sample outside the decision boundary adaptively.As a result, our approach has a naturally data-selective capability and can be applied to fight against the inconsistency produced by labeling errors.Additionally, we introduce a hyperparameter to encourage wider inter-class distance and further propose a large-margin student (LT) loss.Following our approach, intra-class clean samples can aggregate tightly in the center, while mislabeled samples scatter at the edge, achieving an unsupervised clean/mislabeled sample partition.It should be noted that our student loss is the first research that introduces an assumption of prior probability distribution in the hidden space to improve the performance in inaccurate supervision.Moreover, various losses, even SOTA robust losses, can be further strengthened by our method.Our major contributions can be summarized as follows: r We provide an insight into the probability distribution of deep features and not only empirically but also theoretically point out that the robustness of learning with noisy labels can be improved by assuming the samples with the same label to follow the student distribution.
r Based on this perspective, we propose the student loss.
It is data-selective by embedding the student distribution, causing clean intra-class samples to concentrate neatly while mislabeled samples disperse, even if their labels are uniform.Furthermore, we employ some strategies from metric learning and develop its large-margin version.
r Various losses can be enhanced by our approach.Experi- ments on both benchmark and real-world datasets demonstrate that LT losses can achieve better performances than SOTA approaches in inaccurate supervision.

II. RELATED WORK
We briefly review existing approaches for robust learning with noisy labels.
1) Robust Architecture: These approaches aim to employ a noise adaptation layer on top of a DNN to learn the label transition process or create a dedicated architecture to support more varied types of label noise.As such, Webly learning [3] first taught the underlying DNN to retrieve only simple instances.The confusion matrices of all training instances were utilized as the initial weight W of the noise adaptation layer.In [5], Dropout regularization was applied to the adaptation layer, whose output was normalized by softmax to implicitly diffuse W .Similarly, the s-model [6] was proposed for the dropout noise model but without dropout.The c-model [6] was regarded as an extension of the s-model, which was more realistic than the symmetric and asymmetric noises.Additionally, several research projects built specific noise-handling structures.Masking [8] was a humanaided method of communicating the human understanding of erroneous label transitions.The faulty transition explored by humans was used to confine noise modeling.On the other hand, to anticipate the noise type and label transition probability, probabilistic noise modeling [9] controlled two separate networks.The contrastive-additive noise network [10] was recently presented to compensate for inaccurately estimated label transition probabilities.This network introduced the novel notion of quality embedding to characterize the reliability of noisy labels.
2) Robust Regularization: These approaches aim to prevent a DNN from overfitting false-labeled instances by creating some training restrictions.The primary benefit of this group is that it can easily adapt to new contexts by incorporating very few changes.As such, Bilevel learning [11] offered a different tactic by presenting a bilevel optimization strategy to regularize the overfitting of a model using a clean validation dataset.This approach varied from the standard one in that the regularization constraint was itself an optimization issue.Mini-batch-level weight adjustments and validation-set error minimization were used to rein in overfitting.Equally, in [12], it was assumed that several annotators existed, and a regularized EM-based technique was introduced to model the label transition probability.Besides, fine-tuning a pre-trained model yielded a large gain in resilience compared with models trained from scratch [13].This was because the universal representations learned during pretraining prevented the model parameters from being updated in the incorrect direction by noisy labels.For adapting to clean and noisy labels, respectively, robust early learning [14] categorized factors as either crucial or noncritical, and only noncritical updates were penalized.To increase resistance to label noise, PHuber [15] suggested a composite loss-based gradient clipping as an alternative to traditional gradient clipping.
3) Sample Selection: These approaches aim to isolate the most likely clean samples for optimization.As such, [17] offered MentorNet, which was considered a curriculum-based approach for learning the most likely correct samples.Decouple [18] recommended uncoupling update frequency from update methodology.Hence, two DNNs were kept in parallel and only modified when the examples were judged to have a disagreement.Similarly, in Co-teaching [19] and Co-teaching+ [20], two DNNs were kept.One DNN chose a predetermined number of low-loss samples and input them to the other DNN for training.Co-teaching+ added decoupling disagreement to co-teaching.INCV [21] divided noisy training data at random and then applied cross-validation to classify clean examples while getting rid of mislabeled examples in training.JoCoR [22] practiced co-regularization to lower the diversity between two DNNs, bringing together their predictions.DivideMix [23] employed two-component and one-dimensional Gaussian mixture models to fit the loss values of samples and turned noisy samples into labeled and unlabeled sets.Then, a semi-supervised technique called MixMatch [24] was introduced for classification.RoCL [25] similarly followed a two-stage learning strategy: first, supervised training on selected clean examples, and second, semi-supervised learning on relabeled noisy samples under self-supervision.It computed the exponential moving average of training loss for selection and relabeling.Although learning through sample selection is effective in most cases, it produces a large amount of accumulated error when there are numerous ambiguous labels in the training data.
4) Robust Loss Design: These approaches aim to adjust the loss value according to the confidence of a given loss (or label) by some strategies or design a new loss function for inaccurate supervision.Robust losses typically include a constraint to penalize predictions with a low degree of confidence that are most likely caused by noisy samples.This subject is the most pertinent to our work.Since some robust losses have been discussed in Section I, here we only report some loss adjustment techniques.As such, it can be categorized into three groups: 1) loss correction.This approach multiplied the predicted label transition probability to adjust the loss.Backward [36] first made an approximation of the noise transition matrix by employing the softmax output of the DNN trained without loss correction.Subsequently, it refreshed the DNN with a revised loss based on the estimated matrix.Forward [36] combined the softmax output of a DNN before applying the loss function.T-revision [37] offered a technique that inferred the transition matrix without anchor points.2) loss reweighting.This approach gave smaller weights to the mislabeled examples and greater weights to the clean examples.Active bias [39] prioritized uncertain examples with inconsistent label predictions by applying their prediction variances as training weights.DualGraph [40] practiced graph neural networks, reweighted the examples by the structural relations among labels, and eliminated the abnormal noise examples.
3) label refurbishment.This approach repaired a noisy label to avoid overfitting incorrect labels.Bootstrapping [41] was the first method that provided the concept of label refurbishment to update the label of training examples.SELFIE [43] proposed a paradigm of refurbishable examples that can be revised with high precision.The main notion was to regard examples with consistent label predictions as refurbishable because of the learner's perceptual constancy.

A. Preliminaries
We consider a K-class classification task.In general, the classifier aims to seek a mapping function f k from samples to the label of a certain class k ∈ [1, K].It is usually recognized as a DNN ending with a softmax layer to generate a posterior probability.As such, let z ∈ [1, K] represent the labeled class, the posterior probability can be obtained as where x i represents the ith sample in the training set.Specifically, if we only discuss the penultimate representation space, f k (x i ) generates a linear transform written as where w k and b k represent weights and bias corresponding to class k.Then if we employ CCE for training, the loss L can be Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
written as where the indicator function I(•) equals 1 if z i equals k; or 0 otherwise.

B. Robust Student Loss
Generally, let p(x i ) represent the probability that sample x i occurs, and it can be written as where is the value of the specific probability density function of class k at x i .θ is the parameter of the distribution function, and Δx is the tiny increment of the random variable x. p(k) represents the prior probability of class k.After that, the posterior probability p(z|x i ) can be written as In order to obtain a more tolerant representation, we employ a long-tail student distribution T to exhibit the deep features.Under this line, the D can generate into where n k , μ k , and Σ k represent the freedom degree, the mean, and the covariance matrix of the class k, respectively.They are three trainable parameters introduced because of the student distribution assumption.l represents the number of feature dimensions, and d k represents the Mahalanobis distance between deep features of the class k.Γ(•) is the Gamma function which can be written as According to [44], Γ(•) can be defined in the whole complex plane.Since the input of Γ(•) in the student distribution (Γ( n 2 ) and Γ( n+l 2 )) is positive, it is continuous and derivable [44].Therefore, Γ(•) is capable of conducting forward inference and backpropagation during training, although an integral is included.Besides, we rewrite ( 6) and ( 7) as follows: 1) We make x and μ lie on the 2 -norm ball to eliminate the influence of norms.2) We introduce a hyperparameter ϕ to confine the lower bound of the freedom degree and ensure that the probability at the curve edge is not excessively large.In the paper, it is set to 0.1.
3) We eliminate the constant π l 2 and replace l with ln l to ensure the computed correction of extreme high-dimension inputs.4) We assume Σ is the identity matrix I (the Mahalanobis distance degenerates as the euclidean distance.)and the prior probability p(k) = 1/K.As such, the revised class probability function D and the posterior probability p t (z|x i ) can be written as: According to the above distribution embedding strategy, we propose our student loss L T .Overall, it includes two parts: The first term L tds represents a discriminative loss.Various losses can be further employed as L tds .The second term L C represents the center loss [45], which can be written as Attributed to L C , the student loss can detect the mislabeled samples.λ represents a weighting hyperparameter.
In other words, our strategy is universal.Many popular losses, even the SOTA robust losses, can be strengthened to join the student loss family.For example, a student loss L T CCE based on CCE can be written as (15) Furthermore, from a metric learning standpoint, the generalization of a deep model can be enhanced by increasing the interclass margin [46], [47].Similar to [48], an extra hyperparameter is introduced to restrain d k .Accordingly, D can generate into (16) We replace D with G, so that further develop their large-margin versions.
Besides, since the class probability distribution function is redesigned, we attempt to give a decision function when the inference is conducted.As such, assuming a sample x from the test set appears, it can be classified by the following mapping

C. Theoretical Illustration
In this part, we attempt to explain the effect of our approach from a gradient perspective.Let x m represent a mislabeled sample in the training set, and we employ CCE to train a classifier of a neural network F .According to [49], its gradient ∇L(Θ F ) can be written as where p x m and y x m denote the prediction and the label of x m , respectively.Θ denotes the trainable parameters in F .As can be seen, p x m − y x m is a significant term in inaccurate supervision.It presents a small value when the label is correct but a large one when it is incorrect.Accordingly, the mislabeled sample provides a much larger gradient than the clean sample accompanied by the convergence process, which leads to poor performance.According to the above analysis, the design of this scale term is the key to learning with noisy labels.As such, if we assume the deep feature to follow the Gaussian distribution N (μ, I), the class probability function D can be written as F (x m ) can further generate as N (F (x m )).F denotes the projection before the distribution embedding.The gradient ∇L(Θ F ) can be written as: small scale term We observe that p x m − y x m is limited somewhat, since the mislabeled sample x m produces a large d(x m ) and make change to be a small value.Actually, this strategy is widely recognized and called the Gaussian Mixture (GM) loss, which has already been demonstrated to be outstanding in clean samples [48].Nevertheless, the GM loss has bottlenecks in learning with noisy labels.As shown in (20), although the scale term is relatively small, it is non-adaptive to various label noises.This defect makes the performance of the GM loss still unsatisfying with label noises, especially in hard-to-learn tasks or disturbed by high label noise rates (see Section III-D for details).
Encouraged by the GM loss [48], we discover that this contradiction can be perfectly solved by assuming the deep feature to follow the student distribution.According to Section III-B, F (x m ) can further generate as T (F (x m )).The gradient ∇L(Θ F ) can be generated as small adaptive scale term Equally, p x m − y x m is also limited by d(x m ) in the student distribution, making small.Furthermore, the term of H( n) generates an adaptive scale.Shifting n can dynamically adjust the gradient, providing a more tolerant convergence.
Apparently, this proposition theoretically demonstrates the effectiveness of our approach from a gradient weighting perspective.It is also consistent with our motivation for introducing the long tail of the student distribution to hold the mislabeled samples.

D. Discussions
Why Student Distribution?It is commonly documented that discriminate loss functions encourage gathering naturally similar samples while dispersing dissimilar samples.Usually, traditional losses achieve this purpose by learning a categorical template (weight vectors) and directly maximizing the cosine angle distance between the sample and the template.Since categorical information can only be transmitted by the label, mislabeled samples produce intra-class inconsistency and finally result in the messy cluster shown in Fig. 1.In contrast to previous approaches, we rethink inaccurate supervision from the perspective of probability distribution and define the deep features of one noisy category to follow the student distribution.By this assumption, the prior huge disparities in the distribution can tolerate this inconsistency.In other words, the long-tail property of the student distribution can generate extra strength to "absorb" most mislabeled samples and make different categories recognizable, even if they share the same label.Therefore, embedding the student distribution can obtain outstanding performance under inaccurate supervision tasks.
Why Attach L C ?We design L tds with student distribution embedding to resist the label noise.As for L C in the formulation, two impacts are considered: 1) Similar to [48], it acts as a likelihood regularization term to limit the distance between the outlier and the class centroid μ k .This regularization is not affected by the mislabeled samples since its influence can be adjusted by the hyperparameter λ.Actually, properly introducing the center loss into our student loss can accelerate convergence and slightly enhance performance (see Section IV-E).2) More importantly, as the mislabeled sample is closer to the centroid of its natural category, L C of the clean sample and that of the mislabeled sample are extremely distinctive, which can be utilized to identify and even revise incorrect labels (see Section IV-A).Therefore, L C is also indispensable in our strategy.
Why use the euclidean distance?In our strategy, we normalize the features x and the mean μ to let them lie on the 2 -norm ball.The cosine angle distance seems to be more suitable for Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
our student loss in terms of decreasing computational expense.However, we still select the euclidean distance to measure the similarity.Two reasons are considered: 1) As shown in ( 6),( 7), (8), the euclidean distance (the Mahalanobis distance when the covariance matrix is the identity matrix) itself appears in the formula of the student distribution.It means that calculating the euclidean distance is unmissable.In other words, although we use the cosine angle distance to measure the difference, we still need to calculate the euclidean distance between different angles to achieve the student distribution embedding.It increases the computational expenses.If we directly change the metric of the euclidean distance to the cosine angle distance in the student distribution itself, it could be risky in statistics.It is safer to use the common and proven formula of student distribution to conduct our approach.2) The euclidean distance and cosine angle distance are consistent in their changes.In other words, they have a one-to-one relationship on the 2 -norm ball.Their effects are equivalent.Therefore, Therefore, we believe that using the euclidean distance is a compromised but appropriate option, although the features have been normalized.This trade-off does not affect the validity of the student loss in learning with noisy labels.We hope to further explore the way to introduce the cosine angle distance into our student distribution embedding to achieve simpler calculations in the future.
GM Loss versus LT Loss: The GM loss [48] and the LT loss both consider feature representation from a probability perspective, and our LT loss draws on the ideas of the GM loss in some ways.Nonetheless, three items should be highlighted: 1) The authors in [48] make efforts in the formula transformation for a better optimal solution and produce the final GM loss.Therefore, although the student distribution changes to be the Gaussian distribution when n → +∞ [50], [51], [52], the GM loss [48] and our LT loss are different techniques in various fields.The GM loss is an outstanding method for learning with clean labels, while the LT loss is for learning with noisy labels.2) In Section III-C, we theoretically reveal that the GM loss can limit the term of p x m − y x m but is still weak in learning with noisy labels.Here, we provide some experimental evidence to further support it.As shown in Fig. 2(c) and (e), if the GM loss is conducted, its robustness can be observed somewhat when η = 0.2.Mislabeled samples are distributed around clean samples with the same label.But it has been gone when η = 0.6.Nevertheless, as shown in Fig. 2(d) and (f), our LT loss is always data-selective with the same settings.3) As shown in Table I, although the covariance matrix is changed to adjust the sharpness of the Gaussian distribution, the GM loss underperforms our LT loss with label noises.Strikingly, the Gaussian distribution changes its sharpness only by adjusting the covariance matrix.It limits all samples with the same label (not only the clean but also the mislabeled) to disperse within three times the variance surrounding the mean (shown in Fig. 2(a)).Adjusting the covariance matrix also changes the tightness of clean samples and leads to chaos within the cluster, especially in hard-to-learn or high-noise tasks.On the contrary, since the introduction of the freedom degree, the student function can keep some probabilities at the edge and meanwhile maintain the tightness in the center [50], [51], [52] (shown in Fig. 2(b)),

TABLE I TEST ACCURACY (%) OF THE GM LOSS, THE ASYMMETRIC LOSS, AND OUR LT LOSS ON BENCHMARK DATASETS WITH VARIOUS RATES η OF SYMMETRIC
NOISY LABELS making it data-selective.Therefore, the LT loss outperforms the GM loss in learning with noisy labels.Asymmetric Loss versus LT Loss: As mentioned above, designing ALFs [34], [35] can also be regarded as a general approach to resist noisy labels.Here, we attempt to specifically illustrate the difference from the LT loss.It is documented that ALFs are constructed to satisfy the Bayes-optimal condition.In [34], [35], authors hope to find a simple and elegant way to reduce the risk that leads to a classifier with the same probability of misclassification as the noise-free case.Following this consideration, several commonly used losses are transformed to be more asymmetric.Different from it, we do not focus on whether the structure of the loss function is asymmetric or not but directly assume the class probability distribution as the student distribution.Owing to the trainable prior probability under the tail, the student loss harvests another strength to adaptively overcome the incorrect supervision by mislabeled samples.We have compared the improvements of both to take the GCE for the base method as an example.As shown in Table I, the improvements from these two general strategies are close in total.Our method could be superior in certain datasets (such as CIFAR10 in our experiments).
Generalization of LT Loss: Usually, the robust loss design refers to generating a specific one to address the problems in learning with noisy labels.Different from them, we make assumptions about the feature representation and directly employ the prior distribution to construct the loss function.This strategy results in many losses previously sensitive to label noise becoming robust.In other words, we not only propose one robust loss in this paper but more importantly, we propose a paradigm to make common losses robust.In our experiments, we have demonstrated that our approach can enhance many losses, even robust losses, to suppress label noise (see Section IV-B).The generalization is regarded as a piece of evidence for the advancement of our method.

IV. EXPERIMENTS
In this section, we first discuss various empirical understandings of our LT losses using CCE and LT-CCE as examples and compare the performance of our approach against noisy labels to other SOTA methods.Then, some ablation studies are further conducted.Our experiments are supported by six datasets, including MNIST [53], CIFAR-10 [54], CIFAR-100 [54] and three real-world datasets ANIMAL-10N [43], WebVision-50 [55] and the validation set of ImageNet ILSVRC12 [56].
Noise Setting: We analyze both symmetric and asymmetric noise.Symmetric noise is generated by uniformly translating a true label to a random label with probability η, and asymmetric noise is generated using rules that convert a true label to a given label with probability η.In our experiments, we produce asymmetric noise followed [27], [29], [30] in which translating 2 → 7, 3 → 8, 5 ↔ 6 and 7 → 1 for MNIST, and TRUCK → AUTOMOBILE, BIRD → AIRPLANE, DEER → HORSE, CAT ↔ DOG for CIFAR-10.As for CIFAR100, we first group the 100 classes into 20 super-classes, with each containing 5 sub-classes, and then translate each class within the same super-class into the next in a circular fashion.For the empirical interpretations, symmetric noise with η ∈ [0.2, 0.8] is selected to test.For the robustness evaluation, both symmetric noise with η ∈ [0.2, 0.8] and asymmetric noise with η ∈ [0.2, 0.4] are selected to test.It should be noted that we do not need to conduct the same settings in their test sets.The purpose of learning with noisy labels is to improve the accuracy as much as possible when the training set contains a lot of noisy data labels.The test set should be high-quality and less error-prone to prove it.

A. Empirical Understandings
Experimental Setup: We build a toy model with two convolutional layers and two fully connected layers and judge some empirical understandings on MNIST.The experiment consists of two parts.First, we explore the feature representation of the LT loss in the penultimate layer.The dimension of the penultimate output is set to two for better visualization.Then, the convergence during the training and the effect of L C in the LT loss are further evaluated.As such, the dimension of the penultimate output is set to 128.For our LT loss, and λ are set to 0.3 and 0.05 under η 0.6, while 0.1 and 0.01 under η > 0.6, respectively.All networks are trained using the Adam optimizer with a learning rate of 0.001, a weight decay of 5 × 10 −4 , a batch size of 128, and cosine learning rate annealing.The total epoch is set to 50.The situations in various symmetric noises are picked up for illustration.
More Tolerant and Distinguishable Representation: The feature representation of the training set in the penultimate layer has been shown in Fig. 3.As can be seen, the output features from various categories are dispersed according to their respective projected angles.When using CCE, the clusters are separable and clear under η = 0, while the areas of different clusters seem to be more imbalanced accompanied by η increases, leading to the omission of some categories under η 0.4 (shown in Fig. 3(a), (b), (c), (d), and (e)).Inversely, the harvested representation of LT-CCE is obviously more tolerant and distinguishable, allowing the development of a complete and more acceptable representation even under extreme noise rates.(shown in Fig.

3(f), (g), (h), and (i)).
To better illustrate, we take the noisy category "1" as an example and exhibit the distribution of mislabeled samples in the cluster.As shown in Fig. 4, following CCE training, the mislabeled samples present obvious overlap with clean samples under η 0.4 (shown in Fig. 4(a) and (b)), whereas larger overlap occurs under η > 0.4 (shown in Fig. 4(c) and (d)).Nevertheless, when using LT-CCE for training, nearly all clean samples gather tightly while the mislabeled samples scatter, and there is little overlap under all tested noise rates (shown in Fig. 4(e), (f), (g), and (h)).These results adequately reveal the validity of our strategy.
As a matter of fact, since introducing the prior hypothesis of the student distribution in feature representation to resist the chaos of noisy labels, the LT   phenomenon is also mentioned in most of the literature [26], [27], [28], [29] and regarded as one of the main challenges in inaccurate supervision.As for our strategy, we observe that although the accuracy of LT-CCE is lower in the training set than that of CCE, it can reach a high level in the test set in all tested situations.Additionally, the training accuracy almost coincides with the rate of clean samples in the noisy cluster (1 − η) shown in Fig. 5(a).This result reveals that the contributions of mislabeled samples towards convergence can be few.As deduced in Section III-C, our method diminishes the weights of mislabeled samples in the gradient by introducing the prior probabilistic assumption.The long-tail characteristic is applied to naturally resist the convergence of mislabeled samples.Therefore, the performance of LT-CCE is much greater than that of CCE at all tested noise rates (shown in Fig. 5(b)).
In other words, we not only theoretically but also empirically demonstrate that LT losses have strong convergence abilities on clean samples but weaken on mislabeled samples.It reflects the outstanding data-selective characteristic of our approach, which is especially significant in inaccurate supervision.
Different Distribution Patterns on L C : In Section III-D, we illustrate that the incorrect label can be detected by our LT loss.In our experiments, the densities of clean samples and mislabeled samples on L C are further explored.Of note, we discover that the distribution of clean samples and that of mislabeled samples are entirely different.As shown in Fig. 6, the mislabeled sample shows the accuracy of the test set.As can be seen, our strategy can effectively overcome the overfitting caused by the label error.Fig. 6.Densities of clean and mislabeled samples on L C under various noise rates η of symmetric noise.We can observe that the distribution of clean samples and that of mislabeled samples are different, which reflects that our approach can automatically detect and even relabel the mislabeled sample by L C .represents a larger L C .The distance between the two distributions is far under the low noise rate of 0.2, while there are clear boundaries in all tested situations.Apparently, mislabeled samples are close to the clusters of their natural categories but far away from others with our LT losses, which provides an opportunity to automatically identify and even relabel them according to L C .It further reveals that L C is indispensable to our LT losses.Since relabeling samples based on L C should construct various rules that are not our focus, the reported results in this paper are not considered label correction by L C .It will be left to our future work.Additionally, it should be illustrated that since the density curves in Fig. 6 are obtained by filtering with the kernel function, they have some responses in the negative axis of L C .The values of L C are non-negative in all experiments.

B. Robustness Evaluation With Other Robust Losses
Baseline: We compare our LT loss with five SOTA robust losses as well as the CCE loss: (1) GCE [27] Experimental Setup: We attempt to observe the variations when the baselines are strengthened by our method on MNIST, CIFAR10, and CIFAR100 datasets.Since the reported results of baselines in the literature are generated in different experimental environments (different models, different noise settings, etc.), we reproduce them in our experiments for fairness in comparison.The hyperparameter settings of the baselines are consistent with their literature and released repositories [27], [29], [30], [32], [33].Experiments are conducted with a two-layer CNN for MNIST, a six-layer CNN for CIFAR10 (used in the experiments of empirical understandings), and the ResNet34 for CIFAR100.The epochs are set to 50, 120, and 200, respectively.All networks are trained using the Adam optimizer with a learning rate of 0.001, a weight decay of 5 × 10 −4 , a batch size of 128, a gradient clip of five, and cosine learning rate annealing in all experiments.For MNIST, and λ are set to 0.3 and 0.05 under η ∈ [0.2, 0.6], while 0.1 and 0.01 under η = 0.8 in symmetric noise experiments, respectively.They are set to 0.3 and 0.05 in the asymmetric noise experiments, respectively.For CIFAR10, and λ are set to 0.1 and 0.05 under η ∈ [0.2, 0.6], while 0.01 and 0.001 under η = 0.8 in symmetric noise experiments, respectively.They are set to 0.1 and 0.05 in the asymmetric noise experiments, respectively.For CIFAR100, and λ are set to 0.05 and 0.05 under η ∈ [0.2, 0.4], while 0.01 and 0.001 under η ∈ [0.6, 0.8] in symmetric noise experiments, respectively.They are set to 0.01 and 0.005 when using CCE, GCE, SCE, and JNPL for training, and to 0.05 and 0.05 when using APL and JS for training in the asymmetric noise experiments, respectively.Similar to [34], [35], the hyperparameter settings of the base methods are adjusted to adapt our enhanced strategy for better performance.As such, let λ J represent λ in [32].For LT-GCE, q in GCE [27] is set to 0.01 on MNIST and CIFAR10, while it is set to 0.001 on CIFAR100.For LT-SCE, the original hyperparameter setting is consistent with [29] on three datasets.For LT-APL, we set α and β in APL [30] to one and ten on three datasets.For LT-JNPL, we set λ J in JNPL [32] to one on three datasets.For LT-JS, we set w 1 in JS [33] to 0.1 on three datasets.
Results: The classification accuracy is reported in Table II.As can be seen, the LT losses outperform the baselines in most situations.Various improvements are larger than 20%.On MNIST, the largest gap represents 62.79% (85.80% -23.01%) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II TEST ACCURACY (%) OF VARIOUS ROBUST LOSSES AND THEIR ENHANCED VERSIONS ON BENCHMARK DATASETS UNDER VARIOUS RATES η OF SYMMETRIC AND ASYMMETRIC NOISE
appearing when employing CCE and LT-CCE with η = 0.8.On CIFAR10, the largest gap represents 37.87% (79.76% -41.89%) appearing when employing CCE and LT-CCE with η = 0.6.On CIFAR100, the largest gap represents 23.31% (46.02% -22.71%) appearing when employing CCE and LT-CCE with η = 0.6.These results demonstrate the validity of our approach.Furthermore, the effect of the LT loss under symmetric noise is on average better than that under asymmetric noise.Even negative influences are generated under the asymmetric noise in a few cases (such as some experiments on CIFAR100).We guess the randomness in symmetric noise is more in line with the unbiased characteristic of the student distribution, which makes our approach better for handling symmetric noise.The specific solutions for the degeneration will be further explored in the future.
Meanwhile, with different base methods, we discover that the influence of their hyperparameter settings on our approach is different.As shown in Table III, the original hyperparameter settings of JNPL have less impact on the outcomes in all tested situations, but the settings of other tested methods (GCE, APL, and JS) would affect the final performance of our approach.Furthermore, the influences appear to differ under various noisy rates and datasets.We consider the reason for this phenomenon to be cooperative interaction.Since the original robust loss is employed in tandem with our approach, the optimal hyperparameter settings are changed, not the values reported in their literature.Authors in [34], [35] focused on this issue and changed the original settings when using their approaches.We also offer the empirical settings of tested base methods in the setup paragraphs to better utilize our strategy.Besides, recent advances in explainable deep learning push the development of feature visualization for classification decisions.We attempt to employ the Grad-CAM [65] for exploring the differences in feature extraction between the baseline and our method (using CCE and LT-CCE as an example) under various noise rates on CIFAR10.The experimental conditions are as above, and the visualized results are shown in Fig. 7.As can be seen, the label noise undoubtedly affects the accuracy of feature extraction, especially for the baseline.It is obvious that the areas of interest for the baseline are mostly incorrect and unstable under the noisy labels.However, the extracted features obtained by our method are relatively exact and stable under various noise rates.These results also illustrate the advancement of our approach from the perspective of explainable feature extraction.

C. Robustness Evaluation With Other SOTA Methods
Baseline: For further illustrating the generalization of our approach, We also select three approaches that do not include the robust loss strategy as baselines and employ our LT-CCE to enhance them: (1) Bootstrapping [41]: a label refurbishment strategy based on the prediction output.We select its soft version in the experiments; (2) Co-teaching [19]: a sample selection strategy loss by two DNNs following the small-loss trick; (3) SPR [57]: a scalable penalized regression strategy to detect the label noise.
Experimental Setup: We use the same experimental settings in Section IV-B and observe the influence of symmetric noise with η ∈ [0.2, 0.6] and symmetric noise with η ∈ [0.2, 0.4].The hyperparameter settings of the baselines are consistent with their literature and released repositories [19], [41], [57].Since sparse regularization [66] can be regarded as an independent technique for noisy labels, in the experiments of SPR and its enhanced version, we do not employ related codes in the repository of SPR for a clearer comparison.As for the hyperparameter settings of our LT-CCE, on MNIST, we set and λ to 0.3 and 0.05 in all experiments.On CIFAR10, when using Bootstrapping, we set them to 0.01 under both symmetric and asymmetric noise.When using Co-teaching, we set them to 0.1 and 0.05 under symmetric noise while setting them to 1e-8 and 0.5 under asymmetric noise.When using SPR, we set them to 0.01 under symmetric noise of η < 0.6 while setting them to 0.001 and 0.5 under η = 0.6.We set to 0.001, 0.01, and 0.001 while setting λ to one, 0.001, and 0.05 under asymmetric noise of η = 0.2, η = 0.3, and η = 0.4, respectively.On CIFAR100, when using Bootstrapping, we set and λ to 0.01 under symmetric noise of η < 0.6 while setting them to 0.01 and 0.001 under η = 0.6.They are set to 0.01 under asymmetric noise.When using Co-teaching, we set them to 0.05 under symmetric noise of η < 0.6 while setting them to 0.01 and 1e-4 under η = 0.6.They are set to 1e-4 and 0.05 under asymmetric noise.When using SPR, we set them to 0.01 and 0.005 under symmetric noise of η < 0.6 while setting them to 0.01 and 0.005 under η = 0.6.They are set to 0.01 and 0.005 under asymmetric noise.
Results: The classification accuracy is reported in Table IV.It is obvious that the enhanced approaches harvest better results in most cases.The largest gap represents 47.12% (97.75% -50.63%) appearing when employing CCE and LT-CCE with η = 0.6.Similar to the previous comparisons in Table II, the performance seemly degrades under some asymmetric experiments (such as the experiments of SPR on CIFAR10).We also suspect that the reason is the unbiased characteristic of the student distribution.Its solution is one of the main works in the future.
To sum up, the LT loss family produces better performance than their original SOTA versions on benchmark datasets in most cases, which demonstrates the generalization and advancement of our approach in inaccurate supervision.

D. Experiments on Real-World Noisy Dataset
Next, we evaluate the performance of our LT losses on some real-world noisy datasets.Specifically, ANIMAL-10 N, Webvision-50, and ImageNet ILSVRC12 are applied to explore.ANIMAL-10N [43] contains 10 animals with confusing appearances.The estimated label noise rate is around 8%.There are 50,000 training images and 5,000 testing images.Webvision [55] contains 2.4 million images.It has 1,000 categories, the same as the ImageNet ILSVRC12.The estimated label noise rate is around 20%.Similar to [30], [57], [61], [62], [63], the first 50 categories of the Google image subset (Webvision-50) are selected as the training data, and we evaluate on both WebVision and ILSVRC12 validation sets of the same categories.In our experiments, 20 SOTA approaches to deal with learning with noisy labels as well as CCE are employed as our baselines for comparisons.
Experimental Setup: We employ the LT-CCE, LT-GCE, and LT-SCE as examples in the parts.When using GCE, we set q to 0.001.When using SCE, we set α = 6 and β = 0.1.For ANIMAL-10 N, we additionally conduct the combined method of the SPR + LT-CCE to compare.The VGG19-BN backbone is applied for training.The batch size and the epoch are set to 128 and 200, respectively.and λ are set to 0.05 and 1.5 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.when using the SPR + LT-CCE for training, while they are set to 0.01 and 0.05 in other experiments.For WebVision-50, the Inception-ResNet backbone is utilized.The batch size and the epoch are set to 32 and 200, respectively.and λ are set to 0.01 and 0.001, respectively.All networks are trained using the Stochastic Gradient Descent (SGD) optimizer with cosine learning rate annealing.The weight decay is set to 1 × 10 −3 for ANIMAL-10 N while 5 × 10 −4 for WebVision-50.The learning rate is set to 0.1 for ANIMAL-10 N and 0.01 for WebVision-50.Besides, Random Crop, Random Horizontal Flip, and CutMix are also picked as data augmentation strategies.
Results: The classification accuracy on real-world datasets is reported in Tables V and VI.As can be seen, compared to the original versions, LT-CCE, LT-GCE, and LT-SCE obtain much greater performance.The gap between CCE and LT-CCE on ANIMAL-10 N is 6.37% (85.77%-79.4%).The gaps in top-1 accuracy between CCE, GCE, and SCE and their enhanced versions on ILSVRC12 are 13.80% (72.68%-58.88%),19.34% (73.02%-53.68%)and 11.32% (73.08%-61.76%),respectively.These improvements demonstrate the effectiveness and generalization of our approach to strengthening other SOTA methods.Moreover, the LT losses outperform other SOTA strategies in most cases.On ANIMAL-10 N, the SPR + LT-CCE is better than other tested baselines.On WebVison-50 and ILSVRC12, except for the top-1 accuracy on ILSVRC12, our approach yields the best result.
Additionally, some findings should be further discussed.In  LT-SCE, whose results are significantly better than the best of the baselines after the single-sample t-test.However, the improvement of our method seems to be ambiguous compared to the best result of baselines on ILSVRC12.The possible reasons are as follows: 1) In fact, the performance of our LT loss highly depends on the base method, but the tested base methods of CCE, GCE, and SCE in this experiment would be regarded as the early solutions in this field.Some bottlenecks exist by themselves.2) Following other literature [30], [57], [61], [62], [63], we train our models on WebVision-50 and directly test them on the ILSVRC12 validation set.Their samples actually belong to two sub-distributions.The ability of our approach to accommodate various data distributions may be inadequate.More in-depth work and solutions will be further conducted in the future.
As a result, the performances of our approach in these datasets also reveal that our LT losses are still resistant to label noise from the real world and can be a competitive solution compared to other SOTA approaches.

E. Ablation Studies
Finally, to further explore the influence of hyperparameter settings, some ablation studies are conducted.We train the model using LT-CCE on CIFAR10 and take it as an example to illustrate.Additionally, in our approach, features and means are normalized to overcome the influence of norms.We also notice that some work [67], [68] empirically set the radius of the 2 -norm ball to 64.To explore the effect of the radius, we select LT-CCE as an example and proceed with some ablation studies on MNIST, CIFAR10, and CIFAR100.
Experimental Setup: For the studies of hyperparameter settings, we attempt to observe the variances under a low symmetric noise rate of 0.2 and a high symmetric noise rate of 0.6.When changing , we set λ to 0.05, and the is set to zero, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5.When changing λ, we set to 0.1, and the λ is set to zero, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, and 0.5.For the studies of the radius, we also observe η = 0.2 and η = 0.6.The radius is set to 4, 8, 16, 32, 64, and 128, respectively.Since the works in [67], [68] do not employ a metric learning strategy to enhance training, we also set λ to a small value of 1e-8 for consistency.is set to 0.3 and 0.1 for MNIST and CIFAR10.For CIFAR100, we set = 0.05 when η = 0.2 and set = 0.01 when η = 0.6.All networks are trained using the Adam optimizer with a learning rate of 0.001, a 5 × 10 −4 weight decay, a batch size of 128, and a gradient clip of five in all experiments.The network architectures are consistent with those in Section IV-B.
Influences of Hyperparameters and λ: As shown in Fig. 8(a) and (c), we observe that setting to a big value represents an underfitting phenomenon, and it is more obvious under a high noise rate of 0.6.In fact, employing the metric learning strategy increases the difficulty of classification, especially under inaccurate supervision.However, it can improve performance with a suitable setting under conditions of high noise.We discover that there is a slight overfitting under η = 0.6 with setting = 0 while setting it to a small value successfully fights against the degradation.Of note, the overfitting problem is common in inaccurate supervision and becomes acuter accompanied by η increases [29], [30].Introducing metric learning with a small weight can effectively restrain the overfitting problem under conditions of high noise rates.As for λ shown in Fig. 8(b) and (d), we observe that L C can improve the speed of convergence, and it can also slightly enhance performance under the small label noise.These results support our description in Section III-D.However, the overfitting problem is also exposed while setting λ to a large value.Therefore, we recommend setting and λ to a relatively small value according to various noise rates.
Influences of Hyperspherical Radius: As shown in Table VII, the performances are very close under various radius settings on all tested datasets.In addition, the best results of different situations appear under different radius settings.It does not present any regularity.Therefore, we do not offer any empirical value for the radius to use.In our experiments, it is set to one.The improvement to enlarge the radius is weak in our approach.
V. CONCLUSION This study set out to improve the robustness of deep learning with noisy labels and indicated, not only empirically but also Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
theoretically, that assuming identically labeled deep features to follow the student distribution could yield promising performance.The analysis of the feature representation undertaken here has extended the knowledge of existing robust losses and allowed us to create a family of new losses called student losses.Strikingly, the sharp shift in the probability distribution made the student loss naturally data-selective, and various losses could be strengthened to be student losses.After that, we further introduced some metric learning strategies and developed the LT loss.Experiments on both benchmark and real-world datasets demonstrated that the LT loss outperformed the tested baselines in most cases.Overall, we believe the LT loss is an up-andcoming perspective in inaccurate supervision and will become a popular technique to deal with noisy labels.

Fig. 2 .
Fig. 2. Differences between the GM loss and our LT loss.(a) and (b) are computer simulations of the Gaussian distribution and the student distribution.The redder contour means a higher probability distribution.(c), (d), (e), and (f) show feature representations of noisy category '1' in MNIST under the various noise rates η of symmetric noise.Red and black points denote mislabeled and clean samples.We do not normalize features and means in the LT loss for intuitive comparisons.

Fig. 3 .
Fig. 3. Feature representations using CCE and LT-CCE under the various noise rates η of symmetric noise.Our approach can harvest more robust clusters under η > 0.

Fig. 4 .
Fig. 4. Feature representations of noisy category '1' in MNIST using CCE and LT-CCE under the various noise rates η of symmetric noise.It is obvious that LT-CCE can distinguish between clean samples and mislabeled samples even if they share the same label.

Fig. 5 .
Fig. 5. Accuracy curves during training.(a) shows the accuracy of the training set under various noise rates η accompanied by the epoch increases, and (b) shows the accuracy of the test set.As can be seen, our strategy can effectively overcome the overfitting caused by the label error.
: a training loss by combining MAE and CCE losses; (2) SCE [29]: a training loss by combining RCE and CCE losses inspired by the symmetricity of the Kullback-Leibler divergence; (3) APL [30]: a training loss by combining active losses with one passive loss.We select the best-reported combination, Normalized Cross-Entropy (NCE) + RCE, as our baseline; (4) JNPL [32]: a training loss regarded as an enhanced version of NLNL [28]; (5) JS [33]: a training loss adopting the generalized Jensen-Shannon divergence.

Fig. 7 .
Fig.7.Feature visualizations of model predictions using Grad-CAM on CIFAR10.The red/blue areas have larger/smaller weights for the predictions.It is obvious that our LT loss can obtain more robust and exact features than the baseline under various noise rates η.

Fig. 8 .
Fig. 8. Accuracy curves on the test set of CIFAR10 during training under various noise rates η of symmetric noise with different hyperparameter settings.
-CCE can produce more tolerant and clearer clusters than the original CCE.It is essential to improve performance in inaccurate supervision.More Appropriate Convergence during Training: The accuracy curves of the training/test sets during training are shown in Fig. 5.It can be seen that CCE suffers from a serious overfitting problem.Although the accuracy of the training set can reach a high level (nearly 100%) under all tested noise rates, the accuracy of the test set gradually decreases during convergence.This Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III TEST
ACCURACY (%) OF OUR LT LOSSES ON BENCHMARK DATASETS WITH VARIOUS HYPERPARAMETER SETTINGS OF BASE METHODS

TABLE IV TEST
ACCURACY (%) OF OTHER ROBUST LEARNING METHODS AND THEIR ENHANCED VERSIONS ON BENCHMARK DATASETS UNDER VARIOUS RATES η OF SYMMETRIC AND ASYMMETRIC NOISE TABLE V TEST ACCURACY (%) OF DIFFERENT METHODS ON ANIMAL-10 N DATASETS

Table VI ,
we observed that tested LT losses can obtain the best performances on WebVision-50 in most cases, especially the

TABLE VII TEST
ACCURACY (%) OF OUR LT LOSS ON BENCHMARK DATASETS WITH VARIOUS HYPERSPHERE RADIUS