# Difference between revisions of "User:Yktan"

(→Introduction) |
|||

Line 2: | Line 2: | ||

== Introduction == | == Introduction == | ||

− | Much of the success in training deep neural networks (DNNs) is | + | Much of the success in training deep neural networks (DNNs) is due to the collection of large datasets with human-annotated labels. However, human annotation is a both time-consuming and expensive task, especially for the data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators. |

There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples. | There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples. |

## Revision as of 16:46, 29 November 2020

## Contents

## Introduction

Much of the success in training deep neural networks (DNNs) is due to the collection of large datasets with human-annotated labels. However, human annotation is a both time-consuming and expensive task, especially for the data that requires expertise such as medical data. Furthermore, certain datasets will be noisy due to the biases introduced by different annotators.

There are a few existing approaches to use datasets with noisy labels. In learning with noisy labels (LNL), most methods take a loss correction approach. Other LNL methods estimate a noise transition matrix and employ it to correct the loss function. An example of a popular loss correction approach is the bootstrapping loss approach. Another approach to reduce annotation cost is semi-supervised learning (SSL), where the training data consists of labeled and unlabeled samples.

This paper introduces DivideMix, which combines approaches from LNL and SSL. One unique thing about DivideMix is that it discards sample labels that are highly likely to be noisy and leverages these noisy samples as unlabeled data instead. This prevents the model from overfitting and improves generalization performance. Key contributions of this work are: 1) Co-divide, which trains two networks simultaneously, aims to improve generalization and avoiding confirmation bias. 2) During the SSL phase, an improvement is made on an existing method (MixMatch) by combining it with another method (MixUp). 3) Significant improvements to state-of-the-art results on multiple conditions are experimentally shown while using DivideMix. Extensive ablation study and qualitative results are also shown to examine the effect of different components.

## Motivation

While much has been achieved in training DNNs with noisy labels and SSL methods individually, not much progress has been made in exploring their underlying connections and building on top of the two approaches simultaneously.

Existing LNL methods aim to correct the loss function by:

- Treating all samples equally and correcting loss explicitly or implicitly through relabeling of the noisy samples
- Reweighting training samples or separating clean and noisy samples, which results in correction of the loss function

A few examples of LNL methods include:

- Estimating the noise transition matrix, which denote the probability of clean labels flipping to noisy labels, to correct the loss function
- Leveraging DNNs to correct labels and using them to modify the loss
- Reweighting samples so that noisy labels contribute less to the loss

However, these methods each have some downsides. For example, it is very challenging to correctly estimate the noise transition matrix in the first method; for the second method, DNNs tend to overfit to datasets with high noise ratio; for the third method, we need to be able to identify clean samples, which has also proven to be challenging.

On the other hand, SSL methods mostly leverage unlabeled data using regularization to improve model performance. A recently proposed method, MixMatch incorporates the two classes of regularization – consistency regularization which enforces the model to produce consistent predictions on augmented input data, and entropy minimization which encourages the model to give high-confidence predictions on unlabeled data, as well as MixUp regularization.

DivideMix partially adopts LNL in that it removes the labels that are highly likely to be noisy by using co-divide to avoid the confirmation bias problem. It then utilizes the noisy samples as unlabeled data and adopts an improved version of MixMatch (SSL) which accounts for the label noise during the label co-refinement and co-guessing phase. By incorporating SSL techniques into LNL and taking the best of both worlds, DivideMix aims to produce highly promising results in training DNNs by better addressing the confirmation bias problem, more accurately distinguishing and utilizing noisy samples, and performing well under high levels of noise.

## Model Architecture and Algorithm

DivideMix leverages semi-supervised learning to achieve effective modeling. The sample is first split into a labeled set and an unlabeled set. This is achieved by fitting a Gaussian Mixture Model as a per-sample loss distribution. The unlabeled set is made up of data points with discarded labels deemed noisy. Then, to avoid confirmation bias, which is typical when a model is self-training, two models are being trained simultaneously to filter error for each other. This is done by dividing the data using one model and then training the other model. This algorithm, known as Co-divide, keeps the two networks from converging when training, which avoids the bias from occurring. Being diverged also offers the two networks distrinct abilities to filter different types of error, making the model more robust to noise. Figure 1 describes the algorithm in graphical form.

For each epoch, the network divides the dataset into a labeled set consisting of clean data, and an unlabeled set consisting of noisy data, which is then used as training data for the other network, where training is done in mini-batches. For each batch of the labelled samples, co-refinement is performed by using the ground truth label [math] y_b [/math], the predicted label [math] p_b [/math], and the posterior is used as the weight, [math] w_b [/math].

Then, a sharpening function is implemented on this weighted sum to produce the estimate, [math] \hat{y}_b [/math] Using all these predicted labels, the unlabeled samples will then be assigned a "co-guessed" label, which should produce a more accurate prediction. [math] \hat{y}_b=Sharpen(\bar{y}_b,T)={\bar{y}^{c{\frac{1}{T}}}_b}/{\sum_{c=1}^C\bar{y}^{c{\frac{1}{T}}}_b} [/math], for c = 1, 2,....., C. Having calculated all these labels, MixMatch is applied to the combined mini-batch of labeled, [math] \hat{X} [/math] and unlabeled data, [math] \hat{U} [/math], where, for a pair of samples and their labels, one new sample and new label is produced. More specifically, for a pair of samples [math] (x_1,x_2) [/math] and their labels [math] (p_1,p_2) [/math], the mixed sample [math] (x',p') [/math] is:

[math] \begin{alignat}{2} \lambda &\sim Beta(\alpha, \alpha) \\ \lambda ' &= max(\lambda, 1 - \lambda) \\ x' &= \lambda ' x_1 + (1 - \lambda ' ) x_2 \\ p' &= \lambda ' p_1 + (1 - \lambda ' ) p_2 \\ \end{alignat} [/math]

MixMatch transforms [math] \hat{X} [/math] and [math] \hat{U} [/math] into [math] X' [/math] and [math] U' [/math]. Then, the loss on [math] X' [/math], [math] L_X [/math] (Cross-entropy loss) and the loss on [math] U' [/math], [math] L_U [/math] (Mean Squared Error) are calculated. A regularization term, [math] L_{reg} [/math], is introduced to regularize the model's average output across all samples in the mini-batch. Then, the total loss is calculated as:

where [math] \lambda_r [/math] is set to 1, and [math] \lambda_u [/math] is used to control the unsupervised loss.

Lastly, the stochastic gradient descent formula is updated with the calculated loss, [math] L [/math], and the estimated parameters, [math] \boldsymbol{ \theta } [/math].

The full algorithm is shown below.The when the model is warmed up, it is trained on all data using standard cross-entropy to initially converge the model, but with a regulatory negative entropy term [math]\mathcal{H} = -\sum_{c}\text{p}^\text{c}_\text{model}(x;\theta)\log(\text{p}^\text{c}_\text{model}(x;\theta))[/math], where [math]\text{p}^\text{c}_\text{model}[/math] is the softmax output probability for class c. This term penalizes confident predictions during the warm up to prevent overfitting to noise during the warm up, which can happen when there is asymmetric noise.

## Results

**Applications**

The method was validated using four benchmark datasets: CIFAR-10, CIFAR100 (Krizhevsky & Hinton, 2009) which contain 50K training images and 10K test images of size 32 × 32), Clothing1M (Xiao et al., 2015), and WebVision (Li et al., 2017a). Two types of label noise are used in the experiments: symmetric and asymmetric. An 18-layer PreAct Resnet (He et al., 2016) is trained using SGD with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 128. The network is trained for 300 epochs. The initial learning rate was set to 0.02, and reduced by a factor of 10 after 150 epochs. Before applying the Co-divide and MixMatch strategies, the models were first independently trained over the entire dataset using cross-entropy loss during a "warm-up" period. Initially, training the models in this way prepares a more regular distribution of losses to improve upon in subsequent epochs. The warm-up period is 10 epochs for CIFAR-10 and 30 epochs for CIFAR-100. For all CIFAR experiments, we use the same hyperparameters M = 2, T = 0.5, and α = 4. τ is set as 0.5 except for 90% noise ratio when it is set as 0.6.

**Comparison of State-of-the-Art Methods**

The effectiveness of DivideMix was shown by comparing the test accuracy with the most recent state-of-the-art methods: Meta-Learning (Li et al., 2019) proposes a gradient-based method to find model parameters that are more noise-tolerant; Joint-Optim (Tanaka et al., 2018) and P-correction (Yi & Wu, 2019) jointly optimize the sample labels and the network parameters; M-correction (Arazo et al., 2019) models sample loss with BMM and apply MixUp. The following are the results on CIFAR-10 and CIFAR-100 with different levels of symmetric label noise ranging from 20% to 90%. Both the best test accuracy across all epochs and the averaged test accuracy over the last 10 epochs were recorded in the following table:

From table 1, the author noticed that none of these methods can consistently outperform others across different datasets. M-correction excels at symmetric noise, whereas Meta-Learning performs better for asymmetric noise. DivideMix outperforms state-of-the-art methods by a large margin across all noise ratios. The improvement is substantial (∼10% of accuracy) for the more challenging CIFAR-100 with high noise ratios.

DivideMix was compared with the state-of-the-art methods with the other two datasets: Clothing1M and WebVision. It also shows that DivideMix consistently outperforms state-of-the-art methods across all datasets with different types of label noise. For WebVision, DivideMix achieves more than 12% improvement in top-1 accuracy.

**Ablation Study**

The effect of removing different components to provide insights into what makes DivideMix successful. We analyze the results in Table 5 as follows.

The authors find that both label refinement and input augmentation are beneficial for DivideMix. *Label refinement* is important for high noise ratio due because samples that are more noisy would be incorrectly divided into the labeled set. *Augmentation* upgrades model performance by creating more reliable predictions and by achieving consistency regularization. In addition, the performance drop seen in the *DivideMix w/o co-training* highlights the disadvantage of self-training; the model still has dataset division, label refinement and label guessing, but they are all performed by the same model.

## Conclusion

This paper provides a new and effective algorithm for learning with noisy labels by leveraging SSL. The DivideMix method trains two networks simultaneously and utilizes co-guessing and co-labeling effectively, therefore it is a robust approach to dealing with noise in datasets. DivideMix has also been tested using various datasets with the results consistently being one of the best when compared to other advanced methods.

Future work of DivideMix is to create an adaptation for other applications such as Natural Language Processing, and incorporating the ideas of SSL and LNL into DivideMix architecture.

## Critiques/ Insights

1. While combining both models makes the result better, the author did not show the relative time increase using this new combined methodology, which is very crucial considering training a large amount of data, especially for images. In addition, it seems that the author did not perform much on hyperparameters tuning for the combined model.

2. There is an interesting insight, which is when the noise ratio increases from 80% to 90%, the accuracy of DivideMix drops dramatically in both datasets.

3. There should be a further explanation of why the learning rate drops by a factor of 10 after 150 epochs.

4. It would be interesting to see the effectiveness of this method in other domains such as NLP. I am not aware of noisy training datasets available in NLP, but surely this is an important area to focus on, as much of the available data is collected from noisy sources from the web.

5. The paper implicitly assumes that a Gaussian mixture model (GMM) is sufficiently capable of identifying noise. Given the nature of a GMM, it would work well for noise that is distributed by a Gaussian distribution but for all other noise, it would probably be only asymptotic. The paper should present theoretical results on the noise that are Exponential, Rayleigh, etc. This is particularly important because the experiments were done on massive datasets, but they do not directly address the case when there are not many data points.

6. Comparing the training result on these benchmark datasets makes the algorithm quite comprehensive. This is a very insightful idea to maintain two networks to avoid bias from occurring.

## References

Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. Unsupervised label noise modeling and loss correction. In ICML, pp. 312–321, 2019.

David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. NeurIPS, 2019.

Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach to learning from noisy labels. In WACV, pp. 1215–1224, 2018.