Learning with not Enough Data Part 1: SemiSupervised Learning
The performance of supervised learning tasks improves with more highquality labels available. However, it is expensive to collect a large number of labeled samples. There are several paradigms in machine learning to deal with the scenario when the labels are scarce. Semisupervised learning is one candidate, utilizing a large amount of unlabeled data conjunction with a small amount of labeled data.
When facing a limited amount of labeled data for supervised learning tasks, four approaches are commonly discussed.
 Pretraining + finetuning: Pretrain a powerful taskagnostic model on a large unsupervised data corpus, e.g. pretraining LMs on free text, or pretraining vision models on unlabelled images via selfsupervised learning, and then finetune it on the downstream task with a small set of labeled samples.
 Semisupervised learning: Learn from the labelled and unlabeled samples together. A lot of research has happened on vision tasks within this approach.
 Active learning: Labeling is expensive, but we still want to collect more given a cost budget. Active learning learns to select most valuable unlabeled samples to be collected next and helps us act smartly with a limited budget.
 Pretraining + dataset autogeneration: Given a capable pretrained model, we can utilize it to autogenerate a lot more labeled samples. This has been especially popular within the language domain driven by the success of fewshot learning.
I plan to write a series of posts on the topic of “Learning with not enough data”. Part 1 is on SemiSupervised Learning.
 What is semisupervised learning?
 Notations
 Hypotheses
 Consistency Regularization
 Pseudo Labeling
 Pseudo Labeling with Consistency Regularization
 Combined with Powerful PreTraining
 References
What is semisupervised learning?
Semisupervised learning uses both labeled and unlabeled data to train a model.
Interestingly most existing literature on semisupervised learning focuses on vision tasks. And instead pretraining + finetuning is a more common paradigm for language tasks.
All the methods introduced in this post have a loss combining two parts: \(\mathcal{L} = \mathcal{L}_s + \mu(t) \mathcal{L}_u\). The supervised loss \(\mathcal{L}_s\) is easy to get given all the labeled examples. We will focus on how the unsupervised loss \(\mathcal{L}_u\) is designed. A common choice of the weighting term \(\mu(t)\) is a ramp function increasing the importance of \(\mathcal{L}_u\) in time, where \(t\) is the training step.
Disclaimer: The post is not gonna cover semisupervised methods with focus on model architecture modification. Check this survey for how to use generative models and graphbased methods in semisupervised learning.
Notations
Symbol  Meaning 

\(L\)  Number of unique labels. 
\((\mathbf{x}^l, y) \sim \mathcal{X}, y \in \{0, 1\}^L\)  Labeled dataset. \(y\) is a onehot representation of the true label. 
\(\mathbf{u} \sim \mathcal{U}\)  Unlabeled dataset. 
\(\mathcal{D} = \mathcal{X} \cup \mathcal{U}\)  The entire dataset, including both labeled and unlabeled examples. 
\(\mathbf{x}\)  Any sample which can be either labeled or unlabeled. 
\(\bar{\mathbf{x}}\)  \(\mathbf{x}\) with augmentation applied. 
\(\mathbf{x}_i\)  The \(i\)th sample. 
\(\mathcal{L}\), \(\mathcal{L}_s\), \(\mathcal{L}_u\)  Loss, supervised loss, and unsupervised loss. 
\(\mu(t)\)  The unsupervised loss weight, increasing in time. 
\(p(y \vert \mathbf{x}), p_\theta(y \vert \mathbf{x})\)  The conditional probability over the label set given the input. 
\(f_\theta(.)\)  The implemented neural network with weights \(\theta\), the model that we want to train. 
\(\mathbf{z} = f_\theta(\mathbf{x})\)  A vector of logits output by \(f\). 
\(\hat{y} = \text{softmax}(\mathbf{z})\)  The predicted label distribution. 
\(D[.,.]\)  A distance function between two distributions, such as MSE, cross entropy, KL divergence, etc. 
\(\beta\)  EMA weighting hyperparameter for teacher model weights. 
\(\alpha, \lambda\)  Parameters for MixUp, \(\lambda \sim \text{Beta}(\alpha, \alpha)\). 
\(T\)  Temperature for sharpening the predicted distribution. 
\(\tau\)  A confidence threshold for selecting the qualified prediction. 
Hypotheses
Several hypotheses have been discussed in literature to support certain design decisions in semisupervised learning methods.

H1: Smoothness Assumptions: If two data samples are close in a highdensity region of the feature space, their labels should be the same or very similar.

H2: Cluster Assumptions: The feature space has both dense regions and sparse regions. Densely grouped data points naturally form a cluster. Samples in the same cluster are expected to have the same label. This is a small extension of H1.

H3: Lowdensity Separation Assumptions: The decision boundary between classes tends to be located in the sparse, low density regions, because otherwise the decision boundary would cut a highdensity cluster into two classes, corresponding to two clusters, which invalidates H1 and H2.

H4: Manifold Assumptions: The highdimensional data tends to locate on a lowdimensional manifold. Even though realworld data might be observed in very high dimensions (e.g. such as images of realworld objects/scenes), they actually can be captured by a lower dimensional manifold where certain attributes are captured and similar points are grouped closely (e.g. images of realworld objects/scenes are not drawn from a uniform distribution over all pixel combinations). This enables us to learn a more efficient representation for us to discover and measure similarity between unlabeled data points. This is also the foundation for representation learning. [see a helpful link].
Consistency Regularization
Consistency Regularization, also known as Consistency Training, assumes that randomness within the neural network (e.g. with Dropout) or data augmentation transformations should not modify model predictions given the same input. Every method in this section has a consistency regularization loss as \(\mathcal{L}_u\).
This idea has been adopted in several selfsupervised learning methods, such as SimCLR, BYOL, SimCSE, etc. Different augmented versions of the same sample should result in the same representation. Crossview training in language modeling and multiview learning in selfsupervised learning all share the same motivation.
Πmodel
Sajjadi et al. (2016) proposed an unsupervised learning loss to minimize the difference between two passes through the network with stochastic transformations (e.g. dropout, random maxpooling) for the same data point. The label is not explicitly used, so the loss can be applied to unlabeled dataset. Laine & Aila (2017) later coined the name, ΠModel, for such a setup.
\[\mathcal{L}_u^\Pi = \sum_{\mathbf{x} \in \mathcal{D}} \text{MSE}(f_\theta(\mathbf{x}), f'_\theta(\mathbf{x}))\]where \(f'\) is the same neural network with different stochastic augmentation or dropout masks applied. This loss utilizes the entire dataset.
Temporal ensembling
Πmodel requests the network to run two passes per sample, doubling the computation cost. To reduce the cost, Temporal Ensembling (Laine & Aila 2017) maintains an exponential moving average (EMA) of the model prediction in time per training sample \(\tilde{\mathbf{z}}_i\) as the learning target, which is only evaluated and updated once per epoch. Because the ensemble output \(\tilde{\mathbf{z}}_i\) is initialized to \(\mathbf{0}\), it is normalized by \((1\alpha^t)\) to correct this startup bias. Adam optimizer has such bias correction terms for the same reason.
\[\tilde{\mathbf{z}}^{(t)}_i = \frac{\alpha \tilde{\mathbf{z}}^{(t1)}_i + (1\alpha) \mathbf{z}_i}{1\alpha^t}\]where \(\tilde{\mathbf{z}}^{(t)}\) is the ensemble prediction at epoch \(t\) and \(\mathbf{z}_i\) is the model prediction in the current round. Note that since \(\tilde{\mathbf{z}}^{(0)} = \mathbf{0}\), with correction, \(\tilde{\mathbf{z}}^{(1)}\) is simply equivalent to \(\mathbf{z}_i\) at epoch 1.
Mean teachers
Temporal Ensembling keeps track of an EMA of label predictions for each training sample as a learning target. However, this label prediction only changes every epoch, making the approach clumsy when the training dataset is large. Mean Teacher (Tarvaninen & Valpola, 2017) is proposed to overcome the slowness of target update by tracking the moving average of model weights instead of model outputs. Let’s call the original model with weights \(\theta\) as the student model and the model with moving averaged weights \(\theta’\) across consecutive student models as the mean teacher: \(\theta’ \gets \beta \theta’ + (1\beta)\theta\)
The consistency regularization loss is the distance between predictions by the student and teacher and the studentteacher gap should be minimized. The mean teacher is expected to provide more accurate predictions than the student. It got confirmed in the empirical experiments, as shown in Fig. 4.
According to their ablation studies,
 Input augmentation (e.g. random flips of input images, Gaussian noise) or student model dropout is necessary for good performance. Dropout is not needed on the teacher model.
 The performance is sensitive to the EMA decay hyperparameter \(\beta\). A good strategy is to use a small \(\beta=0.99\) during the ramp up stage and a larger \(\beta=0.999\) in the later stage when the student model improvement slows down.
 They found that MSE as the consistency cost function performs better than other cost functions like KL divergence.
Noisy samples as learning targets
Several recent consistency training methods learn to minimize prediction difference between the original unlabeled sample and its corresponding augmented version. It is quite similar to the Πmodel but the consistency regularization loss is only applied to the unlabeled data.
Adversarial Training (Goodfellow et al. 2014) applies adversarial noise onto the input and trains the model to be robust to such adversarial attack. The setup works in supervised learning,
\[\begin{aligned} \mathcal{L}_\text{adv}(\mathbf{x}^l, \theta) &= D[q(y\mid \mathbf{x}^l), p_\theta(y\mid \mathbf{x}^l + r_\text{adv})] \\ r_\text{adv} &= {\arg\max}_{r; \r\ \leq \epsilon} D[q(y\mid \mathbf{x}^l), p_\theta(y\mid \mathbf{x}^l + r_\text{adv})] \\ r_\text{adv} &\approx \epsilon \frac{g}{\g\_2} \approx \epsilon\text{sign}(g)\quad\text{where }g = \nabla_{r} D[y, p_\theta(y\mid \mathbf{x}^l + r)] \end{aligned}\]where \(q(y \mid \mathbf{x}^l)\) is the true distribution, approximated by onehot encoding of the ground truth label, \(y\). \(p_\theta(y \mid \mathbf{x}^l)\) is the model prediction. \(D[.,.]\) is a distance function measuring the divergence between two distributions.
Virtual Adversarial Training (VAT; Miyato et al. 2018) extends the idea to work in semisupervised learning. Because \(q(y \mid \mathbf{x}^l)\) is unknown, VAT replaces it with the current model prediction for the original input with the current weights \(\hat{\theta}\). Note that \(\hat{\theta}\) is a fixed copy of model weights, so there is no gradient update on \(\hat{\theta}\).
\[\begin{aligned} \mathcal{L}_u^\text{VAT}(\mathbf{x}, \theta) &= D[p_{\hat{\theta}}(y\mid \mathbf{x}), p_\theta(y\mid \mathbf{x} + r_\text{vadv})] \\ r_\text{vadv} &= {\arg\max}_{r; \r\ \leq \epsilon} D[p_{\hat{\theta}}(y\mid \mathbf{x}), p_\theta(y\mid \mathbf{x} + r)] \end{aligned}\]The VAT loss applies to both labeled and unlabeled samples. It is a negative smoothness measure of the current model’s prediction manifold at each data point. The optimization of such loss motivates the manifold to be smoother.
Interpolation Consistency Training (ICT; Verma et al. 2019) enhances the dataset by adding more interpolations of data points and expects the model prediction to be consistent with interpolations of the corresponding labels. MixUp (Zheng et al. 2018) operation mixes two images via a simple weighted sum and combines it with label smoothing. Following the idea of MixUp, ICT expects the prediction model to produce a label on a mixup sample to match the interpolation of predictions of corresponding inputs:
\[\begin{aligned} \text{mixup}_\lambda (\mathbf{x}_i, \mathbf{x}_j) &= \lambda \mathbf{x}_i + (1\lambda)\mathbf{x}_j \\ p(\text{mixup}_\lambda (y \mid \mathbf{x}_i, \mathbf{x}_j)) &\approx \lambda p(y \mid \mathbf{x}_i) + (1\lambda) p(y \mid \mathbf{x}_j) \end{aligned}\]where \(\theta'\) is a moving average of \(\theta\), which is a mean teacher.
Because the probability of two randomly selected unlabeled samples belonging to different classes is high (e.g. There are 1000 object classes in ImageNet), the interpolation by applying a mixup between two random unlabeled samples is likely to happen around the decision boundary. According to the lowdensity separation assumptions, the decision boundary tends to locate in the low density regions.
\[\mathcal{L}^\text{ICT}_{u} = \mathbb{E}_{\mathbf{u}_i, \mathbf{u}_j \sim \mathcal{U}} \mathbb{E}_{\lambda \sim \text{Beta}(\alpha, \alpha)} D[p_\theta(y \mid \text{mixup}_\lambda (\mathbf{u}_i, \mathbf{u}_j)), \text{mixup}_\lambda(p_{\theta’}(y \mid \mathbf{u}_i), p_{\theta'}(y \mid \mathbf{u}_j)]\]where \(\theta'\) is a moving average of \(\theta\).
Similar to VAT, Unsupervised Data Augmentation (UDA; Xie et al. 2020) learns to predict the same output for an unlabeled example and the augmented one. UDA especially focuses on studying how the “quality” of noise can impact the semisupervised learning performance with consistency training. It is crucial to use advanced data augmentation methods for producing meaningful and effective noisy samples. Good data augmentation should produce valid (i.e. does not change the label) and diverse noise, and carry targeted inductive biases.
For images, UDA adopts RandAugment (Cubuk et al. 2019) which uniformly samples augmentation operations available in PIL, no learning or optimization, so it is much cheaper than AutoAugment.
For language, UDA combines backtranslation and TFIDF based word replacement. Backtranslation preserves the highlevel meaning but may not retain certain words, while TFIDF based word replacement drops uninformative words with low TFIDF scores. In the experiments on language tasks, they found UDA to be complementary to transfer learning and representation learning; For example, BERT finetuned (i.e. \(\text{BERT}_\text{FINETUNE}\) in Fig. 8.) on indomain unlabeled data can further improve the performance.
When calculating \(\mathcal{L}_u\), UDA found two training techniques to help improve the results.
 Low confidence masking: Mask out examples with low prediction confidence if lower than a threshold \(\tau\).
 Sharpening prediction distribution: Use a low temperature \(T\) in softmax to sharpen the predicted probability distribution.
 Indomain data filtration: In order to extract more indomain data from a large outofdomain dataset, they trained a classifier to predict indomain labels and then retain samples with high confidence predictions as indomain candidates.
where \(\hat{\theta}\) is a fixed copy of model weights, same as in VAT, so no gradient update, and \(\bar{\mathbf{x}}\) is the augmented data point. \(\tau\) is the prediction confidence threshold and \(T\) is the distribution sharpening temperature.
Pseudo Labeling
Pseudo Labeling (Lee 2013) assigns fake labels to unlabeled samples based on the maximum softmax probabilities predicted by the current model and then trains the model on both labeled and unlabeled samples simultaneously in a pure supervised setup.
Why could pseudo labels work? Pseudo label is in effect equivalent to Entropy Regularization (Grandvalet & Bengio 2004), which minimizes the conditional entropy of class probabilities for unlabeled data to favor low density separation between classes. In other words, the predicted class probabilities is in fact a measure of class overlap, minimizing the entropy is equivalent to reduced class overlap and thus low density separation.
Training with pseudo labeling naturally comes as an iterative process. We refer to the model that produces pseudo labels as teacher and the model that learns with pseudo labels as student.
Label propagation
Label Propagation (Iscen et al. 2019) is an idea to construct a similarity graph among samples based on feature embedding. Then the pseudo labels are “diffused” from known samples to unlabeled ones where the propagation weights are proportional to pairwise similarity scores in the graph. Conceptually it is similar to a kNN classifier and both suffer from the problem of not scaling up well with a large dataset.
SelfTraining
SelfTraining is not a new concept (Scudder 1965, Nigram & Ghani CIKM 2000). It is an iterative algorithm, alternating between the following two steps until every unlabeled sample has a label assigned:
 Initially it builds a classifier on labeled data.
 Then it uses this classifier to predict labels for the unlabeled data and converts the most confident ones into labeled samples.
Xie et al. (2020) applied selftraining in deep learning and achieved great results. On the ImageNet classification task, they first trained an EfficientNet (Tan & Le 2019) model as teacher to generate pseudo labels for 300M unlabeled images and then trained a larger EfficientNet as student to learn with both true labeled and pseudo labeled images. One critical element in their setup is to have noise during student model training but have no noise for the teacher to produce pseudo labels. Thus their method is called Noisy Student. They applied stochastic depth (Huang et al. 2016), dropout and RandAugment to noise the student. Noise is important for the student to perform better than the teacher. The added noise has a compound effect to encourage the model’s decision making frontier to be smooth, on both labeled and unlabeled data.
A few other important technical configs in noisy student selftraining are:
 The student model should be sufficiently large (i.e. larger than the teacher) to fit more data.
 Noisy student should be paired with data balancing, especially important to balance the number of pseudo labeled images in each class.
 Soft pseudo labels work better than hard ones.
Noisy student also improves adversarial robustness against an FGSM (Fast Gradient Sign Attack = The attack uses the gradient of the loss w.r.t the input data and adjusts the input data to maximize the loss) attack though the model is not optimized for adversarial robustness.
SentAugment, proposed by Du et al. (2020), aims to solve the problem when there is not enough indomain unlabeled data for selftraining in the language domain. It relies on sentence embedding to find unlabeled indomain samples from a large corpus and uses the retrieved sentences for selftraining.
Reducing confirmation bias
Confirmation bias is a problem with incorrect pseudo labels provided by an imperfect teacher model. Overfitting to wrong labels may not give us a better student model.
To reduce confirmation bias, Arazo et al. (2019) proposed two techniques. One is to adopt MixUp with soft labels. Given two samples, \((\mathbf{x}_i, \mathbf{x}_j)\) and their corresponding true or pseudo labels \((y_i, y_j)\), the interpolated label equation can be translated to a cross entropy loss with softmax outputs:
\[\begin{aligned} &\bar{\mathbf{x}} = \lambda \mathbf{x}_i + (1\lambda) \mathbf{x}_j \\ &\bar{y} = \lambda y_i + (1\lambda) y_j \Leftrightarrow \mathcal{L} = \lambda [y_i^\top \log f_\theta(\bar{\mathbf{x}})] + (1\lambda) [y_j^\top \log f_\theta(\bar{\mathbf{x}})] \end{aligned}\]Mixup is insufficient if there are too few labeled samples. They further set a minimum number of labeled samples in every mini batch by oversampling the labeled samples. This works better than upweighting labeled samples, because it leads to more frequent updates rather than few updates of larger magnitude which could be less stable. Like consistency regularization, data augmentation and dropout are also important for pseudo labeling to work well.
Meta Pseudo Labels (Pham et al. 2021) adapts the teacher model constantly with the feedback of how well the student performs on the labeled dataset. The teacher and the student are trained in parallel, where the teacher learns to generate better pseudo labels and the student learns from the pseudo labels.
Let the teacher and student model weights be \(\theta_T\) and \(\theta_S\), respectively. The student model’s loss on the labeled samples is defined as a function \(\theta^\text{PL}_S(.)\) of \(\theta_T\) and we would like to minimize this loss by optimizing the teacher model accordingly.
\[\begin{aligned} \min_{\theta_T} &\mathcal{L}_s(\theta^\text{PL}_S(\theta_T)) = \min_{\theta_T} \mathbb{E}_{(\mathbf{x}^l, y) \in \mathcal{X}} \text{CE}[y, f_{\theta_S}(\mathbf{x}^l)] \\ \text{where } &\theta^\text{PL}_S(\theta_T) = \arg\min_{\theta_S} \mathcal{L}_u (\theta_T, \theta_S) = \arg\min_{\theta_S} \mathbb{E}_{\mathbf{u} \sim \mathcal{U}} \text{CE}[(f_{\theta_T}(\mathbf{u}), f_{\theta_S}(\mathbf{u}))] \end{aligned}\]However, it is not trivial to optimize the above equation. Borrowing the idea of MAML, it approximates the multistep \(\arg\min_{\theta_S}\) with the onestep gradient update of \(\theta_S\),
\[\begin{aligned} \theta^\text{PL}_S(\theta_T) &\approx \theta_S  \eta_S \cdot \nabla_{\theta_S} \mathcal{L}_u(\theta_T, \theta_S) \\ \min_{\theta_T} \mathcal{L}_s (\theta^\text{PL}_S(\theta_T)) &\approx \min_{\theta_T} \mathcal{L}_s \big( \theta_S  \eta_S \cdot \nabla_{\theta_S} \mathcal{L}_u(\theta_T, \theta_S) \big) \end{aligned}\]With soft pseudo labels, the above objective is differentiable. But if using hard pseudo labels, it is not differentiable and thus we need to use RL, e.g. REINFORCE.
The optimization procedure is alternative between training two models:
 Student model update: Given a batch of unlabeled samples \(\{ \mathbf{u} \}\), we generate pseudo labels by \(f_{\theta_T}(\mathbf{u})\) and optimize \(\theta_S\) with one step SGD: \(\theta’_S = \color{green}{\theta_S  \eta_S \cdot \nabla_{\theta_S} \mathcal{L}_u(\theta_T, \theta_S)}\).
 Teacher model update: Given a batch of labeled samples \(\{(\mathbf{x}^l, y)\}\), we reuse the student’s update to optimize \(\theta_T\): \(\theta’_T = \theta_T  \eta_T \cdot \nabla_{\theta_T} \mathcal{L}_s ( \color{green}{\theta_S  \eta_S \cdot \nabla_{\theta_S} \mathcal{L}_u(\theta_T, \theta_S)} )\). In addition, the UDA objective is applied to the teacher model to incorporate consistency regularization.
Pseudo Labeling with Consistency Regularization
It is possible to combine the above two approaches together, running semisupervised learning with both pseudo labeling and consistency training.
MixMatch
MixMatch (Berthelot et al. 2019), as a holistic approach to semisupervised learning, utilizes unlabeled data by merging the following techniques:
 Consistency regularization: Encourage the model to output the same predictions on perturbed unlabeled samples.
 Entropy minimization: Encourage the model to output confident predictions on unlabeled data.
 MixUp augmentation: Encourage the model to have linear behaviour between samples.
Given a batch of labeled data \(\mathcal{X}\) and unlabeled data \(\mathcal{U}\), we create augmented versions of them via \(\text{MixMatch}(.)\), \(\bar{\mathcal{X}}\) and \(\bar{\mathcal{U}}\), containing augmented samples and guessed labels for unlabeled examples.
\[\begin{aligned} \bar{\mathcal{X}}, \bar{\mathcal{U}} &= \text{MixMatch}(\mathcal{X}, \mathcal{U}, T, K, \alpha) \\ \mathcal{L}^\text{MM}_s &= \frac{1}{\vert \bar{\mathcal{X}} \vert} \sum_{(\bar{\mathbf{x}}^l, y)\in \bar{\mathcal{X}}} D[y, p_\theta(y \mid \bar{\mathbf{x}}^l)] \\ \mathcal{L}^\text{MM}_u &= \frac{1}{L\vert \bar{\mathcal{U}} \vert} \sum_{(\bar{\mathbf{u}}, \hat{y})\in \bar{\mathcal{U}}} \ \hat{y}  p_\theta(y \mid \bar{\mathbf{u}}) \^2_2 \\ \end{aligned}\]where \(T\) is the sharpening temperature to reduce the guessed label overlap; \(K\) is the number of augmentations generated per unlabeled example; \(\alpha\) is the parameter in MixUp.
For each \(\mathbf{u}\), MixMatch generates \(K\) augmentations, \(\bar{\mathbf{u}}^{(k)} = \text{Augment}(\mathbf{u})\) for \(k=1, \dots, K\) and the pseudo label is guessed based on the average: \(\hat{y} = \frac{1}{K} \sum_{k=1}^K p_\theta(y \mid \bar{\mathbf{u}}^{(k)})\).
According to their ablation studies, it is critical to have MixUp especially on the unlabeled data. Removing temperature sharpening on the pseudo label distribution hurts the performance quite a lot. Average over multiple augmentations for label guessing is also necessary.
ReMixMatch (Berthelot et al. 2020) improves MixMatch by introducing two new mechanisms:
 Distribution alignment. It encourages the marginal distribution \(p(y)\) to be close to the marginal distribution of the ground truth labels. Let \(p(y)\) be the class distribution in the true labels and \(\tilde{p}(\hat{y})\) be a running average of the predicted class distribution among the unlabeled data. The model prediction on an unlabeled sample \(p_\theta(y \vert \mathbf{u})\) is normalized to be \(\text{Normalize}\big( \frac{p_\theta(y \vert \mathbf{u}) p(y)}{\tilde{p}(\hat{y})} \big)\) to match the true marginal distribution.
 Note that entropy minimization is not a useful objective if the marginal distribution is not uniform.
 I do feel the assumption that the class distributions on the labeled and unlabeled data should match is too strong and not necessarily to be true in the realworld setting.
 Augmentation anchoring. Given an unlabeled sample, it first generates an “anchor” version with weak augmentation and then averages \(K\) strongly augmented versions using CTAugment (Control Theory Augment). CTAugment only samples augmentations that keep the model predictions within the network tolerance.
The ReMixMatch loss is a combination of several terms,
 a supervised loss with data augmentation and MixUp applied;
 an unsupervised loss with data augmentation and MixUp applied, using pseudo labels as targets;
 a CE loss on a single heavilyaugmented unlabeled image without MixUp;
 a rotation loss as in selfsupervised learning.
DivideMix
DivideMix (Junnan Li et al. 2020) combines semisupervised learning with Learning with noisy labels (LNL). It models the persample loss distribution via a GMM to dynamically divide the training data into a labeled set with clean examples and an unlabeled set with noisy ones. Following the idea in Arazo et al. 2019, they fit a twocomponent GMM on the persample cross entropy loss \(\ell_i = y_i^\top \log f_\theta(\mathbf{x}_i)\). Clean samples are expected to get lower loss faster than noisy samples. The component with smaller mean is the cluster corresponding to clean labels and let’s denote it as \(c\). If the GMM posterior probability \(w_i = p_\text{GMM}(c \mid \ell_i)\) (i.e. the probability of the sampling belonging to the clean sample set) is larger than the threshold \(\tau\), this sample is considered as a clean sample and otherwise a noisy one.
The data clustering step is named codivide. To avoid confirmation bias, DivideMix simultaneously trains two diverged networks where each network uses the dataset division from the other network; e.g. thinking about how Double Q Learning works.
Compared to MixMatch, DivideMix has an additional codivide stage for handling noisy samples, as well as the following improvements during training:
 Label corefinement: It linearly combines the groundtruth label \(y_i\) with the network’s prediction \(\hat{y}_i\), which is averaged across multiple augmentations of \(\mathbf{x}_i\), guided by the clean set probability \(w_i\) produced by the other network.
 Label coguessing: It averages the predictions from two models for unlabelled data samples.
FixMatch
FixMatch (Sohn et al. 2020) generates pseudo labels on unlabeled samples with weak augmentation and only keeps predictions with high confidence. Here both weak augmentation and high confidence filtering help produce highquality trustworthy pseudo label targets. Then FixMatch learns to predict these pseudo labels given a heavilyaugmented sample.
\[\begin{aligned} \mathcal{L}_s &= \frac{1}{B} \sum^B_{b=1} \text{CE}[y_b, p_\theta(y \mid \mathcal{A}_\text{weak}(\mathbf{x}_b))] \\ \mathcal{L}_u &= \frac{1}{\mu B} \sum_{b=1}^{\mu B} \mathbb{1}[\max(\hat{y}_b) \geq \tau]\;\text{CE}(\hat{y}_b, p_\theta(y \mid \mathcal{A}_\text{strong}(\mathbf{u}_b))) \end{aligned}\]where \(\hat{y}_b\) is the pseudo label for an unlabeled example; \(\mu\) is a hyperparameter that determines the relative sizes of \(\mathcal{X}\) and \(\mathcal{U}\).
 Weak augmentation \(\mathcal{A}_\text{weak}(.)\): A standard flipandshift augmentation
 Strong augmentation \(\mathcal{A}_\text{strong}(.)\) : AutoAugment, Cutout, RandAugment, CTAugment
According to the ablation studies of FixMatch,
 Sharpening the predicted distribution with a temperature parameter \(T\) does not have a significant impact when the threshold \(\tau\) is used.
 Cutout and CTAugment as part of strong augmentations are necessary for good performance.
 When the weak augmentation for label guessing is replaced with strong augmentation, the model diverges early in training. If discarding weak augmentation completely, the model overfit the guessed labels.
 Using weak instead of strong augmentation for pseudo label prediction leads to unstable performance. Strong data augmentation is critical.
Combined with Powerful PreTraining
It is a common paradigm, especially in language tasks, to first pretrain a taskagnostic model on a large unsupervised data corpus via selfsupervised learning and then finetune it on the downstream task with a small labeled dataset. Research has shown that we can obtain extra gain if combining semisupervised learning with pretraining.
Zoph et al. (2020) studied to what degree selftraining can work better than pretraining. Their experiment setup was to use ImageNet for pretraining or selftraining to improve COCO. Note that when using ImageNet for selftraining, it discards labels and only uses ImageNet samples as unlabeled data points. He et al. (2018) has demonstrated that ImageNet classification pretraining does not work well if the downstream task is very different, such as object detection.
Their experiments demonstrated a series of interesting findings:
 The effectiveness of pretraining diminishes with more labeled samples available for the downstream task. Pretraining is helpful in the lowdata regimes (20%) but neutral or harmful in the highdata regime.
 Selftraining helps in high data/strong augmentation regimes, even when pretraining hurts.
 Selftraining can bring in additive improvement on top of pretraining, even using the same data source.
 Selfsupervised pretraining (e.g. via SimCLR) hurts the performance in a high data regime, similar to how supervised pretraining does.
 Jointtraining supervised and selfsupervised objectives help resolve the mismatch between the pretraining and downstream tasks. Pretraining, jointtraining and selftraining are all additive.
 Noisy labels or untargeted labeling (i.e. pretraining labels are not aligned with downstream task labels) is worse than targeted pseudo labeling.
 Selftraining is computationally more expensive than finetuning on a pretrained model.
Chen et al. (2020) proposed a threestep procedure to merge the benefits of selfsupervised pretraining, supervised finetuning and selftraining together:
 Unsupervised or selfsupervised pretrain a big model.
 Supervised finetune it on a few labeled examples. It is important to use a big (deep and wide) neural network. Bigger models yield better performance with fewer labeled samples.
 Distillation with unlabeled examples by adopting pseudo labels in selftraining.
 It is possible to distill the knowledge from a large model into a small one because the taskspecific use does not require extra capacity of the learned representation.

The distillation loss is formatted as the following, where the teacher network is fixed with weights \(\hat{\theta}_T\).
\[\mathcal{L}_\text{distill} =  (1\alpha) \underbrace{\sum_{(\mathbf{x}^l_i, y_i) \in \mathcal{X}} \big[ \log p_{\theta_S}(y_i \mid \mathbf{x}^l_i) \big]}_\text{Supervised loss}  \alpha \underbrace{\sum_{\mathbf{u}_i \in \mathcal{U}} \Big[ \sum_{i=1}^L p_{\hat{\theta}_T}(y^{(i)} \mid \mathbf{u}_i; T) \log p_{\theta_S}(y^{(i)} \mid \mathbf{u}_i; T) \Big]}_\text{Distillation loss using unlabeled data}\]
They experimented on the ImageNet classification task. The selfsupervised pretraining uses SimCLRv2, a directly improved version of SimCLR. Observations in their empirical studies confirmed several learnings, aligned with Zoph et al. 2020:
 Bigger models are more labelefficient;
 Bigger/deeper project heads in SimCLR improve representation learning;
 Distillation using unlabeled data improves semisupervised learning.
💡 Quick summary of common themes among recent semisupervised learning methods, many aiming to reduce confirmation bias:
 Apply valid and diverse noise to samples by advanced data augmentation methods.
 When dealing with images, MixUp is an effective augmentation. Mixup could work on language too, resulting in a small incremental improvement (Guo et al. 2019).
 Set a threshold and discard pseudo labels with low confidence.
 Set a minimum number of labeled samples per minibatch.
 Sharpen the pseudo label distribution to reduce the class overlap.
References
[1] Ouali, Hudelot & Tami. “An Overview of Deep SemiSupervised Learning” arXiv preprint arXiv:2006.05278 (2020).
[2] Sajjadi, Javanmardi & Tasdizen “Regularization With Stochastic Transformations and Perturbations for Deep SemiSupervised Learning.” arXiv preprint arXiv:1606.04586 (2016).
[3] Pham et al. “Meta Pseudo Labels.” CVPR 2021.
[4] Laine & Aila. “Temporal Ensembling for SemiSupervised Learning” ICLR 2017.
[5] Tarvaninen & Valpola. “Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results.” NeuriPS 2017
[6] Xie et al. “Unsupervised Data Augmentation for Consistency Training.” NeuriPS 2020.
[7] Miyato et al. “Virtual Adversarial Training: A Regularization Method for Supervised and SemiSupervised Learning.” IEEE transactions on pattern analysis and machine intelligence 41.8 (2018).
[8] Verma et al. “Interpolation consistency training for semisupervised learning.” IJCAI 2019
[9] Lee. “Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks.” ICML 2013 Workshop: Challenges in Representation Learning.
[10] Iscen et al. “Label propagation for deep semisupervised learning.” CVPR 2019.
[11] Xie et al. “Selftraining with Noisy Student improves ImageNet classification” CVPR 2020.
[12] Jingfei Du et al. “Selftraining Improves Pretraining for Natural Language Understanding.” 2020
[13] Iscen et al. “Label propagation for deep semisupervised learning.” CVPR 2019
[14] Arazo et al. “Pseudolabeling and confirmation bias in deep semisupervised learning.” IJCNN 2020.
[15] Berthelot et al. “MixMatch: A holistic approach to semisupervised learning.” NeuriPS 2019
[16] Berthelot et al. “ReMixMatch: Semisupervised learning with distribution alignment and augmentation anchoring.” ICLR 2020
[17] Sohn et al. “FixMatch: Simplifying semisupervised learning with consistency and confidence.” CVPR 2020
[18] Junnan Li et al. “DivideMix: Learning with Noisy Labels as Semisupervised Learning.” 2020 [code]
[19] Zoph et al. “Rethinking pretraining and selftraining.” 2020.
[20] Chen et al. “Big SelfSupervised Models are Strong SemiSupervised Learners” 2020