Here comes the Part 3 on learning with not enough data (Previous: Part 1 and Part 2). Let’s consider two approaches for generating synthetic data for training.
- Augmented data. Given a set of existing training samples, we can apply a variety of augmentation, distortion and transformation to derive new data points without losing the key attributes. We have covered a bunch of augmentation methods on text and images in a previous post on contrastive learning. For the sake of post completeness, I duplicate the section on data augmentation here with some edits.
- New data. Given few or even no data points, we can rely on powerful pretrained models to generate a number of new data points. This is especially true in recent years given the fast progress in large pretrained language models (LM). Few shot prompting is shown to be effective for LM to learn within context without extra training.
Data Augmentation
The goal of data augmentation is to modify the input format (e.g. text wording, visual appearance) while the semantic meaning stays unchanged.
Image Augmentation
Basic Image Processing Operations
There are several ways to modify an image while retaining its semantic information. We can use any one of the following augmentation or a composition of multiple operations.
- Random cropping and then resize back to the original size.
- Random color distortions
- Random Gaussian blur
- Random color jittering
- Random horizontal flip
- Random grayscale conversion
- And many more. Check PIL.ImageOps for inspiration.
Task-Specific Augmentation Strategies
If the downstream task is known, it is possible to learn the optimal augmentation strategies (i.e. what processing operations to use and how to combine them in sequence) to maximize the downstream task performance.
- AutoAugment (Cubuk, et al. 2018) is inspired by neural architecture search, AutoAugment frames the problem of learning best data augmentation operations (i.e. shearing, rotation, invert, etc.) for image classification as an RL problem and looks for the combination that leads to the highest accuracy on the evaluation set. AutoAugment can be executed in adversarial fashion (Zhang, et al 2019).
- RandAugment (Cubuk et al., 2019) greatly reduces the search space of AutoAugment by controlling the magnitudes of different transformation operations with a single magnitude parameter.
- Population based augmentation (PBA; Ho et al., 2019) combines PBT (“population based training”; Jaderberg et al, 2017) with AutoAugment, using the evolutionary algorithm to train a population of children models in parallel to evolve the best augmentation strategies.
- Unsupervised Data Augmentation (UDA; Xie et al., 2019), among a set of possible augmentation strategies, selects a subset to minimize the KL divergence between the predicted distribution over an unlabelled example and its unlabelled augmented version.
Image Mixture
Image mixture methods can construct new training examples from existing data points.
- Mixup (Zhang et al., 2018) runs global-level mixture by creating a weighted pixel-wise combination of two existing images
and : and . - Cutmix (Yun et al., 2019) does region-level mixture by generating a new example by combining a local region of one image with the rest of the other image.
, where is a binary mask and is element-wise multiplication. It is equivalent to filling the cutout (DeVries & Taylor 2017) region with the same region from another image. - Given a query
, MoCHi (“mixing of contrastive hard negatives”; Kalantidis et al. 2020) maintains a queue of negative features and sorts these negative features by similarity to the query, , in descending order. The first items in the queue are considered as the hardest negatives, . Then synthetic hard examples can be generated by where and . Even harder examples can be created by mixing with the query feature, where and .
Text Augmentation
Lexical Edits
Easy Data Augmentation (EDA; Wei & Zou 2019) defines a set of simple but powerful operations for text augmentation. Given a sentence, EDA randomly chooses and applies one of four simple operations:
- Synonym replacement (SR): Replace
random non-stop words with their synonyms. - Random insertion (RI): Place a random synonym of a randomly selected non-stop word in the sentence at a random position.
- Random swap (RS): Randomly swap two words and repeat
times. - Random deletion (RD): Randomly delete each word in the sentence with probability
.
where
EDA is shown to improve the classification accuracy on several classification benchmark datasets compared to baseline without EDA. The performance lift is more significant on a smaller training set. All the four operations in EDA help improve the classification accuracy, but get to optimal at different

Contextual Augmentation (Kobayashi, 2018) replaces word
Back-translation
Back-translation produces augmented data by translating text samples to another language and then translating them back. The translation happens in two ways and both directions should have decent enough performance to avoid significant loss of semantic meaning.
Mix-up
It is also possible to apply Mixup to text (Guo et al. 2019) but on the embedding space to obtain some performance gain. The proposed method relies on a specially designed model architecture to operate the prediction on the word or sentence embedding. Adding adversarial noise in the embedding space as a way of data augmentation is shown to improve the generalization of model training (Zhu et al. 2019).
Audio Augmentation
Here is a list of several commonly used audio data augmentation methods, operated on raw audio or spectrograms, summarized by Wang & van den Oord (2021).
Audio mixup. Given two audio clips
Time masking. A small consecutive chunk of the audio can be masked without losing semantic information.
Frequency masking. A small amount of frequency components on the spectrogram can be dropped off and it should not change the associated label.
Frequency shift. The spectrogram can be shifted by an integer between
Architectural Augmentation
Models with dropout layers can create augmented samples by applying different dropout masks on the same input sample. For example, in the contrastive learning model SimCSE (Guo et al. 2021), a sample is simply fed into the encoder twice with different dropout masks and these two versions are the positive pair where the other in-batch samples are considered as negative pairs.
Dropout augments data by adding noise onto the internal representation of the model. It can be applied in a more structured way, such as in cutoff (Shen et al. (2020)), where random chunks of the token embedding matrix are removed.
Data Synthesis
Given that generating high-quality, photorealistic images is a lot more difficult than generating human-like natural language text and recent success with large pretrained language models, this section only focuses on text generation. To read more on how to synthesize realistic images, check posts on GAN, VAE, flow and diffusion models.
Language Model as Noisy Annotator
Wang et al. (2021) explored ways to leverage GPT-3 as a weak annotator via few-shot prompting, achieving 10x cheaper than human labeling. The paper argues that by using data labeled by GPT-3, it essentially performs self-training: The predictions on unlabeled samples apply entropy regularization on the model to avoid high class overlaps so as to help improve the model performance.

GPT-3-labeled samples selected by active learning with highest uncertainty are sent to human labelers to be re-annotated. The few-shot prompt contains a small number of human labeled examples and thus the labeling cost is restricted. Synthetic samples are ranked by predicted logits of label
GPT-3 labeling achieves better results in the low-cost regime, but has a gap with human labeling when enough money is spent on data collection. This implies the following inequation, although to what extent “a lot” or “noisy” means depends on the task details.
A lot of high-quality data > A lot of noisy data > A little high quality data.

Language Model as Data Generator
If enough training dataset for text classification tasks are available, we can fine-tune language models to synthesize more training samples conditioned on labels (Anaby-Tavor et al. 2019, Kumar et al. 2021).
Language-model-based data augmentation (LAMBADA; Anaby-Tavor et al. 2019) takes such an idea, where the process involves fine-tuning both a classifier and a sample generation model.
- Train a baseline classifier using the existing training dataset:
. - Independently of step 1, a LM
is fine-tuned on to obtain . - Synthesize a labeled dataset
by generating the continuation of the sequencey[SEP]
untilEOS
using . - Filter synthesized dataset by,
- (1) Verifying that the predicted label is correct
; - (2) Selecting the top ranked samples when they are ranked by the classifier probability.
. They generate 10x more samples needed for augmentation and only the top 10% synthesized samples with highest confidence scores remain.
- (1) Verifying that the predicted label is correct
The final classifier is trained on

To simplify LAMBADA, we can actually remove the dependency of a fine-tuned generation model and an existing training dataset of a decent size (Step 2 above). Unsupervised data generation (UDG; Wang et al. 2021) relies on few-shot prompting on a large pretrained language model to generate high-quality synthetic data for training. Opposite to the above approach where LM is asked to predict
Schick & Schutze (2021) proposed a similar idea but on the NLI task instead of classification, asking PLM to write sentence pairs that are similar or different while the model is prompted with task-specific instructions.

The few-shot prompts of UDG contain a small number of unlabeled examples, as well as a task-specific natural language description of the desired label. Because some generated examples are noisy, they implemented noisy label annealing (NLA) techniques to filter potentially misaligned samples out during the training processes. NLA gradually removes noisy training signals in time during training when the model starts to disagree with its pseudo label with high confidence. At each training step
- The model predicted probability is higher than a threshold
where ; - And the predicted label is different from the synthetic label,
.
Note that the threshold
As shown in their experiments, the improvement of UDG over few-shot inference is quit significant, where NLA brings in some extra boost. The results are even comparable with supervised fine-tuning on several cases.

Han et al (2021) achieved SOTA results on translation tasks using few-shot data generation, distillation and back-translation. The proposed method contains the following steps, assuming no access to paired translation data:
- Zero-shot Generation. First use the zero-shot translation ability of a pre-trained LM to generate translations for a small set of unlabeled sentences.
- Few-shot Generation. Then amplify these zero-shot translations by using them as few-shot demonstrations to gather an even larger synthetic dataset.
- Distillation. Fine-tune the model on this dataset. The translation task is formulated as a language modeling task
[L1] <seq1> [[TRANSLATE]] [L2] <seq2>.
given a pair of two sequences<seq1, seq2>
in two different languages. At test-time, the LM is prompted with[L1] <seq> [[TRANSLATE]] [L2]
and a candidate translation<sampledSeq>
is parsed from the sampled completion. - Back-translation. Continue fine-tuning on the back-translation dataset where the order of samples is reversed,
<sampledSeq, seq>
. - Step 1-4 can be repeated.

The success of the above method depends on a good pretrained LM to kick off the initial translation dataset. Iterative few-shot generation and distillation with back-translation is an effective way to extract and refine the translation capability out of a pretrained LM and further to distill that into a new model.

How to Quantify Generated Data Quality?
Given all the generated data, either by data augmentation or data synthesis, how can we quantify data quality in terms of how they improve model generalization? Gontijo-Lopes et al. (2020) introduced two dimensions to track, affinity and diversity.
- Affinity is a model-sensitive metric for distribution shift, quantifying how much an augmentation shifts the training data distribution from what a model learned.
- Definition: The performance difference between the model tested on clean data vs augmented data, while the model is trained on clean data.
- As a comparison, KL can also measure distribution shift but does not consider the model performance.
- Diversity is a measure of augmentation complexity, measuring the complexity of the augmented data with respect to the model and learning procedure.
- Definition: The final training loss of a model trained with a given augmentation.
- Another potential diversity measure is the entropy of the transformed data.
- A third potential diversity measure is the training time needed for a model to reach a given training accuracy threshold.
- All three metrics above are correlated.
The final model performance is dependent on both metrics to be high enough.

There are many quantitative metrics on relevancy and diversity, in different formations depending on whether a reference is available, such as perplexity, BLEU for text and inception score for images. I’m skipping the list of concrete quantitative metrics on quality here, given it could be very long.
Training with Noisy Data
It is convenient to collect a large amount of noisy data via model generation or data augmentation, but it is hard to guarantee that augmented and generated data can be 100% accurate. Knowing that deep neural networks can easily overfit noisy labels and “memotize” corrupted labels, we can apply the techniques for training on noisy labels (noise-robust training) when using generated data to stabilize and optimize the performance. Please check this survey paper (Song et al. 2021) on learning from noisy labels for a more thorough coverage of related work.
Regularization and Robust Architecture
Generally speaking, mechanisms designed for avoiding overfitting should help improve training robustness when working with moderately noisy data, such as weight decay, dropout, batch normalization. In fact, good data augmentation (i.e. only non-essential attributes are modified) can be considered as a way of regularization as well.
A different approach is to enhance the network with a dedicated noisy adaptation layer to approximate the unknown projection of label corruption (Sukhbaatar et al. 2015, Goldberger & Ben-Reuven, 2017).
Sukhbaatar et al. (2015) introduced an extra linear layer

However, it is hard to guarantee such a noise matrix layer would only capture the noise transition distribution and it is actually non-trivial to learn. Goldberger & Ben-Reuven (2017)) proposed to add an additional softmax layer end-to-end with the base model and apply the EM algorithm by treating the correct labels as latent random variable and the noise processes as a communication channel with unknown parameters.
Robust Learning Objective
Besides the most commonly used cross entropy loss, some other choices of learning objectives are shown to be more robust to noisy labels.
For example, MAE (mean absolute error) is more robust to noisy labels than CCE (categorical cross entropy), as it treats every sample equally (Ghosh et al. 2017). Lack of different weighting among training samples of MAE lead to significantly longer training time. Motivated by the tradeoff between MAE and CCE, Zhang & Sabuncu (2018) proposed generalized cross entropy (GCE), a generalization of CCE loss to be robust to noisy data.
To exploit the benefits of both the noise-robustness provided by MAE and the implicit weighting scheme of CCE, GCE adopts the the negative Box-Cox transformation as a loss function:
where
Given true and predicted labels,
The hinge loss,
where
Given a label corruption rate
When experimenting on CIFAR-10, NPCL is comparable with GCE and performs better when the noise rate increases.
Label Correction
Since it is known some labels are incorrect, noise-robust training can explicitly take the label correction into consideration.
One approach is to rely on the estimation of a noise transition matrix and use that to correct the forward or backward loss, named F-correction (Patrini et al. 2017). Let’s first assume that there are
Then we can proceed a forward label correction procedure to incorporate the prior knowledge of noisy transition matrix into the prediction.
In matrix form, we have
where
Let

If the trusted training dataset
The label correction distillation works as following:
- First train an auxiliary model
from the small clean dataset to provide a soft label for each sample , is the sigmoid activation with temperature . - Because the clean dataset is not large,
is likely to overfit, Li et al. (2017) turn to a knowledge graph that defines the relations in the label space and propagate the prediction among labels accordingly. The new soft label is donated as . - The primary model
is trained with predictions from to imitate,
Sample Reweighting and Selection
Some samples may be more likely to have inaccurate labels than others. Such estimation gives us intuition on which samples should be weighted less or more in the loss function. However, considering two types of biases in training data, class imbalance and noisy labels, there is actually a contradictory preference — We would prefer samples with larger loss to balance the label distribution but those with smaller loss for mitigating the potential noise. Some work (Ren et al. 2018) thus argue that in order to learn general forms of training data biases, it is necessary to have a small unbiased validation to guide training. The sample reweighting methods presented in this section all assume access to a small trusted set of clean data.
Considering a binary classification task with random classification noise,
Liu & Tao (2015) applies importance reweighting to adjust the weighted distribution of observed
Because,
Thus the weight assigned to a noisy sample is,
where
Sample reweighting schemes can be learned by a separate network. Learning to reweight (L2R; Ren et al. 2018) is a meta-learning approach to directly optimize the weights in pursuit of best validation performance on a known set of clean data. Each example gets assigned with the weight based on its gradient direction. The weighted loss to minimize
The learning process involves two nested loops of optimization, so pretty expensive, 3x training time.

They ran experiments on (1) two-class MNIST to test the robustness of L2R when the class distribution is imbalanced and (2) CIFAR-10 with noisy labels. L2R is shown to be better than other baseline methods at the time on both tasks.

MentorNet (Jiang et al. 2018) uses teach-student curriculum learning to weight data. It incorporates two different networks, a mentor and a student. The mentor network provides a data-driven curriculum (i.e. sample training weighting scheme) for the student to focus on learning likely correct labels.
Let
StudentNet learns to minimize the following learning objective,
The mentor network

Different from MentorNet where one network explicitly learns weighting scheme and curriculum for the other network, Co-teaching (Han et al. 2018) trains two neural networks,
- First, each network feeds forward the current mini-batch and selects samples with potentially clean labels;
- Then two networks exchange information on which samples in the batch should be used for training. Small-loss instances are selected as they are more likely to be associated with correct labels. The percentage of the batch to select is determined by a time-dependent function
. The value of decreases in time because the network is more likely to overfit and memorize noisy labels as training progresses and thus we use a smaller sampling percentage to keep the selected data quality high. - Finally, each network runs back-propagation updates with the data selected by its peer.
According to their experiments, co-teaching performs better than F-correction where the noise rates are high or the corruption transition matrix is not symmetric.

Citation
Cited as:
Weng, Lilian. (Apr 2022). Learning with not enough data part 3: data generation. Lil’Log. https://lilianweng.github.io/posts/2022-04-15-data-gen/.
Or
@article{weng2022datagen,
title = "Learning with not Enough Data Part 3: Data Generation",
author = "Weng, Lilian",
journal = "Lil'Log",
year = "2022",
month = "Apr",
url = "https://lilianweng.github.io/posts/2022-04-15-data-gen/"
}
Reference
[1] Zhang et al. “Adversarial AutoAgument” ICLR 2020.
[2] Kumar et al. “Data Augmentation using Pre-trained Transformer Models.” AACL 2020 Workshop.
[3] Anaby-Tavor et al. “Not enough data? Deep learning to rescue!” AAAI 2020.
[4] Wang et al. “Want To Reduce Labeling Cost? GPT-3 Can Help.” EMNLP 2021.
[5] Wang et al. “Towards Zero-Label Language Learning.” arXiv preprint arXiv:2109.09193 (2021).
[6] Schick & Schutze. Generating Datasets with Pretrained Language Models." EMNLP 2021.
[7] Han et al. “Unsupervised Neural Machine Translation with Generative Language Models Only.” arXiv preprint arXiv:2110.05448 (2021).
[8] Guo et al. “Augmenting data with mixup for sentence classification: An empirical study.” arXiv preprint arXiv:1905.08941 (2019).
[9] Ekin D. Cubuk et al. “AutoAugment: Learning augmentation policies from data.” arXiv preprint arXiv:1805.09501 (2018).
[10] Daniel Ho et al. “Population Based Augmentation: Efficient Learning of Augmentation Policy Schedules.” ICML 2019.
[11] Cubuk & Zoph et al. “RandAugment: Practical automated data augmentation with a reduced search space.” arXiv preprint arXiv:1909.13719 (2019).
[12] Zhang et al. “mixup: Beyond Empirical Risk Minimization.” ICLR 2017.
[13] Yun et al. “CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features.” ICCV 2019.
[14] Kalantidis et al. “Mixing of Contrastive Hard Negatives” NeuriPS 2020.
[15] Wei & Zou. “EDA: Easy data augmentation techniques for boosting performance on text classification tasks.” EMNLP-IJCNLP 2019.
[16] Kobayashi. “Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations.” NAACL 2018
[17] Fang et al. “CERT: Contrastive self-supervised learning for language understanding.” arXiv preprint arXiv:2005.12766 (2020).
[18] Gao et al. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv preprint arXiv:2104.08821 (2020). [code]
[19] Shen et al. “A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation.” arXiv preprint arXiv:2009.13818 (2020) [code]
[20] Wang & van den Oord. “Multi-Format Contrastive Learning of Audio Representations.” NeuriPS Workshop 2020.
[21] Wu et al. “Conditional BERT Contextual Augmentation” arXiv preprint arXiv:1812.06705 (2018).
[22 Zhu et al. “FreeLB: Enhanced Adversarial Training for Natural Language Understanding.” ICLR 2020.
[23] Affinity and Diversity: Quantifying Mechanisms of Data Augmentation Gontijo-Lopes et al. 2020 (https://arxiv.org/abs/2002.08973)
[24] Song et al. “Learning from Noisy Labels with Deep Neural Networks: A Survey.” TNNLS 2020.
[25] Zhang & Sabuncu. “Generalized cross entropy loss for training deep neural networks with noisy labels.” NeuriPS 2018.
[26] Goldberger & Ben-Reuven. “Training deep neural-networks using a noise adaptation layer.” ICLR 2017.
[27] Sukhbaatar et al. “Training convolutional networks with noisy labels.” ICLR Workshop 2015.
[28] Patrini et al. “Making Deep Neural Networks Robust to Label Noise: a Loss Correction Approach” CVPR 2017.
[29] Hendrycks et al. “Using trusted data to train deep networks on labels corrupted by severe noise.” NeuriPS 2018.
[30] Zhang & Sabuncu. “Generalized cross entropy loss for training deep neural networks with noisy labels.” NeuriPS 2018.
[31] Lyu & Tsang. “Curriculum loss: Robust learning and generalization against label corruption.” ICLR 2020.
[32] Han et al. “Co-teaching: Robust training of deep neural networks with extremely noisy labels.” NeuriPS 2018. (code)
[33] Ren et al. “Learning to reweight examples for robust deep learning.” ICML 2018.
[34] Jiang et al. “MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels.” ICML 2018.
[35] Li et al. “Learning from noisy labels with distillation.” ICCV 2017.
[36] Liu & Tao. “Classification with noisy labels by importance reweighting.” TPAMI 2015.
[37] Ghosh, et al. “Robust loss functions under label noise for deep neural networks.” AAAI 2017.
[38] Hu et al. “Does Distributionally Robust Supervised Learning Give Robust Classifiers? “ ICML 2018.
-
is not a technically correct way to annotate a label being a certain value, since we usually use one-hot encoding (i.e. ). We use this form for simplicity. ↩︎