Language models for low resource text classification

The 2019 paper on MNLI classification (Yin et al.1) had a big impact on the community - it was intuitive and promised to enable zero shot just by using large pretrained models. HuggingFace Transformers supported it natively soon after, and many articles demonstrated the ease of implementation.

However applying it in real world tasks is still difficult, as achieving production level accuracy in many cases still requires a lot of training data. Richard Socher talks about this - “the cool thing with language models is that they will very quickly do something; without any training, they will produce something that kind of works; but once you really care about that particular output, you’re going to want to finetune that model to be really good at that task”.

In this post I will review text classification with large language models, in particular techniques used in low resource conditions.

Yin et al. defines 2 types of zero shot; one where labels are partially seen during training, and another where labels are fully unseen. They frame zero shot as a textual entailment problem, using BERT to train on 3 datasets: MNLI, GLUE RTE, FEVER, and evaluate against 3 types of classification datasets - topic categorization, emotion detection, and situation frame detection.
For fully unseen zero shot, they directly apply the model on the test set; for partially seen, the model is finetuned on the training set first before evaluating on the test set.

In the partially seen case, the NLI approach does better than BERT finetuned on train by 5 to 10.4 points, showing the strength of the NLI approach. In the fully unseen case, the NLI approach does better than Wiki-based BERT for emotion and situation detection but weaker for topic categorization, with the reasoning that Wikipedia based training is much more similar to the topic detection task.
They also test with an ensemble approach, that sums up the probabilities after softmax of each input pair, and applies softmax again to get new probabilities - this model performs better than the 3 individual models trained on MNLI, GLUE RTE and FEVER.
The maximum performance achieved for the fully unseen case is 45.7 accuracy for topic categorization, 25.2 label wise weighted F1 for emotion detection, and 38.0 label wise weighted F1 for situation detection.

HuggingFace zero shot

In 2020 HuggingFace published3 an implementation of this technique in their framework. Using any model trained on NLI sequence pair classification, one can feed the input and label sequences reformulated as premise and hypothesis to the model input, to obtain probability scores of the label/s.

There are 2 ways of calling the sequence classifier in HuggingFace. 1, by manual initialization of the model, tokenizer, and handling of the model input and output. 2, by using the HuggingFace pipeline abstraction, which wraps all initialization, configuration and processing - only the model name needs to be provided.

For the second approach, subclass the default pipeline to configure, for example:

class MyPipeline(ZeroShotClassificationPipeline):
    def preprocess(self, inputs, candidate_labels=None, hypothesis_template='This example is {}.'):
        # override input params
        return super().preprocess(inputs, candidate_labels, hypothesis_template)
    
    def postprocess(self, model_outputs, multi_label=False):
        # do own postprocessing
        return super().postprocess(model_outputs, multi_label)

In preprocess, the placeholder tokens {} in the hypothesis template will be replaced with a label and concatenated with the input text to form a premise-hypothesis pair, for every label in the label sequence.
In postprocess, the conversion from logits to probabilities is handled. Logits are generated in 3 dimensions (contradiction, neutral, entailment) at the end of the forward pass.
In the single label or multilabel case, probability scores are the softmax of entailment vs contradiction logit for each label independently. In the multiclass case, probability scores are the softmax of all entailment logits. In Yin et al., training on the entailment datasets is actually done in the binary case, by changing the neutral label to non-entailment.

The pipeline does automatic batching of the input internally; when multiple labels are provided, the model essentially does a forward pass for each premise-hypothesis pair. By default, the premise sequence is truncated if it exceeds model sequence limit length.

A few practical points to note:

Harder datasets for better performance

Following the paper and HuggingFace release, initial models were trained mostly on MNLI, using RoBERTa, BART, and BERT bases.
ANLI (Nie et al.4) was a more challenging benchmark created via an human-and-model-in-the-loop process, consisting of 3 rounds of data, each incorporating adversarial examples composed against the model in the previous round.
They explore the types of inferences that fool models, finding that round 1 and 2 examples rely heavily on numerical and quantitative reasoning, but dropped off in round 3. Adversarial examples that exploited standard inferences (conjunctions, negations, cause and effect, comparatives etc.), as well as lexical (requiring information about synonyms, antonyms etc.), increased as the rounds increased. Examples that exploited outside knowledge or additional facts maintained a high proportion throughout all rounds.
This dataset was incorporated into many new models in HuggingFace model hub, improving performance on some harder natural language understanding tasks.

Finetuning vs prompting

There were many broad trends in the area of zero shot transfer learning. For the NLI approach, finetuning with NLI data allows the model to perform zero shot sequence pair classification with the same template. Before looking at other trends, let’s first take a look at finetuning in general.

In classical finetuning, the pretrained model is exposed to new data points which are used to update model weights for all layers. Finetuning has led to SOTA for a large variety of tasks, but importantly requires a large amount of labeled data to be effective, and incur other practical concerns such as being more expensive to train and execute at inference time.

The conceptual workflow for finetuning large language models is:

  1. Start with a strong pre-training objective, e.g. masked language model (MLM) on large amounts of unlabelled data
    • Domain adaptive finetuning can also be performed here to specialize the model closer to the distribution of the target data
  2. Apply finetuning with task/s, by replacing or appending the model’s last output layer with a task specific head and training on task data. Often this step is performed with publicly available labeled datasets that are large but not specific to the target task, as is the case with the Yin et al. approach. The idea is to generalize the model’s representations to related tasks, or to strengthen the model’s natural language reasoning abilities.
  3. To further improve on the actual target task, the model can be fed indomain training examples and tuned with cross entropy loss. For the NLI approach, if the indomain data is small (few-shot) there is unlikely to be improvement. The partially seen definition of zero shot in the NLI paper refers to large amounts of training data for some of the labels; in that case the performance is likely to be better.

In 1, the pre-training objective does affect the performance of downstream tasks. Talmor et al.6 investigate what reasoning abilities are captured, and find that there are clear differences between different language models with similar architecture. Even when a model has high performance on a task, small changes to the input can degrade performance significantly.

In 2, multi-task learning is one approach to increase the generalizability of the intermediate model. Knowledge distillation can be used to transform the ensemble model to a single student model (Liu et al.7).
An increasingly popular tangent is the usage of prompts or input patterns to adapt a single model architecture to different tasks.

There are 2 well known approaches to this:

  1. In autoregressive/generation focused models e.g. GPT-38, the prompt is a description in natural language that is appended to the input, and conditions the language model to predict continuations that solve the task. One core focus is, by increasing the size of the model and the quality, diversity of pretraining, the model is able to learn true language understanding, which can be naturally applied to different tasks.
    GPT-3 appears to be weak in the few or one shot setting at tasks that involve comparing 2 sentences or snippets, for example whether a word is used in the same way in 2 sentences, whether one sentence is a paraphrase of another, or whether one sentence implies another (i.e. ANLI and RTE). One explanation for GPT-3’s lagging performance is that these tasks empirically benefit from bidirectionality, which an autoregressive model is unable to provide.
  2. In MLM pretrained models e.g. PET9 (https://github.com/timoschick/pet), by casting task descriptions as fill-in-the-blanks templates and reformulating input examples as cloze-style phrases, the model can solve for different tasks.
    Specifically, for PET,
    • A pattern is defined as a function P that takes x as input and outputs a phrase P(x) that contains a mask token
    • A verbalizer v is defined as an injective function that maps each label to a word from M’s vocabulary
    • P, v is defined as the pattern-verbalizer pair (PvP)
    • A set of PvPs are defined that make sense for a given task
    • For each P, a separate language model is finetuned
    • An ensemble M of finetuned models is used to annotate examples from unlabelled data
    • A pretrained language model C with a standard sequence classification head is finetuned on this secondary training set
    • A final classifier is trained on the soft labeled dataset annotated by C
    • The authors also provide a iterative approach, that can be applied in a zero shot manner:
      • Starting with an ensemble of untrained models (zero shot), or trained (few shot),
      • For each model, generate a new training set from D using a random subset of other models
      • Train a new set of models using the larger, model specific datasets
      • Final set of models is used to create a soft labeled dataset and classifier is trained on this data
      • At each iteration, when drawing from the labeled dataset, examples for which the ensemble is confident in its prediction are preferred, to avoid training future generations on mislabeled data

With no examples, iterative PET (iPET) starts out at an average accuracy of 53.6 to 87.5 for 4 datasets, as compared to the base model of RoBERTa which achieves 33.8 to 69.5. With 100 examples, iPET reaches an average accuracy of 62.9 to 89.6 for 4 datasets, especially a 25 point improvement in MNLI dataset. Supervised RoBERTa achieves 47.9 to 86.0.
A subsequent10 version of PET/iPET uses ALBERT and adapts PET for multiple tokens, by performing k consecutive predictions where the next token is selected based on the MLM’s confidence. With these settings, PET performs 18 points better than GPT-3 Med, a model of similar size, on SuperGLUE.

To summarize, prompt based formulation in both the generative setting and CLOZE setting allow a single model to adapt to different tasks, that can be applied to the target task in a zero shot manner or finetuned using the same template. This is in contrast to classical finetuning using a task specific head on large amounts of labeled data.

Advances in prompting

How many data points is a prompt worth?11 A paper in 2021 addresses this question. They compare the head based approach with a prompt approach similar to PET, starting from 10 data points and increasing exponentially, and find that prompting has a distinct advantage in low resource conditions for 5 out of 6 SuperGlue tasks, with an average advantage of hundreds of data points.
For MNLI, prompting yields an approximate advantage of 3500 data points, although when compared to a null verbalizer for control (where the task descriptions e.g. yes, no, maybe are replaced with random first names), the advantage over the null verbalizer is much less. This seems to suggest that prompts are important, but the actual verbalization of the prompt, not so much; in experiments where they compare different prompts, prompt choice does not seem to be a dominant hyperparameter.

As a counterpoint to this, many other areas of research suggest that prompt design and optimisation is an important consideration.
One major research direction is automatic prompt generation, a technique we can apply to our original NLI classification formulation as well, in the construction of the premise and hypothesis.

Some techniques explored include:

These findings together reinforce the following points: 1, that prompting has widely varying impact depending on the task, model, and data. 2, model task performance is sensitive to prompt design, especially the pattern template. 3, prompt ensembling yields improvement in most cases, similar to model ensembling. 4, in general, prompting performs better than finetuning in low resource conditions and domain shift problems.

A further series of related techniques focus on continuous prompting (as opposed to discrete tokens), to improve the prompt design. In continuous prompting, free parameters which do not correspond to real tokens are constructed as vectors and embedded directly into the model architecture.

Some key approaches include:

In summary: 1, the ideal prompts do not necessarily correspond to real tokens and judgments by a human evaluator. 2, prompting scales well with model size and is strongly affected by the pretraining objective.

There are still improvements to be made before we can apply text classification with language models for production in a truly general, zero shot way. Prompting is one approach that leverages on scale and representation ability in the pretraining stage to generalize better. Nonetheless we need to find and design templates or architectures which can better fit specific tasks or data, which is less straightforward to evaluate as compared to classical finetuning. The research space is moving very rapidly though and likely there will be new improvements that keep challenging the limits in low resource text classification.


Thanks to Shantanu for your comments and feedback.



References

  1. Yin, Wenpeng et al. “Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach.” ArXiv abs/1909.00161 (2019)
  2. https://www.youtube.com/watch?v=Aqa_lj5HiBE
  3. https://joeddav.github.io/blog/2020/05/29/ZSL.html
  4. Nie, Yixin et al. “Adversarial NLI: A New Benchmark for Natural Language Understanding.” ArXiv abs/1910.14599 (2020)
  5. https://ruder.io/recent-advances-lm-fine-tuning/
  6. Talmor, Alon et al. “oLMpics-On What Language Model Pre-training Captures.” Transactions of the Association for Computational Linguistics 8 (2020): 743-758.
  7. Liu, Xiaodong et al. “Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding.” ArXiv abs/1904.09482 (2019)
  8. Brown, Tom B. et al. “Language Models are Few-Shot Learners.” ArXiv abs/2005.14165 (2020)
  9. Schick, Timo and Hinrich Schütze. “Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference.” EACL (2021)
  10. Schick, Timo and Hinrich Schütze. “It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners.” ArXiv abs/2009.07118 (2021)
  11. Scao, Teven Le and Alexander M. Rush. “How many data points is a prompt worth?” NAACL (2021).
  12. Jiang, Zhengbao et al. “How Can We Know What Language Models Know?” Transactions of the Association for Computational Linguistics 8 (2020): 423-438.
  13. Gao, Tianyu et al. “Making Pre-trained Language Models Better Few-shot Learners.” ArXiv abs/2012.15723 (2021)
  14. Shin, Taylor et al. “Eliciting Knowledge from Language Models Using Automatically Generated Prompts.” ArXiv abs/2010.15980 (2020)
  15. Li, Xiang Lisa and Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation.” Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) abs/2101.00190 (2021)
  16. Lester, Brian et al. “The Power of Scale for Parameter-Efficient Prompt Tuning.” ArXiv abs/2104.08691 (2021)
  17. Liu, Xiao et al. “GPT Understands, Too.” ArXiv abs/2103.10385 (2021)
  18. Liu, Xiao et al. “P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks.” ArXiv abs/2110.07602 (2021)