Language models for low resource text classification
The 2019 paper on MNLI classification (Yin et al.1) had a big impact on the community - it was intuitive and promised to enable zero shot just by using large pretrained models. HuggingFace Transformers supported it natively soon after, and many articles demonstrated the ease of implementation.
However applying it in real world tasks is still difficult, as achieving production level accuracy in many cases still requires a lot of training data. Richard Socher talks about this - “the cool thing with language models is that they will very quickly do something; without any training, they will produce something that kind of works; but once you really care about that particular output, you’re going to want to finetune that model to be really good at that task”.
In this post I will review text classification with large language models, in particular techniques used in low resource conditions.
Yin et al. defines 2 types of zero shot; one where labels are partially seen during training, and another where labels are fully unseen. They frame zero shot as a textual entailment problem, using BERT to train on 3 datasets: MNLI, GLUE RTE, FEVER, and evaluate against 3 types of classification datasets - topic categorization, emotion detection, and situation frame detection.
For fully unseen zero shot, they directly apply the model on the test set; for partially seen, the model is finetuned on the training set first before evaluating on the test set.
In the partially seen case, the NLI approach does better than BERT finetuned on train by 5 to 10.4 points, showing the strength of the NLI approach. In the fully unseen case, the NLI approach does better than Wiki-based BERT for emotion and situation detection but weaker for topic categorization, with the reasoning that Wikipedia based training is much more similar to the topic detection task.
They also test with an ensemble approach, that sums up the probabilities after softmax of each input pair, and applies softmax again to get new probabilities - this model performs better than the 3 individual models trained on MNLI, GLUE RTE and FEVER.
The maximum performance achieved for the fully unseen case is 45.7 accuracy for topic categorization, 25.2 label wise weighted F1 for emotion detection, and 38.0 label wise weighted F1 for situation detection.
HuggingFace zero shot
In 2020 HuggingFace published3 an implementation of this technique in their framework. Using any model trained on NLI sequence pair classification, one can feed the input and label sequences reformulated as premise and hypothesis to the model input, to obtain probability scores of the label/s.
There are 2 ways of calling the sequence classifier in HuggingFace. 1, by manual initialization of the model, tokenizer, and handling of the model input and output. 2, by using the HuggingFace pipeline abstraction, which wraps all initialization, configuration and processing - only the model name needs to be provided.
For the second approach, subclass the default pipeline to configure, for example:
class MyPipeline(ZeroShotClassificationPipeline):
def preprocess(self, inputs, candidate_labels=None, hypothesis_template='This example is {}.'):
# override input params
return super().preprocess(inputs, candidate_labels, hypothesis_template)
def postprocess(self, model_outputs, multi_label=False):
# do own postprocessing
return super().postprocess(model_outputs, multi_label)
In preprocess, the placeholder tokens {} in the hypothesis template will be replaced with a label and concatenated with the input text to form a premise-hypothesis pair, for every label in the label sequence.
In postprocess, the conversion from logits to probabilities is handled. Logits are generated in 3 dimensions (contradiction, neutral, entailment) at the end of the forward pass.
In the single label or multilabel case, probability scores are the softmax of entailment vs contradiction logit for each label independently. In the multiclass case, probability scores are the softmax of all entailment logits. In Yin et al., training on the entailment datasets is actually done in the binary case, by changing the neutral label to non-entailment.
The pipeline does automatic batching of the input internally; when multiple labels are provided, the model essentially does a forward pass for each premise-hypothesis pair. By default, the premise sequence is truncated if it exceeds model sequence limit length.
A few practical points to note:
Firstly, the model is very sensitive to punctuation, in both the premise and hypothesis.
Secondly, due to the differences in single label, multiclass, and multilabel, there are multiple ways of constructing the classification definition. For example, in the single label case i.e. binary classification, we can instead use the negation of the original label as a second class in a multiclass setting. Rather than taking the contradiction of the original label as the negative class, the entailment of the negated label is used:
Hypothesis A: This is about technology.
Hypothesis B: This is not about technology.Single label (A only): Softmax(Aentailment, Acontradiction)
Multiclass (A and B): Softmax(Aentailment, Bentailment)This construction can even be applied to other multiclass scenarios, allowing a way to group minority classes that are not important to be separated individually.
Harder datasets for better performance
Following the paper and HuggingFace release, initial models were trained mostly on MNLI, using RoBERTa, BART, and BERT bases.
ANLI (Nie et al.4) was a more challenging benchmark created via an human-and-model-in-the-loop process, consisting of 3 rounds of data, each incorporating adversarial examples composed against the model in the previous round.
They explore the types of inferences that fool models, finding that round 1 and 2 examples rely heavily on numerical and quantitative reasoning, but dropped off in round 3. Adversarial examples that exploited standard inferences (conjunctions, negations, cause and effect, comparatives etc.), as well as lexical (requiring information about synonyms, antonyms etc.), increased as the rounds increased. Examples that exploited outside knowledge or additional facts maintained a high proportion throughout all rounds.
This dataset was incorporated into many new models in HuggingFace model hub, improving performance on some harder natural language understanding tasks.
Finetuning vs prompting
There were many broad trends in the area of zero shot transfer learning. For the NLI approach, finetuning with NLI data allows the model to perform zero shot sequence pair classification with the same template. Before looking at other trends, let’s first take a look at finetuning in general.
In classical finetuning, the pretrained model is exposed to new data points which are used to update model weights for all layers. Finetuning has led to SOTA for a large variety of tasks, but importantly requires a large amount of labeled data to be effective, and incur other practical concerns such as being more expensive to train and execute at inference time.
The conceptual workflow for finetuning large language models is:
- Start with a strong pre-training objective, e.g. masked language model (MLM) on large amounts of unlabelled data
- Domain adaptive finetuning can also be performed here to specialize the model closer to the distribution of the target data
- Apply finetuning with task/s, by replacing or appending the model’s last output layer with a task specific head and training on task data. Often this step is performed with publicly available labeled datasets that are large but not specific to the target task, as is the case with the Yin et al. approach. The idea is to generalize the model’s representations to related tasks, or to strengthen the model’s natural language reasoning abilities.
- To further improve on the actual target task, the model can be fed indomain training examples and tuned with cross entropy loss. For the NLI approach, if the indomain data is small (few-shot) there is unlikely to be improvement. The partially seen definition of zero shot in the NLI paper refers to large amounts of training data for some of the labels; in that case the performance is likely to be better.
In 1, the pre-training objective does affect the performance of downstream tasks. Talmor et al.6 investigate what reasoning abilities are captured, and find that there are clear differences between different language models with similar architecture. Even when a model has high performance on a task, small changes to the input can degrade performance significantly.
In 2, multi-task learning is one approach to increase the generalizability of the intermediate model. Knowledge distillation can be used to transform the ensemble model to a single student model (Liu et al.7).
An increasingly popular tangent is the usage of prompts or input patterns to adapt a single model architecture to different tasks.
There are 2 well known approaches to this:
- In autoregressive/generation focused models e.g. GPT-38, the prompt is a description in natural language that is appended to the input, and conditions the language model to predict continuations that solve the task. One core focus is, by increasing the size of the model and the quality, diversity of pretraining, the model is able to learn true language understanding, which can be naturally applied to different tasks.
GPT-3 appears to be weak in the few or one shot setting at tasks that involve comparing 2 sentences or snippets, for example whether a word is used in the same way in 2 sentences, whether one sentence is a paraphrase of another, or whether one sentence implies another (i.e. ANLI and RTE). One explanation for GPT-3’s lagging performance is that these tasks empirically benefit from bidirectionality, which an autoregressive model is unable to provide. - In MLM pretrained models e.g. PET9 (https://github.com/timoschick/pet), by casting task descriptions as fill-in-the-blanks templates and reformulating input examples as cloze-style phrases, the model can solve for different tasks.
Specifically, for PET,- A pattern is defined as a function P that takes x as input and outputs a phrase P(x) that contains a mask token
- A verbalizer v is defined as an injective function that maps each label to a word from M’s vocabulary
- P, v is defined as the pattern-verbalizer pair (PvP)
- A set of PvPs are defined that make sense for a given task
- For each P, a separate language model is finetuned
- An ensemble M of finetuned models is used to annotate examples from unlabelled data
- A pretrained language model C with a standard sequence classification head is finetuned on this secondary training set
- A final classifier is trained on the soft labeled dataset annotated by C
- The authors also provide a iterative approach, that can be applied in a zero shot manner:
- Starting with an ensemble of untrained models (zero shot), or trained (few shot),
- For each model, generate a new training set from D using a random subset of other models
- Train a new set of models using the larger, model specific datasets
- Final set of models is used to create a soft labeled dataset and classifier is trained on this data
- At each iteration, when drawing from the labeled dataset, examples for which the ensemble is confident in its prediction are preferred, to avoid training future generations on mislabeled data
With no examples, iterative PET (iPET) starts out at an average accuracy of 53.6 to 87.5 for 4 datasets, as compared to the base model of RoBERTa which achieves 33.8 to 69.5. With 100 examples, iPET reaches an average accuracy of 62.9 to 89.6 for 4 datasets, especially a 25 point improvement in MNLI dataset. Supervised RoBERTa achieves 47.9 to 86.0.
A subsequent10 version of PET/iPET uses ALBERT and adapts PET for multiple tokens, by performing k consecutive predictions where the next token is selected based on the MLM’s confidence. With these settings, PET performs 18 points better than GPT-3 Med, a model of similar size, on SuperGLUE.
To summarize, prompt based formulation in both the generative setting and CLOZE setting allow a single model to adapt to different tasks, that can be applied to the target task in a zero shot manner or finetuned using the same template. This is in contrast to classical finetuning using a task specific head on large amounts of labeled data.
Advances in prompting
How many data points is a prompt worth?11 A paper in 2021 addresses this question. They compare the head based approach with a prompt approach similar to PET, starting from 10 data points and increasing exponentially, and find that prompting has a distinct advantage in low resource conditions for 5 out of 6 SuperGlue tasks, with an average advantage of hundreds of data points.
For MNLI, prompting yields an approximate advantage of 3500 data points, although when compared to a null verbalizer for control (where the task descriptions e.g. yes, no, maybe are replaced with random first names), the advantage over the null verbalizer is much less. This seems to suggest that prompts are important, but the actual verbalization of the prompt, not so much; in experiments where they compare different prompts, prompt choice does not seem to be a dominant hyperparameter.
As a counterpoint to this, many other areas of research suggest that prompt design and optimisation is an important consideration.
One major research direction is automatic prompt generation, a technique we can apply to our original NLI classification formulation as well, in the construction of the premise and hypothesis.
Some techniques explored include:
- Jiang et al.12 (https://github.com/jzbjyb/LPAQA) apply mining based and paraphrase based generation. In mining based generation, Wikipedia sentences that contain both subjects and objects of a specific relation are used for prompt templates. In paraphrasing based generation, back translation is used to transform the initial prompt into candidates in another language and then back again, ranked by their round trip probability.
The best ensemble methods were found to improve over manual prompting by 8-11 points in micro averaged accuracy, and 4-5 points in macro averaged accuracy. - Gao et al.13 (https://github.com/princeton-nlp/LM-BFF) apply brute force search with pruning and use T5 for template generation. For each class, they construct a pruned set of top k vocabulary words based on conditional likelihood using the initial MLM, and further refine the assignments that maximize zero shot accuracy on train. Using a generative T5 model, they take input sentences concatenated with labels and mask token, and map the output to multiple prompt templates, choosing either the best candidate tested on dev, or top k templates as an ensemble.
With the same number of prompts as PET, the model achieves improved performance on the RTE dataset, and extends the performance gain for a greater number of prompts in the ensemble. In their findings they also measure a more than 10% change in accuracy depending on the template. - Shin et al.14 propose AutoPrompt (https://github.com/ucinlp/autoprompt), a method using the original task inputs with a collection of trigger tokens to create a prompt. Trigger tokens are initialized as mask tokens, and updated at each step by computing a first order approximation of the change in log-likelihood that would be produced by swapping the jth trigger token with another token from the vocabulary. Label tokens are selected in a 2 step process, by first having a logistic classifier to predict the class label, and then combining with the MLM’s output word embeddings to obtain a score and selecting the k-highest scoring words.
In their results, they find that AutoPrompt can perform better than finetuning on NLI (SICK-E) in low resource scenarios (10 to 1000 data points). In other datasets such as QQP and RTE however, prompts generated manually and with AutoPrompt did not perform considerably better than chance.
As another example to the trickiness of the natural language inference task, they find that MLMs are more interpretable for contradiction compared to entailment or neutral.
These findings together reinforce the following points: 1, that prompting has widely varying impact depending on the task, model, and data. 2, model task performance is sensitive to prompt design, especially the pattern template. 3, prompt ensembling yields improvement in most cases, similar to model ensembling. 4, in general, prompting performs better than finetuning in low resource conditions and domain shift problems.
A further series of related techniques focus on continuous prompting (as opposed to discrete tokens), to improve the prompt design. In continuous prompting, free parameters which do not correspond to real tokens are constructed as vectors and embedded directly into the model architecture.
Some key approaches include:
- Li and Liang15 use an approach called Prefix-tuning (https://github.com/XiangLi1999/PrefixTuning), that keeps model parameters frozen, but updates the prompt vector which is prepended to every transformer layer.
- Lester et al.16 (https://github.com/google-research/prompt-tuning) provide a simplified approach to Prefix-tuning, using a single prompt vector prepended to the input and learned through back propagation. In addition, T5 is used as the base language model with different pretrained objective settings, including LM adaptation, to be more similar to the generative procedure of GPT3.
In the authors’ findings they conclude that prompt tuning alone is sufficient to be competitive with finetuning, assuming large language model sizes (10 billion parameter models), but for medium sized models (100M to 1B) finetuning performs better. - Liu et al.17 propose P-tuning (https://github.com/THUDM/P-tuning), a combination of both prompt tuning and finetuning, where continuous prompts are added throughout the embedded input, and tuning jointly updates the prompt vectors and main model parameters.
In the authors’ findings, P-tuning outperforms PET on all tasks, and iPET on 4 out of 7 tasks.- P-tuning v218 (https://github.com/THUDM/P-tuning-v2) further optimizes for all model sizes and tasks, by applying continuous prompts for every layer of the pretrained model. Main optimisation techniques are the usage of multi-task learning, no verbalizers, and a reparameterization encoder in some cases.
- The authors find that prompt length is an influential hyperparameter, and its optimal value varies from task to task i.e. for harder sequence tasks, a prompt longer than 100 is better, and also has a relationship with reparameterization.
In summary: 1, the ideal prompts do not necessarily correspond to real tokens and judgments by a human evaluator. 2, prompting scales well with model size and is strongly affected by the pretraining objective.
There are still improvements to be made before we can apply text classification with language models for production in a truly general, zero shot way. Prompting is one approach that leverages on scale and representation ability in the pretraining stage to generalize better. Nonetheless we need to find and design templates or architectures which can better fit specific tasks or data, which is less straightforward to evaluate as compared to classical finetuning. The research space is moving very rapidly though and likely there will be new improvements that keep challenging the limits in low resource text classification.
Thanks to Shantanu for your comments and feedback.
References
- Yin, Wenpeng et al. “Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach.” ArXiv abs/1909.00161 (2019)
- https://www.youtube.com/watch?v=Aqa_lj5HiBE
- https://joeddav.github.io/blog/2020/05/29/ZSL.html
- Nie, Yixin et al. “Adversarial NLI: A New Benchmark for Natural Language Understanding.” ArXiv abs/1910.14599 (2020)
- https://ruder.io/recent-advances-lm-fine-tuning/
- Talmor, Alon et al. “oLMpics-On What Language Model Pre-training Captures.” Transactions of the Association for Computational Linguistics 8 (2020): 743-758.
- Liu, Xiaodong et al. “Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding.” ArXiv abs/1904.09482 (2019)
- Brown, Tom B. et al. “Language Models are Few-Shot Learners.” ArXiv abs/2005.14165 (2020)
- Schick, Timo and Hinrich Schütze. “Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference.” EACL (2021)
- Schick, Timo and Hinrich Schütze. “It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners.” ArXiv abs/2009.07118 (2021)
- Scao, Teven Le and Alexander M. Rush. “How many data points is a prompt worth?” NAACL (2021).
- Jiang, Zhengbao et al. “How Can We Know What Language Models Know?” Transactions of the Association for Computational Linguistics 8 (2020): 423-438.
- Gao, Tianyu et al. “Making Pre-trained Language Models Better Few-shot Learners.” ArXiv abs/2012.15723 (2021)
- Shin, Taylor et al. “Eliciting Knowledge from Language Models Using Automatically Generated Prompts.” ArXiv abs/2010.15980 (2020)
- Li, Xiang Lisa and Percy Liang. “Prefix-Tuning: Optimizing Continuous Prompts for Generation.” Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) abs/2101.00190 (2021)
- Lester, Brian et al. “The Power of Scale for Parameter-Efficient Prompt Tuning.” ArXiv abs/2104.08691 (2021)
- Liu, Xiao et al. “GPT Understands, Too.” ArXiv abs/2103.10385 (2021)
- Liu, Xiao et al. “P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks.” ArXiv abs/2110.07602 (2021)