Skip to the content.

NLP models have recently achieved outstanding performances and are thus gained prevalent applications in real world. With this popularity, it is important to make sure these models could adapt well in the dynamic circumstances. More specifically, robustness with respect to domain shifts is supposed to be considered when developing models. Because the same large pretrained language models are often applied to different tasks or fields. It would be inefficient and impractical if we train the model with corresponding inputs every time we apply them to a different domain. We want large models can be easily reused. Improvement on models to ensure they are robust against change of inputs has been a hot topic for study.

Introduction

Prompt tuning and prefix tuning are two effective mechanisms to leverage frozen language models to perform downstream tasks. Robustness reflects models’ resilience of output under a change or noise in the input. In this project, we analyze the robustness of natural language models using various tuning methods with respect to a domain shift (i.e. training on a domain but evaluating on out-of-domain data). We apply both prompt tuning and prefix tuning on T5 models for reading comprehension (i.e. question-answering) and GPT-2 models for table-to-text generation.

Datasets & Evaluation Metrics

Methods & Hyperparameters

In our work, we will apply both prompt and prefix tuning on T5 and GPT-2 models. Our experimental design spans two dimensions for each model and tuning method.

  1. We measure the robustness of tuning with respect to different model sizes, given the same prompt length.
  2. We measure the robustness of tuning with respect to different prompt lengths, given the same model size.

We train both T5 and GPT-2 models with sizes range from small, base and large, and with prompt lengths from 1, 5, 10, 20 and 50. The prompts and prefixes are initialized from vocabulary.

Model Setup

Results

T5 & Question Answering

Table below contains our experimentation results for prompt/prefix tuning on T5 model for question-answer task.

Configuration In-Domain Out-of-Domain
Prompt Prefix
Size #Tkns EM F1 EM F1
Small 1 17.86 56.88 2.27 25.17
5 21.52 55.61 2.40 21.48
10 21.97 57.19 3.06 23.60
20 27.72 61.08 3.53 24.32
50 24.34 60.05 3.60 24.76
Base 1 55.29 79.84 30.71 49.74
5 47.70 72.44 18.79 36.13
10 50.09 73.32 21.99 39.44
20 55.73 75.95 25.98 42.38
50 49.29 74.23 16.06 38.11
Large 1 55.65 82.01 49.43 63.77
5 49.72 78.89 43.84 61.08
10 46.87 78.33 46.77 62.11
20 33.67 73.61 32.91 56.63
50 40.90 76.35 38.97 59.32

We have four findings:

  1. Prefix tuning is better in general for smaller models.
  2. Prompt tuning seems to be superior than prefix tuning as we get larger and larger model.
  3. Prompt tuning’s token choices are model-size agnostic with T5 on question answering.
  4. Prompt tuning’s performance diverges significantly when having different number of tokens, while prefix tuning’s performance keeps consistently over different number of tokens for the same-sized model. This shows that determining a prompt length in prompt tuning is more important than determining a prefix length in prefix tuning.

GPT-2 & Table-to-Text Generation

Table below contains our experimentation results for prompt/prefix tuning on GPT2 model for table-to-text tasks.

Configuration In-Domain Out-of-Domain
Prompt Prefix Prompt Prefix
Size #Tkns B(S) B(U) B(S) B(U) B T M B T M
Base 1 0 0 60.69 42.14 0 0.95 0.04 19.45 0.96 0.26
5 30.01 24.16 62.51 45.53 28.32 0.66 0.2 29.02 0.75 0.32
10 31.91 26.18 63.07 43.16 26.6 0.65 0.25 28.09 0.76 0.32
20 37.17 33.8 63.25 44.9 27.91 0.62 0.27 16.45 1.63 0.31
50 38.27 31.07 62.6 44.33 27 0.61 0.26 20.51 1.15 0.32
Large 1 0.69 0.88 64.02 45.91 0.44 0.97 0.04 22.7 1 0.32
5 32.01 28.07 63.75 45.73 19.77 0.7 0.2 30.35 0.71 0.34
10 35.86 32.25 63.97 47.27 20.67 0.8 0.21 30.23 0.71 0.34
20 37.69 33.57 64.44 46.35 27.22 0.67 0.29 29.98 0.71 0.33
50 40.17 36.85 64.23 46.43 24.61 0.79 0.3 31.68 0.65 0.34

We have four findings:

  1. Prefix tuning is superior in all model sizes.
  2. There does not seem to have a number of tokens that consistently perform better
  3. Although prompt tuning performs better as the model size increases in-domain, it performs worse in a larger model out-of-domain.
  4. We obtain similar patterns that prompt tuning’s performance on GPT2 diverges significantly giving different lengths of tokens, while prefix tuning’s performance is relatively stable.

Common Patterns

The experiments on different lengths of tokens show that prompt tuning’s performance diverges when having a different number of tokens. In contrast, prefix tuning’s performance consistently keeps over different tokens for the same-sized model. This shows that the prompt length parameter in prompt tuning is critical.

Furthermore, prefix tuning seems to perform better in training and in-domain data. However, it unexpectedly yields worse results in out-of-domain data than prompt tuning when the model size grows. In contrast, prompt tuning performs comparable in out-of-domain to in-domain as the model size increases. This could be caused by overfitting in prefix tuning since it uses more parameters than prompt tuning. Still, both methods outperform fine-tuning in out-of-domain data.

Discussion

Advantages of Prefix/Prompt Tuning

Prefix and prompt tuning require much less time and resources to train while still obtaining comparable results to fine-tuning. Both methods only train on a small subset of parameters and freeze other parameters, significantly reducing training costs. Fewer parameters in prompt tuning may allow it to generalize even better in unseen and out-of-domain data. For example, its performance on the training and validation set is very close.

Prefix and prompt tuning are meaningful in real-life applications. For example, suppose we have many individual tasks but share the same model structure. Prefix and prompt tuning could maintain modularity and save time/space by only adding and deleting prefix/prompt tokens for each task. Beyond that, the inference is more efficient with prefix/prompt settings. Instead of having different models and calling forward multiple times, we can do a single forward pass with batches.

Limitation in our project

We have several limitations in the scope of this report. The direct comparison between prompt and prefix tuning is not very convincing. The hyperparameters in prompt tuning are not fine-tuned, but hyperparameters in prefix tuning experiments are tuned based on Li and Liang, 2021.. This directly causes prefix tuning to outperform prompt tuning in in-domain data. The implementation details of two methods are also slightly different. The implementation provided by the Prefix-tuning does not work on T5, so we modified the codebase, which may lead to minor discrepancies in implementations. The implementation of prompt tuning was not released when we started this project, so we built our pipeline, which is different from the official codebase. Our pretrained T5 model is also different from the one experimented in Lester et al., 2021..

Also, we do not perform ablation tests to examine the internal representation of prefix/prompt tokens. This is another exciting topic we want to explore in the future. For example, if we find some patterns in the space of prefix/prompt tokens, we could directly add a prefix/prompt to a pretrained model when a new task comes. This would allow us to obtain a model which has comparable performance to fine-tuned models, but with no extra costs.

Conclusion

We conclude that prompt tuning is more robust in domain-shift tasks. However, the length of prompt tokens is an important parameter and need to be tuned in different tasks. Because of time and resource limitations, our parameters are not fine tuned and the result is not perfect. We would like to further optimize the performance in in-domain data and see whether the score in out-of-domain also increases and achieves the same level.

On the other hand, prefix tuning does not generalize as good as prompt tuning in out-of-domain data, but its performance in in-domain data is close to the state-of-the-art fine tuning method. Furthermore, the prefix length has small affects in different tasks and model sizes. Hence, prefix tuning could reach fine tuning performance with much fewer parameters, less training time and less fine tuning process.