Neural Tech Daily
ai-research

Prompt Tuning, Prefix-Tuning and P-Tuning v2: A Multi-Paper Review of Soft-Prompt PEFT

Multi-paper review of soft-prompt PEFT — Lester prompt tuning, Li and Liang prefix-tuning, P-Tuning v2 deep prompts — across NLG, NLU and sequence labelling.

Updated ~48 min read
Share

Section 1: Paper identity and scope

This review covers three artefacts that together define the soft-prompt branch of parameter-efficient fine-tuning (PEFT). Each frames the same idea — keep the pretrained model frozen and learn only a small continuous prefix — at a different scale, task family, and architectural depth.

Primary papers (full venue):

  1. Lester, Al-Rfou and Constant, The Power of Scale for Parameter-Efficient Prompt Tuning (EMNLP 2021, arXiv:2104.08691). 1
  2. Li and Liang, Prefix-Tuning: Optimizing Continuous Prompts for Generation (ACL 2021, arXiv:2101.00190). 2
  3. Liu, Ji, Fu, Tam, Du, Yang, Tang, P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks (ACL 2022, arXiv:2110.07602). 3

Retrieval confirmation. All three primary papers were fetched at writer-time from arXiv abstract pages and ar5iv HTML renders on 2026-05-19; the multi-source rule was satisfied because the ACL Anthology versions and the arXiv preprints carry identical method content. Supplementary appendices were retrieved from the same source.

Paper classification (multi-paper cluster): Training method · Architecture proposal · Representation learning · LLM-based · Efficiency · Benchmark.

One-paragraph technical abstract (publication voice). Soft-prompt tuning is the family of methods that adapts a frozen pretrained language model by inserting continuous, trainable vectors into its input — never updating any of the model’s own weights. Li and Liang’s prefix-tuning prepends a learned activation matrix to every transformer layer of GPT-2 / BART and demonstrates that 0.1% of the parameters can match fine-tuning on table-to-text generation. Lester, Al-Rfou and Constant strip the method to its minimum, learning only an input-layer prompt embedding matrix for T5, and show that the gap to fine-tuning closes as model scale crosses roughly the 10B-parameter mark. Liu et al.’s P-Tuning v2 then re-introduces per-layer prefix prompts (with several optimisation tricks) to make the approach competitive at smaller scales and on sequence-labelling tasks where Lester-style prompts had struggled.

Primary research question (cluster-level). Can a frozen pretrained language model be conditioned to perform a downstream task using only a small set of continuous prompt parameters, while matching the quality of full fine-tuning across model scales and task families?

Core technical claim (cluster-level). Three coupled claims emerge: (a) input-layer soft prompts suffice when the frozen model is large enough (Lester); (b) per-layer prefix activations are required at smaller scale and for generation tasks (Li and Liang); (c) per-layer prefix activations with a small set of training tricks (reparameterisation, prompt length, multi-task pretraining, classification head) close the gap on natural-language-understanding tasks universally, including sequence labelling (P-Tuning v2).

Core technical domains. Transformer architectures (deep), PEFT (deep), natural-language generation (moderate), natural-language understanding (moderate), optimisation of low-parameter objectives (moderate).

Reader prerequisites. High-school algebra; familiarity with neural-network basics and the transformer encoder-decoder helpful but not required because the Glossary covers them. Readers comfortable with backpropagation and embedding matrices will move fastest through Section 6.

Section 2: TL;DR and executive overview

Three-sentence TL;DR. A pretrained language model is a giant function that turns text into more text; fine-tuning it for a new task usually means copying every one of its billions of weights and nudging them all, which is expensive. Soft-prompt tuning instead leaves the model frozen and learns just a tiny strip of new “prompt” numbers that are stitched into the model’s input — like writing a magic preamble that steers the frozen brain. The three papers in this review show that as the frozen model gets bigger, this magic preamble gets cheaper and more competitive, and that adding the preamble to every layer (not just the input) makes it work at smaller scales and on more task types.

Executive summary (~100 words). Fine-tuning a 10-billion-parameter language model for every new task is wasteful — each task gets its own full copy. The three papers reviewed here propose a cheaper alternative: freeze the model, prepend a learnable vector sequence to its input, and train only that vector sequence. Prefix-tuning (Li and Liang, 2021) attaches the vectors to every transformer layer for generation tasks; prompt tuning (Lester et al., 2021) shows that the input layer alone suffices once the model is large; P-Tuning v2 (Liu et al., 2022) puts the per-layer prefix back to make the method work at smaller scales and on sequence labelling. The evidence: 0.1%–3% of parameters, fine-tuning-equivalent accuracy on T5-XXL, BART, GPT-2 and several BERT / RoBERTa / GLM scales.

Five practitioner-relevant takeaways.

  • Soft prompts are the cheapest PEFT method to store. Lester et al. report 20,480 parameters per task on T5-XXL, roughly five orders of magnitude smaller than the 11B-parameter model copy. 4 Per-task storage measures in kilobytes.
  • Input-layer-only prompt tuning needs scale. The Lester paper shows the gap to fine-tuning is large at T5-Small / Base, closes at T5-XL, and effectively vanishes at T5-XXL. 1 Below roughly 1B parameters, prefer prefix-tuning or P-Tuning v2.
  • Prefix-tuning’s reparameterisation trick is load-bearing. Li and Liang train an auxiliary matrix PθP'_\theta that is mapped through a small MLP to produce the actual prefix; direct optimisation of the prefix is unstable. 2 P-Tuning v2 reports the same trick helps on some tasks and hurts on others. 3
  • Sequence-labelling tasks need per-layer prompts. Lester-style input-only prompts collapse on extractive question answering: Liu et al. report 1.2 EM on SQuAD 1.1 for the baseline versus 88.0 EM with P-Tuning v2. 3 The depth of the prompt insertion is what fixes this.
  • The trained prompt is portable. A single frozen base model serves multiple tasks by swapping prompts at inference time; Lester et al. demonstrate batched mixed-task inference with one model copy, which is operationally cheaper than serving N fine-tuned models. 1

Pipeline overview (training vs inference). At training time, the base model’s parameters θ\theta are frozen; only the prompt parameters ϕ\phi (an input-layer embedding matrix for Lester, a per-layer activation matrix for Li and Liang and Liu et al.) receive gradient updates. Loss is the standard next-token cross-entropy for generation papers or task-specific cross-entropy for classification / span extraction. At inference time, the prompt is concatenated to the input (Lester) or injected as past-key-value tensors at every layer (Li and Liang, Liu et al.), then a single forward pass through the frozen model produces the output.

Section 2.5: Glossary

TermPlain-English explanationFirst appears in
Pretrained language model (PLM)A neural network already trained on a huge text corpus (think GPT-2, T5, BERT); soft-prompt tuning never changes its weights.Section 1
Fine-tuningThe traditional way of adapting a pretrained model: update every one of its parameters using task-specific data.Section 1
Soft prompt / continuous promptA small matrix of trainable real-valued numbers that lives in the same vector space as token embeddings but is not tied to any actual word.Section 1
PrefixThe same concept as soft prompt; used when the trainable vectors are prepended at every transformer layer (not just input).Section 5
EmbeddingThe vector representation of a token; a length-dd real-valued vector that the model uses internally.Section 6
Transformer layerOne block of the model containing self-attention + feed-forward computation; large language models stack tens to hundreds of these.Section 5
PEFTParameter-efficient fine-tuning; the umbrella term for methods that adapt a frozen PLM by training only a small number of new parameters.Section 4
ReparameterisationA training trick: instead of directly optimising the prompt matrix, optimise a smaller hidden matrix and map it through an MLP to produce the prompt.Section 6
LM adaptationAn extra round of language-model-style pretraining applied to T5 by Lester et al. before any prompt tuning starts; it changes T5’s objective from span-corruption to standard left-to-right LM.Section 5
Verbalizer / LM headA way of doing classification by mapping each class to a vocabulary token and reading the model’s next-token probability for that token; an alternative to a randomly-initialised classification head.Section 8
[Analysis] labelThe publication’s own reasoned assessment, distinct from what the paper itself claims.Throughout
[Reviewer Perspective] labelA critical or speculative assessment that goes beyond what the paper proves.Sections 11–12
[Reconstructed] labelContent the publication faithfully reconstructed because the paper only partially disclosed it.Where used
[External comparison] labelA comparison to prior work or general knowledge outside the paper itself.Sections 4, 11
”From the paper:” prefixContent directly supported by the paper’s text, equations, tables, or figures.Throughout

Section 3: Problem formalisation

Notation table (cluster-wide).

SymbolTypeMeaningFirst appears in
θ\thetaParameter setFrozen pretrained-LM parameters (never updated).Section 5
ϕ\phiParameter setTrainable soft-prompt parameters (Lester input layer; Li-Liang and Liu et al. per layer).Section 5
X=(x1,,xn)X = (x_1, \ldots, x_n)Token sequenceInput tokens to the model.Section 6
XeRn×eX_e \in \mathbb{R}^{n \times e}MatrixToken-embedding lookup of XX; row ii is the embedding of xix_i.Section 6
PeRp×eP_e \in \mathbb{R}^{p \times e}MatrixLester soft-prompt matrix; pp prompt slots, ee embedding dim.Section 6
[Pe;Xe][P_e; X_e]MatrixConcatenation of prompt and input embeddings along sequence axis.Section 6
PθRp×dP_\theta \in \mathbb{R}^{p \times d}MatrixLi-Liang per-layer prefix activations; dd is layer activation dim.Section 6
PθRp×kP'_\theta \in \mathbb{R}^{p \times k}MatrixReparameterisation source matrix; k<dk < d (e.g. k=512k=512 for GPT-2 medium).Section 6
hih_iVectorActivation at sequence position ii in the transformer; the prefix mechanism overrides hih_i for iPidxi \in P_{\text{idx}}.Section 6
PidxP_{\text{idx}}Index setPositions in the sequence occupied by prefix tokens.Section 6
L\mathcal{L}ScalarLoss; per-token cross-entropy for generation, task cross-entropy for NLU.Section 6
ppScalarPrompt length in tokens; tested values 1, 5, 10, 20, 100, 150, 200 across the cluster.Section 6

Formal problem statement. Given a frozen pretrained language model fθ:XYf_\theta : \mathcal{X} \to \mathcal{Y} where X\mathcal{X} is the space of token sequences and Y\mathcal{Y} is the model’s output distribution, and given a labelled task dataset D={(X(i),Y(i))}i=1N\mathcal{D} = \{(X^{(i)}, Y^{(i)})\}_{i=1}^N, the goal is to find a small parameter set ϕ\phi that conditions fθf_\theta on a task such that fθ(ϕ)f_\theta( \cdot \mid \phi) approximates the task posterior p(YX)p(Y \mid X), while ϕθ\mid \phi\mid \ll \mid \theta\mid .

The three papers all instantiate this with ϕ\mid \phi\mid in the 0.01%–3% range of θ\mid \theta\mid , and all keep θ\theta literally untouched.

Assumption list.

  1. The frozen PLM is expressive enough. From the paper (Lester Section 4.1): T5 must already encode the linguistic competence the downstream task requires; the prompt only re-routes that competence. [Analysis] Potentially strong assumption — the assumption is exactly what the scale-curve evidence interrogates; below T5-Small the assumption fails.
  2. The pretrained objective is compatible with prompt conditioning. From the paper (Lester Section 3.1): T5’s span-corruption objective leaves sentinel tokens in the model’s most-likely-output distribution, which breaks prompt tuning until LM adaptation is applied for 100K extra steps. 1 [Analysis] Potentially strong assumption for any PLM trained with a fill-in-blanks rather than left-to-right objective.
  3. Gradients can flow through the frozen model. Implicit: the prompt parameters are optimised by ordinary backpropagation through every transformer layer; the model’s parameters do not update, but the forward and backward passes still traverse them. Memory is therefore proportional to the full model size during training, not to ϕ\mid \phi\mid .
  4. Task data is sufficient to learn the prompt. Li and Liang explicitly test the low-data regime (500 examples) and find prefix-tuning still works; Lester et al. test on SuperGLUE-scale data.

Why the problem is hard. Soft-prompt parameters are continuous and lie in the same numerical space as token embeddings, but they are not constrained to any particular subspace. Direct optimisation of a p×ep \times e matrix from random initialisation lands in a high-loss region; Lester et al. document that random uniform initialisation underperforms class-label initialisation by a wide margin at T5-Small, and Li and Liang report that reparameterisation through an auxiliary MLP is required for stable training. The frozen model also acts as a fixed nonlinearity around the trainable parameters, so the optimisation landscape is not the convex-in-the-parameters regime of linear probing.

LLM-based role. All three methods operate on top of a frozen LLM. The LLM is the substrate; the prompt is a small overlay. No paper trains the LLM from scratch.

Section 4: Motivation and gap

Real-world problem. A multi-tenant production system needs to support N downstream tasks (say, 50 customer-specific classification heads). Fine-tuning a 11B-parameter T5 model gives 50 full copies — about 550 GB of weights at fp32 — to serve and store. Per-task GPU memory at inference is 22 GB just to hold the model.

Existing approaches and failure modes.

  • Full fine-tuning. Accurate but full-model copies per task.
  • Adapter modules (Houlsby et al., 2019). 5 Insert small bottleneck modules between transformer layers; train only those. From the paper (Li and Liang Section 5): adapter at 0.1% parameter budget reaches 66.3 BLEU on E2E versus prefix-tuning at 69.7. 2 Higher parameter counts close the gap.
  • Discrete prompt engineering / “prompt design”. Write the prompt as actual English. Lester et al. Section 6 frames this as the most restrictive design space and the weakest baseline. 1
  • P-Tuning v1 (Liu et al., 2021). 6 Soft prompts at input layer with an LSTM encoder; works on selected NLU benchmarks but fails at smaller model scales and on hard sequence-labelling tasks.

Gap the papers claim to fill. A method that (a) requires no model copy per task, (b) keeps inference cost identical to the base model plus a small prompt overhead, (c) matches full fine-tuning quality across enough scales and tasks to be the default in a multi-tenant deployment.

Why prior methods were insufficient. Adapters change model architecture and require a custom inference path. Discrete prompt design tops out well below fine-tuning quality. P-Tuning v1 was constrained to input-layer prompts and a specific set of NLU benchmarks. None of these solved the universal-scale-and-task claim that P-Tuning v2 explicitly targets.

Practical stakes. Storing 50 task-specific 22 GB model copies versus 50 task-specific 80 KB prompt files is a five-orders-of-magnitude operational reduction. The same logic applies to per-user personalisation: Li and Liang explicitly motivate prefix-tuning by the “personal prefix per user” deployment scenario.

[External comparison] Position in broader PEFT research. The cluster sits inside the wider PEFT family that also includes LoRA 7 (low-rank weight-delta matrices) and the BitFit, IA3, and various adapter variants. As of 2026, the dominant production PEFT method for instruction-tuning open-weight LLMs is LoRA; the prompt-tuning lineage remains influential conceptually (the trained-soft-prompt idea has been picked up in vision-language models and multimodal prompt design) but is no longer the default for fine-tuning open base models.

Section 5: Method overview

5.1 Lester et al. (2021) — input-layer prompt tuning

Plain-English intuition. Imagine teaching T5 to answer SuperGLUE tasks not by changing the model but by writing a magic preamble. Lester et al. ask: what if the preamble is not English at all, but a tiny strip of arbitrary numbers chosen by gradient descent? The strip lives where word embeddings live (so dimensionally it is a p×ep \times e matrix) and gets prepended to the input embeddings before the encoder runs.

Exact mechanism.

  1. The input token sequence X=(x1,,xn)X = (x_1, \ldots, x_n) goes through the model’s standard token-embedding lookup to give XeRn×eX_e \in \mathbb{R}^{n \times e}.
  2. A trainable matrix PeRp×eP_e \in \mathbb{R}^{p \times e} is concatenated along the sequence axis: [Pe;Xe]R(p+n)×e[P_e; X_e] \in \mathbb{R}^{(p+n) \times e}.
  3. The combined matrix is fed to the frozen T5 encoder. The decoder runs unmodified.
  4. Standard cross-entropy loss is back-propagated; only PeP_e accumulates gradients.

Design rationale and tradeoffs. Lester et al. deliberately chose the simplest possible scheme to test the scale-conditioned conjecture. The trainable parameter count is p×ep \times e; for p=100p=100 and e=4096e=4096 (T5-XXL) this is 409,600 — Lester et al. report 20,480 per task elsewhere because their main experiments use p=100p=100 with e=4096e=4096 but average across SuperGLUE tasks. The cost is no per-layer signal and limited expressivity at small scale.

What breaks if removed. Removing the prompt entirely reduces the method to zero-shot evaluation. Removing LM adaptation while keeping prompt tuning yields unstable training because T5’s span-corruption objective makes the model output sentinel tokens. 1

Classification: [New] for the input-layer-only formulation at T5 scale; [Adapted] from prior work on continuous prompts (AutoPrompt and P-Tuning v1 explored similar ideas at smaller scale).

5.2 Li and Liang (2021) — per-layer prefix-tuning

Plain-English intuition. Instead of writing a magic preamble that only the input layer sees, write a different magic preamble for every layer of the model. Each layer’s preamble is a p×dp \times d activation matrix that gets concatenated to the layer’s key-value cache, so attention at every layer can read from it.

Exact mechanism.

  1. For each transformer layer \ell, define a trainable activation matrix Pθ()Rp×dP_\theta^{(\ell)} \in \mathbb{R}^{p \times d}.
  2. The activation at position ii in the transformer becomes hi=Pθ[i,:]h_i = P_\theta[i, :] if iPidxi \in P_{\text{idx}}, else hi=LMϕ(zi,h<i)h_i = \text{LM}_\phi(z_i, h_{<i}) — the unmodified transformer computation conditioned on prefix history.
  3. Reparameterisation: Pθ=MLPθ(Pθ)P_\theta = \text{MLP}_\theta(P'_\theta) where PθRp×kP'_\theta \in \mathbb{R}^{p \times k} with k=512k=512 for GPT-2 medium and k=800k=800 for BART-large; the MLP is discarded at inference. 2
  4. Back-propagate cross-entropy through all layers; only PθP'_\theta and the MLP receive updates.

Design rationale. Direct optimisation of PθP_\theta is unstable per Li and Liang Section 4.2; the MLP factorisation gives a smoother optimisation landscape and reduces effective parameter count during training. At inference the MLP is folded down: only the resulting PθP_\theta is stored.

What breaks if removed. The Li and Liang embedding-only ablation reports 62.2 BLEU on E2E vs 69.7 with the full prefix, a 7.5-point gap demonstrating the value of per-layer prompts. 2 Without reparameterisation, training diverges or stalls on several datasets.

Classification: [New] for the per-layer prefix formulation and the reparameterisation trick; [Adapted] for the broader idea of continuous prompts from concurrent work.

5.3 Liu et al. (2022) — P-Tuning v2

Plain-English intuition. Take Li and Liang’s per-layer prefix idea, port it from generation to natural-language understanding (BERT-family encoders), and add four optimisation tricks so it works at small scales and on sequence labelling.

Exact mechanism.

  1. Deep prompts. A trainable prefix is inserted at every transformer layer, not just at the input — identical mechanism to Li and Liang Section 4.1.
  2. No reparameterisation by default. Liu et al. Section 4.1 report that MLP reparameterisation helps on some tasks (RTE, CoNLL04) and hurts on others (BoolQ, CoNLL12), so the method ships without a default reparameterisation and instructs the practitioner to tune per task. 3
  3. Prompt length tuned per task. Simple classification tasks use under 20 tokens; sequence labelling uses around 100.
  4. Classification head, not verbalizer. Liu et al. replace the LM-head + verbalizer pattern of P-Tuning v1 with a randomly-initialised linear classification head on top of the [CLS] token (or per-token for sequence labelling). This makes the method straightforward to apply to extractive QA and NER, which were the prior gap.
  5. Optional multi-task pretraining. Train shared prompt parameters across a family of tasks, then fine-tune per task. Liu et al. find this helps on most tasks except QA.

Design rationale. The four tricks address the four observed failure modes of Lester-style and v1-style prompt tuning at sub-10B scale. Per-layer prompts add capacity. The classification head removes the verbalizer constraint that limited NER and QA.

What breaks if removed. Drop per-layer prompts and the method degrades to Lester-style input-only prompts; SQuAD 1.1 EM falls from 88.0 to roughly 1.2 per the comparison Liu et al. report. 3 Drop the classification head and verbalizer constraints limit task coverage.

Classification: [Adapted] from Li and Liang per-layer prefix; [New] for the optimisation-trick combination and the universality claim across NLU tasks and BERT-family encoders.

Section 6: Mathematical contributions

MATH ENTRY 1: Lester input-layer prompt concatenation.

  • Source: Lester et al. (2021), Section 2.
  • What it is: The trainable prompt matrix is concatenated to the input-embedding matrix along the sequence dimension; the rest of T5 is untouched.
  • Formal definition:

model-input=[Pe;Xe]R(p+n)×e\text{model-input} = [P_e; X_e] \in \mathbb{R}^{(p+n) \times e}

  • Each term explained, with dimensional analysis:
    • XeRn×eX_e \in \mathbb{R}^{n \times e} is the input-token embedding matrix; nn is the input length, ee is T5’s embedding dimension (e.g. e=4096e=4096 for T5-XXL).
    • PeRp×eP_e \in \mathbb{R}^{p \times e} is the soft-prompt matrix; pp is prompt length (default 100 in Lester’s main experiments).
    • [;][ \cdot ; \cdot ] denotes sequence-axis concatenation; the resulting matrix has p+np+n rows and ee columns.
  • Worked numerical example. Take T5 with e=8e=8 (toy dimension), prompt length p=4p=4, input length n=5n=5. Then PeP_e is a 4×84 \times 8 matrix of trainable real numbers and XeX_e is a 5×85 \times 8 matrix of fixed token embeddings. Concatenation gives a 9×89 \times 8 matrix. If T5’s first attention head produces query / key / value via three 8×88 \times 8 projection matrices (frozen), every row of the prompt becomes a queryable / attendable position in the model, indistinguishable in shape from a token-embedded position. Gradient w.r.t. PeP_e flows back through the entire frozen encoder; gradient w.r.t. the frozen matrices is computed but discarded.
  • Role: This is the entire architectural contribution of Lester et al.; the rest of the paper is the empirical scale-conditioning study.
  • Edge cases: p=0p=0 degenerates to zero-shot; p>p > context length minus nn exceeds T5’s positional budget.
  • Novelty: [Adapted] from prior continuous-prompt work; [New] in the T5-scale empirical regime.
  • Transferability. [Analysis] Reusable on any transformer encoder-decoder or decoder-only model; on encoder-only BERT it works for classification but loses sequence-labelling performance, which is the gap P-Tuning v2 closes.
  • Why it matters. The minimal-mechanism claim — input concatenation only, no per-layer signal — is what isolates the scale variable.

MATH ENTRY 2: Li and Liang reparameterised prefix activations.

  • Source: Li and Liang (2021), Section 4.2.
  • What it is: Trainable activations are inserted at every layer, but they are produced by mapping a smaller hidden matrix through an MLP rather than being trained directly.
  • Formal definition:

hi={Pθ[i,:],iPidxLMϕ(zi,h<i),iPidxh_i = \begin{cases} P_\theta[i, :], & i \in P_{\text{idx}} \\ \text{LM}_\phi(z_i, h_{<i}), & i \notin P_{\text{idx}} \end{cases}

with reparameterisation:

Pθ[i,:]=MLPθ(Pθ[i,:])P_\theta[i, :] = \text{MLP}_\theta(P'_\theta[i, :])

  • Each term explained, with dimensional analysis:
    • hiRdh_i \in \mathbb{R}^d is the activation at position ii in a given layer; dd is the model’s hidden dimension (1024 for GPT-2 medium).
    • PθRp×dP_\theta \in \mathbb{R}^{p \times d} holds the per-layer prefix activations; for an LL-layer model the full prefix is LL such matrices (or equivalently, a p×2Ldp \times 2 \cdot L \cdot d flat representation when stored as past-key-value tensors).
    • PθRp×kP'_\theta \in \mathbb{R}^{p \times k} is the smaller hidden matrix; k=512k = 512 in the table-to-text setup.
    • MLPθ:RkRd\text{MLP}_\theta : \mathbb{R}^k \to \mathbb{R}^d is a feedforward network; discarded at inference time.
    • PidxP_{\text{idx}} is the set of sequence indices occupied by the prefix.
    • LMϕ\text{LM}_\phi denotes the transformer layer’s standard computation; ϕ\phi here means the frozen LM parameters (notation differs from θ\theta which Li and Liang reserve for trainable prefix parameters).
  • Worked numerical example. Take p=4p=4 prefix tokens, d=8d=8 hidden dim, k=4k=4 inner dim, L=2L=2 layers (toy values). Without reparameterisation, PθP_\theta has L×p×d=2×4×8=64L \times p \times d = 2 \times 4 \times 8 = 64 trainable scalars per attention key-value pair (128 total counting K and V). With reparameterisation, PθP'_\theta has p×k=16p \times k = 16 scalars and a kdk \to d MLP adds roughly kd+d=40k \cdot d + d = 40 scalars per output projection, so the trainable count is dominated by the MLP rather than by PθP'_\theta at training time but the MLP is discarded at inference, leaving only the final PθP_\theta tensor.
  • Role: The reparameterisation is an optimiser stabiliser. The per-layer prefix is the expressivity contribution.
  • Edge cases: Excessively long prefixes (p>200p > 200 for summarisation) cause overfitting per Li and Liang’s prompt-length ablation.
  • Novelty: [New] for the per-layer + reparameterisation combination.
  • Transferability. [Analysis] The per-layer prefix idea transfers to any decoder-only or encoder-decoder transformer; the reparameterisation is empirically dataset-dependent (Liu et al. confirm this on NLU tasks).
  • Why it matters. This is the core of “deep prompt tuning”; every per-layer-prefix method since reuses the same activation-override mechanism.

MATH ENTRY 3: Standard prompt-tuning loss.

  • Source: All three papers; identical formulation.
  • What it is: Cross-entropy between the model’s predicted distribution and the gold output, with gradients restricted to prompt parameters.
  • Formal definition:

L(ϕ)=(X,Y)Dt=1Ylogpθ(yty<t,X,ϕ)\mathcal{L}(\phi) = -\sum_{(X, Y) \in \mathcal{D}} \sum_{t=1}^{|Y|} \log p_\theta(y_t \mid y_{<t}, X, \phi)

where θ\theta is frozen and only ϕ\phi receives gradient updates.

  • Each term explained:
    • ϕ\phi denotes the trainable prompt parameters (PeP_e for Lester, {Pθ()}=1L\{P_\theta^{(\ell)}\}_{\ell=1}^L for Li and Liang and Liu et al.).
    • θ\theta denotes the frozen LM parameters.
    • pθ(,ϕ)p_\theta(\cdot \mid \cdot, \phi) is the LM’s next-token distribution conditioned on the prompt.
    • The inner sum runs over the output sequence (length Y\mid Y\mid ) for generation; for classification the outer object is a single class log-prob.
  • Worked numerical example. Suppose the task is binary classification, the input is one short sentence (5 tokens), the prompt is 4 tokens, and the model’s “yes / no” verbalizer probability after the full forward pass is p(yes)=0.72p(\text{yes}) = 0.72. Cross-entropy for a positive-class example is log(0.72)=0.328-\log(0.72) = 0.328 nats. Backpropagation computes L/Pe\partial \mathcal{L} / \partial P_e by chaining through every transformer layer; for T5-XXL this is several hundred matrix products. Memory cost during training is essentially the full forward pass plus activation storage; parameter-update cost is tiny.
  • Role: The objective is unchanged from standard fine-tuning; only the set of trainable parameters is restricted.
  • Edge cases: Sentinel-token outputs from T5’s pretraining objective bias pθp_\theta away from valid task outputs at small scales unless LM adaptation is applied (Lester Section 3.1).
  • Novelty: [Adopted] the cross-entropy form; [Adapted] in the constraint that gradients stop at ϕ\phi.
  • Transferability. [Analysis] Loss form is universal across PEFT methods that don’t modify the model output; the constraint set ϕ\phi varies.
  • Why it matters. It is the gradient mask that makes the method PEFT.

MATH ENTRY 4: Parameter-count comparison.

  • Source: Lester Section 4.4, Li and Liang Section 4.2.
  • What it is: The ratio between trainable prompt parameters and total frozen model parameters across the cluster.
  • Formal definition:

ρ=ϕθ\rho = \frac{|\phi|}{|\theta|}

with concrete values: ρpe/θ\rho \approx p \cdot e / \mid \theta\mid for Lester (input-only), ρLp2d/θ\rho \approx L \cdot p \cdot 2 \cdot d / \mid \theta\mid for Li and Liang and Liu et al. (per-layer, the factor of 2 accounts for key and value tensors).

  • Each term explained:
    • pp: prompt length (5–200 in practice).
    • ee: token embedding dimension.
    • dd: hidden dimension at each transformer layer.
    • LL: number of layers.
    • The factor 2 for per-layer methods reflects that the prefix is inserted into both the key and value projections of each attention block.
  • Worked numerical example. T5-XXL has e=4096e=4096, θ11\mid \theta\mid \approx 11 billion. Lester at p=5p=5 gives ϕ=54096=20,480\mid \phi\mid = 5 \cdot 4096 = 20{,}480 and ρ1.86106\rho \approx 1.86 \cdot 10^{-6}. 4 For Li and Liang on GPT-2 medium (d=1024d=1024, L=24L=24, θ345\mid \theta\mid \approx 345M) at p=10p=10, ϕ241021024491,520\mid \phi\mid \approx 24 \cdot 10 \cdot 2 \cdot 1024 \approx 491{,}520, giving ρ0.14\rho \approx 0.14%. P-Tuning v2 on BERT-large (d=1024d=1024, L=24L=24, θ335\mid \theta\mid \approx 335M) at p=20p=20 gives ϕ242021024983,040\mid \phi\mid \approx 24 \cdot 20 \cdot 2 \cdot 1024 \approx 983{,}040, giving ρ0.29\rho \approx 0.29%. All three sit inside the 0.01%–3% band the papers advertise.
  • Role: The PEFT claim is operationalised by these ratios.
  • Edge cases: Naive parameter counting at training time should include the reparameterisation MLP for Li and Liang; counting at inference time should not, since the MLP is folded down.
  • Novelty: [Adopted] the metric.
  • Transferability. [Analysis] ρ\rho is the standard PEFT yardstick across LoRA, adapters, BitFit, and prompt methods.
  • Why it matters. This is the operational case for soft-prompt PEFT: storage scales with ϕ\mid \phi\mid , not θ\mid \theta\mid .

Section 7: Algorithmic contributions

ALGORITHM ENTRY 1: Lester prompt-tuning training loop (headline algorithm of the cluster).

  • Source: Lester et al. (2021), Section 2; faithful reconstruction with explicit gradient masking.
  • Purpose: Learn PeP_e such that the frozen T5 outputs the gold target under a task-specific objective.
  • Inputs:
    • Frozen pretrained model fθf_\theta (T5-Small through T5-XXL).
    • Dataset D={(X(i),Y(i))}\mathcal{D} = \{(X^{(i)}, Y^{(i)})\}.
    • Hyperparameters: prompt length pp, learning rate η\eta, initialisation strategy.
  • Outputs: Trained prompt matrix PeRp×eP_e \in \mathbb{R}^{p \times e}.
  • Pseudocode (headline algorithm; rendered as a code-block image for the article):
# Inputs: frozen T5 model, dataset D, prompt length p, learning rate eta
# Output: trained prompt matrix P_e

# 1. Initialise prompt
if init == "class_label":
    P_e = embed(class_label_tokens)  # repeat to length p
elif init == "sampled_vocab":
    P_e = embed(sample_vocab_tokens(size=p))
else:  # random uniform
    P_e = uniform(-0.5, 0.5, shape=(p, e))

freeze(f_theta)

for step in range(num_steps):
    X, Y = sample_batch(D)
    X_e = embed(X)                              # frozen lookup
    input_seq = concat([P_e, X_e], axis=0)      # (p + n) x e
    logits = f_theta.forward(input_seq)         # frozen forward
    loss = cross_entropy(logits, Y)
    grad_P_e = backprop(loss, wrt=P_e)          # only P_e
    P_e = P_e - eta * grad_P_e                  # AdaFactor update

return P_e
  • Hand-traced example on minimal input. Take T5-Small (e=512e=512) with prompt length p=4p=4, single training example “Is this movie good? <positive>”, batch size 1, learning rate η=0.3\eta = 0.3 (Lester default with AdaFactor). Initialise PeP_e from the class-label embeddings (“positive”, “negative” repeated twice). Step 1: concatenate PeP_e (shape 4×5124 \times 512) with XeX_e (shape 7×5127 \times 512) to give a 11×51211 \times 512 matrix. Forward through the frozen encoder produces a 11×51211 \times 512 encoded sequence; decoder cross-attends; the next-token distribution at decoder step 1 puts probability 0.41 on “positive” (the class label). Cross-entropy is log(0.41)0.89-\log(0.41) \approx 0.89. Backprop computes L/Pe\partial \mathcal{L} / \partial P_e by routing gradients back through all 6 encoder layers and the decoder cross-attention. The 4 prompt rows each receive a 512-dim gradient. AdaFactor scales the update; PeP_e shifts toward a position where “positive” probability rises. By step 1000, the same forward pass typically gives “positive” probability above 0.85 on the same input.
  • Complexity. Forward + backward through the full frozen model per step; bottleneck is the standard transformer self-attention. Parameter-update cost is O(pe)O(p \cdot e) per step. Training-time GPU memory is dominated by activations of the full model, not by Pe\mid P_e\mid .
  • Hyperparameters.
    • Prompt length pp: default 100; tested 1, 5, 20, 100, 150 per Lester Section 4.2.
    • Initialisation: class-label > sampled-vocab > random; gap closes at XXL scale.
    • Learning rate: 0.3 with AdaFactor.
    • LM-adaptation steps before prompt tuning: 100,000 (the default; tested 10K, 50K, 100K).
  • Failure modes. Random init on small T5 collapses; span-corruption pretraining without LM adaptation produces sentinel-token outputs.
  • Novelty: [Adapted] from prior continuous-prompt work; [New] at T5-XXL scale.
  • Transferability. [Analysis] Direct port to any encoder-decoder with LM-style pretraining; encoder-only models need P-Tuning v2 instead.

ALGORITHM ENTRY 2: Li and Liang prefix-tuning training loop.

  • Source: Li and Liang (2021), Section 4.
  • Purpose: Learn per-layer prefix activations {Pθ()}\{P_\theta^{(\ell)}\} for generation tasks with GPT-2 or BART.
  • Inputs: Frozen GPT-2 / BART; generation dataset (E2E, WebNLG, DART, XSum); prompt length pp; reparameterisation rank kk.
  • Outputs: Trained PθRL×p×2dP_\theta \in \mathbb{R}^{L \times p \times 2d} (key + value at each of LL layers).
  • Pseudocode (non-headline algorithm; inline code-fence):
# Inputs: frozen GPT-2/BART, dataset D, prompt length p, inner rank k
# Output: prefix activations P_theta of shape (L, p, 2*d)

# 1. Initialise reparameterisation
P_prime = randn(p, k)                   # small hidden matrix
MLP = FeedForward(k, d, 2 * d * L)      # k -> d -> 2*d*L

freeze(f_theta)

for step in range(num_steps):
    X, Y = sample_batch(D)
    P_theta = MLP(P_prime).reshape(L, p, 2 * d)
    logits = f_theta.forward(X, prefix=P_theta)  # prefix injected
                                                  # at every layer's K, V
    loss = cross_entropy(logits, Y)
    update(P_prime, MLP, loss, lr=5e-5)

# At inference: discard MLP, keep only the final P_theta
P_theta_final = MLP(P_prime).reshape(L, p, 2 * d)
return P_theta_final
  • Hand-traced example. Take GPT-2 medium (L=24L=24, d=1024d=1024, p=10p=10, k=512k=512). PθP'_\theta is 10×512=512010 \times 512 = 5120 params; the MLP from 512 to 2102424=49,1522 \cdot 1024 \cdot 24 = 49{,}152 adds about 51249,152+49,15225.2512 \cdot 49{,}152 + 49{,}152 \approx 25.2M params at training time. Step 1: PθP'_\theta passes through the MLP to produce PθP_\theta of shape (24,10,2048)(24, 10, 2048), which is then split into key and value tensors per layer. Forward through GPT-2 with PθP_\theta injected at every layer’s attention computes the next-token probability over the GPT-2 vocab; cross-entropy against the next gold token gives a scalar. Backprop updates PθP'_\theta and the MLP only. After training, the MLP is discarded and only the resulting Pθ0.5P_\theta \approx 0.5M params is stored per task.
  • Complexity. Identical forward / backward through the frozen LM per step; trainable count at training time is dominated by the MLP, but at inference only PθP_\theta remains.
  • Hyperparameters.
    • Prompt length: optimal 10 for table-to-text, 200 for summarisation per Li and Liang prefix-length ablation.
    • Reparameterisation rank kk: 512 for GPT-2, 800 for BART.
    • Initialisation: real-word activations beat random; task-relevant (“summarize”) beat task-irrelevant (“elephant”) slightly.
  • Failure modes. Training without reparameterisation is unstable; embedding-only ablation (no per-layer signal) drops 7.5 BLEU on E2E.
  • Novelty: [New] for the per-layer + reparameterisation construction.
  • Transferability. [Analysis] Directly portable to any decoder-only or encoder-decoder model; P-Tuning v2 demonstrated portability to BERT encoders with minor changes.

ALGORITHM ENTRY 3: P-Tuning v2 training loop.

  • Source: Liu et al. (2022), Section 4.
  • Purpose: Universal PEFT across NLU model scales (BERT-base through GLM-xxlarge) and task families (classification, NER, QA, SRL).
  • Inputs: Frozen BERT / RoBERTa / GLM; NLU dataset; prompt length pp (under 20 for classification, around 100 for sequence labelling); reparameterisation toggle.
  • Outputs: Per-layer prefix PθP_\theta + a randomly-initialised classification head WclsW_{\text{cls}}.
  • Pseudocode (non-headline algorithm):
# Inputs: frozen BERT-family encoder, NLU dataset D, prompt length p,
# reparameterisation toggle r, multi-task pretraining toggle m
# Output: P_theta (per-layer prefix) + W_cls (linear head)

if m:
    # Optional: multi-task pretraining on related task family
    P_theta = train_multi_task(D_family, frozen_encoder)

if r:
    MLP = FeedForward(k, d, 2 * d * L)
    P_prime = randn(p, k)
else:
    P_theta = randn(L, p, 2 * d)

W_cls = randn(d, num_classes)

freeze(encoder)

for step in range(num_steps):
    X, Y = sample_batch(D)
    if r:
        P_theta = MLP(P_prime).reshape(L, p, 2 * d)
    h = encoder.forward(X, prefix=P_theta)       # frozen
    if task_type == "classification":
        logits = W_cls @ h[CLS]
    elif task_type == "sequence_labelling":
        logits = W_cls @ h                       # per-token
    elif task_type == "extractive_qa":
        start_logits, end_logits = W_cls @ h     # 2-head linear
    loss = cross_entropy(logits, Y)
    update(P_theta or (P_prime, MLP), W_cls, loss)

return P_theta, W_cls
  • Hand-traced example. Take BERT-large (L=24L=24, d=1024d=1024), CoNLL03 NER, prompt length p=100p=100, reparameterisation off. PθP_\theta has shape (24,100,2048)(24, 100, 2048), total about 4.9M params, which is roughly 1.5% of BERT-large’s 335M. WclsW_{\text{cls}} is 1024×91024 \times 9 (9 NER tags) = 9216 params. Step 1: prefix PθP_\theta is injected at every BERT layer’s attention key-value cache; forward over a 30-token input produces per-token contextual embeddings hh of shape (30,1024)(30, 1024); WclsW_{\text{cls}} projects each to 9-class logits. Cross-entropy summed over tokens gives the loss; backprop updates PθP_\theta and WclsW_{\text{cls}}. After 5K steps, F1 on CoNLL03 dev climbs to roughly the 90+ band reported in Table 3. 3
  • Complexity. Same forward / backward cost as Li and Liang; small extra cost for the linear head.
  • Hyperparameters. Prompt length per task (under 20 for SuperGLUE, around 100 for NER / QA / SRL); reparameterisation per task; classification head replaces verbalizer.
  • Failure modes. Reparameterisation off when it should be on (or vice versa) costs 1–3 F1 per task per the Liu et al. ablation.
  • Novelty: [Adapted] per-layer prefix; [New] for the universality claim and the classification-head + multi-task combination.
  • Transferability. [Analysis] The most directly reproducible of the three for production NLU; the THUDM/P-tuning-v2 reference implementation is the canonical entry point.

Section 8: Specialised design contributions

Subsection 8A — LLM / prompt design. Not applicable to this paper.

The cluster’s “prompts” are continuous vectors, not natural-language prompts; there is no string template to schema.

Subsection 8B — Architecture-specific details.

  • Lester et al. No architectural changes to T5. Only the input embedding sequence is extended by pp trainable rows. LM adaptation is a pretraining-objective change, not an architectural one.
  • Li and Liang. The mechanism injects prefix activations into every transformer layer’s attention key-value cache; on encoder-decoder BART, the prefix can attach to encoder, decoder, or both, with both producing best results. Reparameterisation MLP is a small two-layer feed-forward network.
  • Liu et al. The mechanism is the same per-layer-prefix activation injection as Li and Liang. The classification head is a linear layer of shape (d,classes)(d, \mid \text{classes}\mid ) initialised from N(0,0.02)\mathcal{N}(0, 0.02).

Subsection 8C — Training specifics.

  • Lester et al. AdaFactor optimiser, learning rate 0.3, batch size 32, 30,000 steps for SuperGLUE; T5 v1.1 checkpoint with LM adaptation pretraining for 100K extra steps.
  • Li and Liang. AdamW, learning rate 5e-5, batch size 5 (table-to-text) or 6 (summarisation), 10 epochs.
  • Liu et al. AdamW, learning rate selected from 0.02, batch sizes 16–32 depending on task; per-task hyperparameter search documented in their GitHub release.

Subsection 8D — Inference / deployment specifics.

  • Lester et al. Inference cost is one forward pass through the frozen T5 with a (p+n)(p+n)-length input rather than nn; effectively no overhead. Multi-task batched inference is supported by mixing prompts across batch examples — a single frozen model serves N tasks.
  • Li and Liang. Inference cost: prefix activations are injected as past-key-value tensors at every layer; this adds Lp2dL \cdot p \cdot 2d floats to the model’s context but no extra compute beyond the standard prefix-attention cost.
  • Liu et al. Same as Li and Liang inference profile, plus the linear classification head.

Section 9: Experiments and results

9.1 Datasets

  • Lester et al. SuperGLUE (BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, WSC). Domain-shift study uses SQuAD and MRQA out-of-domain sets (TextbookQA, BioASQ, RACE, RE, DuoRC, DROP). Zero-shot transfer study uses QQP and MRPC. 1
  • Li and Liang. Table-to-text generation: E2E, WebNLG, DART. Abstractive summarisation: XSum. 2
  • Liu et al. SuperGLUE (NLU subset); NER (CoNLL03, OntoNotes 5.0, CoNLL04); extractive QA (SQuAD 1.1, SQuAD 2.0); SRL (CoNLL05 WSJ, CoNLL05 Brown, CoNLL12). 3

9.2 Baselines

  • Lester et al. Full T5 model tuning (per-task and multi-task variants); discrete prompt design; few-shot GPT-3. 1
  • Li and Liang. Full fine-tuning of GPT-2 / BART; adapter-tuning at matched 0.1% and 3% parameter budgets; embedding-only ablation; infix-tuning ablation; discrete-prompt baseline. 2
  • Liu et al. Full fine-tuning; Lester-style prompt tuning (PT); P-Tuning v1; LoRA results referenced in some tables for context. 3

9.3 Evaluation metrics

SuperGLUE: per-task metric (accuracy / F1 / EM depending on task), then aggregate. Generation: BLEU, NIST, METEOR, ROUGE-L (E2E); BLEU (WebNLG / DART); ROUGE-1 / 2 / L (XSum). Sequence labelling: F1 (NER, SRL); EM / F1 (QA).

9.4 Selected results reproduced with attribution

The cluster’s headline numerical claims, reproduced from the cited tables:

MethodModelTaskMetricScoreSource
Full fine-tuneT5-XXL (11B)SuperGLUEaggregatecompetitive baselineLester Figure 1 1
Prompt tuningT5-XXL (11B)SuperGLUEaggregatematches model-tune; 20,480 params/taskLester Section 4 4
Prompt tuningT5-SmallSuperGLUEaggregatelarge gap vs model-tuneLester Figure 1 1
Full fine-tuneGPT-2 MediumE2EBLEU68.2Li-Liang Table 1 2
Prefix-tuning (0.1%)GPT-2 MediumE2EBLEU69.7Li-Liang Table 1 2
Adapter (0.1%)GPT-2 MediumE2EBLEU66.3Li-Liang Table 1 2
Prefix-tuning (0.1%)GPT-2 MediumWebNLG seenBLEU62.9Li-Liang Table 1 2
Prefix-tuning (0.1%)GPT-2 MediumDARTBLEU46.4Li-Liang Table 1 2
Full fine-tuneBART-largeXSumROUGE-145.14Li-Liang Table 2 2
Prefix-tuning (0.1%)BART-largeXSumROUGE-142.92Li-Liang Table 2 2
Lester PTBERT-largeSQuAD 1.1EM1.2Liu et al. Table 3 3
P-Tuning v2RoBERTa-largeSQuAD 1.1EM88.0Liu et al. Table 3 3
Full fine-tuneRoBERTa-largeSQuAD 1.1F188.9Liu et al. Table 3 3
P-Tuning v2RoBERTa-largeSQuAD 1.1F188.5Liu et al. Table 3 3
Lester PTBERT-largeCoNLL03 NERF181.9Liu et al. Table 3 3
P-Tuning v2BERT-largeCoNLL03 NERF190.2Liu et al. Table 3 3

Numerical values reproduced from the three papers; rows attribute the table / figure of origin. Some Lester figures aggregate across SuperGLUE tasks and are described qualitatively rather than as single numbers per the paper’s own framing.

9.5 Main quantitative results

Lester et al. Figure 1 of the paper shows the average SuperGLUE score as a function of T5 size for full fine-tuning, model tuning (multi-task), and prompt tuning. The prompt-tuning curve climbs from a large gap at T5-Small to parity at T5-XXL. From the paper: prompt tuning at T5-XL matches GPT-3 few-shot performance with roughly 16x fewer parameters; at T5-XXL it matches multi-task model tuning while using 20,480 parameters per task rather than 11 billion. 1

Li and Liang. Table 1 of the paper (table-to-text generation) shows prefix-tuning at 0.1% parameters beating both full fine-tuning and adapter-tuning at the same parameter budget on E2E, WebNLG (seen and unseen) and DART. The 69.7 BLEU on E2E is 1.5 above full fine-tuning’s 68.2 and 3.4 above adapter at the same budget. 2 Table 2 (XSum summarisation) shows prefix-tuning slightly under-performing fine-tuning (42.92 vs 45.14 ROUGE-1) — the gap is attributed to longer input sequences and greater task complexity. 2

Liu et al. Table 2 of the paper (SuperGLUE) shows P-Tuning v2 matching or exceeding fine-tuning across BERT-large, RoBERTa-large, GLM-xlarge and GLM-xxlarge. The biggest improvement vs Lester-style PT is on smaller scales: BoolQ on BERT-large is 75.8 with P-Tuning v2 vs 67.2 with PT. 3 Table 3 (sequence labelling) is where the universality claim is loudest: SQuAD 1.1 EM jumps from 1.2 (Lester PT) to 88.0 (P-Tuning v2), a difference of 86.8 points that demonstrates input-only soft prompts cannot solve extractive QA. 3

9.6 Supplementary results

Lester ablations. Figure 3 of the paper presents four ablations. Prompt-length sensitivity: 1, 5, 20, 100, 150 tokens; gains plateau by 20 and diminish by 100; at T5-XXL even prompt length 1 is competitive. Initialisation: class-label > sampled-vocab > random; gap closes at XXL. Pre-training objective: span-corruption alone fails; LM adaptation at 100K steps recovers full performance. LM-adaptation step count: 10K << 50K < 100K. 1

Li and Liang ablations. Prefix-length: optimal 10 for table-to-text, 200 for summarisation; lower training loss but higher test loss beyond optimal (overfit signature). Embedding-only ablation: 62.2 BLEU vs 69.7 with full prefix on E2E. Infix-tuning ablation: 67.2 BLEU vs 69.7. Initialisation: real-word activations beat random; task-relevant words marginally beat task-irrelevant. 2

Liu et al. ablations. Reparameterisation toggle per task (helps RTE / CoNLL04, hurts BoolQ / CoNLL12). Prompt length per task family (under 20 for classification, around 100 for sequence labelling). Multi-task pretraining: helps most tasks except QA. Classification head vs verbalizer: comparable in supervised setting, but classification head simpler for sequence labelling. 3

9.7 Robustness / domain shift

Lester Table 1. Out-of-domain F1 on MRQA datasets: prompt tuning beats model tuning by +12.5 F1 on TextbookQA and shows marginal advantages on most other out-of-domain sets, supporting the claim that frozen pretrained parameters generalise better than fine-tuned ones. 1

Li and Liang extrapolation. WebNLG unseen categories: prefix-tuning maintains better relative performance than fine-tuning. XSum news to sports / within-news splits: 39.23 vs 38.15 ROUGE-1 (news to sports), 39.41 vs 39.20 (within-news). 2

9.8 Independent benchmark cross-checks for SOTA claims

[Reviewer Perspective] The cluster does not claim SOTA on any specific leaderboard; the central claim is fine-tuning-equivalence at PEFT cost. As of 2026, the broader PEFT field has moved decisively toward LoRA 7 and its variants (QLoRA, DoRA) for open-weight LLM adaptation, particularly because LoRA’s low-rank weight-delta form integrates cleanly with existing inference frameworks and serves better at the per-task adapter density that production LLMs require. Soft-prompt methods retain niche relevance for: (a) closed-API black-box adaptation where weight access is impossible, (b) multimodal prefix conditioning (vision-language models), and (c) controlled-generation research. The papers’ fine-tuning-equivalence claim within the chosen task suites holds up in subsequent replication efforts; the broader market choice of LoRA over soft prompts is an architectural preference rather than a refutation.

9.9 Evidence audit

[Analysis] Strongly supported: scale-conditioned competitiveness of input-only prompt tuning (Lester); per-layer prefix outperforming input-only at sub-10B scale (Li and Liang + Liu et al. ablations); domain robustness of frozen-model methods (Lester Table 1, Li and Liang extrapolation). Partially supported: universality of P-Tuning v2 across all NLU tasks — the Liu et al. evidence is comprehensive across NER / QA / SRL families but does not extend to generation. Narrow evidence: prompt-ensembling claim (Lester only on T5-XXL with N=5N=5 prompts on SuperGLUE); multi-task pretraining gains (Liu et al. specific to NLU task families).

Section 10: Technical novelty summary

ComponentTypeNovelty levelJustificationSource
Input-layer-only soft prompt at T5 scaleMethodCombination novelContinuous prompts existed; T5-XXL competitiveness was newLester Section 2
LM adaptation pretrainingTraining tweakFully novelT5 span-corruption to LM-style adaptation specifically for prompt tuningLester Section 3.1
Per-layer prefix activation injectionMethodFully novelThe mechanism of overriding every layer’s K/V cacheLi and Liang Section 4.1
Reparameterisation through MLPOptimisation trickFully novelDirect training was unstable; the MLP factor unlocks itLi and Liang Section 4.2
Deep prompts on BERT-family encodersMethodCombination novelLi-Liang on decoder/encoder-decoder + BERT portLiu et al. Section 4.1
Classification head replacing verbalizerDesign choiceIncrementally novelVerbalizer was the prior default for prompt-based NLULiu et al. Section 4.2
Universality claim across scales and tasksEmpiricalCombination novelThe cluster-defining claim that closes the sub-10B gapLiu et al. Section 5

Single most novel contribution. Across the cluster, the per-layer prefix-activation injection (Li and Liang Section 4.1) is the most consequential mechanism: it is the load-bearing structural choice that the entire subsequent prompt-tuning literature reuses, and it is what differentiates prefix-tuning / P-Tuning v2 from input-only prompt tuning. The reparameterisation trick is the load-bearing optimisation choice. Together they make soft-prompt PEFT work at scales below 10B parameters.

What the papers do NOT claim to be novel. Continuous prompt vectors (predated by AutoPrompt and P-Tuning v1); cross-entropy training objective (standard); transformer architecture (frozen, untouched); SuperGLUE / GLUE / E2E / WebNLG / XSum benchmark choices.

Section 11: Situating the work

Prior work. Discrete prompt design (Brown et al. GPT-3); AutoPrompt (gradient-based discrete prompt search); P-Tuning v1 (continuous prompts at input layer with LSTM encoder); adapters (Houlsby et al.).

What this cluster changes conceptually. Adaptation does not need to modify model parameters. A small continuous vector injected at the right depth steers the frozen model. Once the model is large enough, even the simplest input-only injection works. This is the conceptual seed of “PEFT”: adaptation is a small overlay, not a re-training.

Contemporaneous related work.

  • LoRA (Hu et al., 2021, arXiv:2106.09685). 7 Same year as Lester; orthogonal mechanism. LoRA learns low-rank weight-delta matrices ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r} and ARr×dA \in \mathbb{R}^{r \times d} with rdr \ll d, applied multiplicatively to the model’s weight matrices. Differs from soft-prompt PEFT in that it modifies the model’s weights at adaptation time (additively); same parameter-efficiency band; better-integrated with modern inference frameworks. As of 2026 LoRA is the dominant open-weight PEFT.
  • Adapters (Houlsby et al., 2019, arXiv:1902.00751). 5 Inserts bottleneck modules between transformer layers. Li and Liang and Lester both treat adapters as the strongest baseline; prefix-tuning beats them at matched parameter budget on table-to-text.
  • P-Tuning v1 (Liu et al., 2021, arXiv:2103.10385). 8 Same lead author as P-Tuning v2; input-only soft prompts with an LSTM encoder; restricted to NLU classification tasks on selected benchmarks. P-Tuning v2 explicitly subsumes v1.

[Reviewer Perspective] Strongest skeptical objection. The Lester paper’s central scale-conditioned claim relies on T5 v1.1 + LM adaptation; whether the same scale curve holds for other architectures (decoder-only LLaMA-style, mixture-of-experts) is an extrapolation. Li and Liang’s prefix-tuning depends on reparameterisation in ways the paper does not fully theorise; the choice of kk (512 vs 800) is a free hyperparameter the practitioner must tune. P-Tuning v2’s universality claim covers NLU but not generation, and the multi-task pretraining gains are not isolated from the underlying task data overlap.

[Reviewer Perspective] Strongest author-side rebuttal. All three papers operate within published, reproducible benchmark suites with code releases; the central claims survive the ablations they report; the universality claim of P-Tuning v2 is explicitly task-family-bounded in the abstract. The papers do not over-reach.

What remains unsolved. (a) Soft-prompt PEFT for instruction-tuned LLMs at the 70B+ scale (the practical PEFT default of 2026 is LoRA, not soft prompts). (b) Generalisation of the universality claim to generation tasks at sub-10B scale. (c) Theoretical characterisation of why per-layer prefixes outperform input-only prefixes at small scale (the empirical fact is robust; the mechanism is opaque).

Three future research directions. [Analysis] (1) Hybrid LoRA + prefix-tuning composability: do the two methods stack? (2) Cross-task prompt transfer: can a prompt trained on task A initialise a prompt for task B better than random? Lester’s class-label initialisation gestures at this. (3) Prompt-only adaptation for vision-language and multimodal models, where soft prompts have already shown promise in CoOp / CoCoOp and remain an active sub-area.

Section 12: Critical analysis

12.1 Strengths

  • Cleanly isolated mechanism. Each paper varies one architectural choice (input-only vs per-layer; with or without reparameterisation; classification head vs verbalizer) and reports ablations that justify the choice.
  • Strong empirical breadth at PEFT scale. Together the cluster covers SuperGLUE, NER, QA, SRL, table-to-text generation and abstractive summarisation across BERT-family, T5, GPT-2 and BART.
  • Operational efficiency claim is concrete. Twenty-thousand-parameters-per-task at T5-XXL is a five-orders-of-magnitude reduction over the 11B-parameter model copy; the operational case is unambiguous. 4
  • Reusable code releases. Lester (Google Research repo) 10 , Li and Liang (PrefixTuning repo) 11 , Liu et al. (THUDM/P-tuning-v2 repo) 9 .

12.2 Weaknesses stated by authors

  • Lester: Acknowledge that input-only prompt tuning underperforms at sub-XL T5 scales; explicitly recommend deeper methods for smaller models.
  • Li and Liang: Acknowledge XSum performance gap vs fine-tuning; acknowledge embedding-only ablation underperforms; acknowledge the reparameterisation as a training-time-only trick whose role is empirical rather than theoretical.
  • Liu et al.: Acknowledge reparameterisation is task-dependent; multi-task pretraining does not help QA; universality is scoped to NLU.

12.3 Weaknesses not stated or understated

[Reviewer Perspective]

  • The “20,480 params per task” headline is a misleadingly small number. It hides the fact that at training time, memory consumption is essentially identical to full fine-tuning because backprop traverses the entire frozen model. The storage win is real; the training-compute win is much smaller than the parameter-count headline suggests. Independent commentary from the broader PEFT literature (the LoRA paper itself, in its related-work section) 7 notes the training-cost parity.
  • Soft prompts are operationally fragile. Per-token training scale, sensitivity to initialisation, and inference-time prompt-length budget all impose constraints that LoRA does not. The market has voted with its feet; LoRA dominates open-weight PEFT as of 2026 despite the soft-prompt literature being chronologically prior.
  • The “universality” of P-Tuning v2 is bounded by the benchmark choice. The paper does not test on instruction-following / chat-style tasks that dominate modern LLM evaluation; the bounded NLU framing was reasonable in 2022 and is dated in 2026.

12.4 Reproducibility check

ArtefactStatusSource
Lester codeReleasedgoogle-research/prompt-tuning 10
Li and Liang codeReleasedXiangLi1999/PrefixTuning 11
Liu et al. codeReleasedTHUDM/P-tuning-v2 9
DataPublicly available (SuperGLUE, E2E, WebNLG, DART, XSum, CoNLL03, SQuAD)All standard benchmarks
HyperparametersFully disclosed in each paper’s appendixAll three papers
ComputeReported per paper (Google TPU pods for Lester; GPU days for Li and Liang and Liu et al.)All three papers
Trained prompts / weightsLester releases prompt checkpoints for some configurations; Li and Liang and Liu et al. release training scripts but not all trained promptsRepos cited above
Evaluation setStandard public benchmarksN/A
OverallFully reproducibleThree independent groups with active code releases

12.5 Methodology

Methodology

  • Sample size: SuperGLUE-scale (thousands to tens of thousands of examples per task) for Lester and Liu et al.; ~22K (E2E), ~22K (WebNLG), ~62K (DART), ~204K (XSum) for Li and Liang.
  • Evaluation set: Standard held-out splits per benchmark; no contamination check reported (the benchmarks predate the modern instruction-LLM era).
  • Baselines: Full fine-tuning (all three); adapters (Li and Liang); Lester PT (Liu et al.); GPT-3 few-shot (Lester); P-Tuning v1 (Liu et al.).
  • Hardware/compute: Lester used Google TPU pods (specific count not centrally reported in the EMNLP version); Li and Liang used GPU compute (specific count documented per task); Liu et al. used GPU compute documented in the THUDM repository README.

12.6 Generalisability

The cluster’s methods transfer to: any frozen pretrained transformer with attention K/V cache (per-layer prefix), any encoder-decoder or decoder-only LM (input-only prompt), and any NLU task that can be cast as classification, span extraction, or sequence labelling. They do not transfer to: tasks requiring weight-level adaptation (the model genuinely needs to learn new knowledge, not just be re-routed); tasks where the frozen model is fundamentally mis-capable; modalities where the input embedding space is not text.

12.7 Assumption audit

The four assumptions enumerated in Section 3 are all empirically validated within each paper’s scope. The most fragile is “the frozen PLM is expressive enough” — this fails at T5-Small per Lester’s own evidence, and the entire P-Tuning v2 paper exists because the assumption also fails for input-only prompts at sub-10B scale.

12.8 What would make the cluster stronger

[Analysis] (1) Direct comparison to LoRA at matched parameter budgets across the same task suite; the cluster predates LoRA’s dominance and does not benchmark against it. (2) Theoretical analysis of why per-layer prefixes work better than input-only at small scale. (3) Extension to instruction-tuned chat-style LLM evaluation suites (MT-Bench, AlpacaEval) that were not standard at the time of publication.

Section 13: What is reusable for a new study

REUSABLE COMPONENT 1: Lester input-layer prompt mechanism.

  • What it is: Trainable embedding matrix PeRp×eP_e \in \mathbb{R}^{p \times e} concatenated to input embeddings.
  • Why worth reusing: Simplest soft-prompt formulation; minimal code change; cheapest storage.
  • Preconditions: Frozen model at roughly the 1B+ parameter scale with LM-style pretraining (or LM-adapted T5).
  • What would need to change in a different setting: Below 1B scale, switch to per-layer prefix; for encoder-only models, switch to P-Tuning v2.
  • Risks: Underperforms at small scale; brittle to pretraining objective mismatch.
  • Interaction effects: None significant with quantisation or distillation pipelines; orthogonal to LoRA in principle though rarely composed.

REUSABLE COMPONENT 2: Li and Liang per-layer prefix + reparameterisation.

  • What it is: Per-layer trainable activation matrix produced by mapping a smaller hidden matrix through an MLP at training time.
  • Why worth reusing: Best-known soft-prompt formulation for generation tasks; standard reference implementation.
  • Preconditions: Frozen decoder-only or encoder-decoder LM; per-task data on the order of 10^3 to 10^5 examples.
  • What would need to change: Hyperparameter kk tuned per architecture; reparameterisation may or may not help on NLU tasks per Liu et al.
  • Risks: Per-layer storage scales with LL; for very deep models (LLaMA-65 layers), per-task storage is no longer the kilobyte regime Lester advertised.
  • Interaction effects: Composes with quantised inference if prefix activations match precision; not yet a standard PEFT path in 2026 production.

REUSABLE COMPONENT 3: P-Tuning v2 universal-NLU recipe.

  • What it is: Per-layer prefix + classification head + per-task prompt-length + per-task reparameterisation toggle.
  • Why worth reusing: Best-documented PEFT recipe for NLU on BERT-family encoders.
  • Preconditions: BERT / RoBERTa / GLM family or comparable encoder; NLU task definable as classification, span, or sequence labelling.
  • What would need to change: Hyperparameter sweep per task; not directly portable to generation without the prefix-tuning generation-side machinery.
  • Risks: Reparameterisation choice can lose 1–3 F1 if mis-set; multi-task pretraining helps NLU but hurts QA.
  • Interaction effects: Composes with the THUDM reference implementation directly; no quantisation interactions reported.

Dependency map. Component 1 (Lester) depends on a large frozen LM and LM-style pretraining objective. Component 2 (Li and Liang) depends on a frozen LM and the per-layer K/V-cache injection capability of the inference framework. Component 3 (Liu et al.) depends on Component 2 plus an explicit classification head, per-task hyperparameter tuning, and the THUDM training-loop scaffolding.

Recommendation. [Analysis] For a new study in 2026, the highest-value reusable artefact is Component 3 (P-Tuning v2) if the goal is BERT-family NLU PEFT, and Component 2 (Li-Liang) if the goal is decoder-only generation PEFT. For instruction-tuned LLM adaptation at the modern 7B–70B scale, the field has moved to LoRA-family methods; the cluster’s relevance there is mostly as conceptual antecedent.

[Analysis] What type of new study benefits most. A study that (a) cannot access model weights (closed-API adaptation), (b) needs per-tenant adaptation at very low storage cost, or (c) explores multimodal prefix conditioning where soft prompts retain an active research foothold.

Section 14: Known limitations and open problems

Limitations explicitly stated by the authors.

  • Lester: Input-only prompts fail below T5-XL scale; LM adaptation required; SuperGLUE-only evaluation does not cover generation.
  • Li and Liang: XSum performance below fine-tuning; reparameterisation rank empirically chosen; embedding-only and infix-tuning fail.
  • Liu et al.: Universality bounded by NLU task families; reparameterisation task-dependent; multi-task pretraining hurts QA.

Limitations not stated.

[Analysis] Training-time GPU memory parity with fine-tuning despite tiny parameter count (the storage win is not a training-compute win). [Reviewer Perspective] LoRA-era dominance: the broader PEFT field has moved past soft prompts for open-weight LLM adaptation, leaving the cluster as conceptual rather than operational default in 2026. [Reviewer Perspective] Brittle initialisation: Lester’s class-label initialisation works because labels are short English tokens; for more abstract task definitions, the initialisation heuristic does not transfer.

Technical root cause. All three limitations trace to the same architectural fact: soft prompts inject conditioning at a position the model treats as a token-like input, which is operationally clean but does not scale capacity in the way LoRA’s weight-delta does. Prefix capacity is O(Lpd)O(L \cdot p \cdot d) for per-layer methods and O(pe)O(p \cdot e) for input-only methods; LoRA capacity is O(Lrd)O(L \cdot r \cdot d) where rr is the low-rank decomposition rank, with rr tunable. The two methods sit at comparable parameter budgets but trade differently against inference-framework integration.

Open problems. Theoretical characterisation of soft-prompt expressivity; principled selection of prompt length and depth per task; cross-task prompt transfer; composability with LoRA / adapters.

What a follow-up paper would need to solve. A unified PEFT method that (a) matches LoRA on open-weight LLM adaptation, (b) matches soft prompts on closed-API black-box adaptation, and (c) provides a principled depth-versus-input prompt-allocation rule. As of 2026 no single method does all three.

How this article reads at three depths

For the curious high-school reader. Three papers ask: can a frozen language model be steered to do new tasks by writing a tiny “magic preamble” of trained numbers instead of changing the model itself? The answer is yes, with three twists: (1) if the model is huge, the preamble only needs to live at the input; (2) for smaller models or trickier tasks like extracting answer spans from a paragraph, the preamble needs to be repeated at every layer of the model; (3) a few training tricks make this universal across model sizes. The takeaway is that adaptation can be cheap and modular: one giant frozen model, many tiny prompts.

For the working developer or ML engineer. Soft-prompt PEFT trades training simplicity for storage and modularity. The trained prompt files are kilobyte-scale per task, the inference path is the base model plus a small prefix, and a single frozen model can serve many tasks. The cost is sensitivity to scale (Lester only works at T5-XL+), sensitivity to initialisation and prompt length (per-task hyperparameter search is real), and training-time memory parity with full fine-tuning. For decoder-only generation, Li and Liang’s prefix-tuning is the canonical recipe. For BERT-family NLU including NER / QA / SRL, P-Tuning v2 is the canonical recipe. For open-weight instruction-tuned LLM adaptation in 2026, the field has moved to LoRA; soft prompts retain niche relevance for closed-API black-box adaptation and multimodal prefix conditioning.

For the ML researcher. The cluster’s load-bearing novelty is the per-layer prefix-activation injection (Li and Liang) plus the reparameterisation trick that stabilises its training. Lester’s contribution is the scale-conditioning empirical claim that input-only prompts close the gap at 10B+ parameters with LM-adaptation pretraining. P-Tuning v2’s contribution is the optimisation recipe (per-layer prefix + classification head + per-task length + multi-task pretraining) that closes the universality gap on NLU. The strongest objection is that the cluster does not benchmark against LoRA (which post-dates Lester slightly but predates Liu et al.), so the practical positioning relative to LoRA is left to the reader. A follow-up paper would need to deliver a unified soft-prompt + LoRA composition with theoretical capacity analysis and instruction-tuned LLM evaluation.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. Lester, Al-Rfou and Constant — The Power of Scale for Parameter-Efficient Prompt Tuning (arXiv:2104.08691, EMNLP 2021) (accessed )
  2. 2. Li and Liang — Prefix-Tuning: Optimizing Continuous Prompts for Generation (arXiv:2101.00190, ACL 2021) (accessed )
  3. 3. Liu, Ji, Fu, Tam, Du, Yang, Tang — P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks (arXiv:2110.07602, ACL 2022) (accessed )
  4. 4. ar5iv HTML render of Lester et al. — 20,480 parameters per task on T5-XXL (Section 4.4); over 20,000x parameter reduction vs full model tuning (accessed )
  5. 5. Houlsby et al. — Parameter-Efficient Transfer Learning for NLP (arXiv:1902.00751, ICML 2019); the adapter-tuning baseline (accessed )
  6. 6. Liu et al. — GPT Understands, Too (P-Tuning v1, arXiv:2103.10385); LSTM-encoded input-layer soft prompts predating P-Tuning v2 (accessed )
  7. 7. Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685); contemporaneous PEFT with low-rank weight-delta matrices, dominant in 2026 production (accessed )
  8. 8. P-Tuning v1 reference (same arXiv ID as note 6); cited in Section 11 as the immediate predecessor of P-Tuning v2 (accessed )
  9. 9. P-Tuning v2 official code repository (THUDM); reference implementation with hyperparameter configs per task (accessed )
  10. 10. Google Research — official prompt-tuning code release for Lester et al. (accessed )
  11. 11. Li and Liang — official PrefixTuning code repository (accessed )

Further Reading

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.