What is fine-tuning, in 2026? A plain-English explainer of LoRA, QLoRA, and full-parameter tuning
Fine-tuning adapts a pretrained LLM to a narrower task by continuing training on new data. LoRA, QLoRA, full-parameter — the three 2026 flavours.
The short answer
Fine-tuning is what happens when a pretrained language model is shown more data, with the goal of pushing it towards a narrower behaviour than the general-purpose one its base training produced. The base model already speaks the language and knows a great deal of the world; fine-tuning teaches it the format, tone, or task you want it to settle into.
In 2026, three flavours of fine-tuning matter in practice. Full-parameter fine-tuning updates every weight in the model and is the most expressive but also the most expensive. LoRA (Low-Rank Adaptation) freezes the base weights and trains only two small matrices inserted alongside each weight matrix, which cuts the number of trainable parameters by orders of magnitude. 1 QLoRA is LoRA stacked on top of a 4-bit quantised base model, which compresses the frozen weights into roughly a quarter of their original memory footprint so a large model can be fine-tuned on a single consumer GPU. 2
The aggregated source consensus across the LoRA paper, the QLoRA paper, and the Hugging Face PEFT documentation is that LoRA and QLoRA reach within a few benchmark points of full-parameter fine-tuning on most adaptation tasks, while training in a small fraction of the compute. That is why the parameter-efficient flavours have become the default for anyone fine-tuning an open-weight model. 3
How fine-tuning sits next to pretraining
The training of a frontier language model happens in stages. Pretraining is the long, expensive phase in which the model reads a very large corpus and learns to predict the next token; this is where the base capabilities (grammar, world knowledge, code, multilingual understanding) come from. Post-training is the broad name for what happens after pretraining: supervised fine-tuning on instruction-following examples, then preference optimisation through RLHF or methods like DPO. Fine-tuning in the user-facing sense is what a downstream consumer of the model does to specialise it further: same base architecture, same tokeniser, but more training on a smaller, narrower dataset.
The thing that distinguishes fine-tuning from training-from-scratch is that the model starts from learned weights, not random ones. Gradient descent on a small fine-tuning dataset nudges those learned weights; it does not have to relearn what a noun is.
Full-parameter fine-tuning, the original recipe
Full-parameter fine-tuning is exactly what its name suggests. Every weight in the model is unfrozen, gradients flow through the whole network, and the optimiser updates every parameter. The procedure is the same one used during pretraining; only the dataset is smaller and the number of training steps is fewer.
The cost is dictated by the model size. Storing the optimiser state for a 7-billion-parameter model in mixed precision typically requires somewhere between two and four times the model’s own memory footprint, because optimisers like Adam keep first- and second-moment estimates per parameter. Even for an open-weight 7B model, full-parameter fine-tuning on commodity hardware is not always feasible. A single 80 GB H100 is normally needed to hold the gradients and optimiser state alongside the model.
Full-parameter fine-tuning is the most expressive flavour, since the gradient signal can in principle reshape any weight in the network. Per the LoRA paper’s own framing, this is also the baseline that parameter-efficient methods are measured against. 1
LoRA, the rank-decomposition trick
LoRA, introduced by Hu and colleagues at Microsoft in 2021, makes a structural assumption about how fine-tuning actually changes a pretrained model. 1 The assumption is that the update to each weight matrix lives in a low-dimensional subspace, even though the matrix itself is high-dimensional. If that is true, then the update can be approximated by the product of two much smaller matrices.
Concretely, for a weight matrix of shape , LoRA writes the fine-tuned weights as
where has shape , has shape , and is the rank of the adapter, typically a small number like 4, 8, or 16. The frozen weight is never updated. Only and train. For and , the number of trainable parameters drops from about 16.8 million per matrix to about 65,000 per matrix, a reduction of roughly 250 times for that one layer. The LoRA paper reports the overall reduction in trainable parameters as approximately 10,000 times for a GPT-3-scale model when LoRA is applied only to the attention layers. 1
At inference, the trained product can be merged back into to produce a single weight matrix of the original shape. That means LoRA adds no extra latency at inference time, unlike adapter layers that sit in the forward pass. 1
The trade-off is expressivity. LoRA can only represent updates that are themselves low-rank. For tasks where the fine-tuning signal needs to substantially restructure the model, LoRA may underfit relative to full-parameter tuning. In practice, the LoRA paper’s own ablations and a long line of follow-up work show that for instruction-following and style-adaptation tasks the gap is small.
QLoRA, LoRA on a quantised base
QLoRA, introduced by Dettmers and colleagues at the University of Washington in 2023, layers two further compressions on top of LoRA. 2
The first is quantisation of the frozen base model to 4 bits per weight, using a custom data type the paper calls 4-bit NormalFloat (NF4). Quantisation is the process of representing a weight that was originally stored in 16 or 32 bits using a smaller number of bits, accepting some loss of precision. NF4 is designed to be information-theoretically optimal for weights that follow a normal distribution, which neural-network weights approximately do. 2
The second is double quantisation, where the quantisation constants (the small floating-point numbers that record the scale of each quantised block) are themselves quantised. This trims further memory at the cost of a small additional precision loss. 2
Crucially, the LoRA adapter weights and in QLoRA are kept in full precision (typically 16-bit BFloat). Gradients flow through the 4-bit base into the 16-bit adapter, which is where the updates live.
The headline practical claim from the QLoRA paper is that fine-tuning a 65-billion-parameter model becomes possible on a single 48 GB GPU, and that the resulting model recovers the full 16-bit fine-tuning performance. 2 Whether or not the recovery is exact on every task, the paper’s evaluation on a broad suite of benchmarks reaches that conclusion within the noise of the experiments.
For a developer with a single consumer or workstation GPU, this is the change that made fine-tuning of large open-weight models a desktop activity rather than a cloud activity.
What gets fine-tuned in 2026
Three workflows account for most of the activity.
The first is format and tone adaptation. A small dataset of input-output pairs in a particular style (customer-support replies, legal-document summaries, code in a specific framework) is enough for LoRA on a 7B-to-13B open-weight base to reliably hit the desired format. Hugging Face’s PEFT library is the dominant open-source toolkit for this workflow, and the TRL library’s SFTTrainer is the standard wrapper for supervised fine-tuning. 3 4
The second is domain adaptation on closed-vocabulary tasks. A medical-records model, a legal-research model, a code-completion model for a specific in-house framework. These benefit from longer training runs and larger LoRA ranks (32 or 64 rather than 8), and the QLoRA recipe is what makes them affordable to run.
The third is preference optimisation layered on top of supervised fine-tuning, where the goal is to push the model towards outputs that satisfy a preference criterion rather than a labelled target. Direct Preference Optimization and its variants are the dominant frame for this; the underlying parameter updates are still LoRA-style adapters in most open-source recipes.
Closed-weight providers also expose fine-tuning interfaces. OpenAI’s fine-tuning API allows uploading a JSONL dataset of message exchanges and returns a fine-tuned model accessible by its own model ID; the underlying training method is not disclosed in the public documentation but is consistent with a parameter-efficient adapter approach given the latency and cost profile. 5
Common misconceptions
Three things are worth correcting because they appear often in posts on the topic.
First, fine-tuning is not the same as in-context learning. Showing a base model a few examples in the prompt and asking it to follow the pattern is few-shot prompting, which does not change any weights. Fine-tuning permanently updates the model’s parameters (or its adapter parameters). The two techniques can complement each other and the choice between them depends on dataset size, latency budget, and how often the task changes.
Second, fine-tuning is not the same as RAG (retrieval-augmented generation). RAG retrieves text from an external store at inference time and inserts it into the prompt; fine-tuning bakes the change into the model. RAG is the right tool when the knowledge changes faster than the model can be retrained. Fine-tuning is the right tool when the format, tone, or task structure needs to change.
Third, LoRA adapters are not interchangeable across base models. An adapter trained on Llama 3 8B does not load into Mistral 7B; the weight shapes and tokeniser indices are different. Treat adapters as paired with a specific base.
Honest caveats
This explainer covers the three flavours of fine-tuning that dominate 2026 open-weight workflows. It does not cover every parameter-efficient method. IA3, adapters in the BERT-era sense, prefix tuning, prompt tuning, and BitFit each have their use cases, and the Hugging Face PEFT library implements them, but LoRA and QLoRA have absorbed the bulk of practitioner attention. 3
The exact memory footprint of any fine-tuning run depends on sequence length, batch size, gradient checkpointing, and the specific framework’s implementation choices. The numbers in this article are illustrative ranges drawn from the source papers, not guarantees for any specific configuration. Verify your own hardware budget with a short test run before committing to a long training job.
Closed-weight fine-tuning APIs (OpenAI, Anthropic, Google) expose different training options at different times and price points; the OpenAI guide cited above is the canonical reference for the current state of its surface. 5 Always check the vendor’s pricing page on the day of writing.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen — LoRA: Low-Rank Adaptation of Large Language Models (arXiv:2106.09685, ICLR 2022). Sections 4 and 5 cover the rank decomposition and the parameter-count reductions reported on GPT-3 scale. (accessed ) ↩
- 2. Dettmers, Pagnoni, Holtzman, Zettlemoyer — QLoRA: Efficient Finetuning of Quantized LLMs (arXiv:2305.14314, NeurIPS 2023). NF4 data type, double quantisation, paged optimisers, and the single-48GB-GPU 65B fine-tuning result. (accessed ) ↩
- 3. Hugging Face PEFT documentation — library reference for LoRA, QLoRA, IA3, prefix-tuning, prompt-tuning, and other parameter-efficient methods. (accessed ) ↩
- 4. Hugging Face TRL — SFTTrainer reference for supervised fine-tuning on instruction-formatted datasets. (accessed ) ↩
- 5. OpenAI — fine-tuning guide describing the JSONL training format, supported base models, and the closed-weight fine-tuning API surface. (accessed ) ↩
Anonymous · no cookies set