QLoRA Fine-Tune Mistral-7B-Instruct on a Custom Dataset: End-to-End Tutorial
End-to-end QLoRA fine-tune of Mistral-7B-Instruct on a 1,000-row instruction dataset, on a single 16GB GPU, with the adapter pushed to Hugging Face Hub.
Image: Mistral-7B-Instruct-v0.3 model card on Hugging Face, used for editorial coverage of the base model this tutorial fine-tunes.
What this tutorial builds
This tutorial fine-tunes Mistral-7B-Instruct-v0.3 on a 1,000-row custom instruction dataset using QLoRA, the technique introduced in Dettmers et al. 2023 1 that pairs 4-bit base-weight quantisation with a small trainable LoRA adapter. One epoch over 1,000 rows runs in roughly 30 minutes on a single 16 GB GPU 2 , whether that is a Tesla T4 3 , an A10 in the 16 GB sliced configuration, or a consumer card like an RTX 4060 Ti 16 GB or RTX 4080. The output is a roughly 30–50 MB adapter file 4 that anyone with the base model can load.
Mistral-7B-Instruct-v0.3 is already instruction-tuned by Mistral AI under Apache 2.0 5 . Fine-tuning on top of an instruction-tuned base is the right starting point when the goal is a domain-specific assistant (a customer-support agent for a product, a code reviewer for a specific stack, a triage bot for a ticketing system) rather than a general chat model from scratch. The base already knows how to follow instructions; the LoRA adapter teaches it the domain vocabulary, the response shape, and the constraints that matter for the task.
The pipeline below assumes any environment with a 16 GB CUDA GPU and Python 3.10+. It is not tied to Colab. The same code runs on a local workstation, a RunPod / Lambda Labs / Vast.ai rental, a single-GPU EC2 g5.xlarge, or a workstation with a 4060 Ti 16 GB. Sample wall-clock numbers below reference T4 unless noted.
Prerequisites
- A 16 GB CUDA GPU with CUDA 12.1 or later. T4, A10 (sliced or full), RTX 4060 Ti 16 GB, RTX 4080, RTX 4090, A4000, A5000, or equivalent all work.
- Python 3.10 or 3.11, with
pipavailable. Conda oruvenvironments are fine. - A Hugging Face account at huggingface.co with a User Access Token.
readpermission for downloading the base model;writepermission for pushing the trained adapter back to the Hub. - Roughly 30 GB of free disk space for the cached base model, dataset, and adapter checkpoints.
- Working Python familiarity. Comfortable reading and editing roughly 200 lines of Python with imports, dictionaries, and library calls.
Step 1: Install the libraries
Pin every version. Hugging Face libraries change argument names between minor releases, and a notebook that worked last month may break on transformers==4.50 because of a deprecated argument or a renamed config field.
pip install -q -U \
transformers==4.45.2 \
peft==0.13.2 \
bitsandbytes==0.44.1 \
accelerate==1.0.1 \
trl==0.11.4 \
datasets==3.0.1 \
huggingface_hub==0.25.2
The bitsandbytes library is the load-bearing dependency. It ships precompiled CUDA kernels for 4-bit quantisation and the paged optimiser used later 6 . If import bitsandbytes raises a CUDA error after install, the most common fix is to upgrade to the latest patch release and ensure the host CUDA toolkit is 12.1 or newer.
Image: Hugging Face PEFT documentation, used for editorial coverage of the library that powers Step 4’s LoRA configuration.
Authenticate to the Hub once:
from huggingface_hub import login
login(token="hf_YOUR_TOKEN_HERE")
Treat the token like a password. Do not commit a notebook or script with the token inlined.
Step 2: Prepare the custom instruction dataset
The dataset format that pairs cleanly with SFTTrainer is JSONL with one instruction-output pair per line. A useful starting size is 1,000 rows. Below is a small sample the reader can adapt; the same shape scales to 10,000+ rows without code changes.
{"instruction": "Summarise this support ticket in one sentence: 'My payment failed three times today. Card was charged but order shows pending.'", "output": "Customer reports a payment that was charged but the order remains in pending status after three failed attempts."}
{"instruction": "Classify the sentiment of this review: 'The battery lasts about half what was advertised. Otherwise fine.'", "output": "Mixed: negative on battery life, neutral on the rest of the product."}
{"instruction": "Rewrite this error message for a non-technical user: 'ECONNREFUSED 127.0.0.1:5432'", "output": "The application could not connect to its database. Please try again in a minute, or contact support if the issue continues."}
Save 1,000 rows of similar pairs to train.jsonl. For a real domain assistant, the dataset is best built from production data: real support tickets paired with the canonical responses, real code reviews paired with the accepted comments, real product questions paired with the documented answers. Synthetic data generated by a larger model works as a starting point but tends to teach the assistant the larger model’s style rather than the domain’s actual response shape.
Load the file via the datasets library 7 , which handles batching, shuffling, and the train/test split automatically:
from datasets import load_dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")
dataset = dataset.train_test_split(test_size=0.05, seed=42)
train_ds = dataset["train"]
eval_ds = dataset["test"]
def format_row(row):
return {
"text": (
f"<s>[INST] {row['instruction']} [/INST] {row['output']}</s>"
)
}
train_ds = train_ds.map(format_row)
eval_ds = eval_ds.map(format_row)
The [INST] ... [/INST] markers are Mistral’s native chat-template syntax 8 . Using the model’s own template rather than a generic ### Instruction: format makes the adapter compose cleanly with downstream tools that expect Mistral conventions (the Mistral chat completion API, llama.cpp inference, vLLM serving).
Step 3: Load Mistral-7B-Instruct in 4-bit
A 7B model in fp16 occupies roughly 14 GB of VRAM 9 , which leaves essentially no room for activations or gradients on a 16 GB GPU. The QLoRA recipe compresses base weights to roughly 4 GB in NF4 (NormalFloat 4-bit) format, freeing the rest of the budget for the trainable adapter and the optimiser state.
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
The three quantisation settings that matter, per the QLoRA paper and the PEFT quantisation guide 10 :
bnb_4bit_quant_type="nf4"uses the NormalFloat-4 format. NF4 matches the typical normal distribution of trained LLM weights more faithfully than naive int4, which preserves more accuracy at the same bit count.bnb_4bit_use_double_quant=Truequantises the quantisation constants themselves. This saves roughly 0.4 GB more VRAM at no measurable accuracy cost.bnb_4bit_compute_dtype=torch.bfloat16runs the matmul in bf16 even though weights are stored in 4-bit. T4, A10, and all RTX 30/40-series cards support bf16 in their tensor cores.
The base-model memory math, with for quantised weights and for activations:
That leaves about 9 GB on a 16 GB card for the LoRA adapter parameters, gradients, and the SFTTrainer overhead.
Image: bitsandbytes GitHub repository, used for editorial coverage of the 4-bit quantisation library invoked in Step 3.
Step 4: Attach the LoRA adapter
LoRA freezes the 7B base weights and trains two small low-rank matrices and on top of selected layers, with the forward pass becoming . At rank on the four attention projections, the trainable parameter count drops from roughly 7.24 billion to about 21 million — under 0.3% of the original 11 , which is what makes QLoRA fit on a 16 GB GPU.
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Rank 16 with is the QLoRA paper’s default 1 and the PEFT documentation’s recommended starting point. Higher ranks (32, 64) give more adaptation capacity at proportionally higher VRAM cost and risk overfitting a 1,000-row dataset. The target_modules list pins LoRA adapters to the query, key, value, and output projections of every attention layer. Adding the MLP projections (gate_proj, up_proj, down_proj) raises trainable parameters to roughly 42 million and costs more VRAM; defer that experiment to run two.
Step 5: Evaluate before fine-tuning
Five test prompts run against the base Mistral-Instruct model establish the before-state. The same five prompts run against the LoRA-loaded model after training establish the after-state. The diff is the qualitative evidence the adapter learned something.
test_prompts = [
"Summarise this support ticket in one sentence: 'Login button greys out after two attempts on Safari but works on Chrome.'",
"Classify the sentiment of this review: 'Setup took twenty minutes longer than the docs said but it works fine now.'",
"Rewrite this error message for a non-technical user: 'TLS handshake failed: certificate has expired'",
"Extract the action item from this Slack message: 'Hey can someone push the release notes to staging before the demo at 3pm'",
"Suggest a fix for this code review comment: 'This function does too much. It fetches, transforms, and writes in one method.'",
]
def generate(prompt, model_to_use):
chat = [{"role": "user", "content": prompt}]
inputs = tokenizer.apply_chat_template(
chat, return_tensors="pt", add_generation_prompt=True
).to(model_to_use.device)
with torch.no_grad():
out = model_to_use.generate(
inputs,
max_new_tokens=200,
do_sample=False,
temperature=0.0,
)
return tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)
print("=== BEFORE fine-tuning ===")
for p in test_prompts:
print(f"\nPROMPT: {p}\nRESPONSE: {generate(p, model)}\n")
Save the before-output to a text file. Running the same generation loop after training produces the after-output, and a side-by-side diff makes the adapter’s learned behaviour visible. Greedy decoding (do_sample=False) is deliberate at this stage: it removes sampling variance so the before/after difference reflects training rather than noise.
Step 6: Configure SFTTrainer and train
The Supervised Fine-Tuning Trainer (SFTTrainer) from TRL 12 wraps the training loop, gradient accumulation, mixed-precision handling, and checkpoint saving. The settings below are calibrated for 16 GB VRAM on a 1,000-row instruction dataset:
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./mistral-lora-adapter",
num_train_epochs=1,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
bf16=True,
optim="paged_adamw_8bit",
max_seq_length=1024,
packing=False,
dataset_text_field="text",
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=eval_ds,
tokenizer=tokenizer,
)
trainer.train()
Six settings load-bear:
per_device_train_batch_size=2withgradient_accumulation_steps=4gives an effective batch size of 8 without exceeding 16 GB. On a tighter T4 session, drop to batch size 1 with accumulation 8 if OOM appears mid-step.learning_rate=2e-4is the QLoRA paper’s recommended LR for rank-16 adapters. Full fine-tuning uses 1e-5 to 5e-5, but LoRA’s small adapter handles a higher LR cleanly.lr_scheduler_type="cosine"decays the learning rate smoothly toward zero across training, which tends to produce tighter final loss than a linear schedule at this scale.optim="paged_adamw_8bit"uses the bitsandbytes paged optimiser. Optimiser states page between GPU and CPU when VRAM tightens, which can be the difference between an OOM crash and a successful run on a borderline session.bf16=Trueruns training in bfloat16. T4 and every newer GPU supports bf16 in tensor cores; fp16 is the legacy alternative and is more numerically fragile at LoRA learning rates.max_seq_length=1024covers most instruction-following pairs. Lowering to 512 cuts memory and training time roughly in half but truncates longer examples; raising to 2048 doubles activation memory and may OOM at batch size 2.
Training a 1,000-row dataset for one epoch at these settings takes roughly 30 minutes on a T4 and closer to 15 minutes on an A10. The training loss should drop from roughly 1.8–2.2 at step 0 to 0.8–1.2 by the end of one epoch on a well-formed instruction dataset; flat loss usually signals a dataset-format mismatch or a learning rate that is too low.
The training-loss curve, with for adapter parameters, for the dataset, and the standard causal-LM objective:
Only the LoRA parameters receive gradients; the 4-bit base weights stay frozen across the entire run.
Image: Hugging Face TRL — SFTTrainer documentation, used for editorial coverage of the trainer class invoked in Step 6.
Step 7: Save and evaluate after
After training completes, save the adapter and tokeniser:
trainer.save_model("./mistral-lora-final")
tokenizer.save_pretrained("./mistral-lora-final")
The ./mistral-lora-final/ directory contains roughly 30–50 MB of files: adapter_model.safetensors, adapter_config.json, and the tokeniser configs 4 . That is a roughly 280-times reduction from the 14 GB base model.
Run the same five test prompts against the fine-tuned model:
print("=== AFTER fine-tuning ===")
for p in test_prompts:
print(f"\nPROMPT: {p}\nRESPONSE: {generate(p, model)}\n")
Diff the before-file and after-file. Expected patterns on a well-formed 1,000-row instruction dataset: shorter, more on-shape responses; reduced hallucinated boilerplate; vocabulary that matches the training data’s domain. If the after-state shows degenerate repetition or empty responses, the adapter overfit; drop the learning rate to 1e-4 or reduce to half an epoch and re-run.
Image: QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023), reproduced for editorial coverage of the paper this tutorial implements.
Step 8: Push the adapter to Hugging Face Hub
Hub-hosted adapters are the portability story. Anyone with access to the base model can load the adapter without retraining 13 .
from huggingface_hub import HfApi
REPO_ID = "your-username/mistral-7b-instruct-v0.3-lora-myproject"
model.push_to_hub(REPO_ID, private=True)
tokenizer.push_to_hub(REPO_ID, private=True)
private=True keeps the repo unlisted by default. Flip to public when ready to share. The Hub page renders the adapter card automatically, including the base-model reference, the LoRA config, and the training framework metadata SFTTrainer wrote into adapter_config.json.
Loading the adapter from the Hub on a fresh machine:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, REPO_ID)
model.eval()
The same PeftModel.from_pretrained call works against a local directory (./mistral-lora-final), a public Hub repo, or a private Hub repo when authenticated. That is the entire deployment story: the adapter file travels, the base model stays cached on the inference machine.
Common errors and how to recover
Out-of-memory during training. Drop per_device_train_batch_size from 2 to 1 and raise gradient_accumulation_steps from 4 to 8. If the OOM persists, reduce max_seq_length from 1024 to 512, since activation memory scales linearly with sequence length. Last resort: drop target_modules to just ["q_proj", "v_proj"], which roughly halves trainable parameters.
bitsandbytes import fails with a CUDA mismatch. Run pip install -U bitsandbytes to pull the latest patch release, then restart the Python process. The library ships precompiled kernels for specific CUDA versions; the bundled wheel for 0.44.1 expects CUDA 12.1 or newer.
Tokeniser pad-token warning. The tokenizer.pad_token = tokenizer.eos_token line above handles this for Mistral-Instruct. If a downstream warning persists for a different fine-tune target, set tokenizer.add_special_tokens({"pad_token": "[PAD]"}) and call model.resize_token_embeddings(len(tokenizer)).
Loss does not decrease. Three likely causes, in order of probability: the dataset text field does not use Mistral’s [INST] ... [/INST] template (re-check Step 2), the learning rate is too low (try 5e-4), or the dataset is too small or too noisy (1,000 well-formed rows is a floor; 5,000+ rows trains more reliably).
After-state responses are identical to before-state. The adapter did not take. Verify the trained model is being loaded; model.print_trainable_parameters() after PeftModel.from_pretrained should show non-zero LoRA parameters. If they are zero, the load silently failed; re-check the adapter directory path.
Where to go next
Two natural follow-ups once the first end-to-end run is complete.
First, scale the dataset. QLoRA quality plateaus quickly below 1,000 rows and improves measurably up to about 10,000 rows for most instruction-tuning tasks. The same notebook handles the larger dataset; expect the wall-clock to grow roughly linearly with row count.
Second, evaluate quantitatively. The five-prompt before/after diff is qualitative. For a production assistant, add an eval set of 100–200 held-out prompts plus a rubric — exact-match for classification tasks, BLEU or ROUGE for summarisation, an LLM-as-judge pass for open-ended generation. The PEFT documentation links to several eval harnesses that compose with the adapter-loading pattern above.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers, Pagnoni, Holtzman, Zettlemoyer; 2023; arXiv:2305.14314). The paper introduces 4-bit NF4 quantisation + LoRA + paged optimisers, and reports rank 16 with alpha 32 + 2e-4 learning rate as the recommended default for 7B-class fine-tuning. (accessed ) ↩
- 2. Hugging Face PEFT quantisation guide (QLoRA recipe documentation; one-epoch wall-clock figures for 1,000-row datasets on T4-class hardware referenced from the recipe's expected throughput). (accessed ) ↩
- 3. NVIDIA Tesla T4 product brief (16 GB GDDR6 memory, 70W TDP, Turing architecture with tensor cores supporting fp16 and bf16). (accessed ) ↩
- 4. Hugging Face PEFT documentation (trained LoRA adapter at rank 16 on attention layers exports to roughly 30–50 MB of safetensors for a 7B base; the 280-times reduction figure derives from 14 GB fp16 base over ~50 MB adapter). (accessed ) ↩
- 5. Mistral-7B-Instruct-v0.3 model card on Hugging Face (Apache 2.0 licence; instruction-tuned by Mistral AI on top of Mistral-7B-v0.3 base; native chat template uses [INST] ... [/INST] markers). (accessed ) ↩
- 6. bitsandbytes GitHub repository (4-bit NF4 quantisation + paged 8-bit AdamW optimiser; precompiled CUDA kernels target CUDA 12.1 and newer in the 0.44.x release line). (accessed ) ↩
- 7. Hugging Face datasets library documentation (JSONL loading via `load_dataset("json", ...)`, automatic train/test split via `train_test_split`). (accessed ) ↩
- 8. Mistral-7B-Instruct-v0.3 model card (chat-template documentation: the model expects `[INST] {user_message} [/INST] {assistant_message}` per turn, wrapped in `<s>...</s>` BOS/EOS tokens at the conversation boundary). (accessed ) ↩
- 9. Hugging Face transformers — BitsAndBytesConfig documentation (NF4 4-bit quantisation reduces a 7B fp16 model from roughly 14 GB to roughly 4 GB on disk and in VRAM; double quantisation saves an additional ~0.4 GB). (accessed ) ↩
- 10. Hugging Face PEFT — QLoRA conceptual guide (recommended quantisation config: `nf4` quant type, double quant enabled, bf16 compute dtype). (accessed ) ↩
- 11. Hugging Face PEFT documentation (LoRA at rank 16 targeting q/k/v/o projections on a 7B-class transformer reduces trainable parameter count from ~7.24 billion to ~21 million, under 0.3% of the original). (accessed ) ↩
- 12. Hugging Face TRL — SFTTrainer documentation (supervised fine-tuning wrapper with built-in gradient accumulation, mixed-precision handling, and PEFT integration). (accessed ) ↩
- 13. Hugging Face Hub model upload documentation (private repo support via `private=True`; adapters load via `PeftModel.from_pretrained(base_model, repo_id)` against any accessible repo). (accessed ) ↩
Further Reading
- NVIDIA A10 product brief (24 GB GDDR6) (accessed )
Anonymous · no cookies set