LoRA Fine-Tuning a 7B Model on Google Colab Free Tier: A Step-By-Step
A LoRA fine-tune of Qwen 2.5-7B on Google Colab's free Tesla T4 takes 4–6 hours. Step-by-step pipeline for devs: dataset, PEFT, training, inference.
Image: Hugging Face PEFT documentation, used for editorial coverage of the library this tutorial uses.
What this tutorial builds
This tutorial fine-tunes a 7B-parameter language model on Google Colab’s free Tesla T4 GPU using Low-Rank Adaptation (LoRA), the parameter-efficient fine-tuning method that trains a small adapter on top of frozen base weights instead of updating the full model. A complete run takes roughly 4–6 hours on the free tier 1 , produces a ~50MB adapter file, and costs nothing beyond a Google account.
The recommended base model is Qwen 2.5-7B, released by Alibaba’s Qwen team under the Apache 2.0 licence 2 . Apache 2.0 means no licence approval workflow, no acceptable-use restrictions to wade through, and full commercial reuse rights, which matters for devs experimenting on personal projects that might later turn into something paid.
Llama 3.1-8B and Mistral 7B v0.3 work with the same code path, but Llama requires a Meta licence-approval click-through 3 on Hugging Face before the weights download. Llama 3.2 only ships in 1B and 3B text variants plus 11B and 90B vision-language; there is no Llama 3.2-7B, so the Llama path here means 3.1-8B in 4-bit quantisation. Stick with Qwen for the first run; switch later if the project needs it.
Skip the upgrade to Colab Pro until at least two free-tier runs have completed end-to-end. Colab Pro is roughly $9.99/month USD globally, or approximately ₹999/month on Google’s India billing 4 (verify on the day; Google adjusts regional pricing periodically and prices fluctuate, so check before purchase). Pro buys longer sessions and a better GPU pool, but neither is the bottleneck on a first LoRA run — the bottleneck is workflow familiarity, and that comes from completing free-tier runs.
Prerequisites
Three accounts and one skill:
- A Google account with access to Google Colab. Free tier is fine.
- A Hugging Face account at huggingface.co with a User Access Token generated under Settings → Access Tokens. The token needs
readpermission for downloading models andwritepermission if pushing the trained adapter back to the Hub. - Working Python familiarity. Comfortable reading and editing roughly 100 lines of Python with imports, classes, and dictionaries. No deep ML background needed; this tutorial wraps the heavy lifting in library calls.
A kaggle.json API key is optional and only matters if the dataset is on Kaggle. Public datasets on the Hugging Face Hub need no extra credentials.
Step 1: Set up the Colab notebook
Open colab.research.google.com and create a new notebook. Under Runtime → Change runtime type, select T4 GPU as the hardware accelerator. The free tier provisions a Tesla T4 with 16GB of GDDR6 VRAM nominal, of which roughly 15GB is usable in practice after framework and ECC overhead 5 . Session length is roughly 12 hours of wall-clock idle time, with a usage-based dynamic cap that disconnects sooner under heavy load.
Verify the GPU before installing anything:
!nvidia-smi
The output should show Tesla T4 and 15360MiB total memory. If it shows a CPU-only runtime or a different GPU model, change the runtime type and re-run.
Mount Google Drive if the dataset or output adapter needs to persist across session disconnects:
from google.colab import drive
drive.mount('/content/drive')
Drive mounting is the single most important reliability step on free tier. Colab disconnects under heavy idle load; an adapter that lives only in /content/ evaporates on disconnect.
Image: Google Colaboratory, used for editorial coverage of the free-tier T4 runtime this tutorial trains on.
Step 2: Install the Python libraries
Pin specific versions. Hugging Face libraries iterate fast, and a tutorial that worked yesterday on transformers==4.45 may break on 4.50 because of a deprecated argument:
!pip install -q -U \
transformers==4.45.2 \
peft==0.13.2 \
bitsandbytes==0.44.1 \
accelerate==1.0.1 \
trl==0.11.4 \
datasets==3.0.1 \
huggingface_hub==0.25.2
The -q flag suppresses pip’s progress bar (Colab’s terminal renders it as a wall of overwriting lines that breaks output capture). Restart the runtime once the install completes; Colab caches some modules at session start, and the new versions only take effect after a runtime restart.
A note on bitsandbytes: the library has had recurring CUDA-version-mismatch issues on Colab T4 instances when the runtime image ships an older CUDA toolkit. As of the date on this article, bitsandbytes>=0.43 works cleanly with Colab’s CUDA 12.1+ default. If the import fails with a CUDA error after the install, run !pip install -U bitsandbytes to pull the latest, then restart the runtime.
Authenticate to Hugging Face Hub for model downloads:
from huggingface_hub import login
login(token="hf_YOUR_TOKEN_HERE")
Paste the User Access Token from the prerequisites step. Treat it like a password; do not commit a notebook with the token inlined.
Step 3: Choose the base model
Three options, in the order most devs should try them:
- Qwen 2.5-7B. Apache 2.0 licence 2 , ~14GB on disk in fp16, downloads without licence approval. Recommended.
- Mistral 7B v0.3. Apache 2.0 licence, similar architecture, roughly the same VRAM footprint, English-skewed training data. The fallback if Qwen’s tokeniser does not fit a particular use case.
- Llama 3.1-8B. Meta’s bespoke licence, gated download, and the click-through approval can take 24–48 hours on Hugging Face. The 8B parameter count fits the same 4-bit + LoRA configuration as Qwen 7B with a slightly tighter VRAM headroom on T4. Worth doing for production work; not worth doing for a first tutorial run. (Note: Llama 3.2 has no 7B variant; the family ships only 1B and 3B text plus 11B and 90B vision-language.)
Set the model name once at the top of the notebook so the rest of the code stays reusable:
MODEL_NAME = "Qwen/Qwen2.5-7B"
Step 4: Prepare the dataset
A useful first dataset is around 1,000 rows of JSONL, where each row contains an instruction and output pair. The format LoRA training tooling expects is:
import json
# Example: 1,000 instruction-output pairs.
sample_rows = [
{"instruction": "Summarise this paragraph in one sentence.", "output": "..."},
{"instruction": "Translate to Hindi: 'The meeting starts at 9am.'", "output": "..."},
# ...998 more
]
with open("train.jsonl", "w") as f:
for row in sample_rows:
f.write(json.dumps(row) + "\n")
Load the file via the datasets library, which handles batching and shuffling automatically:
from datasets import load_dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")
def format_row(row):
return {
"text": (
f"### Instruction:\n{row['instruction']}\n\n"
f"### Response:\n{row['output']}"
)
}
dataset = dataset.map(format_row)
The chat-template format above is the simplest one that works reliably with SFTTrainer. Real production fine-tuning uses model-specific chat templates exposed by tokenizer.apply_chat_template(); the simple format is fine for a first run.
Step 5: Configure 4-bit quantisation
Loading a 7B model in fp16 consumes ~14GB of VRAM, which leaves essentially no headroom on a T4’s 15GB usable ceiling for activations or gradients. The standard fix is 4-bit quantisation via bitsandbytes, which compresses the frozen base weights to roughly 4GB in NF4 (NormalFloat 4-bit) format 6 . With 4-bit base weights plus a LoRA adapter and small batch sizes, the full pipeline fits with batch_size=1 or batch_size=2; larger batches typically OOM. Use gradient_accumulation_steps to simulate a larger effective batch without raising VRAM.
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
The three settings that matter:
bnb_4bit_quant_type="nf4"uses the NormalFloat 4-bit format, which preserves more accuracy than naive int4 by matching the typical weight distribution of trained LLMs.bnb_4bit_use_double_quant=Truequantises the quantisation constants themselves, saving roughly 0.4GB more VRAM at no measurable accuracy cost.bnb_4bit_compute_dtype=torch.bfloat16runs the matmul in bf16 even though weights are stored in 4-bit. T4 supports bf16 in tensor cores; this is the right default.
Image: bitsandbytes GitHub repository, used for editorial coverage of the 4-bit quantisation library invoked in this step.
Step 6: Configure LoRA
LoRA freezes the 7B base weights and trains two small low-rank matrices on top of selected layers. The trainable parameter count drops from 7 billion to roughly 20 million 7 , which is what makes this fit on a T4 at all.
Image: Hugging Face PEFT GitHub repository, used for editorial coverage of the library invoked above.
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Rank 16 with alpha 32 is the standard starting point per the PEFT documentation. Higher ranks (32, 64) give more adaptation capacity but consume more VRAM and may overfit a 1,000-row dataset; lower ranks (4, 8) train faster but underfit complex tasks. Stick with 16 for the first run.
The target_modules list pins LoRA adapters to the four attention projection layers: query, key, value, and output. This is the conservative default that works across Qwen, Llama, and Mistral. Adding gate_proj, up_proj, down_proj (the MLP layers) increases trainable parameters and adaptation capacity at a real VRAM cost; defer that experiment to run two.
Step 7: Configure SFTTrainer and train
The Supervised Fine-Tuning Trainer (SFTTrainer) from TRL wraps the training loop, gradient accumulation, mixed-precision handling, and checkpoint saving. The settings below are calibrated for a T4’s 15GB usable VRAM on a 1,000-row dataset:
from trl import SFTTrainer, SFTConfig
training_args = SFTConfig(
output_dir="./lora-adapter",
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
bf16=True,
optim="paged_adamw_8bit",
max_seq_length=1024,
packing=False,
dataset_text_field="text",
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
)
trainer.train()
Five settings load-bear:
per_device_train_batch_size=1withgradient_accumulation_steps=4gives an effective batch size of 4 without exceeding T4 memory. Going to batch size 2 typically out-of-memory crashes on a 1024-token sequence length.learning_rate=2e-4is the LoRA-specific default; full fine-tuning uses 1e-5 to 5e-5, but LoRA’s small adapter handles a higher LR.bf16=Trueruns the training loop in bfloat16, which T4 supports natively. Using fp16 instead works but is more numerically fragile.optim="paged_adamw_8bit"uses bitsandbytes’ paged optimiser, which pages optimiser states between GPU and CPU memory when VRAM tightens; this can be the difference between an OOM crash and a successful run on a marginal session.max_seq_length=1024is a balance. Lowering to 512 trains faster and uses less VRAM; raising to 2048 slows training proportionally and risks OOM.
Training a 1,000-row dataset for one epoch at these settings takes roughly 4–6 hours of wall-clock time on T4. Expect Colab to drop the session at least once mid-run; the save_strategy="epoch" checkpoint policy resumes from the last saved adapter state.
Image: Hugging Face TRL — SFTTrainer documentation, used for editorial coverage of the trainer class invoked in this step.
Step 8: Export the trained adapter
After training completes, save the adapter and tokeniser to disk:
model.save_pretrained("./lora-adapter-final")
tokenizer.save_pretrained("./lora-adapter-final")
The ./lora-adapter-final/ directory now contains roughly 50MB of files 8 — adapter_model.safetensors, adapter_config.json, and the tokeniser configs. That is a 280-times reduction from the 14GB base model, which is the entire point of LoRA.
Push to Hugging Face Hub for portability:
model.push_to_hub("your-username/qwen2.5-7b-lora-myproject", private=True)
tokenizer.push_to_hub("your-username/qwen2.5-7b-lora-myproject", private=True)
private=True keeps the repository unlisted by default; flip to False only when ready to share. Adapters on the Hub can be loaded by anyone who has the base model, which is the whole portability story.
Step 9: Inference with the trained adapter
Loading for inference reuses the same 4-bit base model plus the trained adapter on top:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base_model, "./lora-adapter-final")
model.eval()
prompt = (
"### Instruction:\nSummarise this paragraph in one sentence.\n\n"
"### Response:\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The same adapter file runs on any environment that can load the base model: a local GPU, AWS, a regional GPU cloud (E2E Networks in India, RunPod or Lambda Labs in the US, OVH in EU), or another Colab session.
Common errors and how to recover
Out-of-memory during training. Reduce max_seq_length from 1024 to 512 first; this halves the activation memory. If the OOM persists, drop target_modules to just ["q_proj", "v_proj"] (cuts trainable parameters roughly in half) and consider lowering rank to 8.
Kernel crashes mid-step. Almost always a bitsandbytes-CUDA mismatch. Run !pip install -q -U bitsandbytes==0.44.1 again, restart the runtime, and re-run from the model-load cell. Colab occasionally provisions a GPU node with a different CUDA toolkit; the reinstall realigns.
Colab disconnects mid-training. Expected on free tier. Check ./lora-adapter/checkpoint-*/ for the most recent checkpoint, then resume:
trainer.train(resume_from_checkpoint=True)
Tokeniser warns about pad token. The line tokenizer.pad_token = tokenizer.eos_token in Step 5 handles this. If the warning persists for a different model, set tokenizer.add_special_tokens({"pad_token": "[PAD]"}) and resize the model’s token embeddings.
Loss does not decrease. Three likely causes, in order of probability: dataset format does not match the chat template (re-check Step 4), learning rate is too low (try 5e-4), or the dataset is too small (1,000 rows is a floor; 5,000+ rows trains more reliably).
Where to go next
Two natural follow-ups once a free-tier run has completed.
First, scale the dataset. Fine-tuning quality plateaus quickly below 1,000 rows and improves measurably up to about 10,000 rows for most instruction-tuning tasks. The same notebook handles the larger dataset; expect the wall-clock to grow proportionally.
Second, push beyond the T4 ceiling. The recent RoundPipe research extends what consumer-GPU clusters can train end-to-end without paying for an A100, which makes scaled-out distributed training accessible without enterprise budgets. The pipeline-parallel patterns there compose cleanly with the LoRA workflow above when the dataset or model size outgrows a single T4.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. Hugging Face PEFT documentation (LoRA implementation reference for the 4–6 hour T4 training-time figure on a 1,000-row dataset at the configuration in Step 7) (accessed ) ↩
- 2. Qwen 2.5-7B model card on Hugging Face (Apache 2.0 licence, approximately 14GB fp16 footprint, 7B parameter count) (accessed ) ↩
- 3. Llama 3.1-8B-Instruct model card on Hugging Face (gated download requiring Meta licence approval click-through; 8B is the closest real Llama variant to a "7B-class" target since Llama 3.2 ships only 1B and 3B text plus 11B and 90B vision) (accessed ) ↩
- 4. Google Colab Pro signup page (approximately \$9.99/month USD globally, ₹999/month on India geo-render; figure varies by region; verify on the day of purchase since Google adjusts regional pricing periodically) (accessed ) ↩
- 5. Google Colaboratory landing page (free-tier T4 GPU with 16GB VRAM; approximately 12-hour idle session ceiling subject to dynamic usage caps) (accessed ) ↩
- 6. bitsandbytes GitHub repository (NF4 4-bit quantisation reduces 7B fp16 footprint from roughly 14GB to roughly 4GB) (accessed ) ↩
- 7. Hugging Face PEFT repository (LoRA at rank 16 on q/k/v/o projection layers reduces trainable parameter count from 7 billion to roughly 20 million for a 7B base) (accessed ) ↩
- 8. Hugging Face PEFT documentation (trained LoRA adapter at rank 16 on attention layers exports to roughly 50MB of safetensors) (accessed ) ↩
Further Reading
- Hugging Face transformers library documentation (accessed )
- Hugging Face TRL — SFTTrainer documentation (accessed )
- Mistral 7B v0.3 model card on Hugging Face (accessed )
Anonymous · no cookies set