Neural Tech Daily
dev-tutorials

LoRA Fine-Tuning a 7B Model on Google Colab Free Tier: A Step-By-Step

A LoRA fine-tune of Qwen 2.5-7B on Google Colab's free Tesla T4 takes 4–6 hours. Step-by-step pipeline for devs: dataset, PEFT, training, inference.

Updated ~14 min read
Share
Hugging Face PEFT documentation page showing the LoRA adapter pattern that this tutorial walks through end-to-end on Colab's free T4 instance

Image: Hugging Face PEFT documentation, used for editorial coverage of the library this tutorial uses.

What this tutorial builds

This tutorial fine-tunes a 7B-parameter language model on Google Colab’s free Tesla T4 GPU using Low-Rank Adaptation (LoRA), the parameter-efficient fine-tuning method that trains a small adapter on top of frozen base weights instead of updating the full model. A complete run takes roughly 4–6 hours on the free tier 1 , produces a ~50MB adapter file, and costs nothing beyond a Google account.

The recommended base model is Qwen 2.5-7B, released by Alibaba’s Qwen team under the Apache 2.0 licence 2 . Apache 2.0 means no licence approval workflow, no acceptable-use restrictions to wade through, and full commercial reuse rights, which matters for devs experimenting on personal projects that might later turn into something paid.

Llama 3.1-8B and Mistral 7B v0.3 work with the same code path, but Llama requires a Meta licence-approval click-through 3 on Hugging Face before the weights download. Llama 3.2 only ships in 1B and 3B text variants plus 11B and 90B vision-language; there is no Llama 3.2-7B, so the Llama path here means 3.1-8B in 4-bit quantisation. Stick with Qwen for the first run; switch later if the project needs it.

Skip the upgrade to Colab Pro until at least two free-tier runs have completed end-to-end. Colab Pro is roughly $9.99/month USD globally, or approximately ₹999/month on Google’s India billing 4 (verify on the day; Google adjusts regional pricing periodically and prices fluctuate, so check before purchase). Pro buys longer sessions and a better GPU pool, but neither is the bottleneck on a first LoRA run — the bottleneck is workflow familiarity, and that comes from completing free-tier runs.

Prerequisites

Three accounts and one skill:

  • A Google account with access to Google Colab. Free tier is fine.
  • A Hugging Face account at huggingface.co with a User Access Token generated under Settings → Access Tokens. The token needs read permission for downloading models and write permission if pushing the trained adapter back to the Hub.
  • Working Python familiarity. Comfortable reading and editing roughly 100 lines of Python with imports, classes, and dictionaries. No deep ML background needed; this tutorial wraps the heavy lifting in library calls.

A kaggle.json API key is optional and only matters if the dataset is on Kaggle. Public datasets on the Hugging Face Hub need no extra credentials.

Step 1: Set up the Colab notebook

Open colab.research.google.com and create a new notebook. Under Runtime → Change runtime type, select T4 GPU as the hardware accelerator. The free tier provisions a Tesla T4 with 16GB of GDDR6 VRAM nominal, of which roughly 15GB is usable in practice after framework and ECC overhead 5 . Session length is roughly 12 hours of wall-clock idle time, with a usage-based dynamic cap that disconnects sooner under heavy load.

Verify the GPU before installing anything:

!nvidia-smi

The output should show Tesla T4 and 15360MiB total memory. If it shows a CPU-only runtime or a different GPU model, change the runtime type and re-run.

Mount Google Drive if the dataset or output adapter needs to persist across session disconnects:

from google.colab import drive
drive.mount('/content/drive')

Drive mounting is the single most important reliability step on free tier. Colab disconnects under heavy idle load; an adapter that lives only in /content/ evaporates on disconnect.

Google Colaboratory landing page showing the free Tesla T4 GPU runtime used for the LoRA training in this tutorial

Image: Google Colaboratory, used for editorial coverage of the free-tier T4 runtime this tutorial trains on.

Step 2: Install the Python libraries

Pin specific versions. Hugging Face libraries iterate fast, and a tutorial that worked yesterday on transformers==4.45 may break on 4.50 because of a deprecated argument:

!pip install -q -U \
 transformers==4.45.2 \
 peft==0.13.2 \
 bitsandbytes==0.44.1 \
 accelerate==1.0.1 \
 trl==0.11.4 \
 datasets==3.0.1 \
 huggingface_hub==0.25.2

The -q flag suppresses pip’s progress bar (Colab’s terminal renders it as a wall of overwriting lines that breaks output capture). Restart the runtime once the install completes; Colab caches some modules at session start, and the new versions only take effect after a runtime restart.

A note on bitsandbytes: the library has had recurring CUDA-version-mismatch issues on Colab T4 instances when the runtime image ships an older CUDA toolkit. As of the date on this article, bitsandbytes>=0.43 works cleanly with Colab’s CUDA 12.1+ default. If the import fails with a CUDA error after the install, run !pip install -U bitsandbytes to pull the latest, then restart the runtime.

Authenticate to Hugging Face Hub for model downloads:

from huggingface_hub import login
login(token="hf_YOUR_TOKEN_HERE")

Paste the User Access Token from the prerequisites step. Treat it like a password; do not commit a notebook with the token inlined.

Step 3: Choose the base model

Three options, in the order most devs should try them:

  1. Qwen 2.5-7B. Apache 2.0 licence 2 , ~14GB on disk in fp16, downloads without licence approval. Recommended.
  2. Mistral 7B v0.3. Apache 2.0 licence, similar architecture, roughly the same VRAM footprint, English-skewed training data. The fallback if Qwen’s tokeniser does not fit a particular use case.
  3. Llama 3.1-8B. Meta’s bespoke licence, gated download, and the click-through approval can take 24–48 hours on Hugging Face. The 8B parameter count fits the same 4-bit + LoRA configuration as Qwen 7B with a slightly tighter VRAM headroom on T4. Worth doing for production work; not worth doing for a first tutorial run. (Note: Llama 3.2 has no 7B variant; the family ships only 1B and 3B text plus 11B and 90B vision-language.)

Set the model name once at the top of the notebook so the rest of the code stays reusable:

MODEL_NAME = "Qwen/Qwen2.5-7B"

Step 4: Prepare the dataset

A useful first dataset is around 1,000 rows of JSONL, where each row contains an instruction and output pair. The format LoRA training tooling expects is:

import json

# Example: 1,000 instruction-output pairs.
sample_rows = [
 {"instruction": "Summarise this paragraph in one sentence.", "output": "..."},
 {"instruction": "Translate to Hindi: 'The meeting starts at 9am.'", "output": "..."},
 # ...998 more
]

with open("train.jsonl", "w") as f:
 for row in sample_rows:
 f.write(json.dumps(row) + "\n")

Load the file via the datasets library, which handles batching and shuffling automatically:

from datasets import load_dataset

dataset = load_dataset("json", data_files="train.jsonl", split="train")

def format_row(row):
 return {
 "text": (
 f"### Instruction:\n{row['instruction']}\n\n"
 f"### Response:\n{row['output']}"
 )
 }

dataset = dataset.map(format_row)

The chat-template format above is the simplest one that works reliably with SFTTrainer. Real production fine-tuning uses model-specific chat templates exposed by tokenizer.apply_chat_template(); the simple format is fine for a first run.

Step 5: Configure 4-bit quantisation

Loading a 7B model in fp16 consumes ~14GB of VRAM, which leaves essentially no headroom on a T4’s 15GB usable ceiling for activations or gradients. The standard fix is 4-bit quantisation via bitsandbytes, which compresses the frozen base weights to roughly 4GB in NF4 (NormalFloat 4-bit) format 6 . With 4-bit base weights plus a LoRA adapter and small batch sizes, the full pipeline fits with batch_size=1 or batch_size=2; larger batches typically OOM. Use gradient_accumulation_steps to simulate a larger effective batch without raising VRAM.

import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer

bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_use_double_quant=True,
 bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
 MODEL_NAME,
 quantization_config=bnb_config,
 device_map="auto",
 trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

The three settings that matter:

  • bnb_4bit_quant_type="nf4" uses the NormalFloat 4-bit format, which preserves more accuracy than naive int4 by matching the typical weight distribution of trained LLMs.
  • bnb_4bit_use_double_quant=True quantises the quantisation constants themselves, saving roughly 0.4GB more VRAM at no measurable accuracy cost.
  • bnb_4bit_compute_dtype=torch.bfloat16 runs the matmul in bf16 even though weights are stored in 4-bit. T4 supports bf16 in tensor cores; this is the right default.
bitsandbytes GitHub repository README documenting the NF4 4-bit quantisation used to compress the base model in this step

Image: bitsandbytes GitHub repository, used for editorial coverage of the 4-bit quantisation library invoked in this step.

Step 6: Configure LoRA

LoRA freezes the 7B base weights and trains two small low-rank matrices on top of selected layers. The trainable parameter count drops from 7 billion to roughly 20 million 7 , which is what makes this fit on a T4 at all.

Hugging Face PEFT GitHub repository README, the canonical reference for LoRA adapter configuration used in this step

Image: Hugging Face PEFT GitHub repository, used for editorial coverage of the library invoked above.

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
 lora_dropout=0.05,
 bias="none",
 task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Rank 16 with alpha 32 is the standard starting point per the PEFT documentation. Higher ranks (32, 64) give more adaptation capacity but consume more VRAM and may overfit a 1,000-row dataset; lower ranks (4, 8) train faster but underfit complex tasks. Stick with 16 for the first run.

The target_modules list pins LoRA adapters to the four attention projection layers: query, key, value, and output. This is the conservative default that works across Qwen, Llama, and Mistral. Adding gate_proj, up_proj, down_proj (the MLP layers) increases trainable parameters and adaptation capacity at a real VRAM cost; defer that experiment to run two.

Step 7: Configure SFTTrainer and train

The Supervised Fine-Tuning Trainer (SFTTrainer) from TRL wraps the training loop, gradient accumulation, mixed-precision handling, and checkpoint saving. The settings below are calibrated for a T4’s 15GB usable VRAM on a 1,000-row dataset:

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
 output_dir="./lora-adapter",
 num_train_epochs=1,
 per_device_train_batch_size=1,
 gradient_accumulation_steps=4,
 learning_rate=2e-4,
 warmup_ratio=0.03,
 logging_steps=10,
 save_strategy="epoch",
 bf16=True,
 optim="paged_adamw_8bit",
 max_seq_length=1024,
 packing=False,
 dataset_text_field="text",
 report_to="none",
)

trainer = SFTTrainer(
 model=model,
 args=training_args,
 train_dataset=dataset,
 tokenizer=tokenizer,
)

trainer.train()

Five settings load-bear:

  • per_device_train_batch_size=1 with gradient_accumulation_steps=4 gives an effective batch size of 4 without exceeding T4 memory. Going to batch size 2 typically out-of-memory crashes on a 1024-token sequence length.
  • learning_rate=2e-4 is the LoRA-specific default; full fine-tuning uses 1e-5 to 5e-5, but LoRA’s small adapter handles a higher LR.
  • bf16=True runs the training loop in bfloat16, which T4 supports natively. Using fp16 instead works but is more numerically fragile.
  • optim="paged_adamw_8bit" uses bitsandbytes’ paged optimiser, which pages optimiser states between GPU and CPU memory when VRAM tightens; this can be the difference between an OOM crash and a successful run on a marginal session.
  • max_seq_length=1024 is a balance. Lowering to 512 trains faster and uses less VRAM; raising to 2048 slows training proportionally and risks OOM.

Training a 1,000-row dataset for one epoch at these settings takes roughly 4–6 hours of wall-clock time on T4. Expect Colab to drop the session at least once mid-run; the save_strategy="epoch" checkpoint policy resumes from the last saved adapter state.

Hugging Face TRL documentation page covering the SFTTrainer class and SFTConfig parameters used in this step

Image: Hugging Face TRL — SFTTrainer documentation, used for editorial coverage of the trainer class invoked in this step.

Step 8: Export the trained adapter

After training completes, save the adapter and tokeniser to disk:

model.save_pretrained("./lora-adapter-final")
tokenizer.save_pretrained("./lora-adapter-final")

The ./lora-adapter-final/ directory now contains roughly 50MB of files 8 adapter_model.safetensors, adapter_config.json, and the tokeniser configs. That is a 280-times reduction from the 14GB base model, which is the entire point of LoRA.

Push to Hugging Face Hub for portability:

model.push_to_hub("your-username/qwen2.5-7b-lora-myproject", private=True)
tokenizer.push_to_hub("your-username/qwen2.5-7b-lora-myproject", private=True)

private=True keeps the repository unlisted by default; flip to False only when ready to share. Adapters on the Hub can be loaded by anyone who has the base model, which is the whole portability story.

Step 9: Inference with the trained adapter

Loading for inference reuses the same 4-bit base model plus the trained adapter on top:

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
 MODEL_NAME,
 quantization_config=bnb_config,
 device_map="auto",
 trust_remote_code=True,
)

model = PeftModel.from_pretrained(base_model, "./lora-adapter-final")
model.eval()

prompt = (
 "### Instruction:\nSummarise this paragraph in one sentence.\n\n"
 "### Response:\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
 outputs = model.generate(
 **inputs,
 max_new_tokens=128,
 do_sample=True,
 temperature=0.7,
 top_p=0.9,
 )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The same adapter file runs on any environment that can load the base model: a local GPU, AWS, a regional GPU cloud (E2E Networks in India, RunPod or Lambda Labs in the US, OVH in EU), or another Colab session.

Common errors and how to recover

Out-of-memory during training. Reduce max_seq_length from 1024 to 512 first; this halves the activation memory. If the OOM persists, drop target_modules to just ["q_proj", "v_proj"] (cuts trainable parameters roughly in half) and consider lowering rank to 8.

Kernel crashes mid-step. Almost always a bitsandbytes-CUDA mismatch. Run !pip install -q -U bitsandbytes==0.44.1 again, restart the runtime, and re-run from the model-load cell. Colab occasionally provisions a GPU node with a different CUDA toolkit; the reinstall realigns.

Colab disconnects mid-training. Expected on free tier. Check ./lora-adapter/checkpoint-*/ for the most recent checkpoint, then resume:

trainer.train(resume_from_checkpoint=True)

Tokeniser warns about pad token. The line tokenizer.pad_token = tokenizer.eos_token in Step 5 handles this. If the warning persists for a different model, set tokenizer.add_special_tokens({"pad_token": "[PAD]"}) and resize the model’s token embeddings.

Loss does not decrease. Three likely causes, in order of probability: dataset format does not match the chat template (re-check Step 4), learning rate is too low (try 5e-4), or the dataset is too small (1,000 rows is a floor; 5,000+ rows trains more reliably).

Where to go next

Two natural follow-ups once a free-tier run has completed.

First, scale the dataset. Fine-tuning quality plateaus quickly below 1,000 rows and improves measurably up to about 10,000 rows for most instruction-tuning tasks. The same notebook handles the larger dataset; expect the wall-clock to grow proportionally.

Second, push beyond the T4 ceiling. The recent RoundPipe research extends what consumer-GPU clusters can train end-to-end without paying for an A100, which makes scaled-out distributed training accessible without enterprise budgets. The pipeline-parallel patterns there compose cleanly with the LoRA workflow above when the dataset or model size outgrows a single T4.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. Hugging Face PEFT documentation (LoRA implementation reference for the 4–6 hour T4 training-time figure on a 1,000-row dataset at the configuration in Step 7) (accessed )
  2. 2. Qwen 2.5-7B model card on Hugging Face (Apache 2.0 licence, approximately 14GB fp16 footprint, 7B parameter count) (accessed )
  3. 3. Llama 3.1-8B-Instruct model card on Hugging Face (gated download requiring Meta licence approval click-through; 8B is the closest real Llama variant to a "7B-class" target since Llama 3.2 ships only 1B and 3B text plus 11B and 90B vision) (accessed )
  4. 4. Google Colab Pro signup page (approximately \$9.99/month USD globally, ₹999/month on India geo-render; figure varies by region; verify on the day of purchase since Google adjusts regional pricing periodically) (accessed )
  5. 5. Google Colaboratory landing page (free-tier T4 GPU with 16GB VRAM; approximately 12-hour idle session ceiling subject to dynamic usage caps) (accessed )
  6. 6. bitsandbytes GitHub repository (NF4 4-bit quantisation reduces 7B fp16 footprint from roughly 14GB to roughly 4GB) (accessed )
  7. 7. Hugging Face PEFT repository (LoRA at rank 16 on q/k/v/o projection layers reduces trainable parameter count from 7 billion to roughly 20 million for a 7B base) (accessed )
  8. 8. Hugging Face PEFT documentation (trained LoRA adapter at rank 16 on attention layers exports to roughly 50MB of safetensors) (accessed )

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.