Fine-Tune Stable Diffusion XL with Dreambooth + LoRA: End-to-End Project

Train an SDXL LoRA on a custom subject in 60-90 minutes on a 16GB GPU. Dataset, kohya_ss / diffusers training, inference, 5-prompt evaluation.

20 May 2026 Updated 20 May 2026 ~14 min read

Hugging Face diffusers documentation showing the SDXL Dreambooth LoRA training script that this tutorial walks through end-to-end on a consumer 16GB GPU

Image: Hugging Face diffusers documentation, used for editorial coverage of the training script this tutorial walks through.

What this tutorial builds

This tutorial fine-tunes Stable Diffusion XL base 1.0 on a custom subject — a person, pet, product, or art style — using DreamBooth combined with Low-Rank Adaptation (LoRA). The training run takes 60-90 minutes on a 16GB consumer GPU¹, produces a ~200MB adapter file, and lets the resulting model render the trained subject in any scene you can describe in a text prompt.

The aggregated source consensus across the Hugging Face diffusers documentation, the kohya-ss/sd-scripts repository, and the DreamBooth paper supports two viable training paths: the diffusers train_dreambooth_lora_sdxl.py reference script (simpler, fewer knobs, official Hugging Face support) and the kohya_ss training UI (more knobs, broader community LoRA ecosystem). This walkthrough uses the diffusers script as the primary path because it is the canonical reference for SDXL DreamBooth + LoRA per the diffusers maintainers, with kohya_ss flagged as the alternative for readers who want a GUI.

The four ingredients are 20-30 reference images of one consistent subject, a 16GB+ VRAM GPU (RTX 3090, RTX 4070 Ti Super, RTX 4080, or rented A10G / L4 / A100 on a cloud), roughly 12GB of disk for the SDXL base weights, and a Hugging Face account with an access token.

Prerequisites

Python 3.10 or 3.11. Newer Python releases sometimes lag behind PyTorch wheels; 3.10 is the safest target for the current diffusers + accelerate + bitsandbytes combination.
CUDA 12.1+ with a 16GB+ VRAM GPU. SDXL DreamBooth + LoRA needs roughly 12-14GB at batch size 1 with 8-bit Adam and gradient checkpointing per the diffusers training documentation². Cards below 16GB (RTX 3060 12GB, RTX 4060 Ti 8GB) cannot fit the run without aggressive offloading, which slows training to a crawl.
A Hugging Face account with a User Access Token. Read scope is enough to download the SDXL base weights.
Working comfort with the terminal. You will run shell commands, edit a config file, and read Python tracebacks.

If you are on a Mac (Apple Silicon) or an 8GB-12GB GPU, this tutorial will not run end-to-end on your machine; rent an A10G from Lambda Labs, RunPod, or Vast.ai for the training run, then download the adapter and run inference locally on the smaller GPU.

Step 1: Prepare the reference dataset

DreamBooth needs 20-30 images of one consistent subject³. The aggregated consensus from the DreamBooth paper’s dataset guidance and the Civitai community training guide is that quality beats quantity. Twenty sharp, well-lit, varied images outperform fifty noisy phone snaps.

Curation rules that the cited sources agree on:

One subject only. No friends, no other dogs, no second products in frame. The model will learn whatever it sees consistently.
Vary the pose, angle, and lighting. Close-ups, mid shots, full-body or full-product shots. Indoor and outdoor. Different times of day. The model needs to learn the subject across contexts, not memorise one photo.
Avoid heavy filters, watermarks, and text overlays. The model will learn those too.
Crop to a square or near-square aspect ratio. SDXL trains at 1024x1024 by default. Off-aspect images get cropped or padded by the data loader, which wastes signal.
Use the same subject identity throughout. If training a person, do not mix childhood and adult photos. If training a product, do not mix the v1 and v2 of the same SKU.

kohya-ss/sd-scripts GitHub repository readme showing the SDXL DreamBooth LoRA training scripts and configuration options

Image: kohya-ss/sd-scripts GitHub repository, used for editorial coverage of the alternative training path mentioned in this tutorial.

Place the curated images in a directory layout that the diffusers script expects:

training_data/
    instance_images/
        img_01.jpg
        img_02.jpg
        ...
        img_25.jpg
    class_images/        # optional, used for prior preservation

The instance_images folder holds the 20-30 reference photos of the subject. The optional class_images folder holds 100-200 generic images of the broader class (e.g., generic dogs if the subject is a specific dog), which DreamBooth’s prior-preservation loss uses to stop the model from forgetting the class. Skip class images on a first run; add them on cycle 2 if the trained adapter overfits.

Step 2: Install the training stack

Create a fresh virtual environment and install the dependencies:

python3.10 -m venv sdxl-lora-env
source sdxl-lora-env/bin/activate    # Windows: sdxl-lora-env\Scripts\activate

pip install --upgrade pip
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121
pip install diffusers transformers accelerate peft
pip install bitsandbytes
pip install datasets safetensors xformers

Clone the diffusers training examples (the reference script is not packaged in the pip wheel):

git clone https://github.com/huggingface/diffusers.git
cd diffusers/examples/dreambooth
pip install -r requirements_sdxl.txt

Initialise accelerate for single-GPU training:

accelerate config default

The default config picks single-GPU, fp16 mixed precision, no DeepSpeed, which matches what the diffusers SDXL DreamBooth LoRA documentation recommends as the starting point⁴.

huggingface-cli login

Paste the read token from huggingface.co/settings/tokens.

Step 3: Choose the trigger token and prompt template

DreamBooth uses a rare token as the “name” for the trained subject. The diffusers tutorial recommends sks or ohwx as the instance token because they have minimal prior meaning in the SDXL tokenizer⁵. Pair the token with the class noun (dog, person, mug, sneaker) so the model anchors the subject inside its existing class prior.

Decide three strings before kicking off training:

Instance prompt: the prompt that describes the training images. Example: "a photo of sks dog" for a specific dog, "a photo of ohwx man" for a specific person, "a photo of sks mug" for a product.
Class prompt: the generic version used for prior preservation. Example: "a photo of a dog".
Validation prompt: what the script renders at the end of each epoch so you can eyeball progress. Example: "a photo of sks dog wearing a red scarf".

Pick tokens you will remember; the trained adapter is useless without them.

Step 4: Run the training script

The full launch command for diffusers train_dreambooth_lora_sdxl.py:

accelerate launch train_dreambooth_lora_sdxl.py \
    --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
    --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
    --instance_data_dir="./training_data/instance_images" \
    --output_dir="./sdxl-lora-output" \
    --instance_prompt="a photo of sks dog" \
    --resolution=1024 \
    --train_batch_size=1 \
    --gradient_accumulation_steps=4 \
    --gradient_checkpointing \
    --use_8bit_adam \
    --learning_rate=1e-4 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --max_train_steps=1000 \
    --mixed_precision="fp16" \
    --rank=16 \
    --validation_prompt="a photo of sks dog in a snowy field" \
    --validation_epochs=25 \
    --seed=42

What the key flags do:

--pretrained_vae_model_name_or_path points at the fp16-fixed VAE; the default SDXL VAE has known numerical issues in fp16 per the diffusers documentation, and the fix-VAE swap is the recommended workaround⁶.
--rank=16 sets the LoRA rank. Higher ranks (32, 64) capture more subject detail at the cost of a bigger adapter file and slower training. Rank 16 is the diffusers default starting point.
--use_8bit_adam swaps the standard Adam optimiser for the bitsandbytes 8-bit version⁷, cutting optimiser-state VRAM roughly in half — the difference between fitting on a 16GB card and not.
--gradient_checkpointing trades compute for VRAM by recomputing activations on the backward pass. Adds roughly 20-30% to training time but is mandatory under 24GB.
--max_train_steps=1000 is the diffusers tutorial’s starting point for a 20-30-image dataset; under-training shows up as the subject not being recognisable, over-training shows up as every output looking like a literal copy of one training photo.
--mixed_precision="fp16" halves VRAM versus fp32 with negligible quality loss at this scale.

Kick off the run. A 1000-step training pass on an RTX 3090 takes roughly 60-90 minutes per the diffusers training documentation’s benchmark guidance⁸; an A10G is in the same band. The script writes intermediate validation images into ./sdxl-lora-output/ every 25 epochs so you can watch the subject emerge.

Stable Diffusion XL base 1.0 model card page on Hugging Face showing the model specs and licence terms

Image: Hugging Face SDXL base 1.0 model card, used for editorial coverage of the base model this tutorial fine-tunes.

The script saves the final LoRA adapter as pytorch_lora_weights.safetensors (~180-220MB at rank 16) in the output directory. Back it up; it is the artefact the rest of this tutorial loads.

Step 5: Alternative path — kohya_ss training UI

The kohya-ss/sd-scripts repository is the community-favoured alternative, particularly for readers who prefer a GUI over command-line flags. The kohya_ss wrapper at bmaltais/kohya_ss packages the same scripts behind a Gradio interface.

Per the kohya-ss/sd-scripts README, the SDXL DreamBooth LoRA path lives in the sdxl_train_network.py script with equivalent flags to the diffusers reference script. The Civitai community guide documents the GUI workflow step-by-step. Both paths produce a .safetensors LoRA file that loads into the same diffusers inference code.

When to pick kohya_ss over diffusers:

You want a GUI rather than a launch command.
You want sample-image grids generated mid-training with multiple prompts per checkpoint.
You plan to share the adapter on Civitai, where kohya_ss naming conventions are the de facto standard.

When to stick with diffusers:

You want the canonical Hugging Face reference path with first-party support.
You plan to publish the adapter on the Hugging Face Hub.
You want fewer moving parts to debug.

Step 6: Run inference with the trained LoRA

Once training finishes, load the adapter into a standard SDXL pipeline:

import torch
from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
).to("cuda")

pipe.load_lora_weights(
    "./sdxl-lora-output",
    weight_name="pytorch_lora_weights.safetensors",
)

prompt = "a photo of sks dog wearing a red scarf, sitting on a park bench, autumn leaves"

image = pipe(
    prompt=prompt,
    num_inference_steps=30,
    guidance_scale=7.5,
    cross_attention_kwargs={"scale": 0.8},
).images[0]

image.save("output.png")

The cross_attention_kwargs={"scale": 0.8} argument controls LoRA strength at inference time. A scale of 1.0 applies the full adapter; lower values blend the adapter with the base model. The diffusers documentation flags 0.7-0.9 as the typical range for DreamBooth subjects — 1.0 often overcooks the subject and rejects scene-context tokens⁹.

Civitai SDXL LoRA training community article showing the GUI workflow and sample-image grids that the kohya_ss path produces during training

Image: Civitai community LoRA training guide, used for editorial coverage of the community resources referenced in this tutorial.

VRAM during inference is roughly 8-10GB at fp16 with the standard SDXL pipeline, well below training requirements. A trained LoRA runs comfortably on a 12GB card even though training needed 16GB.

Step 7: Evaluate on 5 sample prompts

The diffusers DreamBooth tutorial and the DreamBooth paper agree on the basic evaluation rubric: the trained subject should appear recognisably in scenes that were never in the training data, while still responding to scene-context tokens. Run the same 5 prompts before and after training to see the delta.

Prompt class	Example prompt (replace `sks dog` with your token + class)	What to check
Same context as training	`"a photo of sks dog"`	Subject identity is sharp; no melted features.
Novel scene	`"a photo of sks dog on a beach at sunset"`	Subject identity holds; scene tokens render.
Stylised	`"oil painting of sks dog in renaissance style"`	Subject identity survives style transfer.
With companion	`"a photo of sks dog next to a red bicycle"`	Subject and companion both render; no fusion.
Action / pose	`"sks dog jumping over a fence in a meadow"`	Subject identity holds in motion.

Per the DreamBooth paper’s evaluation discussion, failure modes to watch for:

Underfit: the subject does not appear; outputs look like the generic class. Train for more steps or raise the learning rate.
Overfit: every output looks like one specific training image regardless of the prompt. Train for fewer steps, lower the learning rate, or add prior preservation with class images.
Class drift: the model forgets that “dog” can be anything but sks dog. Add prior-preservation class images and re-train.
Mode collapse: the model renders the subject in one pose only. Vary the training data more.

Score each of the 5 prompts on a 1-5 scale for subject identity and prompt adherence. A trained LoRA that lands in the 3.5-4.5 average is a typical first-cycle result per the Civitai guide’s community calibration; getting to 4.5+ usually takes a second training pass with adjusted hyperparameters.

Hugging Face LoRA blog post describing the low-rank adaptation pattern that underpins this SDXL fine-tuning approach

Image: Hugging Face LoRA blog post, used for editorial coverage of the parameter-efficient method this tutorial applies to SDXL.

Step 8: Troubleshooting common failure modes

The diffusers documentation, the kohya-ss/sd-scripts repository issues, and the Civitai community guide converge on the same recurring problems:

CUDA out-of-memory at step 1. Drop --train_batch_size to 1 if not already, raise --gradient_accumulation_steps to 4 or 8, enable --gradient_checkpointing, and ensure --use_8bit_adam is set. If still failing, training at --resolution=768 instead of 1024 cuts VRAM by roughly 40% at the cost of less-sharp 1024 outputs.
The subject never appears in validation images. The instance prompt may be wrong (typo in sks token), the learning rate may be too low, or the dataset images may be too varied to find one consistent subject. Drop to 15-20 images with stricter curation and retry.
Validation images are pure noise. The fp16 VAE is unstable; ensure --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" is set. The default SDXL VAE is the culprit per the diffusers documentation.
Training crashes with bitsandbytes errors on Windows. Bitsandbytes Windows wheels lag the Linux release. The Civitai guide recommends WSL2 with Ubuntu 22.04 as the workaround, or the --use_8bit_adam=False route at the cost of needing a 24GB card.

Step 9: Where to go next

The trained LoRA loads into any SDXL-compatible inference UI:

ComfyUI — the most flexible node-based interface; load the .safetensors adapter via the “Load LoRA” node.
Automatic1111 / Forge WebUI — drop the file into the models/Lora/ directory and reference it with <lora:filename:0.8> in the prompt.
Hugging Face diffusers in production — wrap the inference snippet above in a FastAPI or Modal endpoint.

Push the adapter to the Hugging Face Hub for sharing:

huggingface-cli upload <your-username>/sdxl-sks-dog-lora ./sdxl-lora-output

Or upload to Civitai if the trained subject fits its model categories (people, characters, styles, concepts).

Per the LoRA paper’s framing, the adapter you trained captures the subject in low-rank weight deltas applied to the base model’s attention layers — the base SDXL weights themselves are untouched. That separation is why a single 16GB GPU can train the adapter while full SDXL fine-tuning needs a multi-GPU rig.

What ships, what to verify, what to check

This tutorial covers the SDXL DreamBooth + LoRA training path that the diffusers maintainers, the kohya-ss repository, and the DreamBooth + LoRA papers collectively support. The diffusers train_dreambooth_lora_sdxl.py script is the canonical reference; kohya_ss is the GUI alternative; both produce interchangeable .safetensors adapters.

Verify before kicking off your first run:

The GPU has 16GB+ VRAM and CUDA 12.1+ is installed.
The Hugging Face token has read scope and is configured via huggingface-cli login.
The dataset has 20-30 curated images of one consistent subject, cropped near-square.
The instance prompt token (sks or ohwx) is consistent across the training command and inference.

Check after the first run finishes:

The output directory contains pytorch_lora_weights.safetensors of roughly 180-220MB.
Validation images saved during training show the subject emerging by epoch 25 onward.
The 5-prompt evaluation rubric scores 3.5+ average on subject identity. Below that, retrain with adjusted hyperparameters per the troubleshooting section.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Hugging Face diffusers — DreamBooth LoRA training guide, hardware section (accessed 2026-05-20) ↩
2. Hugging Face diffusers — DreamBooth memory-optimisation guidance (accessed 2026-05-20) ↩
3. DreamBooth (Ruiz et al., 2022) — dataset size guidance in Section 3 (accessed 2026-05-20) ↩
4. diffusers train_dreambooth_lora_sdxl.py — default accelerate config (accessed 2026-05-20) ↩
5. diffusers DreamBooth guide — rare-token recommendation (accessed 2026-05-20) ↩
6. madebyollin/sdxl-vae-fp16-fix model card — fp16 stability workaround (accessed 2026-05-20) ↩
7. bitsandbytes — 8-bit Adam optimiser implementation (accessed 2026-05-20) ↩
8. diffusers DreamBooth guide — training time on consumer GPUs (accessed 2026-05-20) ↩
9. diffusers DreamBooth guide — LoRA scale at inference (accessed 2026-05-20) ↩