What Is Quantization? Running LLMs on Commodity Hardware (GGUF, AWQ, GPTQ, INT4/INT8)
A plain-English explainer on LLM weight quantization: what 4-bit and 8-bit actually mean, how GGUF, AWQ, and GPTQ differ, and which format fits which hardware.
Image: Hugging Face Transformers quantization overview docs page social card, used for editorial coverage of LLM quantization backends.
The short answer
Quantization shrinks a large language model by storing each weight in fewer bits. A 7-billion-parameter model held in 16-bit half-precision weighs roughly 14 GB on disk; the same model at 4 bits per weight lands near 3.5 GB, small enough to run on a single consumer GPU or a recent laptop CPU. The aggregated source consensus from Hugging Face, llama.cpp, and the AWQ and GPTQ papers is that 4-bit weight-only quantization is the default for local inference today, with 8-bit as a safer fallback when accuracy matters more than memory.
Three names dominate the format landscape. GGUF is the file format llama.cpp uses; it bundles quantized weights with metadata and supports a wide range of bit widths from 2 to 8. 1 AWQ and GPTQ are two different algorithms for choosing how to round full-precision weights into a smaller integer grid; both target roughly 4-bit weights with low accuracy loss. 2 3 The right pick depends on hardware. A Mac mini or a CPU-only laptop runs GGUF through llama.cpp. An NVIDIA GPU runs AWQ or GPTQ through vLLM, Hugging Face Transformers, or AutoAWQ. 4
What quantization actually is
A model’s weights are numbers. Trained models usually store each weight as a 16-bit floating-point number, either FP16 (IEEE half-precision) or BF16 (the “brain float” variant favoured by Google’s TPUs). Quantization replaces that 16-bit number with a much smaller integer plus a scaling factor shared across a group of weights.
The memory saving follows directly from the bit width. For a model with parameters at bits per weight, the storage cost is:
A 7B model at FP16 () needs about $7 \times 10^9 \cdot 2 = 14b = 4$) drops to roughly 3.5 GB before the small per-group scaling factors are added back in. The Hugging Face Transformers documentation states the practical version of this in one line: quantizing a model in 4-bit reduces memory usage by 4x relative to FP16, and quantizing in 8-bit halves it. 4
The catch is that integers cannot represent every floating-point value. Rounding introduces error, and the error compounds across billions of parameters. The whole research programme behind GPTQ, AWQ, and the GGUF “K-quants” is about choosing the rounding such that the model’s outputs stay close to the original.
Image: Hugging Face Transformers bitsandbytes docs page social card, used for editorial coverage of the LLM.int8() and NF4/FP4 data types.
GGUF: the file format llama.cpp uses
GGUF is the binary file format used by llama.cpp and the broader ggml ecosystem. Per the Hugging Face Hub documentation, it was developed by Georgi Gerganov, the author of llama.cpp, and is designed for quick loading and saving of models on CPU and consumer hardware. 1 The format encodes both tensors and a standardised set of metadata in a single file, which is why a .gguf download from Hugging Face is self-contained.
GGUF supports many quantization variants. The most common are the K-quants (Q4_K, Q5_K, Q6_K, Q8_K) and their IQ (“importance-matrix”) cousins, which pack weights into super-blocks with shared scales. Per the Hugging Face GGUF reference, Q4_K lands at an effective 4.5 bits per weight; Q5_K at 5.5; Q6_K at 6.5625; Q2_K at 2.625. 1
The naming convention Q4_K_M versus Q4_K_S is community shorthand layered on top of the GGUF type: _S is “small”, _M is “medium”, _L is “large”, encoding how aggressively the quantization is applied across different layers of the model. The Q4_K_M variant is the de-facto default for local inference and is what most Hugging Face model-card READMEs recommend when no preference is stated.
AWQ: protect the salient weights
AWQ, Activation-aware Weight Quantization, is the algorithm proposed in the 2023 MLSys paper by Ji Lin and collaborators at MIT, NVIDIA, and Tsinghua. The paper’s abstract states the method “protects only 1% salient weights” identified via activation statistics, then applies per-channel scaling to reduce quantization error on the remaining weights. 2 The reported headline result: more than 3x speedup over the Hugging Face FP16 implementation on both desktop and mobile GPUs, with 4-bit weights. 2
The intuition is that not all weights matter equally. A small fraction of a layer’s weights account for most of its output range; quantizing those aggressively destroys accuracy. AWQ finds those salient channels by looking at activation magnitudes and applies a per-channel scale that gives them extra precision without breaking the integer storage format. The paper won the MLSys 2024 Best Paper award. 2
GPTQ: layer-by-layer second-order rounding
GPTQ, the method published by Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh in 2022, takes a different route. Per the paper’s abstract, GPTQ uses approximate second-order information (curvature of the loss landscape) to quantize one layer at a time, choosing the rounding that minimises the layer’s output error against a small calibration set. 3 The reported results: 175-billion-parameter GPT models quantized to 3 or 4 bits per weight in approximately four GPU hours, with negligible accuracy degradation; inference speedups of around 3.25x on NVIDIA A100s and 4.5x on NVIDIA A6000s; reasonable accuracy in the 2-bit and even ternary regime. 3
GPTQ predates AWQ by several months and was the first method to make 4-bit inference on a 175B model practical on a single GPU. The two target the same bit width with different accuracy and throughput tradeoffs, and most modern inference stacks (vLLM, Hugging Face Transformers, AutoAWQ, AutoGPTQ) support both. 5
bitsandbytes, NF4, and QLoRA
The other name a builder encounters is bitsandbytes, the library Hugging Face Transformers uses for on-the-fly INT8 and INT4 quantization. Per the Transformers documentation, bitsandbytes offers two flagship features: LLM.int8(), an 8-bit method that preserves higher precision for outlier activations rather than quantizing the entire forward pass uniformly; and the NF4 / FP4 data types used by QLoRA for memory-efficient fine-tuning. 6
NF4, “Normal Float 4,” is a 4-bit data type from the QLoRA paper that is designed for weights drawn from a normal distribution, which is what transformer weights look like after training. 6 7 The Hugging Face documentation flags that NF4 is the recommended 4-bit type when the goal is parameter-efficient fine-tuning rather than pure inference; for inference, the choice between NF4 and FP4 has little impact on output quality. 6
Which format goes with which hardware
The choice of format follows the hardware more than personal preference. The Hugging Face Transformers quantization overview lists per-backend hardware support: GGUF runs on CPU, CUDA, Metal (Apple Silicon), and Intel GPU; AWQ runs on CPU, CUDA, ROCm, and Intel GPU; bitsandbytes runs on CUDA, with newer Intel XPU, Intel Gaudi, and CPU paths in active development. 4 vLLM supports AWQ, GPTQ, GGUF, bitsandbytes, INT4 W4A16, INT8 W8A8, and FP8 W8A8 among others. 5
| Backend | CPU | NVIDIA GPU | Apple Silicon | Typical bit widths | Common use |
|---|---|---|---|---|---|
| GGUF / llama.cpp | Yes | Yes | Yes | 2 to 8 | Local inference on laptops and consumer GPUs |
| AWQ | Yes | Yes | No | 4 | Production inference on NVIDIA GPUs |
| GPTQ | Limited | Yes | Limited | 2 / 3 / 4 / 8 | Production inference on NVIDIA GPUs |
| bitsandbytes | Limited | Yes | No | 4 / 8 | On-the-fly quantization in Hugging Face Transformers, QLoRA fine-tuning |
For a Mac mini, GGUF through llama.cpp or LM Studio is the consensus default. For a single NVIDIA workstation card, AWQ or GPTQ through vLLM is the common production choice. For training-time memory savings during fine-tuning, bitsandbytes NF4 plus a LoRA adapter is the QLoRA recipe.
What quantization does not solve
Quantization shrinks weights. It does not shrink the KV cache, the per-token attention state that grows with sequence length, and on long contexts the KV cache can dominate memory regardless of weight precision. vLLM, llama.cpp, and most modern stacks now also support quantized KV cache, which compresses that runtime state separately. 5
Quantization also does not eliminate accuracy loss. The AWQ and GPTQ papers report that the loss is small at 4 bits for most benchmarks, but it is not zero, and the gap widens at 2 or 3 bits. Cited evaluations on the source backends (AutoAWQ’s GitHub README, the GPTQ paper’s tables, llama.cpp’s perplexity comparisons) are the right place to look before committing a production system to a specific quantization variant.
The aggregated source consensus is straightforward: start at 4 bits with AWQ, GPTQ, or Q4_K_M depending on hardware; move to 8 bits if accuracy regressions show up in evaluation; reserve 2-bit and 3-bit quantization for cases where the model is too large to fit otherwise, and budget the time to measure quality loss against a held-out evaluation set before shipping.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. Hugging Face Hub — GGUF format documentation (Q4_K bits-per-weight tables; K-quant super-block structure; GGUF format origin) (accessed ) ↩
- 2. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2023; MLSys 2024 Best Paper) — 1% salient-weight protection; 3x speedup over FP16 (accessed ) ↩
- 3. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022) — 175B in ~4 GPU hours; 3.25x on A100, 4.5x on A6000 (accessed ) ↩
- 4. Hugging Face Transformers — Quantization overview (4x memory reduction at 4-bit; 2x at 8-bit; backend hardware-support matrix) (accessed ) ↩
- 5. vLLM — Quantization methods (AWQ / GPTQ / GGUF / bitsandbytes / INT4 W4A16 / INT8 W8A8 / FP8 W8A8 support) (accessed ) ↩
- 6. Hugging Face Transformers — bitsandbytes integration (LLM.int8 outlier preservation; NF4 / FP4 data types; QLoRA recipe) (accessed ) ↩
- 7. QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023) — NF4 data type origin (accessed ) ↩
Anonymous · no cookies set