Is 8‑bit, 4‑bit, or 2‑bit quantization best for a 7–9B GPU LLM?

Running a 7–9B “fast small LLM” on a consumer GPU often comes down to one blunt constraint: how many gigabytes of VRAM you have, and how much latency you can tolerate before the model stops feeling “fast.” Weight quantization—storing model weights in fewer bits—seems like an easy win, but the real trade‑off isn’t just “lower bits = better.” Latency, memory, and benchmark accuracy move in different directions depending on quantization level, kernel quality, and where your workload spends time (prefill vs decode).

The baseline: what changes when you quantize weights?

Weight quantization reduces the memory footprint and bandwidth cost of reading weights during inference. For transformer LLMs, inference time often leans heavily on matrix multiplies where weights are repeatedly read and multiplied with activations.

Three things matter most:

VRAM footprint: Can the model and its KV cache fit on the GPU?
Memory bandwidth vs compute: Is your inference limited by how fast weights can be fetched (bandwidth-bound) or by math throughput (compute-bound)?
Quantization error: Lower-bit weights introduce more approximation error, which tends to show up as worse benchmark scores and sometimes more brittle behavior.

For 7–9B models, consumer GPUs typically fall into a “bandwidth matters a lot” category, especially during token generation (decode), where each new token triggers weight reads across many layers.

Memory math: what you actually save

Weights dominate the static footprint; KV cache dominates the dynamic footprint during longer contexts.

Weight memory (rough intuition)

Assume a 7–9B model has ~7–9 billion parameters.

8‑bit (INT8): ~1 byte/param → ~7–9 GB for weights (plus overhead)
4‑bit (INT4): ~0.5 bytes/param → ~3.5–4.5 GB (plus scales/zeros/group metadata)
2‑bit (INT2): ~0.25 bytes/param → ~1.75–2.25 GB (but metadata and packing overhead become more noticeable)

Real implementations add per‑group scales (and sometimes zero points), so actual VRAM isn’t exactly the ideal number. Still, the ranking holds: 4‑bit roughly halves 8‑bit, and 2‑bit halves 4‑bit.

The overlooked piece: KV cache

Even with heavily quantized weights, long context can blow up memory because KV cache is typically stored in FP16/BF16 (though some stacks offer KV quantization).

For a 7–9B model, KV cache for a few thousand tokens can take multiple GB. This means:

Quantizing weights helps you fit the model
Quantizing KV cache helps you fit long context
If you run short prompts and short generations, weight quantization dominates your VRAM story; for long contexts, KV cache can dominate.

Latency: prefill vs decode (and why lower bits don’t always win)

LLM inference has two very different phases:

Prefill: processing the prompt (many tokens in parallel)
Decode: generating one token at a time (little parallelism)

Prefill latency

Prefill can use more GPU parallelism and often runs closer to compute limits. In this phase:

8‑bit can be competitive because INT8 kernels are often mature and fast.
4‑bit can be faster if kernels are good and your GPU is bandwidth-limited.
2‑bit may not improve much because overhead (unpacking, dequant, more complex kernels) can eat gains.

In other words, prefill may not scale linearly with bits. Dropping from 8→4 can help; dropping 4→2 may disappoint.

Decode latency (tokens/sec)

Decode is typically where “fast small LLM” lives or dies. Each token requires repeated weight reads, and the GPU is frequently underutilized. In this phase:

4‑bit often gives the best speed-per-quality on consumer GPUs because it cuts weight bandwidth significantly while keeping kernels manageable.
8‑bit is usually slower than 4‑bit for decode when bandwidth-bound, though it can be steadier and easier to run across tools.
2‑bit can be a mixed bag: it reduces bandwidth further, but the extra work to dequantize and the lower arithmetic intensity can limit gains. Some setups see speedups; others see similar or even worse tokens/sec than 4‑bit.

A practical rule: 8→4 frequently yields a clear decode speedup; 4→2 is far less predictable and depends heavily on kernel quality and GPU architecture.

Accuracy and benchmark behavior: where the quality drops show up

Quantization hurts accuracy through added noise in weight values. The severity depends on:

Quantization method (per-channel vs per-group, symmetric vs asymmetric)
Group size (smaller groups usually preserve quality better but add overhead)
Model “quantization-friendliness” (some checkpoints tolerate low bits better)

8‑bit: usually “safe”

Benchmark accuracy is often close to FP16 for many tasks.
Output style and instruction-following remain stable.
Good default when you can afford VRAM and want minimal risk.

4‑bit: the current sweet spot

Many 7–9B models stay surprisingly strong at 4‑bit.
Typical benchmark drops are modest, and for chat-style usage the difference may be subtle.
Failures tend to appear in edge cases: long-chain reasoning, exactness (math, code), or strict formatting.

2‑bit: quality becomes the main cost

Expect noticeable benchmark declines unless the method is unusually strong and tuned.
More frequent weirdness: repetitions, missed constraints, shallow reasoning, degraded tool-use reliability.
Some tasks (simple Q&A, summarization) may remain usable; others (coding, multi-step reasoning) often suffer.

The key point: 2‑bit is no longer “free compression.” It’s a real capability trade.

The real-world trade table (typical outcomes)

If your GPU has enough VRAM (e.g., 12–16 GB) and you want stability

Pick 8‑bit when you care about benchmark integrity and predictable behavior.
Latency may be slightly worse than 4‑bit, but quality is steady.

If you want the best “fast small LLM” feel on a single consumer GPU

Pick 4‑bit as the default.
You’ll usually get better tokens/sec than 8‑bit, fit larger context or larger batch, and keep most of the model’s skill.

If VRAM is tight or you want to fit longer context at any cost

Consider 2‑bit, but treat it as a specialization.
It can be the difference between “runs” and “doesn’t run,” yet accuracy loss is often visible and speed gains aren’t guaranteed.

A practical decision checklist

Is VRAM the blocker?
If you can’t fit weights + KV cache, go lower-bit or reduce context. 4‑bit is often enough; 2‑bit is a last resort.
Do you care about benchmarks or reliable coding?
Favor 8‑bit or a high-quality 4‑bit setup.
Are you optimizing tokens/sec for chat?
4‑bit usually wins. If 2‑bit doesn’t clearly outperform 4‑bit on your GPU, don’t pay the quality cost.
Is your workload long-context heavy?
Weight quantization helps, but KV cache is the silent VRAM consumer. If your stack supports KV quantization, it can matter more than going from 4‑bit to 2‑bit weights.

For a 7–9B “fast small LLM” on a consumer GPU, 4‑bit weight quantization is the most reliable balance of latency, memory, and benchmark accuracy. 8‑bit is the safer choice when you can afford VRAM and want near-baseline quality. 2‑bit is primarily a “fit it or ship it” option: sometimes useful, often costly in accuracy, and not always faster in real decode latency once kernel overhead is included.