LLM VRAM Calculator

Estimate exactly how much GPU VRAM you need to run any LLM locally. Covers quantization levels, context length, KV cache overhead, and runtime memory. Then maps the result to real GPUs from our comparison data.

Configure your workload

Parameter count. For MoE models, we use the total parameter count (VRAM must hold all weights).

Lower quantization = smaller VRAM footprint but quality tradeoff. Q4_K_M is the most common default.

KV cache grows linearly with context. Doubling context roughly doubles the cache size.

Q8 KV cache halves memory with minor quality impact. Supported in llama.cpp and newer Ollama.

CUDA context + framework overhead. Ollama and llama.cpp are leanest.

Total VRAM needed

16.4 GB

Model weights 4.5 GB
KV cache 2.0 GB
Runtime overhead 1.5 GB

Recommended VRAM tier

16GB card

A 16GB GPU will handle this workload with ~4GB headroom for longer context or batching.

GPUs that fit this workload

How the math works

The total VRAM needed for running an LLM is the sum of three components: model weights, KV cache, and runtime overhead. Each scales differently based on your configuration.

1. Model weights

Weight memory scales with parameter count and quantization bits per parameter. For a 7B model at Q4_K_M (~4.5 bits/param), the weights occupy roughly 4 GB. At FP16 the same model takes ~14 GB. See our how much VRAM for AI guide for a full breakdown by workload type.

weights_GB = (params × bits_per_param) / 8 / 1024

2. KV cache

Every token in your context window adds to the KV cache. Larger models have bigger caches per token because they have more layers and wider hidden dimensions. A 70B model eats roughly 1.25 MB per token at FP16, so 32K context costs ~40 GB just for the cache.

kv_GB = 2 × num_layers × hidden_dim × context × precision_bytes / 1024³

3. Runtime overhead

CUDA context, framework buffers, activation memory during inference. Ollama and llama.cpp are leanest (~1-1.5 GB). vLLM with heavy batching can consume 2-3 GB more. Text Generation WebUI with extensions loaded is heaviest.

Why "recommended" differs from "fits"

A workload that needs 15.8 GB will technically fit in a 16GB card, but there's no headroom. Any extra context, a second user, or a driver update can push it over and cause OOM errors. We recommend the next VRAM tier up when headroom is under 2 GB. This is why we usually point readers toward the RTX 4090 for AI or a used RTX 3090 once workloads push past 16 GB.

This calculator gives representative estimates based on typical Transformer architectures. Actual VRAM usage varies by model family (Llama, Mistral, Qwen have slightly different layer counts and hidden sizes), driver, framework, and batch size. Treat results as ±10% accurate. See our methodology for how we derive these numbers.