LLM Tokens Per Second Calculator
Will this GPU be fast enough? Pick a GPU and a model size to get a realistic tok/s estimate for interactive chat, batch processing, or production serving. Complements the VRAM Calculator — VRAM tells you if it fits, this tells you if it's fast enough.
Configure
Token speed scales primarily with memory bandwidth, not compute. GDDR7 cards (RTX 5090/5080) lead the pack.
Quantization assumption: Q4_K_M. Larger context and heavier quantization slow output linearly.
Long contexts slow down generation because the KV cache occupies more memory bandwidth.
Different use cases need different speed thresholds. Autocomplete is the strictest.
Estimated throughput
65tok/s
Roughly 0.8s to generate a short paragraph.
Great for interactive chat
Well above the 20 tok/s threshold for natural-feeling conversation.
vs other GPUs at this model size
How to read tok/s numbers
Under 5 tok/s
Unusable for interactive work
Slower than a person reads. Fine for overnight batch jobs; painful for chat.
5-15 tok/s
Works but frustrating
Readable in real-time but you'll notice the wait. Fine for occasional queries.
15-30 tok/s
Usable for daily chat
Comfortable reading speed. Most users are satisfied at this tier.
30-60 tok/s
Great interactive experience
Feels responsive. Good for code autocomplete and iteration.
60+ tok/s
Professional / production
Fast enough for real-time translation, multi-user serving, or production APIs.
What drives tok/s?
Token generation speed is memory bandwidth limited, not compute-limited, for most LLM inference. Every token requires reading the entire model weights from VRAM. That's why:
- The RTX 5090 (GDDR7, 1792 GB/s) is ~40-50% faster than the RTX 4090 (GDDR6X, 1008 GB/s) at the same model size — see our RTX 4090 vs 5090 comparison for the full breakdown
- A used RTX 3090 (GDDR6X, 936 GB/s) matches the 4090's speed more closely than gaming benchmarks suggest. The used GPU buyer's guide covers why it's still the best value in 2026
- AMD RX 7900 XTX (GDDR6, 960 GB/s) competes closely with NVIDIA on raw bandwidth, though software overhead drags actual numbers. Our NVIDIA vs AMD guide has the honest verdict
Why our numbers are approximate
Real tok/s varies by framework (Ollama, llama.cpp, vLLM), driver version, quantization (Q4 vs Q5 vs Q6), context length, batch size, and specific model architecture. Our estimates are representative ranges for a single-user, Q4_K_M, short-context, Ollama/llama.cpp setup. Production vLLM serving with batching can be 2-3x higher per GPU-hour. See our methodology for data sources.