LLM Tokens Per Second Calculator

Will this GPU be fast enough? Pick a GPU and a model size to get a realistic tok/s estimate for interactive chat, batch processing, or production serving. Complements the VRAM Calculator — VRAM tells you if it fits, this tells you if it's fast enough.

Configure

Token speed scales primarily with memory bandwidth, not compute. GDDR7 cards (RTX 5090/5080) lead the pack.

Quantization assumption: Q4_K_M. Larger context and heavier quantization slow output linearly.

Long contexts slow down generation because the KV cache occupies more memory bandwidth.

Different use cases need different speed thresholds. Autocomplete is the strictest.

Estimated throughput

65tok/s

Roughly 0.8s to generate a short paragraph.

Great for interactive chat

Well above the 20 tok/s threshold for natural-feeling conversation.

vs other GPUs at this model size

How to read tok/s numbers

Under 5 tok/s

Unusable for interactive work

Slower than a person reads. Fine for overnight batch jobs; painful for chat.

5-15 tok/s

Works but frustrating

Readable in real-time but you'll notice the wait. Fine for occasional queries.

15-30 tok/s

Usable for daily chat

Comfortable reading speed. Most users are satisfied at this tier.

30-60 tok/s

Great interactive experience

Feels responsive. Good for code autocomplete and iteration.

60+ tok/s

Professional / production

Fast enough for real-time translation, multi-user serving, or production APIs.

What drives tok/s?

Token generation speed is memory bandwidth limited, not compute-limited, for most LLM inference. Every token requires reading the entire model weights from VRAM. That's why:

  • The RTX 5090 (GDDR7, 1792 GB/s) is ~40-50% faster than the RTX 4090 (GDDR6X, 1008 GB/s) at the same model size — see our RTX 4090 vs 5090 comparison for the full breakdown
  • A used RTX 3090 (GDDR6X, 936 GB/s) matches the 4090's speed more closely than gaming benchmarks suggest. The used GPU buyer's guide covers why it's still the best value in 2026
  • AMD RX 7900 XTX (GDDR6, 960 GB/s) competes closely with NVIDIA on raw bandwidth, though software overhead drags actual numbers. Our NVIDIA vs AMD guide has the honest verdict

Why our numbers are approximate

Real tok/s varies by framework (Ollama, llama.cpp, vLLM), driver version, quantization (Q4 vs Q5 vs Q6), context length, batch size, and specific model architecture. Our estimates are representative ranges for a single-user, Q4_K_M, short-context, Ollama/llama.cpp setup. Production vLLM serving with batching can be 2-3x higher per GPU-hour. See our methodology for data sources.