VRAM Calculator
Calculate VRAM requirements for running LLMs locally. Check if Llama, Mistral, Qwen, Phi, or Gemma fits on your GPU at different quantization levels.
Fits on GPU
Llama 3.1 7B at Q4_K_M requires 5.9 GB on RTX 4090 (24GB)
Model Weights
3.9 GB
KV Cache
479 MB
Overhead
1.5 GB
Total VRAM
5.9 GB
GPU Utilization
24.6%
VRAM Usage
Context Length Impact on VRAM
| Context | KV Cache | Total VRAM | Fits? |
|---|---|---|---|
| 0.5K | 60 MB | 5.5 GB | Yes |
| 1K | 120 MB | 5.6 GB | Yes |
| 2K | 239 MB | 5.7 GB | Yes |
| 4K | 479 MB | 5.9 GB | Yes |
| 8K | 958 MB | 6.4 GB | Yes |
| 16K | 1.9 GB | 7.3 GB | Yes |
| 32K | 3.7 GB | 9.2 GB | Yes |
| 64K | 7.5 GB | 12.9 GB | Yes |
Recommended quantization for RTX 4090: FP16 (Half precision, 16-bit float). 7 quantizations fit on this GPU: FP16, Q8, Q6_K, Q5_K_M, Q4_K_M, Q3_K, Q2_K.
How to Use VRAM Calculator
- 1
Select your GPU
Choose your GPU from NVIDIA consumer/data center cards or Apple Silicon Macs.
- 2
Pick a model
Select the LLM you want to run locally, from 3.8B to 405B parameters.
- 3
Choose quantization
Pick a quantization level. Lower quantization uses less VRAM but reduces quality.
- 4
Check the results
See if the model fits, the VRAM breakdown, and recommended quantization for your GPU.