Skip to content

VRAM Calculator

Calculate VRAM requirements for running LLMs locally. Check if Llama, Mistral, Qwen, Phi, or Gemma fits on your GPU at different quantization levels.

FreeNo SignupNo Server UploadsZero Tracking

Fits on GPU

Llama 3.1 7B at Q4_K_M requires 5.9 GB on RTX 4090 (24GB)

Model Weights

3.9 GB

KV Cache

479 MB

Overhead

1.5 GB

Total VRAM

5.9 GB

GPU Utilization

24.6%

VRAM Usage

5.9 GB
24GB
Weights: 3.9 GB
KV Cache: 479 MB
Overhead: 1.5 GB

Context Length Impact on VRAM

ContextKV CacheTotal VRAMFits?
0.5K60 MB5.5 GBYes
1K120 MB5.6 GBYes
2K239 MB5.7 GBYes
4K479 MB5.9 GBYes
8K958 MB6.4 GBYes
16K1.9 GB7.3 GBYes
32K3.7 GB9.2 GBYes
64K7.5 GB12.9 GBYes

Recommended quantization for RTX 4090: FP16 (Half precision, 16-bit float). 7 quantizations fit on this GPU: FP16, Q8, Q6_K, Q5_K_M, Q4_K_M, Q3_K, Q2_K.

Export

How to Use VRAM Calculator

  1. 1

    Select your GPU

    Choose your GPU from NVIDIA consumer/data center cards or Apple Silicon Macs.

  2. 2

    Pick a model

    Select the LLM you want to run locally, from 3.8B to 405B parameters.

  3. 3

    Choose quantization

    Pick a quantization level. Lower quantization uses less VRAM but reduces quality.

  4. 4

    Check the results

    See if the model fits, the VRAM breakdown, and recommended quantization for your GPU.

Frequently Asked Questions

VRAM = (parameters_billions x bytes_per_parameter) + KV_cache + overhead. For example, a 7B model at Q4_K_M (0.5625 bytes/param) uses about 3.9GB for weights, plus ~1.5GB overhead and KV cache depending on context length.

Quantization reduces model weights from full precision (FP32, 4 bytes) to lower precision (e.g., Q4_K_M, ~0.56 bytes). This dramatically reduces VRAM usage with minimal quality loss. Q4_K_M is the most popular balance of size and quality.

Longer context lengths require more KV cache memory. The KV cache grows linearly with context length. For large models at long contexts, this can add several GB of VRAM usage.

Some frameworks support offloading layers to RAM or disk, but this dramatically slows inference. For practical use, the model should fit entirely in VRAM. Apple Silicon Macs can use unified memory more flexibly.

Apple Silicon uses unified memory shared between CPU and GPU. The numbers shown are for model memory; the system also needs memory for macOS and other apps. Leave at least 4-8GB free for the system.