100% client-side. Your prompts never leave your browser.

VRAM Calculator

Calculate VRAM requirements for running LLMs locally. Check if Llama, Mistral, Qwen, Phi, or Gemma fits on your GPU at different quantization levels.

FreeNo SignupNo UploadsNo Tracking

GPU

Model

Quantization

Context length

Fits on GPU

Llama 3.1 7B at Q4_K_M requires 5.9 GB on RTX 4090 (24GB)

Model Weights

3.9 GB

KV Cache

479 MB

Overhead

1.5 GB

Total VRAM

5.9 GB

GPU Utilization

24.6%

VRAM Usage

5.9 GB

24GB

Weights: 3.9 GB

KV Cache: 479 MB

Overhead: 1.5 GB

Context Length Impact on VRAM

Context	KV Cache	Total VRAM	Fits?
0.5K	60 MB	5.5 GB	Yes
1K	120 MB	5.6 GB	Yes
2K	239 MB	5.7 GB	Yes
4K	479 MB	5.9 GB	Yes
8K	958 MB	6.4 GB	Yes
16K	1.9 GB	7.3 GB	Yes
32K	3.7 GB	9.2 GB	Yes
64K	7.5 GB	12.9 GB	Yes

Recommended quantization for RTX 4090: FP16 (Half precision, 16-bit float). 7 quantizations fit on this GPU: FP16, Q8, Q6_K, Q5_K_M, Q4_K_M, Q3_K, Q2_K.

How to Use VRAM Calculator

1
Select your GPU
Choose your GPU from NVIDIA consumer/data center cards or Apple Silicon Macs.
2
Pick a model
Select the LLM you want to run locally, from 3.8B to 405B parameters.
3
Choose quantization
Pick a quantization level. Lower quantization uses less VRAM but reduces quality.
4
Check the results
See if the model fits, the VRAM breakdown, and recommended quantization for your GPU.

Frequently Asked Questions

How is VRAM calculated?

VRAM = (parameters_billions x bytes_per_parameter) + KV_cache + overhead. For example, a 7B model at Q4_K_M (0.5625 bytes/param) uses about 3.9GB for weights, plus ~1.5GB overhead and KV cache depending on context length.

What is quantization?

Quantization reduces model weights from full precision (FP32, 4 bytes) to lower precision (e.g., Q4_K_M, ~0.56 bytes). This dramatically reduces VRAM usage with minimal quality loss. Q4_K_M is the most popular balance of size and quality.

How does context length affect VRAM?

Longer context lengths require more KV cache memory. The KV cache grows linearly with context length. For large models at long contexts, this can add several GB of VRAM usage.

Can I run models larger than my VRAM?

Some frameworks support offloading layers to RAM or disk, but this dramatically slows inference. For practical use, the model should fit entirely in VRAM. Apple Silicon Macs can use unified memory more flexibly.

Are Apple Silicon numbers accurate?

Apple Silicon uses unified memory shared between CPU and GPU. The numbers shown are for model memory; the system also needs memory for macOS and other apps. Leave at least 4-8GB free for the system.

VRAM Calculator

How to Use VRAM Calculator

Select your GPU

Pick a model

Choose quantization

Check the results

Frequently Asked Questions

Related Tools

Model Comparison

Token Counter

Pricing Calculator

Context Window Visualizer