Skip to content

Multi-Modal Token Calculator

Calculate token costs for images, audio, and video across GPT-4o, Claude, and Gemini. See how each model tokenizes multi-modal inputs.

FreeNo SignupNo Server UploadsZero Tracking

Image dimensions: 1,920 x 1,080 (2,073,600 pixels)

ModelTokens/ImageTotal TokensCost
GPT-4o1,1051,105$0.00276
GPT-4o Mini1,1051,105$0.000166
Claude 4 Sonnet2,7652,765$0.00830
Claude 3.5 Haiku2,7652,765$0.00221
Gemini 2.5 Pro258258$0.000322
Gemini 2.0 Flash258258$0.000026
Export

How to Use Multi-Modal Token Calculator

  1. 1

    Select input type

    Switch between Text, Image, Audio, or Video tabs to calculate tokens for that modality.

  2. 2

    Configure your input

    For images, select dimensions or a preset. For audio/video, enter duration and settings.

  3. 3

    Compare across models

    See token counts and costs for each model that supports that input type.

  4. 4

    Estimate total cost

    Multiply by your expected volume to project API costs for multi-modal workloads.

Frequently Asked Questions

GPT-4o divides images into 512x512 tiles. Token count = 85 + (tiles x 170), where tiles = ceil(width/512) x ceil(height/512). A 1920x1080 image uses 85 + (4x3x170) = 2,125 tokens.

Claude uses size-based buckets. Small images (up to 384x384) use ~170 tokens, medium (~768x768) use ~340, and large images (1536x1536+) use up to ~1,632 tokens.

Gemini uses a flat rate of 258 tokens per image regardless of size. This makes it predictable for budgeting multi-image workloads.

Gemini supports native audio at 32 tokens/second. OpenAI Whisper transcribes audio to text (charged per minute) but the resulting text tokens are separate from the audio processing.

Gemini processes video as individual frames (258 tokens each) at a configurable FPS (default 1 FPS) plus audio (32 tokens/second). A 1-minute video at 1 FPS costs 258x60 + 32x60 = 17,400 tokens.