id as the model field on /chat/completions, and Overshoot routes the request to a healthy endpoint.
List available models
Availability changes as endpoints come online and go offline. Always query/models before starting a stream. No auth required.
listModels returns, with one extra status field per entry.
Sample response
Sample response
Active models
Snapshot as of 2026-05-01. The
/models endpoint is the source of truth — treat these tables as a quick reference, not a guarantee.Overshoot-hosted
These are the fast path. We run them on our own GPUs, sized for sub-second time-to-first-token on single-frame inputs and high-throughput video.| Model | Provider | Context | Tokens / frame | Max frames |
|---|---|---|---|---|
Qwen/Qwen3.6-27B-FP8 | Qwen | 32K | ~200 @ 480p | Capped by context |
Qwen/Qwen3.6-35B-A3B-FP8 | Qwen | 16K | ~200 @ 480p | Capped by context (16K) |
google/gemma-4-31B-it | 256K | 70 / 140 / 280 / 560 / 1120 | ~60 (1 fps × 60 s) | |
google/gemma-4-26B-A4B-it | 256K | 70 / 140 / 280 / 560 / 1120 | ~60 (1 fps × 60 s) | |
Hcompany/Holo3-35B-A3B | H Company | 16K | ~200 @ 480p | Capped by context (16K) |
Proprietary passthrough
These are upstream APIs we expose through the same OpenAI-compatible surface for convenience. They are not part of Overshoot’s real-time path.| Model | Upstream | Modalities | Notes |
|---|---|---|---|
gemini-3-flash-preview | Google Gemini | image, video | Fast Gemini tier |
gemini-3.1-pro-preview | Google Gemini | image, video | Frontier reasoning, lowest RPM quota |
claude-haiku-4-5-20251001 | Anthropic | image only | Fastest Claude tier (no video) |
claude-sonnet-4-6 | Anthropic | image only | |
claude-opus-4-6 | Anthropic | image only | Highest capability, highest latency |
gpt-5.4-nano | OpenAI | image only | Cheapest GPT-5 tier |
gpt-5.4-mini | OpenAI | image only | |
gpt-5.4 | OpenAI | image only |
How to read the columns
Context — served vs native
Context — served vs native
Served is the context length we run the model with.
Tokens / frame — Qwen models
Tokens / frame — Qwen models
Qwen3.6 uses the same image processor as the Qwen3 line: patch 16,
Numbers in the table assume 480p — the resolution our benchmark suite uses. Higher resolutions consume context faster.
temporal_patch_size=2, spatial_merge_size=2. The formula:| Resolution | Tokens / frame |
|---|---|
| 480p (854×480) | ~200 |
| 720p (1280×720) | ~450 |
| 1080p (1920×1080) | ~1010 |
Tokens / frame — Gemma 4
Tokens / frame — Gemma 4
You pick the visual-token budget per request —
70, 140, 280, 560, or 1120:- 70–280 — classification, captioning, video understanding.
- 560–1120 — OCR, document parsing, small text.
Max frames
Max frames
- Qwen / Holo3 — no hard model-side cap. Frame count is bounded by context. The practical limit is
(context − text_input − text_output) / tokens_per_frame. - Gemma 4 — Google documents 60 s at 1 fps as the supported envelope (~60 frames).
Interleaved text + video
Interleaved text + video
The model can mix text segments between visual tokens inside a single message — instead of forcing all visual content into one block followed by text. Every active model supports this.
Use a model
Pass theid from /models straight into /chat/completions: