Models
Overshoot runs models optimized for low-latency real-time inference.
Model availability can change. Use the /models endpoint to check current status before starting a stream:
curl https://api.overshoot.ai/v0.2/modelsSee API Reference for status field details.
Picking a Model
| If you need... | Use | Why |
|---|---|---|
| Best overall quality | Qwen3.5-27B | Leads on video, coding, instruction following |
| High-throughput vision | Qwen3.5-35B-A3B | Near-27B vision at a fraction of compute (MoE, 3B active) |
| Strong vision at moderate size | Qwen3.5-9B | Beats last-gen 30B on all vision benchmarks |
| Video understanding at lowest latency | Qwen3.5-4B | 96% of 27B's video score at a fraction of the size |
| OCR and document scanning | Qwen3.5-2B | OCR beats models 2x its size, fastest response time |
For detailed benchmarks and per-model analysis, see our blog post: Qwen 3.5 on Overshoot (opens in a new tab).
Available Models
Large (27B+)
Qwen/Qwen3.5-35B-A3B— MoE, best for throughput-heavy vision and UI agentsQwen/Qwen3.5-27B— dense, best all-rounderQwen/Qwen3-VL-32B-Instruct-FP8Qwen/Qwen3-VL-30B-A3B-InstructOpenGVLab/InternVL3_5-30B-A3B
Medium (8-9B)
Qwen/Qwen3.5-9B— recommended starting point for most developersQwen/Qwen3-VL-8B-Instructallenai/Molmo2-8BKwai-Keye/Keye-VL-1_5-8Bopenbmb/MiniCPM-V-4_5
Small (2-4B)
Qwen/Qwen3.5-4B— strong video understanding for its sizeQwen/Qwen3.5-2B— specialist for OCR and document scanningQwen/Qwen3-VL-4B-Instruct
Usage
const vision = new RealtimeVision({
apiKey: 'your-api-key', // Get yours at platform.overshoot.ai/api-keys
model: 'Qwen/Qwen3.5-9B',
prompt: 'Read any visible text',
source: { type: 'camera', cameraFacing: 'environment' },
onResult: (result) => {
console.log(result.result)
}
})Thinking Mode
Qwen 3.5 models support thinking mode (<think> tags), which allows the model to reason before responding. Thinking mode is disabled by default to ensure real-time performance. If you'd like thinking mode enabled for your use case, contact us (opens in a new tab).
Note: On the 2B and 4B models, thinking mode can produce loops. Non-thinking mode is more reliable for production and actually scores better on OCR tasks.
Performance Notes
Inference latency is as fast as 200ms. Latency scales slowly with input size (number of frames) and scales quickly with output size (number of output tokens). It does not scale linearly with model size.
Tip: Use
maxOutputTokens(JS) ormax_output_tokens(Python) to keep responses short and latency low. See Output Token Limits.