Models

Models

Overshoot runs models optimized for low-latency real-time inference.

Model availability can change. Use the /models endpoint to check current status before starting a stream:

curl https://api.overshoot.ai/v0.2/models

See API Reference for status field details.

Picking a Model

If you need...UseWhy
Best overall qualityQwen3.5-27BLeads on video, coding, instruction following
High-throughput visionQwen3.5-35B-A3BNear-27B vision at a fraction of compute (MoE, 3B active)
Strong vision at moderate sizeQwen3.5-9BBeats last-gen 30B on all vision benchmarks
Video understanding at lowest latencyQwen3.5-4B96% of 27B's video score at a fraction of the size
OCR and document scanningQwen3.5-2BOCR beats models 2x its size, fastest response time

For detailed benchmarks and per-model analysis, see our blog post: Qwen 3.5 on Overshoot (opens in a new tab).

Available Models

Large (27B+)

  • Qwen/Qwen3.5-35B-A3B — MoE, best for throughput-heavy vision and UI agents
  • Qwen/Qwen3.5-27B — dense, best all-rounder
  • Qwen/Qwen3-VL-32B-Instruct-FP8
  • Qwen/Qwen3-VL-30B-A3B-Instruct
  • OpenGVLab/InternVL3_5-30B-A3B

Medium (8-9B)

  • Qwen/Qwen3.5-9B — recommended starting point for most developers
  • Qwen/Qwen3-VL-8B-Instruct
  • allenai/Molmo2-8B
  • Kwai-Keye/Keye-VL-1_5-8B
  • openbmb/MiniCPM-V-4_5

Small (2-4B)

  • Qwen/Qwen3.5-4B — strong video understanding for its size
  • Qwen/Qwen3.5-2B — specialist for OCR and document scanning
  • Qwen/Qwen3-VL-4B-Instruct

Usage

const vision = new RealtimeVision({
  apiKey: 'your-api-key', // Get yours at platform.overshoot.ai/api-keys
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Read any visible text',
  source: { type: 'camera', cameraFacing: 'environment' },
  onResult: (result) => {
    console.log(result.result)
  }
})

Thinking Mode

Qwen 3.5 models support thinking mode (<think> tags), which allows the model to reason before responding. Thinking mode is disabled by default to ensure real-time performance. If you'd like thinking mode enabled for your use case, contact us (opens in a new tab).

Note: On the 2B and 4B models, thinking mode can produce loops. Non-thinking mode is more reliable for production and actually scores better on OCR tasks.

Performance Notes

Inference latency is as fast as 200ms. Latency scales slowly with input size (number of frames) and scales quickly with output size (number of output tokens). It does not scale linearly with model size.

Tip: Use maxOutputTokens (JS) or max_output_tokens (Python) to keep responses short and latency low. See Output Token Limits.