Latency & Performance

Latency & Performance

What Determines Latency

Two factors have the most impact on how fast you get results:

  1. Output tokens — the biggest lever. More tokens means more time. If you only need a short answer, cap it with maxOutputTokens (JS) or max_output_tokens (Python). See Output Token Limits.

  2. Input size — more frames per inference means more to process. Frame mode (single image) is faster than clip mode (multiple frames). Use clip mode only when you need temporal context.

Model size matters, but not in the way you might expect. Some larger models use Mixture of Experts (MoE) architectures that activate only a fraction of their parameters per token, making them faster than smaller dense models. Models are tagged as fast or slow in the Playground (opens in a new tab).

How to Optimize

  • Cap output tokens — a "yes/no" answer doesn't need 128 tokens. Set maxOutputTokens: 10 and save hundreds of milliseconds.
  • Use frame mode for static tasks — OCR, object detection, and document scanning don't need motion context. A single frame is faster than a 6-frame clip.
  • Increase your interval for longer responses — a 2s interval allows up to 256 output tokens per request. A 0.5s interval allows only 64. See Output Token Limits for the formula.
  • Write shorter prompts — the model reads your prompt on every inference. A concise prompt means less input processing.

Iterate in the Playground

The Playground (opens in a new tab) shows real-time latency for every result. Use it to:

  • Compare models — some larger models are faster than smaller ones
  • Test different prompts — see how prompt length and specificity affect output quality and speed
  • Tune processing parameters — adjust interval_seconds, clip_length_seconds, and target_fps before writing code
  • Find your tradeoff — the right balance of quality and speed depends on your use case

Start in the Playground, then move to the SDK once you know what works.

Result Interval vs Inference Latency

These are two different things:

  • Inference latency (inference_latency_ms in the result) — how long the model took to process one frame or clip and produce a response. This is the actual model speed.
  • Result interval (delay_seconds in clip mode, interval_seconds in frame mode) — how often you request a new inference. This controls how frequently results arrive.

Total latency (total_latency_ms in the result) is end-to-end: from frame capture to result delivery, including network time.

If your inference latency is 300ms but your interval is 2s, you'll get a result every 2 seconds — the model is idle between requests. If your interval is 0.3s but inference takes 500ms, results will queue up and arrive as fast as the model can process them.

For most applications, set your interval slightly above your expected inference latency.