Latency & Performance
What Determines Latency
Two factors have the most impact on how fast you get results:
-
Output tokens — the biggest lever. More tokens means more time. If you only need a short answer, cap it with
maxOutputTokens(JS) ormax_output_tokens(Python). See Output Token Limits. -
Input size — more frames per inference means more to process. Frame mode (single image) is faster than clip mode (multiple frames). Use clip mode only when you need temporal context.
Model size matters, but not in the way you might expect. Some larger models use Mixture of Experts (MoE) architectures that activate only a fraction of their parameters per token, making them faster than smaller dense models. Models are tagged as fast or slow in the Playground (opens in a new tab).
How to Optimize
- Cap output tokens — a "yes/no" answer doesn't need 128 tokens. Set
maxOutputTokens: 10and save hundreds of milliseconds. - Use frame mode for static tasks — OCR, object detection, and document scanning don't need motion context. A single frame is faster than a 6-frame clip.
- Increase your interval for longer responses — a 2s interval allows up to 256 output tokens per request. A 0.5s interval allows only 64. See Output Token Limits for the formula.
- Write shorter prompts — the model reads your prompt on every inference. A concise prompt means less input processing.
Iterate in the Playground
The Playground (opens in a new tab) shows real-time latency for every result. Use it to:
- Compare models — some larger models are faster than smaller ones
- Test different prompts — see how prompt length and specificity affect output quality and speed
- Tune processing parameters — adjust
interval_seconds,clip_length_seconds, andtarget_fpsbefore writing code - Find your tradeoff — the right balance of quality and speed depends on your use case
Start in the Playground, then move to the SDK once you know what works.
Result Interval vs Inference Latency
These are two different things:
- Inference latency (
inference_latency_msin the result) — how long the model took to process one frame or clip and produce a response. This is the actual model speed. - Result interval (
delay_secondsin clip mode,interval_secondsin frame mode) — how often you request a new inference. This controls how frequently results arrive.
Total latency (total_latency_ms in the result) is end-to-end: from frame capture to result delivery, including network time.
If your inference latency is 300ms but your interval is 2s, you'll get a result every 2 seconds — the model is idle between requests. If your interval is 0.3s but inference takes 500ms, results will queue up and arrive as fast as the model can process them.
For most applications, set your interval slightly above your expected inference latency.