Output Token Limits

Output Token Limits

TL;DR

  • Most users don't need to set maxOutputTokens — the system picks the optimal value automatically based on your processing interval.
  • The default limit is 128 effective tokens/second per stream. If you need more, reach out at founders@overshoot.ai and we'll increase it.
  • Longer intervals = more tokens per request. A 2s interval allows up to 256 tokens per response.
  • Set once at stream creation — cannot be changed mid-stream.
  • Does not affect billing. Billing is based on stream duration, not token output.

How It Works

The maxOutputTokens parameter controls how many tokens the model can produce for each individual inference request. The system validates this against your stream's request rate:

effective_tokens_per_second = maxOutputTokens / interval

Where interval is delay_seconds (clip mode) or interval_seconds (frame mode). This value must not exceed 128 tok/s.

If You Don't Set maxOutputTokens

The system automatically picks the highest value that fits within the rate limit:

maxOutputTokens = floor(128 × interval)

This is the recommended approach for most users — you get the maximum possible output length for your chosen interval without doing any math.

If You Set maxOutputTokens Explicitly

The API validates your value. If it would exceed the rate limit, the request is rejected with HTTP 422:

Effective output token rate (150.0 tok/s) exceeds maximum of 128 tok/s.
max_output_tokens (300) / interval (2s) must be <= 128.
Reduce max_output_tokens to at most 256.

The error message is self-documenting — it shows the math and tells you the maximum allowed value.

The Formula

TermDefinition
maxOutputTokensMax tokens per single inference request
requests_per_second1 / delay_seconds (clip) or 1 / interval_seconds (frame)
effective_tokens_per_secondmaxOutputTokens × requests_per_second

Constraint: effective_tokens_per_second ≤ 128

Quick Reference

ModeIntervalRequests/secAuto-defaulted maxOutputTokens
Clip0.5s delay264
Clip1.0s delay1128
Clip5.0s delay0.2640
Frame0.2s interval525
Frame0.5s interval264
Frame1.0s interval1128
Frame2.0s interval0.5256

Key insight: Longer intervals = more tokens per request. If you need longer model responses, increase your processing interval.

Examples

Auto-default (recommended)

const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Describe what you see',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'frame',
  frameProcessing: { interval_seconds: 2 },
  onResult: (result) => {
    console.log(result.result)
  }
})
// maxOutputTokens auto-set to 256 (floor(128 × 2))

Explicit value — accepted

const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Is there a person? Answer yes or no.',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'frame',
  frameProcessing: { interval_seconds: 2 },
  maxOutputTokens: 200,  // 200 / 2 = 100 tok/s ≤ 128 ✓
  onResult: (result) => {
    console.log(result.result)
  }
})

Explicit value — rejected

const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Describe everything in detail',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'frame',
  frameProcessing: { interval_seconds: 2 },
  maxOutputTokens: 300,  // 300 / 2 = 150 tok/s > 128 ✗ → 422 error
  onResult: (result) => {
    console.log(result.result)
  }
})
// Error: Reduce max_output_tokens to at most 256

Detecting Truncation

When the model's response exceeds maxOutputTokens, it gets cut off. You can detect this via the finish_reason field:

onResult: (result) => {
  if (result.finish_reason === 'length') {
    console.warn('Output was truncated')
    // Consider increasing maxOutputTokens or using a longer interval
  }
}

See Output — Finish Reason for all possible values.