Output Token Limits
TL;DR
- Most users don't need to set
maxOutputTokens— the system picks the optimal value automatically based on your processing interval. - The default limit is 128 effective tokens/second per stream. If you need more, reach out at founders@overshoot.ai and we'll increase it.
- Longer intervals = more tokens per request. A 2s interval allows up to 256 tokens per response.
- Set once at stream creation — cannot be changed mid-stream.
- Does not affect billing. Billing is based on stream duration, not token output.
How It Works
The maxOutputTokens parameter controls how many tokens the model can produce for each individual inference request. The system validates this against your stream's request rate:
effective_tokens_per_second = maxOutputTokens / intervalWhere interval is delay_seconds (clip mode) or interval_seconds (frame mode). This value must not exceed 128 tok/s.
If You Don't Set maxOutputTokens
The system automatically picks the highest value that fits within the rate limit:
maxOutputTokens = floor(128 × interval)This is the recommended approach for most users — you get the maximum possible output length for your chosen interval without doing any math.
If You Set maxOutputTokens Explicitly
The API validates your value. If it would exceed the rate limit, the request is rejected with HTTP 422:
Effective output token rate (150.0 tok/s) exceeds maximum of 128 tok/s.
max_output_tokens (300) / interval (2s) must be <= 128.
Reduce max_output_tokens to at most 256.The error message is self-documenting — it shows the math and tells you the maximum allowed value.
The Formula
| Term | Definition |
|---|---|
maxOutputTokens | Max tokens per single inference request |
requests_per_second | 1 / delay_seconds (clip) or 1 / interval_seconds (frame) |
effective_tokens_per_second | maxOutputTokens × requests_per_second |
Constraint: effective_tokens_per_second ≤ 128
Quick Reference
| Mode | Interval | Requests/sec | Auto-defaulted maxOutputTokens |
|---|---|---|---|
| Clip | 0.5s delay | 2 | 64 |
| Clip | 1.0s delay | 1 | 128 |
| Clip | 5.0s delay | 0.2 | 640 |
| Frame | 0.2s interval | 5 | 25 |
| Frame | 0.5s interval | 2 | 64 |
| Frame | 1.0s interval | 1 | 128 |
| Frame | 2.0s interval | 0.5 | 256 |
Key insight: Longer intervals = more tokens per request. If you need longer model responses, increase your processing interval.
Examples
Auto-default (recommended)
const vision = new RealtimeVision({
apiKey: 'your-api-key',
model: 'Qwen/Qwen3.5-9B',
prompt: 'Describe what you see',
source: { type: 'camera', cameraFacing: 'environment' },
mode: 'frame',
frameProcessing: { interval_seconds: 2 },
onResult: (result) => {
console.log(result.result)
}
})
// maxOutputTokens auto-set to 256 (floor(128 × 2))Explicit value — accepted
const vision = new RealtimeVision({
apiKey: 'your-api-key',
model: 'Qwen/Qwen3.5-9B',
prompt: 'Is there a person? Answer yes or no.',
source: { type: 'camera', cameraFacing: 'environment' },
mode: 'frame',
frameProcessing: { interval_seconds: 2 },
maxOutputTokens: 200, // 200 / 2 = 100 tok/s ≤ 128 ✓
onResult: (result) => {
console.log(result.result)
}
})Explicit value — rejected
const vision = new RealtimeVision({
apiKey: 'your-api-key',
model: 'Qwen/Qwen3.5-9B',
prompt: 'Describe everything in detail',
source: { type: 'camera', cameraFacing: 'environment' },
mode: 'frame',
frameProcessing: { interval_seconds: 2 },
maxOutputTokens: 300, // 300 / 2 = 150 tok/s > 128 ✗ → 422 error
onResult: (result) => {
console.log(result.result)
}
})
// Error: Reduce max_output_tokens to at most 256Detecting Truncation
When the model's response exceeds maxOutputTokens, it gets cut off. You can detect this via the finish_reason field:
onResult: (result) => {
if (result.finish_reason === 'length') {
console.warn('Output was truncated')
// Consider increasing maxOutputTokens or using a longer interval
}
}See Output — Finish Reason for all possible values.