Configuration

Most apps only need a few options:

apiKey -- your secret key from Overshoot. Get one at platform.overshoot.ai/api-keys (opens in a new tab).
prompt -- what you want the AI to do.
source -- where the video comes from (camera by default, or a video file). See Video Sources.
model -- the model to use. See Models.
onResult -- callback that receives results.

import { RealtimeVision } from 'overshoot'
 
const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Read any visible text',
  source: { type: 'camera', cameraFacing: 'environment' },
  onResult: (result) => {
    console.log(result.result)
  }
})

Updating the Prompt

You can change the prompt while the stream is running. This is useful when you want to ask different questions about the video without restarting.

vision.updatePrompt('Count the number of people')

The next result will use the new prompt.

Max Output Tokens

Use maxOutputTokens to cap how many tokens the model generates per inference. This is useful when you only need short responses (e.g., a single word or a small JSON object).

const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Is there a person? Answer yes or no.',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'frame',
  frameProcessing: { interval_seconds: 0.5 },
  maxOutputTokens: 10,
  onResult: (result) => {
    console.log(result.result)
  }
})

If you don't set maxOutputTokens, the server automatically picks the optimal value based on your processing interval. The default rate limit is 128 effective tokens per second per stream -- if you need more, reach out at founders@overshoot.ai.

If output gets truncated, the result's finish_reason will be "length" -- see Output.

For the full breakdown of how token limits work, including the formula, reference table, and examples, see Output Token Limits.

Processing Parameters

The SDK supports two processing modes: frame mode (default, analyzes individual frames as static images) and clip mode (analyzes video clips with temporal context). For detailed information about choosing the right mode, see Processing Modes.

Quick Reference

Frame mode (default) -- for static image analysis:

const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Read all visible text',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'frame',
  frameProcessing: {
    interval_seconds: 0.5  // Capture a frame every 0.5 seconds (default)
  },
  onResult: (result) => {
    console.log(result.result)
  }
})

Clip mode -- for motion and temporal understanding:

const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Describe what the person is doing',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'clip',
  clipProcessing: {
    clip_length_seconds: 1,    // Duration of each clip
    delay_seconds: 1,          // Time between results
    target_fps: 6              // Frames per second to sample (1-30)
  },
  onResult: (result) => {
    console.log(result.result)
  }
})

Deprecated Parameters

fps and sampling_ratio are deprecated -- use target_fps instead. The processing parameter is also deprecated in favor of clipProcessing and frameProcessing. This is a JS SDK naming change; the API wire format still uses processing.

Processing Visualization

Play with the sliders below to see how processing parameters affect frame sampling.

Stream Processing

Live cursor

Processing window

Sampled frame

Target FPS6

Clip Length0.5s

Delay0.5s

Frames per clip

3.0

Effective FPS

6.0fps

Video Sources Error Handling