JavaScript SDK
Configuration

Configuration

Most apps only need a few options:

import { RealtimeVision } from 'overshoot'
 
const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Read any visible text',
  source: { type: 'camera', cameraFacing: 'environment' },
  onResult: (result) => {
    console.log(result.result)
  }
})

Updating the Prompt

You can change the prompt while the stream is running. This is useful when you want to ask different questions about the video without restarting.

vision.updatePrompt('Count the number of people')

The next result will use the new prompt.

Max Output Tokens

Use maxOutputTokens to cap how many tokens the model generates per inference. This is useful when you only need short responses (e.g., a single word or a small JSON object).

const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Is there a person? Answer yes or no.',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'frame',
  frameProcessing: { interval_seconds: 0.5 },
  maxOutputTokens: 10,
  onResult: (result) => {
    console.log(result.result)
  }
})

If you don't set maxOutputTokens, the server automatically picks the optimal value based on your processing interval. The default rate limit is 128 effective tokens per second per stream -- if you need more, reach out at founders@overshoot.ai.

If output gets truncated, the result's finish_reason will be "length" -- see Output.

For the full breakdown of how token limits work, including the formula, reference table, and examples, see Output Token Limits.

Processing Parameters

The SDK supports two processing modes: frame mode (default, analyzes individual frames as static images) and clip mode (analyzes video clips with temporal context). For detailed information about choosing the right mode, see Processing Modes.

Quick Reference

Frame mode (default) -- for static image analysis:

const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Read all visible text',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'frame',
  frameProcessing: {
    interval_seconds: 0.5  // Capture a frame every 0.5 seconds (default)
  },
  onResult: (result) => {
    console.log(result.result)
  }
})

Clip mode -- for motion and temporal understanding:

const vision = new RealtimeVision({
  apiKey: 'your-api-key',
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Describe what the person is doing',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'clip',
  clipProcessing: {
    clip_length_seconds: 1,    // Duration of each clip
    delay_seconds: 1,          // Time between results
    target_fps: 6              // Frames per second to sample (1-30)
  },
  onResult: (result) => {
    console.log(result.result)
  }
})

Deprecated Parameters

fps and sampling_ratio are deprecated -- use target_fps instead. The processing parameter is also deprecated in favor of clipProcessing and frameProcessing. This is a JS SDK naming change; the API wire format still uses processing.

Processing Visualization

Play with the sliders below to see how processing parameters affect frame sampling.

Stream Processing

Live cursor
Processing window
Sampled frame
Frames per clip
3.0
Effective FPS
6.0fps