Frame vs Clip Mode

For SDK-specific configuration, see JavaScript SDK or Python SDK.

The Overshoot SDK supports two processing modes that determine how video frames are analyzed:

Frame Mode (default): Analyzes individual frames as static images — ideal for reading text, detecting objects, or analyzing still content
Clip Mode: Analyzes short video clips with temporal context — ideal for understanding motion, actions, and events

Frame Mode

Frame mode captures and analyzes individual frames at regular intervals. Each frame is treated as a static image with no temporal context. This is the default mode.

Best for:

Reading text, signs, and labels
Document scanning and OCR
Object detection in static scenes
QR code and barcode scanning
Analyzing dashboards or monitoring displays

const vision = new RealtimeVision({
  apiKey: 'your-api-key', // Get yours at platform.overshoot.ai/api-keys
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Read all visible text',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'frame',
  frameProcessing: {
    interval_seconds: 2.0  // Capture and analyze a frame every 2 seconds
  },
  onResult: (result) => {
    console.log(result.result)
  }
})

Frame Processing Parameters

interval_seconds (0.1-60, default: 0.5): How often to capture and analyze a frame. Shorter intervals give more frequent results but use more resources.

Clip Mode

Clip mode bundles multiple frames into short video clips before sending them to the AI. This gives the model temporal context to understand motion and events.

Best for:

Sports and fitness form analysis
Action recognition and event detection
Gesture recognition
Video content understanding
Anything requiring motion or temporal context

const vision = new RealtimeVision({
  apiKey: 'your-api-key', // Get yours at platform.overshoot.ai/api-keys
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Describe what the person is doing',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'clip',
  clipProcessing: {
    clip_length_seconds: 1.0,    // Duration of each clip
    delay_seconds: 1.0,           // Time between results
    target_fps: 6                  // Frames per second to sample (1-30)
  },
  onResult: (result) => {
    console.log(result.result)
  }
})

Clip Processing Parameters

target_fps (1-30, default: 6): How many frames per second the server samples from the video stream. Must satisfy target_fps × clip_length_seconds >= 3 (at least 3 frames per clip).
clip_length_seconds (0.1-60, default: 0.5): How long each video clip is. Longer clips give more context but take longer to process.
delay_seconds (0-60, default: 0.5): How often you get a new result. Smaller delays mean more frequent updates.

Note (JS SDK): fps, sampling_ratio, and the processing parameter are deprecated in the JavaScript SDK — use target_fps and clipProcessing/frameProcessing instead. The API wire format still uses processing as the field name.

Choosing the Right Mode

Use Case	Mode	Why
Read text from camera	Frame	Text doesn't require motion context
Analyze workout form	Clip	Need to see movement over time
Detect gestures	Clip	Gestures are temporal actions
Read document	Frame	Static image, faster and cheaper
Analyze sports action	Clip	Need temporal context for actions

Trade-off: Frame mode is faster and cheaper but can't understand motion. Clip mode understands temporal events but uses more bandwidth and compute.

Note: Your processing interval also determines how many output tokens the model can produce per request. Shorter intervals mean shorter allowed responses. See Output Token Limits for the full breakdown.

Error Handling Results & Structured Output