Frame vs Clip Mode

Frame vs Clip Mode

For SDK-specific configuration, see JavaScript SDK or Python SDK.

The Overshoot SDK supports two processing modes that determine how video frames are analyzed:

  • Frame Mode (default): Analyzes individual frames as static images — ideal for reading text, detecting objects, or analyzing still content
  • Clip Mode: Analyzes short video clips with temporal context — ideal for understanding motion, actions, and events

Frame Mode

Frame mode captures and analyzes individual frames at regular intervals. Each frame is treated as a static image with no temporal context. This is the default mode.

Best for:

  • Reading text, signs, and labels
  • Document scanning and OCR
  • Object detection in static scenes
  • QR code and barcode scanning
  • Analyzing dashboards or monitoring displays
const vision = new RealtimeVision({
  apiKey: 'your-api-key', // Get yours at platform.overshoot.ai/api-keys
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Read all visible text',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'frame',
  frameProcessing: {
    interval_seconds: 2.0  // Capture and analyze a frame every 2 seconds
  },
  onResult: (result) => {
    console.log(result.result)
  }
})

Frame Processing Parameters

  • interval_seconds (0.1-60, default: 0.5): How often to capture and analyze a frame. Shorter intervals give more frequent results but use more resources.

Clip Mode

Clip mode bundles multiple frames into short video clips before sending them to the AI. This gives the model temporal context to understand motion and events.

Best for:

  • Sports and fitness form analysis
  • Action recognition and event detection
  • Gesture recognition
  • Video content understanding
  • Anything requiring motion or temporal context
const vision = new RealtimeVision({
  apiKey: 'your-api-key', // Get yours at platform.overshoot.ai/api-keys
  model: 'Qwen/Qwen3.5-9B',
  prompt: 'Describe what the person is doing',
  source: { type: 'camera', cameraFacing: 'environment' },
  mode: 'clip',
  clipProcessing: {
    clip_length_seconds: 1.0,    // Duration of each clip
    delay_seconds: 1.0,           // Time between results
    target_fps: 6                  // Frames per second to sample (1-30)
  },
  onResult: (result) => {
    console.log(result.result)
  }
})

Clip Processing Parameters

  • target_fps (1-30, default: 6): How many frames per second the server samples from the video stream. Must satisfy target_fps × clip_length_seconds >= 3 (at least 3 frames per clip).
  • clip_length_seconds (0.1-60, default: 0.5): How long each video clip is. Longer clips give more context but take longer to process.
  • delay_seconds (0-60, default: 0.5): How often you get a new result. Smaller delays mean more frequent updates.

Note (JS SDK): fps, sampling_ratio, and the processing parameter are deprecated in the JavaScript SDK — use target_fps and clipProcessing/frameProcessing instead. The API wire format still uses processing as the field name.

Choosing the Right Mode

Use CaseModeWhy
Read text from cameraFrameText doesn't require motion context
Analyze workout formClipNeed to see movement over time
Detect gesturesClipGestures are temporal actions
Read documentFrameStatic image, faster and cheaper
Analyze sports actionClipNeed temporal context for actions

Trade-off: Frame mode is faster and cheaper but can't understand motion. Clip mode understands temporal events but uses more bandwidth and compute.

Note: Your processing interval also determines how many output tokens the model can produce per request. Shorter intervals mean shorter allowed responses. See Output Token Limits for the full breakdown.