Frame vs Clip Mode
For SDK-specific configuration, see JavaScript SDK or Python SDK.
The Overshoot SDK supports two processing modes that determine how video frames are analyzed:
- Frame Mode (default): Analyzes individual frames as static images — ideal for reading text, detecting objects, or analyzing still content
- Clip Mode: Analyzes short video clips with temporal context — ideal for understanding motion, actions, and events
Frame Mode
Frame mode captures and analyzes individual frames at regular intervals. Each frame is treated as a static image with no temporal context. This is the default mode.
Best for:
- Reading text, signs, and labels
- Document scanning and OCR
- Object detection in static scenes
- QR code and barcode scanning
- Analyzing dashboards or monitoring displays
const vision = new RealtimeVision({
apiKey: 'your-api-key', // Get yours at platform.overshoot.ai/api-keys
model: 'Qwen/Qwen3.5-9B',
prompt: 'Read all visible text',
source: { type: 'camera', cameraFacing: 'environment' },
mode: 'frame',
frameProcessing: {
interval_seconds: 2.0 // Capture and analyze a frame every 2 seconds
},
onResult: (result) => {
console.log(result.result)
}
})Frame Processing Parameters
interval_seconds(0.1-60, default: 0.5): How often to capture and analyze a frame. Shorter intervals give more frequent results but use more resources.
Clip Mode
Clip mode bundles multiple frames into short video clips before sending them to the AI. This gives the model temporal context to understand motion and events.
Best for:
- Sports and fitness form analysis
- Action recognition and event detection
- Gesture recognition
- Video content understanding
- Anything requiring motion or temporal context
const vision = new RealtimeVision({
apiKey: 'your-api-key', // Get yours at platform.overshoot.ai/api-keys
model: 'Qwen/Qwen3.5-9B',
prompt: 'Describe what the person is doing',
source: { type: 'camera', cameraFacing: 'environment' },
mode: 'clip',
clipProcessing: {
clip_length_seconds: 1.0, // Duration of each clip
delay_seconds: 1.0, // Time between results
target_fps: 6 // Frames per second to sample (1-30)
},
onResult: (result) => {
console.log(result.result)
}
})Clip Processing Parameters
target_fps(1-30, default: 6): How many frames per second the server samples from the video stream. Must satisfytarget_fps × clip_length_seconds >= 3(at least 3 frames per clip).clip_length_seconds(0.1-60, default: 0.5): How long each video clip is. Longer clips give more context but take longer to process.delay_seconds(0-60, default: 0.5): How often you get a new result. Smaller delays mean more frequent updates.
Note (JS SDK):
fps,sampling_ratio, and theprocessingparameter are deprecated in the JavaScript SDK — usetarget_fpsandclipProcessing/frameProcessinginstead. The API wire format still usesprocessingas the field name.
Choosing the Right Mode
| Use Case | Mode | Why |
|---|---|---|
| Read text from camera | Frame | Text doesn't require motion context |
| Analyze workout form | Clip | Need to see movement over time |
| Detect gestures | Clip | Gestures are temporal actions |
| Read document | Frame | Static image, faster and cheaper |
| Analyze sports action | Clip | Need temporal context for actions |
Trade-off: Frame mode is faster and cheaper but can't understand motion. Clip mode understands temporal events but uses more bandwidth and compute.
Note: Your processing interval also determines how many output tokens the model can produce per request. Shorter intervals mean shorter allowed responses. See Output Token Limits for the full breakdown.