Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.overshoot.ai/llms.txt

Use this file to discover all available pages before exploring further.

Intro

If you’re familiar with the Chat Completion API, feel free to skip to the next section. The Chat Completion API is the building block of almost every AI application today. Quick refresher: A Chat Completion Request contains primarily two attributes:
  • model: The model the user would like to run
  • messages: A structured chat-form representation of the prompt. Basically a list containing system messages, user messages (”user questions”), assistant messages and tool calls.
request.json
{
  "model": "google/gemma-4-E4B-it",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful assistant."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is the capital of France?"
        }
      ]
    },
    {
      "role": "assistant",
      "content": [
        {
          "type": "text",
          "text": "The capital of France is Paris."
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "How about Morocco?"
        }
      ]
    }
  ]
}
Almost all inference providers today provide Chat Completion inference endpoints. You can either call it over raw HTTP or using the openai SDK.
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.overshoot.ai/v1",
    api_key=os.environ["OVERSHOOT_API_KEY"],
)

response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "what is the capital of france?"},
        {"role": "assistant", "content": "the capital of france is paris."},
        {"role": "user", "content": "how about morocco?"},
    ],
)

print(response.choices[0].message.content)
When the inference provider receives the request, they use the model’s chat template to turn it into a single string and tokenize it. In the example above, the inference provider will respond with “Rabat”. For Vision Language Models (VLM), image and video data can be passed in the message content in the form of bytes or URLs.
{
	"role": "user",
	"content": [
		{
			"type": "text",
			"text": "how many people appeared in this video?"
		},
		{
			"type": "video_url",
			"video_url": {
				"url": "https://example.com/video.mp4"
			}
		}
	]
}
If a message contains an image / video, it will get converted into tokens and interleaved with the text. Note: some models don’t allow you to interleave text and images

Using Overshoot with Chat Completion

When you create a stream and connect it to your camera source, Overshoot allows you to reference parts (or all) of your stream in the message content of your Chat Completion Request with a simple URL.
  • Use type=image_url to refer to a particular frame in the video. For example, the last frame can be referenced as follows:
{
	"type": "image_url", 
	"image_url": {"url": "ovs://streams/{stream_id}?frame_index=-1"}
}
  • Use type=video_url to refer to a segment in the video. Unlike the frame, a video segment must have a start and end . If no end is specified, we default to now. For example, to refer to the last 5 seconds of the stream,
{"type": "video_url", "video_url": {"url": "ovs://streams/{stream_id}?start_offset_ms=-5000"}}
  • To look at the first 5 minutes of the stream, use start_offset_ms instead.
{"type": "video_url", "video_url": {"url": "ovs://streams/{stream_id}?start_offset_ms=0&end_offset_ms=300000"}}
The mental model here is that when you create a Stream , Overshoot gives you a handle to reference any part (or all) of the stream in any conversation with any model. That is, you can:
  • Have a model look at multiple segments of the same stream in the same converstion
  • The model can also watch segments from different streams at the same time
  • Have small model watch the stream for a specific event and escalate it to a bigger model
  • Have multiple models watch the same stream for a given event and only escalate it if event is triggered