-+= Update 26 Feb 2026 | AI Language wins top prize in Daydream Scope Plugin contest! =+-
Real-time AI plugins that close the loop between seeing and generating. The system watches a video stream, reasons about what it sees in real-time, and continuously steers the AI image generation based on that understanding.
A vision language model (VLM) produces semantic descriptions: the mood of a crowd, the species of an animal, the weather in a landscape, the emotional tone of a scene. Those descriptions can optionally feed into a second preprocessor with large language model (LLM), which can rewrite them as rich diffusion prompts, which helps shape what the AI generates, frame by frame, in real time.
4.mp4
Example: Point the camera at a cat. Ask the VLM "what are the natural predators of what you see in three words?". It answers "eagles, foxes, coyotes". That response becomes the live diffusion prompt. The AI no longer renders a cat; it renders whatever is hunting it, morphing dynamically as the VLM's answers evolve with each new inference.
The generation doesn't follow a fixed script. It follows the scene. Prompt state changes smoothly via temporal interpolation rather than cutting abruptly between semantic states. Multiple plugins can run in parallel, chained, or driven from external tools (OSC, UDP) for live performance and installation contexts.
Example By drawing live into the feed (e.g., using local Spout streaming), the VLM can be used to drive a visual auto-complete. While initial strokes are ambigious, with more detail, the VLM inference starts to converge and provide increasingly accurate interpretations. Those are directly fed into the live video generation, serving as both a live autocomplete, but also as a means to create animated drawings.
streamdiffusion2_compressed.mp4
- Test server A: RTX PRO 4050
- Test server B: RTX PRO 6000
- Test server C: RTX PRO 6000, latest
- Pipeline ID: streamdiffusionv2
- Preprocessor: vlm-ollama-pre
- Ollama URL: http://157.157.221.29:23058
- Model: llava:7b
- Postprocessor: vlm-ollama-post
The plugins slot into Daydream Scope's preprocessor / postprocessor pipeline architecture. A typical split chain:
Camera β [VLM Pre] βββββββββββββββββββββββΊ [AI Model] β [VLM Post] β Output
β UDP multicast 239.255.42.99 β²
ββββΊ [UDP Prompt] βββ prompts βββββββ
- scope-vlm-ollama queries an Ollama vision model on each frame at a configurable interval. Runs as a preprocessor (queries the raw feed, injects the VLM response as a diffusion prompt and broadcasts it via UDP), a postprocessor (receives the UDP text and overlays it on the AI output), or as a combined main pipeline.
- scope-llm-ollama sends text to an Ollama LLM and injects the rewritten response as a diffusion prompt. Use it to transform a short observation into an elaborate scene description, style directive, or creative prompt.
- scope-udp-prompt / scope-osc-prompt receive prompts from any external source via UDP multicast or OSC and inject them into the pipeline β bridging Python scripts, TouchDesigner, Ableton Live, Max/MSP, or any custom controller.
Semantic responses are broadcast over UDP multicast so any number of downstream plugins receive them simultaneously β fan-out with no additional routing. The port number acts as a channel: any plugin listening on the same port gets every message.
Prompt transitions use temporal interpolation (slerp or linear) to blend smoothly between semantic states over a configurable number of frames, rather than snapping abruptly when the VLM's description changes.
Built on Ollama for local or remote VLM/LLM inference. Shared libraries handle all transport, frame conversion, text rendering, and prompt injection (scope-bus, scope-language), so each plugin stays focused on its single role in the chain.
3.mp4
2.mp4
1.mp4
5.mp4
moondream.mp4
Queries an Ollama vision model on live video. Available as three variants:
| Pipeline | Role | Description |
|---|---|---|
| VLM Ollama | Main | Query VLM + overlay response + inject prompt |
| VLM Ollama (Pre) | Preprocessor | Query VLM + inject prompt + broadcast via UDP |
| VLM Ollama (Post) | Postprocessor | Receive UDP text + overlay on AI output |
Typical chain: [VLM Pre] β [AI Model] β [VLM Post]
The Pre queries the raw camera feed; the Post overlays the description on the AI-processed output.
Key settings:
ollama_url/ollama_modelβ load-time connection configvlm_promptβ question sent to the VLM with each framesend_intervalβ seconds between VLM queries (VLM is slow; 3β10s typical)inject_prompt/prompt_weightβ whether to use the VLM response as a diffusion prompttransition_stepsβ frames to blend from current to new prompt (0 = instant)udp_portβ channel for PreβPost communication (Pre/Post only)
Sends text to an Ollama LLM and injects the response as a diffusion prompt.
Role: Preprocessor
Use case: Transform a simple input phrase into an elaborate scene description, style directive, or creative prompt. Works well chained before any image generation model.
Key settings:
system_promptβ LLM personality / rewriting instructioninput_promptβ the text fed to the LLM each intervalsend_intervalβ query frequencyinject_promptβ send LLM response downstream as a diffusion promptudp_enabled/udp_portβ optionally broadcast LLM response to other plugins
Receives text via UDP and injects it as a diffusion prompt.
Role: Preprocessor
Use case: Bridge any external application into Scope's prompt chain. Send prompts from a Python script, a custom controller, or any other tool that can send UDP packets.
Key settings:
udp_portβ channel to listen on (load-time)prompt_weightβ weight of injected prompttransition_stepsβ frames to blend from current to new prompt (0 = instant)overlay_enabledβ show received text on video (yellow, top-left) for monitoring
Sending from Python:
import socket
MULTICAST_GROUP = "239.255.42.99"
PORT = 9400
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, socket.IPPROTO_UDP)
sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, 1)
sock.sendto("a moonlit forest, painterly".encode(), (MULTICAST_GROUP, PORT))Receives OSC /prompt messages and injects the text as a diffusion prompt.
Role: Preprocessor
Use case: Integrate Scope with TouchDesigner, Ableton Live, Max/MSP, or any other tool that sends OSC. Send a string to /prompt on the configured port and it becomes the active diffusion prompt.
Key settings:
osc_portβ UDP port to listen for OSC messages (load-time, default 9000)prompt_weightβ weight of injected prompttransition_stepsβ frames to blend from current to new prompt (0 = instant)overlay_enabledβ show received text on video (yellow, top-left) for monitoring
Sending from TouchDesigner / Python:
from pythonosc.udp_client import SimpleUDPClient
client = SimpleUDPClient("127.0.0.1", 9000)
client.send_message("/prompt", "a misty forest at dawn, painterly")Debug postprocessor that overlays all pipeline kwargs on the video and prints them to stdout. Shows video shape, prompts, UDP messages, and any extra kwargs flowing through the chain.
Role: Postprocessor Use case: Drop this at the end of any chain to inspect exactly what's flowing between stages.
Dependencies must be installed before the plugins that use them. Via the Scope UI (installs into the correct venv):
scope-busscope-languagescope-vlm-ollama,scope-llm-ollama,scope-udp-prompt,scope-osc-promptscope-test-text-log
After installing scope-bus and scope-language, they appear in the Scope UI pipeline list as passthrough pipelines β confirming installation and allowing uninstall via the UI.
Tested configuration for running Daydream Scope on RunPod:
| Setting | Value |
|---|---|
| GPU | RTX PRO 6000 (1Γ) |
| vCPU / Memory | 16 vCPU / 188 GB |
| Container disk | 20 GB |
| Network volume | 80 GB (daydream_scope, mounted at /workspace) |
| On-demand price | ~$1.69/hr compute + $0.003/hr storage |
| Base template | daydream-scope (aca8mw9ivw) |
The network volume at /workspace persists across pod restarts β use it for model weights and checkpoints.
Ollama VLM/LLM queries are slow (1β5s each) and run in background threads, so Ollama doesn't need to share the GPU with diffusion inference. Running Ollama on a separate, cheaper pod frees the main GPU for full-speed diffusion:
[Scope pod: RTX PRO 6000] [Ollama pod: CPU-only or cheapest GPU]
StreamDiffusion / other models ollama serve
scope-vlm-ollama (pre/post) βββΊ http://<ollama-pod-ip>:11434
scope-llm-ollama
In each VLM/LLM plugin, set ollama_url (load-time) to the Ollama pod's public IP:
http://213.x.x.x:11434
RunPod exposes port 11434 via the pod's public IP when you add it under Expose TCP Ports in the pod settings.
For the Ollama-only pod, use the cheapest CPU pod (or any GPU pod). Paste this as the Container Start Command when creating the pod or template:
curl -fsSL https://raw.githubusercontent.com/olwal/scope-ai-language/main/scripts/setup-ollama-pod.sh | shThis installs Ollama, pulls qwen3-vl:2b, and starts the server bound to 0.0.0.0:11434. OLLAMA_HOST=0.0.0.0 is required so Ollama is reachable via RunPod's TCP port forwarding β without it, Ollama only listens on 127.0.0.1.
To use a different model, set the OLLAMA_MODEL environment variable on the pod before running the script, or add it inline:
OLLAMA_MODEL=llava:7b curl -fsSL https://raw.githubusercontent.com/olwal/scope-ai-language/main/scripts/setup-ollama-pod.sh | shThe model is downloaded on first boot β subsequent restarts skip the pull if the model is cached on a network volume.
Recommended model: qwen3-vl:2b β fast, small, capable vision model. For higher quality at the cost of speed: llava:7b or llava:13b.
To save this as a reusable template in the RunPod console:
- Go to Manage β Templates β New Template
- Set Container Image to any base image with CUDA or a plain Ubuntu image (e.g.
runpod/base:0.4.0-cuda11.8.0) - Under Container Start Command, paste the Ollama install script above
- Under Expose TCP Ports, add
11434(Ollama API) - Set Container Disk to 5β10 GB (Ollama binary + small model overhead if no volume)
- Optionally attach a Network Volume at
/root/.ollamato cache pulled models across restarts - Save as private template β it will appear in your pod creation flow
For the network volume approach, change the pull line to check first:
ollama pull qwen3-vl:2b 2>/dev/null || trueSo re-pulling an already-cached model is a no-op.
scope-bus β shared transport + rendering library
scope-language β Ollama VLM/LLM clients (depends on scope-bus)
scope-vlm-ollama β vision language model pipeline (depends on scope-language)
scope-llm-ollama β text language model pipeline (depends on scope-language)
scope-udp-prompt β receive UDP text β inject as prompt (depends on scope-bus)
scope-osc-prompt β receive OSC /prompt β inject as prompt (depends on scope-bus)
scope-test-text-log β debug: overlay postprocessor
Scope supports three pipeline roles, declared in each plugin's schema.py:
| Role | usage = |
Runs | Typical use |
|---|---|---|---|
| Main | (omit) | In the AI model slot | Full processing pipelines |
| Preprocessor | [UsageType.PREPROCESSOR] |
Before the AI model | Prompt injection, signal routing |
| Postprocessor | [UsageType.POSTPROCESSOR] |
After the AI model | Overlays, logging, routing |
Plugins communicate at runtime using UDP multicast on 239.255.42.99. The port number acts as a channel β sender and receiver must use the same port. Multiple receivers on the same port all receive every message (fan-out).
[VLM Pre]ββUDP:9400βββΊ[VLM Post] (overlay on AI output)
ββββΊ[UDP Prompt] (forward VLM text as prompt)
ββββΊ[Text Log] (debug display)
Transport, rendering, and frame utilities. All other plugins depend on this.
from scope_bus import (
UDPSender, # send text/dict via UDP multicast
UDPReceiver, # receive text/dict via UDP multicast
render_text_overlay, # draw text onto (T, H, W, C) tensors
apply_overlay_from_kwargs, # render_text_overlay reading from pipeline kwargs dict
normalize_input, # list[Tensor] β (T, H, W, C) float32 [0,1]
tensor_to_pil, # (H, W, C) tensor β PIL Image
PromptInjector, # dedup-inject prompts to output dict
OverlayMixin, # Pydantic mixin: overlay appearance fields for schemas
FontFamily, # Enum: arial | courier | times | helvetica
TextPosition, # Enum: top-left | top-center | bottom-left | bottom-center
)UDPSender β multicast sender with debounced port changes. Accepts strings or dicts (serialised as JSON):
sender = UDPSender(port=9400)
sender.send("a sunset over mountains") # plain text
sender.send({"prompt": "...", "response": "..."}) # JSON dict
sender.update_port(9401) # debounced 3s β call every frame, applies after stableUDPReceiver β multicast receiver, non-blocking poll. Auto-parses JSON:
receiver = UDPReceiver(port=9400)
msg = receiver.poll() # str, dict (if JSON), or Nonerender_text_overlay β composites text onto video frames:
frames = render_text_overlay(
frames,
text="VLM response here",
font_family="arial", # arial | courier | times | helvetica
font_size=24,
font_color=(1.0, 1.0, 1.0), # RGB [0,1]
opacity=1.0,
position="bottom-left", # top-left | top-center | bottom-left | bottom-center
word_wrap=True,
bg_opacity=0.5,
)PromptInjector β injects prompts only when text changes. Supports instant or smooth transitions:
injector = PromptInjector()
# Instant change (default)
injector.inject_if_new(output, text="a cat on a couch", weight=100.0)
# output["prompts"] is set only when text differs from last call
# Smooth temporal blend (uses Scope's transition API)
injector.inject_if_new(output, text="a stormy sea", weight=100.0,
transition_steps=10, interpolation_method="slerp")
# output["transition"] is set with target_prompts + num_stepsnormalize_input β converts Scope's raw video list to a usable tensor:
frames = normalize_input(video, device)
# video: list of (1, H, W, C) uint8 tensors from Scope
# returns: (T, H, W, C) float32 on device, values in [0, 1]Async Ollama clients for vision and text models.
from scope_language import OllamaVLM, OllamaLLMOllamaVLM β sends video frames to a vision model in a background thread:
vlm = OllamaVLM(url="http://localhost:11434", model="llava:7b")
# In __call__ (runs every frame):
if vlm.should_send(interval=3.0): # time-throttled
vlm.query_async(
frames[0], # single (H, W, C) tensor
prompt="Describe what you see",
callback=lambda text: sender.send(text), # optional
)
description = vlm.get_last_response() # returns last completed responseOllamaLLM β text-to-text, same async pattern:
llm = OllamaLLM(url="http://localhost:11434", model="llama3.2:3b")
if llm.should_send(interval=5.0):
llm.query_async(
prompt="a foggy forest",
system="Rewrite as a cinematic scene description in one sentence.",
)
response = llm.get_last_response()| Concept | Documentation |
|---|---|
Pipeline interface (Pipeline, Requirements) |
scope/src/scope/core/pipelines/interface.py |
Schema base class (BasePipelineConfig, ui_field_config) |
scope/src/scope/core/pipelines/base_schema.py |
Plugin registration (hookimpl, register_pipelines) |
scope/src/scope/core/plugins/hookspecs.py |
| Preprocessor β main pipeline parameter forwarding | scope/src/scope/server/pipeline_processor.py |
Prompt format ({"text": str, "weight": float}) |
Consumed by the main diffusion pipeline |