Skip to content

Commit a41681c

Browse files
committed
Local audio/video transport implementation
1 parent 837f64c commit a41681c

21 files changed

Lines changed: 4309 additions & 2285 deletions

File tree

.github/workflows/run_tests.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,8 @@ jobs:
8484
sudo rm -rf "/usr/local/lib/android" || true
8585
sudo rm -rf "/usr/local/share/boost" || true
8686
sudo rm -rf "$AGENT_TOOLSDIRECTORY" || true
87+
- name: Install system dependencies
88+
run: sudo apt-get update && sudo apt-get install -y libportaudio2
8789
- name: Install dependencies
8890
uses: ./.github/actions/python-uv-setup
8991
- name: Run core tests

agents-core/pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ turbopuffer = ["vision-agents-plugins-turbopuffer"]
7373
mistral = ["vision-agents-plugins-mistral"]
7474
assemblyai = ["vision-agents-plugins-assemblyai"]
7575
redis = ["redis[hiredis]>=5.0.0"]
76+
local = ["vision-agents-plugins-local"]
7677

7778
all-plugins = [
7879
"vision-agents-plugins-anthropic",
Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Local Transport Example
2+
3+
This example demonstrates how to run a vision agent using local audio/video I/O (microphone, speakers, and camera) instead of a cloud-based edge network.
4+
5+
## Overview
6+
7+
The LocalEdge provides:
8+
9+
- **Microphone input**: Captures audio from your microphone
10+
- **Speaker output**: Plays AI responses on your speakers
11+
- **Camera input**: Captures video from your camera (optional)
12+
- **No cloud dependencies**: Media runs locally (except for the LLM, TTS, and STT services)
13+
14+
## Examples
15+
16+
There are two example scripts:
17+
18+
### 1. Basic Voice Agent (`local_transport_example.py`)
19+
20+
Uses Gemini LLM with Deepgram STT and ElevenLabs TTS for a voice-only experience.
21+
22+
```bash
23+
uv run python local_transport_example.py
24+
```
25+
26+
### 2. Vision Agent with Gemini Realtime (`local_transport_realtime_example.py`)
27+
28+
Uses Gemini Realtime for native audio/video understanding. This lets Gemini see through your camera!
29+
30+
```bash
31+
uv run python local_transport_realtime_example.py
32+
```
33+
34+
Try asking: "What do you see?" or "Describe what's in front of you"
35+
36+
## Prerequisites
37+
38+
1. A working microphone and speakers
39+
2. A camera (optional for basic example, recommended for realtime example)
40+
3. API keys:
41+
42+
### For basic example:
43+
44+
- Google AI (for Gemini LLM)
45+
- Deepgram (for STT)
46+
- ElevenLabs (for TTS)
47+
48+
### For realtime example:
49+
50+
- Google AI (for Gemini Realtime) - handles audio/video natively
51+
52+
## Setup
53+
54+
1. Create a `.env` file with your API keys:
55+
56+
```bash
57+
GOOGLE_API_KEY=your_google_api_key
58+
DEEPGRAM_API_KEY=your_deepgram_api_key
59+
ELEVENLABS_API_KEY=your_elevenlabs_api_key
60+
```
61+
62+
2. Install dependencies:
63+
64+
```bash
65+
cd examples/10_local_transport_example
66+
uv sync
67+
```
68+
69+
## Device Selection
70+
71+
Both examples will prompt you to select:
72+
73+
1. **Input device** (microphone)
74+
2. **Output device** (speakers)
75+
3. **Video device** (camera) - can be skipped by entering 'n'
76+
77+
Press Enter to use the default device, or enter a number to select a specific device.
78+
79+
Press `Ctrl+C` to stop the agent.
80+
81+
## Listing Audio Devices
82+
83+
To see available audio devices on your system:
84+
85+
```python
86+
from vision_agents.plugins.local.devices import list_audio_input_devices, list_audio_output_devices
87+
88+
list_audio_input_devices()
89+
list_audio_output_devices()
90+
```
91+
92+
## Configuration
93+
94+
You can customize the audio settings when creating the LocalEdge:
95+
96+
```python
97+
from vision_agents.plugins.local import LocalEdge
98+
from vision_agents.plugins.local.devices import (
99+
select_audio_input_device,
100+
select_audio_output_device,
101+
)
102+
103+
input_device = select_audio_input_device()
104+
output_device = select_audio_output_device()
105+
106+
edge = LocalEdge(
107+
audio_input=input_device, # AudioInputDevice (microphone)
108+
audio_output=output_device, # AudioOutputDevice (speakers)
109+
)
110+
```
111+
112+
## Troubleshooting
113+
114+
### No audio input/output
115+
116+
1. Check that your microphone and speakers are properly connected
117+
2. Run `list_audio_input_devices()` or `list_audio_output_devices()` to see available devices
118+
3. Try specifying explicit device indices in the LocalEdge constructor
119+
120+
### Audio quality issues
121+
122+
- Try increasing the `blocksize` parameter for smoother audio
123+
- Ensure your microphone isn't picking up too much background noise
124+
125+
### Permission errors
126+
127+
On macOS, you may need to grant microphone permissions to your terminal application.
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Local Transport Example
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
"""
2+
Local Transport Example
3+
4+
Demonstrates using LocalTransport for local audio/video I/O with vision agents.
5+
This enables running agents using your microphone, speakers, and camera without
6+
cloud-based edge infrastructure.
7+
8+
Usage:
9+
uv run python local_transport_example.py run
10+
11+
Requirements:
12+
- Working microphone and speakers
13+
- Optional: Camera for video input
14+
- API keys for Gemini, Deepgram, and ElevenLabs in .env file
15+
"""
16+
17+
import logging
18+
from typing import Any
19+
20+
from dotenv import load_dotenv
21+
from vision_agents.core import Agent, AgentLauncher, Runner, User
22+
from vision_agents.core.utils.examples import get_weather_by_location
23+
from vision_agents.plugins import deepgram, gemini
24+
from vision_agents.plugins.local import LocalEdge
25+
from vision_agents.plugins.local.devices import (
26+
select_audio_input_device,
27+
select_audio_output_device,
28+
select_video_device,
29+
)
30+
31+
logger = logging.getLogger(__name__)
32+
33+
load_dotenv()
34+
35+
INSTRUCTIONS = (
36+
"You're a helpful voice AI assistant running on the user's local machine. "
37+
"Keep responses short and conversational. Don't use special characters or "
38+
"formatting. Be friendly and helpful."
39+
)
40+
41+
42+
def setup_llm(model: str = "gemini-3.1-flash-lite-preview") -> gemini.LLM:
43+
llm = gemini.LLM(model)
44+
45+
@llm.register_function(description="Get current weather for a location")
46+
async def get_weather(location: str) -> dict[str, Any]:
47+
return await get_weather_by_location(location)
48+
49+
return llm
50+
51+
52+
async def create_agent() -> Agent:
53+
llm = setup_llm()
54+
55+
if input_device is None:
56+
raise RuntimeError("No audio input device available")
57+
if output_device is None:
58+
raise RuntimeError("No audio output device available")
59+
60+
logger.info(f"Using input: {input_device.name} ({input_device.sample_rate}Hz)")
61+
logger.info(f"Using output: {output_device.name} ({output_device.sample_rate}Hz)")
62+
if video_device:
63+
logger.info(f"Using video device: {video_device.name}")
64+
65+
transport = LocalEdge(
66+
audio_input=input_device,
67+
audio_output=output_device,
68+
video_input=video_device,
69+
)
70+
71+
agent = Agent(
72+
edge=transport,
73+
agent_user=User(name="Local AI Assistant", id="local-agent"),
74+
instructions=INSTRUCTIONS,
75+
processors=[],
76+
llm=llm,
77+
tts=deepgram.TTS(),
78+
stt=deepgram.STT(eager_turn_detection=True),
79+
)
80+
81+
return agent
82+
83+
84+
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs: Any) -> None:
85+
call = await agent.edge.create_call(call_id)
86+
async with agent.join(call=call, participant_wait_timeout=0):
87+
await agent.simple_response("Greet the user briefly")
88+
await agent.finish()
89+
90+
91+
if __name__ == "__main__":
92+
print("\n" + "=" * 60)
93+
print("Local Transport Voice Agent")
94+
print("=" * 60)
95+
print("\nThis agent uses your local microphone, speakers, and optionally camera.")
96+
97+
input_device = select_audio_input_device()
98+
output_device = select_audio_output_device()
99+
video_device = select_video_device()
100+
101+
print("Speak into your microphone to interact with the AI.")
102+
if video_device:
103+
print("Camera is enabled for video input.")
104+
print("Press Ctrl+C to stop.\n")
105+
106+
Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
[project]
2+
name = "local-transport-example"
3+
version = "0.0.0"
4+
requires-python = ">=3.10"
5+
6+
# Dependencies for local audio transport
7+
dependencies = [
8+
"python-dotenv>=1.0",
9+
"vision-agents-plugins-deepgram",
10+
"vision-agents-plugins-elevenlabs",
11+
"vision-agents-plugins-gemini",
12+
"vision-agents-plugins-local",
13+
]
14+
15+
[tool.uv.sources]
16+
"vision-agents-plugins-deepgram" = { path = "../../plugins/deepgram", editable = true }
17+
"vision-agents-plugins-elevenlabs" = { path = "../../plugins/elevenlabs", editable = true }
18+
"vision-agents-plugins-gemini" = { path = "../../plugins/gemini", editable = true }
19+
"vision-agents-plugins-local" = { path = "../../plugins/local", editable = true }
20+
"vision-agents" = { path = "../../agents-core", editable = true }

plugins/local/README.md

Whitespace-only changes.

plugins/local/py.typed

Whitespace-only changes.

plugins/local/pyproject.toml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
[build-system]
2+
requires = ["hatchling", "hatch-vcs"]
3+
build-backend = "hatchling.build"
4+
5+
[project]
6+
name = "vision-agents-plugins-local"
7+
dynamic = ["version"]
8+
description = "Local audio & video integration for Vision Agents"
9+
readme = "README.md"
10+
keywords = ["local", "AI", "voice agents", "agents"]
11+
requires-python = ">=3.10"
12+
license = "MIT"
13+
dependencies = [
14+
"vision-agents",
15+
"sounddevice>=0.5.0",
16+
"aiortc>=1.14.0, <1.15.0",
17+
"av>=14.2.0, <17",
18+
]
19+
20+
[project.urls]
21+
Documentation = "https://visionagents.ai/"
22+
Website = "https://visionagents.ai/"
23+
Source = "https://github.com/GetStream/Vision-Agents"
24+
25+
[tool.hatch.version]
26+
source = "vcs"
27+
raw-options = { root = "..", search_parent_directories = true, fallback_version = "0.0.0" }
28+
29+
[tool.hatch.build.targets.wheel]
30+
packages = ["."]
31+
32+
[tool.hatch.build.targets.sdist]
33+
include = ["/vision_agents"]
34+
35+
[tool.uv.sources]
36+
vision-agents = { workspace = true }
37+
38+
[dependency-groups]
39+
dev = [
40+
"pytest>=8.4.1",
41+
"pytest-asyncio>=1.0.0",
42+
]

plugins/local/tests/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)