Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 54 additions & 20 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,27 @@ pre-commit install
```

To setup your .env

```bash
cp env.example .env
```

## Running

```bash
uv run examples/01_simple_agent_example/simple_agent_example.py
uv run examples/01_simple_agent_example/simple_agent_example.py run
```

### Running with a video file as input

```bash
uv run <path-to-example> --video-track-override <path-to-video>
uv run <path-to-example> run --video-track-override <path-to-video>
```

### Running as an HTTP server

```bash
uv run <path-to-example> serve --host=<host> --port=<port>
```

## Tests
Expand All @@ -34,6 +43,7 @@ uv run py.test -m "not integration" -n auto
```

Integration test. (requires secrets in place, see .env setup)

```
uv run py.test -m "integration" -n auto
```
Expand All @@ -60,7 +70,6 @@ uv run ruff check --fix

### Mypy type checks


```
uv run mypy --install-types --non-interactive -p vision_agents
```
Expand Down Expand Up @@ -119,8 +128,10 @@ To see how the agent work open up agents.py
Some important things about audio inside the library:

1. WebRTC uses Opus 48khz stereo but inside the library audio is always in PCM format
2. Plugins / AI models work with different PCM formats, passing bytes around without a container type leads to kaos and is forbidden
3. PCM data is always passed around using the `PcmData` object which contains information about sample rate, channels and format
2. Plugins / AI models work with different PCM formats, passing bytes around without a container type leads to kaos and
is forbidden
3. PCM data is always passed around using the `PcmData` object which contains information about sample rate, channels
and format
4. Audio resampling can be done using `PcmData.resample` method
5. Adjusting from stereo to mono and vice-versa can be done using the `PcmData.resample` method
6. `PcmData` comes with convenience constructor methods to build from bytes, iterators, ndarray, ...
Expand All @@ -132,6 +143,7 @@ import asyncio
from getstream.video.rtc.track_util import PcmData
from openai import AsyncOpenAI


async def example():
client = AsyncOpenAI(api_key="sk-42")

Expand Down Expand Up @@ -162,6 +174,7 @@ async def example():

await play_pcm_with_ffplay(resampled_pcm)


if __name__ == "__main__":
asyncio.run(example())
```
Expand All @@ -177,6 +190,7 @@ Sometimes you need to test audio manually, here's some tips:
## Creating PcmData

### from_bytes

Build from raw PCM bytes

```python
Expand All @@ -186,6 +200,7 @@ PcmData.from_bytes(audio_bytes, sample_rate=16000, format=AudioFormat.S16, chann
```

### from_numpy

Build from numpy arrays with automatic dtype/shape conversion

```python
Expand All @@ -194,6 +209,7 @@ PcmData.from_numpy(np.array([1, 2], np.int16), sample_rate=16000, format=AudioFo
```

### from_response

Construct from API response (bytes, iterators, async iterators, objects with .data)

```python
Expand All @@ -204,6 +220,7 @@ PcmData.from_response(
```

### from_av_frame

Create from PyAV AudioFrame

```python
Expand All @@ -213,27 +230,31 @@ PcmData.from_av_frame(frame)
## Converting Format

### to_float32

Convert samples to float32 in [-1, 1]

```python
pcm_f32 = pcm.to_float32()
```

### to_int16

Convert samples to int16 PCM format

```python
pcm_s16 = pcm.to_int16()
```

### to_bytes

Return interleaved PCM bytes

```python
audio_bytes = pcm.to_bytes()
```

### to_wav_bytes

Return WAV file bytes (header + frames)

```python
Expand All @@ -253,20 +274,23 @@ pcm = pcm.resample(16000, target_channels=1) # to 16khz, mono
## Manipulating Audio

### append

Append another PcmData in-place (adjusts format/rate automatically)

```python
pcm.append(other_pcm)
```

### copy

Create a deep copy

```python
pcm_copy = pcm.copy()
```

### clear

Clear all samples in-place (keeps metadata)

```python
Expand All @@ -276,20 +300,23 @@ pcm.clear()
## Slicing and Chunking

### head

Keep only the first N seconds

```python
pcm_head = pcm.head(duration_s=3.0)
```

### tail

Keep only the last N seconds

```python
pcm_tail = pcm.tail(duration_s=5.0)
```

### chunks

Iterate over fixed-size chunks with optional overlap

```python
Expand Down Expand Up @@ -318,7 +345,8 @@ pcm = await queue.get_duration(100)

# AudioTrack

Use `getstream.video.rtc.AudioTrack` if you need to publish audio using PyAV, this class ensures that `recv` paces audio correctly every 20ms.
Use `getstream.video.rtc.AudioTrack` if you need to publish audio using PyAV, this class ensures that `recv` paces audio
correctly every 20ms.

- Use `.write()` method to enqueue audio (PcmData)
- Use `.flush()` to empty all the enqueued audio (eg. barge-in event)
Expand Down Expand Up @@ -347,8 +375,10 @@ This prevents mistakes related to handling audio with different formats, sample

### Testing

Many of the underlying APIs change daily. To ensure things work we keep 2 sets of tests. Integration tests and unit tests.
Integration tests run once a day to verify that changes to underlying APIs didn't break the framework. Some testing guidelines
Many of the underlying APIs change daily. To ensure things work we keep 2 sets of tests. Integration tests and unit
tests.
Integration tests run once a day to verify that changes to underlying APIs didn't break the framework. Some testing
guidelines

- Every plugin needs an integration test
- Limit usage of response capturing style testing. (since they diverge from reality)
Expand Down Expand Up @@ -442,11 +472,13 @@ metrics.set_meter_provider(
start_http_server(port=9464)
```

You can now see the metrics at `http://localhost:9464/metrics` (make sure that your Python program keeps running), after this you can setup your Prometheus server to scrape this endpoint.
You can now see the metrics at `http://localhost:9464/metrics` (make sure that your Python program keeps running), after
this you can setup your Prometheus server to scrape this endpoint.

### Profiling

The `Profiler` class uses `pyinstrument` to profile your agent's performance and generate an HTML report showing where time is spent during execution.
The `Profiler` class uses `pyinstrument` to profile your agent's performance and generate an HTML report showing where
time is spent during execution.

#### Example usage:

Expand All @@ -456,6 +488,7 @@ from vision_agents.core import User, Agent
from vision_agents.core.profiling import Profiler
from vision_agents.plugins import getstream, gemini, deepgram, elevenlabs, vogent


async def start_agent() -> None:
agent = Agent(
edge=getstream.Edge(),
Expand All @@ -475,12 +508,13 @@ async def start_agent() -> None:
```

The profiler automatically:

- Starts profiling when the agent is created
- Stops profiling when the agent finishes (on `AgentFinishEvent`)
- Saves an HTML report to the specified output path (default: `./profile.html`)

You can open the generated HTML file in a browser to view the performance profile, which shows a timeline of function calls and where time is spent during agent execution.

You can open the generated HTML file in a browser to view the performance profile, which shows a timeline of function
calls and where time is spent during agent execution.

### Queuing

Expand All @@ -498,21 +532,23 @@ You can open the generated HTML file in a browser to view the performance profil

### Video Frames & Tracks

- Track.recv errors will fail silently. The API is to return a frame. Never return None. and wait till the next frame is available
- When using frame.to_ndarray(format="rgb24") specify the format. Typically you want rgb24 when connecting/sending to Yolo etc
- Track.recv errors will fail silently. The API is to return a frame. Never return None. and wait till the next frame is
available
- When using frame.to_ndarray(format="rgb24") specify the format. Typically you want rgb24 when connecting/sending to
Yolo etc
- QueuedVideoTrack is a writable/queued video track implementation which is useful when forwarding video


### Loading Resources in Plugins (aka "warmup")
Some plugins require to download and use external resources like models to work.

Some plugins require to download and use external resources like models to work.

For example:

- `TurnDetection` plugins using a Silero VAD model to detect voice activity in the audio track.
- Video processors using `YOLO` models

In order to standardise how these resources are loaded and to make it performant, the framework provides a special ABC
`vision_agents.core.warmup.Warmable`.
`vision_agents.core.warmup.Warmable`.

To use it, simply subclass it and define the required methods.
Note that `Warmable` supports generics to leverage type checking.
Expand Down Expand Up @@ -551,12 +587,10 @@ class FasterWhisperSTT(STT, Warmable[WhisperModel]):
# This method will be called every time a new agent is initialized.
# The warmup process is now complete.
self._whisper_model = whisper

...
```



## Onboarding Plan for new contributors

**Audio Formats**
Expand Down
Loading
Loading