Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 83 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ sequenceDiagram
Server->>Client: binary Types 8-11 (artwork channels 0-3)
end
alt Visualizer role
Server->>Client: binary Type 16 (visualization data)
Server->>Client: binary Types 16-20 (loudness, beat, f_peak, spectrum, peak)
end
end

Expand Down Expand Up @@ -395,10 +395,11 @@ Instructs clients to clear buffers without ending the stream. Used for seek oper

### Client → Server: `stream/request-format`

Request different stream format (upgrade or downgrade). Available for clients with the `player` or `artwork` role.
Request different stream format (upgrade or downgrade). Available for clients with the `player`, `artwork`, or `visualizer` role.

- `player?`: object - only for clients with the `player` role ([see player object details](#client--server-streamrequest-format-player-object))
- `artwork?`: object - only for clients with the `artwork` role ([see artwork object details](#client--server-streamrequest-format-artwork-object))
- `visualizer?`: object - only for clients with the `visualizer` role ([see visualizer object details](#client--server-streamrequest-format-visualizer-object))

[Application-specific roles](#application-specific-roles) may also include objects in this message (keys starting with `_`).

Expand Down Expand Up @@ -704,36 +705,107 @@ The timestamp indicates when this artwork should be displayed. Clients must tran
**Clearing artwork:** To clear the currently displayed artwork on a specific channel, the server sends an empty binary message (only the message type byte and timestamp, with no image data) for that channel.

## Visualizer messages
This section describes messages specific to clients with the `visualizer` role, which create visual representations of the audio being played. Visualizer clients receive audio analysis data like FFT information that corresponds to the current audio timeline.
This section describes messages specific to clients with the `visualizer` role, which create visual representations of the audio being played. Visualizer clients receive audio analysis data computed from the audio currently playing in the group.

Each visualizer binary message carries exactly one frame. The server emits messages in non-decreasing timestamp order so clients can process them in arrival order. Types the server cannot stream for the current source are silently omitted from the set echoed in [`stream/start`](#server--client-streamstart-visualizer-object). `beat` and `peak` are event-driven and not throttled by `rate_max`; all other types are periodic.
Comment thread
maximmaxim345 marked this conversation as resolved.

**`beat` vs `peak`:** `beat` is a musical pulse derived from tempo/beat tracking, landing on the rhythmic grid with downbeats marking bar starts. Accurate beat detection often relies on offline analysis (e.g. neural beat trackers); servers without such analysis omit the type. `peak` is an energy onset detected live from the audio stream and fires on any transient (drum hits, cymbal crashes, attacks), independent of the rhythmic grid. A `beat` and a `peak` can fire on the same hit, or a `peak` can fire mid-bar with no `beat`.

### Client → Server: `client/hello` visualizer@v1 support object

The `visualizer@v1_support` object in [`client/hello`](#client--server-clienthello) has this structure:

- `visualizer@v1_support`: object
- Desired FFT details (to be determined)
- `buffer_capacity`: integer - max size in bytes of visualization data messages in the buffer that are yet to be displayed
- `types`: string[] - visualization data types requested by the client: 'beat', 'loudness', 'f_peak', 'peak', 'pitch', 'spectrum'
- `buffer_capacity`: integer - max total size in bytes of buffered visualizer binary messages, counting each message's full wire size (message-type byte + timestamp + data)
- `rate_max`: integer - maximum periodic visualization frames per second (applies to `loudness`, `f_peak`, `spectrum`). Beat events are not throttled and are bounded by tempo. Clients should set this to their display refresh rate
- `spectrum?`: object - spectrum configuration, required if `types` includes 'spectrum'
- `n_disp_bins`: integer - number of display bins (i.e. bars on a graphical equalizer)
- `scale`: 'mel' | 'log' | 'lin' - mapping from FFT frequencies to display bins. 'mel' uses the HTK mel formula (`m = 2595 * log10(1 + f/700)`), 'log' uses base-10 logarithm of frequency, 'lin' uses linear frequency spacing
- `f_min`: integer - lowest frequency in Hz to bin
- `f_max`: integer - highest frequency in Hz to bin

### Server → Client: `stream/start` visualizer object

The `visualizer` object in [`stream/start`](#server--client-streamstart) has this structure:

- `visualizer`: object
- FFT details (to be determined)
- `types`: string[] - visualization data types the server will stream
- `rate_max`: integer - periodic frames per second the server will emit
- `tracks_downbeats`: boolean - only if `types` includes 'beat'. True if the server's beat tracker also identifies bar starts (downbeats). When false, the downbeat flag on `beat` messages is always 0
- `spectrum?`: object - spectrum configuration, only if `types` includes 'spectrum'
- `n_disp_bins`: integer - number of display bins
- `scale`: 'mel' | 'log' | 'lin' - mapping from FFT frequencies to display bins
- `f_min`: integer - lowest frequency in Hz
- `f_max`: integer - highest frequency in Hz

### Client → Server: `stream/request-format` visualizer object

The `visualizer` object in [`stream/request-format`](#client--server-streamrequest-format) has this structure:

- `visualizer`: object
- `types?`: string[] - new set of visualization data types
- `rate_max?`: integer - new periodic frames-per-second cap
- `buffer_capacity?`: integer - new buffer capacity in bytes
- `spectrum?`: object - new spectrum configuration ([see spectrum object details](#client--server-clienthello-visualizerv1-support-object))

All fields are optional; omitted fields keep their current value.

Response: [`stream/start`](#server--client-streamstart) with the new visualizer configuration.

### Server → Client: `stream/clear` visualizer

When [`stream/clear`](#server--client-streamclear) includes the visualizer role, clients should clear all buffered visualization data and continue with data received after this message.

### Server → Client: Visualization Data (Binary)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential concern is the amount of messages we are sending.
I don't think the overhead of WebSocket messages is too large though so this just needs testing on more low powered hardware.

There are two reasons why messages are completely split now:

  • Consistency with other roles, all other roles already have one message per datum
  • Difficulty of defining batching behavior. Requiring batching of multiple messages is difficult since it's always a compromise between latency and message count. But leaving batching open to the server would cause most implementations to never use them, defeating the whole purpose.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case this is really a problem, we could also later release a visualizer@v2 if the overhead turns out to be bigger than expected.

This just needs to be tested (with encryption) on a ESP8266 or similar.


Binary messages should be rejected if there is no active stream.
Binary messages should be rejected if there is no active stream. Each visualization `type` has its own binary message type. Every message carries exactly one frame of `[timestamp:8][data]`:

- Byte 0: message type (uint8, one of the types listed below)
- Bytes 1-8: timestamp (big-endian int64) - server clock time in microseconds when this data should be displayed. Clients must translate this server timestamp to their local clock using the offset computed from clock synchronization
- Remaining bytes: data, layout per type below

`loudness`, `spectrum` bins, and the `f_peak` amplitude use the full `uint16` range 0-65535, where 0 = silence and 65535 = full scale. Values are A-weighted and dB-scaled: -60 dB → 0, 0 dB → 65535, mapped linearly across that range.

Message types `22` and `23` are reserved for future visualizer types within the role's 16-23 allocation and must not be used by implementations.

#### `loudness` — message type `16`

- 2 bytes: `uint16` value

Overall A-weighted loudness in dB (see scaling above).

#### `beat` — message type `17`

- 1 byte: `uint8` flags. Bit 0 = downbeat (bar start). Bits 1-7 reserved, must be zero by the server, ignored by the client

Musical beat event. Bit 0 is only meaningful when [`stream/start`](#server--client-streamstart-visualizer-object) sets `tracks_downbeats: true`; otherwise it is always 0.

#### `f_peak` — message type `18`

- 2 bytes: `uint16` freq - dominant frequency in Hz (0 = no peak detected, amp must also be 0)
- 2 bytes: `uint16` amp - amplitude (see scaling above)

Tracks the dominant FFT bin. For pitched sources strong harmonics can dominate the fundamental, so `f_peak` is not a substitute for `pitch`.

#### `spectrum` — message type `19`

- 2*n bytes: `uint16[n]` bins from low to high frequency. `n` = `n_disp_bins` in [`stream/start`](#server--client-streamstart-visualizer-object)

Magnitude per display bin. Servers may impose an implementation-defined upper bound on `n_disp_bins` to keep per-frame size sensible.

#### `peak` — message type `20`

- 1 byte: `uint8` strength

Energy onset event. Fires on any transient (drum hits, cymbal crashes, attacks), independent of musical timing. `strength` 0-255 lets clients scale flash intensity.

#### `pitch` — message type `21`
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple concerns with pitch that came up while implementing this into sendspin-cli.

  • First of all, the pitch given by aiosendspin isn't too precise and useful. (disabled in Music Assistant for this reason). But thats more of an implementation issue.
  • Secondly, how long is a pitch supposed to be valid? If the server stops emitting when there's nothing tonal, the last value just sticks until the next track or so. One could interpret "ignore below your own threshold" as "clear when confidence is below your threshold or 0", but thats not defined in the Specification.
  • And then the confidence scale itself: every client picking its own threshold and no defined meaning of the threshold makes behavior between server and client implementations inconsistent.

But if there is no reliable way to get a single useful pitch value, we could also just consider removing pitch from the specification.


- Byte 0: message type `16` (uint8)
- Bytes 1-8: timestamp (big-endian int64) - server clock time in microseconds when the visualization should be displayed by the device
- Rest of bytes: visualization data
- 2 bytes: `uint16` midi (8.8 fixed-point) - fractional MIDI note (integer part = MIDI note number, e.g. 69 = A4; fractional part = sub-semitone for vibrato/glissando)
- 1 byte: `uint8` confidence - 0-255. Clients should ignore pitches below their own threshold

The timestamp indicates when this visualization data should be displayed, corresponding to the audio timeline. Clients must translate this server timestamp to their local clock using the offset computed from clock synchronization.
Perceived pitch. Emitted periodically up to `rate_max`. Distinct from `f_peak`, which tracks the dominant FFT bin.

## Color messages
This section describes messages specific to clients with the `color` role, which receive colors derived from the current audio. Colors may be extracted from album artwork, provided by the music source, or manually programmed by the server.
Expand Down