Skip to content

Fix/audio skip redundant go connect#10

Open
ibiltari wants to merge 469 commits into
mainfrom
fix/audio-skip-redundant-go-connect
Open

Fix/audio skip redundant go connect#10
ibiltari wants to merge 469 commits into
mainfrom
fix/audio-skip-redundant-go-connect

Conversation

@ibiltari
Copy link
Copy Markdown
Member

@ibiltari ibiltari commented May 4, 2026

Summary

  • Adds AudioMixer.player_connections_correct(player_name, prefix, outputs) — graph-equivalence check that mirrors connect_player_to_outputs (same output→input map, alternating L/R fan-out, mono branch).
  • Replaces unconditional connect at GO inside run_audioCue with 3-outcome guard: subprocess crashed (return), graph correct (skip), graph drifted (repair + warning).

Why

Tier-0 instrumentation measured the unconditional reconnect costing 21–28 ms inside run_audioCue, sitting between MTC snapshot and audioplayer's first OSC observation, directly inflating file-position skew (the audible "first-cue cut").

Cold-load + 4 GO trace post-fix:

Metric Baseline Post-A1
before→after_redundant_connect bracket 21–28 ms 0.96–1.04 ms (mean 0.99)
File-position skew (cold) 62–82 ms 32–61 ms
Skew with chain drift = 0 n/a 32 ms (output-latency comp floor by-design)

Tests

8 new unit tests pin routing equivalence (stereo, mono, mono x4 outputs, missing edge, wrong destination, crashed subprocess, mono regression guard, linear query-count). All pass. Pre-existing failures in TestAudioMixer/MixerClient/StartAudioMixer/test_cues_dmx are stale (removed node_uuid/client_name API), unrelated.

Test plan

  • Unit tests TestPlayerConnectionsCorrect pass
  • Engine import smoke
  • Cold-load + 4 GOs on dev box: bracket sub-ms, all graph_skipped, skew tracks by-design floor
  • Production cold-boot trace (gate per plan §4)
  • Manual repair-path verification (jack_disconnect mid-arm)
  • Manual dead-port verification (pkill -9 between arm and GO)
  • Mono cue verification

Out of scope

  • §5.1.1 latency-comp first-play (cross-binary, pending approval)
  • §5.1.2 per-cue MTC snapshot — chain audit found dispatch is multi-threaded with 3-50 ms gaps; would break A/V sync; needs dispatch redesign first

ibiltari and others added 30 commits March 27, 2026 17:20
After engine restart, audioplayer processes from the previous instance
survive as independent subprocesses. The new engine has no reference to
them, so they steal JACK client names causing new players to get a -01
suffix and routing audio into the dead orphan.

Add kill_orphaned_audio_processes() that pgrep's for audioplayer-cuems
processes not tracked by the current engine and SIGKILL's them. Called
during project load cleanup before arming new cues.
Swap the order of OSC commands sent to the videocomposer during GO:
offset → mtcfollow → visible (was: offset → visible → mtcfollow).

This ensures the videocomposer starts MTC sync and loads the correct
frame while the layer is still invisible, preventing a stale frame
from previous playback from flashing on screen.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes for multi-output video layer positioning:

1. Add VideoOutput.get_layer_scale() — returns uniform scale factor
   based on output region height vs canvas height, so video layers
   fit within smaller output regions (e.g. 1080p on a 4K canvas).

2. Send /scale OSC to videocomposer in arm_cue and run_cue when the
   output region is smaller than the canvas.

3. Fix Y-axis inversion in get_layer_placement() — the OpenGL FBO has
   Y=0 at the bottom, and the renderer negates Y.  The old formula
   (output_cy - canvas_cy) placed layers at the wrong vertical
   position for outputs not centered on the canvas.  Changed to
   (canvas_cy - output_cy) to compensate.
arm_videoCue() now sends /layer/load for the first output (driver) and
/layer/load_shared for subsequent outputs, enabling the videocomposer
to share a single decoder across all outputs of the same cue.
UI sliderToFloat() already sends 0.0-1.0 gain values; the engine was
dividing by 100 again, which would clamp any realtime cue volume to
the 0.0-0.01 range.
The per-cue /mtcfollow OSC control was accidentally removed during the
blackout race-condition refactor (c858705). Without it the dmxplayer
runs on its own internal clock and ignores MTC stopping, so queued
scenes keep firing after the user hits STOP.

- DmxPlayer: add _mtcfollow_param and enable/disable_mtcfollow()
- run_cue: send /mtcfollow 1 before each DMX scene
- NodeEngine: send /mtcfollow 0 on STOP before ola_set_dmx blackout
Replace "arm all at load" strategy with a duration-aware sliding window
that arms 2 cues with meaningful duration ahead in the target chain.
Short/zero-duration cues (ActionCue, short DMX) are armed but don't
count toward the limit — the window walks past them.

Key changes:
- arm() rewritten with _loading sentinel to prevent race conditions
  (double arm_cue) and infinite recursion (non-local/disabled cycles)
- ActionCue(play) + its target treated as 1 unit — arming an ActionCue
  with action_type='play' also arms _action_target_object
- _effective_duration_ms calculates prewait + body + postwait per type
- _arm_ahead walks the target chain, counting only cues >= 1000ms
- go() and go_threaded() use _arm_ahead instead of fixed lookahead
- initial_cuelist_process arms first cue + _arm_ahead (not all cues)
- set_next_cue extends the arm window when user selects a cue
- Safety net in go() re-arms with Logger.warning if pre-arm missed

Fixes ActionCue not re-firing on cuelist loop (ClickUp 869cqzr67).
Also reduces VRAM usage for large go_at_end projects (~3-5 layers
loaded instead of all).
CuemsDeploy._create_deploy_log() failed with Permission denied when
/tmp/cuems was missing or owned by root. Add os.makedirs() to create
the directory with correct ownership before writing.
connect_to_jack() had no retry mechanism unlike connect_player_to_mixer().
Add port_exists() checks with retries so it survives race conditions
when jack-volume hasn't registered its ports yet.
DmxCue and CueList types are armed but never create player entries,
so remove_cue_player() logged spurious ERRORs on every disarm. Change
to DEBUG and return early when no player is found.
Send an early ADD notification before the pre-wait sleep so the
controller sets status=1 and the UI colors the cue immediately.
Guard with _stop_requested checks to handle stop-during-prewait
cleanly and bail out if stop arrives while sleeping.
- ossia: fix Impulse push_value(None) crash by sending True instead
- ossia: force RepetitionFilter OFF for Impulse parameters (prevents
  silent suppression of repeated sends — root cause of reset not
  reaching videocomposer)
- stop_playback: call stop_all_cues() before cleanup so loop_cue
  threads exit before DMX blackout and video reset
- load_project: add video reset and DMX mtcfollow disable (were
  missing — stale layers survived project reload)
- PlayerHandler: upgrade reset error log from DEBUG to WARNING
- DmxPlayer: send /blackout impulse to dmxplayer to clear its internal
  fade engine (scenes queue + active transitions), fixing HTP merge
  issue where dmxplayer's fading values overrode ola_set_dmx zeros
- endpoints: add /blackout Impulse endpoint to OSC_DMXPLAYER_CONF
- CueHandler: check _stop_requested before firing post_go and
  go_at_end chains, preventing DMX cue cascade after STOP
- NodeEngine: call stop_all_cues() before cleanup in load_project
…ccess

When two threads try to arm the same cue simultaneously (e.g. _arm_ahead
from one loop and _handle_play from another), the _loading sentinel caused
the second caller to return False immediately, breaking the loop chain
permanently with "Target could not be armed".

Replace the boolean _loading sentinel with a threading.Event so concurrent
callers block until the in-progress arm completes, then return the result.
Allow the UI to toggle cue enabled/disabled at show time via
/engine/command/cue_enabled. Engine broadcasts confirmation at
/engine/status/cue_enabled/<uuid>. Show-time overrides are in-memory
only and reset to XML values on project reload.

Includes:
- WebSocket on_connect callback for late-join state dump
- NNG feedback from NodeEngine to sync ActionCue-driven toggles
- Project generation counter to abort stale daemon arm threads
- cue.enabled check in go_script to prevent firing disabled cues
- Async re-arm in daemon thread to avoid blocking GO command
Disabled cues are now invisible to the execution engine:

- go_script: advances next_cue_pointer past disabled cues on GO
- ready_script: sets initial pointer to first enabled cue
- _handle_cue_enabled: recalculates pointer when nextcue is disabled
- CueHandler.go: enabled gate rejects disabled cues (returns None)
- go_threaded: guards wait_for_cue against None from disabled targets
- _arm_ahead: skips disabled cues in lookahead window
- arm: skips recursive arm of disabled post_go='go' targets
- initial_cuelist_process: arms first enabled cue, not contents[0]
- run_cueList: dispatches run_cue() on first enabled child instead
  of calling nonexistent .go() method
- is_playing: checks loaded instead of not _stop_requested, matching
  comment intent (loaded=False after disarm means not playing)

All checks happen at execution time for real-time disable/enable
during a running show. A disabled ActionCue will not execute its
action target. All 50 tests pass.
…s-dmxplayer

Track the player-binary rename happening across the ecosystem.
Runtime-critical updates:
 - PlayerHandler.kill_orphaned_audio_processes pgrep filter
 - scripts/kill_cuems.sh and kill_cuems_processes.py patterns
 - dev/test_xml_files/settings.xml <path> values
Docstrings and comments in AudioPlayer, AudioMixer, DmxPlayer,
PlayerHandler, mock_audioplayer, mock_dmxplayer updated to
match.
The dev copy had drifted far from the production unit in
cuems-common and was never installed on any system. Specifically:

 - pkill -u stagelab (prod uses -u cuems)
 - ExecStart=/home/ion/.pyenv/…/python3 /home/ion/src/… (hardcoded
   to a different developer's machine; also points at an obsolete
   scripts/node_engine.py entrypoint)
 - no User=/Group=, no hardening directives, missing
   jackd-cuems.service dependency

The production unit at
cuems-common/etc/systemd/system/cuems-node-engine.service is the
single source of truth. Developers can read that file or install
the cuems-common package.
loop_audioCue and loop_videoCue rebuilt _start_mtc via
CTimecode(start_seconds=_end_mtc.milliseconds/1000), losing one frame
of the target framerate each iteration (40ms at 25fps MTC) through
the ms->s->frames round-trip. Replace with direct frame-count
assignment to skip the lossy conversion. Zero drift verified across
24/25/30fps over 10+ iterations.

Keep Phase 1 diagnostic _dbg instrumentation at /offset send points
as permanent infrastructure. Add tests/test_loop_rebase.py pinning
both the fix (zero drift) and the pre-fix behaviour (-40ms/iter at
25fps) as a regression sentinel.
Fixture was missing required <id> elements on every audio/video/dmx
output and <canvas_region> on every video output. project_mappings.xsd
has required these since the canvas_region addition, but the fixture
had drifted. When engine tests set CUEMS_CONF_PATH to this directory
and ConfigManager.load_net_and_node_mappings() falls back to the
default mappings, BaseEngine would exit(-1) on schema validation.
settings.xsd VideoPlayerType requires an <outputs> child (positive
integer). The dev fixture was stale and caused BaseEngine.load_config()
to exit(-1) during test startup, cascading into 7 SystemExit errors
across test_core_baseengine_status.
Adds required <id> on every put, <canvas_region> on every video
output, and the required (possibly empty) <new_nodes> element —
all required by project_mappings.xsd but missing from the fixture.
Parameterized XSD check over every config fixture under
dev/test_xml_files/ that BaseEngine loads when tests set
CUEMS_CONF_PATH there. Catches drift the moment a schema
gains a required element and a fixture is not updated to match.
NodeEngine reads output_latency_ms from node_conf for audioplayer and
dmxplayer. If the value is a Python int (operator set a specific ms
value in settings.xml), appends --output-latency-ms <int> to the args
string when spawning the process. If the value is "auto" or absent,
no flag is emitted and the binary uses its built-in default/auto path
(audioplayer: JACK query; dmxplayer: 35 ms Phase-5A default).

Videocomposer is not a subprocess of the engine (systemd service);
its args come from its systemd unit, so its --output-latency-ms flag
is wired in cuems-common instead of here.

isinstance(value, int) distinguishes int from the "auto" string
returned by xmlschema's union decoder — the typing contract pinned
by test_output_latency_ms_type_round_trip in cuems-utils.
Seeds the dev settings.xml template with output_latency_ms values and
inline comments documenting per-component semantics:
- videoplayer: "auto" (systemd drop-in documents override path)
- audioplayer: "auto" (JACK query)
- dmxplayer: 35 ms with per-adapter starting-point table (ENTTEC/
  DMXKing, OpenDMX, ArtNet)

Operators use this file as a starting point for /etc/cuems/settings.xml
on each node. Semantics mirror cuems-utils settings.xsd.
…flag

When <args></args> is empty in settings.xml, xmlschema decodes it as
Python None (not ''). The original f-string
  f'{args} --output-latency-ms {value}'
produced a literal "None --output-latency-ms 42" token sequence which
leaked into the spawned argv of dmxplayer (dmx has empty default args
in the shipped settings.xml).

Normalize args to '' via `args = args or ''` before concatenation and
strip trailing whitespace on the returned value. Verified against 7
shapes of (args, output_latency_ms) including the None/int combination
that triggered this on dev node 02 during Phase-5B manual test today.
Undoes pollution from my previous commit (79a9a9e) which accidentally
used `git add -A` and slurped in ~96 files of debuild output plus
local IDE/Claude state. Adds .gitignore entries so a future `git add
-A` in a freshly-built tree stays clean.

Files untracked (still present on disk):
- debian/.debhelper/ — dh build stamps
- debian/cuems-engine/ + debian/cuems-engine-mock/ — staging dirs
- debian/*.substvars, debian/*.debhelper, debian/debhelper-build-stamp,
  debian/files — per-build artifacts
- .claude/ + CLAUDE.md — local IDE state

The useful change from 79a9a9e (NodeEngine.py None-args fix) stays
intact.
Closes the "audioplayer not observed live" gap from the 2026-04-23
Phase-5 manual round-trip test. Audio is spawned on-demand per cue,
so no audio process was running during the dev-node test — only
video and dmx were observed receiving --output-latency-ms 42.

Two test classes, 11 tests:

- TestAppendOutputLatencyFlag: helper under every cross product of
  (args: string | '' | None, output_latency_ms: int | 'auto' | absent).
  Includes the None-args + int combination that produced a literal
  "None" token on dev node 02 (fixed in 79a9a9e) — asserts 'None' is
  not in the result as a regression gate.

- TestSubprocessArgvComposition: mirrors the `args.split()` loop in
  DmxPlayer.run() / AudioPlayer.run() to prove the helper's output
  survives intact into the final subprocess argv. Covers the
  audioplayer `-w -1` shape end-to-end.

Audio's spawn path is structurally identical to dmx's (same helper,
same split loop, same argv extension), so these tests guarantee by
construction that a live audio cue will land `--output-latency-ms`
in the correct argv position. Faster than waiting for the next
audio cue to fire on a live node and covers edge cases that
happen-to-work manual tests wouldn't hit.
Verifies the player's outport ports are wired exactly the way
connect_player_to_outputs would wire them, so callers can skip a
disconnect+reconnect cycle when the graph is already correct.

Mirrors the routing of connect_player_to_outputs (same output_to_input
map, same alternating L/R fan-out, same mono branch where outport 0
serves both pair members). Returns False if outport 0 is missing
(subprocess gone), if any expected edge is missing, or if an edge
points elsewhere.

Tests pin the routing equivalence: stereo, mono with 2 and 4 outputs,
missing edge, wrong destination, crashed subprocess, and a linear
query-count guard against quadratic blowup under future refactors.
run_audioCue called connect_player_to_outputs unconditionally at GO,
even though new_audio_output had already wired the graph at arm. The
redundant call always disconnects and reconnects every output, costing
21-28 ms inside the GO path on the dev box and pushing the audio
file's first-sample read 21-28 ms further into the file (audible as
the start of the cue being clipped).

GO path now has three outcomes:

- Player ports missing -> subprocess crashed between arm and GO; log
  and return cleanly instead of blocking 15 s in the port-wait retry
  loop inside connect_player_to_outputs.
- Graph already wired (the common path): skip the connect, log debug.
- Graph drifted: log a warning and call connect_player_to_outputs to
  repair. Same behaviour as before, with visibility.

Validation helper player_connections_correct is the only new edge
probe; it issues at most one is_connected call per selected output,
sub-millisecond in practice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants