Skip to content

[scheduler/cuebot] Bulk resource accounting#2198

Draft
DiegoTavares wants to merge 13 commits intoAcademySoftwareFoundation:masterfrom
DiegoTavares:sched_subs_lock
Draft

[scheduler/cuebot] Bulk resource accounting#2198
DiegoTavares wants to merge 13 commits intoAcademySoftwareFoundation:masterfrom
DiegoTavares:sched_subs_lock

Conversation

@DiegoTavares
Copy link
Collaborator

@DiegoTavares DiegoTavares commented Mar 6, 2026

This change shifts resource accounting (subscription, layer_resource, job_resource, folder_resource,
point tables) from incremental delta updates at dispatch/release time to periodic bulk re-computation
from the proc table. This affects both the Java Cuebot and the Rust scheduler.

Key changes:

  1. Cuebot: Wraps existing incremental resource updates behind a dispatcher.scheduler_manages_resources feature flag
  2. Scheduler: Replaces the delta-accumulate-and-flush pattern with periodic recompute_all_from_proc() and recalculate_subs()
  3. New ResourceAccountingService: Periodic loop recomputing layer/job/folder/point resource tables
  4. Simplified AllocationService: Removes pending_deltas mutex, DeltaKey/DeltaValue types, retry logic, and delta re-application after cache refresh

Attention: When dispatcher.scheduler_manages_resources=true, the scheduler service needs to be active to make sure resources (subscription, layer_resource, job_resource, folder_resource,
point tables) are updated from the proc table periodically.

Motivation

Each frame dispatch currently triggers updates across 5 resource-accounting tables, where concurrent dispatches contend for the same rows. This creates lock contention on the database that scales with dispatch volume. During crunch times, this contention has led to instability (deadlocks, slow dispatches, cascading timeouts).

Having multiple frames starting on the same subscription leads to lock contention when trying to
update the subscription table. There was already a cache on the scheduler being used for reads, but
writes were being dispatched on each frame update.

To prevent lock contention, the chache is now updated on each dispatch and a flush happens on each
cache update tick (defaults to each 3 seconds).

When running multiple instances, this can lead to running slighly above allocation limits, but the
recalculate_subs scheduled function and the trigger__verify_subscription should prevent big drifts
from happening on the long run.

Entire-Checkpoint: 059ff47f5f92
Entire-Checkpoint: 57fcdec6f3b4
Signed-off-by: Diego Tavares <dtavares@imageworks.com>
@DiegoTavares DiegoTavares changed the title [scheduler] Accumulate subscription updates to avoid locks [scheduler/cuebot] Bulk resource accounting Mar 10, 2026
This change shifts resource accounting (subscription, layer_resource, job_resource, folder_resource,
point tables) from incremental delta updates at dispatch/release time to periodic bulk recomputation
from the proc table. This affects both the Java cuebot and the Rust scheduler.

Key changes:
1. Java (cuebot): Wraps existing incremental resource updates behind a
   dispatcher.scheduler_manages_resources feature flag
2. Rust (scheduler): Replaces the delta-accumulate-and-flush pattern with periodic
   recompute_all_from_proc() and recalculate_subs()
3. New ResourceAccountingService: Periodic loop recomputing layer/job/folder/point resource tables
4. Simplified AllocationService: Removes pending_deltas mutex, DeltaKey/DeltaValue types, retry
   logic, and delta re-application after cache refresh
@DiegoTavares
Copy link
Collaborator Author

PR Assessment using Claude Code

PR Evaluation: Bulk Resource Accounting Effectiveness

Context

This PR replaces per-frame resource accounting updates with periodic bulk recomputation. The scheduler_manages_resources flag is global — when true, Cuebot skips ALL resource accounting updates (subscription, layer_resource, job_resource, folder_resource, point).

What the PR Removes from the Dispatch Path

Each frame dispatch previously executed 4-5 inline UPDATEs within the dispatch transaction:

  • UPDATE subscription SET int_cores = int_cores + ?
  • UPDATE layer_resource SET int_cores = int_cores + ?
  • UPDATE job_resource SET int_cores = int_cores + ?
  • UPDATE folder_resource SET int_cores = int_cores + ?
  • UPDATE point SET int_cores = int_cores + ?

These are now gone from the per-frame transaction. This is the core fix — it eliminates the primary source of row contention that scales with dispatch throughput.

What Remains in the Per-Frame Transaction

  1. INSERT proc — no row contention (new rows)
  2. UPDATE host — idle cores/memory decrement (mitigated by advisory locks per host)
  3. UPDATE frame SET state='RUNNING' — fires trigger that updates layer_stat and job_stat

The layer_stat/job_stat trigger is the remaining hotspot: every frame in the same layer/job contends on these rows. However, Cuebot has the same trigger and handles 10 shows without issues, so this alone isn't the destabilizing factor — it was the combination of trigger contention + resource accounting contention that overwhelmed the database.

Effectiveness Assessment

Since scheduler_manages_resources is global, cross-system contention on resource tables is eliminated — Cuebot won't be doing inline updates.

Aspect Before PR After PR Verdict
Per-frame resource UPDATEs 4-5 per frame, O(dispatch_rate) 0 Fixed
Transaction duration Long (5+ UPDATEs + trigger) Short (proc INSERT + host UPDATE + frame UPDATE) Fixed
Subscription contention Per-frame deltas, O(dispatch_rate) Bulk recompute every 3s (no inline contention) Fixed
Resource table contention Per-frame, O(dispatch_rate) Bulk recompute every 10s (no inline contention) Fixed
layer_stat/job_stat triggers Contended during long transactions Contended during shorter transactions Improved

Overall: The PR is effective. It addresses the root cause (high-concurrency dispatch generating O(dispatch_rate) contention on resource tables) and the shorter transactions also reduce lock hold time on layer_stat/job_stat.

Remaining Concerns (Minor)

1. recalculate_subs() Efficiency (LOW PRIORITY)

The scheduler calls recalculate_subs() every ~3s. This PL/pgSQL function:

  1. Zeros ALL subscription rows (UPDATE subscription SET int_cores = 0)
  2. Loops through proc aggregates, doing 3 queries per subscription (SELECT burst, UPDATE with burst bypass, UPDATE to restore burst)

This is not a contention issue (Cuebot won't be updating subscriptions), but it's unnecessarily expensive: it touches every subscription row even if nothing changed, and the row-by-row loop doesn't scale well with many subscriptions.

Optional improvement: Replace with a single bulk UPDATE (same pattern as resource_accounting_dao.rs) that also handles the burst-bypass. This would be cleaner and faster but is not blocking.

2. layer_stat/job_stat Trigger Contention (MONITOR)

This remains the only per-dispatch hot-row contention. The PR's shorter transactions help (trigger locks held for less time), but under extreme dispatch rates this could still surface. Worth monitoring but unlikely to be a problem given Cuebot handles 10 shows with the same triggers.

If it surfaces: Consider reducing cluster_buffer_size or job_buffer_size to throttle concurrent dispatch streams, or batch frame status updates per layer.

When a show or show.allocation is being served to the scheduler, only resources for that show should
be recomputed on a schedule.

Refactor allocation_dao into resource_accounting as both serve a similar purpose.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant