feat(BA-3212): Split up devices between multi agents #7047

hhoikoo · 2025-12-02T07:13:55Z

Overview

Implements BEP-1016 compliant device partitioning for multi-agent resource allocation. Devices are now assigned as whole units to agents using divmod distribution, ensuring each physical device belongs to exactly one agent (device mutual exclusivity).

Problem Statement

Device contention: Multiple agents on the same host could allocate kernels using the same physical device, causing oversubscription
BEP-1016 violation: The previous "fill-from-front" approach allowed partial device sharing, where the same DeviceId could appear in multiple agents' partitions
Resource isolation: No clear boundary between agents' device ownership made debugging and capacity planning difficult

Architecture

flowchart TB
    subgraph "Device Pool (N=5 devices)"
        D0[cuda0]
        D1[cuda1]
        D2[cuda2]
        D3[cuda3]
        D4[cuda4]
    end

    subgraph "divmod(5, 3) = (1, 2)"
        direction LR
        Q["q=1 base devices"]
        R["r=2 remainder"]
    end

    subgraph "Agent Partitions (M=3 agents)"
        A1[Agent1<br/>q+1 = 2 devices]
        A2[Agent2<br/>q+1 = 2 devices]
        A3[Agent3<br/>q = 1 device]
    end

    D0 --> A1
    D1 --> A1
    D2 --> A2
    D3 --> A2
    D4 --> A3

Distribution algorithm:

For N devices across M agents: q, r = divmod(N, M)
First r agents receive q + 1 devices each
Remaining M - r agents receive q devices each
Edge case: When M > N, first N agents get 1 device each, remaining agents get empty device masks

Implementation Notes

_compute_device_partitions() now assigns whole devices instead of partial slot amounts
_compute_device_partition() calculates per-agent device count using divmod
SHARED mode unchanged - all agents still see all devices
AUTO_SPLIT and MANUAL modes now enforce device mutual exclusivity

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
Update of end-to-end CLI integration tests in ai.backend.test
API server-client counterparts (e.g., manager API -> client SDK)
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

hhoikoo · 2026-01-26T07:39:29Z

BEP-1016 Compatibility Review

After reviewing this PR against BEP-1016 (Accelerator Interface v2), there is one key fix needed to ensure V1 implementation doesn't violate V2 assumptions.

Issue: Device Mutual Exclusivity

BEP-1016 Requirement:

"(NOTE: partitioning here means per-device, not within-device!)"

"Decision: Cross-agent device sharing is not supported. Devices are always mutually exclusive between agents within a node."

In BEP-1016, each agent's allowed devices will be expressed as a device_mask: frozenset[DeviceId] — a simple allowlist. This design fundamentally requires that each DeviceId belongs to exactly ONE agent.

Current PR Behavior:
The fill-from-front algorithm in _compute_device_partition() currently allows the same DeviceId to appear in multiple agents' partitions when devices have fractional capacity. For example, the test shows:

# 3 agents all accessing cuda0 with amount=1 each
for i in range(1, 4):
    ctx = allocator.get_computers(AgentId(f"agent{i}"))[DeviceName("cuda")]
    assert ctx.alloc_map.device_slots[DeviceId("cuda0")].amount == Decimal("1")

This means cuda0 is shared across all 3 agents, which cannot be expressed as a simple frozenset[DeviceId] allowlist per agent (since the same device would need to be in multiple sets).

Recommended Fix

Modify _compute_device_partition() to assign whole devices to agents rather than fractional shares. Use divmod distribution:

For N devices across M agents: q, r = divmod(N, M)
First r agents get q + 1 devices each
Remaining agents get q devices each

Edge case (more agents than devices):
When M > N, the first N agents get 1 device each, and the remaining M - N agents get an empty device mask. This is acceptable — better to have some agents with no devices than to violate mutual exclusivity.

Example

5 devices, 3 agents → divmod(5, 3) = (1, 2)

Agent 1: devices 0, 1 (1 + 1 = 2 devices)
Agent 2: devices 2, 3 (1 + 1 = 2 devices)
Agent 3: device 4 (1 device)

3 devices, 5 agents → divmod(3, 5) = (0, 3)

Agent 1: device 0
Agent 2: device 1
Agent 3: device 2
Agents 4, 5: empty device mask

This ensures each agent's device allocation can be cleanly represented as frozenset[DeviceId] in the future BEP-1016 implementation.

hhoikoo

Generally your change is extremely bloated with unnecessarily long implementations of simple algorithms.

src/ai/backend/agent/config/unified.py

src/ai/backend/agent/resources.py

tests/unit/agent/test_resource_allocation.py

hhoikoo

Submitting pending review

src/ai/backend/agent/resources.py

Add device partitioning logic to allocate GPU resources across multiple agents using fill-from-front assignment strategy. Each agent receives a calculated share of device slots based on the resource split configuration. Key changes include adding set_device_slot_amounts method to AbstractAllocMap for per-device slot updates, implementing DevicePartition class to track device shares per agent, and updating ResourceAllocator to calculate and distribute device allocations across agents. Tests verify multi-agent scenarios.

Tests were using old SlotName + Decimal format but devices now use DeviceName + Sequence[DeviceId] format.

Update resource allocation mode tests to use the new DeviceName/Sequence[DeviceId] format instead of the legacy SlotName/Decimal format. This includes renaming slot-related test methods to device-related names and updating error message assertions. Changes include test method renames from "slots" to "device names" terminology and adding assertions for the second agent in the empty device list test to ensure consistent validation across all agents.

hhoikoo · 2026-02-02T01:00:47Z

Superceded by

github-actions bot assigned hhoikoo Dec 2, 2025

github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Dec 2, 2025

hhoikoo force-pushed the feat/BA-3212/multi-agent-device-split branch 2 times, most recently from 0a23c75 to 9c8fb64 Compare December 2, 2025 08:08

Base automatically changed from fix/BA-3199 to main December 2, 2025 14:25

HyeockJinKim force-pushed the main branch 2 times, most recently from 9552aac to 4af738e Compare December 31, 2025 15:41

hhoikoo marked this pull request as draft January 19, 2026 04:16

hhoikoo force-pushed the feat/BA-3212/multi-agent-device-split branch from 9c8fb64 to 136430a Compare January 27, 2026 09:14

hhoikoo commented Jan 28, 2026

View reviewed changes

hhoikoo force-pushed the feat/BA-3212/multi-agent-device-split branch from 7725685 to 396472a Compare January 29, 2026 06:39

hhoikoo added 2 commits January 29, 2026 16:07

fix(BA-3212): Update device allocation tests to use DeviceId lists

e1f316e

Tests were using old SlotName + Decimal format but devices now use DeviceName + Sequence[DeviceId] format.

hhoikoo closed this Feb 2, 2026

hhoikoo deleted the feat/BA-3212/multi-agent-device-split branch February 2, 2026 01:00

feat(BA-3212): Split up devices between multi agents #7047

feat(BA-3212): Split up devices between multi agents #7047

Uh oh!

Conversation

hhoikoo commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Problem Statement

Architecture

Implementation Notes

Uh oh!

hhoikoo commented Jan 26, 2026

BEP-1016 Compatibility Review

Issue: Device Mutual Exclusivity

Recommended Fix

Example

Uh oh!

hhoikoo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhoikoo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hhoikoo commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hhoikoo commented Dec 2, 2025 •

edited

Loading