-
Notifications
You must be signed in to change notification settings - Fork 164
feat(BA-3212): Split up devices between multi agents #7047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0a23c75 to
9c8fb64
Compare
9552aac to
4af738e
Compare
BEP-1016 Compatibility ReviewAfter reviewing this PR against BEP-1016 (Accelerator Interface v2), there is one key fix needed to ensure V1 implementation doesn't violate V2 assumptions. Issue: Device Mutual ExclusivityBEP-1016 Requirement:
In BEP-1016, each agent's allowed devices will be expressed as a Current PR Behavior: # 3 agents all accessing cuda0 with amount=1 each
for i in range(1, 4):
ctx = allocator.get_computers(AgentId(f"agent{i}"))[DeviceName("cuda")]
assert ctx.alloc_map.device_slots[DeviceId("cuda0")].amount == Decimal("1")This means Recommended FixModify
Edge case (more agents than devices): Example5 devices, 3 agents →
3 devices, 5 agents →
This ensures each agent's device allocation can be cleanly represented as |
9c8fb64 to
136430a
Compare
hhoikoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally your change is extremely bloated with unnecessarily long implementations of simple algorithms.
hhoikoo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitting pending review
Add device partitioning logic to allocate GPU resources across multiple agents using fill-from-front assignment strategy. Each agent receives a calculated share of device slots based on the resource split configuration. Key changes include adding set_device_slot_amounts method to AbstractAllocMap for per-device slot updates, implementing DevicePartition class to track device shares per agent, and updating ResourceAllocator to calculate and distribute device allocations across agents. Tests verify multi-agent scenarios.
7725685 to
396472a
Compare
Tests were using old SlotName + Decimal format but devices now use DeviceName + Sequence[DeviceId] format.
Update resource allocation mode tests to use the new DeviceName/Sequence[DeviceId] format instead of the legacy SlotName/Decimal format. This includes renaming slot-related test methods to device-related names and updating error message assertions. Changes include test method renames from "slots" to "device names" terminology and adding assertions for the second agent in the empty device list test to ensure consistent validation across all agents.
resolves #7046 (BA-3212)
Overview
Implements BEP-1016 compliant device partitioning for multi-agent resource allocation. Devices are now assigned as whole units to agents using divmod distribution, ensuring each physical device belongs to exactly one agent (device mutual exclusivity).
Problem Statement
Architecture
flowchart TB subgraph "Device Pool (N=5 devices)" D0[cuda0] D1[cuda1] D2[cuda2] D3[cuda3] D4[cuda4] end subgraph "divmod(5, 3) = (1, 2)" direction LR Q["q=1 base devices"] R["r=2 remainder"] end subgraph "Agent Partitions (M=3 agents)" A1[Agent1<br/>q+1 = 2 devices] A2[Agent2<br/>q+1 = 2 devices] A3[Agent3<br/>q = 1 device] end D0 --> A1 D1 --> A1 D2 --> A2 D3 --> A2 D4 --> A3Distribution algorithm:
q, r = divmod(N, M)ragents receiveq + 1devices eachM - ragents receiveqdevices eachM > N, first N agents get 1 device each, remaining agents get empty device masksImplementation Notes
_compute_device_partitions()now assigns whole devices instead of partial slot amounts_compute_device_partition()calculates per-agent device count using divmodChecklist: (if applicable)
ai.backend.testdocsdirectory