docs: Add BEP-1043 for fixing multi agent device allocation #8501

hhoikoo · 2026-02-02T02:06:49Z

Checklist: (if applicable)

Milestone metadata specifying the target backport version
Mention to the original issue
Installer updates including:
- Fixtures for db schema changes
- New mandatory config options
Update of end-to-end CLI integration tests in ai.backend.test
API server-client counterparts (e.g., manager API -> client SDK)
Test case(s) to:
- Demonstrate the difference of before/after
- Demonstrate the flow of abstract/conceptual models with a concrete implementation
Documentation
- Contents in the docs directory
- docstrings in public interfaces and type annotations

proposals/BEP-1043-multi-agent-device-split.md

HyeockJinKim · 2026-02-02T04:43:18Z

proposals/BEP-1043-multi-agent-device-split.md

+### Slot Calculation (Before)
+
+The `ResourceAllocator` uses `_calculate_device_slot_*` methods to compute slot amounts:
+
+```python
+def _calculate_device_slot(self, slot_name, agent_idx, agent_config, alloc_map_type):
+    match agent_config.resource.allocation_mode:
+        case ResourceAllocationMode.SHARED:
+            return self._calculate_device_slot_shared(slot_name)
+        case ResourceAllocationMode.AUTO_SPLIT:
+            return self._calculate_device_slot_auto_split(slot_name, alloc_map_type, agent_idx)
+        case ResourceAllocationMode.MANUAL:
+            return self._calculate_device_slot_manual(slot_name, agent_config)
+```


Could you briefly explain what each existing mode does? It's difficult to understand without prior knowledge of the current implementation.

I rewrote quite a bit of the document to address this feedback

HyeockJinKim · 2026-02-02T04:45:28Z

proposals/BEP-1043-multi-agent-device-split.md

+
+1. **No device exclusivity**: Slot-based allocation divides abstract quantities (e.g., "0.5 of cuda.shares") but cannot enforce that agents use mutually exclusive physical devices. For CPU cores and accelerators, multiple agents must not share the same device - slot amounts alone cannot express or enforce this.
+
+2. **Abstraction mismatch**: The actual resource being partitioned is a set of physical devices (cuda0, cuda1, etc.), not slot amounts. Slot names like `cuda.mem` and `cuda.shares` are derived properties - we should configure at the device level and derive slots from that, not the other way around.


That's technically correct, but I suspect this implementation arose because of an existing use case. Do you happen to know anything about that?

No there is no good reason why this was done. This was simply a wrong implementation of the multi agent feature, which was done because of my misunderstanding on how the multi agent resource splitting was supposed to be implemented. There is no need for backwards compatibility here, especially considering how no production (or even dev system) is using the multi agent feature

HyeockJinKim · 2026-02-02T04:46:54Z

proposals/BEP-1043-multi-agent-device-split.md

+### Device Partitioner Abstraction
+
+```python
+class DevicePartitioner(Protocol):
+    """Protocol for generating device assignments in AUTO_SPLIT mode."""
+    device_name: DeviceName
+
+    def generate_assignments(
+        self,
+        devices: Sequence[AbstractComputeDevice],
+        agent_ids: Sequence[AgentId],
+    ) -> Mapping[AgentId, Sequence[DeviceId]]: ...
+```


I don't have a view of the whole picture right now, so it's hard to understand. This BEP appears as a list of implementations rather than feeling organically connected. Could you add context about this implementations?

I rewrote quite a bit of the document to address this feedback

HyeockJinKim · 2026-02-02T04:49:00Z

proposals/BEP-1043-multi-agent-device-split.md

+```toml
+# Before (slot-based)
+[agent.resource.allocations]
+devices = { "cuda.mem" = 0.5, "cuda.shares" = 0.5 }
+
+# After (device-based)
+[agent.resource.allocations.devices]
+cuda = ["cuda0", "cuda1"]
+```


Users may still request the existing (user-friendly) allocation of 0.5, etc. In that case, please specify how the settings should be configured.

As mentioned above, no current production system uses the multi agent feature. I don't think we need to be backwards compatible here, but I could be wrong. Could you please tell me?

Copilot

Pull request overview

Adds a new Backend.AI Enhancement Proposal (BEP-1043) describing a shift from slot-based to device-based allocation to support multi-agent device partitioning.

Changes:

Register BEP-1043 in the BEP number registry.
Add new proposal document outlining motivation, proposed design, migration, and an implementation plan for device-based multi-agent allocation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
proposals/README.md	Adds BEP-1043 entry to the registry table.
proposals/BEP-1043-multi-agent-device-split.md	Introduces the BEP-1043 draft describing device-based multi-agent allocation and rollout plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

proposals/BEP-1043-multi-agent-device-split.md

- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes - Add Design Overview section for high-level narrative flow - Restructure Proposed Design for organic flow instead of feature list - Update to match actual implementation (ResourcePartitioner, Partition types) - Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463) - Add Implementation Notes section (scaling factors, memory handling) - Clarify slot-based design was incorrect implementation, not deliberate - Update config examples to show actual format (cpu, mem, devices fields) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

hhoikoo self-assigned this Feb 2, 2026

Copilot AI review requested due to automatic review settings February 2, 2026 02:06

hhoikoo added the skip:changelog Make the action workflow to skip towncrier check label Feb 2, 2026

github-actions bot added the size:L 100~500 LoC label Feb 2, 2026

hhoikoo mentioned this pull request Feb 2, 2026

docs: Add BEP-1043 for fixing multi agent device allocation #8423

Closed

7 tasks

Copilot started reviewing on behalf of hhoikoo February 2, 2026 02:07 View session

hhoikoo requested review from HyeockJinKim and removed request for Copilot February 2, 2026 02:07

HyeockJinKim reviewed Feb 2, 2026

View reviewed changes

docs: Add BEP-1043 for fixing multi agent device allocation

45f3328

Copilot AI review requested due to automatic review settings February 2, 2026 05:19

hhoikoo force-pushed the docs/add-bep-1043 branch from 20fbf2b to 45f3328 Compare February 2, 2026 05:19

Copilot started reviewing on behalf of hhoikoo February 2, 2026 05:19 View session

Copilot AI reviewed Feb 2, 2026

View reviewed changes

proposals/BEP-1043-multi-agent-device-split.md Outdated Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add BEP-1043 for fixing multi agent device allocation #8501

docs: Add BEP-1043 for fixing multi agent device allocation #8501

hhoikoo commented Feb 2, 2026

Uh oh!

Uh oh!

HyeockJinKim Feb 2, 2026

Uh oh!

hhoikoo Feb 2, 2026

Uh oh!

HyeockJinKim Feb 2, 2026

Uh oh!

hhoikoo Feb 2, 2026

Uh oh!

HyeockJinKim Feb 2, 2026

Uh oh!

hhoikoo Feb 2, 2026

Uh oh!

HyeockJinKim Feb 2, 2026

Uh oh!

hhoikoo Feb 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		1. No device exclusivity: Slot-based allocation divides abstract quantities (e.g., "0.5 of cuda.shares") but cannot enforce that agents use mutually exclusive physical devices. For CPU cores and accelerators, multiple agents must not share the same device - slot amounts alone cannot express or enforce this.

		2. Abstraction mismatch: The actual resource being partitioned is a set of physical devices (cuda0, cuda1, etc.), not slot amounts. Slot names like `cuda.mem` and `cuda.shares` are derived properties - we should configure at the device level and derive slots from that, not the other way around.

docs: Add BEP-1043 for fixing multi agent device allocation #8501

Are you sure you want to change the base?

docs: Add BEP-1043 for fixing multi agent device allocation #8501

Conversation

hhoikoo commented Feb 2, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants