-
Notifications
You must be signed in to change notification settings - Fork 164
docs: Add BEP-1043 for fixing multi agent device allocation #8501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| ### Slot Calculation (Before) | ||
|
|
||
| The `ResourceAllocator` uses `_calculate_device_slot_*` methods to compute slot amounts: | ||
|
|
||
| ```python | ||
| def _calculate_device_slot(self, slot_name, agent_idx, agent_config, alloc_map_type): | ||
| match agent_config.resource.allocation_mode: | ||
| case ResourceAllocationMode.SHARED: | ||
| return self._calculate_device_slot_shared(slot_name) | ||
| case ResourceAllocationMode.AUTO_SPLIT: | ||
| return self._calculate_device_slot_auto_split(slot_name, alloc_map_type, agent_idx) | ||
| case ResourceAllocationMode.MANUAL: | ||
| return self._calculate_device_slot_manual(slot_name, agent_config) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you briefly explain what each existing mode does? It's difficult to understand without prior knowledge of the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote quite a bit of the document to address this feedback
|
|
||
| 1. **No device exclusivity**: Slot-based allocation divides abstract quantities (e.g., "0.5 of cuda.shares") but cannot enforce that agents use mutually exclusive physical devices. For CPU cores and accelerators, multiple agents must not share the same device - slot amounts alone cannot express or enforce this. | ||
|
|
||
| 2. **Abstraction mismatch**: The actual resource being partitioned is a set of physical devices (cuda0, cuda1, etc.), not slot amounts. Slot names like `cuda.mem` and `cuda.shares` are derived properties - we should configure at the device level and derive slots from that, not the other way around. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's technically correct, but I suspect this implementation arose because of an existing use case. Do you happen to know anything about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No there is no good reason why this was done. This was simply a wrong implementation of the multi agent feature, which was done because of my misunderstanding on how the multi agent resource splitting was supposed to be implemented. There is no need for backwards compatibility here, especially considering how no production (or even dev system) is using the multi agent feature
| ### Device Partitioner Abstraction | ||
|
|
||
| ```python | ||
| class DevicePartitioner(Protocol): | ||
| """Protocol for generating device assignments in AUTO_SPLIT mode.""" | ||
| device_name: DeviceName | ||
|
|
||
| def generate_assignments( | ||
| self, | ||
| devices: Sequence[AbstractComputeDevice], | ||
| agent_ids: Sequence[AgentId], | ||
| ) -> Mapping[AgentId, Sequence[DeviceId]]: ... | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a view of the whole picture right now, so it's hard to understand. This BEP appears as a list of implementations rather than feeling organically connected. Could you add context about this implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote quite a bit of the document to address this feedback
| ```toml | ||
| # Before (slot-based) | ||
| [agent.resource.allocations] | ||
| devices = { "cuda.mem" = 0.5, "cuda.shares" = 0.5 } | ||
|
|
||
| # After (device-based) | ||
| [agent.resource.allocations.devices] | ||
| cuda = ["cuda0", "cuda1"] | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users may still request the existing (user-friendly) allocation of 0.5, etc. In that case, please specify how the settings should be configured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, no current production system uses the multi agent feature. I don't think we need to be backwards compatible here, but I could be wrong. Could you please tell me?
20fbf2b to
45f3328
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds a new Backend.AI Enhancement Proposal (BEP-1043) describing a shift from slot-based to device-based allocation to support multi-agent device partitioning.
Changes:
- Register BEP-1043 in the BEP number registry.
- Add new proposal document outlining motivation, proposed design, migration, and an implementation plan for device-based multi-agent allocation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| proposals/README.md | Adds BEP-1043 entry to the registry table. |
| proposals/BEP-1043-multi-agent-device-split.md | Introduces the BEP-1043 draft describing device-based multi-agent allocation and rollout plan. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes - Add Design Overview section for high-level narrative flow - Restructure Proposed Design for organic flow instead of feature list - Update to match actual implementation (ResourcePartitioner, Partition types) - Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463) - Add Implementation Notes section (scaling factors, memory handling) - Clarify slot-based design was incorrect implementation, not deliberate - Update config examples to show actual format (cpu, mem, devices fields) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Checklist: (if applicable)
ai.backend.testdocsdirectory