Skip to content

Conversation

@hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Feb 2, 2026

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@hhoikoo hhoikoo self-assigned this Feb 2, 2026
Copilot AI review requested due to automatic review settings February 2, 2026 02:06
@hhoikoo hhoikoo added the skip:changelog Make the action workflow to skip towncrier check label Feb 2, 2026
@github-actions github-actions bot added the size:L 100~500 LoC label Feb 2, 2026
@hhoikoo hhoikoo requested review from HyeockJinKim and removed request for Copilot February 2, 2026 02:07
Comment on lines 44 to 57
### Slot Calculation (Before)

The `ResourceAllocator` uses `_calculate_device_slot_*` methods to compute slot amounts:

```python
def _calculate_device_slot(self, slot_name, agent_idx, agent_config, alloc_map_type):
match agent_config.resource.allocation_mode:
case ResourceAllocationMode.SHARED:
return self._calculate_device_slot_shared(slot_name)
case ResourceAllocationMode.AUTO_SPLIT:
return self._calculate_device_slot_auto_split(slot_name, alloc_map_type, agent_idx)
case ResourceAllocationMode.MANUAL:
return self._calculate_device_slot_manual(slot_name, agent_config)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you briefly explain what each existing mode does? It's difficult to understand without prior knowledge of the current implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote quite a bit of the document to address this feedback


1. **No device exclusivity**: Slot-based allocation divides abstract quantities (e.g., "0.5 of cuda.shares") but cannot enforce that agents use mutually exclusive physical devices. For CPU cores and accelerators, multiple agents must not share the same device - slot amounts alone cannot express or enforce this.

2. **Abstraction mismatch**: The actual resource being partitioned is a set of physical devices (cuda0, cuda1, etc.), not slot amounts. Slot names like `cuda.mem` and `cuda.shares` are derived properties - we should configure at the device level and derive slots from that, not the other way around.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's technically correct, but I suspect this implementation arose because of an existing use case. Do you happen to know anything about that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No there is no good reason why this was done. This was simply a wrong implementation of the multi agent feature, which was done because of my misunderstanding on how the multi agent resource splitting was supposed to be implemented. There is no need for backwards compatibility here, especially considering how no production (or even dev system) is using the multi agent feature

Comment on lines 79 to 91
### Device Partitioner Abstraction

```python
class DevicePartitioner(Protocol):
"""Protocol for generating device assignments in AUTO_SPLIT mode."""
device_name: DeviceName

def generate_assignments(
self,
devices: Sequence[AbstractComputeDevice],
agent_ids: Sequence[AgentId],
) -> Mapping[AgentId, Sequence[DeviceId]]: ...
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a view of the whole picture right now, so it's hard to understand. This BEP appears as a list of implementations rather than feeling organically connected. Could you add context about this implementations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote quite a bit of the document to address this feedback

Comment on lines 142 to 150
```toml
# Before (slot-based)
[agent.resource.allocations]
devices = { "cuda.mem" = 0.5, "cuda.shares" = 0.5 }

# After (device-based)
[agent.resource.allocations.devices]
cuda = ["cuda0", "cuda1"]
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may still request the existing (user-friendly) allocation of 0.5, etc. In that case, please specify how the settings should be configured.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, no current production system uses the multi agent feature. I don't think we need to be backwards compatible here, but I could be wrong. Could you please tell me?

Copilot AI review requested due to automatic review settings February 2, 2026 05:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Backend.AI Enhancement Proposal (BEP-1043) describing a shift from slot-based to device-based allocation to support multi-agent device partitioning.

Changes:

  • Register BEP-1043 in the BEP number registry.
  • Add new proposal document outlining motivation, proposed design, migration, and an implementation plan for device-based multi-agent allocation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
proposals/README.md Adds BEP-1043 entry to the registry table.
proposals/BEP-1043-multi-agent-device-split.md Introduces the BEP-1043 draft describing device-based multi-agent allocation and rollout plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L 100~500 LoC skip:changelog Make the action workflow to skip towncrier check

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants