Skip to content

Conversation

@hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Jan 30, 2026

resolves #8428 (BA-4146)

Overview

Changes MANUAL mode resource allocation from slot-based format ({"cuda.mem": 0.5}) to device-based format (cuda = ["cuda0", "cuda1"]), enabling explicit device-to-agent mapping in multi-agent configurations.

Problem Statement

  • MANUAL mode used slot-based decimal allocation which didn't align with device-centric resource model
  • No explicit control over which physical devices are assigned to which agents
  • BEP-1041 migration to device-centric resource allocation was incomplete without MANUAL mode support

Architecture

flowchart TB
    subgraph "Configuration (unified.py)"
        RC[ResourceAllocationConfig]
        RC -->|"devices: Mapping[DeviceName, Sequence[DeviceId]]"| DV[Device Validator]
        DV -->|"Detect old format"| ERR[Validation Error]
        DV -->|"Parse new format"| OK[Valid Config]
    end

    subgraph "Resource Allocation (resources.py)"
        RA[ResourceAllocator.__ainit__]
        RA -->|MANUAL mode| GMA[_generate_manual_assignments]
        RA -->|AUTO_SPLIT| GAS[_generate_auto_split_assignments]
        RA -->|SHARED| GSA[_generate_shared_assignments]
        GMA --> ADA[_apply_device_assignments]
        GAS --> ADA
        GSA --> ADA
    end

    OK --> GMA
Loading

Configuration Format Change

Old slot-based format (rejected):

[resource.allocations]
devices = {"cuda.mem" = 0.5, "cuda.shares" = 2}

New device-based format:

[resource.allocations]
devices = {cuda = ["cuda0", "cuda1"], rocm = ["rocm0"]}

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

Copilot AI review requested due to automatic review settings January 30, 2026 06:31
@github-actions github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Jan 30, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates agent MANUAL-mode resource allocation to use explicit device ID lists (device-centric) and refactors resource discovery/allocation to operate on per-device assignments across agents.

Changes:

  • Switch MANUAL mode allocations.devices from slot-based decimals (cuda.mem, cuda.shares) to device-based mappings (cuda = ["cuda0", ...]).
  • Refactor ResourceAllocator to separate global device discovery from per-agent device assignment (AUTO_SPLIT / SHARED / MANUAL).
  • Expand/update unit tests for config validation and resource allocation behaviors (including partitioning helpers like natural sort and device distribution).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/ai/backend/agent/config/unified.py Changes MANUAL allocations.devices schema to device-based mapping and adds format validation.
src/ai/backend/agent/resources.py Introduces global device discovery structures and implements device assignment generation/application for allocation modes.
src/ai/backend/agent/types.py Adds typing for device assignment mappings and a device partitioner protocol.
tests/unit/agent/test_resource_allocation.py Updates/extends tests to reflect device-based allocation and new partitioning helpers.
tests/unit/agent/test_resources.py Adds tests for new GlobalDeviceInfo and device discovery helper behavior.
tests/unit/agent/test_config_validation.py Updates validation tests to reflect device-based MANUAL allocations and old-format rejection.
changes/8440.feature.md Changelog entry for the device discovery infrastructure refactor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hhoikoo hhoikoo changed the base branch from main to feat/BA-4145 January 30, 2026 16:55
@hhoikoo hhoikoo requested a review from Copilot January 30, 2026 17:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

tests/unit/agent/test_resource_allocation.py:759

  • MANUAL mode behavior is now implemented in ResourceAllocator/ResourcePartitioner, but there are no ResourceAllocator MANUAL-mode integration tests in this file anymore (the previous class was removed). Add integration tests that assert per-agent CPU/mem reservations and device visibility in MANUAL mode, and cover failure cases like duplicate device IDs across agents.

        # Memory: shared device, both see same device with divided slots
        assert len(agent1_computers[DeviceName("mem")].alloc_map.device_slots) == 1
        assert len(agent2_computers[DeviceName("mem")].alloc_map.device_slots) == 1

        await allocator.__aexit__(None, None, None)


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hhoikoo hhoikoo force-pushed the feat/BA-4146 branch 5 times, most recently from d442cc1 to 88f13c1 Compare February 2, 2026 04:22
hhoikoo added a commit that referenced this pull request Feb 2, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add `cpu` field as Sequence[DeviceId] for explicit core assignment
- Change `devices` type from Mapping[SlotName, Decimal] to
  Mapping[DeviceName, Sequence[DeviceId]]
- Add validators to detect and reject old slot-based/integer formats
- Implement generate_manual_assignments() in ResourcePartitioner
- Validate device names and IDs exist in discovered hardware
- Update tests for new format, add error case coverage
Refactor generate_manual_assignments to properly handle shared
devices like memory. Previously, MANUAL mode incorrectly created
empty DevicePartitions for memory devices, resulting in zero
memory allocation.

Add helper methods to separate device partition logic:
- _parse_manual_device_partition for discrete devices (cpu, cuda)
- _parse_manual_slot_partition for shared devices (mem)
- _validate_configured_device_names for device name validation
- _validate_device_exclusivity for exclusivity checks

The refactored implementation iterates over global_devices and
selects the appropriate helper based on device type, ensuring
SlotPartitions are created with configured memory amounts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants