Skip to content

Conversation

@hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Jan 30, 2026

resolves #8427 (BA-4145)

Overview

Implements device-based partitioning for multi-agent AUTO_SPLIT mode, enabling proper resource
isolation where devices (CPU/GPU) are exclusively assigned to agents while memory is shared.

Problem Statement

  • Multi-agent configurations need exclusive device assignment to prevent resource contention
  • Previous slot-based approach didn't partition physical devices between agents
  • AUTO_SPLIT mode needed device-centric implementation per BEP-1041 design

Architecture

flowchart TB
    subgraph "Device Discovery"
        LD[_load_resources] --> GDM[GlobalDeviceMap]
        GDM --> GDI[GlobalDeviceInfo per device type]
    end

    subgraph "Partition Types"
        DP[DevicePartition<br/>Exclusive device IDs]
        SP[SlotPartition<br/>Shared slot amounts]
    end

    subgraph "ResourcePartitioner"
        GSA[generate_shared_assignments]
        GASA[generate_autosplit_assignments]
        CDP[_calculate_device_partitions<br/>DiscretePropertyAllocMap]
        CSP[_calculate_slot_partitions<br/>FractionAllocMap]
        GASA --> CDP --> DP
        GASA --> CSP --> SP
    end

    subgraph "Agent Configuration"
        DP --> ADA[_apply_device_assignments]
        SP --> ADA
        ADA --> AC1[Agent1 ComputerContext]
        ADA --> AC2[Agent2 ComputerContext]
    end
Loading

Key Components

ResourcePartitioner: Static methods for generating device assignments

  • generate_shared_assignments: All agents see all devices (SHARED mode)
  • generate_autosplit_assignments: Exclusive partitioning (AUTO_SPLIT mode)

Partition Types:

  • DevicePartition: Exclusive device ID assignment (CPU cores, GPUs)
  • SlotPartition: Shared slot amount division (memory)

Partitioning Logic:

  • DiscretePropertyAllocMapDevicePartition: Devices distributed via fill-from-front
  • FractionAllocMapSlotPartition: Slots divided evenly among agents

distribute_devices Function: Implements fill-from-front algorithm with natural sorting

  • Given N devices and M agents: first r agents get ⌈N/M⌉ devices, rest get ⌊N/M⌋
  • Natural sort ensures cuda0, cuda1, cuda2, cuda10 order (not lexicographic)

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

Copilot AI review requested due to automatic review settings January 30, 2026 04:14
@github-actions github-actions bot added size:XL 500~ LoC comp:agent Related to Agent component labels Jan 30, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the resource allocation pipeline to implement device-based partitioning for AUTO_SPLIT mode and introduce a clearer separation between device discovery and allocation, aligning with BEP-1041.

Changes:

  • Introduces device partitioning abstractions (DevicePartitioner, WholeDevicePartitioner, SharedDevicePartitioner), a natural device ID sorter, and helpers (distribute_devices, GlobalDeviceInfo, GlobalDeviceMap) to support device-centric allocation.
  • Reworks ResourceAllocator.__ainit__ to perform device discovery, compute total/available slots via available_slots(), generate device assignments for agents based on allocation mode, and derive reserved slots from those assignments.
  • Extensively updates and extends unit tests around resources and allocation behavior (including AUTO_SPLIT, MANUAL-as-SHARED behavior, device distribution, and the new helpers) to validate the new semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/ai/backend/agent/types.py Adds DeviceAssignments alias and DevicePartitioner protocol to formalize the device partitioning strategy surface.
src/ai/backend/agent/resources.py Implements natural sorting, device distribution, whole/shared device partitioners, GlobalDeviceInfo/GlobalDeviceMap, reworks ResourceAllocator initialization, total slot calculation via available_slots(), device assignment generation and application, and reserved-slot computation based on assignments.
tests/unit/agent/test_resources.py Adds tests for GlobalDeviceInfo, _create_global_devices, handling plugins with no devices, and the new _calculate_total_slots behavior using plugin available_slots().
tests/unit/agent/test_resource_allocation.py Updates existing AUTO_SPLIT tests to device-based semantics, adds coverage for new allocation behaviors (AUTO_SPLIT device exclusivity, MANUAL behaving like SHARED, scaling factor invariants), and tests _natural_sort_key, distribute_devices, and the partitioner classes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hhoikoo hhoikoo changed the base branch from main to feat/BA-4144 January 30, 2026 04:24
@hhoikoo hhoikoo force-pushed the feat/BA-4145 branch 5 times, most recently from 8e3b040 to 35fe7a6 Compare January 30, 2026 09:05
@hhoikoo hhoikoo requested a review from Copilot January 30, 2026 09:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hhoikoo hhoikoo force-pushed the feat/BA-4145 branch 3 times, most recently from edd6803 to 5042f12 Compare January 30, 2026 09:38
@hhoikoo hhoikoo changed the title feat(BA-4145): Implement DevicePartitioner for AUTO_SPLIT feat(BA-4145): Add device partitioning for AUTO_SPLIT + SHARED Feb 2, 2026
hhoikoo added a commit that referenced this pull request Feb 2, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
device partitioning

Implement device-based partitioning for multi-agent AUTO_SPLIT
mode, replacing the previous slot-based approach. Devices are
now exclusively assigned to agents, enabling proper resource
isolation in multi-agent configurations.

Add DevicePartitioner protocol with WholeDevicePartitioner for
exclusive device assignment (CPU/GPU) and SharedDevicePartitioner
for shared access (memory devices). The distribute_devices
function implements fill-from-front allocation with natural
sorting to ensure deterministic device distribution.

Update ResourceAllocator initialization to generate device
assignments based on allocation mode. In AUTO_SPLIT mode with
multiple agents, devices are partitioned exclusively. Memory
devices (root/mem) remain shared across all agents to preserve
slot-based memory allocation semantics.
Scaling factors are now calculated as (allocated / total) for each slot
type instead of naive 1/n. This correctly handles uneven device
distributions where some agents get more devices than others.
Move _SHARED_DEVICE_NAMES constant from inside ResourcePartitioner
class to module level for better encapsulation and reusability.

Add AllocationType enum and get_allocation_type() method to
AbstractAllocMap to replace isinstance type checking. This provides
a cleaner, more extensible approach for distinguishing allocation
strategies without relying on runtime type inspection.
hhoikoo added a commit that referenced this pull request Feb 3, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)
hhoikoo added a commit that referenced this pull request Feb 3, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants