Skip to content

Conversation

@hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Jan 30, 2026

resolves #8426 (BA-4144)

Overview

Introduces GlobalDeviceInfo dataclass and device discovery infrastructure to separate device discovery from allocation map creation in ResourceAllocator. This is foundational for the device-based allocation approach in subsequent tickets.

Problem Statement

  • Device information was previously intertwined with allocation maps
  • No clear separation between discovering what devices exist vs deciding how to allocate them
  • The current approach limited flexibility for device-based partitioning strategies

Architecture

flowchart TB
    subgraph "Phase 1: Discovery"
        LP[_load_resources] --> CGD[_create_global_devices]
        CGD --> GDM[GlobalDeviceMap]
    end

    subgraph "Phase 2: Allocation"
        GDM --> CAM[create_alloc_map per plugin]
        CAM --> CC[ComputerContext]
    end

    subgraph "Phase 3: Slots"
        CC --> CTS[_calculate_total_slots]
        CTS --> ATS[available_total_slots]
    end
Loading

The refactored ResourceAllocator.__ainit__() now follows a 3-phase initialization:

  1. Device Discovery: _create_global_devices() iterates plugins and calls list_devices()
  2. Allocation Maps: Creates ComputerContext with allocation maps from GlobalDeviceInfo
  3. Slot Calculation: _calculate_total_slots() uses plugin.available_slots() directly

Key Changes

New Types:

  • GlobalDeviceInfo: Dataclass with plugin and devices fields (no alloc_map)
  • GlobalDeviceMap: Type alias for Mapping[DeviceName, GlobalDeviceInfo]

New Methods:

  • _create_global_devices(): Discovers devices from all plugins, returns GlobalDeviceMap

Refactored Methods:

  • __ainit__(): Split into 3 distinct phases with clear separation
  • _calculate_total_slots(): Now async, uses plugin.available_slots() directly

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

@github-actions github-actions bot added size:L 100~500 LoC comp:agent Related to Agent component labels Jan 30, 2026
@hhoikoo hhoikoo requested a review from Copilot January 30, 2026 02:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces GlobalDeviceInfo dataclass and device discovery infrastructure to separate device discovery from allocation map creation in the ResourceAllocator. This refactoring establishes a cleaner 3-phase initialization process and sets the foundation for more flexible device-based allocation strategies.

Changes:

  • Added GlobalDeviceInfo dataclass to store plugin references and discovered devices without allocation maps
  • Introduced _create_global_devices() method for Phase 1 device discovery across all plugins
  • Refactored ResourceAllocator.__ainit__() into 3 distinct phases: device discovery, allocation map creation, and slot calculation
  • Made _calculate_total_slots() async to query plugins directly via available_slots() instead of reading from allocation maps

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/ai/backend/agent/resources.py Introduces GlobalDeviceInfo dataclass, _create_global_devices() method, refactors __ainit__() into 3 phases, and makes _calculate_total_slots() async
tests/unit/agent/test_resources.py Adds comprehensive unit tests for GlobalDeviceInfo, _create_global_devices(), and the async _calculate_total_slots() method

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Introduce GlobalDeviceInfo dataclass to separate device discovery
from allocation map creation in ResourceAllocator. This enables
cleaner separation of concerns and more flexible device-based
allocation strategies in the future.

Key changes include splitting __ainit__() into three distinct
phases: device discovery from plugins, allocation map creation,
and slot calculation. The _calculate_total_slots() method now
uses plugin.available_slots() directly instead of reading from
allocation maps, providing cleaner abstraction boundaries.

Added comprehensive unit tests covering GlobalDeviceInfo
initialization, _create_global_devices() with single and
multiple plugins, empty device handling, and slot calculation
with aggregation.
Refactor device discovery infrastructure to align with downstream
PR 8447 (BA-4145) changes. The GlobalDeviceInfo class now includes
the alloc_map created during device discovery, enabling better
separation of device discovery and computer context initialization.

Key changes include adding alloc_map field and device_ids property
to GlobalDeviceInfo, moving type definitions after the
AbstractComputePlugin class for proper ordering, extracting
_create_computers() method from __ainit__() for clearer separation
of concerns, and adding update_device_slots() to AbstractAllocMap
for dynamic slot updates.
Keep ComputerContext at its original location using attrs.define
to maintain consistency with the downstream PR changes.


@dataclass(kw_only=True, frozen=True)
class GlobalDeviceInfo:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think people who are seeing this code for the first time might find it hard to understand what GlobalDevice is. Could you add comments explaining GlobalDevice and how it differs from a non-global Device?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some docstring explaining what global device is.

hhoikoo added a commit that referenced this pull request Feb 2, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address PR review comment explaining what "Global" means and how
GlobalDeviceInfo differs from agent-specific ComputerContext.
hhoikoo added a commit that referenced this pull request Feb 3, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)
hhoikoo added a commit that referenced this pull request Feb 3, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants