Skip to content

Conversation

@hhoikoo
Copy link
Member

@hhoikoo hhoikoo commented Jan 30, 2026

resolves #8425 (BA-4143)

Overview

Simplify ResourceAllocator by removing mode-specific partitioning logic. All allocation modes (SHARED, AUTO_SPLIT, MANUAL) now behave like SHARED, where agents see all devices with no reserved slots between them. This establishes a clean baseline before implementing BEP-1041's device-centric partitioning design.

Problem Statement

  • The current multi-agent resource allocation uses slot-based configuration with complex _calculate_device_slot_* methods that tightly couple slot calculation with allocation modes
  • Before implementing the new device-based approach from BEP-1041, we need to establish a clean baseline by removing this complexity
  • This allows us to incrementally rebuild the partitioning logic with the new device-centric design

Implementation

Removed methods from ResourceAllocator:

  • _calculate_device_slot() - the mode dispatch method
  • _calculate_device_slot_shared() - SHARED mode calculation
  • _calculate_device_slot_auto_split() - AUTO_SPLIT mode calculation
  • _calculate_device_slot_manual() - MANUAL mode calculation
  • _ensure_slots_are_not_overallocated() - MANUAL mode validation

Simplified methods:

  • _calculate_device_slots() - now always returns full available slots (SHARED behavior)
  • _calculate_resource_scaling_factor() - now always returns 1.0
  • _calculate_agent_partition() - removed unused parameters

Test changes:

  • Skipped TestAutoSplitMode and TestManualMode classes with BA-4143/BEP-1041 explanations
  • Skipped partitioning tests in TestMultiDeviceScenarios
  • Added TestAllocationModesFallbackToShared class with 3 tests verifying all modes behave identically

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue
  • Installer updates including:
    • Fixtures for db schema changes
    • New mandatory config options
  • Update of end-to-end CLI integration tests in ai.backend.test
  • API server-client counterparts (e.g., manager API -> client SDK)
  • Test case(s) to:
    • Demonstrate the difference of before/after
    • Demonstrate the flow of abstract/conceptual models with a concrete implementation
  • Documentation
    • Contents in the docs directory
    • docstrings in public interfaces and type annotations

Copilot AI review requested due to automatic review settings January 30, 2026 00:49
@github-actions github-actions bot added size:L 100~500 LoC comp:agent Related to Agent component labels Jan 30, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors multi-agent ResourceAllocator to remove slot-based/mode-specific device partitioning so that SHARED, AUTO_SPLIT, and MANUAL all expose full device resources as a baseline ahead of BEP-1041.

Changes:

  • Removes AUTO_SPLIT/MANUAL partitioning logic in ResourceAllocator, making all modes behave like SHARED.
  • Simplifies agent partition/scaling-factor computation accordingly.
  • Skips mode-specific tests and adds new tests asserting all modes fall back to SHARED behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/ai/backend/agent/resources.py Removes mode-specific slot partitioning and makes partition/scaling-factor SHARED-only baseline.
tests/unit/agent/test_resource_allocation.py Skips AUTO_SPLIT/MANUAL partitioning tests and adds fallback-to-SHARED tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hhoikoo hhoikoo added the skip:changelog Make the action workflow to skip towncrier check label Jan 30, 2026
Simplify ResourceAllocator by removing mode-specific partitioning
logic. All allocation modes (SHARED, AUTO_SPLIT, MANUAL) now behave
like SHARED, where agents see all devices with no reserved slots
between them. This establishes a baseline for implementing BEP-1041's
device-centric partitioning design.

Removed methods: _calculate_device_slot,
_calculate_device_slot_shared, _calculate_device_slot_auto_split,
_calculate_device_slot_manual, and
_ensure_slots_are_not_overallocated. Scaling factor now always
returns 1.0 instead of mode-specific calculations.

Test changes: Skipped tests for AUTO_SPLIT and MANUAL modes with
explanations. Added TestAllocationModesFallbackToShared class to
verify all modes behave identically as SHARED.
- Remove docstrings from private methods
- Inline _calculate_device_slots() and _calculate_resource_scaling_factor()
- Fix implicit string concatenation in pytest skip reasons
hhoikoo added a commit that referenced this pull request Feb 2, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
hhoikoo added a commit that referenced this pull request Feb 3, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)
hhoikoo added a commit that referenced this pull request Feb 3, 2026
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes
- Add Design Overview section for high-level narrative flow
- Restructure Proposed Design for organic flow instead of feature list
- Update to match actual implementation (ResourcePartitioner, Partition types)
- Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463)
- Add Implementation Notes section (scaling factors, memory handling)
- Clarify slot-based design was incorrect implementation, not deliberate
- Update config examples to show actual format (cpu, mem, devices fields)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component size:L 100~500 LoC skip:changelog Make the action workflow to skip towncrier check

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove slot-based device partitioning methods to establish SHARED-only baseline

2 participants