-
Notifications
You must be signed in to change notification settings - Fork 164
feat(BA-4145): Add device partitioning for AUTO_SPLIT + SHARED #8447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/BA-4144
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the resource allocation pipeline to implement device-based partitioning for AUTO_SPLIT mode and introduce a clearer separation between device discovery and allocation, aligning with BEP-1041.
Changes:
- Introduces device partitioning abstractions (
DevicePartitioner,WholeDevicePartitioner,SharedDevicePartitioner), a natural device ID sorter, and helpers (distribute_devices,GlobalDeviceInfo,GlobalDeviceMap) to support device-centric allocation. - Reworks
ResourceAllocator.__ainit__to perform device discovery, compute total/available slots viaavailable_slots(), generate device assignments for agents based on allocation mode, and derive reserved slots from those assignments. - Extensively updates and extends unit tests around resources and allocation behavior (including AUTO_SPLIT, MANUAL-as-SHARED behavior, device distribution, and the new helpers) to validate the new semantics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/ai/backend/agent/types.py |
Adds DeviceAssignments alias and DevicePartitioner protocol to formalize the device partitioning strategy surface. |
src/ai/backend/agent/resources.py |
Implements natural sorting, device distribution, whole/shared device partitioners, GlobalDeviceInfo/GlobalDeviceMap, reworks ResourceAllocator initialization, total slot calculation via available_slots(), device assignment generation and application, and reserved-slot computation based on assignments. |
tests/unit/agent/test_resources.py |
Adds tests for GlobalDeviceInfo, _create_global_devices, handling plugins with no devices, and the new _calculate_total_slots behavior using plugin available_slots(). |
tests/unit/agent/test_resource_allocation.py |
Updates existing AUTO_SPLIT tests to device-based semantics, adds coverage for new allocation behaviors (AUTO_SPLIT device exclusivity, MANUAL behaving like SHARED, scaling factor invariants), and tests _natural_sort_key, distribute_devices, and the partitioner classes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8e3b040 to
35fe7a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
edd6803 to
5042f12
Compare
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes - Add Design Overview section for high-level narrative flow - Restructure Proposed Design for organic flow instead of feature list - Update to match actual implementation (ResourcePartitioner, Partition types) - Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463) - Add Implementation Notes section (scaling factors, memory handling) - Clarify slot-based design was incorrect implementation, not deliberate - Update config examples to show actual format (cpu, mem, devices fields) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
device partitioning Implement device-based partitioning for multi-agent AUTO_SPLIT mode, replacing the previous slot-based approach. Devices are now exclusively assigned to agents, enabling proper resource isolation in multi-agent configurations. Add DevicePartitioner protocol with WholeDevicePartitioner for exclusive device assignment (CPU/GPU) and SharedDevicePartitioner for shared access (memory devices). The distribute_devices function implements fill-from-front allocation with natural sorting to ensure deterministic device distribution. Update ResourceAllocator initialization to generate device assignments based on allocation mode. In AUTO_SPLIT mode with multiple agents, devices are partitioned exclusively. Memory devices (root/mem) remain shared across all agents to preserve slot-based memory allocation semantics.
Scaling factors are now calculated as (allocated / total) for each slot type instead of naive 1/n. This correctly handles uneven device distributions where some agents get more devices than others.
Move _SHARED_DEVICE_NAMES constant from inside ResourcePartitioner class to module level for better encapsulation and reusability. Add AllocationType enum and get_allocation_type() method to AbstractAllocMap to replace isinstance type checking. This provides a cleaner, more extensible approach for distinguishing allocation strategies without relying on runtime type inspection.
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes - Add Design Overview section for high-level narrative flow - Restructure Proposed Design for organic flow instead of feature list - Update to match actual implementation (ResourcePartitioner, Partition types) - Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463) - Add Implementation Notes section (scaling factors, memory handling) - Clarify slot-based design was incorrect implementation, not deliberate - Update config examples to show actual format (cpu, mem, devices fields)
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes - Add Design Overview section for high-level narrative flow - Restructure Proposed Design for organic flow instead of feature list - Update to match actual implementation (ResourcePartitioner, Partition types) - Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463) - Add Implementation Notes section (scaling factors, memory handling) - Clarify slot-based design was incorrect implementation, not deliberate - Update config examples to show actual format (cpu, mem, devices fields)
resolves #8427 (BA-4145)
Overview
Implements device-based partitioning for multi-agent AUTO_SPLIT mode, enabling proper resource
isolation where devices (CPU/GPU) are exclusively assigned to agents while memory is shared.
Problem Statement
Architecture
flowchart TB subgraph "Device Discovery" LD[_load_resources] --> GDM[GlobalDeviceMap] GDM --> GDI[GlobalDeviceInfo per device type] end subgraph "Partition Types" DP[DevicePartition<br/>Exclusive device IDs] SP[SlotPartition<br/>Shared slot amounts] end subgraph "ResourcePartitioner" GSA[generate_shared_assignments] GASA[generate_autosplit_assignments] CDP[_calculate_device_partitions<br/>DiscretePropertyAllocMap] CSP[_calculate_slot_partitions<br/>FractionAllocMap] GASA --> CDP --> DP GASA --> CSP --> SP end subgraph "Agent Configuration" DP --> ADA[_apply_device_assignments] SP --> ADA ADA --> AC1[Agent1 ComputerContext] ADA --> AC2[Agent2 ComputerContext] endKey Components
ResourcePartitioner: Static methods for generating device assignments
generate_shared_assignments: All agents see all devices (SHARED mode)generate_autosplit_assignments: Exclusive partitioning (AUTO_SPLIT mode)Partition Types:
DevicePartition: Exclusive device ID assignment (CPU cores, GPUs)SlotPartition: Shared slot amount division (memory)Partitioning Logic:
DiscretePropertyAllocMap→DevicePartition: Devices distributed via fill-from-frontFractionAllocMap→SlotPartition: Slots divided evenly among agentsdistribute_devices Function: Implements fill-from-front algorithm with natural sorting
Checklist: (if applicable)
ai.backend.testdocsdirectory