-
Notifications
You must be signed in to change notification settings - Fork 164
feat(BA-4146): Device-based MANUAL mode config #8463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/BA-4145
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Updates agent MANUAL-mode resource allocation to use explicit device ID lists (device-centric) and refactors resource discovery/allocation to operate on per-device assignments across agents.
Changes:
- Switch MANUAL mode
allocations.devicesfrom slot-based decimals (cuda.mem,cuda.shares) to device-based mappings (cuda = ["cuda0", ...]). - Refactor
ResourceAllocatorto separate global device discovery from per-agent device assignment (AUTO_SPLIT / SHARED / MANUAL). - Expand/update unit tests for config validation and resource allocation behaviors (including partitioning helpers like natural sort and device distribution).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/ai/backend/agent/config/unified.py |
Changes MANUAL allocations.devices schema to device-based mapping and adds format validation. |
src/ai/backend/agent/resources.py |
Introduces global device discovery structures and implements device assignment generation/application for allocation modes. |
src/ai/backend/agent/types.py |
Adds typing for device assignment mappings and a device partitioner protocol. |
tests/unit/agent/test_resource_allocation.py |
Updates/extends tests to reflect device-based allocation and new partitioning helpers. |
tests/unit/agent/test_resources.py |
Adds tests for new GlobalDeviceInfo and device discovery helper behavior. |
tests/unit/agent/test_config_validation.py |
Updates validation tests to reflect device-based MANUAL allocations and old-format rejection. |
changes/8440.feature.md |
Changelog entry for the device discovery infrastructure refactor. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
tests/unit/agent/test_resource_allocation.py:759
- MANUAL mode behavior is now implemented in ResourceAllocator/ResourcePartitioner, but there are no ResourceAllocator MANUAL-mode integration tests in this file anymore (the previous class was removed). Add integration tests that assert per-agent CPU/mem reservations and device visibility in MANUAL mode, and cover failure cases like duplicate device IDs across agents.
# Memory: shared device, both see same device with divided slots
assert len(agent1_computers[DeviceName("mem")].alloc_map.device_slots) == 1
assert len(agent2_computers[DeviceName("mem")].alloc_map.device_slots) == 1
await allocator.__aexit__(None, None, None)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
d442cc1 to
88f13c1
Compare
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes - Add Design Overview section for high-level narrative flow - Restructure Proposed Design for organic flow instead of feature list - Update to match actual implementation (ResourcePartitioner, Partition types) - Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463) - Add Implementation Notes section (scaling factors, memory handling) - Clarify slot-based design was incorrect implementation, not deliberate - Update config examples to show actual format (cpu, mem, devices fields) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add `cpu` field as Sequence[DeviceId] for explicit core assignment - Change `devices` type from Mapping[SlotName, Decimal] to Mapping[DeviceName, Sequence[DeviceId]] - Add validators to detect and reject old slot-based/integer formats - Implement generate_manual_assignments() in ResourcePartitioner - Validate device names and IDs exist in discovered hardware - Update tests for new format, add error case coverage
Refactor generate_manual_assignments to properly handle shared devices like memory. Previously, MANUAL mode incorrectly created empty DevicePartitions for memory devices, resulting in zero memory allocation. Add helper methods to separate device partition logic: - _parse_manual_device_partition for discrete devices (cpu, cuda) - _parse_manual_slot_partition for shared devices (mem) - _validate_configured_device_names for device name validation - _validate_device_exclusivity for exclusivity checks The refactored implementation iterates over global_devices and selects the appropriate helper based on device type, ensuring SlotPartitions are created with configured memory amounts.
resolves #8428 (BA-4146)
Overview
Changes MANUAL mode resource allocation from slot-based format (
{"cuda.mem": 0.5}) to device-based format (cuda = ["cuda0", "cuda1"]), enabling explicit device-to-agent mapping in multi-agent configurations.Problem Statement
Architecture
flowchart TB subgraph "Configuration (unified.py)" RC[ResourceAllocationConfig] RC -->|"devices: Mapping[DeviceName, Sequence[DeviceId]]"| DV[Device Validator] DV -->|"Detect old format"| ERR[Validation Error] DV -->|"Parse new format"| OK[Valid Config] end subgraph "Resource Allocation (resources.py)" RA[ResourceAllocator.__ainit__] RA -->|MANUAL mode| GMA[_generate_manual_assignments] RA -->|AUTO_SPLIT| GAS[_generate_auto_split_assignments] RA -->|SHARED| GSA[_generate_shared_assignments] GMA --> ADA[_apply_device_assignments] GAS --> ADA GSA --> ADA end OK --> GMAConfiguration Format Change
Old slot-based format (rejected):
New device-based format:
Checklist: (if applicable)
ai.backend.testdocsdirectory