-
Notifications
You must be signed in to change notification settings - Fork 164
feat(BA-4144): Add GlobalDeviceInfo and device discovery infrastructure #8440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/BA-4143
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces GlobalDeviceInfo dataclass and device discovery infrastructure to separate device discovery from allocation map creation in the ResourceAllocator. This refactoring establishes a cleaner 3-phase initialization process and sets the foundation for more flexible device-based allocation strategies.
Changes:
- Added
GlobalDeviceInfodataclass to store plugin references and discovered devices without allocation maps - Introduced
_create_global_devices()method for Phase 1 device discovery across all plugins - Refactored
ResourceAllocator.__ainit__()into 3 distinct phases: device discovery, allocation map creation, and slot calculation - Made
_calculate_total_slots()async to query plugins directly viaavailable_slots()instead of reading from allocation maps
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/ai/backend/agent/resources.py | Introduces GlobalDeviceInfo dataclass, _create_global_devices() method, refactors __ainit__() into 3 phases, and makes _calculate_total_slots() async |
| tests/unit/agent/test_resources.py | Adds comprehensive unit tests for GlobalDeviceInfo, _create_global_devices(), and the async _calculate_total_slots() method |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
02251f6 to
9415478
Compare
Introduce GlobalDeviceInfo dataclass to separate device discovery from allocation map creation in ResourceAllocator. This enables cleaner separation of concerns and more flexible device-based allocation strategies in the future. Key changes include splitting __ainit__() into three distinct phases: device discovery from plugins, allocation map creation, and slot calculation. The _calculate_total_slots() method now uses plugin.available_slots() directly instead of reading from allocation maps, providing cleaner abstraction boundaries. Added comprehensive unit tests covering GlobalDeviceInfo initialization, _create_global_devices() with single and multiple plugins, empty device handling, and slot calculation with aggregation.
Refactor device discovery infrastructure to align with downstream PR 8447 (BA-4145) changes. The GlobalDeviceInfo class now includes the alloc_map created during device discovery, enabling better separation of device discovery and computer context initialization. Key changes include adding alloc_map field and device_ids property to GlobalDeviceInfo, moving type definitions after the AbstractComputePlugin class for proper ordering, extracting _create_computers() method from __ainit__() for clearer separation of concerns, and adding update_device_slots() to AbstractAllocMap for dynamic slot updates.
Keep ComputerContext at its original location using attrs.define to maintain consistency with the downstream PR changes.
|
|
||
|
|
||
| @dataclass(kw_only=True, frozen=True) | ||
| class GlobalDeviceInfo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think people who are seeing this code for the first time might find it hard to understand what GlobalDevice is. Could you add comments explaining GlobalDevice and how it differs from a non-global Device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some docstring explaining what global device is.
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes - Add Design Overview section for high-level narrative flow - Restructure Proposed Design for organic flow instead of feature list - Update to match actual implementation (ResourcePartitioner, Partition types) - Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463) - Add Implementation Notes section (scaling factors, memory handling) - Clarify slot-based design was incorrect implementation, not deliberate - Update config examples to show actual format (cpu, mem, devices fields) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address PR review comment explaining what "Global" means and how GlobalDeviceInfo differs from agent-specific ComputerContext.
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes - Add Design Overview section for high-level narrative flow - Restructure Proposed Design for organic flow instead of feature list - Update to match actual implementation (ResourcePartitioner, Partition types) - Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463) - Add Implementation Notes section (scaling factors, memory handling) - Clarify slot-based design was incorrect implementation, not deliberate - Update config examples to show actual format (cpu, mem, devices fields)
- Add Background section explaining SHARED/AUTO_SPLIT/MANUAL modes - Add Design Overview section for high-level narrative flow - Restructure Proposed Design for organic flow instead of feature list - Update to match actual implementation (ResourcePartitioner, Partition types) - Update GitHub PR numbers to correct values (#8433, #8440, #8447, #8463) - Add Implementation Notes section (scaling factors, memory handling) - Clarify slot-based design was incorrect implementation, not deliberate - Update config examples to show actual format (cpu, mem, devices fields)
resolves #8426 (BA-4144)
Overview
Introduces
GlobalDeviceInfodataclass and device discovery infrastructure to separate device discovery from allocation map creation inResourceAllocator. This is foundational for the device-based allocation approach in subsequent tickets.Problem Statement
Architecture
flowchart TB subgraph "Phase 1: Discovery" LP[_load_resources] --> CGD[_create_global_devices] CGD --> GDM[GlobalDeviceMap] end subgraph "Phase 2: Allocation" GDM --> CAM[create_alloc_map per plugin] CAM --> CC[ComputerContext] end subgraph "Phase 3: Slots" CC --> CTS[_calculate_total_slots] CTS --> ATS[available_total_slots] endThe refactored
ResourceAllocator.__ainit__()now follows a 3-phase initialization:_create_global_devices()iterates plugins and callslist_devices()ComputerContextwith allocation maps fromGlobalDeviceInfo_calculate_total_slots()usesplugin.available_slots()directlyKey Changes
New Types:
GlobalDeviceInfo: Dataclass withpluginanddevicesfields (noalloc_map)GlobalDeviceMap: Type alias forMapping[DeviceName, GlobalDeviceInfo]New Methods:
_create_global_devices(): Discovers devices from all plugins, returnsGlobalDeviceMapRefactored Methods:
__ainit__(): Split into 3 distinct phases with clear separation_calculate_total_slots(): Now async, usesplugin.available_slots()directlyChecklist: (if applicable)
ai.backend.testdocsdirectory