feat(deployments): implement DockerDeploymentBackend (AIRCORE-756)#399
Conversation
Add Docker backend for nemo-deployments-plugin with single-container v1 lifecycle, entity-store config fetch, port allocation, exec/http probes, plugin-local GPU pool, and unit plus integration test coverage. Stacks on #315 (AIRCORE-758 reconciler prerequisite DAG). Signed-off-by: Tyler Bray <tbray@nvidia.com>
|
@coderabbitai full review |
✅ Action performedFull review finished. |
📝 WalkthroughWalkthroughIntroduces ChangesDocker Deployment Backend
Sequence Diagram(s)sequenceDiagram
participant Reconciler
participant DockerDeploymentBackend
participant DockerGPUPool
participant DockerClient
participant check_readiness_probe
Reconciler->>DockerDeploymentBackend: create_deployment(workspace, name, config_name, ...)
DockerDeploymentBackend->>DockerClient: containers.get(container_name)
DockerClient-->>DockerDeploymentBackend: NotFound
DockerDeploymentBackend->>DockerGPUPool: allocate_gpu(workload_id, n)
DockerDeploymentBackend->>DockerClient: find_available_port → containers.run(image, run_kwargs)
DockerDeploymentBackend-->>Reconciler: BackendStatusUpdate(STARTING, endpoints)
Reconciler->>DockerDeploymentBackend: read_status(workspace, name)
DockerDeploymentBackend->>DockerClient: container.reload()
DockerClient-->>DockerDeploymentBackend: state=running
DockerDeploymentBackend->>check_readiness_probe: check_readiness_probe(container, probe, host_url)
check_readiness_probe-->>DockerDeploymentBackend: (ready=True, "")
DockerDeploymentBackend-->>Reconciler: BackendStatusUpdate(READY, endpoints)
Reconciler->>DockerDeploymentBackend: delete_deployment(workspace, name)
DockerDeploymentBackend->>DockerClient: container.stop() → container.remove()
DockerDeploymentBackend->>DockerGPUPool: release_gpu(workload_id)
DockerDeploymentBackend-->>Reconciler: BackendStatusUpdate(SUCCEEDED)
Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (7)
plugins/nemo-deployments/tests/unit/backends/docker_helpers.py (1)
13-26: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse a concrete restart-policy type and drop the ignore.
restart_policy: stris too broad and forces# type: ignore[arg-type]. Type this as the concrete policy union/Literal used byDeploymentConfigand remove the ignore.As per coding guidelines, "
**/*.py: Always prefer concrete type hints over string-based ones in Python code."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/unit/backends/docker_helpers.py` around lines 13 - 26, The sample_config function has a restart_policy parameter typed as str which is too broad and requires a type ignore comment. Identify the concrete Literal type or union type that DeploymentConfig expects for its restart_policy field, update the function signature to use that concrete type instead of str for the restart_policy parameter, and remove the # type: ignore[arg-type] comment on the restart_policy argument passed to DeploymentConfig.Source: Coding guidelines
plugins/nemo-deployments/tests/integration/conftest.py (1)
10-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAdd type hints to module-level variables.
Per coding guidelines, prefer concrete type hints. Add annotations:
DOCKER_AVAILABLE: bool = Trueand
skip_without_docker: pytest.MarkDecorator = pytest.mark.skipif(...)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/conftest.py` around lines 10 - 18, Add explicit type hints to the module-level variables DOCKER_AVAILABLE and skip_without_docker. For DOCKER_AVAILABLE, add a bool type annotation since it is assigned boolean values (True or False) in the try-except block. For skip_without_docker, add a pytest.MarkDecorator type annotation since it is assigned the result of pytest.mark.skipif(). These type hints should be added directly at the point of variable assignment to comply with coding guidelines.Source: Coding guidelines
plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py (2)
162-164: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winNarrow exception handling in cleanup.
Catching broad
Exceptionhides potential issues. UseNotFoundfromdocker.errors(already imported) for consistency with other cleanup blocks.♻️ Proposed fix
try: docker_backend._client.volumes.remove(docker_volume_name("itest", "data")) - except Exception: + except NotFound: pass🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py` around lines 162 - 164, The exception handling in the volume cleanup block is catching a broad Exception which can mask real issues. Replace the generic Exception catch with docker.errors.NotFound, which is already imported and used consistently in other cleanup blocks throughout the test. This will ensure only the expected "volume not found" error is silently ignored while other unexpected errors will properly surface for debugging.
126-128: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAccessing private attribute in test cleanup.
Test cleanup directly accesses
docker_backend._client, which couples tests to implementation details. Consider adding a public cleanup method to the backend or using the Docker SDK directly viadocker.from_env().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py` around lines 126 - 128, The test cleanup code directly accesses the private `_client` attribute from docker_backend, creating coupling to implementation details. Either add a public cleanup method to the docker_backend class that encapsulates the container removal logic (containers.get(c_name).remove(force=True)) and call that method from the test, or alternatively, use docker.from_env() directly in the test cleanup code to access the Docker client without relying on private implementation details. Remove the direct access to the underscore-prefixed _client attribute.plugins/nemo-deployments/tests/integration/test_reconcile_docker.py (2)
160-174: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low valueOptimize loop by constructing dictionaries once.
The
by_nameandby_configdictionaries are reconstructed on every iteration. Move construction outside the loop for clarity and minor efficiency gain.♻️ Proposed refactor
+ by_name = {("itest", "puller"): puller_dep, ("itest", "server"): server_dep} + by_config = { + ("itest", "puller-cfg"): puller_dep, + ("itest", "server-cfg"): server_dep, + } for _ in range(40): - by_name = {("itest", "puller"): puller_dep, ("itest", "server"): server_dep} - by_config = { - ("itest", "puller-cfg"): puller_dep, - ("itest", "server-cfg"): server_dep, - } await deployment_reconciler.reconcile_one(Apply the same pattern to the server reconciliation loop at lines 177-192.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around lines 160 - 174, The by_name and by_config dictionaries are being reconstructed on every iteration of the loop in the puller deployment reconciliation block. Move the construction of both dictionaries outside the for loop (before the `for _ in range(40):` line) since they contain the same values on each iteration. Apply this same optimization pattern to the server reconciliation loop that follows, moving the dictionary construction for that loop outside as well to avoid redundant object creation.
128-138: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd type hints to get_side_effect.
Per coding guidelines, prefer concrete type hints. This function should annotate parameters and return type:
async def get_side_effect(entity_type: type, name: str, workspace: str | None = None) -> Volume | Deployment | DeploymentConfig:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around lines 128 - 138, The get_side_effect function is missing type hints on its parameters and return type. Add type annotations to the function signature: annotate the entity_type parameter as type, the name parameter as str, the workspace parameter as str or None with a default value of None, and the return type as a union of Volume, Deployment, and DeploymentConfig.Source: Coding guidelines
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py (1)
28-33: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueRestart policy parameter has no effect.
Both branches in
map_exited_statusreturn"FAILED"regardless ofrestart_policy. Consider removing the parameter if not needed, or document the future intent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py` around lines 28 - 33, In the map_exited_status function, the restart_policy parameter is not being utilized effectively as both the if condition checking for restart_policy == "Always" and the final return statement return the same "FAILED" status. Either remove the restart_policy parameter entirely if it is not needed for this function's logic, or modify the conditional logic to return different DeploymentStatus values based on different restart_policy values so that the parameter actually influences the function's behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py`:
- Around line 148-157: The loop iterating through container_spec.ports does not
track previously assigned host ports within the current iteration, allowing
find_available_port to potentially assign the same host port multiple times.
Modify the code to collect already-assigned ports from the host_ports dictionary
in each loop iteration and pass them to find_available_port so it can exclude
those ports when searching for available options, ensuring no duplicate port
assignments occur within a single multi-port deployment.
- Line 292: Replace hardcoded label key strings with imported constants to
maintain consistency with canonical definitions. First, add imports for
BACKOFF_LIMIT_LABEL and MANAGED_BY_KEY from labels.py at the top of the file.
Then replace the hardcoded string "nmp.nvidia.com/backoff-limit" in the
backoff_limit variable assignment with the BACKOFF_LIMIT_LABEL constant, and
replace the hardcoded "managed-by" strings used in the container_labels.get()
call and labels.get() comparison with the MANAGED_BY_KEY constant.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py`:
- Around line 17-18: Remove the TYPE_CHECKING conditional guard around the
docker import. Convert the conditional import of docker (currently under if
TYPE_CHECKING:) to a regular import statement at the module level, so that
docker is imported unconditionally rather than only for type checking purposes.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py`:
- Around line 90-99: The probe target resolution incorrectly hardcodes 127.0.0.1
instead of using the actual host from the host_url parameter, which breaks
functionality for remapped host ports and remote Docker daemons. Replace all
instances of 127.0.0.1 in the base URL construction within the port resolution
logic (in the branches checking isinstance(port, int), the named_ports lookup,
and the host_ports loop) with the appropriate host extracted from the host_url
parameter. Apply the same fix to the additional probe target resolution code
mentioned in lines 111-127 to ensure consistency across all probe endpoint
constructions.
- Around line 11-19: The docker import statement is unnecessarily wrapped in a
TYPE_CHECKING conditional block. Since the file already uses from __future__
import annotations which stringifies all type hints at runtime, there is no
performance cost to importing DockerContainer directly. Remove the if
TYPE_CHECKING: conditional block that wraps the docker import statement and move
the from docker.models.containers import Container as DockerContainer import to
the top-level imports section with the other import statements like httpx and
the nemo_deployments_plugin imports.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py`:
- Around line 15-16: The docker import is currently guarded by the TYPE_CHECKING
conditional block. Remove the TYPE_CHECKING guard and import statement around
the docker module, and instead add docker as a regular import statement at the
top of the file with the other imports. This ensures docker is imported as a
concrete type hint rather than a string-based one, following the coding
guidelines.
In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`:
- Around line 26-32: The try-except block checking Docker availability and
setting _DOCKER_AVAILABLE is duplicated from the parent conftest.py file. Remove
lines 26-32 containing the duplicate Docker availability check (the try block
with docker.from_env().ping() and the _DOCKER_AVAILABLE assignment), and instead
use the existing DOCKER_AVAILABLE fixture or skip_without_docker marker that is
already defined in the parent conftest.py. Pytest automatically discovers and
uses parent conftest fixtures and markers, so this duplicate check should be
completely removed from this file.
In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py`:
- Around line 33-39: Remove the duplicated Docker availability check in the
test_reconcile_docker.py file. Delete the entire try-except block that attempts
to import docker and calls docker.from_env().ping() to set DOCKER_AVAILABLE.
Instead, rely on the Docker availability check already defined in the parent
conftest.py file, which Pytest discovers automatically through its marker
system. This eliminates code duplication and ensures consistency across all test
files in the project.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py`:
- Line 11: The import statement at line 11 using `from backends.docker_helpers
import container_attrs, sample_config` is using an absolute import path that
does not resolve correctly for the test package layout. Change this to use a
relative import that properly navigates from the test file's location
(plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py) to
the helpers module, or adjust the import to use the correct absolute path that
matches your package structure so that pytest can resolve the module correctly.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py`:
- Line 11: The import statement at the top of test_idempotency.py uses an
incorrect module path. The line `from backends.docker_helpers import
container_attrs, sample_config` needs to be updated to use the correct relative
or absolute import path that matches your repository's actual package structure.
Determine the correct path by checking where the docker_helpers module actually
exists in your repo (likely in the backends directory relative to the tests
directory), and adjust the import to use either a relative import (like from
..docker_helpers or similar) or the correct absolute path from the repo root.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py`:
- Around line 21-24: The assertion in test_is_port_free_local is a tautology
that always evaluates to true since it checks if something is true OR not true,
which provides no meaningful test coverage. Replace this assertion with a
deterministic test by either occupying a port and verifying is_port_free returns
False for that port, or using a known free port and verifying is_port_free
returns True. Set up a specific port state beforehand (such as binding to a port
or using a port known to be available) and then assert that is_port_free returns
the expected boolean result for that deterministic condition.
- Around line 26-36: The test_find_available_port_skips_used test depends on
real host port availability which makes it flaky and environment-dependent. Mock
the is_port_free function that is called within find_available_port to return
deterministic values (such as returning False for port 9000 to simulate it being
in use, and True for ports 9001 and 9002 to simulate them being free). Then
update the assertion to check for the exact expected port (9001) instead of
using an assertion with multiple possible values, ensuring the test behavior is
reproducible and not dependent on the actual system port state.
---
Nitpick comments:
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py`:
- Around line 28-33: In the map_exited_status function, the restart_policy
parameter is not being utilized effectively as both the if condition checking
for restart_policy == "Always" and the final return statement return the same
"FAILED" status. Either remove the restart_policy parameter entirely if it is
not needed for this function's logic, or modify the conditional logic to return
different DeploymentStatus values based on different restart_policy values so
that the parameter actually influences the function's behavior.
In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`:
- Around line 162-164: The exception handling in the volume cleanup block is
catching a broad Exception which can mask real issues. Replace the generic
Exception catch with docker.errors.NotFound, which is already imported and used
consistently in other cleanup blocks throughout the test. This will ensure only
the expected "volume not found" error is silently ignored while other unexpected
errors will properly surface for debugging.
- Around line 126-128: The test cleanup code directly accesses the private
`_client` attribute from docker_backend, creating coupling to implementation
details. Either add a public cleanup method to the docker_backend class that
encapsulates the container removal logic
(containers.get(c_name).remove(force=True)) and call that method from the test,
or alternatively, use docker.from_env() directly in the test cleanup code to
access the Docker client without relying on private implementation details.
Remove the direct access to the underscore-prefixed _client attribute.
In `@plugins/nemo-deployments/tests/integration/conftest.py`:
- Around line 10-18: Add explicit type hints to the module-level variables
DOCKER_AVAILABLE and skip_without_docker. For DOCKER_AVAILABLE, add a bool type
annotation since it is assigned boolean values (True or False) in the try-except
block. For skip_without_docker, add a pytest.MarkDecorator type annotation since
it is assigned the result of pytest.mark.skipif(). These type hints should be
added directly at the point of variable assignment to comply with coding
guidelines.
In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py`:
- Around line 160-174: The by_name and by_config dictionaries are being
reconstructed on every iteration of the loop in the puller deployment
reconciliation block. Move the construction of both dictionaries outside the for
loop (before the `for _ in range(40):` line) since they contain the same values
on each iteration. Apply this same optimization pattern to the server
reconciliation loop that follows, moving the dictionary construction for that
loop outside as well to avoid redundant object creation.
- Around line 128-138: The get_side_effect function is missing type hints on its
parameters and return type. Add type annotations to the function signature:
annotate the entity_type parameter as type, the name parameter as str, the
workspace parameter as str or None with a default value of None, and the return
type as a union of Volume, Deployment, and DeploymentConfig.
In `@plugins/nemo-deployments/tests/unit/backends/docker_helpers.py`:
- Around line 13-26: The sample_config function has a restart_policy parameter
typed as str which is too broad and requires a type ignore comment. Identify the
concrete Literal type or union type that DeploymentConfig expects for its
restart_policy field, update the function signature to use that concrete type
instead of str for the restart_policy parameter, and remove the # type:
ignore[arg-type] comment on the restart_policy argument passed to
DeploymentConfig.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5d5a85f9-3645-431f-b8df-57e8281266da
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (22)
plugins/nemo-deployments/pyproject.tomlplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.pyplugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.pyplugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.pyplugins/nemo-deployments/tests/integration/conftest.pyplugins/nemo-deployments/tests/integration/test_reconcile_docker.pyplugins/nemo-deployments/tests/unit/backends/docker/conftest.pyplugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.pyplugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.pyplugins/nemo-deployments/tests/unit/backends/docker/test_labels.pyplugins/nemo-deployments/tests/unit/backends/docker/test_ports.pyplugins/nemo-deployments/tests/unit/backends/docker/test_restart_policy.pyplugins/nemo-deployments/tests/unit/backends/docker/test_status_mapping.pyplugins/nemo-deployments/tests/unit/backends/docker_helpers.py
| host_ports: dict[int, int] = {} | ||
| for port_spec in container_spec.ports: | ||
| host_port = await find_available_port(self._client, docker_cfg) | ||
| if host_port is None: | ||
| if gpu_ids: | ||
| self._gpu_pool.release_gpu(dep_key) # type: ignore[union-attr] | ||
| return BackendStatusUpdate( | ||
| status="FAILED", status_message="No host ports available in configured range" | ||
| ) | ||
| host_ports[port_spec.container_port] = host_port |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
fd -t f 'ports.py' -p 'backends/docker' --exec cat {}Repository: NVIDIA-NeMo/nemo-platform
Length of output: 266
🏁 Script executed:
fd -t f 'ports.py' --exec cat {}Repository: NVIDIA-NeMo/nemo-platform
Length of output: 16215
🏁 Script executed:
find . -type f -name 'backend.py' -path '*backends/docker*' | head -5Repository: NVIDIA-NeMo/nemo-platform
Length of output: 327
🏁 Script executed:
fd -t f -name '*.py' -path '*backends/docker*' | head -20Repository: NVIDIA-NeMo/nemo-platform
Length of output: 394
🏁 Script executed:
sed -n '140,165p' ./plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 1535
Multi-port containers can collide on the same host port.
find_available_port queries running containers to find used ports, but only checks containers that already exist. Since the container being deployed doesn't exist yet, ports assigned in earlier loop iterations won't be excluded on subsequent calls, allowing duplicate allocations.
Track already-selected ports and exclude them on each iteration.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py`
around lines 148 - 157, The loop iterating through container_spec.ports does not
track previously assigned host ports within the current iteration, allowing
find_available_port to potentially assign the same host port multiple times.
Modify the code to collect already-assigned ports from the host_ports dictionary
in each loop iteration and pass them to find_available_port so it can exclude
those ports when searching for available options, ensuring no duplicate port
assignments occur within a single multi-port deployment.
| ) | ||
| if restart_policy == "OnFailure": | ||
| restart_count = int(container.attrs.get("RestartCount", 0)) | ||
| backoff_limit = int(labels.get("nmp.nvidia.com/backoff-limit", "6")) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
fd -t f 'labels.py' -p 'backends/docker' --exec cat {}Repository: NVIDIA-NeMo/nemo-platform
Length of output: 266
🏁 Script executed:
#!/bin/bash
# Find backend.py file
fd -t f 'backend.py' --path '*backends/docker*' | head -20Repository: NVIDIA-NeMo/nemo-platform
Length of output: 397
🏁 Script executed:
#!/bin/bash
# Search for MANAGED_BY_KEY and BACKOFF_LIMIT_LABEL constants
rg -i 'MANAGED_BY_KEY|BACKOFF_LIMIT_LABEL' --max-count 50Repository: NVIDIA-NeMo/nemo-platform
Length of output: 163
🏁 Script executed:
#!/bin/bash
# Search for label key definitions/constants in the codebase
rg '"managed-by"|"nmp.nvidia.com/backoff-limit"' --max-count 50Repository: NVIDIA-NeMo/nemo-platform
Length of output: 163
🏁 Script executed:
#!/bin/bash
# Find backend.py without path filter
git ls-files | grep -E 'backends.*docker.*backend\.py'Repository: NVIDIA-NeMo/nemo-platform
Length of output: 405
🏁 Script executed:
#!/bin/bash
# List all files in the plugin directory
git ls-files | grep 'nemo-deployments' | head -30Repository: NVIDIA-NeMo/nemo-platform
Length of output: 2350
🏁 Script executed:
#!/bin/bash
cat plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 2850
🏁 Script executed:
#!/bin/bash
# Read the specific lines from backend.py that are mentioned in the review
sed -n '290,295p;351,356p;418,422p' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 1022
🏁 Script executed:
#!/bin/bash
# Check the imports in backend.py
head -50 plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py | grep -E 'from|import'Repository: NVIDIA-NeMo/nemo-platform
Length of output: 958
🏁 Script executed:
#!/bin/bash
# Get more context around the imports
head -100 plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 3830
🏁 Script executed:
#!/bin/bash
# Check what MANAGED_BY_LABEL is in constants.py
rg 'MANAGED_BY_LABEL' plugins/nemo-deployments/src/nemo_deployments_plugin/constants.py -A 2Repository: NVIDIA-NeMo/nemo-platform
Length of output: 280
🏁 Script executed:
#!/bin/bash
# Get the full import statement from backend.py
sed -n '30,45p' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 764
🏁 Script executed:
#!/bin/bash
# Check if MANAGED_BY_KEY and BACKOFF_LIMIT_LABEL are imported in backend.py
rg 'MANAGED_BY_KEY|BACKOFF_LIMIT_LABEL' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 163
🏁 Script executed:
#!/bin/bash
# Get more context around line 353-354 and 420
sed -n '350,360p;415,425p' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 1132
🏁 Script executed:
#!/bin/bash
# Verify the exact line numbers by searching for the patterns
rg -n '"managed-by"' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 303
🏁 Script executed:
#!/bin/bash
# Verify backoff-limit line
rg -n '"nmp.nvidia.com/backoff-limit"' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 252
🏁 Script executed:
#!/bin/bash
# Check if BACKOFF_LIMIT_LABEL should be imported but isn't
grep -n 'BACKOFF_LIMIT_LABEL' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 163
Replace hardcoded label keys with imported constants.
Lines 292, 353, and 420 use hardcoded strings ("nmp.nvidia.com/backoff-limit", "managed-by") where corresponding constants are already defined in labels.py but not imported. Use BACKOFF_LIMIT_LABEL and MANAGED_BY_KEY instead to maintain consistency with the canonical definitions.
Current hardcoded usage
Line 292:
backoff_limit = int(labels.get("nmp.nvidia.com/backoff-limit", "6"))
Lines 353, 420:
container_labels.get("managed-by") != MANAGED_BY_LABEL
labels.get("managed-by") == MANAGED_BY_LABEL
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py`
at line 292, Replace hardcoded label key strings with imported constants to
maintain consistency with canonical definitions. First, add imports for
BACKOFF_LIMIT_LABEL and MANAGED_BY_KEY from labels.py at the top of the file.
Then replace the hardcoded string "nmp.nvidia.com/backoff-limit" in the
backoff_limit variable assignment with the BACKOFF_LIMIT_LABEL constant, and
replace the hardcoded "managed-by" strings used in the container_labels.get()
call and labels.get() comparison with the MANAGED_BY_KEY constant.
| if TYPE_CHECKING: | ||
| import docker |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔴 Critical | ⚡ Quick win
Remove TYPE_CHECKING guard for docker import.
The coding guidelines require concrete type hints over string-based ones. Import docker as a regular import instead of under TYPE_CHECKING.
🔧 Proposed fix
-from typing import TYPE_CHECKING, Any
+from typing import Any
+
+import docker
from nemo_deployments_plugin.backends.docker.labels import managed_by_filter
from nemo_deployments_plugin.entities import DockerDeploymentConfig
-
-if TYPE_CHECKING:
- import docker📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if TYPE_CHECKING: | |
| import docker | |
| from typing import Any | |
| import docker | |
| from nemo_deployments_plugin.backends.docker.labels import managed_by_filter | |
| from nemo_deployments_plugin.entities import DockerDeploymentConfig |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py`
around lines 17 - 18, Remove the TYPE_CHECKING conditional guard around the
docker import. Convert the conditional import of docker (currently under if
TYPE_CHECKING:) to a regular import statement at the module level, so that
docker is imported unconditionally rather than only for type checking purposes.
Source: Coding guidelines
| from typing import TYPE_CHECKING | ||
| from urllib.parse import urljoin | ||
|
|
||
| import httpx | ||
| from nemo_deployments_plugin.entities import Probe | ||
|
|
||
| if TYPE_CHECKING: | ||
| from docker.models.containers import Container as DockerContainer | ||
|
|
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor
🧩 Analysis chain
🏁 Script executed:
cat -n plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.pyRepository: NVIDIA-NeMo/nemo-platform
Length of output: 5684
Import DockerContainer directly instead of using TYPE_CHECKING.
The file uses from __future__ import annotations, which stringifies all type hints at runtime. Move the docker import out of TYPE_CHECKING and import it as a regular import—there's no runtime cost and it aligns with the coding guideline to avoid TYPE_CHECKING imports when possible.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py`
around lines 11 - 19, The docker import statement is unnecessarily wrapped in a
TYPE_CHECKING conditional block. Since the file already uses from __future__
import annotations which stringifies all type hints at runtime, there is no
performance cost to importing DockerContainer directly. Remove the if
TYPE_CHECKING: conditional block that wraps the docker import statement and move
the from docker.models.containers import Container as DockerContainer import to
the top-level imports section with the other import statements like httpx and
the nemo_deployments_plugin imports.
Source: Coding guidelines
| if isinstance(port, int): | ||
| base = f"{scheme}://127.0.0.1:{port}" | ||
| elif isinstance(port, str) and named_ports and port in named_ports: | ||
| base = f"{scheme}://127.0.0.1:{named_ports[port]}" | ||
| elif isinstance(port, str) and host_ports: | ||
| for container_port, host_port in host_ports.items(): | ||
| if str(container_port) == port: | ||
| base = f"{scheme}://127.0.0.1:{host_port}" | ||
| break | ||
| url = urljoin(f"{base.rstrip('/')}/", path.lstrip("/")) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Probe target resolution is incorrect for mapped ports/remote hosts.
Line 90/91 uses container port directly; Line 116 hardcodes localhost and ignores host_url. This breaks readiness for remapped host ports and remote Docker daemons.
Proposed fix
-from urllib.parse import urljoin
+from urllib.parse import urljoin, urlparse
@@
- if probe.tcp_socket is not None and host_url is not None:
- return await _check_tcp_probe(host_url, probe)
+ if probe.tcp_socket is not None and host_url is not None:
+ return await _check_tcp_probe(host_url, probe, host_ports=host_ports, named_ports=named_ports)
@@
if isinstance(port, int):
- base = f"{scheme}://127.0.0.1:{port}"
+ resolved = host_ports.get(port, port) if host_ports else port
+ base = f"{scheme}://127.0.0.1:{resolved}"
@@
-async def _check_tcp_probe(host_url: str, probe: Probe) -> tuple[bool, str]:
+async def _check_tcp_probe(
+ host_url: str,
+ probe: Probe,
+ *,
+ host_ports: dict[int, int] | None = None,
+ named_ports: dict[str, int] | None = None,
+) -> tuple[bool, str]:
@@
- if not isinstance(port_value, int):
- return False, "tcp probe requires numeric port"
- host = "127.0.0.1"
+ parsed = urlparse(host_url)
+ host = parsed.hostname or "127.0.0.1"
+ if isinstance(port_value, int):
+ target_port = host_ports.get(port_value, port_value) if host_ports else port_value
+ elif isinstance(port_value, str) and named_ports and port_value in named_ports:
+ target_port = named_ports[port_value]
+ elif isinstance(port_value, str) and host_ports:
+ target_port = next((hp for cp, hp in host_ports.items() if str(cp) == port_value), None)
+ if target_port is None:
+ return False, f"tcp probe named/container port not mapped: {port_value}"
+ else:
+ return False, "tcp probe requires resolvable port"
@@
- with socket.create_connection((host, port_value), timeout=timeout):
+ with socket.create_connection((host, target_port), timeout=timeout):
returnAlso applies to: 111-127
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py`
around lines 90 - 99, The probe target resolution incorrectly hardcodes
127.0.0.1 instead of using the actual host from the host_url parameter, which
breaks functionality for remapped host ports and remote Docker daemons. Replace
all instances of 127.0.0.1 in the base URL construction within the port
resolution logic (in the branches checking isinstance(port, int), the
named_ports lookup, and the host_ports loop) with the appropriate host extracted
from the host_url parameter. Apply the same fix to the additional probe target
resolution code mentioned in lines 111-127 to ensure consistency across all
probe endpoint constructions.
| try: | ||
| import docker | ||
|
|
||
| docker.from_env().ping() | ||
| DOCKER_AVAILABLE = True | ||
| except Exception: | ||
| DOCKER_AVAILABLE = False |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Remove duplicated Docker availability check.
Same issue as in test_docker_backend.py. Remove this duplicated check and rely on the parent conftest.py definition. Pytest discovers markers automatically.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around
lines 33 - 39, Remove the duplicated Docker availability check in the
test_reconcile_docker.py file. Delete the entire try-except block that attempts
to import docker and calls docker.from_env().ping() to set DOCKER_AVAILABLE.
Instead, rely on the Docker availability check already defined in the parent
conftest.py file, which Pytest discovers automatically through its marker
system. This eliminates code duplication and ensures consistency across all test
files in the project.
| from unittest.mock import AsyncMock, MagicMock | ||
|
|
||
| import pytest | ||
| from backends.docker_helpers import container_attrs, sample_config |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Fix helper import path to match test package layout.
from backends.docker_helpers ... is likely unresolved for this test layout. Import from the actual tests package path (or use a relative import) so the module loads under pytest.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py`
at line 11, The import statement at line 11 using `from backends.docker_helpers
import container_attrs, sample_config` is using an absolute import path that
does not resolve correctly for the test package layout. Change this to use a
relative import that properly navigates from the test file's location
(plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py) to
the helpers module, or adjust the import to use the correct absolute path that
matches your package structure so that pytest can resolve the module correctly.
| from unittest.mock import AsyncMock, MagicMock | ||
|
|
||
| import pytest | ||
| from backends.docker_helpers import container_attrs, sample_config |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Fix helper import path; current module path is likely invalid.
from backends.docker_helpers ... should be switched to the real tests package/relative path used by this repo, otherwise this test module may fail to import.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py` at
line 11, The import statement at the top of test_idempotency.py uses an
incorrect module path. The line `from backends.docker_helpers import
container_attrs, sample_config` needs to be updated to use the correct relative
or absolute import path that matches your repository's actual package structure.
Determine the correct path by checking where the docker_helpers module actually
exists in your repo (likely in the backends directory relative to the tests
directory), and adjust the import to use either a relative import (like from
..docker_helpers or similar) or the correct absolute path from the repo root.
| def test_is_port_free_local() -> None: | ||
| # Ephemeral high port likely free in test environment | ||
| assert is_port_free(59999) or not is_port_free(59999) | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
This test is a tautology and provides zero signal.
assert is_port_free(p) or not is_port_free(p) is always true. Replace with a deterministic setup that asserts a known occupied/free port outcome.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py` around
lines 21 - 24, The assertion in test_is_port_free_local is a tautology that
always evaluates to true since it checks if something is true OR not true, which
provides no meaningful test coverage. Replace this assertion with a
deterministic test by either occupying a port and verifying is_port_free returns
False for that port, or using a known free port and verifying is_port_free
returns True. Set up a specific port state beforehand (such as binding to a port
or using a port known to be available) and then assert that is_port_free returns
the expected boolean result for that deterministic condition.
| @pytest.mark.asyncio | ||
| async def test_find_available_port_skips_used(mock_docker_client: MagicMock) -> None: | ||
| from nemo_deployments_plugin.backends.docker.ports import find_available_port | ||
|
|
||
| used = MagicMock() | ||
| used.ports = {"80/tcp": [{"HostPort": "9000"}]} | ||
| mock_docker_client.containers.list.return_value = [used] | ||
|
|
||
| cfg = DockerDeploymentConfig(port_range_start=9000, port_range_end=9002) | ||
| port = await find_available_port(mock_docker_client, cfg) | ||
| assert port in {9001, 9002} |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Port-allocation test is environment-dependent and flaky.
This test depends on real availability of host ports 9001/9002. Mock/monkeypatch is_port_free to deterministic values and assert the exact expected port.
Suggested deterministic test shape
`@pytest.mark.asyncio`
-async def test_find_available_port_skips_used(mock_docker_client: MagicMock) -> None:
- from nemo_deployments_plugin.backends.docker.ports import find_available_port
+async def test_find_available_port_skips_used(
+ mock_docker_client: MagicMock, monkeypatch: pytest.MonkeyPatch
+) -> None:
+ from nemo_deployments_plugin.backends.docker import ports as ports_mod
+ from nemo_deployments_plugin.backends.docker.ports import find_available_port
used = MagicMock()
used.ports = {"80/tcp": [{"HostPort": "9000"}]}
mock_docker_client.containers.list.return_value = [used]
+ monkeypatch.setattr(ports_mod, "is_port_free", lambda p: p != 9001)
cfg = DockerDeploymentConfig(port_range_start=9000, port_range_end=9002)
port = await find_available_port(mock_docker_client, cfg)
- assert port in {9001, 9002}
+ assert port == 9002🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py` around
lines 26 - 36, The test_find_available_port_skips_used test depends on real host
port availability which makes it flaky and environment-dependent. Mock the
is_port_free function that is called within find_available_port to return
deterministic values (such as returning False for port 9000 to simulate it being
in use, and True for ports 9001 and 9002 to simulate them being free). Then
update the assertion to check for the exact expected port (9001) instead of
using an assertion with multiple possible values, ensuring the test behavior is
reproducible and not dependent on the actual system port state.
…E-756) - Exclude already-assigned host ports during multi-port allocation - Use label constants; fix HTTP/TCP probe host/port resolution - Remove TYPE_CHECKING-only docker imports - Deduplicate Docker availability check via docker_availability module - Harden port unit tests; add type hints and cleanup improvements Signed-off-by: Tyler Bray <tbray@nvidia.com>
Summary
Implements
DockerDeploymentBackendfornemo-deployments-plugin, completing the Docker executor path started in #280 and unblocking reconciler E2E tests from #315.init_containersand multi-container specsconfig_nameonly; backend loads fullDeploymentConfigfrom the entity store viaNemoEntitiesClientDockerDeploymentConfigDockerGPUPool(consolidation tonemo_platform_plugintracked separately in AIRCORE-759)managed-by, anddeployment-configlabel matchingStacks on #315 (
758-deployments-reconciler-prerequisite-dag/tbray). Rebase ontomainafter #315 merges.Test plan
uv run pytest plugins/nemo-deployments/tests -q— 126 passeduv run ruff check plugins/nemo-deploymentsuv run ty check plugins/nemo-deploymentstest_puller_server_prerequisite_chain— puller→server prerequisite DAG via reconcilerSummary by CodeRabbit
Release Notes
New Features
Dependencies