Skip to content

feat(deployments): implement DockerDeploymentBackend (AIRCORE-756)#399

Open
tylersbray wants to merge 2 commits into
758-deployments-reconciler-prerequisite-dag/tbrayfrom
756-docker-deployment-backend/tbray
Open

feat(deployments): implement DockerDeploymentBackend (AIRCORE-756)#399
tylersbray wants to merge 2 commits into
758-deployments-reconciler-prerequisite-dag/tbrayfrom
756-docker-deployment-backend/tbray

Conversation

@tylersbray

@tylersbray tylersbray commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements DockerDeploymentBackend for nemo-deployments-plugin, completing the Docker executor path started in #280 and unblocking reconciler E2E tests from #315.

  • v1 scope: single-container deployments; rejects init_containers and multi-container specs
  • Entity contract: reconciler passes config_name only; backend loads full DeploymentConfig from the entity store via NemoEntitiesClient
  • Lifecycle: docker pull/run, inspect, stop/rm, volume create/delete, port allocation from entity DockerDeploymentConfig
  • Probes: exec and http_get readiness supported; grpc deferred
  • GPU: plugin-local DockerGPUPool (consolidation to nemo_platform_plugin tracked separately in AIRCORE-759)
  • Idempotency: workspace, name, managed-by, and deployment-config label matching

Stacks on #315 (758-deployments-reconciler-prerequisite-dag/tbray). Rebase onto main after #315 merges.

Test plan

  • uv run pytest plugins/nemo-deployments/tests -q — 126 passed
  • uv run ruff check plugins/nemo-deployments
  • uv run ty check plugins/nemo-deployments
  • Unit tests: status mapping, restart policy, ports, labels, CRUD, idempotency (mocked docker client)
  • Integration tests against real Docker daemon (when available): volume lifecycle, Never→SUCCEEDED, Always lost detection
  • test_puller_server_prerequisite_chain — puller→server prerequisite DAG via reconciler

Summary by CodeRabbit

Release Notes

  • New Features

    • Docker deployment backend is now available and registered by default, enabling container-based workload deployment with support for volume management, GPU allocation, and configurable restart policies.
  • Dependencies

    • Added Docker and httpx dependencies to support Docker deployment operations.

Add Docker backend for nemo-deployments-plugin with single-container v1
lifecycle, entity-store config fetch, port allocation, exec/http probes,
plugin-local GPU pool, and unit plus integration test coverage.

Stacks on #315 (AIRCORE-758 reconciler prerequisite DAG).

Signed-off-by: Tyler Bray <tbray@nvidia.com>
@tylersbray tylersbray requested review from a team as code owners June 23, 2026 01:52
@github-actions github-actions Bot added the feat label Jun 23, 2026
@tylersbray

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Introduces DockerDeploymentBackend for the nemo-deployments plugin: a full Docker container/volume lifecycle backend with GPU pool allocation, host port selection, readiness probes, identity labeling, status mapping, and a pyproject.toml dependency update. The backend is registered as the default "docker" entry. Unit and integration tests are included.

Changes

Docker Deployment Backend

Layer / File(s) Summary
Config model and identity labeling
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.py, plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.py, plugins/nemo-deployments/pyproject.toml
Defines DockerExecutorConfig (docker_host, docker_timeout, pull_images) and all label-key constants plus naming/labeling helpers (k8s_safe_name, container_name, deployment_identity_labels, etc.). Adds httpx>=0.27 and docker>=7.0 as dependencies.
Thread-safe GPU pool
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py
Implements DockerGPUPool with mutex-protected allocation/release, nvidia-smi-based GPU detection, and a lazy process-wide singleton via get_shared_gpu_pool().
Container spec helpers
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.py
Adds config parsing/validation, restart-policy kwargs, env dict, volume mount merging, port binding generation, and GPU device-request spec construction for translating DeploymentConfig into Docker run arguments.
Host port allocation
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py
Implements find_available_port scanning the configured range while skipping ports used by existing managed containers, with remote-host and local-socket-bind checks.
Readiness probe evaluation
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py
Implements check_readiness_probe dispatching exec/HTTP/TCP probe checks with timeout enforcement and host URL resolution.
Status mapping and volume operations
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py, plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py
Defines Docker-to-backend status helpers (map_exited_status, missing_container_status, map_docker_state_to_starting) and async Docker volume lifecycle functions (create/read/delete).
DockerDeploymentBackend core lifecycle
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py
Implements DockerDeploymentBackend with init/shutdown, create_deployment, read_status, delete_deployment, list_managed_deployment_names, get_logs, volume delegation, and endpoint/identity helpers.
Backend registry wiring
plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py
Registers DockerDeploymentBackend as the "docker" entry in BACKEND_CLASSES, replacing an empty map.
Unit tests
plugins/nemo-deployments/tests/unit/backends/docker/..., plugins/nemo-deployments/tests/unit/backends/docker_helpers.py
Adds fixtures (conftest.py, docker_helpers.py) and tests for create/read/delete idempotency, status mapping, labels, port allocation, restart policy, and volume creation.
Integration tests
plugins/nemo-deployments/tests/integration/conftest.py, plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py, plugins/nemo-deployments/tests/integration/test_reconcile_docker.py
Adds Docker availability detection and integration tests for volume lifecycle, never/always deployment flows, lost-container detection, and a puller-server prerequisite-chain reconciliation scenario.

Sequence Diagram(s)

sequenceDiagram
  participant Reconciler
  participant DockerDeploymentBackend
  participant DockerGPUPool
  participant DockerClient
  participant check_readiness_probe

  Reconciler->>DockerDeploymentBackend: create_deployment(workspace, name, config_name, ...)
  DockerDeploymentBackend->>DockerClient: containers.get(container_name)
  DockerClient-->>DockerDeploymentBackend: NotFound
  DockerDeploymentBackend->>DockerGPUPool: allocate_gpu(workload_id, n)
  DockerDeploymentBackend->>DockerClient: find_available_port → containers.run(image, run_kwargs)
  DockerDeploymentBackend-->>Reconciler: BackendStatusUpdate(STARTING, endpoints)

  Reconciler->>DockerDeploymentBackend: read_status(workspace, name)
  DockerDeploymentBackend->>DockerClient: container.reload()
  DockerClient-->>DockerDeploymentBackend: state=running
  DockerDeploymentBackend->>check_readiness_probe: check_readiness_probe(container, probe, host_url)
  check_readiness_probe-->>DockerDeploymentBackend: (ready=True, "")
  DockerDeploymentBackend-->>Reconciler: BackendStatusUpdate(READY, endpoints)

  Reconciler->>DockerDeploymentBackend: delete_deployment(workspace, name)
  DockerDeploymentBackend->>DockerClient: container.stop() → container.remove()
  DockerDeploymentBackend->>DockerGPUPool: release_gpu(workload_id)
  DockerDeploymentBackend-->>Reconciler: BackendStatusUpdate(SUCCEEDED)
Loading

Possibly related PRs

  • NVIDIA-NeMo/nemo-platform#280: Introduced the BACKEND_CLASSES registry and ExecutorRegistry scaffolding that this PR populates with the concrete DockerDeploymentBackend.

Suggested reviewers

  • benmccown
  • albcui
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: implementing DockerDeploymentBackend with an issue reference.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 756-docker-deployment-backend/tbray

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (7)
plugins/nemo-deployments/tests/unit/backends/docker_helpers.py (1)

13-26: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Use a concrete restart-policy type and drop the ignore.

restart_policy: str is too broad and forces # type: ignore[arg-type]. Type this as the concrete policy union/Literal used by DeploymentConfig and remove the ignore.

As per coding guidelines, "**/*.py: Always prefer concrete type hints over string-based ones in Python code."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/unit/backends/docker_helpers.py` around lines
13 - 26, The sample_config function has a restart_policy parameter typed as str
which is too broad and requires a type ignore comment. Identify the concrete
Literal type or union type that DeploymentConfig expects for its restart_policy
field, update the function signature to use that concrete type instead of str
for the restart_policy parameter, and remove the # type: ignore[arg-type]
comment on the restart_policy argument passed to DeploymentConfig.

Source: Coding guidelines

plugins/nemo-deployments/tests/integration/conftest.py (1)

10-18: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add type hints to module-level variables.

Per coding guidelines, prefer concrete type hints. Add annotations:

DOCKER_AVAILABLE: bool = True

and

skip_without_docker: pytest.MarkDecorator = pytest.mark.skipif(...)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/integration/conftest.py` around lines 10 - 18,
Add explicit type hints to the module-level variables DOCKER_AVAILABLE and
skip_without_docker. For DOCKER_AVAILABLE, add a bool type annotation since it
is assigned boolean values (True or False) in the try-except block. For
skip_without_docker, add a pytest.MarkDecorator type annotation since it is
assigned the result of pytest.mark.skipif(). These type hints should be added
directly at the point of variable assignment to comply with coding guidelines.

Source: Coding guidelines

plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py (2)

162-164: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Narrow exception handling in cleanup.

Catching broad Exception hides potential issues. Use NotFound from docker.errors (already imported) for consistency with other cleanup blocks.

♻️ Proposed fix
         try:
             docker_backend._client.volumes.remove(docker_volume_name("itest", "data"))
-        except Exception:
+        except NotFound:
             pass
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`
around lines 162 - 164, The exception handling in the volume cleanup block is
catching a broad Exception which can mask real issues. Replace the generic
Exception catch with docker.errors.NotFound, which is already imported and used
consistently in other cleanup blocks throughout the test. This will ensure only
the expected "volume not found" error is silently ignored while other unexpected
errors will properly surface for debugging.

126-128: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Accessing private attribute in test cleanup.

Test cleanup directly accesses docker_backend._client, which couples tests to implementation details. Consider adding a public cleanup method to the backend or using the Docker SDK directly via docker.from_env().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`
around lines 126 - 128, The test cleanup code directly accesses the private
`_client` attribute from docker_backend, creating coupling to implementation
details. Either add a public cleanup method to the docker_backend class that
encapsulates the container removal logic
(containers.get(c_name).remove(force=True)) and call that method from the test,
or alternatively, use docker.from_env() directly in the test cleanup code to
access the Docker client without relying on private implementation details.
Remove the direct access to the underscore-prefixed _client attribute.
plugins/nemo-deployments/tests/integration/test_reconcile_docker.py (2)

160-174: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Optimize loop by constructing dictionaries once.

The by_name and by_config dictionaries are reconstructed on every iteration. Move construction outside the loop for clarity and minor efficiency gain.

♻️ Proposed refactor
+        by_name = {("itest", "puller"): puller_dep, ("itest", "server"): server_dep}
+        by_config = {
+            ("itest", "puller-cfg"): puller_dep,
+            ("itest", "server-cfg"): server_dep,
+        }
         for _ in range(40):
-            by_name = {("itest", "puller"): puller_dep, ("itest", "server"): server_dep}
-            by_config = {
-                ("itest", "puller-cfg"): puller_dep,
-                ("itest", "server-cfg"): server_dep,
-            }
             await deployment_reconciler.reconcile_one(

Apply the same pattern to the server reconciliation loop at lines 177-192.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around
lines 160 - 174, The by_name and by_config dictionaries are being reconstructed
on every iteration of the loop in the puller deployment reconciliation block.
Move the construction of both dictionaries outside the for loop (before the `for
_ in range(40):` line) since they contain the same values on each iteration.
Apply this same optimization pattern to the server reconciliation loop that
follows, moving the dictionary construction for that loop outside as well to
avoid redundant object creation.

128-138: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add type hints to get_side_effect.

Per coding guidelines, prefer concrete type hints. This function should annotate parameters and return type:

async def get_side_effect(entity_type: type, name: str, workspace: str | None = None) -> Volume | Deployment | DeploymentConfig:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around
lines 128 - 138, The get_side_effect function is missing type hints on its
parameters and return type. Add type annotations to the function signature:
annotate the entity_type parameter as type, the name parameter as str, the
workspace parameter as str or None with a default value of None, and the return
type as a union of Volume, Deployment, and DeploymentConfig.

Source: Coding guidelines

plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py (1)

28-33: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Restart policy parameter has no effect.

Both branches in map_exited_status return "FAILED" regardless of restart_policy. Consider removing the parameter if not needed, or document the future intent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py`
around lines 28 - 33, In the map_exited_status function, the restart_policy
parameter is not being utilized effectively as both the if condition checking
for restart_policy == "Always" and the final return statement return the same
"FAILED" status. Either remove the restart_policy parameter entirely if it is
not needed for this function's logic, or modify the conditional logic to return
different DeploymentStatus values based on different restart_policy values so
that the parameter actually influences the function's behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py`:
- Around line 148-157: The loop iterating through container_spec.ports does not
track previously assigned host ports within the current iteration, allowing
find_available_port to potentially assign the same host port multiple times.
Modify the code to collect already-assigned ports from the host_ports dictionary
in each loop iteration and pass them to find_available_port so it can exclude
those ports when searching for available options, ensuring no duplicate port
assignments occur within a single multi-port deployment.
- Line 292: Replace hardcoded label key strings with imported constants to
maintain consistency with canonical definitions. First, add imports for
BACKOFF_LIMIT_LABEL and MANAGED_BY_KEY from labels.py at the top of the file.
Then replace the hardcoded string "nmp.nvidia.com/backoff-limit" in the
backoff_limit variable assignment with the BACKOFF_LIMIT_LABEL constant, and
replace the hardcoded "managed-by" strings used in the container_labels.get()
call and labels.get() comparison with the MANAGED_BY_KEY constant.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py`:
- Around line 17-18: Remove the TYPE_CHECKING conditional guard around the
docker import. Convert the conditional import of docker (currently under if
TYPE_CHECKING:) to a regular import statement at the module level, so that
docker is imported unconditionally rather than only for type checking purposes.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py`:
- Around line 90-99: The probe target resolution incorrectly hardcodes 127.0.0.1
instead of using the actual host from the host_url parameter, which breaks
functionality for remapped host ports and remote Docker daemons. Replace all
instances of 127.0.0.1 in the base URL construction within the port resolution
logic (in the branches checking isinstance(port, int), the named_ports lookup,
and the host_ports loop) with the appropriate host extracted from the host_url
parameter. Apply the same fix to the additional probe target resolution code
mentioned in lines 111-127 to ensure consistency across all probe endpoint
constructions.
- Around line 11-19: The docker import statement is unnecessarily wrapped in a
TYPE_CHECKING conditional block. Since the file already uses from __future__
import annotations which stringifies all type hints at runtime, there is no
performance cost to importing DockerContainer directly. Remove the if
TYPE_CHECKING: conditional block that wraps the docker import statement and move
the from docker.models.containers import Container as DockerContainer import to
the top-level imports section with the other import statements like httpx and
the nemo_deployments_plugin imports.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py`:
- Around line 15-16: The docker import is currently guarded by the TYPE_CHECKING
conditional block. Remove the TYPE_CHECKING guard and import statement around
the docker module, and instead add docker as a regular import statement at the
top of the file with the other imports. This ensures docker is imported as a
concrete type hint rather than a string-based one, following the coding
guidelines.

In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`:
- Around line 26-32: The try-except block checking Docker availability and
setting _DOCKER_AVAILABLE is duplicated from the parent conftest.py file. Remove
lines 26-32 containing the duplicate Docker availability check (the try block
with docker.from_env().ping() and the _DOCKER_AVAILABLE assignment), and instead
use the existing DOCKER_AVAILABLE fixture or skip_without_docker marker that is
already defined in the parent conftest.py. Pytest automatically discovers and
uses parent conftest fixtures and markers, so this duplicate check should be
completely removed from this file.

In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py`:
- Around line 33-39: Remove the duplicated Docker availability check in the
test_reconcile_docker.py file. Delete the entire try-except block that attempts
to import docker and calls docker.from_env().ping() to set DOCKER_AVAILABLE.
Instead, rely on the Docker availability check already defined in the parent
conftest.py file, which Pytest discovers automatically through its marker
system. This eliminates code duplication and ensures consistency across all test
files in the project.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py`:
- Line 11: The import statement at line 11 using `from backends.docker_helpers
import container_attrs, sample_config` is using an absolute import path that
does not resolve correctly for the test package layout. Change this to use a
relative import that properly navigates from the test file's location
(plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py) to
the helpers module, or adjust the import to use the correct absolute path that
matches your package structure so that pytest can resolve the module correctly.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py`:
- Line 11: The import statement at the top of test_idempotency.py uses an
incorrect module path. The line `from backends.docker_helpers import
container_attrs, sample_config` needs to be updated to use the correct relative
or absolute import path that matches your repository's actual package structure.
Determine the correct path by checking where the docker_helpers module actually
exists in your repo (likely in the backends directory relative to the tests
directory), and adjust the import to use either a relative import (like from
..docker_helpers or similar) or the correct absolute path from the repo root.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py`:
- Around line 21-24: The assertion in test_is_port_free_local is a tautology
that always evaluates to true since it checks if something is true OR not true,
which provides no meaningful test coverage. Replace this assertion with a
deterministic test by either occupying a port and verifying is_port_free returns
False for that port, or using a known free port and verifying is_port_free
returns True. Set up a specific port state beforehand (such as binding to a port
or using a port known to be available) and then assert that is_port_free returns
the expected boolean result for that deterministic condition.
- Around line 26-36: The test_find_available_port_skips_used test depends on
real host port availability which makes it flaky and environment-dependent. Mock
the is_port_free function that is called within find_available_port to return
deterministic values (such as returning False for port 9000 to simulate it being
in use, and True for ports 9001 and 9002 to simulate them being free). Then
update the assertion to check for the exact expected port (9001) instead of
using an assertion with multiple possible values, ensuring the test behavior is
reproducible and not dependent on the actual system port state.

---

Nitpick comments:
In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py`:
- Around line 28-33: In the map_exited_status function, the restart_policy
parameter is not being utilized effectively as both the if condition checking
for restart_policy == "Always" and the final return statement return the same
"FAILED" status. Either remove the restart_policy parameter entirely if it is
not needed for this function's logic, or modify the conditional logic to return
different DeploymentStatus values based on different restart_policy values so
that the parameter actually influences the function's behavior.

In
`@plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py`:
- Around line 162-164: The exception handling in the volume cleanup block is
catching a broad Exception which can mask real issues. Replace the generic
Exception catch with docker.errors.NotFound, which is already imported and used
consistently in other cleanup blocks throughout the test. This will ensure only
the expected "volume not found" error is silently ignored while other unexpected
errors will properly surface for debugging.
- Around line 126-128: The test cleanup code directly accesses the private
`_client` attribute from docker_backend, creating coupling to implementation
details. Either add a public cleanup method to the docker_backend class that
encapsulates the container removal logic
(containers.get(c_name).remove(force=True)) and call that method from the test,
or alternatively, use docker.from_env() directly in the test cleanup code to
access the Docker client without relying on private implementation details.
Remove the direct access to the underscore-prefixed _client attribute.

In `@plugins/nemo-deployments/tests/integration/conftest.py`:
- Around line 10-18: Add explicit type hints to the module-level variables
DOCKER_AVAILABLE and skip_without_docker. For DOCKER_AVAILABLE, add a bool type
annotation since it is assigned boolean values (True or False) in the try-except
block. For skip_without_docker, add a pytest.MarkDecorator type annotation since
it is assigned the result of pytest.mark.skipif(). These type hints should be
added directly at the point of variable assignment to comply with coding
guidelines.

In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py`:
- Around line 160-174: The by_name and by_config dictionaries are being
reconstructed on every iteration of the loop in the puller deployment
reconciliation block. Move the construction of both dictionaries outside the for
loop (before the `for _ in range(40):` line) since they contain the same values
on each iteration. Apply this same optimization pattern to the server
reconciliation loop that follows, moving the dictionary construction for that
loop outside as well to avoid redundant object creation.
- Around line 128-138: The get_side_effect function is missing type hints on its
parameters and return type. Add type annotations to the function signature:
annotate the entity_type parameter as type, the name parameter as str, the
workspace parameter as str or None with a default value of None, and the return
type as a union of Volume, Deployment, and DeploymentConfig.

In `@plugins/nemo-deployments/tests/unit/backends/docker_helpers.py`:
- Around line 13-26: The sample_config function has a restart_policy parameter
typed as str which is too broad and requires a type ignore comment. Identify the
concrete Literal type or union type that DeploymentConfig expects for its
restart_policy field, update the function signature to use that concrete type
instead of str for the restart_policy parameter, and remove the # type:
ignore[arg-type] comment on the restart_policy argument passed to
DeploymentConfig.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5d5a85f9-3645-431f-b8df-57e8281266da

📥 Commits

Reviewing files that changed from the base of the PR and between 0b92da1 and f612b6a.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (22)
  • plugins/nemo-deployments/pyproject.toml
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/config.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/containers.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/gpu.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/status.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/volumes.py
  • plugins/nemo-deployments/src/nemo_deployments_plugin/backends/registry.py
  • plugins/nemo-deployments/tests/integration/backends/docker/test_docker_backend.py
  • plugins/nemo-deployments/tests/integration/conftest.py
  • plugins/nemo-deployments/tests/integration/test_reconcile_docker.py
  • plugins/nemo-deployments/tests/unit/backends/docker/conftest.py
  • plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py
  • plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py
  • plugins/nemo-deployments/tests/unit/backends/docker/test_labels.py
  • plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py
  • plugins/nemo-deployments/tests/unit/backends/docker/test_restart_policy.py
  • plugins/nemo-deployments/tests/unit/backends/docker/test_status_mapping.py
  • plugins/nemo-deployments/tests/unit/backends/docker_helpers.py

Comment on lines +148 to +157
host_ports: dict[int, int] = {}
for port_spec in container_spec.ports:
host_port = await find_available_port(self._client, docker_cfg)
if host_port is None:
if gpu_ids:
self._gpu_pool.release_gpu(dep_key) # type: ignore[union-attr]
return BackendStatusUpdate(
status="FAILED", status_message="No host ports available in configured range"
)
host_ports[port_spec.container_port] = host_port

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd -t f 'ports.py' -p 'backends/docker' --exec cat {}

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 266


🏁 Script executed:

fd -t f 'ports.py' --exec cat {}

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 16215


🏁 Script executed:

find . -type f -name 'backend.py' -path '*backends/docker*' | head -5

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 327


🏁 Script executed:

fd -t f -name '*.py' -path '*backends/docker*' | head -20

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 394


🏁 Script executed:

sed -n '140,165p' ./plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 1535


Multi-port containers can collide on the same host port.

find_available_port queries running containers to find used ports, but only checks containers that already exist. Since the container being deployed doesn't exist yet, ports assigned in earlier loop iterations won't be excluded on subsequent calls, allowing duplicate allocations.

Track already-selected ports and exclude them on each iteration.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py`
around lines 148 - 157, The loop iterating through container_spec.ports does not
track previously assigned host ports within the current iteration, allowing
find_available_port to potentially assign the same host port multiple times.
Modify the code to collect already-assigned ports from the host_ports dictionary
in each loop iteration and pass them to find_available_port so it can exclude
those ports when searching for available options, ensuring no duplicate port
assignments occur within a single multi-port deployment.

)
if restart_policy == "OnFailure":
restart_count = int(container.attrs.get("RestartCount", 0))
backoff_limit = int(labels.get("nmp.nvidia.com/backoff-limit", "6"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd -t f 'labels.py' -p 'backends/docker' --exec cat {}

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 266


🏁 Script executed:

#!/bin/bash
# Find backend.py file
fd -t f 'backend.py' --path '*backends/docker*' | head -20

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 397


🏁 Script executed:

#!/bin/bash
# Search for MANAGED_BY_KEY and BACKOFF_LIMIT_LABEL constants
rg -i 'MANAGED_BY_KEY|BACKOFF_LIMIT_LABEL' --max-count 50

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 163


🏁 Script executed:

#!/bin/bash
# Search for label key definitions/constants in the codebase
rg '"managed-by"|"nmp.nvidia.com/backoff-limit"' --max-count 50

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 163


🏁 Script executed:

#!/bin/bash
# Find backend.py without path filter
git ls-files | grep -E 'backends.*docker.*backend\.py'

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 405


🏁 Script executed:

#!/bin/bash
# List all files in the plugin directory
git ls-files | grep 'nemo-deployments' | head -30

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 2350


🏁 Script executed:

#!/bin/bash
cat plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/labels.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 2850


🏁 Script executed:

#!/bin/bash
# Read the specific lines from backend.py that are mentioned in the review
sed -n '290,295p;351,356p;418,422p' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 1022


🏁 Script executed:

#!/bin/bash
# Check the imports in backend.py
head -50 plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py | grep -E 'from|import'

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 958


🏁 Script executed:

#!/bin/bash
# Get more context around the imports
head -100 plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 3830


🏁 Script executed:

#!/bin/bash
# Check what MANAGED_BY_LABEL is in constants.py
rg 'MANAGED_BY_LABEL' plugins/nemo-deployments/src/nemo_deployments_plugin/constants.py -A 2

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 280


🏁 Script executed:

#!/bin/bash
# Get the full import statement from backend.py
sed -n '30,45p' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 764


🏁 Script executed:

#!/bin/bash
# Check if MANAGED_BY_KEY and BACKOFF_LIMIT_LABEL are imported in backend.py
rg 'MANAGED_BY_KEY|BACKOFF_LIMIT_LABEL' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 163


🏁 Script executed:

#!/bin/bash
# Get more context around line 353-354 and 420
sed -n '350,360p;415,425p' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 1132


🏁 Script executed:

#!/bin/bash
# Verify the exact line numbers by searching for the patterns
rg -n '"managed-by"' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 303


🏁 Script executed:

#!/bin/bash
# Verify backoff-limit line
rg -n '"nmp.nvidia.com/backoff-limit"' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 252


🏁 Script executed:

#!/bin/bash
# Check if BACKOFF_LIMIT_LABEL should be imported but isn't
grep -n 'BACKOFF_LIMIT_LABEL' plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 163


Replace hardcoded label keys with imported constants.

Lines 292, 353, and 420 use hardcoded strings ("nmp.nvidia.com/backoff-limit", "managed-by") where corresponding constants are already defined in labels.py but not imported. Use BACKOFF_LIMIT_LABEL and MANAGED_BY_KEY instead to maintain consistency with the canonical definitions.

Current hardcoded usage

Line 292:

backoff_limit = int(labels.get("nmp.nvidia.com/backoff-limit", "6"))

Lines 353, 420:

container_labels.get("managed-by") != MANAGED_BY_LABEL
labels.get("managed-by") == MANAGED_BY_LABEL
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/backend.py`
at line 292, Replace hardcoded label key strings with imported constants to
maintain consistency with canonical definitions. First, add imports for
BACKOFF_LIMIT_LABEL and MANAGED_BY_KEY from labels.py at the top of the file.
Then replace the hardcoded string "nmp.nvidia.com/backoff-limit" in the
backoff_limit variable assignment with the BACKOFF_LIMIT_LABEL constant, and
replace the hardcoded "managed-by" strings used in the container_labels.get()
call and labels.get() comparison with the MANAGED_BY_KEY constant.

Comment on lines +17 to +18
if TYPE_CHECKING:
import docker

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔴 Critical | ⚡ Quick win

Remove TYPE_CHECKING guard for docker import.

The coding guidelines require concrete type hints over string-based ones. Import docker as a regular import instead of under TYPE_CHECKING.

🔧 Proposed fix
-from typing import TYPE_CHECKING, Any
+from typing import Any
+
+import docker

 from nemo_deployments_plugin.backends.docker.labels import managed_by_filter
 from nemo_deployments_plugin.entities import DockerDeploymentConfig
-
-if TYPE_CHECKING:
-    import docker
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if TYPE_CHECKING:
import docker
from typing import Any
import docker
from nemo_deployments_plugin.backends.docker.labels import managed_by_filter
from nemo_deployments_plugin.entities import DockerDeploymentConfig
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/ports.py`
around lines 17 - 18, Remove the TYPE_CHECKING conditional guard around the
docker import. Convert the conditional import of docker (currently under if
TYPE_CHECKING:) to a regular import statement at the module level, so that
docker is imported unconditionally rather than only for type checking purposes.

Source: Coding guidelines

Comment on lines +11 to +19
from typing import TYPE_CHECKING
from urllib.parse import urljoin

import httpx
from nemo_deployments_plugin.entities import Probe

if TYPE_CHECKING:
from docker.models.containers import Container as DockerContainer

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py

Repository: NVIDIA-NeMo/nemo-platform

Length of output: 5684


Import DockerContainer directly instead of using TYPE_CHECKING.

The file uses from __future__ import annotations, which stringifies all type hints at runtime. Move the docker import out of TYPE_CHECKING and import it as a regular import—there's no runtime cost and it aligns with the coding guideline to avoid TYPE_CHECKING imports when possible.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py`
around lines 11 - 19, The docker import statement is unnecessarily wrapped in a
TYPE_CHECKING conditional block. Since the file already uses from __future__
import annotations which stringifies all type hints at runtime, there is no
performance cost to importing DockerContainer directly. Remove the if
TYPE_CHECKING: conditional block that wraps the docker import statement and move
the from docker.models.containers import Container as DockerContainer import to
the top-level imports section with the other import statements like httpx and
the nemo_deployments_plugin imports.

Source: Coding guidelines

Comment on lines +90 to +99
if isinstance(port, int):
base = f"{scheme}://127.0.0.1:{port}"
elif isinstance(port, str) and named_ports and port in named_ports:
base = f"{scheme}://127.0.0.1:{named_ports[port]}"
elif isinstance(port, str) and host_ports:
for container_port, host_port in host_ports.items():
if str(container_port) == port:
base = f"{scheme}://127.0.0.1:{host_port}"
break
url = urljoin(f"{base.rstrip('/')}/", path.lstrip("/"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Probe target resolution is incorrect for mapped ports/remote hosts.

Line 90/91 uses container port directly; Line 116 hardcodes localhost and ignores host_url. This breaks readiness for remapped host ports and remote Docker daemons.

Proposed fix
-from urllib.parse import urljoin
+from urllib.parse import urljoin, urlparse
@@
-    if probe.tcp_socket is not None and host_url is not None:
-        return await _check_tcp_probe(host_url, probe)
+    if probe.tcp_socket is not None and host_url is not None:
+        return await _check_tcp_probe(host_url, probe, host_ports=host_ports, named_ports=named_ports)
@@
     if isinstance(port, int):
-        base = f"{scheme}://127.0.0.1:{port}"
+        resolved = host_ports.get(port, port) if host_ports else port
+        base = f"{scheme}://127.0.0.1:{resolved}"
@@
-async def _check_tcp_probe(host_url: str, probe: Probe) -> tuple[bool, str]:
+async def _check_tcp_probe(
+    host_url: str,
+    probe: Probe,
+    *,
+    host_ports: dict[int, int] | None = None,
+    named_ports: dict[str, int] | None = None,
+) -> tuple[bool, str]:
@@
-    if not isinstance(port_value, int):
-        return False, "tcp probe requires numeric port"
-    host = "127.0.0.1"
+    parsed = urlparse(host_url)
+    host = parsed.hostname or "127.0.0.1"
+    if isinstance(port_value, int):
+        target_port = host_ports.get(port_value, port_value) if host_ports else port_value
+    elif isinstance(port_value, str) and named_ports and port_value in named_ports:
+        target_port = named_ports[port_value]
+    elif isinstance(port_value, str) and host_ports:
+        target_port = next((hp for cp, hp in host_ports.items() if str(cp) == port_value), None)
+        if target_port is None:
+            return False, f"tcp probe named/container port not mapped: {port_value}"
+    else:
+        return False, "tcp probe requires resolvable port"
@@
-        with socket.create_connection((host, port_value), timeout=timeout):
+        with socket.create_connection((host, target_port), timeout=timeout):
             return

Also applies to: 111-127

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@plugins/nemo-deployments/src/nemo_deployments_plugin/backends/docker/probes.py`
around lines 90 - 99, The probe target resolution incorrectly hardcodes
127.0.0.1 instead of using the actual host from the host_url parameter, which
breaks functionality for remapped host ports and remote Docker daemons. Replace
all instances of 127.0.0.1 in the base URL construction within the port
resolution logic (in the branches checking isinstance(port, int), the
named_ports lookup, and the host_ports loop) with the appropriate host extracted
from the host_url parameter. Apply the same fix to the additional probe target
resolution code mentioned in lines 111-127 to ensure consistency across all
probe endpoint constructions.

Comment on lines +33 to +39
try:
import docker

docker.from_env().ping()
DOCKER_AVAILABLE = True
except Exception:
DOCKER_AVAILABLE = False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Remove duplicated Docker availability check.

Same issue as in test_docker_backend.py. Remove this duplicated check and rely on the parent conftest.py definition. Pytest discovers markers automatically.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/integration/test_reconcile_docker.py` around
lines 33 - 39, Remove the duplicated Docker availability check in the
test_reconcile_docker.py file. Delete the entire try-except block that attempts
to import docker and calls docker.from_env().ping() to set DOCKER_AVAILABLE.
Instead, rely on the Docker availability check already defined in the parent
conftest.py file, which Pytest discovers automatically through its marker
system. This eliminates code duplication and ensures consistency across all test
files in the project.

from unittest.mock import AsyncMock, MagicMock

import pytest
from backends.docker_helpers import container_attrs, sample_config

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fix helper import path to match test package layout.

from backends.docker_helpers ... is likely unresolved for this test layout. Import from the actual tests package path (or use a relative import) so the module loads under pytest.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py`
at line 11, The import statement at line 11 using `from backends.docker_helpers
import container_attrs, sample_config` is using an absolute import path that
does not resolve correctly for the test package layout. Change this to use a
relative import that properly navigates from the test file's location
(plugins/nemo-deployments/tests/unit/backends/docker/test_backend_mocked.py) to
the helpers module, or adjust the import to use the correct absolute path that
matches your package structure so that pytest can resolve the module correctly.

from unittest.mock import AsyncMock, MagicMock

import pytest
from backends.docker_helpers import container_attrs, sample_config

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fix helper import path; current module path is likely invalid.

from backends.docker_helpers ... should be switched to the real tests package/relative path used by this repo, otherwise this test module may fail to import.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_idempotency.py` at
line 11, The import statement at the top of test_idempotency.py uses an
incorrect module path. The line `from backends.docker_helpers import
container_attrs, sample_config` needs to be updated to use the correct relative
or absolute import path that matches your repository's actual package structure.
Determine the correct path by checking where the docker_helpers module actually
exists in your repo (likely in the backends directory relative to the tests
directory), and adjust the import to use either a relative import (like from
..docker_helpers or similar) or the correct absolute path from the repo root.

Comment on lines +21 to +24
def test_is_port_free_local() -> None:
# Ephemeral high port likely free in test environment
assert is_port_free(59999) or not is_port_free(59999)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

This test is a tautology and provides zero signal.

assert is_port_free(p) or not is_port_free(p) is always true. Replace with a deterministic setup that asserts a known occupied/free port outcome.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py` around
lines 21 - 24, The assertion in test_is_port_free_local is a tautology that
always evaluates to true since it checks if something is true OR not true, which
provides no meaningful test coverage. Replace this assertion with a
deterministic test by either occupying a port and verifying is_port_free returns
False for that port, or using a known free port and verifying is_port_free
returns True. Set up a specific port state beforehand (such as binding to a port
or using a port known to be available) and then assert that is_port_free returns
the expected boolean result for that deterministic condition.

Comment on lines +26 to +36
@pytest.mark.asyncio
async def test_find_available_port_skips_used(mock_docker_client: MagicMock) -> None:
from nemo_deployments_plugin.backends.docker.ports import find_available_port

used = MagicMock()
used.ports = {"80/tcp": [{"HostPort": "9000"}]}
mock_docker_client.containers.list.return_value = [used]

cfg = DockerDeploymentConfig(port_range_start=9000, port_range_end=9002)
port = await find_available_port(mock_docker_client, cfg)
assert port in {9001, 9002}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Port-allocation test is environment-dependent and flaky.

This test depends on real availability of host ports 9001/9002. Mock/monkeypatch is_port_free to deterministic values and assert the exact expected port.

Suggested deterministic test shape
 `@pytest.mark.asyncio`
-async def test_find_available_port_skips_used(mock_docker_client: MagicMock) -> None:
-    from nemo_deployments_plugin.backends.docker.ports import find_available_port
+async def test_find_available_port_skips_used(
+    mock_docker_client: MagicMock, monkeypatch: pytest.MonkeyPatch
+) -> None:
+    from nemo_deployments_plugin.backends.docker import ports as ports_mod
+    from nemo_deployments_plugin.backends.docker.ports import find_available_port
 
     used = MagicMock()
     used.ports = {"80/tcp": [{"HostPort": "9000"}]}
     mock_docker_client.containers.list.return_value = [used]
+    monkeypatch.setattr(ports_mod, "is_port_free", lambda p: p != 9001)
 
     cfg = DockerDeploymentConfig(port_range_start=9000, port_range_end=9002)
     port = await find_available_port(mock_docker_client, cfg)
-    assert port in {9001, 9002}
+    assert port == 9002
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-deployments/tests/unit/backends/docker/test_ports.py` around
lines 26 - 36, The test_find_available_port_skips_used test depends on real host
port availability which makes it flaky and environment-dependent. Mock the
is_port_free function that is called within find_available_port to return
deterministic values (such as returning False for port 9000 to simulate it being
in use, and True for ports 9001 and 9002 to simulate them being free). Then
update the assertion to check for the exact expected port (9001) instead of
using an assertion with multiple possible values, ensuring the test behavior is
reproducible and not dependent on the actual system port state.

…E-756)

- Exclude already-assigned host ports during multi-port allocation
- Use label constants; fix HTTP/TCP probe host/port resolution
- Remove TYPE_CHECKING-only docker imports
- Deduplicate Docker availability check via docker_availability module
- Harden port unit tests; add type hints and cleanup improvements

Signed-off-by: Tyler Bray <tbray@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant