diff --git a/commit.txt b/commit.txt new file mode 100644 index 0000000000..565d51bc2c --- /dev/null +++ b/commit.txt @@ -0,0 +1,15 @@ +feat(jobs): add auth-aware E2E tests and job diagnostics infrastructure + +Add a new E2E test suite (`test_jobs_auth.py`) that validates workspace +isolation and principal propagation under an auth-enabled platform config. +Introduce a reusable `diagnostics.py` module in the jobs controller layer +to collect and log structured job/step/task state on errors, and wire it +into the reconciler and scheduler for automatic debug-level diagnostics +when steps transition to ERROR or encounter unexpected exceptions. + +Refactor `e2e/conftest.py` to support multiple running-services instances +keyed by config hash, enabling per-test-module platform configs (e.g., +`local-subprocess.yaml` with auth enabled) to coexist in a single session. +Add a `local-subprocess.yaml` E2E config and extend `nmp_testing` utilities +with `grant_workspace_role`, `unique_email`, and `TEST_ADMIN_EMAIL` helpers +needed by the auth test scenarios. diff --git a/container.md b/container.md new file mode 100644 index 0000000000..135240cada --- /dev/null +++ b/container.md @@ -0,0 +1 @@ +Just talking out loud here, so try to wrap my own head around this. Conceptually the jobs service is a "backend" (which is a type of execution), and a executor (an configured instance of a backend). "Profile" is a higher level concept that does fancy selection, and arguably it's just a useless layer of abstraction that confuses everyone. The idea was that some jobs need cpu, and some jobs need gpu, so why not create an abstraction layer that does the hard work of choosing for you. Or you can just use an executor as part of the job. Profiles introduces a category of compute: cpu, gpu, gpu_distributed, which are really just shortcuts for the backend you're looking for. So for customization you would select gpu_distributed, for eval: gpu. So I think the idea is that the job compiler chooses the category, and the platform config does the mapping to the executor. And you can pass a profile in with the job spec, which will tell the compiler which profile to use. So you end up having to map profiles to executors anyways. It makes me think that the services should be responsible for doing this mapping, not jobs. Ex: customizer configures the mapping. But having a central concept of "profile" means that every service does things the exact same way, which is nice. It just makes the responsibility unclear. In the future we might want to allow a plugin to define their own "provider", and allow other services to use it. Different providers having different configuration requirements, that map down to the execution back end job format. At this level, we are talking about job specs, and each service compiles their own spec down to the provider spec, which then selects the executor, and the job is submitted to the executor. And as part of the platform config, we have defaults for some of these things, which might include containers. For example, customizer chooses the image to use, depending on the type of customization requested. But for customizer, we could imagine a subprocess executor, with the customizer task image, and the behavior would be to call docker run .... So this is an appropriate translation for the subprocess executor. So the job is running as a subprocess, but we require a container, so we use docker (or podman), depending on how we the subprocess executor is configured to run containers. If the container is absent, then we just call the entrypoint in the workspace configured by the executor. The goal should be that we don't care what executor it's running on, the job spec is the same. But services will want some flexibility here to choose the right executor. For example, imagine a plugin has some dependency on a specific version of python, so we might want to defined an executor executes commands in the context of a specific venv when using the subprocess executor. And we can also provide a container for that plugin, which would execute the same command in a container. The platform shouldn't need to know anything about the venv, or the container, as this would be specific to the plugin. The plugin just selects the "provider", and the platform maps that to the correct executor. Now a plugin could define it's own provider, which will select the correct venv (either using subprocess or in a container). What this suggests to me is that the plugin needs more control over how jobs are mapped to executors, and which executors are configured beyond the rigid "provider" categories. For example, a plugin could define a venv for subprocess exec (typically for local dev), a container for production workloads, and a slurm script for a batch workflow specific to slurm backends. \ No newline at end of file diff --git a/docs/set-up/config-reference.mdx b/docs/set-up/config-reference.mdx index 74f9e0baf0..bfbb07510c 100644 --- a/docs/set-up/config-reference.mdx +++ b/docs/set-up/config-reference.mdx @@ -422,6 +422,8 @@ jobs: schedule_interval_seconds: 5 # Register the subprocess/default execution profile. When unset, defaults to true for docker/none runtimes and false for kubernetes. enable_subprocess_executor: + # Include raw job log lines in controller diagnostics snapshots. Disabled by default because job logs may contain secrets or PII. Enable only for local debugging or test environments. | default: False + include_job_logs_in_diagnostics: false ``` ### `models` diff --git a/e2e/configs/local-subprocess.yaml b/e2e/configs/local-subprocess.yaml new file mode 100644 index 0000000000..87dba1b342 --- /dev/null +++ b/e2e/configs/local-subprocess.yaml @@ -0,0 +1,68 @@ +# Local E2E config for hosts without Docker. +# +# This keeps the explicit subprocess/default jobs profile required by +# translate_cpu_container_steps_to_subprocess(), while avoiding the default +# docker job backends that are derived from platform.runtime: "docker". + +platform: + runtime: "none" + base_url: "http://0.0.0.0:8080" + +service: {} + +auth: + enabled: false + allow_unsigned_jwt: true + policy_decision_point_provider: embedded + policy_decision_point_base_url: "http://localhost:8080" + policy_data_refresh_interval: 2 + bundle_cache_seconds: 15 + admin_email: "admin@example.com" + +entities: {} + +jobs: + # Local E2E-only debugging aid. This may leak secrets or PII from job output, + # so it must remain disabled in non-test configs. + include_job_logs_in_diagnostics: true + executors: + - provider: subprocess + profile: default + backend: subprocess + config: + working_directory: .tmp/e2e/subprocess-jobs + cleanup_completed_jobs_immediately: false + ttl_seconds_before_active: 60 + ttl_seconds_active: 3600 + ttl_seconds_after_finished: 300 + executor_defaults: + subprocess: + working_directory: .tmp/e2e/subprocess-jobs + cleanup_completed_jobs_immediately: false + ttl_seconds_before_active: 60 + ttl_seconds_active: 3600 + ttl_seconds_after_finished: 300 + +evaluator: + recreate_existing_system_entities: true + +safe_synthesizer: {} + +models: + controller: + interval_seconds: 5 + model_deployment_garbage_collection_ttl_seconds: 30 + +inference_gateway: {} + +secrets: + allow_key_creation: true + +files: + default_storage_config: + type: local + path: .tmp/e2e/files + +studio: + static_files_path: web/packages/studio/dist + sandbox_enabled: true diff --git a/e2e/conftest.py b/e2e/conftest.py index feb0a7efcd..3a49a635a2 100644 --- a/e2e/conftest.py +++ b/e2e/conftest.py @@ -15,25 +15,61 @@ connects to the given URL. Otherwise it spawns ``nemo services run`` as a child process on a free port, polls ``/status`` until ready, and terminates the process after the session. + +Config selection:: + + # Default local platform config + pytestmark = [pytest.mark.e2e_config()] + + # Single repo-root-relative config file + pytestmark = [pytest.mark.e2e_config("e2e/configs/local-subprocess.yaml")] + + # Ordered config layers: files first, then inline overlays + pytestmark = [ + pytest.mark.e2e_config( + "e2e/configs/local-subprocess.yaml", + {"auth": {"enabled": True}}, + ) + ] + +Why this exists: + +- E2E modules should be able to declare the platform shape they need rather + than inheriting one global config from ``conftest.py``. +- Different modules can exercise different backends or auth modes in the same + pytest session. +- Identical effective configs are pooled and reused, so config selection does + not imply one fresh ``nemo services`` process per module. + +How pooling works: + +- The harness resolves the ordered ``e2e_config(...)`` layers into one + effective config dict. +- That config is normalized into a canonical form and hashed. +- Modules that resolve to the same hash share one running services instance for + the session. +- The pooled instance is shut down as soon as the last module using that hash + finishes, so mixed-config runs do not keep every started platform alive until + the end of the session. + +The pool implementation itself lives in ``e2e.services_pool`` so this file can +stay focused on pytest hooks and fixtures. """ -import contextlib import logging import os -import socket -import subprocess -import sys import tempfile -import time import uuid from collections.abc import Iterator from pathlib import Path -from typing import IO, Any -import httpx import pytest from nemo_platform import NeMoPlatform -from nmp.testing import NemoRun, get_repo_root + +from e2e.services_pool import E2EServicesPool, RunningServices, admin_headers + +_services_pool_manager_key = pytest.StashKey[E2EServicesPool]() +_services_metadata_key = pytest.StashKey[dict[str, str]]() def pytest_configure(config: pytest.Config) -> None: @@ -49,20 +85,25 @@ def pytest_configure(config: pytest.Config) -> None: from nemo_platform_plugin.config import Configuration Configuration.clear_cache() + config.stash[_services_pool_manager_key] = E2EServicesPool() + + +def pytest_collection_modifyitems(session: pytest.Session, config: pytest.Config, items: list[pytest.Item]) -> None: + """Register collected E2E modules with the services pool manager.""" + config.stash[_services_pool_manager_key].register_collected_items(items) logger = logging.getLogger(__name__) +_E2E_HARNESS_DEBUG = os.environ.get("E2E_HARNESS_DEBUG") == "1" -_HEALTH_TIMEOUT = 60 -_HEALTH_POLL_INTERVAL = 1.0 -_AUTH_READY_TIMEOUT = 60 -_E2E_ADMIN_EMAIL = "admin@example.com" _SERVICES_LOG = Path(os.environ.get("E2E_SERVICES_LOG", os.path.join(tempfile.gettempdir(), "services.log"))) # Number of log lines to dump from the services log on test failure. _TAIL_LINES_ON_FAILURE = 100 _services_log_key = pytest.StashKey[Path]() +_active_services_log_key = pytest.StashKey[Path]() +_active_services_metadata_key = pytest.StashKey[dict[str, str]]() NGC_API_KEY_ENV = "NGC_API_KEY" @@ -87,180 +128,6 @@ def ngc_secret(sdk: NeMoPlatform, workspace: str, ngc_api_key: str) -> Iterator[ pass # Best-effort cleanup; the workspace is deleted anyway -@pytest.fixture(scope="session") -def services_log_path(request: pytest.FixtureRequest, tmp_path_factory: pytest.TempPathFactory) -> Path: - """Return a unique services log path for this session. - - ``E2E_SERVICES_LOG_DIR`` (if set) is treated as a **directory**; in CI - the job uploads everything under it as artifacts. When unset we - fall back to a pytest-managed temp directory. Either way, each - session writes to a UUID-named file inside the directory so - parallel workers never clobber each other. - - The path is stashed on the session so the - ``pytest_runtest_makereport`` hook can read it without requesting - the fixture. - """ - log_dir = os.environ.get("E2E_SERVICES_LOG_DIR") - if log_dir: - directory = Path(log_dir) - directory.mkdir(parents=True, exist_ok=True) - else: - directory = tmp_path_factory.mktemp("e2e-services-logs") - path = directory / f"services-{uuid.uuid4().hex[:8]}.log" - request.session.stash[_services_log_key] = path - return path - - -_E2E_REPO_ROOT = Path(__file__).resolve().parents[1] -_E2E_PLATFORM_CONFIG = _E2E_REPO_ROOT / "packages/nmp_platform/config/local.yaml" - - -def _e2e_services_env() -> dict[str, str]: - """Environment for the ``nemo services run`` child process. - - ``pytest_configure`` sets ``NMP_INFERENCE_GATEWAY_MOCK_PROVIDER_PREFIX`` on the - pytest process so ``add_mock_provider()`` can build providers, but the IGW - must see the same value in *its* process or mock routing and cache refresh - behave differently from the test client. Mirror the Docker E2E backend - (``nmp.testing.e2e.docker``) by setting inference env vars explicitly here - rather than relying on inherited shell state. - - Use ``packages/nmp_platform/config/local.yaml`` (``inference_gateway: {}``) - so IGW polls the Models service on the background refresh interval instead - of the dev-only ``debug_model_providers`` block in - ``services/core/inference-gateway/config/local.yaml``, which disables that - loop. - """ - env = os.environ.copy() - env["NMP_SEED_ON_STARTUP"] = "true" - env["NMP_INFERENCE_GATEWAY_MOCK_PROVIDER_PREFIX"] = "igw-mock-" - env["NMP_CONFIG_FILE_PATH"] = str(_E2E_PLATFORM_CONFIG) - env["NMP_CONFIG_WARNINGS_DISABLED"] = "1" - if not _e2e_auth_enabled(): - env["NMP_AUTH_ENABLED"] = "false" - elif "NMP_AUTH_ENABLED" not in env: - env["NMP_AUTH_ENABLED"] = "true" - return env - - -def _e2e_auth_enabled() -> bool: - """Return whether the e2e harness should run with authorization enabled. - - Default is disabled so ``make test-e2e`` does not depend on platform-admin - seeding, PDP refresh, or role propagation timing. Opt in with - ``E2E_AUTH_ENABLED=true`` (see ``make test-e2e-docker-auth``). - """ - return os.environ.get("E2E_AUTH_ENABLED", "false").lower() in ("1", "true", "yes") - - -def _find_free_port() -> int: - """Bind to port 0 and let the OS assign a free port.""" - with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: - s.bind(("127.0.0.1", 0)) - return s.getsockname()[1] - - -def _wait_for_healthy(url: str, timeout: float = _HEALTH_TIMEOUT) -> bool: - """Poll /status until it returns 200 or timeout expires.""" - deadline = time.monotonic() + timeout - while time.monotonic() < deadline: - try: - resp = httpx.get(f"{url}/status", timeout=2.0) - if resp.status_code == 200: - return True - except httpx.RequestError: - pass # Server not up yet, keep polling - time.sleep(_HEALTH_POLL_INTERVAL) - return False - - -def _admin_headers() -> dict[str, str]: - return { - "X-NMP-Principal-Id": _E2E_ADMIN_EMAIL, - "X-NMP-Principal-Email": _E2E_ADMIN_EMAIL, - } - - -def _wait_for_auth_ready(url: str, timeout: float = _AUTH_READY_TIMEOUT) -> bool: - """Poll until platform admin can create entities in a fresh workspace. - - Workspace create/list alone is insufficient: entity CRUD requires - PlatformAdmin (or entities.create, which workspace Admin lacks). The first - entity e2e test was flaky when only workspace visibility was probed. - """ - deadline = time.monotonic() + timeout - while time.monotonic() < deadline: - probe_name = f"auth-probe-{uuid.uuid4().hex[:8]}" - entity_name = f"auth-probe-entity-{uuid.uuid4().hex[:8]}" - try: - create_resp = httpx.post( - f"{url}/apis/entities/v2/workspaces", - json={"name": probe_name}, - headers=_admin_headers(), - timeout=5.0, - ) - if create_resp.status_code != 201: - time.sleep(_HEALTH_POLL_INTERVAL) - continue - - entity_resp = httpx.post( - f"{url}/apis/entities/v2/workspaces/{probe_name}/entities/e2e-auth-probe", - json={"name": entity_name, "data": {"ready": True}}, - headers=_admin_headers(), - timeout=5.0, - ) - if entity_resp.status_code != 201: - httpx.delete( - f"{url}/apis/entities/v2/workspaces/{probe_name}", - headers=_admin_headers(), - timeout=5.0, - ) - time.sleep(_HEALTH_POLL_INTERVAL) - continue - - httpx.delete( - f"{url}/apis/entities/v2/workspaces/{probe_name}/entities/e2e-auth-probe/{entity_name}", - headers=_admin_headers(), - timeout=5.0, - ) - httpx.delete( - f"{url}/apis/entities/v2/workspaces/{probe_name}", - headers=_admin_headers(), - timeout=5.0, - ) - return True - except httpx.RequestError as exc: - logger.debug("Auth readiness probe failed; will retry: %s", exc) - time.sleep(_HEALTH_POLL_INTERVAL) - return False - - -@contextlib.contextmanager -def background_process( - args: list[str], - stdout: IO[Any] | None = None, - env: dict[str, str] | None = None, -) -> Iterator[subprocess.Popen]: - """Run a subprocess, yield the ``Popen``, and terminate on exit. - - Unlike ``Popen``'s built-in context manager (which only waits for the - process), this sends SIGTERM/SIGKILL so long-running servers are - cleaned up. - """ - proc = subprocess.Popen(args, stdout=stdout, stderr=subprocess.STDOUT, env=env) - try: - yield proc - finally: - proc.terminate() - try: - proc.wait(timeout=10) - except subprocess.TimeoutExpired: - logger.warning("Process %d did not exit after SIGTERM, sending SIGKILL", proc.pid) - proc.kill() - proc.wait(timeout=5) - - # ---- Services log tail on failure ------------------------------------------ @@ -278,72 +145,93 @@ def pytest_runtest_makereport(item: pytest.Item, call: pytest.CallInfo): # noqa if not report.failed: return - log_path = item.session.stash.get(_services_log_key, None) + log_path = item.stash.get(_active_services_log_key, None) or item.session.stash.get(_services_log_key, None) if log_path and log_path.exists(): lines = log_path.read_text().splitlines(keepends=True) tail = lines[-_TAIL_LINES_ON_FAILURE:] if tail: header = f"--- services log (last {len(tail)} lines) [{log_path}] ---" report.sections.append(("Services Log", f"{header}\n{''.join(tail)}")) + metadata = item.stash.get(_active_services_metadata_key, None) + if metadata: + report.sections.append( + ( + "E2E Services Binding", + "\n".join(f"{key}: {value}" for key, value in sorted(metadata.items())), + ) + ) # ---- Fixtures -------------------------------------------------------------- - - @pytest.fixture(scope="session") -def _services(services_log_path: Path) -> Iterator[str]: - """Spawn ``nemo services run`` and yield the base URL. +def _services_pool_manager( + request: pytest.FixtureRequest, + tmp_path_factory: pytest.TempPathFactory, +) -> Iterator[E2EServicesPool]: + manager = request.config.stash[_services_pool_manager_key] + manager.bind_tmp_path_factory(tmp_path_factory) + yield manager + manager.shutdown_all() + + +@pytest.fixture(scope="module") +def _services_instance( + request: pytest.FixtureRequest, + _services_pool_manager: E2EServicesPool, +) -> Iterator[RunningServices]: + """Return the running services instance for the current module's config. Skipped when ``NMP_BASE_URL`` is already set (external services). - This is the "subprocess" backend. When we add Docker and Kubernetes - backends, this fixture should be replaced by a backend-selection layer - (e.g. ``--docker`` / ``--kubernetes`` CLI flags) that dispatches to the - appropriate setup while yielding the same base URL interface. Tests - should remain agnostic to the backend. + Modules do not each get a dedicated services process. Instead, the harness + computes the effective config hash for the module and reuses any existing + process already started for that hash within the pytest session. A new + process is started only when the module resolves to a config that no prior + module has used. """ - external_url = os.environ.get("NMP_BASE_URL") - if external_url: - yield external_url - return + module = request.node.getparent(pytest.Module) + if module is None: + raise RuntimeError("Expected module-scoped E2E fixture to have a pytest module parent") + services = _services_pool_manager.acquire_for_module(module) + if services.log_path is not None: + request.session.stash[_services_log_key] = services.log_path + try: + yield services + finally: + _services_pool_manager.release_for_module(module) - port = _find_free_port() - url = f"http://127.0.0.1:{port}" - - nemo_bin = str(Path(sys.executable).parent / "nemo") - args = [ - nemo_bin, - "services", - "run", - "--service-group", - "all", - "--controller-group", - "all", - "--port", - str(port), - ] - env = _e2e_services_env() - logger.info("Starting nemo services on port %d", port) +@pytest.fixture(autouse=True) +def _bind_services_log_to_test(request: pytest.FixtureRequest, _services_instance: RunningServices) -> None: + if _services_instance.log_path is not None: + request.node.stash[_active_services_log_key] = _services_instance.log_path + module = request.node.getparent(pytest.Module) + if module is None: + return + manager = request.config.stash[_services_pool_manager_key] + metadata = { + key: str(value) + for key, value in manager.describe_module_binding(module.nodeid, _services_instance).items() + if value is not None + } + request.node.stash[_active_services_metadata_key] = metadata + if _E2E_HARNESS_DEBUG: + logger.info( + "E2E test binding", + extra={ + **metadata, + "test": request.node.nodeid, + }, + ) - log_path = services_log_path or _SERVICES_LOG - with open(log_path, "w") as log_file, background_process(args, stdout=log_file, env=env) as proc: - if not _wait_for_healthy(url): - pytest.fail( - f"nemo services run did not become healthy within {_HEALTH_TIMEOUT}s.\nlog:\n{log_path.read_text()}" - ) - if _e2e_auth_enabled() and not _wait_for_auth_ready(url): - pytest.fail( - f"Platform auth seed did not become ready within {_AUTH_READY_TIMEOUT}s.\nlog:\n{log_path.read_text()}" - ) - logger.info("Platform services ready on port %d (pid %d)", port, proc.pid) - yield url - logger.info("Terminating nemo services (pid %d)", proc.pid) +@pytest.fixture(scope="module") +def _services(_services_instance: RunningServices) -> Iterator[str]: + yield _services_instance.url -@pytest.fixture(scope="session") -def sdk(_services: str) -> NeMoPlatform: +@pytest.fixture(scope="module") +def sdk(_services: str, _services_instance: RunningServices) -> NeMoPlatform: """Provide an SDK client connected to the running platform. When connecting to an external cluster (via ``NMP_BASE_URL``), authentication @@ -351,12 +239,12 @@ def sdk(_services: str) -> NeMoPlatform: - ``NMP_ACCESS_TOKEN`` env var (e.g. from ``nemo auth token``) - ``NMP_CONTEXT_NAME`` env var (e.g. ``tot``) to read credentials from CLI config - For local auth-enabled deployments (``E2E_AUTH_ENABLED=true``), admin headers - are injected via ``default_headers``. + For local auth-enabled deployments, admin headers are injected via + ``default_headers`` based on the rendered platform config. """ access_token = os.environ.get("NMP_ACCESS_TOKEN") context_name = os.environ.get("NMP_CONTEXT_NAME") - headers = _admin_headers() if _e2e_auth_enabled() else {} + headers = admin_headers() if _services_instance.auth_enabled else {} return NeMoPlatform( base_url=_services, access_token=access_token, @@ -373,37 +261,3 @@ def workspace(sdk: NeMoPlatform) -> Iterator[str]: sdk.workspaces.create(name=name) yield name sdk.workspaces.delete(name) - - -@pytest.fixture(scope="session") -def nemo_run(_services: str) -> NemoRun: - """Run the NeMo CLI from the repo root with the E2E base URL and workspace env when set.""" - base_url = _services.rstrip("/") - repo_root = get_repo_root() - - def run( - *args: str, - workspace: str | None = None, - env_extra: dict[str, str] | None = None, - timeout: int | None = 60, - capture_output: bool = True, - stdin: int | None = None, - ) -> subprocess.CompletedProcess[str]: - env = os.environ.copy() - env["NMP_BASE_URL"] = base_url - if workspace is not None: - env["NMP_WORKSPACE"] = workspace - if env_extra: - env.update(env_extra) - cmd = ["uv", "run", "--project", str(repo_root), "--frozen", "nemo", "-f", "json", *args] - return subprocess.run( - cmd, - cwd=repo_root, - env=env, - timeout=timeout, - capture_output=capture_output, - stdin=stdin, - text=True, - ) - - return run diff --git a/e2e/services_pool.py b/e2e/services_pool.py new file mode 100644 index 0000000000..17637010c6 --- /dev/null +++ b/e2e/services_pool.py @@ -0,0 +1,526 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Shared E2E services-pool implementation used by pytest fixtures.""" + +from __future__ import annotations + +import hashlib +import json +import logging +import os +import socket +import subprocess +import sys +import time +import uuid +from copy import deepcopy +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +import httpx +import pytest +import yaml +from _pytest.nodes import Node +from nmp.testing.e2e.config import deep_merge + +logger = logging.getLogger(__name__) +_E2E_HARNESS_DEBUG = os.environ.get("E2E_HARNESS_DEBUG") == "1" + +_HEALTH_TIMEOUT = 60 +_HEALTH_POLL_INTERVAL = 1.0 +_AUTH_READY_TIMEOUT = 60 +_E2E_ADMIN_EMAIL = "admin@example.com" +_E2E_REPO_ROOT = Path(__file__).resolve().parents[1] +_DEFAULT_E2E_PLATFORM_CONFIG = _E2E_REPO_ROOT / "packages/nmp_platform/config/local.yaml" + + +def admin_headers() -> dict[str, str]: + return { + "X-NMP-Principal-Id": _E2E_ADMIN_EMAIL, + "X-NMP-Principal-Email": _E2E_ADMIN_EMAIL, + } + + +@dataclass(frozen=True) +class ServicesPoolKey: + config_hash: str + + +@dataclass +class RunningServices: + url: str + log_path: Path | None + proc: subprocess.Popen[Any] | None + config_path: Path | None + auth_enabled: bool = False + key: ServicesPoolKey | None = None + + +@dataclass(frozen=True) +class ModuleConfigState: + module_id: str + key: ServicesPoolKey + config_path: Path | None + config_data: dict[str, Any] + config_layers: tuple[str, ...] + auth_enabled: bool + + +class E2EServicesPool: + """Central manager for config-hash-based E2E service pooling.""" + + def __init__(self) -> None: + self._tmp_path_factory: pytest.TempPathFactory | None = None + self._module_states: dict[str, ModuleConfigState] = {} + self._remaining_modules_by_key: dict[ServicesPoolKey, set[str]] = {} + self._running_by_key: dict[ServicesPoolKey, RunningServices] = {} + self._active_service_key_by_module: dict[str, ServicesPoolKey] = {} + self._generated_config_dir: Path | None = None + self._log_dir: Path | None = None + + @staticmethod + def _log_debug(message: str, **extra: Any) -> None: + if _E2E_HARNESS_DEBUG: + logger.info(message, extra=extra) + + def bind_tmp_path_factory(self, tmp_path_factory: pytest.TempPathFactory) -> None: + if self._tmp_path_factory is None: + self._tmp_path_factory = tmp_path_factory + + def register_collected_items(self, items: list[pytest.Item]) -> None: + seen_modules: set[str] = set() + for item in items: + module = item.getparent(pytest.Module) + if module is None or module.nodeid in seen_modules: + continue + seen_modules.add(module.nodeid) + self._ensure_module_registered(module) + + def acquire_for_module(self, module: pytest.Module) -> RunningServices: + external_url = os.environ.get("NMP_BASE_URL") + if external_url: + return RunningServices(url=external_url, log_path=None, proc=None, config_path=None, auth_enabled=False) + + self._ensure_module_registered(module) + state = self._module_states[module.nodeid] + if state.config_path is None: + state = self._materialize_config_path(state) + self._module_states[module.nodeid] = state + assert state.config_path is not None + services = self._running_by_key.get(state.key) + if services is None: + log_path = self._get_log_dir() / f"services-{state.key.config_hash}-{uuid.uuid4().hex[:8]}.log" + services = _start_services(state.config_path, state.config_data, state.key.config_hash, log_path) + self._running_by_key[state.key] = services + previous_key = self._active_service_key_by_module.get(module.nodeid) + if previous_key is not None and previous_key != state.key: + logger.error( + "E2E module rebound to a different services pool key", + extra={ + "e2e_module": module.nodeid, + "previous_config_hash": previous_key.config_hash, + "new_config_hash": state.key.config_hash, + "new_url": services.url, + "new_pid": services.proc.pid if services.proc is not None else None, + }, + ) + self._active_service_key_by_module[module.nodeid] = state.key + self._log_debug("E2E services acquire", **self.describe_module_binding(module.nodeid, services)) + return services + + def release_for_module(self, module: pytest.Module) -> None: + if os.environ.get("NMP_BASE_URL"): + return + state = self._module_states.get(module.nodeid) + if state is None: + return + remaining = self._remaining_modules_by_key.get(state.key) + if remaining is None or module.nodeid not in remaining: + return + remaining.remove(module.nodeid) + self._log_debug( + "E2E services release", + **{ + **self.describe_module_binding(module.nodeid), + "remaining_modules_for_hash": sorted(remaining), + }, + ) + if remaining: + return + self._remaining_modules_by_key.pop(state.key, None) + self._active_service_key_by_module.pop(module.nodeid, None) + services = self._running_by_key.pop(state.key, None) + if services is not None: + self._terminate_services(services) + + def shutdown_all(self) -> None: + for services in list(self._running_by_key.values()): + self._terminate_services(services) + self._running_by_key.clear() + self._remaining_modules_by_key.clear() + + def _ensure_module_registered(self, module: pytest.Module) -> None: + if module.nodeid in self._module_states: + return + resolved_paths, config_data = _load_effective_e2e_config_from_node(module) + key = _services_pool_key(_canonical_config_hash(config_data)) + auth_enabled = _e2e_auth_enabled(config_data) + self._module_states[module.nodeid] = ModuleConfigState( + module_id=module.nodeid, + key=key, + config_path=None, + config_data=config_data, + config_layers=tuple(str(path) for path in resolved_paths), + auth_enabled=auth_enabled, + ) + self._remaining_modules_by_key.setdefault(key, set()).add(module.nodeid) + self._log_debug( + "Registered E2E module config", + e2e_module=module.nodeid, + config_hash=key.config_hash, + config_layers=list(self._module_states[module.nodeid].config_layers), + auth_enabled=auth_enabled, + ) + + def _materialize_config_path(self, state: ModuleConfigState) -> ModuleConfigState: + data_dir = e2e_services_data_dir(self._get_log_dir(), state.key.config_hash) + rendered_config_data = with_e2e_instance_paths(state.config_data, data_dir) + rendered_config = yaml.safe_dump(rendered_config_data, default_flow_style=False, sort_keys=True) + config_path = self._get_generated_config_dir() / f"platform-{state.key.config_hash}.yaml" + if not config_path.exists(): + config_path.write_text(rendered_config) + self._log_debug( + "Materialized generated E2E config", + e2e_module=state.module_id, + config_hash=state.key.config_hash, + config_path=str(config_path), + ) + return ModuleConfigState( + module_id=state.module_id, + key=state.key, + config_path=config_path, + config_data=state.config_data, + config_layers=state.config_layers, + auth_enabled=state.auth_enabled, + ) + + def _get_generated_config_dir(self) -> Path: + if self._generated_config_dir is None: + self._generated_config_dir = self._get_log_dir() / "generated-configs" + self._generated_config_dir.mkdir(parents=True, exist_ok=True) + return self._generated_config_dir + + def _get_log_dir(self) -> Path: + if self._tmp_path_factory is None: + raise RuntimeError("E2E services pool used before tmp_path_factory was bound") + if self._log_dir is None: + self._log_dir = _services_log_dir(self._tmp_path_factory) + return self._log_dir + + @staticmethod + def _terminate_services(services: RunningServices) -> None: + if services.proc is None: + return + if services.proc.poll() is not None: + E2EServicesPool._log_debug( + "Skipping E2E services terminate for already-exited process", + config_hash=services.key.config_hash if services.key is not None else None, + pid=services.proc.pid, + returncode=services.proc.returncode, + url=services.url, + ) + return + logger.info("Terminating nemo services (pid %d)", services.proc.pid) + services.proc.terminate() + try: + services.proc.wait(timeout=10) + except subprocess.TimeoutExpired: + logger.warning("Process %d did not exit after SIGTERM, sending SIGKILL", services.proc.pid) + services.proc.kill() + services.proc.wait(timeout=5) + + def describe_module_binding( + self, + module_id: str, + services: RunningServices | None = None, + ) -> dict[str, Any]: + state = self._module_states[module_id] + details: dict[str, Any] = { + "e2e_module": module_id, + "config_hash": state.key.config_hash, + "auth_enabled": state.auth_enabled, + "config_layers": list(state.config_layers), + "config_path": str(state.config_path) if state.config_path is not None else None, + } + if services is not None: + details.update( + { + "service_url": services.url, + "service_pid": services.proc.pid if services.proc is not None else None, + "service_log_path": str(services.log_path) if services.log_path is not None else None, + } + ) + return details + + +def _services_log_dir(tmp_path_factory: pytest.TempPathFactory) -> Path: + log_dir = os.environ.get("E2E_SERVICES_LOG_DIR") + if log_dir: + directory = Path(log_dir) + directory.mkdir(parents=True, exist_ok=True) + return directory + return tmp_path_factory.mktemp("e2e-services-logs") + + +def _resolve_e2e_config_layers_from_node(node: Node) -> list[str | dict[str, Any]]: + marker = node.get_closest_marker("e2e_config") + if marker is None or not marker.args: + return [str(_DEFAULT_E2E_PLATFORM_CONFIG)] + layers: list[str | dict[str, Any]] = [] + for layer in marker.args: + if isinstance(layer, (str, dict)): + layers.append(layer) + continue + raise pytest.UsageError("pytest.mark.e2e_config arguments must be strings or dicts") + return layers + + +def _resolve_config_path(config_ref: str) -> Path: + candidate = Path(config_ref) + if not candidate.is_absolute(): + candidate = _E2E_REPO_ROOT / config_ref + return candidate.resolve() + + +def _normalize_config(value: Any, path: tuple[str, ...] = ()) -> Any: + if isinstance(value, dict): + return {key: _normalize_config(value[key], (*path, key)) for key in sorted(value)} + if isinstance(value, list): + normalized = [_normalize_config(item, path) for item in value] + if path == ("jobs", "executors"): + return sorted( + normalized, + key=lambda item: ( + item.get("provider", "") if isinstance(item, dict) else "", + item.get("profile", "") if isinstance(item, dict) else "", + item.get("backend", "") if isinstance(item, dict) else "", + json.dumps(item, sort_keys=True, separators=(",", ":"), ensure_ascii=True), + ), + ) + return normalized + return value + + +def _canonical_config_hash(config_data: dict[str, Any]) -> str: + normalized = _normalize_config(config_data) + payload = json.dumps(normalized, sort_keys=True, separators=(",", ":"), ensure_ascii=True) + return hashlib.sha256(payload.encode("utf-8")).hexdigest()[:12] + + +def _load_effective_e2e_config_from_node(node: Node) -> tuple[list[Path], dict[str, Any]]: + effective_config: dict[str, Any] = {} + resolved_paths: list[Path] = [] + + for layer in _resolve_e2e_config_layers_from_node(node): + if isinstance(layer, str): + config_path = _resolve_config_path(layer) + if not config_path.is_file(): + raise pytest.UsageError(f"E2E platform config not found: {config_path}") + layer_config = yaml.safe_load(config_path.read_text()) or {} + resolved_paths.append(config_path) + else: + layer_config = layer + effective_config = deep_merge(effective_config, layer_config) + + return resolved_paths, _normalize_config(effective_config) + + +def e2e_services_data_dir(log_dir: Path, config_hash: str) -> Path: + """Return the persistent data directory for one pooled services instance.""" + return log_dir / f"data-{config_hash}" + + +def with_e2e_instance_paths(config_data: dict[str, Any], data_dir: Path) -> dict[str, Any]: + """Return config data with per-instance filesystem paths rooted under ``data_dir``.""" + rendered = deepcopy(config_data) + subprocess_working_dir = str(data_dir / "subprocess-jobs") + files_root = str(data_dir / "files") + + jobs = rendered.get("jobs") + if isinstance(jobs, dict): + executors = jobs.get("executors") + if isinstance(executors, list): + for executor in executors: + if not isinstance(executor, dict): + continue + if executor.get("provider") != "subprocess": + continue + config = executor.setdefault("config", {}) + if isinstance(config, dict): + config["working_directory"] = subprocess_working_dir + + executor_defaults = jobs.get("executor_defaults") + if isinstance(executor_defaults, dict): + subprocess_defaults = executor_defaults.get("subprocess") + if isinstance(subprocess_defaults, dict): + subprocess_defaults["working_directory"] = subprocess_working_dir + + files = rendered.get("files") + if isinstance(files, dict): + default_storage_config = files.get("default_storage_config") + if isinstance(default_storage_config, dict) and default_storage_config.get("type") == "local": + default_storage_config["path"] = files_root + + return rendered + + +def e2e_services_env(config_path: Path, data_dir: Path) -> dict[str, str]: + """Environment for the ``nemo services run`` child process.""" + env = os.environ.copy() + env["NMP_SEED_ON_STARTUP"] = "true" + env["NMP_INFERENCE_GATEWAY_MOCK_PROVIDER_PREFIX"] = "igw-mock-" + env["NMP_CONFIG_FILE_PATH"] = str(config_path) + env["NMP_CONFIG_WARNINGS_DISABLED"] = "1" + env["NMP_DATA_DIR"] = str(data_dir) + return env + + +def _e2e_auth_enabled(config_data: dict[str, Any]) -> bool: + auth_cfg = config_data.get("auth") + return isinstance(auth_cfg, dict) and bool(auth_cfg.get("enabled", False)) + + +def _services_pool_key(config_hash: str) -> ServicesPoolKey: + return ServicesPoolKey(config_hash=config_hash) + + +def _find_free_port() -> int: + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock: + sock.bind(("127.0.0.1", 0)) + return sock.getsockname()[1] + + +def _wait_for_healthy(url: str, timeout: float = _HEALTH_TIMEOUT) -> bool: + deadline = time.monotonic() + timeout + while time.monotonic() < deadline: + try: + resp = httpx.get(f"{url}/status", timeout=2.0) + if resp.status_code == 200: + return True + except httpx.RequestError: + pass + time.sleep(_HEALTH_POLL_INTERVAL) + return False + + +def _wait_for_auth_ready(url: str, timeout: float = _AUTH_READY_TIMEOUT) -> bool: + deadline = time.monotonic() + timeout + while time.monotonic() < deadline: + probe_name = f"auth-probe-{uuid.uuid4().hex[:8]}" + entity_name = f"auth-probe-entity-{uuid.uuid4().hex[:8]}" + try: + create_resp = httpx.post( + f"{url}/apis/entities/v2/workspaces", + json={"name": probe_name}, + headers=admin_headers(), + timeout=5.0, + ) + if create_resp.status_code != 201: + time.sleep(_HEALTH_POLL_INTERVAL) + continue + + entity_resp = httpx.post( + f"{url}/apis/entities/v2/workspaces/{probe_name}/entities/e2e-auth-probe", + json={"name": entity_name, "data": {"ready": True}}, + headers=admin_headers(), + timeout=5.0, + ) + if entity_resp.status_code != 201: + httpx.delete( + f"{url}/apis/entities/v2/workspaces/{probe_name}", + headers=admin_headers(), + timeout=5.0, + ) + time.sleep(_HEALTH_POLL_INTERVAL) + continue + + httpx.delete( + f"{url}/apis/entities/v2/workspaces/{probe_name}/entities/e2e-auth-probe/{entity_name}", + headers=admin_headers(), + timeout=5.0, + ) + httpx.delete( + f"{url}/apis/entities/v2/workspaces/{probe_name}", + headers=admin_headers(), + timeout=5.0, + ) + return True + except httpx.RequestError as exc: + logger.debug("Auth readiness probe failed; will retry: %s", exc) + time.sleep(_HEALTH_POLL_INTERVAL) + return False + + +def _start_services( + config_path: Path, config_data: dict[str, Any], config_hash: str, log_path: Path +) -> RunningServices: + port = _find_free_port() + url = f"http://127.0.0.1:{port}" + + nemo_bin = str(Path(sys.executable).parent / "nemo") + args = [ + nemo_bin, + "services", + "run", + "--service-group", + "all", + "--controller-group", + "all", + "--port", + str(port), + ] + data_dir = e2e_services_data_dir(log_path.parent, config_hash) + data_dir.mkdir(parents=True, exist_ok=True) + env = e2e_services_env(config_path, data_dir) + + logger.info("Starting nemo services on port %d with config %s", port, config_path) + + log_file = open(log_path, "w") + try: + proc = subprocess.Popen(args, stdout=log_file, stderr=subprocess.STDOUT, env=env) + finally: + log_file.close() + + if not _wait_for_healthy(url): + try: + proc.terminate() + proc.wait(timeout=10) + except Exception: + proc.kill() + proc.wait(timeout=5) + pytest.fail( + f"nemo services run did not become healthy within {_HEALTH_TIMEOUT}s.\nlog:\n{log_path.read_text()}" + ) + auth_enabled = _e2e_auth_enabled(config_data) + if auth_enabled and not _wait_for_auth_ready(url): + try: + proc.terminate() + proc.wait(timeout=10) + except Exception: + proc.kill() + proc.wait(timeout=5) + pytest.fail( + f"Platform auth seed did not become ready within {_AUTH_READY_TIMEOUT}s.\nlog:\n{log_path.read_text()}" + ) + + logger.info("Platform services ready on port %d (pid %d)", port, proc.pid) + return RunningServices( + url=url, + log_path=log_path, + proc=proc, + config_path=config_path, + auth_enabled=auth_enabled, + key=_services_pool_key(config_hash), + ) diff --git a/e2e/test_data_designer.py b/e2e/test_data_designer.py index 0ba95a6ecb..8cb5b03269 100644 --- a/e2e/test_data_designer.py +++ b/e2e/test_data_designer.py @@ -12,9 +12,11 @@ from nemo_data_designer_plugin.sdk.errors import DataDesignerJobError from nemo_platform import NeMoPlatform, NotFoundError from nemo_platform.types.inference import ModelProvider -from nmp.testing import MockProviderResponse, NemoRun, add_mock_provider +from nmp.testing import MockProviderResponse, add_mock_provider, assert_exit_0, run_nemo_local from nmp.testing.pytest_outcomes import pytest_skip +pytestmark = [pytest.mark.e2e_config("e2e/configs/local-subprocess.yaml")] + PROVIDER_NAME = "test-provider" MODEL_A = "model-a" @@ -168,7 +170,7 @@ def test_fileset_seed_data(sdk: NeMoPlatform, workspace: str) -> None: @pytest.fixture -def nemotron_personas_locale(nemo_run: NemoRun, sdk: NeMoPlatform, workspace: str, ngc_secret: str) -> Generator[str]: +def nemotron_personas_locale(_services: str, sdk: NeMoPlatform, workspace: str, ngc_secret: str) -> Generator[str]: """Invokes the CLI to create a Fileset for Nemotron Personas data. This test does call out to NGC and downloads personas data. Use the smallest locale available @@ -184,7 +186,7 @@ def nemotron_personas_locale(nemo_run: NemoRun, sdk: NeMoPlatform, workspace: st with suppress(NotFoundError): sdk.files.filesets.delete(fileset_name, workspace=WORKSPACE) - nemo_run( + result = run_nemo_local( "data-designer", "personas", "make-fileset", @@ -192,7 +194,10 @@ def nemotron_personas_locale(nemo_run: NemoRun, sdk: NeMoPlatform, workspace: st locale, "--api-key-secret", f"{workspace}/{ngc_secret}", + base_url=_services, + workspace=workspace, ) + assert_exit_0(result, "Failed to create Nemotron Personas fileset via CLI") yield locale diff --git a/e2e/test_jobs.py b/e2e/test_jobs.py index 12a857ac4e..c78d101a9d 100644 --- a/e2e/test_jobs.py +++ b/e2e/test_jobs.py @@ -17,7 +17,10 @@ JOB_SOURCE = "e2e-test-jobs" -pytestmark = [pytest.mark.timeout(600)] +pytestmark = [ + pytest.mark.timeout(600), + pytest.mark.e2e_config("e2e/configs/local-subprocess.yaml"), +] def _job_diagnostic_message(sdk: NeMoPlatform, job, workspace: str, prefix: str) -> str: diff --git a/e2e/test_jobs_auth.py b/e2e/test_jobs_auth.py new file mode 100644 index 0000000000..c7ff2a329c --- /dev/null +++ b/e2e/test_jobs_auth.py @@ -0,0 +1,229 @@ +"""E2E tests for jobs with auth enabled. + +Local E2E runs translate ``cpu/default`` container steps to the subprocess +backend, so these tests intentionally omit ``container.image`` and rely only on +the command shape that subprocess consumes. +""" + +import logging + +import pytest +from nemo_platform import NeMoPlatform +from nemo_platform_ext.auth.helpers import generate_unsigned_jwt +from nemo_platform_plugin.jobs.api_factory import ( + ContainerSpec, + CPUExecutionProviderSpec, + EnvironmentVariable, + PlatformJobSpec, + PlatformJobStep, +) +from nmp.common.entities import ALL_WORKSPACES +from nmp.core.jobs.controllers.diagnostics import collect_job_diagnostics +from nmp.testing import TEST_ADMIN_EMAIL, grant_workspace_role, short_unique_name, unique_email +from nmp.testing.e2e import wait_for_platform_job + +JOB_SOURCE = "e2e-auth-test" +logger = logging.getLogger(__name__) + +pytestmark = [ + pytest.mark.e2e_config("e2e/configs/local-subprocess.yaml", {"auth": {"enabled": True}}), +] + + +def _as_bearer_user( + sdk: NeMoPlatform, + email: str, + *, + groups: list[str] | None = None, +) -> NeMoPlatform: + token = generate_unsigned_jwt( + principal_id=email, + email=email, + groups=groups, + ) + return sdk.with_options(set_default_headers={"Authorization": f"Bearer {token}"}) + + +def _log_auth_job_diagnostics( + sdk: NeMoPlatform, + *, + workspace: str, + job_name: str, + step_name: str, + context: str, +) -> None: + logger.error( + "Auth job diagnostics", + extra={ + "diagnostic_context": context, + "workspace": workspace, + "job_name": job_name, + "step_name": step_name, + "job_diagnostics": collect_job_diagnostics( + sdk, + workspace=workspace, + job_name=job_name, + step_name=step_name, + context=context, + ), + }, + ) + + +def test_job_principal_propagation(sdk: NeMoPlatform): + admin_sdk = _as_bearer_user(sdk, TEST_ADMIN_EMAIL, groups=["admin"]) + user_email = unique_email("job-creator") + workspace_name = short_unique_name("job-auth-test") + + admin_sdk.workspaces.create(name=workspace_name) + grant_workspace_role(admin_sdk, workspace=workspace_name, principal=user_email, roles=["Editor"]) + + user_sdk = _as_bearer_user(sdk, user_email) + job = user_sdk.jobs.create( + workspace=workspace_name, + source=JOB_SOURCE, + spec={"test": "auth-propagation"}, + platform_spec=PlatformJobSpec( + steps=[ + PlatformJobStep( + name="auth-test-step", + executor=CPUExecutionProviderSpec( + provider="cpu", + container=ContainerSpec( + entrypoint=["nemo-platform"], + command=["run", "task", "--task", "nmp.hello_world.tasks.hello_world"], + ), + ), + environment=[EnvironmentVariable(name="BUSY_LOOP_DURATION_SECONDS", value="0")], + config={"message": "auth propagation test"}, + ) + ] + ), + ) + + completed_job = wait_for_platform_job(user_sdk, job.name, workspace_name) + assert completed_job.status == "completed" + + fileset_name = f"hello-world-{job.name}" + fileset = user_sdk.files.filesets.retrieve(workspace=workspace_name, name=fileset_name) + assert fileset is not None + + file_content = user_sdk.files.download_content( + remote_path="message.txt", + fileset=fileset_name, + workspace=workspace_name, + ) + assert file_content == b"auth propagation test" + + +def test_job_cannot_access_unauthorized_workspace(sdk: NeMoPlatform): + admin_sdk = _as_bearer_user(sdk, TEST_ADMIN_EMAIL, groups=["admin"]) + owner_email = unique_email("owner") + other_email = unique_email("other") + + restricted_workspace = short_unique_name("restricted") + runner_workspace = short_unique_name("runner") + + admin_sdk.workspaces.create(name=restricted_workspace) + admin_sdk.workspaces.create(name=runner_workspace) + grant_workspace_role(admin_sdk, workspace=restricted_workspace, principal=owner_email, roles=["Editor"]) + grant_workspace_role(admin_sdk, workspace=runner_workspace, principal=other_email, roles=["Editor"]) + + owner_sdk = _as_bearer_user(sdk, owner_email) + other_sdk = _as_bearer_user(sdk, other_email) + + fileset_name = "private-data" + owner_sdk.files.filesets.create(workspace=restricted_workspace, name=fileset_name) + + job = other_sdk.jobs.create( + workspace=runner_workspace, + source=JOB_SOURCE, + spec={"test": "auth-denial"}, + platform_spec=PlatformJobSpec( + steps=[ + PlatformJobStep( + name="access-test-step", + executor=CPUExecutionProviderSpec( + provider="cpu", + container=ContainerSpec( + entrypoint=["nemo-platform"], + command=["run", "task", "--task", "nmp.hello_world.tasks.access_fileset"], + ), + ), + config={ + "workspace": restricted_workspace, + "fileset": fileset_name, + }, + ) + ] + ), + ) + + completed_job = wait_for_platform_job(other_sdk, job.name, runner_workspace) + if completed_job.status != "error": + _log_auth_job_diagnostics( + other_sdk, + workspace=runner_workspace, + job_name=job.name, + step_name="access-test-step", + context="expected job to fail with unauthorized workspace access", + ) + assert completed_job.status == "error" + + tasks_response = other_sdk.jobs.tasks.list("access-test-step", job=job.name, workspace=runner_workspace) + if not tasks_response.data: + _log_auth_job_diagnostics( + other_sdk, + workspace=runner_workspace, + job_name=job.name, + step_name="access-test-step", + context="expected task list to include failed access task", + ) + assert tasks_response.data + task = tasks_response.data[0] + if not task.error_stack or "403" not in task.error_stack or "Forbidden" not in task.error_stack: + _log_auth_job_diagnostics( + other_sdk, + workspace=runner_workspace, + job_name=job.name, + step_name="access-test-step", + context="expected task error stack to include 403 forbidden details", + ) + assert task.error_stack + assert "403" in task.error_stack and "Forbidden" in task.error_stack + + +def test_job_admin_can_list_jobs_in_all_workspaces(sdk: NeMoPlatform): + admin_sdk = _as_bearer_user(sdk, TEST_ADMIN_EMAIL, groups=["admin"]) + user_email = unique_email("member") + workspace_name = short_unique_name("admin-list-jobs") + + admin_sdk.workspaces.create(name=workspace_name) + grant_workspace_role(admin_sdk, workspace=workspace_name, principal=user_email, roles=["Editor"]) + + user_sdk = _as_bearer_user(sdk, user_email) + job = user_sdk.jobs.create( + workspace=workspace_name, + source=JOB_SOURCE, + spec={"test": "admin-list"}, + platform_spec=PlatformJobSpec( + steps=[ + PlatformJobStep( + name="admin-list-step", + executor=CPUExecutionProviderSpec( + provider="cpu", + container=ContainerSpec( + command=["echo", "admin list jobs"], + ), + ), + ) + ] + ), + ) + + completed_job = wait_for_platform_job(user_sdk, job.name, workspace_name) + assert completed_job.status == "completed" + + jobs = admin_sdk.jobs.list(workspace=ALL_WORKSPACES) + assert jobs.pagination is not None + assert any(item.name == job.name for item in jobs.data) diff --git a/ghcr-mirroring-issue.md b/ghcr-mirroring-issue.md new file mode 100644 index 0000000000..4aaffe6215 --- /dev/null +++ b/ghcr-mirroring-issue.md @@ -0,0 +1,38 @@ +# Mirror all public NeMo Platform images from NVCR to GHCR + +We already publish NeMo Platform images to `nvcr.io`. We should mirror all current public images to `ghcr.io` so GitHub becomes the primary public distribution path for OSS consumers. + +## Context + +- NeMo Platform is already hosted on GitHub, so GHCR provides the most straightforward OSS distribution experience. +- GHCR meets our current requirements for public distribution, including image size and acceptable rate-limit characteristics. +- GHCR avoids the extra operational and publishing-process friction associated with NVIDIA public NVCR distribution. +- This work does not replace NVCR. It adds GHCR as a mirrored public distribution path. +- NVCR public may still be useful later, but we want to avoid taking on additional process overhead at the initial stage. + +## Scope + +- Identify all currently public NeMo Platform images published to `nvcr.io` +- Mirror those images to `ghcr.io` +- Preserve tags and digests where possible +- Evaluate and choose an implementation approach using either `skopeo` or `regsync` +- Document the mirroring workflow and the source-of-truth image list +- Validate that mirrored images can be pulled anonymously from GHCR + +## Out of scope + +- Replacing NVCR as an existing publish target +- Reworking image contents or build pipelines beyond what is needed to support mirroring +- Adding NVIDIA public-registry publishing workflow changes + +## Acceptance criteria + +- All currently public NeMo Platform images available in `nvcr.io` are also available in `ghcr.io` +- Tags are mirrored correctly +- Anonymous pull from GHCR works for all mirrored images +- The mirroring approach is documented, including how new images/tags should be synchronized going forward +- A clear decision is recorded on whether `skopeo` or `regsync` is the long-term sync mechanism + +## Suggested priority + +High diff --git a/packages/nmp_testing/src/nmp/testing/__init__.py b/packages/nmp_testing/src/nmp/testing/__init__.py index 1caee32932..788d913542 100644 --- a/packages/nmp_testing/src/nmp/testing/__init__.py +++ b/packages/nmp_testing/src/nmp/testing/__init__.py @@ -9,7 +9,7 @@ - NemoRun / NmpRun: Type alias for a callable that runs the NeMo CLI - assert_exit_0: Assert that a CLI invocation succeeded - get_repo_root: Return the repository root using git -- run_nemo_local / run_nmp_local: Run NeMo CLI from repo root without cluster URL injection +- run_nemo_local / run_nmp_local: Run NeMo CLI from repo root with optional platform URL injection API testing: - create_test_client: Helper for creating FastAPI test clients with in-memory storage diff --git a/packages/nmp_testing/src/nmp/testing/utils.py b/packages/nmp_testing/src/nmp/testing/utils.py index bdb4fce3be..ad2749c247 100644 --- a/packages/nmp_testing/src/nmp/testing/utils.py +++ b/packages/nmp_testing/src/nmp/testing/utils.py @@ -47,20 +47,27 @@ def get_repo_root() -> Path: def run_nemo_local( *args: str, + base_url: str | None = None, + workspace: str | None = None, env_extra: dict[str, str] | None = None, timeout: int = 120, cwd: Path | None = None, ) -> subprocess.CompletedProcess[str]: - """Run the NeMo CLI (`nemo`) from repo root without cluster URL injection. + """Run the NeMo CLI (``nemo``) from repo root. - Used by tests that target local CLI behavior (config, quickstart) and - don't need the E2E cluster URL injected. + Used by tests that target local CLI behavior as well as E2E flows that + need to point the CLI at a specific platform instance. - Pass ``cwd`` to run from a different directory (e.g. a ``tmp_path`` with a - fake ``.git`` marker so ``skills install`` writes there instead of the real - repo root). + Pass ``base_url`` and ``workspace`` to inject ``NMP_BASE_URL`` and + ``NMP_WORKSPACE`` directly. Pass ``cwd`` to run from a different directory + (e.g. a ``tmp_path`` with a fake ``.git`` marker so ``skills install`` + writes there instead of the real repo root). """ env = os.environ.copy() + if base_url is not None: + env["NMP_BASE_URL"] = base_url.rstrip("/") + if workspace is not None: + env["NMP_WORKSPACE"] = workspace if env_extra: env.update(env_extra) cmd = ["uv", "run", "--project", str(get_repo_root()), "--frozen", "nemo", *args] diff --git a/packages/nmp_testing/tests/unit/test_e2e_harness.py b/packages/nmp_testing/tests/unit/test_e2e_harness.py new file mode 100644 index 0000000000..95e28ff52d --- /dev/null +++ b/packages/nmp_testing/tests/unit/test_e2e_harness.py @@ -0,0 +1,66 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +"""Unit tests for the local pytest E2E harness helpers.""" + +from pathlib import Path +from typing import Any, cast + +import e2e.services_pool as services_pool + + +def test_e2e_services_env_sets_isolated_data_dir(tmp_path, monkeypatch): + monkeypatch.setenv("NMP_DATA_DIR", "/shell/value/should/not/leak") + + config_path = tmp_path / "platform.yaml" + data_dir = tmp_path / "isolated-data" + + env = services_pool.e2e_services_env(config_path, data_dir) + + assert env["NMP_CONFIG_FILE_PATH"] == str(config_path) + assert env["NMP_DATA_DIR"] == str(data_dir) + assert env["NMP_SEED_ON_STARTUP"] == "true" + assert env["NMP_INFERENCE_GATEWAY_MOCK_PROVIDER_PREFIX"] == "igw-mock-" + assert env["NMP_CONFIG_WARNINGS_DISABLED"] == "1" + + +def test_e2e_services_data_dir_is_stable_per_hash(tmp_path): + log_dir = tmp_path / "logs" + + path = services_pool.e2e_services_data_dir(log_dir, "abc123def456") + + assert path == Path(log_dir / "data-abc123def456") + + +def test_with_e2e_instance_paths_namespaces_local_filesystem_paths(tmp_path): + data_dir = tmp_path / "data-abc123def456" + config_data: dict[str, Any] = { + "jobs": { + "executors": [ + { + "provider": "subprocess", + "profile": "default", + "backend": "subprocess", + "config": {"working_directory": ".tmp/e2e/subprocess-jobs"}, + } + ], + "executor_defaults": { + "subprocess": {"working_directory": ".tmp/e2e/subprocess-jobs"}, + }, + }, + "files": { + "default_storage_config": { + "type": "local", + "path": ".tmp/e2e/files", + } + }, + } + + rendered = services_pool.with_e2e_instance_paths(config_data, data_dir) + + assert rendered["jobs"]["executors"][0]["config"]["working_directory"] == str(data_dir / "subprocess-jobs") + assert rendered["jobs"]["executor_defaults"]["subprocess"]["working_directory"] == str(data_dir / "subprocess-jobs") + assert rendered["files"]["default_storage_config"]["path"] == str(data_dir / "files") + jobs_config = cast(dict[str, Any], config_data["jobs"]) + executors = cast(list[dict[str, Any]], jobs_config["executors"]) + assert executors[0]["config"]["working_directory"] == ".tmp/e2e/subprocess-jobs" diff --git a/plans/2026-06-03-helm-chart-kube-e2e-plan.md b/plans/2026-06-03-helm-chart-kube-e2e-plan.md new file mode 100644 index 0000000000..9b9b37e390 --- /dev/null +++ b/plans/2026-06-03-helm-chart-kube-e2e-plan.md @@ -0,0 +1,372 @@ +# Helm Chart And Kubernetes E2E Revival Plan + +**Goal:** Bring the archived NeMo Platform Helm chart from `/Users/rsadler/src/Platform-Deploy` into this repo, make it installable in a minimal local Kubernetes setup, and restore a small but real Kubernetes-backed E2E path so we can iterate from a working baseline. + +**Current State:** +- The archived deployment repo contains a full chart at `/Users/rsadler/src/Platform-Deploy/helm/platform` plus helper scripts and `e2e/k8s` values. +- This repo still advertises Kubernetes E2E entrypoints in [Makefile](/Users/rsadler/src/nemo-platform/Makefile:438), but the current harness in [e2e/conftest.py](/Users/rsadler/src/nemo-platform/e2e/conftest.py:98) only implements the subprocess backend and explicitly says Docker/Kubernetes selection is not built yet. +- There is no live Helm chart checked into this repo today under `deploy/helm` or `helm/`. +- The current image topology is repo-native and should remain authoritative: `nmp-api`, `nmp-core`, and `nmp-cpu-tasks` are built from [docker-bake.hcl](/Users/rsadler/src/nemo-platform/docker-bake.hcl:34). + +**Non-Goals For The First Pass:** +- Do not revive every archived deploy feature up front. +- Do not block the import on GPU, auth, ingress, observability, OpenShift, or cloud-specific policy support. +- Do not treat the old `Platform-Deploy` layout as authoritative where it conflicts with the current monorepo. + +--- + +## File Structure + +**Primary files and directories to add or modify:** +- Add: `deploy/helm/platform/**` or `helm/platform/**` after choosing the long-term chart location +- Add: `e2e/k8s/values/minimal-kind.yaml` +- Add: `e2e/k8s/scripts/install_nmp_k8s_minimal.sh` +- Add: `e2e/k8s/scripts/build_and_load_images.sh` +- Add: `e2e/k8s/scripts/wait_for_api.sh` +- Modify: [Makefile](/Users/rsadler/src/nemo-platform/Makefile:438) +- Modify: [e2e/conftest.py](/Users/rsadler/src/nemo-platform/e2e/conftest.py:98) +- Modify: [TESTING.md](/Users/rsadler/src/nemo-platform/TESTING.md:114) +- Modify: `.github/workflows/ci.yaml` or the relevant kube E2E workflow only after the local path is proven + +**Supporting sources to port selectively from the archive:** +- `/Users/rsadler/src/Platform-Deploy/helm/platform/**` +- `/Users/rsadler/src/Platform-Deploy/e2e/k8s/scripts/install_nmp_e2e.sh` +- `/Users/rsadler/src/Platform-Deploy/e2e/k8s/scripts/wait_for_api.sh` +- `/Users/rsadler/src/Platform-Deploy/e2e/k8s/values/default.yaml` +- `/Users/rsadler/src/Platform-Deploy/e2e/k8s/values/minikube.yaml` + +--- + +## Task 1: Decide The Import Boundary And Target Layout + +- [ ] **Step 1: Pick the chart home in this repo** + +Choose one canonical destination before copying files: +- `deploy/helm/platform/` if we want a deploy-artifacts home that matches the older docs language +- `helm/platform/` if we want the shortest path from the archived repo and the existing helper scripts + +Recommendation: +- Prefer `deploy/helm/platform/` if the team wants clear separation between product source and deploy packaging. +- Prefer `helm/platform/` if the priority is fastest low-risk import with minimal path rewriting. + +Whichever path is chosen, update all future scripts and docs to use only that path. + +- [ ] **Step 2: Inventory what is essential for a minimal chart install** + +Split the archive into: +- required now: chart templates, values, helper templates, chart README, dependency metadata +- defer: observability stack, CI-only values, OpenShift route tuning, NCCL test hooks, cloud-specific Kyverno examples, release publishing scripts + +The first import should preserve enough to install: +- API service +- core/controller service +- embedded Postgres +- shared storage PVC +- platform config map and seed job if still required for a healthy API + +- [ ] **Step 3: Reconcile archived names with the current repo** + +Before copying, identify mismatches in: +- image names +- chart value names +- service names +- config rendering expectations +- required secrets + +Known item to resolve early: +- the archived install scripts set both `api.image.repository` and `core.image.repository` to `.../nmp-api`, but the current repo also builds `nmp-core` in [docker-bake.hcl](/Users/rsadler/src/nemo-platform/docker-bake.hcl:45). Decide whether the chart should run a separate `nmp-core` image now or intentionally keep using `nmp-api` for both components in the minimal phase. + +- [ ] **Step 4: Commit the import decision document** + +Create a short design note in this plan or a sibling doc that records: +- chosen chart location +- import scope +- intentional deferrals +- image topology decision for minimal Kubernetes bring-up + +--- + +## Task 2: Port The Chart Into This Repo Without Broad Refactoring + +- [ ] **Step 1: Copy the chart skeleton and keep it mechanically close to the archive** + +Bring in: +- `Chart.yaml` +- `values.yaml` +- `templates/**` +- `files/**` +- `README.md` +- any helm-docs template if we intend to keep generated docs current + +Avoid mixing cleanup with the initial copy. The first commit should make the provenance obvious. + +- [ ] **Step 2: Remove or disable obviously non-minimal features in values, not templates, where possible** + +The minimal import should default off for: +- `k8s-nim-operator` +- ingress +- auth +- ServiceMonitor / observability extras +- cloud-specific networking policies +- GPU-only hooks and chart tests + +Prefer values-based disablement first. Template deletion should happen only if a feature is clearly dead and blocking comprehension. + +- [ ] **Step 3: Validate the chart renders against minimal local values** + +Run: +```bash +helm dependency build +helm template nemo-platform -f e2e/k8s/values/minimal-kind.yaml +``` + +Expected: +- render succeeds +- no unresolved template functions +- only the minimal resources appear + +- [ ] **Step 4: Add a chart-focused smoke check** + +Add a repeatable render/lint target, for example: +- `make helm-lint` +- `make helm-template-minimal` + +The goal is to make chart iteration cheap before any cluster install. + +--- + +## Task 3: Create A Minimal Local Kubernetes Install Path + +- [ ] **Step 1: Standardize on one local cluster target** + +Use `kind` first unless there is a hard blocker in storage or ingress behavior. + +Reason: +- the repo already references a kind helper in older docs +- kind is easier to automate than minikube +- the first milestone is CPU-only smoke coverage, not GPU or ingress fidelity + +If storage semantics force minikube for the first pass, record that explicitly and keep kind as the follow-up target. + +- [ ] **Step 2: Add a minimal values file just for local smoke** + +Create `e2e/k8s/values/minimal-kind.yaml` with only the overrides needed for a local cluster: +- disable `k8s-nim-operator` +- disable ingress +- disable auth +- use embedded Postgres +- use the cluster default storage class or a known kind-friendly class +- set `platformConfig.platform.runtime: kubernetes` if the chart does not already do that +- point platform image registry and tag overrides at locally built images + +Do not start from the archived `default.yaml` unchanged; it assumes NVIDIA internal registries and storage classes. + +- [ ] **Step 3: Add a build-and-load script for local images** + +Create a thin script that: +1. builds `nmp-api`, `nmp-core`, and `nmp-cpu-tasks` +2. tags them consistently for the cluster run +3. loads them into kind + +Keep the contract simple: +- `NMP_E2E_REGISTRY` +- `NMP_E2E_TAG` +- maybe `KIND_CLUSTER_NAME` + +The script should use the repo’s current bake targets rather than reproducing the archived repo’s image logic. + +- [ ] **Step 4: Add a minimal install script** + +Create `e2e/k8s/scripts/install_nmp_k8s_minimal.sh` that: +1. verifies `kubectl`, `helm`, and cluster access +2. runs `helm dependency build` +3. installs or upgrades the chart with the minimal values file +4. waits for readiness +5. prints targeted diagnostics on failure + +Keep it local-first: +- no NGC auth unless a remaining dependency truly requires it +- no cloud-provider assumptions +- no internal registry defaults + +- [ ] **Step 5: Verify the API really comes up** + +Add a readiness check that proves more than pod existence: +- `kubectl wait` for deployments/statefulsets +- then poll `/health/ready` or `/cluster-info` through port-forward or a local service URL + +This should become the contract the E2E harness relies on. + +--- + +## Task 4: Reintroduce A Kubernetes Backend To The E2E Harness + +- [ ] **Step 1: Replace the placeholder backend comment with real backend selection** + +Extend [e2e/conftest.py](/Users/rsadler/src/nemo-platform/e2e/conftest.py:98) so the session fixture can choose among: +- subprocess +- docker +- kubernetes + +The returned interface should stay the same: +- a base URL for the SDK + +- [ ] **Step 2: Implement the smallest useful Kubernetes mode** + +The first Kubernetes mode does not need full lifecycle automation inside pytest. + +A practical first cut: +- require `NMP_BASE_URL` or `NMP_E2E_CLUSTER_URL` +- assume the chart is already installed by the helper script +- connect the SDK to that external URL + +This gets kube E2E running again without hiding cluster setup inside pytest. + +- [ ] **Step 3: Make the CLI flags and docs match reality** + +Today `Makefile` calls `pytest e2e --kubernetes`, but [e2e/conftest.py](/Users/rsadler/src/nemo-platform/e2e/conftest.py:114) does not register that option. + +Add: +- `pytest_addoption` support for `--kubernetes` +- optional `--cluster-url` +- clear skip or error messages when required env vars are missing + +- [ ] **Step 4: Keep the first kube test set intentionally small** + +Do not aim for the whole suite immediately. + +Start by running only: +- [e2e/test_smoke.py](/Users/rsadler/src/nemo-platform/e2e/test_smoke.py:1) +- one low-risk jobs test if job execution works in the minimal cluster + +If jobs are not ready on the first pass, restore kube smoke coverage first and add jobs in the next milestone. + +--- + +## Task 5: Make Kubernetes E2E Runnable Via Repo Commands + +- [ ] **Step 1: Fix `Makefile` targets so they map to implemented behavior** + +Bring `test-e2e-kubernetes` into alignment with the real harness. + +For the first working version, the flow should be explicit: +1. build and load images +2. install chart +3. run selected tests against `NMP_E2E_CLUSTER_URL` + +If needed, add separate helpers instead of pretending `pytest --kubernetes` does everything by itself. + +- [ ] **Step 2: Add a narrow make target for the first milestone** + +Add one minimal target, for example: +```bash +make test-e2e-kubernetes-smoke +``` + +It should run only the subset we know how to support reliably. + +Defer `auth`, `gpu`, `kai-scheduler`, and `customizer` variants until the base path is real again. + +- [ ] **Step 3: Update `TESTING.md`** + +Document the exact local Kubernetes flow, including: +- prerequisites +- cluster choice +- image build and load step +- Helm install step +- smoke test command +- known unsupported variants + +This is important because [TESTING.md](/Users/rsadler/src/nemo-platform/TESTING.md:114) currently describes root-level E2E as subprocess-based and does not explain the Kubernetes mode at all. + +--- + +## Task 6: Expand From Smoke To Minimal Jobs Coverage + +- [ ] **Step 1: Prove one real SDK workflow on Kubernetes** + +After smoke passes, choose one representative operation: +- create a workspace +- run a trivial CPU job using `nmp-cpu-tasks` +- fetch logs or completion state + +This is the first meaningful kube E2E milestone because it validates: +- API reachability +- controller wiring +- image resolution +- shared config for launched tasks + +- [ ] **Step 2: Add or adapt one kube-safe jobs test** + +Prefer a very small test rather than turning on all of [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:1). + +If the existing suite assumes subprocess or Docker specifics, add a separate minimal kube smoke test instead of forcing conditionals through every test immediately. + +- [ ] **Step 3: Capture the next blockers explicitly** + +Once one jobs path works, classify remaining failures into: +- chart gaps +- runtime config gaps +- storage or PVC semantics +- image distribution issues +- auth or ingress dependencies + +That list should drive the next iteration rather than broad speculative porting. + +--- + +## Task 7: Only Then Wire It Back Into CI + +- [ ] **Step 1: Keep CI out of the critical path until the local flow is stable** + +Do not add CI before a contributor can run the local smoke path twice in a row successfully. + +- [ ] **Step 2: Add a single CPU-only Kubernetes smoke job** + +Once local is stable, add one CI job that: +- provisions the cluster +- builds or pulls the required images +- installs the chart +- runs the Kubernetes smoke subset +- uploads pod logs and Helm values on failure + +- [ ] **Step 3: Gate broader kube suites behind follow-up work** + +Keep these out of the first CI restoration: +- auth +- gpu +- kai-scheduler +- customizer +- cloud storage scenarios + +--- + +## Recommended Execution Order + +1. Choose chart location and import boundary. +2. Copy the chart with minimal changes. +3. Create `minimal-kind.yaml` and get `helm template` green. +4. Build and load local images. +5. Install the chart into a local cluster and verify `/health/ready`. +6. Implement `pytest --kubernetes` as an external-base-URL backend. +7. Restore one smoke target in `Makefile`. +8. Add one jobs-based kube E2E only after smoke is stable. +9. Reintroduce CI coverage last. + +--- + +## Exit Criteria For The First Milestone + +- [ ] The chart lives in this repo in one canonical location. +- [ ] `helm template` succeeds with a repo-owned minimal local values file. +- [ ] A local kind or minikube cluster can install the chart from this repo using repo-owned scripts. +- [ ] The platform API becomes healthy after install. +- [ ] `make test-e2e-kubernetes-smoke` passes against that cluster. +- [ ] At least one Kubernetes-backed E2E test is running again from this repo. + +## Follow-Up Milestones + +- [ ] Add a minimal jobs-on-kubernetes E2E. +- [ ] Re-enable broader Kubernetes variants in `Makefile`. +- [ ] Add CI smoke coverage. +- [ ] Evaluate which archived chart features should be deleted instead of maintained. diff --git a/plans/2026-06-03-jobs-pause-resume-plan.md b/plans/2026-06-03-jobs-pause-resume-plan.md new file mode 100644 index 0000000000..8849f5da69 --- /dev/null +++ b/plans/2026-06-03-jobs-pause-resume-plan.md @@ -0,0 +1,209 @@ +# Jobs Pause Resume Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make Docker-backed platform jobs pause and resume correctly so [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:264) reaches `paused` and then returns to `active` or `completed`. + +**Architecture:** The API and dispatcher already model `pausing`, `paused`, and `resuming`, and there is unit/API coverage for the abstract lifecycle. The failing path is specific to the live Docker backend, so the plan is to reproduce the runtime error, add a Docker backend regression around stop/resume behavior, and then make the minimum state-machine or container-handling fix needed. + +**Tech Stack:** `pytest`, NeMo SDK, Docker jobs backend, Jobs dispatcher, quickstart `nmp-api`, Python + +--- + +## File Structure + +**Files to inspect or modify:** +- Modify: [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:264) +- Modify: [services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py:960) +- Modify: [services/core/jobs/src/nmp/core/jobs/app/dispatcher.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/app/dispatcher.py:1105) +- Test: [services/core/jobs/tests/api/test_pause_resume.py](/Users/rsadler/src/nemo-platform/services/core/jobs/tests/api/test_pause_resume.py:1) +- Test: [services/core/jobs/tests/controllers/test_docker_backend.py](/Users/rsadler/src/nemo-platform/services/core/jobs/tests/controllers/test_docker_backend.py:1318) +- Test: `services/core/jobs/tests/integration/test_jobs_pause_resume_docker.py` + +**Responsibilities:** +- `e2e/test_jobs.py`: external regression against quickstart. +- `dispatcher.py`: high-level job-step transition rules for pause/resume. +- `docker.py`: concrete pause/stop/container-state mapping behavior. +- `test_pause_resume.py`: API-level lifecycle expectations. +- `test_docker_backend.py`: backend-specific stop, paused, and resumed behavior. +- `test_jobs_pause_resume_docker.py`: narrower runtime regression for Docker-backed execution. + +### Task 1: Capture the Exact Pause/Resume Runtime Failure + +**Files:** +- Modify: [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:264) + +- [ ] **Step 1: Improve the E2E failure output** + +Update `test_job_pause_resume` so failures include: +- job `status_details` +- step `status_details` +- task `error_stack` +- current logs for the step task + +Keep the assertions the same; only improve diagnostics. + +- [ ] **Step 2: Run the single failing test** + +Run: +```bash +env NMP_BASE_URL=http://localhost:8080 uv run --frozen pytest e2e/test_jobs.py::test_job_pause_resume -vv --run-e2e -s +``` + +Expected: +- FAIL +- output shows whether pause produced: + - non-zero container exit + - container missing during sync + - dispatcher never seeing `paused` + - resume returning to an invalid container state + +- [ ] **Step 3: Note the concrete failure mode in the test** + +Add a short inline comment documenting the current runtime symptom so future readers know what regression this test protects against. + +- [ ] **Step 4: Commit diagnostics-only changes** + +```bash +git add e2e/test_jobs.py +git commit -s -m "test: improve jobs pause resume diagnostics" +``` + +### Task 2: Add a Focused Docker Regression Test + +**Files:** +- Create: `services/core/jobs/tests/integration/test_jobs_pause_resume_docker.py` +- Modify: [services/core/jobs/tests/controllers/test_docker_backend.py](/Users/rsadler/src/nemo-platform/services/core/jobs/tests/controllers/test_docker_backend.py:1318) + +- [ ] **Step 1: Write a failing integration test for a real Docker-backed pause** + +Create a jobs integration test that: +1. creates a long-running Docker-backed CPU job +2. waits for `active` +3. calls `pause` +4. waits for `paused` +5. calls `resume` +6. waits for `active` or `completed` + +Use a long-running command that is pause-safe and deterministic. Avoid using a fast task that can complete before the state transition is observed. + +- [ ] **Step 2: Run the integration test to verify it fails** + +Run: +```bash +uv run --frozen pytest services/core/jobs/tests/integration/test_jobs_pause_resume_docker.py -vv +``` + +Expected: +- FAIL with the same state transition error as the E2E test + +- [ ] **Step 3: Add one Docker backend unit test for the failing edge** + +Extend [test_docker_backend.py](/Users/rsadler/src/nemo-platform/services/core/jobs/tests/controllers/test_docker_backend.py:1318) to cover the concrete failure from Task 1, for example: +- `container.stop()` during pausing yields `PAUSED` rather than `ERROR` +- a stopped container with pause intent maps to `PAUSED` +- resumed scheduling creates or reuses the right container state + +- [ ] **Step 4: Run the focused backend/API tests** + +Run: +```bash +uv run --frozen pytest services/core/jobs/tests/controllers/test_docker_backend.py -k paused -vv +uv run --frozen pytest services/core/jobs/tests/api/test_pause_resume.py -vv +``` + +Expected: +- the new unit test fails first +- API tests continue to pass unless the bug is higher up in dispatcher logic + +- [ ] **Step 5: Commit the failing regression coverage** + +```bash +git add services/core/jobs/tests/integration/test_jobs_pause_resume_docker.py services/core/jobs/tests/controllers/test_docker_backend.py +git commit -s -m "test: add docker pause resume regression coverage" +``` + +### Task 3: Fix Docker Pause/Resume State Handling + +**Files:** +- Modify: [services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py:960) +- Modify: [services/core/jobs/src/nmp/core/jobs/app/dispatcher.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/app/dispatcher.py:1105) if needed + +- [ ] **Step 1: Verify the intended Docker behavior** + +Inspect: +- `sync_stop_container()` +- `map_docker_container_status_to_platform_status()` +- dispatcher `pause_job()` and `resume_job()` + +Confirm whether Docker pause is implemented as: +- graceful container stop plus `PAUSED` state +- later resume by re-entering scheduling with `RESUMING` + +Do not change API semantics unless the current Docker implementation truly contradicts the existing API tests. + +- [ ] **Step 2: Implement the smallest correct fix** + +Likely implementation areas: +- preserve pause intent across the stop/exited transition +- avoid mapping a paused container stop to generic `ERROR` +- ensure resume_job can find a paused step and move it back into schedulable state +- ensure the backend handles “container already gone because pause succeeded” as `PAUSED`, not `ERROR` + +- [ ] **Step 3: Re-run focused tests** + +Run: +```bash +uv run --frozen pytest services/core/jobs/tests/controllers/test_docker_backend.py -k paused -vv +uv run --frozen pytest services/core/jobs/tests/api/test_pause_resume.py -vv +uv run --frozen pytest services/core/jobs/tests/integration/test_jobs_pause_resume_docker.py -vv +env NMP_BASE_URL=http://localhost:8080 uv run --frozen pytest e2e/test_jobs.py::test_job_pause_resume -vv --run-e2e -s +``` + +Expected: +- all four pass + +- [ ] **Step 4: Check cancel-vs-pause regressions** + +Run: +```bash +env NMP_BASE_URL=http://localhost:8080 uv run --frozen pytest e2e/test_jobs.py::test_job_cancel_immediately e2e/test_jobs.py::test_job_cancel_once_active -vv --run-e2e -s +``` + +Expected: +- PASS +- no regression in cancellation behavior while fixing pause + +- [ ] **Step 5: Commit the fix** + +```bash +git add services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py services/core/jobs/src/nmp/core/jobs/app/dispatcher.py services/core/jobs/tests/controllers/test_docker_backend.py services/core/jobs/tests/integration/test_jobs_pause_resume_docker.py e2e/test_jobs.py +git commit -s -m "fix: support docker job pause and resume" +``` + +### Task 4: Final Validation + +**Files:** +- Modify: [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:264) only if temporary diagnostics need cleanup + +- [ ] **Step 1: Re-run the full non-auth jobs E2E suite** + +Run: +```bash +env NMP_BASE_URL=http://localhost:8080 uv run --frozen pytest e2e/test_jobs.py -v --run-e2e +``` + +Expected: +- pause/resume passes +- cancel tests still pass + +- [ ] **Step 2: Remove or trim any temporary debug-only assertions** + +Retain useful failure context, but remove any excessive noise added solely for diagnosis. + +- [ ] **Step 3: Commit cleanup** + +```bash +git add e2e/test_jobs.py +git commit -s -m "test: clean up jobs pause resume e2e assertions" +``` diff --git a/plans/2026-06-03-jobs-persistent-storage-plan.md b/plans/2026-06-03-jobs-persistent-storage-plan.md new file mode 100644 index 0000000000..475299ebf4 --- /dev/null +++ b/plans/2026-06-03-jobs-persistent-storage-plan.md @@ -0,0 +1,202 @@ +# Jobs Persistent Storage Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make Docker-backed platform jobs preserve and share persistent storage correctly across sequential job steps so [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:121) passes reliably against quickstart. + +**Architecture:** The failing path spans job-spec validation, Docker volume/init-container setup, and runtime task behavior. The safest fix is to first capture the exact task failure and then add one focused integration or controller-level regression test around the shared persistent-storage mount before changing Docker backend behavior. + +**Tech Stack:** `pytest`, NeMo SDK, Docker jobs backend, Jobs dispatcher/controller, quickstart `nmp-api`, Python + +--- + +## File Structure + +**Files to inspect or modify:** +- Modify: [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:121) +- Modify: [services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py:426) +- Modify: [services/core/jobs/src/nmp/core/jobs/api/v2/jobs/endpoints.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/api/v2/jobs/endpoints.py:75) +- Modify: [services/core/jobs/src/nmp/core/jobs/app/schemas.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/app/schemas.py:79) +- Test: [services/core/jobs/tests/controllers/test_docker_backend.py](/Users/rsadler/src/nemo-platform/services/core/jobs/tests/controllers/test_docker_backend.py:1211) +- Test: `services/core/jobs/tests/integration/test_jobs_persistent_storage.py` + +**Responsibilities:** +- `e2e/test_jobs.py`: external regression proving end-to-end behavior across real Docker quickstart. +- `docker.py`: volume creation, mount wiring, job init container, cleanup, and runtime state transitions for Docker jobs. +- `endpoints.py` and `schemas.py`: job-spec validation and feature gating for persistent storage. +- `test_docker_backend.py`: backend unit coverage for mount/label/config behavior. +- `test_jobs_persistent_storage.py`: a narrower live-service regression that proves the platform contract independently of the broad E2E suite. + +### Task 1: Capture the Actual Persistent-Storage Failure + +**Files:** +- Modify: [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:121) + +- [ ] **Step 1: Tighten the failing E2E assertion to surface task error details** + +Update `test_job_passing_data_between_steps` so that when the job status is not `completed`, the assertion includes: +- job `status_details` +- job `error_details` +- task `error_stack` +- job logs + +The change should follow the same pattern already used for job diagnostics in the old `Platform-Deploy` suite: fail with actionable details, not just `status == error`. + +- [ ] **Step 2: Run the single failing test and capture the concrete backend error** + +Run: +```bash +env NMP_BASE_URL=http://localhost:8080 uv run --frozen pytest e2e/test_jobs.py::test_job_passing_data_between_steps -vv --run-e2e -s +``` + +Expected: +- test fails +- output contains the precise runtime error from the second step or from job init/mount setup + +- [ ] **Step 3: Record the failure mode in the test comment** + +Add a short comment in `test_job_passing_data_between_steps` explaining the current failure shape, for example: +- missing file in mounted persistent path +- mount path not shared across steps +- init container path mismatch + +- [ ] **Step 4: Commit the diagnostics-only change** + +```bash +git add e2e/test_jobs.py +git commit -s -m "test: improve jobs persistent storage diagnostics" +``` + +### Task 2: Add a Narrow Regression Test Below E2E + +**Files:** +- Create: `services/core/jobs/tests/integration/test_jobs_persistent_storage.py` +- Test: `services/core/jobs/tests/integration/test_jobs_persistent_storage.py` + +- [ ] **Step 1: Write the failing integration test** + +Add a live service integration test that: +1. creates a two-step platform job +2. writes `data.txt` in step 1 using `NEMO_JOB_PERSISTENT_JOB_STORAGE_PATH` +3. reads the same file in step 2 +4. asserts final job status is `completed` + +Use the same job shape as [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:121), but keep the test local to the jobs service so failure analysis is faster than full quickstart E2E. + +- [ ] **Step 2: Run the new integration test to verify it fails** + +Run: +```bash +uv run --frozen pytest services/core/jobs/tests/integration/test_jobs_persistent_storage.py -vv +``` + +Expected: +- FAIL with the same storage-sharing symptom seen in E2E + +- [ ] **Step 3: Add one Docker backend unit test for mount intent** + +In [test_docker_backend.py](/Users/rsadler/src/nemo-platform/services/core/jobs/tests/controllers/test_docker_backend.py:1211), add a failing unit test that asserts: +- both steps targeting the same job get the same shared job volume name +- the persistent mount target uses the explicit `NEMO_JOB_PERSISTENT_JOB_STORAGE_PATH` +- the mount includes the expected subpath `jobs//` + +- [ ] **Step 4: Run the focused backend unit tests** + +Run: +```bash +uv run --frozen pytest services/core/jobs/tests/controllers/test_docker_backend.py -k persistent_storage -vv +``` + +Expected: +- new unit test fails before implementation changes + +- [ ] **Step 5: Commit the failing test additions** + +```bash +git add services/core/jobs/tests/integration/test_jobs_persistent_storage.py services/core/jobs/tests/controllers/test_docker_backend.py +git commit -s -m "test: add jobs persistent storage regression coverage" +``` + +### Task 3: Fix Docker Persistent-Storage Wiring + +**Files:** +- Modify: [services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py:426) +- Modify: [services/core/jobs/src/nmp/core/jobs/app/schemas.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/app/schemas.py:79) + +- [ ] **Step 1: Verify whether the persistent storage contract is mount-path based or env-var based** + +Inspect: +- `schedule_single_container()` +- `ensure_job_storage()` +- `get_mounts()` + +Confirm that the init container prepares `/job-vol/jobs//` while the runtime mount targets the user-requested path (for example `/mnt/persistent_storage`) with Docker `Subpath`. + +If that contract is already correct, do not redesign it. Limit the fix to the broken edge. + +- [ ] **Step 2: Implement the minimal backend change** + +Possible implementation sites, depending on the failure captured in Task 1: +- normalize the persistent mount target before volume creation +- ensure the shared volume subpath is created before the second step runs +- correct `VolumeOptions["Subpath"]` usage +- preserve mount/env consistency when `NEMO_JOB_PERSISTENT_JOB_STORAGE_PATH` is overridden + +Do not broaden scope into Kubernetes or subprocess backends. + +- [ ] **Step 3: Re-run the focused tests** + +Run: +```bash +uv run --frozen pytest services/core/jobs/tests/controllers/test_docker_backend.py -k persistent_storage -vv +uv run --frozen pytest services/core/jobs/tests/integration/test_jobs_persistent_storage.py -vv +env NMP_BASE_URL=http://localhost:8080 uv run --frozen pytest e2e/test_jobs.py::test_job_passing_data_between_steps -vv --run-e2e -s +``` + +Expected: +- all three pass + +- [ ] **Step 4: Check for cleanup regressions** + +Run: +```bash +uv run --frozen pytest services/core/jobs/tests/controllers/test_docker_backend.py -k cleanup -vv +``` + +Expected: +- PASS +- no new failures in persistent-storage cleanup logic + +- [ ] **Step 5: Commit the fix** + +```bash +git add services/core/jobs/src/nmp/core/jobs/controllers/backends/docker.py services/core/jobs/src/nmp/core/jobs/app/schemas.py services/core/jobs/tests/controllers/test_docker_backend.py services/core/jobs/tests/integration/test_jobs_persistent_storage.py e2e/test_jobs.py +git commit -s -m "fix: share persistent storage across docker job steps" +``` + +### Task 4: Final Validation + +**Files:** +- Modify: [e2e/test_jobs.py](/Users/rsadler/src/nemo-platform/e2e/test_jobs.py:121) if diagnostics added in Task 1 can now be simplified + +- [ ] **Step 1: Re-run the full non-auth jobs E2E suite** + +Run: +```bash +env NMP_BASE_URL=http://localhost:8080 uv run --frozen pytest e2e/test_jobs.py -v --run-e2e +``` + +Expected: +- persistent-storage test passes +- no regressions in the previously passing jobs tests + +- [ ] **Step 2: Simplify temporary diagnostics if they are no longer needed** + +If Task 1 added very noisy debug-only assertions or comments, keep the useful failure context but remove excess noise. + +- [ ] **Step 3: Commit cleanup** + +```bash +git add e2e/test_jobs.py +git commit -s -m "test: clean up jobs persistent storage e2e assertions" +``` diff --git a/process.md b/process.md new file mode 100644 index 0000000000..fa5999ddb3 --- /dev/null +++ b/process.md @@ -0,0 +1,25 @@ +I know you took the action item to draft a release process, and I also know your plate is pretty full right now. + +I’ve been thinking about our delivery headaches and jotted down a lightweight framework that I think could help us get to predictable, quality releases every two weeks. I’ve used a similar setup in the past to help teams get out of this kind of delivery rut, and it worked well. + +The main idea is that shipping every two weeks does not mean every piece of work fits neatly into a two-week window. It means work needs to be phased, visible, and managed in a way that lets us reliably ship quality on that cadence, even when larger efforts span multiple cycles. + +None of this is especially novel, and a lot of it is probably obvious at a high level. We are already doing most of this in some form today. I’m mostly spelling it out so the expectations are explicit, everyone is operating from the same assumptions, and we add a bit more structure and rigor to make our delivery commitments more reliable. + +This is roughly what I had in mind: + +- Filter on deliverability: use something like RICE scoring for incoming work, with heavy emphasis on confidence. If a feature lacks enough product or technical clarity to estimate or execute confidently, it should not enter the delivery track yet. Instead, we create explicit follow-up work to clarify it first. + +- Timeline and milestones: use Linear’s roadmap timeline as the shared source of truth for release-bound work. If it is not on the timeline, there should be no delivery expectation around it. Larger efforts can span multiple weeks, but they should be broken into clear milestone phases like Discovery/Spike, Implementation, Docs or Migration, and Validation. + +- Cycle planning: use cycles and t-shirt estimates to make developer load and progress visible. That gives us a better way to spread work appropriately and avoid overloading individual engineers. + +- Developer-owned visibility: for deadline-bearing work, I think developers should be responsible for keeping their cycles, milestones, tickets, and upcoming availability up to date. That includes accounting for PTO and other capacity constraints. The goal is not extra bureaucracy, it is making coordination and planning possible. + +- Process support: Veena, in her TPM role, could help keep the process honest by tracking slippage, surfacing communication gaps, and helping make sure updates happen early enough to adjust. Standups and weekly scrums can provide a regular cadence for surfacing issues early and keeping plans current. More broadly, I think the sprint cadence itself can and should be run primarily by developers and the TPM, with managers joining mainly to stay informed, ask questions, and provide input as needed rather than driving the process directly. + +- Shift quality left: redefine “Dev Complete” so work does not move into validation until automated tests are verified locally and in k8s. That should help reduce the pattern of throwing partially validated work over the wall to QA. + +I do not think this really slows us down so much as spreads the work out more appropriately over time. Throughput may not change much, but it should help us build trust in our estimates and in our ability to ship quality on a predictable cadence. + +If this is helpful, I’d be happy to take a first pass at turning it into a short proposal. If you already have a different direction in mind, no worries at all. diff --git a/public-images.md b/public-images.md new file mode 100644 index 0000000000..ff7dd4989d --- /dev/null +++ b/public-images.md @@ -0,0 +1,90 @@ +## Problem Statement + +NeMo Platform needs a distribution strategy for large public OSS container images. The core problem is that the preferred distribution path should allow community users to pull images with minimal friction, while still supporting image sizes that are common for AI workloads. + +This document compares candidate registries and distribution models against the constraints that matter for public OSS distribution: + +- Support for large images and layers +- Anonymous or low-friction public pulls +- Public pull rate limits and throttling risk +- Operational overhead for publishing and maintenance +- Fit for NVIDIA-owned versus community-oriented distribution + +One important clarification up front: **GitHub Container Registry (`ghcr.io`) does not impose a 10 GB total image size limit for public OSS images.** The practical constraint is a **10 GB per-layer limit**, plus upload timeout behavior. That means GHCR remains viable if the image can be structured into multiple smaller layers. + +The sections below outline the main options and the tradeoffs each one introduces. + +--- + +## 1. Optimize GitHub Container Registry (GHCR) + +If your total image is over 10 GB, but no *single* layer is over 10 GB, **GHCR is free and unlimited for public open-source projects.** The trick to working around the 10 GB layer cap and the 10-minute timeout is to modify your `Dockerfile` structure to force multi-layer chunking: + +- **Avoid single-layer giants:** Don't do `RUN wget && pip install && apt install ` in one command. +- **Break up heavy copy/run operations:** Group your dependencies, weights, or datasets into separate `RUN` or `COPY` steps so Docker naturally creates smaller individual layers. +- **Rate-limit profile:** GitHub's public documentation for `ghcr.io` does not prominently document a Docker Hub-style anonymous pull cap for public container pulls. The practical constraints called out in the reviewed docs are cost/bandwidth policy for public packages and the 10 GB per-layer limit, rather than a published public pull quota. + +--- + +## 2. Docker Hub (Public / Open Source Program) + +Docker Hub is still the default for most developers. It allows **anonymous, keyless public pulls**, meaning your users just type `docker pull your-org/image` and it works out of the box. + +- **Size Limits:** There is no hard cap on overall image size, though a single layer is typically capped around 10 GB (similar to GitHub). +- **Rate-limit profile:** Docker Hub is the clearest case where public pull rate limits matter. Anonymous users are rate-limited, which can become a real problem for shared NATs, CI fleets, classrooms, and enterprise users pulling from the same egress IP range. The **Docker Open Source Program** materially improves this story by reducing that friction for OSS consumers. + +--- + +## 3. Hugging Face Spaces / Registry + +Hugging Face has become the definitive home for open-source AI, and they fully support custom Docker containers. + +- **How it works:** Instead of standard OCI registries, you can use Hugging Face **Docker Spaces**. You provide a `Dockerfile`, and Hugging Face handles the building and hosting. +- **The AI Advantage:** If your image is massive because it contains model weights, Hugging Face lets you easily decouple the container logic from the data. You can keep the Docker image small and use the `huggingface_hub` cache to pull weights from an HF Model Repo seamlessly at runtime. It completely eliminates key requirements for end-users. +- **Rate-limit profile:** Hugging Face does enforce documented Hub rate limits, including anonymous-user limits, but it also distinguishes between request classes and gives much higher limits to optimized file-resolution traffic than to general API usage. For model and artifact delivery this is generally more AI-friendly than Docker Hub's anonymous pull throttling, but it is still an explicit quota system. + +--- + +## 4. Quay.io (By Red Hat) + +Quay is highly resilient, supports incredibly large images, and is a popular alternative to Docker Hub for large enterprise open-source projects (like many CNCF projects). + +- **Pros:** Public repositories are completely free, unmetered, and allow anonymous public pulling without any keys. +- **Cons:** The UI feels a bit dated compared to GitHub or Hugging Face, but its backend handles massive OCI images flawlessly. +- **Rate-limit profile:** Quay is attractive partly because it is commonly used as a public OSS registry without the same well-known anonymous pull caps that shape Docker Hub decisions. For this document, the key point is that Quay is generally positioned as the lower-friction option when rate-limit sensitivity is a concern. + +## 5. NVCR / NGC + +NVCR is the most natural NVIDIA-native option, but it really splits into two different distribution models: + +- **Private NVCR registry:** This works well if you are distributing images to known internal or partner users, but it requires consumers to authenticate with an NGC API key. That makes it a poor fit for frictionless public OSS distribution, because every user has to clear the NGC account + key setup hurdle before they can even pull the image. +- **NVIDIA public registry path:** NVIDIA can publish public images without requiring end users to bring a key, but getting there means going through NVIDIA's public publishing process. In practice, that process is much more extensive than pushing to GHCR, Docker Hub, or Quay, so it adds significant operational overhead for a community-facing OSS image. +- **Rate-limit profile:** Rate limiting is less central than access model here. The private path is already gated behind NGC authentication, while the public path is governed more by NVIDIA's publishing workflow and policy overhead than by a community-friendly self-serve pull model. + +--- + +## Summary: Which should you choose? + + +| Registry | Max Image Size | Needs Pull Key? | Public Pull Rate Limits | Best For... | +| ---------------------------- | -------------------------------- | -------------------------- | -------------------------------------------------------- | -------------------------------------------------------------------------------------------------------- | +| **GHCR (GitHub)** | Unlimited (Max 10GB *per layer*) | **No** (for public images) | No prominently documented anonymous pull cap reviewed | Keeping code and containers in the exact same GitHub Org. | +| **Docker Hub** | Unlimited | **No** | Significant for anonymous users | Maximum community discoverability, if you can tolerate or mitigate anonymous pull throttling. | +| **Hugging Face** | High / Flexible | **No** | Yes, explicit Hub quotas; more favorable for file fetches | AI-native workflows where you want to split infrastructure from model weights. | +| **Quay.io** | Unlimited for OSS | **No** | Lower-friction OSS posture; no comparable cap highlighted | Heavy-duty, keyless enterprise open-source hosting. | +| **NVCR (Private)** | Unlimited | **Yes** (NGC API key) | Less relevant than auth gating | NVIDIA-internal or controlled distribution where authenticated pulls are acceptable. | +| **NVCR (Public Publishing)** | Unlimited | **No** (for end users) | Not the primary issue; process overhead dominates | NVIDIA-managed public distribution, if you are willing to go through the full public publishing process. | + + +### The Recommendation + +NeMo Platform should use **GitHub Container Registry (`ghcr.io`) as the primary public distribution path** for OSS images. + +This is the strongest default choice for the current stage of the project because: + +- NeMo Platform is already hosted on GitHub, so GHCR keeps source and container distribution in the same ecosystem. +- It provides the most straightforward public OSS user experience: users can discover the project on GitHub and pull the corresponding images without additional NVIDIA-specific account setup. +- It meets the key technical criteria outlined in this document, including image-size viability and an acceptable public rate-limit profile for OSS distribution. +- It avoids the additional publishing-process friction associated with NVIDIA's public NVCR path. + +This recommendation does **not** rule out also publishing through **NVCR public** in the future. That path may still be useful if there is a strategic reason to maintain a public NVIDIA-native distribution channel. However, at the initial stage, NeMo Platform should avoid taking on the additional operational overhead of the NVCR public publishing process when GHCR already satisfies the project’s functional and distribution requirements. diff --git a/pytest.ini b/pytest.ini index 25a75bdb6f..fe374a65fb 100644 --- a/pytest.ini +++ b/pytest.ini @@ -59,6 +59,7 @@ markers = smoke_nmp_automodel_tasks: Import smoke tests for the nmp-automodel-tasks image smoke_nmp_automodel_training: Import smoke tests for the nmp-automodel-training image e2e: End-to-end tests - test complete customer workflows on deployed infrastructure (Helm/Docker Compose) + e2e_config(*layers): Ordered list of repo-root-relative config paths and/or inline dict overlays; empty means default local config regression: Regression tests - test individual functional microservices for baseline functionality infrastructure: Infrastructure tests - ensure services are compatible with customer infrastructure canary: Canary tests - test deployed integration environments like top of tree diff --git a/script/Untitled b/script/Untitled new file mode 100644 index 0000000000..e69de29bb2 diff --git a/script/run-e2e-linux.sh b/script/run-e2e-linux.sh new file mode 100644 index 0000000000..250f51a063 --- /dev/null +++ b/script/run-e2e-linux.sh @@ -0,0 +1,41 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Run NeMo Platform E2E tests inside a local Linux container to compare +# behavior with GitHub Actions' ubuntu-latest environment. +# +# Examples: +# script/run-e2e-linux.sh +# script/run-e2e-linux.sh e2e/test_jobs_auth.py -vv -s --run-e2e +# +# Requirements: +# - local Docker daemon +# - outbound network access from the container for uv package sync + +ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" +IMAGE="${NMP_E2E_LINUX_IMAGE:-python:3.13-slim-bookworm}" +CONTAINER_NAME="nmp-e2e-linux-$(date +%s)" +PYTEST_ARGS=("$@") + +if [ "${#PYTEST_ARGS[@]}" -eq 0 ]; then + PYTEST_ARGS=(e2e/test_jobs_auth.py -vv -s --run-e2e) +fi + +docker run --rm --name "${CONTAINER_NAME}" \ + -e _TYPER_FORCE_DISABLE_TERMINAL=1 \ + -e E2E_SERVICES_LOG_DIR=/tmp/e2e-services-logs \ + -e UV_PROJECT_ENVIRONMENT=/tmp/nmp-e2e-linux-venv \ + -e NGC_API_KEY="${NGC_API_KEY:-not-set}" \ + -e HF_TOKEN="${HF_TOKEN:-}" \ + -v "${ROOT_DIR}:/workspace" \ + -w /workspace \ + "${IMAGE}" \ + bash -lc ' + set -euo pipefail + apt-get update + apt-get install -y --no-install-recommends curl git build-essential + rm -rf /var/lib/apt/lists/* + python -m pip install --no-cache-dir "uv>=0.9.14,<0.10.0" + uv sync --frozen --all-packages + uv run --frozen pytest '"$(printf '%q ' "${PYTEST_ARGS[@]}")"' + ' diff --git a/script/run-hello-world-jobs.sh b/script/run-hello-world-jobs.sh new file mode 100755 index 0000000000..90bd560637 --- /dev/null +++ b/script/run-hello-world-jobs.sh @@ -0,0 +1,79 @@ +#!/usr/bin/env bash + +set -euo pipefail + +ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)" +cd "${ROOT_DIR}" + +ENABLE_AUTH=false +while [[ $# -gt 0 ]]; do + case "$1" in + --auth) + ENABLE_AUTH=true + shift + ;; + -h|--help) + cat <<'EOF' +Usage: script/run-hello-world-jobs.sh [--auth] + +Options: + --auth Enable local auth with unsigned JWTs for development. + +With --auth: + Start the platform with unsigned JWTs enabled and seed default auth role + bindings, then log in with: + .venv/bin/nemo auth login --unsigned-token --email +EOF + exit 0 + ;; + *) + echo "Unknown argument: $1" >&2 + echo "Use --help for usage." >&2 + exit 2 + ;; + esac +done + +export NMP_CONFIG_FILE_PATH="${NMP_CONFIG_FILE_PATH:-packages/nmp_platform/config/local.yaml}" +export NMP_IMAGE_REGISTRY="${NMP_IMAGE_REGISTRY:-my-registry}" +export NMP_IMAGE_TAG="${NMP_IMAGE_TAG:-local}" + +if [[ "${ENABLE_AUTH}" == "true" ]]; then + export NMP_AUTH_ENABLED="${NMP_AUTH_ENABLED:-true}" + export NMP_AUTH_ALLOW_UNSIGNED_JWT="${NMP_AUTH_ALLOW_UNSIGNED_JWT:-true}" + # local.yaml sets auth.bundle_cache_seconds=0 for fast permission-propagation + # feedback in tests. In this one-process local services setup, that causes + # every PDP evaluation to reload policy data through the entities API, which + # recursively triggers more PDP checks and can deadlock into timeout/retry + # loops. Keep a nonzero cache when enabling auth from this helper script. + export NMP_AUTH_BUNDLE_CACHE_SECONDS="${NMP_AUTH_BUNDLE_CACHE_SECONDS:-30}" + # Seed the default local auth role bindings on startup so the unsigned-token + # dev principal (admin@example.com by default) can perform platform actions + # instead of failing every request with an immediate 403. + export NMP_SEED_ON_STARTUP="${NMP_SEED_ON_STARTUP:-true}" + # This helper starts a minimal service set without the models service. The + # generic platform seed task waits for models readiness by default, which + # prevents any seeding from running here. Limit startup seeding to the auth + # role bindings we actually need for local unsigned-JWT testing. + export NMP_PLATFORM_SEED_WAIT_FOR_READY_ENABLED="${NMP_PLATFORM_SEED_WAIT_FOR_READY_ENABLED:-false}" + export NMP_PLATFORM_SEED_AUTH_ENABLED="${NMP_PLATFORM_SEED_AUTH_ENABLED:-true}" + export NMP_PLATFORM_SEED_GUARDRAILS_ENABLED="${NMP_PLATFORM_SEED_GUARDRAILS_ENABLED:-false}" + export NMP_PLATFORM_SEED_EVALUATOR_ENABLED="${NMP_PLATFORM_SEED_EVALUATOR_ENABLED:-false}" + export NMP_PLATFORM_SEED_MODEL_PROVIDER_ENABLED="${NMP_PLATFORM_SEED_MODEL_PROVIDER_ENABLED:-false}" +fi + +NEMO_BIN="${NEMO_BIN:-}" +if [[ -z "${NEMO_BIN}" ]]; then + if [[ -x ".venv/bin/nemo" ]]; then + NEMO_BIN=".venv/bin/nemo" + elif command -v nemo >/dev/null 2>&1; then + NEMO_BIN="nemo" + else + echo "Could not find nemo CLI. Set NEMO_BIN or create .venv/bin/nemo." >&2 + exit 127 + fi +fi + +exec "${NEMO_BIN}" services run \ + --services jobs,hello-world,files,auth,entities \ + --controllers jobs diff --git a/script/submit-hello-world-job.sh b/script/submit-hello-world-job.sh new file mode 100755 index 0000000000..e061188016 --- /dev/null +++ b/script/submit-hello-world-job.sh @@ -0,0 +1,89 @@ +#!/usr/bin/env bash + +set -euo pipefail + +# If the local platform was started with `script/run-hello-world-jobs.sh --auth`, +# authenticate first with: +# .venv/bin/nemo auth login --unsigned-token --email + +WORKSPACE="${NMP_WORKSPACE:-default}" +JOB_NAME="${1:-hello-world-cli-job}" +MESSAGE="${2:-hello from cli}" +PROJECT="${NMP_PROJECT:-}" +IMAGE_REGISTRY="${NMP_IMAGE_REGISTRY:-my-registry}" +IMAGE_TAG="${NMP_IMAGE_TAG:-local}" +EXECUTION_PROFILE="${NMP_JOB_PROFILE:-docker}" +IMAGE="${NMP_CPU_TASKS_IMAGE:-${IMAGE_REGISTRY}/nmp-cpu-tasks:${IMAGE_TAG}}" +NEMO_BIN="${NEMO_BIN:-}" + +if [[ -z "${NEMO_BIN}" ]]; then + if [[ -x ".venv/bin/nemo" ]]; then + NEMO_BIN=".venv/bin/nemo" + elif command -v nemo >/dev/null 2>&1; then + NEMO_BIN="nemo" + else + echo "Could not find nemo CLI. Set NEMO_BIN or create .venv/bin/nemo." >&2 + exit 127 + fi +fi + +payload_file="$(mktemp)" +trap 'rm -f "${payload_file}"' EXIT + +cat > "${payload_file}" < Option except EntityNotFoundError: return None - async def list_tasks(self, step_id: str) -> list[PlatformJobTask]: + async def list_tasks(self, step_id: str, workspace: str) -> list[PlatformJobTask]: """List all platform job tasks for a specific step.""" response = await self.store.list( PlatformJobTask, + workspace=workspace, filter_obj={"step_id": step_id}, page_size=1000, ) diff --git a/services/core/jobs/src/nmp/core/jobs/config.py b/services/core/jobs/src/nmp/core/jobs/config.py index 3c71a0fbd6..e0f3146c7c 100644 --- a/services/core/jobs/src/nmp/core/jobs/config.py +++ b/services/core/jobs/src/nmp/core/jobs/config.py @@ -37,6 +37,13 @@ class JobsServiceConfig(create_service_config_class("jobs")): # type: ignore "docker/none runtimes and false for kubernetes." ), ) + include_job_logs_in_diagnostics: bool = Field( + default=False, + description=( + "Include raw job log lines in controller diagnostics snapshots. Disabled by default because " + "job logs may contain secrets or PII. Enable only for local debugging or test environments." + ), + ) def resolved_enable_subprocess_executor(self) -> bool: """Whether host subprocess execution is registered for default profiles.""" diff --git a/services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py b/services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py index 35610fd19a..e9139a94cc 100644 --- a/services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py +++ b/services/core/jobs/src/nmp/core/jobs/controllers/backends/subprocess.py @@ -57,6 +57,7 @@ SUBPROCESS_PID_STATUS_KEY = "pid" SUBPROCESS_PGID_STATUS_KEY = "pgid" SUBPROCESS_PERSISTENT_STORAGE_STATUS_KEY = "subprocess_persistent_storage_path" +_MISSING_METADATA_PENDING_GRACE_SECONDS = 5 SUBPROCESS_INHERITED_ENV_ALLOWLIST = frozenset( { "PATH", @@ -281,6 +282,24 @@ def sync(self, step: PlatformJobStepWithContext) -> JobUpdate: status=PlatformJobStatus.CANCELLED.value, status_details={"message": "Subprocess not found, job cancelled"}, ) + task_fallback = self._get_task_fallback_update(step) + if task_fallback is not None: + return task_fallback + if step.status == PlatformJobStatus.PENDING and not self._pending_step_missing_metadata_is_stale(step): + # Stopgap only: the subprocess backend keeps execution metadata in controller + # memory, while step/task state is persisted in the jobs database. Those two + # sources of truth are not fully synchronized today because subprocess was not + # originally designed around durable jobs-backed execution state. That means a + # step can already be visible in the database as pending before this backend has + # registered local subprocess metadata for it. Keep the step pending briefly + # instead of failing it. The real fix is to move subprocess onto properly + # serialized, jobs-backed state so reconciliation does not depend on process-local + # memory. + return JobUpdate( + status=PlatformJobStatus.PENDING.value, + status_details=step.status_details or {"message": "Awaiting subprocess metadata"}, + error_details=step.error_details or {}, + ) return JobUpdate( status=PlatformJobStatus.ERROR.value, error_details={"message": "Local subprocess metadata not found"}, @@ -324,8 +343,6 @@ def cleanup_steps(self) -> None: should_cleanup = self._execution_profile_config.cleanup_completed_jobs_immediately if not should_cleanup and step.updated_at is not None: updated_at = step.updated_at - if isinstance(updated_at, str): - updated_at = datetime.datetime.fromisoformat(updated_at) if updated_at.tzinfo is None: updated_at = updated_at.replace(tzinfo=datetime.timezone.utc) expires_at = updated_at + datetime.timedelta( @@ -337,6 +354,41 @@ def cleanup_steps(self) -> None: shutil.rmtree(metadata.work_dir, ignore_errors=True) self._process_registry.pop(key) + def _get_task_fallback_update(self, step: PlatformJobStepWithContext) -> JobUpdate | None: + try: + tasks = self._nmp_sdk.jobs.tasks.list( + name=step.name, + job=step.job, + workspace=step.workspace, + ) + except Exception: + logger.warning( + "Failed to fetch tasks for subprocess metadata fallback", + extra={"job": step.job, "step": step.name, "workspace": step.workspace}, + ) + return None + + if not tasks.data: + return None + + latest_task = max(tasks.data, key=lambda task: task.updated_at or task.created_at or datetime.datetime.min) + return JobUpdate( + status=latest_task.status, + status_details=latest_task.status_details, + error_details=latest_task.error_details or {}, + ) + + @staticmethod + def _pending_step_missing_metadata_is_stale(step: PlatformJobStepWithContext) -> bool: + anchor = step.updated_at or step.created_at + if anchor is None: + return True + if anchor.tzinfo is None: + anchor = anchor.replace(tzinfo=datetime.timezone.utc) + return (anchor + datetime.timedelta(seconds=_MISSING_METADATA_PENDING_GRACE_SECONDS)) < datetime.datetime.now( + datetime.timezone.utc + ) + @staticmethod def _cleanup_failed_startup_dirs(work_dir: Path, persistent_dir: Path) -> None: shutil.rmtree(work_dir, ignore_errors=True) diff --git a/services/core/jobs/src/nmp/core/jobs/controllers/diagnostics.py b/services/core/jobs/src/nmp/core/jobs/controllers/diagnostics.py new file mode 100644 index 0000000000..01fe22bc50 --- /dev/null +++ b/services/core/jobs/src/nmp/core/jobs/controllers/diagnostics.py @@ -0,0 +1,163 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +from __future__ import annotations + +import logging +from dataclasses import dataclass +from typing import Any, Protocol + +from nemo_platform import NeMoPlatform +from nmp.core.jobs.config import config + +_MAX_LOG_ENTRIES = 20 +_MAX_ERROR_STACK_CHARS = 2048 + + +class JobDiagnosticTarget(Protocol): + workspace: str + job: str + name: str + + +@dataclass(frozen=True) +class _JobDiagnosticRef: + workspace: str + job: str + name: str + + +def _trim_error_stack(value: str | None) -> str | None: + if value is None or len(value) <= _MAX_ERROR_STACK_CHARS: + return value + return value[-_MAX_ERROR_STACK_CHARS:] + + +def _task_dict(task: Any) -> dict[str, Any]: + return { + "name": task.name, + "status": task.status, + "status_details": task.status_details, + "error_details": task.error_details, + "error_stack": _trim_error_stack(task.error_stack), + } + + +def _step_dict(step: Any) -> dict[str, Any]: + return { + "name": step.name, + "status": step.status, + "status_details": step.status_details, + "error_details": step.error_details, + } + + +def _job_dict(job: Any) -> dict[str, Any]: + return { + "name": job.name, + "status": job.status, + "status_details": job.status_details, + "error_details": job.error_details, + } + + +def collect_job_diagnostics( + sdk: NeMoPlatform, + step: JobDiagnosticTarget | None = None, + *, + workspace: str | None = None, + job_name: str | None = None, + step_name: str | None = None, + context: str, +) -> dict[str, Any]: + step_ref: JobDiagnosticTarget + if step is None: + if workspace is None or job_name is None or step_name is None: + raise ValueError("Either step or workspace/job_name/step_name must be provided") + step_ref = _JobDiagnosticRef(workspace=workspace, job=job_name, name=step_name) + else: + step_ref = step + + diagnostics: dict[str, Any] = { + "diagnostic_context": context, + "workspace": step_ref.workspace, + "job_name": step_ref.job, + "step_name": step_ref.name, + } + + try: + job = sdk.jobs.retrieve(step_ref.job, workspace=step_ref.workspace) + diagnostics["job"] = _job_dict(job) + except Exception as exc: + diagnostics["job_error"] = str(exc) + + try: + status = sdk.jobs.get_status(step_ref.job, workspace=step_ref.workspace) + diagnostics["status_api"] = { + "status": status.status, + "status_details": status.status_details, + "error_details": status.error_details, + "steps": [ + { + **_step_dict(status_step), + "tasks": [_task_dict(task) for task in status_step.tasks], + } + for status_step in status.steps + ], + } + except Exception as exc: + diagnostics["status_api_error"] = str(exc) + + try: + refreshed_step = sdk.jobs.steps.retrieve(step_ref.name, job=step_ref.job, workspace=step_ref.workspace) + diagnostics["step"] = _step_dict(refreshed_step) + except Exception as exc: + diagnostics["step_error"] = str(exc) + + try: + tasks = sdk.jobs.tasks.list(step_ref.name, job=step_ref.job, workspace=step_ref.workspace) + diagnostics["tasks_api"] = [_task_dict(task) for task in tasks.data] + except Exception as exc: + diagnostics["tasks_api_error"] = str(exc) + + try: + if config.include_job_logs_in_diagnostics: + logs = sdk.jobs.get_logs(workspace=step_ref.workspace, name=step_ref.job) + diagnostics["job_logs"] = [entry.message for entry in logs.data[-_MAX_LOG_ENTRIES:]] + except Exception as exc: + diagnostics["job_logs_error"] = str(exc) + + return diagnostics + + +def log_job_diagnostics_if_debug( + sdk: NeMoPlatform, + step: JobDiagnosticTarget | None = None, + *, + logger: logging.Logger, + workspace: str | None = None, + job_name: str | None = None, + step_name: str | None = None, + context: str, +) -> None: + if not logger.isEnabledFor(logging.DEBUG): + return + + step_ref: JobDiagnosticTarget + if step is None: + if workspace is None or job_name is None or step_name is None: + raise ValueError("Either step or workspace/job_name/step_name must be provided") + step_ref = _JobDiagnosticRef(workspace=workspace, job=job_name, name=step_name) + else: + step_ref = step + + logger.debug( + "Job diagnostics snapshot", + extra={ + "diagnostic_context": context, + "workspace": step_ref.workspace, + "job_name": step_ref.job, + "step_name": step_ref.name, + "job_diagnostics": collect_job_diagnostics(sdk, step_ref, context=context), + }, + ) diff --git a/services/core/jobs/src/nmp/core/jobs/controllers/reconciler.py b/services/core/jobs/src/nmp/core/jobs/controllers/reconciler.py index 66ceca1e75..8dab47f099 100644 --- a/services/core/jobs/src/nmp/core/jobs/controllers/reconciler.py +++ b/services/core/jobs/src/nmp/core/jobs/controllers/reconciler.py @@ -6,12 +6,15 @@ from nemo_platform import APIError, APIStatusError, NeMoPlatform from nemo_platform.types.jobs import PlatformJobStepWithContext +from nemo_platform.types.jobs.platform_job_steps_list_filter_param import PlatformJobStepsListFilterParam +from nemo_platform.types.shared.platform_job_status import PlatformJobStatus as SDKPlatformJobStatus from nmp.common.controller import Controller from nmp.common.jobs.schemas import PlatformJobStatus from nmp.common.observability import scoped_app_ctx, start_span_with_ctx from nmp.core.jobs.app.ctx import JobBackendContext, JobContext from nmp.core.jobs.controllers.backends import extract_provider_profile from nmp.core.jobs.controllers.backends.registry import BackendRegistry +from nmp.core.jobs.controllers.diagnostics import log_job_diagnostics_if_debug from opentelemetry import metrics, trace tracer = trace.get_tracer(__name__) @@ -30,6 +33,7 @@ def __init__( self._nmp_sdk = nmp_sdk self._stop_signal = stop_signal self._is_healthy = False + self._logger = logger self._step_reconciliation_total = meter.create_counter( name="nmp.jobs.reconciler.step.reconciliation.total", @@ -52,7 +56,7 @@ def step(self): with tracer.start_as_current_span("jobs_reconciler/fetch_steps_for_reconciliation"): try: - statuses = [ + statuses: list[SDKPlatformJobStatus] = [ PlatformJobStatus.PENDING.value, PlatformJobStatus.ACTIVE.value, PlatformJobStatus.CANCELLING.value, @@ -84,6 +88,16 @@ def step(self): ): job_update = backend.sync(step) logger.info(f"Updating job step status from '{step.status}' to '{job_update.status}'") + if ( + job_update.status == PlatformJobStatus.ERROR.value + and step.status != PlatformJobStatus.ERROR + ): + log_job_diagnostics_if_debug( + self._nmp_sdk, + step, + logger=self._logger, + context="step transitioned to error during reconciliation", + ) self._nmp_sdk.jobs.steps.update_status( step.name, workspace=step.workspace, @@ -113,6 +127,12 @@ def step(self): ) except Exception: logger.exception("Unexpected error when reconciling job step") + log_job_diagnostics_if_debug( + self._nmp_sdk, + step, + logger=self._logger, + context="unexpected reconciliation error", + ) self._step_reconciliation_errors.add( 1, attributes={ @@ -129,16 +149,17 @@ def step(self): except Exception: logger.exception("Could not complete cleanup steps for backend", exc_info=True) - def get_steps_for_reconciliation(self, statuses: list[str]) -> list[PlatformJobStepWithContext]: + def get_steps_for_reconciliation(self, statuses: list[SDKPlatformJobStatus]) -> list[PlatformJobStepWithContext]: """ Return the list of steps to reconcile. """ # Iterate through all pages to get all steps steps = [] + filter_params: PlatformJobStepsListFilterParam = {"status": statuses} for step in self._nmp_sdk.jobs.steps.list( name="-", # Use "-" to indicate all jobs workspace="-", # Cross-workspace query - filter={"status": statuses}, + filter=filter_params, sort="updated_at", ): steps.append(step) diff --git a/services/core/jobs/src/nmp/core/jobs/controllers/scheduler.py b/services/core/jobs/src/nmp/core/jobs/controllers/scheduler.py index 77cd8f576c..4d9b5a8e98 100644 --- a/services/core/jobs/src/nmp/core/jobs/controllers/scheduler.py +++ b/services/core/jobs/src/nmp/core/jobs/controllers/scheduler.py @@ -6,8 +6,9 @@ import traceback import nemo_platform -from nemo_platform import NeMoPlatform +from nemo_platform import APIStatusError, NeMoPlatform from nemo_platform.types.jobs import PlatformJobStepWithContext +from nemo_platform.types.jobs.platform_job_steps_list_filter_param import PlatformJobStepsListFilterParam from nmp.common.controller import Controller from nmp.common.jobs.schemas import PlatformJobStatus from nmp.common.observability import start_span_with_ctx @@ -15,6 +16,7 @@ from nmp.core.jobs.controllers.backends import JobUpdate, extract_provider_profile from nmp.core.jobs.controllers.backends.exceptions import ResourceAllocationError from nmp.core.jobs.controllers.backends.registry import BackendRegistry +from nmp.core.jobs.controllers.diagnostics import log_job_diagnostics_if_debug from opentelemetry import metrics, trace tracer = trace.get_tracer(__name__) @@ -36,6 +38,7 @@ def __init__( self._nmp_sdk = nmp_sdk self._stop_signal = stop_signal self._is_healthy = False + self._logger = logger self._step_scheduled_total = meter.create_counter( name="nmp.jobs.scheduler.step.scheduled.total", @@ -77,37 +80,68 @@ def step(self): try: update = self.schedule_step(step) logger.info("Scheduled job step") - self._nmp_sdk.jobs.steps.update_status( - step.name, - workspace=step.workspace, - job=step.job, - status=update.status, - status_details=update.status_details, # type: ignore - error_details=update.error_details, # type: ignore - ) + try: + self._nmp_sdk.jobs.steps.update_status( + step.name, + workspace=step.workspace, + job=step.job, + status=update.status, + status_details=update.status_details, # type: ignore + error_details=update.error_details, # type: ignore + ) + except APIStatusError as e: + # Stopgap for a scheduler/reconciler race: by the time the scheduler persists + # CREATED -> PENDING, another controller pass may already have advanced the + # step to ACTIVE (or later). In that case, treating the stale PENDING write + # as fatal incorrectly marks a healthy job as ERROR. The real fix is to + # properly serialize step state transitions so stale controller writes do not + # happen in the first place. + if self._should_ignore_conflicting_pending_update(step, update, e): + logger.info( + "Ignoring stale pending update for job step that already advanced", + extra={ + "job": step.job, + "step": step.name, + "workspace": step.workspace, + }, + ) + continue + raise except ResourceAllocationError as e: logger.info( f"Could not schedule job '{step.job}' step '{step.name}' due to resource constraints: {e.message}. Marking step as error." ) + log_job_diagnostics_if_debug( + self._nmp_sdk, + step, + logger=self._logger, + context="resource allocation error during scheduling", + ) self._step_scheduling_errors.add(1, attributes={"error_type": "resource_allocation"}) self._nmp_sdk.jobs.steps.update_status( step.name, workspace=step.workspace, job=step.job, - status=PlatformJobStatus.ERROR, - status_details={"message": e.message}, # type: ignore - error_details={"message": e.message}, # type: ignore + status=PlatformJobStatus.ERROR.value, + status_details={"message": e.message}, + error_details={"message": e.message}, ) except Exception as e: logger.exception("Could not schedule job step", exc_info=True) + log_job_diagnostics_if_debug( + self._nmp_sdk, + step, + logger=self._logger, + context="unexpected scheduling error", + ) self._step_scheduling_errors.add(1, attributes={"error_type": "unknown"}) self._nmp_sdk.jobs.steps.update_status( step.name, workspace=step.workspace, job=step.job, - status=PlatformJobStatus.ERROR, - status_details={"message": str(e)}, # type: ignore + status=PlatformJobStatus.ERROR.value, + status_details={"message": str(e)}, error_details={"message": str(e), "error": traceback.format_exc()}, ) @@ -118,10 +152,13 @@ def get_steps_for_scheduling(self) -> list[PlatformJobStepWithContext]: """ # Iterate through all pages to get all steps steps = [] + filter_params: PlatformJobStepsListFilterParam = { + "status": [PlatformJobStatus.CREATED.value, PlatformJobStatus.RESUMING.value] + } for step in self._nmp_sdk.jobs.steps.list( name="-", # Use "-" to indicate all jobs workspace="-", # Cross-workspace query - filter={"status": [PlatformJobStatus.CREATED.value, PlatformJobStatus.RESUMING.value]}, + filter=filter_params, sort="-created_at", ): steps.append(step) @@ -136,4 +173,23 @@ def schedule_step(self, step: PlatformJobStepWithContext) -> JobUpdate: "job_scheduler/schedule_step_with_backend", JobBackendContext(provider=provider, profile=profile, name=str(backend)), ): + assert step.step_spec is not None return backend.schedule(step.step_spec.executor, step) + + def _should_ignore_conflicting_pending_update( + self, + step: PlatformJobStepWithContext, + update: JobUpdate, + error: APIStatusError, + ) -> bool: + if error.status_code != 409 or update.status != PlatformJobStatus.PENDING.value: + return False + + current_step = self._nmp_sdk.jobs.steps.retrieve( + step.name, + workspace=step.workspace, + job=step.job, + ) + original_status = PlatformJobStatus(step.status) + current_status = PlatformJobStatus(current_step.status) + return current_status != original_status and original_status.can_transition_to(current_status) diff --git a/services/core/jobs/tests/controllers/test_diagnostics.py b/services/core/jobs/tests/controllers/test_diagnostics.py new file mode 100644 index 0000000000..f79e3da86c --- /dev/null +++ b/services/core/jobs/tests/controllers/test_diagnostics.py @@ -0,0 +1,66 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +from types import SimpleNamespace +from unittest.mock import Mock, patch + +from nmp.core.jobs.controllers.diagnostics import collect_job_diagnostics + + +def _make_sdk_with_logs() -> Mock: + sdk = Mock() + sdk.jobs.retrieve.return_value = SimpleNamespace( + name="job-1", + status="error", + status_details={}, + error_details={}, + ) + sdk.jobs.get_status.return_value = SimpleNamespace( + status="error", + status_details={}, + error_details={}, + steps=[], + ) + sdk.jobs.steps.retrieve.return_value = SimpleNamespace( + name="step-1", + status="error", + status_details={}, + error_details={}, + ) + sdk.jobs.tasks.list.return_value = SimpleNamespace(data=[]) + sdk.jobs.get_logs.return_value = SimpleNamespace( + data=[SimpleNamespace(message="secret-token=abc123"), SimpleNamespace(message="another line")] + ) + return sdk + + +def test_collect_job_diagnostics_omits_raw_job_logs_by_default() -> None: + sdk = _make_sdk_with_logs() + + with patch("nmp.core.jobs.controllers.diagnostics.config.include_job_logs_in_diagnostics", False): + diagnostics = collect_job_diagnostics( + sdk, + workspace="default", + job_name="job-1", + step_name="step-1", + context="test", + ) + + assert "job_logs" not in diagnostics + sdk.jobs.get_logs.assert_not_called() + + +def test_collect_job_diagnostics_includes_raw_job_logs_when_enabled() -> None: + sdk = _make_sdk_with_logs() + + with patch("nmp.core.jobs.controllers.diagnostics.config.include_job_logs_in_diagnostics", True): + diagnostics = collect_job_diagnostics( + sdk, + workspace="default", + job_name="job-1", + step_name="step-1", + context="test", + ) + + assert diagnostics["job_logs"] == ["secret-token=abc123", "another line"] + sdk.jobs.get_logs.assert_called_once_with(workspace="default", name="job-1") diff --git a/services/core/jobs/tests/controllers/test_reconciler.py b/services/core/jobs/tests/controllers/test_reconciler.py index 3a854eed27..94c04e19bd 100644 --- a/services/core/jobs/tests/controllers/test_reconciler.py +++ b/services/core/jobs/tests/controllers/test_reconciler.py @@ -1,9 +1,11 @@ # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 -from unittest.mock import MagicMock, call +from unittest.mock import MagicMock, call, patch +from nmp.common.jobs.schemas import PlatformJobStatus from nmp.core.jobs.api.v2.jobs.schemas import PlatformJobStepWithContext +from nmp.core.jobs.controllers.backends import JobUpdate from nmp.core.jobs.controllers.backends.registry import BackendRegistry from nmp.core.jobs.controllers.backends.test import MockDockerCPUJobBackend from nmp.core.jobs.controllers.reconciler import JobReconciler @@ -58,3 +60,30 @@ def test_job_reconciler_syncs_active_job( error_details=None, status_details=None, ) + + +def test_job_reconciler_logs_diagnostics_for_error_transition_in_debug_mode( + backend_registry: BackendRegistry, + test_step_active: PlatformJobStepWithContext, +): + mock_client = MagicMock() + mock_client.jobs = MagicMock() + mock_client.jobs.steps = MagicMock() + mock_client.jobs.steps.list.return_value = [test_step_active] + + job_reconciler = JobReconciler(backend_registry, mock_client) + test_backend = job_reconciler._backend_registry.get_backend(provider="cpu", profile="default") + assert isinstance(test_backend, MockDockerCPUJobBackend) + + with ( + patch.object(test_backend, "sync", return_value=JobUpdate(status=PlatformJobStatus.ERROR.value)), + patch("nmp.core.jobs.controllers.reconciler.log_job_diagnostics_if_debug") as log_diagnostics, + ): + job_reconciler.step() + + log_diagnostics.assert_called_once_with( + mock_client, + test_step_active, + logger=job_reconciler._logger, + context="step transitioned to error during reconciliation", + ) diff --git a/services/core/jobs/tests/controllers/test_scheduler.py b/services/core/jobs/tests/controllers/test_scheduler.py index 05a02a6dd1..5da34188e0 100644 --- a/services/core/jobs/tests/controllers/test_scheduler.py +++ b/services/core/jobs/tests/controllers/test_scheduler.py @@ -3,6 +3,8 @@ from unittest.mock import patch +import httpx +from nemo_platform import ConflictError from nmp.common.jobs.schemas import PlatformJobStatus from nmp.core.jobs.api.v2.jobs.schemas import PlatformJobStepWithContext from nmp.core.jobs.controllers.backends.exceptions import ResourceAllocationError @@ -69,3 +71,115 @@ def test_resource_allocation_error_marks_step_as_error( status_details={"message": error_message}, error_details={"message": error_message}, ) + + +def test_scheduler_logs_diagnostics_for_unexpected_schedule_error_in_debug_mode( + job_scheduler: JobScheduler, + mock_nmp_client, + test_step_pending: PlatformJobStepWithContext, +): + mock_nmp_client.jobs.steps.list.return_value = [test_step_pending] + + with ( + patch.object(job_scheduler, "schedule_step", side_effect=RuntimeError("boom")), + patch("nmp.core.jobs.controllers.scheduler.logger.isEnabledFor", return_value=True), + patch("nmp.core.jobs.controllers.scheduler.log_job_diagnostics_if_debug") as log_diagnostics, + ): + job_scheduler.step() + + log_diagnostics.assert_called_once_with( + mock_nmp_client, + test_step_pending, + logger=job_scheduler._logger, + context="unexpected scheduling error", + ) + + +def test_scheduler_does_not_mark_step_error_when_pending_update_conflicts_with_concurrent_advance( + job_scheduler: JobScheduler, + mock_nmp_client, + test_step_pending: PlatformJobStepWithContext, +): + mock_nmp_client.jobs.steps.list.return_value = [test_step_pending] + + request = httpx.Request( + "PATCH", "http://localhost/apis/jobs/v2/workspaces/default/jobs/test-job-id/steps/test-step/status" + ) + response = httpx.Response( + 409, + request=request, + json={ + "detail": ( + "Invalid status transition from PlatformJobStatus.ACTIVE to " + "PlatformJobStatus.PENDING for step test-step-id" + ) + }, + ) + conflict = ConflictError( + "Error code: 409 - {'detail': 'Invalid status transition from PlatformJobStatus.ACTIVE " + "to PlatformJobStatus.PENDING for step test-step-id'}", + response=response, + body=response.json(), + ) + active_step = test_step_pending.model_copy(update={"status": PlatformJobStatus.ACTIVE}) + mock_nmp_client.jobs.steps.update_status.side_effect = [conflict] + mock_nmp_client.jobs.steps.retrieve.return_value = active_step + + job_scheduler.step() + + mock_nmp_client.jobs.steps.update_status.assert_called_once_with( + test_step_pending.name, + workspace=test_step_pending.workspace, + job=test_step_pending.job, + status=PlatformJobStatus.PENDING, + status_details=None, + error_details=None, + ) + mock_nmp_client.jobs.steps.retrieve.assert_called_once_with( + test_step_pending.name, + workspace=test_step_pending.workspace, + job=test_step_pending.job, + ) + + +def test_scheduler_does_not_ignore_pending_update_conflict_when_step_remains_resuming( + job_scheduler: JobScheduler, + mock_nmp_client, + test_step_pending: PlatformJobStepWithContext, +): + resuming_step = test_step_pending.model_copy(update={"status": PlatformJobStatus.RESUMING}) + mock_nmp_client.jobs.steps.list.return_value = [resuming_step] + + request = httpx.Request( + "PATCH", "http://localhost/apis/jobs/v2/workspaces/default/jobs/test-job-id/steps/test-step/status" + ) + response = httpx.Response( + 409, + request=request, + json={ + "detail": ( + "Invalid status transition from PlatformJobStatus.RESUMING to " + "PlatformJobStatus.PENDING for step test-step-id" + ) + }, + ) + conflict = ConflictError( + "Error code: 409 - {'detail': 'Invalid status transition from PlatformJobStatus.RESUMING " + "to PlatformJobStatus.PENDING for step test-step-id'}", + response=response, + body=response.json(), + ) + mock_nmp_client.jobs.steps.update_status.side_effect = [conflict, None] + mock_nmp_client.jobs.steps.retrieve.return_value = resuming_step + + job_scheduler.step() + + assert mock_nmp_client.jobs.steps.update_status.call_count == 2 + error_call = mock_nmp_client.jobs.steps.update_status.call_args_list[1] + assert error_call.kwargs["status"] == PlatformJobStatus.ERROR.value + assert "409" in error_call.kwargs["status_details"]["message"] + mock_nmp_client.jobs.steps.retrieve.assert_called_once_with( + resuming_step.name, + workspace=resuming_step.workspace, + job=resuming_step.job, + ) diff --git a/services/core/jobs/tests/controllers/test_subprocess_backend.py b/services/core/jobs/tests/controllers/test_subprocess_backend.py index 2f40410326..3138609a29 100644 --- a/services/core/jobs/tests/controllers/test_subprocess_backend.py +++ b/services/core/jobs/tests/controllers/test_subprocess_backend.py @@ -4,6 +4,8 @@ import os import sys import time +from datetime import datetime, timezone +from types import SimpleNamespace from unittest.mock import patch from nmp.common.jobs.schemas import PlatformJobStatus @@ -307,3 +309,52 @@ def test_cancelling_terminates_running_process(mock_nmp_client, tmp_path, mock_p break time.sleep(0.05) assert metadata.process.poll() is not None + + +def test_sync_uses_persisted_task_when_local_metadata_is_missing( + mock_nmp_client, tmp_path, mock_platform_config, test_step_active +): + backend = _subprocess_backend(mock_nmp_client, tmp_path, mock_platform_config) + step = _step_with_command(test_step_active, ["/bin/sh", "-c", "sleep 10"]) + + _schedule_without_otel_export(backend, step) + key = SubprocessProcessKey(step.workspace, step.job, str(step.attempt_id), step.name) + backend._process_registry.pop(key) + mock_nmp_client.jobs.tasks.list.return_value = SimpleNamespace( + data=[ + SimpleNamespace( + status=PlatformJobStatus.ACTIVE.value, + status_details={"message": "Job is running"}, + error_details={}, + created_at=step.created_at, + updated_at=step.updated_at, + ) + ] + ) + + update = backend.sync(step) + + assert update.status == PlatformJobStatus.ACTIVE + assert update.status_details == {"message": "Job is running"} + assert update.error_details == {} + + +def test_sync_keeps_recent_pending_step_pending_when_local_metadata_is_missing( + mock_nmp_client, tmp_path, mock_platform_config, test_step_pending +): + backend = _subprocess_backend(mock_nmp_client, tmp_path, mock_platform_config) + step = _step_with_command(test_step_pending, ["/bin/sh", "-c", "true"]) + mock_nmp_client.jobs.tasks.list.return_value = SimpleNamespace(data=[]) + + update = backend.sync(step) + + assert update.status == PlatformJobStatus.PENDING + assert update.error_details == {} + + +def test_pending_step_missing_metadata_stale_check_accepts_typed_timestamp(test_step_pending): + step = _step_with_command(test_step_pending, ["/bin/sh", "-c", "true"]) + step.updated_at = datetime.now(timezone.utc) + step.created_at = None + + assert SubprocessJobBackend._pending_step_missing_metadata_is_stale(step) is False diff --git a/services/core/jobs/tests/integration/test_task_auth_runtime.py b/services/core/jobs/tests/integration/test_task_auth_runtime.py index 022e689722..c2d9a612fa 100644 --- a/services/core/jobs/tests/integration/test_task_auth_runtime.py +++ b/services/core/jobs/tests/integration/test_task_auth_runtime.py @@ -14,9 +14,7 @@ import json import os -from contextlib import redirect_stdout -from io import StringIO -from types import ModuleType +from typing import Protocol import pytest from nemo_platform import PermissionDeniedError @@ -32,24 +30,26 @@ ) -def _secret_access_task_module() -> ModuleType: - module = ModuleType("task_auth_runtime_test_module") +class _SecretAccessTask(Protocol): + def run(self, *, http_client) -> str: ... - def run(*, http_client) -> int: - from nmp.common.sdk_factory import get_task_sdk - workspace = os.environ["NEMO_JOB_WORKSPACE"] - secret_name = os.environ["NEMO_TEST_SECRET_NAME"] +def _secret_access_task_module() -> _SecretAccessTask: + class _Task: + @staticmethod + def run(*, http_client) -> str: + from nmp.common.sdk_factory import get_task_sdk - result = get_task_sdk(as_service="jobs", http_client=http_client).secrets.access( - workspace=workspace, - name=secret_name, - ) - print(result.value) - return 0 + workspace = os.environ["NEMO_JOB_WORKSPACE"] + secret_name = os.environ["NEMO_TEST_SECRET_NAME"] - module.run = run - return module + result = get_task_sdk(as_service="jobs", http_client=http_client).secrets.access( + workspace=workspace, + name=secret_name, + ) + return result.value + + return _Task() class TestTaskRuntimeAuthPropagation: @@ -76,9 +76,7 @@ def test_task_sdk_accesses_secret_on_behalf_of_creator(self): ) ctx.access_log.clear() - stdout = StringIO() with ( - redirect_stdout(stdout), pytest.MonkeyPatch.context() as monkeypatch, ): monkeypatch.setenv("NEMO_JOB_WORKSPACE", workspace) @@ -93,10 +91,9 @@ def test_task_sdk_accesses_secret_on_behalf_of_creator(self): } ), ) - exit_code = _secret_access_task_module().run(http_client=ctx.test_client) + secret = _secret_access_task_module().run(http_client=ctx.test_client) - assert exit_code == 0 - assert secret_value in stdout.getvalue() + assert secret == secret_value request = ctx.access_log.assert_has_request( method="GET", diff --git a/services/core/jobs/tests/test_dispatcher_cross_workspace.py b/services/core/jobs/tests/test_dispatcher_cross_workspace.py index 093e639906..e00e68d062 100644 --- a/services/core/jobs/tests/test_dispatcher_cross_workspace.py +++ b/services/core/jobs/tests/test_dispatcher_cross_workspace.py @@ -280,11 +280,43 @@ async def test_create_or_update_task_finds_existing_task_by_name( assert task2.status == PlatformJobStatus.COMPLETED # Verify only one task exists for this step - tasks = await multi_workspace_dispatcher.list_tasks(step.id) + tasks = await multi_workspace_dispatcher.list_tasks(step.id, workspace="default") task_names = [t.name for t in tasks] assert task_names.count(task_name) == 1, f"Should have exactly one task named '{task_name}', found {task_names}" +@pytest.mark.asyncio +@pytest.mark.integration +async def test_list_tasks_uses_step_workspace_for_non_default_jobs( + multi_workspace_dispatcher: JobDispatcher, + multi_workspace_store: EntityClient, +): + """Test that list_tasks returns tasks for steps in non-default workspaces. + + This reproduces a bug where list_tasks omitted the workspace and fell back + to the entity client's default workspace, so jobs.tasks.list(...) returned + an empty task list for jobs outside ``default`` even though the task existed. + """ + custom_workspace = "custom-workspace" + job = await multi_workspace_dispatcher.create_job(create_job_request("custom-task-list-job"), custom_workspace) + step = await multi_workspace_dispatcher.get_current_job_step_by_name(job.name, "basic", custom_workspace) + assert step is not None + + task = await multi_workspace_store.add( + PlatformJobTask( + name="custom-task", + workspace=custom_workspace, + step_id=step.id, + status=PlatformJobStatus.ERROR, + ) + ) + assert task.id is not None + + tasks = await multi_workspace_dispatcher.list_tasks(step.id, workspace=custom_workspace) + + assert [listed_task.id for listed_task in tasks] == [task.id] + + @pytest.mark.asyncio @pytest.mark.integration async def test_get_task_finds_task_by_name_not_entity_id( diff --git a/services/core/jobs/tests/test_timestamp_contracts.py b/services/core/jobs/tests/test_timestamp_contracts.py new file mode 100644 index 0000000000..6f3e69cf6e --- /dev/null +++ b/services/core/jobs/tests/test_timestamp_contracts.py @@ -0,0 +1,27 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +from datetime import datetime +from typing import assert_type + +from nemo_platform.types.jobs import PlatformJobStepWithContext + + +def test_platform_job_step_with_context_parses_wire_timestamps_to_datetime() -> None: + step = PlatformJobStepWithContext.model_validate( + { + "id": "test-step-id", + "attempt_id": "test-attempt-id", + "fileset": "test-fileset", + "job": "test-job-id", + "name": "test-step", + "workspace": "default", + "created_at": "2026-06-23T19:00:00Z", + "updated_at": "2026-06-23T19:00:05Z", + } + ) + + assert_type(step.created_at, datetime | None) + assert_type(step.updated_at, datetime | None) + assert isinstance(step.created_at, datetime) + assert isinstance(step.updated_at, datetime) diff --git a/spec/brian-auth-call-summary.md b/spec/brian-auth-call-summary.md new file mode 100644 index 0000000000..b4824556b2 --- /dev/null +++ b/spec/brian-auth-call-summary.md @@ -0,0 +1,226 @@ +# Brian Auth Call + +## Executive Summary + +This conversation clarified the emerging authentication direction for NeMo Platform and surfaced the major gaps that still need to be closed. The architectural preference is to keep authentication primarily outside the platform by relying on customer-managed identity providers (IDPs), while NeMo Platform focuses on authorization, internal role handling, and plugin-extensible permissions. That direction is considered technically sound, but the current user experience is still incomplete. + +The most important immediate gap is that there is no translation layer today between external OIDC scopes and NeMo Platform scopes. That makes enterprise identity integration conceptually viable but operationally incomplete. In parallel, documentation is missing for the recommended path, especially for enterprise service accounts, API keys, and common IDP setups. Several people see that as the fastest way to reduce friction in the short term. + +The discussion also placed authentication in a broader strategic frame. NeMo Platform may become part of a larger "One NeMo" platform story that connects NVIDIA products, plugins, and agent workflows. Within that vision, Agent Optimizer appears likely to be a major product driver and power-user entry point, which means its needs may materially influence platform priorities, including auth. + +The practical next step coming out of the call is to validate the enterprise path from a customer perspective using Entra, document what is required, and use that as input for both platform documentation and future product work. + +## Meeting Context + +The discussion was driven by real friction experienced while trying to use the platform in practice. That experience created concern that NeMo Platform is still too difficult for teams that want to move quickly, especially those trying to get from experimentation to working product behavior without a large amount of platform-specific setup. + +There was broad alignment that the platform needs a clearer authentication story, stronger developer experience, and better written guidance. At the same time, the participants acknowledged that the product and platform have been changing rapidly, which has contributed both to real gaps and to inconsistent understanding across teams. + +## Key Themes + +### 1. Preferred Authentication Model + +The dominant architectural position in the conversation was that NeMo Platform should avoid owning full authentication if possible. Instead, customers should be able to bring their own identity provider, and the platform should accept identity information from that upstream system. + +In this model: + +- Authentication is handled externally by an IDP or a gateway. +- NeMo Platform consumes authenticated identity context rather than becoming the identity system itself. +- The platform's core responsibility is authorization: roles, permissions, policy boundaries, and plugin-aware access control. + +This approach was viewed as cleaner from an architecture standpoint and more aligned with enterprise deployment expectations. + +### 2. Current Experience Is Still Too Hard + +Even though the high-level architecture makes sense, the current experience was described as difficult. The pain point is not theoretical. It comes from teams trying to use the platform directly and finding it hard to get running in a straightforward way. + +The conversation highlighted that users who want to move quickly should not need to solve a complicated enterprise authentication integration before they can get value out of the platform. That is especially relevant for startup-style users or internal builders trying to experiment fast. + +### 3. Documentation Is a Critical Gap + +One of the strongest points of agreement was that the platform needs better authentication documentation immediately. Several forms of documentation were implied or explicitly requested: + +- A written authentication story that explains the intended model. +- Recipes for integrating common IDPs. +- Quickstarts for practical setup. +- Enterprise-oriented guidance for service accounts and API key flows. +- Blueprint-style docs that shorten time to first success. + +The group viewed this as one of the fastest ways to reduce friction while deeper product gaps are still being worked out. + +### 4. Authorization Is More Central Than Authentication + +A meaningful distinction was made between authentication and authorization. Authentication is something the platform would ideally consume from upstream systems. Authorization, by contrast, is deeply tied to the platform itself. + +This is particularly important because NeMo Platform needs to support: + +- plugin-defined roles, +- internal platform scopes, +- role bindings, +- and future extensibility for product-specific permission models. + +The current work appears to be more focused on making authorization extensible, especially so plugins can define and use their own custom roles. + +### 5. Missing Scope Translation Layer + +The most concrete technical gap identified in the call was the lack of a mapping layer between external OIDC scopes and NeMo Platform scopes. + +That means the platform may be able to authenticate a user or service principal through an upstream IDP, but it still lacks the internal translation needed to turn that identity data into meaningful platform permissions in a clean and supported way. + +This gap was treated as one of the most obvious missing pieces in the current design. + +## Target Users and Product Tension + +The conversation identified three especially relevant user groups: + +- Enterprise customers who want a robust IDP-integrated deployment model. +- Startup-style builders who want to move quickly and avoid heavy setup. +- ML engineers who want a local-first workflow for experimentation. + +These audiences create competing pressures: + +- Enterprise customers increase the need for mature auth and authorization integration. +- Startup users increase pressure for minimal-friction onboarding. +- Local ML engineers reinforce the need for fast setup and smooth experimentation. + +The discussion suggested that enterprise needs are clearly on the roadmap, while startup-friendly support is understood as valuable but may not yet be formally prioritized. That creates some risk that simpler auth paths remain underdeveloped unless they are tied to broader strategic goals. + +## Startup-Friendly Auth Remains an Open Question + +The conversation explored the possibility of a middle-ground experience for smaller teams. Rather than requiring a full enterprise identity integration, the platform might eventually support a lighter-weight setup such as: + +- social login, +- a minimal IDP path, +- a native API key experience, +- or a dedicated auth plugin that can be enabled when needed. + +This idea was seen as appealing from a usability standpoint because many products offer simple login or API key flows without forcing a full identity rollout. At the same time, there was concern that such a feature could easily add product clutter if it is not carefully scoped. + +So the concept has support, but it is still exploratory rather than committed. + +## Enterprise Alignment + +The conversation made clear that enterprise auth is already a known gap in the platform's broader product offering. Enterprise stakeholders want to improve the situation, and auth is considered part of that gap. + +That creates useful alignment: + +- The platform already needs a stronger enterprise auth story. +- The Agent Optimizer team and other platform users can provide concrete pressure and real use cases. +- Improvements in this area are likely to have value beyond a single team. + +The timing was therefore seen as favorable for formalizing the auth approach. + +## Agent Optimizer as a Strategic Driver + +A major product theme in the conversation was that Agent Optimizer may become a central use case for NeMo Platform. The team working on it was described as likely to become a major power user of the platform, with significant influence over what gets prioritized. + +The implication is that: + +- if Agent Optimizer needs better auth support, the platform may respond quickly, +- product gaps exposed by that team are likely to matter, +- and their requests may help unlock additional investment or execution focus. + +This matters because the authentication discussion is not happening in isolation. It is taking place inside a broader shift in how NeMo Platform may be positioned and adopted. + +## The "One NeMo" Platform Vision + +The conversation also surfaced a larger strategic narrative. Across NVIDIA, the word "NeMo" is used in many places, but there is not yet a coherent story tying those efforts together. One view expressed in the meeting is that NeMo Platform could become part of that unifying layer. + +In this framing, NeMo Platform could serve as a path that connects: + +- research outputs, +- product capabilities, +- plugins, +- CLIs, +- and customer-facing delivery. + +Agent Optimizer was discussed as a possible entry point through which customers begin discovering and using a broader set of NVIDIA capabilities. If that becomes true, then authentication and authorization become foundational platform concerns rather than isolated infrastructure details. + +## Current State of Deployments + +There was visible uncertainty during the conversation about what is actually running in development environments and how far authentication support has progressed. + +Some believed that Entra integration had regressed or was no longer meaningfully working. During the call, however, a working login flow appeared to exist in at least one recent deployment. This led to the realization that deployment status, environment freshness, and current feature state are not consistently understood across the team. + +That confusion matters because it suggests two parallel problems: + +- some platform capabilities may exist but be poorly communicated, +- and some gaps may be worsened by weak shared visibility into what is deployed and supported. + +This reinforced the need for better written documentation and stronger internal alignment. + +## Entra as the Immediate Enterprise Path + +By the end of the call, the most practical enterprise path appeared to be Entra-based authentication. The working assumption became: + +- Entra login appears to be functioning in at least some deployment flow. +- A customer may be able to create a service account or API key in Entra. +- That identity can potentially be used with NeMo Platform once role bindings and permissions are configured. +- The missing piece is the translation from external scopes to internal NeMo permissions. + +So the challenge is no longer just "can login work?" but "what is the full supported machine and service identity story?" + +## The Current State: Customer Responsibility + +One of the clearest summary statements in the meeting was that, today, authentication is effectively the customer's responsibility. + +That was treated as an accurate description of the current state, but not a sufficient end state. The desired future is not necessarily that NeMo Platform owns all auth directly. Rather, the goal is to reduce burden on the customer by providing: + +- clearer options, +- better docs, +- supported patterns, +- easier integration, +- and possibly selective built-in helpers where they add real value. + +## Possible Deliverables Discussed + +Several potential outputs were implied by the conversation: + +- an auth spec or RFC, +- a formal written authentication story, +- enterprise setup documentation, +- customer-style walkthroughs for Entra, +- quickstart recipes for common IDPs, +- and possibly a proposal for native API key support through a plugin or optional built-in path. + +These were not all formal commitments, but they represent the most concrete work products suggested in the discussion. + +## Agreed Near-Term Next Step + +The clearest next step was practical rather than abstract. Brian planned to try to solve the auth problem in Entra the way a real customer would, then document what was required. + +That was seen as valuable for several reasons: + +- It tests the recommended enterprise path in reality. +- It produces a concrete example rather than an architectural abstraction. +- It generates material that can become documentation. +- It reveals where additional product work is truly needed. + +The call ended with a sense that this path is likely enough to unblock immediate internal needs, even if broader product improvements still remain ahead. + +## Risks and Open Questions + +The meeting surfaced several unresolved tensions and risks: + +- The platform may remain too tactical if it does not anchor on a clearer long-term product direction. +- Startup-friendly auth may continue to lag if it is not tied to roadmap priorities. +- The lack of a shared mental model across teams may continue to create confusion about what works today. +- Enterprise auth may be directionally correct but still hard to operate until scope mapping and documentation are complete. +- Agent Optimizer may become the main driver of platform evolution, which is helpful for momentum but could create tension with other customer needs. + +There is also a more strategic uncertainty: customers are not yet clearly asking for Agent Optimizer as a product category, but that may simply be because the product has not yet been clearly delivered or positioned. + +## Overall Assessment + +The meeting produced a coherent directional answer, even if it did not produce a finished auth plan. The direction is to rely on external identity providers for authentication, invest inside the platform on authorization and plugin-extensible roles, and close the near-term usability gaps with documentation and a missing scope translation layer. + +This is a reasonable enterprise-oriented architecture, but it is not yet a complete customer experience. The platform still needs better written guidance, validated integration paths, and clearer support for service accounts, API keys, and scope mapping. The next useful move is to document the Entra path from the perspective of an actual customer and use that exercise to sharpen both the product and the docs. + +## Action Items + +- Validate the Entra-based customer path for service accounts and API-key-style access. +- Document every required step, dependency, and role-binding assumption in that flow. +- Turn that walkthrough into internal documentation or a draft quickstart. +- Write down the formal NeMo Platform authentication story as a spec or RFC. +- Define the missing OIDC-scope-to-NeMo-scope translation layer. +- Continue evaluating whether a lighter-weight startup-friendly auth option is worth supporting. diff --git a/spec/brian-auth-call.md b/spec/brian-auth-call.md new file mode 100644 index 0000000000..dc36cca446 --- /dev/null +++ b/spec/brian-auth-call.md @@ -0,0 +1 @@ +The gap between no off and the, I know everything about an IDP crew. Um, I think if we can document a blueprint that makes that easy for someone coming out with speed, that's an okay answer. I do think it's going to be a point of friction. And ultimately, to be clear, like, you all are the deciders on this, not me. But, uh, I just, I feel like, uh, I have a certain opinion from, like, basically our context was, we were building our own product. We were told like, hey, this platform might be where the NVIDA team is interested in us helping out. So we tried to use it and like, it was hard to use. And I think a lot of that now, knowing internally, was seeing, uh, all of the changes that you were sort of alluding to. And, um, so I'm probably feeling particularly sensitive to not, uh, or to trying to work towards addressing that problem. Whether that's in documentation blueprints or like actual application code. But, um, But yeah, like technically, what you're describing totally makes sense to me. Um, use the IDP for what it's good at, keep the concern out of our system. Like, I totally buy that from a architecture perspective. Yeah. And and we should have absolutely have recipes on how to do this. Right. That would be a gap in the developer experience, for sure. And I think there could be like a middle ground, like where you set up like a quick social login or something like that. So, I don't see why that would be hard. I mean, you can do this on Grafana, for example. Yeah, it is kind of one of those things that every product I use has this option, right? Um, and they're not asking you to integrate a full IDP, but we should be able to implement a minimal IDP. There are some gaps in auth, for sure. I've actually have like a long list of gaps because I took this on last week to to look at this. Yeah, more broadly, right? It's like at the user level and everything's cool. Yeah. But, um, but, like, I hear you. I think I think there could be I think it would make sense. And the target audience that I see for that is, uh, a startup, startup is like one of the target audiences I see for Vernima platforms. Like they want to build on the latest, greatest whatevers. And but they're just startup, they want to get up and running. They want to, like, have users and and and be off to the races, right? They shouldn't have to do complicated things. Um, or, or they could, that, that could be the experience. I don't know if we're gonna go in that direction or not, but, um, like maybe it is just gonna be a little piece of the puzzle, or maybe it's gonna be a full, it could be a full application. No one really knows, but the 2 use cases that are really, really, the 3 that use cases that are really, really strong are agent optimizer, which is what you guys are going to be working on. So whatever you guys ask for, you're basically getting to get it. But my only concern is, is it going to override the other people that also want things and stuff like that? Yeah, yeah, yeah. And so Yomini and Santiago, they are very much on the, um, enterprise side of things. They want to sell an enterprise product. And we have not done a very good job of this. There's huge gaps in our enterprise offering. People kick the tires, but no one's really buying, they want to close the gap. And off is actually part of that gap. So, so there's there's alignment here. So this is this is a good time. The 3rd gap is, or the 3rd customer that I see that we want to serve as the ML engineer, the open source ML engineer, who just wants to do some experiments locally. And I think there's some overlap with what you guys are going to be doing with H and Optimizer because we absolutely want that local 1st experience. Yeah, totally. Yeah, because you want to tie deer code and what you're actually doing and able to make changes and run experiments. sort of things. So yeah, exactly. Okay. Um, that makes sense, saying those are both relevant to us. Can you, you can just send me a doc so we don't have to spend too much time if there is one. What else is happening with off, like what else is in the bucket? Because it's like, if we support IDPs and we don't want to handle authentication, I'm curious what else is actually on the list of, like, needs to be done. Let me bring up the list here. Amazing. I don't, um, I was thinking that, so we're currently making it easier for plug-ins to like plug into the authentication mechanism. so that they can define custom roles and all that sort of stuff, which would include you as one. So that would include you as well. So you're going to be making plug-ins. We want you to be able to find custom roles, and... That's all fair authorizing, right? That's part of off-seat. author. Yeah, that's actually, and that's actually mostly what we're what we're focussed on. In terms of, like, integrating, like, all the core pieces, Aussie is fairly central to a number of different problems. Authentication, on the other hand, we're really kind of hoping to kind of have it be like a really simple, like, header-based layer system. Maybe there's a gateway, um, that we, that we provide, that you can basically bring your own IDP and plug right in, and there's a, there's a story for every single IDP. I kind of see that as that is where things are going. I think in the short term, um, probably just writing docs on, on how to integrate these things and maybe having like a quick start that, like, find a base quick start where you can just integrate GitLab or GitHub or whatever, and, and, and it just works. Uh, might be helpful, but uh, I don't see any, uh, like this is the pushback we often get is like, who who's the customer? Who wants this? Enterprise doesn't want that. Uh, ML Engineers don't want it. So who actually wants it? So with that stuff, well, might get deprioritized. I mean, startups is the answer, is an answer, right? What you mentioned already. Yes. But it may not be a target customer. Startups is not on anyone's radar right now. It's, it's, I have it as a customer because I feel like that fills out the product. Like in terms of Nemo platform. And to give you some context, this year, they came up with this one Nemo idea. You've probably heard it? And no one knows what that means. But everyone knows what the problem is. And the problem is, is the word Nemo is used everywhere across different teams, silos, everything. It's everywhere, and there's no coherent Nemo story. Everyone just kind of uses Nemo and randomly and there's no coherent vision for one email. There's an idea that one Nemo could become, or the Nemo platform could become the one Nemo story, and the way, the way I see it. And this is just my perspective, is there's a place for all the research, all the products, everything everyone was doing to kind of find a, uh, a pathway to the customer. To me, that's what one Nemo could be, but no one really knows what it is. It's like, so we're defining it as we go. But to me, that's kind of where I see Nemo platform fitting in is we're like a hosting service for NVDF CLIs. On a very simplistic level, right? And so you can kind of see where, you know, Asian optimizer really fits in here. Um, but we don't want to just do age and optimizer. We want to do agent optimizer that talks all the NVDia products and we want them all talking to each other. No one knows how to do this yet, but this is the brave new world we're kind of marching into. Yeah, I mean, makes sense to me. I don't hear anyone talking about this, by the way. Um, which is concerning. Yeah, everything seems to be pretty tactical. Like, what do we need to do to get to the next step, the next step? Um, and but I think we need to keep, uh, like our eye on like, okay, where is it going? Yeah, let's be tactical, but with an eye to where are we going? And, um, Because I think it makes a big difference, especially for what you're doing, because they, I think they see agent optimizer as, like, the core entry point for, uh, for, for customers to discover Nvidia products. Possibly. And so having deep integration with agent optimizer and all the other possible services and plug-ins that could be available with one Nemo, is kind of critical. I think that's what John Cohen is thinking, but I can't read his mind. I'm just going on what I've heard. Totally. I mean, I think it's like an overarching problem that most people rebuilding agents have, right? So I totally can see how that might be, the common thread that starts policing together, and then like my perception of the platform is like, this has the pieces, which you may or may not use all of, that enable that use case and many others, but, uh, if that one is like the most compelling, we think, in market, then amazing. Yeah. So, yeah. Your, your, um, take on the platform, uh, I kind of look at it as like, you guys are gonna be like star power users of the platform. Um, and and you're gonna be making lots of requests and like, and there's, if there's function that's missing, if things don't serve the user experience, you know, I would expect you to be like, hey, this doesn't work. And, um, and to put push hard on us. And then when you push hard on us, we're going to get the resources to make that happen because I'm pretty sure your effort is going to drive everything. That's my Christmas is my sense. that, but or tell me that, but you're the 1st person to tell me that, but, uh, it's encouraging. I feel like there's that. Uh, yeah, there definitely seems to be a momento. Yeah, yeah. This is coming straight down from John Cohen. When John Cohen wants something done, it happens really, really quickly. Um, and he, he, and I think the pivot towards Agent Optimizer, I think, is coming straight from him and Carrie. they're both aligned. They think both see the potential here. I don't think product really sees the potential for customers. Customers aren't asking for this yet. Um, they are asking for customizer and fine tuning, but they're not asking for agent optimizer, but that's because we don't have a product yet. They don't know what to ask for. They got agents. People view it as their job right now, right? Like as their product, which hopefully we won't aren't like cannibalizing, but like, They want to own the agent and make it good as their product offering. So they wouldn't ask you to do that because that's like core to their business. Maybe. I don't know. Yeah, I'm hoping that you won't have the same pressures that product has, but you'll have a bit of a runway to, like, deliver something, and then before the pressure's, like, set in, um, and, uh, but as soon as as soon as you have something, I'm hoping that we have a delivery platform that can get it immediately into customers' hands. And you shouldn't have to worry about it because that's an enormous amount of effort that goes into. into that, but other than like getting it ready into, as a plug-in, everything else, the whole deployment, build, publishing, process should be taken care of for you by other teams that are working side by side. Awesome. That's great. Um... Thank you for all the context. Super helpful day here. Uh, then I, I've talked to a couple folks, but it hasn't always been, uh, I think just an FSA motion. It's not basically where things stand or where we're proud either. Um, some of what I'm saying might be a bit aspirational too. Well, sure. No, I get it. I, uh, I've been on the other side of this conversation too. But I, it's, uh, definitely appealing, an appealing vision, and I feel like that is necessary. So that we're not changing what we're doing every couple months because we shift the customer focus or whatever. So for odd specifically, I'd love to get your take on. So the next steps we came out of our previous meeting. Or basically, like, Okay, go... I guess you mentioned some sort of spec or RFC. I don't exactly know what that would be if we're just saying rely on the EDP. Can you speak to that a little bit? Um, what you were thinking there? Yeah, I think it would just probably put it towards, um, because it, because it doesn't exist. Yeah, it's, it exists in people's heads, it just needs to be written down. Uh, just formalizing, uh, this this is our authentication story. Um, this is the use cases that it handles. Here's here's the different cases we want to handle, such as API keys and VAPI keys, all that sort of stuff. Here are the possible ways that you could support it. Here's the recommendation, here's the gaps in the product. Um, and, uh, so we can start filling the gaps, uh, which could be documentation, or any glaring issues. There is, there is a big gap, and the gap is the mapping between, um, OID scopes, OIDC scopes, and emo platform scopes. There's no translation layer yet today. To me, that seems like an obvious, like, no-brainer. It just doesn't exist. So I would I see that as like the one gap that I can think of. Um, but if we have that, then I think it's fairly straightforward. Yeah. Okay. Cool. Um, What would be most helpful from, like, a customer, like me as a customer perspective? Because I, I'm, don't expect we'll be driving this. Uh, I am certainly eager to See, even internally, like, the recommended entropath. If someone goes through that, maybe it's me or maybe it's someone else. Uh, I think we'll find what this actually looks like at an enterprise company who are target customers to, like, set up a, uh, actual, like, service account, type principal, and then, I mean, hopefully that turns into docks, but that may also turn into, like, hey, this necessitates, you know, multiple IDP support because it turns out we need, key club to handle API keys because Enter doesn't or whatever that might be, right? Yeah, and I don't think that would necessarily be the case for entra, but. Yeah. Is that tracking? Like, I guess I'm trying to figure out what's next because it's like, oh, we, I guess we just support it, but it's just a pain. It's kind of the current, my perception of the current state. Is that fair? Of the current state would be, um, It's a customer responsibility, is the current state. Right. That's not necessarily a great state. I think we want to close the gap there so that it's not all on the customer. Um, I think in terms of optionalities, uh, you know, what are the options we can give customers, uh, to make their job easier, that they can easily opt out of if they don't need it? Um, and so I kind of see, I see an easy way, way forward as having just a, uh, uh, an off plug-in that's, that supports, uh, API keys natively within the platform, if that's what they want. But I can very much see that this being a burdensome feature that just clutters the product. like very easily. But it could have its own domain and, uh, yeah, be fair, sort of plug into the the fast API metalware and, uh, I, that was one of the, the possible solutions that I was thinking of proposing. The, oh, just so you know, like we had, um, as your, like, intra working in April, um, we just don't have it working now because we, uh, we took uh, one step forward, 2 steps back. Okay, silly question. How am I logging in then? Right? Like, you don't... don't need to worry about logging right now. The Devin Stance... Does it have no op? No off. Well, that's one of the deployment possibilities. It's just, or like, which are hosted in Kubernites or? Yeah, incubernities. Yeah, you can port forward. That's one option. You just started started in no off boat. The idea is we should be able to add off at some and it shouldn't affect anything that you're doing. It should be an optional thing that you could just disable. Totally. If I go to studio or I try to like visit studio right now and then. Yeah. I'm with your... Hit with a login, right? Like, what is this if it's not intra? It's definitely intro, yeah. Okay, cool. And so what is it you're logging into? I didn't see what you were. This is the, like, dove instance. Yeah. I believe this is Uh, probably April. This is a version. like a stale deployment. Yeah, it's a little old. So this is probably, I think, I think we stopped releasing in the end of April. Uh, we, we, we published 2603. That went out to some customers, some internal, mostly internal, I think. Um, and then we did this big pivot towards plug-ins, and so we're still recovering for that. That's that's what's happening. We have not deployed to dev probably in over, yeah, 2 months. I think that this is more recent. Um, so like this deployment, this is just another depth appointment. I don't need to get bogged on by the details, but this was a home chart, like, yesterday, like from yesterday. Built within the platform and it has changes from last week on it. Oh, okay. And then, too... Okay, because I'm the one working on helm charts, so I'd love to. Okay, next. Well, you tell me if it's a super stale home chat, and somehow I magically have the changes that I made last week. Because weird things could totally be happening. But, yeah. Enter is working on this deployment. Okay. in this, like, and this is, like, managed within the Emo Dove. Or it's like PRDUOCI cluster. I can't remember any of the names. Sorry, but that, it's, you know, Phil's team's Gurinetti setup. Okay. So maybe Intra is more working than we thought it was. It's possible. Let me, I was just talking to Phil. I mean, some of that message. And this has definitely been a moving target, which you said about Dev was true. And I have no idea who's working on anything, so that's part of the challenge. Yeah, you want to log in and see what it says? I mean, so I have it like, okay, yeah, I remember. Erin. This is just a 1000 and Gargita window. But, like, this is roughly working. Oh wow. This off flow is happy. Okay, this is a stale link, but the point is, I can view stuff. Did you base off of 2603 or? Uh, no, this is like... Uh, this isn't more recent. These changes were made within the last week. Oh, and and uh, you made the change to Nemo platform repository, the open source one or? Okay. I gotta reach out to... Yeah, it's okay. They may be way way further than I thought. Yeah. Cool. Yeah, I only say this in case it's helpful as for like... I need to know this. amazing. Uh, so that's great news. Like, in theory, we're hooked up to Intra, and I can now create a service account with some IT team that I don't know about. And There's my machine op. Is that, well, from your perspective? Uh, yeah, whatever, uh, enter offers, I think would, yeah, so if you have, uh, if you create an API key with Entra and and then associate it, you might have to do some role bindings on it. Which is kind of, you know, it would be nice to have kind of an integrated experience. But I think the idea is what you do is you define API keys in intra. They would have a specific OIDC scope, and that would get mapped to something in Nemo platform. So using some mapping layer that doesn't exist yet. I'm imagining it would be something like that. Yeah. Okay. Why don't I take an next step of trying to figure out how to solve this problem in intra the way I would as a customer? And that can be like an input into the work you're doing if it's helpful, or you can ignore it. Um, I think that would go to some most possible thing we could do. And then if as long if that works, um, that gives us at least one data and we could do variations on that if as customers. Yeah. Get stuck. Um, great. What's the best way for me to follow along on the broader, off where you're, you're started leading? Here. Um. Good question. Are you on SWDL Air Colab? Yes. Um, that's a big one. Basically just the public chance. If that's the answer, that's totally fine. And SWDL, Nemo platform development. Uh. Things would get posted there. Those are the 2 big ones. Yeah. If you have trouble with the, if you have trouble with anything for on field side of things, like with the deployment or something like that, you'd go to SWDL, Nemo LLM platform. I mean, which is a terrible name, but Yeah, they're all kind of the same name in my mind, but I just usually check all of them. Um amazing. That's helpful. I'm feeling a little hazy on, like, what happens. next as far as like timelines of that documentation, but I do think we're probably unblocked for our internal use cases, just to be transparent. like how I feel like coming out of this. Yeah, if you want to document what you do to get it working on intra, that would be super fun. Like just whatever you had to do to get it to work, that would be helpful. And then we have we have a documentation point we can draw. Yeah totally. Okay. We'll start there and then I'm sure we'll talk more. I think the, like, ongoing, how can we make this easier for the, uh, startup type crew is probably going to be relevant. It's just got the interest of both Eric and I think Ian as well from the product perspective. Um, And so, I'm sure I'll bug you. Something about that, we can talk. Sorry, yeah. Doors always open? to you and yeah, if you ever need clarification, happy to clarify whatever. Like, one of the challenges we've had is we lack a kind of a shared model of how things work, because they're just, they're just hasn't been enough communication. like broadly. And so if you feel like you don't understand or you're like alone or you don't know what's going on, absolutely reach out. Like, it's, uh, yeah, doors, doors always open, schedule time, um, or, or just ping me, and if I, if I'm nothing on my schedule, I can usually jump on a call. Amazing. I really appreciate it. I'm sure I'll be in touch. Thank you for my train. No problem. Take mercy. Bye. \ No newline at end of file diff --git a/spec/caller-execution-hints-and-profile-plumbing-spec.md b/spec/caller-execution-hints-and-profile-plumbing-spec.md new file mode 100644 index 0000000000..58fb574e26 --- /dev/null +++ b/spec/caller-execution-hints-and-profile-plumbing-spec.md @@ -0,0 +1,90 @@ +# Caller Execution Hints And Profile Plumbing Spec + +## Summary + +This spec captures a separate platform problem from provider resolution itself: + +- how caller-supplied execution preferences are represented +- how those preferences are transported through plugin APIs +- how consistently those preferences are honored today + +The immediate focus is profile plumbing, but this document also provides a place to discuss whether caller-visible APIs should eventually expose more explicit execution hints such as provider or mode preferences. + +## Problem + +Today, caller-facing execution controls are inconsistent across layers. + +At a high level: + +- plugin-facing submit surfaces are largely profile-oriented +- the lower-level Jobs API is provider-and-profile oriented through `platform_spec` +- some plugin routes do not yet consistently honor caller-supplied `profile` and `options` + +This makes it unclear what a caller can actually control and how reliably those controls propagate through the stack. + +## Current Behavior + +### Plugin-Facing Submission + +The plugin CLI and scheduler expose caller-facing inputs such as: + +- `--profile` +- `-o` / `--options-file` + +The submit path sends `spec`, `profile`, `options`, and metadata in [packages/nemo_platform_plugin/src/nemo_platform_plugin/scheduler.py](/Users/rsadler/src/nemo-platform/packages/nemo_platform_plugin/src/nemo_platform_plugin/scheduler.py:143). + +However, the newer `add_job_routes(...)` path explicitly notes that `profile` and `options` are not yet fully threaded through the request model and may currently be silently dropped server-side in [packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/routes.py](/Users/rsadler/src/nemo-platform/packages/nemo_platform_plugin/src/nemo_platform_plugin/jobs/routes.py:30). + +So today, even the simpler profile-oriented contract is not fully consistent. + +### Jobs Service Submission + +The lower-level Jobs API accepts a `platform_spec` directly in [services/core/jobs/src/nmp/core/jobs/api/v2/jobs/schemas.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/api/v2/jobs/schemas.py:121). + +At that level, the caller can effectively control both: + +- provider +- profile + +because each step executor in `platform_spec` carries those fields explicitly. + +This means the plugin-facing and Jobs-facing caller models are not the same. + +## Why This Is A Problem + +- callers do not have one clearly documented execution-control contract +- profile plumbing is not consistent across plugin submission paths +- the platform does not have a settled answer for whether caller-visible APIs should remain profile-only or allow more explicit execution hints +- this ambiguity makes it harder to reason about what should be resolved automatically versus what should be caller-directed + +## Goals + +- Define one clear caller-facing execution-control contract for plugin submission APIs. +- Make caller-supplied `profile` handling consistent across plugin routes. +- Decide whether caller-visible APIs should remain profile-oriented or also support explicit provider or mode hints. +- Separate caller intent from plugin/provider resolution in a way that remains understandable to users. + +## Non-Goals + +- This spec does not define the provider/profile resolution algorithm itself. +- This spec does not define runtime availability detection or the Jobs-owned availability API. +- This spec does not define long-term capability-versus-provider data modeling. + +## Key Questions + +- Should plugin-facing APIs expose only `profile`? +- Should plugin-facing APIs also expose provider hints? +- Should plugin-facing APIs expose execution-mode hints such as host subprocess versus container preference? +- How should caller hints interact with plugin-supported providers and Jobs-reported availability? +- Which caller controls are required for advanced use cases, and which should remain internal platform details? + +## Recommendation + +Treat this as a separate design problem from execution resolution itself. + +The current execution-resolution proposal should assume only that caller intent may exist. This document should decide: + +- what exact caller-visible controls exist +- how they are transported +- how they are validated +- how they interact with plugin and Jobs behavior diff --git a/spec/capability-vs-provider-execution-model-spec.md b/spec/capability-vs-provider-execution-model-spec.md new file mode 100644 index 0000000000..1ea18578d3 --- /dev/null +++ b/spec/capability-vs-provider-execution-model-spec.md @@ -0,0 +1,155 @@ +# Capability Versus Provider Execution Model Spec + +## Summary + +This spec captures a future architectural question that is intentionally out of scope for the current execution-resolution proposal: + +- are `cpu`, `gpu`, and `gpu_distributed` best modeled as providers +- or are they more accurately modeled as capabilities that can be satisfied by different providers and backends + +The current repository often treats `subprocess`, `cpu`, `gpu`, and `gpu_distributed` as peers in one selection space. That is useful for near-term cleanup, but it may not be the correct long-term data model. + +This document records the follow-up design problem so the platform can return to it later without blocking the current work. + +## Problem + +There is a modeling mismatch in the current terminology. + +- `subprocess` describes an execution mechanism +- `docker` and `kubernetes_job` describe backend implementations +- `cpu`, `gpu`, and `gpu_distributed` often read more like workload requirements or capabilities than execution mechanisms + +That mismatch becomes more obvious when a single backend instance may be able to satisfy multiple capabilities. + +Examples: + +- a host subprocess executor may satisfy `cpu` +- that same host subprocess executor may satisfy `gpu` if the host has GPU access +- a cluster backend may satisfy `cpu`, `gpu`, and `gpu_distributed` +- one backend controller may manage all three without implying three different controllers + +If that is the real architecture, then `cpu`, `gpu`, and `gpu_distributed` should not necessarily be modeled as the same kind of thing as `subprocess`. + +## Why This Matters + +If capabilities and providers are conflated, several problems follow. + +- the resolver has to treat requirements and mechanisms as if they were interchangeable +- the platform may appear to need separate controllers for each top-level category when one backend instance can satisfy several of them +- plugin intent becomes harder to express precisely +- it becomes harder to represent cases like \"GPU required, subprocess acceptable if the host has GPU capability\" + +The immediate example is: + +- a job requires GPU capability +- Docker is unavailable +- the host can still run GPU work directly +- a subprocess-based executor with GPU capability should be able to satisfy the requirement +- if no available executor satisfies GPU capability, the job should fail immediately + +That is easier to express if GPU is modeled as a capability requirement rather than as the executor type itself. + +## Candidate Model + +A more explicit long-term model may separate at least four concepts. + +### Capability + +What the workload requires. + +Examples: + +- `cpu` +- `gpu` +- `gpu_distributed` + +### Provider Or Execution Mode + +How the workload is meant to run. + +Examples: + +- `subprocess` +- `container` +- `batch` + +The exact vocabulary is open. The important thing is separating mechanism from requirement. + +### Backend + +The implementation that actually runs the job. + +Examples: + +- host subprocess launcher +- Docker runtime +- Kubernetes job controller +- Volcano or Slurm batch backend + +### Profile + +A named configured instance or policy that binds the above concepts to concrete runtime configuration. + +Examples: + +- a local subprocess profile with GPU access +- a Docker CPU profile +- a Kubernetes GPU profile +- a distributed batch profile + +## Resolver Implications + +Under a capability-oriented model, resolution becomes a match across: + +- caller constraints +- workload capability requirements +- execution-mode preferences +- available profiles +- backend capability advertisements + +That would allow the platform to express cases such as: + +- require GPU capability +- prefer subprocess if available +- otherwise use a containerized backend +- fail if no executor with GPU capability exists + +This is different from the current simpler model, where top-level selection is done directly among `subprocess`, `cpu`, `gpu`, and `gpu_distributed`. + +## Controller Implications + +This modeling question also affects controller design. + +If one backend instance can satisfy multiple capabilities, then the platform should not be forced into a one-controller-per-capability shape. + +A more natural model may be: + +- one backend instance +- one controller +- multiple advertised capabilities +- multiple profiles or policies bound to that backend + +This would avoid duplicating control surfaces when the runtime is actually shared. + +## Scope Of This Spec + +This spec is exploratory and intentionally separate from the current execution-resolution cleanup. + +It does not propose immediate code changes. + +It exists to preserve an important architectural question: + +- whether the long-term model should separate capability, provider, backend, and profile more explicitly than the current repository does + +## Recommendation + +Keep this question out of the near-term subprocess-resolution work so that the current cleanup can stay focused. + +Return to it later when the platform is ready to revisit: + +- execution data modeling +- controller ownership +- backend capability advertisement +- profile semantics + +At that point, the platform can decide whether `cpu`, `gpu`, and `gpu_distributed` should remain top-level providers or become capability classes matched against more general execution providers and backends. diff --git a/spec/core-role-default-grants-spec.md b/spec/core-role-default-grants-spec.md new file mode 100644 index 0000000000..f59c87e007 --- /dev/null +++ b/spec/core-role-default-grants-spec.md @@ -0,0 +1,377 @@ +# Core Role Default Grants Spec + +## Summary + +This spec explores two closely related follow-up questions for plugin-defined authorization data: + +- how NeMo Platform should replace the current heuristic that automatically grants plugin-defined permissions to core roles based on permission suffix +- how plugin-defined roles should be surfaced in IAM, CLI, UI, and docs +- whether plugin-defined roles should remain global or become service-scoped in a future design + +This is intentionally separate from `plugin-service-authz-spec.md`. + +The plugin service authz spec preserves current behavior and keeps role surfacing out of scope. This document explores better long-term alternatives. + +## Context + +This follow-up work needs to be understood in the context of the current NeMo Platform auth model. + +### Current Auth Flow + +Today the platform separates identity, scopes, roles, and permissions. + +1. An OAuth/OIDC provider authenticates the caller. +2. The platform extracts a NeMo principal from the request. + - principal id + - principal email + - principal groups +3. The auth client sends the request method, path, principal, and token scopes to the Policy Decision Point (PDP). +4. The PDP evaluates: + - endpoint permissions from `authz.endpoints` + - endpoint scopes from `authz.endpoints` + - role-derived permissions for the principal +5. The PDP returns allow/deny. + +### Where Permissions Come From + +OAuth/OIDC does not directly grant NeMo Platform permissions. + +Instead: + +- OAuth/OIDC provides identity and optionally groups/scopes. +- NeMo Platform binds that identity to roles. +- Roles grant permissions. + +Role bindings are stored as platform data and loaded into the auth bundle at runtime. + +This means: + +- users do not receive platform permissions directly from the OAuth provider +- users receive platform permissions indirectly through NeMo role bindings +- those role bindings may target a principal id, email, or group + +In other words, directory attributes can be inputs to authorization, but the actual permission grant is owned by NeMo Platform. + +### Current Scope Behavior + +The current system does use scopes, but only as an optional coarse-grained gate layered on top of the main permission model. + +To avoid ambiguity: + +- token-provided scopes: scopes extracted from OAuth/OIDC claims or equivalent request auth context +- endpoint-required scopes: scopes declared on a NeMo endpoint rule + +Current PDP behavior is: + +- if token-provided platform scopes are absent, the scope gate is skipped +- if token-provided platform scopes are present, they must satisfy the endpoint-required scopes +- if endpoint-required scopes are empty, the scope gate passes regardless of token-provided scopes + +### Current State Of Plugin-Defined Roles + +The backend can already store and evaluate arbitrary role names. + +However, plugin-defined roles are not currently supported as a first-class end-to-end product feature. + +Today: + +- backend authz data can contain arbitrary global role names +- role bindings can target arbitrary role strings +- Studio workspace-member flows only understand the core workspace roles +- CLI role arguments are mostly free-form, but the surrounding UX and docs still center on the core roles + +So this follow-up spec is not only about policy shape. It is also about whether plugin-defined roles should become a real surfaced platform concept. + +## Current Behavior + +Today the platform automatically grants plugin-defined permissions to core roles using a suffix heuristic. + +Current heuristic: + +- permissions ending in `.read` or `.list` are granted to `Viewer` and `Editor` +- all other permissions are granted to `Editor` + +This behavior exists today in the auth merge logic and is preserved by `plugin-service-authz-spec.md`. + +## Problem + +The current heuristic is simple, but it is also brittle and semantically weak. + +Problems: + +- permission suffix does not always reflect real sensitivity +- different services may use different naming patterns +- some permissions should not be granted to core roles automatically at all +- plugin authors may not realize they are implicitly granting access to core roles +- security review becomes harder because grants are inferred rather than declared + +Examples: + +- a permission ending in `.read` may still expose sensitive operational state +- a permission ending in `.exec` might not belong in `Editor`, but the heuristic would grant it there +- different plugin teams may invent new permission suffixes that do not fit the model + +## Goals + +- Replace suffix-based inference with a more explicit and reviewable model. +- Preserve a simple authoring experience for common services. +- Allow services to declare core-role grants intentionally. +- Avoid surprising implicit privilege expansion. + +## Non-Goals + +- Redesigning the plugin path-rule model. +- Replacing service-scoped roles. +- Changing how role bindings are stored. +- Changing the initial plugin authz implementation plan. + +## Design Principles + +### Principle 1: Core-Role Grants Should Be Explicit + +If a service wants a permission to be granted to `Viewer`, `Editor`, or `Admin`, that should be visible in the service-owned authz definition. + +### Principle 2: Safe Defaults Matter + +It should be hard to accidentally expose a new permission broadly through inference. + +### Principle 3: Common Cases Should Stay Ergonomic + +Services will often want straightforward defaults for read/list vs mutating operations. The replacement should not force every service to write large repetitive grant maps unless necessary. + +### Principle 4: Role Surfacing Should Follow an Explicit Product Model + +If plugin-defined roles appear in IAM, CLI, UI, or docs, that exposure should come from an intentional platform model rather than falling out of backend permissiveness. + +### Principle 5: Role Scope Should Be an Explicit Policy Choice + +The platform should make an explicit decision about whether plugin-defined roles remain global or become service-scoped later. That choice should not be implied accidentally by naming conventions or backend permissiveness alone. + +## Additional Problems: Role Surfacing and Role Scope + +The current backend can already store and evaluate arbitrary role names, but the user-facing surfaces are still strongly oriented around the platform core roles. + +Examples: + +- workspace member flows assume `Viewer`, `Editor`, and `Admin` +- role-selection UX uses hard-coded core-role labels and descriptions +- documentation is written around the core role hierarchy + +So there are really two separate follow-up questions: + +- how core roles should receive plugin-defined permissions by default +- how plugin-defined roles should become visible and manageable across product surfaces +- whether plugin-defined roles should remain global or become service-scoped + +These questions are related because both determine what role model administrators actually see and use. + +## Role Scope Options + +### Option 1: Keep Plugin-Defined Roles Global + +Pros: + +- preserves current backend behavior +- simplest migration path +- no additional namespace or validation rules + +Cons: + +- weaker isolation between plugin-defined role sets +- role names from different plugins may collide semantically +- harder to reason about ownership boundaries later + +### Option 2: Require Service-Scoped Role Names and Grants + +Examples: + +- `agents.Reviewer` +- `customization.Approver` + +Pros: + +- clearer ownership +- easier to validate role-to-permission boundaries +- reduces accidental cross-service privilege expansion + +Cons: + +- changes current behavior +- adds validation and UX complexity +- may be unnecessary for the first implementation + +## Role Surfacing Options + +### Option A: Surface Plugin-Defined Roles Immediately Everywhere + +Plugin-defined roles would appear in IAM, CLI, UI, and docs as soon as they exist in the normalized authz model. + +Pros: + +- consistent with backend behavior +- no hidden role model +- administrators can use plugin-defined roles directly + +Cons: + +- requires a role catalog and metadata model +- role-management UX becomes more complex immediately +- documentation and workspace-member flows need redesign + +### Option B: Keep Plugin-Defined Roles Backend-Only Initially + +Plugin-defined roles would participate in authorization but would not immediately be surfaced in all user-facing management flows. + +Pros: + +- smaller initial product-surface change +- allows the authz redesign to ship without redesigning role UX + +Cons: + +- creates a gap between backend capability and visible platform behavior +- makes plugin-defined roles harder to adopt intentionally + +### Option C: Surface Plugin-Defined Roles Through an Explicit Role Catalog + +Introduce a dedicated role catalog model and API that describes: + +- role name +- description +- owning service +- scope +- whether the role is intended for user-facing assignment + +Then let IAM, CLI, UI, and docs consume that model explicitly. + +Pros: + +- clean long-term structure +- avoids hard-coded core-role assumptions +- separates role evaluation from role presentation + +Cons: + +- requires additional platform work +- larger change than simply preserving current backend behavior + +## Options + +### Option 1: Keep the Current Heuristic + +Pros: + +- no extra authoring burden +- preserves current behavior exactly + +Cons: + +- remains implicit and brittle +- hard to review +- poor fit for security-sensitive services + +### Option 2: Explicit Core-Role Grants Per Permission + +Each permission definition declares which core roles receive it. + +Conceptual example: + +```python +PermissionDef( + id="agents.deployments.read", + description="Read agent deployments", + core_roles=["Viewer", "Editor"], +) +``` + +Pros: + +- explicit +- reviewable +- predictable + +Cons: + +- more verbose +- repetitive for services with many standard CRUD permissions + +### Option 3: Service-Level Default Grant Policy With Explicit Overrides + +Each service declares a default policy for core-role grants, and individual permissions may override it. + +Conceptual example: + +```python +CoreRoleGrantPolicy( + read_like_roles=["Viewer", "Editor"], + write_like_roles=["Editor"], +) +``` + +With explicit override: + +```python +PermissionDef( + id="agents.internal.read", + description="Read internal agent state", + core_roles=[], +) +``` + +Pros: + +- keeps authoring ergonomic +- allows service-level consistency +- allows sensitive permissions to opt out + +Cons: + +- still partly inferential +- requires defining what counts as "read-like" or "write-like" + +### Option 4: No Automatic Core-Role Grants + +Plugin-defined permissions never go to core roles unless explicitly declared. + +Pros: + +- safest +- simplest to reason about +- no hidden behavior + +Cons: + +- more authoring overhead +- makes simple services more verbose + +## Recommendation + +For core-role default grants, recommend Option 3 as the likely best long-term balance: + +- remove the global suffix heuristic +- allow each service to declare an explicit core-role default grant policy +- allow each permission to override that default + +For plugin-role surfacing, recommend Option C: + +- introduce an explicit role catalog model +- use that model to decide what appears in IAM, CLI, UI, and docs +- avoid treating backend role permissiveness as sufficient product-surface design + +This keeps common cases simple while making policy ownership and product exposure much clearer. + +For role scope, preserve the current global-role behavior in the near term and treat service-scoping as optional future work to be evaluated deliberately rather than introduced implicitly. + +## Relationship To Plugin Service Authz + +This spec is downstream of `plugin-service-authz-spec.md`. + +That spec should preserve current behavior and avoid changing the existing core-role grant semantics during the decorator/path-rule work. + +If this spec is adopted later, it should be implemented as a focused follow-up change to: + +- core-role default grant behavior +- plugin-role surfacing and management +- plugin-role scope rules + +rather than bundled into the initial plugin authz redesign. diff --git a/spec/daemon-group-local-jobs-follow-on-spec.md b/spec/daemon-group-local-jobs-follow-on-spec.md new file mode 100644 index 0000000000..1a50229999 --- /dev/null +++ b/spec/daemon-group-local-jobs-follow-on-spec.md @@ -0,0 +1,95 @@ +# Daemon-Group Local Jobs Follow-On Spec + +## Summary + +This document captures what is currently known about a future `daemon-group` backend for local jobs. + +It is intentionally separate from `jobs-local-remote-unification-spec.md`. + +The local/remote unification spec standardizes only the local daemon control plane plus the `subprocess` backend. + +`daemon-group` remains follow-on work because it introduces a substantially larger lifecycle and control-surface design. + +## Why This Is Separate + +`daemon-group` is not just another way to launch a local process. + +It implies that the managed daemons themselves need a durable control interface and runtime contract. + +That adds complexity well beyond the scope of the core jobs local/remote unification work. + +In Orchard, this was a substantial design and implementation surface. + +The same is likely true here. + +## What Daemon-Group Means + +`daemon-group` would be a backend for long-lived, supervised local processes where simple child-process execution is not enough. + +The intended value is: + +- durable local process ownership +- discovery across CLI invocations +- restart and recovery behavior +- better support for long-lived local workloads than one-shot subprocess execution + +## What Makes It Hard + +The main complexity is that daemon-managed processes need their own contract. + +That likely includes: + +- daemon identity +- process-group identity +- status and health +- readiness +- control operations +- logs +- recovery state + +This is different from the simpler subprocess model where the local jobs daemon can own lifecycle directly without another daemon-facing API layer. + +## Relationship To The Local Daemon + +The local daemon from the jobs local/remote unification spec would remain the top-level control plane. + +If `daemon-group` is added later, the likely shape is: + +- CLI talks to the local jobs daemon +- local jobs daemon schedules a job onto the `daemon-group` backend +- `daemon-group` then talks to one or more managed process daemons or daemon wrappers + +That means `daemon-group` likely introduces a second control interface below the local jobs daemon. + +That extra layer is the main reason it is deferred. + +## What Should Stay True + +Even if `daemon-group` is added later, the following rules from the main jobs spec should remain true: + +- `run` and `submit` are interaction modes only +- all execution still flows through the jobs service contract +- backend choice is explicit and honest +- local log and status access should still go through daemon interfaces rather than direct file inspection from the CLI + +## Likely Design Questions + +- What exact control interface should daemon-managed processes expose? +- Should that interface also use HTTP/REST over UDS, or a different transport? +- How should readiness and health propagate from managed daemons up to the local jobs daemon? +- How should job logs be streamed through the local jobs daemon when the underlying runtime is daemon-managed? +- How should restart, shutdown, and orphan recovery work? +- How should version and capability checks work between the local jobs daemon and daemon-managed runtimes? +- How should daemon-group state surface through existing jobs APIs? + +## Recommendation + +Keep `daemon-group` out of the first local/remote jobs unification implementation. + +Ship the simpler architecture first: + +- local daemon control plane +- REST over UDS local daemon interface +- subprocess as the first-class local backend + +Then design `daemon-group` as a dedicated follow-on backend with its own explicit runtime contract. diff --git a/spec/first-class-subprocess-provider-slack-draft.md b/spec/first-class-subprocess-provider-slack-draft.md new file mode 100644 index 0000000000..8ec8ecc4c4 --- /dev/null +++ b/spec/first-class-subprocess-provider-slack-draft.md @@ -0,0 +1,2 @@ +# Slack Draft: First-Class Subprocess Provider + diff --git a/spec/jobs-backed-local-run-ux-and-progressive-start-spec.md b/spec/jobs-backed-local-run-ux-and-progressive-start-spec.md new file mode 100644 index 0000000000..d60d324f0a --- /dev/null +++ b/spec/jobs-backed-local-run-ux-and-progressive-start-spec.md @@ -0,0 +1,93 @@ +# Jobs-Backed Local Run UX And Progressive Start Spec + +## Summary + +This spec captures a separate long-term product and platform question: + +- when local execution becomes jobs-backed, what should happen to today's `run_local(...)` user experience +- how does that transition relate to progressive service start + +This is intentionally separate from provider resolution and subprocess-first-class work. The current execution-resolution proposal can preserve existing local behavior during migration, while this document defines the longer-term direction. + +## Problem + +Today, `run_local(...)` is an in-process execution path in [packages/nemo_platform_plugin/src/nemo_platform_plugin/scheduler.py](/Users/rsadler/src/nemo-platform/packages/nemo_platform_plugin/src/nemo_platform_plugin/scheduler.py:79). + +That gives local execution a lightweight experience: + +- spec validation happens locally +- `to_spec()` runs locally +- a local `JobContext` is constructed +- `job.run(...)` is invoked directly +- the command behaves like a simple synchronous local action + +If local execution moves behind Jobs, the underlying architecture changes materially: + +- local runs become jobs-backed +- subprocess becomes the local execution provider +- Jobs owns persistence, lifecycle, logs, and reconciliation + +That creates a product question: + +- should `run` continue to preserve the current lightweight local UX +- or should it eventually become a thin synchronous wrapper over jobs submission and waiting + +## Current Direction + +For the near-term migration, the platform should minimize disruption. + +That means preserving existing local functionality and user expectations as much as possible while moving the underlying execution model toward Jobs. + +However, that preservation should be treated as transitional compatibility, not the long-term architectural goal. + +## Long-Term Direction + +Long term, the platform should not preserve today's separate `run_local(...)` execution model. + +Instead: + +- local execution should become fully jobs-backed +- the old in-process local execution path should eventually be removed +- any remaining local UX sugar should be justified explicitly as product behavior, not as a separate execution architecture + +The platform should not carry two fundamentally different local execution models forever. + +## Why This Depends On Progressive Start + +The main blocker to removing today's local-only behavior is startup and control-plane overhead. + +If local execution becomes jobs-backed before the platform has a good progressive-start story, users may experience: + +- slower startup +- more visible control-plane machinery +- more operational complexity for simple local runs + +That would make the architecture cleaner internally while making the local user experience worse. + +So the long-term removal of `run_local(...)` should be addressed together with progressive service start and related UX work, not inside the provider-resolution spec. + +## Scope Of This Spec + +This spec is about: + +- local run UX during and after jobs-backing +- the long-term removal of the separate in-process local execution path +- how that interacts with progressive service start + +This spec is not about: + +- provider resolution +- runtime availability ownership +- capability-versus-provider modeling + +## Recommendation + +Treat today's local `run_local(...)` behavior as transitional compatibility during the migration to jobs-backed local execution. + +Do not make preserving that behavior a requirement of the execution-resolution spec. + +Instead: + +- keep current local behavior intact in the short term to minimize migration impact +- plan to remove the separate in-process local execution path in the long term +- resolve the user-facing transition as part of progressive service start and local Jobs UX design diff --git a/spec/jobs-local-remote-unification-spec.md b/spec/jobs-local-remote-unification-spec.md new file mode 100644 index 0000000000..3959f6a394 --- /dev/null +++ b/spec/jobs-local-remote-unification-spec.md @@ -0,0 +1,1880 @@ +# Jobs Local/Remote Unification Spec + +## Summary + +This spec proposes an end-state jobs model where NeMo Platform removes the architectural distinction between "local" and "remote" job execution. + +The key change is: + +- `run` and `submit` remain as interaction modes +- all execution goes through the jobs service contract +- local execution becomes a normal jobs deployment shape, not a separate code path +- `subprocess` becomes the first-class jobs backend for local operation in this spec + +The critical product goal is full API parity between local and remote jobs operation. + +This keeps the platform close to what it already has today while removing the current "jobs" versus "no jobs" split. + +The guiding UX principle for this work is the principle of least surprise. + +The platform should become more consistent and more powerful without becoming more complicated for the common case. + +At the top level, the user should still experience only two deployment modes: + +- local +- remote + +Everything else should support those two modes without forcing extra complexity into the default workflow. + +This spec is broad because local/remote jobs parity is not a single backend change. + +To make local jobs behave like remote jobs, the platform must define all of the following together: + +- how jobs are addressed +- how the local control plane is discovered and reused +- how local and remote targets are selected +- how subprocess inherits code, interpreter, and virtual environment identity +- how task and job storage are represented +- how working directory is defined +- how logs are accessed +- how supporting services are activated +- how lifecycle operations such as pause, resume, and retention behave + +If these pieces are not specified together, the platform will still have hidden local-only behavior even if subprocess becomes a first-class backend. + +So the purpose of this spec is not only to say "use subprocess through jobs." + +It is to define the minimum surrounding control-plane, runtime, and UX contract needed for that statement to actually produce full local/remote jobs parity. + +There is also an important historical reason for the existing split between `run` and `submit`. + +Previously, the platform often treated: + +- `run` as the lightweight local path +- `submit` as the jobs-backed path + +because starting the full platform control plane was seen as expensive and unnecessary for simple local work. + +That concern is valid and remains one of the biggest risks for this project. + +If adopting the unified jobs model makes simple local execution feel heavier, slower, or more operationally confusing than the old `run` experience, then the architecture will have failed a key product requirement even if it is internally cleaner. + +The two primary product risks are: + +1. Progressive platform startup fails to feel lightweight. + + If progressive activation exists architecturally but still feels heavy or slow to the user, then the design has not solved the original problem. + +2. The getting-started story becomes more complicated. + + If users have to learn more control-plane concepts, startup choices, or routing mechanics than they do today for the common local and remote workflows, then the design violates the principle of least surprise. + +So the spec should be evaluated against a simple bar: + +- it must support the needed power and flexibility +- it must not make the common local and remote workflows more complicated than they are today + +## Why This Spec Covers Multiple Areas + +The changes in this document are coupled for structural reasons. + +### Jobs API Parity Requires Control-Plane Parity + +If the jobs API is unified but local daemon discovery, target selection, and reuse are left implicit, local mode still behaves differently in practice. + +That is why this spec covers: + +- local daemon discovery +- foreground/background lifecycle +- target selection +- fail-fast reuse checks + +### Backend Parity Requires Runtime Contract Parity + +If subprocess becomes a first-class backend but continues to expose different working-directory, storage, logging, or interpreter behavior than remote backends, users still experience a split system. + +That is why this spec covers: + +- logical task/job/config storage +- working-directory semantics +- code-root and interpreter identity +- logging access through APIs +- backend-owned retention policy + +### Lightweight Local Mode Requires Service-Activation Semantics + +If local mode is supposed to be lightweight and on-demand, the platform must define how required services are discovered, activated, and observed. + +That is why this spec covers: + +- daemon control interfaces +- readiness and availability reporting +- progressive service activation +- service-level logs and failure reporting + +### Simple UX Requires Explicit Target And Mode Rules + +If the user experience is supposed to remain simple, the spec must say what the defaults are and when users need to think about targets, daemon keys, or endpoints. + +That is why this spec covers: + +- local as the default target +- remote as an explicit named target +- cluster as the main target abstraction +- background versus foreground as lifecycle modes of the same local daemon +- advanced overrides as secondary escape hatches + +In short, the document is intentionally larger than a backend-only spec because the architectural problem is larger than a backend-only change. + +## Primary UX Risk + +The main product risk for this work is unnecessary friction in the local-first workflow. + +Historically, local execution avoided jobs because: + +- users wanted the fastest path to "just run this now" +- many local jobs did not need the full platform runtime +- startup overhead made jobs feel heavyweight for simple development tasks + +This spec must therefore preserve the benefits users expected from the old lightweight local path while still moving everything onto the jobs contract. + +The design should be judged against a simple standard: + +- a user should be able to run a local job synchronously without needing to understand daemon lifecycle, target routing, or service activation details +- if nothing suitable is running, the CLI should be able to start what it needs automatically and transparently +- the default local experience should feel local-first, lightweight, and obvious +- remote operation should also remain straightforward and explicit + +The main failure modes to avoid are: + +- forcing users to learn too much control-plane vocabulary for common local workflows +- making users manually reason about whether jobs services are running +- making local startup feel slow or operationally heavy for simple use cases +- exposing too many knobs in the default path +- preserving hidden differences between local and remote while also increasing perceived complexity + +This is why the spec emphasizes: + +- local as the default target +- transparent daemon acquisition +- progressive service activation +- one jobs API contract +- one target-selection model +- advanced options only when users explicitly need them + +The intended default user experience is: + +- `nemo jobs run ...` should just work locally +- if needed, the platform starts the local daemon in the background automatically +- the user does not need to decide up front between "jobs" and "no jobs" +- switching to remote should also be simple and explicit when desired + +## Acceptance Criteria + +- `nemo jobs run ...` should work in the default local-first workflow without requiring the user to pre-start services manually. +- If a suitable local daemon is not already running, the CLI should be able to start it automatically and transparently. +- Local and remote jobs execution should expose the same jobs API contract to clients. +- Subprocess should be a first-class jobs backend rather than a separate local execution path. +- The default local workflow should not require the user to understand daemon keys, service activation, or transport details. +- Remote targeting should remain explicit and straightforward. +- CLI commands that do not require any services should incur no service-startup overhead. +- Jobs unification should not impose daemon acquisition or service startup on CLI flows that can run entirely client-side. + +### Parity Checklist + +The spec should be considered a failure if local and remote still diverge in any major user-visible way across these areas. + +- same jobs API contract +- same top-level target-selection model +- same profile-discovery model +- same logical task/job/config storage contract +- same working-directory contract +- same log-access contract +- same lifecycle-state contract +- explicit and stable local runtime identity +- explicit service-activation and dependency-failure model +- daemon-to-daemon storage and state isolation +- foreground and background local modes differ only by lifecycle attachment, not by jobs semantics + +## Current State + +Today the repo has two different execution models for jobs. + +### Plugin CLI / Scheduler Split + +At the plugin layer, `NemoJobScheduler` still exposes two materially different paths: + +- `run_local(...)` executes a job in-process +- `submit_remote(...)` POSTs to the jobs API + +This means `run` is not just synchronous interaction. It is a different execution architecture. + +### Core Jobs Already Has A Subprocess Backend + +The core jobs service already supports `provider: subprocess` as a real backend. + +That backend schedules a persisted job step, launches a host process, captures logs, manages lifecycle, and reconciles status through the jobs service. + +This is already much closer to the desired model than `run_local(...)`. + +### Current Subprocess Rewrite + +The jobs API currently contains a compatibility rewrite that converts some CPU container steps into subprocess steps when the selected profile is configured as a subprocess profile. + +That rewrite: + +- happens at jobs API ingress +- rewrites a `CPUExecutionProvider` step into `SubprocessExecutionProvider` +- derives the subprocess command from `container.entrypoint + container.command` +- drops container semantics in the process + +This works as a bridge, but it is not an honest execution contract. + +## Problem + +The current split creates avoidable complexity and weakens local/remote parity. + +### Problem 1: `run` And `submit` Mix Interaction Mode With Execution Placement + +Users currently have to learn: + +- `run` means local and in-process +- `submit` means remote and jobs-backed + +That is the wrong abstraction boundary. Whether a user wants synchronous output or an async handle should be independent from where the job runs. + +### Problem 2: Local Execution Bypasses Jobs Semantics + +When `run_local(...)` is used, the platform bypasses the jobs scheduler, reconciliation, persistence, logs, and status lifecycle that the same workload encounters through the jobs service. + +This means local execution does not exercise the same platform semantics as jobs-backed execution. + +### Problem 3: Subprocess Exists Twice + +The platform currently has: + +- subprocess-like local execution via `run_local(...)` +- a real subprocess backend inside core jobs + +Those are overlapping concepts with different semantics and different code paths. + +### Problem 4: Current Rewrite Is A Compatibility Hack + +Rewriting a container-shaped CPU step into a subprocess step based on profile name is useful for migration, but it hides a real contract change: + +- container execution means "run this image with container semantics" +- subprocess execution means "run this host command in the local environment" + +Those are not equivalent. + +### Problem 5: Lightweight Local Development Still Needs Persistence And Discovery + +The desired local experience is lightweight, but it still needs: + +- persistence +- daemon discovery +- duplicate-service avoidance +- explicit start/stop behavior +- log visibility +- clear reuse of an already-running local endpoint + +Direct multi-process SQLite access from arbitrary client processes is not a sufficient control-plane model. + +## Goals + +- Preserve `run` and `submit` as user-facing verbs. +- Redefine `run` as synchronous jobs interaction, not in-process execution. +- Make all execution flow through the jobs service contract. +- Preserve full API parity between local and remote jobs operation. +- Make `subprocess` a first-class jobs backend rather than a side path. +- Allow the local daemon to host `subprocess` in this spec. +- Preserve profile-driven execution selection in the near term. +- Make local daemon reuse and bootstrap transparent and predictable. +- Ensure CLI commands that do not require services incur no service-startup overhead. + +## Terminology + +This spec uses the following terms consistently: + +- `cluster target`: a named jobs-routing target selected by the CLI +- `local target`: a cluster target whose control plane is a local daemon +- `remote target`: a cluster target whose control plane is a remote jobs API +- `daemon key`: the identity of a specific local daemon target +- `state directory`: the daemon-private state location associated with a daemon key +- `daemon-private root`: the daemon-scoped runtime/storage root under that state directory + +## Non-Goals + +- Unifying jobs with agent deployments or model deployments. +- Replacing profile resolution in this iteration. +- Writing a migration plan in this spec. +- Defining `daemon-group` behavior in this spec. +- Defining Orchard-specific implementation details as required architecture. +- Requiring direct client access to SQLite. + +## End-State Model + +### Core Principle + +There is one jobs contract. + +The CLI always talks to a jobs API. There are only two control-plane targets: + +- local daemon +- remote cluster + +Those are two ways to host the same jobs contract. They are not different job models. + +The local and remote paths should behave as close to identically as possible. + +The default target should be as simple as possible: + +- local daemon is the default target +- no target selection should be required for the default local workflow +- remote targeting should be explicit + +There should be a built-in default cluster target named `local`. + +That target should always be present and should not require the user to create or register it. + +The built-in `local` target should resolve to the default local daemon key for the current local development context. + +In this spec, `cluster` should be understood as the primary named target abstraction for jobs routing. + +It should not be treated as merely a convenient alias for a raw URL. + +A cluster target should represent the saved routing and identity information needed to talk to a jobs control plane. + +The only local-only interface should be the daemon control surface used for discovery and coordination. + +The jobs API itself should remain the same between local and remote modes. + +### Interaction Modes + +The user-facing verbs keep their names but change meaning: + +- `submit` means create a job and return a handle +- `run` means create a job and follow it until terminal state + +`run` is therefore "submit + follow", not "execute locally in-process". + +### Backend Model + +Backend choice is part of jobs execution, not part of the CLI verb. + +The local-first backends in scope for this spec are: + +- `subprocess` + +This backend should be scheduled, reconciled, logged, and cancelled through the same jobs lifecycle. + +### Profile Resolution + +Near-term backend selection continues to use the existing profile model. + +That means: + +- compilers keep producing profile-driven job specs +- configured execution profiles determine what backend contract is actually available +- local and remote hosting differ mainly in which profiles are present and valid + +This keeps the architecture close to the current platform while removing the `run_local(...)` side path. + +Profiles should be treated as properties of the selected control plane. + +That means: + +- remote cluster mode uses the profiles exposed by the remote jobs API +- local daemon mode uses the profiles exposed by the local daemon over its local API + +In local daemon mode, the CLI should not read execution profile definitions directly from local config files as its source of truth. It should query the local daemon for the profile set that is actually active for that daemon instance. + +### Working Directory And Storage Parity + +Full local/remote parity also requires a consistent task/job storage contract. + +Docker and Kubernetes already expose a mostly logical storage contract: + +- task-scoped ephemeral storage via `NEMO_JOB_EPHEMERAL_TASK_STORAGE_PATH` +- job-scoped persistent storage via `NEMO_JOB_PERSISTENT_JOB_STORAGE_PATH` +- task-scoped config storage via `NEMO_JOB_STEP_CONFIG_STORAGE_PATH` + +Those map to stable in-task paths such as: + +- `/var/run/scratch/task` +- `/var/run/scratch/job` +- `/var/run/scratch/config` + +for containerized backends, while the actual implementation may be Docker volumes, Kubernetes `emptyDir`, PVC subpaths, or other backend-owned storage mechanisms. + +Subprocess currently differs in two important ways: + +- it sets `cwd` directly to a host-side task working directory +- it exposes backend-owned host paths as the actual runtime locations for task, config, and job storage + +That is acceptable as an implementation detail, but it should not remain the client-visible or compiler-visible contract. + +The target contract for this spec is: + +- jobs should continue to consume the logical task/job/config storage env vars as the primary storage contract +- subprocess should follow the same logical storage conventions already used by Docker and Kubernetes +- backend-specific filesystem layout should remain an internal implementation detail +- the jobs API should expose enough metadata to identify task-level and job-level storage locations logically, without requiring clients to know backend-private host paths +- different local daemons must have fully isolated backend-private storage roots + +For subprocess specifically: + +- the backend may still materialize task and job storage on the host filesystem +- the backend may still choose an internal host-side working directory layout for process execution +- but the stable contract seen by job code should be the same logical task/job/config storage environment already used by Docker and Kubernetes +- and that host-side storage must be namespaced by daemon identity so multiple local daemons cannot conflict + +The current subprocess host layout is a reasonable internal implementation: + +- `/////` for task work +- `////job-storage` for job-persistent storage + +but that layout should be treated as daemon-owned state, not as the public job contract. + +The spec should therefore distinguish clearly between: + +- logical task storage +- logical job-persistent storage +- backend-private host implementation paths +- daemon-private host implementation roots + +### Daemon Storage Isolation + +Multiple local daemons must be fully isolated from one another. + +That isolation must cover at least: + +- task working directories +- task scratch directories +- task config directories +- job-persistent storage +- backend-private logs +- daemon-private state + +This means one local daemon must not be able to accidentally read, reuse, or delete another daemon's task or job runtime state simply because the workspace, job name, or attempt id happen to match. + +For subprocess, the backend-private host layout should therefore be scoped under a daemon-private root before any workspace/job/attempt/task nesting is applied. + +Conceptually: + +- `/////` for task work +- `////job-storage` for job-persistent storage + +where `` is unique to the local daemon target rather than globally shared across all local daemons. + +The daemon root should be derived from the daemon key and the daemon's own state directory, not only from the workspace or job identifiers. + +The spec should make this explicit: + +- every local daemon key has its own daemon-private state directory +- every local daemon key has its own daemon-private storage root +- daemon-private task, scratch, config, log, and job-storage paths must all live under that daemon-private root + +So if the daemon key is `local`, one daemon-private root might conceptually look like: + +- `/daemons/local/runtime/...` + +and if the daemon key is `exp-b`, a separate daemon-private root might conceptually look like: + +- `/daemons/exp-b/runtime/...` + +The exact directory naming may vary, but the required property is: + +- the daemon key must be part of the daemon-private root layout + +This is what guarantees that two daemons using the same checkout or virtual environment still do not collide with each other. + +This is required for: + +- side-by-side local daemons for testing +- foreground and background instances of different daemon keys +- version-isolated local development +- fail-fast discovery and safe cleanup + +Cleanup and retention must also respect daemon isolation. + +That means: + +- one daemon must only clean up its own backend-private storage +- retention logic must never assume a single global subprocess storage root shared by all local daemons + +### Current Working Directory Gap + +#### Current Behavior + +Today, backends do not all handle the task working directory the same way. + +- `subprocess` explicitly launches the process with `cwd` set to a backend-owned task work directory +- Docker and Kubernetes standardize task/job/config storage mounts and env vars +- Docker and Kubernetes do not currently standardize the task working directory +- the jobs launcher does not currently set `cwd` +- so in Docker and Kubernetes, the working directory is whatever the container runtime or image default happens to be + +#### Problem + +This means local and remote do not have the same working-directory contract. + +Today: + +- local subprocess jobs start in an explicit task directory +- Docker and Kubernetes jobs do not necessarily start in that same logical task directory + +That is a parity gap. + +Job code should not have to guess whether: + +- the current working directory is the task directory +- the current working directory is the image default +- it should use the task storage env vars instead of `cwd` + +#### Change In This Spec + +This spec should define one explicit jobs working-directory contract for all backends. + +The contract is: + +- every task has a logical task working directory +- by default, that working directory is the task-ephemeral storage root +- job specs may override working directory at job scope +- job steps may override working directory at step scope + +The spec should add explicit working-directory fields to the jobs schema: + +- `PlatformJobSpec.working_directory` +- `PlatformJobStepSpec.working_directory` + +These fields represent the logical working directory seen by job code, not a backend-private host path. + +These values should be absolute runtime paths. + +Relative working-directory values should be rejected. + +For the backends in scope here, those values should be runtime paths consistent with the existing Docker/Kubernetes storage conventions, for example: + +- `/var/run/scratch/task` +- `/var/run/scratch/job` +- subdirectories under those roots + +Backends may reject values they cannot safely materialize. + +The precedence should be: + +- task-level working directory override +- job-level working directory override +- backend default + +The backend default should be the logical task storage root. + +That means: + +- subprocess should default `cwd` to the task-ephemeral storage root +- Docker and Kubernetes should also default the container working directory to the task-ephemeral storage root + +So the intended end state is simple: + +- if no override is set, every task starts in its logical task directory +- if a job-level override is set, tasks use that unless a step overrides it +- if a step-level override is set, that step uses its own value + +Backend-specific storage layout remains private. + +For subprocess: + +- the backend maps the logical working directory onto its daemon-private host layout +- the process is launched with `cwd` set to that resolved logical directory + +For Docker and Kubernetes: + +- the backend sets the container working directory to the resolved logical path +- task/config/job storage mounts continue to work as they do today + +#### Potential Impact And Risk + +This part of the spec changes behavior for Docker and Kubernetes, because today they do not consistently force the task working directory. + +There is one real compatibility risk here: + +- some existing workloads may implicitly rely on the current container-image `WORKDIR` + +Examples of concrete breakage would be: + +- code that reads or writes relative paths assuming the image's existing `WORKDIR` +- wrapper scripts that expect to start from a repository root or image-specific home directory +- commands that rely on relative paths without using the provided task/job storage paths + +If a workload already uses the task/job/config storage paths explicitly, the practical impact should be low. + +So this should be treated as a genuine but narrow compatibility risk: + +- narrow, because it only affects workloads that rely on implicit image-default `WORKDIR` +- genuine, because this spec intentionally replaces that accidental behavior with an explicit jobs contract + +### Retention Parity + +#### Current Behavior + +Today, jobs backends already share some controller-level retention settings such as: + +- `cleanup_completed_jobs_immediately` +- `ttl_seconds_after_finished` + +but each backend applies retention differently. + +Today: + +- subprocess removes backend-owned task working directories according to subprocess cleanup policy +- Docker removes containers and task volumes, and may separately clean job-persistent storage +- Kubernetes relies on Kubernetes job TTL plus explicit cleanup behavior + +So there is already no single identical retention mechanism across backends. + +#### Change In This Spec + +This spec should preserve that general architecture. + +Retention should remain backend-owned policy rather than becoming a special local-only concern. + +That means the retention model should distinguish at least: + +- task-ephemeral storage retention +- job-persistent storage retention +- job metadata retention in the jobs API +- log retention + +The intended parity is not identical deletion behavior across backends. + +The intended parity is: + +- retention is a normal backend concern +- retention is configured and enforced through the backend/profile model +- local subprocess follows that same architectural convention instead of behaving like a special non-jobs path + +For this spec: + +- subprocess should have its own backend retention policy, just like Docker and Kubernetes do +- task-ephemeral storage may be cleaned up according to subprocess backend/profile retention policy +- job-persistent storage retention should be defined by the subprocess backend/profile, not by accidental host-directory lifetime +- jobs API records and status lifecycle should still be exposed uniformly even if backend cleanup behavior differs +- log access should remain API-based even if underlying local task directories are removed + +So the change here is mostly architectural clarity: + +- subprocess cleanup policy becomes an explicit backend concern +- local jobs retention is brought under the same conceptual model as Docker and Kubernetes + +#### Potential Impact And Risk + +There is no significant known compatibility risk here. + +This part of the spec is mostly clarifying architecture: + +- subprocess retention is treated as an explicit backend policy +- local subprocess is brought under the same conceptual model as Docker and Kubernetes + +Because local subprocess retention is not understood to be something existing deployed environments materially depend on, this should not be treated as a major product risk. + +## Local Control Plane + +### Principle + +Local execution is not special because it is local. + +It is special only in how the jobs control plane is hosted. + +### Local Daemon + +The local option should be a real daemon-backed jobs endpoint with: + +- persistent local state +- a well-defined daemon control socket +- a discoverable TCP/IP jobs API listener +- daemon discovery +- reuse of an already-running usable daemon +- clear visibility into status and logs + +If the local endpoint is already running, the CLI should connect to it instead of starting a duplicate service. + +If it is not running, the CLI may bootstrap it and then submit through the same jobs API contract. + +The local daemon should not assume a fixed all-or-nothing service bundle. It should be able to activate additional local services and plugins on demand as requirements become known. + +The jobs API exposed by the local daemon should use TCP/IP, not UDS, so local and remote request paths remain as similar as possible. + +### Foreground And Background Local Modes + +The local daemon should support two operational modes: + +- foreground mode +- background mode + +These should not be treated as different architectures. + +They are two lifecycle modes for the same local control plane and should be functionally equivalent. + +In this framing: + +- `nemo services run` is the foreground mode +- daemon mode is the background mode + +Both should expose the same jobs API behavior, the same daemon identity model, and the same progressive activation behavior. + +The difference is only lifecycle ownership and terminal attachment: + +- foreground mode stays attached to the invoking terminal +- background mode detaches and continues running after the invoking command exits + +The spec should make it explicit that either mode may be used for the same local target. + +That means: + +- a user should be able to start the local daemon explicitly in foreground mode +- a user should be able to start the same local daemon explicitly in background mode +- the CLI may also start the local daemon implicitly through progressive activation when no suitable daemon is already running + +Those should all converge on the same effective runtime shape. + +One reasonable command model is: + +- `nemo services run` for foreground local mode +- `nemo services run --daemon` for background local mode + +or an equivalent spelling under the existing services command family. + +The important requirement is not the exact flag name. The important requirement is: + +- there is one command family for starting local services +- foreground and background are explicit lifecycle modes of that same command family +- progressive activation and explicit startup are functionally equivalent ways to obtain the same local daemon target + +This avoids creating a second conceptual split such as: + +- explicit `services run` +- separate unrelated `daemon start` + +The platform should instead present a single local control-plane model with multiple lifecycle entry paths. + +The preferred CLI spelling for background mode in this spec is: + +- `--daemon` +- `-d` + +The same local daemon key selection mechanism should be available in both foreground and background mode. + +That is required because local daemon discovery must be able to identify a specific local target regardless of whether that target is attached to a terminal. + +Example: + +- `nemo services run --instance exp-b` +- `nemo services run --daemon --instance exp-b` +- `nemo services run --daemon --instance exp-b --state-dir /tmp/nmp-exp-b` + +Both commands refer to the same logical local daemon target `exp-b`. + +The difference is only whether the process remains in the foreground or detaches into background operation. + +### Local Daemon Code Root And Interpreter Contract + +The local daemon must also have a well-defined relationship to the source tree and Python environment it is running from. + +This matters especially for subprocess because local execution should be unambiguous about: + +- which checkout of the code is being used +- which Python interpreter is being used +- which virtual environment is being used +- whether an existing daemon can be safely reused for the caller's intended local development context + +Current subprocess behavior already points in this direction: + +- subprocess inherits a narrow allowlist from the daemon process environment, including `PATH` and `VIRTUAL_ENV` +- if a subprocess command begins with `python` or `python3`, the backend rewrites it to use the daemon's interpreter resolution +- if `VIRTUAL_ENV` is set and contains an executable `bin/python`, that interpreter is used +- otherwise, the backend falls back to the daemon process `sys.executable` + +This means local subprocess execution already effectively runs in the daemon's Python environment rather than in a separate task-specific environment. + +That behavior should be made explicit and contractual. + +For local daemon mode, each daemon instance should therefore have explicit identity fields for: + +- daemon key +- code root +- Python executable +- virtual environment path, if any +- daemon version +- jobs API version + +The daemon key should be the selector used to distinguish multiple local daemons for testing or versioned development. + +The code root should identify which checkout the daemon is associated with for local development purposes. + +The Python executable and virtual environment should identify exactly which runtime environment local subprocess jobs will inherit when they use `python` or `python3`. + +The daemon control interface should expose these values directly. + +### Non-Default Daemon Key Example + +The default local workflow should not require users to think about daemon keys. + +By default: + +- the CLI targets the default local daemon for the current local development context +- if no such daemon exists, the CLI starts one + +The built-in cluster target for that workflow should be `local`. + +Non-default daemon keys are mainly for testing, side-by-side development, or explicit version isolation. + +Users should not need to pass a daemon key on every jobs command. + +The local daemon selection model should mirror cluster selection as closely as possible: + +- there should be a current selected local daemon target +- jobs commands in local mode should use that selected daemon by default +- a per-command daemon-key override may still exist for testing or debugging, but it should not be the primary workflow + +A concrete example: + +1. A developer is working on the current checkout and uses the default local daemon: + `nemo jobs run ...` +2. The same developer wants to compare behavior against a second local daemon started from a different checkout or virtual environment. +3. They start or target that daemon with a non-default key such as `exp-b`: + `nemo services run --instance exp-b ...` +4. They then select that daemon through the same target-selection model used for remote clusters: + `nemo cluster use exp-b` +5. Subsequent local jobs commands use that selected daemon automatically: + `nemo jobs run ...` +6. The CLI performs daemon discovery against `exp-b`, checks the daemon metadata, and either: + - connects if the daemon key, version, code root, interpreter, and capability requirements match + - fails explicitly if they do not match + - or starts a new local daemon for `exp-b` if none is running and startup is allowed + +In this example, the user can keep two local daemons distinct: + +- default daemon for the main checkout +- `exp-b` daemon for an alternate checkout or alternate virtual environment + +This makes the behavior explicit and testable: + +- daemon selection is intentional +- daemon identity is inspectable +- reuse is fail-fast rather than best-effort + +The spec does not require a per-command `--daemon-key` to be a broadly advertised end-user workflow. + +It is acceptable for per-command daemon-key override to exist primarily as an advanced local-development and test capability, as long as the behavior is explicit and stable. + +### Local Daemon Selection + +The platform should support explicit selection of the current local daemon target through the existing cluster-selection model. + +That means: + +- local daemon targets should be named by daemon key +- the CLI should persist the currently selected target +- local jobs commands should use the selected local daemon target by default when operating in local mode +- users should be able to switch between local daemons without repeating the daemon key on every command +- the built-in `local` target should always exist and should map to the default local daemon key + +A conceptual workflow is: + +- `nemo cluster ls` +- `nemo cluster use exp-b` +- `nemo jobs run ...` + +The current repo does not yet have a complete generic cluster-selection command surface for this exact purpose, so the examples below should be read as proposed CLI behavior for this spec. + +This keeps the local and remote selection stories aligned: + +- remote mode selects a remote cluster target +- local mode selects a local daemon target +- both should use the same top-level target-selection workflow + +The selection rules should be: + +- if `--cluster` or `--base-url` is provided, use that explicit target +- otherwise, default to the local daemon target for the current local development context + +The daemon control interface should expose enough metadata for the CLI to render and select available local daemons clearly. + +The spec should not introduce a new top-level `daemon` command family for this purpose. + +Instead, the existing cluster/target selection system should be extended so that it can represent both: + +- remote cluster targets +- local daemon targets + +This preserves a single selection model and avoids creating a parallel command surface. + +### Local Daemon Discovery + +The spec should make local daemon discovery explicit. + +Discovery should not be based on guesswork such as: + +- probing arbitrary ports +- looking only for a PID +- assuming one global singleton daemon + +Instead, discovery should be keyed by the local daemon target identity. + +At minimum, discovery should use: + +- daemon key +- instance descriptor/state directory entry +- explicit daemon control socket location + +The daemon control socket should be the authoritative discovery rendezvous point for a local daemon target. + +The local services CLI should also be able to accept an explicit daemon state directory. + +That is useful for: + +- testing +- isolated local sandboxes +- side-by-side daemon instances with different roots + +However, this must be validated strictly. + +The platform should reject configurations where two distinct daemon identities would share the same effective state directory. + +At minimum: + +- the daemon key must map to one daemon-private state directory +- the daemon-private root must include the daemon key +- starting a daemon with a state directory that would collide with another daemon's effective root should fail explicitly +- the CLI should not silently alias two daemon keys onto one shared state directory + +One reasonable model is: + +- each daemon key maps to a deterministic instance state directory +- that directory contains the daemon descriptor +- that directory also contains the well-known daemon control socket for that key +- the CLI uses that descriptor and socket to determine whether the daemon exists, is alive, and is usable + +This is close to the current `nemo services` instance model and should remain the basis for local discovery. + +The discovery flow should be: + +1. Determine the target daemon key. + This may come from: + - the selected local cluster target + - an explicit command-line override + - the default local development-context target +2. Resolve the instance state directory for that daemon key. +3. Read the daemon descriptor if present. +4. Check daemon liveness through the authoritative discovery mechanism for that key. + A lock or equivalent liveness primitive should remain the source of truth, not the descriptor alone. +5. If alive, connect to the daemon control socket for that key. +6. Query status and metadata from the daemon control interface. +7. Verify safe reuse checks such as: + - daemon key + - code root + - Python executable + - virtual environment + - daemon version + - jobs API version + - required capabilities +8. If the daemon is not present, not alive, or not reusable: + - fail explicitly + - or start a new daemon for that same key if the current workflow allows startup + +This should be clear in both foreground and background mode. + +That means the daemon key is not only a background concern. + +It is part of the fundamental local identity and discovery contract for: + +- foreground `nemo services run --instance ` +- background `nemo services run --daemon --instance ` +- implicit CLI-driven local daemon acquisition for jobs commands + +### Per-Command Override + +The selected target should remain the default, not an exclusive routing mechanism. + +Per-command override should continue to be supported. + +That means: + +- the user may override the currently selected target for a single jobs command +- `--cluster` remains a valid per-command override for remote routing +- `--base-url` remains a valid per-command override when the user wants to target a specific endpoint directly + +This keeps the current operational flexibility while still allowing the CLI to maintain a stable default target. + +However, `--base-url` should be treated as an advanced escape hatch rather than the primary user workflow. + +The normal workflow should be: + +- create or discover a named target +- select that target +- run jobs against the selected target + +This keeps users thinking in terms of stable named control-plane targets rather than transport details. + +### Concrete Target Examples + +The spec should make local and remote targets look structurally similar. + +One reasonable target model is: + +- remote cluster targets have a name, kind, and endpoint metadata +- local daemon targets have a name, kind, and local discovery metadata + +Example remote target: + +- name: `dev-usw2` +- kind: `remote` +- base URL: `https://dev-usw2.example.nvidia.com` +- optional additional target metadata for auth or routing as needed + +Example local daemon target: + +- name: `exp-b` +- kind: `local-daemon` +- daemon key: `exp-b` +- discovered jobs API endpoint: `http://127.0.0.1:43123` +- optional additional target metadata for compatibility checks as needed + +Example workflows: + +1. Default local workflow + `nemo jobs run ...` + Result: + - the CLI uses the default local daemon target + - if needed, it discovers or starts the default local daemon + - jobs then run against the local jobs API +2. Create a named remote target + Proposed CLI shape: + `nemo cluster add dev-usw2 --base-url https://dev-usw2.example.nvidia.com` + Result: + - a named remote target `dev-usw2` is created + - it can later be selected or used as a per-command override +3. Switch to a remote target + Proposed CLI shape: + `nemo cluster use dev-usw2` + Result: + - jobs commands go to `https://dev-usw2.example.nvidia.com` +4. Switch back to the default local target + Proposed CLI shape: + `nemo cluster use local` + Then: + `nemo jobs run ...` + Result: + - jobs commands resolve `local` as the default local-daemon target + - the CLI uses the daemon control interface to discover the daemon + - the CLI discovers the local jobs API endpoint, for example `http://127.0.0.1:43123` + - jobs commands then talk to that TCP/IP jobs API endpoint +5. Create and use a non-default local target for testing + Proposed CLI shape: + `nemo cluster add exp-b --local-daemon-key exp-b` + Then: + `nemo cluster use exp-b` + Then: + `nemo jobs run ...` + Result: + - the CLI resolves `exp-b` as a named local-daemon target + - it discovers or starts the `exp-b` daemon instance + - jobs run against that daemon's TCP/IP jobs API +6. Override the selected target for a single command + `nemo jobs run --cluster dev-usw2 ...` + Result: + - this command goes to remote cluster `dev-usw2` + - the current selected target remains unchanged +7. Override with an explicit endpoint + `nemo jobs run --base-url http://127.0.0.1:43123 ...` + Result: + - this command talks directly to that endpoint + - the current selected target remains unchanged + +The important behavior is not the exact command spelling. The important behavior is: + +- local is the simple default +- `local` is a built-in target that is always available +- remote is explicit +- both local and remote targets can be named and selected +- local targets resolve indirectly through daemon discovery +- remote targets resolve directly through configured base URLs +- per-command override remains available without mutating the selected default target + +The key UX point is: + +- `cluster` is the primary abstraction +- URL is just one field inside some cluster targets +- users should usually select a named target rather than supplying raw endpoints + +### Fail-Fast Reuse Policy + +Local daemon reuse should be fail-fast. + +There should be no silent fallback from one local development context to another. + +That means: + +- if the CLI is targeting a specific daemon key, it should connect only to that daemon key +- if the CLI expects a particular code root, Python executable, virtual environment, version, or capability set, and the daemon does not match, the operation should fail explicitly +- the CLI may then choose to start another daemon, but it should not silently reuse the wrong one + +This should be treated as part of safe reuse checks, not as best-effort convenience behavior. + +In particular, local daemon mode should not silently: + +- connect to a daemon associated with a different checkout +- connect to a daemon associated with a different virtual environment +- connect to a daemon associated with a different Python executable +- downgrade to a different local runtime than the caller requested + +### Current Development Behavior + +Today, changes made in the source tree are picked up by newly launched local subprocess jobs only to the extent that the daemon's Python environment resolves them at runtime. + +In practice that means: + +- subprocess jobs launched with `python -m ...` or `python3 -m ...` use the daemon's interpreter selection +- imports come from whatever that interpreter and environment resolve at execution time +- if the local development environment uses an editable install, new subprocess jobs will generally observe source-tree changes immediately +- if the local development environment uses a non-editable installed package, they may instead continue to use installed code + +This is another reason the daemon must expose code-root and interpreter identity explicitly rather than leaving behavior implicit. + +The spec should not rely on accidental editable-install behavior as the architectural contract. + +### Why A Daemon Is Required For Local Mode + +The local mode still needs cross-process coordination, persistence, and discovery. + +That rules out a design where each CLI invocation directly opens and manages SQLite state on its own. For local mode, a daemon-backed control plane is the correct primitive. + +### Remote Cluster + +The remote option means an externally hosted NeMo jobs API, such as a configured cluster endpoint. + +In remote mode: + +- the environment is assumed to be provisioned already +- required services are checked, not progressively bootstrapped by the CLI +- dependency unavailability should surface as a normal error + +### Local Daemon Interface + +Local mode should expose a local-only daemon interface for reporting status, requirements, readiness, and errors. + +This interface is not responsible for starting or stopping the local daemon itself. + +Daemon lifecycle remains CLI-owned code: + +- find an existing local daemon +- decide whether one must be started +- start it if needed +- stop it if the CLI owns that lifecycle + +The local daemon interface is responsible only for answering questions such as: + +- which local daemon instance is being addressed +- which TCP/IP port its jobs API is listening on +- what services and backends are currently running +- what services are required for a given request +- whether the required services are ready +- whether the request cannot be satisfied and why +- what daemon and service version is currently running +- how to stream logs relevant to startup, convergence, and failures + +The interface should be: + +- asynchronous +- readiness-aware +- cross-process safe +- able to support polling or long-poll readiness waits +- able to support log streaming for connected local clients + +Availability should be an explicit part of the contract. In particular, local mode needs a clear and reliable way to determine whether a daemon or service is: + +- not running +- starting but not yet ready +- running and reusable +- running but unhealthy +- running but not usable for this request + +The daemon interface should not guess based only on process existence. It should expose an explicit availability check that can answer whether the target runtime is present, healthy, and usable for the request. + +The interface should expose explicit state such as: + +- started or not started +- ready or not ready +- healthy or unhealthy +- failed, with error details + +That state is required so "ensure service" style logic can actually determine whether to wait, proceed, or fail. + +### Local Daemon API + +The local daemon interface should be specified as a small local-only control API separate from the normal jobs API. + +The recommended transport is HTTP/REST over a Unix domain socket. + +This keeps the daemon control interface local-only while allowing the actual jobs API to remain TCP/IP-based. + +Using HTTP over UDS is a good fit because it: + +- reuses the existing REST and FastAPI-style platform patterns +- keeps daemon discovery and coordination local-only +- avoids forcing the main jobs API onto a different transport than remote mode +- preserves room for streaming, long-poll, and structured status responses + +The contract should expose four concrete capabilities: + +- daemon status +- requirement planning +- wait for readiness +- log streaming + +#### 1. Status + +The daemon should expose a status call that returns the current daemon view without mutating state. + +The status response should include at minimum: + +- daemon key +- daemon identity +- daemon version +- protocol or API version +- jobs API host and port +- started state +- readiness state +- health state +- state-root or storage identity +- effective config identity or fingerprint +- supported services +- supported backends +- supported profiles +- per-service status +- active errors + +Per-service status should distinguish at least: + +- not started +- starting +- ready +- unhealthy +- failed + +This status call is the basis for availability and safe-reuse checks. + +#### 1a. Profiles + +The local daemon should expose the effective execution profiles that belong to that daemon instance. + +The daemon control interface may report profile metadata, but the actual profile query for jobs behavior should go through the local daemon's TCP/IP jobs API so that local and remote paths remain aligned. + +The CLI should use the daemon-resolved profile set in local mode rather than reading profile configuration directly from local files. + +#### 2. Requirement Planning + +The daemon should expose a call that accepts request context and returns what local supporting services are required for that request. + +The input should allow the daemon to consider: + +- platform configuration +- job family or compiler-declared requirements +- backend or profile requirements +- current daemon state + +The response should include: + +- the required service set +- which required services are already ready +- which required services are still converging +- which required services failed +- whether the request is satisfiable in local mode +- why it is unsatisfiable when it fails + +This makes the progressive activation contract explicit instead of implicit. + +#### 3. Wait For Readiness + +The daemon should expose a wait call for local mode that lets a client wait until the daemon is ready for a specific request. + +This call should support polling or long-poll semantics. + +The client should be able to provide: + +- the request context or requirement key +- a timeout +- an optional cursor or last-seen status version for efficient waiting + +The daemon should respond with one of these outcomes: + +- ready: all required services are ready for the request +- pending: requirements are still converging +- failed: the request cannot become ready under current conditions +- timeout: readiness was not reached before the requested deadline + +The response should also include current service state and active errors so the caller can decide whether to keep waiting or surface a failure. + +#### 4. Log Streaming + +The daemon should expose a log stream for local clients so startup and convergence remain visible. + +The stream should be able to include: + +- daemon bootstrap logs +- service activation logs +- readiness or failure transition messages +- logs for specific required services + +The client should be able to scope the stream by: + +- daemon-wide startup +- a specific request or requirement resolution flow +- one or more services + +The log stream is diagnostic and observational. It should not be required for correctness, but it should be available so local startup and failure modes remain transparent. + +### Progressive Service Activation + +Progressive activation should be built on top of the local daemon interface rather than on fixed startup presets. + +Required services may come from multiple sources: + +- platform configuration requirements +- job or compiler-declared requirements +- profile or backend requirements + +Examples: + +- if auth is enabled in platform configuration, local auth-related services may also need to run +- a particular job family may require files, secrets, models, or a plugin-owned service +- a chosen backend or profile may require more local runtime support than another + +The local control plane should therefore: + +- start with the smallest useful local jobs runtime +- compute the union of currently required services from all known requirement sources +- activate missing services incrementally inside the daemon +- reuse already-running compatible services instead of restarting them + +This implies that plugin-owned or service-owned local capabilities should be startable on demand based on usage and configuration, not only through fixed CLI startup flags. + +The daemon interface should support a wait pattern where a caller can poll or long-poll until the required services become ready or until an error or timeout occurs. + +### Responsibility Split + +The CLI should be responsible only for daemon lifecycle and for ensuring access to a usable local jobs control plane. + +That means the CLI should: + +- find or start the local daemon using CLI-owned lifecycle code +- connect to the daemon through the local-only daemon interface +- submit or follow jobs through the normal jobs API + +The local daemon should be responsible for resolving and activating supporting services needed to handle those requests. + +That means the daemon should: + +- inspect configuration and request context +- determine which supporting services are required +- activate supporting services internally as needed +- reuse running compatible services where possible + +This keeps duplicate prevention and lazy activation inside explicit runtime contracts instead of scattering that logic across CLI startup paths. + +### Auto-Start Failure UX + +Transparent local daemon acquisition should have an explicit user-facing error contract. + +If the CLI attempts to acquire a local daemon automatically and fails, the error should report at minimum: + +- the local target name +- the daemon key +- the failure phase +- a short human-readable cause +- how to inspect relevant logs +- how to run explicitly in foreground mode +- how to run explicitly in `--daemon` mode + +The failure phase should be one of: + +- `discovery` +- `reuse` +- `startup` +- `readiness` + +The CLI should use those phases consistently so users can tell whether: + +- no daemon was found +- an existing daemon was found but could not be reused +- startup failed +- startup succeeded but readiness was not reached + +Additional expected details: + +- readiness failures should include timeout information when relevant +- reuse failures should include the mismatched property or requirement when relevant +- startup failures should include enough information to locate daemon-control logs quickly + +The purpose of this contract is to keep transparent startup from feeling opaque when it fails. + +### Availability And Safe Reuse + +For local mode, the contract should define a clear method for determining whether a daemon or service is available. + +That check should be strong enough to distinguish: + +- process exists but runtime is not ready +- runtime is healthy and reusable +- runtime is unhealthy +- runtime is alive but attached to the wrong state, configuration, or capability set for the request + +This is especially important for daemon reuse. "Available" should therefore mean more than "a process is running" or "a socket exists." It should mean the runtime has passed an explicit availability check and is safe for the CLI to treat as the active local control plane for the request. + +At minimum, the availability and reuse contract should surface: + +- daemon key +- daemon identity +- daemon version +- protocol or API version +- jobs API host and port +- health state +- readiness state +- state-root or storage identity +- effective config identity or fingerprint +- supported services, backends, and profiles +- per-service status +- active error details when unavailable + +### Log Streaming + +The local daemon interface should also support streaming logs to the connecting local client. + +This is especially useful when: + +- the CLI has started or attached to a local daemon +- the daemon is progressively activating required services +- a required service is slow to become ready +- activation fails and the user needs immediate diagnostic context + +The local runtime does not have to rely on log streaming for its own internal correctness, but the interface should make log streaming available so local startup and convergence remain transparent to the user. + +### Remote Behavior + +The local-only daemon interface is specific to local mode. + +It should not be treated as part of the remote jobs contract. + +For remote, Kubernetes, or externally hosted platform endpoints: + +- the environment is assumed to be provisioned already +- required services are checked, not progressively bootstrapped by the CLI +- dependency unavailability should surface as a normal error + +For example, if a job requires `files` and the remote `files` service is unavailable, jobs should fail with a dependency error rather than attempting local-style activation behavior. + +### API Parity Requirement + +Full API parity between local and remote is a core requirement of this design. + +That means: + +- the same jobs REST API should be used in both local and remote modes +- local mode should not invent a separate jobs API shape +- differences between local and remote should be limited to control-plane discovery and environment capability, not the jobs API contract itself +- the daemon control API exists only to discover, identify, and coordinate the local daemon; it is not a replacement for the jobs API + +In practice, this means a client that knows how to talk to the remote jobs API should also be able to talk to the local daemon's jobs API once the CLI has discovered its TCP/IP endpoint. + +### Required Platform Changes + +To support this design, NeMo Platform should make the following concrete changes. + +#### 1. Add A Local-Only Daemon Control API + +Add a local daemon API, separate from the normal jobs API, that exposes: + +- status +- requirement planning +- wait for readiness +- log streaming + +This API is local-only and exists for local daemon operation. It is not part of the remote platform jobs contract. + +#### 2. Keep CLI-Owned Daemon Lifecycle Separate + +Keep daemon lifecycle out of the daemon control API. + +The CLI should own: + +- daemon discovery +- daemon startup +- daemon shutdown when appropriate + +The daemon control API should only answer: + +- what is running +- what is required +- whether the request is ready +- why the request failed + +#### 3. Support Request-Scoped Requirement Resolution + +Add a request-scoped requirement resolution path inside the local daemon. + +That resolver should combine requirements from: + +- platform configuration +- service or plugin declarations +- job or compiler declarations +- backend or profile declarations + +The result should be one resolved requirement set for the current request. + +#### 4. Support Nested Dependency Activation From One Request + +A single requirement request should be able to activate the full nested dependency graph needed for a service. + +That means if a plugin-owned service is required, and that service depends on `entities`, `auth`, or another service, one daemon-side activation flow should be able to resolve and activate the entire dependency chain. + +This should follow declared dependencies recursively rather than requiring the CLI or caller to activate each service manually. + +#### 5. Extend Plugin And Service Declarations For Local Activation + +Each core service and plugin-owned service should be able to declare the information needed for local activation. + +At minimum this should include: + +- service identity +- declared service dependencies +- whether the service is eligible for local activation +- any additional local activation requirements that differ from simple startup order + +Existing `dependencies` declarations are a strong starting point and should remain part of this model. + +#### 6. Add Job Or Compiler Requirement Declarations + +Jobs or compilers should be able to declare which supporting services are required for local execution. + +This is distinct from service startup dependencies. + +Examples: + +- a job may require `files` even if the jobs daemon itself does not +- a specific job family may require a plugin-owned service +- a specific backend or profile may require additional services + +This declaration should feed into daemon-side requirement planning. + +#### 7. Preserve Boolean Readiness In The Short Term + +Short term, a boolean readiness signal from individual services is acceptable. + +That means existing `Service.is_ready() -> bool` behavior can remain the base readiness primitive for now. + +The local daemon should synthesize richer daemon-interface states such as: + +- not started +- starting +- ready +- unhealthy +- failed + +from: + +- lifecycle state the daemon already knows +- boolean service readiness +- startup and activation errors + +This avoids blocking the design on an immediate repo-wide readiness refactor while still giving the daemon interface the richer states it needs. + +#### 8. Add Request-Aware Readiness Waiting + +Add daemon-side wait-for-readiness behavior that is specific to the current request, not just platform-wide readiness. + +The daemon should be able to answer: + +- are all services required for this request ready yet +- which required service is still pending +- which required service failed +- whether the request can never become ready under current conditions + +This wait path should support polling or long-poll behavior with timeout. + +#### 9. Add Versioned Safe-Reuse Metadata + +The daemon status contract should explicitly include version and reuse metadata so the CLI can safely decide whether an already-running daemon is reusable for the current request. + +At minimum this should include: + +- daemon key +- daemon version +- protocol or API version +- jobs API host and port +- state-root or storage identity +- effective config identity or fingerprint +- supported services, backends, and profiles + +The CLI should also be able to send explicit version requirements when talking to the local daemon interface. + +That means a local CLI request should be able to express requirements such as: + +- minimum daemon version +- exact protocol or API version +- required capabilities or backend support + +If those requirements are not met, the daemon interface should fail the request explicitly rather than allowing the CLI to proceed against an incompatible runtime. + +#### 10. Add Standard Startup And Convergence Log Streaming + +Add a standard daemon log streaming path for local clients. + +This should support streaming: + +- daemon startup logs +- service activation logs +- readiness transition events +- failure diagnostics + +This log access should go through the local daemon interface rather than through direct local file reads from the CLI. + +That is the correct boundary because it: + +- keeps the CLI talking to one interface instead of inspecting daemon-owned files directly +- allows the daemon to choose how logs are stored internally +- supports future multi-daemon operation without changing the CLI log access model +- keeps local log access aligned with the same control-plane contract used for readiness and status + +This keeps local startup observable without requiring the user to inspect daemon state out of band. + +#### 11. Reuse Existing `--cluster` As The Remote Selector + +Reuse the existing `--cluster` option as the explicit selector for remote control planes. + +Control-plane selection should work like this: + +- if `--cluster` is provided, use remote cluster mode +- if `--base-url` is provided, use remote cluster mode +- otherwise, use local daemon mode + +This means the current implicit fallback chain for jobs submission should change. + +Today the submit path can resolve remote host selection through active CLI context even when `--cluster` is not provided. That behavior is not compatible with a clean two-option model, because absence of `--cluster` would no longer reliably mean local mode. + +For jobs under this design, remote selection should therefore be explicit. The CLI should not silently choose a remote control plane from active context when neither `--cluster` nor `--base-url` was supplied. + +#### 12. Query Profiles Through The Selected Control Plane + +Execution profiles should be queried through the selected control plane rather than inferred directly by the CLI from local config files. + +That means: + +- in remote cluster mode, query the remote jobs API for execution profiles +- in local daemon mode, use the daemon control API over UDS to discover the selected daemon and its jobs API port, then query the local jobs API over TCP/IP for execution profiles + +This keeps profile discovery aligned with the actual runtime that will execute the job and avoids a split where the CLI believes one profile set is active while the daemon is using another. + +#### 13. Support Multiple Local Daemons By Key + +The local daemon control model should support multiple daemon instances, each identified by a daemon key. + +This is primarily useful for testing, development, and version-isolated local runs rather than as a standard end-user workflow. + +The daemon key should allow the CLI to: + +- select which local daemon to discover or start +- resolve which daemon control socket to talk to +- discover which TCP/IP jobs API port that daemon is using +- run multiple daemon versions or configurations side-by-side when needed + +The daemon key and discovered port should come from the daemon control interface rather than from fixed local assumptions. + +## Backend Semantics + +### Subprocess + +`subprocess` is the explicit contract for running a host command in the local environment. + +It should remain suitable for the lightest-weight local developer loop. + +### Pause And Resume + +Pause and resume must be supported in local daemon mode to preserve jobs API parity with remote backends. + +For this spec, parity is required at the API and lifecycle-contract level, not at the low-level process-control level. + +That means local subprocess mode must support: + +- the same pause API +- the same resume API +- the same lifecycle states and transitions expected by the jobs contract + +For the local subprocess backend, the expected implementation is restart-based rather than true in-memory process suspension. + +That means: + +- `pause` may terminate the running subprocess and transition the job to `PAUSED` +- `resume` may schedule a fresh subprocess execution from the persisted job step definition + +This is acceptable for this spec because it preserves API parity and lifecycle parity without requiring the local subprocess backend to implement a more complex suspend-and-continue runtime model. + +The spec should not imply that local subprocess pause/resume preserves in-memory process state. If true suspend/resume semantics are required later, that should be designed as separate follow-on work. + +The client-visible pause/resume contract should also be explicit. + +From a user and API perspective: + +- `pause` should be accepted only for jobs in a pausable non-terminal state +- `pause` should transition the job through `PAUSING` and then to `PAUSED` +- `resume` should be accepted only for jobs in `PAUSED` +- `resume` should transition the job through `RESUMING` and then back into normal scheduling states such as `PENDING` or `ACTIVE` +- `pause` and `resume` should be idempotent at the API level + +For local subprocess, the backend behavior should be: + +- when a running step is paused, the backend may terminate the process group for the current task attempt +- the step definition, job state, and any job-persistent storage must remain available for later resume +- task-ephemeral storage may or may not survive pause depending on backend policy, but that policy must be explicit and not accidental +- resuming should create a new subprocess task execution from persisted job state rather than pretending the original process was frozen in place + +The user-visible limitation should be explicit: + +- pause/resume for local subprocess is stop-and-restart from persisted job state +- it should only be relied on by workloads that can tolerate restart-based semantics +- workloads that require in-memory suspension are out of scope for this backend + +Failure handling should also be defined: + +- if `pause` is requested for a terminal job, the API should return a clear no-op or validation error according to the normal jobs contract +- if the backend cannot successfully stop the subprocess during pause, the job should transition to `ERROR` with a clear reason +- if `resume` is requested but the persisted job state is no longer runnable, the job should transition to `ERROR` with a clear reason +- restart-based resume must respect the same daemon reuse, interpreter, profile, and service-availability checks as an initial run + +### Logging + +Logging must preserve client-visible API parity between local and remote modes. + +For this spec: + +- local subprocess logging may use local capture and OTLP export internally +- those mechanisms are implementation details of the local backend +- clients should not depend on direct access to local log files + +From the client or user point of view, log access should be API-based in the same way it is for remote jobs. + +That means: + +- job logs should be retrieved through the jobs/logs API surface +- daemon startup and convergence logs should be retrieved through the local daemon control interface +- the CLI should not special-case local subprocess logs by reading daemon-owned files directly + +This keeps local and remote logging behavior aligned at the product surface even if their internal log collection paths differ. + +The client-visible logging contract should be more explicit. + +For job logs: + +- the same jobs/logs API surface should be used in local and remote modes +- log retrieval should be keyed by the normal jobs identifiers such as job, step, and task +- the API should support the same kinds of reads the CLI expects remotely, including tailing recent logs and following active logs when available +- clients should not need to know whether logs originated from local file capture, OTLP export, container logs, or pod logs + +For local daemon logs: + +- daemon startup logs should be available through the local daemon control interface +- service activation and readiness logs should be available through the local daemon control interface +- failure logs for daemon bootstrap or service activation should be available through the local daemon control interface +- log streaming should work in both foreground and background local modes + +The relationship between backend-private logs and client-visible logs should also be explicit. + +For subprocess: + +- local file capture remains an internal implementation detail +- OTLP export remains an internal implementation detail +- neither internal mechanism defines the public client contract +- the daemon is responsible for making sure job logs are queryable through the jobs API regardless of how they are stored locally + +The expected behavior for an active local job should be: + +- a user runs `nemo jobs run ...` +- the CLI submits the job through the jobs API +- the CLI follows logs through the same jobs/logs API contract it would use remotely +- if the local daemon had to start or progressively activate services first, the CLI may also surface daemon-control logs during that phase +- once the job is running, task logs come from the jobs API rather than from daemon bootstrap channels + +The transition between these two log sources should be clear: + +- daemon control logs explain local control-plane bootstrap and readiness +- jobs API logs explain job execution + +Retention and cleanup should also be clarified: + +- backend-private log files may be rotated or deleted according to backend retention policy +- client-visible log retention should be defined at the jobs/logs API level rather than by direct access to those files +- local cleanup must not break the API contract for logs more aggressively than the configured backend retention policy allows + +Failure behavior should be explicit: + +- if job execution starts but log ingestion or export fails, the runtime should report that failure clearly rather than silently dropping logs +- if daemon bootstrap fails before the jobs API is available, the daemon control interface should expose enough logs to diagnose that failure +- if a user attaches to a foreground local daemon, that terminal output may be convenient, but it should not be the only supported way to observe daemon behavior + +### Future Local Backends + +This spec intentionally standardizes only the local subprocess path. + +Additional local backends such as Docker or daemon-managed long-lived process groups may be added later, but they are not required to achieve the jobs local/remote unification described here. + +## What Must Change Conceptually + +### `run_local(...)` Stops Being Real + +`run_local(...)` should not survive as a real execution architecture. + +If retained temporarily, it should be a thin compatibility shim that: + +- creates a jobs request +- submits it through a jobs API +- follows the result + +It should not continue to instantiate and run jobs in-process. + +### The CPU-Container-To-Subprocess Rewrite Stops Being A Target Behavior + +The current rewrite may remain as compatibility logic for a transition period, but it should not define the target model. + +Long term: + +- jobs that mean `subprocess` should compile to `subprocess` + +The platform should stop silently changing execution contract based on profile name. + +## Operational Requirements + +The local control-plane story must be explicit and user-visible. + +At minimum the platform should make it clear: + +- whether the CLI connected to an existing local jobs endpoint or started one +- how the local endpoint is identified +- how to inspect its logs +- how to stop it +- how duplicate daemons are prevented +- how reuse eligibility is determined when reusing an existing daemon + +This is required for trust in local daemon mode. + +## Recommendation + +Adopt a single jobs-backed execution model with these properties: + +- `run` and `submit` stay as interaction modes +- all execution flows through the jobs service contract +- local is a jobs hosting shape, not a separate execution path +- `subprocess` is the first-class local jobs backend in this spec +- profile-driven backend selection stays in place for now +- the current subprocess rewrite is treated as migration-only compatibility logic, not target architecture + +This is a materially better model than the current split because it removes the fake distinction between "jobs" and "no jobs" without forcing a large redesign of the current jobs service. diff --git a/spec/jobs-runtime-availability-and-capabilities-spec.md b/spec/jobs-runtime-availability-and-capabilities-spec.md new file mode 100644 index 0000000000..beec852279 --- /dev/null +++ b/spec/jobs-runtime-availability-and-capabilities-spec.md @@ -0,0 +1,285 @@ +# Jobs Runtime Availability Spec + +## Summary + +This spec defines how NeMo Platform should determine which job execution providers and profiles are actually available at runtime, and how that information should be exposed to other services and plugins. + +The key architectural change is: + +- the Jobs service becomes the authoritative source of truth for execution availability +- availability is determined dynamically from both configuration and runtime checks +- plugins and other services query Jobs for the available providers and profiles instead of inferring them independently + +This spec is intentionally separate from execution-selection and subprocess-first-class work. Its purpose is to define how the platform knows what is actually available to select from. + +## Problem + +Today the platform does not have one clear, runtime-authoritative source of truth for execution availability. + +### What Happens Today + +Today, availability is inferred indirectly rather than owned explicitly by Jobs. + +- platform config expresses intended runtime and configured executors +- startup-time config validation may mutate that view based on runtime checks +- Jobs derives its default profiles mostly from runtime and config +- some plugins still perform their own direct availability checks before compile or submit + +Examples in committed code: + +- platform config can downgrade `platform.runtime: docker` to `Runtime.NONE` if Docker is unreachable in [packages/nemo_platform_plugin/src/nemo_platform_plugin/config.py](/Users/rsadler/src/nemo-platform/packages/nemo_platform_plugin/src/nemo_platform_plugin/config.py:605) +- Docker reachability is checked by `validate_docker_available()` in [packages/nemo_platform_plugin/src/nemo_platform_plugin/config.py](/Users/rsadler/src/nemo-platform/packages/nemo_platform_plugin/src/nemo_platform_plugin/config.py:344) +- Jobs builds its `profiles` list from runtime and config in [services/core/jobs/src/nmp/core/jobs/config.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/config.py:64) +- default Jobs profiles are selected from runtime in [services/core/jobs/src/nmp/core/jobs/controllers/backends/config.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/controllers/backends/config.py:44) +- customization code independently checks Docker runtime and reachability in [packages/nmp_customization_common/src/nmp/customization_common/contributor/jobs.py](/Users/rsadler/src/nemo-platform/packages/nmp_customization_common/src/nmp/customization_common/contributor/jobs.py:17) + +### Why This Is A Problem + +This creates a few structural problems. + +- static configuration is treated as if it were the same thing as actual runtime availability +- plugin behavior can drift because plugins are pushed to implement their own reachability and environment checks +- other services cannot reliably know what Jobs can actually execute without reproducing Jobs assumptions +- in a microservice deployment, only Jobs is in the right position to know which execution backends are actually usable + +Configuration describes intent. It does not necessarily describe reality. + +A profile may be configured but unusable because: + +- Docker is not reachable +- the process is not running in the expected runtime environment +- a backend dependency is missing +- a control surface exists in config but is not actually available + +The platform needs one place where those questions are answered definitively. + +## Goals + +- Make Jobs the authoritative source of truth for execution availability. +- Distinguish configured execution profiles from actually available execution profiles. +- Perform dynamic runtime checks in Jobs so availability reflects the real execution environment. +- Expose a Jobs-owned API that other services and plugins can query once and then use as the basis for shared resolution. +- Eliminate plugin-specific availability detection logic over time. +- Keep availability determination centralized even when Jobs runs as a separate service. +- Make availability reporting explicit enough to support diagnostics and fast failure. + +## Non-Goals + +- This spec does not define the provider/profile resolution algorithm itself. +- This spec does not define how plugins compile jobs once a provider/profile has been selected. +- This spec does not define platform startup or service-loading behavior beyond what is necessary to explain availability ownership. +- This spec does not define long-term capability-versus-provider data modeling. + +## Architectural Principle + +The key rule in this spec is: + +- Jobs owns runtime execution availability because Jobs is the service that actually dispatches execution + +Other services and plugins may cache or consume Jobs-reported availability, but they should not be the authority for deciding what execution backends are truly usable. + +This matters especially in a microservice architecture. + +If Jobs is a separate service, then: + +- plugin processes do not necessarily run in the same runtime environment as Jobs +- plugin processes may not have direct visibility into Docker reachability, local subprocess enablement, or backend control surfaces available to Jobs +- reproducing Jobs environment checks in each plugin would be both brittle and inconsistent + +For those reasons, Jobs should own availability detection and publish the result. + +## Terminology + +### Configured Profile + +A provider/profile pair that exists in platform or Jobs configuration. + +This expresses intended support, not guaranteed runtime usability. + +### Available Profile + +A configured profile that Jobs has determined is actually usable in the current environment. + +This is the profile set that other services should resolve against. + +### Runtime Availability Check + +A check performed by Jobs to determine whether a configured backend/profile is actually usable. + +Examples include: + +- Docker reachability +- runtime environment compatibility +- presence of required backend dependencies +- availability of required control surfaces + +## Proposed Model + +Jobs should maintain two distinct views: + +- configured execution profiles +- available execution profiles + +Configured profiles come from platform config and Jobs config. + +Available profiles are the subset that survive Jobs-owned runtime checks. + +Only the available set should be published as the source of truth for plugin resolution. + +## Availability Determination + +Jobs availability should be determined from two inputs. + +### 1. Configuration + +Configuration determines what is intended to be enabled. + +Examples: + +- whether subprocess execution is enabled +- which explicit provider/profile entries exist +- which backend defaults are enabled for the platform runtime + +Configuration answers: + +- what could exist if the environment is healthy + +### 2. Runtime Checks + +Runtime checks determine what is actually usable right now. + +Examples: + +- whether Docker is reachable +- whether the runtime environment matches the configured backend type +- whether backend-specific dependencies are present +- whether the required control plane or control surface is available + +Runtime checks answer: + +- what Jobs can actually dispatch right now + +### Combined Rule + +Jobs should expose only the intersection: + +- configured and enabled +- runtime-validated and usable + +That combined set is the availability contract. + +## Proposed Jobs API + +Jobs should expose an API for runtime execution availability. + +The exact path and schema can be decided later, but the API should be able to answer at least: + +- which providers exist +- which profiles are available for each provider +- which backend each available profile maps to +- optionally, why a configured profile is unavailable + +A minimal shape might include entries like: + +- provider +- profile +- backend +- available: true/false +- reason, when unavailable + +The key architectural requirement is not the exact JSON schema. It is that Jobs owns and publishes the availability result. + +## Client Usage Model + +Plugins and other services should query Jobs once for availability, then run their shared deterministic resolver against that returned set. + +That means the typical flow becomes: + +1. plugin/service obtains available profiles from Jobs +2. plugin/service runs shared provider/profile resolution locally using that availability set +3. plugin compiles the final job spec for the selected provider/profile +4. Jobs validates and dispatches the same provider/profile that was resolved + +This preserves a shared deterministic algorithm while keeping availability ownership centralized in Jobs. + +## Why This Should Be Owned By Jobs + +Jobs is the correct owner for three reasons. + +### Jobs Dispatches Execution + +Jobs is the service that ultimately routes provider/profile selections to real backends. It is therefore the service most qualified to say whether those backends are usable. + +### Jobs Sees The Real Runtime Context + +Jobs runs in the environment that matters for dispatch. + +That environment may differ from: + +- the CLI process +- a plugin service process +- a code-generation or planning context + +If Jobs is remote, only Jobs truly knows what its own runtime can access. + +### Central Ownership Prevents Drift + +If plugins each decide for themselves whether Docker, subprocess, or another backend is available, drift is inevitable. + +Centralizing availability in Jobs means: + +- one check implementation +- one source of truth +- one observable contract for the rest of the platform + +## What Changes Relative To Today + +Today: + +- availability is inferred indirectly from runtime/config +- some checks happen outside Jobs +- plugins may perform their own validation + +Proposed: + +- Jobs performs or owns the runtime availability checks +- Jobs distinguishes configured profiles from available profiles +- Jobs exposes availability through an API or equivalent service boundary +- plugins stop inventing their own availability logic and consume the Jobs result instead + +## Fast-Fail Implications + +A Jobs-owned availability API improves fast failure. + +Instead of plugins guessing from static config or partial runtime assumptions, they can fail using one explicit source of truth. + +That allows clearer failures such as: + +- profile configured but currently unavailable because Docker is unreachable +- provider unsupported in this deployment because it is not enabled +- no available profile satisfies the requested provider + +This is better than asking every plugin to construct these messages independently. + +## Open Questions + +- Should Jobs expose only available profiles, or both configured and available profiles? +- Should unavailability reasons be part of the public API, or just internal diagnostics? +- Should availability be computed once at Jobs startup, periodically refreshed, or both? +- How should Jobs represent transient availability loss after startup? +- Should clients be expected to cache availability for the duration of a request, process, or longer? +- What is the right API surface: direct Jobs HTTP endpoint, SDK call, or both? + +## Recommendation + +Adopt a Jobs-owned runtime availability model. + +The platform should stop treating static configuration as the authoritative source of truth for execution availability. + +Instead: + +- Jobs should determine what is actually usable +- Jobs should publish that result +- plugins and other services should consume it once and resolve against it + +That keeps availability centralized in the only service that can truly know what execution backends are usable, especially in a multi-service deployment. diff --git a/spec/machine-auth-authentication-and-authorization-spec.md b/spec/machine-auth-authentication-and-authorization-spec.md new file mode 100644 index 0000000000..796b1f7471 --- /dev/null +++ b/spec/machine-auth-authentication-and-authorization-spec.md @@ -0,0 +1,513 @@ +# Machine Auth Authentication And Authorization Spec + +## Summary + +This spec defines a first-class machine-to-machine authentication and authorization model for NeMo Platform. + +Today NeMo Platform has two practical auth paths: + +- human callers authenticate through OAuth/OIDC Bearer tokens +- internal callers often identify themselves through trusted `X-NMP-Principal-*` headers using `service:` principal ids + +That leaves a gap for non-human callers that need real authentication without going through an interactive OAuth flow. + +This spec fills that gap by introducing machine authentication as a separate platform capability with its own identity verification path, while preserving the existing NeMo authorization model. + +The first recommended mechanism is Kubernetes service account JWT authentication, including k3s, because it matches the platform's deployment shape and avoids inventing a NeMo-specific secret distribution scheme. + +## Problem + +The current platform is OAuth-centric at the external authentication layer, but not all callers are humans or human-operated CLIs. + +Examples: + +- NeMo platform services calling other NeMo platform services +- controllers, jobs, and workers running inside the cluster +- automation outside the browser and outside the CLI device flow +- infrastructure-adjacent workloads that need a narrow machine identity + +The current service-principal model is not enough for this. + +Current behavior is closer to "trusted transport plus trusted headers" than true machine authentication: + +- `Principal.is_privileged` is inferred from `principal.id.startswith("service:")` +- outbound SDK helpers synthesize `X-NMP-Principal-Id: service:` +- middleware accepts `X-NMP-Principal-*` headers as identity input +- policy defaults service principals to the `ServiceSystem` role with wildcard `*` permissions + +This creates several problems: + +- machine identity is not cryptographically verified by the platform +- the boundary between trusted internal traffic and authenticated machine callers is unclear +- all service principals are effectively equivalent today +- service-to-service authorization is much broader than it needs to be +- external non-OAuth automation has no first-class authentication path + +## Goals + +- Add a first-class authentication path for machine callers that does not require interactive OAuth. +- Preserve the separation between authentication, principal normalization, and NeMo-owned authorization. +- Replace implicit trust in `service:` header conventions with verifiable machine identity. +- Support Kubernetes-native workloads, including k3s, as the first deployment target. +- Allow route policy to distinguish user principals from machine principals without hardcoding a specific transport. +- Create a migration path away from broad wildcard access for all service principals. + +## Non-Goals + +- Replacing human OAuth/OIDC authentication for browsers, CLI users, or SDK users. +- Defining a full general-purpose secret-management system for every non-Kubernetes environment. +- Introducing provider-native OAuth scopes into endpoint policy. +- Solving mesh mTLS identity, SPIFFE, and probe identity in the first iteration. +- Redesigning all plugin path-rule surfaces in this spec. + +## Current State + +### Human Authentication + +Human auth is first-class today. + +- the platform exposes `/apis/auth/discovery` for CLI/SDK OIDC discovery +- middleware validates Bearer JWTs against configured OIDC settings +- token claims are normalized into a NeMo principal +- PDP evaluation uses: + - principal id + - principal email + - principal groups + - optional token scopes + +This path is documented and productized. + +### Machine Authentication + +Machine auth is not first-class today. + +What exists instead: + +- header-based principal propagation +- `service:` ids +- on-behalf-of forwarding +- policy defaults that treat service principals as highly privileged + +This is useful for internal plumbing, but it is not a full authentication design. + +### Authorization + +The core authorization model should remain intact: + +- authentication establishes caller identity +- NeMo normalizes that identity into a principal +- NeMo role bindings and endpoint policy determine authorization + +This is consistent with the existing OIDC scope/claim mapping direction and should remain true for machine auth. + +## Design Principles + +### Principle 1: Machine Auth Is Not Human OAuth + +Machine callers should not be forced through browser login, device flow, or refresh-token lifecycle just to call internal APIs. + +### Principle 2: Verified Identity Before Elevated Authorization + +The platform must verify how a machine caller proved its identity before accepting a privileged `service:`-style principal. + +### Principle 3: Authorization Stays In NeMo + +Upstream machine credentials identify the workload. They do not directly grant NeMo permissions. + +NeMo still owns: + +- principal normalization +- role binding +- endpoint permission checks +- service-level authorization policy + +### Principle 4: Prefer Platform-Native Identity Over Shared Secrets + +For in-cluster machine auth, Kubernetes service account identity is preferable to static API keys because it is: + +- already present in the runtime +- audience-bound +- rotatable by the platform +- less operationally brittle than hand-managed shared secrets + +### Principle 5: Narrow Machine Identities + +`service:evaluator` and `service:guardrails` should not automatically mean the same thing. + +Machine identities should be individually attributable and authorizable. + +## Requirements + +### Requirement 1: Multiple Principal Kinds + +The platform should recognize at least these normalized principal classes: + +- user principal +- machine principal +- delegated machine principal acting on behalf of a user principal + +The existing `service:` naming convention may remain as a principal-id format, but it must no longer be the authentication mechanism. + +### Requirement 2: Credential-Type-Aware Authentication + +Middleware must distinguish at least: + +- human OIDC Bearer token +- machine Bearer token +- trusted propagated principal headers from an already-authenticated internal hop + +These are different authentication modes and should not be conflated. + +### Requirement 3: Machine Identity Projection + +Verified machine credentials must normalize into a stable NeMo principal id and optional machine attributes. + +Examples: + +- `service:auth` +- `service:evaluator` +- `service:workspace-controller` + +Optional attributes may include: + +- Kubernetes namespace +- Kubernetes service account name +- Kubernetes cluster issuer +- workload audience + +### Requirement 4: Least-Privilege Authorization + +The platform must support explicit authorization grants for machine principals instead of relying on universal wildcard service access. + +### Requirement 5: Delegation Must Stay Explicit + +If a machine acts on behalf of a user, that must remain explicit through the existing delegated-principal semantics. + +Machine auth alone must not silently imply user identity. + +## Options + +### Option 1: Kubernetes Service Account JWT Authentication + +Machine callers present a projected Kubernetes service account token as `Authorization: Bearer `. + +The platform validates the token as a machine credential and maps it to a NeMo machine principal. + +Validation approaches: + +- local JWT validation against a configured Kubernetes service-account issuer and JWKS +- Kubernetes TokenReview-based validation +- a deployment-selectable validator abstraction that can support either mode + +Additional constraints: + +- require expected audience such as `nemo-platform` +- require expected issuer +- require claim extraction for namespace and service account identity + +Pros: + +- fits Kubernetes and k3s deployment environments +- avoids distributing NeMo-specific long-lived shared secrets +- gives each workload a native identity +- supports rotation and audience scoping + +Cons: + +- Kubernetes-specific in the first iteration +- needs careful validator configuration across distributions +- external non-Kubernetes automation still needs a separate story later + +### Option 2: Static NeMo API Keys + +Machine callers authenticate with a platform-issued API key. + +Pros: + +- simple mental model +- works outside Kubernetes +- no dependency on Kubernetes identity + +Cons: + +- shared-secret lifecycle is harder +- rotation, storage, and leakage risks are worse +- weaker provenance than workload identity +- easy to overuse as a generic escape hatch + +### Option 3: mTLS / SPIFFE Workload Identity + +Machine callers authenticate through service mesh identity or mutual TLS and the platform maps that identity into a machine principal. + +Pros: + +- strong workload identity +- good long-term story for service meshes + +Cons: + +- much larger deployment dependency surface +- not aligned with current platform implementation shape +- harder to make consistent across environments in the first iteration + +## Recommendation + +Adopt Option 1 as the first-class machine-auth mechanism: + +- Kubernetes service account JWT authentication for machine callers +- explicit normalization into NeMo machine principals +- explicit NeMo authorization grants for those principals + +Do not start with static API keys as the main model. + +API keys may be worth a later follow-up for external automation, but they should not be the foundation for in-cluster platform auth. + +## Proposed Design + +## Authentication Model + +Add a new machine-auth configuration block under `auth`. + +Conceptual shape: + +```yaml +auth: + enabled: true + oidc: + enabled: true + ... + machine_auth: + enabled: true + default_audience: "nemo-platform" + providers: + - type: kubernetes_service_account + issuer: "https://kubernetes.default.svc" + audiences: + - "nemo-platform" + principal_template: "service:{service_account_name}" + namespace_claim: "kubernetes.io/serviceaccount/namespace" + service_account_name_claim: "kubernetes.io/serviceaccount/service-account.name" +``` + +The final config shape may differ, but it needs these concepts: + +- machine auth enabled switch +- one or more machine identity providers +- issuer/audience validation +- claim mapping into normalized principal identity + +## Principal Normalization + +Introduce an explicit machine-principal normalization path. + +Conceptual output: + +```python +Principal( + id="service:evaluator", + email=None, + groups=[], +) +``` + +But machine-specific metadata should also be available to authorization and logging, for example through an attached auth context: + +- principal kind: `machine` +- provider type: `kubernetes_service_account` +- kubernetes namespace +- kubernetes service account name +- token issuer + +This metadata should not force a redesign of the public `Principal` model in the first iteration if a side-channel auth context is simpler. + +## Middleware Behavior + +Update middleware authentication order conceptually as follows: + +1. health and public bypasses +2. trusted internal principal propagation path +3. Bearer token path + - try human OIDC validator + - try machine-auth validator + - reject if neither succeeds +4. auth-disabled behavior +5. anonymous/PDP path where allowed + +Important rule: + +- machine Bearer tokens must not be treated as human OIDC tokens +- trusted propagated `X-NMP-Principal-*` headers should only be accepted from already-authenticated internal hops, not as a substitute for first-hop machine authentication + +## Header Propagation + +The current principal propagation headers are still useful for downstream identity forwarding after the first hop authenticates. + +That means: + +- first hop into the platform may authenticate with a machine token +- platform normalizes that into a machine principal +- downstream internal requests may propagate normalized principal headers + +But the spec should tighten trust boundaries: + +- propagated principal headers are an internal propagation format +- they are not a standalone external authentication mechanism + +## Authorization Model + +Machine-authenticated principals should continue through the normal PDP flow. + +The PDP input remains conceptually similar: + +- principal id +- method +- path +- optional delegated user identity +- optional normalized scopes if relevant + +But authorization behavior changes in one key way: + +- machine principals should no longer universally inherit wildcard access through `ServiceSystem` + +Instead, the platform should move toward explicit grants. + +## Role And Grant Model + +### Current Problem + +Today a `service:*` principal effectively gets broad access through the default `ServiceSystem` role. + +That is too broad for first-class machine auth. + +### Proposed Direction + +Add explicit support for machine-principal bindings. + +Examples: + +- bind `service:evaluator` to a narrow set of evaluator and model-read permissions +- bind `service:guardrails` to only the permissions it needs +- reserve a very small number of platform-internal break-glass principals for bootstrap paths if absolutely necessary + +This can be implemented in stages. + +#### Stage 1 + +- keep compatibility with existing `service:*` behavior for already-deployed internal services +- add the ability to create explicit machine-principal bindings +- prefer explicit bindings for new machine-authenticated callers + +#### Stage 2 + +- shrink the default `ServiceSystem` grant surface +- require explicit bindings for most machine principals + +#### Stage 3 + +- remove wildcard-by-default service-principal behavior entirely, or reduce it to a tightly scoped internal bootstrap set + +## Route Policy + +The existing caller distinction in `plugin-service-authz-spec.md` is compatible with this direction. + +`SERVICE_PRINCIPAL` can remain the route-level concept for machine callers, but it should mean: + +- a principal authenticated as a machine +- not merely a caller that presented `service:` in a trusted header + +This spec does not require immediate route-decorator redesign. + +It does require semantic tightening of what counts as a valid `SERVICE_PRINCIPAL`. + +## Validation Strategy + +Support a pluggable machine-token validator interface. + +Conceptual interface: + +```python +class MachineTokenValidator(Protocol): + async def validate_token(self, token: str) -> MachineClaims | None: ... +``` + +Initial implementation: + +- `KubernetesServiceAccountTokenValidator` + +Returned claims should include enough data to: + +- prove token validity +- identify provider type +- map to a normalized principal id +- surface audit metadata + +## Auditing And Observability + +Machine auth must be visible in logs and traces. + +At minimum record: + +- authenticated principal id +- principal kind: user or machine +- auth provider type +- delegated user id if present +- authorization decision reason when denied + +This is important because the platform is moving from implicit trust to explicit machine identity. + +## Discovery And Client UX + +The current auth discovery endpoint is human-auth focused. + +Machine auth may eventually need discovery metadata, but the first iteration does not need to expose full machine-auth discovery publicly. + +For now: + +- human CLI/SDK discovery remains unchanged +- machine callers are configured operationally through Kubernetes service account projection and cluster config + +A follow-up may expose a minimal machine-auth capability advertisement if needed. + +## Migration Plan + +### Phase 1: Add Machine Validation + +- add machine-auth config and validator abstraction +- support Kubernetes service account token validation +- normalize validated machine tokens into machine principals +- leave existing internal header propagation intact + +### Phase 2: Bind Explicit Machine Principals + +- add role-binding guidance and APIs for machine principals +- use explicit machine-principal grants for new services and automations + +### Phase 3: Tighten Trust Boundaries + +- restrict acceptance of raw `X-NMP-Principal-*` headers to trusted internal propagation contexts +- stop treating first-hop header injection as sufficient machine authentication + +### Phase 4: Reduce Wildcard Service Grants + +- narrow or remove default `ServiceSystem` wildcard authorization +- require explicit authorization for most machine principals + +## Open Questions + +1. Should Kubernetes service account validation use local JWKS validation, TokenReview, or both behind one provider abstraction? +2. How should internal first-hop trust be defined for header propagation during migration? +3. Should machine principal ids stay `service:` or grow a more structured format such as `service::`? +4. Which small set of bootstrap/internal services, if any, still need broad default access during migration? +5. Does the platform need a later API-key model for external automation that cannot use Kubernetes identity? + +## Recommended Decision + +NeMo Platform should add first-class machine authentication based on Kubernetes service account identity and treat that as the authoritative replacement for today's implicit service-principal trust model. + +The core authorization model should remain NeMo-owned: + +- machine credential proves identity +- platform normalizes that identity into a machine principal +- NeMo roles and endpoint policy determine what that machine may do + +This gives the platform a real machine-auth story without forcing everything through OAuth and without preserving the current "trusted `service:` header means privileged caller" model as the long-term design. diff --git a/spec/nvapi-authentication-gateway-spec.md b/spec/nvapi-authentication-gateway-spec.md new file mode 100644 index 0000000000..16817d63e1 --- /dev/null +++ b/spec/nvapi-authentication-gateway-spec.md @@ -0,0 +1,481 @@ +# NVAPI Authentication Gateway Spec + +## Summary + +This spec proposes a way to let external callers authenticate to NeMo Platform using NVIDIA `nvapi-...` API keys. + +The recommended first iteration is **not** a normal NeMo plugin service. It is an **edge authentication translator** that sits in front of NeMo Platform, validates NVIDIA API keys against NVIDIA-hosted APIs, maps validated keys to NeMo principals, and forwards requests with the `X-NMP-Principal-*` headers that NeMo Platform already understands. + +This keeps authorization in NeMo Platform RBAC while avoiding a larger redesign of the in-process auth stack. + +## Problem + +Today NeMo Platform has two practical authentication paths: + +- `Authorization: Bearer ` for OIDC JWTs validated by `packages/nmp_common/src/nmp/common/auth/jwt.py` +- trusted `X-NMP-Principal-*` headers inside the platform trust boundary + +That leaves no first-class path for NVIDIA API keys: + +- NVIDIA `nvapi-...` keys are not OIDC JWTs +- the current middleware has no pluggable external API-key authenticator +- plugin `NemoService` surfaces are too late in the request path to act as the primary authentication mechanism + +The user goal is to let a caller present a NVIDIA API key and have NeMo Platform treat that caller as an authenticated principal that can be authorized through existing workspace RBAC. + +## External Constraint: What NVIDIA Keys Are + +As of June 8, 2026, NVIDIA documentation describes `NVIDIA_API_KEY` values that start with `nvapi-` as opaque API keys used as Bearer credentials against NVIDIA-hosted APIs such as `integrate.api.nvidia.com` and `ai.api.nvidia.com`. + +Important implications: + +- they are documented as **Bearer API keys**, not as JWTs +- NeMo Platform cannot validate them locally with the existing OIDC/JWKS path +- NVIDIA docs reviewed for this spec do **not** document a general-purpose introspection or userinfo endpoint that would return stable user claims and groups for a presented key + +Because of that, any NeMo integration must separate: + +- **key possession validation** +- **NeMo principal projection** +- **NeMo authorization** + +## Goals + +- Allow external clients to authenticate to NeMo Platform with NVIDIA `nvapi-...` keys. +- Reuse NeMo Platform's existing RBAC and PDP authorization model after authentication. +- Avoid forcing the auth middleware to pretend NVIDIA API keys are OIDC JWTs. +- Minimize core-platform changes in the first iteration. +- Keep the design compatible with Envoy `ext_authz` or a similar gateway callout model. +- Preserve fail-closed behavior if NVIDIA validation, mapping lookup, or downstream authorization fails. + +## Non-Goals + +- Replacing OIDC as the main human authentication story. +- Treating NVIDIA API keys as a source of NeMo roles or workspace grants. +- Inventing a general plugin-based authentication framework in the first iteration. +- Assuming NVIDIA exposes stable identity claims, groups, or workspace memberships for an API key. +- Storing raw NVIDIA API keys in NeMo unless a later workflow explicitly requires it. + +## Current NeMo State + +### What exists today + +- Bearer authentication is handled in `AuthorizationMiddleware`. +- JWT validation is OIDC-oriented and backed by issuer/JWKS discovery. +- Trusted `X-NMP-Principal-*` headers are accepted and normalized into a `Principal`. +- Authorization remains a PDP call using the normalized principal plus optional scopes. +- OPA policies already support both direct middleware input and Envoy-style `ext_authz` input. + +### What does not exist today + +- no `auth.providers[]` or equivalent authenticator chain +- no first-class NVIDIA API-key validator +- no first-class API-key-to-principal mapping store +- no implemented in-process fast path for a trusted `X-NMP-Authorized: true` gateway decision in the middleware path reviewed for this spec + +That last point matters: the docs describe gateway-level pre-authorization, but the current middleware still routes `X-NMP-Principal-*` requests through the normal PDP path when auth is enabled. + +## Why This Should Not Be A Normal `NemoService` Plugin + +A standard plugin service is mounted as application routers after the platform process is already accepting the request. + +That is too late for primary authentication because: + +- authentication must happen before request routing reaches arbitrary services +- the auth middleware lives in shared core code, not in service routers +- any solution that depends on a plugin route still needs some earlier component to trust the caller first + +So the right boundary for v1 is not "auth as a plugin service". It is "auth as an edge translator/callout that produces NeMo-trusted identity headers". + +## Design Options + +### Option 1: Native In-Process NVAPI Auth Provider + +Add a new authenticator into `AuthorizationMiddleware`, for example: + +- inspect `Authorization: Bearer ` +- if token starts with `nvapi-`, call a NVIDIA validation routine +- map the validated key to a NeMo principal +- continue into the existing PDP flow + +Pros: + +- first-class internal implementation +- no extra gateway component +- clean UX for clients + +Cons: + +- larger core-auth redesign +- needs new config surface, new secrets/caching rules, and new tests in every service process +- still cannot derive claims from the key without a local mapping or NVIDIA introspection API +- pushes an external network dependency into every platform service + +### Option 2: Standalone Gateway / Translator In Front Of NeMo + +Put a small service or proxy in front of NeMo Platform: + +1. receive external request +2. strip all incoming `X-NMP-*` identity headers +3. validate `Authorization: Bearer nvapi-...` +4. map the key to a NeMo principal +5. forward request to NeMo with `X-NMP-Principal-*` headers + +Pros: + +- minimal NeMo core change +- clean trust boundary +- can be deployed independently +- lets NeMo continue using current principal-header path + +Cons: + +- one more deployable component +- requires strict network topology so callers cannot bypass the gateway +- NeMo still performs its own PDP check, so this is authentication translation rather than full authn+authz offload + +### Option 3: Envoy `ext_authz` Callout + +Use Envoy plus a custom ext_authz service: + +1. Envoy receives request +2. ext_authz service validates NVIDIA key and computes principal projection +3. Envoy injects `X-NMP-Principal-*` headers +4. request proceeds to NeMo + +Pros: + +- aligns with the existing auth docs and OPA input model +- operationally standard for production gateways +- easiest path if a team already runs Envoy + +Cons: + +- functionally similar to Option 2, but with Envoy-specific deployment complexity +- still subject to the current NeMo middleware gap around trusted pre-auth short-circuiting + +## Recommendation + +Recommend **Option 2 or Option 3**, depending on deployment preference: + +- **Option 2** if we want the simplest path that can be developed and tested quickly +- **Option 3** if the target deployment already uses Envoy and wants `ext_authz` + +In both cases, the NeMo-facing contract should be the same: + +- NeMo receives only trusted `X-NMP-Principal-*` headers from the gateway +- NeMo continues to own authorization through the existing PDP and workspace RBAC + +This is the least invasive v1 and does not require pretending NVIDIA API keys are JWTs. + +## Proposed Architecture + +```text +external client + -> edge gateway / ext_authz service + -> NVIDIA key validation probe + -> local key-fingerprint -> principal mapping lookup + -> inject X-NMP-Principal-* headers + -> NeMo Platform + -> existing AuthorizationMiddleware + -> existing PDP / RBAC +``` + +## Request Flow + +### Phase 1 request flow + +1. Client sends `Authorization: Bearer nvapi-...`. +2. Edge translator strips any inbound `X-NMP-*` auth headers. +3. Translator checks that the token matches expected NVIDIA key shape. +4. Translator computes a **non-reversible fingerprint** of the raw key. +5. Translator looks up a local mapping for that fingerprint. +6. Translator validates the key with a configurable NVIDIA probe request. +7. If validation succeeds, translator injects: + - `X-NMP-Principal-Id` + - `X-NMP-Principal-Email` when available locally + - `X-NMP-Principal-Groups` when configured locally + - optional `X-NMP-Scopes` when local mapping assigns normalized NeMo scopes +8. Translator forwards to NeMo Platform. +9. NeMo middleware treats the request like any other trusted principal-header request and performs the normal PDP authorization check. + +If any step fails, the translator returns `401` or `403` and does not forward the request. + +## Principal Mapping Model + +Because NVIDIA API keys do not currently provide the claims NeMo needs, the system needs a local mapping layer. + +### Mapping record + +Conceptual shape: + +```yaml +nvapi_identity: + fingerprint: "hmac-sha256:..." + principal_id: "user@example.com" + principal_email: "user@example.com" + principal_groups: + - "team-ml" + - "nvidia-build-users" + scopes: + - "platform:read" + - "platform:write" + status: "active" +``` + +### Why fingerprint, not raw key + +- avoids persisting raw NVIDIA API keys in the platform for normal request handling +- supports deterministic lookup +- limits blast radius if the mapping store is leaked + +The fingerprint should be derived with a server-side HMAC secret, not plain SHA256, to make offline guessing harder. + +### Identity source of truth + +The mapping store should be local to NeMo deployment operations, not inferred from NVIDIA at request time. + +Possible implementations: + +- YAML config for early prototypes +- entity-store backed records in a later iteration +- secret-backed registration flow if users self-enroll keys + +## NVIDIA Validation Strategy + +### Recommended v1 validation + +Use a configurable probe against a documented NVIDIA-hosted API that accepts the same Bearer key, for example a lightweight request to a models endpoint. + +The validator should treat a success response as "the key is live" and any auth failure as invalid. + +### Validation config + +Conceptual config: + +```yaml +auth: + nvapi: + enabled: true + validation_url: "https://integrate.api.nvidia.com/v1/models" + timeout_seconds: 3 + cache_ttl_seconds: 300 +``` + +### Cache behavior + +- cache positive validations for a short TTL +- cache negative validations for a much shorter TTL +- key cache key should be the local fingerprint, not the raw key +- all cache misses and validation failures fail closed + +### Important limitation + +This proves that the caller holds a currently valid NVIDIA API key. +It does **not** prove: + +- who the human is, unless the local mapping says so +- what groups they belong to in NVIDIA +- what NeMo workspaces they should access + +Those remain local NeMo concerns. + +## Authorization Behavior + +Authorization should stay exactly where it already lives: + +- NeMo PDP +- workspace role bindings +- endpoint permissions +- optional normalized scopes + +The NVIDIA key only gets the caller through authentication. +It must never directly grant platform permissions. + +## Registration / Enrollment Modes + +### Mode A: Operator-managed mapping + +An operator creates mapping entries manually. + +Pros: + +- simplest implementation +- no raw key storage required + +Cons: + +- operationally manual + +### Mode B: Self-service registration + +A user registers a NVIDIA key once through a dedicated enrollment workflow: + +1. present key +2. gateway validates it +3. system stores fingerprint and metadata +4. operator or automation binds it to a principal + +Pros: + +- better UX + +Cons: + +- needs a dedicated onboarding API and lifecycle management + +### Mode C: Full dynamic identity federation + +Only viable if NVIDIA eventually exposes a documented introspection or userinfo API that returns stable identity claims for a presented key. + +This spec does not assume that exists. + +## Security Requirements + +### Network boundary + +The translator or gateway must be the **only** externally reachable path to NeMo services that trust `X-NMP-Principal-*` headers. + +Direct access to platform services from untrusted networks must be blocked. + +### Header stripping + +The edge must remove inbound: + +- `X-NMP-Principal-Id` +- `X-NMP-Principal-Email` +- `X-NMP-Principal-Groups` +- `X-NMP-Principal-On-Behalf-Of` +- `X-NMP-Scopes` +- `X-NMP-Authorized` + +before any translation logic runs. + +### Storage + +- do not log raw NVIDIA API keys +- do not persist raw keys for normal request auth unless a separate enrollment feature explicitly requires it +- redact keys in traces and structured logs + +### Failure mode + +If NVIDIA validation is unavailable, the translator should fail closed by default. + +This is stricter than best-effort auth and is the correct default for a primary authentication system. + +## Operational Concerns + +### Latency + +Per-request remote validation adds latency. +That is why short-lived positive caches are required. + +### Availability + +This design adds a dependency on NVIDIA API availability for uncached validations. + +### Rate limits + +The validation probe may consume NVIDIA API quota or hit rate limits. +The probe endpoint and cache TTL must be chosen accordingly. + +### Revocation window + +Positive caching creates a short window where a recently revoked key may still authenticate until cache expiry. + +## Compatibility With Current NeMo Code + +### Works today without large core changes + +The translator approach fits the existing principal-header path in: + +- `packages/nmp_common/src/nmp/common/auth/middleware.py` +- `packages/nmp_common/src/nmp/common/auth/models.py` +- the current PDP and OPA policies + +### Known gap + +The docs discuss gateway-level pre-authorization via `X-NMP-Authorized: true`, but the middleware path reviewed for this spec does not currently consume that as a skip-PDP fast path. + +So phase 1 should assume: + +- gateway translates authentication +- NeMo still performs authorization + +That is acceptable for v1. + +## Optional Phase 2: Trusted Pre-Authorization Fast Path + +After the translator is working, we may add a small core improvement: + +- if request arrives from a configured trusted proxy identity or network +- and `X-NMP-Authorized: true` is present +- and `X-NMP-Principal-*` headers are present and valid +- then middleware may skip its own PDP call + +This would turn the edge component into a full authn+authz offload point. + +This should be a separate change because it expands the trust surface and needs careful hardening. + +## Suggested Implementation Plan + +### Phase 1 + +- build a small standalone translator service or Envoy ext_authz service +- add local key-fingerprint mapping +- add configurable NVIDIA validation probe +- forward mapped `X-NMP-Principal-*` headers to NeMo +- document required network and header-stripping constraints + +### Phase 1.5 + +- add entity-store backed mapping records +- add operator CRUD APIs or CLI for mapping management +- add audit events for mapping create/revoke/use + +### Phase 2 + +- add trusted gateway pre-auth support in middleware +- optionally support gateway-injected `X-NMP-Authorized: true` +- optionally add a generalized authenticator chain in core auth if multiple non-OIDC auth methods are needed + +## Open Questions + +- Which NVIDIA endpoint is the best long-lived validation probe for key liveness? +- Do we want mapping records to point at human principals, service principals, or both? +- Should NeMo-managed scopes be assignable in the mapping record, or should RBAC alone be sufficient? +- Is manual operator mapping enough for the first usable version? +- Do we want the translator to be repo-owned, or treated as a deployment-side reference implementation? + +## Recommendation + +Proceed with a **gateway/ext_authz translator** as v1. + +Do **not** start with a normal `NemoService` plugin and do **not** start by teaching the current JWT validator to parse NVIDIA API keys. + +The right first step is: + +- validate NVIDIA API key possession at the edge +- map the key to a local NeMo principal +- forward trusted principal headers +- let existing NeMo authorization decide access + +That matches the current platform architecture with the smallest amount of core churn. + +## References + +- NeMo Platform middleware and principal model: + - `packages/nmp_common/src/nmp/common/auth/middleware.py` + - `packages/nmp_common/src/nmp/common/auth/models.py` + - `packages/nmp_common/src/nmp/common/auth/jwt.py` +- NeMo Platform auth docs: + - `docs/auth/deployment/gateway.md` + - `docs/auth/security-model.md` + - `docs/auth/authentication/oidc.md` +- Existing repo specs: + - `spec/machine-auth-authentication-and-authorization-spec.md` + - `spec/oidc-scope-and-claim-mapping-spec.md` +- NVIDIA docs reviewed on June 8, 2026: + - NeMo Retriever docs: `NVIDIA_API_KEY` authorizes HTTP calls to NVIDIA-hosted NIMs and keys typically start with `nvapi-` + - NeMo Evaluator docs: NVIDIA Build authentication example uses `Authorization: Bearer $NGC_API_KEY` against `https://integrate.api.nvidia.com/v1/models` + - NVIDIA Build model docs: gRPC examples pass `authorization: Bearer $NVIDIA_API_KEY` diff --git a/spec/oidc-scope-and-claim-mapping-spec.md b/spec/oidc-scope-and-claim-mapping-spec.md new file mode 100644 index 0000000000..fb031455d3 --- /dev/null +++ b/spec/oidc-scope-and-claim-mapping-spec.md @@ -0,0 +1,282 @@ +# OIDC Scope And Claim Mapping Spec + +## Summary + +This spec defines a normalization and mapping layer between upstream OAuth/OIDC provider claims and the NeMo Platform authorization model. + +It is intentionally separate from `plugin-service-authz-spec.md`. + +The plugin service authz spec covers: + +- permissions +- service-scoped roles +- path rules + +This spec covers: + +- how OAuth/OIDC claims become a NeMo principal +- how provider-native scopes become NeMo-understood scopes +- whether and how external claims should influence NeMo roles + +## Problem + +Different OAuth/OIDC providers emit different claims and scope formats. + +Examples: + +- `openid profile email` +- `api://foo/read` +- `resource.read` +- provider-specific group claims +- provider-specific subject formats + +The current NeMo auth model expects: + +- a principal id +- optional principal email +- optional principal groups +- optional platform-understood scopes such as `models:read` + +Without normalization, plugin and platform authz policy becomes tightly coupled to whichever provider a deployment uses. + +## Goals + +- Normalize provider-native identity claims into a NeMo principal model. +- Normalize provider-native scopes into NeMo-understood scopes before PDP evaluation. +- Preserve the current separation between: + - OAuth/OIDC identity + - NeMo role bindings + - NeMo endpoint permissions +- Keep the plugin endpoint model provider-independent. + +## Non-Goals + +- Replacing NeMo role bindings with OAuth scopes. +- Defining plugin-owned path rules. +- Redesigning the PDP permission model. + +## Current System + +Today the platform: + +- extracts principal id/email/groups +- extracts token scopes +- sends both to the PDP + +Role bindings are loaded from the entity store and merged into the authorization data. + +Important current behavior: + +- there is no built-in facility that maps OIDC scopes directly to NeMo roles +- there is no built-in facility that maps provider-native scopes directly to NeMo permissions +- roles come from NeMo role bindings, not from OAuth scopes + +This means: + +- OAuth/OIDC provides identity and token metadata +- NeMo owns the authorization grants + +## Design Principles + +### Principle 1: Identity and Authorization Stay Separate + +The mapping layer may normalize: + +- subject +- email +- groups +- scopes + +But it must not collapse the platform permission model into provider-native claims. + +NeMo permissions should continue to come from NeMo role bindings and role definitions. + +### Principle 2: Endpoint Policy Must Be Provider-Independent + +Plugin services and platform services should not encode provider-native scopes or claims in endpoint policy. + +Endpoint rules should only reference: + +- NeMo permissions +- optionally NeMo-normalized scopes + +### Principle 3: Mapping Must Be Deployment-Configurable + +Different deployments may use different providers and different claim conventions. + +The normalization layer should therefore be deployment-configurable rather than hardcoded into plugin/service definitions. + +## Scope Mapping + +### Input + +Provider-native token scopes, for example: + +- `openid` +- `profile` +- `api://foo/read` +- `resource.read` + +### Output + +NeMo-understood scopes, for example: + +- `models:read` +- `platform:write` + +### Proposed Behavior + +Before PDP evaluation: + +1. extract raw token scopes +2. apply configured scope mapping rules +3. produce normalized NeMo scopes +4. pass normalized scopes to the Policy Decision Point (PDP) + +The PDP should not need to know which upstream provider produced the token. + +### Mapping Shape + +At a minimum, the mapping layer should support: + +- exact scope mapping +- dropping irrelevant scopes +- passing through already-normalized NeMo scopes unchanged + +Conceptual example: + +```yaml +scope_mapping: + exact: + "api://foo/models.read": "models:read" + "api://foo/models.write": "models:write" + "api://foo/platform.admin": "platform:write" + passthrough_nemo_scopes: true + ignore: + - "openid" + - "profile" + - "email" + - "offline_access" +``` + +## Claim Mapping + +The mapping layer should also normalize identity claims into the NeMo principal model. + +Possible sources: + +- `sub` +- `email` +- `groups` +- provider-specific custom claims + +Conceptual example: + +```yaml +claim_mapping: + principal_id: "sub" + principal_email: "email" + principal_groups: "groups" +``` + +This allows providers with different claim names to be normalized into the same internal principal structure. + +## Roles + +### Current Recommendation + +Do not map OIDC scopes directly to NeMo roles in the first iteration. + +Reason: + +- it mixes identity-provider policy with platform authorization state +- it makes roles provider-dependent +- it bypasses the existing NeMo role binding model + +Instead: + +- normalize identity claims +- normalize scopes +- keep roles granted by NeMo role bindings + +### Possible Future Extension + +If needed later, the platform could support external-claim-driven role projection as a separate feature. + +Examples: + +- map a directory group to a NeMo role +- map a provider-specific claim to a NeMo role + +But this should be modeled explicitly as external-role projection, not as the default meaning of scopes. + +## Options + +### Option 1: Normalize Scopes Only + +- map provider scopes to NeMo scopes +- keep roles entirely in NeMo + +Pros: + +- minimal change +- stays close to the current system +- keeps permission grants under platform control + +Cons: + +- still requires separate role binding administration + +### Option 2: Normalize Scopes and Claims + +- map provider scopes to NeMo scopes +- map provider claims to principal id/email/groups +- keep roles entirely in NeMo + +Pros: + +- cleaner multi-provider support +- keeps endpoint policy provider-independent +- still preserves current role-binding model + +Cons: + +- slightly larger implementation surface than scope-only mapping + +### Option 3: Map External Claims to Roles + +- map scopes or claims directly to NeMo roles + +Pros: + +- can reduce manual role-binding administration in some deployments + +Cons: + +- more invasive change +- blurs ownership of authorization policy +- harder to reason about and audit + +## Recommendation + +Recommend Option 2: + +- normalize provider claims into the NeMo principal model +- normalize provider-native scopes into NeMo scopes +- keep NeMo roles and permissions granted through NeMo role bindings + +This minimizes change to the current authorization model while making the platform much easier to integrate with different OAuth/OIDC providers. + +## Relationship To Plugin Service Authz + +This mapping layer sits before plugin endpoint authorization. + +Order of operations: + +1. provider authenticates caller +2. mapping layer normalizes claims and scopes +3. platform constructs NeMo principal +4. PDP evaluates endpoint permissions/scopes/roles +5. plugin path rules are checked using normalized NeMo auth context + +Plugin path rules should not contain provider-specific logic. diff --git a/spec/options-for-multi-idp.md b/spec/options-for-multi-idp.md new file mode 100644 index 0000000000..719cb74aa5 --- /dev/null +++ b/spec/options-for-multi-idp.md @@ -0,0 +1,739 @@ +# Multi-IdP, SSO, and Tenant Isolation Research for NeMo Platform + +## Executive Summary + +NeMo Platform today supports a single OIDC configuration per deployment, plus a small `additional_issuers` escape hatch intended for issuer-format variance such as Azure AD v1/v2. It does **not** currently support: + +- multiple first-class IdPs in one deployment +- home realm discovery (HRD) or domain-based IdP routing +- user account linking across IdPs +- true multi-tenant isolation + +The repo also explicitly documents that workspaces are a logical authorization boundary, **not** a tenant-isolation boundary. + +If the product requirement is: + +1. one NeMo deployment that supports many customer IdPs, and +2. users from different organizations signing in through different IdPs, and +3. true tenant isolation where one tenant cannot affect or observe another, + +then these should be treated as **two separate architectures**: + +- **Multi-IdP federation for a single deployment**: solve with an identity broker / gateway in front of NeMo. +- **Tenant isolation**: solve with a separate NeMo Platform deployment per tenant. + +That split is the core recommendation of this document. + +## Current State in This Repo + +### 1. Auth config is single-provider + +`OIDCConfig` models one provider: one `issuer`, one `client_id`, optional endpoint overrides, and one claim-mapping profile. See [packages/nmp_common/src/nmp/common/config/base.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/config/base.py:89). + +Relevant indicators: + +- single `issuer`: [base.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/config/base.py:97) +- single `client_id`: [base.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/config/base.py:110) +- single `email_claim`, `groups_claim`, `subject_claim`: [base.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/config/base.py:143) +- one nested `auth.oidc` object under `AuthConfig`: [base.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/config/base.py:234) + +### 2. JWT validation assumes one primary discovery source + +`JWTValidator` fetches discovery from `config.oidc.issuer`, builds one JWKS client, and validates tokens against one audience profile plus a flat list of allowed issuers. See [packages/nmp_common/src/nmp/common/auth/jwt.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/auth/jwt.py:43). + +Important details: + +- discovery URL is built from a single issuer: [jwt.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/auth/jwt.py:63) +- one JWKS client instance is cached: [jwt.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/auth/jwt.py:71) +- allowed issuers are `[issuer] + additional_issuers`: [jwt.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/auth/jwt.py:178) + +This is not a provider registry. It is a single-provider validator with a compatibility list. + +### 3. Auth discovery is single-provider + +The auth discovery endpoint exposes one `oidc` object with one issuer, one token endpoint, one device authorization endpoint, and one client ID. See [services/core/auth/src/nmp/core/auth/api/v2/discovery/endpoints.py](/Users/rsadler/src/nemo-platform/services/core/auth/src/nmp/core/auth/api/v2/discovery/endpoints.py:24). + +This matters because the CLI and SDK bootstrap from this shape today. + +### 4. Studio is single-authority + +Studio runtime env exposes a single `VITE_AUTH_AUTHORITY` and `VITE_AUTH_CLIENT_ID`. See [web/packages/studio/src/constants/environment.ts](/Users/rsadler/src/nemo-platform/web/packages/studio/src/constants/environment.ts:64). + +Auto-login also assumes exactly one authority and redirects there immediately when auth is enabled: [web/packages/studio/src/providers/auth/useAuthLogin.ts](/Users/rsadler/src/nemo-platform/web/packages/studio/src/providers/auth/useAuthLogin.ts:21). + +This means there is no native pre-login organization picker, no HRD screen, and no provider selection UX. + +### 5. Authorization is workspace-scoped, not tenant-isolated + +The docs are explicit: + +- principals are typically human users identified by email: [docs/auth/concepts.md](/Users/rsadler/src/nemo-platform/docs/auth/concepts.md:28) +- workspaces are the auth boundary: [docs/get-started/concepts/workspaces.md](/Users/rsadler/src/nemo-platform/docs/get-started/concepts/workspaces.md:4) +- the product does **not** provide database-isolated multi-tenancy: [docs/auth/security-model.md](/Users/rsadler/src/nemo-platform/docs/auth/security-model.md:172) + +This confirms the current model is: + +- one deployment +- many workspaces +- shared control plane and shared storage surfaces + +That is not equivalent to enterprise tenant isolation. + +### 6. There is at least one user-experience assumption tied to email-local-part + +Studio derives a default workspace name from the email local part: [web/packages/studio/src/providers/auth/useAuthProfile.ts](/Users/rsadler/src/nemo-platform/web/packages/studio/src/providers/auth/useAuthProfile.ts:18). + +That is fragile in any serious enterprise identity design and would need reconsideration even before multi-IdP. + +## Requirements Decomposition + +The original request actually bundles three different problems: + +### A. Multiple IdP types for one NeMo deployment + +Examples: + +- one customer uses Entra ID +- another uses Okta +- another uses Ping or Keycloak +- some enterprise connections are OIDC, others are SAML + +### B. SSO for users that may authenticate through different IdPs + +Examples: + +- same human can sign in with corporate IdP or a backup/social/provider-managed identity +- one user belongs to two organizations with different upstream IdPs +- one deployment needs email-domain routing or organization routing + +### C. True tenant isolation + +Your requirement is stronger than workspace isolation: + +> Multitenant should be a completely separate instance of nemo-platform isolated from every other tenant. + +That is not "multitenancy inside a deployment." That is a **fleet of isolated single-tenant deployments**. + +This is the right framing if the goal is enterprise-grade isolation. + +## Industry Patterns + +### Pattern 1: Identity broker in front of the app + +This is the dominant pattern for apps that want: + +- multiple upstream IdPs +- OIDC plus SAML support +- HRD / IdP routing +- JIT provisioning +- account linking +- one stable downstream OIDC integration for the app + +The app talks to exactly one downstream OIDC provider. The broker talks to many upstream IdPs. + +Examples from vendor docs: + +- Auth0 supports identifier-first login and HRD based on enterprise connection domains, redirecting users based on email domain. +- Okta supports external IdPs, IdP Discovery routing rules, account linking, and JIT provisioning. +- Keycloak supports identity brokering and can broker both OIDC and SAML IdPs. + +This matches NeMo well because NeMo already knows how to be an OIDC relying party. It does **not** know how to be a multi-protocol identity broker. + +### Pattern 2: Native multi-provider auth inside the app + +The app itself owns: + +- provider registry +- routing rules +- login UX +- callback handling +- token validation for many issuers +- account linking +- provider-specific claim normalization + +This is possible, but expensive and security-sensitive. It turns NeMo from "OIDC consumer" into "identity platform." + +### Pattern 3: Separate deployment per tenant + +For strong isolation, each tenant gets: + +- separate NeMo deployment +- separate auth config +- separate DB and persistent stores +- separate secrets +- separate ingress / hostname +- separate admin surface + +This is the cleanest interpretation of your tenant-isolation requirement. + +## External Research + +### Auth0: identifier-first and HRD + +Auth0 documents an identifier-first flow where the user enters email first, and if the email domain matches a configured enterprise connection domain, the user is redirected to that enterprise IdP. Auth0 explicitly calls this HRD. + +Source: + +- Auth0, "Configure Identifier First Authentication": https://auth0.com/docs/authenticate/login/auth0-universal-login/identifier-first + +Why it matters for NeMo: + +- this is the exact UX pattern Studio would need if one deployment supports many enterprise IdPs +- NeMo does not have this UX today + +### Okta: external IdPs, account linking, JIT, and IdP discovery + +Okta documents: + +- external OIDC and SAML IdPs +- account linking so many IdP identities map to one Okta user +- JIT provisioning +- IdP Discovery routing rules + +Sources: + +- Okta, "External Identity Providers": https://developer.okta.com/docs/concepts/identity-providers/ +- Okta, "Add an enterprise identity provider": https://developer.okta.com/docs/guides/add-an-external-idp/ + +Why it matters for NeMo: + +- this is the feature envelope enterprises will expect if we claim "multiple IdPs" +- it also shows why brokering is attractive: app speaks OIDC once, broker handles the rest + +### Auth0: account linking is not automatic and must be done carefully + +Auth0 documents that identities are separate by default and that account linking should require authentication for both accounts before linking. + +Source: + +- Auth0, "User Account Linking": https://auth0.com/docs/manage-users/user-accounts/user-account-linking + +Why it matters for NeMo: + +- "same email across providers" is not enough to merge identities safely +- if NeMo ever owns account linking, it must be deliberate, audited, and re-authenticated + +### Microsoft Entra ID: multitenant issuer handling is stricter than single-tenant + +Microsoft documents that multitenant apps must validate tokens differently because the issuer varies by tenant, and that `/common` or `/organizations` sign-in requires tenant-aware issuer validation. + +Sources: + +- Microsoft Learn, "Convert single-tenant app to multitenant on Microsoft Entra ID": https://learn.microsoft.com/en-us/entra/identity-platform/howto-convert-app-to-be-multi-tenant +- Microsoft Learn, "OpenID Connect (OIDC) on the Microsoft identity platform": https://learn.microsoft.com/en-us/entra/identity-platform/v2-protocols-oidc + +Why it matters for NeMo: + +- the current `additional_issuers` pattern is not a general answer for arbitrary multi-tenant / multi-issuer acceptance +- any direct support for many Entra tenants needs issuer-aware key lookup and stricter tenant binding + +### OIDC core: `login_hint` + +OpenID Connect defines `login_hint` as a way for the RP to pass an email, phone number, or username to the OP as a hint. + +Source: + +- OpenID Foundation, "OpenID Connect Core 1.0": https://openid.net/specs/openid-connect-core-1_0.html + +Why it matters for NeMo: + +- if NeMo uses a broker or a provider that supports it, email-first discovery can hand off the identifier cleanly +- for Entra specifically, `domain_hint` can skip email-based discovery in federated setups + +### Keycloak: identity brokering for OIDC and SAML + +Keycloak documents: + +- Identity Provider Redirector in auth flows +- configuring generic OIDC IdPs +- configuring SAML IdPs + +Source: + +- Keycloak Server Administration Guide: https://www.keycloak.org/docs/latest/server_admin/ + +Why it matters for NeMo: + +- Keycloak is a realistic self-hosted broker option if NVIDIA does not want a managed CIAM dependency + +## Recommendation 1: keep NeMo as an OIDC consumer, not an identity broker + +This is the strongest recommendation in the document. + +For one NeMo deployment that needs many upstream IdPs, put a broker in front: + +- managed: Auth0, Okta, WorkOS, Microsoft Entra External ID +- self-hosted: Keycloak + +Then configure NeMo to trust **one** broker-issued OIDC issuer per deployment. + +Benefits: + +- minimal changes to NeMo auth core +- OIDC and SAML can both be supported upstream +- HRD / IdP routing is delegated to software built for it +- account linking stays out of NeMo +- CLI and Studio stay downstream OIDC clients instead of becoming provider routers + +Tradeoff: + +- adds broker dependency and operational surface + +This tradeoff is still much cheaper than turning NeMo into a multi-provider IAM product. + +## Recommendation 2: tenant isolation should mean separate NeMo deployments + +If the tenant boundary must be strong, the right unit is a deployment, not a workspace. + +Each tenant should get: + +- separate NeMo Platform instance +- separate database +- separate files / buckets / storage prefixes with separate credentials +- separate secrets backend or namespace +- separate auth broker / auth app config +- separate ingress hostname +- separate telemetry labels and audit sinks + +Within each tenant deployment, keep using workspaces for teams, environments, and projects. + +That gives a clean layering: + +- **deployment** = tenant boundary +- **workspace** = team/project boundary inside a tenant + +## Recommendation 3: if native multi-IdP is still required, limit phase 1 to OIDC-only + +If product direction insists on native support inside NeMo, do not start with SAML. + +Start with: + +- multiple OIDC issuers +- explicit provider aliases +- domain / organization routing +- no account linking in phase 1 + +Then decide later whether SAML belongs in NeMo at all. + +## What Would Be Required for Native Multi-IdP in NeMo + +This section assumes we ignore Recommendation 1 and build it in-repo. + +### 1. New auth configuration model + +Replace: + +```yaml +auth: + oidc: + issuer: ... + client_id: ... +``` + +with something like: + +```yaml +auth: + providers: + - alias: entra-acme + protocol: oidc + issuer: https://login.microsoftonline.com//v2.0 + client_id: ... + audience: ... + email_claim: upn + subject_claim: oid + groups_claim: groups + domains: + - acme.com + - alias: okta-foo + protocol: oidc + issuer: https://foo.okta.com/oauth2/default + client_id: ... + audience: ... + domains: + - foo.com +``` + +Likely future extension: + +- SAML providers with metadata URL / entity ID / ACS settings + +### 2. Provider registry and routing rules + +NeMo would need a first-class provider registry with: + +- alias +- protocol +- issuer / metadata +- client settings +- claim mapping +- allowed domains +- optional organization IDs +- enable/disable state + +And routing rules such as: + +- email domain -> provider alias +- explicit org slug -> provider alias +- default provider for unmanaged users + +### 3. Discovery API redesign + +Current discovery returns one `oidc` object. Native multi-IdP would require something more like: + +- list of providers +- whether provider selection is user-facing +- an HRD endpoint +- browser auth initiation endpoints per provider +- CLI-compatible provider metadata + +This would be a breaking change for CLI/SDK bootstrap unless versioned carefully. + +### 4. Studio login UX redesign + +Today Studio auto-redirects to one authority. Native multi-IdP would require: + +- pre-login page +- email-first flow, or org slug input, or provider buttons +- callback handling per provider +- support for `login_hint` and possibly provider-specific params like Entra `domain_hint` + +Because Studio uses `react-oidc-context` against one authority today, the cleanest implementation would likely be: + +- a NeMo-owned broker endpoint, or +- a broker outside Studio that exposes one stable downstream authority + +Directly teaching Studio to dynamically instantiate many OIDC authorities is possible, but it is still more complex than brokering. + +### 5. CLI login redesign + +Current CLI discovery assumes one client ID and one token endpoint. That works for a single downstream OIDC provider and for device flow. + +Problems for native multi-IdP: + +- which provider should `nemo auth login` use? +- SAML does not map cleanly to device flow +- some enterprise IdPs do not expose device flow in the same way the current NeMo CLI expects + +CLI would need one or more of: + +- `nemo auth login --provider ` +- `nemo auth login --org ` +- browser-based code+PKCE flow instead of device flow for some providers + +This is another reason a downstream broker is attractive: the CLI still only needs one OIDC contract. + +### 6. Identity model changes + +This is one of the biggest design changes. + +Today the docs describe principals as typically identified by email. That is not strong enough for cross-IdP identity. + +Native multi-IdP needs: + +- canonical principal key: internal UUID +- external identity table: `(provider_alias, subject)` unique +- email stored as attribute, not identity key +- email verification state +- account status and link provenance + +Suggested shape: + +- `users` + - `id` + - `primary_email` + - `display_name` + - `status` +- `external_identities` + - `user_id` + - `provider_alias` + - `subject` + - `email` + - `email_verified` + - `raw_claims_snapshot` + - `linked_at` + +Without this, email collisions and account takeover risks become very real. + +### 7. Account linking policy + +If NeMo owns cross-IdP SSO for the same person, it needs explicit policy for: + +- user-initiated linking +- admin-assisted linking +- suggested linking when verified emails match +- forbidden automatic linking when email is unverified or provider is low trust + +Minimum safe rule: + +- no silent linking based only on same email +- require active authentication on both accounts before link creation +- audit every link / unlink action + +### 8. Group and claim normalization + +Group semantics differ by provider: + +- Entra groups +- Okta groups +- Keycloak realm roles / client roles +- SAML attributes + +NeMo would need: + +- per-provider claim mapping +- optional group transformation +- optional group-to-role binding automation +- size and overage handling for providers that emit many groups + +The current "one groups claim name" model is not enough for this. + +### 9. Provisioning model + +Native multi-IdP also raises user lifecycle questions: + +- JIT provisioning on first login? +- pre-provisioned users only? +- SCIM inbound sync later? +- deprovision behavior when upstream account is disabled? + +Phase 1 can likely survive with: + +- JIT user record creation +- no SCIM +- role assignment still handled by NeMo workspace membership + +But enterprise buyers will quickly ask for: + +- SCIM +- group sync +- org-scoped user inventory + +### 10. Authorization model adjustments + +Workspace RBAC can stay, but some additions become likely: + +- organization / tenant entity above workspace +- workspace ownership by organization +- provider-to-organization mapping +- org-scoped admin roles + +If one deployment hosts many organizations, you do not want global wildcard patterns like `*` to span all orgs unintentionally. + +This is a subtle but important risk in the current model. + +### 11. Security and audit requirements + +Native multi-IdP expands the security surface materially. + +NeMo would need: + +- secure routing rule evaluation +- callback CSRF/state validation per provider +- nonce / PKCE rigor +- account-link audit trail +- provider config change audit trail +- replay / confused-deputy protections +- stronger principal provenance in logs + +## What Would Be Required for SAML Support + +If "multiple types of IdPs" includes SAML, there are two realistic options: + +### Option A: use a broker that converts SAML upstream to OIDC downstream + +This is the recommended path. + +NeMo continues to speak OIDC only. + +### Option B: teach NeMo to be both OIDC RP and SAML SP + +That requires: + +- SAML metadata handling +- certificate rollover +- ACS endpoints +- NameID / attribute mapping +- SP-initiated and maybe IdP-initiated flow policy +- logout behavior decisions +- more frontend and CLI complexity + +This is a large scope increase and does not appear justified while NeMo still has single-provider OIDC assumptions everywhere else. + +## What "Completely Separate Instance Per Tenant" Implies + +If the product requirement is hard isolation, the platform should model tenancy as deployment orchestration, not as shared runtime policy. + +Each tenant deployment should have: + +- dedicated auth configuration +- dedicated NeMo DB +- dedicated object/file storage namespace with dedicated credentials +- dedicated secrets namespace +- dedicated ingress hostname +- dedicated runtime config and feature flags +- dedicated audit and telemetry partitioning + +Recommended shape: + +- `tenant-a.nemo.example.com` +- `tenant-b.nemo.example.com` + +Each points to a different NeMo installation. + +Possible management-plane responsibilities: + +- tenant provisioning +- DNS / TLS issuance +- broker or IdP connection setup +- deployment rollout / upgrades +- tenant suspension / deletion +- fleet health dashboard + +This can be a control plane later. It does not need to exist before tenant isolation. + +## Preferred Architecture Options + +### Option 1: external identity broker + one NeMo deployment per tenant + +Shape: + +- per tenant: one broker realm / org / auth app +- per tenant: one NeMo deployment +- upstream: many customer IdPs if needed +- downstream into NeMo: one OIDC issuer + +Pros: + +- strongest isolation +- smallest NeMo code change +- easiest enterprise story + +Cons: + +- most infrastructure footprint + +### Option 2: external identity broker + shared NeMo deployment for many orgs + +Shape: + +- one NeMo deployment +- broker handles many orgs and many IdPs +- NeMo uses workspace/org RBAC for isolation + +Pros: + +- cheaper footprint +- fastest path to multi-IdP + +Cons: + +- fails your strict tenant-isolation requirement +- shared blast radius + +### Option 3: native NeMo multi-IdP + shared deployment + +Pros: + +- fewer external dependencies on paper + +Cons: + +- large engineering and security surface +- still does not solve tenant isolation by itself +- forces Studio and CLI redesign + +This is the least attractive option. + +## Recommended Incremental Roadmap + +### Phase 0: state the product boundary clearly + +Document: + +- workspaces are not tenants +- current auth is single-provider OIDC +- SAML and multi-IdP require a broker today + +### Phase 1: formalize "bring your own broker" + +Productize the current best path: + +- validate NeMo against one brokered OIDC issuer +- document supported brokers +- document claim mapping and group mapping recipes +- document HRD via broker + +Likely repo work: + +- tighten docs +- add tested examples for Auth0 / Okta / Keycloak / Entra External ID +- possibly add better `subject_claim` / `email_claim` guidance per provider + +### Phase 2: make one deployment broker-friendly + +Small NeMo improvements that help without making NeMo the broker: + +- richer claim mapping support +- explicit internal principal UUID instead of email-centric assumptions +- better audit metadata for source issuer / subject +- safer Studio defaulting than `email.split('@')[0]` + +### Phase 3: isolated tenant deployment model + +Build automation for: + +- per-tenant deployment provisioning +- per-tenant auth configuration +- per-tenant DNS / ingress / storage / secrets + +### Phase 4: revisit native multi-IdP only if still necessary + +Only after phases 1-3 should NeMo consider: + +- provider registry +- HRD UI +- account linking +- SAML SP support + +## Concrete Repo Touchpoints If Work Proceeds + +If NeMo implements any part of this, the main code surfaces are: + +- auth config model: + [packages/nmp_common/src/nmp/common/config/base.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/config/base.py:89) +- token validation: + [packages/nmp_common/src/nmp/common/auth/jwt.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/auth/jwt.py:43) +- auth discovery: + [services/core/auth/src/nmp/core/auth/api/v2/discovery/endpoints.py](/Users/rsadler/src/nemo-platform/services/core/auth/src/nmp/core/auth/api/v2/discovery/endpoints.py:24) +- middleware and principal propagation: + [packages/nmp_common/src/nmp/common/auth/middleware.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/auth/middleware.py:1) + [packages/nmp_common/src/nmp/common/auth/models.py](/Users/rsadler/src/nemo-platform/packages/nmp_common/src/nmp/common/auth/models.py:1) +- Studio auth runtime: + [web/packages/studio/src/constants/environment.ts](/Users/rsadler/src/nemo-platform/web/packages/studio/src/constants/environment.ts:64) + [web/packages/studio/src/providers/auth/useAuthLogin.ts](/Users/rsadler/src/nemo-platform/web/packages/studio/src/providers/auth/useAuthLogin.ts:16) + [web/packages/studio/src/providers/auth/useAuthProfile.ts](/Users/rsadler/src/nemo-platform/web/packages/studio/src/providers/auth/useAuthProfile.ts:18) +- docs that should be updated with product boundaries: + [docs/auth/security-model.md](/Users/rsadler/src/nemo-platform/docs/auth/security-model.md:172) + [docs/get-started/concepts/workspaces.md](/Users/rsadler/src/nemo-platform/docs/get-started/concepts/workspaces.md:4) + +## Open Questions + +These should be answered before implementation planning: + +1. Is the real target "many customer IdPs for one shared SaaS deployment", or "a deployment template that each customer installs separately"? +2. Is SAML a hard day-1 requirement, or is OIDC-first acceptable? +3. Must the CLI support all enterprise SSO paths, or is browser-based Studio login sufficient at first? +4. Do we want NeMo to own user lifecycle and account linking, or should that remain in an upstream broker? +5. Is a managed CIAM dependency acceptable, or must the solution be self-hosted? + +## Final Recommendation + +For NeMo Platform, the technically sound path is: + +1. **Do not implement native multi-IdP first.** +2. **Use an identity broker in front of NeMo for multi-IdP and SAML.** +3. **Treat strict tenant isolation as separate NeMo deployments, not shared-workspace multitenancy.** +4. **Keep workspaces as intra-tenant segmentation only.** + +That path aligns with the current repo architecture, minimizes auth risk, and matches the enterprise requirement more honestly than trying to stretch workspace RBAC into tenant isolation. diff --git a/spec/permissionless-authenticated-routes-spec.md b/spec/permissionless-authenticated-routes-spec.md new file mode 100644 index 0000000000..45ffc1dc5e --- /dev/null +++ b/spec/permissionless-authenticated-routes-spec.md @@ -0,0 +1,113 @@ +# Permissionless Authenticated Routes Spec + +## Summary + +This spec explores whether NeMo Platform should continue to allow authenticated routes that require no explicit platform permission. + +This is intentionally separate from `plugin-service-authz-spec.md`. + +The plugin service authz spec preserves the current behavior. This document questions whether that behavior should continue long-term. + +## Current Behavior + +Today the platform allows endpoints that are: + +- authenticated +- permissionless + +In practice this means an endpoint may be configured with: + +- `permissions: []` + +and, assuming the caller is authenticated, the route is allowed. + +This behavior exists today and is preserved by `plugin-service-authz-spec.md`. + +## Problem + +Permissionless authenticated routes are convenient, but they weaken the clarity of the authorization model. + +Problems: + +- they make "authenticated" and "authorized" easier to conflate +- they create endpoints that are broad by default for all authenticated callers +- they make it harder to audit access expectations +- they encourage endpoints whose security semantics are underspecified + +## Goals + +- Determine whether permissionless authenticated routes should remain a supported pattern. +- Improve clarity and auditability of protected endpoint behavior. +- Preserve an ergonomic path for simple authenticated-only endpoints if needed. + +## Non-Goals + +- Redesigning plugin path-rule decorators. +- Changing the current behavior immediately. + +## Options + +### Option 1: Preserve Current Behavior + +Keep allowing `USER` and `WORKLOAD` rules with empty `permissions_required`. + +Pros: + +- no behavioral change +- easy for simple authenticated-only endpoints +- closest to the current system + +Cons: + +- weaker security model +- less explicit review surface + +### Option 2: Require Explicit Permissions For All Non-Anonymous Routes + +Require every `USER` or `WORKLOAD` rule to name at least one permission. + +Pros: + +- strongest model +- easiest to audit +- every protected route is governed by a platform permission + +Cons: + +- more authoring overhead +- requires inventing permissions for lightweight endpoints + +### Option 3: Preserve The Pattern, But Make It Explicit + +Allow permissionless authenticated routes only through a distinct explicit marker. + +Examples: + +- `authenticated_only=True` +- `authorization_mode="authenticated_only"` + +Pros: + +- clearer than empty `permissions_required` +- keeps convenience for rare cases + +Cons: + +- adds another concept to the rule model + +## Recommendation + +Do not change current behavior as part of the first plugin authz redesign. + +Instead: + +- preserve the existing behavior in `plugin-service-authz-spec.md` +- revisit this separately after the decorator/path-rule model lands + +Long-term, Option 2 or Option 3 is likely preferable to the implicit empty-permissions pattern. + +## Relationship To Plugin Service Authz + +`plugin-service-authz-spec.md` should preserve the current behavior for compatibility with the existing platform authorization model. + +If the platform later decides to tighten or reformulate permissionless authenticated routes, that should happen as follow-up work under this spec rather than being bundled into the initial plugin authz implementation. diff --git a/spec/plugin-defined-scopes-spec.md b/spec/plugin-defined-scopes-spec.md new file mode 100644 index 0000000000..cc9ffe506c --- /dev/null +++ b/spec/plugin-defined-scopes-spec.md @@ -0,0 +1,179 @@ +# Plugin-Defined Scopes Spec + +## Summary + +This spec explores whether NeMo Platform should support plugin-defined scopes as a first-class concept. + +This is intentionally separate from `plugin-service-authz-spec.md`. + +That spec allows plugin endpoints to reference normalized platform scopes in path rules, but it does not define how plugins declare, validate, surface, or document scopes themselves. This document explores that missing scope model. + +## Current State + +Today, plugin-contributed endpoint authz can include `scopes`, but those scopes are just strings attached to endpoint rules. + +Current behavior: + +- plugin authz contributions may attach `scopes` to endpoint methods +- the PDP enforces those endpoint-required scopes when token-provided platform scopes are present +- platform scopes are recognized by the presence of `:` in the scope string +- standard OIDC scopes such as `openid`, `profile`, `email`, and `offline_access` are ignored for authorization + +What does not currently exist: + +- a plugin-defined scope registry +- a scope definition model with required metadata +- bundle-time validation beyond basic endpoint usage +- a first-class docs or discovery surface for plugin-contributed scopes + +## Problem + +Permissions are treated as a first-class platform concept with explicit definitions and descriptions. + +Scopes are not. + +That creates several problems: + +- plugin authors can reference scope strings without defining them anywhere +- scopes may drift in spelling or naming conventions +- there is no canonical place to attach descriptions or documentation +- it is unclear whether a scope is platform-owned, plugin-owned, or provider-specific +- docs generation and discovery become inconsistent + +## Goals + +- Define whether plugin scopes should be first-class platform objects. +- Require a canonical declaration path if plugin-defined scopes are supported. +- Enforce scope naming conventions at bundle-validation time. +- Keep plugin endpoint rules able to reference normalized platform scopes. +- Separate provider-native scope handling from platform-defined scope handling. + +## Non-Goals + +- Redesigning the current optional scope-checking behavior in the PDP. +- Replacing permissions with scopes. +- Solving IdP-specific scope issuance or consent UX. +- Changing the current plugin-service authz implementation plan. + +## Design Questions + +This spec should answer at least these questions: + +1. Should plugins be allowed to define new normalized platform scopes? +2. If yes, where are those scopes declared? +3. Should scope definitions require descriptions, like permissions do? +4. Should plugin-defined scopes be namespaced by service or API area? +5. Should plugin-defined scopes appear in docs, discovery APIs, and UI? +6. How should plugin-defined scopes relate to provider-native OAuth/OIDC scopes? + +## Options + +### Option 1: Keep Scopes As Endpoint-Only Strings + +Plugins may continue to reference normalized scope strings in path rules, but there is no first-class scope definition model. + +Pros: + +- smallest change +- closest to current behavior +- no extra registry or docs work + +Cons: + +- weak validation +- no canonical descriptions +- scope drift remains easy + +### Option 2: Add A Plugin Scope Registry + +Plugins define scopes explicitly, similar to permissions. + +Conceptual example: + +```python +@dataclass(frozen=True) +class ScopeDef: + id: str + description: str +``` + +And: + +```python +class NemoService(_NamedPlugin): + ... + + def get_scope_definition(self) -> ServiceScopeDefinition | None: + return None +``` + +Pros: + +- explicit +- validates well +- supports docs and discovery + +Cons: + +- more machinery +- needs a clear relationship to permissions + +### Option 3: Reuse The Existing Service Authz Definition Pattern + +Plugin-defined scopes become part of the same service-level authz definition shape used for permissions. + +Conceptual example: + +```python +@dataclass +class ServiceAuthzDefinition: + permissions: list[PermissionDef] = field(default_factory=list) + scopes: list[ScopeDef] = field(default_factory=list) +``` + +Pros: + +- one service-owned definition surface +- consistent with permission definitions +- keeps scope ownership close to other auth metadata + +Cons: + +- makes the authz definition surface broader +- requires the main authz model to grow later + +## Recommendation + +Recommend Option 3 if plugin-defined scopes are adopted later. + +If scopes become first-class, they should follow the same general pattern as permissions: + +- explicit service-level declaration +- required descriptions +- bundle-time validation +- normalized naming conventions + +Endpoint rules should reference previously defined scopes rather than inventing raw scope strings inline. + +## Validation Expectations + +If plugin-defined scopes are introduced later, bundle-time validation should enforce at least: + +- normalized platform scopes use `:` as the segment separator +- provider-native scopes are not used directly in plugin endpoint rules +- every referenced plugin-defined scope exists in the service scope registry +- every plugin-defined scope includes a description +- duplicate or conflicting scope definitions fail validation + +## Relationship To Other Specs + +- `plugin-service-authz-spec.md` + - keeps scopes as normalized strings referenced by path rules + - does not define plugin-defined scopes as a first-class model + +- `oidc-scope-and-claim-mapping-spec.md` + - covers mapping provider-native OAuth/OIDC scopes into normalized NeMo platform scopes + - should remain separate from plugin-owned scope definition + +- `core-role-default-grants-spec.md` + - permissions and roles remain a separate follow-up track from scope definition diff --git a/spec/plugin-service-authz-spec.md b/spec/plugin-service-authz-spec.md new file mode 100644 index 0000000000..96d53d6c31 --- /dev/null +++ b/spec/plugin-service-authz-spec.md @@ -0,0 +1,562 @@ +# Plugin Service Authz Spec + +## Summary + +This spec replaces the current split between `nemo.services` and `nemo.authz` for HTTP-facing plugin authorization. + +Plugins will define authorization entirely through the `NemoService` surface using two concepts: + +- permissions +- path rules + +The design is intentionally narrow. It does not introduce a general policy DSL. It only covers the data required to describe service-owned HTTP authorization. + +## Goals + +- Remove `nemo.authz` as a separate plugin discovery/configuration surface. +- Make `nemo.services` the sole source of plugin HTTP auth policy. +- Allow plugin authors to define authz close to the routes they own. +- Make plugin-owned authz definitions type-safe at the service authoring layer. +- Preserve compatibility with generated/programmatic routers such as job route factories. +- Make a clean design break before release rather than carrying forward transitional APIs. + +## Non-Goals + +- Defining a general-purpose authorization language. +- Replacing core platform roles with plugin-specific roles. +- Defining plugin-defined roles. +- Supporting non-HTTP policy surfaces in this iteration. +- Solving every possible cross-service permission composition case. +- Defining how plugin-defined roles are surfaced in IAM, CLI, UI, or docs. + +## Current Problems + +Today plugin HTTP authz is split across two mechanisms: + +- `nemo.services` defines routers and paths. +- `nemo.authz` or `NemoService.get_authz_contribution()` defines permissions and endpoint policy separately. + +This creates several problems: + +- Route definitions and authz definitions drift because they are authored separately. +- `get_authz_contribution()` must be a classmethod because service discovery loads classes, which is awkward and non-idiomatic. +- Plugin authors need to learn two registration/configuration paths for one HTTP service. +- Factory-generated routes need separate handwritten authz helpers instead of carrying their own auth metadata. + +## Design Overview + +HTTP plugin authz is defined entirely through `NemoService`. + +Each plugin service will define: + +- permission definitions +- path rules + +Path rules may be declared directly with `@path_rule(...)` on route handlers or attached programmatically by route factories. + +The platform will derive the service authz contribution by: + +1. discovering `nemo.services` +2. instantiating each service +3. reading its routers +4. reading authz metadata attached to routes +5. reading optional service-level authz definitions +6. building a normalized authz model used by runtime bundle generation and static sync tooling + +The normalized authz model described in this spec governs endpoint permissions and route rules. It does not replace the upstream identity extraction layer. + +## Core Concepts + +### Permissions + +Permissions are the canonical, service-owned identifiers used in path rules. + +Rules: + +- Permission ids must be namespaced to the declared `permission_namespace`. +- Format remains dot-separated. +- A plugin service may only define permissions under its declared permission namespace. +- Permission ids must use `.` as the segment separator. + +Examples: + +- `agents.deployments.read` +- `agents.deployments.create` +- `customization.jobs.read` + +Each permission definition includes: + +- id +- description + +Descriptions are required. + +Minimal model: + +```python +@dataclass(frozen=True) +class PermissionDef: + id: str + description: str +``` + +Permissions must be declared explicitly in `get_authz_definition()`. + +Path rules may reference permissions, but they do not define them. + +This means: + +- raw string permission definitions in endpoint decorators are not supported +- every permission referenced by a path rule must already exist in the service authz definition +- every permission definition must include a description +- every permission referenced by the service must begin with the declared `permission_namespace` + +### Path Rules + +Path rules define which callers and required permissions apply to a concrete HTTP method and mounted path. + +Rules: + +- Path rules are owned by the service that mounts the route. +- Path rules may only reference permissions in the declared `permission_namespace`. +- Path rules are normally authored at the route level using `@path_rule(...)`. +- Generated routers may attach the same metadata programmatically. +- Every plugin-owned route must have one or more normalized path rules. +- A missing path rule is a validation error. +- No implicit caller or access behavior is inferred for routes with missing path rules. +- Anonymous access must always be explicit. +- Multiple path rules on the same endpoint are alternative allow rules. + +Minimal model: + +```python +@dataclass(frozen=True) +class PathRule: + method: str + path: str + callers: list[CallerKind] + permissions_required: list[str] = field(default_factory=list) + scopes: list[str] | None = None +``` + +## Public Plugin API + +### Route Decorator + +Plugins define path rules primarily through a single decorator with explicit callers. + +Initial decorator: + +- `@path_rule(...)` + +Examples: + +```python +@router.get("/deployments/{name}") +@path_rule( + callers=[CallerKind.PRINCIPAL], + permissions_required=[AgentsPermission.DEPLOYMENTS_READ], + scopes=["agents:read", "platform:read"], +) +async def get_deployment(...): ... +``` + +```python +@router.get("/deployments/{name}") +@path_rule( + callers=[CallerKind.SERVICE_PRINCIPAL], + permissions_required=[AgentsPermission.INTERNAL_DEPLOYMENTS_READ], +) +async def get_deployment(...): ... +``` + +```python +@router.get("/docs") +@path_rule( + callers=[CallerKind.ANON], +) +async def docs(...): ... +``` + +Decorator behavior: + +- attach normalized authz metadata to the route handler +- do not perform authorization directly +- support both hand-authored and factory-authored routes +- allow one or more rules per endpoint +- interpret multiple rules on a single endpoint as OR, not AND + +Initial caller kinds: + +- `ANON` +- `PRINCIPAL` +- `SERVICE_PRINCIPAL` + +Proposed enum: + +```python +class CallerKind(StrEnum): + ANON = "anon" + PRINCIPAL = "principal" + SERVICE_PRINCIPAL = "service_principal" +``` + +Caller kind semantics: + +- `ANON` + - route is callable without authentication + +- `PRINCIPAL` + - route is intended for normal authenticated principal access + - this corresponds to the current model's non-anonymous, non-`service:` callers + +- `SERVICE_PRINCIPAL` + - route is intended for principals whose id is prefixed with `service:` + - this corresponds to the current model's service-principal convention + - how that identity was authenticated is out of scope for the path rule model + +Validation rules for decorator usage: + +- `ANON` rules must not specify `permissions_required` +- `PRINCIPAL` and `SERVICE_PRINCIPAL` rules may specify empty `permissions_required`, preserving the current behavior for authenticated-but-permissionless endpoints +- any specified `permissions_required` must belong to the declared `permission_namespace` + +Additional validation rules: + +- every plugin-owned route must have at least one path rule +- attaching no path rule is invalid +- attaching no path rule must fail validation before merge/startup +- `ANON` must always be explicit and may not be inferred as a fallback +- each rule on an endpoint must be valid on its own +- the final rule set for an endpoint is evaluated as OR over the rules + +Rule semantics: + +- within one rule, `callers` are OR'ed +- within one rule, `permissions_required` are AND'ed +- across multiple rules on the same endpoint, rules are OR'ed + +Scope semantics preserved in this iteration: + +- scopes remain supported as endpoint rule fields in this iteration +- plugin-defined scopes are out of scope for this spec +- endpoint rule scopes are normalized NeMo platform scopes +- provider-native OAuth/OIDC scopes are not used directly in endpoint rules +- token-provided scopes and endpoint-required scopes must be treated as distinct concepts in code and docs +- normalized NeMo platform scopes must use `:` as the segment separator + +### Service-Level Authz Definition + +Each `NemoService` may define service-owned permissions. + +Proposed shape: + +```python +class NemoService(_NamedPlugin): + ... + + def get_authz_definition(self) -> ServiceAuthzDefinition | None: + return None +``` + +Minimal model: + +```python +@dataclass +class ServiceAuthzDefinition: + permission_namespace: str + permissions: list[PermissionDef] = field(default_factory=list) +``` + +This model intentionally does not include endpoint/path data. Path rules come from routers. + +`permission_namespace` is the explicit source of truth for permission-prefix validation. + +Rules: + +- `permission_namespace` must use `.` as its segment separator +- every permission id defined or referenced by the service must start with `.` +- `permission_namespace` is service-owned metadata and does not need to be identical to `NemoService.name` + +This avoids ambiguity between URL/service naming and permission naming. + +This namespace boundary must be enforced during bundle-time validation. + +### Type-Safe Authoring + +Plugin services should be able to define permissions using typed enums or typed constants. + +Example: + +```python +class AgentsPermission(StrEnum): + DEPLOYMENTS_READ = "agents.deployments.read" + DEPLOYMENTS_CREATE = "agents.deployments.create" +``` + +The `@path_rule(...)` decorator and authz definition helpers should accept these typed values directly. + +Goals of type-safe authoring: + +- avoid string typos +- avoid cross-service permission leakage inside service code +- make service-owned permissions easy to refactor + +This type safety is local to the service authoring layer. Cross-plugin composition remains a runtime concern because plugins are discovered dynamically. + +## Derived Platform Behavior + +The platform derives normalized plugin authz from services as follows: + +1. Discover `nemo.services`. +2. Instantiate each service once. +3. Read `get_authz_definition()`. +4. Read `get_routers()`. +5. For each mounted route: + - compute the fully mounted path from service name, `RouterSpec.prefix`, and route path + - read attached authz metadata + - emit one or more normalized path rules +6. Validate the combined result. +7. Convert the normalized result into the existing runtime/static authz structures consumed by the auth service. + +This derived result replaces the need for a separate `nemo.authz` plugin surface. + +The PDP continues to receive request method, path, principal identity, and scopes as it does today. This spec changes how plugin-owned endpoint rules are authored and merged. It does not require replacing the current PDP structure. + +The implementation must preserve the current core-role grant behavior for plugin-defined permissions when converting the derived result into the existing runtime/static authz structures. + +Specifically: + +- `.read` and `.list` permissions continue to be granted to `Viewer` and `Editor` +- other plugin-defined permissions continue to be granted to `Editor` + +Redesigning that heuristic is explicitly out of scope for this spec and is handled by a separate follow-up spec. + +## Validation Rules + +The platform must validate plugin-owned authz before merge and bundle generation. + +Required checks: + +- `permission_namespace` is present on every `ServiceAuthzDefinition`. +- `permission_namespace` uses `.` as its segment separator. +- Every plugin-defined permission id starts with `.`. +- Every plugin-defined permission id uses `.` as its segment separator. +- Every path rule references only `.*` permissions in `permissions_required`. +- Every `permissions_required` entry exists in the service permission registry. +- Every plugin-defined permission includes a description. +- Every endpoint-required scope uses `:` as its segment separator. +- Every plugin-owned route has at least one path rule. +- No plugin-owned route is implicitly anonymous/public. + +Bundle-time validation must fail if a plugin contributes malformed permission ids, undeclared permissions referenced by path rules, missing permission descriptions, or malformed normalized scopes. +Merge-time validation must also verify rule-set correctness, not just individual rule correctness. + +Bundle-time validation must also enforce that a service cannot define or reference permissions outside its declared `permission_namespace`. + +Additional merge-time checks: + +- reject exact duplicate rules on the same endpoint +- reject semantically conflicting rules on the same endpoint +- reject semantically shadowed rules where one rule makes another meaningless +- reject ambiguous same-caller overlaps that cannot be explained as clear alternatives + +Merge behavior: + +- merging authz definitions is a validation boundary, not a best-effort concatenation step +- invalid or conflicting rules must fail merge +- plugin endpoints with no path rules must fail merge rather than receiving any implicit default behavior + +## Factory-Generated Routes + +This design must work for generated routers, not only handwritten endpoints. + +Examples include: + +- job route factories +- reusable CRUD/router builders +- helper functions that return `APIRouter` + +Requirement: + +- factories must be able to attach the same authz metadata that `@path_rule(...)` attaches + +### Current State + +The current plugin authz helpers already embed route-to-permission conventions for some generated routes. + +Example: `authz_for_workspace_job_collection(...)` effectively hard-codes the standard job route policy shape: + +- collection `POST` -> `.create` +- collection `GET` -> `.list` +- item `GET` -> `.read` +- item `DELETE` -> `.delete` + +It also pairs those routes with the current scope convention: + +- read routes -> `:read`, `platform:read` +- write routes -> `:write`, `platform:write` + +So the platform already has factory-local authz conventions today, but they are embedded in helpers rather than expressed as part of a normalized route-metadata model. + +### Desired Outcome + +The desired end state is: + +- plugin authors do not have to restate authz for generated routes outside the factory call +- factories do not hide authz behavior in a way that cannot be validated or overridden +- the final emitted route metadata is explicit and normalized before merge and bundle generation + +### Concrete Examples + +#### Example 1: Standard Job Collection Factory + +Current customization-style usage looks roughly like: + +```python +router = job_route_factory( + service_name="customization", + job_type="Customization", + job_input=CustomizationJobInput, + job_output=CustomizationJobOutput, + input_to_output=transform_input_to_output, + platform_job_config_compiler=platform_job_config_compiler, + generate_job_name=generate_customization_id, + route_options=[JobRouteOption.CORE], +) +``` + +In the model described by this spec: + +- the plugin defines permissions explicitly in `get_authz_definition()` +- the factory provides the default route-to-permission template +- the factory emits normalized path rules for the generated `POST`, `GET`, item `GET`, and item `DELETE` routes + +Conceptually: + +```python +def get_authz_definition(self) -> ServiceAuthzDefinition: + return ServiceAuthzDefinition( + permission_namespace="customization.jobs", + permissions=[ + PermissionDef("customization.jobs.create", "Create customization jobs"), + PermissionDef("customization.jobs.list", "List customization jobs"), + PermissionDef("customization.jobs.read", "Read customization jobs"), + PermissionDef("customization.jobs.delete", "Delete customization jobs"), + ], + ) +``` + +And the factory would internally emit rules equivalent to: + +- collection `POST` -> `customization.jobs.create` +- collection `GET` -> `customization.jobs.list` +- item `GET` -> `customization.jobs.read` +- item `DELETE` -> `customization.jobs.delete` + +#### Example 2: Rebasing A Generated Router + +Some services generate the standard job routes and then rebase them onto a different collection path, as evaluator does for metric jobs. + +Conceptually: + +```python +_jobs_router = job_route_factory( + service_name="evaluator-metrics", + job_type="MetricEvaluation", + job_input=MetricJob, + platform_job_config_compiler=platform_job_config_compiler, +) + +router.include_router(_metric_jobs_router, prefix="/v2/workspaces/{workspace}/metric-jobs") +``` + +In this case, the important requirement is that rebasing the route paths must not lose the attached authz metadata. + +That means: + +- the factory may emit metadata before rebasing +- the rebasing helper must preserve or restamp that metadata onto the final mounted routes +- bundle validation must operate on the final mounted paths, not the factory's temporary `/jobs` paths +- rebasing alone must not change permissions or other authz semantics + +This matters because some rebasing patterns rebuild routes by creating new `APIRoute` objects. + +If authz metadata is attached only to the original route object, rebuilding may drop that metadata unless the rebasing helper explicitly preserves or restamps it. + +Validation must therefore inspect the final mounted route set, after any rebasing or remounting has occurred. + +### Ownership Model + +For factory-generated routes, authz ownership should be split clearly: + +- the factory owns the route shape + - which routes are generated + - which HTTP methods exist + - what authz template is applied by default + +- the plugin owns the concrete policy inputs + - permission definitions + - permission prefixes or ids referenced by the factory + - optional endpoint rule scopes + - caller kinds + - explicit overrides, where the factory allows them + +- the factory emits the final path-rule metadata + - generated endpoints must end up with the same normalized metadata shape as handwritten endpoints + +This spec standardizes the required outcome, not a single shared factory API shape. + +That means: + +- factories may expose their authz inputs through factory-specific parameters +- the core plugin API does not require a universal factory authz interface in this iteration +- regardless of factory signature, the emitted route metadata and referenced permissions must satisfy the same validation rules as handwritten routes + +### Validation Requirement + +Bundle-time validation must operate on the final emitted rule set, regardless of whether the rules came from: + +- handwritten decorators +- factory defaults +- plugin-supplied factory parameters +- explicit plugin overrides + +In other words: + +- factories may synthesize rules +- plugins may parameterize those rules +- but merge and bundle validation must only accept the final normalized route metadata +- any missing, conflicting, or malformed generated rules must fail validation before bundle generation + +This may be implemented by: + +- applying the same decorator internally, or +- attaching the normalized metadata directly to the generated endpoint callable/route object + +This is important for routes like service-owned job collections, where authz should be derived from the same route factory that creates the endpoints. + +## Compatibility + +This design assumes the plugin authz surface has not been released yet and therefore does not need a backward-compatibility layer. + +Requirements: + +- Do not preserve `nemo.authz` as a supported plugin surface. +- Do not preserve `NemoService.get_authz_contribution()` as a supported API. +- Do not add fallback merge logic that supports both the old and new models indefinitely. + +Implementation expectation: + +- internal code may be updated in one pass to the new service-owned model +- route factories such as job helpers should emit the new path-rule metadata directly +- auth runtime and static sync tooling should consume only the new derived service authz model + +## Decision + +Adopt a service-owned authz model for plugin HTTP authorization with exactly two plugin-defined concepts: + +- permissions +- path rules + +Use a single `@path_rule(...)` decorator for path rules, with `callers` and optional `permissions_required`, allow multiple rules per endpoint as explicit OR alternatives, validate rule correctness at merge time, and remove the separate `nemo.authz` surface entirely as part of the initial implementation. diff --git a/spec/plugin-service-authz-ticket-description.md b/spec/plugin-service-authz-ticket-description.md new file mode 100644 index 0000000000..40b099e373 --- /dev/null +++ b/spec/plugin-service-authz-ticket-description.md @@ -0,0 +1,60 @@ +# Unify plugin HTTP authz under `NemoService` with explicit permission definitions and route path rules + +## Description + +Implement the new plugin HTTP authorization model described in `spec/plugin-service-authz-spec.md`. + +Today plugin HTTP authz is split between `nemo.services` and `nemo.authz` / `get_authz_contribution()`. This work should remove that split and make `NemoService` the sole source of plugin HTTP auth policy. + +## Scope + +- Add service-owned authz definition support on `NemoService` +- Require plugin permissions to be declared explicitly, with required descriptions +- Add route-level path rule metadata via `@path_rule(...)` or equivalent programmatic stamping for generated routers +- Derive normalized plugin authz from: + - `get_authz_definition()` + - mounted routers / emitted route metadata +- Preserve current core-role grant behavior when converting derived plugin permissions into runtime/static authz +- Support factory-generated routes, including rebased routers +- Validate final emitted plugin authz before merge/bundle generation + +## Key Requirements + +- `nemo.authz` is removed as a supported plugin surface +- `NemoService.get_authz_contribution()` is removed as a supported API +- Permissions must be declared in `get_authz_definition()` +- Path rules may reference permissions but may not define them +- Every permission must include: + - id + - description +- Every service authz definition must declare `permission_namespace` +- Bundle-time validation must fail if a service: + - defines permissions outside its `permission_namespace` + - references undeclared permissions + - emits malformed permission ids + - emits malformed normalized scopes +- Every plugin-owned route must have at least one final path rule +- Validation must run against the final mounted route set, after any factory generation / rebasing +- Rebasing generated routers must preserve authz metadata and must not change permissions or authz semantics + +## Caller Model + +- `ANON` +- `PRINCIPAL` +- `SERVICE_PRINCIPAL` + +## Out Of Scope + +- Plugin-defined roles +- IAM/UI/CLI role surfacing +- Redesign of core-role default grant heuristics +- Plugin-defined scopes as a first-class model +- OIDC scope/claim mapping redesign + +## Acceptance Criteria + +- Plugin HTTP authz is derived only from `NemoService` +- Existing plugin permission behavior remains functional after migration +- Generated/rebased routes produce correct final path rules +- Bundle/merge validation fails closed on missing or invalid plugin authz metadata +- No plugin-owned route becomes implicitly public due to missing metadata diff --git a/spec/provider-backend-extensibility-spec.md b/spec/provider-backend-extensibility-spec.md new file mode 100644 index 0000000000..73a1b43f9e --- /dev/null +++ b/spec/provider-backend-extensibility-spec.md @@ -0,0 +1,124 @@ +# Provider And Backend Extensibility Spec + +## Summary + +This spec captures a long-term design question about how NeMo Platform should evolve its execution model as new backends are introduced. + +The key question is: + +- when should a new backend fit under an existing provider +- and when should it require a new provider with a new execution contract + +This is intentionally separate from the near-term subprocess and execution-resolution work. The current platform can move forward with built-in providers such as `subprocess`, `cpu`, `gpu`, and `gpu_distributed` without settling every future backend mapping question up front. + +## Problem + +The platform needs a rule for future backend growth. + +Today it is easy to talk about: + +- `subprocess` +- `cpu` +- `gpu` +- `gpu_distributed` + +But future backends may not fit cleanly into those existing categories. + +Examples include: + +- Slurm +- future distributed batch systems +- alternative cluster schedulers +- backend-specific runtimes with specialized submission contracts + +Some of these may preserve an existing provider contract. Others may require a materially different contract. + +The platform needs a clean extensibility rule so that provider vocabulary does not become either: + +- too broad to be meaningful +- or too specific and backend-leaky + +## Core Principle + +Providers should represent meaningful execution contracts, not just implementation names. + +Backends may vary underneath a provider, but only as long as they preserve the same contract from the plugin and resolver point of view. + +That leads to a simple rule: + +- if a new backend preserves the same execution contract as an existing provider, it should map to that provider +- if a new backend requires a materially different execution contract, it should introduce a new provider + +## What Counts As The Same Contract + +Two backends can reasonably share a provider if they preserve the same high-level semantics that matter to plugins and resolution logic. + +Examples of contract-level behavior include: + +- what kind of command or container shape the plugin compiles +- what kinds of resources and topology the job requests +- what assumptions exist around storage, environment, and execution model +- what lifecycle and validation expectations are visible at compile time + +If those things remain meaningfully the same, the backend difference can stay below the provider layer. + +If those things diverge enough that plugins would need different compilation rules or different mental models, the platform probably needs a new provider. + +## Example: `gpu_distributed` + +`gpu_distributed` is a good example of a provider that may have multiple possible backend implementations. + +If several distributed GPU schedulers all preserve roughly the same execution contract, then they can all remain under `gpu_distributed`, with backend-specific differences expressed through: + +- profile +- backend mapping +- backend configuration + +That would keep the provider stable while allowing multiple implementations underneath it. + +## Example: Slurm + +Slurm is intentionally unresolved. + +There are two plausible futures: + +1. Slurm fits the existing `gpu_distributed` contract. + + In that case: + + - Slurm should remain a backend under `gpu_distributed` + - provider selection stays simple + - profile and backend mapping carry the backend-specific detail + +2. Slurm requires a materially different contract. + + In that case: + + - Slurm should become a new provider + - plugins and resolver logic should treat it as a distinct execution contract + +The platform should not force a decision before the actual Slurm design exists. + +## Why This Should Stay Separate + +This question is important, but it is not required to finish the current provider-resolution cleanup. + +The near-term work only needs: + +- a clear built-in provider set +- a shared resolution mechanism +- explicit subprocess support +- removal of the dishonest rewrite behavior + +Future backend extensibility can be addressed later once there is a concrete design for backends such as Slurm. + +## Recommendation + +Do not resolve speculative backend mappings in the current execution-resolution spec. + +Instead, adopt this rule for future work: + +- preserve an existing provider when a new backend preserves the same execution contract +- create a new provider when the backend introduces a materially different contract + +That gives the platform a clean long-term extensibility rule without forcing premature decisions about backends that do not yet have a defined runtime model. diff --git a/spec/service-role-visibility-and-bindability-spec.md b/spec/service-role-visibility-and-bindability-spec.md new file mode 100644 index 0000000000..c5ca703618 --- /dev/null +++ b/spec/service-role-visibility-and-bindability-spec.md @@ -0,0 +1,157 @@ +# Service Role Visibility And Bindability Spec + +## Summary + +This spec explores whether service-defined roles should carry metadata controlling whether they are visible and bindable. + +This is intentionally separate from `plugin-service-authz-spec.md`. + +The plugin service authz spec stays focused on: + +- permissions +- service-scoped roles +- path rules + +It does not require role visibility/bindability metadata in the first iteration. + +## Current Assumption + +For the initial plugin authz design, the implicit default assumption is: + +- roles are visible +- roles are bindable + +This spec explores whether the platform should later make those properties explicit. + +## Problem + +Some service-defined roles may be appropriate as normal user/admin-assigned roles. + +Examples: + +- `agents.Reviewer` +- `customization.Approver` + +Other roles may be useful in policy but should not necessarily be exposed or directly granted. + +Examples: + +- `agents.SystemWorker` +- `guardrails.BackgroundSync` +- `customization.InternalRunner` + +Without explicit metadata, the platform may treat all service-defined roles as equally visible and equally assignable. + +## Goals + +- Explore whether service-defined roles need explicit metadata for visibility and bindability. +- Determine whether these concepts should affect IAM, UI, CLI, and docs behavior. +- Keep the main plugin authz redesign unblocked. + +## Non-Goals + +- Changing the first iteration of `plugin-service-authz-spec.md`. +- Redesigning the core role model. + +## Concepts + +### Visibility + +Visibility answers: + +- should this role be shown in UI/CLI/docs/IAM listing surfaces? + +Possible values: + +- visible +- hidden + +### Bindability + +Bindability answers: + +- may this role be directly granted to principals through normal role-binding APIs? + +Possible values: + +- bindable +- non-bindable + +These are separate concerns: + +- a role could be hidden but bindable +- a role could be visible but non-bindable +- a role could be both hidden and non-bindable + +## Options + +### Option 1: Keep Roles As Simple Definitions Only + +Roles contain only: + +- name +- description +- permissions + +Pros: + +- simplest model +- no additional IAM/UI complexity + +Cons: + +- no way to distinguish customer-facing roles from internal policy roles + +### Option 2: Add Explicit Visibility And Bindability Flags + +Conceptual example: + +```python +RoleDef( + name="agents.SystemWorker", + description="Internal worker role", + permissions=[...], + visible=False, + bindable=False, +) +``` + +Pros: + +- explicit +- clear platform behavior for UI/CLI/IAM/doc surfaces +- supports internal-only roles cleanly + +Cons: + +- adds more surface area to role definitions +- more implementation work across APIs and presentation layers + +### Option 3: Add Bindability Only + +Only define whether a role can be bound directly. + +Visibility is handled by conventions or presentation-layer heuristics. + +Pros: + +- smaller change +- addresses the more security-sensitive concern first + +Cons: + +- still leaves UI/docs ambiguity + +## Recommendation + +Defer this from the initial plugin authz redesign. + +Do not block `plugin-service-authz-spec.md` on solving role visibility/bindability metadata. + +If the platform later needs internal-only service roles, Option 3 or Option 2 can be added as follow-up work. + +## Relationship To Plugin Service Authz + +`plugin-service-authz-spec.md` should assume the simple model for now and avoid taking on extra role metadata concerns in its first version. + +This spec exists so the question is captured for follow-up work without complicating the core decorator/path-rule redesign. diff --git a/spec/subprocess-first-class-execution-resolution-spec.md b/spec/subprocess-first-class-execution-resolution-spec.md new file mode 100644 index 0000000000..01a6373d54 --- /dev/null +++ b/spec/subprocess-first-class-execution-resolution-spec.md @@ -0,0 +1,515 @@ +# Subprocess First-Class Execution Resolution Spec + +## Summary + +This spec defines how NeMo Platform should select execution for jobs when the same logical workload may be able to run as: + +- a host subprocess +- a CPU container job +- a GPU container job +- a distributed GPU container job + +The key architectural change is: + +- `subprocess` becomes a first-class execution provider rather than a compatibility rewrite target +- plugin and service compilers declare which providers they support +- a shared resolution algorithm selects the provider and profile before the final `PlatformJobSpec` is compiled +- the Jobs service validates and dispatches an honest execution contract; it does not silently reinterpret one provider as another + +This spec assumes that local and remote execution should converge on a single jobs-backed architecture. It defines the execution-selection contract needed to make that architecture predictable across plugins. + +The guiding product principle is a single deterministic and well-documented platform mechanism for execution selection. The mechanism may be non-trivial, but it must be consistent across plugins, explainable, and free of plugin-specific surprises. + +A user submitting a job to one plugin should be able to expect the same execution-selection behavior they would get from another plugin unless that plugin has a clear, documented reason to behave differently. + +## Problem + +Today the repo mixes multiple architectural layers. + +### Current State + +At the plugin layer, many jobs compile directly to container-oriented providers such as `cpu`, `gpu`, or `gpu_distributed`. + +At the Jobs API ingress layer, some CPU container steps are silently rewritten into subprocess steps when a subprocess profile is configured. The rewrite currently lives in: + +- [services/core/jobs/src/nmp/core/jobs/api/v2/jobs/endpoints.py](/Users/rsadler/src/nemo-platform/services/core/jobs/src/nmp/core/jobs/api/v2/jobs/endpoints.py:105) + +At the plugin CLI layer, `run_local(...)` still executes jobs in-process while `submit_remote(...)` posts to Jobs. That means local and remote still use materially different execution paths. + +### Why This Is A Problem + +The current behavior creates ambiguity in several places. + +- The same submitted `cpu` step may mean a real CPU container job in one environment and a host subprocess in another. +- The Jobs service is doing semantic translation, not just validation and dispatch. +- Plugins do not have a shared, explicit convention for how to choose among subprocess, CPU, GPU, and distributed GPU execution, so each plugin is pushed toward implementing its own fallback and selection logic. +- Submitters cannot reliably predict whether a plugin will run locally, in a container, or on a cluster. +- Resolution and validation logic are split across plugin compile logic, Jobs API validation, and backend-specific assumptions, which invites behavior drift across plugins instead of one consistent platform mechanism. + +This makes execution behavior harder to explain and less consistent for plugin authors, operators, and submitters because the same logical workload may be compiled, validated, and reinterpreted differently depending on the plugin and deployment configuration. Even if execution selection remains a non-trivial mechanism, it should still be one deterministic and well-documented platform behavior rather than something that drifts across plugins. + +## Goals + +- Make `subprocess` a first-class execution provider. +- Remove the dishonest CPU-to-subprocess rewrite from the Jobs service. +- Define a shared execution resolution process that all plugins use. +- Let plugins describe which providers they support without each plugin inventing its own fallback logic. +- Let callers optionally constrain or override execution choice, without forcing them to understand container-vs-subprocess details in the common case. +- Fast-fail before job creation when no compatible execution target exists. +- Preserve a single jobs-backed architecture for both local and remote execution. +- Move the platform toward eliminating the separate `run_local(...)` execution path in favor of one jobs-backed execution model, even if that full transition lands beyond the scope of this spec. +- Keep backend routing responsibility in Jobs while moving execution-selection policy above Jobs. +- Make execution selection deterministic, documented, and consistent across plugins so that it is a core platform feature rather than a per-plugin convention. + +## Non-Goals + +- This spec does not define platform startup, control-plane lifecycle, or service-loading behavior. +- This spec does not define how Jobs determines runtime execution availability. That remains a separate concern; this spec only assumes that Jobs is the authority for reporting what is available. +- This spec does not define arbitrary plugin-defined execution providers. It standardizes the built-in providers first. +- This spec does not attempt to preserve backward compatibility for the current silent rewrite as a permanent architectural feature. +- This spec does not change the current implementation of `run_local(...)` or the current local `run` command behavior. In scope, those remain the existing scheduler-managed in-process path; their eventual replacement belongs to separate follow-on work. + +## Terminology + +This area has accumulated overlapping terms. This spec standardizes them. + +### Provider + +A small, platform-owned set of execution shapes that a plugin can target during compilation and encode into the final `PlatformJobSpec`. + +Initial providers: + +- `subprocess` +- `cpu` +- `gpu` +- `gpu_distributed` + +These are the choices made by the shared resolver and plugin compiler. + +### Profile + +A named operator-configured execution profile, selected in combination with a provider. + +Examples: + +- `subprocess/default` +- `cpu/default` +- `gpu/research` +- `gpu_distributed/slurm-a100` + +Profiles are how operator policy is surfaced into compilation and dispatch. + +### Backend + +The implementation that ultimately runs the job after a provider/profile pair is resolved. + +Examples: + +- `subprocess` +- `docker` +- `kubernetes_job` +- `volcano_job` + +Backends remain a Jobs concern. + +## Architectural Principle + +The single most important architectural rule in this spec is that Jobs dispatches the chosen execution contract; it does not reinterpret it after the plugin has compiled the final `PlatformJobSpec`. + +The point of this rule is to protect a few specific properties: + +- one selected provider and profile for the job or step +- one honest final `PlatformJobSpec` that reflects that choice +- provider-specific compilation and validation in the plugin or shared plugin-layer logic +- no late semantic rewrite where one provider shape is submitted and another provider shape is actually run + +In practice, a plugin may still have an earlier provider-agnostic phase that canonicalizes user input into a job-specific spec. What this rule forbids is compiling a final provider-specific step shape and then having Jobs reinterpret it as something else later. + +That implies the following responsibility split. + +### Caller Responsibility + +The caller may provide: + +- an explicit profile +- an execution preference or constraint, if surfaced by the plugin API +- no execution hint at all + +The caller should not need to know whether the workload will run as a subprocess or in a container in the common case. + +### Plugin Responsibility + +The plugin or service compiler is responsible for: + +- declaring which providers a job supports +- declaring provider preference order for the canonical spec where applicable +- providing compilation logic for each supported provider +- expressing workload-specific constraints, such as whether a job can run only on GPU or only as a host subprocess + +Plugins do not need to implement every built-in provider. A job may support any subset of providers that makes sense for its workload, but it must support at least one provider in order to be executable. + +Provider preference is dynamic by design. A plugin may determine its preferred provider order as a function of the canonical spec rather than as one static list for the entire job type. + +### Shared Resolver Responsibility + +A single deterministic resolver in the plugin/platform layer is responsible for applying the same execution-selection algorithm across all plugins. + +That shared resolver is responsible for: + +- reading caller intent +- reading plugin-supported providers +- reading the available execution profiles exposed by Jobs +- selecting a compatible provider and profile according to a shared convention +- producing a fast failure when no compatible choice exists + +### Jobs Responsibility + +The Jobs service is responsible for: + +- exposing configured execution profiles +- validating that the submitted `PlatformJobSpec` references valid provider/profile combinations +- routing the selected provider/profile to the configured backend +- dispatching, reconciling, logging, and lifecycle management + +Jobs must not silently change `cpu` into `subprocess`, or any equivalent semantic rewrite. + +## Container Ownership + +This spec makes container ownership explicit. + +### Container-Oriented Providers + +The following providers are container-oriented: + +- `cpu` +- `gpu` +- `gpu_distributed` + +For these providers, the plugin compiler is responsible for defining: + +- the container image +- the entrypoint and command +- resource requests and limits +- any family-specific environment or storage requirements + +The container field is part of the plugin-authored execution contract for these providers. + +### Subprocess Provider + +The `subprocess` provider is host-command-oriented. + +Like every provider in this spec, `subprocess` is optional at the plugin level. Jobs that can run as host commands may implement it; jobs that cannot or should not run that way do not need to support it. + +For this provider, the plugin compiler is responsible for defining: + +- the host command to run +- any required environment variables, secrets, and path validation rules + +The `subprocess` provider does not carry a container field. + +If a subprocess backend implementation later chooses to invoke Docker, Podman, a wrapper script, or a prepared virtual environment internally, that is backend configuration, not plugin-authored step semantics. + +This distinction is important because it keeps the submitted execution contract honest. + +## Shared Resolution Model + +The platform should standardize one resolution process across plugins. + +### Inputs To Resolution + +Resolution takes three categories of input. + +#### 1. Caller Intent + +Possible caller intent includes: + +- explicit profile selection +- explicit provider preference, if the plugin exposes one +- no preference + +Explicit caller choices take precedence over automatic fallback. + +#### 2. Plugin Support + +Each job type declares: + +- supported providers +- optional dynamic preference order among supported providers +- any workload-specific constraints + +Examples: + +- `evaluate-suite`: supports only `subprocess` +- evaluator: supports `subprocess` and `cpu` +- customization training: supports `gpu` and possibly `gpu_distributed` + +#### 3. Host Availability + +Availability comes from the execution profiles that Jobs reports as available. + +For the purposes of this spec, the important contract is simple: + +- Jobs is the authority for provider/profile availability +- plugins and other services resolve against what Jobs reports as available +- plugins should not implement their own ad hoc availability logic + +How Jobs determines that availability is intentionally out of scope for this document. Today that area is fragmented and partly inferred from configuration and plugin-specific checks, but this spec assumes a cleaner future state where Jobs publishes the authoritative availability set and the shared resolver consumes it. + +Examples: + +- local dev host: `subprocess/default`, maybe `cpu/default` +- Docker deployment: `cpu/default`, `gpu/default`, maybe `subprocess/default` +- Kubernetes production: `cpu/default`, `gpu/default`, `gpu_distributed/default`, no subprocess + +The plugin should not need to infer this indirectly from labels like "local" or "production" when Jobs can expose the actual configured capabilities. + +## Resolution Algorithm + +The shared resolver should use the following algorithm. + +### Step 1: Validate Explicit Caller Constraints + +If the caller explicitly selected a profile: + +- determine that profile's provider +- verify that the plugin supports that provider for this job +- verify that the profile is actually available on the host +- fail immediately if either check fails + +If the caller explicitly selected a provider or mode: + +- verify that the plugin supports it +- intersect it with available profiles for that provider +- fail immediately if none are available + +### Step 2: Build Candidate Providers + +If no explicit caller constraint exists: + +- read the plugin's supported providers for the job +- order them according to the plugin's declared preference list for the canonical spec, or a shared default convention when no plugin-specific order is provided + +### Step 3: Intersect With Available Profiles + +For each candidate provider in order: + +- find the available execution profiles for that provider +- discard providers with zero compatible profiles +- keep providers with at least one compatible profile + +### Step 4: Select Provider And Profile + +Select the first compatible provider according to the shared ordering rules. + +Then select the profile according to one of the following: + +- explicit caller profile if given +- plugin-selected preferred profile if declared +- shared default profile selection rule, typically `default` + +### Step 5: Fast Fail On Empty Intersection + +If the final candidate set is empty, fail before job creation. + +The error should state: + +- what the caller requested, if anything +- what providers the plugin supports +- what profiles are available on the host +- why no intersection exists + +### Step 6: Compile For The Selected Provider + +Only after provider/profile resolution succeeds should the plugin run the provider-specific compiler. + +The plugin compiles once for the chosen target. + +This is intentionally different from compiling multiple variants and letting Jobs decide later. + +## Shared Default Convention + +To keep behavior uniform across plugins, the resolver should provide a platform-wide default convention. + +A reasonable initial convention is: + +- GPU-distributed workloads: require `gpu_distributed` +- GPU-only workloads: prefer `gpu` +- CPU-capable workloads: prefer `subprocess`, then `cpu` +- Host-only workloads: require `subprocess` + +Plugins may narrow this based on job semantics, but they should not invent new fallback rules unless the shared resolver supports them. + +The point of the convention is not to eliminate plugin intent. The point is to make the common case predictable. + +## Why Subprocess Must Be First-Class + +Raising `subprocess` to a first-class provider is not just an implementation cleanup. It fixes a correctness issue. + +### Honest Contracts + +A `cpu` step should mean a CPU container-oriented job. + +A `subprocess` step should mean a host subprocess job. + +Those two contracts have different semantics around: + +- working directory +- container image ownership +- command interpretation +- environment inheritance +- filesystem expectations +- runtime dependencies + +Treating one as a hidden rewrite of the other makes the contract dishonest. + +### Better Validation + +When `subprocess` is explicit, plugins can validate subprocess-specific invariants during compilation. + +Examples: + +- absolute path requirements +- required host-side tools +- command shape validation +- environment/secret injection needs + +The `evaluate-suite` job already demonstrates this pattern by compiling directly to subprocess and validating path assumptions at compile time. + +### Better Local/Remote Unification + +If local execution is supposed to be jobs-backed, then `subprocess` is the natural first-class local execution provider. + +This lets the platform unify local and remote around one jobs architecture without pretending that a host process is a CPU container. + +## Local Versus Production + +This spec intentionally avoids making plugins branch directly on a vague "local vs production" flag unless absolutely necessary. + +The preferred rule is capability-driven selection. + +- if subprocess profiles are available, jobs that support subprocess may choose them according to the shared resolver +- if subprocess profiles are absent, those jobs fall back to their other supported providers or fail + +This means production policy is expressed by profile availability. + +- local deployments may expose `subprocess/default` +- production deployments should typically not expose subprocess profiles at all + +The plugin remains mostly environment-agnostic because it chooses from actual available capabilities. + +A platform runtime or deployment label may still be useful for diagnostics or edge cases, but it should not be the primary selector when profile availability already captures the real execution options. + +## Proposed Plugin API Shape + +The current repo uses a single `compile(...)` path per job. This spec proposes splitting the decision from the compilation. + +Conceptually, a job should provide: + +- supported providers +- optional provider preference order for a given spec +- one compiler per supported provider, or one dispatching compiler that compiles based on a selected provider + +The shared resolver then: + +- resolves the provider and profile +- passes that selection into the plugin compile path + +This can be represented in several concrete APIs. The exact method names are implementation detail. The architectural requirement is: + +- plugins express support and provider-specific compilation +- shared code performs selection +- Jobs receives only the final, already-honest `PlatformJobSpec` + +## Fast-Fail Requirements + +Fast failure is a core requirement of this spec. + +The system must fail before job creation when: + +- the caller selected a profile unsupported by the plugin +- the caller selected a profile not configured on the host +- the plugin supports only providers that are unavailable on the host +- the selected provider requires compile-time invariants that are not satisfied + +Failure messages should be structured enough to answer these questions immediately: + +- what did the caller ask for +- what does the plugin support +- what is available on this host +- what should the user or operator change to make it work + +This avoids partially compiled jobs and opaque runtime failures. + +## Migration Plan + +This spec can be adopted incrementally. + +### Phase 1: Standardize Resolver Inputs + +- define the built-in providers +- add shared plugin-layer helpers for declaring supported providers and preference order +- expose Jobs execution profiles as the source of host availability + +### Phase 2: Introduce Provider-Specific Compilation + +- update plugin jobs to compile explicitly for subprocess, CPU, GPU, or distributed GPU as appropriate +- allow plugins that support multiple providers to branch after shared resolution, not before + +### Phase 3: Remove Jobs Rewrite + +- delete the CPU-to-subprocess translation at Jobs API ingress +- require subprocess jobs to be compiled explicitly as subprocess + +### Phase 4: Unify Local Execution Through Jobs + +- migrate local `run_local(...)` flows toward a jobs-backed subprocess path +- keep synchronous vs asynchronous interaction mode separate from execution placement + +## Consequences For Existing Job Types + +### Evaluate-Suite Style Jobs + +Jobs like `evaluate-suite` are already close to the target architecture. + +They explicitly compile to `subprocess`, validate subprocess-specific assumptions up front, and do not rely on the Jobs API to reinterpret a container step. + +### Evaluator / Data-Designer Style Jobs + +Jobs that currently compile to `cpu` should either: + +- remain honest CPU container jobs, or +- gain explicit subprocess compilation support and let the shared resolver choose between subprocess and CPU + +They should not rely on Jobs to decide that a CPU job is actually subprocess. + +### Customization / Training Jobs + +GPU and distributed GPU training jobs should continue to compile explicitly for the appropriate GPU providers. + +If they do not support subprocess, they simply declare that they do not support it. The resolver will then fail or fall back accordingly. + +## Acceptance Criteria + +This spec should be considered successful only if all of the following are true. + +- `subprocess` is represented as an explicit first-class provider in plugin compilation and Jobs validation. +- The Jobs service no longer rewrites `cpu` container steps into subprocess steps. +- Plugins use a shared resolution algorithm rather than per-plugin ad hoc heuristics. +- Execution selection uses caller intent, plugin support, and host-available profiles as its inputs. +- Incompatibility produces a fast failure before job creation. +- The same logical resolution rules apply across plugins unless a plugin explicitly narrows its supported providers. +- Container ownership is explicit for `cpu`, `gpu`, and `gpu_distributed`, and absent from `subprocess`. +- Local jobs unification can treat subprocess as the normal first-class local execution provider. + +## Open Questions + +A few detailed decisions remain for implementation. + +## Recommendation + +- make `subprocess` explicit +- resolve execution before compile +- compile once for the selected provider +- let Jobs validate and dispatch, not reinterpret + +That provides a shared convention across plugins, honest execution semantics, fast failure, and a cleaner path toward full local/remote jobs unification. diff --git a/spec/trusted-probes-and-endpoints-spec.md b/spec/trusted-probes-and-endpoints-spec.md new file mode 100644 index 0000000000..73962daf51 --- /dev/null +++ b/spec/trusted-probes-and-endpoints-spec.md @@ -0,0 +1,154 @@ +# Trusted Probes And Endpoints Spec + +## Summary + +This spec explores whether NeMo Platform should add a first-class concept for trusted probes and other trusted endpoint access patterns. + +This is intentionally separate from `plugin-service-authz-spec.md`. + +The plugin service authz spec stays close to the current implementation and does not introduce a new probe or trusted-endpoint abstraction. + +## Current State + +The current repo does not appear to have a first-class platform concept for: + +- probe caller +- trusted internal endpoint +- mesh-authenticated internal audience + +What exists today is much closer to: + +- no principal present +- principal present +- service principal identified by `service:` prefix + +Some routes are handled through special-case policy logic, but there is no general route-level abstraction for unauthenticated trusted probes or trusted internal callers. + +## Problem + +Some endpoints, especially health/readiness/operational endpoints, may need semantics different from normal user-facing authorization. + +Examples: + +- Kubernetes health probes that do not present a principal +- internal services calling operational APIs through mTLS or trusted service identity +- infrastructure components that should be allowed to reach a narrow set of endpoints without following normal user-facing authorization rules + +The current authorization model does not provide a clear first-class way to describe these cases. + +## Goals + +- Explore whether trusted probes should become a platform concept. +- Explore whether trusted internal endpoint access should become a platform concept. +- Determine whether these concepts belong in plugin route policy or in separate transport/network configuration. + +## Non-Goals + +- Changing `plugin-service-authz-spec.md` in the first iteration. +- Defining the final decorator/path-rule shape for trusted probes. + +## Key Question + +Should NeMo Platform represent trusted probes and trusted internal endpoint access inside the authorization model, or should those concerns remain outside route policy and be enforced at the transport/network layer? + +## Constraints + +### Constraint 1: Probes Often Have No Principal + +Kubernetes-style probes often look like plain HTTP calls with no NeMo principal attached. + +That means they do not naturally fit the current principal-based auth model. + +### Constraint 2: Trust May Come From Transport Or Topology + +Some "trusted" access may rely on: + +- separate port binding +- private network reachability +- service mesh identity +- ingress restrictions +- loopback-only access + +Those are not the same thing as route-level principal authorization. + +### Constraint 3: Trusted Access Should Not Accidentally Become Public Access + +If the platform adds a trusted probe concept, it must fail safely and avoid broadening access unintentionally. + +## Options + +### Option 1: Keep Probes Outside The Plugin Authz Model + +Trusted probes are handled through: + +- separate port +- separate listener +- ingress/network policy +- platform-specific operational route exposure + +Pros: + +- aligns with how many platforms handle health probes +- avoids mixing transport trust with route authorization +- keeps plugin authz simpler + +Cons: + +- plugin/service authors cannot describe trusted probe behavior directly in route metadata +- requires more deployment/runtime coordination + +### Option 2: Add A First-Class Probe Caller Concept + +Add a normalized caller concept for something like: + +- `PROBE` + +Pros: + +- route policy can describe probe access explicitly +- easier to reason about from endpoint definitions alone + +Cons: + +- not obvious how probe identity is established when no principal exists +- may create false confidence if enforcement really depends on network topology + +### Option 3: Add A Trusted Endpoint Classification Separate From Callers + +Instead of treating probes as callers, endpoints could be classified as: + +- normal API endpoint +- operational/probe endpoint + +The platform would then apply different serving/exposure rules to those endpoints. + +Pros: + +- better matches the idea that trust may come from deployment/network shape +- avoids pretending probes are authenticated principals + +Cons: + +- creates another endpoint dimension +- still needs clear runtime enforcement + +## Recommendation + +Do not add trusted probes or trusted internal endpoint abstractions to the first plugin authz redesign. + +Keep the initial spec focused on: + +- callers +- permissions +- roles +- explicit path rules + +Explore probes and trusted endpoints separately here. + +Option 1 or Option 3 is more likely to fit the current platform model than inventing a principal-like `PROBE` caller immediately. + +## Relationship To Plugin Service Authz + +`plugin-service-authz-spec.md` should remain fail-closed and require explicit path rules, but it should not attempt to solve trusted probe semantics in its first version. + +If the platform later introduces a trusted-probe or trusted-endpoint model, it should be designed and implemented as a focused follow-up using this spec. diff --git a/web/packages/common/src/components/UploadModal/SimpleFilesTable.test.tsx b/web/packages/common/src/components/UploadModal/SimpleFilesTable.test.tsx index f839793c4b..402a003dd9 100644 --- a/web/packages/common/src/components/UploadModal/SimpleFilesTable.test.tsx +++ b/web/packages/common/src/components/UploadModal/SimpleFilesTable.test.tsx @@ -44,6 +44,22 @@ const createWrapper = ( describe('SimpleFilesTable', () => { beforeEach(() => { vi.clearAllMocks(); + // @tanstack/react-virtual measures elements via getBoundingClientRect to determine + // the visible row range. JSDOM returns 0 for all dimensions, causing the virtualizer + // to compute an empty visible range and render no rows. Return a fixed 56px height + // (matching VirtualizedTableContent's default rowHeight) so the virtualizer renders + // all rows within the overscan window. + vi.spyOn(Element.prototype, 'getBoundingClientRect').mockReturnValue({ + height: 56, + width: 560, + top: 0, + left: 0, + bottom: 56, + right: 560, + x: 0, + y: 0, + toJSON: () => ({}), + }); }); const mockNewFiles: UploadFile[] = [ diff --git a/web/packages/common/src/components/UploadModal/SimpleFilesTable.tsx b/web/packages/common/src/components/UploadModal/SimpleFilesTable.tsx index 3b79156b79..912a01f699 100644 --- a/web/packages/common/src/components/UploadModal/SimpleFilesTable.tsx +++ b/web/packages/common/src/components/UploadModal/SimpleFilesTable.tsx @@ -1,7 +1,7 @@ // SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. // SPDX-License-Identifier: Apache-2.0 -import { ScrollTable } from '@nemo/common/src/components/ScrollTable'; +import * as DataView from '@nemo/common/src/components/DataView/internal'; import { useUploadModalContext } from '@nemo/common/src/components/UploadModal/Context/useUploadModalContext'; import { useInlinePickerSlot } from '@nemo/common/src/components/UploadModal/InlinePickerSlot'; import { UploadFile } from '@nemo/common/src/components/UploadModal/types'; @@ -9,17 +9,23 @@ import { formatFileSize } from '@nemo/common/src/components/UploadModal/utils'; import { Button, Checkbox, - Text, - TableColumnDefinition, - TableRowDefinition, Flex, - Stack, - RadioGroupRoot, - RadioGroupItem, RadioGroupInput, + RadioGroupItem, + RadioGroupRoot, + Stack, + Text, } from '@nvidia/foundations-react-core'; import { CircleAlert } from 'lucide-react'; -import { useCallback, useMemo } from 'react'; +import { type ComponentProps, useCallback, useMemo } from 'react'; + +type FileRow = { + id: string; + name: string; + size: number; + isDisabled: boolean; + uploadFile: UploadFile; +}; export const SimpleFilesTable = () => { const [state, dispatch] = useUploadModalContext(); @@ -48,44 +54,7 @@ export const SimpleFilesTable = () => { if (allowedExtensions.size === 0) return true; return allowedExtensions.has(fileExtension(uploadFile)); }; - const toggleFileSelection = useCallback( - (file: UploadFile) => { - dispatch({ - type: 'TOGGLE_FILE_SELECTION', - payload: file, - }); - }, - [dispatch] - ); - const handleSingleSelect = useCallback( - (id: string) => { - const file = files.find((f) => f.id === id); - if (!file) return; - dispatch({ type: 'TOGGLE_FILE_SELECTION', payload: file }); - }, - [dispatch, files] - ); - const handleFileChange = (event: React.ChangeEvent) => { - const files = event.target.files; - if (files) { - dispatch({ - type: 'SET_FILES', - payload: Array.from(files).map((file) => ({ id: file.name, type: 'new', file })), - }); - } - }; - - const columns: TableColumnDefinition[] = [ - { children: '' }, - { children: 'Name' }, - { children: 'Size' }, - ]; - // ``invalidFileMode`` controls how files whose extension isn't in - // ``acceptableFileTypes`` are rendered. ``'hide'`` filters them out so the - // user only sees pickable files; ``'disable'`` keeps them visible but - // marks the radio/checkbox as ``disabled``; ``'show'`` (default) keeps - // the prior behaviour and lets the parent validate after submit. const visibleFiles = useMemo(() => { if (invalidFileMode !== 'hide' || allowedExtensions.size === 0) return files; return files.filter(isFileAllowed); @@ -99,70 +68,85 @@ export const SimpleFilesTable = () => { ? `Only ${acceptableFileTypes.join(', ')} files can be selected. Upload a supported file or choose a different fileset.` : null; - const rows = useMemo( + const fileRows = useMemo( () => visibleFiles.map((uploadFile) => { - // In ``'disable'`` mode, mismatched-extension rows render but their - // selector control is ``disabled``. ``'hide'`` already filtered - // them; ``'show'`` keeps everything pickable. const isDisabled = invalidFileMode === 'disable' && !isFileAllowed(uploadFile); const name = uploadFile.type === 'existing' ? uploadFile.file.path : uploadFile.file.name; const size = uploadFile.type === 'existing' ? uploadFile.file.size : uploadFile.file.size; - return { - id: uploadFile.id, - cells: [ - { - children: allowMultipleFileSelection ? ( - file.id === uploadFile.id)} - onCheckedChange={() => toggleFileSelection(uploadFile)} - disabled={isDisabled} - /> - ) : ( - - - - ), - }, - { children: name }, - { children: formatFileSize(size) }, - ], - }; + return { id: uploadFile.id, name, size, isDisabled, uploadFile }; }), // eslint-disable-next-line react-hooks/exhaustive-deps - [visibleFiles, selectedFiles, toggleFileSelection, allowMultipleFileSelection, invalidFileMode] + [visibleFiles, invalidFileMode, allowedExtensions] ); + const dataViewState = DataView.useDataViewState(); + + const makeColumns = useCallback>['makeColumns']>( + (col) => [ + col.display({ + id: 'select', + header: () => null, + size: 40, + maxSize: 40, + minSize: 40, + meta: { alignment: 'center' as const }, + cell: ({ row }) => + allowMultipleFileSelection ? ( + f.id === row.original.id)} + onCheckedChange={() => + dispatch({ type: 'TOGGLE_FILE_SELECTION', payload: row.original.uploadFile }) + } + disabled={row.original.isDisabled} + attributes={{ CheckboxInput: { 'aria-label': row.original.name } }} + /> + ) : ( + + + + ), + }), + col.accessor('name', { header: 'Name' }), + col.accessor('size', { + header: 'Size', + cell: (ctx) => formatFileSize(ctx.getValue()), + }), + ], + [allowMultipleFileSelection, selectedFiles, dispatch] + ); + + const handleFileChange = (event: React.ChangeEvent) => { + const newFiles = event.target.files; + if (newFiles) { + dispatch({ + type: 'SET_FILES', + payload: Array.from(newFiles).map((file) => ({ id: file.name, type: 'new', file })), + }); + } + }; + return ( - {allowMultipleFileSelection ? ( - - ) : ( - // ``RadioGroupRoot`` defaults to its content's natural width — force - // ``w-full`` so the inner ScrollTable fills the modal's width. +
{ + const file = fileRows.find((r) => r.id === id); + if (file) dispatch({ type: 'TOGGLE_FILE_SELECTION', payload: file.uploadFile }); + }} > - + row.id }} + > + + - )} +
{disabledFilesMessage ? ( diff --git a/web/packages/common/src/components/UploadModal/index.tsx b/web/packages/common/src/components/UploadModal/index.tsx index 463e2c8d99..3dc3c3c984 100644 --- a/web/packages/common/src/components/UploadModal/index.tsx +++ b/web/packages/common/src/components/UploadModal/index.tsx @@ -24,7 +24,6 @@ import { ModalHeading, ModalMain, ModalRoot, - Stack, } from '@nvidia/foundations-react-core'; import { FC, MouseEvent, useId, useMemo } from 'react'; @@ -69,35 +68,28 @@ const UploadModalContent: FC = ({ return ( - - - {title} - - - - - - - - - {submitButtonText} - - - + + {title} + + + + + + + {submitButtonText} + + diff --git a/web/packages/studio/src/index.css b/web/packages/studio/src/index.css index f05c0ee05f..692b6a30ed 100644 --- a/web/packages/studio/src/index.css +++ b/web/packages/studio/src/index.css @@ -67,6 +67,12 @@ body { } } +/* hack fix: sticky table header separator — box-shadow is the only reliable way to + paint a bottom border that stays with a position:sticky thead across browsers */ +.sticky-table-header { + box-shadow: 0 2px 0 var(--border-color-base); +} + /* hack fix: KUI ships Checkbox and RadioGroupItem without a pointer cursor */ .nv-checkbox-input, .nv-radio-group-item {