NVIDIA-NeMo · mckornfield · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026 · Jun 23, 2026
@@ -0,0 +1,15 @@
+feat(jobs): add auth-aware E2E tests and job diagnostics infrastructure
+
+Add a new E2E test suite (`test_jobs_auth.py`) that validates workspace
+isolation and principal propagation under an auth-enabled platform config.
+Introduce a reusable `diagnostics.py` module in the jobs controller layer
+to collect and log structured job/step/task state on errors, and wire it
+into the reconciler and scheduler for automatic debug-level diagnostics
+when steps transition to ERROR or encounter unexpected exceptions.
+
+Refactor `e2e/conftest.py` to support multiple running-services instances
+keyed by config hash, enabling per-test-module platform configs (e.g.,
+`local-subprocess.yaml` with auth enabled) to coexist in a single session.
+Add a `local-subprocess.yaml` E2E config and extend `nmp_testing` utilities
+with `grant_workspace_role`, `unique_email`, and `TEST_ADMIN_EMAIL` helpers
+needed by the auth test scenarios.
@@ -0,0 +1 @@
+Just talking out loud here, so try to wrap my own head around this.  Conceptually the jobs service is a "backend" (which is a type of execution), and a executor (an configured instance of a backend).  "Profile" is a higher level concept that does fancy selection, and arguably it's just a useless layer of abstraction that confuses everyone.  The idea was that some jobs need cpu, and some jobs need gpu, so why not create an abstraction layer that does the hard work of choosing for you.  Or you can just use an executor as part of the job.  Profiles introduces a category of compute: cpu, gpu, gpu_distributed, which are really just shortcuts for the backend you're looking for.    So for customization you would select gpu_distributed, for eval: gpu.  So I think the idea is that the job compiler chooses the category, and the platform config does the mapping to the executor.   And you can pass a profile in with the job spec, which will tell the compiler which profile to use.  So you end up having to map profiles to executors anyways.  It makes me think that the services should be responsible for doing this mapping, not jobs.  Ex: customizer configures the mapping.  But having a central concept of "profile" means that every service does things the exact same way, which is nice.  It just makes the responsibility unclear.  In the future we might want to allow a plugin to define their own "provider", and allow other services to use it.  Different providers having different configuration requirements, that map down to the execution back end job format.  At this level, we are talking about job specs, and each service compiles their own spec down to the provider spec, which then selects the executor, and the job is submitted to the executor.  And as part of the platform config, we have defaults for some of these things, which might include containers.  For example, customizer chooses the image to use, depending on the type of customization requested.    But for customizer, we could imagine a subprocess executor, with the customizer task image, and the behavior would be to call docker run ....  So this is an appropriate translation for the subprocess executor.  So the job is running as a subprocess, but we require a container, so we use docker (or podman), depending on how we the subprocess executor is configured to run containers.  If the container is absent, then we just call the entrypoint in the workspace configured by the executor.  The goal should be that we don't care what executor it's running on, the job spec is the same.  But services will want some flexibility here to choose the right executor.  For example, imagine a plugin has some dependency on a specific version of python, so we might want to defined an executor executes commands in the context of a specific venv when using the subprocess executor.  And we can also provide a container for that plugin, which would execute the same command in a container.  The platform shouldn't need to know anything about the venv, or the container, as this would be specific to the plugin.  The plugin just selects the "provider", and the platform maps that to the correct executor.  Now a plugin could define it's own provider, which will select the correct venv (either using subprocess or in a container).  What this suggests to me is that the plugin needs more control over how jobs are mapped to executors, and which executors are configured beyond the rigid "provider" categories.  For example, a plugin could define a venv for subprocess exec (typically for local dev), a container for production workloads, and a slurm script for a batch workflow specific to slurm backends.
@@ -422,6 +422,8 @@ jobs:
   schedule_interval_seconds: 5
   # Register the subprocess/default execution profile. When unset, defaults to true for docker/none runtimes and false for kubernetes.
   enable_subprocess_executor:
+  # Include raw job log lines in controller diagnostics snapshots. Disabled by default because job logs may contain secrets or PII. Enable only for local debugging or test environments. | default: False
+  include_job_logs_in_diagnostics: false
 ```
 
 ### `models`

@@ -0,0 +1,68 @@
+# Local E2E config for hosts without Docker.
+#
+# This keeps the explicit subprocess/default jobs profile required by
+# translate_cpu_container_steps_to_subprocess(), while avoiding the default
+# docker job backends that are derived from platform.runtime: "docker".
+
+platform:
+  runtime: "none"
+  base_url: "http://0.0.0.0:8080"
+
+service: {}
+
+auth:
+  enabled: false
+  allow_unsigned_jwt: true
+  policy_decision_point_provider: embedded
+  policy_decision_point_base_url: "http://localhost:8080"
+  policy_data_refresh_interval: 2
+  bundle_cache_seconds: 15
+  admin_email: "admin@example.com"
+
+entities: {}
+
+jobs:
+  # Local E2E-only debugging aid. This may leak secrets or PII from job output,
+  # so it must remain disabled in non-test configs.
+  include_job_logs_in_diagnostics: true
+  executors:
+    - provider: subprocess
+      profile: default
+      backend: subprocess
+      config:
+        working_directory: .tmp/e2e/subprocess-jobs
+        cleanup_completed_jobs_immediately: false
+        ttl_seconds_before_active: 60
+        ttl_seconds_active: 3600
+        ttl_seconds_after_finished: 300
+  executor_defaults:
+    subprocess:
+      working_directory: .tmp/e2e/subprocess-jobs
+      cleanup_completed_jobs_immediately: false
+      ttl_seconds_before_active: 60
+      ttl_seconds_active: 3600
+      ttl_seconds_after_finished: 300
+
+evaluator:
+  recreate_existing_system_entities: true
+
+safe_synthesizer: {}
+
+models:
+  controller:
+    interval_seconds: 5
+    model_deployment_garbage_collection_ttl_seconds: 30
+
+inference_gateway: {}
+
+secrets:
+  allow_key_creation: true
+
+files:
+  default_storage_config:
+    type: local
+    path: .tmp/e2e/files
+
+studio:
+  static_files_path: web/packages/studio/dist
+  sandbox_enabled: true
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Just talking out loud here, so try to wrap my own head around this. Conceptually the jobs service is a "backend" (which is a type of execution), and a executor (an configured instance of a backend). "Profile" is a higher level concept that does fancy selection, and arguably it's just a useless layer of abstraction that confuses everyone. The idea was that some jobs need cpu, and some jobs need gpu, so why not create an abstraction layer that does the hard work of choosing for you. Or you can just use an executor as part of the job. Profiles introduces a category of compute: cpu, gpu, gpu_distributed, which are really just shortcuts for the backend you're looking for. So for customization you would select gpu_distributed, for eval: gpu. So I think the idea is that the job compiler chooses the category, and the platform config does the mapping to the executor. And you can pass a profile in with the job spec, which will tell the compiler which profile to use. So you end up having to map profiles to executors anyways. It makes me think that the services should be responsible for doing this mapping, not jobs. Ex: customizer configures the mapping. But having a central concept of "profile" means that every service does things the exact same way, which is nice. It just makes the responsibility unclear. In the future we might want to allow a plugin to define their own "provider", and allow other services to use it. Different providers having different configuration requirements, that map down to the execution back end job format. At this level, we are talking about job specs, and each service compiles their own spec down to the provider spec, which then selects the executor, and the job is submitted to the executor. And as part of the platform config, we have defaults for some of these things, which might include containers. For example, customizer chooses the image to use, depending on the type of customization requested. But for customizer, we could imagine a subprocess executor, with the customizer task image, and the behavior would be to call docker run .... So this is an appropriate translation for the subprocess executor. So the job is running as a subprocess, but we require a container, so we use docker (or podman), depending on how we the subprocess executor is configured to run containers. If the container is absent, then we just call the entrypoint in the workspace configured by the executor. The goal should be that we don't care what executor it's running on, the job spec is the same. But services will want some flexibility here to choose the right executor. For example, imagine a plugin has some dependency on a specific version of python, so we might want to defined an executor executes commands in the context of a specific venv when using the subprocess executor. And we can also provide a container for that plugin, which would execute the same command in a container. The platform shouldn't need to know anything about the venv, or the container, as this would be specific to the plugin. The plugin just selects the "provider", and the platform maps that to the correct executor. Now a plugin could define it's own provider, which will select the correct venv (either using subprocess or in a container). What this suggests to me is that the plugin needs more control over how jobs are mapped to executors, and which executors are configured beyond the rigid "provider" categories. For example, a plugin could define a venv for subprocess exec (typically for local dev), a container for production workloads, and a slurm script for a batch workflow specific to slurm backends.