Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 0 additions & 47 deletions .github/workflows/docker-build-dispatch.yml

This file was deleted.

52 changes: 0 additions & 52 deletions .github/workflows/docker-build-liveness.yml

This file was deleted.

20 changes: 0 additions & 20 deletions .github/workflows/docker-build-trigger.yml

This file was deleted.

15 changes: 15 additions & 0 deletions commit.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
feat(jobs): add auth-aware E2E tests and job diagnostics infrastructure

Add a new E2E test suite (`test_jobs_auth.py`) that validates workspace
isolation and principal propagation under an auth-enabled platform config.
Introduce a reusable `diagnostics.py` module in the jobs controller layer
to collect and log structured job/step/task state on errors, and wire it
into the reconciler and scheduler for automatic debug-level diagnostics
when steps transition to ERROR or encounter unexpected exceptions.

Refactor `e2e/conftest.py` to support multiple running-services instances
keyed by config hash, enabling per-test-module platform configs (e.g.,
`local-subprocess.yaml` with auth enabled) to coexist in a single session.
Add a `local-subprocess.yaml` E2E config and extend `nmp_testing` utilities
with `grant_workspace_role`, `unique_email`, and `TEST_ADMIN_EMAIL` helpers
needed by the auth test scenarios.
1 change: 1 addition & 0 deletions container.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Just talking out loud here, so try to wrap my own head around this. Conceptually the jobs service is a "backend" (which is a type of execution), and a executor (an configured instance of a backend). "Profile" is a higher level concept that does fancy selection, and arguably it's just a useless layer of abstraction that confuses everyone. The idea was that some jobs need cpu, and some jobs need gpu, so why not create an abstraction layer that does the hard work of choosing for you. Or you can just use an executor as part of the job. Profiles introduces a category of compute: cpu, gpu, gpu_distributed, which are really just shortcuts for the backend you're looking for. So for customization you would select gpu_distributed, for eval: gpu. So I think the idea is that the job compiler chooses the category, and the platform config does the mapping to the executor. And you can pass a profile in with the job spec, which will tell the compiler which profile to use. So you end up having to map profiles to executors anyways. It makes me think that the services should be responsible for doing this mapping, not jobs. Ex: customizer configures the mapping. But having a central concept of "profile" means that every service does things the exact same way, which is nice. It just makes the responsibility unclear. In the future we might want to allow a plugin to define their own "provider", and allow other services to use it. Different providers having different configuration requirements, that map down to the execution back end job format. At this level, we are talking about job specs, and each service compiles their own spec down to the provider spec, which then selects the executor, and the job is submitted to the executor. And as part of the platform config, we have defaults for some of these things, which might include containers. For example, customizer chooses the image to use, depending on the type of customization requested. But for customizer, we could imagine a subprocess executor, with the customizer task image, and the behavior would be to call docker run .... So this is an appropriate translation for the subprocess executor. So the job is running as a subprocess, but we require a container, so we use docker (or podman), depending on how we the subprocess executor is configured to run containers. If the container is absent, then we just call the entrypoint in the workspace configured by the executor. The goal should be that we don't care what executor it's running on, the job spec is the same. But services will want some flexibility here to choose the right executor. For example, imagine a plugin has some dependency on a specific version of python, so we might want to defined an executor executes commands in the context of a specific venv when using the subprocess executor. And we can also provide a container for that plugin, which would execute the same command in a container. The platform shouldn't need to know anything about the venv, or the container, as this would be specific to the plugin. The plugin just selects the "provider", and the platform maps that to the correct executor. Now a plugin could define it's own provider, which will select the correct venv (either using subprocess or in a container). What this suggests to me is that the plugin needs more control over how jobs are mapped to executors, and which executors are configured beyond the rigid "provider" categories. For example, a plugin could define a venv for subprocess exec (typically for local dev), a container for production workloads, and a slurm script for a batch workflow specific to slurm backends.
2 changes: 2 additions & 0 deletions docs/set-up/config-reference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -422,6 +422,8 @@ jobs:
schedule_interval_seconds: 5
# Register the subprocess/default execution profile. When unset, defaults to true for docker/none runtimes and false for kubernetes.
enable_subprocess_executor:
# Include raw job log lines in controller diagnostics snapshots. Disabled by default because job logs may contain secrets or PII. Enable only for local debugging or test environments. | default: False
include_job_logs_in_diagnostics: false
```

### `models`
Expand Down
68 changes: 68 additions & 0 deletions e2e/configs/local-subprocess.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Local E2E config for hosts without Docker.
#
# This keeps the explicit subprocess/default jobs profile required by
# translate_cpu_container_steps_to_subprocess(), while avoiding the default
# docker job backends that are derived from platform.runtime: "docker".

platform:
runtime: "none"
base_url: "http://0.0.0.0:8080"

service: {}

auth:
enabled: false
allow_unsigned_jwt: true
policy_decision_point_provider: embedded
policy_decision_point_base_url: "http://localhost:8080"
policy_data_refresh_interval: 2
bundle_cache_seconds: 15
admin_email: "admin@example.com"

entities: {}

jobs:
# Local E2E-only debugging aid. This may leak secrets or PII from job output,
# so it must remain disabled in non-test configs.
include_job_logs_in_diagnostics: true
executors:
- provider: subprocess
profile: default
backend: subprocess
config:
working_directory: .tmp/e2e/subprocess-jobs
cleanup_completed_jobs_immediately: false
ttl_seconds_before_active: 60
ttl_seconds_active: 3600
ttl_seconds_after_finished: 300
executor_defaults:
subprocess:
working_directory: .tmp/e2e/subprocess-jobs
cleanup_completed_jobs_immediately: false
ttl_seconds_before_active: 60
ttl_seconds_active: 3600
ttl_seconds_after_finished: 300

evaluator:
recreate_existing_system_entities: true

safe_synthesizer: {}

models:
controller:
interval_seconds: 5
model_deployment_garbage_collection_ttl_seconds: 30

inference_gateway: {}

secrets:
allow_key_creation: true

files:
default_storage_config:
type: local
path: .tmp/e2e/files

studio:
static_files_path: web/packages/studio/dist
sandbox_enabled: true
Loading