Skip to content

feat: Python version-aware Docker image selection#261

Open
deanq wants to merge 7 commits intomainfrom
deanq/ae-2391-python-versions-for-workers
Open

feat: Python version-aware Docker image selection#261
deanq wants to merge 7 commits intomainfrom
deanq/ae-2391-python-versions-for-workers

Conversation

@deanq
Copy link
Member

@deanq deanq commented Mar 6, 2026

Summary

  • Add python_version field to ServerlessResource base model with validation
  • Add get_image_name() and validate_python_version() to constants.py
  • GPU images require Python 3.11+ (no PyTorch CUDA 12.8 for 3.10), CPU supports 3.10-3.12
  • Default Python version: 3.11 (matches GPU PyTorch base image)
  • Live* classes use get_image_name() for dynamic, version-aware image resolution
  • python_version included in _hashed_fields (forces re-deploy on version change)
  • flash build validates local Python version against supported range
  • Manifest includes python_version for traceability
  • pyproject.toml narrowed to >=3.10,<3.13
  • Env var overrides (FLASH_GPU_IMAGE, etc.) continue to bypass version logic

Context

AE-2391: Python version mismatch between user build environment and worker runtime causes silent failures with binary packages. This PR adds version-aware image selection so the SDK picks the correct runtime image for the user's Python version.

Companion PR: runpod-workers/flash (versioned Docker images)

Test plan

  • 24 tests for get_image_name() covering all type/version combinations and env var overrides
  • 6 tests for python_version validation on ServerlessResource
  • 10 tests for Live* class default image assertions (py3.11)
  • Manifest test includes python_version field
  • make quality-check passes (84.71% coverage)

Auto-detect user's Python version at build/deploy time and select
the matching runtime image. Supports 3.10-3.12 (3.13+ not stable).

SDK changes:
- Add get_image_name() for versioned tag resolution (py3.11-latest)
- Add python_version field to ServerlessResource (in _hashed_fields)
- Wire LiveServerless/LiveLoadBalancer to use get_image_name()
- Add build-time validation with actionable error for unsupported versions
- Record python_version in flash_manifest.json
- Narrow pyproject.toml to >=3.10,<3.13

GPU images require Python 3.11+ (no PyTorch CUDA 12.8 image for 3.10).
CPU images support 3.10-3.12. Env var overrides (FLASH_GPU_IMAGE etc.)
bypass version logic entirely.

Tag format: runpod/flash:py3.11-latest (latest alias kept for py3.12)
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a python_version concept across serverless resource models and build tooling to support Python-version-aware runtime Docker image selection (motivated by avoiding local build/runtime ABI mismatches).

Changes:

  • Added Python version support + validation (python_version on ServerlessResource, helpers in constants.py) and switched Live* resources to resolve versioned image names dynamically.
  • Updated CLI build/manifest generation to record and validate the local Python minor version.
  • Expanded/adjusted unit tests to cover image resolution, version validation, and manifest traceability.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/runpod_flash/core/resources/constants.py Adds supported Python versions and get_image_name()/validate_python_version() helpers; updates default image constants to include pyX.Y.
src/runpod_flash/core/resources/serverless.py Adds python_version field, validation, and includes it in hashed/input-only sets for drift/redeploy behavior.
src/runpod_flash/core/resources/live_serverless.py Live* classes now derive imageName via get_image_name() using python_version or default.
src/runpod_flash/cli/commands/build.py Validates local Python minor version before dependency installation.
src/runpod_flash/cli/commands/build_utils/manifest.py Adds python_version field to the generated manifest for traceability.
pyproject.toml Narrows supported Python range to >=3.10,<3.13.
tests/unit/core/resources/test_constants.py New tests for version support + image name resolution + env var override behavior.
tests/unit/resources/test_serverless.py Adds tests for ServerlessResource.python_version validation and hashing/input-only membership.
tests/unit/resources/test_live_serverless.py Adds tests asserting Live* default/explicit python-version image selection and GPU constraints.
tests/unit/resources/test_live_load_balancer.py Updates expected LiveLoadBalancer image tag format to include py3.11-....
tests/unit/cli/commands/build_utils/test_manifest.py Ensures manifest includes the current Python minor version.
tests/unit/test_dotenv_loading.py Updates env-var import location for image constants.
tests/unit/core/__init__.py / tests/unit/core/resources/__init__.py Adds package initializers to support test module imports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

@KAJdev KAJdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when deploying I'm always getting the default 3.11 image (even when deploying within a 3.12 environment with 3.12 in the manifest)

what i think is happening is when running flash deploy, create_resource_from_manifest doesn't extract python_version from the manifest and pass it to the ServerlessResource constructors, so they default to 3.11.

It also seems like its not exposed on Endpoint so users wouldn't be able to override it if they wanted to (not sure if this is intentional or not, just something to note)

- Move env-var override check before version validation in get_image_name
- Add regression tests for override bypassing unsupported versions
- Fix python_version field description (no auto-detection at field level)
- Fix build error message to recommend switching local Python
- Rename mismatched test (test_default_python_version_is_3_11)
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Python version check silently skipped for projects with no dependencies

The version validation only runs when requirements is non-empty — it lives inside install_dependencies(), which is gated behind if requirements: at build.py:312.

A user on Python 3.13 with no requirements.txt and no dependencies=[...] on their @remote decorator will run flash build, see it succeed, deploy their worker, and then hit cryptic cloudpickle deserialization failures at runtime. No warning anywhere. The entire point of this PR is to catch exactly this mismatch, but it misses the most common case — a simple worker with no extra deps.

We can demonstrate this directly: calling install_dependencies(build_dir, [], no_deps=False) with sys.version_info mocked to 3.13 returns True. The same call with ["requests"] returns False. Same Python version, opposite outcomes, depending purely on whether the requirements list is empty.

Fix: call validate_python_version(f"{sys.version_info.major}.{sys.version_info.minor}") unconditionally before the if requirements: block.


Issue: Manifest records the build machine's Python, not the deployed image's Python

manifest.py writes sys.version_info (local Python) as python_version. But if a user explicitly sets python_version="3.12" on their resource config and builds on a 3.11 machine, the manifest records 3.11 while the deployed image is runpod/flash:py3.12-latest.

We can show this concretely: create a LiveServerless with python_version="3.12" (image correctly resolves to py3.12), call ManifestBuilder.build() with sys.version_info patched to 3.11, and the manifest comes back with python_version: "3.11" — not "3.12". The manifest is wrong from the moment it's written.

This is a new field on a new schema — the moment any tooling reads manifest.python_version to understand what's running in production, it gets the wrong answer. Easier to record it correctly now than to migrate it later.

Fix: pass the resource's python_version (or DEFAULT_PYTHON_VERSION if unset) into the manifest builder rather than using sys.version_info.


Nit: Two error messages shown for one failure

When Python version validation fails inside install_dependencies(), the function prints a detailed 3-line explanation and returns False. The caller then prints a second message: "Error: Failed to install dependencies". Calling install_dependencies directly with 3.13 and a non-empty requirements list confirms both: the version error is printed and the function returns False, at which point run_build adds its own redundant banner.

Fix: since install_dependencies() already handles its own error output on version failure, remove the generic "Error: Failed to install dependencies" print from the caller, or raise an exception instead of returning False so the caller can distinguish version errors from pip errors.


🤖 Reviewed by Henrik's AI-Powered Bug Finder

Prevents python_version from leaking into GraphQL payloads for
CpuServerlessEndpoint and CpuLoadBalancerSlsResource.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +253 to +261
@field_validator("python_version")
@classmethod
def validate_python_version(cls, v: Optional[str]) -> Optional[str]:
if v is not None and v not in SUPPORTED_PYTHON_VERSIONS:
supported = ", ".join(SUPPORTED_PYTHON_VERSIONS)
raise ValueError(
f"Python {v} is not supported. Supported versions: {supported}"
)
return v
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServerlessResource.validate_python_version() duplicates the validation logic already implemented in runpod_flash.core.resources.constants.validate_python_version(). To avoid the supported-version list and error message drifting over time, consider importing and calling the shared helper here (or otherwise centralizing this validation in one place).

Copilot uses AI. Check for mistakes.
- Move Python version validation before `if requirements:` so no-dep
  projects are also checked (was silently skipped)
- Pass python_version to ManifestBuilder instead of reading sys.version_info
  at manifest build time
- Remove duplicate version check from install_dependencies (now done
  unconditionally in run_build)
@deanq
Copy link
Member Author

deanq commented Mar 6, 2026

@runpod-Henrik All three issues from your review addressed in commit 1c85da4:

Bug (version check skipped for no-dep projects): Moved validate_python_version() to the top of run_build(), before the if requirements: gate. Now runs unconditionally regardless of whether the project has dependencies.

Issue (manifest records wrong Python): ManifestBuilder now accepts an explicit python_version parameter. run_build() passes the validated version through. Added test for explicit version override.

Nit (duplicate error messages): Removed the version check from install_dependencies() entirely — run_build() now validates once at the top and raises typer.Exit(1) on failure. The generic "Failed to install dependencies" message only fires for actual pip failures, not version mismatches.

create_resource_from_manifest now accepts python_version and includes
it in deployment_kwargs. All three call sites in deployment.py extract
python_version from the manifest and pass it through.

Fixes the bug where deploying from a 3.12 environment always selected
the default 3.11 image because python_version was never propagated.
@deanq
Copy link
Member Author

deanq commented Mar 6, 2026

@KAJdev Fixed in commit e3f90cf. The bug was exactly as you described: create_resource_from_manifest never extracted python_version from the manifest, so resources always defaulted to 3.11.

Fix: create_resource_from_manifest now accepts a python_version parameter. All three call sites in deployment.py extract python_version from the manifest top-level field and pass it through to the resource constructor. A 3.12 manifest now correctly creates resources with python_version="3.12", which resolves to runpod/flash:py3.12-latest.

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test Plan Results — PR #261

Ran the full test plan against the PR branch across Python 3.10, 3.11, 3.12, and 3.13.

46 passed, 1 xfailed — results identical across all four Python versions.


What passes

  • Scenarios 1, 4, 5, 6, 7, existing users upgrading: all green — features implemented correctly
  • GPU rejects python_version='3.10', CPU accepts it — correct
  • validate_python_version rejects patch-level strings, empty string, shell metacharacters — correct
  • Config hash changes when python_version changes — correct
  • DEFAULT_PYTHON_VERSION used when no version specified — correct

BUG-261-A fixed ✓

validate_python_version() now runs in run_build() before the requirements gate. Python 3.13 + no-deps correctly exits 1 — the silent success bug is gone.

Implementation note: the check lives in run_build(), not install_dependencies(). install_dependencies() is version-agnostic by design — run_build() rejects unsupported versions before that function is ever called. Tests updated to match this design.

1 remaining gap (xfail)

No warning when local Python is higher than the target image Python (e.g. building on 3.12, targeting py3.11). Python 3.11 containers cannot load 3.12 bytecode — the worker will fail at runtime with a cloudpickle deserialization error. Documented as xfail for a follow-up PR.


🤖 Reviewed by Henrik's AI-Powered Bug Finder

Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #261: Python Version-Aware Docker Images

Reviewer: Henrik's AI-Powered Bug Finder
Testing: Full unit suite (46/46 passed, 1 xfailed pre-existing), all plan tests pass on pr-261 branch. Tested across Python 3.10.19, 3.11.14, 3.12.12, 3.13.11 — identical results on all four versions.


1. Context

Short-term guardrail for the Python version mismatch problem: user builds on Python 3.12, deployed runtime is Python 3.11 (Conda in the PyTorch base image), binary-incompatible packages (e.g. numpy) silently fail. This PR makes the SDK version-aware so the correct runtime image is selected and the manifest records what the build was compiled against. PR #260 (process injection) is the longer-term architectural solution and is intended to merge after this one.


Bug: Version check bypassed for no-deps projects — fixed correctly

On main, install_dependencies gates the Python version check behind if requirements:. A project with no requirements.txt skips the check entirely, exits 0, and produces an artifact for any Python version including unsupported ones. This PR fixes it by running validate_python_version in run_build() before install_dependencies is called — the check runs regardless of whether there are dependencies. Verified by the test suite.


2. Manifest python_version field

Recording the build machine's Python version in flash_manifest.json is the right call and remains useful even after PR #260 merges — it gives an audit trail of what the artifact was compiled against, independent of the image strategy. No concerns here.


Note: Image selection logic will be superseded by PR #260

The version-aware image picker (selecting the correct 3.10/3.11/3.12 image variant) becomes the non-primary path once injection is the default. Worth a comment in the image selection code noting this is the pre-injection path, so future readers understand why it exists alongside the injection mechanism.


Note: Supported version range

PR description mentions adjusting support from 3.10–3.14 down to 3.10–3.12 since 3.13 is not stable yet. The test suite confirms validate_python_version rejects 3.13. Looks correct.


Nits

  • DEFAULT_PYTHON_VERSION — worth a comment explaining how this value is chosen and when it should be updated (e.g. tied to the LTS image, bumped manually per release).
  • validate_python_version rejects patch-level strings (e.g. "3.11.14") — good defensive behaviour, but the error message could tell the user what format is expected ("3.11" not "3.11.14").

Verdict: PASS WITH NITS

The version check fix is correct and well-placed. Manifest field is useful long-term. Image selection logic is appropriately scoped as a short-term guardrail ahead of PR #260.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

deanq added 2 commits March 7, 2026 10:48
…tion

LiveServerless used DEFAULT_PYTHON_VERSION (3.11) when python_version was
not set, causing cloudpickle serialization mismatches for users on other
Python versions. Now uses local_python_version() to detect the running
interpreter's version.
Tests were asserting GPU_PYTHON_VERSIONS == ("3.11", "3.12") but the
source has ("3.10", "3.11", "3.12"). Updated tests to match: GPU 3.10
is valid, not a raise.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 18 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

src/runpod_flash/runtime/resource_provisioner.py:22

  • The create_resource_from_manifest docstring’s Args list doesn’t mention the new python_version parameter. Please update the docstring to describe how python_version is used (and what happens when it’s None) so callers understand the behavior.
def create_resource_from_manifest(
    resource_name: str,
    resource_data: Dict[str, Any],
    flash_environment_id: Optional[str] = None,
    python_version: Optional[str] = None,
) -> Any:
    """Create a deployable resource configuration from a manifest entry.

    Args:
        resource_name: Name of the resource
        resource_data: Resource configuration from manifest
        flash_environment_id: Optional flash environment ID to attach


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +23 to +25
def test_gpu_python_versions(self):
assert GPU_PYTHON_VERSIONS == ("3.10", "3.11", "3.12")

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions pin GPU_PYTHON_VERSIONS to include "3.10". If GPU images are intended to require Python 3.11+ (per PR description), update the expected GPU_PYTHON_VERSIONS here (and add a dedicated test that GPU+3.10 raises) so the test suite enforces the intended constraint.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +25
SUPPORTED_PYTHON_VERSIONS: tuple[str, ...] = ("3.10", "3.11", "3.12")
GPU_PYTHON_VERSIONS: tuple[str, ...] = ("3.10", "3.11", "3.12")
CPU_PYTHON_VERSIONS: tuple[str, ...] = ("3.10", "3.11", "3.12")
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU_PYTHON_VERSIONS currently includes "3.10", but the PR description states GPU images require Python 3.11+ (no CUDA 12.8 support for 3.10). If GPU images truly start at 3.11, this constant (and related tests/logic) should be updated to exclude 3.10; otherwise the PR description (and any docs) should be corrected to avoid misleading users and selecting non-existent/unsupported images.

Copilot uses AI. Check for mistakes.
Comment on lines 33 to 37
lb = ls_module.LiveLoadBalancer(name="test-lb")
expected_image = "runpod/flash-lb:local"
expected_image = "runpod/flash-lb:py3.11-local"
assert lb.imageName == expected_image
assert lb.template is not None
assert lb.template.imageName == expected_image
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests hardcode a py3.11 image name, but LiveLoadBalancer/CpuLiveLoadBalancer default to local_python_version() when python_version isn’t provided. Since CI runs the test suite across multiple Python versions, this will fail on 3.10/3.12. Compute expected_image using local_python_version() (or sys.version_info) instead of hardcoding 3.11.

Copilot uses AI. Check for mistakes.
Comment on lines 199 to 203
lb = ls_module.CpuLiveLoadBalancer(name="test-lb")
expected_image = "runpod/flash-lb-cpu:local"
expected_image = "runpod/flash-lb-cpu:py3.11-local"
assert lb.imageName == expected_image
assert lb.template is not None
assert lb.template.imageName == expected_image
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: expected_image is hardcoded to py3.11, but the code under test selects the tag based on local_python_version() by default. This will fail under the CI Python version matrix; derive expected_image dynamically.

Copilot uses AI. Check for mistakes.
Comment on lines +224 to +227
def test_gpu_explicit_python_310(self):
ls = LiveServerless(name="test", python_version="3.10")
assert "py3.10" in ls.imageName
assert "runpod/flash:" in ls.imageName
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test suite asserts GPU endpoints support python_version="3.10". That conflicts with the PR description’s stated constraint that GPU images require Python 3.11+. Once the intended constraint is clarified, update these assertions accordingly (e.g., expect a ValueError for GPU+3.10 if 3.10 is no longer supported for GPU images).

Copilot uses AI. Check for mistakes.
"Operating System :: OS Independent",
]
requires-python = ">=3.10,<3.15"
requires-python = ">=3.10,<3.13"
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Narrowing requires-python to <3.13 will likely break the existing CI matrix entries that run on Python 3.13 and 3.14 (see .github/workflows/ci.yml). Either update the workflow matrix in this PR (preferred) or keep requires-python compatible with the versions CI is still running.

Suggested change
requires-python = ">=3.10,<3.13"
requires-python = ">=3.10,<3.15"

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #261: Python Version-Aware Docker Image Selection

Reviewer: Henrik's AI-Powered Bug Finder
Testing: 46/46 tests pass on pr-261 branch across Python 3.10, 3.11, 3.12, 3.13. Full unit suite: no new failures.


1. The core fix

Version check moved to run_build() before install_dependencies is called — correct placement. A project with no requirements.txt no longer slips through; the check runs unconditionally regardless of deps. Verified.


Issue: GPU_PYTHON_VERSIONS guard is dead code — contradicts PR description

The PR description says "GPU images require Python 3.11+ (no PyTorch CUDA 12.8 for 3.10)", but:

GPU_PYTHON_VERSIONS: tuple[str, ...] = ("3.10", "3.11", "3.12")  # includes 3.10
SUPPORTED_PYTHON_VERSIONS: tuple[str, ...] = ("3.10", "3.11", "3.12")  # identical

The GPU-specific guard in get_image_name:

if image_type in _GPU_IMAGE_TYPES and python_version not in GPU_PYTHON_VERSIONS:
    raise ValueError(...)

can never trigger — validate_python_version runs first and already ensures the version is in SUPPORTED_PYTHON_VERSIONS, which equals GPU_PYTHON_VERSIONS. The tests confirm GPU+3.10 is intentionally allowed (test_gpu_3_10, test_gpu_explicit_python_310).

Either the PR description is stale and GPU does support 3.10, or GPU_PYTHON_VERSIONS should be ("3.11", "3.12"). Either way, the dead check should be removed or corrected.


Question: requires-python narrowed to <3.13 — breaks existing 3.13 users

-requires-python = ">=3.10,<3.15"
+requires-python = ">=3.10,<3.13"

A user on Python 3.13 running flash run (local dev, no deployment) would be broken by this upgrade. The deploy-time check in run_build() already handles unsupported versions with a clear error message. Should requires-python stay at <3.15 and let the deploy-time check enforce the supported build range, rather than blocking install entirely?


Nit: Docstrings removed from LiveLoadBalancer and CpuLiveLoadBalancer

Both classes lost their usage examples and development flow documentation, replaced with one-liners ("""Live load-balanced endpoint."""). The usage examples were useful for new contributors — worth preserving in a condensed form.


Nit: local_python_version() defers import sys unnecessarily

def local_python_version() -> str:
    import sys  # stdlib, always available
    return f"{sys.version_info.major}.{sys.version_info.minor}"

sys is stdlib — no reason to defer this import.


Verdict: PASS WITH NITS

The version check placement is correct, manifest tracing is solid, image selection logic is clean. Clarify the GPU_PYTHON_VERSIONS / 3.10 question before merge. The requires-python narrowing is worth a deliberate decision.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants