feat: process injection via LiveServerlessMixin by deanq · Pull Request #260 · runpod/flash

deanq · 2026-03-06T20:21:29Z

Summary

Add injection.py with build_injection_cmd() for dockerArgs generation
Add base image constants (FLASH_GPU_BASE_IMAGE, FLASH_CPU_BASE_IMAGE)
Update LiveServerlessMixin to configure dockerArgs on templates for tarball injection
Add _default_base_image and _legacy_image properties to all Live* resource classes
Unify CPU base image to python:3.11-slim (matching GPU PyTorch runtime)
Worker tarball URL configurable via FLASH_WORKER_TARBALL_URL env var

Test plan

make quality-check passes
LiveServerless template includes dockerArgs with bootstrap command
FLASH_WORKER_TARBALL_URL=<url> flash deploy provisions endpoint with injection
Preview mode still works via legacy Docker images

Depends on: runpod-workers/flash#75 (tarball build pipeline)

Replace pre-built Docker images with runtime tarball injection. The LiveServerlessMixin now generates dockerArgs that download, extract, and bootstrap the flash-worker tarball at container start time. - Add injection.py with build_injection_cmd() for dockerArgs generation - Add base image constants (FLASH_GPU_BASE_IMAGE, FLASH_CPU_BASE_IMAGE) - Update LiveServerlessMixin to configure dockerArgs on templates - Add _default_base_image and _legacy_image properties to all Live* classes - Update tests for injection-based template configuration - Revert InjectableWorkerMixin rename back to LiveServerlessMixin

Match test expectation with unified Python 3.11 base image change in FLASH_CPU_BASE_IMAGE constant.

Copilot

Pull request overview

Adds “process injection” support for Live* serverless resources by switching defaults to base images and generating template.dockerArgs that downloads/extracts/execs a flash-worker tarball at container start.

Changes:

Introduces build_injection_cmd() and new constants for base images + worker tarball configuration.
Updates LiveServerlessMixin / Live* resources to set default base images and populate template.dockerArgs for injection.
Updates unit/integration tests to validate new base-image defaults and presence of injection dockerArgs.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`src/runpod_flash/core/resources/injection.py`	New helper to generate the injection `dockerArgs` shell command.
`src/runpod_flash/core/resources/constants.py`	Adds base image constants and worker tarball URL/version configuration.
`src/runpod_flash/core/resources/live_serverless.py`	Live* resources now default to base images and set `template.dockerArgs` for injection.
`tests/unit/resources/test_injection.py`	New unit tests validating injection command format.
`tests/unit/resources/test_live_serverless.py`	Updates LiveServerless/CpuLiveServerless expectations + dockerArgs assertions.
`tests/unit/resources/test_live_load_balancer.py`	Updates LiveLoadBalancer expectations + dockerArgs assertions.
`tests/integration/test_lb_remote_execution.py`	Updates integration expectations for new image defaults and BYOI behavior.
`tests/integration/test_cpu_disk_sizing.py`	Updates integration expectations for base-image defaults and BYOI behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-06T21:39:19Z

src/runpod_flash/core/resources/injection.py

+        "| tar xz -C $FW_DIR --strip-components=1; "
+        # Cache to network volume if available
+        "if [ -d /runpod-volume ]; then "
+        'mkdir -p "$NV_CACHE" && cp -r "$FW_DIR"/* "$NV_CACHE/" 2>/dev/null || true; fi; '


The cache validity check requires $NV_CACHE/.version, but the cache write uses cp -r "$FW_DIR"/* "$NV_CACHE/", which will not copy dotfiles like .version. As a result, $NV_CACHE/.version will never exist and the network-volume cache branch will never be taken on subsequent starts. Ensure .version is created/updated and copied into $NV_CACHE (or copy $FW_DIR/. so dotfiles are included).

Suggested change

'mkdir -p "$NV_CACHE" && cp -r "$FW_DIR"/* "$NV_CACHE/" 2>/dev/null || true; fi; '

'mkdir -p "$NV_CACHE" && cp -r "$FW_DIR"/. "$NV_CACHE"/ 2>/dev/null || true; fi; '

Copilot · 2026-03-06T21:39:20Z

src/runpod_flash/core/resources/injection.py

+    return (
+        "bash -c '"
+        f"set -e; FW_DIR=/opt/flash-worker; FW_VER={worker_version}; "
+        # Network volume cache check
+        'NV_CACHE="/runpod-volume/.flash-worker/v$FW_VER"; '
+        'if [ -d "$NV_CACHE" ] && [ -f "$NV_CACHE/.version" ]; then '
+        'cp -r "$NV_CACHE" "$FW_DIR"; '
+        # Local cache check (container disk persistence between restarts)
+        'elif [ -f "$FW_DIR/.version" ] && [ "$(cat $FW_DIR/.version)" = "$FW_VER" ]; then '
+        "true; "
+        "else "
+        "mkdir -p $FW_DIR; "
+        f'DL_URL="{tarball_url}"; '
+        '(command -v curl >/dev/null 2>&1 && curl -sSL "$DL_URL" || wget -qO- "$DL_URL") '


worker_version and tarball_url are interpolated directly into a single-quoted bash -c string and then assigned with double quotes (e.g., FW_VER=..., DL_URL="..."). If either contains shell metacharacters (including $(...)), bash will perform expansion and can break the command or execute unintended code. Consider strict validation/allowlisting for URL schemes and version format and/or shell-escaping before interpolation.

Copilot · 2026-03-06T21:39:20Z

tests/integration/test_cpu_disk_sizing.py

+        # 1. Uses CPU base image (default)
        # 2. CPU utilities calculate minimum disk size
        # 3. Template creation with auto-sizing
        # 4. Validation passes
-        assert "flash-cpu:" in live_serverless.imageName
+        assert live_serverless.imageName == "python:3.11-slim"


This test hard-codes the CPU base image as python:3.11-slim, but the CPU base image is configurable via FLASH_CPU_BASE_IMAGE. If that env var is set in the test environment, this will fail despite correct behavior. Prefer asserting equality to FLASH_CPU_BASE_IMAGE from constants (or otherwise making the test independent of the default image string).

Copilot · 2026-03-06T21:39:20Z

tests/integration/test_cpu_disk_sizing.py

+        # Verify different base images are used
        assert gpu_live.imageName != cpu_live.imageName
-        assert "flash:" in gpu_live.imageName
-        assert "flash-cpu:" in cpu_live.imageName
+        assert "pytorch" in gpu_live.imageName
+        assert "python" in cpu_live.imageName


These substring assertions ('pytorch' / 'python') assume particular defaults for the base image names, but both base images are configurable via environment variables. This can make the test fail in valid configurations. Prefer comparing to the exported constants (or asserting on functional behavior rather than image string contents).

Copilot · 2026-03-06T21:39:20Z

tests/unit/resources/test_injection.py

+    def test_default_remote_url(self):
+        """Test default remote URL generation."""
+        cmd = build_injection_cmd(worker_version="1.1.1")
+
+        assert cmd.startswith("bash -c '")
+        assert "FW_VER=1.1.1" in cmd
+        assert "flash-worker/releases/download/v1.1.1/" in cmd
+        assert "bootstrap.sh'" in cmd


This test assumes the default tarball URL template is the GitHub releases URL (asserting flash-worker/releases/download/...). Since FLASH_WORKER_TARBALL_URL is configurable via environment variable, the default template may differ in some test environments, causing a false failure. Consider asserting against FLASH_WORKER_TARBALL_URL_TEMPLATE.format(version=...) or passing an explicit tarball_url in the test.

Copilot · 2026-03-06T21:39:20Z

src/runpod_flash/core/resources/injection.py

+        # Network volume cache check
+        'NV_CACHE="/runpod-volume/.flash-worker/v$FW_VER"; '
+        'if [ -d "$NV_CACHE" ] && [ -f "$NV_CACHE/.version" ]; then '
+        'cp -r "$NV_CACHE" "$FW_DIR"; '


The network-volume cache restore path looks incorrect: cp -r "$NV_CACHE" "$FW_DIR" will copy the directory into $FW_DIR (e.g., $FW_DIR/v1.1.1/...) rather than populating $FW_DIR itself, so exec $FW_DIR/bootstrap.sh may fail even when the cache is present. Consider copying the contents into $FW_DIR (preserving permissions) instead of copying the directory as a nested subdir.

Suggested change

'cp -r "$NV_CACHE" "$FW_DIR"; '

'mkdir -p "$FW_DIR" && cp -a "$NV_CACHE"/. "$FW_DIR"/; '

Copilot · 2026-03-06T21:39:21Z

src/runpod_flash/core/resources/live_serverless.py

    @property
-    def _live_image(self) -> str:
-        """Override in subclasses to specify the locked image."""
-        raise NotImplementedError("Subclasses must define _live_image")
+    def _default_base_image(self) -> str:
+        raise NotImplementedError("Subclasses must define _default_base_image")

    @property
-    def imageName(self):
-        # Lock imageName to specific image
-        return self._live_image
+    def _legacy_image(self) -> str:
+        """Legacy Docker Hub image for preview mode."""
+        raise NotImplementedError("Subclasses must define _legacy_image")
+
+    def _create_new_template(self) -> PodTemplate:
+        """Create template with dockerArgs for process injection."""
+        template = super()._create_new_template()  # type: ignore[misc]
+        template.dockerArgs = build_injection_cmd()
+        return template

-    @imageName.setter
-    def imageName(self, value):
-        # Prevent manual setting of imageName
-        pass
+    def _configure_existing_template(self) -> None:
+        """Configure existing template, adding dockerArgs for injection if not user-set."""
+        super()._configure_existing_template()  # type: ignore[misc]
+        if self.template is not None and not self.template.dockerArgs:  # type: ignore[attr-defined]
+            self.template.dockerArgs = build_injection_cmd()  # type: ignore[attr-defined]


 class LiveServerless(LiveServerlessMixin, ServerlessEndpoint):
    """GPU-only live serverless endpoint."""

    @property
-    def _live_image(self) -> str:
+    def _default_base_image(self) -> str:
+        return FLASH_GPU_BASE_IMAGE
+
+    @property
+    def _legacy_image(self) -> str:
        return FLASH_GPU_IMAGE

    @model_validator(mode="before")
    @classmethod
    def set_live_serverless_template(cls, data: dict):
-        """Set default GPU image for Live Serverless."""
-        data["imageName"] = FLASH_GPU_IMAGE
+        """Set default GPU base image for Live Serverless."""
+        if not data.get("imageName"):
+            data["imageName"] = FLASH_GPU_BASE_IMAGE
        return data


 class CpuLiveServerless(LiveServerlessMixin, CpuServerlessEndpoint):
    """CPU-only live serverless endpoint with automatic disk sizing."""

    @property
-    def _live_image(self) -> str:
+    def _default_base_image(self) -> str:
+        return FLASH_CPU_BASE_IMAGE
+
+    @property
+    def _legacy_image(self) -> str:
        return FLASH_CPU_IMAGE


_default_base_image / _legacy_image are introduced on the mixin and subclasses, but they are not referenced anywhere (defaults are set via the model_validator blocks instead). This leaves dead API surface and duplicated sources of truth for the defaults. Either wire the validators/template logic to use these properties, or remove them until there’s a concrete call site (e.g., preview mode selection) to avoid drift.

Copilot · 2026-03-06T21:39:21Z

src/runpod_flash/core/resources/live_serverless.py

+    """Configures process injection via dockerArgs for any base image.
+
+    Sets a default base image (user can override via imageName) and generates
+    dockerArgs to download, extract, and run the flash-worker tarball at container
+    start time. QB vs LB mode is determined by FLASH_ENDPOINT_TYPE env var at
+    runtime, not by the Docker image.
+    """

    @property
-    def _live_image(self) -> str:
-        """Override in subclasses to specify the locked image."""
-        raise NotImplementedError("Subclasses must define _live_image")
+    def _default_base_image(self) -> str:
+        raise NotImplementedError("Subclasses must define _default_base_image")

    @property
-    def imageName(self):
-        # Lock imageName to specific image
-        return self._live_image
+    def _legacy_image(self) -> str:
+        """Legacy Docker Hub image for preview mode."""
+        raise NotImplementedError("Subclasses must define _legacy_image")
+
+    def _create_new_template(self) -> PodTemplate:
+        """Create template with dockerArgs for process injection."""
+        template = super()._create_new_template()  # type: ignore[misc]
+        template.dockerArgs = build_injection_cmd()
+        return template

-    @imageName.setter
-    def imageName(self, value):
-        # Prevent manual setting of imageName
-        pass
+    def _configure_existing_template(self) -> None:
+        """Configure existing template, adding dockerArgs for injection if not user-set."""
+        super()._configure_existing_template()  # type: ignore[misc]
+        if self.template is not None and not self.template.dockerArgs:  # type: ignore[attr-defined]
+            self.template.dockerArgs = build_injection_cmd()  # type: ignore[attr-defined]


These Live* resources now rely on template.dockerArgs to perform injection at container start, but local preview (flash deploy --preview) starts containers via docker run <image> and does not apply template dockerArgs. With the new default base images, preview containers may not start the flash worker at all unless the preview path explicitly uses the legacy images or executes the injection command. Please ensure preview mode uses _legacy_image (or otherwise applies the injection command) before switching Live* defaults to base images.

Copilot · 2026-03-06T21:39:21Z

tests/unit/resources/test_live_serverless.py

+        assert "pytorch" in live_serverless.imageName
+


This assertion hard-codes that the default GPU base image contains the substring pytorch, but FLASH_GPU_BASE_IMAGE is configurable via environment variable. If a different base image is configured, this test will fail even though behavior is correct. Prefer asserting equality to FLASH_GPU_BASE_IMAGE (already imported) or validating behavior independent of the image repo name.

Suggested change

assert "pytorch" in live_serverless.imageName

Copilot · 2026-03-06T21:39:21Z

tests/integration/test_lb_remote_execution.py

+        assert "pytorch" in lb.imageName  # GPU base image
        assert echo.__remote_config__["method"] == "POST"

-    def test_live_load_balancer_image_locked(self):
-        """Test that LiveLoadBalancer locks the image to Flash LB image."""
+    def test_live_load_balancer_default_image(self):
+        """Test that LiveLoadBalancer uses GPU base image by default."""
        lb = LiveLoadBalancer(name="test-api")
+        assert "pytorch" in lb.imageName


These assertions assume the default GPU base image string contains pytorch, but the base image is configurable (FLASH_GPU_BASE_IMAGE). This makes the integration tests environment-dependent. Consider asserting against the constant or avoiding repo-name substrings so the tests remain valid when users override the base image.

runpod-Henrik

Bug: Network volume cache never activates — dotfile skipped by glob

The NV cache write path uses cp -r "$FW_DIR"/* "$NV_CACHE/". The * glob does not match dotfiles in bash, so .version is never written to the NV cache. The read gate checks [ -f "$NV_CACHE/.version" ], which is always false. Every cold start re-downloads the tarball regardless of whether a warm NV cache exists.

We can confirm this directly: create $FW_DIR with both bootstrap.sh and .version, run the exact glob command from the generated script, and .version is absent from $NV_CACHE while bootstrap.sh is present. The gate check then fails.

Fix: cp -r "$FW_DIR"/. "$NV_CACHE/" — the trailing /. copies directory contents including dotfiles.

Issue: NV cache read creates a subdirectory when `$FW_DIR` already exists

The NV cache read path uses cp -r "$NV_CACHE" "$FW_DIR". When $FW_DIR already exists (version upgrade scenario: old version in place, new NV cache available), cp -r copies the cache directory into $FW_DIR as a subdirectory. bootstrap.sh ends up at $FW_DIR/v1.1.1/bootstrap.sh instead of $FW_DIR/bootstrap.sh. The following exec $FW_DIR/bootstrap.sh fails with "No such file or directory".

Tested with real shell execution: mkdir fw_dir && cp -r nv_cache fw_dir → fw_dir/nv_cache/bootstrap.sh exists, fw_dir/bootstrap.sh does not. Without bug A this would only surface during version upgrades. With bug A fixed (.version now present in NV cache), this would become the primary failure.

Fix: mkdir -p "$FW_DIR" && cp -r "$NV_CACHE"/. "$FW_DIR"/ — consistent with the fix for bug A, and safe whether or not $FW_DIR exists.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

- Fix tarball URL from runpod/flash-worker to runpod-workers/flash (matching the actual GitHub repo path) - Add python3 urllib.request fallback in download chain for base images without curl or wget (e.g. pytorch/pytorch runtime images) - Update test assertions for URL and fallback chain

deanq added 2 commits March 6, 2026 11:00

fix(test): update CPU base image assertion to python:3.11-slim

c1f28b8

Match test expectation with unified Python 3.11 base image change in FLASH_CPU_BASE_IMAGE constant.

deanq requested a review from Copilot March 6, 2026 21:32

Copilot started reviewing on behalf of deanq March 6, 2026 21:33 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

runpod-Henrik reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: process injection via LiveServerlessMixin#260

feat: process injection via LiveServerlessMixin#260
deanq wants to merge 3 commits intomainfrom
deanq/ae-2390-process-injectable-worker

deanq commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

runpod-Henrik left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	'mkdir -p "$NV_CACHE" && cp -r "$FW_DIR"/* "$NV_CACHE/" 2>/dev/null \|\| true; fi; '
	'mkdir -p "$NV_CACHE" && cp -r "$FW_DIR"/. "$NV_CACHE"/ 2>/dev/null \|\| true; fi; '

	'cp -r "$NV_CACHE" "$FW_DIR"; '
	'mkdir -p "$FW_DIR" && cp -a "$NV_CACHE"/. "$FW_DIR"/; '

Conversation

deanq commented Mar 6, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

Bug: Network volume cache never activates — dotfile skipped by glob

Issue: NV cache read creates a subdirectory when $FW_DIR already exists

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Issue: NV cache read creates a subdirectory when `$FW_DIR` already exists