Skip to content

feat(kubernetes): add agent execution mode, executor-agent, sidecar refactor, GKE Sandbox, image & deps upgrades#33

Closed
gafda wants to merge 14 commits intoaron-muon:mainfrom
nosportugal:feat-taints
Closed

feat(kubernetes): add agent execution mode, executor-agent, sidecar refactor, GKE Sandbox, image & deps upgrades#33
gafda wants to merge 14 commits intoaron-muon:mainfrom
nosportugal:feat-taints

Conversation

@gafda
Copy link
Contributor

@gafda gafda commented Feb 18, 2026

This pull request updates the development and runtime environments for multiple languages, modernizes package dependencies, and enhances configuration flexibility, particularly for Redis and Kubernetes. The changes include updates to Dockerfiles for newer base images and language versions, significant dependency version bumps and additions, and expanded sample configuration for advanced deployment scenarios.

Key changes:

1. Dependency and Environment Updates

  • Dockerfiles for C/C++, D, Fortran, and R: Switched to trixie-debian13-dev base images to ensure compilers and development libraries are available at runtime, improving compatibility and security. (docker/c-cpp.Dockerfile, docker/d.Dockerfile, docker/fortran.Dockerfile, docker/r.Dockerfile) [1] [2] [3] [4] [5]
  • Go Environment: Upgraded Go from 1.25 to 1.26, updated the Go Dockerfile stages, and refreshed Go module dependencies with new and reorganized packages. (docker/go.Dockerfile, docker/requirements/go.mod) [1] [2] [3] [4]
  • PHP: Bumped PHP version from 8.4.17 to 8.5.3 for the latest features and security patches. (docker/php.Dockerfile)
  • Python: Modernized and reorganized Python package requirements across core, analysis, documents, utilities, and visualization, with many version bumps and new packages for better functionality and compatibility. (docker/requirements/python-core.txt, docker/requirements/python-analysis.txt, docker/requirements/python-documents.txt, docker/requirements/python-utilities.txt, docker/requirements/python-visualization.txt) [1] [2] [3] [4] [5]
  • Node.js: Expanded and reorganized the list of global packages for a more comprehensive JavaScript environment. (docker/requirements/nodejs.txt)
  • Rust: Updated Rust version from 1.92 to 1.93 and refreshed dependencies in Cargo.toml for new features and bug fixes. (docker/rust.Dockerfile, docker/requirements/rust-Cargo.toml) [1] [2] [3]

2. Configuration Enhancements

  • Redis Configuration: The .env.example file now documents advanced Redis deployment options, including cluster and sentinel modes, TLS/SSL settings, and key prefixing, making it easier to configure Redis in complex environments. [1] [2]
  • Kubernetes Execution: Added detailed Kubernetes execution configuration options to .env.example, including support for agent and nsenter modes, sidecar image selection, image pull policies, and GKE Sandbox compatibility notes.
Mode Description
agent (default) Executor agent runs inside the main container. No nsenter, no extra capabilities. Compatible with gVisor/GKE Sandbox.
nsenter (legacy) Sidecar uses nsenter to enter the main container namespace. Requires SYS_PTRACE, SYS_ADMIN, SYS_CHROOT capabilities.

3. Docker Image Naming and CI/CD

  • Sidecar Image Naming: Updated GitHub Actions workflow to use the new sidecar image name kubecoderun-sidecar-agent instead of kubecoderun-sidecar, aligning with the agent-based execution model. (.github/workflows/docker-publish.yml) [1] [2]

4. Python Runtime Optimization

  • Python Dockerfile: Cleaned up and reorganized installed runtime libraries, focusing on core utilities, image processing, XML/HTML processing, cryptography, and font support. Removed unnecessary tools from the final image for a leaner runtime. (docker/python.Dockerfile) [1] [2]

These changes collectively modernize the development environments, improve documentation and configuration for advanced deployments, and ensure compatibility with newer language and library versions.

- Add configuration options for GKE Sandbox in KubernetesConfig
- Update pod/job manifest creation to support:
  * runtimeClassName for gVisor runtime
  * sandbox.gke.io/runtime annotation
  * nodeSelector for sandbox-enabled nodes
  * tolerations for GKE sandbox and custom taints
- Add GKE Sandbox settings to Helm values.yaml and configmap
- Update KubernetesManager, PodSpec, and PoolConfig models
- Parse JSON configuration for node selectors and tolerations
- Enable easy activation/deactivation via configuration flags

GKE Sandbox provides additional kernel isolation using gVisor for
untrusted workloads. When enabled, execution pods will:
- Run with gVisor runtime (runtimeClassName: gvisor)
- Be scheduled on sandbox-enabled nodes
- Tolerate GKE sandbox taints automatically
- Support custom node pool taints for dedicated execution nodes

Configuration example in values.yaml:
  execution:
    gkeSandbox:
      enabled: true
      runtimeClassName: gvisor
      nodeSelector: {}
      customTolerations:
        - key: pool
          operator: Equal
          value: sandbox
          effect: NoSchedule
* Introduce  to support agent and nsenter modes.
* Implement agent mode with a lightweight executor agent running in the main container.
* Add  for configuring the executor agent's HTTP server port.
* Enhance security by dropping all capabilities in agent mode and ensuring no privilege escalation.
* Support image pull secrets for private registries via .
* Update documentation to reflect new execution modes and security configurations.
* Modify Helm chart to include image pull secrets configuration.
* Change default sidecar image from  to .
* Update environment variable names for executor port from  to .
* Add documentation for building sidecar images and configuring Helm charts for execution modes.
* Introduce GKE Sandbox support with configuration details and limitations.
* Update related code and tests to reflect changes in image names and environment variables.
- Updated Dockerfiles for C/C++, D, Fortran, R, and Sidecar to use the trixie-dev variant.
- Ensures compilers and development libraries are available at runtime.
* Updated base images to trixie-debian13-dev for C/C++, D, Fortran, R, and Rust.
* Upgraded PHP version to 8.5.3.
* Enhanced Node.js and Python requirements with new packages and versions.
* Improved Rust dependencies for better compatibility and performance.
* Updated Go version in executor-agent to 1.26.
* Introduced K8S_IMAGE_PULL_POLICY and K8S_IMAGE_PULL_SECRETS in configuration.
* Updated relevant classes and methods to handle new fields.
* Enhanced validation for execution mode and sidecar image consistency.
* Added unit tests to ensure correct handling of image pull settings.
- Add REDIS_MODE (standalone/cluster/sentinel) to RedisConfig
- Add TLS/SSL configuration (REDIS_TLS_ENABLED, certs, CA, insecure)
- Add Redis Cluster support (REDIS_CLUSTER_NODES) via RedisCluster client
- Add Redis Sentinel support (REDIS_SENTINEL_NODES/MASTER/PASSWORD)
- Update RedisPool to support all three modes with TLS
- Migrate FileService to use shared RedisPool instead of standalone client
- Update Settings class with all new Redis fields
- Update .env.example with new Redis configuration options
- Update docs/CONFIGURATION.md with Cluster, Sentinel, and TLS sections
- Update docs/SECURITY.md with TLS configuration reference
- Update Helm values.yaml, configmap.yaml, and _helpers.tpl
- Default remains standalone Redis for full backward compatibility

feat: add optional Redis key prefix support (REDIS_KEY_PREFIX)

- Add key_prefix field to RedisConfig and Settings
- Add make_key() helper to RedisPool for centralized key prefixing
- Update all services to use prefixed keys: session, state, file,
  health, api_key_manager, detailed_metrics, metrics
- Update .env.example, docs, and Helm chart with new setting
@gafda gafda marked this pull request as ready for review February 23, 2026 10:16
@gafda gafda requested a review from aron-muon as a code owner February 23, 2026 10:16
Copilot AI review requested due to automatic review settings February 23, 2026 10:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a significant architectural enhancement by adding an agent-based execution mode as the default, alongside comprehensive Redis deployment mode support and extensive dependency upgrades. The changes modernize the security model by eliminating the need for Linux capabilities in the default execution path, enable GKE Sandbox (gVisor) compatibility, and provide flexibility for Redis clustering and high-availability deployments.

Changes:

  • Introduced agent execution mode (default) that eliminates nsenter, Linux capabilities, and privilege escalation requirements, with nsenter mode retained for backward compatibility
  • Added comprehensive Redis deployment modes (standalone, cluster, sentinel) with TLS/SSL support and optional key prefixing for multi-tenant deployments
  • Implemented GKE Sandbox (gVisor) support with runtime class, node selectors, and tolerations for kernel-level isolation
  • Upgraded language runtimes and dependencies: Go 1.25→1.26, PHP 8.4.17→8.5.3, Rust 1.92→1.93, Python packages modernized
  • Refactored sidecar to multi-target Docker build producing both agent and nsenter variants from a single Dockerfile

Reviewed changes

Copilot reviewed 49 out of 50 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/services/kubernetes/client.py Enhanced pod manifest creation with agent/nsenter mode support, GKE Sandbox configuration, and image pull secrets
docker/sidecar/main.py Added execute_via_agent() function and mode routing logic for dual execution mode support
docker/sidecar/executor-agent/main.go New Go HTTP server for agent mode execution without nsenter or capabilities
docker/sidecar/Dockerfile Refactored to multi-target build: sidecar-agent (default) and sidecar-nsenter (legacy)
src/core/pool.py Complete rewrite supporting Redis standalone/cluster/sentinel modes with TLS and key prefixing
src/config/redis.py New configuration model for Redis deployment modes, TLS, and advanced features
src/services/*.py Updated all Redis-using services (state, session, metrics, api_key_manager) to use key prefixing
src/main.py Added validation for execution mode/sidecar image consistency and image pull secrets parsing
tests/unit/test_kubernetes_client.py Comprehensive tests for agent/nsenter modes and GKE Sandbox configuration
scripts/build-images.sh Enhanced to support Docker multi-target builds with --target flag
.github/workflows/docker-publish.yml Updated sidecar image name to kubecoderun-sidecar-agent
helm-deployments/kubecoderun/values.yaml Added Redis mode configuration, GKE Sandbox settings, and execution mode options
docs/SECURITY.md, docs/CONFIGURATION.md, docs/ARCHITECTURE.md Extensive documentation updates for new execution modes and Redis features
Comments suppressed due to low confidence (1)

.github/workflows/docker-publish.yml:153

  • The CI/CD workflow only builds the kubecoderun-sidecar-agent image but not the kubecoderun-sidecar-nsenter variant. While the build script supports both targets and the Dockerfile defines both, the nsenter sidecar won't be available in the registry for users who need legacy nsenter mode. Consider adding a separate job to build the nsenter variant, or document that users must build it locally if needed.
  sidecar:
    needs: changes
    if: |
      needs.changes.outputs.is_cross_repo_pr != 'true' &&
      (needs.changes.outputs.sidecar == 'true' || needs.changes.outputs.force_all == 'true')
    uses: ./.github/workflows/docker-build-reusable.yml
    secrets: inherit
    with:
      image_name: kubecoderun-sidecar-agent
      dockerfile: docker/sidecar/Dockerfile
      context: docker/sidecar
      image_tag: ${{ needs.changes.outputs.image_tag }}
      is_release: ${{ needs.changes.outputs.is_release == 'true' }}
      version: ${{ needs.changes.outputs.version }}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* Validate GKE Sandbox compatibility with nsenter execution mode
* Log warnings when incompatible execution modes are used
* Update Dockerfile capabilities for nsenter
* Enhance RedisPool type hints for better type checking
* Add unit tests for GKE Sandbox and nsenter mode interactions
…me by default

Three issues prevented connecting to GCP Memorystore Redis with TLS:

1. _validate_redis_connection() used redis.from_url() without passing
   ssl_ca_certs / ssl_cert_reqs, so certificate verification fell back
   to the system CA bundle which doesn't include managed-service CAs.

2. get_tls_kwargs() set ssl_check_hostname=True when tls_insecure=False.
   Managed Redis services (GCP Memorystore, AWS ElastiCache) and Redis
   Cluster node discovery return IPs that don't match certificate CN/SAN,
   causing CERTIFICATE_VERIFY_FAILED.  Hostname checking is now off by
   default (matching redis-py) and controlled by the new
   REDIS_TLS_CHECK_HOSTNAME setting.

3. REDIS_HOST could contain a URL scheme (rediss://host) which was
   passed through to ClusterNode or URL construction.  A field
   validator now strips accidental schemes from the host value.
Copilot AI review requested due to automatic review settings February 26, 2026 18:13
@gafda
Copy link
Contributor Author

gafda commented Feb 26, 2026

@aron-muon

@aron-muon
Copy link
Owner

@aron-muon

Hello, let me take a look here. Nice to see that Nos engineers are using Kubecoderun - I am a Nos customer myself.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 51 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…host fallback

The startup config validator always used redis.from_url() with a
standalone-style URL regardless of REDIS_MODE.  In cluster or sentinel
mode this connected to the wrong host (typically localhost:6379) and
failed, blocking startup.

- Rewrite _validate_redis_connection() to build the correct client
  type per mode: RedisCluster for cluster, Sentinel for sentinel, and
  redis.from_url() for standalone — all with proper TLS kwargs.
- Remove the silent localhost:6379 fallback in RedisPool._initialize()
  that masked real connection errors and caused confusing log messages.
- Update the corresponding unit test to expect the error to propagate.
- Shallow-copy annotations dict to prevent mutation by GKE Sandbox
- Validate 'key' field in custom tolerations, skip invalid with warning
- Log warning on invalid JSON for GKE_SANDBOX_NODE_SELECTOR/TOLERATIONS
- Align image_pull_policy default to 'Always' in KubernetesConfig
- Fix path traversal in executor-agent workdir validation (/mnt/data2)
- Replace env override append with key replacement in executor-agent
- Default gkeSandbox.enabled to false in Helm values
- Add security warning for REDIS_TLS_CHECK_HOSTNAME in docs and .env
- Add unit tests for all fixes (7 new tests)
…tests

- Fix empty REDIS_PASSWORD  sending AUTH  by converting to None
- Fix empty REDIS_CLUSTER_NODES  treated as truthy, falling back to host:port
- Add missing REDIS_HOST/REDIS_PORT/REDIS_DB to Helm configmap
- Add REDIS_PASSWORD to Helm secret for cluster/sentinel modes
- Add 6-node Redis Cluster docker-compose (non-TLS and TLS variants)
- Add TLS cert generation/cleanup scripts for local testing
- Add 11 non-TLS + 14 TLS cluster integration tests (RedisPool,
  ConfigValidator, sync/async clients, key prefix operations)
- Add 20 new unit tests for Settings/RedisConfig validators
All Redis pipelines that operate on keys in different hash slots now use
transaction=False instead of transaction=True. Redis Cluster cannot wrap
MULTI/EXEC around keys on different nodes. ClusterPipeline with
transaction=False still batches commands but splits them by node.

Fixed files:
- session.py: create_session(), delete_session()
- api_key_manager.py: _ensure_single_env_key_record(), create_key(), revoke_key()
- state.py: save_state()

Also fixes version display showing '0.0.0.dev0' in production:
- build-images.sh: pass --build-arg VERSION=$TAG to docker build
- config: add SERVICE_VERSION env var for runtime version override
- main.py + logging.py: prefer SERVICE_VERSION over build-time _version.py

Added 8 unit tests (test_cluster_pipeline_compat.py) verifying:
- All 6 pipelines use transaction=False
- SERVICE_VERSION override and fallback behavior

Tested against standalone Redis, Cluster (no TLS), and Cluster (TLS).
All 1352 unit tests + 178 integration tests pass.
Copilot AI review requested due to automatic review settings February 27, 2026 18:37
@gafda
Copy link
Contributor Author

gafda commented Feb 27, 2026

@aron-muon
Made a few more fixes, mostly around REDIS in CLUSTER mode.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 61 out of 62 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Pass --port to executor-agent binary so configured executor_port is
  actually used (fixes agent-mode when port != 9090)
- Remove incorrect await on redis.pipeline() in SessionService to align
  with redis-py asyncio API (pipeline() is synchronous)
- Restrict CA private key to 600 in TLS cert generator (redis.key stays
  644 for container access)
Copy link
Owner

@aron-muon aron-muon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you separate out the dockerfile changes (e.g. updating the python/go Dockerfiles and dependencies) from this PR? I'd like such a large change to be as isolated as possible

secrets: inherit
with:
image_name: kubecoderun-sidecar
image_name: kubecoderun-sidecar-agent
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears you are no longer building sidecar-nsenter?

@@ -0,0 +1,12 @@
#!/usr/bin/env bash
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see these scripts being used anywhere in these changes - can we remove them?

@@ -0,0 +1,454 @@
"""Integration tests for Redis Cluster with TLS.

Mirrors the user's production GCP Memorystore configuration:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need this comment indicating an AI agent is referring to your setup

@gafda
Copy link
Contributor Author

gafda commented Mar 2, 2026

Could you separate out the dockerfile changes (e.g. updating the python/go Dockerfiles and dependencies) from this PR? I'd like such a large change to be as isolated as possible

So to confirm — do you want me to move all changes under the docker folder tree into a separate PR, leaving this PR focused on the agent execution mode, executor-agent/sidecar refactor, GKE Sandbox support, and Redis Cluster/Sentinel/TLS changes? Did I understood it well?

@aron-muon
Copy link
Owner

Could you separate out the dockerfile changes (e.g. updating the python/go Dockerfiles and dependencies) from this PR? I'd like such a large change to be as isolated as possible

So to confirm — do you want me to move all changes under the docker folder tree into a separate PR, leaving this PR focused on the agent execution mode, executor-agent/sidecar refactor, GKE Sandbox support, and Redis Cluster/Sentinel/TLS changes? Did I understood it well?

Yes - if you have ideas of making the split even larger based on logical silos, beyond just 2 PRs, that could be helpful as well. I am reviewing 4000+ lines of code manually (as well with an AI agent, but I do the review manually to make sure that the changes support an architectural vision with the project).

Also - don't forget to run linters/tests etc using the justfile https://github.com/aron-muon/KubeCodeRun/blob/main/justfile#L12C1-L28C56
e.g. just lint

@gafda
Copy link
Contributor Author

gafda commented Mar 2, 2026

Hi @aron-muon,

Thank you for your feedback on this PR. Per our discussion about breaking this down into smaller chunks, I am closing PR #33.

I have split the functionality into the following three new pull requests. All feedback you previously provided has been incorporated.

  • PR #35 - task-docker-image-deps-upgrade: Focuses on upgrading all language runtimes to DHI base images and bumping dependency versions (Go 1.26, PHP 8.5.3, Rust 1.93, etc.).
  • PR #36 - feat-agent-execution-mode: Introduces agent-based execution, sidecar builds, GKE Sandbox support, and CI for both sidecar variants.
  • PR #37 - feat-redis-cluster-sentinel-tls: Adds support for Redis Cluster, Sentinel, TLS/SSL, key prefixing, and related integration test environments.

Important Note: This work was developed as a single, cohesive feature set. As such, these three PRs are interdependent and are intended to be reviewed and merged together to ensure full functionality.

Looking forward to your review of the new PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants