feat(kubernetes): add agent execution mode, executor-agent, sidecar refactor, GKE Sandbox, image & deps upgrades#33
feat(kubernetes): add agent execution mode, executor-agent, sidecar refactor, GKE Sandbox, image & deps upgrades#33gafda wants to merge 14 commits intoaron-muon:mainfrom
Conversation
- Add configuration options for GKE Sandbox in KubernetesConfig
- Update pod/job manifest creation to support:
* runtimeClassName for gVisor runtime
* sandbox.gke.io/runtime annotation
* nodeSelector for sandbox-enabled nodes
* tolerations for GKE sandbox and custom taints
- Add GKE Sandbox settings to Helm values.yaml and configmap
- Update KubernetesManager, PodSpec, and PoolConfig models
- Parse JSON configuration for node selectors and tolerations
- Enable easy activation/deactivation via configuration flags
GKE Sandbox provides additional kernel isolation using gVisor for
untrusted workloads. When enabled, execution pods will:
- Run with gVisor runtime (runtimeClassName: gvisor)
- Be scheduled on sandbox-enabled nodes
- Tolerate GKE sandbox taints automatically
- Support custom node pool taints for dedicated execution nodes
Configuration example in values.yaml:
execution:
gkeSandbox:
enabled: true
runtimeClassName: gvisor
nodeSelector: {}
customTolerations:
- key: pool
operator: Equal
value: sandbox
effect: NoSchedule
* Introduce to support agent and nsenter modes. * Implement agent mode with a lightweight executor agent running in the main container. * Add for configuring the executor agent's HTTP server port. * Enhance security by dropping all capabilities in agent mode and ensuring no privilege escalation. * Support image pull secrets for private registries via . * Update documentation to reflect new execution modes and security configurations. * Modify Helm chart to include image pull secrets configuration.
* Change default sidecar image from to . * Update environment variable names for executor port from to . * Add documentation for building sidecar images and configuring Helm charts for execution modes. * Introduce GKE Sandbox support with configuration details and limitations. * Update related code and tests to reflect changes in image names and environment variables.
- Updated Dockerfiles for C/C++, D, Fortran, R, and Sidecar to use the trixie-dev variant. - Ensures compilers and development libraries are available at runtime.
* Updated base images to trixie-debian13-dev for C/C++, D, Fortran, R, and Rust. * Upgraded PHP version to 8.5.3. * Enhanced Node.js and Python requirements with new packages and versions. * Improved Rust dependencies for better compatibility and performance. * Updated Go version in executor-agent to 1.26.
* Introduced K8S_IMAGE_PULL_POLICY and K8S_IMAGE_PULL_SECRETS in configuration. * Updated relevant classes and methods to handle new fields. * Enhanced validation for execution mode and sidecar image consistency. * Added unit tests to ensure correct handling of image pull settings.
- Add REDIS_MODE (standalone/cluster/sentinel) to RedisConfig - Add TLS/SSL configuration (REDIS_TLS_ENABLED, certs, CA, insecure) - Add Redis Cluster support (REDIS_CLUSTER_NODES) via RedisCluster client - Add Redis Sentinel support (REDIS_SENTINEL_NODES/MASTER/PASSWORD) - Update RedisPool to support all three modes with TLS - Migrate FileService to use shared RedisPool instead of standalone client - Update Settings class with all new Redis fields - Update .env.example with new Redis configuration options - Update docs/CONFIGURATION.md with Cluster, Sentinel, and TLS sections - Update docs/SECURITY.md with TLS configuration reference - Update Helm values.yaml, configmap.yaml, and _helpers.tpl - Default remains standalone Redis for full backward compatibility feat: add optional Redis key prefix support (REDIS_KEY_PREFIX) - Add key_prefix field to RedisConfig and Settings - Add make_key() helper to RedisPool for centralized key prefixing - Update all services to use prefixed keys: session, state, file, health, api_key_manager, detailed_metrics, metrics - Update .env.example, docs, and Helm chart with new setting
There was a problem hiding this comment.
Pull request overview
This pull request introduces a significant architectural enhancement by adding an agent-based execution mode as the default, alongside comprehensive Redis deployment mode support and extensive dependency upgrades. The changes modernize the security model by eliminating the need for Linux capabilities in the default execution path, enable GKE Sandbox (gVisor) compatibility, and provide flexibility for Redis clustering and high-availability deployments.
Changes:
- Introduced agent execution mode (default) that eliminates nsenter, Linux capabilities, and privilege escalation requirements, with nsenter mode retained for backward compatibility
- Added comprehensive Redis deployment modes (standalone, cluster, sentinel) with TLS/SSL support and optional key prefixing for multi-tenant deployments
- Implemented GKE Sandbox (gVisor) support with runtime class, node selectors, and tolerations for kernel-level isolation
- Upgraded language runtimes and dependencies: Go 1.25→1.26, PHP 8.4.17→8.5.3, Rust 1.92→1.93, Python packages modernized
- Refactored sidecar to multi-target Docker build producing both agent and nsenter variants from a single Dockerfile
Reviewed changes
Copilot reviewed 49 out of 50 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/services/kubernetes/client.py |
Enhanced pod manifest creation with agent/nsenter mode support, GKE Sandbox configuration, and image pull secrets |
docker/sidecar/main.py |
Added execute_via_agent() function and mode routing logic for dual execution mode support |
docker/sidecar/executor-agent/main.go |
New Go HTTP server for agent mode execution without nsenter or capabilities |
docker/sidecar/Dockerfile |
Refactored to multi-target build: sidecar-agent (default) and sidecar-nsenter (legacy) |
src/core/pool.py |
Complete rewrite supporting Redis standalone/cluster/sentinel modes with TLS and key prefixing |
src/config/redis.py |
New configuration model for Redis deployment modes, TLS, and advanced features |
src/services/*.py |
Updated all Redis-using services (state, session, metrics, api_key_manager) to use key prefixing |
src/main.py |
Added validation for execution mode/sidecar image consistency and image pull secrets parsing |
tests/unit/test_kubernetes_client.py |
Comprehensive tests for agent/nsenter modes and GKE Sandbox configuration |
scripts/build-images.sh |
Enhanced to support Docker multi-target builds with --target flag |
.github/workflows/docker-publish.yml |
Updated sidecar image name to kubecoderun-sidecar-agent |
helm-deployments/kubecoderun/values.yaml |
Added Redis mode configuration, GKE Sandbox settings, and execution mode options |
docs/SECURITY.md, docs/CONFIGURATION.md, docs/ARCHITECTURE.md |
Extensive documentation updates for new execution modes and Redis features |
Comments suppressed due to low confidence (1)
.github/workflows/docker-publish.yml:153
- The CI/CD workflow only builds the
kubecoderun-sidecar-agentimage but not thekubecoderun-sidecar-nsentervariant. While the build script supports both targets and the Dockerfile defines both, the nsenter sidecar won't be available in the registry for users who need legacy nsenter mode. Consider adding a separate job to build the nsenter variant, or document that users must build it locally if needed.
sidecar:
needs: changes
if: |
needs.changes.outputs.is_cross_repo_pr != 'true' &&
(needs.changes.outputs.sidecar == 'true' || needs.changes.outputs.force_all == 'true')
uses: ./.github/workflows/docker-build-reusable.yml
secrets: inherit
with:
image_name: kubecoderun-sidecar-agent
dockerfile: docker/sidecar/Dockerfile
context: docker/sidecar
image_tag: ${{ needs.changes.outputs.image_tag }}
is_release: ${{ needs.changes.outputs.is_release == 'true' }}
version: ${{ needs.changes.outputs.version }}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* Validate GKE Sandbox compatibility with nsenter execution mode * Log warnings when incompatible execution modes are used * Update Dockerfile capabilities for nsenter * Enhance RedisPool type hints for better type checking * Add unit tests for GKE Sandbox and nsenter mode interactions
…me by default Three issues prevented connecting to GCP Memorystore Redis with TLS: 1. _validate_redis_connection() used redis.from_url() without passing ssl_ca_certs / ssl_cert_reqs, so certificate verification fell back to the system CA bundle which doesn't include managed-service CAs. 2. get_tls_kwargs() set ssl_check_hostname=True when tls_insecure=False. Managed Redis services (GCP Memorystore, AWS ElastiCache) and Redis Cluster node discovery return IPs that don't match certificate CN/SAN, causing CERTIFICATE_VERIFY_FAILED. Hostname checking is now off by default (matching redis-py) and controlled by the new REDIS_TLS_CHECK_HOSTNAME setting. 3. REDIS_HOST could contain a URL scheme (rediss://host) which was passed through to ClusterNode or URL construction. A field validator now strips accidental schemes from the host value.
|
Hello, let me take a look here. Nice to see that Nos engineers are using Kubecoderun - I am a Nos customer myself. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 51 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…host fallback The startup config validator always used redis.from_url() with a standalone-style URL regardless of REDIS_MODE. In cluster or sentinel mode this connected to the wrong host (typically localhost:6379) and failed, blocking startup. - Rewrite _validate_redis_connection() to build the correct client type per mode: RedisCluster for cluster, Sentinel for sentinel, and redis.from_url() for standalone — all with proper TLS kwargs. - Remove the silent localhost:6379 fallback in RedisPool._initialize() that masked real connection errors and caused confusing log messages. - Update the corresponding unit test to expect the error to propagate.
- Shallow-copy annotations dict to prevent mutation by GKE Sandbox - Validate 'key' field in custom tolerations, skip invalid with warning - Log warning on invalid JSON for GKE_SANDBOX_NODE_SELECTOR/TOLERATIONS - Align image_pull_policy default to 'Always' in KubernetesConfig - Fix path traversal in executor-agent workdir validation (/mnt/data2) - Replace env override append with key replacement in executor-agent - Default gkeSandbox.enabled to false in Helm values - Add security warning for REDIS_TLS_CHECK_HOSTNAME in docs and .env - Add unit tests for all fixes (7 new tests)
…tests - Fix empty REDIS_PASSWORD sending AUTH by converting to None - Fix empty REDIS_CLUSTER_NODES treated as truthy, falling back to host:port - Add missing REDIS_HOST/REDIS_PORT/REDIS_DB to Helm configmap - Add REDIS_PASSWORD to Helm secret for cluster/sentinel modes - Add 6-node Redis Cluster docker-compose (non-TLS and TLS variants) - Add TLS cert generation/cleanup scripts for local testing - Add 11 non-TLS + 14 TLS cluster integration tests (RedisPool, ConfigValidator, sync/async clients, key prefix operations) - Add 20 new unit tests for Settings/RedisConfig validators
All Redis pipelines that operate on keys in different hash slots now use transaction=False instead of transaction=True. Redis Cluster cannot wrap MULTI/EXEC around keys on different nodes. ClusterPipeline with transaction=False still batches commands but splits them by node. Fixed files: - session.py: create_session(), delete_session() - api_key_manager.py: _ensure_single_env_key_record(), create_key(), revoke_key() - state.py: save_state() Also fixes version display showing '0.0.0.dev0' in production: - build-images.sh: pass --build-arg VERSION=$TAG to docker build - config: add SERVICE_VERSION env var for runtime version override - main.py + logging.py: prefer SERVICE_VERSION over build-time _version.py Added 8 unit tests (test_cluster_pipeline_compat.py) verifying: - All 6 pipelines use transaction=False - SERVICE_VERSION override and fallback behavior Tested against standalone Redis, Cluster (no TLS), and Cluster (TLS). All 1352 unit tests + 178 integration tests pass.
|
@aron-muon |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 61 out of 62 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Pass --port to executor-agent binary so configured executor_port is actually used (fixes agent-mode when port != 9090) - Remove incorrect await on redis.pipeline() in SessionService to align with redis-py asyncio API (pipeline() is synchronous) - Restrict CA private key to 600 in TLS cert generator (redis.key stays 644 for container access)
aron-muon
left a comment
There was a problem hiding this comment.
Could you separate out the dockerfile changes (e.g. updating the python/go Dockerfiles and dependencies) from this PR? I'd like such a large change to be as isolated as possible
| secrets: inherit | ||
| with: | ||
| image_name: kubecoderun-sidecar | ||
| image_name: kubecoderun-sidecar-agent |
There was a problem hiding this comment.
It appears you are no longer building sidecar-nsenter?
| @@ -0,0 +1,12 @@ | |||
| #!/usr/bin/env bash | |||
There was a problem hiding this comment.
I don't see these scripts being used anywhere in these changes - can we remove them?
| @@ -0,0 +1,454 @@ | |||
| """Integration tests for Redis Cluster with TLS. | |||
|
|
|||
| Mirrors the user's production GCP Memorystore configuration: | |||
There was a problem hiding this comment.
don't need this comment indicating an AI agent is referring to your setup
So to confirm — do you want me to move all changes under the |
Yes - if you have ideas of making the split even larger based on logical silos, beyond just 2 PRs, that could be helpful as well. I am reviewing 4000+ lines of code manually (as well with an AI agent, but I do the review manually to make sure that the changes support an architectural vision with the project). Also - don't forget to run linters/tests etc using the |
|
Hi @aron-muon, Thank you for your feedback on this PR. Per our discussion about breaking this down into smaller chunks, I am closing PR #33. I have split the functionality into the following three new pull requests. All feedback you previously provided has been incorporated.
Important Note: This work was developed as a single, cohesive feature set. As such, these three PRs are interdependent and are intended to be reviewed and merged together to ensure full functionality. Looking forward to your review of the new PRs. |
This pull request updates the development and runtime environments for multiple languages, modernizes package dependencies, and enhances configuration flexibility, particularly for Redis and Kubernetes. The changes include updates to Dockerfiles for newer base images and language versions, significant dependency version bumps and additions, and expanded sample configuration for advanced deployment scenarios.
Key changes:
1. Dependency and Environment Updates
trixie-debian13-devbase images to ensure compilers and development libraries are available at runtime, improving compatibility and security. (docker/c-cpp.Dockerfile,docker/d.Dockerfile,docker/fortran.Dockerfile,docker/r.Dockerfile) [1] [2] [3] [4] [5]docker/go.Dockerfile,docker/requirements/go.mod) [1] [2] [3] [4]docker/php.Dockerfile)docker/requirements/python-core.txt,docker/requirements/python-analysis.txt,docker/requirements/python-documents.txt,docker/requirements/python-utilities.txt,docker/requirements/python-visualization.txt) [1] [2] [3] [4] [5]docker/requirements/nodejs.txt)Cargo.tomlfor new features and bug fixes. (docker/rust.Dockerfile,docker/requirements/rust-Cargo.toml) [1] [2] [3]2. Configuration Enhancements
.env.examplefile now documents advanced Redis deployment options, including cluster and sentinel modes, TLS/SSL settings, and key prefixing, making it easier to configure Redis in complex environments. [1] [2].env.example, including support for agent and nsenter modes, sidecar image selection, image pull policies, and GKE Sandbox compatibility notes.agent(default)nsenter(legacy)3. Docker Image Naming and CI/CD
kubecoderun-sidecar-agentinstead ofkubecoderun-sidecar, aligning with the agent-based execution model. (.github/workflows/docker-publish.yml) [1] [2]4. Python Runtime Optimization
docker/python.Dockerfile) [1] [2]These changes collectively modernize the development environments, improve documentation and configuration for advanced deployments, and ensure compatibility with newer language and library versions.