The Code Factory platform is designed from the ground up for enterprise-scale deployments. This document describes the architectural patterns, technologies, and configurations that enable horizontal and vertical scaling, high availability, and performance optimization.
- Architecture Overview
- Horizontal Scaling
- Message Queue Architecture
- Load Balancing & Distribution
- Caching Strategies
- Database Scaling
- Auto-Scaling Mechanisms
- Backpressure Management
- Resilience & Fault Tolerance
- Distributed Coordination
- Resource Management
- Monitoring & Observability
- Performance Optimizations
- Configuration Guide
- Deployment Patterns
- Capacity Planning
The Code Factory implements a multi-tier scalable architecture with the following key characteristics:
┌─────────────────────────────────────────────────────────────┐
│ Load Balancer / Ingress │
└────────────────────┬────────────────────────────────────────┘
│
┌───────────┴───────────┐
│ │
┌────▼────┐ ┌────▼────┐ ┌────────┐
│ Pod 1 │ │ Pod 2 │ ... │ Pod N │
│ │ │ │ │ │
│ Sharded │ │ Sharded │ │Sharded │
│ Message │ │ Message │ │Message │
│ Bus │ │ Bus │ │ Bus │
└────┬────┘ └────┬────┘ └────┬───┘
│ │ │
└──────────┬───────────┴─────────────────────┘
│
┌──────────▼──────────┐
│ Redis (Locks, │
│ Cache, Pub/Sub) │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ PostgreSQL + Citus │
│ (Distributed SQL) │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ Kafka (Optional) │
│ Event Streaming │
└─────────────────────┘
- Stateless Application Tier: All pods are stateless and can be scaled independently
- Sharded Internal Queue: Each pod runs an internal sharded message bus for work distribution
- Distributed Coordination: Redis provides distributed locks and pub/sub for cross-pod coordination
- Optional External Queue: Kafka integration for event streaming and audit trail
- Horizontal Database Scaling: Citus extension for PostgreSQL distribution
The platform automatically scales based on CPU and memory metrics.
Configuration: helm/codefactory/templates/hpa.yaml, k8s/overlays/production/hpa.yaml
spec:
minReplicas: 3 # Production minimum for HA
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale at 70% CPU
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale at 80% memory
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100 # Double pods rapidly under load
periodSeconds: 30
scaleDown:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50 # Scale down conservatively
periodSeconds: 60Key Features:
- Aggressive Scale Up: 100% increase every 30 seconds during load spikes
- Conservative Scale Down: 50% decrease every 60 seconds to avoid thrashing
- Dual Metrics: Both CPU and memory thresholds must be met
- Minimum HA: 3 replicas in production ensures availability during pod failures
File References:
- Helm HPA:
helm/codefactory/templates/hpa.yaml:1-47 - K8s HPA:
k8s/overlays/production/hpa.yaml:1-42
Ensures high availability during cluster maintenance and updates.
Configuration: k8s/overlays/production/pdb.yaml
spec:
minAvailable: 1 # At least 1 pod always running
selector:
matchLabels:
app: codefactory-apiFile Reference: k8s/overlays/production/pdb.yaml:1-12
Zero-downtime deployments with controlled rollout.
Configuration: k8s/base/api-deployment.yaml:13-17
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # One extra pod during updates
maxUnavailable: 0 # No downtimeThe platform uses a three-tier message queue architecture:
- Internal Sharded Message Bus (always present)
- Redis Pub/Sub (for cross-pod coordination)
- Kafka (optional, for event streaming and audit)
The heart of the scalability architecture, running inside each pod.
File: omnicore_engine/message_bus/sharded_message_bus.py (2240 lines)
┌─────────────────────────────────────────────────┐
│ ShardedMessageBus │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Shard 0 │ │ Shard 1 │ │ Shard N │ │
│ │ │ │ │ │ │ │
│ │ Queue │ │ Queue │ │ Queue │ │
│ │ (10k) │ │ (10k) │ │ (10k) │ │
│ │ │ │ │ │ │ │
│ │ Workers │ │ Workers │ │ Workers │ │
│ │ (4x) │ │ (4x) │ │ (4x) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │
│ └──────────────┴──────────────┘ │
│ │ │
│ ┌─────────▼─────────┐ │
│ │ Consistent Hash │ │
│ │ Ring │ │
│ └───────────────────┘ │
└─────────────────────────────────────────────────┘
- Shard Count: 4-8 (configurable, default 8)
- Workers Per Shard: 2-4 (configurable, default 4)
- Max Queue Size: 10,000 messages per shard
- Total Capacity: 80,000 messages per pod (8 shards × 10k)
File References:
- Initialization:
omnicore_engine/message_bus/sharded_message_bus.py:328-337 - Thread pools:
omnicore_engine/message_bus/sharded_message_bus.py:343-363 - Hash ring:
omnicore_engine/message_bus/sharded_message_bus.py:385-392
Messages are distributed to shards using a consistent hash ring based on topic name.
Benefits:
- Even distribution across shards
- Stable routing (same topic always goes to same shard)
- Efficient re-balancing on shard count changes
File Reference: omnicore_engine/message_bus/sharded_message_bus.py:909-916
Each shard maintains two queues:
- Normal Priority Queue: Standard messages
- High Priority Queue: Critical messages (priority ≥ 5)
High-priority messages are processed first within each shard.
File References:
- Queue setup:
omnicore_engine/message_bus/sharded_message_bus.py:338-341 - Priority handling:
omnicore_engine/message_bus/sharded_message_bus.py:942
Optional integration for event streaming, audit trail, and inter-service communication.
File: omnicore_engine/message_bus/integrations/kafka_bridge.py (732 lines)
Optimized for durability and throughput:
producer_config = {
"acks": "all", # Wait for all replicas
"linger_ms": 5, # Batch messages for 5ms
"batch_size": 16384, # 16KB batches
"compression_type": "gzip",
"retries": 5,
"enable_idempotence": True # Exactly-once semantics
}File Reference: omnicore_engine/message_bus/integrations/kafka_bridge.py:202-210
Optimized for reliability:
consumer_config = {
"auto_offset_reset": "latest",
"enable_auto_commit": False, # Manual commit after processing
"max_poll_records": 100,
"session_timeout_ms": 30000
}File Reference: omnicore_engine/message_bus/integrations/kafka_bridge.py:193-200
Kafka operations are protected by circuit breakers to prevent cascading failures.
File References: omnicore_engine/message_bus/integrations/kafka_bridge.py:335-337, 484-485, 499-500
High-performance adapter for sending audit events to Kafka.
File: omnicore_engine/message_bus/kafka_sink_adapter.py (371 lines)
Features:
- Graceful Degradation: Works without Kafka (logs locally instead)
- Bounded Concurrency: Max 64 concurrent operations (configurable)
- Circuit Breaker: Optional protection against Kafka failures
- Batch Emission:
emit_many()for high-throughput scenarios
File References:
- Graceful degradation:
omnicore_engine/message_bus/kafka_sink_adapter.py:39-44 - Circuit breaker:
omnicore_engine/message_bus/kafka_sink_adapter.py:168-173 - Concurrency:
omnicore_engine/message_bus/kafka_sink_adapter.py:152
Used for cross-pod coordination and caching.
Docker Configuration: docker-compose.production.yml:34-50
redis:
image: redis:7.4-alpine
volumes:
- redis-data:/data
command: redis-server --appendonly yes # AOF persistence
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
retries: 5The ShardedMessageBus uses a consistent hash ring to distribute topics across shards.
Algorithm:
- Each shard is assigned a position on a 0-2^32 ring
- Topic names are hashed to a ring position
- Topics are assigned to the next shard clockwise on the ring
- Re-balancing is minimized when shards are added/removed
File References:
- Initialization:
omnicore_engine/message_bus/sharded_message_bus.py:386-388 - Topic mapping:
omnicore_engine/message_bus/sharded_message_bus.py:909-916 - Shard addition:
omnicore_engine/message_bus/sharded_message_bus.py:1638-1642 - Shard removal:
omnicore_engine/message_bus/sharded_message_bus.py:1678-1681
Two-tier dispatch system:
- High-Priority Workers: Process messages with priority ≥ 5
- Normal Workers: Process all other messages
File References:
- High-priority queues:
omnicore_engine/message_bus/sharded_message_bus.py:338-341 - Priority threshold:
omnicore_engine/message_bus/sharded_message_bus.py:942
ClusterIP service with session affinity (optional).
Configuration: k8s/base/api-deployment.yaml:246
kind: Service
metadata:
name: codefactory-api
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 8000
protocol: TCP
name: http
- port: 9090
targetPort: 9090
protocol: TCP
name: metrics
selector:
app: codefactory-apiPrevents duplicate processing of messages with the same idempotency key.
File: omnicore_engine/message_bus/sharded_message_bus.py:414-417
Configuration:
- TTL: Configurable (default: 3600 seconds)
- Storage: In-memory per pod, optionally backed by Redis for cross-pod deduplication
- Idempotency Keys: Based on message content hash or explicit key
File References:
- MessageCache:
omnicore_engine/message_bus/sharded_message_bus.py:414-417 - Redis integration:
omnicore_engine/message_bus/sharded_message_bus.py:1279-1282
Caches topic-to-shard assignments to avoid repeated hash calculations.
Cache Invalidation: Triggered on shard count changes or manual re-balance.
File References: omnicore_engine/message_bus/sharded_message_bus.py:389, 1643, 1694
Used for:
- Distributed locks (see Distributed Coordination)
- Cross-pod message deduplication
- Session state (if enabled)
- Rate limiting counters
Helm Configuration: helm/codefactory/values.yaml:255-260
redis:
host: codefactory-redis
port: 6379
password: changeme
persistence:
enabled: true
size: 10GiThe platform uses Citus to scale PostgreSQL horizontally across multiple nodes.
Image: citusdata/citus:12.1
File: docker-compose.production.yml:55
┌────────────────────────────────────┐
│ Citus Coordinator │
│ (Query Planning & Distribution) │
└────────┬────────────────┬──────────┘
│ │
┌────▼────┐ ┌────▼────┐
│ Worker │ │ Worker │
│ Node 1 │ │ Node 2 │
│ │ │ │
│ Shard 1 │ │ Shard 2 │
│ Shard 3 │ │ Shard 4 │
└─────────┘ └─────────┘
- Transparent Sharding: Application queries unchanged
- Parallel Queries: Coordinator distributes queries across workers
- HA Replication: Worker nodes can be replicated
- Reference Tables: Small tables duplicated to all workers
Environment Variable: ENABLE_CITUS=1 (default in production)
Helm Values: helm/codefactory/values.yaml:188-191
database:
poolSize: 50 # Connection pool size
poolMaxOverflow: 20 # Additional connections under load
retryAttempts: 3
retryDelay: 1.0Efficient database connection management:
- Pool Size: 50 connections per pod
- Max Overflow: 20 additional connections during spikes
- Timeout: 30 seconds
- Pool Recycle: 3600 seconds (prevents stale connections)
File Reference: helm/codefactory/values.yaml:188-191
For read-heavy workloads, PostgreSQL read replicas can be configured:
database:
primary:
host: postgres-primary
port: 5432
replicas:
- host: postgres-replica-1
port: 5432
- host: postgres-replica-2
port: 5432Note: Read replica support is available but not enabled by default.
The ShardedMessageBus automatically adjusts shard count and worker count based on load.
File: omnicore_engine/message_bus/sharded_message_bus.py:1849-1882
async def auto_scale_shards(self):
"""Automatically adjust shard count based on load metrics."""
total_queue_size = sum(q.qsize() for q in self.queues)
avg_queue_size = total_queue_size / len(self.queues)
# Scale up if average queue size > 80% of max
if avg_queue_size > (self.max_queue_size * 0.8):
new_count = min(self.shard_count + 1, 16) # Max 16 shards
logger.info(f"Scaling up to {new_count} shards")
await self._add_shard()
# Scale down if average queue size < 20% of max
elif avg_queue_size < (self.max_queue_size * 0.2) and self.shard_count > 2:
new_count = max(self.shard_count - 1, 2) # Min 2 shards
logger.info(f"Scaling down to {new_count} shards")
await self._remove_shard()Parameters:
- Scale Up Threshold: 80% queue capacity
- Scale Down Threshold: 20% queue capacity
- Min Shards: 2
- Max Shards: 16
- Check Interval: 60 seconds
File Reference: omnicore_engine/message_bus/sharded_message_bus.py:1849-1882
Dynamically adjusts worker count per shard based on message backlog.
File: omnicore_engine/message_bus/sharded_message_bus.py:1804-1847
async def adjust_workers(self, shard_index: int):
"""Adjust worker count for a specific shard."""
queue_size = self.queues[shard_index].qsize()
current_workers = len(self.workers[shard_index])
# Scale up workers if queue is growing
if queue_size > 1000 and current_workers < 8:
# Add 2 workers
pass
# Scale down workers if queue is empty
elif queue_size < 100 and current_workers > 2:
# Remove 1 worker
passParameters:
- Min Workers Per Shard: 2
- Max Workers Per Shard: 8
- Scale Up Threshold: 1000 messages queued
- Scale Down Threshold: 100 messages queued
See Horizontal Scaling section for Kubernetes HPA configuration.
The platform uses three-tier scaling:
- Worker Scaling (seconds): Adjust thread pool size within each shard
- Shard Scaling (minutes): Add/remove shards within each pod
- Pod Scaling (minutes): Kubernetes HPA adds/removes pods
This approach provides:
- Fast Response: Worker scaling handles micro-bursts
- Medium Response: Shard scaling handles sustained load increases
- Macro Response: Pod scaling handles cluster-wide load
Prevents queue overflow by pausing message publishing when queues reach capacity.
File: omnicore_engine/message_bus/backpressure.py (93 lines)
class BackpressureManager:
def check_queue_load(self, shard_index: int) -> bool:
"""Check if shard queue is over threshold (80%)."""
queue = self.message_bus.queues[shard_index]
utilization = queue.qsize() / self.message_bus.max_queue_size
return utilization > 0.8
async def should_pause_publishes(self) -> bool:
"""Check if any shard is over capacity."""
for shard_index in range(self.message_bus.shard_count):
if self.check_queue_load(shard_index):
return True
return FalseFile Reference: omnicore_engine/message_bus/backpressure.py:26-92
Individual shards can be paused without affecting other shards.
File References:
- Pause check:
omnicore_engine/message_bus/sharded_message_bus.py:932-934 - Pause/resume:
omnicore_engine/message_bus/sharded_message_bus.py:2018-2032
When backpressure is triggered:
- Pause Publishing: New messages are rejected with HTTP 429 (Too Many Requests)
- Drain Queue: Existing workers continue processing
- Resume Publishing: Once queue drops below 50%, publishing resumes
- Scale Up: Auto-scaler adds shards/workers to prevent future backpressure
Prevents cascading failures when external dependencies fail.
File: omnicore_engine/message_bus/resilience.py (100+ lines)
- Closed: Normal operation, requests pass through
- Open: Dependency failing, requests fail fast
- Half-Open: Testing if dependency recovered
class CircuitBreaker:
def __init__(
self,
failure_threshold: int = 5, # Failures before opening
recovery_timeout: float = 60.0, # Seconds before half-open
success_threshold: int = 3 # Successes before closing
):
self.state = "closed"- Kafka Circuit:
failure_threshold=3,timeout=60s - Redis Circuit:
failure_threshold=5,timeout=60s
File Reference: omnicore_engine/message_bus/sharded_message_bus.py:420-427
Automatic retry with exponential backoff for transient failures.
File: omnicore_engine/message_bus/resilience.py:28-35
class RetryPolicy:
max_retries: int = 3
backoff_factor: float = 0.01 # Exponential: 0.01, 0.02, 0.04, ...Failed messages are captured for manual replay or analysis.
File: omnicore_engine/message_bus/dead_letter_queue.py
Messages sent to DLQ if:
- Priority ≥ 5 (high-priority messages)
- Retry count exceeded
- Permanent failure (e.g., invalid message format)
async def replay_failed_messages(self, max_age_seconds: int = 3600) -> int:
"""Replay messages from the DLQ."""
# Query DLQ messages younger than max_age
# Re-publish each message
# Remove successfully replayed messages
return replayed_countFile Reference: omnicore_engine/message_bus/sharded_message_bus.py:2034-2191
The platform continues operating with reduced functionality when dependencies fail:
- Without Kafka: Audit logs to file instead of streaming
- Without Redis: Single-pod operation (no distributed locks)
- Without Citus: Standard PostgreSQL mode
Redis-based distributed locks for cross-pod coordination.
File: server/distributed_lock.py (434 lines)
Uses Redis SET NX EX (atomic set-if-not-exists with expiration):
acquired = redis.set(
name="lock:agent_initialization",
value=uuid4().hex, # Unique owner ID
nx=True, # Only set if doesn't exist
ex=timeout # Auto-expire after timeout
)File Reference: server/distributed_lock.py:165-169
Atomic check-and-delete using Lua script:
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
endFile Reference: server/distributed_lock.py:320-326
- Agent Initialization: Only one pod initializes shared resources
- Database Migrations: Only one pod runs migrations
- Job Scheduling: Prevent duplicate job execution
- Resource Allocation: Coordinate access to shared resources
File References:
- Global startup lock:
server/distributed_lock.py:389-411 - Lock acquisition with retry:
server/distributed_lock.py:185-224
When Redis is unavailable:
- Single-pod operation allowed
- Multi-pod deployments require Redis
File Reference: server/distributed_lock.py:185-224
Prevent resource exhaustion and ensure QoS.
Configuration: k8s/base/api-deployment.yaml:198-204
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"Resource Classes:
- Requests: Guaranteed resources (Kubernetes schedules based on this)
- Limits: Maximum resources (pod throttled/killed if exceeded)
Customizable via values.yaml:
File: helm/codefactory/values.yaml:80-86
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"For non-Kubernetes deployments:
File: docker-compose.production.yml:237-244
deploy:
resources:
limits:
cpus: '4'
memory: 8G
reservations:
cpus: '2'
memory: 4GPer-client rate limiting to prevent abuse.
Configuration: omnicore_engine/message_bus/sharded_message_bus.py:394-397
self.rate_limiter = RateLimiter(
max_requests=1000, # Max requests per window
window_seconds=60 # Time window
)Prevents oversized messages from breaking queues.
Configuration: omnicore_engine/message_bus/sharded_message_bus.py:195-203
MAX_MESSAGE_SIZE = 1 * 1024 * 1024 # 1MB
def validate_message_size(message: Dict[str, Any]) -> bool:
size = len(json.dumps(message).encode('utf-8'))
return size <= MAX_MESSAGE_SIZEComprehensive metrics for monitoring and alerting.
Configuration: monitoring/prometheus.yml (48 lines)
global:
scrape_interval: 15s # Scrape every 15 seconds
evaluation_interval: 15s
scrape_configs:
- job_name: 'codefactory-platform'
static_configs:
- targets: ['codefactory-api:9090']
- job_name: 'redis'
static_configs:
- targets: ['redis:9121']
- job_name: 'postgres'
static_configs:
- targets: ['postgres:9187']Retention: 30 days
Detailed metrics for the ShardedMessageBus:
File: omnicore_engine/metrics.py
| Metric | Type | Description |
|---|---|---|
message_bus_queue_size |
Gauge | Queue size per shard |
message_bus_message_age |
Histogram | Message latency distribution |
message_bus_dispatch_duration |
Histogram | Dispatch time per message |
message_bus_topic_throughput |
Counter | Messages per topic |
message_bus_callback_latency |
Histogram | Callback execution time |
message_bus_callback_errors |
Counter | Callback failures per topic |
message_bus_publish_retries |
Counter | Retry attempts per shard |
kafka_connection_failures |
Counter | Kafka connection failures by reason |
kafka_health_check_status |
Gauge | Kafka availability (0/1) |
Enable automatic scraping:
File: helm/codefactory/values.yaml:31-34
podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"For Prometheus Operator:
File: helm/codefactory/templates/servicemonitor.yaml
spec:
selector:
matchLabels:
app: codefactory-api
endpoints:
- port: metrics
interval: 30s
scrapeTimeout: 10sPre-configured dashboards for:
- System health overview
- Message bus performance
- Database connections and query latency
- Kubernetes pod metrics
- Application-specific metrics
Docker Compose: docker-compose.production.yml:264-301
Access: http://localhost:3000 (default credentials in docker-compose)
Automated alerting for critical conditions:
Configuration: monitoring/alertmanager.yml
Alert Rules: monitoring/alerts.yml
Example alerts:
- High CPU/memory usage
- Queue saturation (>80%)
- Message processing latency >5s
- Circuit breaker opened
- Pod crash loops
Reduce pod startup time for faster scaling.
Configuration: helm/codefactory/values.yaml:180-185
performance:
parallelAgentLoading: true # Load agents in parallel
lazyLoadML: true # Defer ML model loading
skipImportValidation: true # Skip import-time checks
startupTimeout: 90 # Max startup time (seconds)File References:
- Parallel loading:
helm/codefactory/values.yaml:183 - Lazy ML:
helm/codefactory/values.yaml:184 - Startup timeout:
helm/codefactory/values.yaml:185
Efficient database connection management:
Configuration: helm/codefactory/values.yaml:188-191
database:
poolSize: 50 # Normal pool size
poolMaxOverflow: 20 # Additional connections under load
retryAttempts: 3
retryDelay: 1.0Benefits:
- Reduced connection overhead
- Better resource utilization
- Graceful handling of connection spikes
Gunicorn worker processes for parallel request handling:
Configuration: docker-compose.production.yml:261
gunicorn server.main:app \
--workers 4 \
--worker-class uvicorn.workers.UvicornWorker \
--timeout 300 \
--graceful-timeout 30 \
--keep-alive 5 \
--max-requests 1000 \
--max-requests-jitter 50Parameters:
- Workers: 4 (adjust based on CPU cores)
- Max Requests: 1000 (prevents memory leaks)
- Max Requests Jitter: 50 (prevents thundering herd)
- Keep-Alive: 5 seconds
Fully async architecture using uvloop for performance:
- ASGI Server: Uvicorn with UvicornWorker
- Event Loop: uvloop (if available) or asyncio
- Database: asyncpg for PostgreSQL
- Redis: aioredis for async operations
Optimized thread pool configuration:
File: omnicore_engine/message_bus/sharded_message_bus.py:343-363
# Normal priority executors (one per shard)
self.executors = [
ThreadPoolExecutor(max_workers=self.workers_per_shard)
for _ in range(self.shard_count)
]
# High priority executors (one per shard)
self.high_priority_executors = [
ThreadPoolExecutor(max_workers=self.workers_per_shard // 2)
for _ in range(self.shard_count)
]
# Callback executors (one per shard)
self.callback_executors = [
ThreadPoolExecutor(max_workers=8)
for _ in range(self.shard_count)
]Total Threads Per Pod:
- Normal: 4 workers × 8 shards = 32 threads
- High-priority: 2 workers × 8 shards = 16 threads
- Callbacks: 8 workers × 8 shards = 64 threads
- Grand Total: ~112 threads per pod
# Sharding
MESSAGE_BUS_SHARD_COUNT=8
MESSAGE_BUS_WORKERS_PER_SHARD=4
MESSAGE_BUS_MAX_QUEUE_SIZE=10000
# Monitoring
ENABLE_MESSAGE_BUS_GUARDIAN=1
MESSAGE_BUS_GUARDIAN_INTERVAL=30
# Auto-scaling
MESSAGE_BUS_AUTO_SCALE=1
MESSAGE_BUS_SCALE_UP_THRESHOLD=0.8
MESSAGE_BUS_SCALE_DOWN_THRESHOLD=0.2File Reference: helm/codefactory/values.yaml:194-198
# Connection pooling
DB_POOL_SIZE=50
DB_POOL_MAX_OVERFLOW=20
DB_POOL_TIMEOUT=30
DB_POOL_RECYCLE=3600
# Retry logic
DB_RETRY_ATTEMPTS=3
DB_RETRY_DELAY=1.0
# Citus
ENABLE_CITUS=1File Reference: helm/codefactory/values.yaml:188-191
# Core features
ENABLE_DATABASE=1
ENABLE_REDIS=1
ENABLE_KAFKA=0 # Optional
# Distributed features
ENABLE_CITUS=1 # Production only
ENABLE_DISTRIBUTED_LOCKS=1
# Performance
PARALLEL_AGENT_LOADING=1
LAZY_LOAD_ML=1
STARTUP_TIMEOUT=90File Reference: helm/codefactory/values.yaml:207-215
Complete scalability configuration via values.yaml:
# Replica configuration
replicaCount: 3
# HPA configuration
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# Resource limits
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
# Message bus
messageBus:
shardCount: 8
workersPerShard: 4
maxQueueSize: 10000
autoScale: true
# Database
database:
poolSize: 50
poolMaxOverflow: 20
citus:
enabled: true
# Redis
redis:
enabled: true
persistence:
enabled: true
size: 10Gi
# Kafka (optional)
kafka:
enabled: falseFile: helm/codefactory/values.yaml
Configuration:
- Pods: 1-2
- Shards: 4
- Workers: 2 per shard
- Resources: 500m CPU, 1Gi RAM
Use Cases: Development, small teams, prototyping
Configuration:
- Pods: 2-5
- Shards: 6-8
- Workers: 3-4 per shard
- Resources: 1 CPU, 2Gi RAM
- Redis: Required
- HPA: Enabled
Use Cases: Small production, mid-sized teams
Configuration:
- Pods: 5-10 (HPA controlled)
- Shards: 8
- Workers: 4 per shard
- Resources: 2 CPU, 4Gi RAM
- Redis: Required with persistence
- Citus: Enabled
- Kafka: Recommended
- HPA: Enabled with aggressive scaling
- Monitoring: Prometheus + Grafana + AlertManager
Use Cases: Production, large teams, enterprise
Configuration:
- Pods: 10-20+ (HPA controlled)
- Shards: 12-16
- Workers: 6-8 per shard
- Resources: 4 CPU, 8Gi RAM
- Redis: Cluster mode (3+ nodes)
- Citus: Multi-node cluster
- Kafka: Multi-broker cluster
- HPA: Aggressive with custom metrics
- Monitoring: Full observability stack
- CDN: For static assets
- Load Balancer: Multi-region with health checks
Use Cases: Enterprise at scale, high-traffic production
Throughput (msg/s) = (Shards × Workers) / Avg_Processing_Time
Example:
- Shards: 8
- Workers: 4
- Avg Processing Time: 0.1s per message
Throughput = (8 × 4) / 0.1 = 320 msg/s per pod
Total_Capacity = Shards × Max_Queue_Size
Example:
- Shards: 8
- Max Queue Size: 10,000
Total Capacity = 8 × 10,000 = 80,000 messages per pod
Required_Pods = (Peak_Throughput / Pod_Throughput) × Safety_Margin
Example:
- Peak Throughput: 1000 msg/s
- Pod Throughput: 320 msg/s
- Safety Margin: 1.5x (50% headroom)
Required Pods = (1000 / 320) × 1.5 = 4.7 ≈ 5 pods
CPU_Per_Pod = (Workers × Avg_CPU_Per_Worker) + Overhead
Example:
- Workers: 32 (8 shards × 4 workers)
- Avg CPU per Worker: 0.05 cores
- Overhead: 0.5 cores (OS, logging, metrics)
CPU = (32 × 0.05) + 0.5 = 2.1 cores
Memory_Per_Pod = (Queue_Capacity × Avg_Message_Size) + Worker_Memory + Overhead
Example:
- Queue Capacity: 80,000 messages
- Avg Message Size: 1KB
- Worker Memory: 500MB
- Overhead: 500MB
Memory = (80,000 × 1KB) + 500MB + 500MB = 1.08GB ≈ 1.5GB
| Metric | Scale Up | Scale Down | Duration |
|---|---|---|---|
| CPU | >70% | <30% | 30s / 60s |
| Memory | >80% | <40% | 30s / 60s |
| Queue Size | >80% | <20% | 60s / 120s |
| Message Age | >5s | <1s | 60s / 120s |
Increase per-pod resources before horizontal scaling:
- Start: 1 CPU, 2GB RAM
- First Increase: 2 CPU, 4GB RAM (2x)
- Second Increase: 4 CPU, 8GB RAM (2x)
- Limit: 8 CPU, 16GB RAM (beyond this, scale horizontally)
Add pods once vertical scaling limits reached:
- Phase 1: 1-3 pods (development/small production)
- Phase 2: 3-5 pods (medium production)
- Phase 3: 5-10 pods (large production)
- Phase 4: 10+ pods (enterprise scale)
Scale supporting infrastructure with application:
| Component | Small | Medium | Large | Extra Large |
|---|---|---|---|---|
| Redis | Single | Single + Replica | Cluster (3 nodes) | Cluster (5+ nodes) |
| PostgreSQL | Single | Single + Replica | Citus (2 workers) | Citus (4+ workers) |
| Kafka | N/A | 1 broker | 3 brokers | 5+ brokers |
| Prometheus | Single | Single | HA (2 nodes) | Federated |
✅ Enable HPA in production for automatic scaling ✅ Use Redis for distributed locks in multi-pod deployments ✅ Enable Citus for large databases (>100GB) ✅ Monitor queue sizes and set alerts at 60% capacity ✅ Use circuit breakers for all external dependencies ✅ Configure resource limits to prevent resource exhaustion ✅ Enable graceful shutdown (SIGTERM handling) ✅ Use startup probes to handle slow agent loading ✅ Enable Prometheus metrics for all components ✅ Test scaling before production (load tests)
❌ Don't disable backpressure (can cause OOM) ❌ Don't run without resource limits (can starve other pods) ❌ Don't skip health checks (can cause routing to unhealthy pods) ❌ Don't use fixed replica counts in production (use HPA) ❌ Don't ignore metrics (set up alerts) ❌ Don't share Redis across environments (isolation) ❌ Don't exceed 80% capacity for extended periods (add resources) ❌ Don't deploy without testing (run integration tests first) ❌ Don't skip graceful shutdown (can lose in-flight messages) ❌ Don't use development configs in production
- HPA enabled with appropriate thresholds
- PDB configured (minAvailable: 1+)
- Resource requests and limits set
- Health checks configured (startup, liveness, readiness)
- Redis enabled with persistence
- Database connection pooling configured
- Citus enabled (for large databases)
- Prometheus metrics enabled
- Grafana dashboards imported
- Alert rules configured
- Circuit breakers enabled
- Backpressure enabled
- Graceful shutdown configured
- Distributed locks enabled
- Load testing completed
- Disaster recovery plan documented
- Kubernetes Deployment Guide - Detailed K8s deployment instructions
- Helm Deployment Guide - Helm chart configuration
- Architecture Improvements - Recent architecture enhancements
- OmniCore Architecture - Core engine architecture
- Deployment Guide - General deployment instructions
- Monitoring Guide - Kafka and monitoring setup
For scalability questions or issues:
- Architecture Review: Contact DevOps team
- Performance Issues: Check Grafana dashboards first
- Scaling Issues: Review HPA status and metrics
- Resource Issues: Check pod resource usage and limits
Email: support@novatraxlabs.com Issues: /issues
Document Version: 1.0.0 Last Updated: 2026-02-11 Maintained By: Novatrax Labs DevOps Team