Production Hardening Guide

Complete guide to deploying Lynkr in production with 14 hardening features for reliability, observability, and security.

Overview

Lynkr includes 16 production-ready features:

Reliability: Circuit breakers, retries, load shedding, graceful shutdown
Observability: Prometheus metrics, structured logging, health checks, routing telemetry
Intelligence: Graphify code analysis, Distill compression, quality scoring
Security: Input validation, policy enforcement, sandboxing
Performance: Minimal overhead (~7μs), 140K req/sec throughput

Reliability Features

1. Circuit Breaker Pattern

Protects against cascading failures to external services.

States:

CLOSED - Normal operation
OPEN - Failing fast (provider down)
HALF_OPEN - Testing recovery

Configuration:

# Failures before opening circuit
CIRCUIT_BREAKER_FAILURE_THRESHOLD=5  # default: 5

# Successes needed to close from half-open
CIRCUIT_BREAKER_SUCCESS_THRESHOLD=2  # default: 2

# Time before attempting recovery (ms)
CIRCUIT_BREAKER_TIMEOUT=60000  # default: 60000 (1 min)

How it works:

5 failures → Circuit OPEN
Wait 60 seconds
Try 1 request → Circuit HALF_OPEN
2 successes → Circuit CLOSED

2. Exponential Backoff with Jitter

Automatic retries for transient failures.

Configuration:

# Max retry attempts
API_RETRY_MAX_RETRIES=3  # default: 3

# Initial retry delay (ms)
API_RETRY_INITIAL_DELAY=1000  # default: 1000

# Maximum retry delay (ms)
API_RETRY_MAX_DELAY=30000  # default: 30000

Retry schedule:

Attempt 1: Immediate
Attempt 2: 1s + jitter (±500ms)
Attempt 3: 2s + jitter (±1s)
Attempt 4: 4s + jitter (±2s)

Retryable errors:

5xx status codes
Network timeouts
Connection errors

Non-retryable errors:

4xx status codes
Authentication errors
Validation errors

3. Load Shedding

Proactive request rejection when system is overloaded.

Configuration:

# Memory usage threshold (0-1)
LOAD_SHEDDING_MEMORY_THRESHOLD=0.85  # default: 0.85 (85%)

# Heap usage threshold (0-1)
LOAD_SHEDDING_HEAP_THRESHOLD=0.90  # default: 0.90 (90%)

# Max concurrent requests
LOAD_SHEDDING_ACTIVE_REQUESTS_THRESHOLD=1000  # default: 1000

Behavior:

Returns HTTP 503 during overload
Includes Retry-After header
Cached state (1s) for performance

Monitoring:

curl http://localhost:8081/metrics | grep lynkr_load_shedding

4. Graceful Shutdown

Zero-downtime deployments.

Configuration:

# Shutdown timeout (ms)
GRACEFUL_SHUTDOWN_TIMEOUT=30000  # default: 30000 (30s)

Sequence:

Receive SIGTERM/SIGINT
Stop accepting new requests
Complete in-flight requests (max 30s)
Close database connections
Exit

Kubernetes:

spec:
  containers:
  - name: lynkr
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]
    terminationGracePeriodSeconds: 35

Observability

5. Prometheus Metrics

Comprehensive metrics collection.

Endpoint:

curl http://localhost:8081/metrics

Request Metrics:

# Request rate
lynkr_requests_total{provider="databricks",status="200"} 1234

# Latency histogram
lynkr_request_duration_seconds_bucket{provider="databricks",le="0.5"} 980
lynkr_request_duration_seconds_bucket{provider="databricks",le="1"} 1200
lynkr_request_duration_seconds_sum 1234.5
lynkr_request_duration_seconds_count 1234

# Error rate
lynkr_errors_total{provider="databricks",type="timeout"} 12

Token Metrics:

# Token usage
lynkr_tokens_input_total{provider="databricks"} 5000000
lynkr_tokens_output_total{provider="databricks"} 500000
lynkr_tokens_cached_total 2000000

# Cache hits
lynkr_cache_hits_total 850
lynkr_cache_misses_total 150

System Metrics:

# Memory usage
process_resident_memory_bytes 104857600
nodejs_heap_size_used_bytes 52428800

# Circuit breaker state
lynkr_circuit_breaker_state{provider="databricks",state="closed"} 1

# Active requests
lynkr_active_requests 42

Configuration:

METRICS_ENABLED=true  # default: true

6. Structured Logging

JSON logs with request ID correlation via Pino.

Log Level Philosophy:

info — Meaningful milestones: request received (minimal), request completed (duration + tokens), errors, retries, fallbacks
debug — Operational details: request body previews, tool injection, streaming chunks, intermediate conversions, tool mapping

Console Configuration:

LOG_LEVEL=info                  # options: error, warn, info, debug (default: info)
REQUEST_LOGGING_ENABLED=true    # default: true

In development mode (NODE_ENV=development), logs are pretty-printed via pino-pretty.

File Logging (optional):

Persistent log files with automatic daily rotation via pino-roll. Enable by setting LOG_FILE_ENABLED=true.

LOG_FILE_ENABLED=true           # default: false
LOG_FILE_PATH=./logs/lynkr.log  # default: <cwd>/logs/lynkr.log
LOG_FILE_LEVEL=debug            # default: debug (captures all levels)
LOG_FILE_FREQUENCY=daily        # options: daily, hourly, custom (default: daily)
LOG_FILE_MAX_FILES=14           # rotated files to keep (default: 14)

Rotated files are named with timestamps (e.g., lynkr.log.2025-07-12). The log directory is created automatically.

Log format (JSON):

{
  "level": "info",
  "time": 1705123456789,
  "msg": "Request processed",
  "requestId": "req_abc123",
  "provider": "databricks",
  "statusCode": 200,
  "duration": 1250,
  "tokens": {
    "input": 1250,
    "output": 234,
    "cached": 750
  }
}

Querying log files:

# Tail live logs
tail -f ./logs/lynkr.log | npx pino-pretty

# Find errors in the last 24 hours
cat ./logs/lynkr.log | jq 'select(.level >= 50)'

# Filter by provider
cat ./logs/lynkr.log | jq 'select(.provider == "databricks")'

# Search for slow requests (>2s)
cat ./logs/lynkr.log | jq 'select(.duration > 2000)'

Log aggregation:

Stdout — Captured by Docker/K8s log drivers
File rotation — For standalone deployments or local debugging
External — Forward JSON logs to Elasticsearch, Splunk, Grafana Loki, etc.

7. Health Checks

Kubernetes-ready health endpoints.

Liveness Probe:

curl http://localhost:8081/health/live

# Returns:
{
  "status": "ok",
  "provider": "databricks",
  "timestamp": "2026-01-12T00:00:00.000Z"
}

Readiness Probe:

curl http://localhost:8081/health/ready

# Returns:
{
  "status": "ready",
  "checks": {
    "database": "ok",
    "provider": "ok"
  }
}

Deep Health Check:

curl "http://localhost:8081/health/ready?deep=true"

# Returns:
{
  "status": "ready",
  "checks": {
    "database": "ok",
    "provider": "ok",
    "memory": {"used": "50%", "status": "ok"},
    "circuit_breaker": {"state": "closed", "status": "ok"}
  }
}

Kubernetes:

livenessProbe:
  httpGet:
    path: /health/live
    port: 8081
  initialDelaySeconds: 10
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8081
  initialDelaySeconds: 5
  periodSeconds: 5

Configuration:

HEALTH_CHECK_ENABLED=true  # default: true

Security

8. Input Validation

Zero-dependency schema validation.

Validates:

Request body structure
Required fields
Field types
Value constraints

Example:

// Invalid request
{
  "model": 123,  // Should be string
  "max_tokens": -1  // Should be positive
}

// Returns 400 Bad Request
{
  "error": "Invalid request",
  "details": [
    "model must be string",
    "max_tokens must be positive"
  ]
}

9. Policy Enforcement

Environment-driven guardrails.

Git Policies:

# Allow git push (default: disabled)
POLICY_GIT_ALLOW_PUSH=false

# Require tests before commit (default: disabled)
POLICY_GIT_REQUIRE_TESTS=false

# Custom test command
POLICY_GIT_TEST_COMMAND="npm test"

Web Fetch Policies:

# Allowed hosts for web_fetch tool
WEB_SEARCH_ALLOWED_HOSTS=github.com,stackoverflow.com

# Web search endpoint
WEB_SEARCH_ENDPOINT=http://localhost:8888/search

Workspace Policies:

# Workspace root directory
WORKSPACE_ROOT=/path/to/projects

# Max agent loop iterations
POLICY_MAX_STEPS=8

10. Sandboxing

Optional Docker isolation for MCP tools.

Configuration:

# Enable MCP sandbox
MCP_SANDBOX_ENABLED=true  # default: true

# Docker image for sandbox
MCP_SANDBOX_IMAGE=ubuntu:22.04

How it works:

MCP tool invoked
Launch Docker container
Execute tool in container
Return result
Destroy container

Benefits:

Isolated execution
Resource limits
No host access
Safe for untrusted tools

Deployment

Kubernetes

deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: lynkr
spec:
  replicas: 3
  selector:
    matchLabels:
      app: lynkr
  template:
    metadata:
      labels:
        app: lynkr
    spec:
      containers:
      - name: lynkr
        image: lynkr:latest
        ports:
        - containerPort: 8081
        env:
        - name: MODEL_PROVIDER
          value: "databricks"
        - name: DATABRICKS_API_KEY
          valueFrom:
            secretKeyRef:
              name: lynkr-secrets
              key: databricks-api-key
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "2"
            memory: "2Gi"
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8081
          initialDelaySeconds: 10
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8081
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: lynkr
spec:
  selector:
    app: lynkr
  ports:
  - port: 80
    targetPort: 8081
  type: LoadBalancer

Docker Compose

See Docker Deployment Guide for complete setup.

Systemd

lynkr.service:

[Unit]
Description=Lynkr Proxy
After=network.target

[Service]
Type=simple
User=lynkr
WorkingDirectory=/opt/lynkr
EnvironmentFile=/etc/lynkr/lynkr.env
ExecStart=/usr/bin/node /opt/lynkr/index.js
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

sudo systemctl enable lynkr
sudo systemctl start lynkr
sudo journalctl -u lynkr -f

Monitoring

Prometheus

prometheus.yml:

scrape_configs:
  - job_name: 'lynkr'
    static_configs:
      - targets: ['localhost:8081']
    metrics_path: '/metrics'
    scrape_interval: 15s

Grafana Dashboard

Key metrics to monitor:

Request rate (req/sec)
Latency percentiles (p50, p95, p99)
Error rate
Token usage
Cache hit rate
Circuit breaker state
Memory usage

Sample queries:

# Request rate
rate(lynkr_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(lynkr_request_duration_seconds_bucket[5m]))

# Error rate
rate(lynkr_errors_total[5m]) / rate(lynkr_requests_total[5m])

# Cache hit rate
lynkr_cache_hits_total / (lynkr_cache_hits_total + lynkr_cache_misses_total)

Best Practices

1. Use Reverse Proxy

server {
    listen 443 ssl;
    server_name lynkr.example.com;

    ssl_certificate /path/to/cert.pem;
    ssl_certificate_key /path/to/key.pem;

    location / {
        proxy_pass http://localhost:8081;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

2. Set Resource Limits

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "2"
    memory: "2Gi"

3. Enable All Hardening Features

CIRCUIT_BREAKER_FAILURE_THRESHOLD=5
LOAD_SHEDDING_MEMORY_THRESHOLD=0.85
GRACEFUL_SHUTDOWN_TIMEOUT=30000
METRICS_ENABLED=true
HEALTH_CHECK_ENABLED=true

4. Monitor Metrics

Set up Prometheus + Grafana
Alert on high error rates
Alert on high latency
Monitor token usage

5. Rotate Secrets

# Rotate API keys regularly
kubectl create secret generic lynkr-secrets \
  --from-literal=databricks-api-key=new-key \
  --dry-run=client -o yaml | kubectl apply -f -

# Rollout restart
kubectl rollout restart deployment/lynkr

Next Steps

Docker Deployment - Docker setup
API Reference - API endpoints
Troubleshooting - Common issues

Getting Help

GitHub Discussions - Ask questions
GitHub Issues - Report issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Production Hardening Guide

Overview

Reliability Features

1. Circuit Breaker Pattern

2. Exponential Backoff with Jitter

3. Load Shedding

4. Graceful Shutdown

Observability

5. Prometheus Metrics

6. Structured Logging

7. Health Checks

Security

8. Input Validation

9. Policy Enforcement

10. Sandboxing

Deployment

Kubernetes

Docker Compose

Systemd

Monitoring

Prometheus

Grafana Dashboard

Best Practices

1. Use Reverse Proxy

2. Set Resource Limits

3. Enable All Hardening Features

4. Monitor Metrics

5. Rotate Secrets

Next Steps

Getting Help

Uh oh!

FilesExpand file tree

production.md

Latest commit

History

production.md

File metadata and controls

Production Hardening Guide

Overview

Reliability Features

1. Circuit Breaker Pattern

2. Exponential Backoff with Jitter

3. Load Shedding

4. Graceful Shutdown

Observability

5. Prometheus Metrics

6. Structured Logging

7. Health Checks

Security

8. Input Validation

9. Policy Enforcement

10. Sandboxing

Deployment

Kubernetes

Docker Compose

Systemd

Monitoring

Prometheus

Grafana Dashboard

Best Practices

1. Use Reverse Proxy

2. Set Resource Limits

3. Enable All Hardening Features

4. Monitor Metrics

5. Rotate Secrets

Next Steps

Getting Help