Skip to content

[STORY] Auto-Updater: Fix ENOMEM Fork Failures via Memory Overcommit and Swap Configuration #356

@jsbattig

Description

@jsbattig

Auto-Updater: Fix ENOMEM Fork Failures via Memory Overcommit and Swap Configuration

User Story

As a CIDX server operator,
I want the auto-updater to configure the Linux kernel's memory overcommit policy and provision a swap file,
So that subprocess.run() calls (git, pip, systemctl) no longer fail with ENOMEM when the Python process has large mmap'd virtual memory (HNSW indexes, SQLite DBs).

Context

Production servers with vm.overcommit_memory=0 (heuristic mode) refuse fork() when VmPeak exceeds CommitLimit, even though mmap'd memory is disk-backed and not actual RAM usage. The CIDX server process reaches ~57GB VmPeak from mmap'd HNSW indexes and SQLite databases, while CommitLimit is ~8GB (no swap, overcommit_ratio=50%). This causes all subprocess.run() calls to fail with OSError: [Errno 12] Cannot allocate memory, breaking git pull, pip install, systemctl restart, and all deployment operations.

Setting vm.overcommit_memory=1 (always overcommit) tells the kernel to allow fork() regardless of virtual memory size, which is safe because the child process (exec'd immediately via subprocess) never actually uses the parent's memory pages. Adding a 4GB swap file provides an additional safety net for the OOM killer.

Conversation references: User confirmed root cause is vm.overcommit_memory=0 + high VmPeak from mmap. User explicitly stated "all we can do in auto-updater, let's do it there." Solution agreed: two new _ensure_* methods in DeploymentExecutor.

Implementation Status

  • 1. _ensure_memory_overcommit() method - New idempotent method in DeploymentExecutor that checks current vm.overcommit_memory value and configures persistent sysctl override to 1 if not already set
  • 2. _ensure_swap_file() method - New idempotent method in DeploymentExecutor that checks for existing swap via swapon --show and creates a 4GB /swapfile with fstab persistence if none exists
  • 3. Wire into execute() method - Add Step 9 (_ensure_memory_overcommit) and Step 10 (_ensure_swap_file) after existing Step 8 (_ensure_sudoers_restart), both as non-fatal warnings
  • 4. Error codes and logging - Assign unique DEPLOY-GENERAL-090 through DEPLOY-GENERAL-100 error codes following the established format_error_log pattern with get_correlation_id()
  • 5. Unit tests for _ensure_memory_overcommit() - Test all idempotent paths: already-configured skip, successful configuration, write failure, apply failure, exception handling
  • 6. Unit tests for _ensure_swap_file() - Test all idempotent paths: swap-exists skip, full creation sequence, each subprocess step failure, fstab already-contains check, fstab append failure (non-fatal)
  • 7. Unit tests for execute() wiring - Verify new steps called in correct order after Step 8, verify non-fatal continuation on failure
  • 8. Integration validation on staging - Deploy to staging server, verify both configurations active, verify idempotent re-run, verify reboot persistence

Algorithm

_ensure_memory_overcommit() - Idempotent sysctl configuration

1. Read current value: subprocess.run(["sysctl", "-n", "vm.overcommit_memory"])
2. If current value is "1":
   a. Log debug "Memory overcommit already configured"
   b. Return True (idempotent skip)
3. If current value is not "1":
   a. Write config file: subprocess.run(["sudo", "tee", "/etc/sysctl.d/99-cidx-memory.conf"],
      input="vm.overcommit_memory = 1\n")
   b. If write fails: log error DEPLOY-GENERAL-090, return False
   c. Apply immediately: subprocess.run(["sudo", "sysctl", "-p",
      "/etc/sysctl.d/99-cidx-memory.conf"])
   d. If apply fails: log error DEPLOY-GENERAL-091, return False
   e. Log info "Configured vm.overcommit_memory=1 for fork safety"
   f. Return True
4. On any exception: log error DEPLOY-GENERAL-092, return False

_ensure_swap_file() - Idempotent swap provisioning

1. Check existing swap: subprocess.run(["swapon", "--show", "--noheadings"])
2. If stdout is non-empty (swap exists):
   a. Log debug "Swap already configured: {stdout.strip()}"
   b. Return True (idempotent skip)
3. If no swap exists:
   a. Allocate: subprocess.run(["sudo", "fallocate", "-l", "4G", "/swapfile"])
      - If fails: log error DEPLOY-GENERAL-093, return False
   b. Permissions: subprocess.run(["sudo", "chmod", "600", "/swapfile"])
      - If fails: log error DEPLOY-GENERAL-094, return False
   c. Format: subprocess.run(["sudo", "mkswap", "/swapfile"])
      - If fails: log error DEPLOY-GENERAL-095, return False
   d. Enable: subprocess.run(["sudo", "swapon", "/swapfile"])
      - If fails: log error DEPLOY-GENERAL-096, return False
   e. Check fstab persistence:
      - Read /etc/fstab via subprocess.run(["cat", "/etc/fstab"])
      - If "/swapfile" not in content:
        * Append: subprocess.run(["sudo", "tee", "-a", "/etc/fstab"],
          input="/swapfile none swap sw 0 0\n")
        * If fails: log warning DEPLOY-GENERAL-097 (non-fatal - swap is active
          but will not survive reboot)
   f. Log info "Created and enabled 4GB swap file"
   g. Return True
4. On any exception: log error DEPLOY-GENERAL-098, return False

Acceptance Criteria

AC1: Memory overcommit is configured idempotently

Scenario: First run on unconfigured server
  Given the auto-updater runs on a server with vm.overcommit_memory=0
  When DeploymentExecutor.execute() runs
  Then /etc/sysctl.d/99-cidx-memory.conf contains "vm.overcommit_memory = 1"
  And sysctl vm.overcommit_memory returns 1
  And the setting survives a server reboot

Scenario: Subsequent run on already-configured server
  Given the auto-updater runs on a server with vm.overcommit_memory=1 already
  When DeploymentExecutor.execute() runs
  Then the method returns True without writing any files
  And only a debug-level log message is emitted

AC2: Swap file is created idempotently

Scenario: First run on server with no swap
  Given the auto-updater runs on a server with no swap configured
  When DeploymentExecutor.execute() runs
  Then a 4GB /swapfile exists with permissions 0600
  And swapon --show lists /swapfile as active swap
  And /etc/fstab contains the /swapfile entry for reboot persistence
  And the swap survives a server reboot

Scenario: Subsequent run on server with existing swap
  Given the auto-updater runs on a server where swap already exists
  When DeploymentExecutor.execute() runs
  Then the method returns True without creating a new swap file
  And only a debug-level log message is emitted

AC3: Failures are non-fatal to deployment

Scenario: Memory overcommit configuration fails
  Given the auto-updater runs and _ensure_memory_overcommit() fails
  When DeploymentExecutor.execute() continues
  Then a warning is logged with DEPLOY-GENERAL-099 error code
  And deployment continues to the swap file step
  And the server restart still proceeds normally

Scenario: Swap file creation fails
  Given the auto-updater runs and _ensure_swap_file() fails
  When DeploymentExecutor.execute() continues
  Then a warning is logged with DEPLOY-GENERAL-100 error code
  And deployment completes successfully (server restart proceeds)

AC4: Methods follow established DeploymentExecutor patterns

Scenario: Code consistency with existing _ensure_* methods
  Given the new methods are implemented
  Then each uses subprocess.run() with capture_output=True and text=True
  And each wraps in try/except with format_error_log using unique DEPLOY-GENERAL-09x codes
  And each returns bool (True on success or already-configured, False on error)
  And each logs via the module logger with correlation_id from get_correlation_id()
  And each is called from execute() after Step 8 (_ensure_sudoers_restart)
  And neither method aborts the deployment on failure

Testing Requirements

Unit Tests

File: tests/unit/auto_update/test_deployment_executor_memory.py

Tests for _ensure_memory_overcommit():

  • test_memory_overcommit_already_configured - When sysctl returns "1", method returns True without writing config file
  • test_memory_overcommit_configures_successfully - When sysctl returns "0", writes config file via sudo tee, applies via sysctl -p, returns True
  • test_memory_overcommit_write_failure - When sudo tee returns non-zero, logs DEPLOY-GENERAL-090, returns False
  • test_memory_overcommit_apply_failure - When sysctl -p returns non-zero, logs DEPLOY-GENERAL-091, returns False
  • test_memory_overcommit_exception_handling - When subprocess raises unexpected exception, logs DEPLOY-GENERAL-092, returns False

Tests for _ensure_swap_file():

  • test_swap_already_exists - When swapon --show returns non-empty output, method returns True without any creation steps
  • test_swap_creates_full_sequence - When no swap exists, executes fallocate, chmod, mkswap, swapon, fstab check/append in order, returns True
  • test_swap_fallocate_failure - When fallocate returns non-zero, logs DEPLOY-GENERAL-093, returns False
  • test_swap_chmod_failure - When chmod returns non-zero, logs DEPLOY-GENERAL-094, returns False
  • test_swap_mkswap_failure - When mkswap returns non-zero, logs DEPLOY-GENERAL-095, returns False
  • test_swap_swapon_failure - When swapon returns non-zero, logs DEPLOY-GENERAL-096, returns False
  • test_swap_fstab_already_contains_entry - When /etc/fstab already contains "/swapfile", does not append duplicate entry
  • test_swap_fstab_append_failure_non_fatal - When fstab tee -a fails, logs warning DEPLOY-GENERAL-097, still returns True (swap is active, just not reboot-persistent)
  • test_swap_exception_handling - When subprocess raises unexpected exception, logs DEPLOY-GENERAL-098, returns False

Tests for execute() wiring:

  • test_execute_calls_memory_overcommit_after_sudoers - Verify _ensure_memory_overcommit is called in execute() after _ensure_sudoers_restart
  • test_execute_calls_swap_file_after_memory_overcommit - Verify _ensure_swap_file is called after _ensure_memory_overcommit
  • test_execute_continues_on_memory_overcommit_failure - When _ensure_memory_overcommit returns False, execute() still continues and returns True
  • test_execute_continues_on_swap_file_failure - When _ensure_swap_file returns False, execute() still continues and returns True

Integration / E2E Testing (Manual via staging server)

  1. Deploy to staging server (.20) via development -> staging merge
  2. Verify /etc/sysctl.d/99-cidx-memory.conf is created with content vm.overcommit_memory = 1
  3. Verify sysctl -n vm.overcommit_memory returns 1
  4. Verify swapon --show lists /swapfile as active swap with ~4GB size
  5. Verify /etc/fstab contains line /swapfile none swap sw 0 0
  6. Reboot staging server, verify both settings persist after reboot
  7. Run auto-updater again, verify idempotent skip in logs (debug messages only, no writes)
  8. Verify CIDX server subprocess operations (git pull, systemctl restart) complete without ENOMEM

Technical Notes

Error Code Allocation

Code Method Condition
DEPLOY-GENERAL-090 _ensure_memory_overcommit Failed to write /etc/sysctl.d/99-cidx-memory.conf
DEPLOY-GENERAL-091 _ensure_memory_overcommit Failed to apply sysctl config via sysctl -p
DEPLOY-GENERAL-092 _ensure_memory_overcommit Unexpected exception
DEPLOY-GENERAL-093 _ensure_swap_file fallocate -l 4G /swapfile failed
DEPLOY-GENERAL-094 _ensure_swap_file chmod 600 /swapfile failed
DEPLOY-GENERAL-095 _ensure_swap_file mkswap /swapfile failed
DEPLOY-GENERAL-096 _ensure_swap_file swapon /swapfile failed
DEPLOY-GENERAL-097 _ensure_swap_file fstab append failed (warning level, non-fatal)
DEPLOY-GENERAL-098 _ensure_swap_file Unexpected exception
DEPLOY-GENERAL-099 execute Warning when _ensure_memory_overcommit returns False
DEPLOY-GENERAL-100 execute Warning when _ensure_swap_file returns False

execute() Wiring Pattern

The new steps follow the exact same pattern as Step 8 (_ensure_sudoers_restart) -- non-fatal warnings that do not abort the deployment:

# Step 9: Ensure vm.overcommit_memory=1 for fork safety
if not self._ensure_memory_overcommit():
    logger.warning(
        format_error_log(
            "DEPLOY-GENERAL-099",
            "Memory overcommit could not be configured - "
            "subprocess fork may fail with ENOMEM on high VmPeak",
            extra={"correlation_id": get_correlation_id()},
        )
    )

# Step 10: Ensure swap file exists as safety net
if not self._ensure_swap_file():
    logger.warning(
        format_error_log(
            "DEPLOY-GENERAL-100",
            "Swap file could not be created - "
            "no swap safety net for OOM conditions",
            extra={"correlation_id": get_correlation_id()},
        )
    )

Files Modified

File Change
src/code_indexer/server/auto_update/deployment_executor.py Add _ensure_memory_overcommit() and _ensure_swap_file() methods; wire into execute() as Steps 9-10
tests/unit/auto_update/test_deployment_executor_memory.py New test file with ~18 unit tests covering all idempotent paths and failure modes

Key Constraints (from conversation)

  • Both methods are non-fatal: deployment continues even if they fail (warning logged only)
  • Both methods are idempotent: safe to run on every deployment cycle with no side effects
  • Both methods use sudo: the auto-updater service runs as root or has sudo access
  • Swap file path is /swapfile (standard Linux convention)
  • Sysctl config file is /etc/sysctl.d/99-cidx-memory.conf (high priority number, CIDX-namespaced)
  • All subprocess calls use capture_output=True, text=True per established pattern
  • Explicitly excluded from scope: No changes to subprocess.run() callers, no posix_spawn, no process launcher daemon, no fork health dashboard, no ENOMEM retry wrapper, no Python version changes, no worker concurrency reduction

Definition of Done

  • _ensure_memory_overcommit() implemented following _ensure_sudoers_restart() pattern
  • _ensure_swap_file() implemented following _ensure_sudoers_restart() pattern
  • Both methods wired into execute() as Step 9 and Step 10 after existing Step 8
  • All error codes DEPLOY-GENERAL-090 through DEPLOY-GENERAL-100 are unique and not duplicated elsewhere
  • Unit tests pass for all idempotent paths (already-configured, successful-config, each failure mode)
  • Unit tests pass for execute() wiring order and non-fatal continuation behavior
  • fast-automation.sh passes with zero failures under 10 minutes
  • Deployed to staging server (.20), both configurations verified active
  • Idempotent re-run verified on staging (no duplicate writes, debug logs only)
  • Reboot persistence verified on staging (both sysctl and swap survive reboot)
  • CIDX server subprocess operations (git, pip, systemctl) confirmed working without ENOMEM

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions