-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Auto-Updater: Fix ENOMEM Fork Failures via Memory Overcommit and Swap Configuration
User Story
As a CIDX server operator,
I want the auto-updater to configure the Linux kernel's memory overcommit policy and provision a swap file,
So that subprocess.run() calls (git, pip, systemctl) no longer fail with ENOMEM when the Python process has large mmap'd virtual memory (HNSW indexes, SQLite DBs).
Context
Production servers with vm.overcommit_memory=0 (heuristic mode) refuse fork() when VmPeak exceeds CommitLimit, even though mmap'd memory is disk-backed and not actual RAM usage. The CIDX server process reaches ~57GB VmPeak from mmap'd HNSW indexes and SQLite databases, while CommitLimit is ~8GB (no swap, overcommit_ratio=50%). This causes all subprocess.run() calls to fail with OSError: [Errno 12] Cannot allocate memory, breaking git pull, pip install, systemctl restart, and all deployment operations.
Setting vm.overcommit_memory=1 (always overcommit) tells the kernel to allow fork() regardless of virtual memory size, which is safe because the child process (exec'd immediately via subprocess) never actually uses the parent's memory pages. Adding a 4GB swap file provides an additional safety net for the OOM killer.
Conversation references: User confirmed root cause is vm.overcommit_memory=0 + high VmPeak from mmap. User explicitly stated "all we can do in auto-updater, let's do it there." Solution agreed: two new _ensure_* methods in DeploymentExecutor.
Implementation Status
- 1.
_ensure_memory_overcommit()method - New idempotent method inDeploymentExecutorthat checks currentvm.overcommit_memoryvalue and configures persistent sysctl override to1if not already set - 2.
_ensure_swap_file()method - New idempotent method inDeploymentExecutorthat checks for existing swap viaswapon --showand creates a 4GB/swapfilewith fstab persistence if none exists - 3. Wire into
execute()method - Add Step 9 (_ensure_memory_overcommit) and Step 10 (_ensure_swap_file) after existing Step 8 (_ensure_sudoers_restart), both as non-fatal warnings - 4. Error codes and logging - Assign unique
DEPLOY-GENERAL-090throughDEPLOY-GENERAL-100error codes following the establishedformat_error_logpattern withget_correlation_id() - 5. Unit tests for
_ensure_memory_overcommit()- Test all idempotent paths: already-configured skip, successful configuration, write failure, apply failure, exception handling - 6. Unit tests for
_ensure_swap_file()- Test all idempotent paths: swap-exists skip, full creation sequence, each subprocess step failure, fstab already-contains check, fstab append failure (non-fatal) - 7. Unit tests for
execute()wiring - Verify new steps called in correct order after Step 8, verify non-fatal continuation on failure - 8. Integration validation on staging - Deploy to staging server, verify both configurations active, verify idempotent re-run, verify reboot persistence
Algorithm
_ensure_memory_overcommit() - Idempotent sysctl configuration
1. Read current value: subprocess.run(["sysctl", "-n", "vm.overcommit_memory"])
2. If current value is "1":
a. Log debug "Memory overcommit already configured"
b. Return True (idempotent skip)
3. If current value is not "1":
a. Write config file: subprocess.run(["sudo", "tee", "/etc/sysctl.d/99-cidx-memory.conf"],
input="vm.overcommit_memory = 1\n")
b. If write fails: log error DEPLOY-GENERAL-090, return False
c. Apply immediately: subprocess.run(["sudo", "sysctl", "-p",
"/etc/sysctl.d/99-cidx-memory.conf"])
d. If apply fails: log error DEPLOY-GENERAL-091, return False
e. Log info "Configured vm.overcommit_memory=1 for fork safety"
f. Return True
4. On any exception: log error DEPLOY-GENERAL-092, return False
_ensure_swap_file() - Idempotent swap provisioning
1. Check existing swap: subprocess.run(["swapon", "--show", "--noheadings"])
2. If stdout is non-empty (swap exists):
a. Log debug "Swap already configured: {stdout.strip()}"
b. Return True (idempotent skip)
3. If no swap exists:
a. Allocate: subprocess.run(["sudo", "fallocate", "-l", "4G", "/swapfile"])
- If fails: log error DEPLOY-GENERAL-093, return False
b. Permissions: subprocess.run(["sudo", "chmod", "600", "/swapfile"])
- If fails: log error DEPLOY-GENERAL-094, return False
c. Format: subprocess.run(["sudo", "mkswap", "/swapfile"])
- If fails: log error DEPLOY-GENERAL-095, return False
d. Enable: subprocess.run(["sudo", "swapon", "/swapfile"])
- If fails: log error DEPLOY-GENERAL-096, return False
e. Check fstab persistence:
- Read /etc/fstab via subprocess.run(["cat", "/etc/fstab"])
- If "/swapfile" not in content:
* Append: subprocess.run(["sudo", "tee", "-a", "/etc/fstab"],
input="/swapfile none swap sw 0 0\n")
* If fails: log warning DEPLOY-GENERAL-097 (non-fatal - swap is active
but will not survive reboot)
f. Log info "Created and enabled 4GB swap file"
g. Return True
4. On any exception: log error DEPLOY-GENERAL-098, return False
Acceptance Criteria
AC1: Memory overcommit is configured idempotently
Scenario: First run on unconfigured server
Given the auto-updater runs on a server with vm.overcommit_memory=0
When DeploymentExecutor.execute() runs
Then /etc/sysctl.d/99-cidx-memory.conf contains "vm.overcommit_memory = 1"
And sysctl vm.overcommit_memory returns 1
And the setting survives a server reboot
Scenario: Subsequent run on already-configured server
Given the auto-updater runs on a server with vm.overcommit_memory=1 already
When DeploymentExecutor.execute() runs
Then the method returns True without writing any files
And only a debug-level log message is emittedAC2: Swap file is created idempotently
Scenario: First run on server with no swap
Given the auto-updater runs on a server with no swap configured
When DeploymentExecutor.execute() runs
Then a 4GB /swapfile exists with permissions 0600
And swapon --show lists /swapfile as active swap
And /etc/fstab contains the /swapfile entry for reboot persistence
And the swap survives a server reboot
Scenario: Subsequent run on server with existing swap
Given the auto-updater runs on a server where swap already exists
When DeploymentExecutor.execute() runs
Then the method returns True without creating a new swap file
And only a debug-level log message is emittedAC3: Failures are non-fatal to deployment
Scenario: Memory overcommit configuration fails
Given the auto-updater runs and _ensure_memory_overcommit() fails
When DeploymentExecutor.execute() continues
Then a warning is logged with DEPLOY-GENERAL-099 error code
And deployment continues to the swap file step
And the server restart still proceeds normally
Scenario: Swap file creation fails
Given the auto-updater runs and _ensure_swap_file() fails
When DeploymentExecutor.execute() continues
Then a warning is logged with DEPLOY-GENERAL-100 error code
And deployment completes successfully (server restart proceeds)AC4: Methods follow established DeploymentExecutor patterns
Scenario: Code consistency with existing _ensure_* methods
Given the new methods are implemented
Then each uses subprocess.run() with capture_output=True and text=True
And each wraps in try/except with format_error_log using unique DEPLOY-GENERAL-09x codes
And each returns bool (True on success or already-configured, False on error)
And each logs via the module logger with correlation_id from get_correlation_id()
And each is called from execute() after Step 8 (_ensure_sudoers_restart)
And neither method aborts the deployment on failureTesting Requirements
Unit Tests
File: tests/unit/auto_update/test_deployment_executor_memory.py
Tests for _ensure_memory_overcommit():
test_memory_overcommit_already_configured- When sysctl returns "1", method returns True without writing config filetest_memory_overcommit_configures_successfully- When sysctl returns "0", writes config file via sudo tee, applies via sysctl -p, returns Truetest_memory_overcommit_write_failure- When sudo tee returns non-zero, logs DEPLOY-GENERAL-090, returns Falsetest_memory_overcommit_apply_failure- When sysctl -p returns non-zero, logs DEPLOY-GENERAL-091, returns Falsetest_memory_overcommit_exception_handling- When subprocess raises unexpected exception, logs DEPLOY-GENERAL-092, returns False
Tests for _ensure_swap_file():
test_swap_already_exists- When swapon --show returns non-empty output, method returns True without any creation stepstest_swap_creates_full_sequence- When no swap exists, executes fallocate, chmod, mkswap, swapon, fstab check/append in order, returns Truetest_swap_fallocate_failure- When fallocate returns non-zero, logs DEPLOY-GENERAL-093, returns Falsetest_swap_chmod_failure- When chmod returns non-zero, logs DEPLOY-GENERAL-094, returns Falsetest_swap_mkswap_failure- When mkswap returns non-zero, logs DEPLOY-GENERAL-095, returns Falsetest_swap_swapon_failure- When swapon returns non-zero, logs DEPLOY-GENERAL-096, returns Falsetest_swap_fstab_already_contains_entry- When /etc/fstab already contains "/swapfile", does not append duplicate entrytest_swap_fstab_append_failure_non_fatal- When fstab tee -a fails, logs warning DEPLOY-GENERAL-097, still returns True (swap is active, just not reboot-persistent)test_swap_exception_handling- When subprocess raises unexpected exception, logs DEPLOY-GENERAL-098, returns False
Tests for execute() wiring:
test_execute_calls_memory_overcommit_after_sudoers- Verify_ensure_memory_overcommitis called in execute() after_ensure_sudoers_restarttest_execute_calls_swap_file_after_memory_overcommit- Verify_ensure_swap_fileis called after_ensure_memory_overcommittest_execute_continues_on_memory_overcommit_failure- When_ensure_memory_overcommitreturns False, execute() still continues and returns Truetest_execute_continues_on_swap_file_failure- When_ensure_swap_filereturns False, execute() still continues and returns True
Integration / E2E Testing (Manual via staging server)
- Deploy to staging server (.20) via development -> staging merge
- Verify
/etc/sysctl.d/99-cidx-memory.confis created with contentvm.overcommit_memory = 1 - Verify
sysctl -n vm.overcommit_memoryreturns1 - Verify
swapon --showlists/swapfileas active swap with ~4GB size - Verify
/etc/fstabcontains line/swapfile none swap sw 0 0 - Reboot staging server, verify both settings persist after reboot
- Run auto-updater again, verify idempotent skip in logs (debug messages only, no writes)
- Verify CIDX server subprocess operations (git pull, systemctl restart) complete without ENOMEM
Technical Notes
Error Code Allocation
| Code | Method | Condition |
|---|---|---|
| DEPLOY-GENERAL-090 | _ensure_memory_overcommit |
Failed to write /etc/sysctl.d/99-cidx-memory.conf |
| DEPLOY-GENERAL-091 | _ensure_memory_overcommit |
Failed to apply sysctl config via sysctl -p |
| DEPLOY-GENERAL-092 | _ensure_memory_overcommit |
Unexpected exception |
| DEPLOY-GENERAL-093 | _ensure_swap_file |
fallocate -l 4G /swapfile failed |
| DEPLOY-GENERAL-094 | _ensure_swap_file |
chmod 600 /swapfile failed |
| DEPLOY-GENERAL-095 | _ensure_swap_file |
mkswap /swapfile failed |
| DEPLOY-GENERAL-096 | _ensure_swap_file |
swapon /swapfile failed |
| DEPLOY-GENERAL-097 | _ensure_swap_file |
fstab append failed (warning level, non-fatal) |
| DEPLOY-GENERAL-098 | _ensure_swap_file |
Unexpected exception |
| DEPLOY-GENERAL-099 | execute |
Warning when _ensure_memory_overcommit returns False |
| DEPLOY-GENERAL-100 | execute |
Warning when _ensure_swap_file returns False |
execute() Wiring Pattern
The new steps follow the exact same pattern as Step 8 (_ensure_sudoers_restart) -- non-fatal warnings that do not abort the deployment:
# Step 9: Ensure vm.overcommit_memory=1 for fork safety
if not self._ensure_memory_overcommit():
logger.warning(
format_error_log(
"DEPLOY-GENERAL-099",
"Memory overcommit could not be configured - "
"subprocess fork may fail with ENOMEM on high VmPeak",
extra={"correlation_id": get_correlation_id()},
)
)
# Step 10: Ensure swap file exists as safety net
if not self._ensure_swap_file():
logger.warning(
format_error_log(
"DEPLOY-GENERAL-100",
"Swap file could not be created - "
"no swap safety net for OOM conditions",
extra={"correlation_id": get_correlation_id()},
)
)Files Modified
| File | Change |
|---|---|
src/code_indexer/server/auto_update/deployment_executor.py |
Add _ensure_memory_overcommit() and _ensure_swap_file() methods; wire into execute() as Steps 9-10 |
tests/unit/auto_update/test_deployment_executor_memory.py |
New test file with ~18 unit tests covering all idempotent paths and failure modes |
Key Constraints (from conversation)
- Both methods are non-fatal: deployment continues even if they fail (warning logged only)
- Both methods are idempotent: safe to run on every deployment cycle with no side effects
- Both methods use sudo: the auto-updater service runs as root or has sudo access
- Swap file path is
/swapfile(standard Linux convention) - Sysctl config file is
/etc/sysctl.d/99-cidx-memory.conf(high priority number, CIDX-namespaced) - All subprocess calls use
capture_output=True, text=Trueper established pattern - Explicitly excluded from scope: No changes to subprocess.run() callers, no posix_spawn, no process launcher daemon, no fork health dashboard, no ENOMEM retry wrapper, no Python version changes, no worker concurrency reduction
Definition of Done
-
_ensure_memory_overcommit()implemented following_ensure_sudoers_restart()pattern -
_ensure_swap_file()implemented following_ensure_sudoers_restart()pattern - Both methods wired into
execute()as Step 9 and Step 10 after existing Step 8 - All error codes DEPLOY-GENERAL-090 through DEPLOY-GENERAL-100 are unique and not duplicated elsewhere
- Unit tests pass for all idempotent paths (already-configured, successful-config, each failure mode)
- Unit tests pass for execute() wiring order and non-fatal continuation behavior
- fast-automation.sh passes with zero failures under 10 minutes
- Deployed to staging server (.20), both configurations verified active
- Idempotent re-run verified on staging (no duplicate writes, debug logs only)
- Reboot persistence verified on staging (both sysctl and swap survive reboot)
- CIDX server subprocess operations (git, pip, systemctl) confirmed working without ENOMEM