Skip to content

[STORY] Signal-Based Server Restart via Auto-Updater #355

@jsbattig

Description

@jsbattig

Story: Signal-Based Server Restart via Auto-Updater

As a CIDX server administrator
I want to restart the server from the Diagnostics web UI via a signal file that the auto-updater monitors
So that server restarts work reliably without requiring the server process to have sudo/privilege escalation capabilities


Context

The current restart mechanism in routes.py:_delayed_restart() uses sudo systemctl restart cidx-server when running under systemd. This fails when the systemd service unit has NoNewPrivileges=true security hardening enabled, because the kernel blocks sudo from escalating privileges.

The fix leverages the existing auto-updater infrastructure: the server writes a restart signal file to ~/.cidx-server/restart.signal, and the auto-updater (which already has restart capabilities and runs as the same user) picks up the signal, deletes it immediately, and executes the restart from the outside.

This follows the established PENDING_REDEPLOY_MARKER pattern already used in deployment_executor.py for the same shared directory.


Implementation Status

  • Signal file constant and format definition (~/.cidx-server/restart.signal, JSON with timestamp + reason)
  • Server: _delayed_restart() writes signal file in systemd mode instead of sudo systemctl restart
  • Auto-updater: restart signal detection in poll_once() (before existing redeploy marker check)
  • Auto-updater: delete-then-restart execution (delete file immediately, then systemctl restart)
  • Edge case: stale signal file cleanup (file exists but no restart needed, e.g., after crash)
  • Remove sudoers dependency from server restart path (server no longer needs sudo)
  • Unit tests (0/0 passing)
  • Integration tests (0/0 passing)
  • E2E manual testing on staging server (.20)

Completion: 0/9 tasks complete (0%)


Algorithm

Signal File Protocol:
  SIGNAL_PATH = Path.home() / ".cidx-server" / "restart.signal"
  SIGNAL_CONTENT = { "timestamp": ISO8601, "reason": string }

Server._delayed_restart(delay):
  SLEEP delay seconds (allow HTTP response to complete)
  IF running under systemd (INVOCATION_ID env var set):
    WRITE SIGNAL_PATH with JSON { timestamp, reason="diagnostics_restart" }
    LOG "Restart signal written, waiting for auto-updater"
    RESET _restart_in_progress flag
  ELSE:
    os.execv (existing dev mode logic, unchanged)

AutoUpdater.poll_once() - restart signal check (before existing redeploy marker check):
  STALENESS_THRESHOLD = 120 seconds (2x poll interval)

  IF file_exists(SIGNAL_PATH):
    READ signal file JSON content
    signal_age = now() - signal.timestamp

    IF signal_age > STALENESS_THRESHOLD:
      LOG WARNING "Stale restart signal detected (age: {signal_age}s), deleting without restart"
      DELETE SIGNAL_PATH
      RETURN (skip normal update check for this cycle, no restart)

    DELETE SIGNAL_PATH immediately (before any restart attempt)
    LOG "Restart signal detected, file deleted, executing restart"
    EXECUTE deployment_executor.restart_server()
    IF restart fails:
      LOG error (signal already deleted, no retry loop)
    RETURN (skip normal update check for this cycle)

Key Design Decisions

  • Delete-first: Prevents restart loops if systemctl restart triggers auto-updater restart too
  • Stale file cleanup: If signal file exists when poll_once starts and server is already running, it's stale from a crash. Delete and log warning, do NOT restart.
  • No retry: If restart after signal pickup fails, log and move on. Admin can click again.
  • Signal file is JSON: For debuggability (contains timestamp and reason)
  • Reuses existing patterns: Same directory as PENDING_REDEPLOY_MARKER, same restart_server() call

Acceptance Criteria

Scenario 1: Server writes restart signal file when restart requested
  Given the CIDX server is running under systemd (INVOCATION_ID is set)
  When an admin triggers a restart from the Diagnostics web UI
  Then a signal file is created at ~/.cidx-server/restart.signal
  And the file contains JSON with "timestamp" and "reason" fields
  And the HTTP response returns success before the file is written

Scenario 2: Auto-updater detects signal and restarts server
  Given the auto-updater polling loop is running
  And a restart.signal file exists at ~/.cidx-server/
  When the auto-updater executes its next poll cycle
  Then the signal file is deleted immediately (before restart attempt)
  And systemctl restart cidx-server is executed

Scenario 3: Signal file deleted even if restart fails
  Given a restart.signal file exists at ~/.cidx-server/
  When the auto-updater detects it and systemctl restart fails
  Then the signal file is still deleted (no retry loop)
  And the failure is logged

Scenario 4: Stale signal file cleaned up at startup
  Given a restart.signal file exists from a previous crash/power-loss
  When the auto-updater starts a new poll cycle
  Then the stale signal file is deleted
  And a warning is logged
  And no restart is triggered (server is already starting fresh)

Scenario 5: Dev mode restart unchanged
  Given the CIDX server is running in dev mode (no INVOCATION_ID)
  When an admin triggers a restart from the Diagnostics web UI
  Then the existing os.execv restart mechanism is used (no signal file)

Key Files

  • src/code_indexer/server/web/routes.py - _delayed_restart() function (line ~8743)
  • src/code_indexer/server/auto_update/service.py - AutoUpdateService.poll_once() method
  • src/code_indexer/server/auto_update/deployment_executor.py - PENDING_REDEPLOY_MARKER constant, restart_server() method
  • tests/unit/server/web/test_restart_endpoint.py - Existing restart tests
  • tests/unit/server/auto_update/ - Existing auto-update tests

Testing Requirements

  • Unit tests covering signal file write (JSON format, path, permissions)
  • Unit tests covering signal file detection and delete-before-restart ordering
  • Unit tests covering stale signal file cleanup
  • Unit tests covering dev mode unchanged behavior
  • Integration tests for signal file write/read/delete protocol
  • E2E manual testing on staging server (192.168.60.20)

Manual Testing Strategy (Staging .20)

  1. Deploy to staging via development -> staging merge
  2. SSH to staging, verify updated code deployed
  3. Trigger restart from Diagnostics web UI
  4. Verify signal file appears: ls -la ~/.cidx-server/restart.signal
  5. Verify auto-updater picks it up within 60s: journalctl -u cidx-auto-update --since "1 min ago"
  6. Verify signal file deleted: ls -la ~/.cidx-server/restart.signal (should be gone)
  7. Verify server back online: systemctl status cidx-server
  8. Verify Diagnostics page loads after restart

Definition of Done

  • All acceptance criteria satisfied
  • 90% unit test coverage achieved

  • Integration tests passing
  • E2E tests with zero mocking passing
  • Code review approved
  • Manual end-to-end testing on staging completed
  • No lint/type errors
  • Working software deployable to production

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions