Skip to content

fix(op): ensure op worker restarts on reboot and make script resilient to reinstalls#428

Draft
gwenaskell wants to merge 2 commits intomainfrom
yoenn.burban/OPA-5043-restart-worker-on-reboot
Draft

fix(op): ensure op worker restarts on reboot and make script resilient to reinstalls#428
gwenaskell wants to merge 2 commits intomainfrom
yoenn.burban/OPA-5043-restart-worker-on-reboot

Conversation

@gwenaskell
Copy link
Copy Markdown
Contributor

@gwenaskell gwenaskell commented Apr 29, 2026

Summary

Two related improvements to install_script_op_worker2.sh:

  1. The OP Worker service is now explicitly enabled to start on host reboot.
    Refs: OPA-5043
  2. The install script is now safe to re-run on top of a previous install whose
    uninstall left state behind.
    Refs: OPA-3197

1. Enable the OP Worker on host reboot

Why

After a host reboot, an OPW installed via this script wasn't reliably coming
back up: the script only ran the package's runtime restart command, which
doesn't register the service for boot-time startup. Whether the service then
started on boot was at the mercy of the package's post-install hooks and
varied across distros.

What

A new enable_cmd is computed alongside the existing restart_cmd /
start_instructions / stop_instructions, picking the right tool per init
system:

Init system Enable command
systemd systemctl enable observability-pipelines-worker.service
SysV on Debian/Ubuntu update-rc.d observability-pipelines-worker defaults
SysV on RHEL/CentOS/Amazon chkconfig observability-pipelines-worker on
Upstart (no-op — package .conf is picked up automatically)
service is intentionally not used for this — it only forwards runtime
actions (start/stop/restart) to the init script and can't register a service
for boot. The reasoning is captured in a comment in the script.
The enable step runs in the same branch as the existing restart_cmd,
gated on the same no_start flag so:
  • DD_INSTALL_ONLY=true skips both starting and enabling.
  • Missing DD_API_KEY / DD_OP_PIPELINE_ID still skips both.
  • A failure to enable is non-fatal — it prints a yellow warning and the
    install proceeds.

2. Make the install script resilient to a previous (incomplete) uninstall

Why

apt-get remove / apt-get purge leave several pieces of state behind that
the package manager doesn't own:

  • /etc/default/observability-pipelines-worker (created by this script)
  • /etc/observability-pipelines-worker/install_info (created by this script)
  • /etc/apt/sources.list.d/datadog-observability-pipelines-worker.list
  • /usr/share/keyrings/datadog-archive-keyring.gpg
  • /var/lib/observability-pipelines-worker/
  • the observability-pipelines-worker system user
    Most of those steps in the script are already idempotent (repo file overwrite,
    GPG re-import, package re-install, install_info overwrite). Two were not:
  • The env file was reused as-is, silently dropping any DD_* values passed
    in the new invocation. Re-running the script with a new DD_API_KEY was a
    no-op.
  • chown $bootstrap_file could fail under set -e if a partial prior state
    meant the file or the system user was missing, aborting the whole install.

What

Env file behavior (/etc/default/observability-pipelines-worker):

Scenario Before After
File doesn't exist (fresh install) Create + populate Same outcome
File present, no DD_* env vars supplied Kept verbatim Kept verbatim (operator-set keys preserved)
File present, new DD_API_KEY supplied Silently dropped Upserted (existing line replaced, others untouched)
File present, new DD_OP_* not previously set Silently dropped Appended
File present, custom keys the operator added (not via DD_*) Preserved Preserved (only the keys we explicitly pass are touched)
The implementation is a small upsert_env_var helper that does
sed -i "/^${key}=/d" then echo $key=$value >> $env_file.
Bootstrap file (/etc/observability-pipelines-worker/bootstrap.yaml):
Scenario Before After
----------------------------------------------------- --------------------------------- ----------------------------------------------------
File present, system user present chown succeeds chown succeeds
File present, observability-pipelines-worker user missing chown fails → script aborts Yellow warning, install continues
File missing entirely chown fails → script aborts Yellow warning, install continues

Backward compatibility note

The env file change is a semantic change: previously the file was
inviolate on re-runs. Now any DD_* values passed in the new invocation will
overwrite their matching lines. I believe this matches operator expectations
("I re-ran with a new key, why didn't it apply?") and the current behavior
was an undocumented footgun, but it's worth flagging.
Operator-added keys (anything the script doesn't pass via DD_*) are still
preserved.

Test plan

  • Fresh install on Ubuntu (systemd) — service starts and systemctl is-enabled observability-pipelines-worker returns enabled.
  • Fresh install on RHEL/CentOS (systemd) — same as above.
  • Reboot a freshly installed host — OPW comes back up automatically.
  • Re-run the script over an existing install with a different DD_API_KEY — verify /etc/default/observability-pipelines-worker reflects the new key and the worker uses it.
  • Re-run over an existing install with no DD_* env vars — verify the existing env file is left intact and the worker keeps running with its prior config.
  • apt-get remove observability-pipelines-worker then re-run the script — install completes without aborting; service is enabled and starts.
  • apt-get purge observability-pipelines-worker && userdel observability-pipelines-worker then re-run — install completes (with warnings), package's postinst recreates the user, service starts.
  • DD_INSTALL_ONLY=true install — script does not enable nor start the service.
  • Old/legacy SysV target (e.g. older Debian without systemd) — update-rc.d defaults runs, service comes up at boot.

Out of scope

  • Real cleanup of leftover state on uninstall (env file, install_info, repo
    file, keyring, system user) belongs in the OPW package's postrm /
    prerm (analogous to datadog-agent's
    agent-deb/postrm),
    not in this install script.

@gwenaskell
Copy link
Copy Markdown
Contributor Author

I will hold this off for the now because some of those operations could actually be bundled directly with the opw package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant