Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions docs/systemd-guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Terminology

"Targets" (.target files) define/establish "stages", for example, boot, test and report stages.<br>
"Barrier services" are closely related to targets, but allow targets to be decoupled from stage details. "Barriers" and "barrier services" are used interchangeably.<br>
"Worker services" are services that we create, the .service file and the service code. "Workers" and "worker services" are used interchangeably.<br>
Comment on lines +3 to +5

# Overview

## Activation versus ordering versus enrollment

Activation of a systemd unit happens via

- systemctl start (or restart)<br>
- Wants/Requires<br>
- WantedBy/RequiredBy plus enabling via enable directive in .preset file or systemctl enable<br>

For sev-certify, somehow automating systemctl start versus one of the other activation methods doesn't make sense. Also, the "directions" of Wants/Requires and WantedBy/RequiredBy are opposite and it may only be appropriate/correct to change "one side". For example, it's inappropriate to change multi-user.target to have Wants/Requires=`<`one or more sev-certify units`>`. WantedBy/RequiredBy "enrolls" a unit and this plus enabling is one way to activate.<br>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the "directions" of Wants/Requires and WantedBy/RequiredBy are opposite and it may only be appropriate/correct to change "one side".

By this do you mean that we should only be having the relationship defined one way, correct? What I mean is that if unit A requires unit B. We would define in systemd either

  • (In A) Requires=B.service
  • (in B) RequiredBy=A.service

But we shouldn't have both? I don't disagree with this statement.

Maybe we should clarify which one is preferred. from my understanding you're saying here we probably prefer RequiredBy.


Ordering is only achieved via Before/After directives in unit files.

## Value of having both targets and "barrier services"

(straight from Claude Code)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(straight from Claude Code)

It's ok, what isn't this days :)


1. Stage chaining stays stable — targets give each stage a named boundary that subsequent stages reference. As workers are added or removed from a stage, only the barrier changes (Requires=); the target and the chain above it are untouched.
2. Intra-stage ordering without coupling — when workers within a stage must run in sequence, a started barrier gives them a common synchronization point without workers needing to reference each other directly. Without the barrier, you'd have to wire workers to each other, coupling units that conceptually belong to the same stage independently.
Comment on lines +25 to +26
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason for using barrier services is to make failure handling more explicit and resilient. If a target directly depends on all of its worker services, a single worker failure can prevent the target from being reached and stop the rest of the flow. With a barrier, the target can remain the stable stage boundary, while the barrier owns the responsibility of running the stage services, collecting their status, and deciding how failure is represented. This allows later stages, especially reporting, to still be reached so failures can be captured instead of skipping directly to shutdown or leaving the run incomplete.


# The target - barrier service pattern

Each target Requires= and After= the previous target. After= the previous target helps provide inter-stage ordering. Each target also Wants= and After= its barrier service. For example, in report.target:<br>

Requires=test.target<br>
After=test.target<br>
Wants=report-done.service<br>
After=report-done.service<br>

A target has no ExecStart, so without this second After= and even though targets have After=`<`previous target`>`, all the targets would activate at essentially the same time and, as a result, be out of sync with the stage workers.

Each barrier service Requires= and After= all of its worker services. A barrier does have an ExecStart, but the simplest, most natural way for a barrier to stay in sync with its workers is to After= all of the workers. For example, in guest report-done.service:<br>
Comment thread
markg-github marked this conversation as resolved.

Requires=display-guest-logs.service sev-certificate-generator.service<br>
After=display-guest-logs.service sev-certificate-generator.service<br>

This "synchronization process" alone doesn't work for non-oneshot systemd services. See Intra-stage ordering below for how to handle this, basically, have two or more workers together "go outside" systemd for synchronization in order to ensure that everything can stay in sync.

# Intra-stage ordering

In cases where intra-stage ordering is required, worker services use After= to achieve it. This works for oneshot services. For non-oneshot, either<br>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we need the explanation, but the reason After= does not behave as expected with non-oneshot services is that After= only waits for initialization, not completion. Since long-running services do not have a completion state, once the service initializes successfully, systemd considers it safe to start dependent services.


1) have a oneshot service use a non-systemd mechanism to tell when the non-oneshot is done and use After= with this oneshot service or
2) use OnSuccess (and OnFailure?).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this suggestion


An example of 1) is the verify-guest service (Type=oneshot) checking logs to determine when the launch-guest service (Type=simple) is done.<br>

Comment on lines +39 to +54
# Bootstrapping

Stop targets (guest and host) have WantedBy=multi-user.target. multi-user.target is the "system is ready" terminal target — always present, always reached on normal boot. This is the only use of WantedBy that's required in sev-certify. This "enrollment" is not enough to "activate" the stop targets. Activation of the stop targets requires enabling (or starting) them. Do this via an "enable stop.target" directive in a .preset file.<br>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot everything else can and does use Wants or Requires. The sev-certify stop targets boot strap this process.


# General systemd units

These are units for which we don't maintain either unit files (.service, .target, etc.) or "unit code". For simplicity and clarity, it's best to reference them in targets, via Requires/Wants/After. This should be done as early as possible, for example, Requires= and After= in boot.target.<br>

# Other directives

## Type

In sev-certify, use `Type=oneshot` with `RemainAfterExit=yes` when a suitable `TimeoutStartSec` value can be determined. Otherwise, use `Type=simple`.<br>

Comment on lines +67 to +68
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot to me, RemainAfterExit=yes seems to fit sev-certify better and there doesn't seem to be a downside other than maybe salvaging more of a "bad boot".

You can't easily use Before/After with simple services since they satisfy Before/After as soon as they start. See intra-stage ordering above. With oneshot services, Before/After isn't satisfied until the main process exits.<br>

With oneshot services, `TimeoutStartSec` is how long the main process has to exit/finish before systemd kills it. This can affect subprocesses and whether it does depends on `RemainAfterExit` and `KillMode` directives.<br>

default: simple<br>
Comment on lines +67 to +73
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments on this section.

First, I don't think TimeoutStartSec is necessarily the right criteria to decide between Type=oneshot and Type=simple. None of our current services rely on TimeoutStartSec, and I don't think service type selection should be driven by whether we can determine a timeout value.

Instead, I think the recommendation should be based on service behavior. For most of our services, Type=oneshot is the better fit because they are intended to perform finite work and run to completion once. I expect that to apply to the majority of services in this project.

Also, the main benefit of Type=oneshot in our architecture is ordering semantics. After= and Before= only wait for service activation, not completion. For long-running services (Type=simple), dependencies are satisfied as soon as the service initializes successfully. With Type=oneshot, ordering is only satisfied once the main process exits, which aligns better with our stage-based execution model.

RemainAfterExit=yes serves a different purpose. It keeps the service in the active state after execution has completed, allowing the unit to represent that a stage has finished. That is why barrier services use it by definition—they act as completion markers for all work associated with that target.

Without RemainAfterExit=yes, barrier services immediately transition to inactive after completion, which can lead to them being retriggered when revisiting target dependencies later in the flow.


## RemainAfterExit

In sev-certify, use `RemainAfterExit=yes` with oneshot services and `RemainAfterExit=no`, the default, with simple services.<br>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot It's possible that the code isn't compliant with the guidelines. The code that I believe you're thinking of here was committed and merged before this PR was opened.


Comment on lines +67 to +78
`RemainAfterExit` has the same semantics for oneshot and simple services. `RemainAfterExit=no` (default) means the service will stop when the main process exits. `RemainAfterExit=yes` means the service will stay active after the main process exits.<br>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`RemainAfterExit` has the same semantics for oneshot and simple services. `RemainAfterExit=no` (default) means the service will stop when the main process exits. `RemainAfterExit=yes` means the service will stay active after the main process exits.<br>
`RemainAfterExit` has the same semantics for oneshot and simple services. `RemainAfterExit=no` (default) means the service will become inactive when the main process exits. `RemainAfterExit=yes` means the service will stay active after the main process exits.<br>


Simple services are expected to keep running so `RemainAfterExit=yes` is much less common with them than with oneshot services. (For simple services, `RemainAfterExit=yes` normally has no effect and can mask exit-causing errors.)<br>

default: no<br>

## DefaultDependencies

In sev-certify, use `DefaultDependencies=no`.<br>

`DefaultDependencies=no` allows precise, self-contained placement of a unit in the dependency graph. systemd units in sev-certify aren't standard and default dependencies don't make sense for them.<br>

Comment on lines +87 to +90
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above. While it's true that guidelines are in flux, the sev-certify code being thought about here may not end up being compliant and isn't compliant with this version/commit of the guidelines.

Comment on lines +87 to +90
default: yes<br>

## KillMode

`KillMode` controls which processes systemd will kill when a unit is stopped.<br>
Comment on lines +93 to +95
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of where we use KillMode?


For sev-certify, it's better to use `RemainAfterExit=yes` to avoid undesired process killing than to change `KillMode` from control-group, its default.<br>

default: control-group<br>

## TimeoutStartSec

See above. Also, `TimeoutStartSec=infinity` is how to express no timeout.<br>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing, I don't know if we use this at all


default: 90s<br>

Loading