Skip to content

feat: service restart support for file config plugin#3931

Merged
albinsuresh merged 10 commits intothin-edge:mainfrom
albinsuresh:feat/file-config-plugin-service-restart
Mar 17, 2026
Merged

feat: service restart support for file config plugin#3931
albinsuresh merged 10 commits intothin-edge:mainfrom
albinsuresh:feat/file-config-plugin-service-restart

Conversation

@albinsuresh
Copy link
Copy Markdown
Contributor

@albinsuresh albinsuresh commented Jan 15, 2026

Proposed changes

  • extract system_services module from tedge into a standalone crate to use from the plugin
  • service restart support for file config plugin
  • update config_update workflow with granular operation steps
  • handle plugins restarting the tedge-agent itself
  • how to detect a "reload" when service action is reload instead if restart to be handled using dedicated plugins as a generic logic in the file plugin is not possible

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue


Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s. You can activate automatic signing by running just prepare-dev once)
  • I ran just format as mentioned in CODING_GUIDELINES
  • I used just check as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jan 15, 2026

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
863 0 3 863 100 2h41m5.955277s

@albinsuresh albinsuresh marked this pull request as ready for review January 15, 2026 20:08
Comment thread crates/common/tedge_system_services/src/services.rs Outdated
Comment thread plugins/tedge_file_config_plugin/src/lib.rs Outdated
Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

Comment thread plugins/tedge_file_config_plugin/src/config.rs
didier-wenzek
didier-wenzek previously approved these changes Jan 19, 2026
@reubenmiller reubenmiller added the theme:configuration Theme: Configuration management label Jan 19, 2026
@reubenmiller
Copy link
Copy Markdown
Contributor

reubenmiller commented Jan 19, 2026

A few missing pieces/open questions:

  1. Missing log statements that the service is being restarted. No logs were present in the tedge-agent logs nor the workflow log (/var/log/tedge/agent/*.log)
  2. What happens if the "tedge-agent" is used in the service property? does this break the workflow? If so, then we might need to use the background action perform the restart of the tedge-agent

@albinsuresh albinsuresh marked this pull request as draft January 21, 2026 09:13
on_success = "apply"

[apply]
background_script = "sudo /usr/share/tedge/config-plugins/file apply ${.payload.type}"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when we convert this step into a builtin task, this plugin apply command should be executed as a background task, irrespectve of the type of the config and whethere that type restarts tedge-agent or not. That'll smplify this workflow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though the downside of doing the background task is that the "verify" state will be executed before the "apply" command is done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a workdir available (introduced in 55c2d76), how about using that in the apply state to write a "completion marker file" to that directory, which the verify phase can check for afterwards with a timeout? We just need to formalize the "completion marker file" contract.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though can't we just tell people not to restart the tedge-agent from within a plugin? And if they want to restart a service then pass back some data via the workflow state exchange mechanism (e.g. :::begin-tedge:::)?

Copy link
Copy Markdown
Contributor Author

@albinsuresh albinsuresh Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if they want to restart a service then pass back some data via the workflow state exchange mechanism (e.g. :::begin-tedge:::)?

Yeah, I had considered that option. But, these are the reasons why I tried to avoid this "restart request" exchange from the plugin to the agent via the command state:

  1. We'll have to introduce yet another state after apply for the the agent to receive the restart request from the plugin, and then process it by restarting the agent. It feels like an optional state in the workflow only there for once specific case (the agent restart).
  2. We'll be exposing the workflow state update contract (:::begin-tedge::: and :::begin-tedge:::) with the plugin, which was otherwise unaware of the workflow contracts. That's why I was looking for a simple file-system based contract via the workdir.
  3. Consistency in the sense that they don't need to care if they are restarting the agent or any other service and don't need to do things differently for one vs the other. It would have been different if all service restarts were done by the agent. But since the agent running as tedge can't do that, it brings that inconsistency. Though, we can get arround this problem by making the agent use tedgectl.

But, I agree that this workdir approach introduces yet another mechanism for metdata exchange between states, when the MQTT state is already available. I thought this would be less of a problem as the workdir would anyway be used for config backups and even other things like service restart detection using persisted metadata(pids). So, this would be just one additional thing to that list.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by 7b83421

Comment thread tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml Outdated
Comment thread tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml Outdated
Comment thread tests/RobotFramework/tests/cumulocity/configuration/plugins/file Outdated
Comment thread tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml Outdated
@didier-wenzek didier-wenzek dismissed their stale review January 22, 2026 10:06

We decided to explore a more flexible design, using workflows.

Comment thread tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml Outdated
Comment thread tests/RobotFramework/tests/cumulocity/configuration/composite_config_update.toml Outdated
Comment thread crates/core/tedge_actors/src/errors.rs Outdated
Comment on lines +43 to +44
#[error("A shutdown has been requested")]
Shutdown,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like an error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("A shutdown has been requested")]
Shutdown,
#[error("A restart is required: not running latest version and configuration ")]
RestartRequired,

Comment thread crates/core/tedge_agent/src/operation_workflows/actor.rs Outdated
Copy link
Copy Markdown
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into an issue when trying to trigger a tedge-agent restart after applying a new tedge.toml file, but the operation fails for some reason, this line was found in the operation log at least:

ERROR: builtin action 'builtin:config_update:verify' failed: No builtin operation step handler registered for config_update operation verify step

Below is tedge-configuration-plugin.toml that I configured in my test:

file: /etc/tedge/plugins/tedge-configuration-plugin.toml

files = [
    { path = '/etc/tedge/system.toml', type = 'system.toml', service = 'tedge-agent' },
    { path = '/etc/tedge/tedge.toml', type = 'tedge.toml', service = 'tedge-agent' },
    { path = '/etc/lighttpd/lighttpd.conf', type = 'lighttpd-conf', service = 'lighttpd', service_action = 'restart' }
]

And the full operation log of the failed operation:
TST_actualize_shiny_bend_workflow-config_update-c8y-mapper-53005959.log

@albinsuresh
Copy link
Copy Markdown
Contributor Author

tedge-configuration-plugin.toml

The issue was caused by using a tedge.toml that was intentionally or unintentionally disabling the config management feature of the tedge-agent with the following setting in it:

[agent.enable]
config_snapshot = false

So, the config manager actor did the first part of the config update until the agent restart, but could not resume it post-restart as the actor itself is not present.

That single config_snapshot flag determines whether the config manager actor is loaded or not. We should fix that logic to check both config_snapshot && config_update.

@albinsuresh albinsuresh temporarily deployed to Test Pull Request March 12, 2026 09:49 — with GitHub Actions Inactive
@albinsuresh
Copy link
Copy Markdown
Contributor Author

tedge-configuration-plugin.toml

The issue was caused by using a tedge.toml that was intentionally or unintentionally disabling the config management feature of the tedge-agent with the following setting in it:

[agent.enable]
config_snapshot = false

So, the config manager actor did the first part of the config update until the agent restart, but could not resume it post-restart as the actor itself is not present.

That single config_snapshot flag determines whether the config manager actor is loaded or not. We should fix that logic to check both config_snapshot && config_update.

Resolved by 945642f

... tedge.toml
... tedge-log-plugin

Config update restarts service if configured in plugin configuration
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed, as I can't remember why I added this back then and the Set Configuration Should Restart Service test above covers what was intended anyway.

Copy link
Copy Markdown
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I ran into earlier has been resolved, and it's working nicely now. There is a related followup action tracked in #4000, but as of now the PR is working as expected.

Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Thank you for you perseverance on this long standing work.

I really like the new design for the communications between the workflow engine and an operation actor.

This has also been an opportunity to fix a hack used to request an agent restart.

// Make sure the operation status is properly reported before the restart
tokio::time::sleep(Duration::from_secs(5)).await;
return Err(RuntimeError::ActorError(Box::new(SoftwareManagerError::NotRunningLatestVersion)));
return Err(RuntimeError::RestartRequired);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@albinsuresh albinsuresh force-pushed the feat/file-config-plugin-service-restart branch from 945642f to ab35950 Compare March 17, 2026 15:23
@albinsuresh albinsuresh temporarily deployed to Test Pull Request March 17, 2026 15:23 — with GitHub Actions Inactive
@albinsuresh albinsuresh added this pull request to the merge queue Mar 17, 2026
Merged via the queue into thin-edge:main with commit 853fb00 Mar 17, 2026
34 checks passed
@albinsuresh albinsuresh deleted the feat/file-config-plugin-service-restart branch March 18, 2026 05:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme:configuration Theme: Configuration management

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants