Skip to content

Comments

Implement Script Isolation#1803

Draft
rhysparry wants to merge 1 commit intomainfrom
rhys/eft-251/script-isolation
Draft

Implement Script Isolation#1803
rhysparry wants to merge 1 commit intomainfrom
rhys/eft-251/script-isolation

Conversation

@rhysparry
Copy link
Contributor

@rhysparry rhysparry commented Feb 25, 2026

Overview

This change provides an implementation to enforce script isolation within Calamari.

Currently, Tentacle provides script isolation functionality, by maintaining a series of in-process locks.

For SSH, this is achieved by performing the locking on the Octopus Instance running the script. This introduces issues for multi-node clusters as the lock is limited to a single node. Our execution scaling strategies rely on being able to create additional nodes to process service bus messages, so this needs to be addressed to support SSH targets and workers.

This implementation uses a file locking approach to share the lock across multiple processes.

Result

When the appropriate environment variables are supplied, Calamari will acquire an appropriate lock before running the relevant command.

The relevant environment variables are:

  • CalamariScriptIsolationLevel which defines the script isolation level Calamari should enfore. This must be one of two values (case insensitive): FullIsolation to run the command independent of all other commands and NoIsolation to allow the command to run concurrently with other NoIsolation commands.
  • CalamariScriptIsolationMutexName defines a mutex name to use. Only other commands running with the same mutex name will be isolated. This name should be valid to use in a file name.
  • CalamariScritpIsolationMutexTimeout defines a timeout to use when waiting for the mutex. A timeout value should be greater that 10ms and less than 1 day. Providing a value outside of this range will remove the timeout and either reduce the number of retries or rety indefinitely.

Additionally, TentacleHome is used as a location to store the files used for the lock. This directory resolves to something like /home/username/.octopus/machine-slug. This means that if the same machine is registered with a different slug it will use independent isolation locks.

Current Caveats

  • Logging is largely for diagnosing at this stage and will likely be reduced. There may also be a preferred alternative approach to doing the logging that we should switch to.
  • The entry point is a little bit unclear, so the static Isolation.Enforce() and Isolation.EnforceAsync() methods are used in three separate places. Ideally we could do some consolidation and there might be a spot missing.
  • This approach only isolates the Calamari component of the execution, not additional tasks, such as extracting and installing Calamari in the first place.
  • Testing so far has largely been focused on running the lock logic in a series of separate processes (independently of Calamari).

Miscellaneous

⚠️ Does this change require a corresponding Server Change?
⚠️ If so - please add a "Requires Server Change" label to this PR!

@rhysparry rhysparry force-pushed the rhys/eft-251/script-isolation branch from ca2cf06 to e49c2f6 Compare February 25, 2026 03:58
var fileStream = File.Open(lockOptions.LockFile, FileMode.OpenOrCreate, FileAccess.ReadWrite, fileShareMode);
return new LockHandle(fileStream);
}
catch (IOException e)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IOException is possibly a little bit broad here. We could potentially perform additional checks, but that could get fragile if we rely on the error message string.

Comment on lines +27 to +28
LockType.Exclusive => FileShare.None,
LockType.Shared => FileShare.ReadWrite,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to use a custom enum to map to the appropriate FileShare value used in the underlying lock. This keeps the intended usage clearer.

Comment on lines +12 to +13
static readonly TimeSpan RetryInitialDelay = TimeSpan.FromMilliseconds(10);
static readonly TimeSpan RetryMaxDelay = TimeSpan.FromMilliseconds(500);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to use a polling strategy in Tentacle's ScriptIsolationMutex.

static ResiliencePipeline<ILockHandle> BuildLockAcquisitionPipeline(LockOptions lockOptions)
{
var builder = new ResiliencePipelineBuilder<ILockHandle>()
.AddFallback(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During testing I found that if the retry configuration was just right (or wrong, depending on your point-of-view) it would occasionally throw the exception from acquiring the lock, but mostly the timeout exception. This is intended to normalise the exceptions.

I believe it does have the side effect currently of hiding other types of exceptions.

{
try
{
using var _ = Isolation.Enforce();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add these in multiple spots (see the other occurrences below). There is a risk here if an additional entry point is added and this isn't included.

Maybe there is a better common entry point/wrapper we can use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant