-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Another one. My clanker diagnosed and wrote up the issue. All I know is my stuff is broke.
Swamp version: 20260325.134747.0
Environment: GitHub Actions ubuntu-latest, AWS eu-west-1
Summary
When using SWAMP_DATASTORE=s3:<bucket>/<prefix>, Swamp model method calls hang indefinitely due to stale lock files in S3 that are never released or expired. The ttlMs field in the lock payload is ignored — once a lock file exists in S3, subsequent operations that need to acquire the same lock block forever.
Reproduction
Setup
- S3 bucket with versioning enabled (required per Swamp docs for conditional writes)
SWAMP_DATASTORE=s3:claw-hosting-swamp-state/swamp-state- 10 models run sequentially via
swamp model method run <name> create --json - Models have CEL cross-references (e.g. route-table references vpc, subnet, igw)
Steps to reproduce
- Run a sequence of
swamp model method runcalls against models with S3 datastore - After 3-6 successful model operations, the next operation hangs indefinitely
- Inspecting S3 reveals
.lockfiles that were never deleted after the previous operation completed
Observed behaviour
This was reproduced 3 times across different CI runs:
Occurrence 1 — Create sequence, hang on model 4 of 10:
Running create on bc01-vpc # ~12s, succeeded
Running create on bc01-igw # ~8s, succeeded
Running create on bc01-subnet # ~8s, succeeded
Running create on bc01-route-table # HUNG — no output for 30+ minutes
S3 state at time of hang:
swamp-state/.datastore.lock (197 bytes, never released)
swamp-state/data/@claw/subnet/12c6b12c-.../.lock (197 bytes, never released)
Lock file contents:
{
"holder": "runner@runnervm46oaq",
"hostname": "runnervm46oaq",
"pid": 3652,
"acquiredAt": "2026-03-25T15:31:31.513Z",
"ttlMs": 30000,
"nonce": "55f05fa4-07ef-4f9e-9b2c-30f4934f2daf"
}The ttlMs: 30000 (30 seconds) was set but the lock was still blocking operations 30+ minutes later.
Occurrence 2 — Create sequence, all 10 models succeeded:
After manually deleting the stale locks from occurrence 1 and clearing all S3 state, the full 10-model create sequence completed without hanging. This confirms the locks from a previous run were the cause, not a fundamental issue with the model operations.
Occurrence 3 — Destroy sequence, hang on model 8 of 10:
Running delete on bc01-instance # succeeded
Running delete on bc01-key-pair # failed (unrelated schema error)
Running delete on bc01-state # succeeded
Running delete on bc01-sg # succeeded
Running delete on bc01-instance-profile # failed (unrelated CEL error)
Running delete on bc01-instance-role # succeeded
Running delete on bc01-route-table # succeeded
Running delete on bc01-subnet # HUNG
S3 state at time of hang:
swamp-state/.datastore.lock (197 bytes)
swamp-state/data/@claw/route-table/bc51e40d-.../.lock (197 bytes)
Same pattern: global .datastore.lock and a per-model .lock left behind from a previous operation, blocking the current one.
Expected behaviour
- Lock files should be deleted after the operation that acquired them completes (success or failure)
- If a lock file exists but its
ttlMshas expired (based onacquiredAt), subsequent operations should treat it as stale and acquire a new lock - If the process that held the lock has exited (different PID/hostname), the lock should be considered abandoned
Workaround
Manually delete stale lock files from S3 before running operations:
aws s3 rm s3://<bucket>/<prefix>/.datastore.lock
aws s3 rm s3://<bucket>/<prefix>/ --recursive --exclude "*" --include "*/.lock"This unblocks the next operation, but the locks will accumulate again after a few operations.
Impact
This makes the S3 datastore unusable for sequential multi-model operations in CI pipelines. Each swamp model method run call is a separate process, and locks from process N are not released before process N+1 runs, eventually causing a hang.
The issue does NOT occur with the local .swamp/ datastore, only with SWAMP_DATASTORE=s3:....
Environment details
- Swamp version:
20260325.134747.0(from.swamp.yamlswampVersionfield) - S3 bucket: versioning enabled,
eu-west-1 - IAM permissions: full S3 access to the bucket (GetObject, PutObject, DeleteObject, ListBucket, GetBucketVersioning, ListBucketVersions)
- Each model method run is a separate OS process (called from bash via
swamp model method run <name> <method> --json) - Typically 3-8 operations succeed before the hang occurs