Skip to content

Swamp S3 Datastore: Stale Locks Cause Indefinite Hangs #872

@kief

Description

@kief

Another one. My clanker diagnosed and wrote up the issue. All I know is my stuff is broke.

Swamp version: 20260325.134747.0
Environment: GitHub Actions ubuntu-latest, AWS eu-west-1

Summary

When using SWAMP_DATASTORE=s3:<bucket>/<prefix>, Swamp model method calls hang indefinitely due to stale lock files in S3 that are never released or expired. The ttlMs field in the lock payload is ignored — once a lock file exists in S3, subsequent operations that need to acquire the same lock block forever.

Reproduction

Setup

  • S3 bucket with versioning enabled (required per Swamp docs for conditional writes)
  • SWAMP_DATASTORE=s3:claw-hosting-swamp-state/swamp-state
  • 10 models run sequentially via swamp model method run <name> create --json
  • Models have CEL cross-references (e.g. route-table references vpc, subnet, igw)

Steps to reproduce

  1. Run a sequence of swamp model method run calls against models with S3 datastore
  2. After 3-6 successful model operations, the next operation hangs indefinitely
  3. Inspecting S3 reveals .lock files that were never deleted after the previous operation completed

Observed behaviour

This was reproduced 3 times across different CI runs:

Occurrence 1 — Create sequence, hang on model 4 of 10:

Running create on bc01-vpc        # ~12s, succeeded
Running create on bc01-igw        # ~8s, succeeded
Running create on bc01-subnet     # ~8s, succeeded
Running create on bc01-route-table  # HUNG — no output for 30+ minutes

S3 state at time of hang:

swamp-state/.datastore.lock                              (197 bytes, never released)
swamp-state/data/@claw/subnet/12c6b12c-.../.lock         (197 bytes, never released)

Lock file contents:

{
  "holder": "runner@runnervm46oaq",
  "hostname": "runnervm46oaq",
  "pid": 3652,
  "acquiredAt": "2026-03-25T15:31:31.513Z",
  "ttlMs": 30000,
  "nonce": "55f05fa4-07ef-4f9e-9b2c-30f4934f2daf"
}

The ttlMs: 30000 (30 seconds) was set but the lock was still blocking operations 30+ minutes later.

Occurrence 2 — Create sequence, all 10 models succeeded:

After manually deleting the stale locks from occurrence 1 and clearing all S3 state, the full 10-model create sequence completed without hanging. This confirms the locks from a previous run were the cause, not a fundamental issue with the model operations.

Occurrence 3 — Destroy sequence, hang on model 8 of 10:

Running delete on bc01-instance       # succeeded
Running delete on bc01-key-pair       # failed (unrelated schema error)
Running delete on bc01-state          # succeeded
Running delete on bc01-sg             # succeeded
Running delete on bc01-instance-profile  # failed (unrelated CEL error)
Running delete on bc01-instance-role  # succeeded
Running delete on bc01-route-table    # succeeded
Running delete on bc01-subnet         # HUNG

S3 state at time of hang:

swamp-state/.datastore.lock                                    (197 bytes)
swamp-state/data/@claw/route-table/bc51e40d-.../.lock          (197 bytes)

Same pattern: global .datastore.lock and a per-model .lock left behind from a previous operation, blocking the current one.

Expected behaviour

  1. Lock files should be deleted after the operation that acquired them completes (success or failure)
  2. If a lock file exists but its ttlMs has expired (based on acquiredAt), subsequent operations should treat it as stale and acquire a new lock
  3. If the process that held the lock has exited (different PID/hostname), the lock should be considered abandoned

Workaround

Manually delete stale lock files from S3 before running operations:

aws s3 rm s3://<bucket>/<prefix>/.datastore.lock
aws s3 rm s3://<bucket>/<prefix>/ --recursive --exclude "*" --include "*/.lock"

This unblocks the next operation, but the locks will accumulate again after a few operations.

Impact

This makes the S3 datastore unusable for sequential multi-model operations in CI pipelines. Each swamp model method run call is a separate process, and locks from process N are not released before process N+1 runs, eventually causing a hang.

The issue does NOT occur with the local .swamp/ datastore, only with SWAMP_DATASTORE=s3:....

Environment details

  • Swamp version: 20260325.134747.0 (from .swamp.yaml swampVersion field)
  • S3 bucket: versioning enabled, eu-west-1
  • IAM permissions: full S3 access to the bucket (GetObject, PutObject, DeleteObject, ListBucket, GetBucketVersioning, ListBucketVersions)
  • Each model method run is a separate OS process (called from bash via swamp model method run <name> <method> --json)
  • Typically 3-8 operations succeed before the hang occurs

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinglifecycle/implementingImplementation is in progress

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions