Swamp S3 Datastore: Stale Locks Cause Indefinite Hangs

Another one. My clanker diagnosed and wrote up the issue. All I know is my stuff is broke.

**Swamp version:** 20260325.134747.0
**Environment:** GitHub Actions `ubuntu-latest`, AWS `eu-west-1`

## Summary

When using `SWAMP_DATASTORE=s3:<bucket>/<prefix>`, Swamp model method calls hang indefinitely due to stale lock files in S3 that are never released or expired. The `ttlMs` field in the lock payload is ignored — once a lock file exists in S3, subsequent operations that need to acquire the same lock block forever.

## Reproduction

### Setup

- S3 bucket with versioning enabled (required per Swamp docs for conditional writes)
- `SWAMP_DATASTORE=s3:claw-hosting-swamp-state/swamp-state`
- 10 models run sequentially via `swamp model method run <name> create --json`
- Models have CEL cross-references (e.g. route-table references vpc, subnet, igw)

### Steps to reproduce

1. Run a sequence of `swamp model method run` calls against models with S3 datastore
2. After 3-6 successful model operations, the next operation hangs indefinitely
3. Inspecting S3 reveals `.lock` files that were never deleted after the previous operation completed

### Observed behaviour

This was reproduced 3 times across different CI runs:

**Occurrence 1 — Create sequence, hang on model 4 of 10:**
```
Running create on bc01-vpc        # ~12s, succeeded
Running create on bc01-igw        # ~8s, succeeded
Running create on bc01-subnet     # ~8s, succeeded
Running create on bc01-route-table  # HUNG — no output for 30+ minutes
```

S3 state at time of hang:
```
swamp-state/.datastore.lock                              (197 bytes, never released)
swamp-state/data/@claw/subnet/12c6b12c-.../.lock         (197 bytes, never released)
```

Lock file contents:
```json
{
  "holder": "runner@runnervm46oaq",
  "hostname": "runnervm46oaq",
  "pid": 3652,
  "acquiredAt": "2026-03-25T15:31:31.513Z",
  "ttlMs": 30000,
  "nonce": "55f05fa4-07ef-4f9e-9b2c-30f4934f2daf"
}
```

The `ttlMs: 30000` (30 seconds) was set but the lock was still blocking operations 30+ minutes later.

**Occurrence 2 — Create sequence, all 10 models succeeded:**

After manually deleting the stale locks from occurrence 1 and clearing all S3 state, the full 10-model create sequence completed without hanging. This confirms the locks from a previous run were the cause, not a fundamental issue with the model operations.

**Occurrence 3 — Destroy sequence, hang on model 8 of 10:**
```
Running delete on bc01-instance       # succeeded
Running delete on bc01-key-pair       # failed (unrelated schema error)
Running delete on bc01-state          # succeeded
Running delete on bc01-sg             # succeeded
Running delete on bc01-instance-profile  # failed (unrelated CEL error)
Running delete on bc01-instance-role  # succeeded
Running delete on bc01-route-table    # succeeded
Running delete on bc01-subnet         # HUNG
```

S3 state at time of hang:
```
swamp-state/.datastore.lock                                    (197 bytes)
swamp-state/data/@claw/route-table/bc51e40d-.../.lock          (197 bytes)
```

Same pattern: global `.datastore.lock` and a per-model `.lock` left behind from a previous operation, blocking the current one.

## Expected behaviour

1. Lock files should be deleted after the operation that acquired them completes (success or failure)
2. If a lock file exists but its `ttlMs` has expired (based on `acquiredAt`), subsequent operations should treat it as stale and acquire a new lock
3. If the process that held the lock has exited (different PID/hostname), the lock should be considered abandoned

## Workaround

Manually delete stale lock files from S3 before running operations:

```bash
aws s3 rm s3://<bucket>/<prefix>/.datastore.lock
aws s3 rm s3://<bucket>/<prefix>/ --recursive --exclude "*" --include "*/.lock"
```

This unblocks the next operation, but the locks will accumulate again after a few operations.

## Impact

This makes the S3 datastore unusable for sequential multi-model operations in CI pipelines. Each `swamp model method run` call is a separate process, and locks from process N are not released before process N+1 runs, eventually causing a hang.

The issue does NOT occur with the local `.swamp/` datastore, only with `SWAMP_DATASTORE=s3:...`.

## Environment details

- Swamp version: `20260325.134747.0` (from `.swamp.yaml` `swampVersion` field)
- S3 bucket: versioning enabled, `eu-west-1`
- IAM permissions: full S3 access to the bucket (GetObject, PutObject, DeleteObject, ListBucket, GetBucketVersioning, ListBucketVersions)
- Each model method run is a separate OS process (called from bash via `swamp model method run <name> <method> --json`)
- Typically 3-8 operations succeed before the hang occurs


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swamp S3 Datastore: Stale Locks Cause Indefinite Hangs #872

Summary

Reproduction

Setup

Steps to reproduce

Observed behaviour

Expected behaviour

Workaround

Impact

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Swamp S3 Datastore: Stale Locks Cause Indefinite Hangs #872

Description

Summary

Reproduction

Setup

Steps to reproduce

Observed behaviour

Expected behaviour

Workaround

Impact

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions