Skip to content

Five fixes for plugin reliability against current openclaw + non-trivial state#7

Open
tommyjoseph wants to merge 5 commits into
backblaze-b2-samples:mainfrom
tommyjoseph:fix/scale-and-correctness
Open

Five fixes for plugin reliability against current openclaw + non-trivial state#7
tommyjoseph wants to merge 5 commits into
backblaze-b2-samples:mainfrom
tommyjoseph:fix/scale-and-correctness

Conversation

@tommyjoseph
Copy link
Copy Markdown

Five independent fixes, one per commit, each addressing a distinct bug discovered while bringing this plugin up against openclaw 2026.4 / 2026.5 with a real ~25k-file state directory. Tests grow from 70 → 93 (all green). End-to-end verified with a ~3.7 GB encrypted snapshot landing successfully in B2 after the patches.

Closes

Commits (in order)

  1. Fix import path for json-store helpers; bump openclaw peer range
    readJsonFileWithFallback, writeJsonFileAtomically moved to openclaw/plugin-sdk/json-store; OpenClawPluginService, OpenClawPluginServiceContext moved to openclaw/plugin-sdk/core. Bumps peer/dev openclaw to >=2026.4.0 to make the layout requirement explicit. Without this the plugin throws TypeError: ... is not a function on every push.

  2. URL-encode S3 object keys for SigV4 + wire URL
    putObject/getObject/deleteObject now use a new s3EncodePath helper for both signing and the fetch URL. Without this any object key with a space, +, ', (, ), *, or ! returns 403 AccessDenied. +9 unit tests for s3EncodePath.

  3. Add exponential-backoff retry around b2.putObject calls
    New src/retry.ts module. 6 attempts, 500ms base, 15s cap, ±25% jitter. Retries fetch failed, HTTP 5xx, 408, 429, 400 IncompleteBody; propagates other errors. Both putObject sites in push.ts use it. +13 unit tests covering retry decisions and behavior.

  4. Skip ENOENT during upload (file vanished between gather and read)
    Inside the per-file upload task, catch ENOENT from readFile and skip the file. Other read errors still propagate. The next push naturally picks up the new state.

  5. Skip files whose paths contain ASCII control characters
    shouldInclude now also rejects any path with a char code <32. Real workloads see this for filenames produced by upstream pipelines that didn't strip CR/LF. +1 test covering several control-char shapes.

Test plan

  • pnpm test → 93 passed, 0 failed (was 70 baseline)
  • pnpm build → clean
  • pnpm typecheck → no new errors (one pre-existing index.ts:69 tool signature error remains, unrelated to these patches)
  • End-to-end: a 25,407-file / ~3.7 GB encrypted push completed in ~57 minutes against B2 us-west-002, recovering automatically from one transient B2 500 mid-flight via the retry helper.

Notes for review

  • Each commit is intentionally atomic and individually reviewable. If any one of them is contentious it can be cherry-picked or dropped without affecting the others.
  • The retry timeouts and concurrency are conservative defaults; happy to tune them down if you'd like.
  • The control-char filter is positioned as a defensive workaround, not a fix — the right long-term fix lives in whatever upstream pipeline produces those filenames.
  • (This PR replaces Five fixes for plugin reliability against current openclaw + non-trivial state #6, which was opened from a different fork; same commits, same content.)

`readJsonFileWithFallback` and `writeJsonFileAtomically` were exported
from the top-level `openclaw/plugin-sdk` in openclaw 2026.2 (281 names
exported there) but moved into the `openclaw/plugin-sdk/json-store`
submodule in openclaw 2026.4 as part of a tree-shaking refactor (only
5 names left at top level).

Without this fix the plugin throws
`TypeError: ... is not a function`
on the first scheduled push when running against openclaw 2026.4+, and
the gateway never produces a backup.

The peer/dev dependency range is bumped to `>=2026.4.0` to make the
new SDK layout requirement explicit. Verified against openclaw 2026.5.7.

Tests: 70 pass.
The b2 client built the request path as `/${bucket}/${key}` and used
that raw string both for SigV4 canonical-request signing and for the
fetch URL. Node's `fetch` URL-encodes spaces (and other reserved
characters) before transmission, but the signer signed the unencoded
path. B2 recomputes the canonical request from the encoded path it
receives, gets a different signature, and rejects the request with
HTTP 403 `AccessDenied — Signature validation failed`.

Symptom: any object key containing a space (or `+`, `'`, `(`, `)`,
`*`, `!`, etc.) failed with 403, while UUID-named keys succeeded.
In a real workload that means user-named files in `workspace/` like
`Marketing Image 1.png` cause the entire scheduled push to abort,
since the first 403 propagates out of `Promise.all`.

Fix: introduce `s3EncodePath` which URI-encodes each segment per the
S3 SigV4 rules (preserves `/` between segments, encodes the AWS-
extended set `' ( ) * !` on top of `encodeURIComponent`'s defaults).
Use the encoded form in both signing and the fetch URL so the two
strings stay in sync.

Tests: added 9 unit tests for `s3EncodePath` covering spaces, the
AWS-extended set, unicode, unreserved chars, slashes, and edge cases.
70 → 79 tests pass.
A scheduled push of a non-trivial state directory makes thousands of
serial putObject calls. With the previous code, a single transient
failure (Node `fetch failed`, B2 5xx, B2 429, or B2 400 IncompleteBody
from a connection drop mid-upload) caused `Promise.all` to reject and
aborted the entire push — wasting all upload progress and falling back
to retrying everything from scratch on the next cron fire.

This commit introduces `putObjectWithRetry` (in a new `retry.ts`
module) which wraps b2.putObject with exponential-backoff-with-jitter:
- 6 attempts (initial + 5 retries)
- 500ms base delay, doubling, capped at 15s
- ±25% jitter
- Retries `fetch failed`, HTTP 5xx, 408, 429, and 400 IncompleteBody
- Propagates other errors (401/403/404/InvalidRequest/etc.) immediately

Push.ts is updated to use this wrapper for both the per-file upload
loop and the final manifest upload.

Tests: 14 unit tests cover both `isRetryablePutError` (retry decision
matrix) and `putObjectWithRetry` (success path, retry-then-succeed,
non-retryable propagation, exhaustion, fetch failed, IncompleteBody
vs InvalidRequest disambiguation). 79 → 93 tests pass.
The gather phase walks the state directory and snapshots a list of
file paths. The upload phase then reads each file in turn. If another
subsystem (an LCM compactor, a session-cleanup hook, a user `rm`, etc.)
deletes a file between those two phases, the upload's `readFile`
rejects with `ENOENT` and aborts the whole push via `Promise.all`.

This is a real condition in any agent that maintains its own session
files in the background — losing 15+ minutes of upload progress to a
single deleted JSONL file is not acceptable.

Fix: catch ENOENT specifically inside the per-file upload task, log
a debug line, and skip the file. Other read errors (EACCES, EIO, etc.)
still propagate. The next push naturally picks up the new state.
B2 rejects object keys containing characters with codes <32 with HTTP
400 `InvalidRequest — File names must not contain unicode characters
with codes less than 32`. The plugin's previous behavior was to send
the file anyway, which aborted the entire push the moment B2 saw the
first such filename.

In practice this happened for paths produced by upstream document/
email-attachment pipelines that didn't strip trailing CR/LF from
attachment names. The underlying bug belongs to those pipelines, but
the backup plugin should not lose the rest of the snapshot just
because one filename is malformed.

Filter at gather time inside `shouldInclude`. Tests cover trailing
CR, embedded newline, tab inside a path segment, NUL byte, and a
control char in an otherwise-included filename.

92 → 93 tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant