Skip to content

Improve recovery APIs and machine identity resilience #3

@warengonzaga

Description

@warengonzaga

Summary

IntegrityError recovery is currently too fragile for consumers that use @wgtechlabs/secrets-engine as a local machine-bound store.

While integrating it in Tiny Claw, we hit a Windows failure mode where:

  • an existing store becomes unreadable with INTEGRITY_ERROR
  • the caller cannot safely reset the store through the engine because recovery currently depends on a successful open()
  • a failed open path can leave SQLite handles around briefly on Windows, which makes immediate cleanup harder
  • machine identity derivation appears brittle enough that adapter ordering / MAC selection drift may invalidate otherwise healthy stores

This should be fixed in the engine rather than repeatedly patched in downstream consumers.

Observed behavior

A consumer opens the store and gets a generic IntegrityError when the computed HMAC does not match meta.json.

Today that leaves downstream apps with two weak options:

  1. tell users to manually delete the store directory
  2. bypass the engine and remove files directly

That is not ideal for a security-focused persistence library.

Problems to address

1. Recovery requires successful open()

The engine has an instance destroy() method, but callers cannot use it if open() fails first.

Needed:

  • add a static recovery API such as SecretsEngine.destroyAtPath(path)
  • or add SecretsEngine.resetAtPath(path) if you want a distinction between full destroy and keep-directory reinit

That gives consumers a supported recovery path for corrupt or mismatched stores.

2. Machine identity is too brittle

Current derivation appears to depend on hostname, username, and a primary MAC selection.

On Windows especially, that can drift because of:

  • VPN adapters
  • docking / undocking
  • virtual NICs
  • adapter ordering changes

Needed:

  • make machine identity derivation stable across adapter ordering changes
  • avoid relying on a single "first non-internal MAC" choice
  • document the exact compatibility guarantees and failure modes

3. Error classification is too generic

Multiple distinct failure classes are surfaced as integrity failures or are otherwise not actionable enough for consumers.

Needed:

  • clearer error taxonomy or subcodes for:
    • metadata missing
    • metadata corrupted
    • unsupported version
    • integrity mismatch
    • machine identity changed / key derivation mismatch
  • preserve a machine-readable .code / subcode for downstream UX

4. Failed open should release resources cleanly

On Windows, a failed integrity path can make immediate directory cleanup flaky due to transient handle retention.

Needed:

  • ensure any failed open() path fully releases DB/file handles before propagating the error
  • add regression coverage for Windows cleanup after failed open / failed verification

Suggested API direction

Example only:

await SecretsEngine.destroyAtPath({ path: storePath });
await SecretsEngine.resetAtPath({ path: storePath, preserveDirectory: true });

Or equivalent APIs if you prefer a different shape.

Acceptance criteria

  • consumers can destroy or reset a broken store without needing a successful open() first
  • machine identity derivation is resilient to common Windows adapter-order changes
  • integrity-related failures expose actionable machine-readable error information
  • failed open paths do not block immediate cleanup on Windows
  • tests cover the recovery path and Windows-oriented resource release behavior

Downstream impact

Tiny Claw had to add consumer-side workarounds to:

  • catch INTEGRITY_ERROR and print recovery instructions
  • prefer engine-aware destruction when possible
  • fall back to raw directory deletion for corrupt stores
  • add retry logic for Windows cleanup

Those are reasonable defenses, but the engine should provide the primary recovery contract.

Notes

If helpful, I can follow up with a companion issue or PR on the Tiny Claw side that references this engine issue and removes any workaround once the engine exposes the proper recovery API.

Metadata

Metadata

Assignees

Labels

ready[Status] Triaged and ready to be picked up [issues]

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions