Skip to content

sync run: failed status comes back with no error info — agents can't tell rate-limit from auth from network #139

@lucasygu

Description

@lucasygu

Summary

one --agent sync run <platform> reports status: "failed" for individual profiles without surfacing any error context. Agents have no way to distinguish a transient 429 from an auth failure from a not_in_channel failure from a malformed-profile failure. The CLI's documented error handling (Retry-After, exponential backoff, adaptive throttle) appears to work internally, but its outputs and on-disk artifacts don't reflect any of it.

Environment

  • @withone/cli 1.43.4
  • Platform: slack (16 per-channel conversations_<slug> profiles + 1 slack/users)
  • macOS 25.3.0 (darwin-arm64)

Concrete repro

We ran a one-time pull across 16 channels (sequential — single sync run slack over 17 profiles total). 13 profiles completed (~880 records across 8.5 minutes; the slowest were new_food_menu 3m33s and ordering 3m32s — clearly hitting backoffs internally). The last 4 alphabetically failed:

{
  "model": "conversations_receipts_dev",
  "recordsSynced": 0,
  "pagesProcessed": 0,
  "duration": "0s",
  "status": "failed",
  "deletedStale": 0,
  "statusCounts": {"active": 0, "archived": 0}
}

No error field. No HTTP status. No Retry-After value. No mention of which Slack API method failed. The same shape comes back for sync list slackstatus: "failed" and nothing else.

A minute later, retrying just those 4 succeeded — sun_may_24_moneka_arabic_jazz took 1m31s for 21 records / 2 pages, which is consistent with several rounds of internal Retry-After backoff. The only way to even guess "rate limit" was per-row timing math.

Compounding evidence: sync test slack/conversations_receipts_dev against the same rate-limit window surfaces the real error cleanly:

{
  "name": "single-page fetch",
  "ok": false,
  "detail": "HTTP 429: {\"ok\":false,\"error\":\"ratelimited\"}"
}

So the underlying signal is reachable — sync test propagates it. sync run swallows it.

What's missing — proposed fields on a failed result

{
  "model": "...",
  "status": "failed",
  "error": {
    "phase": "list_fetch | enrich | transform | upsert | hook",
    "message": "HTTP 429: ratelimited",
    "httpStatus": 429,
    "retryAfter": 60,
    "lastSuccessfulPage": null,
    "context": { "actionId": "...", "url": "..." }
  }
}

At minimum, populate error with whatever sync test already returns when it hits the same failure.

Documented but missing on-disk artifacts

The mem/sync guide describes:

.one/sync/
  events/{platform}_{model}.jsonl     # change event logs (if onChange: "log")
  logs/{platform}.log                 # cron run logs

After ~30 minutes of repeated sync run slack calls (one cron-style and several manual runs that hit failures), .one/sync/logs/ does not exist on this machine. The cron run logs line in the docs implies this is only populated by cron-mode runs; if so, the docs are misleading — operators reading the doc will expect run logs to land there regardless. Either:

  • Have sync run write to .one/sync/logs/<platform>.log unconditionally, or
  • Update the docs to make it crystal that manual runs leave no trace.

Operational impact for our use case

We're building a per-channel mirror of Slack for an FDE prototype (one --agent sync run slack over ~16 channels, eventually scheduled --every 5m). Without:

  1. Per-failure error context, and
  2. dateFilter-driven incremental fetches (which we just added — but its absence by default in our manually-authored profiles meant every run was a full re-paginate, multiplying rate-limit pressure),

…sync schedules are operationally fragile. A scheduled tick that silently fails because of rate limits looks identical to a tick that fails because of an expired token, a renamed channel, or a withone outage.

Mitigations we applied locally

  1. Added dateFilter: {param: "oldest", location: "query", format: "unix"} to all conversation profiles so subsequent runs only fetch since last_synced.
  2. Sequenced our runs with ~60s cooldown between full re-fetches during the prototype to stay under Slack Tier-3 limits.
  3. Wrote a tail-based "watch the timing" heuristic to guess at rate-limit-induced failures (a record-count of 0 with a duration of 0s = likely auth/scope; a slow record-count of N with high duration = backoffs).

Without visibility into the actual error, every operational decision about the sync engine becomes a guess.


Related: I filed #138 earlier today for the four embedded-postgres plugin issues that block mem init on darwin-arm64. This issue is about runtime observability for the sync engine — independent surface.

Happy to send a small PR to plumb the error through sync run's result rows if a maintainer can point me at the right place in the bundled (or source) code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions