Summary
Arco authoritative run storage currently only exposes canonical run objects under a flat authoritative-runs/runs/ prefix. That forces downstream catalog readers to do a bucket-wide list/download scan on cold refreshes in order to answer org-scoped catalog and lineage reads.
Daxis has added a short-lived in-memory snapshot cache as a mitigation, so this no longer happens on every request, but the cold-refresh path is still fundamentally O(total authoritative runs) and not org-scoped.
This needs to be solved in Arco via storage-layout and/or indexing changes, not with more local read-side workarounds.
Problem
Today, a downstream catalog consumer can only build its view by:
- Listing every object under
authoritative-runs/runs/
- Downloading/deserializing each run
- Filtering by
org_id, target_namespace, and target_table in memory
That has a few problems:
- Cold refresh latency grows with total run volume, not tenant scope
- One tenant’s metadata read pays the cost of all tenants’ history
- GCS list/download cost scales poorly
- Consumers have no stable Arco-native projection/index for catalog-oriented reads
- Read-side services have to invent caching and fallback behavior independently
Current Downstream Mitigation
Daxis now caches a catalog snapshot in-memory for 30s and coalesces refreshes, but its cold-refresh path still scans the full authoritative-run prefix.
Reference points from the current downstream workaround:
- prefix listing happens in
GcsCatalogObjectBackend::list_objects(...)
- cold refresh still scans
authoritative-runs/runs/ before caching
This issue is to remove the need for that behavior at the source.
Recommended Direction
Add an Arco-managed secondary projection/index for catalog-consumable authoritative runs, scoped at least by org.
Recommended shape:
- Keep canonical run objects as the source of truth
- On authoritative run create/update/retry/cancel/completion, also write/update a derived catalog projection or index entry
- Store those derived objects under an org-scoped prefix such as:
authoritative-runs/catalog/orgs/{org_id}/runs/{run_id}.json
- or
authoritative-runs/catalog/orgs/{org_id}/tables/{namespace}/{table}/{run_id}.json
- or refs + projection objects, if that is easier to maintain
- Projection schema should include the fields needed by catalog/lineage readers:
run_id
org_id
kind
reference_id
status
updated_at
payload.target_namespace
payload.target_table
payload.source_type
- latest task/result metadata needed for execution metadata and lineage
Acceptance Criteria
- A downstream catalog reader can enumerate authoritative catalog runs for a single org without listing the full
authoritative-runs/runs/ prefix
- Cold refresh for org-scoped catalog reads does not require bucket-wide scanning
- Projection/index updates remain correct across:
- create
- started/heartbeat/completed callbacks
- retry
- cancel
- forced failure/timeout paths
- Schema/versioning for the projection/index is documented
- There is a migration/backfill story for existing authoritative runs
- Tests cover projection/index consistency on all authoritative run lifecycle transitions
Non-Goals
- Replacing canonical authoritative run objects as the source of truth
- Pushing this problem permanently into downstream read-service caches
- Introducing a consumer-specific contract that only Daxis can use
Why This Matters
Arco is already the authoritative control-plane source for these runs. Catalog/lineage readers need an efficient, stable read path derived from that source. Without an Arco-native index/projection, every consumer will end up rebuilding the same bucket-scan workaround with different caching semantics and operational tradeoffs.
Summary
Arco authoritative run storage currently only exposes canonical run objects under a flat
authoritative-runs/runs/prefix. That forces downstream catalog readers to do a bucket-wide list/download scan on cold refreshes in order to answer org-scoped catalog and lineage reads.Daxis has added a short-lived in-memory snapshot cache as a mitigation, so this no longer happens on every request, but the cold-refresh path is still fundamentally
O(total authoritative runs)and not org-scoped.This needs to be solved in Arco via storage-layout and/or indexing changes, not with more local read-side workarounds.
Problem
Today, a downstream catalog consumer can only build its view by:
authoritative-runs/runs/org_id,target_namespace, andtarget_tablein memoryThat has a few problems:
Current Downstream Mitigation
Daxis now caches a catalog snapshot in-memory for 30s and coalesces refreshes, but its cold-refresh path still scans the full authoritative-run prefix.
Reference points from the current downstream workaround:
GcsCatalogObjectBackend::list_objects(...)authoritative-runs/runs/before cachingThis issue is to remove the need for that behavior at the source.
Recommended Direction
Add an Arco-managed secondary projection/index for catalog-consumable authoritative runs, scoped at least by org.
Recommended shape:
authoritative-runs/catalog/orgs/{org_id}/runs/{run_id}.jsonauthoritative-runs/catalog/orgs/{org_id}/tables/{namespace}/{table}/{run_id}.jsonrun_idorg_idkindreference_idstatusupdated_atpayload.target_namespacepayload.target_tablepayload.source_typeAcceptance Criteria
authoritative-runs/runs/prefixNon-Goals
Why This Matters
Arco is already the authoritative control-plane source for these runs. Catalog/lineage readers need an efficient, stable read path derived from that source. Without an Arco-native index/projection, every consumer will end up rebuilding the same bucket-scan workaround with different caching semantics and operational tradeoffs.