Skip to content

Add org-scoped catalog read indexing for authoritative runs to eliminate bucket-wide cold scans #132

@ethan-tyler

Description

@ethan-tyler

Summary

Arco authoritative run storage currently only exposes canonical run objects under a flat authoritative-runs/runs/ prefix. That forces downstream catalog readers to do a bucket-wide list/download scan on cold refreshes in order to answer org-scoped catalog and lineage reads.

Daxis has added a short-lived in-memory snapshot cache as a mitigation, so this no longer happens on every request, but the cold-refresh path is still fundamentally O(total authoritative runs) and not org-scoped.

This needs to be solved in Arco via storage-layout and/or indexing changes, not with more local read-side workarounds.

Problem

Today, a downstream catalog consumer can only build its view by:

  1. Listing every object under authoritative-runs/runs/
  2. Downloading/deserializing each run
  3. Filtering by org_id, target_namespace, and target_table in memory

That has a few problems:

  • Cold refresh latency grows with total run volume, not tenant scope
  • One tenant’s metadata read pays the cost of all tenants’ history
  • GCS list/download cost scales poorly
  • Consumers have no stable Arco-native projection/index for catalog-oriented reads
  • Read-side services have to invent caching and fallback behavior independently

Current Downstream Mitigation

Daxis now caches a catalog snapshot in-memory for 30s and coalesces refreshes, but its cold-refresh path still scans the full authoritative-run prefix.

Reference points from the current downstream workaround:

  • prefix listing happens in GcsCatalogObjectBackend::list_objects(...)
  • cold refresh still scans authoritative-runs/runs/ before caching

This issue is to remove the need for that behavior at the source.

Recommended Direction

Add an Arco-managed secondary projection/index for catalog-consumable authoritative runs, scoped at least by org.

Recommended shape:

  • Keep canonical run objects as the source of truth
  • On authoritative run create/update/retry/cancel/completion, also write/update a derived catalog projection or index entry
  • Store those derived objects under an org-scoped prefix such as:
    • authoritative-runs/catalog/orgs/{org_id}/runs/{run_id}.json
    • or authoritative-runs/catalog/orgs/{org_id}/tables/{namespace}/{table}/{run_id}.json
    • or refs + projection objects, if that is easier to maintain
  • Projection schema should include the fields needed by catalog/lineage readers:
    • run_id
    • org_id
    • kind
    • reference_id
    • status
    • updated_at
    • payload.target_namespace
    • payload.target_table
    • payload.source_type
    • latest task/result metadata needed for execution metadata and lineage

Acceptance Criteria

  • A downstream catalog reader can enumerate authoritative catalog runs for a single org without listing the full authoritative-runs/runs/ prefix
  • Cold refresh for org-scoped catalog reads does not require bucket-wide scanning
  • Projection/index updates remain correct across:
    • create
    • started/heartbeat/completed callbacks
    • retry
    • cancel
    • forced failure/timeout paths
  • Schema/versioning for the projection/index is documented
  • There is a migration/backfill story for existing authoritative runs
  • Tests cover projection/index consistency on all authoritative run lifecycle transitions

Non-Goals

  • Replacing canonical authoritative run objects as the source of truth
  • Pushing this problem permanently into downstream read-service caches
  • Introducing a consumer-specific contract that only Daxis can use

Why This Matters

Arco is already the authoritative control-plane source for these runs. Catalog/lineage readers need an efficient, stable read path derived from that source. Without an Arco-native index/projection, every consumer will end up rebuilding the same bucket-scan workaround with different caching semantics and operational tradeoffs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions