Skip to content

Content-addressed artifact cache: reuse across pipelines and build dirs #50

@marklubin

Description

@marklubin

Problem

Today, artifact reuse only works when two pipelines share the same build_dir and manifest.json. The cache key is semantic (artifact_id + input hashes + prompt + model config), but the lookup is scoped to a single manifest file. If you run the same transform with the same inputs in a different build directory, it rebuilds from scratch.

Vision

Fully content-addressed build cache. If we can determine that a unit of work is identical — same inputs, same prompt, same model config — we pull the result from cache regardless of which pipeline or build directory produced it.

Think: ccache for LLM transforms. The artifact store becomes a global (or per-user) content-addressed cache keyed on the hash of (input_hashes, prompt_id, model_config, transform_cache_key). Any pipeline that requests the same computation gets a cache hit.

What This Enables

  • Cross-project reuse: Two unrelated pipelines processing the same source files share cached transcripts/episodes without needing to point at the same build_dir
  • CI caching: Warm cache from prior runs, even if build dirs differ per job
  • Deduplication: Same conversation exported from both ChatGPT backup and a shared folder — parsed once, reused everywhere
  • Pipeline experimentation: Try different rollup strategies without re-running expensive lower layers, even from a fresh directory

Rough Design Notes

  • Global cache dir (e.g. ~/.cache/synix/ or configurable via SYNIX_CACHE_DIR)
  • Cache key: SHA256(sorted(input_hashes) + prompt_id + canonical(model_config) + transform_cache_key)
  • Build-local manifest still exists for fast lookups; global cache is the fallback
  • synix gc to prune old/unused cache entries
  • Opt-in initially (pipeline.cache_dir = "~/.cache/synix" or CLI flag)

Not Now

This is a future enhancement. Current per-build-dir caching works fine for the primary use case (swapping pipelines in the same project). Tracked here for when cross-project reuse becomes a real need.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions