fix(cll): include adapter_type in cache key to prevent cross-database poisoning#1285
Open
fix(cll): include adapter_type in cache key to prevent cross-database poisoning#1285
Conversation
… poisoning The CLL cache key was sha256(node_id + raw_code + parents + columns) which did not include the database adapter type. When a user switched between adapters (e.g. Snowflake → DuckDB), the cache would return lineage computed under the wrong dialect — Snowflake uppercases all identifiers while DuckDB preserves case, causing silent column mapping failures. Add adapter_type as the first component of the content hash so that lineage computed under one dialect is never returned for another. Also add a warning log when compiled_code is missing and the fragile Jinja fallback path is used. Resolves DRC-3199 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
The warning said "has no compiled_code" but the fallback also fires for alias collisions where compiled_code exists. Use a generic message. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
Codecov Report❌ Patch coverage is
... and 4 files with indirect coverage changes 🚀 New features to boost your workflow:
|
Improve the cache key to use dbt-computed checksums instead of raw content: - Remove node_id from content key (already in DB lookup key via make_node_key) - Replace raw_code with manifest checksum (sha256 already computed by dbt) - Replace parent ID list with parent checksums — cascading invalidation so if any parent's SQL changes, children recompute Sources/exposures/metrics have no checksum (no SQL), so their node ID is used as a stable placeholder. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
The test helper uses `hash(raw_code)` which returns an int, and some older dbt versions may also store non-string checksums. Wrap with `str()` to ensure the content key always receives a string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
288611f to
df05c81
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR checklist
What type of PR is this?
Bug fix + refactor
What this PR does / why we need it:
Fixes cross-database cache poisoning in the CLL SQLite cache and redesigns the cache key for better invalidation.
Bug: The CLL cache key did not include the database adapter type. When a user switched adapters (e.g. Snowflake → DuckDB), the cache silently returned lineage computed under the wrong dialect — Snowflake uppercases all identifiers while DuckDB preserves case, causing column mapping failures.
Redesign: While investigating, we found the cache key used raw SQL strings and parent node IDs as proxies. Replaced with a more principled design using dbt-computed checksums:
sha256(node_id + raw_code + parent_ids + column_names)sha256(adapter_type + checksum + parent_checksums + column_names)Key improvements:
adapter_type— prevents cross-database poisoning (the original bug)checksum— uses dbt's pre-computedsha256(raw_code)instead of re-hashing large SQL stringsparent_checksums— cascading invalidation: if any parent's SQL changes, children recompute (previously only tracked parent IDs, not content)node_idremoved from content key — already included in the DB lookup key (CllCache.make_node_key), was redundantlogger.warningwhencompiled_codeis missing and the fragile Jinja rendering path is usedVerified with jaffle-shop-expand (1060 models):
adapter_type: snowflake→ 0 cache hits, 973 recomputed (fix confirmed)Which issue(s) this PR fixes:
Resolves DRC-3199
Special notes for your reviewer:
Backward compatible — old cache entries (hashed with the old key format) become natural cache misses and recompute on first access. No migration or
recce cache clearneeded.Sources, exposures, and metrics have no
checksumin the manifest (no SQL), so their node ID is used as a stable placeholder.Does this PR introduce a user-facing change?:
Users who switch between database adapters (e.g. DuckDB ↔ Snowflake) with CLL cache enabled will no longer get silently incorrect lineage from stale cache entries. Additionally, changes to parent model SQL now correctly invalidate downstream cache entries.
🤖 Generated with Claude Code