feat(ol-dbt-cli): add SQL/YAML linting and column-level impact analysis CLI#1998
feat(ol-dbt-cli): add SQL/YAML linting and column-level impact analysis CLI#1998blarghmatey wants to merge 7 commits intomainfrom
Conversation
…is CLI
Adds a new `ol-dbt` CLI package at `src/ol_dbt_cli/` with two commands:
## `ol-dbt validate`
Structural consistency checks for dbt models:
1. **YAML/SQL sync** — warns when columns declared in a model's YAML schema
are absent from its compiled SELECT output, or vice-versa.
2. **Missing upstream refs** — flags `ref()` or `source()` calls that
resolve to models/sources not present in the project.
3. **Upstream column existence** — validates that columns referenced from
an upstream model actually exist in that model's output.
4. **Duplicate column aliases** — detects accidental repeated alias names
within a single model.
5. **SELECT \*** — INFO-level notice for unresolved star expansions;
`--warn-select-star` promotes these to warnings.
Options:
- `--model/-m` (repeatable): filter to one or more model names, comma-separated
lists, or directory paths under `models/`
- `--errors-only`: suppress INFO/WARN output, show only errors
- `--warn-select-star`: treat unresolved SELECT * as warnings
- `--compiled-dir`: point at `target/compiled/` for Jinja-heavy models
- `--output json`: machine-readable output
## `ol-dbt impact`
Column-level downstream impact analysis driven by git diffs:
- Reads `git diff [--cached] [base]` to find changed/removed SQL column aliases
- Traverses the manifest or parsed SQL to find all downstream models that
reference each changed column
- Reports broken references, risky usages, and models needing review
## Jinja pre-processing
Robust regex-based Jinja stripping so sqlglot can parse the raw `.sql` files
without a running dbt environment:
- `{{ ref() }}` / `{{ source() }}` → stable SQL identifiers with reverse maps
- `{{ var() }}` and bare `{{ variable }}` → unquoted placeholders (avoids
the `''__jinja__''` doubled-quote bug when templates surround with SQL quotes)
- Block-level macro calls alone on a line → SQL comment
- Broken column expressions where a macro splits a SQL column boundary
(e.g. `{{ array_join('partial SQL', "path") }}`) → collapsed to
`__jinja__ as alias` via `_BROKEN_COL_RE` cleanup pass
Result: **588/588** dbt models in this project parse successfully.
## YAML registry improvements
- Parses both `models:` and `sources:` blocks for SELECT * resolution
- Rescues model definitions accidentally nested inside another model's
`columns:` list (YAML authoring error in `_stg_mitlearn_models.yml`)
- Filters out nested-model entries from the parent model's column list
## Tests
85 pytest tests covering all commands, lib modules, and edge cases.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cdfb9e5 to
18376fe
Compare
There was a problem hiding this comment.
Pull request overview
Introduces a new ol-dbt CLI workspace package (src/ol_dbt_cli/) intended to help engineers validate dbt model/YAML consistency and assess column-level downstream impact of in-progress changes in src/ol_dbt.
Changes:
- Adds
ol-dbt validateandol-dbt impactcommands (cyclopts + rich output) plus supporting parsing/registry libraries. - Implements YAML schema registry + sqlglot-based SQL parsing with Jinja stripping and partial SELECT * resolution.
- Adds a pytest suite covering registry/parsing/validation/impact helpers and wires the package into the uv workspace/dependencies.
Reviewed changes
Copilot reviewed 13 out of 18 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Adds ol-dbt-cli workspace member and locks new deps (rich, sqlglot). |
| pyproject.toml | Adds src/ol_dbt_cli to uv workspace members and adds rich/sqlglot to root deps. |
| src/ol_dbt_cli/pyproject.toml | Declares the ol-dbt-cli package, deps, entrypoint, and ruff config. |
| src/ol_dbt_cli/ol_dbt_cli/init.py | Package marker. |
| src/ol_dbt_cli/ol_dbt_cli/cli.py | Defines ol-dbt CLI app and registers impact/validate commands. |
| src/ol_dbt_cli/ol_dbt_cli/commands/init.py | Commands package marker. |
| src/ol_dbt_cli/ol_dbt_cli/commands/validate.py | Implements structural validation checks and reporting/output. |
| src/ol_dbt_cli/ol_dbt_cli/commands/impact.py | Implements diff-based column change detection and downstream impact reporting. |
| src/ol_dbt_cli/ol_dbt_cli/lib/init.py | Lib package marker. |
| src/ol_dbt_cli/ol_dbt_cli/lib/yaml_registry.py | Parses dbt schema YAML into a registry of models/sources/columns. |
| src/ol_dbt_cli/ol_dbt_cli/lib/sql_parser.py | Strips Jinja and uses sqlglot to extract output columns (+ some star resolution). |
| src/ol_dbt_cli/ol_dbt_cli/lib/manifest.py | Parses dbt manifest.json into a registry with lineage helpers. |
| src/ol_dbt_cli/ol_dbt_cli/lib/git_utils.py | Git helpers for changed files/models and retrieving file content at refs. |
| src/ol_dbt_cli/tests/init.py | Tests package marker. |
| src/ol_dbt_cli/tests/test_yaml_registry.py | Tests YAML registry model/source parsing and nested-model rescue behavior. |
| src/ol_dbt_cli/tests/test_validate.py | Tests validate checks, formatting, and model target resolution helpers. |
| src/ol_dbt_cli/tests/test_sql_parser.py | Tests Jinja stripping and SQL parsing/star handling/compiled-dir logic. |
| src/ol_dbt_cli/tests/test_impact.py | Tests column diff helper used by impact analysis. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- validate.py: fix _resolve_star_with_registry returning m_model.columns (dict[str, ManifestColumn]) instead of m_model.column_names (set[str]), which broke set operations in callers - validate.py: fix _resolve_model_targets silently skipping invalid directory tokens when earlier tokens already populated resolved[]; now tracks whether THIS token added any models before deciding whether to emit an unknown warning - validate.py: remove flawed upstream column cross-ref existence check that produced false positives because it cannot attribute a YAML column to a specific upstream ref without table-alias-level SQL AST analysis; replaced with a narrow unresolvable-ref warning only - manifest.py: replace list.pop(0) O(n) queue with collections.deque.popleft() O(1) in get_all_descendants() BFS - cli.py: fix help text examples that referenced non-existent --changed-only flag on impact and --format (correct flag is --output) - sql_parser.py: fix _parse_clean_sql compiled-SQL path to get refs/sources from the raw SQL via strip_jinja() rather than looking for 'ref_'/'source_' prefixes in compiled table names (which don't exist in real compiled SQL) - git_utils.py: fix get_changed_files() docstring claiming it includes untracked files (it only includes tracked changed files) - git_utils.py: fix get_changed_yaml_models() return type from list[str] to list[Path] (the function returns file paths, not model names) and remove the type: ignore suppression Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Inverse of --skip; accepts a comma-separated list of check names and runs only those checks, skipping all others. Mutually exclusive with --skip (exits with error if both are provided). Available check names: yaml_sql_sync, upstream_refs, docs_coverage, yaml_integrity, select_star. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
rachellougee
left a comment
There was a problem hiding this comment.
ol-dbt validate works well, as it detects the inconsistencies between sql and yml file.
ol-dbt impact has some limitations. It relies on the dbt compiled folder, so if downstream models have never been compiled locally, the result may be incomplete. So maybe we could add something like dbt compile --select model+ in the impact.py to ensure all downstream models are compiled before analysis?
Also, when I tested by renaming the user_email column in the stg__edxorg__program_entitlement, ol-dbt impact detected one downstream edxorg_to_mitxonline_users impact, but it missed the user_email reference in edxorg_to_mitxonline_program_entitlements. Not sure why it would miss that because both downstream models were compiled locally.
% ol-dbt impact
Using compiled SQL: /Users/rachellougee/repo/OL/ol-data-platform/src/ol_dbt/target/compiled/open_learning/models
Using manifest: /Users/rachellougee/repo/OL/ol-data-platform/src/ol_dbt/target/manifest.json
Analysing 1 model(s) for column-level impact
BREAKING stg__edxorg__program_entitlement
Column change(s) in 'stg__edxorg__program_entitlement' will break 1 downstream model(s).
Column changes:
🔄 user_email → user_emails
→ user_emails
Downstream impact:
↳ edxorg_to_mitxonline_users (uses: user_email)
↳ not_null_stg__edxorg__program_entitlement_program_title (column metadata unavailable)
↳ not_null_stg__edxorg__program_entitlement_program_type (column metadata unavailable)
↳ not_null_stg__edxorg__program_entitlement_organization_key (column metadata unavailable)
↳ not_null_stg__edxorg__program_entitlement_user_id (column metadata unavailable)
↳ not_null_stg__edxorg__program_entitlement_user_emails (column metadata unavailable)
↳ not_null_stg__edxorg__program_entitlement_purchase_date (column metadata unavailable)
↳ not_null_stg__edxorg__program_entitlement_expiration_date (column metadata unavailable)
↳ not_null_stg__edxorg__program_entitlement_number_of_entitlements (column metadata unavailable)
↳ not_null_stg__edxorg__program_entitlement_number_of_redeemed_entitlements (column metadata unavailable)
↳ dbt_expectations_expect_compound_columns_to_be_unique_stg__edxorg__program_entitlement_user_email__program_title (column metadata unavailable)
↳ not_null_edxorg_to_mitxonline_users_user_email (column metadata unavailable)
↳ unique_edxorg_to_mitxonline_users_user_email (column metadata unavailable)
↳ dbt_expectations_expect_compound_columns_to_be_unique_edxorg_to_mitxonline_program_entitlements_program_title__user_edxorg_email__order_id_is_null
(column metadata unavailable)
Summary: 1 breaking, 0 warning(s) across 1 changed model(s)
…, add stale SQL warning Phase 1 — SELECT * resolution improvements: - Fix _build_cte_column_map to handle exp.Union CTEs (UNION ALL) by extracting columns from the leftmost SELECT branch; resolves 8 of the 10 previously unresolved SELECT * models - Add VALUES inline table handling in _try_resolve_star: detects SELECT * FROM (VALUES …) AS alias (col1, col2) and extracts column names from the alias list - Add _resolve_star_source_through_ctes helper that walks thin passthrough CTE chains (SELECT * FROM ref_x) to find the ultimate ref/source placeholder, enabling registry-based resolution of the final two unresolved stars - Add source_path field to ParsedModel to track the raw SQL source file - After full pipeline: 0 unresolved SELECT * across all 588 models Phase 2 — Impact false-negative fix (rachellougee's PR feedback): - Add _get_columns_read_from_ref() to impact.py: uses sqlglot AST analysis to find columns a downstream model reads from an upstream ref, even when those columns are aliased to different output names - Update _get_downstream_manifest and _get_downstream_yaml to fall back to SQL-level column-read analysis when manifest/YAML column intersection returns empty (the false-negative case: upstream col read but renamed) Phase 3 — Stale compiled SQL warning: - After analysis, emit a warning listing downstream models that had no compiled SQL, with a dbt compile command to fix it Tests: - 15 new test cases in test_sql_parser.py: Union CTE resolution, passthrough star chains, VALUES column extraction - 4 new test cases in test_impact.py: column-read analysis including the aliased-column false-negative scenario - Total: 98 tests, all passing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 3 — --auto-compile flag for impact command:
- Add --auto-compile flag to 'ol-dbt impact' that runs 'dbt compile
--select <changed_model>+' before analysis to ensure compiled SQL is
fresh for downstream models
- Gracefully handles missing dbt binary (warning, continues with raw SQL)
- Reloads parsed models after successful compile to pick up new output
Phase 4 — Jinja2 rendering engine:
- Replace regex-based Jinja stripping with a Jinja2 rendering engine as
the primary path in strip_jinja()
- _SqlSafeUndefined: custom Jinja2 Undefined subclass that renders
unknown variables/macros as safe SQL placeholder tokens (__undefined__,
__macro__) instead of raising TemplateSyntaxError
- _make_jinja_env(): builds a dbt-compatible Jinja2 environment with
ref(), source(), config(), var(), env_var(), is_incremental() and
common dbt adapter namespaces (dbt_utils, adapter, this, etc.)
- Key benefit: {% if %}/{% for %} control blocks are properly evaluated
(taking one branch) rather than stripped, yielding more valid SQL
for sqlglot parsing
- Regex path (_strip_jinja_regex) retained as fallback for TemplateError
- _BROKEN_COL_RE updated to match both __jinja__ (regex) and __macro__/
__undefined__ (Jinja2) placeholders
- Add jinja2>=3.1 and markupsafe>=2.1 as explicit dependencies in
pyproject.toml (were already transitive deps)
Tests:
- 8 new test cases in test_sql_parser.py: Jinja2 if/for rendering,
ref/source collection, undefined macro placeholder, fallback behavior
- Total: 106 tests, all passing
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds --auto-compile to 'ol-dbt validate' which runs 'dbt parse' before
validation to regenerate manifest.json.
'dbt parse' is fast and offline (no database connection required) — it
just parses and validates the project, generating a fresh manifest.json.
This improves upstream column resolution accuracy (check 2: upstream_refs)
and SELECT * expansion when the manifest was stale or missing.
Behaviour:
- Runs dbt parse in the dbt project directory before loading the manifest
- Gracefully degrades if dbt is not installed (warning, continues without)
- Falls back cleanly if dbt parse fails (validation continues with
existing or no manifest)
For compiled SQL accuracy (check 1: yaml_sql_sync), run 'dbt compile'
manually first — it is not run automatically here since it requires a
database connection.
Example:
ol-dbt validate --auto-compile
ol-dbt validate --auto-compile --changed-only
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both 'ol-dbt validate --auto-compile' and 'ol-dbt impact --auto-compile'
now default to --target dev_local, which uses a local DuckDB instance
(~/.ol-dbt/local.duckdb) and requires no network access or credentials.
Changes:
- Add --target / -t parameter to both validate and impact (default: dev_local)
- validate: passes --target to 'dbt parse'
- impact: passes --target to 'dbt compile'
- Update help text and docstring examples to document the default and
how to override with dev_qa/production targets
The dev_local target is defined in profiles.yml as a DuckDB adapter,
making local compilation fast and friction-free for day-to-day use.
Examples:
ol-dbt validate --auto-compile # uses dev_local
ol-dbt validate --auto-compile --target dev_qa # override target
ol-dbt impact --auto-compile # uses dev_local
ol-dbt impact --auto-compile --target production # override target
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds a new
ol-dbtCLI package atsrc/ol_dbt_cli/with two commands for the engineering team to use with thesrc/ol_dbtdbt project.Commands
ol-dbt validateStructural consistency checks for dbt models:
ref()/source()calls that don't resolve to known models--warn-select-starto promote)Options:
--model/-m— filter by model name(s), comma-separated list, or directory undermodels/--errors-only— suppress INFO/WARN, show only errors--warn-select-star— treat unresolved SELECT * as warnings--compiled-dir— usetarget/compiled/SQL for Jinja-heavy modelsol-dbt impactColumn-level downstream impact analysis driven by git diffs:
git diff [--cached]to find changed/removed SQL column aliasesJinja pre-processing
All 588 dbt models in this project parse successfully via a regex-based Jinja stripper that handles:
{{ ref() }}/{{ source() }}→ stable SQL identifiers with reverse maps{{ ... }}→ unquoted placeholders (fixes doubled-quote bug:like '{{ var }}'→like '__jinja__'notlike ''__jinja__''){{ deduplicate_raw_table(...) }}) → SQL comment}}boundary (e.g.{{ array_join('partial SQL', "path") }}) → collapsed to__jinja__ as aliasvia post-cleanup regex passYAML registry improvements
models:andsources:blocks for SELECT * column resolutioncolumns:list (authoring error in_stg_mitlearn_models.yml)Tests
85 pytest tests covering all commands, library modules, and edge cases.