feat(ol-dbt-cli): add SQL/YAML linting and column-level impact analysis CLI by blarghmatey · Pull Request #1998 · mitodl/ol-data-platform

blarghmatey · 2026-03-15T21:56:45Z

Summary

Adds a new ol-dbt CLI package at src/ol_dbt_cli/ with two commands for the engineering team to use with the src/ol_dbt dbt project.

Commands

`ol-dbt validate`

Structural consistency checks for dbt models:

Check	Level	Description
YAML/SQL sync	ERROR	Columns in YAML but absent from SQL output, or vice-versa
Missing upstream refs	ERROR	`ref()`/`source()` calls that don't resolve to known models
Upstream column existence	ERROR	Column referenced from upstream doesn't exist in that model
Duplicate aliases	ERROR	Repeated column alias within a single SELECT
SELECT *	INFO/WARN	Unresolved star expansion (use `--warn-select-star` to promote)

Options:

ol-dbt validate [--model MODEL]... [--errors-only] [--warn-select-star]
                [--compiled-dir PATH] [--output text|json]

--model/-m — filter by model name(s), comma-separated list, or directory under models/
--errors-only — suppress INFO/WARN, show only errors
--warn-select-star — treat unresolved SELECT * as warnings
--compiled-dir — use target/compiled/ SQL for Jinja-heavy models

`ol-dbt impact`

Column-level downstream impact analysis driven by git diffs:

ol-dbt impact [--base BRANCH] [--staged] [--compiled-dir PATH] [--output text|json]

Reads git diff [--cached] to find changed/removed SQL column aliases
Traverses upstream → downstream to find all models referencing each changed column
Reports broken references, risky usages, and models needing review

Jinja pre-processing

All 588 dbt models in this project parse successfully via a regex-based Jinja stripper that handles:

{{ ref() }} / {{ source() }} → stable SQL identifiers with reverse maps
All {{ ... }} → unquoted placeholders (fixes doubled-quote bug: like '{{ var }}' → like '__jinja__' not like ''__jinja__'')
Block-level macro calls alone on a line (e.g. {{ deduplicate_raw_table(...) }}) → SQL comment
Macros that split a SQL column expression across the }} boundary (e.g. {{ array_join('partial SQL', "path") }}) → collapsed to __jinja__ as alias via post-cleanup regex pass

YAML registry improvements

Parses both models: and sources: blocks for SELECT * column resolution
Rescues model definitions accidentally nested inside another model's columns: list (authoring error in _stg_mitlearn_models.yml)

Tests

85 pytest tests covering all commands, library modules, and edge cases.

src/ol_dbt_cli/ol_dbt_cli/commands/validate.py

…is CLI Adds a new `ol-dbt` CLI package at `src/ol_dbt_cli/` with two commands: ## `ol-dbt validate` Structural consistency checks for dbt models: 1. **YAML/SQL sync** — warns when columns declared in a model's YAML schema are absent from its compiled SELECT output, or vice-versa. 2. **Missing upstream refs** — flags `ref()` or `source()` calls that resolve to models/sources not present in the project. 3. **Upstream column existence** — validates that columns referenced from an upstream model actually exist in that model's output. 4. **Duplicate column aliases** — detects accidental repeated alias names within a single model. 5. **SELECT \*** — INFO-level notice for unresolved star expansions; `--warn-select-star` promotes these to warnings. Options: - `--model/-m` (repeatable): filter to one or more model names, comma-separated lists, or directory paths under `models/` - `--errors-only`: suppress INFO/WARN output, show only errors - `--warn-select-star`: treat unresolved SELECT * as warnings - `--compiled-dir`: point at `target/compiled/` for Jinja-heavy models - `--output json`: machine-readable output ## `ol-dbt impact` Column-level downstream impact analysis driven by git diffs: - Reads `git diff [--cached] [base]` to find changed/removed SQL column aliases - Traverses the manifest or parsed SQL to find all downstream models that reference each changed column - Reports broken references, risky usages, and models needing review ## Jinja pre-processing Robust regex-based Jinja stripping so sqlglot can parse the raw `.sql` files without a running dbt environment: - `{{ ref() }}` / `{{ source() }}` → stable SQL identifiers with reverse maps - `{{ var() }}` and bare `{{ variable }}` → unquoted placeholders (avoids the `''__jinja__''` doubled-quote bug when templates surround with SQL quotes) - Block-level macro calls alone on a line → SQL comment - Broken column expressions where a macro splits a SQL column boundary (e.g. `{{ array_join('partial SQL', "path") }}`) → collapsed to `__jinja__ as alias` via `_BROKEN_COL_RE` cleanup pass Result: **588/588** dbt models in this project parse successfully. ## YAML registry improvements - Parses both `models:` and `sources:` blocks for SELECT * resolution - Rescues model definitions accidentally nested inside another model's `columns:` list (YAML authoring error in `_stg_mitlearn_models.yml`) - Filters out nested-model entries from the parent model's column list ## Tests 85 pytest tests covering all commands, lib modules, and edge cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Introduces a new ol-dbt CLI workspace package (src/ol_dbt_cli/) intended to help engineers validate dbt model/YAML consistency and assess column-level downstream impact of in-progress changes in src/ol_dbt.

Changes:

Adds ol-dbt validate and ol-dbt impact commands (cyclopts + rich output) plus supporting parsing/registry libraries.
Implements YAML schema registry + sqlglot-based SQL parsing with Jinja stripping and partial SELECT * resolution.
Adds a pytest suite covering registry/parsing/validation/impact helpers and wires the package into the uv workspace/dependencies.

Reviewed changes

Copilot reviewed 13 out of 18 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
uv.lock	Adds `ol-dbt-cli` workspace member and locks new deps (rich, sqlglot).
pyproject.toml	Adds `src/ol_dbt_cli` to uv workspace members and adds rich/sqlglot to root deps.
src/ol_dbt_cli/pyproject.toml	Declares the `ol-dbt-cli` package, deps, entrypoint, and ruff config.
src/ol_dbt_cli/ol_dbt_cli/init.py	Package marker.
src/ol_dbt_cli/ol_dbt_cli/cli.py	Defines `ol-dbt` CLI app and registers `impact`/`validate` commands.
src/ol_dbt_cli/ol_dbt_cli/commands/init.py	Commands package marker.
src/ol_dbt_cli/ol_dbt_cli/commands/validate.py	Implements structural validation checks and reporting/output.
src/ol_dbt_cli/ol_dbt_cli/commands/impact.py	Implements diff-based column change detection and downstream impact reporting.
src/ol_dbt_cli/ol_dbt_cli/lib/init.py	Lib package marker.
src/ol_dbt_cli/ol_dbt_cli/lib/yaml_registry.py	Parses dbt schema YAML into a registry of models/sources/columns.
src/ol_dbt_cli/ol_dbt_cli/lib/sql_parser.py	Strips Jinja and uses sqlglot to extract output columns (+ some star resolution).
src/ol_dbt_cli/ol_dbt_cli/lib/manifest.py	Parses dbt `manifest.json` into a registry with lineage helpers.
src/ol_dbt_cli/ol_dbt_cli/lib/git_utils.py	Git helpers for changed files/models and retrieving file content at refs.
src/ol_dbt_cli/tests/init.py	Tests package marker.
src/ol_dbt_cli/tests/test_yaml_registry.py	Tests YAML registry model/source parsing and nested-model rescue behavior.
src/ol_dbt_cli/tests/test_validate.py	Tests validate checks, formatting, and model target resolution helpers.
src/ol_dbt_cli/tests/test_sql_parser.py	Tests Jinja stripping and SQL parsing/star handling/compiled-dir logic.
src/ol_dbt_cli/tests/test_impact.py	Tests column diff helper used by impact analysis.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/ol_dbt_cli/ol_dbt_cli/lib/manifest.py

src/ol_dbt_cli/ol_dbt_cli/commands/validate.py

src/ol_dbt_cli/ol_dbt_cli/cli.py

src/ol_dbt_cli/ol_dbt_cli/lib/sql_parser.py

src/ol_dbt_cli/ol_dbt_cli/lib/git_utils.py

- validate.py: fix _resolve_star_with_registry returning m_model.columns (dict[str, ManifestColumn]) instead of m_model.column_names (set[str]), which broke set operations in callers - validate.py: fix _resolve_model_targets silently skipping invalid directory tokens when earlier tokens already populated resolved[]; now tracks whether THIS token added any models before deciding whether to emit an unknown warning - validate.py: remove flawed upstream column cross-ref existence check that produced false positives because it cannot attribute a YAML column to a specific upstream ref without table-alias-level SQL AST analysis; replaced with a narrow unresolvable-ref warning only - manifest.py: replace list.pop(0) O(n) queue with collections.deque.popleft() O(1) in get_all_descendants() BFS - cli.py: fix help text examples that referenced non-existent --changed-only flag on impact and --format (correct flag is --output) - sql_parser.py: fix _parse_clean_sql compiled-SQL path to get refs/sources from the raw SQL via strip_jinja() rather than looking for 'ref_'/'source_' prefixes in compiled table names (which don't exist in real compiled SQL) - git_utils.py: fix get_changed_files() docstring claiming it includes untracked files (it only includes tracked changed files) - git_utils.py: fix get_changed_yaml_models() return type from list[str] to list[Path] (the function returns file paths, not model names) and remove the type: ignore suppression Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Inverse of --skip; accepts a comma-separated list of check names and runs only those checks, skipping all others. Mutually exclusive with --skip (exits with error if both are provided). Available check names: yaml_sql_sync, upstream_refs, docs_coverage, yaml_integrity, select_star. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

rachellougee

ol-dbt validate works well, as it detects the inconsistencies between sql and yml file.

ol-dbt impact has some limitations. It relies on the dbt compiled folder, so if downstream models have never been compiled locally, the result may be incomplete. So maybe we could add something like dbt compile --select model+ in the impact.py to ensure all downstream models are compiled before analysis?

Also, when I tested by renaming the user_email column in the stg__edxorg__program_entitlement, ol-dbt impact detected one downstream edxorg_to_mitxonline_users impact, but it missed the user_email reference in edxorg_to_mitxonline_program_entitlements. Not sure why it would miss that because both downstream models were compiled locally.

% ol-dbt impact
Using compiled SQL: /Users/rachellougee/repo/OL/ol-data-platform/src/ol_dbt/target/compiled/open_learning/models
Using manifest: /Users/rachellougee/repo/OL/ol-data-platform/src/ol_dbt/target/manifest.json

Analysing 1 model(s) for column-level impact

 BREAKING stg__edxorg__program_entitlement
   Column change(s) in 'stg__edxorg__program_entitlement' will break 1 downstream model(s).
   Column changes:
     🔄 user_email → user_emails
     →  user_emails
   Downstream impact:
       ↳ edxorg_to_mitxonline_users (uses: user_email)
       ↳ not_null_stg__edxorg__program_entitlement_program_title (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_program_type (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_organization_key (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_user_id (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_user_emails (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_purchase_date (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_expiration_date (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_number_of_entitlements (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_number_of_redeemed_entitlements (column metadata unavailable)
       ↳ dbt_expectations_expect_compound_columns_to_be_unique_stg__edxorg__program_entitlement_user_email__program_title (column metadata unavailable)
         ↳ not_null_edxorg_to_mitxonline_users_user_email (column metadata unavailable)
         ↳ unique_edxorg_to_mitxonline_users_user_email (column metadata unavailable)
         ↳ dbt_expectations_expect_compound_columns_to_be_unique_edxorg_to_mitxonline_program_entitlements_program_title__user_edxorg_email__order_id_is_null 
(column metadata unavailable)

Summary: 1 breaking, 0 warning(s) across 1 changed model(s)

…, add stale SQL warning Phase 1 — SELECT * resolution improvements: - Fix _build_cte_column_map to handle exp.Union CTEs (UNION ALL) by extracting columns from the leftmost SELECT branch; resolves 8 of the 10 previously unresolved SELECT * models - Add VALUES inline table handling in _try_resolve_star: detects SELECT * FROM (VALUES …) AS alias (col1, col2) and extracts column names from the alias list - Add _resolve_star_source_through_ctes helper that walks thin passthrough CTE chains (SELECT * FROM ref_x) to find the ultimate ref/source placeholder, enabling registry-based resolution of the final two unresolved stars - Add source_path field to ParsedModel to track the raw SQL source file - After full pipeline: 0 unresolved SELECT * across all 588 models Phase 2 — Impact false-negative fix (rachellougee's PR feedback): - Add _get_columns_read_from_ref() to impact.py: uses sqlglot AST analysis to find columns a downstream model reads from an upstream ref, even when those columns are aliased to different output names - Update _get_downstream_manifest and _get_downstream_yaml to fall back to SQL-level column-read analysis when manifest/YAML column intersection returns empty (the false-negative case: upstream col read but renamed) Phase 3 — Stale compiled SQL warning: - After analysis, emit a warning listing downstream models that had no compiled SQL, with a dbt compile command to fix it Tests: - 15 new test cases in test_sql_parser.py: Union CTE resolution, passthrough star chains, VALUES column extraction - 4 new test cases in test_impact.py: column-read analysis including the aliased-column false-negative scenario - Total: 98 tests, all passing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_dbt_cli/ol_dbt_cli/lib/sql_parser.py

Phase 3 — --auto-compile flag for impact command: - Add --auto-compile flag to 'ol-dbt impact' that runs 'dbt compile --select <changed_model>+' before analysis to ensure compiled SQL is fresh for downstream models - Gracefully handles missing dbt binary (warning, continues with raw SQL) - Reloads parsed models after successful compile to pick up new output Phase 4 — Jinja2 rendering engine: - Replace regex-based Jinja stripping with a Jinja2 rendering engine as the primary path in strip_jinja() - _SqlSafeUndefined: custom Jinja2 Undefined subclass that renders unknown variables/macros as safe SQL placeholder tokens (__undefined__, __macro__) instead of raising TemplateSyntaxError - _make_jinja_env(): builds a dbt-compatible Jinja2 environment with ref(), source(), config(), var(), env_var(), is_incremental() and common dbt adapter namespaces (dbt_utils, adapter, this, etc.) - Key benefit: {% if %}/{% for %} control blocks are properly evaluated (taking one branch) rather than stripped, yielding more valid SQL for sqlglot parsing - Regex path (_strip_jinja_regex) retained as fallback for TemplateError - _BROKEN_COL_RE updated to match both __jinja__ (regex) and __macro__/ __undefined__ (Jinja2) placeholders - Add jinja2>=3.1 and markupsafe>=2.1 as explicit dependencies in pyproject.toml (were already transitive deps) Tests: - 8 new test cases in test_sql_parser.py: Jinja2 if/for rendering, ref/source collection, undefined macro placeholder, fallback behavior - Total: 106 tests, all passing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Adds --auto-compile to 'ol-dbt validate' which runs 'dbt parse' before validation to regenerate manifest.json. 'dbt parse' is fast and offline (no database connection required) — it just parses and validates the project, generating a fresh manifest.json. This improves upstream column resolution accuracy (check 2: upstream_refs) and SELECT * expansion when the manifest was stale or missing. Behaviour: - Runs dbt parse in the dbt project directory before loading the manifest - Gracefully degrades if dbt is not installed (warning, continues without) - Falls back cleanly if dbt parse fails (validation continues with existing or no manifest) For compiled SQL accuracy (check 1: yaml_sql_sync), run 'dbt compile' manually first — it is not run automatically here since it requires a database connection. Example: ol-dbt validate --auto-compile ol-dbt validate --auto-compile --changed-only Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

src/ol_dbt_cli/ol_dbt_cli/lib/sql_parser.py

src/ol_dbt_cli/ol_dbt_cli/commands/impact.py

Both 'ol-dbt validate --auto-compile' and 'ol-dbt impact --auto-compile' now default to --target dev_local, which uses a local DuckDB instance (~/.ol-dbt/local.duckdb) and requires no network access or credentials. Changes: - Add --target / -t parameter to both validate and impact (default: dev_local) - validate: passes --target to 'dbt parse' - impact: passes --target to 'dbt compile' - Update help text and docstring examples to document the default and how to override with dev_qa/production targets The dev_local target is defined in profiles.yml as a DuckDB adapter, making local compilation fast and friction-free for day-to-day use. Examples: ol-dbt validate --auto-compile # uses dev_local ol-dbt validate --auto-compile --target dev_qa # override target ol-dbt impact --auto-compile # uses dev_local ol-dbt impact --auto-compile --target production # override target Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

sentry bot reviewed Mar 15, 2026

View reviewed changes

src/ol_dbt_cli/ol_dbt_cli/commands/validate.py Outdated Show resolved Hide resolved

src/ol_dbt_cli/ol_dbt_cli/commands/validate.py Outdated Show resolved Hide resolved

blarghmatey force-pushed the feat/ol-dbt-cli branch from cdfb9e5 to 18376fe Compare March 16, 2026 12:52

blarghmatey requested a review from Copilot March 16, 2026 13:30

Copilot started reviewing on behalf of blarghmatey March 16, 2026 13:30 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

blarghmatey and others added 2 commits March 16, 2026 09:40

rachellougee self-assigned this Mar 16, 2026

rachellougee reviewed Mar 17, 2026

View reviewed changes

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/ol_dbt_cli/ol_dbt_cli/lib/sql_parser.py Show resolved Hide resolved

blarghmatey and others added 2 commits March 18, 2026 14:58

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/ol_dbt_cli/ol_dbt_cli/lib/sql_parser.py Show resolved Hide resolved

sentry bot reviewed Mar 18, 2026

View reviewed changes

src/ol_dbt_cli/ol_dbt_cli/commands/impact.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ol-dbt-cli): add SQL/YAML linting and column-level impact analysis CLI#1998

feat(ol-dbt-cli): add SQL/YAML linting and column-level impact analysis CLI#1998
blarghmatey wants to merge 7 commits intomainfrom
feat/ol-dbt-cli

blarghmatey commented Mar 15, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rachellougee left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

blarghmatey commented Mar 15, 2026

Summary

Commands

ol-dbt validate

ol-dbt impact

Jinja pre-processing

YAML registry improvements

Tests

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rachellougee left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`ol-dbt validate`

`ol-dbt impact`