Skip to content

feat(ol-dbt-cli): add SQL/YAML linting and column-level impact analysis CLI#1998

Open
blarghmatey wants to merge 7 commits intomainfrom
feat/ol-dbt-cli
Open

feat(ol-dbt-cli): add SQL/YAML linting and column-level impact analysis CLI#1998
blarghmatey wants to merge 7 commits intomainfrom
feat/ol-dbt-cli

Conversation

@blarghmatey
Copy link
Member

Summary

Adds a new ol-dbt CLI package at src/ol_dbt_cli/ with two commands for the engineering team to use with the src/ol_dbt dbt project.

Commands

ol-dbt validate

Structural consistency checks for dbt models:

Check Level Description
YAML/SQL sync ERROR Columns in YAML but absent from SQL output, or vice-versa
Missing upstream refs ERROR ref()/source() calls that don't resolve to known models
Upstream column existence ERROR Column referenced from upstream doesn't exist in that model
Duplicate aliases ERROR Repeated column alias within a single SELECT
SELECT * INFO/WARN Unresolved star expansion (use --warn-select-star to promote)

Options:

ol-dbt validate [--model MODEL]... [--errors-only] [--warn-select-star]
                [--compiled-dir PATH] [--output text|json]
  • --model/-m — filter by model name(s), comma-separated list, or directory under models/
  • --errors-only — suppress INFO/WARN, show only errors
  • --warn-select-star — treat unresolved SELECT * as warnings
  • --compiled-dir — use target/compiled/ SQL for Jinja-heavy models

ol-dbt impact

Column-level downstream impact analysis driven by git diffs:

ol-dbt impact [--base BRANCH] [--staged] [--compiled-dir PATH] [--output text|json]
  • Reads git diff [--cached] to find changed/removed SQL column aliases
  • Traverses upstream → downstream to find all models referencing each changed column
  • Reports broken references, risky usages, and models needing review

Jinja pre-processing

All 588 dbt models in this project parse successfully via a regex-based Jinja stripper that handles:

  • {{ ref() }} / {{ source() }} → stable SQL identifiers with reverse maps
  • All {{ ... }} → unquoted placeholders (fixes doubled-quote bug: like '{{ var }}'like '__jinja__' not like ''__jinja__'')
  • Block-level macro calls alone on a line (e.g. {{ deduplicate_raw_table(...) }}) → SQL comment
  • Macros that split a SQL column expression across the }} boundary (e.g. {{ array_join('partial SQL', "path") }}) → collapsed to __jinja__ as alias via post-cleanup regex pass

YAML registry improvements

  • Parses both models: and sources: blocks for SELECT * column resolution
  • Rescues model definitions accidentally nested inside another model's columns: list (authoring error in _stg_mitlearn_models.yml)

Tests

85 pytest tests covering all commands, library modules, and edge cases.

…is CLI

Adds a new `ol-dbt` CLI package at `src/ol_dbt_cli/` with two commands:

## `ol-dbt validate`

Structural consistency checks for dbt models:

1. **YAML/SQL sync** — warns when columns declared in a model's YAML schema
   are absent from its compiled SELECT output, or vice-versa.
2. **Missing upstream refs** — flags `ref()` or `source()` calls that
   resolve to models/sources not present in the project.
3. **Upstream column existence** — validates that columns referenced from
   an upstream model actually exist in that model's output.
4. **Duplicate column aliases** — detects accidental repeated alias names
   within a single model.
5. **SELECT \*** — INFO-level notice for unresolved star expansions;
   `--warn-select-star` promotes these to warnings.

Options:
- `--model/-m` (repeatable): filter to one or more model names, comma-separated
  lists, or directory paths under `models/`
- `--errors-only`: suppress INFO/WARN output, show only errors
- `--warn-select-star`: treat unresolved SELECT * as warnings
- `--compiled-dir`: point at `target/compiled/` for Jinja-heavy models
- `--output json`: machine-readable output

## `ol-dbt impact`

Column-level downstream impact analysis driven by git diffs:

- Reads `git diff [--cached] [base]` to find changed/removed SQL column aliases
- Traverses the manifest or parsed SQL to find all downstream models that
  reference each changed column
- Reports broken references, risky usages, and models needing review

## Jinja pre-processing

Robust regex-based Jinja stripping so sqlglot can parse the raw `.sql` files
without a running dbt environment:

- `{{ ref() }}` / `{{ source() }}` → stable SQL identifiers with reverse maps
- `{{ var() }}` and bare `{{ variable }}` → unquoted placeholders (avoids
  the `''__jinja__''` doubled-quote bug when templates surround with SQL quotes)
- Block-level macro calls alone on a line → SQL comment
- Broken column expressions where a macro splits a SQL column boundary
  (e.g. `{{ array_join('partial SQL', "path") }}`) → collapsed to
  `__jinja__ as alias` via `_BROKEN_COL_RE` cleanup pass

Result: **588/588** dbt models in this project parse successfully.

## YAML registry improvements

- Parses both `models:` and `sources:` blocks for SELECT * resolution
- Rescues model definitions accidentally nested inside another model's
  `columns:` list (YAML authoring error in `_stg_mitlearn_models.yml`)
- Filters out nested-model entries from the parent model's column list

## Tests

85 pytest tests covering all commands, lib modules, and edge cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new ol-dbt CLI workspace package (src/ol_dbt_cli/) intended to help engineers validate dbt model/YAML consistency and assess column-level downstream impact of in-progress changes in src/ol_dbt.

Changes:

  • Adds ol-dbt validate and ol-dbt impact commands (cyclopts + rich output) plus supporting parsing/registry libraries.
  • Implements YAML schema registry + sqlglot-based SQL parsing with Jinja stripping and partial SELECT * resolution.
  • Adds a pytest suite covering registry/parsing/validation/impact helpers and wires the package into the uv workspace/dependencies.

Reviewed changes

Copilot reviewed 13 out of 18 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
uv.lock Adds ol-dbt-cli workspace member and locks new deps (rich, sqlglot).
pyproject.toml Adds src/ol_dbt_cli to uv workspace members and adds rich/sqlglot to root deps.
src/ol_dbt_cli/pyproject.toml Declares the ol-dbt-cli package, deps, entrypoint, and ruff config.
src/ol_dbt_cli/ol_dbt_cli/init.py Package marker.
src/ol_dbt_cli/ol_dbt_cli/cli.py Defines ol-dbt CLI app and registers impact/validate commands.
src/ol_dbt_cli/ol_dbt_cli/commands/init.py Commands package marker.
src/ol_dbt_cli/ol_dbt_cli/commands/validate.py Implements structural validation checks and reporting/output.
src/ol_dbt_cli/ol_dbt_cli/commands/impact.py Implements diff-based column change detection and downstream impact reporting.
src/ol_dbt_cli/ol_dbt_cli/lib/init.py Lib package marker.
src/ol_dbt_cli/ol_dbt_cli/lib/yaml_registry.py Parses dbt schema YAML into a registry of models/sources/columns.
src/ol_dbt_cli/ol_dbt_cli/lib/sql_parser.py Strips Jinja and uses sqlglot to extract output columns (+ some star resolution).
src/ol_dbt_cli/ol_dbt_cli/lib/manifest.py Parses dbt manifest.json into a registry with lineage helpers.
src/ol_dbt_cli/ol_dbt_cli/lib/git_utils.py Git helpers for changed files/models and retrieving file content at refs.
src/ol_dbt_cli/tests/init.py Tests package marker.
src/ol_dbt_cli/tests/test_yaml_registry.py Tests YAML registry model/source parsing and nested-model rescue behavior.
src/ol_dbt_cli/tests/test_validate.py Tests validate checks, formatting, and model target resolution helpers.
src/ol_dbt_cli/tests/test_sql_parser.py Tests Jinja stripping and SQL parsing/star handling/compiled-dir logic.
src/ol_dbt_cli/tests/test_impact.py Tests column diff helper used by impact analysis.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

blarghmatey and others added 2 commits March 16, 2026 09:40
- validate.py: fix _resolve_star_with_registry returning m_model.columns
  (dict[str, ManifestColumn]) instead of m_model.column_names (set[str]),
  which broke set operations in callers
- validate.py: fix _resolve_model_targets silently skipping invalid directory
  tokens when earlier tokens already populated resolved[]; now tracks whether
  THIS token added any models before deciding whether to emit an unknown warning
- validate.py: remove flawed upstream column cross-ref existence check that
  produced false positives because it cannot attribute a YAML column to a
  specific upstream ref without table-alias-level SQL AST analysis; replaced
  with a narrow unresolvable-ref warning only
- manifest.py: replace list.pop(0) O(n) queue with collections.deque.popleft()
  O(1) in get_all_descendants() BFS
- cli.py: fix help text examples that referenced non-existent --changed-only
  flag on impact and --format (correct flag is --output)
- sql_parser.py: fix _parse_clean_sql compiled-SQL path to get refs/sources
  from the raw SQL via strip_jinja() rather than looking for 'ref_'/'source_'
  prefixes in compiled table names (which don't exist in real compiled SQL)
- git_utils.py: fix get_changed_files() docstring claiming it includes
  untracked files (it only includes tracked changed files)
- git_utils.py: fix get_changed_yaml_models() return type from list[str] to
  list[Path] (the function returns file paths, not model names) and remove
  the type: ignore suppression

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Inverse of --skip; accepts a comma-separated list of check names and
runs only those checks, skipping all others. Mutually exclusive with
--skip (exits with error if both are provided).

Available check names: yaml_sql_sync, upstream_refs, docs_coverage,
yaml_integrity, select_star.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rachellougee rachellougee self-assigned this Mar 16, 2026
Copy link
Contributor

@rachellougee rachellougee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ol-dbt validate works well, as it detects the inconsistencies between sql and yml file.

ol-dbt impact has some limitations. It relies on the dbt compiled folder, so if downstream models have never been compiled locally, the result may be incomplete. So maybe we could add something like dbt compile --select model+ in the impact.py to ensure all downstream models are compiled before analysis?

Also, when I tested by renaming the user_email column in the stg__edxorg__program_entitlement, ol-dbt impact detected one downstream edxorg_to_mitxonline_users impact, but it missed the user_email reference in edxorg_to_mitxonline_program_entitlements. Not sure why it would miss that because both downstream models were compiled locally.

% ol-dbt impact
Using compiled SQL: /Users/rachellougee/repo/OL/ol-data-platform/src/ol_dbt/target/compiled/open_learning/models
Using manifest: /Users/rachellougee/repo/OL/ol-data-platform/src/ol_dbt/target/manifest.json

Analysing 1 model(s) for column-level impact

 BREAKING stg__edxorg__program_entitlement
   Column change(s) in 'stg__edxorg__program_entitlement' will break 1 downstream model(s).
   Column changes:
     🔄 user_email → user_emails
     →  user_emails
   Downstream impact:
       ↳ edxorg_to_mitxonline_users (uses: user_email)
       ↳ not_null_stg__edxorg__program_entitlement_program_title (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_program_type (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_organization_key (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_user_id (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_user_emails (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_purchase_date (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_expiration_date (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_number_of_entitlements (column metadata unavailable)
       ↳ not_null_stg__edxorg__program_entitlement_number_of_redeemed_entitlements (column metadata unavailable)
       ↳ dbt_expectations_expect_compound_columns_to_be_unique_stg__edxorg__program_entitlement_user_email__program_title (column metadata unavailable)
         ↳ not_null_edxorg_to_mitxonline_users_user_email (column metadata unavailable)
         ↳ unique_edxorg_to_mitxonline_users_user_email (column metadata unavailable)
         ↳ dbt_expectations_expect_compound_columns_to_be_unique_edxorg_to_mitxonline_program_entitlements_program_title__user_edxorg_email__order_id_is_null 
(column metadata unavailable)

Summary: 1 breaking, 0 warning(s) across 1 changed model(s)

…, add stale SQL warning

Phase 1 — SELECT * resolution improvements:
- Fix _build_cte_column_map to handle exp.Union CTEs (UNION ALL) by
  extracting columns from the leftmost SELECT branch; resolves 8 of the
  10 previously unresolved SELECT * models
- Add VALUES inline table handling in _try_resolve_star: detects
  SELECT * FROM (VALUES …) AS alias (col1, col2) and extracts column
  names from the alias list
- Add _resolve_star_source_through_ctes helper that walks thin passthrough
  CTE chains (SELECT * FROM ref_x) to find the ultimate ref/source
  placeholder, enabling registry-based resolution of the final two
  unresolved stars
- Add source_path field to ParsedModel to track the raw SQL source file
- After full pipeline: 0 unresolved SELECT * across all 588 models

Phase 2 — Impact false-negative fix (rachellougee's PR feedback):
- Add _get_columns_read_from_ref() to impact.py: uses sqlglot AST
  analysis to find columns a downstream model reads from an upstream
  ref, even when those columns are aliased to different output names
- Update _get_downstream_manifest and _get_downstream_yaml to fall back
  to SQL-level column-read analysis when manifest/YAML column intersection
  returns empty (the false-negative case: upstream col read but renamed)

Phase 3 — Stale compiled SQL warning:
- After analysis, emit a warning listing downstream models that had no
  compiled SQL, with a dbt compile command to fix it

Tests:
- 15 new test cases in test_sql_parser.py: Union CTE resolution,
  passthrough star chains, VALUES column extraction
- 4 new test cases in test_impact.py: column-read analysis including
  the aliased-column false-negative scenario
- Total: 98 tests, all passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
blarghmatey and others added 2 commits March 18, 2026 14:58
Phase 3 — --auto-compile flag for impact command:
- Add --auto-compile flag to 'ol-dbt impact' that runs 'dbt compile
  --select <changed_model>+' before analysis to ensure compiled SQL is
  fresh for downstream models
- Gracefully handles missing dbt binary (warning, continues with raw SQL)
- Reloads parsed models after successful compile to pick up new output

Phase 4 — Jinja2 rendering engine:
- Replace regex-based Jinja stripping with a Jinja2 rendering engine as
  the primary path in strip_jinja()
- _SqlSafeUndefined: custom Jinja2 Undefined subclass that renders
  unknown variables/macros as safe SQL placeholder tokens (__undefined__,
  __macro__) instead of raising TemplateSyntaxError
- _make_jinja_env(): builds a dbt-compatible Jinja2 environment with
  ref(), source(), config(), var(), env_var(), is_incremental() and
  common dbt adapter namespaces (dbt_utils, adapter, this, etc.)
- Key benefit: {% if %}/{% for %} control blocks are properly evaluated
  (taking one branch) rather than stripped, yielding more valid SQL
  for sqlglot parsing
- Regex path (_strip_jinja_regex) retained as fallback for TemplateError
- _BROKEN_COL_RE updated to match both __jinja__ (regex) and __macro__/
  __undefined__ (Jinja2) placeholders
- Add jinja2>=3.1 and markupsafe>=2.1 as explicit dependencies in
  pyproject.toml (were already transitive deps)

Tests:
- 8 new test cases in test_sql_parser.py: Jinja2 if/for rendering,
  ref/source collection, undefined macro placeholder, fallback behavior
- Total: 106 tests, all passing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds --auto-compile to 'ol-dbt validate' which runs 'dbt parse' before
validation to regenerate manifest.json.

'dbt parse' is fast and offline (no database connection required) — it
just parses and validates the project, generating a fresh manifest.json.
This improves upstream column resolution accuracy (check 2: upstream_refs)
and SELECT * expansion when the manifest was stale or missing.

Behaviour:
- Runs dbt parse in the dbt project directory before loading the manifest
- Gracefully degrades if dbt is not installed (warning, continues without)
- Falls back cleanly if dbt parse fails (validation continues with
  existing or no manifest)

For compiled SQL accuracy (check 1: yaml_sql_sync), run 'dbt compile'
manually first — it is not run automatically here since it requires a
database connection.

Example:
    ol-dbt validate --auto-compile
    ol-dbt validate --auto-compile --changed-only

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Both 'ol-dbt validate --auto-compile' and 'ol-dbt impact --auto-compile'
now default to --target dev_local, which uses a local DuckDB instance
(~/.ol-dbt/local.duckdb) and requires no network access or credentials.

Changes:
- Add --target / -t parameter to both validate and impact (default: dev_local)
- validate: passes --target to 'dbt parse'
- impact: passes --target to 'dbt compile'
- Update help text and docstring examples to document the default and
  how to override with dev_qa/production targets

The dev_local target is defined in profiles.yml as a DuckDB adapter,
making local compilation fast and friction-free for day-to-day use.

Examples:
    ol-dbt validate --auto-compile                    # uses dev_local
    ol-dbt validate --auto-compile --target dev_qa    # override target
    ol-dbt impact --auto-compile                      # uses dev_local
    ol-dbt impact --auto-compile --target production  # override target

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants