diff --git a/.claude/agents/docs-dev.md b/.claude/agents/docs-dev.md index 5bffce8..5daa070 100644 --- a/.claude/agents/docs-dev.md +++ b/.claude/agents/docs-dev.md @@ -1,6 +1,6 @@ --- name: docs-dev -description: Owns the dbdocs package — the click CLI plus the extract/site pipeline that turns dbt artifacts into a self-contained single-page-app (SPA). Use for CLI commands (generate/serve/deploy), the artifact extractors (nodes/erd/graph/column-lineage), the data-dict assembly + base64 injection, the bundled vanilla-JS SPA, and the pytest suite. Scope is `dbdocs/` and `tests/`. +description: Owns the dbdocs package — the click CLI plus the extract/site pipeline that turns dbt artifacts into a single-page-app (SPA) with an external gzip data payload. Use for CLI commands (generate/serve/deploy), the artifact extractors (nodes/erd/graph/column-lineage/health), the data-dict assembly + external payload, the bundled vanilla-JS SPA, and the pytest suite. Scope is `dbdocs/` and `tests/`. tools: Read, Edit, Write, Glob, Grep, Bash model: sonnet memory: project @@ -8,9 +8,10 @@ memory: project You own the `dbdocs/` package. It reads dbt artifacts (`manifest.json` / `catalog.json`) via the `dbterd` Python API, derives one project data dict, and -builds a **self-contained single-page-app** — a single `site/index.html` with all -data base64-injected as `window.dbdocsData`, plus vendored JS assets. dbdocs is -an **alternative dbt docs site = dbt docs + ERD + column-level lineage**. It is a +builds a **single-page-app**: a small `site/index.html` plus an **external** +`site/dbdocs-data.json.gz` that a hand-written vanilla-JS SPA fetches and +decompresses client-side (the data is never inlined into the HTML). dbdocs is an +**alternative dbt docs site = dbt docs + ERD + column-level lineage**. It is a **doc generator**, not a dbt or dbterd reimplementation. There is no mkdocs, mkdocs-material, mike, or Jinja2 templating — those are gone. @@ -19,16 +20,22 @@ mkdocs-material, mike, or Jinja2 templating — those are gone. - `dbdocs/cli/main.py` — the click command group and subcommands (`generate`, `serve`, `deploy`). - `dbdocs/extract/` — derive doc data from artifacts: `nodes` (models/sources/ - seeds/snapshots → display records + nav tree), `erd` (Mermaid ERDs via - dbterd), `graph` (the node-level DAG), `column_lineage` + - `_sqlglot_lineage` (column-level lineage via sqlglot). -- `dbdocs/site/` — `builder` (assemble the data dict + write the site), - `inject` (base64 `window.dbdocsData`), `deploy` (hand-rolled versioning), and - the `bundle/` SPA (`index.html` + `assets/app.js` + `assets/style.css` + - `assets/vendor/`). + seeds/snapshots → display records + nav tree), `erd` + `erd_json` (structured + ERD `{nodes, edges}` via a dbterd `json` target adapter — not Mermaid text; the + SPA renders it with React Flow), `graph` (the node-level DAG), `column_lineage` + + `_sqlglot_lineage` (column-level lineage via sqlglot), and the `health/` + sub-package (the always-built Health Check section from `run_results.json`). +- `dbdocs/site/` — `builder` (assemble the one data dict + write the site), + `inject` (`strip_marker` removes the `` placeholder — the + data is external, not inlined), `deploy` (hand-rolled versioning), and the + `bundle/` SPA (`index.html` + `assets/{css,js,vendor,graph}/`; `js/` is the + 3-tier `data → service → ui` ES modules, `graph/` the committed React Flow + bundle). - `dbdocs/core/` — `config` (`DbDocsConfig` from `dbdocs.yml`), `artifacts` (artifact loading), `exceptions`, and the colored `log` singleton. -- `pytest` coverage at 100%. +- `pytest` coverage at 100% (`tests/`), **plus** the Playwright E2E specs at + `frontend/test/e2e/spa.spec.ts` that cover the rendered SPA — extend them + whenever you change bundle behavior (this is your one window outside `dbdocs/`). ## Non-responsibilities @@ -46,6 +53,15 @@ mkdocs-material, mike, or Jinja2 templating — those are gone. 4. Run `uv run pytest --cov=dbdocs --cov-report=term-missing`. 5. Ensure coverage is 100%. Add tests before reporting done. Only `# pragma: no cover` lines that are genuinely untestable I/O boundaries, and say so. +6. **If the change touches the rendered SPA** — the bundle (`site/bundle/**`: + `index.html`, `assets/{css,js}/`) or the React Flow graph (`frontend/**`) — + pytest does **not** exercise it; the Playwright E2E suite is the only thing + that does. Run `task frontend:e2e` (Node + a real demo build; one-time + `task frontend:e2e:install` for the browser). The E2E suite is **independent + of the 100% coverage gate** — green pytest coverage says nothing about the + SPA. Add/extend a spec in `frontend/test/e2e/spa.spec.ts` for new + user-visible behavior, and for a graph-source change also + `task frontend:build` to refresh the committed `assets/graph/` bundle. ## Conventions @@ -53,6 +69,9 @@ mkdocs-material, mike, or Jinja2 templating — those are gone. top of the file, one class per file (exception: multiple exception classes may share one file), no nested functions/classes. - Use specific exception types, never bare `except:` or `except Exception`. +- Keep comments sparse and present-tense — add one only when the code isn't + self-evident, describing what it does now, never historically (no "now / no + longer / used to / as before" changelog framing; git holds the history). - Keep presentation in the SPA assets; keep the Python the thin glue that loads artifacts and assembles the data dict. - Follow DRY in tests — share fixtures via `tests/conftest.py`. diff --git a/.github/scripts/check_dpe_rules.py b/.github/scripts/check_dpe_rules.py new file mode 100644 index 0000000..a800b9d --- /dev/null +++ b/.github/scripts/check_dpe_rules.py @@ -0,0 +1,106 @@ +"""Compare the published dbt-project-evaluator rule set against dbdocs's rules. + +Scrapes each DPE rules category page for its rule heading anchors, derives the +anchors dbdocs already implements from the live ``DIMENSION_RULES`` registry (via +the same ``docs_url`` builder the findings use), and prints any DPE rule dbdocs +hasn't implemented yet. The watcher workflow turns a non-empty result into a +``feat:`` issue. + +Output (stdout) is GitHub-Actions friendly: a ``missing<`` anchors (see below). +DPE_CATEGORIES = ("modeling", "testing", "documentation", "structure", "performance", "governance") +_BASE = "https://dbt-labs.github.io/dbt-project-evaluator/latest/rules" + +# mkdocs-material renders each rule as an ``

`` heading (the page +# title is the ``

``, any sub-detail is ``

+``); matching only ``

`` keeps +# non-rule headings out, so a new section heading can't masquerade as a rule. +_RULE_HEADING_ID = re.compile(r']*\bid="([^"]+)"', re.IGNORECASE) + +# A handful of ``

`` slugs that aren't a rule (page-structure headings). +_NON_RULE_ANCHORS = set(DPE_CATEGORIES) | { + "rules", + "overview", + "exceptions", + "customization", +} + + +def _fetch(url: str) -> str: + request = urllib.request.Request(url, headers={"User-Agent": "dbdocs-dpe-watcher"}) + with urllib.request.urlopen(request, timeout=30) as response: # noqa: S310 - fixed HTTPS host + return response.read().decode("utf-8", "replace") + + +def published_anchors(category: str) -> "set[str]": + """The rule heading anchors published on a DPE category page. + + Raises ``ValueError`` if the page yields no rule headings — a structural change + upstream — so the run fails loudly rather than reporting every rule as missing. + """ + html = _fetch(f"{_BASE}/{category}/") + anchors = {a for a in _RULE_HEADING_ID.findall(html) if a not in _NON_RULE_ANCHORS} + if not anchors: + raise ValueError( + f"No rule headings found on the DPE {category} page — page structure changed?" + ) + return anchors + + +def implemented_anchors(category: str) -> "set[str]": + """The DPE anchors dbdocs implements for *category* (via the rules' docs_url).""" + out = set() + for rule in DIMENSION_RULES.get(category, []): + url = docs_url(category, rule.__name__) + out.add(url.rsplit("#", 1)[-1]) + return out + + +def find_missing() -> "dict[str, list[str]]": + """Map each category to the DPE rule anchors dbdocs hasn't implemented yet.""" + missing = {} + for category in DPE_CATEGORIES: + gap = sorted(published_anchors(category) - implemented_anchors(category)) + if gap: + missing[category] = gap + return missing + + +def render(missing: "dict[str, list[str]]") -> str: + lines = [] + for category in DPE_CATEGORIES: + for anchor in missing.get(category, []): + url = f"{_BASE}/{category}/#{anchor}" + lines.append(f"- **{category}** — `{anchor}` ([docs]({url}))") + return "\n".join(lines) + + +def main() -> int: + missing = find_missing() + report = render(missing) + github_output = os.environ.get("GITHUB_OUTPUT") + if github_output: + with open(github_output, "a", encoding="utf-8") as handle: + handle.write(f"has_missing={'true' if missing else 'false'}\n") + handle.write(f"missing< N direct model children is flagged # too_many_joins: 7 # >= N upstream dependencies is flagged # chained_view_dependencies: 4 # >= N-deep view/ephemeral chain is flagged +# documentation_coverage: 100 # < N% of models documented is flagged # -# # Disable individual rules by name. The full built-in set: -# # testing: test_coverage, missing_primary_key_tests -# # modeling: direct_join_to_source, duplicate_sources, model_fanout, -# # multiple_sources_joined, rejoining_of_upstream_concepts, +# # Disable individual rules by name. The full built-in set (one-to-one with +# # the dbt-project-evaluator rules): +# # testing: test_coverage, missing_primary_key_tests, +# # missing_source_freshness +# # modeling: direct_join_to_source, +# # downstream_models_dependent_on_source, duplicate_sources, +# # hard_coded_references, model_fanout, multiple_sources_joined, +# # rejoining_of_upstream_concepts, # # root_models, source_fanout, staging_dependent_on_staging, # # staging_dependent_on_marts_or_intermediate, unused_sources, # # too_many_joins -# # documentation: undocumented_models, undocumented_sources, -# # undocumented_source_tables +# # documentation: documentation_coverage, undocumented_models, +# # undocumented_sources, undocumented_source_tables # # structure: model_naming_conventions, model_directories, -# # source_directories +# # source_directories, test_directories # # performance: chained_view_dependencies, exposure_parents_materializations # # governance: public_models_without_contracts, undocumented_public_models, # # exposures_dependent_on_private_models diff --git a/dbdocs/extract/health/dimensions.py b/dbdocs/extract/health/dimensions.py index f80bca6..0f90b5d 100644 --- a/dbdocs/extract/health/dimensions.py +++ b/dbdocs/extract/health/dimensions.py @@ -50,6 +50,13 @@ def __init__(self, manifest: "Any | None", thresholds: "dict | None" = None) -> self.models = [n for uid, n in self._nodes.items() if uid.startswith("model.")] self.sources = list(self._sources.values()) self.exposures = list(self._exposures.values()) + # Singular tests are custom-SQL test nodes (no test_metadata); generic + # tests (unique/not_null/…) carry test_metadata and are excluded. + self.singular_tests = [ + n + for uid, n in self._nodes.items() + if uid.startswith("test.") and getattr(n, "test_metadata", None) is None + ] # Rule thresholds: per-run overrides layered over the DPE defaults. self._thresholds = {**DEFAULT_THRESHOLDS, **(thresholds or {})} @@ -147,6 +154,21 @@ def access(model: Any) -> str: access = access or getattr(model, "access", None) return str(access or "protected").lower() + @staticmethod + def has_source_freshness(source: Any) -> bool: + """Whether a source has a freshness check: a ``loaded_at_field`` plus a + ``warn_after``/``error_after`` threshold count.""" + if not str(getattr(source, "loaded_at_field", "") or "").strip(): + return False + freshness = getattr(source, "freshness", None) + if freshness is None: + return False + for bound in ("warn_after", "error_after"): + period = getattr(freshness, bound, None) + if period is not None and getattr(period, "count", None) is not None: + return True + return False + @staticmethod def contract_enforced(model: Any) -> bool: """Whether the model has an enforced contract (``contract.enforced``).""" diff --git a/dbdocs/extract/health/rules/__init__.py b/dbdocs/extract/health/rules/__init__.py index 283ad51..370cd03 100644 --- a/dbdocs/extract/health/rules/__init__.py +++ b/dbdocs/extract/health/rules/__init__.py @@ -24,6 +24,7 @@ from dbdocs.extract.health.rules.base import DEFAULT_THRESHOLDS, NON_PHYSICAL, docs_url, finding from dbdocs.extract.health.rules.dimensions.documentation import ( + documentation_coverage, undocumented_models, undocumented_source_tables, undocumented_sources, @@ -35,7 +36,9 @@ ) from dbdocs.extract.health.rules.dimensions.modeling import ( direct_join_to_source, + downstream_models_dependent_on_source, duplicate_sources, + hard_coded_references, model_fanout, multiple_sources_joined, rejoining_of_upstream_concepts, @@ -54,8 +57,13 @@ model_directories, model_naming_conventions, source_directories, + test_directories, +) +from dbdocs.extract.health.rules.dimensions.testing import ( + missing_primary_key_tests, + missing_source_freshness, + test_coverage, ) -from dbdocs.extract.health.rules.dimensions.testing import missing_primary_key_tests, test_coverage from dbdocs.extract.health.rules.registry import ( DIMENSION_RULES, ENTRY_POINT_GROUP, @@ -81,8 +89,11 @@ # rule functions (re-exported for direct access / testing) "test_coverage", "missing_primary_key_tests", + "missing_source_freshness", "direct_join_to_source", + "downstream_models_dependent_on_source", "duplicate_sources", + "hard_coded_references", "model_fanout", "multiple_sources_joined", "rejoining_of_upstream_concepts", @@ -92,12 +103,14 @@ "staging_dependent_on_marts_or_intermediate", "unused_sources", "too_many_joins", + "documentation_coverage", "undocumented_models", "undocumented_sources", "undocumented_source_tables", "model_naming_conventions", "model_directories", "source_directories", + "test_directories", "chained_view_dependencies", "exposure_parents_materializations", "public_models_without_contracts", diff --git a/dbdocs/extract/health/rules/base.py b/dbdocs/extract/health/rules/base.py index 2e70db3..b40c360 100644 --- a/dbdocs/extract/health/rules/base.py +++ b/dbdocs/extract/health/rules/base.py @@ -14,6 +14,16 @@ "model_fanout": 3, "too_many_joins": 7, "chained_view_dependencies": 4, + "documentation_coverage": 100, +} + +# DPE anchors that don't equal the rule name kebab-cased. The auto-derived anchor +# (``rule.replace("_", "-")``) is the DPE heading for most rules; these few have a +# different published heading, so map the rule name to its real anchor explicitly. +_RULE_ANCHORS = { + "too_many_joins": "models-with-too-many-joins", + "staging_dependent_on_marts_or_intermediate": "staging-models-dependent-on-downstream-models", + "staging_dependent_on_staging": "staging-models-dependent-on-other-staging-models", } # Materializations that are *not* a physical table (for chained-view detection). @@ -28,8 +38,13 @@ def docs_url(category: str, rule: str) -> str: - """The DPE docs URL for a rule under *category* (anchor = the rule name).""" - return f"{_RULES_BASE}/{category}/#{rule.replace('_', '-')}" + """The DPE docs URL for a rule under *category*. + + The anchor is the rule name kebab-cased, except for the few rules whose DPE + heading differs (see ``_RULE_ANCHORS``). + """ + anchor = _RULE_ANCHORS.get(rule, rule.replace("_", "-")) + return f"{_RULES_BASE}/{category}/#{anchor}" def finding(rule: str, category: str, node: str, node_type: str, message: str) -> dict: diff --git a/dbdocs/extract/health/rules/dimensions/documentation.py b/dbdocs/extract/health/rules/dimensions/documentation.py index ff06c10..e389de1 100644 --- a/dbdocs/extract/health/rules/dimensions/documentation.py +++ b/dbdocs/extract/health/rules/dimensions/documentation.py @@ -54,6 +54,36 @@ def undocumented_sources(graph: "ManifestGraph") -> "list[dict]": return out +def documentation_coverage(graph: "ManifestGraph") -> "list[dict]": + """One project-level finding when the share of documented models is below target. + + Mirrors DPE's ``fct_documentation_coverage`` — an aggregate percentage rather + than a per-model flag. The ``documentation_coverage`` threshold (default 100) + is the minimum acceptable percentage; coverage at or above it is clean. + + Being an aggregate, its finding's ``node`` is the pseudo-id ``"project"`` (not a + real ``unique_id``); the SPA renders an unresolvable node as plain text rather + than a node-page link, so this degrades gracefully. + """ + models = graph.models + if not models: + return [] + documented = sum(1 for m in models if str(getattr(m, "description", "") or "").strip()) + coverage = round(documented * 100 / len(models), 1) + if coverage >= graph.threshold("documentation_coverage"): + return [] + return [ + finding( + "documentation_coverage", + "documentation", + "project", + "project", + f"Only {coverage}% of models are documented " + f"({documented}/{len(models)}) — below the configured target.", + ) + ] + + def undocumented_source_tables(graph: "ManifestGraph") -> "list[dict]": """Source tables (the individual relations) with no description.""" out = [] diff --git a/dbdocs/extract/health/rules/dimensions/modeling.py b/dbdocs/extract/health/rules/dimensions/modeling.py index 9731f08..03b1e4b 100644 --- a/dbdocs/extract/health/rules/dimensions/modeling.py +++ b/dbdocs/extract/health/rules/dimensions/modeling.py @@ -4,14 +4,29 @@ for the shared finding shape and the registry that assembles these. """ +import re from typing import Any +import sqlglot +from sqlglot import exp +from sqlglot.errors import SqlglotError +from sqlglot.optimizer.scope import build_scope + from dbdocs.extract.health.rules.base import finding # ``ManifestGraph`` annotations are strings (forward ref) to avoid a circular # import (rules ← graph ← analyzer); resolved lazily by the type checker only. ManifestGraph = Any +# A ``{{ ref(...) }}`` / ``{{ source(...) }}`` call is rewritten to a uniquely +# numbered sentinel table before parsing (``__dbt_ref_0__``, ``__dbt_ref_1__``, …) +# — unique so sqlglot's scope analysis doesn't choke on a repeated alias — so any +# *other* real table left in the SQL is a hard-coded relation. Any remaining +# ``{{ ... }}`` macro is blanked so it doesn't masquerade as a table. +_REF_SENTINEL_PREFIX = "__dbt_ref_" +_REF_OR_SOURCE = re.compile(r"{{[-\s]*(?:ref|source)\s*\(.*?\)[-\s]*}}", re.IGNORECASE | re.DOTALL) +_OTHER_JINJA = re.compile(r"{[{%].*?[%}]}", re.DOTALL) + def direct_join_to_source(graph: "ManifestGraph") -> "list[dict]": """Models that ref a model AND a source in the same query (missing staging).""" @@ -121,6 +136,87 @@ def rejoining_of_upstream_concepts(graph: "ManifestGraph") -> "list[dict]": return out +def downstream_models_dependent_on_source(graph: "ManifestGraph") -> "list[dict]": + """Non-staging models that ref a source directly (only staging should touch sources).""" + out = [] + for model in graph.models: + if graph.layer(model) == "staging": + continue + if any(p.startswith("source.") for p in graph.parents(model.unique_id)): + out.append( + finding( + "downstream_models_dependent_on_source", + "modeling", + model.unique_id, + "model", + "A non-staging model selects from a source — route it through a staging model.", + ) + ) + return out + + +def _number_refs(raw_code: str) -> str: + """Replace each ref()/source() jinja call with a uniquely numbered sentinel.""" + out = [] + last = 0 + for index, match in enumerate(_REF_OR_SOURCE.finditer(raw_code)): + out.append(raw_code[last : match.start()]) + out.append(f"{_REF_SENTINEL_PREFIX}{index}__") + last = match.end() + out.append(raw_code[last:]) + return "".join(out) + + +def _hard_coded_relations(raw_code: str) -> "list[str]": + """Real table names a model's raw SQL selects from outside ref()/source(). + + dbt references are jinja (``{{ ref(...) }}`` / ``{{ source(...) }}``), so they + are rewritten to uniquely numbered sentinel tables before parsing (so sqlglot's + scope analysis doesn't choke on a repeated alias); any other real table its + scope analysis finds is a hard-coded relation. CTEs are excluded by walking + ``scope.selected_sources`` rather than every ``exp.Table``. Unparseable SQL + yields nothing (fail-soft — one model never sinks the pass). + """ + sql = _number_refs(raw_code) + sql = _OTHER_JINJA.sub(" ", sql) + if not sql.strip(): + return [] + try: + # parse_one returns a node or raises for non-empty input (empty is guarded + # above); build_scope returns None for a non-query statement (DDL/SET). + root = build_scope(sqlglot.parse_one(sql)) + if root is None: + return [] + relations = [ + source.sql() + for scope in root.traverse() + for _alias, (_node, source) in scope.selected_sources.items() + if isinstance(source, exp.Table) and not source.name.startswith(_REF_SENTINEL_PREFIX) + ] + except SqlglotError: # OptimizeError (e.g. duplicate alias) is a SqlglotError subclass. + return [] + return relations + + +def hard_coded_references(graph: "ManifestGraph") -> "list[dict]": + """Models whose SQL selects from a literal relation instead of ref()/source().""" + out = [] + for model in graph.models: + relations = _hard_coded_relations(str(getattr(model, "raw_code", "") or "")) + if relations: + shown = ", ".join(sorted(set(relations))) + out.append( + finding( + "hard_coded_references", + "modeling", + model.unique_id, + "model", + f"Hard-coded relation(s) {shown} — replace with ref()/source().", + ) + ) + return out + + def root_models(graph: "ManifestGraph") -> "list[dict]": """Models with zero parents (no ref/source) — untraceable lineage.""" out = [] diff --git a/dbdocs/extract/health/rules/dimensions/structure.py b/dbdocs/extract/health/rules/dimensions/structure.py index 271f804..99012a0 100644 --- a/dbdocs/extract/health/rules/dimensions/structure.py +++ b/dbdocs/extract/health/rules/dimensions/structure.py @@ -57,6 +57,24 @@ def model_directories(graph: "ManifestGraph") -> "list[dict]": return out +def test_directories(graph: "ManifestGraph") -> "list[dict]": + """Singular (custom SQL) tests not stored under a ``tests/`` directory.""" + out = [] + for test in graph.singular_tests: + path = str(getattr(test, "original_file_path", "") or getattr(test, "path", "") or "") + if "tests" not in path.split("/"): + out.append( + finding( + "test_directories", + "structure", + test.unique_id, + "test", + "Singular test is not under a 'tests/' directory.", + ) + ) + return out + + def source_directories(graph: "ManifestGraph") -> "list[dict]": """Source YAML not under a staging/ directory.""" out = [] diff --git a/dbdocs/extract/health/rules/dimensions/testing.py b/dbdocs/extract/health/rules/dimensions/testing.py index cbad2fe..2842e8d 100644 --- a/dbdocs/extract/health/rules/dimensions/testing.py +++ b/dbdocs/extract/health/rules/dimensions/testing.py @@ -49,3 +49,22 @@ def missing_primary_key_tests(graph: "ManifestGraph") -> "list[dict]": ) ) return out + + +def missing_source_freshness(graph: "ManifestGraph") -> "list[dict]": + """Sources with no freshness check (no ``loaded_at_field`` + warn/error threshold).""" + out = [] + for src in graph.sources: + if not graph.has_source_freshness(src): + name = str(getattr(src, "source_name", "") or "") or src.unique_id + table = str(getattr(src, "name", "") or "") + out.append( + finding( + "missing_source_freshness", + "testing", + src.unique_id, + "source", + f"Source '{name}.{table}' has no freshness check configured.", + ) + ) + return out diff --git a/dbdocs/extract/health/rules/registry.py b/dbdocs/extract/health/rules/registry.py index f59e1a6..00f6bc3 100644 --- a/dbdocs/extract/health/rules/registry.py +++ b/dbdocs/extract/health/rules/registry.py @@ -12,6 +12,7 @@ from dbdocs.core.log import logger from dbdocs.extract.health.rules.dimensions.documentation import ( + documentation_coverage, undocumented_models, undocumented_source_tables, undocumented_sources, @@ -23,7 +24,9 @@ ) from dbdocs.extract.health.rules.dimensions.modeling import ( direct_join_to_source, + downstream_models_dependent_on_source, duplicate_sources, + hard_coded_references, model_fanout, multiple_sources_joined, rejoining_of_upstream_concepts, @@ -42,8 +45,13 @@ model_directories, model_naming_conventions, source_directories, + test_directories, +) +from dbdocs.extract.health.rules.dimensions.testing import ( + missing_primary_key_tests, + missing_source_freshness, + test_coverage, ) -from dbdocs.extract.health.rules.dimensions.testing import missing_primary_key_tests, test_coverage # The built-in rules grouped by dimension, in DPE's published order. These are the # *built-ins*. ``DIMENSION_RULES`` (below) starts as a deep copy and is what @@ -53,10 +61,13 @@ "testing": ( test_coverage, missing_primary_key_tests, + missing_source_freshness, ), "modeling": ( direct_join_to_source, + downstream_models_dependent_on_source, duplicate_sources, + hard_coded_references, model_fanout, multiple_sources_joined, rejoining_of_upstream_concepts, @@ -68,6 +79,7 @@ too_many_joins, ), "documentation": ( + documentation_coverage, undocumented_models, undocumented_sources, undocumented_source_tables, @@ -76,6 +88,7 @@ model_naming_conventions, model_directories, source_directories, + test_directories, ), "performance": ( chained_view_dependencies, diff --git a/dbdocs/site/bundle/assets/css/style.css b/dbdocs/site/bundle/assets/css/style.css index ee12fff..d1a45a6 100644 --- a/dbdocs/site/bundle/assets/css/style.css +++ b/dbdocs/site/bundle/assets/css/style.css @@ -84,7 +84,26 @@ code, pre { font-family: "SF Mono", "Roboto Mono", Menlo, Consolas, monospace; } display: block; padding: 8px 12px; color: var(--text); border-bottom: 1px solid var(--border); } .search-results a:hover, .search-results a.active { background: var(--accent-soft); text-decoration: none; } +.search-results .sr-empty { padding: 8px 12px; color: var(--text-soft); font-size: 13px; } +.search-results .sr-title { display: block; } .search-results .sr-meta { color: var(--text-soft); font-size: 12px; } +.search-results .sr-snippet { + display: flex; align-items: baseline; gap: 6px; margin-top: 3px; + font-size: 12px; color: var(--text-soft); min-width: 0; +} +.search-results .sr-snippet-field { + flex-shrink: 0; font-size: 10px; text-transform: uppercase; letter-spacing: 0.03em; + color: var(--text-soft); border: 1px solid var(--border); border-radius: 4px; padding: 0 4px; +} +.search-results .sr-snippet-text { + overflow: hidden; text-overflow: ellipsis; white-space: nowrap; font-family: monospace; +} +.search-results .sr-snippet-text mark { background: var(--accent-soft); color: var(--text); border-radius: 2px; padding: 0 1px; } +/* Off-screen but readable by assistive tech (the #search-status live region). */ +.visually-hidden { + position: absolute; width: 1px; height: 1px; margin: -1px; padding: 0; + overflow: hidden; clip: rect(0 0 0 0); clip-path: inset(50%); border: 0; white-space: nowrap; +} .topbar-actions { display: flex; align-items: center; gap: 10px; margin-left: auto; } .icon-btn { width: 34px; height: 34px; display: inline-flex; align-items: center; justify-content: center; diff --git a/dbdocs/site/bundle/assets/js/service.js b/dbdocs/site/bundle/assets/js/service.js index 413014e..beda94f 100644 --- a/dbdocs/site/bundle/assets/js/service.js +++ b/dbdocs/site/bundle/assets/js/service.js @@ -6,8 +6,11 @@ var DATA = { metadata: {}, nodes: {}, lineage: {}, columnLineage: {}, erd: {}, tree: { byDatabase: {} }, readme: "", health: { enabled: false } }; var DOWNSTREAM = null; +// Cache of the per-node indexed search text, keyed by id, populated by +// searchDocs() and read by searchSnippet() (both pure, DOM-free). Reset on init(). +var SEARCH_TEXT = null; -export function init(data) { DATA = data; DOWNSTREAM = null; } +export function init(data) { DATA = data; DOWNSTREAM = null; SEARCH_TEXT = null; } export function meta() { return DATA.metadata; } export function nodes() { return DATA.nodes; } @@ -18,6 +21,171 @@ export function counts() { return DATA.metadata.counts || {}; } export function shortName(id) { return String(id).split(".").pop(); } +/* Flatten possibly-HTML-bearing text to plain words for the search index: drop + tags, decode the entities mdInline emits, collapse whitespace. A node's own + `description` is raw markdown (rendered through mdInline at display time), while + a column's `description` is pre-escaped HTML carrying
; this handles both. */ +function stripHtml(text) { + return String(text == null ? "" : text) + .replace(/<[^>]*>/g, " ") + .replace(/&/g, "&").replace(/</g, "<").replace(/>/g, ">").replace(/"/g, '"') + .replace(/\s+/g, " ") + .trim(); +} + +/* Above this node count we drop the SQL body (raw + compiled) from the search + index. Indexing every model's SQL doubles that corpus into MiniSearch's + in-memory inverted index; on a multi-thousand-model project that's tens of MB + resident for a marginal-value field. Name/column/tag/macro coverage stays — + only the full-SQL haystack is traded away at scale. */ +var SQL_INDEX_NODE_CAP = 2000; + +/* The indexed fields, ordered most→least human-relevant. The snippet builder + walks this order to pick which matched field to excerpt, and ui mirrors it for + the MiniSearch `fields` list. `label` is the result title (never excerpted as a + snippet); the rest are the searchable surface: the human-facing text (name, + description, tags) plus the structural surface a reader searches by — column + names + descriptions, the warehouse relation, the package, the macros a model + calls, and the model SQL (raw + compiled). Each excerptable entry carries a + display label (what ui shows above a snippet, mkdocs-material style) so the + dropdown can say *why* a result matched; label/name are the title itself and + are never excerpted, so their display label is null. */ +export var SEARCH_FIELDS = [ + { key: "label", label: null }, + { key: "name", label: null }, + { key: "columns", label: "Column" }, + { key: "columnDescriptions", label: "Column docs" }, + { key: "tags", label: "Tag" }, + { key: "description", label: "Description" }, + { key: "relation", label: "Relation" }, + { key: "package", label: "Package" }, + { key: "macros", label: "Macro" }, + { key: "code", label: "SQL" }, +]; + +/* The documents fed to the full-text index (one per node). DOM-free — ui builds + the MiniSearch instance from these. Indexes the fields in SEARCH_FIELDS; the + SQL body (`code`) is dropped on large projects (see SQL_INDEX_NODE_CAP) to keep + the index small. storeFields (label/resource_type/schema) are what the dropdown + renders. The raw per-field text is cached in SEARCH_TEXT so searchSnippet() can + excerpt the matched field without a second pass. */ +export function searchDocs() { + var N = DATA.nodes; + var ids = Object.keys(N); + var indexCode = ids.length <= SQL_INDEX_NODE_CAP; + SEARCH_TEXT = {}; + return ids.map(function (id) { + var n = N[id]; + var cols = n.columns || []; + var doc = { + id: id, + label: n.label, + name: n.name, + resource_type: n.resource_type, + schema: n.schema, + description: stripHtml(n.description), + columns: cols.map(function (c) { return c.name; }).join(" "), + columnDescriptions: cols.map(function (c) { return stripHtml(c.description); }).join(" "), + tags: (n.tags || []).join(" "), + relation: n.relation_name || "", + package: n.package || "", + macros: (n.macros || []).map(function (m) { return m.name; }).join(" "), + code: indexCode ? stripHtml((n.raw_code || "") + " " + (n.compiled_code || "")) : "", + }; + SEARCH_TEXT[id] = doc; + return doc; + }); +} + +/* The MiniSearch `match` object maps each matched document-term to the list of + fields it hit. Collapse that to the single most-relevant matched field by + walking SEARCH_FIELDS order (skipping `label`/`name` — the title already shows + that). Returns the SEARCH_FIELDS entry ({ key, label }), or null when only + `label`/`name` matched and the title already explains the hit. */ +function topMatchedField(match) { + var hitFields = {}; + Object.keys(match || {}).forEach(function (term) { + (match[term] || []).forEach(function (f) { hitFields[f] = true; }); + }); + for (var i = 0; i < SEARCH_FIELDS.length; i++) { + var f = SEARCH_FIELDS[i]; + if (f.key === "label" || f.key === "name") continue; + if (hitFields[f.key]) return f; + } + return null; +} + +/* A windowed excerpt of `text` centered on the first occurrence of any matched + term, with ~`pad` chars of context either side and ellipses where clipped. + Case-insensitive find; returns the head of the text when no term is located + (e.g. a fuzzy/prefix hit whose surface term differs). Pure string work. */ +function excerpt(text, terms, pad) { + var hay = String(text || ""); + if (!hay) return ""; + var lower = hay.toLowerCase(); + var at = -1; + for (var i = 0; i < terms.length; i++) { + var p = lower.indexOf(String(terms[i]).toLowerCase()); + if (p !== -1 && (at === -1 || p < at)) at = p; + } + if (at === -1) return hay.length > pad * 2 ? hay.slice(0, pad * 2) + "…" : hay; + var start = Math.max(0, at - pad); + var end = Math.min(hay.length, at + pad); + return (start > 0 ? "…" : "") + hay.slice(start, end).trim() + (end < hay.length ? "…" : ""); +} + +/* mkdocs-material-style match context for a search hit: which field matched and + a short excerpt of it around the query term(s), so the dropdown explains *why* + a result is there (e.g. searching "unique" surfaces the SQL line it lives on). + `terms` are the matched document terms (hit.terms from MiniSearch). Returns + { field, text } or null when only the title matched. DOM-free — ui highlights + the terms and renders. */ +export function searchSnippet(id, match, terms) { + var field = topMatchedField(match); + if (!field) return null; + var doc = (SEARCH_TEXT && SEARCH_TEXT[id]) || {}; + var text = excerpt(doc[field.key], terms || [], 40); + if (!text) return null; + return { field: field.label, text: text }; +} + +/* Inline search operators (mkdocs-material style) the user types into the box: + type: restrict hits to a resource_type (model/source/…) + label: / name: match only against the name/label fields, + skipping the SQL/description/column noise + Operators combine with the free-text remainder: `type:model orders` finds + models whose text matches "orders"; `label:stg` matches names containing + "stg". A bare query (no operators) searches everything. */ +var SEARCH_OPERATOR = /(\w+):(\S+)/g; +var NAME_FIELDS = ["label", "name"]; + +/* Parse a raw query into MiniSearch inputs. Returns: + { text } the free-text remainder to search (operators stripped) + { fields } restrict matching to these fields, or null for all + { filterType } restrict hits to this resource_type, or null + { operators } the recognized operators (for ui to echo as chips) + DOM-free — ui owns running mini.search() and rendering. */ +export function parseSearchQuery(raw) { + var fields = null; + var filterType = null; + var operators = []; + var text = String(raw || "").replace(SEARCH_OPERATOR, function (m, key, val) { + var k = key.toLowerCase(); + if (k === "type" || k === "resource_type") { + filterType = val.toLowerCase(); + operators.push({ key: "type", value: filterType }); + return " "; + } + if (k === "label" || k === "name") { + fields = NAME_FIELDS; + operators.push({ key: "label", value: val }); + return " " + val + " "; // the operator's value is still the text to match + } + return m; // unrecognized prefix (e.g. a "schema:foo" the index doesn't carry) — leave it as a literal term + }).replace(/\s+/g, " ").trim(); + return { text: text, fields: fields, filterType: filterType, operators: operators }; +} + /* The searchable text for a node in the sidebar tree filter: its table name and its fully-qualified database.schema.table, lowercased. Lets the tree filter match either "orders" or "shaman.jf.orders". DOM-free (ui does the show/hide). */ diff --git a/dbdocs/site/bundle/assets/js/ui.js b/dbdocs/site/bundle/assets/js/ui.js index 1aaa7bd..7b56092 100644 --- a/dbdocs/site/bundle/assets/js/ui.js +++ b/dbdocs/site/bundle/assets/js/ui.js @@ -651,37 +651,157 @@ function toggleFullscreen(host) { } } +var SEARCH_RESULT_CAP = 12; + function buildSearch() { var input = document.getElementById("search"); var results = document.getElementById("search-results"); + var status = document.getElementById("search-status"); if (typeof MiniSearch === "undefined") return; - var NODES = svc.nodes(); - var docs = Object.keys(NODES).map(function (id) { - var n = NODES[id]; - return { id: id, label: n.label, resource_type: n.resource_type, schema: n.schema, - description: n.description, columns: (n.columns || []).map(function (c) { return c.name; }).join(" ") }; + var docs = svc.searchDocs(); + var mini = new MiniSearch({ + fields: svc.SEARCH_FIELDS.map(function (f) { return f.key; }), + storeFields: ["label", "resource_type", "schema"], + // Names/columns/tags expand fuzzily + by prefix so a half-typed model name + // still hits. The bulk text fields (description, column docs, SQL) match only + // on whole terms and at a low weight — otherwise a generic SQL keyword like + // "unique" fuzzy-floods nearly every model. boost ranks name/column hits well + // above an incidental SQL-body hit so the title results stay on top. + searchOptions: { + prefix: true, + fuzzy: 0.2, + boost: { label: 6, name: 6, columns: 3, tags: 2, relation: 2, description: 1, columnDescriptions: 1, code: 0.4 }, + }, }); - var mini = new MiniSearch({ fields: ["label", "description", "columns"], storeFields: ["label", "resource_type", "schema"], searchOptions: { prefix: true, fuzzy: 0.2, boost: { label: 3 } } }); mini.addAll(docs); + // Build the matched-field snippet (mkdocs-material style): a label for the + // field that matched + the excerpt, with the matched terms wrapped in . + // Highlighting is done by splitting on the matched terms and building text + + // nodes (no innerHTML) so user-derived text can never inject markup. + function highlightNodes(text, terms) { + var safe = (terms || []).map(function (t) { return String(t).replace(/[.*+?^${}()|[\]\\]/g, "\\$&"); }).filter(Boolean); + if (!safe.length) return [text]; + var re = new RegExp("(" + safe.join("|") + ")", "ig"); + return String(text).split(re).map(function (piece, i) { + // split() with a capturing group puts the matched delimiters at odd indices. + return i % 2 === 1 ? el("mark", null, [piece]) : piece; + }); + } + + function snippetNode(h) { + var snip = svc.searchSnippet(h.id, h.match, h.terms); + if (!snip) return null; + return el("span", { class: "sr-snippet" }, [ + el("span", { class: "sr-snippet-field" }, [snip.field]), + el("span", { class: "sr-snippet-text" }, highlightNodes(snip.text, h.terms)), + ]); + } + + // Roving active-descendant state for keyboard nav: index into the rendered + // option rows, or -1 for "input itself focused, nothing selected". + var activeIndex = -1; + + function setExpanded(open) { + results.hidden = !open; + input.setAttribute("aria-expanded", open ? "true" : "false"); + if (!open) { + activeIndex = -1; + input.removeAttribute("aria-activedescendant"); + status.textContent = ""; + } + } + + function optionRows() { return results.querySelectorAll('[role="option"]'); } + + // Move the roving selection by `delta` (wrapping), reflect it as the .active + // class + aria-selected, and point the input's aria-activedescendant at it so + // a screen reader announces the row without moving DOM focus off the input. + function moveActive(delta) { + var rows = optionRows(); + if (!rows.length) return; + if (activeIndex >= 0 && rows[activeIndex]) rows[activeIndex].setAttribute("aria-selected", "false"); + activeIndex = (activeIndex + delta + rows.length) % rows.length; + rows.forEach(function (r, i) { r.classList.toggle("active", i === activeIndex); }); + var active = rows[activeIndex]; + active.setAttribute("aria-selected", "true"); + input.setAttribute("aria-activedescendant", active.id); + active.scrollIntoView({ block: "nearest" }); + } + function render(hits) { clear(results); - if (!hits.length) { results.hidden = true; return; } - hits.slice(0, 12).forEach(function (h) { - results.appendChild(el("a", { href: "#/node/" + encodeURIComponent(h.id), onclick: function () { results.hidden = true; input.value = ""; setSearchOpen(false); } }, [ - el("span", { class: "dot " + h.resource_type }), " " + h.label, - el("span", { class: "sr-meta" }, [" " + h.resource_type + " · " + h.schema]), - ])); + activeIndex = -1; + input.removeAttribute("aria-activedescendant"); + if (!hits.length) { + // A non-empty query with no hits gets an explicit cue, mirrored into the + // #search-status live region for screen readers. + var noMatches = "No matches."; + results.appendChild(el("div", { class: "sr-empty" }, [noMatches])); + status.textContent = noMatches; + setExpanded(true); + return; + } + var shown = Math.min(hits.length, SEARCH_RESULT_CAP); + status.textContent = shown === 1 ? "1 result." : shown + " results."; + hits.slice(0, SEARCH_RESULT_CAP).forEach(function (h, i) { + var children = [ + el("span", { class: "sr-title" }, [ + el("span", { class: "dot " + h.resource_type }), " " + h.label, + el("span", { class: "sr-meta" }, [" " + h.resource_type + " · " + h.schema]), + ]), + ]; + var snip = snippetNode(h); + if (snip) children.push(snip); + results.appendChild(el("a", { + id: "search-opt-" + i, + role: "option", + "aria-selected": "false", + href: "#/node/" + encodeURIComponent(h.id), + onclick: function () { closeSearch(); }, + }, children)); }); - results.hidden = false; + setExpanded(true); + } + + function closeSearch() { setExpanded(false); input.value = ""; setSearchOpen(false); } + // Map service's parsed operators to MiniSearch's field restriction + a + // resource_type filter. A bare `type:` (no free text) scans for the first + // SEARCH_RESULT_CAP nodes of that type — O(cap), not O(N) on a huge project. + function runQuery(raw) { + var p = svc.parseSearchQuery(raw); + if (!p.text) { + if (!p.filterType) return []; + var picked = []; + for (var i = 0; i < docs.length && picked.length < SEARCH_RESULT_CAP; i++) { + if (docs[i].resource_type === p.filterType) picked.push(docs[i]); + } + return picked; + } + var opts = {}; + if (p.fields) opts.fields = p.fields; + if (p.filterType) opts.filter = function (h) { return h.resource_type === p.filterType; }; + return mini.search(p.text, opts); } input.addEventListener("input", function () { var q = input.value.trim(); - if (!q) { results.hidden = true; return; } - render(mini.search(q)); + if (!q) { setExpanded(false); return; } + render(runQuery(q)); + }); + input.addEventListener("focus", function () { if (input.value.trim()) render(runQuery(input.value.trim())); }); + // Keyboard nav: ↑/↓ rove the options, Enter follows the active one (or the + // first when none is roved), Escape closes. Other keys fall through to typing. + input.addEventListener("keydown", function (e) { + if (results.hidden) return; + if (e.key === "ArrowDown") { e.preventDefault(); moveActive(1); } + else if (e.key === "ArrowUp") { e.preventDefault(); moveActive(-1); } + else if (e.key === "Enter") { + var rows = optionRows(); + var target = rows[activeIndex >= 0 ? activeIndex : 0]; + if (target) { e.preventDefault(); location.hash = target.getAttribute("href"); closeSearch(); } + } else if (e.key === "Escape") { setExpanded(false); } }); - input.addEventListener("focus", function () { if (input.value.trim()) render(mini.search(input.value.trim())); }); - document.addEventListener("click", function (e) { if (!results.contains(e.target) && e.target !== input) results.hidden = true; }); + document.addEventListener("click", function (e) { if (!results.contains(e.target) && e.target !== input) setExpanded(false); }); } function initTheme() { diff --git a/dbdocs/site/bundle/index.html b/dbdocs/site/bundle/index.html index 3aea69a..576abe5 100644 --- a/dbdocs/site/bundle/index.html +++ b/dbdocs/site/bundle/index.html @@ -21,8 +21,9 @@
- - + + +