From 9f22659e1aacfa9e86a3885ff546436d1d5cda68 Mon Sep 17 00:00:00 2001 From: Hopefan Date: Thu, 16 Apr 2026 22:46:37 +1000 Subject: [PATCH] feat: enhance metadata-enrich skill with data profiling and glossary linking - Add Step 3b: profile columns via PostgreSQL MCP (null rates, value ranges, distinct counts, top categorical values) to ground descriptions in real data - Add Step 3c: extract dbt test context (not_null, unique, accepted_values, relationships) to infer semantic role of each column - Add Step 3d: match column names against OpenMetadata glossary terms and suggest links in the review table - Apply glossary term links via patch_entity alongside description updates - Update review table to include Glossary column - Update Step 5 report to show glossary links applied - Update .mcp.json to read JWT token from OPENMETADATA_JWT_TOKEN env variable Co-Authored-By: Claude Sonnet 4.6 --- .claude/skills/metadata-enrich/SKILL.md | 118 ++++++++++++++++++++---- .mcp.json | 2 +- 2 files changed, 100 insertions(+), 20 deletions(-) diff --git a/.claude/skills/metadata-enrich/SKILL.md b/.claude/skills/metadata-enrich/SKILL.md index 0b225a5..495defa 100644 --- a/.claude/skills/metadata-enrich/SKILL.md +++ b/.claude/skills/metadata-enrich/SKILL.md @@ -1,6 +1,6 @@ --- name: metadata-enrich -description: Audit OpenMetadata for missing or drifted descriptions across all dbt layers, generate AI descriptions, write confirmed descriptions back to dbt YAML (source of truth), then sync to OpenMetadata via patch_entity. Triggers include "enrich metadata", "missing descriptions", "which tables have no description", "fill metadata", "generate descriptions", "update catalog", "sync descriptions". +description: Audit OpenMetadata for missing or drifted descriptions across all dbt layers, generate AI descriptions grounded in real data profiles and dbt tests, match columns to glossary terms, write confirmed descriptions back to dbt YAML (source of truth), then sync to OpenMetadata via patch_entity. Triggers include "enrich metadata", "missing descriptions", "which tables have no description", "fill metadata", "generate descriptions", "update catalog", "sync descriptions". --- # Metadata Enrichment @@ -83,21 +83,75 @@ Stop and wait for user input. For the chosen table(s): -**Gather context:** +**3a. Gather static context:** - Identify which dbt YAML file contains it -- Read the YAML — use existing descriptions as the base +- Read the YAML — use existing descriptions as the base, and extract dbt tests per column - Read the SQL file if it exists (staging, intermediate, marts — not raw sources) -- For drift columns: show current dbt description vs current OpenMetadata value +- For drift columns: note current dbt description vs current OpenMetadata value + +**3b. Profile the data via PostgreSQL MCP** — skip for raw source tables that are not materialised as dbt models. + +Run the following queries against the actual table. Use the model name as the table name (models are materialised in the `marketing` schema): + +```sql +-- Row count and per-column null rate + distinct count +SELECT + COUNT(*) AS total_rows, + COUNT({col}) AS non_null_count, + ROUND(100.0 * COUNT({col}) / NULLIF(COUNT(*), 0), 1) AS non_null_pct, + COUNT(DISTINCT {col}) AS distinct_count +FROM marketing.{table}; + +-- For numeric columns: range and zero split +SELECT + MIN({col}) AS min_val, + MAX({col}) AS max_val, + ROUND(AVG({col})::numeric, 2) AS avg_val, + SUM(CASE WHEN {col} = 0 THEN 1 ELSE 0 END) AS zero_count, + SUM(CASE WHEN {col} IS NULL THEN 1 ELSE 0 END) AS null_count +FROM marketing.{table}; + +-- For categorical / low-cardinality columns (distinct_count <= 20): top values +SELECT {col}, COUNT(*) AS freq +FROM marketing.{table} +GROUP BY 1 ORDER BY 2 DESC LIMIT 10; +``` + +**Use profile results to enrich descriptions with:** +- Null rate: only mention if > 0% (e.g. "Nullable — 12% of rows have no value") +- Zero vs null split: mention if zeros are meaningful (e.g. "23% are zero, not null — campaigns with no spend that day") +- Value range for numeric columns (e.g. "Range: 0–48,320") +- Enumerated values for categoricals (e.g. "Values: google_ads, facebook, email, tv") +- If `non_null_pct = 100` and `distinct_count = total_rows` → flag as a unique key in description + +**3c. Extract dbt test context from YAML:** + +| dbt test on column | What to add to description | +|---|---| +| `not_null` | "Always populated." | +| `not_null` + `unique` | "Primary key / grain of this table. Always populated and unique." | +| `accepted_values` | "Accepted values: {list from test config}" | +| `relationships` | "Foreign key to `{referenced_table}.{referenced_column}`" | + +**3d. Match columns to glossary terms:** -**Generate for missing only** — do not regenerate descriptions that already exist in dbt YAML, just flag drifted ones for sync. +For every column in the table, call `search_metadata` with the column name as the query and `entity_type: "glossaryTerm"`. Match on name similarity — exact match first, then partial. + +Rules: +- Only suggest a glossary link if the match confidence is high (exact name match or the glossary term name is a clear substring of the column name, e.g. `total_roas` → `ROAS`) +- If no glossary exists yet, skip this step silently — do not error +- One column can match at most one glossary term — take the best match +- Store the matched glossary term FQN (e.g. `Marketing Analytics.KPIs.ROAS`) for use in Step 4 + +**Generate for missing only** — do not regenerate descriptions that already exist in dbt YAML, just flag drifted ones for sync. Incorporate profile, test, and glossary findings into all generated descriptions. **Style by layer:** -- **Raw sources**: factual — what the raw field represents in the source system -- **Staging**: what was cleaned, cast, or renamed; note source field if renamed -- **Intermediate**: what business logic or aggregation was applied -- **Marts**: business definition in plain language; include formula for calculated metrics (e.g. "revenue / spend"); max 2 sentences +- **Raw sources**: factual — what the raw field represents in the source system; include value range or top values if profiled +- **Staging**: what was cleaned, cast, or renamed; note source field if renamed; include null rate if non-zero +- **Intermediate**: what business logic or aggregation was applied; include range and zero-split for metrics +- **Marts**: business definition in plain language; include formula for calculated metrics (e.g. "revenue / spend"); include data-driven caveats (nulls, zeros, skew); max 3 sentences -**Present for review:** +**Present for review — include a Glossary column:** ``` ## Review: {table} @@ -107,17 +161,20 @@ For the chosen table(s): |---|---| | dbt YAML | (empty) | | OpenMetadata | (empty) | -| Generated | Aggregated session metrics grouped by campaign and date... | +| Generated | Aggregated session metrics grouped by campaign and date. 18,432 rows. | ### Columns -| Column | dbt YAML | OpenMetadata | Action | Generated/Sync value | -|--------|----------|--------------|--------|----------------------| -| total_sessions | "Total number of distinct sessions" | (empty) | SYNC | "Total number of distinct sessions" | -| unique_users | (empty) | (empty) | GENERATE | "Count of distinct users who had sessions on this date." | -| avg_session_duration | "Average session duration in seconds" | "Average session duration in seconds" | OK | — | +| Column | Action | Generated/Sync value | Glossary | +|--------|--------|----------------------|----------| +| total_sessions | SYNC | "Total number of distinct sessions" | — | +| unique_users | GENERATE | "Count of distinct users who had at least one session. Range: 1–4,821. Always populated." | — | +| channel | GENERATE | "Marketing channel. Values: google_ads (42%), facebook (31%), email (18%), tv (9%). Always populated." | — | +| roas | GENERATE | "Return on ad spend. Calculated as revenue / spend. Range: 0.2–8.4." | → KPIs > ROAS | +| cpa | GENERATE | "Cost per acquisition. Calculated as spend / conversions. Nullable — 8% of rows have no conversions." | → KPIs > CPA | +| avg_session_duration | OK | — | — | ``` -Then say: **"Reply with changes (e.g. 'change unique_users to X') or 'confirm' to apply. Say 'skip {column}' to leave a column unchanged."** +Then say: **"Reply with changes (e.g. 'change unique_users to X') or 'confirm' to apply. Say 'skip {column}' to leave a column or its glossary link unchanged."** Stop and wait. @@ -145,6 +202,23 @@ Build a JSON Patch array for only the fields that changed: Call `patch_entity` with `entity_type: "table"`, the FQN, and the patch array. +**D. Apply glossary term links** +For every column that matched a glossary term in Step 3d and was not skipped by the user, add the glossary tag to the column patch: + +```json +{ + "op": "add", + "path": "/columns/{matched_index}/tags/-", + "value": { + "tagFQN": "{glossary_term_fqn}", + "source": "Glossary", + "labelType": "Automated" + } +} +``` + +Include these operations in the same `patch_entity` call as the description updates — do not make a separate call. Skip glossary linking silently if the glossary term FQN no longer exists (verify with `search_metadata` before patching). + --- ### Step 5: Validate @@ -166,6 +240,11 @@ Report: - Columns synced (drift fixed): Y - Columns skipped (OK or user skipped): Z +### Glossary Links +- Linked: roas → KPIs > ROAS +- Linked: cpa → KPIs > CPA +- No match: total_sessions, unique_users, channel, avg_session_duration + dbt YAML and OpenMetadata are now in sync. ``` @@ -185,9 +264,10 @@ When user says `all staging`, `all intermediate`, or `all marts`: ## MCP Tools -- **`search_metadata`** (read) — Fallback if FQN lookup fails +- **`search_metadata`** (read) — Two uses: (1) fallback if FQN lookup fails, (2) match column names against existing glossary terms in Step 3d - **`get_entity_details`** (read) — Fetch descriptions + column list with indices for name matching -- **`patch_entity`** (write) — Push confirmed descriptions to OpenMetadata +- **`patch_entity`** (write) — Push confirmed descriptions and glossary term links to OpenMetadata in a single call +- **`execute_sql`** (read, Postgres MCP) — Profile columns: null rates, distinct counts, value ranges, top categorical values. Used in Step 3b to ground generated descriptions in real data. ## Local Tools diff --git a/.mcp.json b/.mcp.json index cedc2cf..afc180e 100644 --- a/.mcp.json +++ b/.mcp.json @@ -21,7 +21,7 @@ "--header", "Authorization:${AUTH_HEADER}" ], "env": { - "AUTH_HEADER": "Bearer " + "AUTH_HEADER": "Bearer ${OPENMETADATA_JWT_TOKEN}" } } }