Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 99 additions & 19 deletions .claude/skills/metadata-enrich/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: metadata-enrich
description: Audit OpenMetadata for missing or drifted descriptions across all dbt layers, generate AI descriptions, write confirmed descriptions back to dbt YAML (source of truth), then sync to OpenMetadata via patch_entity. Triggers include "enrich metadata", "missing descriptions", "which tables have no description", "fill metadata", "generate descriptions", "update catalog", "sync descriptions".
description: Audit OpenMetadata for missing or drifted descriptions across all dbt layers, generate AI descriptions grounded in real data profiles and dbt tests, match columns to glossary terms, write confirmed descriptions back to dbt YAML (source of truth), then sync to OpenMetadata via patch_entity. Triggers include "enrich metadata", "missing descriptions", "which tables have no description", "fill metadata", "generate descriptions", "update catalog", "sync descriptions".
---

# Metadata Enrichment
Expand Down Expand Up @@ -83,21 +83,75 @@ Stop and wait for user input.

For the chosen table(s):

**Gather context:**
**3a. Gather static context:**
- Identify which dbt YAML file contains it
- Read the YAML — use existing descriptions as the base
- Read the YAML — use existing descriptions as the base, and extract dbt tests per column
- Read the SQL file if it exists (staging, intermediate, marts — not raw sources)
- For drift columns: show current dbt description vs current OpenMetadata value
- For drift columns: note current dbt description vs current OpenMetadata value

**3b. Profile the data via PostgreSQL MCP** — skip for raw source tables that are not materialised as dbt models.

Run the following queries against the actual table. Use the model name as the table name (models are materialised in the `marketing` schema):

```sql
-- Row count and per-column null rate + distinct count
SELECT
COUNT(*) AS total_rows,
COUNT({col}) AS non_null_count,
ROUND(100.0 * COUNT({col}) / NULLIF(COUNT(*), 0), 1) AS non_null_pct,
COUNT(DISTINCT {col}) AS distinct_count
FROM marketing.{table};

-- For numeric columns: range and zero split
SELECT
MIN({col}) AS min_val,
MAX({col}) AS max_val,
ROUND(AVG({col})::numeric, 2) AS avg_val,
SUM(CASE WHEN {col} = 0 THEN 1 ELSE 0 END) AS zero_count,
SUM(CASE WHEN {col} IS NULL THEN 1 ELSE 0 END) AS null_count
FROM marketing.{table};

-- For categorical / low-cardinality columns (distinct_count <= 20): top values
SELECT {col}, COUNT(*) AS freq
FROM marketing.{table}
GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
```

**Use profile results to enrich descriptions with:**
- Null rate: only mention if > 0% (e.g. "Nullable — 12% of rows have no value")
- Zero vs null split: mention if zeros are meaningful (e.g. "23% are zero, not null — campaigns with no spend that day")
- Value range for numeric columns (e.g. "Range: 0–48,320")
- Enumerated values for categoricals (e.g. "Values: google_ads, facebook, email, tv")
- If `non_null_pct = 100` and `distinct_count = total_rows` → flag as a unique key in description

**3c. Extract dbt test context from YAML:**

| dbt test on column | What to add to description |
|---|---|
| `not_null` | "Always populated." |
| `not_null` + `unique` | "Primary key / grain of this table. Always populated and unique." |
| `accepted_values` | "Accepted values: {list from test config}" |
| `relationships` | "Foreign key to `{referenced_table}.{referenced_column}`" |

**3d. Match columns to glossary terms:**

**Generate for missing only** — do not regenerate descriptions that already exist in dbt YAML, just flag drifted ones for sync.
For every column in the table, call `search_metadata` with the column name as the query and `entity_type: "glossaryTerm"`. Match on name similarity — exact match first, then partial.

Rules:
- Only suggest a glossary link if the match confidence is high (exact name match or the glossary term name is a clear substring of the column name, e.g. `total_roas` → `ROAS`)
- If no glossary exists yet, skip this step silently — do not error
- One column can match at most one glossary term — take the best match
- Store the matched glossary term FQN (e.g. `Marketing Analytics.KPIs.ROAS`) for use in Step 4

**Generate for missing only** — do not regenerate descriptions that already exist in dbt YAML, just flag drifted ones for sync. Incorporate profile, test, and glossary findings into all generated descriptions.

**Style by layer:**
- **Raw sources**: factual — what the raw field represents in the source system
- **Staging**: what was cleaned, cast, or renamed; note source field if renamed
- **Intermediate**: what business logic or aggregation was applied
- **Marts**: business definition in plain language; include formula for calculated metrics (e.g. "revenue / spend"); max 2 sentences
- **Raw sources**: factual — what the raw field represents in the source system; include value range or top values if profiled
- **Staging**: what was cleaned, cast, or renamed; note source field if renamed; include null rate if non-zero
- **Intermediate**: what business logic or aggregation was applied; include range and zero-split for metrics
- **Marts**: business definition in plain language; include formula for calculated metrics (e.g. "revenue / spend"); include data-driven caveats (nulls, zeros, skew); max 3 sentences

**Present for review:**
**Present for review — include a Glossary column:**

```
## Review: {table}
Expand All @@ -107,17 +161,20 @@ For the chosen table(s):
|---|---|
| dbt YAML | (empty) |
| OpenMetadata | (empty) |
| Generated | Aggregated session metrics grouped by campaign and date... |
| Generated | Aggregated session metrics grouped by campaign and date. 18,432 rows. |

### Columns
| Column | dbt YAML | OpenMetadata | Action | Generated/Sync value |
|--------|----------|--------------|--------|----------------------|
| total_sessions | "Total number of distinct sessions" | (empty) | SYNC | "Total number of distinct sessions" |
| unique_users | (empty) | (empty) | GENERATE | "Count of distinct users who had sessions on this date." |
| avg_session_duration | "Average session duration in seconds" | "Average session duration in seconds" | OK | — |
| Column | Action | Generated/Sync value | Glossary |
|--------|--------|----------------------|----------|
| total_sessions | SYNC | "Total number of distinct sessions" | — |
| unique_users | GENERATE | "Count of distinct users who had at least one session. Range: 1–4,821. Always populated." | — |
| channel | GENERATE | "Marketing channel. Values: google_ads (42%), facebook (31%), email (18%), tv (9%). Always populated." | — |
| roas | GENERATE | "Return on ad spend. Calculated as revenue / spend. Range: 0.2–8.4." | → KPIs > ROAS |
| cpa | GENERATE | "Cost per acquisition. Calculated as spend / conversions. Nullable — 8% of rows have no conversions." | → KPIs > CPA |
| avg_session_duration | OK | — | — |
```

Then say: **"Reply with changes (e.g. 'change unique_users to X') or 'confirm' to apply. Say 'skip {column}' to leave a column unchanged."**
Then say: **"Reply with changes (e.g. 'change unique_users to X') or 'confirm' to apply. Say 'skip {column}' to leave a column or its glossary link unchanged."**

Stop and wait.

Expand Down Expand Up @@ -145,6 +202,23 @@ Build a JSON Patch array for only the fields that changed:

Call `patch_entity` with `entity_type: "table"`, the FQN, and the patch array.

**D. Apply glossary term links**
For every column that matched a glossary term in Step 3d and was not skipped by the user, add the glossary tag to the column patch:

```json
{
"op": "add",
"path": "/columns/{matched_index}/tags/-",
"value": {
"tagFQN": "{glossary_term_fqn}",
"source": "Glossary",
"labelType": "Automated"
}
}
```

Include these operations in the same `patch_entity` call as the description updates — do not make a separate call. Skip glossary linking silently if the glossary term FQN no longer exists (verify with `search_metadata` before patching).

---

### Step 5: Validate
Expand All @@ -166,6 +240,11 @@ Report:
- Columns synced (drift fixed): Y
- Columns skipped (OK or user skipped): Z

### Glossary Links
- Linked: roas → KPIs > ROAS
- Linked: cpa → KPIs > CPA
- No match: total_sessions, unique_users, channel, avg_session_duration

dbt YAML and OpenMetadata are now in sync.
```

Expand All @@ -185,9 +264,10 @@ When user says `all staging`, `all intermediate`, or `all marts`:

## MCP Tools

- **`search_metadata`** (read) — Fallback if FQN lookup fails
- **`search_metadata`** (read) — Two uses: (1) fallback if FQN lookup fails, (2) match column names against existing glossary terms in Step 3d
- **`get_entity_details`** (read) — Fetch descriptions + column list with indices for name matching
- **`patch_entity`** (write) — Push confirmed descriptions to OpenMetadata
- **`patch_entity`** (write) — Push confirmed descriptions and glossary term links to OpenMetadata in a single call
- **`execute_sql`** (read, Postgres MCP) — Profile columns: null rates, distinct counts, value ranges, top categorical values. Used in Step 3b to ground generated descriptions in real data.

## Local Tools

Expand Down
2 changes: 1 addition & 1 deletion .mcp.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
"--header", "Authorization:${AUTH_HEADER}"
],
"env": {
"AUTH_HEADER": "Bearer <YOUR_OPENMETADATA_JWT_TOKEN>"
"AUTH_HEADER": "Bearer ${OPENMETADATA_JWT_TOKEN}"
}
}
}
Expand Down