Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions .claude/skills/metadata-exposure-enrichment/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
name: metadata-exposure-enrichment
description: Enrich dbt exposure definitions by querying Metabase directly via MCP. Discovers dashboard cards, maps table and column references to dbt models, audits the existing _exposures.yml for gaps, and writes back a fully enriched exposure. Triggers include "enrich exposure", "exposure enrichment", "document dashboard", "update exposures", "what does the dashboard use".
---

# Exposure Enrichment

Discovers what a Metabase dashboard actually contains and writes that context back into dbt exposures. Works with Metabase MCP and Postgres MCP only, no metadata platform required.

## How It Works

1. **Parse `$ARGUMENTS`** -- Dashboard name, dashboard ID (e.g. `2`), or `all`. If empty, default to `all`. Extract any optional `--dry-run` flag (report only, no file write).
2. **Check file exists** -- Read `dbt/models/marts/_exposures.yml`. If missing, stop and output:
> ERROR: `dbt/models/marts/_exposures.yml` not found. Create a barebones version first with `name`, `type`, `url`, and `depends_on` fields, then re-run this skill.
3. **Discover via Metabase MCP** -- Execute sequentially:
a. `metabase-list-dashboards` -- confirm the target dashboard exists and get its ID
b. `metabase-get-dashboard` with the dashboard ID -- extract the `dashcards` array to get all `card_id` values
c. For each `card_id`: call `metabase-get-question` -- collect card name, display type, and `dataset_query` (MBQL `source-table` or native SQL)
d. `metabase-get-database-metadata` for the database -- map internal Metabase table IDs to real table names in the `marketing` schema
e. `metabase-get-current-user` -- capture email for the exposure `owner` field
4. **Cross-reference to dbt** -- For each card's source table, determine whether it maps to a dbt mart model or a raw source:
- Read `dbt/models/marts/` SQL files and `_marts.yml` to confirm mart models
- Read `dbt/models/staging/_sources.yml` to identify raw source tables
- Mart models become `ref('model_name')` in `depends_on`
- Raw source tables become `source('marketing_raw', 'table_name')` in `depends_on`
- Flag any table that doesn't map to either
5. **Audit the existing exposure** -- Read `_exposures.yml` and check each exposure:
- Is `description` present and non-empty?
- Is `owner` present with name and email?
- Is `maturity` set?
- Does `depends_on` include ALL models/sources discovered in step 3?
- Are card-level details documented in the description?
- Are key columns documented?
6. **Report** -- Print a structured summary before offering any write:
- Dashboard: name, URL, total card count
- Card inventory: ID, name, display type, source table, columns used (aggregation + breakout)
- Audit gaps: what `_exposures.yml` is missing vs what was discovered
- End with: `GAPS: N fields missing | Cards: N discovered | Models: N mapped`
7. **Offer to enrich** -- Propose the enriched YAML and confirm with user before writing. If `--dry-run`, print proposed content only. On confirmation:
- Write to `dbt/models/marts/_exposures.yml`
- Report what changed (before vs after)

## Description Format

Plain text with bracketed headers. Same convention as `_marts.yml` descriptions. dbt YAML and OpenMetadata both render plain text.

**Exposure description**:
`[Business Purpose]` what decisions the dashboard drives and who uses it.
`[Cards]` list each card: name, chart type, what it measures.
`[Key Columns]` columns surfaced in the dashboard (aggregation columns + breakout dimensions).
`[Data Sources]` which dbt models and sources feed the dashboard, and how.
`[Known Issues / Caveats]` date range defaults, missing channels, filter behavior, archived cards.

## Reference

**Target dashboard**: "Agentic Data Modeling Demo" (ID=2), URL `http://localhost:3000/dashboard/2`

**Known card inventory on dashboard 2**:
- Card 40: ROAS (smartscalar) -- avg(roas) from campaign_performance, grouped by date
- Card 41: CR% (smartscalar) -- avg(conversion_rate) from campaign_performance, grouped by date
- Card 42: Target Revenue (progress) -- sum(total_revenue) from campaign_performance, grouped by date
- Card 43: Daily Spend by Channel (bar) -- sum(spend) from campaigns_daily, grouped by date + channel
- Card 44: Desktop Per Channel (pie) -- sum(desktop_sessions) from campaign_performance, grouped by channel
- Card 45: Mobile Per Channel (pie) -- sum(mobile_sessions) from campaign_performance, grouped by channel

**Standalone card NOT on dashboard 2** (do not include):
- Card 38: ROAS (table, native SQL) -- archived, not on any active dashboard

**Metabase table to dbt model map**:
- `campaign_performance` -> mart model, use `ref('campaign_performance')`
- `campaigns_daily` -> raw source table staged as `stg_campaigns_daily`, use `source('marketing_raw', 'campaigns_daily')`
- `daily_summary` -> mart model, not directly queried by any card but is a rollup of campaign_performance

**dbt mart models**: `campaign_performance`, `daily_summary`, `user_journey`, `channel_attribution`
**Files**: `dbt/models/marts/_exposures.yml`, `dbt/models/marts/_marts.yml`
**Sources**: `dbt/models/staging/_sources.yml` (source name: `marketing_raw`, schema: `marketing`)

## Output Format

```
## Exposure Enrichment: {dashboard_name} (ID={id})

### Dashboard Discovery
- Cards on dashboard: {count}
- Source tables: {table} ({n} cards), ...
- dbt mapping: {table} -> {ref or source}

### Card Inventory
| Card | Name | Type | Source Table | Columns Used |
|------|------|------|--------------|--------------|
| ... | ... | ... | ... | ... |

### Audit: _exposures.yml gaps
- description: {PRESENT | MISSING}
- owner: {PRESENT | MISSING}
- maturity: {PRESENT | MISSING}
- depends_on: {complete | missing: list}
- card documentation: {PRESENT | MISSING}

### Proposed enrichment
{full enriched YAML}

GAPS: {n} fields missing | Cards: {n} discovered | Models: {n} mapped
```
8 changes: 8 additions & 0 deletions .mcp.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,14 @@
"env": {
"AUTH_HEADER": "Bearer <YOUR_OPENMETADATA_JWT_TOKEN>"
}
},
"metabase": {
"command": "npx",
"args": ["-y", "@getnao/metabase-mcp-server@latest"],
"env": {
"METABASE_URL": "http://localhost:3000",
"METABASE_API_KEY": "<YOUR_METABASE_API_KEY>"
}
}
}
}
27 changes: 27 additions & 0 deletions QUICKSTART.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ A step by step guide on how to get started with this project.
OPENMETADATA_JWT_TOKEN=your_jwt_token_here
METABASE_USERNAME=your_metabase_username
METABASE_PASSWORD=your_metabase_password
METABASE_API_KEY=your_metabase_api_key_here
```

2. Run the docker container:
Expand Down Expand Up @@ -91,6 +92,23 @@ Replace `<YOUR_OPENMETADATA_JWT_TOKEN>` in `.mcp.json` with the token you genera

> **Docs**: [OpenMetadata MCP Reference](https://docs.open-metadata.org/v1.10.x/how-to-guides/mcp/reference)

### Metabase MCP

The `/metadata-exposure-enrichment` skill queries Metabase dashboards directly via the [nao-metabase-mcp-server](https://github.com/getnao/nao-mcp-servers). It is already configured in `.mcp.json`. You need to supply an API key.

**Generate an API key:**
1. Go to `http://localhost:3000/admin/settings/authentication/api-keys`
2. Create a new API key
3. Add it to your `.env`:

```bash
METABASE_API_KEY=your_api_key_here
```

Replace `<YOUR_METABASE_API_KEY>` in `.mcp.json` with your key.

> **Docs**: [nao-metabase-mcp-server](https://github.com/getnao/nao-mcp-servers)

### Verify `.mcp.json`

Your `.mcp.json` should look like this (already included in the repo):
Expand Down Expand Up @@ -121,6 +139,14 @@ Your `.mcp.json` should look like this (already included in the repo):
"env": {
"AUTH_HEADER": "Bearer <YOUR_OPENMETADATA_JWT_TOKEN>"
}
},
"metabase": {
"command": "npx",
"args": ["-y", "@getnao/metabase-mcp-server@latest"],
"env": {
"METABASE_URL": "http://localhost:3000",
"METABASE_API_KEY": "<YOUR_METABASE_API_KEY>"
}
}
}
}
Expand All @@ -136,6 +162,7 @@ Then use the MCP servers to ask questions such as:
- "Who owns the Agentic Data Modeling Demo dashboard?"
- "Is `user_journey` ready to be consumed by an AI agent?"
- "Create a business glossary from our dbt models"
- "Enrich the exposure for the Agentic Data Modeling Demo dashboard"

---

Expand Down
17 changes: 11 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,13 @@ This project connects Claude to the data stack through **two MCP servers**, givi
|---|---|---|
| **OpenMetadata MCP** | Metadata catalog — lineage, search, glossaries, entity details | `search_metadata`, `get_entity_lineage`, `get_entity_details`, `create_glossary_term` |
| **PostgreSQL MCP** | Direct database access — query data, profile columns, validate models | `execute_sql`, `list_tables`, `list_table_stats` |
| **Metabase MCP** | Direct dashboard access — discover cards, questions, database metadata | `metabase-list-dashboards`, `metabase-get-dashboard`, `metabase-get-question` |

The **PostgreSQL MCP** uses [Google GenAI Toolbox](https://github.com/googleapis/genai-toolbox) (pre-downloaded binary in `bin/toolbox`) to give Claude direct SQL access to the local PostgreSQL instance. This enables data profiling, edge case discovery, and validation queries — capabilities used heavily by the AI Readiness skill.

The **OpenMetadata MCP** connects to the OpenMetadata server's native MCP endpoint, providing metadata search, lineage tracing, and glossary management through natural language.

Both servers are configured in `.mcp.json` at the project root, with permissions managed in `.claude/settings.local.json`.
All three servers are configured in `.mcp.json` at the project root, with permissions managed in `.claude/settings.local.json`.

### What this enables

Expand All @@ -43,7 +44,7 @@ Both servers are configured in `.mcp.json` at the project root, with permissions

## 🛠️ Claude Code Skills

The project includes three custom **Claude Code skills** (in `.claude/skills/`) that encode repeatable data engineering workflows as slash commands. These skills combine the OpenMetadata and PostgreSQL MCP tools with local file analysis to automate common tasks:
The project includes four custom **Claude Code skills** (in `.claude/skills/`) that encode repeatable data engineering workflows as slash commands. These skills combine the OpenMetadata and PostgreSQL MCP tools with local file analysis to automate common tasks:

### `/metadata-impact-analysis`
Analyze downstream impact before making schema changes. Traces lineage through dbt models and dashboards to identify what breaks if a column is renamed, dropped, or its type changes.
Expand All @@ -54,6 +55,9 @@ Audit and enrich dbt mart models for AI consumption. Checks schema quality, quer
### `/metadata-glossary`
Manage an OpenMetadata glossary derived from dbt models. Parses dbt YAML for column names and descriptions, groups them into business categories, and creates/syncs glossary terms via OpenMetadata.

### `/metadata-exposure-enrichment`
Enrich dbt exposure definitions by querying Metabase directly via MCP. Discovers dashboard cards, maps table and column references to dbt models, audits the existing `_exposures.yml` for gaps, and writes back a fully enriched exposure.

## 📚 Documentation

This project includes comprehensive documentation to help you get started:
Expand Down Expand Up @@ -90,8 +94,8 @@ This setup enables a complete data analytics workflow where:
2. dbt transforms and models the data locally
3. Metabase provides interactive dashboards
4. OpenMetadata centralizes metadata from all components via **YAML-based ingestion** (not UI), providing unified lineage and metadata views
5. Claude connects via two MCP servers (OpenMetadata + PostgreSQL) for metadata exploration and direct data access
6. Custom skills (`/metadata-impact-analysis`, `/metadata-ai-readiness`, `/metadata-glossary`) automate repeatable data engineering workflows
5. Claude connects via three MCP servers (OpenMetadata + PostgreSQL + Metabase) for metadata exploration, direct data access, and dashboard discovery
6. Custom skills (`/metadata-impact-analysis`, `/metadata-ai-readiness`, `/metadata-glossary`, `/metadata-exposure-enrichment`) automate repeatable data engineering workflows

**Key Feature:** All OpenMetadata ingestion is configured through YAML files, enabling Infrastructure as Code (IaC) practices. Ingestion runs on-demand using Docker Compose profiles, giving you control over when metadata is synchronized. While OpenMetadata provides a UI for configuration, this project uses YAML files for version control, automation, and reproducibility.

Expand All @@ -101,8 +105,9 @@ This setup enables a complete data analytics workflow where:
│ └── skills/ # Custom Claude Code skills
│ ├── metadata-impact-analysis/
│ ├── metadata-ai-readiness/
│ └── metadata-glossary/
├── .mcp.json # MCP server definitions (Postgres + OpenMetadata)
│ ├── metadata-glossary/
│ └── metadata-exposure-enrichment/
├── .mcp.json # MCP server definitions (Postgres + OpenMetadata + Metabase)
├── bin/
│ └── toolbox # Google GenAI Toolbox binary (Postgres MCP)
├── dbt/ # dbt project
Expand Down
37 changes: 37 additions & 0 deletions dbt/models/marts/_exposures.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
version: 2

exposures:
- name: agentic_data_modeling_demo
type: dashboard
maturity: low
url: http://localhost:3000/dashboard/2
description: >
[Business Purpose] Marketing performance dashboard used to monitor campaign ROI,
conversion efficiency, revenue targets, and channel-level spend and device breakdown.
Supports daily decision-making on budget allocation and channel optimization.

[Cards] 6 cards:
(1) ROAS -- smartscalar showing average return on ad spend over time.
(2) CR% -- smartscalar showing average conversion rate over time.
(3) Target Revenue -- progress bar tracking cumulative revenue against a 100k goal.
(4) Daily Spend by Channel -- stacked bar chart of daily spend broken down by marketing channel.
(5) Desktop Per Channel -- pie chart of total desktop sessions by channel.
(6) Mobile Per Channel -- pie chart of total mobile sessions by channel.

[Key Columns] roas, conversion_rate, total_revenue, spend, desktop_sessions,
mobile_sessions, date, channel.

[Data Sources] campaign_performance mart (cards 1-3, 5-6) and campaigns_daily
raw source (card 4). daily_summary is an indirect dependency as a rollup of
campaign_performance.

[Known Issues / Caveats] Dashboard date filter defaults to past 7 days.
Card 43 (Daily Spend) queries the raw campaigns_daily source table directly
rather than a mart model.
owner:
name: Alejandro Aboy
email: aboyalejandro@gmail.com
depends_on:
- ref('campaign_performance')
- ref('daily_summary')
- source('marketing_raw', 'campaigns_daily')